利用LDA主题模型、Word2Vec词向量模型与TextRank相融合的关键词抽取算法python代码
时间: 2024-02-03 13:15:33 浏览: 27
以下是利用LDA主题模型、Word2Vec词向量模型与TextRank相融合的关键词抽取算法的Python代码:
```python
import jieba
import gensim
from gensim.models import Word2Vec
from gensim import corpora, models
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# 加载停用词表
stopwords = [line.strip() for line in open('stopwords.txt', encoding='UTF-8').readlines()]
# 加载语料库
corpus = []
with open('data.txt', 'r', encoding='UTF-8') as f:
for line in f.readlines():
corpus.append(line.strip())
# 分词
texts = [[word for word in jieba.cut(text) if word not in stopwords] for text in corpus]
# 训练Word2Vec模型
model = Word2Vec(texts, size=100, window=5, min_count=1, workers=4)
# 训练LDA主题模型
dictionary = corpora.Dictionary(texts)
corpus_bow = [dictionary.doc2bow(text) for text in texts]
lda_model = models.ldamodel.LdaModel(corpus_bow, num_topics=10, id2word=dictionary)
# 获取关键词列表
keywords_list = []
for i in range(len(texts)):
text = texts[i]
bow = dictionary.doc2bow(text)
# 获取LDA主题分布
lda_dist = lda_model[bow]
lda_dist = sorted(lda_dist, key=lambda x: x[1], reverse=True)
# 获取Word2Vec词向量
word_vectors = []
for word in text:
try:
word_vectors.append(model[word])
except:
pass
word_vectors = np.array(word_vectors)
word_vectors_mean = np.mean(word_vectors, axis=0)
# 获取TextRank分数
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(text)
sim_matrix = cosine_similarity(tfidf)
scores = np.sum(sim_matrix, axis=1)
scores = scores / np.sum(scores)
# 综合三种方法获取关键词
keywords = []
for j in range(10):
topic = lda_dist[j][0]
word_dist = lda_model.show_topic(topic, topn=20)
for word, dist in word_dist:
if word in text:
score = dist * 0.5 + model.similarity(word, '主题') * 0.3 + scores[text.index(word)] * 0.2
keywords.append((word, score))
keywords = sorted(keywords, key=lambda x: x[1], reverse=True)[:5]
# 将关键词加入列表
keywords_list.append([keyword[0] for keyword in keywords])
# 输出关键词列表
print(keywords_list)
```
在上面的代码中,我们首先加载了停用词表和语料库。然后对每一篇文本进行分词,并使用Word2Vec模型和LDA主题模型训练。接着,我们使用LDA主题模型获取主题分布,使用Word2Vec模型获取词向量,使用TextRank算法获取每个词的权重得分。最后,我们将三种方法得出的关键词综合起来,得到每篇文本的关键词列表。