写程序:用python实现it-idf提取关键词
时间: 2023-12-06 19:02:58 浏览: 109
基于python 编写的 基于tfidf的关键字提取算法。
好的,下面是一个简单的Python程序,用于实现TF-IDF提取关键词:
```python
import jieba.analyse
import jieba
import os
import codecs
from sklearn import feature_extraction
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.metrics.pairwise import cosine_similarity
# 加载停用词
def get_stopwords_list():
stopwords_file = os.path.join(os.getcwd(), 'stopwords.txt')
stopwords_list = [line.strip() for line in codecs.open(stopwords_file, 'r', encoding='utf-8').readlines()]
return stopwords_list
# 分词
def cut_words(text):
words = jieba.cut(text)
stopwords_list = get_stopwords_list()
result = ''
for word in words:
if word not in stopwords_list:
result += word + ' '
return result
# TF-IDF提取关键词
def tfidf_extract(texts, topK):
corpus = []
for text in texts:
text = cut_words(text)
corpus.append(text)
vectorizer = TfidfVectorizer()
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(vectorizer.fit_transform(corpus))
word = vectorizer.get_feature_names()
weight = tfidf.toarray()
ids = []
for i in range(len(weight)):
ids.append(i)
for i in range(len(weight)):
cosine_similarities = cosine_similarity(tfidf[i], tfidf).flatten()
related_docs_indices = [index for index in cosine_similarities.argsort()[::-1] if index != i]
top_k = related_docs_indices[:topK]
print("关键词:")
for index in top_k:
print(word[index], end=" ")
print("\n")
if __name__ == '__main__':
texts = ['中文语料库', '汉语分词', '自然语言处理', '机器学习', '深度学习']
topK = 3
tfidf_extract(texts, topK)
```
该程序使用了jieba库进行分词,并使用sklearn库中的TfidfVectorizer和TfidfTransformer进行TF-IDF计算和特征转换。在程序中,我们使用了一个停用词表来过滤掉一些无意义的词语,同时在输出关键词时,我们使用余弦相似度来计算相关性,并输出与当前文本最相关的前topK个关键词。
阅读全文