TF-IDF提取文本关键词代码
时间: 2023-06-12 08:02:22 浏览: 121
很好,我可以帮您回答这个问题。以下是一个简单的Python代码以使用sklearn库中的TF-IDF向量化器和nltk库中的词干提取器来提取文本关键词:
```
from sklearn.feature_extraction.text import TfidfVectorizer
from nltk.stem import SnowballStemmer
# 设置文本语料
corpus = ["This is the first document.",
"This is the second document.",
"And this is the third one.",
"Is this the first document?"]
# 初始化stemmer
stemmer = SnowballStemmer('english')
# 定义函数来实现词干化
def stem_tokenize(text):
return [stemmer.stem(word) for word in text.split()]
# 计算TF-IDF向量
vectorizer = TfidfVectorizer(tokenizer=stem_tokenize, stop_words='english')
X = vectorizer.fit_transform(corpus)
# 获取关键词列表
feature_names = vectorizer.get_feature_names()
for i in range(len(corpus)):
# 获取文档权重列表
tfidf_scores = X[i].toarray().flatten()
# 获取权重最高的前三个关键词
top_indexes = tfidf_scores.argsort()[-3:][::-1]
top_features = [(feature_names[idx], tfidf_scores[idx]) for idx in top_indexes]
print("Top features for document {}: {}".format(i, top_features))
```
该代码使用SnowballStemmer来提取文本的词干,并使用TfidfVectorizer计算TF-IDF向量。然后,它获取每个文档的TF-IDF分数,并输出每个文档的前三个关键词。
阅读全文