从文档集中提取TF-IDF值的降序排位前200的词汇。用python代码如何实现
时间: 2023-06-02 22:05:40 浏览: 86
深入理解TF-IDF算法:Python实现与关键词提取
可以使用Python的sklearn库中的TfidfVectorizer实现。具体代码如下:
```
from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd
# 文档集合
docs = [
"This is the first document",
"This is the second document",
"And this is the third one",
"Is this the first document?"
]
# 使用TfidfVectorizer得到文档中词汇的TF-IDF值
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(docs)
# 得到所有单词对应的tf-idf值,并将其存为列表
features = vectorizer.get_feature_names()
df = pd.DataFrame(X.toarray(), columns=features)
# 将各单词的tf-idf值求和排序,得到前200个单词
top_words = df.sum(axis=0).sort_values(ascending=False)[:200].index.tolist()
print("Top 200 words by TF-IDF:")
print(', '.join(top_words))
```
此代码将得到文档集中提取TF-IDF值的降序排位前200的词汇。
阅读全文