基于tf-idf算法用nltk库对文本文件中的文本进行关键词提取
时间: 2024-05-07 20:20:23 浏览: 110
首先,需要安装NLTK库并下载停用词和词性标注器。可以通过以下代码安装和下载:
```
!pip install nltk
import nltk
nltk.download('stopwords')
nltk.download('averaged_perceptron_tagger')
```
然后,可以按照以下步骤进行tf-idf关键词提取:
1. 导入必要的库:
```
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from sklearn.feature_extraction.text import TfidfVectorizer
```
2. 读取文本文件并进行预处理,包括去除标点符号、停用词和词形还原:
```
# 读取文本文件
with open('text.txt', 'r') as f:
text = f.read()
# 去除标点符号
text = text.translate(str.maketrans('', '', string.punctuation))
# 分词并去除停用词
stop_words = set(stopwords.words('english'))
tokens = word_tokenize(text)
filtered_tokens = [word for word in tokens if word.lower() not in stop_words]
# 词形还原
lemmatizer = WordNetLemmatizer()
lemmatized_tokens = [lemmatizer.lemmatize(token) for token in filtered_tokens]
```
3. 使用TfidfVectorizer计算tf-idf权重并提取关键词:
```
# 计算tf-idf权重
tfidf = TfidfVectorizer()
tfidf.fit_transform(lemmatized_tokens)
feature_names = tfidf.get_feature_names()
# 提取关键词
top_n = 10
top_features = sorted(zip(tfidf.idf_, feature_names), reverse=True)[:top_n]
keywords = [feature[1] for feature in top_features]
```
最后,可以输出提取的关键词:
```
print(keywords)
```
注意,这里假设文本文件为'text.txt',需要根据实际情况修改文件名和路径。另外,提取的关键词数量可以通过修改top_n变量来控制。
阅读全文