python对英文文本进行分词停词、大小写处理、词性还原、去除标点符号,最后TF-IDF关键词提取分析,列出具体代码
时间: 2023-07-10 17:19:22 浏览: 210
好的,以下是使用Python进行英文文本处理和TF-IDF关键词提取分析的代码示例:
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import string
# 读取文本文件
with open('example.txt', 'r') as f:
text = f.read()
# 转换为小写
text = text.lower()
# 分词和去除标点符号
tokens = word_tokenize(text)
tokens = [word for word in tokens if word.isalpha()]
# 停用词处理
stop_words = set(stopwords.words('english'))
tokens = [word for word in tokens if not word in stop_words]
# 词性还原
lemmatizer = WordNetLemmatizer()
tokens = [lemmatizer.lemmatize(word) for word in tokens]
# 词干提取
porter = PorterStemmer()
tokens = [porter.stem(word) for word in tokens]
# 将处理后的文本重新组合为字符串
text = ' '.join(tokens)
# TF-IDF关键词提取
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform([text])
feature_names = vectorizer.get_feature_names()
# 输出前10个关键词及其TF-IDF分数
for i in range(10):
print(feature_names[tfidf.toarray()[0].argsort()[-1-i]])
```
以上代码中,我们首先读取了一个文本文件,然后对文本进行了小写处理、分词和去除标点符号、停用词处理、词性还原、词干提取等操作,最后使用TF-IDF算法提取出了关键词并输出前10个关键词及其TF-IDF分数。
阅读全文