python对英文文本进行分词停词、大小写处理、词性还原、去除标点符号,最后TF-IDF方法提取分析关键词并获取词频,列出具体代码
时间: 2023-07-10 17:19:56 浏览: 49
好的,以下是Python对英文文本进行分词、停词、大小写处理、词性还原、去除标点符号,最后使用TF-IDF方法提取关键词并获取词频的代码:
```python
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from nltk.stem import WordNetLemmatizer
from nltk.stem.porter import PorterStemmer
from sklearn.feature_extraction.text import TfidfVectorizer
import string
# 读取文本文件
with open('text.txt', 'r') as file:
text = file.read()
# 将文本转换为小写
text = text.lower()
# 分词
words = word_tokenize(text)
# 停词处理
stop_words = set(stopwords.words('english'))
words = [w for w in words if not w in stop_words]
# 去除标点符号
table = str.maketrans('', '', string.punctuation)
stripped = [w.translate(table) for w in words]
# 词性还原
lemmatizer = WordNetLemmatizer()
stemmer = PorterStemmer()
lemmatized = [lemmatizer.lemmatize(w) for w in stripped]
stemmed = [stemmer.stem(w) for w in stripped]
# 使用TF-IDF方法提取关键词并获取词频
vectorizer = TfidfVectorizer()
tfidf = vectorizer.fit_transform(lemmatized) # 或者使用stemmed
feature_names = vectorizer.get_feature_names()
dense = tfidf.todense()
denselist = dense.tolist()
df = pd.DataFrame(denselist, columns=feature_names)
# 输出前10个关键词及其对应的词频
print(df.sum().sort_values(ascending=False)[:10])
```
需要注意的是,这段代码需要用到一些外部库,如nltk和sklearn等,需要提前安装和导入。另外,TF-IDF方法提取关键词的结果可能需要根据实际情况进行筛选和调整。