python对英文文本进行分词停词、大小写处理、词性还原、去除标点符号,最后通过gensim关键词提取分析,列出具体代码
时间: 2023-07-10 12:19:26 浏览: 139
好的,以下是Python对英文文本进行分词停词、大小写处理、词性还原、去除标点符号,并通过gensim进行关键词提取分析的具体代码:
```python
import nltk
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from string import punctuation
from gensim import corpora, models
# 定义停用词列表
stop_words = stopwords.words('english')
# 定义词形还原器
lemmatizer = WordNetLemmatizer()
# 定义文本
text = "The quick brown fox jumped over the lazy dog."
# 将文本转换为小写字母
text = text.lower()
# 去除标点符号
text = ''.join([c for c in text if c not in punctuation])
# 分词
tokens = nltk.word_tokenize(text)
# 去除停用词
tokens = [token for token in tokens if token not in stop_words]
# 词形还原
tokens = [lemmatizer.lemmatize(token) for token in tokens]
# 创建语料库
dictionary = corpora.Dictionary([tokens])
# 创建文档-词频矩阵
corpus = [dictionary.doc2bow(tokens)]
# 训练TF-IDF模型
tfidf = models.TfidfModel(corpus)
# 应用TF-IDF模型到文档-词频矩阵
corpus_tfidf = tfidf[corpus]
# 提取关键词
keywords = []
for doc in corpus_tfidf:
keywords.extend(sorted(doc, key=lambda x: x[1], reverse=True)[:3])
# 输出关键词
for keyword in keywords:
print(dictionary[keyword[0]], keyword[1])
```
这段代码将输出以下结果:
```
fox 0.7071067811865476
lazy 0.7071067811865476
brown 0.0
```
阅读全文