长文本关键词提取后,进行可视化,Python完整代码示例,直接写代码
时间: 2023-09-30 13:07:25 浏览: 81
以下是一个基于Python的长文本关键词提取和可视化的完整代码示例:
```python
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from wordcloud import WordCloud
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
# 读取文本文件
with open('text_file.txt', 'r', encoding='utf-8') as f:
text = f.read()
# 文本预处理,去除标点符号和停用词
stopwords = ['the', 'and', 'of', 'in', 'to', 'that', 'is', 'for', 'it', 'with', 'as', 'was', 'on', 'by', 'at', 'an', 'be', 'this', 'which', 'or', 'from', 'but', 'not', 'are', 'have', 'they', 'has', 'their']
text = text.lower()
text = ''.join([c for c in text if c.isalpha() or c.isspace()])
text = ' '.join([word for word in text.split() if word not in stopwords])
# 计算词频
vectorizer = CountVectorizer()
word_count = vectorizer.fit_transform([text])
words = vectorizer.get_feature_names()
# 计算TF-IDF值
transformer = TfidfTransformer()
tfidf = transformer.fit_transform(word_count)
# 将TF-IDF值转换为DataFrame
tfidf_df = pd.DataFrame(tfidf.toarray(), columns=words)
# 获取关键词及其TF-IDF值
top_n = 10
keywords = tfidf_df.sum().sort_values(ascending=False)[:top_n]
# 可视化关键词
wordcloud = WordCloud(background_color='white', width=800, height=600).generate_from_frequencies(keywords)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
说明:
1. 首先读取文本文件并进行预处理,去除标点符号和停用词。
2. 然后使用`CountVectorizer`计算词频,使用`TfidfTransformer`计算TF-IDF值。
3. 将TF-IDF值转换为DataFrame,并获取前N个关键词及其TF-IDF值。
4. 最后使用`WordCloud`可视化关键词。
阅读全文