python数据预处理,jieba分词,去除停用词,词频分析,并生成词云图代码
时间: 2023-10-23 08:12:44 浏览: 132
好的,以下是Python的数据预处理、jieba分词、去除停用词、词频分析并生成词云图的代码:
```python
import jieba
from wordcloud import WordCloud
import matplotlib.pyplot as plt
from collections import Counter
# 读取文本文件
with open('text.txt', 'r', encoding='utf-8') as f:
text = f.read()
# jieba分词
words = jieba.cut(text)
# 去除停用词
stopwords = ['的', '我们', '你们', '他们', '她们']
words = [word for word in words if word not in stopwords]
# 词频统计
word_counts = Counter(words)
word_counts_top10 = word_counts.most_common(10)
print(word_counts_top10)
# 生成词云图
wordcloud = WordCloud(font_path='msyh.ttc', width=800, height=600, background_color='white')
wordcloud.generate_from_frequencies(word_counts)
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
其中,`text.txt`是要处理的文本文件,`stopwords`是要去除的停用词列表,`word_counts_top10`是词频统计结果中出现次数最多的前10个词语。`msyh.ttc`是字体文件,用于生成中文词云图。
注意:在运行代码之前,需要先安装`jieba`和`wordcloud`包。可以使用以下命令进行安装:
```bash
pip install jieba wordcloud
```
阅读全文