python写出红楼梦人物词云,并进行去除语气词,去除人物别称
时间: 2023-10-25 19:26:04 浏览: 63
实现这个任务需要使用到 jieba 和 wordcloud 两个库,具体步骤如下:
1. 安装 jieba 和 wordcloud 库
```python
!pip install jieba wordcloud
```
2. 读取红楼梦文本
```python
with open('hongloumeng.txt', 'r', encoding='utf-8') as f:
text = f.read()
```
3. 分词并去除停用词和人名别称
```python
import jieba
import jieba.analyse
import jieba.posseg as pseg
# 加载停用词列表
stopwords = []
with open('stopwords.txt', 'r', encoding='utf-8') as f:
for line in f:
stopwords.append(line.strip())
# 加载人名别称列表
alias = []
with open('alias.txt', 'r', encoding='utf-8') as f:
for line in f:
alias.append(line.strip())
# 添加自定义词典
jieba.load_userdict('userdict.txt')
# 分词并去除停用词和人名别称
words = []
for word, flag in pseg.cut(text):
if flag.startswith('n') and word not in stopwords and word not in alias:
words.append(word)
```
4. 生成词云
```python
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 将分词结果转为字符串
text = ' '.join(words)
# 生成词云
wordcloud = WordCloud(font_path='msyh.ttc',
background_color='white',
width=800,
height=600,
max_words=100,
min_font_size=10,
max_font_size=80).generate(text)
# 显示词云
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
完整代码如下:
```python
import jieba
import jieba.analyse
import jieba.posseg as pseg
from wordcloud import WordCloud
import matplotlib.pyplot as plt
# 读取红楼梦文本
with open('hongloumeng.txt', 'r', encoding='utf-8') as f:
text = f.read()
# 加载停用词列表
stopwords = []
with open('stopwords.txt', 'r', encoding='utf-8') as f:
for line in f:
stopwords.append(line.strip())
# 加载人名别称列表
alias = []
with open('alias.txt', 'r', encoding='utf-8') as f:
for line in f:
alias.append(line.strip())
# 添加自定义词典
jieba.load_userdict('userdict.txt')
# 分词并去除停用词和人名别称
words = []
for word, flag in pseg.cut(text):
if flag.startswith('n') and word not in stopwords and word not in alias:
words.append(word)
# 将分词结果转为字符串
text = ' '.join(words)
# 生成词云
wordcloud = WordCloud(font_path='msyh.ttc',
background_color='white',
width=800,
height=600,
max_words=100,
min_font_size=10,
max_font_size=80).generate(text)
# 显示词云
plt.figure(figsize=(10, 8))
plt.imshow(wordcloud, interpolation='bilinear')
plt.axis('off')
plt.show()
```
其中,`stopwords.txt` 和 `alias.txt` 分别是停用词列表和人名别称列表,可以自行收集或者使用已有的列表。`userdict.txt` 是自定义的词典,用于添加一些特殊的词语,比如一些不常见的人名或地名。