python水浒传词频统计
时间: 2023-10-20 12:35:19 浏览: 278
使用Python进行《水浒传》的词频统计是一种常见的文本分析方法。可以通过分词工具jieba对文本进行分词处理,然后统计每个词语出现的频率。
在给定的代码中,使用了jieba库对《水浒传》文本进行了分词处理,然后使用字典counts记录词语出现的频率。在统计过程中,排除了一些无效词语,比如“两个”、“一个”、“只见”等,同时对于相同对象的不同称谓进行了同一化处理。
最后,按照词语的频率进行排序,并输出前十个高频词汇。
因此,使用Python进行《水浒传》词频统计的代码如下所示:
```python
import jieba
excludes = {"两个","一个","只见","如何","那里","哥哥","说道","军马","头领","众人","这里","兄弟","梁山泊","出来","小人","今日","这个","先锋","三个","因此","人马","问道","起来","便是","妇人","好汉","不是","不知","不曾","只是","如此","次日","我们","不得","如今","看时","不敢","来到","且说","一面","只得","山寨","原来","将军","却是"}
txt = open("水浒传.txt","r",encoding='gb18030', errors='ignore').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "宋江道" or word == "宋江":
rword = "宋江"
else:
rword = word
counts[rword = counts.get(rword, 0) + 1
for word in excludes:
del(counts[word])
items = list(counts.items())
items.sort(key=lambda x:x<span class="em">1</span><span class="em">2</span>
阅读全文