Python jieba 文本文件“红楼梦.txt”中包含《红楼梦》小说前20章内容。对“红楼梦.txt”中的文本进行分词,并对人物名称进行归一化处理:凤姐、凤姐儿、凤丫头归一为凤姐;宝玉、二爷、宝二爷归一为宝玉;黛玉、颦儿、林妹妹、黛玉道归一为黛玉;宝钗、宝丫头归一为宝钗;贾母、老祖宗归一为贾母;袭人、袭人道归一为袭人;贾政、贾政道归一为贾政;贾琏、琏二爷归一为贾琏。提取出场次数不少于40次的人物名称,将人物名称及其出场次数按照递减排序写入out.txt文件中。
时间: 2023-07-04 21:09:24 浏览: 251
可以使用Python中的jieba库进行文本分词,然后使用字典进行人物名称归一化处理,最后统计人物出场次数并按照递减排序输出至文件。
以下是代码实现:
```python
import jieba
# 读取文本文件
with open("红楼梦.txt", "r", encoding="utf-8") as f:
text = f.read()
# 分词
words = jieba.lcut(text)
# 人物名称归一化处理
mapping = {
"凤姐儿": "凤姐",
"凤丫头": "凤姐",
"二爷": "宝玉",
"宝二爷": "宝玉",
"颦儿": "黛玉",
"林妹妹": "黛玉",
"黛玉道": "黛玉",
"宝丫头": "宝钗",
"老祖宗": "贾母",
"袭人道": "袭人",
"贾政道": "贾政",
"琏二爷": "贾琏"
}
normalized_words = []
for word in words:
if word in mapping:
word = mapping[word]
normalized_words.append(word)
# 统计人物出场次数
count = {}
for name in ["贾宝玉", "林黛玉", "薛宝钗", "贾母", "王熙凤", "史湘云", "妙玉", "元春", "贾琏", "李纨", "贾环", "贾兰", "贾蓉", "贾芸", "贾珠", "秦可卿", "晴雯", "袭人", "贾政"]:
count[name] = normalized_words.count(name)
# 按照递减排序并输出至文件
with open("out.txt", "w", encoding="utf-8") as f:
for name, freq in sorted(count.items(), key=lambda x: x[1], reverse=True):
if freq >= 40:
f.write(name + " " + str(freq) + "\n")
```
运行以上代码后,会生成一个名为“out.txt”的文件,里面包含了人物名称及其出场次数。
阅读全文