首页统计xiyouji5.txt共有多少字，获得该文本的频率分布（提示，可使用FreqDist()函数），基于该分布，绘制前1000个高频词汇的频率分布图。我需要最后的完整代码。要求图和坐标上的字不准出现乱码 D:\xiyouji5.txt

统计xiyouji5.txt共有多少字，获得该文本的频率分布（提示，可使用FreqDist()函数），基于该分布，绘制前1000个高频词汇的频率分布图。我需要最后的完整代码。要求图和坐标上的字不准出现乱码 D:\xiyouji5.txt

时间: 2024-10-16 14:16:04 浏览: 53

为了完成这个任务，我们可以使用Python的`nltk`库来处理文本统计和频率分析，并结合`matplotlib`库来绘制词频分布图。请确保已经安装了这两个库。如果没有，可以使用下面的命令安装： ```bash pip install nltk matplotlib ``` 然后，按照以下步骤编写代码： 1. 导入所需库 2. 加载文本文件 3. 对文本进行预处理（移除标点、停用词等） 4. 计算词频并创建频率分布 5. 绘制前1000个高频词汇的频率分布图下面是完整的代码： ```python import matplotlib.pyplot as plt from nltk.corpus import stopwords from nltk.probability import FreqDist from nltk.tokenize import word_tokenize import string import re # 读取文本文件 with open("D:/xiyouji5.txt", "r", encoding="utf-8") as file: text = file.read() # 文本预处理 stop_words = set(stopwords.words('chinese')) # 中文停用词 text = text.translate(str.maketrans('', '', string.punctuation)) tokens = [word for word in word_tokenize(text) if word.isalpha() and word not in stop_words] # 统计词频 fdist = FreqDist(tokens) # 获取前1000个高频词汇及其频率 top_1000 = fdist.most_common(1000) # 绘制词频分布图 plt.figure(figsize=(10, 6)) plt.hist([freq for _, freq in top_1000], bins=range(1, len(top_1000) + 1), align='left') plt.xticks(range(1, len(top_1000) + 1), [word for word, _ in top_1000]) plt.xlabel('词汇') plt.ylabel('频率') plt.title('前1000个高频词汇频率分布') plt.show() ``` 执行以上代码后，会显示一个包含前1000个高频词汇及其频率的直方图。请注意检查编码设置是否正确，确保文本文件能被正确解析。

阅读全文