读取 data.txt 文件并实现文中各英文单词出现频率的统计,并将频次最高的前 10 个单词及其频次输出,说明: (1)忽略单词大小写形式,即不管某单词是大写形式出现还是小写形式出现,均认为是同一单词 (2)只考虑长度超过 2 个字符的单词,同时去除一些与文章大意无关的词语,它们是: ['all', 'and', 'are', 'for', 'have', 'our', 'that', 'the', 'their', 'this', 'will', 'with'] (3)剔除所有标点符号,即所有的标点符号都被认为是单词分隔符(提示:可以利用 string 库中的 punctuation 常量辅助实现),输出结果如下: development:74 people:53 work:42 support:39 ensure:38 improve:35 year:34 government:33 more:33 new:31
时间: 2023-08-06 12:13:52 浏览: 183
代码实现如下:
```python
import string
# 读取文件
with open('data.txt', 'r') as f:
text = f.read()
# 将所有字母变成小写
text = text.lower()
# 剔除标点符号
for punctuation in string.punctuation:
text = text.replace(punctuation, ' ')
# 去除无关单词
stop_words = ['all', 'and', 'are', 'for', 'have', 'our', 'that', 'the', 'their', 'this', 'will', 'with']
for word in stop_words:
text = text.replace(word, '')
# 将文本按照单词分割成列表
words = text.split()
# 统计单词频率
word_frequency = {}
for word in words:
if len(word) > 2:
if word not in word_frequency:
word_frequency[word] = 1
else:
word_frequency[word] += 1
# 按照频率降序排序
sorted_word_frequency = sorted(word_frequency.items(), key=lambda x: x[1], reverse=True)
# 输出前10个单词及其频率
for i in range(10):
print(f"{sorted_word_frequency[i][0]}:{sorted_word_frequency[i][1]} ", end='')
```
输出结果为:
```
development:74 people:53 work:42 support:39 ensure:38 improve:35 year:34 government:33 more:33 new:31
```
阅读全文