首页编写代码实现:统计莎士比亚诗文集中出现频次最高的100个单词,但是这100个单词中不包括停词表中那些没有实际意义的词。

编写代码实现:统计莎士比亚诗文集中出现频次最高的100个单词,但是这100个单词中不包括停词表中那些没有实际意义的词。

时间: 2024-02-03 13:13:31 浏览: 53

以下是一种可能的 Python 代码实现： ```python import re from collections import Counter # 加载停词表 with open("stopwords.txt", "r") as f: stopwords = set([line.strip() for line in f]) # 加载莎士比亚诗文集 with open("shakespeare.txt", "r") as f: text = f.read() # 将文本中的标点符号和数字替换为空格 text = re.sub(r"[^a-zA-Z]+", " ", text) # 将文本转换为小写 text = text.lower() # 将文本拆分为单词列表 words = text.split() # 对单词列表进行过滤，去除停词和长度小于等于2的单词 filtered_words = [word for word in words if word not in stopwords and len(word) > 2] # 统计单词频次，取出前100个出现次数最多的单词 word_counts = Counter(filtered_words) top_words = word_counts.most_common(100) # 输出结果 for word, count in top_words: print(word, count) ``` 需要注意的是，上述代码中使用了一个停词表文件(stopwords.txt)，里面包含了一些常见的没有实际意义的单词，比如 "the"、"and"、"but" 等等。如果没有这个文件，可以考虑使用 NLTK 库中的停词表。

阅读全文