首页运用nltk统计出电脑中文本文件内容数量前一百的单词

运用nltk统计出电脑中文本文件内容数量前一百的单词

时间: 2023-11-15 13:11:22 浏览: 200

统计文件中单词的个数

用二叉搜索树实现统计一个文件中单词的个数

以下是使用NLTK统计电脑文本文件内容前100个单词的Python代码： ```python import nltk from nltk.corpus import stopwords from collections import Counter # 读取文本文件内容 with open('file.txt', 'r', encoding='utf-8') as file: text = file.read() # 将所有单词转换为小写，并分词 words = nltk.word_tokenize(text.lower()) # 去除停用词 stop_words = set(stopwords.words('english')) words = [word for word in words if word not in stop_words] # 统计词频 word_freq = Counter(words) # 取出前100个单词 top_words = word_freq.most_common(100) # 输出结果 print(top_words) ``` 解释： - 读取文本文件内容并将其存储为字符串变量 `text`。 - 将所有单词转换为小写，并使用 `nltk.word_tokenize()` 函数将文本分词，并将其存储为列表变量 `words`。 - 使用NLTK提供的 `stopwords` 库获取停用词，并将其存储为集合变量 `stop_words`。 - 使用列表推导式去除文本中的停用词。 - 使用 `collections.Counter()` 函数统计每个单词在文本中出现的次数，并将结果存储为字典变量 `word_freq`。 - 使用 `most_common()` 函数获取前100个出现次数最多的单词，并将其存储为列表变量 `top_words`。 - 最后，输出前100个单词及其出现次数。

阅读全文