首页jupyter编写函数,参数是一个文件的文件名。统计一个文件中单词的出现次数,并输出出现次数最高的前三个单词

jupyter编写函数,参数是一个文件的文件名。统计一个文件中单词的出现次数,并输出出现次数最高的前三个单词

时间: 2024-09-06 17:04:59 浏览: 59

在Jupyter Notebook中编写一个函数，用于统计文件中单词的出现次数，并输出出现频率最高的前三个单词，你可以按照以下步骤操作：首先，你需要安装`nltk`库，它包含了处理文本数据的工具，包括分词。如果没有安装，可以运行： ```python !pip install nltk ``` 然后导入所需的模块，并下载停用词列表和英文分词器： ```python import os import re from collections import Counter from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # 下载停用词列表（如果还没有） nltk.download('stopwords') nltk.download('punkt') ``` 接下来创建函数`word_frequency_analysis`： ```python def word_frequency_analysis(filename): # 读取文件内容 with open(filename, 'r', encoding='utf-8') as file: text = file.read().lower() # 清理文本（去除标点、数字等非字母字符） cleaned_text = re.sub(r'\W+', ' ', text) # 分词并移除停用词 words = [word for word in word_tokenize(cleaned_text) if word not in set(stopwords.words('english'))] # 计算每个单词的出现次数 word_counts = Counter(words) # 获取出现次数最高的前三个单词及其计数 top_three_words = word_counts.most_common(3) print(f"Top three most frequent words:") for word, count in top_three_words: print(f"{word}: {count}") # 使用函数 file_name = "your_file.txt" # 替换为你想要分析的文件名 word_frequency_analysis(file_name) ``` 在这个函数中，将`your_file.txt`替换为你要分析的实际文件名。这个函数会读取文件、清理文本、分词、计算单词频率，并打印出出现频率最高的前三名。

阅读全文