首页用Python代码对英文文件进行分词、去停用词

用Python代码对英文文件进行分词、去停用词

时间: 2024-12-26 22:28:51 浏览: 10

在Python中，我们可以使用nltk（Natural Language Toolkit）库来进行文本处理，包括分词和去除停用词。这里是一个简单的例子： ```python import nltk from nltk.corpus import stopwords from nltk.tokenize import word_tokenize # 下载必要的资源（首次运行需要） nltk.download('punkt') nltk.download('stopwords') def process_text(file_path): # 读取文件内容 with open(file_path, 'r', encoding='utf-8') as file: text = file.read() # 分词 tokenized_words = word_tokenize(text) # 英文停用词列表 stop_words = set(stopwords.words('english')) # 去除停用词 filtered_words = [word for word in tokenized_words if word.casefold() not in stop_words] return filtered_words file_to_process = "example.txt" # 替换为你想要处理的英文文件路径 tokens = process_text(file_to_process) print("处理后的单词:", tokens)

阅读全文