首页利用停用词字典stop_words分别对文件夹中所有txt文档去停用词

利用停用词字典stop_words分别对文件夹中所有txt文档去停用词

时间: 2024-03-03 11:49:17 浏览: 117

停用词文件 stopwords.txt

自然语言处理，中文停用词

好的，下面是一个基于 Python 的示例代码，可以实现利用停用词字典 `stop_words` 对文件夹中所有 txt 文档去停用词。你需要将 `stop_words` 字典替换为你自己的停用词字典，并将 `input_folder` 和 `output_folder` 分别替换为你的输入和输出文件夹路径。 ```python import os import re # 定义停用词字典 stop_words = {"is", "the", "of", "and", "a", "to", "in", "that", "it", "with", "for", "as", "on", "was", "at", "by", "an", "be", "this", "which", "from", "or", "were", "are", "you", "your", "we", "our"} # 定义输入和输出文件夹路径 input_folder = "/path/to/input/folder" output_folder = "/path/to/output/folder" # 遍历文件夹中的所有 txt 文件 for filename in os.listdir(input_folder): if filename.endswith(".txt"): # 拼接文件路径 input_path = os.path.join(input_folder, filename) output_path = os.path.join(output_folder, filename) # 打开输入文件并读取内容 with open(input_path, "r", encoding="utf-8") as f: text = f.read() # 去除标点符号和数字 text = re.sub(r"[^\w\s]", "", text) text = re.sub(r"\d+", "", text) # 去除停用词 words = text.split() words = [word for word in words if word not in stop_words] text = " ".join(words) # 将处理后的文本写入输出文件 with open(output_path, "w", encoding="utf-8") as f: f.write(text) ``` 这段代码将遍历指定的输入文件夹中的所有以 `.txt` 结尾的文件，对每个文件进行如下处理： 1. 打开文件并读取内容。 2. 去除标点符号和数字。 3. 去除停用词。 4. 将处理后的文本写入输出文件。你需要根据自己的需求自行修改代码，比如替换停用词字典、修改输入和输出文件夹路径等。

阅读全文