首页python去中文文本停用词代码

python去中文文本停用词代码

时间: 2023-07-23 22:22:54 浏览: 99

好的，以下是一个基本的 Python 去除中文停用词的代码示例： ``` import jieba import codecs # 加载中文停用词表 def load_stopwords(file_path): stopwords = [] with codecs.open(file_path, 'r', 'utf-8') as f: for line in f.readlines(): stopwords.append(line.strip()) return set(stopwords) # 去除停用词 def remove_stopwords(text, stopwords): words = jieba.lcut(text) filtered_words = [word for word in words if word not in stopwords] return "".join(filtered_words) # 加载中文停用词表 stopwords = load_stopwords("stopwords.txt") # 测试 text = "今天是个好日子，我们一起去郊游吧。" filtered_text = remove_stopwords(text, stopwords) print(filtered_text) ``` 在上述代码中，我们首先使用 `codecs` 模块读取了一个中文停用词表文件，并将其中的停用词存储在一个集合中。接着，我们定义了一个 `remove_stopwords` 函数，它接受一个字符串和一个停用词集合作为输入，然后将其中的停用词去除，并返回处理后的文本。在函数中，我们首先使用 `jieba` 对输入文本进行分词，然后逐个判断每个单词是否为停用词，如果不是则将其加入到 `filtered_words` 列表中。最后，我们使用 `join` 方法将 `filtered_words` 中的单词拼接成一个字符串，并返回该字符串。请注意，上述代码仅仅是一个基本的示例，实际应用中可能需要根据具体情况对其进行更加细致的调整。此外，还需要注意的是，在实际应用中，可能需要使用更加全面、准确的中文停用词表。

阅读全文