jieba库去除停用词的代码
时间: 2023-12-21 08:57:42 浏览: 64
以下是使用jieba库去除停用词的示例代码:
```python
import jieba
import jieba.analyse
import codecs
import sys
# 打开停用词文件,读取内容
stopwords_file = "stopwords.txt"
with codecs.open(stopwords_file, 'r', encoding='utf-8') as f:
stopwords = f.readlines()
stopwords = [x.strip() for x in stopwords]
# 打开待处理文本文件,读取内容
input_file = "input.txt"
with codecs.open(input_file, 'r', encoding='utf-8') as f:
text = f.read()
# 使用jieba进行分词
seg = jieba.cut(text)
# 去除停用词
seg_list = []
for word in seg:
if word not in stopwords:
seg_list.append(word)
# 输出结果
output_file = "output.txt"
with codecs.open(output_file, 'w', encoding='utf-8') as f:
f.write(' '.join(seg_list))
```
代码中,使用`jieba.cut`方法进行分词,然后遍历分词结果,去除停用词。停用词以列表的形式存储在文件`stopwords.txt`中,每行一个停用词。最终结果以空格分隔的形式存储在文件`output.txt`中。
阅读全文