现有好评和差评两个文本文件,我需要对这两个中文文本文件作为训练集,使用sklearn实现贝叶斯分类,区分好评和差评,给出示例代码
时间: 2023-10-29 13:05:43 浏览: 108
好的,以下是一个简单的示例代码:
```python
import jieba
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# 读取文件内容
with open('positive.txt', 'r', encoding='utf-8') as f:
pos_data = f.readlines()
with open('negative.txt', 'r', encoding='utf-8') as f:
neg_data = f.readlines()
# 进行分词和去除停用词
stopwords = []
with open('stopwords.txt', 'r', encoding='utf-8') as f:
stopwords = f.read().splitlines()
pos_words = []
for line in pos_data:
words = jieba.cut(line, cut_all=False)
words = [w for w in words if w not in stopwords]
pos_words.append(' '.join(words))
neg_words = []
for line in neg_data:
words = jieba.cut(line, cut_all=False)
words = [w for w in words if w not in stopwords]
neg_words.append(' '.join(words))
# 将文本转换为向量
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(pos_words + neg_words)
y = [1] * len(pos_words) + [0] * len(neg_words)
# 训练模型
clf = MultinomialNB()
clf.fit(X, y)
# 测试模型
test_text = '这个产品真心不错!'
test_words = ' '.join(jieba.cut(test_text, cut_all=False))
test_vec = vectorizer.transform([test_words])
pred = clf.predict(test_vec)
if pred[0] == 1:
print('好评')
else:
print('差评')
```
在这个示例代码中,我们首先读取了好评和差评两个文本文件,并进行了分词和去除停用词的预处理。然后使用sklearn中的CountVectorizer将文本转换为向量,使用MultinomialNB训练贝叶斯分类器模型。最后对一个测试文本进行预测,并输出预测结果。
阅读全文