使用scikit-learn,采用朴素贝叶斯分类器对”20 newsgroups“数据集文本集中['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']四种类型的文档进行分类
时间: 2023-12-30 14:03:00 浏览: 82
好的,可以的。首先需要导入需要的库和数据集:
```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 导入数据集
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
```
然后需要将文本转化为向量,使用`CountVectorizer`可以实现这个功能。
```python
# 将文本转化为向量
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
```
接下来,我们可以使用`MultinomialNB`来训练模型并进行预测。
```python
# 训练模型
clf = MultinomialNB()
clf.fit(X_train, newsgroups_train.target)
# 预测分类结果
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
# 计算准确率
acc_train = accuracy_score(newsgroups_train.target, y_pred_train)
acc_test = accuracy_score(newsgroups_test.target, y_pred_test)
print("训练集准确率:", acc_train)
print("测试集准确率:", acc_test)
```
最后输出训练集和测试集的准确率即可。
阅读全文