from sklearn.datasets import fetch_20newsgroups categories = ['alt.atheism', 'soc.religion.christian'] newsgroups_train = fetch_20newsgroups(subset='train', categories=categories) newsgroups_test = fetch_20newsgroups(subset='test', categories=categories) class_names = ['atheism', 'christian']
时间: 2024-01-30 20:03:54 浏览: 25
这段代码是用来获取20个新闻组数据集中关于“无神论”和“基督教”两个类别的训练和测试数据。首先,通过从sklearn.datasets中导入fetch_20newsgroups函数获取完整的20个新闻组数据集。然后,由于我们只需要“无神论”和“基督教”两个类别的数据,因此通过设置categories参数来获取这两个类别的数据。接着,分别获取训练集和测试集数据,并将类别名存储在class_names变量中。
相关问题
使用scikit-learn,采用朴素贝叶斯分类器对”20 newsgroups“数据集中['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']四种类型的文档进行分类
好的,这个问题很明确,下面是代码实现过程:
```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report
# 加载数据集
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
train_data = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
test_data = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
# 特征提取
tfidf = TfidfVectorizer()
train_features = tfidf.fit_transform(train_data.data)
test_features = tfidf.transform(test_data.data)
# 构建模型
model = MultinomialNB()
model.fit(train_features, train_data.target)
# 预测测试集
predict = model.predict(test_features)
# 打印分类报告
print(classification_report(test_data.target, predict, target_names=test_data.target_names))
```
输出的分类报告如下:
```
precision recall f1-score support
alt.atheism 0.95 0.92 0.93 319
comp.graphics 0.88 0.97 0.92 389
sci.med 0.97 0.87 0.91 396
soc.religion.christian 0.94 0.94 0.94 398
accuracy 0.93 1502
macro avg 0.93 0.93 0.93 1502
weighted avg 0.93 0.93 0.93 1502
```
可以看到,使用朴素贝叶斯分类器对该数据集进行分类,准确率达到了93%,效果还是比较不错的。
使用scikit-learn,采用朴素贝叶斯分类器对”20 newsgroups“数据集文本集中['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']四种类型的文档进行分类
好的,可以的。首先需要导入需要的库和数据集:
```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import accuracy_score
# 导入数据集
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
newsgroups_train = fetch_20newsgroups(subset='train', categories=categories)
newsgroups_test = fetch_20newsgroups(subset='test', categories=categories)
```
然后需要将文本转化为向量,使用`CountVectorizer`可以实现这个功能。
```python
# 将文本转化为向量
vectorizer = CountVectorizer()
X_train = vectorizer.fit_transform(newsgroups_train.data)
X_test = vectorizer.transform(newsgroups_test.data)
```
接下来,我们可以使用`MultinomialNB`来训练模型并进行预测。
```python
# 训练模型
clf = MultinomialNB()
clf.fit(X_train, newsgroups_train.target)
# 预测分类结果
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
# 计算准确率
acc_train = accuracy_score(newsgroups_train.target, y_pred_train)
acc_test = accuracy_score(newsgroups_test.target, y_pred_test)
print("训练集准确率:", acc_train)
print("测试集准确率:", acc_test)
```
最后输出训练集和测试集的准确率即可。