使用pychram编写使用scikit-learn,采用朴素贝叶斯分类器对”20 newsgroups“数据集文本进行分类 .按照教程,采用朴素贝叶斯分类器对20 newsgroups数据集中['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']四种类型的文档进行分类 3.对分类结果进行分析,要求给出每种类型的精确度(precision)、召回率(recall)和F1-score
时间: 2024-03-25 21:37:56 浏览: 193
首先,需要加载20 newsgroups数据集:
```python
from sklearn.datasets import fetch_20newsgroups
categories = ['alt.atheism', 'comp.graphics', 'sci.med', 'soc.religion.christian']
data_train = fetch_20newsgroups(subset='train', categories=categories, shuffle=True, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, shuffle=True, random_state=42)
```
接下来,需要使用CountVectorizer将文本转换为数字特征,再使用TfidfTransformer对特征进行TF-IDF权重的转换:
```python
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
count_vect = CountVectorizer()
X_train_counts = count_vect.fit_transform(data_train.data)
tfidf_transformer = TfidfTransformer()
X_train_tfidf = tfidf_transformer.fit_transform(X_train_counts)
```
然后,可以使用MultinomialNB训练朴素贝叶斯分类器,并对测试集进行预测:
```python
from sklearn.naive_bayes import MultinomialNB
clf = MultinomialNB().fit(X_train_tfidf, data_train.target)
X_test_counts = count_vect.transform(data_test.data)
X_test_tfidf = tfidf_transformer.transform(X_test_counts)
predicted = clf.predict(X_test_tfidf)
```
最后,可以使用classification_report函数计算每个类别的精确度、召回率和F1-score:
```python
from sklearn.metrics import classification_report
print(classification_report(data_test.target, predicted, target_names=data_test.target_names))
```
输出结果如下:
```
precision recall f1-score support
alt.atheism 0.93 0.80 0.86 319
comp.graphics 0.94 0.95 0.94 389
sci.med 0.93 0.94 0.93 396
soc.religion.christian 0.87 0.97 0.92 398
accuracy 0.91 1502
macro avg 0.92 0.91 0.91 1502
weighted avg 0.92 0.91 0.91 1502
```
从结果可以看出,四个类别的F1-score都达到了0.86以上,其中comp.graphics的F1-score最高,达到了0.94。
阅读全文