使用scikit-learn,采用朴素贝叶斯分类器对”20 newsgroups“数据集文本进行分类
时间: 2023-12-29 19:02:13 浏览: 78
好的,我可以为您提供一些代码来完成这个任务。
首先,您需要安装scikit-learn包,可以使用以下命令:
```
pip install scikit-learn
```
然后,您需要导入必要的库:
```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
```
接下来,您可以使用以下代码来加载数据集:
```python
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
train_data = fetch_20newsgroups(subset='train', categories=categories)
test_data = fetch_20newsgroups(subset='test', categories=categories)
```
这将加载包含四个类别的“20 newsgroups”数据集的训练和测试数据。
接下来,您可以使用CountVectorizer将文本转换为向量表示:
```python
vectorizer = CountVectorizer()
train_vectors = vectorizer.fit_transform(train_data.data)
test_vectors = vectorizer.transform(test_data.data)
```
然后,您可以创建一个朴素贝叶斯分类器并训练它:
```python
classifier = MultinomialNB()
classifier.fit(train_vectors, train_data.target)
```
最后,您可以使用以下代码对测试数据进行分类并计算分类报告:
```python
predicted = classifier.predict(test_vectors)
print(classification_report(test_data.target, predicted))
```
这将打印出包括准确率、召回率和F1分数在内的分类报告。
完整的代码如下:
```python
from sklearn.datasets import fetch_20newsgroups
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.pipeline import make_pipeline
from sklearn.metrics import classification_report
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
train_data = fetch_20newsgroups(subset='train', categories=categories)
test_data = fetch_20newsgroups(subset='test', categories=categories)
vectorizer = CountVectorizer()
train_vectors = vectorizer.fit_transform(train_data.data)
test_vectors = vectorizer.transform(test_data.data)
classifier = MultinomialNB()
classifier.fit(train_vectors, train_data.target)
predicted = classifier.predict(test_vectors)
print(classification_report(test_data.target, predicted))
```
阅读全文