使用朴素贝叶斯过滤垃圾邮件数据集
时间: 2024-02-25 13:59:41 浏览: 66
以下是使用Python中的scikit-learn库和朴素贝叶斯算法来训练和测试Enron-Spam数据集的示例代码:
```python
from sklearn.naive_bayes import MultinomialNB
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
# 读取数据
data = pd.read_csv('enron-spam-dataset/spam.csv', encoding='latin-1')
# 将数据集拆分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(data['text'], data['class'], test_size=0.2, random_state=42)
# 将文本转换为计数向量
vectorizer = CountVectorizer()
X_train_counts = vectorizer.fit_transform(X_train)
# 训练朴素贝叶斯分类器
clf = MultinomialNB()
clf.fit(X_train_counts, y_train)
# 在测试集上进行预测并计算精度
X_test_counts = vectorizer.transform(X_test)
accuracy = clf.score(X_test_counts, y_test)
print('Accuracy:', accuracy)
```
其中,'enron-spam-dataset/spam.csv'是Enron-Spam数据集的文件路径。代码将数据集拆分为80%的训练集和20%的测试集,并使用CountVectorizer将文本转换为计数向量。然后训练MultinomialNB朴素贝叶斯分类器,并在测试集上进行预测并计算精度。
阅读全文