手写代码实现朴素贝叶斯分类器(采用拉普拉斯修正)过滤垃圾邮件,在程序中自行下载所需的数据集,给出完整代码
时间: 2024-03-26 10:36:17 浏览: 62
用朴素的贝叶斯构建垃圾邮件过滤器
好的,以下是手写代码实现朴素贝叶斯分类器(采用拉普拉斯修正)过滤垃圾邮件的完整代码。其中,我们使用了Python中的numpy和pandas库。
```python
import numpy as np
import pandas as pd
# 定义朴素贝叶斯分类器
class NaiveBayesClassifier:
def __init__(self, alpha=1):
self.alpha = alpha # 拉普拉斯修正的参数
self.num_classes = 2 # 二分类任务
self.vocab = set() # 所有单词的集合
self.counts = np.zeros((self.num_classes, len(self.vocab) + 1)) # 记录单词出现次数
# 训练模型
def fit(self, X, y):
# 获取所有单词的集合
for x in X:
self.vocab = self.vocab.union(set(x))
# 统计单词出现次数
for i in range(len(X)):
words = X[i]
label = y[i]
for word in words:
j = ord(word) % (len(self.vocab) + 1)
self.counts[label][j] += 1
# 计算先验概率和条件概率
self.priors = np.zeros(self.num_classes)
self.conditionals = np.zeros((self.num_classes, len(self.vocab) + 1))
for i in range(self.num_classes):
self.priors[i] = np.log((y == i).sum() / len(y))
self.conditionals[i] = np.log((self.counts[i] + self.alpha) / (self.counts[i].sum() + self.alpha * (len(self.vocab) + 1)))
# 预测
def predict(self, X):
yhat = np.zeros(len(X))
for i in range(len(X)):
words = X[i]
sums = [self.priors[j] + np.sum(self.conditionals[j][[ord(word) % (len(self.vocab) + 1) for word in words]]) for j in range(self.num_classes)]
yhat[i] = np.argmax(sums)
return yhat
# 加载数据集
spam = pd.read_csv('https://raw.githubusercontent.com/justmarkham/pycon-2016-tutorial/master/data/sms_spam.csv', encoding='latin-1')
X = spam['text'].values
y = spam['label'].values
y = np.array([1 if label == 'spam' else 0 for label in y])
# 划分训练集和测试集
np.random.seed(0)
indices = np.random.permutation(len(X))
X_train = X[indices[:-1000]]
y_train = y[indices[:-1000]]
X_test = X[indices[-1000:]]
y_test = y[indices[-1000:]]
# 训练模型并预测
clf = NaiveBayesClassifier(alpha=1)
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
# 计算准确率
accuracy = np.mean(y_pred == y_test)
print("Accuracy:", accuracy)
```
在上面的代码中,我们使用了一个朴素贝叶斯分类器来过滤垃圾邮件。我们通过拉普拉斯修正来避免概率为0的情况,同时使用了对数概率来避免数值下溢的问题。我们在UCI Machine Learning Repository上下载了一个垃圾邮件数据集,并将其划分为训练集和测试集。最后,我们计算了模型的准确率。
阅读全文