基于朴素贝叶斯实现垃圾邮件分类,不用sklearn和nltk
时间: 2023-05-18 17:07:00 浏览: 160
可以使用Python中的numpy和pandas库来实现基于朴素贝叶斯的垃圾邮件分类。以下是一个简单的实现代码:
```python
import numpy as np
import pandas as pd
# 读取数据
data = pd.read_csv('spam.csv', encoding='latin-1')
data = data[['v1', 'v2']]
data = data.rename(columns={'v1': 'label', 'v2': 'text'})
# 分割数据集
train_data = data.sample(frac=0.8, random_state=1)
test_data = data.drop(train_data.index)
# 计算先验概率
spam_count = train_data['label'].value_counts()['spam']
ham_count = train_data['label'].value_counts()['ham']
total_count = len(train_data)
p_spam = spam_count / total_count
p_ham = ham_count / total_count
# 计算条件概率
spam_words = []
ham_words = []
for index, row in train_data.iterrows():
words = row['text'].split()
if row['label'] == 'spam':
spam_words += words
else:
ham_words += words
spam_word_count = len(spam_words)
ham_word_count = len(ham_words)
spam_word_dict = {}
ham_word_dict = {}
for word in set(spam_words + ham_words):
spam_word_dict[word] = (spam_words.count(word) + 1) / (spam_word_count + len(set(spam_words + ham_words)))
ham_word_dict[word] = (ham_words.count(word) + 1) / (ham_word_count + len(set(spam_words + ham_words)))
# 预测
def predict(text):
words = text.split()
p_spam_given_text = p_spam
p_ham_given_text = p_ham
for word in words:
if word in spam_word_dict:
p_spam_given_text *= spam_word_dict[word]
else:
p_spam_given_text *= 1 / (spam_word_count + len(set(spam_words + ham_words)))
if word in ham_word_dict:
p_ham_given_text *= ham_word_dict[word]
else:
p_ham_given_text *= 1 / (ham_word_count + len(set(spam_words + ham_words)))
if p_spam_given_text > p_ham_given_text:
return 'spam'
else:
return 'ham'
```
其中,`data`是一个包含标签和文本的数据集,`train_data`和`test_data`是将数据集分割成训练集和测试集的结果。`p_spam`和`p_ham`是先验概率,`spam_word_dict`和`ham_word_dict`是条件概率。`predict`函数可以对新的文本进行分类。
需要注意的是,这只是一个简单的实现,还有很多可以优化的地方。
阅读全文