基于朴素贝叶斯实现垃圾邮件分类,不用sklearn
时间: 2023-05-18 10:06:56 浏览: 151
可以使用Python中的nltk库来实现基于朴素贝叶斯的垃圾邮件分类,以下是一个简单的实现代码:
```python
import nltk
import random
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
# 读取垃圾邮件和正常邮件的数据
spam = open('spam.txt', 'r').read()
ham = open('ham.txt', 'r').read()
# 分词和去除停用词
spam_words = word_tokenize(spam)
ham_words = word_tokenize(ham)
stop_words = set(stopwords.words('english'))
spam_words = [word.lower() for word in spam_words if word.isalpha() and word.lower() not in stop_words]
ham_words = [word.lower() for word in ham_words if word.isalpha() and word.lower() not in stop_words]
# 构建词汇表
all_words = set(spam_words + ham_words)
word_features = list(all_words)
# 构建训练集和测试集
def extract_features(document):
document_words = set(document)
features = {}
for word in word_features:
features[word] = (word in document_words)
return features
spam_features = [(extract_features(spam_words), 'spam') for spam_words in spam]
ham_features = [(extract_features(ham_words), 'ham') for ham_words in ham]
train_set = spam_features + ham_features
random.shuffle(train_set)
# 训练模型
classifier = nltk.NaiveBayesClassifier.train(train_set)
# 测试模型
test_spam = open('test_spam.txt', 'r').read()
test_ham = open('test_ham.txt', 'r').read()
test_spam_words = word_tokenize(test_spam)
test_ham_words = word_tokenize(test_ham)
test_spam_features = extract_features(test_spam_words)
test_ham_features = extract_features(test_ham_words)
print('Test Spam:', classifier.classify(test_spam_features))
print('Test Ham:', classifier.classify(test_ham_features))
```
其中,spam.txt和ham.txt分别是垃圾邮件和正常邮件的数据,test_spam.txt和test_ham.txt是用于测试的数据。
阅读全文