写一个垃圾邮件分类的Python代码
时间: 2023-05-25 09:02:25 浏览: 95
Python代码实现基于朴素贝叶斯算法的垃圾邮件分类
以下是一个简单的垃圾邮件分类的Python代码:
```python
import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.naive_bayes import MultinomialNB
# 读取邮件数据
spam = []
ham = []
with open("spam.txt", "r") as f:
spam = f.readlines()
with open("ham.txt", "r") as f:
ham = f.readlines()
# 分词
spam_words = []
for email in spam:
words = word_tokenize(email)
spam_words += words
ham_words = []
for email in ham:
words = word_tokenize(email)
ham_words += words
# 移除停用词
stop_words = set(stopwords.words('english'))
spam_words = [word.lower() for word in spam_words if word.lower() not in stop_words]
ham_words = [word.lower() for word in ham_words if word.lower() not in stop_words]
# 特征提取
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(spam_words + ham_words)
y = [1] * len(spam_words) + [0] * len(ham_words)
# 构建模型
clf = MultinomialNB()
clf.fit(X, y)
# 预测结果
email_to_predict = "Buy Viagra now, 50% off!"
words = word_tokenize(email_to_predict)
words = [word.lower() for word in words if word.lower() not in stop_words]
X_new = vectorizer.transform(words)
if clf.predict(X_new)[0] == 1:
print("This is a spam email.")
else:
print("This is a ham email.")
```
该代码使用了朴素贝叶斯分类器进行分类。首先读取垃圾邮件和正常邮件的数据并进行分词和去除停用词处理。然后使用CountVectorizer提取特征,将文本转化为词频矩阵。最后使用MultinomialNB训练分类器并预测新的邮件类型。
阅读全文