过滤垃圾邮件贝叶斯python代码
时间: 2023-09-17 18:03:43 浏览: 111
基于朴素贝叶斯的垃圾邮件过滤系统(Python)
贝叶斯算法是一种常用的过滤垃圾邮件的方法,以下是一个简单的用Python实现的贝叶斯垃圾邮件过滤代码示例:
```
import re
import os
import math
spam_words = {}
ham_words = {}
def tokenize(text):
words = re.findall(r'\b\w+\b', text.lower())
return set(words)
def train_spam(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
text = file.read()
words = tokenize(text)
for word in words:
spam_words[word] = spam_words.get(word, 0) + 1
def train_ham(directory):
for filename in os.listdir(directory):
with open(os.path.join(directory, filename), 'r', encoding='utf-8') as file:
text = file.read()
words = tokenize(text)
for word in words:
ham_words[word] = ham_words.get(word, 0) + 1
def calculate_spam_prob(text):
words = tokenize(text)
spam_prob = 0
for word in words:
spam_count = spam_words.get(word, 0)
ham_count = ham_words.get(word, 0)
if spam_count + ham_count > 0:
spam_prob += math.log((spam_count + 0.5) / (spam_count + ham_count + 1))
return spam_prob
def classify_email(text):
spam_prob = calculate_spam_prob(text)
if spam_prob > 0:
return "垃圾邮件"
else:
return "正常邮件"
# 训练垃圾邮件
train_spam("spam_directory")
# 训练正常邮件
train_ham("ham_directory")
# 邮件分类
email_text = "这是一封垃圾邮件"
classification = classify_email(email_text)
print(classification)
```
以上代码中,`tokenize`函数用于将文本分词,`train_spam`和`train_ham`函数分别用于训练垃圾邮件和正常邮件,`calculate_spam_prob`函数用于计算邮件的垃圾邮件概率,`classify_email`函数用于判断邮件是否为垃圾邮件。使用时,需要将实际的垃圾邮件和正常邮件分别放置在`spam_directory`和`ham_directory`文件夹中,并将待分类的邮件文本传入`classify_email`函数进行分类。
阅读全文