用java代码实现垃圾邮件过滤
时间: 2024-03-26 12:40:05 浏览: 131
机器学习作业-垃圾邮件过滤实现+源代码+文档说明+数据集
5星 · 资源好评率100%
垃圾邮件过滤可以使用机器学习算法实现,这里提供一种基于朴素贝叶斯算法的实现方式。
首先,需要准备两个文件夹,一个用于存放垃圾邮件,一个用于存放正常邮件,文件夹中的每个文件都应该是一封邮件的文本数据。
然后,按照以下步骤进行:
1. 分词:将每封邮件的文本数据进行分词处理,将每个词作为特征。
2. 统计词频:对于每个词,统计它在垃圾邮件和正常邮件中出现的次数,得到两个频率向量。
3. 计算概率:使用朴素贝叶斯算法计算每个特征在垃圾邮件和正常邮件中出现的概率。
4. 预测分类:对于一个新的邮件,分词后计算每个特征在垃圾邮件和正常邮件中出现的概率,然后根据朴素贝叶斯算法计算该邮件属于垃圾邮件和正常邮件的概率,取概率较大的类别作为预测结果。
下面是Java代码实现:
```java
import java.io.*;
import java.util.*;
public class SpamFilter {
private static Map<String, Integer> spamWords = new HashMap<>();
private static Map<String, Integer> hamWords = new HashMap<>();
private static Set<String> vocabulary = new HashSet<>();
private static double pSpam = 0.0;
private static double pHam = 0.0;
public static void main(String[] args) throws IOException {
String spamFolder = "spamFolder/";
String hamFolder = "hamFolder/";
String testFile = "testFile.txt";
// 训练模型
train(spamFolder, hamFolder);
// 测试模型
String testText = readText(testFile);
boolean isSpam = classify(testText);
System.out.println(isSpam ? "垃圾邮件" : "正常邮件");
}
public static void train(String spamFolder, String hamFolder) throws IOException {
// 统计垃圾邮件的词频
for (String fileName : new File(spamFolder).list()) {
String text = readText(spamFolder + fileName);
Map<String, Integer> wordCounts = countWords(text);
for (Map.Entry<String, Integer> entry : wordCounts.entrySet()) {
String word = entry.getKey();
int count = entry.getValue();
spamWords.put(word, spamWords.getOrDefault(word, 0) + count);
vocabulary.add(word);
}
}
// 统计正常邮件的词频
for (String fileName : new File(hamFolder).list()) {
String text = readText(hamFolder + fileName);
Map<String, Integer> wordCounts = countWords(text);
for (Map.Entry<String, Integer> entry : wordCounts.entrySet()) {
String word = entry.getKey();
int count = entry.getValue();
hamWords.put(word, hamWords.getOrDefault(word, 0) + count);
vocabulary.add(word);
}
}
// 计算垃圾邮件和正常邮件的概率
int spamCount = spamWords.values().stream().mapToInt(Integer::intValue).sum();
int hamCount = hamWords.values().stream().mapToInt(Integer::intValue).sum();
int totalCount = spamCount + hamCount;
pSpam = (double) spamCount / totalCount;
pHam = (double) hamCount / totalCount;
}
public static boolean classify(String text) {
Map<String, Integer> wordCounts = countWords(text);
double pSpamGivenText = Math.log(pSpam);
double pHamGivenText = Math.log(pHam);
for (String word : wordCounts.keySet()) {
if (vocabulary.contains(word)) {
int spamCount = spamWords.getOrDefault(word, 0);
int hamCount = hamWords.getOrDefault(word, 0);
double pWordGivenSpam = (double) (spamCount + 1) / (spamWords.size() + vocabulary.size());
double pWordGivenHam = (double) (hamCount + 1) / (hamWords.size() + vocabulary.size());
pSpamGivenText += Math.log(pWordGivenSpam) * wordCounts.get(word);
pHamGivenText += Math.log(pWordGivenHam) * wordCounts.get(word);
}
}
return pSpamGivenText > pHamGivenText;
}
private static String readText(String fileName) throws IOException {
StringBuilder sb = new StringBuilder();
BufferedReader reader = new BufferedReader(new FileReader(fileName));
String line;
while ((line = reader.readLine()) != null) {
sb.append(line).append("\n");
}
reader.close();
return sb.toString();
}
private static Map<String, Integer> countWords(String text) {
Map<String, Integer> wordCounts = new HashMap<>();
StringTokenizer tokenizer = new StringTokenizer(text);
while (tokenizer.hasMoreTokens()) {
String word = tokenizer.nextToken().toLowerCase();
if (word.length() > 2 && !StopWords.isStopWord(word)) {
wordCounts.put(word, wordCounts.getOrDefault(word, 0) + 1);
}
}
return wordCounts;
}
}
class StopWords {
private static Set<String> stopWords = new HashSet<>();
static {
String[] words = {"a", "an", "the", "this", "that", "these", "those", "is", "am", "are", "was", "were", "be", "been", "being", "of", "in", "on", "at", "to", "for", "with", "by", "about", "from", "as", "but", "or", "and", "not"};
stopWords.addAll(Arrays.asList(words));
}
public static boolean isStopWord(String word) {
return stopWords.contains(word);
}
}
```
在代码中,`train`方法用于训练模型,`classify`方法用于预测分类,`readText`方法用于读取文本文件,`countWords`方法用于统计词频,`StopWords`类用于过滤停用词。
需要注意的是,这个实现仅作为示例,实际应用需要考虑更多的优化和细节,比如去除HTML标签、处理附件等。
阅读全文