基于交叉N-gram的贝叶斯中文垃圾邮件过滤器

需积分: 9 13 浏览量更新于2024-09-19 收藏 183KB PDF 举报

"基于交叉N-gram的贝叶斯中文垃圾邮件过滤器" 贝叶斯中文垃圾邮件过滤器是利用贝叶斯定理进行邮件分类的一种有效方法，尤其在处理中文文本时。然而，由于中文的词边界不明显，传统的词分割方法可能会对过滤器的性能产生限制。"基于交叉N-gram的贝叶斯中文垃圾邮件过滤器"这篇论文提出了一种无需进行词汇分割的新方法，旨在解决这一问题。传统的贝叶斯过滤器通常依赖于对邮件内容的预处理，包括词汇的准确分词。中文的分词过程复杂且容易出错，这些错误会直接影响到过滤器的准确性。该研究中的新方法通过使用交叉N-gram（Crossed N-grams）技术，绕过了这个难题。交叉N-gram是N-gram的一个变体，它考虑了连续的N个字符组合，不仅限于单个词，这样可以捕获更多上下文信息，同时避免了因分词不准确导致的问题。这种方法的优点在于，它不需要预先安装词典或进行复杂的分词操作，简化了系统的安装和维护流程。这使得该过滤器更易于部署，并且适应性强，能够适应不同用户和环境的需求。此外，由于它基于贝叶斯统计，仍然保留了贝叶斯分类器的高效性和可扩展性，能够在不断学习新的邮件样本后持续优化其过滤性能。论文的作者们，来自兰州理工大学和江西理工大学的学者们，以及解放军理工大学网络教育学院的研究人员，通过实验验证了这种方法的有效性。他们可能对比了传统分词方法和交叉N-gram方法在过滤效果上的差异，展示了新方法在减少错误分类和提高过滤效率方面的优势。总结来说，"基于交叉N-gram的贝叶斯中文垃圾邮件过滤器"是一种创新的文本分类技术，它解决了中文分词难题，提高了贝叶斯过滤器在处理中文垃圾邮件时的准确性和实用性。对于需要处理大量中文邮件的系统，尤其是那些无法依赖精确分词工具的系统，这种技术具有很大的应用价值。

Bayesian Chinese Spam Filter Based on Crossed N-gram

Jianshe DONG Haixia CAO

School of Computer and Communication Colleage of Information Engineering

Lanzou University of Technology Jiangxi University of science and Technology

Lanzou, Gansu Province 730050, China Ganzhou, Jiangxi Province 314000, China

dongjs@lut.cn caohaixia318@163.com

Peng LIU

Li REN

Research center of Military Grid College of Network Education

PLA University of Science and Technology Lanzou University of Technology

Nanjing, Jiangsu Province 210050, China Lanzou, Gansu Province 730050, China

milgrid@163.com renli@lut.cn

Abstract

Naive Bayesian spam email filters are a well-

known and powerful type of filters that can easily be

induced from a dataset of sample cases. However, the

problem of segmenting words for Chinese email

restricts its performance. In this paper, we present a

Bayesian Chinese spam filter based on cross N-gram.

This method does not need to carry on segmenting

words for Chinese emails, so that it can avoid to be

restricted by inaccurate words segmenting. It also

needn’t to install segmenting word dictionary and is

easy to install in the user terminal to construct an

individualized spam filter since the space and time

efficiency are improved. The restriction on

independence assumption of naive bayes method is

relaxed in some degree. The results of experiments

show that the proposed method can acquire a high

accuracy ratio at low cost.

1. Introduction

Mass unsolicited electronic mail, often known as

spam, has recently increased enormously and has

become a serious threat to not only the Internet but

also to society. The flooding of Spam will result in a

mass of network resources being wasted, and the

normal email corresponding being affected.

In September 2001, 8% of all emails in US were

spam. By July 2002, this fraction had increased to

35% [1]. More recent studies report that, in North

America, a business user received 10 spam emails on

average per day in 2003, and that this number is

expected to grow by a factor of four by 2008 [2].

Furthermore, AOL and MSN report a daily blocking

of 2.4 billion spam emails from reaching their

customers’ inboxes. This traffic corresponds to about

80% of daily incoming emails at AOL [3]. This is also

serous in China, it is reported by the Anti-spam center

of ISC[4] that in China a user received 19.33 spam

emails on average per week and 63.97% of all emails

were spam in Mar. 2006, this is 2.03 spam emails

more than Oct. 2005. In China, 68.55% of spam

emails were sent in Chinese in 2005.

Over the past few years, different approaches have

been presented to provide resistance against spammers.

Some of them use a Bayesian-like approach [6, 7], or a

rule-based approach [8, 9], and some use a

cryptographic solution to protect against spamming

problem [10].

The concept of Bayesian spam email filters

suggested by Sahami et al. [6] got popularity. The

filter was based on naive bayes classifier. It is

powerful and can easily be induced from a dataset of

sample cases. However, the strong conditional

independence and distribution assumptions underlying

them can lead to poor classification performance,

because the used type of probability distribution, e.g.,

normal distributions, may not be able to describe the

data appropriately or (some of) the conditional

independence assumptions do not hold. For Chinese

emails, it also should be segment to words before

using Bayesian method to filter. The inaccurateness of

segmenting words will restrict the filtering

accurateness badly.

Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06)

下载后可阅读完整内容，剩余5页未读，立即下载

wherrlich

粉丝: 0
资源: 15

基于交叉N-gram的贝叶斯中文垃圾邮件过滤器

Bayesian Spam Filter PHP防垃圾信息

Robust Bayesian sparse representation based on beta-Bernoulli process prior

SpamProbe - fast bayesian spam filter-开源

Machine-learning-PCA-and-Bayesian-Classification-on-Radiology-X-ray-Images:Python中的X射线图像处理和分类（从零开始）

吉布斯采样matlab代码-ORIE-6741-Bayesian-Machine-Learning----Bayesian-Non-param

Sufficient-statistic-based-strategies-in-Finite-Horizon-Two-Player-Zero-Sum-Stochastic-Bayesian-Game

Bayesian-Reasoning-and-Machine-Learning

Learning bayesian network structure based on topological potential

bayesian-networks-with-examples-in-r

Bayesian-Machine-Learning-and-Reinforcement-Learning-Playground

最新资源