Bayesian Chinese Spam Filter Based on Crossed N-gram
Jianshe DONG Haixia CAO
School of Computer and Communication Colleage of Information Engineering
Lanzou University of Technology Jiangxi University of science and Technology
Lanzou, Gansu Province 730050, China Ganzhou, Jiangxi Province 314000, China
dongjs@lut.cn caohaixia318@163.com
Peng LIU
Li REN
Research center of Military Grid College of Network Education
PLA University of Science and Technology Lanzou University of Technology
Nanjing, Jiangsu Province 210050, China Lanzou, Gansu Province 730050, China
milgrid@163.com renli@lut.cn
Abstract
Naive Bayesian spam email filters are a well-
known and powerful type of filters that can easily be
induced from a dataset of sample cases. However, the
problem of segmenting words for Chinese email
restricts its performance. In this paper, we present a
Bayesian Chinese spam filter based on cross N-gram.
This method does not need to carry on segmenting
words for Chinese emails, so that it can avoid to be
restricted by inaccurate words segmenting. It also
needn’t to install segmenting word dictionary and is
easy to install in the user terminal to construct an
individualized spam filter since the space and time
efficiency are improved. The restriction on
independence assumption of naive bayes method is
relaxed in some degree. The results of experiments
show that the proposed method can acquire a high
accuracy ratio at low cost.
1. Introduction
Mass unsolicited electronic mail, often known as
spam, has recently increased enormously and has
become a serious threat to not only the Internet but
also to society. The flooding of Spam will result in a
mass of network resources being wasted, and the
normal email corresponding being affected.
In September 2001, 8% of all emails in US were
spam. By July 2002, this fraction had increased to
35% [1]. More recent studies report that, in North
America, a business user received 10 spam emails on
average per day in 2003, and that this number is
expected to grow by a factor of four by 2008 [2].
Furthermore, AOL and MSN report a daily blocking
of 2.4 billion spam emails from reaching their
customers’ inboxes. This traffic corresponds to about
80% of daily incoming emails at AOL [3]. This is also
serous in China, it is reported by the Anti-spam center
of ISC[4] that in China a user received 19.33 spam
emails on average per week and 63.97% of all emails
were spam in Mar. 2006, this is 2.03 spam emails
more than Oct. 2005. In China, 68.55% of spam
emails were sent in Chinese in 2005.
Over the past few years, different approaches have
been presented to provide resistance against spammers.
Some of them use a Bayesian-like approach [6, 7], or a
rule-based approach [8, 9], and some use a
cryptographic solution to protect against spamming
problem [10].
The concept of Bayesian spam email filters
suggested by Sahami et al. [6] got popularity. The
filter was based on naive bayes classifier. It is
powerful and can easily be induced from a dataset of
sample cases. However, the strong conditional
independence and distribution assumptions underlying
them can lead to poor classification performance,
because the used type of probability distribution, e.g.,
normal distributions, may not be able to describe the
data appropriately or (some of) the conditional
independence assumptions do not hold. For Chinese
emails, it also should be segment to words before
using Bayesian method to filter. The inaccurateness of
segmenting words will restrict the filtering
accurateness badly.
Proceedings of the Sixth International Conference on Intelligent Systems Design and Applications (ISDA'06)
0-7695-2528-8/06 $20.00 © 2006