Text classification based on SMO and fuzzy model
Mengqi Pei, Xing Wu*
School of Computer Engineering and Science
Shanghai University
Shanghai, China
xingwu@shu.edu.cn
Abstract—In this article we propose a text classification sys-
tem using chi-value as feature selection method and SMO (se-
quential minimal optimization) algorithm as classifier. In addi-
tion, we use fuzzy model of fuzzy concept to describe documents’
classified label and entropy to calculate the uncertainty of a doc-
ument’s classification result. Experimental results demonstrated
that the proposed method can reach 87% or higher accuracy of
text classification.
Keywords—text classification; SMO; fuzzy model; fuzzy con-
cept; entropy
I. INTRODUCTION
Nowadays, statistical learning methods have become the
absolute mainstream in text categorization. Because statistical
learning methods have less subjective factors compared with
knowledge engineering methods. Besides, a lot of statistical
learning technologies have a solid theoretical foundation, and
there is a definite evaluation standard and good performance in
statistical learning.
Statistical learning methods rely on effective feature
extraction to get a good learning result, so it is important for
improving machine learning effect to extract effective features
and avoid the noise interference. One effective way to extract
features is chi-value, which compares the contribution of
certain word between one category and others. The method is
widely adopted: Recently, Xiuxia Chen proposed a automatic
web music resource crawler system
[1]
using chi-value as feature
selection method. And Yunfei Qiu proposed an improved chi-
value feature selection method
[2]
. The establishment of the
feature vector mostly use TF-IDF method. And the feature
vector can be described by an algebra model named Vector
Space Model, in the model, each dimension represents a term,
and the value of the dimension set to non-zero if the term
shown in document. After feature extraction, there’s an
algorithm needed to classify samples. According to Yiming
Yang
[3]
, support vector machine (SVM) method based on
vector space model (VSM) works best in text classification.
SVM is a hot classification technology in data mining due to its
simple structure and good classification performance. Huang
Yuqing proposed a SVM with mixed kernel function
[4]
. And
Yuanchao Liu proposed an abstract sentence classification for
scientific papers based on transductive SVM
[5]
. In 1998, an
optimized SVM method named sequential minimal
optimization (SMO) is put forward by Platt John
[6]
. It has
become the fastest quadratic programming optimization
algorithm and gets better performance especially for linear
SVMs and sparse data. In this paper, we use SMO as our
classification algorithm.
The result of classification always be described as crisp set,
which dichotomize testing data into two groups: members and
nonmembers. While many classification concepts do not
exhibit this characteristic. The input document won’t always
specifically belong to one category. Fuzzy set is different from
crisp set because its elements have degrees of membership.
And in this article We use fuzzy model of fuzzy concept which
based on fuzzy semantic models proposed by Yingxu Wang
[7]
to describe our classification result.
Entropy is a measure of the unpredictability in a random
variable, in this paper, we use entropy to be the metric of the
uncertainty of one document’s classification confidence.
II. P
ROPOSED WORK
A. Text classification system based on SMO
In this paper, we build a text classification application with
a whole pipeline. We used chi-value to help us extract features,
and TF-IDF as feature weight to help us describe documents.
Each document is described as a feature vector using vector
space model. And the classification algorithm we chose is
SMO which is an optimized algorithm of SVM.
First, all training documents need to be preprocessed by
segmenting into words and filtering out the stop words.
After preprocessing we get individual words from all
documents, then we can select a subset of the words as
keywords set by using chi-value which is defined as follows:
2
2
()
(, )
()()()()
NAD CB
tc
ACBDABC D
F
where t indicates a term, c indicates a category, N is the
number of training documents, A is the number of documents
who both contain t and belong to c, B is the number of
documents who contain t but not belong to c, C is the number
of documents who does not contain t but belong to c, D is the
number of documents who neither does not contain t and not
belong to c.
Next, the relevant terms from the documents can be
represented by VSM, and the elements of which are TF-IDF
weights. The definition of TF-IDF weight is as follows:
log
max
ij
ij
ii
tf
N
W
tf n
u
Where
ij
W
indicates TF-IDF weight of term i toward cate-
gory j.
ij
tf
is the term frequency of i in j and
max
i
tf
is the
maximum term frequncy of i in all categories. N is the total
number of training documents and
i
n
is the number of docu-
ments in which the term i appears.
____________________________________
978-1-4799-4419-4 /14/$31.00 ©2014 IEEE