Two-level hierarchical combination method for text classification
Wen Li
a,b,
⇑
, Duoqian Miao
a
, Weili Wang
a,b
a
Department of Computer Science and Technology, Tongji University, Shanghai 201804, China
b
Information Engineering School, Nanchang University, Nanchang 330031, China
article info
Keywords:
Text classification
Combination method
Variable precision rough sets
Support vector machine
k nearest neighbor
abstract
Text classification has been recognized as one of the key techniques in organizing digital data. The intu-
ition that each algorithm has its bias data and build a high performance classifier via some combination
of different algorithm is a long motivation. In this paper, we proposed a two-level hierarchical algorithm
that systematically combines the strength of support vector machine (SVM) and k nearest neighbor
(KNN) techniques based on variable precision rough sets (VPRS) to improve the precision of text classi-
fication. First, an extension of regular SVM named variable precision rough SVM (VPRSVM), which parti-
tions the feature space into three kinds of approximation regions, is presented. Second, a modified KNN
algorithm named restrictive k nearest neighbor (RKNN) is put forward to reclassify texts in boundary
region effectively and efficiently. The proposed algorithm overcomes the drawbacks of sensitive to noises
of SVM and low efficiency of KNN. Experimental results compared with traditional algorithms indicate
that the proposed method can improve the overall performance significantly.
Ó 2010 Elsevier Ltd. All rights reserved.
1. Introduction
Text classification (TC), also known as text categorization, aims
at automating the process that assigns documents to a set of pre-
viously fixed categories, has always been a hot topic. Many popular
algorithms have been applied to text categorization. No Free Lunch
(NFL) theorems (Wolpert & Macready, 1997) have shown that
learning algorithms cannot be universally acceptable and any algo-
rithm has its bias data. When the data fits the underlying classifi-
cation strategy well, the system accuracy can be very high, and vice
versa (Tan, Cheng, & Ghanem, 2005). Among the many well-known
algorithms, support vector machine (SVM) (Joachims, 1998) and k
nearest neighbor (kNN) (Cover & Hart, 1967) are widely used be-
cause their excellent learning performance both in theory and in
practices. But despite their advantages, they also have weaknesses
and limitations.
SVM is well founded in terms of computational learning theory
and very open to theoretical understanding. The final classifier
obtained by the SVM depends only on a small portion of the train-
ing samples, i.e. support vectors, which is good for implementa-
tion. However, this makes the SVM sensitive to noises or outliers
and patterns that were wrongly classified lie near the separation
hyper-plane (Zhang & Wang, 2008).
KNN is a well-known statically approach in pattern recognition.
It is also known as one of the top-performing methods on the
benchmark Reuters corpus (Yang & Liu, 1999). Because of using
an instance-based learning algorithm, the KNN algorithm simply
stores all of the training examples as classifier and delay learning
until prediction phase. Under circumstance of huge amount of
training data, considerable time would be paid during the classifi-
cation process in KNN. Besides, the performance of KNN may be
affected by noisy data (Srisawat, Phienthrakul, & Kijsirikul, 2006).
Researchers have long pursued the promise of harnessing mul-
tiple text classifiers to synthesize a more accurate classification
procedure via some combination of the outputs of the contributing
classifiers (Bennett, Dumais, & Horvitz, 2005). In this paper, we
present a hybrid algorithm based on variable precision rough sets
(VPRS) by combining the respective excellences of SVM and KNN in
order to improve classification accuracy. The proposed method is
based on a two-stage algorithm. First, by introducing the VPRS the-
ory into the support vector machines, a variable precision rough
SVM (VPRSVM) is presented. The transformed feature space is par-
titioned by using VPRSVM where lower and upper approximations
of each category are defined. Second, on analysis of the character-
istic of boundary region text, a modified KNN algorithm, namely
restrictive k nearest neighbor (RKNN) classifier is put forward
which built on the reduced candidate classes, and it only requires
classifying testing document of boundary region effectively and
efficiently.
Since uncertainties in the labeling are taken into account, our
approach tries to provide a practical mechanism to deal with
real-world noisy text data. Analysis of the different approximation
0957-4174/$ - see front matter Ó 2010 Elsevier Ltd. All rights reserved.
doi:10.1016/j.eswa.2010.07.139
⇑
Corresponding author at: Department of Computer Science and Technology,
Tongji University, Shanghai 201804, China. Tel.: +86 15900799568.
E-mail addresses: jx_wenli@yahoo.com.cn (W. Li), miaoduoqian@163.com
(D. Miao), ken.wlwang@gmail.com (W. Wang).
Expert Systems with Applications 38 (2011) 2030–2039
Contents lists available at ScienceDirect
Expert Systems with Applications
journal homepage: www.elsevier.com/locate/eswa