Utilizing High-quality Feature Extension Mode to
Classify Chinese Short-text
Xinghua Fan
College of Computer Science and Technology
Chongqing University of Posts and Telecommunications, Chongqing 400065, China
fanxh@cqupt.edu.cn
Hongge Hu
College of Computer Science and Technology
Chongqing University of Posts and Telecommunications, Chongqing 400065, China
huge120806@163.com
Abstract—This paper presents a method of classifying
Chinese short-texts that have weak concept signal, in which
high-quality feature extension modes are extracted and used
effectively. In the method, a feature extension mode is
considered as a set of terms that have co-occurrence
relationship in the training data, and three measures that
decide whether it is high-quality, i.e., Confidence, category
homoplasy and relevancy strength, are presented. Then, an
algorithm, which extracts high-quality feature extension
modes from training data, is designed. Next, Chinese short-
text classification algorithm utilizing feature extension
modes is presented, in which a short-text is extended by
adding new features or modifying the weights of initial
features, according to the relationship between non-feature
term and feature extension mode. The experiments show
that (1) A high-quality feature extension mode is helpful to
improve Chinese short-text classification; (2) the proposed
method can obtain a higher classification performance
comparing with the conventional text classification methods.
Index Terms—Chinese short-text classification, co-
occurrence relationship, high-quality feature extension
mode, feature extension
I. INTRODUCTION
With the rapid development of information technology,
the form of information transmission has been being
endlessly enriched. As the major representative form of
information such as SMS, online charting and Netizen
comments, short-text, which generally has no more than
160 character, becomes the important channel for the
dissemination of public information. But taking the short-
text as the carrier, propagation of all kinds of pornography,
violence, rumor, reactionary remarks, fraud and illegal
advertising in network will inevitably become the hidden
trouble of social stability with the fast explosive growth of
short-text. So we must effectively monitor, intercept and
filter the harmful information related to that. Short-text
classification technology [1-5], which delivers the short-
text to some pre-defined classes based on the content
analysis, is a kind of effective way accounting for above
mentioned problems.
Owing to the fact that the short texts have inherent
defects such as short length, weak concept signal and high
ambiguity, the short-text categorization is a very
challenging task. So far, there are two thoughts to handle
the difficulties in the short-text classification: One is
making use of extra information in external resources such
as HowNet, background corpus to assist classification [3-
5]; the other is mining the internal implied information to
offer helps for categorization [6-7]. Although the first
approach could make the information quantity of the
short-text enhanced, it has a strong dependency on the
build an access of external resources; besides, it is a tough
work to ensure the homogeneity of the additional
information and internal information. Aiming at the
second way, Zelikovitz [6-7] tried to exploit the internal
relations of terms by the singular matrix decomposition
and achieved a certain effect, but it asked too much of the
computer’s handling capacity while processing large
amount of data.
In this paper, we focus on the first approach and
implement the thought by utilizing the co-occurrence
relationships hidden in the training data which is set as
background corpus to classify Chinese short-text. As
important extra information, the co-occurrence
relationship, which is a set of terms that have co-
occurrence relationship in the background corpus, is
helpful for short-text classification [5]. The process such
as in [5], which introduced simple co-occurrence
relationship and didn’t consider its quality, may bring
noise into short-text classification, and leads that it is
difficult to get a satisfied classification result. While we
call the collection of co-occurrence relationships among
terms for short-text classification as feature extension
mode library, it is obvious that, the core issues of
improving classification performance are build a high-
quality feature extension mode library and to find a
perfect method of utilizing the high-quality feature
extension modes. So that needs to solve the following
problems. (1) What measures can determine whether a
feature extension mode, i.e., co-occurrence relationship, is
high quality? (2) How to utilize the measures to extract
JOURNAL OF NETWORKS, VOL. 5, NO. 12, DECEMBER 2010 1417
© 2010 ACADEMY PUBLISHER
doi:10.4304/jnw.5.12.1417-1425