International Joint Conference on Natural Language Processing, pages 614–622,
Nagoya, Japan, 14-18 October 2013.
Context-Based Chinese Word Segmentation using SVM Machine-
Learning Algorithm without Dictionary Support
Chia-ming Lee
Department of Engineering Science
and Ocean Engineering,
National Taiwan University,
Taipei, Taiwan (R.O.C.)
trueming@gmail.com
Chien-Kang Huang
Department of Engineering Science
and Ocean Engineering,
National Taiwan University,
Taipei, Taiwan (R.O.C.)
ckhuang@ntu.edu.tw
Abstract
This paper presents a new machine-learning
Chinese word segmentation (CWS) approach,
which defines CWS as a break-point classifi-
cation problem; the break point is the bound-
ary of two subsequent words. Further, this
paper exploits a support vector machine
(SVM) classifier, which learns the segmenta-
tion rules of the Chinese language from a
context model of break points in a corpus.
Additionally, we have designed an effective
feature set for building the context model,
and a systematic approach for creating the
positive and negative samples used for train-
ing the classifier. Unlike the traditional ap-
proach, which requires the assistance of
large-scale known information sources such
as dictionaries or linguistic tagging, the pro-
posed approach selects the most frequent
words in the corpus as the learning sources.
In this way, CWS is able to execute in any
novel corpus without proper assistance
sources. According to our experimental re-
sults, the proposed approach can achieve a
competitive result compared with the Chinese
knowledge and information processing
(CKIP) system from Academia Sinica.
1 Introduction
Chinese sentences contain sequences of charac-
ters that are not delimited by white spaces or any
other symbol used for word identification, so
Chinese word segmentation (CWS) is one of the
fundamental issues in Chinese natural language
processing studies.
One of the major aspects in existing CWS re-
searches is the resolution of word segment ambi-
guities. The conventional approach of ambiguity
detection is to use two maximum matching
methods (MMs), which scan corpora forward
(Forward Maximum Matching, FMM) and
backward (Backward Maximum Matching,
BMM) based on dictionaries (Kit, Pan, & Chen,
2002). Meanwhile, disambiguation methods can
be classified into two different categories: rule-
based methods and statistical-based methods.
(Ma & Chen, 2003b). Problem disambiguity is
often accompanied by the problem resolution of
an unknown word or out-of-vocabulary (OOV)
extraction (K.-J. Chen & Ma, 2002). Besides the
MMs with dictionaries, which are also known as
word-based approaches, there are character-
based approaches. The word-based approach
treats words as the basic unit of a language, and
the character-based approach labels each charac-
ter as the beginning, middle, or end of a word.
Character-based approaches are often imple-
mented with a machine-learning classification
algorithm for handling disambiguation (Wang,
Zong, & Su, 2012). In addition to dictionaries,
other linguistic resources such as part-of-speech
(POS) or semantic information can be integrated
for further improvement (M.-y. Zhang, Lu, &
Zou, 2004).
In addition to the disambiguation strategy,
many researchers provide the best word sequence
identification methods for their CWS. The Hid-
den Markov model (HMM) (Lin, 2006; M.-y.
Zhang et al., 2004), maximum entropy (ME),
mutual information (MI) and boundary depend-
ency (Peng & Schuurmans, 2001) are often used.
Theoretically, to get the best CWS result is to
obtain the optimized word sequence.
As described above, existing CWS research
takes either words or characters as the core unit
of their methodologies. Instead of identifying
word ambiguity, finding word sequence or join-
ing characters into words, we redefine the CWS
problem as the identification of “break points”
among the “joint points” in Chinese character
sequences. In this paper, we define a “joint
614