Chinese Spelling Errors Detection based on CSLM
Zhaoyi Guo
School of Computer
Science
Leshan Normal University
Leshan, China
gzy125@163.com
Xingyuan Chen
School of Computer
Science
Leshan Normal University
Leshan, China
cxyforpaper@gmail.com
Peng jin
*
School of Computer
Science
Leshan Normal University
Leshan, China
jandp@pku.edu.cn
Si-Yuan Jing
School of Computer
Science
Leshan Normal University
Leshan, China
siyuan-jing@hotmail.com
Abstract—Spelling errors are very common in various
electronic documents and it leads to serious influence
sometimes. To solve this problem, methods based on the n-
gram language model are the most commonly used. CSLM
(continuous space language model) which represents a word as
a vector is different from traditional models. In this paper, we
experimented with a specific CSLM, namely, the CBOW
(Continuous Bag-of-Words) model, to detect spelling errors.
Since spelling errors are usually considered as wrong
characters rather than words in Chinese language, we trained
character vectors with a large Chinese corpus, and then judged
a Chinese character is right or not by its probability of the
occurrence in a given context. Experimental results show that
the method based on CSLM outperforms the n-gram language
model.
Keywords-Spelling errors detection; Continuous space
language model; N-gram language model; Character vectors
I. INTRODUCTION
Spelling errors are very common in our daily lives.
These mistakes make readers be difficult to understand.
Moreover, it will cause bad effect if they appear in some
sensitive environments, such as formal documents. As far as
Chinese language are concerned, a typical story is that "Р剕
唤" (Urumqi) is wrongly written as "右剕唤", then the
post office couldn't find the destination for the goods, which
will lead to wasting lots of money. However, detecting
spelling errors manually is very expensive and time-
consuming. Therefore, many researchers have explored
methods which can automatically detect spelling errors from
electrical documents.
In this paper, we propose a novel [1] method based on
CSLM (continuous space language model) to find out wrong
Chinese characters in real contexts. We compared the new
method with the traditional method which is based on the n-
gram language model. Experimental results confirm the
effectiveness and efficiency of the new method.
II. R
ELATED WORK
There are many previous studies on Chinese spelling
errors detection. Among them, methods based on n-gram
language models are widely used.
C. H. Chang [2] proposed a method based on bi-gram
language model. Y. J. Huang et al. [3] introduced a printing
system which uses the n-gram language model to improve
spelling errors and reduce the waste of resources. J. F. Yeh et
al. [4] proposed a novel spelling error detection and
correction method based on n-gram ranked inverted index
list. Moreover, in SIGHAN Bake-off [5], which is a famous
international Chinese language processing contest, many
researchers have proposed lots of methods on Chinese
spelling errors detection.
Because of methods based on the n-gram language model
are widely used in spelling errors detection, we briefly recall
some basic knowledge about it in this section.
Supposing there is a Chinese sentence
12
,,,
m
Sww w= " , in which
i
w
is the
i-th Chinese
character, the goal of a language model will calculate the
probability
()
S .
A classical language model is illustrated in the following
equation:
()
()
12 1
1
.
m
ii
i
pS pwww w
−
=
=
∏
" (1)
In order to simplify the computation, we assume that the
appearance of a word only relies on
1n − words in front of it.
This is called n-gram language model. If
2n = , the model is
called bi-gram language model, else if 3n = , the model is
called tri-gram language model. Bi-gram language model
and tri-gram language model are two of the most commonly
used in n-gram language models. Supposing
i
w expresses
ij
ww" , in n-gram language model, we would get the
equation:
()
()
()
1
1
1
1
1
.
i
im
i
iim
i
im
cw
pww
cw
−+
−
−+
−
−+
= (2)
In (2),
()
c ⋅ is the occurrence number of ⋅ in the corpus.
The main disadvantage of n-gram language model is data
sparseness. Some smoothing methods, such as additive
smoothing, are introduced to improve the effect that data
sparseness brings.
The method which is compared with our method in
experiments is based on n-gram language model adopting
additive smoothing. This method is usually used in Chinese
spelling errors detection in real applications.
* Peng Jin is the corresponding author
2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
978-1-4673-9618-9/15 $31.00 © 2015 IEEE
DOI 10.1109/WI-IAT.2015.62
173