Proceedings of the Third CIPS-SIGHAN Joint Conference on Chinese Language Processing, pages 179–185,
Wuhan, China, 20-21 October 2014
Introduction to BIT Chinese Spelling Correction
System at CLP 2014 Bake-off
School of Computer Science
and Technology, Beijing In-
stitute of Technology
luis328@foxmail.com
School of Computer Science
and Technology, Beijing In-
stitute of Technology
pjian@bit.edu.cn
School of Computer Science
and Technology, Beijing In-
stitute of Technology
hhy63@bit.edu.cn
Abstract
This paper describes the Chinese spelling
correction system submitted by BIT at
CLP Bake-off 2014 task 2. The system
mainly includes two parts: 1) N-gram
model is adopted to retrieve the
non-words which are wrongly separated
by word segmentation. The non-words
are then corrected in terms of word fre-
quency, pronunciation similarity, shape
similarity and POS (part of speech) tag.
2) For wrong words, abnormal POS tag
is used to indicate their location and de-
pendency relation matching is employed
to correct them. Experiment results
demonstrate the effectiveness of our
system.
1. Introduction
Spelling check, which is an automatic mecha-
nism to detect and correct human spelling errors,
is a common task in every written language. The
number of people learning Chinese as a Foreign
Language (CFL) is booming in recent decades
and this number is expected to become even
larger for the years to come. However, unlike
English learning environment where many
learning techniques have been developed, tools to
support CFL learners are relatively rare, espe-
cially those that could automatically detect and
correct Chinese spelling and grammatical errors.
For example, Microsoft Word
®
has not yet sup-
ported these functions for Chinese, although it
supports English for years. In CLP Bake-off 2014,
essays written by CFL learners were collected for
developing automatic spelling checkers. The
aims are that through such evaluation campaigns,
more innovative computer assisted techniques
will be developed, more effective Chinese
learning resources will be built, and the
state-of-art NLP techniques will be advanced for
the educational applications.
By analyzing the training data released by the
CLP 2014 Bake-off task2
1
and the test data used
in SIGHAN Bake-off 2013
2
, we find that the
main errors focus on two types: One is wrong
characters which result in “non-words” that are
similar to OOV (out-of-vocabulary). For example,
the writer may misspell “身邊” as “生邊”, and
“根據” as “根處” (The former appears because
of the words’ similar pronunciation and the latter
comes up due to their similar shape). These are
even not words and of course do not exist in the
vocabulary. The other type is words which are
correct in the dictionary but incorrect in the sen-
tence. Some of them may be misspelled, like “情
愛” in phrase “情愛的王宜家”, which is a mis-
spelling of word “親愛”. But we can find “情愛”
in the dictionary and it is not a non-word. Others
are words which are not used correctly. This
usually happens when the writer does not under-
stand their meaning clearly. For example, writ-
ers often confuse “在” and “再”, such as “高雄是
再台灣南部一個現代化城市”. Here, it is “在”
but not “ 再 ” the right one. Different from
non-words, we call these words “wrong words”.
According to the statistics obtained from the
training data of CLP 2014 Back-off, there are
nearly 3,400 wrong words which are about twice
more than non-words, 1,800 ones.
Spelling check and correction is a traditional
task in natural language processing. Pollock and
Zamora (1984) built a misspelling dictionary for
spelling check. Chang (1995) adopted a bi-gram
language model to substitute the confusing
character. Zhang et al. (2000) proposed an ap-
proximate word matching method to detect and
correct spelling errors. Liu et al. (2011)
1
http://www.cipsc.org.cn/clp2014/webpage/cn/four_
bakeoffs/Bakeoff2014cfp_ChtSpellingCheck_cn.htm
2
http://tm.itc.ntnu.edu.tw/CNLP/?q=node/27
179