The Construction of a kind of Chat Corpus in Chinese Word Segmentation
Xia Yang
Laboratory of Intelligent Information
Processing and Application
Leshan Normal University
Leshan, China
yangxia113@gmail.com
Peng Jin
Laboratory of Intelligent Information
Processing and Application
Leshan Normal University
Leshan, China
jandp@pku.edu.cn
Xingyuan Chen
Laboratory of Intelligent Information
Processing and Application
Leshan Normal University
Leshan, China
cxyforpaper@gmail.com
Abstract—In this thesis, we present a kind of chat corpus in
Chinese word segmentation and we also present its
construction process. This kind of chat corpus works in the
way of combining application of automatic segmentation
technology with the method of manual correction. Thereinto,
the automatic segmentation is performed in the way of using
the Natural Language Processing Information Retrieval
(NLPIR). As to manual correction, errors from NLPIR will be
categorized and some annotation suggestions will be put
forward. Combining using these two methods above, our study,
which is a preliminary study, could be very easy extended to
other Chats texts. What’s more, the corpus, which produced in
our works, could provide a good standard for the research of
Chinese word segmentation, especially in the part of dialogue.
Keywords-Chats Corpus; Chinese Word Segmentation;
manual annotation
I. INTRODUCTION
Chinese word segmentation is one of the most
fundamental and the most important technology in Chinese
information processing. It is the key technology and also a
difficult point in automatic text categorizations, information
retrieval, information filtering, literature indexing, automatic
text generation and etc. Furthermore, it is the key link of
other Chinese applications, such as named entity recognition,
syntactic analysis, semantic analysis and etc.
At present, there are three kinds of algorithms in Chinese
word segmentation. One is the string matching algorithm
[1-3]. One is a kind of algorithm which based on
understanding of the word [4-6]. The other is based on
statistical [7-9].
In order to test the effectiveness of these algorithms,
some specific criterion to evaluate must be made, and in this
case, building an artificial proofreading participle corpus
becomes very important. However, most of the corpus have
been built based on journalism text and communication text
[10-11]. Hence chats corpus with annotation have many
serious disadvantages. Because, it is not similar to the press
release, languages in chat records are mainly informal and
colloquial. Abbreviations and wrong words are frequently
used on the Internet, and what’s worse, in the meantime,
some newly popular network words and emoticons mixed
together when people are using it. So, building a chatting
record corpus becomes particularly important.
With using some instant message software, we have
collected some chat records, and analyzed their
characteristics in some aspects, such as word formation,
word shape, morpheme and etc. And it is characteristics we
analyzed that made some problems which may affect the
results of word segmentation come to light. On the process
of building a corpus, we adopt to use the method of
automatic tagging in combination with artificial correction.
In this way, building a corpus has become more effectively
than before and the workload in subjective tagging of
manual annotation has been greatly reduced. At the end of
our study, we have got a high quality chats corpus with
400K sentences (24, 000 KB).
The rest of this paper is organized in next 3 parts,
including section 2, section 3 and section 4. Section 2 is
about ways of building this corpus while section 3, which
2015 IEEE/WIC/ACM International Conference on Web Intelligence and Intelligent Agent Technology
978-1-4673-9618-9/15 $31.00 © 2015 IEEE
DOI 10.1109/WI-IAT.2015.196
168