ICCPOL2003, pp. 109-117, ShenYang (China)
A Statistical Approach to Extract Chinese Chunk
Candidates from Large Corpora
ZHANG Le, L
¨
U Xue-qiang, SHEN Yan-na, YAO Tian-shun
Institute of Computer Software & Theory.
School of Information Science & Engineering, Northeastern University
Shenyang, 110004 China
Email: ejoy@xinhuanet.com, studystrong@sohu.com, neu
syn@sohu.com, tsyao@mail.neu.edu.cn
Abstract
The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building example-based
machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from
large monolingual corpora. The first step is to extract large N-grams (up to 20-gram) from raw corpus. Then two
newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial N-gram set to
remove some unnecessary N-grams using their frequen cy information. The two algorithms are efficient (both have a
time complexity of O(n)) and can effectively reduce the size of N-gram set up to 50%. Finally, mutual information
is used to obt ain chunk candidates from reduced N-gram set.
Perhaps the biggest contribution of this paper is t hat it is the first time to apply Fast Statistical Substring
Reduction algorithm to large corpora and demonstrate t he effectiveness and efficiency of this algorithm which, in
our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with different
sizes show that this method can extract chunk candidates from corpora of giga bytes efficiently under current
computational power. We get an extraction accuracy of 86.3% from People Daily 2000 news corpus.
Key Words: Chunk extraction, N-gram, Substring Reduction, Corpus
1 Introduction
With the rapid development of computational power and the availability of large online corpora (BNC (Clear, 1993),
People Daily (Y
¨
U et al, 2002)), there has been a dramatic shift in computational linguistics from manually construction
knowledge bases to partially or totally auto matic knowledge acquisition by applying statistical learning methods to
large corpora (see SU, 1996, for an overv ie w). The concept of chunk was first raised by (Abney, 1991) in the early
nineties to make the task of language parsing eas ie r. He suggested to develop a parse r based on chunk that decomposes
sentences into chunks with each chunk being a syntactic unit for easier parsing. Chunks, especially bilingual chunk
pairs, can be a use ful ingredient in ex ample based machine translation (EBMT) system to help obtain better translation
result (YAO et al, 2000). Other NLP tasks such as information retrieval, knowledge discovery and the construction
of semantic dictionary can also benefit from such kind of r esources. Most Text Chunking methods applied to English
require either a parsing stage to parse raw corpus into parsed trees or a corpus which already has syntactical information
(such as Treebank) (Erik and Sabine, 2000 ). Unfortunately, neither kind of r e sources is largely available for Chinese.
Some language spe cific properties of Chinese further impose challenges on Chinese chunking. Unlike western languages
that have explicit word boundary, Chinese has no word s e paration in s e ntence and a sentence must be segmented into
words before further processing. Segmentation is a field that has been researched for decades with more than two
dozen methods tried in litera tur e (Ponte and Croft, 1996; Palmer, 19 97; Teahan, 2000). Even today no segmentation
method totally satisfies na tive speakers. In addition, Chinese language enjoys more freedom in sentence structure than
English, making even shallow parsing a formidable ta sk.
To address these problems, this paper presents a novel approach to extract Chinese chunk c andidates from large
unsegmented raw c orpora. The originality of our approach resides in the statistical substring reduction, a pr ocedure
aims at efficiently removing unpoetical N-grams from extracted N-gram set. The first step is to acquire large N-gram
(up to 2 0-gra m) statistics from raw corpus. This initial N-gram set contains a vast a mount of “garbag e strings”
which do not have any meaning at all. The work of (Fung, 1994) showed: without the help of a machine-readable
dictionary, the extracted trigrams and 4-grams from Chinese raw co rpus contain only 31.3% and 36.75% valid phrases
1