统计方法提取大规模语料库中的汉语切分候选序列

需积分: 6 136 浏览量更新于2024-09-19 收藏 209KB PDF 举报

本文档探讨了一种基于统计的方法来从大规模单语语料库中抽取中文短语（Chunk）候选。该研究是构建基于实例的机器翻译模型中的关键任务之一，因为有效的短语抽取能够显著提高翻译质量，尤其是在处理语言结构复杂度较高的汉语时。首先，作者们从原始语料库中提取了大量N-gram（直到20-gram），这是一个基本的预处理步骤，用于捕捉词语之间的频繁搭配和模式。这一步骤的重要性在于，短语通常由连续的词或子词组成，N-gram分析有助于发现这些潜在的短语结构。然后，文中提出了两种新的快速统计子串缩减（Fast Statistical Substring Reduction，FSSR）算法。这两个算法利用频率信息来评估N-gram的有效性，通过删除出现频率较低、对短语识别贡献不大的部分，大大减少了N-gram集合的大小，效率极高，时间复杂度达到O(n)，这意味着它们可以在保持高效的同时，将N-gram集压缩至原有大小的50%左右。最后，作者们采用互信息（Mutual Information）这一统计量作为衡量标准，从已缩减的N-gram集合中进一步筛选出具有更高关联性的短语候选。互信息是一种量化两个随机变量之间依赖关系的指标，对于短语的结构和上下文关联性有很好的反映。通过这种方法，作者们可以得到一组既具有语言学意义又在实际语料中频繁出现的高质量中文短语候选。这项研究提供了一种有效且高效的策略，通过结合大语料库分析、统计优化和信息论方法，为中文短语抽取这一自然语言处理任务提供了新的视角。这对于改进机器翻译系统，特别是针对汉语这样的多变和丰富的语言，具有重要的理论价值和实际应用潜力。

ICCPOL2003, pp. 109-117, ShenYang (China)

A Statistical Approach to Extract Chinese Chunk

Candidates from Large Corpora

ZHANG Le, L

U Xue-qiang, SHEN Yan-na, YAO Tian-shun

Institute of Computer Software & Theory.

School of Information Science & Engineering, Northeastern University

Shenyang, 110004 China

Email: ejoy@xinhuanet.com, studystrong@sohu.com, neu

syn@sohu.com, tsyao@mail.neu.edu.cn

Abstract

The extraction of Chunk candidates from real corpora is one of the fundamental tasks of building example-based

machine translation model. This paper presents a statistical approach to extract Chinese chunk candidates from

large monolingual corpora. The ﬁrst step is to extract large N-grams (up to 20-gram) from raw corpus. Then two

newly proposed Fast Statistical Substring Reduction (FSSR) algorithms can be applied to the initial N-gram set to

remove some unnecessary N-grams using their frequen cy information. The two algorithms are eﬃcient (both have a

time complexity of O(n)) and can eﬀectively reduce the size of N-gram set up to 50%. Finally, mutual information

is used to obt ain chunk candidates from reduced N-gram set.

Perhaps the biggest contribution of this paper is t hat it is the ﬁrst time to apply Fast Statistical Substring

Reduction algorithm to large corpora and demonstrate t he eﬀectiveness and eﬃciency of this algorithm which, in

our hope, will shed new light on large scale corpus oriented research. Experiments on three corpora with diﬀerent

sizes show that this method can extract chunk candidates from corpora of giga bytes eﬃciently under current

computational power. We get an extraction accuracy of 86.3% from People Daily 2000 news corpus.

Key Words: Chunk extraction, N-gram, Substring Reduction, Corpus

1 Introduction

With the rapid development of computational power and the availability of large online corpora (BNC (Clear, 1993),

People Daily (Y

U et al, 2002)), there has been a dramatic shift in computational linguistics from manually construction

knowledge bases to partially or totally auto matic knowledge acquisition by applying statistical learning methods to

large corpora (see SU, 1996, for an overv ie w). The concept of chunk was ﬁrst raised by (Abney, 1991) in the early

nineties to make the task of language parsing eas ie r. He suggested to develop a parse r based on chunk that decomposes

sentences into chunks with each chunk being a syntactic unit for easier parsing. Chunks, especially bilingual chunk

pairs, can be a use ful ingredient in ex ample based machine translation (EBMT) system to help obtain better translation

result (YAO et al, 2000). Other NLP tasks such as information retrieval, knowledge discovery and the construction

of semantic dictionary can also beneﬁt from such kind of r esources. Most Text Chunking methods applied to English

require either a parsing stage to parse raw corpus into parsed trees or a corpus which already has syntactical information

(such as Treebank) (Erik and Sabine, 2000 ). Unfortunately, neither kind of r e sources is largely available for Chinese.

Some language spe ciﬁc properties of Chinese further impose challenges on Chinese chunking. Unlike western languages

that have explicit word boundary, Chinese has no word s e paration in s e ntence and a sentence must be segmented into

words before further processing. Segmentation is a ﬁeld that has been researched for decades with more than two

dozen methods tried in litera tur e (Ponte and Croft, 1996; Palmer, 19 97; Teahan, 2000). Even today no segmentation

method totally satisﬁes na tive speakers. In addition, Chinese language enjoys more freedom in sentence structure than

English, making even shallow parsing a formidable ta sk.

To address these problems, this paper presents a novel approach to extract Chinese chunk c andidates from large

unsegmented raw c orpora. The originality of our approach resides in the statistical substring reduction, a pr ocedure

aims at eﬃciently removing unpoetical N-grams from extracted N-gram set. The ﬁrst step is to acquire large N-gram

(up to 2 0-gra m) statistics from raw corpus. This initial N-gram set contains a vast a mount of “garbag e strings”

which do not have any meaning at all. The work of (Fung, 1994) showed: without the help of a machine-readable

dictionary, the extracted trigrams and 4-grams from Chinese raw co rpus contain only 31.3% and 36.75% valid phrases

下载后可阅读完整内容，剩余8页未读，立即下载

wherrlich

粉丝: 0

统计方法提取大规模语料库中的汉语切分候选序列

A statistical approach to machine translation.doc

A statistical approach to machine translation 4

A statistical approach to map matching

Thresholding in Edge Detection： A Statistical Approach

A Bayesian Approach to Sparse Model Selection in Statistical Shape Models∗.pdf

Numerical code for the paper _Statistical approach for parameter

A new approach to solve multi-response statistical optimization problems using neural network, genetic algorithm, and goal attai.xdf

Theory of Financial Risk - From Statistical Physics to Risk Management

【2018新书】Machine Learning－A Practical Approach on the Statistical Learning Theory

Introduction to Data Science: A Python Approach

最新资源