无词典支持的基于上下文的中文分词方法

需积分: 1 77 浏览量更新于2024-09-15 收藏 533KB PDF 举报

"Context-Based Chinese Word Segmentation using SVM Machine-Learning Algorithm without Dictionary Support" 这篇论文介绍了一种新的基于上下文的中文词分割（CWS）方法，该方法将词分割问题定义为断点分类问题，断点是两个相邻词汇之间的边界。论文采用支持向量机（SVM）分类器，通过从语料库中的断点上下文模型学习中文的分词规则。设计了一套有效的特征集来构建上下文模型，并提出了创建训练分类器所需正负样本的系统化方法。与传统依赖大规模已知信息源（如词典或语言标注）的方法不同，该方法选择语料库中最频繁出现的词作为学习来源。这样，CWS能够在任何新颖语料库中执行，无需适当的辅助资源。实验结果显示，所提出的这种方法可以与台湾中央研究院的中文知识和信息处理（CKIP）系统取得竞争性的结果。关键词：自然语言处理（NLP）、支持向量机（SVM）、词分割（CWS）在2013年10月14日至18日在日本名古屋举行的国际自然语言处理联合会议上，Chia-Ming Lee和Chien-Kang Huang提出了这个研究。他们来自台湾大学工程科学与海洋工程系。本文的核心创新在于使用SVM进行无字典支持的词分割。SVM是一种监督学习算法，能处理非线性分类问题。在CWS中，SVM通过分析上下文特征来识别词边界。设计的特征集可能包括字符n-gram、词频信息、上下文词汇的相关性等，这些特征有助于区分断点是否为真实词边界。为了训练SVM，必须构造正样本（真实的断点）和负样本（非断点）。通过从语料库中自动提取这些样本，系统能够自适应地学习语言模式，而无需依赖外部词典。实验表明，这种方法在不依赖大量预定义资源的情况下，仍然可以达到良好的分词效果，与使用了丰富语言资源的CKIP系统相比，性能上具有竞争力。这为无字典支持的中文词分割提供了一条新的途径，对于处理未标注文本或者新兴领域文本时，其优势尤为明显，因为这些文本往往缺乏预先存在的语言资源。这项工作强调了上下文模型和SVM在中文词分割中的潜力，它开辟了无字典分词的新方向，降低了对大规模语言资源的依赖，提高了分词技术的普适性和实用性。这对于中文信息处理、自然语言理解和机器翻译等领域具有重要价值。

International Joint Conference on Natural Language Processing, pages 614–622,

Nagoya, Japan, 14-18 October 2013.

Context-Based Chinese Word Segmentation using SVM Machine-

Learning Algorithm without Dictionary Support

Chia-ming Lee

Department of Engineering Science

and Ocean Engineering,

National Taiwan University,

Taipei, Taiwan (R.O.C.)

trueming@gmail.com

Chien-Kang Huang

Department of Engineering Science

and Ocean Engineering,

National Taiwan University,

Taipei, Taiwan (R.O.C.)

ckhuang@ntu.edu.tw

Abstract

This paper presents a new machine-learning

Chinese word segmentation (CWS) approach,

which defines CWS as a break-point classifi-

cation problem; the break point is the bound-

ary of two subsequent words. Further, this

paper exploits a support vector machine

(SVM) classifier, which learns the segmenta-

tion rules of the Chinese language from a

context model of break points in a corpus.

Additionally, we have designed an effective

feature set for building the context model,

and a systematic approach for creating the

positive and negative samples used for train-

ing the classifier. Unlike the traditional ap-

proach, which requires the assistance of

large-scale known information sources such

as dictionaries or linguistic tagging, the pro-

posed approach selects the most frequent

words in the corpus as the learning sources.

In this way, CWS is able to execute in any

novel corpus without proper assistance

sources. According to our experimental re-

sults, the proposed approach can achieve a

competitive result compared with the Chinese

knowledge and information processing

(CKIP) system from Academia Sinica.

1 Introduction

Chinese sentences contain sequences of charac-

ters that are not delimited by white spaces or any

other symbol used for word identification, so

Chinese word segmentation (CWS) is one of the

fundamental issues in Chinese natural language

processing studies.

One of the major aspects in existing CWS re-

searches is the resolution of word segment ambi-

guities. The conventional approach of ambiguity

detection is to use two maximum matching

methods (MMs), which scan corpora forward

(Forward Maximum Matching, FMM) and

backward (Backward Maximum Matching,

BMM) based on dictionaries (Kit, Pan, & Chen,

2002). Meanwhile, disambiguation methods can

be classified into two different categories: rule-

based methods and statistical-based methods.

(Ma & Chen, 2003b). Problem disambiguity is

often accompanied by the problem resolution of

an unknown word or out-of-vocabulary (OOV)

extraction (K.-J. Chen & Ma, 2002). Besides the

MMs with dictionaries, which are also known as

word-based approaches, there are character-

based approaches. The word-based approach

treats words as the basic unit of a language, and

the character-based approach labels each charac-

ter as the beginning, middle, or end of a word.

Character-based approaches are often imple-

mented with a machine-learning classification

algorithm for handling disambiguation (Wang,

Zong, & Su, 2012). In addition to dictionaries,

other linguistic resources such as part-of-speech

(POS) or semantic information can be integrated

for further improvement (M.-y. Zhang, Lu, &

Zou, 2004).

In addition to the disambiguation strategy,

many researchers provide the best word sequence

identification methods for their CWS. The Hid-

den Markov model (HMM) (Lin, 2006; M.-y.

Zhang et al., 2004), maximum entropy (ME),

mutual information (MI) and boundary depend-

ency (Peng & Schuurmans, 2001) are often used.

Theoretically, to get the best CWS result is to

obtain the optimized word sequence.

As described above, existing CWS research

takes either words or characters as the core unit

of their methodologies. Instead of identifying

word ambiguity, finding word sequence or join-

ing characters into words, we redefine the CWS

problem as the identification of “break points”

among the “joint points” in Chinese character

sequences. In this paper, we define a “joint

614

下载后可阅读完整内容，剩余8页未读，立即下载

逸偉

粉丝: 0
资源: 1

无词典支持的基于上下文的中文分词方法

图像分割—基于图的图像分割（Graph-Based Image Segmentation）C++源代码

Improving Cross-Domain Chinese Word Segmentation with Word Embeddings

Integrating N-gram model and case-based learning for Chinese word segmentation

Efficient Graph-Based Image Segmentation & k-means Image Segmentation

MATLAB典型环节代码-Fixation-Based-Segmentation:基于固定的分段

层次分析matlab代码-Co-Saliency-Detection-Based-on-Hierarchical-Segmentation:基

matlab图像分割肿瘤代码-Automated-Brain-Tumor-Segmentation-Based-on-Multi-Planar

Probe-into-Image-Segmentation-Based-on-Sobel-Oper_maximum entrop

matlab的欧拉方法代码-Level-Set-Based-Image-Segmentation:基于水平集的图像分割

Efficient Graph-Based Image Segmentation

最新资源