汉语语言分析中的词汇化统计模式应用

132 浏览量更新于2024-08-29 收藏 204KB PDF 举报

在汉语语言分析中利用词汇化统计模式的研究论文探讨了如何通过搜索引擎和简单字符串匹配策略获取语言学信息。论文的作者Yu Zhao和Maosong Sun来自清华大学计算机科学技术系，他们在国家智能科技与系统实验室和信息科学与技术国家重点实验室合作，针对中文的词汇、语法和语义三个层面进行了深入研究。词汇化统计模式（Lexicalized Statistical Patterns）是研究的核心概念，它将语言结构视为由词汇元素组成的序列，并利用大规模网络语料库，如Sogou T1语料库，来统计这些模式出现的频率。这种方法摒弃了传统语言分析中复杂的规则或模型，转而依赖于大量数据中的统计规律，从而简化了分析过程并提高了效率。在词汇层面，词汇化统计模式有助于识别词语的常见搭配和多义性，这对于理解句子的含义和构建词典有着重要作用。通过搜索引擎的频率计数功能，可以揭示词汇之间的共现关系，进而评估词汇的关联性和使用习惯。在语法层面，这些模式能够识别出短语的结构类型，比如动宾结构、主谓结构等，这对于句法分析和句法树的构建具有指导意义。通过统计特定语法模式在语料库中的分布，可以为语法分析算法提供支持，帮助解析句子的复杂结构。在语义层面，尽管中文不像英语那样有丰富的形态变化，但词汇化统计模式也能揭示词语之间的隐含意义，例如通过上下文中的频繁搭配发现词语的隐喻或象征含义。此外，这些模式还可以用于识别和分析复合词、成语以及多义词在不同上下文中的具体含义，增强对文本语义的理解。实验结果显示，词汇化统计模式在汉语语言分析中表现出较高的有效性，特别是在分析短语的连贯性、确定短语类别以及识别病理性动词对象（如被动语态）方面。这些发现对于自然语言处理（Natural Language Processing, NLP）领域的文本挖掘、机器翻译和信息检索等任务具有重要的应用价值。关键词：词汇化统计模式、汉语语言分析、网络语料库、自然语言处理。这篇论文不仅提供了理论框架，还展示了实际应用的可行性，为未来的中文语言处理研究奠定了坚实的基础。

Exploiting Lexicalized Statistical Patterns

Chinese Linguistic Analysis

Yu Zhao and Maosong Sun

Department of Computer Science and Technology,

State Key Lab on Intelligent Technology and Systems,

National Lab for Information Science and Technology,

Tsinghua University, Beijing 100084, China

Abstract. The web corpus has been used for linguistic analysis with

the help of search engines. In this paper, we describe the concept of lexi-

calized patterns, which we exploit to obtain statistical information using

the simple string matching strategy via search engines. We discuss the

usage of lexicalized statistical patterns at three linguistic levels of Chi-

nese analysis: lexical, syntactic and semantic. We develop a specialized

search engine to get frequency counts for these patterns on SogouT

cor-

pus. Experimental results show that lexicalized statistical patterns are

eﬀective on analyzing the cohesion of phrases, determining the phrasal

category and discovering patient objects.

Keywords: Lexicalized statistical pattern, Chinese linguistic analysis,

Web corpus, Natural language processing.

1 Introduction

Most of current statistical natural language processing systems rely on large well-

organized annotated corpus. For example, the state-of-the-art dependency parser

uses Treebanks to extract POS-tag features [9]. Nevertheless, these corpora are

highly time-consuming and labor-intensive to build and extend. Moreover, they

are mostly in limited size. The main cause of error for many natural language

processing task is the lack of related statistical information in the training set.

Let us consider the task of determining the phrasal category, which is one of

the most signiﬁcant issues in shallow parsing. In Chinese, a chunk that has two

components: VP+NP, can possibly be a verbal phrase or a substantive phrase.

For instance,  (farewell ceremony) and  (say goodbye to a

friend) are both composed by VP+NP. while the former is a substantive phrase

and the latter is a verbal phrase. However, the famous Stanford parser incorrectly

categorizes  as a verbal phrase, as shown in Figure 1. Resolving this

type of error requires information that is not present in Treebanks.

Therefore, a growing number of researchers have been realizing the poten-

tial of web-scale corpus to NLP tasks. The key advantage of web corpus lies in

The 2008 version is available online at

M. Sun et al. (Eds.): CCL and NLP-NABD 2013, LNAI 8202, pp. 238–246, 2013.

 Springer-Verlag Berlin Heidelberg 2013

http://www.sogou.com/labs/dl/t.html

下载后可阅读完整内容，剩余8页未读，立即下载

weixin_38634065

粉丝: 7
资源: 970

汉语语言分析中的词汇化统计模式应用

中文情感词库

人工智能中自然语言词法分析器

统计自然语言处理刘挺

汉语词频统计工具

藏语动词分布统计分析

大数据视角下的汉语学习分析模型建构.zip

python数据分析与自然语言处理.ppt

统计方法来识别中文姓名

使用ROSTCM6进行汉语频度分析与词频统计

词典统计驱动的语料库词汇级对齐算法：刘小虎的研究

最新资源