纳西语句相似度计算：基于改进的chunking编辑距离

117 浏览量更新于2024-08-28 收藏 174KB PDF 举报

"这篇论文提出了一种针对纳西语（Naxi）句子相似度计算的方法，该方法基于改进的分块编辑距离。考虑到纳西语的语法特性，如动词后置、名词和动词成块出现，定义了纳西语的NP（名词短语）和VP（动词短语）chunk，并提取了相应的chunk规则。通过纳西语-汉语词典，将纳西语词映射到对应的汉语词，利用汉语词的相似性来处理句子的相似度计算。此外，文章还涉及到了交换方法、替换成本以及语义相似性等概念。作者包括张慧慧、余正涛、沈龙华、郭建义和毛存利，他们分别来自昆明理工大学的信息工程与自动化学院及智能信息处理重点实验室，以及中国机械设备研究院。" 纳西语句子相似度计算是一个复杂的过程，因为纳西语具有独特的语法结构。在本文中，研究者首先根据纳西语的特殊性质——动词通常位于句子的后部，而名词和动词常常以块的形式出现，定义了NP和VPchunk。这一步是理解句子结构的基础，因为它有助于识别出句子的主要成分。接着，通过构建纳西语-汉语词典，实现了纳西语词汇向汉语词汇的映射，这一步对于跨语言的相似度计算至关重要，因为不同的语言之间可能存在词汇和语法的巨大差异。在计算句子相似度时，传统的编辑距离可能无法准确地反映出语义上的相似性，因此文章提到了“改进的分块编辑距离”。编辑距离是一种衡量两个字符串相似性的方法，通过插入、删除、替换操作的最小代价来量化它们之间的差异。在纳西语场景下，由于语言特性和结构的复杂性，需要考虑更多的因素，如chunk级别的操作和语义的保持。因此，研究者可能引入了特定的交换方法和替换成本策略，以更好地适应纳西语的特征。此外，语义相似性是计算句子相似度中的另一个关键要素。这可能涉及到对词语的深层理解，比如通过词义消歧、上下文依赖或词向量技术来捕捉词语的语义信息。通过考虑语义相似性，可以确保即使在表面形式变化较大的情况下，也能准确地评估句子的本质相似性。这篇文章提出的纳西语句子相似度计算方法结合了语言学的chunking分析、词汇映射、编辑距离的改进以及语义相似性计算，为处理纳西语这种特殊语言的自然语言处理任务提供了一种有效的解决方案。这种方法不仅适用于纳西语，也对其他具有类似语法结构的语言的处理有一定的借鉴意义。

Vol. , No. , 2013 1

Naxi Sentence Similarity

Calculating Based on Improved

Chunking Edit-Distance

Huihui Zhang

School of Information Engineering and Automation

Kunming University of Science and Technology, Kunming

Key Laboratory of Intelligent Information Processing

Kunming University of Science and Technology, China

E-mail: glitter_zhang@163.com

Zhengtao Yu*

School of Information Engineering and Automation

Kunming University of Science and Technology, Kunming

Key Laboratory of Intelligent Information Processing

Kunming University of Science and Technology, China

E-mail: ztyu@hotmail.com

*Corresponding authors

Longhua Shen,

China Research and Development Academy of Machinery Equipment, Beijing

E-mail: lhsheng@liip.cn

Jianyi Guo, Cunli Mao

The School of Information Engineering and Automation

Kunming University of Science and Technology, China

Key Laboratory of Intelligent Information Processing

Kunming University of Science and Technology, China

E-mail:

gjade86@hotmail.com, mcl@163.com

Abstract: Aiming at the characteristics of Naxi language, a method is proposed for Naxi

sentence similarity calculation. First, according to the characteristics of Naxi language that

verbs set back, and nouns and verbs appear in chunks. Naxi NP and VP chunks are defined

and chunk rule is extracted. According to the rules of the Naxi sentence chunking, extracts

NP and VP chunks as so on. Then, by using the Naxi-Chinese Dictionary, Naxi word is

mapped to the Chinese word. By using the Chinese word similarity, Naxi words semantic

similarity is calculated. Similarity of chunks is calculated by the combination of Chinese

word similarity. Chunks similarity is defined as the replacement cost of chunk that edits

operation, and Naxi sentence similarity is computed according to replacement cost. At last,

experiment is done to calculate Naxi sentence similarity. Experimental result shows that

proposed method is better than other methods, and chunk exchange method can effectively

improve the accuracy of the Naxi sentence similarity.

Keywords: Naxi; Sentence similarity;Chunk; Edit-distance.

1 INTRODUCTION

Dongba is also called Naxi pictograph, which currently is

the only living pictograph in the world and is widesp -read

concerned by researchers around the world (Lei Shi, 2005;

Yu Sui-sheng, 2008). Naxi sentence similarity calculation is

the foundation of Naxi and Chinese bilingual retrieval and

bilingual learning. In domestic, respecting to the Chinese

sentence similarity comput ing research, Zhifang Sui and

Shiwen Yu proposed the Skeletal-Dependency-Tree-Based

Computational Model for the Sentence Similarity for the

machine translation(Zhifang Sui and Shiwen YU, 1998);

Sujian Li proposed relevance quantitative calculation model

which base on HowNet and Cilin(Su-jian Li, 2002);

Xueqiang Lv consider the two factors of word-form and

word-order similarity, and proposed sentence similarity

model and the most similar sentence search

algorithm(Xueqiang Lv, Feiliang Ren Huangzhi Dan and

Tianshun Yao, 2003); Wanxiang Che used Similar Chinese

Sentence Retrieval Based on Improved Edit-Distance(Bing

下载后可阅读完整内容，剩余5页未读，立即下载

weixin_38621365

粉丝: 7

纳西语句相似度计算：基于改进的chunking编辑距离

Naxi-English Bilingual Word Alignment Based on Language Characteristics and Log-Linear Model

NAXI广播「NAXI Radio」-crx插件

NAXI Radio-crx插件

互组合线共振光激发NaXI离子的粒子数反转研究

梦熊联盟崩服了！CSP-J电子版试卷

高中英语 Module 5 Ethnic Culture-Grammar1素材 外研版选修7

纳西-汉语双语词对齐算法：基于双语词典与IBM模型

(base) C:\Users\naxi>ls 'ls' 不是内部或外部命令，也不是可运行的程序 或批处理文件。

D:/STMDemo/naxi_precision/Debug/../Core/Src/main.c:128: undefined reference to `xTimerCreate'

cole_02_0507.pdf

最新资源

高中英语 Module 5 Ethnic Culture-Grammar1素材外研版选修7

(base) C:\Users\naxi>ls 'ls' 不是内部或外部命令，也不是可运行的程序或批处理文件。