跨语言混合方法：多词表达提取的新途径

185 浏览量更新于2024-08-29 收藏 259KB PDF 举报

"该资源是一篇研究论文，探讨了一种独立于语言的混合方法，用于多词表达（Multi-Word Expression, MWE）的提取。论文作者来自中国南京的 Jingling Institute of Technology 和山西大学的 Computer and Information Technology 部门。文章指出，识别多词表达对于自然语言处理（Natural Language Processing, NLP）任务至关重要，但现有的方法很大程度上依赖特定语言知识和预先存在的 NLP 工具，这对于资源匮乏的语言（如中文）来说是一个挑战。因此，他们提出了一种新的自动学习方法，能够在不依赖特定语言资源的情况下从语料库中有效地提取特征。" 正文：多词表达（MWE）是自然语言中的一个关键概念，它们是由两个或更多个单词组成的固定表达，具有特定的含义，不能通过其单个成分的意义简单推断出来。例如，英语中的“break a leg”（祝你好运）和中文的“一言既出，驷马难追”（说话要算数）。在NLP领域，正确识别和理解MWE对于任务如机器翻译、信息检索、情感分析等都至关重要。传统的MWE提取方法往往依赖于特定语言的结构和规则，这限制了它们在处理未被广泛研究的语言时的适用性。例如，英文的MWE提取可能利用词典、词性标注和句法分析工具，而这些在中文等其他语言中可能并不容易获取。此外，许多资源贫乏的语言缺乏足够的标注数据来训练NLP工具。针对这一问题，该论文提出了一种语言独立的混合方法，它结合了统计和规则基础的方法，旨在从不同语言的语料库中自动学习特征，以识别MWE。这种方法的关键在于其无需预先存在的语言特有工具，而是通过分析词序、共现频率、词汇搭配等通用语言特性来捕获MWE的模式。这样，即使在资源有限的语言环境下，也能实现有效的MWE提取。具体实现过程中，该方法可能包括以下步骤：首先，通过计算词对或短语的共现频率来识别潜在的MWE；接着，利用上下文信息和词性信息进行过滤和优化；然后，可能采用机器学习算法，如支持向量机（SVM）或条件随机场（CRF），来建立分类模型，区分真正的MWE和非MWE；最后，可能会有一个后处理步骤，用以消除错误的候选MWE并确保提取结果的质量。这种语言独立的混合方法对于推动跨语言的NLP研究有着重要意义，它可以扩大NLP技术的应用范围，尤其是在资源有限的语言环境中。未来的研究可能进一步优化该方法，提高其在各种语言和任务上的性能，并探索更高效的学习策略和特征工程方法，以更好地捕捉不同语言中的MWE模式。

A Language-Independent Hybrid Approach for

Multi-Word Expression Extraction

Yinghong Liang

Department of

Software Engineering

Jingling Institute of

Technology

Nanjing,China

liangyh@jit.edu.cn

Hongye Tan

Department of

Computer and

information

technology

Shanxi University

Shanxi,China

hytan_2006@126.co

Hui Li

Department of

Software Engineering

Jingling Institute of

Technology

Nanjing,China

lihui@jit.edu.cn

Zhigang Wang

Department of

Software Engineering

Jingling Institute of

Technology

Nanjing,China

friend@jit.edu.cn

Wenming Gui

Department of

Software Engineering

Jingling Institute of

Technology

Nanjing,China

gwm@jit.edu.cn

Abstract—Failing to identify multi-word expression (MWE)

may cause serious problems for many Natural Language

Processing (NLP) tasks. Previous approaches heavily depend on

language specific knowledge and pre-existing natural language

processing (NLP) tools. However, many languages (including

Chinese language) have less such resources and tools compared

to English. An automatically learn effective features from corpus,

without relying on language specific resources is needed. In this

paper, we develop a hybrid approach that combines Bi-

directional long short-term memory (Bi-LSTM), word

correlation degree calculation and weakly supervised K-means

cluster to capture both sequence information and correlation

degree of phrase from specific contexts, and use them to train a

multi-word expression detector for multiple languages without

any manually encoded features. Experiment result shows that the

extraction results of Chinese and English multi-word expression

using this hybrid approach is better than that of baseline

algorithm, which verified that the hybrid approach is effective.

Keywords—Multi-Word Expression; Bi-LSTM; Language-

Independent

I. I

NTRODUCTION

With the deeply study in the field of natural language

processing, the researchers found that a major factor of

affecting the performance promotion was related with the

accurate extraction of multi-word expression (MWE). Most

researchers use the definition of MWE defined by Sag et al.in

2002[1]: A single meaning unit, which combines two or more

words together. For example:

English S1: I only [want some more] [white coffee].

Chinese S2:ᰙᲘ[⍇ᆼ◑]ˈԆᘰ⵰[ᘀᘁнᆹ]Ⲵᗳ

ᛵоሬᐸ㿱Ҷа䶒.

Translation S2˖He met his tutor [in a rather nervous

state] [after a bath] in the morning.

In S1:[want some more]is a compound verb and [white

coffee]is a compound noun.

In S2:[

⍇ᆼ◑

] [after a bath] is the verb phrase in loose

structure, and [

ᘀᘁнᆹ

] [in a rather nervous state] is an

idiom.

MWE is a special part of phrase recognition, which is

regarded as a difficult and bottleneck problem in the field of

Natural Language Processing. The construction of Chinese

MWE data is a time-consuming work. In order to avoid this

problem, most of researchers used English and Chinese

parallel corpus to extract the Chinese MWEs [2-6]. A part of

researchers labeled the small scale of corpus, and then used

this corpus to extract the Chinese MWE.

Most of previous methods considered multi-word

expression (MWE) as a classi¿cation problem and designed a

lot of lexical and syntactic features. These features are often

derived from speci¿c language resources, which make these

methods dif¿cult to be applied to different languages.

For example, in S1, when predicting the type of a

compound verb candidate “want some more”, the forward

sequence information such as “I” can help the classi¿er label

“want” as the beginning of a compound verb. In addition,

considering S2, “

⍇ᆼ◑

”[after a bath]is a verb phrase in loose

structure, “

⍇◑

” is a phrases, “

⍇

”is relative to following

context information “

◑

”. However, for feature engineering

methods, it is hard to establish a relation between “

⍇

”and

“

◑

”, “ I” and “want”, because there is no direct dependency

path between them.

Recently, deep learning techniques have been widely

used in modeling complex structures and proven effective for

many NLP tasks, such as relation extraction [7] and sentiment

analysis [8]. A key advantage of these neural architectures is

that they can capture the meaning of linguistic phenomena

ranging from individual words[9] to longer-range linguistic

contexts at the sentence level [10] .Bi-directional long short-

term memory (Bi-LSTM) model[11]is a two-way recurrent

neural network (RNN) [12] which can capture both the

preceding and following context information of each word.

In this work, we present Bi-LSTM neural network to

model sequence information from speci¿c contexts, which

does not require manual effort for finding the best relevant

features. Sequence is a language-independent structure for

MWE extraction. Taking advantage of word semantic

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38637144

粉丝: 4
资源: 925

跨语言混合方法：多词表达提取的新途径

基于BERT的混合神经网络实体识别方法.pdf

基于二维ICA变换的语音特征提取

MATLAB中特征提取算法的全面总结

基于EEMD的情感平静语音特征提取研究

C++手写混合高斯模型采样器教程

【R语言贝叶斯混合效应模型】：MCMC教程与评估方法

PHP文本自然语言处理：意义提取指南，从文本中提取意义，赋能人工智能

自然语言处理：社交网络文本深层信息提取指南

R语言e1071包文本挖掘实战：从文本到知识的提取，数据洞察力提升

无监督学习在自然语言处理中的突破：词嵌入与语义分析的7大创新应用

最新资源