非平行数据中双语词汇诱导的地球移动距离正则化方法

41 浏览量更新于2024-08-29 收藏 346KB PDF 举报

"这篇研究论文‘Inducing Bilingual Lexica From Non-Parallel Data With Earth Mover’s Distance Regularization’在2016年的国际计算语言学会议COLING上发表，探讨了如何在非平行数据中诱导双语词汇表。通过引入地球搬运距离（Earth Mover's Distance）正则化来解决自然语言中的多对一翻译问题，适用于资源稀缺的语言和领域中的跨语言处理。" 在自然语言处理和计算语言学中，构建双语词汇表是跨语言任务的基础，特别是在资源有限的语言和领域。传统的方法通常假设每个源语言单词有一个特定的目标语言翻译，即一对一的翻译假设。然而，这种假设在实际的自然语言中并不成立，因为一个词可能有多个含义，对应到目标语言的多个词。本文作者Meng Zhang、Yang Liu、Huanbo Luan、Yiqun Liu和Maosong Sun提出了一种新的方法，该方法利用地球搬运距离（Earth Mover's Distance，EMD）来放松一对一的翻译假设。EMD是一种衡量两个概率分布之间差异的度量，常用于图像处理和运输问题。在双语词汇表的构建中，EMD可以帮助模型考虑源语言单词到目标语言单词的多对一或一对多映射。论文中，作者将EMD引入训练过程，以允许源语言单词和目标语言单词之间的灵活匹配。这样，模型不仅可以学习到最可能的翻译对，还能捕捉到更复杂的语言现象，如一词多义和同义词。通过这种方式，他们改进了双语词汇诱导的性能，尤其在处理非平行数据集时，这种数据集在许多实际场景中更为常见。在实验部分，作者可能会对比他们的方法与其他现有方法的性能，包括基于统计的模型和深度学习模型，并展示在不同语言对和任务上的优势。此外，他们可能还讨论了EMD正则化的参数选择、训练效率以及如何适应不同的数据分布。这篇研究论文为处理自然语言中的复杂翻译问题提供了一个创新的解决方案，有望改善跨语言信息检索、机器翻译和多语言文本理解等任务的效果。通过引入地球搬运距离，模型可以更好地应对现实世界的语言挑战，提高资源匮乏语言环境下的跨语言处理性能。

where C

deﬁnes the cost of matching the target word w

and the source word w

(illustrated by the

distance between words in Figure 1), and f

(resp. f

) is the weight associated with w

(resp. w

)

(illustrated by the sizes of the shapes in Figure 1(b)). The weights are chosen to be the number of times

a word appears in the corpus. Once the linear program is solved, the matrix T stores the matching

information between source and target vocabularies. This cross-lingual matching can be interpreted as

translation. For example, a non-zero T

can be seen as evidence to translate the source word w

to the

target word w

Besides the vocabulary-level matching, the EMD program brings an additional beneﬁt. As mentioned

in Section 1, it automatically retrieves multiple translations for a source word as long as the program

ﬁnds it appropriate (cf. Figure 1). In the following section, we will strengthen this desirable capability

by bringing the EMD program from a post-processing step to the training phase.

3 Approach

In typical scenarios, resources available to bilingual lexicon inducers include non-parallel corpora C

and C

, and a seed lexicon d. In order to utilize these resources to train bilingual word embeddings, a

straightforward idea is to devise a learning objective that combines a monolingual term and a seed term.

The monolingual term J

mono

is responsible for explaining regularities in corpora C

and C

. Since

the two corpora are non-parallel, J

mono

consists of two monolingual submodels that are independent of

each other:

mono



, W



= J

mono





+ J

mono





. (2)

As the common practice (Gouws et al., 2015), we choose the well established skip-gram model (Mikolov

et al., 2013a) for our monolingual term.

The seed term J

seed

encourages embeddings of word translation pairs in a seed lexicon d to move

near, which can be achieved via a L

regularizer:

seed



, W



= −

hs,ti∈d



− W



, (3)

where s ∈



1, ..., V



and W

is the s-th column of W

(i.e. the embedding of the s-th source word

), and notations are similar for the target side.

However, as shown in our experiment, a simple linear combination of the monolingual term and the

seed term is insufﬁcient to provide satisfactory performance. We propose to introduce the Earth Mover’s

Distance into the training phase, as an additional term in the learning objective:

EMD



, W

, T



= −

t=1

s=1

(4)

with constraints

≥ 0

s=1

≤ f

, t ∈



1, ..., V



t=1

= f

, s ∈



1, ..., V



. (5)

Note that, unlike the post-processing case (1), the ground distance matrix C is now parametrized by

bilingual embeddings W

and W

, and therefore adjustable during training.

Putting everything together, we arrive at our overall learning objective to maximize:



, W

, T



= J

mono



, W



+ λ

seed



, W



+ λ

EMD



, W

, T



(6)

with constraints (5) inherited from the EMD. The hyperparameters λ

and λ

control the relative weight-

ing of the terms. In this form, we can naturally view the EMD term as a regularizer that can potentially

3190

剩余10页未读，继续阅读

weixin_38628243

粉丝: 1
资源: 907

非平行数据中双语词汇诱导的地球移动距离正则化方法

Impaired function of MSCs from ITP patients in inducing regulatory DC differentiation via the Notch-1/Jagged-1signaling pathway

Fifth-order attosecond polarization beats using twin color-locked noisy lights in cascade three-level system with Doppler-free approach

Inducing Human-like Motion in Robots：ACM 论文“Inducing Human-like Motion in Robots”的代码-matlab开发

Expression of human TNF-related apoptosis-inducing ligand extracellular region in Ecot* (2002年)

Performance study of optical triangular-shaped pulse generation with full duty cycle

Asymptotic equilibrium inducing metastable chaos

Fluorine atom-inducing graphene oxide in situ coating SnPO composites as anode for sodium ion batteries

Role of oxygen defects in inducing the blue photoluminescence of zinc oxide films deposited by magnetron sputtering

Suppressor of Cytokine Signaling 1 Inhibits Apoptosis of Islet Grafts Through Caspase 3 and Apoptosis Inducing Factor Pathways

Observation of 1S0-3P0 transition of bosonic strontium in the Lamb-Dicke regime

最新资源