非平行语料中迭代学习并行词汇与短语模型

173 浏览量更新于2024-08-29 收藏 306KB PDF 举报

本文档探讨了"从非平行语料库迭代学习并行词典和短语"这一关键主题。在自然语言处理任务，如机器翻译，平行语料库是至关重要的，它们提供了丰富的双语对照数据，以驱动模型训练。然而，由于实际获取的平行语料库往往在数量、质量和覆盖范围上存在局限，特别是在低资源语言环境中，依赖非平行语料的学习方法变得越来越重要。作者提出了一种迭代学习模型，旨在解决从非平行文本中有效地抽取并合并词典和短语的问题。该模型的目标是利用非对齐数据的优势，通过多次迭代的过程，逐渐建立和优化翻译模型中的并行词汇和短语库。这种方法有助于弥补平行数据不足带来的挑战，提升翻译系统的性能，尤其是在资源匮乏的语言处理任务中。迭代学习的核心思想是利用无监督或半监督的方法，通过多次迭代来发现和增强语言对之间的潜在关联。每次迭代过程中，模型可能包括以下几个步骤： 1. **词典初始化**：使用诸如词频统计或词向量相似度等手段，从非平行文本中初步构建词典对应关系。 2. **短语识别**：通过分析上下文和语法结构，找出具有语义连贯性的短语，并尝试将其与对应的平行短语进行匹配。 3. **模型训练**：使用这些初步的并行资源来训练翻译模型，例如基于统计的机器翻译（SMT）或者神经机器翻译（NMT），以便在后续迭代中提供更准确的翻译指导。 4. **模型评估与调整**：通过评估翻译质量，调整模型参数，尤其是那些与并行词典和短语学习相关的部分。 5. **迭代更新**：根据模型反馈，不断优化并行资源，重复以上步骤直到达到满意的翻译效果或达到预设的迭代次数。这种方法不仅有助于降低对大量平行语料的依赖，还可能发掘出非显而易见的翻译规则和习惯表达，从而提高翻译的准确性和自然度。随着深度学习技术的发展，特别是注意力机制和迁移学习的引入，这种迭代学习方法在未来有望在多语言处理领域取得更大的突破。这篇研究论文为解决低资源环境下机器翻译问题提供了一个有前景的解决方案。

Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora

Meiping Dong

†

Yang Liu

†‡ ∗

Huanbo Luan

†

Maosong Sun

†‡

Tatsuya Izuha

Dakun Zhang

†

State Key Laboratory of Intelligent Technology and Systems

Tsinghua National Laboratory for Information Science and Technology

Department of Computer Sci. and Tech., Tsinghua University, Beijing, China

‡

Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China

hellodmp@163.com, {liuyang2011,sms}@tsinghua.edu.cn, luanhuanbo@gmail.com

Toshiba Corporation Corporate Research & Development Center

tatsuya.izuha@toshiba.co.jp

Toshiba (China) R&D Center

zhangdakun@toshiba.com.cn

Abstract

While parallel corpora are an indispensable re-

source for data-driven multilingual natural lan-

guage processing tasks such as machine translation,

they are limited in quantity, quality and coverage.

As a result, learning translation models from non-

parallel corpora has become increasingly important

nowadays, especially for low-resource languages.

In this work, we propose a joint model for itera-

tively learning parallel lexicons and phrases from

non-parallel corpora. The model is trained using a

Viterbi EM algorithm that alternates between con-

structing parallel phrases using lexicons and up-

dating lexicons based on the constructed parallel

phrases. Experiments on Chinese-English datasets

show that our approach learns better parallel lexi-

cons and phrases and improves translation perfor-

mance signiﬁcantly.

1 Introduction

Parallel corpora, which are collections of parallel texts, play a

critical role in data-driven multilingual natural language pro-

cessing (NLP) tasks such as statistical machine translation

(MT) and cross-lingual information retrieval. For example, in

statistical MT, parallel corpora serve as the central source for

estimating translation model parameters

[

Brown et al., 1993;

Koehn et al., 2003; Chiang, 2005

]

. It is widely accepted that

the quantity, quality, and coverage of parallel corpora have

an important effect on the performance of statistical MT sys-

tems.

Despite the apparent success of data-driven multilin-

gual NLP techniques, the availability of large-scale, wide-

coverage, high-quality parallel corpora still remains a ma-

jor challenge. For most language pairs, parallel corpora

are nonexistent. Even for the top handful of resource-rich

languages, the available parallel corpora are usually unbal-

anced because the major sources are government documents

or news articles.

As a result, learning translation models from non-parallel

corpora has attracted intensive attention from the commu-

nity

[

Koehn and Knight, 2002; Fung and Cheung, 2004;

∗

Yang Liu is the corresponding author.

Munteanu and Marcu, 2006; Quirk et al., 2007; Uefﬁng et

al., 2007; Haghighi et al., 2008; Bertoldi and Federico, 2009;

Cettolo et al., 2010; Daum

e III and Jagarlamudi, 2011;

Ravi and Knight, 2011; Nuhn et al., 2012; Dou and Knight,

2012; Klementiev et al., 2012; Zhang and Zong, 2013;

Dou et al., 2014

]

. Most existing approaches focus on learning

word-based models: either bilingual lexicons or IBM model-

s. Based on canonical correlation analysis (CCA), Haghighi

et al.

[

2008

]

leverage orthographic and context features to

induce word translation pairs. Ravi and Knight

[

2011

]

cast

training IBM models on monolingual data as a deciphermen-

t problem. However, word-based models are not expressive

enough to capture non-local dependencies and therefore are

insufﬁcient to yield high quality translations.

Recently, several authors have moved a step further to learn

phrase-based models from non-parallel corpora

[

Klementiev

et al., 2012; Zhang and Zong, 2013

]

. Zhang and Zong

[

2013

]

propose to use a parallel lexicon to retrieve parallel phrases

from non-parallel corpora. They show that their approach can

learn new translations and improve translation performance.

However, their approach is unidirectional: only using lexi-

cons to identify parallel phrases. In fact, it is possible to learn

better lexicons from extracted phrase pairs in a reverse direc-

tion, which potentially constitutes a “ﬁnd-one-get-more” loop

[

Fung and Cheung, 2004

]

In this paper, we propose an iterative approach to learning

bilingual lexicons and phrases jointly from non-parallel cor-

pora. Given two sets of monolingual phrases that might con-

tain parallel phrases, we develop a generative model based on

IBM model 1

[

Brown et al., 1993

]

, which treats the map-

ping between phrase pairs as a latent variable. The mod-

el is trained using a Viterbi EM algorithm. Experiments on

Chinese-English datasets show that iterative training signiﬁ-

cantly improves the quality of learned bilingual lexicons and

phrases and beneﬁt end-to-end MT systems.

2 Preliminaries

We begin with a brief introduction to IBM model 1, which is

the core component of our generative model.

Let f = f

, . . . , f

be a foreign sentence with J words. We

use f

to denote the j-th word in the foreign sentence. Sim-

ilarly, e = e

, . . . , e

is an English sentence with I words

and e

is the i-th word. f and e denote single foreign and En-

glish words, respectively. A word alignment a = a

, . . . , a

下载后可阅读完整内容，剩余6页未读，立即下载

weixin_38687968

粉丝: 7
资源: 969

非平行语料中迭代学习并行词汇与短语模型

Real-time Iterative Learning Control

Parallel.Iterative.Algorithms.-.From.Sequential.to.Grid.Computing

Iterative learning control for a class of non-linear switched systems

Orebody Modeling from Non-Parallel Cross Sections

A hybridized iterative algorithm of the BiCORSTAB and GPBiCOR methods for solving non-Hermitian linear systems

Non-Linear and Non-Iterative Based Transceiver Design for SU-MIMO Systems

Non-iterative implementation of Layered Adaptive Importance Sampling (LAIS)：Non-iterative implementation of Layered Adaptive Importance Sampling (LAIS)-matlab开发

Kulkarni-ReconNet-Non-Iterative-Reconstruction-CVPR-2016-paper.docx

层次分析matlab代码-Iterative-Learning-Dynamic-Scheduling-Strategy-Based-on-HR

Fuzzy Adaptive Iterative Learning Control for Consensus of Multi-Agent Systems with Imprecise Communication Topology Structure

最新资源