Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora
Meiping Dong
†
Yang Liu
†‡ ∗
Huanbo Luan
†
Maosong Sun
†‡
Tatsuya Izuha
+
Dakun Zhang
#
†
State Key Laboratory of Intelligent Technology and Systems
Tsinghua National Laboratory for Information Science and Technology
Department of Computer Sci. and Tech., Tsinghua University, Beijing, China
‡
Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China
hellodmp@163.com, {liuyang2011,sms}@tsinghua.edu.cn, luanhuanbo@gmail.com
+
Toshiba Corporation Corporate Research & Development Center
tatsuya.izuha@toshiba.co.jp
#
Toshiba (China) R&D Center
zhangdakun@toshiba.com.cn
Abstract
While parallel corpora are an indispensable re-
source for data-driven multilingual natural lan-
guage processing tasks such as machine translation,
they are limited in quantity, quality and coverage.
As a result, learning translation models from non-
parallel corpora has become increasingly important
nowadays, especially for low-resource languages.
In this work, we propose a joint model for itera-
tively learning parallel lexicons and phrases from
non-parallel corpora. The model is trained using a
Viterbi EM algorithm that alternates between con-
structing parallel phrases using lexicons and up-
dating lexicons based on the constructed parallel
phrases. Experiments on Chinese-English datasets
show that our approach learns better parallel lexi-
cons and phrases and improves translation perfor-
mance significantly.
1 Introduction
Parallel corpora, which are collections of parallel texts, play a
critical role in data-driven multilingual natural language pro-
cessing (NLP) tasks such as statistical machine translation
(MT) and cross-lingual information retrieval. For example, in
statistical MT, parallel corpora serve as the central source for
estimating translation model parameters
[
Brown et al., 1993;
Koehn et al., 2003; Chiang, 2005
]
. It is widely accepted that
the quantity, quality, and coverage of parallel corpora have
an important effect on the performance of statistical MT sys-
tems.
Despite the apparent success of data-driven multilin-
gual NLP techniques, the availability of large-scale, wide-
coverage, high-quality parallel corpora still remains a ma-
jor challenge. For most language pairs, parallel corpora
are nonexistent. Even for the top handful of resource-rich
languages, the available parallel corpora are usually unbal-
anced because the major sources are government documents
or news articles.
As a result, learning translation models from non-parallel
corpora has attracted intensive attention from the commu-
nity
[
Koehn and Knight, 2002; Fung and Cheung, 2004;
∗
Yang Liu is the corresponding author.
Munteanu and Marcu, 2006; Quirk et al., 2007; Ueffing et
al., 2007; Haghighi et al., 2008; Bertoldi and Federico, 2009;
Cettolo et al., 2010; Daum
´
e III and Jagarlamudi, 2011;
Ravi and Knight, 2011; Nuhn et al., 2012; Dou and Knight,
2012; Klementiev et al., 2012; Zhang and Zong, 2013;
Dou et al., 2014
]
. Most existing approaches focus on learning
word-based models: either bilingual lexicons or IBM model-
s. Based on canonical correlation analysis (CCA), Haghighi
et al.
[
2008
]
leverage orthographic and context features to
induce word translation pairs. Ravi and Knight
[
2011
]
cast
training IBM models on monolingual data as a deciphermen-
t problem. However, word-based models are not expressive
enough to capture non-local dependencies and therefore are
insufficient to yield high quality translations.
Recently, several authors have moved a step further to learn
phrase-based models from non-parallel corpora
[
Klementiev
et al., 2012; Zhang and Zong, 2013
]
. Zhang and Zong
[
2013
]
propose to use a parallel lexicon to retrieve parallel phrases
from non-parallel corpora. They show that their approach can
learn new translations and improve translation performance.
However, their approach is unidirectional: only using lexi-
cons to identify parallel phrases. In fact, it is possible to learn
better lexicons from extracted phrase pairs in a reverse direc-
tion, which potentially constitutes a “find-one-get-more” loop
[
Fung and Cheung, 2004
]
.
In this paper, we propose an iterative approach to learning
bilingual lexicons and phrases jointly from non-parallel cor-
pora. Given two sets of monolingual phrases that might con-
tain parallel phrases, we develop a generative model based on
IBM model 1
[
Brown et al., 1993
]
, which treats the map-
ping between phrase pairs as a latent variable. The mod-
el is trained using a Viterbi EM algorithm. Experiments on
Chinese-English datasets show that iterative training signifi-
cantly improves the quality of learned bilingual lexicons and
phrases and benefit end-to-end MT systems.
2 Preliminaries
We begin with a brief introduction to IBM model 1, which is
the core component of our generative model.
Let f = f
1
, . . . , f
J
be a foreign sentence with J words. We
use f
j
to denote the j-th word in the foreign sentence. Sim-
ilarly, e = e
1
, . . . , e
I
is an English sentence with I words
and e
i
is the i-th word. f and e denote single foreign and En-
glish words, respectively. A word alignment a = a
1
, . . . , a
J