Iterative Learning of Parallel Lexicons and Phrases from Non-Parallel Corpora
Meiping Dong
Yang Liu
†‡ ∗
Huanbo Luan
Maosong Sun
Tatsuya Izuha
Dakun Zhang
State Key Laboratory of Intelligent Technology and Systems
Tsinghua National Laboratory for Information Science and Technology
Department of Computer Sci. and Tech., Tsinghua University, Beijing, China
Jiangsu Collaborative Innovation Center for Language Competence, Jiangsu, China, {liuyang2011,sms},
Toshiba Corporation Corporate Research & Development Center
Toshiba (China) R&D Center
While parallel corpora are an indispensable re-
source for data-driven multilingual natural lan-
guage processing tasks such as machine translation,
they are limited in quantity, quality and coverage.
As a result, learning translation models from non-
parallel corpora has become increasingly important
nowadays, especially for low-resource languages.
In this work, we propose a joint model for itera-
tively learning parallel lexicons and phrases from
non-parallel corpora. The model is trained using a
Viterbi EM algorithm that alternates between con-
structing parallel phrases using lexicons and up-
dating lexicons based on the constructed parallel
phrases. Experiments on Chinese-English datasets
show that our approach learns better parallel lexi-
cons and phrases and improves translation perfor-
mance significantly.
1 Introduction
Parallel corpora, which are collections of parallel texts, play a
critical role in data-driven multilingual natural language pro-
cessing (NLP) tasks such as statistical machine translation
(MT) and cross-lingual information retrieval. For example, in
statistical MT, parallel corpora serve as the central source for
estimating translation model parameters
Brown et al., 1993;
Koehn et al., 2003; Chiang, 2005
. It is widely accepted that
the quantity, quality, and coverage of parallel corpora have
an important effect on the performance of statistical MT sys-
Despite the apparent success of data-driven multilin-
gual NLP techniques, the availability of large-scale, wide-
coverage, high-quality parallel corpora still remains a ma-
jor challenge. For most language pairs, parallel corpora
are nonexistent. Even for the top handful of resource-rich
languages, the available parallel corpora are usually unbal-
anced because the major sources are government documents
or news articles.
As a result, learning translation models from non-parallel
corpora has attracted intensive attention from the commu-
Koehn and Knight, 2002; Fung and Cheung, 2004;
Yang Liu is the corresponding author.
Munteanu and Marcu, 2006; Quirk et al., 2007; Ueffing et
al., 2007; Haghighi et al., 2008; Bertoldi and Federico, 2009;
Cettolo et al., 2010; Daum
e III and Jagarlamudi, 2011;
Ravi and Knight, 2011; Nuhn et al., 2012; Dou and Knight,
2012; Klementiev et al., 2012; Zhang and Zong, 2013;
Dou et al., 2014
. Most existing approaches focus on learning
word-based models: either bilingual lexicons or IBM model-
s. Based on canonical correlation analysis (CCA), Haghighi
et al.
leverage orthographic and context features to
induce word translation pairs. Ravi and Knight
training IBM models on monolingual data as a deciphermen-
t problem. However, word-based models are not expressive
enough to capture non-local dependencies and therefore are
insufficient to yield high quality translations.
Recently, several authors have moved a step further to learn
phrase-based models from non-parallel corpora
et al., 2012; Zhang and Zong, 2013
. Zhang and Zong
propose to use a parallel lexicon to retrieve parallel phrases
from non-parallel corpora. They show that their approach can
learn new translations and improve translation performance.
However, their approach is unidirectional: only using lexi-
cons to identify parallel phrases. In fact, it is possible to learn
better lexicons from extracted phrase pairs in a reverse direc-
tion, which potentially constitutes a “find-one-get-more” loop
Fung and Cheung, 2004
In this paper, we propose an iterative approach to learning
bilingual lexicons and phrases jointly from non-parallel cor-
pora. Given two sets of monolingual phrases that might con-
tain parallel phrases, we develop a generative model based on
IBM model 1
Brown et al., 1993
, which treats the map-
ping between phrase pairs as a latent variable. The mod-
el is trained using a Viterbi EM algorithm. Experiments on
Chinese-English datasets show that iterative training signifi-
cantly improves the quality of learned bilingual lexicons and
phrases and benefit end-to-end MT systems.
2 Preliminaries
We begin with a brief introduction to IBM model 1, which is
the core component of our generative model.
Let f = f
, . . . , f
be a foreign sentence with J words. We
use f
to denote the j-th word in the foreign sentence. Sim-
ilarly, e = e
, . . . , e
is an English sentence with I words
and e
is the i-th word. f and e denote single foreign and En-
glish words, respectively. A word alignment a = a
, . . . , a