利用少量平行资源的机器翻译新方法

需积分: 10 15 浏览量更新于2024-07-19 收藏 2.32MB PDF 举报

《利用最少平行资源的机器翻译》由乔治·坦鲍拉蒂斯博士、玛丽娜·瓦西里乌博士和索克拉蒂斯·索菲亚诺普洛斯博士合著，是一本专注于探讨在机器翻译（MT）领域的新方法论著作。该书的核心理念是提出了一种依赖广泛可用的单语语料库（如大规模的文本数据）而非高度依赖大量平行语料库的技术，这在统计机器翻译（SMT）中开辟了独特的起点。书中详尽阐述了这一方法论的基本原则和系统架构，强调了如何利用单语言数据集来提取信息，以弥补平行资源匮乏的情况。作者通过一系列实验展示了他们提出的系统与其他MT系统的比较，使用了诸如BLEU、NIST、Meteor和TER等标准评测指标，以量化评估翻译质量。这种新颖的方法旨在提高翻译效率并降低对平行语料的依赖。此外，本书还提供了一份免费的代码资源，读者可以借此创建自己的MT系统，这无疑对于语言专业人士和研究者具有很高的实用价值。对于读者来说，基本的机器翻译知识以及自然语言处理基础是必要的前提条件。《利用最少平行资源的机器翻译》不仅是一份学术贡献，也是一份实践指南，适合那些希望深入了解现代MT技术局限性及其解决方案的读者。整个系列《SpringerBriefs in Statistics》提供了更多的统计学研究简短介绍，而这本书作为其中一员，展现了作者们在机器翻译领域的深入探索和创新。版权方面，所有内容受到法律保护，未经许可不得复制或传播。通过阅读这本书，读者将能洞悉如何在资源有限的情况下，推动机器翻译技术的进步，这对于当前和未来的MT研究与实际应用具有重要意义。

Thus, to develop an SMT system from a source language to a target language,

SL-TL parallel corpora (where the same text is expressed in both the source and the

target language) of the order of millions of tokens are required to allow the

extraction of meaningful models for trans lation. Such corpora are hard to obtain,

particularly for less resourced languages and are frequently restricted to a speciﬁc

domain (or a narrow range of domains). In addition, it is accepted that existing

parallel corpora are not suitable for creating general-purpose MT systems that focus

on other domains. For this reason, in SMT, researchers are increasingly using

syntax-based models as well as investigating the extraction of information from

monolingual corpora, including lexical translation probabilities (Koehn and Knight

2002; Klementiev et al. 2012) and topic-speciﬁc information (Su et al. 2012).

The other key CBMT family is that of EBMT, where translations are generated

via a reasoning-by-analogy process, as introduced conceptually by Nagao (1984).

EBMT is based on having a large set of known pairs of sentences, each pair

including one input sentence (in SL) and its corresponding translation (in TL).

In EBMT, translations are generated by analogy, by appropriately processing the

large library of examples during the actual translation phase rather than during a

learning/training phase (Wu 2005), selecting the best matching one to the input

sentence, and then replacing on the translation side the appropriate elements (for

instance, tokens or phrases). A comprehensive review of the ﬁeld of EBMT is

provided by Hutchins (2005). The key difﬁculty in EBMT concerns searching

through the entire set of sentence pairs to determine the one whose SL side best

matches the input sentence, and then making the appropriate replacements.

In a bid to achieve higher translation quality, researchers have studied approa-

ches that combine principles from more than one MT paradigm, in order to combine

their advantages. This effort has led to a family of methods collectively known as

Hybrid MT. Examples of HMT include the systems by Eisele et al. (2008) and

Quirk and Menezes (2006), while the latest HMT activity is reported by Costa-jussà

et al. (2013). The general convergence of more recent MT methodologies towards

the combination of the most promising characteristics of each paradigm has been

documented by Wu (2005), having started from pure MT systems belong ing to one

of the main paradigms (for instance, RBMT, SMT or EBMT) and increasingly

progressing towards syst ems that combine characteristics from multiple paradigms.

Thus, the effort to continuously improve translation quality leads to less clearly

separable MT systems.

Concluding this brief survey, it is widely accept ed that the most popular CBMT

method is SMT, which needs parallel corpora aligned at a sentence basis. The

popularity of SMT is closely related to the fact that open code has been released for

use by researchers and MT practitioners, via the Moses software package (Koehn

et al. 2007), and further developments in terms of open-source software are sup-

ported. This fact renders the creation of an MT system straightforward, provi ded

that suitable resources are available. However, the compilation of parallel corpora is

both expensive and time-consuming. Thus, alternative techniques have been studied

for creating MT systems requiring resources which may be less informative but are

also less expensive to collect or to create from scratch. These aim to eliminate the

6 1 Preliminaries

parallel corpus needed in SMT (or at least drastically reduce its size), employing

instead either comparable corpora or monolingual corpora. Monolingual resources

can be easily assembled for any language, for instance by harvesting the Web with

relatively low effort. Methods following this approach had been proposed by

Dologlou et al. (2003); Carbonell et al. (2006); Markantonatou et al. (2009).

Though these methods do not provide a translation quality as high as SMT, their

ability to develop MT systems with a very limited amount of specialised resources

represents a valid starting point which is interesting in terms of research. This

enables the development of a working MT system even for language pairs with very

few language resources.

It is on the basis of the aforementioned works that the PRESEMT methodology

has been developed. The design brief for PRESEMT has been to create a

language-independent methodology that—with limited resources—can translate

unconstrained texts giving a quality suitable for gisting purposes. In this method-

ology, the key design decision is to use a large monolingual corpus to extract most

of the information required for the MT system. Thus, the dependence on a large

parallel corpus is alleviated, requiring only a small parallel corpus (whose size is

only a few hundred sentences) to provide information on the mapping of sentence

structures from SL to TL. According to the preceding review o f MT systems,

PRESEMT can be classiﬁed as a Hybrid MT system, based on the argumentation of

Quirk and Menezes (2006) and Wu (2005), combining certain elem ents of EBMT

and SMT. The question, thus, becomes to what extent this methodol ogy is more

readily portable to new language pairs, while at the same time providing a trans-

lation accuracy which is comparable to that of state-of-the-art MT systems, and

whether the translation quality is not too heavily compromised by the design

decisions.

1.4 The PRESEMT Methodology in a Nutshell

To introduce the d esign principles of PRESEMT, the most intuitive way is to

consider a potential application. An assumption is made that a person wi shes to

locate some piece of information over the Web. Most search engines retrieve Web

pages which can cover the entire globe based on speciﬁc keywords, and naturally

the corresponding text may be written in any language. As the use of the Internet

expands, it is highly likely that the users will retrieve documents from several

languages. Therefore, to obtain information, especially in the cases where only a

few documents answering a query may be available, it is more than helpful to the

users to provide an automatic translation of sufﬁcient quality just for gisting. In that

respect, the requirement is to transform the information, which is expressed in a

given language, in the individual’s native language.

A number of automatic translation systems are available over the Web, such as

Google Translate, Bing Translator, SYSTRAN, where the user is prompted to either

enter a text or alternatively deﬁne a Web page to receive their translat ion. As a rule,

1.3 Advantages and Disadvantages of Main MT Paradigms 7

such MT systems currently give rather poor translations, apart from speciﬁc lan-

guage pairs, which have been extensively developed and ﬁne-tuned. In addition, it

can be seen that for language pairs involving less widely-used languages subse-

quent versions of these MT systems sometimes even display a deterioration rather

than improvement of the translation quality (examples of such non-monotonic

behaviour will be discussed in subsequent chapters).

What is required from the user’s point of view is a higher level of quality that is

draft, but comprehensible as far as the average user is concerned. However, due to

the natural language complexity it would probably be too complex a task to design

a system that can produce translations of an appropriate quality for each and every

domain. What is probably of more interest is to address speciﬁc user needs.

This necessitates developing an MT system that can be rapidly adapted to cover

a new language pair, as well as to allow the user to effectively modify an existing

MT system for a given language pair so that it better matches his/her requirements.

The related set of requirements has formed the starting step for the PRESEMT

project. This system needs to be characterised by the need for limited reliance on

specialised resources, inherent language independence and the ability to modify the

language resources used (e.g. the corpora used).

The main requirements for the PRESEMT system are to generate translations

fast (a real-time or near real-time response is of prime importance) and to be able to

develop new language pairs in a simple manner, without requiring specialised

linguistic tools. In the modern multilingual environment of the European Union as

well as beyond the boundaries of the Union itself, there exists an increased

requirement for creating translation systems even for language pairs with limited

availability of linguistic tools.

To cover these main requirements, the decision was taken to adopt

cross-disciplinary ideas, mainly borrowed from the machine learning and compu-

tational intelligence domains. Thus, the core METIS idea (Dologlou et al. 2003)of

combining machine-learning approaches with large monolingual corpora is retained

in PRESEMT.

This idea is enhanced by a repertoire of pattern recognition and artiﬁcial intel-

ligence techniq ues for linguistic applications ranging from the alignment of sen-

tences and the creation of compatible phrases in different languages to the

optimisation of system parameters. In this way, it is expected that substantial

progress will be achieved in terms of (i) translation quality versus speed and

(ii) language portability and ease of develo pment of new language pairs. The two

key objectives of the PRESEMT methodology are listed below:

Objective 1: Flexibility and adaptability—The MT system must be adaptable to

user requirements and preferences, thus making it possible to address the issue of

online translation for the masses. To that end, the PRESEMT MT system has a

modular architecture, in order to maintain the integrity of the individual modules

and facilitate local (i.e. module-speciﬁc) modiﬁcations, without affecting the system

as a whole. For instance, the user may choose to retrain the MT system by

retrieving corpora from the Web, having thus the freedom to specify resources,

thematic domains and languages, to which the system will be called to adapt.

8 1 Preliminaries

Objective 2: Language independence—The MT must be customisable to new

languages. It is based on a language -independent method ensuring easy and

cost-effecti ve portability to new language pairs, without signiﬁcant restrictions in

the choice of either source or target languages. The support of the

language-independent aspect of the prototype reduces signi ﬁcantly the human effort

involved in the collection and processing (annotation, validation, etc.) of textual

resources, as Web-sourced content serves as the major source of linguistic

knowledge. At the same time, the provi sion of a list of key required resources

(many of which can be shared between languages or easily modiﬁ ed when changing

the language pair handled) supports the user in creating a new language pair within

a limited number of days, by making use of existing modules and adapting

resources accordingly.

1.5 Closing Note on Implementation

The issue of implementation and utilisation is of prime importance. Thus, the

implemented MT system discussed in this volume is avail able for download and

experimentation. The reader may visit the project’s website to gain more infor-

mation on the MT system and download the PRESEMT package coupled with

some initial resources for the German-English and Greek-English language pairs

that allow the running of an initial system. As an alternative, the fully functional

online MT syste m can be accessed over the link www.presemt.eu/live. The website

also provides detailed technical documentation and links to the standalone versions

of the major PRESEMT modules hosted at Google Code, to encourage reuse.

References

Arnold D (1986) Eurotra: a European Perspective on MT. Proc IEEE 74(7):979–992

Arnold D, Sadler L (1992) EUROTRA: an assessment of the current state of the EC’sMT

programme. In: Translation and the computer, 13: the theory and practice of machine

translation—a marriage of convenience? ASLIB, London

Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical

machine translation: parameter estimation. Comput Linguist 19(2):263–311

Carbonell J, Klein S, Miller D, Steinbaum M, Grassiany T, Frey J (2006) Context-based machine

translation. In: Proceedings of the 7th AMTA Conference, Cambridge, MA, USA, pp 19– 28

Costa-jussà MR, Banchs R, Rapp R, Lambert P, Eberle K, Babych B (2013) Workshop on hybrid

approaches to translation: overview and developments. In: Proceedings of the 2nd HYTRA

workshop, held within ACL-2013, Soﬁa, Bulgaria, pp 1–6

Dologlou Y, Markantonatou S, Tambouratzis G, Yannoutsou O, Fourla S, Ioannou N (2003)

Using monolingual corpora for statistical machine translation: the METIS system. In:

Proceedings of the EAMT-CLAW’03 Workshop, Dublin, Ireland, 15–17 May, pp 61–68

Eisele A, Federmann C, Uszkoreit H, Saint-Amand H, Kay M, Jellinghaus M, Hunsicker S,

Herrmann T, Chen Y (2008) Hybrid machine translation architectures within and beyond the

1.4 The PRESEMT Methodology in a Nutshell 9

EuroMatrix project. In: Hutchins J, v.Hahn W (eds) Proceedings of EAMT 2008 Conference,

22–23 September 2008, Hamburg, Germany, pp 27–34

Forcada ML, Ginestí-Rosell M, Nordfalk J, O’Regan J, Ortiz-Rojas S, Pérez-Ortiz JA,

Sánchez-Martínez F, Ramírez-Sánchez G, Tyers FM (2011) Apertium: a free/open-source

platform for rule-based machine translation. Mach Transl 25:127–144

Hutchins J (1996) ALPAC: the (in)famous report. MT News Int 14:9–12

Hutchins J (2005) Example-based machine translation: a review and commentary. Mach Transl

19:197–211

Jurafsky D, Martin JH (2009) Speech and language processing: an introduction to natural language

processing. In: Computational linguistics and speech recognition, 2nd edn. Pearson

Educational, Upper Saddle River, pp 895–944. ISBN 978-0-13-504196-1

Klementiev A, Irvine A, Callison-Burch C, Yarowsky D (2012) Toward statistical machine

translation without parallel corpora. In: Proceedings of EACL2012, Avignon, France, 23–25

April, pp. 130–140

Koehn P (2010) Statistical machine translation. Cambridge University Press. xii, 433 pp. ISBN

978-0-521-87415-1

Koehn P, Knight K (2002) Learning a translation lexicon from monolingual corpora. In:

Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, Philadelphia,

Pennsylvania, U.S.A., 12 July 2002, pp. 9–16

Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W,

Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit

for statistical machine translation. In: ACL 2007: proceedings of demo and poster sessions,

Prague, Czech Republic, June 2007, pp 177–180

Markantonatou S, Soﬁanopoulos S, Yannoutsou O, Vassiliou M (2009) Hybrid machine

translation for low- and middle-density languages. In: Nirenburg S (ed) Language engineering

for lesser-studied languages. IOS Press, pp 243–274. ISBN 978-1-58603-954-7

Nagao M (1984) A framework of a mechanical translation between Japanese and English by

analogy principle. In: Elithorn A, Banerji R (eds) Artiﬁcial and human intelligence: edited

review papers presented at the international NATO Symposium, October 1981, Lyons, France.

Amsterdam: North Holland, pp 173–180

Quirk C, Menezes A (2006) Dependency treelet translation: the convergence of statistical and

example-based machine translation? Mach Transl 20:45–66

Sánchez-Cartagena VM, Pérez-Ortiz JA, Sánchez-Martínez F (2015) A generalised alignment

template formalism and its application to the inference of shallow-transfer machine translation

rules from scarce bilingual corpora. Comput Speech Lang (Special Issue on Hybrid Machine

Translation) 32(1):49–90

Senellart J, Dienes P, Varadi T (2001) New generation SYSTRAN system. In: Proceedings of the

8th MT Summit, 18–22 September, Santiago de Compostella, Spain, pp 311–316

Surcin S, Lange E, Senellart J (2007) Rapid development of new language pairs at SYSTRAN. In:

Proceedings of MT Summit XI, 10–14 September, Copenhagen, Denmark, pp 443-449

Su J, Wu H, Wang H, Chen Y, Shi X, Dong H, Liu Q (2012) Translation Model Adaptation for

Statistical Machine Translation with Monolingual Topic Information. In: Proceedings of

ACL2012 Conference, Jeju, Republic of Korea, July 2012, pp 459–

468

Wu D (2005) MT model space: statistical versus compositional versus example-based machine

translation. Mach Transl 19:213 – 227

Yang J, Enoue S, Senellart J, Croiset T (2009) SYSTRAN Chinese-English and English-Chinese

hybrid machine translation systems. In: Proceedings of CWMT-2009: the 5th China Workshop

on Machine Translation, Nanjing, China, 16–17 October, p 8

10 1 Preliminaries

剩余91页未读，继续阅读

DWcsdnNET

粉丝: 414
资源: 651

利用少量平行资源的机器翻译新方法

Hands-On Machine Learning with Scikit-Learn and TensorFlow [EPUB]

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and

RTP Prole for Audio and Video Conferences with Minimal Control

Hands-On Machine Learning with Scikit-Learn and TensorFlow (epub)

Hands-On Machine Learning with Scikit-Learn and TensorFlow [Kindle Edition]

Timely - A Minimal Clock on New Tabs-crx插件

基于最小领域知识的主题建模 ：Topic Modeling with Minimal Domain Knowledge

Track Join - Distributed Joins with Minimal Network Traffic (sigmod14II)-计算机科学

Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow 2ed 2019.epub

Practical Machine Learning with H2O

最新资源

基于最小领域知识的主题建模：Topic Modeling with Minimal Domain Knowledge