没有合适的资源?快使用搜索试试~ 我知道了~
首页利用少量平行资源的机器翻译新方法
利用少量平行资源的机器翻译新方法
需积分: 10 7 下载量 184 浏览量
更新于2024-07-19
收藏 2.32MB PDF 举报
《利用最少平行资源的机器翻译》由乔治·坦鲍拉蒂斯博士、玛丽娜·瓦西里乌博士和索克拉蒂斯·索菲亚诺普洛斯博士合著,是一本专注于探讨在机器翻译(MT)领域的新方法论著作。该书的核心理念是提出了一种依赖广泛可用的单语语料库(如大规模的文本数据)而非高度依赖大量平行语料库的技术,这在统计机器翻译(SMT)中开辟了独特的起点。 书中详尽阐述了这一方法论的基本原则和系统架构,强调了如何利用单语言数据集来提取信息,以弥补平行资源匮乏的情况。作者通过一系列实验展示了他们提出的系统与其他MT系统的比较,使用了诸如BLEU、NIST、Meteor和TER等标准评测指标,以量化评估翻译质量。这种新颖的方法旨在提高翻译效率并降低对平行语料的依赖。 此外,本书还提供了一份免费的代码资源,读者可以借此创建自己的MT系统,这无疑对于语言专业人士和研究者具有很高的实用价值。对于读者来说,基本的机器翻译知识以及自然语言处理基础是必要的前提条件。《利用最少平行资源的机器翻译》不仅是一份学术贡献,也是一份实践指南,适合那些希望深入了解现代MT技术局限性及其解决方案的读者。 整个系列《SpringerBriefs in Statistics》提供了更多的统计学研究简短介绍,而这本书作为其中一员,展现了作者们在机器翻译领域的深入探索和创新。版权方面,所有内容受到法律保护,未经许可不得复制或传播。通过阅读这本书,读者将能洞悉如何在资源有限的情况下,推动机器翻译技术的进步,这对于当前和未来的MT研究与实际应用具有重要意义。
资源详情
资源推荐
Thus, to develop an SMT system from a source language to a target language,
SL-TL parallel corpora (where the same text is expressed in both the source and the
target language) of the order of millions of tokens are required to allow the
extraction of meaningful models for trans lation. Such corpora are hard to obtain,
particularly for less resourced languages and are frequently restricted to a specific
domain (or a narrow range of domains). In addition, it is accepted that existing
parallel corpora are not suitable for creating general-purpose MT systems that focus
on other domains. For this reason, in SMT, researchers are increasingly using
syntax-based models as well as investigating the extraction of information from
monolingual corpora, including lexical translation probabilities (Koehn and Knight
2002; Klementiev et al. 2012) and topic-specific information (Su et al. 2012).
The other key CBMT family is that of EBMT, where translations are generated
via a reasoning-by-analogy process, as introduced conceptually by Nagao (1984).
EBMT is based on having a large set of known pairs of sentences, each pair
including one input sentence (in SL) and its corresponding translation (in TL).
In EBMT, translations are generated by analogy, by appropriately processing the
large library of examples during the actual translation phase rather than during a
learning/training phase (Wu 2005), selecting the best matching one to the input
sentence, and then replacing on the translation side the appropriate elements (for
instance, tokens or phrases). A comprehensive review of the field of EBMT is
provided by Hutchins (2005). The key difficulty in EBMT concerns searching
through the entire set of sentence pairs to determine the one whose SL side best
matches the input sentence, and then making the appropriate replacements.
In a bid to achieve higher translation quality, researchers have studied approa-
ches that combine principles from more than one MT paradigm, in order to combine
their advantages. This effort has led to a family of methods collectively known as
Hybrid MT. Examples of HMT include the systems by Eisele et al. (2008) and
Quirk and Menezes (2006), while the latest HMT activity is reported by Costa-jussà
et al. (2013). The general convergence of more recent MT methodologies towards
the combination of the most promising characteristics of each paradigm has been
documented by Wu (2005), having started from pure MT systems belong ing to one
of the main paradigms (for instance, RBMT, SMT or EBMT) and increasingly
progressing towards syst ems that combine characteristics from multiple paradigms.
Thus, the effort to continuously improve translation quality leads to less clearly
separable MT systems.
Concluding this brief survey, it is widely accept ed that the most popular CBMT
method is SMT, which needs parallel corpora aligned at a sentence basis. The
popularity of SMT is closely related to the fact that open code has been released for
use by researchers and MT practitioners, via the Moses software package (Koehn
et al. 2007), and further developments in terms of open-source software are sup-
ported. This fact renders the creation of an MT system straightforward, provi ded
that suitable resources are available. However, the compilation of parallel corpora is
both expensive and time-consuming. Thus, alternative techniques have been studied
for creating MT systems requiring resources which may be less informative but are
also less expensive to collect or to create from scratch. These aim to eliminate the
6 1 Preliminaries
parallel corpus needed in SMT (or at least drastically reduce its size), employing
instead either comparable corpora or monolingual corpora. Monolingual resources
can be easily assembled for any language, for instance by harvesting the Web with
relatively low effort. Methods following this approach had been proposed by
Dologlou et al. (2003); Carbonell et al. (2006); Markantonatou et al. (2009).
Though these methods do not provide a translation quality as high as SMT, their
ability to develop MT systems with a very limited amount of specialised resources
represents a valid starting point which is interesting in terms of research. This
enables the development of a working MT system even for language pairs with very
few language resources.
It is on the basis of the aforementioned works that the PRESEMT methodology
has been developed. The design brief for PRESEMT has been to create a
language-independent methodology that—with limited resources—can translate
unconstrained texts giving a quality suitable for gisting purposes. In this method-
ology, the key design decision is to use a large monolingual corpus to extract most
of the information required for the MT system. Thus, the dependence on a large
parallel corpus is alleviated, requiring only a small parallel corpus (whose size is
only a few hundred sentences) to provide information on the mapping of sentence
structures from SL to TL. According to the preceding review o f MT systems,
PRESEMT can be classified as a Hybrid MT system, based on the argumentation of
Quirk and Menezes (2006) and Wu (2005), combining certain elem ents of EBMT
and SMT. The question, thus, becomes to what extent this methodol ogy is more
readily portable to new language pairs, while at the same time providing a trans-
lation accuracy which is comparable to that of state-of-the-art MT systems, and
whether the translation quality is not too heavily compromised by the design
decisions.
1.4 The PRESEMT Methodology in a Nutshell
To introduce the d esign principles of PRESEMT, the most intuitive way is to
consider a potential application. An assumption is made that a person wi shes to
locate some piece of information over the Web. Most search engines retrieve Web
pages which can cover the entire globe based on specific keywords, and naturally
the corresponding text may be written in any language. As the use of the Internet
expands, it is highly likely that the users will retrieve documents from several
languages. Therefore, to obtain information, especially in the cases where only a
few documents answering a query may be available, it is more than helpful to the
users to provide an automatic translation of sufficient quality just for gisting. In that
respect, the requirement is to transform the information, which is expressed in a
given language, in the individual’s native language.
A number of automatic translation systems are available over the Web, such as
Google Translate, Bing Translator, SYSTRAN, where the user is prompted to either
enter a text or alternatively define a Web page to receive their translat ion. As a rule,
1.3 Advantages and Disadvantages of Main MT Paradigms 7
such MT systems currently give rather poor translations, apart from specific lan-
guage pairs, which have been extensively developed and fine-tuned. In addition, it
can be seen that for language pairs involving less widely-used languages subse-
quent versions of these MT systems sometimes even display a deterioration rather
than improvement of the translation quality (examples of such non-monotonic
behaviour will be discussed in subsequent chapters).
What is required from the user’s point of view is a higher level of quality that is
draft, but comprehensible as far as the average user is concerned. However, due to
the natural language complexity it would probably be too complex a task to design
a system that can produce translations of an appropriate quality for each and every
domain. What is probably of more interest is to address specific user needs.
This necessitates developing an MT system that can be rapidly adapted to cover
a new language pair, as well as to allow the user to effectively modify an existing
MT system for a given language pair so that it better matches his/her requirements.
The related set of requirements has formed the starting step for the PRESEMT
project. This system needs to be characterised by the need for limited reliance on
specialised resources, inherent language independence and the ability to modify the
language resources used (e.g. the corpora used).
The main requirements for the PRESEMT system are to generate translations
fast (a real-time or near real-time response is of prime importance) and to be able to
develop new language pairs in a simple manner, without requiring specialised
linguistic tools. In the modern multilingual environment of the European Union as
well as beyond the boundaries of the Union itself, there exists an increased
requirement for creating translation systems even for language pairs with limited
availability of linguistic tools.
To cover these main requirements, the decision was taken to adopt
cross-disciplinary ideas, mainly borrowed from the machine learning and compu-
tational intelligence domains. Thus, the core METIS idea (Dologlou et al. 2003)of
combining machine-learning approaches with large monolingual corpora is retained
in PRESEMT.
This idea is enhanced by a repertoire of pattern recognition and artificial intel-
ligence techniq ues for linguistic applications ranging from the alignment of sen-
tences and the creation of compatible phrases in different languages to the
optimisation of system parameters. In this way, it is expected that substantial
progress will be achieved in terms of (i) translation quality versus speed and
(ii) language portability and ease of develo pment of new language pairs. The two
key objectives of the PRESEMT methodology are listed below:
Objective 1: Flexibility and adaptability—The MT system must be adaptable to
user requirements and preferences, thus making it possible to address the issue of
online translation for the masses. To that end, the PRESEMT MT system has a
modular architecture, in order to maintain the integrity of the individual modules
and facilitate local (i.e. module-specific) modifications, without affecting the system
as a whole. For instance, the user may choose to retrain the MT system by
retrieving corpora from the Web, having thus the freedom to specify resources,
thematic domains and languages, to which the system will be called to adapt.
8 1 Preliminaries
Objective 2: Language independence—The MT must be customisable to new
languages. It is based on a language -independent method ensuring easy and
cost-effecti ve portability to new language pairs, without significant restrictions in
the choice of either source or target languages. The support of the
language-independent aspect of the prototype reduces signi ficantly the human effort
involved in the collection and processing (annotation, validation, etc.) of textual
resources, as Web-sourced content serves as the major source of linguistic
knowledge. At the same time, the provi sion of a list of key required resources
(many of which can be shared between languages or easily modifi ed when changing
the language pair handled) supports the user in creating a new language pair within
a limited number of days, by making use of existing modules and adapting
resources accordingly.
1.5 Closing Note on Implementation
The issue of implementation and utilisation is of prime importance. Thus, the
implemented MT system discussed in this volume is avail able for download and
experimentation. The reader may visit the project’s website to gain more infor-
mation on the MT system and download the PRESEMT package coupled with
some initial resources for the German-English and Greek-English language pairs
that allow the running of an initial system. As an alternative, the fully functional
online MT syste m can be accessed over the link www.presemt.eu/live. The website
also provides detailed technical documentation and links to the standalone versions
of the major PRESEMT modules hosted at Google Code, to encourage reuse.
References
Arnold D (1986) Eurotra: a European Perspective on MT. Proc IEEE 74(7):979–992
Arnold D, Sadler L (1992) EUROTRA: an assessment of the current state of the EC’sMT
programme. In: Translation and the computer, 13: the theory and practice of machine
translation—a marriage of convenience? ASLIB, London
Brown PF, Della Pietra SA, Della Pietra VJ, Mercer RL (1993) The mathematics of statistical
machine translation: parameter estimation. Comput Linguist 19(2):263–311
Carbonell J, Klein S, Miller D, Steinbaum M, Grassiany T, Frey J (2006) Context-based machine
translation. In: Proceedings of the 7th AMTA Conference, Cambridge, MA, USA, pp 19– 28
Costa-jussà MR, Banchs R, Rapp R, Lambert P, Eberle K, Babych B (2013) Workshop on hybrid
approaches to translation: overview and developments. In: Proceedings of the 2nd HYTRA
workshop, held within ACL-2013, Sofia, Bulgaria, pp 1–6
Dologlou Y, Markantonatou S, Tambouratzis G, Yannoutsou O, Fourla S, Ioannou N (2003)
Using monolingual corpora for statistical machine translation: the METIS system. In:
Proceedings of the EAMT-CLAW’03 Workshop, Dublin, Ireland, 15–17 May, pp 61–68
Eisele A, Federmann C, Uszkoreit H, Saint-Amand H, Kay M, Jellinghaus M, Hunsicker S,
Herrmann T, Chen Y (2008) Hybrid machine translation architectures within and beyond the
1.4 The PRESEMT Methodology in a Nutshell 9
EuroMatrix project. In: Hutchins J, v.Hahn W (eds) Proceedings of EAMT 2008 Conference,
22–23 September 2008, Hamburg, Germany, pp 27–34
Forcada ML, Ginestí-Rosell M, Nordfalk J, O’Regan J, Ortiz-Rojas S, Pérez-Ortiz JA,
Sánchez-Martínez F, Ramírez-Sánchez G, Tyers FM (2011) Apertium: a free/open-source
platform for rule-based machine translation. Mach Transl 25:127–144
Hutchins J (1996) ALPAC: the (in)famous report. MT News Int 14:9–12
Hutchins J (2005) Example-based machine translation: a review and commentary. Mach Transl
19:197–211
Jurafsky D, Martin JH (2009) Speech and language processing: an introduction to natural language
processing. In: Computational linguistics and speech recognition, 2nd edn. Pearson
Educational, Upper Saddle River, pp 895–944. ISBN 978-0-13-504196-1
Klementiev A, Irvine A, Callison-Burch C, Yarowsky D (2012) Toward statistical machine
translation without parallel corpora. In: Proceedings of EACL2012, Avignon, France, 23–25
April, pp. 130–140
Koehn P (2010) Statistical machine translation. Cambridge University Press. xii, 433 pp. ISBN
978-0-521-87415-1
Koehn P, Knight K (2002) Learning a translation lexicon from monolingual corpora. In:
Proceedings of the ACL-02 workshop on Unsupervised lexical acquisition, Philadelphia,
Pennsylvania, U.S.A., 12 July 2002, pp. 9–16
Koehn P, Hoang H, Birch A, Callison-Burch C, Federico M, Bertoldi N, Cowan B, Shen W,
Moran C, Zens R, Dyer C, Bojar O, Constantin A, Herbst E (2007) Moses: open source toolkit
for statistical machine translation. In: ACL 2007: proceedings of demo and poster sessions,
Prague, Czech Republic, June 2007, pp 177–180
Markantonatou S, Sofianopoulos S, Yannoutsou O, Vassiliou M (2009) Hybrid machine
translation for low- and middle-density languages. In: Nirenburg S (ed) Language engineering
for lesser-studied languages. IOS Press, pp 243–274. ISBN 978-1-58603-954-7
Nagao M (1984) A framework of a mechanical translation between Japanese and English by
analogy principle. In: Elithorn A, Banerji R (eds) Artificial and human intelligence: edited
review papers presented at the international NATO Symposium, October 1981, Lyons, France.
Amsterdam: North Holland, pp 173–180
Quirk C, Menezes A (2006) Dependency treelet translation: the convergence of statistical and
example-based machine translation? Mach Transl 20:45–66
Sánchez-Cartagena VM, Pérez-Ortiz JA, Sánchez-Martínez F (2015) A generalised alignment
template formalism and its application to the inference of shallow-transfer machine translation
rules from scarce bilingual corpora. Comput Speech Lang (Special Issue on Hybrid Machine
Translation) 32(1):49–90
Senellart J, Dienes P, Varadi T (2001) New generation SYSTRAN system. In: Proceedings of the
8th MT Summit, 18–22 September, Santiago de Compostella, Spain, pp 311–316
Surcin S, Lange E, Senellart J (2007) Rapid development of new language pairs at SYSTRAN. In:
Proceedings of MT Summit XI, 10–14 September, Copenhagen, Denmark, pp 443-449
Su J, Wu H, Wang H, Chen Y, Shi X, Dong H, Liu Q (2012) Translation Model Adaptation for
Statistical Machine Translation with Monolingual Topic Information. In: Proceedings of
ACL2012 Conference, Jeju, Republic of Korea, July 2012, pp 459–
468
Wu D (2005) MT model space: statistical versus compositional versus example-based machine
translation. Mach Transl 19:213 – 227
Yang J, Enoue S, Senellart J, Croiset T (2009) SYSTRAN Chinese-English and English-Chinese
hybrid machine translation systems. In: Proceedings of CWMT-2009: the 5th China Workshop
on Machine Translation, Nanjing, China, 16–17 October, p 8
10 1 Preliminaries
剩余91页未读,继续阅读
DWcsdnNET
- 粉丝: 415
- 资源: 651
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- C语言快速排序算法的实现与应用
- KityFormula 编辑器压缩包功能解析
- 离线搭建Kubernetes 1.17.0集群教程与资源包分享
- Java毕业设计教学平台完整教程与源码
- 综合数据集汇总:浏览记录与市场研究分析
- STM32智能家居控制系统:创新设计与无线通讯
- 深入浅出C++20标准:四大新特性解析
- Real-ESRGAN: 开源项目提升图像超分辨率技术
- 植物大战僵尸杂交版v2.0.88:新元素新挑战
- 掌握数据分析核心模型,预测未来不是梦
- Android平台蓝牙HC-06/08模块数据交互技巧
- Python源码分享:计算100至200之间的所有素数
- 免费视频修复利器:Digital Video Repair
- Chrome浏览器新版本Adblock Plus插件发布
- GifSplitter:Linux下GIF转BMP的核心工具
- Vue.js开发教程:全面学习资源指南
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功