短语句法类别序列模型提升机器翻译质量

14 浏览量更新于2024-08-26 收藏 174KB PDF 举报

"这篇论文提出了一种名为短语句法类别序列模型（PSCS）的新方法，旨在改进基于短语的机器翻译（PBMT）系统，使其能生成语法更规范的翻译。通过解析双语训练语料库的目标语言句子，并为每个短语对分配语法类别，作者构建了一个PSCS模型。此模型随后被整合到标准PBMT系统中，提高了翻译的质量，与基线系统相比，BLEU分数提升了0.7点。" 在自然语言处理领域，机器翻译是一项重要的任务，其目标是自动将一种语言的文本转换为另一种语言。传统的统计机器翻译方法如基于短语的机器翻译（PBMT）依赖于短语对的提取和翻译概率的计算。然而，这些系统往往在生成语法结构良好的翻译时面临挑战，可能导致产出的翻译句子在语法上不够准确。本文提出的短语句法类别序列模型（PSCS）为解决这一问题提供了一个创新的解决方案。首先，它对双语训练语料库的目标语言部分进行全面的句法分析，以获取每个短语的句法类别信息。句法类别是语言学中的一个重要概念，它描述了词语在句子中的功能和结构位置，如名词、动词、形容词等。通过这种句法分析，模型能够捕捉到语言间的句法差异，从而更好地理解源语言和目标语言的结构。接着，作者在标准的短语对提取过程中引入句法类别，为每一对短语赋予相应的句法标签。这些标签随后被用来构建PSCS模型，这个模型存储了短语对及其对应的句法信息。PSCS模型的构建基于平行训练数据，这使得模型能够学习到语言间的句法对应关系。然后，将线性化的PSCS模型集成到标准PBMT系统中。在解码阶段，PBMT系统会利用这个模型来优先选择语法上更合理的翻译选项。这种方法的一个显著优点是它的简单性，不需要复杂的架构修改，就能在保持系统效率的同时提升翻译质量。实验结果表明，采用PSCS模型的PBMT系统在BLEU评分上相对于基线系统有了0.7点的提升。BLEU分数是评估机器翻译质量的常用指标，数值越高表示翻译结果与人工参考翻译的相似度越高。因此，这一提升意味着PSCS模型能够显著改善机器翻译的语法正确性和自然度。关键词：机器翻译，自然语言处理，基于短语的机器翻译，句法类别，序列模型。这项工作为机器翻译领域的研究提供了新的视角，即通过整合句法信息来优化翻译结果，对于提升机器翻译系统的性能具有重要意义。

A. Gelbukh (Ed.): CICLing 2012, Part II, LNCS 7182, pp. 52–59, 2012.

Phrasal Syntactic Category Sequence Model

for Phrase-Based MT

Hailong Cao

, Eiichiro Sumita

, Tiejun Zhao

, and Sheng Li

Harbin Institute of Technology, China

National Institute of Information and Communications Technology, Japan

{hailong,tjzhao,shengli}@mtlab.hit.edu.cn,

eiichiro.sumita@nict.go.jp

Abstract. Incorporating target syntax into phrase-based machine translation

(PBMT) can generate syntactically well-formed translations. We propose a

novel phrasal syntactic category sequence (PSCS) model which allows a PBMT

decoder to prefer more grammatical translations. We parse all the sentences on

the target side of the bilingual training corpus. In the standard phrase pair

extraction procedure, we assign a syntactic category to each phrase pair and

build a PSCS model from the parallel training data. Then, we log linearly

incorporate the PSCS model into a standard PBMT system. Our method is very

simple and yields a 0.7 BLEU point improvement when compared to the

baseline PBMT system.

Keywords: machine translation, natural language processing, phrase-based

machine translation.

1 Introduction

Both PBMT models (Koehn et al., 2003; Chiang, 2005) and syntax-based machine

translation models (Yamada et al., 2000; Quirk et al., 2005; Galley et al., 2006; Liu

et al., 2006; Marcu et al., 2006; and numerous others) are state-of-the-art statistical

machine translation (SMT) methods. Over the last several years, an increasing amount

of work has been done to combine the advantages of the two approaches. DeNeefe et al.

(2007) made a quantitative comparison of the phrase pairs that each model has to work

with and found it is useful to improve the phrasal coverage of their string-to-tree model.

Liu et al. (2007) proposed forest-to-string rules to capture the non-syntactic phrases

in their tree-to-string model. Zhang et al. (2008) proposed a tree sequence based

tree-to-tree model which can describe non-syntactic phrases with syntactic structure

information.

The converse of the above methods is to incorporate syntactic information into the

PBMT model. Zollmann and Venugopal (2006) started with a complete set of phrases

as extracted by traditional PBMT heuristics, and then annotated the target side of each

phrasal entry with the label of the constituent node in the target-side parse tree that

下载后可阅读完整内容，剩余7页未读，立即下载

weixin_38543120

粉丝: 6

短语句法类别序列模型提升机器翻译质量

基于短语的神经机器翻译NPMT

c--的句法生成器和代码生成器

基于句法的统计机器翻译方法详解

【实战演练】机器翻译模型实现：基于统计机器翻译与神经机器翻译方法

基于Apache OpenNLP框架构建的语言模型，用于识别文本中的词汇、短语和实体，以及进行句法分析和生成文本的联想

基于句法决策树和SVM的短文本语境识别模型

基于神经网络的统计机器翻译的预调序模型.pdf

智能问答系统\基于HMM的汉语介词短语自动识别研究

基于协同训练的电商领域短语挖掘.pdf

基于动态图卷积网络的关键短语抽取研究与应用

最新资源