系统比较7种改进的BLEU sentence级平滑技术

需积分: 10 127 浏览量更新于2024-09-09 收藏 155KB PDF 举报

BLEU（Bilingual Evaluation Understudy）是机器翻译（Machine Translation, MT）领域的一种标准评估指标，由Papineni等人在2002年提出。然而，由于BLEU是通过计算n-gram精度的几何平均值得出的，它在句子级别的评价与人类判断往往不匹配，因为n-gram方法可能无法全面反映翻译质量。为了改善这一问题，研究人员提出了多种BLEU平滑技术，旨在提高其与人类评价的关联性。本研究论文，由Boxing Chen 和 Colin Cherry 联合撰写，发表于2014年第九届统计机器翻译工作研讨会，地点在美国巴尔的摩。论文对7种不同的句子级别BLEU平滑技术进行了系统性的比较。这其中包括三位作者新提出的三种技术，它们显示出比其他平滑技术更好的与人类主观评价的关联性。此外，研究还探讨了在统计机器翻译调优过程中，采用这7种平滑技术的效果对比。传统的BLEU评估可能过于注重精确匹配，忽略了翻译的流畅性和语义一致性。平滑技术的作用在于引入一定的灵活性，使得模型能够在处理罕见或未见过的词汇时，也能给出相对合理的评价。例如，常见的平滑方法有： 1. **Add-k smoothing**：为防止低频词导致的分数过低，添加一定数量的k到每个n-gram的频率，增加了词汇表的大小。 2. **Good Turing frequency estimation**：根据实际观察到的n-gram频率和语言模型预测的频率之间的差异来调整概率估计。 3. **Jelinek-Mercer smoothing**：基于语言模型的概率分布，对n-gram概率进行平滑处理。 4. **Chen and Cherry提出的特定技术**：论文中提到的创新方法，可能包括改进的平滑策略，如考虑上下文的n-gram组合或者自适应的平滑参数设置。通过比较这些技术，研究者发现某些方法在保持准确度的同时提高了可解释性和对人工评价的敏感性。选择合适的平滑技术对于优化MT系统的性能至关重要，因为它直接影响到模型的训练和调整过程。在实际应用中，开发者和研究人员可以根据翻译任务的具体需求和数据特性，权衡各种平滑策略，以获得最佳的评估结果。这项工作不仅提供了实用的工具，还推动了机器翻译评价方法的理论发展。

Proceedings of the Ninth Workshop on Statistical Machine Translation, pages 362–367,

Baltimore, Maryland USA, June 26–27, 2014.

2014 Association for Computational Linguistics

A Systematic Comparison of Smoothing Techniques for Sentence-Level

BLEU

Boxing Chen and Colin Cherry

National Research Council Canada

ﬁrst.last@nrc-cnrc.gc.ca

Abstract

BLEU is the de facto standard machine

translation (MT) evaluation metric. How-

ever, because BLEU computes a geo-

metric mean of n-gram precisions, it of-

ten correlates poorly with human judg-

ment on the sentence-level. There-

fore, several smoothing techniques have

been proposed. This paper systemati-

cally compares 7 smoothing techniques

for sentence-level BLEU. Three of them

are ﬁrst proposed in this paper, and they

correlate better with human judgments on

the sentence-level than other smoothing

techniques. Moreover, we also compare

the performance of using the 7 smoothing

techniques in statistical machine transla-

tion tuning.

1 Introduction

Since its invention, BLEU (Papineni et al., 2002)

has been the most widely used metric for both

machine translation (MT) evaluation and tuning.

Many other metrics correlate better with human

judgments of translation quality than BLEU, as

shown in recent WMT Evaluation Task reports

(Callison-Burch et al., 2011; Callison-Burch et al.,

2012). However, BLEU remains the de facto stan-

dard evaluation and tuning metric. This is proba-

bly due to the following facts:

1. BLEU is language independent (except for

word segmentation decisions).

2. BLEU can be computed quickly. This is im-

portant when choosing a tuning metric.

3. BLEU seems to be the best tuning metric

from a quality point of view - i.e., models

trained using BLEU obtain the highest scores

from humans and even from other metrics

(Cer et al., 2010).

One of the main criticisms of BLEU is that it

has a poor correlation with human judgments on

the sentence-level. Because it computes a geomet-

ric mean of n-gram precisions, if a higher order

n-gram precision (eg. n = 4) of a sentence is

0, then the BLEU score of the entire sentence is

0, no matter how many 1-grams or 2-grams are

matched. Therefore, several smoothing techniques

for sentence-level BLEU have been proposed (Lin

and Och, 2004; Gao and He, 2013).

In this paper, we systematically compare 7

smoothing techniques for sentence-level BLEU.

Three of them are ﬁrst proposed in this paper, and

they correlate better with human judgments on the

sentence-level than other smoothing techniques on

the WMT metrics task. Moreover, we compare

the performance of using the 7 smoothing tech-

niques in statistical machine translation tuning on

NIST Chinese-to-English and Arabic-to-English

tasks. We show that when tuning optimizes the

expected sum of these sentence-level metrics (as

advocated by Cherry and Foster (2012) and Gao

and He (2013) among others), all of these metrics

perform similarly in terms of their ability to pro-

duce strong BLEU scores on a held-out test set.

2 BLEU and smoothing

2.1 BLEU

Suppose we have a translation T and its reference

R, BLEU is computed with precision P (N, T, R)

and brevity penalty BP(T,R):

BLEU(N, T, R) = P (N, T, R) × BP(T, R) (1)

where P (N, T, R) is the geometric mean of n-

gram precisions:

P (N, T, R) =

n=1

(2)

362

下载后可阅读完整内容，剩余5页未读，立即下载

sinat_21921345

粉丝: 0
资源: 1

系统比较7种改进的BLEU sentence级平滑技术

无需安装的BLEU值计算工具与视频教程

Transformer模型与BLEU评分分析

自动机器翻译评估方法：BLEU-2001

bleu4

bleu-开源

Bleu值计算工具

BLEU算法的python实现

glassy-bleu-themez

bleu2vec:修改后的BLEU得分。 使用word2vec相似性而不是一对一相似性

BLEU python

最新资源

bleu2vec:修改后的BLEU得分。使用word2vec相似性而不是一对一相似性