A Systematic Comparison of Smoothing Techniques for Sentence-Level
BLEU
Boxing Chen and Colin Cherry
National Research Council Canada
first.last@nrc-cnrc.gc.ca
Abstract
BLEU is the de facto standard machine
translation (MT) evaluation metric. How-
ever, because BLEU computes a geo-
metric mean of n-gram precisions, it of-
ten correlates poorly with human judg-
ment on the sentence-level. There-
fore, several smoothing techniques have
been proposed. This paper systemati-
cally compares 7 smoothing techniques
for sentence-level BLEU. Three of them
are first proposed in this paper, and they
correlate better with human judgments on
the sentence-level than other smoothing
techniques. Moreover, we also compare
the performance of using the 7 smoothing
techniques in statistical machine transla-
tion tuning.
1 Introduction
Since its invention, BLEU (Papineni et al., 2002)
has been the most widely used metric for both
machine translation (MT) evaluation and tuning.
Many other metrics correlate better with human
judgments of translation quality than BLEU, as
shown in recent WMT Evaluation Task reports
(Callison-Burch et al., 2011; Callison-Burch et al.,
2012). However, BLEU remains the de facto stan-
dard evaluation and tuning metric. This is proba-
bly due to the following facts:
1. BLEU is language independent (except for
word segmentation decisions).
2. BLEU can be computed quickly. This is im-
portant when choosing a tuning metric.
3. BLEU seems to be the best tuning metric
from a quality point of view - i.e., models
trained using BLEU obtain the highest scores
from humans and even from other metrics
(Cer et al., 2010).
One of the main criticisms of BLEU is that it
has a poor correlation with human judgments on
the sentence-level. Because it computes a geomet-
ric mean of n-gram precisions, if a higher order
n-gram precision (eg. n = 4) of a sentence is
0, then the BLEU score of the entire sentence is
0, no matter how many 1-grams or 2-grams are
matched. Therefore, several smoothing techniques
for sentence-level BLEU have been proposed (Lin
and Och, 2004; Gao and He, 2013).
In this paper, we systematically compare 7
smoothing techniques for sentence-level BLEU.
Three of them are first proposed in this paper, and
they correlate better with human judgments on the
sentence-level than other smoothing techniques on
the WMT metrics task. Moreover, we compare
the performance of using the 7 smoothing tech-
niques in statistical machine translation tuning on
NIST Chinese-to-English and Arabic-to-English
tasks. We show that when tuning optimizes the
expected sum of these sentence-level metrics (as
advocated by Cherry and Foster (2012) and Gao
and He (2013) among others), all of these metrics
perform similarly in terms of their ability to pro-
duce strong BLEU scores on a held-out test set.
2 BLEU and smoothing
2.1 BLEU
Suppose we have a translation T and its reference
R, BLEU is computed with precision P (N, T, R)
and brevity penalty BP(T,R):
BLEU(N, T, R) = P (N, T, R) × BP(T, R) (1)
where P (N, T, R) is the geometric mean of n-
gram precisions:
P (N, T, R) =
N
Y
n=1
p
n
!
1
N
(2)