![](https://csdnimg.cn/release/download_crawler_static/88440830/bg3.jpg)
context is in fact a more difficult problem. This is
because most of the difficulty in automatically eval-
uating language generation models lies in the large
set of correct answers. Dialogue response genera-
tion given solely the context intuitively has a higher
diversity (or entropy) than translation given text in
a source language, or surface realization given some
intermediate form (Artstein et al., 2009).
3 Evaluation Metrics
Given a dialogue context and a proposed response,
our goal is to automatically evaluate how appropri-
ate the proposed response is to the conversation. We
focus on metrics that compare it to the ground truth
response of the conversation. In particular, we inves-
tigate two approaches: word based similarity met-
rics and word-embedding based similarity metrics.
3.1 Word Overlap-based Metrics
We first consider metrics that evaluate the amount
of word-overlap between the proposed response and
the ground-truth response. We examine the BLEU
and METEOR scores that have been used for ma-
chine translation, and the ROUGE score that has
been used for automatic summarization. While these
metrics have been shown to correlate with human
judgements in their target domains (Papineni et al.,
2002a; Lin, 2004), they have not been thoroughly
investigated for dialogue systems.
2
We denote the ground truth response as r (thus we
assume that there is a single candidate ground truth
response), and the proposed response as ˆr. The j’th
token in the ground truth response r is denoted by
w
j
, with ˆw
j
denoting the j’th token in the proposed
response ˆr.
BLEU. BLEU (Papineni et al., 2002a) analyzes
the co-occurrences of n-grams in the ground truth
and the proposed responses. It first computes an
n-gram precision for the whole dataset (we assume
that there is a single candidate ground truth response
2
To the best of our knowledge, only BLEU has been eval-
uated in the dialogue system setting quantitatively by Galley
et al. (2015a) on the Twitter domain. However, they carried
out their experiments in a very different setting with multiple
ground truth responses, which are rarely available in practice,
and without providing any qualitative analysis of their results.
per context):
P
n
(r, ˆr) =
P
k
min(h(k, r), h(k, ˆr
i
))
P
k
h(k, r
i
)
where k indexes all possible n-grams of length n and
h(k, r) is the number of n-grams k in r.
3
To avoid
the drawbacks of using a precision score, namely
that it favours shorter (candidate) sentences, the au-
thors introduce a brevity penalty. BLEU-N, where
N is the maximum length of n-grams considered, is
defined as:
BLEU-N := b(r, ˆr) exp(
N
X
n=1
β
n
log P
n
(r, ˆr))
β
n
is a weighting that is usually uniform, and b(·) is
the brevity penalty. The most commonly used ver-
sion of BLEU uses N = 4. Modern versions of
BLEU also use sentence-level smoothing, as the ge-
ometric mean often results in scores of 0 if there is
no 4-gram overlap (Chen and Cherry, 2014). Note
that BLEU is usually calculated at the corpus-level,
and was originally designed for use with multiple
reference sentences.
METEOR. The METEOR metric (Banerjee and
Lavie, 2005) was introduced to address several
weaknesses in BLEU. It creates an explicit align-
ment between the candidate and target responses.
The alignment is based on exact token matching,
followed by WordNet synonyms, stemmed tokens,
and then paraphrases. Given a set of alignments, the
METEOR score is the harmonic mean of precision
and recall between the proposed and ground truth
sentence.
ROUGE. ROUGE (Lin, 2004) is a set of evalua-
tion metrics used for automatic summarization. We
consider ROUGE-L, which is a F-measure based on
the Longest Common Subsequence (LCS) between
a candidate and target sentence. The LCS is a set of
words which occur in two sentences in the same or-
der; however, unlike n-grams the words do not have
to be contiguous, i.e. there can be other words in be-
tween the words of the LCS.
3
Note that the min in this equation is calculating the num-
ber of co-occurrences of n-gram k between the ground truth re-
sponse r and the proposed response ˆr , as it computes the fewest
appearances of k in either response.