对话响应生成系统评估指标的误用研究

人工智能

外文论文

需积分: 1 165 浏览量更新于2024-06-20 收藏 670KB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

"对话系统评估研究综述" 对话系统评估是自然语言处理和人工智能领域中的一个核心问题。近年来，随着对话系统的发展，评估对话系统的性能变得越来越重要。然而，当前的评估方法存在一定的缺陷，例如需要大量的人工标注数据，且评估结果不够客观。本文探讨了无监督的对话系统评估方法，旨在解决当前评估方法的缺陷。研究人员调查了多种评估指标，包括BLEU、ROUGE、METEOR等，并对其进行了实证研究。结果表明，这些指标与人类判断之间的相关性很弱，特别是在非技术性 Twitter 领域和技术性 Ubuntu 领域。研究人员还发现，当前的评估方法存在一定的缺陷，例如忽视了对话系统的语境和语义信息。因此，研究人员提出了未来发展更好自动评估指标的建议，旨在提高对话系统评估的准确性和客观性。从技术角度来看，本文主要探讨了以下几个方面： 1. 无监督的对话系统评估方法：研究人员探讨了无监督的评估方法，可以避免人工标注数据的限制，提高评估效率和准确性。 2. 评估指标的分析：研究人员对多种评估指标进行了分析，发现这些指标之间的相关性很弱，且与人类判断之间的相关性也很弱。 3. 对话系统的语境和语义信息：研究人员发现，当前的评估方法忽视了对话系统的语境和语义信息，导致评估结果不够准确。从理论角度来看，本文主要探讨了以下几个方面： 1. 对话系统评估的挑战：研究人员探讨了对话系统评估的挑战，例如需要大量的人工标注数据，且评估结果不够客观。 2. 无监督的对话系统评估方法的优点：研究人员探讨了无监督的评估方法的优点，例如可以避免人工标注数据的限制，提高评估效率和准确性。 3. 未来发展方向：研究人员提出了未来发展更好自动评估指标的建议，旨在提高对话系统评估的准确性和客观性。本文对对话系统评估方法进行了深入探讨，旨在提高对话系统评估的准确性和客观性。研究结果表明，当前的评估方法存在一定的缺陷，需要发展更好自动评估指标。同时，本文也为未来发展更好自动评估指标提供了有价值的建议。

资源详情

资源推荐

context is in fact a more difﬁcult problem. This is

because most of the difﬁculty in automatically eval-

uating language generation models lies in the large

set of correct answers. Dialogue response genera-

tion given solely the context intuitively has a higher

diversity (or entropy) than translation given text in

a source language, or surface realization given some

intermediate form (Artstein et al., 2009).

3 Evaluation Metrics

Given a dialogue context and a proposed response,

our goal is to automatically evaluate how appropri-

ate the proposed response is to the conversation. We

focus on metrics that compare it to the ground truth

response of the conversation. In particular, we inves-

tigate two approaches: word based similarity met-

rics and word-embedding based similarity metrics.

3.1 Word Overlap-based Metrics

We ﬁrst consider metrics that evaluate the amount

of word-overlap between the proposed response and

the ground-truth response. We examine the BLEU

and METEOR scores that have been used for ma-

chine translation, and the ROUGE score that has

been used for automatic summarization. While these

metrics have been shown to correlate with human

judgements in their target domains (Papineni et al.,

2002a; Lin, 2004), they have not been thoroughly

investigated for dialogue systems.

We denote the ground truth response as r (thus we

assume that there is a single candidate ground truth

response), and the proposed response as ˆr. The j’th

token in the ground truth response r is denoted by

, with ˆw

denoting the j’th token in the proposed

response ˆr.

BLEU. BLEU (Papineni et al., 2002a) analyzes

the co-occurrences of n-grams in the ground truth

and the proposed responses. It ﬁrst computes an

n-gram precision for the whole dataset (we assume

that there is a single candidate ground truth response

To the best of our knowledge, only BLEU has been eval-

uated in the dialogue system setting quantitatively by Galley

et al. (2015a) on the Twitter domain. However, they carried

out their experiments in a very different setting with multiple

ground truth responses, which are rarely available in practice,

and without providing any qualitative analysis of their results.

per context):

(r, ˆr) =

min(h(k, r), h(k, ˆr

))

h(k, r

)

where k indexes all possible n-grams of length n and

h(k, r) is the number of n-grams k in r.

To avoid

the drawbacks of using a precision score, namely

that it favours shorter (candidate) sentences, the au-

thors introduce a brevity penalty. BLEU-N, where

N is the maximum length of n-grams considered, is

deﬁned as:

BLEU-N := b(r, ˆr) exp(

n=1

log P

(r, ˆr))

is a weighting that is usually uniform, and b(·) is

the brevity penalty. The most commonly used ver-

sion of BLEU uses N = 4. Modern versions of

BLEU also use sentence-level smoothing, as the ge-

ometric mean often results in scores of 0 if there is

no 4-gram overlap (Chen and Cherry, 2014). Note

that BLEU is usually calculated at the corpus-level,

and was originally designed for use with multiple

reference sentences.

METEOR. The METEOR metric (Banerjee and

Lavie, 2005) was introduced to address several

weaknesses in BLEU. It creates an explicit align-

ment between the candidate and target responses.

The alignment is based on exact token matching,

followed by WordNet synonyms, stemmed tokens,

and then paraphrases. Given a set of alignments, the

METEOR score is the harmonic mean of precision

and recall between the proposed and ground truth

sentence.

ROUGE. ROUGE (Lin, 2004) is a set of evalua-

tion metrics used for automatic summarization. We

consider ROUGE-L, which is a F-measure based on

the Longest Common Subsequence (LCS) between

a candidate and target sentence. The LCS is a set of

words which occur in two sentences in the same or-

der; however, unlike n-grams the words do not have

to be contiguous, i.e. there can be other words in be-

tween the words of the LCS.

Note that the min in this equation is calculating the num-

ber of co-occurrences of n-gram k between the ground truth re-

sponse r and the proposed response ˆr , as it computes the fewest

appearances of k in either response.

剩余14页未读，继续阅读

UnknownToKnown

粉丝: 1w+
资源: 665

会员权益专享

对话响应生成系统评估指标的误用研究

Measurements and Mitigation of Peer-to-Peer-based Botnets - A Case Study on Storm Worm

How To... Evaluate SeqCap EZ Target Enrichment Data.pdf

unity Behavior Designer 1.5.10 AI专用插件

how to prepare a mpi inf 3dhp dataset

table_handle: Attempting to capture an EagerTensor without building a function.

How to Fine-Tune BERT for Text Classification?

how to use chatgpt

Find an algorithm to evaluate radical signx for a positive real numberx.

上述问题中，我想把getElementById改成通过xpath去寻找元素，怎么修改

js实现点击特定xpath

term does not evaluate to a function

how to write a script for pruning of yolov5 by python

How to evaluate the status of USA in the world ?

How to evaluate the status of Japan in the world ?

How to evaluate the status of China in the world ?

How to analyze the power tree

Constant expression required

i want to know how to get the methods to use chatgpt

会员权益专享

最新资源