构建对话式问答系统：CoQA挑战

需积分: 10 147 浏览量更新于2024-07-15 收藏 1.06MB PDF 举报

"CoQA: 一个对话式问答挑战赛" 在自然语言处理领域，CoQA（Conversational Question Answering）是一项新兴的研究焦点，它旨在推动机器理解并回答连贯、多轮的对话式问题。由Siva Reddy、Danqi Chen和Christopher D. Manning等人在斯坦福大学计算机科学系提出，CoQA是一项全新的对话式问答数据集构建挑战。这个数据集包含了127,000个问题及其对应的答案，这些问题来源于关于7个不同领域文本段落的8,000次对话。 CoQA数据集的独特之处在于其问题具有对话性，并且答案是自由形式的文本，同时提供了答案在原文中的证据高亮。这使得模型不仅需要理解单个句子，还要能够处理上下文关联、指代消解（coreference）以及语用推理（pragmatic reasoning）等复杂任务。例如，对话式问题可能涉及到对前文提及实体的引用，或者需要理解说话者的意图和上下文暗示。为了评估CoQA任务的难度和现有技术的性能，研究者们测试了一系列强效的对话模型和阅读理解模型。结果显示，最佳系统在CoQA上的F1分数达到了65.1%，这相较于现有的阅读理解数据集是一个显著的挑战，表明在处理对话式问答时，机器仍有很大的提升空间。此外，CoQA的分析揭示了对话式问题与传统阅读理解任务之间的差异，后者通常关注单一问题和独立的文本段落。CoQA强调了在连续的对话环境中理解和生成答案的重要性，这对于开发能够辅助人类获取信息的智能系统至关重要。未来的研究工作将集中在提高模型在处理对话式问答时的准确性和流畅性，以实现更加人性化的交互体验。通过CoQA挑战赛，研究人员可以评估和改进模型在处理多轮、上下文依赖问题的能力，从而推动自然语言处理技术的发展，使其更接近于人类在实际对话中的信息获取方式。这不仅有助于提升聊天机器人和虚拟助手的质量，也为智能教育、客户服务和信息检索等领域带来潜在的应用前景。

Domain #Passages #Q/A Passage #Turns per

pairs length passage

Children’s Sto. 750 10.5k 211 14.0

Literature 1,815 25.5k 284 15.6

Mid/High Sch. 1,911 28.6k 306 15.0

News 1,902 28.7k 268 15.1

Wikipedia 1,821 28.0k 245 15.4

Out of domain

Science 100 1.5k 251 15.3

Reddit 100 1.7k 361 16.6

Total 8,399 127k 271 15.2

Table 2: Distribution of domains in CoQA.

in order to limit the number of possible answers.

We encourage this by automatically copying the

highlighted text into the answer box and allowing

them to edit copied text in order to generate a nat-

ural answer. We found 78% of the answers have

at least one edit such as changing a word’s case or

adding a punctuation.

3.2 Passage Selection

We select passages from seven diverse domains:

children’s stories from MCTest (Richardson et al.,

2013), literature from Project Gutenberg

, middle

and high school English exams from RACE (Lai

et al., 2017), news articles from CNN (Hermann

et al., 2015), articles from Wikipedia, science ar-

ticles from AI2 Science Questions (Welbl et al.,

2017) and Reddit articles from the Writing Prompts

dataset (Fan et al., 2018).

Not all passages in these domains are equally

good for generating interesting conversations. A

passage with just one entity often result in ques-

tions that entirely focus on that entity. There-

fore, we select passages with multiple entities,

events and pronominal references using Stanford

CoreNLP (Manning et al., 2014). We truncate long

articles to the ﬁrst few paragraphs that result in

around 200 words.

Table 2 shows the distribution of domains. We

reserve the Science and Reddit domains for out-of-

domain evaluation. For each in-domain dataset, we

split the data such that there are 100 passages in the

development set, 100 passages in the test set, and

the rest in the training set. For each out-of-domain

dataset, we just have 100 passages in the test set.

Project Gutenberg https://www.gutenberg.org

3.3 Collecting Multiple Answers

Some questions in CoQA may have multiple valid

answers. For example, another answer for Q

Figure 2 is A Republican candidate. In order to

account for answer variations, we collect three addi-

tional answers for all questions in the development

and test data. Since our data is conversational, ques-

tions inﬂuence answers which in turn inﬂuence the

follow-up questions. In the previous example, if

the original answer was A Republican Candidate,

then the following question Which party does he

belong to? would not have occurred in the ﬁrst

place. When we show questions from an existing

conversation to new answerers, it is likely they will

deviate from the original answers which makes the

conversation incoherent. It is thus important to

bring them to a common ground with the original

answer.

We achieve this by turning the answer collection

task into a game of predicting original answers.

First, we show a question to a new answerer, and

when she answers it, we show the original answer

and ask her to verify if her answer matches the

original. For the next question, we ask her to guess

the original answer and verify again. We repeat this

process until the conversation is complete. In our

pilot experiment, the human F1 score is increased

by 5.4% when we use this veriﬁcation setup.

4 Dataset Analysis

What makes the CoQA dataset conversational com-

pared to existing reading comprehension datasets

like SQuAD? How does the conversation ﬂow from

one turn to the other? What linguistic phenomena

do the questions in CoQA exhibit? We answer

these questions below.

4.1 Comparison with SQuAD 2.0

SQuAD has become the de facto dataset for read-

ing comprehension. In the following, we peform

an in-depth comparsion of CoQA and the latest

version of SQuAD (Rajpurkar et al., 2018). Fig-

ure 3(a) and Figure 3(b) show the distribution of fre-

quent trigram preﬁxes. While coreferences are non-

existent in SQuAD, almost every sector of CoQA

contains coreferences (he, him, she, it, they) indi-

cating CoQA is highly conversational. Because of

the free-form nature of answers, we expect a richer

variety of questions in CoQA than SQuAD. While

nearly half of SQuAD questions are dominated by

what questions, the distribution of CoQA is spread

剩余15页未读，继续阅读

微知girl

粉丝: 1w+

构建对话式问答系统：CoQA挑战

qa面试问题及答案.pdf

QA考试题及答案借鉴.pdf

QA答疑室.pdf

2020-A Survey on Conversational Recommender Systems.pdf

Conversational Recommender Systems.pdf

【论文摘要】An Empirical Study of Content Understanding in Conversational Question Answering

AVC_overview_1.rar_IEC_The Network

【13】Achieving Human Parity in Conversational Speech Recognition.pdf

2014-History-Guided conversational recommendation-WWW.pdf

ChatGPT研究框架（2023） 20230207_国泰君安.pdf

最新资源