Domain #Passages #Q/A Passage #Turns per
pairs length passage
Children’s Sto. 750 10.5k 211 14.0
Literature 1,815 25.5k 284 15.6
Mid/High Sch. 1,911 28.6k 306 15.0
News 1,902 28.7k 268 15.1
Wikipedia 1,821 28.0k 245 15.4
Out of domain
Science 100 1.5k 251 15.3
Reddit 100 1.7k 361 16.6
Total 8,399 127k 271 15.2
Table 2: Distribution of domains in CoQA.
in order to limit the number of possible answers.
We encourage this by automatically copying the
highlighted text into the answer box and allowing
them to edit copied text in order to generate a nat-
ural answer. We found 78% of the answers have
at least one edit such as changing a word’s case or
adding a punctuation.
3.2 Passage Selection
We select passages from seven diverse domains:
children’s stories from MCTest (Richardson et al.,
2013), literature from Project Gutenberg
4
, middle
and high school English exams from RACE (Lai
et al., 2017), news articles from CNN (Hermann
et al., 2015), articles from Wikipedia, science ar-
ticles from AI2 Science Questions (Welbl et al.,
2017) and Reddit articles from the Writing Prompts
dataset (Fan et al., 2018).
Not all passages in these domains are equally
good for generating interesting conversations. A
passage with just one entity often result in ques-
tions that entirely focus on that entity. There-
fore, we select passages with multiple entities,
events and pronominal references using Stanford
CoreNLP (Manning et al., 2014). We truncate long
articles to the first few paragraphs that result in
around 200 words.
Table 2 shows the distribution of domains. We
reserve the Science and Reddit domains for out-of-
domain evaluation. For each in-domain dataset, we
split the data such that there are 100 passages in the
development set, 100 passages in the test set, and
the rest in the training set. For each out-of-domain
dataset, we just have 100 passages in the test set.
4
Project Gutenberg https://www.gutenberg.org
3.3 Collecting Multiple Answers
Some questions in CoQA may have multiple valid
answers. For example, another answer for Q
4
in
Figure 2 is A Republican candidate. In order to
account for answer variations, we collect three addi-
tional answers for all questions in the development
and test data. Since our data is conversational, ques-
tions influence answers which in turn influence the
follow-up questions. In the previous example, if
the original answer was A Republican Candidate,
then the following question Which party does he
belong to? would not have occurred in the first
place. When we show questions from an existing
conversation to new answerers, it is likely they will
deviate from the original answers which makes the
conversation incoherent. It is thus important to
bring them to a common ground with the original
answer.
We achieve this by turning the answer collection
task into a game of predicting original answers.
First, we show a question to a new answerer, and
when she answers it, we show the original answer
and ask her to verify if her answer matches the
original. For the next question, we ask her to guess
the original answer and verify again. We repeat this
process until the conversation is complete. In our
pilot experiment, the human F1 score is increased
by 5.4% when we use this verification setup.
4 Dataset Analysis
What makes the CoQA dataset conversational com-
pared to existing reading comprehension datasets
like SQuAD? How does the conversation flow from
one turn to the other? What linguistic phenomena
do the questions in CoQA exhibit? We answer
these questions below.
4.1 Comparison with SQuAD 2.0
SQuAD has become the de facto dataset for read-
ing comprehension. In the following, we peform
an in-depth comparsion of CoQA and the latest
version of SQuAD (Rajpurkar et al., 2018). Fig-
ure 3(a) and Figure 3(b) show the distribution of fre-
quent trigram prefixes. While coreferences are non-
existent in SQuAD, almost every sector of CoQA
contains coreferences (he, him, she, it, they) indi-
cating CoQA is highly conversational. Because of
the free-form nature of answers, we expect a richer
variety of questions in CoQA than SQuAD. While
nearly half of SQuAD questions are dominated by
what questions, the distribution of CoQA is spread