DuReader: a Chinese Machine Reading Comprehension Dataset from
Real-world Applications
Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang,
Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, Haifeng Wang
Baidu Inc., Beijing, China
{hewei06, liukai20, liujing46, lvyajuan, zhaoshiqi, xiaoxinyan, liuyuan04, wangyizhong01,
wu hua, sheqiaoqiao, liuxuan, wutian, wanghaifeng}@baidu.com
Abstract
This paper introduces DuReader, a new
large-scale, open-domain Chinese ma-
chine reading comprehension (MRC)
dataset, designed to address real-world
MRC. DuReader has three advantages
over previous MRC datasets: (1) data
sources: questions and documents are
based on Baidu Search and Baidu Zhi-
dao
1
; answers are manually generated.
(2) question types: it provides rich
annotations for more question types,
especially yes-no and opinion questions,
that leaves more opportunity for the
research community. (3) scale: it contains
200K questions, 420K answers and 1M
documents; it is the largest Chinese
MRC dataset so far. Experiments show
that human performance is well above
current state-of-the-art baseline systems,
leaving plenty of room for the community
to make improvements. To help the
community make these improvements,
both DuReader
2
and baseline systems
3
have been posted online. We also organize
a shared competition to encourage the
exploration of more models. Since the
release of the task, there are significant
improvements over the baselines.
1 Introduction
The task of machine reading comprehension
(MRC) aims to empower machines to answer
questions after reading articles (Rajpurkar et al.,
1
Zhidao (https://zhidao.baidu.com) is the
largest Chinese community-based question answering
(CQA) site in the world.
2
http://ai.baidu.com/broad/download?
dataset=dureader
3
https://github.com/baidu/DuReader
2016; Nguyen et al., 2016). In recent years, a
number of datasets have been developed for MRC,
as shown in Table 1. These datasets have led to
advances such as Match-LSTM (Wang and Jiang,
2017), BiDAF (Seo et al., 2016), AoA Reader (Cui
et al., 2017), DCN (Xiong et al., 2017) and R-
Net (Wang et al., 2017). This paper hopes to
advance MRC even further with the release of
DuReader, challenging the community to deal
with more realistic data sources, more types of
questions and more scale, as illustrated in Tables
1-4. Table 1 highlights DuReader’s advantages
over previous datasets in terms of data sources and
scale. Tables 2-4 highlight DuReader’s advantages
in the range of questions.
Ideally, a good dataset should be based on ques-
tions from real applications. However, many ex-
isting datasets have been forced to make vari-
ous compromises such as: (1) cloze task: Data
is synthesized missing a keyword. The task is
to fill in the missing keyword (Hermann et al.,
2015; Cui et al., 2016; Hill et al., 2015). (2)
multiple-choice exams: Richardson et al. (2013)
collect both fictional stories and the corresponding
multiple-choice questions by crowdsourcing. Lai
et al. (2017) collect the multiple-choice questions
from English exams. (3) crowdsourcing: Turkers
are given documents (e.g., articles from the news
and/or Wikipedia) and are asked to construct ques-
tions after reading the documents(Trischler et al.,
2017; Rajpurkar et al., 2016; Ko
ˇ
cisk
`
y et al., 2017).
The limitations of the datasets lead to build
datasets based on queries that real users submit-
ted to real search engines. MS-MARCO (Nguyen
et al., 2016) is based on Bing logs (in English),
and DuReader (this paper) is based on the logs
of Baidu Search (in Chinese). Besides question
sources, DuReader complements MS-MARCO
and other datasets in the following ways:
question types: DuReader contains a richer in-
arXiv:1711.05073v4 [cs.CL] 11 Jun 2018