DuReader：大规模开放域中文阅读理解新挑战

需积分: 0 97 浏览量更新于2024-08-05 1 收藏 481KB PDF 举报

DuReader是一个由百度公司开发的大型开放域中文机器阅读理解（Machine Reading Comprehension, MRC）数据集，其目标是解决实际应用中的机器阅读理解任务。该数据集的独特之处在于以下几个关键点： 1. **数据来源广泛**：DuReader的问题和文档来源于百度搜索和百度知道，这确保了问题与现实生活场景紧密相关，反映了用户在搜索引擎上遇到的真实查询。这种源自真实世界的多样性使得研究者能够更贴近实际应用场景进行模型训练。 2. **丰富的问题类型**：相比于之前的数据集，DuReader提供了更为丰富的问题类型，特别是包括了yes-no和观点型问题。这类问题考察了模型对于文本理解的深入程度，以及能否识别并回应复杂的判断或主观信息，从而拓宽了研究者探索机器智能在理解和处理复杂语言结构上的可能性。 3. **规模巨大**：DuReader包含了20万条问题、42万条答案和100万篇文档，是目前规模最大的中文MRC数据集。这样的规模不仅有助于模型在大规模数据上进行训练，提升泛化能力，也使得在实际应用中具有更高的实用价值。 4. **人工标注质量**：DuReader的答案是由专业人员手动生成的，这意味着答案的准确性得到了保障，这对于评估模型在理解文本基础上生成正确响应的能力至关重要。 5. **实验结果与挑战**：研究表明，人类的表现远高于当前最先进的机器模型，这表明尽管技术取得了显著进步，但DuReader仍是一个极具挑战性的数据集，为机器阅读理解的研究者提供了新的竞争目标和优化方向。 DuReader作为一款大型、多类型且基于真实世界情境的中文MRC数据集，对于推动中文自然语言处理领域的研究和应用具有重要意义，为开发出更加智能、适应复杂问题解答的机器阅读理解系统提供了宝贵资源。随着深度学习和人工智能技术的发展，这个数据集有望持续激发创新，提升人工智能的实用性和智能水平。

DuReader: a Chinese Machine Reading Comprehension Dataset from

Real-world Applications

Wei He, Kai Liu, Jing Liu, Yajuan Lyu, Shiqi Zhao, Xinyan Xiao, Yuan Liu, Yizhong Wang,

Hua Wu, Qiaoqiao She, Xuan Liu, Tian Wu, Haifeng Wang

Baidu Inc., Beijing, China

{hewei06, liukai20, liujing46, lvyajuan, zhaoshiqi, xiaoxinyan, liuyuan04, wangyizhong01,

wu hua, sheqiaoqiao, liuxuan, wutian, wanghaifeng}@baidu.com

Abstract

This paper introduces DuReader, a new

large-scale, open-domain Chinese ma-

chine reading comprehension (MRC)

dataset, designed to address real-world

MRC. DuReader has three advantages

over previous MRC datasets: (1) data

sources: questions and documents are

based on Baidu Search and Baidu Zhi-

dao

; answers are manually generated.

(2) question types: it provides rich

annotations for more question types,

especially yes-no and opinion questions,

that leaves more opportunity for the

research community. (3) scale: it contains

200K questions, 420K answers and 1M

documents; it is the largest Chinese

MRC dataset so far. Experiments show

that human performance is well above

current state-of-the-art baseline systems,

leaving plenty of room for the community

to make improvements. To help the

community make these improvements,

both DuReader

and baseline systems

have been posted online. We also organize

a shared competition to encourage the

exploration of more models. Since the

release of the task, there are signiﬁcant

improvements over the baselines.

1 Introduction

The task of machine reading comprehension

(MRC) aims to empower machines to answer

questions after reading articles (Rajpurkar et al.,

Zhidao (https://zhidao.baidu.com) is the

largest Chinese community-based question answering

(CQA) site in the world.

http://ai.baidu.com/broad/download?

dataset=dureader

https://github.com/baidu/DuReader

2016; Nguyen et al., 2016). In recent years, a

number of datasets have been developed for MRC,

as shown in Table 1. These datasets have led to

advances such as Match-LSTM (Wang and Jiang,

2017), BiDAF (Seo et al., 2016), AoA Reader (Cui

et al., 2017), DCN (Xiong et al., 2017) and R-

Net (Wang et al., 2017). This paper hopes to

advance MRC even further with the release of

DuReader, challenging the community to deal

with more realistic data sources, more types of

questions and more scale, as illustrated in Tables

1-4. Table 1 highlights DuReader’s advantages

over previous datasets in terms of data sources and

scale. Tables 2-4 highlight DuReader’s advantages

in the range of questions.

Ideally, a good dataset should be based on ques-

tions from real applications. However, many ex-

isting datasets have been forced to make vari-

ous compromises such as: (1) cloze task: Data

is synthesized missing a keyword. The task is

to ﬁll in the missing keyword (Hermann et al.,

2015; Cui et al., 2016; Hill et al., 2015). (2)

multiple-choice exams: Richardson et al. (2013)

collect both ﬁctional stories and the corresponding

multiple-choice questions by crowdsourcing. Lai

et al. (2017) collect the multiple-choice questions

from English exams. (3) crowdsourcing: Turkers

are given documents (e.g., articles from the news

and/or Wikipedia) and are asked to construct ques-

tions after reading the documents(Trischler et al.,

2017; Rajpurkar et al., 2016; Ko

cisk

y et al., 2017).

The limitations of the datasets lead to build

datasets based on queries that real users submit-

ted to real search engines. MS-MARCO (Nguyen

et al., 2016) is based on Bing logs (in English),

and DuReader (this paper) is based on the logs

of Baidu Search (in Chinese). Besides question

sources, DuReader complements MS-MARCO

and other datasets in the following ways:

question types: DuReader contains a richer in-

arXiv:1711.05073v4 [cs.CL] 11 Jun 2018

下载后可阅读完整内容，剩余9页未读，立即下载

Crazyanti

粉丝: 26
资源: 302

DuReader：大规模开放域中文阅读理解新挑战

Python-DuReader是一个全新的大型现实世界和人类MRC数据集

DuReader-Checklist-BASELINE

Dureader-Bert：BERT Dureader多文档阅读理解排名第七

百度的开源问答数据集如何下载

中文问答数据有哪些？

中文命名实体识别数据集

paddle框架下paddlenlp模块实现的主要功能是什么

Python-QANetDuReader中文机器阅读理解

法研杯数据集.tar.gz

基于Matlab面板版的卡尔曼小球运动跟踪[Matlab面板版].zip

最新资源