Ubuntu对话库：大规模多轮对话研究资源

下载需积分: 1 | PDF格式 | 317KB | 更新于2024-08-02 | 84 浏览量 | 举报

"Ubuntu对话语料库是一个用于非结构化多回合对话系统研究的大型数据集，包含近100万次对话，总共有超过700万个会话和1亿个单词。该数据集提供了构建基于神经语言模型的对话管理器的独特资源，这些模型可以利用大量未标记的数据。数据集具有对话状态跟踪挑战数据集中的多回合对话属性，以及像Twitter这样的微博客服务中的非结构化交互特性。此外，论文还介绍了两种适合分析该数据集的神经学习架构，并在选择最佳下一个响应的任务上提供了基准性能。" 在自然语言处理领域，对话系统是人工智能的一个重要分支，它涉及到让计算机能够以自然、连贯的方式与人类进行对话。Ubuntu对话语料库的发布为这个领域的研究带来了新的机遇。该数据集的独特之处在于其规模巨大，涵盖了丰富的多轮对话，这使得研究人员能够在实际的、非结构化的对话环境中训练和测试对话管理系统。传统的对话系统通常依赖于精心设计的规则或统计模型，但这些方法往往难以适应复杂的、非结构化的对话场景。随着深度学习的发展，尤其是神经网络语言模型的出现，数据驱动的方法成为了可能。Ubuntu对话语料库的大量未标记数据正是用于训练这些模型的理想材料。通过无监督学习或半监督学习，模型可以从大量的对话中自动学习语言模式和对话策略，从而提高对话的自然性和连贯性。论文中提到的两种神经学习架构可能是基于循环神经网络（RNN）或者Transformer等模型的变体，这些模型擅长处理序列数据并捕捉上下文信息。在对话系统中，它们可以用来预测对话的下一步，即选择最合适的回应。通过在Ubuntu对话语料库上进行训练，这些模型能够理解并生成与上下文相关的、有意义的回应。评估对话系统的一个关键任务是选择最佳的下一个响应，这通常通过比较模型生成的候选响应与实际发生的对话历史来完成。论文提供的基准性能对于后续的研究者来说是一个重要的参考点，他们可以在此基础上改进模型，提升对话系统的表现。总而言之，Ubuntu对话语料库为构建更智能、更具适应性的对话系统提供了宝贵的资源。它的大规模和非结构化特性使得研究人员有机会探索更复杂、更接近真实世界的人机对话场景，推动对话系统研究的进步。同时，论文中提出的神经学习架构和基准性能也为后续研究提供了方向，有助于推动整个领域的创新和发展。

展开

The Ubuntu Dialogue Corpus: A Large Dataset for Research in

Unstructured Multi-Turn Dialogue Systems

Ryan Lowe

∗*

, Nissan Pow

, Iulian V. Serban

†

and Joelle Pineau

School of Computer Science, McGill University, Montreal, Canada

†

Department of Computer Science and Operations Research, Universié de Montréal, Montreal, Canada

Abstract

This paper introduces the Ubuntu Dia-

logue Corpus, a dataset containing almost

1 million multi-turn dialogues, with a to-

tal of over 7 million utterances and 100

million words. This provides a unique re-

source for research into building dialogue

managers based on neural language mod-

els that can make use of large amounts

of unlabeled data. The dataset has both

the multi-turn property of conversations

in the Dialog State Tracking Challenge

datasets, and the unstructured nature of in-

teractions from microblog services such

as Twitter. We also describe two neural

learning architectures suitable for analyz-

ing this dataset, and provide benchmark

performance on the task of selecting the

best next response.

1 Introduction

The ability for a computer to converse in a nat-

ural and coherent manner with a human has long

been held as one of the primary objectives of artiﬁ-

cial intelligence (AI). In this paper we consider the

problem of building dialogue agents that have the

ability to interact in one-on-one multi-turn con-

versations on a diverse set of topics. We primar-

ily target unstructured dialogues, where there is

no a priori logical representation for the informa-

tion exchanged during the conversation. This is in

contrast to recent systems which focus on struc-

tured dialogue tasks, using a slot-ﬁlling represen-

tation [10, 27, 32].

We observe that in several subﬁelds of AI—

computer vision, speech recognition, machine

translation—fundamental break-throughs were

achieved in recent years using machine learning

∗

The ﬁrst two authors contributed equally.

methods, more speciﬁcally with neural architec-

tures [1]; however, it is worth noting that many

of the most successful approaches, in particular

convolutional and recurrent neural networks, were

known for many years prior. It is therefore rea-

sonable to attribute this progress to three major

factors: 1) the public distribution of very large

rich datasets [5], 2) the availability of substantial

computing power, and 3) the development of new

training methods for neural architectures, in par-

ticular leveraging unlabeled data. Similar progress

has not yet been observed in the development of

dialogue systems. We hypothesize that this is due

to the lack of sufﬁciently large datasets, and aim

to overcome this barrier by providing a new large

corpus for research in multi-turn conversation.

The new Ubuntu Dialogue Corpus consists of

almost one million two-person conversations ex-

tracted from the Ubuntu chat logs

, used to receive

technical support for various Ubuntu-related prob-

lems. The conversations have an average of 8 turns

each, with a minimum of 3 turns. All conversa-

tions are carried out in text form (not audio). The

dataset is orders of magnitude larger than struc-

tured corpuses such as those of the Dialogue State

Tracking Challenge [32]. It is on the same scale as

recent datasets for solving problems such as ques-

tion answering and analysis of microblog services,

such as Twitter [22, 25, 28, 33], but each conversa-

tion in our dataset includes several more turns, as

well as longer utterances. Furthermore, because

it targets a speciﬁc domain, namely technical sup-

port, it can be used as a case study for the devel-

opment of AI agents in targeted applications, in

contrast to chatbox agents that often lack a well-

deﬁned goal [26].

In addition to the corpus, we present learning

architectures suitable for analyzing this dataset,

ranging from the simple frequency-inverse docu-

These logs are available from 2004 to 2015 at http:

//irclogs.ubuntu.com/

arXiv:1506.08909v3 [cs.CL] 4 Feb 2016

下载后可阅读完整内容，剩余9页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

UnknownToKnown

粉丝: 1w+

Ubuntu对话库：大规模多轮对话研究资源

Ubuntu Dialogue Corpus V1

The Ubuntu Dialogue Corpus.

ubuntu-corpus:Ubuntu 对话语料库

开源中文对话数据集汇总 - Chinese-Dialogue-Dataset

用于汇总目前的开源中文对话数据集_Chinese-Dialogue-Dataset.zip

Midiki, the MITRE Dialogue Toolkit-开源

Medical Dialogue Dataset-数据集

Confirmation-Dialogue-Box-in-Angular.js

Demo_DialogueSystem-master.zip

DevAV-Dialogue-Corpus:会话语料库以动态任务目标为特色

最新资源