The Ubuntu Dialogue Corpus: A Large Dataset for Research in
Unstructured Multi-Turn Dialogue Systems
Ryan Lowe
∗*
, Nissan Pow
*
, Iulian V. Serban
†
and Joelle Pineau
*
*
School of Computer Science, McGill University, Montreal, Canada
†
Department of Computer Science and Operations Research, Universié de Montréal, Montreal, Canada
Abstract
This paper introduces the Ubuntu Dia-
logue Corpus, a dataset containing almost
1 million multi-turn dialogues, with a to-
tal of over 7 million utterances and 100
million words. This provides a unique re-
source for research into building dialogue
managers based on neural language mod-
els that can make use of large amounts
of unlabeled data. The dataset has both
the multi-turn property of conversations
in the Dialog State Tracking Challenge
datasets, and the unstructured nature of in-
teractions from microblog services such
as Twitter. We also describe two neural
learning architectures suitable for analyz-
ing this dataset, and provide benchmark
performance on the task of selecting the
best next response.
1 Introduction
The ability for a computer to converse in a nat-
ural and coherent manner with a human has long
been held as one of the primary objectives of artifi-
cial intelligence (AI). In this paper we consider the
problem of building dialogue agents that have the
ability to interact in one-on-one multi-turn con-
versations on a diverse set of topics. We primar-
ily target unstructured dialogues, where there is
no a priori logical representation for the informa-
tion exchanged during the conversation. This is in
contrast to recent systems which focus on struc-
tured dialogue tasks, using a slot-filling represen-
tation [10, 27, 32].
We observe that in several subfields of AI—
computer vision, speech recognition, machine
translation—fundamental break-throughs were
achieved in recent years using machine learning
∗
The first two authors contributed equally.
methods, more specifically with neural architec-
tures [1]; however, it is worth noting that many
of the most successful approaches, in particular
convolutional and recurrent neural networks, were
known for many years prior. It is therefore rea-
sonable to attribute this progress to three major
factors: 1) the public distribution of very large
rich datasets [5], 2) the availability of substantial
computing power, and 3) the development of new
training methods for neural architectures, in par-
ticular leveraging unlabeled data. Similar progress
has not yet been observed in the development of
dialogue systems. We hypothesize that this is due
to the lack of sufficiently large datasets, and aim
to overcome this barrier by providing a new large
corpus for research in multi-turn conversation.
The new Ubuntu Dialogue Corpus consists of
almost one million two-person conversations ex-
tracted from the Ubuntu chat logs
1
, used to receive
technical support for various Ubuntu-related prob-
lems. The conversations have an average of 8 turns
each, with a minimum of 3 turns. All conversa-
tions are carried out in text form (not audio). The
dataset is orders of magnitude larger than struc-
tured corpuses such as those of the Dialogue State
Tracking Challenge [32]. It is on the same scale as
recent datasets for solving problems such as ques-
tion answering and analysis of microblog services,
such as Twitter [22, 25, 28, 33], but each conversa-
tion in our dataset includes several more turns, as
well as longer utterances. Furthermore, because
it targets a specific domain, namely technical sup-
port, it can be used as a case study for the devel-
opment of AI agents in targeted applications, in
contrast to chatbox agents that often lack a well-
defined goal [26].
In addition to the corpus, we present learning
architectures suitable for analyzing this dataset,
ranging from the simple frequency-inverse docu-
1
These logs are available from 2004 to 2015 at http:
//irclogs.ubuntu.com/
arXiv:1506.08909v3 [cs.CL] 4 Feb 2016