Deep Reinforcement Learning for Dialogue Generation
Jiwei Li
1
, Will Monroe
1
, Alan Ritter
2
and Dan Jurafsky
1
1
Dept of Computer Science, Stanford University
2
Dept of Computer Science and Engineering, Ohio State University
{jiweil,wmonroe4,jurafsky}@stanford.edu, ritter.1492@osu.edu
Abstract
Recent neural models of dialogue generation
offer great promise for generating responses
for conversational agents, but tend to be short-
sighted, predicting utterances one at a time
while ignoring their influence on future out-
comes. Modeling the future direction of a di-
alogue is crucial to generating coherent, inter-
esting dialogues, a need which led traditional
NLP models of dialogue to draw on reinforce-
ment learning. In this paper, we show how to
integrate these goals, applying deep reinforce-
ment learning to model future reward in chat-
bot dialogue. The model simulates dialogues
between two virtual agents, using policy gradi-
ent methods to reward sequences that display
three useful conversational properties: infor-
mativity (non-repetitive turns), coherence, and
ease of answering (related to forward-looking
function). We evaluate our model on diversity,
length as well as with human judges, show-
ing that the proposed algorithm generates more
interactive responses and manages to foster a
more sustained conversation in dialogue sim-
ulation. This work marks a first step towards
learning a neural conversational model based
on the long-term success of dialogues.
1 Introduction
Neural response generation (Li et al., 2015; Vinyals
and Le, 2015; Luan et al., 2016; Wen et al., 2015;
Shang et al., 2015; Yao et al., 2015; Xu et al., 2016;
Wen et al., 2016; Li et al., 2016) is of growing inter-
est. The LSTM sequence-to-sequence (SEQ2SEQ)
model (Sutskever et al., 2014) is one type of neural
generation model that maximizes the probability of
generating a response given the previous dialogue
turn. This approach enables the incorporation of rich
context when mapping between consecutive dialogue
turns (Sordoni et al., 2015) in a way not possible, for
example, with MT-based dialogue models (Ritter et
al., 2011).
Despite the success of SEQ2SEQ models in di-
alogue generation, two problems emerge: First,
SEQ2SEQ models are trained by predicting the next
dialogue turn in a given conversational context using
the maximum-likelihood estimation (MLE) objective
function. However, it is not clear how well MLE
approximates the real-world goal of chatbot develop-
ment: teaching a machine to converse with humans,
while providing interesting, diverse, and informative
feedback that keeps users engaged. One concrete
example is that SEQ2SEQ models tend to generate
highly generic responses such as“I don’t know” re-
gardless of the input (Sordoni et al., 2015; Serban
et al., 2015b; Serban et al., 2015c; Li et al., 2015).
This can be ascribed to the high frequency of generic
responses found in the training set and their compati-
bility with a diverse range of conversational contexts.
Apparently “I don’t know” is not a good action to
take, since it closes the conversation down.
Another common problem, illustrated in Table 1
(the example in the bottom left), is when the sys-
tem becomes stuck in an infinite loop of repetitive
responses. This is due to MLE-based SEQ2SEQ mod-
els’ inability to account for repetition. In example
2, the dialogue falls into an infinite loop after three
turns, with both agents generating dull, generic utter-
ances like i don’t know what you are talking about
and you don’t know what you are saying. Looking at
the entire conversation, utterance (2) i’m 16 turns out
to be a bad action to take. While it is an informative
and coherent response to utterance (1) asking about
age, it offers no way of continuing the conversation.
1
1
A similar rule is often suggested in improvisational comedy:
https://en.wikipedia.org/wiki/Yes,_and...
arXiv:1606.01541v1 [cs.CL] 5 Jun 2016