Reinforcing Coherence for Sequence to Sequence Model in Dialogue Generation
Hainan Zhang, Yanyan Lan, Jiafeng Guo, Jun Xu and Xueqi Cheng
University of Chinese Academy of Sciences, Beijing, China
CAS Key Lab of Network Data Science and Technology,
Institute of Computing Technology, Chinese Academy of Sciences
zhanghainan@software.ict.ac.cn, {lanyanyan, guojiafeng, junxu, cxq}@ict.ac.cn
Abstract
Sequence to sequence (Seq2Seq) approach has
gained great attention in the field of single-turn
dialogue generation. However, one serious prob-
lem is that most existing Seq2Seq based models
tend to generate common responses lacking spe-
cific meanings. Our analysis show that the underly-
ing reason is that Seq2Seq is equivalent to optimiz-
ing Kullback–Leibler (KL) divergence, thus does
not penalize the case whose generated probability
is high while the true probability is low. How-
ever, the true probability is unknown, which poses
challenges for tackling this problem. Inspired by
the fact that the coherence (i.e. similarity) between
post and response is consistent with human eval-
uation, we hypothesize that the true probability of
a response is proportional to the coherence degree.
The coherence scores are then used as the reward
function in a reinforcement learning framework to
penalize the case whose generated probability is
high while the true probability is low. Three dif-
ferent types of coherence models, including an un-
learned similarity function, a pretrained semantic
matching function, and an end-to-end dual learn-
ing architecture, are proposed in this paper. Ex-
perimental results on both Chinese Weibo dataset
and English Subtitle dataset show that the pro-
posed models produce more specific and meaning-
ful responses, yielding better performances against
Seq2Seq models in terms of both metric-based and
human evaluations.
1 Introduction
This paper focuses on the problem of single-turn dialogue
generation, which is expected to automatically generate an
appropriate response for a given post. Following conven-
tional data-driven generation framework of statistical ma-
chine translation, most existing neural conversation models
are based on a Seq2Seq architecture
[
Sutskever et al., 2014
]
.
In these models, a recurrent neural network (RNN) encoder is
first utilized to encode the input post to a vector, and another
RNN decoder is then used to generate the response. To learn
the model parameters, a maximum likelihood estimation ap-
proach is applied on the training data which consists of many
post-response pairs. The intrinsic philosophy is that the true
probability would be estimated by the generated probability
with proper parameters.
Though Seq2Seq has the ability to generate fluent re-
sponses, one serious problem is that the generated responses
are usually common, such as ‘I do not know’, ‘What does
this mean?’ and ‘Haha’
[
Li et al., 2016a; Mou et al., 2017
]
.
Clearly, these kinds of responses lack specific meanings for
further widening and deepening of the dialogue, which will
have a bad effect on the users’ experience. Through our anal-
ysis, the main reason is that the objective of Seq2Seq is equiv-
alent to minimizing the KL divergence between the generated
probability and the true probability. However, KL divergence
is not symmetric, thus it will not penalize the case whose gen-
erated probability is high while the true probability is low,
which is exactly the case of common responses.
In this paper, we propose to utilize the coherence (i.e. sim-
ilarity) between the generated responses and the original
post as an estimation of the true probability, with inspira-
tion comes from the fact that the similarity measure between
post and response embeddings is consistent with human eval-
uation. Specifically, three kinds of coherence models are
adopted in this paper. Firstly, an unlearned similarity func-
tion, such as cosine similarity, can be directly used as the co-
herence model. Secondly, the previous semantic text match-
ing models can be regarded as good candidates for measur-
ing the coherence between a post and its corresponding re-
sponse. In this paper, we use two pretrained matching func-
tions, i.e., GRU bilinear model
[
Socher et al., 2013
]
and
MatchPyramid
[
Pang et al., 2016
]
, which are representatives
of two different kinds of deep matching models, i.e., repre-
sentation focused methods and interaction focused methods.
Thirdly, an end-to-end dual learning architecture similar to
[
Xia et al., 2016
]
can be adopted to jointly learn the parame-
ters of response generation model and coherence model. Af-
ter that, the coherence model is used as the reward function in
a reinforcement learning framework for optimization, which
will guide the learning process to penalize the case whose
generated probability is high while the true probability is low.
We evaluate the proposed models on two public datasets,
i.e. the Chinese Weibo and the English Subtitle dataset. Ex-
perimental results show that our models significantly outper-
Proceedings of the Twenty-Seventh International Joint Conference on Artificial Intelligence (IJCAI-18)
4567