
a rule-based agent is emp loyed to warm-start the system
[111]. Then, superv ised learning is conduct ed on t he ac-
tions generated by the rules. In online shopping scenario,
if the d ialogue state is “Recommendation”, then the “Rec-
ommendation” action is triggered, and the sy stem will re-
trieve product s from the product database. If the state is
“Comparison”, then the system will compare target prod-
ucts/brands[111]. The dialogue policy can b e further trained
end-to-end with reinforcement learning to lead the system
making policies toward the final performance. [14] applied
deep reinforcement learning on strategic conversation that
simultaneously learned the feature representation and dia-
logue policy, the system outperformed several baselines in-
cluding random, rule-based, and supervised-based methods.
2.1.4 Natural Language Generation
The natural language generation component converts an ab-
stract dialogue action into natural language surface utter-
ances. As noticed in [78], a good generator usually relies on
several factors: adequacy, fluency, readability, an d variation.
Conventional approaches to NLG typically perform sentence
planning. It maps input semantic symbols into the inter-
mediary form representing the utterance such as tree- like
or template structures, and then converts the intermediate
structure into the final response through the surface realiza-
tion [90 ; 79].
[94] and [95] introduced neural network-based (NN) approaches
to NLG with a LSTM-based structure similar with RNNLM
[52]. The dialogue act type and its slot-value pairs are trans-
formed into a 1-hot control vector and is given as t he addi-
tional input, which ensures that the generated utterance rep-
resents the intended meaning. [94] used a forward RNN gen-
erator together with a CNN reranker, and backward RNN
reranker. All th e sub-modules are jointly optimized to gen-
erate utterances conditioned by the required dialogue act.
To address the slot in formation omitting and duplicating
problems in surface realization, [95] used an additional con-
trol cell to gate the dialogue act. [83] extended this ap-
proach by gating the input token vector of LSTM with the
dialogue act. It was then extended to the multi-domain
setting by multiple adaptation steps [96]. [123] adopted
an encoder-decoder LSTM-based structure to incorporate
the question information, semantic slot values, and dialogue
act type to generate correct answers. It used the attention
mechanism to attend to the key information cond itioned on
the current decoding state of the decoder. Encoding the di-
alogue act type embedding, the n eural network-based model
is able to generate variant answers in response to different
act types. [20] also presented a natu ral language genera-
tor based on the sequence-to-sequ ence approach that can
be trained to produce natural language strings as well as
deep syntax dependency trees from input dialogue acts. It
was then extended with the preceding user utterance and
responses [19]. It enabled the model entraining (adapting)
to users ways of speaking, which provides contextually ap-
propriate responses.
2.2 End-to-End Methods
Despite a lot of domain-specific handcrafting in traditional
task oriented dialogue systems, which are diffcult to adapt
to new domains [7], [120] further noted that, the conven-
tional pipeline of task-oriented dialogue systems has two
main limitations. One is the credit assignment problem,
where the end user’s feedback is hard to be propagated to
each upstream module. The second issue is process interde-
pendence. The input of a component is dependent on the
output of another component. When ad ap ting one compo-
nent to new environment or retrained with new data, all the
other components need to be adapted accordingly to ensure
a global optimization. Slots and features might change ac-
cordingly. This p rocess requ ires significant human efforts.
With the advance of end-to-end neural generative models in
recent years, many attempts have been made to construct an
end-to-end trainable framework for task-oriented dialogue
systems. Note that more details about neural generative
mod els will be discussed when we introduce the non-task-
oriented systems. Instead of the traditional pipeline, the
end-to-end model uses a single module and interacts with
structured external databases. [97] and [7] introduced a
network-based end- to-end trainable task-oriented d ialogue
system, which treated dialogue system learning as the prob-
lem of learning a m ap ping from dialogue histories to system
responses, and applied an encoder-decoder model to train
the whole system. However, the system is t rained in a su-
pervised fashion – not only does it require a lot of training
data, but it may also fail to find a good policy robustly due
to t he lack of exploration of dialogue control in the train-
ing data. [120] first presented an end-to-end reinforcement
learning app roach to joint ly train dialogue state tracking
and policy learning in the dialogue management in order to
optimize the system actions more robustly. In the conver-
sation, the agent asks the user a series of Yes/No questions
to find the correct answer. This approach was shown to be
promising when applied to the task-oriented dialogue prob-
lem of guessing the famous people users think of. [45] trained
the end-to-end system as a task completion neural dialogue
system, where its final goal is to complete a task, such as
movie-ticket booking.
Task-oriented systems usually need to query outside kn owl-
edge base. Previous systems achieved this by issuing a sym-
bolic query to the knowledge b ase to retrieve entries based
on their attributes, where semantic parsing on the input is
performed to constru ct a symbolic qu ery representing the
beliefs of the agent about th e user goal[97; 103; 45]. This
approach has two drawbacks: (1) the retrieved results do not
carry any information about u ncertainty in semantic pars-
ing, and (2) the retrieval operation is non differentiable, and
hence the parser and dialog policy are trained separately.
This makes online end-to-end learning from user feedback
difficult once the system is deployed. [21] augmented ex-
isting recurrent network architectures with a differentiable
attention-based key-value retrieval mechanism over the en-
tries of a knowledge base, which is inspired by key-value
memory networks[54]. [18] replaced symbolic queries with
an indu ced “soft” posterior distribution over the knowledge
base that indicates which entities the user is interested in.
Integrating the soft retrieval process with a reinforcement
learner. [102] combined an RNN with domain-specific knowl-
edge en coded as software and system action templates.
3. NON-TASK-ORIENTED DIALOGUE SYS-
TEM
Unlike task-oriented dialogue systems, which aim to com-
plete specific tasks for user, non-task-oriented dialogue sys-
tems (also known as chatbots) focus on conversing with hu-