Though SeqGAN (Yu et al., 2017) has proposed
to use rollout to get rewards for each generated
word, the variance of the rewards is typically too
high to be useful practically. In addition, the com-
putational cost may be too high for practical use.
We below describe how to use the proposed guider
network to define intermediate rewards, leading to
a definition of feature-matching reward.
Feature-Matching Rewards
We first define an
intermediate reward to generate a particular word.
The idea is to match the ground-truth features from
the CNN encoder in Figure 1 with those generated
from the guider network. Equation
(8)
indicates
that the further the generated feature is from the
true feature, the smaller the reward should be. To
this end, for each time
t
, we define the intermediate
reward for generating the current word as:
r
g
t
=
1
2c
c
X
i=1
(D
cos
(f
t
,
ˆ
f
t
)+
D
cos
(f
t
− f
t−i
,
ˆ
f
t
− f
t−i
)) ,
where
ˆ
f
t
= G
ψ
(s
G
t−c−1
, f
t−c
)
is the predicted fea-
ture. Intuitively,
f
t
−f
t−i
measures the difference
between the generated sentences in feature space;
the reward is high if it matches the predicted fea-
ture transition
ˆ
f
t
− f
t−i
from the guider network.
At the last step of text generation, i.e.,
t = T
,
the corresponding reward measures the quality of
the whole generated sentence, thus it is called a
final reward. The final reward is defined differently
from the intermediate reward, discussed below for
both the unconditional- and conditional-generation
cases.
Note that a token generated at time
t
will influ-
ence not only the rewards received at that time
but also the rewards at subsequent time steps.
Thus we propose to define the cumulative reward,
P
T
i=t
γ
i
r
g
i
with
γ
a discount factor, as a feature-
matching reward. Intuitively, this encourages the
generator to focus on achieving higher long-term
rewards. Finally, in order to apply policy gradient
to update the generator, we combine the feature-
matching reward with the problem-specific final
reward, to form a Q-value reward specified below.
Similar to SeqGAN, the final reward is defined as
the output of a discriminator, evaluating the quality
of the whole generated sentence, i.e., the smaller
the output, the less likely the generation is a true
sentence. As a result, we combine the adversarial
reward
r
f
∈ [0, 1]
by the discriminator (Yu et al.,
Algorithm 1
Model-based Imitation Learning for
Text Generation
Require:
generator policy
π
φ
; guider network
G
ψ
; a sequence dataset
{X
1...T
}
by some ex-
pert policy.
1: Initialize G
ψ
, D
θ
with random weights.
2: while Imitation Learning phase do
3:
Update generator
π
φ
, guider
G
ψ
with MLE
loss.
4: end while
5: while Reinforcement Learning phase do
6: Generate a sequence Y
1...T
∼ π
φ
.
7: Compute Q
t
, and update π
φ
.
8: end while
2017) with the guider-matching rewards, to define
a Q-value reward as Q
t
= (
P
T
i=t
γ
i
r
g
i
) × r
f
.
Generator Optimization
The generator is ini-
tialized by pre-training on sentences with an au-
toencoder structure, based on MLE training. Af-
ter that, the final
Q
-value reward
Q
t
is used as a
reward for each time t, with standard policy gradi-
ent optimization methods to update the generator.
Specifically, the policy gradient is
∇
φ
J = E
(s
t−1
,y
t
)∼ρ
π
[Q
t
∇
φ
log p(y
t
|s
t−1
; φ, ϕ)] ,
∇
ϕ
J = E
(s
t−1
,y
t
)∼ρ
π
[Q
t
∇
ϕ
log p(y
t
|s
t−1
; φ, ϕ)] ,
where
p(y
t
|s
t−1
; φ, ϕ)
is the probability of gener-
ating
y
t
given
s
t−1
in the generator. Algorithm
1 describes the proposed model-based imitation
learning framework for text generation.
Model-based or Model-free
Text generation
seeks to generate the next word (action) given
the current (sub-)sentence (state). The generator
is considered as an agent that learns a policy to
predict the next word given its current state. In
previous work (Ranzato et al., 2016), a metric re-
ward is given and the generator is trained to only
maximize the metric reward by trial, thus this is
model-free learning. In the proposed method, the
guider network models the environment dynamics,
and is trained by minimizing the cosine similar-
ity between the prediction and the ground truth
on real text. For generator training, it maximizes
the reward which is determined by the metric and
guider network, and thus is model-free learning
with model-based boosting (Gu et al., 2016). The
model predictive control scheme is included in our
method, where the guider network is used to help
next-word selection at each time-step.