generated samples to improve the generator and could result
in mode collapse problems. Feature Matching (Zhang et al.
2017) provides a mechanism that matches the latent feature
distributions of real and generated sequences via a kernel-
ized discepancy metric to alleviate the weak guidance and
mode collapse problems. However, such enhancement only
happens when the whole text sample is generated and thus
the guiding signal is still sparse during the training.
Reinforcement learning (RL) on the other hand also faces
a similar difficulty when reward signals are sparse (Kulkarni
et al. 2016). Hierarchical RL is one of the promising tech-
niques for handling the sparse reward issue (Sutton, Precup,
and Singh 1999). A typical approach in hierarchical RL is
to manually identify the hierarchical structure for the agent
by defining several low-level sub-tasks and learning micro-
policies for each sub-task while learning a macro-policy for
choosing which sub-task to solve. Such methods can be very
effective when the hierarchical structure is known a priori
using domain knowledge in a given specific task, but fail
to flexibly adapt to other tasks. Recently, (Vezhnevets et al.
2017) proposed an end-to-end framework for hierarchical
RL where the sub-tasks are not identified manually but im-
plicitly learned by a MANAGER module which takes current
state as input and output a goal embedding vector to guide
the low-level WORKER module.
In this work, we model the text generation procedure via
adversarial training and policy gradient (Yu et al. 2017). To
address the sparse reward issue in long text generation, we
follow (Vezhnevets et al. 2017) and propose a hierarchy de-
sign, i.e. MANAGER and WORKER, for the generator. As the
reward function in our case is a discriminative model rather
than a black box in (Vezhnevets et al. 2017), the high-level
feature extracted by the discriminator given the current gen-
erated word sequence is sent to the MANAGER module. As
such, the MANAGER module can be also viewed as a spy that
leaks information from the discriminator to better guide the
generator. To our knowledge, this is the first work that con-
siders the information leaking in GAN framework for better
training generators and combines hierarchical RL to address
long text generation problems.
Methodology
We formalize the text generation problem as a sequen-
tial decision making process (Bachman and Precup 2015).
Specifically, at each timestep t, the agent takes the previ-
ously generated words as its current state, denoted as s
t
=
(x
1
, . . . , x
i
, . . . , x
t
), where x
i
represents a word token in
the given vocabulary V . A θ-parameterized generative net
G
θ
, which corresponds to a stochastic policy, maps s
t
to a
distribution over the whole vocabulary, i.e. G
θ
(·|s
t
), from
which the action x
t+1
, i.e. the next word to select is sam-
pled. We also train a φ-parameterized discriminative model
D
φ
that provides a scalar guiding signal D
φ
(s
T
) for G
θ
to
adjust its parameters when the whole sentence s
T
has been
generated.
As we discussed previously, although the above adversar-
ial training is principled, the scalar guiding signal becomes
relatively less informative when the sentence length T goes
larger. To address this, the proposed LeakGAN framework
allows discriminator D
φ
to provide additional information,
denoted as features f
t
, of the current sentence s
t
(it is in-
ternally used for D
φ
itself for discrimination) to genera-
tor G
θ
(·|s
t
). In LeakGAN, a hierarchical RL architecture
is used as a promising mechanism to effectively incorporate
such leaked information f
t
into the generation procedure of
G
θ
(also see Figure 1).
Leaked Features from D as Guiding Signals
Different from typical model-free RL settings where the re-
ward function is a black box, our adversarial text generation
uses D
φ
as a learned reward function. Typically, D
φ
is a neu-
ral network and can be decomposed into a feature extractor
F(· ; φ
f
) and a final sigmoid classification layer with weight
vector φ
l
. Mathematically, given input s, we have
D
φ
(s) = sigmoid(φ
>
l
F(s; φ
f
)) = sigmoid(φ
>
l
f), (1)
where φ = (φ
f
, φ
l
) and sigmoid(z) = 1/(1 + e
−z
).
f = F(s; φ
f
) is the feature vector of s in the last layer
of D
φ
, which is to be leaked to generator G
θ
. As is shown
in Eq. (1), for a given D
φ
, the reward value for each state
s mainly depends on the extracted features f. As such, the
objective of getting a higher reward from D
φ
is equivalent
to finding a higher reward region in this extracted feature
space F(S; φ
f
) = {F(s; φ
f
)}
s∈S
. Specifically, our feature
extractor F(· ; φ
f
) in D
φ
is implemented by a CNN (Zhang
and LeCun 2015); thus F(s; φ
f
) outputs the CNN fea-
ture map vector as f after its convolution-pooling-activation
layer. Other neural network models such as LSTM (Hochre-
iter and Schmidhuber 1997) can also be used to implement
D
φ
.
Compared to the scalar signal D
φ
(s), the feature vector f
is a much more informative guiding signal for G
θ
, since it
tells what the position of currently-generated words is in the
extracted feature space.
A Hierarchical Structure of G
In each step t during the generation procedure, to utilize the
leaked information f
t
from D
φ
, we follow hierarchical RL
(Vezhnevets et al. 2017) to have a hierarchical architecture
of G
θ
. Specifically, we introduce a MANAGER module, an
LSTM that takes the extracted feature vector f
t
as its input
at each step t and outputs a goal vector g
t
, which is then
fed into the WORKER module to guide the generation of the
next word in order to approach the higher reward region in
F(S; φ
f
). Next we will first describe the detailed generator
model in LeakGAN and then show how the MANAGER and
WORKER are trained with the guiding signals from D
φ
.
Generation Process. The MANAGER and WORKER mod-
ules both start from an all-zero hidden state, denoted as
h
M
0
and h
W
0
respectively. At each step, the MANAGER re-
ceives the leaked feature vector f
t
from the discriminator
D
φ
, which is further combined with current hidden state of
the MANAGER to produce the goal vector g
t
as
ˆg
t
, h
M
t
= M(f
t
, h
M
t−1
; θ
m
), (2)
g
t
= ˆg
t
/kˆg
t
k, (3)