For broader readers who might not be familiar
with reinforcement learning, we briefly introduce
by their counterparts or equivalent concepts in su-
pervised models with the RL terms in the paren-
theses: our goal is to train an extractor (agent A)
to label entities, event triggers and argument roles
(actions a) in text (environment e); to commit cor-
rect labels, the extractor consumes features (state
s) and follow the ground truth (expert E); a re-
ward R will be issued to the extractor according
to whether it is different from the ground truth and
how serious the difference is – as shown in Fig-
ure 1, a repeated mistake is definitely more serious
– and the extractor improves the extraction model
(policy π) by pursuing maximized rewards.
Our framework can be briefly described as fol-
lows: given a sentence, our extractor scans the
sentence and determines the boundaries and types
of entities and event triggers using Q-Learning
(Section 3.1); meanwhile, the extractor determines
the relations between triggers and entities – argu-
ment roles with policy gradient (Section 3.2). Dur-
ing the training epochs, GANs estimate rewards
which stimulate the extractor to pursue the most
optimal joint model (Section 4).
3 Framework and Approach
3.1 Q-Learning for Entities and Triggers
The entity and trigger detection is often mod-
eled as a sequence labeling problem, where long-
term dependency is a core nature; and reinforce-
ment learning is a well-suited method (Maes et al.,
2007).
From RL perspective, our extractor (agent A)
is exploring the environment, or unstructured nat-
ural language sentences when going through the
sequences and committing labels (actions a) for
the tokens. When the extractor arrives at tth to-
ken in the sentence, it observes information from
the environment and its previous action a
t−1
as its
current state s
t
; the extractor commits a current
action a
t
and moves to the next token, it has a new
state s
t+1
. The information from the environment
is token’s context embedding v
t
, which is usually
acquired from Bi-LSTM (Hochreiter and Schmid-
huber, 1997) outputs; previous action a
t−1
may
impose some constraint for current action a
t
, e.g.,
I-ORG does not follow B-PER
2
. With the afore-
2
In this work, we use BIO, e.g., “B-Meet” indicates the
token is beginning of Meet trigger, “I-ORG” means that the
token is inside an organization phrase, and “O” denotes null.
mentioned notations, we have
s
t
=< v
t
, a
t−1
> . (1)
To determine the current action a
t
, we generate
a series of Q-tables with
Q
sl
(s
t
, a
t
) = f
sl
(s
t
|s
t−1
, s
t−2
, . . . , a
t−1
, a
t−2
, . . .),
(2)
where f
sl
(·) denotes a function that determine the
Q-values using the current state as well as previ-
ous states and actions. Then we achieve
ˆa
t
= arg max
a
t
Q
sl
(s
t
, a
t
). (3)
Equation 2 and 3 suggest that an RNN-based
framework which consumes current input and pre-
vious inputs and outputs can be adopted, and we
use a unidirectional LSTM as (Bakker, 2002). We
have a full pipeline as illustrated in Figure 2.
For each label (action a
t
) with regard to s
t
, a
reward r
t
= r(s
t
, a
t
) is assigned to the extractor
(agent). We use Q-learning to pursue the most op-
timal sequence labeling model (policy π) by max-
imizing the expected value of the sum of future re-
wards E(R
t
), where R
t
represents the sum of dis-
counted future rewards r
t
+ γr
t+1
+ γ
2
r
t+2
+ . . .
with a discount factor γ, which determines the in-
fluence between current and next states.
We utilize Bellman Equation to update the Q-
value with regard to the current assigned label to
approximate an optimal model (policy π
∗
).
Q
π
∗
sl
(s
t
, a
t
) = r
t
+ γ max
a
t+1
Q
sl
(s
t+1
, a
t+1
). (4)
As illustrated in Figure 3, when the extractor
assigns a wrong label on the “death” token be-
cause the Q-value of Die ranks first, Equation 4
will penalize the Q-value with regard to the wrong
label; while in later epochs, if the extractor com-
mits a correct label of Execute, the Q-value will
be boosted and make the decision reinforced.
We minimize the loss in terms of mean squared
error between the original and updated Q-values
notated as Q
0
sl
(s
t
, a
t
):
L
sl
=
1
n
n
X
t
X
a
(Q
0
sl
(s
t
, a
t
) − Q
sl
(s
t
, a
t
))
2
(5)
and apply back propagation to optimize the param-
eters in the neural network.