Published as a conference paper at ICLR 2016
Here, the subscript of E enumerates the variables being integrated over, where states and actions are
sampled sequentially from the dynamics model P (s
t+1
| s
t
, a
t
) and policy π(a
t
| s
t
), respectively.
The colon notation a : b refers to the inclusive range (a, a + 1, . . . , b). These formulas are well
known and straightforward to obtain; they follow directly from Proposition 1, which will be stated
shortly.
The choice Ψ
t
= A
π
(s
t
, a
t
) yields almost the lowest possible variance, though in practice, the
advantage function is not known and must be estimated. This statement can be intuitively justified by
the following interpretation of the policy gradient: that a step in the policy gradient direction should
increase the probability of better-than-average actions and decrease the probability of worse-than-
average actions. The advantage function, by it’s definition A
π
(s, a) = Q
π
(s, a) − V
π
(s), measures
whether or not the action is better or worse than the policy’s default behavior. Hence, we should
choose Ψ
t
to be the advantage function A
π
(s
t
, a
t
), so that the gradient term Ψ
t
∇
θ
log π
θ
(a
t
| s
t
)
points in the direction of increased π
θ
(a
t
| s
t
) if and only if A
π
(s
t
, a
t
) > 0. See Greensmith et al.
(2004) for a more rigorous analysis of the variance of policy gradient estimators and the effect of
using a baseline.
We will introduce a parameter γ that allows us to reduce variance by downweighting rewards cor-
responding to delayed effects, at the cost of introducing bias. This parameter corresponds to the
discount factor used in discounted formulations of MDPs, but we treat it as a variance reduction
parameter in an undiscounted problem; this technique was analyzed theoretically by Marbach &
Tsitsiklis (2003); Kakade (2001b); Thomas (2014). The discounted value functions are given by:
V
π,γ
(s
t
)
:
= E
s
t+1:∞
,
a
t:∞
"
∞
X
l=0
γ
l
r
t+l
#
Q
π,γ
(s
t
, a
t
)
:
= E
s
t+1:∞
,
a
t+1:∞
"
∞
X
l=0
γ
l
r
t+l
#
(4)
A
π,γ
(s
t
, a
t
)
:
= Q
π,γ
(s
t
, a
t
) − V
π,γ
(s
t
). (5)
The discounted approximation to the policy gradient is defined as follows:
g
γ
:
= E
s
0:∞
a
0:∞
"
∞
X
t=0
A
π,γ
(s
t
, a
t
)∇
θ
log π
θ
(a
t
| s
t
)
#
. (6)
The following section discusses how to obtain biased (but not too biased) estimators for A
π,γ
, giving
us noisy estimates of the discounted policy gradient in Equation (6).
Before proceeding, we will introduce the notion of a γ-just estimator of the advantage function,
which is an estimator that does not introduce bias when we use it in place of A
π,γ
(which is not
known and must be estimated) in Equation (6) to estimate g
γ
.
1
Consider an advantage estimator
ˆ
A
t
(s
0:∞
, a
0:∞
), which may in general be a function of the entire trajectory.
Definition 1. The estimator
ˆ
A
t
is γ-just if
E
s
0:∞
a
0:∞
h
ˆ
A
t
(s
0:∞
, a
0:∞
)∇
θ
log π
θ
(a
t
| s
t
)
i
= E
s
0:∞
a
0:∞
[A
π,γ
(s
t
, a
t
)∇
θ
log π
θ
(a
t
| s
t
)] . (7)
It follows immediately that if
ˆ
A
t
is γ-just for all t, then
E
s
0:∞
a
0:∞
"
∞
X
t=0
ˆ
A
t
(s
0:∞
, a
0:∞
)∇
θ
log π
θ
(a
t
| s
t
)
#
= g
γ
(8)
One sufficient condition for
ˆ
A
t
to be γ-just is that
ˆ
A
t
decomposes as the difference between two
functions Q
t
and b
t
, where Q
t
can depend on any trajectory variables but gives an unbiased estimator
of the γ-discounted Q-function, and b
t
is an arbitrary function of the states and actions sampled
before a
t
.
Proposition 1. Suppose that
ˆ
A
t
can be written in the form
ˆ
A
t
(s
0:∞
, a
0:∞
) = Q
t
(s
t:∞
, a
t:∞
) −
b
t
(s
0:t
, a
0:t−1
) such that for all (s
t
, a
t
), E
s
t+1:∞
,a
t+1:∞
| s
t
,a
t
[Q
t
(s
t:∞
, a
t:∞
)] = Q
π,γ
(s
t
, a
t
).
Then
ˆ
A is γ-just.
1
Note, that we have already introduced bias by using A
π,γ
in place of A
π
; here we are concerned with
obtaining an unbiased estimate of g
γ
, which is a biased estimate of the policy gradient of the undiscounted
MDP.
3