A Distributional Perspective on Reinforcement Learning
a) The randomness in the reward R,
b) The randomness in the transition P
π
, and
c) The next-state value distribution Z(X
0
, A
0
).
In particular, we make the usual assumption that these three
quantities are independent. In this section we will show
that (5) is a contraction mapping whose unique fixed point
is the random return Z
π
.
3.3.1. CONTRACTION IN
¯
d
p
Consider the process Z
k+1
:= T
π
Z
k
, starting with some
Z
0
∈ Z. We may expect the limiting expectation of {Z
k
}
to converge exponentially quickly, as usual, to Q
π
. As we
now show, the process converges in a stronger sense: T
π
is a contraction in
¯
d
p
, which implies that all moments also
converge exponentially quickly.
Lemma 3. T
π
: Z → Z is a γ-contraction in
¯
d
p
.
Using Lemma 3, we conclude using Banach’s fixed point
theorem that T
π
has a unique fixed point. By inspection,
this fixed point must be Z
π
as defined in (1). As we assume
all moments are bounded, this is sufficient to conclude that
the sequence {Z
k
} converges to Z
π
in
¯
d
p
for 1 ≤ p ≤ ∞.
To conclude, we remark that not all distributional metrics
are equal; for example, Chung & Sobel (1987) have shown
that T
π
is not a contraction in total variation distance. Sim-
ilar results can be derived for the Kullback-Leibler diver-
gence and the Kolmogorov distance.
3.3.2. CONTRACTION IN CENTERED MOMENTS
Observe that d
2
(U, V ) (and more generally, d
p
) relates to a
coupling C(ω) := U(ω) − V (ω), in the sense that
d
2
2
(U, V ) ≤ E[(U − V )
2
] = V
C
+
E C
2
.
As a result, we cannot directly use d
2
to bound the variance
difference |V(T
π
Z(x, a)) − V(Z
π
(x, a))|. However, T
π
is in fact a contraction in variance (Sobel, 1982, see also
appendix). In general, T
π
is not a contraction in the p
th
centered moment, p > 2, but the centered moments of the
iterates {Z
k
} still converge exponentially quickly to those
of Z
π
; the proof extends the result of R
¨
osler (1992).
3.4. Control
Thus far we have considered a fixed policy π, and studied
the behaviour of its associated operator T
π
. We now set
out to understand the distributional operators of the control
setting – where we seek a policy π that maximizes value
– and the corresponding notion of an optimal value distri-
bution. As with the optimal value function, this notion is
intimately tied to that of an optimal policy. However, while
all optimal policies attain the same value Q
∗
, in our case
a difficulty arises: in general there are many optimal value
distributions.
In this section we show that the distributional analogue
of the Bellman optimality operator converges, in a weak
sense, to the set of optimal value distributions. However,
this operator is not a contraction in any metric between dis-
tributions, and is in general much more temperamental than
the policy evaluation operators. We believe the conver-
gence issues we outline here are a symptom of the inherent
instability of greedy updates, as highlighted by e.g. Tsitsik-
lis (2002) and most recently Harutyunyan et al. (2016).
Let Π
∗
be the set of optimal policies. We begin by charac-
terizing what we mean by an optimal value distribution.
Definition 1 (Optimal value distribution). An optimal
value distribution is the v.d. of an optimal policy. The set
of optimal value distributions is Z
∗
:= {Z
π
∗
: π
∗
∈ Π
∗
}.
We emphasize that not all value distributions with expecta-
tion Q
∗
are optimal: they must match the full distribution
of the return under some optimal policy.
Definition 2. A greedy policy π for Z ∈ Z maximizes the
expectation of Z. The set of greedy policies for Z is
G
Z
:= {π :
X
a
π(a |x) E Z(x, a) = max
a
0
∈A
E Z(x, a
0
)}.
Recall that the expected Bellman optimality operator T is
T Q(x, a) = E R(x, a) + γ E
P
max
a
0
∈A
Q(x
0
, a
0
). (6)
The maximization at x
0
corresponds to some greedy policy.
Although this policy is implicit in (6), we cannot ignore it
in the distributional setting. We will call a distributional
Bellman optimality operator any operator T which imple-
ments a greedy selection rule, i.e.
T Z = T
π
Z for some π ∈ G
Z
.
As in the policy evaluation setting, we are interested in the
behaviour of the iterates Z
k+1
:= T Z
k
, Z
0
∈ Z. Our first
result is to assert that E Z
k
behaves as expected.
Lemma 4. Let Z
1
, Z
2
∈ Z. Then
kE T Z
1
− E T Z
2
k
∞
≤ γ kE Z
1
− E Z
2
k
∞
,
and in particular E Z
k
→ Q
∗
exponentially quickly.
By inspecting Lemma 4, we might expect that Z
k
con-
verges quickly in
¯
d
p
to some fixed point in Z
∗
. Unfor-
tunately, convergence is neither quick nor assured to reach
a fixed point. In fact, the best we can hope for is pointwise
convergence, not even to the set Z
∗
but to the larger set of
nonstationary optimal value distributions.