IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4
against exploitation, but ultimately these can all be addressed
formally within the framework of RL.
III. REINFORCEMENT LEARNING ALGORITHMS
So far, we have introduced the key formalism used in RL,
the MDP, and briefly noted some challenges in RL. In the
following, we will distinguish between different classes of
RL algorithms. There are two main approaches to solving
RL problems: methods based on value functions and methods
based on policy search. There is also a hybrid, actor-critic
approach, which employs both value functions and policy
search. We will now explain these approaches and other useful
concepts for solving RL problems.
A. Value Functions
Value function methods are based on estimating the value
(expected return) of being in a given state. The state-value
function V
π
(s) is the expected return when starting in state s
and following π henceforth:
V
π
(s) = E[R|s, π] (2)
The optimal policy, π
∗
, has a corresponding state-value
function V
∗
(s), and vice-versa, the optimal state-value func-
tion can be defined as
V
∗
(s) = max
π
V
π
(s) ∀s ∈ S. (3)
If we had V
∗
(s) available, the optimal policy could be re-
trieved by choosing among all actions available at s
t
and pick-
ing the action a that maximises E
s
t+1
∼T (s
t+1
|s
t
,a)
[V
∗
(s
t+1
)].
In the RL setting, the transition dynamics T are unavailable.
Therefore, we construct another function, the state-action-
value or quality function Q
π
(s, a), which is similar to V
π
,
except that the initial action a is provided, and π is only
followed from the succeeding state onwards:
Q
π
(s, a) = E[R|s, a, π]. (4)
The best policy, given Q
π
(s, a), can be found by choosing a
greedily at every state: argmax
a
Q
π
(s, a). Under this policy,
we can also define V
π
(s) by maximising Q
π
(s, a): V
π
(s) =
max
a
Q
π
(s, a).
Dynamic Programming: To actually learn Q
π
, we exploit
the Markov property and define the function as a Bellman
equation [13], which has the following recursive form:
Q
π
(s
t
, a
t
) = E
s
t+1
[r
t+1
+ γQ
π
(s
t+1
, π(s
t+1
))]. (5)
This means that Q
π
can be improved by bootstrapping, i.e.,
we can use the current values of our estimate of Q
π
to improve
our estimate. This is the foundation of Q-learning [159] and
the state-action-reward-state-action (SARSA) algorithm [112]:
Q
π
(s
t
, a
t
) ← Q
π
(s
t
, a
t
) + αδ, (6)
where α is the learning rate and δ = Y −Q
π
(s
t
, a
t
) the tem-
poral difference (TD) error; here, Y is a target as in a standard
regression problem. SARSA, an on-policy learning algorithm,
is used to improve the estimate of Q
π
by using transitions
generated by the behavioural policy (the policy derived from
Q
π
), which results in setting Y = r
t
+ γQ
π
(s
t+1
, a
t+1
). Q-
learning is off-policy, as Q
π
is instead updated by transitions
that were not necessarily generated by the derived policy.
Instead, Q-learning uses Y = r
t
+γ max
a
Q
π
(s
t+1
, a), which
directly approximates Q
∗
.
To find Q
∗
from an arbitrary Q
π
, we use generalised
policy iteration, where policy iteration consists of policy eval-
uation and policy improvement. Policy evaluation improves
the estimate of the value function, which can be achieved
by minimising TD errors from trajectories experienced by
following the policy. As the estimate improves, the policy can
naturally be improved by choosing actions greedily based on
the updated value function. Instead of performing these steps
separately to convergence (as in policy iteration), generalised
policy iteration allows for interleaving the steps, such that
progress can be made more rapidly.
B. Sampling
Instead of bootstrapping value functions using dynamic
programming methods, Monte Carlo methods estimate the
expected return (2) from a state by averaging the return from
multiple rollouts of a policy. Because of this, pure Monte Carlo
methods can also be applied in non-Markovian environments.
On the other hand, they can only be used in episodic MDPs,
as a rollout has to terminate for the return to be calculated.
It is possible to get the best of both methods by combining
TD learning and Monte Carlo policy evaluation, as in done in
the TD(λ) algorithm [135]. Similarly to the discount factor,
the λ in TD(λ) is used to interpolate between Monte Carlo
evaluation and bootstrapping. As demonstrated in Figure 3,
this results in an entire spectrum of RL methods based around
the amount of sampling utilised.
Another major value-function based method relies on learn-
ing the advantage function A
π
(s, a) [6, 43]. Unlike producing
absolute state-action values, as with Q
π
, A
π
instead represents
relative state-action values. Learning relative values is akin
to removing a baseline or average level of a signal; more
intuitively, it is easier to learn that one action has better
consequences than another, than it is to learn the actual return
from taking the action. A
π
represents a relative advantage
of actions through the simple relationship A
π
= Q
π
− V
π
,
and is also closely related to the baseline method of variance
reduction within gradient-based policy search methods [164].
The idea of advantage updates has been utilised in many recent
DRL algorithms [157, 40, 85, 123].
C. Policy Search
Policy search methods do not need to maintain a value
function model, but directly search for an optimal policy
π
∗
. Typically, a parameterised policy π
θ
is chosen, whose
parameters are updated to maximise the expected return E[R|θ]
using either gradient-based or gradient-free optimisation [26].
Neural networks that encode policies have been successfully
trained using both gradient-free [37, 23, 64] and gradient-
based [164, 163, 46, 79, 122, 123, 74] methods. Gradient-free
optimisation can effectively cover low-dimensional parameter
spaces, but despite some successes in applying them to large
networks [64], gradient-based training remains the method of
choice for most DRL algorithms, being more sample-efficient