cremental. In the online setting the behavior policy should be the same as the learning
policy or can be updated once every several transitions. On the contrary in the offline
setting the agent does not have control on how the data are generated and the agent is
provided with a given data set of experiences. In the offline setting the behavior policy
is usually stochastic and might be unknown to the agent. Data processing while learning
may use a batch or an incremental algorithm. A batch algorithm processing the collected
observations can freely access any element at any time. An incremental algorithm con-
tinues to learn whenever a new data sample is available while the computation in prin-
ciple might not directly depend on the whole data set of observations but could rely on
the last sample. A possibility is to alternate between phases of exploration, where a set
of training examples is grown by interacting with the system, and phases of learning,
where the whole batch of observations is used called growing batch learning problem.
In practice, the growing batch approach is the modeling of choice when applying batch
reinforcement learning algorithms to real systems. Central goal of RL is to develop algo-
rithms that learn online, in which case the performance should improve once every few
transition samples. In this paper we propose and empirically study an online algorithm
that evaluates policies with the BRM using SVR called online API −BRM
ε
. A crucial
difference from the offline case is that policy improvements must be performed once ev-
ery few samples, before an accurate evaluation of the current policy can be completed.
Online API −BRM
ε
collects its own samples making exploration necessary, does not suf-
fer from sub-optimality, always finding the solution to the approximation problem. Be-
ing a non-parametric learning method, choosing an appropriate kernel can automatically
adapt to the complexity of the problem. After describing some theoretical background
we introduce online API −BRM
ε
and provides an experimental evaluation for inverted
pendulum and bike balancing problem.
2. Background and notation
A finite-action discounted MDP can then be defined as a tuple (S,A,P,R, γ) with S a mea-
surable state space, A a finite set of available actions P is a mapping giving the distribu-
tion over R×S with marginals defined as P(·|s, a) (transition probability) while R(·|s,a)
represents the expected immediate reward when the agent makes a transition. At stage
t an action a
t
∈ A is selected by the agent controlling the process and in response the
pair (r
t
,s
t
) is drawn from the distribution P(r,s
|s
t
,a
t
) i.e (r
t
,s
t
) ∼ P(r, s
|s
t
,a
t
) where
r
t
is the reward the agent receives and s
t
the next MDP state. An agent in RL is usu-
ally assumed to be very simple, consisting mainly of an action selection policy such
that a
t
= π(s
t
). More generally a stationary stochastic policy maps states to distributions
over the action space with π
t
(a|s) denoting the probability that the agent will select the
action a to perform in state s at time t. Stochastic policies are also called soft when
they do not commit to a single action per state. An ε−greedy policy is a soft policy
which for some 0 ≤ ε ≤ 1 picks deterministically a particular action with probability
1 −ε and a uniformly random action with probability ε. We will then use a ∼ π(·|s)
to indicate that action a is chosen according to the probability function in state s.For
an agent following the policy π considering the sequence of rewards {r
t
: t ≥ 1} when
the MDP is started in the state action (s
1
,a
1
) ∼ ν(s,a) ∈M(S ×A) the action value
function Q
π
is defined as Q
π
(s,a)=E{
∑
∞
t=1
γ
t−1
r
t
|s
1
= s,a
1
= a,π} where γ ∈ [0,1]
is a discount factor. A policy π =
ˆ
π(·,Q) is greedy w.r.t. an action value function
G. Esposito and M. Martin / Approximate Policy Iteration with Bellman Residuals Minimization4