
REGAL: A Regularization based Algorithm for Reinforcement
Learning in Weakly Communicating MDPs
Peter L. Bartlett
Computer Science Division, and
Department of Statistics
University of California at Berkeley
Berkeley, CA 94720
Ambuj Tewari
Toyota Technological Institute at Chicago
6045 S. Kenwood Ave.
Chicago, IL 60615
Abstract
We provide an algorithm that achieves the
optimal regret rate in an unknown weakly
communicating Markov Decision Process
(MDP). The algorithm proceeds in episodes
where, in each episode, it picks a policy us-
ing regularization based on the span of the
optimal bias vector. For an MDP with S
states and A actions whose optimal bias vec-
tor has span bounded by H, we show a regret
bound of
˜
O(HS
√
AT ). We also relate the
span to various diameter-like quantities asso-
ciated with the MDP, demonstrating how our
results improve on previous regret bounds.
1 INTRODUCTION
In reinforcement learning, an agent interacts with an
environment while trying to maximize the total reward
it accumulates. Markov Decision Processes (MDPs)
are the most commonly used model for the environ-
ment. To every MDP is associated a state space S and
an action space A. Suppose there are S states and A
actions. The parameters of the MDP then consist of
S · A state transition distributions P
s,a
and S · A re-
wards r(s, a). When the agent takes action a in state
s, it receives reward r(s, a) and the probability that
it moves to state s
0
is P
s,a
(s
0
). The agent does not
know the parameters P
s,a
and r(s, a) of the MDP in
advance but has to learn them by directly interacting
with the environment. In doing so, it faces the explo-
ration vs. exploitation trade-off that Kearns and Singh
[2002] succinctly describe as,
“. . . should the agent exploit its cumulative
experience so far, by executing the action
that currently seems best, or should it exe-
cute a different action, with the hope of gain-
ing information or experience that could lead
to higher future payoffs? Too little explo-
ration can prevent the agent from ever con-
verging to the optimal behavior, while too
much exploration can prevent the agent from
gaining near-optimal payoff in a timely fash-
ion.”
Suppose the agent uses an algorithm G to choose its
actions based on the history of its interactions with the
MDP starting from some initial state s
1
. Denoting
the (random) reward obtained at time t by r
t
, the
algorithm’s expected reward until time T is
R
G
(s
1
, T) = E
"
T
X
t=1
r
t
#
.
Suppose λ
?
is the optimal per-step reward. An impor-
tant quantity used to measure how well G is handling
the exploration vs. exploitation trade-off is the regret,
∆
G
(s
1
, T) = λ
?
T − R
G
(s
1
, T) .
If ∆
G
(s
1
, T) is o(T ) then G is indeed learning some-
thing useful about the MDP since its expected average
reward then converges to the optimal value λ
?
(which
can only be computed with the knowledge of the MDP
parameters) in the limit T → ∞.
However, asymptotic results are of limited value and
therefore it is desirable to have finite time bounds on
the regret. To obtain such results, one has to work
with a restricted class of MDPs. In the theory of
MDPs, four fundamental subclasses have been stud-
ied. Unless otherwise specified, by a policy, we mean
a deterministic stationary policy, i.e. simply a map
π : S → A.
Ergodic Every policy induces a single recurrent class,
i.e. it is possible to reach any state from any other
state.
Unichain Every policy induces a single recurrent
class plus a possibly empty set of transient states.
BARTLETT & TEWARIUAI 2009 35