Continuous Deep Q-Learning with Model-based Acceleration
expressive models themselves require substantially more
data, and that otherwise efficient algorithms like Dyna-Q
are vulnerable to poor model approximations.
3. Background
In reinforcement learning, the goal is to learn a policy to
control a system with states x ∈ X and actions u ∈ U
in environment E, so as to maximize the expected sum of
returns according to a reward function r(x, u). The dy-
namical system is defined by an initial state distribution
p(x
1
) and a dynamics distribution p(x
t+1
|x
t
, u
t
). At each
time step t ∈ [1, T ], the agent chooses an action u
t
ac-
cording to its current policy π(u
t
|x
t
), and observes a re-
ward r(x
t
, u
t
). The agent then experiences a transition to a
new state sampled from the dynamics distribution, and we
can express the resulting state visitation frequency of the
policy π as ρ
π
(x
t
). Define R
t
=
P
T
i=t
γ
(i−t)
r(x
i
, u
i
),
the goal is to maximize the expected sum of returns, given
by R = E
r
i≥1
,x
i≥1
∼E,u
i≥1
∼π
[R
1
], where γ is a discount
factor that prioritizes earlier rewards over later ones. With
γ < 1, we can also set T = ∞, though we use a finite hori-
zon for all of the tasks in our experiments. The expected re-
turn R can be optimized using a variety of model-free and
model-based algorithms. In this section, we review several
of these methods that we build on in our work.
Model-Free Reinforcement Learning. When the sys-
tem dynamics p(x
t+1
|x
t
, u
t
) are not known, as is often
the case with physical systems such as robots, policy gra-
dient methods (Peters & Schaal, 2006) and value function
or Q-function learning with function approximation (Sut-
ton et al., 1999) are often preferred. Policy gradient meth-
ods provide a simple, direct approach to RL, which can
succeed on high-dimensional problems, but potentially re-
quires a large number of samples (Schulman et al., 2015;
2016). Off-policy algorithms that use value or Q-function
approximation can in principle achieve better data effi-
ciency (Lillicrap et al., 2016). However, adapting such
methods to continuous tasks typically requires optimizing
two function approximators on different objectives. We in-
stead build on standard Q-learning, which has a single ob-
jective. We summarize Q-learning in this section. The Q
function Q
π
(x
t
, u
t
) corresponding to a policy π is defined
as the expected return from x
t
after taking action u
t
and
following the policy π thereafter:
Q
π
(x
t
, u
t
) = E
r
i≥t
,x
i>t
∼E,u
i>t
∼π
[R
t
|x
t
, u
t
]
(1)
Q-learning learns a greedy deterministic policy
µ(x
t
) = arg max
u
Q(x
t
, u
t
), which corresponds to
π(u
t
|x
t
) = δ(u
t
= µ(x
t
)). Let θ
Q
parametrize the
action-value function, and β be an arbitrary exploration
policy, the learning objective is to minimize the Bellman
error, where we fix the target y
t
:
L(θ
Q
) = E
x
t
∼ρ
β
,u
t
∼β,r
t
∼E
[(Q(x
t
, u
t
|θ
Q
) − y
t
)
2
]
y
t
= r(x
t
, u
t
) + γQ(x
t+1
, µ(x
t+1
))
(2)
For continuous action problems, Q-learning becomes diffi-
cult, because it requires maximizing a complex, nonlinear
function at each update. For this reason, continuous do-
mains are often tackled using actor-critic methods (Konda
& Tsitsiklis, 1999; Hafner & Riedmiller, 2011; Silver et al.,
2014; Lillicrap et al., 2016), where a separate parame-
terized “actor” policy π is learned in addition to the Q-
function or value function “critic,” such as Deep Determin-
istic Policy Gradient (DDPG) algorithm (Lillicrap et al.,
2016).
In order to describe our method in the following sections, it
will be useful to also define the value function V
π
(x
t
, u
t
)
and advantage function A
π
(x
t
, u
t
) of a given policy π:
V
π
(x
t
) = E
r
i≥t
,x
i>t
∼E,u
i≥t
∼π
[R
t
|x
t
, u
t
]
A
π
(x
t
, u
t
) = Q
π
(x
t
, u
t
) − V
π
(x
t
).
(3)
Model-Based Reinforcement Learning. If we know the
dynamics p(x
t+1
|x
t
, u
t
), or if we can approximate them
with some learned model ˆp(x
t+1
|x
t
, u
t
), we can use
model-based RL and optimal control. While a wide range
of model-based RL and control methods have been pro-
posed in the literature (Deisenroth et al., 2013; Kober &
Peters, 2012), two are particularly relevant for this work:
iterative LQG (iLQG) (Li & Todorov, 2004) and Dyna-
Q (Sutton, 1990). The iLQG algorithm optimizes tra-
jectories by iteratively constructing locally optimal lin-
ear feedback controllers under a local linearization of the
dynamics ˆp(x
t+1
|x
t
, u
t
) = N (f
xt
x
t
+ f
ut
u
t
, F
t
) and a
quadratic expansion of the rewards r(x
t
, u
t
) (Tassa et al.,
2012). Under linear dynamics and quadratic rewards, the
action-value function Q(x
t
, u
t
) and value function V (x
t
)
are locally quadratic and can be computed by dynamics
programming. The optimal policy can be derived ana-
lytically from the quadratic Q(x
t
, u
t
) and V (x
t
) func-
tions, and corresponds to a linear feedback controller
g(x
t
) =
ˆ
u
t
+ k
t
+ K
t
(x
t
−
ˆ
x
t
), where k
t
is an open-
loop term, K
t
is the closed-loop feedback matrix, and
ˆ
x
t
and
ˆ
u
t
are the states and actions of the nominal trajectory,
which is the average trajectory of the controller. Employing
the maximum entropy objective (Levine & Koltun, 2013),
we can also construct a linear-Gaussian controller, where
c is a scalar to adjust for arbitrary scaling of the reward
magnitudes:
π
iLQG
t
(u
t
|x
t
) = N (
ˆ
u
t
+ k
t
+ K
t
(x
t
−
ˆ
x
t
), −cQ
−1
u,ut
)
(4)
When the dynamics are not known, a particularly effective
way to use iLQG is to combine it with learned time-varying