274 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 3, JULY 2014
Experience Replay for Least-Squares
Policy Iteration
Quan Liu Xin Zhou Fei Zhu Qiming Fu Yuchen Fu
Abstract—Policy iteration, which evaluates and improves the
control policy iteratively, is a reinforcement learning method.
Policy evaluation with the least-squares method can draw more
useful information from the empirical data and therefore improve
the data validity. However, most existing online least-squares
policy iteration methods only use each sample just once, resulting
in the low utilization rate. With the goal of improving the utiliza-
tion efficiency, we propose an experience replay for least-squares
policy iteration (ERLSPI) and prove its convergence. ERLSPI
method combines online least-squares policy iteration method
with experience replay, stores the samples which are generated
online, and reuses these samples with least-squares method to
update the control policy. We apply the ERLSPI method for
the inverted pendulum system, a typical benchmark testing. The
experimental results show that the method can effectively take
advantage of the previous experience and knowledge, improve the
empirical utilization efficiency, and accelerate the convergence
speed.
Index Terms—reinforcement learning, experience replay, least-
squares, policy iteration.
I. INTRODUCTION
R
EINFORCEMENT learning (RL) interacts with the envi-
ronment and learns how to map situations to actions in or-
der to obtain maximum cumulative reward. Agents constantly
try to find the best action which gets the maximum reward. In
reinforcement learning cases, the action will not only affect the
immediate rewards, but also the next state and all subsequent
reward. Reinforcement learning is characterized by trial and
error search as well as delayed reward
[1−2]
. Reinforcement
learning has good learning performance in complex nonlinear
systems with large spaces, and is widely used in process
control, task scheduling, robot design, gaming and many other
fields
[3−5]
.
In reinforcement learning, we use the value function to esti-
mate the long-term cumulative reward of a state or state-action
pair. The V-function estimates the state and the Q-function
estimates the state-action pair. The policy in reinforcement
learning is a mapping from the state space to the action space,
Manuscript received September 10, 2013; accepted July 23, 2014. This work
was supported by National Natural Science Foundation of China (61303108,
61272005, 61373094, 61103045), Natural Science Foundation of Jiangsu
(BK2012616), High School Natural Foundation of Jiangsu (13KJB520020),
Key Laboratory of Symbolic Computation and Knowledge Engineering of
Ministry of Education, Jilin University (93K172014K04), Suzhou Industrial
Application of Basic Research Program (SYG201422). Recommended by
Associate Editor Warren Dixon
Citation: Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu, Yuchen Fu. Experience
replay for least-squares policy iteration.
IEEE/CAA Journal of Automatica
Sinica
, 2014, 1(3): 274−281
Quan Liu is with the School of Computer Science and Technology, Soo-
chow University, Jiangsu 215006, China, and also with the Key Laboratory of
Symbolic Computation and Knowledge Engineering of Ministry of Education,
Jilin University, Changchun 130012, China (e-mail: quanliu@suda.edu.cn).
Xin Zhou is with the School of Computer Science and Technology,
Soochow University, Jiangsu 215006, China (e-mail: 504828465@qq.com).
Fei Zhu is with the School of Computer Science and Technology, Soochow
University, Jiangsu 215006, China, and also with the Key Laboratory of
Symbolic Computation and Knowledge Engineering of Ministry of Education,
Jilin University, Changchun 130012, China (e-mail: zhufei@suda.edu.cn).
Qiming Fu and Yuchen Fu are with the School of Computer Sci-
ence and Technology, Soochow University, Jiangsu 215006, China (e-mail:
fqm
1@126.com; yuchenfu@suda.edu.cn).
agents eventually reach the target through comparing the value
functions to seek the optimal policy
[6]
. Policy iteration (PI) is
an important reinforcement learning method, whose algorithms
update the current policy by calculating the value functions and
then improve the policy with the greedy policy. Least-squares
method can be used advantageously in reinforcement learning
algorithms. Bradtke et al. proposed the least-squares temporal
difference (LSTD) algorithm based on the V-function
[7−8]
.
Although their algorithm requires more computation cost for
each time step, which is different from traditional temporal dif-
ference (TD) algorithms, it can extract more useful information
from the empirical data and therefore improve the data validity.
However, as the absence of model in action selection in the
model free cases, LSTD only intended to solve the prediction
problems, and could not be used in the control problems
[9−11]
.
Lagoudakis et al. extended it to the least-squares temporal
difference for Q-functions (LSTD-Q)
[12−13]
, and proposed the
least-squares policy iteration (LSPI) algorithm which made it
available for control problems. LSPI used LSTD-Q in policy
evaluation, and accurately estimated the current policy with
all samples collected in advance, which was a typical offline
algorithm. Reinforcement learning is capable of dealing with
the interaction with the environment online. Busoniu et al. ex-
tended the offline least-squares policy iteration algorithm to the
online cases
[14]
, and presented an online least-squares policy
iteration (online LSPI) algorithm. The algorithm uses samples
generated online with exploratory policy, and improves the
policy every few steps. But it uses empirical data only once,
which have low empirical utilization rate. Therefore, in the
early time of the algorithm implementation, as there is few
sample data available, it is difficult to obtain good control
policy, which leads to poor initial performance and slow
convergence speed of the algorithm.
Experience replay (ER) methods can reuse prior empirical
data and therefore reduce the number of samples required to
get a good policy
[15]
, which generates and stores up sample
data online, and then reuses the sample data to update the
current policy. In this work, combining the ER method with
the online LSPI, we propose an ER for least-squares policy
iteration (ERLSPI) algorithm which is based on linear least-
squares function approximation theory. The algorithm reuses
the sample data collected online and extracts more information
from them, leading to the empirical utilization rate and the
convergence speed improvement.
II. RELATED THEORIES
A. Markov Decision Process
A state signal which retains all pertinent information suc-
cessfully is supposed to have the Markov property. In rein-
forcement learning, if a state has the Markov property, then the
response of the environment at time t + 1 depends only on the
representation of the state and action at time t. A reinforcement
learning task that satisfies the Markov property is known to be
a Markov decision process (MDP). An MDP can be defined as
the following four factors: the state space X, the action space
U, the transition probability function f, the reward function