经验回放用于最小二乘策略迭代

PDF格式 | 1.17MB | 更新于2024-08-27 | 166 浏览量 | 举报

"Experience Replay for Least-Squares Policy Iteration" 在强化学习领域，策略迭代（Policy Iteration）是一种常用的方法，它通过迭代的方式评估和改进控制策略。这种方法的核心在于它能够逐步优化策略，从而找到最优的行动策略。描述中提到的策略评估利用了最小二乘法（Least-Squares），这是一种数值分析中的优化技术，能够从经验数据中提取更多的有用信息，提高数据的有效性。然而，大多数现有的在线最小二乘策略迭代方法存在一个问题，即每个样本仅使用一次，导致样本利用率低。针对这一问题，文章提出了经验回放用于最小二乘策略迭代（Experience Replay for Least-Squares Policy Iteration，简称ERLSPI）。这个方法旨在提高样本的利用效率，将在线收集的样本存储起来，并在后续的迭代过程中重复使用这些样本，通过最小二乘法更新控制策略。通过这种方式，ERLSPI能够在多次迭代中不断利用同一份样本，从而更充分地挖掘样本信息的价值。在实际应用中，ERLSPI方法被应用于倒立摆系统，这是一个典型的基准测试系统。实验结果显示，该方法能有效地利用先前的经验，提高策略学习的效率和效果。这表明，通过结合经验回放与在线最小二乘策略迭代，可以克服单次使用样本的局限性，提高强化学习算法的性能。标签中的关键词"reinforcement learning"指的是强化学习，这是人工智能的一个分支，通过与环境的交互来学习最优策略。"experience replay"是强化学习中的一种技术，它允许算法在不同时间点重复使用过去的经验，有助于稳定训练过程并减少波动。"leastsquares"指的是最小二乘法，是解决线性回归问题的一种常用方法，这里被用于策略评估。"policy iteration"则特指本文讨论的策略迭代方法。 "Experience Replay for Least-Squares Policy Iteration"是一项创新性的强化学习研究，它结合了经验回放和最小二乘策略迭代，提高了样本的利用率，增强了算法的收敛性和学习效率，尤其在处理倒立摆等复杂控制系统时表现出了显著的优势。这项工作对于强化学习算法的设计和优化具有重要的理论与实践价值。

展开

274 IEEE/CAA JOURNAL OF AUTOMATICA SINICA, VOL. 1, NO. 3, JULY 2014

Experience Replay for Least-Squares

Policy Iteration

Quan Liu Xin Zhou Fei Zhu Qiming Fu Yuchen Fu

Abstract—Policy iteration, which evaluates and improves the

control policy iteratively, is a reinforcement learning method.

Policy evaluation with the least-squares method can draw more

useful information from the empirical data and therefore improve

the data validity. However, most existing online least-squares

policy iteration methods only use each sample just once, resulting

in the low utilization rate. With the goal of improving the utiliza-

tion efﬁciency, we propose an experience replay for least-squares

policy iteration (ERLSPI) and prove its convergence. ERLSPI

method combines online least-squares policy iteration method

with experience replay, stores the samples which are generated

online, and reuses these samples with least-squares method to

update the control policy. We apply the ERLSPI method for

the inverted pendulum system, a typical benchmark testing. The

experimental results show that the method can effectively take

advantage of the previous experience and knowledge, improve the

empirical utilization efﬁciency, and accelerate the convergence

speed.

Index Terms—reinforcement learning, experience replay, least-

squares, policy iteration.

I. INTRODUCTION

EINFORCEMENT learning (RL) interacts with the envi-

ronment and learns how to map situations to actions in or-

der to obtain maximum cumulative reward. Agents constantly

try to ﬁnd the best action which gets the maximum reward. In

reinforcement learning cases, the action will not only affect the

immediate rewards, but also the next state and all subsequent

reward. Reinforcement learning is characterized by trial and

error search as well as delayed reward

[1−2]

. Reinforcement

learning has good learning performance in complex nonlinear

systems with large spaces, and is widely used in process

control, task scheduling, robot design, gaming and many other

ﬁelds

[3−5]

In reinforcement learning, we use the value function to esti-

mate the long-term cumulative reward of a state or state-action

pair. The V-function estimates the state and the Q-function

estimates the state-action pair. The policy in reinforcement

learning is a mapping from the state space to the action space,

Manuscript received September 10, 2013; accepted July 23, 2014. This work

was supported by National Natural Science Foundation of China (61303108,

61272005, 61373094, 61103045), Natural Science Foundation of Jiangsu

(BK2012616), High School Natural Foundation of Jiangsu (13KJB520020),

Key Laboratory of Symbolic Computation and Knowledge Engineering of

Ministry of Education, Jilin University (93K172014K04), Suzhou Industrial

Application of Basic Research Program (SYG201422). Recommended by

Associate Editor Warren Dixon

Citation: Quan Liu, Xin Zhou, Fei Zhu, Qiming Fu, Yuchen Fu. Experience

replay for least-squares policy iteration.

IEEE/CAA Journal of Automatica

Sinica

, 2014, 1(3): 274−281

Quan Liu is with the School of Computer Science and Technology, Soo-

chow University, Jiangsu 215006, China, and also with the Key Laboratory of

Symbolic Computation and Knowledge Engineering of Ministry of Education,

Jilin University, Changchun 130012, China (e-mail: quanliu@suda.edu.cn).

Xin Zhou is with the School of Computer Science and Technology,

Soochow University, Jiangsu 215006, China (e-mail: 504828465@qq.com).

Fei Zhu is with the School of Computer Science and Technology, Soochow

University, Jiangsu 215006, China, and also with the Key Laboratory of

Symbolic Computation and Knowledge Engineering of Ministry of Education,

Jilin University, Changchun 130012, China (e-mail: zhufei@suda.edu.cn).

Qiming Fu and Yuchen Fu are with the School of Computer Sci-

ence and Technology, Soochow University, Jiangsu 215006, China (e-mail:

fqm

1@126.com; yuchenfu@suda.edu.cn).

agents eventually reach the target through comparing the value

functions to seek the optimal policy

[6]

. Policy iteration (PI) is

an important reinforcement learning method, whose algorithms

update the current policy by calculating the value functions and

then improve the policy with the greedy policy. Least-squares

method can be used advantageously in reinforcement learning

algorithms. Bradtke et al. proposed the least-squares temporal

difference (LSTD) algorithm based on the V-function

[7−8]

Although their algorithm requires more computation cost for

each time step, which is different from traditional temporal dif-

ference (TD) algorithms, it can extract more useful information

from the empirical data and therefore improve the data validity.

However, as the absence of model in action selection in the

model free cases, LSTD only intended to solve the prediction

problems, and could not be used in the control problems

[9−11]

Lagoudakis et al. extended it to the least-squares temporal

difference for Q-functions (LSTD-Q)

[12−13]

, and proposed the

least-squares policy iteration (LSPI) algorithm which made it

available for control problems. LSPI used LSTD-Q in policy

evaluation, and accurately estimated the current policy with

all samples collected in advance, which was a typical ofﬂine

algorithm. Reinforcement learning is capable of dealing with

the interaction with the environment online. Busoniu et al. ex-

tended the ofﬂine least-squares policy iteration algorithm to the

online cases

[14]

, and presented an online least-squares policy

iteration (online LSPI) algorithm. The algorithm uses samples

generated online with exploratory policy, and improves the

policy every few steps. But it uses empirical data only once,

which have low empirical utilization rate. Therefore, in the

early time of the algorithm implementation, as there is few

sample data available, it is difﬁcult to obtain good control

policy, which leads to poor initial performance and slow

convergence speed of the algorithm.

Experience replay (ER) methods can reuse prior empirical

data and therefore reduce the number of samples required to

get a good policy

[15]

, which generates and stores up sample

data online, and then reuses the sample data to update the

current policy. In this work, combining the ER method with

the online LSPI, we propose an ER for least-squares policy

iteration (ERLSPI) algorithm which is based on linear least-

squares function approximation theory. The algorithm reuses

the sample data collected online and extracts more information

from them, leading to the empirical utilization rate and the

convergence speed improvement.

II. RELATED THEORIES

A. Markov Decision Process

A state signal which retains all pertinent information suc-

cessfully is supposed to have the Markov property. In rein-

forcement learning, if a state has the Markov property, then the

response of the environment at time t + 1 depends only on the

representation of the state and action at time t. A reinforcement

learning task that satisﬁes the Markov property is known to be

a Markov decision process (MDP). An MDP can be deﬁned as

the following four factors: the state space X, the action space

U, the transition probability function f, the reward function

下载后可阅读完整内容，剩余7页未读，立即下载

身份认证购VIP最低享 7 折!

30元优惠券

weixin_38581447

粉丝: 8

经验回放用于最小二乘策略迭代

tcpreplay-4.4.4-1.el8.x86-64.rpm

tcpreplay-4.4.3-1-aarch64.pkg.tar.xz

Warcraft III Dota Replay for PHP-开源

replay-source-1.6.7-windows64_replay_OBS_

make-a-replay-in-maj-soul:雀魂自制重构

Replay-of-Quantitative-Strategy:重看量化金融工程固定收益策略

Python库 | fd_replay-0.0.4-py2-none-any.whl

callbag-replay-all::handbag:callbag-replay-all-记住所有发出的值并将它们重播到每个新接收器

tcpreplay-4.4.4-1.el7.x86-64.rpm

Python库 | replay-rec-0.8.0.tar.gz

最新资源