强化学习中的正则化算法：REGAL

下载需积分: 9 | PDF格式 | 177KB | 更新于2024-08-31 | 73 浏览量 | 举报

"MAP估计，消息传递和完美图.pdf——该文档是关于强化学习中的REGAL算法，一个基于正则化的算法，适用于弱通信Markov决策过程(MDP)。作者包括Peter L. Bartlett和Ambuj Tewari。" 在强化学习中，智能体与环境交互以最大化累计奖励，而马尔科夫决策过程（MDP）是描述这种环境的常用模型。每个MDP都与其特定的奖励结构和状态转移概率相关联。弱通信MDP是指状态之间通信程度有限的MDP，这使得学习最优策略更具挑战性。文档中介绍的REGAL算法旨在在一个未知的弱通信MDP中实现最优的后悔率（regret rate）。算法的工作方式是分阶段进行，在每个阶段，它会根据最优偏差向量的跨度使用正则化方法选择策略。这里的最优偏差向量表示从每个状态到最优策略的期望奖励差。对于具有S个状态和A个动作的MDP，如果其最优偏差向量的跨度被限制在H，REGAL算法能展示出一个大约为O(HS√AT)的后悔界。这表明在一定的迭代步数T后，智能体的性能接近于最优策略。此外，文档还探讨了偏差向量的跨度与MDP中的一些直径类量的关系，揭示了REGAL算法如何改进先前的后悔界。正则化在机器学习中扮演着重要角色，因为它可以帮助防止过拟合并提高泛化能力。在强化学习的背景下，正则化可以帮助智能体在不完整或有限的数据下学习稳定且泛化的策略。通过考虑偏差向量的跨度，REGAL算法能够平衡探索与利用，有效地在不同状态间导航，即使这些状态之间的连接并不强。文档还可能涵盖了消息传递的概念，这通常在图论和概率推理中用于在图结构中传播信息，尤其是在计算MAP（最大后验概率）估计时。在MDP中，消息传递可能被用来估算不同状态之间的关系，从而优化策略选择。最后，提到的“完美图”可能指的是图论中的完美图，这类图的所有子图都有完全匹配。在MDP中，完美图可能被用来描述状态间的某种理想连接性，有助于理解状态空间的结构并优化算法性能。这篇文档深入研究了强化学习中的算法设计，特别是针对弱通信MDP的高效策略学习，通过正则化和消息传递等技术提供了一种有竞争力的学习框架。

REGAL: A Regularization based Algorithm for Reinforcement

Learning in Weakly Communicating MDPs

Peter L. Bartlett

Computer Science Division, and

Department of Statistics

University of California at Berkeley

Berkeley, CA 94720

Ambuj Tewari

Toyota Technological Institute at Chicago

6045 S. Kenwood Ave.

Chicago, IL 60615

Abstract

We provide an algorithm that achieves the

optimal regret rate in an unknown weakly

communicating Markov Decision Process

(MDP). The algorithm proceeds in episodes

where, in each episode, it picks a policy us-

ing regularization based on the span of the

optimal bias vector. For an MDP with S

states and A actions whose optimal bias vec-

tor has span bounded by H, we show a regret

bound of

O(HS

√

AT ). We also relate the

span to various diameter-like quantities asso-

ciated with the MDP, demonstrating how our

results improve on previous regret bounds.

1 INTRODUCTION

In reinforcement learning, an agent interacts with an

environment while trying to maximize the total reward

it accumulates. Markov Decision Processes (MDPs)

are the most commonly used model for the environ-

ment. To every MDP is associated a state space S and

an action space A. Suppose there are S states and A

actions. The parameters of the MDP then consist of

S · A state transition distributions P

s,a

and S · A re-

wards r(s, a). When the agent takes action a in state

s, it receives reward r(s, a) and the probability that

it moves to state s

is P

s,a

). The agent does not

know the parameters P

s,a

and r(s, a) of the MDP in

advance but has to learn them by directly interacting

with the environment. In doing so, it faces the explo-

ration vs. exploitation trade-oﬀ that Kearns and Singh

[2002] succinctly describe as,

“. . . should the agent exploit its cumulative

experience so far, by executing the action

that currently seems best, or should it exe-

cute a diﬀerent action, with the hope of gain-

ing information or experience that could lead

to higher future payoﬀs? Too little explo-

ration can prevent the agent from ever con-

verging to the optimal behavior, while too

much exploration can prevent the agent from

gaining near-optimal payoﬀ in a timely fash-

ion.”

Suppose the agent uses an algorithm G to choose its

actions based on the history of its interactions with the

MDP starting from some initial state s

. Denoting

the (random) reward obtained at time t by r

, the

algorithm’s expected reward until time T is

, T) = E

t=1

Suppose λ

is the optimal per-step reward. An impor-

tant quantity used to measure how well G is handling

the exploration vs. exploitation trade-oﬀ is the regret,

∆

, T) = λ

T − R

, T) .

If ∆

, T) is o(T ) then G is indeed learning some-

thing useful about the MDP since its expected average

reward then converges to the optimal value λ

(which

can only be computed with the knowledge of the MDP

parameters) in the limit T → ∞.

However, asymptotic results are of limited value and

therefore it is desirable to have ﬁnite time bounds on

the regret. To obtain such results, one has to work

with a restricted class of MDPs. In the theory of

MDPs, four fundamental subclasses have been stud-

ied. Unless otherwise speciﬁed, by a policy, we mean

a deterministic stationary policy, i.e. simply a map

π : S → A.

Ergodic Every policy induces a single recurrent class,

i.e. it is possible to reach any state from any other

state.

Unichain Every policy induces a single recurrent

class plus a possibly empty set of transient states.

BARTLETT & TEWARIUAI 2009 35

下载后可阅读完整内容，剩余7页未读，立即下载

zogo55

粉丝: 1

强化学习中的正则化算法：REGAL

M_Map 用户指南（中文版）.pdf

map_mode_escape_1.28.13.12700.pak

map_mode_escape_1.27.18.12500.pak

Android下GoogleMap地图类应用程序的开发.pdf

MATLAB绘制电机效率MAP图.pdf

基于Android和Google Map的校内事务处理系统.pdf

SLAM – Map Building and Navigation via ROS.pdf

STL中map用法详解[整理].pdf

Hidden Markov Map Matching Through Noise and Sparseness.pdf

如何将Map集合转换成JSON数据.pdf

最新资源