2020版《强化学习：基础》简介

需积分: 5 54 浏览量更新于2024-06-26 收藏 69.74MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

《强化学习：一个介绍》（2020年第二版）是Richard S. Sutton和Andrew G. Barto合著的经典之作，该书作为Adaptive Computation and Machine Learning系列的一部分出版。本书全面探讨了强化学习这一领域的核心概念和技术，为读者提供了一个深入理解该领域的坚实基础。在第二版中，作者们更新了先前版本的内容，涵盖了最新的研究成果和技术进展，确保了读者能够紧跟强化学习的前沿发展。本书旨在帮助读者掌握强化学习的基本原理，包括马尔可夫决策过程（Markov Decision Processes, MDP）、价值函数、策略、Q-learning算法、深度强化学习等关键概念。通过一系列实例和案例，作者将理论知识与实际应用紧密结合，使复杂的技术易于理解。强化学习的核心思想是通过试错的方式，让智能体在与环境的交互中学习如何做出最优决策以最大化长期奖励。它广泛应用于游戏、机器人控制、自然语言处理、推荐系统等领域，具有巨大的潜力和实际价值。本书的结构严谨，分为多个章节，包括强化学习的背景、基本概念、模型构建、算法设计和评估方法。此外，还特别强调了强化学习中的探索与利用（Exploration vs. Exploitation）平衡问题以及连续动作空间的处理策略。值得注意的是，版权方面，该书受到Creative Commons Attribution-NonCommercial-NoDerivs 2.0 Generic License的保护，这意味着读者可以在非商业用途下分享和复制内容，但不能进行任何形式的修改或创作衍生作品。同时，为了方便查找，书中附有该系列书籍的完整列表，并在版权页提供了详细的版权信息和获取更多授权许可的途径。《强化学习：一个介绍》（2020年版）是所有希望在这个快速发展的领域深入学习和实践的科研人员、工程师和学生的必备参考书籍，无论是初学者还是高级研究者，都能从中收获丰富的理论知识和实践经验。

资源详情

资源推荐

xvi Preface to the Second Edition

deserve our d ee pest gratitude for this edition as well, which would not exist were it not

for their contributions to edi t i on number one. To that long li st we must add many others

who contributed speciﬁcally to the second edition. Our students over the many years that

we have taught this material contributed in countless ways: exposing errors, o↵ering ﬁxes,

and—not the least—being confused in places where we could have explained things better.

We especially thank Martha Steenstrup for reading and providing detailed comments

throughout. The chapters on psychology and neuroscience could not have been writt e n

without t h e help of many experts in those ﬁelds. We thank John Moore for his patient

tutoring over many many years on animal learning experiments, theor y, and neuroscience,

and for his careful reading of multiple drafts of Chapters 14 and 15. We also thank Matt

Botvinick, Nathaniel Daw, Peter Dayan, and Yael Niv for their penetrating comments on

drafts of these chapters, their essential guidance through the massive literature, and their

interception of many of our errors in early drafts. Of course, the remaining errors in these

chapters—and there must st il l be some—are totall y our own. We thank Phil Thomas for

helping u s make these chapters accessible to non- ps y chologists and non-neuroscientists,

and we thank Peter Sterling for helping us improve the exposition. We are grateful to Jim

Houk for introduci ng us to the subject of information processing in the basal ganglia and

for alerting us to other relevant aspects of neuroscie nc e. Jos´e Mart´ınez, Terry Sejnowski,

David Silver, Gerry Tesaur o, Georgios Theocharous, and Phil Thomas generously helped

us understand details of their reinforcement learning applications for inclusi on in the

case-studies chapt er , and they provided helpful comments on drafts of these sections.

Special thanks are owed to David Silver for helping us better understand Monte Carlo

Tree Search and the DeepMind Go-playing p rogram s . We thank George Konidaris for his

help wit h the section on the Fourier b asi s. Emilio Cartoni, Thomas Cederborg, S t ef an

Dernbach, Clemens Rosenbaum, Patrick Taylor, Thomas Colin, and Pierre-Luc Bacon

helped us in a number important ways for which we are most grateful.

Sutton would also like to thank the members of the Reinforcement Learning and

Artiﬁcial Intelligence laboratory at the University of Alberta for contributions to the

second edition. He owes a part i cu l ar debt to Rupam M ah mood for essential contributions

to the treatment of o↵-policy Monte Carlo methods in Chapter 5, to Hamid Maei for

helping develop th e perspective on o↵-policy learni ng presented in Chapter 11, to Eric

Graves for conducting the experiments in Chapter 13, to Shangtong Zhang for replicating

and thus verifying almost all the experimental results, to Kris De Asis for improving

the new technical content of Chapters 7 and 12, and to Harm van Seijen for insights

that led to the separation of n-step methods from eligibili ty tr ace s an d (al on g wit h Hado

van Hasselt) for the ideas involving exact equivalence of forward and backward views of

eligibility traces presented in Chapter 12. Sut t on also gratefully acknowledges the support

and freedom he was granted by the Government of Alberta and the National Scienc e and

Engineering Research Council of Canada t hr ou ghou t the period during which the second

edition was con ce i ved and written. In particular, he would like to thank Randy Goebe l

for creating a supportive and far-sighted environment for research in Alberta. He would

also like to thank DeepMind their support i n the last six months of writing the book.

Finally, we owe thanks to the m any careful readers of drafts of the second edition t h at

we posted on the internet. They found many errors that we had missed and alert ed us to

potential points of confusion.

Preface to the First Edition

We ﬁrst came to focus on what is now known as reinforcement learning in late 1979. We

were both at the University of Massachusetts, working on one of the earliest projects to

revive the idea that networks of neuronlike adaptive elements might prove to be a promising

approach to artiﬁcial adaptive intelligence. The project explored the “heterostatic theor y

of adaptive systems” developed by A. Harry Klopf. Harry’s work was a rich source of

ideas, and we were permitted to explore them critical l y and compare them with the long

history of prior work in adaptive systems. Our task became one of teasing the ideas apart

and understanding their relationships and relative importance. This continues today,

but in 1979 we came to realize that perhaps the simplest of the ideas, which had long

been taken for granted, had received surprisingly little attention from a computational

perspec t ive. This was simply the idea of a learning system that wants some t hi n g, that

adapts its behavior in order to maximize a special signal from its environment. This

was the idea of a “hedonistic” learning system, or, as we would say now, the idea of

reinforcement learning.

Like others, we had a sense that reinforcement learning had been thoroughly explored

in t he early days of cybernetics and artiﬁcial intelligence. On closer inspection, though,

we found that it had been explored onl y slightly. While rei nf or cem ent learning had clearly

motivated some of the e ar li e st computational studies of learning, most of these researchers

had gone on to other things, such as pattern classiﬁcation, supervised learn in g, and

adaptive control, or they had abandoned the study of learning altogether. As a result, the

special issues involved in learning how to get something from the environment received

relatively little attention. In retrospect, focusing on this idea was t he critical step that

set this branch of research in motion. Little progress could be made in the computational

study of reinforcement learning until it was recognized that such a fundamental idea had

not yet been thoroughly expl or ed .

The ﬁeld has come a long way since then , evolving and maturing in several directions.

Reinforcement learning has gradually become one of the most active research areas in ma-

chine learning, artiﬁcial i ntelligence, and neural network research. The ﬁeld has developed

strong mathematical foundations and impressive applications. The comput ation al study

of reinforcement learning is n ow a large ﬁeld, with hundreds of active researchers around

the world in diverse disciplines such as psychology, control theory, artiﬁcial intelligence,

and neur osc i en ce . Particularly important have been the contributions establishing and

developing the relationships to the theory of optimal control and dynamic programming.

xviii Preface to the First Edition

The overall problem of learning from interaction to achieve goals is still far from being

solved, but our understanding of it has improved signiﬁcantly. We can now place compo-

nent ideas, such as temporal-di↵erence learning, dynamic programming, and fun ct i on

approximation, within a coherent perspective with respect to the overall problem.

Our goal in writing this book was to provide a clear and simple account of the key

ideas and algor i t hm s of reinforcem ent learni ng. We wanted our treatment to be accessible

to readers in all of the related discip l in es , but we c oul d not cover all of these perspectives

in detail. For the most part, our treatment takes the point of view of artiﬁcial intelligence

and engineering. Coverage of connections to other ﬁelds we leave to others or to anot h er

time. We also chose not to produce a rigorous formal treatment of reinforcement learning.

We di d not reach for the highest possible level of mathematical abstraction and did not

rely on a theorem–proof format. We tried to choose a level of mat he mat i cal detail that

points the mathematically inclined in the right directions without d i st r act i n g from the

simplicity and potential generality of the underlying ideas.

In some sense we have been working toward this book for thirty years, and we have lots

of people t o thank. First, we thank those who have personally helped us develop the overall

view presented i n this book: Harry Klopf, for helping us recognize that reinforc em ent

learning needed to be revived; Chris Watkins, Dimitri Bertsekas, John Tsitsikl i s, and

Paul Werbos, f or he l pi n g us see the value of the rel at i ons hi p s to dynami c pr ogr ammi n g;

John Moore and Jim K e hoe, for insights and insp i rat i on s from animal learning theory;

Oliver Selfridge, for emphasizin g the breadth and importance of adaptation; and, more

generally, our colleagues and students who have contributed in countless ways: Ron

Williams, Charles Anderson, Satinder Singh, Sridhar Mahadevan, St e ve Bradtke, Bob

Crites, Peter Dayan, and Leemon Baird. Our view of rei nf or ce ment learni n g has been

signiﬁcantly enriched by discussions with Paul Cohen, Paul Utgo↵, Marth a Steenstrup,

Gerry Tesauro, Mike Jordan, Leslie Kaelbling, Andrew Moore, Chris Atkeson, Tom

Mitchell, Nils Nilsson, Stuart Russell, Tom Dietterich, Tom Dean, and Bob Narendra.

We thank Michael Littman, Gerry Tesauro, Bob Crites, Satinder Singh, and Wei Zhang

for providing speciﬁcs of Sections 4.7, 15.1, 15.4, 15.5, and 15.6 respective ly. We thank

the Air Force Oﬃce of Scientiﬁc Research, the National Science Foundation, and GTE

Laboratories for thei r long and farsighted support.

We also wish to thank the many people who have read drafts of this book and

provided valuable comments, including Tom Kalt, John Tsitsiklis, Pawel Cichos z , Olle

G¨allmo, Chuck Anders on, Stuart Russell, Ben Van Roy, Paul Steenstrup, Paul Cohen,

Sridhar Mahadevan, J et t e Randlov, Brian Sheppard, Thomas O’Connell, Richard Coggins,

Cristina Versino, John H. Hiet t , Andreas Badelt, Jay Ponte, Joe Beck, Justus Piater ,

Martha Steenstrup, Satinder Singh, Tommi Jaakkola, Dimitri Bertsekas, Torbj¨orn Ekman,

Christina Bj¨orkman, Jakob Carlstr¨om, and Olle Palmgren. Finally, we thank Gwyn

Mitchell for helping in many ways, and Harry Stanton and Bob P r ior for bein g our

champions at MIT Press.

xx Summary of Notation

In a Mar kov Decision Process:

s, s

states

a an action

r a reward

S set of all nonterminal states

set of all st at es , inclu di n g the terminal stat e

A(s) set of all act i ons availabl e in state s

R set of al l possible rewards, a ﬁnite subset of R

⇢ subset of (e.g., R ⇢ R)

2 is an e l eme nt of; e.g. (s 2 S, r 2 R)

|S| number of elements in set S

t discrete time step

T, T(t) ﬁnal time step of an epi sode, or of the episode including time step t

action at time t

state at time t, typically due, stochastical l y, to S

t1

and A

t1

reward at time t, typically due, stochastic all y, to S

t1

and A

t1

⇡ policy (decision-making rule)

⇡(s) action taken in state s under determi ni s tic policy ⇡

⇡(a|s) probability of taking action a in state s under stochastic policy ⇡

return following time t

h horizon, the time step one looks up to in a forward view

t:t+n

t:h

n-step return from t +1tot + n, or to h (discounted and corrected)

t:h

ﬂat return (undiscounted and uncorrected) from t +1 toh (Sec t ion 5.8)



-return (Section 12.1)



t:h

truncated, corrected -return (Section 12.3)

s

a

-return, corrected by estimated state, or action, values (Section 12.8)

p(s

,r|s, a) probability of tran si t i on to state s

with reward r, from state s and action a

p(s

|s, a) probability of transition to state s

, from state s tak in g action a

r(s, a) expected immediate reward from state s after action a

r(s, a, s

) expected immediate reward on transition from s to s

under action a

⇡

(s) value of st ate s under policy ⇡ (expected return)

⇤

(s) value of state s under the optimal policy

⇡

(s, a) value of tak i ng action a in state s under policy ⇡

⇤

(s, a) value of tak i ng action a in state s under the optimal policy

V,V

array estimates of state-value function v

⇡

or v

⇤

Q, Q

array estimates of action-value function q

⇡

or q

⇤

(s) expected approx i mat e action value; for example,

(s)

⇡(a|s)Q

(s, a)

target for estimate at time t

剩余547页未读，继续阅读

qq_16740151

粉丝: 93
资源: 3

2020版《强化学习：基础》简介

Reinforcement Learning: An Introduction 第二版

Reinforcement Learning：An Introduction

Reinforcement Learning - An Introduction （原版，非HTML打印）

book-Reinforcement Learning An Introduction2018最新版.pdf

reinforcement-learning-an-introduction-chinese:《强化学习

经典教材 R.Sutton《增强学习导论》第二版英文原版

Packt Artificial Intelligence with Python-英文原版

RLbook-2nd-Sutton-Answer_Sutton_youthock_强化学习_RLbook2020_monthz1

强化学习导论资源集合(两个版本英文原文，部分翻译)

全面解读强化学习：《强化学习-第二版》权威指南

模式分类英文原版详解：机器感知与学习方法

python中paramiko插件

fastcache-1.1.0-cp38-cp38-win_amd64.whl

【图像检索】基于matlab颜色特征图像检索（含直方图距离）【含Matlab源码 4145期】.md

【图像加密】基于matlab混沌结合小波变换图像加密【含Matlab源码 3223期】.md

基于Java的学生管理系统的实现与代码解析

xxhash-1.4.4-cp36-cp36m-win_amd64.whl

最新资源