强化学习入门：Sutton与Barto著作第二版概览

需积分: 9 18 浏览量更新于2024-07-18 收藏 12.15MB PDF 举报

《强化学习：一种介绍》(第二版)是 Richard S. Sutton 和 Andrew G. Barto 合著的经典之作，该书在人工智能领域占有重要地位。这本教材主要关注于强化学习这一机器学习方法，它是一种通过与环境的交互来学习如何做出决策以最大化长期奖励的学习过程。作者们在书中详细阐述了强化学习的基本概念、理论框架和应用实例，旨在帮助读者理解这个复杂而强大的工具。 1. 强化学习介绍：作为本书的核心内容，强化学习首先定义为一个在不确定环境中学习的过程，其中智能体通过尝试不同的行动，接收环境的反馈（通常是奖励或惩罚），以优化其行为策略。这种学习方式不需要预先定义所有可能的状态和动作，而是通过不断试错来逐渐改进。 2. 示例与元素：书中包含丰富的例子，如经典的棋盘游戏——井字棋（Tic-Tac-Toe）作为入门示例，展示了强化学习如何通过迭代地尝试不同策略来找到最优解。此外，章节内容涵盖了价值函数、策略、状态空间、动作空间、奖励函数等基本概念，这些是强化学习算法设计和分析的基础。 3. 限制与范围：作者强调了强化学习的局限性，例如它对于高维状态空间的处理可能较困难，且需要大量的试验和错误。此外，书中会讨论强化学习与其他学习方法（如监督学习和无监督学习）的区别，以及何时选择强化学习最为合适。 4. 进一步探索：Tic-Tac-Toe案例研究的扩展在第一章末尾，读者将有机会深入了解如何运用强化学习解决更为复杂的策略问题，如Tic-Tac-Toe的高级玩法。这个案例展示了强化学习如何通过深度搜索和学习来逐步提升玩家的胜率。 5. 历史回顾：对于强化学习的历史，书中回顾了早期的相关工作，包括A.Harry Klopf的研究，这些早期的努力为现代强化学习的发展奠定了基础。作者们鼓励读者在此基础上探寻更深入的历史文献，以更好地理解当前技术的根源。 6. 版本更新与反馈：第二版的《强化学习：一种介绍》已接近完成，但仍可能需要添加一个案例研究和最终版本的索引。作者们欢迎读者发现并报告任何错误或遗漏，并鼓励提供有价值的引用，以便在印刷前进行修正。《强化学习：一种介绍》第二版是一本详尽且实用的指南，适合研究人员、工程师和学生深入了解强化学习的原理和应用。通过阅读这本书，读者不仅能掌握强化学习的核心概念，还能了解到其在现实世界中的广泛应用和潜在挑战。

xvi Summary of Notation

|S| number of elements in set S

t discrete time step

T, T (t) ﬁnal time step of an episode, or of the episode including time step t

action at time t

state at time t, typically due, stochastically, to S

t−1

and A

t−1

reward at time t, typically due, stochastically, to S

t−1

and A

t−1

π policy (decision-making rule)

π(s) action taken in state s under deterministic policy π

π(a|s) probability of taking action a in state s under stochastic policy π

π(a|s, θ) probability of taking action a in state s given parameter vector θ

return (cumulative discounted reward) following time t (Section 3.3)

t:t+n

, G

t:h

n-step return from t + 1 to h(discounted and corrected)

t:h

ﬂat return (undiscounted and uncorrected) from t + 1 to h (Section 5.8)

λ-return (Section 12.1)

t:h

truncated, corrected λ-return (Section 12.3)

λs

λ-return, corrected by estimated state values (Section 12.8)

λa

λ-return, corrected by estimated action values (Section 12.8)

p(s

, r|s, a) probability of transition to state s

with reward r, from state s and action a

p(s

|s, a) probability of transition to state s

, from state s taking action a

r(s, a, s

) expected immediate reward on transition from s to s

under action a

(s) value of state s under policy π (expected return)

∗

(s) value of state s under the optimal policy

(s, a) value of taking action a in state s under policy π

∗

(s, a) value of taking action a in state s under the optimal policy

V, V

array estimates of state-value function v

or v

∗

Q, Q

array estimates of action-value function q

or q

∗

expected approximate action value, e.g.,

π(a|S

t−1

, a)

target for estimate at time t

temporal-diﬀerence error at t (a random variable) (Section 6.1)

w, w

d-vector of weights underlying an approximate value function

, w

t,i

ith component of learnable weight vector

d dimensionality—the number of components of w

alternate dimensionality—the number of components of θ

m number of 1s in a sparse binary feature vector

ˆv(s,w) approximate value of state s given weight vector w

(s) alternate notation for ˆv(s,w)

ˆq(s, a, w) approximate value of state–action pair s, a given weight vector w

x(s) vector of features visible when in state s

x(s, a) vector of features visible when in state s taking action a

(s), x

(s, a) ith component of vector x(s) or x(s, a)

shorthand for x(S

) or x(S

, A

)

x inner product of vectors, w

; e.g., ˆv(s,w)

= w

x(s)

µ(s) on-policy distribution over states (Section 9.2)

µ |S|-vector of the µ(s) for all s ∈ S

kxk

µ-weighted norm of any vector x(s), i.e.,

µ(s)x(s)

(Section 11.4)

Chapter 1

Introduction

The idea that we learn by interacting with our environment is probably the ﬁrst to occur to us when

we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no

explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this

connection produces a wealth of information about cause and eﬀect, about the consequences of actions,

and about what to do in order to achieve goals. Throughout our lives, such interactions are undoubtedly

a major source of knowledge about our environment and ourselves. Whether we are learning to drive a

car or to hold a conversation, we are acutely aware of how our environment responds to what we do, and

we seek to inﬂuence what happens through our behavior. Learning from interaction is a foundational

idea underlying nearly all theories of learning and intelligence.

In this book we explore a computational approach to learning from interaction. Rather than directly

theorizing about how people or animals learn, we primarily explore idealized learning situations and

evaluate the eﬀectiveness of various learning methods.

That is, we adopt the perspective of an artiﬁcial

intelligence researcher or engineer. We explore designs for machines that are eﬀective in solving learning

problems of scientiﬁc or economic interest, evaluating the designs through mathematical analysis or

computational experiments. The approach we explore, called reinforcement learning, is much more

focused on goal-directed learning from interaction than are other approaches to machine learning.

1.1 Reinforcement Learning

Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize

a numerical reward signal. The learner is not told which actions to take, but instead must discover

which actions yield the most reward by trying them. In the most interesting and challenging cases,

actions may aﬀect not only the immediate reward but also the next situation and, through that, all

subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two

most important distinguishing features of reinforcement learning.

Reinforcement learning, like many topics whose names end with “ing,” such as machine learning

and mountaineering, is simultaneously a problem, a class of solution methods that work well on the

problem, and the ﬁeld that studies this problems and its solution methods. It is convenient to use a

single name for all three things, but at the same time essential to keep the three conceptually separate.

In particular, the distinction between problems and solution methods is very important in reinforcement

learning; failing to make this distinction is the source of a many confusions.

We formalize the problem of reinforcement learning using ideas from dynamical systems theory,

The relationships to psychology and neuroscience are summarized in Chapters 14 and 15.

2 CHAPTER 1. INTRODUCTION

speciﬁcally, as the optimal control of incompletely-known Markov decision processes. The details of this

formalization must wait until Chapter 3, but the basic idea is simply to capture the most important

aspects of the real problem facing a learning agent interacting over time with its environment to achieve

a goal. A learning agent must be able to sense the state of its environment to some extent and must be

able to take actions that aﬀect the state. The agent also must have a goal or goals relating to the state of

the environment. Markov decision processes are intended to include just these three aspects—sensation,

action, and goal—in their simplest possible forms without trivializing any of them. Any method that

is well suited to solving such problems we consider to be a reinforcement learning method.

Reinforcement learning is diﬀerent from supervised learning, the kind of learning studied in most

current research in the ﬁeld of machine learning. Supervised learning is learning from a training set

of labeled examples provided by a knowledgable external supervisor. Each example is a description of

a situation together with a speciﬁcation—the label—of the correct action the system should take to

that situation, which is often to identify a category to which the situation belongs. The object of this

kind of learning is for the system to extrapolate, or generalize, its responses so that it acts correctly

in situations not present in the training set. This is an important kind of learning, but alone it is

not adequate for learning from interaction. In interactive problems it is often impractical to obtain

examples of desired behavior that are both correct and representative of all the situations in which the

agent has to act. In uncharted territory—where one would expect learning to be most beneﬁcial—an

agent must be able to learn from its own experience.

Reinforcement learning is also diﬀerent from what machine learning researchers call unsupervised

learning, which is typically about ﬁnding structure hidden in collections of unlabeled data. The terms

supervised learning and unsupervised learning would seem to exhaustively classify machine learning

paradigms, but they do not. Although one might be tempted to think of reinforcement learning as a

kind of unsupervised learning because it does not rely on examples of correct behavior, reinforcement

learning is trying to maximize a reward signal instead of trying to ﬁnd hidden structure. Uncovering

structure in an agent’s experience can certainly be useful in reinforcement learning, but by itself does

not address the reinforcement learning problem of maximizing a reward signal. We therefore consider

reinforcement learning to be a third machine learning paradigm, alongside supervised learning and

unsupervised learning and perhaps other paradigms as well.

One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the

trade-oﬀ between exploration and exploitation. To obtain a lot of reward, a reinforcement learning

agent must prefer actions that it has tried in the past and found to be eﬀective in producing reward.

But to discover such actions, it has to try actions that it has not selected before. The agent has to

exploit what it has already experienced in order to obtain reward, but it also has to explore in order to

make better action selections in the future. The dilemma is that neither exploration nor exploitation

can be pursued exclusively without failing at the task. The agent must try a variety of actions and

progressively favor those that appear to be best. On a stochastic task, each action must be tried many

times to gain a reliable estimate of its expected reward. The exploration–exploitation dilemma has been

intensively studied by mathematicians for many decades, yet remains unresolved. For now, we simply

note that the entire issue of balancing exploration and exploitation does not even arise in supervised

and unsupervised learning, at least in their purest forms.

Another key feature of reinforcement learning is that it explicitly considers the whole problem of a

goal-directed agent interacting with an uncertain environment. This is in contrast to many approaches

that consider subproblems without addressing how they might ﬁt into a larger picture. For example, we

have mentioned that much of machine learning research is concerned with supervised learning without

explicitly specifying how such an ability would ﬁnally be useful. Other researchers have developed

theories of planning with general goals, but without considering planning’s role in real-time decision

making, or the question of where the predictive models necessary for planning would come from. Al-

though these approaches have yielded many useful results, their focus on isolated subproblems is a

signiﬁcant limitation.

剩余443页未读，继续阅读

hai008007

粉丝: 106
资源: 9

强化学习入门：Sutton与Barto著作第二版概览

Reinforcement Learning: An Introduction 2nd solutions （第二版 答案）

Reinforcement Learning An Introduction(2nd)2018.pdf

Reinforcement Learning An Introduction second edition

Python Implementation of Reinforcement Learning: An Introduction Code

Reinforcement Learning: An Introduction（最新版书籍+代码，February 28, 2018）

Reinforcement learning an introduction 2nd edition

Reinforcement Learning - An Introduction 2nd 2017 6月版

Reinforcement Learning - An Introduction 2nd (final draft Nov 5 2017)

Reinforcement learning--an introduction second edition

An Introduction to Machine Learning, 2nd Edition

最新资源

Reinforcement Learning: An Introduction 2nd solutions （第二版答案）