没有合适的资源?快使用搜索试试~ 我知道了~
首页强化学习入门:2017年最终版详解
强化学习入门:2017年最终版详解
5星 · 超过95%的资源 需积分: 10 69 下载量 54 浏览量
更新于2024-07-19
收藏 12.27MB PDF 举报
《强化学习:一个介绍》第二版(最终草稿,发布日期:2017年11月5日)是由Richard S. Sutton和Andrew G. Barto两位作者编著的一本深入浅出的强化学习入门教材。该书在2014年至2017年间持续更新和完善,内容包括对强化学习的基本概念、实例分析以及其局限性和研究范围。 强化学习(RL)是机器学习的一个分支,它关注智能体如何通过与环境的交互来学习最优策略,以最大化累积奖励。本书首先定义了强化学习的核心概念,如环境、状态、动作、奖励函数和策略,这些构成了强化学习问题的基本要素。 在实际应用中,强化学习的例子多种多样,如棋类游戏(如围棋和国际象棋)、机器人控制、自动驾驶等。作者通过一个详尽的Tic-Tac-Toe游戏案例,展示了强化学习如何通过迭代地尝试和学习来改善决策,使读者能够直观理解强化学习的工作原理。 书中的第1章概述了强化学习的历史,提到了早期的重要贡献者和里程碑事件,这对于了解这个领域的发展脉络至关重要。然而,尽管文本已经完成,但仍可能需要进一步修订,比如增加新的案例研究到第16章,彻底检查参考文献,并添加索引。作者鼓励读者发现并指出可能存在的错误或遗漏,以便在最终版本印刷前进行修正。 该书由Bradford Book出版,由麻省理工学院出版社发行,覆盖了美国马萨诸塞州剑桥和英国伦敦两地。作为纪念,书中还特别提及了A.Harry Klopf。整本书旨在为读者提供一个全面且易懂的强化学习基础知识平台,对于希望进入这个领域的研究人员和工程师来说,是一份宝贵的资源。
资源详情
资源推荐
xvi Summary of Notation
|S| number of elements in set S
t discrete time step
T, T (t) final time step of an episode, or of the episode including time step t
A
t
action at time t
S
t
state at time t, typically due, stochastically, to S
t−1
and A
t−1
R
t
reward at time t, typically due, stochastically, to S
t−1
and A
t−1
π policy, decision-making rule
π(s) action taken in state s under deterministic policy π
π(a|s) probability of taking action a in state s under stochastic policy π
π(a|s, θ) probability of taking action a in state s given parameter θ
G
t
return (cumulative discounted reward) following time t (Section 3.3)
¯
G
t:h
flat return (uncorrected, undiscounted) from t + 1 to h (Section 5.8)
G
λs
t
λ-return, corrected by estimated state values (Section 12.1)
G
λa
t
λ-return, corrected by estimated action values (Section 12.1)
G
λs
t:h
truncated, corrected λ-return, with state values (Section 12.3)
G
λa
t:h
truncated, corrected λ-return, with action values (Section 12.3)
p(s
0
, r|s, a) probability of transition to state s
0
with reward r, from state s and action a
p(s
0
|s, a) probability of transition to state s
0
, from state s taking action a
r(s, a, s
0
) expected immediate reward on transition from s to s
0
under action a
v
π
(s) value of state s under policy π (expected return)
v
∗
(s) value of state s under the optimal policy
q
π
(s, a) value of taking action a in state s under policy π
q
∗
(s, a) value of taking action a in state s under the optimal policy
V, V
t
array estimates of state-value function v
π
or v
∗
Q, Q
t
array estimates of action-value function q
π
or q
∗
δ
t
temporal-difference error at t (a random variable) (Section 6.1)
w, w
t
d-vector of weights underlying an approximate value function
w
i
, w
t,i
ith component of learnable weight vector
d dimensionality—the number of components of w
d
0
alternate dimensionality—the number of components of θ
m number of 1s in a sparse binary feature vector
ˆv(s,w) approximate value of state s given weight vector w
v
w
(s) alternate notation for ˆv(s,w)
ˆq(s, a, w) approximate value of state–action pair s, a given weight vector w
x(s) vector of features visible when in state s
x(s, a) vector of features visible when in state s taking action a
x
i
(s), x
i
(s, a) ith component of vector x(s) or x(s, a)
x
t
shorthand for x(S
t
) or x(S
t
, A
t
)
w
>
x inner product of vectors, w
>
x
.
=
P
i
w
i
x
i
; e.g., ˆv(s,w)
.
= w
>
x(s)
µ(s) onpolicy distribution over states (Section 9.2)
µ |S|-vector of the µ(s)
kxk
2
µ
µ-weighted norm of any vector x(s), i.e.,
P
i
µ(s)x(s)
2
i
(Section 11.4)
v, v
t
secondary d-vector of weights, used to learn w (Chapter 11)
z
t
d-vector of eligibility traces at time t (Chapter 12)
Summary of Notation xvii
θ, θ
t
parameter vector of target policy (Chapter 13)
π
θ
policy corresponding to parameter θ
J(π), J(θ) performance measure for policy π or π
θ
h(s, a, θ) a preference for selecting action a in state s based on θ
b behavior policy selecting actions while learning about target policy π,
or a baseline function b : S 7→ R for policy-gradient methods
or a branching factor
ρ
t:h
importance sampling ratio for time t to time h (Section 5.5)
ρ
t
importance sampling ratio for time t alone, ρ
t
= ρ
t:t
r(π) average reward (reward rate) for policy π (Section 10.3)
¯
R
t
estimate of r(π) at time t
A d × d matrix A
.
= E
h
x
t
x
t
− γx
t+1
>
i
(Section 11.4)
b d-dimensional vector b
.
= E[R
t+1
x
t
]
w
TD
TD fixed point, w
TD
.
= A
−1
b (d-vector)
I identity matrix
P |S|× |S| matrix of state-transition probabilities under π
D |S| × |S| diagonal matrix with µ(s) on its diagonal
X |S| × d matrix with x(s) as its rows
xviii Summary of Notation
Chapter 1
Introduction
The idea that we learn by interacting with our environment is probably the first to occur to us when
we think about the nature of learning. When an infant plays, waves its arms, or looks about, it has no
explicit teacher, but it does have a direct sensorimotor connection to its environment. Exercising this
connection produces a wealth of information about cause and effect, about the consequences of actions,
and about what to do in order to achieve goals. Throughout our lives, such interactions are undoubtedly
a major source of knowledge about our environment and ourselves. Whether we are learning to drive a
car or to hold a conversation, we are acutely aware of how our environment responds to what we do, and
we seek to influence what happens through our behavior. Learning from interaction is a foundational
idea underlying nearly all theories of learning and intelligence.
In this book we explore a computational approach to learning from interaction. Rather than directly
theorizing about how people or animals learn, we explore idealized learning situations and evaluate the
effectiveness of various learning methods. That is, we adopt the perspective of an artificial intelligence
researcher or engineer. We explore designs for machines that are effective in solving learning problems of
scientific or economic interest, evaluating the designs through mathematical analysis or computational
experiments. The approach we explore, called reinforcement learning, is much more focused on goal-
directed learning from interaction than are other approaches to machine learning.
1.1 Reinforcement Learning
Reinforcement learning is learning what to do—how to map situations to actions—so as to maximize
a numerical reward signal. The learner is not told which actions to take, but instead must discover
which actions yield the most reward by trying them. In the most interesting and challenging cases,
actions may affect not only the immediate reward but also the next situation and, through that, all
subsequent rewards. These two characteristics—trial-and-error search and delayed reward—are the two
most important distinguishing features of reinforcement learning.
Reinforcement learning, like many topics whose names end with “ing,” such as machine learning
and mountaineering, is simultaneously a problem, a class of solution methods that work well on the
problem, and the field that studies this problems and its solution methods. It is convenient to use a
single name for all three things, but at the same time essential to keep the three conceptually separate.
In particular, the distinction between problems and solution methods is very important in reinforcement
learning; failing to make this distinction is the source of a many confusions.
We formalize the problem of reinforcement learning using ideas from dynamical systems theory,
specifically, as the optimal control of incompletely-known Markov decision processes. The details of this
1
2 CHAPTER 1. INTRODUCTION
formalization must wait until Chapter 3, but the basic idea is simply to capture the most important
aspects of the real problem facing a learning agent interacting over time with its environment to achieve
a goal. A learning agent must be able to sense the state of its environment to some extent and must be
able to take actions that affect the state. The agent also must have a goal or goals relating to the state of
the environment. Markov decision processes are intended to include just these three aspects—sensation,
action, and goal—in their simplest possible forms without trivializing any of them. Any method that
is well suited to solving such problems we consider to be a reinforcement learning method.
Reinforcement learning is different from supervised learning, the kind of learning studied in most
current research in the field of machine learning. Supervised learning is learning from a training set
of labeled examples provided by a knowledgable external supervisor. Each example is a description of
a situation together with a specification—the label—of the correct action the system should take to
that situation, which is often to identify a category to which the situation belongs. The object of this
kind of learning is for the system to extrapolate, or generalize, its responses so that it acts correctly
in situations not present in the training set. This is an important kind of learning, but alone it is
not adequate for learning from interaction. In interactive problems it is often impractical to obtain
examples of desired behavior that are both correct and representative of all the situations in which the
agent has to act. In uncharted territory—where one would expect learning to be most beneficial—an
agent must be able to learn from its own experience.
Reinforcement learning is also different from what machine learning researchers call unsupervised
learning, which is typically about finding structure hidden in collections of unlabeled data. The terms
supervised learning and unsupervised learning would seem to exhaustively classify machine learning
paradigms, but they do not. Although one might be tempted to think of reinforcement learning as a
kind of unsupervised learning because it does not rely on examples of correct behavior, reinforcement
learning is trying to maximize a reward signal instead of trying to find hidden structure. Uncovering
structure in an agent’s experience can certainly be useful in reinforcement learning, but by itself does
not address the reinforcement learning problem of maximizing a reward signal. We therefore consider
reinforcement learning to be a third machine learning paradigm, alongside supervised learning and
unsupervised learning and perhaps other paradigms as well.
One of the challenges that arise in reinforcement learning, and not in other kinds of learning, is the
trade-off between exploration and exploitation. To obtain a lot of reward, a reinforcement learning
agent must prefer actions that it has tried in the past and found to be effective in producing reward.
But to discover such actions, it has to try actions that it has not selected before. The agent has to
exploit what it has already experienced in order to obtain reward, but it also has to explore in order to
make better action selections in the future. The dilemma is that neither exploration nor exploitation
can be pursued exclusively without failing at the task. The agent must try a variety of actions and
progressively favor those that appear to be best. On a stochastic task, each action must be tried many
times to gain a reliable estimate of its expected reward. The exploration–exploitation dilemma has been
intensively studied by mathematicians for many decades, yet remains unresolved. For now, we simply
note that the entire issue of balancing exploration and exploitation does not even arise in supervised
and unsupervised learning, at least in their purest forms.
Another key feature of reinforcement learning is that it explicitly considers the whole problem of a
goal-directed agent interacting with an uncertain environment. This is in contrast to many approaches
that consider subproblems without addressing how they might fit into a larger picture. For example, we
have mentioned that much of machine learning research is concerned with supervised learning without
explicitly specifying how such an ability would finally be useful. Other researchers have developed
theories of planning with general goals, but without considering planning’s role in real-time decision
making, or the question of where the predictive models necessary for planning would come from. Al-
though these approaches have yielded many useful results, their focus on isolated subproblems is a
significant limitation.
剩余444页未读,继续阅读
边沿漫游者
- 粉丝: 11
- 资源: 49
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 构建Cadence PSpice仿真模型库教程
- VMware 10.0安装指南:步骤详解与网络、文件共享解决方案
- 中国互联网20周年必读:影响行业的100本经典书籍
- SQL Server 2000 Analysis Services的经典MDX查询示例
- VC6.0 MFC操作Excel教程:亲测Win7下的应用与保存技巧
- 使用Python NetworkX处理网络图
- 科技驱动:计算机控制技术的革新与应用
- MF-1型机器人硬件与robobasic编程详解
- ADC性能指标解析:超越位数、SNR和谐波
- 通用示波器改造为逻辑分析仪:0-1字符显示与电路设计
- C++实现TCP控制台客户端
- SOA架构下ESB在卷烟厂的信息整合与决策支持
- 三维人脸识别:技术进展与应用解析
- 单张人脸图像的眼镜边框自动去除方法
- C语言绘制图形:余弦曲线与正弦函数示例
- Matlab 文件操作入门:fopen、fclose、fprintf、fscanf 等函数使用详解
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功