伯特塞卡斯《强化学习课程笔记》

需积分: 5 123 浏览量更新于2024-06-21 收藏 25.45MB PDF 举报

RLCOURSECOMPLETE.pdf是一本由Dimitri P. Bertsekas教授编写的关于强化学习的课程教材，该书专为 Arizona State University 的学生设计。这本书是 Athena Scientific 出版社的作品，位于 Massachusetts 的 Belmont，地址为 Post Office Box 805，Nashua, NH 03060，美国。出版社的联系方式包括电子邮件 info@athenasc.com 和官方网站 <http://www.athenasc.com>。书中内容涵盖了精确和近似动态规划的核心理论，对于理解和应用强化学习至关重要。第1章详细探讨了动态编程的基础，包括AlphaZero算法的离线训练和在线应用。AlphaZero是一种先进的机器学习方法，它结合了深度强化学习、蒙特卡洛树搜索和传统的静态分析，能够在零或很少的领域知识前提下自我对弈，从而实现显著的性能提升。在确定性动态编程部分（1.2节），作者首先介绍了有限期限问题的数学表述。动态规划在此部分被定义为一种通过分解复杂决策问题为一系列子问题来寻找最优解的方法，其目的是最小化或最大化某个目标函数。具体来说，章节探讨了如何通过迭代更新状态值函数和策略，以求得最优化路径。此外，书中还可能包含对不同类型的动态规划技术如值迭代、政策迭代以及Q-learning等的深入讲解，这些是强化学习中的基本算法。另外，读者可以期待书中对环境模型（确定性或部分确定性）、状态空间大小、计算复杂度等因素的讨论，这些都是强化学习实际应用中需要考虑的关键因素。随着深入到后续章节，可能会有对马尔科夫决策过程（MDP）的介绍，这是强化学习中的标准模型，用于描述一个随机环境中智能体如何通过与环境交互以最大化期望奖励。书中还可能涉及强化学习的收敛性分析、误差分析以及如何处理连续动作和状态空间的挑战。 RLCOURSECOMPLETE.pdf是一本实用且深入的教材，旨在帮助读者掌握强化学习的基本概念、算法和实践技巧，特别适合对人工智能和机器学习感兴趣的学生和研究人员，以及希望深入了解该领域的专业人士。

Sec. 1.1 AlphaZero, Oﬀ-Line Training, and On-Line Play 5

To understa nd the overall structure of AlphaZero, and its connection

to our DP/RL methodology, it is useful to divide its design into two parts:

oﬀ-line training, which is an algorithm that learns how to evalua te chess

positions, and how to steer itself towards good positions with a default/base

chess player, and on-line play, which is an algor ithm that generates good

moves in real time a gainst a human or computer opponent, using the train-

ing it went through oﬀ-line. We will next brieﬂy describe these algorithms,

and relate them to DP concepts a nd principles.

Oﬀ-Line Training and Policy Iteration

This is the part of the program that learns how to play through oﬀ-line

self-training, and is illustrated in Fig. 1.1.1. The algorithm generates a

sequence of chess players and position evaluators . A chess player assigns

“probabilities” to all possible moves at any given chess po sition (theseare

the probabilities with which the player selects the possible moves at the

given position). A position evaluator assigns a numerical score to any

given chess position (akin to a “probability” of winning the game from

that position), and thus predicts quantitatively the performance of a player

starting from any position. The chess player and the position evaluator are

represented by two neural networks, a policy network and a value network,

which accept a chess position and generate a set o f move probabilitiesand

a position evaluation, respectively.†

In the more conventional DP-oriented terms of these notes, a p osition

is the state of the game, a position evaluator is a cost function that gives (an

estimate of) the optimal cost-to-go at a given state, and the chess player

is a randomized policy for selecting actions/controls at a given state.‡

of Tetris, also based on the method of policy iteration, is described by Scherrer

et al. [SGG15], who mention several related antecedent works. For a better un-

derstanding of the connections of AlphaZero and AlphaGo ZerowithTesauro’s

programs and the concepts developed here, the “Methods” section of the p aper

[SSS17] is recommended.

† Here the neural networks play the role of function approximators;seeChap-

ter 3. By v iewing a player as a function that assigns move probabilities to a

position, and a position evaluator as a function that assignsanumericalscoreto

aposition,thepolicyandvaluenetworksprovideapproximations to these func-

tions based on training with d ata (training algorithms for neural networks and

other approximation architectures are also discussed in theRLbooks[Ber19a],

[Ber20a], and the neuro-dynamic programming book [BeT96]).

‡ One more complication is that chess and Go are two-player games, while

most of our development will involve single-player optimization. However, DP

theory extends to two-player games, although we will not fo cus on this extension.

Alternately, we can consider training a game program to play against a known

ﬁxed opponent; this is a one-player setting.

6 Exact and Approximate Dynamic Programming Chap. 1

Policy Improvement

erent! Approximate Value Function Player Features Mappin

Self-Learning/Policy Iteration Constraint Relaxation

Learned from scratch ... with 4 hours of training! Current “Improv

Learned from scratch ... with 4 hours of training! Current “Improved”

Policy Improvement

Policy Evaluation Improvement o f Current Policy

Neural Network

Value Policy

Figure 1.1.1 Illustration of the AlphaZero training algorithm. It generates a

sequence of position evaluators and chess players. The position evaluator and the

chess player are represented by two neural networks, a value network and a policy

network, which accept a chess position and generate a position evaluation and a

set of move probabilities, respectively.

The overall training algorithm is a form of policy iteration , a clas-

sical DP algorithm that will be of primary interest to us in these notes.

Starting from a given player, it repeatedly generates (approximately) im-

proved players, and settles o n a ﬁnal player that is judged empirically to b e

“best” out of all the players generated.† Policy iteration may be separated

conceptually in two stag es (see Fig. 1.1.1).

(a) Policy evaluation: Given the current player and a chess position, the

outcome of a game played out from the position provides a single data

point. Many data points are thus collected, and are used to train a

value network, whose output serves as the position evaluator for that

player.

(b) Policy improvement : Given the current player and its position eva lua-

tor, trial move seque nc e s are selected and evaluated for the remainder

of the game starting from many positions. An impr oved playe r is then

generated by adjusting the move probabilities of the current player

towards the trial moves that have yielded the best results. In Alp-

† Quoting from the p aper [SSS17]: “The AlphaGo Zero selfp lay algorithm

can similarly be understood as an approximate policy iteration scheme in which

MCTS is used for both policy improvemen t and policy evaluation. Policy im-

provement starts with a neural network policy, executes an MCTS based on that

policy’s recommendations, and then projects the (much stronger) search policy

back into the function space of the neural network. Policy evaluation is applied

to the (much stronger) search policy: the outcomes of selfplay games are also

projected back into the function space of the neural network.Theseprojection

steps are achieved by training the neural network parameterstomatchthesearch

probabilities and selfplay game outcome respectively.” Note, however, that a two-

person game player, trained through selfplay is not guaranteed theoretically to

play well against a particular human or computer player.

Sec. 1.1 AlphaZero, Oﬀ-Line Training, and On-Line Play 7

haZero this is done w ith a complicated algor ithm called Monte Carlo

Tree Search . However , policy improvement can a lso be done more

simply. For example one c ould try all possible move sequences from

a given position, extending forward to a given number of moves, and

then evaluate the terminal po sition with the player’s position evalu-

ator. The move evaluations obtained in this way are used to nudge

the move probabilities of the current player towar ds more successful

moves, thereby obtaining data that is us ed to train a policy network

that repr e sents the new playe r.

Tesauro’s TD-Gammon algorithm [Tes94] program is similarly based

on approximate policy iteration, but uses a diﬀerent methodology forap-

proximate policy evaluation [it is based on the TD(λ) algorithm]; see the

book [BeT96], Section 8.6, for a detailed description. Moreover, it does not

use a policy network and MCTS. It involves only a value network, which

replicates the functionality of a policy networ k by gene rating moves on-line

via a one-step or two-step lookahead minimization.

On-Line Play and Approximation in Value Space - Rollout

Suppose that a “ﬁnal” player has been obtained through the AlphaZero oﬀ-

line training process just de scribe d. It could then be used in principle to

play chess against any human or computer opponent, since it is capable of

generating move probabilities at each given chess position using its policy

network. In particular, during on-line play, at a given position the player

can simply choose the move of highest probability supplied by the oﬀ-line

trained policy network. This player would play very fast on-line, but it

would not play good enough chess to beat stro ng human opponents.The

extraordinary strength of AlphaZer o is attained only after the player and

its position evaluator obtained from oﬀ-line training have been embedded

into another algorithm, which we refer to as the “on-line player.” Given

the policy network/player obtained oﬀ-line and its value ne twork/position

evaluator, this algorithm plays as follows (see Fig. 1.1.2).

At a given position, it generates a lookahe ad tree of all possible mul-

tiple move and countermove sequences, up to a given depth. It thenruns

the oﬀ-line obtained player for some more moves, and then evaluatesthe

eﬀect of the remaining moves by using the position evaluator of the oﬀ-line

obtained value network. Actually the middle portion, called “truncated

rollout,” is not used in the published version of AlphaZero/chess [SHS17],

[SHS17]; the ﬁrst portion (multistep lookahead) is quite long and imple-

mented eﬃciently, so that the rollout portion is not essential. Rollout is

used in AlphaGo [SHM16], and plays a very important role the ﬁnal ver-

sion of Te sauro’s backgammon program [TeG96]. The reason is that in

backgammon, long multistep lookahead is not possible because of rapid

expansion of the lookahe ad tree w ith every move.

8 Exact and Approximate Dynamic Programming Chap. 1

Base Heuristic Truncated Rollout

...

Current Position

Current Position x

Oﬀ-Line Obtained Player O

ON-LINE PLAY

OFF-LINE TRAINING

ON-LINE PLAY Lookahead Tree States

ON-LINE PLAY Lookahead Tree States x

k+1

Current Position

States x

k+2

-Line Obtained Player Oﬀ-Line Obtained Cost Approximation

Adaptive Reoptimizatio n Position Evaluator

Without the Newton Step Base Player

With the Newton Step Adaptive Rollout Cost Approximation

Figure 1.1.2 Illustration of an on-line pla yer such as the one used in AlphaGo,

AlphaZero, and Tesauro’s backgammon program [TeG96]. At a given position,

it generates a lookahead tree of multiple moves up to some depth, then runs

the oﬀ-line obtained play er for some more moves, and evaluates the eﬀect of the

remaining moves by using the position evaluator of the oﬀ-line player.

We should note that the preceding description of AlphaZero and re-

lated ga mes is oversimpliﬁed. We w ill be discussing reﬁnements and details

as the notes progress. However, DP ideas with cost function approxima-

tions, similar to the on-line player illustrated in Fig. 1 .1.2, will be central

for our purposes. Moreover, the algorithmic division between oﬀ-line train-

ing and on-line policy implementation will be conceptually very important

for our purposes.

Note that the oﬀ-line training and the on-line play algorithms may

be decoupled and may be designed independently. For example the oﬀ-line

training portion may be very simple, such as using a simple known policy

for rollout without truncation, or without terminal cost approximation.

Conversely, a sophisticated process may be used for oﬀ-line trainingofa

terminal cos t function approximation, which is used immediately following

one-step or multistep lookahead in a value space approximation scheme.

In control system design, similar architectures to the ones of Alp-

haZero and TD-Gammon are employed in model predictive control (MPC).

There, the number of steps in lookahead minimization is called the con-

trol interval, while the total number of s teps in lookahea d minimization

and truncated rollout is called the pre diction interval; see e.g., Magni et

al. [MDM01].† The beneﬁt of truncated rollout in providing an economi-

cal substitute for longer lookahead minimization is well known within this

† The Matlab toolbox for MPC design explicitly allows the user to set th ese

two intervals.

Sec. 1.2 Deterministic Dynamic Programming 9

...

Control u

Cost g

)

) x

k+1

Stage k

k Futu r e Sta g es

) x

Future Stages Terminal Cost

Future Stages Terminal Cost g

)

Deterministic Transition

Deterministic Transition x

k+1

= f

)

Figure 1.2.1 Illustration of a deterministic N-stage optimal control problem.

Starting from state x

,thenextstateundercontrolu

is generated nonrandomly,

according to

k+1

= f

and a stage cost g

)isincurred.

context.

Dynamic programming frameworks with cost function approxima-

tions that are similar to the on-line player illustrated in Fig . 1.1.2, are

also known as approximate dynamic programming,orneuro-dynamic pro-

gramming, and will be central for our purposes. They will be generically

referred to as appr oximation in value space in these notes.†

1.2 DETERMINISTIC DYNAMIC PROGRAMMING

In all DP problems, the central object is a discrete-time dy namic system

that ge nerates a sequence of states under the inﬂuence of control. The

system may evolve deterministically or randomly (under the additional

inﬂuence of a random disturbance).

1.2.1 Finite Horizon Problem Formulation

In ﬁnite horizon problems the system evolves over a ﬁnite number N of time

steps (also called stages). The state and control at time k of the system will

be generally denoted by x

and u

, respectively. In deterministic systems,

k+1

is generated nonrandomly, i.e., it is determined solely by x

and u

;

† The names “appro ximate dynamic programming” and “neuro-dynamic pro-

gramming” are often used as synonyms to RL. How ever, RL is generally thought

to also subsume the method ology of approximation in policy space, which in-

volves search for optimal parameters within a parametrized set of policies. The

search is done with methods that are largely unrelated to DP, such as for ex-

ample stochastic gradient or random search methods. Appro ximation in policy

space may be used oﬀ-line to design a policy that can be used foron-linerollout.

It will be discussed very brieﬂy here, but a fuller account that is consistent in

terminology with the present notes may be found in Chapter 5 oftheRLbook

[Ber19a].

剩余411页未读，继续阅读

傻啦嘿哟

粉丝: 5892
资源: 87

伯特塞卡斯《强化学习课程笔记》

大学生职业生涯规划书Word模板范文就业求职简历应聘工作PPT医疗康复专业

基于Java的学生信息管理系统的实现与操作

基于单片机控制的填块切割装置的设计_孟紫腾.pdf

ImageNet-1K数据集索引和对应的中英文表单

B站叫叫兽粉丝专属-YOLOv11改进免费送

400w微型逆变器, 基于stm32g474实现 设计方案，不是成品 带有源代码、原理图(AD)、PCB(AD)

全球与中国清洁型漱口水市场现状及未来发展趋势（2024版）.docx

新文科建设赋能地方高校汉语言文学师范生培养的思考.pdf

人事管理系统 JAVA高分毕业设计 Vue.JS+SpringBoot前后端分离项目.zip

如何学习Object-C语言（PDF文档）

最新资源

400w微型逆变器, 基于stm32g474实现设计方案，不是成品带有源代码、原理图(AD)、PCB(AD)