无限期强化学习与最优化控制

需积分: 9 137 浏览量更新于2024-07-15 收藏 1.07MB PDF 举报

"RL_MONOGRAPH5.pdf 是麻省理工学院Dimitri P. Bertsekas教授关于强化学习和最优控制的教材章节。该章节聚焦于无限期强化学习，目前仍处于草稿阶段，可能会有错误并缺乏完整的文献引用。作者欢迎读者提供反馈和建议。最新修订日期为2019年4月1日。章节内容涵盖了价值空间的近似及性能边界、有限展望、滚动策略、近似策略迭代、拟合值迭代以及基于模拟的参数化近似策略迭代等方法。" 强化学习是人工智能领域的一个重要分支，它涉及智能体在与环境的交互中通过试错学习来优化其行为策略。无限期强化学习关注的是在没有预设终止状态的情况下，如何进行有效的学习和决策。 5.1.1. 有限展望：在强化学习中，有限展望是一种策略，其中智能体只考虑未来有限步的结果来决定当前动作，而不是考虑整个未来的奖励。这种方法可以降低计算复杂性，但可能牺牲一定的性能。 5.1.2. 滚动策略（Rollout）：这是一种策略评估技术，它通过扩展当前策略的有限步数来估计其长期效果。这可以被视为对完全展开策略的一种近似，可以用于在实际应用中平衡计算成本和准确性。 5.1.3. 近似策略迭代：在策略迭代算法中，如果环境状态空间过大，无法精确存储或计算所有状态的价值函数，可以使用近似方法。近似策略迭代结合了策略改进和价值函数的近似更新，以适应大规模问题。 5.2. 拟合值迭代（Fitted Value Iteration）：这是强化学习中的一种策略，它涉及到使用一组样本来训练一个函数近似器（如神经网络），以拟合每个状态的价值函数。这种方法允许处理连续状态空间，并且可以迭代地改进近似。 5.3. 基于模拟的参数化近似策略迭代：在具有参数化策略的环境中，智能体可以学习并调整这些参数以优化其长期回报。自我学习系统和演员-评论家系统是这类方法的实例，其中自我学习系统同时更新策略和价值函数，而演员-评论家系统则将策略更新（演员）与价值函数评估（评论家）分开。 5.3.1. 自我学习和演员-评论家系统：自我学习系统通过同时更新策略和价值函数来学习，而演员-评论家系统则引入两个相互作用的组件，演员负责更新策略，评论家负责评估策略的效果，以实现策略的改进。这些内容构成了强化学习理论和实践的基础，对于理解如何在实际问题中有效地应用强化学习至关重要。无论是有限展望、滚动策略还是近似方法，都是为了在计算限制下找到接近最优策略的有效途径。

12 Inﬁnite Horizon Reinforcement Learning Chap. 5

k PI index k J

∗

0 1 2

0 1 2 . . .

∗

. . . Error Zone Width (

Error Zone Width (ǫ + 2αδ)/(1 − α)

) J

Figure 5.1.4 Illustration of typical behavior of approximate PI. In the early

iterations, the method tends to make rapid and fairly monotonic progress, until

gets within an error zone of s ize less than (ǫ + 2αδ)/(1 − α)

. After that J

oscillates randomly within that zone.

max

i=1,...,n



(i) − J

(i)



becomes less or equal to

ǫ + 2αδ

(1 − α)

asymptotically as k → ∞.

The preceding performance bound is not particularly useful in prac-

tical terms. Signiﬁcantly, howeve r, it is in qualitative agreement with the

empirical behavior of approximate PI. In the b eginning, the method tends

to make rapid and fairly monotonic progress , but eventually it gets into

an oscillatory pattern. This happens after J

gets within an error zone of

size (ǫ + 2αδ)/(1 − α)

or smaller, and then J

oscillates fairly randomly

within that zone; see Fig. 5.1.4. In practice, the error bound of Prop. 5.1.4

tends to be pessimistic, so the zone of oscillation is usually much narrower

than what is suggested by the bound. However, the bound itself can be

proved to be tight, in worst case. This is shown with an e xample in the

book [BeT96], Sectio n 6.2.3. Note also that the bound of Prop. 5.1.4 holds

in the case of inﬁnite state and control spaces disco unted problems, when

there are inﬁnitely many policies (see [Ber18a ], Prop. 2.4.3).

Performance Bound for the Case Where Policies Converge

Generally, the policy sequence {µ

} generated by approximate PI may

Sec. 5.2 Fitted Value Iteration 15

where the polic y ˜µ

is obtained from the minimization

˜µ

(i) ∈ arg min

u∈U(i)

j=1

(u)



g(i, u, j) + α

(j)



It turns out that such estimates are po ssible, but under assumptions

whose validity may be hard to g uarantee. In particular, it is natural to

assume tha t the error in generating the value iter ates (T

)(i) is within

some δ > 0 for every state i and iteration k, i.e., that

max

i=1,...,n



k+1

(i) − min

u∈U(i)

j=1

(u)



g(i, u, j) + α

(j)





≤ δ. (5.19)

It is then possible to show tha t asymptotically, as k → ∞, the cost error

(5.17) becomes less or equal to δ/(1 − α), while the policy error (5.18 )

becomes less or equal to 2δ/(1 − α)

Such error bounds ar e given in Section 6.5.3 of the book [BeT96] (see

also Prop. 2.5.3 of [Ber 12]), but it is important to note that the condition

(5.19) may not be satisﬁed by the natural least squares reg ression/ﬁtted VI

scheme of Section 3.3. This is illustrated by the fo llowing simple example

from [TsV96] (see also [BeT96], Section 6.5.3), which shows that the errors

from successive approximate value iterations can accumulate to the point

where the condition (5.19) cannot be maintained, and the approximate

value iterates

can grow unbounded.

Example 5.2.1 (Error Amp liﬁcation in Approximate Value

Iteration)

Consider a two- state discounted problem with states 1 and 2, and a single

policy. The transitions are deterministic: from state 1 to state 2, and from

state 2 to state 2. The transitions are also cost-free; see Fig. 5.2.1. Thus the

Bellman eq uation is

J(1) = αJ(2), J(2) = αJ(2),

and its unique solution is J

∗

(1) = J

∗

(2) = 0. Moreover, exact VI has the

form

k+1

(1) = αJ

(2), J

k+1

(2) = αJ

(2).

We consider a VI approach that approximates cost fun ctions within

the one-dimensional subspace of linear functions S =



(r, 2r) | r ∈ ℜ



; this

is a favorable choice since the optimal cost function J

∗

= (0, 0) belongs to

S. We use a weighted least squ ares regression scheme. In particular, given

= (r

, 2r

), we ﬁnd

k+1

= (r

k+1

, 2r

k+1

) as follows; see Fig. 5.2.2:

(a) We compute the exact VI iterate from



(2), α

(2)



= (2αr

, 2αr

剩余75页未读，继续阅读

Quant0xff

粉丝: 1w+
资源: 459

无限期强化学习与最优化控制

RL_MONOGRAPH4.pdf

RL_MONOGRAPH2.pdf

SSM+JSP政务大厅管理系统答辩PPT.pptx

博客系统-java-基于ssm博客系统的设计与实现+vue（毕业论文+开题+任务书）

公司日常考勤系统 JAVA毕业设计 源码+数据库+论文 Vue.js+SpringBoot+MySQL.zip

基于Android的团购系统APP答辩PPT.pptx

基于光场调控与体积光聚合制造的生物打印散射光补偿方法

图片转PDF_QQ浏览器_20241015.pdf

Python实现简易推荐系统：用户行为与内容过滤

SpringBoot-线上买菜系统.zip

最新资源

公司日常考勤系统 JAVA毕业设计源码+数据库+论文 Vue.js+SpringBoot+MySQL.zip