强化学习基础：理论与算法概述

需积分: 20 129 浏览量更新于2024-07-15 收藏 652KB PDF 举报

"Reinforcement Learning - Theory and Algorithms.pdf 是一本关于强化学习基础的资料，涵盖了强化学习的基本概念、MDP（马尔科夫决策过程）的预读知识、样本复杂性与生成模型、策略探索以及策略梯度方法等核心主题。" 在强化学习中，我们关注的是智能体如何通过与环境的交互来学习最优行为策略。这本书首先介绍了**马尔科夫决策过程(Markov Decision Process, MDP)**，这是强化学习的基础框架。MDP是一个状态转移概率依赖于当前状态的动态系统，其中智能体在每个时间步采取行动，并接收到环境的反馈（奖励）。 **互动协议**描述了智能体与环境如何交互：智能体观察当前状态，选择一个动作，然后环境转移到新状态并给出奖励。**目标、策略和值函数**是MDP的核心概念，智能体的目标是最大化长期累积奖励，策略定义了智能体选择动作的方式，而值函数衡量了策略的预期回报。 **贝尔曼方程**是MDP理论中的重要工具，分为一致性方程（对于固定策略）和最优性方程（寻找最佳策略）。**Q值迭代**和**策略迭代**是两种常用的规划算法，用于求解MDP中的最优策略。接下来，书中的**样本复杂性**部分探讨了在有生成模型的情况下，智能体学习有效策略所需的经验样本数量。它比较了**精确模型估计**的直观方法和使用**稀疏模型**的更精细策略，并讨论了下界问题。 **策略探索**章节可能涉及探索与开发的平衡，即在获取新信息和利用已有知识之间找到合适的策略。而**策略梯度方法**是现代强化学习中常用的一种优化技术，智能体通过调整策略参数来最大化期望回报。书中详细介绍了**策略梯度法**，包括优化过程、**softmax策略**和**相对熵正则化**，以及**自然策略梯度**，这是一种更有效的优化策略，考虑了策略参数的几何结构。这本书提供了强化学习的全面理论基础和算法实现，适合初学者和研究者深入理解这一领域。

14 Chapter 1:

The proof is completed by noting that (P

− P

≤ 0. To see this, observe that:

[(P

− P

]

s,a

= E

∼P (·|s,a)

, π

)) − Q

, π(s

))] ≤ 0

where we use π = π

in the last step.

1.2 Planning in MDPs

Planning refers to the problem of computing π

given the MDP speciﬁcation M = (S, A, P, r, γ). This section

reviews classical planning algorithms that compute Q

1.2.1 Q-Value Iteration

A simple algorithm is to iteratively applying the ﬁxed point mapping: starting at some Q, we iteratively apply T :

Q ← T Q ,

This is algorithm is referred to as Q-value iteration.

Lemma 1.5. (contraction) For any two vectors Q, Q

∈ R

|S||A|

kT Q − T Q

∞

≤ γkQ − Q

∞

Proof: First, let us show that for all s, |V

(s) − V

(s)| ≤ max

a∈A

|Q(s, a) − Q

(s, a)|. Assume V

(s) > V

(s)

(the other direction is symmetric), and let a be the greedy action for Q at s. Then

(s) − V

(s)| = Q(s, a) − max

∈A

(s, a

) ≤ Q(s, a) − Q

(s, a) ≤ max

a∈A

|Q(s, a) − Q

(s, a)|.

Using this,

kT Q − T Q

∞

= γkP V

− P V

∞

= γkP (V

− V

∞

≤ γkV

− V

∞

= γ max

(s) − V

(s)|

≤ γ max

max

|Q(s, a) − Q

(s, a)|

= γkQ − Q

∞

where the ﬁrst inequality uses that each element of P (V

− V

) is a convex average of V

− V

and the second

inequality uses our claim above.

The following result bounds the suboptimality of the greedy policy itself, based on the error in Q-value function.

Lemma 1.6. [Singh and Yee [1994]] For any vector Q ∈ R

|S||A|

≥ V

−

2kQ − Q

∞

1 − γ

where 1 denotes the vector of all ones.

剩余82页未读，继续阅读

teresa_lin

粉丝: 487
资源: 6

强化学习基础：理论与算法概述

NIPS 2020强化学习：基于模型方法的最新论文研究

从PyPI官网获取gym_craftingworld-*.*.*.*.tar.gz

深度增强学习课程mff-dee-reinforcement-learning-npfl122解析

Machine Learning - Tom Mitchell

Hands-On Machine Learning with Scikit-Learn and TensorFlow: Concepts, Tools, and

Hands-On Machine Learning with Scikit-Learn and TensorFlow (epub)

Hands-On Machine Learning with Scikit-Learn and TensorFlow [EPUB]

Hands-On Machine Learning with Scikit-Learn and TensorFlow [Kindle Edition]

论文研究-基于排队模型和强化学习的动态云任务调度算法 .pdf

斯坦福大学-机器学习公开课课件.rar

最新资源

从PyPI官网获取gym_craftingworld-....tar.gz