统计强化学习：现代视角与算法详解

需积分: 9 176 浏览量更新于2024-07-18 收藏 7.22MB PDF 举报

《统计强化学习》是一本由Masashi Sugiyama教授撰写的重要著作，它以现代视角探讨了强化学习算法的广泛范围。该书的核心在于强调统计学习参数估计在强化学习中的应用，将不同的学习方法贯穿于各种环境下的学习场景之中。作者将算法分为两大类：模型自由（model-free）方法和模型基础（model-based）方法。模型自由的方法不直接建模环境的动态过程，而是依赖于大量的交互数据来学习和优化策略。它们通常包括基于值函数的策略迭代算法，如Q-learning或SARSA，这些算法通过估算状态动作价值函数来指导决策制定。这类方法的优势在于能够处理复杂环境且对环境假设的需求较低，但可能需要大量的经验来收敛到最优解。另一方面，模型基础的方法则尝试构建环境的描述性过程模型，如马尔科夫决策过程（MDP），以便更精确地预测状态转移和奖励。这种方法的优点是能够利用模型进行规划，从而可能达到更高的效率，但模型的准确性和复杂性通常是其挑战所在。模型基础方法中常见的策略搜索算法可能涉及直接调整策略参数，如策略梯度方法，它们通过迭代优化来改进策略。书中不仅关注理论层面，还鼓励包含具体实例、应用案例和实用方法，反映出系列出版物对机器学习和模式识别领域最新进展的聚焦。《统计强化学习》作为Chapman & Hall/CRC出版社的Machine Learning & Pattern Recognition Series的一部分，旨在促进学术交流和实际问题解决，同时也体现了系列出版物对跨学科领域的包容，如自然语言处理、计算机视觉、游戏AI等。《统计强化学习》是一本深入浅出的教材，对于那些希望理解强化学习基本原理、掌握统计方法在其中的应用以及探索如何结合模型和数据驱动策略的学生和研究者来说，具有很高的参考价值。无论是理论研究还是实践开发，这本书都能提供丰富的资源和指导。

Part I

Introduction

Chapter 1

Introduction to Reinforcement

Learning

Reinforcement learning is aimed at controlling a computer agent so that a

target task is achieved in an unknown environment.

In this chapter, we ﬁrst give an informal overview o f reinforcement learning

in Section 1 .1. Then we provide a more formal for mulation of reinfor cement

learning in Sec tion 1 .2. Finally, the book is summar ized in Sec tion 1.3.

1.1 R einforcement Learning

A schematic of reinforcement learning is given in Figure 1.1. In an unknown

environment (e.g., in a maze), a computer agent (e.g., a robot) takes an action

(e.g., to walk) based on its own control policy. Then its state is updated (e.g.,

by moving forward) and eva luation of tha t action is given as a “reward” (e.g.,

praise, neutral, or scolding ). T hrough such interaction with the environment,

the agent is trained to achieve a certain task (e.g., getting out of the maze)

without explicit guidance. A crucial advantage of reinforcement learning is its

non-greedy nature. That is, the agent is trained not to improve performance in

a short term (e.g., greedily approaching an exit of the maze), but to optimize

the long-term achievement (e.g., successfully getting out of the maze).

A reinforcement learning problem contains va rious technical components

such as states, actions, transitions, rewards, policies, and values. Befor e go-

ing into mathematical details (which will be provided in Sec tion 1.2), we

intuitively explain these concepts through illustrative reinforcement learning

problems here .

Let us co nsider a maze problem (Figure 1.2), where a robot agent is located

in a maz e and we want to guide him to the goal without explicit supervision

about which direction to go. St ates are positions in the maze w hich the robot

agent can visit. In the example illustrated in Figure 1.3, there are 21 states

in the maze. Actions are possible dire ctions along which the robot agent can

move. In the example illustrated in Figure 1.4, there are 4 actions which corre-

sp ond to movement toward the nor th, south, e ast, and west directions. States

4 Statistical Reinforcement Learning

Agent

State

Action

Reward

Environment

FIGURE 1.1: R

einforcement learning.

and actions are fundamental elements that deﬁne a reinforcement learning

problem.

Transitions specify how states are connec ted to each other through actions

(Figure 1.5). Thus, knowing the transitions intuitively means knowing the map

of the maze. Rewards specify the incomes/costs that the robot agent receives

when making a transition from one state to another by a certain action. In the

case of the maze example, the robot agent receives a positive reward when it

reaches the goal. More speciﬁcally, a positive reward is provided when making

a transition from state 12 to state 17 by action “east” or from state 18 to

state 17 by action “north” (Figure 1.6). Thus, knowing the rewards intuitively

means knowing the location of the goal state. To emphasize the fact tha t a

reward is given to the robot ag ent right after taking an action and making a

transition to the next state, it is also referred to as an immediate reward.

Under the above setup, the goal o f reinforcement learning to ﬁnd the policy

for controlling the robot agent that allows it to receive the maximum amount

of rewards in the long run. Here, a policy speciﬁes an action the robot agent

takes at each state (Figure 1.7). Through a policy, a series of states and ac-

tions that the robot a gent takes from a start state to an end state is speciﬁed.

Such a series is called a trajectory (see Figure 1.7 again). The sum of im-

mediate rewards along a trajectory is called the return. In practice, rewards

that can be obtained in the distant future are often discounted because re-

ceiving rewards earlier is re garded as more preferable. In the maze task, such

a discounting strategy ur ges the robot agent to reach the goal as quickly as

possible.

To ﬁnd the optimal policy eﬃciently, it is useful to view the return as a

function of the initial state. This is calle d the (state-)value. The values can

be eﬃciently obtained via dynamic programming, which is a general method

for solving a complex optimizatio n problem by br eaking it down into simpler

subproblems recursively. With the hope that many subproblems are actually

the same, dynamic programming solves such overlapped subproblems only

once and reuses the solutions to reduce the computation costs.

In the maze problem, the value of a state can be computed from the values

of neighboring states. For example, let us compute the value of state 7 (see

剩余205页未读，继续阅读

fusion_c

粉丝: 0
资源: 14

统计强化学习：现代视角与算法详解

Statistical Reinforcement Learning . Masashi_Sugiyama

Learning Maya

Statistical Reinforcement Learning - Modern Machine Learning Approaches

Statistical-Learning-Method_Code.zip

Deep Reinforcement Learing

Dive into Deep Learning

Statistics for Machine Learning

Pattern Recogintion and Machine Learning

statistic books for machine learning

Machine Learning Algorithms 2nd Edition

最新资源