深度强化学习综述：迈向视觉世界自主系统的革命

需积分: 21 3 浏览量更新于2024-07-16 1 收藏 6MB PDF 举报

深度强化学习是人工智能领域的一次重大突破，它朝着构建具有更高视觉理解能力的自主系统的目标前进。随着深度学习的进步，强化学习已经能够处理以前难以解决的问题，例如直接从像素级别学习玩视频游戏。这篇文章，"A Brief Survey of Deep Reinforcement Learning"，由Kai Arulkumaran、Marc Peter Deisenroth、Miles Brundage和Anil Anthony Bharath撰写，发表在IEEE SIGNAL PROCESSING MAGAZINE的深度学习图像理解特别刊上（基于arXiv的扩展版本）。首先，文章概述了强化学习的基本概念，这是一种通过与环境交互来学习最优策略的方法，目标是最大化长期累积奖励。它包括价值函数估计算法和策略优化方法两大主流。作者着重介绍了深度强化学习的核心算法： 1. **深度Q网络（Deep Q-Networks, DQN）**：DQN是将深度学习技术引入强化学习的一个关键创新，它通过卷积神经网络(CNN)对状态进行表征，解决了传统Q-learning中的函数逼近问题，从而在复杂的环境中学习更精确的动作值估计。 2. **信任区域策略优化（Trust Region Policy Optimization, TRPO）**：这是一种基于策略梯度的优化方法，通过控制策略更新的幅度，确保每个步骤都在一个可接受的性能区域内，防止过度调整导致性能下降。 3. **异步优势 actor-critic（Asynchronous Advantage Actor-Critic, A3C）**：这是一种并行化的策略梯度算法，多个代理同时执行并反馈经验，显著提高了训练速度，特别适用于大规模环境。此外，文章还讨论了深度强化学习在机器人领域的应用，如如何利用摄像头输入实时地学习和改进机器人的控制策略。作者强调了深度强化学习在图像理解、自然语言处理等任务中的潜力，以及它在未来可能带来的深远影响，比如自动驾驶、智能游戏、工业自动化等。这篇综述深入剖析了深度强化学习的基础理论、核心算法和实际应用，为读者提供了一个全面了解这一前沿技术的框架。对于希望在AI和机器人技术中探索深度强化学习的科研人员和工程师来说，这是一份不可或缺的参考资料。

IEEE SIGNAL PROCESSING MAGAZINE, SPECIAL ISSUE ON DEEP LEARNING FOR IMAGE UNDERSTANDING (ARXIV EXTENDED VERSION) 4

against exploitation, but ultimately these can all be addressed

formally within the framework of RL.

III. REINFORCEMENT LEARNING ALGORITHMS

So far, we have introduced the key formalism used in RL,

the MDP, and brieﬂy noted some challenges in RL. In the

following, we will distinguish between different classes of

RL algorithms. There are two main approaches to solving

RL problems: methods based on value functions and methods

based on policy search. There is also a hybrid, actor-critic

approach, which employs both value functions and policy

search. We will now explain these approaches and other useful

concepts for solving RL problems.

A. Value Functions

Value function methods are based on estimating the value

(expected return) of being in a given state. The state-value

function V

(s) is the expected return when starting in state s

and following π henceforth:

(s) = E[R|s, π] (2)

The optimal policy, π

∗

, has a corresponding state-value

function V

∗

(s), and vice-versa, the optimal state-value func-

tion can be deﬁned as

∗

(s) = max

(s) ∀s ∈ S. (3)

If we had V

∗

(s) available, the optimal policy could be re-

trieved by choosing among all actions available at s

and pick-

ing the action a that maximises E

t+1

∼T (s

t+1

,a)

∗

t+1

)].

In the RL setting, the transition dynamics T are unavailable.

Therefore, we construct another function, the state-action-

value or quality function Q

(s, a), which is similar to V

except that the initial action a is provided, and π is only

followed from the succeeding state onwards:

(s, a) = E[R|s, a, π]. (4)

The best policy, given Q

(s, a), can be found by choosing a

greedily at every state: argmax

(s, a). Under this policy,

we can also deﬁne V

(s) by maximising Q

(s, a): V

(s) =

max

(s, a).

Dynamic Programming: To actually learn Q

, we exploit

the Markov property and deﬁne the function as a Bellman

equation [13], which has the following recursive form:

, a

) = E

t+1

+ γQ

t+1

, π(s

t+1

))]. (5)

This means that Q

can be improved by bootstrapping, i.e.,

we can use the current values of our estimate of Q

to improve

our estimate. This is the foundation of Q-learning [159] and

the state-action-reward-state-action (SARSA) algorithm [112]:

, a

) ← Q

, a

) + αδ, (6)

where α is the learning rate and δ = Y −Q

, a

) the tem-

poral difference (TD) error; here, Y is a target as in a standard

regression problem. SARSA, an on-policy learning algorithm,

is used to improve the estimate of Q

by using transitions

generated by the behavioural policy (the policy derived from

), which results in setting Y = r

+ γQ

t+1

, a

t+1

). Q-

learning is off-policy, as Q

is instead updated by transitions

that were not necessarily generated by the derived policy.

Instead, Q-learning uses Y = r

+γ max

t+1

, a), which

directly approximates Q

∗

To ﬁnd Q

∗

from an arbitrary Q

, we use generalised

policy iteration, where policy iteration consists of policy eval-

uation and policy improvement. Policy evaluation improves

the estimate of the value function, which can be achieved

by minimising TD errors from trajectories experienced by

following the policy. As the estimate improves, the policy can

naturally be improved by choosing actions greedily based on

the updated value function. Instead of performing these steps

separately to convergence (as in policy iteration), generalised

policy iteration allows for interleaving the steps, such that

progress can be made more rapidly.

B. Sampling

Instead of bootstrapping value functions using dynamic

programming methods, Monte Carlo methods estimate the

expected return (2) from a state by averaging the return from

multiple rollouts of a policy. Because of this, pure Monte Carlo

methods can also be applied in non-Markovian environments.

On the other hand, they can only be used in episodic MDPs,

as a rollout has to terminate for the return to be calculated.

It is possible to get the best of both methods by combining

TD learning and Monte Carlo policy evaluation, as in done in

the TD(λ) algorithm [135]. Similarly to the discount factor,

the λ in TD(λ) is used to interpolate between Monte Carlo

evaluation and bootstrapping. As demonstrated in Figure 3,

this results in an entire spectrum of RL methods based around

the amount of sampling utilised.

Another major value-function based method relies on learn-

ing the advantage function A

(s, a) [6, 43]. Unlike producing

absolute state-action values, as with Q

, A

instead represents

relative state-action values. Learning relative values is akin

to removing a baseline or average level of a signal; more

intuitively, it is easier to learn that one action has better

consequences than another, than it is to learn the actual return

from taking the action. A

represents a relative advantage

of actions through the simple relationship A

= Q

− V

and is also closely related to the baseline method of variance

reduction within gradient-based policy search methods [164].

The idea of advantage updates has been utilised in many recent

DRL algorithms [157, 40, 85, 123].

C. Policy Search

Policy search methods do not need to maintain a value

function model, but directly search for an optimal policy

∗

. Typically, a parameterised policy π

is chosen, whose

parameters are updated to maximise the expected return E[R|θ]

using either gradient-based or gradient-free optimisation [26].

Neural networks that encode policies have been successfully

trained using both gradient-free [37, 23, 64] and gradient-

based [164, 163, 46, 79, 122, 123, 74] methods. Gradient-free

optimisation can effectively cover low-dimensional parameter

spaces, but despite some successes in applying them to large

networks [64], gradient-based training remains the method of

choice for most DRL algorithms, being more sample-efﬁcient

剩余15页未读，继续阅读

a18779148177

粉丝: 23
资源: 4

深度强化学习综述：迈向视觉世界自主系统的革命

A Brief Introduction to Sigma Delta Conversion.pdf

CH7211A Brief Datasheet_Rev 1.1.pdf

信息安全_数据安全_A Brief History of Attribution Mistakes.pdf

A Brief Introduction to Reinforcement Learning

A Brief Introduction to Machine Learning for Engineers.pdf

AMT630_Brief_Spec_v1.0.pdf

Deep Learning with Hadoop.pdf

MT6753T_Techniacal_Brief_V0.1_decrypted.pdf

A Brief Introduction to SystemVerilog Instructor.pdf

Brief of Construction of LNG Tank.doc

最新资源