深度强化学习：利用Hindsight Experience Replay提升样本效率

需积分: 11 24 浏览量更新于2024-07-17 收藏 2.35MB PDF 举报

"Hindsight Experience Replay 是一种深度强化学习技术，旨在解决奖励稀疏性的问题，使得机器人能够在困难环境中高效学习并完成任务。" 在深度强化学习（Deep Reinforcement Learning, DRL）领域，一个主要挑战是处理稀疏的奖励信号。通常，当智能体在一个环境中执行任务时，只有在达成特定目标时才会接收到奖励，这使得学习过程变得极其困难，因为大部分时间智能体都在尝试中得不到反馈。"Hindsight Experience Replay"（HER）是一种创新技术，由OpenAI的研究人员提出，它通过改变历史经验的方式来解决这个问题。 HER的基本思想是，即使当前的目标没有达到，也可以从过去的经验中学习。它允许智能体将失败的尝试视为对其他可能目标的成功尝试，从而从原本无用的体验中获取学习机会。这种方法可以视为一种隐含的学习进度（implicit curriculum），因为它鼓励智能体逐步学习更复杂的策略，而无需预先设计复杂的奖励函数。在实际应用中，HER与任意的离策略（off-policy）强化学习算法结合，例如Deep Q-Network (DQN) 或 Proximal Policy Optimization (PPO)。在论文中，研究人员展示了HER在机器人操作任务中的有效性，包括推动物体、滑动物体以及抓取和放置物体。这些任务只提供二进制奖励，即成功或失败。通过使用HER，即使在奖励极度稀疏的情况下，也能使训练变得可行。实验结果表明，HER对于在这些具有挑战性的环境中训练智能体至关重要。经过模拟环境的训练，这些智能体的策略可以被部署到真实的机器人上，并成功完成任务，显示了这种方法在现实世界应用的潜力。 Hindsight Experience Replay 提供了一种强大的工具，克服了深度强化学习中稀疏奖励的难题，增强了智能体在复杂任务中的学习效率和泛化能力。这种技术对于推动机器人自主学习和适应新环境的能力具有重大意义，也为未来的强化学习研究开辟了新的方向。

is to periodically set the weights of the target network to the current weights of the main network (e.g.

Mnih et al. (2015)) or to use a polyak-averaged

(Polyak and Juditsky, 1992) version of the main

network instead (Lillicrap et al., 2015).

2.3 Deep Deterministic Policy Gradients (DDPG)

Deep Deterministic Policy Gradients (DDPG) (Lillicrap et al., 2015) is a model-free RL algorithm

for continuous action spaces. Here we sketch it only informally, see Lillicrap et al. (2015) for more

details. In DDPG we maintain two neural networks: a target policy (also called an actor)

π : S → A

and an action-value function approximator (called the critic)

Q : S × A → R

. The critic’s job is to

approximate the actor’s action-value function Q

Episodes are generated using a behavioral policy which is a noisy version of the target policy, e.g.

(s) = π(s) + N (0, 1)

. The critic is trained in a similar way as the Q-function in DQN but the

targets

are computed using actions outputted by the actor, i.e.

= r

+ γQ(s

t+1

, π(s

t+1

))

The actor is trained with mini-batch gradient descent on the loss

= −E

Q(s, π(s))

, where

is sampled from the replay buffer. The gradient of

w.r.t. actor parameters can be computed by

backpropagation through the combined critic and actor networks.

2.4 Universal Value Function Approximators (UVFA)

Universal Value Function Approximators (UVFA) (Schaul et al., 2015a) is an extension of DQN to

the setup where there is more than one goal we may try to achieve. Let

be the space of possible

goals. Every goal

g ∈ G

corresponds to some reward function

: S × A → R

. Every episode starts

with sampling a state-goal pair from some distribution

p(s

, g)

. The goal stays ﬁxed for the whole

episode. At every timestep the agent gets as input not only the current state but also the current goal

π : S × G → A

and gets the reward

= r

, a

)

. The Q-function now depends not only on a

state-action pair but also on a goal

, a

, g) = E[R

, a

, g]

. Schaul et al. (2015a) show that in

this setup it is possible to train an approximator to the Q-function using direct bootstrapping from the

Bellman equation (just like in case of DQN) and that a greedy policy derived from it can generalize

to previously unseen state-action pairs. The extension of this approach to DDPG is straightforward.

3 Hindsight Experience Replay

3.1 A motivating example

Consider a bit-ﬂipping environment with the state space

S = {0, 1}

and the action space

A =

{0, 1, . . . , n − 1}

for some integer

in which executing the

-th action ﬂips the

-th bit of the state.

For every episode we sample uniformly an initial state as well as a target state and the policy gets a

reward of −1 as long as it is not in the target state, i.e. r

(s, a) = −[s 6= g].

Figure 1: Bit-ﬂipping experi-

ment.

0 10 20 30 40 50

number of bits n

0.0

0.2

0.4

0.6

0.8

1.0

success rate

DQN DQN+HER

Standard RL algorithms are bound to fail in this environment for

n > 40

because they will never experience any reward other than

−1

Notice that using techniques for improving exploration (e.g. VIME

(Houthooft et al., 2016), count-based exploration (Ostrovski et al.,

2017) or bootstrapped DQN (Osband et al., 2016)) does not help

here because the real problem is not in lack of diversity of states

being visited, rather it is simply impractical to explore such a large

state space. The standard solution to this problem would be to use

a shaped reward function which is more informative and guides the

agent towards the goal, e.g.

(s, a) = −||s − g||

. While using a

shaped reward solves the problem in our toy environment, it may be

difﬁcult to apply to more complicated problems. We investigate the

results of reward shaping experimentally in Sec. 4.4.

Instead of shaping the reward we propose a different solution which

does not require any domain knowledge. Consider an episode with

A polyak-averaged version of a parametric model

which is being trained is a model whose parameters

are computed as an exponential moving average of the parameters of M over time.

剩余14页未读，继续阅读

GanD.GanD

粉丝: 3

深度强化学习：利用Hindsight Experience Replay提升样本效率

hindsight_experience_replay：后视经验重播的张量流实现

pytorch-ddpg, 利用PyTorch实现深度确定策略梯度( DDPG )的实现.zip

deep-reinforcement-learning_DDQN_PPO_HER:适用于OpenAI的Gym游戏的MLP框架（纯numpy）和DDQN框架。 +添加了PPO的测试代码。 + Hindsight Experience Replay（HER）bitflip-DQN示例。 +优先重播

hindsight-experience-replay:这是Hindsight Experience Replay（HER）的pytorch实施-在所有提取机器人环境中进行实验

强化学习实战：HindSight Experience Replay 解析

PyTorch实现Hindsight Experience Replay在机器人环境中的应用

Modular-HER: 强化学习的模块化Hindsight Experience Replay改进

hindsight experience replay

Modular_HER:Modular-HER是从OpenAI基线修订而来，并支持将Hindsight Experience Replay作为模块进行许多改进

信息安全_数据安全_Humane Technology in Hindsight W.pdf

最新资源