分层强化学习：用事后视角加速样本效率

需积分: 10 92 浏览量更新于2024-07-14 收藏 1.09MB PDF 举报

"Hierarchical Reinforcement Learning with Hindsight - 该论文是关于分层强化学习的一篇研究，发表在ICLR 2019会议上。它探讨了如何利用事后（hindsight）学习方法来解决多层次的分层策略学习问题，旨在提高解决序列决策任务的效率。" 在强化学习（Reinforcement Learning, RL）领域，分层强化学习（Hierarchical Reinforcement Learning, Hierarchical RL）是一种试图提升学习效率和解决问题能力的方法。传统的强化学习算法通常在单一层次上进行决策，而分层强化学习则通过构建多级任务分解，将复杂任务转化为一系列简单子任务，这样可以减少所需的学习步数和探索空间。这篇论文的核心在于解决分层策略并行学习的挑战。通常，高层次的策略决定低层次的子任务，而低层次的策略负责执行具体操作。然而，这种多层次结构的学习过程是不稳定的，因为一个层次的策略变化可能会对其他层次产生连锁反应，导致整体学习过程的不稳定性和困难。为了克服这一难题，作者提出了一种基于事后学习（Hindsight Learning）的框架。事后学习是一种强化学习技术，它允许智能体从过去的失败经验中学习，即使这些经验的目标与原目标不同。在分层强化学习中，这种方法可能意味着即使在解决子任务时未达到原始目标，智能体也能从中学到有用的信息。论文可能详细讨论了如何利用事后学习来同时学习多个层次的策略，使得这些子问题可以在学习过程中被独立优化。此外，可能还涉及到了如何在保持系统稳定性的前提下，有效地更新不同层次的策略，以及如何评估和调整层次之间的交互。 "Hierarchical Reinforcement Learning with Hindsight"提出了一个创新的解决方案，旨在通过利用事后学习的洞察力来改进分层强化学习的效率和稳定性，这对于解决复杂的、长期的决策问题具有重要意义，尤其是在需要高效探索和学习的环境中。

Published as a conference paper at ICLR 2019

3 BACKGROUND

We are interested in solving a Markov Decision Process (MDP) augmented with a set of goals G

(each a state or set of states) that we would like an agent to learn. We deﬁne an MDP augmented with

a set of goals as a Universal MDP (UMDP). A UMDP is a tuple U = (S, G, A, T, R, γ), in which S

is the set of states; G is the set of goals; A is the set of actions; T is the transition probability function

in which T (s, a, s

) is the probability of transitioning to state s

when action a is taken in state s; R

is the reward function; γ is the discount rate ∈ [0, 1). At the beginning of each episode in a UMDP,

a goal g ∈ G is selected for the entirety of the episode. The solution to a UMDP is a control policy

π : S, G → A that maximizes the value function v

(s, g) = E

[

∞

n=0

t+n+1

= s, g

= g]

for an initial state s and goal g.

In order to implement hierarchical agents in tasks with continuous state and actions spaces, we will

use two techniques from the RL literature: (i) the Universal Value Function Approximator (UVFA)

(Schaul et al., 2015) and (ii) Hindsight Experience Replay (Andrychowicz et al., 2017). The UVFA

will be used to estimate the action-value function of a goal-conditioned policy π, q

(s, g, a) =

[

∞

n=0

t+n+1

= s, g

= g, a

= a]. In our experiments, the UVFAs used will be in the

form of feedforward neural networks. UVFAs are important for learning goal-conditioned policies

because they can potentially generalize Q-values from certain regions of the (state, goal, action)

tuple space to other regions of the tuple space, which can accelerate learning. However, UVFAs

are less helpful in difﬁcult tasks that use sparse reward functions. In these tasks when the sparse

reward is rarely achieved, the UVFA will not have large regions of the (state, goal, action) tuple

space with relatively high Q-values that it can generalize to other regions. For this reason, we

also use Hindsight Experience Replay (Andrychowicz et al., 2017). HER is a data augmentation

technique that can accelerate learning in sparse reward tasks. HER ﬁrst creates copies of the [state,

action, reward, next state, goal] transitions that are created in traditional off-policy RL. In the copied

transitions, the original goal element is replaced with a state that was actually achieved during the

episode, which guarantees that at least one of the HER transitions will contain the sparse reward.

These HER transitions in turn help the UVFA learn about regions of the (state, goal, action) tuple

space that should have relatively high Q-values, which the UVFA can then potentially extrapolate to

the other areas of the tuple space that may be more relevant for achieving the current set of goals.

4 HIERARCHICAL ACTOR-CRITIC (HAC)

We introduce a HRL framework, Hierarchical Actor-Critic, that can efﬁciently learn the levels in a

multi-level hierarchy in parallel. HAC contains two components: (i) a particular hierarchical archi-

tecture and (ii) a method for learning the levels of the hierarchy simultaneously and independently.

In this section, we will more formally present our proposed system as a UMDP transformation

operation.

The purpose of our framework is to efﬁciently learn a k-level hierarchy Π

k−1

consisting of k

individual policies π

, . . . , π

k−1

, in which k is a hyperparameter chosen by the user. In or-

der to learn π

, . . . , π

k−1

in parallel our framework transforms the original UMDP, U

original

(S, G, A, T, R, γ), into a set of k UMDPs U

, . . . , U

k−1

, in which U

= (S

, G

, A

, T

, R

, γ

). In

the remainder of the section, we will describe these tuples at a high-level. See section 7.3 in the

Appendix for the full deﬁnition of each UMDP tuple.

4.1 STATE, GOAL, AND ACTION SPACES

In our approach, each level of the UMDP hierarchy learns its own deterministic policy: π

: S

, G

→

, 0 ≤ i ≤ k − 1. The state space for every level i is identical to the state space in the original

problem: S

= S. Since each level will learn to solve a shortest path problem with respect to a goal

state, we set the goal space at each level i to be identical to the state space: G

= S. Finally, the

action space at all levels except the bottom-most level is identical to the goal space of the next level

down (i.e. the state space): A

= S, i > 0. These levels output subgoal states for the next lower

level to achieve. The action space of the bottom-most level is identical to the set of primitive actions

that are available to the agent: A

= A.

剩余15页未读，继续阅读

Glen997

粉丝: 337
资源: 10

分层强化学习：用事后视角加速样本效率

(源码)基于Python和Hierarchical Reinforcement Learning的基站部署与天线配置优化系统.zip

【Hierarchical RL】动态分层强化学习（DHRL）算法代码

learning representations in model-free hierarchical reinforcement learning

Hierarchical Deep Reinforcement Learning: Integrating Temporal A

Image-level classification by hierarchical structure learning with visual and semantic similarities

Hierarchical human-like strategy for aspect-level sentiment classification with sentiment linguistic knowledge and reinforcement learning

DEEP REINFORCEMENT LEARNING

HLEA4TC:多智能体通信神器，搭配 JADE 和 JNI（Hierarchical Reinforcement Learner）

Algorithm-Hierarchical-Meta-Reinforcement-Learning.zip

Visual And Text Sentiment Analysis through Hierarchical Deep Learning Networks

最新资源