Published as a conference paper at ICLR 2019
3 BACKGROUND
We are interested in solving a Markov Decision Process (MDP) augmented with a set of goals G
(each a state or set of states) that we would like an agent to learn. We define an MDP augmented with
a set of goals as a Universal MDP (UMDP). A UMDP is a tuple U = (S, G, A, T, R, γ), in which S
is the set of states; G is the set of goals; A is the set of actions; T is the transition probability function
in which T (s, a, s
0
) is the probability of transitioning to state s
0
when action a is taken in state s; R
is the reward function; γ is the discount rate ∈ [0, 1). At the beginning of each episode in a UMDP,
a goal g ∈ G is selected for the entirety of the episode. The solution to a UMDP is a control policy
π : S, G → A that maximizes the value function v
π
(s, g) = E
π
[
P
∞
n=0
γ
n
R
t+n+1
|s
t
= s, g
t
= g]
for an initial state s and goal g.
In order to implement hierarchical agents in tasks with continuous state and actions spaces, we will
use two techniques from the RL literature: (i) the Universal Value Function Approximator (UVFA)
(Schaul et al., 2015) and (ii) Hindsight Experience Replay (Andrychowicz et al., 2017). The UVFA
will be used to estimate the action-value function of a goal-conditioned policy π, q
π
(s, g, a) =
E
π
[
P
∞
n=0
γ
n
R
t+n+1
|s
t
= s, g
t
= g, a
t
= a]. In our experiments, the UVFAs used will be in the
form of feedforward neural networks. UVFAs are important for learning goal-conditioned policies
because they can potentially generalize Q-values from certain regions of the (state, goal, action)
tuple space to other regions of the tuple space, which can accelerate learning. However, UVFAs
are less helpful in difficult tasks that use sparse reward functions. In these tasks when the sparse
reward is rarely achieved, the UVFA will not have large regions of the (state, goal, action) tuple
space with relatively high Q-values that it can generalize to other regions. For this reason, we
also use Hindsight Experience Replay (Andrychowicz et al., 2017). HER is a data augmentation
technique that can accelerate learning in sparse reward tasks. HER first creates copies of the [state,
action, reward, next state, goal] transitions that are created in traditional off-policy RL. In the copied
transitions, the original goal element is replaced with a state that was actually achieved during the
episode, which guarantees that at least one of the HER transitions will contain the sparse reward.
These HER transitions in turn help the UVFA learn about regions of the (state, goal, action) tuple
space that should have relatively high Q-values, which the UVFA can then potentially extrapolate to
the other areas of the tuple space that may be more relevant for achieving the current set of goals.
4 HIERARCHICAL ACTOR-CRITIC (HAC)
We introduce a HRL framework, Hierarchical Actor-Critic, that can efficiently learn the levels in a
multi-level hierarchy in parallel. HAC contains two components: (i) a particular hierarchical archi-
tecture and (ii) a method for learning the levels of the hierarchy simultaneously and independently.
In this section, we will more formally present our proposed system as a UMDP transformation
operation.
The purpose of our framework is to efficiently learn a k-level hierarchy Π
k−1
consisting of k
individual policies π
0
, . . . , π
k−1
, in which k is a hyperparameter chosen by the user. In or-
der to learn π
0
, . . . , π
k−1
in parallel our framework transforms the original UMDP, U
original
=
(S, G, A, T, R, γ), into a set of k UMDPs U
0
, . . . , U
k−1
, in which U
i
= (S
i
, G
i
, A
i
, T
i
, R
i
, γ
i
). In
the remainder of the section, we will describe these tuples at a high-level. See section 7.3 in the
Appendix for the full definition of each UMDP tuple.
4.1 STATE, GOAL, AND ACTION SPACES
In our approach, each level of the UMDP hierarchy learns its own deterministic policy: π
i
: S
i
, G
i
→
A
i
, 0 ≤ i ≤ k − 1. The state space for every level i is identical to the state space in the original
problem: S
i
= S. Since each level will learn to solve a shortest path problem with respect to a goal
state, we set the goal space at each level i to be identical to the state space: G
i
= S. Finally, the
action space at all levels except the bottom-most level is identical to the goal space of the next level
down (i.e. the state space): A
i
= S, i > 0. These levels output subgoal states for the next lower
level to achieve. The action space of the bottom-most level is identical to the set of primitive actions
that are available to the agent: A
0
= A.
4