没有合适的资源?快使用搜索试试~ 我知道了~
首页Deep Reinforcement Learning in Large Discrete Action Spaces
Deep Reinforcement Learning in Large Discrete Action Spaces
需积分: 16 201 浏览量
更新于2023-05-24
评论
收藏 1.38MB PDF 举报
Deep Reinforcement Learning in Large Discrete Action Spaces论文
资源详情
资源评论
资源推荐

Deep Reinforcement Learning in Large Discrete Action Spaces
Gabriel Dulac-Arnold*, Richard Evans*, Hado van Hasselt, Peter Sunehag, Timothy Lillicrap, Jonathan Hunt,
Timothy Mann, Theophane Weber, Thomas Degris, Ben Coppin DULACARNOLD@GOOGLE.COM
Google DeepMind
Abstract
Being able to reason in an environment with a
large number of discrete actions is essential to
bringing reinforcement learning to a larger class
of problems. Recommender systems, industrial
plants and language models are only some of the
many real-world tasks involving large numbers
of discrete actions for which current methods are
difficult or even often impossible to apply.
An ability to generalize over the set of actions
as well as sub-linear complexity relative to the
size of the set are both necessary to handle such
tasks. Current approaches are not able to provide
both of these, which motivates the work in this
paper. Our proposed approach leverages prior
information about the actions to embed them in
a continuous space upon which it can general-
ize. Additionally, approximate nearest-neighbor
methods allow for logarithmic-time lookup com-
plexity relative to the number of actions, which is
necessary for time-wise tractable training. This
combined approach allows reinforcement learn-
ing methods to be applied to large-scale learn-
ing problems previously intractable with current
methods. We demonstrate our algorithm’s abili-
ties on a series of tasks having up to one million
actions.
1. Introduction
Advanced AI systems will likely need to reason with a large
number of possible actions at every step. Recommender
systems used in large systems such as YouTube and Ama-
zon must reason about hundreds of millions of items every
second, and control systems for large industrial processes
may have millions of possible actions that can be applied
at every time step. All of these systems are fundamentally
*Equal contribution.
reinforcement learning (Sutton & Barto, 1998) problems,
but current algorithms are difficult or impossible to apply.
In this paper, we present a new policy architecture which
operates efficiently with a large number of actions. We
achieve this by leveraging prior information about the ac-
tions to embed them in a continuous space upon which the
actor can generalize. This embedding also allows the pol-
icy’s complexity to be decoupled from the cardinality of
our action set. Our policy produces a continuous action
within this space, and then uses an approximate nearest-
neighbor search to find the set of closest discrete actions
in logarithmic time. We can either apply the closest ac-
tion in this set directly to the environment, or fine-tune this
selection by selecting the highest valued action in this set
relative to a cost function. This approach allows for gen-
eralization over the action set in logarithmic time, which is
necessary for making both learning and acting tractable in
time.
We begin by describing our problem space and then detail
our policy architecture, demonstrating how we can train it
using policy gradient methods in an actor-critic framework.
We demonstrate the effectiveness of our policy on various
tasks with up to one million actions, but with the intent that
our approach could scale well beyond millions of actions.
2. Definitions
We consider a Markov Decision Process (MDP) where A is
the set of discrete actions, S is the set of discrete states, P :
S ×A×S → R is the transition probability distribution,
R : S ×A → R is the reward function, and γ ∈ [0, 1] is
a discount factor for future rewards. Each action a ∈ A
corresponds to an n-dimensional vector, such that a ∈ R
n
.
This vector provides information related to the action. In
the same manner, each state s ∈ S is a vector s ∈ R
m
.
The return of an episode in the MDP is the discounted
sum of rewards received by the agent during that episode:
R
t
=
P
T
i=t
γ
i−t
r(s
i
, a
i
). The goal of RL is to learn a
policy π : S → A which maximizes the expected return
over all episodes, E[R
1
]. The state-action value function
Q
π
(s, a) = E[R
1
|s
1
= s, a
1
= a, π] is the expected re-
arXiv:1512.07679v2 [cs.AI] 4 Apr 2016

Deep Reinforcement Learning in Large Discrete Action Spaces
turn starting from a given state s and taking an action a,
following π thereafter. Q
π
can be expressed in a recursive
manner using the Bellman equation:
Q
π
(s, a) = r(s, a) + γ
X
s
0
P (s
0
|s, a)Q
π
(s
0
, π(s
0
)).
In this paper, both Q and π are approximated by
parametrized functions.
3. Problem Description
There are two primary families of policies often used in RL
systems: value-based, and actor-based policies.
For value-based policies, the policy’s decisions are directly
conditioned on the value function. One of the more com-
mon examples is a policy that is greedy relative to the value
function:
π
Q
(s) = arg max
a∈A
Q(s, a). (1)
In the common case that the value function is a parame-
terized function which takes both state and action as input,
|A| evaluations are necessary to choose an action. This
quickly becomes intractable, especially if the parameter-
ized function is costly to evaluate, as is the case with deep
neural networks. This approach does, however, have the
desirable property of being capable of generalizing over
actions when using a smooth function approximator. If a
i
and a
j
are similar, learning about a
i
will also inform us
about a
j
. Not only does this make learning more efficient,
it also allows value-based policies to use the action features
to reason about previously unseen actions. Unfortunately,
execution complexity grows linearly with |A| which ren-
ders this approach intractable when the number of actions
grows significantly.
In a standard actor-critic approach, the policy is explicitly
defined by a parameterized actor function: π
θ
: S → A.
In practice π
θ
is often a classifier-like function approxima-
tor, which scale linearly in relation to the number of ac-
tions. However, actor-based architectures avoid the com-
putational cost of evaluating a likely costly Q-function on
every action in the arg max in Equation (1). Nevertheless,
actor-based approaches do not generalize over the action
space as naturally as value-based approaches, and cannot
extend to previously unseen actions.
Sub-linear complexity relative to the action space and an
ability to generalize over actions are both necessary to
handle the tasks we interest ourselves with. Current ap-
proaches are not able to provide both of these, which moti-
vates the approach described in this paper.
STATE
A
C
T
O
R
C
R
I
T
I
C
PROTO ACTION
ARGMAX
ACTION
EMBEDDING
K-NN
ACTION
Figure 1. Wolpertinger Architecture
4. Proposed Approach
We propose a new policy architecture which we call the
Wolpertinger architecture. This architecture avoids the
heavy cost of evaluating all actions while retaining general-
ization over actions. This policy builds upon the actor-critic
(Sutton & Barto, 1998) framework. We define both an effi-
cient action-generating actor, and utilize the critic to refine
our actor’s choices for the full policy. We use multi-layer
neural networks as function approximators for both our ac-
tor and critic functions. We train this policy using Deep
Deterministic Policy Gradient (Lillicrap et al., 2015).
The Wolpertinger policy’s algorithm is described fully in
Algorithm 1 and illustrated in Figure 1. We will detail these
in the following sections.
Algorithm 1 Wolpertinger Policy
State s previously received from environment.
ˆ
a = f
θ
π
(s) {Receive proto-action from actor.}
A
k
= g
k
(
ˆ
a) {Retrieve k approximately closest actions.}
a = arg max
a
j
∈A
k
Q
θ
Q
(s, a
j
)
Apply a to environment; receive r, s
0
.
4.1. Action Generation
Our architecture reasons over actions within a continuous
space R
n
, and then maps this output to the discrete action

Deep Reinforcement Learning in Large Discrete Action Spaces
set A. We will first define:
f
θ
π
: S → R
n
f
θ
π
(s) =
ˆ
a.
f
θ
π
is a function parametrized by θ
π
, mapping from the
state representation space R
m
to the action representation
space R
n
. This function provides a proto-action in R
n
for
a given state, which will likely not be a valid action, i.e. it
is likely that
ˆ
a /∈ A. Therefore, we need to be able to map
from
ˆ
a to an element in A. We can do this with:
g : R
n
→ A
g
k
(
ˆ
a) =
k
arg min
a∈A
|a −
ˆ
a|
2
.
g
k
is a k-nearest-neighbor mapping from a continuous
space to a discrete set
1
. It returns the k actions in A that
are closest to
ˆ
a by L
2
distance. In the exact case, this
lookup is of the same complexity as the arg max in the
value-function derived policies described in Section 3, but
each step of evaluation is an L
2
distance instead of a full
value-function evaluation. This task has been extensively
studied in the approximate nearest neighbor literature, and
the lookup can be performed in an approximate manner in
logarithmic time (Muja & Lowe, 2014). This step is de-
scribed by the bottom half of Figure 1, where we can see the
actor network producing a proto-action, and the k-nearest
neighbors being chosen from the action embedding.
4.2. Action Refinement
Depending on how well the action representation is struc-
tured, actions with a low Q-value may occasionally sit clos-
est to
ˆ
a even in a part of the space where most actions have
a high Q-value. Additionally, certain actions may be near
each other in the action embedding space, but in certain
states they must be distinguished as one has a particularly
low long-term value relative to its neighbors. In both of
these cases, simply selecting the closest element to
ˆ
a from
the set of actions generated previously is not ideal.
To avoid picking these outlier actions, and to generally im-
prove the finally emitted action, the second phase of the
algorithm, which is described by the top part of Figure 1,
refines the choice of action by selecting the highest-scoring
action according to Q
θ
Q
:
π
θ
(s) = arg max
a∈g
k
◦f
θ
π
(s)
Q
θ
Q
(s, a). (2)
This equation is described more explicitly in Algorithm 1.
It introduces π
θ
which is the full Wolpertinger policy. The
parameter θ represents both the parameters of the action
generation element in θ
π
and of the critic in θ
Q
.
1
For k = 1 this is a simple nearest neighbor lookup.
As we demonstrate in Section 7, this second pass makes
our algorithm significantly more robust to imperfections in
the choice of action representation, and is essential in mak-
ing our system learn in certain domains. The size of the
generated action set, k, is task specific, and allows for an
explicit trade-off between policy quality and speed.
4.3. Training with Policy Gradient
Although the architecture of our policy is not fully differ-
entiable, we argue that we can nevertheless train our policy
by following the policy gradient of f
θ
π
. We will first con-
sider the training of a simpler policy, one defined only as
˜π
θ
= g ◦ f
θ
π
. In this initial case we can consider that the
policy is f
θ
π
and that the effects of g are a deterministic
aspect of the environment. This allows us to maintain a
standard policy gradient approach to train f
θ
π
on its output
ˆ
a, effectively interpreting the effects of g as environmen-
tal dynamics. Similarly, the arg max operation in Equation
(2) can be seen as introducing a non-stationary aspect to the
environmental dynamics.
4.4. Wolpertinger Training
The training algorithm’s goal is to find a parameterized
policy π
θ
∗
which maximizes its expected return over the
episode’s length. To do this, we find a parametrization θ
∗
of our policy which maximizes its expected return over an
episode: θ
∗
= arg max
θ
E[R
1
|π
θ
].
We perform this optimization using Deep Deterministic
Policy Gradient (DDPG) (Lillicrap et al., 2015) to train
both f
θ
π
and Q
θ
Q
. DDPG draws from two stability-
inducing aspects of Deep Q-Networks (Mnih et al., 2015)
to extend Deterministic Policy Gradient (Silver et al., 2014)
to neural network function approximators by introducing a
replay buffer (Lin, 1992) and target networks. DPG is sim-
ilar to work introduced by NFQCA (Hafner & Riedmiller,
2011) and leverages the gradient-update originally intro-
duced by ADHDP (Prokhorov et al., 1997).
The goal of these algorithms is to perform policy iteration
by alternatively performing policy evaluation on the cur-
rent policy with Q-learning, and then improving upon the
current policy by following the policy gradient.
The critic is trained from samples stored in a replay buffer
(Mnih et al., 2015). Actions stored in the replay buffer are
generated by π
θ
π
, but the policy gradient ∇
a
Q
θ
Q
(s, a) is
taken at
ˆ
a = f
θ
π
(s). This allows the learning algorithm to
leverage the otherwise ignored information of which action
was actually executed for training the critic, while taking
the policy gradient at the actual output of f
θ
π
. The target
action in the Q-update is generated by the full policy π
θ
and not simply f
θ
π
.
A detailed description of the algorithm is available in the
剩余10页未读,继续阅读
















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0