Soft Actor-Critic
ing towards high-reward regions. More recently, several
papers have noted the connection between Q-learning and
policy gradient methods in the framework of maximum en-
tropy learning (O’Donoghue et al., 2016; Haarnoja et al.,
2017; Nachum et al., 2017a; Schulman et al., 2017a). While
most of the prior model-free works assume a discrete action
space, Nachum et al. (2017b) approximate the maximum en-
tropy distribution with a Gaussian and Haarnoja et al. (2017)
with a sampling network trained to draw samples from the
optimal policy. Although the soft Q-learning algorithm pro-
posed by Haarnoja et al. (2017) has a value function and
actor network, it is not a true actor-critic algorithm: the
Q-function is estimating the optimal Q-function, and the
actor does not directly affect the Q-function except through
the data distribution. Hence, Haarnoja et al. (2017) moti-
vates the actor network as an approximate sampler, rather
than the actor in an actor-critic algorithm. Crucially, the
convergence of this method hinges on how well this sampler
approximates the true posterior. In contrast, we prove that
our method converges to the optimal policy from a given
policy class, regardless of the policy parameterization. Fur-
thermore, these prior maximum entropy methods generally
do not exceed the performance of state-of-the-art off-policy
algorithms, such as DDPG, when learning from scratch,
though they may have other benefits, such as improved ex-
ploration and ease of fine-tuning. In our experiments, we
demonstrate that our soft actor-critic algorithm does in fact
exceed the performance of prior state-of-the-art off-policy
deep RL methods by a wide margin.
3. Preliminaries
We first introduce notation and summarize the standard and
maximum entropy reinforcement learning frameworks.
3.1. Notation
We address policy learning in continuous action spaces.
We consider an infinite-horizon Markov decision process
(MDP), defined by the tuple
(S, A, p, r)
, where the state
space
S
and the action space
A
are continuous, and the
unknown state transition probability
p : S × S × A →
[0, ∞)
represents the probability density of the next state
s
t+1
∈ S
given the current state
s
t
∈ S
and action
a
t
∈ A
.
The environment emits a bounded reward
r : S × A →
[r
min
, r
max
]
on each transition. We will use
ρ
π
(s
t
)
and
ρ
π
(s
t
, a
t
)
to denote the state and state-action marginals of
the trajectory distribution induced by a policy π(a
t
|s
t
).
3.2. Maximum Entropy Reinforcement Learning
Standard RL maximizes the expected sum of rewards
P
t
E
(s
t
,a
t
)∼ρ
π
[r(s
t
, a
t
)]
. We will consider a more gen-
eral maximum entropy objective (see e.g. Ziebart (2010)),
which favors stochastic policies by augmenting the objective
with the expected entropy of the policy over ρ
π
(s
t
):
J(π) =
T
X
t=0
E
(s
t
,a
t
)∼ρ
π
[r(s
t
, a
t
) + αH(π( ·|s
t
))] . (1)
The temperature parameter
α
determines the relative im-
portance of the entropy term against the reward, and thus
controls the stochasticity of the optimal policy. The maxi-
mum entropy objective differs from the standard maximum
expected reward objective used in conventional reinforce-
ment learning, though the conventional objective can be
recovered in the limit as
α → 0
. For the rest of this paper,
we will omit writing the temperature explicitly, as it can
always be subsumed into the reward by scaling it by α
−1
.
This objective has a number of conceptual and practical
advantages. First, the policy is incentivized to explore more
widely, while giving up on clearly unpromising avenues.
Second, the policy can capture multiple modes of near-
optimal behavior. In problem settings where multiple ac-
tions seem equally attractive, the policy will commit equal
probability mass to those actions. Lastly, prior work has ob-
served improved exploration with this objective (Haarnoja
et al., 2017; Schulman et al., 2017a), and in our experi-
ments, we observe that it considerably improves learning
speed over state-of-art methods that optimize the conven-
tional RL objective function. We can extend the objective to
infinite horizon problems by introducing a discount factor
γ
to ensure that the sum of expected rewards and entropies is
finite. Writing down the maximum entropy objective for the
infinite horizon discounted case is more involved (Thomas,
2014) and is deferred to Appendix A.
Prior methods have proposed directly solving for the op-
timal Q-function, from which the optimal policy can be
recovered (Ziebart et al., 2008; Fox et al., 2016; Haarnoja
et al., 2017). We will discuss how we can devise a soft
actor-critic algorithm through a policy iteration formulation,
where we instead evaluate the Q-function of the current
policy and update the policy through an off-policy gradient
update. Though such algorithms have previously been pro-
posed for conventional reinforcement learning, our method
is, to our knowledge, the first off-policy actor-critic method
in the maximum entropy reinforcement learning framework.
4. From Soft Policy Iteration to Soft
Actor-Critic
Our off-policy soft actor-critic algorithm can be derived
starting from a maximum entropy variant of the policy it-
eration method. We will first present this derivation, verify
that the corresponding algorithm converges to the optimal
policy from its density class, and then present a practical
deep reinforcement learning algorithm based on this theory.