深度强化学习：Soft Actor-Critic算法详解与入门

需积分: 9 147 浏览量更新于2024-09-06 1 收藏 4.18MB PDF 举报

Soft Actor-Critic (SAC) 是一种基于深度强化学习 (Deep Reinforcement Learning, DRL) 的模型-free算法，它在原始论文《Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor》中首次提出，由 Tuomas Haarnoja、Aurick Zhou、Pieter Abbeel 和 Sergey Levine 等研究人员共同贡献。此篇论文对初学者而言，是理解复杂决策制定和控制任务中 SAC 所扮演的关键角色的一个重要入门资源。 SAC 的核心理念在于最大化期望奖励的同时，保持行为的多样性，即“在完成任务的同时尽可能地保持随机性”。这与传统的基于最大期望回报的强化学习方法不同，后者往往追求纯粹的优化效率而忽视了探索性。通过引入熵项，SAC 强调在学习过程中保持一定的不确定性和探索，有助于防止过早收敛，并提高算法在复杂真实世界环境中的适应性。相比于之前基于最大熵框架的深度强化学习方法，如基于 Q-learning 的形式化，SAC 采用了 off-policy 学习策略。这意味着它可以在不同的行为策略（off-policy）上进行学习，这极大地提高了样本效率，减少了对超参数调整的需求。off-policy 更新允许算法利用历史数据进行学习，即使当前使用的策略与目标策略有所不同，也能有效地积累经验。 SAC 的稳定且随机的 actor-critic 架构是其独特之处，其中 actor 负责选择动作，critic 则评估这些动作的质量。这种架构使得 SAC 在训练过程中更为稳健，能够处理连续动作空间的问题，对于高维和连续控制任务具有显著优势。在实际应用中，SAC 已经展示了在诸如机器人控制、游戏策略和自主驾驶等领域的出色性能，证明了其在解决复杂任务时的有效性和实用性。 Soft Actor-Critic 是深度强化学习领域的一项重要突破，它通过结合 off-policy 学习、最大熵思想和稳定的随机策略，成功克服了传统算法在样本效率和收敛性上的挑战。对于想要深入研究或在实际项目中应用深度强化学习的开发者和研究人员来说，理解和掌握 Soft Actor-Critic 的原理和实践方法是至关重要的。

Soft Actor-Critic

ing towards high-reward regions. More recently, several

papers have noted the connection between Q-learning and

policy gradient methods in the framework of maximum en-

tropy learning (O’Donoghue et al., 2016; Haarnoja et al.,

2017; Nachum et al., 2017a; Schulman et al., 2017a). While

most of the prior model-free works assume a discrete action

space, Nachum et al. (2017b) approximate the maximum en-

tropy distribution with a Gaussian and Haarnoja et al. (2017)

with a sampling network trained to draw samples from the

optimal policy. Although the soft Q-learning algorithm pro-

posed by Haarnoja et al. (2017) has a value function and

actor network, it is not a true actor-critic algorithm: the

Q-function is estimating the optimal Q-function, and the

actor does not directly affect the Q-function except through

the data distribution. Hence, Haarnoja et al. (2017) moti-

vates the actor network as an approximate sampler, rather

than the actor in an actor-critic algorithm. Crucially, the

convergence of this method hinges on how well this sampler

approximates the true posterior. In contrast, we prove that

our method converges to the optimal policy from a given

policy class, regardless of the policy parameterization. Fur-

thermore, these prior maximum entropy methods generally

do not exceed the performance of state-of-the-art off-policy

algorithms, such as DDPG, when learning from scratch,

though they may have other beneﬁts, such as improved ex-

ploration and ease of ﬁne-tuning. In our experiments, we

demonstrate that our soft actor-critic algorithm does in fact

exceed the performance of prior state-of-the-art off-policy

deep RL methods by a wide margin.

3. Preliminaries

We ﬁrst introduce notation and summarize the standard and

maximum entropy reinforcement learning frameworks.

3.1. Notation

We address policy learning in continuous action spaces.

We consider an inﬁnite-horizon Markov decision process

(MDP), deﬁned by the tuple

(S, A, p, r)

, where the state

space

and the action space

are continuous, and the

unknown state transition probability

p : S × S × A →

[0, ∞)

represents the probability density of the next state

t+1

∈ S

given the current state

∈ S

and action

∈ A

The environment emits a bounded reward

r : S × A →

min

, r

max

]

on each transition. We will use

)

and

, a

)

to denote the state and state-action marginals of

the trajectory distribution induced by a policy π(a

3.2. Maximum Entropy Reinforcement Learning

Standard RL maximizes the expected sum of rewards

)∼ρ

[r(s

, a

)]

. We will consider a more gen-

eral maximum entropy objective (see e.g. Ziebart (2010)),

which favors stochastic policies by augmenting the objective

with the expected entropy of the policy over ρ

J(π) =

t=0

)∼ρ

[r(s

, a

) + αH(π( ·|s

))] . (1)

The temperature parameter

determines the relative im-

portance of the entropy term against the reward, and thus

controls the stochasticity of the optimal policy. The maxi-

mum entropy objective differs from the standard maximum

expected reward objective used in conventional reinforce-

ment learning, though the conventional objective can be

recovered in the limit as

α → 0

. For the rest of this paper,

we will omit writing the temperature explicitly, as it can

always be subsumed into the reward by scaling it by α

−1

This objective has a number of conceptual and practical

advantages. First, the policy is incentivized to explore more

widely, while giving up on clearly unpromising avenues.

Second, the policy can capture multiple modes of near-

optimal behavior. In problem settings where multiple ac-

tions seem equally attractive, the policy will commit equal

probability mass to those actions. Lastly, prior work has ob-

served improved exploration with this objective (Haarnoja

et al., 2017; Schulman et al., 2017a), and in our experi-

ments, we observe that it considerably improves learning

speed over state-of-art methods that optimize the conven-

tional RL objective function. We can extend the objective to

inﬁnite horizon problems by introducing a discount factor

to ensure that the sum of expected rewards and entropies is

ﬁnite. Writing down the maximum entropy objective for the

inﬁnite horizon discounted case is more involved (Thomas,

2014) and is deferred to Appendix A.

Prior methods have proposed directly solving for the op-

timal Q-function, from which the optimal policy can be

recovered (Ziebart et al., 2008; Fox et al., 2016; Haarnoja

et al., 2017). We will discuss how we can devise a soft

actor-critic algorithm through a policy iteration formulation,

where we instead evaluate the Q-function of the current

policy and update the policy through an off-policy gradient

update. Though such algorithms have previously been pro-

posed for conventional reinforcement learning, our method

is, to our knowledge, the ﬁrst off-policy actor-critic method

in the maximum entropy reinforcement learning framework.

4. From Soft Policy Iteration to Soft

Actor-Critic

Our off-policy soft actor-critic algorithm can be derived

starting from a maximum entropy variant of the policy it-

eration method. We will ﬁrst present this derivation, verify

that the corresponding algorithm converges to the optimal

policy from its density class, and then present a practical

deep reinforcement learning algorithm based on this theory.

剩余13页未读，继续阅读

GanD.GanD

粉丝: 3
资源: 90

深度强化学习：Soft Actor-Critic算法详解与入门

7.soft actor-critic.ipynb

SAC-Auto路径规划, Soft Actor-Critic算法, SAC-pytorch，激光雷达Lidar避障+仿真模拟

Connecting Generative Adversarial Network and Actor-Critic Methods.pdf

ppo actor-critic

多智能体编队actor-critic pytorch代码

什么是actor-critic

actor-critic和ppo的关系

soft actor critic lstm tensorflow1.11 code

请采用python以Tensorflow为架构编写《Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor》版本的SAC强化学习算法代码

actor-critic与adp

最新资源