PPO强化学习 离散 连续
时间: 2023-11-26 21:47:25 浏览: 351
周博磊强化学习课程源码
PPO(Proximal Policy Optimization)是一种强化学习算法,可以用于离散和连续动作空间。PPO的主要思想是在更新策略时,限制新策略与旧策略之间的差异,以确保更新后的策略不会太远离旧策略,从而保证学习的稳定性。PPO有两种形式:PPO-Penalty和PPO-Clip。
对于离散动作空间,PPO-Penalty使用KL散度来限制新策略与旧策略之间的差异,而PPO-Clip使用一个截断函数来限制差异。在PPO-Clip中,新策略与旧策略之间的差异被限制在一个固定的范围内,这个范围由一个超参数ε控制。
对于连续动作空间,PPO使用一个正态分布来表示策略,其中均值和方差是策略的参数。在训练过程中,PPO使用一个神经网络来拟合均值和方差,并使用拟合的正态分布进行抽样。PPO-Clip同样适用于连续动作空间,只需要将截断函数应用于均值的更新上即可。
下面是一个使用PPO算法解决倒立摆问题的示例代码:
```python
import gym
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
# 定义神经网络
class ActorCritic(nn.Module):
def __init__(self):
super(ActorCritic, self).__init__()
self.fc1 = nn.Linear(3, 64)
self.fc2 = nn.Linear(64, 64)
self.actor = nn.Linear(64, 1)
self.critic = nn.Linear(64, 1)
def forward(self, x):
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
actor = torch.tanh(self.actor(x)) * 2 # 将输出映射到[-2, 2]范围内
critic = self.critic(x)
return actor, critic
# 定义PPO算法
class PPO:
def __init__(self):
self.gamma = 0.99
self.lmbda = 0.95
self.eps_clip = 0.2
self.K = 10
self.actor_critic = ActorCritic()
self.optimizer = optim.Adam(self.actor_critic.parameters(), lr=0.001)
def select_action(self, state):
state = torch.FloatTensor(state.reshape(1, -1))
actor, _ = self.actor_critic(state)
dist = Normal(actor, torch.ones(1, 1))
action = dist.sample()
return action.item()
def update(self, memory):
states = torch.FloatTensor(memory.states)
actions = torch.FloatTensor(memory.actions)
old_log_probs = torch.FloatTensor(memory.log_probs)
returns = torch.FloatTensor(memory.returns)
advantages = torch.FloatTensor(memory.advantages)
for _ in range(self.K):
actor, critic = self.actor_critic(states)
dist = Normal(actor, torch.ones(actor.size()))
log_probs = dist.log_prob(actions)
ratios = torch.exp(log_probs - old_log_probs)
surr1 = ratios * advantages
surr2 = torch.clamp(ratios, 1 - self.eps_clip, 1 + self.eps_clip) * advantages
actor_loss = -torch.min(surr1, surr2).mean()
critic_loss = F.mse_loss(critic, returns)
loss = actor_loss + 0.5 * critic_loss
self.optimizer.zero_grad()
loss.backward()
self.optimizer.step()
# 训练PPO算法
env = gym.make('Pendulum-v0')
ppo = PPO()
memory = Memory()
for i in range(1000):
state = env.reset()
done = False
while not done:
action = ppo.select_action(state)
next_state, reward, done, _ = env.step([action])
memory.add(state, action, reward, next_state, done)
state = next_state
if i % 10 == 0:
memory.calculate_returns(ppo.actor_critic, ppo.gamma, ppo.lmbda)
ppo.update(memory)
memory.clear()
# 测试PPO算法
state = env.reset()
done = False
while not done:
action = ppo.select_action(state)
next_state, reward, done, _ = env.step([action])
env.render()
state = next_state
env.close()
```
阅读全文