用pycharm和pytorch写一个基于DDPG算法的无人机降落代码
时间: 2024-03-12 18:49:32 浏览: 250
这是一个比较复杂的项目,需要一些前置知识。以下是一个简单的代码示例,仅供参考,不保证完全正确和可运行,需要根据实际情况进行修改和调整。
首先,我们需要安装PyTorch和OpenAI Gym:
```
pip install torch gym
```
然后,我们定义一个无人机降落环境:
```python
import gym
import numpy as np
class LandingEnv(gym.Env):
def __init__(self):
self.observation_space = gym.spaces.Box(low=0, high=1, shape=(3,))
self.action_space = gym.spaces.Box(low=-1, high=1, shape=(1,))
self.goal = np.array([0.5, 0.5, 0.0])
self.state = np.array([0.0, 0.0, 1.0])
self.time_step = 0
self.max_time_step = 100
def reset(self):
self.state = np.array([0.0, 0.0, 1.0])
self.time_step = 0
return self.state
def step(self, action):
self.time_step += 1
action = np.clip(action, self.action_space.low, self.action_space.high)
noise = np.random.normal(0, 0.1)
next_state = np.clip(self.state + np.array([0.0, 0.0, action[0] + noise]), self.observation_space.low, self.observation_space.high)
reward = -np.sum(np.abs(next_state - self.goal))
done = (self.time_step >= self.max_time_step)
self.state = next_state
return next_state, reward, done, {}
def render(self, mode="human"):
pass
```
在这个环境中,我们需要让无人机在规定时间内降落到指定高度。环境的观测空间是一个三维向量,表示无人机的高度、速度和加速度。动作空间是一个一维向量,表示无人机的油门输入。奖励函数是负的状态与目标之间的欧氏距离。
接下来,我们定义一个DDPG智能体:
```python
import torch
import torch.nn as nn
import torch.optim as optim
import torch.nn.functional as F
from torch.distributions import Normal
class Actor(nn.Module):
def __init__(self, state_dim, action_dim, max_action):
super(Actor, self).__init__()
self.fc1 = nn.Linear(state_dim, 256)
self.fc2 = nn.Linear(256, 256)
self.fc3 = nn.Linear(256, action_dim)
self.max_action = max_action
def forward(self, state):
x = F.relu(self.fc1(state))
x = F.relu(self.fc2(x))
x = self.max_action * torch.tanh(self.fc3(x))
return x
class Critic(nn.Module):
def __init__(self, state_dim, action_dim):
super(Critic, self).__init__()
self.fc1 = nn.Linear(state_dim + action_dim, 256)
self.fc2 = nn.Linear(256, 256)
self.fc3 = nn.Linear(256, 1)
def forward(self, state, action):
x = torch.cat([state, action], 1)
x = F.relu(self.fc1(x))
x = F.relu(self.fc2(x))
x = self.fc3(x)
return x
class DDPG(object):
def __init__(self, state_dim, action_dim, max_action):
self.actor = Actor(state_dim, action_dim, max_action).cuda()
self.actor_target = Actor(state_dim, action_dim, max_action).cuda()
self.actor_target.load_state_dict(self.actor.state_dict())
self.actor_optimizer = optim.Adam(self.actor.parameters(), lr=1e-3)
self.critic = Critic(state_dim, action_dim).cuda()
self.critic_target = Critic(state_dim, action_dim).cuda()
self.critic_target.load_state_dict(self.critic.state_dict())
self.critic_optimizer = optim.Adam(self.critic.parameters(), lr=1e-3)
self.max_action = max_action
def select_action(self, state):
state = torch.FloatTensor(state.reshape(1, -1)).cuda()
action = self.actor(state).cpu().data.numpy().flatten()
return action
def train(self, replay_buffer, batch_size=256, discount=0.99, tau=0.001):
state, action, next_state, reward, done = replay_buffer.sample(batch_size)
state = torch.FloatTensor(state).cuda()
action = torch.FloatTensor(action).cuda()
next_state = torch.FloatTensor(next_state).cuda()
reward = torch.FloatTensor(reward).cuda()
done = torch.FloatTensor(done).cuda()
target_Q = self.critic_target(next_state, self.actor_target(next_state))
target_Q = reward + ((1 - done) * discount * target_Q).detach()
current_Q = self.critic(state, action)
critic_loss = F.mse_loss(current_Q, target_Q)
self.critic_optimizer.zero_grad()
critic_loss.backward()
self.critic_optimizer.step()
actor_loss = -self.critic(state, self.actor(state)).mean()
self.actor_optimizer.zero_grad()
actor_loss.backward()
self.actor_optimizer.step()
for param, target_param in zip(self.critic.parameters(), self.critic_target.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
for param, target_param in zip(self.actor.parameters(), self.actor_target.parameters()):
target_param.data.copy_(tau * param.data + (1 - tau) * target_param.data)
```
在这个DDPG智能体中,我们使用了一个Actor网络和一个Critic网络。Actor网络将状态映射到动作,Critic网络将状态和动作映射到Q值。在训练时,我们使用了一个经验回放缓冲区,每次从中随机采样一批样本进行训练。我们还使用了一些技巧,如目标网络和软更新,以提高训练效果。
最后,我们可以将环境和智能体组合起来进行训练:
```python
env = LandingEnv()
agent = DDPG(state_dim=3, action_dim=1, max_action=1.0)
replay_buffer = ReplayBuffer()
state = env.reset()
while True:
action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
replay_buffer.add(state, action, next_state, reward, done)
agent.train(replay_buffer)
state = next_state
if done:
state = env.reset()
```
在训练结束后,我们可以使用训练好的智能体进行测试:
```python
state = env.reset()
while True:
action = agent.select_action(state)
next_state, reward, done, _ = env.step(action)
state = next_state
if done:
state = env.reset()
```
这就是一个简单的基于DDPG算法的无人机降落代码。需要注意的是,这只是一个示例代码,实际的项目中还需要进行很多修改和调整。
阅读全文