如何使用多个GPU进行训练 pytorch
时间: 2024-06-12 14:04:34 浏览: 127
使用多个GPU进行训练pytorch可以大大加速训练过程。下面是使用多个GPU进行训练pytorch的步骤:
1. 引入必要的库和模块
```
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader
from torchvision import datasets, transforms
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
```
2. 设置参数
```
batch_size = 64
learning_rate = 0.001
epochs = 10
num_workers = 4
num_classes = 10
```
3. 定义模型
```
class Net(nn.Module):
def __init__(self):
super(Net, self).__init__()
self.conv1 = nn.Conv2d(3, 32, kernel_size=3, padding=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=3, padding=1)
self.fc1 = nn.Linear(64*8*8, 512)
self.fc2 = nn.Linear(512, num_classes)
def forward(self, x):
x = F.relu(self.conv1(x))
x = F.max_pool2d(x, 2)
x = F.relu(self.conv2(x))
x = F.max_pool2d(x, 2)
x = x.view(-1, 64*8*8)
x = F.relu(self.fc1(x))
x = self.fc2(x)
return x
model = Net()
```
4. 初始化分布式训练环境
```
dist.init_process_group(backend='nccl', init_method='env://')
torch.cuda.set_device(rank)
```
5. 将模型和数据分发到所有GPU上
```
model.to(rank)
model = DDP(model, device_ids=[rank])
```
6. 加载数据
```
train_dataset = datasets.CIFAR10(root='./data', train=True, download=True, transform=transforms.ToTensor())
train_sampler = torch.utils.data.distributed.DistributedSampler(train_dataset, num_replicas=size, rank=rank)
train_loader = DataLoader(train_dataset, batch_size=batch_size, shuffle=False, num_workers=num_workers, pin_memory=True, sampler=train_sampler)
```
7. 定义损失函数和优化器
```
criterion = nn.CrossEntropyLoss().cuda(rank)
optimizer = optim.SGD(model.parameters(), lr=learning_rate, momentum=0.9)
```
8. 开始训练
```
for epoch in range(epochs):
train_sampler.set_epoch(epoch)
for batch_idx, (data, target) in enumerate(train_loader):
data, target = data.cuda(rank), target.cuda(rank)
optimizer.zero_grad()
output = model(data)
loss = criterion(output, target)
loss.backward()
optimizer.step()
if batch_idx % 100 == 0:
print('Train Epoch: {} [{}/{} ({:.0f}%)]\tLoss: {:.6f}'.format(
epoch, batch_idx * len(data), len(train_sampler),
100. * batch_idx / len(train_loader), loss.item()))
```
9. 关闭分布式训练环境
```
dist.destroy_process_group()
```
以上是使用多个GPU进行训练pytorch的步骤,其中包括了初始化分布式训练环境、将模型和数据分发到所有GPU上、定义损失函数和优化器等步骤。需要注意的是,使用多个GPU进行训练需要在分布式环境下进行,可以使用torch.distributed.launch命令来启动分布式训练。例如,如果要在两台机器上使用两张GPU进行训练,可以使用以下命令:
```
python -m torch.distributed.launch --nproc_per_node=2 --nnodes=2 --node_rank=0 --master_addr="ip_address" --master_port=1234 train.py
```
其中,--nproc_per_node表示每个节点使用的GPU数量,--nnodes表示节点数,--node_rank表示节点的编号,--master_addr表示主节点的IP地址,--master_port表示主节点的端口号。在第二台机器上运行相同的命令,将--node_rank的值设为1即可。
阅读全文