并行训练卷积神经网络的新方法

需积分: 0 130 浏览量更新于2024-08-05 收藏 483KB PDF 举报

"这篇文档是Alex Krizhevsky在2014年发表的一篇关于并行卷积神经网络优化的论文，主要介绍了一种新的方法来在多个GPU上并行化训练卷积神经网络（CNN）。这种方法在应用于现代CNN时，具有更好的可扩展性。" 在这篇论文中，Alex Krizhevsky探讨了如何有效并行化CNN的训练过程，以应对大型模型和大数据集带来的挑战。现有的并行化策略通常有两种：一是跨模型维度，每个工作器训练模型的不同部分；二是跨数据维度，不同工作器处理不同的数据样本。 1. 传统并行策略分析： - 模型维度并行：这种方法通常用于深度学习架构中的参数服务器模式，其中每个工作器负责更新一部分模型的权重，最后通过同步机制确保所有工作器的模型状态一致。 - 数据维度并行：也称为数据并行，每个工作器使用不同的数据样本进行训练，可以利用数据的独立性，无需同步，但可能会导致局部梯度偏差。 2. 提出的新方法： Alex Krizhevsky提出的第一种变体是完美模拟单核上的同步SGD执行。这意味着所有工作器都同时进行相同的迭代步骤，并在每个步骤结束时同步权重，以保持一致性。这种方法的优点是保持了传统的SGD算法的精确性，但缺点是在大量GPU上同步可能成为性能瓶颈。第二种变体引入了一种近似方法，不再完全模拟SGD，但实际效果更好。这通常涉及到异步更新，每个工作器可以独立地更新权重，而不等待其他工作器完成。这种策略减少了同步开销，提高了训练速度，但可能会导致权重更新的不一致性，即所谓的“漂移”问题。 3. 优化与讨论：这种新的并行化方法特别适合现代的大型CNN，因为它能够更好地扩展，即使在存在权重漂移的情况下也能保持较好的性能。作者可能还讨论了如何通过调整学习率、动量参数或者使用特定的同步策略（如AllReduce）来控制和最小化这种不一致性。 4. 结论：通过提出这些并行化策略，Alex Krizhevsky为解决大规模CNN训练中的计算效率问题提供了新的思路。他的工作不仅有助于加速训练过程，而且可能为分布式系统设计提供了新的见解，尤其是在处理实时数据流或大规模图像识别任务时。这篇论文揭示了在并行训练CNN时如何优化算法，以提高效率并保持模型性能，对于深入理解深度学习系统的并行化有着重要的参考价值。

One weird trick for parallelizing convolutional neural networks

Alex Krizhevsky

Google Inc.

akrizhevsky@google.com

April 29, 2014

Abstract

I present a new way to parallelize the training of convolutional neural networks across multiple GPUs.

The method scales signiﬁcantly better than all alternatives when applied to modern convolutional neural

networks.

1 Introduction

This is meant to be a short note introducing a new

way to parallelize the training of convolutional neural

networks with stochastic gradient descent (SGD). I

present two variants of the algorithm. The ﬁrst vari-

ant perfectly simulates the synchronous execution of

SGD on one core, while the second introduces an ap-

proximation such that it no longer perfectly simulates

SGD, but nonetheless works better in practice.

2 Existing approaches

Convolutional neural networks are big models trained

on big datasets. So there are two obvious ways to par-

allelize their training:

• across the model dimension, where diﬀerent

workers train diﬀerent parts of the model, and

• across the data dimension, where diﬀerent work-

ers train on diﬀerent data examples.

These are called model parallelism and data paral-

lelism, respectively.

In model parallelism, whenever the model part

(subset of neuron activities) trained by one worker

requires output from a model part trained by another

worker, the two workers must synchronize. In con-

trast, in data parallelism the workers must synchro-

nize model parameters (or parameter gradients) to

ensure that they are training a consistent model.

In general, we should exploit all dimensions of par-

allelism. Neither scheme is better than the other a

priori. But the relative degrees to which we exploit

each scheme should be informed by model architec-

ture. In particular, model parallelism is eﬃcient when

the amount of computation per neuron activity is high

(because the neuron activity is the unit being commu-

nicated), while data parallelism is eﬃcient when the

amount of computation per weight is high (because

the weight is the unit being communicated).

Another factor aﬀecting all of this is batch size.

We can make data parallelism arbitrarily eﬃcient if

we are willing to increase the batch size (because the

weight synchronization step is performed once per

batch). But very big batch sizes adversely aﬀect the

rate at which SGD converges as well as the quality of

the ﬁnal solution. So here I target batch sizes in the

hundreds or possibly thousands of examples.

3 Some observations

Modern convolutional neural nets consist of two types

of layers with rather diﬀerent properties:

• Convolutional layers cumulatively contain

about 90-95% of the computation, about 5%

of the parameters, and have large representa-

tions.

• Fully-connected layers contain about 5-10% of

the computation, about 95% of the parameters,

and have small representations.

Knowing this, it is natural to ask whether we should

parallelize these two in diﬀerent ways. In particular,

data parallelism appears attractive for convolutional

layers, while model parallelism appears attractive for

fully-connected layers.

This is precisely what I’m proposing. In the re-

mainder of this note I will explain the scheme in more

detail and also mention several nice properties.

arXiv:1404.5997v2 [cs.NE] 26 Apr 2014

下载后可阅读完整内容，剩余6页未读，立即下载

白绍伟

粉丝: 18
资源: 287

并行训练卷积神经网络的新方法

信息安全_数据安全_The One Weird Trick SecureROM Hates.pdf

GPRlearn:用于地雷识别的卷积神经网络

2014-2015-pokemon

Gygax:Gygax 2.0 版 12/02/2014-开源

deepgaze:用于人机交互的计算机视觉库。 它使用卷积神经网络，通过反投影进行皮肤检测，运动检测和跟踪，显着图来实现头部姿势和凝视方向估计

58同城源码java-javascript-understanding-the-weird-parts:TonyAlicea的JavaScri

看懂java源码-udemy-javascript-understanding-the-weird-parts:JavaScript的源代码：

Non-WEIRD-WEB:在非WEIRD国家中使用基于Web的研究的数据可视化

Udemy-JS-Weird-Parts

weird-text

最新资源

deepgaze:用于人机交互的计算机视觉库。它使用卷积神经网络，通过反投影进行皮肤检测，运动检测和跟踪，显着图来实现头部姿势和凝视方向估计