One weird trick for parallelizing convolutional neural networks
Alex Krizhevsky
Google Inc.
akrizhevsky@google.com
April 29, 2014
Abstract
I present a new way to parallelize the training of convolutional neural networks across multiple GPUs.
The method scales significantly better than all alternatives when applied to modern convolutional neural
networks.
1 Introduction
This is meant to be a short note introducing a new
way to parallelize the training of convolutional neural
networks with stochastic gradient descent (SGD). I
present two variants of the algorithm. The first vari-
ant perfectly simulates the synchronous execution of
SGD on one core, while the second introduces an ap-
proximation such that it no longer perfectly simulates
SGD, but nonetheless works better in practice.
2 Existing approaches
Convolutional neural networks are big models trained
on big datasets. So there are two obvious ways to par-
allelize their training:
• across the model dimension, where different
workers train different parts of the model, and
• across the data dimension, where different work-
ers train on different data examples.
These are called model parallelism and data paral-
lelism, respectively.
In model parallelism, whenever the model part
(subset of neuron activities) trained by one worker
requires output from a model part trained by another
worker, the two workers must synchronize. In con-
trast, in data parallelism the workers must synchro-
nize model parameters (or parameter gradients) to
ensure that they are training a consistent model.
In general, we should exploit all dimensions of par-
allelism. Neither scheme is better than the other a
priori. But the relative degrees to which we exploit
each scheme should be informed by model architec-
ture. In particular, model parallelism is efficient when
the amount of computation per neuron activity is high
(because the neuron activity is the unit being commu-
nicated), while data parallelism is efficient when the
amount of computation per weight is high (because
the weight is the unit being communicated).
Another factor affecting all of this is batch size.
We can make data parallelism arbitrarily efficient if
we are willing to increase the batch size (because the
weight synchronization step is performed once per
batch). But very big batch sizes adversely affect the
rate at which SGD converges as well as the quality of
the final solution. So here I target batch sizes in the
hundreds or possibly thousands of examples.
3 Some observations
Modern convolutional neural nets consist of two types
of layers with rather different properties:
• Convolutional layers cumulatively contain
about 90-95% of the computation, about 5%
of the parameters, and have large representa-
tions.
• Fully-connected layers contain about 5-10% of
the computation, about 95% of the parameters,
and have small representations.
Knowing this, it is natural to ask whether we should
parallelize these two in different ways. In particular,
data parallelism appears attractive for convolutional
layers, while model parallelism appears attractive for
fully-connected layers.
This is precisely what I’m proposing. In the re-
mainder of this note I will explain the scheme in more
detail and also mention several nice properties.
1
arXiv:1404.5997v2 [cs.NE] 26 Apr 2014