Multi-column Deep Neural Networks for Image Classification
Dan Cires¸an, Ueli Meier and J
¨
urgen Schmidhuber
IDSIA-USI-SUPSI
Galleria 2, 6928 Manno-Lugano, Switzerland
{dan,ueli,juergen}@idsia.ch
Abstract
Traditional methods of computer vision and machine
learning cannot match human performance on tasks such
as the recognition of handwritten digits or traffic signs. Our
biologically plausible, wide and deep artificial neural net-
work architectures can. Small (often minimal) receptive
fields of convolutional winner-take-all neurons yield large
network depth, resulting in roughly as many sparsely con-
nected neural layers as found in mammals between retina
and visual cortex. Only winner neurons are trained. Sev-
eral deep neural columns become experts on inputs pre-
processed in different ways; their predictions are averaged.
Graphics cards allow for fast training. On the very com-
petitive MNIST handwriting benchmark, our method is the
first to achieve near-human performance. On a traffic sign
recognition benchmark it outperforms humans by a factor
of two. We also improve the state-of-the-art on a plethora
of common image classification benchmarks.
1. Introduction
Recent publications suggest that unsupervised pre-
training of deep, hierarchical neural networks improves su-
pervised pattern classification [2, 10]. Here we train such
nets by simple online back-propagation, setting new, greatly
improved records on MNIST [19], Latin letters [13], Chi-
nese characters [22], traffic signs [33], NORB (jittered, clut-
tered) [20] and CIFAR10 [17] benchmarks.
We focus on deep convolutional neural networks (DNN),
introduced by [11], improved by [19], refined and simpli-
fied by [1, 32, 7]. Lately, DNN proved their mettle on data
sets ranging from handwritten digits (MNIST) [5, 7], hand-
written characters [6] to 3D toys (NORB) and faces [34].
DNNs fully unfold their potential when they are wide (many
maps per layer) and deep (many layers) [7]. But training
them requires weeks, months, even years on CPUs. High
data transfer latency prevents multi-threading and multi-
CPU code from saving the situation. In recent years, how-
ever, fast parallel neural net code for graphics cards (GPUs)
has overcome this problem. Carefully designed GPU code
for image classification can be up to two orders of magni-
tude faster than its CPU counterpart [35, 34]. Hence, to train
huge DNN in hours or days, we implement them on GPU,
building upon the work of [5, 7]. The training algorithm
is fully online, i.e. weight updates occur after each error
back-propagation step. We will show that properly trained
wide and deep DNNs can outperform all previous methods,
and demonstrate that unsupervised initialization/pretraining
is not necessary (although we don’t deny that it might help
sometimes, especially for datasets with few samples per
class). We also show how combining several DNN columns
into a Multi-column DNN (MCDNN) further decreases the
error rate by 30-40%.
2. Architecture
The initially random weights of the DNN are iteratively
trained to minimize the classification error on a set of la-
beled training images; generalization performance is then
tested on a separate set of test images. Our architecture does
this by combining several techniques in a novel way:
(1) Unlike the small NN used in many applications,
which were either shallow [32] or had few maps per layer
(LeNet7, [20]), ours are deep and have hundreds of maps
per layer, inspired by the Neocognitron [11], with many
(6-10) layers of non-linear neurons stacked on top of each
other, comparable to the number of layers found between
retina and visual cortex of macaque monkeys [3].
(2) It was shown [14] that such multi-layered DNN are
hard to train by standard gradient descent [36, 18, 28], the
method of choice from a mathematical/algorithmic point
of view. Today’s computers, however, are fast enough for
this, more than 60000 times faster than those of the early
90s
1
. Carefully designed code for massively parallel graph-
ics processing units (GPUs normally used for video games)
allows for gaining an additional speedup factor of 50-100
over serial code for standard computers. Given enough la-
beled data, our networks do not need additional heuristics
1
1991 486DX-33 MHz, 2011 i7-990X 3.46 GHz
1