90 J. Schmidhuber / Neural Networks 61 (2015) 85–117
(2003). Section 5.15 is mostly about Deep Belief Networks (DBNs,
2006) and related stacks of Autoencoders (AEs, Section 5.7), both
pre-trained by UL to facilitate subsequent BP-based SL (compare
Sections 5.6.1, 5.10). Section 5.16 mentions the first SL-based
GPU-CNNs (2006), BP-trained MPCNNs (2007), and LSTM stacks
(2007). Sections 5.17–5.22 focus on official competitions with
secret test sets won by (mostly purely supervised) deep NNs
since 2009, in sequence recognition, image classification, image
segmentation, and object detection. Many RNN results depended
on LSTM (Section 5.13); many FNN results depended on GPU-based
FNN code developed since 2004 (Sections 5.16–5.19), in particular,
GPU-MPCNNs (Section 5.19). Section 5.24 mentions recent tricks
for improving DL in NNs, many of them closely related to earlier
tricks from the previous millennium (e.g., Sections 5.6.2, 5.6.3).
Section 5.25 discusses how artificial NNs can help to understand
biological NNs; Section 5.26 addresses the possibility of DL in NNs
with spiking neurons.
5.1. Early NNs since the 1940s (and the 1800s)
Early NN architectures (McCulloch & Pitts, 1943) did not learn.
The first ideas about UL were published a few years later (Hebb,
1949). The following decades brought simple NNs trained by
SL (e.g., Narendra & Thathatchar, 1974; Rosenblatt, 1958, 1962;
Widrow & Hoff, 1962) and UL (e.g., Grossberg, 1969; Kohonen,
1972; von der Malsburg, 1973; Willshaw & von der Malsburg,
1976), as well as closely related associative memories (e.g., Hop-
field, 1982; Palm, 1980).
In a sense NNs have been around even longer, since early su-
pervised NNs were essentially variants of linear regression meth-
ods going back at least to the early 1800s (e.g., Gauss, 1809, 1821;
Legendre, 1805); Gauss also refers to his work of 1795. Early NNs
had a maximal CAP depth of 1 (Section 3).
5.2. Around 1960: visual cortex provides inspiration for DL (Sections
5.4, 5.11)
Simple cells and complex cells were found in the cat’s visual
cortex (e.g., Hubel & Wiesel, 1962; Wiesel & Hubel, 1959). These
cells fire in response to certain properties of visual sensory inputs,
such as the orientation of edges. Complex cells exhibit more
spatial invariance than simple cells. This inspired later deep NN
architectures (Sections 5.4, 5.11) used in certain modern award-
winning Deep Learners (Sections 5.19–5.22).
5.3. 1965: deep networks based on the Group Method of Data
Handling
Networks trained by the Group Method of Data Handling
(GMDH) (Ivakhnenko, 1968, 1971; Ivakhnenko & Lapa, 1965;
Ivakhnenko, Lapa, & McDonough, 1967) were perhaps the first DL
systems of the Feedforward Multilayer Perceptron type, although
there was earlier work on NNs with a single hidden layer
(e.g., Joseph, 1961; Viglione, 1970). The units of GMDH nets
may have polynomial activation functions implementing Kol-
mogorov–Gabor polynomials (more general than other widely used
NN activation functions, Section 2). Given a training set, lay-
ers are incrementally grown and trained by regression analysis
(e.g., Gauss, 1809, 1821; Legendre, 1805) (Section 5.1), then pruned
with the help of a separate validation set (using today’s terminol-
ogy), where Decision Regularization is used to weed out superfluous
units (compare Section 5.6.3). The numbers of layers and units per
layer can be learned in problem-dependent fashion. To my knowl-
edge, this was the first example of open-ended, hierarchical rep-
resentation learning in NNs (Section 4.3). A paper of 1971 already
described a deep GMDH network with 8 layers (Ivakhnenko, 1971).
There have been numerous applications of GMDH-style nets, e.g.
Farlow (1984), Ikeda, Ochiai, and Sawaragi (1976), Ivakhnenko
(1995), Kondo (1998), Kondo and Ueno (2008), Kordík, Náplava,
Snorek, and Genyk-Berezovskyj (2003), Madala and Ivakhnenko
(1994) and Witczak, Korbicz, Mrugalski, and Patton (2006).
5.4. 1979: convolution + weight replication + subsampling (Neocog-
nitron)
Apart from deep GMDH networks (Section 5.3), the Neocogni-
tron (Fukushima, 1979, 1980, 2013a) was perhaps the first artificial
NN that deserved the attribute deep, and the first to incorporate
the neurophysiological insights of Section 5.2. It introduced con-
volutional NNs (today often called CNNs or convnets), where the
(typically rectangular) receptive field of a convolutional unit with
given weight vector (a filter) is shifted step by step across a 2-
dimensional array of input values, such as the pixels of an image
(usually there are several such filters). The resulting 2D array of
subsequent activation events of this unit can then provide inputs
to higher-level units, and so on. Due to massive weight replication
(Section 2), relatively few parameters (Section 4.4) may be neces-
sary to describe the behavior of such a convolutional layer.
Subsampling or downsampling layers consist of units whose
fixed-weight connections originate from physical neighbors in the
convolutional layers below. Subsampling units become active if at
least one of their inputs is active; their responses are insensitive to
certain small image shifts (compare Section 5.2).
The Neocognitron is very similar to the architecture of modern,
contest-winning, purely supervised, feedforward, gradient-based
Deep Learners with alternating convolutional and downsampling
layers (e.g., Sections 5.19–5.22). Fukushima, however, did not set
the weights by supervised backpropagation (Sections 5.5, 5.8), but
by local, WTA-based unsupervised learning rules (e.g., Fukushima,
2013b), or by pre-wiring. In that sense he did not care for the
DL problem (Section 5.9), although his architecture was compar-
atively deep indeed. For downsampling purposes he used Spatial
Averaging (Fukushima, 1980, 2011) instead of Max-Pooling (MP,
Section 5.11), currently a particularly convenient and popular WTA
mechanism. Today’s DL combinations of CNNs and MP and BP also
profit a lot from later work (e.g., Sections 5.8, 5.16, 5.19).
5.5. 1960–1981 and beyond: development of backpropagation (BP)
for NNs
The minimization of errors through gradient descent (Hadamard,
1908) in the parameter space of complex, nonlinear, differentiable
(Leibniz, 1684), multi-stage, NN-related systems has been dis-
cussed at least since the early 1960s (e.g., Amari, 1967; Bryson,
1961; Bryson & Denham, 1961; Bryson & Ho, 1969; Director &
Rohrer, 1969; Dreyfus, 1962; Kelley, 1960; Pontryagin, Boltyan-
skii, Gamrelidze, & Mishchenko, 1961; Wilkinson, 1965), initially
within the framework of Euler–Lagrange equations in the Calculus
of Variations (e.g., Euler, 1744).
Steepest descent in the weight space of such systems can be per-
formed (Bryson, 1961; Bryson & Ho, 1969; Kelley, 1960) by iter-
ating the chain rule (Leibniz, 1676; L’Hôpital, 1696) à la Dynamic
Programming (DP) (Bellman, 1957). A simplified derivation of this
backpropagation method uses the chain rule only (Dreyfus, 1962).
The systems of the 1960s were already efficient in the DP sense.
However, they backpropagated derivative information through
standard Jacobian matrix calculations from one ‘‘layer’’ to the pre-
vious one, without explicitly addressing either direct links across
several layers or potential additional efficiency gains due to net-
work sparsity (but perhaps such enhancements seemed obvious
to the authors). Given all the prior work on learning in multilayer
NN-like systems (see also Section 5.3 on deep nonlinear nets since