深度学习：神经网络概览

需积分: 50 32 浏览量更新于2024-07-20 1 收藏 840KB PDF 举报

"这篇文章是Jürgen Schmidhuber所著的一篇关于深度学习在神经网络中的综述，发表于Neural Networks期刊，详细介绍了深度学习的不同方面，包括监督学习、无监督学习、强化学习以及进化计算。" 深度学习在神经网络领域的应用近年来取得了显著成就，它主要依赖于深度人工神经网络，包括循环神经网络，这些网络在模式识别和机器学习竞赛中表现出色。这篇历史回顾性文章涵盖了上个世纪大量的相关工作，将浅层学习与深度学习进行了对比。两者的区别在于信用分配路径的深度，这是一系列可能可学习的、在行为与效果之间建立因果联系的链接。深度监督学习是深度学习的一个重要分支，其历史可以追溯到反向传播算法的发展。反向传播是一种有效的优化策略，用于调整神经网络权重以最小化损失函数，从而提高模型的预测准确性。在深度监督学习中，多个隐藏层允许网络捕获更复杂的特征表示，这在图像识别、语音识别和自然语言处理等领域发挥了关键作用。无监督学习是另一种重要的深度学习方法，它不依赖于预先标记的数据。通过自我组织和聚类，深度神经网络可以从原始数据中学习潜在的结构和模式。自编码器和生成对抗网络（GANs）是无监督学习中的典型例子，它们在数据降维、图像生成和推荐系统中表现出色。强化学习是深度学习与环境互动的领域，目标是通过不断尝试和错误来最大化长期奖励。深度Q学习和策略梯度方法使得神经网络能够学习复杂策略，应用于游戏AI、机器人控制和资源管理等领域。进化计算则借鉴生物进化原理，如遗传算法和粒子群优化，来搜索神经网络的参数空间，找到高性能的网络配置。这种方法对于解决大型、复杂的问题，尤其是当传统优化方法难以应用时，显得尤为有用。这篇综述提供了深度学习各个方面的全面概述，包括理论基础、关键算法和实际应用，对理解深度学习的核心概念和技术发展有着重要价值。深度学习的发展不仅推动了人工智能的进步，也正在改变我们理解和解决问题的方式，其影响力将持续扩展至各个科学和技术领域。

90 J. Schmidhuber / Neural Networks 61 (2015) 85–117

(2003). Section 5.15 is mostly about Deep Belief Networks (DBNs,

2006) and related stacks of Autoencoders (AEs, Section 5.7), both

pre-trained by UL to facilitate subsequent BP-based SL (compare

Sections 5.6.1, 5.10). Section 5.16 mentions the first SL-based

GPU-CNNs (2006), BP-trained MPCNNs (2007), and LSTM stacks

(2007). Sections 5.17–5.22 focus on official competitions with

secret test sets won by (mostly purely supervised) deep NNs

since 2009, in sequence recognition, image classification, image

segmentation, and object detection. Many RNN results depended

on LSTM (Section 5.13); many FNN results depended on GPU-based

FNN code developed since 2004 (Sections 5.16–5.19), in particular,

GPU-MPCNNs (Section 5.19). Section 5.24 mentions recent tricks

for improving DL in NNs, many of them closely related to earlier

tricks from the previous millennium (e.g., Sections 5.6.2, 5.6.3).

Section 5.25 discusses how artificial NNs can help to understand

biological NNs; Section 5.26 addresses the possibility of DL in NNs

with spiking neurons.

5.1. Early NNs since the 1940s (and the 1800s)

Early NN architectures (McCulloch & Pitts, 1943) did not learn.

The first ideas about UL were published a few years later (Hebb,

1949). The following decades brought simple NNs trained by

SL (e.g., Narendra & Thathatchar, 1974; Rosenblatt, 1958, 1962;

Widrow & Hoff, 1962) and UL (e.g., Grossberg, 1969; Kohonen,

1972; von der Malsburg, 1973; Willshaw & von der Malsburg,

1976), as well as closely related associative memories (e.g., Hop-

field, 1982; Palm, 1980).

In a sense NNs have been around even longer, since early su-

pervised NNs were essentially variants of linear regression meth-

ods going back at least to the early 1800s (e.g., Gauss, 1809, 1821;

Legendre, 1805); Gauss also refers to his work of 1795. Early NNs

had a maximal CAP depth of 1 (Section 3).

5.2. Around 1960: visual cortex provides inspiration for DL (Sections

5.4, 5.11)

Simple cells and complex cells were found in the cat’s visual

cortex (e.g., Hubel & Wiesel, 1962; Wiesel & Hubel, 1959). These

cells fire in response to certain properties of visual sensory inputs,

such as the orientation of edges. Complex cells exhibit more

spatial invariance than simple cells. This inspired later deep NN

architectures (Sections 5.4, 5.11) used in certain modern award-

winning Deep Learners (Sections 5.19–5.22).

5.3. 1965: deep networks based on the Group Method of Data

Handling

Networks trained by the Group Method of Data Handling

(GMDH) (Ivakhnenko, 1968, 1971; Ivakhnenko & Lapa, 1965;

Ivakhnenko, Lapa, & McDonough, 1967) were perhaps the first DL

systems of the Feedforward Multilayer Perceptron type, although

there was earlier work on NNs with a single hidden layer

(e.g., Joseph, 1961; Viglione, 1970). The units of GMDH nets

may have polynomial activation functions implementing Kol-

mogorov–Gabor polynomials (more general than other widely used

NN activation functions, Section 2). Given a training set, lay-

ers are incrementally grown and trained by regression analysis

(e.g., Gauss, 1809, 1821; Legendre, 1805) (Section 5.1), then pruned

with the help of a separate validation set (using today’s terminol-

ogy), where Decision Regularization is used to weed out superfluous

units (compare Section 5.6.3). The numbers of layers and units per

layer can be learned in problem-dependent fashion. To my knowl-

edge, this was the first example of open-ended, hierarchical rep-

resentation learning in NNs (Section 4.3). A paper of 1971 already

described a deep GMDH network with 8 layers (Ivakhnenko, 1971).

There have been numerous applications of GMDH-style nets, e.g.

Farlow (1984), Ikeda, Ochiai, and Sawaragi (1976), Ivakhnenko

(1995), Kondo (1998), Kondo and Ueno (2008), Kordík, Náplava,

Snorek, and Genyk-Berezovskyj (2003), Madala and Ivakhnenko

(1994) and Witczak, Korbicz, Mrugalski, and Patton (2006).

5.4. 1979: convolution + weight replication + subsampling (Neocog-

nitron)

Apart from deep GMDH networks (Section 5.3), the Neocogni-

tron (Fukushima, 1979, 1980, 2013a) was perhaps the first artificial

NN that deserved the attribute deep, and the first to incorporate

the neurophysiological insights of Section 5.2. It introduced con-

volutional NNs (today often called CNNs or convnets), where the

(typically rectangular) receptive field of a convolutional unit with

given weight vector (a filter) is shifted step by step across a 2-

dimensional array of input values, such as the pixels of an image

(usually there are several such filters). The resulting 2D array of

subsequent activation events of this unit can then provide inputs

to higher-level units, and so on. Due to massive weight replication

(Section 2), relatively few parameters (Section 4.4) may be neces-

sary to describe the behavior of such a convolutional layer.

Subsampling or downsampling layers consist of units whose

fixed-weight connections originate from physical neighbors in the

convolutional layers below. Subsampling units become active if at

least one of their inputs is active; their responses are insensitive to

certain small image shifts (compare Section 5.2).

The Neocognitron is very similar to the architecture of modern,

contest-winning, purely supervised, feedforward, gradient-based

Deep Learners with alternating convolutional and downsampling

layers (e.g., Sections 5.19–5.22). Fukushima, however, did not set

the weights by supervised backpropagation (Sections 5.5, 5.8), but

by local, WTA-based unsupervised learning rules (e.g., Fukushima,

2013b), or by pre-wiring. In that sense he did not care for the

DL problem (Section 5.9), although his architecture was compar-

atively deep indeed. For downsampling purposes he used Spatial

Averaging (Fukushima, 1980, 2011) instead of Max-Pooling (MP,

Section 5.11), currently a particularly convenient and popular WTA

mechanism. Today’s DL combinations of CNNs and MP and BP also

profit a lot from later work (e.g., Sections 5.8, 5.16, 5.19).

5.5. 1960–1981 and beyond: development of backpropagation (BP)

for NNs

The minimization of errors through gradient descent (Hadamard,

1908) in the parameter space of complex, nonlinear, differentiable

(Leibniz, 1684), multi-stage, NN-related systems has been dis-

cussed at least since the early 1960s (e.g., Amari, 1967; Bryson,

1961; Bryson & Denham, 1961; Bryson & Ho, 1969; Director &

Rohrer, 1969; Dreyfus, 1962; Kelley, 1960; Pontryagin, Boltyan-

skii, Gamrelidze, & Mishchenko, 1961; Wilkinson, 1965), initially

within the framework of Euler–Lagrange equations in the Calculus

of Variations (e.g., Euler, 1744).

Steepest descent in the weight space of such systems can be per-

formed (Bryson, 1961; Bryson & Ho, 1969; Kelley, 1960) by iter-

ating the chain rule (Leibniz, 1676; L’Hôpital, 1696) à la Dynamic

Programming (DP) (Bellman, 1957). A simplified derivation of this

backpropagation method uses the chain rule only (Dreyfus, 1962).

The systems of the 1960s were already efficient in the DP sense.

However, they backpropagated derivative information through

standard Jacobian matrix calculations from one ‘‘layer’’ to the pre-

vious one, without explicitly addressing either direct links across

several layers or potential additional efficiency gains due to net-

work sparsity (but perhaps such enhancements seemed obvious

to the authors). Given all the prior work on learning in multilayer

NN-like systems (see also Section 5.3 on deep nonlinear nets since

J. Schmidhuber / Neural Networks 61 (2015) 85–117 91

1965), it seems surprising in hindsight that a book (Minsky & Pa-

pert, 1969) on the limitations of simple linear perceptrons with a

single layer (Section 5.1) discouraged some researchers from fur-

ther studying NNs.

Explicit, efficient error backpropagation (BP) in arbitrary, dis-

crete, possibly sparsely connected, NN-like networks apparently

was first described in a 1970 master’s thesis (Linnainmaa, 1970,

1976), albeit without reference to NNs. BP is also known as the re-

verse mode of automatic differentiation (Griewank, 2012), where

the costs of forward activation spreading essentially equal the

costs of backward derivative calculation. See early FORTRAN code

(Linnainmaa, 1970) and closely related work (Ostrovskii, Volin, &

Borisov, 1971).

Efficient BP was soon explicitly used to minimize cost functions

by adapting control parameters (weights) (Dreyfus, 1973). Com-

pare some preliminary, NN-specific discussion (Werbos, 1974, Sec-

tion 5.5.1), a method for multilayer threshold NNs (Bobrowski,

1978), and a computer program for automatically deriving and

implementing BP for given differentiable systems (Speelpenning,

1980).

To my knowledge, the first NN-specific application of efficient

BP as above was described in 1981 (Werbos, 1981, 2006). Re-

lated work was published several years later (LeCun, 1985, 1988;

Parker, 1985). A paper of 1986 significantly contributed to the pop-

ularization of BP for NNs (Rumelhart, Hinton, & Williams, 1986),

experimentally demonstrating the emergence of useful internal

representations in hidden layers. See generalizations for sequence-

processing recurrent NNs (e.g., Atiya & Parlos, 2000; Baldi, 1995;

Gherrity, 1989; Kremer & Kolen, 2001; Pearlmutter, 1989, 1995;

Robinson & Fallside, 1987; Rohwer, 1989; Schmidhuber, 1992a;

Werbos, 1988; Williams, 1989; Williams & Peng, 1990; Williams &

Zipser, 1988, 1989a, 1989b), also for equilibrium RNNs (Almeida,

1987; Pineda, 1987) with stationary inputs.

5.5.1. BP for weight-sharing feedforward NNs (FNNs) and recurrent

NNs (RNNs)

Using the notation of Section 2 for weight-sharing FNNs or

RNNs, after an episode of activation spreading through differen-

tiable f

, a single iteration of gradient descent through BP computes

changes of all w

in proportion to

∂E

∂w



∂E

∂net

∂w

as in Algo-

rithm 5.5.1 (for the additive case), where each weight w

is associ-

ated with a real-valued variable ∆

initialized by 0.

Algorithm 5.5.1: One iteration of BP for weight-sharing FNNs or

RNNs

for t = T , . . . , 1 do

to compute

∂E

∂net

, initialize real-valued error signal variable δ

by 0;

if x

is an input event then continue with next iteration;

if there is an error e

then δ

:= x

− d

;

add to δ

the value



k∈out

v(t,k)

; (this is the elegant and

efficient recursive chain rule application collecting impacts of net

on future events)

multiply δ

by f

′

(net

);

for all k ∈ in

add to △

v(k,t)

the value x

end for

change each w

in proportion to △

and a small real-valued

learning rate

The computational costs of the backward (BP) pass are

essentially those of the forward pass (Section 2). Forward and

backward passes are re-iterated until sufficient performance is

reached.

As of 2014, this simple BP method is still the central learning

algorithm for FNNs and RNNs. Notably, most contest-winning NNs

up to 2014 (Sections 5.12, 5.14, 5.17, 5.19, 5.21, 5.22) did not

augment supervised BP by some sort of unsupervised learning as

discussed in Sections 5.7, 5.10, 5.15.

5.6. Late 1980s–2000 and beyond: numerous improvements of NNs

By the late 1980s it seemed clear that BP by itself (Section 5.5)

was no panacea. Most FNN applications focused on FNNs with

few hidden layers. Additional hidden layers often did not seem

to offer empirical benefits. Many practitioners found solace in

a theorem (Hecht-Nielsen, 1989; Hornik, Stinchcombe, & White,

1989; Kolmogorov, 1965a) stating that an NN with a single layer of

enough hidden units can approximate any multivariate continuous

function with arbitrary accuracy.

Likewise, most RNN applications did not require backpropagat-

ing errors far. Many researchers helped their RNNs by first train-

ing them on shallow problems (Section 3) whose solutions then

generalized to deeper problems. In fact, some popular RNN algo-

rithms restricted credit assignment to a single step backwards (El-

man, 1990; Jordan, 1986, 1997), also in more recent studies (Jaeger,

2001, 2004; Maass et al., 2002).

Generally speaking, although BP allows for deep problems in

principle, it seemed to work only for shallow problems. The late

1980s and early 1990s saw a few ideas with a potential to over-

come this problem, which was fully understood only in 1991 (Sec-

tion 5.9).

5.6.1. Ideas for dealing with long time lags and deep CAPs

To deal with long time lags between relevant events, sev-

eral sequence processing methods were proposed, including Fo-

cused BP based on decay factors for activations of units in RNNs

(Mozer, 1989, 1992), Time-Delay Neural Networks (TDNNs) (Lang,

Waibel, & Hinton, 1990) and their adaptive extension (Boden-

hausen & Waibel, 1991), Nonlinear AutoRegressive with eXogenous

inputs (NARX) RNNs (Lin, Horne, Tino, & Giles, 1996), certain hier-

archical RNNs (Hihi & Bengio, 1996) (compare Section 5.10, 1991),

RL economies in RNNs with WTA units and local learning rules

(Schmidhuber, 1989b), and other methods (e.g., Bengio, Simard, &

Frasconi, 1994; de Vries & Principe, 1991; Plate, 1993; Ring, 1993,

1994; Sun, Chen, & Lee, 1993). However, these algorithms either

worked for shallow CAPs only, could not generalize to unseen CAP

depths, had problems with greatly varying time lags between rele-

vant events, needed external fine tuning of delay constants, or suf-

fered from other problems. In fact, it turned out that certain simple

but deep benchmark problems used to evaluate such methods are

more quickly solved by randomly guessing RNN weights until a so-

lution is found (Hochreiter & Schmidhuber, 1996).

While the RNN methods above were designed for DL of tem-

poral sequences, the Neural Heat Exchanger (Schmidhuber, 1990c)

consists of two parallel deep FNNs with opposite flow directions.

Input patterns enter the first FNN and are propagated ‘‘up’’. De-

sired outputs (targets) enter the ‘‘opposite’’ FNN and are propa-

gated ‘‘down’’. Using a local learning rule, each layer in each net

tries to be similar (in information content) to the preceding layer

and to the adjacent layer of the other net. The input entering the

first net slowly ‘‘heats up’’ to become the target. The target enter-

ing the opposite net slowly ‘‘cools down’’ to become the input. The

Helmholtz Machine (Dayan & Hinton, 1996; Dayan, Hinton, Neal,

& Zemel, 1995) may be viewed as an unsupervised (Section 5.6.4)

variant thereof (Peter Dayan, personal communication, 1994).

A hybrid approach (Shavlik & Towell, 1989; Towell & Shavlik,

1994) initializes a potentially deep FNN through a domain theory in

propositional logic, which may be acquired through explanation-

based learning (DeJong & Mooney, 1986; Minton et al., 1989;

Mitchell, Keller, & Kedar-Cabelli, 1986). The NN is then fine-tuned

through BP (Section 5.5). The NN’s depth reflects the longest chain

剩余32页未读，继续阅读

老穷挫风致心

粉丝: 1
资源: 3

深度学习：神经网络概览

高被引的深度学习综述的全文翻译+原文

深度学习-英文版

介绍深度学习(Deep Learning)的3篇经典英文综述和2篇中文综述

深度学习超分综述

深度学习研究综述

Day6的两篇英文文献： （外 Q1 2022）基于深度学习的文本分类：综述 （外 Q1 2022）基于深度学习的行为识别概述

深度学习技术在预测维修中的应用综述（英文）.pdf

指向深度学习的小学英语阅读教学实践.pdf

在小学英语阅读课中巧用多元评价促深度学习.pdf

斑马线语义分割集成平台：深度学习模型综述

最新资源

Day6的两篇英文文献：（外 Q1 2022）基于深度学习的文本分类：综述（外 Q1 2022）基于深度学习的行为识别概述