深度学习与神经网络：历史与进展

需积分: 11 174 浏览量更新于2024-07-17 收藏 414KB PDF 举报

"这篇文章是瑞士Jürgen Schmidhuber教授在2014年编写的深度学习综述报告，全面概述了深度学习和神经网络的发展历程与最新进展。报告详细介绍了深度学习的不同方面，包括监督学习、无监督学习、强化学习、进化计算以及寻找编码深度大网络的短程序的间接搜索方法。" 深度学习是现代人工智能领域的一个关键组成部分，尤其在图像识别、语音识别、自然语言处理和推荐系统等任务中表现出色。Jürgen Schmidhuber的报告深入探讨了深度学习的历史，从早期的神经网络理论开始，这些理论奠定了深度学习的基础。深度学习的核心在于其深层结构，即包含多个隐藏层的神经网络，这使得模型能够学习复杂的特征表示。报告首先回顾了深度监督学习，这是深度学习的基石之一。深度监督学习通常涉及反向传播算法，这是一种用于训练多层感知器的有效方法，通过反向传播误差来更新权重，以最小化损失函数。这种方法使得网络能够逐层学习高级抽象特征。无监督学习在深度学习中也占有重要地位，特别是自动编码器和玻尔兹曼机等技术。这些模型在没有标签数据的情况下学习数据的内在结构和表示，对于数据降维、特征提取和生成模型等方面有着广泛的应用。报告进一步讨论了强化学习，这是一种通过与环境互动来优化决策过程的学习方法。深度强化学习，如深度Q网络（DQN），已经成功应用于游戏控制和机器人等领域，实现了超越人类的表现。此外，进化计算在深度学习中也有着重要角色，特别是遗传算法和进化策略，它们可以用于网络结构的优化和参数搜索。这些方法提供了一种探索复杂解决方案空间的手段，特别是在处理具有大量自由度的深度网络时。最后，报告提到了间接搜索方法，这是寻找高效编码深度大网络的程序的一种策略。这种方法试图找到简洁的表示，使网络的训练和部署更加高效。 Jürgen Schmidhuber的深度学习综述涵盖了该领域的多个关键方面，提供了对深度学习历史和现状的全面理解，对于研究者和从业者来说是一份宝贵的参考资料。

1990). An MDL-based, Bayesian argument suggests that ﬂat minima correspond to “simple” NNs and low

expected overﬁtting. Compare Sec. 5.6.4 and more recent developments mentioned in Sec. 5.24.

5.6.4 Potential Beneﬁts of UL for SL

The notation of Sec. 2 introduced teacher-given labels d

. Many papers of the previous millennium, how-

ever, were about unsupervised learning (UL) without a teacher (e.g., Hebb, 1949; von der Malsburg, 1973;

Kohonen, 1972, 1982, 1988; Willshaw and von der Malsburg, 1976; Grossberg, 1976a,b; Watanabe, 1985;

Pearlmutter and Hinton, 1986; Barrow, 1987; Field, 1987; Oja, 1989; Barlow et al., 1989; Baldi and Hornik,

1989; Rubner and Tavan, 1989; Sanger, 1989; Ritter and Kohonen, 1989; Rubner and Schulten, 1990;

oldi

ak, 1990; Martinetz et al., 1990; Kosko, 1990; Mozer, 1991; Palm, 1992; Atick et al., 1992; Miller,

1994; Saund, 1994; F

oldi

ak and Young, 1995; Deco and Parra, 1997); see also post-2000 work (e.g.,

Carreira-Perpinan, 2001; Wiskott and Sejnowski, 2002; Franzius et al., 2007; Waydo and Koch, 2008).

Many UL methods are designed to maximize entropy-related, information-theoretic (Boltzmann, 1909;

Shannon, 1948; Kullback and Leibler, 1951) objectives (e.g., Linsker, 1988; Barlow et al., 1989; MacKay

and Miller, 1990; Plumbley, 1991; Schmidhuber, 1992b,c; Schraudolph and Sejnowski, 1993; Redlich,

1993; Zemel, 1993; Zemel and Hinton, 1994; Field, 1994; Hinton et al., 1995; Dayan and Zemel, 1995;

Amari et al., 1996; Deco and Parra, 1997). Many do this to uncover and disentangle hidden underlying

sources of signals (e.g., Jutten and Herault, 1991; Schuster, 1992; Andrade et al., 1993; Molgedey and

Schuster, 1994; Comon, 1994; Cardoso, 1994; Bell and Sejnowski, 1995; Karhunen and Joutsensalo, 1995;

Belouchrani et al., 1997; Hyv

arinen et al., 2001; Szab

o et al., 2006). UL can also serve to extract invariant

features from different data items (e.g., Becker, 1991; Schmidhuber and Prelinger, 1993; Taylor et al.,

2011) through coupled NNs (also called Siamese NNs, e.g., Bromley et al., 1993; Hadsell et al., 2006).

Many UL methods automatically and robustly generate distributed, sparse representations of input pat-

terns (F

oldi

ak, 1990; Hinton and Ghahramani, 1997; Lewicki and Olshausen, 1998; Hyv

arinen et al., 1999;

Hochreiter and Schmidhuber, 1999; Falconbridge et al., 2006) through well-known feature detectors (e.g.,

Olshausen and Field, 1996; Schmidhuber et al., 1996), such as off-center-on-surround-like structures, as

well as orientation sensitive edge detectors and Gabor ﬁlters (Gabor, 1946). They extract simple features

related to those observed in early visual pre-processing stages of biological systems (e.g., De Valois et al.,

1982; Jones and Palmer, 1987).

UL can help to encode input data in a form advantageous for further processing. In the context of

DL, one important goal of UL is redundancy reduction. Ideally, given an ensemble of input patterns,

redundancy reduction through a deep NN will create a factorial code (a code with statistically independent

components) of the ensemble (Barlow et al., 1989; Barlow, 1989), to disentangle the unknown factors of

variation (compare Bengio et al., 2013). Such codes may be sparse and can be advantageous for (1) data

compression, (2) speeding up subsequent BP (Becker, 1991), (3) trivialising the task of subsequent naive

yet optimal Bayes classiﬁers (Schmidhuber et al., 1996).

Most early UL FNNs had a single layer. Methods for deeper UL FNNs include hierarchical (Sec. 4.3)

self-organizing Kohonen maps (e.g., Koikkalainen and Oja, 1990; Lampinen and Oja, 1992; Versino and

Gambardella, 1996; Dittenbach et al., 2000; Rauber et al., 2002), hierarchical Gaussian potential function

networks (Lee and Kil, 1991), the Self-Organising Tree Algorithm (SOTA) (Herrero et al., 2001), and

nonlinear Autoencoders (AEs) with more than 3 (e.g., 5) layers (Kramer, 1991; Oja, 1991; DeMers and

Cottrell, 1993). Such AE NNs (Rumelhart et al., 1986) can be trained to map input patterns to themselves,

for example, by compactly encoding them through activations of units of a narrow bottleneck hidden layer.

See (Baldi, 2012) for limitations of certain nonlinear AEs.

Other nonlinear UL methods include Predictability Minimization (PM) (Schmidhuber, 1992c), where

nonlinear feature detectors ﬁght nonlinear predictors, trying to become both informative and as unpre-

dictable as possible, and LOCOCODE (Hochreiter and Schmidhuber, 1999), where FMS (Sec. 5.6.3) ﬁnds

low-complexity AEs with low-precision weights describable by few bits of information, often yielding

sparse or factorial codes. PM-based UL was applied not only to FNNs but also to RNNs (e.g., Schmidhu-

ber, 1993b; Lindst

adt, 1993a,b). Compare Sec. 5.10 on UL-based RNN stacks (1991), as well as later UL

RNNs (e.g., Klapper-Rybicka et al., 2001; Steil, 2007).

5.7 1987: UL Through Autoencoder (AE) Hierarchies

Perhaps the ﬁrst work to study potential beneﬁts of UL-based pre-training was published in 1987. It

proposed unsupervised AE hierarchies (Ballard, 1987), closely related to certain post-2000 feedforward

Deep Learners based on UL (Sec. 5.15). The lowest-level AE NN with a single hidden layer is trained to

map input patterns to themselves. Its hidden layer codes are then fed into a higher-level AE of the same

type, and so on. The hope is that the codes in the hidden AE layers have properties that facilitate subsequent

learning. In one experiment, a particular AE-speciﬁc learning algorithm (different from traditional BP of

Sec. 5.5.1) was used to learn a mapping in an AE stack pre-trained by this type of UL (Ballard, 1987). This

was faster than learning an equivalent mapping by BP through a single deeper AE without pre-training.

On the other hand, the task did not really require a deep AE, that is, the beneﬁts of UL were not that

obvious from this experiment. Compare an early survey (Hinton, 1989) and the somewhat related Recursive

Auto-Associative Memory (RAAM) (Pollack, 1988, 1990; Melnik et al., 2000), originally used to encode

linguistic structures, but later also as an unsupervised pre-processor to facilitate deep credit assignment for

RL (Gisslen et al., 2011) (Sec. 6.4).

In principle, many UL methods (Sec. 5.6.4) could be stacked like the AEs above, the history-

compressing RNNs of Sec. 5.10, the Restricted Boltzmann Machines (RBMs) of Sec. 5.15, or hierarchical

Kohonen nets (Sec. 5.6.4), to facilitate subsequent SL. Compare Stacked Generalization (Wolpert, 1992;

Ting and Witten, 1997), and FNNs that proﬁt from pre-training by competitive UL (e.g., Rumelhart and

Zipser, 1986) prior to BP-based ﬁne-tuning (Maclin and Shavlik, 1995). See also more recent methods

using UL to improve SL (e.g., Escalante-B. and Wiskott, 2013).

5.8 1989: BP for Convolutional NNs (CNNs)

In 1989, backpropagation (Sec. 5.5) was applied (LeCun et al., 1989, 1990a, 1998) to weight-sharing

convolutional neural layers (Sec. 5.4) with adaptive connections. This combination, augmented by max-

pooling (Sec. 5.11, 5.16), and sped up on graphics cards (Sec. 5.19), has become an essential ingredient of

many modern, competition-winning, feedforward, visual Deep Learners (Sec. 5.19-5.21). This work also

introduced the MNIST data set of handwritten digits (LeCun et al., 1989), which over time has become

perhaps the most famous benchmark of Machine Learning. CNNs helped to achieve good performance on

MNIST (LeCun et al., 1990a) (CAP depth 5) and on ﬁngerprint recognition (Baldi and Chauvin, 1993);

similar CNNs were used commercially in the 1990s.

5.9 1991: Fundamental Deep Learning Problem of Gradient Descent

A diploma thesis (Hochreiter, 1991) represented a milestone of explicit DL research. As mentioned in

Sec. 5.6, by the late 1980s, experiments had indicated that traditional deep feedforward or recurrent net-

works are hard to train by backpropagation (BP) (Sec. 5.5). Hochreiter’s work formally identiﬁed a major

reason: Typical deep NNs suffer from the now famous problem of vanishing or exploding gradients. With

standard activation functions (Sec. 1), cumulative backpropagated error signals (Sec. 5.5.1) either shrink

rapidly, or grow out of bounds. In fact, they decay exponentially in the number of layers or CAP depth

(Sec. 3), or they explode. This is also known as the long time lag problem. Much subsequent DL re-

search of the 1990s and 2000s was motivated by this insight. Compare (Bengio et al., 1994), who also

studied basins of attraction and their stability under noise from a dynamical systems point of view: either

the dynamics are not robust to noise, or the gradients vanish. See also (Hochreiter et al., 2001a; Ti

no and

Hammer, 2004). Over the years, several ways of partially overcoming the Fundamental Deep Learning

Problem were explored:

I A Very Deep Learner of 1991 (the History Compressor, Sec. 5.10) alleviates the problem through

unsupervised pre-training for a hierarchy of RNNs. This greatly facilitates subsequent supervised

credit assignment through BP (Sec. 5.5). Compare conceptually related AE stacks (Sec. 5.7) and

Deep Belief Networks (DBNs) (Sec. 5.15) for the FNN case.

II LSTM-like networks (Sec. 5.13, 5.17, 5.22) alleviate the problem through a special architecture

unaffected by it.

III Today’s GPU-based computers have a million times the computational power of desktop machines of

the early 1990s. This allows for propagating errors a few layers further down within reasonable time,

even in traditional NNs (Sec. 5.18). That is basically what is winning many of the image recognition

competitions now (Sec. 5.19, 5.21, 5.22). (Although this does not really overcome the problem in a

fundamental way.)

IV Hessian-free optimization (Sec. 5.6.2) can alleviate the problem for FNNs (Møller, 1993; Pearlmut-

ter, 1994; Schraudolph, 2002; Martens, 2010) (Sec. 5.6.2) and RNNs (Martens and Sutskever, 2011)

(Sec. 5.20).

V The space of NN weight matrices can also be searched without relying on error gradients, thus

avoiding the Fundamental Deep Learning Problem altogether. Random weight guessing sometimes

works better than more sophisticated methods (Hochreiter and Schmidhuber, 1996). Certain more

complex problems are better solved by using Universal Search (Levin, 1973b) for weight matrix-

computing programs written in a universal programming language (Schmidhuber, 1997). Some are

better solved by using linear methods to obtain optimal weights for connections to output events, and

evolving weights of connections to other events—this is called Evolino (Schmidhuber et al., 2007).

Compare related RNNs pre-trained by certain UL rules (Steil, 2007), also for the case of spiking

neurons (Klampﬂ and Maass, 2013) (Sec. 5.26). Direct search methods are relevant not only for SL

but also for more general RL, and are discussed in more detail in Sec. 6.6.

5.10 1991: UL-Based History Compression Through a Deep Hierarchy of RNNs

A working Very Deep Learner (Sec. 3) of 1991 (Schmidhuber, 1992b, 2013a) could perform credit as-

signment across hundreds of nonlinear operators or neural layers, by using unsupervised pre-training for a

stack of RNNs.

The basic idea is still relevant today. Each RNN is trained for a while in unsupervised fashion to predict

its next input (e.g., Connor et al., 1994; Dorffner, 1996). From then on, only unexpected inputs (errors)

convey new information and get fed to the next higher RNN which thus ticks on a slower, self-organising

time scale. It can easily be shown that no information gets lost. It just gets compressed (much of machine

learning is essentially about compression, e.g., Sec. 4.4, 5.6.3, 6.7). For each individual input sequence, we

get a series of less and less redundant encodings in deeper and deeper levels of this History Compressor,

which can compress data in both space (like feedforward NNs) and time. This is another good example

of hierarchical representation learning (Sec. 4.3). There also is a continuous variant (Schmidhuber et al.,

1993).

The RNN stack is essentially a deep generative model of the data, which can be reconstructed from its

compressed form. Adding another RNN to the stack improves a bound on the data’s description length—

equivalent to the negative logarithm of its probability (Huffman, 1952; Shannon, 1948)—as long as there

is remaining local learnable predictability in the data representation on the corresponding level of the

hierarchy.

The system was able to learn many previously unlearnable DL tasks. One ancient illustrative DL

experiment (Schmidhuber, 1993b) required CAPs (Sec. 3) of depth 1200. The top level code of the ini-

tially unsupervised RNN stack, however, got so compact that (previously infeasible) sequence classiﬁcation

through additional BP-based SL became possible. Essentially the system used UL to greatly reduce prob-

lem depth. Compare earlier BP-based ﬁne-tuning of NNs initialized by rules of propositional logic (Shavlik

and Towell, 1989) (Sec. 5.6.1).

There is a way of compressing higher levels down into lower levels, thus fully or partially collapsing the

RNN stack. The trick is to retrain a lower-level RNN to continually imitate (predict) the hidden units of an

already trained, slower, higher-level RNN (the “conscious” chunker), through additional predictive output

neurons (Schmidhuber, 1992b). This helps the lower RNN (the “automatizer”) to develop appropriate,

rarely changing memories that may bridge very long time lags. Again, this procedure can greatly reduce

the required depth of the BP process.

The 1991 system was a working Deep Learner in the modern post-2000 sense, and also a ﬁrst Neu-

ral Hierarchical Temporal Memory (HTM). It is conceptually similar to previous AE hierarchies (1987,

剩余73页未读，继续阅读

IgorZ

粉丝: 15
资源: 8

深度学习与神经网络：历史与进展

深度学习综述类文章

深度学习综述PPT.pptx

多模态深度学习综述 (1).pdf

AlexNet深度学习综述

大数据与深度学习综述

FPGA加速深度学习综述

深度学习 综述 三驾马车

深度学习综述（英文）

深度学习综述ppt详细版

大数据与深度学习综述.pdf

最新资源

深度学习综述三驾马车