深度学习架构：理论优势与实践挑战

需积分: 9 37 浏览量更新于2024-07-18 收藏 1.12MB PDF 举报

《学习深度架构以实现人工智能》是一篇由约书亚·本吉奥撰写的文章，收录于《机器学习基础与趋势》第二卷第一期（2009年），篇幅长达127页。该论文深入探讨了在人工智能领域中运用深度学习架构的重要性及理论优势。作者首先提出了如何训练深层架构的关键问题，并强调了共享特征和抽象表示在跨任务中的价值。论文的第一部分，"Introduction"，着重讨论了训练深层网络的挑战，包括如何通过有效的学习算法来处理复杂的层级结构。1.2节讨论了中间表示的重要性，即如何通过网络的不同层学习到通用的特征和概念，这些概念能够适应不同的任务需求。1.3节提出了学习人工智能的理想目标，如泛化能力、复杂性管理和表达能力等。第二章，"Theoretical Advantages of Deep Architectures"，深入剖析了深度网络在计算效率和泛化能力方面的优势。计算复杂度部分指出，深度模型能够在数据量有限的情况下提供更好的性能，而非局部的一般化则有助于处理更复杂的模式识别。非局部匹配限制的分析揭示了浅层模型的局限性，而深层网络通过分布式表示的学习来克服这一问题。接下来的章节详细介绍了几种关键的深度学习技术。第4章讨论了多层神经网络，包括其结构和训练难点，以及无监督学习方法如自编码器在深层网络中的应用。此外，还包括了深度生成模型，如生成对抗网络（GANs）和卷积神经网络（CNN），它们在图像识别和生成任务中表现出色。第5章进一步探讨了能量模型，如 Boltzmann机和受限玻尔兹曼机，这些模型通过概率分布来建模复杂的数据分布，对比学习算法如对比性退火则被用来训练这些模型。本论文不仅提供了深度学习理论基础，还涵盖了各种关键技术和方法，对于理解深度架构在人工智能中的核心作用及其训练策略具有重要意义。深度学习的进步不仅在于提高模型性能，还在于如何通过理论和实践优化网络结构，以应对不断增长的复杂现实世界问题。

Theoretical Advantages of Deep Architectures

In this section, we present a motivating argument for the study of

learning algorithms for deep architectures, by way of theoretical results

revealing potential limitations of architectures with insuﬃcient depth.

This part of the monograph (this section and the next) motivates the

algorithms described in the later sections, and can be skipped without

making the remainder diﬃcult to follow.

The main point of this section is that some functions cannot be eﬃ-

ciently represented (in terms of number of tunable elements) by archi-

tectures that are too shallow. These results suggest that it would be

worthwhile to explore learning algorithms for deep architectures, which

might be able to represent some functions otherwise not eﬃciently rep-

resentable. Where simpler and shallower architectures fail to eﬃciently

represent (and hence to learn) a task of interest, we can hope for learn-

ing algorithms that could set the parameters of a deep architecture for

this task.

We say that the expression of a function is compact when it has

few computational elements, i.e., few degrees of freedom that need to

be tuned by learning. So for a ﬁxed number of training examples, and

short of other sources of knowledge injected in the learning algorithm,

14 Theoretical Advantages of Deep Architectures

we would expect that compact representations of the target function

would yield better generalization.

More precisely, functions that can be compactly represented by a

depth k architecture might require an exponential number of computa-

tional elements to be represented by a depth k − 1 architecture. Since

the number of computational elements one can aﬀord depends on the

number of training examples available to tune or select them, the con-

sequences are not only computational but also statistical: poor general-

ization may be expected when using an insuﬃciently deep architecture

for representing some functions.

We consider the case of ﬁxed-dimension inputs, where the computa-

tion performed by the machine can be represented by a directed acyclic

graph where each node performs a computation that is the application

of a function on its inputs, each of which is the output of another node

in the graph or one of the external inputs to the graph. The whole

graph can be viewed as a circuit that computes a function applied to

the external inputs. When the set of functions allowed for the compu-

tation nodes is limited to logic gates, such as {AND, OR, NOT}, this

is a Boolean circuit, or logic circuit.

To formalize the notion of depth of architecture, one must introduce

the notion of a set of computational elements. An example of such a set

is the set of computations that can be performed logic gates. Another

is the set of computations that can be performed by an artiﬁcial neuron

(depending on the values of its synaptic weights). A function can be

expressed by the composition of computational elements from a given

set. It is deﬁned by a graph which formalizes this composition, with

one node per computational element. Depth of architecture refers to

the depth of that graph, i.e., the longest path from an input node to

an output node. When the set of computational elements is the set of

computations an artiﬁcial neuron can perform, depth corresponds to

the number of layers in a neural network. Let us explore the notion of

depth with examples of architectures of diﬀerent depths. Consider the

function f(x)=x ∗ sin(a ∗ x + b). It can be expressed as the composi-

tion of simple operations such as addition, subtraction, multiplication,

The target function is the function that we would like the learner to discover.

2.1 Computational Complexity 17

A two-layer circuit of logic gates can represent any Boolean func-

tion [127]. Any Boolean function can be written as a sum of products

(disjunctive normal form: AND gates on the ﬁrst layer with optional

negation of inputs, and OR gate on the second layer) or a product

of sums (conjunctive normal form: OR gates on the ﬁrst layer with

optional negation of inputs, and AND gate on the second layer). To

understand the limitations of shallow architectures, the ﬁrst result to

consider is that with depth-two logical circuits, most Boolean func-

tions require an exponential (with respect to input size) number of

logic gates [198] to be represented.

More interestingly, there are functions computable with a

polynomial-size logic gates circuit of depth k that require exponential

size when restricted to depth k − 1 [62]. The proof of this theorem

relies on earlier results [208] showing that d-bit parity circuits of depth

2 have exponential size. The d-bit parity function is deﬁned as usual:

parity : (b

,...,b

) ∈{0,1}

→











1, if



i=1

is even

0, otherwise.

One might wonder whether these computational complexity results

for Boolean circuits are relevant to machine learning. See [140] for an

early survey of theoretical results in computational complexity relevant

to learning algorithms. Interestingly, many of the results for Boolean

circuits can be generalized to architectures whose computational ele-

ments are linear threshold units (also known as artiﬁcial neurons [125]),

which compute

f(x)=1



x+b≥0

(2.1)

with parameters w and b. The fan-in of a circuit is the maximum

number of inputs of a particular element. Circuits are often organized

in layers, like multi-layer neural networks, where elements in a layer

only take their input from elements in the previous layer(s), and the

ﬁrst layer is the neural network input. The size of a circuit is the number

of its computational elements (excluding input elements, which do not

perform any computation).

剩余129页未读，继续阅读

kevinvonster

粉丝: 0
资源: 2

深度学习架构：理论优势与实践挑战

Learning Deep Architectures for AI.pdf

学习AI的深层架构（Yoshua Bengio）Learning Deep Architectures for AI (Yoshua Bengio)

Creating_Brain-Like_Intelligence\Learning deep Architectures for AI

Learning deep architecture for AI

Hands-On Deep Learning Architectures with Python Create deep neu

Deep Learning for Image Processing Applications

Deep_Learning_Architecture_for_AI.pdf

【Essentials of Deep Learning for Time Series Forecasting】: Tips and Advanced Applications of RNN

deep learning with python

deep q_learning

最新资源