深度学习：梯度训练实用指南

深度学习

需积分: 9 12 浏览量更新于2024-07-19 收藏 490KB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇文档是Yoshua Bengio在2012年撰写的一份关于深度学习中基于梯度的训练实用建议的手册。它涵盖了深度神经网络学习算法中的超参数选择、优化策略以及处理更深层次架构训练挑战的方法。" 在深度学习领域，基于梯度的训练是构建和优化深度架构的关键技术。Yoshua Bengio的这份文档旨在提供一些实用的指导，以帮助研究人员和实践者更好地理解和调整训练过程中的各种超参数。这些超参数包括学习率、批大小、正则化方法、初始化策略等，它们对模型的性能和收敛速度有着显著影响。文档首先介绍了深度学习复兴的背景，特别是在2006年之后，一系列的研究突破使得深度学习得以快速发展，如Hinton等人提出的深度信念网络，以及Bengio等人和Ranzato等人在卷积神经网络上的工作。这些突破展示了多层神经网络在复杂任务中的潜力，但同时也揭示了训练大型深度架构的挑战。文档的核心部分讨论了如何选择和调整超参数。例如，学习率的选择至关重要，因为它决定了权重更新的速度。过大的学习率可能导致训练不稳定性，而过小的学习率可能使训练过程过于缓慢。实践中，常采用动态调整学习率的方法，如学习率衰减或步进策略。同时，批量大小也是一个关键因素，它可以影响模型的收敛速度和内存需求。正则化如L1、L2或dropout有助于防止过拟合，保持模型的泛化能力。此外，文档还探讨了初始化策略，如Xavier初始化或He初始化，它们可以确保网络内部的梯度在传播过程中不会因层间差异过大而消失或爆炸。对于优化器，文档可能会提到常见的如随机梯度下降（SGD）、动量SGD、Adagrad、RMSprop或Adam等，它们各有优缺点，适用于不同的场景。在更深的架构中，梯度消失和梯度爆炸问题更为突出，文档可能会建议使用残差连接、归一化层（如Batch Normalization）或者权值初始化策略来缓解这些问题。最后，文档指出，尽管有这些实用策略，但更深层次的架构仍然面临训练困难，这仍然是一个开放的问题，需要进一步研究。这份文档提供了一个实用的工具箱，包含了训练深度学习模型时应该考虑的各种策略和技巧，对于想要深入理解和改进深度学习模型的人来说，是一份宝贵的参考资料。

资源详情

资源推荐

scent, sometimes called “batch gradient descent”,

which corre sponds to the case wher e B equals the

training set size, i.e., there is one par ameter update

per epoch). The great advantage of stochastic gra-

dient descent and other online or minibatch update

methods is that their co nvergence does not depend

on the size of the training set, only on the number

of updates and the richness of the training distribu-

tion. In the limit of a large or inﬁnite training set,

a batch method (which updates only after seeing all

the examples) is hopeless . In fact, even for ordinary

datasets of tens or hundreds of thousands of exa m-

ples (or more!), stochastic gradient descent c onverges

much faster than ordinary (ba tch) gradient descent,

and beyond some dataset sizes the speed-up is al-

most linear (i.e., doubling the size almost doubles the

gain)

. It is really important to use the stochastic

version in order to get reasonable clock-time conver-

gence speeds.

As for any stochastic gradient descent method (in-

cluding the mini-batch case ), it is important for ef-

ﬁciency of the estimator that each ex ample or mini-

batch be sampled approximately independently. B e -

cause ra ndom access to memory (or even worse, to

disk) is expe nsive, a good approximation, called in-

cremental gradient (Bertsekas, 2 010), is to visit the

examples (or mini-batches) in a ﬁxed order corre-

sponding to their order in memory or disk (repe ating

the examples in the same order on a second epoch, if

we are not in the pure online case where each exam-

ple is visited only once). In this context, it is safer if

the examples or mini-batches are ﬁrst put in a ran-

dom order (to make sure this is the case, it could

be useful to ﬁrst shuﬄe the examples). Faster con-

vergence has been observed if the order in which the

mini-batches are visited is changed for each epoch,

which can be reasonably eﬃcient if the training set

holds in computer memory.

On the other hand, batch methods can be parallelized

easily, which becomes an imp ortant advantage with currently

available forms of computing power.

2.2 Gradient Computation and Aut o-

matic Diﬀerentiation

The gradie nt can be either computed manually or

through automatic diﬀerentiation. Either way, it

helps to structure this computation as a ﬂow graph,

in order to prevent mathematical mistakes and make

sure an implementation is co mputationally eﬃcient.

The computation of the loss L (z, θ) as a function of

θ is laid out in a graph whose nodes correspond to

elementary oper ations such as addition, multiplica-

tion, and non-linear operations such as the neural

networks activation function (e.g., sigmoid or hyper-

bolic tangent), possibly at the level o f vectors, matri-

ces or tensors. The ﬂow graph is directed and acyclic

and has three types of nodes: input nodes, internal

nodes, and output nodes. Ea ch of its nodes is as-

sociated with a numerical output which is the result

of the application of that c omputation (none in the

case of input nodes), taking as input the output of

previous node s in a dire c ted acyclic graph. Example

z and parameter vector θ (or their elements) are the

input nodes of the graph (i.e., they do not have in-

puts themselves) and L(z, θ) is a scalar output of the

graph. Note that here, in the supervised case, z can

include an input part x (e.g. an image) and a target

part y (e.g. a target class a ssociated with an object

in the image). In the unsupervised case z = x. In

a semi-supervised case, there is a mix of labe led and

unlabeled exa mples, and z includes y on the labeled

examples but not on the unlabe le d ones.

In addition to ass ociating a numerical output o

each node a of the ﬂow graph, we c an associate a gra-

dient g

∂L(z,θ)

∂o

. The gra die nt will be deﬁned and

computed re c ursively in the graph, in the opposite

direction of the computation of the nodes’ outputs,

i.e., whereas o

is computed using outputs o

of pre-

decessor nodes p of a, g

will be computed using the

gradients g

of successor nodes s o f a. More precisely,

the chain rule dictates

∂o

where the sum is over immediate successor s of a.

Only output nodes have no successo r, and in par-

ticular for the output node that computes L, the

gradient is set to 1 since

∂L

= 1, thus initializing

the recursion. Manual or automatic diﬀerentiatio n

then only requires to deﬁne the partial derivative as-

sociated with each type of operation performed by

any node of the graph. When implementing gradi-

ent descent algorithms with manual diﬀerentiation

the result tends to be verbose, brittle code tha t lacks

modularity – all bad things in terms of software en-

gineering. A better approach is to expr ess the ﬂow

graph in terms of objects that modularize how to

compute outputs from inputs as well as how to com-

pute the pa rtial derivatives necessary for gradient de-

scent. One can pre -deﬁne the operations of these ob-

jects (in a “fo rward propagation” or fprop method)

and their partial derivatives (in a “backward prop-

agation” or bprop method) and encapsulate these

computations in an object that knows how to com-

pute its output given its inputs, a nd how to com-

pute the gradient with respect to its inputs given

the gradient with respect to its output. This is the

strategy adopted in the Theano library

with its Op

objects (Bergstra et al., 2010), as well as in libraries

such as Torch

(Collobert et al., 2011b) and Lush

Compared to Torch and Lush, Theano adds an in-

teresting ingredient which makes it a full-ﬂedg ed au-

tomatic diﬀerentiation tool: symbolic computation.

The ﬂow graph itself (without the numerical values

attached) can be viewed as a symbolic representation

(in a data structure) of a numerical computation. In

Theano, the gradient computation is ﬁrst performed

symbolically, i.e., each Op object knows how to create

other Ops correspo nding to the computation of the

partial derivatives associated with that Op. Hence the

symbolic diﬀerentiation of the output of a ﬂow graph

with respect to any or all of its input nodes can be

performed easily in mo st cas es, yielding another ﬂow

graph which speciﬁes how to compute these gradi-

ents, given the input of the original graph. Since the

gradient graph typically contains the original graph

(mapping parameters to loss) as a sub-graph, in o r-

der to make computations eﬃcient it is important to

automate (as done in Theano) a number of simpliﬁca-

tions which are gra ph transformations prese rving the

http://deeplearning.net/software/theano/

http://www.torch.ch

http://lush.sourceforge.net

semantics of the output (given the input) but y ield-

ing smaller (or more numerically stable or more eﬃ-

ciently computed) graphs (e.g., removing redundant

computations). To take advantage of the fac t that

computing the loss gradient includes as a ﬁrst step

computing the loss itself, it is advantageous to struc-

ture the code so that both the loss and its gradient are

computed at once, with a s ingle graph having multi-

ple outputs. The advantages of performing gradient

computations symbolically are numerous. First of all,

one can rea dily compute gradients over gradients, i.e.,

second de rivatives, which are useful for some lear n-

ing algorithms. Second, one can deﬁne algorithms or

training criteria involving gradients themselves, as re-

quired for example in the Contractive Auto-Encoder

(which uses the norm of a Jacobian matrix in its

training criterion, i.e., really r equires second deriva-

tives, which here are cheap to compute). Third, it

makes it easy to implement o ther useful graph trans-

formations such as graph simpliﬁcations or numerical

optimizations and transformations that help making

the numerical results more robust and more eﬃcient

(such as working in the domain o f logarithms of prob-

abilities rather than in the domain of probabilities

directly). Other potential beneﬁcial applications of

such symbo lic manipulations include parallelization

and additional diﬀerential operators (such as the R-

operator , recently implemented in Theano, which is

very useful to compute the product o f a Ja cobian ma-

trix

∂f (x)

∂x

or Hessian matrix

∂

L(x,θ)

∂θ

with a vector

without ever having to actually compute and store

the matrix itself (Pearlmutter, 1994)).

3 Hyper-Parameters

A pure learning algorithm can be seen as a func-

tion taking training data as input and producing

as output a function (e.g. a predictor) or model

(i.e. a bunch of functions). However, in practice,

many learning algorithms invo lve hyper-parameters,

i.e., annoying knobs to be adjusted. In many algo-

rithms such as Deep Lea rning algorithms the number

of hyper-parameters (ten or more!) can make the idea

of having to adjust all of them unappealing. In addi-

tion, it has been shown that the use of computer clus-

剩余32页未读，继续阅读

wssky3333

粉丝: 1
资源: 2

深度学习：梯度训练实用指南

Practical Recommendations for Gradient-Based Training of Deep Architectures.pdf

Practical Recommendations for Gradient-Based Training of Deep Architectures

Integral library based on C++

找关于五金销售系统的外文文献

针对车道线检测，写一个PPT

针对文献写一个英文评论

ITU-T F.735.1

java test plan for matrix

When you have unstructured problems, what approach do you use for decision-making?

pointr dynamic query

写一篇激光对刀仪市场分析的PPT

uniapp@escook/request-miniprog

CARS python

idea copilot插件

Internet of Things for Smart Cities

Give me a project based on ArcGIS Pro

policycoreutils-python-utils

Python 协同过滤算法案例

data report

最新资源