深度学习入门与实践：特斯拉与SpaceX联合创始人Elon Musk推荐

深度学习

需积分: 10 7 浏览量更新于2024-07-19 收藏 1.36MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

资源详情

资源推荐

Deep Learning Tutorial, Release 0.1

# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic

# expression has to be compiled into a Theano function (see the Theano

# tutorial for more details)

NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])

# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].

# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the

# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this

# syntax to retrieve the log-probability of the correct labels, y.

3.4.2 Stochastic Gradient Descent

What is ordinary gradient descent? it is a simple algorithm in which we repeatedly make small steps down-

ward on an error surface deﬁned by a loss function of some parameters. For the purpose of ordinary gradient

descent we consider that the training data is rolled into the loss function. Then the pseudocode of this algo-

rithm can be described as :

# GRADIENT DESCENT

while True:

loss = f(params)

d_loss_wrt_params = ... # compute gradient

params -= learning_rate

d_loss_wrt_params

if <stopping condition is met>:

return params

Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but

proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire

training set. In its purest form, we estimate the gradient from just a single example at a time.

# STOCHASTIC GRADIENT DESCENT

for (x_i,y_i) in training_set:

# imagine an infinite generator

# that may repeat examples (if there is only a finite training set)

loss = f(params, x_i, y_i)

d_loss_wrt_params = ... # compute gradient

params -= learning_rate

d_loss_wrt_params

if <stopping condition is met>:

return params

The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-

called “minibatches”. Minibatch SGD works identically to SGD, except that we use more than one training

example to make each estimate of the gradient. This technique reduces variance in the estimate of the

gradient, and often makes better use of the hierarchical memory organization in modern computers.

for (x_batch,y_batch) in train_batches:

# imagine an infinite generator

# that may repeat examples

loss = f(params, x_batch, y_batch)

d_loss_wrt_params = ... # compute gradient using theano

params -= learning_rate

d_loss_wrt_params

10 Chapter 3. Getting Started

Deep Learning Tutorial, Release 0.1

if <stopping condition is met>:

return params

There is a tradeoff in the choice of the minibatch size B. The reduction of variance and use of SIMD

instructions helps most when increasing B from 1 to 2, but the marginal improvement fades rapidly to

nothing. With large B, time is wasted in reducing the variance of the gradient estimator, that time would be

better spent on additional gradient steps. An optimal B is model-, dataset-, and hardware-dependent, and

can be anywhere from 1 to maybe several hundreds. In the tutorial we set it to 20, but this choice is almost

arbitrary (though harmless).

Note: If you are training for a ﬁxed number of epochs, the minibatch size becomes important because it

controls the number of updates done to your parameters. Training the same model for 10 epochs using a

batch size of 1 yields completely different results compared to training for the same 10 epochs but with a

batchsize of 20. Keep this in mind when switching between batch sizes and be prepared to tweak all the

other parameters acording to the batch size used.

All code-blocks above show pseudocode of how the algorithm looks like. Implementing such algorithm in

Theano can be done as follows :

# Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given

# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params

d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function

updates = [(params, params - learning_rate

d_loss_wrt_params)]

MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:

# here x_batch and y_batch are elements of train_batches and

# therefore numpy arrays; function MSGD also updates the params

print(’Current loss is ’, MSGD(x_batch, y_batch))

if stopping_condition_is_met:

return params

3.4.3 Regularization

There is more to machine learning than optimization. When we train our model from data we are trying

to prepare it to do well on new examples, not the ones it has already seen. The training loop above for

MSGD does not take this into account, and may overﬁt the training examples. A way to combat overﬁtting

is through regularization. There are several techniques for regularization; the ones we will explain here are

L1/L2 regularization and early-stopping.

3.4. A Primer on Supervised Optimization for Deep Learning 11

Deep Learning Tutorial, Release 0.1

L1 and L2 regularization

L1 and L2 regularization involve adding an extra term to the loss function, which penalizes certain parameter

conﬁgurations. Formally, if our loss function is:

NLL(θ, D) = −

|D|

i=0

log P (Y = y

(i)

, θ)

then the regularized loss will be:

E(θ, D) = NLL(θ, D) + λR(θ)

or, in our case

E(θ, D) = NLL(θ, D) + λ||θ||

where

||θ||





|θ|

j=0

|θ





which is the L

norm of θ. λ is a hyper-parameter which controls the relative importance of the regularization

parameter. Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the

regularizer is also called “weight decay”.

In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural

network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that

the network models). More intuitively, the two terms (NLL and R(θ)) correspond to modelling the data

well (NLL) and having “simple” or “smooth” solutions (R(θ)). Thus, minimizing the sum of both will, in

theory, correspond to ﬁnding the right trade-off between the ﬁt to the training data and the “generality” of

the solution that is found. To follow Occam’s razor principle, this minimization should ﬁnd us the simplest

solution (as measured by our simplicity criterion) that ﬁts the training data.

Note that the fact that a solution is “simple” does not mean that it will generalize well. Empirically, it

was found that performing such regularization in the context of neural networks helps with generalization,

especially on small datasets. The code block below shows how to compute the loss in python when it

contains both a L1 regularization term weighted by λ

and L2 regularization term weighted by λ

# symbolic Theano variable that represents the L1 regularization term

L1 = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term

L2_sqr = T.sum(param

# the loss

loss = NLL + lambda_1

L1 + lambda_2

Early-Stopping

Early-stopping combats overﬁtting by monitoring the model’s performance on a validation set. A validation

set is a set of examples that we never use for gradient descent, but which is also not a part of the test set. The

12 Chapter 3. Getting Started

Deep Learning Tutorial, Release 0.1

validation examples are considered to be representative of future test examples. We can use them during

training because they are not part of the test set. If the model’s performance ceases to improve sufﬁciently

on the validation set, or even degrades with further optimization, then the heuristic implemented here gives

up on much further optimization.

The choice of when to stop is a judgement call and a few heuristics exist, but these tutorials will make use

of a strategy based on a geometrically increasing amount of patience.

# early-stopping parameters

patience = 5000 # look as this many examples regardless

patience_increase = 2 # wait this much longer when a new best is

# found

improvement_threshold = 0.995 # a relative improvement of this much is

# considered significant

validation_frequency = min(n_train_batches, patience/2)

# go through this many

# minibatches before checking the network

# on the validation set; in this case we

# check every epoch

best_params = None

best_validation_loss = numpy.inf

test_score = 0.

start_time = time.clock()

done_looping = False

epoch = 0

while (epoch < n_epochs) and (not done_looping):

# Report "1" for first epoch, "n_epochs" for last epoch

epoch = epoch + 1

for minibatch_index in xrange(n_train_batches):

d_loss_wrt_params = ... # compute gradient

params -= learning_rate

d_loss_wrt_params # gradient descent

# iteration number. We want it to start at 0.

iter = (epoch - 1)

n_train_batches + minibatch_index

# note that if we do ‘iter % validation_frequency‘ it will be

# true for iter = 0 which we do not want. We want it true for

# iter = validation_frequency - 1.

if (iter + 1) % validation_frequency == 0:

this_validation_loss = ... # compute zero-one loss on validation set

if this_validation_loss < best_validation_loss:

# improve patience if loss improvement is good enough

if this_validation_loss < best_validation_loss

improvement_threshold:

patience = max(patience, iter

patience_increase)

best_params = copy.deepcopy(params)

best_validation_loss = this_validation_loss

if patience <= iter:

3.4. A Primer on Supervised Optimization for Deep Learning 13

Deep Learning Tutorial, Release 0.1

done_looping = True

break

# POSTCONDITION:

# best_params refers to the best out-of-sample parameters observed during the optimization

If we run out of batches of training data before running out of patience, then we just go back to the beginning

of the training set and repeat.

Note: The validation_frequency should always be smaller than the patience. The code should

check at least two times how it performs before running out of patience. This is the reason we used the

formulation validation_frequency = min( value, patience/2.)

Note: This algorithm could possibly be improved by using a test of statistical signiﬁcance rather than the

simple comparison, when deciding whether to increase the patience.

3.4.4 Testing

After the loop exits, the best_params variable refers to the best-performing model on the validation set. If

we repeat this procedure for another model class, or even another random initialization, we should use the

same train/valid/test split of the data, and get other best-performing models. If we have to choose what the

best model class or the best initialization was, we compare the best_validation_loss for each model. When

we have ﬁnally chosen the model we think is the best (on validation data), we report that model’s test set

performance. That is the performance we expect on unseen examples.

3.4.5 Recap

That’s it for the optimization section. The technique of early-stopping requires us to partition the set of

examples into three sets (training D

train

, validation D

valid

, test D

test

). The training set is used for minibatch

stochastic gradient descent on the differentiable approximation of the objective function. As we perform

this gradient descent, we periodically consult the validation set to see how our model is doing on the real

objective function (or at least our empirical estimate of it). When we see a good model on the validation set,

we save it. When it has been a long time since seeing a good model, we abandon our search and return the

best parameters found, for evaluation on the test set.

3.5 Theano/Python Tips

3.5.1 Loading and Saving Models

When you’re doing experiments, it can take hours (sometimes days!) for gradient-descent to ﬁnd the best

parameters. You will want to save those weights once you ﬁnd them. You may also want to save your

current-best estimates as the search progresses.

Pickle the numpy ndarrays from your shared variables

14 Chapter 3. Getting Started

剩余164页未读，继续阅读

qinyuehong

粉丝: 26
资源: 3

深度学习入门与实践：特斯拉与SpaceX联合创始人Elon Musk推荐

deep learning toolbox 安装

Deeplearning4j 视频教程有吗？

how to learn deep learning

下载deep learning toolbox

怎么安装deep learning toolbox

window安装deeplearning4j

Java 安装deeplearning4j

java deeplearning4j 安装

deep learning toolbox 下载

Deeplearning4j 视频教程链接有哪些

deeplearning4j书籍

Deeplearning4j学习流程

安装Deep Learning Toolbox

官网下载Deeplearning4j 的步骤

gradle 引用deeplearning4j

deeplearning4j教程

deep learning 英文版pdf

pip install deeplearning

2022deep learning toolbox安装

Deep Learning Toolbox怎么安装

最新资源