蒙特利尔大学深度学习教程0.1入门指南

需积分: 9 154 浏览量更新于2024-07-17 收藏 1.59MB PDF 举报

本篇深度学习教程文档"DeepLearningTutorial_0.1.pdf"由LISAlab, University of Montreal于2015年9月1日发布，是一份综合性的深度学习入门指南。教程详细介绍了深度学习的基本概念、实践步骤和技术应用。第1章是许可证部分，确保了读者对文档使用的法律条款有所了解。接下来的章节深入浅出地引导读者进入深度学习的世界：第3章"Getting Started"首先指导用户下载必要的工具（如Theano和Python库），并提供了一些数据集的获取途径。然后，讲解了监督优化在深度学习中的基础，介绍了损失函数的概念，并举例说明如何定义和使用逻辑回归模型进行MNIST手写数字分类。第4章聚焦于逻辑回归模型的实际应用，从模型构建、损失函数设计、创建分类类到模型训练和测试，逐步展示了整个流程。这部分内容对于理解浅层神经网络的工作原理至关重要。第5章介绍多层感知器（Multilayer Perceptron, MLP），展示了如何从逻辑回归模型扩展到多层结构，以及训练技巧。通过实例演示了将这些组件整合在一起的方法。第6章转向卷积神经网络（Convolutional Neural Networks, CNN），主要讨论了LeNet的设计动机，强调了稀疏连接、共享权重等关键概念。进一步解释了卷积和池化操作，完整介绍了LeNet模型的架构，包括如何运行代码和训练过程中的实用技巧。第7章探讨了denoising autoencoders (dA)，一种用于无监督学习和特征提取的重要工具，它们通过降噪来学习数据的内在表示。这篇教程不仅涵盖了深度学习的基本框架，还提供了实践经验，使读者能够逐步掌握从基础到高级的深度学习技术，适用于初学者和希望提升技能的专业人士。通过阅读和实践教程中的示例，读者将能够更好地理解和构建自己的深度学习项目。

Deep Learning Tutorial, Release 0.1

# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic

# expression has to be compiled into a Theano function (see the Theano

# tutorial for more details)

NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])

# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].

# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the

# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this

# syntax to retrieve the log-probability of the correct labels, y.

3.4.2 Stochastic Gradient Descent

What is ordinary gradient descent? it is a simple algorithm in which we repeatedly make small steps down-

ward on an error surface deﬁned by a loss function of some parameters. For the purpose of ordinary gradient

descent we consider that the training data is rolled into the loss function. Then the pseudocode of this algo-

rithm can be described as :

# GRADIENT DESCENT

while True:

loss = f(params)

d_loss_wrt_params = ... # compute gradient

params -= learning_rate

d_loss_wrt_params

if <stopping condition is met>:

return params

Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but

proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire

training set. In its purest form, we estimate the gradient from just a single example at a time.

# STOCHASTIC GRADIENT DESCENT

for (x_i,y_i) in training_set:

# imagine an infinite generator

# that may repeat examples (if there is only a finite training set)

loss = f(params, x_i, y_i)

d_loss_wrt_params = ... # compute gradient

params -= learning_rate

d_loss_wrt_params

if <stopping condition is met>:

return params

The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-

called “minibatches”. Minibatch SGD works identically to SGD, except that we use more than one training

example to make each estimate of the gradient. This technique reduces variance in the estimate of the

gradient, and often makes better use of the hierarchical memory organization in modern computers.

for (x_batch,y_batch) in train_batches:

# imagine an infinite generator

# that may repeat examples

loss = f(params, x_batch, y_batch)

d_loss_wrt_params = ... # compute gradient using theano

params -= learning_rate

d_loss_wrt_params

10 Chapter 3. Getting Started

欢迎加入非盈利Python编程学习交流QQ群783462347，群里免费提供500+本Python书籍！

Deep Learning Tutorial, Release 0.1

if <stopping condition is met>:

return params

There is a tradeoff in the choice of the minibatch size B. The reduction of variance and use of SIMD

instructions helps most when increasing B from 1 to 2, but the marginal improvement fades rapidly to

nothing. With large B, time is wasted in reducing the variance of the gradient estimator, that time would be

better spent on additional gradient steps. An optimal B is model-, dataset-, and hardware-dependent, and

can be anywhere from 1 to maybe several hundreds. In the tutorial we set it to 20, but this choice is almost

arbitrary (though harmless).

Note: If you are training for a ﬁxed number of epochs, the minibatch size becomes important because it

controls the number of updates done to your parameters. Training the same model for 10 epochs using a

batch size of 1 yields completely different results compared to training for the same 10 epochs but with a

batchsize of 20. Keep this in mind when switching between batch sizes and be prepared to tweak all the

other parameters acording to the batch size used.

All code-blocks above show pseudocode of how the algorithm looks like. Implementing such algorithm in

Theano can be done as follows :

# Minibatch Stochastic Gradient Descent

# assume loss is a symbolic description of the loss function given

# the symbolic variables params (shared variable), x_batch, y_batch;

# compute gradient of loss with respect to params

d_loss_wrt_params = T.grad(loss, params)

# compile the MSGD step into a theano function

updates = [(params, params - learning_rate

d_loss_wrt_params)]

MSGD = theano.function([x_batch,y_batch], loss, updates=updates)

for (x_batch, y_batch) in train_batches:

# here x_batch and y_batch are elements of train_batches and

# therefore numpy arrays; function MSGD also updates the params

print(’Current loss is ’, MSGD(x_batch, y_batch))

if stopping_condition_is_met:

return params

3.4.3 Regularization

There is more to machine learning than optimization. When we train our model from data we are trying

to prepare it to do well on new examples, not the ones it has already seen. The training loop above for

MSGD does not take this into account, and may overﬁt the training examples. A way to combat overﬁtting

is through regularization. There are several techniques for regularization; the ones we will explain here are

L1/L2 regularization and early-stopping.

3.4. A Primer on Supervised Optimization for Deep Learning 11

欢迎加入非盈利Python编程学习交流QQ群783462347，群里免费提供500+本Python书籍！

Deep Learning Tutorial, Release 0.1

L1 and L2 regularization

L1 and L2 regularization involve adding an extra term to the loss function, which penalizes certain parameter

conﬁgurations. Formally, if our loss function is:

NLL(θ, D) = −

|D|

i=0

log P (Y = y

(i)

, θ)

then the regularized loss will be:

E(θ, D) = NLL(θ, D) + λR(θ)

or, in our case

E(θ, D) = NLL(θ, D) + λ||θ||

where

||θ||





|θ|

j=0

|θ





which is the L

norm of θ. λ is a hyper-parameter which controls the relative importance of the regularization

parameter. Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the

regularizer is also called “weight decay”.

In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural

network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that

the network models). More intuitively, the two terms (NLL and R(θ)) correspond to modelling the data

well (NLL) and having “simple” or “smooth” solutions (R(θ)). Thus, minimizing the sum of both will, in

theory, correspond to ﬁnding the right trade-off between the ﬁt to the training data and the “generality” of

the solution that is found. To follow Occam’s razor principle, this minimization should ﬁnd us the simplest

solution (as measured by our simplicity criterion) that ﬁts the training data.

Note that the fact that a solution is “simple” does not mean that it will generalize well. Empirically, it

was found that performing such regularization in the context of neural networks helps with generalization,

especially on small datasets. The code block below shows how to compute the loss in python when it

contains both a L1 regularization term weighted by λ

and L2 regularization term weighted by λ

# symbolic Theano variable that represents the L1 regularization term

L1 = T.sum(abs(param))

# symbolic Theano variable that represents the squared L2 term

L2_sqr = T.sum(param

# the loss

loss = NLL + lambda_1

L1 + lambda_2

Early-Stopping

Early-stopping combats overﬁtting by monitoring the model’s performance on a validation set. A validation

set is a set of examples that we never use for gradient descent, but which is also not a part of the test set. The

12 Chapter 3. Getting Started

欢迎加入非盈利Python编程学习交流QQ群783462347，群里免费提供500+本Python书籍！

Deep Learning Tutorial, Release 0.1

validation examples are considered to be representative of future test examples. We can use them during

training because they are not part of the test set. If the model’s performance ceases to improve sufﬁciently

on the validation set, or even degrades with further optimization, then the heuristic implemented here gives

up on much further optimization.

The choice of when to stop is a judgement call and a few heuristics exist, but these tutorials will make use

of a strategy based on a geometrically increasing amount of patience.

# early-stopping parameters

patience = 5000 # look as this many examples regardless

patience_increase = 2 # wait this much longer when a new best is

# found

improvement_threshold = 0.995 # a relative improvement of this much is

# considered significant

validation_frequency = min(n_train_batches, patience/2)

# go through this many

# minibatches before checking the network

# on the validation set; in this case we

# check every epoch

best_params = None

best_validation_loss = numpy.inf

test_score = 0.

start_time = time.clock()

done_looping = False

epoch = 0

while (epoch < n_epochs) and (not done_looping):

# Report "1" for first epoch, "n_epochs" for last epoch

epoch = epoch + 1

for minibatch_index in xrange(n_train_batches):

d_loss_wrt_params = ... # compute gradient

params -= learning_rate

d_loss_wrt_params # gradient descent

# iteration number. We want it to start at 0.

iter = (epoch - 1)

n_train_batches + minibatch_index

# note that if we do ‘iter % validation_frequency‘ it will be

# true for iter = 0 which we do not want. We want it true for

# iter = validation_frequency - 1.

if (iter + 1) % validation_frequency == 0:

this_validation_loss = ... # compute zero-one loss on validation set

if this_validation_loss < best_validation_loss:

# improve patience if loss improvement is good enough

if this_validation_loss < best_validation_loss

improvement_threshold:

patience = max(patience, iter

patience_increase)

best_params = copy.deepcopy(params)

best_validation_loss = this_validation_loss

if patience <= iter:

3.4. A Primer on Supervised Optimization for Deep Learning 13

欢迎加入非盈利Python编程学习交流QQ群783462347，群里免费提供500+本Python书籍！

Deep Learning Tutorial, Release 0.1

done_looping = True

break

# POSTCONDITION:

# best_params refers to the best out-of-sample parameters observed during the optimization

If we run out of batches of training data before running out of patience, then we just go back to the beginning

of the training set and repeat.

Note: The validation_frequency should always be smaller than the patience. The code should

check at least two times how it performs before running out of patience. This is the reason we used the

formulation validation_frequency = min( value, patience/2.)

Note: This algorithm could possibly be improved by using a test of statistical signiﬁcance rather than the

simple comparison, when deciding whether to increase the patience.

3.4.4 Testing

After the loop exits, the best_params variable refers to the best-performing model on the validation set. If

we repeat this procedure for another model class, or even another random initialization, we should use the

same train/valid/test split of the data, and get other best-performing models. If we have to choose what the

best model class or the best initialization was, we compare the best_validation_loss for each model. When

we have ﬁnally chosen the model we think is the best (on validation data), we report that model’s test set

performance. That is the performance we expect on unseen examples.

3.4.5 Recap

That’s it for the optimization section. The technique of early-stopping requires us to partition the set of

examples into three sets (training D

train

, validation D

valid

, test D

test

). The training set is used for minibatch

stochastic gradient descent on the differentiable approximation of the objective function. As we perform

this gradient descent, we periodically consult the validation set to see how our model is doing on the real

objective function (or at least our empirical estimate of it). When we see a good model on the validation set,

we save it. When it has been a long time since seeing a good model, we abandon our search and return the

best parameters found, for evaluation on the test set.

3.5 Theano/Python Tips

3.5.1 Loading and Saving Models

When you’re doing experiments, it can take hours (sometimes days!) for gradient-descent to ﬁnd the best

parameters. You will want to save those weights once you ﬁnd them. You may also want to save your

current-best estimates as the search progresses.

Pickle the numpy ndarrays from your shared variables

14 Chapter 3. Getting Started

欢迎加入非盈利Python编程学习交流QQ群783462347，群里免费提供500+本Python书籍！

剩余172页未读，继续阅读

weixin_38743737

粉丝: 376
资源: 2万+

蒙特利尔大学深度学习教程0.1入门指南

Deep Learning Tutorial Release 0.1

Deep Learning Tutorial Release 0.1 (Python)

operating_system_tutorial.pdf

ros noetic moveit中的Planning Adapter Tutorials功能python示例

用自动化编码器实现低剂量CT重建pytorch代码

使用phpstudy做出打卡系统的代码

ansys_fluent_tutorial_guide_2022_r1.pdf

(base) C:\Users\李浩杨\Desktop\machine_learning_notebook>jupyter trust LR Tutorial.ipynb [TrustNotebookApp] ERROR | Notebook missing: LR 明明有LR Tutorial.ipynb文件，为啥还会这样？应该怎么操作呢？

161204_mastering_the_freertos_real_time_kernel-a_hands- on_tutorial_guide

cannot import name 'cross_validation' from 'sklearn' (C:\Users\86183\miniconda3\lib\site-packages\sklearn\__init__.py)

最新资源

cannot import name 'cross_validation' from 'sklearn' (C:\Users\86183\miniconda3\lib\site-packages\sklearn\init.py)