没有合适的资源?快使用搜索试试~ 我知道了~
首页深度学习入门与实践:特斯拉与SpaceX联合创始人Elon Musk推荐
深度学习入门与实践:特斯拉与SpaceX联合创始人Elon Musk推荐
需积分: 10 0 下载量 7 浏览量
更新于2024-07-19
收藏 1.36MB PDF 举报
《深度学习》是由该领域三位专家编撰的一本全面介绍深度学习的权威书籍,受到了特斯拉和SpaceX联合创始人兼CEO埃隆·马斯克的高度评价。深度学习是机器学习的一个分支,它允许计算机通过经验学习,理解世界并构建复杂的概念层次结构。与传统编程方式不同,深度学习模型能够自我学习,不需要人类程序员明确提供所有所需的知识。 书中涵盖了广泛的深度学习主题,包括数学和概念基础,如线性代数、概率论、信息论以及数值计算和机器学习的基础。深入讲解了实践中的技术,例如深度前馈网络、正则化方法、优化算法、卷积神经网络(CNN)、序列建模、自然语言处理、语音识别、计算机视觉、在线推荐系统、生物信息学和游戏等领域的应用。 对于初学者和研究人员,这本书提供了从入门到实践的路径,如如何下载数据集、定义损失函数,以及构建和训练诸如逻辑回归、多层感知器(MLP)和卷积神经网络(LeNet)这样的模型。此外,还介绍了自动编码器(Autoencoder)及其变体,如去噪自动编码器(dA),这些模型在无监督学习和数据预处理中扮演着重要角色。 学习者在阅读过程中,会遇到如Theano或Python的使用技巧,以及训练多层模型时的一些实用建议和窍门。教程还包括实际操作部分,如如何测试模型、运行代码,并针对不同类型的网络进行配置和优化。整体而言,《深度学习》是一本为学生、研究人员和软件工程师量身定制的教材,旨在帮助他们理解和应用这一前沿技术,无论是应用于工业界的产品开发还是科研项目。
资源详情
资源推荐
Deep Learning Tutorial, Release 0.1
# NLL is a symbolic variable ; to get the actual value of NLL, this symbolic
# expression has to be compiled into a Theano function (see the Theano
# tutorial for more details)
NLL = -T.sum(T.log(p_y_given_x)[T.arange(y.shape[0]), y])
# note on syntax: T.arange(y.shape[0]) is a vector of integers [0,1,2,...,len(y)].
# Indexing a matrix M by the two vectors [0,1,...,K], [a,b,...,k] returns the
# elements M[0,a], M[1,b], ..., M[K,k] as a vector. Here, we use this
# syntax to retrieve the log-probability of the correct labels, y.
3.4.2 Stochastic Gradient Descent
What is ordinary gradient descent? it is a simple algorithm in which we repeatedly make small steps down-
ward on an error surface defined by a loss function of some parameters. For the purpose of ordinary gradient
descent we consider that the training data is rolled into the loss function. Then the pseudocode of this algo-
rithm can be described as :
# GRADIENT DESCENT
while True:
loss = f(params)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate
*
d_loss_wrt_params
if <stopping condition is met>:
return params
Stochastic gradient descent (SGD) works according to the same principles as ordinary gradient descent, but
proceeds more quickly by estimating the gradient from just a few examples at a time instead of the entire
training set. In its purest form, we estimate the gradient from just a single example at a time.
# STOCHASTIC GRADIENT DESCENT
for (x_i,y_i) in training_set:
# imagine an infinite generator
# that may repeat examples (if there is only a finite training set)
loss = f(params, x_i, y_i)
d_loss_wrt_params = ... # compute gradient
params -= learning_rate
*
d_loss_wrt_params
if <stopping condition is met>:
return params
The variant that we recommend for deep learning is a further twist on stochastic gradient descent using so-
called “minibatches”. Minibatch SGD works identically to SGD, except that we use more than one training
example to make each estimate of the gradient. This technique reduces variance in the estimate of the
gradient, and often makes better use of the hierarchical memory organization in modern computers.
for (x_batch,y_batch) in train_batches:
# imagine an infinite generator
# that may repeat examples
loss = f(params, x_batch, y_batch)
d_loss_wrt_params = ... # compute gradient using theano
params -= learning_rate
*
d_loss_wrt_params
10 Chapter 3. Getting Started
Deep Learning Tutorial, Release 0.1
if <stopping condition is met>:
return params
There is a tradeoff in the choice of the minibatch size B. The reduction of variance and use of SIMD
instructions helps most when increasing B from 1 to 2, but the marginal improvement fades rapidly to
nothing. With large B, time is wasted in reducing the variance of the gradient estimator, that time would be
better spent on additional gradient steps. An optimal B is model-, dataset-, and hardware-dependent, and
can be anywhere from 1 to maybe several hundreds. In the tutorial we set it to 20, but this choice is almost
arbitrary (though harmless).
Note: If you are training for a fixed number of epochs, the minibatch size becomes important because it
controls the number of updates done to your parameters. Training the same model for 10 epochs using a
batch size of 1 yields completely different results compared to training for the same 10 epochs but with a
batchsize of 20. Keep this in mind when switching between batch sizes and be prepared to tweak all the
other parameters acording to the batch size used.
All code-blocks above show pseudocode of how the algorithm looks like. Implementing such algorithm in
Theano can be done as follows :
# Minibatch Stochastic Gradient Descent
# assume loss is a symbolic description of the loss function given
# the symbolic variables params (shared variable), x_batch, y_batch;
# compute gradient of loss with respect to params
d_loss_wrt_params = T.grad(loss, params)
# compile the MSGD step into a theano function
updates = [(params, params - learning_rate
*
d_loss_wrt_params)]
MSGD = theano.function([x_batch,y_batch], loss, updates=updates)
for (x_batch, y_batch) in train_batches:
# here x_batch and y_batch are elements of train_batches and
# therefore numpy arrays; function MSGD also updates the params
print(’Current loss is ’, MSGD(x_batch, y_batch))
if stopping_condition_is_met:
return params
3.4.3 Regularization
There is more to machine learning than optimization. When we train our model from data we are trying
to prepare it to do well on new examples, not the ones it has already seen. The training loop above for
MSGD does not take this into account, and may overfit the training examples. A way to combat overfitting
is through regularization. There are several techniques for regularization; the ones we will explain here are
L1/L2 regularization and early-stopping.
3.4. A Primer on Supervised Optimization for Deep Learning 11
Deep Learning Tutorial, Release 0.1
L1 and L2 regularization
L1 and L2 regularization involve adding an extra term to the loss function, which penalizes certain parameter
configurations. Formally, if our loss function is:
NLL(θ, D) = −
|D|
X
i=0
log P (Y = y
(i)
|x
(i)
, θ)
then the regularized loss will be:
E(θ, D) = NLL(θ, D) + λR(θ)
or, in our case
E(θ, D) = NLL(θ, D) + λ||θ||
p
p
where
||θ||
p
=
|θ|
X
j=0
|θ
j
|
p
1
p
which is the L
p
norm of θ. λ is a hyper-parameter which controls the relative importance of the regularization
parameter. Commonly used values for p are 1 and 2, hence the L1/L2 nomenclature. If p=2, then the
regularizer is also called “weight decay”.
In principle, adding a regularization term to the loss will encourage smooth network mappings in a neural
network (by penalizing large values of the parameters, which decreases the amount of nonlinearity that
the network models). More intuitively, the two terms (NLL and R(θ)) correspond to modelling the data
well (NLL) and having “simple” or “smooth” solutions (R(θ)). Thus, minimizing the sum of both will, in
theory, correspond to finding the right trade-off between the fit to the training data and the “generality” of
the solution that is found. To follow Occam’s razor principle, this minimization should find us the simplest
solution (as measured by our simplicity criterion) that fits the training data.
Note that the fact that a solution is “simple” does not mean that it will generalize well. Empirically, it
was found that performing such regularization in the context of neural networks helps with generalization,
especially on small datasets. The code block below shows how to compute the loss in python when it
contains both a L1 regularization term weighted by λ
1
and L2 regularization term weighted by λ
2
# symbolic Theano variable that represents the L1 regularization term
L1 = T.sum(abs(param))
# symbolic Theano variable that represents the squared L2 term
L2_sqr = T.sum(param
**
2)
# the loss
loss = NLL + lambda_1
*
L1 + lambda_2
*
L2
Early-Stopping
Early-stopping combats overfitting by monitoring the model’s performance on a validation set. A validation
set is a set of examples that we never use for gradient descent, but which is also not a part of the test set. The
12 Chapter 3. Getting Started
Deep Learning Tutorial, Release 0.1
validation examples are considered to be representative of future test examples. We can use them during
training because they are not part of the test set. If the model’s performance ceases to improve sufficiently
on the validation set, or even degrades with further optimization, then the heuristic implemented here gives
up on much further optimization.
The choice of when to stop is a judgement call and a few heuristics exist, but these tutorials will make use
of a strategy based on a geometrically increasing amount of patience.
# early-stopping parameters
patience = 5000 # look as this many examples regardless
patience_increase = 2 # wait this much longer when a new best is
# found
improvement_threshold = 0.995 # a relative improvement of this much is
# considered significant
validation_frequency = min(n_train_batches, patience/2)
# go through this many
# minibatches before checking the network
# on the validation set; in this case we
# check every epoch
best_params = None
best_validation_loss = numpy.inf
test_score = 0.
start_time = time.clock()
done_looping = False
epoch = 0
while (epoch < n_epochs) and (not done_looping):
# Report "1" for first epoch, "n_epochs" for last epoch
epoch = epoch + 1
for minibatch_index in xrange(n_train_batches):
d_loss_wrt_params = ... # compute gradient
params -= learning_rate
*
d_loss_wrt_params # gradient descent
# iteration number. We want it to start at 0.
iter = (epoch - 1)
*
n_train_batches + minibatch_index
# note that if we do ‘iter % validation_frequency‘ it will be
# true for iter = 0 which we do not want. We want it true for
# iter = validation_frequency - 1.
if (iter + 1) % validation_frequency == 0:
this_validation_loss = ... # compute zero-one loss on validation set
if this_validation_loss < best_validation_loss:
# improve patience if loss improvement is good enough
if this_validation_loss < best_validation_loss
*
improvement_threshold:
patience = max(patience, iter
*
patience_increase)
best_params = copy.deepcopy(params)
best_validation_loss = this_validation_loss
if patience <= iter:
3.4. A Primer on Supervised Optimization for Deep Learning 13
Deep Learning Tutorial, Release 0.1
done_looping = True
break
# POSTCONDITION:
# best_params refers to the best out-of-sample parameters observed during the optimization
If we run out of batches of training data before running out of patience, then we just go back to the beginning
of the training set and repeat.
Note: The validation_frequency should always be smaller than the patience. The code should
check at least two times how it performs before running out of patience. This is the reason we used the
formulation validation_frequency = min( value, patience/2.)
Note: This algorithm could possibly be improved by using a test of statistical significance rather than the
simple comparison, when deciding whether to increase the patience.
3.4.4 Testing
After the loop exits, the best_params variable refers to the best-performing model on the validation set. If
we repeat this procedure for another model class, or even another random initialization, we should use the
same train/valid/test split of the data, and get other best-performing models. If we have to choose what the
best model class or the best initialization was, we compare the best_validation_loss for each model. When
we have finally chosen the model we think is the best (on validation data), we report that model’s test set
performance. That is the performance we expect on unseen examples.
3.4.5 Recap
That’s it for the optimization section. The technique of early-stopping requires us to partition the set of
examples into three sets (training D
train
, validation D
valid
, test D
test
). The training set is used for minibatch
stochastic gradient descent on the differentiable approximation of the objective function. As we perform
this gradient descent, we periodically consult the validation set to see how our model is doing on the real
objective function (or at least our empirical estimate of it). When we see a good model on the validation set,
we save it. When it has been a long time since seeing a good model, we abandon our search and return the
best parameters found, for evaluation on the test set.
3.5 Theano/Python Tips
3.5.1 Loading and Saving Models
When you’re doing experiments, it can take hours (sometimes days!) for gradient-descent to find the best
parameters. You will want to save those weights once you find them. You may also want to save your
current-best estimates as the search progresses.
Pickle the numpy ndarrays from your shared variables
14 Chapter 3. Getting Started
剩余164页未读,继续阅读
qinyuehong
- 粉丝: 26
- 资源: 3
上传资源 快速赚钱
- 我的内容管理 展开
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额
- 我的收藏
- 我的下载
- 下载帮助
最新资源
- 多模态联合稀疏表示在视频目标跟踪中的应用
- Kubernetes资源管控与Gardener开源软件实践解析
- MPI集群监控与负载平衡策略
- 自动化PHP安全漏洞检测:静态代码分析与数据流方法
- 青苔数据CEO程永:技术生态与阿里云开放创新
- 制造业转型: HyperX引领企业上云策略
- 赵维五分享:航空工业电子采购上云实战与运维策略
- 单片机控制的LED点阵显示屏设计及其实现
- 驻云科技李俊涛:AI驱动的云上服务新趋势与挑战
- 6LoWPAN物联网边界路由器:设计与实现
- 猩便利工程师仲小玉:Terraform云资源管理最佳实践与团队协作
- 类差分度改进的互信息特征选择提升文本分类性能
- VERITAS与阿里云合作的混合云转型与数据保护方案
- 云制造中的生产线仿真模型设计与虚拟化研究
- 汪洋在PostgresChina2018分享:高可用 PostgreSQL 工具与架构设计
- 2018 PostgresChina大会:阿里云时空引擎Ganos在PostgreSQL中的创新应用与多模型存储
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈
安全验证
文档复制为VIP权益,开通VIP直接复制
信息提交成功