Stochastic Gradient Descent技巧：神经网络训练的高效策略（2012年微软研究）

52 浏览量更新于2024-07-14 收藏 419KB PDF 举报

"Stochastic Gradient Descent Tricks是一篇由Léon Bottou在2012年发表于微软研究（Microsoft Research）的文章，着重介绍了在大型数据集上训练神经网络时，为何随机梯度下降（Stochastic Gradient Descent, SGD）是一种有效的学习算法。文章首先倡导使用随机反向传播（stochastic back-propagation），这是SGD的一个具体应用实例。 SGD的核心概念是，它在每次迭代中仅使用一小部分训练样本（通常是随机选择的）来更新模型参数，而非一次性处理整个数据集。这种方法的优势在于，当数据量庞大时，可以显著减少计算成本，避免内存限制，并加速训练过程。相比于批量梯度下降（Batch Gradient Descent），SGD对于在线学习（online learning）和大规模分布式环境非常适用。在文章的第二部分，作者解释了SGD的工作原理。每一步，模型基于单个或少数样本的梯度方向进行更新，这样可以及时捕获数据的局部特性，有助于模型更快地收敛到局部最优解。尽管全局最优解可能不被找到，但在很多情况下，SGD能够提供具有竞争力的性能。此外，文章提供了关于如何有效实施SGD的实用建议，包括学习率调整策略（如衰减学习率、动量法等）、模型正则化技术以及如何处理噪声数据等问题。作者强调了在实际应用中调整SGD参数的重要性，以适应特定任务和数据分布。 Stochastic Gradient Descent Tricks是一篇深入浅出的指南，不仅阐述了SGD的基本理论，还为在实际工程场景中优化神经网络训练过程提供了宝贵的实践指导。对于那些处理大规模数据和复杂模型的机器学习从业者来说，理解和掌握这些技巧至关重要。"

The convergence speed of stochastic gradient descent is in fact limited by

the noisy approximation of the true gradient. When the learning rates decrease

too slowly, the variance of the parameter estimate w

decreases equally slowly.

When the learning rates decrease too quickly, the expectation of the parameter

estimate w

takes a very long time to approach the optimum.

– When the Hessian matrix of the cost function at the optimum is strictly

positive deﬁnite, the best convergence speed is achieved using learning rates

∼ t

−1

(e.g. [14]). The expectation of the residual error then decreases with

similar speed, that is, E(ρ) ∼ t

−1

. These theoretical convergence rates are

frequently observed in practice.

– When we relax these regularity assumptions, the theory suggests slower

asymptotic convergence rates, typically like E(ρ) ∼ t

−1/2

(e.g., [28]). In

practice, the convergence only slows down during the ﬁnal stage of the

optimization process. This may not matter in practice because one often

stops the optimization before reaching this stage (see section 3.1.)

Second order stochastic gradient descent (2SGD) multiplies the gradients by

a positive deﬁnite matrix Γ

approaching the inverse of the Hessian :

t+1

= w

− γ

∇

Q(z

, w

) . (5)

Unfortunately, this modiﬁcation does not reduce the stochastic noise and

therefore does not signiﬁcantly improve the variance of w

. Although constants

are improved, the expectation of the residual error still decreases like t

−1

, that

is, E(ρ) ∼ t

−1

at best, (e.g. [1], appendix).

Therefore, as an optimization algorithm, stochastic gradient descent is

asymptotically much slower than a typical batch algorithm. However, this is

not the whole story. . .

3 When to use Stochastic Gradient Descent?

During the last decade, the data sizes have grown faster than the speed

of processors. In this context, the capabilities of statistical machine learning

methods is limited by the computing time rather than the sample size. The

analysis presented in this section shows that stochastic gradient descent performs

very well in this context.

Use stochastic gradient descent

when training time is the bottleneck.

3.1 The trade-oﬀs of large scale learning

Let f

∗

= arg min

E(f) be the best possible prediction function. Since we

seek the prediction function from a parametrized family of functions F, let

剩余15页未读，继续阅读

weixin_38705530

粉丝: 7
资源: 893

Stochastic Gradient Descent技巧：神经网络训练的高效策略（2012年微软研究）

Federated Accelerated Stochastic Gradient Descent

2016-J神-Mini-Batch Semi-Stochastic Gradient Descent in the Proxi

机器学习基石 11 - 2 - Stochastic Gradient Descent (11-39).mp4

机器学习技法 15 - 3 - Stochastic Gradient Descent (12-22).mp4

Semi-Stochastic Gradient Descent Methods

Mini-batch Stochastic Gradient Descent

小批量随机梯度下降（Mini-batch Stochastic Gradient Descent，Mini-batch SGD）。

Distributed Stochastic Gradient Descent with Discriminative Aggregating

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

最新资源