统一随机梯度与拟牛顿方法：大规模优化的高效之路

需积分: 0 16 浏览量更新于2024-08-26 收藏 894KB PDF 举报

本文档探讨了一种结合随机梯度法（Stochastic Gradient Descent, SGD）与拟牛顿方法优势的新型大型规模优化算法。作者Jascha Sohl-Dickstein、Ben Poole和Surya Ganguli提出了一种创新方法，旨在提升大规模机器学习中的模型训练效率。他们提出的算法主要目标是整合SGD的计算效率，特别是其在处理大量数据时的并行性和在线更新能力，同时引入拟牛顿方法中对二阶导数信息的利用，以实现更精确的局部曲率估计。传统上，SGD通过迭代地评估单个样本或小批量数据来更新模型参数，这使得它在大规模数据集上非常适用，但可能牺牲了全局优化的准确性。另一方面，拟牛顿方法如BFGS（Broyden-Fletcher-Goldfarb-Shanno）利用Hessian矩阵的近似，能够在每次迭代中提供更精细的局部调整，但计算成本较高且可能不适用于大规模数据。该新算法的核心在于为每个构成目标函数的部分（如损失函数的加权和）独立维护一个Hessian矩阵的近似。这些近似被存储在一个共享、随时间演化且维度较低的子空间中，这有助于保持算法的计算效率，并降低内存需求，即使在高维优化问题中也是如此。每次更新步骤仅需评估一个贡献函数或一小批数据，类似于SGD。此外，该算法利用一个近似的逆Hessian进行权重更新，减少了通常拟牛顿方法中对超参数调整的需求，从而简化了整个优化过程。与早期的随机二阶优化方法相比，这个新算法在兼顾速度和精度的同时，提供了更好的灵活性和扩展性。它能够在大型深度学习任务中展现出强大的性能，特别是在训练深度神经网络时，能够减少训练时间和内存开销，提高模型收敛速度和最终性能。这项工作为解决现代大规模机器学习中的优化挑战提供了一种有效的解决方案。

Sum of Functions Optimizer

bine Hessian information from multiple subfunctions in a

much more natural and efﬁcient way than previous work,

and avoids the requirement of large minibatches per up-

date step to accurately estimate the full Hessian. More-

over, we develop a novel method to maintain computa-

tional tractability and limit the memory requirements of

this quasi-Newton method in the face of high dimensional

optimization problems (large M). We do this by storing

and manipulating the subfunctions in a shared, adaptive

low dimensional subspace, determined by the recent his-

tory of the gradients and iterates.

Thus our optimization method can usefully estimate and

utilize powerful second-order information while simulta-

neously combatting two potential sources of computational

intractability: large numbers of subfunctions (large N) and

a high-dimensional optimization domain (large M). More-

over, the use of a second order approximation means that

minimal or no adjustment of hyperparameters is required.

We refer to the resulting algorithm as Sum of Functions

Optimizer (SFO). We demonstrate that the combination of

techniques and new ideas inherent in SFO results in faster

optimization on seven disparate example problems. Fi-

nally, we release the optimizer and the test suite as open

source Python and MATLAB packages.

2. Algorithm

Our goal is to combine the beneﬁts of stochastic and quasi-

Newton optimization techniques. We ﬁrst describe the gen-

eral procedure by which we optimize the parameters x.

We then describe the construction of the shared low di-

mensional subspace which makes the algorithm tractable

in terms of computational overhead and memory for large

problems. This is followed by a description of the BFGS

method by which an online Hessian approximation is main-

tained for each subfunction. Finally, we end this section

with a review of implementation details.

2.1. Approximating Functions

We deﬁne a series of functions G

(x) intended to approxi-

mate F (x),

(x) =

i=1

(x) , (3)

where the superscript t indicates the learning iteration.

Each g

(x) serves as a quadratic approximation to the cor-

responding f

(x). The functions g

(x) will be stored, and

one of them will be updated per learning step.

2.2. Update Steps

As is illustrated in Figure 1, optimization is performed by

repeating the steps:

1. Choose a vector x

by minimizing the approximating

objective function G

t−1

(x),

= argmin

t−1

(x) . (4)

Since G

t−1

(x) is a sum of quadratic functions

t−1

(x), it can be exactly minimized by a Newton

step,

= x

t−1

− η



t−1



−1

∂G

t−1



t−1



∂x

, (5)

where H

t−1

is the Hessian of G

t−1

(x). The step

length η

is typically unity, and will be discussed in

Section 3.5.

2. Choose an index j ∈ {1...N }, and update the cor-

responding approximating subfunction g

(x) using a

second order power series around x

, while leaving all

other subfunctions unchanged,

(x) =











t−1

(x) i 6= j





)

+ (x − x

)

(x − x

)

(x − x

)





i = j

(6)

The constant and ﬁrst order term in Equation 6 are set

by evaluating the subfunction and gradient, f

) and

). The quadratic term H

is set by using the BFGS

algorithm to generate an online approximation to the true

Hessian of subfunction j based on its history of gradient

evaluations (see Section 2.4). The Hessian of the summed

approximating function G

(x) in Equation 5 is the sum of

the Hessians for each g

(x), H

2.3. A Shared, Adaptive, Low-Dimensional

Representation

The dimensionality M of x ∈ R

is typically large. As

a result, the memory and computational cost of working

directly with the matrices H

∈ R

M×M

is typically pro-

hibitive, as is the cost of storing the history terms ∆f

and

∆x required by BFGS (see Section 2.4). To reduce the

dimensionality from M to a tractable value, all history is

instead stored and all updates computed in a lower dimen-

sional subspace, with dimensionality between K

min

and

max

. This subspace is constructed such that it includes

the most recent gradient and position for every subfunc-

tion, and thus K

min

≥ 2N. This guarantees that the sub-

space includes both the steepest gradient descent direction

over the full batch, and the update directions from the most

recent Newton steps (Equation 5).

剩余12页未读，继续阅读

zeeq_

粉丝: 1w+
资源: 47

统一随机梯度与拟牛顿方法：大规模优化的高效之路

Optimization Methods for Large-Scale Machine Learning

Derive the stochastic gradient descent algorithm

安装vite-plugin- optimization -persist2.0.0

随机梯度下降可以引用哪篇参考文献？

Robust MKKMUsing Min-Max Optimization与普通的多核k-means聚类有什么不同

pigeon-inspired optimization

基于采样的优化路径规划算法（Sampling-Based Optimization），主要是基于RRT（Rapidly-exploring Random Tree）和其变种算法进行的

Robust MKKM (Multiple Kernel k-Means) using Min-Max Optimization 多核聚类算法中，通过minmax公式实现什么效果，为什么要min为什么要max

\begin{equation*}\mathrm {Objective}~1:~~ \mathrm {minimize}\left ({N_{c\times }}\right).\tag{13}\end{equation*}

FAST-LIO-LC

最新资源

\begin{equation}\mathrm {Objective}~1:~~ \mathrm {minimize}\left ({N_{c\times }}\right).\tag{13}\end{equation}