S2GD：半随机梯度下降法在深度学习中的优化分析

需积分: 9 78 浏览量更新于2024-07-19 收藏 499KB PDF 举报

"Semi-Stochastic Gradient Descent Methods是深度学习、强化学习和机器学习中用于优化的一种算法，它结合了全梯度和随机梯度的优点。由Jakub Koneˇcný和Peter Richt´arik在2015年的论文中提出。" Semi-Stochastic Gradient Descent (S2GD) 方法是解决大规模数据集上光滑凸损失函数平均值最小化问题的一种高效策略。在深度学习、强化学习和机器学习中，我们经常需要优化大量的参数，这通常涉及对大量样本的损失函数进行迭代。传统的Gradient Descent（GD）方法每次迭代计算所有样本的梯度，而Stochastic Gradient Descent (SGD) 则随机选取一个或一部分样本计算梯度，以降低计算复杂性。 S2GD 在每个“epoch”内交替计算一次全梯度和一定数量的随机梯度，这个数量遵循几何分布。这种方法的关键在于它能够在期望值中以更少的总工作量输出ε-精度的解。工作量用数据遍历次数或等效的单个经验损失梯度计算次数来衡量，其复杂度是O((n/κ)log(1/ε))，其中n是样本数量，κ是条件数，ε是所需的精度。条件数κ反映了目标函数的难度，即其最小值处的梯度的平坦程度。较小的κ意味着更容易优化，较大的κ则表示更复杂的优化问题。S2GD 通过在每个epoch执行O(log(1/ε))次迭代实现这一目标，每次迭代包含一次全梯度评估和O(κ)次随机梯度评估。值得注意的是，当S2GD仅执行一个epoch时，它的性能退化为O((κ/ε)log(1/ε))次随机梯度评估，这与SVRG（SAGA）等其他方法相比较，其优势在于在多epoch设置下。 S2GD 的一个重要特性是它能够平衡全梯度和随机梯度的使用，从而在减少计算成本的同时保持良好的收敛速度。这使得它在处理大型数据集时比单纯的GD或SGD更有效率。此外，由于S2GD 包含SVRG作为特例，这表明它继承了SVRG的一些优良特性，比如快速收敛和对大数据集的良好适应性。 Semi-Stochastic Gradient Descent 是一种优化技术，旨在通过智能地组合全梯度和随机梯度来提高梯度下降算法的效率，尤其适用于需要处理大量数据的现代机器学习模型。这种技术通过控制梯度计算的频率和方式，实现了在保证精度的同时，降低了计算复杂度，从而提升了训练速度。

Note that for all j, the expected number of iterations of the inner loop, E(t

), is equal to

ξ = ξ(m, h)

def

t=1

(1 − νh)

m−t

. (5)

Also note that ξ ∈ [

m+1

, m), with the lower bound attained for ν = 0, and the upper bound for

νh → 1.

2.2 S2GD+

We also implement Algorithm 2, which we call S2GD+. In our experiments, the performance of

this method is superior to all methods we tested, including S2GD. However, we do not analyze the

complexity of this method and leave this as an open problem.

Algorithm 2 S2GD+

parameters: α ≥ 1 (e.g., α = 1)

1. Run SGD for a single pass over the data (i.e., n iterations); output x

2. Starting from x

= x, run a version of S2GD in which t

= αn for all j

In brief, S2GD+ starts by running SGD for 1 epoch (1 pass over the data) and then switches

to a variant of S2GD in which the number of the inner iterations, t

, is not random, but ﬁxed to

be n or a small multiple of n.

The motivation for this method is the following. It is common knowledge that SGD is able to

progress much more in one pass over the data than GD (where this would correspond to a single

gradient step). However, the very ﬁrst step of S2GD is the computation of the full gradient of f.

Hence, by starting with a single pass over data using SGD and then switching to S2GD, we obtain

a superior method in practice.

3 Summary of Results

In this section we summarize some of the main results and contributions of this work.

1. Complexity for strongly convex f. If f is strongly convex, S2GD needs

W = O((n + κ) log(1/ε)) (6)

work (measured as the total number of evaluations of the stochastic gradient, accounting for

the full gradient evaluations as well) to output an ε-approximate solution (in expectation or

in high probability), where κ = L/µ is the condition number. This is achieved by running

S2GD with stepsize h = O(1/L), j = O(log(1/ε)) epochs (this is also equal to the number

of full gradient evaluations) and m = O(κ) (this is also roughly equal to the number of

stochastic gradient evaluations in a single epoch). The complexity results are stated in detail

in Sections 4 and 5 (see Theorems 4, 5 and 6; see also (27) and (26)).

Using a single pass of SGD as an initialization strategy was already considered in [11]. However, the authors

claim that their implementation of vanilla SAG did not beneﬁt from it. S2GD does beneﬁt from such an initialization

due to it starting, in theory, with a (heavy) full gradient computation.

剩余21页未读，继续阅读

coolrainman

粉丝: 3
资源: 10

S2GD：半随机梯度下降法在深度学习中的优化分析

2016-J神-Mini-Batch Semi-Stochastic Gradient Descent in the Proxi

机器学习技法 15 - 3 - Stochastic Gradient Descent (12-22).mp4

机器学习基石 11 - 2 - Stochastic Gradient Descent (11-39).mp4

Large Scale Semi-supervised Linear SVM with Stochastic Gradient Descent

Mini-batch Stochastic Gradient Descent

小批量随机梯度下降（Mini-batch Stochastic Gradient Descent，Mini-batch SGD）。

沈弢-Stochastic Gradient Push for Distributed Deep Learning1

Federated Accelerated Stochastic Gradient Descent

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

Distributed Stochastic Gradient Descent with Discriminative Aggregating

最新资源