大规模机器学习优化：从梯度下降到深度学习的挑战

3星 · 超过75%的资源需积分: 50 170 浏览量更新于2024-07-20 1 收藏 1.85MB PDF 举报

"大规模机器学习的优化方法" 在机器学习领域，随着数据量的急剧增长，优化算法的重要性日益凸显。本文档深入探讨了针对大规模机器学习的优化技术，主要关注在海量数据环境下如何有效地训练模型。作者Leon Bottou、Frank E. Curtis和Jorge Nocedal在2016年的论文中指出，传统的数值优化方法在处理大规模问题时可能遇到困难，而随机梯度（Stochastic Gradient, SG）方法在这种情况下展现出独特的优势。机器学习中的优化问题通常源于寻找能够最小化损失函数的参数。在文本分类和深度神经网络训练等应用中，这些优化问题变得尤为复杂，因为它们涉及到大量的参数和高维度的数据空间。大规模数据集使得传统的梯度下降法和共轭梯度等非线性优化技术难以适应，因为它们计算成本高且收敛速度慢。论文的核心是介绍了一种简单而灵活的SG算法的全面理论。SG方法通过在每个迭代步骤中仅考虑一部分样本的梯度来减少计算负担，从而在大数据集上实现快速更新。尽管SG方法存在收敛速度较慢和可能陷入局部最优的问题，但其并行化能力和对硬件资源的有效利用使其成为大规模学习的首选。为了提高性能，研究者们正在探索改进SG算法的方法。论文中提到了两个主流的研究方向：一是通过动量项和自适应学习率策略来改进SG，如RMSprop和Adam算法，它们能更好地平衡快速收敛和稳定性；二是研究更复杂的优化框架，例如二阶方法，如有限内存的BFGS（Limited-memory BFGS）和拟牛顿法，这些方法利用二阶导数信息来改善方向选择，虽然计算复杂度较高，但在某些情况下可以提供更快的收敛速度和更好的全局寻优能力。此外，论文还讨论了分布式优化技术，如参数服务器架构和异步SGD，这些技术允许在多台机器上并行执行SG算法，进一步加速训练过程。然而，异步更新可能导致一致性问题，因此需要设计新的同步策略来确保算法的稳定性和收敛性。总结来说，大规模机器学习的优化方法是机器学习领域的关键挑战之一。通过对SG算法的深入理解和改进，以及探索新的优化框架和分布式策略，研究者们正逐步解决这些问题，推动机器学习在大数据时代的应用和发展。

and then write the empirical risk as the average of the sample losses:

(Empirical Risk) R

(w) =

i=1

(w). (3.6)

For future reference, we use ξ

[i]

to denote the ith element of a ﬁxed set of realizations of a random

variable ξ, whereas, starting in §4, we will use ξ

to denote the kth element of a sequence of random

variables.

3.2 Stochastic vs. Batch Optimization Methods

Let us now introduce some fundamental optimization algorithms for minimizing risk. For the

moment, since it is the typical setting in practice, we introduce two algorithm classes in the context

of minimizing the empirical risk measure R

in (3.6). Note, however, that much of our later

discussion will focus on the performance of algorithms when considering the true measure of interest,

namely, the expected risk R in (3.4).

Optimization methods for machine learning fall into two broad categories. We refer to them as

stochastic and batch. The prototypical stochastic optimization method is the stochastic gradient

method (SG) [126], which, in the context of minimizing R

and with w

∈ R

given, is deﬁned by

k+1

← w

− α

∇f

). (3.7)

Here, for all k ∈ N := {1, 2, . . . }, the index i

(corresponding to the seed ξ

]

, i.e., the sample pair

, y

)) is chosen randomly from {1, . . . , n} and α

is a positive stepsize. Each iteration of this

method is thus very cheap, involving only the computation of the gradient ∇f

) corresponding

to one sample. The method is notable in that the iterate sequence is not determined uniquely by the

function R

, the starting point w

, and the sequence of stepsizes {α

}, as it would in a deterministic

optimization algorithm. Rather, {w

} is a stochastic process whose behavior is determined by the

random sequence {i

}. Still, as we shall see in our analysis in §4, while each direction −∇f

)

might not be one of descent from w

(in the sense of yielding a negative directional derivative for

from w

), if it is a descent direction in expectation, then the sequence {w

} can be guided

toward a minimizer of R

For many in the optimization research community, a batch approach is a more natural and well-

known idea. The simplest such method in this class is the steepest descent algorithm—also referred

to as the gradient, batch gradient, or full gradient method—which is deﬁned by the iteration

k+1

← w

− α

∇R

) = w

−

i=1

∇f

). (3.8)

Computing the step −α

∇R

) in such an approach is more expensive than computing the step

−α

∇f

) in SG, though one may expect that a better step is computed when all samples are

considered in an iteration.

Stochastic and batch approaches oﬀer diﬀerent trade-oﬀs in terms of per-iteration costs and

expected per-iteration improvement in minimizing empirical risk. Why, then, has SG risen to such

prominence in the context of large-scale machine learning? Understanding the reasoning behind

this requires careful consideration of the computational trade-oﬀs between stochastic and batch

methods, as well as a deeper look into their abilities to guarantee improvement in the underlying

expected risk R. We start to investigate these topics in the next subsection.

We remark in passing that the stochastic and batch approaches mentioned here have analogues

in the simulation and stochastic optimization communities, where they are referred to as stochastic

approximation (SA) and sample average approximation (SAA), respectively [60].

Inset 3.1: Herbert Robbins and Stochastic Approximation

The paper by Robbins and Monro [126] represents a landmark in the history of numerical

optimization methods. Together with the invention of back propagation [128, 129], it also

represents one of the most notable developments in the ﬁeld of machine learning. The SG

method was ﬁrst proposed in [126], not as a gradient method, but as a Markov chain.

Viewed more broadly, the works by Robbins and Monro [126] and Kalman [80] mark the

beginning of the ﬁeld of stochastic approximation, which studies the behavior of iterative meth-

ods that use noisy signals. The initial focus on optimization led to the study of algorithms

that track the solution of the ordinary diﬀerential equation ˙w = −∇F (w). Stochastic approx-

imation theory has had a major impact in signal processing and in areas closer to the subject

of this paper, such as pattern recognition [2] and neural networks [18].

After receiving his PhD, Herbert Robbins became a lecturer at New York University, where

he co-authored with Richard Courant the popular book What is Mathematics? [36], which is

still in print after more than seven decades [37]. Robbins went on to become one of the

most prominent mathematicians of the second half of the twentieth century, known for his

contributions to probability, algebra, and graph theory.

3.3 Motivation for Stochastic Methods

Before discussing the strengths of stochastic methods such as SG, one should not lose sight of

the fact that batch approaches possess some intrinsic advantages. First, when one has reduced

the stochastic problem of minimizing the expected risk R to focus exclusively on the deterministic

problem of minimizing the empirical risk R

, the use of full gradient information at each iterate

opens the door for many deterministic gradient-based optimization methods. That is, in a batch

approach, one has at their disposal the wealth of nonlinear optimization techniques that have been

developed over the past decades, including the full gradient method (3.8), but also accelerated

gradient, conjugate gradient, quasi-Newton, and inexact Newton methods [113]. (See §6 and §7 for

discussion of these techniques.) Second, due to the sum structure of R

, a batch method can easily

beneﬁt from parallelization since the bulk of the computation lies in evaluations of R

and ∇R

Calculations of these quantities can even be done in a distributed manner.

Despite these advantages, there are intuitive, practical, and theoretical reasons for following a

stochastic approach. Let us motivate them by contrasting the hallmark SG iteration (3.7) with the

full batch gradient iteration (3.8).

Intuitive Motivation On an intuitive level, SG employs information more eﬃciently than a

batch method. To see this, consider a situation in which a training set, call it S, consists of ten

copies of a set S

sub

. A minimizer of empirical risk for the larger set S is clearly given by a minimizer

for the smaller set S

sub

, but if one were to apply a batch approach to minimize R

over S, then

each iteration would be ten times more expensive than if one only had one copy of S

sub

. On the

other hand, SG performs the same computations in both scenarios, in the sense that the stochastic

gradient computations involve choosing elements from S

sub

with the same probabilities. In reality,

a training set typically does not consist of exact duplicates of sample data, but in many large-scale

applications the data does involve a good deal of (approximate) redundancy. This suggests that

using all of the sample data in every optimization iteration is ineﬃcient.

A similar conclusion can be drawn by recalling the discussion in §2 related to the use of training,

validation, and testing sets. If one believes that working with only, say, half of the data in the

training set is suﬃcient to make good predictions on unseen data, then one may argue against

working with the entire training set in every optimization iteration. Repeating this argument,

working with only a quarter of the training set may be useful at the start, or even with only an

eighth of the data, and so on. In this manner, we arrive at motivation for the idea that working

with small samples, at least initially, can be quite appealing.

Practical Motivation The intuitive beneﬁts just described have been observed repeatedly in

practice, where one often ﬁnds very real advantages of SG in many applications. As an example,

Figure 3.1 compares the performance of a batch L-BFGS method [95, 112] (see §6) and the SG

method (3.7) with a constant stepsize (i.e., α

= α for all k ∈ N) on a binary classiﬁcation problem

using a logistic loss objective function and the data from the RCV1 dataset mentioned in §2.1.

The ﬁgure plots the empirical risk R

as a function of the number of accesses of a sample from

the training set, i.e., the number of evaluations of a sample gradient ∇f

). Each set of n

consecutive accesses is called an epoch. The batch method performs only one step per epoch while

SG performs n steps per epoch. The plot shows the behavior over the ﬁrst 10 epochs. The advantage

of SG is striking and representative of typical behavior in practice. (One should note, however, that

to obtain such eﬃcient behavior, it was necessary to run SG repeatedly using diﬀerent choices for

the stepsize α until a good choice was identiﬁed for this particular problem. We discuss theoretical

and practical issues related to the choice of stepsize in our analysis in §4.)

0 0.5 1 1.5 2 2.5 3 3.5 4

x 10

0.1

0.2

0.3

0.4

0.5

0.6

Accessed Data Points

Empirical Risk

SGD

LBFGS

Fig. 3.1: Empirical risk R

as a function of the number of accessed data points (ADP) for a batch

L-BFGS method and the stochastic gradient (SG) method (3.7) on a binary classiﬁcation problem

with a logistic loss objective and the RCV1 dataset. SG was run with a ﬁxed stepsize of α = 4.

At this point, it is worthwhile to mention that the fast initial improvement achieved by SG,

followed by a drastic slowdown after 1 or 2 epochs, is common in practice and is fairly well under-

stood. An intuitive way to explain this behavior is by considering the following example due to

Bertsekas [13].

Example 3.1. Suppose that each f

in (3.6) is a convex quadratic with minimal value at zero

and minimizers w

i,∗

evenly distributed in [−1, 1] such that the minimizer of R

is w

∗

= 0; see

Figure 3.2. At w

 −1, SG will, with certainty, move to the right (toward w

∗

). Indeed, even if

a subsequent iterate lies slightly to the right of the minimizer w

1,∗

of the “leftmost” quadratic, it is

likely (but not certain) that SG will continue moving to the right. However, as iterates near w

∗

the algorithm enters a region of confusion in which there is a signiﬁcant chance that a step will not

move toward w

∗

. In this manner, progress will slow signiﬁcantly. Only with more complete gradient

information could the method know with certainty how to move toward w

∗

− 1

1,*

Fig. 3.2: Simple illustration to motivate the fast initial behavior of the SG method for minimizing

empirical risk (3.6), where each f

is a convex quadratic. This example is adapted from [13].

Despite the issues illustrated by this example, we shall see in §4 that one can nevertheless ensure

convergence by employing a sequence of diminishing stepsizes to overcome any oscillatory behavior

of the algorithm.

Theoretical Motivation One can also cite theoretical arguments for a preference of SG over a

batch approach. Let us give a preview of these arguments now, which are studied in more depth

and further detail in §4.

• It is well known that a batch approach can minimize R

at a fast rate; e.g., if R

is strongly

convex (see Assumption 4.5) and one applies a batch gradient method, then there exists a

constant ρ ∈ (0, 1) such that, for all k ∈ N, the training error satisﬁes

) − R

∗

≤ O(ρ

), (3.9)

where R

∗

denotes the minimal value of R

. The rate of convergence exhibited here is refereed

to as R-linear convergence in the optimization literature [116] and geometric convergence in

the machine learning research community; we shall simply refer to it as linear convergence.

From (3.9), one can conclude that, in the worst case, the total number of iterations in which

the training error can be above a given  > 0 is proportional to log(1/). This means that, with

a per-iteration cost proportional to n (due to the need to compute ∇R

) for all k ∈ N),

the total work required to obtain -optimality for a batch gradient method is proportional to

n log(1/).

• The rate of convergence of a basic stochastic method is slower than for a batch gradient

method; e.g., if R

is strictly convex and each i

is drawn uniformly from {1, . . . , n}, then,

for all k ∈ N, the SG iterates deﬁned by (3.7) satisfy the sublinear convergence property (see

Theorem 4.7)

E[R

) − R

∗

] = O(1/k). (3.10)

However, it is crucial to note that neither the per-iteration cost nor the right-hand side of

(3.10) depends on the sample set size n. This means that the total work required to obtain

-optimality for SG is proportional to 1/. Admittedly, this can be larger than n log(1/)

for moderate values of n and , but, as discussed in detail in §4.4, the comparison favors

SG when one moves to the big data regime where n is large and one is merely limited by a

computational time budget.

• Another important feature of SG is that, in a stochastic optimization setting, it yields the

same convergence rate as in (3.10) for the generalization error R−R

∗

, where R

∗

is the minimal

value of R. Speciﬁcally, by applying the SG iteration (3.7), but with ∇f

) replaced by

∇f(w

; ξ

) with each ξ

drawn independently according to the distribution P , one ﬁnds that

E[R(w

) − R

∗

] = O(1/k); (3.11)

again a sublinear rate, but on the expected risk. Moreover, in this context, a batch approach

is not even viable without the ability to compute ∇R. Of course, this represents a diﬀerent

setting than one in which only a ﬁnite training set is available, but it reveals that if n is large

with respect to k, then the behavior of SG in terms of minimizing the empirical risk R

the expected risk R is practically indistinguishable up to iteration k. This property cannot

be claimed by a batch method.

In summary, there are intuitive, practical, and theoretical arguments in favor of stochastic over

batch approaches in optimization methods for large-scale machine learning. For these reasons,

and since SG is used so pervasively by practitioners, we frame our discussions about optimization

methods in the context of their relationship with SG. We do not claim, however, that batch methods

have no place in practice. For one thing, if Figure 3.1 were to consider a larger number of epochs,

then one would see the batch approach eventually overtake the stochastic method and yield a

lower training error. This motivates why many recently proposed methods try to combine the best

properties of batch and stochastic algorithms. Moreover, the SG iteration is diﬃcult to parallelize

and requires excessive communication between nodes in a distributed computing setting, providing

further impetus for the design of new and improved optimization algorithms.

3.4 Beyond SG: Noise Reduction and Second-Order Methods

Looking forward, one of the main questions being asked by researchers and practitioners alike is:

what lies beyond SG that can serve as an eﬃcient, reliable, and easy-to-use optimization method

for the kinds of applications discussed in §2?

To answer this question, we depict in Figure 3.3 methods that aim to improve upon SG as lying

on a two-dimensional plane. At the origin of this organizational scheme is SG, representing the

base from which all other methods may be compared.

From the origin along the horizontal access, we place methods that are neither purely stochastic

nor purely batch, but attempt to combine the best properties of both approaches. For example,

剩余92页未读，继续阅读

code_monky

粉丝: 4
资源: 1

大规模机器学习优化：从梯度下降到深度学习的挑战

Optimization_Algorithm-最优化方法代码实现.zip

大规模机器学习优化方法综述：SGD挑战与未来趋势

《大规模机器学习与凸优化》

大规模机器学习的优化方法综述：SGD进展与未来趋势

基于Spark的十亿级特征大规模机器学习优化与应用

Apache Spark支持的十亿级特征大规模机器学习优化与应用

Spark支持下的十亿级特征大规模机器学习优化与应用

大规模机器学习的统计自适应优化方法：SA L-BFGS

大规模机器学习系统架构设计与优化

ADMM优化与Apache Spark的大规模机器学习

最新资源