没有合适的资源?快使用搜索试试~ 我知道了~
首页最新《机器学习最优化》课程笔记
最新《机器学习最优化》课程笔记
需积分: 40 128 浏览量
更新于2023-05-30
评论
收藏 8.57MB PDF 举报
本文介绍了一阶优化方法及其在机器学习中的应用。这不是一门关于机器学习的课程(特别是它不涉及建模和统计方面的考虑),它侧重于使用和分析可以扩展到具有大量参数的大型数据集和模型的廉价方法。
资源详情
资源评论
资源推荐

Course notes on
Optimization for Machine Learning
Gabriel Peyr´e
CNRS & DMA
´
Ecole Normale Sup´erieure
gabriel.peyre@ens.fr
https://mathematical-tours.github.io
www.numerical-tours.com
May 9, 2020
Abstract
This document presents first order optimization methods and their applications to machine learning.
This is not a course on machine learning (in particular it does not cover modeling and statistical consid-
erations) and it is focussed on the use and analysis of cheap methods that can scale to large datasets and
models with lots of parameters. These methods are variations around the notion of “gradient descent”,
so that the computation of gradients plays a major role. This course covers basic theoretical properties
of optimization problems (in particular convex analysis and first order differential calculus), the gradient
descent method, the stochastic gradient method, automatic differentiation, shallow and deep networks.
Contents
1 Motivation in Machine Learning 1
1.1 Unconstraint optimization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.3 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Basics of Convex Analysis 2
2.1 Existence of Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.2 Convexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.3 Convex Sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
3 Derivative and gradient 5
3.1 Gradient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
3.2 First Order Conditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
3.3 Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.4 Link with PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
3.5 Classification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3.6 Chain Rule . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4 Gradient Descent Algorithm 9
4.1 Steepest Descent Direction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
4.2 Gradient Descent . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
1

5 Convergence Analysis 11
5.1 Quadratic Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11
5.2 General Case . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
6 Regularization 17
6.1 Penalized Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
6.2 Ridge Regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
6.3 Lasso . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
6.4 Iterative Soft Thresholding . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
7 Stochastic Optimization 21
7.1 Minimizing Sums and Expectation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
7.2 Batch Gradient Descent (BGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.3 Stochastic Gradient Descent (SGD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
7.4 Stochastic Gradient Descent with Averaging (SGA) . . . . . . . . . . . . . . . . . . . . . . . . 25
7.5 Stochastic Averaged Gradient Descent (SAG) . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
8 Multi-Layers Perceptron 26
8.1 MLP and its derivative . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26
8.2 MLP and Gradient Computation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
8.3 Universality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
9 Automatic Differentiation 30
9.1 Computational Graphs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
9.2 Forward Mode of Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31
9.3 Reverse Mode of Automatic Differentiation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
9.4 Feed-forward Compositions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
9.5 Feed-forward Architecture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
9.6 Recurrent Architectures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
1 Motivation in Machine Learning
1.1 Unconstraint optimization
In most part of this Chapter, we consider unconstrained convex optimization problems of the form
inf
x∈R
p
f(x), (1)
and try to devise “cheap” algorithms with a low computational cost per iteration to approximate a minimizer
when it exists. The class of algorithms considered are first order, i.e. they make use of gradient information.
In the following, we denote
argmin
x
f(x)
def.
= {x ∈ R
p
; f(x) = inf f},
to indicate the set of points (it is not necessarily a singleton since the minimizer might be non-unique) that
achieve the minimum of the function f. One might have argmin f = ∅ (this situation is discussed below),
but in case a minimizer exists, we denote the optimization problem as
min
x∈R
p
f(x). (2)
In typical learning scenario, f(x) is the empirical risk for regression or classification, and p is the number
of parameter. For instance, in the simplest case of linear models, we denote (a
i
, y
i
)
n
i=1
where a
i
∈ R
p
are
the features. In the following, we denote A ∈ R
n×p
the matrix whose rows are the a
i
.
2

Figure 1: Left: linear regression, middle: linear classifier, right: loss function for classification.
Figure 2: Left: non-existence of minimizer, middle: multiple minimizers, right: uniqueness.
1.2 Regression
For regression, y
i
∈ R, in which case
f(x) =
1
2
n
X
i=1
(y
i
− hx, a
i
i)
2
=
1
2
||Ax −y||
2
, (3)
is the least square quadratic risk function (see Fig. 1). Here hu, vi =
P
p
i=1
u
i
v
i
is the canonical inner product
in R
p
and || ·||
2
= h·, ·i.
1.3 Classification
For classification, y
i
∈ {−1, 1}, in which case
f(x) =
n
X
i=1
`(−y
i
hx, a
i
i) = L(−diag(y)Ax) (4)
where ` is a smooth approximation of the 0-1 loss 1
R
+
. For instance `(u) = log(1 + exp(u)), and diag(y) ∈
R
n×n
is the diagonal matrix with y
i
along the diagonal (see Fig. 1, right). Here the separable loss function
L = R
n
→ R is, for z ∈ R
n
, L(z) =
P
i
`(z
i
).
2 Basics of Convex Analysis
2.1 Existence of Solutions
In general, there might be no solution to the optimization (1). This is of course the case if f is unbounded
by bellow, for instance f (x) = −x
2
in which case the value of the minimum is −∞. But this might also
happen if f does not grow at infinity, for instance f(x) = e
−x
, for which min f = 0 but there is no minimizer.
In order to show existence of a minimizer, and that the set of minimizer is bounded (otherwise one can
have problems with optimization algorithm that could escape to infinity), one needs to show that one can
replace the whole space R
p
by a compact sub-set Ω ⊂ R
p
(i.e. Ω is bounded and close) and that f is
continuous on Ω (one can replace this by a weaker condition, that f is lower-semi-continuous, but we ignore
this here). A way to show that one can consider only a bounded set is to show that f(x) → +∞ when
3

Figure 3: Coercivity condition for least squares.
Figure 4: Convex vs. non-convex functions ; Strictly convex vs. non strictly convex functions.
x → +∞. Such a function is called coercive. In this case, one can choose any x
0
∈ R
p
and consider its
associated lower-level set
Ω = {x ∈ R
p
; f(x) 6 f(x
0
)}
which is bounded because of coercivity, and closed because f is continuous. One can actually show that for
convex function, having a bounded set of minimizer is equivalent to the function being coercive (this is not
the case for non-convex function, for instance f(x) = min(1, x
2
) has a single minimum but is not coercive).
Example 1 (Least squares). For instance, for the quadratic loss function f(x) =
1
2
||Ax −y||
2
, coercivity holds
if and only if ker(A) = {0} (this corresponds to the overdetermined setting). Indeed, if ker(A) 6= {0} if x
?
is a solution, then x
?
+ u is also solution for any u ∈ ker(A), so that the set of minimizer is unbounded. On
contrary, if ker(A) = {0}, we will show later that the set of minimizer is unique, see Fig. 3. If ` is strictly
convex, the same conclusion holds in the case of classification.
2.2 Convexity
Convex functions define the main class of functions which are somehow “simple” to optimize, in the sense
that all minimizers are global minimizers, and that there are often efficient methods to find these minimizers
(at least for smooth convex functions). A convex function is such that for any pair of point (x, y) ∈ (R
p
)
2
,
∀t ∈ [0, 1], f((1 − t)x + ty) 6 (1 −t)f(x) + tf(y) (5)
which means that the function is bellow its secant (and actually also above its tangent when this is well
defined), see Fig. 4. If x
?
is a local minimizer of a convex f, then x
?
is a global minimizer, i.e. x
?
∈ argmin f.
Convex function are very convenient because they are stable under lots of transformation. In particular,
if f, g are convex and a, b are positive, af + bg is convex (the set of convex function is itself an infinite
dimensional convex cone!) and so is max(f, g). If g : R
q
→ R is convex and B ∈ R
q×p
, b ∈ R
q
then
f(x) = g(Bx + b) is convex. This shows immediately that the square loss appearing in (3) is convex, since
|| · ||
2
/2 is convex (as a sum of squares). Also, similarly, if ` and hence L is convex, then the classification
loss function (4) is itself convex.
4

Figure 5: Comparison of convex functions f : R
p
→ R (for p = 1) and convex sets C ⊂ R
p
(for p = 2).
Strict convexity. When f is convex, one can strengthen the condition (5) and impose that the inequality
is strict for t ∈]0, 1[ (see Fig. 4, right), i.e.
∀t ∈]0, 1[, f((1 − t)x + ty) < (1 −t)f(x) + tf(y). (6)
In this case, if a minimum x
?
exists, then it is unique. Indeed, if x
?
1
6= x
?
2
were two different minimizer, one
would have by strict convexity f(
x
?
1
+x
?
2
2
) < f(x
?
1
) which is impossible.
Example 2 (Least squares). For the quadratic loss function f(x) =
1
2
||Ax −y||
2
, strict convexity is equivalent
to ker(A) = {0}. Indeed, we see later that its second derivative is ∂
2
f(x) = A
>
A and that strict convexity
is implied by the eigenvalues of A
>
A being strictly positive. The eigenvalues of A
>
A being positive, it is
equivalent to ker(A
>
A) = {0} (no vanishing eigenvalue), and A
>
Az = 0 implies hA
>
Az, zi = ||Az||
2
= 0 i.e.
z ∈ ker(A).
2.3 Convex Sets
A set Ω ⊂ R
p
is said to be convex if for any (x, y) ∈ Ω
2
, (1 − t)x + ty ∈ Ω for t ∈ [0, 1]. The
connexion between convex function and convex sets is that a function f is convex if and only if its epigraph
epi(f)
def.
=
(x, t) ∈ R
p+1
; t > f (x)
is a convex set.
Remark 1 (Convexity of the set of minimizers). In general, minimizers x
?
might be non-unique, as shown on
Figure 3. When f is convex, the set argmin(f) of minimizers is itself a convex set. Indeed, if x
?
1
and x
?
2
are
minimizers, so that in particular f (x
?
1
) = f(x
?
2
) = min(f), then f((1 −t)x
?
1
+ tx
?
2
) 6 (1 −t)f(x
?
1
) + tf (x
?
2
) =
f(x
?
1
) = min(f), so that (1 − t)x
?
1
+ tx
?
2
is itself a minimizer. Figure 5 shows convex and non-convex sets.
3 Derivative and gradient
3.1 Gradient
If f is differentiable along each axis, we denote
∇f(x)
def.
=
∂f(x)
∂x
1
, . . . ,
∂f(x)
∂x
p
>
∈ R
p
the gradient vector, so that ∇f : R
p
→ R
p
is a vector field. Here the partial
derivative (when they exits) are defined as
∂f(x)
∂x
k
def.
= lim
η→0
f(x + ηδ
k
) − f(x)
η
5
剩余37页未读,继续阅读











安全验证
文档复制为VIP权益,开通VIP直接复制

评论0