深度学习：统计视角下的优化与正则化

需积分: 10 26 浏览量更新于2024-07-14 收藏 1.4MB PDF 举报

身份认证购VIP最低享 7 折!

30元优惠券

"这篇文档是关于深度学习的一种统计视角，由Peter L. Bartlett、Andrea Montanari和Alexander Rakhlin共同撰写。文中探讨了深度学习在理论上的惊奇发现，尤其是在非凸优化问题上简单梯度方法的有效性，以及在不显式控制模型复杂性的情况下，深度学习仍能实现高精度预测的原理。作者提出了过度参数化可能带来的插值解、隐含正则化以及良性过拟合的概念，并通过统计学习理论进行分析。" 深度学习是一种强大的机器学习方法，其在图像识别、自然语言处理和许多其他领域的应用已经取得了显著成果。从统计角度来看，深度学习的理论基础涉及几个关键点，这些在文章中被详细讨论。首先，非凸优化问题是深度学习模型训练的核心挑战，因为神经网络的损失函数通常是高度非凸的，这意味着可能存在多个局部最优解。然而，简单的梯度下降方法在实践中表现出令人惊讶的能力，能够找到接近全局最优的解决方案。这可能得益于深度学习模型的过度参数化特性，即模型的参数数量远超过实际数据所需的最小数量。过度参数化允许模型在训练数据上实现完美或近乎完美的拟合，即所谓的“插值解”。这种插值能力并不自动导致过拟合，因为深度学习模型在某些情况下能够学习到数据的内在结构，而不仅仅是记忆训练样本。这与传统的机器学习理论相悖，传统理论认为过度拟合会导致泛化性能下降。其次，文章提到，尽管没有显式的正则化机制，深度学习模型在训练过程中似乎隐含地执行了某种形式的正则化。这可能是由于在大量参数空间中搜索解时，梯度方法倾向于找到具有较低复杂度的解，这与贝叶斯推理中的“最大后验概率”(MAP)估计类似。最后，良性过拟合是深度学习的另一个重要特征，它表明即使在过拟合训练数据的情况下，模型仍然能够在未见过的数据上保持良好的预测性能。这与传统的过拟合概念相矛盾，传统认为过拟合会导致泛化能力降低。在深度学习中，良性过拟合可能是由于模型有能力忽略噪声和无关特征，同时抓住关键模式。文章通过统计学习理论对这些现象进行了深入研究，回顾了经典的均匀收敛结果，并指出它们在解释深度学习的实践成功时的局限性。作者还介绍了一些简化设置中的实例，以更好地理解这些原则如何在深度学习中起作用。这样的理论进展对于深化我们对深度学习的理解，以及指导未来算法设计和优化策略的改进至关重要。

资源详情

资源推荐

where the nonlinearity σ : R → [−1, 1] is 1-Lipschitz and has σ(0) = 0,

) ≤ B

2 log 2d

Thus, for example, applying (11) in this case with a Lipschitz convex loss `

and the corresponding ψ

deﬁned by Theorem 2.8, shows that with high probability the minimizer

erm

in F

satisﬁes



(

erm

) − inf

(f)



≤ O

log d

+ inf

f∈F

(f) − inf

(f).

If, in addition, `

is scaled so that it is an upper bound on `

, applying (12) shows that with high probability

every f ∈ F

satisﬁes

(f) ≤

(f) + O

log d

Theorem 2.9 is from [BM02]. The proof uses the contraction inequality (Theorem 2.7) and elementary

properties of Rademacher complexity.

The following theorem gives similar error bounds for networks with Lipschitz nonlinearities that, like the

ReLU nonlinearity, do not necessarily have a bounded range. The deﬁnition of the function class includes

deviations of the parameter matrices W

from ﬁxed ‘centers’ M

Theorem 2.10. Consider a feed-forward network with L layers, ﬁxed vector nonlinearities σ

: R

→ R

and parameter θ = (W

, . . . , W

) with W

∈ R

×d

i−1

, for i = 1, . . . , L, which computes functions

f(x; θ) = σ

L−1

···σ

x) ···)),

where d

= d and d

= 1. Deﬁne

d = d

∨ ··· ∨ d

. Fix matrices M

∈ R

×d

i−1

, for i = 1, . . . , L, and

deﬁne the class of functions on the unit Euclidean ball in R







f(·, θ) :

i=1

− M

2/3

2,1

2/3

3/2

≤ r







where kAk denotes the spectral norm of the matrix A and kAk

2,1

denotes the sum of the 2-norms of its

columns. If the σ

are all 1-Lipschitz and the surrogate loss `

is a b-Lipschitz upper bound on the classiﬁ-

cation loss `

, then with probability at least 1 − δ, every f ∈ F

has

(f) ≤

(f) +

rb log

d +

log(1/δ)

√

Theorem 2.10 is from [BFT17]. The proof uses diﬀerent techniques (covering numbers rather than the

Rademacher complexity) to address the key technical diﬃculty, which is controlling the scale of vectors that

appear throughout the network.

When the nonlinearity has a 1-homogeneity property, the following result gives a simple direct bound on

the Rademacher complexity in terms of the Frobenius norms of the weight matrices (although it is worse than

Theorem 2.10, even with M

= 0, unless the ratios kW

/kW

k are close to 1). We say that σ : R → R is

1-homogeneous if σ(αx) = ασ(x) for all x ∈ R and α ≥ 0. Notice that the ReLU nonlinearity σ(x) = x ∨ 0

has this property.

Theorem 2.11. Let ¯σ : R → R be a ﬁxed 1-homogeneous nonlinearity, and deﬁne the componentwise version

: R

→ R

via σ

(x)

= ¯σ(x

). Consider a network with L layers of these nonlinearities and parameters

θ = (W

, . . . , W

), which computes functions

f(x; θ) = σ

L−1

···σ

x) ···)).

Deﬁne the class of functions on the unit Euclidean ball in R

= {f(·; θ) : kW

≤ B},

where kW

denotes the Frobenius norm of W

. Then we have

) .

√

This result is from [GRS18], which also shows that it is possible to remove the

√

L factor at the cost of

a worse dependence on n. See also [NTS15].

2.9 The mismatch between benign overﬁtting and uniform convergence

It is instructive to consider the implications of the generalization bounds we have reviewed in this section for

the phenomenon of benign overﬁtting, which has been observed in deep learning. For concreteness, suppose

that ` is the quadratic loss. Consider a neural network function

f ∈ F chosen so that

f) = 0. For an

appropriate complexity hierarchy F =

, suppose that

f is chosen to minimize the complexity r(

f),

deﬁned as the smallest r for which

f ∈ F

, subject to the interpolation constraint

L(f) = 0. What do the

bounds based on uniform convergence imply about the excess risk L(

f) − inf

f∈F

L(f) of this minimum-

complexity interpolant?

Theorems 2.9, 2.10, and 2.11 imply upper bounds on risk in terms of various notions of scale of network

parameters. For these bounds to be meaningful for a given probability distribution, there must be an

interpolating

f for which the complexity r(

f) grows suitably slowly with the sample size n so that the excess

risk bounds converge to zero.

An easy example is when there is an f

∗

∈ F

with L(f

∗

) = 0, where r is a ﬁxed complexity. Notice that

this implies not just that the conditional expectation is in F

, but that there is no noise, that is, almost

surely y = f

∗

(x). In that case, if we choose

f as the interpolant

f) = 0 with minimum complexity,

then its complexity will certainly satisfy r(

f) ≤ r(f

∗

) = r. And then as the sample size n increases,

f) will approach zero. In fact, since

f) = 0, Theorem 2.2 implies a faster rate in this case: L(

f) =

O((log n)

)).

Theorem 2.3 shows that if we were to balance the complexity with the ﬁt to the training data, then we

can hope to enjoy excess risk as good as the best bound for any F

in the complexity hierarchy. If we always

choose a perfect ﬁt to the data, there is no trade-oﬀ between complexity and empirical risk, but when there

is a prediction rule f

∗

with ﬁnite complexity and zero risk, then once the sample size is suﬃciently large,

the best trade-oﬀ does correspond to a perfect ﬁt to the data. To summarize: when there is no noise, that

is, when y = f

∗

(x), and f

∗

∈ F, classical theory shows that a minimum-complexity interpolant

f ∈ F will

have risk L(

f) converging to zero as the sample size increases.

But what if there is noise, that is, there is no deterministic relationship between x and y? Then it turns

out that the bounds on the excess risk L(

f) − L(f

∗

) presented in this section must become vacuous: they

can never decrease below a constant, no matter how large the sample size. This is because these bounds do

not rely on any properties of the distribution on X, and hence are also true in a ﬁxed design setting, where

the excess risk is at least the noise level.

To make this precise, ﬁx x

, . . . , x

∈ X and deﬁne the ﬁxed design risk

(f) :=

i=1

E [`(f(x

), y)|x = x

] .

Then the decomposition (4) extends to this risk: for any

f and f

∗

(

f) − L

∗

)

(

f) −

L(f

∗

)

L(f

∗

) − L

∗

)

For a nonnegative loss, the second term is nonpositive when

f) = 0, and the last term is small for any ﬁxed

∗

. Fix f

∗

(x) = E[y|x], and suppose we choose

f from a class F

. The same proof as that of Theorem 2.1

gives a Rademacher complexity bound on the ﬁrst term above, and [LT91, Theorem 4.12] implies the same

contraction inequality as in Theorem 2.7 when by 7→ `(by, y) is c-Lipschitz:

E sup

f∈F



(f) −

L(f)



≤ 2E

sup

f∈F



i=1



`(f(x

), y

)



, . . . , x

≤ 4c

Finally, although Theorems 2.9 and Theorem 2.11 are stated as bounds on the Rademacher complexity of

, they are in fact bounds on

), the worst-case empirical Rademacher complexity of F.

Consider the complexity hierarchy deﬁned in Theorem 2.9 or Theorem 2.11. For the minimum-complexity

interpolant

f, these theorems give bounds that depend on the complexity r(

f), that is, bounds of the form

f) − L(f

∗

) ≤ B(r(

f)) (ignoring the fact that that the minimum complexity r(

f) is random; making the

bounds uniform over r would give a worse bound). Then these observations imply that

(

f) − L

∗

)

= EL

(

f) − L(f

∗

) ≤ EB(r(

f)).

But then

EB(r(

f)) ≥ E

(

f) − L(f

∗

)

i=1



f(x

) − f

∗

)



= L(f

∗

Thus, unless there is no noise, the upper bound on excess risk must be at least as big as a constant.

[BL20b] use a similar comparison between prediction problems in random design and ﬁxed design settings

to demonstrate situations where benign overﬁtting occurs but a general family of excess risk bounds—those

that depend only on properties of

f and do not increase too quickly with sample size—must sometimes be very

loose. [NK19] present a scenario where, with high probability, a classiﬁcation method gives good predictive

accuracy but uniform convergence bounds must fail for any function class that contains the algorithm’s

output. Algorithmic stability approaches—see [DW79] and [BE02]—also aim to identify suﬃcient conditions

for closeness of risk and empirical risk, and appear to be inapplicable in the interpolation regime. These

examples illustrate that to understand benign overﬁtting, new analysis approaches are necessary that exploit

additional information. We shall review results of this kind in Section 4, for minimum-complexity interpolants

in regression settings. The notion of complexity that is minimized is obviously of crucial importance here;

this is the topic of the next section.

3 Implicit regularization

When the model F is complex enough to ensure zero empirical error, such as in the case of overparametrized

neural networks, the set of empirical minimizers may be large. Therefore, it may very well be the case

that some empirical minimizers generalize well while others do not. Optimization algorithms introduce a

bias in this choice: an iterative method may converge to a solution with certain properties. Since this

bias is a by-product rather than an explicitly enforced property, we follow the recent literature and call it

implicit regularization. In subsequent sections, we shall investigate statistical consequences of such implicit

regularization.

Perhaps the simplest example of implicit regularization is gradient descent on the square-loss objective

with linear functions:

t+1

= θ

− η

∇

L(θ

L(θ) =

kXθ − yk

, θ

= 0 ∈ R

, (13)

where X = [x

, . . . , x

]

∈ R

n×d

and y = [y

, . . . , y

]

are the training data, and η

> 0 is the step size.

While the set of minimizers of the square-loss objective in the overparametrized (d > n) regime is an aﬃne

subspace of dimension at least d −n, gradient descent (with any choice of step size that ensures convergence)

converges to a very speciﬁc element of this subspace: the minimum-norm solution

θ = argmin

kθk

: hθ, x

i = y

for all i ≤ n

. (14)

This minimum-norm interpolant can be written in closed form as

θ = X

†

y, (15)

where X

†

denotes the pseudoinverse. It can also be seen as a limit of ridge regression

= argmin

kXθ −yk

+ λ kθk

(16)

as λ → 0

. The connection between minimum-norm interpolation (14) and the “ridgeless” limit of ridge

regression will be fruitful in the following sections when statistical properties of these methods are analyzed

and compared.

To see that the iterations in (13) converge to the minimum-norm solution, observe that the Karush-Kuhn-

Tucker (KKT) conditions for the constrained optimization problem (14) are Xθ = y and θ + X

µ = 0 for

Lagrange multipliers µ ∈ R

. Both conditions are satisﬁed (in ﬁnite time or in the limit) by any procedure

that interpolates the data while staying in the span of the rows of X, including (13). It should be clear

that a similar statement holds for more general objectives

L(θ) = n

−1

`(hθ, x

i, y

) under appropriate

assumptions on `. Furthermore, if started from an arbitrary θ

, gradient descent (if it converges) selects a

solution that is closest to the initialization with respect to k·k

Boosting is another notable example of implicit regularization arising from the choice of the optimization

algorithm, this time for the problem of classiﬁcation. Consider the linear classiﬁcation objective

(θ) =

i=1

1 [−y

hθ, x

i ≥ 0] (17)

where y

, . . . , y

∈ {±1}. In the classical formulation of the boosting problem, the coordinates of vectors

correspond to features computed by functions in some class of base classiﬁers. Boosting was initially

proposed as a method for minimizing empirical classiﬁcation loss (17) by iteratively updating θ. In particular,

AdaBoost [FS97] corresponds to coordinate descent on the exponential loss function

θ 7→

i=1

exp{−y

hθ, x

i} (18)

[Bre98, Fri01]. Notably, the minimizer of this surrogate loss does not exist in the general separable case,

and there are multiple directions along which the objective decreases to 0 as kθk → ∞. The AdaBoost

optimization procedure and its variants were observed empirically to shift the distribution of margins (the

values y

hθ

, x

i, i = 1, . . . , n) during the optimization process in the positive direction even after empirical

classiﬁcation error becomes zero, which in part motivated the theory of large margin classiﬁcation [SFBL98].

In the separable case, convergence to the direction of the maximizing `

margin solution

θ = argmin

kθk

: y

hθ, x

i ≥ 1 for all i ≤ n

(19)

was shown in [ZY05] and [Tel13] assuming small enough step size, where separability means positivity of the

margin

max

kθk

min

i∈[n]

hθ, x

i. (20)

剩余88页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习：统计视角下的优化与正则化

探析深度学习表征的一种新方法：社会认知网络特征(SENS).pdf

深度学习入门与实践（工程师视角）

python深度学习:基于pytorch pdf

《keras深度学习:入门、实战与进阶》pdf

python深度学习：基于pytorch

深度学习：给出python深度学习对于图像识别动物种类应用案例，要求用到多种深度学习方法，并给出相较于传统算法的优势

深度学习三维重建模型有哪些

深度学习入门：基于python的理论与实现

基于深度学习的鱼群数量统计与密度检测研究已有的基础

请问我怎么学习深度学习

深度学习的目标检测代码

深度学习的产生及其发展趋势

基于深度学习的第一视角视频动作识别方法有哪些

边做边学深度强化学习:pytorch程序设计实践在线下载

有没有深度学习图像识别的项目源码

深度学习：用生成对抗网络（GAN）来恢复高分辨率（高精度）图片 （附源码，模型与数据集）...

深度学习发展方向2024

基于深度学习的行人检测

缺陷检测深度学习算法哪个好

深度学习多视图三维重建程序

最新资源

深度学习：用生成对抗网络（GAN）来恢复高分辨率（高精度）图片（附源码，模型与数据集）...