深度学习理论：初始化与损失景观分析

需积分: 10 62 浏览量更新于2024-07-15 收藏 777KB PDF 举报

"《深度学习理论》笔记，由Evgenii Golikov撰写，涵盖了深度学习中的初始化、损失曲面、泛化以及神经切线核理论等关键概念，并计划在未来修订中加入更多主题，如表达性、平均场理论和双下降现象等。" 这篇笔记是由莫斯科物理技术学院(MIPT)和雅虎数据分析学院(YSDA)的讲座内容整理而成，旨在深入探讨深度学习的基础理论。笔记首先介绍了深度学习的核心概念： 1. **泛化能力**：深度学习模型的泛化能力是其能够在未见过的数据上表现良好的关键。笔记中可能会讨论如何通过正则化、网络结构和优化策略来提高模型的泛化性能。 2. **全局收敛**：深度学习的训练通常涉及到找到全局最优解，而非局部最优解。笔记可能讨论了不同初始化方法对全局收敛速度和稳定性的影响。 3. **权重空间到函数空间的转换**：深度学习模型的权重配置如何决定其在输入空间上的行为。这部分可能会涉及网络的表达能力和复杂度。接着，笔记详细阐述了**初始化**的重要性： - **保持方差**：为了确保信息在通过多层网络时不会消失或爆炸，初始化时需要考虑输入和输出层的方差关系。 - **线性层**：线性层的初始化通常采用如Xavier或He初始化，以保持前向传播的方差。 - **ReLU层**：ReLU激活函数需要特殊的初始化，如He初始化，以避免“死亡ReLU”问题。 - **Tanh层**：Tanh激活函数的初始化也需考虑其压缩特性。 - **动态稳定性**：讨论了初始化如何影响梯度下降法的动态，包括线性和ReLU层的稳定性分析。 - **正交初始化的GD动力学**：正交初始化能提供更好的动态行为，有助于避免梯度消失或爆炸，提升训练效率。接下来是**损失曲面**的分析，尤其是宽非线性网络的情况，可能会探讨平坦最小值与模型泛化之间的关系，以及如何通过损失曲面理解训练过程中的挑战。笔记还提到了**神经切线核理论**，这是研究深度学习动态行为的一种方法，它通过观察网络在微小扰动下的行为来理解其泛化性能。最后，笔记表示计划在未来的修订中增加更多主题，比如**表达性**（网络表示复杂函数的能力）、**平均场理论**（用于理解大型网络的行为）以及**双下降现象**（在某些情况下，随着模型复杂度增加，训练误差和验证误差可能出现双下降趋势）。这篇笔记提供了深度学习理论的深入见解，对于理解模型的训练过程、优化策略和泛化性能具有重要价值。

In o rder to study its stability, we have to consider a derivative of the C-map at c

= 1. Let us compute the

derivative for a c

< 1 ﬁrst:

∂c

l+1

∂c



= q

−1

∞

)

∼N(0,I)

φ(

√

∞

)φ

′

(

√

∞

(cz

1 − c

))(

√

∞

− z

1 − c

)). (2.37)

We shall use the following equivalence:

z∼N(0,1)

F (z)z =

+∞

−∞

F (z)ze

−z

dz =

+∞

−∞

(−F (z)) de

−z

+∞

−∞

′

(z)e

−z

dz = E

z∼N(0,1)

′

(z).

(2.38)

We begin the integration with analyz ing one of the parts of this equation:

z∼N(0,1)

′

(

√

∞

(cz

1 − c

z))

√

∞

zc/

1 − c

= q

∞

z∼N(0,1)

′′

(

√

∞

(cz

1 − c

z))c. (2.39)

Henceforth,

∂c

l+1

∂c



= q

−1

∞

∼N(0,1)

φ(u

))E

∼N(0,1)

(

√

∞

′

, z

)) − q

∞

cφ

′′

, z

))), (2.40)

where u

√

∞

, while u

√

∞

(cz

√

1 − c

). Cons ider the limit of c → 1:

lim

c→1

∂c

l+1

∂c



= q

−1

∞

z∼N(0,1)

φ(

√

∞

z)(

√

∞

zφ

′

(

√

∞

z) − q

∞

′′

(

√

∞

z)). (2.41)

Let us compute the ﬁrst term ﬁrst:

z∼N(0,1)

φ(

√

∞

√

∞

zφ

′

(

√

∞

z) = q

∞

z∼N(0,1)



(φ

′

(

√

∞

z))

+ φ(

√

∞

z)φ

′′

(

√

∞



. (2.42)

This gives the ﬁnal result:

lim

c→1

∂c

l+1

∂c



= σ

z∼N(0,1)

(φ

′

(

√

∞

z))

= χ

. (2.43)

We see that χ

drives the stability of the correlatio n of strongly correlated hidden representatio ns, or, equivalently,

of nearby input points. For χ

< 1 nearby points with c

≈ 1 become more correlated as they pro pagate through

the layers. Hence initially diﬀerent points become more and mor e similar. We refer this regime as ordered. In

contrast, for χ

> 1 nearby po ints separate as they propagate deep e r in the network. We refer this regime as

chaotic. Hence the case of χ

= 1 is the edge of chaos.

2.2 Dynamical stability

Following [Pennington et al., 2017], let us turn our attention to the input-output jacobian:

J =

∂h

L+1

∂h

l=1

∈ R

L+1

×n

. (2.44)

We now compute the mean square Frobenius norm of J

J ∈ R

×n

E kJ

= E tr(J

J) = E

0:L





l=1





= tr





0:L





l=1









= tr





0:L−1





L−1

l=1

L−1

l=1









= n

L+1





0:L−1





L−1

l=1

L−1

l=1









. (2.45)

Assuming that tr(D

) does not depend on W

0:l

∀l ∈ [L] allows us to proceed with calculations:

E kJ

= n

L+1

tr(D

L−1





0:L−2





L−2

l=1

L−1

L−2

l=1









= n

L+1

l=2



tr(D

l−1







= n

L+1

l=1

tr(D

). (2.46)

Suppose we aim to normalize the backward dynamics: v

= σ

l+1

∀l ∈ [L]. Assume then (see Section 2.1.3)

∼ N(0, q

∞

) ∀l ∈ [L]. Then the calculation above gives us the mean aver age eigenvalue of J

i=1

E kJ

= σ



z∼N(0,1)

′

(

√

∞



= χ

. (2.47)

Hence χ

is the mean average eigenvalue o f the input-ouput jacobian of the network of depth L.

Let us assume that our non-linearity is homogeneous: φ(βz) = βφ(z). This property holds for leaky ReLU with

arbitrary slope; in particular, it holds in the linear case . Then we have the following:

L+1

= Jh

; q

L+1

E kJh

L+1

(E J

J)h

L+1

i=1

)

. (2.48)

= J

L+1

; δ

E kJ

L+1

(E JJ

L+1

i=1

L+1

)

. (2.49)

One can perceive q

L+1

as a mean normalized squared length of the network output. We may want to study a

distribution of normalized squar ed lengths instead.

In this case it suﬃces to study a distribution of the empirical spectral density:

ˆρ(x) =

i=1

δ(x − λ

). (2.50)

Besides being ra ndom, it converges to a deterministic limiting spec tral dens ity ρ as n → ∞ if we assume n

= α

∀l ∈ [L + 1] with α

being constant.

Assume all matrices W

are squa re: n

= . . . = n

L+1

= n. In this case the choice of v

= 1/n normalizes both

forward a nd backward dynamics. On the other hand, in the linear case the limiting spectrum can be parameterized

as (see [Pennington et al., 2017]):

λ(φ) =

sin

L+1

((L + 1)φ)

sin φ sin

(Lφ)

. (2.51)

We shall prove this result in the upcoming section. Notice that lim

φ→0

λ(φ) = (L+ 1)

L+1

∼ e(L + 1) for large L.

Hence in this ca se despite we preserve lengths of input vectors on average, some of the input vectors g e t expanded

with positive probability during forward propa gation, while some get contracted. The same holds for the backward

dynamics.

2.2.1 Linear case

Our goal is to compute a limiting spec trum of the matrix JJ

∈ R

n×n

with J =

l=1

with all W

composed

of i.i.d. gauss ians with variance 1/n; it is referred as product Wishart ensemble. The case of L = 1, W W

, is

known as Wishart ens emble. The limiting spectrum of the Wishart ensemble is known as Marchenko-Pastur law

[Marchenko and Pastur, 1967]:

W W

T (x) =

2π

− 1 Ind

[0,4]

(x). (2.52)

It is possible to derive a limiting spectrum fo r JJ

by using the so-called S-transform, w hich we shall deﬁne

later. A high-level a lgorithm is the following. First, we compute an S-transform for the Wishart ensemble:

W W

T (z) =

1 + z

. (2.53)

The S-transform has a following fundamental proper ty. Given two asymptotically free random matrices A and B,

we have [Voiculescu, 198 7]

= S

(2.54)

in the limit of n → ∞.

As we sha ll see later, the S-transform of J

l=1

depends only on traces of the form n

−1

tr(J

) which

are invariant under cyc lic permutations of matrices W

. This allows us to compute S

T :

T = S

= S

L−1

= S

L−1

l=1

= S

. (2.55)

The last equation holds since a ll W

are distributed identically. The ﬁnal step is to recover the sp e ctrum of JJ

from its S-transform.

Free independence. We say that A and B are freely independent, or just free, if:

τ((P

(A) − τ(P

(A)))(Q

(B) − τ(Q

(B))) . . . (P

(A) − τ(P

(A)))(Q

(B) − τ(Q

(B)))) = 0, (2.56)

where ∀i ∈ [k] P

and Q

are polynomia ls, while τ(A) = n

−1

E tr(A) — an analogue of the expectation for scalar

random variables. Note that τ is a linear operator and τ(I) = 1. Compare ab ove with the deﬁnition of classical

independence:

τ((P(A) − τ(P(A)))(Q(B) − τ(Q(B)))) = 0, (2.57)

for all polynomials P and Q.

Note that two scalar-valued random variables are free iﬀ one of them is constant; indeed:

E ((ξ −E ξ)(η −E η)(ξ −E ξ)(η −E η)) = E ((ξ −E ξ)

(η −E η)

) = (E (ξ −E ξ)

)(E (η −E η)

) = Var ξVar η. (2.58)

Hence having Var ξ = 0 or Var η = 0 is necessary; this implies ξ = const or η = const, which gives free independence.

This means tha t the notion of free independence is too strong for scala r r andom variables. The reason for this

is their commutativity; only non-commutative objects can have a non-trivial notion of free independence. As for

random matrices with classically independent entries, they have a remarkable property that they b e c ome free in

the limit of n → ∞:

lim

n→∞

τ((P

) −τ(P

)))(Q

) −τ(Q

))) . . . (P

) −τ(P

)))(Q

) −τ(Q

)))) = 0, (2.59)

for A

and B

∈ R

n×n

such tha t the moments τ(A

) and τ(B

) are ﬁnite for larg e n for k ∈ N. We shall sat that

the two seq uence s {A

} and {B

} are asymptotically fre e as n → ∞.

Asymptotic free independence for Wigner matrices. In order to illustrate the above property, consider X

and Y being classically independent n × n Wigner matrices, i.e. X

= X

∼ N(0, n

−1

), and similarly for Y . Of

course, τ(X) = τ (Y ) = 0, while τ(X

) = n

−1

tr(E X

E Y

) = n

−1

tr(I) = 1. Let us compute τ(XY XY ):

τ(XY XY ) =

E X

((δ

+ δ

)(δ

+ δ

) −Cn) =

+ (3 − C)n) = O

n→∞

−1

). (2.60)

This means that X and Y are not freely independent, however, it sug gests that they become free in the limit of

large n.

see also https://mast.queensu.ca/~speicher/survey.html

剩余67页未读，继续阅读

syp_net

粉丝: 158
资源: 1187

深度学习理论：初始化与损失景观分析

深度学习理论学习笔记

吴恩达深度学习笔记.pdf

深度学习基础笔记

深度学习整理笔记

深度学习课程笔记

深度学习笔记

本项目用于存放机器学习与深度学习理论相关的笔记.zip

dlaicourse：学习深度学习的笔记本

吴恩达深度学习学习笔记以及代码

深度学习笔记：理论与实践精华

最新资源