深度学习泛化理论：解析神经网络的定量与定性洞察

需积分: 12 7 浏览量更新于2024-07-16 收藏 323KB PDF 举报

本文档《Generalization in Deep Learning》深入探讨了深度学习的泛化能力，针对深度学习中一个尚未解决的关键问题提供了数学上严谨的理论分析。与之前基于界限的理论不同，该理论对每个具体数据集的泛化性能有定量上的最优性，同时在提供定性洞见方面也具有竞争力。研究的核心成果揭示了深度学习如何在拥有巨大容量、复杂性、潜在算法不稳定性、非鲁棒性和尖锐局部极小值的情况下实现良好泛化，解答了文献中关于深度学习泛化机制的开放性问题。论文首先回顾了深度学习在实际应用中的显著成功以及它对机器学习和人工智能基础理念的影响。接着，作者指出尽管深度学习取得了显著成就，但其背后的理论挑战依然存在，特别是关于为何和如何在复杂条件下有效泛化的问题。作者通过直接分析神经网络结构，展示了他们的理论是如何在量化上紧致化泛化误差，确保了对不同数据集的一致性和针对性。本文的主要贡献在于提出了一种新的泛化理论框架，能够精确地量化模型在训练集之外的预测性能，从而更好地理解深度学习模型在面对复杂任务时仍能保持良好性能的原因。这种理论不仅关注模型的统计特性，还考虑了学习过程中的算法行为和优化问题，如权重分布、梯度更新策略等。然而，文章也指出该理论的局限性，包括可能存在的假设条件、对某些特殊结构或训练方法的适用性以及对于过度拟合和噪声数据的处理。此外，作者并未止步于现有的工作，而是提出了进一步的研究方向，鼓励探索更深层次的泛化机制，比如对抗性样本的鲁棒性、训练数据的质量与数量对泛化的影响等开放性问题。《Generalization in Deep Learning》这篇论文为深度学习的理论研究做出了重要贡献，它不仅填补了现有泛化理论的一个空白，而且为未来的研究者提供了深入理解和改进深度学习泛化性能的新视角。对于任何致力于深入理解深度学习内在机制的学者或工程师来说，这篇文章都是不可或缺的重要参考资料。

Kawaguchi, Kaelbling, and Bengio

Figure 1: An illustration of diﬀerences in assumptions. Statistical learning theory analyzes

the generalization behaviors of f

A(S

)

over randomly-drawn unspeciﬁed datasets S

∈ D

according to some unspeciﬁed distribution P ∈ P. Intuitively, statistical learning theory

concerns more about questions regarding a set P × D because of the unspeciﬁed nature of

(P, S

), whereas certain empirical studies (e.g., Zhang et al. 2017) can focus on questions

regarding each speciﬁed point (P, S

) ∈ P × D.

Lower bounds, necessary conditions and tightness in statistical learning theory are typi-

cally deﬁned via a worst-case distribution P

worst

∈ P. For instance, classical “no free lunch”

theorems and certain lower bounds on the generalization gap (e.g., Mohri et al. 2012, Sec-

tion 3.4) have been proven for the worst-case distribution P

worst

∈ P. Therefore, “tight”

and “necessary” typically mean “tight” and “necessary” for the set P × D (e.g., through

the worst or average case), but not for each particular point (P, S

) ∈ P × D. From this

viewpoint, we can understand that even if the quality of the set P × D is “bad” overall,

there may exist a “good” point (P, S

) ∈ P × D.

Several approaches in statistical learning theory, such as the data-dependent and Bayesian

approaches (Herbrich and Williamson, 2002; Dziugaite and Roy, 2017), use more assump-

tions on the set P×D to take advantage of more prior and posterior information; these have

an ability to tackle Problem 1. However, these approaches do not apply to Problem 2 as

they still depend on other factors than the given (P, S

, f ). For example, data-dependent

bounds with the luckiness framework (Shawe-Taylor et al., 1998; Herbrich and Williamson,

2002) and empirical Rademacher complexity (Koltchinskii and Panchenko, 2000; Bartlett

et al., 2002) still depend on a concept of hypothesis spaces (or the sequence of hypothesis

spaces), and the robustness approach (Xu and Mannor, 2012) depend on diﬀerent datasets

than a given S

via the deﬁnition of robustness (i.e., in Section 2, ζ(S

) is a data-dependent

term, but the deﬁnition of ζ itself and Ω depend on other datasets than S

We note that analyzing a set P × D is of signiﬁcant interest for its own merits and is a

natural task along the ﬁeld of computational complexity (e.g., categorizing a set of problem

instances into subsets with or without polynomial solvability). Indeed, the situation where

theory focuses more on a set and many practical studies focus on each element in the set

is prevalent in computer science (see the discussion in Appendix B.1 for more detail). We

further validate the logical consistency in our observations in Appendix B.2 and propose

several practical roles of generalization theory in Appendix B.3.

4. Direct Analyses of Neural Networks

In the previous section, we extended Problem 1 to Problem 2, and identiﬁed the diﬀerent

assumptions in theoretical and empirical studies. Accordingly, this section aims to solve

these problems, both in the case of each speciﬁed dataset and the case of random unspeciﬁed

datasets. To achieve this goal, this section presents a direct analysis of neural networks,

Generalization in Deep Learning

rather than deriving results about neural networks from more generic theories based on

capacity, Rademacher complexity, stability, or robustness. This section focuses on the

generalization gap R[f

A(S

)

] −

A(S

)

] with a training dataset S

and with squared

loss. For 0-1 loss with multi-labels, our probabilistic bound is presented in Appendix A.2.

4.1 Model Description via Deep Paths

We consider general neural networks of any depth that have the structure of a directed

acyclic graph (DAG) with ReLU nonlinearity and/or max pooling. This includes any struc-

ture of a feedforward network with convolutional and/or fully connected layers, potentially

with skip connections. For pedagogical purposes, we ﬁrst discuss our model description for

layered networks without skip connections, and then describe it for DAGs.

Layered nets without skip connections Let h

(l)

(x, w) ∈ R

be the pre-activation

vector of the l-th hidden layer, where n

is the width of the l-th hidden layer, and w

represents the trainable parameters. Let H be the number of hidden layers. For layered

networks without skip connections, the pre-activation (or pre-nonlinearity) vector of the

l-th layer can be written as

(l)

(x, w) = W

(l)

(l−1)



(l−1)

(x, w)



with a boundary deﬁnition σ

(0)



(0)

(x, w)



≡ x, where σ

(l−1)

represents nonlinearity via

ReLU and/or max pooling at the (l −1)-th hidden layer, and W

(l)

∈ R

×n

l−1

is a matrix of

weight parameters connecting the (l −1)-th layer to the l-th layer. Here, W

(l)

can have any

structure (e.g., shared and sparse weights to represent a convolutional layer). Let ˙σ

(l)

(x, w)

be a vector with each element being 0 or 1 such that σ

(l)



(l)

(x, w)



= ˙σ

(l)

(x, w)◦h

(l)

(x, w),

which is an element-wise product of the vectors ˙σ

(l)

(x, w) and h

(l)

(x, w). Then, we can write

the pre-activation of the k-th output unit at the last layer l = H + 1 as

(H+1)

(x, w) =

(H+1)

˙σ

(H)

(x, w)h

(H)

(x, w).

By expanding h

(l)

(x, w) repeatedly and exchanging the sum and product via the distributive

law of multiplication,

(H+1)

(x, w) =

H−1

∙∙∙

H−1

...j

˙σ

H−1

...j

(x, w)x

where W

H−1

...j

= W

(H+1)

l=1

(l)

l−1

and ˙σ

H−1

...j

(x, w) =

l=1

˙σ

(l)

(x, w). By

merging the indices j

, . . . , j

into j with some bijection between {1, . . . , n

} × ∙∙∙ ×

{1, . . . , n

} 3

, . . . , j

) and {1, . . . , n

∙∙∙n

} 3 j,

(H+1)

(x, w) =

ˉw

k,j

ˉσ

(x, w)ˉx

where ˉw

k,j

, ˉσ

(x, w) and ˉx

respectively represent W

H−1

...j

, ˙σ

H−1

...j

(x, w) and x

with the change of indices (i.e., σ

(x, w) and ˉx

respectively contain the n

numbers and

∙∙∙n

numbers of the same copy of each ˙σ

H−1

...j

(x, w) and x

). Note that

represents summation over all the paths from the input x to the k-th output unit.

剩余30页未读，继续阅读

hywcxq

粉丝: 0
资源: 33

深度学习泛化理论：解析神经网络的定量与定性洞察

Generalization in Machine Learning via Analytical Learning Theory.pdf

Model-driven deep-learning.pdf

wide & deep.pdf

learning-based VO.pdf

ON LARGE-BATCH TRAINING FOR DEEP LEARNING论文原文PDF

An Introduction to DRL.pdf

基于卷积神经网络的卫星遥感图像区域识别.pdf

基于卷积神经网络嵌套模型的人群异常行为检测.pdf

基于神经网络的NO_x燃煤锅炉排放预测及优化.pdf

100篇之外深度学习.zip

最新资源