深度学习理解：重新思考泛化能力（2016年论文）

182 浏览量更新于2024-08-25 收藏 390KB PDF 举报

"Understanding Deep Learning Requires Rethinking Generalization" 是一篇发表于2016年的研究论文，由 Chiyuan Zhang、Samy Bengio、Moritz Hardt、Benjamin Recht 和 Oriol Vinyals 等学者共同撰写，他们分别来自 Massachusetts Institute of Technology (MIT)、Google Brain 和 University of California, Berkeley。这篇论文针对深度学习中的一个关键问题提出了新颖的观点：尽管深度人工神经网络在大规模数据集上表现出色，其训练和测试性能之间的差距为何如此之小。传统观点认为，深度学习模型的出色泛化能力要么源于模型家族的内在特性（如卷积神经网络的结构），要么归功于训练过程中的正则化技术，如权重衰减或dropout等，这些方法旨在防止过拟合，即模型在训练数据上表现优秀但对新数据适应性较差的问题。然而，作者通过大量的系统性和深入的实验揭示，这些传统的解释并不能充分解释为何深度神经网络在实践中能很好地泛化。他们观察到，即便是最先进的卷积神经网络，在使用随机标签对训练数据进行标记的情况下，依然可以轻易地学会这种随机模式，而且这个现象与显式的正则化技术无关，即使在没有正则化的条件下也存在。这一发现挑战了既定的认识，表明可能有其他的因素或者机制在起作用，比如模型的内在表示学习能力、潜在的复杂结构或者数据本身的特性。它提示我们，理解深度学习的泛化能力需要重新审视我们的理论框架，并可能需要探索更深层次的学习机制，而不仅仅是依赖于显式正则化的优化策略。该论文不仅为深入研究深度学习的内在机制提供了新的视角，还对如何设计和优化深度学习模型提出了新的挑战，鼓励研究人员进一步探索如何提升模型的泛化性能，尤其是在面对未见过的数据时。因此，阅读和理解这篇论文对于那些想要在这个领域取得突破的研究者和实践者来说，是不可或缺的一环。"

contrast to existing depth separations (Delalleau & Bengio, 2011; Eldan & Shamir, 2016; Telgarsky,

2016; Cohen & Shashua, 2016) in function space, our result shows that even depth-2 networks of

linear size can already represent any labeling of the training data.

The role of implicit regularization. While explicit regularizers like dropout and weight-decay

may not be essential for generalization, it is certainly the case that not all models that ﬁt the training

data well generalize well. Indeed, in neural networks, we almost always choose our model as the

output of running stochastic gradient descent. Appealing to linear models, we analyze how SGD

acts as an implicit regularizer. For linear models, SGD always converges to a solution with small

norm. Hence, the algorithm itself is implicitly regularizing the solution. Indeed, we show on small

data sets that even Gaussian kernel methods can generalize well with no regularization. Though this

doesn’t explain why certain architectures generalize better than other architectures, it does suggest

that more investigation is needed to understand exactly what the properties are inherited by models

that were trained using SGD.

1.2 RELATED WORK

Hardt et al. (2016) give an upper bound on the generalization error of a model trained with stochastic

gradient descent in terms of the number of steps gradient descent took. Their analysis goes through

the notion of uniform stability (Mukherjee et al., 2002; Bousquet & Elisseeff, 2002; Poggio et al.,

2004). As we point out in this work, uniform stability of a learning algorithm is independent of

the labeling of the training data. Hence, the concept is not strong enough to distinguish between

the models trained on the true labels (small generalization error) and models trained on random

labels (high generalization error). This also highlights why the analysis of Hardt et al. (2016) for

non-convex optimization was rather pessimistic, allowing only a very few passes over the data. Our

results show that even empirically training neural networks is not uniformly stable for many passes

over the data. Consequently, a weaker stability notion is necessary to make further progress along

this direction.

There has been much work on the representational power of neural networks, starting from universal

approximation theorems for multi-layer perceptrons (Cybenko, 1989; Mhaskar, 1993; Delalleau &

Bengio, 2011; Mhaskar & Poggio, 2016; Eldan & Shamir, 2016; Telgarsky, 2016; Cohen & Shashua,

2016). All of these results are at the population level characterizing which mathematical functions

certain families of neural networks can express over the entire domain. We instead study the repre-

sentational power of neural networks for a ﬁnite sample of size n. This leads to a very simple proof

that even O(n)-sized two-layer perceptrons have universal ﬁnite-sample expressivity.

2 EFFECTIVE CAPACITY OF NEURAL NETWORKS

Our goal is to understand the effective model capacity of feed-forward neural networks. Toward this

goal, we choose a methodology inspired by non-parametric randomization tests. Speciﬁcally, we

take a candidate architecture and train it both on the true data and on a copy of the data in which the

true labels were replaced by random labels. In the second case, there is no longer any relationship

between the instances and the class labels. As a result, learning is impossible. Intuition suggests that

this impossibility should manifest itself clearly during training, e.g., by training not converging or

slowing down substantially. To our surprise, the training process for several standard achitectures is

largely unaffected by this transformation of the labels. This poses a conceptual challenge. Whatever

justiﬁcation we had for expecting a small generalization error to begin with must no longer apply to

the case of random labels.

To gain further insight into this phenomenon, we experiment with different levels of randomization

exploring the continuum between no label noise and completely corrupted labels. We also try out

different randomizations of the inputs (rather than labels), arriving at the same general conclusion.

The experiments are run on two image classiﬁcation datasets, the CIFAR10 dataset (Krizhevsky

& Hinton, 2009) and the ImageNet (Russakovsky et al., 2015) ILSVRC 2012 dataset. We test the

Inception V3 (Szegedy et al., 2015) architecture on ImageNet and a smaller version of Inception,

Alexnet (Krizhevsky et al., 2012), and MLPs on CIFAR10. Please see Section A in the appendix for

details of the experimental setup.

剩余13页未读，继续阅读

weixin_38677190

粉丝: 6
资源: 891

深度学习理解：重新思考泛化能力（2016年论文）

gcc-4.8.5-44.el7.x86_64相关包

nginx-1.18.0-2.el7.ngx.x86-64.rpm安装包(含有部署手册)

krb5-devel-1.15.1-50.el7.x86-64的子依赖包

把 yum localinstall -y /usr/local/src/requires/msodbcsql17-17.7.2.1-1.x86_64.rpm替换成dockfile

docker build -t workflow433:v1 报错ERROR: "docker buildx build" requires exactly 1 argument. See 'docker buildx build --help'.

package redhat-lsb-4.1-47.el8.x86_64 requires redhat-lsb-languages = 4.1-47.

Error: Package: docker-ce-rootless-extras-24.0.2-1.el7.x86_64 (docker-ce-stable) Requires: fuse-overlayfs >= 0.7 Error: Package: docker-ce-rootless-extras-24.0.2-1.el7.x86_64 (docker-ce-stable) Requires: slirp4netns >= 0.4

docker-ce-rootless-extras-24.0.2-1.el7.x86_64 (docker-ce-stable) Requires: fuse-overlayfs >= 0.7

最新资源