contrast to existing depth separations (Delalleau & Bengio, 2011; Eldan & Shamir, 2016; Telgarsky,
2016; Cohen & Shashua, 2016) in function space, our result shows that even depth-2 networks of
linear size can already represent any labeling of the training data.
The role of implicit regularization. While explicit regularizers like dropout and weight-decay
may not be essential for generalization, it is certainly the case that not all models that fit the training
data well generalize well. Indeed, in neural networks, we almost always choose our model as the
output of running stochastic gradient descent. Appealing to linear models, we analyze how SGD
acts as an implicit regularizer. For linear models, SGD always converges to a solution with small
norm. Hence, the algorithm itself is implicitly regularizing the solution. Indeed, we show on small
data sets that even Gaussian kernel methods can generalize well with no regularization. Though this
doesn’t explain why certain architectures generalize better than other architectures, it does suggest
that more investigation is needed to understand exactly what the properties are inherited by models
that were trained using SGD.
1.2 RELATED WORK
Hardt et al. (2016) give an upper bound on the generalization error of a model trained with stochastic
gradient descent in terms of the number of steps gradient descent took. Their analysis goes through
the notion of uniform stability (Mukherjee et al., 2002; Bousquet & Elisseeff, 2002; Poggio et al.,
2004). As we point out in this work, uniform stability of a learning algorithm is independent of
the labeling of the training data. Hence, the concept is not strong enough to distinguish between
the models trained on the true labels (small generalization error) and models trained on random
labels (high generalization error). This also highlights why the analysis of Hardt et al. (2016) for
non-convex optimization was rather pessimistic, allowing only a very few passes over the data. Our
results show that even empirically training neural networks is not uniformly stable for many passes
over the data. Consequently, a weaker stability notion is necessary to make further progress along
this direction.
There has been much work on the representational power of neural networks, starting from universal
approximation theorems for multi-layer perceptrons (Cybenko, 1989; Mhaskar, 1993; Delalleau &
Bengio, 2011; Mhaskar & Poggio, 2016; Eldan & Shamir, 2016; Telgarsky, 2016; Cohen & Shashua,
2016). All of these results are at the population level characterizing which mathematical functions
certain families of neural networks can express over the entire domain. We instead study the repre-
sentational power of neural networks for a finite sample of size n. This leads to a very simple proof
that even O(n)-sized two-layer perceptrons have universal finite-sample expressivity.
2 EFFECTIVE CAPACITY OF NEURAL NETWORKS
Our goal is to understand the effective model capacity of feed-forward neural networks. Toward this
goal, we choose a methodology inspired by non-parametric randomization tests. Specifically, we
take a candidate architecture and train it both on the true data and on a copy of the data in which the
true labels were replaced by random labels. In the second case, there is no longer any relationship
between the instances and the class labels. As a result, learning is impossible. Intuition suggests that
this impossibility should manifest itself clearly during training, e.g., by training not converging or
slowing down substantially. To our surprise, the training process for several standard achitectures is
largely unaffected by this transformation of the labels. This poses a conceptual challenge. Whatever
justification we had for expecting a small generalization error to begin with must no longer apply to
the case of random labels.
To gain further insight into this phenomenon, we experiment with different levels of randomization
exploring the continuum between no label noise and completely corrupted labels. We also try out
different randomizations of the inputs (rather than labels), arriving at the same general conclusion.
The experiments are run on two image classification datasets, the CIFAR10 dataset (Krizhevsky
& Hinton, 2009) and the ImageNet (Russakovsky et al., 2015) ILSVRC 2012 dataset. We test the
Inception V3 (Szegedy et al., 2015) architecture on ImageNet and a smaller version of Inception,
Alexnet (Krizhevsky et al., 2012), and MLPs on CIFAR10. Please see Section A in the appendix for
details of the experimental setup.