Define the class of functions on the unit Euclidean ball in R
d
,
F
B
= {f(·; θ) : kW
i
k
F
≤ B},
where kW
i
k
F
denotes the Frobenius norm of W
i
. Then we have
R
n
(F
B
) .
√
LB
L
√
n
.
This result is from [GRS18], which also shows that it is possible to remove the
√
L factor at the cost of
a worse dependence on n. See also [NTS15].
2.9 The mismatch between benign overfitting and uniform convergence
It is instructive to consider the implications of the generalization bounds we have reviewed in this section for
the phenomenon of benign overfitting, which has been observed in deep learning. For concreteness, suppose
that ` is the quadratic loss. Consider a neural network function
b
f ∈ F chosen so that
b
L(
b
f) = 0. For an
appropriate complexity hierarchy F =
S
r
F
r
, suppose that
b
f is chosen to minimize the complexity r(
b
f),
defined as the smallest r for which
b
f ∈ F
r
, subject to the interpolation constraint
b
L(f) = 0. What do the
bounds based on uniform convergence imply about the excess risk L(
b
f) − inf
f∈F
L(f) of this minimum-
complexity interpolant?
Theorems 2.9, 2.10, and 2.11 imply upper bounds on risk in terms of various notions of scale of network
parameters. For these bounds to be meaningful for a given probability distribution, there must be an
interpolating
b
f for which the complexity r(
b
f) grows suitably slowly with the sample size n so that the excess
risk bounds converge to zero.
An easy example is when there is an f
∗
∈ F
r
with L(f
∗
) = 0, where r is a fixed complexity. Notice that
this implies not just that the conditional expectation is in F
r
, but that there is no noise, that is, almost
surely y = f
∗
(x). In that case, if we choose
b
f as the interpolant
b
L(
b
f) = 0 with minimum complexity,
then its complexity will certainly satisfy r(
b
f) ≤ r(f
∗
) = r. And then as the sample size n increases,
L(
b
f) will approach zero. In fact, since
b
L(
b
f) = 0, Theorem 2.2 implies a faster rate in this case: L(
b
f) =
O((log n)
4
¯
R
2
n
(F
r
)).
Theorem 2.3 shows that if we were to balance the complexity with the fit to the training data, then we
can hope to enjoy excess risk as good as the best bound for any F
r
in the complexity hierarchy. If we always
choose a perfect fit to the data, there is no trade-off between complexity and empirical risk, but when there
is a prediction rule f
∗
with finite complexity and zero risk, then once the sample size is sufficiently large,
the best trade-off does correspond to a perfect fit to the data. To summarize: when there is no noise, that
is, when y = f
∗
(x), and f
∗
∈ F, classical theory shows that a minimum-complexity interpolant
b
f ∈ F will
have risk L(
b
f) converging to zero as the sample size increases.
But what if there is noise, that is, there is no deterministic relationship between x and y? Then it turns
out that the bounds on the excess risk L(
b
f) − L(f
∗
F
) presented in this section must become vacuous: they
can never decrease below a constant, no matter how large the sample size. This is because these bounds do
not rely on any properties of the distribution on X, and hence are also true in a fixed design setting, where
the excess risk is at least the noise level.
To make this precise, fix x
1
, . . . , x
n
∈ X and define the fixed design risk
L
|x
(f) :=
1
n
n
X
i=1
E [`(f(x
i
), y)|x = x
i
] .
Then the decomposition (4) extends to this risk: for any
b
f and f
∗
,
L
|x
(
b
f) − L
|x
(f
∗
)
=
h
L
|x
(
b
f) −
b
L(
b
f)
i
+
h
b
L(
b
f) −
b
L(f
∗
)
i
+
h
b
L(f
∗
) − L
|x
(f
∗
)
i
.
16