没有合适的资源?快使用搜索试试~ 我知道了~
首页《过参数化机器学习理论》综述论文
机器学习(ML)最近的快速进展提出了一些科学问题,挑战了该领域长期存在的教条。最重要的谜题之一是过度参数化模型的良好经验泛化。过度参数化的模型对于训练数据集的大小来说过于复杂,这导致它们完美地拟合(即插值)训练数据,而训练数据通常是有噪声的。这种对噪声数据的插值传统上与有害的过拟合有关,但最近观察到,从简单的线性模型到深度神经网络的各种插值模型在新测试数据上都能很好地泛化。
资源详情
资源评论
资源推荐

A Farewell to the Bias-Variance Tradeoff?
An Overview of the Theory of Overparameterized Machine Learning
Yehuda Dar
*
Vidya Muthukumar
†
Richard G. Baraniuk
‡
Abstract
The rapid recent progress in machine learning (ML) has raised a number of scientific questions
that challenge the longstanding dogma of the field. One of the most important riddles is the good
empirical generalization of overparameterized models. Overparameterized models are excessively
complex with respect to the size of the training dataset, which results in them perfectly fitting
(i.e., interpolating) the training data, which is usually noisy. Such interpolation of noisy data is
traditionally associated with detrimental overfitting, and yet a wide range of interpolating models
– from simple linear models to deep neural networks – have recently been observed to generalize
extremely well on fresh test data. Indeed, the recently discovered double descent phenomenon
has revealed that highly overparameterized models often improve over the best underparameterized
model in test performance.
Understanding learning in this overparameterized regime requires new theory and foundational
empirical studies, even for the simplest case of the linear model. The underpinnings of this under-
standing have been laid in very recent analyses of overparameterized linear regression and related
statistical learning tasks, which resulted in precise analytic characterizations of double descent.
This paper provides a succinct overview of this emerging theory of overparameterized ML (hence-
forth abbreviated as TOPML) that explains these recent findings through a statistical signal pro-
cessing perspective. We emphasize the unique aspects that define the TOPML research area as a
subfield of modern ML theory and outline interesting open questions that remain.
1. Introduction
Deep learning techniques have revolutionized the way many engineering and scientific problems
are addressed, establishing the data-driven approach as a leading choice for practical success. Con-
temporary deep learning methods are extreme, extensively developed versions of classical machine
learning (ML) settings that were previously restricted by limited computational resources and in-
sufficient availability of training data. Established practice today is to learn a highly complex deep
neural network (DNN) from a set of training examples that, while itself large, is quite small rela-
tive to the number of parameters in the DNN. While such overparameterized DNNs comprise the
state-of-the-art in ML practice, the fundamental reasons for this practical success remain unclear.
Particularly mysterious are two empirical observations: i) the explicit benefit (in generalization) of
adding more parameters into the model, and ii) the ability of these models to generalize well despite
perfectly fitting even noisy training data. These observations endure across diverse architectures in
modern ML — while they were first made for complex, state-of-the-art DNNs (Neyshabur et al.,
*
Electrical and Computer Engineering Department, Rice University. E-mail: ydar@rice.edu
†
Schools of Electrical and Computer Engineering, and Industrial and Systems Engineering, Georgia Institute of Tech-
nology. E-mail: vmuthukumar8@gatech.edu
‡
Electrical and Computer Engineering Department, Rice University. E-mail: richb@rice.edu
arXiv:2109.02355v1 [stat.ML] 6 Sep 2021

Figure 1: Double descent of test errors (i.e., generalization errors) with respect to the complex-
ity of the learned model. TOPML studies often consider settings in which the learned
model complexity is expressed as the number of (independently tunable) parameters in
the model. In this qualitative demonstration, the global minimum of the test error is
achieved by maximal overparameterization.
2014; Zhang et al., 2017), they have since been unearthed in far simpler model families including
wide neural networks, kernel methods, and even linear models (Belkin et al., 2018b; Spigler et al.,
2019; Geiger et al., 2020; Belkin et al., 2019a).
In this paper, we survey the recently developed theory of overparameterized machine learning
(henceforth abbreviated as TOPML) that establishes foundational mathematical principles underly-
ing phenomena related to interpolation (i.e., perfect fitting) of training data. We will shortly provide
a formal definition of overparameterized ML, but describe here some salient properties that a model
must satisfy to qualify as overparameterized. First, such a model must be highly complex in the
sense that its number of independently tunable parameters
1
is significantly higher than the number
of examples in the training dataset. Second, such a model must not be explicitly regularized in
any way. DNNs are popular instances of overparameterized models that are usually trained with-
out explicit regularization (see, e.g., Neyshabur et al., 2014; Zhang et al., 2017). This combination
of overparameterization and lack of explicit regularization yields a learned model that interpolates
the training examples and therefore achieves zero training error on any training dataset. The train-
ing data is usually considered to be noisy realizations from an underlying data class (i.e., noisy
data model). Hence, interpolating models perfectly fit both the underlying data and the noise in
their training examples. Conventional statistical learning has always associated such perfect fitting
of noise with poor generalization performance (e.g., Friedman et al., 2001, p. 194); hence, it is re-
markable that these interpolating solutions often generalize well to new test data beyond the training
dataset.
The observation that overparameterized models interpolate noise and yet generalize well was
first elucidated in pioneering experiments by Neyshabur et al. (2014); Zhang et al. (2017). These
findings sparked widespread conversation across deep learning practitioners and theorists, and in-
spired several new research directions in deep learning theory. However, meaningful progress in
1. The number of parameters is an accepted measure of learned model complexity for simple cases such as ordinary least
squares regression in the underparameterized regime and underlies classical complexity measures like Akaike’s infor-
mation criterion (Akaike, 1998). For overparameterized models, the correct definition of learned model complexity
is an open question at the heart of TOPML research. See Section 7.3 for a detailed exposition.
2

understanding the inner workings of overparameterized DNNs has remained elusive
2
owing to their
multiple challenging aspects. Indeed, the way for recent TOPML research was largely paved by
the discovery of similar empirical behavior in far simpler parametric model families (Belkin et al.,
2019a; Geiger et al., 2020; Spigler et al., 2019; Advani et al., 2020).
In particular, Belkin et al. (2019a); Spigler et al. (2019) explicitly evaluated the test error of
these model families as a function of the number of tunable parameters and showed that it exhibits
a remarkable double descent behavior (pictured in Figure 1). The first descent occurs in the under-
parameterized regime and is a consequence of the classical bias–variance tradeoff; indeed, the test
error peaks when the learned model first becomes sufficiently complex to interpolate the training
data. More unusual is the behavior in the overparameterized regime: there, the test error is observed
to decrease monotonically with the number of parameters, forming the second descent in the double
descent curve. The double descent phenomenon suggests that the classical bias–variance tradeoff,
a cornerstone of conventional ML, is predictive only in the underparameterized regime where the
learned model is not sufficiently complex to interpolate the training data. A fascinating implica-
tion of the double descent phenomenon is that the global minimum of the generalization error can
be achieved by a highly overparameterized model even without explicit regularization, and despite
perfect fitting of noisy training data.
Somewhat surprisingly, a good explanation for the double descent behavior did not exist until
recently even for the simplest case of the overparameterized linear model — for example, classi-
cal theories that aim to examine the optimal number of variables in a linear regression task (e.g.,
Breiman and Freedman, 1983) only considered the underparameterized regime
3
. Particularly mys-
terious was the behavior of linear models that interpolate noise in data. The aforementioned double
descent experiments (Belkin et al., 2019a; Spigler et al., 2019) inspired substantial TOPML research
in the simple linear model
4
. The theoretical foundations of this research originate in the studies by
Belkin et al. (2020); Bartlett et al. (2020); Hastie et al. (2019); Muthukumar et al. (2020b) in early
2019. These studies provide a precise analytical characterization of the test error of minimum-norm
solutions to the task of overparameterized linear regression with the squared loss (henceforth abbre-
viated as LS regression). Of course, linear regression is a much simpler task than learning a deep
neural network; however, it is a natural starting point for theory and itself of great independent in-
terest for a number of reasons. First, minimum `
2
-norm solutions to linear regression do not include
explicit regularization to avoid interpolating noise in training data, unlike more classical estimators
such as ridge and the Lasso. Second, the closed form of the minimum `
2
-norm solution to linear
regression is equivalent to the solution obtained by gradient descent when the optimization process
initialized at zero (Engl et al., 1996); thus, this solution arises easily and ubiquitously in practice.
2. For example, the original paper (Zhang et al., 2017) now appears in a 2021 version of Communications of the
ACM (Zhang et al., 2021). The latter survey highlights that almost all of the questions posed in the original paper
remain open.
3. This problem is usually called principal component regression. Recently, Xu and Hsu (2019) showed that double
descent can occur with principal component regression in the overparameterized regime.
4. The pioneering experiments by Belkin et al. (2018b) also inspired a parallel thread of mathematical research on
harmless interpolation of noise by nonparametric and local methods beginning with Belkin et al. (2018a, 2019b);
Liang and Rakhlin (2020). These models are not explicitly parameterized and are not surveyed in this paper, but
possess possible connections to TOPML research of future interest.
3

1.1. Contents of this paper
In this paper, we survey the emerging field of TOPML research with a principal focus on founda-
tional principles developed in the past few years. Compared to other recent surveys (Bartlett et al.,
2021; Belkin, 2021), we take a more elementary signal processing perspective to elucidate these
principles. Formally, we define the TOPML research area as the sub-field of ML theory where
1. there is clear consideration of exact or near interpolation of training data
2. the learned model complexity is high with respect to the training dataset size. Note that the
complexity of the learned model is typically affected by (implicit or explicit) regularization
aspects as a consequence of the learning process.
Importantly, this definition highlights that while TOPML was inspired by observations in deep
learning, several aspects of deep learning theory do not involve overparameterization. More strik-
ingly, TOPML is relevant to diverse model families other than DNNs.
The first studies of TOPML were conducted for the linear regression task; accordingly, much of
our treatment centers around a comprehensive survey of overparameterized linear regression. How-
ever, TOPML goes well beyond the linear regression task. Overparameterization naturally arises in
diverse ML tasks, such as classification (e.g., Muthukumar et al., 2020a), subspace learning for di-
mensionality reduction (Dar et al., 2020), data generation (Luzi et al., 2021), and dictionary learning
for sparse representations (Sulam et al., 2020). In addition, overparameterization arises in various
learning settings that are more complex than elementary fully supervised learning: unsupervised and
semi-supervised learning (Dar et al., 2020), transfer learning (Dar and Baraniuk, 2020), pruning of
learned models (Chang et al., 2021), and others. We also survey recent work in these topics.
This paper is organized as follows. In Section 2 we introduce the basics of interpolating so-
lutions in overparameterized learning, as a machine learning domain that is outside the scope of
the classical bias–variance tradeoff. In Section 3 we overview recent results on overparameterized
regression. Here, we provide intuitive explanations of the fundamentals of overparameterized learn-
ing by taking a signal processing perspective. In Section 4 we review the state-of-the-art findings on
overparameterized classification. In Section 5 we overview recent work on overparameterized sub-
space learning. In Section 6 we examine recent research on overparameterized learning problems
beyond regression and classification. In Section 7 we discuss the main open questions in the theory
of overparameterized ML.
2. Beyond the classical bias–variance tradeoff: The realm of interpolating solutions
We start by setting up basic notation and definitions for both underparameterized and overparam-
eterized models. Consider a supervised learning setting with n training examples in the dataset
D = {(x
i
, y
i
)}
n
i=1
. The examples in D reflect an unknown functional relationship f
true
: X → Y
that in its ideal, clean form maps a d-dimensional input vector x ∈ X ⊆ R
d
to an output
y = f
true
(x), (1)
where the output domain Y depends on the specific problem (e.g., Y = R in regression with scalar
response values, and Y = {0, 1} in binary classification). The mathematical notations and analysis
in this paper consider one-dimensional output domain Y, unless otherwise specified (e.g., in Section
4

5). One can also perceive f
true
as a signal that should be estimated from the measurements in D.
Importantly, it is commonly the case that the output y is degraded by noise. For example, a popular
noisy data model for regression extends (1) into
y = f
true
(x) + (2)
where ∼ P
is a scalar-valued noise term that has zero mean and variance σ
2
. Moreover, the noise
is also independent of the input x ∼ P
x
.
The input vector can be perceived as a random vector x ∼ P
x
that, together with the unknown
mapping f
true
and the relevant noise model, induces a probability distribution P
x,y
over X × Y
for the input-output pair (x, y). Moreover, the n training examples {(x
i
, y
i
)}
n
i=1
in D are usually
considered to be i.i.d. random draws from P
x,y
.
The goal of supervised learning is to provide a mapping f : X → Y such that f(x) constitutes
a good estimate of the true output y for a new sample (x, y) ∼ P
x,y
. This mapping f is learned
from the n training examples in the dataset D; consequently, a mapping f that operates well on a
new (x, y) ∼ P
x,y
beyond the training dataset is said to generalize well. Let us now formulate the
learning process and its performance evaluation. Consider a loss function L : Y × Y → R
≥0
that
evaluates the distance, or error, between two elements in the output domain. Then, the learning of
f is done by minimizing the training error given by
E
train
(f) =
1
n
n
X
i=1
L(f(x
i
), y
i
). (3)
The training error is simply the empirical average (over the training data) of the error in estimating
an output y given an input x. The search for the mapping f that minimizes E
train
(f) is done within
a limited set F of mappings that are induced by specific computational architectures. For example,
in linear regression with scalar-valued response, F denotes the set of all the linear mappings from
X = R
d
to Y = R. Accordingly, the training process can be written as
b
f = arg min
f∈F
E
train
(f) (4)
where
b
f denotes the mapping with minimal training error among the constrained set of mappings
F. The generalization performance of a mapping f is evaluated using the test error
E
test
(f) = E
x,y
[L(f(x), y)] (5)
where (x, y) ∼ P
x,y
are random test data. The best generalization performance corresponds to the
lowest test error, which is achieved by the optimal mapping
f
opt
= arg min
f:X→Y
E
test
(f). (6)
Note that the optimal mapping as defined above does not posit any restrictions on the function
class. If the solution space F of the constrained optimization problem in (4) includes the optimal
mapping f
opt
, the learning architecture is considered as well specified. Otherwise, the training
procedure cannot possibly induce the optimal mapping f
opt
, and the learning architecture is said to
be misspecified.
5
剩余47页未读,继续阅读



















syp_net
- 粉丝: 147
- 资源: 1197
上传资源 快速赚钱
我的内容管理 收起
我的资源 快来上传第一个资源
我的收益
登录查看自己的收益我的积分 登录查看自己的积分
我的C币 登录后查看C币余额
我的收藏
我的下载
下载帮助

会员权益专享
最新资源
- ARM Cortex-A(armV7)编程手册V4.0.pdf
- ABB机器人保养总结解析.ppt
- 【超详细图解】菜鸡如何理解双向链表的python代码实现
- 常用网络命令的使用 ipconfig ping ARP FTP Netstat Route Tftp Tracert Telnet nslookup
- 基于单片机控制的DC-DC变换电路
- RS-232接口电路的ESD保护.pdf
- linux下用time(NULL)函数和localtime()获取当前时间的方法
- Openstack用户使用手册.docx
- KUKA KR 30 hA,KR 60 hA机器人产品手册.pdf
- Java programming with JNI
资源上传下载、课程学习等过程中有任何疑问或建议,欢迎提出宝贵意见哦~我们会及时处理!
点击此处反馈



安全验证
文档复制为VIP权益,开通VIP直接复制

评论0