没有合适的资源?快使用搜索试试~ 我知道了~
首页Google-斯坦福发布《深度学习统计力学》综述论文.pdf
Google-斯坦福发布《深度学习统计力学》综述论文.pdf
需积分: 0 171 浏览量
更新于2023-05-30
评论
收藏 1.81MB PDF 举报
我们回顾了最近的工作,其中物理分析方法植根于统计力学已经开始提供这些问题的概念上的见解。这些见解产生了深度学习与各种物理和数学主题之间的联系,包括随机景观、旋转玻璃、干扰、动态相变、混沌、黎曼几何、随机矩阵理论、自由概率和非平衡统计力学。事实上,统计力学和机器学习领域长期以来一直享有强耦合交叉作用的丰富历史,而统计力学和深度学习交叉领域的最新进展表明,这些交叉作用只会进一步深化。
资源详情
资源评论
资源推荐

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27
Annual Review of Condensed Matter Physics
Statistical Mechanics of
Deep Learning
Yasaman Bahri,
1
Jonathan Kadmon,
2
Jeffrey Pennington,
1
Sam S. Schoenholz,
1
Jascha Sohl-Dickstein,
1
and Surya Ganguli
1,2
1
Google Brain, Google Inc., Mountain View, California 94043, USA
2
Department of Applied Physics, Stanford University, Stanford, California 94035, USA;
email: sganguli@stanford.edu
Annu. Rev. Condens. Matter Phys. 2020. 11:501–28
First published as a Review in Advance on
December 9, 2019
The Annual Review of Condensed Matter Physics is
online at conmatphys.annualreviews.org
https://doi.org/10.1146/annurev-conmatphys-
031119-050745
Copyright © 2020 by Annual Reviews.
All rights reserved
Keywords
neural networks, machine learning, dynamical phase transitions, chaos, spin
glasses, jamming, random matrix theory, interacting particle systems,
nonequilibrium statistical mechanics
Abstract
The recent striking success of deep neural networks in machine learning
raises profound questions about the theoretical principles underlying their
success. For example, what can such deep networks compute? How can we
train them? How does information propagate through them? Why can they
generalize? And how can we teach them to imagine? We review recent work
in which methods of physical analysis rooted in statistical mechanics have
begun to provide conceptual insights into these questions. These insights
yield connections between deep learning and diverse physical and mathe-
matical topics, including random landscapes, spin glasses, jamming, dynam-
ical phase transitions, chaos, Riemannian geometry, random matrix theory,
free probability, and nonequilibrium statistical mechanics. Indeed, the elds
of statistical mechanics and machine learning have long enjoyed a rich his-
tory of strongly coupled interactions, and recent advances at the intersection
of statistical mechanics and deep learning suggest these interactions will only
deepen going forward.
501
Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org
Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27
1. INTRODUCTION
Deep neural networks, with multiple hidden layers (1), have achieved remarkable success across
many elds, including machine vision (2), speech recognition (3), natural language processing (4),
reinforcement learning (5), and even modeling of animals and humans themselves in neuroscience
(6, 7), psychology (8, 9), and education (10). However, the methods used to arrive at successful deep
neural networks remain a highly practiced art, lled with many heuristics, rather than an exact
science. This raises exciting challenges and opportunities for the theoretical sciences in creating a
mature theory of deep neural networks that is powerful enough to guide a wide set of engineering
design choices in deep learning. Although we are still currently far from any such mature theory,
a recently emerged body of work at the intersection of statistical mechanics and deep learning
has begun to provide theoretical insights into how deep networks learn and compute, sometimes
suggesting new and improved methods for deep learning driven by these theoretical insights.
Here, we review this body of work, which builds upon a long and rich history of interaction
between statistical mechanics and machine learning (11–15). Interestingly, this body of work leads
to many new bridges between statistical mechanics and deep learning as we discuss below. In the
remainder of this introduction, we provide frameworks for two major branches of machine learn-
ing. The rst is supervised learning, which concerns the process of learning input–output maps
from examples. The second is unsupervised learning, which concerns the process of learning and
exploiting hidden patterns of structure in data. With these two frameworks in hand, we introduce
in Section 1.3 several foundational theoretical questions of deep learning discussed in this review,
and their connections to a diversity of topics related to statistical mechanics.
1.1. Overall Framework of Supervised Learning
Image classication is a classic example of supervised learning. In the image classication problem,
one must learn a mapping from a pixel representation of an image to a class label for that image
(e.g., cat, dog). To learn this map, a neural network is trained on a training set of images along
with their correct class label. This is called a supervised learning problem because the correct class
labels are given to the network during training. Indeed, a seminal advance that popularized deep
learning was a signicant improvement in image classication by deep networks (2).
More formally, the simplest version of a feed-forward neural network with D layers is specied
by D weight matrices W
1
,..., W
D
and D layers of neural activity vectors x
1
, ..., x
D
, with N
l
neu-
rons in each layer l,sothatx
l
∈ R
N
l
and W
l
is an N
l
× N
l − 1
matrix. The feed-forward dynamics
elicited by an input x
0
presented to the network is given by
x
l
= φ(h
l
), h
l
= W
l
x
l−1
+ b
l
for l = 1, ..., D,1.
where b
l
is a vector of biases, h
l
is the pattern of inputs to neurons at layer l,andφ is a single
neuron scalar nonlinearity that acts component-wise to transform inputs h
l
to activities x
l
.We
henceforth collectively denote all N neural network parameters {W
l
, b
l
}
D
l=1
by the N-dimensional
parameter vector w, and the nal output of the network in response to the input x
0
by the vector
y = x
D
(x
0
, w), where the function x
D
is dened recursively in Equation 1.
A supervised learning task is specied by a joint distribution P (x
0
, y) over possible inputs x
0
and
outputs y. A key goal of supervised learning is to nd an optimal set of parameters that minimizes
the test error on a randomly chosen input–output pair (x
0
, y):
E
Te s t
(w) =
dx
0
dy P (x
0
, y)L(y,
ˆ
y), 2.
502 Bahri et al.
Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org
Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27
where the loss function L(y,
ˆ
y) penalizes any discrepancy between the correct output y and the
network prediction
ˆ
y = x
D
(x
0
, w). For example, a simple loss function is the squared loss L =
1
2
(y −
ˆ
y)
2
. However, in real-world applications, it may not be possible to either directly access or
even mathematically specify the data distribution P. For example, in image classication, x
0
could
denote a vector of pixel intensities of an image, whereas y could denote a probability distribution
over image category labels. However, one can often access a nite data set D ={x
0,μ
, y
μ
}
P
μ=1
of P
independent identically distributed (i.i.d.) samples drawn from P (e.g., example images of cats and
dogs). One can then attempt to choose parameters w to minimize the training error,
E
Tr a i n
(w, D) =
1
P
P
μ=1
L(y
μ
,
ˆ
y
μ
), 3.
or the average mismatch between correct answers y
μ
and network predictions
ˆ
y
μ
= x
D
(x
0,μ
, w)
on the specic training set D. Many approaches to supervised learning attempt to minimize this
training error, potentially with an additional cost function on w to promote generalization to
accurate predictions on new inputs, as we discuss below.
1.2. Overall Framework of Unsupervised Learning
In addition to learning input–output maps, another key branch of machine learning, known as
unsupervised learning, concerns modeling and understanding the structure of complex data. For
example, how can we describe the structure of natural images, sounds, and language? If we could
accurately model probability distributions over such complex data, then we could generate natu-
ralistic data, as well as correct errors in image acquisition (16), speech recordings (17), or human
language generation (4).
Of course, the distribution over such complex data as images and sounds cannot be mathemat-
ically specied, but we often have access to an empirical distribution of P samples:
q(x) =
1
P
P
μ=1
δ(x − x
μ
). 4.
For example, each x
μ
could denote a vector of pixel intensities for images, or a time series of
pressure variations for sound.
The goal of unsupervised learning is to adjust the parameters w of a family of distributions
p(x; w) to nd one similar to the data distribution q(x). This is often done by maximizing the log
likelihood of the data with respect to model parameters w:
l(w) =
dx q(x) log p(x; w). 5.
This learning principle modies p to assign high probability to data points, and consequently
low probability elsewhere, thereby moving the model distribution, p(x; w), closer to the data
distribution, q(x). Indeed, we review further connections between the log-likelihood function and
an information theoretic divergence between distributions, as well as free energy and entropy, in
Section 6.2.
Once a good model p(x; w) is found, it has many uses. For example, one can sample from it
to imagine new data. One can also use it to denoise or ll in missing entries in a data vector x.
Furthermore, if the distribution p consists of a generative process that transforms latent, or hid-
den, variables h into the visible data vector x, then the inferred latent variables h, rather than the
www.annualreviews.org
•
Statistical Mechanics of Deep Learning 503
Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org
Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27
data vector itself, can aid in solving subsequent supervised learning tasks. This approach has been
very successful, for example, in natural language processing, where the hidden layers of a network
trained simply to generate language form useful internal representations for solving subsequent
language processing tasks (4).
Interestingly, the process of choosing p can be thought of as an inverse statistical mechan-
ics problem (18). Traditionally, many problems in the theory of equilibrium statistical mechan-
ics involve starting from a Boltzmann distribution p(x; w) over microstates x, with couplings w,
and computing bulk statistics of x from p. In contrast, machine learning involves sampling from
microstates x and deducing an appropriate distribution p(x; w).
1.3. Foundational Theoretical Questions in Deep Learning
With the above minimal frameworks for supervised and unsupervised learning in hand, we can
now introduce foundational theoretical questions in the eld of deep learning and how ideas from
statistical physics have begun to shed light on these questions. On the supervised side, we discuss
four questions. First, what is the advantage of depth D? In principle, what functions can be com-
puted in Equation 1 for large, but not small, D? We address this question in Section 2 by making
a connection to dynamical phase transitions between order and chaos.
Second, many methods for minimizing the training error in Equation 3 involve descending
the error landscape over the parameter vector w given by E
Tr a i n
(w, D) via (stochastic) gradient
descent. What is the shape of this landscape and when can we descend to points of low training
error? We address these questions in Section 3, making various connections to the statistical me-
chanics of energy landscapes with quenched disorder, including phenomena like random Gaussian
landscapes, spin glasses, and jamming. Indeed E
Tr a i n
(w, D) could be thought of as such an energy
function over thermal degrees of freedom w,wherethedataD play the role of quenched disorder.
Third, when minimizing E
Tr a i n
(w, D) via gradient descent, one must start at an initial point
w, which is often chosen randomly. How can one choose the random initialization to accelerate
subsequent gradient descent? In Section 4, we show that theories of signal propagation through
random deep networks provide clues to good initializations, making connections to topics in ran-
dom matrix theory, free probability, and functional path integrals.
Fourth, though many learning algorithms minimize E
Tr a i n
in Equation 3, possibly with extra
regularization on the parameters w, the key goal is to minimize the inaccessible test error E
Te s t
in
Equation 2 on a randomly chosen new input not necessarily present in the training data D.Itis
then critical to achieve a small generalization error E
Gen
= E
Te s t
− E
Tr a i n
. When can one achieve
such a small generalization error, especially in situations in which the number of parameters N can
far exceed the number of data points P? We address this question in Section 5, making connections
to topics like phase transitions in random matrix spectra, free eld theories, and interacting particle
systems.
On the unsupervised side, the theoretical development is much less mature. However, in Sec-
tion 6, we review work in deep unsupervised learning that connects to ideas in equilibrium sta-
tistical mechanics, like free-energy minimization, as well as nonequilibrium statistical mechanics,
like the Jarzynski equality and heat dissipation in irreversible processes.
2. EXPRESSIVITY OF DEEP NETWORKS
Seminal results (19, 20) demonstrate that shallow networks, with only one hidden layer of neurons,
can universally approximate any Borel measurable function from one nite-dimensional space to
another, given enough hidden neurons. These results then raise a fundamental question: Why are
504 Bahri et al.
Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org
Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27
deeper neural networks with many hidden layers at all functionally advantageous in solving key
problems in machine learning and articial intelligence?
2.1. Efcient Computation of Special Functions by Deep Networks
Importantly, the early results on function approximation in References 19 and 20 do not spec-
ify how many hidden neurons are required to approximate, or express, any given function by a
shallow network. A key factor thought to underlie the success of deep networks over their shal-
low cousins is their high expressivity. This notion corresponds primarily to two intuitions. The
rst is that deep networks can compactly express highly complex functions over input space in a
way that shallow networks with one hidden layer and the same number of neurons cannot. The
second intuition, which has captured the imagination of machine learning (21) and neuroscience
(22) alike, is that deep neural networks can disentangle highly curved decision boundaries in input
space into attened decision boundaries in hidden space, to aid the performance of simple linear
classiers. To more precisely dene a decision boundary, consider the deep network y = x
D
(x
0
, w)
in Equation 1, where the nal output y is restricted to a scalar function y. This network can per-
form a binary classication task by partitioning the input vectors x
0
into two classes, according to
whether y = x
D
(x
0
, w) is positive or negative. The codimension 1 manifold obeying the equation
x
D
(x
0
, w) = 0 is then the network’s decision boundary. Note that one could also dene the deci-
sion boundary similarly in the penultimate hidden layer x
D−1
. Whereas the decision boundary in
this hidden layer must be a linear hyperplane, by virtue of the linear map from x
D−1
to the scalar
h
D
, the decision boundary in input space could potentially be highly curved by virtue of the highly
nonlinear map from x
0
to x
D−1
in Equation 1.
Focusing on the rst intuition, several works have exhibited examples of particular complex
functions that can be computed with a number of neurons that grows polynomially with the num-
ber of input dimensions when using a deep network, but require a number of neurons that instead
grows exponentially in the input dimension when using a shallow network (23–27). The theo-
retical techniques employed in these works both limited the applicability of theory to specic
nonlinearities and dictated the particular measure of deep functional complexity involved. For ex-
ample, Reference 23 focused on rectied linear unit (ReLU) nonlinearities and number of linear
regions as a complexity measure; Reference 24 focused on sum-product networks, which compute
polynomials, and the number of monomials in the polynomial as complexity measure; and Refer-
ence 28 focused on Pfafan nonlinearities and topological measures of complexity, like the sum
of Betti numbers of a decision boundary. These works thus left open a fundamental question: Are
the particular example functions efciently computed by particular deep networks merely rare cu-
riosities, or in some sense is any function computed by a generic deep network, with more general
nonlinearities, not efciently computable by a shallow network?
2.2. Expressivity Through Transient Chaos
Recent work (29) addressed this question by combining Riemannian geometry and dynamical
mean-eld theory (30) to analyze the propagation of signals through random deep networks in
which the weights and biases are chosen i.i.d. from zero-mean, Gaussian distributions. In a phase
plane formed by the variance of the weights and biases, this work revealed a dynamical phase
transition between ordered and chaotic regimes of signal propagation [see Figure 1a,b for an
example where the nonlinearity φ in Equation 1 is taken to be φ(x) = tanh x]. Intuitively, for small
weights, relative to the strength of biases, nearby input points coalesce as they propagate through
the layers of a deep network and the feed-forward map stays within the linear regime. However,
for large weights, signal propagation corresponds to alternating linear expansion and nonlinear
www.annualreviews.org
•
Statistical Mechanics of Deep Learning 505
Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org
Access provided by 161.117.184.108 on 03/26/20. For personal use only.
剩余29页未读,继续阅读















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0