没有合适的资源？快使用搜索试试~ 我知道了~

首页Google-斯坦福发布《深度学习统计力学》综述论文.pdf

# Google-斯坦福发布《深度学习统计力学》综述论文.pdf

需积分: 0 171 浏览量
更新于2023-05-30
评论
收藏 1.81MB PDF 举报

我们回顾了最近的工作，其中物理分析方法植根于统计力学已经开始提供这些问题的概念上的见解。这些见解产生了深度学习与各种物理和数学主题之间的联系，包括随机景观、旋转玻璃、干扰、动态相变、混沌、黎曼几何、随机矩阵理论、自由概率和非平衡统计力学。事实上，统计力学和机器学习领域长期以来一直享有强耦合交叉作用的丰富历史，而统计力学和深度学习交叉领域的最新进展表明，这些交叉作用只会进一步深化。

资源详情

资源评论

资源推荐

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27

Annual Review of Condensed Matter Physics

Statistical Mechanics of

Deep Learning

Yasaman Bahri,

1

Jonathan Kadmon,

2

Jeffrey Pennington,

1

Sam S. Schoenholz,

1

Jascha Sohl-Dickstein,

1

and Surya Ganguli

1,2

1

Google Brain, Google Inc., Mountain View, California 94043, USA

2

Department of Applied Physics, Stanford University, Stanford, California 94035, USA;

email: sganguli@stanford.edu

Annu. Rev. Condens. Matter Phys. 2020. 11:501–28

First published as a Review in Advance on

December 9, 2019

The Annual Review of Condensed Matter Physics is

online at conmatphys.annualreviews.org

https://doi.org/10.1146/annurev-conmatphys-

031119-050745

Copyright © 2020 by Annual Reviews.

All rights reserved

Keywords

neural networks, machine learning, dynamical phase transitions, chaos, spin

glasses, jamming, random matrix theory, interacting particle systems,

nonequilibrium statistical mechanics

Abstract

The recent striking success of deep neural networks in machine learning

raises profound questions about the theoretical principles underlying their

success. For example, what can such deep networks compute? How can we

train them? How does information propagate through them? Why can they

generalize? And how can we teach them to imagine? We review recent work

in which methods of physical analysis rooted in statistical mechanics have

begun to provide conceptual insights into these questions. These insights

yield connections between deep learning and diverse physical and mathe-

matical topics, including random landscapes, spin glasses, jamming, dynam-

ical phase transitions, chaos, Riemannian geometry, random matrix theory,

free probability, and nonequilibrium statistical mechanics. Indeed, the elds

of statistical mechanics and machine learning have long enjoyed a rich his-

tory of strongly coupled interactions, and recent advances at the intersection

of statistical mechanics and deep learning suggest these interactions will only

deepen going forward.

501

Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org

Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27

1. INTRODUCTION

Deep neural networks, with multiple hidden layers (1), have achieved remarkable success across

many elds, including machine vision (2), speech recognition (3), natural language processing (4),

reinforcement learning (5), and even modeling of animals and humans themselves in neuroscience

(6, 7), psychology (8, 9), and education (10). However, the methods used to arrive at successful deep

neural networks remain a highly practiced art, lled with many heuristics, rather than an exact

science. This raises exciting challenges and opportunities for the theoretical sciences in creating a

mature theory of deep neural networks that is powerful enough to guide a wide set of engineering

design choices in deep learning. Although we are still currently far from any such mature theory,

a recently emerged body of work at the intersection of statistical mechanics and deep learning

has begun to provide theoretical insights into how deep networks learn and compute, sometimes

suggesting new and improved methods for deep learning driven by these theoretical insights.

Here, we review this body of work, which builds upon a long and rich history of interaction

between statistical mechanics and machine learning (11–15). Interestingly, this body of work leads

to many new bridges between statistical mechanics and deep learning as we discuss below. In the

remainder of this introduction, we provide frameworks for two major branches of machine learn-

ing. The rst is supervised learning, which concerns the process of learning input–output maps

from examples. The second is unsupervised learning, which concerns the process of learning and

exploiting hidden patterns of structure in data. With these two frameworks in hand, we introduce

in Section 1.3 several foundational theoretical questions of deep learning discussed in this review,

and their connections to a diversity of topics related to statistical mechanics.

1.1. Overall Framework of Supervised Learning

Image classication is a classic example of supervised learning. In the image classication problem,

one must learn a mapping from a pixel representation of an image to a class label for that image

(e.g., cat, dog). To learn this map, a neural network is trained on a training set of images along

with their correct class label. This is called a supervised learning problem because the correct class

labels are given to the network during training. Indeed, a seminal advance that popularized deep

learning was a signicant improvement in image classication by deep networks (2).

More formally, the simplest version of a feed-forward neural network with D layers is specied

by D weight matrices W

1

,..., W

D

and D layers of neural activity vectors x

1

, ..., x

D

, with N

l

neu-

rons in each layer l,sothatx

l

∈ R

N

l

and W

l

is an N

l

× N

l − 1

matrix. The feed-forward dynamics

elicited by an input x

0

presented to the network is given by

x

l

= φ(h

l

), h

l

= W

l

x

l−1

+ b

l

for l = 1, ..., D,1.

where b

l

is a vector of biases, h

l

is the pattern of inputs to neurons at layer l,andφ is a single

neuron scalar nonlinearity that acts component-wise to transform inputs h

l

to activities x

l

.We

henceforth collectively denote all N neural network parameters {W

l

, b

l

}

D

l=1

by the N-dimensional

parameter vector w, and the nal output of the network in response to the input x

0

by the vector

y = x

D

(x

0

, w), where the function x

D

is dened recursively in Equation 1.

A supervised learning task is specied by a joint distribution P (x

0

, y) over possible inputs x

0

and

outputs y. A key goal of supervised learning is to nd an optimal set of parameters that minimizes

the test error on a randomly chosen input–output pair (x

0

, y):

E

Te s t

(w) =

dx

0

dy P (x

0

, y)L(y,

ˆ

y), 2.

502 Bahri et al.

Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org

Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27

where the loss function L(y,

ˆ

y) penalizes any discrepancy between the correct output y and the

network prediction

ˆ

y = x

D

(x

0

, w). For example, a simple loss function is the squared loss L =

1

2

(y −

ˆ

y)

2

. However, in real-world applications, it may not be possible to either directly access or

even mathematically specify the data distribution P. For example, in image classication, x

0

could

denote a vector of pixel intensities of an image, whereas y could denote a probability distribution

over image category labels. However, one can often access a nite data set D ={x

0,μ

, y

μ

}

P

μ=1

of P

independent identically distributed (i.i.d.) samples drawn from P (e.g., example images of cats and

dogs). One can then attempt to choose parameters w to minimize the training error,

E

Tr a i n

(w, D) =

1

P

P

μ=1

L(y

μ

,

ˆ

y

μ

), 3.

or the average mismatch between correct answers y

μ

and network predictions

ˆ

y

μ

= x

D

(x

0,μ

, w)

on the specic training set D. Many approaches to supervised learning attempt to minimize this

training error, potentially with an additional cost function on w to promote generalization to

accurate predictions on new inputs, as we discuss below.

1.2. Overall Framework of Unsupervised Learning

In addition to learning input–output maps, another key branch of machine learning, known as

unsupervised learning, concerns modeling and understanding the structure of complex data. For

example, how can we describe the structure of natural images, sounds, and language? If we could

accurately model probability distributions over such complex data, then we could generate natu-

ralistic data, as well as correct errors in image acquisition (16), speech recordings (17), or human

language generation (4).

Of course, the distribution over such complex data as images and sounds cannot be mathemat-

ically specied, but we often have access to an empirical distribution of P samples:

q(x) =

1

P

P

μ=1

δ(x − x

μ

). 4.

For example, each x

μ

could denote a vector of pixel intensities for images, or a time series of

pressure variations for sound.

The goal of unsupervised learning is to adjust the parameters w of a family of distributions

p(x; w) to nd one similar to the data distribution q(x). This is often done by maximizing the log

likelihood of the data with respect to model parameters w:

l(w) =

dx q(x) log p(x; w). 5.

This learning principle modies p to assign high probability to data points, and consequently

low probability elsewhere, thereby moving the model distribution, p(x; w), closer to the data

distribution, q(x). Indeed, we review further connections between the log-likelihood function and

an information theoretic divergence between distributions, as well as free energy and entropy, in

Section 6.2.

Once a good model p(x; w) is found, it has many uses. For example, one can sample from it

to imagine new data. One can also use it to denoise or ll in missing entries in a data vector x.

Furthermore, if the distribution p consists of a generative process that transforms latent, or hid-

den, variables h into the visible data vector x, then the inferred latent variables h, rather than the

www.annualreviews.org

•

Statistical Mechanics of Deep Learning 503

Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org

Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27

data vector itself, can aid in solving subsequent supervised learning tasks. This approach has been

very successful, for example, in natural language processing, where the hidden layers of a network

trained simply to generate language form useful internal representations for solving subsequent

language processing tasks (4).

Interestingly, the process of choosing p can be thought of as an inverse statistical mechan-

ics problem (18). Traditionally, many problems in the theory of equilibrium statistical mechan-

ics involve starting from a Boltzmann distribution p(x; w) over microstates x, with couplings w,

and computing bulk statistics of x from p. In contrast, machine learning involves sampling from

microstates x and deducing an appropriate distribution p(x; w).

1.3. Foundational Theoretical Questions in Deep Learning

With the above minimal frameworks for supervised and unsupervised learning in hand, we can

now introduce foundational theoretical questions in the eld of deep learning and how ideas from

statistical physics have begun to shed light on these questions. On the supervised side, we discuss

four questions. First, what is the advantage of depth D? In principle, what functions can be com-

puted in Equation 1 for large, but not small, D? We address this question in Section 2 by making

a connection to dynamical phase transitions between order and chaos.

Second, many methods for minimizing the training error in Equation 3 involve descending

the error landscape over the parameter vector w given by E

Tr a i n

(w, D) via (stochastic) gradient

descent. What is the shape of this landscape and when can we descend to points of low training

error? We address these questions in Section 3, making various connections to the statistical me-

chanics of energy landscapes with quenched disorder, including phenomena like random Gaussian

landscapes, spin glasses, and jamming. Indeed E

Tr a i n

(w, D) could be thought of as such an energy

function over thermal degrees of freedom w,wherethedataD play the role of quenched disorder.

Third, when minimizing E

Tr a i n

(w, D) via gradient descent, one must start at an initial point

w, which is often chosen randomly. How can one choose the random initialization to accelerate

subsequent gradient descent? In Section 4, we show that theories of signal propagation through

random deep networks provide clues to good initializations, making connections to topics in ran-

dom matrix theory, free probability, and functional path integrals.

Fourth, though many learning algorithms minimize E

Tr a i n

in Equation 3, possibly with extra

regularization on the parameters w, the key goal is to minimize the inaccessible test error E

Te s t

in

Equation 2 on a randomly chosen new input not necessarily present in the training data D.Itis

then critical to achieve a small generalization error E

Gen

= E

Te s t

− E

Tr a i n

. When can one achieve

such a small generalization error, especially in situations in which the number of parameters N can

far exceed the number of data points P? We address this question in Section 5, making connections

to topics like phase transitions in random matrix spectra, free eld theories, and interacting particle

systems.

On the unsupervised side, the theoretical development is much less mature. However, in Sec-

tion 6, we review work in deep unsupervised learning that connects to ideas in equilibrium sta-

tistical mechanics, like free-energy minimization, as well as nonequilibrium statistical mechanics,

like the Jarzynski equality and heat dissipation in irreversible processes.

2. EXPRESSIVITY OF DEEP NETWORKS

Seminal results (19, 20) demonstrate that shallow networks, with only one hidden layer of neurons,

can universally approximate any Borel measurable function from one nite-dimensional space to

another, given enough hidden neurons. These results then raise a fundamental question: Why are

504 Bahri et al.

Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org

Access provided by 161.117.184.108 on 03/26/20. For personal use only.

CO11CH22_Ganguli ARjats.cls February 13, 2020 10:27

deeper neural networks with many hidden layers at all functionally advantageous in solving key

problems in machine learning and articial intelligence?

2.1. Efcient Computation of Special Functions by Deep Networks

Importantly, the early results on function approximation in References 19 and 20 do not spec-

ify how many hidden neurons are required to approximate, or express, any given function by a

shallow network. A key factor thought to underlie the success of deep networks over their shal-

low cousins is their high expressivity. This notion corresponds primarily to two intuitions. The

rst is that deep networks can compactly express highly complex functions over input space in a

way that shallow networks with one hidden layer and the same number of neurons cannot. The

second intuition, which has captured the imagination of machine learning (21) and neuroscience

(22) alike, is that deep neural networks can disentangle highly curved decision boundaries in input

space into attened decision boundaries in hidden space, to aid the performance of simple linear

classiers. To more precisely dene a decision boundary, consider the deep network y = x

D

(x

0

, w)

in Equation 1, where the nal output y is restricted to a scalar function y. This network can per-

form a binary classication task by partitioning the input vectors x

0

into two classes, according to

whether y = x

D

(x

0

, w) is positive or negative. The codimension 1 manifold obeying the equation

x

D

(x

0

, w) = 0 is then the network’s decision boundary. Note that one could also dene the deci-

sion boundary similarly in the penultimate hidden layer x

D−1

. Whereas the decision boundary in

this hidden layer must be a linear hyperplane, by virtue of the linear map from x

D−1

to the scalar

h

D

, the decision boundary in input space could potentially be highly curved by virtue of the highly

nonlinear map from x

0

to x

D−1

in Equation 1.

Focusing on the rst intuition, several works have exhibited examples of particular complex

functions that can be computed with a number of neurons that grows polynomially with the num-

ber of input dimensions when using a deep network, but require a number of neurons that instead

grows exponentially in the input dimension when using a shallow network (23–27). The theo-

retical techniques employed in these works both limited the applicability of theory to specic

nonlinearities and dictated the particular measure of deep functional complexity involved. For ex-

ample, Reference 23 focused on rectied linear unit (ReLU) nonlinearities and number of linear

regions as a complexity measure; Reference 24 focused on sum-product networks, which compute

polynomials, and the number of monomials in the polynomial as complexity measure; and Refer-

ence 28 focused on Pfafan nonlinearities and topological measures of complexity, like the sum

of Betti numbers of a decision boundary. These works thus left open a fundamental question: Are

the particular example functions efciently computed by particular deep networks merely rare cu-

riosities, or in some sense is any function computed by a generic deep network, with more general

nonlinearities, not efciently computable by a shallow network?

2.2. Expressivity Through Transient Chaos

Recent work (29) addressed this question by combining Riemannian geometry and dynamical

mean-eld theory (30) to analyze the propagation of signals through random deep networks in

which the weights and biases are chosen i.i.d. from zero-mean, Gaussian distributions. In a phase

plane formed by the variance of the weights and biases, this work revealed a dynamical phase

transition between ordered and chaotic regimes of signal propagation [see Figure 1a,b for an

example where the nonlinearity φ in Equation 1 is taken to be φ(x) = tanh x]. Intuitively, for small

weights, relative to the strength of biases, nearby input points coalesce as they propagate through

the layers of a deep network and the feed-forward map stays within the linear regime. However,

for large weights, signal propagation corresponds to alternating linear expansion and nonlinear

www.annualreviews.org

•

Statistical Mechanics of Deep Learning 505

Annu. Rev. Condens. Matter Phys. 2020.11:501-528. Downloaded from www.annualreviews.org

Access provided by 161.117.184.108 on 03/26/20. For personal use only.

剩余29页未读，继续阅读

安全验证

文档复制为VIP权益，开通VIP直接复制

信息提交成功

## 评论0