没有合适的资源?快使用搜索试试~ 我知道了~
首页《深度学习最优化》综述论文
资源详情
资源评论
资源推荐

Optimization for deep learning: an overview
Ruoyu Sun
∗
April 28, 2020
Abstract
Optimization is a critical component in deep learning. We think optimization for neural net-
works is an interesting topic for theoretical research due to various reasons. First, its tractability
despite non-convexity is an intriguing question, and may greatly expand our understanding of
tractable problems. Second, classical optimization theory is far from enough to explain many
phenomenons. Therefore, we would like to understand the challenges and opportunities from
a theoretical perspective, and review the existing research in this field. First, we discuss the
issue of gradient explosion/vanishing and the more general issue of undesirable spectrum, and
then discuss practical solutions including careful initialization, normalization methods, and skip
connections. Second, we review generic optimization methods used in training neural networks,
such as stochastic gradient descent (SGD) and adaptive gradient methods, and existing theoret-
ical results. Third, we review existing research on the global issues of neural network training,
including results on global landscape, mode connectivity, lottery ticket hypothesis and neural
tangent kernel (NTK).
1 Introduction
Optimization has been a critical component of neural network research for a long time. However,
it is not clear at first sight whether neural network problems are good topics for theoretical study,
as they are both too simple and too complicated. On one hand, they are “simple” because typical
neural network problems are just a special instance of a unconstrained continuous optimization
problem (this view is problematic though, as argued later), which is itself a subarea of optimization.
On the other hand, neural network problems are indeed “complicated” because of the composition
of many non-linear functions. If we want to open the “black box” of the neural networks and
look carefully at the inner structure, we may find ourselves like a child in a big maze with little
clue what is going on. In contrast to the rich theory of many optimization branches such as
convex optimization and integer programming, the theoretical appeal of this special yet complicated
unconstrained problem is not clear.
That being said, there are a few reasons that make neural network optimization an interesting
topic of theoretical research. First, neural networks may provide us a new class of tractable op-
∗
Department of Industrial and Enterprise Systems Engineering (ISE), and affiliated to Coordinated Science Labo-
ratory and Department of ECE, University of Illinois at Urbana-Champaign, Urbana, IL. Email: ruoyus@illinois.edu.
1

timization problems beyond convex problems. A somewhat related analogy is the development of
conic optimization: in 1990’s, researchers realized that many seemingly non-convex problems can
actually be reformulated as conic optimization problems (e.g. semi-definite programming) which
are convex problems, thus the boundary of tractability has advanced significantly. Neural network
problems are surely not the worst non-convex optimization problems and their global optima could
be found relatively easily in many cases. Admittedly, they are also not the best non-convex prob-
lems either. In fact, they are like wild animals and proper tuning is needed to make them work,
but if we can understand their behavior and tame these animals, they will be very powerful tools
to us.
Second, existing nonlinear optimization theory is far from enough to explain the practical be-
havior of neural network training. A well known difficulty in training neural-nets is called “gradient
explosion/vanishing” due to the concatenation of many layers. Properly chosen initialization and/or
other techniques are needed to train deep neural networks, but not always enough. This poses a
great challenge for theoretical analysis, because what conditions are needed for theoretical analysis
is not very clear. Even proving convergence (to stationary points) of the existing method with the
practically used stepsize seems a difficult task. In addition, some seemingly simple methods like
SGD with cyclical step-size and Adam work very well in practice, and the current theory is far from
explaining their effectiveness. Overall, there is still much space for rigorous convergence analysis
and algorithm design.
Third, although the basic formulation a theoretician has in mind (and the focus of this article)
is an unconstrained problem, neural network is essentially a way to parametrize the optimization
variable and thus can be applied to a wide range of problems, including reinforcement learning
and min-max optimization. In principle, any optimization problem can be combined with neural
networks. As long as a cascade of multiple parameters appears, the problem suddenly faces all
the issues we discussed above: the parametrization may cause complicated landscape, and the
convergence analysis may be quite difficult. Understanding the basic unconstrained formulation
is just the first step towards understanding neural networks in a broader setting, and presumably
there can be richer optimization theory or algorithmic ingredients that can be developed.
To keep the survey simple, we will focus on the supervised learning problem with feedforward
neural networks. We will not discuss more complicated formulations such as GANs (generative
adversarial networks) and deep reinforcement learning, and do not discuss more complicated ar-
chitecture such as RNN (recurrent neural network), attention and Capsule. In a broader context,
theory for supervised learning contains at least representation, optimization and generalization (see
Section 1.1), and we do not discuss representation and generalization in detail.
This article is written for researchers who are interested in theoretical understanding of opti-
mization for neural networks. Prior knowledge on optimization methods and basic theory will be
very helpful (see, e.g., [23, 189, 30] for preparation). Existing surveys on optimization for deep
learning are intended for general machine learning audience, such as Chapter 8 of the book Good-
fellow et al. [75]. These surveys often do not discuss optimization theoretical aspects in depth.
In contrast, in this article, we emphasize more on the theoretical results while trying to make it
2

accessible for non-theory readers. Simple examples that illustrate the intuition will be provided if
possible, and we will not explain the details of the theorems.
1.1 Big picture: decomposition of theory
A useful and popular meta-method to develop theory is decomposition. We first briefly review
the role of optimization in machine learning, and then discuss how to decompose the theory of
optimization for deep learning.
Representation, optimization and generalization. The goal of supervised learning is to
find a function that approximates the underlying function based on observed samples. The first
step is to find a rich family of functions (such as neural networks) that can represent the desirable
function. The second step is to identify the parameter of the function by minimizing a certain loss
function. The third step is to use the function found in the second step to make predictions on
unseen test data, and the resulting error is called test error. The test error can be decomposed into
representation error, optimization error and generalization error, corresponding to the error caused
by each of the three steps.
In machine learning, the three subjects representation, optimization and generalization are
often studied separately. For instance, when studying representation power of a certain family
of functions, we often do not care whether the optimization problem can be solved well. When
studying the generalization error, we often assume that the global optima have been found (see
[93] for a survey of generalization). Similarly, when studying optimization properties, we often do
not explicitly consider the generalization error (but sometimes we assume the representation error
is zero).
Decomposition of optimization issues. Optimization issues of deep learning are rather
complicated, and further decomposition is needed. The development of optimization can be divided
into three steps. The first step is to make the algorithm start running and converge to a reasonable
solution such as a stationary point. The second step is to make the algorithm converge as fast as
possible. The third step is to ensure the algorithm converge to a solution with a low objective value
(e.g. global minima). There is an extra step of achieving good test accuracy, but this is beyond the
scope of optimization. In short, we divide the optimization issues into three parts: convergence,
convergence speed and global quality.
Optimization issues
Local issues
(
Convergence issue: gradient explosion/vanishing
Convergence speed issue
Global issues: bad local minima, plaeaus, etc.
Most works are reviewed in three sections: Section 4, Section 5 and Section 6. Roughly speaking,
each section is mainly motivated by one of the three parts of optimization theory. However, this
partition is not precise as the boundaries between the three parts are blurred. For instance, some
techniques discussed in Section 4 can also improve the convergence rate, and some results in Section
6 address the convergence issue as well as global issues. Another reason of the partition is that
3

Optimization terms optimization variable objective stepsize
Deep learning terms weight, parameter training loss learning rate
Table 1: Optimization and machine learning terminology: the terms in the same column represent the same thing.
they represent three rather separate subareas of neural network optimization, and are developed
somewhat independently.
1.2 Terminology and Outline
Terminology. The terminology of optimization and deep learning are somewhat different, and in
this article we use them interchangeably. See Table 1 for a comparison of some major terms.
Outline. The structure of the article is as follows. In Section 2, we present the formulation of a
typical neural network optimization problem for supervised learning. In Section 3, we present back
propagation (BP) and discuss the basic convergence results. In Section 4, we discuss neural-net
specific tricks for training a neural network, and some underlying theory. In particular, we discuss
a major challenge called gradient explosion/vanishing, and review main solutions such as careful
initialization and normalization methods. In Section 5, we discuss generic algorithm design which
treats neural networks as generic non-convex optimization problems. In particular, we review SGD
with various learning rate schedules, adaptive gradient methods, large-scale distributed training,
second order methods and the existing convergence and iteration complexity results. In Section
6, we review research on global optimization of neural networks, including global landscape, mode
connectivity, lottery ticket hypothesis and neural tangent kernel (NTK).
2 Problem Formulation
In this section, we present the optimization formulation for a supervised learning problem. Suppose
we are given data points x
i
∈ R
d
x
, y
i
∈ R
d
y
, i = 1, . . . , n, where n is the number of samples. The
input instance x
i
can represent a feature vector of an object, an image, a vector that presents a
word, etc. The output instance y
i
can represent a real-valued vector or scalar such as in a regression
problem, or an integer-valued vector or scalar such as in a classification problem.
We want the computer to predict y
i
based on the information of x
i
, so we want to learn the
underlying mapping that maps each x
i
to y
i
. To approximate the mapping, we use a neural network
f
θ
: R
d
x
→ R
d
y
, which maps an input x to a predicted output ˆy. A standard fully-connected neural
network is given by
f
θ
(x) = W
L
φ(W
L−1
. . . φ(W
2
φ(W
1
x))), (1)
where φ : R → R is the neuron activation function (sometimes simply called “activation” or
“neuron”), W
j
is a matrix of dimension d
j
× d
j−1
, j = 1, . . . , L and θ = (W
1
, . . . , W
L
) represents
the collection of all parameters. Here we define d
0
= d
x
and d
L
= d
y
. When applying the scalar
4

function φ to a vector v, we apply φ to each entry of v. Another way to write down the neural
network is to use a recursion formula:
z
0
= x; z
l
= φ(W
l
z
l−1
), l = 1, . . . , L. (2)
Note that in practice, the recursive expression should be z
l
= φ(W
l
z
l−1
+ b
l
). For simplicity of
presentation, throughout the paper, we often skip the “bias” term b
l
in the expression of neural
networks and just use the simplified version (2).
We want to pick the parameter of the neural network so that the predicted output ˆy
i
= f
θ
(x
i
)
is close to the true output y
i
, thus we want to minimize the distance between y
i
and ˆy
i
. For a
certain distance metric `(·, ·), the problem of finding the optimal parameters can be written as
min
θ
F (θ) ,
1
n
n
X
i=1
`(y
i
, f
θ
(x
i
)). (3)
For regression problems, `(y, z) is often chosen to be the quadratic loss function `(y, z) = ky −zk
2
.
For binary classification problem, a popular choice of ` is `(y, z) = log(1 + exp(−yz)).
Technically, the neural network given by (2) should be called fully connected feed-forward
networks (FCN). Neural networks used in practice often have more complicated structure. For
computer vision tasks, convolutional neural networks (CNN) are standard. In natural language
processing, extra layers such as “attention” are commonly added. Nevertheless, for our purpose
of understanding the optimization problem, we mainly discuss the FCN model (2) throughout this
article, though in few cases the results for CNN will be mentioned.
For a better understanding of the problem (3), we relate it to several classical optimization
problems.
2.1 Relation with Least Squares
One special form of (3) is the linear regression problem (least squares):
min
w∈R
d×1
ky − w
T
Xk
2
, (4)
where X = (x
1
, . . . , x
n
) ∈ R
d×n
, y ∈ R
1×n
. If there is only one linear neuron that maps the input
x to w
T
x and the loss function is quadratic, then the general neural network problem (3) reduces
to the least square problem (4). We explicitly mention the least square problem for two reasons.
First, it is one of the simplest forms of a neural network problem. Second, when understanding
neural network optimization, researchers have constantly resorted to insight gained from analyzing
linear regression.
2.2 Relation with Matrix Factorization
Neural network optimization (3) is closely related to a fundamental problem in numerical computa-
tion: matrix factorization. If there is only one hidden layer of linear neurons and the loss function
5
剩余50页未读,继续阅读















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0