What Do We Understand About Convolutional Networks?论文 卷积神经网络（CNN）在计算机视觉领域已经取得了前所未有的巨大成功，但我们目前对其效果显著的原因还没有全面的理解。约克大学电气工程与计算机科学系的 Isma Hadji 和 Richard P. Wildes 发表了论文《What Do We Understand About Convolutional Networks?》，对卷积网络的技术基础、组成模块、当前现状和研究前景进行了梳理，介绍了我们当前对 CNN 的理解。
What Do We Understand About
Isma Hadji and Richard P. Wildes
Department of Electrical Engineering and Computer Science
arXiv:1803.08834v1 [cs.CV] 23 Mar 2018
Over the past few years major computer vision research eﬀorts have focused on
convolutional neural networks, commonly referred to as ConvNets or CNNs. These
eﬀorts have resulted in new state-of-the-art performance on a wide range of classiﬁ-
cation (e.g [64,88,139]) and regression (e.g [36,97,159]) tasks. In contrast, while the
history of such approaches can be traced back a number of years (e.g [49, 91]), the-
oretical understanding of how these systems achieve their outstanding results lags.
In fact, currently many contributions in the computer vision ﬁeld use ConvNets
as a black box that works while having a very vague idea for why it works, which
is very unsatisfactory from a scientiﬁc point of view. In particular, there are two
main complementary concerns: (1) For learned aspects (e.g convolutional kernels),
exactly what has been learned? (2) For architecture design aspects (e.g number
of layers, number of kernels/layer, pooling strategy, choice of nonlinearity), why
are some choices better than others? The answers to these questions not only will
improve the scientiﬁc understanding of ConvNets, but also increase their practical
Moreover, current realizations of ConvNets require massive amounts of data for
training [84, 88, 91] and design decisions made greatly impact performance [23, 77].
Deeper theoretical understanding should lessen dependence on data-driven design.
While empirical studies have investigated the operation of implemented networks, to
1.2. Objective 2
date, their results largely have been limited to visualizations of internal processing
to understand what is happening at the diﬀerent layers of a ConvNet [104,133,154].
In response to the above noted state of aﬀairs, this document will review the most
prominent proposals using multilayer convolutional architectures. Importantly, the
various components of a typical convolutional network will be discussed through a
review of diﬀerent approaches that base their design decisions on biological ﬁndings
and/or sound theoretical bases. In addition, the diﬀerent attempts at understanding
ConvNets via visualizations and empirical studies will be reviewed. The ultimate
goal is to shed light on the role of each layer of processing involved in a ConvNet
architecture, distill what we currently understand about ConvNets and highlight
critical open problems.
1.3 Outline of report
This report is structured as follows: The present chapter has motivated the need for
a review of our understanding of convolutional networks. Chapter 2 will describe
various multilayer networks and present the most successful architectures used in
computer vision applications. Chapter 3 will more speciﬁcally focus on each one
of the building blocks of typical convolutional networks and discuss the design of
the diﬀerent components from both biological and theoretical perspectives. Finally,
chapter 4 will describe the current trends in ConvNet design and eﬀorts towards
ConvNet understanding and highlight some critical outstanding shortcomings that
This chapter gives a succinct overview of the most prominent multilayer architectures
used in computer vision, in general. Notably, while this chapter covers the most
important contributions in the literature, it will not to provide a comprehensive
review of such architectures, as such reviews are available elsewhere (e.g. [17, 56,
90]). Instead, the purpose of this chapter is to set the stage for the remainder
of the document and its detailed presentation and discussion of what currently is
understood about convolutional networks applied to visual information processing.
2.1 Multilayer architectures
Prior to the recent success of deep learning-based networks, state-of-the-art com-
puter vision systems for recognition relied on two separate but complementary steps.
First, the input data is transformed via a set of hand designed operations (e.g. con-
volutions with a basis set, local or global encoding methods) to a suitable form.
The transformations that the input incurs usually entail ﬁnding a compact and/or
abstract representation of the input data, while injecting several invariances depend-
ing on the task at hand. The goal of this transformation is to change the data in a
way that makes it more amenable to being readily separated by a classiﬁer. Second,
the transformed data is used to train some sort of classiﬁer (e.g. Support Vector
Machines) to recognize the content of the input signal. The performance of any
classiﬁer used is, usually, heavily aﬀected by the used transformations.
2.1. Multilayer architectures 4
Multilayer architectures with learning bring about a diﬀerent outlook on the
problem by proposing to learn, not only the classiﬁer, but also learn the required
transformation operations directly from the data. This form of learning is commonly
referred to as representation learning [7,90], which when used in the context of deep
multilayer architectures is called deep learning.
Multilayer architectures can be deﬁned as computational models that allow for
extracting useful information from the input data multiple levels of abstraction.
Generally, multilayer architectures are designed to amplify important aspects of the
input at higher layers, while becoming more and more robust to less signiﬁcant
variations. Most multilayer architectures stack simple building block modules with
alternating linear and nonlinear functions. Over the years, a plethora of various mul-
tilayer architectures were proposed and this section will cover the most prominent
such architectures adopted for computer vision applications. In particular, artiﬁcial
neural network architectures will be the focus due to their prominence. For the sake
of succinctness, such networks will be referred to more simply as neural networks in
2.1.1 Neural networks
A typical neural network architecture is made of an input layer, x, an output layer,
y, and a stack of multiple hidden layers, h, where each layer consists of multiple
cells or units, as depicted in Figure 2.1. Usually, each hidden unit, h
, receives input
from all units at the previous layer and is deﬁned as a weighted combination of the
inputs followed by a nonlinearity according to
= F (b
, are the weights controlling the strength of the connections between the
input units and the hidden unit, b
is a small bias of the hidden unit and F (.) is
some saturating nonlinearity such as the sigmoid.
Deep neural networks can be seen as a modern day instantiation of Rosenblatt’s
perceptron  and multilayer perceptron . Although, neural network models
- 我的内容管理 收起
- 我的资源 快来上传第一个资源
- 我的收益 登录查看自己的收益
- 我的积分 登录查看自己的积分
- 我的C币 登录后查看C币余额