没有合适的资源?快使用搜索试试~ 我知道了~
首页卷积神经网络英文版综述
What Do We Understand About Convolutional Networks?论文 卷积神经网络(CNN)在计算机视觉领域已经取得了前所未有的巨大成功,但我们目前对其效果显著的原因还没有全面的理解。约克大学电气工程与计算机科学系的 Isma Hadji 和 Richard P. Wildes 发表了论文《What Do We Understand About Convolutional Networks?》,对卷积网络的技术基础、组成模块、当前现状和研究前景进行了梳理,介绍了我们当前对 CNN 的理解。
资源详情
资源评论
资源推荐

What Do We Understand About
Convolutional Networks?
Isma Hadji and Richard P. Wildes
Department of Electrical Engineering and Computer Science
York University
Toronto, Ontario
Canada
arXiv:1803.08834v1 [cs.CV] 23 Mar 2018

Chapter 1
Introduction
1.1 Motivation
Over the past few years major computer vision research efforts have focused on
convolutional neural networks, commonly referred to as ConvNets or CNNs. These
efforts have resulted in new state-of-the-art performance on a wide range of classifi-
cation (e.g [64,88,139]) and regression (e.g [36,97,159]) tasks. In contrast, while the
history of such approaches can be traced back a number of years (e.g [49, 91]), the-
oretical understanding of how these systems achieve their outstanding results lags.
In fact, currently many contributions in the computer vision field use ConvNets
as a black box that works while having a very vague idea for why it works, which
is very unsatisfactory from a scientific point of view. In particular, there are two
main complementary concerns: (1) For learned aspects (e.g convolutional kernels),
exactly what has been learned? (2) For architecture design aspects (e.g number
of layers, number of kernels/layer, pooling strategy, choice of nonlinearity), why
are some choices better than others? The answers to these questions not only will
improve the scientific understanding of ConvNets, but also increase their practical
applicability.
Moreover, current realizations of ConvNets require massive amounts of data for
training [84, 88, 91] and design decisions made greatly impact performance [23, 77].
Deeper theoretical understanding should lessen dependence on data-driven design.
While empirical studies have investigated the operation of implemented networks, to
1

1.2. Objective 2
date, their results largely have been limited to visualizations of internal processing
to understand what is happening at the different layers of a ConvNet [104,133,154].
1.2 Objective
In response to the above noted state of affairs, this document will review the most
prominent proposals using multilayer convolutional architectures. Importantly, the
various components of a typical convolutional network will be discussed through a
review of different approaches that base their design decisions on biological findings
and/or sound theoretical bases. In addition, the different attempts at understanding
ConvNets via visualizations and empirical studies will be reviewed. The ultimate
goal is to shed light on the role of each layer of processing involved in a ConvNet
architecture, distill what we currently understand about ConvNets and highlight
critical open problems.
1.3 Outline of report
This report is structured as follows: The present chapter has motivated the need for
a review of our understanding of convolutional networks. Chapter 2 will describe
various multilayer networks and present the most successful architectures used in
computer vision applications. Chapter 3 will more specifically focus on each one
of the building blocks of typical convolutional networks and discuss the design of
the different components from both biological and theoretical perspectives. Finally,
chapter 4 will describe the current trends in ConvNet design and efforts towards
ConvNet understanding and highlight some critical outstanding shortcomings that
remain.

Chapter 2
Multilayer Networks
This chapter gives a succinct overview of the most prominent multilayer architectures
used in computer vision, in general. Notably, while this chapter covers the most
important contributions in the literature, it will not to provide a comprehensive
review of such architectures, as such reviews are available elsewhere (e.g. [17, 56,
90]). Instead, the purpose of this chapter is to set the stage for the remainder
of the document and its detailed presentation and discussion of what currently is
understood about convolutional networks applied to visual information processing.
2.1 Multilayer architectures
Prior to the recent success of deep learning-based networks, state-of-the-art com-
puter vision systems for recognition relied on two separate but complementary steps.
First, the input data is transformed via a set of hand designed operations (e.g. con-
volutions with a basis set, local or global encoding methods) to a suitable form.
The transformations that the input incurs usually entail finding a compact and/or
abstract representation of the input data, while injecting several invariances depend-
ing on the task at hand. The goal of this transformation is to change the data in a
way that makes it more amenable to being readily separated by a classifier. Second,
the transformed data is used to train some sort of classifier (e.g. Support Vector
Machines) to recognize the content of the input signal. The performance of any
classifier used is, usually, heavily affected by the used transformations.
3

2.1. Multilayer architectures 4
Multilayer architectures with learning bring about a different outlook on the
problem by proposing to learn, not only the classifier, but also learn the required
transformation operations directly from the data. This form of learning is commonly
referred to as representation learning [7,90], which when used in the context of deep
multilayer architectures is called deep learning.
Multilayer architectures can be defined as computational models that allow for
extracting useful information from the input data multiple levels of abstraction.
Generally, multilayer architectures are designed to amplify important aspects of the
input at higher layers, while becoming more and more robust to less significant
variations. Most multilayer architectures stack simple building block modules with
alternating linear and nonlinear functions. Over the years, a plethora of various mul-
tilayer architectures were proposed and this section will cover the most prominent
such architectures adopted for computer vision applications. In particular, artificial
neural network architectures will be the focus due to their prominence. For the sake
of succinctness, such networks will be referred to more simply as neural networks in
the following.
2.1.1 Neural networks
A typical neural network architecture is made of an input layer, x, an output layer,
y, and a stack of multiple hidden layers, h, where each layer consists of multiple
cells or units, as depicted in Figure 2.1. Usually, each hidden unit, h
j
, receives input
from all units at the previous layer and is defined as a weighted combination of the
inputs followed by a nonlinearity according to
h
j
= F (b
j
+
X
i
w
ij
x
i
) (2.1)
where, w
ij
, are the weights controlling the strength of the connections between the
input units and the hidden unit, b
j
is a small bias of the hidden unit and F (.) is
some saturating nonlinearity such as the sigmoid.
Deep neural networks can be seen as a modern day instantiation of Rosenblatt’s
perceptron [122] and multilayer perceptron [123]. Although, neural network models
剩余93页未读,继续阅读
















安全验证
文档复制为VIP权益,开通VIP直接复制

评论0