深度卷积神经网络最新架构综述：性能提升的关键

需积分: 27 197 浏览量更新于2024-07-17 收藏 873KB PDF 举报

深度卷积神经网络（Deep Convolutional Neural Networks, CNNs）是近年来在计算机视觉领域展现出卓越性能的关键技术。作为神经网络的一种特殊类型，CNN通过多层非线性特征提取阶段，能够自动学习数据中的层次化表示，从而实现了强大的学习能力。随着大数据的丰富和硬件处理单元的提升，研究者们对CNN架构的探索不断加速，催生了许多引人注目的深度CNN设计。近期，CNN架构竞赛在挑战性的基准测试上争夺高效率，这表明创新的网络结构设计以及参数优化策略对于提高CNN在图像识别、物体检测、语义分割等视觉任务中的性能至关重要。本综述文章详细梳理了这些最新的深度CNN架构，包括但不限于： 1. **深度学习模块**：文章深入讨论了深度网络的堆叠层次，如残差连接（Residual Connections）、注意力机制（Attention Mechanisms）和自注意力网络（Self-Attention Networks），它们如何增强模型的学习能力和泛化能力。 2. **卷积层创新**：卷积核大小、步长、填充策略的优化，以及各种类型的卷积（如深度可分离卷积、混合卷积等）都在文中被提及，它们如何减少计算量和内存消耗，同时保持或提高性能。 3. **池化层和下采样**：文章分析了不同类型的池化（如最大池化、平均池化、全局池化）以及不同层之间的空间金字塔池化（Spatial Pyramid Pooling）如何在保持信息的同时降低维度。 4. **膨胀卷积（Dilated Convolution）与跳跃连接（Skip Connections）**：这两种技术在扩展感受野和保留低级特征信息方面的作用，以及它们如何促进特征融合。 5. **参数优化与正则化**：文章介绍了优化器的选择（如Adam、SGD）、学习率调整策略（如学习率衰减、 warm-up策略）以及防止过拟合的策略，如批量归一化（Batch Normalization）和Dropout。 6. **迁移学习与微调**：针对预训练模型（如VGG、ResNet、Inception等）在特定任务上的应用和调整，如何利用已有的大规模预训练数据提高新任务的性能。 7. **动态网络结构**：如可适应性网络（Adaptive Networks）、可变形卷积（Deformable Convolution）和生成对抗网络（GANs）在CNN架构中的运用，增强了模型的灵活性和对复杂场景的适应性。 8. **硬件加速与并行计算**：针对GPU、TPU等硬件平台，如何设计高效的并行计算方案来加速CNN的训练和推理。这篇综述为读者提供了一个全面的视角，概述了当前深度CNN架构的最新进展和关键突破，旨在帮助研究人员和工程师更好地理解和应用这些技术，以推动计算机视觉领域的进一步发展。

3.1. Late 1980s-1999: Origin of CNN

CNNs have been applied to visual tasks since the late 1980s. In 1989, LeCuN et al. proposed the

first multilayered CNN named as ConvNet, whose origin was rooted in Fukushima’s

Neocognitron

48,49

. LeCuN proposed supervised training of ConvNet, using Backpropagation

algorithm

in comparison to the unsupervised reinforcement learning scheme used by its

predecessor Neocognitron. Supervised training in CNN endows the automatic feature learning

ability from raw input rather than designing of handcrafted features, which were required by

traditional ML methods. This ConvNet showed successful results for handwritten digits and zip

codes recognition related problems

. In 1998, ConvNet was improved by LeCuN and used for

classifying characters in a document recognition application

. This modified architecture was

named as LeNet-5

52,53

, which was an improvement over initial CNN as it can extract feature

representation in a hierarchical way from raw pixels. Reliance of LeNet-5 on fewer parameters

along with consideration of spatial topology of images enabled CNN to recognize rotational

variants of the image

. Due to its good performance in optical character recognition, its

commercial use in ATM and Banks was started in 1993 and 1996, respectively. Though, many

successful features were gained by LeNet-5, yet the main concern associated with it was that its

discrimination power was not scaled to classification tasks other than hand recognition.

3.2. Early 2000: Stagnation of CNN

In the late 1990s and early 2000s, interest in NNs was low and little research was carried out to

explore the role of CNNs in different applications such as object detection, video surveillance,

etc. Use of CNN in ML tasks became dormant due to insignificant improvement in performance

with no noticeable decrease in error. At that time, other statistical methods and, in particular,

SVM

54,55

became more popular than CNN due to its good performance

. It was widely

presumed in early 2000 that the backpropagation algorithm used for training of CNN was not

effective in converging to optimal points and learned no useful features in supervised fashion as

compared to handcrafted features. Meanwhile, different researchers kept working on CNN and

tried to optimize its performance. In 2003, Simard et al.,

improved CNN architecture and

showed good results as compared to SVM on hand digit benchmark dataset, MNIST

5851,56,59

This performance improvement expedited the research in CNN by extending its application in

optical character recognition (OCR) to other script’s character recognition

59–61

, deployment in

image sensors for face detection in video conferencing, and regulation of street crimes, etc.

Likewise, CNN based systems were industrialized in markets for tracking and detection of

customers

. Moreover, CNN’s potential in other applications such as medical image

segmentation, anomaly detection, and robot vision was also evaluated

63,6465

3.3. 2006-2011: Revival of CNN

Deep NNs have a very complex structure and time intensive training phase that sometimes

spanned over weeks and even months. In early 2000, there was no appropriate approach for the

training of deep networks. Moreover, CNN was not scaled well enough to be applied to complex

problems. These challenges halted the use of CNN in ML related tasks.

To address these problems, in 2006 many methods were developed to overcome the difficulties

encountered in the training of deep CNNs and learning of invariant features. Hinton proposed

greedy layer-wise training approach in 2006

, for deep architectures. One of the factor, which

brought deep CNNs into the limelight, was a renaissance of deep learning

67,68

. Huang et al. used

max pooling instead of subsampling, which showed good results by learning of invariant features

. In late 2006, researchers started using graphics processing units (GPUs)

to accelerate

training of DNN and CNN architectures. In 2007, NVIDIA launched the CUDA programming

platform

70,71

, which allows exploitation of parallel processing capabilities of GPU with a much

greater degree

. The use of GPUs in NNs and hardware improvements revived the research in

CNN. In 2010, Fei-Fei Li’s group at Stanford, established a large database of images known as

ImageNet, containing millions of labeled images

. This database was coupled with the annual

ILSVRC competition, where submitted models performances were evaluated and scored

3.4. 2012-2014: Boom of CNN

Availability of big training data, hardware advancements, and computational resources

contributed to advancement in CNN algorithms and renaissance of CNN in object detection,

image classification, and segmentation related tasks. However, the success of CNN in image

classification tasks was not just the result of high tech hardware but also due to architectural

modifications, parameter optimization, incorporation of regulatory units, and reformulation and

readjustment of connections within the network

33,34,74

The main breakthrough in CNN performance was brought by AlexNet

. AlexNet won the 2012-

ILSVRC competition, which was one of the most difficult challenges for image detection and

classification. AlexNet improved performance by exploiting depth (incorporate multiple levels of

transformation) and introduced regularization term in CNN. Exemplary performance by AlexNet

(CNN method)

suggested that the performance of CNN on vision related tasks can be

improved by increasing the representational depth. The improved performance of AlexNet in

2012-ILSVRC suggested that the main reason of the saturation in CNN performance before 2006

was largely due to the unavailability of enough training data and computational resources, which

made it hard to train a high-capacity CNN without deterioration of performance

With CNN becoming more of a commodity in the computer vision (CV) ﬁeld, a number of

attempts have been made to improve the original architecture of AlexNet

. Similarly, in 2013

and 2014 researchers worked on parameter optimization to accelerate CNN performance on

diverse applications with a small increase in computational requirement. In 2013, Zeiler et al.,

defined a mechanism to visualize learned filters of each CNN layer. Visualization approach was

used to improve the feature extraction approach by reducing the size of filters. Similarly, VGG

architecture

proposed by the Oxford Group, which was runner-up at the 2014-ILSVRC

competition, made the receptive field much small in comparison to that of AlexNet but with

increased volume. In VGG, depth is increased from 9 layers to 16, by making the volume of

features maps double at each layer. In the same year, GoogleNet

that won 2014-ILSVRC

competition, not only exerted its efforts in the rearrangement of parameters to reduce

computational cost, but also increased width in compliance with depth to improve CNN

performance. GoogleNet introduced the concept of blocks, within which multiscale and

multilevel transformation is incorporated to capture both local and global information

26,76,77

. The

use of multilevel transformations help CNN in tackling details of images at various levels. In the

year 2012-14, the main improvement in the learning capacity of CNN was achieved by

increasing its depth and parameter optimization strategies. This suggests that the representation

depth is beneficial in improving the generalization of classifier.

剩余59页未读，继续阅读

zhoujq

粉丝: 22
资源: 36

深度卷积神经网络最新架构综述：性能提升的关键

A Survey of the Recent Architectures of Deep Convolutional Neural Networks.pdf

Convolutional Neural Networks in Visual Computing A Concise Guide 无水印原版pdf

《A Survey of the Recent Architectures of Deep Convolutional Neural Networks》

Very Deep Convolutional Networks for Large-Scale Image Recognition" by Karen Simonyan and Andrew Zisserman (2014)

Auto-encoder

Deep Learning Toolbox

paddle.fluid

Bag of Tricks and A Strong Baseline for Deep Person Re-identification

基于深度学习甲状腺结节的分割与识别的国内外研究现状，具体到文献

trainable filters

最新资源