深度学习里程碑：AlexNet与现代神经网络演进

需积分: 25 195 浏览量更新于2024-07-19 收藏 2.69MB PDF 举报

"AlexNet深度学习综述" 深度学习作为一种革命性的机器学习技术，近年来在众多应用领域取得了显著的成功。自AlexNet在2012年ImageNet竞赛中取得突破性成果以来，它引领了深度学习的发展，尤其是卷积神经网络（CNN）的进步。本文将对AlexNet进行综述，并探讨其在深度学习方法上的历史演变。 AlexNet是深度学习领域的里程碑，由Hinton团队提出，其主要创新在于深度架构和大规模数据的结合。AlexNet包含8个层次，其中5层是卷积层，3层是全连接层，以及两个最大池化层。通过并行GPU处理，AlexNet成功地减少了训练时间，提高了模型的准确性，打破了当时计算机视觉任务的记录。深度学习的主要分支包括深度神经网络（DNN）、卷积神经网络（CNN）和循环神经网络（RNN）。DNN是多层的神经网络，允许非线性特征学习，而AlexNet正是DNN的一个实例。CNN在图像处理和计算机视觉领域尤为强大，因为它们能捕捉空间关系和局部特征。AlexNet利用卷积层提取图像特征，有效减少了参数数量，降低了过拟合风险。RNN则适用于序列数据，如语音识别和自然语言处理，因其具有记忆过去状态的能力。深度学习方法在多个领域展现出优越性能，如图像处理、计算机视觉、语音识别、机器翻译、艺术、医疗成像、医学信息处理、机器人和控制、生物信息学、NLP、网络安全等。与传统机器学习方法相比，深度学习在解决复杂问题时往往表现出更高的准确性和泛化能力。然而，深度学习也面临挑战，例如模型解释性差、训练数据需求量大、计算资源消耗高以及容易出现梯度消失或爆炸问题。为克服这些挑战，研究者们提出了各种优化策略，如正则化、批量归一化、残差连接和动态网络结构等。总结来说，AlexNet开启了深度学习的新纪元，推动了CNN的发展，并影响了后续的深度学习架构，如VGGNet、GoogLeNet和ResNet等。随着硬件进步和算法优化，深度学习将继续在各个领域发挥关键作用，不断推动技术的边界。

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

C. Stochastic Gradient Descent (SGD)

Since a long training time is the main drawback for the

traditional gradient descent approach, the SGD approach is used

for training Deep Neural Networks (DNN) [52]. Algorithm II

explains SGD in detail.

Algorithm II. Stochastic Gradient Descent (SGD)

Inputs: loss function , learning rate , dataset  and the

model 

󰇛



󰇜

Outputs: Optimum  which minimizes 

REPEAT until converge:

Shuffle 

For each batch of 







in  do







󰇛





󰇜



    









󰇛











󰇜







End

D. Back-propagation

DNN are trained with the popular Back-Propagation (BP)

algorithm with SGD [53]. The pseudo code of basic Back-

propagation is given in Algorithm III. In case of MLP, we can

easily represent NN models using computation graphs which

are directive acyclic graphs. For that representation of DL, we

can use the chain-rule to efficiently calculate the gradient from

the top to the bottom layers with BP as shown in Algorithm III

for a single path network. For example:

  

󰇛



󰇜



󰇛







󰇛







󰇛





 



󰇜

 



󰇜

 



󰇜

(4)

This is composite function for  layers of a network. In case

of   , then the function can be written as

 

󰇛



󰇜

 

󰇛



󰇜

 (5)

According to the chain rule, the derivative of this function can

be written as







󰇛󰇜



 

󰇛



󰇜



󰆒

󰇛



󰇜

(6)

E. Momentum

Momentum is a method which helps to accelerate the training

process with the SGD approach. The main idea behind it is to

use the moving average of the gradient instead of using only the

current real value of the gradient. We can express this with the

following equation mathematically:









 

󰇛





󰇜

(7)





 







(8)

Here γ is the momentum and  is the learning rate for the t

round of training. Other popular approaches have been

introduced during last few years which are explained in section

4 under the scope of optimization approaches. The main

advantages of using momentum during training is to prevent the

network from getting stuck in local minimum. The values of

momentum are γ  (0,1]. It is noted that a higher momentum

value overshoots its minimum, possibly making the network

unstable. In generally γ is set to 0.5 until the initial learning

stabilizes and is then increased to 0.9 or higher [54].

Algorithm III. Back-propagation

Input: A network with  layers, the activation function 



the outputs of hidden layer 







󰇛











 



󰇜

and the

network output  



Compute the gradient:  



󰇛











󰇜



For    to  do

Calculate gradient for present layer:



󰇛





󰇜









󰇛





󰇜

























󰇛





󰇜









󰇛





󰇜























Apply gradient descent using



󰇛





󰇜





and



󰇛





󰇜





Back-propagate gradient to the lower layer

 



󰇛





󰇜

























End

F. Learning rate󰇛󰇜

The learning rate is an important component for training DNN

(as explained in Algorithm I and II). The learning rate is the step

size considered during training which makes the training

process faster. However, selecting the value of the learning rate

is sensitive. For example: if you choose a larger value for ,

the network may start diverging instead of converging. On the

other hand, if you choose a smaller value for , it will take more

time for the network to converge. In addition, it may be easily

stuck in a local minima. The typical solution for this problem is

to reduce the learning rate during training [52].

There are three common approaches used for reducing the

learning rate during training: constant, factored, and

exponential decay. First, we can define a constant  which is

applied to reduce the learning rate manually with a defined step

function. Second, the learning rate can be adjusted during

training with the following equation:

















(9)

Where 



is the t

round learning rate, 



is the initial learning

rate, and  is the decay factor with a value between the range

󰇛



󰇜

> REPLACE THIS LINE WITH YOUR PAPER IDENTIFICATION NUMBER (DOUBLE-CLICK HERE TO EDIT) <

The step function format for exponential decay is:





















(10)

The common practice is to use a learning rate decay of   

to reduce the learning rate by a factor of 10 at each stage.

G. Weight decay

Weight decay is used for training deep learning models as the

L2 regularization approach, which helps to prevent over fitting

the network and model generalization. L2 regularization for



󰇛



󰇜

can be define as:











(11)



󰇛



󰇛



󰇜



󰇜



󰇛



󰇛



󰇜



󰇜







 (12)

The gradient for the weight  is:











  (13)

General practice is to use the value  . A smaller 

will accelerate training.

Other necessary components for efficient training including

data preprocessing and augmentation, network initialization

approaches, batch normalization, activation functions,

regularization with dropout, and different optimization

approaches (as discussed in Section 4).

In the last few decades, many efficient approaches have been

proposed for better training of deep neural networks. Before

2006, attempts taken at training deep architectures failed: training

a deep supervised feed-forward neural network tended to yield

worse results (both in training and in test error) then shallow ones

(with 1 or 2 hidden layers). Hinton’s revolutionary work on

DBNs spearheaded a change in this in 2006 [50, 53].

Due to their composition, many layers of DNN are more

capable at representing highly varying nonlinear functions

compare to shallow learning approaches [56, 57, and 58].

Moreover, DNNs are more efficient for learning because of the

combination of feature extraction and classification layers. The

following sections discuss in detail about different DL

approaches with necessary components.

III. CONVOLUTIONAL NEURAL NETWORKS (CNN)

A. CNN overview

This network structure was first proposed by Fukushima in

1988 [48]. It was not widely used however due to limits of

computation hardware for training the network. In the 1990s,

LeCun et al. applied a gradient-based learning algorithm to

CNNs and obtained successful results for the handwritten digit

classification problem [49]. After that, researchers further

improved CNNs and reported state-of-the-art results in many

recognition tasks. CNNs have several advantages over DNNs,

including being more similar to the human visual processing

system, being highly optimized in structure for processing 2D

and 3D images, and being effective at learning and extracting

abstractions of 2D features. The max pooling layer of CNNs is

effective in absorbing shape variations. Moreover, composed of

sparse connections with tied weights, CNNs have significantly

fewer parameters than a fully connected network of similar size.

Most of all, CNNs are trained with the gradient-based learning

algorithm, and suffer less from the diminishing gradient

problem. Given that the gradient-based algorithm trains the

whole network to minimize an error criterion directly, CNNs

can produce highly optimized weights.

Fig. 11. The overall architecture of the CNN includes an input layer, multiple alternating convolution and max-pooling layers, one fully-connected

layer and one classification layer.

剩余38页未读，继续阅读

qgywzz2

粉丝: 0

深度学习里程碑：AlexNet与现代神经网络演进

AlexNet 论文总结

深度学习综述类文章

深度学习综述

深度学习发展综述.pdf

深度学习发展综述.docx

图像物体检测深度学习算法综述.pdf

深度学习与大模型综述（文献综述）

深度学习方法研究综述.pdf

深度学习及卷积神经网络综述

应用于平扫CT图像肺结节检测的深度学习方法综述.pdf

最新资源