"大规模深度学习优化综述：性能提高与时间成本平衡"

需积分: 0 112 浏览量更新于2024-01-01 收藏 1.18MB PDF 举报

身份认证购VIP最低享 7 折!

领优惠券(最高得80元）

在新加坡国立大学最新发布的综述论文中，对大规模深度学习优化进行了全面的探讨。深度学习在人工智能的广泛应用领域取得了显著的成果。事实上，更大的数据集和更复杂的模型通常能够产生更好的性能。然而，我们也不得不面对更长的训练时间以及更大的计算和通信开销。本论文首先介绍了深度学习的基本原理和相关概念。深度学习是一种机器学习方法，通过构建多层神经网络来模拟人类大脑的工作方式。这种方法能够从大量数据中自动学习和提取特征，并用于解决各种复杂的任务，如图像识别、语音识别和自然语言处理等。接下来，论文重点讨论了大规模深度学习面临的优化问题。针对较大的数据集和复杂的模型，研究人员往往需要更多的计算资源和训练时间。为了解决这一问题，论文介绍了一系列优化技术和方法。其中，一项重要的优化技术是并行计算。由于深度学习模型的训练过程是高度并行的，可以将任务分配给多个计算单元同时进行处理。通过利用多核处理器、图形处理器和分布式计算平台，可以显著提高训练速度和效率。此外，论文还探讨了模型压缩和剪枝技术。由于深度学习模型通常包含大量的参数和连接，模型压缩可以通过减少模型的大小和复杂性来降低计算和存储开销。而模型剪枝则通过删除冗余的连接和神经元来提高模型的效率和泛化能力。除了以上几种优化技术，论文还介绍了其他一些与大规模深度学习相关的内容，如网络架构搜索、自动调参和数据并行等。这些方法都旨在进一步提高深度学习模型的训练速度和性能。在论文的最后，研究人员认为大规模深度学习仍然面临着许多挑战和困难。例如，如何有效地处理大规模数据、如何减少模型的计算和存储开销，以及如何提高训练的稳定性和鲁棒性等。因此，未来的研究方向应该集中在这些问题上，以进一步推动深度学习在人工智能领域的发展。总之，新加坡国立大学的这篇综述论文对大规模深度学习的优化进行了系统的总结和分析。通过引入并行计算、模型压缩和剪枝等优化技术，可以显著提高深度学习模型的性能和效率。然而，大规模深度学习仍然存在许多挑战，需要进一步的研究和探索。

资源详情

资源推荐

Noticing the mismatch of units in Eq.[13], i.e., the units of the update

Δ𝑥

do not match the units of

the parameters 𝑥 which it applies to

𝑢𝑛𝑖𝑡𝑠 𝑜 𝑓 Δ𝑥 ∝ 𝑢𝑛𝑖𝑡𝑠 𝑜 𝑓 𝑔 ∝

𝜕𝑓

𝜕𝑥

∝

𝑢𝑛𝑖𝑡𝑠 𝑜 𝑓 𝑥

, (14)

Zeiler [122] rearranges second order method (i.e., Newton’s method)

Δ𝑥 =

𝜕𝑓

𝜕𝑥

𝜕

𝑓

𝜕𝑥

⇒

𝜕

𝑓

𝜕𝑥

Δ𝑥

𝜕𝑓

𝜕𝑥

. (15)

Since

Δ𝑥

𝑡

for the current time step in unknown, assuming the curvature is locally smooth,

Δ𝑥

𝑡

can

be approximated by computing the exponentially decaying RMS of previous Δ𝑥

Δ𝑥

𝑡

= −

𝑅𝑀𝑆 [Δ𝑥]

𝑡−1

𝑅𝑀𝑆 [𝑔]

𝑡

𝑔

𝑡

. (16)

3.3.3 RMSProp. RMSprop [

107

] was developed independently around the same time with Adadelta

to solve the problem of AdaGrad’s drastically decreasing gradients. AdaGrad treats all past gradients

equally, which is counter to our intuition that fresh gradient is more informative than the elder

one. RMSProp redenes 𝑣

𝑡

by decaying the past gradients at an exponential rate

𝑣

𝑡

= 0.9𝑣

𝑡−1

+ 0.1𝑔

𝑡

𝑥

𝑡

= 𝑥

𝑡−1

−

𝜂

√

𝑣

𝑡

+𝜖

𝑔

𝑡

(17)

3.3.4 Adam. Adam [

] is one of the most popular optimizers for training DNNs nowadays. It

computes individual LRs for dierent parameters base on the estimates of rst and second moments

of the gradients. In particular, Adam stores an exponentially moving average of past gradients (

𝑚

𝑡

)

and squared gradients (

𝑣

𝑡

). The former is an estimate of the rst momentum (the mean) and the

latter is an estimate of the second momentum (the uncentered variance) of the gradients

𝑚

𝑡

= 𝛽

𝑚

𝑡−1

+ (1 − 𝛽

)𝑔

𝑡

𝑣

𝑡

= 𝛽

𝑣

𝑡−1

+ (1 − 𝛽

)𝑔

𝑡

(18)

𝛽

and

𝛽

are hyper-parameters controlling the decaying rates of theses moving averages. Since

the moving averages are initialed as 0’s, the estimates of rst and second moments are biased

towards zero, especially in the beginning of training. Adam utilizes correction terms to counteract

the initialization bias

𝑚

𝑡

𝑚

𝑡

1 − 𝛽

𝑡

𝑣

𝑡

𝑣

𝑡

1 − 𝛽

𝑡

(19)

Then Adam applies the update rule

𝑥

𝑡

= 𝑥

𝑡−1

−

𝜂

√

𝑣

𝑡

+𝜖

𝑚

𝑡

. (20)

Adam is found to be robust and well-suited to a wide range of non-convex optimization problems

in the eld of DL. There are several variants of Adam. AdaMax [

] is an extension to Adam that

generalizes the approach to the innite norm (max) and may result in a more eective optimization

on some problems. Nesterov-accelerated Adaptive Moment Estimation (NAdam) [

] incorporates

NAG into Adam. It shows better convergence speed in some cases. While these algorithms have

been successfully employed in several practical applications, they may fail to converge to optimal

Algorithm 1: Distributed Synchronous SGD on Node k.

Input: Dataset 𝑋 ,minibatch size 𝑏 per node, the number of nodes 𝑁 , optimization function

SGD, init parameters 𝑤 = 𝑤 [0], ··· , 𝑤 [𝑀]

for 𝑡 = 0, 1, ··· do

𝐺

𝑘

𝑡

← 0;

for 𝑖 = 1, ··· , 𝐵 do

Sample data 𝑥 from 𝑋 ;

𝐺

𝑘

𝑡

← 𝐺

𝑘

𝑡

𝑁𝑏

∇𝑓 (𝑥; 𝑤

𝑡

)

end

All-Reduce 𝐺

𝑘

𝑡

: 𝐺

𝑡

←

𝑁

𝑘=1

𝐺

𝑘

𝑡

;

𝑤

𝑡+1

← SGD(𝑤

𝑡

, 𝐺

𝑡

)

end

solution even in convex setting, or even diverge in DL training. Reddi et al

. [84]

pinpoint the

exponential moving average of past squared gradients as a reason for such failures. Recall that the

introduction of the exponential average was well-motivated to tackle the key aw of the Adagrad

algorithm: it should prevent the LRs to become innitesimally small as training progresses by

limiting the reliance of the update on essentially only the past few gradients. However, this short-

term memory of the gradients can indeed cause signicant convergence issues in other scenarios.

To resolve this issue, the authors propose new variants of Adam — AMSGrad, which relies on

long-term memory of past gradients. AMSGrad uses the maximum of past squared gradients rather

than the exponential average to update the parameters. Liu et al

. [64]

argue that the root cause

of the bad convergence problem suered by Adam is that the adaptive LR has undesirably large

variance in the early stage of model training, due to the limited amount of training samples being

used. Thus, to reduce such variance, it is better to use smaller LRs in the rst few epochs of training.

The authors propose Rectied Adam (RAdam) to rectify the variance of the adaptive LR.

Choosing an optimizer is a crucial step when training DNNs since it is woven with the training

speed and the nal predictive performance. Despite the fact that adaptive optimization methods,

including AdaGrad, RMSProp, AdaDelat and Adam, are becoming increasingly popular, to date,

how to choose an optimal one is still theoretically elusive and intractable. Instead practitioners rely

on empirical studies [

113

] and bench-marking [

]. Wilson et al

. [113]

observed that the solutions

found by adaptive methods generalize worse (often signicantly worse) than SGD, even when

these solutions have better training performance. However, Choi et al

. [25]

suggest that popular

adaptive gradient methods never under-perform momentum or gradient descent. They point out

the comparisons among optimizers are sensitive to the hyper-parameter tuning protocols.

4 LARGE BATCH TRAINING

Large DNNs and large datasets have fueled the development of deep learning [

104

However, training large models on massive datasets is compute-intensive. For instance, training the

SOTA DL models like BERT and ResNet-50 takes 3 days on 16 TPUv3 chips and 29 hours on 8 Tesla

P100 gpus respectively [

]. An intuitive way to accelerate training is to add more computational

power (e.g., more GPU nodes) and use data parallel (see Alg.1). Considering communication (i.e.,

synchronizing the updates at each iteration) is an issue, each GPU must be utilized as much as

possible to amortize the communication cost. Therefore, large batch should be used to distribute

more data to each GPU. The nontrivial growth of batch size often results in test performance

degradation, as observed in [

]. We describe the training diculties introduced by large

剩余32页未读，继续阅读

syp_net

粉丝: 158
资源: 1196

会员权益专享

"大规模深度学习优化综述：性能提高与时间成本平衡"

新加坡国立大学研究论文

Large-Scale Deep Learning Optimizations—A Comprehensive

新加坡国立大学神经网络课件

pp-matting发展历程

卷积神经网络经典案例

人工智能研究生哪所学校好

适合物联网专业留学的国家

NUS-WIDE-OBJECT

双非读nus一年制硕士有什么优势

新加坡2023年经济发展

适合嵌入式专业留学的国家

新加坡能上openai吗

专科嵌入式出国能升学的国家

新加坡5G频段有哪些

在新加坡做JAVA程序员相比较中国有什么优势呢？

从下面文字中查找字符串’新加坡’出现的次数。

关于地缘文化视角下新加坡“平衡外交”逻辑探析，国内外研究现状是怎样的呢

新加坡za环保科技商业融资计划书

新加坡过春节有哪些习俗

会员权益专享

最新资源