![](https://csdnimg.cn/release/download_crawler_static/45054078/bg7.jpg)
Algorithm 1: Distributed Synchronous SGD on Node k.
Input: Dataset 𝑋 ,minibatch size 𝑏 per node, the number of nodes 𝑁 , optimization function
SGD, init parameters 𝑤 = 𝑤 [0], ··· , 𝑤 [𝑀]
for 𝑡 = 0, 1, ··· do
𝐺
𝑘
𝑡
← 0;
for 𝑖 = 1, ··· , 𝐵 do
Sample data 𝑥 from 𝑋 ;
𝐺
𝑘
𝑡
← 𝐺
𝑘
𝑡
+
1
𝑁𝑏
∇𝑓 (𝑥; 𝑤
𝑡
)
end
All-Reduce 𝐺
𝑘
𝑡
: 𝐺
𝑡
←
Í
𝑁
𝑘=1
𝐺
𝑘
𝑡
;
𝑤
𝑡+1
← SGD(𝑤
𝑡
, 𝐺
𝑡
)
end
solution even in convex setting, or even diverge in DL training. Reddi et al
. [84]
pinpoint the
exponential moving average of past squared gradients as a reason for such failures. Recall that the
introduction of the exponential average was well-motivated to tackle the key aw of the Adagrad
algorithm: it should prevent the LRs to become innitesimally small as training progresses by
limiting the reliance of the update on essentially only the past few gradients. However, this short-
term memory of the gradients can indeed cause signicant convergence issues in other scenarios.
To resolve this issue, the authors propose new variants of Adam — AMSGrad, which relies on
long-term memory of past gradients. AMSGrad uses the maximum of past squared gradients rather
than the exponential average to update the parameters. Liu et al
. [64]
argue that the root cause
of the bad convergence problem suered by Adam is that the adaptive LR has undesirably large
variance in the early stage of model training, due to the limited amount of training samples being
used. Thus, to reduce such variance, it is better to use smaller LRs in the rst few epochs of training.
The authors propose Rectied Adam (RAdam) to rectify the variance of the adaptive LR.
Choosing an optimizer is a crucial step when training DNNs since it is woven with the training
speed and the nal predictive performance. Despite the fact that adaptive optimization methods,
including AdaGrad, RMSProp, AdaDelat and Adam, are becoming increasingly popular, to date,
how to choose an optimal one is still theoretically elusive and intractable. Instead practitioners rely
on empirical studies [
113
] and bench-marking [
88
]. Wilson et al
. [113]
observed that the solutions
found by adaptive methods generalize worse (often signicantly worse) than SGD, even when
these solutions have better training performance. However, Choi et al
. [25]
suggest that popular
adaptive gradient methods never under-perform momentum or gradient descent. They point out
the comparisons among optimizers are sensitive to the hyper-parameter tuning protocols.
4 LARGE BATCH TRAINING
Large DNNs and large datasets have fueled the development of deep learning [
28
,
42
,
55
,
56
,
94
,
104
].
However, training large models on massive datasets is compute-intensive. For instance, training the
SOTA DL models like BERT and ResNet-50 takes 3 days on 16 TPUv3 chips and 29 hours on 8 Tesla
P100 gpus respectively [
28
,
42
]. An intuitive way to accelerate training is to add more computational
power (e.g., more GPU nodes) and use data parallel (see Alg.1). Considering communication (i.e.,
synchronizing the updates at each iteration) is an issue, each GPU must be utilized as much as
possible to amortize the communication cost. Therefore, large batch should be used to distribute
more data to each GPU. The nontrivial growth of batch size often results in test performance
degradation, as observed in [
45
,
52
,
54
,
61
]. We describe the training diculties introduced by large
7