分布式深度学习中的Stochastic Gradient Push算法

需积分: 0 9 浏览量更新于2024-08-05 收藏 509KB PDF 举报

"本文主要探讨了在分布式深度学习中如何有效地加速模型训练，提出了Stochastic Gradient Push (SGP)算法，该算法结合了PUSH SUM gossip算法和随机梯度更新，旨在解决同步节点时遇到的延迟和滞后问题。SGP算法在理论上保证了对平滑非凸目标函数的亚线性收敛速度，且所有节点能达成共识。实验结果显示，SGP在图像分类（ResNet-50，ImageNet）和机器翻译（Transformer，WMT'16 En-De）任务上表现出色。 1. 引言深度神经网络（DNNs）在许多应用领域，如计算机视觉、自然语言处理和语音识别等，已成为最先进的机器学习方法。然而，随着模型复杂性和数据量的增加，训练大型DNNs变得日益耗时。分布式数据并行算法通过在多个节点间并行计算大mini-batch的梯度更新来解决这个问题，以加快训练速度。 2. 分布式学习的挑战与解决方案传统的分布式平均同步策略，如ALL-REDUCE，容易受到网络延迟和节点性能不一致（stragglers）的影响。PUSH SUM gossip算法则通过节点间的局部通信实现近似分布式平均，增强了系统的鲁棒性。然而，它不能提供精确的梯度平均。 3. Stochastic Gradient Push (SGP) SGP算法结合了PUSH SUM的优势与随机梯度下降（SGD）的高效性。在每个迭代步骤中，节点不仅会传播其梯度的加权平均，还会引入随机性以模拟SGD的行为。这使得SGP能够在容忍节点延迟的同时，保持与SGD相似的收敛速度。 4. 理论分析论文证明了SGP对于平滑的非凸优化问题，其收敛速度是亚线性的，这意味着尽管存在节点间的异步通信，但算法的整体性能不会显著降低。此外，所有参与训练的节点最终会达到梯度的共识，确保了模型的一致性。 5. 实验验证实验部分展示了SGP在实际任务中的效果，包括ResNet-50模型在ImageNet数据集上的图像分类以及Transformer模型在WMT'16 En-De任务上的机器翻译。实验结果表明，SGP在这些基准测试中展现出与标准SGD相当甚至更好的性能，同时具有更好的延迟和滞后容忍度。 6. 结论 Stochastic Gradient Push为分布式深度学习提供了一种新的、有前景的解决方案，它解决了同步问题，提高了训练效率，而且在实际应用中表现优异。未来的研究可能会进一步探索SGP在更大规模和更复杂网络架构中的适用性，以及如何优化其通信效率和收敛速度。 7. 潜在应用由于其对延迟和网络拓扑的鲁棒性，SGP可能在边缘计算、物联网设备协同学习、大规模云计算平台等领域具有广泛的应用潜力。" 总结：该论文介绍的Stochastic Gradient Push (SGP)算法是为了解决分布式深度学习中的延迟和滞后问题而设计的。它结合了PUSH SUM gossip算法的鲁棒性与SGD的高效性，确保了在非凸优化问题上的亚线性收敛速度和节点间的共识，从而提高模型训练效率。实验表明，SGP在实际的图像分类和机器翻译任务上表现优秀，有望在更多场景下提升分布式深度学习的性能。

Stochastic Gradient Push for Distributed Deep Learning

Mahmoud Assran

1 2

Nicolas Loizou

1 3

Nicolas Ballas

Mike Rabbat

Abstract

Distributed data-parallel algorithms aim to accel-

erate the training of deep neural networks by par-

allelizing the computation of large mini-batch gra-

dient updates across multiple nodes. Approaches

that synchronize nodes using exact distributed av-

eraging (e.g., via ALLREDUCE) are sensitive to

stragglers and communication delays. The PUSH-

SUM gossip algorithm is robust to these issues,

but only performs approximate distributed aver-

aging. This paper studies Stochastic Gradient

Push (SGP), which combines PUSHSUM with

stochastic gradient updates. We prove that SGP

converges to a stationary point of smooth, non-

convex objectives at the same sub-linear rate as

SGD, and that all nodes achieve consensus. We

empirically validate the performance of SGP on

image classiﬁcation (ResNet-50, ImageNet) and

machine translation (Transformer, WMT’16 En-

De) workloads.

1. Introduction

Deep Neural Networks (DNNs) are the state-of-the art ma-

chine learning approach in many application areas, includ-

ing computer vision (He et al., 2016) and natural language

processing (Vaswani et al., 2017). Stochastic Gradient De-

scent (SGD) is the current workhorse for training neural

networks. The algorithm optimizes the network parameters,

, to minimize a loss function,

f(·)

, through gradient de-

scent, where the loss function’s gradients are approximated

using a subset of training examples (a mini-batch). DNNs

often require large amounts of training data and trainable

parameters, necessitating non-trivial computational require-

ments (Wu et al., 2016; Mahajan et al., 2018).

Large mini-batch parallel SGD is usually adopted for dis-

Facebook AI Research, Montr

eal, QC, Canada

Department

of Electrical and Computer Engineering, McGill University,

Montr

eal, QC, Canada

School of Mathematics, University of

Edinburgh, Edinburgh, Scotland. Correspondence to: Mahmoud

Assran <mahmoud.assran@mail.mcgill.ca>.

Proceedings of the

International Conference on Machine

Learning, Long Beach, California, PMLR 97, 2019. Copyright

2019 by the author(s).

tributed training of deep networks (Goyal et al., 2017; Li

et al., 2014). Worker nodes compute local mini-batch gradi-

ents of the loss function on different subsets of the data, and

then calculate an exact inter-node average gradient using

either the ALLREDUCE communication primitive, in syn-

chronous implementations (Goyal et al., 2017; Akiba et al.,

2017), or using a central parameter server, in asynchronous

implementations (Dean et al., 2012). Using a parameter

server to aggregate gradients introduces a potential bottle-

neck and a central point of failure (Lian et al., 2017). The

ALLREDUCE primitive computes the exact average gradient

at all workers in a decentralized manner, avoiding issues as-

sociated with centralized communication and computation.

However, exact averaging algorithms like ALLREDUCE

are not robust in communication-constrained settings, i.e.,

where the network bandwidth is a signiﬁcant bottleneck.

This issue motivates the investigation of a decentralized and

inexact version of SGD to reduce the communication over-

head associated with distributed training. There have been

numerous decentralized optimization algorithms proposed

and studied in the control-systems literature that leverage

gossip-based approaches for the computation of aggregate

information; see the survey of Nedi

c et al. (2018) and ref-

erences therein. State-of-the-art gossip-based optimization

methods build on the PUSHSUM algorithm for distributed

averaging (Kempe et al., 2003; Nedi

c et al., 2018). Rather

than computing exact averages (as with ALLREDUCE), this

line of work uses less-coupled message passing and com-

putes approximate averages. The tradeoff is that approxi-

mate distributed averaging also injects additional noise in

the average gradient estimate.

In this work we study Stochastic Gradient Push (SGP), an

algorithm blending parallel SGD and PUSHSUM. SGP en-

ables the use of generic communication topologies that may

be directed (asymmetric), sparse, and time-varying. In con-

trast, existing gossip-based approaches explored in the con-

text of training DNNs (Lian et al., 2017; Jiang et al., 2017)

are constrained to use symmetric communication (i.e., if

node

sends to

, then

must also receive from

before

proceeding) and thus inherently require deadlock-avoidance,

and more synchronization, making them slower and more

sensitive to stragglers. Moreover, SGP can be seen as a gen-

eralization of parallel SGD and these previous approaches.

下载后可阅读完整内容，剩余9页未读，立即下载

销号le

粉丝: 36

分布式深度学习中的Stochastic Gradient Push算法

2016-J神-Mini-Batch Semi-Stochastic Gradient Descent in the Proxi

Semi-Stochastic Gradient Descent Methods

机器学习基石 11 - 2 - Stochastic Gradient Descent (11-39).mp4

机器学习技法 15 - 3 - Stochastic Gradient Descent (12-22).mp4

IE598NH-lecture-24-Stochastic Optimization for Reinforcement Learning.pdf

Stochastic Gradient Techniques for Optimization and Learning.pdf

Large-Scale Matrix Factorization with Distributed Stochastic Gradient Descent

matlab正弦函数程序代码-Stochastic-gradient:在GPU上计算随机梯度

Mini-batch Stochastic Gradient Descent

Data-Driven-Stochastic-Unit-Commitment-for-Integrating-Wind-Gene

最新资源