一比特梯度压缩：深度学习训练的新突破

需积分: 50 55 浏览量更新于2024-09-09 1 收藏 311KB PDF 举报

深度学习梯度压缩是一种在现代神经网络训练中引入的重要技术，特别是在大规模分布式系统中，如使用GPU等高性能处理器进行数据并行训练时。本研究论文主要关注了1-Bit Stochastic Gradient Descent (1-Bit SGD) 方法，这是一种对梯度进行高度压缩的技术，将每个梯度值量化为仅一个比特（即二进制的0或1）。传统的深度学习训练通常采用精确的浮点数表示梯度，然而，1-Bit SGD证明即使在如此极端的量化下，如果允许误差在不同批次之间累积（即所谓的"error feedback"），并不会对模型的准确性造成重大损失。这极大地降低了存储和通信的需求，使得大规模的数据并行训练变得可能，特别是对于那些内存限制严格的硬件环境。作者们通过实验证明，结合AdaGrad算法、自动选择合适的小批量大小、双缓冲机制以及模型并行策略，可以有效地实现数据并行的1-Bit SGD。值得注意的是，这项工作还发现，1-Bit SGD实际上对AdaGrad算法也有意想不到的好处，能够带来微小但显著的精度提升。这一发现表明，尽管梯度压缩看似牺牲了精度，但在特定的训练框架下，它可能还能作为一种性能优化手段。论文以Switchboard深度神经网络（DNN）为例，展示了这种1-Bit SGD方法在实际语音识别任务中的应用效果。通过对模型进行严格的实验评估，研究人员证实了即使在1比特的极端压缩下，模型仍能在保持相对较高的准确性的前提下，实现高效的分布式训练。总结来说，这篇研究强调了深度学习梯度压缩技术在提高计算效率和降低硬件需求方面的潜力，同时也揭示了其与现有优化算法如AdaGrad之间的交互作用，为今后在更大规模和更复杂的学习任务中优化计算资源分配提供了新的视角。

1-Bit Stochastic Gradient Descent

and its Application to Data-Parallel Distributed Training of Speech DNNs

Frank Seide

, Hao Fu

1,2

, Jasha Droppo

, Gang Li

, and Dong Yu

Microsoft Research Asia, 5 Danling Street, Haidian District, Beijing 100080, P.R.C.

Institute of Microelectronics, Tsinghua University, 10084 Beijing, P.R.C

Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA

{fseide,jdroppo,ganl,dongyu}@microsoft.com, fuhao9202@hotmail.com

Abstract

We show empirically that in SGD training of deep neural net-

works, one can, at no or nearly no loss of accuracy, quantize the

gradients aggressively—to but one bit per value—if the quan-

tization error is carried forward across minibatches (error feed-

back). This size reduction makes it feasible to parallelize SGD

through data-parallelism with fast processors like recent GPUs.

We implement data-parallel deterministically distributed

SGD by combining this ﬁnding with AdaGrad, automatic

minibatch-size selection, double buffering, and model paral-

lelism. Unexpectedly, quantization beneﬁts AdaGrad, giving a

small accuracy gain.

For a typical Switchboard DNN with 46M parameters, we

reach computation speeds of 27k frames per second (kfps) when

using 2880 samples per minibatch, and 51kfps with 16k, on a

server with 8 K20X GPUs. This corresponds to speed-ups over

a single GPU of 3.6 and 6.3, respectively. 7 training passes over

309h of data complete in under 7h. A 160M-parameter model

training processes 3300h of data in under 16h on 20 dual-GPU

servers—a 10 times speed-up—albeit at a small accuracy loss.

1. Introduction and Related Work

At present, the best context-dependent deep-neural-network

HMMs, or CD-DNN-HMMs [1, 2], are trained primarily with

error back-propagation, or BP. BP is a form of stochastic gradi-

ent descent, or SGD. For production-size models and corpora,

this is time consuming and can take many days or weeks, even

on the currently fastest hardware, graphics processing units

(GPUs). While attempts at parallelizing SGD training across

multiple compute nodes were successful for sparsely connected

networks like those used for image processing, success has been

moderate for speech DNNs which are fully connected.

For example, Google’s DistBelief system successfully uti-

lizes 16,000 cores for the ImageNet task [3] through asyn-

chronous SGD, an implementation of Hogwild [4]; while for

a speech model with 42M parameters, a 1,600-core DistBe-

lief [5] is only marginally faster than a single recent GPU; and

[6] achieved a 28-fold speed-up with 64 GPUs for their 1.9B-

parameter vision network, while [7] reports a 3.2-times speed-

up using 4 GPUs for speech.

This paper focuses on parallelization in a data-parallel fash-

ion. In data parallelism, each minibatch is split over multiple

compute nodes. Each node computes a sub-gradient on its sub-

minibatch. These sub-gradients, of the same dimension as the

full model, must be summed over all nodes and redistributed.

Applied directly to typical training conﬁgurations, this pro-

cess is infeasible due to the high bandwidth that it takes to ex-

change sub-minibatch gradients across nodes. Avenues for im-

proving efﬁciency for data parallelism are to increase the mini-

batch size and to reduce how much data gets exchanged [8].

We focus on the latter and propose to reduce bandwidth by

aggressively quantizing the sub-gradients—to but one bit per

value. We show that this does not or almost not reduce word

accuracies—but only if the quantization error is carried for-

ward across minibatches, i.e. the error in quantizing the gradi-

ent in one minibatch is added (fed back) to the gradient of the

next minibatch. This is a common technique in other areas, such

as sigma-delta modulation for DACs [9], or image rasterization.

It is a key difference to the well-known R-prop method [27].

Some prior work on speeding up model training considered

changes of model structure and training approach, e.g. [10, 11]

where the network was factored into a hierarchy; low-rank

approximations [12, 13]; second-order methods (“Hessian-

Free”) [14, 15]; model averaging [16]; or ADMM which clev-

erly tweaks the objective function for better parallelizability

[17, 18]. The last three typically require more data passes, but

make up for it through good parallelization properties.

In the paper at hand, we aim at unchanged convergence be-

havior. Also, unlike Hogwild/ASGD [4, 5], we desire determin-

istic behavior. In this category, an alternative to data parallelism

is model parallelism, where models are distributed over nodes

[5, 8]. One can also parallelize over layers [19]: Each GPU

processes one or more consecutive layers, where data ﬂows up

and down through the layers between GPUs, and, as a conse-

quence, gradients only become available at a delay of one or

more minibatches (depending on the layer). This achieved a

3.3-times speed-up on 4 GPUs, but it does not scale beyond the

number of layers, and load balancing is problematic. That work

showed, however, that delayed updates can work, and motivated

the double-buffering technique we apply in this paper.

We will next describe data-parallel DNN training. Then,

Section 3 will introduce the 1-bit quantization approach, and

Section 4 the data-parallel SGD system we implemented based

on this. Finally, Section 5 will give experimental results

for quantization, interaction with AdaGrad, impact of double

buffering, and combination with model parallelism.

2. Data-Parallel Deterministically

Distributed SGD Training

A deep neural network (DNN) is a conventional multi-layer per-

ceptron (MLP [20]) with many layers, where training is com-

monly initialized by a pretraining algorithm [21, 22, 23]. A

CD-DNN-HMM models the posterior probability P (s|o) of a

tied triphone state, or s

enone s [24, 1], given an observation

vector o. For details, please see, for example, [23].

The best DNNs to this date are often trained using the com-

mon error back-propagation (BP) technique [25], which is a

18 September 2014, Singapore

INTERSPEECH 2014

1058

下载后可阅读完整内容，剩余4页未读，立即下载

后厂村葫芦娃

粉丝: 7
资源: 1

一比特梯度压缩：深度学习训练的新突破

pytorch-ddpg, 利用PyTorch实现深度确定策略梯度( DDPG )的实现.zip

梯度下降算法代码及详细解释（非常易懂）.zip

一种基于4Bit编码的深度学习梯度压缩算法.pdf

4Bit编码的深度学习梯度压缩算法：提升准确率与收敛速度

深度学习模型压缩文件解析指南

深度学习模型压缩与加速技术

C 深度学习模型压缩与加速技术

基于深度学习的梯度聚类SSD算法参数选择.pdf

基于深度学习的图像压缩和图像加密，算法实现和芯片加速方案，便于深度学习在图像压缩和加密领域应用生态建设.zip

基于深度学习的压缩感知FDD大规模MIMO系统稀疏信道估计算法.docx

最新资源