1-Bit Stochastic Gradient Descent
and its Application to Data-Parallel Distributed Training of Speech DNNs
Frank Seide
1
, Hao Fu
1,2
, Jasha Droppo
3
, Gang Li
1
, and Dong Yu
3
1
Microsoft Research Asia, 5 Danling Street, Haidian District, Beijing 100080, P.R.C.
2
Institute of Microelectronics, Tsinghua University, 10084 Beijing, P.R.C
3
Microsoft Research, One Microsoft Way, Redmond, WA 98052, USA
{fseide,jdroppo,ganl,dongyu}@microsoft.com, fuhao9202@hotmail.com
Abstract
We show empirically that in SGD training of deep neural net-
works, one can, at no or nearly no loss of accuracy, quantize the
gradients aggressively—to but one bit per value—if the quan-
tization error is carried forward across minibatches (error feed-
back). This size reduction makes it feasible to parallelize SGD
through data-parallelism with fast processors like recent GPUs.
We implement data-parallel deterministically distributed
SGD by combining this finding with AdaGrad, automatic
minibatch-size selection, double buffering, and model paral-
lelism. Unexpectedly, quantization benefits AdaGrad, giving a
small accuracy gain.
For a typical Switchboard DNN with 46M parameters, we
reach computation speeds of 27k frames per second (kfps) when
using 2880 samples per minibatch, and 51kfps with 16k, on a
server with 8 K20X GPUs. This corresponds to speed-ups over
a single GPU of 3.6 and 6.3, respectively. 7 training passes over
309h of data complete in under 7h. A 160M-parameter model
training processes 3300h of data in under 16h on 20 dual-GPU
servers—a 10 times speed-up—albeit at a small accuracy loss.
1. Introduction and Related Work
At present, the best context-dependent deep-neural-network
HMMs, or CD-DNN-HMMs [1, 2], are trained primarily with
error back-propagation, or BP. BP is a form of stochastic gradi-
ent descent, or SGD. For production-size models and corpora,
this is time consuming and can take many days or weeks, even
on the currently fastest hardware, graphics processing units
(GPUs). While attempts at parallelizing SGD training across
multiple compute nodes were successful for sparsely connected
networks like those used for image processing, success has been
moderate for speech DNNs which are fully connected.
For example, Google’s DistBelief system successfully uti-
lizes 16,000 cores for the ImageNet task [3] through asyn-
chronous SGD, an implementation of Hogwild [4]; while for
a speech model with 42M parameters, a 1,600-core DistBe-
lief [5] is only marginally faster than a single recent GPU; and
[6] achieved a 28-fold speed-up with 64 GPUs for their 1.9B-
parameter vision network, while [7] reports a 3.2-times speed-
up using 4 GPUs for speech.
This paper focuses on parallelization in a data-parallel fash-
ion. In data parallelism, each minibatch is split over multiple
compute nodes. Each node computes a sub-gradient on its sub-
minibatch. These sub-gradients, of the same dimension as the
full model, must be summed over all nodes and redistributed.
Applied directly to typical training configurations, this pro-
cess is infeasible due to the high bandwidth that it takes to ex-
change sub-minibatch gradients across nodes. Avenues for im-
proving efficiency for data parallelism are to increase the mini-
batch size and to reduce how much data gets exchanged [8].
We focus on the latter and propose to reduce bandwidth by
aggressively quantizing the sub-gradients—to but one bit per
value. We show that this does not or almost not reduce word
accuracies—but only if the quantization error is carried for-
ward across minibatches, i.e. the error in quantizing the gradi-
ent in one minibatch is added (fed back) to the gradient of the
next minibatch. This is a common technique in other areas, such
as sigma-delta modulation for DACs [9], or image rasterization.
It is a key difference to the well-known R-prop method [27].
Some prior work on speeding up model training considered
changes of model structure and training approach, e.g. [10, 11]
where the network was factored into a hierarchy; low-rank
approximations [12, 13]; second-order methods (“Hessian-
Free”) [14, 15]; model averaging [16]; or ADMM which clev-
erly tweaks the objective function for better parallelizability
[17, 18]. The last three typically require more data passes, but
make up for it through good parallelization properties.
In the paper at hand, we aim at unchanged convergence be-
havior. Also, unlike Hogwild/ASGD [4, 5], we desire determin-
istic behavior. In this category, an alternative to data parallelism
is model parallelism, where models are distributed over nodes
[5, 8]. One can also parallelize over layers [19]: Each GPU
processes one or more consecutive layers, where data flows up
and down through the layers between GPUs, and, as a conse-
quence, gradients only become available at a delay of one or
more minibatches (depending on the layer). This achieved a
3.3-times speed-up on 4 GPUs, but it does not scale beyond the
number of layers, and load balancing is problematic. That work
showed, however, that delayed updates can work, and motivated
the double-buffering technique we apply in this paper.
We will next describe data-parallel DNN training. Then,
Section 3 will introduce the 1-bit quantization approach, and
Section 4 the data-parallel SGD system we implemented based
on this. Finally, Section 5 will give experimental results
for quantization, interaction with AdaGrad, impact of double
buffering, and combination with model parallelism.
2. Data-Parallel Deterministically
Distributed SGD Training
A deep neural network (DNN) is a conventional multi-layer per-
ceptron (MLP [20]) with many layers, where training is com-
monly initialized by a pretraining algorithm [21, 22, 23]. A
CD-DNN-HMM models the posterior probability P (s|o) of a
tied triphone state, or s
enone s [24, 1], given an observation
vector o. For details, please see, for example, [23].
The best DNNs to this date are often trained using the com-
mon error back-propagation (BP) technique [25], which is a
Copyright © 2014 ISCA 14
-
18 September 2014, Singapore
INTERSPEECH 2014
1058