Published as a conference paper at ICLR 2022
scalar loss
P
i
c
i
L
i
. This procedure gives us the sum of clipped gradients. Under this setup, the
difficulty is computing the per-example gradient norm
k∇L
i
k
2
. We emphasize two technicalities
that enable computing this quantity without instantiating the full per-example gradient ∇L
i
.
First, for a typical neural net layer
l
with parameters
W
(l)
(without parameter sharing), the per-
example gradient w.r.t. parameters can be easily computed using the input to the layer
a
(l)
and the
gradient of the loss w.r.t. the output
g
(l)
, both of which are available during backpropagation. Second,
for a large vector formed by concatenating several small vectors
u = [u
1
, . . . , u
k
]
, its Euclidean
norm is simply the norm of the vector of norms, i.e.
kuk
2
= k(ku
1
k
2
, . . . , ku
k
k
2
)k
2
.
The second
observation means that computing the per-example gradient norm
k∇L
i
k
2
can be done by computing
the per-example gradient norms for individual layers of the neural net
k∇
W
(1)
L
i
k
2
, . . . , k∇
W
(L)
L
i
k
2
one at a time (
L
is layer count). Moreover, the first observation implies that the norms for each layer
can be computed using quantities freely available to a typical backward pass. Overall, the per-example
gradient norm of any network without parameter sharing can be computed in a layer-by-layer fashion
with only one per-example gradient tensor for a single layer being instantiated at any time.
4.2 GHOST CLIPPING FOR TRANSFORMERS WITH SEQUENTIAL DATA
The trick by Lee & Kifer (2020) still requires instantiating the per-example gradient of individual
layers (although not simultaneously). This can be problematic in terms of memory for Transformers
with large embedding layers.
3
Here, we present a specialized procedure for computing the per-
example gradient norm for linear and embedding layers when they are applied to sequential data.
4
This procedure reduces memory footprint and can be viewed as a generalization of the Goodfellow
(2015) trick that additionally handles sequential inputs.
Let
a ∈ R
B×T ×d
be the input to a linear layer with weight matrix
W ∈ R
p×d
, and
s ∈ R
B×T ×p
be
the output with
s
i,j
= W a
i,j
. Let
g ∈ R
B×T ×p
be the gradient of the loss w.r.t. the output
s
. Here,
T
is the number of time steps in the input, and we omitted biases for simplicity. Simple calculation
shows that the per-example gradient is the product of two matrices:
∇
W
L
i
= g
>
i
a
i
∈ R
p×d
. (1)
Since the per-example gradient norms are the end goal, the per-example gradients
{∇
W
L
i
}
B
i=1
themselves need not be instantiated explicitly. More precisely, we observe that the squared per-
example gradient norm for this layer k∇
W
L
i
k
2
F
obeys the following identity:
k∇
W
L
i
k
2
F
= vec(a
i
a
>
i
)
>
vec(g
i
g
>
i
). (2)
See Appendix F for a derivation. Implemented with common primitives in machine learning libraries,
(2)
has a memory complexity of order
O(BT
2
)
when
a
i
a
>
i
, g
i
g
>
i
∈ R
T ×T
are instantiated,
56
as
opposed to O(Bpd) in the na
¨
ıve approach which goes through instantiating (1).
7
The memory efficiency of this procedure is exemplified with off the shelf pretrained language models,
most of which have large embeddings. For instance, for GPT-2,
d ≈ 50, 000
and
p = 768
for the
embedding layer, and the context window
T ≤ 1024
.
8
Our method in theory reduces the memory
cost of this large embedding layer by at least a factor of
22
. In practice, we also observe significant
savings, since embedding layers can be a major source of memory spending for training large language
models.
9
To stress-test ghost clipping, we compare it with
4
baselines: The
PyTorch
package
Opacus
that implements DP optimization by instantiating per-example gradients, the approach by
Lee & Kifer (2020), non-private training in
PyTorch
, and na
¨
ıve DP optimization implemented in
3
For GPT-2, per-example gradients w.r.t. the embedding for ten examples alone occupy
~
1.5GB of memory.
4
An embedding layer is essentially a linear layer: The embedding lookup operation applied to indices is
equivalent to a matrix multiplication of the embedding matrix with one-hot encoded indices.
5
This is assuming the space complexity for multiplying two matrices
A ∈ R
m×n
and
B ∈ R
n×p
is roughly
O(mp), which is the case for most workloads running on a framework like PyTorch.
6
More sophisticated solutions may even avoid instantiating
a
i
a
>
i
and
g
i
g
>
i
entirely by trading in more
run-time. Custom CUDA kernels are likely needed to make these solutions fast in practice.
7
We omitted the cost of storing
a
i
and
g
i
, since our goal is to compare the additional cost induced by
computing gradient norms.
8
In practice, for fine-tuning tasks, the maximum sequence length is usually a few hundred.
9
While there are alternative approaches for reducing the memory footprint of embedding layers during
training, these methods tend to introduce extra hyperparameters that require tuning and privacy spending.
6