大型预训练模型助力隐私保护的NLP深度学习

版权申诉

144 浏览量更新于2024-06-14 收藏 815KB PDF 举报

在2022年的国际联合会议机器学习（ICLR）上，一篇名为《大型语言模型：强大的差异化私人学习者》的论文探讨了如何有效地利用深度学习技术进行文本领域的差分隐私（Differential Privacy, DP）学习。作者Xuechen Li、Florian Tramer、Percy Liang和Tatsunori Hashimoto，分别来自斯坦福大学和谷歌研究，提出了针对自然语言处理（NLP）任务的全新策略，旨在解决以往DP学习面临的性能下降和计算开销过高的问题。论文指出，传统的差分隐私方法应用于NLP时，如使用差分私有化随机梯度下降（DP-SGD），往往导致模型性能大幅下滑，并且对计算资源的需求显著增加。为了克服这些挑战，研究者们提出以下三个关键策略： 1. **大型预训练语言模型的优势**：通过引入大规模预训练的模型，如Transformer等，作者发现它们在保持隐私的同时，能够提供更好的性能。这些预训练模型已经积累了大量的文本数据和语言理解能力，这使得它们在私有化训练下仍能保持较高的泛化能力。 2. **非标准超参数优化**：论文强调了定制化的超参数设置对于改善DP学习的重要性。通过调整学习率、批量大小和其他关键参数，研究人员能够更好地适应差分隐私约束，优化模型的训练过程，从而减少性能损失。 3. **与预训练目标一致的微调**：为了确保训练的连续性和有效性，研究者提出了与预训练阶段相匹配的微调目标。这种策略确保了在私有化训练过程中，模型能够更好地利用其原有的知识基础，从而在有限的隐私预算内达到或超越非私有模型的性能。通过以上方法，研究人员成功地开发出了一种新的NLP模型，能够在相同的隐私预算下，直接使用预训练模型并采用差分隐私优化，实现在适度规模语料库上的训练，同时在性能上超越了当时的DP训练状态-of-the-art模型，甚至超过了非私有基准线。这对于推动在保护用户隐私的同时，保持深度学习在NLP任务中的竞争力具有重要意义。这一突破性工作展示了大型语言模型在应对隐私保护挑战时的独特价值，也提供了未来设计更高效、更实用的DP NLP算法的新方向。

Published as a conference paper at ICLR 2022

scalar loss

. This procedure gives us the sum of clipped gradients. Under this setup, the

difﬁculty is computing the per-example gradient norm

k∇L

. We emphasize two technicalities

that enable computing this quantity without instantiating the full per-example gradient ∇L

First, for a typical neural net layer

with parameters

(l)

(without parameter sharing), the per-

example gradient w.r.t. parameters can be easily computed using the input to the layer

(l)

and the

gradient of the loss w.r.t. the output

(l)

, both of which are available during backpropagation. Second,

for a large vector formed by concatenating several small vectors

u = [u

, . . . , u

]

, its Euclidean

norm is simply the norm of the vector of norms, i.e.

kuk

= k(ku

, . . . , ku

The second

observation means that computing the per-example gradient norm

k∇L

can be done by computing

the per-example gradient norms for individual layers of the neural net

k∇

(1)

, . . . , k∇

(L)

one at a time (

is layer count). Moreover, the ﬁrst observation implies that the norms for each layer

can be computed using quantities freely available to a typical backward pass. Overall, the per-example

gradient norm of any network without parameter sharing can be computed in a layer-by-layer fashion

with only one per-example gradient tensor for a single layer being instantiated at any time.

4.2 GHOST CLIPPING FOR TRANSFORMERS WITH SEQUENTIAL DATA

The trick by Lee & Kifer (2020) still requires instantiating the per-example gradient of individual

layers (although not simultaneously). This can be problematic in terms of memory for Transformers

with large embedding layers.

Here, we present a specialized procedure for computing the per-

example gradient norm for linear and embedding layers when they are applied to sequential data.

This procedure reduces memory footprint and can be viewed as a generalization of the Goodfellow

(2015) trick that additionally handles sequential inputs.

Let

a ∈ R

B×T ×d

be the input to a linear layer with weight matrix

W ∈ R

p×d

, and

s ∈ R

B×T ×p

the output with

i,j

= W a

i,j

. Let

g ∈ R

B×T ×p

be the gradient of the loss w.r.t. the output

. Here,

is the number of time steps in the input, and we omitted biases for simplicity. Simple calculation

shows that the per-example gradient is the product of two matrices:

∇

= g

∈ R

p×d

. (1)

Since the per-example gradient norms are the end goal, the per-example gradients

{∇

}

i=1

themselves need not be instantiated explicitly. More precisely, we observe that the squared per-

example gradient norm for this layer k∇

obeys the following identity:

k∇

= vec(a

)

vec(g

). (2)

See Appendix F for a derivation. Implemented with common primitives in machine learning libraries,

(2)

has a memory complexity of order

O(BT

)

when

, g

∈ R

T ×T

are instantiated,

opposed to O(Bpd) in the na

ıve approach which goes through instantiating (1).

The memory efﬁciency of this procedure is exempliﬁed with off the shelf pretrained language models,

most of which have large embeddings. For instance, for GPT-2,

d ≈ 50, 000

and

p = 768

for the

embedding layer, and the context window

T ≤ 1024

Our method in theory reduces the memory

cost of this large embedding layer by at least a factor of

. In practice, we also observe signiﬁcant

savings, since embedding layers can be a major source of memory spending for training large language

models.

To stress-test ghost clipping, we compare it with

baselines: The

PyTorch

package

Opacus

that implements DP optimization by instantiating per-example gradients, the approach by

Lee & Kifer (2020), non-private training in

PyTorch

, and na

ıve DP optimization implemented in

For GPT-2, per-example gradients w.r.t. the embedding for ten examples alone occupy

1.5GB of memory.

An embedding layer is essentially a linear layer: The embedding lookup operation applied to indices is

equivalent to a matrix multiplication of the embedding matrix with one-hot encoded indices.

This is assuming the space complexity for multiplying two matrices

A ∈ R

m×n

and

B ∈ R

n×p

is roughly

O(mp), which is the case for most workloads running on a framework like PyTorch.

More sophisticated solutions may even avoid instantiating

and

entirely by trading in more

run-time. Custom CUDA kernels are likely needed to make these solutions fast in practice.

We omitted the cost of storing

and

, since our goal is to compare the additional cost induced by

computing gradient norms.

In practice, for ﬁne-tuning tasks, the maximum sequence length is usually a few hundred.

While there are alternative approaches for reducing the memory footprint of embedding layers during

training, these methods tend to introduce extra hyperparameters that require tuning and privacy spending.

剩余29页未读，继续阅读

百态老人

粉丝: 1w+
资源: 2万+

大型预训练模型助力隐私保护的NLP深度学习

Large Language Models Can Self-Improve.pdf

全面概述 大型语言模型.pdf

大型语言模型：语言理解和生成

个性化联邦学习的总结.pdf

智能家居开放与标准化倡议出台.pdf

世界经济概论复习重点.pdf

房地产经济基本理论考试归纳.pdf

基于BP神经网络的历史街区交通方式选择研究.pdf

人工智能环境下金融隐私权的法律救济研究.pdf

2019“7595”健康风险与保障绿皮书精品报告2020.pdf

最新资源

全面概述大型语言模型.pdf