联邦学习：深度网络的通信高效训练

需积分: 10 45 浏览量更新于2024-09-01 收藏 1.38MB PDF 举报

"Communication-Efficient Learning of Deep Networks" 是一篇重要的研究论文，针对现代移动设备上广泛存在的、隐私敏感且量大繁多的数据提出了一种创新的学习方法。随着移动设备上积累了丰富的数据，这些数据对于提升用户体验具有巨大潜力，比如语言模型可以改进语音识别和文本输入，图像模型能自动挑选优质照片。然而，由于数据隐私保护和传输成本的考量，传统的将所有数据上传到数据中心进行集中训练的方式不再适用。该论文倡导了一种名为"联邦学习"（Federated Learning）的分散式学习策略。在联邦学习中，数据保持在用户的移动设备上，每个设备本地训练模型，然后通过安全的方式将模型更新汇总，形成一个共享的模型。这种方法避免了大规模数据传输带来的隐私风险，同时也降低了通信开销。作者Brendan McMahan等人提出了一个基于迭代模型平均的实践性联邦学习算法，该算法适用于深度网络。他们进行了广泛的实验，包括测试了五种不同的模型架构，如卷积神经网络（CNN）、循环神经网络（RNN）等，以及四个不同的数据集，涵盖了文本、图像等多种类型的数据。这些实验旨在评估该方法的有效性和效率，同时也揭示了在实际应用中可能遇到的挑战和优化方向。通过这篇论文，研究者们不仅展示了联邦学习作为一种新兴技术在深度学习领域的潜力，还为开发者和研究者提供了一个实用的框架，用于在保护用户隐私的同时，充分利用分布式设备上的数据进行高效模型训练。这对于推动边缘计算、隐私保护的机器学习发展以及智能设备的普及具有重要意义。在未来的研究中，联邦学习可能会进一步发展和完善，以适应更多场景和复杂网络结构的需求。

H. Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, Blaise Ag

uera y Arcas

While we focus on non-convex neural network objectives,

the algorithm we consider is applicable to any ﬁnite-sum

objective of the form

min

w∈R

f(w) where f(w)

def

i=1

(w). (1)

For a machine learning problem, we typically take

(w) =

`(x

, y

; w)

, that is, the loss of the prediction on example

, y

)

made with model parameters

. We assume there

are

clients over which the data is partitioned, with

the

set of indexes of data points on client

, with

= |P

Thus, we can re-write the objective (1) as

f(w) =

k=1

(w) where F

(w) =

i∈P

(w).

If the partition

was formed by distributing the training

examples over the clients uniformly at random, then we

would have

(w)] = f(w)

, where the expectation is

over the set of examples assigned to a ﬁxed client

. This is

the IID assumption typically made by distributed optimiza-

tion algorithms; we refer to the case where this does not

hold (that is,

could be an arbitrarily bad approximation

to f) as the non-IID setting.

In data center optimization, communication costs are rela-

tively small, and computational costs dominate, with much

of the recent emphasis being on using GPUs to lower these

costs. In contrast, in federated optimization communication

costs dominate — we will typically be limited by an upload

bandwidth of 1 MB/s or less. Further, clients will typically

only volunteer to participate in the optimization when they

are charged, plugged-in, and on an unmetered wi-ﬁ connec-

tion. Further, we expect each client will only participate in a

small number of update rounds per day. On the other hand,

since any single on-device dataset is small compared to the

total dataset size, and modern smartphones have relatively

fast processors (including GPUs), computation becomes

essentially free compared to communication costs for many

model types. Thus, our goal is to use additional computation

in order to decrease the number of rounds of communica-

tion needed to train a model. There are two primary ways

we can add computation: 1) increased parallelism, where

we use more clients working independently between each

communication round; and, 2) increased computation on

each client, where rather than performing a simple computa-

tion like a gradient calculation, each client performs a more

complex calculation between each communication round.

We investigate both of these approaches, but the speedups

we achieve are due primarily to adding more computation

on each client, once a minimum level of parallelism over

clients is used.

Related Work

Distributed training by iteratively averag-

ing locally trained models has been studied by McDon-

ald et al.

[28]

for the perceptron and Povey et al.

[31]

for

speech recognition DNNs. Zhang et al.

[42]

studies an asyn-

chronous approach with “soft” averaging. These works only

consider the cluster / data center setting (at most 16 workers,

wall-clock time based on fast networks), and do not consider

datasets that are unbalanced and non-IID, properties that

are essential to the federated learning setting. We adapt

this style of algorithm to the federated setting and perform

the appropriate empirical evaluation, which asks different

questions than those relevant in the data center setting, and

requires different methodology.

Using similar motivation to ours, Neverova et al.

[29]

also

discusses the advantages of keeping sensitive user data on

device. The work of Shokri and Shmatikov

[35]

is related in

several ways: they focus on training deep networks, empha-

size the importance of privacy, and address communication

costs by only sharing a subset of the parameters during each

round of communication; however, they also do not consider

unbalanced and non-IID data, and the empirical evaluation

is limited.

In the convex setting, the problem of distributed opti-

mization and estimation has received signiﬁcant attention

[

], and some algorithms do focus speciﬁcally on

communication efﬁciency [

]. In addition

to assuming convexity, this existing work generally requires

that the number of clients is much smaller than the number

of examples per client, that the data is distributed across

the clients in IID fashion, and that each node has an iden-

tical number of data points — all of these assumptions

are violated in the federated optimization setting. Asyn-

chronous distributed forms of SGD have also been applied

to training neural networks, e.g., Dean et al.

[12]

, but these

approaches require a prohibitive number of updates in the

federated setting. Distributed consensus algorithms (e.g.,

[

]) relax the IID assumption, but are still not a good ﬁt for

communication-constrained optimization over very many

clients.

One endpoint of the (parameterized) algorithm family we

consider is simple one-shot averaging, where each client

solves for the model that minimizes (possibly regularized)

loss on their local data, and these models are averaged to

produce the ﬁnal global model. This approach has been

studied extensively in the convex case with IID data, and it

is known that in the worst-case, the global model produced is

no better than training a model on a single client [

2 The FederatedAveraging Algorithm

The recent multitude of successful applications of deep

learning have almost exclusively relied on variants of

stochastic gradient descent (SGD) for optimization; in fact,

many advances can be understood as adapting the struc-

ture of the model (and hence the loss function) to be more

amenable to optimization by simple gradient-based meth-

ods [

]. Thus, it is natural that we build algorithms for

剩余10页未读，继续阅读

GanD.GanD

粉丝: 3
资源: 90

联邦学习：深度网络的通信高效训练

-csmath-2020-FederatedAveraging：csmath-2020，联邦平均

LR_Mnist.py

Communication-Efficient Learning of Deep Networks from Decentralized Data

Big Data and Computational Intelligence in Networking-CRC(2018).pdf

微众银行-刘洋-联邦学习的研究及应用.pdf

Python Deep Learning Projects

Big Data Analysis and Deep Learning Applications

Research on Multi-Access Communication Technology Based on Deep Learning

【In-depth Understanding of MATLAB Spectrum Analysis】: The Mysteries of FFT and IFFT

Deep Learning Model Compression Techniques: How to Reduce Model Size While Maintaining Performance

最新资源