大规模联邦学习系统设计：挑战与解决方案

需积分: 9 142 浏览量更新于2024-07-16 收藏 1.27MB PDF 举报

"这篇论文探讨了大规模联邦学习的系统设计，着重于在TensorFlow基础上构建的用于移动设备的可扩展生产系统。文章介绍了高级设计、面临的挑战、解决方案以及未来的研究方向，旨在推动联邦学习在实际应用中的发展。" 联邦学习（Federated Learning, FL）是一种新兴的分布式机器学习技术，其核心思想是让模型训练过程在分散的数据源上进行，如移动设备。这种学习方式保护了数据隐私，因为数据不需要集中到一个中心节点进行处理，而是保持在本地设备上。FL的提出解决了数据隐私、所有权和本地性问题，是实现数据去中心化处理的重要途径。基于谷歌的TensorFlow框架，研究者已经开发出一个可扩展的联邦学习生产系统，尤其适用于移动设备环境。在移动设备上的联邦学习系统设计面临一系列独特挑战，例如设备间的异构性、网络条件的不稳定性、以及设备的电池寿命和计算资源限制等。在高级设计方面，该系统可能采用了分层架构，允许设备按照预定义的策略参与模型训练，如定期同步或者仅在设备空闲时更新模型。同时，系统可能采用了模型聚合算法，如FedAvg（Federated Averaging），将各个设备上训练的模型参数进行平均，形成全局模型。在解决挑战的过程中，论文可能会讨论如何处理设备离线或网络中断的情况，以及如何优化通信效率，减少模型同步过程中对网络带宽的需求。此外，为了适应设备资源的差异，可能采用了模型压缩和量化技术，使得模型能够在资源有限的设备上高效运行。尽管已取得显著进展，但联邦学习仍存在一些未解决的问题。比如，如何处理非独立同分布（Non-IID）数据，即不同设备上的数据分布可能大相径庭，这对模型的泛化能力提出了更高要求。此外，如何确保模型训练过程的公平性和鲁棒性，避免部分设备或用户对全局模型产生偏颇影响，也是需要关注的议题。未来的方向可能包括更智能的设备选择策略，以提高训练效率；探索更安全的加密计算技术，进一步增强数据隐私；以及开发适应动态环境变化的自适应联邦学习算法。这篇论文深入探讨了大规模联邦学习的系统设计，不仅提供了当前解决方案的见解，还指出了未来研究的关键领域，对于推动联邦学习在现实世界的应用具有重要意义。

Towards Federated Learning at Scale: System Design

Conﬁguration

The server is conﬁgured based on the ag-

gregation mechanism selected (e.g., simple or Secure Ag-

gregation) for the selected devices. The server sends the FL

plan and an FL checkpoint with the global model to each of

the devices.

Reporting

The server waits for the participating devices

to report updates. As updates are received, the server ag-

gregates them using Federated Averaging and instructs the

reporting devices when to reconnect (see also Sec. 2.3). If

enough devices report in time, the round will be success-

fully completed and the server will update its global model,

otherwise the round is abandoned.

As seen in Fig. 1, straggling devices which do not report

back in time or do not react on conﬁguration by the server

will simply be ignored. The protocol has a certain tolerance

for such drop-outs which is conﬁgurable per FL task.

The selection and reporting phases are speciﬁed by a set

of parameters which spawn ﬂexible time windows. For ex-

ample, for the selection phase the server considers a device

participant goal count, a timeout, and a minimal percentage

of the goal count which is required to run the round. The

selection phase lasts until the goal count is reached or a

timeout occurs; in the latter case, the round will be started

or abandoned depending on whether the minimal goal count

has been reached.

2.3 Pace Steering

Pace steering is a ﬂow control mechanism regulating the

pattern of device connections. It enables the FL server both

to scale down to handle small FL populations as well to

scale up to very large FL populations.

Pace steering is based on the simple mechanism of the server

suggesting to the device the optimum time window to re-

connect. The device attempts to respect this, modulo its

eligibility.

In the case of small FL populations, pace steering is used

to ensure that a sufﬁcient number of devices connect to

the server simultaneously. This is important both for the

rate of task progress and for the security properties of the

Secure Aggregation protocol. The server uses a stateless

probabilistic algorithm requiring no additional device/server

communication to suggest reconnection times to rejected

devices so that subsequent checkins are likely to arrive con-

temporaneously.

For large FL populations, pace steering is used to random-

ize device check-in times, avoiding the “thundering herd”

problem, and instructing devices to connect as frequently as

needed to run all scheduled FL tasks, but not more.

cated methods which address selection bias.

Pace steering also takes into account the diurnal oscillation

in the number of active devices, and is able to adjust the

time window accordingly, avoiding excessive activity during

peak hours and without hurting FL performance during other

times of the day.

3 DEVICE

App Process

Device

FL Runtime

Conﬁg

Example Store

Process boundary (inter or intra app)

model and

(training) plan

model update

FL Server

Data

Figure 2: Device Architecture

This section describes the software architecture running on

a device participating in FL. This describes our Android

implementation but note that the architectural choices made

here are not particularly platform-speciﬁc.

The device’s ﬁrst responsibility in on-device learning is to

maintain a repository of locally collected data for model

training and evaluation. Applications are responsible for

making their data available to the FL runtime as an exam-

ple store by implementing an API we provide. An appli-

cation’s example store might, for example, be an SQLite

database recording action suggestions shown to the user

and whether or not those suggestions were accepted. We

recommend that applications limit the total storage footprint

of their example stores, and automatically remove old data

after a pre-designated expiration time, where appropriate.

We provide utilities to make these tasks easy. Data stored

on devices may be vulnerable to threats like malware or

physical disassembly of the phone, so we recommend that

applications follow the best practices for on-device data

security, including ensuring that data is encrypted at rest in

the platform-recommended manner.

The FL runtime, when provided a task by the FL server,

accesses an appropriate example store to compute model

updates, or evaluate model quality on held out data. Fig. 2

shows the relationship between the example store and the

FL runtime. Control ﬂow consists of the following steps:

Programmatic Conﬁguration

An application conﬁgures

剩余14页未读，继续阅读

R2017

粉丝: 35
资源: 8

大规模联邦学习系统设计：挑战与解决方案

谷歌联邦学习大规模系统设计：挑战与未来

"2019谷歌研究：规模化系统设计的联邦学习

"基于单片机的智能无线门铃设计与制作研究

6.TOWARDS FEDERATED LEARNING AT SCALE----SYSTEM DESIGN(Google)(2019.3.22)(1).pdf

《Learning Jupyter》Dan Toomey.pdf

Towards a Semantically Enriched Local Dynamic Map.pdf

Towards A Swarm of Agile Micro Quadrotors.pdf

China and a Sustainable Future towards a low carbon economy and society.pdf

Towards Efficient and Robust BFT Protocols.pdf

在线课程长Towards organizing smart collaboration and e.pdf

最新资源