Towards Federated Learning at Scale: System Design
Configuration
The server is configured based on the ag-
gregation mechanism selected (e.g., simple or Secure Ag-
gregation) for the selected devices. The server sends the FL
plan and an FL checkpoint with the global model to each of
the devices.
Reporting
The server waits for the participating devices
to report updates. As updates are received, the server ag-
gregates them using Federated Averaging and instructs the
reporting devices when to reconnect (see also Sec. 2.3). If
enough devices report in time, the round will be success-
fully completed and the server will update its global model,
otherwise the round is abandoned.
As seen in Fig. 1, straggling devices which do not report
back in time or do not react on configuration by the server
will simply be ignored. The protocol has a certain tolerance
for such drop-outs which is configurable per FL task.
The selection and reporting phases are specified by a set
of parameters which spawn flexible time windows. For ex-
ample, for the selection phase the server considers a device
participant goal count, a timeout, and a minimal percentage
of the goal count which is required to run the round. The
selection phase lasts until the goal count is reached or a
timeout occurs; in the latter case, the round will be started
or abandoned depending on whether the minimal goal count
has been reached.
2.3 Pace Steering
Pace steering is a flow control mechanism regulating the
pattern of device connections. It enables the FL server both
to scale down to handle small FL populations as well to
scale up to very large FL populations.
Pace steering is based on the simple mechanism of the server
suggesting to the device the optimum time window to re-
connect. The device attempts to respect this, modulo its
eligibility.
In the case of small FL populations, pace steering is used
to ensure that a sufficient number of devices connect to
the server simultaneously. This is important both for the
rate of task progress and for the security properties of the
Secure Aggregation protocol. The server uses a stateless
probabilistic algorithm requiring no additional device/server
communication to suggest reconnection times to rejected
devices so that subsequent checkins are likely to arrive con-
temporaneously.
For large FL populations, pace steering is used to random-
ize device check-in times, avoiding the “thundering herd”
problem, and instructing devices to connect as frequently as
needed to run all scheduled FL tasks, but not more.
cated methods which address selection bias.
Pace steering also takes into account the diurnal oscillation
in the number of active devices, and is able to adjust the
time window accordingly, avoiding excessive activity during
peak hours and without hurting FL performance during other
times of the day.
3 DEVICE
App Process
Device
FL Runtime
Config
Example Store
Process boundary (inter or intra app)
model and
(training) plan
model update
FL Server
Data
Figure 2: Device Architecture
This section describes the software architecture running on
a device participating in FL. This describes our Android
implementation but note that the architectural choices made
here are not particularly platform-specific.
The device’s first responsibility in on-device learning is to
maintain a repository of locally collected data for model
training and evaluation. Applications are responsible for
making their data available to the FL runtime as an exam-
ple store by implementing an API we provide. An appli-
cation’s example store might, for example, be an SQLite
database recording action suggestions shown to the user
and whether or not those suggestions were accepted. We
recommend that applications limit the total storage footprint
of their example stores, and automatically remove old data
after a pre-designated expiration time, where appropriate.
We provide utilities to make these tasks easy. Data stored
on devices may be vulnerable to threats like malware or
physical disassembly of the phone, so we recommend that
applications follow the best practices for on-device data
security, including ensuring that data is encrypted at rest in
the platform-recommended manner.
The FL runtime, when provided a task by the FL server,
accesses an appropriate example store to compute model
updates, or evaluate model quality on held out data. Fig. 2
shows the relationship between the example store and the
FL runtime. Control flow consists of the following steps:
Programmatic Configuration
An application configures