• Same label, different features (concept shift): The conditional distributions P
i
(x |y) may vary across
clients even if P(y) is shared. The same label y can have very different features x for different
clients, e.g. due to cultural differences, weather effects, standards of living, etc. For example, images
of homes can vary dramatically around the world and items of clothing vary widely. Even within the
U.S., images of parked cars in the winter will be snow-covered only in certain parts of the country. The
same label can also look very different at different times, and at different time scales: day vs. night,
seasonal effects, natural disasters, fashion and design trends, etc.
• Same features, different label (concept shift): The conditional distribution P
i
(y |x) may vary across
clients, even if P(x) is the same. Because of personal preferences, the same feature vectors in a
training data item can have different labels. For example, labels that reflect sentiment or next word
predictors have personal and regional variation.
• Quantity skew or unbalancedness: Different clients can hold vastly different amounts of data.
Real-world federated learning datasets likely contain a mixture of these effects, and the characterization
of cross-client differences in real-world partitioned datasets is an important open question. Most empirical
work on synthetic non-IID datasets (e.g. [289]) have focused on label distribution skew, where a non-IID
dataset is formed by partitioning a “flat” existing dataset based on the labels. A better understanding of
the nature of real-world non-IID datasets will allow for the construction of controlled but realistic non-IID
datasets for testing algorithms and assessing their resilience to different degrees of client heterogeneity.
Further, different non-IID regimes may require the development of different mitigation strategies. For
example, under feature-distribution skew, because P(y |x) is assumed to be common, the problem is at least
in principle well specified, and training a single global model that learns P(y |x) may be appropriate. When
the same features map to different labels on different clients, some form of personalization (Section 3.3)
may be essential to learning the true labeling functions.
Violations of independence Violations of independence are introduced any time the distribution Qchanges
over the course of training; a prominent example is in cross-device FL, where devices typically need to meet
eligibility requirements in order to participate in training (see Section 1.1.2). Devices typically meet those
requirements at night local time (when they are more likely to be charging, on free wi-fi, and idle), and so
there may be significant diurnal patterns in device availability. Further, because local time of day corre-
sponds directly to longitude, this introduces a strong geographic bias in the source of the data. Eichner et al.
[151] described this issue and some mitigation strategies, but many open questions remain.
Dataset shift Finally, we note that the temporal dependence of the distributions Q and P may introduce
dataset shift in the classic sense (differences between the train and test distributions). Furthermore, other
criteria may make the set of clients eligible to train a federated model different from the set of clients where
that model will be deployed. For example, training may require devices with more memory than is needed
for inference. These issues are explored in more depth in Section 6. Adapting techniques for handling
dataset shift to federated learning is another interesting open question.
3.1.1 Strategies for Dealing with Non-IID Data
The original goal of federated learning, training a single global model on the union of client datasets, be-
comes harder with non-IID data. One natural approach is to modify existing algorithms (e.g. through
19