联邦学习的进展与开放问题

需积分: 9 192 浏览量更新于2024-07-09 收藏 1.19MB PDF 举报

“Advances_and_Open_Problems_in_Federated_Learning.pdf”是一篇关于联邦学习最新进展和开放问题的研究预印本，由包括Peter Kairouz、Mehdi Bennis、Kallista Bonawitz等在内的58位作者共同撰写。该文于2019年12月发布，截止目前在ResearchGate上有2,038次阅读，但还没有被引用。作者们还在进行与5G到10G视图和超快成像相关的项目。联邦学习（Federated Learning）是一种分布式机器学习方法，旨在保护数据隐私，允许在不集中数据的情况下训练模型。以下是联邦学习的一些关键知识点和当前的研究挑战： 1. **数据隐私**：联邦学习的核心优势在于它允许在本地设备上进行模型训练，减少了对用户数据的集中收集，从而保护了用户的隐私。然而，如何在保持模型性能的同时确保数据安全和隐私仍然是一个重要的研究问题。 2. **异构性**：联邦学习环境通常涉及不同类型的设备（如手机、平板电脑、IoT设备），它们具有不同的计算能力、存储限制和网络条件。如何设计适应这种异构性的算法是联邦学习的挑战之一。 3. **通信效率**：在联邦学习中，频繁的模型更新交换可能导致大量的通信开销。优化通信协议和压缩技术以减少带宽需求是提高联邦学习效率的关键。 4. **系统鲁棒性**：由于参与节点可能不稳定或不可靠，联邦学习需要对系统中的失败和恶意行为具有鲁棒性。这涉及到设计容错机制和防止模型中毒攻击的策略。 5. **个性化模型**：在联邦学习框架下，每个设备可能拥有独特数据分布，因此需要构建能够适应这些差异的个性化模型，同时保持全局模型的一致性。 6. **可扩展性和可伸缩性**：随着参与设备数量的增长，联邦学习需要能够处理大规模参与者的系统架构。如何设计可扩展的算法和系统架构是研究的焦点。 7. **公平性和偏见**：联邦学习需要解决数据集的代表性问题，确保模型不会因为某些群体的数据不足而产生偏见。 8. **理论基础**：虽然联邦学习在实践中取得了显著成果，但其背后的数学理论和收敛性分析仍有待深入研究。 9. **应用领域**：联邦学习的应用范围广泛，包括移动键盘预测、医疗图像分析、自然语言处理等。探索新的应用领域和跨领域的融合是未来的重要方向。 10. **标准化与法规**：随着联邦学习的广泛应用，建立相应的标准和法规以确保数据安全和合规性也变得至关重要。 “Advances and Open Problems in Federated Learning”这篇论文探讨了联邦学习的最新进展，并指出了一系列待解决的开放问题，对于理解联邦学习的现状和未来发展具有重要意义。

have been proposed in order to limit the amount of information other participants can infer from observing

the training process. The downside of this approach is that the training algorithm is typically dependent on

the type of machine learning objective being pursued. Currently proposed algorithms include trees [103],

linear and logistic regression [419, 198], and neural networks [276].

Federated transfer learning [419] is another concept that considers challenging scenarios in which data

parties share only a partial overlap in the user space or the feature space, and leverage existing transfer

learning techniques [314] to build models collaboratively. The existing formulation is limited to the case of

2 clients.

Partitioning by examples is usually relevant in cross-silo FL when a single company cannot centralize

their data due to legal constraints, or when organizations with similar objectives want to collaboratively im-

prove their models. For instance, different banks can collaboratively train classiﬁcation or anomaly detection

models for fraud detection [407], hospitals can build better diagnostic models [121], and so on.

An open-source platform supporting the above outlined applications is currently available as Federated

AI Technology Enabler (FATE) [34]. At the same time, the IEEE P3652.1 Federated Machine Learning

Working Group is focusing on standard-setting for the Federated AI Technology Framework.

Incentive mechanisms In addition to developing new algorithmic techniques for FL, incentive mechanism

design for honest participation is an important practical research question. This need may arise in cross-

device settings (e.g. [225, 224]), but is particularly relevant in the cross-silo setting, where participants may

also be business competitors. Related objectives include how to divide earnings generated by the federated

learning model among contributing data owners in order to sustain long-term participation, and also how to

link the incentives with decisions on defending against adversarial data owners to enhance system security,

optimizing the participation of data owners to enhance system efﬁciency.

Differential privacy The discussion of actors and threat models in Section 4.1 is largely relevant also for

the cross-silo FL. However, protecting against different actors might have different priorities. For example,

in many practical scenarios, the ﬁnal trained model would be released only to those who participate in the

training, which makes the concerns about “the rest of the world” less important.

On the other hand, for a practically persuasive claim, we would usually need a notion of local differential

privacy, as the potential threat from other clients is likely to be more important. In cases when the clients

are not considered a signiﬁcant threat, each client could control the data from a number of their respective

users, and a formal privacy guarantee might be needed on such user-level basis. Depending on application,

other objectives could be worth pursuing. This area has not been systematically explored.

Tensor factorization Several works have also studied cross-silo federated tensor factorization where mul-

tiple sites (each having a set of data with the same feature, i.e. horizontally partitioned) jointly perform

tensor factorization by only sharing intermediate factors with the coordination server while keeping data

private at each site. Among the existing works, [236] used an alternating direction method of multipli-

ers (ADMM) based approach and [280] improved the efﬁciency with the elastic averaging SGD (EASGD)

algorithm and further ensures differential privacy for the intermediate factors.

2.3 Split Learning

In contrast with the previous settings which focus on data partitioning and communication patterns, the key

idea behind split learning [190, 393]

is to split the execution of a model on a per-layer basis between the

clients and the server. This can be done for both training and inference.

In the simplest conﬁguration of split learning, each client computes the forward pass through a deep

network up to a speciﬁc layer referred to as the cut layer. The outputs at the cut layer, referred to as

smashed data, are sent to another entity (either the server or another client), which completes the rest of the

computation. This completes a round of forward propagation without sharing the raw data. The gradients

can then be back propagated from its last layer until the cut layer in a similar fashion. The gradients at the

cut layer – and only these gradients – are sent back to the clients, where the rest of back propagation is

completed. This process is continued until convergence, without having clients directly access each others

raw data. This setup is shown in Figure 2(a) and a variant of this setup where labels are also not shared

along with raw data is shown in Figure 2(b).

(a) Vanilla split learning (b) U-shaped split learning

Figure 2: Split learning conﬁgurations showing raw data is not transferred in the vanilla setting and that raw

data as well as labels are not transferred between the client and server entities in the U-shaped split learning

setting.

In several settings, the overall communication requirements of split learning and federated learning

were compared in [360]. Split learning brings in another dimension of parallelism in the training, paral-

lelization among parts of a model, e.g. client and server. The ideas in [213, 207], where the authors break

the dependencies between partial networks and reduced total centralized training time by parallelizing the

computations in different parts, can be relevant here as well. However, it is still an open question to explore

such parallelization of split learning on edge devices. Split learning also enables matching client-side model

components with the best server-side model components for automating model selection as shown in the

ExpertMatcher [353].

The values communicated can nevertheless, in general, reveal information about the underlying data.

How much, and whether this is acceptable, is likely going to be application and conﬁguration speciﬁc. A

variation of split learning called NoPeek SplitNN [395] reduces the potential leakage via communicated ac-

tivations, by reducing their distance correlation [394, 378] with the raw data, while maintaining good model

performance via categorical cross-entropy. The key idea is to minimize the distance correlation between

the raw data points and communicated smashed data. The objects communicated could otherwise contain

information highly correlated with the input data if used without NoPeek SplitNN, the use of which also

See also split learning project website - https://splitlearning.github.io/.

3 Improving Efﬁciency and Effectiveness

In this section we explore a variety of techniques and open questions that address the challenge of making

federated learning more efﬁcient and effective. This encompasses a myriad of possible approaches, includ-

ing: developing better optimization algorithms; providing different models to different clients; making ML

tasks like hyperparameter search, architecture search, and debugging easier in the FL context; improving

communication efﬁciency; and more.

One of the fundamental challenges in addressing these goals is the presence of non-IID data, so we begin

by surveying this issue and highlighting potential mitigations.

3.1 Non-IID Data in Federated Learning

While the meaning of IID is generally clear, data can be non-IID in many ways. In this section, we provide

a taxonomy of non-IID data regimes that may arise for any client-partitioned dataset. The most common

sources of dependence and non-identicalness are due to each client corresponding to a particular user, a

particular geographic location, and/or a particular time window. This taxonomy has a close mapping to

notions of dataset shift [304, 327], which studies differences between the training distribution and testing

distribution; here, we consider differences in the data distribution on each client.

For the following, consider a supervised task with features x and labels y. A statistical model of feder-

ated learning involves two levels of sampling: accessing a datapoint requires ﬁrst sampling a client i ∼ Q,

the distribution over available clients, and then drawing an example (x, y) ∼ P

(x, y) from that client’s

local data distribution.

When non-IID data in federated learning is referenced, this typically refers to differences between P

and P

for different clients i and j. However, it is also important to note that the distribution Q and P

may

change over time, introducing another dimension of “non-IIDness”.

For completeness, we note that even considering the dataset on a single device, if the data is in an

insufﬁciently-random order, e.g. ordered by time, then independence is violated locally as well. For exam-

ple, consecutive frames in a video are highly correlated. Sources of intra-client correlation can generally be

resolved by local shufﬂing.

Non-identical client distributions Following Hsieh et al. [205], we ﬁrst survey some common ways in

which data tend to deviate from being identically distributed, that is P

6= P

for different clients i and

j. Rewriting P

(x, y) as P

(y |x)P

(x) and P

(x |y)P

(y) allows us to characterize the differences more

precisely.

• Feature distribution skew (covariate shift): The marginal distributions P

(x) may vary across clients,

even if P(y |x) is shared.

For example, in a handwriting recognition domain, users who write the

same words might still have different stroke width, slant, etc.

• Label distribution skew (prior probability shift): The marginal distributions P

(y) may vary across

clients, even if P(x |y) is the same. For example, when clients are tied to particular geo-regions,

the distribution of labels varies across clients — kangaroos are only in Australia or zoos; a person’s

face is only in a few locations worldwide; for mobile device keyboards, certain emoji are used by one

demographic but not others.

We write “P(y |x) is shared” as shorthand for P

(y |x) = P

(y |x) for all clients i and j.

• Same label, different features (concept shift): The conditional distributions P

(x |y) may vary across

clients even if P(y) is shared. The same label y can have very different features x for different

clients, e.g. due to cultural differences, weather effects, standards of living, etc. For example, images

of homes can vary dramatically around the world and items of clothing vary widely. Even within the

U.S., images of parked cars in the winter will be snow-covered only in certain parts of the country. The

same label can also look very different at different times, and at different time scales: day vs. night,

seasonal effects, natural disasters, fashion and design trends, etc.

• Same features, different label (concept shift): The conditional distribution P

(y |x) may vary across

clients, even if P(x) is the same. Because of personal preferences, the same feature vectors in a

training data item can have different labels. For example, labels that reﬂect sentiment or next word

predictors have personal and regional variation.

• Quantity skew or unbalancedness: Different clients can hold vastly different amounts of data.

Real-world federated learning datasets likely contain a mixture of these effects, and the characterization

of cross-client differences in real-world partitioned datasets is an important open question. Most empirical

work on synthetic non-IID datasets (e.g. [289]) have focused on label distribution skew, where a non-IID

dataset is formed by partitioning a “ﬂat” existing dataset based on the labels. A better understanding of

the nature of real-world non-IID datasets will allow for the construction of controlled but realistic non-IID

datasets for testing algorithms and assessing their resilience to different degrees of client heterogeneity.

Further, different non-IID regimes may require the development of different mitigation strategies. For

example, under feature-distribution skew, because P(y |x) is assumed to be common, the problem is at least

in principle well speciﬁed, and training a single global model that learns P(y |x) may be appropriate. When

the same features map to different labels on different clients, some form of personalization (Section 3.3)

may be essential to learning the true labeling functions.

Violations of independence Violations of independence are introduced any time the distribution Qchanges

over the course of training; a prominent example is in cross-device FL, where devices typically need to meet

eligibility requirements in order to participate in training (see Section 1.1.2). Devices typically meet those

requirements at night local time (when they are more likely to be charging, on free wi-ﬁ, and idle), and so

there may be signiﬁcant diurnal patterns in device availability. Further, because local time of day corre-

sponds directly to longitude, this introduces a strong geographic bias in the source of the data. Eichner et al.

[151] described this issue and some mitigation strategies, but many open questions remain.

Dataset shift Finally, we note that the temporal dependence of the distributions Q and P may introduce

dataset shift in the classic sense (differences between the train and test distributions). Furthermore, other

criteria may make the set of clients eligible to train a federated model different from the set of clients where

that model will be deployed. For example, training may require devices with more memory than is needed

for inference. These issues are explored in more depth in Section 6. Adapting techniques for handling

dataset shift to federated learning is another interesting open question.

3.1.1 Strategies for Dealing with Non-IID Data

The original goal of federated learning, training a single global model on the union of client datasets, be-

comes harder with non-IID data. One natural approach is to modify existing algorithms (e.g. through

剩余105页未读，继续阅读

KHongi

粉丝: 4
资源: 2

联邦学习的进展与开放问题

"Samsung DDR4规格变更通知 - 8G_B_Feb_17-0.pdf

"图机器学习峰会-分布外鲁棒图学习新进展

《金融机器学习的进展》：量化交易新视角

Advances and Open Problems in Federated Learning.pdf

联邦学习Advances and Open Problems in Federated Learning

Advances and Open Problems in Federated Learning 总结翻译

advances and open problems in federated learning

Research Advances in Cloud Computing-Springer(2017).pdf

federated-learning:关于联邦学习的所有内容（论文，教程等）-联邦学习

微软机器学习在安全防护领域的新进展

最新资源