Transformer架构应对联邦学习数据异构性的革新策略

版权申诉

44 浏览量更新于2024-07-06 收藏 2.58MB PDF 举报

本文探讨了在联邦学习（Federated Learning）这一新兴研究领域中，针对数据异构性问题进行体系结构设计的重要性。随着机器学习模型在不同机构之间进行协作训练的同时保持数据隐私的需求日益增长，现有的联邦学习方法面临着诸如收敛性不足和在现实世界中不同设备间潜在的灾难性遗忘等核心挑战。作者们注意到，尽管近年来在联邦学习上取得了显著进展，但针对数据分布不一致的问题，传统的模型可能并不适用。他们提出了一种新的视角，即利用注意力机制为基础的架构，如Transformer，来增强联邦学习对异构数据的鲁棒性。Transformer由于其在处理自然语言理解和序列建模中的成功，被认为具有处理数据分布变化的能力。研究者们进行了严谨的实验调查，比较了多种神经网络架构在不同联邦算法下的表现，重点关注它们在面对数据异质性时的性能。他们的工作旨在发现哪种类型的体系结构能够在保持模型准确性的同时，更有效地应对分布式环境中数据的多样性。实验结果显示，基于注意力机制的模型如Transformer在处理联邦学习中的数据异构性时，展现出了显著的优势，这表明在设计体系结构时，应当优先考虑能够适应并减少数据分布不均衡影响的因素。这对于推动联邦学习的实际应用，尤其是在物联网、医疗保健等领域的隐私保护学习中，具有重要的指导意义。此外，该研究还可能为未来的联邦学习算法发展提供理论基础，促使研究人员进一步探索如何结合更先进的模型和技术来优化联邦学习的效率和稳定性，特别是在数据隐私保护和模型泛化能力之间寻找最佳平衡点。这篇文章对于理解和改进联邦学习中的数据异构性问题，以及设计更加适应实际场景的体系结构具有深远的影响。

for the input. GN here indicates replacing all Batch normalization layers in ResNet-50 with Group

Normalization. The patch embedding is applied to

1 × 1

patches extracted from the CNN feature

map instead of from raw images.

3.2 Federated Learning Methods

We apply one of the most popular parallel methods (FedAVG [

]) and serial methods (CWT [

]) as

training algorithms (see schematic descriptions in Figure 1).

Federated Averaging.

FedAVG combines local stochastic gradient descent (SGD) on each client

with iterative model averaging [

]. Speciﬁcally, a fraction of local clients are randomly sampled in

each communication round, and the server sends the current global model to each of these clients.

Each selected client then performs

epochs of local SGD on its local training data and sends the

local gradients back to the central server for aggregation synchronously. The server then applies the

averaged gradients to update its global model, and the process repeats.

Cyclic Weight Transfer.

In contrast to FedAVG where each local client is trained in a synchronous

and parallel way, the local clients in CWT are trained in a serial and cyclic way. In each round of

training, CWT trains a global model on one local client with its local data for a number of epochs

and then cyclically transfers this global model to the next client for training, until all the local clients

have been trained on once [

]. The training process then cycles through the clients repeatedly until

the model converges or a predeﬁned number of communication rounds is reached.

4 Experiments

Our experiments are designed to answer the following research questions that are of importance to

practical deployment of FL methods, while also aiding our understanding of (vision) Transformer

architectures.

•

Are Transformers able to learn a better global model in FL settings as compared to CNNs which

have been the de-facto approach on FL tasks (section 4.2)?

• Are Transformers especially capable of handling heterogeneous data partitions (section 4.3.1)?

• Do Transformers reduce communication costs as compared to CNNs (section 4.3.2)?

• What are practical tips helpful for practitioners to deploy Transformers in FL (section 4.4)?

Experimental code is included in the supplement and will be made public after blind review.

4.1 Experimental Setup

Following [

], we evaluate different FL methods on the Kaggle Diabetic Retinopathy competition

dataset (denoted as Retina) [

] and CIFAR-

dataset [

] in our study. Speciﬁcally, we binarize the

labels in the Retina dataset to Healthy (positive) and Diseased (negative), randomly selecting

6, 000

balanced images for training,

3, 000

images as the global validation dataset, and

3, 000

images as

the global testing dataset following [

]. We use the original test set in CIFAR-

as the global test

dataset, set aside

5, 000

images from the original training dataset as the global validation dataset, and

use the remaining

45, 000

images as the training dataset. Detailed image pre-processing steps for

Retina and CIFAR-

dataset are shown in Appendix A.1. We simulate three sets of data partitions

for both Retina and CIFAR-

: one IID-data partition, and two non-IID data partitions with label

distribution skew. Each data partition in Retina and CIFAR-

dataset contains

and

simulated

clients, respectively. We use the mean Kolmogorov-Smirnov (KS) statistics between every two clients

to measure the degree of label distribution skewness.

KS = 0

indicates IID data partitions, while

KS = 1

results in an extremely non-IID data partition. The detailed data partitions are shown in

Appendix A.1.

We use linear learning rate warm-up and decay scheduler for VIT-FL. The learning rate scheduler for

FL with CNNs is selected from linear warm-up and decay or step decay. Gradient clipping (at global

norm 1) is applied to stabilize the training. We set local training epoch in all the FL methods to

and the total communication round to 100, unless otherwise stated. More implementation details are

shown in Appendix A.2.

剩余18页未读，继续阅读

易小侠

粉丝: 6594
资源: 9万+

Transformer架构应对联邦学习数据异构性的革新策略

本地学习问题重新思考联邦学习中的数据异构性_Local Learning Matters Rethinking Data Het

信息安全_数据安全_law-f03_rethinking_employee_surv.pdf

信息安全_数据安全_Rethinking_Access_Control_and_Au.pdf

[IJCAI_2022,_Official_Code]_for_paper_Rethinking__TANet-image-

重新思考位置编码_Rethinking Positional Encoding

重新思考CrowdsA纯基于点的框架中的计数和本地化_Rethinking Counting and Localization

用光谱注意重新思考图形变换器_Rethinking Graph Transformers with Spectral Atten

stats_rethinking_julia

statistical_rethinking

从自顶向下的角度重新思考视频对象分割中的跨模式交互_Rethinking Cross-modal Interaction fro

最新资源