오상윤 이승준 2023-02 32400 https://aurora.ajou.ac.kr/handle/2018.oak/24569 학위논문(석사)--아주대학교 일반대학원 :인공지능학과,2023. 2 연합 학습(federated learning)은 수십만 개에 달하는 사용자 데이터를 사용하여 심층 신경망을 학습시키기 위하여 제안되었다. 이 기법은 개인정보를 보호할 수 있다는 특징 덕분에 많은 관심을 받아 왔다. 하지만 아직 풀어야할 중요한 문제가 남아 있다. 첫 번째는 동시에 참여 가능한 클라이언트 수의 한계이다. 클라이언트의 수가 증가할 경우 하나만 존재하는 파라미터 서버가 쉽게 병목 지점이 될 수 있으며 또한 낙오자(straggler)가 발생하기 쉬워진다. 두 번째는 데이터 이질성 문제로 전역 모델(global model)의 정확도에 악영향을 끼치는 문제이다. 개인 정보를 보호하기 위하여 사용자 데이터는 사용자 기기에 남아있어야 하기에 기존 분산 심층 학습에서 데이터를 균질하게 만들기 위해 사용하던 데이터 섞기는 사용하기 어렵다. 이 연구에서는 동시에 참여 가능한 클라이언트의 수를 늘리고 동시에 데이터 이질성 문제를 완화하기 위한 CCFed라고 불리는 클라이언트 클러스터링 및 모델 취합(model aggregation) 방법을 제안한다. CCFed는 집합 분할 문제(set partition problem)을 사용하여 클러스터간 데이터가 균질하게 분배되도록 하고 이를 통해 비항등독립분포의 영향을 완하하여 학습 성능이 향상되도록 한다. 본 연구의 실험에서는 CCFed가 FedAvg와 비교했을 때 벤치마크 데이터셋에서 FedAvg 대비 약 50%의 라운드만으로 약 2.5에서 7%p의 정확도 향상이 있음을 보여주었다. 1 Introduction 1 2 Background 5 2.1 Federated Learning 5 2.2 Effect of the number of local updates and the number of clients 7 2.2.1 Relation between the number of local updates and the number of clients 8 2.2.2 Backward of the many clients 9 2.3 Heterogeneous Dataset in Federated Learning 10 3 Proposed Method 13 3.1 Motivation of Client Clustering 13 3.2 Architecture Overview 14 3.3 Problem Formulation 16 3.3.1 Client modeling 17 3.3.2 Client Data Summary 17 3.3.3 Cluster 19 3.3.4 Cluster-level Data Heterogeneity 20 3.4 CCFed: Client Clustering Federated Learning 21 3.4.1 Initializing Phase 21 3.4.2 Clustering Phase 22 3.4.3 Learning Phase 24 3.4.4 Repeated execution of the Clustering Phase 27 4 Experiment 32 4.1 Experiment Setting 32 4.1.1 Models and Datasets 32 4.1.2 Learning Parameter 33 4.2 Results 34 4.2.1 Training Performance of CCFed 35 4.2.2 Increasing the Number of Clusters 39 4.2.3 Effect of Varying the Re-clustering Interval 39 4.2.4 Effect of Set Partitioning on the Clustering 39 5 Related Works 44 6 Conclusion and Future works 47 7 Bibliography 49 eng The Graduate School, Ajou University 아주대학교 논문은 저작권에 의해 보호받습니다. Can hierarchical client clustering mitigate the data heterogeneity effect in federated learning? Thesis 아주대학교 대학원 Seungjun Lee 일반대학원 인공지능학과 2023-02 Master https://dcoll.ajou.ac.kr/dcollection/common/orgView/000000032400 client clustering data heterogeneity federated learning hierarchical aggregation Federated learning (FL) was proposed for training a deep neural network model using millions of user data. The technique has attracted considerable attention owing to its privacy-preserving characteristic. However, two major challenges exist. The first is the limitation of simultaneously participating clients. If the number of clients increases, the single parameter server easily becomes a bottleneck and is prone to have stragglers. The second is data heterogeneity, which adversely affects the accuracy of the global model. Because data should remain at user devices to preserve privacy, we cannot use data shuffling, which is used to homogenize training data in traditional distributed deep learning. This work proposes a client clustering and model aggregation method, CCFed, to increase the number of simultaneously participating clients and mitigate the data heterogeneity problem. CCFed improves the learning performance using set partition modeling to let data be evenly distributed between clusters and mitigate the effect of a non-IID environment. Experiments show that CCFed can achieve a 2.5-7%p higher accuracy using CCFed compared with FedAvg, where CCFed requires only approximately 50% of rounds compared with FedAvg training on benchmark datasets.