Sage Journals: Discover world-class research

Abstract

In order to solve the problem of accuracy decline caused by feature redundancy, this paper designs a federated learning strategy that combines composite meta-consistency loss and multi-head attention. Firstly, this paper decorrelates the features based on the dual theory of constraints to eliminate redundant information, and improves the stability of the model through gradient-based regularization. Composite meta-consistency loss is constructed based on these two optimization methods. Experiments show that compared with the latest algorithms, the maximum accuracy of CIFAR-10 and Oxford-Pets in this paper is improved by 0.82% and 2.19%, respectively. After that, this paper introduces multi-head attention into the framework of federated learning. After capturing richer context information in the process of feature extraction, the combination of inner-layer update and outer-layer update of the meta-learning method enables the federated learning framework to effectively cope with the data distribution of different clients and finally accelerate the convergence speed. Compared with other algorithms, the average accuracy of the first 40 rounds in the MINIST, CIFAR-10 and CIFAR-100 data sets is higher. In CIFAR-10, SVHN, Oxford-Pets, taking Robust-HDP as the benchmark, the speedup ratio reaches 1.5, 1.42, and 1.34, respectively, which is faster than other algorithms.

Keywords

Federated learning composite meta-consistency loss multi-head attention

1. Introduction

In recent years, machine learning technology has been widely applied to people’s daily life and all walks of life, and has had a profound impact on social development. Machine learning relies heavily on model training supported by massive data. However, with the continuous improvement of people’s awareness of data security and privacy protection, the willingness of data owners to share sensitive data has gradually decreased, and various governments have successively promulgated relevant regulations on privacy data protection, such as the General Data Protection Regulation of the European Union (Goddard, 2017). To address this data silo problem, Google proposed the concept of federated learning (FL). FL is a special distributed machine learning framework that aims to build a global model based on distributed clients (Konen et al., 2016; Mcmahan et al., 2016; Yu et al., 2020). The emergence of FL can effectively solve the problem of data silos, so that different institutions and individuals can legally share dispersed data and train high-quality models together, which provides theoretical support for the progress of science and technology.

In FL, the design and application of the loss function directly affect the training effect and final performance of the model. Cross-entropy loss function is widely used in federated classification learning tasks, but there are some shortcomings in its application: (a) poor processing of non-independently and identically distributed (non-IID) data may lead to deviation of global model; (b) lack of consistency constraints on the feature space, the correlation of the feature space is too strong, and there is redundant information; (c) because the data volume of different clients is different, and cross-entropy is sensitive to label noise and false labels, it is easy to overfit local data. These problems cause that there is still room for optimization in the accuracy and speed of training in FL. Several methods have been proposed to optimize the FL loss function. Wang et al. (2018) analyzed the convergence boundary of distributed gradient descent to minimize the FL loss function. Wei et al. (2020) and Chen et al. (2021) have proposed theoretical convergence bounds and expected convergence rate optimization of FL algorithms, respectively, with the aim of improving convergence performance and minimizing loss functions. Ghosh et al. (2020) proposed an iterative federated clustering algorithm to analyze the convergence rate of strongly convex smooth loss functions. Dinh et al. (2021) and Li et al. (2021b) proposed FEDL model and Fed LSGAN framework, respectively, to improve training stability and generation quality. Zhang et al. (2021) and Shlezinger et al. (2021) further optimized the loss minimization of FL through a three-tier collaborative FL architecture and a universal vector quantization approach. Wang et al. (2020) designed the ratio loss function to reduce the problem of data imbalance. Li et al. (2021a) and Dong et al. (2022) optimize the loss function minimization problem through blockchain-assisted learning and SphereFed framework, respectively. The composite meta-consistency loss (CMCL) constructed in this paper can adjust and reduce feature redundancy and constrain the correlation of feature space, reduce the influence of noise to improve the utilization rate of effective information, and enable the model to focus more on extracting and utilizing important and distinguishing features. Ultimately, it improves classification or prediction accuracy and provides model robustness.

In addition, reducing the communication overhead by optimizing the FL framework is another important direction of FL research. In FL, communication cost is an important consideration. Frequent data exchange may significantly increase network overhead and latency. Accelerating the convergence speed of FL can reduce the overall bandwidth requirement of the system and improve the efficiency and scalability of FL. Several studies have been conducted to improve the performance of FL. Li et al. (2023) proposed a method to realize universality and personalization of FL by using equiangular tight frame classifiers. Lee et al. (2021) preserve the global view of non-true classes through federated not-true distillation algorithm. Oh et al. (2021) proposed FedBABU algorithm to update the model body during training and fine-tune the head during evaluation. The SCAFFOLD algorithm proposed by Karimireddy et al. (2019) corrects client drift. Jhunjhunwala et al. (2023) developed a FedExP mechanism to accelerate POCS. Kim et al. (2024) improved the alignment of local models and the aggregation of global models by using FedDr+ algorithm.

In theory, meta-learning can take advantage of the way of performing inner updates on each client and then summarizing these updates for outer updates, so that the model can quickly adapt to new tasks with a small amount of training data and fewer communication rounds, and through the multi-head attention mechanism, the model can integrate global information from other clients during local updates. This effective use of global information reduces the dependence of the model on frequent communication and speeds up the convergence of the global model. Several studies have explored advancements in fault diagnosis and meta-learning (Feng et al., 2021; Tao et al., 2022; Yang et al., 2022b), such as Tao Hongfeng who designed a fault diagnosis method combining parameter optimization and feature measurement. In terms of the combination of FL and meta-learning, some studies focus on improving model performance and convergence. Khodak et al. (2019) developed a framework to enhance meta-testing, while Liu et al. (2021b) and Yue et al. (2022) proposed a NUFM algorithm to accelerate convergence and achieve distributed interference recognition. Wang et al. (2022) used the PrivRec model for personalized FL (PFL). Some emphasize fine-tuning models and ways to improve communication efficiency. Xiong et al. (2022) and Yang et al. (2022a) used model-agnostic meta-learning (MAML) and G-FML frameworks to realize FL personalization, and Liu et al. (2021a) introduced communication efficient PFA+ algorithm to realize FL. Noble et al. (2021) explore the convergence of convex and non-convex joint learning algorithms using differential privacy. Malekmohammadi et al. (2024) proposed a Robust-HDP model to reduce model update noise and improve the stability and accuracy of FL systems.

The existing FL methods, such as FedAvg, DPFedAvg, and Robust-HDP in Table 1, while making progress in improving model accuracy and handling data heterogeneity, often overlook the constraints between the feature spaces of different tasks. This leads to imbalanced feature representations during the learning of multiple tasks. Such imbalanced representations prevent the model from effectively sharing information between tasks, thereby reducing its generalization ability. Additionally, these methods fail to handle redundant data effectively, with some redundant data being processed repeatedly across tasks, increasing the risk of overfitting to specific tasks and affecting the model’s stability. Especially in cases where there is significant heterogeneity in data distribution, existing FL methods typically assume that data is independently and identically distributed (IID), but in practice, client data is often non-IID. Non-IID data refer to scenarios where the data distributions on different clients are not the same, either due to differences in data types, class distributions, or the way data are collected. This inconsistency in data distribution introduces challenges, such as slower convergence, reduced model stability, and increased risks of overfitting, as the model struggles to generalize across clients with different data characteristics.

Table 1.
Comparison of FL Methods and Their Contributions.

Method Contribution Strengths Limitations

FedAvg FedAvg for model aggregation Simple and scalable for large numbers of clients Poor performance on non-IID data

DPFedAvg DPFedAvg to protect client data Enhances privacy of client data May reduce model accuracy due to added noise

Robust-HDP Robust-HDP to address data heterogeneity Effective for heterogeneous data Computationally intensive

FedSAM with SAM for robustness Improves robustness and accelerates convergence Can be computationally expensive in high-dimensional settings

MetaVers MetaVers representations for FL Achieves state-of-the-art performance for personalized FL Sensitive to hyperparameters

MetaFed (Ours) Construct CMCL loss, combine multi-head attention, and MAML Improves feature independence and accelerates convergence Balancing the contributions of each component can be challenging.

Method	Contribution	Strengths	Limitations
FedAvg	FedAvg for model aggregation	Simple and scalable for large numbers of clients	Poor performance on non-IID data
DPFedAvg	DPFedAvg to protect client data	Enhances privacy of client data	May reduce model accuracy due to added noise
Robust-HDP	Robust-HDP to address data heterogeneity	Effective for heterogeneous data	Computationally intensive
FedSAM	with SAM for robustness	Improves robustness and accelerates convergence	Can be computationally expensive in high-dimensional settings
MetaVers	MetaVers representations for FL	Achieves state-of-the-art performance for personalized FL	Sensitive to hyperparameters
MetaFed (Ours)	Construct CMCL loss, combine multi-head attention, and MAML	Improves feature independence and accelerates convergence	Balancing the contributions of each component can be challenging.

Note. FedAvg = federated averaging; IID = independently and identically distributed; DPFedAvg = differentially private FedAvg; Robust-HDP = robust hierarchical Dirichlet process; FedSAM = federated sharpness aware minimization; MetaVers = meta-learned versatile; FL = federated learning; CMCL = composite meta-consistency loss; MAML = model-agnostic meta-learning; SAM = sharpness-aware minimization.

To address these issues, this paper proposes a new CMCL loss function, which introduces a constraint-based decorrelation technique to improve the independence between features and avoid unnecessary noise. This innovation allows the model to focus on independent feature dimensions during training, enhancing feature representation balance and stability while avoiding overfitting to specific task data. Additionally, this paper introduces a multi-head attention mechanism, which helps the model capture richer context information during feature extraction, effectively addressing the challenges posed by non-IID data and improving the model’s adaptability to client data heterogeneity. Building on this, the paper also integrates the MAML method, accelerating the convergence speed of FL through inner and outer-layer updates, further improving the model’s generalization ability. Through these innovations, the proposed optimized framework not only significantly enhances the accuracy of FL but also accelerates the convergence process, addressing the limitations of existing methods in non-IID data environments.

The key contributions of this paper are as follows:

A new CMCL loss function is proposed, which improves the independence between features through decorrelation techniques, reducing interference from redundant information, and enhancing the model’s stability and generalization ability.

In this paper, a new FL framework is designed for the first time by combining multi-head attention mechanism and model-agnostic meta-learning, which together improve the expressiveness and adaptability of the model, achieve faster convergence and better handle the data distribution between different clients.

A series of experiments, such as contrast experiment, robustness experiment, hyperparameter sensitivity experiment, etc. jointly verify the effectiveness of the MetaFed method, providing a powerful solution for scalable, efficient, and robust FL.

2. Preparatory Knowledge

2.1. Vertical FL

Vertical FL is a type of FL, FL is a distributed machine learning approach designed to protect data privacy and security while enabling collaborative learning across multiple devices or organizations. While traditional machine learning methods typically require data to be centralized on a central server for training, FL allows the model to be trained locally on the data-hosting device and only model updates are uploaded, not the raw data.

A situation where different data holders have different characteristics of the data, and the data samples may overlap or not overlap, applies to vertical FL. The scenario of two data owners (i.e., companies A and B) is taken as an example to introduce the system structure of vertical FL (Mcmahan et al., 2016). The FL system architecture in Figure 1 consists of three parts that extend to the case of multiple data owners.

Figure 1.

Vertical federated learning system architecture.

2.2. Duality Theory in Constrained Optimization

Duality theory in constrained optimization is a mathematical theory widely used to solve optimization problems, especially constrained optimization problems. The basic idea of the theory is to transform the original problem into a dual problem and solve the original problem indirectly by solving the dual problem. In many cases, solving the dual problem is simpler or more efficient than solving the original problem directly, especially in complex scenarios such as nonlinear optimization and convex optimization.

Liu et al. (2022) explore the application of Lagrange duality discovery in energy systems and transparent computing. Robey et al. (2021) use non-convex duality theory and semi-infinite optimization to analyze and improve robust learning, especially adversarial training. Ji and Lejeune (2021) use duality theory to solve challenges in robust learning and random constrained optimization.

In constrained optimization problems, the original problem is usually expressed in the following form.

Under the conditions of $g_{i} (x) \leq 0, i = 0, 1, 2, \dots, m, min f (x)$ is obtained, where $f(x)$ is the objective function and $g_{i} (x)$ is the inequality constraint.

The duality problem is generally coinstructed by introducing Lagrange multipliers $λ_{i}$ , as shown in equation(1):

L (x, λ) = f (x) + \sum_{i = 1}^{m} λ_{i} g_{i} (x),

(1)

where

λ_{f}

is a non-negative dual variable, then the dual problem can be expressed as

max_{λ \geq 0} min_{x} L (x, λ)

The goal of this dual problem is to find the dual variable $λ$ , so that the dual function $L (x, λ)$ reaches an extreme value, and thus indirectly finds the solution to the original problem.

This paper applies duality theory to constrained optimization problems with data-driven uncertainties and complex constraints, where new formulas using Lagrange multipliers are introduced to solve specific challenges in non-IID data distribution and robust learning, with a focus on improving optimization accuracy and stability by enhancing duality formulas. This contribution extends the application of traditional duality theory to more complex real-world optimization problems involving multiple types of uncertainty.

2.3. Gradient-Based Regularization

Gradient-based regularization is a regularization technique that measures the difference between task gradient and global gradient to adjust the updating direction of the model and enhance the stability of the model:

L_{gradient} = \sum_{i = 1}^{N} {‖ \nabla_{θ} L_{{task}_{i}} - \nabla_{θ} L_{global} ‖}^{2} .

(2)

In equation (2), $L_{{task}_{i}}$ is the loss of task $i$ , and $L_{global}$ is the loss of the global model.

2.4. Meta-Learning

Meta-learning focuses on making the model learn to learn, that is, learn to learn, make the model acquire the ability to adjust hyperparameters and focus on optimizing the learning algorithm itself. The core idea is to enable machine learning systems to quickly adapt to new tasks or environments by learning how to learn. Meta-learning is widely used in many fields.

Jeon et al. (2024) use meta-variational dropout to personalize FL models in non-IID data settings, improving classification accuracy. Alsulaimawi (2024) introduced meta-FL, a framework that improves global model performance by leveraging optimization-based meta-aggregators to achieve superior accuracy, scalability, and efficiency. Wang et al. (2023) introduced a memory-based stochastic algorithm for MAML to ensure convergence and vanishing errors, making it suitable for continuous learning and cross-device FL scenarios. Lim et al. (2024) introduced MetVers, a meta-learning-based approach for PFL that is achieving state-of-the-art performance on PFL benchmarks. Lan et al. (2023) enhance the convergence of meta-learning by using historical local adaptation models to limit the inner ring direction and overcome local adaptation instability caused by non-convex loss functions and random sampling updates.

Figure 2 shows the brief flow of meta-learning. In the specific training process of meta-learning, the MAML method optimizes the generalization ability of the model through inner-layer updating and outer-layer updating. The specific formula is as follows:

(1)
Inner-layer update as in equation (3) (performed on each client) :
$θ_{task} = θ - α \nabla_{θ} L_{task} (θ) .$
(3)
(2)
The outer layer is updated as equation (4) (optimize the global model):
$θ = θ - β \nabla_{θ} \sum_{task} L_{task} (θ_{task}) .$
(4)

2.5. Multi-Head Attention

The Multi-head attention mechanism is a variant of the attention mechanism commonly used in deep learning models, especially for processing multiple information sources or learning multiple representations. It computes multiple attention weights in parallel and focuses attention on inputs from different angles to enhance the expressibility of the model when dealing with complex relationships and multi-modal data.

Figure 2.

Meta-learning process framework.

The multi-head attention mechanism is widely used in current FL (Chen et al., 2024; Choudhry et al., 2024; Wang et al., 2024). Li et al. (2024) proposed residual attention for FL (RAFL), which uses multiple attention mechanisms to enrich personalized feature information. Jiang et al. (2023) have improved the personalization of local models while aggregating them into new global models. Wu and Kwon (2023) recommended a system to introduce the multi-focus mechanism in the personalized federated knowledge distillation model.

Figure 3 shows multi-head attention using a fully connected layer to implement learnable linear transformations:

Figure 3.

Multi-head attention flowchart.

As shown in Figure 3, given query $q \in R^{d_{q}}$ , key $k \in R^{d_{k}}$ , and value $v \in R^{d_{v}}$ , the calculation method of each attention head $h_{i} (i = 1, 2, \dots, h)$ is equation (5) :

\begin{matrix} h_{i} = f (W_{i}^{q} q, W_{i}^{k} k, W_{i}^{v} v) \in R^{P_{v}}, \end{matrix}

(5)

\begin{matrix} W^{\circ} [\begin{matrix} h_{1} \\ h_{2} \\ \dots \\ h_{h} \end{matrix}] \in R^{p_{v}} . \end{matrix}

(6)

Then the operation of concatenating the outputs of multiple attention heads is shown in equation (7):

MultiHead (Q, K, V) = Concat ({Attention}_{1}, {Attention}_{2}, \dots, {Attention}_{H}) W^{O} .

(7)

While earlier methods use multi-head attention to improve personalization by enhancing local feature extraction (as in RAFL, Li et al., 2024, and other PFL models), this paper goes further by addressing the feature space constraint between different tasks, which leads to unbalanced feature representations and redundant data processing. By introducing the CMCL loss function, this paper enhances feature independence and stability and combines it with multi-head attention to capture richer context information during feature extraction. This novel approach not only refines the model’s expressiveness but also integrates the MAML framework, allowing FL to effectively adapt to data heterogeneity and accelerate convergence, thus offering a more comprehensive solution that improves both accuracy and training efficiency.

3. FL Based on CMCL

Feature redundancy makes it possible for the model to learn the same information multiple times, while ignoring useful features in the data that really contribute to the task, which not only reduces the learning efficiency of the model, but also may lead to overfitting. Yi et al. (2024) proposed an FL framework FedPE that combines adaptive pruning extension, error compensation strategy, and fair aggregation. Zhou et al. (2024) proposed a new PFL framework combining adaptive pruning of edge data to solve non-IID data. Yan et al. (2024a, 2024b) introduced cluster-contrastive federated clustering and CCFC++, which is a method that combines representation learning with federated clustering to improve clustering performance.

In order to reduce the feature redundancy and the correlation between features, this paper constructs an optimization problem using the dual theory to find the optimal feature transformation matrix, which realizes the removal of the correlation between features and reduces the influence of redundant information. In addition, this paper also improves the stability of model training through gradient-based regularization. Finally, the CMCL algorithm is constructed by integrating feature decorrelation and gradient-based regularization into cross-entropy loss, in order to obtain better results in FL training.

3.1. CMCL Algorithm

Feature decorrelation is an important part of this CMCL, and then the design process of this scheme is mainly explained. Huang et al. (2024) introduced salience-guided feature declination (SGFD) for vision-based reinforcement learning, which uses random Fourier functions and significance maps to eliminate feature correlations to achieve generalization. Wen et al. (2024) proposed a Fourier feature decorrelation-based sample focus method for locating dense crowds. Fourier transform and cross-covariance operators are used to decouple feature correlation and improve the model’s focus on relevant target features. In this paper, the feature correlation is mainly based on duality theory.

3.1.1. The Construction of the Duality Problem for Feature Decorrelation

In this paper, we hope to reduce the redundancy among features by means of feature decorrelation, so as to improve the generalization ability of the model. According to the duality theory of constraints, the problem can be formulated as an optimization problem with the goal of finding a transformation matrix $W$ that minimizes the correlation between the transformed features and satisfies the orthogonality constraint.

The optimization problem is specifically defined as follows: under the condition of $W^{T} W = I, min Tr (W^{T} Σ W)$ is obtained. Where $Σ$ is the covariance matrix of the features, $Tr$ is the trace operation of the matrix, and the constraint $W^{T} W = I$ ensures that the transformed features are orthogonal, that is, independent.

Then, according to duality theory, we construct the Lagrangian duality problem. By introducing the Lagrange multiplier $λ$ to punish the constraints, the Lagrange function is constructed:

L (W, Λ) = Tr (W^{T} Σ W) + Tr (Λ (I - W^{T} W)) .

(8)

By taking the derivative of $W$ and making it equal to 0, the optimization problem can be transformed into an eigenvalue decomposition problem, and the eigenvector corresponding to the smallest eigenvalue can be selected to obtain the optimal $W$ . At this point, the goal is determined. Through the duality problem, we hope to find the transformation matrix $W$ , so that the correlation between features in the new feature space is minimal, so as to achieve feature redundancy.

3.1.2. Specific Implementation Steps of Feature Decorrelation

In the above article, we have established the goal of finding the transformation matrix $W$ so that the correlation between features in the new feature space is minimal. Next, we will explain the detailed implementation steps.

Calculate the covariance matrix of the data

In neural network training, we first need to calculate the covariance matrix $Σ_{Z}$ from the feature $Z$ output from a certain layer, the purpose of which is to describe the linear correlation between the features. In FL, calculating this covariance matrix locally for each client allows each client to capture their unique data distribution, which is crucial for handling data across clients:

Σ_{Z} = \frac{1}{n} Z^{T} Z .

(9)

Eigenvalue decomposition

Eigenvalues and eigenvectors are obtained through the eigenvalue decomposition of covariance matrix $Σ_{Z}$ . The eigenvalue describes the degree of variation of each feature in its corresponding direction, and the eigenvector describes the representation of the feature in a specific direction. By performing this decomposition on each client’s local model, we can extract features that are less influenced by correlated data, thus improving the robustness of the FL model when aggregating local models with different data distributions.

Σ_{Z} = Q Λ Q^{T}

(10)

Select the principal component

In this step, we need to select the eigenvectors $W$ corresponding to the smallest eigenvalues, which indicate the smallest changes in the corresponding direction. These eigenvectors will be used as the transformation matrix $W$ to reduce the correlation between the features. By focusing on these principal components, we ensure that the data captured by the FL model is not dominated by task-irrelevant features, leading to improved personalization and model accuracy in the presence of non-IID data.

W = \arg min_{W} Tr (W^{T} Σ W) .

(11)

Data transformation By selecting the eigenvector matrix $W$ from the principal component, we can linearly transform the original data to obtain decorrelated data. This transformation ensures that the correlation between features is minimized, and thus, the new feature $Z^{'}$ exhibits the least correlation. This step helps in FL by ensuring that local models focus on the most relevant features, reducing the noise from irrelevant features, which improves the model’s ability to generalize across heterogeneous clients and decreases the risk of overfitting on specific client data.

Z^{'} = Z W .

(12)

By decorrelating features at the local level using the described process, each client can generate a more meaningful representation of its local data distribution. When these representations are aggregated, the global model benefits from more personalized and generalizable features, improving convergence speed and model robustness. Moreover, by reducing redundant feature information and focusing on the most relevant features, the communication overhead in FL is reduced. Since only the transformed, decoupled features need to be transmitted rather than raw, high-dimensional data, the communication efficiency is significantly improved, particularly in federated scenarios with limited bandwidth or high communication cost.

Figures 4 and 5 show the thermal maps of feature correlation of CIFAR-10, CIFAR-100, and MINISTdata sets before and after feature decorrelation, in which red represents positive correlation, blue represents negative correlation, and white represents no correlation. The diagonal line is the correlation between itself and itself, so 1 is shown as red. As shown in Figures 4 and 5, the heat maps of the three data sets changed from red and blue to white, indicating that the degree of correlation between the features of the three data sets decreased significantly.

Figure 4.

Thermal maps of feature correlation before feature decorrelation.

Figure 5.

Feature correlation heat maps after feature decorrelation.

In machine learning, if the correlation between adjacent features is high, it may mean that those features have duplication in information. By analyzing the mean correlation of adjacent features, redundant features can be identified to help reduce feature dimensions and improve model efficiency and performance. Table 2 shows the mean value correlation changes of adjacent features in the three data sets before and after feature decorrelation. It can be seen from Table 2 that the feature decorrelation scheme constructed by the duality theory in this paper is effective.

Table 2.

Mean Correlation of Adjacent Features.

Type	Before conversion			After conversion
Database	CIFAR-10	CIFAR-100	MINIST	CIFAR-10	CIFAR-100	MINIST
Mean correlation	0.92	0.93	0.63	0.0015	$- 0.0046$	0.0249

3.1.3. Total Loss Function CMCL After Optimization

In this paper, we hope to optimize both the predictive ability and feature decorrelation of the model. Therefore, the final loss function should combine three objectives:

Basic task loss (cross-entropy loss).

Remove the correlation regularization term and reduce the feature redundancy by minimizing the trace of the covariance matrix.

Measure the difference between task gradient and global gradient, adjust the updating direction of the model, and enhance the stability of the model.

Therefore, the final CMCL is as equations (13)

L_{CMCL} = L_{cross} + α \cdot Tr (W^{T} Σ_{Z} W) + weight \cdot L_{gradient},

(13)

where

α

is the hyperparameter that controls the intensity of decorrelation,

L_{cross}

is the cross-entropy loss in the classification task and

L_{gradient}

is the gradient-based regularization, as shown in equation (2).

3.2. Experimental Results and Analysis

In this section, we present the experimental results of CMCL in FL (hereinafter referred to as FedCMCL) in this paper, including the ablation experiment comparison, and the interaction with FedAvg (Goddard, 2017), FedDr+ (Kim et al., 2024), FedSAM (Li et al., 2025), and MetaVers (Lim et al., 2024).

3.2.1. Experimental Setup Data Set and Model

To simulate a real FL scenario involving multiple data sets, we tested six datasets, CIFAR-10, CIFAR-100, MINIST, SVHN, Fashion MINIST, and Oxford-Pets.

3.2.2. Ablation Experiment

Figure 6 shows the change in accuracy of the training model using the cross-entropy loss function and FedCMCL on three datasets (CIFAR-10, CIFAR-100, and MNIST), respectively. The horizontal axis of each chart represents the number of rounds trained, and the vertical axis represents the accuracy of the model. Each chart consists of two lines:

Figure 6.

Cross-entropy loss function and composite meta-consistency loss (CMCL) accuracy on different data sets.

As shown in Figure 6, the blue broken line represents the accuracy of the model trained using the cross-entropy loss function, and the orange broken line represents the accuracy of the model trained using FedCMCL. Figure 6(a) and 6(b) shows that the orange line rises significantly more than the blue line, indicating that FedCMCL has faster accuracy improvement during training. In Figure 6(c), although the accuracy of the orange polyline is higher in the early stage, the gap with the blue polyline narrows subsequently. In general, FedCMCL shows significant advantages on different data sets. Thanks to the feature decorrelation in FedCMCL, the features learned by the model can be more focused on the dimensions that are independent and helpful for decision-making, avoiding the interference of redundant information and improving accuracy. As shown in Table 3, FedCMCL outperforms cross-entropy in the first round accuracy on CIFAR-10, CIFAR-100, and MNIST data sets. At the same time, FedCMCL uses a gradient-based regularization method to further enhance the robustness of the model and ensure that the performance is steadily improved during training. The combination of these factors enables FedCMCL to quickly adapt to task changes on multiple datasets and show better performance than the traditional cross-entropy loss function in the final stage, which is further proved by the accuracy of the 40th round in Table 3.

Table 3.

Comparison of Composite Meta-Consistency Loss in Federated Learning (FedCMCL and Cross-Entropy Accuracy in Round 1 and Round 40.

	First round			Fortieth round
Arithmetic	CIFAR-10	CIFAR-100	MINIST	CIFAR-10	CIFAR-100	MINIST
CrossEntropy	17.96%	1.15%	96.84%	70.58%	52.30%	99.07%
FedCMCL (ours)	46.74%	8.09%	97.54%	87.75%	60.24%	99.22%

3.2.3. Comparative Experiment

As shown in Table 4, this histogram shows the difference between FedCMCL and FedAvg (Goddard, 2017), FedDr+ (Kim et al., 2024), FedSAM (Li et al., 2025), and MetaVers (Lim et al., 2024).

Table 4.
Results of Different Algorithms on Different Data Sets.

Cifar-10 SVHN Fashion-MNIST Oxford-Pets

Method Acc Precision Recall Acc Precision Recall Acc Precision Recall Acc Precision Recall

FedAvg 64.19% 64.05% 62.7% 83.51% 84.14% 87.14% 88.15% 91.48% 92.96% 85.14% 83.43% 87.83%

FedDr+ 74.9% 76.11% 73.14% 85.06% 86.14% 88.08% 89.72% 88.55% 91.5% 87.8% 84.17% 84.46%

FedSAM 78.89% 75.46% 77.12% 90.15% 94.56% 92.8% 92.45% 92.04% 89.67% 95.35% 95.41% 92.04%

MetaVers 89.4% 90.35% 89.67% 93.06% 93.48% 92.69% 94.38% 93.98% 91.46% 95.23% 96.4% 96.64%

FedCMCL 90.22% 89.29% 89.86% 93.74% 93.61% 91.59% 94.59% 93.95% 92.59% 97.42% 96.73% 96.52%

	Cifar-10	SVHN	Fashion-MNIST	Oxford-Pets
FedAvg	64.19%	64.05%	62.7%	83.51%	84.14%	87.14%	88.15%	91.48%	92.96%	85.14%	83.43%	87.83%
FedDr+	74.9%	76.11%	73.14%	85.06%	86.14%	88.08%	89.72%	88.55%	91.5%	87.8%	84.17%	84.46%
FedSAM	78.89%	75.46%	77.12%	90.15%	94.56%	92.8%	92.45%	92.04%	89.67%	95.35%	95.41%	92.04%
MetaVers	89.4%	90.35%	89.67%	93.06%	93.48%	92.69%	94.38%	93.98%	91.46%	95.23%	96.4%	96.64%
FedCMCL	90.22%	89.29%	89.86%	93.74%	93.61%	91.59%	94.59%	93.95%	92.59%	97.42%	96.73%	96.52%

Note. Acc = accuracy; FedAvg = federated averaging; FedSAM = federated sharpness aware minimization; MetaVers = meta-learned versatile; FedCMCL = composite meta-consistency loss in federated learning.

In Table 4, the accuracy, precision, and recall values for each algorithm on four different datasets (CIFAR-10, SVHN, Fashion-MNIST, and Oxford-Pets) are presented. FedCMCL significantly outperforms other algorithms in most aspects, demonstrating its strong capabilities in FL. Specifically, on the CIFAR-10 data set, FedCMCL achieves the highest accuracy, outperforming the second-place algorithm by a notable margin. However, in terms of precision, MetaVers slightly outperforms FedCMCL, this is due to MetaVers employing a more refined feature representation learning mechanism, which enhances class separability, leading to higher precision. On the SVHN data set, FedCMCL also shows its superiority with the highest accuracy, outperforming other methods. However, in terms of precision and recall, FedSAM achieves better results compared to FedCMCL. This is attributed to FedSAM’s sharpness-aware minimization (SAM) strategy, which helps the model converge to flatter minima, thereby reducing overfitting to specific classes and improving precision and recall. On Fashion-MNIST, FedCMCL continues to lead with the highest accuracy, surpassing other methods. However, in terms of precision and recall, MetaVers slightly outperforms FedCMCL. This is because Fashion-MNIST has relatively simple visual patterns, and MetaVers, which employs advanced contrastive learning techniques, can better distinguish fine-grained class differences, leading to improved precision and recall. Moreover, on Oxford-Pets, FedCMCL achieves the highest accuracy, well ahead of other baseline algorithms. However, for recall, MetaVers marginally surpasses FedCMCL. This stems from MetaVers’s ability to retain more fine-grained feature details, which aids in identifying more difficult-to-classify instances, leading to a higher recall.

Overall, the results across all data sets demonstrate that FedCMCL is the most consistently strong-performing algorithm, achieving the highest accuracy in all four data sets while maintaining competitive precision and recall. Its superior accuracy highlights its effectiveness in handling diverse data distributions, making it a robust choice for FL. Although some precision and recall values are slightly lower than those of other methods, FedCMCL maintains a balanced performance across all metrics, reinforcing its reliability and adaptability.

4. An FL Strategy Combining CMCL and Multi-Head Attention

In the above, this paper solves the accuracy problem of FL through CMCL. However, in order to speed up the convergence of FL to reduce the overall bandwidth requirements and latency, and further improve the efficiency and scalability of FL, based on the above research, this paper designs a meta-FL framework while using CMCL to improve the accuracy of FL. It can better solve the problem of communication overhead of FL.

4.1. Design of a Meta-FL Framework With Multi-Head Attention

Through the multi-head attention mechanism, the model can fuse global information from other clients during local updates, so as to adapt to the heterogeneous data distribution of different clients more quickly. This effective use of global information reduces the dependence of the model on frequent communication and speeds up the convergence of the global model. Then, we combine the MAML inner-layer update and outer-layer update, by performing the inner-layer update on each client and summarizing these updates for the outer-layer update, so that the model can quickly adapt to new tasks with a small amount of training data and fewer communication rounds. This mechanism allows each client model to approach the global optimal solution in a short time, thus reducing the overall training time.

The structural design of the multi-head attention-based meta-FL framework constructed in this paper (hereinafter referred to as MetaFed), as shown in Figure 7.

Figure 7.

Algorithmic flow of meta-federated learning based on multi-head attention.

This algorithm describes the process of meta-FL based on a multi-head attention mechanism. The entire process is divided into two parts: client-side and server-side, where model updates occur locally and globally. The primary inputs to the MetaFed framework include the local client data and the global model parameters. Each client initializes its model using its local data. The global model parameters $θ$ are initially distributed by the server to all clients. The outputs of the MetaFed framework include the updated global model and predictions from each client.

First, the server initializes the global model parameters $θ$ and distributes the initial model parameters to all connected clients, providing the foundational parameters for subsequent model training. After receiving these parameters, each client begins local training. Each client uses its local data for feature extraction, which includes extracting spatial features through the convolutional layer, reducing feature dimensions via the pooling layer, and then applying the multi-head attention mechanism. This mechanism generates query matrices, key matrices, and value matrices to aggregate important features, enhancing the model’s ability to focus on different features and promoting information fusion. Afterward, the fully connected layer computes the predicted output.

Once the predicted output is generated, each client performs an inner-layer update using the MAML method, yielding updated model parameters $θ_{i}^{'}$ . The clients then upload these updated parameters to the server.

After receiving the updated parameters $θ_{i}^{'}$ from all clients, the server aggregates them into a new global model using a federated aggregation algorithm. Based on the MAML method, the server performs the outer-layer update of the model. The server then distributes the updated global model to the clients, initiating a new round of training. This process continues until the model converges, and finally, the server outputs the globally optimized model $θ$ . This process combines the privacy-preserving advantages of FL with the powerful feature processing capabilities of the multi-head attention mechanism. Additionally, the use of MAML enhances the model’s generalization and adaptability.

The computational requirements of the MetaFed framework mainly include the processes involved in feature decorrelation, model updates, and attention mechanism computations. The CMCL process requires calculating the covariance matrix, performing feature decorrelation, and selecting the principal components, which may have a different time complexity depending on the number of clients and the scale of dataset (e.g., the feature decorrelation time of CIFAR-10, CIFAR-100, MINIST, SVHN, and Oxford-Pets in this paper requires 5.7281 s, 6.3243 s, 2.1977 s, 9.3658 s, and 6.4839 s, respectively). The multi-head attention mechanism also adds computational complexity, as it performs attention layer computations to fuse information across clients during local updates, ultimately speeding up model convergence and improving generalization. The setting of different hyperparameters will affect the computing requirements of the multi-head attention.

4.2. Experimental Results and Analysis

In this section, we present the experimental results of MetaFed. In this paper, we use three datasets, SVHN, Oxfoxd-Pets, and MINIST, and compare them with RHDP (Malekmohammadi et al., 2024), PFA, WeiAvg (Liu et al., 2021a), FedAvg (Goddard, 2017), FedDr+ (Kim et al., 2024), FedSAM (Li et al., 2025), and MetaVers (Lim et al., 2024) in terms of average accuracy.

4.2.1. Comparative Experiment

The average accuracy can reflect the overall performance of different algorithms in the training process, and help to evaluate the stability and effect of the model. A higher average accuracy rate means that the model can maintain good classification performance in multiple rounds. The experimental results in Figure 8 cover the three data sets MNIST, CIFAR-10, and CIFAR-100, and show the average accuracy of different algorithms in the case of guaranteed convergence in the first 40 rounds.

Figure 8.

Comparison of average accuracy of MetaFed with other algorithms on different data sets.

As shown in Figure 8, the performance of the MetaFed framework on different datasets highlights its advantages in convergence speed and accuracy. On the MNIST data set, MetaFed achieves an average accuracy of 98.54%, which is much higher than RHDP’s 84.6% and PFA’s 82.16%, indicating that it effectively integrates global information through the multi-head attention mechanism and quickly adapts to simple data sets. In the CIFAR-10 data set, MetaFed also performs well, achieving an accuracy of 41.62%, which is higher than that of RHDP (35.03%) and PFA (29.21%), which verifies its ability to combine the MAML method to improve adaptation to new tasks. However, in the more complex CIFAR-100 data set, MetaFed’s performance weakens somewhat, with only 30.92%, slightly higher than RHDP’s 30.6%, but better than Baseline’s 25.02%. This result shows that while MetaFed performs well on simple and moderately complex data sets, its advantage diminishes when dealing with more complex data sets.

Table 5 presents the performance of different methods on various data sets, including CIFAR-10, SVHN, and Oxford-Pets, respectively. MetaFed consistently outperforms the other algorithms in terms of both accuracy and convergence speed. On the CIFAR-10 data set, MetaFed achieves the highest accuracy of 90.75%, surpassing MetaVers (89.40%) and Robust-HDP (86.71%). Notably, MetaFed achieves a significant speedup ratio of 1.5, outperforming DPFedAvg (1.35) and Robust-HDP (1). Similarly, on the SVHN data set, MetaFed achieves 94.28% accuracy and a speedup ratio of 1.42, demonstrating better performance than FedSAM (90.15%) and MetaVers (93.04%). In terms of training efficiency, MetaFed requires 4872.96 s on SVHN, which is lower than Robust-HDP (6901.72 s) and DPFedAvg (5412.04 s), confirming its efficiency in reducing both time and computational overhead. Additionally, MetaFed achieves the highest accuracy on the Oxford-Pets data set (96.18%) with a speedup ratio of 1.34, outperforming FedSAM (95.35%) and MetaVers (95.23%). These results demonstrate that MetaFed not only achieves high accuracy but also significantly reduces training time compared to other FL methods, offering a compelling tradeoff between performance and efficiency. The integration of CMCL with multi-head attention in MetaFed contributes to its fast convergence and robust performance, making it a superior choice for FL tasks across different data sets.

Table 5.

Performance of Different Methods on Various Data Sets.

	CIFAR-10			SVHN			Oxford-Pets
Method	Accuracy (%)	Time (s)	Speedup	Accuracy (%)	Time (s)	Speedup	Accuracy (%)	Time (s)	Speedup
Baseline	70.58	2489.42	2.82	83.51	2418.32	2.85	85.14	2751.52	2.41
DPFedAvg	80.23	5217.25	1.35	84.73	5412.04	1.28	86.29	5659.43	1.17
Robust-HDP	86.71	7025.93	1	89.51	6901.72	1	90.14	6638.31	1
FedSAM	82.73	3815.38	1.84	90.15	4154.51	1.66	95.35	3241.96	2.05
MetaVers	89.40	5363.41	1.31	93.04	5256.35	1.22	95.23	5437.35	1.22
MetaFed (ours)	90.75	4699.08	1.5	94.28	4872.96	1.42	96.18	4954.58	1.34

Note. DPFedAvg = differentially private federated averaging; Robust-HDP = robust hierarchical Dirichlet process; FedSAM = federated sharpness aware minimization; MetaVers = meta-learned versatile.

4.2.2. Sensitivity Experiment of Client Quantity

The results in Figure 9 show that while increasing the number of clients generally results in a slight drop in accuracy and slower convergence, the performance drop for the proposed approach is small. Specifically, the 100-client model had an accuracy of about 79% to 80% at round 60, while 50 clients reached 81% and 20 clients showed slightly higher accuracy and faster convergence. Despite the increase in the number of clients, the performance decline is relatively small, indicating that the proposed method combines meta-joint learning and multi-head attention, and maintains good accuracy and convergence even in the presence of more clients, demonstrating its robustness and scalability.

Figure 9.

Trend of results for different number of clients.

4.2.3. Robustness Experiment

The robustness experiment results in Table 6 show that MetaFed consistently outperforms other methods in handling both label and input noise on the CIFAR-10 data set. While FedAvg experiences the greatest accuracy decline under both types of noise, with a drop from 63.38% to 60.15% for label noise and 63.38% to 54.56% for input noise, MetaFed maintains the highest post-noise accuracy, dropping from 89.34% to 86.83% under label noise and from 89.34% to 86.28% under input noise. MetaFed also achieves faster convergence with fewer rounds (80 for label noise and 85 for input noise) compared to other methods, making it more efficient. Overall, MetaFed demonstrates superior robustness to noisy data, maintaining both high accuracy and reduced training time, outperforming other algorithms such as FedDr+, FedSAM, and MetaVers.

Table 6.
Corresponding Results of Different Methods in CIFAR-10 for the Noisy Case.

Noise type Noise ratio Method Pre-noise accuracy Post-noise accuracy Average loss Rounds

Label noise 10% FedAvg 63.38% 60.15% 118.75 60

FedDr+ 75.81% 71.26% 102.65 100

FedSAM 77.42% 74.18% 85.67 90

MetaVers 90.21% 85.84% 77.21 85

MetaFed 89.34% 86.83% 74.45 80

Input noise 30% FedAvg 63.38% 54.56% 128.75 65

FedDr+ 75.81% 69.55% 105.32 105

FedSAM 77.42% 74.98% 82.34 95

MetaVers 90.21% 85.67% 75.33 90

MetaFed 89.34% 86.28% 69.48 85

Noise type	Noise ratio	Method	Pre-noise accuracy	Post-noise accuracy	Average loss	Rounds
Label noise	10%	FedAvg	63.38%	60.15%	118.75	60
		FedDr+	75.81%	71.26%	102.65	100
		FedSAM	77.42%	74.18%	85.67	90
		MetaVers	90.21%	85.84%	77.21	85
		MetaFed	89.34%	86.83%	74.45	80
Input noise	30%	FedAvg	63.38%	54.56%	128.75	65
		FedDr+	75.81%	69.55%	105.32	105
		FedSAM	77.42%	74.98%	82.34	95
		MetaVers	90.21%	85.67%	75.33	90
		MetaFed	89.34%	86.28%	69.48	85

Note. FedAvg = federated averaging; FedSAM = federated sharpness aware minimization; MetaVers = meta-learned versatile.

4.2.4. Hyperparameter Sensitivity Experiment

The hyperparameter sensitivity results of multi-head attention in Table 7 indicate that increasing the number of heads and hidden layer dimensions typically improves the performance of the model in terms of accuracy, precision, and recall. Experiment 9 achieved the highest performance with 16 heads, 512-dimensional hidden layers, and a learning rate of 0.0005, particularly in terms of accuracy (89.56%), precision (90.34%), and recall (86.78%). A smaller learning rate (0.0005) often performs better in certain configurations, especially when the hidden layer is large and attention is focused. Overall, the best configuration to achieve optimal performance is to combine more heads, larger hidden layers, and lower learning rates.

In this paper, we address the accuracy problem in FL by introducing the CMCL technique, which reduces redundancy and improves feature independence. This enhances the model’s ability to focus on learning the most relevant features, thereby improving the generalization ability and accuracy of the FL model. However, to further improve the efficiency and scalability of FL, we design a meta-FL framework (MetaFed) that combines CMCL with MAML and a multi-head attention mechanism. The multi-head attention mechanism enables the model to efficiently fuse global information from other clients during local updates, allowing it to adapt to the heterogeneous data distribution across clients more effectively. This reduces the model’s reliance on frequent communication, speeding up the convergence of the global model and reducing both communication overhead and latency. Furthermore, by integrating MAML, we perform inner-layer updates on each client, followed by outer-layer updates, which allows the model to adapt to new tasks with a minimal amount of training data and fewer communication rounds. This accelerates convergence and enables each client model to approach the global optimal solution quickly, thus reducing the overall training time. The MetaFed framework, combining the power of CMCL, MAML, and multi-head attention, not only improves the accuracy and robustness of FL but also ensures better communication efficiency and scalability across diverse clients, making it a highly efficient solution for large-scale FL tasks.

Table 7.
Cross-Validation of Hyperparameter Sensitivity for Multi-Head Attention.

Number of Hidden layer Learning

A heads dimension rate Accuracy Precision Recall

1 4 128 0.001 83.45% 85.22% 80.67%

2 4 128 0.0005 84.56% 86.17% 81.35%

3 4 256 0.001 85.67% 87.25% 82.54%

4 8 128 0.001 86.12% 88.03% 83.12%

5 8 256 0.001 87.23% 88.75% 84.12%

6 8 256 0.0005 88.01% 89.52% 85.67%

7 16 128 0.001 88.45% 89.23% 85.32%

8 16 256 0.001 89.23% 90.01% 86.45%

9 16 512 0.0005 89.56% 90.34% 86.78%

10 8 128 0.0005 87.67% 88.89% 84.23%

	Number of	Hidden layer	Learning
1	4	128	0.001	83.45%	85.22%	80.67%
2	4	128	0.0005	84.56%	86.17%	81.35%
3	4	256	0.001	85.67%	87.25%	82.54%
4	8	128	0.001	86.12%	88.03%	83.12%
5	8	256	0.001	87.23%	88.75%	84.12%
6	8	256	0.0005	88.01%	89.52%	85.67%
7	16	128	0.001	88.45%	89.23%	85.32%
8	16	256	0.001	89.23%	90.01%	86.45%
9	16	512	0.0005	89.56%	90.34%	86.78%
10	8	128	0.0005	87.67%	88.89%	84.23%

5. Conclusion

In this paper, in order to improve the accuracy of FL, a new loss function: the CMCL function is proposed to encourage the model to maintain a consistent performance across different tasks. In order to further optimize the model’s effectiveness, entropy penalty and gradient-based regularization are used to prevent model overfitting, enhance the generalization ability and stability of the model, and ultimately improve the accuracy of FL. Through the experimental results on six data sets, CIFAR-10, CIFAR-100, MINIST, SVHN, Fashion MINIST, and Oxford-Pets, this paper verifies the effectiveness of CMCL. After that, in order to reduce the communication overhead of FL, in the further research of this paper, this paper introduces multi-head attention into the framework of FL, which improves the expressive ability of the model by capturing richer contextual information during feature extraction, and combines the MAML, with the inner-layer updating and outer-layer updating, the FL framework can quickly adapt to new tasks and accelerate the convergence speed. The experimental results show that on Oxford-Pets and CIFAR-10 data sets, the models combining the multi-head attention mechanism and the MAML method outperform the traditional methods in terms of convergence speed, which further validates the effectiveness of the method in this paper.

Footnotes

ORCID iDs

Afei Li

Xiaolei Yang

Yongshan Liu

Li Ma

Lu Yu

Liyu Hao

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by the National Natural Science Foundation of China(No. 61972334).

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Alsulaimawi

(2024). Meta-FL: A novel meta-learning framework for optimizing heterogeneous model aggregation in federated learning. arXiv preprint arXiv:2406.16035.

Chen

Sakurai

, et al (2024). A multi-head federated continual learning approach for improved flexibility and robustness in edge environments. International Journal of Networking and Computing, 14(2), 123–144. https://doi.org/10.15803/ijnc.14.2_123

Chen

Yang

Saad

Yin

Poor

H. V.

Cui

(2021). A joint learning and communications framework for federated learning over wireless networks. IEEE Transactions on Wireless Communications, 20(1), 269–283. https://doi.org/10.1109/TWC.2020.3024629

Choudhry

I. A.

Iqbal

Alhussein

Aurangzeb

Qureshi

A. N.

Anwar

M. S.

Khan

(2024). Privacy-preserving ai for early diagnosis of thoracic diseases using iots: A federated learning approach with multi-headed self-attention for facilitating cross-institutional study. Internet of Things, 27, 101296. https://doi.org/10.1016/j.iot.2024.101296

Dinh

C. T.

Tran

N. H.

Nguyen

M. N. H.

Hong

C. S.

Bao

Zomaya

A. Y.

Gramoli

(2021). Federated learning over wireless networks: Convergence analysis and resource allocation. IEEE/ACM Transactions on Networking, 29(1), 398–409. https://doi.org/doi: 10.1109/TNET.2020.3035770

Dong

Zhang

S. Q.

Kung

H. T.

(2022 October). Spherefed: Hyperspherical federated learning, In European Conference on Computer Vision (pp. 165-184). Cham: Springer Nature Switzerland. https://link.springer.com/chapter/10.1007/978-3-031-19809-0_10

Feng

Chen

Zhang

Zhou

(2022). Semi-supervised meta-learning networks with squeeze-and-excitation attention for few-shot fault diagnosis. ISA Transactions, 120, 383–401. https://doi.org/10.1016/j.isatra.2021.03.013

Ghosh

Chung

Yin

Ramchandran

(2020). An efficient framework for clustered federated learning,

Advances in neural information processing systems, 33, 19586-19597

, Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2020/hash/e32cc80bf07915058ce90722ee17bb71-Abstract.html.

Goddard

(2017). The EU general data protection regulation (GDPR): European regulation that has a global impact. International Journal of Market Research, 59(6), 703–705. https://doi.org/10.2501/IJMR-2017-050

10.

Huang

Sun

Guo

Chen

Chang

Sun

Yang

(2024). Learning generalizable agents via saliency-guided features decorrelation, Advances in Neural Information Processing Systems, 36, 39363-39381, Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/hash/7bd4a7d0e6773072c2e3c77b11d93065-Abstract-Conference.html.

11.

Jeon

Hong

Yun

Kim

(2023). Federated learning via meta-variational dropout, In Advances in neural information processing systems (vol. 36, pp. 11168-11193). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2023/hash/24a8d40f6656e542f3fd43bac678e71b-Abstract-Conference.html

12.

Jhunjhunwala

Wang

Joshi

(2023). Fedexp: Speeding up federated averaging via extrapolation. arxiv preprint arxiv:2301.09604.

13.

Lejeune

M. A.

(2021). Data-driven distributionally robust chance-constrained optimization with Wasserstein metric. Journal of Global Optimization, 79(4), 779–811. https://doi.org/10.1007/s10898-020-00966-0

14.

Jiang

Weng

Xia

Lin

(2023). Personalized federated learning based on multi-head attention algorithm. International Journal of Machine Learning and Cybernetics, 14(11), 3783–3798. https://doi.org/10.1007/s13042-023-01864-z

15.

Karimireddy

S. P.

Kale

Mohri

Reddi

S. J.

Stich

S. U.

Suresh

A. T.

(2020, November) Scaffold: Stochastic controlled averaging for federated learning. In International conference on machine learning (pp. 5132-5143). PMLR. https://proceedings.mlr.press/v119/karimireddy20a.html

16.

Khodak

Balcan

M. F.

Talwalkar

(2019). Adaptive gradient-based meta-learning methods. In Advances in neural information processing systems (vol. 32). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/hash/f4aa0dd960521e045ae2f20621fb4ee9-Abstract.html

17.

Kim

Jeong

Kim

Cho

Ahn

Yun

S.-Y.

(2024). Feddr+: Stabilizing dot-regression with global feature distillation for federated learning. arXiv preprint arXiv:2406.02355.

18.

Konen

Mcmahan

H. B.

F. X.

Richtárik

Bacon

(2016). Federated learning: Strategies for improving communication efficiency. arxiv preprint arxiv:1610.05492.

19.

Lan

Chen

Xie

Chen

Zhang

Chen

(2023). Elastically-constrained meta-learner for federated learning. arXiv preprint arXiv:2306.16703.

20.

Lee

Shin

Jeong

Yun

S. Y.

(2021). Preservation of the global knowledge by not-true self knowledge distillation in federated learning. arxiv preprint arxiv:2106.03097.

21.

Wang

(2021b). Privacy-preserving spatiotemporal scenario generation of renewable energies: A federated deep generative learning approach. IEEE Transactions on Industrial Informatics, 18(4), 2310–2320. https://doi.org/10.1109/TII.2021.3098259

22.

Shang

Lin

(2023). No fear of classifier biases: Neural collapse inspired federated learning with synthetic and fixed classifier. In Proceedings of the IEEE/CVF international conference on computer vision (pp. 5319–5329), IEEE. https://openaccess.thecvf.com/content/ICCV2023/html/Li_No_Fear_of_Classifier_Biases_Neural_Collapse_Inspired_Federated_Learning_ICCV_2023_paper.html

23.

Shao

Wei

Ding

Shi

Han

Poor

H. V.

(2021). Blockchain assisted decentralized federated learning (blade-FL): Performance analysis and resource allocation IEEE Transactions on Parallel and Distributed Systems, 33(10), 2401–2415. https://doi.org/10.1109/TPDS.2021.3138848

24.

Zhang

(2025). Implicit regularization of sharpness-aware minimization for scale-invariant problems. In Advances in neural information processing systems (vol. 37, pp. 44444–44478). Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2024/hash/4eb2c0adafbe71269f3a772c130f9e53-Abstract-Conference.html

25.

Zhong

Zuo

Zhao

(2024). A personalized federated learning method based on the residual multi-head attention mechanism. Journal of King Saud University-Computer and Information Sciences, 36(4), 102043. https://doi.org/10.1016/j.jksuci.2024.102043

26.

Lim

J. H.

Yoon

S. W.

(2024). MetaVers: Meta-learned versatile representations for personalized federated learning. In Proceedings of the IEEE/CVF winter conference on applications of computer vision (pp. 2587–2596), IEEE. https://openaccess.thecvf.com/content/WACV2024/html/Lim_MetaVers_Meta-Learned_Versatile_Representations_for_Personalized_Federated_Learning_WACV_2024_paper.html

27.

Liu

Cao

Wang

Kong

X.-H.

(2022). A “pre-constrained metal twins” strategy to prepare efficient dual-metal-atom catalysts for cooperative oxygen electrocatalysis. Advanced Materials, 34(7), 2107421. https://doi.org/10.1002/adma.202107421

28.

Liu

Chen

Gao

Zhao

(2021b). Distributed few-shot learning for intelligent recognition of communication jamming. IEEE Journal of Selected Topics in Signal Processing, 16(3), 395–405. https://doi.org/10.1109/JSTSP.2021.3137028

29.

Liu

Lou

Xiong

Liu

Meng

(2021a). Projected federated averaging with heterogeneous differential privacy. Proceedings of the VLDB Endowment, 15(4), 828–840. https://doi.org/10.14778/3503585.3503592

30.

Malekmohammadi

Cao

(2024). Noise-aware algorithm for heterogeneous differentially private federated learning. arXiv preprint arXiv:2406.03519.

31.

Mcmahan

H. B.

Moore

Ramage

Hampson

Arcas

B. A. Y.

(2016). Communication-efficient learning of deep networks from decentralized data

In Artificial intelligence and statistics (pp. 1273-1282). PMLR.

https://proceedings.mlr.press/v54/mcmahan17a?ref=https://githubhelp.com

32.

Noble

Bellet

Dieuleveut

(2021). Differentially private federated learning on heterogeneous data.

In International conference on artificial intelligence and statistics (pp. 10110-10145). PMLR. https://proceedings.mlr.press/v151/noble22a.html

33.

Kim

Yun

S. Y.

(2021). Fedbabu: Towards enhanced representation for federated image classification. arxiv preprint arxiv:2106.06042.

34.

Robey

Chamon

Pappas

G. J.

Hassani

Ribeiro

(2021). Adversarial robustness with semi-infinite constrained learning. In Advances in neural information processing systems (vol. 34, pp. 6198–6215). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2021/hash/312ecfdfa8b239e076b114498ce21905-Abstract.html

35.

Shlezinger

Chen

Eldar

Y. C.

Poor

H. V.

Cui

(2021). Uveqfed: Universal vector quantization for federated learning. IEEE Transactions on Signal Processing, 69, 500–514. https://doi.org/10.1109/TSP.2020.3046971

36.

Tao

Chen

Qiu

Vladimir

(2022). Few shot cross equipment fault diagnosis method based on parameter optimization and feature metric. Measurement Science and Technology, 33(11), 115005. https://doi.org/10.1109/10.1088/1361-6501/ac8368

37.

Wang

Dong

Z. Y.

Xiong

(2024). Adaptive multi-personalized federated learning for state of health estimation of multiple batteries. IEEE Internet of Things Journal, 11(24), 39994–40008. https://doi.org/10.1109/JIOT.2024.3448626

38.

Wang

Tuor

Salonidis

Leung

K. K.

Makaya

Chan

(2018). Adaptive federated learning in resource constrained edge computing systems IEEE Journal on Selected Areas in Communications. 37(6), 1205–1221. https://doi.org/10.1109/JSAC.2019.2904348

39.

Wang

Zhu

(2020). Addressing class imbalance in federated learning In Proceedings of the AAAI conference on artificial intelligence, 35(11), 10165–10173. https://doi.org/10.1609/aaai.v35i11.17219

40.

Wang

Yin

Chen

Zhou

Zhang

(2022). Fast-adapting and privacy-preserving federated recommender system. The VLDB Journal, 31(5), 877–896. https://doi.org/10.1007/s00778-021-00700-6

41.

Wang

Yuan

Ying

Yang

(2023). Memory-based optimization methods for model-agnostic meta-learning and personalized federated learning. Journal of Machine Learning Research, 24(145), 1–46. http://jmlr.org/papers/v24/21-1301.html

42.

Wei

Ding

Poor

H. V.

(2020). Federated learning with differential privacy: Algorithms and performance analysis. IEEE transactions on information forensics and security, 15, 3454–3469. https://doi.org/10.1109/TIFS.2020.2988575

43.

Wen

Qian

Xie

Wang

(2024). Fourier feature decorrelation based sample attention for dense crowd localization. Neural Networks, 172, 106131. https://doi.org/10.1016/j.neunet.2024.106131

44.

Kwon

Y.-W.

(2023). Mafd: A federated distillation approach with multi-head attention for recommendation tasks. In Proceedings of the 38th ACM/SIGAPP symposium on applied computing (pp. 1221–1224), ACM. https://dl.acm.org/doi/abs/10.1145/3555776.3577849

45.

Xiong

Yang

(2022). A unified framework for multi-modal federated learning. Neurocomputing, 480, 110–118. https://doi.org/10.1016/j.neucom.2022.01.063

46.

Yan

Liu

Ning

Y.-Z.

Zhang

Z.-Y.

(2024a). Ccfc++: Enhancing federated clustering through feature decorrelation. arXiv preprint arXiv:2402.12852.

47.

Yan

Liu

Zhang

Z.-Y.

(2024b). Ccfc: Bridging federated clustering and contrastive learning. arXiv preprint arXiv:2401.06634.

48.

Yang

Huang

Lin

Cao

(2022a). Personalized federated learning on non-IID data via group-based meta-learning. ACM Transactions on Knowledge Discovery from Data, 17(4), 1–20. https://doi.org/10.1145/3558005

49.

Yang

Tan

Chen

(2022b). Robust spike-based continual meta-learning improved by restricted minimum error entropy criterion. Entropy, 24(4), 455. https://doi.org/10.3390/e24040455

50.

Shi

Wang

Zhang

Wang

Liu

(2024). Fedpe: Adaptive model pruning-expanding for federated learning on mobile devices. IEEE Transactions on Mobile Computing, 23(11), 10475–10493. https://doi.org/10.1109/TMC.2024.3374706

51.

Liu

Chen

Yang

(2020). A fairness-aware incentive scheme for federated learning. In AIES’20: AAAI/ACM conference on ai, ethics, and society (pp. 393–399). ACM. https://doi.org/10.1145/3375627.3375840

52.

Yue

Ren

Xin

Zhang

Zhuang

(2024). Efficient federated meta-learning over multi-access wireless networks. Next Generation Multiple Access, 22, 547–581. https://doi.org/10.1002/9781394180523.ch22

53.

Zhang

Yang

Peng

Zhang

Shen

(2021). Optimizing federated learning in distributed industrial iot: A multi-agent approach. IEEE Journal on Selected Areas in Communications, 39(12), 3688–3703. https://doi.org/10.1109/JSAC.2021.3118352

54.

Zhou

Duan

Qiu

Zhang

Tian

Zheng

Zhu

(2024). Personalized federated learning incorporating adaptive model pruning at the edge. Electronics, 13(9), 1738. https://doi.org/10.3390/electronics13091738

	Cifar-10			SVHN			Fashion-MNIST			Oxford-Pets
Method	Acc	Precision	Recall	Acc	Precision	Recall	Acc	Precision	Recall	Acc	Precision	Recall
FedAvg	64.19%	64.05%	62.7%	83.51%	84.14%	87.14%	88.15%	91.48%	92.96%	85.14%	83.43%	87.83%
FedDr+	74.9%	76.11%	73.14%	85.06%	86.14%	88.08%	89.72%	88.55%	91.5%	87.8%	84.17%	84.46%
FedSAM	78.89%	75.46%	77.12%	90.15%	94.56%	92.8%	92.45%	92.04%	89.67%	95.35%	95.41%	92.04%
MetaVers	89.4%	90.35%	89.67%	93.06%	93.48%	92.69%	94.38%	93.98%	91.46%	95.23%	96.4%	96.64%
FedCMCL	90.22%	89.29%	89.86%	93.74%	93.61%	91.59%	94.59%	93.95%	92.59%	97.42%	96.73%	96.52%

	Number of	Hidden layer	Learning
A	heads	dimension	rate	Accuracy	Precision	Recall
1	4	128	0.001	83.45%	85.22%	80.67%
2	4	128	0.0005	84.56%	86.17%	81.35%
3	4	256	0.001	85.67%	87.25%	82.54%
4	8	128	0.001	86.12%	88.03%	83.12%
5	8	256	0.001	87.23%	88.75%	84.12%
6	8	256	0.0005	88.01%	89.52%	85.67%
7	16	128	0.001	88.45%	89.23%	85.32%
8	16	256	0.001	89.23%	90.01%	86.45%
9	16	512	0.0005	89.56%	90.34%	86.78%
10	8	128	0.0005	87.67%	88.89%	84.23%

Federated Learning Strategies for Integrating Composite Meta-Consistency Loss With Multi-Head Attention

Abstract

Keywords

1. Introduction

2.1. Vertical FL

(1) Inner-layer update as in equation (3) (performed on each client) : θ task = θ − α ∇ θ L task ( θ ) . (3) (2) The outer layer is updated as equation (4) (optimize the global model): θ = θ − β ∇ θ ∑ task L task ( θ task ) . (4) 2.5. Multi-Head Attention

3.1. CMCL Algorithm

3.1.1. The Construction of the Duality Problem for Feature Decorrelation

3.2.1. Experimental Setup Data Set and Model

3.2.2. Ablation Experiment

4.1. Design of a Meta-FL Framework With Multi-Head Attention

4.2.1. Comparative Experiment

Footnotes

ORCID iDs

Funding

Declaration of Conflicting Interests

References

(1)
Inner-layer update as in equation (3) (performed on each client) :
$θ_{task} = θ - α \nabla_{θ} L_{task} (θ) .$
(3)
(2)
The outer layer is updated as equation (4) (optimize the global model):
$θ = θ - β \nabla_{θ} \sum_{task} L_{task} (θ_{task}) .$
(4)

2.5. Multi-Head Attention