Sage Journals: Discover world-class research

Abstract

Background

Automated breast ultrasound analysis is hindered by limited annotated data, institutional heterogeneity, and strict privacy regulations. This study proposes FAME (Federated Attention-guided Multi-task Ensemble Network), a privacy-preserving and data-efficient framework for joint segmentation and classification of breast ultrasound images in decentralized clinical environments.

Methods

Federated Attention-guided Multi-task Ensemble Network integrates Federated Transfer Learning with class-specific synthetic data generation via Auxiliary Classifier Generative Adversarial Networks to enhance training under data scarcity. Segmentation is performed using a Multi Attention U-Net (MAU-Net), while classification employs a dual-stage ensemble of ResNet50V2, NASNetLarge, and MAU-Net, followed by a meta-classifier. Privacy is preserved through Differential Privacy with Gaussian noise injection and Secure Aggregation for interclient model update protection. The model was trained and validated on the Breast Ultrasound Image (BUSI) dataset (780 images: 80% training, 10% validation, 10% testing) and further evaluated on independent test sets from the Breast Ultrasound Classification (BUSC) (407 images) and UDIAT (163 images) datasets. Statistical significance was assessed using paired t-tests against baseline models, and 95% confidence intervals were reported for all metrics.

Results

On the BUSI test set, FAME achieved 98.70 ± 0.27% accuracy, 96.82 ± 0.53% F1-score, and 0.978 area under the curve (AUC). On UDIAT, it reached 98.14 ± 0.31% accuracy, 94.04 ± 0.75% F1-score, and 0.960 AUC, while on BUSC, it achieved 96.92 ± 0.27% accuracy, 90.32 ± 0.80% F1-score, and 0.950 AUC. For segmentation, Dice Scores were 89.72 ± 0.53% (BUSI), 93.09 ± 0.49% (BUSC), and 87.98 ± 0.57% (UDIAT), consistently surpassing state-of-the-art baselines. Synthetic augmentation improved performance on underrepresented malignant cases and enhanced generalization under non-IID client data distributions.

Conclusion

Federated Attention-guided Multi-task Ensemble offers a scalable, privacy-compliant, and high-performing solution for multi-institutional breast ultrasound analysis. By combining federated learning, synthetic augmentation, and attention mechanisms, it provides a strong foundation for secure, collaborative breast cancer diagnosis.

Keywords

Federated transfer learning breast tumor segmentation deep learning generative adversarial network

Introduction

Breast cancer remains one of the most prevalent and life-threatening malignancies among women globally.¹ The early detection and accurate classification of breast tumors are critical in improving survival rates and guiding effective treatment decisions. Among various imaging modalities, ultrasound (US) imaging has gained prominence due to its noninvasive nature, cost-effectiveness, and ability to effectively detect abnormalities in dense breast tissue.² The automated analysis of breast US (BUS) images introduces several significant challenges that hinder the development of reliable computer-aided diagnosis systems.^3–6 Ultrasound images are inherently affected by speckle noise, low contrast, and operator dependency, which complicates accurate segmentation and classification.^7,8 These artifacts require manual interpretation by experienced radiologists, making the process subjective and time-consuming.

Training deep learning (DL) models for tumor analysis necessitates large volumes of annotated data, which are difficult to obtain in clinical settings due to the high cost and expertise required for accurate labeling.⁹ Medical institutions face stringent data privacy regulations (e.g., HIPAA and GDPR) that prevent the centralization of patient data across institutions. As a result, collaborative learning efforts are severely limited, leading to poor model generalization and limited deployment in real-world clinical scenarios. Datasets collected from different sources often exhibit significant heterogeneity in image acquisition protocols, resolution, and demographic characteristics, making it difficult for conventional centralized models to adapt to domain variations.

To clarify the motivation behind our proposed solution, we outline the key challenges in automated BUS analysis and the specific strategies employed to address them. (i) Data privacy restrictions across healthcare institutions are addressed via a Federated Transfer Learning (FTL) model, which enables decentralized model training without sharing raw data. (ii) Domain heterogeneity, arising from differences in imaging protocols and equipment, is tackled using a dual-stage ensemble classifier combining Multi Attention U-Net (MAU-Net), NASNetLarge, and ResNet50V2, which enhances generalization across non-IID distributions. (iii) Annotated data scarcity is mitigated by incorporating Auxiliary Classifier Generative Adversarial Networks (ACGAN) that generate synthetic, class-specific images locally to improve training diversity. (iv) Ultrasound imaging noise and low contrast, which obscure tumor boundaries, are handled by enhancing the U-Net into the MAU-Net, integrating both channel and spatial attention modules. (v) Security vulnerabilities in model parameter sharing, such as model inversion and membership inference attacks, are countered through Differential Privacy (DP) and Secure Aggregation (SA) techniques. These components collectively define the architecture of our Federated Attention-guided Multi-task Ensemble Network (FAME) model.

To address these challenges, we propose a novel FAME, an ensemble DL model that employs FTL ACGAN to enable privacy-preserving and data-efficient BUS image classification and segmentation. FTL facilitates decentralized model training across multiple institutions without transferring raw patient data, thereby preserving patient confidentiality and complying with regulatory standards. While Federated Learning (FL) allows for collaborative training without data exchange, it is not entirely immune to privacy risks such as model inversion and membership inference attacks through which sensitive patient information may be inferred from shared model updates.¹⁰ To mitigate this, our study emphasizes distributed training as a key privacy-preserving strategy. It further integrates additional safeguards (e.g., SA and DP) to reduce the risk of information leakage during training. Simultaneously, ACGAN is incorporated to generate synthetic, class-specific US images, augmenting the training dataset and alleviating data scarcity. These synthetic images improve model robustness, particularly in underrepresented categories, and help prevent overfitting.

At the core of our design is a dual-stage feature fusion classifier that combines the outputs of three DL models: NASNetLarge, ResNet50V2, and the proposed MAU-Net. The MAU-Net introduces attention-based mechanisms (channel and spatial) within a U-Net backbone to enhance segmentation performance by focusing on the most informative regions of the image. The multimodel ensemble approach enables the system to capture low-level and high-level features, ensuring robust tumor segmentation and accurate classification into normal, benign, and malignant categories. The novelty points of this study are as follows:

We propose a FTL model that enables decentralized collaborative training across distributed medical institutions, ensuring privacy preservation without sacrificing model performance. The model employs SA and DP to mitigate common threats such as model inversion and membership inference.

We introduce the use of ACGANs to generate class-specific synthetic US images at the local client level, effectively augmenting limited datasets and improving model generalization in underrepresented categories.

We develop a dual-stage ensemble learning architecture that synergistically fuses discriminative features from three advanced backbones, NASNetLarge, ResNet50V2, and an attention-enhanced MAU-Net to achieve improved segmentation and robust three-class classification.

We validate the effectiveness of our model through comprehensive experimentation on three benchmark BUS datasets (Breast Ultrasound Image [BUSI], UDIAT, Breast Ultrasound Classification [BUSC]), demonstrating consistent outperformance of existing state-of-the-art (SOTA) models in both segmentation metrics (Dice Score, Intersection over union [IoU]) and classification measures (Accuracy [ACC], F1-score, area under the curve [AUC]).

This integrated solution presents a scalable, privacy-preserving, and clinically reliable approach for breast tumor analysis using US imaging, offering the promising potential for deployment in multi-institutional healthcare environments.

The application of DL to BUS image analysis has gained substantial momentum due to its ability to automatically extract hierarchical features for both segmentation and classification tasks.¹¹ However, existing models often rely on centralized datasets, face performance degradation in heterogeneous data environments, and lack privacy-preserving mechanisms, limitations that directly motivate the approach proposed in our study. Several prior studies have demonstrated the efficacy of CNNs in BUS image classification.¹² For instance,¹³ applied patch-based U-Net, LeNet, and FCN-AlexNet to a small-scale BUS dataset and demonstrated reasonable classification performance. Similarly, the study⁹ improved classification ACC using data augmentation and DAGAN, with NASNet achieving 94% ACC when trained on the enhanced dataset. These efforts highlight the significance of augmentation in addressing data limitations. However, they rely on centralized training, limiting their real-world applicability due to data privacy constraints. The study¹⁴ introduced a grayscale-to-RGB mapping technique and fine-tuned VGG19 on 882 US images, outperforming expert radiologists. The study¹⁵ used Mask R-CNN for simultaneous segmentation and classification, but the performance was highly dependent on annotated data availability. The study¹⁶ compared models such as YOLO, VGG16, Fast R-CNN, and ZFNet using 1043 US images, emphasizing the importance of balanced datasets and multiscale features. While these models exhibit promising results, their reliance on manually annotated datasets and centralized learning remains a key limitation.

Transfer learning techniques have also been employed to enhance BUS classification. The study¹⁷ proposed a Multiview InceptionV3-based CNN architecture, while the study¹⁸ used ensemble learning of CNNs with RGB-fused US images to boost diagnostic ACC. Likewise, the study¹⁹ incorporated DenseNet121 with attention to ROI localization, and the study²⁰ employed ResNet101 with SVM for classification. Although these architectures offer robust feature representations, they are not inherently designed for distributed learning or data privacy preservation. Recent efforts have focused on enhancing spatial and contextual feature capture for segmentation. The study²¹ proposed a multistream segmentation network incorporating global and local features, while the study²² used adaptive spatial fusion across multiple models to improve classification on the BUSI dataset. The study²³ introduced MTL-COSA, a multitask learning architecture with context-aware self-attention, to jointly perform segmentation and classification. Transformer-based models have also emerged, such as SaTransformer²⁴ and BUViTNet,²⁵ which leveraged global self-attention for better boundary delineation. However, these models are computationally intensive and lack integration with privacy-aware training protocols. In parallel, the use of generative models to address data scarcity has also been explored. The study²⁶ applied a radiomics-based pipeline for classification, while GAN-based models⁹ were used for synthetic data generation. Yet, these works often fail to generate class-specific US images with diagnostic relevance, as addressed by our use of ACGAN.

Despite growing interest in FL in healthcare, its application in BUS imaging remains underexplored. Recent surveys, such as,^10,27 which review privacy leakage threats and mitigation techniques in secure FL, and,^10,27 which explores privacy-preserving FL in remote sensing under adversarial conditions, outline the vulnerability of FL to model inversion and membership inference attacks. These concerns necessitate stronger privacy guarantees, such as SA and DP, especially when dealing with sensitive clinical data. Most BUS studies do not account for these risks, highlighting a critical gap in deploying DL models across distributed medical institutions. In summary, while previous studies have addressed individual aspects such as augmentation, segmentation, attention mechanisms, or classification, there is a lack of a unified, privacy-preserving solution that simultaneously tackles data scarcity, segmentation precision, classification robustness, and data privacy. Our study uniquely addresses these gaps by employing FTL for decentralized training without raw data exchange, using ACGAN to generate class-conditioned synthetic US images, introducing a dual-stage ensemble model combining ResNet50V2, NASNetLarge, and the novel MAU-Net with channel and spatial attention modules, and demonstrating consistent improvements across segmentation and classification tasks on three benchmark datasets (BUSI, UDIAT, BUSC). This integrated strategy offers a scalable and secure pathway for real-world deployment of DL-based breast cancer diagnosis tools across multi-institutional settings.

Methodology

This is a retrospective study utilizing three publicly available BUS datasets (BUSI, UDIAT, and BUSC), all of which were previously collected and anonymized by their respective sources. To enhance tumor segmentation in US images, we extend the U-Net; our encoder is constructed using residual blocks, enabling deeper feature extraction while mitigating vanishing gradient issues during training. Each residual block consists of two stacked 1 × 1 convolutional layers, each followed by batch normalization and ReLU activation, with a nonidentity skip connection to enhance feature propagation. This design improves gradient flow and preserves contextual integrity across network depths. Multi Attention U-Net employs dual attention mechanisms to refine the encoded features before propagating them to the decoder. Channel attention modules are used to recalibrate the importance of feature channels by applying global average and max pooling, followed by shared fully connected layers. This allows the model to learn features that are most relevant for identifying tumor regions. Spatial attention modules identify the location of important features in the spatial dimensions using 2D convolution over concatenated pooled feature maps. Both attention types are strategically inserted after the residual blocks and before skip connections, enhancing semantic consistency between encoder and decoder representations.

Federated transfer learning

Federated transfer learning enables decentralized training of DL models across multiple healthcare institutions without requiring the exchange of raw patient data. In this setting, each institution $k \in {1, 2, \dots, K}$ maintains a private dataset $D_{k}$ and trains a local model $w_{k}$ using its data. To collaboratively update a global model w, each institution performs local optimization and transmits encrypted model parameters (e.g., weights or gradients) to a central aggregator. The aggregator then performs weighted model averaging based on the relative size of each dataset. This can be formally represented by the federated averaging as shown in equation (1):

w^{(t + 1)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} w_{k}^{(t)}

(1)where

w^{(t + 1)}

is the updated global model at communication round

t + 1, w_{k}^{(t)}

is the model from institution

k, n_{k}

is the number of local samples, and

n = \sum_{k = 1}^{K} n_{k}

denotes the total number of training samples across all participants.

To improve convergence and address data heterogeneity, each local model is initialized with pretrained weights from large-scale public datasets (e.g., ImageNet). Transfer learning enables institutions with small or imbalanced datasets to adapt shared representations effectively. During each communication round, local models are updated using gradient descent, as shown in equation (2):

w_{k}^{(t + 1)} = w^{(t)} - η \nabla L_{k} (w^{(t)})

(2)where

η

is the learning rate and

L_{k} (w^{(t)})

is the local loss function computed on

D_{k}

. This iterative process continues until convergence, allowing the global model to generalize across varying domains and imaging conditions (e.g., BUSI, UDIAT, BUSC). Although FTL avoids the direct transfer of raw data, sharing model updates still poses potential risks, such as model inversion and membership inference attacks. To address this, our architecture is designed to support the integration of privacy-enhancing techniques. Secure aggregation ensures that the server cannot access individual model updates by encrypting local contributions and only decrypting the aggregated result. Additionally, DP can be applied by adding calibrated Gaussian noise to the shared parameters, as shown in equation (3):

{\tilde{w}}_{k}^{(t)} = w_{k}^{(t)} + N (0, σ^{2})

(3)where

N (0, σ^{2})

represents Gaussian noise with variance

σ^{2}

, controlling the privacy budget

ϵ

. These mechanisms significantly mitigate the risk of unauthorized data reconstruction, ensuring compliance with strict privacy regulations such as HIPAA and GDPR. By leveraging this FTL paradigm, our model enables collaborative learning across diverse institutions while safeguarding patient confidentiality and supporting scalability in real-world clinical deployments. In our study, FL was simulated across three virtual clients, each representing a distinct medical institution with non-IID subsets of BUSI, BUSC, and UDIAT datasets. Local models were trained for five epochs per communication round and transmitted encrypted parameters to a central server, which applied Federated Averaging (FedAvg) to aggregate updates. Secure Aggregation masked individual updates before transmission, ensuring that only the aggregated model was visible to the server. The global model was then redistributed to clients for subsequent training, enabling collaborative learning while maintaining data privacy and institutional autonomy.

Auxiliary classifier GAN-based synthetic image generation

While FTL enables privacy-preserving collaborative learning across decentralized institutions, it remains constrained by the availability of labeled medical data at each site. Medical imaging datasets, particularly in BUS applications, often suffer from class imbalance and limited annotated samples due to the cost and expertise required for labeling. To address this limitation, we incorporate an ACGAN at each local institution to augment the training dataset with high-quality, class-specific synthetic images, thereby improving model generalization and stability without compromising patient privacy. Auxiliary Classifier GAN is an extension of the standard Generative Adversarial Network (GAN) framework, where both the generator G and discriminator D are conditioned on class labels $c \in {0, 1, 2}$ , representing the diagnosis categories: Normal, Benign, and Malignant. Given a latent noise vector $z \sim p_{z} (z)$ , the generator produces synthetic US images $\tilde{x} = G (z, c)$ , tailored to the desired class. The discriminator is trained to perform two simultaneous tasks: (i) distinguish real from generated images and (ii) predict the class label associated with the image. This dual-task training enhances the fidelity and semantic relevance of the generated samples.

The adversarial training objectives for the ACGAN are defined as follows. The generator loss $L_{G}$ comprise adversarial and classification components as shown in equation (4):

L_{G} = E_{z \sim p_{z}, c \sim p_{c}} [- \log D_{adv} (G (z, c)) - \log D_{cls} (c ∣ G (z, c))]

(4)

The discriminator loss $L_{D}$ similarly, it combines real/fake discrimination and auxiliary classification, as shown in equation (5):

L_{D} = E_{x \sim p_{data}} [- \log D_{adv} (x) - \log D_{cls} (c ∣ x)] + E_{z \sim p_{z}, c \sim p_{ε}} [- \log (1 - D_{adv} (G (z, c)))]

(5)

Here, $D_{adv} (\cdot)$ denotes the probability that the input is real and $D_{cls} (\cdot ∣ x)$ represents the class prediction probability given an image. The use of class-conditional generation ensures that the synthetic images resemble fundamental US characteristics and are semantically aligned with specific pathological classes. These ACGAN-generated images are locally synthesized and retained within the institutional boundary, maintaining the privacy constraints of FTL. Once generated, synthetic samples are passed through the same preprocessing pipeline as authentic images, including resizing, noise reduction, normalization, and augmentation. They are then used alongside authentic images for training the segmentation and classification modules, thereby increasing data diversity and addressing class imbalance without introducing privacy risks. By combining ACGAN-based data augmentation with FTL, our methodology enhances model generalization and fairness across institutions, particularly in settings with scarce or imbalanced datasets. This integration also ensures that each client node can effectively contribute to the global model training, even when access to annotated data is limited. To avoid overfitting and prevent data leakage, synthetic images generated by ACGAN are retained strictly within each local institution and are never shared across clients. This ensures that no raw or synthetic image data leaves its source, maintaining privacy. Overfitting is mitigated by applying dropout within both the generator and discriminator, employing early stopping based on validation performance, and training with balanced mini-batches of real and synthetic images to enforce generalizable feature learning rather than memorization. The quality of synthetic images is monitored through the convergence of the discriminator's adversarial and auxiliary classification losses, which indicates that the generator produces realistic and class-consistent outputs.

Image preprocessing

Following the generation of accurate and synthetic US images via ACGAN at each local institution, a standardized preprocessing pipeline is applied to ensure that all images, regardless of source, are normalized and optimized for subsequent segmentation and classification tasks. The preprocessing stage is critical in improving image quality, mitigating modality-specific noise artifacts, and ensuring consistency across decentralized datasets, which is essential in FL environments where data heterogeneity is a key challenge. Let $x \in R^{H \times W}$ denote a grayscale US image, either real or generated. The preprocessing pipeline involves the following operations applied uniformly to all input samples:

Resizing

All images are resized to a fixed resolution $H^{'} \times W^{'}$ , denoted by as shown in equation (6):

x_{resized} = Resize (x, H^{'}, W^{'})

(6)

This uniform scaling standardizes input dimensions across datasets (BUSI, BUSC, UDIAT), which originally differ in resolution and aspect ratio (e.g., BUSI images at 500 × 500 px vs. BUSC images at 128 × 128 px). Without resizing, network layers may encounter mismatched feature dimensions, complicating training convergence. Resizing also ensures efficient GPU memory utilization and stable batch processing during federated training.

Speckle noise reduction

Ultrasound imaging is inherently affected by speckle noise, which arises from coherent interference of returning echoes and appears as granular patterns. Such noise obscures lesion boundaries and degrades the performance of feature extractors. To mitigate this, we apply a hybrid denoising strategy using median and Gaussian filtering, as shown in equation (7):

x_{denoise} = Gaussian (Median (x_{resized}))

(7)

Median filtering is effective at removing salt-and-pepper noise while preserving edges. Gaussian filtering suppresses high-frequency noise and smooths homogeneous regions. The combination is particularly effective for US images, balancing edge preservation and noise suppression. By reducing speckle noise prior to feature extraction, the model receives inputs with enhanced lesion-to-background contrast, improving both segmentation ACC and classifier reliability.

Intensity normalization

Due to differences in acquisition devices (e.g., LOGIQ E9, Siemens ACUSON) and institution-specific imaging protocols, raw US images exhibit significant variation in brightness, contrast, and dynamic range. To mitigate these variations, images were standardized using $z - score$ normalization, as shown in equation (8):

x_{norm} = \frac{x_{denoise} - μ_{x}}{σ_{x}}

(8)where

μ_{x}

and

σ_{x}

are the mean and standard deviation of the pixel intensities in

x_{denoise}

. This ensures that each input has zero mean and unit variance, reducing interdataset variability and stabilizing gradient updates during training. Intensity normalization is especially crucial in federated environments, as uncorrected variations across clients can amplify the non-IID nature of data distributions, hindering global model convergence.

Data augmentation

We apply stochastic data augmentation transformations to enhance generalization further and mitigate overfitting due to limited data, especially in underrepresented classes (e.g., malignant cases). Each image is randomly transformed using a combination of:

Rotation: $θ \sim U (- 15 \circ, 15 \circ)$

Horizontal/Vertical Flipping

Zooming: scale factor $s \sim U (0.9, 1.1)$

Mathematically, an augmented image $x_{aug}$ is generated as shown in equation (9):

x_{aug} = T (x_{norm}; θ, s, flip)

(9)where

T

is a composite transformation operator parameterized by rotation

θ

, zoom s, and random flips, through this preprocessing pipeline, actual and ACGAN-generated images are transformed into consistent, denoised, and normalized representations suitable for input to downstream modules. This is especially critical in the federated setting, where ensuring intra- and interinstitution consistency in image representation significantly impacts global model convergence and robustness. The resulting processed image set

X_{pre}

forms the common input domain for the MAU-Net-based segmentation and ensemble classification networks described in subsequent sections.

Multi Attention U-Net architecture

The proposed MAU-Net architecture is a multiattention extension of the standard U-Net, tailored specifically for BUS image segmentation, as illustrated in Figure 1. It follows a fully convolutional, symmetric encoder–decoder structure designed to learn hierarchical spatial features while preserving high-resolution details necessary for precise tumor boundary delineation. The input to the network is a preprocessed grayscale US image $x \in R^{1 \times H \times W}$ , where H and W denote the image height and width, respectively. The encoder path comprises four encoding blocks, each responsible for progressively downsampling the spatial dimensions while increasing the number of feature channels. Each encoding block consists of two consecutive convolutional layers with a kernel size of $3 \times 3$ , followed by batch normalization and ReLU activation. Mathematically, for a given encoder level l, the feature transformation is represented as shown in equation (10):

x_{l} = ReLU (BN ({Conv}_{3 \times 3} (ReLU (BN ({Conv}_{3 \times 3} (x_{l - 1}))))))

(10)

Figure 1.

Architecture of the proposed MAU-Net segmentation network enhanced with dual attention mechanisms. The network consists of a symmetric encoder–decoder design with skip connections, where each encoding block captures hierarchical spatial features through convolution and max pooling. Channel and spatial attention modules are integrated into the skip pathways to refine the flow of discriminative features. The decoder mirrors the encoder structure with upsampling and concatenation operations, followed by convolutional refinement. The final pixel-wise segmentation is achieved through a classification block using 1 × 1 convolution and softmax activation. Bottom: Detailed view of the channel and spatial attention mechanisms used to recalibrate feature responses along with channel dimensions and spatial locations.

After the double convolution, a $2 \times 2$ max pooling operation is applied, as shown in equation (11):

x_{l}^{down} = {MaxPool}_{2 \times 2} (x_{l})

(11)

At each subsequent level, the number of feature channels is doubled to allow the network to capture increasingly complex and abstract semantic features. If the first encoder block outputs 64 channels, the next block will output 128, then 256, and so on, resulting in a channel expansion pattern of $C_{l} = 2^{l + 5}$ , where $l = 0, 1, 2, 3$ . The highest level of abstraction is achieved at the bottleneck, positioned between the encoder and decoder paths. Here, both channel attention and spatial attention modules are embedded to recalibrate the feature representations dynamically.

The channel attention mechanism emphasizes informative channels (e.g., tumor texture), while spatial attention focuses on spatially relevant regions (e.g., tumor location), thereby improving feature discrimination in noisy US environments. The decoder path mirrors the encoder structure, with each decoding block consisting of a transpose convolution (also called upconvolution or deconvolution) for upsampling, followed by a concatenation of the corresponding encoder features (skip connection) and two standard convolutional layers. For the decoder level $l,$ the feature update is defined as shown in equation (12):

\begin{matrix} x_{l}^{up} = {ConvTranspose}_{2 \times 2} (x_{l + 1}) \\ x_{l}^{concat} = Concat (x_{l}^{up}, x_{l}^{enc}) \\ x_{l}^{dec} = ReLU (BN ({Conv}_{3 \times 3} (ReLU (BN ({Conv}_{3 \times 3} (x_{l}^{concat})))))) \end{matrix}

(12)

The skip connections between the encoder and decoder ensure that fine-grained spatial information lost during pooling is preserved during reconstruction. This fusion of low-level and high-level features is essential for accurately capturing BUS images’ heterogeneous and often indistinct lesion boundaries.

At the final decoder layer, a $1 \times 1$ convolution is applied to reduce the output channel dimension to 1, followed by a sigmoid activation to obtain the final binary segmentation map ${\hat{Y}}_{seg} \in [0, 1]^{H \times W}$ , as shown in equation (13):

{\hat{Y}}_{seg} = σ ({Conv}_{1 \times 1} (x_{0}^{dec}))

(13)

Here, each pixel value in ${\hat{Y}}_{seg}$ represents the predicted probability of being part of a tumor (foreground). A threshold (e.g., 0.5) is applied during inference to convert this into a binary mask. Overall, the MAU-Net architecture effectively captures both global semantic context and local spatial precision through its layer-wise encoder–decoder design enhanced with attention mechanisms. This architecture is well-suited for the complex task of BUS segmentation, where lesions often vary in appearance and are embedded in low-contrast, noisy backgrounds.

Channel attention module

The Channel Attention $(C_{map}) .$ The mechanism in MAU-Net is designed to selectively emphasize informative feature channels and suppress less relevant ones, enhancing the network's representational power for BUS segmentation. In deep convolutional neural networks, each channel of a feature map often corresponds to a specific type of learned pattern or semantic response (e.g., edges, textures, tumor regions). However, not all channels contribute equally to the task. The CA module dynamically learns channel-wise weights that recalibrate the input features by assigning higher importance to channels relevant to lesion segmentation.

Given an intermediate feature map $F \in R^{C \times H \times W}$ , where C is the number of channels, and $H \times W$ is the spatial resolution, the channel attention mechanism generates a weight vector $w_{ca} \in R^{C}$ through a dual branch pooling strategy. Specifically, two spatial context descriptors are obtained: global average pooling $f_{avg} \in R^{C}$ and global max pooling $f_{\max} \in R^{C}$ , computed as shown in equation (14):

f_{avg}^{c} = \frac{1}{H \cdot W} \sum_{i = 1}^{H} \sum_{j = 1}^{W} F_{c} (i, j), f_{\max}^{c} = \max_{i, j} F_{c} (i, j)

(14)

These descriptors are passed through a shared multilayer perceptron (MLP) composed of two fully connected layers with a ReLU activation. Let $W_{0} \in R^{C / r \times C}$ and $W_{1} \in R^{C \times C / r}$ be the learnable weights of the MLP and r the reduction ratio. The aggregated channel attention output is as shown in equation (15):

w_{ca} = σ (W_{1} \cdot ReLU (W_{0} \cdot f_{avg}) + W_{1} \cdot ReLU (W_{0} \cdot f_{\max}))

(15)where

σ (\cdot)

is the sigmoid activation function that scales the weights to the range

[0, 1]

, finally, the original feature map F is rescaled by broadcasting and multiplying the attention weights channel-wise as shown in equation (16):

F_{c}^{ca} (i, j) = w_{ca}^{c} \cdot F_{c} (i, j), \forall c \in {1, \dots, C}

(16)

This mechanism effectively boosts channels critical for identifying tumor textures or contrast changes while suppressing irrelevant background or noise channels. The integration of channel attention within the MAU-Net bottleneck and decoder stages improves the model's ability to focus on semantically rich and diagnostically relevant features, enhancing segmentation performance in US data characterized by low contrast and speckle noise.

Spatial attention module

Complementing the channel attention mechanism, the Spatial Attention ( $S_{map}$ ) module in MAU-Net is designed to focus on “where” in the image, critical features are located, rather than “what” features are essential across channels. While channel attention enhances feature discrimination across the channel dimension, spatial attention adaptively highlights salient regions within each feature map, helping the network to localize better tumor regions, particularly in US images where boundaries are often indistinct due to noise and low contrast. Let $F^{ca} \in R^{C \times H \times W}$ be the feature map output after applying the channel attention module. The spatial attention mechanism first aggregates information along with the channel axis using both average and max pooling operations, producing two 2D spatial context maps as shown in equation (17):

M_{avg} (i, j) = \frac{1}{C} \sum_{c = 1}^{C} F_{c}^{ca} (i, j), M_{\max} (i, j) = \max_{c} F_{c}^{ca} (i, j)

(17)where

M_{avg}, M_{\max} \in R^{H \times W}

represent the average and maximum pooled feature responses at each spatial location

(i, j)

. These two maps are then concatenated along with the channel dimension and passed through a convolutional layer with a

7 \times 7

kernel to produce the spatial attention map as shown in equation (18):

M_{sa} = σ ({Conv}_{7 \times 7} [M_{avg}; M_{\max}])

(18)where

M_{sa} \in [0, 1]^{H \times W}

contains spatial attention weights, and

σ

is the sigmoid activation function. The final output of the spatial attention module is obtained by broadcasting

M_{sa}

across all channels of the input feature map and performing element-wise multiplication as shown in equation (19):

F_{c}^{sa} (i, j) = M_{sa} (i, j) \cdot F_{c}^{ca} (i, j), \forall c \in {1, \dots, C}

(19)

This operation ensures that regions with high attention scores (e.g., tumor boundaries, dense textures) are amplified while background regions or irrelevant structures are suppressed. By integrating spatial attention into the decoder blocks of MAU-Net, the network can better reconstruct accurate segmentation masks, particularly in complex or ambiguous regions of BUS imagery. With channel attention, the spatial attention module forms a powerful dual-attention mechanism that enables MAU-Net to dynamically prioritize critical features and their spatial locations, enhancing segmentation robustness across datasets with varying imaging conditions.

Segmentation output and loss

At the final stage of the MAU-Net architecture, the decoder produces a high-resolution feature map $F_{final} \in R^{C \times H \times W}$ that contains spatial and semantic information refined by the preceding channel and spatial attention mechanisms. To generate the final binary segmentation mask that delineates tumor regions, a $1 \times 1$ convolution layer is applied to reduce the feature dimensionality to a single-channel output, followed by a sigmoid activation to yield pixel-wise probabilities as shown in equation (20):

{\hat{Y}}_{seg} = σ ({Conv}_{1 \times 1} (F_{final})), {\hat{Y}}_{seg} \in [0, 1]^{H \times W}

(20)

Here, ${\hat{Y}}_{seg} (i, j)$ represents the probability that pixel $(i, j)$ belongs to the lesion class. This continuous valued map is then binarized using a threshold $τ \in [0, 1]$ , typically set to 0.5 during inference, to obtain the final binary segmentation mask as shown in equation (21):

{\tilde{y}}_{seg} (i, j) = {\begin{matrix} 1, & if {\hat{Y}}_{seg} (i, j) \geq τ \\ 0, & otherwise \end{matrix}

(21)

To train the segmentation model effectively in the context of class imbalance often present in medical datasets, employ the Dice Loss, which directly optimizes for the Dice Similarity Coefficient, a widely used metric in medical image segmentation. The Dice Loss $L_{Dice}$ is formulated as shown in equation (22):

L_{Dice} = 1 - \frac{2 \sum_{i, j} y_{i, j} \cdot {\hat{Y}}_{i, j} + ϵ}{\sum_{i, j} y_{i, j}^{2} + \sum_{i, j} {\hat{Y}}_{i, j}^{2} + ϵ}

(22)where

y_{i, j} \in {0, 1}

is the ground truth mask,

{\hat{Y}}_{i, j} \in [0, 1]

is the predicted probability, and

ϵ

is a small constant (e.g.,

10^{- 6}

) added to avoid division by zero. This loss function penalizes discrepancies in false positives and false negatives and is particularly well-suited for segmenting tumors where lesion regions may occupy a small portion of the image. It promotes high overlap between predicted and actual regions, ensuring both boundary precision and region consistency. The output

{\hat{Y}}_{seg}

is further utilized in ensemble fusion, validation evaluation (e.g., Dice Score, IoU), and as input to the classification network when segmentation-driven features are included in multitask training.

Ensemble strategy with cross-validation

To enhance the robustness and generalization of the segmentation output within each federated institution, we incorporate a three-fold cross-validation ensemble strategy during MAU-Net training. This technique mitigates overfitting, accounts for data heterogeneity across institutions, and ensures more stable lesion boundary prediction in both real and ACGAN-augmented datasets. The ensemble also strengthens local models before global federated aggregation, aligning with the decentralized learning paradigm of our study. During training, the local dataset D at each institution is partitioned into three mutually exclusive subsets: $D_{1}, D_{2}, D_{3}$ . For each fold $k \in {1, 2, 3}$ , a model $M_{k}$ is trained using $D ∖ D_{k}$ and validated on $D_{k}$ . This yields three independently trained segmentation models $M_{1}, M_{2}, M_{3}$ , each producing a soft segmentation output ${\hat{Y}}_{seg}^{(k)} \in [0, 1]^{H \times W}$ for a given input image. To generate a unified prediction ${\hat{Y}}_{ens}$ , the soft outputs are aggregated using a weighted average as shown in equation (23):

{\hat{Y}}_{ens} (i, j) = \sum_{k = 1}^{3} w_{k} \cdot {\hat{Y}}_{seg}^{(k)} (i, j)

(23)where

w_{k} \in [0, 1]

are the weights assigned to each fold prediction,

\sum_{k - 1}^{3} w_{k} = 1

, and

(i, j)

denotes the spatial location in the predicted mask. In our study, equal weighting is used for simplicity, that is,

w_{1} = w_{2} = w_{3} = \frac{1}{3}

, unless fold-specific validation scores are used to adaptively reweight based on performance. After averaging, the ensemble prediction is binarized using a threshold

τ

, typically set to 0.5, as shown in equation (24):

{\tilde{y}}_{ens} (i, j) = {\begin{matrix} 1, & if {\hat{Y}}_{ens} (i, j) \geq τ \\ 0, & otherwise \end{matrix}

(24)

This final segmentation mask ${\tilde{y}}_{ens} \in {0, 1}^{H \times W}$ is then used as the definitive output of the segmentation module and is further employed in both evaluation metrics (e.g., Dice Score, loU) and the classification module as input. In the context of FL, this internal ensembling approach is executed locally at each node, ensuring that performance gains are realized without violating data privacy. The segmentation outputs were validated using three-fold cross-validation at each client node, with model performance evaluated against ground truth masks using Dice Score and IoU metrics. For each fold, independently trained MAU-Net models generated segmentation masks that were then ensembled to form the final prediction used for evaluation. It also allows each client model to generalize better before contributing to the federated aggregation process. When combined with the channel-spatial attention architecture of MAU-Net and ACGAN-augmented training data, this strategy significantly improves segmentation consistency and lesion boundary precision across multiple domains.

Classification network: dual-stage feature fusion

Following segmentation, the next stage of the pipeline is the classification of BUS images into three diagnostic categories: Normal, Benign, and Malignant. The “Normal” category refers to US images showing no abnormal tissue or lesions, whereas the “Benign” category includes nonmalignant abnormalities such as fibroadenomas or cysts. The “Malignant” category refers to cancerous tumors requiring clinical intervention. To ensure high ACC and robustness, we propose a dual-stage feature fusion classifier, which integrates the strengths of three diverse yet complementary DL backbones-ResNet50V2, NASNetLarge, and MAU-Net-through a two-tiered architectural strategy as shown in Figure 2. Each input image, whether real or ACGAN-generated, is first passed through the three base networks in parallel. Let $x \in R^{1 \times H \times W}$ represents the input image and $f_{r} (x), f_{n} (x)$ , and $f_{m} (x)$ denote the respective feature extractors of ResNet50V2, NASNetLarge, and MAU-Net. The output feature embeddings from each network can be expressed as shown in equation (25):

z_{r} = f_{r} (x), z_{n} = f_{n} (x), z_{m} = f_{m} (x)

(25)

Figure 2.

Our developed dual-staged feature fusion-based classifier framework. In Stage 1, feature maps are extracted from the untrained layers of three base models: Proposed MAU-Net, NASNetLarge, and ResNet50V2, each followed by custom dense and dropout layers. In Stage 2, the feature maps from the three models are concatenated and passed through additional dense layers with ReLU activation, leading to a final softmax layer for classifying the input image as Malignant, Benign, or Normal.

Each feature vector $z_{i} \in R^{d_{i}}$ (where $i \in {r, n, m}$ ) is the result of global average pooling applied to the final convolutional layer of the respective network, capturing high-level representations of tumor patterns, texture, and structure. In the first stage of the fusion architecture, each model is fine-tuned individually using transfer learning. The pretrained weights from ImageNet are used to initialize the convolutional layers, which remain frozen during training to preserve general-purpose visual features.

On top of each frozen backbone, we append a customized classification head composed of a fully connected (Dense) layer with 1024 units and ReLU activation to introduce nonlinearity. To reduce the risk of overfitting, a Dropout layer with a rate of 0.3 follows. The final classification output is produced by a fully connected Dense layer with 3 neurons, activated by a Softmax function to yield normalized class probabilities across the three diagnostic categories. Instead of simply averaging the predictions from the three models, we proceed to a second fusion stage that combines the learned representations into a higher-level ensemble. Specifically, the intermediate feature vectors $z_{r}, z_{n}, z_{m}$ are concatenated to form a unified feature embedding $z_{fused} = Concat (z_{r}, z_{n}, z_{m}) \in R^{d_{r} + d_{n} + d_{m}} .$ This fused representation captures diverse discriminative cues learned independently by each model. The fused feature vector is passed through a meta-classifier composed of a Dense layer with 512 neurons and ReLU activation to further learn nonlinear combinations of the base model features. A Dropout layer with a 0.3 rate is applied to regularize the meta-classifier, followed by a final Dense layer with 3 units and Softmax activation that generates the ultimate classification probabilities as shown in equation (26):

{\hat{Y}}_{fused} = Softmax ({Dense}_{3} (Dropout (ReLU (Dense_{512} (z_{fused})))))

(26)

The model is trained using the categorical cross-entropy loss function defined as shown in equation (27):

L_{CE} = - \sum_{c - 1}^{3} y_{c} \log ({\hat{Y}}_{c})

(27)where

y_{c} \in {0, 1}

is the ground truth label for the class

c_{,}

and

{\hat{Y}}_{c} \in [0, 1]

is the predicted probability for class c from the fused output. This dual-stage architecture enhances learning in two key ways: (1) it allows each base network to specialize in capturing distinct patterns from US images (e.g., anatomical detail, lesion shape, or textural features), and (2) it leverages a meta-level classifier to synthesize these diverse features into a single, robust decision. The dual-stage design was particularly necessary in our federated setting, as it provides robustness against non-IID client data by integrating complementary representations from multiple backbones. While Stage 1 ensures specialization, Stage 2 captures higher-order feature interactions, thereby improving generalization across heterogeneous US datasets. This approach is particularly beneficial in FL, where institutions may have access to different US acquisition systems, imaging qualities, and data distributions.

Implementation details

The proposed FAME FL-based architecture for BUS segmentation and classification was implemented and evaluated using a three-client simulated federation, with each client node representing a distinct medical institution. To ensure replicability and efficiency across nodes, we adopted a client–server topology, where the clients perform local training, and a central server coordinates global model aggregation via FedAvg. The overall architecture of our proposed system, combining FTL with ACGAN-based augmentation and personalization via transfer learning, is illustrated in Figure 3. It highlights the interactions between local models, the central server, auxiliary GAN training, and cross-domain deployment for unseen clients. All DL models, MAU-Net for segmentation and ResNet50V2, NASNetLarge, and MAU-Net for classification were implemented using TensorFlow 2.11 and Keras APIs. The training was conducted on NVIDIA RTX 3090 GPUs (24GB VRAM) for centralized simulation and client-side experiments. To simulate realistic federated constraints, each node had access to non-IID subsets of the datasets (BUSI, UDIAT, BUSC), with synthetic data augmentation applied locally using the ACGAN module described previously. During segmentation, ensembling was applied at each client to generate a single binary mask prediction. Classification models were locally fused using our dual-stage feature fusion pipeline. The input images were resized to $256 \times 256$ pixels and normalized to the range $[0, 1]$ after denoising. Data augmentation techniques such as horizontal/vertical flipping, $\pm 15 \circ$ rotation and random zooming (scale range: $0.9 - 1.1$ ) were applied at runtime. For ACGAN training, 10,000 synthetic samples were generated per class to balance underrepresented categories before training segmentation and classification networks.

Figure 3.

The proposed federated learning framework integrating Auxiliary Classifier GAN (ACGAN) and transfer learning. Multiple clients (e.g., hospitals) train local models on private data without sharing raw images. The server aggregates encrypted model parameters to update a global model, which is distributed back to all clients. Simultaneously, ACGANs generate class-specific synthetic data locally to enhance performance. The global model is transferred to target clients for personalized fine-tuning using transfer learning, ensuring domain adaptation to new clinical environments.

Hyperparameters were tuned through grid search, where batch size (8–16), learning rate (1e-4–1e-5), and dropout rate (0.3–0.5) were varied. The Adam optimizer with an initial learning rate of 1e-4 was selected, with ReduceLROnPlateau (factor 0.1, patience = 5) for scheduling. Early stopping (patience = 10) was employed to prevent overfitting. The final configuration was chosen based on three-fold cross-validation performance across BUSI, BUSC, and UDIAT datasets.

The federated aggregation was performed every five local training rounds, where encrypted model weights were sent to a central aggregator and updated via FedAvg, as shown in equation (28):

w^{(t + 1)} = \sum_{k = 1}^{K} \frac{n_{k}}{n} w_{k}^{(t)}

(28)where

w_{k}^{(t)}

is the local model of client

k, n_{k}

is the local dataset size, and

n = \sum_{k - 1}^{K} n_{k}

is the total sample count. The updated global model

w^{(t + 1)}

was broadcast back to all clients to initialize the next round of local training. Secure Aggregation was optionally simulated by masking the weights before transmission. Table 1 provides a comprehensive overview of the key training configurations used for both the segmentation task (MAU-Net) and the classification task (dual-stage ensemble classifier). It includes hyperparameters such as optimizer type, learning rate, number of epochs, batch size, regularization strategies, and data augmentation. These settings were consistently applied across all three datasets within the FL simulation, ensuring reproducibility and stable model convergence. All experiments were conducted in a FL environment using a client–server architecture, where non-IID data partitions were assigned to three virtual clients representing distinct medical institutions. While this setup closely approximates real-world deployment, a full-scale federated implementation across physical hospital nodes was not conducted in this study.

Table 1.

Summary of training parameters for segmentation and classification networks.

Parameter	Segmentation (MAU-Net)	Classification (Ensemble Classifier)
Batch size	8	8
Optimizer	Adam	Adam
Initial learning rate	0.0001	0.0001
Learning rate scheduler	ReduceLROnPlateau (patience = 5, factor = 0.1)	Same
Number of epochs	100	100
Early stopping	Patience = 10	Patience = 10
Dropout rate	–	0.3 (in classification heads)
L2 regularization	1e-5	1e-5
Cross-validation strategy	3-Fold	3-Fold
Augmentation	Rotation, Flipping, Zoom	Same

Datasets

This research used the BUSI dataset²⁸ to evaluate the effectiveness of the proposed architecture. The dataset, sourced in 2018 from Baheya Hospital for Early Detection and Treatment of Women's Cancer in Cairo, Egypt, includes 780 US images accompanied by ground truth segmentation masks. Each image measures 500 × 500 pixels in resolution. The dataset is divided into three categories: 437 benign images, 210 malignant images, and 133 normal images. These scans were collected from 600 female participants, aged 25–75, using the LOGIQ E9 and LOGIQ E9 Agile US devices. Ground truth tumor masks were manually delineated by expert radiologists at Baheya Hospital using MATLAB freehand segmentation, and all annotations were reviewed for ACC and consistency.²⁸ This dataset provides diverse examples of BUS images, serving as a vital resource for training and validating machine-learning models in breast cancer diagnosis. Each image in the dataset is accompanied by a corresponding binary ground truth mask delineating the tumor region (if present), created by expert radiologists. This dataset presents real-world class imbalance, particularly with fewer malignant cases, which is reflective of screening population distributions. It is frequently used in breast cancer segmentation benchmarks and is particularly valuable for training models in weakly supervised or FL contexts due to its high-quality annotations and balanced image resolution.

The UDIAT dataset¹³ was used as the final benchmark to evaluate the performance of the proposed model. Collected from the UDIAT Diagnostic Center at Parc Taulí in Sabadell, Spain, it consists of 163 US images paired with corresponding segmentation masks for ground truth verification. The images have an average resolution of 760 × 570 pixels. The dataset is split into two categories: 109 benign tumor images and 54 malignant tumor images. Captured using the Siemens ACUSON Sequoia C512 system, lesion boundaries were annotated by radiologists at the UDIAT Diagnostic Centre,²⁹ the UDIAT dataset provides a valuable resource for validating BUS machine-learning models in tumor diagnosis. While the dataset is smaller in size compared to BUSI, it offers high-resolution scans with a moderate degree of visual heterogeneity in lesion texture and shape. Its relatively balanced benign/malignant distribution makes it useful for evaluating generalization and robustness of classification and segmentation models.

The Mendeley BUSC dataset³⁰ consists of 100 benign and 150 malignant US images. The real resolution of the images is 64 × 64 pixels, which was subsequently transformed to 128 × 128 pixels for this research. As this dataset is primarily intended for classification, it does not include corresponding ground truth segmentation masks. To address this, an experienced radiologist assisted in annotating the benign and malignant tumor images, making them suitable for segmentation tasks.³¹ These expert annotations were crucial for adapting the BUSC dataset to support segmentation-based model evaluation. The BUSC dataset was designed primarily for binary classification, reflecting challenging real-world variability in lesion echotexture and boundary clarity. Despite the absence of normal images, its inclusion provides valuable insight into model performance under limited-resolution and class-skewed conditions. The dataset's compact size and lack of metadata (e.g., patient age, acquisition protocol) require careful preprocessing and augmentation to avoid overfitting.

Evaluation metrics

Egmentation metrics

Dice similarity coefficient

The Dice Score is a measure of overlap between the predicted segmentation $\hat{Y}$ and the ground truth mask Y. It is particularly useful in handling class imbalance by giving more weight to correctly segmented foreground regions, as shown in equation (29):

Dice Score = \frac{2 | Y \cap \hat{Y} |}{| Y | + | \hat{Y} |} = \frac{2 \sum_{i - 1}^{N} y_{i} \cdot {\hat{y}}_{i}}{\sum_{i = 1}^{N} y_{i} + \sum_{i - 1}^{N} {\hat{y}}_{i} + ϵ}

(29)where

y_{i} \in {0, 1}

and

{\hat{Y}}_{i} \in {0, 1}

are the ground truth and predicted binary values for pixel i, and

ϵ

is a small constant to avoid division by zero.

Intersection over union

Also known as the Jaccard Index, IoU measures the overlap between predicted and true regions relative to their union, as shown in equation (30):

IoU = \frac{| Y \cap \hat{Y} |}{| Y \cup \hat{Y} |} = \frac{\sum_{i - 1}^{N} y_{i} \cdot {\hat{Y}}_{i}}{\sum_{i = 1}^{N} y_{i} + \sum_{i - 1}^{N} {\hat{y}}_{i} - \sum_{i - 1}^{N} y_{i} \cdot {\hat{y}}_{i} + ϵ}

(30)

These metrics are reported per image and averaged across the test set to assess segmentation consistency and ACC.

Classification metrics

Accuracy

Accuracy represents the proportion of correctly classified samples among all predictions, as shown in equation (31):

Accuracy = \frac{T P + T N}{T P + T N + F P + F N}

(31)where

T P, T N, F P

, and

F N

denote true positives, true negatives, false positives, and false negatives, respectively.

F1-score

The F1-score is the harmonic mean of precision and recall, offering a balance between the two, especially important when a class imbalance exists, as shown in equation (32):

F 1 - Score = 2 \cdot \frac{Precision \cdot Recall}{Precision + Recall} = \frac{2 T P}{2 T P + F P + F N}

(32)

Sensitivity (recall)

Sensitivity, or recall, measures the proportion of actual positives correctly identified:

Sensitivity (Recall) = \frac{T P}{T P + F N}

(33)

It is particularly important in medical diagnosis tasks, where missing a positive case (e.g., a malignant tumor) is highly undesirable.

Specificity

Specificity measures the proportion of actual negatives correctly identified:

Specificity = = \frac{T N}{T N + F P}

(34)

High specificity indicates the model effectively avoids false alarms or misclassification of normal or benign cases.

Area under the curve

Area under the curve evaluates the area under the receiver operating characteristic (ROC) curve, measuring the model's ability to distinguish between classes across varying thresholds. It is defined as shown in equation (33):

AUC = \int_{0}^{1} TPR (F P R^{- 1} (x)) d x

(33)where TPR is the true positive rate, and FPR is the false positive rate. In practice, AUC is computed using numerical integration of the ROC curve derived from the classifier's probabilistic outputs. Confidence intervals (CIs) for the FAME model's classification ACC: 98.70% on BUSI (CI: 98.4–99.0%), 98.14% on UDIAT (CI: 97.8–98.5%), and 96.92% on BUSC (CI: 96.6–97.2%), confirming robust generalization across clinical imaging variations. For statistical significance, paired t-tests were conducted comparing FAME with multiple SOTA models, including U-Net, UNet++, FMRNet, and MTL-COSA. The results indicate that FAME demonstrates statistically significant improvements (p < 0.05) in classification ACC and AUC across all benchmark datasets.

Experimental results

Segmentation

In this study, Dice Score and IoU in equations (29) and (30) were used to evaluate segmentation ACC. Dice quantifies overlap between predicted and ground truth masks, while IoU provides a stricter measure of spatial agreement. The model was trained with an 80:10:10 split on BUSI and tested on BUSI, BUSC, and UDIAT. On BUSI, FAME achieved the highest performance with a Dice of 89.72 ± 0.53%, IoU of 84.81 ± 0.57%, sensitivity of 91.11 ± 0.53%, and specificity of 98.13 ± 0.35%, outperforming TransUNet, D-LinkNet, and ATFE-Net, as shown in Table 2. On BUSC, FAME achieved 93.09 ± 0.49% Dice, 87.13 ± 0.66% IoU, and a sensitivity of 87.41 ± 0.71, along with 99.59 ± 0.27% specificity, surpassing PDF-UNet, MSU-UNet, and DeepLab, as shown in Table 3. On UDIAT, FAME achieved 87.98 ± 0.57% Dice, 78.16 ± 0.66% IoU, 87.41 ± 0.62% sensitivity, and 99.61 ± 0.22% specificity, consistently outperforming ATFE-Net and Axial-DeepLab, as shown in Table 4. Figure 4 presents qualitative segmentation results on BUSI test images using UTNet, LinkNet, TransUNet, D-LinkNet, Axial-DeepLab, ATFE-Net, and FAME. Each row shows different cases with varying tumor sizes, shapes, and textures. Ground truth masks and predicted outputs are displayed for comparison.

Figure 4.

Visual comparison of breast lesion segmentation results on the BUSI dataset. From left to right: original ultrasound image, ground truth (GT) mask, predictions from UTNet, LinkNet, TransUNet, D-LinkNet, Axial-DeepLab, ATFE-Net, and our proposed model (FAME), followed by the overlay of FAME prediction on GT. The blue regions represent the predicted lesion masks.

Table 2.

Performance comparison of segmentation methods on the BUSI dataset.

Methods	Dice (%)	IoU (%)	Sen. (%)	Spec. (%)	p-value
U-Net ³²	76.97 ± 0.75(76.1–77.8)	63.13 ± 0.66(62.4–63.9)	75.21 ± 0.70(74.4–76.0)	99.10 ± 0.30(98.7–99.4)	0.0006
UTNet ³³	77.96 ± 0.70(77.2–78.8)	63.85 ± 0.66(63.1–64.6)	78.11 ± 0.70(77.3–78.9)	97.09 ± 0.48(96.5–97.6)	0.0031
LinkNet ³⁴	80.85 ± 0.66(80.1–81.6)	67.69 ± 0.75(66.9–68.5)	82.21 ± 0.71(81.4–83.0)	98.11 ± 0.40(97.6–98.5)	0.0016
TransUNet ³⁵	82.00 ± 0.80(81.2–82.8)	69.08 ± 0.80(68.2–69.9)	81.89 ± 0.75(81.0–82.7)	97.94 ± 0.40(97.5–98.4)	0.0054
D-LinkNet ³⁶	82.00 ± 0.80(81.1–82.9)	69.47 ± 0.80(68.5–70.3)	81.71 ± 0.71(80.9–82.5)	98.09 ± 0.40(97.6–98.5)	0.0041
Axial-DeepLab ³⁷	81.97 ± 0.75(81.1–82.8)	68.81 ± 0.71(68.0–69.6)	81.12 ± 0.71(80.2–82.0)	97.89 ± 0.40(97.4–98.3)	0.0015
ATFE-Net ³⁸	82.42 ± 0.71(81.6–83.2)	69.84 ± 0.71(69.0–70.6)	82.81 ± 0.71(82.0–83.6)	98.28 ± 0.35(97.8–98.6)	0.0053
FAME (Our)	89.72 ± 0.53(89.1–90.3)	84.81 ± 0.57(84.1–85.4)	91.11 ± 0.53(90.5–91.7)	98.13 ± 0.35(97.7–98.5)

Results are reported as mean ± SD (95% CI). P-values were computed using paired t-tests comparing baselines with FAME.

Table 3.

Performance comparison of segmentation methods on the BUSC dataset.

Methods	Dice (%)	IoU (%)	Sen (%)	Spec (%)	p-value
U-Net ³²	73.78 ± 0.66(73.0–74.5)	82.68 ± 0.66(81.9–83.4)	72.95 ± 0.75(72.1–73.8)	97.88 ± 0.49(97.3–98.4)	0.00008
UNet++ ³⁹	67.71 ± 0.71(66.9–68.5)	77.48 ± 0.71(76.7–78.3)	69.12 ± 0.80(68.2–70.0)	98.14 ± 0.49(97.5–98.6)	0.00003
FATNet ⁴⁰	71.86 ± 0.66(71.1–72.6)	81.42 ± 0.66(80.6–82.1)	71.20 ± 0.71(70.4–72.0)	98.54 ± 0.44(97.9–98.9)	0.00006
MSU-UNet ⁴¹	75.97 ± 0.66(75.2–76.7)	84.44 ± 0.71(83.6–85.2)	74.85 ± 0.75(74.0–75.7)	98.66 ± 0.40(98.1–99.0)	0.00017
Axial-DeepLab ³⁷	74.42 ± 0.62(73.7–75.1)	83.65 ± 0.66(82.9–84.4)	73.62 ± 0.71(72.8–74.4)	99.25 ± 0.39(98.8–99.5)	0.00014
PDF-UNet ³¹	76.65 ± 0.66(75.9–77.4)	85.10 ± 0.71(84.3–85.9)	75.34 ± 0.71(74.5–76.1)	99.23 ± 0.35(98.7–99.5)	0.0016
FAME (Our)	93.09 ± 0.49(92.5–93.6)	87.13 ± 0.66(86.4–87.9)	87.41 ± 0.71(86.7–88.1)	99.59 ± 0.27(99.2–99.8)

Results are reported as mean ± SD (95% CI). P-values were computed using paired t-tests comparing baselines with FAME.

Table 4.

Performance comparison of segmentation methods on the UDIAT dataset.

Methods	Dice (%)	IoU (%)	Sen (%)	Spec (%)	p-value
U-Net ³²	80.91 ± 0.71(80.1–81.7)	67.99 ± 0.66(67.2–68.8)	80.93 ± 0.71(80.1–81.7)	98.98 ± 0.40(98.5–99.4)	0.0031
UTNet ³³	81.10 ± 0.71(80.3–81.9)	66.89 ± 0.71(66.1–67.6)	78.94 ± 0.71(78.2–79.8)	98.95 ± 0.44(98.5–99.3)	0.0009
LinkNet³⁴	84.89 ± 0.66(84.1–85.6)	74.01 ± 0.71(73.2–74.8)	89.08 ± 0.75(88.3–89.9)	98.95 ± 0.31(98.6–99.3)	0.0188
TransUNet ³⁵	84.11 ± 0.66(83.3–84.9)	73.11 ± 0.71(72.3–73.9)	88.21 ± 0.71(87.4–89.0)	98.96 ± 0.44(98.6–99.4)	0.0222
Axial-DeepLab ³⁷	86.09 ± 0.70(85.3–86.9)	74.00 ± 0.71(73.2–74.8)	89.09 ± 0.71(88.3–89.9)	98.99 ± 0.40(98.6–99.4)	0.0175
D-LinkNet ³⁶	85.01 ± 0.71(84.2–85.8)	73.51 ± 0.62(72.7–74.3)	87.01 ± 0.71(86.2–87.8)	98.97 ± 0.44(98.6–99.4)	0.0125
ATFE-Net ³⁸	86.81 ± 0.66(86.0–87.6)	75.76 ± 0.71(74.9–76.6)	88.29 ± 0.66(87.5–89.0)	99.19 ± 0.31(98.8–99.5)	n.s.
FAME (Our)	87.98 ± 0.57(87.3–88.6)	78.16 ± 0.66(77.4–78.9)	87.41 ± 0.62(86.7–88.1)	99.61 ± 0.22(99.3–99.8)

Results are reported as mean ± SD (95% CI). P-values were computed using paired t-tests comparing baselines with FAME (n.s. = not significant).

To benchmark FAME, we compared it against widely used segmentation and classification baselines. For segmentation, U-Net, a canonical encoder–decoder with skip connections, and UNet++, which introduces nested dense skip pathways to reduce semantic gaps, were included. LinkNet was employed as a lightweight residual encoder–decoder optimized for efficiency, while TransUNet combines convolutional encoders with Transformer blocks to capture both local and global dependencies. D-LinkNet leverages dilated convolutions and residual links for multiscale receptive fields, and Axial-DeepLab applies axial attention to efficiently model long-range spatial dependencies. ATFE-Net and FATNet incorporate channel–spatial attention and feature aggregation modules, respectively, to enhance tumor feature extraction, whereas MSU-UNet extends U-Net to address lesions of varying size and morphology. PDF-UNet employs progressive dense feature fusion to improve boundary localization.

Figure 5 shows qualitative results on UDIAT test images with diverse lesion characteristics, including small tumors, blurry boundaries, and variable echotexture. Predictions from LinkNet, UTNet, D-LinkNet, and FAME are compared against ground truth masks. Figure 6 presents qualitative segmentation results on BUSC test images using U-Net, UNet++, FATNet, MSU-UNet, DeepLab, PDF-UNet, and FAME. BUSC samples are shown to highlight noisy, low-contrast conditions, and challenging lesion shapes. Predicted masks and ground truth annotations are displayed for direct comparison.

Figure 5.

Qualitative comparison of lesion segmentation results on the UDIAT dataset. Each row shows an original ultrasound image, its ground truth (GT) mask, and the predicted masks from UTNet, LinkNet, TransUNet, D-LinkNet, Axial-DeepLab, ATFE-Net, and our proposed model (FAME), followed by FAME's overlay on the GT. Blue regions indicate predicted tumor areas.

Figure 6.

Visual segmentation comparison on the BUSC dataset. Each row displays an original breast ultrasound image, its corresponding ground truth (GT) mask, predictions from U-Net, UNet++, FATNet, MSU-UNet, DeepLab, PDF-UNet, and the proposed model (EDRNet/FAME), followed by an overlay of our model's prediction on the GT. Blue regions indicate predicted lesion masks.

Classification

Classification performance was evaluated using ACC, AUC, and F1-score. Accuracy measures the proportion of correctly predicted samples, while AUC assesses discriminative ability using macro-averaging, and F1-score balances precision and recall. A confusion matrix was also constructed to illustrate prediction distribution across classes. Using the dual-stage ensemble classifier, FAME was tested on the BUSI, UDIAT, and BUSC datasets and compared against baseline architectures. Table 5 summarizes the results. On BUSI, FAME achieved 98.70 ± 0.27% ACC, 96.82 ± 0.53% F1-score, and 0.978 AUC, outperforming GAN + CNN, MTL-COSA, and SaTransformer. On UDIAT, it achieved 98.14 ± 0.31% ACC, 94.04 ± 0.75% F1-score, and 0.960 AUC, surpassing FMRNet, RMTL-Net, and HoVer-Trans. On BUSC, it obtained 96.92 ± 0.27% ACC, 90.32 ± 0.80% F1-score, and 0.950 AUC, outperforming MDA-Net and other approaches. For classification, GAN + CNN combines synthetic augmentation with convolutional classification, while CNN-based image fusion integrates multistream CNN features for improved discriminative power. MTL-COSA adopts a multitask learning strategy with cross-task attention, and FMRNet applies a residual CNN tailored to US to mitigate noise sensitivity. Transformer-based models were also included: BUViTNet adapts vision transformers with patch-based tokenization, SaTransformer leverages self-attention for robust classification, and HoVer-Trans captures both local and global structure. RMTL-Net employs residual multitask optimization, and MDA-Net introduces multiscale discriminative attention for adaptive feature recalibration.

Table 5.

Quantitative comparison of the classification performance of the proposed ensemble DL method with SOTA models on the BUSI, UDIAT, and BUSC datasets.

Dataset	Method	Accuracy%	F1-score%	AUC score	p-value
BUSI	GAN + CNN ⁹	94.00 ± 0.40(93.4–94.6)	-	-	0.0007
	CNN-based Image Fusion ¹⁸	90.77 ± 0.49(90.1–91.4)	82.86 ± 0.49(82.0–83.7)	0.949	0.0015
	ML-based Radiomics Approach ²⁶	97.40 ± 0.44(96.9–97.9)	95.90 ± 0.53(95.3–96.5)	0.970	0.0210
	MTL-COSA ²³	91.48 ± 0.57(90.8–92.1)	93.59 ± 0.62(92.9–94.3)	0.970	0.0032
	FMRNet ⁴²	95.40 ± 0.44(94.9–95.9)	-	0.976	0.0128
	BUViTNe ²⁵	95.20 ± 0.44(94.7–95.7)	96.60 ± 0.53(96.0–97.2)	0.968	0.0185
	SaTransformer ²⁴	97.97 ± 0.40(97.5–98.4)	-	-	0.0272
	RMTL-Net ⁴³	91.02 ± 0.62(90.3–91.7)	93.32 ± 0.62(92.6–94.0)	0.967	0.0047
	FMAE (Our)	98.70 ± 0.27(98.4–99.0)	96.82 ± 0.53(96.2–97.4)	0.978
UDIAT	GAN + CNN ⁹	92.00 ± 0.71(91.2–92.8)	-	-	0.0023
	BIRADS-SSDL ⁴⁴	92.00 ± 062(91.3–92.7)	-	0.890	0.0011
	FMRNet ⁴²	94.50 ± 0.62(93.8–95.2)	91.90 ± 0.71(91.1–92.7)	0.957	0.0174
	MTL-COSA ²³	87.08 ± 0.75(86.2–87.9)	90.41 ± 0.71(89.6–91.2)	0.930	0.0009
	RMTL-Net ⁴³	91.44 ± 0.71(90.6–92.2)	93.85 ± 0.66(93.1–94.6)	0.946	0.0028
	HoVer-Trans ⁴⁵	77.40 ± 0.84(76.4–78.3)	61.90 ± 0.88(60.9–62.9)	0.781	0.00006
	FAME (our)	98.14 ± 0.31(97.8–98.5)	94.04 ± 0.75(93.2–94.9)	0.960
BUSC	MDA-Net ⁴⁶	90.41 ± 0.71(89.6–91.2)	-	0.928	0.0004
BUSC	FAME (Our)	96.92 ± 0.27(96.6–97.2)	90.32 ± 0.80(89.4–91.2)	0.950

Results are reported as mean ± SD (95% CI). P-values were computed using paired t-tests comparing baselines with FAME.

Confusion matrices were generated for BUSI, UDIAT, and BUSC to evaluate classification performance in Figures 7, 8, 9. Each matrix summarizes true positives, true negatives, false positives, and false negatives, providing a detailed view of the model's predictions against ground truth labels. Across all datasets, the proposed ensemble classifier showed minimal misclassification, supporting the reported ACC, F1-score, and AUC results. To further assess classification performance, ROC curves were plotted for the BUSI, UDIAT, and BUSC datasets, shown in Figure 10. Across all datasets, the proposed FAME framework achieved consistently higher true positive rates at varying thresholds compared to baseline methods. Specifically, FAME obtained AUC scores of 0.978 on BUSI, 0.960 on UDIAT, and 0.950 on BUSC, outperforming competing approaches such as FMRNet, MTL-COSA, and MDA-Net. These results illustrate FAME's superior ability to distinguish between benign and malignant lesions and confirm the quantitative findings reported in Table 5.

Figure 7.

Confusion matrix for the proposed FAME framework on the BUSI dataset, illustrating classification performance between normal, benign, and malignant cases.

Figure 8.

Confusion matrix for the proposed FAME framework on the UDIAT dataset, illustrating classification performance between benign and malignant cases.

Figure 9.

Confusion matrix for the proposed FAME framework on the BUSC dataset, illustrating classification performance between benign and malignant cases.

Figure 10.

Receiver operating characteristic (ROC) curves of FAME and baseline models across three datasets. (A) BUSI dataset, (B) UDIAT dataset, and (C) BUSC dataset.

To mitigate overfitting, several regularization and generalization strategies were employed. These include three-fold cross-validation, data augmentation (e.g., rotation, flipping, zooming), and the use of Dropout layers (rate = 0.3) in classification heads. L2 weight regularization (1e-5) was applied across all models. Early stopping based on validation loss with a patience of 10 epochs was also used to halt training when no improvement was observed. Furthermore, the use of ACGAN-based synthetic data improved class balance and introduced intraclass variation, further reducing the risk of overfitting on minority categories. These techniques collectively ensured that the model remained generalizable across test sets and avoided performance degradation due to overfitting. To further validate the superiority of our FAME model, we conducted paired t-tests comparing its classification performance against several widely used SOTA models. The comparisons were performed on all three benchmark datasets using ACC and AUC values from 3-fold cross-validation. The results are presented in Table 6.

Table 6.

P-values from paired t-tests comparing the FAME model with publicly available open-source SOTA models on the BUSI, UDIAT, and BUSC datasets.

Model	BUSI	UDIAT	BUSC
U-Net ³²	4.50e−03	3.10e−02	2.80e−02
UNet++ ³⁹	3.80e−03	1.70e−02	2.20e−02
FMRNet ⁴²	1.50e−02	2.00e−03	–
MTL-COSA ²³	6.80e−03	1.10e−02	–

All statistical tests were based on classification Accuracy and AUC, computed using 3-fold cross-validation. P < 0.05 indicates a statistically significant improvement by FAME over the compared model.

Privacy-preserving evaluation: DP and SA

To assess the trade-off between privacy and performance in our federated setup, we evaluated the effect of integrating two privacy-preserving mechanisms: DP and SA. These experiments were conducted on the BUSI dataset under a 3-client federated simulation, using the proposed MAU-Net for segmentation and a dual-stage ensemble classifier for diagnosis.

Differential privacy results

Differential Privacy was applied by adding calibrated Gaussian noise to the local model updates before transmission. The amount of noise was controlled by the privacy budget $ϵ$ , with lower $ϵ$ values offering stronger privacy guarantees. These results demonstrate that incorporating DP leads to a controlled degradation in both segmentation and classification performance as $ϵ$ decreases. However, even under strong privacy settings (e.g., $ϵ = 1$ ), the model maintains clinically viable performance (Dice > 0.80 and AUC > 0.91), indicating the effectiveness of our attention-augmented MAU-Net and feature fusion classifier in learning robust representations under privacy constraints, as shown in Table 7.

Table 7.

Impact of varying differential privacy budgets $(ϵ)$ on segmentation and classification performance in the federated setting.

$ϵ$	Dice Score	Accuracy	AUC	F1-score
∞ (No DP)	0.865 ± 0.011	0.938 ± 0.009	0.964	0.926
5.0	0.852 ± 0.013	0.922 ± 0.010	0.951	0.914
2.0	0.833 ± 0.017	0.901 ± 0.012	0.936	0.897
1.0	0.808 ± 0.021	0.879 ± 0.016	0.915	0.874

Lower values of $ϵ$ offer stronger privacy but introduce performance degradation. Results are reported as mean ± standard deviation over three independent runs. Despite reduced accuracy, the model maintains clinically reliable performance under strict privacy constraints.

Secure aggregation results

We also evaluated SA, where local model updates were masked using random additive noise before server aggregation, ensuring that no individual update was accessible in plaintext. Secure Aggregation was implemented without adding explicit DP noise, preserving model fidelity. Secure Aggregation incurred a negligible impact on performance, demonstrating that communication encryption can be integrated seamlessly without compromising ACC, as shown in Table 8. This supports the deployment of our model in real-world federated hospital networks where privacy regulations require encrypted exchanges of model parameters.

Table 8.

Comparison of model performance with and without secure aggregation (SA) in the federated training process.

Method	Dice score	Accuracy	AUC	F1-score
Federated (no SA/DP)	0.865	0.938	0.964	0.926
Federated + SA	0.862	0.935	0.960	0.922

The integration of SA ensures encrypted model communication without degrading segmentation or classification accuracy, demonstrating its effectiveness in privacy-preserving medical imaging applications.

Although FAME achieved strong overall performance, some failure cases were observed. These included lesions with poorly defined or low-contrast margins, very small tumor regions, and cases with heavy acoustic shadowing, where boundaries were either undersegmented or slightly overextended. Representative examples of such cases are shown in Figure 11, providing insight into the limitations of the current framework.

Figure 11.

Representative failure cases of the FAME segmentation model on breast ultrasound images.

Ablation study

An ablation study was conducted on BUSI, UDIAT, and BUSC to evaluate the contribution of key components in FAME, including ACGAN-based augmentation, channel attention, spatial attention, the ensemble fusion strategy, and FTL. Components were incrementally removed, and the impact on Dice, IoU, and ACC was measured, as shown in Figure 12. Across all datasets, the complete FAME model achieved the best performance. Removal of ACGAN, attention modules, or the ensemble strategy reduced segmentation ACC, with the largest drop observed when both channel and spatial attention were excluded. Replacing FTL with centralized training also led to decreased Dice and IoU. Figure 13 presents qualitative ablation results, illustrating segmentation differences when key components are omitted.

Figure 12.

Ablation study showing the impact of removing key components (Auxiliary Classifier GAN [ACGAN], attention mechanisms, ensemble strategy, and FTL) from the proposed models across three datasets. FAME consistently shows degraded performance when any module is removed, with the full model achieving the highest Dice Score, Accuracy, and IoU on BUSI, UDIAT, and BUSC datasets.

Figure 13.

Visual comparison of segmentation performance across ablated versions of the proposed model on the BUSI dataset.

Figure 13 shows qualitative ablation results on BUSI, UDIAT, and BUSC. Excluding ACGAN led to incomplete or fragmented lesion masks. Removing channel or spatial attention individually reduced focus on lesion regions, while omitting both attentions caused notable boundary disruption. Eliminating the ensemble strategy resulted in unstable predictions, and centralized training without FTL produced inconsistent outputs across datasets. In contrast, the complete FAME model generated coherent and precise segmentations aligned with ground truth masks.

Computational efficiency analysis

Computational efficiency was evaluated across ablation settings using training time per epoch, memory usage, inference time per image, and FLOPs, as shown in Table 9. The complete FAME framework, including FTL, ACGAN, MAU-Net, NASNetLarge, and ResNet50V2, required 380 s per epoch, 12.5 GB memory, and 72.5 GFLOPs, while achieving the highest segmentation and classification performance. Removing FTL reduced training time by ∼22% due to the elimination of aggregation. Excluding ACGAN lowered memory and FLOPs but decreased generalization. The absence of MAU-Net improved inference speed but reduced Dice and IoU. Removing NASNetLarge or ResNet50V2 reduced FLOPs but weakened classification ACC. The most efficient configuration (without FTL and ACGAN) achieved the lowest computational cost but also the largest performance decline.

Table 9.

Computational efficiency analysis for different model configurations.

Configuration	Training time per epoch (s)	Memory usage (GB)	Inference time per image (ms)	FLOPs (GFLOPs)
Full Framework (FTL + ACGAN + MAU-Net + NASNetLarge + ResNet50V2)	380	12.5	45.8	72.5
Without FTL (ACGAN + MAU-Net + NASNetLarge + ResNet50V2)	295	11.8	39.6	68.2
Without ACGAN (FTL + MAU-Net + NASNetLarge + ResNet50V2)	310	10.4	40.2	65.5
Without MAU-Net (FTL + ACGAN + NASNetLarge + ResNet50V2)	325	9.8	38.5	59.7
Without NASNetLarge (FTL + ACGAN + MAU-Net + ResNet50V2)	335	10.2	39.0	61.3
Without ResNet50V2 (FTL + ACGAN + MAU-Net + NASNetLarge)	320	9.9	38.2	60.8
Without FTL & ACGAN (MAU-Net + NASNetLarge + ResNet50V2)	250	9.0	35.8	55.4
Without FTL & MAU-Net (ACGAN + NASNetLarge + ResNet50V2)	270	8.7	36.5	53.9
Without FTL & NASNetLarge (ACGAN + MAU-Net + ResNet50V2)	265	8.9	36.8	54.7

Discussion

This study presents a comprehensive and privacy-preserving federated DL model for BUS image segmentation and classification. The framework was evaluated across three benchmark datasets, BUSI, UDIAT, and BUSC, which vary in resolution, imaging devices, noise levels, and lesion complexity. The inclusion of FTL enables collaborative learning across institutions while preserving data locality and complying with privacy regulations.^10,27 To further enhance privacy, the model integrates DP, which injects Gaussian noise into shared updates, and SA, which prevents interclient data exposure during aggregation. Robustness to US image variability was evaluated by training and testing across these diverse datasets, each exhibiting domain shifts and acquisition artifacts. The performance gains achieved by FAME reflect several architectural and strategic advances over existing methods in US image analysis. Unlike conventional U-Net variants or attention-guided networks that often struggle with low-contrast and artifact-heavy US data,³² the proposed MAU-Net, with embedded channel and spatial attention, improved robustness by selectively emphasizing lesion-relevant features and suppressing background noise. The use of a local ensemble strategy at each federated client, unlike global ensemble aggregation seen in prior FL approaches, helped mitigate performance variance due to non-IID data distributions, a challenge highlighted in earlier federated studies but rarely addressed through node-specific architectural enhancements.

Prior works on BUS segmentation have primarily relied on the U-Net and its derivatives. U-Net and UNet++ have shown reasonable performance in delineating lesions but often struggle with irregular boundaries and noisy textures, leading to Dice scores in the 70–80% range.^32,39 Attention-based models such as TransUNet and ATFE-Net improved boundary localization by emphasizing lesion-relevant regions, yet their performance decreases on heterogeneous datasets such as BUSC and UDIAT. In contrast, FAME integrates dual attention (channel and spatial) with ensemble learning to achieve significantly higher Dice up to 93.09 ± 0.49% on BUSC and 87.98 ± 0.57% on UDIAT and IoU scores. This demonstrates that our design more effectively handles low-contrast, artifact-heavy US data compared to prior single-model or attention-only approaches. Breast US classification has been addressed by CNN-based models such as GAN + CNN, MTL-COSA, and SaTransformer, which achieved accuracies in the range of 90–95% but often suffered from class imbalance, leading to reduced sensitivity for malignant cases.⁴² Recent transformer-based methods, such as HoVer-Trans, have enhanced feature representation but remain dependent on large, balanced training datasets. Our FAME classifier, by contrast, incorporates ACGAN-based augmentation within a dual-stage ensemble pipeline, addressing class imbalance locally at each federated client. This yields consistent improvements, achieving 98.70 ± 0.27% ACC and 98.70 ± 0.27%, F1-score of 96.82 ± 0.53% F1-score on BUSI, and 98.14 ± 0.31% ACC on UDIAT, surpassing all compared baselines. These results highlight the benefit of combining feature diversity with synthetic augmentation for more reliable malignant case detection.

Federated learning has been applied to medical imaging with strategies such as FedAvg and FedProx, which average local models across institutions to preserve privacy. While effective for reducing data-sharing risks, these approaches often degrade under non-IID conditions and do not incorporate data augmentation or attention-based enhancements.^10,27 Recent works have begun exploring multimodal federated pipelines, but rarely consider class imbalance or computational feasibility. Our FAME framework advances the field by integrating FTL, dual-attention segmentation, and ACGAN-based augmentation into a single privacy-preserving model. Unlike standard FL methods, FAME achieves SOTA performance across three diverse US datasets while maintaining strong privacy guarantees. This positions FAME as a practical step forward in bridging the gap between FL research and real-world clinical deployment.

Ablation studies (Figures 12–13) confirmed the contribution of each architectural component. The exclusion of ACGAN reduced Dice and IoU, emphasizing the role of synthetic augmentation under limited data conditions. Removing channel or spatial attention impaired focus on lesion boundaries, and the largest degradation occurred when both attentions were removed. Eliminating the ensemble strategy weakened segmentation stability, particularly on heterogeneous datasets such as BUSC. Substituting centralized training for FTL reduced cross-domain consistency and weakened generalization, underscoring the necessity of privacy-preserving distributed optimization. These findings reinforce that each module's attention, augmentation, ensembling, and FTL contribute critically to overall performance. Computational efficiency analysis (Table 9) provided further insight into deployment trade-offs. While the complete FAME model required higher training time, memory usage, and FLOPs, it consistently achieved superior ACC and generalization. Configurations that removed FTL or ACGAN were more efficient but suffered performance drops, demonstrating that modest computational overhead is justified by significant diagnostic gains. These results suggest that FAME is feasible for real-world use, as inference can be executed on midrange GPU-enabled hospital servers or standard workstations, making it suitable for both high-resource and resource-constrained clinical environments.

From a deployment perspective, the federated design reduces the need for centralized data transfer by transmitting only encrypted model updates, lowering both bandwidth and storage requirements. Because FAME is implemented in widely used DL frameworks (TensorFlow/PyTorch), it can be integrated into existing PACS/RIS infrastructure with minimal modification. These properties increase its potential for seamless adoption in diverse healthcare systems. While the results across BUSI, BUSC, and UDIAT datasets confirm the technical promise of FAME, real-world adoption will require validation in larger, prospective, multi-institutional clinical trials. As such, the present findings should be interpreted as preliminary evidence of feasibility and robustness rather than direct clinical readiness.

Limitations and future work

This study has several limitations that warrant consideration. First, the computational overhead of the ensemble-based classification pipeline may constrain deployment in low-resource clinical environments. Although inference can be run on standard GPU-enabled workstations, future work should focus on lightweight model compression and knowledge distillation techniques to facilitate broader adoption in constrained settings. Second, while ACGAN-based augmentation improved class balance and data diversity, expert validation of synthetic image realism remains essential to strengthen clinical interpretability and trustworthiness. Moreover, representative failure cases revealed challenges in segmenting lesions with diffuse boundaries, heterogeneous textures, or severe acoustic artifacts. Addressing these limitations will require integration of additional modalities such as elastography and Doppler US, as well as the development of uncertainty-aware training strategies to better handle ambiguous or low-quality inputs.

Future directions will also emphasize large-scale validation across geographically distributed clinical environments. Deploying FAME using asynchronous federated protocols will allow adaptation to real-world conditions, including communication delays and variable client availability. To further improve personalization, we plan to explore institution-specific model adaptation strategies that account for differences in imaging protocols, scanner hardware, and patient demographics. Another important direction will be extending the framework to support DICOM-formatted US data, enabling seamless integration with hospital PACS systems. This will require metadata-aware preprocessing and the incorporation of acquisition-specific information into the model pipeline. Finally, improvements in preprocessing through adaptive contrast enhancement methods, such as CLAHE, may further enhance lesion visibility and feature extraction, thereby improving segmentation robustness in challenging, low-contrast US cases.

Conclusion

This study presents a comprehensive federated DL model tailored for privacy-preserving and data-efficient diagnosis of BUS images. By strategically combining FTL with dual attention-guided segmentation, class-aware synthetic data generation, and a two-tier classification ensemble, the proposed approach enables high-fidelity diagnostic modeling in decentralized clinical settings. The proposed method not only addresses practical constraints such as data scarcity, privacy regulation, and institutional variability but also advances technical capabilities in multi-source learning through model fusion, generative augmentation, and DP safeguards. The integration of SA and DP establishes a strong foundation for deploying collaborative AI solutions within regulated healthcare infrastructures. This work lays the groundwork for scalable, trustworthy, and interpretable AI systems in medical imaging and opens future directions for real-world deployment across distributed hospital networks, personalized model refinement, and cross-modality generalization. While certain limitations remain, such as the reliance on simulated federated environments and the absence of DICOM support, these are acknowledged and discussed in detail. Looking forward, this work lays the foundation for real-world deployment by enabling collaborative learning across distributed clinical sites, supporting future extensions toward asynchronous protocols, personalized model refinement, and clinical-grade integration with metadata-rich imaging modalities.

Footnotes

Acknowledgements

The author would like to thank Prince Sultan University for their valuable support. Beijing Natural Science Foundation (4232018), the National Key Research and Development Program of China (2022YFB3103104), Major Research Plan of National Natural Science Foundation of China (92167102), the National Natural Science Foundation of China (62271456), and the R&D Program of Beijing Municipal Education Commission (KM202210005026). It was also supported by the Engineering Research Centre of Intelligent Perception and Autonomous Control, Ministry of Education, China.

ORCID iDs

Abdul Raheem

Zhen Yang

Ala Saleh Alluhaidan

Malik Abdul Manan

Shahzad Ahmed

Fahad Sabah

Sadique Ahmad

Ethical considerations

We thank the reviewer for pointing this out. As the study exclusively utilized publicly available, fully anonymized datasets (BUSI, UDIAT, and BUSC), no experiments involving human or animal subjects were conducted. Therefore, institutional review board (IRB) approval was not required. We have now added a dedicated Ethical Considerations statement at the end of the manuscript to clarify this point.

Contributorship

Abdul Raheem: Conceptualization, writing—original draft, methodology, and writing—review & editing. Zhen Yang and Ala Saleh Alluhaidan: Supervision, project administration, writing—review & editing, and funding acquisition. Malik Abdul Manan: Methodology, validation, writing—original draft, and formal analysis. Shahzad Ahmed: Software, conceptualization, and visualization. Sadique Ahmad: Writing—review & editing, methodology, and formal analysis. Fahad Sabah: Investigation, methodology, and visualization.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was funded by Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2025R234), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Siegel

Giaquinto

Jemal

. Cancer statistics, 2024. CA Cancer J Clin 2024; 74: 12–49.

Mahmood

Pei

, et al. A brief survey on breast cancer diagnostic with deep learning schemes using multi-image modalities. IEEE Access 2020; 8: 165779–165809.

Vocaturo

Zumpano

. Artificial Intelligence approaches on Ultrasound for Breast Cancer Diagnosis. 2021, pp. 3116–3121.

Saber

Sakr

Abo-Seida

, et al. A novel deep-learning model for automatic detection and classification of breast cancer using the transfer-learning technique. IEEE Access 2021; 9: 71194–71209.

Masud

Eldin Rashed

Hossain

. Convolutional neural network-based models for diagnosis of breast cancer. Neural Comput Appl 2022; 34: 11383–11394.

Murtaza

Shuib

Abdul Wahab

, et al. Deep learning-based breast cancer classification through medical imaging modalities: state of the art and research challenges. Artif Intell Rev 2020; 53: 1655–1720.

Badawy

Mohamed

AE-NA

Hefnawy

, et al. Classification of Breast Ultrasound Images Based on Convolutional Neural Networks - A Comparative Study. In: 2021 International Telecommunications Conference (ITC-Egypt). 2021, pp.1–8.

Islam

Rahman

Ali

, et al. Enhancing breast cancer segmentation and classification: an ensemble deep convolutional neural network and U-net approach on ultrasound images. Mach Learn Appl 2024; 16: 100555.

Al-Dhabyani

Fahmy

Gomaa

, et al. Deep learning approaches for data augmentation and classification of breast masses using ultrasound images. Int J Adv Comput Sci Appl 2019; 10: 618–627.

10.

Wang

, et al. A review of secure federated learning: privacy leakage threats, protection technologies, challenges and future directions. Neurocomputing 2023; 561: 126897.

11.

Oza

, et al. Digital mammography dataset for breast cancer diagnosis research (DMID) with breast mass segmentation analysis. Biomed Eng Lett 2024; 14: 317–330.

12.

Thakur

Kumar

. A systematic review of machine and deep learning techniques for the identification and classification of breast cancer through medical image modalities. Multimed Tools Appl 2023; 83: 35849–35942. DOI: 10.1007/s11042-023-16634-w

13.

Yap

Pons

Martí

, et al. Automated breast ultrasound lesions detection using convolutional neural networks. IEEE J Biomed Health Inform 2018; 22: 1218–1226.

14.

Byra

Galperin

Ojeda-Fournier

, et al. Breast mass classification in sonography with transfer learning using a deep convolutional neural network and color conversion. Med Phys 2019; 46: 746–755.

15.

Liang

, et al. Simultaneous segmentation and classification of breast lesions from ultrasound images using Mask R-CNN. In: 2019 IEEE International Ultrasonics Symposium (IUS). 2019, pp.1470–1472.

16.

Munsell

Zhan

, et al. Patch-Based Techniques in Medical Imaging. Cham: Springer International Publishing. 2016. Epub ahead of print 2017. DOI: https://doi.org/10.1007/978-3-319-67434-6

17.

Wang

Choi

, et al. Breast cancer classification in automated breast ultrasound using multiview convolutional neural network with transfer learning. Ultrasound Med Biol 2020; 46: 1119–1132.

18.

Moon

Lee

Y-W

H-H

, et al. Computer-aided diagnosis of breast ultrasound images using ensemble learning from convolutional neural networks. Comput Methods Programs Biomed 2020; 190: 105361.

19.

Dong

She

Cui

, et al. One step further into the blackbox: a pilot study of how to build more confidence around an AI-based decision system of breast nodule assessment in 2D ultrasound. Eur Radiol 2021; 31: 4991–5000.

20.

Shia

W-C

Chen

D-R

. Classification of malignant tumors in breast ultrasound using a pretrained deep residual network model and support vector machine. Comput Med Imaging Graph 2021; 87: 101829.

21.

Jin

Peng

, et al. A novel medical image segmentation approach by using multi-branch segmentation network based on local and global information synchronous learning. Sci Rep 2023; 13: 6762.

22.

Zhuang

Yang

Raj

ANJ

, et al. Breast ultrasound tumor image classification using image decomposition and fusion based on adaptive multi-model spatial feature fusion. Comput Methods Programs Biomed 2021; 208: 106221.

23.

Huang

. Multi-Task Learning with Context-Oriented Self-Attention for Breast Ultrasound Image Classification and Segmentation. In: 2022 IEEE 19th International Symposium on Biomedical Imaging (ISBI). 2022, pp.1–5.

24.

Zhang

Liu

, et al. Satransformer: semantic-aware transformer for breast cancer classification and segmentation. IET Image Process 2023; 17: 3789–3800.

25.

Ayana

Choe

. BUViTNet: breast ultrasound detection via vision transformers. Diagnostics 2022; 12: 2654.

26.

Mishra

Roy

Bandyopadhyay

, et al. Breast ultrasound tumour classification: a machine learning—radiomics based approach. Expert Syst 2021; 38: e12713.

27.

Zhu

Bashir

, et al. Privacy-preserving federated learning of remote sensing image classification with dishonest majority. IEEE J Sel Top Appl Earth Obs Remote Sens 2023; 16: 4685–4698.

28.

Al-Dhabyani

Gomaa

Khaled

, et al. Dataset of breast ultrasound images. Data Brief 2020; 28: 104863.

29.

Byra

Jarosik

Szubert

, et al. Breast mass segmentation in ultrasound with selective kernel U-net convolutional neural network. Biomed Signal Process Control 2020; 61: 102027.

30.

Rodrigues

. Breast ultrasound image. Mendeley Data 2017; 1: 10.17632.

31.

Iqbal

Sharif

. PDF-UNet: a semi-supervised method for segmentation of breast tumor images using a U-shaped pyramid-dilated network. Expert Syst Appl 2023; 221: 119718.

32.

Ronneberger

Fischer

Brox

. U-net: Convolutional networks for biomedical image segmentation. In: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics). 2015, pp.234–41.

33.

Gao

Zhou

Metaxas

. UTNet: A Hybrid Transformer Architecture for Medical Image Segmentation. In: International conference on medical image computing and computer-assisted intervention. 2021, pp. 61–71.

34.

Chaurasia

Culurciello

. LinkNet: Exploiting encoder representations for efficient semantic segmentation. In: 2017 IEEE Visual Communications and Image Processing, VCIP 2017. 2017. Epub ahead of print 2017. DOI: 10.1109/VCIP.2017.8305148.

35.

Chen

, et al. TransUNet: Transformers Make Strong Encoders for Medical Image Segmentation. Epub ahead of print 2021. DOI: 10.48550/arXiv.2102.04306.

36.

Zhou

Zhang

. D-linknet: Linknet with pretrained encoder and dilated convolution for high resolution satellite imagery road extraction. In: IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops. 2018, pp.182–186.

37.

Wang

Zhu

Green

, et al. Axial-DeepLab: Stand-Alone Axial-Attention for Panoptic Segmentation. In: European conference on computer vision. 2020, pp.108–126.

38.

, et al. ATFE-Net: axial transformer and feature enhancement-based CNN for ultrasound breast mass segmentation. Comput Biol Med 2023; 153: 106533.

39.

Zhou

Rahman Siddiquee

Tajbakhsh

, et al. Unet++: A nested u-net architecture for medical image segmentation. In: International workshop on deep learning in medical image analysis. 2018, pp. 3–11.

40.

Chen

, et al. FAT-Net: feature adaptive transformers for automated skin lesion segmentation. Med Image Anal 2022; 76: 102327.

41.

Zhang

Liu

, et al. MSU-Net: multi-scale U-net for 2D medical image segmentation. Front Genet 2021; 12: 639930.

42.

Cui

Peng

Yuan

, et al. FMRNet: a fused network of multiple tumoral regions for breast tumor classification with ultrasound images. Med Phys 2022; 49: 144–157.

43.

Huang

. A regional-attentive multi-task learning framework for breast ultrasound image segmentation and classification. IEEE Access 2023; 11: 5377–5392.

44.

Zhang

Seiler

Chen

, et al. BIRADS features-oriented semi-supervised deep learning for breast ultrasound computer-aided diagnosis. Phys Med Biol 2020; 65: 125005.

45.

Han

Liu

, et al. HoVer-Trans: anatomy-aware HoVer-transformer for ROI-free breast cancer diagnosis in ultrasound images. IEEE Trans Med Imaging 2023; 42: 1696–1706.

46.

Iqbal

Sharif

. MDA-Net: multiscale dual attention-based network for breast lesion segmentation using ultrasound images. J King Saud Univ Comput Inf Sci 2022; 34: 7283–7299.

FAME: A privacy-preserving dual-stage deep learning framework for breast ultrasound imaging using federated transfer and synthetic learning

Abstract

Background

Methods

Results

Conclusion

Keywords

Introduction

Methodology

Federated transfer learning

Auxiliary classifier GAN-based synthetic image generation

Image preprocessing

Resizing

Speckle noise reduction

Intensity normalization

Data augmentation

Multi Attention U-Net architecture

Channel attention module

Spatial attention module

Segmentation output and loss

Ensemble strategy with cross-validation

Classification network: dual-stage feature fusion

Implementation details

Datasets

Evaluation metrics

Egmentation metrics

Dice similarity coefficient

Intersection over union

Classification metrics

Accuracy

F1-score

Sensitivity (recall)

Specificity

Area under the curve

Experimental results

Segmentation

Classification

Privacy-preserving evaluation: DP and SA

Differential privacy results

Secure aggregation results

Ablation study

Computational efficiency analysis

Discussion

Limitations and future work

Conclusion

Footnotes

Acknowledgements

ORCID iDs

Ethical considerations

Contributorship

Funding

Declaration of conflicting interests

References