Sage Journals: Discover world-class research

Abstract

Objective

Early and accurate identification of skin lesions—ranging from benign irregularities to life-threatening cancers—is crucial for improving clinical outcomes. However, existing skin lesion datasets suffer from severe class imbalance, and there is limited consensus on effective augmentation strategies. This study aims to develop a robust framework that mitigates these limitations while enhancing diagnostic accuracy and interpretability.

Methods

We introduce a novel transfer learning-based framework termed hierarchical attention stacked ensemble (HASE), which integrates multiple EfficientNetV1 backbones through three distinct stacking schemes: HASE: independent TA, HASE: serial stacked TA, and HASE: parallel stacked TA. Here, TA denotes the triplet-attention module encompassing soft attention, channel attention, and squeeze-excitation mechanisms. To address prediction fusion, we propose an advanced ensemble aggregation method—Matthews-correlation-coefficient weighted averaging (MWA)—further extended into a multi-level MWA (ML-MWA) formulation. Additionally, four augmentation strategies were systematically evaluated to identify the most effective ensemble configuration.

Results

Experimental evaluations on the HAM10000 dataset demonstrated that the proposed framework achieved an outstanding accuracy of 93.96%, surpassing several state-of-the-art approaches. The use of Grad-CAM visualizations further enhanced model interpretability by effectively localizing lesion-relevant regions.

Conclusion

The proposed HASE framework not only delivers superior diagnostic accuracy but also alleviates challenges associated with class imbalance, limited dataset diversity, and high computational cost. By combining hierarchical attention and multi-level ensemble weighting, it establishes a reliable and interpretable solution for early and precise skin lesion classification, offering significant potential for real-world dermatological applications and improved patient care.

Keywords

Skin lesion classification hierarchical attention stacked ensemble (HASE)Matthews-correlation-coefficient weighted averaging (MWA)triplet-attention (TA)augmentation gradient class activation map (Grad-CAM)

Introduction

Skin lesions represent abnormal alterations in the skin’s appearance or structure and are associated with a wide range of dermatological conditions. These conditions can vary from common problems, such as acne, to more serious and potentially life-threatening diseases like skin cancer. Although skin disorders manifest with diverse symptoms, the presence of lesions is not their sole characteristic. Lesions may result from numerous causes, including infections, inflammatory responses, allergic reactions, malignancies, insect bites, trauma, autoimmune diseases, genetic factors, environmental exposures, vascular irregularities, warts, and cysts.¹ They are generally classified according to their potential health risk. Benign lesions, such as moles, skin tags, warts, seborrheic keratoses, and hemangiomas, are typically harmless. In contrast, malignant lesions, including basal cell carcinoma, squamous cell carcinoma, and melanoma, are cancerous and capable of spreading, posing serious health risks.²

Prompt and accurate identification of skin conditions traditionally relied on clinical examination and diagnostic testing. Delays in detection or insufficient attention could lead to severe consequences, including skin cancers, which rank among the most common forms of cancer globally. Although melanoma is less frequent than other types of skin cancer, it remains the primary cause of skin cancer-related mortality.³ Recent data suggest that around 2.2% of the population may be diagnosed with melanoma during their lifetime, with $\sim$ 97,610 new cases and 7990 deaths estimated in the United States in 2023. Furthermore, over 1.4 million individuals were living with melanoma in the US, underscoring its substantial public health impact.⁴

Timely identification of skin lesions is crucial for preventing their progression into more serious conditions. Despite this, many individuals remain unaware of existing skin abnormalities, partly due to the complexity and cost associated with conventional medical evaluations. Dermatoscopy, a non-invasive imaging technique that combines magnification and illumination, assists in examining skin lesions and facilitates early cancer detection. However, the effectiveness of dermatoscopy largely depends on the examiner’s skill, leaving room for human error.⁵

Artificial intelligence (AI), particularly through machine learning (ML) and deep learning (DL) frameworks, has demonstrated considerable promise in automating the detection of skin lesions. These approaches allow for rapid analysis of medical images, supporting early diagnosis and better clinical outcomes. Nevertheless, several challenges remain. Existing methods tend to favor classes with abundant data, struggle to extract rich features from transfer learning (TL) models without fine-tuning, and encounter difficulties in integrating multiple models effectively. Moreover, limited interpretability and biases arising from overlapping validation and testing datasets further hinder practical deployment. TL architectures such as DenseNet and ResNet also present limitations, including rigid scaling constraints, reliance on manual design choices, and high computational demands, which restrict their adaptability and efficiency, particularly in resource-limited settings.⁶

To overcome these obstacles, researchers have explored convolutional neural networks (CNNs) and ensemble learning strategies. While these approaches aim to mitigate the shortcomings of individual models, conventional ensemble techniques—such as majority voting, softmax averaging, and weighted averaging—do not adequately consider the relative importance of each predictor, often resulting in suboptimal performance. Additionally, post-prediction ensembling alone may fail when handling images with high variance, as no single model consistently identifies the correct class. These limitations highlight the importance of pre-prediction stacking, which complements traditional ensembling by addressing data variability and enhancing overall predictive accuracy.

Our methodology was meticulously crafted to address the abovementioned key challenges in skin lesion detection, aiming to resolve the following research questions (RQs). These questions formed the cornerstone of our architectural framework, guiding the development of innovative and effective solutions.

RQ1: How can the issue of severe class imbalance be mitigated, and what is the optimal approach for doing so?

– Most skin lesion datasets have significant class imbalance, which can bias models toward the majority classes. Approaches like data augmentation or generative adversarial networks (GANs) can help reduce this imbalance. Identifying the most effective augmentation strategy to ensure reliable performance on completely unseen data remains a key research goal.

RQ2: How can TL models be optimally tuned for specific tasks?

– With many TL models available, choosing and adapting the most suitable model is challenging. Models pretrained on datasets such as ImageNet often have fixed architectures that may not perfectly match task-specific needs. Optimizing these models for superior performance on the target task is therefore crucial.

RQ3: What methods effectively identify critical features, particularly significant regions in the data?

– In classification tasks, not all image regions contribute equally to feature extraction. Irrelevant or redundant regions can reduce model performance. Highlighting and focusing on the most important regions is essential to improve classification accuracy.

RQ4: Is a single algorithm sufficient, or is ensemble learning (EL) necessary? If so, which EL approach is most effective?

– Relying on a single algorithm can result in misclassification, especially for complex data. EL combines the strengths of multiple models to produce more reliable predictions. However, conventional methods like majority voting or simple averaging often fail to optimally weight individual models. A dynamic weighting strategy is, therefore, needed to enhance overall performance.⁷

RQ5: What are the limitations of relying solely on post-prediction ensembling, and how can pre-prediction stacking help?

– Post-prediction ensembling alone may struggle with images exhibiting high variance, as no single model can consistently predict the correct class. Integrating a pre-prediction stacking mechanism helps overcome these limitations, enabling more robust feature extraction and improved classification accuracy.⁸

These research questions formed the foundation of our approach, which introduced several key contributions:

Addressing class imbalance: A comprehensive augmentation framework was developed and evaluated using four strategies: no augmentation (NA), prior augmentation (PrA), training data augmentation (TDA), and posterior augmentation (PoA). The most effective method was selected based on its performance on unseen data, ensuring balanced class representation and reducing bias toward majority classes.

Optimizing TL models: EfficientNetV1 architectures were adopted for their flexibility and computational efficiency. These models were customized with additional layers and novel modules, resulting in the “Hierarchical Attention Stacked Ensemble (HASE)” framework, which effectively captured both shallow and deep feature representations.

Focusing on critical features: Triplet-attention (TA) mechanisms, consisting of soft attention integration (SAI), channel attention integration (CAI), and squeeze-excitation attention integration (SEAI), were incorporated to emphasize the most relevant regions of the input data, enhancing the model’s focus on key features.

Pre-prediction stacking: Three stacking configurations—HASE: independent TA, HASE: serial stacked TA, and HASE: parallel stacked TA—were implemented before training. This strategy combined extracted features to capture the most meaningful and deep patterns.

Novel ensemble strategy: The Matthews-correlation-coefficient weighted averaging (MWA) approach was introduced to calculate and apply optimal prediction weights across models. Its extension, multi-level MWA (ML-MWA), further improved performance by leveraging predictions from multiple layers, enhancing robustness, accuracy, and generalization.

Enhancing interpretability: Gradient class activation maps (Grad-CAMs) were integrated to visualize and highlight regions associated with specific skin conditions, improving transparency and reliability while providing valuable insights for clinical applications.

Literature review

Skin lesion classification has received significant attention within medical imaging and AI. Despite notable advancements, challenges such as class imbalance, dataset-specific optimization, and the effective use of attention mechanisms remain. This section reviews existing approaches, highlighting both their contributions and limitations to provide context for the present study.

TL has been extensively applied in skin lesion classification. Hosny et al.⁹ utilized AlexNet to classify melanoma and nevus, achieving high accuracy, but did not incorporate attention mechanisms that could improve diagnostic precision. Tajerian et al.¹⁰ reported 84.30% accuracy with EfficientNet-B1, demonstrating its capability to detect pigmented lesions; however, reliance on general features limited performance for dataset-specific characteristics. Wang et al.¹¹ employed DenseNet-121 and VGG-16 to extract multiscale features, achieving 91.24% accuracy, yet the absence of dataset-specific fine-tuning reduced adaptability. Mahbod et al.¹² studied the influence of image size on TL-based classification, reaching 86.2% balanced accuracy, though the computational requirements limited real-time applicability. Popescu et al.¹³ combined TL with collective intelligence to achieve 86.71% accuracy, but did not validate results on an independent test set, raising questions about model robustness. Howal and Wagh¹⁴ proposed the ILENET–LinkNet architecture integrating preprocessing, attention-based segmentation, and hybrid feature learning, demonstrating improved skin lesion classification performance through score-level fusion, although the multi-stage design increases architectural complexity.

Hybrid architectures that combine CNNs and transformers have demonstrated effectiveness in capturing both local and global features. Khan and Khan¹⁵ developed SkinViT, which integrates outlook attention with transformers and achieved 91.09% accuracy; however, its high computational cost limited scalability. Dong et al.¹⁶ proposed TC-Net, effectively combining CNN and transformer features to achieve improved segmentation performance, yet the model’s complexity hindered practical implementation. Nie et al.¹⁷ introduced a hybrid CNN–transformer approach with focal loss, attaining 89.48% accuracy, though it struggled to extract deeper features in more complex cases.

Attention mechanisms have become increasingly popular for emphasizing critical features in classification tasks. Nguyen et al.¹⁸ applied DL with soft attention, reporting accuracies of 90% and 86% across different models, but did not compare alternative attention strategies. Datta et al.¹⁹ implemented soft attention, achieving 93.4% accuracy, yet faced challenges in optimizing color channel weights, limiting generalizability. Saarela and Georgieva²⁰ used Bayesian inference to improve interpretability, achieving 80% accuracy, but their method fell short in classification precision compared to other techniques.

To address these limitations, more sophisticated attention mechanisms have been proposed. Singh et al.²¹ combined Bayesian MultiResUNet with DenseNet-169 for segmentation and classification, reaching 86.67% accuracy, yet it struggled with complex lesion types. Khan et al.²² introduced an entropy-optimized attention mechanism within a DL framework, achieving over 90% accuracy, although the robustness of their model on independent test sets was not thoroughly evaluated.

EL methods have been widely investigated to improve classification performance. Ajmal et al.²³ applied fuzzy entropy optimization within an ensemble framework, achieving high accuracy on the HAM10000 and ISIC 2018 datasets; however, high computational demands and the lack of evaluation on real-world datasets limited its applicability. Rahman et al.²⁴ employed an ensemble of five deep networks, attaining 88% accuracy, but the method did not offer dataset-specific optimization.

Nidhi et al.²⁵ and Abir et al.²⁶ utilized PAD-UFES-20 dataset to classify skin lesions with only one TL method with no ensemble techniques. Ahmmed et al.²⁷ also did the same with PH2 dataset.

Data augmentation has played a crucial role in mitigating class imbalance. Gouda et al.²⁸ enhanced image quality using ESRGAN prior to classification, achieving 83.2% accuracy, yet did not fully resolve persistent imbalance issues. Sun et al.²⁹ leveraged augmented datasets along with supplementary metadata, reaching 89.5% accuracy, but the augmentation procedure lacked sufficient transparency, limiting reproducibility.

Studies^30–33 incorporated augmentation techniques in ISIC2017-2020 datasets with TL, but hadn’t tried to explore any ensemble methods.

Despite these advances, many approaches remain constrained by small datasets, limited dataset-specific fine-tuning, and inadequate validation on independent test sets. Challenges related to computational efficiency and scalability persist, particularly for real-world deployment. Additionally, traditional ensemble techniques often fail to assign optimal weights to individual models, which can reduce overall effectiveness.

Building on these observations, our study proposes a novel framework designed to overcome the limitations of existing approaches. A critical first step involves selecting the most effective augmentation strategy to address class imbalance. By incorporating TA in serial, parallel, and independent stacking configurations, the framework enhances feature extraction and emphasizes the most relevant regions. Fine-tuning TL models for skin-specific characteristics reduces dependence on generalized ImageNet-pretrained architectures. Furthermore, our ensemble strategy employs Matthews-correlation-coefficient weighted averaging (MWA) to dynamically assign optimal prediction weights, ensuring robust and consistent performance across diverse datasets. Collectively, these innovations offer a comprehensive solution to current challenges and advance the state of skin lesion classification.

Materials and methods

This study was conducted as a theoretical and computational investigation utilizing TL and ensemble learning techniques for skin lesion classification. The experimental work was carried out in the authors laboratory over a period of approximately 6 months.

Dataset description

This study utilized a publicly available dermatoscopic dataset to provide a comprehensive and diverse analysis of skin lesion classification.

The dataset, Human Against Machine (HAM10000), was obtained from the Harvard Dataverse repository.³⁴ It consists of a carefully curated set of 10,015 dermatoscopic images in JPG format, divided into seven distinct classes.

The seven classes included in the dataset are melanoma (MEL), nevus (NV), vascular lesions (VASC), actinic keratosis (AK), basal cell carcinoma (BCC), benign keratosis (BKL), and dermatofibroma (DF). Among these, MEL, AK, and BCC are classified as malignant lesions, whereas NV, BKL, and DF are benign. Some types of VASC may also exhibit malignant characteristics.

Tables 1 to 4 provides an overview of the dataset distribution, offering a clear depiction of the composition of data used in this study.

Table 1.

Brief information of the HAM10000 dataset.

Images	Format	Classes	Source
10015	JPG	7	Harvard Dataverse

HAM10000: Human Against Machine.

Table 2.

Input data for Matthews-correlation-coefficient weighted averaging (MWA).

Sample	True label	Model A	Model B
		Prob [0, 1]	Prob [0, 1]
1	0	[0.8, 0.2]	[0.6, 0.4]
2	1	[0.4, 0.6]	[0.7, 0.3]
3	0	[0.9, 0.1]	[0.5, 0.5]

Table 3.

Weighted ensemble results (Matthews-correlation-coefficient weighted averaging (MWA)).

Sample	Ensemble Prob [0,1]	Prediction
1	[0.8, 0.2]	0
2	[0.4, 0.6]	1
3	[0.9, 0.1]	0

Table 4.

Trainable parameters for each architecture.

Serial stacked attention (SSA) architectures
SSA-ENb0 - 7,380,404	SSA-ENb1 - 9,886,040	SSA-ENb2 - 10,787,258
SSA-ENb3 - 14,075,360	SSA-ENb4 - 20,934,016	SSA-ENb5 - 31,732,456
SSA-ENb6 - 44,133,648	SSA-ENb7 - 67,191,176
Parallel stacked attention (PSA) architectures
PSA-ENb0 - 23,466,628	PSA-ENb1 - 25,972,264	PSA-ENb2 - 24,334,218
PSA-ENb3 - 30,174,128	PSA-ENb4 - 37,045,328	PSA-ENb5 - 47,856,312
PSA-ENb6 - 60,270,048	PSA-ENb7 - 83,340,120
Independent attention (ISA) architectures
SAI-ENb0 - 12,308,515	CAI-ENb0 - 9,641,571	SEAI-ENb0 - 9,600,133
SAI-ENb1 - 14,814,151	CAI-ENb1 - 12,147,207	SEAI-ENb1 - 11,413,417
SAI-ENb2 - 14,535,721	CAI-ENb2 - 12,655,209	SEAI-ENb2 - 12,613,771
SAI-ENb3 - 17,543,503	CAI-ENb3 - 15,662,991	SEAI-ENb3 - 15,621,553
SAI-ENb4 - 25,862,127	CAI-ENb4 - 23,195,183	SEAI-ENb4 - 23,153,745
SAI-ENb5 - 36,660,567	CAI-ENb5 - 33,993,623	SEAI-ENb5 - 33,952,185
SAI-ENb6 - 49,061,759	CAI-ENb6 - 46,394,815	SEAI-ENb6 - 46,353,377
SAI-ENb7 - 72,119,287	CAI-ENb7 - 69,452,343	SEAI-ENb7 - 69,410,905

Figure 1 presents representative examples from each class, showing one sample per category. The dataset’s substantial class imbalance is further illustrated in the class distribution visualization in Figure 2.

Figure 1.

Sample images from the HAM10000 dataset: (a) NV: nevus; (b) MEL: melanoma; (c) BKL: benign keratosis; (d) BCC: basal cell carcinoma; (e) AK: actinic keratosis; (f) VASC: vascular lesions; and (g) DF: dermatofibroma.

Figure 2.

Sample distribution for each class in the HAM10000 dataset.

The dataset was carefully preprocessed to meet the objectives of this study. Additional details regarding the specific versions used are available in HAM10000.³⁵

Methodological approach

The methodological framework of this study started with dataset acquisition, followed by comprehensive data preprocessing. The datasets were subsequently divided into two primary subsets: a main training set and an independent testing set. The independent testing set was completely held out during training and validation, providing truly unseen data for final evaluation.

To mitigate class imbalance, four distinct data augmentation strategies were employed:

No augmentation (NA): Only the original dataset was used, without generating any synthetic images.

Prior augmentation (PrA): Synthetic images were created prior to data splitting, which could result in overlap, where both original and augmented images from the same source might appear in training, validation, and testing sets.

Training data augmentation (TDA): Augmentation was applied solely to the training data, keeping validation and testing sets independent and unchanged.

Posterior augmentation (PoA): Each subset—training, validation, and testing—was augmented after splitting, increasing the dataset size across all partitions.

The most effective augmentation strategy was identified by training a customized network based on EfficientNetV1 variants, followed by evaluation on the independent testing set to determine performance on entirely unseen data.

Next, the data was processed within the HASE framework. HASE combined architectures trained on the training set and validated on the validation set. It incorporated models using three TA configurations, which included soft attention, channel attention, and squeeze-excitation attention: HASE: independent TA, HASE: serial stacked TA, and HASE: parallel stacked TA.

Predictions from each model were then fused using the ML-MWA method, applied across multiple layers to boost performance. This ensemble technique enabled optimal weighting of predictions and improved generalization.

For interpretability, Grad-CAM visualizations were employed, providing insights into model behavior by highlighting critical regions of the input images. A schematic diagram of the sequential steps in this methodology is presented in Figure 3.

Figure 3.

Sequential representation of methodology.

Preprocessing and data augmentation

To prepare the dataset for effective training, images were first grouped according to their lesion IDs. Careful sampling was then conducted to create distinct subsets for training, validation, and testing. Specifically, 15% of the images were allocated to the independent testing set, while the remaining 85% formed the primary training set. The independent testing set was completely preserved as unseen data and used exclusively for final evaluation, ensuring an unbiased assessment of model performance.

Lesion IDs were strictly separated across training, validation, and testing sets prior to augmentation, ensuring that no images derived from the same lesion appear in more than one subset.

Figure 4 depicts the four data augmentation strategies implemented to address class imbalance:

Figure 4.

Illustration of four data augmentation strategies.

No augmentation (NA): The original dataset was used without generating any synthetic images.

Prior augmentation (PrA): Synthetic images were produced before dataset splitting, which could result in overlap where both original and augmented images from the same source appear in training, validation, and testing sets.

Training data augmentation (TDA): Augmentation was applied solely to the training subset, keeping validation and testing sets independent and unchanged. Among the four strategies, this is the only clinically valid augmentation approach, as it preserves the independence of validation and test data and prevents data leakage.

Posterior augmentation (PoA): Each subset—training, validation, and testing—was augmented separately after splitting, increasing the dataset size across all partitions.

To address class imbalance, roughly 8000 synthetic images were generated for each class. The primary training dataset was subsequently split into training, validation, and testing subsets in a 70:15:15 ratio, respectively.

Augmentation was carried out using TensorFlow’s ImageDataGenerator, following a comprehensive strategy to increase dataset diversity and enhance model generalization. The process began with contrast enhancement of the original images to improve visual clarity. Various transformations were then applied, including random rotations up to 180 $\circ$ , horizontal and vertical flips, width and height shifts of up to 10%, and zoom adjustments within a 10% range. To fill gaps resulting from these transformations, the nearest neighbor fill mode was utilized, ensuring consistency across generated images. This augmentation strategy simulated a wide variety of variations, effectively improving the robustness of the DL model.³⁶

Figure 5 presents examples of original, contrast-enhanced, and augmented images, illustrating a sample from the AK class along with its augmented variants.

Figure 5.

Images of the augmented samples: (a) original sample; (b) rotated sample; (c) width shifted; (d) height shifted; (e) zoomed sample; (f) horizontal flipped, and (g) vertical flipped.

Tables 5 and 6 show the comparison of all augmentation strategies in both testing and independent testing data. Accordingly, all primary performance comparisons and conclusions in this study are drawn based on results obtained using the TDA strategy.

Table 5.

Performance evaluation by four augmentation strategies on testing data.

Algorithm	A	P	R	F1	S
NA_ENb0	83.95	83.94	83.95	83.71	89.60
PrA_ENb0	98.11	98.10	98.11	98.10	99.68
TDA_ENb0	86.44	86.00	86.44	86.13	89.24
PoA_ENb0	82.25	83.19	82.25	82.17	96.87
NA_ENb1	81.34	80.30	81.34	80.38	86.93
PrA_ENb1	97.91	97.90	97.91	97.90	99.65
TDA_ENb1	88.50	88.21	88.50	88.03	89.63
PoA_ENb1	81.53	82.01	81.53	81.43	96.78
NA_ENb2	84.49	83.67	84.49	83.75	85.92
PrA_ENb2	97.87	97.86	97.87	97.86	99.64
TDA_ENb2	88.83	88.55	88.83	88.38	88.67
PoA_ENb2	79.68	81.26	79.68	79.39	96.42
NA_ENb3	82.65	81.82	82.65	81.60	82.70
PrA_ENb3	97.55	97.55	97.55	97.54	99.58
TDA_ENb3	86.88	87.06	86.88	86.48	91.32
PoA_ENb3	80.22	81.32	80.22	80.03	96.56
NA_ENb4	80.48	79.98	80.48	79.29	83.42
PrA_ENb4	97.05	97.06	97.05	97.04	99.50
TDA_ENb4	87.09	87.06	87.09	86.92	90.41
PoA_ENb4	78.70	80.19	78.70	78.64	96.25
NA_ENb5	78.74	77.87	78.74	77.66	83.77
PrA_ENb5	96.18	96.20	96.18	96.18	99.35
TDA_ENb5	84.60	84.25	84.60	84.15	87.37
PoA_ENb5	78.64	80.23	78.64	78.54	96.28
NA_ENb6	82.00	80.91	82.00	80.89	84.80
PrA_ENb6	93.60	93.59	93.60	93.56	98.91
TDA_ENb6	83.95	83.02	83.95	83.21	83.42
PoA_ENb6	77.56	79.32	77.56	77.31	96.10
NA_ENb7	82.54	81.61	82.54	81.76	85.05
PrA_ENb7	93.85	93.88	93.85	93.81	98.96
TDA_ENb7	83.73	83.58	83.73	83.43	87.74
PoA_ENb7	78.03	79.08	78.03	78.03	96.20

NA: no augmentation; PrA: prior augmentation; TDA: training data augmentation; PoA: posterior augmentation.

Table 6.

Performance evaluation by four augmentation strategies on independent testing data.

Algorithm	A	P	R	F1	S
NA_ENb0	90.58	90.34	90.58	90.33	87.52
PrA_ENb0	91.43	91.39	91.43	91.06	86.59
TDA_ENb0	90.10	89.97	90.10	89.88	83.70
PoA_ENb0	90.46	89.77	90.46	90.03	86.57
NA_ENb1	88.65	87.87	88.65	88.13	84.07
PrA_ENb1	89.37	88.59	89.37	88.87	79.78
TDA_ENb1	91.06	90.78	91.06	90.70	84.18
PoA_ENb1	91.43	91.13	91.43	91.26	86.63
NA_ENb2	88.53	87.25	88.53	87.78	80.26
PrA_ENb2	90.58	89.93	90.58	90.14	81.79
TDA_ENb2	90.70	89.92	90.70	90.02	78.44
PoA_ENb2	90.34	89.69	90.34	89.79	80.84
NA_ENb3	89.49	88.57	89.49	88.80	80.24
PrA_ENb3	90.94	90.29	90.94	90.38	81.29
TDA_ENb3	90.58	90.51	90.58	90.35	88.44
PoA_ENb3	90.22	90.41	90.22	90.27	88.48
NA_ENb4	88.77	88.21	88.77	88.33	81.66
PrA_ENb4	89.61	90.10	89.61	89.71	87.49
TDA_ENb4	91.06	91.28	91.06	91.10	87.57
PoA_ENb4	91.43	90.82	91.43	90.89	83.75
NA_ENb5	87.68	87.01	87.68	87.25	85.41
PrA_ENb5	90.58	90.29	90.58	90.40	87.53
TDA_ENb5	90.70	90.21	90.70	90.31	83.21
PoA_ENb5	91.18	90.79	91.18	90.86	87.57
NA_ENb6	87.68	87.09	87.68	87.26	82.58
PrA_ENb6	90.94	90.49	90.94	90.52	83.70
TDA_ENb6	90.58	90.09	90.58	90.13	82.79
PoA_ENb6	89.49	88.57	89.49	88.86	81.76
NA_ENb7	86.59	87.11	86.59	86.73	85.43
PrA_ENb7	90.46	90.12	90.46	90.08	82.69
TDA_ENb7	89.13	89.14	89.13	89.06	86.91
PoA_ENb7	89.01	89.34	89.01	88.94	87.38

NA: no augmentation; PrA: prior augmentation; TDA: training data augmentation; PoA: posterior augmentation.

Development of HASE architectures

The HASE framework utilized customized EfficientNetV1 models, fully leveraging TL. Specifically, seven pre-trained architectures, including various EfficientNetV1 variants with input dimensions of $299 \times 299 \times 3$ and $224 \times 224 \times 3$ , were fine-tuned. Since these models were originally trained on unrelated datasets, fine-tuning allowed adaptation to our dataset, enabling the extraction of both shallow and deep features effectively. To further improve performance, TA was incorporated in three configurations: serial stacked, parallel stacked, and independent attention. A schematic diagram of the complete architecture is illustrated in Figure 6.

Figure 6.

Overview of the hierarchical attention stacked ensemble (HASE) architecture.

The integration process started by importing pre-trained models from the tensorflow library and adapting them to match our specific input dimensions. Outputs were reshaped into a three-dimensional tensor (None, height, width, channels) to align the custom architecture with the pre-trained models for smooth feature extraction.

Three customized CNN architectures incorporating TA were developed:

Soft attention integrated network (SAIN): Targeted fine-grained spatial patterns.

Channel attention integrated network (CAIN): Enhanced feature representation by emphasizing significant channels.

Squeeze-excitation attention integrated network (SEAIN): Calibrated channel-wise responses to capture hierarchical features more effectively.

The TA modules were selectively integrated into these networks. For SAIN and SEAIN, TA modules were inserted after each convolutional block, while CAIN incorporated channel attention after every Conv2D layer. This strategic placement balanced computational efficiency and ensured meaningful feature enhancement.

The convolutional backbone consisted of two convolutional blocks, each containing four Conv2D layers with kernels of varying sizes ( $7 \times 7, 5 \times 5, 3 \times 3$ , and $1 \times 1$ ). The first block used 128 filters, and the second block used 256 filters, with BatchNormalization and MaxPooling2D layers applied for feature refinement and dimensionality reduction. All convolutional layers utilized ReLU activation to prevent vanishing gradient issues.

The three HASE configurations—serial stacked, parallel stacked, and independent attention—are described as follows.

HASE: Serial stacked attention network

In the serial configuration, outputs from SAIN, CAIN, and SEAIN were integrated in a sequential manner. Following the reshaping of the pre-trained model’s output tensor, the SAIN processed the features first, followed by CAIN, and finally SEAIN. Each network further refined the features extracted by the preceding one, producing progressively enhanced representations. These features were flattened into a one-dimensional tensor and fed through three fully connected layers with sizes 256, 128, and 7, corresponding to the number of classes. ReLU activation was applied to the first two layers, while the final layer used softmax to produce class probabilities. Dropout layers with rates of 35% and 25% were included after the first two dense layers, respectively, to reduce overfitting.

HASE: Parallel stacked attention network

In the parallel configuration, outputs from SAIN, CAIN, and SEAIN were computed concurrently. Each network independently processed the reshaped pre-trained output, extracting features in parallel. The resulting feature maps were then concatenated to merge complementary information from all attention mechanisms. The combined tensor was flattened and passed through the same fully connected layers and dropout setup as in the serial configuration. This design facilitated the integration of diverse feature representations, enhancing model generalization.

HASE: Independent attention network

In the independent configuration, SAIN, CAIN, and SEAIN functioned completely independently. Each network extracted features separately from the reshaped pre-trained model output. The outputs were flattened into one-dimensional tensors and passed through their respective fully connected layers. Each network generated its own predictions, maintaining independence of the extracted features. This setup allowed each attention mechanism to focus solely on its specialized feature extraction, which could later be combined during ensemble evaluation.

The careful design of these three HASE configurations ensured effective utilization of attention mechanisms, enabling robust feature extraction and improving model performance across varied input scenarios.

Feature extraction process

In our methodology, the HASE models were employed for effective feature extraction. The top fully connected layers were excluded (include_top=False), and global average pooling was applied (pooling=‘‘avg’’). The resulting outputs were reshaped to optimized dimensions suitable for further processing with additional convolutional layers. These convolutional layers used a variety of filter sizes ( $7 \times 7, 5 \times 5, 3 \times 3$ , and $1 \times 1$ ) and incorporated ReLU activation along with batch normalization to stabilize and enhance learning. Max pooling layers were then applied to reduce spatial dimensions, sharpening feature focus. Finally, the extracted feature maps were flattened and passed through fully connected layers with ReLU activation, concluding with a dense output layer using softmax activation to generate class probability predictions.

Feature visualization

Figure 7 illustrates the hierarchical feature extraction process within an Optimized EfficientNetV1 architecture. The figure shows activation maps at multiple stages of the TL model, with each row corresponding to a different layer’s activations, providing a detailed view of the progressive transformation of input images:

Figure 7.

Feature extraction process illustrated by activation maps (sample visualization).

Input layer (input_1): Displayed the preprocessed input image, representing raw pixel data.

Zero padding (zero_padding2d): Feature maps after zero padding, preparing tensors for subsequent convolutional operations.

Convolution (conv2d): Activation maps obtained after applying 64 convolutional filters, highlighting learned edges and patterns.

Batch normalization (batch_normalization): Normalized feature maps to enhance convergence and training stability.

ReLU activation (activation): Non-linear activations via the ReLU function, enabling the recognition of complex patterns.

Max pooling (max_pooling2d): Downsampled feature maps to preserve key features while reducing spatial dimensions.

Concatenation (concatenate): Merged feature maps from multiple layers, integrating multi-path information for richer representations.

Dense layer (dense): Converted feature maps into a vector form in preparation for classification.

Output layer (dense_1): Final activations, producing class probabilities through softmax.

The visualization in Figure 7 displays up to five filters per layer using the viridis colormap, ensuring clarity and effective contrast. These activation maps provide a comprehensive view of how the model hierarchically processes input images, capturing critical features at each stage.

Demonstrated on a single sample and selected layers, this process highlights the systematic extraction of thousands of feature representations. These detailed features substantially enhanced overall model performance by providing deeper insights into how hierarchical patterns were captured across the architecture.

Triplet-attention

To improve the model’s ability to focus on important input features while minimizing less relevant information, we employed three complementary attention mechanisms, collectively called Triplet-Attention (TA). This method integrates CAI, SEAI, and soft attention integration (SAI) to efficiently capture and emphasize critical patterns within the data.

Soft attention integration

The SAI module emphasizes assigning attention weights to individual elements of the input, enabling the model to prioritize regions according to their importance.³⁷ The attention mechanism can be expressed as follows:

a_{i} = \frac{\exp (e_{i})}{\sum_{j = 1}^{T} \exp (e_{j})}

(1)

Here $a_{i}$ denotes the attention weight for the $i$ -th input element, $T$ is the total number of input elements, and $e_{i}$ represents the relevance score of the $i$ -th element.³⁸

By assigning greater weights to the most important regions, the SAI module directs the model’s focus toward the most relevant portions of the input, thereby improving overall performance.

Channel attention integration

The CAI module emphasizes the significance of key channels within feature maps by computing attention weights across them. These weights are determined using statistical properties, such as the mean and standard deviation, of the input feature maps and are applied to enhance relevant features.³⁹ The functionality of the CAI module can be expressed mathematically as follows:

w_{c} = σ (W_{2} δ (W_{1} x))

(2)

y_{c} = w_{c} ⊙ x

(3)

Here $x$ denotes the input feature maps of size $C \times H \times W$ , $W_{1}$ and $W_{2}$ are learnable weight matrices, $δ$ represents the ReLU activation function, $σ$ is the sigmoid activation function, $w_{c}$ corresponds to the computed channel attention weight, and $⊙$ indicates element-wise multiplication.⁴⁰

Squeeze-excitation attention integration

The SEAI module emphasizes channel-wise attention, allowing the model to dynamically recalibrate feature maps.⁴¹ The module carries out two main operations: aggregation of global spatial information and recalibration of features across channels. For an input feature map $x$ of size $C \times H \times W$ , the SEAI operations can be expressed as follows:

z = GlobalAvgPooling (x)

(4)

s = ReLU (W_{2} \cdot sigmoid (W_{1} \cdot z))

(5)

y = s ⊙ x

(6)

Here $GlobalAvgPooling$ performs global spatial information aggregation, and $W_{1}$ and $W_{2}$ are learnable weight matrices. This mechanism allows the model to focus on the most informative channel-wise features within the input data.⁴²

Matthews-correlation-coefficient weighted averaging

We introduced a new ensemble learning method called Matthews-correlation-coefficient weighted averaging (MWA), which assigns proportional weights to predictions from multiple classifiers based on their Matthews correlation coefficient (MCC) performance. Unlike accuracy-based or loss-based metrics, MCC provides a balanced evaluation of classifier quality even in cases of class imbalance. By emphasizing classifiers with higher MCC values, the MWA method ensures that the ensemble leverages models that provide stronger overall consistency in predicting both positive and negative classes.

Step 1: Evaluating classifier performance

The first step involves measuring the performance of each classifier using the MCC. MCC is widely regarded as one of the most informative metrics for binary classification, as it takes into account true positives (TP), true negatives (TN), false positives (FP), and false negatives (FN) in a balanced manner. The MCC score is calculated as follows:

MCC = \frac{T P \cdot T N - F P \cdot F N}{\sqrt{(T P + F P) (T P + F N) (T N + F P) (T N + F N)}}

(7)

The resulting MCC value ranges from $- 1$ to $1$ : $+ 1$ indicates perfect agreement between the classifier’s predictions and the true labels, $0$ indicates no better than random prediction, and $- 1$ indicates total disagreement (inverse prediction).

This property makes MCC particularly suitable for imbalanced classification problems, where simple accuracy may be misleading.

Step 2: Computing ensemble weights

Once MCC values are obtained for each model, they are shifted to ensure non-negativity so that proportional weights can be assigned. This prevents negative MCC values from adversely affecting ensemble contributions. The normalized ensemble weights are computed as follows:

w_{i} = \frac{{MCC}_{i}^{'}}{\sum_{j = 1}^{N} {MCC}_{j}^{'}}

(8)

where ${MCC}_{i}^{'} = {MCC}_{i} - min (MCC) + ϵ$ , with $ϵ$ representing a very small positive constant to avoid zero weights, ${MCC}_{i}$ is the Matthews correlation coefficient score of classifier $i$ , and $N$ is the total number of classifiers. This proportional weighting scheme ensures that classifiers with stronger correlation to true labels exert more impact on the ensemble inference.

Step 3: Generating ensemble predictions

The ensemble predictions are generated through the weighted averaging of each classifier’s probabilistic outputs. Let $P_{i} = [p_{i 1}, p_{i 2}, \dots, p_{i n}]$ denote the predicted probability distribution of classifier $i$ across $n$ samples. The ensemble prediction for the $j$ -th sample is derived as follows:

E_{j} = \sum_{i = 1}^{N} w_{i} \cdot p_{i j}

(9)

where $E_{j}$ represents the weighted ensemble output, for instance, $j$ , $w_{i}$ is the MCC-derived weight of classifier $i$ , and $p_{i j}$ corresponds to the probability predicted by classifier $i$ , for instance, $j$ . The final prediction label is then obtained by taking the class with the maximum probability in $E_{j}$ .

By proportionally emphasizing classifiers that demonstrate a higher correlation between predictions and actual outcomes, the MWA technique strengthens both the robustness and generalization of the ensemble. A graphical overview of this process is provided in Figure 8.

Figure 8.

Logarithmic loss-based weighted ensemble in layer $L$ .

Justification of proposing MWA

The MWA method is introduced to provide a principled and reliable ensemble weighting strategy for class-imbalanced medical image classification tasks. Conventional ensemble techniques, such as the majority voting and simple weighted averaging, typically rely on heuristic or manually assigned weights and do not guarantee an optimal or performance-aware weighting of individual models. Similarly, probabilistic or accuracy-based weighting methods, including Bayesian model averaging, are often biased toward the majority classes and may yield misleading importance estimates in imbalanced datasets.⁴³

In contrast, the MCC offers a comprehensive performance measure by simultaneously incorporating true positives, true negatives, false positives, and false negatives into a single statistic. Unlike accuracy or confidence-based metrics, MCC remains robust under severe class imbalance, which is a common characteristic of skin lesion datasets. Therefore, using MCC as the basis for model weighting enables a fair and discriminative assessment of each classifier’s true predictive capability.

The proposed MWA strategy leverages MCC values to automatically assign higher weights to consistently reliable models while suppressing the influence of poorly performing ones. Within the multi-level ensemble architecture, MWA is applied hierarchically, allowing model contributions to be refined progressively across successive layers. This hierarchical weighting mechanism addresses the limitations of single-stage ensemble methods, enhances robustness against class imbalance, and leads to improved generalization and predictive performance.

Multi-level MWA

The multi-level MWA method extended the MWA technique across two distinct layers, enabling a more refined and hierarchical emphasis on the strengths of individual models. This multi-level strategy addressed a critical challenge in single-level ensembling: the difficulty in adequately highlighting superior models due to relatively low individual classifier weights. By adopting a sequential “Layer-by-Layer” ensembling approach, this method progressively prioritized high-performing models at each layer, amplifying their influence in subsequent layers. A generic visual representation of the multi-level MWA framework is provided in Figure 9.

Figure 9.

Structure of the multi-level Matthews-correlation-coefficient weighted averaging (MWA) framework.

MWA in Layer 1

In the first layer, we ensembled the predictions to generate pre-final predictions using three core HASE approaches: HASE: serial stacked attention (SSA), HASE: parallel stacked attention (PSA), and HASE: independent stacked attention (ISA). These approaches were applied to eight customized versions of pre-trained models, resulting in a total of 24 initial predictions. For the ISA approach specifically, a pre-layer combination step was introduced to aggregate the attention-integrated results for each model before proceeding to the ensembling process in Layer 1. This step ensured that the attention mechanisms were effectively integrated into the ensemble.

MWA in Layer 2

The predictions from Layer 1, reduced to three consolidated outputs (SSA, PSA, and ISA), were further ensembled in Layer 2. This final ensembling step combined the strengths of the three HASE approaches, producing the ultimate prediction output, denoted as “HASE.” This hierarchical approach enhanced the robustness and accuracy of the ensemble by iteratively refining the influence of high-performing models across layers.

Pseudocode for MWA

Algorithm 1.

Matthews-Correlation-Coefficient alg:mcc Weighted Averaging (MWA)

Input: Predictions matrices

P = [P_{1}, P_{2}, \dots, P_{n}]

from

n

classifiers (

P_{i} \in R^{m \times k})

, true labels

y \in R^{m}

Output: Ensemble predictions

\hat{y} \in R^{m}

, class probabilities

P_{ens} \in R^{m \times k}

Convert probability predictions to hard predictions: ${\hat{y}}_{i} = \arg max (P_{i}), i = 1, \dots, n$

Compute MCC for each classifier:

{MCC}_{i} = MatthewsCorrCoef ({\hat{y}}_{i}, y)

Shift MCC values to ensure non-negative weights:

{MCC}_{i}^{'} = {MCC}_{i} - min_{j} {MCC}_{j} + ϵ

Normalize weights:

w_{i} = \frac{{MCC}_{i}^{'}}{\sum_{j = 1}^{n} {MCC}_{j}^{'}}

Compute weighted ensemble probabilities:

P_{ens} = \sum_{i = 1}^{n} w_{i} P_{i}

Obtain final predictions:

\hat{y} = \underset{c}{argmax} (P_{ens})

Numerical example of MCWE

Consider a binary classification problem withtwo models and three test samples in Table 2.

Step 1: Predictions from each model

Model A hard predictions = [0, 1, 0]

Model B hard predictions = [0, 0, 0]

Step 2: Compute MCC scores For Model A: predictions = [0,1,0] vs. labels = [0,1,0]

\Rightarrow

{MCC}_{A} = 1.0

(perfect) For Model B: predictions = [0,0,0] vs. labels = [0,1,0]

\Rightarrow

{MCC}_{B} = 0.0

Step 3: Shift and normalize weights

{MCC}_{A}^{'} = 1.0 - 0.0 = 1.0, {MCC}_{B}^{'} = 0.0 - 0.0 + ϵ = ϵ

w_{A} = \frac{1.0}{1.0 + ϵ} \approx 0.999, w_{B} = \frac{ϵ}{1.0 + ϵ} \approx 0.001

Step 4: Ensemble probabilities (dominated by Model A)

Final accuracy:

Accuracy = \frac{3}{3} = 100 %

Experimental results and analysis

This section provides a detailed evaluation of the classification performance of the proposed methodology. The analysis includes both quantitative metrics and visual interpretations to demonstrate the effect of employing MWA on enhancing the predictive capabilities of the HASE architectures. Through a variety of experimental results—covering multiple evaluation metrics, graphical representations, and confusion matrices—we offer a thorough comparison of the different approaches discussed in the previous sections.

Performance evaluation metrics

To comprehensively evaluate the performance of our models, several key metrics were employed: accuracy, precision, recall (sensitivity), F1-score, specificity, and ROC-AUC (receiver operating characteristic area under the curve). These metrics provided essential insights into the classification capabilities of the models. Each metric was calculated based on the confusion matrix, which classified predictions as true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN).⁴⁴ The mathematical definitions of these metrics are as follows:

Accuracy (A) = \frac{T P + T N}{T P + T N + F P + F N}

(10)

Precision (P) = \frac{T P}{T P + F P}

(11)

Recall (R) = \frac{T P}{T P + F N}

(12)

F1-score (F1) = 2 \times \frac{Precision \times Recall}{Precision + Recall}

(13)

Specificity (S) = \frac{T N}{T N + F P}

(14)

True Positive Rate (TPR) = \frac{T P}{T P + F N}

(15)

False Positive Rate (FPR) = \frac{F P}{F P + T N}

(16)

where

\begin{aligned} T P & : Number of positive samples correctly predicted \\ T N & : Number of negative samples correctly predicted \\ F P & : Negative samples incorrectly predicted as positive \end{aligned}

FN : Positive samples incorrectly predicted as negative⁴⁵

Using these performance metrics, we obtained a clearer picture of how well our models generalize across different classification scenarios. This evaluation helped identify both the strengths and limitations of each approach, guiding further improvements for practical deployment.

Experimental setup

The complete framework was executed within a Kaggle notebook environment, utilizing a GPU P100 alongside a dual-core Intel Xeon CPU with a processing speed of 690 ms/step. The dataset consisted of distinct lesion images, resized to $(224, 224, 3)$ for the EfficientNetV1 models. The data was divided into three subsets: 15% for validation, 15% for testing, and the remaining images for training.

Model training was carried out over 100 epochs with a batch size of 16. The Adam optimizer was employed with an initial learning rate of 0.0001, and categorical cross-entropy was used as the loss function to support multi-class classification. To prevent overfitting and improve generalization, early stopping was applied using the Reduce-on-Plateau method, with a patience of 50 epochs.

This section examined both theoretical considerations and empirical results to evaluate classification performance. The main goal was to demonstrate the effect of MWA on enhancing the predictive accuracy of HASE architectures. By presenting experimental outcomes—including a comprehensive set of evaluation metrics, ROC-AUC curves, and confusion matrices—a detailed comparison of the various methodologies introduced earlier was achieved.

Although the proposed framework integrates multiple models and ensembling stages, it does not incur significant computational overhead, as all base models are independent and can be trained in parallel. More precisely, The seven ENv1 variants with three attention modules are considered as base models that can be run in parallel. Consequently, the overall training time is bounded by the most computationally intensive model (SA_ENb7, 997 seconds per epoch), while the remaining models require less time. The ensemble stage operates in constant time due to the fixed number of layers. Furthermore, early stopping and learning rate scheduling were applied to mitigate overfitting and ensure robust generalization.

Trainable parameters

As our ensemble strategy was implemented at the prediction stage, the total number of trainable parameters remained unchanged after ensembling. In contrast, during the stacking phase, where models were integrated prior to training, the overall parameter count increased substantially. Table 4 presents a detailed summary of the trainable parameters for each model.

Hyperparameter selection

Hyperparameter tuning is essential for maximizing model performance, often resulting in gains beyond baseline expectations.^46,47 In this work, hyperparameters were carefully selected through a manual tuning procedure, guided by empirical observations and established DL practices. Key parameters such as learning rate, batch size, kernel sizes, and activation functions were systematically adjusted to improve performance while preventing overfitting. This meticulous tuning, informed by extensive experimentation, ensured a balance between computational efficiency and optimal classification outcomes.

A learning rate of 0.0001 was used with the Adam optimizer, enabling precise weight updates necessary for navigating the complex optimization space. Batch normalization was applied to stabilize training, enhance convergence speed, and reduce overfitting. The “he_normal” kernel initializer was employed to maintain proper gradient flow and support effective weight initialization. Additionally, the ReLU activation function was utilized to capture complex data patterns, further improving the model’s classification accuracy.

Performance analysis of the four augmentation strategies to determine the optimal approach

As previously described, four data augmentation strategies were implemented to address class imbalance. NA: The dataset remained unmodified, using only the original images without generating any synthetic samples. PrA: synthetic images were created before dataset splitting, which could result in overlaps where both original and augmented images from the same source appeared in training, validation, and testing subsets. TDA: Augmentation was applied solely to the training subset, keeping the validation and testing sets fully independent. PoA: augmentation was carried out separately on each subset—training, validation, and testing—after splitting, thereby expanding the dataset size across all partitions.

These augmentation strategies were applied to customized pre-trained EfficientNetV1 models. Tables 5 and 6 report the performance metrics on both the testing dataset and the reserved independent testing set, enabling the identification of the most effective augmentation approach.

Although PrA achieved near-perfect test accuracies (e.g. PrA_ENb0: 98.11% on test data), its performance dropped considerably on independent testing data, exposing limitations in generalization. For example, PrA_ENb0 declined from 98.11% to 91.43% on independent testing—a decrease of 6.68%—whereas TDA_ENb0 maintained 90.10%, demonstrating better robustness. Similarly, PrA_ENb3 decreased by 6.61% (97.55% $\to$ 90.94%), while TDA_ENb3 slightly improved from 86.88% to 90.58%, highlighting TDA’s stability on unseen data.

Dominance of TDA across architectures: TDA consistently outperformed other augmentation strategies on independent datasets, even when test accuracies appeared lower. For instance, TDA_ENb2 achieved 90.70% independent accuracy compared to PrA_ENb2’s 90.58%, despite PrA showing higher test accuracy (97.87% vs. 88.83%). TDA_ENb4 improved from 87.09% (test) to 91.06% (independent), surpassing PrA_ENb4 (97.05% $\to$ 89.61%). TDA_ENb7 retained 89.13% independent accuracy versus PrA_ENb7’s 90.46%, despite PrA’s inflated test performance of 97.05%.

These results emphasize TDA’s ability to prevent overfitting, while PrA’s high test scores diminished when evaluated on independent data. The instances where TDA’s independent accuracy exceeded its test accuracy (e.g. TDA_ENb2: +1.87%) further confirm its superior generalization capability.

Failure of PrA and PoA strategies: The shortcomings of PrA became evident through metrics such as specificity. For example, PrA_ENb0 achieved 99.68% specificity on test data but dropped sharply to 86.59% on independent testing, indicating weak performance on minority classes. In contrast, TDA_ENb0 maintained 83.70% specificity, reflecting a more balanced performance across classes. Similarly, PoA strategies (e.g. PoA_ENb0: 82.25% test $\to$ 90.46% independent) exhibited inconsistent outcomes, as augmenting validation and test subsets introduced evaluation bias and hindered reliable model assessment.

While the NA strategy showed some gains over earlier results (e.g. NA_ENb0: 90.58% on independent data), it was generally outperformed by TDA. For example, TDA_ENb0 achieved 90.10% on independent testing, slightly higher than NA_ENb0 (90.58%), and larger differences were observed for models such as TDA_ENb4 (91.06% vs. NA_ENb4: 88.77%). The persistent class imbalance continued to impact NA’s precision and recall, as seen in NA_ENb3 (F1-score: 88.80% compared to TDA_ENb3: 90.35%).

These results confirm TDA as the most effective augmentation strategy. Although PrA’s high test accuracies may appear attractive, the sharp declines on independent data (e.g. PrA_ENb0: $-$ 6.68%) highlight its limitations. By consistently maintaining or improving performance on unseen data, TDA ensures reliable generalization, making it the preferred choice for practical deployment in DL-based skin lesion classification.

Performance analysis of HASE architectures in ML-MWA

The MWA was applied to the outputs of all classifiers at each layer, denoted as $M W A_{L}$ (with $L$ representing the respective layer), to construct the multi-level MWA ( $M L - M W A$ ). This strategy incrementally refined predictions by utilizing a multi-stage ensemble approach.

HASE architectures in MWA_Layer 1 on testing data

The effectiveness of the SSA architectures at $M W A_L a y e r 1$ was assessed across various pre-trained models, as summarized in Table 7. The individual serial attention (SA) architectures exhibited strong classification performance, with accuracies ranging from 86.55% to 90.24%, alongside well-balanced precision, recall, and specificity values. Notably, SA_ENb3 achieved a specificity of 92.65%, demonstrating effective identification of negative classes, while SA_ENb0 and SA_ENb1 achieved accuracies of 90.02% and 89.48%, respectively. These findings indicate that the architectures perform robustly across multiple metrics, with some variations in F1-scores reflecting differences in the precision-recall trade-off.

Table 7.

Performance evaluation of serial stacked attention on testing data in MWA_Layer 1.

Algorithm	A	P	R	F1	S
SA_ENb0	90.02	89.98	90.02	89.70	91.69
SA_ENb1	89.48	89.19	89.48	89.19	90.90
SA_ENb2	88.29	88.03	88.29	88.02	90.32
SA_ENb3	90.24	90.21	90.24	90.09	92.65
SA_ENb4	87.20	87.19	87.20	87.13	92.17
SA_ENb5	88.39	88.29	88.39	88.14	89.72
SA_ENb6	86.55	86.63	86.55	86.43	91.09
SA_ENb7	87.31	86.70	87.31	86.73	89.66
SSA
( $M W A_{1}$ )	92.95	92.88	92.95	92.68	92.21

MWA: Matthews-correlation-coefficient weighted averaging; SA: serial attention; SSA: serial stacked attention.

The ensembled SSA architecture at $M W A_L a y e r 1$ demonstrated a marked improvement over individual models, achieving an accuracy of 92.95%, precision of 92.88%, and recall of 92.95%. Its F1-score of 92.68% highlighted a well-balanced performance, and a specificity of 92.21% confirmed the model’s effectiveness in correctly identifying negative samples. This enhanced performance illustrated that serial stacked attention allowed the model to better capture contextual relationships, resulting in more robust and accurate classification. The application of the MWA ensemble strategy further amplified these gains, emphasizing SSA’s effectiveness for complex classification tasks.

The performance of the parallel stacked attention (PSA) architectures at $M W A_L a y e r 1$ was assessed using the same set of pre-trained models, as summarized in Table 8. Individual parallel attention (PA) architectures exhibited consistent results, with accuracies ranging from 87.09% to 90.24% and balanced precision, recall, and specificity values. Notably, PA_ENb2 achieved 90.24% accuracy, demonstrating strong handling of negative classes, while PA_ENb0 achieved 88.39% accuracy and 90.98% specificity. These findings indicate that the individual PSA architectures perform reliably across multiple metrics, with some variability in F1-scores reflecting differences in precision–recall balance.

Table 8.
Performance evaluation of parallel stacked attention on testing data in MWA_Layer 1.

Algorithm A P R F1 S

PA_ENb0 88.39 88.13 88.39 88.09 90.98

PA_ENb1 88.72 88.52 88.72 88.44 89.64

PA_ENb2 90.24 90.05 90.24 89.99 90.74

PA_ENb3 88.29 88.16 88.29 87.83 89.20

PA_ENb4 87.09 87.09 87.09 86.96 90.46

PA_ENb5 87.53 87.28 87.53 87.20 88.67

PA_ENb6 88.07 87.72 88.07 87.60 87.38

PA_ENb7 87.64 87.68 87.64 87.35 90.78

PSA

( $M W A_{1}$ ) 92.41 92.34 92.41 92.13 92.13

MWA: Matthews-correlation-coefficient weighted averaging; PA: parallel attention; PSA: parallel stacked attention.

The ensembled PSA architecture at $M W A_L a y e r 1$ markedly outperformed the individual architectures, achieving an accuracy of 92.41%, precision of 92.34%, and recall of 92.41%. The F1-score of 92.13% reflected a well-balanced performance, while the specificity of 92.13% indicated the model’s ability to correctly identify negative samples. This improvement demonstrated that parallel stacked attention effectively enhanced the model’s capability to process and leverage contextual information. Furthermore, the MWA ensembling successfully combined the strengths of individual architectures, resulting in overall performance gains. Consequently, the PSA architecture proved highly effective for improving classification performance in complex tasks.

For evaluating the performance of HASE: ISA in Layer 1, the results of attention-integrated networks were first ensembled for each EfficientNetV1 variant. In the pre-Layer 1 stage, the outputs of the TA mechanisms were combined, producing seven results per architecture. These were subsequently ensembled into a single ISA outcome. Detailed metrics of this ensembling process are presented in Table 9.

Table 9.
Performance evaluation of independent attention on testing data in MWA_Layer 1.

Algorithm A P R F1 S

SAI_ENb0 86.98 86.92 86.98 86.77 89.18

CAI_ENb0 88.61 88.56 88.61 88.33 90.43

SEAI_ENb0 85.79 85.74 85.79 85.51 89.22

IA_ENb0

( $M W A_{0}$ ) 88.50 88.53 88.50 88.16 90.03

SAI_ENb1 88.83 88.93 88.83 88.50 90.42

CAI_ENb1 87.85 87.58 87.85 87.45 88.76

SEAI_ENb1 85.90 85.72 85.90 85.67 88.28

IA_ENb1

( $M W A_{0}$ ) 88.72 88.70 88.72 88.42 90.21

SAI_ENb2 89.59 89.41 89.59 89.25 88.89

CAI_ENb2 88.07 87.77 88.07 87.81 87.98

SEAI_ENb2 87.53 87.53 87.53 87.29 89.19

IA_ENb2

( $M W A_{0}$ ) 89.70 89.53 89.70 89.35 88.72

SAI_ENb3 86.77 86.62 86.77 86.42 88.29

CAI_ENb3 86.55 86.15 86.55 86.20 89.00

SEAI_ENb3 87.20 87.81 87.20 87.25 92.28

IA_ENb3

( $M W A_{0}$ ) 87.42 88.03 87.42 87.46 92.50

SAI_ENb4 86.55 86.46 86.55 86.27 89.94

CAI_ENb4 86.98 86.80 86.98 86.79 89.86

SEAI_ENb4 86.44 86.30 86.44 86.18 88.24

IA_ENb4

( $M W A_{0}$ ) 87.74 87.58 87.74 87.55 90.52

SAI_ENb5 86.12 85.62 86.12 85.57 86.38

CAI_ENb5 85.68 85.19 85.68 85.26 86.93

SEAI_ENb5 86.44 86.16 86.44 85.90 86.43

IA_ENb5

( $M W A_{0}$ ) 86.98 86.66 86.98 86.47 86.48

SAI_ENb6 85.47 85.25 85.47 84.99 86.14

CAI_ENb6 86.12 85.86 86.12 85.66 87.78

SEAI_ENb6 84.49 84.35 84.49 84.01 85.72

IA_ENb6

( $M W A_{0}$ ) 86.66 86.44 86.66 86.21 87.85

SAI_ENb7 87.09 86.91 87.09 86.55 88.02

CAI_ENb7 86.98 86.44 86.98 86.32 87.08

SEAI_ENb7 86.12 86.08 86.12 85.98 89.27

IA_ENb7

( $M W A_{0}$ ) 87.85 87.54 87.85 87.29 88.34

ISA

( $M W A_{1}$ ) 91.65 91.50 91.65 91.29 91.29

SAI: soft attention integration; CAI: channel attention integration; SEAI: squeeze- excitation attention integration; IA: independent attention; ISA: independent stacked attention; MWA: Matthews-correlation-coefficient weighted averaging.

The table illustrates performance metrics for each TA mechanism and their ensembled results for the respective architectures independent attention (IA). For example, IA_ENb2 achieved an accuracy of 89.70%, precision of 89.53%, and recall of 89.70%, with an F1-score of 89.35% and specificity of 88.72%. Similarly, IA_ENb0 and IA_ENb1 recorded accuracies of 88.50% and 88.72%, respectively. The ISA architecture at $M W A_L a y e r 1$ consolidated these results, achieving an accuracy of 91.65%, precision of 91.50%, recall of 91.65%, F1-score of 91.29%, and specificity of 91.29%.

This process demonstrated that ensembling the TA mechanisms significantly enhanced model performance, with the ISA architecture at $M W A_L a y e r 1$ delivering superior results. The integration of independent attention mechanisms effectively combined the strengths of individual architectures, leading to improved performance across multiple metrics. These results highlighted the robustness and generalizability of the ISA approach in handling complex classification tasks.

HASE architectures in MWA_Layer 1 with independent testing data

The performance of the HASE architectures in $M W A_L a y e r 1$ was evaluated using independent test data, with results presented in Tables 10 to 12. These tables provided a comprehensive comparison of the SSA, PSA, and ISA architectures in MWA ensembling, highlighting their effectiveness on completely unseen data.

Table 10.
Performance evaluation of serial stacked attention on independent test data in MWA_Layer 1.

Algorithm A P R F1 S

SA_ENb0 90.46 89.69 90.46 89.84 81.72

SA_ENb1 91.91 91.56 91.91 91.63 86.63

SA_ENb2 91.18 91.23 91.18 91.16 90.43

SA_ENb3 92.03 91.96 92.03 91.68 88.53

SA_ENb4 91.06 91.38 91.06 91.12 91.36

SA_ENb5 91.91 91.73 91.91 91.72 87.11

SA_ENb6 91.06 91.04 91.06 90.90 89.91

SA_ENb7 89.37 89.63 89.37 89.31 86.97

SSA

( $M W A_{1}$ ) 93.48 93.31 93.48 93.28 91.02

MWA: Matthews-correlation-coefficient weighted averaging; SA: serial attention; SSA: serial stacked attention.

Table 11.
Performance evaluation of parallel stacked attention on independent test data in MWA_Layer 1.

Algorithm A P R F1 S

PA_ENb0 90.22 89.46 90.22 89.77 84.17

PA_ENb1 90.46 89.84 90.46 89.91 82.70

PA_ENb2 90.94 90.19 90.94 90.30 82.73

PA_ENb3 91.67 91.41 91.67 91.41 88.06

PA_ENb4 91.30 91.26 91.30 91.20 88.01

PA_ENb5 91.55 91.10 91.55 91.23 85.67

PA_ENb6 91.55 90.94 91.55 91.14 85.69

PA_ENb7 90.94 90.96 90.94 90.87 87.04

PSA

( $M W A_{1}$ ) 93.12 92.88 93.12 92.78 87.17

MWA: Matthews-correlation-coefficient weighted averaging; PA: parallel attention; PSA: parallel stacked attention.

Table 12.
Performance evaluation of independent attention on independent test data in MWA_Layer 1.

Algorithm A P R F1 S

SAI_ENb0 91.18 90.92 91.18 90.91 84.20

CAI_ENb0 90.22 89.64 90.22 89.86 85.11

SEAI_ENb0 91.30 91.48 91.30 91.32 88.51

IA_ENb0

( $M W A_{0}$ ) 92.15 91.84 92.15 91.89 87.12

SAI_ENb1 89.37 89.06 89.37 89.14 83.63

CAI_ENb1 91.43 91.23 91.43 91.00 84.66

SEAI_ENb1 90.10 90.34 90.10 90.06 86.08

IA_ENb1

( $M W A_{0}$ ) 91.55 91.46 91.55 91.18 85.61

SAI_ENb2 90.22 89.18 90.22 89.53 82.25

CAI_ENb2 90.46 89.69 90.46 89.91 82.26

SEAI_ENb2 89.86 89.88 89.86 89.69 84.62

IA_ENb2

( $M W A_{0}$ ) 90.70 89.90 90.70 90.16 82.75

SAI_ENb3 91.30 91.05 91.30 91.14 86.61

CAI_ENb3 90.46 90.24 90.46 90.19 84.65

SEAI_ENb3 90.58 91.30 90.58 90.51 91.24

IA_ENb3

( $M W A_{0}$ ) 91.79 91.56 91.79 91.63 88.55

SAI_ENb4 90.70 90.30 90.70 90.40 84.18

CAI_ENb4 91.55 91.37 91.55 91.32 86.62

SEAI_ENb4 91.06 91.00 91.06 91.02 87.54

IA_ENb4

( $M W A_{0}$ ) 91.55 91.31 91.55 91.31 85.66

SAI_ENb5 90.58 90.13 90.58 90.20 85.59

CAI_ENb5 90.22 89.94 90.22 90.01 84.62

SEAI_ENb5 90.94 90.62 90.94 90.64 85.64

IA_ENb5

( $M W A_{0}$ ) 91.43 91.14 91.43 91.14 87.10

SAI_ENb6 91.18 90.73 91.18 90.71 85.14

CAI_ENb6 91.67 91.22 91.67 91.25 87.59

SEAI_ENb6 89.61 89.73 89.61 89.39 84.58

IA_ENb6

( $M W A_{0}$ ) 91.79 91.54 91.79 91.38 86.64

SAI_ENb7 90.46 90.36 90.46 90.40 86.59

CAI_ENb7 90.10 89.13 90.10 89.36 79.85

SEAI_ENb7 89.73 90.03 89.73 89.78 88.39

IA_ENb7

( $M W A_{0}$ ) 90.82 90.65 90.82 90.72 87.07

ISA

( $M W A_{1}$ ) 93.84 93.72 93.84 93.56 90.07

SAI: soft attention integration; CAI: channel attention integration; SEAI: squeeze- excitation attention integration; IA: independent attention; ISA: independent stacked attention; MWA: Matthews-correlation-coefficient weighted averaging.

Table 10 showcases the performance of the SSA architectures across various models. For example, SA_ENb0 achieved an accuracy of 90.46%, with precision and recall values of 89.69% and 90.46%, respectively. Similarly, SA_ENb1 demonstrated strong performance with an accuracy of 91.91%, precision of 91.56%, and recall of 91.91%. The ensembled SSA mechanism in $M W A_L a y e r 1$ achieved the highest accuracy of 93.48%, along with precision of 93.31%, recall of 93.48%, and an F1-score of 93.28%. These results indicated that the SSA mechanism effectively leveraged serial stacked attention to enhance model performance on independent test data.

Table 11 presents the results for the PSA architectures. PA_ENb1 and PA_ENb2 achieved accuracies of 90.46% and 90.94%, respectively, with balanced precision and recall values. PA_ENb3 also performed well, achieving an accuracy of 91.67% and an F1-score of 91.41%. The ensembled PSA mechanism in $M W A_L a y e r 1$ achieved an accuracy of 93.12%, precision of 92.88%, recall of 93.12%, and an F1-score of 92.78%. These results demonstrated that the PSA mechanism, which processes attention in parallel, delivers robust performance on independent test data.

Table 12 highlights the performance of the ISA architectures, which combined the strengths of each attention mechanism individually. For instance, IA_ENb3 achieved an accuracy of 91.79%, precision of 91.56%, and recall of 91.79%. Similarly, IA_ENb6 demonstrated strong performance with an accuracy of 91.79%, precision of 91.54%, and recall of 91.79%. The ensembled ISA mechanism in $M W A_L a y e r 1$ achieved the highest accuracy of 93.84%, along with precision of 93.72%, recall of 93.84%, and an F1-score of 93.56%. These results underscore the effectiveness of the ISA mechanism in integrating multiple attention strategies to enhance model performance.

The evaluation of HASE architectures on independent test data revealed that all three architectures—SSA, PSA, and ISA—delivered strong performance, with the ISA architectures achieving the highest accuracy and F1-score. These results demonstrate the robustness of the MWA framework in effectively combining different attention strategies to enhance model performance on unseen data.

HASE architectures in MWA-Layer 2

The performance of the HASE architecture at MWA_Layer 2 was comprehensively evaluated through Tables 13 and 14, which present the results from the final ensembling layer. These tables demonstrate the effectiveness of combining multiple attention mechanisms in a hierarchical framework to achieve superior classification performance.

Table 13.
Performance evaluation of HASE on test data in MWA_Layer 2.

Algorithm A P R F1 S

SSA

( $M W A_{1}$ ) 92.95 92.88 92.95 92.68 92.21

PSA

( $M W A_{1}$ ) 92.41 92.34 92.41 92.13 92.13

ISA

( $M W A_{1}$ ) 91.65 91.50 91.65 91.29 91.29

HASE

( $M W A_{2}$ ) 93.17 93.09 93.17 92.90 93.01

SSA: serial stacked attention; PSA: parallel stacked attention; ISA: independent stacked attention; MWA: Matthews-correlation-coefficient weighted averaging; HASE: hierarchical attention stacked ensemble.

Table 14.
Performance evaluation of HASE on independent test data in MWA_layer 2.

Algorithm A P R F1 S

SSA

( $M W A_{1}$ ) 93.48 93.31 93.48 93.28 91.02

PSA

( $M W A_{1}$ ) 93.12 92.88 93.12 92.78 87.17

ISA

( $M W A_{1}$ ) 93.84 93.72 93.84 93.56 90.07

HASE

( $M W A_{2}$ ) 93.96 93.77 93.96 93.71 88.64

SSA: serial stacked attention; PSA: parallel stacked attention; ISA: independent stacked attention; MWA: Matthews-correlation-coefficient weighted averaging; HASE: hierarchical attention stacked ensemble.

Table 13 reveals the comparative performance of the three pre-final architectures before final ensembling. The SSA achieved an accuracy of 92.95% with an F1-score of 92.68, while the PSA showed slightly lower performance at 92.41% accuracy. The ISA demonstrated competitive results with 91.65% accuracy and an F1-score of 91.29. These metrics establish a performance baseline for the individual components prior to their integration in the final layer.

The power of the HASE ensemble became evident as the combined architecture outperformed all previous layer architectures. With 93.17% accuracy, 93.09% precision, and a 92.90 F1-score, HASE demonstrated the synergistic effect of integrating multiple attention strategies. The high specificity of 93.01% further confirmed the model’s ability to correctly identify the majority of samples, showcasing balanced performance across all evaluation metrics.

Table 14 presents the crucial validation of these architectures on independent test data, providing a rigorous assessment of generalization capability on completely unseen data. The pre-final architectures maintained strong performance, with SSA leading at 93.48% accuracy, closely followed by PSA at 93.12% and ISA at 93.84%. This consistency between validation and independent test results confirmed the robustness of each attention approach when faced with unseen data.

The most significant achievement emerged in the final HASE ensemble performance on independent data, achieving 93.96% accuracy and a 93.71 F1-score. This represented a measurable improvement over any single attention mechanism, demonstrating that hierarchical ensembling successfully captured the complementary strengths of each approach. The model maintained high precision (93.77%) and recall (93.96%), indicating balanced performance without significant trade-offs among different evaluation metrics.

These results collectively demonstrated that the MWA framework’s layered approach to integrating attention mechanisms yielded substantial benefits. By progressively combining SSA, PSA, and ISA through hierarchical ensembling, the final HASE architecture achieved superior performance that exceeded what any single attention mechanism or stacking architecture could accomplish independently. The consistent results across both validation and independent test sets provided strong evidence of the model’s robustness and generalization capability in complex classification tasks.

Notably, the performance on unseen independent data slightly surpassed or matched that on the standard testing data, indicating that the architecture is highly reliable and can be effectively employed for real-world skin lesion identification.

Results with confidence interval (CI)

The comparative performance of the pre-final and final layers of the proposed architecture is summarized in Table 15, where each metric is reported with its corresponding 95% CI based on a test size.

Table 15.
Performance metrics of pre-final and final layers with 95% confidence intervals.

Algorithm A P R F1 S

$M W A_{1}$ (SSA) $93.48 \pm 1.47$ $93.31 \pm 1.51$ $93.48 \pm 1.47$ $93.28 \pm 1.51$ $91.02 \pm 2.10$

$M W A_{1}$ (PSA) $93.12 \pm 1.36$ $92.88 \pm 1.39$ $93.12 \pm 1.36$ $92.78 \pm 1.39$ $87.17 \pm 1.77$

$M W A_{1}$ (ISA) $93.84 \pm 1.39$ $93.72 \pm 1.42$ $93.84 \pm 1.39$ $93.56 \pm 1.43$ $90.07 \pm 2.13$

$M W A_{2}$ (HASE) $93.96 \pm 1.34$ $93.77 \pm 1.39$ $93.96 \pm 1.34$ $93.71 \pm 1.39$ $88.64 \pm 1.88$

SSA: serial stacked attention; PSA: parallel stacked attention; ISA: independent stacked attention; MWA: Matthews-correlation-coefficient weighted averaging; HASE: hierarchical attention stacked ensemble.

Performance analysis by visualization

To streamline the analysis, confusion matrices were not presented for every classifier due to the diversity of models. Instead, we focused on the final layer of the MWA model, with the corresponding confusion matrices shown in Figures 10 and 11, effectively highlighting per-class accuracy and misclassification rates.

Figure 10.
Confusion matrix and ROC-AUC curve obtained by HASE architecture in MWA-Level 2: (a) confusion matrix and (b) ROC-AUC curve. ROC-AUC: receiver-operating characteristic curve area under the curve; MWA: Matthews-correlation-coefficient weighted averaging; HASE: hierarchical attention stacked ensemble.

Figure 11.
Confusion matrix and ROC-AUC curve obtained by HASE architecture in MWA-Layer 2 on independent test data. (a) Confusion matrix and (b) ROC-AUC curve. ROC-AUC: receiver-operating characteristic curve area under the curve; MWA: Matthews-correlation-coefficient weighted averaging; HASE: hierarchical attention stacked ensemble.

Similarly, ROC-AUC curves were examined to provide additional insights into model performance. Consistent with the confusion matrices, we presented ROC-AUC curves only for the ML-MWA model in Figures 10 and 11 for uniformity.

The final HASE architecture ( $M W A_{2}$ ) demonstrated strong performance across multiple classes. For instance, the BCC class achieved near-perfect accuracy, correctly classifying 46 out of 49 samples, while the DF class correctly identified eight out of 11 samples, with only three misclassifications. The NV class performed exceptionally well, correctly classifying 598 out of 605 samples, indicating robust handling of both majority and minority classes. In the AK class, 17 samples were correctly classified, with 14 misclassified, whereas the VASC class achieved 13 correct predictions with a single error. The BKL class correctly identified 93 out of 104 samples. Even the most challenging MEL class achieved 84 correct classifications, with 24 errors. Overall, the HASE architecture exhibited high accuracy and reliable performance across all categories.

The ROC-AUC scores further highlighted the effectiveness of the HASE ( $M W A_{2}$ ) model. Even the AK class, which recorded the lowest AUC, achieved a strong 0.946, demonstrating reliable classification. The VASC class attained a perfect score of 1, while the BCC class exhibited near-perfect performance with 0.998. The NV and BKL classes also performed exceptionally well, achieving 0.989 and 0.984, respectively. Despite being the most challenging category, the MEL class reached an impressive 0.977, and the DF class obtained a solid 0.952. These consistently high ROC-AUC values across all classes underscore the precision, robustness, and stability of the HASE architecture.

The HASE ( $M W A_{2}$ ) architecture demonstrated superior performance on independent test data, outperforming earlier layers across all classes.

The VASC class achieved near-perfect accuracy, correctly classifying nine out of 10 samples with only one error and attaining a flawless AUC of 1. Similarly, the DF class performed exceptionally well, accurately identifying four out of six samples, with just two misclassifications, and achieving an outstanding AUC of 0.996.

The AK class showed a moderately balanced outcome, with 14 correct classifications and nine misclassifications, yet maintained a strong AUC of 0.988, reflecting solid discriminative ability. The NV class excelled, correctly classifying 654 out of 663 samples and achieving an AUC of 0.992, demonstrating robust handling of both majority and minority classes.

The BKL class performed strongly, correctly identifying 58 out of 66 samples with a high AUC of 0.986. The BCC class followed closely, correctly classifying 18 out of 26 samples while attaining a robust AUC of 0.993.

Finally, the MEL class, which presented the greatest classification challenge, still achieved 21 correct classifications out of 34 samples, with an AUC of 0.974—demonstrating an improvement over previous architectures.

Overall, the HASE ( $M W A_{2}$ ) architecture delivered exceptional accuracy and robustness, confirming its reliability and effectiveness across all classes.

GradCAM for interpretability

The obtained gradients were then globally averaged to produce neuron importance weights, which were subsequently multiplied with the corresponding feature maps of the convolutional layer. The resulting weighted combination was passed through a ReLU activation to generate the final class-specific heatmap. This heatmap was then upsampled to match the input image dimensions and overlaid on the original image, visually highlighting the regions that contributed most significantly to the model’s prediction.

Figure 12 presents representative GradCAM visualizations for selected classes in the HAM10000 dataset. These visualizations revealed that the HASE ( $M W A_{2}$ ) architecture consistently focused on clinically relevant regions of skin lesions, such as irregular borders, pigmentation patterns, and texture variations, thereby confirming the model’s interpretability and alignment with domain knowledge.

Figure 12.
Step-by-step implementation of gradient class activation map.

The gradients were spatially pooled by averaging across each feature map channel to determine their relative importance for the target class. These pooled gradients were then used to weight the activation maps of the final convolutional layer, and the resulting weighted activations were aggregated to produce a class-specific activation heatmap. The heatmap was normalized to a [0, 1] range to enhance visualization and overlaid on the original input image using a colormap, clearly highlighting the regions that influenced the model’s classification decision.

To evaluate the model’s attention across different categories, GradCAM visualizations were generated for representative images from each class. The resulting heatmaps demonstrated that the model effectively focused on salient regions, such as lesions in medical images, thereby confirming its ability to extract clinically meaningful features.

However, GradCAM has inherent limitations. Because it relies on the model’s predictions, misclassifications can yield misleading heatmaps. Furthermore, for complex or subtle patterns—such as ambiguous skin lesions—GradCAM may occasionally highlight irrelevant regions, potentially reducing interpretability. These limitations underscore the importance of complementing GradCAM with rigorous quantitative evaluations to ensure reliable and trustworthy model insights.

In Figure 13, GradCAM visualizations are presented for all seven classes, illustrating how the model selectively focused on the most discriminative regions rather than the entire image. This targeted attention contributed to enhanced classification accuracy and highlighted the effectiveness of our approach. The figure displays the original image alongside the corresponding GradCAM and region of interest (ROI), providing clear interpretability of the model’s decision-making process.

Figure 13.
GradCAM visualization for each class: (a, d, g, j, m, p, s) NV: nevus; MEL: melanoma; BKL: benign keratosis; BCC: basal cell carcinoma; AK: actinic keratosis; VASC: vascular lesions; and DF: dermatofibroma, (b, e, h, k, n, q, t) GradCAM: gradient class activation map, and (c, f, i, l, o, r, u) ROI: Region of Interest.

A key advantage of GradCAM is its ability to validate model reliability. When the heatmap aligns with the relevant region, it reflects an accurate classification, whereas misaligned heatmaps often indicate misclassifications. By integrating multiple models through our multi-level MWA ensemble, the final predictions achieved superior accuracy. The GradCAM visualizations corroborated this improvement, demonstrating that the ensemble effectively compensated for the limitations of individual classifiers. This emphasizes the superiority of the proposed multi-level MWA framework, showcasing its ability to produce precise predictions even in challenging cases, thereby reinforcing the contribution of our methodology.

Answers to the RQs

Answer to RQ1: To address severe class imbalance, this study evaluated four data augmentation strategies: NA, which retained the original imbalanced dataset but risked poor minority-class performance; PrA, which generated synthetic data before splitting but introduced potential data leakage; TDA, which augmented only the training set to preserve the integrity of validation and testing subsets; and PoA, which augmented all subsets but distorted evaluation results. Among these, TA emerged as the optimal strategy, as it effectively balanced class representation during training by synthesizing minority-class samples while maintaining independent, unaltered validation and testing sets. This ensured robust generalization without contamination. The combination of TA with HASE architectures further enhanced performance, establishing it as the most effective solution for mitigating class imbalance while upholding rigorous evaluation standards.

Answer to RQ2: To optimize TL models for the specific classification tasks, diverse variants of EfficientNetV1 were employed, enabling exploration of a broad architectural design space to identify task-specific strengths. Customized fine-tuning involved replacing and training final layers to align with task requirements, ensuring adaptability without compromising the pre-trained feature extraction capabilities. Parameter quantization improved computational efficiency while maintaining performance, and integrating HASE architectures with TA further enhanced robustness. This dual strategy—leveraging architectural diversity for comprehensive feature representation and targeted fine-tuning for task-specific optimization—ensured high adaptability and efficiency. By balancing pre-trained knowledge retention with domain-specific adjustments, the approach achieved superior effectiveness, demonstrating that optimal TL performance depends on strategic architecture selection, parameter-efficient training, and context-aware adaptation.

Answer to RQ3: The TA mechanism, incorporated into the HASE model, integrates three specialized attention modules—SAI, CAI, and SEAI—to hierarchically identify and prioritize critical features. SA dynamically highlights salient spatial regions in feature maps, guiding the model to focus on discriminative local patterns. CA refines channel-wise importance, amplifying informative filters while suppressing redundant ones. SEA adaptively recalibrates channel responses through squeeze-and-excitation operations, enabling context-aware feature refinement. Together, these modules mitigate the risk of overlooking hierarchical relationships (common in simpler models) or overfitting (in overly complex architectures). By synergistically balancing spatial sensitivity, channel relevance, and cross-layer contextualization, the TA-equipped HASE model outperformed non-attention baselines in ablation studies, demonstrating precise localization of significant regions and robust feature discrimination. This structured approach ensured interpretable and generalizable feature extraction, critical for tasks that rely on identifying contextually significant data regions.

Answer to RQ4: Single algorithms proved insufficient for skin lesion classification due to challenges such as inter-class similarity and intra-class variability. EL was, therefore, essential. In this study, HASE architectures—custom CNNs integrated with SAI, CAI, and SEAI modules, alongside TL models—were incorporated into three ensemble frameworks: serial stacked, parallel stacked, and independent stacked. The final multi-level ensemble synergized these attention-driven feature representations, reducing model bias and enhancing generalization. By combining predictions from heterogeneous architectures, this approach significantly improved accuracy and robustness on unseen data, demonstrating that EL is crucial for complex medical imaging tasks where feature diversity and consensus are vital. A novel ensemble method, MWA, was proposed to optimally weight predictions across multiple layers.

Answer to RQ5: Relying solely on post-prediction ensembling, such as MWA, carries risks including redundant feature learning, error propagation from uncorrelated base models, and limited synergistic learning, as independently trained models cannot exploit interdependencies. In contrast, pre-prediction stacking—implemented via serial, parallel, and ISA—integrates attention mechanisms before training, enabling collaborative feature refinement. Serial stacking progressively refines attention-guided features across layers; parallel stacking diversifies input processing through simultaneous attention pathways; and independent stacking isolates specialized attention modules for subsequent fusion. By embedding attention at the architectural level, pre-prediction methods reduce redundancy, enhance feature complementarity, and allow end-to-end optimization of interactions. Empirical results confirmed that pre-prediction stacking outperforms post-prediction MWA, particularly in complex tasks like skin lesion classification, where hierarchical feature refinement and synergistic learning improve discrimination and robustness.

Discussion and extended comparison

Our research demonstrated the advantages of combining HASE architectures with the multi-layer MWA ensemble strategy to improve classification outcomes. Through careful data preprocessing, strategic augmentation, and fine-tuning of pre-trained models, we successfully mitigated class imbalance issues and enhanced the extraction of discriminative features, leading to significant improvements in overall model performance.

To optimize feature representation, we employed a TA mechanism that integrates SAI, CAI, and SEAI. This mechanism was incorporated via three specialized stacking strategies: SSA, PSA, and ISA. By leveraging both shallow and deep features, these approaches enabled the construction of a highly effective and flexible architecture. In addition to pre-prediction stacking, post-prediction ensembling across multiple layers further refined the model’s predictions, enhancing accuracy and robustness.

The final evaluation at MWA-Layer 2 highlighted the effectiveness of this hierarchical ensembling framework. The HASE model achieved a notable accuracy of 93.96%, outperforming prior approaches and validating the efficacy of our multi-layer MWA design. Moreover, a high specificity of 88.64% confirmed the model’s ability to correctly identify non-target classes, ensuring consistent and reliable performance across diverse evaluation metrics.

These results affirmed the effectiveness of our methodological framework, highlighting the successful integration of sophisticated attention modules, stacking architectures, and multi-layer ensemble strategies. The observed improvements across performance metrics underscored the model’s capability to produce accurate and generalizable predictions, positioning it as a noteworthy advancement in the field of image classification.

Even in the presence of multiple baseline classifiers, our carefully designed framework consistently outperformed existing approaches, demonstrating superior accuracy, robustness, and reliability across evaluation measures. A comprehensive comparison of our proposed model with prior studies is provided in Table 16, with particular emphasis on research leveraging the HAM10000 dataset.

Table 16.
Comparison of our proposed architecture with existing others.

Article A P R F1 S

Tajerian et al.¹⁰ 84.30 – – – –

Wang et al.¹¹ 91.24 83.53 95.04 88.91 –

Mahbod et al.¹² 86.20 91.30 – – –

Popescu et al.¹³ 86.20 – – – –

Khan and Khan¹⁵ 91.09 – – – –

Nie et al.¹⁷ 91.51 – – – –

Nguyen et al.¹⁸ 90.00 86.00 81.00 86.00 –

Datta et al.¹⁹ 93.40 93.70 – – –

Saarela and Geogieva²⁰ 80.00 – – – –

Singh et al.²¹ 86.67 – – – –

Khan et al.²² 90.00 – – – –

Rahman et al.²⁴ 88.00 87.00 94.00 89.00 –

Gouda et al.²⁸ 83.20 – – – –

Sun et al.²⁹ 89.50 – 89.50 – 98.10

Sevli⁴⁸ 91.51 – – – –

Hoang et al.⁴⁹ 86.33 – 86.33 – 97.48

Harangi et al.⁵⁰ 93.46 – – – 92.90

Ours 93.96 93.77 93.96 93.71 88.64

Additionally, we benchmarked our proposed architecture against contemporary state-of-the-art approaches to demonstrate its comparative advantage. As illustrated in Table 17, our customized model consistently surpassed existing techniques, offering compelling evidence of its efficacy and robustness.

Table 17.
Comparison of our proposed architecture with state-of-the-art methods.

Model A P R F1 S

InceptionRv2 89.98 90.19 89.98 90.02 87.96

Inceptionv3 91.43 91.07 91.43 91.09 87.06

Xception 89.61 88.47 89.61 88.86 78.84

MobileNet 90.10 89.53 90.10 89.70 82.69

MobileNetv2 90.10 89.98 90.10 89.87 84.59

MobileNetv3L 90.70 89.98 90.70 90.13 82.73

MobileNetv3s 90.22 89.38 90.22 89.63 81.77

DenseNet121 90.34 90.22 90.34 89.85 85.56

DenseNet169 92.03 91.72 92.03 91.61 86.61

DenseNet201 92.03 91.96 92.03 91.67 84.24

ResNet50 87.80 87.68 87.80 87.35 86.02

ResNet101 91.30 91.23 91.30 90.98 84.67

ResNet152 90.70 90.07 90.70 90.01 77.97

ML-MWA 93.96 93.77 93.96 93.71 88.64

ML-MWA: multi-level Matthews-correlation-coefficient weighted averaging.

Limitations of the study

Although the proposed methodology achieved strong performance in image classification, certain limitations should be acknowledged to inform future improvements:

Computational overhead

The multi-layered MWA framework, while effective, introduces considerable computational demands due to its reliance on multiple HASE architectures and iterative ensemble processing. Both training and inference are impacted by the need to coordinate diverse attention mechanisms and base classifiers, posing challenges for large-scale datasets or environments with limited computational resources.

Dependence on similar base classifiers

The success of the MWA approach depends on the similarity of its constituent models. Using similar or homogeneous architectures may lead to redundant feature representations, thereby limiting ensemble gains. Although this study leverages diverse HASE variants (SSA, PSA, and ISA) to mitigate this risk, incorporating additional architectural or algorithmic diversity could further improve robustness.

Dataset-specific generalization

The evaluation is based on a single benchmark dataset, which may contain domain-specific biases or distributional characteristics not representative of broader settings. Consequently, the model’s performance might decline when applied to cross-domain data, such as variations in imaging protocols, lesion types, or patient demographics.

Future work and research directions

The limitations highlighted in this study also suggest several promising directions for extending the proposed framework’s applicability, efficiency, and robustness. Three primary avenues for future research are outlined below:

Optimizing computational efficiency

Future studies could aim to reduce the computational demands of the multi-layer MWA framework while preserving its ensemble advantages. Strategies may include knowledge distillation to compress multiple HASE architectures into more compact models, dynamic pruning of redundant classifiers during inference, and hardware-aware parallelization to maximize resource utilization. Additionally, adaptive layer depth—where the number of ensembling layers is determined based on dataset complexity—could provide a balanced trade-off between computational cost and performance, facilitating deployment in real-world scenarios.

Enhancing base classifier diversity

To address reliance on manual architectural selection, automated approaches for promoting classifier heterogeneity could be explored. Potential methods include incorporating adversarial decorrelation losses during base model training to reduce redundant feature learning, and leveraging neural architecture search (NAS) to discover optimal combinations of attention mechanisms and transfer learning models. Furthermore, hybrid ensembles that integrate CNNs with transformer-based models or graph neural networks could enrich feature representations, particularly for rare or morphologically complex lesion categories.

Cross-domain generalization and robustness

To enhance the framework’s applicability beyond a single dataset, future work should focus on validating performance across multi-center datasets with diverse imaging protocols, patient demographics, and lesion distributions. Incorporating domain adaptation techniques and evaluating on heterogeneous datasets will help address distribution shifts and improve generalizability. Establishing collaborative benchmarking with clinical partners can create standardized evaluation protocols, particularly for challenging scenarios such as low-quality images or streaming data with class imbalance. Additionally, integrating uncertainty quantification into the MWA ensemble weighting process could enhance reliability in ambiguous cases, supporting greater trust and adoption in clinical settings.

Conclusion

This work proposed a comprehensive framework for image classification by combining HASE architectures with a multi-level MWA (ML-MWA) strategy. The methodology commenced with thorough data preprocessing and the evaluation of four augmentation strategies, selecting the most effective approach based on performance on independent test data. This step ensured that the HASE models were trained efficiently and generalized well.

To improve feature representation and discrimination, three attention-based stacking mechanisms were incorporated: SSA, PSA, and ISA. Each stacking variant contributed complementary strengths to the model, and their outputs were subsequently fused through the proposed MWA ensemble to optimize prediction performance.

The multi-level MWA framework employed a two-tier sequential refinement process, leveraging the complementary capabilities of individual HASE models. This iterative ensembling approach led to a highly robust classification model, achieving substantial improvements in accuracy and reliability across all evaluation metrics.

Detailed GradCAM visualizations further highlighted the interpretability of the proposed framework, providing insights into the regions driving the model’s predictions and confirming its practical relevance. The method established new benchmarks in TL performance, particularly within medical imaging applications. Notably, its capability to support early and accurate diagnosis of skin lesions underscores its potential to enhance patient care and broaden access to reliable diagnostic tools.

Algorithm	A	P	R	F1	S
PA_ENb0	88.39	88.13	88.39	88.09	90.98
PA_ENb1	88.72	88.52	88.72	88.44	89.64
PA_ENb2	90.24	90.05	90.24	89.99	90.74
PA_ENb3	88.29	88.16	88.29	87.83	89.20
PA_ENb4	87.09	87.09	87.09	86.96	90.46
PA_ENb5	87.53	87.28	87.53	87.20	88.67
PA_ENb6	88.07	87.72	88.07	87.60	87.38
PA_ENb7	87.64	87.68	87.64	87.35	90.78
PSA
( $M W A_{1}$ )	92.41	92.34	92.41	92.13	92.13

Algorithm	A	P	R	F1	S
SAI_ENb0	86.98	86.92	86.98	86.77	89.18
CAI_ENb0	88.61	88.56	88.61	88.33	90.43
SEAI_ENb0	85.79	85.74	85.79	85.51	89.22
IA_ENb0
( $M W A_{0}$ )	88.50	88.53	88.50	88.16	90.03
SAI_ENb1	88.83	88.93	88.83	88.50	90.42
CAI_ENb1	87.85	87.58	87.85	87.45	88.76
SEAI_ENb1	85.90	85.72	85.90	85.67	88.28
IA_ENb1
( $M W A_{0}$ )	88.72	88.70	88.72	88.42	90.21
SAI_ENb2	89.59	89.41	89.59	89.25	88.89
CAI_ENb2	88.07	87.77	88.07	87.81	87.98
SEAI_ENb2	87.53	87.53	87.53	87.29	89.19
IA_ENb2
( $M W A_{0}$ )	89.70	89.53	89.70	89.35	88.72
SAI_ENb3	86.77	86.62	86.77	86.42	88.29
CAI_ENb3	86.55	86.15	86.55	86.20	89.00
SEAI_ENb3	87.20	87.81	87.20	87.25	92.28
IA_ENb3
( $M W A_{0}$ )	87.42	88.03	87.42	87.46	92.50
SAI_ENb4	86.55	86.46	86.55	86.27	89.94
CAI_ENb4	86.98	86.80	86.98	86.79	89.86
SEAI_ENb4	86.44	86.30	86.44	86.18	88.24
IA_ENb4
( $M W A_{0}$ )	87.74	87.58	87.74	87.55	90.52
SAI_ENb5	86.12	85.62	86.12	85.57	86.38
CAI_ENb5	85.68	85.19	85.68	85.26	86.93
SEAI_ENb5	86.44	86.16	86.44	85.90	86.43
IA_ENb5
( $M W A_{0}$ )	86.98	86.66	86.98	86.47	86.48
SAI_ENb6	85.47	85.25	85.47	84.99	86.14
CAI_ENb6	86.12	85.86	86.12	85.66	87.78
SEAI_ENb6	84.49	84.35	84.49	84.01	85.72
IA_ENb6
( $M W A_{0}$ )	86.66	86.44	86.66	86.21	87.85
SAI_ENb7	87.09	86.91	87.09	86.55	88.02
CAI_ENb7	86.98	86.44	86.98	86.32	87.08
SEAI_ENb7	86.12	86.08	86.12	85.98	89.27
IA_ENb7
( $M W A_{0}$ )	87.85	87.54	87.85	87.29	88.34
ISA
( $M W A_{1}$ )	91.65	91.50	91.65	91.29	91.29

Algorithm	A	P	R	F1	S
SA_ENb0	90.46	89.69	90.46	89.84	81.72
SA_ENb1	91.91	91.56	91.91	91.63	86.63
SA_ENb2	91.18	91.23	91.18	91.16	90.43
SA_ENb3	92.03	91.96	92.03	91.68	88.53
SA_ENb4	91.06	91.38	91.06	91.12	91.36
SA_ENb5	91.91	91.73	91.91	91.72	87.11
SA_ENb6	91.06	91.04	91.06	90.90	89.91
SA_ENb7	89.37	89.63	89.37	89.31	86.97
SSA
( $M W A_{1}$ )	93.48	93.31	93.48	93.28	91.02

Algorithm	A	P	R	F1	S
PA_ENb0	90.22	89.46	90.22	89.77	84.17
PA_ENb1	90.46	89.84	90.46	89.91	82.70
PA_ENb2	90.94	90.19	90.94	90.30	82.73
PA_ENb3	91.67	91.41	91.67	91.41	88.06
PA_ENb4	91.30	91.26	91.30	91.20	88.01
PA_ENb5	91.55	91.10	91.55	91.23	85.67
PA_ENb6	91.55	90.94	91.55	91.14	85.69
PA_ENb7	90.94	90.96	90.94	90.87	87.04
PSA
( $M W A_{1}$ )	93.12	92.88	93.12	92.78	87.17

Algorithm	A	P	R	F1	S
SAI_ENb0	91.18	90.92	91.18	90.91	84.20
CAI_ENb0	90.22	89.64	90.22	89.86	85.11
SEAI_ENb0	91.30	91.48	91.30	91.32	88.51
IA_ENb0
( $M W A_{0}$ )	92.15	91.84	92.15	91.89	87.12
SAI_ENb1	89.37	89.06	89.37	89.14	83.63
CAI_ENb1	91.43	91.23	91.43	91.00	84.66
SEAI_ENb1	90.10	90.34	90.10	90.06	86.08
IA_ENb1
( $M W A_{0}$ )	91.55	91.46	91.55	91.18	85.61
SAI_ENb2	90.22	89.18	90.22	89.53	82.25
CAI_ENb2	90.46	89.69	90.46	89.91	82.26
SEAI_ENb2	89.86	89.88	89.86	89.69	84.62
IA_ENb2
( $M W A_{0}$ )	90.70	89.90	90.70	90.16	82.75
SAI_ENb3	91.30	91.05	91.30	91.14	86.61
CAI_ENb3	90.46	90.24	90.46	90.19	84.65
SEAI_ENb3	90.58	91.30	90.58	90.51	91.24
IA_ENb3
( $M W A_{0}$ )	91.79	91.56	91.79	91.63	88.55
SAI_ENb4	90.70	90.30	90.70	90.40	84.18
CAI_ENb4	91.55	91.37	91.55	91.32	86.62
SEAI_ENb4	91.06	91.00	91.06	91.02	87.54
IA_ENb4
( $M W A_{0}$ )	91.55	91.31	91.55	91.31	85.66
SAI_ENb5	90.58	90.13	90.58	90.20	85.59
CAI_ENb5	90.22	89.94	90.22	90.01	84.62
SEAI_ENb5	90.94	90.62	90.94	90.64	85.64
IA_ENb5
( $M W A_{0}$ )	91.43	91.14	91.43	91.14	87.10
SAI_ENb6	91.18	90.73	91.18	90.71	85.14
CAI_ENb6	91.67	91.22	91.67	91.25	87.59
SEAI_ENb6	89.61	89.73	89.61	89.39	84.58
IA_ENb6
( $M W A_{0}$ )	91.79	91.54	91.79	91.38	86.64
SAI_ENb7	90.46	90.36	90.46	90.40	86.59
CAI_ENb7	90.10	89.13	90.10	89.36	79.85
SEAI_ENb7	89.73	90.03	89.73	89.78	88.39
IA_ENb7
( $M W A_{0}$ )	90.82	90.65	90.82	90.72	87.07
ISA
( $M W A_{1}$ )	93.84	93.72	93.84	93.56	90.07

Algorithm	A	P	R	F1	S
SSA
( $M W A_{1}$ )	92.95	92.88	92.95	92.68	92.21
PSA
( $M W A_{1}$ )	92.41	92.34	92.41	92.13	92.13
ISA
( $M W A_{1}$ )	91.65	91.50	91.65	91.29	91.29
HASE
( $M W A_{2}$ )	93.17	93.09	93.17	92.90	93.01

Algorithm	A	P	R	F1	S
SSA
( $M W A_{1}$ )	93.48	93.31	93.48	93.28	91.02
PSA
( $M W A_{1}$ )	93.12	92.88	93.12	92.78	87.17
ISA
( $M W A_{1}$ )	93.84	93.72	93.84	93.56	90.07
HASE
( $M W A_{2}$ )	93.96	93.77	93.96	93.71	88.64

Algorithm	A	P	R	F1	S
$M W A_{1}$ (SSA)	$93.48 \pm 1.47$	$93.31 \pm 1.51$	$93.48 \pm 1.47$	$93.28 \pm 1.51$	$91.02 \pm 2.10$
$M W A_{1}$ (PSA)	$93.12 \pm 1.36$	$92.88 \pm 1.39$	$93.12 \pm 1.36$	$92.78 \pm 1.39$	$87.17 \pm 1.77$
$M W A_{1}$ (ISA)	$93.84 \pm 1.39$	$93.72 \pm 1.42$	$93.84 \pm 1.39$	$93.56 \pm 1.43$	$90.07 \pm 2.13$
$M W A_{2}$ (HASE)	$93.96 \pm 1.34$	$93.77 \pm 1.39$	$93.96 \pm 1.34$	$93.71 \pm 1.39$	$88.64 \pm 1.88$

Article	A	P	R	F1	S
Tajerian et al.¹⁰	84.30	–	–	–	–
Wang et al.¹¹	91.24	83.53	95.04	88.91	–
Mahbod et al.¹²	86.20	91.30	–	–	–
Popescu et al.¹³	86.20	–	–	–	–
Khan and Khan¹⁵	91.09	–	–	–	–
Nie et al.¹⁷	91.51	–	–	–	–
Nguyen et al.¹⁸	90.00	86.00	81.00	86.00	–
Datta et al.¹⁹	93.40	93.70	–	–	–
Saarela and Geogieva²⁰	80.00	–	–	–	–
Singh et al.²¹	86.67	–	–	–	–
Khan et al.²²	90.00	–	–	–	–
Rahman et al.²⁴	88.00	87.00	94.00	89.00	–
Gouda et al.²⁸	83.20	–	–	–	–
Sun et al.²⁹	89.50	–	89.50	–	98.10
Sevli⁴⁸	91.51	–	–	–	–
Hoang et al.⁴⁹	86.33	–	86.33	–	97.48
Harangi et al.⁵⁰	93.46	–	–	–	92.90
Ours	93.96	93.77	93.96	93.71	88.64

Model	A	P	R	F1	S
InceptionRv2	89.98	90.19	89.98	90.02	87.96
Inceptionv3	91.43	91.07	91.43	91.09	87.06
Xception	89.61	88.47	89.61	88.86	78.84
MobileNet	90.10	89.53	90.10	89.70	82.69
MobileNetv2	90.10	89.98	90.10	89.87	84.59
MobileNetv3L	90.70	89.98	90.70	90.13	82.73
MobileNetv3s	90.22	89.38	90.22	89.63	81.77
DenseNet121	90.34	90.22	90.34	89.85	85.56
DenseNet169	92.03	91.72	92.03	91.61	86.61
DenseNet201	92.03	91.96	92.03	91.67	84.24
ResNet50	87.80	87.68	87.80	87.35	86.02
ResNet101	91.30	91.23	91.30	90.98	84.67
ResNet152	90.70	90.07	90.70	90.01	77.97
ML-MWA	93.96	93.77	93.96	93.71	88.64

Footnotes

Acknowledgements

The authors would like to express their sincere gratitude to their parents for their continuous support and encouragement.

ORCID iDs

Jubaer Ahamed Bhuiyan

Anwar Hossain Efat

Md. Shifaul Hasan

Faniyam Maria Mansia

Ethical approval

This study was conducted in compliance with ethical standards, ensuring proper copyright adherence and attribution. The dataset used in this research is publicly available under the CC BY-NC-4.0 license, and it has been utilized with appropriate attribution.

Consent to participants

Informed consent was originally obtained from all individual participants by the creators of the dataset.

Consent to publication

All authors have provided their consent for publication in this journal (Digital Health, Sage). No additional consent is required beyond the authors’ approval.

Author contributions

Jubaer Ahamed Bhuiyan: validation, formal analysis, methodology, software, investigation, and writing–original draft.

Anwar Hossain Efat: conceptualization, supervision, data curation, methodology, and writing–original draft.

Md. Shifaul Hasan: formal analysis, software, investigation, and writing–review and editing.

Faniyam Maria Mansia: investigation and validation.

All authors have read and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

All authors accept full responsibility for the integrity of the data and the accuracy of the data analysis, and confirm that they had full access to all data used in the study.

Data availability statement

All data used in this study, including the augmented training dataset, are publicly available in the Kaggle repository: [HAM10000 Dataset].³⁵

().

The use of the HAM10000 dataset³⁴ complies with the Creative Commons Attribution-NonCommercial 4.0 International License. Proper attribution has been provided, and the recommended citation of the original dataset publication has been included, thereby fulfilling the license requirements. Furthermore, the dataset has been used strictly for non-commercial research purposes in accordance with the license terms.

The source code developed for this study is available from the corresponding author upon reasonable request.

References

Bibi

Khan

Shah

, et al. Msrnet: multiclass skin lesion recognition using additional residual block based fine-tuned deep models information fusion and best feature selection. Diagnostics 2023; 13: 3063.

Dillshad

Khan

Nazir

, et al. D2lfs2net: multi-class skin lesion diagnosis using deep learning and variance-controlled marine predator optimisation: an application for precision medicine. CAAI Trans Intell Technol 2025; 10: 207–222.

Hussain

Khan

Damaševičius

, et al. Skinnet-inio: multiclass skin lesion localization and classification using fusion-assisted deep neural networks and improved nature-inspired optimization algorithm. Diagnostics 2023; 13: 2869.

Efat

Hasan

Uddin

, et al. A multi-level ensemble approach for skin lesion classification using customized transfer learning with triple attention. PloS one 2024; 19: e0309430.

Ahmad

Shah

Khan

, et al. A novel framework of multiclass skin lesion recognition from dermoscopic images using deep learning and explainable ai. Front Oncol 2023; 13: 1151257.

Malik

Akram

Awais

, et al. An improved skin lesion boundary estimation for enhanced-intensity images using hybrid metaheuristics. Diagnostics 2023; 13: 1285.

Efat

. Chi2 weighted ensemble: a multi-layer ensemble approach for skin lesion classification using a novel framework-optimized regnet synergy with attention-triplet. PloS one 2025; 20: e0321803.

Efat

Zibran

Eishita

. Skin lesion classification breakthrough: leveraging independent-serial–parallel-stacking ensemble architecture with reciprocal cross-entropy averaging. IEEE Access 2026; 14: 1–18.

Hosny

Kassem

Foaud

. Classification of skin lesions using transfer learning and augmentation with alex-net. PloS one 2019; 14: e0217293.

10.

Tajerian

Kazemian

Tajerian

, et al. Design and validation of a new machine-learning-based diagnostic tool for the differentiation of dermatoscopic skin cancer images. PLoS One 2023; 18: e0284437.

11.

Wang

Yan

Tang

, et al. Multiscale feature fusion for skin lesion classification. BioMed Res Int 2023; 2023: 5146543.

12.

Mahbod

Schaefer

Wang

, et al. Transfer learning using a multi-scale and multi-network ensemble for skin lesion classification. Comput Methods Programs in Biomed 2020; 193: 105475.

13.

Popescu

El-Khatib

Ichim

. Skin lesion classification using collective intelligence of multiple neural networks. Sensors 2022; 22: 4399.

14.

Howal

Wagh

. Ilenet-linknet architecture trained on pattern and color features for skin lesion classification: segmentation with improved attention-based rcnn model. Biomed Eng: Appl Basis Commun 2025; 37: 2550007.

15.

Khan

. Skinvit: a transformer based method for melanoma and nonmelanoma classification. Plos one 2023; 18: e0295151.

16.

Dong

Wang

. Tc-net: Dual coding network of transformer and CNN for skin lesion segmentation. Plos one 2022; 17: e0277578.

17.

Nie

Sommella

Carratù

, et al. A deep CNN transformer hybrid model for skin lesion classification of dermoscopic images using focal loss. Diagnostics 2022; 13: 72.

18.

Nguyen

Bui

. Skin lesion classification on imbalanced data using deep learning with soft attention. Sensors 2022; 22: 7530.

19.

Datta

Shaikh

Srihari

, et al. Soft attention improves skin cancer classification performance. In: International workshop on interpretability of machine intelligence in medical image computing, 2021, pp.13–23. Springer.

20.

Saarela

Geogieva

. Robustness, stability, and fidelity of explanations for a deep skin cancer classification model. Appl Sci 2022; 12: 9545.

21.

Singh

Gorantla

Allada

SGR

, et al. Skinet: a deep learning framework for skin lesion diagnosis with uncertainty estimation and explainability. Plos one 2022; 17: e0276836.

22.

Khan

Akram

Zhang

, et al. Skinnet-endo: multiclass skin lesion recognition using deep neural network and entropy-normal distribution optimization algorithm with ELM. Int J Imag Syst Technol 2023; 33: 1275–1292.

23.

Ajmal

Khan

Akram

, et al. Bf2sknet: best deep learning features fusion-assisted framework for multiclass skin lesion classification. Neural Comput Appl 2023; 35: 22115–22131.

24.

Rahman

Hossain

Islam

, et al. An approach for multiclass skin lesion classification based on ensemble learning. Informat Med Unlocked 2021; 25: 100659.

25.

Nidhi

Efat

Hasan

, et al. Triple attention mobilenetv3: harnessing integrated attention and transfer learning for next-generation skin lesion detection. In: 2024 IEEE International conference on computing, applications and systems (COMPAS), 2024, pp.1–6. IEEE.

26.

Abir

MAK

Efat

Hasan

, et al. Attention enhanced inception-v3: a multi-scale feature fusion network for skin lesion detection with explainable artificial intelligence. In: 2024 International conference on innovations in science, engineering and technology (ICISET), 2024, pp.1–6. IEEE.

27.

Ahmmed

Faruk

Srizon

, et al. Shallow tuned densenet: a lightweight convolutional neural network approach for enhanced skin lesion recognition. In: 2024 IEEE International conference on power, electrical, electronics and industrial applications (PEEIACON), 2024, pp.1–6. IEEE.

28.

Gouda

Sama

Al-Waakid

, et al. Detection of skin cancer based on skin lesion images using deep learning. In: Healthcare, 2022, vol. 10, p.1183. MDPI.

29.

Sun

Huang

Chen

, et al. Skin lesion classification using additional patient information. BioMed Res Int 2021; 2021: 6673852.

30.

Jahan

Efat

Hasan

, et al. An explainable deep learning framework for multi-class skin lesion classification while resolving class imbalance. In: 2024 IEEE International conference on power, electrical, electronics and industrial applications (PEEIACON), 2024, pp.473–478. IEEE.

31.

Roy

Efat

Hasan

, et al. Multi-scale feature fusion framework based on attention integrated customized densenet201 architecture for multi-class skin lesion detection. In: 2024 IEEE International conference on power, electrical, electronics and industrial applications (PEEIACON), 2024, pp.496–501. IEEE.

32.

Hasib

Faruk

Hasan

, et al. Improved skin lesion detection with double layer concatenated densenet using transfer learning and attention modules. In: 2024 IEEE international conference on power, electrical, electronics and industrial applications (PEEIACON), 2024, pp.1–6. IEEE.

33.

Mia

Efat

Hasan

, et al. Exploring augmentation strategies for balanced skin lesion classification: an explainable lightly tuned densenet 169 architecture. In: 2024 international conference on innovations in science, engineering and technology (ICISET), 2024, pp.1–6. IEEE.

34.

Tschandl

Rosendahl

Kittler

. The HAM10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Sci Data 2018; 5: 1–9.

35.

HAM10000: Split and augmented. https://www.kaggle.com/datasets/ahefatresearch/ham10000-split-and-augmented. [Online; accessed 2026-02-12].

36.

Bhowmick

Efat

Hasan

, et al. Dual concatenated densenet with attention fusion: a framework for skin lesion classification incorporating multiple augmentation techniques and transfer learning. In: 2024 27th international conference on computer and information technology (ICCIT), 2024, pp.1087–1092. IEEE.

37.

Sikder

Efat

Hasan

, et al. A triple-level ensemble-based brain tumor classification using dense-resnet in association with three attention mechanisms. In: 2023 26th international conference on computer and information technology (ICCIT), 2023, pp.1–6. IEEE.

38.

Haque

Efat

Hasan

, et al. Revolutionizing pest detection for sustainable agriculture: a transfer learning fusion network with attention-triplet and multi-layer ensemble. In: 2023 26th international conference on computer and information technology (ICCIT), 2023, pp.1–6. IEEE.

39.

Joy

Efat

Hasan

, et al. Attention trinity net and densenet fusion: revolutionizing american sign language recognition for inclusive communication. In: 2023 26th international conference on computer and information technology (ICCIT), 2023, pp.1–6. IEEE.

40.

Montashir Fahim

Efat

Mahedy Hasan

, et al. Tri focus net: a CNN-based model with integrated attention modules for pest and insect detection in agriculture. In: International conference on trends in electronics and health informatics, 2023, pp.225–240. Springer.

41.

Shafin

Efat

Hasan

, et al. Skin lesion classification through sequential triple attention densenet: diverse utilization of the combination of attention modules. In: 2023 26th international conference on computer and information technology (ICCIT), 2023, pp.1–6. IEEE.

42.

Amin

Efat

Rahman

, et al. Enhanced skin lesion detection using concatenated densenet and multi-attention mechanisms. In: 2024 international conference on innovations in science, engineering and technology (ICISET), 2024, pp.1–6. IEEE.

43.

Efat

Hasan

Uddin

, et al. Inverse GINI indexed averaging: a multi-leveled ensemble approach for skin lesion classification using attention-integrated customized resnet variants. Digit Health 2025; 11: 20552076241312936.

44.

Efat

. Pinpointing key success factors in Bangladesh’s public university entrance exams: a feature-optimized SVM architecture with xai. In: 2024 27th international conference on computer and information technology (ICCIT), 2024, pp.429–434. IEEE.

45.

Efat

Hasan

Zibran

. Greeknet: Handwritten Greek alphabet recognition using explainable parallel CNN with attention mechanisms. In: 2025 IEEE 4th international conference on computing and machine intelligence (ICMI), 2025, pp.1–9. IEEE.

46.

Efat

Hasant

Jannat

, et al. Inquisition of the support vector machine classifier in association with hyper-parameter tuning: a disease prognostication model. In: 2022 4th international conference on electrical, computer & telecommunication engineering (ICECTE), 2022, pp.131–134. IEEE.

47.

Hossain Efat

Faysal Ferdous

Islam Nayem

, et al. From data to diagnosis: a journey with machine learning, hyperparameter tuning, and ensemble learning for disease prognostication. In: International conference on trends in electronics and health informatics, 2023, pp.407–420. Springer.

48.

Sevli

. A deep convolutional neural network-based pigmented skin lesion classification application and experts evaluation. Neural Comput Appl 2021; 33: 12039–12050.

49.

Hoang

Lee

, et al. Multiclass skin lesion classification using a novel lightweight deep learning framework for smart healthcare. Appl Sci 2022; 12(5): 2677.

50.

Harangi

Baran

Hajdu

. Assisted deep learning framework for multi-class skin lesion classification considering a binary classification support. Biomed Signal Process Control 2020; 62: 102041.

Hierarchical attention stacked ensemble with Matthews-correlation-coefficient weighted averaging: A novel framework for skin lesion classification

Abstract

Objective

Methods

Results

Conclusion

Keywords

Introduction

Literature review

Materials and methods

Dataset description

Methodological approach

Preprocessing and data augmentation

Development of HASE architectures

HASE: Serial stacked attention network

HASE: Parallel stacked attention network

HASE: Independent attention network

Feature extraction process

Feature visualization

Triplet-attention

Soft attention integration

Channel attention integration

Squeeze-excitation attention integration

Matthews-correlation-coefficient weighted averaging

Step 1: Evaluating classifier performance

Step 2: Computing ensemble weights

Step 3: Generating ensemble predictions

Justification of proposing MWA

Multi-level MWA

MWA in Layer 1

MWA in Layer 2

Pseudocode for MWA

Numerical example of MCWE

Experimental results and analysis

Performance evaluation metrics

Experimental setup

Trainable parameters

Hyperparameter selection

Performance analysis of the four augmentation strategies to determine the optimal approach

Performance analysis of HASE architectures in ML-MWA

HASE architectures in MWA_Layer 1 on testing data

HASE architectures in MWA_Layer 1 with independent testing data

HASE architectures in MWA-Layer 2

Results with confidence interval (CI)

Performance analysis by visualization

GradCAM for interpretability

Answers to the RQs

Discussion and extended comparison

Limitations of the study

Computational overhead

Dependence on similar base classifiers

Dataset-specific generalization

Future work and research directions

Optimizing computational efficiency

Enhancing base classifier diversity

Cross-domain generalization and robustness

Conclusion

Footnotes

Acknowledgements

ORCID iDs

Ethical approval

Consent to participants

Consent to publication

Author contributions

Funding

Declaration of conflicting interests

Guarantor

Data availability statement

References