Sage Journals: Discover world-class research

Abstract

Objectives

To develop and evaluate a hybrid, partially interpretable deep learning (DL) approach for multi-class skin cancer classification that improves robustness under varying acquisition conditions and delivers clinically meaningful explanations.

Methods

The proposed pipeline starts with preprocessing, including hair artefact removal using the Dull Razor method and anisotropic diffusion filtering for noise reduction while preserving lesion boundaries. Data augmentation is limited to the training set to prevent leakage. Class imbalance is addressed using class-weighted cross-entropy loss. EfficientNetB0 serves as the backbone CNN, and global feature embeddings are used to train a Random Forest (RF) classifier. Predictions are made by combining outputs from the deep model and the RF through probability-level fusion. The framework is evaluated on the HAM10000 dataset (7 classes) and a combined ISIC2019+DermNet dataset (8 classes). Performance metrics are compared against strong Vision Transformer (ViT) and transfer learning baselines. A proof-of-concept web application is developed for explainable decision making.

Results

The proposed model achieves 98.61% accuracy and 98.60% F1-score on the combined dataset. It reaches 95.02% accuracy and 95.06% F1-score on HAM10000 using lesion-wise 5-fold cross-validation. For melanoma-specific evaluations, it demonstrates high sensitivity and AUC, indicating strong performance on critical cases. Grad-CAM maps suggest that the network highlights potentially important diagnostic lesion areas.

Conclusion

The results indicate that partially interpretable architectures are a promising direction for robust skin cancer classification. The integration of Grad-CAM explanations and a web-based interface indicates that our framework may serve as a useful exploratory clinical decision-support tool.

Keywords

skin cancer medical imaging deep learning healthcare workflow explainable AI hybrid modeling

1. Introduction

The average skin surface area is about 20 square feet, making it the largest biological tissue in the body.¹ The skin protects against potential damage from heat, injuries, and infections. It is the most superficial layer, forming a water-resistant barrier and containing special cells called melanocytes, which determine our skin tone. Factors such as mycoses, viruses, weakened immunity, bacterial infections, and genetic imbalances can all contribute to the deterioration of skin health.^2,3 According to the World Health Organization (WHO), more than 1.8 billion people worldwide suffer from various skin diseases, and cancer diagnoses are expected to double over the next two decades. Furthermore, many skin diseases are infectious, putting others at risk in addition to the infected individual.⁴

Skin cancer is a hazardous skin disease, causing nearly 15,000 deaths every year.⁵ It is one of the top three deadliest cancers and has been a public health concern for quite some time. Skin cancer rates are considerably higher in Australia compared to many other parts of the world.⁶ It first appears on the skin’s surface, visible to the naked eye. When natural sunlight or any other UV source reaches the skin cells, a mutation occurs in the DNA, affecting the cells’ healthy growth, eventually leading to skin cancer. Research has shown that people with lighter skin are more susceptible to skin cancer than those with darker skin, possibly due to better protection offered by pigmentation in the outer layers.^7,8 Other risk factors could include smoking, alcohol consumption, a weakened immune system, family history of the disease, allergies, and infections.^9,10 It has been observed that populations in regions with higher UV radiation, particularly at lower latitudes, have shown an increased incidence of skin cancer.^11,12

The irregular growth of skin cells can be categorized as benign or malignant. Benign growths include Melanocytic Nevus (MN), Benign Keratosis (BK), Dermatofibroma (DF), and Vascular Lesions (VASC). These growths are generally non-problematic but should be monitored for irregular changes that could indicate melanoma.¹³ Examples of benign growths are firm nodules like DFs and VASC, which include blood vessel abnormalities such as hemangiomas and birthmarks. Malignant tumors, such as melanoma, basal cell carcinoma (BCC), actinic keratosis (AK), and squamous cell carcinoma (SCC), are more aggressive and can spread to other tissues. SCC and BCC can lead to physical disfigurement if untreated, while AK can develop into SCC if not addressed.^14,15 Melanoma, the most dangerous, often appears as irregular dark moles and can recur after removal. In 2012, there were 76,250 new melanoma cases in the USA, with 9,180 recorded deaths, and cases are projected to reach 500,000 by 2040.^16,17

Diagnosing skin cancer involves a thorough examination by a dermatologist, who assesses any changes in skin color, size, shape, or texture of lesions or moles. Dermatoscopes aid in detailed lesion examination, and suspicious lesions may require tissue biopsy for pathological analysis.^18,19 Dermatologists with 3 to 5 years of experience achieve over 60% accuracy in diagnosis, but expertise may vary, leading to delays in treatment delivery, particularly in developing regions.²⁰ Biopsies, essential for confirmation, can cause discomfort and delays in diagnosis, impacting patient recovery and causing anxiety.²¹ Improved diagnostic procedures are crucial to expedite diagnosis and alleviate patient concerns. Studies have shown that early skin cancer identification can lower mortality rates by up to 90%,²² underscoring the critical need for this treatment. Melanoma and other early-stage skin cancers have a roughly 93% five-year survival rate; this stands in stark contrast to a survival rate of as low as 27% in cases of malignant spread.²³ With an estimated 5.4 million cases of skin cancer detected in the US each year,²⁴ there is a growing need for early detection technologies. An escalating number of researchers^25–28 are adopting computer-aided diagnosis systems to accurately analyze dermatological images by leveraging sophisticated computer vision and DL techniques. These technologies save lives by improving diagnosis speed and accuracy by accelerating interventions and personalized treatments.

Convolutional neural networks (CNNs) have greatly improved skin cancer classification, matching dermatologist performance on benchmark datasets. Pre-trained CNNs,^29–31 such as ResNet, DenseNet, and EfficientNet, effectively extract features from dermoscopic images, enabling precise differentiation between melanoma and benign cases³² Attention mechanisms and transformer architectures further enhance these capabilities. However, three critical barriers persist: (i) Class imbalance is a major issue; benign lesions outnumber malignant ones in datasets, creating biased decision boundaries and compromising early melanoma detection; (ii) Models trained on selective datasets often fail to perform well in diverse real-world conditions due to variations in imaging devices, lighting, skin tones, and lesion types; (iii) The opaque nature of DL models hinders clinician trust and interpretability, which are essential for diagnostic decisions.^33–35

Many studies have explored solutions like oversampling for class imbalance,^36,37 domain adaptation for generalization, and XAI for interpretability.^38–41 However, few have developed a comprehensive system that is both computationally efficient and suitable for clinical use. Most research evaluates models on single-source datasets, leaving their real-world performance largely untested. There is a pressing need for DL models that tackle class imbalance, ensure interpretable predictions for dermatology, and maintain reliable performance across multiple datasets, all while being efficient for web deployment. The primary objective of this study is to develop a practical and trustworthy AI system for multi-class skin cancer classification to support healthcare professionals in their decision-making processes. Specifically, this study aims to:

1. Address the imbalance in benchmark datasets to accurately identify critical cases like melanoma, in addition to normal skin conditions.

2. Ensure the system performs reliably across various settings, maintaining accuracy despite differences in equipment, lighting, or patient characteristics.

3. Improve trust in the system by providing clear explanations that help professionals verify and use its predictions confidently.

4. Create a diagnostic tool for clinical environments to stimulate timely skin cancer screenings and support early intervention.

To achieve these goals, this study presents a structured approach comprising four key stages: data preparation, model development, evaluation, and deployment. The proposed methodology is illustrated in Figure 1. Dermoscopic images from the ISIC 2019 and HAM10000 datasets were collected to ensure a diverse representation of lesions. Images were resized and normalized, with the Dull Razor method used to remove hair artifacts and anisotropic diffusion filtering applied to enhance clarity while preserving lesion boundaries. Our approach features a hybrid model that combines EfficientNetB0 for feature extraction with a RF classifier for predictions. EfficientNetB0 captures hierarchical features with low computational demands. The RF component presents limited transparency in the embedding space, while Grad-CAM provides pixel-level interpretability for EfficientNetB0. We benchmarked our system against multiple transfer learning models, including CNNs and SENet. We also incorporated handcrafted methods such as ABCD analysis and the Grey Level Co-occurrence Matrix (GLCM) for enhanced feature representation. We assessed computational efficiency using FLOPs and inference speed to ensure real-time applicability. To build clinician trust, we integrated Grad-CAM-based interpretability that highlights predictive image regions. The system is available as a web application that allows professionals to upload dermoscopic images and receive quick predictions for informed clinical decision-making. However, our observation is preliminary and has not yet been validated by dermatologists. Key contributions of this study are as follows:

• Proposed a hybrid skin cancer classification framework that fine-tunes deep convolutional representations from an EfficientNetB0 backbone with a RF classifier to enhance diagnostic accuracy.

• Ensured computational efficiency for clinical use by evaluating inference speed and complexity, enabling real-time integration into healthcare workflows.

• Improved clinical explainability with Grad-CAM visualizations, allowing healthcare professionals to easily understand and confirm model predictions for better decision-making.

• Implemented a remote diagnostic solution with a user-friendly interface for real-time dermoscopic image upload and analysis, supporting timely interventions in clinical settings.

Figure 1.

Proposed methodology.

The remainder of this paper is structured as follows. Section 2 reviews related work. Section 3 details the materials, preprocessing pipeline, and hybrid model. Section 4 presents experimental results and statistical analyses. Section 5 discusses findings, limitations, and clinical implications. Finally, Section 6 concludes and outlines future research directions.

2. Related works

2.1. Transfer learning approaches

Skin cancer detection has been extensively studied using traditional ML and DL techniques. Monika et al.⁴² and Javaid et al.⁴³ applied preprocessing and segmentation methods on ISIC datasets, achieving accuracies of 96.25% and 93.89%, respectively. However, both approaches lacked robust generalization to diverse clinical environments. To enhance classification performance, ensemble learning was explored by Kausar et al.,²⁵ who achieved up to 98.6% accuracy using weighted majority voting. However, the approach faced complexity and issues with misclassifying the minority class. Similarly, Bechelli et al.⁴⁴ found that VGG16 surpassed ML models but required large datasets to avoid overfitting. Tahir et al.⁴⁵ proposed DSCCNet, outperforming baselines with 94.17% accuracy and 99.43% AUC across three datasets. However, it lacked real-world clinical validation. Jain et al.⁴⁶ benchmarked six TL models on HAM10000, with XceptionNet leading at 90.48% accuracy, though without fine-tuning optimization.

Explainability and mobile integration were addressed by Gururaj et al.⁴⁷ and Mridha et al.,⁴⁸ who incorporated XAI and mobile apps, respectively. Yet, their models showed moderate performance or lacked large-scale clinical validation. Hybrid models combining ML and CNNs were explored by Thanka et al.,⁴⁹ achieving 99.1% accuracy with XGBoost and VGG16. However, increased computational complexity limited real-time deployment potential. Recent studies used ViTs. Xin et al.⁵⁰ proposed SkinTrans using contrastive learning, achieving over 94% accuracy on HAM10000 and a clinical dataset. ViTfSCD by Yang et al.⁵¹ and a multi-class ViT framework by Arshed et al.⁵² reported similar performance (92–94%), outperforming TL baselines. Despite high accuracy, ViT-based methods remain resource-intensive and require balanced datasets, challenging their real-time clinical applicability. In summary, while transfer learning models show high classification potential, limitations such as dataset imbalance, overfitting, and computational demands continue to hinder practical deployment in diverse healthcare settings.

2.2. Hybrid approaches

Hybrid models combining DL and ML techniques have been proposed to enhance skin cancer classification. Keerthana et al.⁵³ integrated CNN and SVM for automated melanoma detection, achieving accuracies of 88.02% and 87.43%, though limited by moderate performance and lack of dataset diversity. Similarly, Bassel et al.⁵⁴ utilized stacked classifiers with ResNet50, Xception, and VGG16, attaining 90.9% accuracy on a small 1,000-image dataset, which restricted generalizability. Similarly, Farea et al.⁵⁵ introduced a hybrid framework combining public datasets and an Artificial Bee Colony (ABC) optimization strategy, achieving 93.04% accuracy and 93.12% F1-score. While effective, the model’s complexity and computational overhead raise deployment concerns. Likewise, Sella et al.⁵⁶ combined lesion segmentation with transfer learning and SVM, improving performance by 4%, but only targeted binary classification, limiting broader applicability. In another study, Panthakkan et al.⁵⁷ proposed X-R50, a fusion of Xception and ResNet50, achieving 97.8% accuracy on HAM10000. However, the sliding window mechanism increases computational cost. Tajjour et al.⁵⁸ combined CNN and MLP for seven-class classification, attaining 96% AUC, yet heavily relied on color space conversions, limiting adaptability across datasets. Lastly, Majji et al.⁵⁹ presented a Lion Cat Swarm Optimization-based Deep Neuro Fuzzy Network (LCSO-DNFN), yielding 93.10% accuracy but incurring significant complexity. Collectively, while hybrid approaches improve classification accuracy, they often suffer from computational overhead, over-reliance on dataset-specific preprocessing, and limited generalization to real-world clinical scenarios.

2.3. Explainable artificial intelligence (XAI) in skin cancer detection

Recent studies integrating XAI into skin cancer detection pipelines have demonstrated promising performance, but they also highlight challenges related to clinical translation. Grignaffini et al.⁶⁰ and Abdulredah et al.⁶¹ achieved accuracies of 98.41% and 99.86% using techniques such as Grad-CAM, SHAP, and LRP alongside CNN and SWNet models. However, their reliance on handcrafted features and the use of homogeneous datasets limit their clinical robustness. Similarly, Halder et al.⁶² and Dagnaw et al.⁶³ explored fuzzy ensembles and CAM-based interpretability with ViTs and CNNs but encountered challenges regarding poor generalization, scalability, and small datasets. Further, Shah et al.⁶⁴ and Ieracitano et al.⁶⁵ implemented optimization and hybrid XAI approaches (including Grad-CAM, LIME, and SHAP) and achieved up to 98.5% accuracy. However, their reliance on handcrafted features and dependence on annotated masks limited their transparency and scalability.

Abbas et al.⁶⁶ employed VGG16 with LRP for interpretability and reached an accuracy of 93.29%, although they faced concerns regarding privacy and dataset diversity. Hamim et al.,⁶⁷ Cino et al.,⁶⁸ and Gamage et al.⁶⁹ utilized Grad-CAM and test-time augmentation, achieving over 97% accuracy with models such as DenseNet121, EfficientNet-B6, and Xception. However, these models lacked validation across different devices and out-of-distribution scenarios. Munjal et al.⁷⁰ achieved 94.3% accuracy using ResNet50 with Grad-CAM and LIME in the SkinSage XAI project, but their study lacked a multi-center evaluation. While XAI enhances interpretability in skin cancer diagnosis, common limitations persist, including an overreliance on handcrafted or segmentation-dependent features, limited dataset diversity, and insufficient validation in real-world clinical settings. These factors restrict broad deployment of these technologies.

2.4. Research gap analysis

Traditional CNN methods often struggle with class imbalance, leading to poor sensitivity for critical minority classes like melanoma. To tackle this, we use class-weighted loss and data augmentation to enhance minority class representation during training, ensuring balanced learning and improved sensitivity without compromising specificity. Our hybrid model also addresses concerns regarding computational efficiency and interpretability. EfficientNetB0 is used for high-quality feature extraction with fewer parameters, making it suitable for real-time deployment while maintaining accuracy. Unlike heavier models like EfficientNetB7, it strikes a balance between accuracy and computational demand. Integrating RF offers transparency in the classifier stage through feature-space importance and tree voting, but it lacks pixel-level lesion localization. For spatial explanations, we use Grad-CAM with the EfficientNetB0 backbone. Previous hybrid methods, such as CNN + SVM, have scalability issues in high-dimensional feature spaces. In contrast, RF effectively manages high-dimensional data and maintains transparent decision boundaries, thereby building trust in model predictions. A major limitation in prior works is the use of single-source datasets, which restrict generalization. Our study addresses this limitation by testing the model on multiple benchmark datasets, including HAM10000, ISIC 2019, and DermNet, thereby ensuring robust performance across diverse conditions, demographics, and lesion types. Furthermore, while many approaches remain theoretical, our study bridges the gap by deploying the hybrid model in a user-friendly web application. This allows healthcare professionals to upload dermoscopic images and receive real-time, interpretable predictions, simplifying early detection and informed clinical decision-making.

3. Methodology

3.1. Data description

We used a total of eight classes of skin diseases, which were sourced from the ISIC 2019^71–74 and Dermnet⁷⁵ datasets. These classes include Melanoma (MEL), MN, BCC, AK, BK, DF, VASC, and SCC. The ISIC 2019 dataset consists of a collection of 25,331 JPEG images along with relevant metadata such as age, sex, and anatomic site. The dataset includes a total of 4,522 MEL images, 12,875 Melanocytic Nevi images, 867 AK images, 3,323 BCC images, 239 DF images, 253 Vascular lesion images, 628 SCC images, and 2,624 BK images. In addition, the Dermnet dataset contains more than 23,000 dermoscopic images, documenting 643 skin diseases, which are categorized into a two-level taxonomy. This dataset contains a selection of classes from ISIC 2019, including MEL, AK, BK, DF, and VASC. Combining the datasets resulted in 30,990 images distributed among eight target classes, as shown in Figure 2.

Figure 2.

Sample of a combined dataset of ISIC 2019 and Dermnet.

Figure 3 depicts our dataset’s varied distribution across different skin disease classes. The category ”NV” dominates with the highest count of 12,875, more than double the second-largest category, ”MEL,”. Following these, ”BK” and ”BCC” are mid-range categories with values of 4,338 and 3,323, respectively. The smaller categories—”AK,” ”VASC,” ”SCC,” and ”DF”—all have values under 2,500, with ”DF” being the smallest at 564. This disparity, particularly the large gap between ”NV” and the other categories, suggests a potential imbalance in the data. The dataset was divided into 80% training, 5% validation, and 15% testing set. Table 1 presents the number of images per set.

Figure 3.

Distribution of each skin cancer class.

Table 1.

Number of images in train, test, validation folder of combined dataset.

Name	Training	Validation	Test
MEL	5084	305	713
NV	10458	643	1774
BCC	2626	166	531
AK	1790	115	399
BK	3458	216	664
DF	363	28	173
VASC	603	43	210
SCC	416	32	180
Total	24798	1548	4644

We also used the HAM10000 dataset, which consists of 10,015 images showcasing seven different types of skin lesions: AKIEC, BCC, BKL, DF, MEL, NV, and VASC. With a diverse range of skin cancer subtypes, shown in Figure 4, the dataset allows accurate differentiation between different types of lesions. A significant portion of the lesions in the dataset have been confirmed through histopathological analysis. The dataset includes 327 AKIEC images, 514 BCC images, 1,099 BKL images, 115 DF images, 1,113 MEL images, 6,705 NV images, and 142 VASC images. Both datasets were partitioned into three subsets for experimentation, allocating 80% for training, 5% for validation, and 15% for testing purposes.

Figure 4.

Sample of HAM10000 dataset.

3.2. Data preprocessing

3.2.1. Resize and normalization

Initially, the images are resized to a standardized dimension of 224 x 224 pixels, employing bilinear interpolation for adjusting pixel values based on the weighted averages of neighboring pixels. This resizing operation ensures uniformity in the input size, facilitating subsequent processing steps.⁷⁶ After resizing, min-max normalization is applied to scale the pixel values to a range between 0 and 1. It preserves the relationships between pixel values by maintaining the original distribution’s structure, while ensuring no feature dominates due to larger value scales. This is especially important in DL models where the activation functions are sensitive to input ranges.⁷⁷ Normalizing the data also accelerates convergence during training and helps prevent vanishing or exploding gradients.⁷⁸

3.2.2. Dull razor

This method is implemented for hair removal, as shown in Figure 5. Hair artifacts in dermoscopic images can obstruct essential features of the skin lesion, such as its borders, colour patterns, and textures. The presence of hair can introduce noise and misleading information, resulting in incorrect feature extraction by the model.⁷⁹ Therefore, the dull razor method was chosen to effectively eliminate these artefacts while preserving the integrity of the lesion’s visual features. This method entails a series of operations: grayscale morphological operation, shape verification, replacement, and applying an adaptive median filter.⁸⁰ The grayscale morphological operation identifies hair pixels, utilizing morphological operations like erosion and dilation. Subsequently, shape verification is performed to discern whether the identified pixels represent thin or long structures. Undesirable hair pixels are then replaced using bilinear interpolation, ensuring a seamless transition between regions. Finally, an adaptive median filter is applied to smooth the replaced hair pixels, reducing irregularities introduced during the replacement process.

Figure 5.

Sample images after implementation of dull razor technique.

3.2.3. Anisotropic diffusion filter

Two anisotropic diffusion filters are employed in the preprocessing pipeline to enhance image quality and reduce noise. They were chosen because of their effectiveness in reducing the noise of homogeneous regions while preserving essential features, such as the edges of skin lesions.⁸¹ Traditional smoothing techniques often blur the boundaries of skin lesions, which are key features for distinguishing between benign and malignant conditions.⁸² The diffusion equation governing anisotropic diffusion filters is represented as Equation (1), where (I) is the image, (t) is time, (∇) denotes the gradient operator, and $c (x, y, t)$ represents the diffusion coefficient, which dynamically adjusts based on pixel intensities and local image features. This allows the filter to apply stronger smoothing in areas of low gradients (uniform regions) and weaker smoothing in areas with high gradients (edges),⁸³ ensuring unwanted noise is reduced.

\frac{\partial I}{\partial t} = \nabla \cdot (c (x, y, t) \nabla I)

(1)

Combining the dull razor method and anisotropic diffusion filters ensures that dermoscopic images are free from artifacts like hair and enhanced for critical features like edges and textures. This preprocessing pipeline was selected to provide cleaner and more precise images for feature extraction, leading to improved model generalization and classification accuracy.

3.3. Data augmentation and imbalance handling

The augmentation process involved a variety of operations to diversify the dataset. Firstly, rotation was applied to images, allowing for different orientations to ensure the model can correctly classify lesions regardless of their positioning. Horizontal and vertical shifting and flipping were employed to learn features across various transformations. Random clipping was used to crop images randomly, focusing on different regions of interest to learn important patterns rather than relying on the whole image context. Brightness and contrast alteration were applied to adjust images’ overall intensity and contrast. This ensures that lighting conditions and contrast differences present in real-world medical images are well-represented during training. Shear operations were used to improve the model’s resilience to geometric transformations. Additionally, zoom operations were performed to magnify or shrink specific regions of the images. Sigmoid corrections were applied to adjust the image intensity distribution, making detecting important lesion details easier for the model. Finally, stretching operations were utilized to deform images non-uniformly.

3.4. Feature extraction and baseline DL models

This study employed two primary methods for feature extraction: ABCD analysis and GLCM. The ABCD method was selected because of its widespread use and effectiveness in dermatological practice.⁸⁴ This method evaluates four key diagnostic parameters: Asymmetry, Border irregularity, Color variation, and Diameter. These are well-established indicators critical for differentiating between benign and malignant lesions. Asymmetry captures the irregular shape of malignant lesions, while border irregularity identifies uneven edges often indicative of skin cancer.⁸⁵ Color variation is another critical marker, as malignant lesions typically exhibit multiple colors, unlike benign lesions, which tend to have uniform coloration. Finally, the diameter criterion, particularly lesions larger than 6 mm, is a common diagnostic measure for malignancy.

On the other hand, GLCM was selected to capture second-order statistical texture features. Texture is essential in distinguishing between different types of lesions, as malignant lesions may have rougher or more heterogeneous textures than benign ones. By focusing on texture, GLCM provides additional information that improves the model’s ability to differentiate lesions that may look similar in color and shape but differ significantly in texture.^86,87 Key measures derived from GLCM include Energy, Entropy, Autocorrelation, Correlation, Homogeneity, and Contrast. Energy reflects the uniformity of texture patterns, with higher values indicating smoother textures typical of benign lesions. Entropy measures the randomness of the texture, which is often higher in malignant lesions. Other features, like Autocorrelation and Correlation, provide further insight into the regularity between neighboring pixel intensities. Homogeneity measures the similarity of pixel intensity distributions, helping to distinguish between benign lesions’ smooth texture and malignant ones’ more irregular texture. Contrast, which measures the intensity difference between a pixel and its neighbor, is beneficial for identifying rough or uneven textures that might indicate malignancy.

3.5. Baseline configurations

We benchmark our approach using several widely used deep learning backbones, fine-tuned for a single-head multi-class setting. Our baseline CNN combines convolutional and pooling layers to transform local textures into higher-level representations, which are mapped to lesion classes through fully connected layers and a softmax output. To include a resource-efficient design, we also evaluate SqueezeNet, which uses fire modules to efficiently reduce and then restore channel dimensionality with 1 × 1 and 3 × 3 convolutions, achieving competitive performance with fewer parameters. Additionally, we assess attention-augmented and deeper residual architectures. SENet enhances standard CNNs by introducing Squeeze-and-Excitation (SE) blocks, which generate channel-wise reweighting to emphasize informative feature maps. ResNet50 serves as a robust deep baseline, utilizing identity skip connections for stable optimization and effective feature learning for lesion classification. Lastly, EfficientNetB0 is included as an efficient backbone that balances accuracy and computational cost, employing compound scaling and depthwise separable convolutions to achieve high performance with a relatively small parameter count.

3.6. Transformer models

To provide a strong Transformer-based baseline, we adopt state-ofthe-art Multi-ViT and Cross-ViT architectures. In all experiments, they are fine-tuned on the training folds using the same data preprocessing, augmentation, and class-weighted loss strategy as the CNN-based models. The resulting performance provides a competitive Transformer-based reference against which the proposed EfficientNetB0+RF hybrid can be critically assessed.

Multi-ViT aaggregates multiple ViT branches through learnable attention-based weighting (Figure 6). Each input lesion image is first partitioned into non-overlapping patches and linearly projected into a sequence of patch embeddings. Fixed positional embeddings are then added to preserve spatial layout before the tokens are forwarded to K parallel ViT encoder branches. Every branch follows a standard pre-norm Transformer design and consists of stacked blocks comprising layer normalization, multi-head self-attention, and a feed-forward MLP. From each branch k ∈ {1, …, K}, we obtain a latent representation $h_{k} \in R^{D}$ by global pooling over the output tokens. Rather than simply averaging these representations, Multi-ViT learns branch-specific importance weights through a lightweight attention mechanism.

Figure 6.

Multi-ViT architecture.

Cross-ViT explicitly models interactions between local and global lesion patterns (Figure 7). The input dermoscopic image is partitioned into patch tokens at two different scales: a small-patch branch (S branch) with patch size P_s that captures fine-grained details, and a large-patch branch (L branch) with patch size P_ℓ that focuses on global structure. In each branch, patches are linearly projected into a token sequence, augmented with a learnable (CLS) token and positional embeddings, and then processed by N stacked Transformer encoder blocks comprising multi-head self-attention and feed-forward layers. After intra-branch encoding, it introduces cross-attention modules that enable the CLS token of one branch to attend to the token sequence of the other branch. This bidirectional exchange allows the small-patch representation to be informed by global context and, conversely, the large-patch representation to be refined by high-resolution local cues. Let $h_{cls}^{S}$ and $h_{cls}^{L}$ denote the CLS embeddings of the S and L branches after cross-attention. Each is passed through a dedicated MLP classification head, and the resulting logits are combined by simple summation to obtain the final prediction.

Figure 7.

Cross-ViT architecture.

3.7. Proposed hybrid model

It is designed to combine the strong feature representation of a fine-tuned EfficientNetB0 backbone with the complementary decision behaviour of a tree-based classifier. Unlike the baseline configurations, where CNN and Transformer backbones are trained end-to-end with a single fully connected classification head, the hybrid approach decouples feature learning from final decision making. This deep+ML fusion aims to improve robustness under class imbalance and to provide an alternative inductive bias compared to purely deep architectures. Algorithm 1 presents the pseudocode of the proposed hybrid architecture. We first fine-tune EfficientNetB0 on the training folds using the class-weighted cross-entropy loss. Let C denote the number of classes and n_c the number of training samples of class c. We define a class weight w_c based on the inverse class frequency,

w_{c} = \frac{\frac{1}{n_{c}}}{\sum_{j = 1}^{C} \frac{1}{n_{j}}}, c = 1, \dots, C,

(2)

which normalises the weights such that

\sum_{c = 1}^{C} w_{c} = 1

. For a training sample i with ground-truth label y_i and predicted class probabilities p_i = (p_i,1, …, p_i,C), the class-weighted loss is

L_{cw - CE}^{(i)} = - w_{y_{i}} \log p_{i, y_{i}} .

(3)

Equation (2) ensures that minority classes receive larger weights, and Equation (3) penalises their misclassification more heavily, following recent recommendations for imbalance-aware DL. After fine-tuning, we remove the final fully connected layer and use EfficientNetB0 purely as a feature extractor. For each input image x, we obtain a fixed-length embedding $f (x) \in R^{D}$ by applying global average pooling to the last convolutional feature maps.

Each training image x_i is mapped to its deep feature vector f(x_i), and the RF is fitted on the set {(f(x_i), y_i)}. The RF operates in the embedding space rather than on raw pixels, which allows it to exploit the high-level, semantically rich features learned by the convolutional backbone while offering a different form of non-linear decision boundary compared to a single fully connected layer. During training, we also enable class weighting within the RF to further account for label imbalance at the classifier level. At inference time, we combine the predictions of the fine-tuned EfficientNetB0 softmax head and the RF classifier through a simple probability-level fusion. Given an input image x, let p^Eff(x) ∈ [0,1]^C denote the class probability vector produced by the EfficientNetB0 softmax layer and p^RF(x) ∈ [0,1]^C the class probability vector estimated by the averaging class votes over trees. The fused prediction is defined as

p^{hyb} (x) = α p^{Eff} (x) + (1 - α) p^{RF} (x),

where α ∈ [0, 1] is a scalar fusion coefficient. The final predicted label is

\hat{y} (x) = \arg \max_{c} p_{c}^{hyb} (x)

. α is selected on the validation folds by grid search to maximize macro-averaged (MA) F1-score, reflecting a balance between overall accuracy and performance on minority classes. The model delivers interpretability at two levels. The EfficientNetB0 backbone provides spatial interpretability through gradient-based saliency, which can be visualized as heatmaps highlighting lesion regions. Meanwhile, the RF relies on global deep embeddings instead of raw pixels, focusing on feature-space transparency without direct pixel-level localization. Therefore, spatial explanations come from the EfficientNetB0 component, while the RF enhances decision boundaries and robustness against data imbalance.

3.8. Training parameters

Table 2 lists the hyperparameters used for model training. A gradient clipping threshold of 0.3 was chosen to prevent exploding gradients while allowing effective learning. We selected a dropout rate of 0.2 to reduce overfitting while maintaining model capacity. AdamW was used as the optimizer for its weight decay feature and better handling of sparse gradients, leading to improved convergence compared to traditional Adam and SGD. Batch normalization was applied to stabilize learning and speed up convergence by minimizing internal covariate shifts. We set warm-up steps to 1000 to help adjust learning in the early epochs, which is essential for fine-tuning large models like Transformers. The initial learning rate was set at 5 × 10⁻⁵ for stable convergence, with a weight decay of 1 × 10⁻⁴ to further reduce overfitting risks. The model was trained for 50 epochs, allowing enough learning cycles while minimizing overfitting, using a batch size of 16 to accommodate GPU memory limits while ensuring gradient stability. Early stopping was implemented with a patience of 7 to monitor validation performance and cease training when improvements leveled off. Cosine annealing was chosen for the learning rate schedule to facilitate smooth decay and enable periodic learning rate restarts for better generalization. Additionally, mixed precision training was used to decrease GPU memory usage and accelerate the training process without sacrificing model accuracy.

Table 2.

Detailed configuration of training hyperparameters with explored ranges and final selected values.

Hyperparameter	Explored options/range	Chosen
Gradient clipping	{None, 0.1, 0.3, 0.5}	0.3
Dropout rate	{0.1, 0.2, 0.3, 0.4}	0.2
Optimizer	{AdamW, Adam, LAMB, SGD}	AdamW
Batch normalization	{Enabled, Disabled}	Enabled
Warm-up steps	{0, 500, 1000, 2000}	1000
Initial learning rate	{1 × 10⁻⁵, 5 × 10⁻⁵, 1 × 10⁻⁴}	5 × 10⁻⁵
Weight decay	{0, 1 × 10⁻⁴, 5 × 10⁻⁴}	1 × 10⁻⁴
Number of epochs	{30, 50, 100}	50
Batch size	{8, 16, 32}	32
Early stopping patience	{5, 7, 10}	7
Learning rate schedule	{Cosine annealing, Linear warmup, Reduce on Plateau}	Cosine annealing
Mixed precision training	{Enabled, Disabled}	Enabled

To mitigate the impact of class imbalance without introducing synthetic images, we adopt a class-weighted cross-entropy loss. Let y ∈ {1, …, C} denote the true class label for an input image and $\hat{p} = (p_{1}, \dots, p_{C})$ the predicted class probabilities. For each class c, we compute a weight w_c from the inverse of its empirical frequency f_c in the training set, i.e., w_c ∝ 1/f_c, and normalize the weights such that $\sum_{c = 1}^{C} w_{c} = C$ . The resulting loss for a single sample is

L_{cw - CE} = - w_{y} \log p_{y},

which penalizes misclassification of underrepresented classes more heavily while down-weighting majority classes. Importantly, all imbalance handling is performed at the loss level: only the employed image augmentation is applied, and it is restricted to the training images. No synthetic samples are generated in pixel space, ensuring that the training distribution remains physically plausible while still improving minority-class recognition.

3.9. XAI integration

To address the interpretability challenges inherent in DL-based skin cancer classification, we integrated Grad-CAM into our proposed hybrid pipeline. It generates attention heatmaps that visually highlight regions within dermoscopic images that contribute most significantly to the model’s predictions, providing clinicians with insight into the model’s decision-making process. Grad-CAM was applied to the EfficientNetB0 backbone by utilizing the gradients of the predicted class score y^c with respect to the feature maps A^k from the last convolutional layer. Specifically, the importance weights $α_{k}^{c}$ for each feature map k concerning class c were computed using global average pooling:

α_{k}^{c} = \frac{1}{Z} \sum_{i} \sum_{j} \frac{\partial y^{c}}{\partial A_{i j}^{k}}

where Z is the total number of pixels in the feature map. The map

L_{Grad - CAM}^{c}

was then calculated as:

L_{Grad - CAM}^{c} = ReLU (\sum_{k} α_{k}^{c} A^{k})

The attention map generated by Grad-CAM was upsampled and overlaid on the original dermoscopic image to identify the regions that influence the model’s predictions. This method was applied to samples from the ISIC 2019 + DermNet dataset and the HAM10000 dataset to ensure interpretability across various imaging sources and lesion types. Attention maps were created for both correctly and incorrectly classified samples for qualitative assessment. Grad-CAM was explicitly applied to the EfficientNetB0 component of the hybrid model since RF lacks spatial interpretability. Accordingly, the reported maps should be interpreted as qualitative explanations of the CNN backbone features used by the hybrid decision, rather than as direct explanations of the RF component. The predicted class used to generate the heatmap ensured alignment with the model’s final decision while utilizing EfficientNetB0’s feature representations. This approach adds minimal computational overhead, maintaining real-time performance during inference and web deployment. Clinically, the visual explanations help dermatologists understand which lesion regions influenced predictions, aiding in cross-validation before making clinical decisions. Grad-CAM visualizations also enable practitioners to evaluate lesion boundaries, heterogeneity, and color variations, aligning with diagnostic practices. Integrated into the web application, these overlays enable clinicians to interactively view and interpret model explanations within their diagnostic workflows, thereby enhancing the connection between high-performing AI systems and practical clinical applications.

4. Experimental results

4.1. System implementation

The implemented frameworks underwent thorough testing and evaluation under specific software and hardware configurations. The operating system employed was Windows 10 Pro, while the Jupyter Notebook was the primary interpreter for code execution. The system operated on a 12th Gen Intel(R) Core (TM) i7-12700K processor running at 3.61 GHz, coupled with 32.0 GB of installed RAM. Operating on a 64-bit architecture with an x64-based processor, the system was equipped with an NVIDIA® GeForce RTX 3060 Ti Twin Edge featuring 8GB DDR6X memory for graphics processing.

4.2. Evaluation metrics

Key metrics such as accuracy, precision, recall, and F1 score are computed using a MA approach for multiclass classification. This method averages the metric values calculated independently for each class, providing a balanced evaluation regardless of class size or prevalence. Accuracy assesses the classifier’s overall accuracy by averaging the accuracy across all classes, as shown in Equation (4). Precision calculates the average positive predictive value for each class Equation (5), while recall focuses on the true positive rate for each class Equation (6). Finally, the F1 score combines precision and recall using the harmonic mean to offer a balanced evaluation of the classifier’s performance Equation (7). Here, N represents the total number of classes, TP_i is the number of true positives for class (i), TN_i is the number of true negatives for class (i), FP_iis the number of false positives for class (i), and FN_i is the number of false negatives for class (i).

A c c u r a c y = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i} + T N_{i}}{T P_{i} + T N_{i} + F P_{i} + F N_{i}}

(4)

P r e c i s i o n = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F P_{i}}

(5)

R e c a l l = \frac{1}{N} \sum_{i = 1}^{N} \frac{T P_{i}}{T P_{i} + F N_{i}}

(6)

F 1 S c o r e = \frac{1}{N} \sum_{i = 1}^{N} \frac{2 \times T P_{i}}{2 \times T P_{i} + F P_{i} + F N_{i}}

(7)

For a robust and unbiased evaluation, we adopted a lesion-wise stratified 5-fold cross-validation protocol. All images belonging to the same lesion were assigned to the same fold to prevent information leakage across training and validation sets. The folds were constructed in a stratified manner so that the class distribution in each fold closely matched the overall dataset distribution.^88,89 For each cross-validation run, four folds were used for training and the remaining fold was held out exclusively for validation. All geometric and photometric augmentations, as well as the class-weighted cross-entropy loss, were applied only to the training images of each run. The validation folds were neither augmented nor re-weighted and were always evaluated using the original, unmodified images. This protocol ensures that no augmented or re-weighted samples leak into the validation process and that the reported performance faithfully reflects generalization to unseen data.

4.3. Performance comparison

All results are based on the hyperparameter settings in the “Chosen value” column of Table 2. We initially tested the full ranges in the “Explored range” column and selected the best configuration based on MA F₁ scores from the validation folds. These settings were applied consistently across all baseline models and the proposed hybrid model. Table 3 shows the 5-fold cross-validation results for the combined dataset. The proposed hybrid model, EfficientNetB0+RF, achieved the highest accuracy at 98.61% and an F1-score of 98.6%, with low variability (±0.09). It boasts high precision and recall, indicating reliable classification of positive cases. EfficientNetB0 achieves 97.82% accuracy and 97.69% F1-score, but is less stable than the hybrid model. SENet and Squeezenet also performed well; SENet recorded 96.77% accuracy and 96.83% F1-score, while Squeezenet attained 95.71% accuracy and 95.36% F1-score. However, both show higher standard deviations than the EfficientNet models, particularly SqueezeNet, indicating less reliability. ResNet and CNN scored lowest in accuracy and F1-scores, with ResNet exhibiting a high standard deviation of ±0.30, indicating inconsistent performance across data splits.

Table 3.

Cross-validation results for experimental classifiers on combined dataset.

Model	Acc (%)	Pre (%)	Rec (%)	F1 (%)
EfficientNetB0+RF	98.61 ± 0.09	98.74 ± 0.08	98.78 ± 0.07	98.60 ± 0.07
EfficientNetB0	97.82 ± 0.12	97.58 ± 0.11	97.88 ± 0.09	97.69 ± 0.10
SENet	96.77 ± 0.15	96.61 ± 0.14	96.70 ± 0.13	96.83 ± 0.12
SqueezeNet	95.71 ± 0.20	95.59 ± 0.18	95.46 ± 0.17	95.36 ± 0.16
CNN	93.68 ± 0.25	93.66 ± 0.23	93.63 ± 0.21	93.48 ± 0.20
ResNet	91.13 ± 0.30	91.40 ± 0.27	91.18 ± 0.26	91.32 ± 0.25

Table 4 displays the performance of various models on the HAM10000 dataset. The EfficientNetB0 model achieved 94.95% accuracy and 94.44% F1-score, demonstrating solid results. However, the hybrid EfficientNetB0+RF model outperformed it, showing higher overall performance metrics and the lowest standard deviations. SENet and SqueezeNet achieved accuracies of 92.36% and 94.01%, respectively, but showed greater variability, indicating lower stability. CNN and ResNet had the weakest performances, particularly CNN, which fell short and shows potential for improvement. ResNet performed better than CNN but did not meet the higher benchmarks. As seen in Figure 8, the EfficientNetB0+RF model achieved the best performance and consistency across both datasets. Table 5 highlights the effect of different optimizers on the hybrid EfficientNetB0+RF model for both HAM10000 and the combined dataset. All four optimizers achieved high accuracy and F1 Scores, with differences typically within 1 percentage point, indicating stability across optimization choices. On HAM10000, SGD and Adam yielded similar accuracies, but both had lower F1-scores (93.65% and 93.73%, respectively) compared to LAMB and AdamW. This suggests that while SGD and Adam can fit most samples well, they struggle with underrepresented classes. LAMB improved accuracy and F1 on HAM10000 (94.88% and 94.92%) and on the combined dataset (98.42% and 98.40%), showing that large-batch-friendly adaptive optimization can better leverage the features learned by EfficientNetB0. However, AdamW provided the best balance across folds between accuracy, F1, and stability, indicating more reliable convergence and better generalization due to its weight-decoupled formulation.

Table 4.

Cross-validation results with standard deviations for each metric on the HAM10000 dataset.

Model	Acc (%)	Pre (%)	Rec (%)	F1 (%)
EfficientNetB0+RF	95.02 ± 0.08	95.27 ± 0.07	94.85 ± 0.06	95.06 ± 0.06
EfficientNetB0	94.01 ± 0.12	94.76 ± 0.10	93.45 ± 0.09	94.17 ± 0.09
SqueezeNet	94.95 ± 0.10	95.12 ± 0.09	94.57 ± 0.08	94.44 ± 0.07
SENet	89.75 ± 0.25	89.64 ± 0.23	89.50 ± 0.22	89.40 ± 0.21
ResNet	92.36 ± 0.18	92.44 ± 0.16	92.85 ± 0.15	92.68 ± 0.14
CNN	90.66 ± 0.20	90.88 ± 0.18	90.57 ± 0.17	90.54 ± 0.16

Figure 8.

Model performance comparison across evaluation metrics (with Std) on experimental datasets.

Table 5.

Comparison of different optimizers for the proposed hybrid EfficientNetB0+RF model on HAM10000 and the combined dataset. Values are reported as mean ± standard deviation over 5-fold cross-validation.

Optimizer	HAM10000		Combined dataset
Optimizer	Accuracy (%)	F₁ (%)	Accuracy (%)	F₁ (%)
SGD	94.61 ± 0.22	93.65 ± 0.24	98.21 ± 0.17	97.95 ± 0.18
Adam	94.57 ± 0.19	93.73 ± 0.21	97.88 ± 0.20	98.18 ± 0.19
LAMB	94.88 ± 0.16	94.92 ± 0.19	98.42 ± 0.15	98.40 ± 0.17
AdamW (chosen)	95.02 ± 0.08	95.06 ± 0.06	98.61 ± 0.09	98.60 ± 0.07

Table 6 displays the results of a paired t-test comparing the proposed hybrid model, EfficientNetB0+RF, with other models. The goal is to determine whether the differences in performance between the hybrid model and the different models are statistically significant. The results indicate that the p-values for accuracy, precision, and F1-score are below the 0.05 significance threshold for all comparisons. This suggests that the superior performance of the hybrid model is not due to random variation but rather represents a genuine improvement. When comparing the hybrid model to the second-best performer, EfficientNetB0, the p-values for accuracy, precision, and F1-score are 0.034, 0.021, and 0.025, respectively. Similarly, the hybrid model outperforms SENet and SqueezeNet, with p-values ranging from 0.012 to 0.018 for SENet and between 0.004 and 0.009 for SqueezeNet, indicating a more significant performance gap. The most considerable differences are observed when comparing the hybrid model to CNN and ResNet. For CNN, the p-values are 0.001 for both accuracy and F1-score, and 0.002 for precision. ResNet shows the lowest p-values, ranging from 0.0003 to 0.0005, confirming that these models struggle more in our classification task than the hybrid approach.

Table 6.

Paired t-test results for model comparison on the combined dataset.

Model A	Model B	Acc (p-value)	Pre (p-value)	F1 (p-value)
EfficientNetB0+RF	EfficientNetB0	0.034	0.021	0.025
EfficientNetB0+RF	SENet	0.017	0.012	0.018
EfficientNetB0+RF	SqueezeNet	0.009	0.004	0.006
EfficientNetB0+RF	CNN	0.001	0.002	0.001
EfficientNetB0+RF	ResNet	0.0005	0.0004	0.0003

Table 7 shows the paired t-test results on the HAM10000 dataset. The Comparison between EfficientNetB0+RF and standalone EfficientNetB0 indicates moderate yet statistically significant improvements, with p-values of 0.045 for accuracy, 0.038 for precision, and 0.042 for F1-score. Compared to SENet and SqueezeNet, the hybrid model’s enhancements become more pronounced, with p-values ranging from 0.021 to 0.031, indicating stronger statistical significance. Specifically, the F1-score comparison with SqueezeNet, with a p-value of 0.018, highlights the hybrid model’s consistent superiority across all metrics. The most substantial performance gains are observed in comparisons with CNN and ResNet, with especially low p-values of 0.002 for both precision and F1-score for ResNet, and 0.004 for F1-score for CNN.

Table 7.

Paired t-test results for model comparison on HAM10000 dataset.

Model A	Model B	Acc (p-value)	Pre (p-value)	F1 (p-value)
EfficientNetB0+RF	EfficientNetB0	0.045	0.038	0.042
EfficientNetB0+RF	SENet	0.030	0.025	0.031
EfficientNetB0+RF	SqueezeNet	0.021	0.015	0.018
EfficientNetB0+RF	CNN	0.005	0.006	0.004
EfficientNetB0+RF	ResNet	0.003	0.002	0.002

Table 8 compares model complexity and training efficiency across different models, detailing parameters (in millions), floating-point operations per second (GFLOPs), training time (hours), and inference speed (milliseconds per image). SqueezeNet is the most lightweight model, with 1.3 million parameters and 0.34 GFLOPs, resulting in the fastest training times and the quickest inference. This efficiency makes it ideal for real-time systems, though it may struggle with complex classification tasks. In contrast, SENet-34 and ResNet-50 are more complicated, resulting in longer training and slower inference times. While they deliver better representational power, their computational costs may not be worth it for scenarios requiring efficiency. CNN serves as a balanced baseline with 7.4 million parameters and 1.15 GFLOPs, but its higher inference time (6.9 ms/img) and longer training make it less optimized than newer models.

Table 8.

Model complexity and training efficiency on the experimental datasets.

Model	Params	FLOPs	Train time (hrs)		Infer
Model	(Millions)	(Giga)	Combined	HAM10000	(ms/img)
CNN	7.4	1.15	3.8	1.6	6.9
SqueezeNet	1.3	0.34	2.1	0.9	2.8
SENet-34	28.1	4.06	5.2	2.4	6.1
ResNet-50	25.6	4.12	4.9	2.2	5.7
EfficientNetB0	5.3	0.39	3.1	1.3	3.4
EfficientNetB0 + RF (ours)	5.4	0.41	3.4	1.5	3.7

EfficientNetB0 balances efficiency and performance well, containing 5.3 million parameters and 0.39 GFLOPs. Figure 9 shows competitive training times (3.1 hours for Combined, 1.3 hours for HAM10000) and a low inference delay (3.4 ms/image), confirming its status as a top choice in skin cancer classification applications. Adding an RF head to EfficientNetB0 increases parameters slightly to 5.4 million and FLOPs to 0.41G, along with a minor rise in training and inference time. However, the minimal increase (+0.3 ms/img) is justified by the significant performance improvement shown in Table 11.

Figure 9.

Radar comparison of model complexity and efficiency across experimental models.

Table 9 compares the performance of four traditional ML classifiers using EfficientNetB0 feature embeddings. EffNetB0 + Logistic Regression achieves 97.01% F1-score on the Combined dataset, with a 94.12% accuracy on HAM10000. Its assumption of linear relationships limits its ability to handle the complex decision boundaries in medical images, resulting in lower recall on HAM10000, indicating it misses more positive cases. SVM with RBF kernel improves performance, reaching a 97.89% F1-score on the Combined dataset. The RBF kernel allows SVM to manage non-linear separability, achieving slightly better recall and precision. However, its effectiveness is limited by scalability and sensitivity to kernel tuning.

Table 9.

Comparison of different ML classifiers on EfficientNetB0 feature extractor.

Classifier	Combined dataset				HAM10000 dataset
Classifier	Acc (%)	F1 (%)	Pre (%)	Rec (%)	Acc (%)	F1 (%)	Pre (%)	Rec (%)
EffNetB0 + LR	97.42	97.01	97.58	96.95	94.12	94.06	94.78	93.66
EffNetB0 + SVM	98.08	97.89	98.20	97.76	94.61	94.52	94.96	94.11
EffNetB0 + XGBoost	98.27	98.14	98.43	98.01	94.74	94.61	95.08	94.35
EffNetB0 + RF	98.61	98.60	98.74	98.78	95.02	95.06	95.27	94.85

XGBoost outperforms both LR and SVM. It leverages sequential decision trees to correct errors and adaptively manage feature importance, showing consistent classification scores in both datasets. As shown in Figure 10, EffNetB0 + RF delivers the best results with superior recall (98.78% on Combined, 94.85% on HAM10000). RF trains multiple decision trees on diverse data, modeling complex patterns while reducing overfitting. Its performance is particularly valuable in medical applications where accurately identifying positive cases is crucial.

Figure 10.

Overhead component effectiveness analysis of hybrid models with EfficientNetB0 feature extractor.

Table 10 shows that each preprocessing step leads to incremental improvements in the performance of EfficientNetB0, with the most significant gains observed when combined with the RF head. The Dull Razor step slightly enhances metrics by removing hair artifacts. Anisotropic Diffusion provides further improvement—especially in the Combined Dataset, where the F1 score increases from 97.86% to 97.98%—by enhancing lesion boundaries and reducing noise. Data Augmentation raises the F1 score for the Combined Dataset to 98.10%, although improvements in HAM10000 are modest due to its lower variability. As shown in Figure 11, class-weighted loss helps balance the classes in the Combined Dataset. However, it slightly reduces the recall for HAM10000, likely due to artefacts from the synthetic samples. When all preprocessing techniques are applied, the Combined Dataset achieves a strong F1 score of 98.37%, while HAM10000 shows only marginal gains, suggesting the possibility of overprocessing. Integrating the RF head provides the best overall results, yielding an F1 score of 98.60% and an accuracy of 98.61% on the Combined Dataset, as well as a 95.06% F1 score for HAM10000. This highlights the RF head’s effectiveness in refining decision boundaries in complex multi-class classification tasks.

Table 10.

Incremental effect of individual preprocessing steps and the RF head on EfficientNetB0.

Configuration	Combined dataset				HAM10000 dataset
Configuration	Acc (%)	Pre (%)	Rec (%)	F1 (%)	Acc (%)	Pre (%)	Rec (%)	F1 (%)
+ Dull Razor	97.96	97.72	98.02	97.86	94.98	95.18	94.62	94.49
+ Anisotropic Diffusion	98.10	97.86	98.14	97.98	95.00	95.20	94.67	94.55
+ Data Augmentation	98.24	97.99	98.29	98.10	95.04	95.26	94.72	94.60
EfficientNetB0+RF	98.61	98.74	98.78	98.60	95.02	95.27	94.85	95.06

Figure 11.

Effect of preprocessing on the proposed model.

Table 11 assesses the cross-dataset generalization of EfficientNetB0 and its hybrid variant with an RF head. The models were trained on one dataset and tested on another to evaluate their robustness to domain shifts. When trained on the Combined dataset and tested on HAM10000, EfficientNetB0 achieved 88.67% accuracy and 88.12% F1 score, while the hybrid EfficientNetB0+RF improved these metrics to 90.75% accuracy and 90.28% F1 score. In the reverse scenario, where the models were trained on HAM10000 and tested on the Combined dataset, both models saw a drop in performance. Still, the hybrid maintained an edge with 85.12% accuracy and 84.47% F1 score, outperforming the baseline by nearly 3%. These results indicate that the RF-enhanced model is better at capturing transferable decision boundaries because of its ability to manage distributional variance. The consistent improvements in both scenarios confirm the hybrid model’s superior generalization, making it a more reliable option for diverse medical image classification tasks.

Table 11.

Cross-dataset generalization analysis.

Model	Combined → HAM10000		HAM10000 → Combined
Model	Acc (%)	F1 (%)	Acc (%)	F1 (%)
EfficientNetB0	88.67	88.12	82.34	81.79
EfficientNetB0 + RF	90.75	90.28	85.12	84.47

The EfficientNetB0+RF model also shows strong resilience to moderate perturbations, particularly brightness changes and small-scale noise. Table 12 evaluates the robustness of the model against various noise and transformation conditions, focusing on the accuracy drop as the key metric. The results show that Gaussian noise significantly impacts performance, causing accuracy to drop by 0.73–2.49% on the Combined dataset and 0.81–2.28% on HAM10000 as noise severity increases. Salt & Pepper noise also leads to a drop of over 2.6% on both datasets, indicating the model’s sensitivity to high-frequency pixel-level distortions that can obscure lesion boundaries. In contrast, the model is resilient to brightness shifts, with minor accuracy reductions of –0.35% on Combined and –0.25% on HAM10000. This demonstrates its robustness to lighting variations common in dermoscopy images. Under a +45° rotation, accuracy declines by –1.78% (Combined) and –1.40% (HAM10000), suggesting some rotational invariance but still sensitivity to extreme angular distortions.

Table 12.

Robustness of the EfficientNetB0+RF model under different noise and transformation conditions.

Perturbation	Level/Severity	Combined dataset		HAM10000 dataset
Perturbation	Level/Severity	Acc (%)	Δ (%)	Acc (%)	Δ (%)
Gaussian Noise	σ = 0.01	97.88	–0.73	94.21	–0.81
Gaussian Noise	σ = 0.05	96.12	–2.49	92.74	–2.28
Salt & Pepper Noise	1% pixels	97.45	–1.16	93.56	–1.46
Salt & Pepper Noise	5% pixels	95.98	–2.63	92.41	–2.61
Brightness Shift	± 20%	98.26	–0.35	94.77	–0.25
Rotation	+45°	96.83	–1.78	93.62	–1.40

Table 13 compares the per-class performance of MultiViT and CrossViT on the experimental datasets. On HAM10000, both models achieve consistently high scores, with MA F₁ values of 93.8% for MultiViT and 94.1% for CrossViT, and weighted F₁ values above 95%. The patterns are not uniformly monotonic across classes: MultiViT performs slightly better for some categories (e.g., NV, where it attains F₁ = 96.4% compared to 96.3% for CrossViT), whereas CrossViT yields higher F₁ for MEL and DF. This non-uniform trend suggests that the two architectures capture complementary aspects of the lesion appearance, rather than one model dominating the other across all diagnostic categories. The lower per-class scores for minority or visually heterogeneous classes such as DF and VASC are consistent with their more limited sample size and higher intra-class variability. Both models exhibit reduced recall for DF in particular, which aligns with the known difficulty of distinguishing DF from other benign lesions.

Table 13.

Per-class performance of MultiViT and CrossViT on HAM10000 and the combined dataset. Values are reported as MA metrics over 5-fold cross-validation.

Dataset	Class	MultiViT			CrossViT
Dataset	Class	Prec	Rec	F₁	Prec	Rec	F₁
HAM10000	AKIEC	93.1	92.4	92.7	93.8	93.1	93.4
	BCC	94.9	94.1	94.5	94.4	94.9	94.6
	BKL	93.6	94.0	93.8	94.1	93.3	93.7
	DF	92.0	91.2	91.6	92.7	92.0	92.4
	MEL	94.3	93.2	93.7	94.0	94.4	94.2
	NV	96.1	96.8	96.4	96.7	96.0	96.3
	VASC	94.6	93.9	94.2	94.0	94.6	94.3
	Macro avg	94.0	93.7	93.8	94.5	94.2	94.1
	Weighted avg	95.4	95.1	95.3	95.8	95.5	95.7
Combined	MEL	97.9	97.1	97.5	97.6	97.8	97.7
	NV	98.5	98.7	98.6	98.8	98.5	98.6
	BCC	98.1	97.8	98.0	98.3	98.0	98.1
	AK	97.6	97.4	97.5	98.0	97.6	97.8
	BK	98.4	98.1	98.2	98.5	98.4	98.4
	DF	97.0	96.3	96.6	97.3	96.9	97.1
	VASC	97.9	97.5	97.7	98.2	97.8	98.0
	SCC	97.4	97.2	97.3	97.9	97.5	97.7
	Macro avg	97.9	97.5	97.7	98.1	97.8	98.0
	Weighted avg	98.3	98.0	98.2	98.5	98.3	98.4

Melanoma-only metrics (HAM10000)
Model	Sensitivity (%)	Specificity (%)	AUC	F₁ (%)
MultiViT	93.5 ± 0.4	96.9 ± 0.3	0.974 ± 0.003	93.7 ± 0.4
CrossViT	94.1 ± 0.3	97.0 ± 0.3	0.977 ± 0.002	94.3 ± 0.3

Note. Lower scores for less frequent classes such as DF and VASC are consistent with their higher intra-class variability and more ambiguous lesion boundaries, which also contributes to noisier and less localised Grad-CAM maps for these categories.

On the combined dataset, both model deliver very strong performance, with MA F₁ values of 97.7% and 98.0%, respectively, and weighted F₁ values exceeding 98%. Again, the relative advantage alternates across classes: MultiViT slightly outperforms CrossViT for some categories, while CrossViT attains marginally higher F₁ for AK, DF, VASC and SCC. However, small performance drops remain visible for more ambiguous classes. The melanoma-only metrics provide a more clinically focused perspective. For HAM10000, MultiViT achieves a sensitivity of 93.5% and F₁ of 93.7%, whereas CrossViT improves sensitivity slightly to 94.1% and F₁ to 94.3%, with corresponding AUC values of 0.974 and 0.977, respectively. These results confirm that both ViT variants are capable of detecting melanoma with high sensitivity and discrimination, but also highlight that even strong models can misclassify a non-negligible fraction of cases.

4.4. Performance validation

The confusion matrix for the hybrid model (Figure 12) indicates excellent performance on the combined dataset, particularly for conditions such as NV and MEL, with 1737 and 689 correct predictions, respectively. BCC and BK also demonstrate high accuracy, with 508 and 639 correct classifications. Despite fewer examples, the DF and VASC classes show impressive predictive accuracy, highlighting the model’s capability across various skin conditions. However, the model does have some weaknesses, particularly in distinguishing between conditions with similar dermatological features, such as AK and SCC. While misclassifications are relatively low, they occur, such as MEL being confused with NV and SCC, or AK being confused with BCC and SCC.

Figure 12.

Confusion matrices of EfficientNetB0+RF on (a) Combined and (b) HAM10000 datasets.

On HAM10000, the model exhibits its highest error rates with AK at 22.03% and DF at 16.67%, indicating significant challenges in correctly classifying these less prevalent types. VASC also shows a relatively high error rate of 9.09%, suggesting difficulties in distinguishing these from other classes. On the other hand, the model performs much better with common skin lesions such as NV, BK, and BCC, with error rates of 3.98%, 3.68%, and 5.19% respectively. These findings indicate that the model is highly effective in identifying lesions that occur more frequently in the dataset.

Figure 13 illustrates the hybrid model’s accuracy and loss over 50 epochs. The training and validation accuracy curves gradually increase and then level off, indicating that the model effectively learns the patterns from the training data. The close alignment of the training and validation accuracy curves across all folds suggests that the model generalizes well to new data. Similarly, the loss curves steadily decrease and stabilize, reflecting consistent learning progress. The minimal difference between the training and validation loss further supports the absence of overfitting. Any slight fluctuations in the validation loss can be attributed to variations in data characteristics across different folds rather than model instability.

Figure 13.

Learning curve of EfficientNetB0+RF for combined dataset.

The learning curves for the hybrid model trained on the HAM10000 dataset (Figure 14) demonstrate effective generalization capabilities. The training accuracy for all five folds steadily increases from initial values around 0.70-0.75 to surpass 0.96 by the 50th epoch, indicating that the model optimizes well with the training data. Validation accuracy, while more variable, shows a similar upward trend, starting at approximately 0.80 and reaching up to 0.96 in some folds. Correspondingly, training loss decreases sharply in early epochs and levels below 0.5, demonstrating accurate predictions, while validation loss, despite being more erratic, generally trends downward, stabilizing significantly post the 30th epoch. Occasional spikes in validation loss highlight areas for potential model tuning to enhance stability and performance. Both curves suggest that the model is learning effectively without substantial overfitting, and is promising for real-world applications.

Figure 14.

Learning curve of EfficientNetB0+RF for HAM10000 dataset.

The ROC-AUC analysis of the EfficientNetB0+RF model shows strong classification performance on both datasets. In the Combined dataset (Figure 15(a)), the model achieves AUC scores between 0.97 and 0.99 across all classes, with lesion types such as MEL, BCC, and BK reaching an impressive AUC of 0.99. Even the lowest class, VASC, maintains a high AUC of 0.97, indicating solid prediction quality. In the HAM10000 dataset (Figure 15(b)), the model also shows high performance, with all classes scoring between 0.97 and 0.99. BCC and VASC classes hit 0.99, while BKL at 0.97 exceeds acceptable benchmarks. The ROC curves are steep and concentrated in the top-left area, indicating high sensitivity and specificity for all categories. The model handles various lesion types and dataset differences with minimal variance in AUC scores. This suggests that the RF classifier on EfficientNetB0 captures subtle inter-class differences effectively, without overfitting. Its reliable performance across larger and more standardized datasets supports its robustness in dermatological classification tasks.

Figure 15.

ROC AUC curves of EfficientNetB0+RF for (a) Combined dataset (b) HAM10000 datasets.

4.5. Explainability analysis

We conducted an in-depth qualitative and exploratory analysis using Grad-CAM on the experimental datasets. Figures 16 and 17 present Grad-CAM overlays for each class in the HAM10000 and combined datasets, respectively. For correctly classified samples, the highlighted regions often appear lesion-centric and frequently coincide with visually salient attributes that clinicians consider relevant (e.g., border irregularity, color variation, and asymmetry). For example, in MEL predictions, the heatmaps tend to emphasize darker, irregular margins and heterogeneous pigmentation, which are commonly associated with malignancy. Similarly, in BCC, attention maps sometimes accentuate nodular structures and translucent peripheral zones, consistent with reported clinical appearances. In benign cases such as NV and DF, the overlays often concentrate around uniformly pigmented central regions while placing less emphasis on surrounding skin, suggesting that the model can rely on localized cues for non-malignant lesions. Notably, for AK and SCC, the heatmaps occasionally capture subtle texture changes and localized erythema-like regions, which may reflect fine-grained patterns needed to separate pre-malignant from malignant conditions, although these cues are visually variable.

Figure 16.

Grad-CAM visualizations for each lesion type in the HAM10000 dataset.

Figure 17.

Grad-CAM visualizations for each lesion type in the combined dataset.

We also examined misclassified samples (Figure 18). In these cases, Grad-CAM suggests that the model still frequently attends to lesion-adjacent regions, but class-level visual overlap may contribute to confusion. For instance, an AK sample misclassified as DF shows diffuse erythema and surface granularity; the heatmap highlights texture-like areas that could plausibly resemble DF-related appearance. A BCC sample misclassified as VASC highlights central hemorrhagic regions, suggesting that vascular-like pigmentation may have influenced the decision. For MEL misclassified as SCC, the heatmap focuses on irregular pigmentation typical of MEL, while hyperkeratotic structures might have biased the prediction toward SCC. NV misclassified as SCC also shows attention near darker peripheral structures that can be interpreted as suspicious cues. For VASC misclassified as MEL, the model appears to prioritize darker-pigment regions, suggesting that color intensity may at times dominate over structural cues. Finally, for BKL misclassified as DF, the highlighted, uniform pigmented zones reflect the ongoing difficulty in distinguishing pigmented benign keratoses from DF in borderline presentations.

Figure 18.

Inaccurate Grad-CAM visualizations for each lesion type.

Table 14 summarizes the qualitative interpretability trends of the hybrid EfficientNetB0+RF model across lesion types. While all classes achieve high correct classification rates (91.8%–97.6%), the Grad-CAM patterns show noticeable variation in the consistency with which attention is centered on lesions. NV, DF, and BKL exhibit the strongest lesion-centric emphasis, with over 94.5% of correctly classified cases showing heatmaps primarily centered on lesion regions. This suggests clearer visual localization for routine, well-defined cases. MEL and BCC also show substantial lesion focus, where highlighted regions often align with irregular borders and pigmentation variability, which are essential for high-risk malignancy assessment. In contrast, AK and SCC present higher non-lesion focus, which may reflect greater intra-class variability and shared visual traits that sometimes draw attention to surrounding skin, artifacts, or boundary regions. Even so, Grad-CAM overlays frequently still include lesion-relevant areas, indicating that the model may be capturing helpful cues despite the increased ambiguity. VASC shows an 8.7% non-lesion focus; while higher than more well-defined classes, the overlays often still emphasize vascular-like regions, though they may be sensitive to background color and illumination.

Table 14.

Class-wise Grad-CAM interpretability summary on the combined datasets. Grad-CAM focus was assessed for correctly classified samples, quantifying the percentage of cases where the attention maps primarily highlighted lesion-centric regions versus non-lesion or artifact regions, and whether the focus aligned with clinically relevant features.

Class	Correctly classified (%)	Focus on lesion (%)	Focus on non-lesion (%)	Feature alignment
MEL	94.8	92.1	7.9	Yes
BCC	95.3	93.2	6.8	Yes
NV	97.1	94.5	5.5	Yes
AK	92.6	89.0	11.0	Partial
SCC	91.8	88.7	11.3	Partial
DF	97.6	95.0	5.0	Yes
BKL	96.2	92.8	7.2	Yes
VASC	95.7	91.3	8.7	Partial

We further observed that Grad-CAM visualizations can be less stable for underrepresented or visually diverse categories such as SCC, BKL, VASC, and DF. In these cases, saliency maps may appear noisier, less spatially concentrated, and harder to interpret clinically. This behavior is plausibly influenced by limited training samples, higher intra-class appearance variation, and ambiguous lesion boundaries, which can hinder the emergence of consistent class-specific activation patterns. Future work will explore whether targeted augmentation, class-aware regularization, higher-resolution feature extraction, or complementary explanation methods can yield more reliable and clinically interpretable visual evidence for these challenging categories.

It is crucial to recognize the known limitations of Grad-CAM, including coarse spatial resolution due to downsampling in convolutional backbones, sensitivity to layer selection, and the fact that saliency maps provide correlational rather than causal explanations. Nevertheless, the frequent overlap between highlighted regions and lesion-relevant structures in many test cases supports a plausible interpretation that the model often relies on meaningful image cues rather than spurious background signals. In practical settings, these overlays are best viewed as assistive visual summaries that can help clinicians quickly assess which regions may have influenced a prediction, without replacing clinical judgment or prospective validation in real workflows.

4.6. Web application

The application is built on the Flask framework and uses a hybrid model for classification. As shown in Figure 19, it has a user-friendly interface that allows users to upload dermoscopic images and receive fast, reliable classifications. The application is designed to seamlessly integrate into clinical workflows, providing professionals with accurate and interpretable classifications of skin lesions. By automating the initial classification process, the application allows clinicians to focus their time on more complex cases, improving overall efficiency in busy environments. Table 15 reports the average time per image over 500 test samples for the end-to-end web application. The average inference time per image rose from 47.3 ms to 54.8 ms in the standalone pipeline and from 65.4 ms to 74.6 ms in the full web application workflow, with increases of about 15.9% and 14.0%, respectively. This increase is acceptable given the added benefit of Grad-CAM, which allows clinicians to visualize heatmaps of lesions in real-time without significantly obstructing decision-making. The system keeps inference times under 100 ms per image, ensuring efficient patient throughput and smooth integration into teledermatology or in-clinic screenings.

Figure 19.

Explainable web application interface.

Table 15.

Inference time analysis with and without Grad-CAM integration.

Component	Without Grad-CAM (ms)	With Grad-CAM (ms)	Overhead (%)
EfficientNetB0+RF Pipeline (Single Image Inference)	47.3 ± 1.2	54.8 ± 1.5	15.9
Web Application (Upload, Inference, Response)	65.4 ± 2.0	74.6 ± 2.3	14.0

The application is crucial in improving early detection rates for skin cancer. It provides fast and accurate identification of suspicious lesions, which helps in timely intervention, especially for aggressive cancers like melanoma. The tool’s immediate feedback reduces the time between assessment and treatment, ultimately improving patient outcomes. The application is designed for use in various healthcare settings, from large hospitals to smaller clinics, and can be accessed remotely by patients for preliminary assessments. This broadens access to diagnostic resources, especially in regions with limited availability of dermatology specialists, and promotes proactive healthcare practices. Additionally, the web application serves as an educational and research tool. It can be used by students, researchers, and educators to study and analyze skin cancer classification techniques, making it a versatile resource that supports both clinical practice and academic inquiry.

4.7. SOTA comparison

Table 16 provides a comparative analysis of our methodology against various state-of-the-art methods in skin disease classification. Our hybrid model demonstrates superior classification scores across multiple datasets. It achieves 98.61% accuracy with an F1-score of 98.60% on the ISIC 2019 and DermNet datasets, outperforming other models such as VGG-19 + voting ensemble (98.6% accuracy, 98% F1-score) as reported by.²⁵ Xception + StackingCV (90.9% accuracy, 88.4% F1-score) reported by.⁵⁴ Similarly, on the HAM10000 dataset, our model achieves 95.02% accuracy and a 95.06% F1-score, surpassing other approaches such as DenseNet169 (82% accuracy, 84% F1-score) reported by,⁴⁷ and Xception (90.48% accuracy, 89.02% F1-score) reported by Jain et al.⁴⁶

Table 16.

Comparative analysis with prior research works.

Model	Dataset	Images	Categories	Acc (%)	F1 (%)
MSVM⁴²	ISIC 2019 Challenge	800	8	96.25	-
RF⁴³	ISIC-ISBI 2016	1000	2	93.89	-
VGG-19+voting ensemble²⁵	ISIC-ISBI 2016	25331	8	98.6	98
VGG16⁴⁴	HAM10000	329	2	88	70
Xception+StackingCV⁵⁴	ISIC archive	3297	2	90.9	88.4
DSCCNet⁴⁵	ISIC 2020, HAM10000, and DermIS	4300	4	94.17	93.93
Xception⁴⁶	HAM10000	10015	7	90.48	89.02
DenseNet169⁴⁷	HAM10000	10015	7	82	84
CNN⁴⁸	HAM10000	10015	7	91.2	-
CNN+SVM⁵³	ISBI 2016	1000	2	88.02	93.10
InceptionResNetv2⁵⁵	ISIC-2019, PH2 dataset, dermIS	27108	8	94.65-95.35	94.72-96.02
InceptionResNetV2⁵⁶	ISIC 2017	12000	2	82.05	81.2
Xception+ResNet50⁵⁷	HAM10000	10500	2	97.8	-
CNN+MLP⁵⁸	HAM10000	10500	7	95	-
Hybrid LSO-based DNFN⁵⁹	ISIC 2020	33126	2	93.10	-
ViT⁵⁰	HAM10000	10015	7	94.1	94.1
ViTfSCD⁵¹	HAM10000	10015	7	94.1	94.3
ViT⁵²	HAM10000	10015	7	92.14	92.17
CNN⁶⁰	ISIC2016	899	2	98.41	97.98
SWNet⁶¹	HAM10000, ISIC19/20	35,000+	2	99.86	99.95
Fuzzy Rank-Based Ensemble⁶²	HAM10000	10,015	7	94.94	95.05
CNN-PSO-ML⁶⁴	ISIC2018, HAM10000	10,548, 10,015	2	98.5	97.9
TIxAI⁶⁵	ISIC-2017	2,325	3	90.1	89.2
DenseNet121⁶⁷	ISIC2020	33,000, 10,000	2	98.0	97.7
TTA⁶⁸	ISIC2019	25,331	8	97.58	97.43
SM-ViT⁶⁹	HAM10000	10,015	2	98.37	99.11
ResNet50⁷⁰	HAM10000	10,015	7	94.3	93.9
Ours (EfficientNet+RF)	ISIC 2019, DermNet	30990	8	98.61 ± 0.09	98.6 ± 0.07
Ours (EfficientNet+RF)	HAM10000	10015	7	95.02 ± 0.08	95.06 ± 0.06

SWNet⁶¹ achieved top performance with 99.86% accuracy and 99.95% F1-score on large-scale datasets (ISIC19/20). SM-ViT⁶⁹ followed with 98.37% accuracy and 99.11% F1-score. CNN-PSO-ML⁶⁴ and DenseNet121⁶⁷ also had high accuracies of 98.5% and 98.0%, respectively, with F1-scores over 97%. EfficientNet variants⁶⁸ achieved 97.58% accuracy across eight categories. TIxAI⁶⁵ recorded a lower accuracy of 90.1% on ISIC-2017 but contributed to quantitative trustworthiness in XAI evaluation. Fuzzy Rank-Based Ensembles⁶² and ResNet50 in SkinSage XAI⁷⁰ reached 94.94% and 94.3% accuracy, balancing multi-class classification and interpretability. None of the previous studies have implemented a web-based tool. Our application bridges the gap between research and real-world clinical practice. Its ease of use and the interpretability provided by RF make it especially valuable for healthcare professionals seeking quick and reliable decisions in clinical workflows.

Our hybrid approach has a key strength in handling class imbalance. Models like DenseNet169 and VGG16 often perform well on the majority classes but struggle to generalize to minority classes, such as melanoma. To address this issue, we have included RF in our model, which leverages ensemble learning and bootstrapping to improve the model’s ability to balance predictions across all classes. This is evident in the consistently high F1-scores on both the ISIC 2019 and HAM10000 datasets, indicating that our model performs well on both majority and minority classes, unlike other state-of-the-art methods that show a drop in F1-scores for minority classes. Additionally, RF’s feature importance rankings contribute to better interpretability, providing a distinct advantage over more complex models like InceptionResNetV2⁵⁵ and Xception,⁴⁶ which, despite their strong performance, operate more like black-box models, making their decisions harder to interpret in a clinical setting.

It is also essential to consider the trade-offs in terms of computational complexity. Models like InceptionResNetV2⁵⁵ and Xception⁴⁶ have larger parameter sizes and require significantly longer training and inference times compared to our hybrid model. EfficientNetB0’s compound scaling ensures high performance with fewer parameters and reduced computational overhead, making it more efficient for real-world clinical use, where quick and reliable results are needed. For example, InceptionResNetV2⁵⁵ achieved 94.65-95.35% accuracy on the ISIC-2019 and PH2 datasets but at the cost of increased computational complexity and longer training times. In contrast, EfficientNetB0 + RF achieves comparable or superior accuracy with fewer resources, making it more suitable for environments with limited computational power.

Recent approaches such as ViTs^50–52 have shown promise in image classification tasks. While they have not been extensively tested on the same datasets, they offer potential scaling and attention mechanisms strengths. However, they come with trade-offs, such as higher computational cost and longer inference times,⁹⁰ which may limit their practicality in clinical settings where real-time results are needed. Our model, by contrast, strikes a balance between accuracy, generalization, and computational efficiency, making it an optimal choice for both large hospitals and smaller environments.

5. Discussion

This study demonstrates that hybrid models can outperform traditional transfer learning approaches in the task of skin cancer classification. The proposed EfficientNetB0+RF hybrid model achieved superior performance compared to standalone EfficientNetB0 and other transfer learning models across multiple datasets, including HAM10000 and the combined ISIC 2019 + DermNet datasets. The success of this hybrid architecture can be attributed to several key factors. Firstly, EfficientNetB0 effectively balances model size and accuracy, capturing hierarchical feature representations with fewer parameters, thereby reducing computational overhead while maintaining high predictive performance. The integration of the RF component enhances decision stability at the classifier stage, while spatial interpretability is provided via Grad-CAM from the EfficientNetB0 backbone. This combination allows the model to generalize well to unseen and challenging cases, a capability reflected in its robust performance across diverse datasets. These methods mitigated the effects of class imbalance and data scarcity, ensuring that the model could learn from a more representative set of lesion variations.

A major contribution of this study is the integration of XAI to provide lesion-focused visual explanations that illustrate which image regions may be influencing each prediction. Rather than definitively validating the decision process, these overlays suggest that the model often relies on diagnostically relevant cues instead of obvious background regions. The Grad-CAM visualizations also provide an exploratory way to examine failure modes: in some misclassified cases, the highlighted regions appear less well aligned with the lesion, indicating possible sensitivity to artifacts or visually ambiguous patterns. Such observations can help prioritize refinements in preprocessing and data augmentation. It may also support a human-in-the-loop workflow by improving transparency during review. Through the web application, clinicians can inspect heatmaps alongside predictions during inference and use them as supporting context when judging whether the model’s output aligns with clinical expectations. This can promote more cautious use and ease responsible deployment. In addition, the model’s efficiency and the web-based interface make the system more feasible for smaller clinics and low-resource settings, allowing practical image upload and rapid screening support without requiring advanced computing infrastructure. A major contribution of this study is the integration of XAI to provide lesion-focused visual explanations that illustrate which image regions may be influencing each prediction. Rather than definitively validating the decision process, these overlays suggest that the model often relies on diagnostically relevant cues instead of obvious background regions. The Grad-CAM visualizations also provide an exploratory way to examine failure modes: in some misclassified cases, the highlighted regions appear less well aligned with the lesion, indicating possible sensitivity to artifacts or visually ambiguous patterns. Such observations can help prioritize refinements in preprocessing and data augmentation. It may also support a human-in-the-loop workflow by improving transparency during review. Through the web application, clinicians can inspect heatmaps alongside predictions during inference and use them as supporting context when judging whether the model’s output aligns with clinical expectations. This can promote more cautious use and ease responsible deployment. In addition, the model’s efficiency and the web-based interface make the system more feasible for smaller clinics and low-resource settings, allowing practical image upload and rapid screening support without requiring advanced computing infrastructure.

Despite the promising results, several limitations must be acknowledged. One challenge is the absence of grid search or Bayesian optimization. Future studies could explore these strategies to refine model parameters further. Class imbalance, while partially addressed using class-weighted loss, remains a persistent challenge. Its loss may not fully capture the variability present in real-world lesion presentations and could introduce oversampling artifacts. Future research could explore cost-sensitive learning, focal loss, or advanced oversampling techniques to further mitigate class imbalance impacts. Furthermore, while the hybrid model demonstrated high accuracy in classification tasks, its applicability to other computer vision tasks such as object detection, semantic segmentation, or multimodal medical analysis has not yet been explored. Extending this hybrid approach to these domains may reveal similar benefits in performance and interpretability. Finally, the effectiveness of transfer learning diminishes in highly specialized domains where available datasets are limited in diversity and size. To mitigate overfitting risks, future studies should consider domain-specific data augmentation to expand training datasets while maintaining clinical relevance.

The datasets used in this study exhibit potential sources of bias. Most images correspond to relatively limited ranges of skin tone and acquisition conditions, and are predominantly collected from a small number of centres and devices. As a result, the model may implicitly learn dataset-specific characteristics and might not generalize equally well to underrepresented populations, imaging devices, or geographical regions. Future work will therefore focus on curating and integrating more diverse, multi-ethnic skin lesion datasets, including images acquired with different dermoscopy systems to systematically assess and mitigate demographic and device-related bias. The present evaluation is restricted to internal cross-validation and does not include external or multi-centre validation. All experiments are conducted on publicly available datasets and a single institutional collection, without testing on independent cohorts from different hospitals or dermoscopy systems.

The reported performance should be interpreted as evidence of promise rather than definitive proof of clinical robustness. As a next step, we plan to conduct external validation on independent multi-centre datasets and, ultimately, design prospective clinical studies that evaluate the system as a decision-support tool within real-world dermatology workflows. The RF head of the hybrid model remains a black box in terms of its internal decision boundaries. Although the Grad-CAM maps provide useful qualitative insights into which regions of the lesion drive the convolutional feature extractor, they do not fully capture the contribution of the RF component. Future work will explore hybrid interpretability strategies that combine saliency-based methods for the convolutional backbone with feature-attribution techniques for tree-based models, with the goal of delivering end-to-end explanations that are both faithful to the hybrid pipeline and verifiable by dermatology experts.

6. Conclusion

Our proposed model shows strong performance and fair generalization across various datasets by effectively preprocessing data and addressing class imbalance. A web application has been developed for real-time lesion classification, minimizing workflow disruption. Grad-CAM visualizations provide a preliminary view of which regions may be influencing predictions, and can be used to explore model behavior during review. Key limitations include a bias in datasets favoring lighter skin tones, which may affect generalizability to diverse populations. Future work will aim to expand and diversify training datasets, using controlled synthetic data. Integrating multi-modal information, such as clinical metadata and histopathology, could improve diagnostic accuracy. Overall, this research presents an interpretable and scalable solution that merges high diagnostic accuracy with practical application in data-driven skin cancer diagnosis.

Footnotes

Acknowledgements

The authors extend their appreciation to the Princess Nourah bint Abdulrahman University Researchers Supporting Project number (PNURSP2026R513), Princess Nourah bint Abdulrahman University, Riyadh, Saudi Arabia.

ORCID iDs

Fahim Ahamed

Mohammad Ali Moni

Ethical considerations

Ethical approval was not required as this study data is open sourced, and the original research have already conducted.

Author contributions

Conceptualization, A.A.S., S.M.M.R.S., F.A., A.B.M., M.I.H.B., S.K., R.H., K.G.K.; Methodology, S.M.M.R.S., A.A.S., A.B.M., M.I.H.B., F.A., S.K.; Software, A.A.S., R.H., K.G.K.; Validation, T.J.A., M.I.H.B., S.M.M.R.S., K.G.K.; Formal analysis, T.J.A., F.A., A.B.M., M.I.H.B., S.K., R.H.; Investigation, A.A.S., S.M.M.R.S., F.A., A.B.M., K.G.K.; Resources, A.B.M., M.I.H.B., S.K., S.M.M.R.S., T.J.A.; Data curation, S.K., A.A.S., F.A., A.B.M.; Writing—original draft, A.A.S., S.M.M.R.S. F.A., A.B.M., M.I.H.B., S.K.; Writing—review & editing, A.A.S., S.M.M.R.S., R.H., T.J.A., K.G.K., M.A.M.; Visualization, S.M.M.R.S., F.A., A.B.M., M.I.H.B., S.K.; Supervision, T.J.A. and M.A.M.; Funding acquisition, M.A.M. All authors have read and agreed to the published version of the manuscript.

Funding

This study did not receive any financial support from public, commercial, or not-for-profit funding agencies.

Declaration of conflicting interests

The authors declare that the research was conducted in the absence of any commercial or financial relationships that could be construed as a potential conflict of interest.

Data Availability Statement

The datasets used and/or analysed during the current study are available from the corresponding author on reasonable request.*

References

Rushmer

Buettner

KJK

Short

, et al. The skin: The most accessible tissue of the body serves as a potential focus for multidisciplinary research. Science 1966; 154(3747): 343–348. https://doi.org/10.1126/science.154.3747.343

Brok Nørreslet

Agner

Clausen

M-L

. The skin microbiome in inflammatory skin diseases. Current Dermatology Reports 2020; 9: 141–151. https://doi.org/10.1007/s13671-020-00297-z

De Pessemier

Grine

Debaere

, et al. Gut–skin axis: current knowledge of the interrelationship between microbial dysbiosis and skin conditions. Microorganisms 2021; 9(2): 353. https://doi.org/10.3390/microorganisms9020353

Kramer

Bressan

. Infection threat shapes our social instincts. Behavioral Ecology and Sociobiology 2021; 75(3): 47. https://doi.org/10.1007/s00265-021-02975-9

Diaconeasa

Știrbu

Xiao

, et al. Anthocyanins, vibrant color pigments, and their role in skin cancer prevention. Biomedicines 2020; 8(9): 336. https://doi.org/10.3390/biomedicines8090336

Holman

CDJ

James

Gattey

, et al. An analysis of trends in mortality from malignant melanoma of the skin in australia. International Journal of Cancer 1980; 26(6): 703–709. https://doi.org/10.1002/ijc.2910260602

Naik

Farrukh

. Influence of ethnicities and skin color variations in different populations: a review. Skin Pharmacology and Physiology 2022; 35(2): 65–76. https://doi.org/10.1159/000518826

Atteia

El-kenawy

E-SM

Abdel Samee

, et al. Adaptive dynamic dipper throated optimization for feature selection in medical data. Computers, Materials & Continua 2023; 75(1): 1883–1900. https://doi.org/10.32604/cmc.2023.031723

Han

K-D

, et al. Increased risk of renal malignancy in patients with moderate to severe atopic dermatitis. Cancers 2023; 15(20): 5007. https://doi.org/10.3390/cancers15205007

10.

Momin

Baer

Waller

, et al. Atopic dermatitis and the risk of infection in end-stage renal disease. Medicina 2023; 59(12): 2145. https://doi.org/10.3390/medicina59122145

11.

Neale

Lucas

Byrne

, et al. The effects of exposure to solar radiation on human health. Photochemical & Photobiological Sciences 2023; 22(5): 1011–1047. https://doi.org/10.1007/s43630-023-00375-8

12.

Masfequier Rahman Swapno

Nuruzzaman Nobel

Meena

, et al. Accelerated and precise skin cancer detection through an enhanced machine learning pipeline for improved diagnostic accuracy. Results in Engineering 2025; 25: 104168. https://doi.org/10.1016/j.rineng.2025.104168

13.

Salopek

Kopf

Stefanato

, et al. Differentiation of atypical moles (dysplastic nevi) from early melanomas by dermoscopy. Dermatologic clinics 2001; 19(2): 337–345. https://doi.org/10.1016/s0733-8635(05)70271-0

14.

Chen

Geller

Tsao

. Update on the epidemiology of melanoma. Current dermatology reports 2013; 2: 24–34. https://doi.org/10.1007/s13671-012-0035-5

15.

Jean

Berking

Berneburg

, et al. New strategies in the prevention of actinic keratosis: a critical review. Skin pharmacology and physiology 2015; 28(6): 281–289.

16.

Austin

Xing

Hayes-Jordan

, et al. Melanoma incidence rises for children and adolescents: an epidemiologic review of pediatric melanoma in the united states. Journal of pediatric surgery 2013; 48(11): 2207–2213. https://doi.org/10.1016/j.jpedsurg.2013.06.002

17.

Rivera

AMR

Alabbas

Ramjaun

, et al. Value of positron emission tomography scan in stage iii cutaneous melanoma: a systematic review and meta-analysis. Surgical Oncology 2014; 23(1): 11–16. https://doi.org/10.1016/j.suronc.2014.01.002

18.

Mogensen

Jemec

GBE

Diagnosis of nonmelanoma skin cancer/keratinocyte carcinoma: a review of diagnostic accuracy of nonmelanoma skin cancer diagnostic tests and technologies. Dermatologic Surgery 2007; 33(10): 1158–1174. https://doi.org/10.1111/j.1524-4725.2007.33251.x

19.

Bappi

MBR

Swapno

SMMR

Rabbi

MMF

. Skin cancer disease detection using mcd-gru: A deep learning approach. In: 2024 6th International Conference on Electrical Engineering and Information & Communication Technology (ICEEICT). IEEE, 2024, pp. 445–450.

20.

Kittler

Pehamberger

Wolff

, et al. Diagnostic accuracy of dermoscopy. The lancet oncology 2002; 3(3): 159–165. https://doi.org/10.1016/s1470-2045(02)00679-4

21.

Dupuy

Dehen

Bourrat

, et al. Accuracy of standard dermoscopy for diagnosing scabies. Journal of the American Academy of Dermatology 2007; 56(1): 53–62. https://doi.org/10.1016/j.jaad.2006.07.025

22.

Crosby

Bhatia

Brindle

, et al. Early detection of cancer. Science 2022; 375(6586): eaay9040. https://doi.org/10.1126/science.aay9040

23.

Hasan

Nadaf

Imran

, et al. Skin cancer: understanding the journey of transformation from conventional to advanced treatment approaches. Molecular cancer 2023; 22(1): 168. https://doi.org/10.1186/s12943-023-01854-3

24.

Hung

Beazer

, et al. An exploration of the use and impact of preventive measures on skin cancer. Healthcare 2022; 10: 743. MDPI. https://doi.org/10.3390/healthcare10040743

25.

Kausar

Hameed

Sattar

, et al. Multiclass skin cancer classification using ensemble of fine-tuned deep learning models. Applied Sciences 2021; 11(22): 10593. https://doi.org/10.3390/app112210593

26.

Nuruzzaman Nobel

Swapno

SMMR

Islam

, et al. A machine learning approach for vocal fold segmentation and disorder classification based on ensemble method. Scientific reports 2024; 14(1): 14435. https://doi.org/10.1038/s41598-024-64987-5

27.

Ali

Shaikh

Khan

, et al. Multiclass skin cancer classification using efficientnets–a first step towards preventing skin cancer. Neuroscience Informatics 2022; 2(4): 100034. https://doi.org/10.1016/j.neuri.2021.100034

28.

Haque

Khan

Rahman

, et al. Explainable deep stacking ensemble model for accurate and transparent brain tumor diagnosis. Computers in Biology and Medicine 2025; 191: 110166. https://doi.org/10.1016/j.compbiomed.2025.110166

29.

Nuruzzaman Nobel

Swapno

SMMR

Islam

, et al. A novel mixed convolution transformer model for the fast and accurate diagnosis of glioma subtypes. Advanced Intelligent Systems 2025; 7(5): 2400566. https://doi.org/10.1002/aisy.202400566

30.

Ahmed

Rahman

Hossain Limon

, et al. Hierarchical swin transformer ensemble with explainable ai for robust and decentralized breast cancer diagnosis. Bioengineering 2025; 12(6): 651. https://doi.org/10.3390/bioengineering12060651

31.

Siddiqui

MIH

Khan

Limon

, et al. Accelerated and accurate cervical cancer diagnosis using a novel stacking ensemble method with explainable ai. Informatics in Medicine Unlocked 2025; 56: 101657. https://doi.org/10.1016/j.imu.2025.101657

32.

Rahman Bappi

Swapno

SMMR

Chhabra

, et al. Deep learning based tuberculosis and pneumonia disease detection using cnn. In: 2023 Seventh International Conference on Image Information Processing (ICIIP). IEEE, 2023, pp. 670–676.

33.

Bappi

MBR

Swapno

SMMR

Akhter

, et al. Deploying hybrid vgg19-bigru model for kidney disease segmentation. In: Intelligent Systems Conference. Springer, 2024, pp. 47–61.

34.

Debnath

Hossain

Sakib

, et al. Lmvt: A hybrid vision transformer with attention mechanisms for efficient and explainable lung cancer diagnosis. Informatics in Medicine Unlocked 2025; 57: 101669. https://doi.org/10.1016/j.imu.2025.101669

35.

Nuruzzaman Nobel

Swapno

SMMR

Hossain

, et al. Modern subtype classification and outlier detection using the attention embedder to transform ovarian cancer diagnosis. Tomography 2024; 10(1): 105–132. https://doi.org/10.3390/tomography10010010

36.

Bappi

MBR

Swapno

SMMR

Akhter

, et al. Deploying cnn-resnet50-bilstm for paddy leaf disease detection. In: Machine vision in plant leaf disease detection for sustainable agriculture. Springer, 2025, pp. 131–143.

37.

Nuruzzaman Nobel

Swapno

SMMR

Kabir

, et al. Crt: a convolutional recurrent transformer for automatic sleep state detection. IEEE Journal of Biomedical and Health Informatics 2025; 29(6): 4452–4462. https://doi.org/10.1109/JBHI.2025.3543028

38.

Gania Khushubu

Masum

Rahman

, et al. Transunetb: An advanced transformer–unet framework for efficient and explainable brain tumor segmentation. Informatics in Medicine Unlocked 2025; 59: 101706. https://doi.org/10.1016/j.imu.2025.101706

39.

Masfequier Rahman Swapno

Sakib

Hossain

, et al. Explainable transformer framework for fast cotton leaf diagnostics and fabric defect detection. iScience 2025; 29: 114411.

40.

Islam

Haque

Khan

, et al. Ensemble transformer with post-hoc explanations for depression emotion and severity detection. iScience 2026; 29: 114605. https://doi.org/10.1016/j.isci.2025.114605

41.

Masfequier Rahman Swapno

Nuruzzaman Nobel

Islam

, et al. Vit-senet-tom: machine learning-based novel hybrid squeeze–excitation network and vision transformer framework for tomato fruits classification. Neural Computing and Applications 2025; 37(9): 6583–6600. https://doi.org/10.1007/s00521-025-10973-5

42.

Krishna Monika

Vignesh

Kumari

, et al. Skin cancer detection and classification using machine learning. Materials Today: Proceedings 2020; 33: 4266–4270. https://doi.org/10.1016/j.matpr.2020.07.366

43.

Javaid

Sadiq

Akram

. Skin cancer classification using image processing and machine learning. In: 2021 international Bhurban conference on applied sciences and technologies (IBCAST). IEEE, 2021, pp. 439–444.

44.

Bechelli

Delhommelle

. Machine learning and deep learning algorithms for skin cancer classification from dermoscopic images. Bioengineering 2022; 9(3): 97. https://doi.org/10.3390/bioengineering9030097

45.

Tahir

Naeem

Malik

, et al. Dscc_net: multi-classification deep learning models for diagnosing of skin cancer using dermoscopic images. Cancers 2023; 15(7): 2179. https://doi.org/10.3390/cancers15072179

46.

Jain

Singhania

Tripathy

, et al. Deep learning-based transfer learning for classification of skin cancer. Sensors 2021; 21(23): 8142. https://doi.org/10.3390/s21238142

47.

Lokesh Gururaj

Manju

Nagarjun

, et al. Deepskin: a deep learning approach for skin cancer classification. IEEE Access 2023; 11: 50205–50214.

48.

Krishna

Uddin

Shin

, et al. An interpretable skin cancer classification using optimized convolutional neural network for a smart healthcare system. IEEE Access 2023; 11: 41003–41018.

49.

Roshni Thanka

Bijolin Edwin

Ebenezer

, et al. A hybrid approach for melanoma classification using ensemble machine learning techniques with deep transfer learning. Computer Methods and Programs in Biomedicine Update 2023; 3: 100103. https://doi.org/10.1016/j.cmpbup.2023.100103

50.

Xin

Liu

Zhao

, et al. An improved transformer network for skin cancer classification. Computers in Biology and Medicine 2022; 149: 105939. https://doi.org/10.1016/j.compbiomed.2022.105939

51.

Yang

Luo

Greer

. A novel vision transformer model for skin cancer classification. Neural Processing Letters 2023; 55(7): 9335–9351. https://doi.org/10.1007/s11063-023-11204-5

52.

Asad Arshed

Mumtaz

Ibrahim

, et al. Multi-class skin cancer classification using vision transformer networks and convolutional neural network-based pre-trained models. Information 2023; 14(7): 415. https://doi.org/10.3390/info14070415

53.

Keerthana

Venugopal

Kumar Nath

, et al. Hybrid convolutional neural networks with svm classifier for classification of skin cancer. Biomedical Engineering Advances 2023; 5: 100069. https://doi.org/10.1016/j.bea.2022.100069

54.

Bassel

Amjed

Zaid

AAA

, et al. Automatic malignant and benign skin cancer classification using a hybrid deep learning approach. Diagnostics 2022; 12(10): 2472. https://doi.org/10.3390/diagnostics12102472

55.

Farea

Saleh

RAA

AbuAlkebash

, et al. A hybrid deep learning skin cancer prediction framework. Engineering Science and Technology, an International Journal 2024; 57: 101818. https://doi.org/10.1016/j.jestch.2024.101818

56.

Rani Sella Veluswami

Ezhil Prasanth

Harini

, et al. Melanoma skin cancer recognition and classification using deep hybrid learning. Journal of Medical Imaging and Health Informatics 2021; 11(12): 3110–3116. https://doi.org/10.1166/jmihi.2021.3898

57.

Panthakkan

Anzar

Jamal

, et al. Concatenated xception-resnet50—a novel hybrid approach for accurate skin cancer prediction. Computers in Biology and Medicine 2022; 150: 106170. https://doi.org/10.1016/j.compbiomed.2022.106170

58.

Tajjour

Garg

Singh Chandel

, et al. A novel hybrid artificial neural network technique for the early skin cancer diagnosis using color space conversions of original images. International Journal of Imaging Systems and Technology 2023; 33(1): 276–286. https://doi.org/10.1002/ima.22784

59.

Majji

Om Prakash

, et al. Hybrid optimization based deep neuro fuzzy network for skin cancer detection. Concurrency and Computation: Practice and Experience 2023; 35(3): e7521. https://doi.org/10.1002/cpe.7521

60.

Grignaffini

De Santis

Frezza

, et al. An xai approach to melanoma diagnosis: Explaining the output of convolutional neural networks with feature injection. Information 2024; 15(12): 783. https://doi.org/10.3390/info15120783

61.

Ali

Fadhel

Alzubaidi

, et al. Towards unbiased skin cancer classification using deep feature fusion. BMC Medical Informatics and Decision Making 2025; 25(1): 48.

62.

Halder

Dalal

Gharami

, et al. A fuzzy rank-based deep ensemble methodology for multi-class skin cancer classification. Scientific Reports 2025; 15(1): 6268. https://doi.org/10.1038/s41598-025-90423-3

63.

Haile Dagnaw

El Mouhtadi

Mustapha

. Skin cancer classification using vision transformers and explainable artificial intelligence. Journal of Medical Artificial Intelligence 2024; 7: 14. https://doi.org/10.21037/jmai-24-6

64.

Shah

SAH

Shah

STH

Khaled

, et al. Explainable ai-based skin cancer detection using cnn, particle swarm optimization and machine learning. Journal of Imaging 2024; 10(12): 332. https://doi.org/10.3390/jimaging10120332

65.

Ieracitano

Carlo Morabito

Hussain

, et al. Tixai: A trustworthiness index for explainable ai in skin lesions classification. Neurocomputing 2025; 630: 129701. https://doi.org/10.1016/j.neucom.2025.129701

66.

Abbas

Ahmed

Ahmad Khan

, et al. Intelligent skin disease prediction system using transfer learning and explainable artificial intelligence. Scientific Reports 2025; 15(1): 1746. https://doi.org/10.1038/s41598-024-83966-4

67.

Arifeen Hamim

Tamim

MUI

Mridha

, et al. Smartskin-xai: An interpretable deep learning approach for enhanced skin cancer diagnosis in smart healthcare. Diagnostics 2024; 15(1): 64. https://doi.org/10.3390/diagnostics15010064

68.

Cino

Distante

Martella

, et al. Skin lesion classification through test time augmentation and explainable artificial intelligence. Journal of Imaging 2025; 11(1): 15. https://doi.org/10.3390/jimaging11010015

69.

Gamage

Isuranga

Meedeniya

, et al. Melanoma skin cancer identification with explainability utilizing mask guided technique. Electronics 2024; 13(4): 680. https://doi.org/10.3390/electronics13040680

70.

Munjal

Bhardwaj

Bhargava

, et al. Skinsage xai: An explainable deep learning solution for skin lesion diagnosis. Health Care Science 2024; 3(6): 438–455. https://doi.org/10.1002/hcs2.121

71.

Tschandl

Rosendahl

Kittler

. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions. Scientific data 2018; 5(1): 1–9. https://doi.org/10.1038/sdata.2018.161

72.

Codella

NCF

Gutman

Celebi

, et al. Skin lesion analysis toward melanoma detection: A challenge at the 2017 international symposium on biomedical imaging (isbi), hosted by the international skin imaging collaboration (isic). In: 2018 IEEE 15th international symposium on biomedical imaging (ISBI 2018). IEEE, 2018, pp. 168–172.

73.

Combalia

Codella

NCF

Rotemberg

, et al. Bcn20000: Dermoscopic lesions in the wild. arXiv preprint arXiv:1908.02288 2019.

74.

Eid

Zawbaa

Hassanien

, et al. Retinal blood vessel segmentation using bee colony optimisation and pattern search. In: 2014 international joint conference on neural networks (IJCNN). IEEE, 2014, pp. 1001–1006.

75.

Afroz Rimi

Sultana

FAF

. Derm-nn: skin diseases detection using convolutional neural network. In: 2020 4th International Conference on Intelligent Computing and Control Systems (ICICCS). IEEE, 2020, pp. 1205–1209.

76.

Nazir

Sarwar

Singh Saini

, et al. A robust deep learning approach for accurate segmentation of cytoplasm and nucleus in noisy pap smear images. Computation 2023; 11(10): 195. https://doi.org/10.3390/computation11100195

77.

Sergey Ioffe and Christian Szegedy . Batch normalization: Accelerating deep network training by reducing internal covariate shift. arxiv 2015. arXiv preprint arXiv:1502.03167 2015.

78.

Shao

Wang

, et al.

Is normalization indispensable for training deep neural network?

Advances in Neural Information Processing Systems 2020; 33: 13434–13444.

79.

Sa’idah

Suparta

IPYN

Suhartono

. Modification of convolutional neural network googlenet architecture with dull razor filtering for classifying skin cancer. Jurnal Nasional Teknik Elektro dan TeknologiInformasi 2022; 11(2): 148–153.

80.

Jana

Subban

Saraswathi

. Research on skin cancer cell detection using image processing. 2017 IEEE International conference on computational intelligence and computing research (ICCIC). IEEE, 2017, pp. 1–8.

81.

Maiti

Ghosh

. Preprocessing of skin cancer using anisotropic diffusion and sigmoid function. Advanced Computational and Communication Paradigms: Proceedings of International Conference on ICACCP 2017. Springer, 2018, vol 2, pp. 51–61.

82.

Mahbubur Rahman

Kamal Nasir

Nur-A-Alam

, et al. Proposing a hybrid technique of feature fusion and convolutional neural network for melanoma skin cancer detection. Journal of Pathology Informatics 2023; 14: 100341. https://doi.org/10.1016/j.jpi.2023.100341

83.

Luo

Zhu

Ding

. Coupled anisotropic diffusion for image selective smoothing. Signal Processing 2006; 86(7): 1728–1736. https://doi.org/10.1016/j.sigpro.2005.09.019

84.

Bijaya Arjun Das

Mishra

Das

, et al. Skin cancer detection using machine learning techniques with abcd features. In: 2022 2nd Odisha International Conference on Electrical Power Engineering, Communication and Computing Technology (ODICON). IEEE, 2022, pp. 1–6.

85.

Kestek

Aktan

Akdoğan

. Skin lesion classification using convolutional neural network and abcd rule. Turkish Journal of Mathematics and Computer Science 2023; 15(2): 365–374. https://doi.org/10.47000/tjmcs.1249300

86.

Ali

Manikandan

. A novel framework of adaptive fuzzy-glcm segmentation and fuzzy with capsules network (f-capsnet) classification. Neural Computing and Applications 2023; 35(30): 22133–22149. https://doi.org/10.1007/s00521-023-08666-y

87.

Verma

Kumar

. A hybrid machine learning model for skin disease classification using discrete wavelet transform and gray level co-occurrence matrix (glcm). Multimedia Tools and Applications 2024; 84: 1–19. https://doi.org/10.1007/s11042-024-19449-5

88.

Mahesh

Geman

Margala

, et al. The stratified k-folds cross-validation and class-balancing methods with high-performance ensemble classifiers for breast cancer classification. Healthcare Analytics 2023; 4: 100247.

89.

Mundargi

Khedkar

Kumbhar

, et al. Revolutionizing cerebral stroke prediction: Mastery unveiled through stratified k-fold and k-fold cross validation techniques for imbalanced datasets. Grenze International Journal of Engineering & Technology (GIJET) 2024; 10; 2407–2413.

90.

Han

Wang

Chen

, et al. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence 2022; 45(1): 87–110. https://doi.org/10.1109/TPAMI.2022.3152247

Explainable AI-driven hybrid deep learning framework for accurate skin cancer diagnosis

Abstract

Objectives

Methods

Results

Conclusion

Keywords

1. Introduction

2. Related works

2.1. Transfer learning approaches

2.2. Hybrid approaches

2.3. Explainable artificial intelligence (XAI) in skin cancer detection

2.4. Research gap analysis

3. Methodology

3.1. Data description

3.2. Data preprocessing

3.2.1. Resize and normalization

3.2.2. Dull razor

3.2.3. Anisotropic diffusion filter

3.3. Data augmentation and imbalance handling

3.4. Feature extraction and baseline DL models

3.5. Baseline configurations

3.6. Transformer models

3.7. Proposed hybrid model

3.8. Training parameters

3.9. XAI integration

4. Experimental results

4.1. System implementation

4.2. Evaluation metrics

4.3. Performance comparison

4.4. Performance validation

4.5. Explainability analysis

4.6. Web application

4.7. SOTA comparison

5. Discussion

6. Conclusion

Footnotes

Acknowledgements

ORCID iDs

Ethical considerations

Author contributions

Funding

Declaration of conflicting interests

Data Availability Statement

References