Sage Journals: Discover world-class research

Abstract

Objectives:

Early skin cancer detection in primary care settings is crucial for prognosis, yet clinicians often lack relevant training. Machine learning (ML) methods may offer a potential solution for this dilemma. This study aimed to develop a neural network for the binary classification of skin lesions into malignant and benign categories using smartphone images and clinical data via a multimodal and transfer learning-based approach.

Methods:

We used the PAD-UFES-20 dataset, which included 2298 sets of lesion images. Three neural network models were developed: (1) a clinical data-based network, (2) an image-based network using a pre-trained DenseNet-121 and (3) a multimodal network combining clinical and image data. Models were tuned using Bayesian Optimisation HyperBand across 5-fold cross-validation. Model performance was evaluated using AUC-ROC, average precision, Brier score, calibration curve metrics, Matthews correlation coefficient (MCC), sensitivity and specificity. Model explainability was explored using permutation importance and Grad-CAM.

Results:

During cross-validation, the multimodal network achieved an AUC-ROC of 0.91 (95% confidence interval [CI] 0.88-0.93) and a Brier score of 0.15 (95% CI 0.11-0.19). During internal validation, it retained an AUC-ROC of 0.91 and a Brier score of 0.12. The multimodal network outperformed the unimodal models on threshold-independent metrics and at MCC-optimised threshold, but it had similar classification performance as the image-only model at high-sensitivity thresholds. Analysis of permutation importance showed that key clinical features influential for the clinical data-based network included bleeding, lesion elevation, patient age and recent lesion growth. Grad-CAM visualisations showed that the image-based network focused on lesioned regions during classification rather than background artefacts.

Conclusions:

A transfer learning-based, multimodal neural network can accurately identify malignant skin lesions from smartphone images and clinical data. External validation with larger, more diverse datasets is needed to assess the model’s generalisability and support clinical adoption.

Keywords

transfer-learning multimodal model machine learning skin cancer neural network

Introduction

Skin cancers – including melanomas, squamous cell carcinomas (SCCs) and basal cell carcinomas (BCCs) – are among the most commonly diagnosed malignancies worldwide and account for 1 in every 3 cancer diagnoses.^1,2 Melanomas, which constitute only around 2% of all skin cancer cases, were estimated to cause 57 000 deaths globally in 2020.^3,4 Although non-melanoma skin cancers are typically perceived as less dangerous, they can also pose significant health burdens. For instance, SCCs have been reported to cause mortality rates comparable to melanomas in parts of the southern and central United States.⁵ While BCCs rarely metastasise, they can cause significant local tissue damage and result in cosmetic disfigurements and functional impairments.⁶

The prognosis of skin cancers is highly dependent on lesion type, depth and stage – information that can only be obtained through examinations and biopsies by dermatologists. Thus, timely referral of suspicious lesions from primary care to dermatology is essential for early diagnosis and treatment. Yet, studies indicate that primary care physicians are generally inadequately trained in skin cancer detection and management.^7,8 While targeted educational programmes have been proposed,^9,10 these curricula are often limited in length due to the busy schedules of practising physicians.¹⁰ Additionally, the effectiveness of such educational interventions is poorly evaluated in published studies, and it is unknown whether the knowledge gained from these curricula can be retained long-term.⁹

Machine learning (ML) offers an alternative strategy by enabling systematic triage of suspicious skin lesions using images. Most ML efforts to date have focused on dermoscopic images, which offer consistent, clearly segmented and well-lit lesions for interpretation. Computer vision-based models, such as convolutional neural networks (CNNs), perform well under these conditions. A previous investigation by Angeline et al using a 2-stage CNN reached an accuracy of over 97% when identifying melanoma lesions from dermoscopic images.¹¹ Similarly, Alshahrani et al developed a neural network model for multiclass classification of dermoscopic images into different skin cancer subtypes and achieved an area under the receiver operating characteristics curve (AUC-ROC) of 0.96.¹²

However, while dermatologists are well-trained in dermoscopy, patients and primary care physicians – especially those in areas with limited access to specialised dermatology services – may not have the equipment and skills necessary to perform such an examination.¹³ Instead, ML models targeting these non-specialised users must rely on images captured with smartphones and portable cameras, which are prone to variable lighting, distortion, occlusions (eg, hair and tattoos), and poor resolutions, all of which can degrade model performance. A recent investigation by Rios-Duarte et al found that the performance difference between non-dermoscopy and dermoscopy skin cancer detection models can be greater than 20%, with dermoscopy-based models yielding an AUC-ROC of 0.87 versus 0.66 from models trained on digital camera photos of the same patient cases.¹⁴

Another limitation of existing ML approaches is their sole reliance on imaging data.^15
-19 In clinical practice, dermatologists supplement their visual assessments with patient history, risk factors, demographics, and symptoms. Image-based unimodal models omit this contextual information. Multimodal approaches, which combine images with structured clinical data, can more closely approximate real-world clinical reasoning and may offer more robust and generalisable screening performance, especially in ambiguous or low-quality imaging scenarios.²⁰

Evidently, there remains a need to develop multimodal ML strategies to improve skin cancer detection using non-dermoscopy images. In this study, we developed and evaluated a novel, transfer learning-based multimodal neural network that integrates structured clinical variables with smartphone-captured lesion images, optimised using an efficient hyperparameter tuning approach known as Bayesian Optimisation HyperBand (BOHB). We describe the model’s development and tuning process, compare its performance against unimodal clinical data- and image-based models and examine its calibration, explainability, as well as fairness across skin phototypes.

Definition of terms

Transfer learning

Transfer learning refers to the adaptation of a pre-trained neural network for a new task.²¹ For example, pre-training a neural network to classify a random assortment of images – such as animals, objects, and landscapes – helps the model learn to extract basic visual features like edges, textures, shapes, and luminosity changes. These learned capabilities can then be repurposed for more specialised tasks, such as classifying skin lesions, even with relatively small datasets of lesion images and limited computational resources.

Multimodal learning

Multimodal learning involves integrating multiple types of input data into a unified neural network.^20,22 In our model, we combined tabular clinical data – including patient demographics, histories, symptoms, and lesion descriptors – with imaging features extracted from lesion photographs. This approach reflects real-world diagnostic processes, where dermatologists use visual inspection in conjunction with clinical context to make diagnostic decisions.

Bayesian Optimisation HyperBand

BOHB is an advanced method for hyperparameter tuning.²³ Hyperparameter tuning is a critical step in ML model development, where the most optimal hyperparameters, or ‘settings’, are identified for ML algorithms.²⁴ Many ML studies opt to manually tune the hyperparameters or simply use default configurations. These approaches are non-systematic and can be catastrophic to the model’s performance and generalisability. Systematic methods like grid-searching, which tests all possible hyperparameter combinations, are exhaustive but also computationally expensive. Random searching, which tests random combinations of hyperparameters, is empirically considered more efficient than grid-searching; it can identify top-performing hyperparameter combinations in as few as 60 trials.²⁵

BOHB improves upon random searching by incorporating Bayesian optimisation.^26
-32 Bayesian optimisation starts with several rounds of random searching. As the search progresses, results from previous hyperparameter trials are used to inform which hyperparameter combinations to test next. In essence, Bayesian optimisation can be thought of as a more intelligent and ‘guided’ version of random search. On a technical level, Bayesian optimisation involves building a surrogate model to predict how changes in hyperparameter values affect performance. An acquisition function then uses this surrogate model to select promising combinations for further trials, and results from these trials are used to iteratively re-update the surrogate model. The process repeats until the pre-allocated resource budget is exhausted.^33,34

BOHB also incorporates Hyperband.³⁵ Hyperband starts by evaluating hyperparameter combinations using minimal resources, such as running models for only one epoch. Based on initial results, Hyperband retains only the top-performing configurations and terminates the rest. This process repeats, progressively allocating more resources (eg, running for more epochs) to fewer hyperparameter configurations until the overall resource budget is exhausted.³⁶ Bayesian optimisation and Hyperband allow hyperparameters to be tuned efficiently and effectively for large, complex models such as the multimodal model developed in this study.

Methods

The development of this predictive model was conducted and reported in accordance with the Transparent Reporting of a Multivariable Prediction Model for Individual Prognosis or Diagnosis + AI (TRIPOD + AI) Checklist for Prediction Model Development (Supplemental Table S1)³⁷ and the Checklist for Evaluation of Image-Based Artificial Intelligence Reports in Dermatology (CLEAR-Derm; Supplemental Table S2).³⁸

Data Sources

We retrospectively analysed clinical and imaging data from the PAD-UFES-20 dataset (http://doi.org/10.17632/zr7vgbcyr2.1). PAD-UFES-20 contains 2298 images and associated clinical data from 1641 skin lesions in 1373 Brazilian patients. For model development, each image and its associated clinical data was considered as a separate patient case.

Lesions depicted in the dataset images include 3 categories of benign skin lesions (actinic keratosis, melanocytic nevus, and seborrheic keratosis) and 3 categories of malignant skin conditions (BCC, melanoma and SCC). All malignant diagnoses in the PAD-UFES-20 dataset were validated via biopsies. The images were collected using a diverse range of smartphone devices, and they present with different resolutions, angles and lighting conditions. The images were straight-from-the-device with no post-processing, except for patient-applied cropping to focus on the region of interest during image upload.³⁹

Because this study involved only secondary analysis of publicly available, anonymised patient data with no identifiable information, it was exempt from ethics review under the Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (TCPS 2).⁴⁰

Target Outcome

The model was developed to predict the probability of malignancy, and by extension, to perform binary classifications of suspicious lesions into malignant and benign categories based on ground truth labels provided in the PAD-UFES-20 metadata. Skin lesions diagnosed as BCC, melanoma or SCC were designated as malignant. Because the model is intended for triage purposes in primary care or non-clinical settings to determine whether specialist referrals or interventions may be required, we did not attempt multiclass classifications to establish specific pathological diagnoses.

Derivation and Internal Validation Split

Typically, train-test splits in ML studies involve an arbitrary 80-20 or 70-30 split, with substantially more samples assigned to the training set.⁴¹ In this study, we purposefully selected a near 50 to 50 split, holding out 1000 cases as an internal validation set and using 1298 cases for tuning and training (ie, the derivation set). This approach was chosen to demonstrate the effectiveness of a transfer-learning approach on smaller datasets, as well as to better evaluate the generalisability of the developed model across a larger number of test cases. Patient characteristics in the derivation and internal validation sets are summarised in Table 1.

Table 1.

Case characteristics in PAD-UFES-20 dataset and in the derivation and internal validation subsets.

	Total (N = 2298)	Derivation set (N = 1298)	Internal validation set (N = 1000)
Age	60.5 (15.9)	60.2 (16.1)	60.8 (15.6)
Sex^a
Male	741 (49.6)	420 (50.1)	321 (49.0)
Female	753 (50.4)	419 (49.9)	334 (51.0)
Fitzpatrick skin phototype^a
Type I	153 (10.2)	79 (9.4)	74 (11.3)
Type II	876 (58.6)	494 (58.9)	382 (58.3)
Type III	392 (26.2)	217 (25.9)	175 (26.7)
Type IV	62 (4.1)	42 (5.0)	20 (3.1)
Type V	10 (0.7)	7 (0.8)	3 (0.5)
Type VI	1 (0.1)	0 (0.0)	1 (0.2)
Has personal or family skin cancer history^a	681 (45.6)	389 (46.4)	292 (44.6)
Pathological diagnosis
Actinic keratosis	730 (31.8)	419 (32.3)	311 (31.1)
Basal cell carcinoma	845 (36.8)	472 (36.4)	373 (37.3)
Malignant melanoma	52 (2.3)	25 (1.9)	27 (2.7)
Melanocytic nevus	244 (10.6)	135 (10.4)	109 (10.9)
Squamous cell carcinoma	192 (8.4)	118 (9.1)	74 (7.4)
Seborrheic keratosis	235 (10.2)	129 (9.9)	106 (10.6)
Lesion location
Abdomen	36 (1.6)	21 (1.6)	15 (1.5)
Arm	192 (8.4)	112 (8.6)	80 (8.0)
Back	248 (10.8)	136 (10.5)	112 (11.2)
Chest	280 (12.2)	158 (12.2)	122 (12.2)
Ear	73 (3.2)	42 (3.2)	31 (3.1)
Face	570 (24.8)	348 (26.8)	222 (22.2)
Foot	16 (0.7)	6 (0.5)	10 (1.0)
Forearm	392 (17.1)	197 (15.2)	195 (19.5)
Hand	126 (5.5)	78 (6.0)	48 (4.8)
Lip	23 (1.0)	12 (0.9)	11 (1.1)
Neck	93 (4.0)	47 (3.6)	46 (4.6)
Nose	158 (6.9)	95 (7.3)	63 (6.3)
Scalp	18 (0.8)	7 (0.5)	11 (1.1)
Thigh	73 (3.2)	39 (3.0)	34 (3.4)
Maximum lesion diameter^a,b	11.9 (8.9)	12.0 (8.4)	11.9 (8.6)
Lesion characteristics
Elevated^c	1433 (62.4)	799 (61.7)	634 (63.4)
Itching^d	1455 (63.5)	830 (64.2)	625 (62.6)
Recently grew^e	925 (48.8)	527 (49.1)	398 (48.4)
Painful^f	397 (17.4)	234 (18.1)	163 (16.3)
Bleeding^d	614 (26.8)	348 (26.9)	266 (26.6)

Summary statistics are presented as mean (standard deviation) for continuous characteristics, and as N (%) for discrete characteristics.

459 and 345 cases had missing sex, Fitzpatrick skin phototype, cancer history and lesion measurement data in the derivation and internal validation sets, respectively.

Maximum lesion diameter is reported in millimetres.

2 cases had missing lesion elevation data in the derivation set.

5 and 1 cases had missing itching and bleeding data in the derivation and internal validation sets, respectively.

225 and 177 cases had missing recent lesion growth data in the derivation and internal validation sets, respectively.

7 and 3 cases had missing painful data in the derivation and internal validation sets, respectively.

Image-Based Neural Network

We started the development of the multimodal model by adapting a pre-trained DenseNet-121⁴² for performing skin lesion classification using smartphone images (Figure 1A).

Figure 1.

Illustration of the 3 neural network architectures used in this study. Feature maps shown are taken from the corresponding locations within DenseNet-121. Numbers and dimensions under the layers and feature maps indicate the dimension and size of the corresponding layer outputs/feature maps. Numbers in brackets next to the dense block labels indicate the number of convolution blocks in each corresponding dense block. The example image originates from the PAD-UFES-20 dataset.³⁹ (A) The image-based neural network, which uses a pre-trained DenseNet-121 for feature extraction. Images were randomly augmented before entering the network during training. The classifier of the image-based network consisted of a global average pooling layer and a ReLU hidden layer, with accompanying dropout and batch normalisation layers and followed by a sigmoid output layer. (B) The clinical data-based neural network, which of 2 ReLU hidden layers with accompanying dropout and batch normalisation layers, followed by a sigmoid output layer. (C) The multimodal neural network. Tensors from both the clinical data- and image-based networks are concatenated to form the inputs to the multimodal network. The multimodal classifier consists of a single ReLU hidden layer, with accompanying dropout and batch normalisation layers and followed by a sigmoid output layer.

Image Preprocessing

Image data were first split into individual RGB colour channels and resized to 224 × 224 pixels. Pixel values were then min-max normalised to values between 0 and 1. These preprocessing steps ensured that the image inputs resembled those used during the pretraining of DenseNet-121.⁴² Because images in the PAD-UFES-20 dataset were already square-cropped by patients at the time of image upload, no additional cropping was performed.

Image Augmentations

Five image augmentation layers were randomly activated as hyperparameters during model training to improve generalisability and reduce the risk of overfitting. These augmentations included random rotation, random horizontal/vertical flips, random cropping/zooming, random contrast and random brightness adjustments.^43
-45

Network Architecture

The network architecture is visualised in Figure 1A. CNN-based classification neural networks typically consist of 2 components: a feature extractor and a classifier.⁴⁶ For the feature extractor, we adopted the DenseNet-121 architecture.⁴⁷ DenseNet architectures are composed of dense blocks, wherein each layer within a dense block is connected to every other layer in a feed-forward fashion. This means that the input to each layer consists of the feature maps of all preceding layers, which promotes feature reuse and efficient gradient flow (Figure 2). For the classifier, a global average pooling layer was used, followed by a rectified linear unit (ReLU) layer with 16 neurons and He kernel initialisation.⁴⁸ The pooling and ReLU layers were followed by a dropout layer, which was used for regularisation to prevent overfitting.⁴⁹ The dropout probability was determined via hyperparameter tuning. A batch normalisation layer followed each dropout layer to improve the speed and stability of model training.⁵⁰ After the hidden layers, an output layer with a sigmoid activation function and Xavier kernel initialisation was used to output the predicted probabilities. An AdamW optimiser was used.⁵¹ Learning rate, weight decay rate, and batch size were determined by hyperparameter tuning.

Figure 2.

Illustration of the inner structure of a dense block within DenseNet-121. An example dense block with 6 convolution blocks is shown. Also shown are the structures of each convolution block and the transition block in-between dense blocks. Dimensions under each set of feature maps indicate the dimension of the feature maps. Feature maps are for illustration purposes only. The example image originates from the PAD-UFES-20 dataset.³⁹

Transfer Learning and Fine-Tuning

DenseNet-121 was previously pre-trained using images from a large computer vision research image database called ImageNet.⁵² The specific subset used for DenseNet-121 pretraining, the ImageNet Large Scale Visual Recognition Challenge (ILSVRC) subset, contained around 1.2 million human-annotated training images spanning over 1000 object categories.⁵³ To apply a transfer learning approach, we loaded the pre-trained weights from ImageNet and froze the weights to prevent them from being updated during training. After initial model training was complete, the DenseNet-121 weights were unlocked, and the model was further trained for 10 epochs at a very low learning rate (1 × 10⁻⁶) to fine-tune the model. The batch normalisation layers within DenseNet-121 remained frozen during fine-tuning to prevent unexpected performance degradations due to loss of important pre-trained information.⁵⁴

Clinical Data-Based Neural Network

We then created a standalone multilayer perceptron neural network classifier for skin lesion classification based on only clinical data (Figure 1B).

Feature Selection

Feature selection was primarily guided by feature availability in the PAD-UFES-20 dataset and expert domain knowledge. The final list of selected clinical features is presented in Supplemental Table S3. Because neural networks automatically downweight redundant or non-contributory features during training, and the overall number of features was low, we did not explicitly adopt additional statistical feature selection or dimensionality reduction techniques.

Data Imputation and Resampling

Missing data were imputed using multiple imputation by chained equations (MICE).⁵⁵ Given that the number of MICE iterations should be informed by the percentage of missing data,^56,57 and that our derivation set had a maximum missing data percentage of around 40%, we opted to use 40 iterations of MICE. The derivation and internal validation datasets were imputed separately to avoid data leakage.⁵⁸ Although the dataset was imbalanced, we did not apply any resampling techniques, as our objective was to predict malignancy probabilities based on class distributions in the original dataset.

Data Preprocessing

Categorical features, such as lesion location, were one-hot-encoded.⁵⁹ All features were standardised (ie, centred around the mean with a standard deviation of 1) using a normalisation layer in the network.⁶⁰

Network Architecture

The network architecture is visualised in Figure 1B. Each hidden ReLU layer within the clinical data network contained 16 neurons. Because the structure of the clinical dataset was relatively simple and low-dimensional, the number of neurons was determined empirically, by selecting the largest power of 2 that is smaller than the number of input nodes (ie, the number of features). The dropout, batch normalisation, and sigmoid layers were configured similarly to those in the classifier component of the image-based neural network. An AdamW optimiser was used.⁵¹

We constructed clinical data networks with 2, 3 or 4 hidden ReLU layers and performed hyperparameter tuning on all 3 network types. We found that network performance did not improve beyond 2 hidden layers, and the network with 3 hidden layers exhibited signs of overfitting. Thus, the final network consisted of 2 hidden ReLU layers.

Multimodal Neural Network

To create a multimodal neural network, tensors from the last hidden layers before the output layers in both the clinical data- and image-based neural networks were concatenated. An additional ReLU layer with 16 neurons processed the concatenated tensors before a sigmoid layer output the predicted probability (Figure 1C). The dropout, batch normalisation, and sigmoid layers were configured similarly to those in the classifier component of the image-based neural network. An AdamW optimiser was used.⁵¹

Hyperparameter Tuning

Hyperparameter tuning was performed using BOHB²³ across 5-fold cross-validation, with the objective of minimising average cross-entropy loss across the 5-folds. Given that random search can typically identify top-performing hyperparameter combinations in around 60 iterations,²⁵ we aimed to set our resource budget well above this target. Due to computational resource limitations, each network underwent 250 iterations of BOHB. The list of tuned hyperparameters and their respective search spaces are presented in Supplemental Tables S4-S6.

The number of epochs was determined empirically. Changes in cross-entropy loss were monitored during model fitting, and training was stopped when average cross-entropy loss stopped decreasing. As previously discussed, the image-based and multimodal neural networks underwent an additional 10 epochs of training at a very low learning rate to fine-tune DenseNet-121.

Threshold Selection

After the models were trained and optimised, threshold tuning was performed to maximise their binary classification performance. ‘Balanced’ thresholds were first derived to maximise Matthews correlation coefficient (MCC).⁶¹ MCC was calculated for every classification threshold between 0.01 and 0.99 at intervals of 0.01. The threshold with the highest average MCC across 5-fold cross-validation was selected and rounded to the nearest 0.05 as the most optimal threshold.

In addition to balanced thresholds, high-sensitivity thresholds were derived to maintain a sensitivity closest to and above 0.90 and 0.95 during 5-fold cross-validation. Due to the importance of early referral and biopsy of suspicious lesions, we anticipate that high-sensitivity thresholds would be more frequently used in clinical settings compared to the balanced thresholds.

Model Evaluation

The 3 networks were evaluated using a previously published framework proposed for assessing clinical prediction ML models.⁶² The classification performance of each tuned and calibrated model was assessed using AUC-ROC, average precision (AP), MCC, sensitivity and specificity. Calibration performance was assessed using Brier score, calibration curve slope and calibration curve intercept. Threshold-dependent metrics, including MCC, sensitivity and specificity, were presented for each of the 3 derived classification thresholds. All metrics were first evaluated using 5-fold cross-validation and presented as mean and 95% confidence intervals (CIs). The metrics were then used to evaluate model performance on the held-out internal validation dataset.

In addition to numeric metrics, we presented ROC curves, precision-recall (PR) curves, MCC-threshold plots, and calibration curves for all 3 models during cross-validation and internal validation. For cross-validation, averaged ROC and PR plots were illustrated using vertical averaging.⁶³ Averaged calibration curves were generated by aligning the individual calibration curves from each cross-validation fold to 5 common bin centres using linear interpolation.⁶⁴

Model Explainability

Model explainability for the image-based neural network was assessed via Gradient-weighted Class Activation Mapping (Grad-CAM),⁶⁵ which produces a heatmap indicating areas of the image most influential in classification decisions. Model explainability for the clinical data-based neural network was assessed using the permutation importance method for assessing feature importance in black-box models.⁶⁶

Model Fairness and Bias

Because previous research has shown reduced performance of dermatology ML models when applied to darker skin types,^67,68 and this inconsistent performance is rarely studied in ML investigations,⁶⁹ we reported additional internal validation metrics stratified by Fitzpatrick phototypes to assess potential skin type biases. Congruent with previous studies, we categorised the internal validation cases into light skin (Fitzpatrick phototypes I, II or III) and skin of colour (Fitzpatrick phototypes IV, V or VI).⁶⁷

Software

Clinical data imputation and train-test split were performed using R. The mice package was used to perform MICE. Model development and testing were completed using Python. Neural network models were built using keras 3.3.3 and tensorflow 2.16.1.⁷⁰ BOHB tuning was completed using ray 2.30.0.⁷¹ Model evaluation was performed using scikit-learn 1.5.0.⁷² Permutation feature importance was assessed using a custom fork of the eli5 package.

Results

Out of 2298 patient cases from the PAD-UFES-20 dataset, 1089 cases (47.4%) were diagnosed with malignant lesions, and 1209 cases (52.6%) were diagnosed with non-malignant lesions. In terms of skin phototypes, 1421 cases (61.8%) were classified as Fitzpatrick phototypes I, II or III, and 73 cases (3.2%) were classified as Fitzpatrick phototypes IV, V or VI. The remaining cases did not have an associated Fitzpatrick phototype. Detailed patient case characteristics are presented in Table 1. The optimal hyperparameters and thresholds identified are tabulated in Supplemental Tables S4-S6. The full codebase and dataset needed to replicate our findings are available at http://doi.org/10.17632/2yv6rv3pzs.1.

Cross-Validation Performance

Table 2 and Figure 3 summarise the average performance of the 3 neural networks over 5-fold cross-validation. The image-based neural network outperformed the clinical data-based neural network on all threshold-independent metrics. In turn, the multimodal network outperformed the image-based network on all threshold-independent metrics. When using the balanced threshold, the multimodal network exhibited a slightly higher average MCC compared to the image-based network (0.61, 95% CI 0.47-0.74 vs 0.57, 95% CI 0.53-0.61), indicating better classification performance.

Table 2.

Average performance of neural networks during cross-validation.

Model	Threshold-independent metrics					Threshold-dependent metrics (balanced threshold)			Threshold-dependent metrics (high-sensitivity threshold, sensitivity ⩾ 0.90)			Threshold-dependent metrics (high-sensitivity threshold, sensitivity ⩾ 0.95)
	AUC-ROC	Average precision	Brier score	Calibration curve slope	Calibration curve intercept	MCC	Sensitivity	Specificity	MCC	Sensitivity	Specificity	MCC	Sensitivity	Specificity
Clinical data-based neural network	0.79 (0.70 to 0.89)	0.77 (0.63 to 0.91)	0.24 (0.13 to 0.34)	0.91 (0.18 to 1.63)	0.04 (−0.59 to 0.66)	0.36 (0.08 to 0.65)	0.48 (0.12 to 0.85)	0.86 (0.74 to 0.97)	0.09 (−0.16 to 0.34)	0.91 (0.66 to 1.00)	0.17 (−0.31 to 0.66)	0.08 (−0.13 to 0.29)	0.97 (0.88 to 1.00)	0.11 (−0.19 to 0.40)
Image-based neural network	0.87 (0.85 to 0.89)	0.84 (0.81 to 0.87)	0.15 0.14 to 0.16)	1.06 (1.00 to 1.13)	−0.04 (−0.09 to 0.00)	0.57 (0.53 to 0.61)	0.86 (0.79 to 0.93)	0.79 (0.73 to 0.85)	0.63 (0.57 to 0.68)	0.91 (0.86 to 0.95)	0.71 (0.65 to 0.77)	0.59 (0.53 to 0.64)	0.95 (0.92 to 0.99)	0.60 (0.56 to 0.64)
Multimodal neural network	0.91 (0.88 to 0.93)	0.89 (0.85 to 0.93)	0.15 (0.11 to 0.19)	1.07 (0.81 to 1.32)	0.08 (−0.15 to 0.30)	0.61 (0.47 to 0.74)	0.69 (0.48 to 0.90)	0.89 (0.78 to 1.00)	0.60 (0.55 to 0.66)	0.90 (0.78 to 1.00)	0.68 (0.53 to 0.84)	0.54 (0.43 to 0.64)	0.95 (0.88 to 1.00)	0.53 (0.32 to 0.74)

Abbreviations: AUC-ROC, area under the receiver operating characteristic curve; MCC, Matthews correlation coefficient.

The performance metrics are tabulated as mean (95% confidence interval).

Figure 3.

Performance plots of the 3 network models during 5-fold cross-validation. (A) Average ROC curves. (B) Average PR curves. (C) Average MCC-threshold curves. (D) Average calibration curves.

At both high-sensitivity thresholds (sensitivity ⩾0.90 and ⩾0.95), the threshold-dependent performance metrics from the image-based network were similar to that of the multimodal network. This highlights the robustness of the image-based approach at higher thresholds despite its reliance on a single data modality.

Internal Validation Performance

Table 3 and Figure 4 summarise the average performance of the 3 neural networks on the held-out internal validation subset. Similar to findings during cross-validation, the multimodal network outperformed the other 2 networks on threshold-independent metrics. At the balanced threshold, the multimodal network had higher MCC compared to the image- and clinical data-based networks (0.68 vs 0.60 vs 0.53). Consistent with cross-validation findings, the image-based network had similar threshold-dependent performance metrics as the multimodal network at the high-sensitivity thresholds.

Table 3.

Performance of neural networks during internal validation.

Model	Threshold-independent metrics					Threshold-dependent metrics (balanced threshold)			Threshold-dependent metrics (high-sensitivity threshold, sensitivity ⩾0.90)			Threshold-dependent metrics (high-sensitivity threshold, sensitivity ⩾0.95)
	AUC-ROC	Average precision	Brier score	Calibration curve slope	Calibration curve intercept	MCC	Sensitivity	Specificity	MCC	Sensitivity	Specificity	MCC	Sensitivity	Specificity
Clinical data-based neural network	0.86	0.87	0.16	1.23	−0.04	0.53	0.50	0.96	0.04	1.00	0.00	0.04	1.00	0.00
Image-based neural network	0.87	0.85	0.15	1.12	−0.05	0.60	0.82	0.77	0.59	0.88	0.71	0.58	0.93	0.62
Multimodal neural network	0.91	0.91	0.12	1.03	0.02	0.68	0.83	0.85	0.62	0.92	0.70	0.58	0.96	0.59

Abbreviations: AUC-ROC, area under the receiver operating characteristic curve; MCC, Matthews correlation coefficient.

Figure 4.

Performance plots of the 3 network models during internal validation. (A) ROC curves. (B) PR curves. (C) MCC-threshold curves. (D) Calibration curves.

Performance Stratified by Skin Phototypes

In the held out internal validation set, 631 cases were classified as light skin (ie, Fitzpatrick phototypes I, II or III), and 24 cases were classified as skin of colour (ie, Fitzpatrick phototypes IV, V or VI). Internal validation performance metrics stratified by phototypes are summarised in Supplemental Table S7. Overall, there was no notable decrease in performance associated with darker Fitzpatrick phototypes; in fact, the model yielded better performance metrics in darker phototypes compared to lighter phototypes. However, these results should be interpreted with caution due to the substantial imbalance in sample sizes across phototype subgroups.

Model Explainability

Global explainability for the clinical data-based neural network during internal validation is illustrated in Figure 5. The most influential features identified via permutation importance included lesion bleeding, lesion elevation, patient age and recent lesion growth – clinical factors that align well with established dermatological risk criteria for malignancies.^73
-75 This suggests that the model relied on clinically meaningful patterns from the structured clinical data inputs.

Figure 5.

Feature importance plot showing permutation importance of the top 10 features in the clinical data-based neural network.

For the image-based neural network, Grad-CAM heatmaps were generated across the internal validation set. Figure 6 displays a representative selection of Grad-CAM examples for each quadrant of the confusion matrix based on the balanced threshold. These examples highlight regions within the input image that most strongly influenced the model’s prediction. Notably, the model consistently focused on the lesioned regions rather than background or artefacts, even in misclassified cases. This pattern suggests the image network was not relying on spurious features and had learned to prioritise relevant morphological features. Grad-CAM visualisations for the entire internal validation dataset are available in our code repository.

Figure 6.

Representative examples of Grad-CAM heatmaps for each quadrant of the confusion matrix based on the balanced threshold, generated from the image-based neural network. Four random examples are shown for each quadrant. (A) True positives, (B) true negatives, (C) false positives and (D) false negatives.

Discussion

In this study, we developed and internally validated a transfer learning-based, multimodal neural network for identifying malignant skin lesions using smartphone-captured images and structured clinical data. Compared to standalone image- and clinical data-based neural networks, the multimodal model showed superior discriminative ability and calibration in both cross-validation and internal validation. These findings indicate that combining clinical variables with image features may yield more robust classification models.

Notably, at the high-sensitivity thresholds relevant for primary care triage (ie, sensitivity ⩾0.90 or ⩾0.95), the image-only model performed comparably to the multimodal model during cross-validation and internal validation. A plausible explanation is that these thresholds provide generous tolerance for false positives, which allows the networks to classify lesions with even modest malignant probabilities as cancerous. Although the multimodal model can adjust probability estimates based on clinical context, these adjustments are often sufficient to shift cases beyond such lenient thresholds, thus both networks produce nearly identical predictions. When the threshold aims to balance sensitivity and specificity, the added clinical context is sufficient to improve classification of borderline cases, allowing the multimodal model to outperform the image-based model. While both models may be adequate for triage purposes in primary care settings, the multimodal network’s consistent advantage in threshold-independent metrics such as AUC-ROC and AP, along with better calibration, suggests that integrating image and clinical data enhances the model’s robustness and reliability across a wider range of clinical scenarios and decision thresholds.

Our results compare favourably with existing literature. Prior unimodal models typically report AUC-ROC values ranging from 0.87 to 0.96 on curated dermoscopic datasets.^12,14 Our multimodal model achieved similar or higher AUC-ROC scores during internal validation on smartphone images, despite the more variable and less standardised imaging conditions. Multimodal integration may therefore help mitigate quality limitations inherent to consumer-grade photographs.

Importantly, our findings also compare favourably with human diagnostic accuracy in primary care. Ojeda and Graells reported that primary care physicians demonstrated a sensitivity of 0.45 and specificity of 0.16 in identifying malignant skin lesions, compared to 0.97 and 0.75 for dermatologists.⁷⁶ A systematic review by Chen et al reported a sensitivity of 0.70 to 0.88 and specificity of 0.70 to 0.87 among primary care providers, compared to 0.81 to 1.00 and 0.70 to 0.89 for dermatologists.⁷⁷ These comparisons suggest that, at least in silico, our models may deliver decision support performance that exceeds the average diagnostic ability of primary care clinicians, though this must be confirmed in further head-to-head studies.

From a real-world implementation perspective, eventhough training deep-learning neural networks such as DenseNet-121 is computationally intensive – even with transfer learning and efficient tuning techniques such as BOHB – inference, which is the act of making predictions on new images after training, requires orders of magnitude fewer resources. It would be possible to develop a mobile application centred around our multimodal model that captures lesion photographs and presents a questionnaire to collect clinical data to generate ML-powered predictions. This application could be deployed on smartphones, tablets or web platforms, facilitated by lightweight frameworks designed for edge devices such as TensorFlow Lite/LiteRT. Grad-CAM visualisations could provide local explainability on a case-by-case basis, which could help improve model transparency and user trust.

Limitations

There are several important limitations that must be acknowledged. While the PAD-UFES-20 dataset includes a diverse range of lesion types and patient cases, it was collected from a single country and contains a disproportionately small number of patients with darker Fitzpatrick phototypes. Although our subgroup analyses did not show performance degradation in these groups, the small sample size limits our ability to draw firm conclusions about model generalisability across different skin tones. Moreover, only malignant lesions in the dataset underwent biopsy confirmation, which introduces a risk of verification bias. The clinical dataset also contained missing values, which required imputation that may have reduced the informativeness of certain features.

The lack of external validation represents another critical limitation. Because very few publicly available datasets contain diverse, smartphone-captured skin lesion images linked to structured clinical data, we were unable to assess model generalisability on an independent cohort. The multimodal nature of our model poses practical challenges for future data collection and validation, as it requires not only image data but also detailed clinical information which may be difficult to acquire and organise at scale. Regardless, proper external validation using an unrelated dataset remains essential before deployment in real-world clinical settings.

Future Directions

As with other ML models, the performance of our models ultimately depends on the quantity and diversity of available training data. To improve predictive performance and ensure generalisability across populations and practice environments, future efforts should prioritise multicentre and international data collection initiatives. Approaches such as decentralised, federated learning may offer a feasible path forward by allowing training on distributed datasets without the need to centralise sensitive patient information.⁷⁸

Lastly, classification models such as those described in this study are fundamentally application-oriented tools. Their value depends not only on statistical performance but on real-world effectiveness – how they influence clinical decision-making, improve triage accuracy and reduce diagnostic delays. To evaluate these endpoints, future work will involve prospective testing and deployment studies in primary care settings, using methodologies such as shadow-testing or randomised controlled trials. We are actively pursuing these next steps and have open-sourced our codebase to support further development, replication and adaptation by other groups.

Conclusion

This study demonstrates that a transfer learning-based, multimodal neural network can effectively identify malignant skin lesions by integrating structured clinical data with smartphone-captured lesion images. To ensure broad generalisability and clinical applicability, the model must undergo external validation on larger, more diverse datasets. Our findings advocate for collaborative, multicentre efforts and the utilisation of decentralised learning frameworks to advance the development and deployment of ML-driven diagnostic tools in dermatology.

Supplemental Material

sj-pdf-1-cix-10.1177_11769351251349891 – Supplemental material for Development of a Transfer Learning-Based, Multimodal Neural Network for Identifying Malignant Dermatological Lesions From Smartphone Images

Supplemental material, sj-pdf-1-cix-10.1177_11769351251349891 for Development of a Transfer Learning-Based, Multimodal Neural Network for Identifying Malignant Dermatological Lesions From Smartphone Images by Jiawen Deng, Eddie Guo, Heather Jianbo Zhao, Kaden Venugopal and Myron Moskalyk in Cancer Informatics

Footnotes

Acknowledgements

We would like to thank the Programa de Assistência Dermatológica e Cirúrgica at the Federal University of Espírito Santo and the Nature Inspired Computing Laboratory for developing the PAD-UFES-20 dataset and making it available for our model development.

Author Note

Preliminary results from this study were presented at the 2025 Canadian Dermatology Association (CDA) Annual Conference in Halifax, Nova Scotia, Canada.

ORCID iD

Jiawen Deng

Ethical Considerations

This study only involves secondary analyses of publicly available, anonymous patient data with no identifiable information. Thus, this study is exempt from ethics review as determined by the Unity Health Toronto Research Ethics Board under the guidance of The Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans (TCPS 2).

Author Contributions

Jiawen Deng conceptualised and supervised the study, performed data retrieval and management, lead coding, model development and validation efforts, created the illustrations, and drafted the manuscript. Eddie Guo performed code validation and review, drafted the manuscript and made intellectually important manuscript edits. Heather Jianbo Zhao was involved in data retrieval and management, drafted the manuscript and made intellectually important manuscript edits. Kaden Venugopal was involved in data retrieval and management and made intellectually important manuscript edits. Myron Moskalyk performed code validation and review, drafted the manuscript and made intellectually important manuscript edits. All authors agree to be held accountable for all aspects of the work in ensuring that questions related to the accuracy or integrity of any part of the work are appropriately investigated and resolved, and all authors give final approval for the manuscript to be submitted and published in its current state.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Jiawen Deng is a member of the OpenAI Researcher Access Program and receives grants from OpenAI in the form of API credits for purposes of research involving large language models. Eddie Guo, Heather Jianbo Zhao, Kaden Venugopal and Myron Moskalyk report no conflict of interest.

Data Availability Statement

The full codebase and dataset needed to replicate our findings is available at http://doi.org/10.17632/2yv6rv3pzs.1.

Supplemental Material

Supplemental material for this article is available online.

References

Urban

Mehrmal

Uppal

Giesey

Delost

. The global burden of skin cancer: a longitudinal analysis from the Global Burden of Disease Study, 1990-2017. JAAD Int. 2021;2:98-108.

U.S. Department of Health and Human Services. The Surgeon General’s Call to Action to Prevent Skin Cancer. Office of the Surgeon General (US); 2019.

Grossman

Curry

Owens

, et al. Behavioral counseling to prevent skin cancer: US Preventive Services Task Force Recommendation Statement. JAMA. 2018;319:1134-1142.

Arnold

Singh

Laversanne

, et al. Global Burden of cutaneous melanoma in 2020 and projections to 2040. JAMA Dermatol. 2022;158:495-503.

Que

SKT

Zwald

Schmults

. Cutaneous squamous cell carcinoma: incidence, risk factors, diagnosis, and staging. J Am Acad Dermatol. 2018;78:237-247.

Rubin

Chen

Ratner

. Basal-cell carcinoma. N Engl J Med. 2005;353:2262-2269.

Bakhai

Hopster

Wakeel

. A retrospective study comparing the accuracy of prehistology diagnosis and surgical excision of malignant melanomas by general practitioners and hospital specialists. Clin Exp Dermatol. 2010;35:63-67.

Holmes

Vassantachart

Limone

, et al. Using dermoscopy to identify melanoma and improve diagnostic discrimination. Fed Pract. 2018;35:S39-S45.

Rivet

Motamedi

Burns

Archibald

. A structured curriculum and procedure clinic to help family medicine residents diagnose and treat skin cancer. Can Med Educ J. 2021;12:108-111.

10.

Goulart

Quigley

Dusza

, et al. Skin cancer education for primary care physicians: a systematic review of published evaluated interventions. J Gen Intern Med. 2011;26:1027-1035.

11.

Angeline

Siva Kailash

Karthikeyan

Karthika

Saravanan

. Automated prediction of malignant melanoma using two-stage convolutional neural network. Arch Dermatol Res. 2024;316:275.

12.

Alshahrani

Al-Jabbar

Senan

Ahmed

Mohammed Saif

. Analysis of dermoscopy images of multi-class for early detection of skin lesions by hybrid systems based on integrating features of CNN models. PLoS One. 2024;19:e0298305.

13.

Yadav

Goldberg

Barense

Bell

. A cross-sectional survey of population-wide wait times for patients seeking medical vs. cosmetic dermatologic care. PLoS One. 2016;11:e0162767.

14.

Rios-Duarte

Diaz-Valencia

Combariza

, et al. Comprehensive analysis of clinical images contributions for melanoma classification using convolutional neural networks. Skin Res Technol. 2024;30:e13607.

15.

Han

Park

Eun Chang

, et al. Augmented intelligence dermatology: deep neural networks empower medical professionals in diagnosing skin cancer and predicting treatment options for 134 skin disorders. J Investig Dermatol. 2020;140:1753-1761.

16.

Han

Moon

Lim

, et al. Keratinocytic skin cancer detection on the face using region-based convolutional neural network. JAMA Dermatol. 2020;156:29-37.

17.

Han

Moon

Kim

, et al. Assessment of deep neural networks for the diagnosis of benign and malignant skin neoplasms in comparison with dermatologists: a retrospective validation study. PLoS Med. 2020;17:e1003381.

18.

Han

Park

Lim

, et al. Deep neural networks show an equivalent and often superior performance to dermatologists in onychomycosis diagnosis: automatic construction of onychomycosis datasets by region-based convolutional deep neural network. PLoS One. 2018;13:e0191493.

19.

Han

Kim

Lim

, et al. Classification of the clinical images for benign and malignant cutaneous tumors using a deep learning algorithm. J Investig Dermatol. 2018;138:1529-1538.

20.

Bozzo

Hollingsworth

Chatterjee

, et al. A multimodal neural network with gradient blending improves predictions of survival and metastasis in sarcoma. NPJ Precis Oncol. 2024;8:188.

21.

Hosna

Merry

Gyalmo

, et al. Transfer learning: a friendly introduction. J Big Data. 2022;9:102.

22.

Akkus

Chu

Djakovic

, et al. Multimodal deep learning. Published online 2023. doi:10.48550/ARXIV.2301.04856.

23.

Falkner

Klein

Hutter

. BOHB: Robust and efficient hyperparameter optimization at scale. Published online 2018. doi:10.48550/ARXIV.1807.01774.

24.

Pfob

Sidey-Gibbons

. Machine learning in medicine: a practical introduction to techniques for data pre-processing, hyperparameter tuning, and model comparison. BMC Med Res Methodol. 2022;22:282.

25.

Bergstra

Bengio

. Random search for hyper-parameter optimization. J Mach Learn Res. 2012;13:281-305.

26.

Jones

Schonlau

Welch

. Efficient Global optimization of expensive black-box functions. J Glob Optim. 1998;13:455-492.

27.

Kushner

. A new method of locating the maximum point of an arbitrary multipeak curve in the presence of noise. J Basic Eng. 1964;86:97-106.

28.

Zhilinskas

. Single-step Bayesian search method for an extremum of functions of a single variable. Cybern Syst Anal. 1976;11:160-166.

29.

Snoek

Larochelle

Adams

, et al. Practical Bayesian optimization of machine learning algorithms. In: Pereira

Burges

Bottou

, eds. Advances in Neural Information Processing Systems 25. Curran Associates, Inc; 2012.

30.

Močkus

Tiesis

Zilinskas

. The application of Bayesian methods for seeking the extremum. In: Dixon LC ed.Towards Global Optimization 2. Elsevier; 1978;117-129.

31.

Močkus

. On Bayesian methods for seeking the extremum. In: Marchuk

ed. Optimization Techniques IFIP Technical Conference. Springer; 1975;400-404.

32.

Močkus

. Bayesian Approach to Global Optimization: Theory and Applications. Springer; 2011.

33.

Frazier

. A tutorial on Bayesian optimization. arXiv. Published online 2018. doi:10.48550/ARXIV.1807.02811.

34.

Deng

Heybati

Yadav

. Development and validation of machine-learning models for predicting the risk of hypertriglyceridemia in critically ill patients receiving propofol sedation using retrospective data: a protocol. BMJ Open. 2025;15:e092594.

35.

Jamieson

DeSalvo

, et al. Hyperband: a novel bandit-based approach to hyperparameter optimization. Published online 2016. doi:10.48550/ARXIV.1603.06560.

36.

Passos

Mishra

. A tutorial on automatic hyperparameter tuning of deep spectral modelling for regression and classification tasks. Chemometr Intell Lab Syst. 2022;223:104520.

37.

Collins

Moons

KGM

Dhiman

. TRIPOD+AI statement: updated guidance for reporting clinical prediction models that use regression or machine learning methods. BMJ. 2024;385:e078378.

38.

Daneshjou

Barata

Betz-Stablein

, et al. Checklist for evaluation of image-based artificial intelligence reports in dermatology: CLEAR Derm Consensus Guidelines From the International Skin Imaging Collaboration Artificial Intelligence Working Group. JAMA Dermatol. 2022;158:90-96.

39.

Pacheco

AGC

Lima

Salomão

, et al. PAD-UFES-20: a skin lesion dataset composed of patient data and clinical images collected from smartphones. Data Brief. 2020;32:106221.

40.

Canadian Institutes of Health Research, Natural Sciences and Engineering Research Council of Canada, Social Sciences and Humanities Research Council of Canada. Tri-Council Policy Statement: Ethical Conduct for Research Involving Humans – TCPS 2(2022). Government of Canada; 2022. https://ethics.gc.ca/eng/policy-politique_tcps2-eptc2_2022.html

41.

Sivakumar

Parthasarathy

Padmapriya

. Trade-off between training and testing ratio in machine learning for medical image processing. PeerJ Comput Sci. 2024;10:e2245.

42.

Huang

Liu

Van Der Maaten

, et al. Densely Connected Convolutional Networks. 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR); 2017:2261-2269.

43.

Hao

Namdar

Liu

Haider

Khalvati

. A comprehensive study of data augmentation strategies for prostate cancer detection in diffusion-weighted MRI using convolutional neural networks. J Digit Imaging. 2021;34:862-876.

44.

Kumar

Asiamah

Jolaoso

, et al. Enhancing image classification with augmentation: data augmentation techniques for improved image classification. arXiv [cs.CV]. Published online 2025. doi:10.48550/ARXIV.2502.18691.

45.

Shorten

Khoshgoftaar

. A survey on image data augmentation for deep learning. J Big Data. 2019;6. doi:10.1186/s40537-019-0197-0

46.

Alzubaidi

Zhang

Humaidi

, et al. Review of deep learning: concepts, CNN architectures, challenges, applications, future directions. J Big Data. 2021;8:53.

47.

Huang

Liu

van der Maaten

, et al. Densely connected convolutional networks. Published online 2016. doi:10.48550/ARXIV.1608.06993.

48.

Zhang

Ren

, et al. Deep residual learning for image recognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). IEEE. Published online June 2016. doi:10.1109/cvpr.2016.90.

49.

Srivastava

Hinton

Krizhevsky

, et al. Dropout: a simple way to prevent neural networks from overfitting. J Mach Learn Res. 2014;15:1929-1958.

50.

Ioffe

Szegedy

. Batch normalization: accelerating deep network training by reducing internal covariate shift. Published online 2015. doi: 10.48550/ARXIV.1502.03167.

51.

Loshchilov

Hutter

. Decoupled weight decay regularization. Published online 2017. doi:10.48550/ARXIV.1711.05101.

52.

Deng

Dong

Socher

, et al. ImageNet: a large-scale hierarchical image database. 2009 IEEE Conference on Computer Vision and Pattern Recognition. IEEE. Published online 2009. doi:10.1109/cvpr.2009.5206848.

53.

Russakovsky

Deng

, et al. ImageNet large scale visual recognition challenge. Published online 2014. doi:10.48550/ARXIV.1409.0575.

54.

Kanavati

Tsuneki

. Partial transfusion: on the expressive influence of trainable batch norm parameters for transfer learning. Proc Mach Learn Res. 2021;143:338-353.

55.

van Buuren

Groothuis-Oudshoorn

. Mice: multivariate imputation by chained equations in R. J Stat Softw. 2011;45. doi:10.18637/jss.v045.i03

56.

Bodner

. What improves with increased missing data imputations? Struct Equ Modeling. 2008;15:651-675.

57.

White

Royston

Wood

. Multiple imputation using chained equations: Issues and guidance for practice. Stat Med. 2011;30:377-399.

58.

Apicella

Isgrò

Prevete

. Don’t push the button! Exploring data leakage risks in machine learning and transfer learning. Published online 2024. doi:10.48550/ARXIV.2401.13796.

59.

Ashenden

Bartosik

Agapow

, et al. Chapter 2 - Introduction to artificial intelligence and machine learning. In: Ashenden

, ed. The Era of Artificial Intelligence, Machine Learning, and Data Science in the Pharmaceutical Industry. Academic Press; 2021;15-26.

60.

de Amorim

LBV

Cavalcanti

GDC

Cruz

RMO

. The choice of scaling technique matters for classification performance. Appl Soft Comput. 2023;133:109924.

61.

Chicco

Jurman

. The advantages of the Matthews correlation coefficient (MCC) over F1 score and accuracy in binary classification evaluation. BMC Genomics. 2020;21:6.

62.

Steyerberg

Vickers

Cook

, et al. Assessing the performance of prediction models: a framework for traditional and novel measures. Epidemiology. 2010;21:128-138.

63.

Hogan

Adams

. On averaging ROC curves. Trans Mach Learn Res. 2023. https://openreview.net/forum?id=FByH3qL87G

64.

Deng

Moskalyk

Nayan

, et al. Development of explainable machine-learning models to identify patients at risk for one-year mortality and new distant metastases post-endoprosthetic reconstruction for lower extremity bone tumors based on data from the PARITY trial. JBJS Open Access. 2025;10:e24.00213.

65.

Selvaraju

Cogswell

Das

, et al. Grad-CAM: Visual explanations from deep networks via gradient-based localization. 2017 IEEE International Conference on Computer Vision (ICCV). IEEE. Published online October 2017. doi:10.1109/iccv.2017.74.

66.

Breiman

. Random Forests. Mach Learn. 2001;45:5-32.

67.

Aggarwal

. Performance of artificial intelligence imaging models in detecting dermatological manifestations in higher Fitzpatrick skin color classifications. JMIR Dermatol. 2021;4:e31697.

68.

Adamson

Smith

. Machine Learning and health care disparities in dermatology. JAMA Dermatol. 2018;154:1247-1248.

69.

Venkatesh

Raza

Nickel

Wang

Kvedar

. Deep learning models across the range of skin disease. NPJ Digit Med. 2024;7:32.

70.

Abadi

Barham

Chen

, et al. TensorFlow: a system for large-scale machine learning. Proceedings of the 12th USENIX Conference on Operating Systems Design and Implementation; 2016:265-283. USENIX Association.

71.

Liaw

Liang

Nishihara

, et al. Tune: a research platform for distributed model selection and training. arXiv [cs.LG]. Published online 2018. doi:10.48550/ARXIV.1807.05118.

72.

Pedregosa

Varoquaux

Gramfort

, et al. Scikit-Learn: machine learning in Python. J Mach Learn Res. 2011;12:2825-2830.

73.

Swetter

Tsao

Bichakjian

, et al. Guidelines of care for the management of primary cutaneous melanoma. J Am Acad Dermatol. 2019;80:208-250.

74.

Tsao

Olazagasti

Cordoro

, et al. Early detection of melanoma: reviewing the ABCDEs. J Am Acad Dermatol. 2015;72:717-723.

75.

Gershenwald

Scolyer

Hess

, et al. Melanoma staging: Evidence-based changes in the American Joint Committee on Cancer eighth edition cancer staging manual. CA Cancer J Clin. 2017;67:472-492.

76.

Ojeda

Graells

. [Effectiveness of primary care physicians and dermatologists in the diagnosis of skin cancer: a comparative study in the same geographic area]. Actas Dermosifiliogr. 2011;102:48-52.

77.

Chen

Bravata

Weil

Olkin

. A comparison of dermatologists’ and primary care physicians’ accuracy in diagnosing melanoma: a systematic review. Arch Dermatol. 2001;137:1627-1634.

78.

Dasaradharami Reddy

Gadekallu

. A comprehensive survey on federated learning techniques for healthcare informatics. Comput Intell Neurosci. 2023;2023:8393990.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.44 MB

Development of a Transfer Learning-Based,Multimodal Neural Network for Identifying Malignant Dermatological Lesions From Smartphone Images

Abstract

Objectives:

Methods:

Results:

Conclusions:

Keywords

Introduction

Definition of terms

Transfer learning

Multimodal learning

Bayesian Optimisation HyperBand

Methods

Data Sources

Target Outcome

Derivation and Internal Validation Split

Image-Based Neural Network

Image Preprocessing

Image Augmentations

Network Architecture

Transfer Learning and Fine-Tuning

Clinical Data-Based Neural Network

Feature Selection

Data Imputation and Resampling

Data Preprocessing

Network Architecture

Multimodal Neural Network

Hyperparameter Tuning

Threshold Selection

Model Evaluation

Model Explainability

Model Fairness and Bias

Software

Results

Cross-Validation Performance

Internal Validation Performance

Performance Stratified by Skin Phototypes

Model Explainability

Discussion

Limitations

Future Directions

Conclusion

Supplemental Material

sj-pdf-1-cix-10.1177_11769351251349891 – Supplemental material for Development of a Transfer Learning-Based, Multimodal Neural Network for Identifying Malignant Dermatological Lesions From Smartphone Images

Footnotes

Acknowledgements

Author Note

ORCID iD

Ethical Considerations

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

Supplemental Material

References

Supplementary Material