Sage Journals: Discover world-class research

Abstract

Introduction

Lung cancer has the highest mortality rate among all cancer types globally, largely due to delayed or ineffective diagnosis and treatment. Radiomics is commonly used to diagnose lung cancer, especially in later stages or during routine screenings. However, frequent radiological imaging poses health risks, and while advanced diagnostic alternatives exist, they are often costly and accessible only to a limited, privileged population. Leveraging clinical data using machine learning (ML) and artificial intelligence (AI) enables a safer, more inclusive, and affordable solution. Due to a lack of interpretability, AI-based models for cancer diagnosis are less adopted by clinicians.

Methods

This study introduces a safe, inclusive, and cost-effective lung cancer diagnostic method using an explainable AI (XAI) model built on routine clinical data. It employs a stacking ensemble of Artificial Neural Network (ANN) and Deep Neural Network (DNN) to match the diagnostic performance of clean-data DNN models. By incorporating rare medical cases through Adaptive Synthetic Sampling (ADASYN), the model reduces the risk of missing challenging, rare-case diagnoses.

Results

The proposed XAI model demonstrates strong performance with an accuracy of 0.8558, AUC of 0.8600, precision of 0.8092, recall of 0.9282, and F1-score of 0.8646, notably improving rare case detection by over 50%. SHapley additive exPlanations(SHAP)-based interpretability highlights Erythrocyte sedimentation rate(ESR), intoxication-related factors, hemoglobin levels, and neutrophil counts as key features. The model also reveals associations, such as a link between heavy tobacco use and elevated ESR. Counterfactual explanations help identify features contributing to misdiagnoses by exposing sources of confusion in the model's decisions.

Conclusion

Given the limited dataset size and geographic constraints, this research should be viewed as a prototype and in its current form, the model is best suited as a pre-screening tool to support early detection. With training on larger and more diverse datasets, the model has strong potential to evolve into a robust and scalable diagnostic solution.

Keywords

AI models cancer diagnosis clinical data explainable AI lung cancer model calibration machine learning

Introduction

Carcinogenic agents are found to induce uncontrolled cellular proliferation in the pulmonary organs, resulting in the formation of malignant tumours in the lungs, leading to the culmination of lung cancer. The probability of a successful treatment outcome is positively correlated with the stage of disease detection, with earlier being better.^1,2 Upon confirmation, the patient must undergo a systematic treatment to enhance the likelihood of survival. According to the Global Cancer Observatory (GLOBOCAN) 2022, the number of new cancer cases diagnosed that year was 20 million.³ The number of deaths due to cancer in 2022 was about 9.7 million, with lung cancer being the leading cause. GLOBOCAN predicts that the number of cancer cases in 2050 will reach 35 million. While considering cancer-related mortality, lung cancer tops the chart (18.7%), followed by colorectal cancer (9.3%), liver cancer (7.8%), stomach cancer (6.8%), and female breast cancer (6.9%). Men were affected predominantly by lung, prostate, and colorectal cancers, whereas women were affected predominantly by breast, colorectal, and lung cancers.

As per the 2022 report by the American Cancer Society (ACS), there were about 117,910 and 118,830 newly diagnosed cases of lung cancer in men and women, respectively, in the United States (US). Studies by the ACS¹ reveal that the incidence of mortality resulting from lung cancer was approximately 68,820 among males and 61,360 among females. In cases where the cancer is in a localised stage, the five-year survival rate was 55%. The age group of 65 years and older exhibited the highest mortality rates for lung cancer. The incidence of lung cancer diagnosis among individuals under the age of 45 was observed to be significantly lower. According to Surveillance, epidemiology, and end results (SEER) statistics,⁴ the mean age at which lung cancer was diagnosed was 70 years. The reported rate of early detection of lung cancer stands at approximately 16%. In accordance with,³ the five-year survival rate in the event of metastasis is reported to be merely 4%. The prognosis of lung cancer is significantly influenced by both the extent of progression and the duration of time between its onset and diagnosis. Thus, successful treatment is more likely to happen when detection occurs at an earlier stage.

Radiation screening is a widely utilised non-invasive diagnostic technique for lung cancer. However, it is typically found to be applied during the advanced stages of the disease only due to initial symptoms being commonly misattributed to other health conditions.^5–7 In addition, radiomics has limitations such as non-standardisation of acquisition parameters, inconsistency in radiomic methods, and limited reproducibility. Scholars are currently researching how to surmount these constraints.^8,9 The repeated subjection to computed tomography (CT) imaging and low-dose computed tomography (LDCT) screening technique may increase the probability of the occurrence of solid cancers and leukemia due to cumulative exposure to ionising radiation. This is a matter of serious concern for non-cancerous patients.^10–12 LDCT screening is found to have a high false positive rate (FPR), as observed in some studies.^12,13 So, researchers are exploring for better alternative methods for the early detection of lung cancer.^14,15

AI has the power to recognise patterns in clinical healthcare data and obtain insights from them in order to develop predictions, diagnoses, prognoses, and so on more accurately. Currently, AI-based models are being actively researched in diagnostic procedures, medication development, customised medicine, patient monitoring, and treatment protocol formation.¹⁶ Studies have also reported that different ML and AI techniques serve as models for global practices to aid cancer research to improve clinical workflow and diagnostic accuracy, reduce human resource costs, increase the efficiency of data, and enhance treatment.^17–19 Although AI has been quickly integrated into cancer research, AI-based solutions are still in their early stages. We see that only a few applications based on AI have been authorised for usage in the real world.²⁰ To increase these numbers, the AI models developed must be explainable and interpretable to foster trust, ensure safety and effectiveness, and adhere to regulatory and ethical standards.

One commonly used method for AI model interpretability is feature importance analysis, which gives a global perspective by assessing how the model's performance varies when the values of the features are shuffled. Layer-wise relevance propagation techniques, such as Deep Taylor decomposition, attribute the output to input features by backward propagation through the layers of the network.^21,22 The drawbacks of these techniques include complexity in the deployment and the increasing difficulty of interpretation with non-linearity. Guidotti et al²³ proposed local rule-based explanations (LORE), a local interpretable predictor on a synthetic neighbourhood generated by a genetic algorithm. It derives from the logic of the local interpretable predictor a meaningful explanation consisting of a decision rule, which explains the reasons for the decision, and a set of counterfactual rules, suggesting the changes in the instance's features that lead to a different outcome. It focuses on local explanations rather than global descriptions of how the overall system works.²³ Local interpretable model-agnostic explanations (LIME) is an interpretation that typically generates an explanation for a single prediction by any ML model by learning a simpler interpretable model (for example, a linear classifier) around the prediction by generating simulated data around the instance by random perturbation and obtaining feature importance through applying some form of feature selection. The random perturbation and feature selection methods have the possibility of instability in the generated explanations for the same prediction.²⁴ SHAP is based on the concept of Shapley values from cooperative game theory, which provides a fair allocation of contributions among the features for a prediction. It provides additive explanations where the prediction is explained as a sum of contributions from each feature, ensuring consistency and accuracy.^25,26 SHAP algorithms can be model-specific (TreeSHAP for tree-based models, DeepSHAP for neural networks). Integrated Gradients²⁷ is another method that assigns an importance score to each feature by integrating the gradients along the path from a baseline input to the actual input. It provides a comprehensive way to understand how each feature contributes to the model's prediction.

We should understand that an explainable AI model used to be either a complex model with “detailed understanding” requirements or a simple model with a “global perspective of the decisions made”, and thus, there exists a trade-off between the simplicity of the explanation model and the level of detail in interpretations. However, it is true that the majority of the available AI models for cancer diagnosis are not fully explainable as they are complex and work like a “black box.” This makes it difficult to explain how these models have arrived at their conclusions, and this is a problem for doctors who need to trust the AI's diagnosis and be able to explain it to patients. Thus, despite the increase in accuracy, the AI models’ lack of transparency and accountability is limiting their adoption into practice, particularly in critical applications such as cancer diagnosis.

The literature reveals numerous studies focused on developing alternatives to radiomic diagnostic models.²⁸ Despite these novel approaches, their implementation has been limited by high costs and the requirement for advanced technologies to extract and analyse the data. This highlights the need for more cost-effective and accessible solutions for lung cancer diagnosis.

According to an oncologist affiliated with the Cancer Institute at University College London, individuals tend to disregard persistent coughing and subsequently present at the clinic with metastatic disease. Unfortunately, at this point, the possibility of receiving effective treatment for cancer might have already diminished.² The preliminary indications of lung cancer are frequently perceived as typical maladies and deemed insignificant. The process of clinical staging is therefore a crucial factor in the determination of both treatment options and the likelihood of survival. The clinical guidelines should be subjected to regular revisions based on the available data to accurately identify predictors and facilitate prompt diagnosis.⁸ When it comes to identifying the early signs of lung cancer, not all clinical doctors have the same level of expertise as subject matter experts.

According to research by Huang et al,²⁸ lymphocytes have been suggested as a potential prognostic indicator for lung cancer. Based on the findings by Wu et al,²⁹ the haemoglobin to red blood cell (RBC) distribution ratio may serve as a prognostic indicator for small cell lung cancer. In a recent study, Guidotti et al²³ have developed an ML model that utilises non-imaging electronic health record (EHR) data. Based on a dataset comprising 6505 patients diagnosed with lung cancer and 189,597 control subjects, the model exhibited superior accuracy compared to the Prostate, lung, colorectal, and ovarian cancer screening trial (PLCO) criteria in forecasting the occurrence of lung cancer within a one-year timeframe. These alternatives to radiation screening for the early diagnosis of lung cancer are expensive and limited in availability.

The literature demonstrates the significant potential clinical data holds for facilitating lung cancer diagnoses.³⁰ However, the suggested approaches are mostly restricted from common practice because they are unavailable at major healthcare centres or very expensive. So, there is a need for more research to find viable alternatives to radiation screening for the early diagnosis of lung cancer.

A review of the literature reveals that despite the huge potential clinical data hold for early diagnosis of lung cancer, the existing approaches predominantly utilise clinical data to analyse comorbid conditions³¹ associated with lung cancer and predict survival rates.^32,33 However, there is a noticeable gap in the development of diagnostic models specifically designed for lung cancer detection.^2,5 To the best of our knowledge, no comprehensive diagnostic model leveraging clinical data has been developed or made available for deployment in medical centres to facilitate lung cancer diagnosis.

Research Challenges and Objectives

From the literature, the existing challenges are identified, which can be summarised as:

− Repetitive exposure to conventional radiomic diagnostic methods poses potential health risks of developing cancer.

− Lung cancer, being an internal cancer, the initial disease symptoms are often regarded as inconsequential or assumed to be other maladies, leading to a late diagnosis of lung cancer.

− Though there are novel alternatives developed for radiomic detection methods. They are not inclusive as the respective medical tests are available and affordable only to a privileged segment of the population.

− Despite the huge potential clinical data hold for early diagnosis of lung cancer, the existing approaches predominantly utilise clinical data to analyse comorbid conditions associated with lung cancer and predict survival rates.

− Though AI-based cancer diagnosis models are rapidly growing, they are hardly adopted into practice as medical practitioners deem them an incomprehensible black box.

− While developing the AI diagnostic model, there is an increasing tendency to discard rare medical cases, which will lead to biased learning by the model, leading to the failure of detecting rare medical cases.

To address some of the above challenges, we propose to use XAI models, utilising clinical data to diagnose lung cancer. The main contributions of this research work include:

− A safe alternative to traditional radiomics-based lung cancer detection methods for the early diagnosis of lung cancer. This is inclusive and non-invasive because of the use of normal clinical data.

− A key advantage of the proposed approach is its seamless integration into existing clinical workflows without the need for additional capital investment in specialised equipment or diagnostic tests. The tool leverages routinely collected clinical data, making it a cost-effective solution. Moreover, because it does not rely on sophisticated technologies or equipment typically limited to privileged healthcare settings, the model is more accessible to a broader population.

− An explainable AI diagnostic model, to address the reliability and transparency concerns of healthcare practitioners.

− The diagnostic model can diagnose rare medical cases as well.

Thus, in this research, we develop an explainable AI model that can interpret the model's diagnosis of lung cancer from clinical data. This can enhance early diagnosis and increase the reliability of the model's operation to facilitate timely medical intervention for lung cancer in clinical practice. The rest of the paper is organised as follows: Section 2 presents the research gap in the existing works utilising clinical data. Section 3 describes the materials and methods used for the research, including the data and models. Section 4 evaluates the performance of the models and presents the results. Section 5 explains the models’ decision-making process. Section 6 discusses the implications of this research for timely diagnosing malignancy in clinical practice. Section 6 concludes the paper.

Materials and Methods

We present this article in accordance with the transparent reporting of multivariable prediction models developed (TRIPOD) reporting checklist.³⁴ Figure 1 shows the architecture of the proposed diagnosis model. Each of the processes and components is explained in detail in the subsequent sections.

Figure 1.

Architecture of the Proposed Diagnosis Model.

Data Collection

This research is conducted by a collaborative team of technical and medical experts in the field in line with the ethical and data collection standards outlined in.³⁵ This retrospective study utilised data from the Pulmonology Department of the third-largest hospital in India. It encompasses all recorded clinical observations of patients diagnosed with lung cancer as well as those with non-cancerous lung diseases between the period 2017-2019. Given the retrospective nature of the study, no prospective inclusion or exclusion criteria were applied. Instead, all available patient records meeting the diagnostic coding criteria for lung cancer or benign pulmonary conditions were included. While this approach maximises real-world representativeness, we acknowledge that the study does not adhere to pre-specified standards for data collection, such as standardised inclusion/exclusion criteria or prospective data acquisition protocols. As such, potential variability in clinical documentation and missing data are considered when interpreting the findings.

From a cohort of 743, including 378 confirmed cases of lung malignancy and 365 cases diagnosed with benign pulmonary conditions. The dataset comprised 74 independent variables and one dependent variable, with features encompassing patient demographics, clinical symptoms, and a full blood count profile. Due to the retrospective nature of data acquisition, completeness varied; only 15 records were fully complete, while 728 contained varying degrees of missing data. A total of 23 records were excluded from the analysis due to having more than 80% missing values. Appropriate data imputation and pre-processing techniques were employed to address the remaining missingness and prepare the data for analysis.³⁶

Patient Characteristics

Our dataset contains patient details such as gender, age, habitual and intoxication characteristics, critical environmental checks, family history of cancer, clinical signs and symptoms, recording of history of lung diseases, clinical analysis results of blood and urine, and current disease observation. The 74 independent variables (features) include Gender, Age, Type_of_Meal, Diet_Type, Mode_of_Tobacco, Period_of_Consumption, Freqeuncy_of_Consumption, Pattern_of_Consumption, Passive_Smoking, Alcoholic, Modality_of_Cooking, Cancer_in_Family, Clean_water, Expose_to_Chemicals, Vision_Problem, Comorbidity, Previous_CT/Xray, History_TB, History_Pneumonia, History_COPD, History_Respiratory_Tract_Infection, History_Interstetial_Lungdisease, BMI Gastrointestinal, Muscle_Weakness, Slow_Tissue_Healing, Weight_Loss, Eating_Disorder, Wheezing, Chest_pain, Bone_pain, Shaking_chills, Night_sweats, Coughing_Status, Fever, Lethargic, Hoarseness_in_Sound, Limbs_swelling, Dehydration, Dyspnea, Dysphagia, Hemoptysis, Rheumatoid, Hepatomegaly, Pleural_Effusions, Clubbing, Systolic, Diastolic, Pulse, Pulse_Oximetry, Pulmonary_Hypertension, Sputum_Cytology, Spirometry, WBC, RBC, Hemoglobin, Hematocrit, Platelets, Neutrophils, Lymphs, Monocytes, ESR, Glucose_Fasting, Bacteria_in_Blood, Creatinine, Urea_Nitrogen, Uric_Acid, Urine_Protein, Blood_group, COPD, Pneumonia, TB, Myocardial_Infarct_syndrome, Coronary_Artery_Bypass. The target variable is a binary variable Cancer_class. Table 1 shows the relevant baseline characteristics of the patient data.

Table 1.

Baseline Characteristics of the Patient Data.

Feature	Total (N = 720)	Non-cancer(N = 361)	Cancer (N = 359)	p-value
Bone_pain	247 (34.3%)	99 (27.4%)	148 (41.2%)	0.000132
COPD	278 (38.6%)	111 (30.7%)	167 (46.5%)	1.96E-05
Clubbing	261 (36.2%)	100 (27.7%)	161 (44.8%)	2.51E-06
Comorbidity	256 (35.6%)	105 (29.1%)	151 (42.1%)	0.000372
Dysphagia	234 (32.5%)	91 (25.2%)	143 (39.8%)	3.96E-05
Dyspnea	295 (41.0%)	117 (32.4%)	178 (49.6%)	4.05E-06
ESR > 43.4	360 (50.0%)	113 (31.3%)	247 (68.8%)	1.72E-23
Expose_to_Chemicals	151 (21.0%)	49 (13.6%)	102 (28.4%)	1.60E-06
Fever	226 (31.4%)	136 (37.7%)	90 (25.1%)	0.000366
Freqeuncy_of_Consumption ≥ moderate	311 (43.2%)	108 (29.9%)	203 (56.5%)	9.53E-13
Gastrointestinal diseases	113 (15.7%)	32 (8.9%)	81 (22.6%)	7.42E-07
Hematocrit > 42.1	360 (50.0%)	152 (42.1%)	208 (57.9%)	2.99E-05
Hemoglobin > 12.3	352 (48.9%)	123 (34.1%)	229 (63.8%)	2.76E-15
History_COPD	360 (50.0%)	155 (42.9%)	205 (57.1%)	0.000194
Lethargic	321 (44.6%)	129 (35.7%)	192 (53.5%)	2.41E-06
Lymphs > 23.0	359 (49.9%)	225 (62.3%)	134 (37.3%)	3.27E-11
Mode_of_Tobacco (Consumes)	360 (50.0%)	135 (37.4%)	225 (62.7%)	1.97E-11
Muscle_Weakness	220 (30.6%)	74 (20.5%)	146 (40.7%)	6.89E-09
Neutrophils > 63.6	360 (50.0%)	123 (34.1%)	237 (66.0%)	1.94E-17
Passive_Smoking > 0.0	290 (40.3%)	119 (33.0%)	171 (47.6%)	8.27E-05
Pattern_of_Consumption(Continues)	149 (20.7%)	46 (12.7%)	103 (28.7%)	2.11E-07
Period_of_Consumption(years) > 13.0	354 (49.2%)	129 (35.7%)	225 (62.7%)	8.36E-13
Pneumonia	127 (17.6%)	92 (25.5%)	35 (9.7%)	5.30E-08
Previous_CT/Xray (more than once)	360 (50.0%)	147 (40.7%)	213 (59.3%)	8.68E-07
Pulse_Oximetry > 97.0	259 (36.0%)	155 (42.9%)	104 (29.0%)	0.00013
Sputum_Cytology	89 (12.4%)	16 (4.4%)	73 (20.3%)	1.90E-10
Systolic > 135.0	357 (49.6%)	144 (39.9%)	213 (59.3%)	2.71E-07
Urea_Nitrogen > 16.0	321 (44.6%)	105 (29.1%)	216 (60.2%)	9.23E-17
Uric_Acid > 6.0	321 (44.6%)	132 (36.6%)	189 (52.6%)	1.99E-05
WBC > 9400	317 (44.0%)	129 (35.7%)	188 (52.4%)	9.85E-06
Weight_Loss	287 (39.9%)	116 (32.1%)	171 (47.6%)	3.03E-05
Wheezing	157 (21.8%)	54 (15.0%)	103 (28.7%)	1.23E-05

Data pre-Processing

The collected data contained mixed variables that are numeric and categorical in nature. The feature encoding technique is applied to transform categorical data. The features that have constant values for all instances are discarded. These transformations resulted in 78 input features. During the process of data exploration, it was observed that a significant number of attributes in the dataset exhibited a non-Gaussian distribution. Therefore, we used the semi-parametric multiple imputation chained equation (MICE) imputation to handle the missing values in the data.³⁶ MICE utilises multiple regression to examine all conditional distributions and associated regression models.³⁷ Then each missing attribute is predicted based on a regression equation. So, for multiple missing attribute data, there will be multiple chains of values. The missing values are initially imputed with placeholders, and then the missing records are regressed multiple times, in a cyclic manner. Subsequently, the missing value is substituted with the predicted value. This approach is recommended by Van Buuren³⁸ for imputing data when the sample size is greater than 400. The predicted missing values need not lie within the range of observed values.

Handling Outliers

In ML, the training data needs to be free from outliers. There is a high chance of outliers with medical data, and so they are to be handled appropriately. Outliers in medical data can arise from two causes, one by error and the other by rare medical cases. This needs to be properly differentiated with the help of a domain expert or by standard procedures. Using the density-based isolation forest method,³⁹ we identified the presence of 73 outliers in our data. With the guidance of the domain expert, we understood that these are not error cases but rare medical cases. Discarding rare cases in medical data for learning is not a recommended practice. Hence, we need to appropriately prepare it for the AI model by applying scaling and data augmentation. Robust scaling techniques scale the data based on the median and the interquartile range. This approach is more resilient to the presence of outliers, as it is less influenced by extreme values. Robust scaling is particularly beneficial when working with non-Gaussian data.⁴⁰ Data augmentation selectively augments the underrepresented records, helping to create more balanced and representative datasets. Synthetic minority oversampling technique (SMOTE) and adaptive synthetic sampling (ADASYN) are the two prominent methods used for data augmentation.⁴¹ SMOTE finds the n-nearest neighbours in the minority class for each of the samples in the class. Then, it extrapolates the neighbour points and generates random data points. ADASYN is an improved version of SMOTE. ADASYN adds a random variance to the random points generated to scatter the points rather than confining them to linear extrapolation.⁴¹ Hence, we apply the ADASYN method to augment the data.

Model Selection

Given the limitations in the volume of data available for this research, we selected the following widely used cancer diagnosis models: Logistic regression,⁴² K-Nearest Neighbours (KNN),⁴³ Random forest,⁴⁴ Support vector machine (SVM),⁴⁵ ANN⁴⁶ and DNN.⁴⁷ We trained these classifier models with our dataset, and the performance of these diagnosis models is compared in Table 2.

Table 2.

Preliminary Results of the State-of-the-art Cancer Diagnosis Models on our Clinical Data.

Models	Mean accuracy	Mean precision	Mean recall	Mean F1-score	MeanAUC
Logistic regression⁵]	0.69	0.72	0.66	0.69	0.69
KNN⁵³	0.70	0.69	0.71	0.70	0.69
Random forest⁵⁴	0.79	0.79	0.78	0.78	0.79
SVM⁵⁵	0.74	0.74	0.75	0.74	0.74
ANN⁵⁶	0.80	0.80	0.81	0.80	0.80
DNN⁵⁷	0.82	0.81	0.87	0.83	0.82

From the preliminary results shown in Table 2, we found that the top-performing diagnosis models are DNN and ANN models. So, we focus more on these two models.

Building the AI Models

ANNs are very promising in cancer diagnosis as they have the potential to find complex patterns in data.⁴⁸ This is important in cancer diagnosis as subtle variations in blood tests, imaging scans, or a patient's medical history may hold clues about the presence or absence of cancer. Studies have shown that ANN models are achieving high accuracy rates in diagnosing various cancers, such as lung cancer, skin cancer, and so on. After exploratory analysis, we understood that our data has a complex structure. Various factors such as non-linearity, high dimensionality, interdependencies, and non-Gaussian data distributions can contribute to this structure.⁴⁹ Therefore, the structure of the data may not be easily captured by simple linear models. Therefore, we choose the ANN model, which is capable of analysing data with complex structures.

An ANN classifier consists of an input layer, one or more intermediate hidden layers, and an output layer. Each layer is composed of a number of interlinked neurons. The input layer receives the input features, and the output layer produces the final classifications. Every individual neuron employs an activation function to the summation of its inputs that have been weighted.⁵⁰ The choice of initial weights can affect how quickly the neural network converges during training. Well-selected initial weights can lead to faster convergence, reducing the time it takes for the model to learn the underlying patterns in the data. Common techniques for weight initialisation include random initialisation, Glorot initialisation, He initialisation, and so on. If all the neurons in a particular layer start with the same initial weights, they may end up learning the same features during training. This symmetry issue can hinder the model's ability to learn diverse representations and can limit its capacity.⁵¹ Randomly initialising the weights helps the optimisation algorithm escape local minima during training. The activation function introduces non-linearity and allows the network to learn complex patterns. The rectifier linear unit (ReLU) activation function outputs the maximum non-negative value.⁵²

The input data is passed through the network in the forward direction, from the input layer to the output layer. The outputs of the neurons in each layer serve as inputs to the neurons in the subsequent layers. This process is called feedforward propagation, and it transforms the input data through the network to generate predictions. Each connection between neurons is associated with a weight, which represents the strength of the connection. Additionally, each neuron has a bias term, which allows the network to adjust the output independently of the input. These weights and biases are initially random and are updated during the training process. The network is trained by adjusting the weights and biases to minimise a loss function that measures the discrepancy between the predicted labels and the true labels. The backpropagation algorithm is commonly used for the training of ANNs.⁵⁰ It calculates the gradients of the loss function with respect to the model parameters and updates the weights and biases accordingly. At the output layer, to enable binary classification, the sigmoid activation function is chosen to ensure that the predicted values are between 0 and 1. Our model uses the Adam learning rate optimisation algorithm that combines the benefits of both momentum and root mean square propagation (RMSprop) for improving the model's performance. It adjusts learning rates for each parameter individually, making it well-suited for neural network optimisation.⁵³ Model complexity and simplicity of interpretation are two key factors to be considered while designing a diagnosis model. Therefore, we design two AI models: a simple ANN model, which learns from the feature-selected attributes based on relevance, and a DNN model, which learns from the entire data.

ANN Model

We develop a shallow ANN model that has only one hidden layer between the input and output layers. Such neural networks are simpler to train and use less computational resources. So, we need to perform a feature selection to reduce the number of features fed to the model. Leveraging the Extra trees feature selection and ranking scores, we filtered 15 relevant features as input to model.⁵⁴ The selection of 15 attributes from 78 was finalised through a subset selection approach. This is to keep our model simple and explainable.

The extra trees feature selection method assesses the significance of each feature in an ensemble model built on Extra Trees based on Gini impurity scores. Here, the higher the value, the more significant the feature.⁵⁵ The benefits of this method include the ability to handle mixed data, improved generalisation, capture of intricate feature-feature interactions, and identification of non-linear correlations between features and the target variable. Figure 2 shows the top fifteen relevant features and their feature importance scores given by the Extra tress model.⁵⁶

Figure 2.

Feature ranking by Extra trees.

Our 3-layer ANN architecture with a first layer of input, hidden and output layers has 8, 4, and 1 neuron(s), respectively, as shown in Figure 3. We use a simple pyramid neural network structure for our model. Pyramid networks are designed to capture features at multiple scales. This is particularly useful for numerical data where patterns might exist at different levels of granularity. Lower layers capture finer details, while higher layers capture more abstract representations. Pyramid networks can form rich and comprehensive representations of the data, which is beneficial for classification tasks that require understanding of complex relationships. This model also captures contextual information effectively, allowing the network to consider a wider context for each decision.^57,58 Keras, with TensorFlow as the backend, facilitates designing the first layer with a flexible number of neurons and the creation of dense connections. Hence, for 15 input features, we have designed our first layer with eight neurons, where each neuron in this layer takes all 15 features rather than acting as a placeholder for each input feature.⁵⁹ Since our objective is to design a simple model with less complexity, we have tried to meet our objective with fewer neurons and layers. Therefore, each neuron in the first layer learns a property from all the input features. We performed hyperparameter optimisation using both grid search and random search strategies. The search space for grid search included variations in activation functions (ReLU and sigmoid), optimiser choices ('sgd’, ‘adam’, ‘rmsprop’, ‘nadam’), batch sizes (16, 24, 32, 64), learning rates (0.001 to 0.01), and dropout (0.1 to 0.5). During random search, we performed 50 iterations, randomly sampling the number of hidden layers (ranging from 1 to 5), and the range for the number of neurons was given as ([8, 16, 24], [4, 8, 12], [1, 2, 4]). For the number of epochs, we implemented early stopping and selected the optimal epoch based on the best validation performance recorded in the training history. For ANN, we use ReLU and sigmoid activation functions in our model. We use the random weight initialisation technique in our model. The other hyperparameter values chosen include a learning rate of 0.001, a batch size of 32, an optimiser of Adam, and a number of epochs of 25. The features are initially scaled before feeding into the network to avoid the unexpected influence of attribute value range.

Figure 3.

Architecture of the ANN Model.

The computations at the input layer can be written as:

I_{11} = σ (\sum_{i = 1}^{15} W_{1 i} x_{i} + b_{11})

I_{12} = σ (\sum_{i = 1}^{15} W_{2 i} x_{i} + b_{12})

I_{13} = σ (\sum_{i = 1}^{15} W_{3 i} x_{i} + b_{13})

I_{14} = σ (\sum_{i = 1}^{15} W_{4 i} x_{i} + b_{14})

I_{15} = σ (\sum_{i = 1}^{15} W_{5 i} x_{i} + b_{15})

I_{16} = σ (\sum_{i = 1}^{15} W_{6 i} x_{i} + b_{16})

I_{17} = σ (\sum_{i = 1}^{15} W_{7 i} x_{i} + b_{17})

I_{18} = σ (\sum_{i = 1}^{15} W_{8 i} x_{i} + b_{18})

Combining it all into a matrix form, we can represent it as:

I = σ₁

([\begin{matrix} \begin{matrix} w_{1.1} & w_{1.2} \\ w_{2.1} & w_{2.2} \end{matrix} \dots \dots . \begin{matrix} w_{1.14} & w_{1.15} \\ w_{2.14} & w_{2.15} \end{matrix} \\ \begin{matrix} ⋮ \\ \begin{matrix} w_{7.1} & w_{7.2} \\ w_{8.1} & w_{8.2} \end{matrix} \dots \dots . \begin{matrix} w_{7.14} & w_{7.15} \\ w_{8.14} & w_{8.15} \end{matrix} \end{matrix} \end{matrix}] \begin{matrix} [\begin{matrix} x_{1} \\ \begin{matrix} x_{2} \\ \begin{matrix} ⋮ \\ \begin{matrix} x_{14} \\ x_{15} \end{matrix} \end{matrix} \end{matrix} \end{matrix}] & \begin{matrix} + & [[\begin{matrix} b_{11} \\ \begin{matrix} b_{12} \\ \begin{matrix} ⋮ \\ \begin{matrix} b_{14} \\ b_{15} \end{matrix} \end{matrix} \end{matrix} \end{matrix}]] \end{matrix} \end{matrix})

₌ σ₁ ( W₁ X + B1 )

Similarly, for each layer in the neural network, the computations can be expressed as:

Input layer, I = σ₁ ( W₁ X + B1 )

Hidden layer, H₁ = σ₂ ( W₂ I + B2 )

Output, Y= σ₃ ( W₃ H₁ + B3 )

The activation function is represented as σ. σ₁ and σ₂ denotes ReLU activation function, while σ₃ denotes sigmoid function. The variable x_i refers to the i^th input feature and W_ji denotes the weight associated with neuron j against the input feature x_i and W_k denotes the weight matrix at the k^th layer. Similarly, the variable b_kj denotes the bias term at j^th neuron at k^th layer and B_k indicates the bias matrix at the k^th layer. The variable I_kj represents the total input to the jth neuron at the kth layer, and I represents the output of the first layer. H_k denotes the output of the k^th hidden layer. Y represents the final output obtained from the ANN.

DNN Model

DNNs are capable of capturing more complex patterns in data due to the increased number of processing steps. The traditional methods require hand-crafted feature extraction, whereas DNNs can automatically learn the most relevant features from the data itself. This helps in reducing human bias and streamlines the analysis process.

Our proposed DNN model architecture has a first layer of input, three hidden layers, and an output layer. Each layer consists of 78, 32, 16, 8, and 1 neurons, respectively. The choice of pyramidal network design is made as it progressively distils the input features.⁵⁸ All the layers except the output layer use the ReLu activation function. The output layer uses the sigmoid activation function. All the layers are fully connected (FC). A dropout of 0.5 and 0.3 is done in the first dense layer and first hidden layer, respectively. Dropout is a regularisation technique where randomly selected neurons are ignored during the training process in order to prevent overfitting due to interdependence among neurons. Figure 4 shows the five-layer architecture of the DNN model. All the input features are scaled before being directly fed into the DNN as inputs. We performed hyperparameter optimisation using both grid search and random search strategies. The search space included variations in activation functions (ReLU and sigmoid), optimiser choices ('sgd’, ‘adam’, ‘rmsprop’, ‘nadam’), batch sizes (16, 24, 32, 64), learning rates (0.001 to 0.01), and dropout (0.1 to 0.5). During random search, we performed 50 iterations, randomly sampling the number of hidden layers (ranging from 1 to 5), and the range for number of neurons was given as ([64, 78, 85, 90], [16, 32, 40, 48], [8, 16, 24], [4, 8, 12], [1, 2, 4]). For the number of epochs, we implemented early stopping and selected the optimal epoch based on the best validation performance recorded in the training history. The chosen hyperparameters for DNN include a learning rate of 0.001, a batch size of 24, an optimiser of Adam, and epochs of 15.

Figure 4.

Architecture of the DNN model.

Combining all the input from individual neurons in the input layer into a matrix form, we can represent it as:

I = σ_{1} {([\begin{matrix} \begin{matrix} w_{1.1} & w_{1.2} \\ w_{2.1} & w_{2.2} \end{matrix} \dots \dots . \begin{matrix} w_{1.14} & w_{1.78} \\ w_{2.14} & w_{2.78} \end{matrix} \\ \begin{matrix} ⋮ \\ \begin{matrix} w_{77.1} & w_{77.2} \\ w_{78.1} & w_{78.2} \end{matrix} \dots \dots . \begin{matrix} w_{77.14} & w_{77.78} \\ w_{78.14} & w_{78.78} \end{matrix} \end{matrix} \end{matrix}] \begin{matrix} [\begin{matrix} x_{1} \\ \begin{matrix} x_{2} \\ \begin{matrix} ⋮ \\ \begin{matrix} x_{77} \\ x_{78} \end{matrix} \end{matrix} \end{matrix} \end{matrix}] & \begin{matrix} + & [[\begin{matrix} b_{11} \\ \begin{matrix} b_{12} \\ \begin{matrix} ⋮ \\ \begin{matrix} b_{77} \\ b_{78} \end{matrix} \end{matrix} \end{matrix} \end{matrix}]] \end{matrix} \end{matrix})}_{=} σ_{1} (W_{1} X + B 1)

Similarly, for each layer in the neural network, the computations can be expressed as:

Input layer, I = σ₁ ( W₁ X + B1 )

First hidden layer, H₁ = σ₂ ( W₂ I + B2 )

Second hidden layer, H₂ = σ₃ ( W₃ H₁ + B3 )

Third hidden layer, H₃ = σ₄ ( W₄ H₂ + B4 )

Output, Y = σ₅ ( W₅ H₃ + B5 )

σ_i indicates ReLU activation function for i<=4, and σ₅ indicates sigmoid function. The variable x_i refers to the i^th input feature and W_ji denotes the weight associated with neuron j against the input feature x_i and W_k denotes the weight matrix at k^th layer. Similarly, the variable b_kj denotes the bias term at j^th neuron at the k^th layer and B_k indicates the bias matrix at the k^th layer. The variable I_kj represents the total input at j^th neuron at the k^th layer, and I represents the output of the first layer. H_k denotes the output of the k^th hidden layer. Y represents the final output obtained from the DNN.

Results

In this section, we evaluate the performance of the models developed and present the results. Then, based on the observed results, we explain and interpret the model outcomes. Followed by the development of the ensemble model and the evaluation of its performance.

Cross Validation

In situations where the available dataset is limited, cross-validation ensures reliable evaluation by using different subsets of available data for training and testing in multiple folds, making it less likely that the model will perform well solely due to chance.⁶⁰ We use a 10-fold cross-validation to evaluate the performance of our model. Precision and recall are crucial metrics in the context of medical applications, where the false positives (FP) and false negatives (FN) can have significant impacts on patient outcomes. Precision indicates the accuracy of positive classifications. In cancer diagnosis, high precision means that when the model classifies a case as malignant, it is likely to be correct. High recall suggests a lower risk of FN, which is critical in cancer diagnosis, as a missed malignant case can delay the treatment. A good F1-score indicates a better balance between precision and recall. The receiver operating characteristics-area under the curve (ROC-AUC) provides a comprehensive summary of a model's performance⁶¹ across various classification thresholds. It considers the trade-off between true positive rate (TPR) and true negative rate (TNR). Table 3 compares the performance of the candidate models developed. It can be observed that the DNN model, which used all the 78 attributes, scored higher accuracy, recall and F1-score than the simple ANN model. The simple ANN model, which used only 15 relevant attributes, scored higher precision than the DNN model.

Table 3.

Performance Metrics for the ANN and DNN Models.

Models	Mean accuracy	MeanAUC	Mean precision	Mean recall	Mean F1-score
ANN Model (ADASYN augmentation)	0.8125	0.8149	0.7872	0.8552	0.8196
DNN Model (ADASYN augmentation)	0.8333	0.8323	0.7674	0.9428	0.8461
ANN Model (without data augmentation)	0.8000	0.7982	0.7973	0.8082	0.8027
DNN Model (without data augmentation)	0.8207	0.8214	0.7975	0.8630	0.8289
ANN Model (with clean data)	0.8301	0.8286	0.8053	0.8683	0.8337
DNN Model (with clean data)	0.8786	0.8701	0.8333	0.9375	0.8823

From Table 3, it can be observed that with clean data (without rare records), the performance of the DNN model is very good, and the ANN model performs slightly inferior. However, in the medical domain, it makes sense to consider rare medical cases while developing the model. A model trained without considering rare cases might become biased towards more common conditions, reducing its generalizability and reliability when deployed in clinical settings. The ADASYN augmented data yielded better results for our models than the data without augmentation. Hence, we choose to develop fair models that offer inclusivity to diagnose rare medical conditions. We then tried to enhance the diagnosis performance of these models through additional steps.

Calibration Curve

A calibration graph gives a visual representation of the quality of model suitability by illustrating the observed values and true values in the evaluation. It is commonly used in ML to assess the rightness and reliability of the model classifications.⁶² Calibration graphs are important tools in evaluating ANN and DNN models because they provide insight into the reliability and accuracy of the model's predictions. The x-axis of the graph typically represents the predicted probabilities generated by the model, while the y-axis represents the actual observed outcomes.^63,64 By comparing these values, we can assess how well the model's classification aligns with the actual underlying probabilities. In a well-calibrated model, the points on the calibration graph should fall along a diagonal line, indicating that the predicted probabilities closely match the true probabilities. Deviations from the diagonal line suggest calibration errors, indicating that the model's predicted probabilities may not accurately reflect the true likelihood of events. Platt scaling is a method used to calibrate the output probabilities.⁶² It involves fitting a logistic regression model to the predicted probabilities to map them to calibrated probabilities.

Figure 5 compares the calibration curves, plotted with Platt scaling, for the ANN and DNN models. From the graphs, it can be observed that the ANN exhibits a better orientation to the ideal calibration curve than the DNN model for higher prediction probabilities. This could be because DNNs require large datasets to leverage their capacity effectively, which is not the case here. When the dataset is small or noisy, ANNs, being simpler, may perform better as they don't overfit as easily.

Figure 5.

Calibration Curves of ANN and DNN Models.

Error Analysis

We performed an error analysis augmented with counterfactual reasoning on how minimal changes to those features could potentially correct the model's errors. We identified misclassified samples by comparing predicted labels to true labels on the test set. The percentages of true positives are 43.06%, false positives are 9.03%, true negatives are 43.06%, and false negatives are 4.86%.

Among the false positives and false negatives, we selected representative samples for deeper interpretability. To explain misclassified predictions, we used the diverse counterfactual explanations (DiCE).⁶⁵ For each misclassified instance, we generated multiple diverse and plausible counterfactuals and observed a minimum change in a set of impactful features that can flip the decision right.

Table 4 shows the top features identified by DiCE that are sufficiently perturbed in order to change the model's output from a false positive to a true negative diagnosis. Average reductions in these identified features sufficient for the model to flip the decision from cancer to no cancer are shown.

Table 4.

Counterfactual Explanation for False Positives to True Negatives Using DiCE.

Influential Feature	Average change in magnitude of feature
Rheumatoid	1.456292
Pneumonia	1.131303
Wheezing	0.954400
Expose_to_Chemicals	0.548217

Table 5 shows the top features identified by DiCE that are sufficiently perturbed in order to change the model's output from a false negative to a true positive diagnosis. Average reductions in these identified features sufficient for the model to flip the decision from no cancer to cancer are shown. These observation shows the major factors that confuse the diagnosis model to misclassify the cases.

Table 5.

Counterfactual Explanation for False Negatives to True Positives Using DiCE.

Influential Feature	Average change in magnitude of feature
Pattern_of_Consumption	4.5647
Lymphs	1.2021

Explaining Model Decision

Explaining and interpreting the results of neural network models may be crucial for building trust, providing transparency, accountability, and error diagnosis. Interpretation of the models presents the details of the methods and techniques used in the models to arrive at the decisions. SHAP and LIME are two popular methods that can explain neural network classifications. LIME focuses only on local interpretability based on the instances, whereas, SHAP offers both local and global interpretability, consistent interpretations based on SHAP values. Though SHAP is computationally intensive and complex, it is a more accurate and theoretically sound explanation than LIME interpretations.^24,64 Hence, we use the SHAP tool to interpret our models.

For a binary classification model, the SHAP value for feature i is computed using the equation (1)

ϕ_{i} (f) = \sum_{S \subseteq N ∖ {i}} \frac{| S |! \cdot (| N | - | S | - 1)!}{| N |!} [f (S \cup {i}) - f (S)]

(1)

Where $ϕ_{i}$ is the SHAP value for feature i, N is the number of features, S is the subset of N excluding feature i, |S| is the cardinality of S, $f (S)$ is the model's prediction based on the subset S, $f (S \cup {i})$ is the model's prediction when feature i is added to subset S, and $\frac{| S |! \cdot (| N | - | S | - 1)!}{| N |!}$ is the Shapley weight that is distributed according to the contributions across all feature subsets.

Positive SHAP values indicate that the feature contributes positively to the decision, while negative SHAP values indicate a negative contribution. Features with larger absolute SHAP values are considered more important in influencing the model's decisions.⁶⁴ Figure 6 shows the 3D visualisation of the SHAP values for a subset of medical instances and their corresponding features. The red scale corresponds to a positive diagnosis, whereas the blue scale corresponds to a negative diagnosis. The final decision for an instance can be interpreted as a function of the sum of individual SHAP values for each feature.

Figure 6.

The 3D visualisation of SHAP values.

Explaining the ANN Model Outcome Using SHAP Interpretation

Figure 7 shows the top fifteen features contributing to the model's decision, ranked in descending order. We can identify Hemoglobin, ESR and Neutrophils as the top features that contribute to the lung cancer diagnosis. The left or right orientation of the colour scale indicates the direction of impact of the model for a change in feature value. For example, the red points of ESR are on the positive SHAP scale, which indicates that high ESR contributes positively to lung cancer diagnosis and low ESR contributes negatively. Meanwhile, the attribute RBC is vice versa due to the opposite direction of colouring. From Figure 7, all features except Pneumonia and RBC contribute positively to lung cancer diagnosis.

Figure 7.

SHAP Summary Plot for the ANN Model.

Figure 8 shows the mean absolute SHAP values of each feature based on the class. This represents the average impact each feature has in labelling an instance to a particular class by the ANN model. For instance, the red bar represents the impact of each feature contributing to labelling a case as malignant. To further understand the associations between features and their contribution to lung cancer diagnosis, we plot dependency graphs. The dependency plot has SHAP values of a feature on the y-axis and standardised values of that feature on the x-axis. The colour gradient represents its association with another feature.

Figure 8.

Classwise mean absolute SHAP values for the ANN model.

Figure 9 shows the two dependency plots for ESR against a period of consumption (tobacco) and frequency of consumption (tobacco) of the ANN model. Two noteworthy associations were made while exploring the dependencies of features using SHAP values. There is an increasing tendency in patients with higher periods of tobacco consumption and higher frequency of consumption to have increased ESR values. This insight aligns with the relevant clinical literature.^66–68 In addition, a higher ESR value has a higher SHAP value, which indicates increased chances of a lung cancer diagnosis.

Figure 9.

SHAP Dependency Plot of the ANN Model.

Explaining the DNN Model Outcome Using SHAP Interpretation

Figure 10 shows the SHAP summary plot for the DNN model. It contains the top 20 features that significantly contribute to lung cancer diagnosis by the DNN model based on their corresponding ranking. The three most significant features of this model are ESR, Mode_of_Tobacco, and Sputum_Cytology. From Figure 10, all the features except History_Respiratory_Tract_Infection, Lymphs, Fever, Platelets, and Spirometry contribute positively to the lung cancer diagnosis.

Figure 10.

SHAP summary plot for DNN model.

Figure 11 shows the mean absolute SHAP values of each feature based on the class. This represents the average impact each feature has in labelling an instance to a particular class by the DNN model. For example, the red bar against the feature Sputum_Cytology shows its huge impact in labelling a case as malignant with a mean absolute SHAP value above 0.025. Its impact on labelling a case as non-cancerous is relatively less. Further associations of these primary features are visualised as dependency graphs, as shown in Figure 12. These results are then reviewed by domain experts to ensure their clinical relevance.

Figure 11.

Class-Wise Mean Absolute SHAP Values for the DNN Model.

Figure 12.

SHAP dependency plot of the DNN model.

Figure 12 shows the dependency graphs plotted for the features of the DNN model. The first two dependency plots align with the findings in the dependency plots for the ANN model. The third dependency plot indicates an association of high Hoarseness_in_Sound to the severity of Coughing_Status, both of which contribute positively to the malignancy diagnosis. The fourth dependency graph shows the increasing tendency of association of high ESR values with high History_of_COPD. The fifth dependency graph indicates that the occurrence of COPD has a high association with low Pulse_Oximetry. The sixth dependency graph shows that pleural effusions are associated with lower values of Pulse_Oximetry.

In our ANN model with fewer features, the model relies more heavily on each feature, and so their SHAP values are typically higher. In the DNN model, which has many features, the model might still rely on significant features, but the SHAP values tend to be lower because the contributions are spread out across more features. Hence, the importance is distributed across a larger number of features, leading to smaller individual SHAP values. As a result, the threshold of significance can vary for each case based on the relative ranking. One interesting observation in the interpretation of the diagnosis models developed is that the attribute ESR is a major determinant that favours a malignancy diagnosis in both models. This indicates that an elevated ESR value is a prominent factor in labelling a patient's case as malignant, but it is not a sufficient condition, ie, a high ESR does not necessarily imply the presence of cancer. However, elevated ESR levels were observed in a larger proportion of lung cancer patients.

Developing an Ensemble Model

Now we have two models, one with better accuracy and recall (DNN model) and the other with better precision and calibration (ANN model). We utilise the benefits of the individual models and mitigate the effect of biases by leveraging an ensemble model for diagnosis using two approaches. The weighted voting ensemble approach directly combines predictions of individual models using predefined weights.⁶⁹ This approach does not involve training a meta-model on top of the base model predictions. The final output is computed using equation (2).

\hat{y} = w_{1} \cdot f_{1} (x) + w_{2} \cdot f_{2} (x)

(2)

Where, $\hat{y}$ is the final output of the diagnosis model, $f_{1} (x) a n d f_{2} (x)$ are the two base neural network model outputs, $w_{1} a n d w_{2}$ are the weights for the corresponding base models based on each model's contribution, such that $\sum w_{j} = 1$ . In our research, we used a grid search approach a set of weight combinations ranging from (0.1,0.9) and selected the pair (0.4, 0.6) that yielded the best performance.

On the other hand, the stacking ensemble approach involves training base models (ANN and DNN models) and then using their predictions as inputs to a meta-model (logistic regression) that makes the final decision.⁷⁰ The final output of the models is computed using equation (3).

\hat{y} = σ (β_{0} + β_{1} \cdot f_{1} (x) + β_{2} \cdot f_{2} (x))

(3)

Where, $f_{1} (x) a n d f_{2} (x)$ are the two base neural network model outputs, $β_{0}$ is the bias term for the logistic regression model, $β_{1} and β_{2}$ are coefficients learned by the logistic regression model corresponding to $f_{1} (x) a n d f_{2} (x)$ , and $σ ()$ is the sigmoid function applied to the linear combination of the model outputs to yield the final probability.

Figure 13 shows the AUC scores of the Weighted voting ensemble model and the stacking ensemble model. For instance, the weighted voting with ANN model weight 0.4 and DNN model weight 0.6 yielded a mean AUC of 0.85, and the Stacking approach yielded a mean AUC of 0.86.

Figure 13.

AUC Scores of the Weighted Voting Ensemble Model and the Stacking Ensemble Model.

It can be observed that the Stacking ensemble approach has improved the overall diagnosis capability of the model despite having rare record data. It achieved a performance comparable to that of the DNN model with clean data. The stacking ensemble model gives a mean accuracy of 0.8558, a mean AUC score of 0.8600, 0.8092 mean precision, 0.9282 mean recall, and 0.8646 mean F1-score. These performance scores are satisfactory for developing practical models. The Weighted ensemble model is giving a mean accuracy of 0.8411, a mean AUC of 0.8500, 0.7846 mean precision, 0.8907 mean recall, and a mean score of 0.8342. Figure 14 shows the model accuracy comparison with 95% confidence interval (C.I.). The error bars do not overlap with those of the ensemble model, indicating its statistically significant superior performance.

Figure 14.

Model accuracy comparison with 95% C.I.

Discussion

Given the moderate sample size and high dimensionality, we are aware of the high risk of overfitting for the DNN model; however, we took several rigorous steps to mitigate this by employing cross-validation and dropout. We also kept the network architecture intentionally shallow and limited the number of parameters to suit the small dataset size.

Overfitting would be indicated by the increased difference between average validation loss and average training loss. From Figure 15, we do not observe any signs of relevant overfitting. Having addressed that concern, we now turn our attention to two key aspects of this research: the use of ADASYN and the implementation of an ensemble model.

Figure 15.

Training Versus Validation Loss for DNN Model.

ADASYN is employed to address the challenge of class imbalance in medical datasets, particularly when rare conditions are underrepresented. This approach is crucial to ensure fairness in medical research, where certain disease conditions may have limited data due to their rarity. The possible biases can be mitigated carefully through expert review of the generated synthetic samples and regularisation. Thereby ensuring proper management of the potential drawbacks. This inclusivity is vital for developing diagnostic tools that perform equitably across diverse patient populations, preventing the neglect of rare cancer cases.^71,72

The benefit of the ensemble model can be leveraged when the base models have diversity in their operation, when the individual models are learning and making different kinds of errors. Here, our ANN model has increased precision, and the DNN model has increased recall; therefore, the ensemble facilitated the imbibe the merits of both models. Since the data size is limited, adding more base models might result in overfitting. Also, if the errors of the base models are not complementary, adding additional models will not benefit to the ensemble performance. Youden's J statistic can be used to find the optimal threshold for classifying the classes to avoid a high disparity between accuracy and AUC values for the model. For our final model, the optimal threshold for classification is 0.3936. The ensemble model with a stacking approach can give a better performance with the dataset containing rare outlier data, which is comparable to the performance of the DNN model trained on clean clinical data. We evaluated the statistical significance of the performance differences between the ensemble model and the DNN model (trained on clean data) using McNemar's test.⁷³ The resulting test statistic was 0.2500 with a p-value of 0.6171, indicating no statistically significant difference between the two models. In contrast, when comparing the ensemble model with the DNN model trained on ADASYN-augmented data, the test yielded a statistic of 107.6412 with a p-value of 0.0000, indicating a statistically significant improvement in performance by the ensemble model.

Comparing the Proposed Model with the Latest Models

In this section, we compare the performance and features of the proposed model to the latest models developed for lung cancer diagnosis, prediction, and prognosis. Table 6 highlights the key advantages of the proposed model. It is explainable, poses no health risks as it relies solely on routine clinical data without the need for repeated radiomic procedures, and is cost-effective since it does not require sophisticated tests or complex processing. Additionally, the model is scalable and can be deployed across a wider range of medical centres, unlike the models that use advanced technologies based on genomic data, which are often accessible only to a privileged segment of the population. Notably, this model has been developed to include rare cases of lung cancer, ensuring fairness and inclusivity in its application.

Table 6.

Comparison of Model Features of the Proposed Lung Cancer Diagnosis Model with the Latest Lung Cancer Models.

Model	Data used	Data size	AUC	Approximate misclassification rate	Explainability	Health risk due to routine radiomic evaluation	Affordability	Availability and accessibility	Inclusive of rare cases
Our proposed ensemble model	Clinical data	720	0.86	1 in 7 patients	Yes	None	Affordable	Widely	Yes
Chen, Z et al, 2025¹⁴	Composition of radiomic, clinical and genomic data	254	0.84	1 in 6 patients	Yes	Possible	Expensive	Restricted	No mention
Zhou, L. et al, 2024¹⁵	Composition of radiomic, clinical and EHR data	452	0.84	1 in 6 patients	No	Possible	Affordable	Widely	No mention
Liu, N. et al, 2025⁷⁴	Composition of gene, radiomic and clinical data	72	0.94	1 in 17 patients	No	Possible	Expensive	Restricted	No mention
Yoon, M. L., et al, 2025⁷⁵	Gene data	400	0.94	1 in 17 patients	Yes	None	Expensive	Restricted	No mention
Longueville, E., et al, 2025¹⁶	Clinical and radiomic data	70	0.68	1 in 3 patients	Yes	Possible	Affordable	Widely	No mention

Table 7 presents the approximate cost range associated with various diagnostic methods. Among all the services, clinical evaluation emerges as the most cost-effective option on average.^76,77

Table 7.

Approximate Cost Range for Different Medical Services.

Service	Approximate average costs
Genomic test	$1500 – $5000
Liquid biopsy test	$1000 – $3500
Liquid chromatography mass spectrometry	$200 – $1000
Single gene mutation test	$200 – $800
Chest CT scan	$300 – $2000
LDCT	$200 – $400
Clinical evaluation	$50 – $200

Several recent studies have raised safety concerns by reporting cases of radiation-induced cancer.^78–80 Though LDCT is not expensive and widely accessible, it will not provide adequate images in obese patients. Heavier patients receive higher doses of radiation. Our proposed diagnostic model does not rely on radiomic features or imaging-based data. Instead, it utilises routine clinical data, making it a significantly safer and more cost-effective alternative.

Limitations

However, given the limited scale of the dataset used in this research, we propose this study as a prototypical research cohort for the development of a clinical data-based lung cancer diagnostic system. In its current form, the proposed model is best suited as a pre-screening tool rather than a definitive lung cancer diagnostic solution. With training on larger and more diverse datasets, the model has the potential to achieve improved generalizability and more precise factor identification, at which point it could serve as a robust tool for lung cancer diagnosis.

Implications of Excluding Rare Medical Records from Training Data

The scope of applicability of the model trained without considering rare cases becomes limited, as it fails to recognise and handle the full spectrum of possible conditions, including rare but critical ones. Failing to account for rare cases can lead to inequitable healthcare services. Misdiagnosis or delayed diagnosis of rare conditions can lead to incorrect treatments, which can undermine patients’ health and increase costs. Also, over the years, medical practitioners may become overly reliant on the model, leading to a lack of awareness and knowledge about rare conditions. Hence, the models need to learn from properly handled rare medical records.

Table 8 highlights the clear advantage of using ADASYN augmentation for improving the diagnosis of rare cases. From a set of 35 actual rare case records, our approach achieved the highest number of correct diagnoses, demonstrating its effectiveness in addressing data imbalance in medical applications.

Table 8.

Comparison of hit Percentage.

Approach	Hit percentage
Discarding rare cases from model training	11.4%
Directly including untreated rare cases in model training	8.6%
Including ADASYN data in model training	62.9%

Implications of XAI

Explainability in neural network-based models for lung cancer diagnosis has significant implications for various stakeholders such as doctors, patients, researchers, and regulatory bodies. It increases the trust among doctors by providing insights into the decision-making process, which is crucial for adopting AI models in clinical practice. From the SHAP interpretations, we can observe that the significance and impact of each clinical symptom in deciding a particular class has a large variance. These variations and insights are hard to comprehend by human cognitive skills, despite having years of clinical experience. Therefore, the explainable models not only give a proper diagnosis but also serve as a robust quantitative explanation of how the diagnosis model labels a patient case. The clear explanations of the model also allow doctors and researchers to identify potential flaws or biases in the model, leading to the development of more accurate and reliable diagnostic tools. Understanding the model will guide iterative improvements, which can guarantee that the AI model will remain accurate and relevant as and when new data arrives. Patients will be more open to AI-based interventions without anxiety.

Financial Implications

The proposed XAI model for lung cancer diagnosis can be easily deployed into the clinical setup as it does not incur a huge capital investment in buying any hefty diagnostic equipment. The only additional cost associated with implementing this AI model in practice is to provide training for medical practitioners. This involves teaching them how to collect high-quality clinical data, run the model with proper interfaces and understand the model's decisions based on the interpretations.

Conclusion

Alternative techniques for radiation-based screening to diagnose lung cancer are in high demand due to the increased rate of late-stage diagnoses and associated exposure risks. Although several novel alternatives have been developed, their widespread adoption is hindered by high costs and limited access to necessary resources.⁵ Leveraging clinical data presents a cost-effective and viable solution to this problem. Often, early clinical indications of lung cancer are overlooked or considered insignificant, contributing to delays in diagnosis. A key strength of the proposed approach lies in its seamless integration with existing clinical workflows, requiring no additional investment in equipment or diagnostic procedures. By utilising routinely collected clinical data, our tool provides a resource-efficient method for identifying patients at risk of lung cancer, significantly reducing operational costs.

This research presents an XAI model for the early diagnosis of lung cancer using clinical data. The model emphasises accurate diagnosis along with interpretability, aiming to build trust among healthcare practitioners and patients. One major challenge in building a diagnostic model from clinical data is handling missing values. Given the prevalence of non-Gaussian data distributions, the semiparametric MICE method was employed to impute missing values effectively. Two XAI models were developed. The ANN model uses 15 selected, clinically relevant features, while the DNN model incorporates 78 features to capture complex patterns in the data. ADASYN augmentation was applied to address the imbalance caused by rare medical records. The DNN model demonstrated superior performance in terms of accuracy, AUC, and recall. In contrast, the ANN model showed better precision and calibration when trained with ADASYN-augmented data.

Another significant barrier to the adoption of AI in clinical settings is the lack of interpretability. Healthcare professionals are often presented with black-box decisions, making them reluctant to trust or endorse such models. This paper addresses this issue using SHAP for global interpretation. Both models’ decisions are explained using SHAP, providing transparency into their diagnostic rationale. To capitalise on the strengths of both models, we developed a stacking ensemble that yielded diagnostic performance comparable to the DNN model trained on clean data, while improving robustness.

In a clinical workflow, patient observations can be input into the model to generate a diagnosis that is either positive or negative. These models can be continuously updated with new clinical data, enhancing their accuracy and effectiveness over time. Medical practitioners require minimal training to operate the model and can easily interpret its outputs. The combination of affordability, accessibility, and interpretability makes our XAI models a promising and safe alternative to traditional radiation-based screening for early lung cancer diagnosis. Additionally, the model must comply with relevant regulatory standards.⁸¹ This includes obtaining regulatory approvals such as those required for Software as a Medical Device or under the European Union Medical Device Regulation, adhering to artificial intelligence-specific guidelines like the Food and Drug Administration's Good Machine Learning Practices and the International Organization for Standardization Technical Report 24028, as well as ensuring compliance with data privacy regulations such as the Health Insurance Portability and Accountability Act and the General Data Protection Regulation.

However, the current XAI models are trained on a relatively limited dataset, which constrains their reliability and generalizability as fully assertive diagnostic tools. At this stage, the model should be regarded as a prototype and a feasibility study. Our findings aim to support healthcare practitioners in adopting a cost-effective, interpretable pre-screening tool for identifying high-risk lung cancer patients.

Future Work Should Focus on Several key Areas:

Dataset expansion and diversity: To enhance model robustness and applicability, future research should involve training and validating the model on larger, multi-institutional datasets that include a more diverse patient population across different age groups, ethnicities, and geographic regions.

Improvement of data augmentation techniques: While ADASYN has proven effective in our research, its performance can be further optimised. Future work could explore hybrid augmentation strategies combining ADASYN with generative models to create more realistic synthetic samples and avoid overfitting.

Model calibration and reliability: Calibration of deep learning models remains a critical step to ensure that predicted probabilities reflect true risks. Techniques such as temperature scaling, isotonic regression, or Bayesian deep learning approaches could be considered to enhance the trustworthiness of predictions, especially in clinical settings.

Clinical validation and deployment: Real-world testing through prospective clinical trials is essential to validate the utility and impact of the model in actual healthcare environments. Integration with existing diagnostic workflows and evaluation of cost-effectiveness, decision impact, and clinician trust will be crucial steps toward clinical deployment.

By addressing these areas, future iterations of our model can move closer to deployment as a reliable, generalizable, and interpretable decision-support tool for early lung cancer detection.

Footnotes

ORCID iD

Anu Maria Sebastian

Ethics Statement

This research is conducted by a collaborative team of technical and medical experts of the field in line with the ethical and data collection standards outlined in [⁴⁵]. This research was approved by the institutional ethics committee of Government Medical College, Kozhikode, and adhered to their de-identification protocols. Reference no. GMCKKD/RP 2019/IEC/148.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

References

Key Statistics of Lung Cancer, American Cancer Society. 2022. https://www.cancer.org/cancer/lung-cancer/about/key-statistics.html .

Elizabeth

. Deep learning delivers early detection. Nature. 2020;587(7834):S20-S22.

Bray

Laversanne

Sung

, et al. Global cancer statistics 2022: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin. 2024;74(3):229-263. doi:10.3322/caac.21834

SEER Cancer Statistics Factsheets: Lung and Bronchus Cancer. 2022. http://seer.cancer.gov/statfacts/html/lungb.html. Accessed 6/4/2020.

Sebastian

Peter

. Identifying the predictors from lung cancer data using machine learning. In. 2023:691-701. doi:10.1007/978-981-19-5443-6_53

Hawkins

Wang

Liu

, et al. Predicting malignant nodules from screening CT scans. J Thorac Oncol. 2016;11(12):2120-2128. doi:10.1016/j.jtho.2016.07.002

Sungheetha

Dr. Rajesh

. Comparative study: Statistical approach and deep learning method for automatic segmentation methods for lung CT image segmentation. J Innov Image Process. 2020;2(4):187-193. doi:10.36548/jiip.2020.4.003

Lim

Ridge

Nicholson

Mirsadraee

. The 8th lung cancer TNM classification and clinical staging system: Review of the changes and clinical implications. Quant Imaging Med Surg. 2018;8(7):709-718. doi:10.21037/qims.2018.08.02

Thawani

McLane

Beig

, et al. Radiomics and radiogenomics in lung cancer: A review for the clinician. Lung Cancer. 2018;115:34-41. doi:10.1016/j.lungcan.2017.10.015

10.

Mascalchi

Sali

. Lung cancer screening with low dose CT and radiation harm—from prediction models to cancer incidence data. Ann Transl Med. 2017;5(17):360-360. doi:10.21037/atm.2017.06.41

11.

Midthun

. Early detection of lung cancer. F1000Res. 2016;5:739. doi:10.12688/f1000research.7313.1

12.

Kauczor

Bonomo

Gaga

, et al. ESR/ERS white paper on lung cancer screening. Eur Radiol. 2015;25(9):2519-2531. doi:10.1007/s00330-015-3697-0

13.

Kadir

Gleeson

. Lung cancer prediction using machine learning and advanced imaging techniques. Transl Lung Cancer Res. 2018;7(3):304-312. doi:10.21037/tlcr.2018.05.15

14.

Chen

Liu

, et al. TRANS: A prediction model for EGFR mutation status in NSCLC based on radiomics and clinical features. Respir Res. 2025;26(1):211. doi:10.1186/s12931-025-03287-6

15.

Zhou

Mao

, et al. Development of an AI model for predicting hypoxia status and prognosis in non-small cell lung cancer using multi-modal data. Transl Lung Cancer Res. 2024;13(12):3642-3656. doi:10.21037/tlcr-24-982

16.

Longueville

Dewolf

Dalstein

, et al. Comparing neutrophil-to-lymphocyte ratio (NLR), absolute neutrophil count (ANC) and derived NLR as predictive biomarkers in first-line immunotherapy for non-small cell lung cancer: A retrospective study. Transl Lung Cancer Res. 2025;14(4):1212-1230. doi:10.21037/tlcr-24-808

17.

Sebastian

Peter

. Artificial intelligence in cancer research: Trends, challenges and future directions. Life. 2022;12(12):1991. doi:10.3390/life12121991

18.

Kanan

Alharbi

Alotaibi

, et al. AI-Driven Models for diagnosing and predicting outcomes in lung cancer: A systematic review and meta-analysis. Cancers (Basel). 2024;16(3):674. doi:10.3390/cancers16030674

19.

Gandhi

Gurram

Amgai

, et al. Artificial intelligence and lung cancer: Impact on improving patient outcomes. Cancers (Basel). 2023;15(21):5236. doi:10.3390/cancers15215236

20.

Zhang

Shi

Wang

. Machine learning and AI in cancer prognosis, prediction, and treatment selection: A critical approach. J Multidiscip Healthc. 2023;16:1779-1791. doi:10.2147/JMDH.S410301

21.

Montavon

Binder

Lapuschkin

Samek

Müller

. Layer-Wise relevance propagation: An overview. In. 2019:193-209. doi:10.1007/978-3-030-28954-6_10

22.

Kauffmann

Müller

Montavon

. Towards explaining anomalies: A deep Taylor decomposition of one-class models. Pattern Recognit. 2020;101:107198. doi:10.1016/j.patcog.2020.107198

23.

Guidotti

Monreale

Ruggieri

Pedreschi

Turini

Giannotti

. Local Rule-Based Explanations of Black Box Decision Systems. ArXiv. 2018;abs/1805.1. https://api.semanticscholar.org/CorpusID:44063479.

24.

Zafar

Khan

. DLIME: A Deterministic Local Interpretable Model-Agnostic Explanations Approach for Computer-Aided Diagnosis Systems. Published online June 24, 2019. http://arxiv.org/abs/1906.10263.

25.

Rodríguez-Pérez

Bajorath

. Interpretation of machine learning models using shapley values: Application to compound potency and multi-target activity predictions. J Comput Aided Mol Des. 2020;34(10):1013-1026. doi:10.1007/s10822-020-00314-0

26.

Vimbi

Shaffi

Mahmud

. Interpreting artificial intelligence models: A systematic review on the application of LIME and SHAP in Alzheimer’s disease detection. Brain Inform. 2024;11(1):10. doi:10.1186/s40708-024-00222-1

27.

Sundararajan

Taly

Yan

. Axiomatic Attribution for Deep Networks. In: Precup D, Teh YW, eds. Proceedings of the 34th International Conference on Machine Learning. Vol 70. Proceedings of Machine Learning Research. PMLR; 2017:3319–3328. https://proceedings.mlr.press/v70/sundararajan17a.html.

28.

Huang

Luo

, et al. Lymphocyte percentage as a valuable predictor of prognosis in lung cancer. J Cell Mol Med. 2022;26(7):1918-1931. doi:10.1111/jcmm.17214

29.

Yang

Tang

Liu

Chen

Gao

. Prognostic value of baseline hemoglobin-to-red blood cell distribution width ratio in small cell lung cancer: A retrospective analysis. Thorac Cancer. 2020;11(4):888-897. doi:10.1111/1759-7714.13330

30.

Shah

SNA

Parveen

. An extensive review on lung cancer diagnosis using machine learning techniques on radiological data: State-of-the-art and perspectives. Arch Comput Methods Eng. 2023;30(8):4917-4930. doi:10.1007/s11831-023-09964-3

31.

Criner

Agusti

Borghaei

, et al. Chronic obstructive pulmonary disease and lung cancer: A review for clinicians. Chronic Obstr Pulm Dis J COPD Found. Published online 2022:454-476. doi:10.15326/jcopdf.2022.0296

32.

Zhou

Cao

Wang

Pan

. Application of artificial intelligence in the diagnosis and prognostic prediction of ovarian cancer. Comput Biol Med. 2022;146:105608. doi:10.1016/j.compbiomed.2022.105608

33.

Leiter

Charokopos

Bailey

, et al. Assessing the association of diabetes with lung cancer risk. Transl Lung Cancer Res. 2021;10(11):4200-4208. doi:10.21037/tlcr-21-601

34.

Collins

Reitsma

Altman

Moons

KGM

. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): The TRIPOD statement. Br Med J. 2014;350(jan07 4):g7594-g7594. doi:10.1136/bmj.g7594

35.

Saczynski

McManus

Goldberg

. Commonly used data-collection approaches in clinical research. Am J Med. 2013;126(11):946-950. doi:10.1016/j.amjmed.2013.04.016

36.

Sebastian

Peter

. Systematic selection of a suitable data imputation technique based on data characteristics. SN Comput Sci. 2025;6(5):465. doi:10.1007/s42979-025-04003-3

37.

Azur

Stuart

Frangakis

Leaf

. Multiple imputation by chained equations: What is it and how does it work? Int J Methods Psychiatr Res. 2011;20(1):40-49. doi:10.1002/mpr.329

38.

Buuren

Groothuis-Oudshoorn

.mice: Multivariate Imputation by Chained Equations inR. J Stat Softw. 2011;45(3). doi:10.18637/jss.v045.i03

39.

Mandhare

Idate

. A comparative study of cluster based outlier detection, distance based outlier detection and density based outlier detection techniques. In: 2017 International Conference on Intelligent Computing and Control Systems (ICICCS). IEEE; 2017:931-935. doi:10.1109/ICCONS.2017.8250601.

40.

Beinecke

Heider

. Gaussian Noise up-sampling is better suited than SMOTE and ADASYN for clinical decision making. BioData Min. 2021;14(1):49. doi:10.1186/s13040-021-00283-6

41.

Cao

Stojkovic

Obradovic

. A robust data scaling algorithm to improve classification accuracies in biomedical data. BMC Bioinformatics. 2016;17(1):359. doi:10.1186/s12859-016-1236-x

42.

Elkahwagy

DMAS

Kiriacos

Mansour

. Logistic regression and other statistical tools in diagnostic biomarker studies. Clin Transl Oncol. 2024;26(9):2172-2180. doi:10.1007/s12094-024-03413-8

43.

Maleki

Zeinali

Niaki

STA

. A k-NN method for lung cancer prognosis with the use of a genetic algorithm for feature selection. Expert Syst Appl. 2021;164:113981. doi:10.1016/j.eswa.2020.113981

44.

Minnoor

Baths

. Diagnosis of breast cancer using random forests. Procedia Comput Sci. 2023;218:429-437. doi:10.1016/j.procs.2023.01.025

45.

Bilal

Imran

Baig

Liu

Abouel Nasr

Long

. Breast cancer diagnosis using support vector machine optimized by improved quantum inspired grey wolf optimization. Sci Rep. 2024;14(1):10714. doi:10.1038/s41598-024-61322-w

46.

Prisciandaro

Sedda

Cara

Diotti

Spaggiari

Bertolaccini

. Artificial neural networks in lung cancer research: A narrative review. J Clin Med. 2023;12(3):880. doi:10.3390/jcm12030880

47.

Tufail

Kaabar

MKA

, et al. Deep learning in cancer diagnosis and prognosis prediction: A minireview on challenges, recent trends, and future directions. Comput Math Methods Med. 2021;2021:1-28. doi:10.1155/2021/9025470

48.

Kumar

Gupta

Singla

. A systematic review of artificial intelligence techniques in cancer prediction and diagnosis. Arch Comput Methods Eng. 2022;29(4):2043-2070. doi:10.1007/s11831-021-09648-w

49.

Rai

Yoo

. A comprehensive analysis of recent advancements in cancer detection using machine learning and deep learning models for improved diagnostics. J Cancer Res Clin Oncol. 2023;149(15):14365-14408. doi:10.1007/s00432-023-05216-w

50.

Tian

Shu

Jia

. Artificial neural network. In. 2021:1-4. doi:10.1007/978-3-030-26050-7_44-1

51.

Narkhede

Bartakke

Sutaone

. A review on weight initialization strategies for neural networks. Artif Intell Rev. 2022;55(1):291-322. doi:10.1007/s10462-021-10033-z

52.

Betere

Kinjo

Nakazono

Oshiro

. Investigation of multi-layer neural network performance evolved by genetic algorithms. Artif Life Robot. 2019;24(2):183-188. doi:10.1007/s10015-018-0494-2

53.

Dubey

Sinhal

Sharma

. Heart disease classification through crow intelligence optimization-based deep learning approach. Int J Inf Technol. 2024;16(3):1815-1830. doi:10.1007/s41870-023-01445-x

54.

Sanmorino

Marnisah

Sunardi

. Feature selection using extra trees classifier for research productivity framework in Indonesia. In. 2023:13-21. doi:10.1007/978-981-99-0248-4_2

55.

Nirbhav

Maheshwar

Prasad

. Landslide susceptibility prediction based on decision tree and feature selection methods. J Indian Soc Remote Sens. 2023;51(4):771-786. doi:10.1007/s12524-022-01645-1

56.

Dahiya

Mahajan

. On (simple) decision tree rank. Theor Comput Sci. 2023;978:114177. doi:10.1016/j.tcs.2023.114177

57.

Lin

Xiu

Kong

Yang

Zhao

. An effective pyramid neural network based on graph-related attentions structure for fine-grained disease and pest identification in intelligent agriculture. Agriculture. 2023;13(3):567. doi:10.3390/agriculture13030567

58.

Soares

Fernandes

BJT

Bastos-Filho

CJA

. Pyramidal neural networks with evolved variable receptive fields. Neural Comput Appl. 2018;29(12):1443-1453. doi:10.1007/s00521-016-2656-2

59.

Keras

Team

. Keras documentation: Dense layer, Keras. https://keras.io/api/layers/core_layers/dense.

60.

Schaffer

. Selecting a classification method by cross-validation. Mach Learn. 1993;13(1):135-143. doi:10.1007/BF00993106

61.

Hanley

McNeil

. The meaning and use of the area under a receiver operating characteristic (ROC) curve. Radiology. 1982;143(1):29-36. doi:10.1148/radiology.143.1.7063747

62.

Rajaraman

Ganesan

Antani

. Deep learning model calibration for improving performance in class-imbalanced medical image classification tasks. PLoS One. 2022;17(1):e0262838. doi:10.1371/journal.pone.0262838

63.

Van Calster

McLernon

van Smeden

Wynants

Steyerberg

. Calibration: The achilles heel of predictive analytics. BMC Med. 2019;17(1):230. doi:10.1186/s12916-019-1466-7

64.

Wang

Liang

Hancock

Khoshgoftaar

. Feature selection strategies: A comparative analysis of SHAP-value and importance-based methods. J Big Data. 2024;11(1):44. doi:10.1186/s40537-024-00905-w

65.

Oprea

Bâra

. Diverse counterfactual explanations (DiCE) role in improving sales and e-commerce strategies. J Theor Appl Electron Commer Res. 2025;20(2):96. doi:10.3390/jtaer20020096

66.

Korani

Hassan

Tony

EAE

Abdou

MAA

. The impact of smoking on inflammatory biomarkers in patients with chronic obstructive pulmonary disease. Egypt J Chest Dis Tuberc. 2016;65(3):597-603. doi:10.1016/j.ejcdt.2016.04.011

67.

Sultana

Afsar

Jawad

Hazari

MAH

. Effects of cigarette smoking on erythrocyte sedimentation rate, platelet count, total and differential leucocyte counts in adult male smokers. Ann Med Physiol. 2019;3(1):14-18. doi:10.23921/amp.2019v3i1.35185

68.

Walser

Cui

Yanagawa

, et al. Smoking and lung cancer: The role of inflammation. Proc Am Thorac Soc. 2008;5(8):811-815. doi:10.1513/pats.200809-100TH

69.

Czarnowski

. Weighted ensemble with one-class classification and over-sampling and instance selection (WECOI): An approach for learning from imbalanced data streams. J Comput Sci. 2022;61:101614. doi:10.1016/j.jocs.2022.101614

70.

Daza

Bobadilla

Apaza

Pinto

. Stacking ensemble learning model for predict anxiety level in university students using balancing methods. Informatics Med Unlocked. 2023;42:101340. doi:10.1016/j.imu.2023.101340

71.

Garcea

Serra

Lamberti

Morra

. Data augmentation for medical imaging: A systematic literature review. Comput Biol Med. 2023;152:106391. doi:10.1016/j.compbiomed.2022.106391

72.

Kennedy

Dras

Gallego

. Augmentation of Electronic Medical Record Data for Deep Learning. In. 2022. doi:10.3233/SHTI220144

73.

Pembury Smith

MQR

Ruxton

. Effective use of the McNemar test. Behav Ecol Sociobiol. 2020;74(11):133. doi:10.1007/s00265-020-02916-y

74.

Liu

Luo

, et al. Development and validation of machine learning models based on molecular features for estimating the probability of multiple primary lung carcinoma versus intrapulmonary metastasis in patients presenting multiple non-small cell lung cancers. Transl Lung Cancer Res. 2025;14(4):1118-1137. doi:10.21037/tlcr-24-875

75.

Yoon

Kim

Chun

, et al. Diagnostic accuracy of serum biomarkers MMP11 and SPP1 in non-small cell lung cancer: An analysis of sensitivity, specificity, and area under the curve. Transl Lung Cancer Res. 2025;14(4):1197-1211. doi:10.21037/tlcr-2024-1068

76.

Vanderpoel

Stevens

Emond

, et al. Total cost of testing for genomic alterations associated with next-generation sequencing versus polymerase chain reaction testing strategies among patients with metastatic non-small cell lung cancer. J Med Econ. 2022;25(1):457-468. doi:10.1080/13696998.2022.2053403

77.

Kumar

Bennett

Campbell

, et al. Costs of next-generation sequencing assays in non-small cell lung cancer: A micro-costing study. Curr Oncol. 2022;29(8):5238-5246. doi:10.3390/curroncol29080416

78.

Cao

Shan

, et al. CT Scans and cancer risks: A systematic review and dose-response meta-analysis. BMC Cancer. 2022;22(1):1238. doi:10.1186/s12885-022-10310-2

79.

Perisinakis

Seimenis

Tzedakis

Karantanas

Damilakis

. Radiation burden and associated cancer risk for a typical population to be screened for lung cancer with low-dose CT: A phantom study. Eur Radiol. 2018;28(10):4370-4378. doi:10.1007/s00330-018-5373-7

80.

Rampinelli

De Marco

Origgi

, et al. Exposure to low dose computed tomography for lung cancer screening and risk of cancer: secondary analysis of trial data and risk-benefit analysis. Br Med J. Published online February 8, 2017;j347. doi:10.1136/bmj.j347

81.

Pantanowitz

Hanna

Pantanowitz

, et al. Regulatory aspects of artificial intelligence and machine learning. Mod Pathol. 2024;37(12):100609. doi:10.1016/j.modpat.2024.100609

Cost-Efficient Early Diagnostic Tool for Lung Cancer: Explainable AI in Clinical Systems

Abstract

Introduction

Methods

Results

Conclusion

Keywords

Introduction

Research Challenges and Objectives

Materials and Methods

Data Collection

Patient Characteristics

Data pre-Processing

Handling Outliers

Model Selection

Building the AI Models

ANN Model

DNN Model

Results

Cross Validation

Calibration Curve

Error Analysis

Explaining Model Decision

Explaining the ANN Model Outcome Using SHAP Interpretation

Explaining the DNN Model Outcome Using SHAP Interpretation

Developing an Ensemble Model

Discussion

Comparing the Proposed Model with the Latest Models

Limitations

Implications of Excluding Rare Medical Records from Training Data

Implications of XAI

Financial Implications

Conclusion

Future Work Should Focus on Several key Areas:

Footnotes

ORCID iD

Ethics Statement

Funding

Declaration of Conflicting Interests

References