Enhancing deep learning models for predicting smoking Status using clinical data in patients with chronic obstructive pulmonary disease

Abstract

Objective

This study aimed to develop and evaluate deep learning models to improve the prediction of persistent smoking in patients with chronic obstructive pulmonary disease (COPD) by integrating behavioral and psychosocial variables with clinical data from a structured national dataset.

Methods

Three deep learning models and one machine learning model were developed and assessed using clinical, behavioral, and psychosocial data from 350 patients with COPD, including 51 current smokers. Data preprocessing involved imputation, variable transformation, and class weighting. Hyperparameter optimization was performed using the Optuna framework. Model performance was evaluated with repeated stratified K-fold cross-validation, and the macro F1 score was the primary metric. Shapley Additive Explanations (SHAPs) were applied to assess feature importance and improve interpretability.

Results

The Residual Neural Network achieved the highest performance, with a macro F1 score of .87 (95% confidence interval: .83–.89). SHAP analysis highlighted professional advice to quit, employment status, sputum symptoms lasting more than 3 months, perceived stress level, health check-up experience, and health literacy as key predictors of persistent smoking.

Conclusion

Incorporating behavioral and psychosocial data enabled the models to capture complex smoking patterns while maintaining interpretability. These findings emphasize the value of multidimensional data in identifying high-risk individuals and informing targeted smoking cessation strategies in COPD care. Future research should include synthesized behavioral variables often absent from large external datasets and validate model performance in more diverse populations.

Keywords

Classification deep learning machine learning chronic obstructive pulmonary disease smoking

Introduction

Recent advances in deep learning have demonstrated strong potential for analyzing medical data with high accuracy.¹ Building on this progress, deep learning-based predictive analytics are increasingly applied in nursing to improve workflow efficiency, enhance patient management and task allocation,² reduce nurse burnout, and improve quality of care.³ Beyond workflow, deep learning models can also support patients directly by assessing adherence to self-management, a critical factor in chronic disease outcomes.⁴ However, accurately predicting complex patient behaviors such as adherence remains a persistent challenge.⁵

Our research team recently developed deep learning models to predict smoking status—a key component of self-management—in patients with chronic obstructive pulmonary disease (COPD) using datasets such as the Korean National Health and Nutrition Examination Survey (KNHANES).⁶ KNHANES is designed to assess national health status and inform public health policy under the Ministry of Health and Welfare and the Korea Disease Control and Prevention Agency.⁷ Despite its strengths, KNHANES lacks behavioral variables that provide deeper insight into health and nutritional conditions. These include psychosocial, clinical, and cognitive factors such as self-efficacy, sputum characteristics, and health literacy in COPD patients. While clinical factors may result from smoking, worsening clinical conditions can also influence smoking status, as perceiving a condition as serious often increases motivation to quit.⁸ Combining clinical variables with behavioral and psychosocial data may therefore improve predictive accuracy by capturing both the consequences of smoking and the patient's ongoing health context.⁹ Our previous work also emphasized the need to include such variables to strengthen model performance.⁶ Since smoking cessation is closely tied to overall adherence to health-related behaviors, predicting persistent smoking after a COPD diagnosis offers valuable insights into patient self-management.¹⁰

Selecting appropriate predictive models is equally important for optimizing performance. In our previous study, the Residual Neural Network (ResNN) outperformed both traditional machine learning and other deep learning models.⁶ It is therefore necessary to compare its performance with other architectures designed to capture non-linear and high-dimensional interactions in tabular and healthcare datasets.¹¹ Including such models can ensure robust prediction of smoking behaviors among patients with COPD.¹²

Although missing data in KNHANES is minimal, its potential impact on model performance should not be overlooked.¹³ While some degree of missingness is inevitable in real-world datasets, it can reduce predictive accuracy and reliability.¹⁴ To address this limitation, a prospective survey targeting patients with a confirmed COPD diagnosis who are currently undergoing treatment is recommended.¹⁵ Such a strategy supports more comprehensive data collection, reduces bias from incomplete records, and improves both predictive performance and model generalizability.¹⁶

Accordingly, this study evaluates whether incorporating key clinical variables can improve the prediction of smoking status in patients with COPD before integrating these variables with KNHANES data in future research. The study explores the use of clinically sourced datasets containing detailed behavioral and psychosocial information to address the limitations of large-scale national survey data and enhance model sensitivity to individual patient contexts. Ultimately, the goal is to develop a robust and generalizable deep learning model to support decision-making in identifying patients with COPD who are vulnerable to poor self-management, thereby contributing to improved patient outcomes.

Method

Study design

This study employed a cross-sectional design using data from a clinical setting.

Participants

Participants were adults aged 40 years and older diagnosed with COPD who visited the outpatient clinic at C University Hospital in Gwangju, Korea, between 26 December 2023 and 28 May 2024. A total of 575 patients were screened for eligibility, of whom 350 consented to participate and were included in the study (Figure 1).

Figure 1.

Flow diagram of participants.

According to Riley et al. (2019),¹⁷ a minimum sample size corresponding to up to five events per variable is recommended when developing a clinical prediction model. Because the structured questionnaire used in this study included 31 categories of variables, the appropriate sample size ranged from 93 (3 per event) to 155 (5 per event). However, the relatively small number of current smokers increased the risk of overfitting. To mitigate this risk, dropout was applied to enhance model stability and generalizability.¹⁸ Among the four models used (eXtreme Gradient Boosting (XGBoost), ResNN, TabTransformer, and Feature Tokenizer (FT) Transformer), XGBoost inherently incorporates boosting and regularization through weight penalization, whereas the deep learning models required explicit dropout application.¹⁹

Data selection and collection

In our previous study,⁶ 21 variables were selected to develop a smoking prediction model for patients with COPD, based on factors related to COPD self-management.²⁰ Of these, 20 were retained in the present study. The remaining variable—interpretation of lung function based on forced expiratory volume in 1 s (FEV₁) or the FEV₁/forced vital capacity (FVC) ratio—was excluded, as all participants had already been clinically diagnosed with COPD.

In the current dataset, the “household composition” variable was replaced with “family/friend support for quitting.” To expand the dataset with additional behavioral factors, 23 new variables were added (Supplementary Table 1), resulting in a total of 43 variables across 31 categories. Among these, self-efficacy and health literacy were measured using validated instruments. Self-efficacy was assessed using a modified Korean version of the SCES-COPD (Self-Care Self-Efficacy Scale for COPD), validated by Choi and Yun.²¹ This version originally consisted of seven items rated on a 5-point Likert scale (total score 0–100) and was adapted for respondent convenience. COPD-specific health literacy was measured with a 66-point instrument developed by Kim and Choi.²² Multicollinearity among variables was assessed, with correlations ranging from .01 to .67.

The dependent variable was smoking status. Following the methodology of our previous study based on a study using KNHANES data,²³ daily and occasional smokers were grouped as “smokers,” and ex-smokers and nonsmokers were classified as “nonsmokers.” For model development, smoking status was coded as a binary outcome: 1 = smoker, 0 = nonsmoker.

To ensure data reliability, two trained research assistants underwent standardized training before data collection. Training covered study procedures, ethical considerations, and question-and-answer techniques. For instance, when assessing sputum characteristics, participants providing multiple responses were instructed to report their most recent condition. Data were collected in a quiet room using a structured questionnaire administered by trained assistants. Completed surveys were reviewed by a researcher for completeness. Clinical indices such as the most recent FEV₁ and FEV₁/FVC values were extracted from electronic medical records by the same assistants.

FEV₁/FVC data were unavailable for 29 patients (8.3%) who had been diagnosed at other hospitals and had not undergone repeat testing at the study clinic.

Data preprocessing

To optimize the performance and reliability of the smoking prediction models, preprocessing included missing value imputation, normalization, categorical encoding, class imbalance handling, and splitting of the training and test datasets (Supplementary Figure 1).

A comparative analysis of imputation methods was conducted to address missing values in FEV₁ and FEV₁/FVC. The methods tested were simple imputation (mean/median replacement), Multiple Imputation by Chained Equations (MICE), Iterative Imputer, K-Nearest Neighbors (KNN) Imputer, and MissForest. These were evaluated using quantitative metrics (mean and standard deviation comparisons). For normalizing continuous variables, both Min–Max normalization and standardization (z-score transformation) were tested. For encoding categorical variables, both Label Encoding and One-Hot Encoding were evaluated. However, One-Hot Encoding could not be applied due to internal algorithm constraints in the TabTransformer and FT Transformer.

Class imbalance was addressed by assigning higher weights to the minority class, consistent with a previous study.²⁴ Other methods, including Synthetic Minority Oversampling Technique, random undersampling, and deep learning-based oversampling, did not result in noticeable improvements compared with class weighting (Supplementary Table 2). Specifically, the positive class weight was defined as the ratio of nonsmoker to smoker labels and applied to the loss function using the pos_weight parameter in BCEWithLogitsLoss. This increased the penalty for misclassifying smokers, thereby improving the influence of the minority class and balancing the learning process. The dataset was randomly split into training and test sets in an 80:20 ratio using the train_test_split function from scikit-learn.²⁵

Model selection and development

Four models were evaluated to predict smoking status in patients with COPD: three deep learning models (ResNN, TabTransformer, and FT Transformer) and one machine learning model (XGBoost). These models were selected for their effectiveness in handling structured clinical data, including high-dimensional features, missing values, and class imbalance.

ResNN was prioritized based on its strong predictive performance in a previous study.⁶ To broaden the range of approaches, TabTransformer and FT Transformer were also included, as both have shown success in capturing complex feature interactions in tabular datasets.²⁶ Transformer-based models are particularly effective for modeling contextual relationships between input variables,²⁷ and they perform well in scenarios with missing data, skewed distributions, and non-linear dependencies.²⁸ These architectures have also been applied successfully in healthcare prediction tasks such as modeling disease progression, estimating treatment response, and forecasting hospital readmissions.²⁹ XGBoost was included for its scalability and strong generalization capability,³⁰ as boosting algorithms were not evaluated in the previous study. It is also well suited for structured medical data with missing values and class imbalance.³¹

Hyperparameter tuning was performed using the Optuna framework,³² with Bayesian optimization across 100 iterations to identify the optimal combination of hyperparameters for each model.

Model validation and evaluation

Model performance was evaluated using fivefold cross-validation. Specifically, five validation runs with Repeated Stratified K-Fold were conducted to preserve the proportion of smoker cases, given their limited number, and to identify the model with the highest performance and the narrowest 95% confidence interval (CI). The macro F1 score was used as the primary evaluation metric. This score calculates the unweighted mean of F1 scores across all classes, treating each class equally regardless of frequency. Because the outcome variable was imbalanced, the macro F1 score was considered the most appropriate metric.³³

P r e c i s i o n = \frac{T P}{T P + F P}

R e c a l l = \frac{T P}{T P + F N}

F 1 - S c o r e = 2 \times \frac{P r e c i s i o n \times R e c a l l}{P r e c i s i o n + R e c a l l}

M a c r o F 1 - S c o r e = \frac{1}{N} \sum_{i = 1}^{N} F 1 - S c o r e_{i}

Interpretable artificial intelligence

This study applied SHapley Additive exPlanations (SHAPs) to interpret the output of the top-performing models and improve transparency.³⁴ SHAP assigns an importance value to each feature by estimating its marginal contribution to the model's prediction while accounting for feature interactions and dependencies. In SHAP plots, features are listed on the y-axis in order of importance. The x-axis distribution shows the magnitude of each feature's contribution, while color indicates the direction of effect: red for higher values and blue for lower values. For example, a cluster of red points on the right side of the plot indicates that higher feature values contribute to predicting smokers.

Ethical considerations

This study was approved by the Institutional Review Board of C National University Hospital.

Results

Participant characteristics

Table 1 presents the characteristics of the participants. The mean age was 73.3 ± 9.4 years, and 267 (76.3%) were male. Fifty-five participants (15.7%) had a college degree or higher. Most were married (97.7%), and 27 (7.7%) lived with grandchildren. Ninety-two (26.3%) had an occupation, and 100 (28.6%) were engaged in economic activity.

Table 1.

Characteristics of participants (N = 350).

Variable	Total, n (%)	Smoker, n (%)	Nonsmoker, n (%)	χ², t, or ANOVA	p
Age	73.3 ± 9.4	69.9 ± 8.1	73.8 ± 9.5	−2.79	.006
Sex (male)	267 (76.3)	49 (96.1)	218 (72.9)	12.93	<.001
Education level (college degree or higher)	55 (15.7)	9 (17.6)	46 (15.4)	1.58	.664
Marital status (yes)	342 (97.7)	50 (98.0)	292 (97.7)	0.172	.918
Living with grandchildren (yes)	27 (7.7)	2 (3.9)	25 (8.4)	7.01	.030
Occupation (yes)	92 (26.3)	19 (37.3)	73 (24.4)	3.83	.147
Economic engagement activity (yes)	100 (28.6)	20 (39.2)	80 (26.8)	9.62	.022
FEV₁	1.80 ± 0.68	2.03 ± 0.74	1.76 ± 0.66	2.53	.012
FEV₁/FVC	0.63 ± 0.15	0.60 ± 0.16	0.64 ± 0.15	−1.69	.092
COPD duration	9.49 ± 10.56	7.00 ± 8.90	9.91 ± 10.77	−1.83	.068
COPD-related hospitalization (past year, yes)	163 (46.6)	17 (33.3)	146 (48.8)	4.56	.102
Physician-diagnosed hypertension (yes)	143 (40.9)	15 (29.4)	128 (42.8)	3.24	.072
Diabetes prevalence (yes)	76 (21.7)	12 (23.5)	64 (21.4)	0.12	.734
Physician-diagnosed lung cancer (yes)	5 (1.4)	1 (2.0)	4 (1.3)	0.12	.729
Hypertension treatment (yes)	137 (39.1)	15 (29.4)	122 (40.8)	2.66	.265
Antihypertensive medication use (yes)	136 (38.9)	15 (29.4)	121 (40.5)	3.52	.621
Diabetes treatment (yes)	70 (20.0)	12 (23.5)	58 (19.4)	0.45	.503
Diabetes medication use (yes)	67 (19.1)	12 (23.5)	55 (18.4)	0.72	.395
Cough lasting ≥3 months (yes)	166 (47.4)	17 (33.3)	149 (49.8)	4.76	.029
Sputum ≥3 months (yes)	183 (52.3)	19 (37.3)	164 (54.8)	5.41	.020
Sputum characteristics (yellow)	43 (23.2)	2 (3.9)	41 (13.7)	21.46	.003
Severe smoking withdrawal symptoms (yes)	51 (14.6)	19 (37.3)	32 (10.7)	24.82	<.001
Quality of life	0.73 ± 0.30	0.76 ± 0.26	0.73 ± 0.31	0.63	.532
Self-efficacy	6.30 ± 1.16	6.18 ± 1.38	6.32 ± 1.12	−0.84	.401
Health literacy	41.20 ± 9.66	41.47 ± 9.38	41.15 ± 9.72	0.22	.829
Family/friend support for quitting (yes)	179 (51.1)	41 (80.4)	138 (46.2)	20.47	<.001
Professional advice to quit (yes)	88 (25.1)	37 (72.5)	51 (17.1)	97.80	<.001
Perceived stress (little/no stress)	167 (47.7)	18 (35.3)	149 (49.8)	4.14	.387
Depressive symptoms >2 weeks (no)	278 (79.4)	40 (78.4)	238 (79.6)	2.03	.363
Perceived health status (normal)	136 (38.9)	22 (43.1)	114 (38.1)	1.55	.908
Smoking amount	17.88 ± 16.72	17.75 ± 11.14	0.65 ± 2.28	−2.91	.068
Smoking duration	24.29 ± 18.14	40.65 ± 12.67	20.60 ± 17.14	−1.83	.068
Alcohol consumption (bottles/week)	0.72 ± 2.24	1.16 ± 1.95	0.65 ± 2.28	0.22	.829
Average sleep time (hours)	6.45 ± 1.80	7.19 ± 2.05	6.32 ± 1.72	−1.83	.401
Walking days per week (no)	50 (14.3)	8 (15.7)	42 (14.0)	4.64	.704
Strength training days per week (no)	244 (69.7)	38 (74.5)	206 (68.9)	3.95	.786
Activity limitation (no)	92 (26.3)	18 (35.3)	74 (24.7)	2.46	.117
Health check-up (yes)	279 (79.7)	40 (78.4)	139 (79.9)	0.29	.867
Body weight control (no, 1 year)	247 (70.6)	37 (72.5)	210 (70.2)	1.16	.885
Body weight changed (no, 1 year)	241 (68.9)	35 (68.6)	206 (68.9)	1.63	.898
Change in food intake (no change)	273 (78.0)	37 (72.5)	236 (78.9)	2.39	.302
Influenza vaccination (yes)	310 (88.6)	42 (82.4)	268 (89.6)	2.28	.131

Abbreviations: FEV₁: forced expiratory volume in 1 s; FVC: forced vital capacity; QoL: quality of life; EQ-5D: EuroQol-5 Dimension; COPD: chronic obstructive pulmonary disease.

The mean FEV₁ was 1.80 ± 0.68, and the mean FEV₁/FVC ratio was 0.63 ± 0.15. The average duration of COPD was 9.49 ± 10.56 years, and 163 (46.6%) reported COPD-related hospitalization in the past year. A total of 143 (40.9%) had physician-diagnosed hypertension, 76 (21.7%) had diabetes, and 5 (1.4%) had lung cancer. Hypertension treatment was reported by 137 (39.1%), and 136 (38.9%) used antihypertensive medication. Diabetes treatment was reported by 70 (20.0%), and 67 (19.1%) used diabetes medication. Cough lasting ≥3 months was reported by 166 (47.4%), and 183 (52.3%) reported sputum for ≥3 months. Yellow sputum was reported by 43 (23.2%), and 51 (14.6%) experienced severe smoking withdrawal symptoms.

The mean quality of life score was 0.73 ± 0.30; self-efficacy was 6.30 ± 1.16; and health literacy was 41.20 ± 9.66. Support from family or friends to quit smoking was reported by 179 (51.1%), while 88 (25.1%) received professional advice to quit. A total of 167 (47.7%) reported little to no stress, and 278 (79.4%) reported no depressive symptoms lasting >2 consecutive weeks. Normal perceived health status was reported by 136 (38.9%).

The mean smoking amount was 17.88 ± 16.72 cigarettes per day, and the mean smoking duration was 24.29 ± 18.14 years. Alcohol consumption averaged 0.72 ± 2.24 bottles per week, and mean sleep duration was 6.45 ± 1.80 h per day. Fifty participants (14.3%) reported no walking days per week, and 244 (69.7%) reported no strength training. Activity limitation was reported by 92 (26.3%). Most participants (279, 79.7%) had received a health check-up. A total of 247 (70.6%) did not attempt body weight control in the past year, and 241 (68.9%) reported no change in body weight. Similarly, 273 (78.0%) reported no change in food intake. Influenza vaccination in the past year was reported by 310 (88.6%).

Data preprocessing

For missing value imputation, the MICE method was selected due to its strong theoretical foundation and superior performance in quantitative comparisons (Supplementary Table 3). Although MissForest demonstrated robust performance, it was excluded because of its high computational cost and limited clinical interpretability. The KNN Imputer generated the most realistic data distribution but showed lower predictive performance. Similarly, the Iterative Imputer exhibited stable distributional characteristics but was not selected because of its experimental nature and limited empirical support.

For variable transformation, the combination of standard normalization for continuous variables and label encoding for categorical variables yielded the most consistent results across all models. Under this configuration, the FT Transformer achieved the highest performance (macro F1 score .81, 95% CI .73–.88), followed by ResNN (.80, 95% CI .68–.91) and TabTransformer (.76, 95% CI .70–.82). XGBoost performed slightly lower, with a macro F1 score of .75 (95% CI .69–.82). Although One-Hot Encoding produced comparable results to label encoding in XGBoost, it was not applicable to transformer-based architectures. Therefore, standard normalization and label encoding were adopted as the final preprocessing methods, balancing compatibility and predictive performance across model types (Supplementary Table 4).

Model development

Table 2 presents the macro F1 scores of each model before and after hyperparameter tuning using fivefold cross-validation. ResNN showed the most notable improvement, increasing from .80 ± .09 to .85 ± .03. XGBoost, TabTransformer, and FT Transformer also improved slightly, but the gains were relatively modest.

Table 2.

Parameter settings after tuning by model.

		Macro F1 score
Model	Parameters	before tuning	after tuning
XGBoost	n_estimators = 124; max_depth = 7; learning_rate = 0.003; gamma = 1.64; alpha = 4.85; lambda = 2.51; min_child_weight = 0.38; max_delta_step = 38; subsample = 0.15; sampling_method = “uniform”; tree_method = “approx”; grow_policy = “lossguide”	.75 ± .05	.76 ± .05
ResNN	hidden_dim = 696	.80 ± .09	.85 ± .03
TabTransformer	dim = 36; depth = 4; heads = 5; dropout = 0.39; mlp_hidden_units = (4, 2); mlp_act = LeakyReLU	.76 ± .05	.77 ± .06
FT Transformer	dim = 89; depth = 3; heads = 8; dropout = 0.4	.81 ± .06	.83 ± .07

ResNN: Residual Neural Network; FT: Feature Tokenizer; XGBoost: eXtreme Gradient Boosting.

Model validation and evaluation

Table 3 presents macro F1 scores with 95% CIs across two validation strategies. Overall, Repeated Stratified K-Fold produced slightly higher or comparable scores compared with baseline fivefold validation. Both ResNN (.87 ± .05, CI .83–.89) and FT Transformer (.87 ± .06, CI .80–.92) achieved strong performance under Repeated Stratified K-Fold. These findings suggest that both validation approaches are robust, with only minor variations depending on the model architecture. Aggregated confusion matrices from fivefold validation further illustrate classification performance (Supplementary Figure 2).

Table 3.

Comparison of macro F1 scores (95% CI) across validation methods and models.

Model	Baseline (fivefold)	Macro F1 score (Repeated stratified K-fold)
XGBoost	.76 ± .05 (.69–.83)	.80 ± .06 (.77–.83)
ResNN	.85 ± .03 (.81–.89)	.87 ± .05 (.83–.89)
TabTransformer	.77 ± .06 (.69–.85)	.78 ± .05 (.74–.81)
FT Transformer	.83 ± .07 (.69–.94)	.87 ± .06 (.80–.92)

ResNN: Residual Neural Network; FT: Feature Tokenizer; XGBoost: eXtreme Gradient Boosting; CI: confidence interval.

Feature importance

Figure 2(a) shows the SHAP plot for ResNN, which highlighted psychosocial and clinical predictors associated with persistent smoking. The most influential feature was professional advice to quit, followed by employment status, sputum symptoms lasting ≥3 months, health check-up experience, and perceived stress level. Patients who had not received cessation advice, were unemployed, had chronic sputum symptoms, or reported high stress levels were more likely to be classified as smokers. Conversely, those who had undergone regular health check-ups were more often predicted as nonsmokers.

Figure 2.

SHAP summary plots of feature importance. (a) ResNN model. (b) FT transformer model. Abbreviations: AMU: antihypertensive medication use; AST: average sleep time; BWCY: body weight control for a year; BWChY: body weight change over a year; CL3 M: cough lasting ≥3 months; COPD: chronic obstructive pulmonary disease; CRH: COPD-related hospitalization; CUFI: change in usual food intake; DMU: diabetes medication use; DS2 W: depressive symptoms lasting >2 weeks; EEA: economic engagement activity; FEV₁: forced expiratory volume in 1 s; FVC: forced vital capacity; FFSQ: family/friend support for quitting; LWG: living with grandchildren; PAQ: professional advice to quit; PDH: physician-diagnosed hypertension; PDLC: physician-diagnosed lung cancer; PHS: perceived health status; PSL: perceived stress level; QoL: quality of life; SP3 M: sputum for ≥3 months; STDW: strength training days per week; SWS-E: severe withdrawal symptoms (experienced); WDW: walking days per week; SHAP: Shapley Additive Explanation; ResNN: Residual Neural Network; FT: Feature Tokenizer.

Figure 2(b) shows the SHAP plot for the FT Transformer. Professional advice to quit smoking remained the most impactful feature, followed by smoking duration, family or friend support, hypertension treatment, and COPD-related hospitalization. Compared with ResNN, the FT Transformer tended to classify patients who had received cessation support or social encouragement as smokers. Based on previous studies,^35–37 ResNN was considered explainable in terms of the causal direction of predictors for persistent smoking, whereas the FT Transformer appeared to overfit the data.

Discussion

This study developed an enhanced deep learning model to predict smoking status in patients with COPD by integrating behavioral and psychosocial variables into a clinical dataset. By expanding input features to include healthcare provider support, COPD-specific symptoms, and health literacy, the model better captured patterns of non-adherence to self-management associated with persistent smoking. In particular, professional advice to quit, sputum production lasting ≥3 months, and health literacy were newly incorporated and identified as influential predictors. SHAP analysis of the top-performing model consistently confirmed the importance of these factors in classifying smoking status. In contrast, although biochemical markers such as salivary cotinine provide the most accurate assessment, smoking status in this study was measured using a self-administered questionnaire. Self-reported smoking status shows high agreement with biomarkers such as cotinine levels, captures patients’ perception and disclosure of their smoking habits, and is more feasible in terms of cost and routine data collection.³⁸ Nevertheless, the potential for misclassification inherent in self-reported measures must be acknowledged.

One of the key findings was that the model achieved the highest macro F1 score with the narrowest 95% CI, consistent with our previous study,⁶ indicating stable and superior performance in predicting persistent smoking in COPD. This strong performance may be attributed to the residual learning structure, which captures essential non-linear interactions among clinically meaningful variables while minimizing overfitting in relatively small and imbalanced datasets.³⁹ In addition, the present study implemented refined model architectures and context-rich clinical and behavioral data, which enabled more effective learning of complex relationships between predictors. This improved predictive performance and allowed for more clinically meaningful risk stratification.

The model's performance and SHAP analysis also indicated that it was more suitable for predicting persistent smoking than the FT Transformer. One of the most influential predictors was whether patients received professional advice to quit smoking. Patients were more likely to continue smoking if they had not received cessation advice, were unemployed, had chronic sputum symptoms, or reported high stress levels. In contrast, the FT Transformer produced contradictory directional interpretations, placing greater predictive weight on features such as family and friend support for cessation, and unexpectedly predicting patients with support as smokers. This contradicts clinical evidence. For example, Cheung et al. (2021) showed that even brief physician advice significantly increased quit rates compared with no advice.⁴⁰ The discrepancy suggests that the model captured more clinically reasonable associations, whereas the FT Transformer may have overfit complex but less interpretable patterns. The unexpected prediction patterns in the FT Transformer likely reflect its higher model complexity, which can lead to overfitting in limited datasets despite regularization and repeated stratified K-fold validation.⁴¹ In contrast, the model, being structurally simpler, produced more stable and generalizable results. Accordingly, the model demonstrated both high predictive performance and clinically meaningful interpretability. Due to this balance, the model appears particularly well suited for clinical applications where transparency and efficiency are essential.⁴² However, achieving an optimal balance between model complexity and cost-effectiveness remains challenging, as increasing the number of input variables may limit practical applicability.⁴³

To address potential instability in performance evaluation due to the small and imbalanced dataset, we applied Repeated Stratified K-Fold cross-validation.⁴⁴ This method preserves class balance across training and validation sets and reduces variability from random sampling, yielding more reliable performance estimates.⁴⁵ By repeating the stratified K-Fold procedure multiple times, this approach improves consistency in evaluation outcomes.⁴⁶ However, it also increases computational time due to multiple model iterations. While this method enhances evaluation stability, its computational cost should be carefully considered in future applications involving larger datasets or more complex architectures.

Our SHAP analysis further highlighted clinically interpretable predictors of persistent smoking, such as lack of professional cessation advice, chronic sputum symptoms, unemployment, and high stress levels. These findings are biologically and behaviorally plausible.⁴⁷ Patients who do not receive cessation advice from healthcare professionals may underestimate the health risks of smoking, reducing their likelihood of quitting.⁴⁸ Patients with chronic sputum symptoms lasting ≥3 months were more likely to be classified as smokers, consistent with research linking these symptoms to tobacco-induced airway inflammation.⁴⁹ Misinterpreting such symptoms may reduce motivation to quit, underscoring the need for education that links respiratory symptoms to smoking behavior.⁵⁰ Psychosocial stress has also been consistently associated with smoking relapse.⁵¹ Patients reporting frequent or intense stress were more likely to be smokers, reinforcing the role of psychological distress in maintaining tobacco use.⁵² Stress may serve as a barrier to cessation, highlighting the value of integrating stress management into quit strategies for COPD patients.

Another important predictor identified by the model was occupation, with unemployed patients more likely to be classified as persistent smokers. This aligns with research showing that unemployment is associated with higher smoking prevalence, fewer quit attempts, and increased relapse risk.^53,54 Unemployed patients with COPD may face greater psychosocial stress, reduced access to health information, and limited engagement with preventive healthcare, all of which contribute to sustained smoking behavior. Interestingly, patients who had undergone a recent health check-up were also more likely to be smokers in this study. This may reflect a false sense of reassurance from normal findings, reducing motivation to quit despite continued risk.⁵⁵ These findings underscore the need to integrate smoking cessation interventions into routine clinical care.³⁵ Taken together, the results support comprehensive, personalized cessation strategies targeting modifiable risk factors to improve smoking outcomes and support long-term disease management in COPD.

Limitations

Despite promising results, this study has several limitations. First, although Repeated Stratified K-Fold cross-validation improved the stability of performance estimates, it also imposed a moderate to high computational burden due to repeated model fitting. This method is commonly used in small datasets to reduce sampling variability and improve robustness, but it may be less practical for large-scale applications or real-time clinical settings where computational efficiency is critical. Second, the relatively small sample size may limit the generalizability of the findings. Although stratified cross-validation and dropout were applied to mitigate this issue, external validation with independent datasets will be needed to more robustly assess generalizability. Third, while behavioral and psychosocial variables significantly enhanced model performance, these factors may be misinterpreted as causal, are context-dependent, and may vary across populations, healthcare systems, and cultural settings. Caution is therefore warranted when applying this model in different contexts.

Clinical usability

This study highlights the potential clinical utility of deep learning models in identifying and managing smoking behavior among patients with COPD. By integrating behavioral, psychosocial, and routinely collected clinical data, the ResNN model provided accurate and interpretable predictions. Its reliance on commonly available variables and SHAP-based interpretability supports integration into clinical workflows and electronic health record systems, enabling timely and personalized smoking cessation interventions for high-risk patients.

Conclusions

This study demonstrated the effectiveness of an enhanced deep learning model in predicting smoking status among patients with COPD by integrating behavioral, psychosocial, and clinical variables. The inclusion of diverse features—particularly professional advice to quit, sputum symptoms, and health literacy—contributed to improved model performance and clinical relevance. Among the evaluated models, ResNN showed the most consistent and superior results, achieving the highest average macro F1 score. These findings provide empirical support for applying advanced deep learning models in predictive tasks involving clinical populations.

By incorporating behavioral and psychosocial data into a clinically feasible model, this research helps bridge the gap between artificial intelligence development and real-world healthcare applications. The performance and interpretability of the ResNN model represent a promising step toward integrating predictive analytics into personalized COPD care. These results underscore the importance of leveraging multidimensional data and selecting architectures suited to structured clinical datasets. The model's reliance on routinely collected variables also supports its feasibility for outpatient implementation. Future research should focus on external validation in larger and more diverse populations, real-world deployment, and prospective evaluation of clinical outcomes.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251393380 - Supplemental material for Enhancing deep learning models for predicting smoking Status using clinical data in patients with chronic obstructive pulmonary disease

Supplemental material, sj-docx-1-dhj-10.1177_20552076251393380 for Enhancing deep learning models for predicting smoking Status using clinical data in patients with chronic obstructive pulmonary disease by Sehyun Cho, Hyeonseok Jin, Kyungbaek Kim, Sola Cho and Ja Yun Choi in DIGITAL HEALTH

Supplemental Material

sj-docx-2-dhj-10.1177_20552076251393380 - Supplemental material for Enhancing deep learning models for predicting smoking Status using clinical data in patients with chronic obstructive pulmonary disease

Supplemental material, sj-docx-2-dhj-10.1177_20552076251393380 for Enhancing deep learning models for predicting smoking Status using clinical data in patients with chronic obstructive pulmonary disease by Sehyun Cho, Hyeonseok Jin, Kyungbaek Kim, Sola Cho and Ja Yun Choi in DIGITAL HEALTH

Footnotes

Acknowledgments

None.

ORCID iDs

Hyeonseok Jin

Kyungbaek Kim

Sola Cho

Ja Yun Choi

Ethical approval

This study was approved by the Institutional Review Board of C National University Hospital.

Consent to participate

Informed consent was obtained from all individual participants included in the study.

Consent for publication

Consent for publication was obtained from all participants.

Author contributions

SC contributed to investigation, data curation, methodology, writing—original draft, review, and editing. HJ contributed to investigation, validation, visualization, writing—original draft, review, and editing. KK contributed to investigation, methodology, writing—original draft, review, and editing. SC contributed to data curation, writing—original draft, review, and editing. JYC contributed to conceptualization and/or methodology, funding acquisition, investigation, project administration and/or supervision, writing—original draft, review, and editing.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Basic Science Research Program through the National Research Foundation of Korea (NRF), funded by the Ministry of Education [grant number NRF-2022R1A2C1010364]; Institute of Information & communications Technology Planning & Evaluation (IITP) under the Artificial Intelligence Convergence Innovation Human Resources Development (IITP-2023-RS-2023-00256629) grant funded by the Korea government (MSIT). This research was supported by the MSIT (Ministry of Science and ICT), Korea, under the ITRC (Information Technology Research Center) support program (IITP-2024-RS-2024-00437718) supervised by the IITP (Institute for Information & Communications Technology Planning & Evaluation).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability

The trained model weights have been uploaded to GitHub () to facilitate reproducibility. The patient-level data used in this study are protected under Institutional Review Board (IRB) approval and cannot be shared publicly.

Supplemental material

Supplemental material for this article is available online.

Disclaimers

Not applicable.

References

Yuan

. Research on intelligent analysis and recognition system of medical data based on deep learning. Med Insights 2025; 2: 1–10.

Kumar

. Revolutionizing patient care: artificial intelligence applications in nursing. Asian J Nurs Educ Res 2024; 14: 110–112.

Espinosa Méndez

. Cooling nurse burnout: a theoretical approach to patient care. Health Econ Manag Rev 2024; 5: 110–120.

Ansari

Harris

Hosseinzadeh

, et al. Application of artificial intelligence in assessing the self-management practices of patients with type 2 diabetes. Healthcare (Basel) 2023; 11: 903.

Ghozali

. Predicting patient adherence in healthcare using artificial intelligence and machine learning techniques: a narrative review. In: 2024 International conference on IoT in social, Mobile, analytics and cloud (I-SMAC). Palladam, India: IEEE, 2024, pp.1211–1215. doi:10.1109/i-smac61858.2024.10714843

Pant

Yang

Cho

, et al. Development of a deep learning model to predict smoking status in patients with chronic obstructive pulmonary disease: a secondary analysis of cross-sectional national survey. Digit Health 2025; 11: 1–12.

Korea Disease Control and Prevention Agency (KDCA) . Survey method. 2024. Available from: https://knhanes.kdca.go.kr/knhanes/sub02/sub02_02.do [Accessed 2024 Oct 27].

Taniguchi

Narisada

Ando

, et al. Smoking cessation behavior in patients with a diagnosis of a non-communicable disease: the impact of perceived disease severity of and susceptibility to the disease. Tob Induc Dis 2023; 21: 125.

Poudel

Fernando

KRM

Schabath

, et al. Abstract B019: a machine learning approach to predicting smoking cessation outcomes among Spanish-speaking smokers who completed a culturally targeted intervention. Cancer Epidemiol Biomarkers Prev 2024; 33: B019.

10.

Fujii

Nakano

Tanaka

, et al. Effects of self-management interventions with behavior-change support on long-term adherence in patients with chronic respiratory diseases: a systematic review. GHM Open 2022; 2: 12–24.

11.

Borisov

Leemann

Seßler

, et al. Deep neural networks and tabular data: a survey. IEEE Trans Neural Netw Learn Syst 2022; 35: 7499–7519.

12.

Lone

Moudgil

. Utilizing machine learning to forecast smoking behavior. Int J Res Appl Sci Eng Technol 2024; 12: 3436–3441.

13.

Chebli

Daas

Hafs

. Evaluating the impact of data imputation on model precision in machine learning. Stud Eng Exact Sci 2024; 5: e8310.

14.

Mena

Arenas

Dengel

. Missing data as augmentation in the Earth observation domain: a multi-view learning approach. Neurocomputing 2025; 638: 130175.

15.

Bottle

Adamson

Zhang

, et al. What happens between first symptoms and first acute exacerbation of COPD – observational study of routine data and patient survey. Health Soc Care Deliv Res 2024; 12: 1–80.

16.

Hansen

Pedersen

Løkke

, et al. BREATHEIN: Better understanding obstructive respiratory airway disease treatment and health—a nationwide investigative survey in Denmark—a study protocol. BMJ Open 2025; 15: e099447.

17.

Riley

Snell

Ensor

, et al. Minimum sample size for developing a multivariable prediction model: PART II – binary and time-to-event outcomes. Stat Med 2019; 38: 1276–1296.

18.

Díez López

Montiel González

Vidaki

, et al. Prediction of smoking habits from class-imbalanced saliva microbiome data using data augmentation and machine learning. Front Microbiol 2022; 13: 886201.

19.

Sutou

Wang

. Influence-balanced XGBoost: improving XGBoost for imbalanced data using influence functions. IEEE Access 2024; 1: 3520159.

20.

Choi

Ryu

Yun

, et al. Development of a conceptual framework for non-adherence to self-management in patients with chronic obstructive pulmonary disease: an exploratory study. Korean J Adult Nurs 2024; 36: 126–135.

21.

Choi

Yun

. Validity and reliability of Korean version of self-care chronic obstructive pulmonary disease inventory (SC-COPD) and self-care self-efficacy scale (SCES-COPD). J Korean Acad Nurs 2022; 52: 522–534.

22.

Kim

Choi

. Relationship between health literacy and self-management adherence in patients with chronic obstructive pulmonary disease. J Korea Converg Soc 2021; 21: 691–698.

23.

Kang

Kim

Lim

, et al. Characteristics of intermittent smokers in Korean adults: comparison with daily smokers. J Korean Soc Res Nicotine Tob 2019; 8: 58–64.

24.

Tripathi

Chakraborty

Kopparapu

. A novel adaptive minority oversampling technique for improved classification in data imbalanced scenarios. In: Proceedings of the 2021 international conference on pattern recognition (ICPR). IEEE, 2021, pp.10650–10657. doi:10.1109/ICPR48806.2021.9413002

25.

Bichri

Chergui

Hain

. Investigating the impact of train/test split ratio on the performance of pre-trained models with custom datasets. Int J Adv Comput Sci Appl 2024; 15: 331–396.

26.

Ren

Zhao

Huang

, et al.

Deep learning within tabular data: foundations, challenges, advances and future directions

. arXiv;2501.03540. Preprint posted online 2025 Jan. doi: 10.48550/arxiv.2501.03540

27.

Islam

Elmekki

Elsebai

, et al. A comprehensive survey on applications of transformers for deep learning tasks. Expert Syst Appl 2024; 241: 122666.

28.

Hina

Harun

. Enhancing missing values imputation through transformer-based predictive modeling. IgMin Res 2024; 2: 25–31.

29.

Nerella

Bandyopadhyay

Zhang

, et al. Transformers and large language models in healthcare: a review. Artif Intell Med 2024; 154: 102900.

30.

Gouveia

Correia

. Network intrusion detection with XGBoost. In: Machine learning for cybersecurity. Boca Raton, FL: Chapman and Hall/CRC, 2020, pp.137–166. doi:10.1201/9780429270567-6.

31.

Fan

Gao

, et al. Prediction of outpatient rehabilitation patient preferences and optimization of graded diagnosis and treatment based on XGBoost machine learning algorithm. Front Artif Intell 2025; 7: 1473837.

32.

Hassanali

Soltanaghaei

Javdani Gandomani

, et al. Software development effort estimation using boosting algorithms and automatic tuning of hyperparameters with Optuna. J Softw Maint Evol 2024; 36: e2665.

33.

Lee

MCH

Braet

Springael

. Performance metrics for multilabel emotion classification: comparing micro, macro, and weighted F1-scores. Appl Sci 2024; 14: 9863.

34.

Lundberg

Lee

. A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 2017; 30: 4765–4774.

35.

Coleman

SRM

Menson

Kaminsky

, et al. Smoking cessation interventions for patients with chronic obstructive pulmonary disease: a narrative review with implications for pulmonary rehabilitation. J Cardiopulm Rehabil Prev 2023; 43: 259–269.

36.

Diukarev

Starukhin

. Proposed methods for preventing overfitting in machine learning and deep learning. Asian J Res Comput Sci 2024; 17: 85–94.

37.

Zhang

Fan

, et al.

Research on deep neural network model construction and overfitting

Proceedings of the international conference on neural networks, information, and communication engineering (NNICE 2022), Vol. 12258. Qingdao, China: SPIE, 2022, pp.122580G doi:10.1117/12.2639137

38.

McGinnis

Skanderson

Justice

, et al. Using the biomarker cotinine and survey self-report to validate smoking data from United States veterans health administration electronic health records. JAMIA Open 2022; 5: ooac040.

39.

Gayatri

Aarthy

. Reduction of overfitting on the highly imbalanced ISIC-2019 skin dataset using deep learning frameworks. J Xray Sci Technol 2024; 32: 53–68.

40.

Cheung

YTD

Jiang

, et al. Physicians’ very brief (30-sec) intervention for smoking cessation on 13 671 smokers in China: a pragmatic randomized controlled trial. Addiction 2021; 116: 1172–1185.

41.

Stern

Yaacoby

Weinshall

. On local overfitting and forgetting in deep neural networks. arXiv 2024. doi:10.48550/arxiv.2412.12968

42.

Mekov

Miravitlles

Petkov

. Artificial intelligence and machine learning in respiratory medicine. Expert Rev Respir Med 2020; 14: 559–564.

43.

Asaduzzaman

Uddin

Sibai

. Dimensionality reduction by machine learning for cost-effective data analysis. TechRxiv 2024: 1–11. doi:10.36227/techrxiv.171332281.12206851/v1

44.

Władziński

Orlicki

Barczak

, et al. Small data in model calibration for optical tissue phantom validation. In: Proc SPIE; Timisoara, Romania, 2024, pp.131870J. doi:10.1117/12.3021367

45.

Das

Nayak

Sahoo

, et al. Evaluating ensemble models on imbalanced data sets: a comparative study across varied minority class ratios. In: Proceedings of the 2024 international conference on emerging smart computing and informatics (ESCI). Hyderabad, India: IEEE, 2024, pp.774–779. doi:10.1109/esic60604.2024.10481583.

46.

Lumumba

Kiprotich

Mpaine

, et al. Comparative analysis of cross-validation techniques: LOOCV, K-folds cross-validation, and repeated K-folds cross-validation in machine learning models. Am J Theor Appl Stat 2024; 13: 127–137.

47.

Huimin

Zheng

, et al. A scoping review of factors influencing smoking cessation in patients with chronic obstructive pulmonary disease. COPD 2024; 21: 1.

48.

Liu

Huang

, et al. Health knowledge about smoking, role of doctors, and self-perceived health: a cross-sectional study on smokers’ intentions to quit. Int J Environ Res Public Health 2021; 18: 3629.

49.

Guiedem

Pefura-Yone

Ikomey

, et al. Cytokine profile in the sputum of subjects with post-tuberculosis airflow obstruction and in those with tobacco related chronic obstructive pulmonary disease. BMC Immunol 2020; 21: 1–11.

50.

Gupta

Panchal

Sadatsafavi

, et al. A personalized biomedical risk assessment infographic for people who smoke with COPD: a qualitative study. Addict Sci Clin Pract 2022; 17: 1–11.

51.

Zohal

Rafiei

Rastgoo

, et al. Exposure to stressful life events among patients with chronic obstructive pulmonary disease: a prospective study. Adv Respir Med 2020; 88: 377–382.

52.

Zakiyah

Sihombing

Kamaruddin

, et al. Stress level and smoking behavior. Sandi Husada: J Ilm Kesehat 2023; 12: 467–473.

53.

Michalek

Wong

Brown-Johnson

, et al. Smoking and unemployment: a photo elicitation project. Tob Prev Cessat 2020; 13: 49.

54.

Brünés

Lindstroem

Ulrik

, et al. Opportunistic screening for COPD among socially marginalized patients. BMC Pulm Med 2024; 24: xx.

55.

Zeliadt

Heffner

Sayre

, et al. Attitudes and perceptions about smoking cessation in the context of lung cancer screening. JAMA Intern Med 2015; 175: 1530–1537.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.04 MB

0.13 MB