Sage Journals: Discover world-class research

Abstract

Background:

Diabetes remains a major public health concern in the United States, particularly in Tennessee, where prevalence rates exceed national averages. Traditional statistical approaches may not fully capture the non-linear interactions among predictors. This study applied both traditional approaches and machine learning (ML) techniques to predict and identify key contributing factors associated with self-reported diabetes using the 2023 Behavioral Risk Factor Surveillance System (BRFSS) dataset.

Methods:

A cross-sectional analysis was conducted on 5634 (weighted population 5 614 486) adults from the Tennessee BRFSS dataset. Sociodemographic, behavioral, and health-related variables were analyzed. Data processing, exploratory analysis, and modeling were performed in Python using Pandas, NumPy, Scikit-learn, and SHAP. Seven algorithms were tested: Logistic Regression, Support Vector Machine, K-Nearest Neighbors, Decision Tree, Random Forest, Gradient Boosting, and XGBoost, with stratified 5-fold cross-validation. Models were evaluated using accuracy, precision, recall, balanced accuracy, F1-score, AUROC, and PR-AUC.

Results:

The Gradient Boosting model demonstrated the best overall performance, achieving an accuracy of 82%, precision of 48%, recall of 32%, F1-score of 37%, AUROC of 0.80, and PR-AUC of 0.45. Key predictors included high blood pressure, high cholesterol, body mass index, comorbidity burden, and physical inactivity. SHAP analysis revealed that both clinical factors and social determinants substantially influenced diabetes risk.

Conclusion:

This study highlights the strong potential of machine learning, particularly Gradient Boosting, in predicting self-reported diabetes. Integrating SHAP analysis enhanced interpretability by revealing how the above factors interact to influence diabetes risk, underscoring the value of explainable AI for precision public health and targeted prevention strategies.

Keywords

diabetes machine learning explainable AI SHAP analysis Behavioral Risk Factor Surveillance System (BRFSS)public health informatics risk prediction population surveillance

Background

Diabetes mellitus (DM) is a chronic metabolic disorder characterized by persistent hyperglycemia due to inadequate insulin secretion, impaired insulin action, or both.^1-4 Insulin is a crucial hormone responsible for transporting glucose from the bloodstream into cells for energy metabolism. When insulin function is disrupted, excess glucose accumulates in the blood, leading to a range of symptoms such as excessive thirst (polydipsia), frequent urination (polyuria), fatigue, unintended weight loss, and blurred vision.^3,5 If left untreated or poorly managed, DM can result in severe complications, including cardiovascular disease, kidney failure, neuropathy, and vision loss, significantly reducing quality of life and life expectancy.^3,6 The progression of DM is often gradual, with many individuals remaining asymptomatic during the early stages, which delays diagnosis and timely intervention.⁴

According to the International Diabetes Federation (IDF), 537 million adults were living with diabetes globally in 2021, a number projected to rise to 643 million by 2030 and 783 million by 2045.⁷ In the United States, over 37 million people have diabetes, with approximately 1 in 5 undiagnosed.⁸ The economic impact is substantial, with an estimated annual cost of $327 billion, including $237 billion in direct medical expenses and $90 billion in lost productivity.⁹ In Tennessee, the burden is particularly high: recent CDC data indicate that approximately 13.8% of adults have been diagnosed with diabetes, ranking the state among those with the highest prevalence nationwide.⁸ Early detection and intervention are critical for preventing or delaying complications; however, many individuals remain undiagnosed for years due to the absence of symptoms in early stages, lack of routine screening, and barriers to care.¹⁰

Public health surveillance systems play a vital role in tracking chronic diseases like diabetes by providing timely, population-level insights that can guide prevention and intervention strategies. Among these systems, the Behavioral Risk Factor Surveillance System (BRFSS) stands out as the world’s largest continuously conducted health survey, collecting high-quality, state-specific data on health behaviors, chronic disease prevalence, and preventive service use from a representative sample of non-institutionalized adults in all U.S. states and territories.¹¹ Since its inception in 1984, the BRFSS has grown into a cornerstone of chronic disease epidemiology, enabling researchers and policymakers to monitor trends over time, identify emerging health threats, and evaluate the effectiveness of public health programs. The survey’s comprehensive scope includes numerous variables relevant to diabetes prediction and prevention, such as sociodemographic factors, lifestyle behaviors, anthropometric measures, and a wide array of comorbid conditions.¹¹ In addition, BRFSS incorporates information on healthcare access, health-related quality of life, and preventive health behaviors like glucose and cholesterol screening. This breadth of information, combined with its large sample size and standardized methodology, makes BRFSS a uniquely valuable resource for developing robust, generalizable risk models for diabetes at both the state and national levels.

Traditional epidemiological analyses, such as logistic regression, are effective for identifying risk factors but may be limited in their ability to capture complex, non-linear relationships and high-order interactions between predictors.^1,12 Machine learning (ML) approaches offer powerful alternatives by automatically identifying intricate patterns in high-dimensional datasets, potentially improving prediction accuracy and uncovering novel associations.^2,13 Recent studies have successfully applied ML algorithms, including Random Forest, Support Vector Machines, Gradient Boosting, and Neural Networks, to predict diabetes using datasets such as the Pima Indians Diabetes Database, BRFSS, and clinical records, often outperforming traditional methods.^2,14,15 One major criticism of ML in healthcare is the “black box” nature of many models, which can hinder trust and adoption by clinicians and policymakers.¹⁶ Explainable artificial intelligence (XAI) techniques, such as SHapley Additive exPlanations (SHAP), address this concern by providing transparent, interpretable insights into how each feature contributes to a model’s predictions, both at the individual and population levels.^17,18 Despite numerous studies using BRFSS data for diabetes surveillance, no prior research has applied machine learning with SHAP-based explainability to the 2023 Tennessee BRFSS dataset, making this the first study to combine predictive modeling and interpretability for understanding state-specific diabetes patterns.

The primary aim of this study is to develop and evaluate explainable machine learning (ML) models for the classification of self-reported diabetes among adults in Tennessee using data from the 2023 Behavioral Risk Factor Surveillance System (BRFSS). Specifically, the study seeks to (1) compare the performance of multiple ML algorithms against traditional logistic regression, (2) assess and report predictive metrics including accuracy, sensitivity, specificity, and the area under the receiver operating characteristic curve (ROC-AUC), and (3) apply SHapley Additive exPlanations (SHAP) to identify and interpret the most influential covariates associated with diabetes status. By integrating predictive performance with model interpretability, this research aims to generate actionable insights that can inform early detection, targeted prevention, and population-level strategies to reduce the burden of diabetes in Tennessee.

Methods

Study Design and Data Source

This study employed a cross-sectional design, utilizing secondary data from the 2023 Behavioral Risk Factor Surveillance System (BRFSS) specific to the state of Tennessee. The BRFSS is a nationally representative, state-based surveillance system conducted annually by the Centers for Disease Control and Prevention (CDC)¹¹ in collaboration with U.S. states and territories. It collects information on health-related behaviors, chronic conditions, and preventive health practices among non-institutionalized adults aged 18 years and older via telephone interviews.¹¹

The 2023 BRFSS dataset was selected for its recency and comprehensive coverage of variables related to diabetes, physical health, sociodemographic characteristics, and health-related behaviors. For this study, we extracted the subset of respondents residing in Tennessee and included only those with complete responses to the outcome variable, self-reported diabetes diagnosis. After applying these inclusion criteria, the final analytical sample comprised 5634 adult individuals (weighted population 5 614 486).

Study Variables

The primary outcome variable in this study was self-reported diabetes diagnosis, based on whether a respondent reported ever being told by a health professional that they had diabetes or high blood sugar. This aligns with the standard BRFSS approach for assessing diabetes prevalence, including both cases diagnosed outside of pregnancy and those diagnosed during pregnancy for female respondents. The outcome variable was recoded into a binary indicator: “Yes” for respondents reporting a diagnosis of diabetes (including pregnancy-related diabetes) and “No” for those reporting no diagnosis or prediabetes/borderline status.

A range of predictor variables was selected based on their theoretical and empirical associations with diabetes, as documented in previous literature.^4,19-21 These predictors spanned across 3 broad domains: sociodemographic characteristics, health status, and health behaviors. Sociodemographic variables included sex (male, female), age group (18-24, 25-34, 35-44, 45-54, 55-64, 65+), race/ethnicity (White, Black, Other, Multiracial, Hispanic), marital status (married vs unmarried), education level (less than high school, high school graduate, some college, and college graduate), income group (<$15 000, $15 000-$25 000, $25 000-$35 000, $35 000-$50 000, $50 000-$100 000, $100 000-$200 000, and $200 000+), employment status (employed vs unemployed), and urbanicity (urban vs rural). Health status indicators included body mass index (BMI) category (underweight, normal, overweight, obese), self-reported general health (good/better vs fair/poor), number of poor physical health days (0, 1-13, 14+), number of poor mental health days (0, 1-13, 14+), comorbidity count (0, 1-2, 3+), disability count (0, 1-2, 3+), and specific chronic condition diagnoses including high blood pressure, high cholesterol, cancer, and arthritis. Health behavior variables included smoking status (every day, some days, former, never), alcohol use in the past 30 days (yes vs no), physical activity in the last 30 days (yes vs no), and current health insurance coverage (insured vs not insured). Additional predictors included veteran status (yes vs no). These variables were included to capture demographic, behavioral, and health-related risk factors that may influence the likelihood of a diabetes diagnosis (Supplemental Table S1).

Data Processing and Analysis

Data management and analysis were conducted across multiple platforms to ensure methodological rigor and reproducibility. Multiple imputation was performed in R version 4.3.2 (“Eye Holes”) using the MICE package to address missing data, which ranged from 0.01% for comorbidity count to 19% for income. The relatively low to moderate levels of missingness warranted imputation to preserve statistical power and minimize potential bias in subsequent analyses.²² The imputed dataset was then imported into Stata/BE 19, where descriptive analyses were conducted using the svyset commands to account for the complex sampling design of the BRFSS. Weighted frequencies and percentages were computed for categorical variables, and chi-square tests were used to assess associations between self-reported diabetes diagnosis and each predictor. Crude odds ratios (CORs) with 95% confidence intervals (CIs) were estimated through bivariate logistic regression. Variables with a P-value ≤ .20 were retained via backward elimination to build the multivariable model, from which adjusted odds ratios (AORs) and 95% CIs were obtained.²³ Statistical significance was defined as P ≤ .05 (highly significant if P ≤ .001). For the machine-learning (ML) component, the final cleaned dataset was exported to Google Colab (Python 3.10), where data transformation, model building, and explainable AI analyses were performed using Pandas, NumPy, Scikit-learn, SHAP, Matplotlib, Seaborn, and SciPy.

Data Pre-processing

The raw BRFSS dataset underwent a multi-stage cleaning and transformation process to ensure analytical integrity before modeling. As all variables were categorical, normalization or standardization was unnecessary. Rare response categories were merged with related groups, and inconsistent codes were corrected to maintain uniformity. Outlier detection was performed for each variable, and infrequent or analytically insignificant categories were consolidated or excluded as appropriate. In addition, composite indicators, including comorbidity count, disability count, and BMI category, were derived to improve interpretability and capture broader health dimensions. In Python, the training data were processed using an automated preprocessing pipeline incorporating one-hot encoding for categorical variables, ensuring consistent variable treatment and compatibility across machine learning algorithms. Collectively, these steps streamlined the dataset, enhanced interpretability, and minimized analytical noise, thereby strengthening the robustness of subsequent model development.

Addressing Class Imbalance Using SMOTE

Exploratory analysis revealed a pronounced class imbalance, with considerably fewer participants reporting diabetes than those without. To prevent model bias toward the majority (non-diabetic) group, the Synthetic Minority Over-sampling Technique (SMOTE) was applied only to the training subset after the train-test split. SMOTE interpolates new synthetic minority-class observations, balancing class representation without duplicating existing cases. The test data remained untouched, ensuring an unbiased evaluation of model generalizability. Post-SMOTE, class proportions in the training set were approximately equal, enhancing model sensitivity to the minority (diabetic) class.

Feature Selection

Predictor selection followed a hybrid statistical and machine-learning approach. The process began with exploratory data analysis (EDA), using descriptive statistics and visualizations to examine variable distributions and explore potential relationships with the outcome. Bivariate analyses were then performed to assess the strength, direction, and statistical significance of associations between each predictor and diabetes status. Thereafter, screening in Stata used backward elimination (P ≤ .20) to retain potentially informative variables.²³ In Python, Cramer’s V statistic was computed to assess pairwise associations among categorical predictors and identify potential multicollinearity. Subsequently, Recursive Feature Elimination (RFE) was applied to the training data to identify the top-ranking features contributing most to model performance (Supplemental Table S3). This combination of statistical screening and algorithmic selection ensured an optimal, interpretable, and parsimonious predictor set.

Feature Importance

The relative contribution of each predictor variable to the classification models was assessed using SHapley Additive exPlanations (SHAP), a model-agnostic interpretability framework. The SHAP values quantify each predictor’s contribution to the model’s output, indicating both the direction and magnitude of its influence on diabetes classification. The analyses were conducted on the best-performing model, and the results were visualized using beeswarm, bar, and box plots. Both encoded and aggregated SHAP analyses were generated, the latter grouping one-hot-encoded categories back into their original variables, to facilitate clearer interpretation. This method provided a consistent and transparent way to interpret complex machine learning algorithms. It also informed potential directions for public health interventions by highlighting the most influential risk and protective factors.

Model Development and Optimization

Given the complex and multifactorial nature of self-reported diabetes, a diverse set of machine learning algorithms was implemented to capture varying patterns in the data. The selected models represented different learning paradigms, including linear (Logistic Regression), kernel-based (Support Vector Machine), tree-based (Decision Tree, Random Forest), distance-based (K-Nearest Neighbors), and ensemble boosting methods (XGBoost and Gradient Boosting). This facilitated a comprehensive comparison of predictive performance across multiple modeling strategies. For model optimization, default hyperparameters were initially applied, and where appropriate, tuning was conducted using grid search with 5-fold cross-validation to refine model performance (Supplemental Table S2). Each model was implemented within a consistent preprocessing framework and trained on the SMOTE-balanced training data. Logistic Regression served as the baseline model, offering a transparent linear approach to classification. Support Vector Machine was included for its ability to handle non-linear decision boundaries through kernel functions. Decision Tree and Random Forest models were used to capture hierarchical decision-making patterns, with Random Forest benefiting from ensemble averaging to improve stability and reduce overfitting. K-Nearest Neighbors classified observations based on their proximity in feature space, while boosting methods (e.g., Gradient Boosting and XGBoost) were incorporated for their iterative approach to minimizing classification errors and improving accuracy.

Model Training and Evaluation

Model performance was evaluated using a structured training and testing framework. The dataset was partitioned into training (80%) and testing (20%) sets using stratified sampling to preserve outcome distribution. The training data underwent SMOTE oversampling, while the test set remained unaltered. Model performance was assessed using accuracy, precision, recall, F1-score, Area Under the Receiver Operating Characteristic Curve (AUROC), balanced accuracy, and precision-recall AUC (PR-AUC). In addition, 5-fold stratified cross-validation was applied to the training data for robust performance estimation. The cross-validation was performed within the training data only to prevent information leakage into the test set. Confusion matrices were generated to visualize true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). An AUROC curve was plotted to visualize and compare the classification performance across all machine learning models. Performance metrics were averaged across folds to ensure stability and minimize overfitting bias. This approach ensured that the final evaluation captured not only raw accuracy but also the balance between sensitivity and specificity across all models tested.

Model Selection

The final model was selected through a systematic comparison of performance metrics across all evaluated machine learning algorithms. Accuracy was considered a primary indicator, reflecting the overall proportion of correctly classified cases. Precision measured the share of predicted positive cases that were truly positive, while recall (or sensitivity) assessed the model’s ability to correctly identify actual positive cases. The F1-score, representing the harmonic mean of precision and recall, was particularly valuable in addressing class imbalance by balancing the trade-off between these 2 measures.²⁴ The PR-AUC was also assessed to better evaluate model performance under class imbalance by summarizing the relationship between precision and recall. By evaluating these metrics collectively, the model that achieved the most favorable balance of accuracy, recall, precision, PR-AUC, and F1-score was identified and selected as the optimal algorithm for predicting self-reported diabetes (Figure 1).

Figure 1.

Machine learning workflow for predicting self-reported diabetes diagnosis.

Results

Sociodemographic and Health Characteristics by Self-Reported Diabetes Diagnosis

Table 1 presents the weighted sociodemographic and health-related characteristics of adults in Tennessee based on the 2023 BRFSS survey, stratified by self-reported diabetes diagnosis. Overall, approximately 16.2% of adults reported having been diagnosed with diabetes, while 83.8% reported no diagnosis. Women constituted a slightly higher proportion of the sample (51.9%) compared to men (48.1%), and diabetes prevalence was modestly higher among females (17.1%) than males (15.2%). Diabetes prevalence increased markedly with age, from 3.3% among adults aged 18 to 24 years to 27.8% among those aged 65 years and older, reflecting the strong age gradient typical of diabetes burden.

Table 1.

Weighted Descriptive Statistics of Key Variables and Their Distribution by Self-Reported Diabetes Diagnosis, Tennessee, BRFSS 2023.

SN	Variables	Frequency (%)	Self-reported diabetes diagnosis
		5 614 486 (100%)	(Categories)
			Yes	No
			Frequency (%)	Frequency (%)
1	Self-reported diabetes diagnosis
	Yes	908 945 (16.19)
	No	4 705 541 (83.81)
2	Sex
	Male	2 702 326 (48.13)	410 510 (15.19)	2 291 816 (84.81)
	Female	2 912 160 (51.87)	498 435 (17.12)	2 413 725 (82.88)
3	Age-group
	18-24	673 123 (11.99)	22 235 (3.3)	650 888 (96.7)
	25-34	934 420 (16.64)	60 735 (6.5)	873 685 (93.5)
	35-44	893 026 (15.91)	75 733 (8.48)	817 294 (91.52)
	45-54	857 534 (15.27)	156 863 (18.29)	700 671 (81.71)
	55-64	905 505 (16.13)	218 575 (24.14)	686 930 (75.86)
	65+	1 350 878 (24.06)	374 804 (27.75)	976 073 (72.25)
4	Race/ethnicity
	White	4 061 187 (72.33)	629 392 (15.5)	3 431 795 (84.5)
	Black	827 170 (14.73)	195 196 (23.6)	631 975 (76.4)
	Other	215 650 (3.84)	35 919 (16.66)	179 731 (83.34)
	Multiracial	144 152 (2.57)	20 781 (14.42)	123 370 (85.58)
	Hispanic	366 327 (6.53)	27 657 (7.55)	338 670 (92.45)
5	Education level
	<HS	636 309 (11.33)	154 768 (24.32)	481 542 (75.68)
	HS Grad	1 825 137 (32.51)	293 125 (16.06)	1 532 012 (83.94)
	Some college	1 649 708 (29.38)	275 096 (16.68)	1 374 611 (83.32)
	College grad	1 503 332 (26.78)	185 956 (12.37)	1 317 376 (87.63)
6	Income group
	<15k	376 232 (6.7)	81 176 (21.58)	295 056 (78.42)
	15-25k	571 436 (10.18)	130 812 (22.89)	440 624 (77.11)
	25-35k	617 184 (10.99)	122 923 (19.92)	494 261 (80.08)
	35-50k	976 841 (17.4)	166 095 (17)	810 745 (83)
	50-100k	1 678 869 (29.9)	275 136 (16.39)	1 403 733 (83.61)
	100-200k	1 040 393 (18.53)	99 071 (9.52)	941 322 (90.48)
	200k+	353 531 (6.3)	33 732 (9.54)	319 799 (90.46)
7	Marital status
	Married	2 841 347 (50.61)	494 478 (17.4)	2 346 868 (82.6)
	Unmarried	2 773 139 (49.39)	414 467 (14.95)	2 358 673 (85.05)
8	Employment status
	Employed	3 400 830 (60.57)	357 989 (10.53)	3 042 841 (89.47)
	Unemployed	2 213 656 (39.43)	550 956 (24.89)	1 662 700 (75.11)
9	Urbanicity
	Urban	4 956 960 (88.29)	795 541 (16.05)	4 161 419 (83.95)
	Rural	657 526 (11.71)	113 404 (17.25)	544 122 (82.75)
10	Health insurance
	Insured	5 084 567 (90.56)	845 628 (16.63)	4 238 938 (83.37)
	Not insured	529 920 (9.44)	63 317 (11.95)	466 603 (88.05)
11	BMI category
	Underweight	87 896 (1.57)	2 267 (2.58)	85 629 (97.42)
	Normal	1 581 285 (28.16)	156 533 (9.9)	1 424 752 (90.1)
	Overweight	1 826 486 (32.53)	262 418 (14.37)	1 564 068 (85.63)
	Obese	2 118 819 (37.74)	487 727 (23.02)	1 631 092 (76.98)
12	Smoking status
	Every day	656 217 (11.69)	104 063 (15.86)	552 154 (84.14)
	Somedays	313 557 (5.59)	53 825 (17.17)	259 732 (82.83)
	Former	1 375 346 (24.5)	240 604 (17.49)	1 134 742 (82.51)
	Never	3 269 366 (58.2)	510 453 (15.61)	2 758 913 (84.39)
13	Alcohol use
	Yes	2 693 387 (47.97)	297 530 (11.05)	2 395 857 (88.95)
	No	2 921 099 (52.03)	611 415 (20.93)	2 309 685 (79.07)
14	Cancer status
	Yes	491 416 (8.75)	133 178 (27.1)	358 237 (72.9)
	No	5 123 071 (91.25)	775 767 (15.14)	4 347 304 (84.86)
15	Veteran status
	Yes	645 041 (11.49)	130 902 (20.29)	514 139 (79.71)
	No	4 969 445 (88.51)	778 043 (15.66)	4 191 402 (84.34)
16	Self-reported wellbeing
	Good/better	4 300 787 (76.6)	490 968 (11.42)	3 809 819 (88.58)
	Fair/poor	1 313 699 (23.4)	417 977 (31.82)	895 722 (68.18)
17	Poor physical health days
	Zero days	3 164 764 (56.37)	376 026 (11.88)	2 788 738 (88.12)
	1-13 days	1 571 998 (28)	275 366 (17.52)	1 296 632 (82.48)
	14+ days	877 725 (15.63)	257 554 (29.34)	620 171 (70.66)
18	Poor mental health days
	Zero days	2 942 662 (52.41)	465 595 (15.82)	2 477 067 (84.18)
	1-13 days	1 572 367 (28.01)	230 342 (14.65)	1 342 025 (85.35)
	14+ days	1 099 457 (19.58)	213 008 (19.37)	886 450 (80.63)
19	Physical activity in the last 30 days
	Yes	4 155 472 (74.01)	540 841 (13.02)	3 614 630 (86.98)
	No	1 459 015 (29.99)	368 104 (25.23)	1 090 911 (74.77)
20	High blood pressure diagnosis
	Yes	2 278 646 (40.59)	658 258 (28.89)	1 620 388 (71.11)
	No	3 335 840 (59.41)	250 687 (7.51)	3 085 153 (92.49)
21	High cholesterol diagnosis
	Yes	2 085 402 (37.14)	611 464 (29.32)	1 473 938 (70.68)
	No	3 529 084 (62.86)	297 481 (8.43)	3 231 603 (91.57)
22	Arthritis diagnosis
	Yes	1 863 546 (33.19)	498 016 (26.72)	1 365 530 (73.28)
	No	3 750 940 (66.81)	410 929 (10.96)	3 340 011 (89.04)
23	Comorbidity count
	0 comorbidities	1 541 800 (27.46)	55 071 (3.57)	1 486 729 (96.43)
	1-2 comorbidities	2 373 124 (42.27)	280 936 (11.84)	2 092 187 (88.16)
	3+ comorbidities	1 699 562 (30.27)	572 938 (33.71)	1 126 624 (66.29)
24	Disabilities count
	0 disabilities	3 569 102 (63.57)	406 845 (11.4)	3 162 257 (88.6)
	1-2 disabilities	1 472 932 (26.23)	308 238 (20.93)	1 164 694 (79.07)
	3+ disabilities	572 452 (10.2)	193 861 (33.87)	378 591 (66.13)

By race and ethnicity, Black adults exhibited the highest diabetes prevalence (23.6%), followed by White adults (15.5%), whereas Hispanic respondents had the lowest prevalence (7.6%). Educational attainment and income demonstrated inverse associations with diabetes: adults with less than a high school education had the highest prevalence (24.3%), while college graduates had the lowest (12.4%). Similarly, diabetes prevalence declined steadily with increasing household income, from 21.6% among those earning less than $15 000 to 9.5% among those earning $100 000 or more. Unemployment and lower socioeconomic indicators were also associated with higher diabetes prevalence. Nearly 24.9% of unemployed adults reported diabetes, compared to 10.5% of employed adults. Marital status showed smaller differences, with prevalence of 17.4% among married and 15% among unmarried respondents.

Geographically, diabetes prevalence was comparable between urban (16.1%) and rural (17.3%) residents. Uninsured individuals had a lower reported prevalence (12.0%) compared to insured individuals (16.6%). Body mass index (BMI) showed a clear positive gradient with diabetes risk, rising from 2.6% among underweight adults to 23.0% among those classified as obese. Lifestyle factors also displayed strong associations: individuals who reported no physical activity in the past 30 days had nearly double the diabetes prevalence (25.2%) of those who were physically active (13.0%). Non-drinkers and daily smokers also had high diabetes prevalence of 20.9% and 15.9%, respectively. Adults reporting 3 or more comorbidities had a prevalence of 33.7%, compared to only 3.6% among those with none. Similarly, those with 3 or more disabilities reported a prevalence of 33.9%, versus 11.4% among those with no disabilities. Prevalence was also elevated among those with diagnoses of high blood pressure (28.9%), high cholesterol (29.3%), and arthritis (26.7%), as well as among respondents reporting fair or poor self-rated health (31.8%) (Table 1).

Factors Associated with Self-Reported Diabetes Diagnosis

Table 2 summarizes the results of unadjusted and adjusted logistic regression analyses identifying factors associated with self-reported diabetes among Tennessee adults. Variables were initially screened using backward elimination at a threshold of P ≤ .20, and twelve predictors were retained in the final multivariable model. In the unadjusted analyses, older age-groups, Black race, lower education and income levels, unemployment, obesity, lack of physical activity, hypertension, hypercholesterolemia, multiple comorbidities or disabilities, and fair/poor self-reported health were all significantly associated with increased odds of self-reported diabetes (P ≤ .05). Specifically, diabetes odds rose sharply with age, from the reference group (18-24 years) to those aged 65 years and older (OR = 11.2, 95% CI: 5.4-23.4). Similarly, obesity was strongly associated with diabetes (OR = 11.3, 95% CI: 4.4-29.2), as were high blood pressure (OR = 0.20, inverse coding indicating greater odds among hypertensive individuals) and high cholesterol (OR = 0.22).

Table 2.

Unadjusted and Adjusted Odds Ratios for Factors Associated with Self-Reported Diabetes Diagnosis, Tennessee, BRFSS 2023.

SN	Variable	Unadjusted model		Adjusted model
SN	Variable	Crude OR	95% CI	Adjusted OR	95% CI
1	Sex
	Male	Ref
	Female	1.15	(0.95-1.40)
2	Age-group
	18-24	Ref		Ref
	25-34	2.03	(0.9-4.59)	1.79	(0.81-3.96)
	35-44	2.71	(1.23-6.0)*	1.89	(0.89-3.99)
	45-54	6.55	(3.08-13.96)**	3.56	(1.71-7.41)**
	55-64	9.31	(4.43-19.59)**	3.60	(1.76-7.39)**
	65+	11.24	(5.41-23.36)**	3.32	(1.62-6.79)**
3	Race/ethnicity
	White	Ref		Ref
	Black	1.68	(1.29-2.20)**	1.59	(1.18-2.18)*
	Other	1.0	(0.57-2.10)	1.67	(0.81-3.46)
	Multiracial	0.92	(0.51-1.65)	0.76	(0.40-1.47)
	Hispanic	0.45	(0.21-0.94)*	0.89	(0.39-2.02)
4	Education level
	<HS	Ref
	HS grad	0.60	(0.42-0.85)*
	Some college	0.62	(0.44-0.89)**
	College grad	0.44	(0.31-0.63)**
5	Income Group
	<15k	Ref
	15-25k	1.08	(0.68-1.71)
	25-35k	0.90	(0.57-1.43)
	35-50k	0.74	(0.49-1.14)
	50-100k	0.71	(0.47-1.07)
	100-200k	0.38	(0.24-0.60)**
	200k+	0.38	(0.21-0.68)**
6	Marital status
	Married	Ref
	Unmarried	0.83	(0.69-1.01)
7	Employment status
	Employed	Ref		Ref
	Unemployed	2.82	(2.31-3.43)**	1.37	(1.05-1.79)*
8	Urbanicity
	Urban	Ref
	Rural	1.09	(0.84-1.41)
9	Health insurance
	Insured	Ref
	Not insured	0.68	(0.43-1.08)
10	BMI category
	Underweight	Ref		Ref
	Normal	4.15	(1.57-10.98)*	3.57	(1.28-9.59)*
	Overweight	6.34	(2.49-16.44)**	4.57	(1.69-12.33)*
	Obese	11.30	(4.38-29.15)**	6.90	(2.56 -18.56)**
11	Smoking status
	Every day	Ref		Ref
	Somedays	1.10	(0.62-1.94)	1.35	(0.74-2.48)
	Former	1.12	(0.80-1.59)	1.29	(0.87-1.92)
	Never	0.98	(0.71-1.36)	1.75	(1.18-2.60)*
12	Alcohol use
	Yes	Ref		Ref
	No	2.13	(1.74-2.62)**	1.38	(1.10-1.73)*
13	Cancer diagnosis
	Yes	Ref		Ref
	No	0.48	(0.37-0.62)**	1.00
14	Veteran status
	Yes			Ref
	No	0.73	(0.57-0.94)*	0.91
15	Self-reported wellbeing
	Good/better	Ref		Ref
	Fair/poor	3.62	(2.94-4.45)**	1.90	(1.48-2.44)**
16	Poor physical health days
	Zero days	Ref
	1-13 days	1.58	(2.41-3.94)**
	14+ days	3.08	(1.25-1.98)**
17	Poor mental health days
	Zero days	Ref
	1-13 days	0.91	(0.72-1.16)
	14+ days	1.28	(1.0-1.62)*
18	Physical activity in the last 30 days
	Yes	Ref		Ref
	No	2.26	(1.84-2.76)**	1.22	(0.97-1.53)
19	High blood pressure diagnosis
	Yes	Ref		Ref
	No	0.2	(0.16-0.25)**	0.58	(0.44-0.76)**
20	High_Cholesterol diagnosis
	Yes	Ref		Ref
	No	0.22	(0.18-0.25)**	0.51	(0.40-0.65)**
21	Arthritis_Diagnosis
	Yes	Ref		Ref
	No	0.34	(0.27-0.41)**	1.20	(0.93-1.55)
22	Comorbidity count
	0 comorbidities	Ref		Ref
	1-2 comorbidities	3.63	(2.37-5.55)**	1.50	(0.93-2.42)
	3+ comorbidities	13.73	(9.07-20.78)**	2.60	(1.46-4.62)*
23	Disabilities count
	0 disabilities	Ref
	1-2 disabilities	2.05	(1.65-2.56)**
	3+ disabilities	3.98	(2.99-5.30)**

Abbreviations: CI, confidence interval; OR, odds ratio.

P-value ≤ .001. *P-value ≤ .05.

After adjustment for covariates, twelve variables remained significant predictors of diabetes: age group, race/ethnicity, employment status, BMI category, smoking status, alcohol use, self-reported wellbeing, and diagnoses of hypertension and high cholesterol, as well as comorbidity count. The associations for age and BMI remained robust; adults aged 55-64 years (AOR = 3.6, 95% CI: 1.8-7.4) and 65 years or older (AOR = 3.3, 95% CI: 1.6-6.8) were over 3 times as likely to report diabetes compared to those aged 18-24 years. Likewise, obese adults had nearly 7-fold higher adjusted odds (AOR = 6.9, 95% CI: 2.6-18.6) relative to underweight respondents. Black adults had higher adjusted odds of diabetes compared to White adults (AOR = 1.6, 95% CI: 1.2-2.2), while education and income effects were attenuated after adjustment. Unemployed adults had about 40% higher odds of diabetes than employed adults (AOR = 1.4, 95% CI: 1.1-1.8). Participants reporting fair or poor health were nearly twice as likely to have diabetes (AOR = 1.9, 95% CI: 1.5-2.4). Those reporting no diagnosis of hypertension (AOR = 0.6, 95% CI: 0.4-0.8) or hypercholesterolemia (AOR = 0.5, 95% CI: 0.4-0.7) had significantly lower odds, reinforcing the clustering of metabolic risk factors.

Paradoxically, individuals who had never smoked (AOR = 1.75, 95% CI: 1.18-2.60) and those who abstained from alcohol (AOR = 1.38, 95% CI: 1.10-1.73) exhibited higher adjusted odds of diabetes compared to current smokers and drinkers, respectively. Overall, the final multivariable model underscores the combined influence of demographic, behavioral, and metabolic factors on diabetes risk, with obesity, advancing age, poor self-rated health, and comorbidities emerging as the strongest independent predictors (Table 2).

Model Performance Evaluation and Comparison

Model performance was assessed using accuracy, precision, recall, F1-score, balanced accuracy, AUROC, and the Precision-Recall Area Under the Curve (PR-AUC). Confusion matrices were examined to evaluate each model’s classification behavior in terms of true positives (TP), false positives (FP), true negatives (TN), and false negatives (FN). Across all models, accuracy ranged from 0.576 (KNN) to 0.818 (XGBoost). The Gradient Boosting (GB) model achieved consistently strong overall performance, with an accuracy of 0.815, AUROC of 0.796, and the highest PR-AUC of 0.447, indicating good discrimination and balance between precision and recall. XGBoost performed comparably (accuracy = 0.818, AUROC = 0.776), demonstrating robust predictive ability with efficient learning from complex feature interactions. The Random Forest (RF) model followed closely (accuracy = 0.813, AUROC = 0.775), exhibiting reliable generalization with low false-positive (67) and high true-negative (855) counts. The Logistic Regression (LR) model provided the highest recall (0.746), indicating strong sensitivity in detecting diabetes cases, but at the expense of precision (0.378), a trade-off typical of linear models in imbalanced settings. SVM also demonstrated balanced results (accuracy = 0.791, AUROC = 0.761), while KNN showed the highest recall (0.844) but poor precision (0.279) and low overall accuracy (0.576), reflecting its tendency to over-classify positive cases. Decision Tree (DT) performed moderately (accuracy = 0.736, AUROC = 0.626), capturing non-linear patterns but with greater overfitting risk compared to ensemble models.

Overall, ensemble-based methods, XGBoost, Gradient Boosting, and Random Forest, outperformed other classifiers, achieving the highest AUROC and balanced accuracy values, reflecting superior discrimination and stability. Gradient Boosting was selected as the best-performing model due to its optimal combination of accuracy (0.815), AUROC (0.796), and PR-AUC (0.447), signifying reliable predictive performance and robustness in distinguishing between diabetic and non-diabetic respondents (Table 3, Figure 2).

Table 3.

Performance Evaluation Metrics and Confusion Matrix for ML Models in Self-reported Depression Diagnosis.

Algorithms performance evaluation metrics
Metric	LR	SVM	KNN	DT	RF	GB	XGB
Accuracy	0.730	0.791	0.576	0.736	0.813	0.815	0.818
Precision	0.378	0.422	0.279	0.331	0.477	0.484	0.500
Recall	0.746	0.395	0.844	0.439	0.298	0.302	0.273
F1-score	0.502	0.408	0.420	0.377	0.366	0.372	0.353
Balanced Accuracy	0.737	0.637	0.680	0.621	0.612	0.615	0.606
AUROC	0.785	0.761	0.738	0.626	0.775	0.796	0.776
PR-AUC	0.445	0.392	0.340	0.253	0.409	0.447	0.402
Algorithms confusion matrix
Metric	LR	SVM	KNN	DT	RF	GB	XGB
TP	153	81	173	90	61	62	56
FP	252	111	446	182	67	66	56
FN	52	124	32	115	144	143	149
TN	670	811	476	740	855	856	866

Abbreviations: AUROC, area under the receiver operating characteristic curve; DT, decision tree; FN, false negative; FP, false positive; GB, gradient boosting; KNN, K-nearest neighbors; LR, logistic regression; PR-AUC, Precision-recall area under the curve; RF, random forest; SVM, support vector machine; TN, true negative; TP, true positive; XGB, extreme gradient boosting (XGBoost).

Metrics were computed on the non-resampled test dataset after model training on the SMOTE-balanced training set.

Figure 2.

Model evaluation metrics for all models.

AUROC Curve Analysis

The Receiver Operating Characteristic (ROC) curve is a widely used tool to evaluate the diagnostic performance of classification models by plotting the true positive rate (sensitivity) against the false positive rate (1 – specificity) across various decision thresholds. The Area Under the ROC Curve (AUROC) provides a single measure of a model’s ability to discriminate between individuals with and without self-reported diabetes. As shown in Figure 3, the ROC curves for the 7 machine learning algorithms demonstrate varying degrees of discriminatory power. The Gradient Boosting (GB) model achieved the highest AUROC value (0.796), followed closely by Logistic Regression (0.785), XGBoost (0.776), and Random Forest (0.775), all indicating good overall classification performance. Support Vector Machine (SVM) achieved an AUROC of 0.761, while K-Nearest Neighbors (KNN) yielded a moderate value of 0.738. The Decision Tree (DT) model showed the lowest AUROC (0.626), reflecting relatively weaker discrimination between diabetes and non-diabetes cases. Overall, the ensemble-based models (GB, XGB, and RF) outperformed the others, demonstrating smoother ROC curves and higher AUROC values, suggesting superior capability to balance sensitivity and specificity. These results reinforce that ensemble learning approaches provide robust predictive performance for identifying self-reported diabetes cases.

Figure 3.

ROC curve for all models.

Feature Importance Analysis Using SHAP

To interpret how individual predictors influenced the likelihood of self-reported diabetes, SHapley Additive exPlanations (SHAP) analysis was applied to the best-performing model. SHAP provides a unified approach for interpreting complex machine learning models by quantifying each feature’s contribution to the prediction outcome. As illustrated in Figure 4, the SHAP beeswarm plot displays both the magnitude and direction of each feature’s impact on model predictions. Each dot represents a single observation, with color indicating the feature value (blue = low, red = high). Among all predictors, high blood pressure, high cholesterol, and BMI category exerted the strongest positive influence on diabetes prediction, higher values of these features corresponded with a greater likelihood of being classified as diabetic. Other influential factors included income group, age group, self-reported general health, and comorbidity count, reflecting the interplay of clinical and sociodemographic determinants in diabetes risk.

Figure 4.

SHAP feature importance Beeswarm plot.

Figure 5 presents the SHAP bar plot ranking features by their average absolute SHAP value, summarizing their overall contribution to the model. Consistent with the beeswarm plot, high blood pressure emerged as the dominant predictor, followed by high cholesterol, BMI category, income group, and age group. These results highlight that both biomedical risk factors (e.g., hypertension, hypercholesterolemia, obesity) and social determinants (e.g., income, education, and general health perception) are critical in predicting self-reported diabetes (Figures 4 and 5).

Figure 5.

SHAP feature importance bar plot.

Discussion

This study applied both traditional regression and explainable machine learning (ML) techniques to identify key predictors of self-reported diabetes among adults in Tennessee using the 2023 BRFSS dataset. By integrating weighted statistical modeling with ML predictive analytics, the study provides a nuanced understanding of how biomedical, behavioral, and sociodemographic factors interact to influence diabetes risk in the state’s adult population. The multifactorial and non-linear nature of the disease’s risk factors poses significant challenges for conventional epidemiological approaches.²⁵ By leveraging machine learning algorithms, this work establishes a robust framework for analyzing population health patterns, particularly in contexts involving numerous interrelated variables. The findings emphasize the practicality and effectiveness of machine learning in identifying diabetes risk factors, especially in situations where traditional statistical models may fail to capture complex, non-linear interactions.^25-28 This study identified significant, non-linear relationships between demographic attributes, lifestyle habits, and clinical measurements, all of which are critical for understanding and predicting diabetes. The capacity of machine learning to reveal hidden patterns supports growing evidence that integrating artificial intelligence into public health research can enhance both predictive accuracy and analytical depth.²⁹

An important contribution of this study is the identification of key predictors of self-reported diabetes among adults in Tennessee. The results highlight the continued importance of established clinical and behavioral risk factors for diabetes. Both the weighted logistic regression and ML models identified high blood pressure, high cholesterol, and body mass index (BMI) as the most significant predictors of self-reported diabetes, consistent with findings from prior U.S. and international studies.^21,30,31 These results underscore consistent patterns across both traditional and ML frameworks, strengthening confidence in the robustness of identified risk factors. Individuals reporting hypertension or hypercholesterolemia had significantly higher odds of diabetes, reflecting the well-documented metabolic linkages among these conditions.^30,32 Additionally, poorer self-rated health, the presence of multiple comorbidities, and limited physical activity were associated with higher diabetes prevalence, underscoring the multifactorial nature of the disease. Interestingly, the study observed inverse associations between smoking or alcohol use and diabetes risk, which contrasts with established scientific evidence. This discrepancy is likely attributable to bias; individuals diagnosed with diabetes may quit smoking or drinking either due to medical advice or personal health concerns, subsequently reporting themselves as non-smokers or abstainers. Similarly, the observed association between high blood pressure and diabetes may partly reflect reverse causality, as the metabolic and cardiovascular consequences of diabetes can contribute to the development of hypertension. From the ML analysis, ensemble-based models, particularly Gradient Boosting (GB), XGBoost (XGB), and Random Forest (RF), achieved the highest predictive performance (AUROC range: 0.775-0.796), outperforming simpler algorithms such as Logistic Regression and K-Nearest Neighbors. This aligns with evidence that ensemble methods excel in capturing nonlinear relationships and complex feature interactions inherent in behavioral health data.^13,29,30 The SHAP analysis further confirmed the dominant role of high blood pressure and cholesterol as the most influential predictors, followed by BMI, income, and age group. This approach addresses the common “black-box” critique of machine learning by producing interpretable results that can inform both clinical and public health interventions.^17,18 These findings highlight the need for integrated prevention strategies that combine medical risk screening with targeted interventions addressing social determinants of health.

While the weighted logistic regression model provided statistically interpretable associations and population-level inference, the machine learning approach, particularly the Gradient Boosting (GB) model interpreted through SHAP values, offered superior predictive flexibility and deeper insight into nonlinear relationships between risk factors. Traditional regression assumes linearity, additivity, and independence among predictors, which can oversimplify the complex interplay of biological and behavioral determinants of diabetes.^26,33,34 In contrast, ensemble tree-based models like GB can capture nonlinear and high-order interactions without prespecified functional forms, making them more effective in heterogeneous population health data.^33,35-37 The SHAP framework further enhances interpretability by quantifying feature-level contributions in a manner consistent with game theory, bridging the transparency gap often criticized in ML models.^17,38-40 In this study, while both approaches consistently identified hypertension, high cholesterol, and obesity as key predictors, the GB-SHAP model provided a more nuanced understanding of how socioeconomic variables such as income, education, and self-rated health interact with clinical factors to shape diabetes risk. This capability is particularly relevant in public health contexts, where understanding complex risk patterns supports targeted screening, resource allocation, and equity-focused intervention design.⁴¹ Thus, integrating explainable machine learning with traditional epidemiologic modeling represents a promising pathway toward more precise, data-driven public health strategies.

The superior performance of the Gradient Boosting (GB) model in this study reflects its ability to iteratively minimize prediction errors by combining multiple weak learners into a robust ensemble, optimizing both bias and variance in complex datasets. This property is particularly advantageous for chronic disease prediction, where the relationships between variables such as hypertension, BMI, and socioeconomic status are rarely linear or additive. The evaluation metrics in this study employed accuracy, precision, recall, F1-score, balanced accuracy, AUROC, and PR-AUC, collectively providing a comprehensive assessment of model performance. While accuracy gives an overall measure of correctness, it can be misleading in class-imbalanced data, such as diabetes prevalence studies.²⁴ In this public health context, precision and recall are more informative: precision reflects the model’s ability to avoid false positives, while recall measures how well true diabetes cases are detected.^24,42 The F1-score, as the harmonic mean of precision and recall, balances these competing goals, but it assumes equal importance of both metrics.²⁴ The PR-AUC, however, offers a more discriminating evaluation under imbalance by focusing directly on the trade-off between precision and recall.^42,43 Given the lower prevalence of diabetes in the dataset, PR-AUC provides a clearer measure of real-world clinical utility, highlighting the model’s ability to correctly identify high-risk individuals without excessive false alarms. Thus, GB’s strong PR-AUC and AUROC values indicate that it not only achieved high discrimination but also maintained robust performance in identifying positive diabetes cases, qualities critical for effective public health screening and early intervention.

Beyond its empirical findings, this study offers a structured methodological framework for applying artificial intelligence to public health surveillance and chronic disease research, specifically diabetes. The framework includes key stages such as comprehensive data preprocessing, application of weighted logistic regression, careful model selection, comparative performance evaluation, and post-model interpretability through SHAP analysis. This systematic approach provides a practical roadmap for integrating machine learning into large-scale health data analysis, particularly in contexts where comprehensive datasets like the BRFSS are underutilized for predictive modeling. The successful implementation of this pipeline demonstrates the potential of AI to strengthen timely, evidence-driven decision-making in diabetes prevention and management, especially in settings where traditional epidemiological methods may fall short in capturing complex, non-linear risk interactions.

Limitations

While this study provides valuable insights into the application of machine learning for predicting diabetes risk, several limitations should be acknowledged. First, the cross-sectional nature of the BRFSS data prevents the establishment of causal relationships between the identified predictors and diabetes outcomes. The associations observed cannot confirm temporal directionality. Second, the reliance on self-reported information introduces potential reporting biases, including recall error and social desirability bias, particularly for sensitive behaviors such as alcohol consumption and smoking. Third, the analysis was limited to variables included in the BRFSS dataset, which may have excluded other important biological, psychosocial, or environmental determinants of diabetes, such as dietary quality, healthcare access, or neighborhood-level factors. Finally, although the models demonstrated strong performance during internal validation, they were not tested on external datasets, limiting the ability to generalize findings to other populations.

Future Directions

Future research should build upon this work by incorporating longitudinal data, diverse geographic and demographic groups, integrating electronic health records or social determinants, and testing the predictive stability of machine learning models over time. Moreover, exploring more advanced approaches, such as deep neural networks or ensemble meta-modeling, may further enhance predictive accuracy and broaden the applicability of AI in chronic disease surveillance. External validation with independent datasets will be critical to confirm the robustness and generalizability of these predictive models, ensuring their applicability across diverse populations. Additionally, deploying this framework for real-time diabetes risk prediction dashboards could enhance public health decision support, particularly in high-burden regions such as Tennessee.

Conclusion

This study demonstrated the strong potential of machine learning algorithms alongside traditional statistical methods in predicting self-reported diabetes among U.S. adults using the 2023 BRFSS dataset. The integration of SHAP analysis enhanced model interpretability, providing case-level insights into how individual predictors influence diabetes risk and highlighting nuanced interactions between lifestyle, metabolic, and demographic factors. These results illustrate the value of combining advanced approaches to strengthen chronic disease surveillance and inform precision public health interventions like diabetes prevention and management strategies.

Supplemental Material

sj-docx-1-jpc-10.1177_21501319251400546 – Supplemental material for Exploring Explainable Machine Learning for Predicting and Interpreting Self-Reported Diabetes among Tennessee Adults: Insights from the 2023 Behavioral Risk Factor Surveillance System (BRFSS)

Supplemental material, sj-docx-1-jpc-10.1177_21501319251400546 for Exploring Explainable Machine Learning for Predicting and Interpreting Self-Reported Diabetes among Tennessee Adults: Insights from the 2023 Behavioral Risk Factor Surveillance System (BRFSS) by Mustapha Aliyu Muhammad, Jamilu Sani and Mohamed Mustaf Ahmed in Journal of Primary Care & Community Health

Footnotes

Author Contributions

MAM conceptualized the study, developed the methodology, and performed the formal analysis. JS & MMA assisted with data curation, implemented the software, and supported model evaluation. All authors contributed to the manuscript writing, reviewed the final draft, and approved the submitted version.

ORCID iDs

Mustapha Aliyu Muhammad

Mohamed Mustaf Ahmed

Consent for Publication

Not applicable.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The dataset analyzed during the current study is publicly available through the Centers for Disease Control and Prevention (CDC) Behavioral Risk Factor Surveillance System (BRFSS) 2023 annual data release:

Supplemental Material

Supplemental material for this article is available online.

References

Suryakant Rawat

. A classification system for diabetic patients with machine learning techniques. Int J Math Eng Manag Sci. 2019;4(3):729-744. doi:10.33889/IJMEMS.2019.4.3-057

Kaur

Kumari

Predictive modelling and analytics for diabetes using a machine learning approach. Appl Comput Inform. 2020;18(1-2):90-100. doi:10.1016/j.aci.2018.12.004

American Diabetes Association. Diagnosis and classification of diabetes mellitus. Diabetes Care. 2009;32(Supplement_1):S62-S67. doi:10.2337/dc09-S062

Dulyapach

Ngamchaliew

Vichitkunakorn

Sornsenee

Choomalee

Prevalence and associated factors of delayed diagnosis of type 2 diabetes mellitus in a tertiary hospital: a retrospective cohort study. Int J Public Health. 2022;67:1605039. doi:10.3389/ijph.2022.1605039

Mohanty

Parida

Nayak

Pati

Panigrahi

CR.

Study and impact analysis of machine learning approaches for smart healthcare in predicting mellitus diabetes on clinical data. In: Pattnaik

Vaidya

Mohanty

Hol

, eds. Smart Healthcare Analytics: State of the Art. Springer; 2022:75-101. doi:10.1007/978-981-16-5304-9_7

Xie

Nikolayeva

Luo

Building risk prediction models for type 2 diabetes using machine learning techniques. Prev Chronic Dis. 2019;16:E130. doi:10.5888/pcd16.190109

Magliano

Boyko

, IDF Diabetes Atlas 10th edition scientific committee. IDF DIABETES ATLAS. 10th ed. International Diabetes Federation; 2021. Accessed August 10, 2025. http://www.ncbi.nlm.nih.gov/books/NBK581934/

Centers for Disease Control and Prevention. National Diabetes Statistics Report. Diabetes. 2024. Accessed August 10, 2025. https://www.cdc.gov/diabetes/php/data-research/index.html

American Diabetes Association. Economic costs of diabetes in the U.S. in 2017. Diabetes Care. 2018;41(5):917-928. doi:10.2337/dci18-0007

10.

Cardozo

Pintarelli

Andreis

Lopes

ACW

Marques

JLB

. Use of machine learning and routine laboratory tests for diabetes mellitus screening. BioMed Res Int. 2022;2022(1):8114049. doi:10.1155/2022/8114049

11.

Centers for Disease Control and Prevention. Behavioral Risk Factor Surveillance System. 2025. Accessed July 13, 2025. https://www.cdc.gov/brfss/index.html

12.

Saxena

Sharma

Gupta

Analysis of machine learning algorithms in diabetes mellitus prediction. J Phys: Conf Ser. 2021;1921(1):012073. doi:10.1088/1742-6596/1921/1/012073

13.

Olisah

Smith

Diabetes mellitus prediction and diagnosis from a data preprocessing and machine learning perspective. Comput Methods Programs Biomed. 2022;220:106773. doi:10.1016/j.cmpb.2022.106773

14.

Permana

BAC

Ahmad

Bahtiar

Sudianto

Gunawan

Classification of diabetes disease using decision tree algorithm (C4.5). J Phys: Conf Ser. 2021;1869(1):012082. doi:10.1088/1742-6596/1869/1/012082

15.

Geetha

Prasad

KM.

A Hybrid ensemble machine learning approach to predict type 2 diabetes mellitus. Webology. 2021;18(2):311-331. doi:doi:10.14704/WEB/V18SI02/WEB1807

16.

Zou

Luo

Yin

Tang

Predicting diabetes mellitus with machine learning techniques. Front Genet. 2018;9:515. doi:10.3389/fgene.2018.00515

17.

Lundberg

Lee

SI.

A unified approach to interpreting model predictions. Paper presented at: Proceedings of the 31st International Conference on Neural Information Processing Systems. NIPS’17. Curran Associates Inc.; December 4-9, 2017; Long Beach, CA, USA:4768-4777. Accessed August 10, 2025. https://proceedings.neurips.cc/paper/2017/file/8a20a8621978632d76c43dfd28b67767-Paper.pdf

18.

Nohara

Matsumoto

Soejima

Nakashima

Explanation of machine learning models using shapley additive explanation and application for real data in hospital. Comput Methods Programs Biomed. 2022;214:106584. doi:10.1016/j.cmpb.2021.106584

19.

Lugner

Rawshani

Helleryd

Eliasson

Identifying top ten predictors of type 2 diabetes through machine learning analysis of UK Biobank data. Sci Rep. 2024;14(1):2102. doi:10.1038/s41598-024-52023-5

20.

Hill-Briggs

Adler

Berkowitz

, et al. Social determinants of health and diabetes: a scientific review. Diabetes Care. 2020;44(1):258-279. doi:10.2337/dci20-0053

21.

Kyrou

Tsigos

Mavrogianni

, et al. Sociodemographic and lifestyle-related risk factors for identifying vulnerable groups for type 2 diabetes: a narrative review with emphasis on data from Europe. BMC Endocr Disord. 2020;20(1):134. doi:10.1186/s12902-019-0463-3

22.

Blazek

van Zwieten

Saglimbene

Teixeira-Pinto

A practical guide to multiple imputation of missing data in nephrology. Kidney Int. 2021;99(1):68-74. doi:10.1016/j.kint.2020.07.035

23.

Mickey

Greenland

The impact of confounder selection criteria on effect estimation. Am J Epidemiol. 1989;129(1):125-137. doi:10.1093/oxfordjournals.aje.a115101

24.

Sadeghi

Khalili

Ramezankhani

Mansournia

Parsaeian

Diabetes mellitus risk prediction in the presence of class imbalance using flexible machine learning methods. BMC Med Inform Decis Mak. 2022;22:36. doi:10.1186/s12911-022-01775-z

25.

Rajula

HSR

Verlato

Manchia

Antonucci

Fanos

. Comparison of conventional statistical methods with machine learning in medicine: diagnosis, drug development, and treatment. Medicina. 2020;56(9):455. doi:10.3390/medicina56090455

26.

Deo

RC.

Machine learning in medicine. Circulation. 2015;132(20):1920-1930. doi:10.1161/CIRCULATIONAHA.115.001593

27.

Yland

Wang

Zad

, et al. Predictive models of pregnancy based on data from a preconception cohort study. Hum Reprod. 2022;37(3):565-576. doi:10.1093/humrep/deab280

28.

Chen

Chamouni

Wang

Integrating machine learning and artificial intelligence in life-course epidemiology: pathways to innovative public health solutions. BMC Med. 2024;22(1):354. doi:10.1186/s12916-024-03566-x

29.

Alowais

Alghamdi

Alsuhebany

, et al. Revolutionizing healthcare: the role of artificial intelligence in clinical practice. BMC Med Educ. 2023;23(1):689. doi:10.1186/s12909-023-04698-z

30.

Farran

AlWotayan

Alkandari

Al-Abdulrazzaq

Channanath

Thanaraj

TA.

Use of non-invasive parameters and machine-learning algorithms for predicting future risk of type 2 diabetes: a retrospective cohort study of health data from Kuwait. Front Endocrinol. 2019;10:624. doi:10.3389/fendo.2019.00624

31.

Deng

Moniruzzaman

Rogers

Jagannathan

Tamura

Unveiling inequalities: racial, ethnic, and socioeconomic disparities in diabetes: findings from the 2007-2020 NHANES data among U.S. adults. Prevent Med Rep. 2025;50:102957. doi:10.1016/j.pmedr.2024.102957

32.

Marwa

Mohamed

HKM

Said

Diabetic mellitus prediction with BRFSS datasets. J Theor Appl Inf Technol. 2024;102(3):883-897.

33.

Nisbet

Miner

Yale

Chapter 9 - classification. In: Nisbet

Miner

Yale

, eds. Handbook of Statistical Analysis and Data Mining Applications. 2nd ed. Academic Press; 2018:169-186. doi:10.1016/B978-0-12-416632-5.00009-8

34.

Shatte

ABR

Hutchinson

Teague

. Machine learning in mental health: a scoping review of methods and applications. Psychol Med. 2019;49(9):1426-1448. doi:10.1017/S0033291719000151

35.

Mirzaei

Adeli

Machine learning techniques for diagnosis of Alzheimer disease, mild cognitive disorder, and other types of dementia. Biomed Signal Process Control. 2022;72:103293. doi:10.1016/j.bspc.2021.103293

36.

Alqahtani

SAM

Alobaid

Alshammari

, et al. Feature importance and model performance for prediabetes prediction: a comparative study. J King Saud Univ Sci. 2024;36(11):103583. doi:10.1016/j.jksus.2024.103583

37.

Christodoulou

Collins

Steyerberg

Verbakel

Van Calster

A systematic review shows no performance benefit of machine learning over logistic regression for clinical prediction models. J Clin Epidemiol. 2019;110:12-22. doi:10.1016/j.jclinepi.2019.02.004

38.

GeeksforGeeks. SHAP: A Comprehensive Guide to SHapley Additive exPlanations. GeeksforGeeks. 12:44:19+00:00. 2025. Accessed August 14, 2025. https://www.geeksforgeeks.org/machine-learning/shap-a-comprehensive-guide-to-shapley-additive-explanations/

39.

Linardatos

Papastefanopoulos

Kotsiantis

Explainable AI: a review of machine learning interpretability methods. Entropy. 2020;23(1):18. doi:10.3390/e23010018

40.

Orsini

Moore

Wolk

Interaction analysis based on shapley values and extreme gradient boosting: a realistic simulation and application to a large epidemiological prospective study. Front Nutr. 2022;9:871768. doi:10.3389/fnut.2022.871768

41.

Weigard

Spencer

RJ.

Benefits and challenges of using logistic regression to assess neuropsychological performance validity: evidence from a simulation study. Clin Neuropsychol. 2023;37(1):34-59. doi:10.1080/13854046.2021.2023650

42.

Saito

Rehmsmeier

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLoS ONE. 2015;10(3):e0118432. doi:10.1371/journal.pone.0118432

43.

Richardson

Trevizani

Greenbaum

Carter

Nielsen

Peters

The receiver operating characteristic curve accurately assesses imbalanced datasets. Patterns. 2024;5(6):100994. doi:10.1016/j.patter.2024.100994

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB