Abstract
Objective
Rheumatoid arthritis (RA) and diabetes mellitus (DM) frequently coexist, yet the heterogeneity of RA-DM multimorbidity remains unclear. This study aims to develop an interpretable machine learning framework to reveal the phenotypic subgroups of RA-DM multimorbidity, providing a potential direction for precision public health interventions.
Methods
Utilizing data from the National Health and Nutrition Examination Survey (1999–2018), we developed a Bayesian-optimized eXtreme Gradient Boosting (XGBoost) model to classify RA-DM multimorbidity status and compared with other machine learning models. Shapley Additive Explanations (SHAP) was applied to interpret the optimal model and quantify the contributions of different features. A dual-clustering approach combining Self-Organizing Maps and K-means was used to identify RA-DM phenotypic subgroups with different feature contribution patterns based on SHAP profiles.
Results
The optimized XGBoost model achieved the best classification performance, outperforming other models such as K-nearest neighbors, support vector machine and logistic regression. SHAP analysis identified nine key contributing features (homocysteine, age, glucose, etc), and revealed non-linear interactions among the features. The dual-clustering based on SHAP values identified four distinct RA-DM phenotypes—inflammatory, metabolically protective, age-related and non-obese protective—each exhibiting unique clinical and biochemical patterns.
Conclusion
This study established an interpretable machine learning framework for identifying distinct phenotypes of RA-DM multimorbidity. These findings provide a data-driven basis for targeted interventions in precision public health, while offering a transferable paradigm for phenotype discovery in other multimorbid conditions.
Keywords
1. Introduction
Rheumatoid arthritis (RA) and diabetes mellitus (DM) are two of the most prevalent chronic conditions worldwide, both contributing substantially to morbidity, mortality, and health system burden. According to the Global Burden of Disease (GBD) study, their global prevalence continues to rise, with the number of individuals affected projected to reach 31.7 million for RA and 1.3 billion for DM by 2050.1,2 Importantly, growing evidence indicates a bidirectional association between the two diseases—patients with RA are at a markedly higher risk of developing DM compared with the general population. 3 This intersection gives rise to RA-DM multimorbidity, defined as the coexistence of RA and DM in the same individual, which complicates disease management and leads to treatment challenges.4–6
Given the complex interplay between autoimmune and metabolic pathways involved in RA and DM, it is plausible that individuals with concurrent RA-DM multimorbidity represent a heterogeneous population with distinct pathophysiological profiles. However, existing studies have largely investigated RA and DM as separate entities, focusing on single-disease risk factors rather than integrated multimorbidity profiles.7,8 As a result, little is known about the phenotypic diversity within RA-DM multimorbidity or the individualized interventions required for its management. Identifying phenotypic subgroups could illuminate differential biological mechanisms, uncover potential therapeutic targets, and provide appropriate intervention directions for specific populations.
Traditional statistical models—such as linear regression or Cox survival analysis—have been applied to RA or DM prediction, but these approaches are often limited by linearity assumptions, oversimplified interactions, and insufficient predictive accuracy. For example, correlation-based analyses can only capture direct linear relationships, 9 while regression-based models often fail to account for nonlinear dynamics or high-dimensional feature dependencies. 10 Thus, there is an urgent need for personalized and multidimensional analytic frameworks that can capture complex interactions and heterogeneity across populations with multimorbidity.
Recent advances in machine learning (ML) provide powerful tools for modeling such complexity. Algorithms like Random Forests (RF) and Support Vector Machine (SVM) can integrate heterogeneous data sources and automatically learn nonlinear relationships, offering higher classification accuracy and greater flexibility than traditional methods.11–15 Nevertheless, their clinical application remains constrained by the “black-box” nature of most models, which hinders interpretability and limits trust among clinicians. 16 To address this, interpretable ML frameworks—notably those combining modeling with post-hoc explanation methods—have emerged as promising solutions. These frameworks not only quantify the contribution of each feature to an outcome but also provide individual-level explanations, supporting transparent and data-driven decision-making in clinical practice. By enabling the identification of subgroups with distinct feature contribution patterns, such interpretable approaches can overcome the limitations of universal models and facilitate precision interventions tailored to heterogeneous populations.
Against this background, our study aimed to develop an interpretable ML framework to disentangle the heterogeneity of RA-DM multimorbidity. Using an optimized XGBoost model and the SHAP framework, we achieved accurate classification of multimorbidity status and quantified the model-based contribution of each feature. Building on these results, we employed a dual-clustering strategy integrating Self-Organizing Maps (SOM) and K-means to identify distinct multimorbidity phenotypes by clustering the feature contributions. This integrative framework bridges modeling analytics and interpretability, offering novel insights into the phenotypic stratification of RA-DM multimorbidity and providing a scientific basis for precision public health interventions. Moreover, the proposed analytical pipeline serves as a transferable framework that can be adapted for phenotype identification in other multimorbid or complex chronic disease settings, broadening its applicability beyond RA-DM.
2. Methods
2.1. Data source and study population
This was a cross-sectional observational study in the United States between 1999 and 2018. Data were obtained from the National Health and Nutrition Examination Survey (NHANES), a nationally representative program designed to assess the health and nutritional status of the U.S. population through standardized interviews and physical examinations. 17 The NHANES was approved by the Ethics Review Board of the National Center for Health Statistics, and all participants provided written informed consent. This study was a secondary analysis of publicly available NHANES data, and no additional permission was required. Specifically, we analyzed the serial survey data from 1999 to 2018, which included approximately 5,000 participants per survey cycle.
In terms of sample selection, individuals aged below 20 years were excluded, given the low prevalence of rheumatoid arthritis (RA) and diabetes mellitus (DM) in younger populations.
18
Samples with a feature missing rate above 70% were also removed to minimize bias and improve data quality.
19
RA and DM were defined using self-reported physician diagnosis recorded in the NHANES questionnaire, which was widely used as a basis for disease classification in large epidemiological surveys.20,21 These definitions were used for phenotypic pattern discovery in a population-based survey setting rather than for clinical diagnosis or prevalence estimation. The outcome definition in this study was not based on medication use or laboratory thresholds. In addition, the NHANES questionnaire variable used here does not distinguish type 1 from type 2 diabetes; therefore, DM in this study included both types. On this basis, participants with both RA and DM were categorized as the case group (RA-DM multimorbidity), while those without either condition served as the control group. Finally, a total of 3,451 eligible participants were included, comprising 1,001 cases with RA-DM multimorbidity and 2,450 controls without multimorbidity (Figure 1(a)). Flowchart of sample screening and methodology.
2.2. Feature selection and data preprocessing
A total of 45 variables were extracted and classified into four domains (Figure 1(b)): (1) Biochemical indicators (e.g., alanine aminotransferase, serum creatinine); (2) Hematological indicators (e.g., lymphocyte count, mean corpuscular hemoglobin concentration); (3) Anthropometric indicators (e.g., triceps skinfold thickness, Body Mass Index (BMI)); (4) Sociodemographic indicators (e.g., age, sex, education level). Feature selection was informed by clinical relevance, data availability, and prior evidence linking these indicators to RA or DM.22,23 Supplementary Table A.1 provides detailed descriptions of all features and their corresponding abbreviations.
For samples with a feature missing rate below 70%, the XGBoost algorithm’s intrinsic mechanism for handling missingness was applied, while mean imputation or mode imputation was used for other ML models. 24 As shown in Figure 1(b), the Least Absolute Shrinkage and Selection Operator (LASSO) regression was used for feature selection. Compared with traditional methods such as stepwise regression, LASSO reduces noise interference in high-dimensional feature spaces and adjusts model complexity through regularization parameters, thereby enhancing model parsimony and generalization ability. 25 Specifically, we used five-fold cross-validation and grid search to determine the optimal regularization parameter, and variables with nonzero coefficients were retained as the final features.
2.3. Model construction and validation
To achieve precise classification of multimorbidity conditions, we innovatively developed an optimized XGBoost model. Compared with the traditional gradient boosting framework, XGBoost incorporates second-order Taylor expansion of the loss function and regularization terms, thereby enhancing convergence speed and accuracy while preventing overfitting.26,27 To further optimize performance, we employed Bayesian optimization with five-fold cross-validation for hyperparameter tuning (with initial random sampling points set to 5 and iteration count to 30), which iteratively estimated the posterior distribution of the objective function to approach the optimal parameter combination. To validate the performance of the optimized model (hereafter XGBoost Tuned), we developed additional ML classifiers for comparison (Figure 1(b)): K-nearest neighbors (KNN), support vector machine (SVM) and logistic regression (LR).
In this study, the dataset was randomly split into training (70%) and test (30%) sets for all models, which also served as the basis for subsequent feature importance analysis. Specifically, the training set consisted of 2,415 participants (679 cases and 1,736 controls), and the test set contained 1,036 participants (322 cases and 714 controls). Model performance was assessed using two validation strategies. Firstly, the five-fold cross-validation was conducted on the training set for all models. Secondly, all models were assessed on an independent test set to further evaluate the generalization ability of different models. The comprehensive metrics were calculated to evaluate classification performance, which included precision, recall, F1-score and area under receiver operating characteristic curves (AUC).
To further assess model performance and the potential for overfitting, we performed bootstrap-based optimism correction for the AUCs of the ML models. First, the apparent AUC of each model was calculated in the original dataset. Next, 200 bootstrap samples of the same size as the original dataset were generated with replacement. For each bootstrap sample, the model was retrained and evaluated both in the bootstrap sample and in the original dataset. Optimism was defined as the difference between model performance in the bootstrap sample and that in the original dataset, and the optimism adjusted AUC was obtained by subtracting the average optimism from the apparent AUC. 28
Additionally, calibration curves and the Brier score were further used to evaluate the accuracy of output probabilities by ML models. The calibration curve reflected the model’s calibration by comparing the consistency between output probabilities and observed frequencies, while the Brier score quantified the mean squared error between the output probabilities and actual labels, with lower scores indicating greater accuracy of output probabilities. Model development and evaluation were implemented in Python (v3.7) using the Xgboost and sklearn packages.
2.4. Model interpretation and feature contribution
To improve interpretability, the SHAP framework was applied to the XGBoost Tuned model. 29 We calculated the SHAP value for each feature retained after LASSO screening, which quantified the marginal contribution of different features to the multimorbidity outcome. The magnitude of SHAP values reflects the strength of influence, while the sign indicates directionality (positive or negative association).
The SHAP summary plot was generated to visualize overall feature importance, and the interaction heatmaps were constructed to examine potential nonlinear relationships among features. The overall heatmap and force plots for samples were generated to globally display the distribution of key features’ contributions, assisting in identifying underlying patterns of the dataset for in-depth analysis (Figure 1(c)).
Additionally, systematic robustness assessments were conducted for SHAP-based feature ranking patterns. To evaluate cross-method consistency, we compared the leading feature rankings derived from SHAP, permutation importance and XGBoost built-in gain importance. To assess cross-sample stability, we performed bootstrap resampling with 100 iterations, retrained the XGBoost Tuned model in each bootstrap sample, and calculated the coefficient of variation for the ranking of each feature. We further conducted a leave-top1-out analysis, in which the top-ranked feature was removed and the model was retrained to examine the sensitivity of SHAP-based rankings to feature dependency.
Finally, we conducted independent association analyses based on the original data to assess potential biases in model-based interpretations. Specifically, the leading model-identified features were included in univariate logistic regression models to examine their independent associations with the RA-DM outcome. To explore the dose-response relationships between feature levels and the outcome, we stratified each leading feature by quartiles and used the Cochran-Armitage trend test to evaluate whether RA-DM risk showed a significant monotonic trend with increasing feature levels. Meanwhile, the restricted cubic splines (RCS) were applied to fit nonlinear curves and evaluate the gradient associations between leading features and the RA-DM risk.
2.5. Multimorbidity clustering and phenotype identification
Based on individual SHAP profiles, we employed a dual-clustering framework integrating the SOM and K-means algorithms to identify subgroups with similar contribution patterns (Figure 1(d)). The hybrid approach could effectively reduce clustering complexity of SHAP values while maintaining the accuracy of subgroup identification. Specifically, the SOM algorithm first projected high-dimensional SHAP data into a lower-dimensional representation, 30 which was subsequently refined by K-means to determine distinct subgroups. The optimal number of clusters was determined by the within-cluster sum of squares (WCSS), and the average silhouette score and Davies-Bouldin Index (DBI) of clusters were further calculated to quantitatively validate the optimal number of clusters. Additionally, sensitivity analyses were performed by reclustering with different random seeds and adjusted SOM grid sizes to evaluate the robustness and validity of the clustering results. On this basis, SHAP waterfall plots were generated for each cluster to visualize key contributing features, enabling the delineation of heterogeneous RA-DM multimorbidity phenotypes and providing a basis for phenotype-specific intervention strategies.
3. Results
3.1. Clinical characteristics of the participants
Supplementary Table A.2 summarizes the descriptive characteristics of all participants. The median age was 45 years (IQR: 31–63), and 44.2% were male. Significant differences were observed for most features between the two groups, indicating that the selected variables possess substantial discriminatory potential for identifying RA-DM multimorbidity.
3.2. Feature selection and model performance
Initially, 45 features of participants were included in the dataset. The LASSO regression reduced dimensionality by penalizing less informative indicators, retaining 39 key features as model inputs (Figure A.1, a-b). Detailed regression coefficients are listed in Supplementary Table A.3, while Supplementary Figure A.2 presents the Pearson correlation heatmap, showing minimal inter-feature collinearity and confirming the robustness of variable selection.
Performance comparison of different machine learning models.
Note. CV, Cross Validation; KNN, K-Nearest Neighbors; SVM, Support Vector Machine; LR, Logistic Regression.
Supplementary Figure A.3 presents the optimism adjusted performance of the ML models. After bootstrap-based optimism correction, the XGBoost Tuned model continued to show the highest AUC among the compared models. Meanwhile, the optimism adjusted AUCs were close to the corresponding apparent AUCs, suggesting limited evidence of substantial overfitting in the current dataset. Regarding model calibration, the calibration curve showed strong agreement between the output probabilities of the XGBoost Tuned model and the actual observed frequencies (Supplementary Figure A.4). Meanwhile, the Brier score of this model was 0.027, substantially lower than those of other ML models (KNN = 0.106, SVM = 0.080, LR = 0.068). These results confirm the strong generalization capability and stability of the optimized XGBoost model for classifying the status of RA-DM multimorbidity.
3.3. Feature contribution to RA-DM multimorbidity
To characterize feature ranking patterns and individual-level heterogeneity, SHAP analysis was applied to the XGBoost Tuned model. The SHAP summary plot (Figure 2(a)) identified the top 20 leading features, with serum homocysteine (HCY), survey period age (AGE), serum glucose (GLU), serum ferritin (SF), Body Mass Index (BMI), serum urea nitrogen (BUN), serum c-reactive protein (CRP), serum ldl cholesterol (LDLC) and triceps skinfold (TSF) exhibiting the strongest association with RA-DM multimorbidity (absolute SHAP > 0.5). SHAP analysis of the XGBoost Tuned model. (a) SHAP summary plot of the top 20 features with importance ranking. (b) Heatmap of the average SHAP interaction matrix for all features. (c) Interaction effect of AGE and HCY on RA-DM. (d) Interaction effect of AGE and SF on RA-DM. (e) Interaction effect of TSF and HCY on RA-DM. (f) Heatmap of SHAP values for major features in all samples, with red and blue representing positive and negative contributions respectively, and color intensity indicating the degree of contribution. (g) SHAP force plot for the first 30 samples in original order. (h) SHAP force plot for the first 30 samples sorted by similarity. The full names of features are provided in Supplementary Table A.1.
The interaction heatmap (Figure 2(b)) revealed complex nonlinear dependencies among key features and multimorbidity risk. For example, for participants with low HCY levels, increasing AGE was consistently associated with a reduced risk of RA-DM multimorbidity. In contrast, among those with high HCY levels, AGE demonstrated a threshold effect around 50 years, beyond which its impact shifted from negative to positive (Figure 2(c)). A similar interaction pattern was observed between SF and AGE (Figure 2(d)). Furthermore, TSF had the threshold of 20 for influencing the multimorbidity regardless of HCY levels, and the critical shift was more pronounced in the high HCY group than in the low HCY group (Figure 2(e)). The overall heatmap (Figure 2(f)) visualized heterogeneous feature contribution patterns across individuals, suggesting distinct multimorbidity substructures. The force plots of the top 30 samples (Figure 2(g)–(h)) confirmed this variability, indicating potential subgroups characterized by similar feature impact profiles—providing the rationale for subsequent clustering analysis.
To assess the robustness of SHAP-based feature ranking patterns, we conducted a multidimensional set of analyses. As shown in Supplementary Figure A.5, the top 20 features identified by SHAP, permutation importance and gain importance showed substantial overlap and broadly consistent ranking patterns. Bootstrap analysis further showed low coefficients of variation for the rankings of all features across resampled datasets, supporting the cross-sample stability of SHAP-based feature rankings (Supplementary Table A.5). As shown in Supplementary Figure A.6, the leave-top1-out analysis indicated that after removal of the top-ranked feature (HCY), the ranking of the remaining features remained largely stable, with a Spearman’s rank correlation coefficient of 0.8205 (P < 0.0001).
Regarding the independent association analyses, logistic regression results showed that the top nine model-identified features were significantly associated with the RA-DM multimorbidity, with effect directions consistent with their SHAP contribution patterns (Supplementary Table A.6). Supplementary Table A.7 presents the stratified analysis results based on the original data, where the risk levels of leading features exhibited a significant monotonic trend with increasing quartiles (P-value < 0.05), indicating clear dose-response relationships between model-identified features and the RA-DM outcome. As shown in Supplementary Figure A.7, the RCS analysis further visualized the gradient associations between leading features and RA-DM risk, with curve trends generally aligning with the model-based interpretation (Figure 2(a)).
3.4. Phenotype identification of RA-DM multimorbidity
Based on the outputs of SHAP values, a dual-clustering strategy combining SOM and K-means was employed to describe phenotypic subgroups. The 4×4 SOM grid achieved stable sample–neuron mapping after iterative training (Figure 3, (a)–(b)) and produced 16 preliminary clusters based on mean SHAP profiles (Figure 3(c)). Figures 3(d)–(f) present the secondary clustering results based on K-means. When the number of clusters (K) reached 4, the WCSS value was relatively low and gradually stabilized. Meanwhile, the average silhouette coefficient and DBI achieved the best performance under 4 clusters, so all samples were divided into four subgroups for phenotypic analysis. As shown in Supplementary Figure A.8, sensitivity analysis results indicated that the SOM network maintained a consistent sample distribution when reclustering with different random seeds. Additionally, different SOM grid sizes exhibited varied mapping patterns: the 3×3 and 5×5 grids showed imbalanced sample distribution, while the 6×6 grid presented the empty neuron. In contrast, the 4×4 grid used in this study demonstrated a reasonable sample distribution (Figure 3(b)), providing a solid foundation for subsequent multimorbidity phenotype identification. SHAP clustering based on SOM and K-means algorithm. (a) The average Euclidean distance from all data points to their nearest units in the training progress of SOM. (b) The distribution of samples mapped to each neuron in SOM. (c) The initial clustering results of SOM and the distribution of SHAP values for features within each neuron. (d) The WCSS of different cluster numbers in the secondary clustering using K-means. (e) Average silhouette coefficients and DBI of different cluster numbers in the secondary clustering using K-means. (f) The secondary clustering results of K-means, with different types of neurons separated by black lines. SOM, Self-Organizing Map; WCSS, Within-cluster sum of squares; DBI, Davies-Bouldin index.
To further identify phenotypes of RA-DM multimorbidity, SHAP waterfall plots were generated for representative individuals within each subgroup. By matching the original input dataset, it was found that subgroup 1 had higher levels of inflammatory markers (HCY, SF and CRP), which were the main drivers of RA-DM multimorbidity (Figure 4(a)). In Subgroup 2, lower levels of metabolic markers (e.g., GLU, BUN) were associated with a protective effect against RA-DM, emerging as the key negative contributors in the model (Figure 4(b)). Subgroup 3 primarily consisted of the older adults, with AGE being the main contributor to RA-DM multimorbidity (Figure 4(c)). Lower obesity indicators (measured by BMI and TSF) were observed in subgroup 4, which were regarded as the main negative contributors for the group (Figure 4(d)), revealing the potential role of reduced obesity in mitigating RA-DM multimorbidity. SHAP waterfall plots and multimorbidity phenotypes of subgroups. The full names of features are provided in Supplementary Table A.1.
Collectively, these results revealed four phenotypically distinct patterns—inflammatory, metabolically protective, age-related and non-obese protective—highlighting the heterogeneous pathophysiological profiles underlying RA-DM multimorbidity and providing a foundation for personalized intervention strategies.
4. Discussion
This study identified distinct phenotypes of RA-DM multimorbidity using an interpretable machine learning framework. The optimized XGBoost model, enhanced through Bayesian optimization, achieved the best classification performance among all algorithms, demonstrating robust generalization across validation datasets. The subsequent SHAP analysis showed that nine variables—such as HCY, AGE and GLU—emerged as the top-ranking features within the model. Building upon the feature contribution patterns, the dual-clustering approach combining SOM and K-means identified four distinct phenotypes, addressing a critical gap in individualized management of RA-DM multimorbidity and offering a data-driven foundation for precision public health interventions.
Traditional classification models for chronic disease multimorbidity often struggle with nonlinear associations and high-dimensional features, leading to overfitting and reduced interpretability.31,32 In contrast, this study employed LASSO regression for feature selection, and the XGBoost Tuned model integrated with Bayesian optimization was used to capture complex nonlinear feature relationships and improve classification performance. Although the LASSO method might remove weak but interacting features, parsimony is necessary for model interpretability, as it allows for focusing on individualized feature patterns and achieving precise clustering of multimorbidity phenotypes.33,34
In this study, the SHAP framework transformed the ML model from a “black box” into a transparent analytic system, enabling quantitative evaluation of individual-level feature patterns. Our findings align with earlier works integrating interpretability into ML models. For instance, Li et al. applied SHAP to explain gradient boosting predictions for sepsis risk, and Huang et al. used Local Interpretable Model-agnostic Explanations to interpret pneumonia diagnosis models.35,36 Notably, because feature importance lacks an external ground truth, the SHAP-based interpretation should be understood as a model-based explanation rather than a direct validation of true association. 37 In this study, we conducted systematic robustness assessments for model-based feature ranking patterns. Comparisons across different feature evaluation methods showed cross-method consistency, while the bootstrap resampling and leave-top1-out analysis supported the stability of the ranking patterns from the perspectives of sample variability and feature dependency. In addition, we performed independent association analyses outside the XGBoost-SHAP framework to distinguish model-based explanations from true associations. The results showed that model-identified features not only reflected internal model interpretation patterns but also demonstrated consistency and plausible dose-response relationships in the independent analyses. Collectively, the evidence supports the robustness of the identified feature patterns within the current data framework, thereby enhancing the reliability of our findings.
In the SHAP framework, HCY, AGE and GLU were identified as the top-ranking features within the model, each playing dual roles across both RA and DM pathophysiology. Elevated HCY can damage vascular endothelium, promote oxidative stress, and trigger inflammatory cascades, which are implicated in synovial injury in RA and insulin resistance in DM.38–41 Similarly, dysregulated GLU metabolism has been shown to upregulate MARK4 and NLRP3 inflammasome activation, aggravating immune dysfunction in RA. 42 AGE, a non-modifiable but critical determinant, reflects both metabolic aging and cumulative inflammatory exposure, supporting its centrality in multimorbidity risk. 43 These converging mechanisms reinforce the notion that RA-DM multimorbidity arises from intersecting inflammatory and metabolic pathways, rather than additive disease effects. Thus, the identified biomarkers not only validate prior single-disease findings but also highlight the shared immunometabolic axis underpinning multimorbidity.
Using the dual-clustering framework, we identified four distinct subgroups of RA-DM multimorbidity—inflammatory, metabolically protective, age-related and non-obese protective. These subgroups should be interpreted as exploratory phenotypes derived from model-based representation of the data, rather than definitive biological subtypes. However, our findings are consistent with previously reported categories of RA or DM, supporting the validity of the phenotypes identified in this study. For instance, the inflammatory phenotype, characterized by elevated HCY, SF and CRP, resembles the immune-active subtype described in RA studies,44,45 whereas the age-related and non-obese protective phenotypes parallel the mild age-related (MARD) and obesity-linked (MOD/SIDD) diabetes clusters identified by Ahlqvist et al. 46 The metabolically protective phenotype, featuring lower glucose and urea levels, may represent individuals with relatively preserved metabolic function and reduced multimorbidity risk. Collectively, these results demonstrate that RA-DM multimorbidity is a heterogeneous condition encompassing multiple interrelated clinical patterns rather than a uniform disease entity.
Each phenotype may have distinct implications for clinical management in different patient groups. Individuals with the inflammatory phenotype may benefit from regular monitoring of inflammatory markers, and anti-inflammatory or immunomodulatory treatment may be considered when these markers remain elevated. 47 Although the metabolically protective phenotype was associated with a lower risk of multimorbidity, routine metabolic monitoring, a balanced diet and lifestyle optimization may still be important for maintaining this protective profile. 48 The age-related phenotype may require geriatric-oriented care, including multimorbidity management, polypharmacy review and functional assessment. For the non-obese protective phenotype, nutritional counseling and physical activity may help maintain the protective effect associated with appropriate body weight.
These phenotypes may also help inform risk-stratified screening and public health planning. For example, community-based RA-DM screening programs may give greater attention to individuals with the inflammatory phenotype, while the age-related phenotype highlights older adults as a priority group for integrated multimorbidity management. In addition, the metabolically protective and non-obese protective phenotypes may offer useful reference patterns for health promotion strategies focused on maintaining metabolic health and appropriate weight status.
The present study advances the methodological frontier by demonstrating how interpretable ML frameworks can integrate population-level data with individualized feature explanations to uncover phenotypic diversity in multimorbidity. Beyond RA-DM, this analytical pipeline can be adapted to other multimorbid or complex chronic diseases—such as cardiovascular-metabolic syndromes, or COPD-diabetes co-occurrence—providing a generalizable tool for disease clustering and phenotype discovery in large epidemiologic datasets.
However, several limitations merit attention. Firstly, this study was based on the survey data from the U.S. population, which might limit the generalizability of the findings to non-U.S. populations and clinical practice settings. Future research should incorporate clinical cohort data from heterogeneous populations for validation and explore the practical value of different phenotypes in clinical decision-making. Secondly, the definition of RA and DM in this study relied on self-reported physician diagnoses, which might introduce potential misclassification bias. Meanwhile, the questionnaire design limited the in-depth exploration of multimorbidity between different DM subtypes and RA. Thirdly, the classifier in this study was primarily designed to distinguish RA-DM multimorbidity from healthy status, while its performance in differentiating multimorbidity from single-disease conditions remains unclear. Future studies should include RA-only and DM-only comparator groups to evaluate the specificity of the identified patterns. Fourthly, due to the cross-sectional design, the findings of this study should be considered hypothesis-generating rather than indicative of causal relationships. Subsequent prospective cohort studies are warranted to further validate the efficacy of interventions for distinct multimorbidity phenotypes. Fifthly, although the XGBoost Tuned model showed excellent classification performance, this should be interpreted cautiously. One possible explanation is that the control group consisted of participants without either RA or DM, rather than individuals with single-disease states, which might make the classification task easier and allow the model to capture broader disease burden. In addition, because all models were developed and evaluated within the same survey source, external validation in independent datasets is still needed before the generalizability of the findings can be established.
5. Conclusions
This study developed an interpretable machine learning framework to identify distinct phenotypes of RA-DM multimorbidity. The Bayesian-optimized XGBoost model achieved superior classification performance compared with traditional algorithms, and nine features—such as HCY, AGE and GLU—emerged as the top contributors within the model. Clustering of feature contributions revealed four phenotypes of RA-DM (inflammatory, metabolically protective, age-related and non-obese protective), each with unique clinical and pathological profiles. These findings provide a novel paradigm for precise stratification in RA-DM multimorbidity, bridging machine learning with clinical interpretability.
Supplemental material
Supplemental material - Phenotype identification and precise intervention for multimorbidity of rheumatoid arthritis and diabetes mellitus using interpretable machine learning
Supplemental material for Phenotype identification and precise intervention for multimorbidity of rheumatoid arthritis and diabetes mellitus using interpretable machine learning by Zhijun He, Jinglin Han, Xinzhu Qiao, Yiqiang Zhan and Mingming Xu in Digital Health.
Footnotes
Acknowledgments
The authors thank the participants and staff of the National Health and Nutrition Examination Survey 1999–2018 for their valuable contributions.
Ethical considerations
The NHANES was approved by the Ethics Review Board of the National Center for Health Statistics, and all participants provided written informed consent. As this study is a secondary analysis, it is exempt from further ethical review.
Consent to participate
Author contributions
Zhijun He: Conceptualization, Methodology, Software, Data Curation, Writing - Original Draft, Visualization. Jinglin Han: Investigation, Writing - Original Draft. Xinzhu Qiao: Investigation, Writing - Original Draft. Yiqiang Zhan: Supervision, Writing - Review & Editing. Mingming Xu: Conceptualization, Methodology, Writing - Review & Editing, Supervision, Project administration, Funding acquisition.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research was supported by the GuangDong Basic and Applied Basic Research Foundation (Nos. 2022A1515110721), the Guangdong Philosophy and Social Sciences Planning Project (Nos. GD25YSH15) and the Medical Scientific Research Foundation of Guangdong Province (Nos. A2024051).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
Declaration of AI use
No artificial intelligence (AI) tools were used in the writing or editing of this article.
Guarantor
Yiqiang Zhan and Mingming Xu.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
