Abstract
Background:
Glucagon-like peptide-1 receptor agonists (GLP-1 RAs) are highly effective pharmacotherapies for obesity, offering substantial weight loss and improvement in obesity-related complications. However, their high cost and limited insurance coverage necessitate precise, data-driven strategies for patient selection.
Objective:
To develop and validate machine learning (ML) models that predict eligibility for GLP-1 RA therapy using body mass index (BMI) and comorbidity profiles, leveraging both synthetic and real-world patient data.
Methods:
A rule-based algorithm, grounded in clinical guidelines, was used to generate a synthetic training dataset (n = 256), while a real-world test set (n = 287) was derived from electronic health records. Patients were classified as eligible for GLP-1 RA therapy or recommended for generic alternatives based on BMI and major obesity-related complications (e.g., type 2 diabetes, cardiovascular disease, metabolic dysfunction-associated steatohepatits [MASH], sleep apnea, transplant status). Four models, baseline classifier, logistic regression, decision tree, and random forest, were trained and evaluated using cross-validation and independent testing. The rule-based algorithm was also applied directly to the test set.
Results:
Decision tree, random forest, and logistic regression models achieved near-perfect accuracy (≥99%) on both datasets. Key predictors of eligibility included type 2 diabetes, cardiovascular disease, MASH, sleep apnea, and transplant status. The rule-based model also performed comparably well, demonstrating its clinical validity and interpretability.
Conclusions:
ML and rule-based models can accurately identify patients eligible for GLP-1 RA therapy, supporting scalable, equitable, and cost-conscious obesity treatment strategies. These tools offer a transparent framework for optimizing access to high-cost antiobesity medications in real-world clinical settings.
Background and Rationale
The growing rate of obesity, affecting over 40% of adults in the United States (defined by a body mass index [BMI] >30 kg/m2), alongside related health issues, has heightened interest in targeted drug treatments, particularly glucagon-like peptide-1 receptor agonists (GLP-1 RAs). 1 These medications imitate natural hormones that control hunger, activating brain pathways that regulate appetite and energy use. 2 Semaglutide, approved in 2021, and tirzepatide, approved in 2023, have shown impressive results in clinical trials, helping patients achieve significant weight reductions of over 15%–20% of their initial body weight.3,4 In addition, recent evidence highlights these therapies’ ability to decrease major cardiovascular events substantially 5 and improve conditions like obstructive sleep apnea, 6 resulting in expanded Food and Drug Administration (FDA)-approved uses.
Despite their clinical advantages, the high cost of GLP-1 RAs, often surpassing $900 per month, or around $500 after manufacturer discounts, has prompted concerns about affordability and fair patient access. 7 Both public and private insurers, including Medicaid and Medicare, face increasing pressure to strike a balance between effectiveness and cost, emphasizing the necessity for better strategies to identify eligible patients. 8
This research investigates predictive modeling techniques to assess patient eligibility for GLP-1 therapy by analyzing patient profiles that include BMI and critical obesity-related complications such as cardiovascular disease, metabolic dysfunction-associated steatohepatitis (MASH, previously known as nonalcoholic steatohepatitis or NASH), type 2 diabetes, sleep apnea, and transplant status. Initially employing a rule-based classification, the study’s main goal was to develop and test machine learning (ML) models capable of automating, and potentially enhancing, clinical decision-making. By utilizing both synthetic and real-world patient data, this work illustrates the promise of artificial intelligence (AI)-driven tools in delivering scalable, transparent, and personalized obesity treatment recommendations.9–11
Methods
Dataset and preprocessing
The study included two datasets: (1) a training set (n = 256), used for training and validation, and (2) an independent test set (n = 287), totaling 543 patient records.
The training set was created synthetically using a rule-based algorithm designed to mirror real-world clinical decisions regarding eligibility for GLP-1 RA therapy (Fig. 1). This algorithm integrated current medical guidelines,12,13 published studies demonstrating GLP-1 RA drug effectiveness,3–7 FDA indications as of 2025, and insurance coverage insights based on clinical experience. To broaden generalizability, the criteria included all patients with obesity (BMI ≥ 30 kg/m2). In addition, the algorithm considered secondary, less severe obesity-related conditions, such as hypertension, prediabetes, and insulin resistance, as part of an exploratory classification. While not typically insurance-covered for obesity treatment (as of 2025), these secondary conditions are relevant markers of meta-inflammation and overall disease severity, 8 reflecting ongoing research and potential future insurance considerations.

Rule-based clinical decision-tree for risk stratifying glucagon-like peptide-1 receptor agonist (GLP1-RA) drug eligibility. Medical decision-making for triaging GLP1 drug eligibility for the treatment of obesity based on current medical criteria, insurance coverage considerations, and proven efficacy of GLP1 drugs in obesity-related major complications as defined by the presence of cardiovascular disease (CVD), obstructive sleep apnea, type 2 diabetes, metabolic dysfunction associated steatohepatitis (MASH), and/or transplant status. Other medical complications of obesity, such as prediabetes and hypertension, not yet covered by insurance for an obesity indication, were added as a secondary classification and exploratory for the model.
Input features were binary-encoded, capturing:
BMI ≥ 30 kg/m2 Major obesity complications: cardiovascular disease (CVD), obstructive sleep apnea, type 2 diabetes mellitus (T2DM), MASH, and transplant status Secondary conditions: prediabetes or insulin resistance, hypertension
The hierarchical decision-making process was as follows:
Patients with BMI ≥ 30 and at least one major obesity-related complication were labeled as “GLP1 Drug.” Patients with BMI ≥30 and only secondary conditions (without any major complications) were labeled as “Generic Alternative Recommended” (non-GLP1-RA class obesity drug such as generic phentermine, phentermine/topiramate combination, bupropion/naltrexone, or metformin for prediabetes and/or insulin resistance).
All others were excluded from the training set. Synthetic data allowed the creation of diverse patient scenarios with carefully controlled features, balanced labeling, and reliable model training. This approach was particularly valuable given limited access to comprehensive real-world electronic health records (EHRs), highlighting that ML models can still be developed effectively even when real patient data is scarce or incomplete. The independent test set consisted of de-identified EHRs from our institutional obesity center (2021–2025). The dataset and study procedures were reviewed and deemed exempt from full review by the Vanderbilt University Institutional Review Board, in compliance with the Declaration of Helsinki and institutional ethical standards.
Model development and evaluation
Machine training and validation
To automate and improve clinical decisions for GLP-1 RA eligibility, we developed ML models trained on synthetic data, classifying patients into two groups: “GLP1-RA drug” vs. “Generic alternative recommended.” 9 We evaluated logistic regression (simple and interpretable), decision trees (intuitive clinical reasoning), and random forests (ensemble learning to reduce overfitting). A majority-class baseline was included for comparison. Features were binary-encoded clinical criteria; no additional feature engineering or longitudinal data was used.
Clarifying rule-based versus ML approaches
Although the synthetic training data were generated via a clinically grounded, rule-based algorithm (BMI ≥ 30, major comorbidities), the predictive models employed ML techniques. Unlike fixed rule-based methods, these ML classifiers identified patterns from input features rather than explicit rules. However, to evaluate practical utility, we later applied the original rule-based approach directly to an independent real-world dataset, enabling direct performance comparison with ML models.
Model validation
Validation involved 5-fold stratified cross-validation on the training set to assess model stability and generalizability independently from final evaluation. We experimented with Synthetic Minority Over-sampling Technique (SMOTE) to address class imbalance but observed no improvement—likely due to minimal impact of imbalance on performance in this dataset. All models achieved perfect scores (accuracy, precision, recall, F1-score, receiver operating characteristic-area under curve [ROC-AUC] = 1.00), indicating inherent robustness. Thus, cross-validation remained our primary validation method.
Model performance and metrics
Final model evaluation was performed on the independent real-world test set. Metrics included accuracy (overall correct predictions), precision (true positives among predicted positives), recall (true positives correctly identified), F1-score (balance of precision and recall), and ROC-AUC (ability to distinguish classes across thresholds). To address class imbalance, we reported macro averages (unweighted mean) and weighted averages (mean weighted by class frequency). In addition, support, the actual class occurrences, was provided for context.
Results
Dataset composition
The final dataset (n = 543) included a synthetic training set and a real-world independent test set, each reported separately due to distinct purposes (Table 1). The training set intentionally favored GLP1-RA–eligible cases (n = 256: 248 GLP1 Drug, 8 Generic Alternative Recommended), reflecting clinical prioritization of high-risk patients. In contrast, the real-world test set had fewer GLP1-eligible cases (n = 287: 75 GLP1 Drug, 212 Generic Alternative Recommended).
Feature distribution
Analysis revealed distinct profiles between groups. GLP1-eligible patients commonly had significant comorbidities (>50% had BMI > 35, transplant status, MASH, sleep apnea, or T2DM; nearly 50% had hypertension or prediabetes). Over 80% had at least one major comorbidity. Conversely, the Generic Alternative group mostly had obesity (BMI > 35 in >85%) but few comorbidities (transplant, CVD, MASH, sleep apnea < 1%, T2DM 8.3%), confirming clinical relevance of obesity-related complications.
Cross-validation results
Model robustness was assessed via 5-fold stratified cross-validation on the training set (Table 2). Decision tree and random forest showed perfect consistency (accuracy = 1.00 ± 0.00), while logistic regression was slightly lower (accuracy = 0.98 ± 0.02). Baseline classifier performed poorly (accuracy = 0.97), confirming limited clinical utility. Class imbalance correction (SMOTE) had no measurable benefit, indicating models’ inherent robustness.
Cross-Validation Results (5-Fold)
Training set performance
All models performed strongly on the synthetic data. The baseline classifier achieved high overall accuracy (96%) but failed entirely on the minority class (Generic Alternative). Decision tree, random forest, and logistic regression all achieved perfect accuracy (100%), precision, recall, and F1-score, raising possible concerns about overfitting given synthetic data. Visualizations (ROC curves, confusion matrices, feature importance) supported interpretability.
Independent test set performance
Real-world testing showed clear differences (Table 3, Fig. 2). The baseline classifier’s accuracy dropped sharply (26%), with poor class differentiation (F1-score = 0.41 GLP1 Drug, 0.00 Generic Alternative). ML models significantly outperformed baseline: Decision tree: accuracy 99%, F1-score = 0.99. Random forest and logistic regression: accuracy 100%, nearly perfect precision, recall, and F1-scores. All models demonstrated balanced classification, high ROC-AUC (∼1.0), and strong clinical suitability. The rule-based algorithm (directly applied to test data) showed comparable performance (accuracy = 99.7%, F1-score = 0.99), indicating its practical clinical potential as a simpler alternative to ML.

Comparison of model performance on the independent test set. The bar chart compares four key performance metrics, including accuracy, precision, recall, and F1-score, across five models evaluated on the independent test set.
Model Performance and Metrics
This table summarizes the performance of four machine learning models, Baseline, Decision Tree, Random Forest, and Logistic Regression, used to classify patients as eligible for GLP-1 therapy or a generic alternative on the independent test set. For each model, performance metrics are reported for two classes: “Generic Alternative Recommended” and “GLP1 Drug.” The metrics include precision, recall, F1-score, and support. Precision refers to the proportion of true positive predictions among all positive predictions made by the model for a given class, indicating how often the model is correct when it predicts a class. Recall measures the proportion of true positives identified out of all actual instances of that class, reflecting the model’s sensitivity. The F1-score is the harmonic mean of precision and recall, providing a balanced measure of model performance, particularly useful in cases of class imbalance. Support indicates the number of actual instances of each class in the dataset. In addition to class-specific metrics, the table includes overall accuracy, which represents the proportion of correct predictions across all classes. Macro average values are also reported, representing the unweighted mean of precision, recall, and F1-score across both classes, treating each class equally regardless of its frequency. Weighted average values, on the other hand, account for the number of instances in each class (support), providing a more representative summary when class distributions are imbalanced. These metrics collectively offer a comprehensive view of model performance and are essential for evaluating the effectiveness and fairness of classification in clinical decision-making.
Comorbidity feature importance
Comorbidities strongly influenced model predictions. Random forest analysis identified T2DM, hypertension, sleep apnea, and MASH as top predictors (Fig. 3). Logistic regression coefficients confirmed these findings (Fig. 4), aligning closely with clinical guidelines.12,13 This emphasizes the importance of detailed comorbidity profiling in predictive modeling for obesity pharmacotherapy.

Random Forest-feature importances. The bar chart shows which features the Random Forest model found most useful: the existence of any defined obesity-related complication or comorbidity. These features likely have strong clinical relevance in determining GLP1 drug suitability. T2DM, type 2 diabetes mellitus; BMI, body mass index; pre-DM, prediabetes.

Logistic Regression-coefficients. This bar chart summarizes the influence of various clinical features on eligibility for GLP-1 therapy, based on a logistic regression model. Positive coefficients (right of zero) increase the likelihood of being classified as GLP1 Drug. Negative coefficients (left of zero) decrease the likelihood. The most influential factor is having any complication or comorbidity of obesity, with a coefficient around 2.5. Other significant predictors include pre/post-transplant status, cardiovascular disease, and MASH, each with moderate positive effects (∼0.75), followed by sleep apnea (∼0.6) and hypertension (∼0.25). Like Random Forest, T2DM, hypertension, and sleep apnea have strong positive weights. In contrast, BMI, prediabetes, and insulin resistance show minimal impact, with coefficients near zero. This suggests that medical complications play a larger role than BMI alone in determining eligibility in this model.
Discussion
The results of this study demonstrate that ML models, particularly decision tree, random forest, and logistic regression, can accurately and reliably classify patients with obesity as eligible for GLP-1 RA therapy based on BMI and comorbidity profiles. Importantly, the models identified clinically relevant complications such as T2DM, hypertension, sleep apnea, and MASH as key predictors of eligibility. These findings align with current clinical guidelines12,13 and validate the model’s ability to reflect real-world decision-making processes.
The observed imbalance in the training set, favoring patients labeled as “GLP1 Drug”, can be attributed to several factors. The labeling logic was based on clinical eligibility criteria for GLP-1 RAs, which include obesity (particularly BMI > 35) and the presence of comorbidities such as T2DM, MASH, sleep apnea, and transplant status. A large proportion of patients in the dataset met these criteria, resulting in a higher number of GLP1 Drug cases. In addition, the rule-based approach used to assign labels likely contributed to the skew. Patients without qualifying comorbidities were less frequently labeled as “Generic Alternative Recommended,” especially in the training set. This reflects the clinical reality that patients with fewer risk factors are less likely to be considered for GLP-1 therapy. While the training set was imbalanced, model performance on the independent test set remained strong, supporting the robustness of the approach. Notably, the imbalance may also reflect real-world prescribing patterns or intentional oversampling of GLP1 drug cases to ensure adequate representation of positive cases for model training.
The implications of these findings are significant in the context of rising health care costs and the expanding use of GLP-1 RAs. While these medications have demonstrated substantial benefits in reducing obesity-related complications, their high cost, often exceeding $900 per month, 7 poses a challenge for insurers, employers, and health care systems. Currently, coverage decisions are often reactive and inconsistent, lacking a standardized, data-driven approach to patient selection. 14
This study offers a scalable and interpretable solution to that problem. By deploying this model within large health care organizations, insurers and hospital systems could proactively identify patients who meet evidence-based criteria for GLP-1 therapy. This would enable more timely access to treatment, reduce administrative burden, and support a cost-effective, risk-stratified approach to obesity management. Furthermore, the model could be integrated into HER systems to provide real-time decision support at the point of care.
Although long-term cost-effectiveness data for GLP-1 RAs are still emerging, the landscape is expected to shift in the coming years. As newer GLP-1 agents and next-generation antiobesity medications with different mechanisms of action enter the market, competition is likely to drive down prices. 15 In this evolving environment, predictive models like the one developed in this study will be essential for guiding equitable and efficient resource allocation.
In summary, this work not only demonstrates the technical feasibility of using ML to support obesity pharmacotherapy decisions but also highlights its potential to inform policy and operational strategies in health care delivery. Future research should focus on prospective validation, integration into clinical workflows, and economic modeling to quantify the long-term impact of such tools on health outcomes and system-wide costs.
Limitations
While the results of this study are promising, several limitations should be acknowledged. First, the training dataset was synthetically generated due to the lack of publicly available, comprehensive obesity datasets with detailed comorbidity profiles. Although the synthetic data was designed to reflect real-world clinical patterns, it may not fully capture the complexity and variability of actual patient populations.
Second, the independent test dataset, while based on real-world EHRs, was relatively small (n = 287) and derived from a single academic obesity center. This may limit the generalizability of the findings to broader, more diverse populations, including those in community or rural health care settings. Future studies should aim to validate the model across multiple institutions and larger, more heterogeneous datasets.
Third, although the models demonstrated high accuracy and interpretability, they were trained on binary-encoded features and did not incorporate longitudinal data or medication history, which could further enhance predictive power. In addition, the models do not currently account for socioeconomic status, cultural background, or regional disparities, such as those captured by the area deprivation index. These social determinants of health are known to influence treatment access and outcomes and should be considered in future iterations of the model. 14
In addition, in clinical ML studies, performance metrics exceeding 99% are highly unusual and often suggest underlying issues. Potential causes include data leakage, overfitting, or simplistic tasks, especially when using a small number of binary input features that make classification trivial. Models may also benefit from similarities between synthetic training data and real-world test data, particularly if the synthetic data were generated using a rule-based algorithm, leading models to simply rediscover those rules—raising concerns about circularity. Additional factors include nonrepresentative or imbalanced datasets, flawed evaluation setups, preprocessing artifacts, and small or overly similar test sets, all of which can inflate performance and undermine generalizability.
Finally, while the model shows potential for integration into clinical decision support systems, it has not yet been prospectively tested in a live clinical environment. Real-world implementation will require careful evaluation of workflow integration, clinician adoption, and patient outcomes, as well as ongoing monitoring for model drift and fairness. Notably, while the dataset included basic demographic variables such as age and gender, which have been previously shown to influence GLP-1 RA prescribing patterns,15–19 the absence of broader contextual data remains a limitation.
Conclusions
This study demonstrates the feasibility and utility of using both ML and rule-based approaches to support clinical decision-making in obesity pharmacotherapy. By leveraging synthetic data generated from clinical guidelines and validating against real-world patient records, we developed predictive models, particularly decision tree, random forest, and logistic regression, that achieved high accuracy in identifying patients eligible for GLP-1 RA therapy based on BMI and comorbidity profiles. Notably, the rule-based algorithm, when applied directly to the independent test set, also demonstrated strong performance, reinforcing the clinical validity of the underlying decision logic. The identification of key medical complications such as type 2 diabetes, hypertension, sleep apnea, and MASH as top predictors further supports the relevance of these features in treatment eligibility. The rule-based model offers a transparent and interpretable alternative that may be especially valuable in settings with limited access to ML infrastructure. Together, these tools provide a scalable, cost-conscious framework for optimizing resource allocation and improving equitable access to high-cost antiobesity medications. As newer pharmacotherapies enter the market and cost dynamics evolve, such predictive frameworks, whether rule-based or data-driven, will be essential for guiding value-based care strategies in obesity management.
Footnotes
Authors’ Contributions
G.S.: Conceptualization and design of study, original code and implementation, original draft, formal analysis, editing of subsequent drafts. A.V.M.: Dataset curation, writing and editing of subsequent drafts, visualization, formatting, code review. Y.C.: Interpretation, data analysis, methodology, editing and revising. All authors reviewed and approved the final article.
Author Disclosure Statement
G.S. is the Editor-in-Chief of DTOM. G.S. reports advisory fees from Novo Nordisk, Eli Lilly, Rhythm Pharmaceuticals, Quest Diagnostics, Epitomee. Research grant support from Eli Lilly and Recordati; Speaker’s Bureau Novo Nordisk. A.V.M. and Y.C. have no competing interests. Artificial intelligence tools were used for grammar, language refinement, and reference formatting during the final article draft preparation.
Funding Information
This research received no external funding.
