Machine learning to predict adverse drug events based on electronic health records: a systematic review and meta-analysis

Abstract

Objective

This systematic review aimed to provide a comprehensive overview of the application of machine learning (ML) in predicting multiple adverse drug events (ADEs) using electronic health record (EHR) data.

Methods

Systematic searches were conducted using PubMed, Web of Science, Embase, and IEEE Xplore from database inception until 21 November 2023. Studies that developed ML models for predicting multiple ADEs based on EHR data were included.

Results

Ten studies met the inclusion criteria. Twenty ML methods were reported, most commonly random forest (RF, n = 9), followed by AdaBoost (n = 4), eXtreme Gradient Boosting (n = 3), and support vector machine (n = 3). The mean area under the summary receiver operator characteristics curve (AUC) was 0.76 (95% confidence interval [CI] = 0.26–0.95). RF combined with resampling-based approaches achieved high AUCs (0.9448–0.9457). The common risk factors of ADEs included the length of hospital stay, number of prescribed drugs, and admission type. The pooled estimated AUC was 0.72 (95% CI = 0.68–0.75).

Conclusions

Future studies should adhere to more rigorous reporting standards and consider new ML methods to facilitate the application of ML models in clinical practice.

Keywords

Meta-analysis machine learning adverse drug event prediction model electronic health record systematic review

Introduction

Drug therapy carries potential risks of medication-related injuries, which are associated with notably prolonged hospital stays, increased economic burdens, and a nearly 2-fold higher risk of mortality.¹ Adverse drug events (ADEs) represent a global public health challenge that affects patients and healthcare systems.² ADEs are drug-related patient injuries caused during the drug use process that could potentially be prevented.³ Therefore, numerous studies have concentrated on predicting ADEs.

Researchers have focused on ADE prediction based on various factors, including drug–drug interactions (DDIs),³ the chemical structures of drugs,⁴ spontaneous reporting systems, and health records.⁵ DDIs can manifest in various ways, including effects on pharmacokinetics and pharmacodynamics, resulting in altered drug concentrations or increased susceptibility to organ toxicity.^6,7 Therefore, DDIs can potentially elicit adverse effects in patients, leading to significant consequences.⁸ The chemical composition of a drug possesses inherent structural characteristics that can influence its propensity to induce adverse effects.⁹ However, these approaches do not provide patients information, such as age, sex, length of stay, total dose, and diagnosis. Spontaneous reporting systems can provide some patient information, whereas other information, such as the length of stay, dose per patient, and the total number of patients using a certain medication, remain unavailable.¹⁰

Health records contain comprehensive data spanning the entire duration of patients’ hospitalization. Using predictive tools for ADE anticipation in hospitalized patients has the capacity to assist clinicians in proactively averting ADEs at the individual patient level.¹¹ Furthermore, the elucidation of patterns underlying the occurrence of ADEs during hospitalization, as derived through this approach, can inform the selection of interventions to enhance medication safety and ultimately reduce the incidence of ADEs.

With the burgeoning popularity of electronic health record (EHR) systems, it becomes increasingly feasible to detect and predict drug-related injuries. EHRs can help detect potential ADEs,¹² representing an appealing alternative to the arduous manual review of patient charts. Subsequently, adverse reactions can be predicted through the application of statistical models. Therefore, an escalating number of studies have focused on the prediction of ADEs using EHR data.¹³ The intricate nature of drug-related injuries in clinical practice often entails the compounding effects of multiple medications, with certain medications evincing the capacity to induce multiple concurrent or sequential adverse events. Therefore, predictive studies targeting multiple adverse events are more aligned with clinical imperatives, and they carry practical significance. However, the prediction of multiple adverse events involves more risk factors, which can lead to the unsatisfactory performance of traditional statistical methods. For example, logistic regression (LR) is a conventional statistical method commonly used to predict ADE, but its F1 score is only 17% to 36%.^1,14,15

Machine learning (ML), situated within the domain of artificial intelligence, represents an interdisciplinary field encompassing statistics, computer science, and various ancillary domains. ML exhibits proficiency in managing intricate non-linear associations between variables and outcomes, affording heightened generalization capabilities and augmented precision.¹⁶ ML has emerged as a burgeoning focal point in medical applications, demonstrating substantial potential in disease diagnosis,¹⁷ prescription analysis,¹⁸ and complications surveillance.¹⁹ Many studies have applied ML to predict multiple ADEs, but there is a lack of relevant research evaluating these studies systematically. Therefore, we conducted a systematic review to provide a comprehensive overview of the application of ML in predicting multiple ADEs based on EHR data. The corresponding protocol has been posted on the Research Square preprint platform.²⁰

Materials and methods

This systematic review was conducted in accordance with the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA) guidelines.²¹ The details of the PRISMA guidelines are provided in Table S1. The review protocol was formally registered as a systematic review with PROSPERO under the registration number CRD42023464771. PROSPERO is funded by the United Kingdom National Institute for Health Research. The PROSPERO register accepts any systematic review, and its registration process is simple and capable of providing strong evidence for evidence-based decision-making to facilitate systematic review of research.²²

Search strategy and selection criteria

Four electronic databases, namely PubMed, Web of Science, Embase, and IEEE Xplore, were searched by title and abstract from each database’s date of inception to 21 November 2023. The search strategy involved the use of MeSH and Emtree keywords, as well as their synonyms and keywords obtained from the initially reviewed papers. The search strategy used for each individual database is provided in Table S2.

The inclusion criteria were as follows: focused on the prediction of multiple ADEs, applied ML algorithms based on EHRs, and provided sufficient explanations regarding the research findings. Studies were excluded if they met any of the following criteria: lacking a full-text paper, review articles, written in languages other than English, focused on medical safety events, concentrated on the identification of ADEs, used conventional algorithms such as logistic regression, and not based on EHRs.

Study selection and data extraction

After removing duplicate studies, two independent reviewers (QZ Hu and CQ Li) independently evaluated the titles and abstracts retrieved in the searches. The reviewers have systematic review experience, and they received training prior to this study to avoid the deletion of important studies. The full text of potentially eligible studies was assessed to confirm their suitability. Any discordant evaluations were resolved through consensus.

Study information was extracted, including the author, year, study setting, demographic characteristics, duration of data collection, used ML algorithms, performance metrics (such as accuracy, sensitivity, specificity, precision, and F1 score), and risk factors.

Quality evaluation

Two assessment instruments were employed to assess the quality of the included articles. Predictive models in healthcare use predictors to estimate the probability of an individual developing a condition or disease in the future. The Prediction Model Risk of Bias Assessment Tool (PROBAST) is a tool for assessing the risk of bias and the applicability of prediction model studies, and it can be used as a guide to systematically evaluate the quality of papers related to modeling and prediction, evaluate the quality of papers, and assess the risk of bias.²³ PROBAST consists of 20 signaling questions across four domains: participants, predictors, outcomes, and analysis. The applicability of the study within the first three domains was also scrutinized and categorized as low, high, or unclear.²³

The quality of artificial intelligence studies in the medical field was assessed by implementing the Checklist for the Assessment of Medical AI (ChAMAI).²⁴ The original purpose of the tool is to help editors and reviewers discriminate between high-quality contributions and manuscripts that should be rejected.²⁴ ChAMAI comprises 30 items distributed across six dimensions, namely problem understanding, data understanding, data preparation, modeling, validation, and deployment.²⁴ These items can be divided to low- and high-priority types, and they were evaluated as OK (adequately addressed), mR (sufficient but improvable), or MR (inadequately addressed).²⁴ For high-priority items, scores of 0, 1, and 2 were assigned for OK, mR, and MR, respectively, whereas for low-priority projects, the scores were halved (i.e., 0, 0.5, and 1).²⁵ The maximum score for this assessment tool was 50 points.

Statistical analysis

The effects and 95% confidence intervals (95% CIs) were examined using a random-effects model.²⁶ Pooled sensitivity and specificity and their respective 95% CIs were calculated using contingency tables. The overall performance of ML was evaluated using the summary receiver operator characteristics (SROC) curve and the area under the SROC curve (AUC). Statistical significance was signified by P < 0.05. Heterogeneity was assessed using the Q and I² statistics, with I² > 50% indicating significant heterogeneity. If heterogeneity was significant, then the random-effects model was used to pool the effect estimates and assess potential sources of heterogeneity. Publication bias was evaluated using the funnel plot and regression test.²⁶ All statistical analyses were conducted using Stata 16.0 software (StataCorp, College Station, TX, USA).²⁷

Results

Characteristics of the included studies

The initial database search yielded 5704 relevant studies. After removing duplicates, the titles and abstracts were screened for relevance. The full text of 36 studies was assessed, and 10 studies^5,15,28–35 met the inclusion criteria, as presented in Figure 1.

Figure 1.

PRISMA flow diagram of the citation search and selection strategy. ADE, adverse drug event; EHR, electronic health record; PRISMA, Preferred Reporting Items for Systematic reviews and Meta-Analyses.

These studies included three studies based on self-built databases,^5,30,35 one study based on the Medical Information Mart for Intensive Care (MIMIC)-IV (version 2.1),¹⁵ one based on a Chinese EMR database,³⁵ and five studies based on the Stockholm Electronic Patient Record (EPR) Corpus.^{28,29,31–33} These studies were conducted in China,^5,30,34 the United States,¹⁵ India,²⁹ and Sweden,^{28,29,31–33} and they covered a range of targeted populations, including older adults,⁵ pediatric patients,³⁰ and the general population.^{15,28,29,31–35}

These studies frequently employed multiple ML algorithms simultaneously to compare the performance of different models on the same dataset. Among the 20 ML methods documented across the included studies, random forest (RF) was most frequently used (n = 9), followed by Adaboost (n = 4), eXtreme Gradient Boosting (XGBoost, n = 3), and support vector machine (SVM, n = 3). Four studies improved the ML algorithms by adjusting their configurations, learned weights, or tree sizes to optimize model performance.^28,29,31,33 Detailed information is presented in Table 1.

Table 1.

Characteristics of the included studies.

Study ID	No. of included patients	Data period	Data types	Age	Male (%)	No. of patients with ADE (%)	No of patients with ADE	No. of drugs that cause ADE	Length of stay
Hu, 2022⁵	1800	2015–2017	Structured clinical data	Total: 69.84 ± 8.14No ADE: 70.12 ± 8.19ADE: 67.95 ± 7.55	58.56%	234 (13.00%)	296	21	Total: 10.19 ± 8.94 No ADEs : 9.44 ± 8.07ADEs :15.19 ± 12.28
Yu, 2021³⁰	1746	2013–2015	Structured clinical data	Total: 3.84 ± 3.89No ADE: 3.86 ± 3.85ADE: 3.72 ± 4.12	Total: 65.00%No ADE: 64.70%ADE: 67.40%	221 (12.70%)	247	27	Total: 7.83 ± 5.29No ADE: 7.48 ± 4.66ADE: 10.23 ± 8.03
Langenberger, 2023¹⁵	210,181	2008–2019	Unstructured and structured clinical data	No ADE: 59.8 ± 19.7ADE: 63.3 ± 16.9	No ADE: 49.40%ADE: 51.50%	10,957 (5.21%)	22,667	N	No ADE: 4.00 ± 6.49ADE: 9.12 ± 11.8
Karlsson, 2014²⁸	16,287	2009–2010	Unstructured and structured clinical data	N	N	4128 (25.34%)	N	N	N
Ponraj, 2021³⁵	5000	N	Structured clinical data	N	N	N	N	N	N
Karlsson, 2016²⁹	35,711	2007–2014	Unstructured and structured clinical data	N	N	16,062 (44.98%)	N	N	N
Zhao, 2021³⁴	30,703	N	Unstructured and structured clinical data	N	N	N	N	N	N
Zhao, 2015a³²	14,303	2009–2010	Unstructured and structured clinical data	N	N	2807 (19.62%)	N	N	N
Zhao, 2015b³³	14696	2009–2010	Unstructured and structured clinical data	N	N	2928 (19.92%)	N	N	N
Zhao, 2016³¹	38,709	2009–2015	Unstructured and structured clinical data	N	N	5733 (14.81%)	N	N	N

The top five types of drugs that cause ADEs (n)	Type of ADE	The top five types of ADEs (n)	Important risk factors	Machine learning algorithms	Evaluation and validation	Evaluation indicator	Optimal model performance
Antineoplastic (165), antimicrobial (44), anticoagulant (18), antihypertensive (13), analgesic (9)	32	Nausea (64), leukopenia (36), hepatotoxicity/transaminase disorder (25), diarrhea (22), vomiting (22)	Number of true triggers (+), length of stay, doses per patient, age, number of admissions in the previous year, surgery, drugs per patient, number of medical diagnoses, antibacterial use, and sex	Seven algorithms: XGBoost, AdaBoost, CatBoost, GBDT, LightGBM, TPOT, and RF	Training and testing	AUC, accuracy, precision, recall, F1 score, precision–recall curve	AdaBoost with the best predicting abilities (accuracy: 88.06%, precision: 68.57%, recall: 48.21%, F1 score: 52.75%, AUC: 0.91)
Antibacterials (86), sedatives (55), immunomodulators (14), antiepileptics (12), potassium chloride, glucose (10)	28	Oversedation\hypotension (47), rash (35), diarrhea (34), candidiasis (25), constipation (21), vomiting (21)	Number of true triggers (+), number of doses, BMI, number of drugs, number of admissions, height, length of hospital stays, weight, age, and number of diagnoses	Seven algorithms: XGBoost, AdaBoost, CatBoost, GBDT, LightGBM, TPOT, and RF	Training and Testing	AUC, precision, recall, F1 score, precision–recall curve	GBDT with the best predictive performance (precision: 44%, recall: 25%, F1 score: 31.88%, AUC: 0.81)
Antineoplastic (165), antimicrobial (44), anticoagulant (18), antihypertensive (13), analgesic (9)	102	Other secondary thrombocytopenia (5153), other iatrogenic hypotension (4388), fever, unspecified (4010), dermatitis attributable to drugs and medicines taken internally (1871), enterocolitis attributable to Clostridium difficile (1222)	Admission type, temperature, chief complaint, systolic blood pressure, age, serotonin receptor antagonists, heart rate, glucocorticoids, antiviral agents, and rifamycin	Six algorithms: RF, GBM, Ridge, LASSO, EN, and LR	Training and testing	AUC, accuracy, precision, recall, F1 scores, G-mean, Youden index, negative predictive value, specificity, precision–recall curve, Brier score	GBM with the best predictive performance (accuracy: 66.67%, precision: 10.10%, recall: 67.50%, F1 score: 17.50%, negative predictive value: 97.50%, specificity: 66.50%, G-mean: 66.30%, Youden index: 35.00%, AUC: 0.75, AUC-PR: 0.13
N	14	N	N	Four algorithms: RF with four tree sizes and four configurations [two (log and sqrt) as baseline and two re-sampling approaches (ren and reinf)]	10-fold cross-validation	AUC, F1 score, Brier score, prediction	RF-ren model had the best predictive performance (F1 score: 84.02%, AUC: 0.95, Brier score: 12.98%). However, no significant relationship was observed between the models.
N	N	N	N	Three algorithms: SVM, DT, and MLP	Training and testing	Accuracy	DT with the best performance (fatality rate accuracy: 91.40%, severity rate accuracy: 81.70%)
N	27	Allergy, unspecified (5540), unspecified adverse effect of drug or medicament (1967), angioneurotic edema (1146), maternal care for (suspected) damage to fetus by drugs (1096), Generalized skin eruption attributable to drugs and medicaments (861)	N	Five algorithms: RPF with four configurations [RPF-ed (1), RPF-ed (2), RPF (3,1), and RPF (3,2)] and RF	10-fold-cross validation	F1 score, Brier score, prediction	There was no significant difference in performance between RPF and RF (RF F1 score: 84.45%, RF Brier score: 11.66%)
N	N	N	Age, course of treatment, transfusion stash time, infusion speed, solvent dosage, solvent category, dermatitis, Xueshuantong, Naodanbai, hypertension, Sanqi Xiaozhong, capsule, fasudil, chronic obstructive pulmonary disease, loose-jointed pill, enteritis, erysipelas, syringomyelia, pelvic Infection, aescin	Six algorithms: SVM, RF, AdaBoost, XGBoost, ANN, and LR	Training and testing	Sensitivity, specificity, recall F1 scores	XGBoost, AdaBoost, and RF exhibited relatively similar performance (XGBoost recall: 75.00%, XGBoost specificity: 92.86%, XGBoost F1 score: 82.98%)
N	27	Allergy, unspecified (737), unspecified adverse effect of drug or medicament (442), Other complications following infusion, transfusion, and therapeutic injection (342), generalized skin eruption attributable to drugs and medicaments (156), angioneurotic edema (108)	N	Nine algorithms: DT, SVM Poly, SVM RBF, RF, KNN, AdaBoost, nagging, NB, and LR	Training and testing10-fold cross-validation	AUC, accuracy	RF (accuracy: 84.47%, AUC: 0.76)
N	14	Allergy, unspecified (984), unspecified adverse effect of drug or medicament (472), other complications following infusion, transfusion, and therapeutic injection (353), maternal care for (suspected) damage to fetus by drugs (334), generalized skin eruption attributable to drugs and medicaments (174)	N	Three algorithms: RF with three strategies (BE, BBE, and BWE)	Training and testing 10-fold cross-validation	AUC, accuracy, precision, recall, F1 scores, precision–recall curve	RF-BWE had the best predictive performance (accuracy: 88.35%, recall: 43.49%, precision: 52.21%, F1 score: 45.00%)
N	19	Secondary thrombocytopenia (1246), unspecified adverse effect of drug or medicament (1047), drug-induced aplastic anemia (593), allergy (574), other complications following infusion, transfusion, and therapeutic injection (538)	N	Four algorithms: RF with two learned weights and two preassigned weights (LWA, LWS, PWA, and PWS)	Training and testing10-fold cross-validation	AUC, accuracy, precision–recall curve	RF-LWS had the best predictive performance (accuracy: 88.85; AUC: 0.80)

Evaluating the quality of studies

The biased risk assessment results of PROBAST indicated that the majority of included studies featured a high risk of bias while demonstrating a low risk of applicability concerns (Table S3 and Figure S1). Among the 10 included studies, seven were deemed to have a high risk of bias, and three were found to have a high risk of applicability concerns. Only three studies^5,15,30 displayed both a low risk of bias and a low risk of applicability concerns.

Based on the assessment using ChAMAI, the overall mean score of the included studies was 23.7 (95% CI = 20.00–32.50), which constituted less than 50% of the maximum score (Table S4 and Figure S2). Only three studies achieved a score of at least 25.00. Low mean scores were recorded for the dimensions of data understanding, data preparation, and deployment, each falling below 50% of the maximum score. Conversely, the dimensions of problem understanding, modeling, and validation achieved high mean scores, particularly with respect to modeling, for which a full score was obtained.

ADEs

Eight studies^5,15,28–33 provided information on the types of ADEs identified, with the reported numbers ranging from 14 to 102 (Table S5). Drug allergies, angioneurotic edema, Stevens–Johnson syndrome (SJS), anaphylactic shock, and contact dermatitis were discussed in all eight studies. Additionally, oversedation/hypotension, hypoglycemia, thrombocytopenia, cardiomyopathy attributable to drugs and external agents, nephrotoxicity/creatinine disorder, and drug-induced adrenocortical insufficiency were commonly documented.

Convulsions, hyperglycemia, respiratory depression, bronchospasm, and dyspnea were only mentioned in the study targeting children,³⁰ whereas bradycardia was only mentioned in the study of older patients.⁵ The incidences of the aforementioned ADEs were lower than 1%.^5,30

Predictive performance for different predictive methods

The AUC serves as a critical indicator of model performance. Seven studies reported AUCs ranging from 0.26 to 0.95.^{5,15,28,30–33} ML algorithms such as Light Gradient-Boosting Machine (LightGBM), AdaBoost, CatBoost, XGBoost, and TPOT demonstrated high AUCs exceeding 0.90. RF, with an average AUC of 0.83, exhibited the potential to achieve high AUCs when combined with resampling-based approaches, yielding a range of 0.9448 to 0.9457.

The performance was assessed using metrics such as accuracy, precision, sensitivity, specificity, and the F1 score, as reported in seven, four, five, five, and five studies, respectively. The mean accuracy, precision, sensitivity, specificity, and F1 score were 80.77%, 41.48%, 46.02%, 63.37%, and 51.16%, respectively, as presented in Figure 2.

Figure 2.

Summary of prediction model performance. XGBoost, eXtreme Gradient Boosting; GBDT, gradient boosting decision tree; LightGBM, Light Gradient-Boosting Machine; GBM, gradient boosting machine; RF, random forest; TPOT, Tree-based Pipeline Optimization Tool; LR, logistic regression; SVM, support vector machine; LASSO, least absolute shrinkage and selection operator; EN, elastic net; DT, decision tree; MLP, multilayer perceptron; KNN, K-nearest neighbor; NB, naïve Bayes; ANN, artificial neural network.

Meta-regression

The contingency tables from three prediction studies^5,16,30 were extracted. The details are presented in Table S6. The pooled estimated AUC was 0.72 (95% CI = 0.68–0.75, Figure 3, whereas the pooled sensitivity and specificity were 0.40 (95% CI = 0.31–0.50, I² = 99.33%) and 0.92 (95% CI = 0.87–0.96, I² = 99.96%). In comparison with LR, the meta-regression analysis indicated that ML algorithms yielded higher specificity [ML vs. LR: 0.93 (95% CI = 0.88–0.96) vs. 0.65 (95% CI = 0.65–0.66)] but lower sensitivity [ML vs. LR: 0.38 (95% CI = 0.29–0.49) vs. 0.68 (95% CI = 0.66–0.69, Figures 4 and S3).

Figure 3.

Hierarchical summary receiver operating characteristic curves of studies included in the meta-analysis to classify adverse drug effects from three studies. The 95% confidence interval is a visual representation of between-study heterogeneity.

Figure 4.

Forest plots of sensitivity and specificity.

Heterogeneity

The meta-regression results revealed substantial heterogeneity regarding the pooled estimate (I² > 99%). The results of sensitivity analysis illustrated that the within-study heterogeneity was low. Because different ML were used in three prediction studies, we could not conduct sensitivity analysis of the different models. The results of sensitivity analysis are presented in Table S7. The funnel plot is depicted in Figure S4.

Discussion

EHRs have been extensively used to document the change of the illness, therapeutic regimens, laboratory test results, and radiological images.³⁶ Predicting ADEs based on EHR met the clinical need to reduce ADE rates, and it holds promise for enhancing the quality of healthcare. Owing to the amalgamation of structured and unstructured data, conventional statistical methods face difficulties in predicting adverse events based on EHRs. ML algorithms have been widely embraced in the medical domain, particularly for prognostic predictions, and they are well-suited for intricate data landscapes. Previous systematic reviews emphasized the identification and diagnosis of safety events, and ADE prediction is only part of these studies.^37,38 In addition, although some articles^5,15,30 that met the inclusion criteria were published within the search time frame of these systematic reviews, none of them was included. In the present study, we conducted a systematic review of the use of ML algorithms in forecasting multiple drug-related events based on EHRs and clinical notes.

The Stockholm EPR Corpus, MIMIC-IV, and self-built databases were the most commonly used databases. The Stockholm EPR Corpus was developed by Karolinska University Hospital in Sweden, and it encompasses comprehensive diagnostic information, drug administrations, clinical measurements, and free-text clinical notes.³³ MIMIC-IV is a product of a collaboration between Beth Israel Deaconess Medical Center and the Massachusetts Institute of Technology, and it provides deidentified data, including patient measurements, diagnoses, procedures, treatments, and free-text clinical notes.³⁹ One study used a Chinese database covering 30,703 patients (demographic data, procedures, and clinical notes).³⁴ Data in the Stockholm EPR Corpus, MIMIC-IV, and the Chinese database were both unstructured and structured,^28,34 consistent with the actual forecast scenario. In the Chinese database, clinical notes included diagnoses and medications in the Chinese language, and thus, researchers introduced natural language processing (NLP) methods to process them.³⁴ Three studies relied on self-built databases derived from the EHRs of hospitals, and they contained structured data.^5,30,35 Compared with Stockholm EPR, MIMIC-IV, and the Chinese database, the ADE prediction model was built based on a self-built database, allowing researchers to review medical records for more information, thus avoiding missed data. In addition, the occurrence of ADEs in different institutions was inconsistent; hence, the prediction model based on the self-built database was more in line with the actual situation of the institution. International Classification of Disease, Version 10 codes were applied as the standard terminology for diagnoses in studies based on the Stockholm EPR Corpus and MIMIC-IV.^{15,28,29,31,32} The use of standard terminology contributes to greater standardization of the database, thereby enhancing the reliability of the results. Based on these results, we recommend applying self-built databases and standard terminology to build ADE prediction models and improve the applicability and accuracy of the model.

ML is the scientific discipline that focuses on how computers learn from data, and it arose at the intersection of statistics.⁴⁰ This marriage between statistics and computer science is driven by the unique computational challenges of building statistical models from massive datasets.⁴¹ The types of ML are conveniently subclassified into the categories supervised learning and unsupervised learning. Because it is based on existing data, supervised learning was used to build the ADE prediction model. Twenty ML algorithms were employed in the included studies, with ensemble learning emerging as the predominant model. Ensemble learning encompassing predictions from multiple weak learners can obtain superior predictions.^42,43 Bagging and boosting algorithms are the representative ensemble learning methods. In boosting algorithms, each weak learner undergoes further training with an updated set including the misclassified instances from the prior training iteration.⁴⁴ Five studies documented the performances of boosting algorithms, with aggregate performance metrics portraying favorable outcomes.^{5,15,30,32,34} The average AUC and precision of boosting algorithms were 0.82 (95% CI = 0.72–0.92) and 0.47 (95% CI = 0.10–0.69). AdaBoost and XGBoost demonstrated superior performance compared with LightGBM, gradient boosting machine, and gradient boosting decision tree, as evidenced by the AUC, F1 scores, and precision, although meta-regression was infeasible because of the inadequate contingency tables.

Bagging algorithms, involving the aggregation of predictions from multiple decision trees, delineate a distinct approach from boosting algorithms. This method entails the resampling of data from the training set with equivalent cardinality to the original set, consequently mitigating classifier variance and overfitting.⁴⁵ RF, as the representative algorithm, was the most frequently reported.^{5,15,28–30,32–35} EHR data featuring numerous sparse features can lead to suboptimal predictive performance for RF.⁴⁶ Therefore, the RF algorithms in the included studies were refined through adjustments in configuration, learned weights, tree sizes, and integration with diverse resampling approaches. The results indicated that the mean AUC and precision for unimproved RF were 0.81 (95% CI = 0.74–0.94) and 0.36 (95% CI = 0.10–0.75), whereas those for improved RF were 0.83 (95% CI = 0.75–0.95) and 0.51 (0.49–0.53). Among these, improved RF combined with resampling until an informative feature was found or until no more features were left demonstrated superior performance, yielding an AUC of 0.95 and an F1 score of 0.89.²⁸

SVM was also widely reported. Its fundamental principle involves identifying the maximum margin hyperplane within the input space to segregate the training dataset.⁴⁷ SVM excels in addressing pattern recognition challenges associated with limited samples, non-linearity, and high dimensionality, particularly in the realm of classification problems.⁴⁸ Therefore, the performance of SVM was inferior to that of ensemble learning, with a mean AUC of only 0.63 (95% CI = 0.59–0.67).^29,32,34 According to the aforementioned ML results, ensemble learning might be more popular and more suitable for building ADE prediction models. In addition, we encourage adjustment of the configuration to improve model performance.

Several studies described findings on the performance of LR. It was found to yield comparable results to non-LR methods, such as SVM, artificial neural network, K-nearest neighbor, and naïve Bayes (NB).^15,32,34 Meta-regression analysis indicated that non-LR methods exhibited higher specificity than LR, albeit with lower sensitivity. Therefore, LR might also achieve favorable performance. The selection of appropriate algorithms should be contingent on the specific research issue and application context.

Eight analyses provided information on the types of ADEs.^5,15,28–33 Allergies were most frequently mentioned, with an incidence ranging from 1% to 6%. Drug allergies can be linked to any form of medication. Although most allergies are transitory, some can result in severe consequences, such as drug reaction with eosinophilia and systemic symptoms and SJS.⁴⁹ Oversedation\hypotension were also prevalent and associated with blood pressure medications, sedative hypnotics, and anesthetics, and its incidence rates ranged from 0.3% to 2%. This incidence might be higher in pediatric inpatients³⁰ because of the considerable individual variability in sedative hypnotics or anesthetics dosages among children coupled with significant variations in children’s sensitivity to these drugs.⁵⁰ Furthermore, respiratory depression, bronchospasm, and dyspnea were exclusively observed in children, and they might be linked to the use of sedative hypnotics or anesthetics, underscoring the need for cautious administration of these medications in pediatric patients.

The risk factors associated with ADEs were reported in four studies.^5,15,30,34 Although these risk factors exhibited variations, certain common factors consistently emerged, including the length of hospitalization, polypharmacy, patient age, and the use of high-risk medications.^5,15,30,34 Cross-sectional investigations demonstrated that patients requiring prolonged hospitalization and a higher number of medications have a higher risk of drug-related injuries.^13,51,52 Advanced age emerged as a significant risk factor for ADEs given that older patients are more susceptible to drug-related events because of the presence of multiple comorbidities, polypharmacy, challenges in medication monitoring, and age-related changes in pharmacokinetics and pharmacodynamics.⁵³ High-risk medications, including glucocorticoids, anticoagulants, non-steroidal anti-inflammatory drugs, and chemotherapeutic agents, were also among the risk factors.⁵⁴ Therefore, patients undergoing surgery usually only receive intravenous infusion therapy during hospitalization, and they rarely experience ADEs. These findings underscored the importance of reducing hospital stays, simplifying treatment regimens, and avoiding the use of high-risk medications as potential strategies for ADE prevention. Although it is difficult to avoid these risk factors in clinical practice, healthcare providers are advised to exercise heightened vigilance in monitoring patients with risk factors and to promptly address ADEs when they occur.

The heterogeneity among the included studies was significant, and although it was reduced in the same study, it was not eliminated. The high heterogeneity was related to differences in the databases used, predictors, ML algorithms, hyperparameters, and the populations included, making it difficult to avoid.⁵⁵

The present study had several limitations. First, the quality of the included studies was not high. Only three studies were deemed to have a low risk of bias and scores higher than 25 (maximum, 50) based on PROBAST and ChAMAI assessments. Most of the included studies lacked proper reporting of data and data preprocessing procedures. Second, contingency tables, which could provide a pooled estimate for comparing predictive performance, were only available for three studies involving 13 models. Third, certain input variables, such as chief complaints, can benefit from advanced preprocessing methods such as NLP. Unfortunately, only one study combined NLP with ML for ADE prediction, but the reliability of this method was not verified. Finally, the application of new ML models, such as convolutional neural networks, recurrent neural networks, and bidirectional long short-term memory with conditional random field algorithms, for predicting drug safety events using EHRs was rare. Hence, researchers should focus on innovative ML algorithms to enhance the predictive capabilities of models and promote the application of these advancements in the future.

Conclusions

This systematic review provided evidence supporting the potential of ML to be incorporated into EHRs for predicting multiple ADEs, improving the quality of patient care, and reducing drug-related harm. Future studies should consider adopting more rigorous reporting standards and newer ML techniques to enhance the effectiveness of ML models in clinical practice.

Supplemental Material

sj-pdf-1-imr-10.1177_03000605241302304 - Supplemental material for Machine learning to predict adverse drug events based on electronic health records: a systematic review and meta-analysis

Supplemental material, sj-pdf-1-imr-10.1177_03000605241302304 for Machine learning to predict adverse drug events based on electronic health records: a systematic review and meta-analysis by Qiaozhi Hu, Jiafeng Li, Xiaoqi Li, Dan Zou, Ting Xu and Zhiyao He in Journal of International Medical Research

Footnotes

Author contributions

QZH and TX conceived the study. QZH, DZ, and XQL wrote the initial protocol, performed the literature search, and screened articles. QZH and ZYH wrote the initial manuscript. QZH and XQL performed statistical analysis. JFL and QZH wrote the revised manuscript.

Declaration of conflicting interest

The authors declare that there is no conflict of interest.

Funding

This study was supported by the Sichuan Science and Technology Support Program (grant number: 2023NSFSC1696), the Science and Technology project of Chengdu Health Commission (grant number: 2022020), and the Science and Technology program of Tibet Autonomous Region (grant number: XZ202401RK0005). This research was additionally supported by the National Key Clinical Specialties Construction Program.

ORCID iD

Qiaozhi Hu

Supplementary material

Supplemental material for this article is available online.

References

Amelung

Meid

Nafe

, et al. Association of preventable adverse drug events with inpatients’ length of stay-A propensity-matched cohort study. Int J Clin Pract 2017; 71.

Al Meslamani

AZ.

Underreporting of Adverse Drug Events: a Look into the Extent, Causes, and Potential Solutions. Expert Opin Drug Saf 2023; 22: 351–354. https://doi.org/10.1080/14740338.2023.2224558.

Al Meslamani

AZ.

Why are outcome-based drug safety research studies scarce? Insights into operational challenges and potential solutions. Expert Opin Drug Saf 2024; 23: 145–148. doi: 10.1080/14740338.2024.2305368.

Gong

Teng

Wang

, et al. In silico prediction of potential drug-induced nephrotoxicity with machine learning methods. J Appl Toxicol 2022; 42: 1639–1650.

, et al. Predicting adverse drug events in older inpatients: a machine learning study. Int J Clin Pharm 2022; 44: 1304–1311.

Huang

Niu

Green

, et al. Systematic prediction of pharmacodynamic drug-drug interactions through protein-protein-interaction network. PLoS Comput Biol 2013; 9: e1002998.

Zitnik

Agrawal

Leskovec

Modeling polypharmacy side effects with graph convolutional networks. Bioinformatics 2018; 34: i457–i466.

Zhang

Deng

, et al. Application of Artificial Intelligence in Drug-Drug Interactions Prediction: A Review. J Chem Inf Model 2024; 64: 2158–2173.

Zheng

Peng

Zhang

, et al. Predicting adverse drug reactions of combined medication from heterogeneous pharmacologic databases. BMC Bioinformatics 2018; 19: 517.

10.

Vallano

Cereza

Pedròs

, et al. Obstacles and solutions for spontaneous reporting of adverse drug reactions in the hospital. Br J Clin Pharmacol 2005; 60: 653–658.

11.

Shojania

Thomas

EJ.

Trends in adverse events over time: why are we not improving?

BMJ Qual Saf 2013; 22: 273–277.

12.

Klopotowska

Wierenga

Stuijt

CCM

, et al. Adverse drug events in older hospitalized patients: results and reliability of a comprehensive and structured identification strategy. PLoS One 2013; 8: e71045.

13.

Qin

Zhan

, et al. Validating the Chinese geriatric trigger tool and analyzing adverse drug event associated risk factors in elderly Chinese patients: A retrospective review. PLoS One 2020; 15: e0232095.

14.

Song

Xiao

, et al. Adverse drug events in Chinese pediatric inpatients and associated risk factors: a retrospective review using the Global Trigger Tool. Sci Rep 2018; 8: 2573.

15.

Langenberger

Machine learning as a tool to identify inpatients who are not at risk of adverse drug events in a large dataset of a tertiary care hospital in the USA. Br J Clin Pharmacol 2023; 89: 3523–38.

16.

Deo

RC.

Machine Learning in Medicine. Circulation 2015; 132: 1920–1930.

17.

Chen

Wang

, et al. Differentiation of Low-Grade Astrocytoma From Anaplastic Astrocytoma Using Radiomics-Based Machine Learning Techniques. Front Oncol 2021; 11: 521313.

18.

Tian

Jin

, et al. Developing a Warning Model of Potentially Inappropriate Medications in Older Chinese Outpatients in Tertiary Hospitals: A Machine-Learning Study. J Clin Med 2023; 12: 2619.

19.

Lei

Wang

Zhang

, et al. Using Machine Learning to Predict Acute Kidney Injury After Aortic Arch Surgery. J Cardiothorac Vasc Anesth 2020; 34: 3321–3328.

20.

, et al. Machine learning to predict adverse drug reaction or event based on electronic health records: a systematic review and meta-analysis. Research Square, https://doi.org/10.21203/rs.3.rs-4081881/v1 (2024, accessed 5 August 2024).

21.

Page

McKenzie

Bossuyt

, et al. The PRISMA 2020 statement: an updated guideline for reporting systematic reviews. BMJ 2021; 372: n71. https://doi.org/10.1136/bmj.n71.

22.

Page

Shamseer

Tricco

AC.

Registration of systematic reviews in PROSPERO: 30,000 records and counting. Syst Rev 2018; 7: 32. https://doi.org/10.1186/s13643-018-0699-4.

23.

Moons

KGM

Wolff

Riley

, et al. PROBAST: A Tool to Assess Risk of Bias and Applicability of Prediction Model Studies: Explanation and Elaboration. Ann Intern Med 2019; 170: W1–W33.

24.

Cabitza

Campagner

The need to separate the wheat from the chaff in medical informatics: Introducing a comprehensive checklist for the (self)-assessment of medical AI studies. Int J Med Inform 2021; 153: 104510.

25.

Zhou

Shi

, et al. Machine learning predictive models for acute pancreatitis: A systematic review. Int J Med Inform 2022; 157: 104641.

26.

Borenstein

Hedges

Higgins

JPT

, et al. A basic introduction to fixed-effect and random-effects models for meta-analysis. Res Synth Methods 2010; 1: 97–111.

27.

Bagos

PG.

Meta-analysis in Stata using gllamm. Res Synth Methods 2015; 6: 310–332. https://doi.org/10.1002/jrsm.1157.

28.

Karlsson

Bostrom

Handling Sparsity with Random Forests when Predicting Adverse Drug Events from Electronic Health Records. 2014 IEEE International Conference on Healthcare Informatics. https://doi.org/10.1109/ICHI.2014.10

29.

Karlsson

Bostrom

Predicting Adverse Drug Events using Heterogeneous Event Sequences. 2016 IEEE International Conference on Healthcare Informatics https://doi.org/10.1109/ICHI.2016.64

30.

Xiao

, et al. Predicting Adverse Drug Events in Chinese Pediatric Inpatients With the Associated Risk Factors: A Machine Learning Study. Front Pharmacol 2021; 12: 659099.

31.

Zhao

Henriksson

Learning temporal weights of clinical events using variable importance. BMC Med Inform Decis Mak 2016; 16 Suppl 2: 71.

32.

Zhao

Henriksson

Asker

, et al. Predictive modeling of structured electronic health records for adverse drug event detection. BMC Med Inform Decis Mak 2015; 15 Suppl 4: S1.

33.

Zhao

Henriksson

Kvist

, et al. Handling Temporality of Clinical Events for Drug Safety Surveillance. AMIA Annu Symp Proc 2015; 2015: 1371–1380.

34.

Zhao

Yuan

Prediction of Adverse Drug Reaction using Machine Learning and Deep Learning Based on an Imbalanced Electronic Medical Records Dataset 2021 ICMHI. https://doi.org/10.1145/3472813.3472817

35.

Ponraj

Balan

RVS

Vignesh

Analysis and Prediction of Adverse Reaction of Drugs with Machine Learning Models for Tracking the Severity. Arab J Sci Eng 2021; (46). https://doi.org/10.1007/s13369-021-05999-5

36.

Alzu’bi

Watzlaf

VJM

Sheridan

Electronic Health Record (EHR) Abstraction. Perspect Health Inf Manag 2021; 18: 1g.

37.

Deimazar

Sheikhtaheri

Machine learning models to detect and predict patient safety events using electronic health records: A systematic review. Int J Med Inform 2023; 180: 105246. doi:10.1016/j.ijmedinf.2023.105246.

38.

Yasrebi-de Kom

IAR

Dongelmans

de Keizer

, et al. Electronic health record-based prediction models for in-hospital adverse drug event diagnosis or prognosis: a systematic review. J Am Med Inform Assoc 2023; 30: 978–988. doi:10.1093/jamia/ocad014.

39.

Johnson

AEW

Bulgarelli

Shen

, et al. MIMIC-IV, a freely accessible electronic health record dataset. Sci Data 2023; 10: 1. https://doi.org/35. Giesa N, Heeren P, Klopfenstein S, et al. MIMIC-IV as a Clinical Data Schema. Stud Health Technol Inform 2022; 294: 559–60.

40.

Lip

Nieuwlaat

Pisters

, et al. Refining clinical risk stratification for predicting stroke and thromboembolism in atrial fibrillation using a novel risk factor-based approach: the euro heart survey on atrial fibrillation. Chest 2010; 137(2): 263–72. https://doi.org/10.1378/chest.09-1584.

41.

Deo

RC.

Machine Learning in Medicine. Circulation 2015; 132: 1920–30. https://doi.org/10.1161/CIRCULATIONAHA.115.001593.

42.

Alotaibi

Ilyas

Ensemble-Learning Framework for Intrusion Detection to Enhance Internet of Things’ Devices Security. Sensors (Basel) 2023; 23: 5568.

43.

Ahn

Kim

Ensemble Machine Learning of Gradient Boosting (XGBoost, LightGBM, CatBoost) and Attention-Based CNN-LSTM for Harmful Algal Blooms Forecasting. Toxins (Basel) 2023; 15: 608.

44.

González-Recio

Jiménez-Montero

Alenda

The gradient boosting algorithm and random boosting for genome-assisted evaluation in large data sets. J Dairy Sci 2013; 96: 614–624.

45.

Sisodia

Verma

Prediction performance of individual and ensemble learners for chronic kidney disease. 2017 ICIC 2017; 1027–1031. https://doi.org/10.1109/ICICI.2017.8365295

46.

Abdel-Fattah

Othman

Goher

Predicting Chronic Kidney Disease Using Hybrid Machine Learning Based on Apache Spark. Comput Intell Neurosci 2022; 2022: 9898831.

47.

Wang

Shao

Zhou

, et al. Support Vector Machine Classifier via L0/1 Soft-Margin Loss. IEEE Trans Pattern Anal Mach Intell 2022; 44: 7253–7265.

48.

Wang

Sun

Construction of a new smooth support vector machine model and its application in heart disease diagnosis. PLoS One 2023; 18: e0280804.

49.

Joint Task Force on Practice Parameters, American Academy of Allergy, Asthma and Immunologyet al. Drug allergy: an updated practice parameter. Ann Allergy Asthma Immunol 2010; 105: 259–73.

50.

Lyttle

Rainford

NEA

Gamble

, et al. Levetiracetam versus phenytoin for second-line treatment of paediatric convulsive status epilepticus (EcLiPSE): a multicentre, open-label, randomised trial. Lancet 2019; 393: 2125–2134.

51.

Marcum

Arbogast

Behrens

, et al. Utility of an adverse drug event trigger tool in Veterans Affairs nursing facilities. Consult Pharm 2013; 28: 99–109.

52.

Toscano Guzmán

Banqueri

Otero

, et al. Validating a Trigger Tool for Detecting Adverse Drug Events in Elderly Patients With Multimorbidity (TRIGGER-CHRON). J Patient Saf 2021; 17: e976–e982.

53.

Gurwitz

Field

Harrold

, et al. Incidence and preventability of adverse drug events among older persons in the ambulatory setting. JAMA 2003; 289: 1107–1116.

54.

Ridge

Macintyre

Kitsos

, et al. Assessing risk of adverse drug reactions in the elderly: a feasibility study. Int J Clin Pharm 2019; 41: 1483–1490.

55.

Xie

Wang

Pei

, et al. Machine Learning-Based Prediction Models for Delirium: A Systematic Review and Meta-Analysis. J Am Med Dir Assoc 2022; 23: 1655–1668.e6. https://doi.org/10.1016/j.jamda.2022.06.020.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.85 MB