Abstract
Objective:
To demonstrate the usefulness of applying supervised machine-learning analyses to identify specific groups of patients that experience high levels of mortality post-interhospital transfer.
Methods:
This was a cross-sectional analysis of data from the Health Care Utilization Project 2013 National Inpatient Sample, that applied supervised machine-learning approaches that included (1) classification and regression tree to identify mutually exclusive groups of patients and their associated characteristics of those experiencing the highest levels of mortality and (2) random forest to identify the relative importance of each characteristic’s contribution to post-transfer mortality.
Results:
A total of 21 independent groups of patients were identified, with 13 of those groups exhibiting at least double the national average rate of mortality post-transfer. Patient characteristics identified as influencing post-transfer mortality the most included: diagnosis of a circulatory disorder, comorbidity of coagulopathy, diagnosis of cancer, and age.
Conclusions:
Employing supervised machine-learning analyses enabled the computational feasibility to assess all potential combinations of available patient characteristics to identify groups of patients experiencing the highest rates of mortality post-interhospital transfer, providing potentially useful data to support developing clinical decision support systems in future work.
Approximately 1.6 million patients undergo interhospital transfer annually. 1 Patients undergoing interhospital transfer experience up to three times higher mortality,2,3 use double the resources and experience twice the length of stay than those not transferred from another hospital. 1 Interhospital transfers consist of two primary patient types: those experiencing an immediately life-threatening condition (e.g. myocardial infarction, trauma) and those who are not experiencing an immediately life-threatening condition. Transfer for patients experiencing an immediately life-threatening condition has been shown to be a life-saving measure, with reductions in mortality for trauma4–8 and heart attack patients 9 but has yielded conflicting results for stroke10,11 and minimally injured trauma patients.12–14
The decision to transfer patients from lower to higher levels of care for an immediately life-threatening condition are common and often supported by referral networks established within local regions like trauma and stroke networks. For those patients not experiencing an immediately life-threatening condition, the decision to transfer is complicated and is based on individual provider judgment, family request, or other factors. Currently, no national guidelines 15 exists to guide interhospital transfer; furthermore, there is limited understanding of who does and does not benefit from being transferred and exactly when those transfers should occur.
The overall poor outcomes that interhospital transfer patients experience and mixed outcomes for patients that are immediately transferred for time-sensitive conditions suggest that we do not have a good understanding of immediately life-threatening conditions. Outside of patients that are transferred for intervention that must be performed immediately upon arrival at the receiving hospital (e.g. cardiac catheterization and surgical procedure), our recognition of what constitutes a patient experiencing an immediately life-threatening condition needs to be reconceptualized.
Reconceptualizing type of transfer patients require the focus to move beyond the currently used broad categories (e.g. trauma and stroke) to categories that support patient-specific characteristics that identify those who should be considered for transfer. Therefore, to begin moving toward a more patient-centric approach, the purpose of this study was to identify specific groups of patients and their associated characteristics that experience high levels of mortality post-transfer.
Methods
Data source
We used the 2013 Nationwide Inpatient Sample (NIS). The NIS is part of the Healthcare Cost and Utilization Project (HCUP) from the Agency for Healthcare Research and Quality (AHRQ) and is the largest all-payer inpatient database in the United States with a nationally representative sample of approximately 8 million inpatient discharges each year. 16 We identified all adult patients aged 19 years or older that were transferred from one acute care hospital to another to compose an interhospital transfer cohort.
Measures
Our main outcome measure is in-hospital mortality, as recorded on the hospital billing record discharge status. To identify patient characteristics and variables that are clinically meaningful and where available in the data set, we only incorporated covariates that are useful in guiding clinical decision-making or practice. Patient-level covariates included the following: age (continuous), gender, payer type, race, comorbidity, and primary diagnosis.
To include the primary diagnosis and to make the analysis computationally feasible, we accounted for the primary diagnosis via the Clinical Classification System (CCS) for the International Classification of Diseases, Ninth Revision, Clinical Modification (ICD-9-CM),17,18 using the multi-level diagnosis category labels—a total of 17 categories. The multi-level CCS category is a standard, established method to collapse over 14 000 diagnosis codes and 3900 procedure codes into clinically meaningful categories. 18 Refer to Supplementary Material Table 1 for variables included in analysis, for a listing of the covariates and CCS categories used in the model.
We measured the presence of comorbid conditions using the Elixhauser comorbidity index list. The Elixhauser index contains 30 comorbid conditions defined through secondary ICD-9-CM diagnosis codes and Diagnosis Related Group (DRG) codes.19-21 We excluded both arthritis and fluid and electrolyte disorder comorbidities. Many patients have arthritis, and for the purposes of this study, it was not considered a factor that differentiates patients for transfer. In addition, most patients hospitalized and undergoing interhospital transfer experience some form of abnormal laboratory value, making it not clinically useful for identifying discrete subgroups of patients who will provide new insight to enable reconceptualizing patient categories for transfer.
To describe the severity of the patient population and to enabling comparison between the data subsets used in the analysis, we used the All Patient Refined Diagnosis Related Groups (APR-DRG) Risk Mortality covariate provided by HCUP. The APR-DRGs are assigned using proprietary software developed by 3M Health Information Systems that include the base APR-DRG, the severity of illness subclass, and the risk of mortality subclass within each base APR-DRG. 16 We only used this variable to provide a description of the study samples and did not include it in the model development and analyses due to it being a combination of other covariates already included in the model (e.g. age, gender, and diagnosis) while also including proprietary calculations that are not available within the electronic medical record (EMR) and thus would not be useable in decision-support tools or other patient care activities relying on primary data.
System-level covariates in the analysis included the following: admission month, admission on a weekend, hospital bed size, hospital teaching status, hospital region, hospital control/ownership, and patient location before hospitalization. We also accounted for whether patients received a major operating room procedure that was either diagnostic or therapeutic occurring post-transfer. The University Hospitals Case Medical Center Institutional Review Board determined that this study meets the exemption criteria for human subject research (IRB #em-14-30).
Statistical analysis
Frequency counts and percentages were tabulated for the categorical outcome—mortality. For descriptive analysis, we used discharge-level survey weights provided in the NIS that accounted for complex survey design effects. The final sample for this study is a nationally representative sample generated via the weighting variable provided with the data set. However, the classification and regression tree (CART) analysis does not apply the sample weights, which leads to smaller samples in the terminal nodes. We excluded cases where the mortality variable was missing. We did not exclude any observations with missing values for the independent variables specifically because a robust feature of the CART algorithm is that it handles missing data using the surrogate split method—a method that finds an alternative variable that is highly correlated with the missing variable to determine the split. 22 While there are other methods for handling missing data in CART analysis, 23 the default setting in CART packages is to skip missing variables to streamline the analysis. 24 In this analysis, we employed the surrogate split method that identifies and supplements a surrogate variable.
Supervised machine-learning approaches
We used CART analysis to identify combinations of predictors associated with post-transfer mortality. The CART involves a tree-building technique in which the choice of “splitting” variables is based on an exhaustive search of all possibilities, using a recursive partitioning algorithm, resulting in mutually exclusive groups that are the most different with respect to the dependent variable. 25 The tree-building process leads to terminal nodes (or leaves), at which point the nodes cannot be divided anymore and need to be pruned to avoid over-fitting and increase efficiency. 26 First, CART recursively partitions the patients into smaller and smaller homogeneously distributed groups—in this case, based on the presence of specific combinations of clinical conditions. The purpose is to reduce variations within the group and to improve the fit as best as possible. Next, CART uses these groups to predict post-transfer mortality. We used the following stopping criteria (based on model tuning described below): a maximum tree depth of 10 splits, a minimum node size of 50 subjects, requiring a split to increase the complexity parameter by a minimum of 0.001 and using the information impurity index to determine node splits.
To build our model, we partitioned the study data into a training data set (70% of the data) and validation data set (remaining 30%) using random sampling within each class of the outcome variables. We used 10-fold cross-validation repeated three times on the training data set to build the CART models. Since mortality is highly unbalanced, we weighted the “cost” of a false negative to be higher than a false positive to improve sensitivity and produce a more meaningful model. We then tested the accuracy of our models on the testing data set using a confusion matrix and by calculating the area under the curve. We also used the Matthews correlation coefficient measure, a measure of accuracy that accounts for imbalanced outcomes. 27 We chose our final model for the outcome based on accuracy and interpretability.
In addition, we compared our final models with those from a random forest model to see if they were in agreement on variables that are the most important predictors. Random forest is a bootstrap aggregation method that creates multiple decision trees using random variable selection. Breiman et al 22 provides a detailed description of random forest. We used SAS software version 9.4 28 for data management; for our statistical analyses, we used R version 3.3.1 and RStudio 1.0.136 29 and the “rpart” (CART), partykit (tree graphics), “randomForest” (random forest), and “caret” (model tuning and cross-validation) packages.
Results
In 2013, approximately 1 456 422 adult patients underwent interhospital transfer, 52% were male, 66% White, 11% Black, and 7% Hispanic. The primary payers for the interhospital transfer were Medicare 44%, Medicaid 19%, and private insurance 26%. Further demographic characteristics of the nationally weighted sample are provided in Table 1, and the frequency of the primary diagnosis categorized by the multi-level diagnosis category of the CCS in Table 2. As expected, circulatory disease was the most frequent diagnosis in the older age groups (45 and older), whereas mental health was the most frequent in the youngest age groups (19-44). Frequency of comorbidities across age cohorts is presented in Table 3. The distribution of patient characteristics across the total study population and between the training and testing data sets are available in Supplementary Material Table 2.
Sample characteristics*.
Sample characteristics total subjects and % represent the data set weighted to reflect national estimates.
Clinical classification frequencies*.
Sample characteristics total subjects and % represent the data set weighted to reflect national estimates.
Comorbidity Frequencies*.
Sample characteristics total subjects and % represent the data set weighted to reflect national estimates.
The final CART identified 21 discrete subgroups of patients (Figure 1). Trees from the training holdout data set and the testing holdout data set contained the same splits and terminal nodes. Of the 21 subgroups, 12 were for patients with a primary cardiac diagnosis (n = 16 798 patients), the next eight groups primary diagnoses were cancer (n = 35 030 patients), and the remaining subgroup had neither cardiac nor cancer as a primary diagnosis (n = 151 464).

Classification and regression tree.
Subgroups with a primary cardiac diagnosis (Figure 1—right side) experiencing the highest rates of post-transfer mortality included (1) patients greater than 40 years old with either coagulopathy (30% mortality) or with metastasis (~35%), (2) patients greater than 52 years old with cardiac arrhythmia and either liver failure (~35%) or pulmonary circulatory comorbidity (30%), and (3) patients greater than 72 years without Medicare (35%). The payer mix of the patients in the subgroup that was greater than 72 years and without Medicare consisted of 10% on Medicaid, 56% private insurance, 10% self-pay, and 24% not specified. Alternatively, patients that were less than 40 years (5% mortality) or greater than 40 years and underwent an operating room procedure (5% mortality) experienced the highest rates of survival.
Subgroups of patients that had cancer as the primary diagnosis (Figure 1—left side) that experienced the highest rates of mortality post-transfer included (1) those greater than 83 years old (35% mortality), (2) those >68 years with either hypertension (15% mortality) or on Medicare (10% mortality), and (3) for those <68 years old with coagulopathy and either arrhythmia (25% mortality) or pulmonary circulatory comorbidity (35% mortality).
The results from the random forest analysis are presented in Figure 2. Variables identified as being important via random forest, but not included in any of the CART pathways include weight loss, congestive heart failure, and genitourinary.

Random forest results.
Model performance
We tested the performance of our model on a holdout data set. The area under the curve was 0.69, and the Matthews correlation coefficient was 0.198. The model had a positive predictive value (PPV) of 0.291 and a negative predictive value (NPV) of 0.960. The sensitivity was 0.18 and the specificity was 0.98. As we further describe below, the aim of this model was to identify clinically meaningful rather than most accurately predict mortality post-transfer.
Discussion
This analysis identified 21 distinct groups of patients, 13 of which experienced mortality rates more than double the national average ranging from 4.7% to 5.2% post-transfer mortality. 1 In 2013, the national mortality for all-cause hospital admissions was 2%. This analysis included all patients, even patients who underwent transfer for routine procedures such as orthopedic cases or appendectomies, who were accounted for in the far left of the tree in the lowest mortality group (n = 151 464). Alternatively, the other lowest mortality group consisted those with a circulatory diagnosis and who were aged younger than 40.5 years.
The left side of the tree, or the non-cardiac side, was dominated by patients with cancer, composing the second largest group of patients undergoing transfer (n = 35 020), with the highest mortality experienced by those with coagulopathy as a comorbid condition. Coagulopathy is also represented on the right side as significant contributor to increased mortality post-transfer. Of note, comorbid conditions in the AHRQ NIS are not directly related to the primary diagnosis or necessarily the main reason for admission, likely having originated before the current hospitalization, thus representing a pre-existing condition. 16 The finding that coagulopathy is a significant predictor of post-transfer mortality was surprising, but its significance is reinforced by the random forest analysis (Figure 2) and our other work looking at surgical populations. 30 Coagulopathy typically manifests as a secondary physiologic response to a primary disturbance such as cancer and trauma induced and has been found to be an independent predictor of in-hospital mortality, regardless of transfer status.31,32 This study reinforces including coagulopathy, whether it is a comorbidity or a condition on the active problem list for the current hospitalization, as a covariate in future modeling efforts.
This study identified that patients with a cardiac diagnosis and aged less than 40 years or were older than 40 years and received an operating room procedure experienced the highest survival rates post-transfer. While we cannot ascertain the specific operating room procedures performed, the high survival rates for this clinical group receiving a major therapeutic or diagnostic operating room procedure supports the role that transfer plays in improving mortality. Likely, these patients without concomitant comorbidity or other significant clinical characteristics, represent those experiencing a myocardial infarction or other time-sensitive condition that benefits from rapid transport and subsequent intervention.
While the primary focus of this study was not to predict patient mortality, the methods employed identified groups of patients that experience mortality at rates two to three times higher than the expected rate of post-transfer mortality of 5% and thus provides specific groups of patients that warrant focused inquiry. Current efforts to leverage EMR data to support developing clinical decision-support systems (e.g. health system transfer command centers) 33 can benefit by initially focusing on high-risk target populations like those identified in this analysis.
The random forest model identified several important variables not included in the individual tree, those being weight loss, congestive heart failure, and genitourinary conditions. The variable importance results reported in the random forest are the average results of many individual trees—many trees included the three omitted variables while others did not. Given that the CART tree represents an individual tree and sample; in this case sample, 789 out of 10 000, it is possible that variables identified in the random forest analysis are not represented in this specific tree. Omission of these variables in the individual tree can be due to the greedy splitting procedure that identifies the best split at that particular point in the tree without considering the impact on the full model. Therefore, depending on the random sample chosen to run the CART, the tree for each sample can include different variables and split points.
During the analytic process, we randomly select the samples and “freeze” them, otherwise we would get a different training and testing sample each time the analysis was performed. The omission of the variables underlines the importance of running complementary or additional analyses when using atheoretical approaches.
Our model had an area under the curve of 0.69, which is reasonable performance for rare and difficult events to predict like mortality. The area under the curve (AUC) is in-line with other studies that have used the Elixhauser or Charleston comorbidity indices to predict mortality that ranged between 0.65-0.80.34,35 It is difficult to compare the performance of AUC across studies that assess different patient populations, and to our knowledge, this is the first model to predict mortality among all-diagnoses of transferred patients.
Finally, employing the supervised machine-learning techniques provides distinct analytical advantages over traditional modeling techniques that we have used in past analyses. The primary advantage is the ability to assess all available covariates in every possible combination. Rather than identifying the influence of a given covariate while the others are held constant, the supervised machine-learning techniques employed allow us to test every possible combination of the covariates to identify clinically meaningful combinations and report those combinations in mutually exclusive groups capable of being easily incorporated into decision-support modeling or other approaches such as developing more precise clinical nomograms. In addition, the mutually exclusive groups provide easily recognizable patient characteristics in specific combinations that are more descriptive than the odds of change in one variable while the others are held constant. For example, our past work employing regression identified that the odds of death increased with age, with age being included in the regression via seven categories. 1 Alternatively, in CART, we are able to include age as a continuous variable and let the technique determine what the significant splits in age are for a given combination of characteristics. For example, in Figure 1, age is split five different times in the tree with each split signifying a significant difference in outcome for those patients above or below that age threshold. Attempting to identify these age categories via other approaches, would be burdensome, if achievable at all.
Limitations
Secondary analyses of existing databases present several limitations. First, we were only able to include basic demographic characteristics, the Elixhauser comorbidities, primary diagnosis via the CCS, and basic hospital descriptors. While nationally representative, the lack of rich clinical descriptors limits the depth of the analyses and applicability of the findings. Second, primary diagnosis determination is complex and is influenced by the clinical course of care as well as coding for payment. This well-known limitation has been identified by others. Third, we included all patients that were transferred between hospitals, including groups of patients that on one end would not impact overall transfer mortality rates (e.g. mental health) and, on the other end, patients who exceeded the level of care available at their current hospital (i.e. community hospital) and had to be transferred to a tertiary center. Fourth, inclusion of variables such as operating room procedure are only broad indicators of care and do not provide specificity in differentiating between normal and unexpected rates of mortality. However, the inclusion of operating procedure across the models highlights the need to conduct further in-depth investigations into specifically which transfers and corresponding procedures impart improved morbidity and mortality, highlighting a strength of this broad approach to focus future inquiry. Finally, we do not know why the patient was transferred and the elements contributing to the decision. This will be future work.
Conclusions
This study analyzed a nationally representative sample of hospital discharges to identify groups of patients who experience increased mortality after undergoing interhospital transfer. The supervised machine-learning approach implemented identified 13 distinct groups of patients who experience post-transfer mortality more than double the national average mortality of post-transfer patients. Of the 13 groups, 10 experience mortality rates of 20% or greater, identifying specific groups of patients that may benefit from being transferred sooner based on their individual characteristics. The individual characteristics identified do not necessarily fall into the currently used categories of transfer patients, supporting the reconceptualization of which patient groups should be considered for immediate transfer to another hospital.
Supplemental Material
Supplementary_Material_Table_1_xyz1364584bb4390 – Supplemental material for Applying Supervised Machine Learning to Identify Which Patient Characteristics Identify the Highest Rates of Mortality Post-Interhospital Transfer
Supplemental material, Supplementary_Material_Table_1_xyz1364584bb4390 for Applying Supervised Machine Learning to Identify Which Patient Characteristics Identify the Highest Rates of Mortality Post-Interhospital Transfer by Andrew P Reimer, Nicholas K Schiltz, Vanessa P Ho, Elizabeth A Madigan and Siran M Koroukian in Biomedical Informatics Insights
Supplemental Material
Supplementary_Material_Table_2_xyz13645265ff583 – Supplemental material for Applying Supervised Machine Learning to Identify Which Patient Characteristics Identify the Highest Rates of Mortality Post-Interhospital Transfer
Supplemental material, Supplementary_Material_Table_2_xyz13645265ff583 for Applying Supervised Machine Learning to Identify Which Patient Characteristics Identify the Highest Rates of Mortality Post-Interhospital Transfer by Andrew P Reimer, Nicholas K Schiltz, Vanessa P Ho, Elizabeth A Madigan and Siran M Koroukian in Biomedical Informatics Insights
Footnotes
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: N.K.S. is supported by the Clinical and Translational Science Collaborative of Cleveland (grant no.: KL2TR000440) from the National Center for Advancing Translational Sciences (NCATS) component of the National Institutes of Health. S.M.K. was supported by the Clinical and Translational Science Collaborative of Cleveland (grant no.: UL1TR000439) from the National Center for Advancing Translational Sciences (NCATS) component of the National Institutes of Health and NIH roadmap for Medical Research. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH.
Author Contributions
APR, NKS, VPH, EAM, SMK contributed to planning; APR and NKS conducted analyses; and APR, NKS, VPH, EAM and SMK contributed to drafting the manuscript and critical revisions.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
