Abstract
Introduction
Prognostic models are tools developed to predict an outcome. 1 In the critical care setting, some scores were developed to measure disease severity and assess probability of hospital mortality.2–4 They can be used to evaluate results of the unit over time and for benchmarking reasons. For individual decision-making, however, their application is not straightforward and should not substitute clinician's judgment. Several prognostic models were developed to predict outcomes in trauma victims. Most of them are based on information from early trauma care and hospital admission, including data on distribution and severity of anatomical injuries and physiologic derangements measured through laboratory or clinical variables (5-7). Scores based on anatomical injury patterns, derived from the Abbreviated Injury Scale (AIS), 5 are among the most widely used for multiple trauma epidemiological and clinical studies, such as the Injury Severity Score (ISS) 6 and its later modification New Injury Severity Score (NISS). 7 However, these anatomical scores, as well as the SOFA score, were developed by experts’ opinion without data-driven validation. Furthermore, their predictive performance in critically ill trauma patients has not been thoroughly addressed. In critically ill patients, simplified acute physiology score (SAPS) 3 3 and sequential organ failure assessment (SOFA) 4 are 2 of the most frequently used disease severity and organ dysfunction scores, respectively. But their performance in trauma patients has shown some miscalibration and few studies assessed their external validation in this population.8,9
Although development of new prognostic scores is frequently considered, external validation of predictive scores is fundamental in prognostic score validation and is more important than developing new models when others exist. 1 We hypothesized that SAPS 3 and SOFA would have a superior performance compared to anatomical scores in the prediction of hospital mortality in critically ill trauma patients. Therefore, we performed an external validation study comparing SAPS 3, SOFA, and anatomical trauma scores (ISS and NISS) in multiple trauma patients admitted to a specialized trauma and acute care surgery intensive care unit (ICU) in Brazil.
Material and Methods
Study Design, Setting, and Ethics
This is a retrospective analysis of a prospectively collected cohort study of critically ill adult trauma patients admitted to a specialized trauma and acute care surgery ICU from a reference hospital in São Paulo, Brazil, Hospital das Clínicas, University of Sao Paulo Medical School.
At the time of data collection, the hospital comprised 95 ICU beds, out of which 17 were dedicated specifically to critically ill trauma patients. The hospital is a trauma center and reference for a large portion of São Paulo Metropolitan area (serving a total population of nearly 8,000,000 people) and admits trauma patients from ground and air retrieval from pre-hospital trauma care, with access to trauma surgery, interventional radiology, neurosurgery, and vascular, orthopedic and plastic surgery. Patients were admitted either directly from the emergency department or after surgery.
Ethical approval was obtained from the Comissão de ética para análise de projetos de pesquisa under the number CAAE 1.905.974, which waived the need for informed consent given the retrospective nature of the study. This manuscript is reported according to the recommendations from Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis (TRIPOD) for validation studies 10 (Supplemental Material File 1).
Study Population and Outcome
The study included all consecutive first ICU admissions of patients ≥15 years victims of trauma from May 2012 to January 2016. We excluded ICU readmissions, patients admitted to hospital with trauma but whose primary reason for ICU admission was not trauma (eg, sepsis) and those with a mechanism of injury associated with exclusive hypoxic-ischemic injury (hanging or drowning) or exclusive burn injury. Trauma patients primarily admitted to other ICUs in the institution and later transferred to our ICU were not included in this analysis.
The outcome used to compare the predictive score performance was hospital mortality. The predicted performance was assessed comparing discrimination, calibration, and clinical utility between the scores, detailed below.
Data Collection, Definitions, and Data Cleaning
Demographics, comorbidities, injury mechanism, trauma characteristics, and illness severity variables at baseline and at 24 h after ICU admission were retrieved from the ICU electronic health record (EHR), designed to collect specific data regarding trauma patients.
We cross-checked and corrected when necessary all variables used for SAPS 3 calculation, with recalculation of the score as appropriate. We calculated the SOFA score with the most abnormal values within 24 h of ICU admission, including physiologic, clinical and laboratory data from admission until first next morning evaluation. 11
We calculated ISS and NISS based on the anatomical injuries reported in the medical records and analysis of performed computed tomography (CT) results, blinded to study outcome. In this trauma center, whole-body CT scan is performed before ICU admission in victims of high energy injuries, at consultant trauma surgeon's discretion. Injuries were classified according to AIS from 1—minor, 2—moderate, 3—serious, 4—severe, 5—critical, and 6—unsurvivable, in 6 sections: head and neck, face, thorax, abdomen, pelvic and extremities, and external injuries. 5 According to their original publications, ISS is calculated as the sum of the squares of 3 highest AIS grade injuries from 3 different body sections 6 and NISS calculated as the sum of the squares of the 3 highest AIS grade injuries, regardless of their body sections. 7 All the calculations were performed by trained evaluators blinded to patient's outcomes.
Statistical Analysis
We did not perform an a priori sample size calculation, but our sample comprised 1053 critically ill trauma patients with 251 events, which provides an adequate statistical power as suggested by Collins et al. 12
Categorical variables are presented as absolute frequencies and percentages. Continuous variables are presented as means and standard deviations or medians with respective 25th or 75th percentiles according to their distribution. We compared categorical variables with the chi-squared test, and quantitative variables with t-tests or Mann-Whitney tests, as appropriate. Discrimination was evaluated through the C-statistic, with pairwise comparisons done by the DeLong test. 13 To assess consistency between scores (i.e., how much information is closely related between models), we calculated Cronbach's alpha with 95% confidence intervals obtained after 100 bootstrap replications. Brier score was calculated to evaluate overall accuracy of each model. We primarily assessed calibration through calibration plots using locally weighted smooth (Lowess) curves. We also assessed observed-to-expected ratios, calibration-in-the-large and calibration slopes. We assessed clinical utility with decision curves, comparing different models and equations between themselves and against the treat-all and treat-none approaches. 14
For SAPS 3, we calculated the logit and expected death probabilities from the general and Central/South America equations of the original publication 3 and we performed intercept and slope recalibration of the model. For SOFA, ISS, and NISS, since there is no published equation, we derived the equation from this dataset. Therefore, apparent validation for SOFA, ISS, NISS, and recalibrated SAPS 3 is presented, and full external validation is presented for SAPS 3 general and Central/South America equations.
Multiple steps were taken to handle missing data. Missing data, outliers, and disagreements in structured fields in the EHR were completed or corrected after reviewing information from unstructured fields. Given the small proportion of missing data (<5%), 15 the primary analysis was done on complete cases. As a sensitivity analysis, assuming missingness at random, we performed multiple imputation with chained equations, 16 including the outcome, missing explored variables (SAPS 3 and SOFA) and auxiliary variables (age, Glasgow Coma Scale, ISS, NISS, Charlson Comorbidity Index) in the model. Predictive mean matching was performed with 10 imputed datasets 16 and the variance was corrected according to Rubin's rules. We performed a post-hoc comparison of precision recall curves for the scores with higher C-statistics to allow a more comprehensive comparison of the models. We performed a subgroup analysis on patients with severe traumatic brain injury (sTBI) at ICU admission.
All analyses were performed in Stata SE 16.0, with the user-contributed packages dca, pmcalplot, brier, and prcurve. The significance threshold was α < 0.05, with no adjustment for multiple comparisons.
Results
Sample Characteristics
From May, 2012 to January, 2016, 1984 patients were admitted to the ICU. Of those, 1053 patients were admitted for trauma and met inclusion criteria for this study cohort, with a hospital mortality of 23.8% (Supplemental Material File 2—Figure S1). There were 50 patients with non-retrievable missing data for either SAPS 3 or SOFA score within 24 h of ICU admission. Thus, 1003 patients were included in the primary analysis. There were no missing data for the outcome of interest, hospital mortality.
Patients’ main characteristics are described in Table 1. Most patients were male (84.2%), with mean age of 40 (±18) years. Blunt trauma was more frequent (90.7%), with road traffic injuries as the most common trauma mechanism (Supplemental Material File 2—Table S1). Traumatic brain injury was present in 67.8%, with sTBI in 43.3%. At the time of ICU admission, 846 patients (80.3%) were on mechanical ventilation and 644 patients (64.3%) were on vasoactive drugs. Median ICU length of stay (LOS) was 8 days, and median hospital LOS was 17 days. The median values for SAPS 3 was 41, SOFA within 24 h was 7, ISS was 29, and NISS was 41. Hospital mortality was 23.8%.
Sample Characteristics.
Continuous variables are presented as mean (SD) or median [P25, P75] as appropriate. Categorical variables are presented as n (%).
Abbreviations: ICU, intensive care unit; TBI, traumatic brain injury; CT, computed tomography; LOS, length-of-stay; SAPS 3, simplified acute physiology score, 3rd version; SOFA, sequential organ failure assessment; ISS, injury severity score; NISS, new injury severity score.
There were missing data for the following variables: age (n = 12), severe TBI (n = 8), SAPS 3 (n = 40), SOFA (n = 16).
Discrimination
Distribution of each score according to the outcome are presented in Supplemental Material File 2—Figures S2-S5. Models’ discriminations are presented in Table 2 and (Supplemental Material File 2—Figure S6). At pairwise comparisons, ICU admission SAPS 3 and SOFA were not different between them, while both presented a better discrimination than the anatomical scores. Although NISS had a better discrimination than ISS, internal consistency (measured through Cronbach alpha) was high between them, which was not observed among other scores (Table 2).
Discrimination and Internal Consistency Among the Four Scores.
Areas under the receiving operator characteristics curves (AUROCs) are presented in the diagonals (bold), with their respective 95% confidence intervals. Asymptotic P-values for between score comparisons are presented below the diagonal (italic). Cronbach’s alpha results for between score comparisons with their respective bootstrap 95% confidence intervals (100 replications) are presented above the diagonal.
Abbreviations: SAPS 3, simplified acute physiology score, 3rd version; SOFA, sequential organ failure assessment; ISS, Injury Severity Score; NISS, New Injury Severity Score.
Calibration
The models’ equations and calibration measures are presented in Table 3 and calibration plots in Figure 1. SAPS 3 was miscalibrated for the general equation and for the customized Central, South America equation. For the general equation, the predicted risk of hospital mortality was consistently smaller than observed mortality in this sample, especially in the lower risk stratum (Figure 1). After recalibration, SAPS 3 presented a better calibration throughout the entire range of expected risks, similar to that observed with SOFA (Figure 1). Both SAPS 3 and SOFA overestimated mortality in the higher risk stratum. ISS and NISS were miscalibrated and have a narrow range of predicted risk (Figure 1).

Calibration plots for the prognostic scores.
Equations, Brier Score, and Calibration Measures of the Prognostic Scores.
The calibration results for SOFA, ISS, and NISS are all apparent validation measures since no published equation exists to properly perform external validation. Their interpretation is therefore hampered and the calibration plots (Figure 1) provide a better account of calibration than these measures.
Abbreviations: E/O, expected over observed; CITL, calibration in the large; Brier, Brier score; SAPS 3, simplified acute physiology score, third version; SOFA, sequential organ failure assessment; ISS, injury severity score; NISS, new injury severity score; GE, general equation; CSA, Central and South America customized equation; RC, recalibrated equation.
Clinical Utility
Net benefit curves are presented in Figure 2. SOFA and recalibrated SAPS 3 showed a greater net benefit compared to other scores, regardless of threshold probability. Both showed a greater positive net benefit in intermediate range of outcome probabilities, mostly SOFA score, from 10% to 50% in hospital mortality. ISS showed the worst net benefit in the decision curve analysis.

Decision curve analysis of the tested prognostic scores.
Secondary Analyses
After multiple imputations for missing data on SOFA and SAPS 3 scores, there were no significant changes in results for discrimination, calibration, and overall accuracy (Supplemental Material File 2—Table S2).
The comparisons of precision recall curves (SOFA vs SAPS 3) showed that SOFA score presented the best trade-off between true positive rate and positive predictive value in different probability thresholds (Supplemental Material File 2—Figure S7).
Subgroup analyses for sTBI patients are presented in Supplemental Material File 2—Table S3, Figure S8. Overall, areas under the receiving operator characteristic curves (AUROCs) reduced and the Brier score increased for all models, but SOFA remained the most discriminative model, with no major differences in calibration.
Discussion
In this external validation study of critically ill trauma patients comprised of mostly young patients with high utilization of organ support and with a high prevalence of associated traumatic brain injury, admission SAPS 3 and SOFA within 24 h of ICU admission had a good predictive performance for hospital mortality, while classical anatomical trauma scores—ISS and NISS—performed fairly. The SAPS 3 general and regional (Central and South America) equations were miscalibrated in this dataset, but it improved after intercept and slope recalibration. Compared to the SAPS 3 score, SOFA discrimination was similar with slightly better clinical utility through a wide range of threshold probabilities.
SAPS 3 discrimination was inferior compared to its original publication, probably because trauma is a specific population and our cohort comprised young victims of trauma, penalized in the original model. SAPS 3 also presented poor calibration in both general and regional equations, as occurred in the original cohort subgroup of trauma patients. 3 These results are not unexpected, since it is common for some deterioration to occur in external validation studies throughout time and when tested in specific subpopulations. 17 Poorly calibrated risk estimates could make this model less useful to evaluate ICU results, although still useful for comparisons. We thus recalibrated the model's intercept and slope, which yielded better performance, like SOFA.
Despite the original description for patients with sepsis, 4 SOFA became the most widely used organ dysfunction score in different subpopulations of critically ill patients,18,19 likely supported by its easy calculation at the bedside. The literature presents its discrimination in small cohorts of trauma patients,8,9,20 but its calibration has not been thoroughly evaluated, with no previously published equation. SOFA presented the best predictive performance in this cohort, possibly attributed to high prevalence of young patients, with few comorbidities and high clinical severity. This reinforces the importance of the burden of organ dysfunction as a main driver of hospital mortality in ICU admitted trauma patients. Furthermore, we now present a published equation that can be used for comparison in further research.
The anatomical scores based on AIS were developed over 50 years ago to describe the extent and severity of injuries in trauma patients, derived from experts’ opinions, which could contribute to an inferior accuracy of such scores compared to data derived models. 21 ISS and NISS may differ by allowing inclusion of a different set of injuries. While they can be similar in patients with injuries restricted to a single body segment, NISS has a potential advantage in multisystemic blunt trauma, as in our cohort. 22 However, Cronbach's alpha was high between these scores, showing high consistency, probably because most of the injuries had high AIS scores, regardless of containment to one or more body segments. In their original description, the authors demonstrated better accuracy with NISS than ISS, also confirmed by later studies,23,24 mostly in blunt trauma, as in our cohort. Nevertheless, both anatomical scores presented a lower AUROC than previous descriptions, which is likely attributed to the subset of critically ill trauma patients included in this study. Such low discrimination expectedly led to poor calibration curves and net benefit, suggesting these scores are not fit for prognostication of critically ill trauma patients.
sTBI represents a sizeable subgroup of trauma patients for which scoring systems performance could be different from the overall trauma population. In our cohort, discrimination deteriorated for SAPS 3 and SOFA in this population, probably because these models do not include other important neurological variables besides Glasgow Coma Scale, such as pupils’ size and reactivity. 25 Nevertheless, non-neurological organ dysfunction is frequent and plays an important role in predicting outcomes, both in isolated TBI and multisystemic trauma with TBI.26–28 It was not our primary objective to evaluate TBI patients and further studies in this population should consider not only neurological variables, but also consider the SOFA score as good surrogate of non-neurological organ dysfunction to be incorporated.
Implications for Practice, Policy, and Research
Clinical utility of prediction models has emerged as an important tool in the assessment of model performance. 14 In our manuscript, we could observe that SOFA score presented the best clinical utility across a wide range of threshold probabilities if specific decision-making is necessary, unlike ISS or NISS. However, the decision curve results reinforces that these scores are not fit for individual decision-making early at the ICU course, since they do not add a net benefit to the upper end of threshold probabilities (>80%). Further research considering the dynamic nature of the SOFA score in the context of full-code admissions or ICU-trial admissions may provide more insights for individual decision-making.
For outcome evaluation and benchmarking, given that SAPS 3 has a published equation, it remains as a better tool than SOFA score. When applying SAPS 3 to calculate standardized mortality ratios, specialized trauma ICUs must be knowledgeable that some miscalibration does occur and that aiming for standardized mortality ratio < 1 may not be attainable with the original equation, as has been described for COVID-19. 29 This does not however invalidate its use for ICU performance evaluation through time.
Strengths and Limitations
This validation study has some strengths. First, the sample size is large enough with an adequate number of events to draw valid conclusions. 12 Our results are also representative of a population of ICU-admitted critically ill trauma patients from a middle-income country, where there are no regional or national trauma databases of similar reach as many international initiatives, where most of the publications in trauma scores originate. However, results are likely not completely generalizable considering high patient severity and complexity of care provided in a trauma reference hospital. Second, we adhered to the most recent guidance for validation of predictive models, including discrimination, calibration, and clinical utility measures. Current methodological literature highlights that most published research on prognostic models focus on accuracy, a characteristic of limited applicability at the bedside, and fail to present measures of calibration. 17 Third, we present the equations used to validate the models, which allows further external validation of these models, a necessary step in the reproducibility of research results in this field. Finally, data missingness, a ubiquitous issue in retrospective EHR analysis, can be considered small in this dataset, 15 with consistent results in sensitivity analyses and with no missing outcome data.
This study has limitations. The retrospective design based in EHR can introduce measurement bias in the analysis. However, multiple measures were taken to check for consistency and quality of data. For admission SAPS 3 and SOFA within 24 h of ICU admission, although individual component variables were prospectively collected, they were later corrected if necessary. The calculation of anatomical scores was indeed retrospective, based in CT results and in structured and unstructured fields in medical records, as it would be if done prospectively. To deal with this limitation, anatomical scores were calculated blinded to patient's outcomes. Second, this is a single center study from a specialized trauma ICU in a public academic hospital that is reference for major trauma, but the patients’ profile can be considered representative of middle-income country cohorts, given high prevalence of young victims of multisystemic blunt trauma, mostly from road traffic injuries. Third, the period from 2012 to 2016 is a potential limitation, but there were no major changes in the structure or processes of care of trauma patients in our institution from 2017 onwards.
Conclusions
In this external validation study of critically ill trauma patients from a specialized ICU in Brazil, SAPS 3 and SOFA within 24 h of ICU admission outperformed classical anatomical trauma scores—ISS and NISS—for hospital mortality prediction. SOFA presented the best combination for discrimination, calibration, and clinical utility in this cohort.
Supplemental Material
sj-docx-1-jic-10.1177_08850666231188051 - Supplemental material for Predictive Performance for Hospital Mortality of SAPS 3, SOFA, ISS, and New ISS in Critically Ill Trauma Patients: A Validation Cohort Study
Supplemental material, sj-docx-1-jic-10.1177_08850666231188051 for Predictive Performance for Hospital Mortality of SAPS 3, SOFA, ISS, and New ISS in Critically Ill Trauma Patients: A Validation Cohort Study by Roberta Muriel Longo Roepke and Bruno Adler Maccagnan Pinheiro Besen, Renato Daltro-Oliveira, Renata Mello Guazzelli, Estevão Bassi, Jorge Ibrain Figueira Salluh, Sérgio Henrique Bastos Damous, Edivaldo Massazo Utiyama, Luiz Marcelo Sá Malbouisson in Journal of Intensive Care Medicine
Supplemental Material
sj-docx-2-jic-10.1177_08850666231188051 - Supplemental material for Predictive Performance for Hospital Mortality of SAPS 3, SOFA, ISS, and New ISS in Critically Ill Trauma Patients: A Validation Cohort Study
Supplemental material, sj-docx-2-jic-10.1177_08850666231188051 for Predictive Performance for Hospital Mortality of SAPS 3, SOFA, ISS, and New ISS in Critically Ill Trauma Patients: A Validation Cohort Study by Roberta Muriel Longo Roepke and Bruno Adler Maccagnan Pinheiro Besen, Renato Daltro-Oliveira, Renata Mello Guazzelli, Estevão Bassi, Jorge Ibrain Figueira Salluh, Sérgio Henrique Bastos Damous, Edivaldo Massazo Utiyama, Luiz Marcelo Sá Malbouisson in Journal of Intensive Care Medicine
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
