Abstract
Background
While machine learning (ML) models are increasingly used to predict outcomes in health care, their practical effect on health care operations, such as bed capacity management, remains underexplored. There is a variety of traditionally used evaluation metrics to analyze ML models; however, decision makers in health care settings require a deeper understanding of their implications for resource management. Traditional performance measures often fail to provide this practical insight.
Methods
In this work, we conduct a simulation study to evaluate the impact of ML-driven length-of-stay (LOS) predictions on intensive care unit (ICU) bed capacity management. Two classification models differing in terms of explainability and interpretability, logistic regression (LR) and extreme gradient boosting (XGB), are applied to predict ICU-LOS. We use the HiRID dataset containing high-frequency data of more than 33,000 patients. The predictions of the ML models are integrated into a simulation framework that replicates real-world ICU bed management, allowing for the assessment of the practical implications of using these algorithms in a clinical setting.
Results
The application of both classification models results in improved capacity control regarding the key performance indicators in the simulation study, with XGB outperforming LR. While LR leads to slight overoccupancy in the ICU, slight underoccupancy can be observed when XGB is applied.
Conclusion
Our study bridges the gap between predictive accuracy and practical application by emphasizing the importance of evaluating ML models within the context of ICU capacity management. The simulation-based approach offers a more relevant assessment for health care practitioners, providing actionable insights that go beyond classical performance measures and directly address the needs of decision makers in clinical practice.
Highlights
We apply multiple classification models for ICU-LOS prediction using time-series data. This approach enables an update of the initial prediction resulting in the possibility of efficiently managing intensive care capacities.
We present a simulation-based approach to evaluate ML algorithms and their impact on bed capacity management in real-world clinical settings.
Our work provides in-depth insights into the impact of using ML techniques as decision support systems in the ICU and can lead to increased acceptance in practice.
Keywords
The German health care system is not crisis resistant and not sufficiently prepared for the consequences of climate change or pandemics: this is the conclusion of a report by the German Council of Experts on the Assessment of Developments in the Healthcare Sector, which cited the inadequate use of digital options as one of the key weaknesses. 1 The vulnerability of the system, especially in the event of crisis-related challenges, has been highlighted in recent years by the COVID-19 pandemic, which has led to an increased burden on the entire German health care system, particularly hospitals. In February 2021, 10% of COVID-19 patients in Germany were hospitalized, of which 33% were estimated to be treated in the intensive care unit (ICU). 2 The enormous increase in the number of patients requiring care presented a major challenge to physicians and highlighted the relevance of forecasting and managing the need for intensive care capacity. In this context, the length of stay (LOS) in the ICU (ICU-LOS), describing the treatment period of a patient, is crucial. A precise prediction can contribute to optimizing the care of critically ill patients, the efficient use of resources, and the improvement of patient care. Furthermore, knowledge of the individual LOS enables enhanced management of intensive care capacities as the treatment duration of a patient can be considered, for example, in high-load situations when transferring patients to other hospitals.
In this context, the application of artificial intelligence and in particular machine learning (ML) methods as a subarea can contribute to an accurate prediction. In the literature, both regression and classification algorithms are used to predict ICU-LOS. When using regression methods, the ICU-LOS is regarded as a continuous variable, and the aim is to make as accurate a prediction as possible for each patient.3–8 When applying classification algorithms, binary classification is often applied to differentiate between “regular” and “prolonged” ICU-LOS.3,5,9–18 However, the definition of prolonged ICU-LOS varies depending on the study, as datasets with different patient characteristics and different use cases are considered. The threshold value is often set at 7 d,9,12,17 although distinctions such as between ≤2 d and >2 d 15 or ≤10 d and >10 d 5 are also defined in the literature. In practice, multiclass classification is often useful for making more concrete predictions of the ICU-LOS for each patient and to be able to plan capacities accordingly. In the existing literature on multiclass classification, differences can also be observed with regard to the number of classes selected as well as their size. Iwase et al. 19 distinguished between 3 classes, short (<1 wk), medium (1–2 wk), and long (>2 wk), whereas Chen et al. 20 selected 4 classes, 4 d, 4 to 7 d, 7 to 10 d, and >10 d, and Alabbad et al. 21 differentiated between 9 classes, whereby intervals of 5 d were chosen. The studies on multiclass classification also highlight distinct variations in sample sizes, ranging from 353 (Chen et al. 20 ) to 895 (Alabbad et al. 21 ) to 12,747 (Iwase et al. 19 ) patients. Overall, there is a strong heterogeneity in the applications of ML models for ICU-LOS prediction in the literature. This reflects the fundamentally different approaches that, although potentially reasonable depending on the context, hinder comparability and may not be suitable in all settings. Moreover, no forecasts specifically supporting a daily planning level for ICUs including resource management perspectives are provided. A subdivision into classes of several days can enable a better performance of the forecast; however, a more accurate prediction of the ICU-LOS, for example, on a daily planning level, can be desirable for the hospital from a management perspective.22–24
In addition, the focus of the existing literature is primarily on the performance of the algorithms measured by state-of-the-art performance indicators. Of particular interest to decision makers in practice is the real-world effect of applying these models. For instance, in the case of ICU-LOS prediction, the insights gained are crucial for optimizing intensive care capacity and effectively managing bed occupancy. By accurately predicting patient stay durations, health care providers can enhance the allocation of resources, streamline patient flow, and ultimately improve overall care delivery in the ICU and the entire hospital. Although state-of-the-art performance indicators enable a comparison of individual algorithms, they provide only a limited indication of the effects of an actual application in the ICU. This aspect is often neglected in the existing literature. However, the literature, particularly in the fields of operations research and management science, frequently features studies that bridge predictive analytics with practical operations. In the context of ICU management, several papers have explored the application of simulation methods25–29 as well as the integration of ML and artificial intelligence with simulation approaches30–33 in intensive care settings. A comprehensive overview of how operations research and management science can inform ICU capacity planning is provided in the review by Bai et al. 34 In general, further analysis with respect to the evaluation of an application of the algorithms in practice is essential due to the importance of the combination of artificial intelligence driven by ML and medical decision making. 35
In our work, we focus on this essential interface and apply 2 classification algorithms for ICU-LOS prediction differing in terms of transparency and explainability, namely, logistic regression (LR) and extreme gradient boosting (XGB), on a real-world intensive care dataset. The ICU-LOS was divided into 4 classes, that is, 1 d, 2 d, 3 d, and ≥4 d, allowing us to focus on day-based capacity planning and operational needs in a clinical setting. At each point in time (i.e., days of treatment), a prediction is made as to whether a patient will continue to be treated in the ICU for 1 more day or longer. Therefore, the ICU-LOS is not only predicted shortly after the beginning of treatment in the ICU but also updated daily. In addition to state-of-the-art performance indicators and decision curve analysis, we use a Monte Carlo simulation for further analysis and evaluation of the classification algorithms. In the simulation study, we model ICU bed capacity management based on the predictions generated by the classification algorithms. This approach offers deeper and more practical insights than traditional key performance indicators do, providing physicians with actionable information that directly addresses the challenges of resource allocation and patient care. Our study offers in-depth insights into the real-world effects of implementing classification algorithms on a daily planning level in the ICU, which is crucial for effective capacity management and informed medical decision making in hospitals.
Methods
Dataset
Our study was based on the high time-resolution ICU dataset (HiRID, version 1.1.1) available on PhysioNet,36,37 including more than 33,000 patients admitted to the ICU in the Bern University Hospital in Switzerland from January 2008 to June 2016. Descriptive statistics and further information about the dataset as well as the preprocessing procedure applied can be found in the original publication by Hyland et al. 38 We used the preprocessed dataset containing 18 meta-variables consisting of observational and pharmaceutical data with a 5-min time grid. In addition, 3 general variables (i.e., age, admission time, and sex) were included, resulting in 21 variables considered for predicting ICU-LOS. The target variable was rounded up to whole days, as from a practical perspective it is useful to forecast on a daily basis to enable appropriate planning and management of capacities in the hospital. The patients were divided into 4 classes, resulting in the following number of patients per class: 18,629 patients with an ICU-LOS of 1 d, 6,722 patients with an ICU-LOS of 2 d, 2,623 patients with an ICU-LOS of 3 d, and 5,931 patients with an ICU-LOS of 4 d or longer. With regard to the simulation study, a treatment period of 4 d was assumed for all patients in the last class.
Classification Algorithms
Two classification algorithms, XGB and LR, were used to predict each patient’s ICU-LOS. To ensure comparability, in addition to the 2 models (XGB and LR[c]), we also included a simplified LR model (LR[s]) based on a selection of basic clinical covariates as well as a naïve forecast. This allows for comparison with clinical practice, since ICU managers typically rely on basic observable patient characteristics for planning while also providing the standard benchmark with a naïve forecast that is commonly used in the literature. In general, the transparency and explainability of the algorithms play a central role in the application of ML models, particularly in medicine. In this context, a distinction can be made between black-box models such as XGB and white-box models such as LR. Black-box models are mathematically complex, which makes it difficult for users to understand them in practice. 39 Due to their high complexity, these models usually provide precise predictions, but the internal decision-making rules and mechanisms are not transparent. In contrast, white-box models are easier to understand by experts in the field of application. 39 However, the predictions are often less precise due to the lower complexity of the models. In general, the consideration of transparency and explainability is particularly relevant in the context of the application of ML algorithms in health care, as neglecting this aspect in clinical decision support systems poses a threat to fundamental ethical values in medicine. 40
Our approach consists of a combination of 3 binary classification algorithms to predict ICU-LOS. This proceeding allowed the time-series data to be adequately considered, as updated patient information can be used for further prognosis. Shortly after a patient was admitted to the ICU, an initial prediction was made as to whether this patient will need to be treated for 1 d or longer than 1 d (
For the predictions at the 2 consecutive points in time, that is,

Graphical representation of the classification procedure (
Due to the 5-min time grid, the data available for each patient are very extensive. Therefore, a linear regression model was fitted for each of the existing meta-variables, and the respective parameters (i.e., intercept and slope) were used as the input for the classification algorithm. For each prediction, that is,
For each model, a 5-fold cross-validation procedure was used in the training process. Usually, this procedure is applied for hyperparameter tuning to select the best model. In our study, we modified the classic procedure with the aim of obtaining a prognosis for each patient in the dataset and thus being able to use the entire dataset in the simulation study. The dataset was randomly divided once in 5 different folds, each consisting of 20% of the patients treated in the ICU. For each submodel, 4 folds (i.e., 80% of the patients) were used as training dataset, and the remaining fold (i.e., 20% of the patients) was used as test dataset. Consequently, 5 submodels, each using a different fold as test dataset, were created for each of the 9 classification models to be able to use an out-of-sample prediction for each individual patient in the subsequent simulation study. Logistic Lasso regression was used for the LR method owing to its advantages in feature selection and model complexity, which have made it a common choice in the statistical literature. For XGB, the hyperparameters were set to basic values traditionally regarded as reasonable to ensure fair comparability across the individual submodels. Detailed specifications of the parameters are provided in the accompanying R code for this article. Our approach enables the possibility of carrying out the extensive simulation study based on all data available, which is particularly relevant for an application in practice, as most of the datasets accessible in the health care sector are comparatively small, and it is crucial to adequately use all the information provided.
State-of-the-Art Performance Evaluation
The classification models were evaluated using state-of-the-art performance indicators, that is, area under the receiver-operating characteristic curve (AUROC), sensitivity, specificity, precision, scaled Brier score, F1 score, and accuracy. These performance measures are used as standard when comparing ML models and are intended to enable good comparability. In our study, this applies both when analyzing the selected ML algorithms and when comparing the performance quality of the individual submodels at the points in time considered. The use of classic evaluation methods enables good comparability with other studies, even if the data basis used can have an enormous influence on the performance. Furthermore, variable importance plots as well as calibration plots are used to assess model performance for the classification algorithms.
In addition, we applied decision curve analysis (DCA) to assess the practical utility of the algorithms. This approach accounts for the clinical impact of decision-making outcomes. 41 The foundation of these considerations is the idea of calculating the potential cost and benefit of a clinical intervention. 42 In our context, this clinical intervention was defined as intensive care treatment extending for more than 1 additional day. The choice of the decision threshold, a probability value ranging from 0 to 1, is of central importance, 43 as the necessity of an intervention is predicted in the model from this point onward. The theoretical relationship between benefit and selected thresholds can be leveraged, enabling model evaluation against the commonly used clinical benchmarks “treat all” and “treat none” without requiring additional data. 41 It is important to note that DCA is not intended to determine the optimal threshold. 44 Instead, it provides a method to evaluate whether the model offers clinical benefit, while the appropriate range of thresholds should always be defined according to practical requirements. 44 In our example, we assumed that a threshold of 0.5 is generally appropriate for management purposes. However, hospital-specific circumstances must be considered; for example, in high-demand situations, lower thresholds may be more practical to facilitate timely management decisions, such as transferring patients to other facilities. In addition, thresholds greater than 0.5 may be relevant in this context and are therefore also examined further. From a practical perspective, the choice of the appropriate threshold depends on whether the risk of an empty bed or an overbooked patient is considered to be higher. Overall, an evaluation of the threshold range from 0.25 to 0.75 appeared practical for a real-world application of the classification models for predicting ICU-LOS.
Simulation Study
In addition to the evaluation by means of the classically used performance measures, a comprehensive simulation study was subsequently carried out to analyze the algorithms in greater depth. We applied a Monte Carlo simulation, as this method can provide valuable insights into the effects of data-based decision support in the ICU. 45 We simulated ICU capacity management by incorporating predictions generated by the classification models, specifically focusing on how these predictions influence hospital bed management. By using ICU-LOS predictions derived from the algorithms, this method enables a thorough and comprehensive evaluation of their impact on ICU bed utilization and highlights the practical implications for resource allocation within the hospital setting.
At the beginning of the simulation study, the input parameters, that is, number of runs (
Then, the number of patients being treated in the ICU was calculated, which is always 50 at the time of initial occupancy

Flowchart of the Monte Carlo simulation setup in R.
Results
State-of-the-Art Performance Evaluation
An initial evaluation of the ML models is carried out using the classic performance measures AUROC, sensitivity, specificity, precision, scaled Brier score, F1 score, and accuracy. For all binary classification models, the longer ICU-LOS is defined as the positive class (class 1). The AUROC, which we use as the primary performance measure in our state-of-the-art evaluation, is highest for the XGB model for all predictions. In addition, the AUROC is higher for the LR(c) model than for the LR(s) model. When using the LR(s) model, the AUROC is 0.63, 0.67, and 0.63 for the predictions
Area under the Receiver-Operating Characteristic Curve (AUROC), Sensitivity (SE), Specificity (SP), Precision (PRE), Scaled Brier Score (BS), F1 Score (F1), and Accuracy (ACC) in Comparison with the No-Information Rate (NIR) (1-Tailed Binomial Test) per the Machine Learning Model and Point in Time of Prediction
Statistically significant values to a significance level of 0.05.
In addition to the presented performance measures, DCA was conducted to assess the benefits of the models, with a particular focus on the predefined threshold range of 0.25 to 0.75. Overall, the benefit demonstrates notable differences when using the classification models compared with the “treat all” and “treat none” strategies. At points in time

Graphical representation of decision curve analysis for logistic regression (LR) models. s, simple: τ = 0, τ = 1, τ = 2.

Graphical representation of decision curve analysis for logistic regression (LR) models. c, complex: τ = 0, τ = 1, τ = 2.

Graphical representation of decision curve analysis for extreme gradient boosting (XGB) models in all time steps: τ = 0, τ = 1, τ = 2.

Simulated occupancy of the intensive care unit per classifier and point in time. Green line, number of beds used for initial occupancy (50); red line, maximal number of beds available (60). XGB, extreme gradient boosting: τ = 0, τ = 1, τ = 2, τ = 3.
In addition, we used variable importance plots (Supplementary Figures 7–15) as well as calibration plots (Supplementary Figures 16–18) to assess the model performance for all time steps, which can be found in the Appendix.
Simulation Study
In the simulation study, the number of patients receiving intensive care treatment as well as the number of correct, overestimating, and underestimating predictions are used as key performance indicators to compare the 4 prediction models (i.e., naïve classifier, simple LR, complex LR, and XGB). At the initial point in time,
Mean Number of Patients in the ICU (
Therefore, the ICU occupancy was measured for each simulation run to provide a detailed overview in addition to the mean values per point in time. Furthermore, we used post hoc tests to test the statistically significant differences in the number of patients receiving treatment in the ICU. In the initial time step, there is no statistically significant difference since the number of initial patients is 50 for each model. In the consecutive time steps, there is a statistically significant difference in most cases, except in
Discussion
The use of state-of-the-art performance measures shows that the XGB classifier outperforms the LR classifiers with regard to the AUROC for the predictions
The simulation study shows that the application of both classification models is advantageous compared with the naïve prediction as well as the LR(s) model. Overall, the number of patients treated in the ICU using the XGB classifier is closest to the specified number of beds to be occupied by means of capacity management. However, the application of the LR(c) classifier also shows promising results, especially in comparison with the naïve classifier. When planning the ICU capacity using the predictions of the LR model, the average number of patients treated is usually slightly greater than 50; when using the XGB model, it is always less than 50. The use of the classification models also demonstrates benefits in terms of the maximum number of patients that can be accommodated in the ICU. This is especially evident with the XGB classifier, as only outliers have values exceeding 60. If the number of patients exceeds the number of free beds available and thus not all patients can no longer receive intensive care treatment, a transfer to other hospitals or, in extreme cases, triage must be carried out. 45 It is therefore particularly relevant in practice to plan sufficient buffers to ensure the ability to treat emergency patients and provide medical care for all patients. As explained in detail in the previous part of the discussion, there is always a tradeoff to consider between free or unused capacity, which generates costs for the hospital, and ensuring patient care, which is central to the health care system. Both in the use of state-of-the-art performance measures and in the simulation study, it is evident that XGB outperforms LR. However, the advantage of white-box models such as LR lies in the fact that decision makers in the health care system can more easily understand how the predictions are made, which can lead to higher acceptance in practice. The use of a simulation study can also contribute to higher acceptance, as it clearly shows the effects of the algorithms in practical application. This is essential for ML models to be actually applied in practice and thus enhance patient care. In summary, both classification algorithms, LR(c) and XGB, perform well and can therefore contribute to improved decision support in the ICU.
Overall, it is important to emphasize that incorporating a simulation study within the context of using ML predictions in hospitals can provide a comprehensive evaluation of their impact on capacity management. This approach enables a nuanced analysis of how these forecasts influence resource allocation and operational efficiency, thereby offering valuable support to decision makers in optimizing hospital operations. In future research, this approach could be further refined and expanded. One potential development is the integration of simulation within the training process of the algorithms, allowing for real-time evaluation based on capacity control metrics. By focusing on the practical impact of these algorithms on resource management, rather than relying solely on traditional evaluation measures, this method could offer a more tailored and effective adaptation for real-world applications. This shift would ensure that ML models are not only accurate but also optimized for the specific operational needs of health care environments. Another approach in this context could be to focus more on the role of explainability of the algorithms. This might involve considering doctors’ perspectives when using such forecasts and the benefits of visualizing parameters in white-box models. For black-box models, incorporating methods that enhance the transparency of complex algorithms can be useful. This is particularly relevant for establishing data-driven decision support systems in practice. Ensuring that the needs and preferences of the decision makers in practice are included is crucial for successful implementation. Furthermore, this work focuses on short-term planning since we use time-series data that allow for daily updates of the prediction. An alternative approach for medium- and long-term forecasting would be to generate a day-based prediction at the outset without subsequent updates. This may be appropriate, for instance, when time-series data are not available or when incorporating up-to-date patient information is not technically feasible. In general, the application of prediction models for medium- and long-term planning in ICUs could be further investigated in future research.
This study is subject to some limitations. First, the ICU-LOS of the patients is divided into 4 different classes, but in practice, the target variable is continuous and not discrete. This dichotomization of the continuous outcome inevitably entails a loss of information. However, from an operational management perspective, planning in discrete time units is practical and meaningful. Capacity planning in hospitals is often done on a daily basis, for which our ML models can provide valuable support. In addition, our approach enables a good comparison of the algorithms within the framework of the simulation study since our dataset contains time-series data allowing for dynamic updates throughout the treatment process. Second, we assume a maximum ICU-LOS of 4 d for the simulation study, which can be exceeded in practice. Third, we used a minimal tuning approach for hyperparameters in the XGB model to ensure comparability across the submodels. However, extensive hyperparameter tuning could improve the performance. Fourth, the characteristics of the patients can differ depending on the ICU considered. For example, differences can result from the proportion of elective and emergency patients as well as the size and medical specialties in the hospital. Therefore, a direct transfer of the results to other ICUs is possible to only a limited extent. However, our work provides a detailed explanation of the advantages of using classification algorithms and their simulation-based analysis for medical decision making in hospitals. Fifth, this study is based on retrospective data, but to actually implement the algorithms in practice, the data must be continuously collected and stored, which requires sufficient technical conditions and adequate software in the hospital.
Conclusion
Our study provides a simulation-based analysis of 2 algorithms differing in terms of transparency and explainability, LR and XGB, for ICU-LOS classification. The predictions are made both shortly after the beginning of treatment and as a daily update. The results show that the application of a Monte Carlo simulation can provide valuable insights into the effects of the classification algorithms in case of application in practice. In further research, the consideration of data-driven mathematical statistical analyses from an interdisciplinary perspective is essential to enable the most efficient management of available resources in the ICU. This approach has the potential to help ensure the adequate preparation for future challenges for the health care system such as pandemics or the consequences of climate change.
Supplemental Material
sj-docx-1-mdm-10.1177_0272989X251406639 – Supplemental material for Machine Learning for Intensive Care Unit Length-of-Stay Prediction: A Simulation-Based Approach to Bed Capacity Management
Supplemental material, sj-docx-1-mdm-10.1177_0272989X251406639 for Machine Learning for Intensive Care Unit Length-of-Stay Prediction: A Simulation-Based Approach to Bed Capacity Management by Sara Garber and Yarema Okhrin in Medical Decision Making
Supplemental Material
sj-docx-2-mdm-10.1177_0272989X251406639 – Supplemental material for Machine Learning for Intensive Care Unit Length-of-Stay Prediction: A Simulation-Based Approach to Bed Capacity Management
Supplemental material, sj-docx-2-mdm-10.1177_0272989X251406639 for Machine Learning for Intensive Care Unit Length-of-Stay Prediction: A Simulation-Based Approach to Bed Capacity Management by Sara Garber and Yarema Okhrin in Medical Decision Making
Footnotes
Acknowledgements
The authors would like to thank Prof. Dr. Christina Barten-Schlager (Nuremberg School of Health) and Prof. Dr. Jens Brunner (Technical University of Denmark) for the valuable discussions in early stages of this work.
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Sara Garber gratefully acknowledges funding from the University of Augsburg for her presentation at the 50th Annual Meeting of the European Working Group on Operational Research Applied to Health Services.
Authors’ Note
Partial results were presented at the 50th Annual Meeting of the European Working Group on Operational Research Applied to Health Services (July 2024).
Ethical Considerations
The authors used only retrospective and anonymized data available on PhysioNet.
Consent to Participate
Not applicable.
Patient Consent
Not applicable.
Data and Code Availability
The HiRID dataset used in this article is available via PhysioNet (https://physionet.org/). The code accompanying this study isavailable on GitHub (
).
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
