Abstract
Objective
This study aimed to identify the optimal time-series models and training strategies for forecasting daily pediatric asthma visit volumes and explore the impact of varying training set sizes on model performance to provide a data-driven framework for clinical resource allocation.
Methods
This paper compares the performance of four representative time-series models (autoregressive integrated moving average, Prophet, extreme gradient boosting, bidirectional long short-term memory) in forecasting pediatric asthma daily visits and investigates the impact of varying training set sizes on model performance. A retrospective study was conducted using daily pediatric asthma visit data from July 1, 2015, to June 30, 2019, at a large tertiary children's hospital in Chongqing, China. Four representative time-series models were constructed and evaluated under two forecasting strategies (rolling and direct forecasting) with varying training set sizes (3 years, 2 years, 1 year, 6 months, 1 month). The models were evaluated using metrics including the coefficient of determination R2, mean absolute error, root mean squared error, and days exceeding error thresholds.
Results
The experimental results indicate that extreme gradient boosting and bidirectional long short-term memory are reliable for pediatric asthma visit forecasting, with 2-year training data and rolling forecasting optimal. Ensemble method combining the above two models reduced error days compared to single models.
Conclusion
This research presents a robust framework for hospitals to implement data-driven forecasting of pediatric asthma visit volumes, integrating machine learning models and deep learning model with adaptive training strategies to improve the efficiency of resource management in this clinical domain.
Keywords
Introduction
Pediatric asthma is a common respiratory disease and a major cause of hospital emergency and outpatient visits.1,2 Accurate forecasting of daily pediatric asthma visits is critical for hospital management and patient care.3–5 For hospitals, it enables optimal allocation of medical resources (staffing, supplies) to ensure service efficiency.6–8 For patients, visit forecasts allow patients to adjust appointment activities to avoid crowding.4,9 However, the dynamics of pediatric asthma visits are influenced by seasons or environmental factors, posing challenges to accurate prediction.10,11
Recent studies have been conducted on forecasting hospital visit volume, but there are some limitations.
Firstly, some studies proposed overly complex models with poor generalization. For example, Keyimu et al. 12 developed a hybrid model integrating successive variational mode decomposition with a modified cheetah optimizer-optimized Gated Recurrent Unit (GRU) for hospital outpatient volume prediction. Ren et al. 13 proposed an optimized two-stage hybrid grey model: in the non-COVID-19 stage, it uses aquila optimizer for parameter tuning and Fourier adjustment for random disturbance correction; in the COVID-19 stage, it incorporates a dummy variable-based impact factor to improve accuracy. However, these models are generally complex in architecture, and their parameter tuning is labor-intensive—both of which increase computational costs and tuning difficulty, thereby limiting their scalability and generalization across different hospitals and data environments.
Secondly, existing studies use a variety of forecasting strategies, namely direct forecasting, 14 rolling forecasting, 15 iterative forecasting, 16 multi-step forecasting, 17 and so on. Direct forecasting adopts a fixed training period, where the model captures historical data patterns in one iteration and generates forecast values directly without updating the training set. Rolling forecasting uses an iterative training set update mechanism, with a sliding window rolling forward to add the latest data. Iterative forecasting, an extension of rolling forecasting, takes the forecast result of the previous step as the input for the next step. By contrast, multi-step forecasting focuses on predicting multiple consecutive future steps, which can be realized via either direct or iterative strategies.
Existing studies mainly focus on direct and rolling forecasting, owing to their advantages in implementation convenience and evaluation ease. Both strategies have simple logical frameworks. Direct forecasting avoids complex iterative updates, while rolling forecasting only requires a fixed window sliding mechanism. Their results are also straightforward to quantify and compare. For instance, Wang et al. 15 used one-step-ahead rolling forecasting to evaluate the performance of LSTM and GRU models for conjunctivitis outpatient volume prediction, and Liu et al. 14 developed a genetic programming-based model using direct forecasting to predict daily outpatient visits by learning historical data once. However, these studies neither justify their strategy selection nor clarify the superiority of either method. A comprehensive comparative analysis of direct and rolling forecasting is therefore necessary.
Thirdly, existing studies often use statistical metrics [e.g., R2, mean absolute error (MAE), root mean squared error (RMSE)] to assess prediction models, quantifying the goodness-of-fit and discrepancies between predicted and actual values.15,18,19 These metrics are fundamental for evaluating overall model performance in an aggregated manner, but relying solely on them is insufficient to directly inform real-world clinical decision-making. For instance, hospitals cannot translate R2 values into actionable staffing adjustments or patient scheduling strategies. Furthermore, while predicted visit volumes provide a foundational reference for hospital resource planning, they fail to quantify the degree of discrepancy between predicted and actual values.
Thus, this study supplements a clinical application-oriented scenario-specific indicator—days exceeding error thresholds. This indicator directly quantifies the magnitude and frequency of deviations between predicted and actual values that exceed the hospital's pre-defined acceptable operational threshold. Notably, the fewer such days, the higher the reliability of the model in real-world clinical applications. By capturing these critical operational deviations, it delivers a more refined and accurate quantitative measure of prediction gaps, bridging statistical model performance and clinical practicality, and enhancing the model's utility, making it more actionable for hospital management tasks.
Fourthly, a common practice in hospital visit prediction is to build models on the training set and validate on the testing set, yet it overlooks model performance dependency on training set size and fails to explore how such factors influence accuracy.
To address these gaps, this study adopts a systematic approach: (1) compare four representative time-series models for forecasting daily pediatric asthma visits, emphasizing simplicity and generalization; (2) evaluate rolling vs. direct forecasting strategies per model to identify the superior one; (3) introduce application-oriented metrics alongside the statistical metrics to assess model performance and real-world utility; (4) compare model performance across different training set sizes to explore accuracy dependency on data volume and determine the optimal size.
Method
DataSet
Daily pediatric asthma visits data (including daily outpatient and emergency volumes) were extracted from electronic medical records of a large tertiary children's hospital in Chongqing (July 1, 2015–June 30, 2019) to exclude COVID-19 containment impacts, with calendar date as the analysis unit. This hospital features a comprehensive clinical service system, where the outpatient and emergency departments are staffed separately to fulfill specialized clinical demands. Due to the scarcity of outpatient registration resources, some asthma patients who are unable to secure an outpatient appointment may choose to seek emergency care. Therefore, we combined pediatric asthma outpatient and emergency data for analysis to ensure the overall trend of asthma visit volumes.
Data analysis was conducted from March 17, 2025, to May 23, 2025. The study was approved by the Ethics Committee of Children's Hospital of Chongqing Medical University (File No.338-2023).
Experiment design
This study aims to forecast daily pediatric asthma visits 1 day ahead. To investigate the impact of training set size on model performance, we designed controlled experiments with a fixed 1-year testing set and five training sets of varying time spans. The testing set was defined from July 1, 2018 to June 30, 2019, to provide a statistically robust sample size for performance assessment. For the training sets, we started with a 3-year dataset (July 1, 2015 to June 30, 2018) and then gradually shortened the time span to 2 years, 1 year, 6 months, and 1 month, respectively.
The details of experimental design for exploring optimal training set size are shown in Table 1, which outlines five groups of training data with varying time spans and the corresponding test sets.
Experimental design for different training data set size.
Two forecasting strategies were used: rolling forecasting and direct forecasting. In rolling forecasting, after each prediction, the predicted day's visit data was added to the training set while the earliest data was removed, keeping the training set size consistent. Direct forecasting involved one-time training on the training set followed by direct prediction for the testing data. Notably, 6-month and 1-month training sets were excluded from direct forecasting, as these scales would make the training data smaller than the testing set, insufficient to capture complete patterns.
Time-series models
Time-series forecasting has evolved into four major methodological categories. They are classical statistical models, engineering-friendly time-series tools, traditional machine learning models, and deep learning models. Specifically, classical statistical models are represented by autoregressive integrated moving average (ARIMA) and seasonal ARIMA (SARIMA), and SARIMA with exogenous variables.20–22 Engineering-friendly tools include Prophet and exponential smoothing. Traditional machine learning models cover extreme gradient boosting (XGBoost) and random forest. Deep learning models are exemplified by long short-term memory (LSTM), bidirectional LSTM (Bi-LSTM), and temporal convolutional networks (TCN). 23
This study selected four representative models for comparative analysis, namely ARIMA, Prophet, XGBoost, and Bi-LSTM. The selection was guided by methodological comprehensiveness and scenario applicability. These models span the mainstream time-series methodologies, enabling a systematic comparison between linear statistical and non-linear approaches. Each model addresses distinct practical challenges in real-world forecasting tasks. ARIMA, as a classic statistical benchmark, provides a baseline for evaluating advanced models. Prophet is designed to handle strong seasonality and missing data with minimal parameter tuning, making it ideal for operational scenarios. XGBoost excels at capturing complex non-linear feature interactions while maintaining high computational efficiency, overcoming the limitations of linear models in heterogeneous datasets. Bi-LSTM extends unidirectional sequence learning to bidirectional processing, allowing it to capture both past and future temporal dependencies.
All specifications of selected models are listed in Table 2.
The specifications of four models.
Based on the performance of the above four single models, an ensemble forecasting method was constructed by combining the two most accurate models. The predicted value of the ensemble method was calculated as the arithmetic mean of the predicted values of the two most accurate models. For example, if the two most accurate models are XGBoost and Bi-LSTM, the predicted value was calculated as
Data analysis
All data analyses were performed using Python 3.8 with libraries including NumPy, Pandas, Scikit-learn, and Matplotlib. The performance of time-series models was quantified using three categories of metrics, with specific calculation principles and clinical implications as follows:
(1) Goodness-of-fit metric
Coefficient of determination (R2), calculated as
(2) Prediction accuracy metrics
Mean Absolute Error (
(3) Clinical application metric
Days exceeding error thresholds, a clinical application-oriented evaluation indicator, is defined as the number of days where the absolute error in the predicted visit volumes exceeds the pre-specified fixed numerical thresholds. The proportion of days exceeding error thresholds was calculated as the ratio of the above-mentioned error days to the total number of days in the test set. A smaller number of error days and a lower proportion of such days indicate that the model's predicted values deviate less from the actual visit volumes, with fewer extreme prediction errors and thus higher practical reliability and clinical applicability for guiding healthcare resource allocation in real-world clinical settings.
Descriptive statistics (mean ± standard deviation) were used to characterize the daily pediatric asthma visit volume and summarize the model performance metrics across different training set sizes and forecasting strategies for comparative analysis. No inferential statistical tests were applied as the study focused on the comparative analysis of model performance rather than the hypothesis testing of statistical significance.
Ethical considerations
This study was conducted in accordance with the Declaration of Helsinki (World Medical Association, 2022 version), which is the fundamental ethical guideline for medical research involving human health-related data. We used fully anonymized and deidentified pediatric asthma visit data (i.e., all personal identifiers, including patient names, medical record numbers, and contact information were permanently removed) to protect patient privacy.
Though the research did not involve direct interaction with human subjects or access to identifiable personal health information, we proactively obtained formal ethics approval from the Ethics Committee of the Children's Hospital of Chongqing Medical University (File No.: 338-2023) to ensure full adherence to ethical standards for secondary use of medical data.
Results
The characteristics of daily pediatric asthma visits
The daily pediatric asthma visit volumes from July 1, 2015, to June 30, 2019 are illustrated in Figure 1.

Daily asthma clinic visits from July, 2015 to June, 2019.
In Figure 1, daily pediatric asthma visits over 1452 days averaged 120 [standard deviation (SD) = 58], consisting of 108 outpatient encounters (SD = 57) and 7 emergency visits (SD = 5). At this hospital, the mean daily total outpatient volume was 8012 visits (SD = 1538) and the mean total daily emergency volume was 1065 visits (SD = 331), confirming a large-scale, high-throughput clinical setting. Given that these asthma-related emergency visits accounted for <1% of all emergency visits and <6% of total asthma visits, we modeled daily asthma volume—defined as the sum of same-day outpatient and emergency encounters—without disaggregating by care setting.
Notably, distinct seasonal patterns in pediatric asthma visits were observed from July 2015 to January 2017. However, a marked shift in data variability occurred from January 2017 onward, with the time series exhibiting drastic irregular fluctuations lacking obvious periodicity. This change coincided with the launch of our institution's big data clinical research platform in 2017 (aimed at standardizing clinical data collection), though a definitive causal link cannot be established due to incomplete institutional records. These high-frequency, erratic variations increase the complexity of accurate forecasting for pediatric asthma visit volumes, reflecting the real-world challenges of clinical data dynamics in healthcare settings.
Crucially, our core objective focuses on evaluating model performance for pediatric asthma visit forecasting and identifying the optimal training set size. All models were uniformly trained and evaluated on the complete dataset encompassing both pre-2017 and post-2017 periods, ensuring that comparative performance between models remains valid and comparable despite the observed variability shift. This consistent evaluation framework eliminates biases arising from differential data exposure, preserving the internal validity of our model comparisons.
The performance of four models
We compared the performance of four models in rolling and direct forecast tasks, and analyzed the impact of varying training set size on model performance.
The performance of four models in rolling and direct forecast tasks
We select ARIMA, Prophet, XGBoost, and Bi-LSTM to conduct rolling forecasting and direct forecasting experiments across varying training set sizes (3 years, 2 years, 1 year, 6 months, 1 month) respectively. The R2, MAE, and RMSE results of each model under two strategies provide a basis for model selection and optimization. The experimental results are shown in Figure 2.

The comparison of four models in rolling and direct forecast task.
In Figure 2, R2 represents the model's goodness-of-fit to the data patterns, with a range of (-∞, 1]. Values closer to 1 indicate a better fit. MAE and RMSE reflect the magnitude of deviation between predicted and actual values, where smaller values signify higher prediction accuracy. In each subplot, the x-axis represents the size of the training data, and the y-axis corresponds to the respective metric value (R2, MAE, or RMSE). Different colors are used to distinguish the four models in each subplot.
Across both the rolling forecasting scenario (left subplots in Figure 2) and the direct forecasting scenario (right subplots in Figure 2), the Bi-LSTM and XGBoost models demonstrated significantly superior comprehensive predictive performance compared to Prophet and ARIMA. In terms of the coefficient of R2 [Figure 2(1) and Figure 2(2)], Bi-LSTM and XGBoost consistently ranked the highest in both scenarios across different training sets. Specifically, in rolling forecasting, ARIMA even yielded a negative R2 value when training set is only one month, indicating that its predictive accuracy was worse than the naive sample mean baseline. In direct forecasting, the R2 values of ARIMA and Prophet generally approached zero or became negative.
Consistent with the R2 findings, the results of MAE [Figure 2(3) and Figure 2(4)] and RMSE [Figure 2(5) and Figure 2(6)] further validated the advantages of Bi-LSTM and XGBoost. These two models achieved substantially lower error levels, and their error fluctuations remained minimal even when the training set was reduced from 3 years to 1 year. On the contrary, ARIMA and Prophet exhibited high forecasting errors and poor stability as the training set size decreased, resulting in unsatisfactory overall predictive performance.
Overall, Bi-LSTM and XGBoost outperform Prophet and ARIMA in fitting capacity, error control, and stability across both forecasting strategies. Notably, Prophet shows a unique trend in rolling experiments: its R2, MAE, and RMSE improve progressively as training set size shrinks below 1 year (e.g., 6 months and 1 month). However, even at this optimal window, Prophet's performance still remains inferior to Bi-LSTM and XGBoost trained on >1 year of data.
The core significance of Figure 2 can be summarized in three key aspects. First, it visually verifies that Bi-LSTM and XGBoost outperform ARIMA and Prophet under both forecasting strategies, as evidenced by higher R2 values and lower MAE and RMSE metrics. Second, Bi-LSTM and XGBoost exhibit strong robustness, as they maintain stable predictive performance even when the training set size is reduced. Third, it provides direct experimental evidence to substantiate the study's core conclusion that Bi-LSTM and XGBoost are reliable models for forecasting pediatric asthma visit.
The detailed comparison between the Bi-LSTM and XGBoost
In both rolling and direct forecasting, Bi-LSTM and XGBoost showed remarkably similar performance in pediatric asthma outpatient visit forecasting. To further compare two models, we analyzed their metrics across all training set size sequentially, with results in Figure 3.

The performance of XGBoost and Bi-LSTM in rolling and direct forecast tasks.
Figure 3 shows that rolling forecasting outperforms direct forecasting overall, evidenced by higher R2 [Figure 3(1) vs. Figure 3(2)] and lower MAE [Figure 3(3) vs. Figure 3(4)] and RMSE [Figure 3(5) vs. Figure 3(6)], demonstrating the positive impact of dynamic updates on prediction stability.
The results in Figure 3 also show that training data size can affect model performance. R2, MAE, and RMSE metrics consistently indicated that both XGBoost and Bi-LSTM achieved optimal performance with 2-year training across both forecasting strategies, as these metrics all peaked at the 2-year training size. For example, in Figure 3(1), XGBoost shows R2 values of 0.727 (3-year), 0.729 (2-year), and 0.683 (1-year), with the 2-year data yielding the highest. Bi-LSTM presents R2 values of 0.717 (3-year), 0.738 (2-year), and 0.721 (1-year), also peaking at 2-year training.
Comparison of the proportion of days exceeding error thresholds: XGBoost vs. Bi-LSTM
The ultimate goal of this paper is to accurately forecast daily asthma visits. Since we found that rolling forecasting outperformed direct forecasting, we compared XGBoost and Bi-LSTM in rolling forecasting by sequentially analyzing the proportion of days with forecast error exceeding the thresholds of 25, 50, and 1SD—defined as the ratio of days with errors exceeding threshold to the total length of the test set. This metric helps to evaluate model stability and guide the actual allocation of medical resources. Results are presented in Table 3. In Table 3, XGBoost and Bi-LSTM are used as standalone models to forecast daily pediatric asthma visits. The “Ensemble” column is the result of using the mean of above two models’ predictions, to demonstrate the predictive performance of their combined approach.
Proportion of days with forecast error exceeding a given threshold (unit:%).
As shown in Table 3, nearly all models performed optimally with 2-year training data. Taking the >1SD error threshold as an example, XGBoost, Bi-LSTM, and the ensemble method achieved error day proportions of 4.1%, 4.9%, and 3.8%, respectively—all lower than those under 1-year or 3-year training—and were consistent with the R2, MAE, and RMSE results in Figure 2.
In Table 3, XGBoost and Bi-LSTM showed no consistent performance differences, with alternating advantages across error thresholds and training data sizes. Specifically, XGBoost had a slight edge under the >50 threshold (3-year: 9.0%; 2-year: 9.3%; 1-year: 11.8%), while Bi-LSTM performed better under the >25 threshold (3-year: 30.9%; 1-year: 34.2%).
The ensemble method showed significant advantages across all scenarios, achieving the lowest proportion of error days for almost all error thresholds (>25, >50, >1SD) and training data sizes (1–3 years). For example, with 3-year training data and the >25 threshold, the ensemble method had an error day proportion of 28.7%—a 3.1% relative reduction compared to XGBoost (31.8%). A mean and variance analysis across the three training sizes revealed that the ensemble method minimized both the proportion of error days and forecasting variance, while improving accuracy and stability simultaneously.
Discussion
Rolling forecasting outperforms direct forecasting in pediatric asthma daily visit forecasting
This study confirmed that rolling forecasting is superior to direct forecasting in predicting daily pediatric asthma visits. As shown in Figure 3, XGBoost and Bi-LSTM achieve higher accuracy with rolling windows than direct training: as evidenced by 3-year data metric of XGBoost,: for R2, 0.715 (direct) vs. 0.727 (rolling), for MAE, 23.25 (direct) to 22.57 (rolling), and for RMSE, 34.82 (direct) to 34.07 (rolling). These improvements stem from the dynamic updates of training data in rolling forecasting, which can accurately capture periodic fluctuations in pediatric asthma visits.
This finding aligns with Zhang et al., 24 who showed that rolling forecasts using ARIMA and LSTM outperformed direct forecasts, with MAE/RMSE metrics reduced by 8.3%–44.01%. Engoren et al. 25 further validated this in COVID-19 critical care prediction: their rolling logistic regression model significantly improved area under the curve, precision, and recall metrics compared to static models.
XGBoost and Bi-LSTM models outperform ARIMA and Prophet in pediatric asthma daily visit forecasting
In Figure 3, metrics in both forecasting strategies consistently show XGBoost and Bi-LSTM outperforming ARIMA and Prophet. In rolling forecasting, XGBoost and Bi-LSTM maintain R2 values stably around 0.7–0.8. In contrast, ARIMA and Prophet exhibit much lower, more volatile R2—ARIMA even turns negative with 1-month data. This trend is further validated by MAE and RMSE. In direct forecasting, the superiority of XGBoost and Bi-LSTM across all metrics is reaffirmed.
This finding is supported by multiple studies. Lv et al. 26 compared ARIMA and XGBoost for hemorrhagic fever with renal syndrome prediction, showing that XGBoost significantly outperformed ARIMA in MAE, RMSE, and other metrics. Similarly, Fang et al. 27 found that XGBoost outperformed ARIMA in U.S. daily COVID-19 case prediction with lower MAE, RMSE, and XGBoost also incorporated external data (e.g., vaccination rates) to enhance performance. Wang et al. 28 compared Prophet, XGBoost, and LSTM for severe fever with thrombocytopenia syndrome forecasting, showing that XGBoost performed best, followed by LSTM, with Prophet lagging behind. Collectively, these studies highlight that XGBoost and LSTM excel in handling nonlinear time series with complex patterns.
The performance bottlenecks of ARIMA and Prophet stem from their heavy reliance on linear assumptions. ARIMA converts non-stationary sequences to stationary via differencing, leading to information loss during drastic data fluctuations.29,30 Though Prophet allows piecewise linear trends, its core additive seasonality assumption (e.g., fixed-period visit peaks) fails to capture nonlinear shifts from irregular events. In our study, pediatric asthma daily visits exhibited marked non-stationarity: a mean of 120 with a SD of 58, featuring regular periodic fluctuations pre-2017 but erratic oscillations afterward. Such characteristics demand strong nonlinear modeling capabilities from the model.
By contrast, XGBoost's gradient boosting framework captures high-order interactions (e.g., seasons, holidays) via iterative tree construction.31,32 Meanwhile, Bi-LSTM uses cell states to store long-term dependencies (e.g., annual seasonality) and gate mechanisms (forget, input, output) to regulate short-term information flow. 33 Their ability to model multi-scale temporal dependencies enables XGBoost and Bi-LSTM to outperform traditional models in both rolling and direct strategies, demonstrating strong applicability for pediatric asthma visit trend modeling.
In rolling forecasting, the Prophet model demonstrates optimal performance when the training data size is 1 month.
As shown in Figure 3, Prophet showed significant metric optimization (R2, MSE, RMSE) with a 1-month training set. This stems from its core assumptions—additive seasonality and piecewise linear trends—granting unique advantages in short training strategies. With 1-month data, seasonal fluctuations and linear trends may avoid complex couplings, allowing its piecewise linear fitting to efficiently capture local patterns.
This mechanism was validated by Hasan et al. 34 in municipal solid waste disposal rate prediction across four North American cities (72-sample training data), where Prophet performed optimally in most scenarios, particularly in R2 metric. Conversely, Wang et al. 28 found Prophet ranked third among four models in monthly severe fever with thrombocytopenia syndrome prediction (84-sample training data). Notably, another JAMA Network open study on hospital discharge forecasting revealed that when the training data spanned 1 to 5 years, the Prophet model achieved the highest prediction accuracy at two research centers. 3 Collectively, existing evidence indicates that training data size is not the core determinant of Prophet model performance. Instead, the seasonal cycle characteristics and variance level of the data itself may represent key influencing thresholds. The specific impact mechanisms warrant further exploration.
Optimal training data size for XGBoost and BiLSTM model in pediatric asthma visit prediction
As shown in Figure 3 and Table 3, XGBoost and BiLSTM achieved optimal performance with 2-year training data, evidenced by R2, MAE, RMSE metrics and the number of days with errors exceeding the threshold.
The reason may lie in the impact of training set size: A 1-year training set led to underfitting due to insufficient data, as models failed to capture annual periodic fluctuations and medium-long term trends of pediatric asthma visits. Meanwhile, 3-year training data risked overfitting by introducing redundant information (e.g., outdated patterns, sporadic noise). The 2-year training dataset balanced underfitting and overfitting, enabling better generalization on testing data.
This finding aligns with the classical time-series prediction principle: model generalization does not increase monotonically with data volume, but shows an inverted U-shaped relationship with model complexity. 35 Here, the optimal 2-year training data size exemplifies how balancing data volume and model complexity maximizes predictive performance.
Ensemble method outperform standalone model in pediatric asthma visit prediction
As shown in Table 3, across 1-, 2-, and 3-year training data size, the ensemble model had significantly fewer error days exceeding the threshold than standalone XGBoost and BiLSTM models. Additionally, across all training size, the ensemble model showed the lowest average error days and a smaller standard deviation than standalone models, fully validating its dual superiority in prediction accuracy and stability.
This phenomenon is underpinned by the core advantages and mechanisms of ensemble learning. Statistically, standalone models may suffer from high variance or bias due to limited data and model complexity. 36 By combining multiple base models, ensemble models effectively balance individual model variance/bias, enhancing overall prediction stability. 37 For instance, in the medical field, Lee et al. 38 built an ensemble model for non-cryptogenic ischemic stroke etiology prediction, achieving higher accuracy, stronger external validation generalization, and smaller performance decline than standalone models. Additionally, the ensemble model also exhibited significant advantages in financial, intelligent manufacturing, and other domains. 39 These cross-domain studies validate ensemble models’ effectiveness in reducing errors and enhancing generalization, consistent with their performance in our pediatric asthma visit prediction.
Threshold selection for error-exceeding threshold days analysis
The choice of thresholds for error-exceeding threshold days is critical to accurately measure the frequency and magnitude of deviations between predicted and actual values. An overly small threshold may result in all models exceeding it, while an overly large one may lead to no models exceeding it. Both scenarios fail to quantify model differences. In practice, each healthcare institution may select appropriate thresholds based on their own context to provide direct references for staffing allocation (e.g., Reference 3 adopts 10, 25, and 1SD as thresholds).
In this study, we initially considered 1SD (SD = 58) and 2SD thresholds but discarded 2SD due to minimal deviations beyond this cutoff, which made it impossible to quantify model differences. Through threshold testing, we found that 25 and 50 effectively distinguished predictive accuracy across models. Furthermore, we used absolute numerical values rather than proportion-based thresholds because numerical values directly quantify errors between predicted and actual values without additional conversion, making it more intuitive for clinical administrators to implement practical management.
Clinical applicability of evaluated models
Actually, the applicability of the evaluated time-series models and the optimized strategy is closely related to the daily visit volume scale and data characteristics of clinical institutions. The following is the analysis of their applicable and inapplicable scenarios.
The XGBoost and Bi-LSTM models, along with the optimized strategy (2-year training set + rolling forecasting + ensemble method), are applicable to large tertiary children's hospitals with moderate-to-high daily pediatric asthma visit volumes, like our hospital. As demonstrated in the study, our hospital with a mean daily visit volume of 120 (SD = 58) achieved stable performance with these models. Under the 2-year training set, XGBoost and Bi-LSTM maintained R2 values of 0.729 and 0.738 respectively under rolling forecasting, while their ensemble model reduced the proportion of days with forecast errors exceeding 1SD (58 visits) to 3.8%. These results confirm their suitability for institutions with similar visit scales and at least 2 years of continuous historical data.
The models are less applicable to small healthcare institutions with insufficient historical data (i.e., < 1 year). As demonstrated in this study, training sets shorter than 1 year lead to model underfitting. The ARIMA model even yielded a negative R2 when trained on merely 1-month data. Additionally, given that the present study was conducted at a large tertiary children's hospital, the performance of the evaluated models in low daily visit volume scenarios remains uninvestigated. Furthermore, the models’ effectiveness for institutions with highly stable pediatric asthma visit volumes has not been explored either. Collectively, if asthma visit data from small-scale hospitals can be obtained, the generalizability of the models could be systematically verified to address this research gap.
Limitations and further works
This study has two main limitations. First, as a single-center pre-pandemic study, the drastic change in pediatric asthma visit data in January may reduce the predictive value of the additional 1-year data in the 3-year training set, further limiting the generalization of the results. Another limitation relates to the prediction horizon: the model framework only validated for 1-day-ahead daily visit forecasting, while healthcare institutions could benefit from medium-to-long-term forecasts (e.g., weekly, monthly, or even annual visit volumes) for long-range resource planning. Notably, the proposed model inherently supports such extend forecasting tasks, but this potential was not explored here due to limited dataset size.
Further research is needed to validate the model in multi-center settings, systematically extend the prediction horizon to weekly, monthly, and annual scales, and explore diverse ensemble strategies.
Conclusion
This study systematically evaluated four time-series models for 1-day-ahead forecasting of daily pediatric asthma visit, addressing key gaps in existing research. The key findings include four insights: First, Bi-LSTM and XGBoost outperformed ARIMA and Prophet across all training set sizes, driven by their superior ability to capture nonlinear patterns and temporal dependencies. Second, rolling forecasting consistently improved accuracy by adapting to evolving trends, enhancing R2, and reducing MAE and RMSE compared to direct forecasting. Third, a 2-year training period in this study balanced data sufficiency and noise reduction, optimizing both statistical metrics (R2, MAE, RMSE) and application-oriented metrics (error-exceeding threshold days). Shorter periods (<1 year) risked underfitting, while 3-year may introduce redundant noise, increasing overfitting risks. Fourth, the introduction of days with forecast error exceeding threshold added clinical relevance, showing that ensemble method reduced invalid prediction compared to standalone models, directly supporting resource allocation like staffing adjustments during high-error periods.
This research provides a robust framework for hospitals to implement data-driven resource management, combining machine learning and deep learning models with adaptive training strategies to enhance pediatric asthma care efficiency. By identifying optimal models and training strategies, this study establishes a quantitative basis for clinical management. The model accurately predicts the daily visit fluctuation range for the next day. And its predicted values and error intervals can be converted into actionable staffing plans. This not only avoids labor redundancy or shortages caused by visit volume fluctuations in traditional fixed scheduling but also enables advance reservation of mobile resources via the “error-exceeding threshold days” indicator, reducing emergency dispatching frequency. Overall, this study builds a transformation bridge between “model prediction” and “clinical decision-making”, laying a foundation for future staffing allocation plans.
Footnotes
Abbreviations
Acknowledgments
We would like to acknowledge the Department of Respiratory Medicine and the Big Data Engineering Center of Children's Hospital of Chongqing Medical University for their valuable support and assistance in providing the pediatric asthma outpatient visit data used in this study, as well as Mr Qifan Wu for his support with the outpatient and emergency visit data.
Ethics approval
The study was conducted in accordance with the Declaration of Helsinki (World Medical Association, 2022 version). And ethics approval was obtained from the Ethics Committee of the Children's Hospital Affiliated to Chongqing Medical University (File No.338-2023).
Author contributions
Xin Zhang: study design, predictive model development, and drafting of the manuscript. Ximing Xu: assisted with study design and framework construction. Hongyao Leng: organized and analyzed the XGBoost and Bi-LSTM results. Qiao Shen: organized and analyzed the ARIMA and Prophet results. Yulin Liu: supported data collection. Zhanmei Zhang: generated and organized all figures and tables. Xianlan Zheng: coordinated research resources and finalized the manuscript for submission.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research is supported by the Natural Science Foundation of Chongqing Municipality (CSTB2023NSCQ-BHX0107), the China Postdoctoral Science Foundation (M2023M740440) and Chongqing Postdoctoral Science Foundation (2022CQBSHTB3065).
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability
The data that support the findings of this study are available in a deidentified form from Children's Hospital of Chongqing Medical University, but restrictions apply to the availability of these data, which were used under license for this study, and so are not publicly available. Data are, however, available from the authors upon reasonable request and with permission of Children's Hospital of Chongqing Medical University, by sending a request to the corresponding author.
Additional information
This manuscript has not been previously published and is not under consideration by any other journal. All authors have approved the submission of this manuscript.
