Abstract
Discrete-time hazard models are widely used when event times are measured in intervals or are not precisely observed. While these models can be estimated using standard generalized linear model techniques, they rely on extensive data augmentation, making estimation computationally demanding in large-scale, high-dimensional settings. In this article, we demonstrate how the recently proposed batchwise backfitting algorithm, a general framework for scalable estimation and variable selection in distributional regression, can be effectively extended to discrete hazard models. Using both simulated data and a large-scale application on infant mortality in sub-Saharan Africa, we show that the algorithm delivers accurate estimates, automatically selects relevant predictors and scales efficiently to large datasets. The findings underscore the algorithm’s practical utility for analyzing large-scale, complex survival data with high-dimensional covariates.
Keywords
Introduction
Modelling the time until an event occurs is a central task in many areas of applied statistics. This field, commonly referred to as time-to-event or survival analysis, has a wide range of applications, from modelling survival outcomes of children to predicting the duration of marriages or the failure time of mechanical components (e.g., Christodoulou, 2011; Musick and Michelmore, 2015; Burstein, R. et al., 2019). While much of the research in this area focuses on continuous-time models, discrete-time approaches are often more suitable when event times are measured in intervals (Tutz and Schmid, 2016). Foundational work by Allison (1982) established the basis for discrete-time survival models and subsequent developments have introduced extensions to enhance their flexibility (Fahrmeir and Wagenpfeil, 1996). Another related, but somewhat distinct strand of the literature is based on discretizing continuous duration time; see, for example, Efron (1988) as a standard reference and, more recently, Carollo et al. (2025), which adopts a modelling framework very similar to that of the present article.
In parallel, the development of generalized additive models (GAMs) has advanced statistical modelling by allowing for nonlinear, smooth effects of covariates (Hastie and Tibshirani, 1986). Building on this, innovations in smoothing techniques (Eilers and Marx, 1996), efficient estimation algorithms (Wood, 2003, 2004; Wood et al., 2017) and Bayesian extensions (Fahrmeir and Lang, 2001; Brezger and Lang, 2006) have made GAMs highly adaptable. These innovations have also influenced time-to-event analysis, leading to growing interest in flexible additive models for survival data (Tutz and Binder, 2004; Tutz and Schmid, 2016; Berger and Schmid, 2018). More recently, machine learning-inspired techniques based on tree methods have grown in popularity in this field (Schmid et al., 2016; Puth et al., 2020; Spuck et al., 2023).
Modern applications of time-to-event analysis increasingly involve large-scale, high-dimensional datasets, which present significant computational and statistical challenges. In such contexts, efficient estimation methods and automatic variable selection are critical. However, to the best of our knowledge, scalable and interpretable approaches for additive discrete time-to-event models remain underdeveloped and insufficiently tested. This article addresses this gap by extending the recently proposed batchwise backfitting algorithm by Umlauf et al. (2024) to additive discrete time-to-event models. Our contribution lies in introducing a scalable estimation strategy enabling simultaneous model fitting and variable selection for big datasets.
The performance of batchwise backfitting is evaluated through a comprehensive simulation study and application. The simulation study applies the method across a range of settings, including datasets with up to 10 million observations and numerous informative and uninformative covariates. The results demonstrate that the proposed approach achieves comparable or superior performance to established alternatives in terms of estimation accuracy, variable selection and estimation time. This suggests that the method is well-suited for large-scale, high-dimensional time-to-event data. To further demonstrate the practical usefulness of the method, we apply batchwise backfitting to model infant mortality in 10 sub-Saharan African countries with 351 705 children and 26 potential explanatory variables. The results highlight the algorithm’s ability to identify key determinants of child survival while maintaining scalability and interpretability.
The remainder of this article is organized as follows: Section 2 introduces the model and estimation framework. Section 3 presents the simulation study and discusses estimation results. Section 4 applies batchwise backfitting to model infant mortality. Section 5 concludes with a discussion of the results and future research directions.
Flexible discrete hazard models
Model specification
We consider a discrete time-to-event framework in which time is represented as a sequence of discrete periods and each observational unit or individual experiences at most one event. An event is defined as a one-time transition from one state to another (e.g., death, dropout or system failure). The model formulation and notation used in the following build on the frameworks of Tutz and Schmid (2016) and Berger and Schmid (2018).
Although time is generally conceptualized as continuous, in many practical applications it is recorded or analyzed in discrete intervals (e.g., months). Formally, let
Censoring is a common feature in time-to-event data, arising when the event time for an individual is not fully observed. The most frequent case is right censoring, which occurs when the start of observation is known, but the event has not occurred by the end of the observation period. For right-censored data, the observed time is defined as ti := min(Ti, Ci), where Ti is the (possibly unobserved) event time and
The hazard function is the central quantity in modelling time-to-event data. In a discrete-time setting, it is defined as the conditional probability that an event occurs at time t, given that the individual has ‘survived’ up to that time. For a given set of explanatory variables
which represents the probability of transitioning from the initial to the terminal state during interval t, conditional on survival up to t and covariates
where h(·) is a strictly increasing response function, such as the logit function
Each function
The primary focus is on estimating the model specified in Equation (2.2). If uninformative censoring is assumed, the likelihood function simplifies considerably, as censoring does not affect the distribution of the event time. The individual contribution to the likelihood is given by
The log-likelihood can be expressed as a sum of individual contributions, where each contribution corresponds to the log-likelihood of a Bernoulli-distributed random variable. Specifically, for each individual i, we define a binary indicator yir for each discrete time interval
where
To prevent the model terms fj(·) from overfitting, regularization is introduced through a penalized log-likelihood of the form
Estimation of discrete hazard models reduces to the estimation of a binary regression model but requires prior data augmentation. This involves expanding the original dataset by adding pseudo-observations for each individual i and each time point t = 1, …, ti during which the individual is at risk.
The data augmentation step is best illustrated with an example. Imagine the original dataset in Table 1 on the left. The individual identification variable is labelled as id, t is the time variable, y is the event indicator, where one indicates an event and zero otherwise and x1, x2 and x3 are time-invariant explanatory variables. Note that the individuals 1 and 4 are right-censored (y = 0).
Following augmentation, the dataset takes the form shown in Table 1 on the right. The time variable t needs to cover the individual risk period, the event indicator y is zero for the added rows, and the individual explanatory variables x1, x2 and x3 are duplicated (time-invariant covariates).
In many applications, this augmentation step leads to a substantial expansion of the data, from n individuals to
Example for the data augmentation step: Dataset before augmentation on the left and dataset after augmentation on the right.
Example for the data augmentation step: Dataset before augmentation on the left and dataset after augmentation on the right.
Moreover, existing large-sample implementations typically focus on estimation for a fixed model and offer limited support for automatic variable and smoothing parameter selection in high-dimensional settings. Recently, Umlauf et al. (2024) proposed the batchwise backfitting algorithm, an efficient approach for scalable estimation of distributional regression models that, at the same time, performs automatic selection of model terms and smoothing parameters.
The discrete hazard model in Equation (2.2) specifies a Bernoulli likelihood with the conditional event probability as its single distributional parameter. This fits naturally into the distributional regression framework, in which one or more distributional parameters are linked to predictors. Consequently, the Newton–Raphson-type updates used for distributional regression models directly apply here in iterative form. These equations are used to maximize the penalized log-likelihood, see Section 2.2 and estimate the coefficient vectors β
where
Scalable estimation is achieved by using only a randomly selected batch of the data. In discrete hazard models, where each individual contributes ti observations to the augmented dataset, the natural sampling unit is the individual. If an individual is selected for a batch, then all the corresponding observations are included. Formally,
The corresponding response vector
where ν is the step length control parameter specifying the amount of which
The updating function defined in Equation (2.5) can be applied in several ways. For instance, if the step length control parameter is set to ν = 0.1 and only the best-fitting model term fj(·) is updated in each iteration, the algorithm mimics a boosting-type approach. Specifically, the decision to update a model term is based on its log-likelihood contribution, evaluated on another random batch
The algorithm is considered to have converged when the ‘out-of-sample’ log-likelihood evaluated on other batches no longer improves. To better account for uncertainty, a common strategy is to refit the selected model with ν = 1, mimicking a resampling step. Estimates are then based on the last iterations, such as using Markov chain Monte Carlo simulation. This algorithm has proven to have excellent model term selection performance and can be applied to very large datasets, as is often the case in discrete hazard models. For a description of the algorithm in full detail, please refer to Umlauf et al. (2024).
The number of batches and thus the number of iterations L should be chosen sufficiently large such that no (major) further improvements in the out-of-sample log likelihood can be observed. Furthermore, a reasonable batch size M must be defined, which in previous studies was chosen between 10 000 and 50 000 (see, e.g., Umlauf et al., 2024 or Seiler et al., 2025). Depending on the complexity and type of data, the batches can be smaller or must be larger to ensure sufficient coverage of the information in the data.
A comprehensive simulation study is conducted to investigate the performance of batchwise backfitting and suitable benchmark methods in different settings and evaluate estimation accuracy, variable selection performance and estimation time. Each setting is replicated 250 times for each method under study.
(A) Specifications of the baseline hazard function f0(t). a = −3 is used in the basic setting. a = −2 and a = −4 are variations to investigate the influence of different event frequencies. (B–E) Univariate informative effects on ηit. (F) Bivariate spatial effect.
(A) Specifications of the baseline hazard function f0(t). a = −3 is used in the basic setting. a = −2 and a = −4 are variations to investigate the influence of different event frequencies. (B–E) Univariate informative effects on ηit. (F) Bivariate spatial effect.
In our basic setting, data is simulated for n = 5 000 individuals, k = 20 time points and the predictor is defined as
Equidistant design points are generated from the interval [0, 1] for the variable xi4 and from the interval [−3, 3] for the remaining variables. The effects of the informative variables are defined as f1(x1) = 0.5 x1 (referred to as linear), f2(x2) = 1.5 sin(x2) (sinus),
The event indicator yit is derived by comparing
Further settings
The further settings are variations of the basic setting to evaluate the performance of the estimation methods under different circumstances. In each setting, one specific feature is varied, while all other components remain as in the basic setting: (i) Number of individuals n: To investigate the performance for both small and large datasets, simulations are performed with number of individuals n = 1 000, 10 000, 50 000, 100 000, 500 000 and 1 000 000. This results in augmented datasets that range from approximately 10 000 to 10 000 000 rows. (ii) Baseline hazard specification f0(t): Alternative baseline hazard specifications are considered to analyze the influence of different event frequencies. Simulations are performed with a = −2 and a = −4, which corresponds to an event frequency of approximately 20% and 5%, respectively. These two variations are also visualized in Figure 1 in panel (A). Furthermore, we considered two additional functional forms for the baseline hazard: One that increases monotonically and another that first increases and then decreases over time, specified as f0(t) = −5 + 0.7 · log(t) and f0(t) = −5 + 2 · sin((t − 1)/7), respectively. Both specifications were scaled to yield event frequencies comparable to the basic setting with a = −3. The corresponding hazard shapes are shown in the Supplementary Materials A.2. (iii) Effect scaling SD: The scaling of the unique values of the effects is varied to SD = 0.5 and SD = 2 to investigate the influence of different effect sizes. (iv) Spatial effect fspa(lon, lat): In addition to the univariate effects, a bivariate spatial effect fspa(lon, lat) = 2.5 · sin(lon) · sin (0.5 · lat) − 0.3 is added to the predictor; see panel (F) of Figure 1. For lon and lat equidistant design points are sampled from the interval [−3, 3].
Methods and implementation
In the following, we refer to batchwise backfitting as BBFIT. The method is implemented in the R package
A maximum batch size of M = 20 000 is used in all settings. This implies that for the setting with 1 000 individuals, corresponding to an augmented dataset with around 10 000 rows, each batch consists of the entire model dataset including all individuals. For settings with more individuals, only a portion of the dataset is included in each batch until M is reached. More general guidance on choosing an appropriate batch size is provided in Umlauf et al. (2024).
The model specification builds on Wood (2003), using thin-plate regression splines (s()) with the default basis dimension for the baseline hazard and all univariate covariate effects. The bivariate spatial effect is modelled with a full tensor product smooth (te()) in its default form, following Wood (2006).
The performance of BBFIT is compared with two benchmark methods: (i) As a basic benchmark, we use an estimation method for generalized linear models with stepwise model selection based on the AIC. Specifically, the function
Performance measures
The following measures and tools are used: (i) Estimated effects are plotted against true effects, with means and quantiles used to detect systematic biases. The mean squared error (MSE) is also calculated as
Results
Basic setting Figure 2 shows the estimated effects of all informative variables and the first noise variable, while Figure 3 shows the corresponding distribution of the MSEs. Since the other noise variables show very similar patterns, their results are omitted here but can be found in Supplementary Materials A.1. To analyze the estimated effects in more detail, we created a web app that allows replications and effects to be visualized separately or as an overlay in a dynamic plot. The app also offers the possibility to explore the further settings described below. It can be accessed via bmueller5000.github.io/bb4sa-shinylive/ and the corresponding data can be found in doi.org/10.5281/zenodo.19330803.
Estimated effects of all informative variables and the first noise variable in the basic setting (see Section 3.1). Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.
Estimated effects of all informative variables and the first noise variable in the basic setting (see Section 3.1). Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.
MSEs of all informative variables and the first noise variable in the basic setting (see Section 3.1). Note: MSEs of BBFIT (see e.g., Noise 1 bottom right) are always zero as the noise variables are never selected.
Further results regarding the selection frequencies are discussed in Figure 4. Since the informative variables are always selected, we restrict the presentation to the noise variables. As mentioned above, the selection approach of BAM does not provide a clear selection rule. Therefore, different thresholds for the effective degrees of freedom (EDF) and p-values are tested as selection criteria and visualized in Figure 4 in the left panel.
Selection frequencies of the noise variables for different selection criteria of BAM on the left and selection frequencies of the noise variables (with p-value < 0.01 for BAM) in the basic setting (see Section 3.1) on the right.
From Figures 2–4 and the web app, we draw the following conclusions: (i) Bias: All methods provide largely unbiased estimates for the linear, sinus and squared effects f1, f2, f3 and substantial bias for the complex effect f4. This bias is, as expected, more pronounced for the GLM method, since cubic polynomials are not flexible enough to capture such complex effects. The GLM method also shows a clear bias for the baseline hazard f0. BAM shows a small bias in the baseline hazard f0 at higher t values, which is almost invisible in Figure 2. Surprisingly, this bias increases with the number of individuals; see the more detailed discussion below. (ii) MSE: The main findings for the MSEs are similar to those for the bias. For the linear, sinus and squared effects f1, f2, f3 the MSEs are almost zero for all methods. For the baseline hazard f0, BAM and BBFIT yield lower MSEs than GLM and for the complex effect f4, GLM is clearly outperformed. Regarding the noise variables, BBFIT achieves MSEs of exactly zero in all replications. BAM shows low but non-zero MSE values due to occasional false positive selections, while GLM produces considerably higher MSEs. (iii) Selection frequencies: Compared to the p-values, selection via the EDF consistently leads to higher selection frequencies for the noise variables in BAM (see Figure 4 left panel). The best results are obtained with the rule p-value < 0.01, which is therefore used for BAM in the remainder. The right panel of Figure 4 shows exceptional selection performance of BBFIT as the noise variables are never selected. The selection frequency of BAM (based on the rule p-value < 0.01) varies between 0.8% and 3.2%, and GLM is not competitive with selection frequencies between 9.6% and 14.8%.
Figure 5 displays the average MSEs across all 250 replications for varying numbers of individuals, highlighting the baseline hazard f0, the complex effect f4 and the first noise effect f5, which exhibit the most notable patterns. A plot, including all effects, is provided in Supplementary Materials A.4. Estimation using the GLM method was not feasible for datasets with more than 100 000 individuals and is therefore excluded from the analysis of large datasets. As the number of individuals increases, the estimation accuracy of all methods improves, as indicated by a decreasing average MSE. As discussed in the basic setting, GLM performs worst in estimating the complex effect f4 due to model specification limitations. BAM and BBFIT perform similarly for 50 000 to 1 000 000 individuals, while for smaller sample sizes no method clearly outperforms the others. Especially for n = 1 000, differences may, among other factors, be driven by smoothing parameter selection in low-information regions, rather than reflecting systematic methodological advantages.
Mean MSEs for different numbers of individuals of the baseline hazard f0, the complex effect f4 and the first noise effect f5.
Figure 6 displays the mean estimated baseline hazard together with quantiles based on 500 000 individuals, comparing BAM (with
Mean estimated effects together with the 2.5% and 97.5% quantiles of the baseline hazard f0 with 500 000 individuals for BAM. Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.
Regarding variable selection, the left panel of Figure 7 shows the mean selection frequency of the noise variables for different numbers of individuals. Surprisingly, the performance of all methods is relatively unaffected by the number of individuals. The GLM method performs worst with a mean selection frequency of 9.0%–13.1%. BAM performs considerably better with selection rates between 1.0% and 2.1%. The best selection performance is achieved using BBFIT with perfect selection (0%) for 5 000 or more individuals.
Mean selection frequency of the six noise variables for different numbers of individuals on the left. Median estimation time in minutes for different numbers of individuals based on 25 replications run on a ‘standard’ PC (see Supplementary Materials C) on the right.
In terms of computation time, the right panel of Figure 7 shows the median estimation time for different numbers of individuals based on 25 replications on a ‘standard’ PC (see Supplementary Materials C). The median estimation time for individuals up to 100 000 ranges from a few seconds to around 4.5 minutes and is comparable for the three methods. For very large sample sizes of 500 000 or even 1 000 000 individuals, BBFIT clearly outperforms BAM with about half the median estimation time. While the estimation times for GLM and BBFIT are relatively stable, BAM shows some large outliers. For 500 000 individuals, for example, four of 25 replications have estimation times of about 250 minutes, while the remaining replications take about 13 minutes. For these exceptionally long estimation times, a warning is issued that the algorithm did not converge. However, the estimated effects for these replications show no qualitative differences compared to those with normal estimation times and no warnings.
Infant mortality rates, that is, the probability of a newborn child dying before reaching the age of one, remain high in sub-Saharan Africa. In 2023, the rate was 4.4% in this region, corresponding to approximately 1.8 million infant deaths—about 51% of the global total (UNIGME, 2025). Infant mortality is a complex issue influenced by various factors such as limited healthcare access, poor nutrition and environmental conditions. As a result, infant mortality rates are key indicators of broader human development, a priority highlighted in Sustainable Development Goal 3. We model infant mortality in 10 eastern sub-Saharan African countries using a time-to-event model with a structured additive predictor.
Data
The Demographic and Health Surveys (DHS, ICF, 2004-2017) serve as the primary data source and are merged with remotely sensed data on climate, demography and environmental factors. Seiler et al. (2025) provide a detailed overview of the data and details on the preprocessing steps (see Section Data and Supplementary Information). We use a subset of the data on individual children from Burundi, Ethiopia, Kenya, Malawi, Mozambique, Rwanda, the United Republic of Tanzania, Uganda, Zambia and Zimbabwe. Further data cleaning excluded surveys before 2 000 due to limited geo-referenced data availability, missing covariates and children born over five years before the survey to ensure accurate mortality measurement. The resulting dataset contains 351 705 individual children, of which about 4.5% died in the first year of life. Data augmentation substantially increases the dataset up to around 3.7 million rows, which are ultimately used for estimation. All basic information for modelling mortality is recorded for these children, which includes an indicator of whether the child is still alive at the time of observation (
Response and covariates included in the full model of the selection step.
Response and covariates included in the full model of the selection step.
The probability of death in month t = 0, … 11 is modelled with a discrete time-to-event model and the additive predictor defined as
Results
Figure 8 displays the relative updating frequencies of the boosting step in the left panel and the log-likelihood contribution in the right panel. It is the frequency with which a model term yields the best improvement of the out-of-sample log-likelihood by the total number of iterations. We find that the age of the child (
We illustrate our approach for presenting and interpreting results in a survival context with the covariate
Updating frequencies based on the boosting step on the left and log-likelihood contribution plot on the right. Note that variables that have never been updated are omitted. See Table 2 for variable descriptions.
Updating frequencies based on the boosting step on the left and log-likelihood contribution plot on the right. Note that variables that have never been updated are omitted. See Table 2 for variable descriptions.
Figure 9 provides the following interpretations: (i) Mothers age at birth varies between a minimum of 14 and a maximum of 48 years, with a median age of 25. The frequency of observations declines with increasing age, with only 356 observations in the age group 46–48. As already mentioned, this age group is highlighted in red in the plots to prevent overinterpretation of results where few individuals are observed. (ii) The top right panel already suggests that children of older mothers are more likely to die than children of younger mothers. This is confirmed with the estimated effect f8(
Detailed analyses similar to those for
Summary statistics and the distribution of magebirth in the top left panel and relative frequency of events (deaths) in the augmented data in the top right panel. Estimated effect of this covariate centred at zero in the middle panel. Estimated marginal survival probabilities for different child ages across the age of the mother at birth are in the bottom left, and estimated marginal survival probabilities over child age for different maternal ages at birth are in the bottom right panel.
The estimated effect of
Estimated effects of the covariates age (baseline hazard), hhs , and iyear centred at zero on the left. Estimated marginal survival probabilities in dependence of the child age for hhs and iyear on the right.
The estimated survival probabilities in the right panels of Figure 10 offer a clearer understanding of the true size of the effects. For example, children in small households (fewer than 5 members) have a substantially lower first-year survival probability—about 90%–95%–compared with children in larger households (8–10 members), where the probability is around 97%. For households with more than 15 members, the marginal probability of surviving the first year decreases again to around 95%. Similar interpretations are possible for the other explanatory variables.
This article introduces an efficient estimation framework for discrete time-to-event models with additive predictors by extending the recently proposed batchwise backfitting algorithm. Our contribution lies in combining scalable estimation with simultaneous automated variable selection in large-scale, high-dimensional survival settings—addressing key limitations of existing approaches.
Through a comprehensive simulation study, we demonstrate that the proposed method achieves high estimation accuracy, excellent variable selection performance and significantly reduced computation times, even with datasets comprising up to 10 million rows. Compared to benchmark methods such as GLM and BAM, BBFIT shows superior robustness and efficiency, especially in large-scale contexts. In particular, BBFIT consistently avoids the selection of uninformative covariates, while BAM and GLM have a non-negligible false selection rate. The application to modelling infant mortality further illustrates the practical usefulness of our approach. Despite the complexity of the data—both in terms of sample size and number of covariates—BBFIT identifies a small subset of influential factors and provides interpretable smooth effect estimates.
While the proposed framework is a powerful tool for large-scale, high-dimensional discrete time-to-event modelling, several directions for future work remain: For instance, exploring alternative model structures—such as piecewise exponential models or Box–Cox-transformed hazard models—could broaden its applicability beyond the current link and distributional assumptions. From a computational point of view, future work may focus on extending the algorithm towards a fully Bayesian framework to better account for model uncertainty. Final exploiting the specific structure of the likelihood—where all entries are zero except possibly the last—could lead to additional algorithmic efficiencies.
Footnotes
Acknowledgments
The computational results presented have been achieved (in part) using the HPC infrastructure LEO of the University of Innsbruck. We thank ICF International, Inc. and USAID for providing public access to the DHS data (ICF, 2004-2017).
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research was funded in whole or in part by the Austrian Science Fund (FWF) (doi:10.55776/P33941). For open access purposes, the author has applied a CC BY public copyright licence to any author-accepted manuscript version arising from this submission.
Supplementary materials
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
