Simultaneous estimation and model choice for big discrete time-to-event data with additive predictors

Abstract

Discrete-time hazard models are widely used when event times are measured in intervals or are not precisely observed. While these models can be estimated using standard generalized linear model techniques, they rely on extensive data augmentation, making estimation computationally demanding in large-scale, high-dimensional settings. In this article, we demonstrate how the recently proposed batchwise backfitting algorithm, a general framework for scalable estimation and variable selection in distributional regression, can be effectively extended to discrete hazard models. Using both simulated data and a large-scale application on infant mortality in sub-Saharan Africa, we show that the algorithm delivers accurate estimates, automatically selects relevant predictors and scales efficiently to large datasets. The findings underscore the algorithm’s practical utility for analyzing large-scale, complex survival data with high-dimensional covariates.

Keywords

Additive predictors automated variable selection batchwise backfitting big data discrete time-to-event analysis

1 Introduction

Modelling the time until an event occurs is a central task in many areas of applied statistics. This field, commonly referred to as time-to-event or survival analysis, has a wide range of applications, from modelling survival outcomes of children to predicting the duration of marriages or the failure time of mechanical components (e.g., Christodoulou, 2011; Musick and Michelmore, 2015; Burstein, R. et al., 2019). While much of the research in this area focuses on continuous-time models, discrete-time approaches are often more suitable when event times are measured in intervals (Tutz and Schmid, 2016). Foundational work by Allison (1982) established the basis for discrete-time survival models and subsequent developments have introduced extensions to enhance their flexibility (Fahrmeir and Wagenpfeil, 1996). Another related, but somewhat distinct strand of the literature is based on discretizing continuous duration time; see, for example, Efron (1988) as a standard reference and, more recently, Carollo et al. (2025), which adopts a modelling framework very similar to that of the present article.

In parallel, the development of generalized additive models (GAMs) has advanced statistical modelling by allowing for nonlinear, smooth effects of covariates (Hastie and Tibshirani, 1986). Building on this, innovations in smoothing techniques (Eilers and Marx, 1996), efficient estimation algorithms (Wood, 2003, 2004; Wood et al., 2017) and Bayesian extensions (Fahrmeir and Lang, 2001; Brezger and Lang, 2006) have made GAMs highly adaptable. These innovations have also influenced time-to-event analysis, leading to growing interest in flexible additive models for survival data (Tutz and Binder, 2004; Tutz and Schmid, 2016; Berger and Schmid, 2018). More recently, machine learning-inspired techniques based on tree methods have grown in popularity in this field (Schmid et al., 2016; Puth et al., 2020; Spuck et al., 2023).

Modern applications of time-to-event analysis increasingly involve large-scale, high-dimensional datasets, which present significant computational and statistical challenges. In such contexts, efficient estimation methods and automatic variable selection are critical. However, to the best of our knowledge, scalable and interpretable approaches for additive discrete time-to-event models remain underdeveloped and insufficiently tested. This article addresses this gap by extending the recently proposed batchwise backfitting algorithm by Umlauf et al. (2024) to additive discrete time-to-event models. Our contribution lies in introducing a scalable estimation strategy enabling simultaneous model fitting and variable selection for big datasets.

The performance of batchwise backfitting is evaluated through a comprehensive simulation study and application. The simulation study applies the method across a range of settings, including datasets with up to 10 million observations and numerous informative and uninformative covariates. The results demonstrate that the proposed approach achieves comparable or superior performance to established alternatives in terms of estimation accuracy, variable selection and estimation time. This suggests that the method is well-suited for large-scale, high-dimensional time-to-event data. To further demonstrate the practical usefulness of the method, we apply batchwise backfitting to model infant mortality in 10 sub-Saharan African countries with 351 705 children and 26 potential explanatory variables. The results highlight the algorithm’s ability to identify key determinants of child survival while maintaining scalability and interpretability.

The remainder of this article is organized as follows: Section 2 introduces the model and estimation framework. Section 3 presents the simulation study and discusses estimation results. Section 4 applies batchwise backfitting to model infant mortality. Section 5 concludes with a discussion of the results and future research directions.

2 Flexible discrete hazard models

2.1 Model specification

We consider a discrete time-to-event framework in which time is represented as a sequence of discrete periods and each observational unit or individual experiences at most one event. An event is defined as a one-time transition from one state to another (e.g., death, dropout or system failure). The model formulation and notation used in the following build on the frameworks of Tutz and Schmid (2016) and Berger and Schmid (2018).

Although time is generally conceptualized as continuous, in many practical applications it is recorded or analyzed in discrete intervals (e.g., months). Formally, let $t \in {1, 2, \dots, k}$ index the discrete time intervals, which may either reflect naturally discrete time or grouped continuous time, with boundaries $[0, a_{1}), [a_{1}, a_{2}), \dots, [a_{k - 1}, \infty)$ . For individuals i = 1, …, n, let $T_{i} \in {1, \dots, k}$ denote the event time, where $T_{i} = t$ means that the event occurred in the interval $[a_{t - 1}, a_{t})$ . The timing of the event is modelled in relation to p explanatory variables $x_{i} = {(x_{i 1}, \dots, x_{i p})}^{⊤}$ .

Censoring is a common feature in time-to-event data, arising when the event time for an individual is not fully observed. The most frequent case is right censoring, which occurs when the start of observation is known, but the event has not occurred by the end of the observation period. For right-censored data, the observed time is defined as t_i := min(T_i, C_i), where T_i is the (possibly unobserved) event time and $C_{i} \in {1, \dots, k}$ denotes the censoring time for individual i. Other forms of censoring, such as left or interval censoring, are not considered in this work. Left truncation (delayed entry) can be accommodated naturally within the proposed framework by including only risk intervals from the individual-specific entry time onward.

The hazard function is the central quantity in modelling time-to-event data. In a discrete-time setting, it is defined as the conditional probability that an event occurs at time t, given that the individual has ‘survived’ up to that time. For a given set of explanatory variables x_i, i = 1, …, n, the discrete-time hazard function is defined as

λ (t | x_{i}) = P (T_{i} = t | T_{i} \geq t, x_{i}), t = 1, \dots, k,

(2.1)

which represents the probability of transitioning from the initial to the terminal state during interval t, conditional on survival up to t and covariates x _i . The corresponding survival function is given by $S (t | x_{i}) = P (T_{i} > t | x_{i}) = \prod_{r = 1}^{t} (1 - λ (r | x_{i}))$ and denotes the probability that the event occurs after time t or equivalently, that the individual survives the interval $[a_{t - 1}, a_{t})$ . A general model for the discrete hazard function in Equation (2.1) can be defined as

λ (t | x_{i}) = h (η_{i t}), η_{i t} = f_{0} (t; β_{0}) + f_{1} (x_{i}; β_{1}) + \dots + f_{p} (x_{i}; β_{p}),

(2.2)

where h(·) is a strictly increasing response function, such as the logit function $h (x) = \exp (x) / (1 + \exp (x)), η_{i t}$ is a structured additive predictor (Fahrmeir et al., 2004, 2022) for individual i at time t, where $f_{0} (\cdot)$ models the baseline hazard over time and $f_{1} (\cdot), \dots, f_{p} (\cdot)$ represent (possibly nonlinear) covariate effects. For notational simplicity, we assume the same number of covariates and nonlinear terms p here.

Each function $f_{j} (\cdot)$ is approximated using basis functions, such as B-splines with an additional smoothness penalty as in penalized splines (e.g., Eilers and Marx, 1996; Wood, 2003). The (potentially nonlinear) effects are expressed as $f_{j} (X_{j}; β_{j}) = X_{j} β_{j}, j = 0, \dots, p$ , where X _j denotes the design matrix and βj the associated coefficients. Stacking all components, the predictor of the additive discrete hazard model can be expressed in matrix form as $η = \sum_{J = 0}^{P} X_{j} β_{j}$ , where $η$ is the vector over all individuals and risk intervals.

2.2 Penalized likelihood

The primary focus is on estimating the model specified in Equation (2.2). If uninformative censoring is assumed, the likelihood function simplifies considerably, as censoring does not affect the distribution of the event time. The individual contribution to the likelihood is given by $L_{i} = c_{i} P {(T_{i} = t_{i})}^{δ_{i}} P {(T_{i} > t_{i})}^{1 - δ_{i}}$ , where $δ_{i} = I (T_{i} \leq C_{i})$ indicates whether the event is observed $(δ_{i} = 1)$ or the observation is right-censored $(δ_{i} = 0)$ . The constant $c_{i} : = P {(C_{i} \geq t_{i})}^{δ_{i}} P {(C_{i} = t_{i})}^{1 - δ_{i}}$ accounts for the uninformative censoring mechanism, which can be omitted from the likelihood.

The log-likelihood can be expressed as a sum of individual contributions, where each contribution corresponds to the log-likelihood of a Bernoulli-distributed random variable. Specifically, for each individual i, we define a binary indicator y_ir for each discrete time interval $[a_{r - 1}, a_{r})$ up to the observed or censored time t_i, with y_ir = 1 if the event occurs in interval r and y_ir = 0 otherwise. Since each individual can experience the event at most once, the data can be represented as a sequence of independent Bernoulli trials, each with success probability equal to the discrete hazard $λ (t | x_{i})$ . The total log-likelihood of the discrete hazard model then becomes

l (β; y, X) \propto \sum_{i = 1}^{n} \sum_{r = 1}^{t_{i}} [y_{i r} \log λ (r | x_{i}) + (1 - y_{i r}) \log (1 - λ (r | x_{i}))],

(2.3)

where $X = (X_{0}, \dots, X_{p})$ is the full design matrix and $β = (β_{0}, \dots, β_{p})$ the full vector of regression coefficients.

To prevent the model terms f_j(·) from overfitting, regularization is introduced through a penalized log-likelihood of the form $l_{pen} (β, τ; y, X) \propto l (β; y, X) - \sum_{j = 0}^{p} β_{j}^{⊤} P_{j} (τ_{j}) β_{j}$ , where $τ_{j}$ is a (possibly vector-valued) smoothing parameter that controls the degree of regularization applied to each model component f_j(·) and P _j (τ_j) is a corresponding quadratic penalty matrix determined by the structural assumptions on f_j(·). For example, in the case of univariate smooth terms modelled via P-splines, τ_j reduces to a scalar τ_j and the penalty matrix typically takes the form $P_{j} (τ_{j}) = τ_{j} K_{j}$ , where K _j is based on second-order difference penalties on βj (Fahrmeir et al., 2022, Section 8.1.2, pp. 449–464). The smoothing parameters τ_j thus govern the overall functional form of the estimated effects.

2.3 Estimation

Estimation of discrete hazard models reduces to the estimation of a binary regression model but requires prior data augmentation. This involves expanding the original dataset by adding pseudo-observations for each individual i and each time point t = 1, …, t_i during which the individual is at risk.

The data augmentation step is best illustrated with an example. Imagine the original dataset in Table 1 on the left. The individual identification variable is labelled as id, t is the time variable, y is the event indicator, where one indicates an event and zero otherwise and x₁, x₂ and x₃ are time-invariant explanatory variables. Note that the individuals 1 and 4 are right-censored (y = 0).

Following augmentation, the dataset takes the form shown in Table 1 on the right. The time variable t needs to cover the individual risk period, the event indicator y is zero for the added rows, and the individual explanatory variables x₁, x₂ and x₃ are duplicated (time-invariant covariates).

In many applications, this augmentation step leads to a substantial expansion of the data, from n individuals to $N = \sum_{i = 1}^{n} t_{i}$ Bernoulli observations. Even for moderate n, N can easily reach several millions once all risk intervals are represented. Combined with multiple smooth covariate effects and possible interactions, classical iteratively weighted least squares (IWLS)/backfitting procedures must repeatedly process the full augmented data in each iteration and for each smoothing parameter update, which quickly becomes computationally demanding and memory-intensive. Efficient estimation methods for GAMs are therefore essential (see, e.g., Lang et al., 2014; Wood et al., 2017).

Table 1

Example for the data augmentation step: Dataset before augmentation on the left and dataset after augmentation on the right.

id	t	y	x ₁	x ₂	x ₃	id	t	y	x ₁	x ₂	x ₃
1	1	0	x₁₁	x₁₂	x₁₃	1	1	0	x₁₁	x₁₂	x₁₃
2	3	1	x₂₁	x₂₂	x₂₃	2	1	0	x₂₁	x₂₂	x₂₃
3	2	1	x₃₁	x₃₂	x₃₃	2	2	0	x₂₁	x₂₂	x₂₃
4	3	0	x₄₁	x₄₂	x₄₃	2	3	1	x₂₁	x₂₂	x₂₃
						3	1	0	x₃₁	x₃₂	x₃₃
						3	2	1	x₃₁	x₃₂	x₃₃
						4	1	0	x₄₁	x₄₂	x₄₃
						4	2	0	x₄₁	x₄₂	x₄₃
						4	3	0	x₄₁	x₄₂	x₄₃

Moreover, existing large-sample implementations typically focus on estimation for a fixed model and offer limited support for automatic variable and smoothing parameter selection in high-dimensional settings. Recently, Umlauf et al. (2024) proposed the batchwise backfitting algorithm, an efficient approach for scalable estimation of distributional regression models that, at the same time, performs automatic selection of model terms and smoothing parameters.

The discrete hazard model in Equation (2.2) specifies a Bernoulli likelihood with the conditional event probability as its single distributional parameter. This fits naturally into the distributional regression framework, in which one or more distributional parameters are linked to predictors. Consequently, the Newton–Raphson-type updates used for distributional regression models directly apply here in iterative form. These equations are used to maximize the penalized log-likelihood, see Section 2.2 and estimate the coefficient vectors β _j . For iteration l + 1, the update for component j is given by

β_{j}^{[l + 1]} = {(X_{j}^{⊤} W X_{j} + P_{j} (τ_{j}))}^{- 1} X_{j}^{⊤} W (z - η_{- j}^{[l + 1]}),

(2.4)

where $z = η^{[l]} + W^{- 1} u$ is a vector of working observations and $u = \partial l (β; y, X) / \partial η$ is the score vector $W = - diag (\partial^{2} l (β; y, X) / \partial η^{2})$ .is a diagonal weight matrix of size N × N, where $N = \sum_{i = 1}^{n} t_{i}$ is the total number of augmented binary observations. Each diagonal entry in W is evaluated at the current state β^[ ^l ^]. The expression in Equation (2.4) corresponds to a backfitting update for the model term f_j(·), with η−j denoting the predictor excluding the j-th model component. This backfitting loop cycles through model terms j = 0, …, p and is iterated until convergence, for example, when the relative change in the coefficients falls below a prespecified threshold. Smoothing parameters τ_j can be estimated using stepwise selection procedures (similar to Belitz and Lang, 2008), optimizing each component sequentially within adaptive search intervals and using an appropriate information criterion (e.g., AIC or BIC). As mentioned above, in many practical cases, τ_j reduces to a scalar. For further algorithmic details, see Umlauf et al. (2018).

Scalable estimation is achieved by using only a randomly selected batch of the data. In discrete hazard models, where each individual contributes t_i observations to the augmented dataset, the natural sampling unit is the individual. If an individual is selected for a batch, then all the corresponding observations are included. Formally, $i = (1, \dots, \sum_{i = 1}^{n} t_{i}) = {(i_{1}^{⊤}, \dots, i_{n}^{⊤})}^{⊤}$ denote the vector of stacked row indices in the augmented dataset, where $i_{i} = {(\sum_{m = 1}^{i - 1} t_{m} + 1, \dots, \sum_{m = 1}^{i} t_{m})}^{⊤}$ gives the indices corresponding to individual i, for $i = 1, \dots, n$ . A batch is then defined by selecting a random subset $s \subseteq {1, \dots, n}$ of individuals and using all corresponding observations, that is, $i_{s} = \cup_{i \in s} i_{i}$ .

The corresponding response vector $y_{[i_{s}]}$ and covariate matrix $X_{[i_{s}]}$ are used to compute a stochastic updating step of the form

\begin{matrix} β_{j}^{[l + 1]} = (1 - v) \cdot β_{j}^{[l]} + v \cdot {(X_{[i_{s}], j}^{⊤} W_{[i_{s}]} X_{[i_{s}], j} + P_{j} (τ_{j}))}^{- 1} X_{[i_{s}], j}^{⊤} W_{[i_{s}]} (z_{[i_{s}]} - η_{[i_{s}], - j}^{[l + 1]}) \\ = (1 - v) \cdot β_{j}^{[l]} + v \cdot β_{[i_{s}], j}, \end{matrix}

(2.5)

where ν is the step length control parameter specifying the amount of which $β_{j}^{[l]}$ is updated in the direction of the new estimate $β_{[i_{s}], j}$ on batch $[i_{s}]$ . Note that the working weights $W_{[i_{s}]}$ and the score vectors u used to compute $z_{[i_{s}]}$ are evaluated on the [l]-th estimate $β_{j}^{[l]}$ . In each iteration, Equation (2.5) is evaluated on exactly one batch [i _s ], such that the computational burden can be reduced considerably.

The updating function defined in Equation (2.5) can be applied in several ways. For instance, if the step length control parameter is set to ν = 0.1 and only the best-fitting model term f_j(·) is updated in each iteration, the algorithm mimics a boosting-type approach. Specifically, the decision to update a model term is based on its log-likelihood contribution, evaluated on another random batch $[{\tilde{i}}_{s}]$ . This design helps mitigate overfitting, as demonstrated by Umlauf et al. (2024). Similarly, smoothing parameters τ _j are selected by minimizing an information criterion (e.g., AIC) on a different independent batch. This simultaneous selection of model terms and smoothing parameters offers a major advantage as it eliminates the need to determine an optimal stopping iteration, which is typically required in classical boosting algorithms (e.g., via computationally intensive cross-validation).

The algorithm is considered to have converged when the ‘out-of-sample’ log-likelihood evaluated on other batches no longer improves. To better account for uncertainty, a common strategy is to refit the selected model with ν = 1, mimicking a resampling step. Estimates are then based on the last iterations, such as using Markov chain Monte Carlo simulation. This algorithm has proven to have excellent model term selection performance and can be applied to very large datasets, as is often the case in discrete hazard models. For a description of the algorithm in full detail, please refer to Umlauf et al. (2024).

The number of batches and thus the number of iterations L should be chosen sufficiently large such that no (major) further improvements in the out-of-sample log likelihood can be observed. Furthermore, a reasonable batch size M must be defined, which in previous studies was chosen between 10 000 and 50 000 (see, e.g., Umlauf et al., 2024 or Seiler et al., 2025). Depending on the complexity and type of data, the batches can be smaller or must be larger to ensure sufficient coverage of the information in the data.

3 Simulation study

A comprehensive simulation study is conducted to investigate the performance of batchwise backfitting and suitable benchmark methods in different settings and evaluate estimation accuracy, variable selection performance and estimation time. Each setting is replicated 250 times for each method under study.

Figure 1

(A) Specifications of the baseline hazard function f₀(t). a = −3 is used in the basic setting. a = −2 and a = −4 are variations to investigate the influence of different event frequencies. (B–E) Univariate informative effects on η_it. (F) Bivariate spatial effect.

3.1 Design and basic setting

In our basic setting, data is simulated for n = 5 000 individuals, k = 20 time points and the predictor is defined as $η_{i t} = f_{0} (t) + f_{1} (x_{i 1}) + \dots + f_{10} (x_{i 10})$ , where f₀(t) represents the baseline hazard function, $f_{1} (x_{i 1}), \dots, f_{4} (x_{i 4})$ are the effects of informative variables $x_{i 1}, \dots, x_{i 4}$ and $f_{5} (x_{i 5}) \equiv 0, \dots, f_{10} (x_{i 10}) \equiv 0$ are the effects of the six uninformative/noise variables $x_{i 5}, \dots, x_{i 10}$ . The baseline hazard function f₀(t) is specified decreasing logarithmically over time as $f_{0} (t) = a - 0.5 \cdot \log (t)$ . In the basic setting a = −3, which corresponds to an event frequency of approximately 10%, see panel (A) of Figure 1. As specified in Section 2.1, the logit function is used as the link h(·) to obtain the conditional event probabilities.

Equidistant design points are generated from the interval [0, 1] for the variable x_i₄ and from the interval [−3, 3] for the remaining variables. The effects of the informative variables are defined as f₁(x₁) = 0.5 x₁ (referred to as linear), f₂(x₂) = 1.5 sin(x₂) (sinus), $f_{3} (x_{3}) = x_{3}^{1 / 3} - 1.5$ (squared), $f_{4} (x_{4}) = \sin (2 \cdot (4 \cdot x_{4} - 2)) + 2 \cdot \exp (- (16^{2}) \cdot {(x_{4} - 0.5)}^{2})$ (complex), see Figure 1 panels (B–E). The unique values of the effects f₁(x₁), …, f₄(x₄) are scaled to have a specific standard deviation (SD), which is set to SD = 1 in the basic setting.

The event indicator y_it is derived by comparing $λ (t | x_{i})$ with random draws from the uniform distribution $u_{i t} \sim U (0, 1)$ , that is, $y_{i t} = 1 if u_{i t} \leq λ (t | x_{i})$ and 0 otherwise. For each individual, only the first occurrence of an event (i.e., the first time y_it = 1) is considered as the event time T_i and a random censoring time C_i is generated, where $C_{i} \sim U ({1, 2, \dots, 20})$ . The individual’s observation time is then defined as t_i = min(T_i, C_i, 20). This results in approximately (n · k)/2 rows in the augmented dataset used for estimation. In the basic setting (where n = 5 000), the augmented dataset consists of around 50 000 rows.

3.2 Further settings

The further settings are variations of the basic setting to evaluate the performance of the estimation methods under different circumstances. In each setting, one specific feature is varied, while all other components remain as in the basic setting: (i) Number of individuals n: To investigate the performance for both small and large datasets, simulations are performed with number of individuals n = 1 000, 10 000, 50 000, 100 000, 500 000 and 1 000 000. This results in augmented datasets that range from approximately 10 000 to 10 000 000 rows. (ii) Baseline hazard specification f₀(t): Alternative baseline hazard specifications are considered to analyze the influence of different event frequencies. Simulations are performed with a = −2 and a = −4, which corresponds to an event frequency of approximately 20% and 5%, respectively. These two variations are also visualized in Figure 1 in panel (A). Furthermore, we considered two additional functional forms for the baseline hazard: One that increases monotonically and another that first increases and then decreases over time, specified as f₀(t) = −5 + 0.7 · log(t) and f₀(t) = −5 + 2 · sin((t − 1)/7), respectively. Both specifications were scaled to yield event frequencies comparable to the basic setting with a = −3. The corresponding hazard shapes are shown in the Supplementary Materials A.2. (iii) Effect scaling SD: The scaling of the unique values of the effects is varied to SD = 0.5 and SD = 2 to investigate the influence of different effect sizes. (iv) Spatial effect f_spa(lon, lat): In addition to the univariate effects, a bivariate spatial effect f_spa(lon, lat) = 2.5 · sin(lon) · sin (0.5 · lat) − 0.3 is added to the predictor; see panel (F) of Figure 1. For lon and lat equidistant design points are sampled from the interval [−3, 3].

3.3 Methods and implementation

In the following, we refer to batchwise backfitting as BBFIT. The method is implemented in the R package bamlss (Umlauf et al., 2018, 2021) and the batchwise backfitting algorithm (Umlauf et al., 2024) can be used by setting optimizer = opt_bbfit in the main function bamlss(). We use a combination of the boosting and resampling variants (see Section 2.3) of the algorithm in a two-step procedure with model comparison based on out-of-sample AIC. In the first step, the boosting variant is executed for L = 200 iterations including all possible covariates, with select = TRUE specified. The resampling step is also executed for L = 200 iterations, with the first 100 iterations discarded as a burn-in phase. Here, slice sampling (Neal, 2003) is used for the smoothing parameters.

A maximum batch size of M = 20 000 is used in all settings. This implies that for the setting with 1 000 individuals, corresponding to an augmented dataset with around 10 000 rows, each batch consists of the entire model dataset including all individuals. For settings with more individuals, only a portion of the dataset is included in each batch until M is reached. More general guidance on choosing an appropriate batch size is provided in Umlauf et al. (2024).

The model specification builds on Wood (2003), using thin-plate regression splines (s()) with the default basis dimension for the baseline hazard and all univariate covariate effects. The bivariate spatial effect is modelled with a full tensor product smooth (te()) in its default form, following Wood (2006).

The performance of BBFIT is compared with two benchmark methods: (i) As a basic benchmark, we use an estimation method for generalized linear models with stepwise model selection based on the AIC. Specifically, the function stepAIC() from the R package MASS (Venables and Ripley, 2002) with the specification direction = ‘both’ is applied. The baseline hazard function and all effects of the explanatory variables are modelled with cubic polynomials. In the following, we refer to this approach as GLM. (ii) For more flexible specifications, an estimation method for GAMs optimized for (very) large datasets is employed. Here we use the bam() function (Wood et al., 2015, 2017) from the R package mgcv (Wood, 2003, 2004, 2011; Wood et al., 2016, 2017). The smoothing parameter is estimated using the default fast version of the restricted maximum likelihood approach (method = ‘fREML’). In the following, this approach is referred to as BAM.

3.4 Performance measures

The following measures and tools are used: (i) Estimated effects are plotted against true effects, with means and quantiles used to detect systematic biases. The mean squared error (MSE) is also calculated as ${MSE}_{f} = \frac{1}{m} \sum_{i = 1}^{m} {(\hat{f} (x_{i}) - f (x_{i}))}^{2}$ , where f(·) is the true effect, $\hat{f} (\cdot)$ the estimate and x₁, …, x_m span the full range of x. (ii) Selection performance is evaluated using selection frequencies, with an emphasis on the noise variables. For GLM and BBFIT, selection is based on variable inclusion or exclusion. For BAM, no clear rule exists since uninformative effects are estimated to be near—but never exactly—zero. This issue is discussed further in Section 3.5. (iii) Estimation time refers to the time required to estimate the model. It is evaluated on 25 replications that are run on a ‘standard’ PC (see Supplementary Materials C) to ensure practical relevance.

3.5 Results

Basic setting Figure 2 shows the estimated effects of all informative variables and the first noise variable, while Figure 3 shows the corresponding distribution of the MSEs. Since the other noise variables show very similar patterns, their results are omitted here but can be found in Supplementary Materials A.1. To analyze the estimated effects in more detail, we created a web app that allows replications and effects to be visualized separately or as an overlay in a dynamic plot. The app also offers the possibility to explore the further settings described below. It can be accessed via bmueller5000.github.io/bb4sa-shinylive/ and the corresponding data can be found in doi.org/10.5281/zenodo.19330803.

Figure 2

Estimated effects of all informative variables and the first noise variable in the basic setting (see Section 3.1). Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.

Figure 3

MSEs of all informative variables and the first noise variable in the basic setting (see Section 3.1). Note: MSEs of BBFIT (see e.g., Noise 1 bottom right) are always zero as the noise variables are never selected.

Further results regarding the selection frequencies are discussed in Figure 4. Since the informative variables are always selected, we restrict the presentation to the noise variables. As mentioned above, the selection approach of BAM does not provide a clear selection rule. Therefore, different thresholds for the effective degrees of freedom (EDF) and p-values are tested as selection criteria and visualized in Figure 4 in the left panel.

Figure 4

Selection frequencies of the noise variables for different selection criteria of BAM on the left and selection frequencies of the noise variables (with p-value < 0.01 for BAM) in the basic setting (see Section 3.1) on the right.

From Figures 2 –4 and the web app, we draw the following conclusions: (i) Bias: All methods provide largely unbiased estimates for the linear, sinus and squared effects f₁, f₂, f₃ and substantial bias for the complex effect f₄. This bias is, as expected, more pronounced for the GLM method, since cubic polynomials are not flexible enough to capture such complex effects. The GLM method also shows a clear bias for the baseline hazard f₀. BAM shows a small bias in the baseline hazard f₀ at higher t values, which is almost invisible in Figure 2. Surprisingly, this bias increases with the number of individuals; see the more detailed discussion below. (ii) MSE: The main findings for the MSEs are similar to those for the bias. For the linear, sinus and squared effects f₁, f₂, f₃ the MSEs are almost zero for all methods. For the baseline hazard f₀, BAM and BBFIT yield lower MSEs than GLM and for the complex effect f₄, GLM is clearly outperformed. Regarding the noise variables, BBFIT achieves MSEs of exactly zero in all replications. BAM shows low but non-zero MSE values due to occasional false positive selections, while GLM produces considerably higher MSEs. (iii) Selection frequencies: Compared to the p-values, selection via the EDF consistently leads to higher selection frequencies for the noise variables in BAM (see Figure 4 left panel). The best results are obtained with the rule p-value < 0.01, which is therefore used for BAM in the remainder. The right panel of Figure 4 shows exceptional selection performance of BBFIT as the noise variables are never selected. The selection frequency of BAM (based on the rule p-value < 0.01) varies between 0.8% and 3.2%, and GLM is not competitive with selection frequencies between 9.6% and 14.8%.

Further setting Among the additional settings discussed in Section 3.2, only varying the number of individuals yields notable results. The alternative baseline hazard specifications and the inclusion of a spatial effect lead to only minor differences and are therefore deferred to Supplementary Materials A.2 and A.3. The scaling of effects does not provide any meaningful insights and is therefore not discussed further.

Figure 5 displays the average MSEs across all 250 replications for varying numbers of individuals, highlighting the baseline hazard f₀, the complex effect f₄ and the first noise effect f₅, which exhibit the most notable patterns. A plot, including all effects, is provided in Supplementary Materials A.4. Estimation using the GLM method was not feasible for datasets with more than 100 000 individuals and is therefore excluded from the analysis of large datasets. As the number of individuals increases, the estimation accuracy of all methods improves, as indicated by a decreasing average MSE. As discussed in the basic setting, GLM performs worst in estimating the complex effect f₄ due to model specification limitations. BAM and BBFIT perform similarly for 50 000 to 1 000 000 individuals, while for smaller sample sizes no method clearly outperforms the others. Especially for n = 1 000, differences may, among other factors, be driven by smoothing parameter selection in low-information regions, rather than reflecting systematic methodological advantages.

Figure 5

Mean MSEs for different numbers of individuals of the baseline hazard f₀, the complex effect f₄ and the first noise effect f₅.

Figure 6 displays the mean estimated baseline hazard together with quantiles based on 500 000 individuals, comparing BAM (with select = TRUE, see Section 3.3) and the bam() method without variable selection (select = FALSE). As mentioned above and clearly present with 500 000 individuals, the method shows a bias in the baseline hazard at higher t values. To investigate this bias further, we tested different methods for estimating the smoothing parameters, as well as varied the number of nodes and the smoothing basis of the splines. These variations did not resolve the issue, however, we found that the bias is not present when using the method without variable selection. Further research is needed to test for systematic biases with the selection method of bam() in a discrete time-to-event data structure (and possibly in general data structures).

Figure 6

Mean estimated effects together with the 2.5% and 97.5% quantiles of the baseline hazard f₀ with 500 000 individuals for BAM. Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.

Regarding variable selection, the left panel of Figure 7 shows the mean selection frequency of the noise variables for different numbers of individuals. Surprisingly, the performance of all methods is relatively unaffected by the number of individuals. The GLM method performs worst with a mean selection frequency of 9.0%–13.1%. BAM performs considerably better with selection rates between 1.0% and 2.1%. The best selection performance is achieved using BBFIT with perfect selection (0%) for 5 000 or more individuals.

Figure 7

Mean selection frequency of the six noise variables for different numbers of individuals on the left. Median estimation time in minutes for different numbers of individuals based on 25 replications run on a ‘standard’ PC (see Supplementary Materials C) on the right.

In terms of computation time, the right panel of Figure 7 shows the median estimation time for different numbers of individuals based on 25 replications on a ‘standard’ PC (see Supplementary Materials C). The median estimation time for individuals up to 100 000 ranges from a few seconds to around 4.5 minutes and is comparable for the three methods. For very large sample sizes of 500 000 or even 1 000 000 individuals, BBFIT clearly outperforms BAM with about half the median estimation time. While the estimation times for GLM and BBFIT are relatively stable, BAM shows some large outliers. For 500 000 individuals, for example, four of 25 replications have estimation times of about 250 minutes, while the remaining replications take about 13 minutes. For these exceptionally long estimation times, a warning is issued that the algorithm did not converge. However, the estimated effects for these replications show no qualitative differences compared to those with normal estimation times and no warnings.

4 Application

Infant mortality rates, that is, the probability of a newborn child dying before reaching the age of one, remain high in sub-Saharan Africa. In 2023, the rate was 4.4% in this region, corresponding to approximately 1.8 million infant deaths—about 51% of the global total (UNIGME, 2025). Infant mortality is a complex issue influenced by various factors such as limited healthcare access, poor nutrition and environmental conditions. As a result, infant mortality rates are key indicators of broader human development, a priority highlighted in Sustainable Development Goal 3. We model infant mortality in 10 eastern sub-Saharan African countries using a time-to-event model with a structured additive predictor.

4.1 Data

The Demographic and Health Surveys (DHS, ICF, 2004-2017) serve as the primary data source and are merged with remotely sensed data on climate, demography and environmental factors. Seiler et al. (2025) provide a detailed overview of the data and details on the preprocessing steps (see Section Data and Supplementary Information). We use a subset of the data on individual children from Burundi, Ethiopia, Kenya, Malawi, Mozambique, Rwanda, the United Republic of Tanzania, Uganda, Zambia and Zimbabwe. Further data cleaning excluded surveys before 2 000 due to limited geo-referenced data availability, missing covariates and children born over five years before the survey to ensure accurate mortality measurement. The resulting dataset contains 351 705 individual children, of which about 4.5% died in the first year of life. Data augmentation substantially increases the dataset up to around 3.7 million rows, which are ultimately used for estimation. All basic information for modelling mortality is recorded for these children, which includes an indicator of whether the child is still alive at the time of observation (dead), as well as information on the age on a monthly basis (age). In addition, the dataset contains a total of 25 potential explanatory variables, which are listed and briefly described in Table 2. The references and sources of the variables can be found in Supplementary Materials B.1.

Table 2

Response and covariates included in the full model of the selection step.

Variable	Unit	Type	Description
Response
dead	1 if ‘dead’;0 if ‘alive’	Binary	Death/living status at the day of the interview
Continuous or quasi-continuous covariates
age	Months	Metric	Age of child
ai	Index	Continuous	Asset index of the household
altitude	Meters	Continuous	Elevation in meters above sea level
bord	Count	Metric	Birth order within household
l_distance	Kilometers	Continuous	Log of distance to closest body of water
hhs	Count	Metric	Household size
higheduyear	Years	Metric	Highest completed year of schooling
magebirth	Years	Metric	Age of mother at birth
minc12	Prevalence	Continuous	Malaria incidence
ndvi12	Index	Continuous	Normalized difference vegetation index
pre12	Meters	Continuous	Precipitation
rgdp	US$	Continuous	Real GDP of the country
t2m12	Kelvin	Continuous	Two meter surface temperature
l_ttcity	Hours	Continuous	Log of travel time (TT) to city
l_ttmotor	Hours	Continuous	Log of TT to healthcare facility by motorized vehicle
l_ttwalk	Hours	Continuous	Log of TT to healthcare facility by foot
lon; lat	Degree	Continuous	Longitude and latitude coordinates
Discrete covariates
gender	‘female’; ‘male’	Binary	Sex of the child
d_conf25;d_conf50;	‘yes’ if x ≥ 5;‘no’ if x < 5	Binary	Indicator of reported conflicts in the past within a buffer of 25/50/100 kilometers
d_conf100
d_nl20_12	‘yes’ if x ≥ 5;‘no’ if x < 5	Binary	Indicator of observed night-time light digital number values
s_lc12	Classification	Categorical	Land-cover classification
s_soil	Classification	Categorical	Soil type classification

4.2 Model

The probability of death in month t = 0, … 11 is modelled with a discrete time-to-event model and the additive predictor defined as $η = f_{0} (a g e) + f_{1} (a i) + f_{2} (a l t i t u d e) + \dots + f_{24} (s_s o i l)$ , where f₀, …, f₂₄ can be a spline, a spatial or a random effect (see Table 2 for variable descriptions). As in the simulation study, the logit function is employed as the link h(·). The model is estimated using BBFIT, following the same specification approach as in the simulation study—which includes both boosting and resampling steps—but with an increased number of iterations. Specifically, L = 1 000 iterations are used in the boosting step to select the most important covariates and L = 400 iterations are used in the resampling step, with the first 200 discarded as burn-in.

4.3 Results

Figure 8 displays the relative updating frequencies of the boosting step in the left panel and the log-likelihood contribution in the right panel. It is the frequency with which a model term yields the best improvement of the out-of-sample log-likelihood by the total number of iterations. We find that the age of the child (age), the household size (hhs), the age of the mother at birth (magebirth) and the interview year (iyear) are the most important explanatory variables in terms of updating frequency. The age of the child appears to have by far the greatest importance based on this metric. While these four variables stand out in terms of importance, the others contribute far less and are therefore not examined further here.

We illustrate our approach for presenting and interpreting results in a survival context with the covariate magebirth as one of the most important predictors in our analysis. Figure 9 shows summary statistics as well as visualizations of the distribution in the top panels. The left panel shows a boxplot and a histogram; the right panel provides relative frequencies of observed deaths. We highlight the sparsely populated age group 46–48 with fewer than 500 observations in red to point out that subsequent interpretation of regression results has to be taken with care. The middle panel in Figure 9 displays the estimated effect f₈(magebirth) centred at zero. The bottom panels show the estimated marginal survival probabilities of magebirth (depending also on the specific value of the child’s age). To visualize this two-dimensional function, we plot the survival probabilities in dependence of either magebirth or age while fixing the remaining covariates at the mean (continuous variables) or at the mode (discrete variables) level. The bottom left panel shows marginal survival probabilities for children up to an age of one month (age = 0), half a year (age = 5) and one year (age = 11) as a function of the mother’s age at birth. The bottom right panel shows estimated marginal survival probabilities over child’s age for three selected maternal ages: 16 (2.5% quantile), 25 (median) and 41 (97.5% quantile) years.

Figure 8

Updating frequencies based on the boosting step on the left and log-likelihood contribution plot on the right. Note that variables that have never been updated are omitted. See Table 2 for variable descriptions.

Figure 9 provides the following interpretations: (i) Mothers age at birth varies between a minimum of 14 and a maximum of 48 years, with a median age of 25. The frequency of observations declines with increasing age, with only 356 observations in the age group 46–48. As already mentioned, this age group is highlighted in red in the plots to prevent overinterpretation of results where few individuals are observed. (ii) The top right panel already suggests that children of older mothers are more likely to die than children of younger mothers. This is confirmed with the estimated effect f₈(magebirth) in the middle panel, which shows a near-linear increase in mortality risk. (iii) The estimated marginal survival probabilities in the bottom panels further provide valuable insight into their actual size. Survival probabilities are close to 100% for younger maternal ages and decline notably with increasing maternal age, particularly after age 35 (bottom left panel). The downward trend is more pronounced for older children, indicating that the impact of maternal age accumulates over time. Children born to younger mothers consistently show higher marginal survival probabilities across all ages (bottom right panel), while children born to older mothers experience lower survival probabilities, with the gap widening for older children. Note that these insights are not visible from the pure effect plot in the middle panel of Figure 9.

Detailed analyses similar to those for magebirth are provided for all covariates in Supplementary Materials B.2 and B.3. To keep the article at a reasonable length, we continue in Figure 10 with a brief summary and interpretation of the estimated effects (left panels) and the marginal probabilities of survival (right panels) for the other most important variables identified above. For better comparison, the plots are scaled identically to the corresponding plots for magebirth in Figure 9.

Figure 9

Summary statistics and the distribution of magebirth in the top left panel and relative frequency of events (deaths) in the augmented data in the top right panel. Estimated effect of this covariate centred at zero in the middle panel. Estimated marginal survival probabilities for different child ages across the age of the mother at birth are in the bottom left, and estimated marginal survival probabilities over child age for different maternal ages at birth are in the bottom right panel.

The estimated effect of age (baseline hazard) decreases roughly logarithmically over time. This indicates that the probability of dying in the first months of life is higher than in later months. The estimated effect of hhs follows a U-shape. The probability of dying is lower in larger households of about 5–15 members. In contrast, the probability of dying is higher in small households with fewer members and very large households with more than 15 members. A possible interpretation is that smaller households undergo a learning curve or have limited childcare options, while larger households may lack sufficient attention and resources for each individual. The estimated effect of iyear declines roughly linearly, suggesting an encouraging trend toward lower mortality over time.

Figure 10

Estimated effects of the covariates age (baseline hazard), hhs, and iyear centred at zero on the left. Estimated marginal survival probabilities in dependence of the child age for hhs and iyear on the right.

The estimated survival probabilities in the right panels of Figure 10 offer a clearer understanding of the true size of the effects. For example, children in small households (fewer than 5 members) have a substantially lower first-year survival probability—about 90%–95%–compared with children in larger households (8–10 members), where the probability is around 97%. For households with more than 15 members, the marginal probability of surviving the first year decreases again to around 95%. Similar interpretations are possible for the other explanatory variables.

5 Conclusion

This article introduces an efficient estimation framework for discrete time-to-event models with additive predictors by extending the recently proposed batchwise backfitting algorithm. Our contribution lies in combining scalable estimation with simultaneous automated variable selection in large-scale, high-dimensional survival settings—addressing key limitations of existing approaches.

Through a comprehensive simulation study, we demonstrate that the proposed method achieves high estimation accuracy, excellent variable selection performance and significantly reduced computation times, even with datasets comprising up to 10 million rows. Compared to benchmark methods such as GLM and BAM, BBFIT shows superior robustness and efficiency, especially in large-scale contexts. In particular, BBFIT consistently avoids the selection of uninformative covariates, while BAM and GLM have a non-negligible false selection rate. The application to modelling infant mortality further illustrates the practical usefulness of our approach. Despite the complexity of the data—both in terms of sample size and number of covariates—BBFIT identifies a small subset of influential factors and provides interpretable smooth effect estimates.

While the proposed framework is a powerful tool for large-scale, high-dimensional discrete time-to-event modelling, several directions for future work remain: For instance, exploring alternative model structures—such as piecewise exponential models or Box–Cox-transformed hazard models—could broaden its applicability beyond the current link and distributional assumptions. From a computational point of view, future work may focus on extending the algorithm towards a fully Bayesian framework to better account for model uncertainty. Final exploiting the specific structure of the likelihood—where all entries are zero except possibly the last—could lead to additional algorithmic efficiencies.

Footnotes

Acknowledgments

The computational results presented have been achieved (in part) using the HPC infrastructure LEO of the University of Innsbruck. We thank ICF International, Inc. and USAID for providing public access to the DHS data (ICF, 2004-2017).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship and/or publication of this article: This research was funded in whole or in part by the Austrian Science Fund (FWF) (doi:10.55776/P33941). For open access purposes, the author has applied a CC BY public copyright licence to any author-accepted manuscript version arising from this submission.

ORCID iDs

Benjamin Müller

Nikolaus Umlauf

Johannes Seiler

Kenneth Harttgen

Stefan Lang

Supplementary materials

References

Allison

(1982) Discrete-time methods for the analysis of event histories. Sociological Methodology , 13, 61–98. doi:10.2307/270718.

Belitz

and Lang

(2008) Simultaneous selection of variables and smoothing parameters in structured additive regression models. Computational Statistics and Data Analysis , 53, 61–81. doi:10.1016/j.csda.2008.05.032.

Berger

and Schmid

(2018) Semiparametric regression for discrete time-to-event data. Statistical Modelling , 18, 322–45. doi:10.1177/1471082X17748084.

Brezger

and Lang

(2006) Generalized structured additive regression based on Bayesian P-splines. Computational Statistics and Data Analysis , 50, 967–91. doi:10.1016/j.csda.2004.10.011.

Burstein

, . (2019) Mapping 123 million neonatal, infant and child deaths between 2000 and 2017. Nature , 574, 353–58. doi:10.1038/s41586-019-1545-0.

Carollo

, Putter

, Eilers

and Gampe

(2025) Competing risks models with two time scales. Statistical Methods in Medical Research , 34, 2145–62. doi:10.1177/09622802251367443.

Christodoulou

(2011) Water network assessment and reliability analysis by use of survival analysis. Water Resources Management , 25, 1229–38. doi:10.1007/s11269-010-9679-8.

Efron

(1988) Logistic regression, survival analysis, and the Kaplan-Meier curve. Journal of the American Statistical Association , 83, 414–25. doi:10.1080/01621459.1988.10478612.

Eilers

PHC

and Marx

(1996) Flexible smoothing with B-splines and penalties. Statistical Science , 11, 89–121. doi:10.1214/ss/1038425655.

10.

Fahrmeir

and Lang

(2001) Bayesian inference for generalized additive mixed models based on Markov random field priors. Journal of the Royal Statistical Society Series C: Applied Statistics , 50, 201–20. doi:10.1111/1467-9876.00229.

11.

Fahrmeir

and Wagenpfeil

(1996) Smoothing hazard functions and time-varying effects in discrete duration and competing risks models. Journal of the American Statistical Association , 91, 1584–94. doi:10.1080/01621459.1996.10476726.

12.

Fahrmeir

, Kneib

and Lang

(2004) Penalized structured additive regression for space-time data: A Bayesian perspective. Statistica Sinica , 14, 731–61. URL https://www.jstor.org/stable/24307414

13.

Fahrmeir

, Kneib

, Lang

and Marx

(2022) Regression: Models, Methods and Applications . 2nd edition. Springer. doi:10.1007/978-3-662-63882-8.

14.

Hastie

and Tibshirani

(1986) Generalized additive models. Statistical Science , 1, 297–310. doi:10.1214/ss/1177013604.

15.

ICF (2004–17) Demographic and Health Surveys (Various) [Datasets] . Rockville, Maryland, USA: ICF [Distributor]. Funded by USAID.

16.

Lang

, Umlauf

, Wechselberger

, Harttgen

and Kneib

(2014) Multilevel structured additive regression. Statistics and Computing , 24, 223–38. doi:10.1007/s11222-012-9366-0.

17.

Musick

and Michelmore

(2015) Change in the stability of marital and cohabiting unions following the birth of a child. Demography , 52, 1463–85. doi:10.1007/s13524-015-0425-y.

18.

Neal

(2003) Slice sampling. The Annals of Statistics , 31, 705–67. doi:10.1214/aos/1056562461.

19.

M-T

Puth

, Tutz

, Heim

, Münster

, Schmid

and Berger

(2020) Tree-based modeling of time-varying coefficients in discrete time-to-event models. Lifetime Data Analysis , 26, 545–72. doi:10.1007/s10985-019-09489-7.

20.

Schmid

, Küchenhoff

, Hoerauf

and Tutz

(2016) A survival tree method for the analysis of discrete event times in clinical and epidemiological studies. Statistics in Medicine , 35, 734–51. doi:10.1002/sim.6729.

21.

Seiler

, Wetscher

, Harttgen

, Utzinger

and Umlauf

(2025) High-resolution spatial prediction of anemia risk among children aged 6 to 59 months in low- and middle-income countries. Communications Medicine , 5, 57. doi:10.1038/s43856-025-00765-2.

22.

Spuck

, Schmid

, Heim

, Klarmann-Schulz

, Hörauf

and Berger

(2023) Flexible tree-structured regression models for discrete event times. Statistics and Computing , 33, 20. doi:10.1007/s11222-022-10196-x.

23.

Tutz

and Binder

(2004) Flexible modelling of discrete failure time including time-varying smooth effects. Statistics in Medicine , 23, 2445–61. doi:10.1002/sim.1824.

24.

Tutz

and Schmid

(2016) Modeling Discrete Time-to-Event Data . Springer Series in Statistics. Cham: Springer. doi:10.1007/978-3-319-28158-2.

25.

Umlauf

, Klein

and Zeileis

(2018) BAMLSS: Bayesian additive models for location, scale, and shape (and beyond). Journal of Computational and Graphical Statistics , 27, 612–27. doi:10.1080/10618600.2017.1407325.

26.

Umlauf

, Klein

, Simon

and Zeileis

(2021) bamlss: A lego toolbox for flexible Bayesian regression (and beyond). Journal of Statistical Software , 100, 1–53. doi:10.18637/jss.v100.i04.

27.

Umlauf

, Seiler

, Wetscher

, Simon

, Lang

and Klein

(2024) Scalable estimation for structured additive distributional regression. Journal of Computational and Graphical Statistics , 34, 601–17. doi:10.1080/10618600.2024.2388604.

28.

UNIGME (2025) Levels and Trends in Child Mortality: Report 2024 – Estimates Developed by the United Nations Inter-agency Group for Child Mortality Estimation . New York, USA: United Nations Children’s Fund. URL https://data.unicef.org/resources/levels-and-trends-in-child-mortality-2024/

29.

Venables

and Ripley

(2002) Modern Applied Statistics with S . Statistics and Computing. Fourth edition. New York, NY: Springer. doi:10.1007/978-0-387-21706-2.

30.

Wood

(2003) Thin plate regression splines. Journal of the Royal Statistical Society Series B: Statistical Methodology , 65, 95–114. doi:10.1111/1467-9868.00374.

31.

Wood

(2004) Stable and efficient multiple smoothing parameter estimation for generalized additive models. Journal of the American Statistical Association , 99, 673–86. doi:10.1198/016214504000000980.

32.

Wood

(2006) Low-rank scale-invariant tensor product smooths for generalized additive mixed models. Biometrics , 62, 1025–36. doi:10.1111/j.1541-0420.2006.00574.x.

33.

Wood

(2011) Fast stable restricted maximum likelihood and marginal likelihood estimation of semiparametric generalized linear models. Journal of the Royal Statistical Society Series B: Statistical Methodology , 73, 3–36. doi:10.1111/j.1467-9868.2010.00749.x.

34.

Wood

, Goude

and Shaw

(2015) Generalized additive models for large data sets. Journal of the Royal Statistical Society Series C: Applied Statistics , 64, 139–55. doi:10.1111/rssc.12068.

35.

Wood

, Pya

and Säfken

(2016) Smoothing parameter and model selection for general smooth models. Journal of the American Statistical Association , 111, 1548–63. doi:10.1080/01621459.2016.1180986.

36.

Wood

, Li

, Shaddick

and Augustin

(2017) Generalized additive models for gigadata: Modelling the UK black smoke network daily data. Journal of the American Statistical Association , 112, 1199–1210. doi:10.1080/01621459.2016.1195744.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.53 MB

0.00 MB

Simultaneous estimation and model choice for big discrete time-to-event data with additive predictors

Abstract

Keywords

1 Introduction

2 Flexible discrete hazard models

2.1 Model specification

Table 1

Example for the data augmentation step: Dataset before augmentation on the left and dataset after augmentation on the right.

Figure 1

(A) Specifications of the baseline hazard function f0(t). a = −3 is used in the basic setting. a = −2 and a = −4 are variations to investigate the influence of different event frequencies. (B–E) Univariate informative effects on ηit. (F) Bivariate spatial effect.

3.2 Further settings

3.3 Methods and implementation

3.4 Performance measures

3.5 Results

Figure 2

Estimated effects of all informative variables and the first noise variable in the basic setting (see Section 3.1). Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.

MSEs of all informative variables and the first noise variable in the basic setting (see Section 3.1). Note: MSEs of BBFIT (see e.g., Noise 1 bottom right) are always zero as the noise variables are never selected.

Selection frequencies of the noise variables for different selection criteria of BAM on the left and selection frequencies of the noise variables (with p-value < 0.01 for BAM) in the basic setting (see Section 3.1) on the right.

Mean MSEs for different numbers of individuals of the baseline hazard f0, the complex effect f4 and the first noise effect f5.

Mean estimated effects together with the 2.5% and 97.5% quantiles of the baseline hazard f0 with 500 000 individuals for BAM. Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.

Mean selection frequency of the six noise variables for different numbers of individuals on the left. Median estimation time in minutes for different numbers of individuals based on 25 replications run on a ‘standard’ PC (see Supplementary Materials C) on the right.

4.1 Data

Table 2

Response and covariates included in the full model of the selection step.

4.3 Results

Figure 8

Updating frequencies based on the boosting step on the left and log-likelihood contribution plot on the right. Note that variables that have never been updated are omitted. See Table 2 for variable descriptions.

Estimated effects of the covariates age (baseline hazard), hhs, and iyear centred at zero on the left. Estimated marginal survival probabilities in dependence of the child age for hhs and iyear on the right.

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iDs

Supplementary materials

References

Supplementary Material

(A) Specifications of the baseline hazard function f₀(t). a = −3 is used in the basic setting. a = −2 and a = −4 are variations to investigate the influence of different event frequencies. (B–E) Univariate informative effects on η_it. (F) Bivariate spatial effect.

Mean MSEs for different numbers of individuals of the baseline hazard f₀, the complex effect f₄ and the first noise effect f₅.

Mean estimated effects together with the 2.5% and 97.5% quantiles of the baseline hazard f₀ with 500 000 individuals for BAM. Q2.5/Q97.5 indicate the 2.5% and 97.5% quantiles.