Clinical prediction models are developed to estimate a patient’s risk for a specific outcome, and machine learning is frequently employed to improve prediction accuracy. When the outcome is some event that happens over time, binary classifiers can predict the risk at specific time points if right-censoring is addressed by inverse-probability-of-censoring-weighting . Assessing prediction uncertainty is crucial for interpreting individual risks, but there is limited knowledge on how to consider inverse-probability-of-censoring-weighting when estimating this uncertainty. We propose an adjustment of the infinitesimal jackknife estimator for the standard error of predictions that incorporates inverse-probability-of-censoring-weighting. By using a nonparametric approach, it is broadly applicable, especially to machine learning classifiers. For a simple tractable example, we show that the proposed adjustment reveals unbiased standard error estimates. For other situations, we evaluate performance through simulation studies under both parametric models with inverse-probability-of-censoring-weighting-customized log-likelihood and machine learning with inverse-probability-of-censoring-weighting-customized loss function. We illustrate the methods by predicting post-transplant survival probabilities, using national kidney transplant registry data. Our findings show that the proposed estimator is useful for quantifying prediction uncertainty of inverse-probability-of-censoring-weighting classifiers. Applications to simulated and real data show that prediction uncertainty increases when employing binary classifiers on dichotomized data compared to predictions from survival models.
There is a significant interest in machine learning for predicting a patient’s long-term outcome after some medical intervention. An example motivating this paper is the patients’ and organ survival rate after organ transplantation.1–3 Event time data are subject to censoring when patients enter the study cohort at different times, resulting in varying lengths of follow-up and/or when patients prematurely leave the study cohort and become lost to follow-up. Numerous survival analysis methods address censored event time data and many of them model the hazard function.
In the field of machine learning, it is more common to predict the risk or survival probability at a specific time point, denoted as . Instead of relying on survival methodology, which involves modeling hazards, binary classifiers are often used for this purpose.4–7 These classifiers dichotomize the event time outcome into a binary variable indicating whether a subject has survived up to time or not. However, the dichotomized outcome remains unknown for subjects that are lost to follow-up both before time and before experiencing the event of interest. Reps et al.8 and Kvamme and Borgan9 have demonstrated that treating these censored observations as missing data in binary classifiers performs poorly in terms of discrimination and calibration. To address this issue, an alternative approach has been proposed:10–12 weighting the observations by their inverse-probability-of-censoring (IPC) and applying the classification algorithm to the IPC-weighted observations.
Predictions obtained from IPC-weighted binary classifiers have been demonstrated to be asymptotically unbiased employing concepts of maximum likelihoods10,12 or statistical functionals11,13 and some applications show promising results.11,14 However, existing research has primarily focused on bias, neglecting the investigation of prediction uncertainty under IPC-weighted classification. Prediction uncertainty arises from using a random sample from the target population. Binary classifiers derive their results from dichotomized sample data and thus do not use the full available information on event times. Consequently, IPC-weighted classification may introduce higher uncertainties in predicted survival probabilities compared to survival methodologies, such as (semi-)parametric survival regression, random survival forests15 or hazard-based neural networks.16
However, how to consider the IPC weights when estimating the prediction uncertainty of inverse-probability-of-censoring-weighting (IPCW) classifiers is—to the best of our knowledge—unknown. Available methods are rare and focus on selected parametric models.12 Targeting specific parametric models when estimating the uncertainty of predictions is common17,18 but limits its applicability in particular for machine learning. In this article, we will, therefore, first adjust the nonparametric infinitesimal jackknife estimator for the standard error of predictions to the IPC weights. Afterwards, this estimator will be employed to compare the prediction uncertainty of survival probabilities when derived from IPC-weighted classifiers compared to hazard-based algorithms.
Our investigation complements existing research on prediction accuracy of IPC-weighted binary classifiers with censored data by an investigation of prediction uncertainty. It will help to balance its potential disadvantage that results from using less information from the data against its potential benefits that can be summarized as follows: First, classification tasks are common in machine learning and artificial intelligence and thus novel algorithms often emerge first for classification (and regression) before adaptations to event time data are developed. For the same reason, widely used machine learning and deep learning libraries, such as scikit-learn and Keras, as well as automated machine learning (AutoML) solutions like H2O AutoML and SageMaker Autopilot/Canvas, integrate models and metrics for both classification and regression tasks, but not event time data. Second, the data structure itself can motivate a binary classifier. For example when events are recorded only annually (as in some registries or healthcare databases), estimating annual survival probabilities using binary classifiers can be reasonable. Third, binary classifiers can be appealing for their simplicity: when the interest lies in a single timepoint (e.g. annual survival probabilities) binary classifiers provide results that are easier to communicate than for example, hazard ratios, which may not always be the most intuitive measure of effect size.12,19
As an example the proposed methods are used to quantify the prediction uncertainty of machine learning classifiers when predicting the individual patient risks of death or graft failure after kidney transplantation. Kidney transplantation can be considered as the treatment of choice for patients with end-stage kidney disease, due to its potential for improved quality of life and longer life expectancy. However, the demand for donor kidneys far exceeds the organs available for donation, leading to important questions, for example, about the patients’ post-transplant prognosis.
The article is structured as follows: In Section 2, the infinitesimal jackknife estimator with an adjustment for IPCW is proposed, followed by an evaluation of its performance through simulation studies in Section 3. In Section 4, we apply the proposed estimator to assess prediction uncertainty when analyzing data from the German Organ Transplant Registry. Finally, Section 5 concludes with a discussion.
Methods
We first introduce some notations needed throughout this article: For individuals, realizations of the random variables are observed that are considered as independent identically distributed copies of . Thereby, denotes the potentially right-censored event time with and defining the time to event and the time to censoring, respectively. The symbol is used to define the minimum throughout the article. The random variable , with defining the indicator function, informs whether defines an event or censoring time. The random baseline covariate vector describes subject characteristics that might affect the survival probabilities over time. Thus, for a single observation , defines the p-dimensional vector of covariate values, defines the time to event or censoring and defines the indicator whether an event has been observed at time . We assume that and are independent conditional on .
We are now interested in predicting the probability to survive a certain time point for subject . This probability is defined as follows:
The random variable represents the dichotomized outcome that a binary classifier could use as the dependent variable for predicting . When there is no censoring, is observable for all subjects and we can use any prediction model for classification tasks to estimate the prognosis . However, in the presence of censoring, for observations where , the information about surviving and thus about is missing. Ad hoc approaches that build a classification model using only observations with available information on will yield biased predictions of .8,9 This bias arises because the distribution of conditional on is obviously not the same as the distribution of , and the same holds for .
Weighting observations by their IPC has been proposed to prevent bias10–12 and is investigated in the present article. The IPC weights are defined as follows:
Assigning the normalized weights to the observations of the learning sample will assign weight zero to observations with unknown . These discarded observations will then be represented by the observations to which greater weights were assigned. The use of IPC weights in classification algorithms can give unbiased predictions as shown in the following section.
Accuracy of predictions under IPC-weighted classification
First, we will confirm that any binary classifier that employs the IPC-weighted binary log loss function yields unbiased predictions of . This, for example, refers to generalized linear models with logit link function where maximizing the log-likelihood with IPC-weighted contributions is equivalent to minimizing the weighted binary log loss function. Another example is gradient boosting or neural networks with customized weighted binary log loss function. To demonstrate this, consider a binary classifier that seeks to find a prediction model that minimizes the expected IPC-weighted binary log loss function, . Thereby, the expected value is considered as conditional on the covariates and refers to the distribution of new data that are independent of the training sample. For simplicity, we assume that is a known function in , although in practice it needs to be estimated. The proof builds upon ideas from Kvamme and Borgan9 for calculating the expected value of IPCW-Brier scores. It can complement other theoretical justifications of IPC-weighted classification that rely on maximum likelihood theory10 or on classification considered as functionals in the data distribution.11,13
This is the average cross-entropy across all data samples and thus reaches its minimum at according to Gibb’s inequality. Consequently, the classifier that minimizes the weighted loss is expected to predict , resulting in unbiased predictions. After having confirmed that IPC-weighted classification can yield accurate predictions, we will now investigate the uncertainty of predictions. An available method on standard error estimation for predictions that are derived from IPC-weighted binary classifiers refers to generalized linear models,12 which limits its generalizability. Consequently, our next step is to investigate model-free approaches for estimating standard errors in the context of IPC-weighted classification.
Adjusting the jackknife standard error estimate to IPC weights
We start with the infinitesimal jackknife estimator for the standard error of predictions. By its nonparametric nature, jackknife estimators are not restricted to any particular classification algorithm. The infinitesimal jackknife standard error estimator is based on the concept of influence functions.20 To consider the IPCW within this concept, we introduce an adjustment as follows:
To simplify notation, we will omit the subject index when denoting the probability to survive time in the following and we will use the indices to indicate jackknife replications instead. Let be the (unknown) distribution where the observations are sampled from. The probability to survive the time point is assumed to be some functional of the distribution , . The estimator of is derived from the empirical distribution , thus . In general, the influence function of under distribution is defined as follows:
with being the distribution with all its mass on . The influence function can be used for a standard error estimate of .21 With being the influence function for some random observation sampled from , the variance of is then estimated by
In the unweighted situation, the sample weights of the empirical distribution are for each observation . In the weighted situation, observations are differentially weighted by the IPC weights . We will, therefore, adjust the variance estimator (2) by replacing with . Furthermore, in the unweighted situation the assumption is commonly used with being the estimate of when omitting observation from the sample. This results in the jackknife variance estimate . However, this association might no longer be true in the weighted situation, in particular as the influence of an observation might depend on its weight . We will illustrate this for an example in Section 2.3. We instead assume the following weighted association between influence components and jackknife replications for observations with :
We will show for the analytically tractable example in Section 2.3 that this assumption holds true. Equation (3) has been shown also for more complex scenarios.22 Finally, these two adjustments result in the following adjusted infinitesimal jackknife standard error estimate of
Observations with are assigned weight zero in the first sum and, therefore, do not contribute to the sum. They are excluded in the second sum only to well define , that is no longer required in the third sum because for observations with weight zero.
Relationship to delete-d jackknife estimators
The proposed estimator can be considered as a generalization of the delete-d jackknife variance estimate23 defined as
with defining a subset of size and defining the estimator of p that results when is left out from the training data. Thus, in contrast to , when estimating , not only 1 but observations are left out. In total, different left-out-subsets of size d, , exist and the resulting estimates are averaged with respect to their squared difference to .
In the weighted situation, we no longer consider but as the amount of information that is left out when estimating . Consequently, the term in (5) can be replaced by . Furthermore, we no longer average over delete-d estimators. Consequently, averaging over the squared differences can be replaced by the weighted mean of squared differences. Both replacements then result in the same estimator as proposed in (4) as follows:
An analytically tractable example
We will now substantiate our adjustment using the weighted mean as a simple, analytically tractable example. The weighted mean of the dichotomized outcomes is the simplest classifier for estimating , ignoring any covariate information during prediction. Consequently, it predicts the same probability for each subject. This prediction will be . In other words, , where is the empirical distribution of putting mass on each observed value . This is used as an estimate of with denoting the true distribution of . The influence components for an observation with will be
with and thus are an example where (3) instead of holds.
Now consider a simplified two-point discrete distribution for both time to event and time to censoring
Consider we are interested in with and want to estimate from an i.i.d. random sample of size using the IPC-weighted mean. Let be the number of observations that are censored at and thus receive a weight of 0. The other weights are estimated from the data and since for the normalized weight is assigned to the remaining m observations. W.l.o.g., let the index set refer to the observations that receive the non-zero weight . The weighted mean is then given by with well-known variance . The adjusted jackknife estimator of prediction uncertainty we have proposed in (4) then is
Thus, in this simplified scenario, the proposed estimator simplifies to the unbiased sample variance estimator of the mean of m i.i.d. random variables.
Comparison to the unadjusted jackknife and Greenwood’s standard error estimate
The unadjusted jackknife standard error estimate is a scaled version of the unbiased sample variance with scaling factor and thus approximates the adjusted estimator with increasing and . We can also compare the adjusted jackknife estimator with another well-known variance estimate in this simplified scenario. The IPC-weighted mean corresponds to the Kaplan-Meier estimate at , , which can be written as a weighted sum.24 Therefore, Greenwood’s variance estimate, , would be an alternative estimator that turns out to be a scaled version of the unbiased adjusted jackknife estimate:
with being the number of observed events before .
Using standard error estimates to calculate confidence intervals
Under certain distributional assumptions on the prediction estimate of interest, , standard error estimates can be used to construct an % confidence interval for the survival probability . If can be assumed to be (asymptotically) normally distributed, a Wald-type confidence interval can be derived as with . However, this approach can yield confidence intervals with boundaries outside the range , which is not reasonable for probabilities. To address this issue, an alternative approach transforms the boundaries of a confidence interval calculated on the logit scale,25 which ensures that the resulting interval lies within [0, 1]. This method still relies on the assumption of asymptotic normality of and uses the delta method to derive the confidence interval on the logit scale. The confidence interval is then given by the following equation:
where and are the lower and upper bounds of the % confidence interval for the logit of , respectively, which can be found by applying the delta method on the logit transformation of :
To inspect the asymptotic normality assumption of , a specific analysis model and data-generating process would have to be defined, which was not required for the nonparametric jackknife approach for estimating the standard error. Investigating the asymptotic behavior of is beyond the scope of the present article and has been done specifically for the IPC-weighted maximum likelihood solution of a generalized log linear model before.12 Instead, we will assess the coverage probability of the confidence interval given in (6) in a simulation study both with a parametric log linear model and a nonparametric gradient boosting approach.
Simulation study
In situations where less simple classifiers are applied, the adjusted jackknife variance estimator may no longer be analytically tractable and we will explore its properties through a simulation study. Our findings will be evaluated in a simulation study following the ADEMP structure.26 Applied analysis methods are available as an R package on GitHub (see the Data availability section).
Simulation design
Objective:
We aim to investigate finite-sample properties of the predictions derived from IPC-weighted classification algorithms together with finite-sample properties of the proposed adjusted jackknife standard error estimator and the confidence intervals. In particular, we compare the predictions , their estimated standard error and the coverage probability of the derived % confidence interval for with the true probability to survive time , the empirical standard deviation of predictions and the nominal confidence level %, respectively. In the first step, we use a simple parametric generalized linear classification model (GLM), because for the GLM a model-based standard error estimator exists as a benchmark. In the second step, a gradient boosting classifier will be applied with a customized IPC-weighted binary log-loss function. Here, no benchmark exists for the standard error of predictions and the adjusted jackknife estimator will be compared to the empirical standard deviation of predictions across the simulation runs. For comparison, we will also apply a survival model that uses information on instead of the dichotomized IPC-weighted outcome.
Methods:
We evaluate event time scenarios with sample size . For each sample size, we examine censoring distributions where 25% and 50% of the observations are expected to have a weight of 0, respectively. For each subject , the values of two binary covariates, and , are drawn, respectively, from independent -distributions. Time to event is generated from a log-logistic distribution with shape parameter and scale parameter with the true values of the regression coefficients being set to . Time to censoring is generated from an exponential distribution with parameter to ensure an average proportion of , , and of observations receiving weight zero, respectively. Applying a log-logistic distribution ensures that the dichotomized outcome of the independent realizations can be considered as a sample from a generalized linear model with logit link function. At the same time, follows a parametric survival model. Consequently, any differences in the prediction performance of both modeling strategies cannot be attributed to one model violating its modeling assumptions.
For the simulation, we use the rllogis function of the R-package flexsurv (version 2.2.2) to generate replications for each combination of and . Each of these datasets is evaluated by the following three analysis methods:
A parametric log-logistic survival model fitted using the survreg function from the R package survival (version 3.5.5). Predicted probabilities are obtained as follows:
where is the maximum likelihood estimate of the model parameters. To simplify notation, we define the covariate vector to include a leading 1, that is, .
An IPC-weighted generalized linear model with logit link function fitted using the logitIPCW function of the R-package mets (version 1.3.3). Predicted probabilities are obtained as follows:
where minimizes the IPC-weighted loss (which is equivalent to maximizing the IPC-weighted log-likelihood):
In particular, can be considered as a functional of the empirical distribution putting mass on each observed value , as this distribution serves as the basis for the loss function.
An IPC-weighted gradient boosting approach fitted using the xgb.train function from the R-package xgboost (version 1.7.7). With this approach predictions cannot be expressed in closed form due to the nonparametric nature of the algorithm. However, as in (b), predictions are obtained by minimizing the IPC-weighted loss
without an explicit parametric form of . Nonetheless, can still be considered as a functional of the empirical distribution putting mass on each observed value , as this distribution still serves as the basis of the loss function.
For estimating the IPC weights that are passed to the classifiers in (b) and (c) we use a model-free Kaplan-Meier estimator. For each simulation run , we extract the prediction estimates from each model fit. These estimates are for a subject with and , and they are derived as a function of the regression coefficient estimates in (a) and (b) and as the predicted outcome in (c). Model-based standard error estimates of predictions can be derived in (a) and (b) from the standard error of regression coefficient estimates using the delta method. In addition, for all models the proposed adjusted jackknife standard error estimates are derived. Each standard error estimate is also used to compute a % confidence interval following the approach given in (6). Furthermore, we estimate each model’s IPC-weighted Brier score . Based on these data, we calculate the following measures of evaluation for each model in each scenario:
: Empirical mean of the predictions .
: Empirical standard deviation of .
: Empirical mean of the Brier scores .
: Empirical mean of the proposed model-free standard error estimates .
: Empirical mean of the model-based standard error estimates (available for (a) and (b) only).
: Empirical coverage probability of the CI, calculated by (6) and using the proposed model-free standard error estimates. Empirical coverage probability is defined as the proportion of simulations in which the CI contains the true .
: Empirical coverage probability of the CI, calculated by (6) and using the model-based standard error estimates (if available). Empirical coverage probability is defined as the proportion of simulations in which the CI contains the true .
Simulation results
As an example, we show the results of the prediction estimates and for a subject with and , respectively. The true probability of surviving time is and , respectively. The mean predictions over the simulation runs are provided in Tables 1 and 2 for the different sample sizes and censoring rates investigated. The parametric survival model, which does not require IPCW, provides unbiased predictions as expected. Unbiased predictions are also achieved by the generalized linear model and the gradient boosting, that both employ the IPC-weighted log-loss function. This finding confirms our analytical results derived in Section 2.1. However, both appear to need a substantial number of observations with non-zero weights to produce unbiased predictions. While the IPC-weighted GLM stabilizes with observations, up to 50% of which may have zero weight, weighted gradient boosting stabilizes not before (for ) or even (for ), in particular when up to half of the observations have zero weight. Severe bias is observed for the XGBoost even with sample sizes up to n=1000 for and only with large sample sizes of do these predictions become approximately unbiased. One reason might be that the machine learning method often performs better with higher number of covariates and more hyperparameter tuning might lead to better predictions. However, this result also underscores that the results presented in Section 2.1 refer to expected bias only, and that finite-sample behaviors may still differ—particularly when using data-demanding machine learning methods.
Mean predicted probabilities to survive time point for an individual with and () in simulated data with sample size n.
% of obs
Parametric
IPCW-generalized
IPCW-gradient
n
with w0
survival model
linear model
boosting
200
5
0.353
0.352
0.361
200
25
0.353
0.351
0.366
200
50
0.352
0.346
0.385
500
5
0.351
0.351
0.352
500
25
0.351
0.351
0.350
500
50
0.352
0.349
0.359
1000
5
0.352
0.352
0.351
1000
25
0.352
0.352
0.350
1000
50
0.352
0.351
0.351
5000
5
0.352
0.352
0.352
5000
25
0.352
0.352
0.352
5000
50
0.352
0.352
0.351
The true survival probabaility is . The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. IPC: inverse-probability-of-censoring; IPCW: inverse-probability-of-censoring-weighting.
Mean predicted probabilities to survive time point for an individual with and () in simulated data with sample size n.
% of obs
Parametric
IPCW-generalized
IPCW-gradient
n
with w0
survival model
linear model
boosting
200
5
0.248
0.248
0.291
200
25
0.248
0.248
0.305
200
50
0.247
0.244
0.341
500
5
0.248
0.249
0.265
500
25
0.248
0.248
0.269
500
50
0.248
0.246
0.292
1000
5
0.248
0.248
0.256
1000
25
0.248
0.248
0.257
1000
50
0.248
0.248
0.269
5000
5
0.248
0.248
0.250
5000
25
0.248
0.248
0.250
5000
50
0.248
0.248
0.252
The true survival probabaility is . The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. IPC: inverse-probability-of-censoring; IPCW: inverse-probability-of-censoring-weighting.
Tables 3 and 4 present the simulation results for the standard errors of and , respectively, estimated using the proposed IPCW-adjusted jackknife estimator (Section 2.2). For comparison, model-based standard error estimates are also shown, which are, however, not available for general machine learning approaches like XGBoost. The empirical standard errors over the simulation runs (SE) are expected to closely resemble the true standard error of predictions when derived from the different analysis methods and for different sample sizes and censoring rates. The model-based standard errors (mod-SEE) of the survival and IPC-weighted GLM are asymptotically unbiased and, consequently, closely match the empirical standard errors for larger sample sizes and lower censoring rates. The IPC-adjusted jackknife estimates for the IPCW-GLM model exhibit a bias of up to 6.2% relative to the empirical standard error estimate in certain scenarios (, with 50% of observations having zero weight). However, in all cases where only 25% of observations have zero weight, the relative bias remains below 4%. For IPCW-weighted XGBoost, in scenarios where prediction bias arises due to insufficient sample sizes, the IPCW-based standard error estimates also exhibit bias, which can be substantial (up to for and up to for . Since the Jackknife variance estimator is based on predictions from leave-one-out subsamples, it is not surprising that biased predictions can lead to biased Jackknife standard errors. For sufficiently large sample sizes, when the bias of predictions diminishes, also the bias of the weighted Jackknife standard error estimates decreases to <. Irrespective of the applied method, all standard error estimates increase with decreasing sample size and increasing censoring rates, because both lead to a reduced number of observed events.
Standard error of the predicted probability to survive time point for an individual with and () in simulated data with sample size .
Parametric
IPCW-generalized
IPCW-gradient
survival model
linear model
boosting
% of obs
n
with w=0
SE
mod-SEE
SE
mod-SEE
wJK-SEE
SE
wJK-SEE
200
5
0.0508
0.0498
0.0618
0.0601
0.0612
0.0581
0.0662
200
25
0.0549
0.0539
0.0716
0.0696
0.0724
0.0715
0.0687
200
50
0.0622
0.0614
0.0997
0.0961
0.1058
0.0944
0.1011
500
5
0.0319
0.0316
0.0384
0.0381
0.0385
0.0371
0.0375
500
25
0.0348
0.0341
0.0446
0.0442
0.0454
0.0438
0.0442
500
50
0.0393
0.0389
0.0623
0.0615
0.0657
0.0613
0.0615
1000
5
0.0223
0.0224
0.0273
0.0270
0.0272
0.0271
0.0268
1000
25
0.0241
0.0242
0.0316
0.0313
0.0321
0.0314
0.0317
1000
50
0.0278
0.0275
0.0444
0.0437
0.0462
0.0431
0.0447
5000
5
0.0100
0.0100
0.0122
0.0121
0.0121
0.0123
0.0121
5000
25
0.0108
0.0108
0.0142
0.0140
0.0143
0.0142
0.0143
5000
50
0.0123
0.0123
0.0194
0.0196
0.0206
0.0194
0.0205
The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. The empirical standard error over the simulation runs (SE), the mean standard error derived from standard error estimates of regression coefficients using the delta method (mod-SEE) and the mean adjusted jackknife standard error estimates (wJK-SEE) are given for the different prediction models. For gradient boosting no model-based standard error estimate exists, the survival model does not require IPCW and thus no adjusted jackknife estimator exists. IPC: inverse-probability-of-censoring; IPCW: inverse-probability of censoring weighting.
Standard error of the predicted probability to survive time point for an individual with and () in simulated data with sample size .
Parametric
IPCW-generalized
IPCW-gradient
survival model
linear model
boosting
% of obs
n
with w=0
SE
mod-SEE
SE
mod-SEE
wJK-SEE
SE
wJK-SEE
200
5
0.0421
0.0421
0.0531
0.0524
0.0532
0.0709
0.0651
200
25
0.0458
0.0453
0.0623
0.0609
0.0630
0.0890
0.0593
200
50
0.0516
0.0512
0.0875
0.0843
0.0915
0.1160
0.1078
500
5
0.0264
0.0267
0.0329
0.0333
0.0336
0.0391
0.0314
500
25
0.0283
0.0287
0.0384
0.0388
0.0398
0.0477
0.0355
500
50
0.0328
0.0323
0.0550
0.0543
0.0575
0.0772
0.0507
1000
5
0.0190
0.0189
0.0238
0.0235
0.0237
0.0264
0.0224
1000
25
0.0204
0.0203
0.0279
0.0275
0.0281
0.0310
0.0263
1000
50
0.0231
0.0228
0.0392
0.0387
0.0407
0.0488
0.0357
5000
5
0.0084
0.0084
0.0105
0.0105
0.0106
0.0106
0.0104
5000
25
0.0091
0.0091
0.0122
0.0123
0.0125
0.0124
0.0123
5000
50
0.0101
0.0102
0.0173
0.0174
0.0182
0.0180
0.0177
The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. The empirical standard error over the simulation runs (SE), the mean standard error derived from standard error estimates of regression coefficients using the delta method (mod-SEE) and the mean adjusted jackknife standard error estimates (wJK-SEE) are given for the different prediction models. For gradient boosting no model-based standard error estimate exists, the survival model does not require IPCW and thus no adjusted jackknife estimator exists. IPC: inverse-probability-of-censoring; IPCW: inverse-probability-of-censoring-weighting.
In general, the IPCW-adjusted jackknife estimates tend to slightly overestimate the standard errors. They can therefore be considered conservative, for example, with respect to coverage probabilities when applied to calculate confidence intervals of predictions. This is consistent with the results on coverage probabilities shown in Tables 5 and 6. Except for cases where sample size limitations introduce bias that adversely affects coverage (GLM with or with and 50% observations with zero weight and gradient boosting with () and (), the coverage probabilities using the IPCW-adjusted jackknife standard error come close to the nominal level of 95%. In general, they slightly exceed the nominal level, whereas those using the model-based standard error sometimes fall slightly below it. Overall, the IPCW-adjusted jackknife standard error seems to provide a useful method for constructing confidence intervals, particularly in situations where model-based solutions are not available. However, coverage probabilities of CI depend not only on reliable standard error estimates, but also on unbiased predictions. The latter explains the very poor coverage probabilities in certain scenarios with gradient boosting and , as shown in Table 6.
Coverage probabilities of 95% confidence intervals for the probability to survive time point for an individual with and in simulated data with sample size .
Parametric
IPCW-generalized
IPCW-gradient
% of obs
survival model
linear model
boosting
n
with w0
mod-CP
mod-CP
wJK-CP
wJK-CP
200
5
0.950
0.945
0.949
0.918
200
25
0.951
0.949
0.957
0.888
200
50
0.952
0.948
0.967
0.812
500
5
0.947
0.948
0.950
0.948
500
25
0.945
0.950
0.955
0.950
500
50
0.948
0.952
0.964
0.925
1000
5
0.948
0.952
0.953
0.951
1000
25
0.948
0.950
0.955
0.952
1000
50
0.943
0.948
0.960
0.953
5000
5
0.955
0.945
0.946
0.945
5000
25
0.950
0.946
0.950
0.950
5000
50
0.948
0.954
0.963
0.962
The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. The confidence intervals are derived from confidence intervals on the logit scale, that use the adjusted jackknife standard error (wJK-CP). For comparison, the coverage probabilities when using model-based standard error estimates from the parametric survival and generalized linear model are also given (mod-CP). IPC: inverse-probability-of-censoring; IPCW: inverse-probability-of-censoring-weighting.
Coverage probabilities of 95% confidence intervals for the probability to survive time point for an individual with and in simulated data with sample size .
Parametric
IPCW-generalized
IPCW-gradient
% of obs
survival model
linear model
boosting
n
with w0
mod-CP
mod-CP
wJK-CP
wJK-CP
200
5
0.954
0.951
0.954
0.728
200
25
0.952
0.948
0.955
0.627
200
50
0.954
0.941
0.957
0.542
500
5
0.953
0.953
0.955
0.824
500
25
0.951
0.952
0.957
0.812
500
50
0.944
0.949
0.962
0.674
1000
5
0.948
0.947
0.949
0.878
1000
25
0.948
0.950
0.956
0.886
1000
50
0.950
0.948
0.959
0.804
5000
5
0.956
0.954
0.955
0.942
5000
25
0.952
0.954
0.957
0.945
5000
50
0.951
0.953
0.964
0.935
The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. The confidence intervals are derived from confidence intervals on the logit scale, that use the adjusted jackknife standard error (wJK-CP). For comparison, the coverage probabilities when using model-based standard error estimates from the parametric survival and generalized linear model are also given (mod-CP). IPC: inverse-probability-of-censoring; IPCW: inverse-probability-of-censoring-weighting
In general, the construction of confidence intervals from the logit scale, as given in (6), can be recommended over Wald-type intervals, as they yield coverage probabilities closer to the nominal level of 95% (results for the Wald-type intervals not shown).
When only a small proportion—5%—of all subjects receive weight of zero, weighted approaches may suffer from increased uncertainty in estimating the censoring distribution, which determines the weights. Additionally, smaller weights can lead to numerically instability in the weighted estimators, as known from other IPCW-based approaches.27 However, our simulation results do not indicate any adverse effects resulting from small censoring proportions. One possible explanation is that a lower censoring rate implies fewer subjects with weight zero, that are represented by up-weighting remaining observations. This benefit might outweigh the potential drawbacks of smaller weights.
When comparing the prediction uncertainty between the different applied analysis models, the survival model exhibits the lowest degree of uncertainty when measured as the standard error of prediction. Survival models utilize the full information from the data, whereas classification models dichotomize the information, which could explain this result. However, it must be considered that the applied survival model perfectly fits the data generation process, whereas in real data applications, model misspecifications might also affect the prediction uncertainty. When comparing the IPC-weighted classifiers, the generalized linear model exhibits slightly higher prediction uncertainties than the gradient boosting. However, this is accompanied by a higher prediction accuracy, as described before.
Interestingly, the differences in prediction uncertainties seem to be too small to be reflected in the Brier scores presented in Table 7. Here, all methods show comparable results with respect to the Brier scores averaged over the simulation runs.
Mean Brier score of the different prediction models for predicting the probability to survive time point in simulated data with sample size n.
% of obs
Parametric
IPCW-generalized
IPCW-gradient
n
with w0
survival model
linear model
boosting
200
5
0.2244
0.2227
0.2244
200
25
0.2242
0.2215
0.2239
200
50
0.2237
0.2170
0.2234
500
5
0.2255
0.2248
0.2251
500
25
0.2254
0.2243
0.2247
500
50
0.2248
0.2223
0.2239
1000
5
0.2257
0.2254
0.2254
1000
25
0.2257
0.2251
0.2252
1000
50
0.2256
0.2243
0.2247
5000
5
0.2261
0.2260
0.2260
5000
25
0.2261
0.2259
0.2260
5000
50
0.2261
0.2258
0.2258
The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. IPC: inverse-probability-of-censoring; IPCW: inverse-probability-of-censoring-weighting.
The results are further illustrated in Figures 1 to 4 exemplarily for the situation of a sample size of . The figures illustrate the findings described before: Whereas all methods exhibit unbiased predictions for , variability of predictions is higher when IPC-weighted classifiers are applied as compared to the survival model (Figure 1). This can also be observed from Figure 2. Furthermore, Figure 2 also reflects the slightly conservative results of the IPCW-adjusted jackknife estimates of standard error when compared to the model-based estimates of IPC-weighted GLM. For , Figure 3 reflects the bias of the IPCW-weighted gradient boosting, that also adversely affects the standard error estimates based on the Jackknife replicates of predictions (Figure 4).
Predictions of the probability to survive time point for an individual with and () in simulated data with sample size . The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. Predictions are derived from a parametric log-logistic survival model (survival), an IPCW-GLM and an IPC-weighted gradient boosting model (IPCW-gradient-boosting) described as (a) to (c) in Section 3.1. IPC: inverse-probability-of-censoring; IPCW-GLM: inverse-probability-of-censoring-weighted generalized linear model; IPCW: inverse-probability-of-censoring-weighting.
Standard error of the predicted probability to survive time point for an individual with and () in simulated data with sample size . The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. Standard errors of predictions are derived from the standard errors of regression coefficients using the delta method in the parametric log-logistic survival model and the IPC-weighted-GLM (survival:mod-SEE, IPCW-GLM:mod-SEE) and by the adjusted jackknife estimator in the IPC-weighted-GLM and the IPC-weighted-gradient boosting model (IPCW-GLM:wJK-SEE, IPCW-gradient-bosting:wJK-SEE). The analysis methods (a) to (c) are described in Section 3.1. IPC: inverse-probability-of-censoring; IPCW-GLM: inverse-probability-of-censoring-weighted generalized linear model; IPCW: inverse-probability-of-censoring-weighting.
Predictions of the probability to survive time point for an individual with and () in simulated data with sample size . The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. Predictions are derived from a parametric log-logistic survival model (survival), an IPCW-GLM and an IPCW-weighted gradient boosting model (IPCW-gradient-boosting) described as (a) to (c) in Section 3.1. IPC: inverse-probability-of-censoring; IPCW-GLM: inverse-probability-of-censoring-weighted generalized linear model; IPCW: inverse-probability-of-censoring-weighting.
Standard error of the predicted probability to survive time point for an individual with and () in simulated data with sample size . The expected proportion of observations with assigned IPC weight of 0 is 5%, 25%, and 50% depending on the censoring distribution. Standard errors of predictions are derived from the standard errors of regression coefficients using the delta method in the parametric log-logistic survival model and the IPC-weighted GLM (survival:mod-SEE, IPCW-GLM:mod-SEE) and by the adjusted jackknife estimator in the IPC-weighted GLM and the IPC-weighted gradient boosting model (IPCW-GLM:wJK-SEE, IPCW-gradient-bosting:wJK-SEE). The analysis methods (a) to (c) are described in Section 3.1. IPC: inverse-probability-of-censoring; IPCW-GLM: inverse-probability-of-censoring-weighted generalized linear model; IPCW: inverse-probability-of-censoring-weighting.
Application
We illustrate the methods by quantifying the prediction uncertainty of parametric and machine learning classifiers when predicting the individual patient risks of death or graft failure after kidney transplantation. For prediction, modeling data from the German Organ Transplant Registry are used. These data include information on donors, recipients, and the transplantation process of the solid organ transplants performed between 2006 and 2016. Additionally, follow-up information on patient survival and organ failure is incorporated, which hospitals are required to report at least annually. Due to this reporting schedule, it may be more reliable to derive only annual predictions. These predictions will serve to illustrate the methods developed within this article.
Data were filtered to include first-time, kidney-only recipients of deceased donor organs with available information on graft and patient survival data. Donors and recipients above 65 years of age were excluded due to differences in organ allocation for this age group (Eurotransplant Senior Program). This resulted in a sample size of kidney transplantations. We further removed 102 observations with implausible values (weight <30 kg, height <100 cm, years on dialysis >100 years or negative, creatinine >1000 mol/L, cold ischemia time > 80 hours or 0 minutes, ABO-incompatible transplants).
The outcome of interest was graft and patient survival probabilities at , and years after transplantation. Observations with graft failure or patient death within the first 5 days after transplantation () were excluded, as well as patients with follow-up information <5 days after transplantation () resulting in a total of 9707 observations. The proportion of censored observations before time was , and , respectively.
A Cox proportional hazards model, an IPC-weighted GLM with logit link function and an IPC-weighted gradient boosting classifier were fitted to estimate the graft and patient survival probabilities at time . Hyperparameter selection is described in Appendix A. All models were based on the following donor, transplant and recipient factors: donor age, donor height, donor smoking, recipient age, recipient’s years on dialysis, log cold ischemia time, recipient pretransplant blood transfusion, recipient presenting PRA (panel-reactive antibodies), and recipient sex. The details about the variable selection method and handling of missing values are given in Appendix B. For the Cox model and GLM, model-based standard error estimates of predictions were derived, where methods derived by Blanche et al.12 are used for the weighted GLM. Additionally, the adjusted jackknife estimator for the standard error of predictions is employed for all models.
Table 8 presents the predictions from the different models exemplarily for a male recipient, of mean age, with mean years of dialysis, without pretransplant blood transfusion, not presenting panel reactive antibodies, receiving a transplant from a non-smoking donor of mean age and height, after the mean log cold ischemia time of the transplant. The predicted patient and graft survival probability at time is similar, irrespective of whether a Cox regression approach or an IPC-weighted binary classifier has been applied. The survival probability at 2 years after transplantation is predicted to be 93% and decreases to around 90% and 87% after 3 and 4 years, respectively. The gradient boosting seems to exhibit a slight bias when the proportion of observations with zero weight is 22.5% or 53.7%. Regarding prediction uncertainty, the application confirms the finding of the simulation studies that IPC-weighted classifiers have higher prediction uncertainty compared to the survival analysis approach that utilizes the full data. The results also indicate that the adjusted jackknife estimate of prediction uncertainty tends to be slightly conservative compared to the model-based approach for GLM. However, when machine learning classifiers such as XGBoost are employed with right-censored data, no model-based approach for prediction uncertainty exists and the proposed adjusted jackknife estimator provides a way to quantify the prediction uncertainty also in these situations.
Predicted probability with standard error for the graft and patient survival at time (years) for the specific patient as described in Section 4.
of obs
with w0
Method
mod-SE()
wJK-SE()
2
8.5
Cox model
0.931
0.0037
–
IPCW-GLM
0.935
0.0043
0.0044
IPCW-XGBoost
0.931
–
0.0044
3
22.5
Cox model
0.905
0.0048
–
IPCW-GLM
0.910
0.0051
0.0058
IPCW-XGBoost
0.886
–
0.0058
4
53.7
Cox model
0.871
0.0064
–
IPCW-GLM
0.867
0.0078
0.0080
IPCW-XGBoost
0.857
–
0.0080
Predictions () are derived from the different prediction models that have been trained on data of the German Organ Transplant Registry. The percentage of observations who were assigned an IPC weight of 0 depends on the censoring rate and thus on . For gradient boosting no model-based standard error estimate (mod-SE()) exists, the survival model does not require IPCW and thus no adjusted jackknife estimator (wJK-SE()) exists. IPC: inverse-probability-of-censoring; IPCW: inverse-probability-of-censoring-weighting.
Variables considered in the backward variable selection process to select the variables for prediction modeling of the graft and patients survival probability using data of the German Organ Transplant Registry (Section 4).
Variable
Distinct values
Donor
Age (years)
–
Creatinine (mol/L)
–
Diabetes
T/F
Height (cm)
–
Hepatitis C antibodies
T/F
Hypertension
T/F
Sex
F/M
Smoker
T/F
Weight (kg)
–
Recipient
Age (years)
–
Pretransplant bloodtransfusion
T/F
Years on dialysis
–
Height (cm)
–
Hepatitis C antibodies
T/F
Panel reactive antibodies (PRA)
T/F
Sex
F/M
Weight (kg)
–
Transplantation
Cold ischemia time (minutes)
–
Enbloc transplantation
T/F
Destination
L/A/H/R
Variables considered by Rao et al.34 being unavailable or not directly available in the TxReg were:
For the recipient: Race, transplant center, angina pectoris, peripheral vascular disease, drug-treated chronic obstructive, pulmonary disease, and diabetes status.
For the donor: Race, donation after cardiac death, pulsatile perfusion, and year of transplant.
Discussion
IPCW classifiers have been introduced as a flexible tool for applying in particular machine learning models to survival data. We have proposed a method to consider the IPCW when estimating the uncertainty of individual predictions derived from these classifiers.
Whereas we have focused on classifiers that customize the log-likelihood or loss function using IPCW, bagging with IPC-weighted bootstrapping has been proposed as an alternative.11 This approach enables ensemble learning and prediction estimates are expected to be unbiased, following similar arguments as given in Section 2.1. For standard error estimation, the jackknife after bootstrap method28,29 could be adapted to incorporate IPC weights, which will be the topic of future work.
Jackknife methods can be computationally challenging in particular for large sample sizes and time-consuming prediction algorithms as for example random forests or deep neural networks. For this reason, our simulation study investigates only a single machine learning algorithm (gradient boosting), which is a limitation. The simulation study is also restricted to a data generation model, which does not reflect non-linear or non-additive effects or a larger number of predictors as for example proposed by Infante et al.30 Nevertheless, we chose this simulation model, because for the IPCW-GLM a parametric standard error recently has been proposed12 that could serve as a comparator.
Three issues appear with the presented approach similarly to how it does with unadjusted jackknife estimates of standard errors: First, our estimator gives slightly conservative results as typical for the jackknife method.31 Second, we rely on the assumption (3) that is similar to the assumption used in the unadjusted jackknife approach. We demonstrate that assumption (3) holds true in the simplified example given in Section 2.2. However, in more complex scenarios, the influence functions may not be reducible to simpler formulas making this assumption difficult to verify. Third, Jackknife variance estimators are based on predictions from leave-one-out subsamples. Consequently, when the underlying method produces biased predictions—as we observed for the machine learning approach under insufficient sample sizes—this bias can negatively affect the Jackknife variance estimates and, in turn, reduce the coverage probabilities of the confidence intervals.
In situations where IPC weights are close to zero, the proposed adjustment will have less impact on the estimator than in situations with larger weights. This can be observed in the simple tractable example of Section 2.3 where the scaling factor approximates with increasing and thus with decreasing IPC weights.
Prediction is a common task in statistical and machine learning. We consider reporting the uncertainty of individual risk predictions as crucial, especially when these predictions inform medical decisions. However, there is no firm conclusion on the best measure for reporting individual prediction uncertainties.32 The TRIPOD statement33 provides guidance on how to report a prediction model, but it does not address how to report the resulting predictions. Recently, Thomassen et al.17 proposed the effective sample size as a metric to quantify prediction uncertainty, which may be a more intuitive measure than standard errors or confidence intervals. Although the effective sample size is currently limited to certain parametric prediction models and thus out of the scope of this article, the authors discuss extending it to more flexible models.
It should be mentioned that uncertainty in individual predictions does not only arise from sampling data from the target population. There are other sources of uncertainty, which we have not considered here. One potential source is model misspecification, which is of more concern in statistical modeling than in machine learning due to the flexibility of many machine learning algorithms. Another source is sampling bias. This issue is particularly relevant for machine learning algorithms, which often are trained on large datasets that are more susceptible to sampling bias compared to clinical study data with pre-specified inclusion and exclusion criteria.
Footnotes
Acknowledgements
We thank the reviewers for their valuable comments that have improved the article.
ORCID iDs
Antje Jahn-Eimermacher
Lukas Klein
Gunter Grieser
Ethical considerations
This study received ethical approval from the Ethics Committee of the medical school of the Martin-Luther-University Halle-Wittenberg (approval 2024-139) on 6 August 2024. This is an IRB-approved retrospective study, all patient information was de-identified and patient consent was not required. Patient data will not be shared with third parties.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research project was funded by the Federal Ministry of Education and Research (project 13FH019KX1). The results presented in this article are the responsibility of the authors.
Declaration of conflicting interest
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data availability statement
An implementation of the applied analysis methods is available as an R package on GitHub: https://github.com/IDEN-Project-UAS-Darmstadt/IPCWJK. The data analysis presented in this article uses retrospective and anonymized data of the German Transplant Registry. These data have been supplied by the Transplant Registry Agency represented by the Gesundheitsforen Leipzig GmbH. Data can be requested for research purposes here .
Appendices
References
1.
GotliebNAzhieASharmaD, et al.The promise of machine learning applications in solid organ transplantation. npj Digit Med2022; 5: 89.
2.
TruchotARaynaudMKamarN, et al.Machine learning does not outperform traditional statistical modelling for kidney allograft failure prediction. Kidney Int2023; 103: 936–948.
3.
Szili-TorokTTietgeUJVerbeekMJ, et al.Machine learning: it takes more than select models to draw general conclusions. Kidney Int2023; 104: 1035–1036.
4.
LisboaPJJayabalanMOrtega-MartorellS, et al.Enhanced survival prediction using explainable artificial intelligence in heart transplantation. Sci Rep2022; 12: 1–14.
5.
NitskiOAzhieAQazi-ArisarFA, et al.Long-term mortality risk stratification of liver transplant recipients: real-time application of deep learning algorithms on longitudinal data. Lanc Digit Health2021; 3: e295–e305.
6.
ErshoffBDLeeCKWrayCL, et al.Training and validation of deep neural networks for the prediction of 90-day post-liver transplant mortality using UNOS registry data. Transplant Proc2020; 52: 246–258.
7.
KampaktsisPNTzaniADoulamisIP, et al.State-of-the-art machine learning algorithms for the prediction of outcomes after contemporary heart transplantation: results from the UNOS database. Clin Transplant2021; 35: e14388.
8.
RepsJMRijnbeekPCuthbertA, et al.An empirical analysis of dealing with patients who are lost to follow-up when developing prognostic models using a cohort design. BMC Med Inform Decis Mak2021; 21: 43.
9.
KvammeHBorganØ. The Brier score under administrative censoring: problems and solutions. J Mach Learn Res2023; 24: 1–26.
10.
VockDMWolfsonJBandyopadhyayS, et al.Adapting machine learning techniques to censored time-to-event health record data a general-purpose approach using inverse probability of censoring weighting. J Biomed Inform2016; 61: 119–131.
11.
GinestetGPKotalikAVockDM, et al.Stacked inverse probability of censoring weighted bagging a case study in the InfCareHIV register. J R Stat Soc Ser C: Appl Stat2021; 70: 51–65.
12.
BlanchePFHoltAScheikeT. On logistic regression with right censored data, with or without competing risks, and its use for estimating treatment effects. Lifetime Data Anal2023; 29: 441–482.
13.
TsiatisAA. Semiparametric theory and missing data. New York, NY: Springer, 2006.
14.
SrujanaBVermaDNaqviS. Machine learning versus survival analysis models: a study on right censored heart failure data. Commun Stat - Simul Comput2024; 53: 1899–1916.
15.
IshwaranHKogalurUBBlackstoneEH, et al.Random survival forests. Ann Appl Stat2008; 2: 841–860.
16.
KatzmanJLShahamUCloningerA, et al.DeepSurv personalized treatment recommender system using a Cox proportional hazards deep neural network. BMC Med Res Methodol2018; 18: 24.
17.
ThomassenDle CessieSvan HouwelingenHC, et al.Effective sample size: a measure of individual uncertainty in predictions. Stat Med2024; 43: 1384–1396.
18.
ShenCLiX. On the uncertainty of individual prediction because of sampling predictors. Stat Med2016; 35: 2016–2030.
19.
HernanMA. The hazards of hazard ratios. Epidemiology2010; 21: 13–15.
20.
EfronBHastieT. Computer age statistical inference: algorithms, evidence, and data science. Cambridge: Cambridge University Press, 2016.
21.
EfronBTibshiraniR. An introduction to the bootstrap. Philadelphia, PA: Chapman and Hall/CRC Press, 1994.
22.
HinkleyDV. Jackknifing in unbalanced situations. Technometrics1977; 19: 285–292.
23.
ShaoJWuCFJ. A general theory for jackknife variance estimation. Ann Stat1989; 17: 1176–1197.
24.
StuteW. The jackknife estimate of variance of a Kaplan-Meier integral. Ann Stat1996; 24: 2679–2704.
25.
PermeMPManevskiD. Confidence intervals for the Mann–Whitney test. Stat Methods Med Res2019; 28: 3755–3768.
26.
MorrisTPWhiteIRCrowtherMJ. Using simulation studies to evaluate statistical methods. Stat Med2019; 38: 2074–2102.
27.
AustinPCStuartEA. The performance of inverse probability of treatment weighting and full matching on the propensity score in the presence of model misspecification when estimating the effect of treatment on survival outcomes. Stat Methods Med Res2017; 26: 1654–1670.
28.
EfronB. Jackknife-after-bootstrap standard errors and influence functions. J R Stat Soc Ser B: Stat Methodol1992; 54: 83–111.
29.
WagerSHastieTEfronB. Confidence intervals for random forests: the jackknife and the infinitesimal jackknife. J Mach Learn Res2014; 15: 1625–1651.
30.
InfanteGMiceliRAmbrogiF. Sample size and predictive performance of machine learning methods with survival data: a simulation study. Stat Med2023; 42: 5657–5675.
31.
EfronBSteinC. The jackknife estimate of variance. Ann Stat1981; 9: 586–596.
32.
SpiegelhalterD. Risk and uncertainty communication. Annu Rev Stat Appl2017; 4: 31–60.
33.
MoonsKGMAltmanDGReitsmaJB, et al.Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): explanation and elaboration. Ann Intern Med2015; 162: W1–W73.
34.
RaoPSSchaubelDEGuidingerMK, et al.A comprehensive risk quantification score for deceased donor kidneys: the kidney donor risk index. Transplantation2009; 88: 231–236.