Sage Journals: Discover world-class research

Abstract

Purpose. Individual-level simulation models often require sampling times to events; however, efficient parametric distributions for many processes may often not exist. For example, time to death from life tables cannot be accurately sampled from existing parametric distributions. We propose an efficient nonparametric method to sample times to events that does not require any parametric assumption on the hazards. Methods. We developed a nonparametric sampling (NPS) approach that simultaneously draws multiple time-to-event samples from a categorical distribution. This approach can be applied to univariate and multivariate processes. We discretize the entire period into equal-length time intervals and then derived the interval-specific probabilities. The times to events can then be used directly in individual-level simulation models. We compared the accuracy of our approach in sampling time-to-events from common parametric distributions, including exponential, gamma, and Gompertz. In addition, we evaluated the method’s performance in sampling age to death from US life tables and sampling times to events from parametric baseline hazards with time-dependent covariates. Results. The NPS method estimated similar expected times to events from 1 million draws for the 3 parametric distributions, 100,000 draws for the homogenous cohort, 200,000 draws from the heterogeneous cohort, and 1 million draws for the parametric distributions with time-varying covariates, all in less than 1 second. Conclusion. Our method produces accurate and computationally efficient samples for time to events from hazards without requiring parametric assumptions.

Highlights

The nonparametric sampling method is generic and can sample times to an event from any discrete (or discretizable) hazard without requiring any parametric assumption.

The method is showcased with 5 commonly used distributions in discrete-event simulation models.

The method produced very similar expected times to events, as well as their probability distribution, compared with analytical results.

We provide a multivariate categorical sampling function for R and Python programming languages to sample times to events from processes with different hazards simultaneously.

Keywords

discrete event simulation time to event non-parametric sampling multivariate categorical sampling nonhomogeneous poisson point process (NHPPP)

Introduction

Discrete-event simulation (DES) models simulate processes as discrete sequences of events that occur over time.¹ These models rely on sampling the time of different events. For example, if events have a constant rate or hazard of occurrence, the time of their occurrence can be sampled from an exponential distribution. In DES models, time-to-event data following a nonconstant hazard could be sampled from parametric distributions.² However, some events cannot be easily described by parametric distributions. For example, life tables, or events following hazards that are a function of time-varying covariates, such as smoking histories or tumor size, do not always follow standard parametric distributions. An alternative is to use a nonhomogeneous Poisson point process (NHPPP), which assumes that the rate of events follows a Poisson process that can vary over time.³ There are different implementations of algorithms for sampling from NHPPP, which require either numerical integration or rejection sampling.⁴

In this brief report, we propose a nonparametric sampling (NPS) implementation of NHPPP that is both generalizable and computationally efficient. The method assumes that time to event is drawn from a nonparametric categorical distribution. We illustrate the NPS method using 5 examples highlighting its accuracy, flexibility, and computational efficiency. In addition, we provide an open-source implementation in R and Python to facilitate wider adoption.

Constructing the Categorical Distribution

The steps to implement the NPS method are described in Box 1 and shown in Figure 1. In summary, the approach involves 6 steps. First, obtaining the discrete-time hazard cumulative distribution function (CDF), $F_{t}$ . This can be obtained from nonparametric estimation methods, such as the life table, actuarial, or Kaplan-Meier methods.⁵ Second, obtaining the interval-specific sample probability, which is the probability mass function, $p_{t}$ , derived from $F_{t}$ . Third, sampling the times to events using a categorical distribution, using the interval-specific probabilities, and defining each time interval as a category. Finally, approximating time to event in continuous time. If $F_{t}$ is not readily available, it can be derived from either the discrete-time cumulative hazard, $H_{t}$ (e.g., obtained from a Nelson-Aalen estimator⁶), or the interval-specific discrete-time hazard, $h_{t}$ (e.g., obtained from a life table or actuarial estimation method). If instead, the cumulative hazard is available in continuous time, $H (t)$ , then $h_{t}$ can be derived within a time interval $Δ t$ from $H (t + Δ t) - H (t)$ . Below we provide further details on these steps.

Box 1

Steps to Draw a Time to Event Using a Nonparametric Sampling Approach^a

1. Obtain F_t; if it is not available, then
a. If H_t is available, F_t = 1 –exp (−H_t).
b. If h_t is available, compute H_t = Σ^t_x=0h_x and then compute step la.
c. If H(t) is available, compute h_t = H(t+Δt) −H(t) and then compute step 1b.
d. If h(t) is available, compute H(t) = ∫^t₀h (x) dx and then compute step 1c.
2. Calculate the category-specific sample probabilities p_t₌F_t_+Δt−F_t.
3. Sample the time to event X=x, from a categorical distribution.
a. If 1 or multiple samples are taken from the same process X∼ Cat [p_o, p₁, …, p_z], use the univariate categorical distribution through the sample function in base R or numpy.random.choice in Python.
b. If multiple samples are taken from different processes, X = [X¹, X², …, X^K] ∼Cat_κ [p¹, p², …, p^K], use the multivariate categorical distribution through the nps_nhppp function, implemented for R and Python, provided in the Supplementary Material.
4. Approximate to continuous time by adding a random variable Y ∼U [0, Δt] to all sampled elements.

$F_{t}$ , cumulative distribution function for a discrete-time random variable; $H_{t}$ , discrete-time cumulative hazard function; $h_{t}$ , discrete-time hazard function; H(t), continuous time cumulative hazard function; h(t), continuous time hazard function; $p_{t}$ , probability mass function; Cat, categorical distribution; 〖“Cat_K”〗, multivariate categorical distribution for K random variables.

Figure 1

Steps to sample time to events using a nonparametric sampling approach.

Let $t \in T$ be the time to event following a piecewise constant hazard $h_{t}$ within a time interval $Δ t$ , where $T$ is a random variable, $t = 0, \dots, Z$ , and $Z$ is the last time interval by which the event can occur. Thus, the cumulative hazard function at time $t$ , $H_{t}$ , is obtained from

H_{t} = \sum_{x = 0}^{t} h_{x}

(1)

The CDF of $T$ at time $t$ , $F_{t}$ , is

F_{t} = 1 - \exp (- H_{t}) .

(2)

If the hazard is given in a different scale from the one the analyst is interested in, it can be transformed to the desired scale by multiplying the hazard $h_{t}$ by $Δ t,$ where $Δ t$ represents the ratio of the given scale to the desired scale.⁵ For example, if the hazards are on a yearly scale and we want to sample monthly time-to-event data, we use $Δ t = \frac{1}{12}$ , and when the samples are in years, we use $Δ t = 1$ . This scale transformation assumes that the hazard is constant within the interval.

We derive the probability of an event happening within the $t$ th interval $[t, t + Δ t)$ by the difference in the CDF in equation (2) as

p_{t} = F_{t + Δ t} - F_{t}

(3)

To conduct an NPS of the time interval at which the event can occur, we define $X$ as the time interval at which the event can occur and assume it follows a categorical distribution in which each time interval is considered a category. Thus,

X ~ Cat [p_{0}, p_{1}, \dots, p_{Z}],

with a probability mass function $f (X = x | p) = p_{t}, F_{t}$ , where $p = (p_{0}, p_{1}, \dots, p_{Z})$ , $p_{t} \geq 0$ is the probability of the event occurring at the $t$ -th time interval $[t, t + Δ t)$ and $\sum_{t = 0}^{Z} p_{t} = 1$ . Most statistical software provides built-in functions to sample from a categorical distribution. For example, in R is the sample function, and in Python is the numpy.random.choice function.

Multivariate Categorical Distribution

We expand the previous approach to sample values for multiple random variables simultaneously by defining a multivariate categorical distribution as

X = [X^{1}, X^{2}, \dots, X^{K}] ~ Ca t_{K} [p^{1}, p^{2}, \dots, p^{K}],

where $X^{k} = x^{k}$ is the $k$ th random variable with a vector of probabilities $p^{k}$ of having the events in each of the time intervals defined as

p^{k} = [p_{0}^{k}, p_{1}^{k}, \dots, p_{Z}^{k}]

Common statistical software has no built-in functions to sample from a multivariate categorical distribution. However, we provide the code of the multivariate categorical distribution in R and Python in the Supplementary Material.

Approximating Continuous Time to Event

An approximation error occurs when approximating the continuous time to event by using a discrete-time approach.^7–9 Since the NPS samples for the exact time categories that were initially defined while dividing the time interval, the method does not contemplate the possibility of events happening in between any 2 categories. This generates a systematic bias, which could be reduced by adding a random variable $Y \sim U [0, Δ t]$ to $X$ , assuming that the time to event within each $Δ t$ interval is equally likely to occur within the interval, which is consistent with a piecewise constant hazard model. Adding a random uniform value to each time to event sampled from the categorical NPS method will increase the expected value of the simulated data by $Δ t / 2$ , akin to a half-cycle correction (HCC).⁹ However, unlike the HCC, the sampled values can take any value within the distribution range. Thus, the random variable of the time to event with the correction is $X_{c} = X + Y$ . The steps to sample the time to event using an NPS are shown in Box 1 and illustrated in Figure 1. If there is interest only in the expected value of the time to events, analysts could add a $Δ t / 2$ value to the expected value of the time to event obtained from the categorical NPS method without correction.

Accounting for Covariates

Hazards could be a function of either time-independent covariates, such as sex, race, or birth cohort, or time-dependent covariates, such as smoking histories, exposure to environmental risk factors, or tumor size. In this section, we demonstrate the use of the NPS method to sample times to events from hazards as functions of time-independent and time-dependent covariates.

Time-Independent Covariates

Let the $i$ th individual time to an event $T_{i}$ follow a time-dependent hazard, $h_{i} (t) = f_{i} (x_{i}; β)$ , over a time interval $[0, Z]$ , as a function of a time-independent covariate, $x_{i}$ , that can take any functional form and vary between individuals, and a set of coefficients $β$ . We assume a proportional hazards approach of the effect of the covariates on the hazard to demonstrate how to sample time to events in the NPS method. That is, $h_{i} (t) = f (x_{i}; β) = h_{0} (t) e^{x_{i} β}$ , where $h_{0} (t)$ is the time-dependent baseline hazard, $x_{i}$ is the covariate for the $i$ th individual, and $β$ is the log-hazard ratio of the proportional effect of the covariate $x_{i}$ on $h_{0} (t)$ .

Time-Dependent Covariates

We now consider that the covariate can vary over time $x_{i} (t)$ and can take any functional form, which results in a time-varying hazard $h_{i} (t) = f_{i} (x_{i} (t); β)$ , over a time interval $[0, Z]$ . The time-dependent covariate could be the same across all individuals (e.g., all experiencing the same mean tumor growth over time) or vary by individuals (e.g., everyone having their own smoking history). To use the NPS method to sample from hazards with time-varying covariates, we generate or prespecify the time-dependent covariate and compute the corresponding hazard. For example, Figure 2 shows a time-dependent Weibull hazard. We use the multivariate categorical distribution to sample time to events for multiple individuals with different covariate paths.

Figure 2

(A) Time-dependent hazard, h(t), for different values of a covariate. (B) Example of a covariate path. (C) Corresponding path of the h(t).

Examples

Below, we provide 5 examples to illustrate the implementation of the NPS method for different processes. The R code for these examples and the function of the multivariate categorical distribution is provided in a GitHub repository (https://github.com/DARTH-git/NPS_time_to_event).

Example 1: Time to Event from Parametric Hazards

We used the NPS method for drawing times to events from various commonly used parametric distributions, such as exponential, gamma, and log-normal. We derived the piecewise constant hazard, $h_{t}$ , as described in step 1 in Figure 1 and Box 1 and applied equation (2) and equation (3). We sampled 10,000 times to event, computed the mean across all samples, and repeated this 1,000 times to compute the overall mean across all simulations. We then compared the expected time to event obtained from our method, with and without the approximation to continuous-time interval, to the analytic expected time from the parametric distributions. While the expected times to event coming from the exponential, gamma, and log-normal distributions, using the NPS method and accounting for continuous time, were 10, 39.98, and 33.49, their analytical values were 10, 40, and 33.49, respectively. We also computed the mean execution time and their 95% interquantile range (see Table 1) from 100 iterations using a computer with 2.3-GHz Quad-Core Intel Core i7 with 32 GB memory. The mean execution times for the exponential, gamma, and log-normal distributions were 0.49, 0.61, and 0.54 ms, respectively.

Table 1

Comparison of Expected Time to Events and Mean Sampling Time, in Milliseconds, from 100 Iterations of N Samples Each between the Nonparametric Sampling (NPS) Method and Parametric Distributions or Life Table Estimates

Distribution	NPS-U	NPS-C	Analytic Solution	Parametric Sampling
Exponential (rate = 0.1; N = 10,000)
Expected value	9.51	10.00	10.00	10.00
Mean execution time (95% IQR)	0.35 [0.31, 0.48]	0.49 [0.43, 0.66]		0.45 [0.35, 0.69]
Gamma (rate = 0.1, shape = 4; N = 10,000)
Expected value	39.47	39.98	40.00	40.00
Mean execution time (95% IQR)	0.49 [0.39, 0.68]	0.61 [0.50, 0.85]		0.88 [0.68, 3.12]
Log-normal (µ = 3.5, σ = 0.15; N = 10,000)
Expected value	32.99	33.49	33.49	33.49
Mean execution time (95% IQR)	0.34 [0.30, 0.49]	0.54 [0.42, 0.78]		0.64 [0.54, 0.97]
Life tables—homogeneous cohort; N = 100,000
Expected value	78.04	78.53	78.37	N/A
Mean execution time (95% IQR)	3.63 [3.23, 4.05]	5.15 [4.38, 5.36]
Life tables—heterogeneous cohort; N = 200,000

IQR, interquantile range; N/A, not applicable; NPS-C, nonparametric sampling corrected by adding a uniformly distributed random number; NPS-U, nonparametric sampling uncorrected.

Example 2: Sampling Age to Death from a Homogeneous Cohort

We sampled the age to death for 100,000 individuals in a hypothetical cohort from the US population in 2015.¹⁰ We estimated the life expectancy by taking the average across the 100,000 samples with the continuous-time approximation. The probability mass function (PMF) for the age to death obtained from the NPS methods closely follows the PMF from the life table (Figure 3). The estimated life expectancy from the NPS method is 78.53 years, which is close to the life expectancy obtained from the life tables of 78.37 y. The mean execution time, repeating the sampling process 100 times, is 5.15 milliseconds (Table 1).

Figure 3

Probability mass function of dying within 1 y of age in the total US population in 2015.

Example 3: Drawing Age to Death from a Heterogeneous Cohort

We used the multivariate categorical distribution to simultaneously sample ages to death for 100,000 males and females from sex-specific life tables for the US population in 2015, with the continuous-time approximation defined above. The sex-specific PMF from the NPS method and the exact PMF from life tables are shown in Figure 4. The NPS method estimated a life expectancy of 76.22 and 80.93 y for males and females, respectively. The life expectancy obtained from the life tables was 75.93 and 80.76 y for males and females, respectively. The mean execution time, repeating the sampling process 100 times, is 255.30 ms (Table 1).

Figure 4

Probability mass function of dying within 1 y of age by sex, US population in 2015.

Example 4: Drawing Time to Event from Hazards with Time-Dependent Covariates

We used a proportional hazard setup with a time-dependent covariate that increases linearly over time, $x_{i} (t) = α_{0} + α_{1} t$ , obtaining $h_{i} (t) = h_{0} (t) e^{(x_{i} (t) β)} = h_{0} (t) e^{((α_{0} + α_{1} t) β)}$ . We compared the accuracy of the method in sampling time to events from parametric exponential (rate = 0.1), Gompertz (shape = 0.1, scale = 0.001), and Weibull (shape = 2, scale = 0.01) baseline hazards, considering a linear time-varying covariate ( $α_{0} = 0$ , and $α_{1} = 1$ ) with a log-hazard ratio ( $β = 1.02$ ) against those obtained using direct sampling (DS) from the inverse cumulative density functions obtained analytically.^11,12

The NPS method produced similar expected time to events for the 2 distributions compared with the DS method, from 1 million draws: exponential (8.61 NPS v. 8.52 DS), Gompertz (35.98 NPS v. 35.48 DS), and Weibull (8.79 NPS v. 8.02, DS). Their mean execution times in milliseconds, repeating the sampling process 100 times, were 38.28, 51.92, and 48.44, respectively.

Example 5: Drawing Time to Event from Hazards with Time-Dependent Covariates following Random Paths

We specify a time-varying covariate $x_{i} (t) = α_{0} + α_{1} y_{i} (t)$ assuming $y_{i} (t)$ follows a Gaussian random walking process $y_{i} (t) = y_{i} (t - 1) + ϵ_{i}$ , where $ϵ_{i} ~ Normal (μ = 0, σ = 0.5)$ , and generated 1,000 random paths over 100 y (Figure 5). We assume a Weibull baseline hazard, $h_{0} (t) = Weibull (shape = 1.3, scale = 30.1)$ , obtaining $h_{i} (t) = h_{0} (t) e^{(x_{i} (t) β)} = h_{0} (t) e^{((α_{0} + α_{1} y_{i} (t)) β)}$ . We used the multivariate categorical distribution to sample times to events from the individual-level Gaussian random walk processes and estimated an expected time to event of 27.88 y. Repeating the sampling process 100 times, the average sampling time was 4.27 ms.

Figure 5

(A) Individual-specific trajectories. (B) Individual-specific time-dependent hazards. Sample of 10 individuals.

Discussion

We developed a nonparametric method of sampling times to events with high computation efficiency. The NPS method uses a categorical distribution, which discretizes the hazard of events over a fixed and finite time period, assuming a piecewise hazard. We illustrated the NPS method with 5 examples that show common situations encountered when building DES models and provided their mean execution times. NPS can be used to sample the age of death from age-, sex-, race-, and year-specific life tables and/or times to smoking initiation or cessation from smoking histories.¹³ It can also be used to sample times to events with hazards that are functions of either time-independent or time-dependent covariates.

The proposed NPS method works similarly to previous methods when sampling ages of death from a life table for a specific group (e.g., White females born in 1980 in the United States) using a piecewise-constant exponential distribution.¹⁴ However, a strength of the proposed method is the use of multivariate categorical sampling, which extends the NPS method to simultaneously sample multiple ages of death from multiple life tables for different groups.

The NPS method accurately approximates the expected time to events from parametric distributions and can generate times to events from hazards for which no parametric distributions can be accurately fitted, such as time-varying hazards described by time-varying covariates. Once the probability distributions are derived from the observed hazards, the sampling process is computationally efficient and can be easily repeated multiple times. This approach can be very useful for individual-level models that require sampling times to events following processes that could not be appropriately addressed using parametric distributions.

Our approach does not provide criteria to determine the optimal time interval length and it is up to the user to define it. This may pose a limitation because selecting an excessively wide interval can result in distributions that do not resemble the observed hazard, such as those with extremely swift changes in their levels. However, this is a focus for future research. In addition, since this method uses a nonparametric categorical distribution, a sufficient number of samples must be drawn to obtain unbiased estimates. Our method assumes that the analyst is interested in sampling time to events from the mean process. However, if the analyst is interested in propagating the uncertainty of the estimated time-to-event process and has access to the mechanism generating the uncertainty of the estimation of the average process, the user can sample multiple hazards from this mechanism and apply the NPS method to each sampled hazard. Resampling the hazards and running the method on each sampled hazard accurately propagates the uncertainty in the estimated hazards into probabilistic sensitivity analysis using NPS.

We proposed a method that can efficiently sample times to event from any time-to-event process from its hazard, survival, or CDF over time. Moreover, this method can simultaneously sample from multiple different hazards with the multivariate categorical distribution, which we provide as R and Python functions in the Supplementary Material.

Supplemental Material

sj-pdf-1-mdm-10.1177_0272989X241308768 – Supplemental material for A Fast Nonparametric Sampling Method for Time to Event in Individual-Level Simulation Models

Supplemental material, sj-pdf-1-mdm-10.1177_0272989X241308768 for A Fast Nonparametric Sampling Method for Time to Event in Individual-Level Simulation Models by David U. Garibay-Treviño, Hawre Jalal and Fernando Alarid-Escudero in Medical Decision Making

Footnotes

Acknowledgements

We thank Rowan Iskandar for his valuable contributions to the code for the multivariate categorical sampling. We thank Karen Kuntz, Thomas Trikalinos, and Yuliia Sereda for providing feedback on an earlier version of this article.

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Financial support for this study was provided in part by grants from the National Cancer Institute as part of the Cancer Intervention and Surveillance Modeling Network. Dr. Alarid-Escudero is supported by grant U01CA253913, Dr. Jalal is supported by a Canada Research Chair, and Drs. Alarid-Escudero and Jalal are supported by grant U01CA265750. The funding agencies had no role in the design of the study, interpretation of results, or writing of the manuscript. The funding agreement ensured the authors’ independence in designing the study, interpreting the data, writing, and publishing the report.

Ethical Considerations

The authors did not carry out any human and/or animal studies for this publication submission. In addition, the authors of this article do not have any ethical considerations to disclose.

Consent to Participate

The authors did not carry out investigations involving humans for this publication submission.

Consent for Publication

Not applicable.

ORCID iDs

David U. Garibay-Treviño

Hawre Jalal

Fernando Alarid-Escudero

Data Availability

The R code to reproduce the examples and the function of the multivariate categorical distribution, present in the article, is provided in a public GitHub repository ().

Supplemental Material

Supplementary material for this article is available online at .

References

Karnon

Stahl

Brennan

Caro

Mar

Möller

Modeling using discrete event simulation: a report of the ISPOR-SMDM modeling good research practices task force–4. Med Decis Making. 2012;32(5):701–11. DOI: 10.1177/0272989X12455462

Arrospide

Ibarrondo

Blasco-Aguado

Larrañaga

Alarid-Escudero

Mar

Using age-specific rates for parametric survival function estimation in simulation models. Med Decis Making. 2024;44(4):359–64. DOI: 10.1177/0272989X241232967

Cinlar

Introduction to Stochastic Processes. Englewood Cliffs (NJ): Prentice-Hall; 1975.

Trikalinos

Sereda

The nhppp package for simulating non-homogeneous Poisson point processes in R. PLoS One. 2024;19(11):e0311311.

Lee

Wang

JW.

Statistical Methods for Survival Data Analysis. Hoboken (NJ): Wiley; 2013.

Klein

Moeschberger

ML.

Survival Analysis: Techniques for Censored and Truncated Data. New York (NY): Springer New York; 2003.

Elbasha

Chhatwal

Theoretical foundations and practical applications of within-cycle correction methods. Med Decis Making. 2016;36(1):115–31.

Elbasha

Chhatwal

Myths and misconceptions of within-cycle correction: a guide for modelers and decision makers. Pharmacoeconomics. 2016;34(1):13–22.

Hunink

MGM

Weinstein

Wittenberg

, et al. Decision Making in Health and Medicine: Integrating Evidence and Values. 2nd ed. Cambridge (UK): Cambridge University Press; 2014.

10.

Max Planck Institute for Demographic Research (Germany), University of California-Berkeley (USA), and French Institute for Demographic Studies (France). HMD. Human mortality database. 2021. Available from: https://www.mortality.org/Country/Country?cntr=USA

11.

Austin

PC.

Generating survival times to simulate cox proportional hazards models with time-varying covariates. Stat Med. 2012;31(29):3946–58. Available from: https://onlinelibrary.wiley.com/doi/abs/10.1002/sim.5452

12.

Ngwa

Cabral

Cheng

Gagnon

LaValley

Cupples

LA.

Generating survival times with time-varying covariates using the Lambert W function. Commun Stat Simul Comput. 2022;51(1):135–53. DOI: 10.1080/03610918.2019.1648822

13.

Jeon

Meza

Krapcho

Clarke

Byrne

Levy

Chapter 5: actual and counterfactual smoking prevalence rates in the U.S. population via microsimulation. Risk Anal. 2012;32(suppl 1):S51–68.

14.

Willekens

Multistate Analysis of Life Histories With R. Springer International Publishing; 2014. Available from: https://books.google.com.mx/books?id=Cd2CBAAAQBAJ

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.45 MB