Sage Journals: Discover world-class research

Abstract

Information on the age at onset distribution of the asymptomatic stage of a disease can be of paramount importance in early detection and timely management of that disease. However, accurately estimating this distribution is challenging, because the asymptomatic stage is difficult to recognize for the patient and is often detected as an incidental finding or in case of recommended screening; the age at onset is often interval-censored. In this paper, we propose a method for the estimation of the age at onset distribution of the asymptomatic stage of a genetic disease based on ascertained pedigree data that take into account the way the data are ascertained to overcome selection bias. Simulation studies show that the estimates seem to be asymptotically unbiased. Our work is motivated by the analysis of data on facioscapulohumeral muscular dystrophy, a genetic muscle disorder. In our application, carriers of the genetic causal variant are identified through genetic screening of the relatives of symptomatic carriers and their disease status is determined by a medical examination. The estimates reveal an early age at onset of the asymptomatic stage of facioscapulohumeral muscular dystrophy.

Keywords

Age at onset ascertainment current status data family data outcome-dependent sampling

1. Introduction

A reliable and accurate estimate of the age at onset distribution of a disease is of great importance for optimizing follow-up protocols of high-risk patients, aiming at early detection of the disease and timely start of treatment. For most diseases treatment is more effective if it is started in an early stage than at a later moment when the disease has progressed.

Diseases usually progress in stages. The early stage is typically asymptomatic. In this phase of the disease the patient is not aware of having the disease, because no symptoms with which the disease is usually associated are experienced. Diagnosis is most often in the symptomatic stage when symptoms appear. Although the disease is not apparent during the asymptomatic stage, some pathological changes may be detectable with a medical test. For common diseases like breast and colon cancer, population screening programs have been set up in many countries to identify the disease process during this asymptomatic phase so that intervention can be started at an early stage in the disease process. However, population screening is only offered for a limited scope of diseases, because of cost-effectiveness and possible profit for the patient. Small-scale screening is performed in high-risk sub-populations, typically characterized for certain rare genetic variants.^1,2

Since genetic traits aggregate within families, these high-risk sub-populations are often identified via family members carrying the genetic variant and being affected by the disease. Specifically, families who satisfy certain ascertainment criteria (i.e. “at least one affected carrier”) are selected and all family members (up to a certain degree) are invited for a genetic test. Individuals who carry the genetic variant of interest may be then (regularly) screened for the presence of the disease in the asymptomatic stage. The screening scheme and especially the age at which screening starts should depend on the risk the carrier will become asymptomatic. To determine this risk, an estimate of the age at onset distribution for the asymptomatic stage is needed.

Estimation of this age at onset distribution for the asymptomatic stage is challenging, because it should rely on data of selected families (based on the presence of the disease). Moreover, the exact age at onset is never observed for any of the individuals in the data-set. Instead, at every moment of screening it is observed whether an individual has the disease in asymptomatic stage or not; age at onset is interval-censored by the ages at time of screening. Sometimes, if the disease is not lethal or no treatment is available, individuals are screened only once. Then, only the age at time of screening is observed, as well as whether or not the individual is asymptomatic before examination took place. This type of censoring is referred to as interval-censored type I or current status.³

With interval-censored data, the exact age at onset is never observed for any individual; therefore, this kind of censoring differs greatly from right censoring. If the data are obtained from a population of independent, non-selected individuals, for instance with population screening, the distribution of the age at onset of the asymptomatic disease can be estimated with the non-parametric maximum likelihood estimator (NPMLE) (see, e.g. Zhang and Sun,³ Groeneboom and Wellner,⁴ Jewell and van der Laan,⁵ and Witte et al.⁶). In case of selected families, one often tries to correct for ascertainment bias by leaving out the index patients and assuming independence between the relatives. Under this independence assumption, the NPMLE could be used, but the estimator is still biased because of the suboptimal ascertainment correction (leaving out the index): it does not take into account the fact that ascertainment is based on the phenotype of all individuals in the pedigree and not on one particular person. The bias is especially large in data-sets with small pedigrees.⁷ Moreover, leaving out the data of the index patients from the analysis, means that valuable information is discarded, what is especially unfortunate in rare diseases for which data-sets are often small. A more sophisticated analysis has to be performed to properly correct for ascertainment and to use the data optimally. This could be done by considering an adjusted likelihood which conditions on the ascertainment event, for instance by maximizing the retrospective likelihood, the prospective likelihood, or a joint conditional likelihood (e.g. Carayol and Bonaiti-Pellie⁸ and Kraft and Thomas⁹).

In this paper, we propose a maximum likelihood-based method, adjusted for the ascertainment, to estimate the distribution of the age at onset of the asymptomatic stage of a disease. The likelihood function has a complex form as it takes into account all the available information: the interval-censored age at onset of the asymptomatic stage, the (right-censored) age at onset of the symptomatic stage, the ascertainment criteria, and also family characteristics that may affect the age at onset distributions. Simulation studies are performed to study the performance of the proposed estimator.

For many rare disease susceptible genetic variants, the number of detected pedigrees with this variant is small, like for the disorder that motivated this research. This complicates estimation of the age at onset distributions and hence the proposed statistical model cannot be too large (too many unknown parameters) to overcome overfitting of the data. A balance has to be found between the complexity of the model and the information in the data. The proposed model can also be applied if the sample size is small.

This work is motivated by a study on facioscapulohumeral muscular dystrophy (FSHD), a genetic muscle disorder. In order to learn more about the progression and causes of the disease, a cross-sectional observational study of families with at least two symptomatic family members was performed. All family members were offered a genetic test and a physical examination to determine if they had already entered the asymptomatic stage of the disease. No extra data were collected after the moment of examination. The age at onset for the asymptomatic stage is hence type I interval-censored. The date of examination was the same for all the participants in the study (cross-sectional).

The rest of the paper is organized as follows. In the second section, we introduce notation and establish a general framework for the problem. Ascertained-corrected conditional likelihood estimation is proposed in the third section. A simulation study is presented in the fourth section, while in the fifth section the methods are applied to a data-set of familial FSHD in the Netherlands. Main conclusions and a final discussion follow in the final section.

2. Data, notation, and assumptions

In case of genetic diseases, carriers of the disease susceptible variant are often identified via already detected carriers; relatives of affected carriers are invited for genetic testing. For data analysis and estimation of the age at onset distribution of the disease, usually only data of the proven carriers from ascertained families are included in the data-set (and the non-carriers are left out). Below, notation for carriers of the causal variant is introduced.

Let U and T be the ages at onset of the asymptomatic and symptomatic stage, with $U \leq T$ almost surely. The age at examination (genetic test and physical examination) is denoted by C which we assume to be independent of U and T. This is a reasonable assumption in our setting: due to the cross-sectional nature of the study all pedigrees fulfilling the ascertainment criteria are retrospectively selected and subsequently screened at the same chronological time (see also Section 5, the application). Further, define the indicator function $Δ = I (T \leq C)$ as 1 if $T \leq C$ and 0 otherwise. The indicator function $Σ = I (U \leq C)$ is defined similarly.

The three possible configurations of observations are C < U < T, $U \leq C < T$ , and $U \leq T \leq C$ . In the first situation, C < U < T, the individual does not have any symptoms of the disease at the time of examination; U and T might occur after C, but it could also be that they never occur before the death of the individual. In the second situation, $U \leq C < T$ , the disease is diagnosed during the examination, but the individual does not present any symptoms yet; the individual is in the asymptomatic stage. In the third situation, $U \leq T \leq C$ , the individual has the disease and also notices symptoms at the time of examination, the individual is symptomatic.

We assume that all individuals with a positive genetic test for the variant of interest are physically examined and included in the data-set. For a single carrier, the triple $(Σ, T Δ, C)$ is observed. The age at time of onset of symptoms, T, is only observed if $T \leq C$ , so if Δ = 1. This means that the age of appearance of symptoms is known for symptomatic patients only. Further, U is never observed, but the presence or absence of symptoms at examination allows to observe Σ, i.e. we can determine if the asymptomatic stage is entered before or after examination. The age at time of examination, C, is always observed (if an individual is symptomatic at age C, (s)he will not be examined again, but it is known at what time this should have taken place if the individual was not symptomatic). The observed data for each of the three situations are represented in Figure 1.

Figure 1.

Presentation of the observations in the three configurations. There are no observations after C and the exact moment U took place is not indicated, because this is unknown.

We assume that U and T follow (parametric) distributions, $U | x \sim H_{η, x}$ and $T | x \sim F_{θ, x}$ , where x is a vector of covariates and η and θ are unknown parameters. The covariates in $H_{η, x}$ and $F_{θ, x}$ may differ, but for ease of notation we use x for both. The distribution for C, the age at time of examination, is denoted by G and is left unspecified for the moment. An overview of the notation is given in Table 1. The number of pedigrees in the data-set is denoted as r, the number of carriers in these pedigrees as $n_{j}, j = 1, \dots, r$ , and the total number of carriers in the data-set as $n = \sum_{j = 1}^{r} n_{j}$ .

Table 1.

Definition of variables and distributions.

Variable	Meaning	Distribution
T	Age at onset of symptomatic disease for a carrier	$F_{θ, x}, f_{θ, x}$
U	Age at onset of asymptomatic disease for a carrier	$H_{η, x}, h_{η, x}$
C	Age at time of examination/screening for a carrier	G, g
Δ	Indicator function $Δ = I (T \leq C)$
Σ	Indicator function $Σ = I (U \leq C)$

In the paper it is assumed that, conditional on the covariates and the genetic status of the individuals, the ages at onset of the asymptomatic and the symptomatic stages are independent between individuals.

For tested non-carriers, the ages at time of genetic testing are known (from the records). These data can be used for estimating the distribution G. This is under the assumption that the distributions of the age at testing are equal for carriers and non-carriers. This is a reasonable assumption, because carriers and non-carriers do not know their carrier status when they decide to be genetically tested or not.

3. Ascertained-corrected conditional likelihood estimation

Our main goal is to obtain an accurate estimator of the distribution of the age at onset of the asymptomatic stage for carriers of the variant, $H_{η, x}$ . We propose a two-stage approach. The main idea is to first estimate G and θ by maximizing a likelihood function, L^symp, for the symptomatic data $(T Δ, C)$ only. These estimates are inserted in the likelihood function, L^full, for all data $(Σ, T Δ, C)$ and this function is maximized with respect to η.

In the derivation of the likelihood functions, we assume that, conditional on the covariates, the observations of all carriers are independent, notwithstanding the pedigrees they belong to. In the notation, no discrimination with respect to the pedigrees is necessary and all carriers are identified with a single and unique index i: the variables $(U_{i}, T_{i}, C_{i}, Σ_{i}, Δ_{i})$ for individual i.

In Appendix 1, a derivation is given of the ascertainment-corrected prospective likelihood functions L^symp and L^full. The likelihood function based on the symptomatic data $(T_{i} Δ_{i}, C_{i}), i = 1, \dots, n$ only, is given by

L^{symp} = \frac{\prod_{i = 1}^{n} f_{θ, x_{i}} {(T_{i})}^{Δ_{i}} {(1 - F_{θ, x_{i}} (C_{i}))}^{1 - Δ_{i}} g (C_{i})}{\prod_{j = 1}^{r} P (A_{j})}

(1)where A_j is the ascertainment event for pedigree j and

P (A_{j})

the probability this event occurred. The ascertainment-corrected prospective likelihood function, L^full, for the full data

(Σ_{i}, T_{i} Δ_{i}, C_{i}), i = 1, \dots, n

is given by

L^{full} = \frac{\prod_{i = 1}^{n} g (C_{i}) {(1 - H_{η, x_{i}} (C_{i}))}^{(1 - Σ_{i}) (1 - Δ_{i})} {(H_{η, x_{i}} (C_{i}) - F_{θ, x_{i}} (C_{i}))}^{Σ_{i} (1 - Δ_{i})} f_{θ, x_{i}} {(T_{i})}^{Σ_{i} Δ_{i}}}{\prod_{j = 1}^{r} P (A_{j})}

(2)

Since the pedigrees are not randomly sampled from the population, the likelihood functions are conditional on the ascertainment events $A_{j}, j = 1, \dots, r$ . The exact expression of the probability $P (A_{j})$ in the denominator of the likelihood functions depends on the ascertainment rules. These rules may vary over studies, depending on the severity and prevalence of the disease, but are usually based on observations on the symptomatic stage only. As a consequence, the product $\prod_{j = 1}^{r} P (A_{j})$ is a function of the distributions G and $F_{θ, x}$ only. As an example, suppose that a pedigree is ascertained if at least one individual among the carriers is symptomatic at the time of examination. The probability this ascertainment event occurs for family j, $P (A_{j})$ , equals 1 minus the probability that none of the carriers was symptomatic. That means that

P (A_{j}) = 1 - {(\int 1 - F_{θ, x} d G)}^{n_{j}}

(3)

In Section 5, we give an explicit expression of $P (A_{j})$ for our real data application based on the ascertainment of at least two symptomatic carriers at the time of examination.

3.1. Estimation of parameters

From a theoretical perspective the full likelihood, L^full, could be maximized with respect to all unknown parameters to obtain their maximum likelihood estimates. However, in practice we noticed that optimization algorithms often do not converge to the global maximum, probably because the parameter space is too big and the algorithm stops at a local and not the global maximum. Therefore, we propose to perform the estimation of the unknown parameters in the model in two steps. Simulation studies in Section 4 show good performance of this two-step procedure. The two steps are given by:

Distribution G and parameter θ are estimated based on the likelihood function L^symp in equation (1). More details are given below.

After inserting the estimates $\hat{G}$ and $F_{\hat{θ}, x}$ into the likelihood function L^full in equation (2), it is maximized with respect to η. If the ascertainment is based on the symptomatic stage only (what is usually the case), this boils down to maximizing

\prod_{i = 1}^{n} {(1 - H_{η, x_{i}} (C_{i}))}^{(1 - Σ_{i}) (1 - Δ_{i})} {(H_{η, x_{i}} (C_{i}) - F_{\hat{θ}, x_{i}} (C_{i}))}^{Σ_{i} (1 - Δ_{i})}

with respect to η.

In the first step of the estimation algorithm, G and θ are estimated without using data on the asymptomatic status (Σ). In principle, this does not affect the unbiasedness of the estimators, but because not all information is used standard errors might slightly increase. If G is assumed to follow a parametric distribution, G and θ can be estimated by maximizing the likelihood function L^symp in equation (1). Alternatively, G could be estimated by the empirical distribution of the ages at time of examination of the relatives with a negative genetic test (non-carriers). This estimator is asymptotically unbiased. If no data of non-carriers are available, the ages at examination of the carriers could be used instead. This yields an estimator that is asymptotically slightly biased. An extensive discussion on this topic, including simulation studies, is given in Jonker et al.⁷

Our two-step estimation procedure implies that two maximizations are required, but the dimension of the parameters space in each maximization problem is reduced. This lowers the computational complexity of the problem. Of course, the likelihood L^full in equation (2) could also be maximized with respect to all parameters simultaneously, but, as mentioned before, the two-step procedure shows more stable results (this has been checked with simulations).

3.2. Variance estimation

Variance estimates of $\hat{η}$ and $\hat{θ}$ are derived with a parametric bootstrap. A bootstrap resample is obtained as follows:

Randomly select a pedigree from the original data-set. Say pedigree j is selected.

For each of the n_j family members i of the selected pedigree j, draw $U_{i}^{*}$ and $T_{i}^{*}$ from $H_{\hat{η}, x_{i}}$ and $F_{\hat{θ}, x_{i}}$ , respectively as described in Section 4.1.

To guarantee that the resulting bootstrap resample is similar to the original data-set in terms of family size and structure, the previous step is repeated until the simulated phenotypes of the carriers in the pedigree satisfy the ascertainment condition.

These three steps are repeated until the bootstrap resample contains the same number of pedigrees as the original data-set. For each bootstrap resample b, we apply the proposed estimation procedure described in Section 3.1. to obtain bootstrap estimates of the parameters of interest $({\hat{η}}^{b}, {\hat{θ}}^{b})$ . This procedure is repeated a large number of times, B, and the variances of $\hat{η}$ and $\hat{θ}$ are computed empirically from the B bootstrap resamples. Confidence intervals for θ and η can be constructed based on these variance estimates. Further, pointwise confidence intervals for $H_{η} (t)$ and $F_{θ} (t)$ can be constructed by their 2.5 and 97.5% quantiles of the bootstrap estimates of $H_{{\hat{η}}^{b}} (t)$ and $F_{{\hat{θ}}^{b}} (t)$ .

By constructing the confidence intervals for the unknown parameters in this way, the inaccuracy of the estimators for G and θ is reflected in the width of the confidence interval for η.

4. Simulation study

We conduct a simulation study to illustrate the performance of the proposed estimation procedure. The setup and the results are described below.

4.1. Simulation setup

We assume that the variables for age at onset of the asymptomatic and symptomatic stages, U and T, follow gamma distributions. Further, inspired by our real data example, we assume that the variable for the age at time of examination, C, is independent of (U, T). We consider three basic scenarios which resemble relevant situations in practice. In scenario 1, we assume a situation in which the asymptomatic stage is short; the onset of the asymptomatic and symptomatic phases of the disease is close to each other. Namely, we assume that the expected age at onset of the asymptomatic stage, U, has mean $E U = 60$ with standard deviation sd $(U) = 8.5$ , and that the expected age at onset of the symptomatic stage, T, is $E T = 66$ with sd $(T) = 8.9$ . Scenario 2 corresponds to a situation with an average longer time between the onset of the asymptomatic and symptomatic phases. Specifically, we assume that the expected age at onset of the asymptomatic stage, U, has mean $E U = 48$ and sd $(U) = 7.6$ , and that the expected age at onset of the symptomatic stage, T, is $E T = 72$ with sd $(T) = 9.2$ . Scenario 3, which closely resembles our real data application, is characterized by an early expected age at onset of the asymptomatic phase ( $E U = 20$ ), large latency period before the onset of the symptomatic phase ( $E T = 60$ ), and large variation in the age-of-onset of both the asymptomatic and symptomatic phases (sd $(U) = 20$ and sd $(T) = 34.6$ ) (see Figure 2 for the curves). In the steps below, the corresponding shape and scale parameters are given.

Figure 2.

Results of the simulation study with m = 500 families. Top: scenario 1 ( $k_{1} = 50, {\tilde{k}}_{1} = 5, θ_{1} = 1.2$ ). Middle: scenario 2 ( $k_{1} = 40, {\tilde{k}}_{1} = 20, θ_{1} = 1.2$ ). Bottom: scenario 3 ( $k_{1} = 1, {\tilde{k}}_{1} = 2, θ_{1} = 20$ ). Left panels: small families ( $n_{j} \in {1, 2, 3, 4}$ ). Right panels: large families ( $n_{j} \in {3, 6, 9, 12}$ ). Solid black and gray lines are the true distribution functions $H_{(k_{1}, θ_{1})}$ and $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ , respectively. Dashed lines form a band based on estimated pointwise 2.5 and 97.5% percentiles based on M = 1000 Monte Carlo trials.

We simulate M = 1000 Monte Carlo trials, each consisting of m families (m is taken equal to 100, 500, or 1000). Additionally, in scenario 3 we also considered m = 25 to mimic our real data setting with regard to the reduced number of observed families. m = 25 was not considered in scenarios 1 and 2 since it led to extremely low numbers of observed families so that inference was meaningless.

For each family j, $j = 1, \dots, m$ , we follow the steps given below:

Simulate family size n_j. In order to check the impact of the family size on the performance of our method, two situations are considered: populations composed of “small” families (n_j is sampled from ${1, 2, 3, 4}$ with equal probability) and populations composed of “large” families (n_j is sampled from ${3, 6, 9, 12}$ with equal probability).

For each family member $i, i = 1, \dots, n_{j}$ , simulate U_i from a gamma distribution with shape and scale parameters k₁ ( $k_{1} = 50$ in scenario 1, $k_{1} = 40$ in scenario 2, $k_{1} = 1$ in scenario 3) and θ₁ ( $θ_{1} = 1.2$ in scenarios 1 and 2, $θ_{1} = 20$ in scenario 3).

For each family member i, $i = 1, \dots, n_{j}$ , simulate the time between entering the asymptomatic and the symptomatic stage, ${\tilde{U}}_{i}$ , from a gamma distribution with shape and scale parameters ${\tilde{k}}_{1}$ ( ${\tilde{k}}_{1} = 5$ in scenario 1, ${\tilde{k}}_{1} = 20$ in scenario 2, ${\tilde{k}}_{1} = 2$ in scenario 3) and scale parameter θ₁ (equal to the scale parameter used to generate U_i).

For each family member i, $i = 1, \dots, n_{j}$ , define $T_{i} = U_{i} + {\tilde{U}}_{i}$ as the age at onset of the symptomatic stage of the disease. Since U_i and ${\tilde{U}}_{i}$ are independent, T_i follows a gamma distribution with shape parameter $k_{1} + {\tilde{k}}_{1}$ and scale parameter $θ_{1} = 1.2$ in scenarios 1 and 2 and 20 in scenario 3.

For each family member i, $i = 1, \dots, n_{j}$ , simulate the age at time of examination C_i from a uniform distribution at the interval [20, 70].

If the ascertainment event A_j occurred, the n_j carriers of family j are ascertained. In our simulation setting, families are selected if at least one symptomatic family member is identified (at least one family member with $T_{i} \leq C_{i}$ ).

Independently of the simulations in the previous steps, a sample of 1000 observations from the uniform distribution at [20, 70] is simulated (the same distribution from which was simulated in step 5). The simulated values represent the ages at time of genetic testing of the individuals who got a negative test result. These data are used to estimate G (as explained in Section 3.1).

In the simulation study, families are ascertained if at least one family member is symptomatic at the time of examination, like in the example given earlier. As a result of the ascertainment, the effective sample that is used for estimation is smaller than m families; the effective sample size is denoted by r. To compute confidence intervals of the estimators, we set the number of bootstrap resamples equal to B = 500.

In practice, often some carriers in an ascertained pedigree do not participate in the study. It is likely that those carriers do not show symptoms of the disease. To check the impact of this form of informative missing data, we perform a second simulation study in which every non-symptomatic family member (Δ = 0) is excluded from the sample with a probability of either 0.20 or 0.50.

All computations are performed using the statistical software R (R Core Team, 2018). The function nlminb R function is employed to maximize the log-likelihood functions (1) and (2).

4.2. Simulation results

The main results of the simulation study are shown in Figure 2 and in Table 2. In each graphic in Figure 2, the solid black and gray lines represent the true distributions of $H_{(k_{1}, θ_{1})}$ and $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ , respectively. The dashed lines represent the 2.5 and 97.5% estimated pointwise percentile curves across the M = 1000 Monte Carlo trials. The left graphics in Figure 2 refer to the “small families” situation ( $n_{j} \in {1, 2, 3, 4}$ ), whereas the right graphics refer to the “large families” situation ( $n_{j} \in {3, 6, 9, 12}$ ). Top panels refer to scenario 1 with $k_{1} = 50, {\tilde{k}}_{1} = 5$ and $θ_{1} = 1.2$ , middle panels refer to scenario 2 with $k_{1} = 40, {\tilde{k}}_{1} = 20$ and $θ_{1} = 1.2$ , and the bottom panels refer to scenario 3 with $k_{1} = 1, {\tilde{k}}_{1} = 2$ and $θ_{1} = 20$ .

Table 2.

reBias, standard deviation (SD), and coverage probabilities of the 95% confidence intervals (Cov) for the location and scale parameters of the age at onset gamma distributions of U (k₁, θ₁) and T ( $k_{1} + {\tilde{k}}_{1}, θ_{2}$ ) along 1000 trials for several sample sizes ( $m \in {100, 500, 1000}$ families) and effective sample sizes, defined as the mean number of ascertained families along the 1000 trials ( $\bar{r}$ ).

		Small families				Large families
Parameter	m	$\bar{r}$	reBias	SD	Cov	$\bar{r}$	reBias	SD	Cov
Scenario 1
	100	26	0.736	–	–	58	0.054	10.594	0.921
	500	132	0.076	16.088	0.802	289	0.014	4.299	0.917
k₁ = 50	1000	264	0.033	9.791	0.792	579	0.002	3.076	0.896
	100	26	0.656	–	–	58	−0.013	0.244	0.900
	500	132	0.005	0.337	0.937	289	−0.006	0.105	0.926
θ₁ = 1.2	1000	264	0.002	0.225	0.956	579	0.002	0.076	0.919
	100	26	0.093	–	–	58	0.025	8.623	0.962
	500	132	0.020	8.000	0.969	289	0.008	3.771	0.958
$k_{1} + {\tilde{k}}_{1}$ = 55	1000	264	0.009	5.656	0.951	579	0.003	2.485	0.960
	100	26	0.076	–	–	58	0.001	0.201	0.940
	500	132	0.003	0.189	0.932	289	−0.002	0.088	0.948
θ₂ = 1.2	1000	264	0.003	0.136	0.942	579	−0.001	0.059	0.960
Scenario 2
	100	13	1.384	–	–	34	0.070	10.464	0.969
	500	65	0.186	20.123	0.941	167	0.013	4.288	0.925
k₁ = 40	1000	130	0.067	10.792	0.922	333	0.008	2.836	0.946
	100	13	−0.050	–	–	34	−0.013	0.268	0.939
	500	65	−0.036	0.402	0.904	167	−0.002	0.125	0.939
θ₁ = 1.2	1000	130	<0.001	0.279	0.945	333	−0.003	0.084	0.961
	100	13	0.389	–	–	34	0.041	15.561	0.975
	500	65	0.046	15.168	0.971	167	0.007	6.391	0.958
$k_{1} + {\tilde{k}}_{1}$ = 60	1000	130	0.014	10.036	0.961	333	0.005	4.483	0.957
	100	13	0.324	–	–	34	0.031	0.341	0.937
	500	65	0.036	0.400	0.928	167	0.006	0.144	0.944
θ₂ = 1.2	1000	130	0.023	0.246	0.946	333	0.001	0.101	0.953
Scenario 3
	100	66	0.309	–	–	92	0.027	0.332	0.930
	500	329	0.046	0.320	0.891	462	0.009	0.139	0.935
k₁ = 1	1000	658	0.015	0.219	0.897	924	0.003	0.097	0.937
	100	66	0.552	–	–	92	0.061	6.241	0.946
	500	329	0.038	5.822	0.958	462	0.009	2.378	0.954
θ₁ = 20	1000	658	0.024	3.888	0.966	924	0.005	1.648	0.951
	100	66	0.021	–	–	92	0.004	0.224	0.953
	500	329	0.006	0.186	0.942	462	0.001	0.101	0.944
$k_{1} + {\tilde{k}}_{1}$ = 3	1000	658	0.002	0.017	0.943	924	<0.001	0.068	0.954
	100	66	0.015	–	–	92	0.005	4.051	0.954
	500	329	<0.001	1.786	0.942	462	0.002	0.904	0.949
θ₂ = 20	1000	658	0.002	1.310	0.937	924	<0.001	0.624	0.954

For small families and m = 100, the SDs are not reliable and left out from the table. reBias: relative bias.

From these graphics, the proposed estimators seem to be unbiased. The median estimated curves based on the median parameter estimate along the M = 1000 Monte Carlo trials cannot be distinguished from the theoretical distributions, and the bands formed by the 2.5 and 97.5% Monte Carlo percentiles nicely cover the theoretical curves in all the studied scenarios. In each Monte Carlo trial, data of m = 1000 families are simulated. Since not all of these m families satisfied the ascertainment criteria, the effective number of families included for analysis in each Monte Carlo trial, denoted as r, is considerably lower (see Table 2 for details). This might be the reason for the wide percentile band in the right upper graphic in Figure 2.

Table 2 complements Figure 2 and provides further results of the simulation study. For each of the studied scenarios, we provide results on mean estimated relative bias (reBias) (defined as the difference between the simulated mean and true parameter value divided by the true value), empirical standard deviation, and coverage probabilities across the 1000 Monte Carlo trials of the scale and shape parameters of the distribution of $H_{(k_{1}, θ_{1})}$ and $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ . The results regarding bias reinforce some of the findings observed in Figure 2. Within the same scenario, the populations composed of large families provide better results than the populations composed of small families, as is expected. Results also improve by increasing the sample size, showing lower reBias and less variability (SD). With regard to the coverage probabilities, we observe that the coverage probabilities are close to 0.95 in all the studied scenarios. This indicates the good performance of our bootstrap approach to estimate the standard errors.

Special mention deserves scenario 3, large families and m = 25 (corresponding to a mean number of ascertained families along the 1000 trials of $\bar{r} = 17$ ), since it resembles our real data setting and hence gives valuable information about the expected performance of our method in our motivating data-set. In this case, we observe reasonable values of reBias (0.202 for k₁, 0.667 for θ₁, 0.025 for $k_{1} + {\tilde{k}}_{1}$ , and 0.005 for θ₂) and coverage probabilities close to 0.95. In terms of standard deviation (0.753 for k₁, 237 for θ₁, 0.467 for $k_{1} + {\tilde{k}}_{1}$ , and 4.153 for θ₂), the results are also reasonable with exception of k₁, with a very large value of standard deviation due to extreme results in some of the 1000 trials. This should be interpreted as symptoms of instability of our method with very small samples even if the overall performance is reasonably good in this setting.

In the second simulation study, we evaluate the level of introduced bias in our estimates due to informative missing. For scenario 1 ( $k_{1} = 50, {\tilde{k}}_{1} = 5, θ_{1} = 1.2$ ), and scenario 2 ( $k_{1} = 40, {\tilde{k}}_{1} = 20, θ_{1} = 1.2$ ), and for missing probabilities 0.20 and 0.50 of non-symptomatic family members, this bias is visualized in Figure 4 in Appendix 3 (scenario 1 in the graphics on the left and scenario 2 on the right). In scenario 1, the distributions of the asymptomatic and symptomatic stages are both overestimated if non-symptomatic family members do not participate in the study. The bias is smaller for $H_{(k_{1}, θ_{1})}$ than for $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ , which is interesting since the estimation $H_{(k_{1}, θ_{1})}$ is our main objective. In scenario 2 this phenomenon is even more pronounced for $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ and the bias in the estimation of $H_{(k_{1}, θ_{1})}$ is negligible. Also, in general, the reBias is larger if the proportion of missing family members increases. Assuming a missing probability of 0.20 in scenario 1, the reBias for $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ is around 13% at 50 and 60 years old, and reduces to 7% at age 70. The reBias for $H_{(k_{1}, θ_{1})}$ is lower at all ages, around 4% at ages 50 and 60 years, and reduces to 2% at age 70. Assuming a missing probability of 0.50, the reBias for $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ increases to 40% at the ages 50 and 60 years, and to 20% at age 70. The reBias for $H_{(k_{1}, θ_{1})}$ also increases, reaching 10% at ages 50 and 60 years and to 5% at age 70. Similar results are found in scenario 2 in the estimation of $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ while the bias in the estimation of $H_{(k_{1}, θ_{1})}$ remains negligible when increasing the proportion of missing individuals. The small bias in the estimation of $H_{(k_{1}, θ_{1})}$ is likely due to the fact that the missing data mechanism relies on the indicator $Δ = I (T > C)$ and hence it is only informative about U given its association with T. As a result, less bias for estimating $H_{(k_{1}, θ_{1})}$ than for $F_{(k_{1} + {\tilde{k}}_{1}, θ_{1})}$ is expected. The association between T and U is larger in scenario 1 than in scenario 2 which explains the lower observed bias in scenario 2. In summary, even if missing members is a potential problem and it introduces systematic bias, we expect its impact will be limited.

5. Motivating example

This work was motivated by a study on FSHD, a genetic muscle disorder. The severity of this disease is associated with a specific form of genetic lesion, the loss of repetitions of the D4Z4 unit.¹⁰ Individuals without a loss of units (they have at least 10 units) are considered to be healthy and are not susceptible to develop the muscle disorder. It is expected that the age at onset of the asymptomatic and the symptomatic stage is also associated with the number of repetitions of this unit.

The data come from a cross-sectional study in which at a fixed and non-informative calendar time, all affected pedigrees in the Netherlands with at least one affected member among the first and second degree of the index patient (the first diagnosed patient in a family) were invited to participate in the study (see Wohlgemuth et al.¹¹ for details). So, the ascertained families have at least two affected individuals with both a loss of repetitions of the D4Z4 unit: the index patient and a relative. Data of 10 pedigrees consisting of in total 155 individuals are available. Of these 155 individuals, 69 present loss of repetitions of the D4Z4 unit at some degree (the so-called carriers) and 86 have no genetic alteration (the non-carriers). All individuals within a pedigree with a loss of repetitions have an equal number of repetitions. An overview of the carrier-data is given in Table 3.

Table 3.

Overview of data: For every pedigree, the number of units, carriers, and symptomatic and asymptomatic carriers are given.

Pedigree no.	1	2	3	4	5	6	7	8	9	10
No. of D4Z4 units among carriers	4	5	5	6	6	6	7	7	9	9
No. of carriers in the pedigree	5	9	8	5	7	13	5	3	8	6
No. of symptomatic carriers	5	9	7	5	2	3	3	2	2	2
No. of asymptomatic carriers	0	0	1	0	5	6	1	1	2	1

A carrier is defined as an individual with a loss of repetitions of the D4Z4 unit.

We consider two parametric models: the Weibull distribution for both $H_{η, x}$ and $F_{θ, x}$ and the gamma distribution for both (see Appendix 2 for details) and covariate x equals an indicator function that indicates whether an individual has less than 7 or at least 7 (i.e. 7, 8, or 9) repetitions of the D4Z4 unit. (We also included the number of repetitions as a continuous variable, but since the linearity assumption does not seem to hold, it is not considered further.)

For n_j the number of individuals in pedigree j in the data-set, the probability the ascertainment event occurs (at least two symptomatic patients at examination), equals 1 minus the probability that none or only one of the n_j individuals is symptomatic

P (A_{j}) = 1 - {(\int 1 - F_{θ, x} d G)}^{n_{j}} - n_{j} {(\int 1 - F_{θ, x} d G)}^{n_{j} - 1} \int F_{θ, x} d G

(4)where the term

{(\int 1 - F_{θ, x} d G)}^{n_{j}}

equals the probability that none of the family members has symptoms at the time of examination and

n_{j} {(\int 1 - F_{θ, x} d G)}^{n_{j} - 1} \int F_{θ, x} d G

equals the probability that exactly one individual has symptoms at the time of examination. This probability is inserted in the denominators of the likelihood functions in equations (1) and (2).

We estimate G by the empirical distribution function of the ages at time of examination C of the individuals with and without a loss of repetitions together. Since the number of observations is low, we chose to combine the data when estimating G. Next, we follow the estimation procedure as described in Section 3.

Based on the value of the likelihood (or AIC), the gamma model fits slightly better than the Weibull model. The actual estimates of $H_{η, x}$ and $F_{θ, x}$ for both models are very similar. Only the estimates of $F_{θ, x}$ for the group with 7–9 units and after the age of 45 seem to diverge; the estimate based on the gamma distribution increases to almost 0.6 at the age of 60, whereas the estimate based on the Weibull distribution reaches 0.4 at this age. This is probably due to the fact that the data-set is relatively small and most individuals in the data-set have not reached this age at time of examination. The form of the estimated parametric distributions (i.e. parameters) is therefore mainly determined by the events before the age of 45 and the curve is extrapolated after the age of 45. As a consequence one should be careful with drawing conclusions at higher ages. The estimates based on the gamma models are given in Figure 3. The estimation procedure was repeated for G estimated by the empirical distribution of data of individuals with no genetic alteration; the results are similar. The pointwise 95% confidence intervals constructed with the parametric bootstrap method are wide, especially for $H_{η, x}$ at younger age. This is possibly because no data of individuals below the age of 20 are available.

Figure 3.

Solid lines: Estimates of $H_{η, x}$ (top panels) and $F_{θ, x}$ (bottom panels) for 4, 5, or 6 repetitions of the D4Z4 unit (left) and for 7, 8, or 9 repetitions (right). Dashed lines: corresponding pointwise 95% confidence intervals.

The estimates of the age at onset distribution of the asymptomatic and the symptomatic stage of FSHD show that these functions depend on the covariate repeat size and both increase until late adulthood. These estimates can be used in counseling and help in understanding progression of the disease over time. However, the number of individuals on which the estimates are based is low and should be interpreted with care.

R-code for maximizing the log-likelihood functions is provided in Appendix 4.

6. Discussion

In this paper, we have proposed a maximum likelihood-based method for estimating the age at onset distribution for the asymptomatic stage of a genetic disease using clinically ascertained pedigree data. Simulation studies showed that as long as the sample is not too small, our estimation method yields accurate results. Estimates of this distribution are of great importance for setting up follow-up programs in high-risk families, for instance families with a genetic variant associated with susceptibility of a disease.

The estimates of the age at onset distribution of the asymptomatic and symptomatic stage of FSHD, found in the application, can be used to learn more about the progression of the disease. Since there is no treatment available and the disease is not life threatening, one could argue whether screening is necessary, but eventually this decision will be made by the family and their medical doctor.

In this paper, we considered the situation of a cross-sectional study which took place at a randomly chosen moment (set by the researcher), without any follow-up afterwards. However, regular screening of patients with a high disposition of a slowly developing disease is quite common.^1,2 To fit these kind of screening data, the expressions of the likelihood functions need to be adjusted, but the underlying principle of estimating the age at onset distributions for the asymptomatic and the symptomatic stage in two steps remains valid.

Measured family characteristics can be included in the model via covariates. To account for unmeasured family characteristics, a frailty term (random family effect) could be added to the model (see, e.g. Gong et al.,¹² Hsu et al.,¹³ Hsu and Gorfine,¹⁴ and Gorfine et al.¹⁵). In a shared frailty model, every family has its own frailty that describes the susceptibility of the family members to develop the disease compared to the population of interest. This model could be further generalized to correlated frailty models in which every individual in the family has its own frailty term, but these terms are correlated within families (and are independent between families). This gives the model more flexibility. However, for fitting these models sufficient data must be available; the number of pedigrees and the number of individuals in the pedigrees must be sufficiently large. This is certainly not the case in our application, but including frailties in the model is an interesting topic for further research.

Since we are considering the distinct stages (healthy, asymptomatic, symptomatic), our data could be modeled with a multi-state model.¹⁶ However, as far as we know, multi-state modeling with data that are ascertained based on the outcome is still an open problem.

In this paper, we have assumed that the medical tests at screening moments are fully sensitive and specific. In population screening programs like breast and colon cancer, the screening tests are often imperfect; the sensitivity and specificity of the test are below 100%.⁶ To preclude unnecessary treatment, a screening test is followed by a confirmative test in case of a positive screening test. The confirmative test is assumed to be 100% sensitive and specific. In family studies, only high-risk individuals (carriers) are screened. For this purpose usually the most accurate medical test is applied and an assumption of full sensitivity and specificity is reasonable. Otherwise, the expression of the likelihood must be adapted, but the methodology will be the same.

To conclude, reliable estimates of the age at onset distribution of the asymptomatic stage are important for setting up personal follow-up screening in high-risk families. The methodology described in this paper is therefore of great relevance.

Footnotes

Acknowledgements

We like to thank M. Wohlgemuth and N. Voermans for collecting and making available the FSHD-data that were analyzed in Section 5. Further, we like to thank the reviewers for their helpful comments.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Appendix 1

References

Halk

Potjer

Kukutsch

, et al. Surveillance for familial melanoma: recommendations from a national centre of expertise. Br J Dermatol 2019; 181(3): 594–596.

Kennelly

Gryfe

Winter

DC.

Familial colorectal cancer: patient assessment, surveillance and surgical management.

Eur J Surg Oncol 2017; 43: 294–302.

Zhang

Sun

Interval censoring.

Stat Method Med Res 2010; 19: 53–70.

Groeneboom

Wellner

JA.

Information bounds and nonparametric maximum likelihood estimation. Berlin: Birkhauser Verlag, 1992.

Jewell

van der Laan

Current status data: review, recent developments and open problems. In: Balakrishnan N and Rao CR (eds) Handbook of statistics. Elsevier, Vol. 23. 2003, pp.625–642.

Witte

Berkhof

Jonker

MA.

An EM algorithm for nonparametric estimation of the cumulative incidence function from repeated imperfect test results.

Stat Med 2017; 36(21): 3412–3421.

Jonker

Rijken

Hes

, et al. Estimating the penetrance of pathogenic gene variants in families with missing pedigree information. Stat Method Med Res 2019; 28(10–11): 2924–2936.

Carayol

Bonaiti-Pellie

Estimating penetrance from family data using a retrospective likelihood when ascertainment depends on genotype and age of onset.

Genet Epidemiol 2004; 27: 109–117.

Kraft

Thomas

DC.

Bias and efficiency in family-based gene-characterization studies: conditional, prospective, retrospective, and joint likelihoods.

AJHG 2000; 66: 1119–1131.

10.

Mull

Lemmers

RJLF

Horlings

CGC

, et al. Phenotype-genotype relations in facioscapulohumeral muscular dystrophy type 1. Clin Genet 2018; 94: 521–527.

11.

Wohlgemuth

Lemmers

Jonker

, et al. A family-based study into penetrance in facioscapulohumeral muscular dystrophy type 1. Neurology 2018; 91(5): e444–e454.

12.

Gong

Hannon

Whittemore

AS.

Estimating gene penetrance from family data.

Genet Epidemiol 2010; 34: 373–381.

13.

Hsu

Chen

Gorfine

, et al. Semiparametric estimation of marginal hazard function from case-control family studies. Biometrics 2004; 60: 936–944.

14.

Hsu

Gorfine

Multivariate survival analysis for case-control family data.

Biostatistics 2006; 7: 387–398.

15.

Gorfine

Hsu

Parmigiani

Frailty models for familial risk with application to breast cancer.

J Am Stat Assoc 2013; 108: 1205–1215.

16.

Putter

Fiocco

Geskus

RB.

Tutorial in biostatistics: competing risks and multi-state models.

Stat Med 2006; 26: 2389–2430.

Estimating the age at onset distribution of the asymptomatic stage of a genetic disease based on pedigree data

Abstract

Keywords

1. Introduction

2. Data, notation, and assumptions

3. Ascertained-corrected conditional likelihood estimation

3.1. Estimation of parameters

3.2. Variance estimation

4. Simulation study

4.1. Simulation setup

4.2. Simulation results

5. Motivating example

6. Discussion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

Appendix 1

References