heap: A command for fitting discrete outcome variable models in the presence of heaping at known points

Abstract

Self-reported survey data are often plagued by the presence of heaping. Accounting for this measurement error is crucial for the identification and consistent estimation of the underlying model (parameters) from such data. In this article, we introduce two commands. The first command, heapmph, estimates the parameters of a discrete-time mixed proportional hazard model with gammaunobserved heterogeneity, allowing for fixed and individual-specific censoring and different-sized heap points. The second command, heapop, extends the framework to ordered choice outcomes, subject to heaping. We also provide suitable specification tests.

Keywords

st0603 heapmph heapop discrete-time duration model mixed proportional hazards model ordered choice model heaping measurement error

1 Introduction

A problem frequently encountered in survey data is the abnormal concentration of reported observations at certain values of the outcome variable. Examples include reported dates of death in neonatal mortality data (Arulampalam, Corradi, and Gutknecht 2017, ACG from now on), age of starting and quitting cigarette smoking (Forster and Jones 2001), or self-reported consumption expenditure data (Pudney 2008). One of the main reasons for such concentration, often referred to as heap points, is rounding. Correctly identifying and accounting for the rounding behavior is crucial for consistent estimation of and valid inference on the parameters of the underlying model of interest. ACG discusses identification and estimation of popular duration and ordered choice models, in the presence of heaping, using maximum likelihood procedures.

In this article, we introduce the command heapmph to estimate the underlying parameters in the case of a discrete-time mixed proportional hazard (MPH) (Cox 1972) duration model as proposed in ACG. More specifically, this command estimates a semiparametric baseline hazard function in the presence of heaping of observations at certain durations, and gamma-distributed unobserved heterogeneity (frailty). In the accompanying heapop, we extend the framework to an ordered choice model, allowing for the presence of heaping points.

As shown in ACG, when some of the parameters lie on the boundary of the parameter space, the limiting distribution of the estimator is no longer a normal distribution, and more complicated subsampling procedures are required for inference. Hence, we also provide two specification tests. The first one tests for the absence of heaping effects in the model. The second specification test examines whether all heaping parameters lie inside the parameter space, which in turn will allow for inference based on asymptotic normality. We use the so-called M out of N bootstrap method to calculate the standard errors. These tests provide a set of tools that enable applied researchers to verify the validity of different model specifications.

In appendix A, we show how the heapmph command can be used to test for a shift in the heaping probability and baseline parameters because of a policy or regime change, while in appendix B we outline similar examples for heapop. Finally, in appendix C, we formally link the proportional hazard (PH) model to the type I extreme value (EV) ordered choice model (Han and Hausman 1990), outlining the implications for the interpretation of the parameters.

2 MPH model with “heaping”

2.1 Specification

We start with the MPH model for the unobserved true durations in continuous time and parameterize this for individual i as

λ_{i} (τ^{*} | z_{i}, u_{i}) = λ_{0} (τ^{*}) e x p (z_{i}^{'} β + u_{i})

where λ ₀(τ ^∗) is the baseline hazard at time τ ^∗, u_i is the individual unobserved heterogeneity (frailty), and z _i is a set of time-invariant covariates. In most empirical studies, time is observed on a discrete scale. We therefore assume that a continuous duration $τ_{i}^{*}$ ∊ {τ, τ + 1) is recorded as τ, where τ denotes a discrete time period, so that the sample of (discrete) durations is given by τ_i for i = 1,…, N. The discrete-time hazard for our model can then be written as

\begin{array}{l} h_{i} (τ | z_{i}, u_{i}) = Pr (τ_{i}^{*} < τ + 1 | τ_{i}^{*} \geq τ, z_{i}, u_{i}) \\ = 1 - exp {- \int_{τ}^{τ + 1} λ_{i} (s | z_{i}, u_{i}) d s} \\ = 1 - exp [- e x p {z_{i}^{'} β + γ (τ) + u_{i}}] \end{array}

where $γ (τ) = ln \int_{τ}^{τ + 1} λ_{0} (s) d s$ . However, because of misreporting, the researcher does not observe τ_i directly, but t_i , a potentially mismeasured version of it.

More specifically, the form of misreporting we address is referred to as “heaping” in the literature and describes the phenomenon of observing an over- and underreporting of failures at certain time periods. We briefly list informally the set of assumptions for the derivation of the estimator and its properties here and refer the readers to ACG for further details on the assumptions and identification results.¹ Based on the neonatal mortality illustration from ACG, we also illustrate our command using a simulated dataset based on ACG.

Assumptions

A1 Excessive concentrations of reported failures occur at time periods that are multiples of a positive integer. This implies equal distance between the heap points. In most of the empirical applications where we see heaping because of rounding, we often see the distance between heaping points to be the same. This is the scenario heapmph uses.² There is no heaping at time zero. This is not an unrealistic assumption, because one would expect survey respondents to know whether the discretized duration was a zero. Following ACG, our illustration also assumes the heaping to be at points that are multiples of 5.

A2 To identify the baseline hazard from possibly misreported observations, we need to impose a structure on the heaping process. In the illustration provided here, we assume that one period to the right and to the left of each heap point are associated with that heap. We denote the maximum number of time periods that a duration can be rounded to as $\overset{―}{r}$ , and in this example $\overset{―}{r}$ = 1. That is, we assume that the duration points 4, 9, and 14, will be rounded up, while 6, 11, and 16 will be rounded down to 5, 10, and 15, respectively.

A3 All heaping is to observed duration points only. In our example, this implies that the heaping is to the points 5, 10, and 15 only, because we assume that the outcome variable is censored at 18 days. The maximum number of heaps is assumed to be $\overset{―}{j}$ , and in our example $\overset{―}{j}$ =3.

A4 The censoring is exogenous, and the censored observations are correctly reported.

A5 Whenever the true duration falls onto one of the heaping points, it will be correctly reported. However, whenever the duration falls onto the nonheaping points, it is assumed to be either correctly reported or rounded (up or down) to the nearest heaping point. Let p ₁, p ₂, etc., denote the corresponding rounding up probabilities when a true duration is lower by one, two, etc., units from the nearest heaping point. Similarly, let q ₁, q ₂, etc., denote the rounding down probabilities when a true duration is higher by one, two, etc., units from the nearest heaping point. In our illustration, a reported duration of, say, 10 days includes true durations of 11 (9) days, which have been rounded down (up) to 10 days (see figure 1). Hence, p ₁ is the probability that a true duration of 9 will be rounded up to 10 days. Analogously, q ₁ is the probability that a true duration of 11 will be rounded down to 10 days..

Figure 1.

Stylized example

A6 There exists a segment in the baseline hazard that is constant from time period $\overset{―}{k}$ and includes a known true value (that is, there is no misreporting at this value). In our example, we assume $\overset{―}{k}$ = 12.

Heuristically, the assumption that the hazard is constant over a set of time periods, which includes (at least) a known true value, enables us to uniquely identify the γ parameter associated with this correctly reported time period as well as the parameters of the heaping process, that is, the ps and the qs, in this region, from the observed data. Subsequently, we can use these identified probability parameters to pin down the rest of the baseline and other hazard parameters. See figure 1.

2.2 Maximum likelihood estimation

Before writing down our likelihood function, we first define some notation.

Let θ = { β ^′, γ ^′ } ^′ with γ = {γ(0), γ(1),…, γ( $\overset{―}{τ}$ − 1)} ^′ , $\overset{―}{τ}$ be some finite, positive integer, and ( $\overset{―}{τ}$ − 1) represent the uncensored maximum observed time period. Define the probability of survival at least until time period τ < τ in the absence of misreporting as

\begin{array}{l} S_{i} (τ | z_{i}, u_{i}, \underline{θ}) = Pr (τ_{i} \geq τ | z_{i}, u_{i}, \underline{θ}) \\ = \prod_{s = 0}^{τ - 1} exp [- exp {z_{i}^{'} β + γ (s) + u_{i}}] \\ = \prod_{s = 0}^{τ - 1} exp [- v_{i} exp {z_{i}^{'} β + γ (s)}] \end{array}

where v_i ≡ exp(u_i ) and u_i is the unobserved heterogeneity.

The probability for an exit event in τ_i < $\overset{―}{τ}$ is

\begin{array}{l} f_{i} (τ | z_{i}, u_{i}, \underline{θ}) = Pr (τ_{i} = τ | z_{i}, u_{i}, \underline{θ}) \\ = S_{i} (τ | z_{i}, u_{i}, \underline{θ}) - S_{i} (τ + 1 | z_{i}, u_{i}, \underline{θ}) \\ = \prod_{s = 0}^{τ - 1} exp [- v_{i} exp {z_{i}^{'} β + γ (s)}] - \prod_{s = 0}^{τ} exp [- v_{i} exp {z_{i}^{'} β + γ (s)}] \end{array}

f_i (τ| z _i, u_i, θ ) in the above equation denotes the probability of a duration equal to τ when there is no misreporting. However, because of the rounding, heaped values are overreported, while nonheaped values are underreported, and this needs to be accounted for when constructing the likelihood function (see below).

Henceforth, let

ϕ_{i} (t | z_{i}, v_{i}, \underline{θ}) = P r (t_{i} = t | z_{i}, v_{i}, \underline{θ})

with t_i denoting the discrete-reported duration.

The likelihood contributions depend on the following four cases:

For correctly reported durations, ϕ_i (t| z _i, v_i, θ ) = f_i (t| z _i, v_i, θ ). This will include the duration point discussed in assumption A3 earlier. Depending on the application, there might be other points, too.

For reported durations that are l = 1, 2, etc., points below the nearest heaping point, ϕ_i (t| z _i, v_i, θ ) = (1 − p_l )f_i (t| z _i, v_i, θ ), because p_l refer to the probabilities of rounding up.

Similar to (II), for reported durations that are l = 1, 2, etc., points above the nearest heaping point, ϕ_i (t| z _i, v_i, θ ) = (1 − q_l )f_i (t| z _i, v_i, θ ), because q_l refer to the probabilities of rounding down.

Finally, for reported durations on the heaping points,

ϕ_{i} (t | z_{i}, v_{i}, \underline{θ}) = \sum_{l} p_{l} f_{i} (t - l | z_{i}, v_{i}, \underline{θ}) + \sum_{l} q_{l} f_{i} (t + l | z_{i}, v_{i}, \underline{θ}) + f_{i} (t | z_{i}, v_{i}, \underline{θ})

In summary, there are four different probabilities of exit events depending on the nature of the true duration.

We next write down the corresponding unconditional probabilities under a set of assumptions on the unobserved heterogeneity v_i . More specifically, we impose the following assumptions on the properties and the distributional form of v_i , which are standard in the duration literature:

v_i is identically and independently distributed over i and is also independent of z _i ;

v_i follows a Gamma distribution with unit mean and variance σ.³

The unconditional probabilities under the above assumptions, in case (I) above are given by

\begin{array}{l} \int ϕ_{i} (t | z i, v, \underline{θ}) g (v; σ) d v = \int Pr (τ i = t | z i, v, \underline{θ}) g (v; σ) d v \\ = \int S_{i} (t | z_{i}, v, \underline{θ}) g (v; σ) d v - \int S_{i} (t + 1 | z_{i}, v, \underline{θ}) g (v; σ) d v \\ = {(1 + σ [\sum_{s = 0}^{t - 1} exp {z_{i}^{'} β + γ (s)}])}^{- σ^{- 1}} \\ - {(1 + σ [\sum_{s = 0}^{t} exp {z_{i}^{'} β + γ (s)}])}^{- σ^{- 1}} \end{array}

where the last equality uses the fact that there is a closed-form expression under the Gamma density assumption for v (for example, see Meyer [1990, 770]). Moreover, because the integral is a linear operator, the probabilities for the cases (II) to (IV) can be derived accordingly.

Our next goal is to obtain consistent estimators for $θ = {({\underline{θ}}^{'}, σ, p_{1}, . . ., p_{\bar{r}}, q_{1}, . . ., q_{\bar{r}})}^{'}$ from the possibly misreported durations. Before setting up the likelihood function, we introduce censoring into our setup.

Let c_i be an indicator equal to one if the observation is uncensored and zero otherwise. It is assumed that durations are censored at a fixed time τ that exceeds the points that are rounded and is not one of the heaping points. Assuming that censoring is independent of the heaping process and the durations, we have the following unconditional likelihood contributions.⁴

The likelihood function for the observed sample is

L_{N} (θ) = \sum_{i = 1}^{N} \int {ϕ_{i} {(t | z_{i}, v)}^{c}^{_{i}} S_{i} {(t | z_{i}, v)}^{(1 - c_{i})}} g (v; σ) d v

and so

l_{N} (θ) = ln L_{N} (θ) = \sum_{i = 1}^{N} ln \int {ϕ_{i} {(t | z_{i}, v)}^{c}^{_{i}} S_{i} {(t | z_{i}, v)}^{(1 - c} {^{_{i}}}^{)}} g (v; σ) d v

Given the definition of ϕ_i (t| z _i, v) and cases (I) through (IV), it is clear that the (log) likelihood function downweights the contribution of heaped durations and overweights the contribution of nonheaped durations.

Under the assumptions provided in ACG, we see that the limiting distribution of the estimator depends on whether some heaping probability parameters lie on the boundary of the parameter space, that is, whether one or more of the “true” probability parameters are equal to zero. In this case, the limiting distribution is no longer normal because the information matrix is not block diagonal in general but takes a different form. We use the M out of N bootstrap method to derive the asymptotic standard errors. Details are provided in ACG.

3 Ordered probit model with heaping: Specification and estimation

In general, there are many observed discrete outcomes (other than durations) that can exhibit heaping. For instance, survey data on the number of doctor visits or on cigarette consumption in a given period of time is often subject to this phenomenon. Here we discuss the estimation of an ordered probit model allowing for heaping. In appendix C, we provide a discussion on the link between the discrete duration model derived from the PH specification and the ordered choice model. To keep notational clutter to a minimum, we do not explicitly show the conditioning set in what follows.

Consider the following latent-variable model representation of an ordered choice model,⁵

y_{i}^{*} = z_{i}^{'} β^{†} + ε_{i}

where $y_{i}^{*}$ represents the latent outcome, z _i stands for the vector of regressors, and β ^† is the vector of coefficients. Also, let the cumulative probability function of the error term ε_i be standard normal, denoted by Φ(·).⁶ Assume we have an ordered discrete outcome variable coded as y_i ∊ {0,…, J}. That is, we have

y_{i} = j if and only if κ_{j} < y_{i}^{*} = z_{i}^{'} β^{†} + ε_{i} < κ_{j}_{+ 1}

where κ ₀,…, κ_J are the threshold parameters that divide the real line into a finite number of intervals. Here we have assumed the normalizations κ ₀ = −∞, κ_J ₊₁ = +∞, and κ_j < κ_j ₊₁. In addition, note that we require a scale normalization, so z _i may not contain a constant. For any j ∊ {0,…, J}, the probabilities of interest are given by

\begin{array}{l} Pr (y_{i} = j) = Pr (κ_{j} < y_{i}^{*} < κ_{j}_{+ 1}) \\ \begin{array}{l} = Pr (κ_{j} - z_{i}^{'} β^{†} < ε_{i} < κ_{j}_{+ 1} - z_{i}^{'} β^{†}) \\ = Φ (κ_{j}_{+ 1} - z_{i}^{'} β^{†}) - Φ (κ_{j} - z_{i}^{'} β^{†}) \end{array} \end{array}

In the presence of heaping data, the term Pr(y_i = j) depends on the four cases:

For correctly reported outcomes, $Pr (y_{i} = j) = Φ (κ_{j}_{+ 1} - z_{i}^{'} β^{†}) - Φ (κ_{j} - z_{i}^{'} β^{†})$ .

For reported outcomes that are l = 1, 2, etc., points below the nearest heaping point, $Pr (y_{i} = j) = (1 - p_{l}) {Φ (κ_{j}_{+ 1} - z_{i}^{'} β^{†}) - Φ (κ_{j} - z_{i}^{'} β^{†})}$ .

Similar to (II), for reported outcomes that are l = 1, 2, etc., points above the nearest heaping point, $Pr (y_{i} = j) = (1 - q_{l}) {Φ (κ_{j}_{+ 1} - z_{i}^{'} β^{†}) - Φ (κ_{j} - z_{i}^{'} β^{†})}$ .

Finally, for reported outcomes on the heaping points,

\begin{array}{l} Pr (y_{i} = j) = \sum_{l} p_{l} {Φ (κ_{j}_{+ 1} - z_{i}^{'} β^{†}) - Φ (κ_{j} - z_{i}^{'} β^{†})} \\ + \sum_{l} q_{l} {Φ (κ_{j}_{+ 1} - z_{i}^{'} β^{†}) - Φ (κ_{j} - z_{i}^{'} β^{†})} \\ + {Φ (κ_{j}_{+ 1} - z_{i}^{'} β^{†}) - Φ (κ_{j} - z_{i}^{'} β^{†})} \end{array}

Note that when the outcome is duration data and for right-censored data at y_i = $\bar{τ}$ ,

the likelihood function can be written as

L_{N} (θ^{†}) = \sum_{i = 1}^{N} {\sum_{j = 1}^{\bar{τ} - 1} Pr (y_{i} = j)}^{d_{i j} \cdot c_{i}} {1 - Φ (κ_{\bar{τ}} - z_{i}^{'} β^{†})}^{(1 - c_{i})}

where $θ^{†} = {(β^{†'}, κ', p_{1}, . . ., p_{\bar{r}}, q_{1}, . . ., q_{\bar{r}})}^{'}$ and d_ij is an indicator equal to one when t_i = j and zero otherwise.

4 Testing for “heaping”

As pointed out in section 2.2, if some of the heaping probability parameters lie on the boundary of the parameter space, the asymptotic distribution of the estimator is no longer normal. In addition, inference becomes more complicated, because subsampling methods are used to derive the asymptotic standard errors. In the following, we discuss two specification tests: first, a test to detect whether heaping matters in a statistical sense $(H^{π_{1}})$ ; second, if heaping matters, a test to discriminate between the general case that allows for probability parameters on the boundary and the special case without parameters on the boundary $(H^{π_{2}})$ . That is, while the first test helps to determine whether the specified heaping model is indeed preferred over a standard model that does not account for heaping, the second test allows one to decide whether inference, in fact, ought to be based on subsampling methods.

Thus, collecting all heaping parameters in the vector π with $π = {(p_{1}, . . ., p_{\bar{r}}, q_{1}, . . ., q_{\bar{r}})}^{'}$ and θ = ( θ ^′, σ, π ^′ ) ^′ , the first test examines the existence of heaping effects through $H^{π_{1}}$

H_{0}^{π_{1}} : p_{1} = \cdot \cdot \cdot = p_{\bar{r}} = q_{1} = \cdot \cdot \cdot = q_{\bar{r}} = 0

versus

H_{0}^{π_{1}} : p_{l} > 0 or q_{l} > 0 or both

for some l = 1,…, $\overset{―}{r}$ . The above hypothesis $H_{0}^{π_{1}}$ can be tested through a standard likelihood-ratio (LR) test (ACG).

The second specification test examines whether all heaping parameters lie inside the parameter space, which in turn allows inference based on asymptotic normality. That is, the null hypothesis of the test is that at least one rounding parameter is equal to zero versus the alternative that none is zero (and thus no boundary problem exists). Therefore, if we reject this hypothesis, we can make inference based on standard normal critical values, while if we fail to reject, we ought to rely on subsampling methods for inference.

Formally, let $H_{p, 0}^{(j)} : p_{j} = 0, H_{p, A}^{(j)} : p j > 0$ , and let $H_{q, 0}^{(j)}, H_{q, A}^{(j)}$ be defined analogously. Our objective is to test the following hypotheses,

$H^{π_{2}}$

H_{0}^{π_{2}} = (\cup_{j = 1}^{\bar{r}} H_{p, 0}^{(j)}) \cup (\cup_{j = 1}^{\bar{r}} H_{q, 0}^{(j)})

versus

H_{A}^{π_{2}} = (\cap_{j = 1}^{\bar{r}} H_{p, A}^{(j)}) \cap (\cap_{j = 1}^{\bar{r}} H_{q, A}^{(j)})

so that under $H_{A}^{π_{2}}$ all p’s and q’s are strictly positive. To discriminate between $H_{0}^{π_{2}}$ and $H_{A}^{π_{2}}$ , we apply the intersection-union principle (IUP); see, for example, chapter 5 in Silvapulle and Sen (2005). According to the IUP, we reject $H_{0}^{π_{2}}$ at level α only if all single null hypotheses $H_{p, 0}^{(j)}$ and $H_{p, A}^{(j)}$ are rejected at level α.

We now introduce a rule to discriminate between $H_{0}^{π_{2}}$ and $H_{A}^{π_{2}}$ .

Rule IUP-PQ: Reject $H_{0}^{π_{2}}$ , if ${max}_{j}_{= 1, ..., \bar{r}} {P V_{p, j}, P V_{q, j}} < α$ and do not reject otherwise.

Thus, as pointed out above, if one rejects $H_{0}^{π_{2}}$ , the inference can be based on asymptotic normality, while failure to reject $H_{0}^{π_{2}}$ requires the use of subsampling methods as outlined before.

5 Command implementation

As discussed in the earlier section, if one or more of the probability parameters lie on the boundary of the parameter space, the asymptotic distribution of the estimator is no

longer normal. We provide two tests that can be used to detect this. Hence, the output provides the usual asymptotic standard errors along with the standard errors calculated using the M out of N bootstrap method, where M denotes an integer strictly smaller than N (see ACG).

5.1 Data

We illustrate the use of the heapmph and heapop commands using generated ACG data based on 250 observations drawn randomly from the original sample used in ACG. More specifically, we retain two covariates of these observations that were found to be significant: mother’s age at the time of birth (age_m) and mother’s years of schooling (school_m). Our outcome variable duration, which is the time of death of the child measured in days if the child died within the first 17 days, is generated using these two covariates within the ordered choice model framework as detailed next. All observations where the child survived for longer than 18 days are treated as censored.⁷

The latent dependent variable $y_{_{i}}^{*}$ in the ordered choice model framework is generated according to

y_{_{i}}^{*} = 0.1 a g e_m_{i} - 0.1 s c h o o l_m_{i} + ε_{i} for i = 1, . . ., 250

We use two different schemes to generate ε_i for demonstrating heapmph and heapop commands, respectively. Note, as shown in appendix C, the Cox’s PH model is equivalent to the ordered choice model where the underlying error term in the latent-variable model is type I EV distributed. The threshold parameters κ are then generated in terms of parameters γ (see appendix C).⁸ In detail:

For the heapmph command, we characterize a PH model data example by generating i.i.d. ε_i from a type I EV distribution. The baseline gamma parameters are set as follows: exp{γ(t)} = 0.3 for t = 0, 1, 2, 3, exp{γ(t)} = 0.6 for t = 4,…, 7, exp{γ(t)} = 1.2 for t = 8,…, 11, exp{γ(t)} = 2.5 for t = 12,…, 15, exp{γ(16)} = 8, and exp{γ(17)} = 10. The dataset created according to this scheme is enclosed in the package and named heap_demonstration2.dta.

For the data example used to demonstrate heapop, we draw ε_i from a standard normal distribution. We set the gamma parameters for heapop as follows: exp{γ(t)} = 0.6 for t = 0, 1,…, 11, exp{γ(t)} = 1.5 for t = 12,…, 15, exp{γ(16)} = 1.8, and exp{γ(17)} = 3. In the heap package, the dataset generated following this scheme is named heap_demonstration.dta.

Note that we keep the function flat from period 12 to 15. The discrete duration variable without heaping, for each observation i = 1, 2,…, 250, for these models is then generated using the cutoff points as

where we assume δ ₀ = −∞ and δ ₁₉ = ∞ for the normalization.

Finally, we add the following heaping pattern to the dependent variable: the duration points 4, 9, and 11 are rounded up to 5, 10, and 15, respectively, with probability 0.7, and the duration points 6, 11, and 16 are rounded down to 5, 10, and 15, respectively, with the same probability, 0.7. Hence, the heaping probability parameters are p ₁ = q ₁ = 0.7. Algebraically, the actual observed duration variable duration is generated by

We have not included the unobserved heterogeneity in the generation of the above data. Figure 2 plots the histograms of both the observed duration variable with heaping and the true duration variable without heaping as generated from the ordered probit model.

Figure 2.

Histograms of the duration variable in the example data for demonstrating the heapmph command (see section 5.1)

5.2 The heapmph command

This section describes the implementation of the heapmph command for the MPH model.

Basic syntax

The basic syntax of the heapmph command follows the standard command form:

heapmph depvar varlist [ if ] [ in ] [ , censor( integer ) vcensor( varname ) hstar( integer ) jbar( integer ) kbar( integer ) rbar( integer ) detail rep( integer )] moon(real)

where depvar stands for the dependent variable and varlist may contain the specified covariates. In this article, we demonstrate the usages of the heap package with examples and then explain a few other available options. We do not provide an exhaustive explanation of all the available options and thus refer the interested user to the help files included in the package.

Model estimation

As discussed in section 5.1, the analysis is restricted to modeling the hazard rate during the first 18 days after birth because the reported number of deaths is smaller after this period (see ACG). We therefore add the censor(18) option to the command to fix the right-censoring period for each observation at 18. By default, the heap command assumes that the right-censoring period is the largest value of the dependent variable in the chosen sample. Instead of using the fixed right-censoring, we can also allow for person-specific censoring points for each observation (see section 5.4). We also provide a command to test for policy effects (see appendix B).

We next detail the values used for the four compulsory options to define the pattern of heaping in our example.

Because we have generated the data with heaps at days 5, 10, and 15, we define the starting period (h ^∗) of 5 using the option hstar(5). The assumption is that the heaping occurs at points that are multiples of h ^∗.

We set option jbar(3) (that is, $\overset{―}{j}$ = 3) to indicate that there is a maximum of three heaping points prior to the censoring point (see point 1 above).

As illustrated in our stylized example (figure 1), the rounding probabilities are p ₁ and q1, respectively. Hence, with the number of heaping probabilities, we have the maximum number of time periods that a duration can be rounded to denoted as $\overset{―}{r}$ = 1. This is set by the option rbar(1) in the command.

The constant part of the baseline hazard enables us to identify the parameters of the heaping process. In this example, we set the time period after which the hazard is constant equal to 12 ( $\overset{―}{k}$ ). Also, we assume that the heaping is asymmetric, which suggests that constant baseline hazard parameters are at different levels for periods {12, 13, 14, 15}.⁹ In the command, the starting period of the flat segment can be defined by adding the option kbar(12).

Example

We choose duration as the dependent variable and age_m and school_m as the covariates. We request Stata implement the command using the following code:¹⁰

The command first employs a single simulated annealing SA algorithm (see section 5.5.3 of ACG) to solve for the point estimates. The M out of N bootstrap procedure is then conducted to yield the standard errors. Note also that the 95% bootstrap confidence interval is constructed using the 2.5% and the 97.5% quantile of the empirical bootstrap distribution. The output table consists of five panels. The panel exp(gamma) reports the estimates of functions of the baseline hazard parameters (see section 5.1 and appendix C). Note again that we set the baseline hazard parameters γ to be constant over periods {12, 13, 14, 15}. Hence, the number of baseline hazard parameters we estimate is 18 − 3 − 1 = 14. Specifically, gamma0, gamma1,…, gamma11 in the output table correspond to functions of the baseline hazard in period 0, 1,…, 11, respectively. gamma12corresponds to the flat baseline hazard during periods {12, 13, 14, 15}. gamma13 is for period 16, and gamma14 is for the period 17.

Panel sigma displays the estimate of σ, which is the variance of the gamma-distributed unobserved heterogeneity variable v_i , and panel beta is for the estimates of the covariate coefficients. In panels prob_left and prob_right, we report the estimated heaping probabilities p ₁ and q ₁. The value of the sigmacoefficient can be seen to be very close to zero numerically. This does not come unexpected, because the data-generating process does not feature any unobserved heterogeneity.¹¹

Testing for the presence of heaping effects

This command provides a subroutine to test null hypothesis via the LR test described in remark 4.2 in section 4 of ACG and briefly discussed in section 4 in this article. We provide a test that can be implemented by adding an option (testpi1) to the main command; see help heap_tests for details. testpi1 tests the null hypothesis $(H_{0}^{π_{1}})$ that all heaping probability parameters are zero, and the alternative $(H_{A}^{π_{1}})$ is that at least one heaping probability parameter is greater than zero. Applying the IUP, we could test the null hypothesis $(H_{0}^{π_{2}})$ that at least one heaping probability parameter is equal to zero, and the alternative $(H_{A}^{π_{2}})$ is that none is zero.

Example

To test for the presence of heaping effects under the model specification described in the last subsection, we can simply add the testpi1 option to the command:

The output table reports the test statistic along with the corresponding bootstrapped critical values at 10%, 5%, and 1% levels.¹² In this example, we fail to reject the null hypothesis at the 10% significance level, which suggests that there is no clear evidence of heaping.

In addition, we employ the IUP rule to test the null that at least one heaping probability parameter is equal to zero $(H_{0}^{π_{2}})$ . In detail, we sort the p-values of all heaping parameters (p ₁ and q ₁) displayed in the regression output. The largest p-value is 0.492 in our example, so we do not reject the null at any conventional significance level; hence, we have to continue to use the M out of N subsampling scheme. Otherwise, if the null hypothesis was rejected, one could simply do inference based on the standard normal distribution.

5.3 The heapop command for the ordered probit model with heaping

Basic syntax

The syntax and corresponding options of command heapop are identical to those of the heapmph command (see section 5.2).

Model estimation

The heapop command fits an ordered probit model with heaping and can be also employed to deal with the duration outcome data. The heapop command also requires four compulsory options to define the pattern of heaping, that is, kbar(), jbar(), hstar(), and rbar(), as introduced in section 5.2 for the heapmph command. In the case of ordered choice or count data, the censor() option can be used to indicate the maximum number of possible choices or counts. If censor() is left unspecified, Stata by default uses the maximum value of the dependent variable as censor().

This section attaches example usages of the heapop command under the same specification of the heaping pattern as used in section 5.2.

We first request Stata implement the heapop command to fit the model:

Example

Unlike the table in section 5.2, this table consists of only four panels because no unobserved heterogeneity parameter has been estimated. Standard errors and bootstrap confidence intervals are constructed as before in section 5.2. The first panel contains again the estimated baseline parameters (exp(gamma); see section 5.2 for the specification), while panel two provides estimates of the β coefficients. Note that the numerical differences in the β coefficient estimates are likely to stem from the omission of unobserved heterogeneity and the different functional form in this specification. Finally, panel three and four contain the estimated heaping probabilities, which can both be seen to be statistically insignificant.

Testing for the presence of heaping effects

Example

To test for the presence of heaping effects $(H^{π_{1}})$ , we use this code:

As in the previous section, we cannot reject the null $H_{0}^{π_{1}}$ at any conventional level and thus proceed to test $H_{0}^{π_{2}}$ via the IUP rule. More specifically, we sort again the p-values of all heaping parameters (p ₁ and q ₁) displayed in the regression output. Because the largest p-value is 0.549, we do not reject the null at any conventional significance level and continue to use M out of N subsampling for inference.

5.4 Further options

Here we elaborate on a few additional options, which are available for both commands heapmph and heapop.

Bootstrap options

The rep( integer ) option allows users to specify the number of M out of N bootstrap replications for calculating the standard errors. The default value is set at 100. In the example shown in section 5.2, it takes 26 minutes to run 100 bootstrap iterations in Stata/SE 15 (64-bit) on a desktop computer with the Intel i7 quad-core processor with 4.0 GHz.

When choosing the M in the M out of N bootstrap, users can set the option moon( real ) to select the share of M observations to be randomly drawn from the sample of size N. Bickel and Sakov (2008) provide an in-depth discussion on the choice of the M parameter. The heap packages, by default, set moon() at 0.8 so that in each MooN bootstrap iteration, 80% of the original sample is randomly kept.

Optimization

The provided commands implement the SA algorithm to maximize the likelihood function of the model. The SA method, proposed by Kirkpatrick, Gelatt, and Vecchi (1983), is a popular local search algorithm for stochastically approximating the global optimum of a given objective function. The review of the algorithm and its technical details can be found in Dowsland and Thompson (2012), for example. The SA algorithm is particularly useful for our model and may be preferable to the conventional Newton algorithm, because SA is better at locating global maximum when the likelihood function is complex, as in our case.

The heap package contains the Mata function for the SA method of Kirkpatrick, Gelatt, and Vecchi (1983). In this function, we have designed 10 options for users to control settings of the SA algorithm. For instance, sa_verbosity( integer ) allows one to set the maximum number of total iterations (the default is 8,000), and the sa_stopTemp( real )option allows one to set the temperature at which to stop the searching algorithm (the default is 1 × 10⁻⁸). The full details about the settings are listed in help heap_annealing. Besides, the seed state for initializing the random-number generator is set to 1,000 by default and can be adjusted in the seed( real ) option.¹³

Display options

For diagnosing and monitoring purposes, we provide the following two options to display the intermediate command outputs. First the detail option can be used to display a summary of heaping model specifications and produce a table only for point estimates before conducting the bootstrap. Second, the sa_verbosity( integer ) option can be set to 1 for producing the final report of the SA and set to 2 for further displaying the temperature changes in each iteration. The default value of this option is zero, which suppresses all output.

Different censoring points for each observation

The option for variable censoring is vcensor( varname ), where varname is a dummy variable that equals 1 if the observation is complete and 0 if the observation is rightcensored.

Let uncensor_dummy stand for a period-specific censoring indicator variable, where uncensor_dummy=1 if the observation’s spell is complete and uncensor_dummy=0 if the spell is right-censored. For example, we randomly generate uncensor_dummy from a Bernoulli(0.1) distribution and apply the heapmph command:

Note that if neither vcensor( varname ) nor censor( integer ) is specified, the command by default will fix the right-censoring point at the maximum value of the dependent variable in the usable sample.

6 Conclusion

Discrete-time duration models are very popular among researchers. The command heapmphallows the estimation of a discrete time MPH model, when the observed discrete durations exhibit abnormal concentrations at certain duration points. An accompanying command heapop, allows for heaping in an ordered probit model. The underlying assumptions and the identification strategy used are discussed fully in ACG.

8 Programs and supplemental materials

Supplemental Material, st0603 - heap: A command for fitting discrete outcome variable models in the presence of heaping at known points

Supplemental Material, st0603 for heap: A command for fitting discrete outcome variable models in the presence of heaping at known points by Zizhong Yan, Wiji Arulampalam, Valentina Corradi and Daniel Gutknecht in The Stata Journal

Footnotes

7 Acknowledgments

We are grateful to the British Academy (grant number: SG160731 - Estimation and inference with heaped data - a novel approach), for funding this project. Zizhong Yan acknowledges the support from the 111 project of China (project number B18026). We also thank David M. Drukker and a referee for helpful comments and discussions.

8 Programs and supplemental materials

To install a snapshot of the corresponding software files as they existed at the time of publication of this article, type

Notes

References

Abbring

J. H.

van den Berg

G. J.

2007. The unobserved heterogeneity distribution in duration analysis. Biometrika 94: 87–99.

Arulampalam

Corradi

Gutknecht

2017. Modeling heaped duration data: An application to neonatal mortality. Journal of Econometrics 200: 363–377. https://doi.org/10.1016/j.jeconom.2017.06.016.

Bickel

P. J.

Sakov

2008. On the choice of m in the m out of n bootstrap and confidence bounds for extrema. Statistica Sinica 18: 967–985.

Cox

D. R.

1972. Regression models and life-tables. Journal of the Royal Statistical Society, Series B 34: 187–220. https://doi.org/10.1111/j.2517-6161.1972.tb00899.x.

Cox

D. R.

Oakes

1984. Analysis of Survival Data. London: Chapman & Hall/CRC.

Dowsland

K. A.

Thompson

J. M.

2012. Simulated annealing. In Handbook of Natural Computing, ed. Rozenberg

Bäck

Kok

J. N.

, 1623–1655. Vol. 1, 1623–1655. Berlin Heidelberg: Springer.

Forster

Jones

A. M.

2001. The role of tobacco taxes in starting and quitting smoking: Duration analysis of British data. Journal of the Royal Statistical Society, Series A 164: 517–547 . https://doi.org/10.1111/1467-985X.00217.

Greene

2014. Models for ordered choices. In Handbook of Choice Modelling, ed. Hess

Daly

, 333–362, 333–362. Northampton, MA: Edward Elgar Publishing. https://doi.org/10.4337/9781781003152.00023.

Gutierrez

R. G.

Carter

Drukker

D. M.

2001. sg160: On boundary-value likelihood-ratio tests. Stata Technical Bulletin 60: 15–18. Reprinted in Stata Technical Bulletin Reprints. Vol. 10, pp. 269–273. College Station, TX: Stata Press.

10.

Han

Hausman

J. A.

1990. Flexible parametric estimation of duration and competing risk models. Journal of Applied Econometrics 5: 1–28. https://doi.org/10.1002/jae.3950050102.

11.

Kirkpatrick

Gelatt

C. D.

Jr. Vecchi

M. P.

1983. Optimization by simulated annealing. Science 220: 671–680. https://doi.org/10.1126/science.220.4598.671.

12.

Meyer

B. D.

1990. Unemployment insurance and unemployment spells. Econometrica 58: 757–782.

13.

Pudney

2008. Heaping and leaping: Survey response behaviour and the dynamics of self-reported consumption expenditure. Working Paper Series No. 2008-09, Institute for Social & Economic Research. https://www.iser.essex.ac.uk/research/publications/working-papers/iser/2008-09.

14.

Silvapulle

M. J.

Sen

P. K.

2005. Constrained Statistical Inference: Inequality, Order, and Shape Restrictions, Wiley Series in Probability and Statistics . Hoboken, NJ: Wiley.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB