Sage Journals: Discover world-class research

Abstract

Test-retest reliability is often estimated using naturally occurring data from test repeaters. In settings such as admissions testing, test takers choose if and when to retake an assessment. This self-selection can bias estimates of test-retest reliability because individuals who choose to retest are typically unrepresentative of the broader testing population and because differences among test takers in learning or practice effects may increase with time between test administrations. We develop a set of methods for estimating test-retest reliability from observational data that can mitigate these sources of bias, which include sample weighting, polynomial regression, and Bayesian model averaging. We demonstrate the value of using these methods for reducing bias and improving precision of estimated reliability using empirical and simulated data, both of which are based on more than 40,000 repeaters of a high-stakes English language proficiency test. Finally, these methods generalize to settings in which only a single, error-prone measurement is taken repeatedly over time and where self-selection and/or changes to the underlying construct may be at play.

Keywords

test-retest reliability self-selection bias entropy balancing minimum discriminant information adjustment polynomial regression Bayesian model averaging

Reliability of a psychological or educational assessment is critical to the valid use and interpretation of the scores it produces (AERA, APA, and NCME, 2014). Reliability generally refers to the consistency of scores across multiple instances of assessment and is typically measured on a scale from 0 to 1. This originates from classical test theory, where the correlation across two parallel forms of assessment, ρ_XX′, is defined as the proportion of true score variance, $σ_{T}^{2}$ , to observed score variance, $σ_{X}^{2}$ (Lord & Novick, 1968); mathematically, $ρ_{X X^{'}} = \frac{σ_{T}^{2}}{σ_{X}^{2}} = \frac{σ_{T}^{2}}{σ_{T}^{2} + σ_{E}^{2}}$ , where $σ_{E}^{2}$ is measurement error variance. Values of reliability near 1 indicate relative stability of inferences about test-taker proficiency across instances of the assessment, whereas smaller values indicate more sensitivity of those inferences to idiosyncratic factors that impact any given test score. Thus, test scores that are used for consequential decisions must demonstrate high levels of reliability.

Reliability depends on properties of both the assessment itself (i.e., items) and the target population of individuals for that assessment (i.e., persons). For a given target population, reliability tends to increase as the number of items on the assessment increases because there is less opportunity for idiosyncratic variation in a given observed score with many items (i.e., less measurement error), or $\frac{σ_{T}^{2}}{σ_{T}^{2} + σ_{E}^{2} ↓}$ and ρ_XX′ ↑. Conversely, for a given assessment or set of items, reliability also tends to increase as heterogeneity of the true score in the target population increases because larger heterogeneity in true scores reduces the impact that measurement error has on observed scores, or $\frac{↑ σ_{T}^{2}}{↑ σ_{T}^{2} + σ_{E}^{2}}$ and ↑ ρ_XX′. The same assessment may have high reliability in one population of people (e.g., students applying to a highly selective institution), but low reliability in particular subpopulations (e.g., students admitted to a highly selective institution).

Although there are more technical definitions and related estimation methods for quantifying reliability (see, e.g., Crocker & Algina, 1986, Chapters 6–7), we focus on test-retest reliability. Test-retest reliability corresponds to a hypothetical experiment in which a large random sample of individuals from a target population is administered an assessment, and then quickly after, is administered a second parallel version of the assessment.¹ The sample correlation of the two sets of observed scores for the same people from this hypothetical experiment is a common definition of reliability.^2,3 This definition is useful because it refers to only observable quantities rather than latent variables, aiding communication to broader audiences. It also applies to complex assessments for which analytical standard errors of measurement may be intractable. More generally, test-retest reliability can be used for any error-prone measurement that is taken repeatedly over time.

Unfortunately, the hypothetical experiment described above may not be possible for some assessments. Rather, it may be more common to rely on naturally occurring data from people who self-select whether and when to retake an assessment (Monfils & Manna, 2021; Raymond et al., 2007). We refer to such people as repeaters.

Relying on observational samples of repeaters leads to two key challenges when estimating the reliabilities of scores and subscores for the full population. The first is that repeaters tend to differ in consequential ways from the full population (Zhou & Cao, 2020). For instance, people who repeat a high-stakes test tend to represent a more homogeneous, lower-ability subpopulation than the full population because high-achieving individuals often do not need to take the test more than once (e.g., to meet admission cutoffs). Secondly, repeaters not only self-select whether to repeat the assessment, but also when to repeat. For example, the Duolingo English Test (DET; Settles et al., 2020) is an online, high-stakes assessment of English language proficiency that is available on demand; some repeaters may thus take the twice test on the same day, whereas others may wait more than a year to take the test again. The more time that elapses between sessions, the more opportunity there is for heterogeneity across people in true proficiency change (e.g., through differential learning or test practice). Unless these forms of self-selection bias are addressed, estimates of test-retest reliability computed from observational data on repeaters can be biased.

In this article, we develop approaches for mitigating this bias and reducing estimation variance. We account for individual differences in whether and when to repeat using a weighting approach known as entropy balancing (EB; Hainmueller, 2012) or minimum discriminant information adjustment (MDIA; Haberman, 1984; Haberman & Yao, 2015), which weights an observed sample so that moments (e.g., means and variances) of test-taker characteristics match those of a target population. We account for variance inflation due to such weighting, as well as the impacts of heterogeneous change among repeaters, by using a combination of polynomial regression and an approximation to Bayesian model averaging (BMA; Raftery, 1995). We demonstrate the value of these methods for reducing bias and improving precision of estimated reliability using a case study of the DET and a simulated data analysis.

Model Framework

This section provides notation and a description of the inference problem.

Notation for Observed Data

Suppose we have data from a random sample of first-time test takers from the target testing population. Let test takers in the sample be indexed by i = 1, …, N. Let y₁ be a (N × 1) vector of scores for these test takers. Let Z₁ be a (N × P) matrix of auxiliary information (e.g., age, gender, and country of origin) for these test takers. If a given auxiliary variable is categorical with C categories, we assume that this variable is represented in Z₁ by (C − 1) columns of indicator (dummy) variables.

Consider the subset of these N first-time test takers who choose to take the test again (i.e., the repeaters). We assume that the time between the first and second tests is measured in days⁴ denoted by d, where d = 0 means that an individual took the second test on the same day as the first test, d = 1 means that an individual took the second test the day after the first test, and so on. We refer to the repeaters who retook the test d days after their first test as lag-d repeaters, and we let N_d < N be the number of such individuals. The decision about the maximum value of d to consider in any given analysis, denoted by d_max, depends on the specific application.

Let y_1,d be the first-time test scores and y_2,d be the second-time test scores for the lag-d repeaters. Both y_1,d and y_2,d are (N_d × 1), such that y_1,d is a subvector of y₁, the vector of test scores for all first-time test takers in the sample. Analogously, let Z_1,d be the (N_d × P) submatrix of Z₁, the matrix of auxiliary information for all first-time test takers in the sample. In all, the observed data consist of y₁ and Z₁ for a random sample of N first-time test takers, as well as {N_d, Z_1,d, y_1,d, y_2,d} for each lag-d repeater sample for d = 0, 1, …, d_max. Table 1 lists this notation.

Table 1.

Notation for Model Framework.

Symbol	Meaning
y ₁	Random sample of first-time test scores, (N × 1)
Z ₁	Background variables of first-time test takers, (N × P)
d	Number of days between repeaters’ first- and second-time test scores
y _1,d	First-time test scores of lag-d repeaters, (N_d × 1)
y _2,d	Second-time test scores of lag-d repeaters, (N_d × 1)
Z _1,d	Background variables of lag-d repeaters, (N_d × P)
${\tilde{y}}_{2, d}$	Hypothetical second-time test scores for a random sample of lag-d repeaters,
	(N × 1)
w _d	EB/MDIA weights that adjust the lag-d repeater sample to match attributes
	Of all first-time test scores and test takers, (N_d × 1)
r _d	Sample correlation between first- and second-time test scores of lag-d
	Repeaters, or cor(y_1,d, y_2,d)
${\tilde{r}}_{d}$	Hypothetical sample correlation between first- and second-time test scores of
	Random sample of lag-d repeaters, or cor(y_1,d, ${\tilde{y}}_{2, d}$ )
ρ _d	Population correlation between first- and second-time test scores of lag-d
	Repeaters, or cor $(y_{1, d} {\tilde{y}}_{2, d})$ as N → ∞
${\hat{ρ}}_{0}^{wgt}$	Weighted sample correlation between first- and second-time test scores of
	Lag-0 repeaters,
${\hat{ρ}}_{0}^{bma}$	Model-averaged weighted sample correlation between first- and second-time
	Test scores of lag-0 repeaters

Inference Problem

We describe the inference problem by first defining r_d = cor(y_1,d, y_2,d), or the observed sample correlation between the first- and second-time test scores for the lag-d repeaters. We then define notation for certain counterfactual test scores that correspond to hypothetical measurement scenarios. Specifically, suppose that all individuals in the sample of first-time test takers were retested after d days, then let ${\tilde{y}}_{2, d}$ be the (N × 1) vector of counterfactual/hypothetical second-time test scores. Then, define ${\tilde{r}}_{d} = cor (y_{1}, {\tilde{y}}_{2, d})$ , or the unobserved sample correlation between the first-time test scores and the counterfactual second-time test scores at lag-d. Finally, denote the probability limit of ${\tilde{r}}_{d}$ as N → ∞ by ρ_d. In general, ρ_d will vary as a function of d because any processes that impact proficiency and evolve differently across test takers over time can impact the test-retest correlations.

The problem we are trying to address is that the observed test-retest correlations r_d for d = 0, 1, …, d_max may be inconsistent (asymptotically biased) estimators of ρ_d for each d = 0, 1, …, d_max. This bias can occur because not all people choose to retest, and the decision about whether and when to retest may be related to latent proficiency, test scores, background variables, or all three. This can cause sample moments of the observed data from each lag-d repeater sample to differ from those that would be obtained if all first-time test takers retested after d days. Consequently, we expect differences between r_d and ${\tilde{r}}_{d}$ to persist as N increases.

We define the target estimand by ρ₀, the vector of test-retest correlations for test scores that would be observed in an arbitrarily large random sample of first-time test takers who were retested again on the same day. Although any d could be used to define the target estimand, we choose d = 0 because the risk of carry-over effects in a high-stakes testing environment is minimized if test takers repeat on the same day.

Estimation Methods

In this section, we develop two methods for estimating ρ₀ from the observed data. Both rely on weighting, which we describe first. We then discuss the two estimators.

Weighting

One approach for estimating ρ_d for a given value of d is to weight the test takers in the lag-d repeater sample so they resemble all first-time test takers with respect to initial scores y₁ and auxiliary data Z₁. The intent is that the weighted lag-d repeater sample proxies as data from the hypothetical experiment in which a random sample of first-time test takers are retested after d days.

We adopt the weighting method called entropy balancing (Hainmueller, 2012) or minimum discriminant information adjustment (Haberman, 1984), referred to as EB/MDIA. This method produces a set of case weights for an observed sample such that (1) the weighted moments (e.g., means and variances) of the observed sample are identical to a given set of target moments, and (2) the weights are as close to uniform (i.e., equal weights) as possible as defined by a distance metric. The latter constraint helps to mitigate variance inflation due to weighting.

For a given value of d, we consider the observed background variables and the observed first- and second-time test scores of lag-d repeaters, {Z_1,d, y_1,d, y_2,d}. We wish to weight this sample such that it more closely resembles the target sample of all first-time test takers with respect to background characteristics and initial test scores, {Z₁, y₁}. Specifically, our application of EB/MDIA computes a vector of non-negative weights w_d for the N_d lag-d repeaters satisfying the following constraints:

\begin{matrix} w_{d}^{T} 1 & = 1, \\ w_{d}^{T} y_{1, d} & = N^{- 1} 1^{T} y_{1}, \\ w_{d}^{T} y_{1, d}^{2} & = N^{- 1} 1^{T} y_{1}^{2}, \\ w_{d}^{T} Z_{1, d} & = N^{- 1} 1^{T} Z_{1} . \end{matrix}

(1)

These constraints imply that the weights sum to one and the weighted first and second moments of the first-time scores, as well as the weighted first moments of background variables for the lag-d repeaters, equal the corresponding moments for all first-time test takers. Technical details on how these constraints are implemented to compute w _d are provided in Haberman (1984) and Hainmueller (2012).

Given the weights w_d, we can estimate ρ_d with a weighted Pearson’s correlation, denoted wCor(⋅). Thus, we define

{\hat{ρ}}_{d}^{wgt} = w C o r (y_{1, d}, y_{2, d}; w_{d}) .

(2)

We now present two estimators of ρ₀ based on this weighting approach.

Method 1: Weighting Lag-0 Repeaters

An estimator of ρ₀ using only the lag-0 repeater sample is ${\hat{ρ}}_{0}^{wgt}$ from equation (2):

{\hat{ρ}}_{0}^{wgt} = w C o r (y_{1,0}, y_{2,0}; w_{0}) .

(3)

Method 2: Model Averaging

A potential shortcoming of ${\hat{ρ}}_{0}^{wgt}$ is that it may be imprecise. For instance, the lag-0 repeater sample may be small, and any imprecision in ${\hat{ρ}}_{0}^{wgt}$ due to limited sample size may be exacerbated by weighting. We thus consider a second method for estimating ρ₀ that pools data across multiple lag-d repeater samples using Bayesian model averaging (BMA) (Hoeting et al., 1999; Kass & Raftery, 1995).

Specifically, this method assumes that the unobserved sequence ${ρ_{d}}_{d = 0}^{d_{max}}$ can be approximated by a polynomial function of d with unknown degree K. We observe the sequence ${{\hat{ρ}}_{d}^{wgt}}_{d = 0}^{d_{max}}$ obtained by computing ${\hat{ρ}}_{d}^{wgt}$ from equation (2) for each d = 0, 1, …, d_max.

We use ${{\hat{ρ}}_{d}^{wgt}}_{d = 0}^{d_{max}}$ to construct a model-averaged estimator of ρ₀ as follows. For each k ∈ {1, 2, …, K_max}, we regress ${{\hat{ρ}}_{d}^{wgt}}_{d = 0}^{d_{max}}$ on polynomial functions of d up to degree k. For k = 1, for example, the regression model includes an intercept and a single regressor equal to d; for k = 2 the model includes an intercept, d, and d², etc. For each value of k we retain two outputs from the regression model: (1) the fitted value of the model at d = 0, denoted by ${\hat{ρ}}_{0, k}$ ; and (2) the Bayesian Information Criterion (BIC; Kass & Raftery, 1995) for model k, denoted by BIC_k.⁵ Following Kass and Raftery (1995), the posterior probability of model k can be approximated by

λ_{k} = \frac{e x p (- \frac{1}{2} {B I C}_{k})}{\sum_{k = 1}^{K_{max}} e x p (- \frac{1}{2} {B I C}_{k})} .

We then compute the model-averaged estimate of ρ₀ by

{\hat{ρ}}_{0}^{bma} = \sum_{k = 1}^{K_{max}} λ_{k} {\hat{ρ}}_{0, k} .

(4)

The intent of the model-averaged estimate is to improve precision by pooling information across multiple lags, while simultaneously mitigating bias arising from misspecification of the functional form of the regression.

Empirical Example

We demonstrate these procedures for reliability estimation using data from the DET. For the data analyzed here, the assessment consisted of seven sections (five computer adaptive sections and two interview sections). Scores from these seven sections were used to compute an overall score and four subscores (comprehension, conversation, literacy, and production), each reported on a scale of 10–160 in increments of five points (LaFlair, 2020; LaFlair & Settles, 2020). We consider reliabilities of the five scores reported on this scale. Where applicable, we provide standard deviations of these scores to aid interpretation.

Data Details

We use data from N = 293, 229 unique test takers who received certified DET scores (i.e., no rule violations or technical difficulties invalidated their test sessions). We define the target population for which we want to estimate DET overall and subscore reliabilities as first-time test takers. We thus treat the first test session from each of these unique test takers as a random sample of sessions from this target population. We refer to this sample as the target sample. Among test takers in the target sample, 40,757 (14%) took the DET a second time within d_max = 30 days of the initial assessment.⁶ We refer to these test takers as the 30-day repeater sample.

Descriptive Statistics

Figure 1 provides counts of repeaters for each lag-d repeater sample (N_d). There is large variability in the number of test takers across lags, with lags of two to seven days being most frequent.

Figure 1.

Number of test takers for each lag-d repeater sample for d = 0, …, 30.

The background characteristics of the 30-day repeater sample as a whole are substantially different from those of the target sample, and within the 30-day repeater sample, there are large differences in background characteristics across the lag-d repeater samples. These differences are demonstrated in Figure 2 for DET scores (e.g., mean and standard deviation of first-time DET overall scores) and Figure 3 for select background variables (e.g., percentage of test takers from China and India). The horizontal dotted gray line in each sub-figure shows the corresponding average for test takers in the target sample, or all first-time test takers.

Figure 2.

Mean and standard deviation (SD) of first-time DET overall score and means of first-time DET subscores in SD units of the target population (all first-time test takers). Gray dotted lines represent target population means/SD.

Figure 3.

Means of select test-taker background variables for each lag-d repeater sample. Gray dotted lines represent target population (all first-time test taker) means. TOEFL and IELTS scores are in SD units of the target population.

In particular, the 30-day repeater sample consists of test takers who have substantially lower DET, self-reported TOEFL, and self-reported IELTS scores compared to the target sample, and thereby lower DET score variance. They are also more likely to come from China and not from India, to take the DET for graduate school admission, and to be young and female. Across the 30-day repeater sample, test takers who repeat rapidly such as on the same day or within a couple days tend to be most different from other repeaters.

Test-Retest Correlations

Among test takers in the 30-day repeater sample, the unadjusted correlation between the first- and second-time overall scores is 0.84. The corresponding unadjusted correlations for the comprehension, conversation, literacy, and production subscores are 0.81, 0.83, 0.80, and 0.81, respectively. Because the 30-day repeater sample is a more homogeneous, lower-achieving subpopulation than the target sample, these correlations are likely to be negatively biased estimates of overall and subscore reliabilities. Specifically, the first-time scores for the 30-day repeater sample have means ranging from 0.34 to 0.46 standard deviation units below the corresponding means in the target sample, and have standard deviations that are between 78% and 83% as large as the corresponding standard deviations in the target sample.

The black curve in Figure 4 provides the unadjusted test-retest correlation for the overall DET score, for each of the lag-d repeater samples. The correlation is highest for people who took the test twice on the same day and then rapidly decays as d increases. Similar patterns occur for all DET subscores (black curves in Figure 5).

Figure 4.

Weighted and unweighted test-retest correlations for each lag-d repeater sample for d = 0, …, 30.

Figure 5.

Unweighted (black) and weighted (gray) test-retest correlations for each lag-d repeater sample, separately by DET subscore.

Reliability Estimation

For each lag-d sample, we computed EB/MDIA weights to make the sample match 38 attributes of the target sample. The variables in Z₁ included indicator (dummy) variables for (a) the ten most frequent native language/country combinations from the target sample (which account for approximately 50% of the target sample); (b) intention to apply to graduate school; (c) intention to apply to undergraduate school; (d) female; (e) use of the Windows operating system for the first DET administration; (f) whether TOEFL scores were unreported by test taker; and (g) whether IELTS scores were unreported by the test taker. We also included test-taker age, TOEFL overall scores, and IELTS overall scores. Given the inclusion of (f) and (g), TOEFL and IELTS scores were set to zero when they were not reported. Collectively these background variables account for 19 of the 38 adjustment variables.

The remaining 19 adjustment variables are based on test-takers’ DET scores from the first administration. Although we are interested in the reliability of the overall score and four subscores, each of these reported scores is a linear combination of all or a subset of the seven DET section scores. Weights that match the first moments of these section scores also match the first moments of all reported scores because the reported scores are linear functions of the section scores. The 19 adjustment variables consist of the first and second moments of the section scores (14), and the second moments of the reported scores (5). We note that all adjustment variables were based on their marginal moments.

After computing the EB/MDIA weights for each lag-d sample, we computed ${\hat{ρ}}_{d}^{wgt}$ in equation (2). These weighted correlations for the DET overall score by lag are also displayed in Figure 4 (gray curve). Compared to the unweighted results (black curve), the weighted results suggest that most of the changes in test-retest correlations across different lag-d samples are due to differences in the background characteristics of the test takers at each lag. Specifically, the sharp decline in the unweighted correlations is largely flattened by adjusting for test takers’ background variables. Similar patterns occur for all four subscores, shown in Figure 5.

While weighting mitigates the rapid decline in correlations as the time between tests increases, there is still evidence of a slight decline in the weighted correlations for the DET overall score in Figure 4 and the subscores in Figure 5. Such decline may be consistent with heterogeneous ability changes among test takers, with variability that increases with the time between tests.

A straightforward way to avoid bias due to heterogeneous changes is to focus on lag-0 repeaters by applying the estimator in equation (3). Unfortunately, the lag-0 repeater sample size is relatively small (N₀ = 743), and the MDIA weights required to make this sample more closely resemble the target sample have a relatively large design effect (DEFF; Kish, 1965). A DEFF greater than 1 tends to cause larger sampling variance of the weighted statistic than would exist for a statistic based on a random sample of the same size from the target population. For instance, the DEFF of weighting the lag-0 repeater sample is 2.01 (see Figure 6), such that the sample of N₀ = 743 repeaters at lag-0 has sampling variability of the weighted test-retest correlation that is approximately the same as that of a random sample from the target population of size 743/2.01 ≈ 370 people. This leads to a relatively wide confidence interval for the test-retest correlation in the target population as estimated by the weighted sample of lag-0 repeaters.

Figure 6.

Design effects for MDIA weights. A design effect greater than 1 indicates a smaller effective sample size when using each set of weights. Presumably, larger design effects are observed with more days between tests because of learning/practice effects.

To improve precision, we pool data across lags using the model averaging procedure described previously. Figure 7 depicts the procedure. We focus on the lag-d repeater samples for d = 0, 1, …, 15, using d_max = 15 to avoid the influence of potential learning/practice effects, and then regress the weighted test-retest correlations on d using four different polynomial regression models: linear (degree 1), quadratic (degree 2), cubic (degree 3), and quartic (degree 4). Each of the k = 1, 2, 3, 4 regression models are estimated using weighted least squares, with weights proportional to the repeater sample size at each lag. Each of the four fitted models is then used to estimate the correlation at d = 0, and the fitted values are averaged by BIC according to equation (4).

Figure 7.

Polynomial regression models fitted to the weighted correlations of the lag-d repeater samples for d = 0, 1, …, 15.

Results

Estimated test-retest reliabilities are summarized in Figure 8. For each of the five DET scores (the overall and four subscores), three estimates are provided: (a) the weighted test-retest correlation for the lag-0 repeater sample; (b) the model-averaged estimate using lags 0, 1, …, 15; and (c) the model-averaged estimate using lags 0, 1, …, 10 (as a sensitivity check). The vertical bars around each estimate are 95% confidence intervals based on standard errors estimated using 2500 independent bootstrap samples of test takers (Efron & Tibshirani, 1993) and applying the entire estimation procedure to each sample.⁷

Figure 8.

Confidence intervals of test-retest coefficients for DET overall and the four subscores (comprehension, conversation, literacy and production). Lag 0 refers to the weighted correlation of the lag-0 repeater sample. Lags 0–15 and Lags 0–10 refer to the model-averaged weighted correlations across the first 15 or 10 lags, respectively.

Consistent with previous discussion, the confidence intervals using only the weighted lag-0 repeater sample are relatively wide. Using regression models to pool data across multiple lag-d repeater samples substantially improves precision. Restricting the data used in these models to d ≤ 15 or d ≤ 10 has little impact on the estimated reliabilities. The fact that the regression-based estimates are higher than the estimates based only on lag-0 data is not necessarily evidence of bias of the former, as all scores are derived from the same seven component scores and thus have correlated estimation errors.

Sensitivity Analysis

The estimated reliabilities were robust to an application of the methods using different time scales to measure the lag between the first and second tests. For example, we could define “lag-0” repeaters to be those who took the second test 0 or 1 days after the first, “lag-1” repeaters to be those who took the second test 2 or 3 days after the first, etc. This measures the time between tests in two-day increments. We considered such aggregation of the data into d-day increments for d = 2, …, 10. We restricted the data to repeaters who took the second test within 60 days of the first for all of these analysis. This implies that the number of lags in the analysis depends on how many days are bundled into a single lag. For example, when defining time in two-day increments, the 60-day repeaters provide 30 lags of data, whereas when defining time in ten-day increments, they provide only 6 lags of data. To simplify the analysis and accommodate such small numbers of lags, we applied the BMA procedure to all conditions using only linear and quadratic models. For the DET overall score, the estimated reliability ranged from 0.92 to 0.93 as lags ranged from two-day increments up to ten-day increments, consistent with the values using one-day increments in Figure 8. Results for the subscores were similarly robust: estimates of 0.90 were obtained for both comprehension and conversation across two-day to ten-day increments, estimates of 0.89 were obtained for literacy, and estimates ranging from 0.87 to 0.88 were obtained for production.

Simulation

The empirical example suggests that the proposed methods can be useful for reducing bias and imprecision in estimated reliabilities. However, it is impossible to know whether these methods work as intended with empirical data alone. In this section we evaluate performance of the methods with simulated data.

Setup

We based key elements of the simulation on the empirical example to improve authenticity. Using notation from the Model Framework section (see Table 1), the following elements were fixed in the simulation: d (number of days between first- and second-time test scores), d ≤ d_max = 30 (maximum lag-d considered was 30), N (number of first-time test takers), N_d (number of second-time test takers for each lag-d), Z₁ (background variables for all first-time test takers), Z_1,d (subset of background variables for second-time test takers by lag-d), and finally, which test takers repeated at each lag-d. We also set ρ_d = .9 (i.e., the true test-retest correlation) for all lags, d ≤ d_max = 30.

We then simulated N true scores, t, using the following two-step procedure: (1) the empirical first-time test scores, y₁, were regressed on the background variables of all first-time test takers, Z₁; (2) the predicted values, ${\hat{y}}_{1}$ , and the estimated residual standard error, ${\hat{σ}}_{e}^{2}$ , were then used as the case-specific means and overall variance, respectively, in a random number generator for the normal distribution.

Next, we simulated N first-time test scores, s₁, and N second-time test scores, s_2,d, for every lag-d by using the simulated true scores, t, and the empirical variance of the first-time test scores, $σ_{y_{1}}^{2}$ , as the case-specific means and overall variance, respectively, in a random-normal generator. Note that in contrast to the simulated data, we did not have access to N second-time test scores at every lag-d in the empirical example; rather, we only had the subsets of scores depending on who decided to repeat at each lag-d. We confirmed that the “true” test-retest correlations between the simulated first-time test scores and every set of lag-d second-time test scores (denoted ${\tilde{r}}_{d}$ in the empirical example) equaled .9, within rounding error.

The final step before applying the reliability estimation methods to the simulated data was to remove the (simulated) second-time test scores for cases in which the corresponding empirical test scores were absent. That is, the simulated data set included the same number of repeaters, in the same positions, at each lag d.

Results

Figure 9 presents weighted and unweighted test-retest correlations of the simulated data as well as the polynomial-fitted models. Notably, the EB/MDIA-adjusted correlations hover around the true test-retest correlations at each lag-d (ρ_d = .9), whereas the unweighted correlations decline from .88 to .81. At lag-0, the weighted test-retest correlation $({\hat{ρ}}_{0}^{wgt})$ equals .907, compared to .879 for the unweighted correlation, and the BMA weighted correlation $({\hat{ρ}}_{0}^{bma})$ equals .901.

Figure 9.

On the left: unweighted, weighted, and true test-retest correlations for each lag-d simulated sample for d = 0, …, 30. On the right: polynomial regression models fitted to the weighted correlations of the lag-d simulated samples for d = 0, 1, …, 15.

Discussion

The empirical analysis showed that test-retest correlations using 30-day repeater samples, with no adjustment for either sample selection or heterogeneous skill changes, appear to be negatively biased estimates of test-retest reliability. The unadjusted correlations for the five reported DET scores are all less than the least-favorable 95% confidence interval produced using our methods. This finding is consistent with the idea that adjusting repeater samples to account for sample selection and learning/practice dynamics can provide more realistic estimates of test-retest reliability. The reliability estimators proposed above provide one straightforward way for accomplishing this adjustment, a claim further supported by the simulated data analysis.

Choosing among the different reliability estimators presented in Figure 8 is less straightforward. Estimates based on the weighted lag-0 repeater samples are compelling, but the estimation errors are large enough to greatly limit their value. For the DET, however, we believe that the model-based estimates, which pool repeater data across multiple day-lags, provide more credible estimates given the available data and evidence.

Although our empirical example and simulation focused on scores from the DET, the proposed framework and reliability estimation methods generalize to any error-prone measure taken repeatedly over time for the same individuals, of whom choose whether and when to be assessed on a measure in which the true value may change over time. For instance, these methods could be applied to measures (e.g., blood pressure readings) that do not typically consist of aggregating information across discrete items and thus cannot use internal consistency (e.g., Cronbach’s alpha) or item response theory to estimate reliability.

Limitations

The ability of the proposed methods to produce accurate reliability estimates depends on the ability of the weighting step to properly account for selection bias. In our application to DET data, we used 34 test-taker background variables that were plausibly related to a test-taker’s decision to retest and when to retest. These variables were available through a combination of actively and passively collected data. However, other variables which were not collected and thus omitted from the weighting function could have affected our ability to recover the true target moments. Some uncollected variables were (1) test-takers’ perceptions of whether they could achieve a better score if they retook the test, (2) test-takers’ submission deadlines relative to their first test date, (3) average cut scores of test-takers’ target institutions relative to their first test score, and (4) test-takers’ socioeconomic status with respect to their ability to purchase another test credit. Without having access to these variables, among other unknown confounders, it is difficult to determine whether our approach accounted for all possible sources of omitted variable bias.

Another limitation of the paper is that we focused on one empirical application and a simulation analysis based on that same data. Additional applications and a more thorough simulation study would provide stronger evidence about possible benefits and pitfalls of the methods across settings. Some questions that could be evaluated include (1) What are the effects of omitted background variables on the recovery of suitable weights to match the observed moments to the target moments? (2) Does the magnitude of self-selection bias and learning/practice effects affect the ability of weighting to recover target moments? (3) How robust are inferences to alternative methods for computing weights, such as general calibration weighting from the survey sampling literature (Deville & Särndal, 1992), propensity score methods (Imai & Ratkovic, 2014; Rosenbaum & Rubin, 1983), and DEFF-minimizing weights (Zubizarreta, 2015)?

Conclusion

Given that the “need for precision [of measurement] increases as the consequences of decisions and interpretations grow in importance” (AERA, APA, and NCME, 2014), it is critical for high-stakes assessments to estimate test-retest reliability as accurately as possible. We believe the methods developed here, despite the aforementioned limitations, are a promising way to estimate reliability using observational data from test repeaters.

Footnotes

Declaration of Conflicting Interests

The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Both authors work at Duolingo on the Duolingo English Test.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

William C. M. Belzak

Notes

References

AERA, APA, and NCME . (2014). Standards for educational and psychological testing. American Psychological Association.

Crocker

Algina

(1986). Introduction to classical and modern test theory. Holt, Rinehart; Winston.

Deville

J.-C.

Särndal

C.-E.

(1992). Calibration estimators in survey sampling. Journal of the American Statistical Association, 87(418), 376–382. https://doi.org/10.1080/01621459.1992.10475217

Efron

Tibshirani

(1993). An introduction to the bootstrap. Chapman & Hall.

Haberman

S. J.

(1984). Adjustment by minimum discriminant information. Annals of Statistics, 12(3), 971–988. https://doi.org/10.1214/aos/1176346715

Haberman

S. J.

Yao

(2015). Repeater analysis for combining information from different assessments. Journal of Educational Measurement, 52(2), 223–251. https://doi.org/10.1111/jedm.12075

Hainmueller

(2012). Entropy balancing for causal effects: A multivariate reweighting method to produce balanced samples in observational studies. Political Analysis, 20(1), 25–46. https://doi.org/10.1093/pan/mpr025

Hoeting

J. A.

Madigan

Raftery

A. E.

Volinsky

C. T.

(1999). Bayesian model averaging: A tutorial (with comments by m. clyde, david draper and e. george, and a rejoinder by the authors. Statistical Science, 14(4), 382–417.

Imai

Ratkovic

(2014). Covariate balancing propensity score. Journal of the Royal Statistical Society: Series B, 76(1), 243–263. https://doi.org/10.1111/rssb.12027

10.

Kass

Raftery

(1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.2307/2291091

11.

Kish

(1965). Survey sampling. John Wiley & Sons.

12.

LaFlair

(2020). Duolingo English test: Subscores. [Duolingo Research Report DRR-20-03]. https://go.duolingo.com/subscorewhitepaper

13.

LaFlair

Settles

(2020). Duolingo English test. Technical Manual [Duolingo Research Report]. https://go.duolingo.com/dettechnicalmanual

14.

Lord

F. M.

Novick

M. R.

(1968). Statistical theories of mental test scores. IAP.

15.

Monfils

L. F.

Manna

V. F.

(2021). Time to achieving a designated criterion score level: A survival analysis study of test taker performance on the TOEFL iBT^® test. Language Testing, 38(1), 154–176. https://doi.org/10.1177/0265532220940709

16.

Raftery

A. E.

(1995). Bayesian model selection in social research. Sociological Methodology, 25, 111–163. https://doi.org/10.2307/271063

17.

Raymond

M. R.

Neustel

Anderson

(2007). Retest effects on identical and parallel forms in certification and licensure testing. Personnel Psychology, 60(2), 367–396. https://doi.org/10.1111/j.1744-6570.2007.00077.x

18.

Rosenbaum

P. R.

Rubin

D. B.

(1983). The central role of the propensity score in observational studies for causal effects. Biometrika, 70(1), 41–55. https://doi.org/10.1093/biomet/70.1.41

19.

Settles

T LaFlair

Hagiwara

(2020). Machine learning–driven language assessment. Transactions of the Association for Computational Linguistics, 8, 247–263. https://doi.org/10.1162/tacl_a_00310

20.

Zhou

Cao

(2020). Does retest effect impact test performance of repeaters in different subgroups? ETS Research Report Series, 2020(1), 1–15. https://doi.org/10.1002/ets2.12300

21.

Zubizarreta

J. R.

(2015). Stable weights that balance covariates for estimation with incomplete outcome data. Journal of the American Statistical Association, 110(511), 910–922. https://doi.org/10.1080/01621459.2015.1023805

Estimating Test-Retest Reliability in the Presence of Self-Selection Bias and Learning/Practice Effects

Abstract

Keywords

Model Framework

Notation for Observed Data

Inference Problem

Estimation Methods

Weighting

Method 1: Weighting Lag-0 Repeaters

Method 2: Model Averaging

Empirical Example

Data Details

Descriptive Statistics

Test-Retest Correlations

Reliability Estimation

Results

Sensitivity Analysis

Simulation

Setup

Results

Discussion

Limitations

Conclusion

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

Notes

References