Sage Journals: Discover world-class research

Abstract

Item response theory models allow estimation of participant and group-mean trait scores from responses to a set of items, but estimates can be biased when participants vary in their response style. We illustrate models fit in gsem that can account for such response style differences by comparing self-report with their ratings of anchoring vignettes—descriptions of other individuals displaying different levels of the trait. Simulation results from standard item response theory, mean bias, random bias, and free-threshold models are illustrated. We show that unbiased estimates can be recovered when the vignettes rated depend on the participants’ own self-rating or are even rated by a different sample, substantially broadening their scope of application.

Keywords

st0760 gsem item response theory IRT anchoring vignettes response bias cross-culture calibration

1 Introduction

Item response theory (IRT) models allow estimation of participant and group-mean trait scores from responses to a set of items (questions). The possible responses, or response set, are almost always ordered categories. IRT is much used in education for measuring student attainment (for example, https://www.oecd.org/pisa/data/pisa2018technicalreport/), where the coded response set is typically correct and incorrect. In many other settings, respondents themselves grade the response, often using a Likert scale where, for example, for the question “Do you worry about everyday things?”, the available alternatives could be never, rarely, sometimes, often, or always (Bluett-Duncan et al. 2024). Interpreting differences in questionnaire responses from different respondents usually requires that we assume that the respondents share a common understanding of both the questions and the meaning of the set of response options that they are offered. Do respondents agree on what constitutes worry rather than mere thought? Do respondents all share the same understanding of how rare is “rarely”? The assumption of a common understanding becomes increasingly difficult to make when respondents differ in their life experiences, for example, when we compare generations or populations from different cultural backgrounds or countries.

Standard IRT modeling is well covered by Stata’s own irt suite of models and a textbook by Raykov and Marcoulides (2018). This article illustrates a class of extended IRT models that use anchoring vignettes (AVs) to overcome the potential problem of response bias. In addition to answering questions about themselves, respondents are also asked to rate, using the same response set, vignettes—short descriptions of other individuals (Kondratek 2022; Chevalier and Fielding 2011; Hopkins and King 2010). Differences in how respondents rate the same vignettes are used to characterize their response style, also commonly referred to as response bias.

AVs were proposed by King et al. (2004) to solve this problem in relation to the assessment of individual political efficacy. These vignettes are commonly presented via text but have also been presented orally, in pictures (Jordans et al. 2020), or through video clips (Young et al. 2020). The AV method is predicated on two measurement assumptions: vignette equivalence (all participants consider the vignettes represent the same level of trait) and response consistency (the same scale is used for self and vignette responding) (d’Uva et al. 2011). Rabe-Hesketh and Skrondal (2002) showed how gllamm (Rabe-Hesketh, Skrondal, and Pickles 2004) could be used with AVs to estimate group differences in trait scores anchored to the vignettes using the partially parametric compound hierarchical ordered probit (CHOPIT) model of King et al. (2004). However, the CHOPIT model requires participants to have a uniform set of self and AV reports. Unfortunately, the addition of AVs to a survey can make the respondents’ burden intolerable, reducing completion and ongoing participation rates. Here we propose extended parametric IRT models (van Soest and Vonkova 2012) with two factors, one for the trait of interest and the second being a response style factor (Bolt, Lu, and Kim 2014). We show that fully parametric gsem (Rabe-Hesketh, Skrondal, and Pickles 2004) models fit by maximum likelihood can recover true parameter estimates from study designs that present i) AV’s only for some items; ii) only a subset of the available AVs for each item, ones chosen adaptively to be close to the respondents’ rating of themselves (Brown and Weiss 1977); and iii) AVs not to the target sample at all but instead to a different, but comparable, sample of participants, a so-called reference or calibration sample. These properties, which hugely extend the scope of application of AV methodology, do not appear to have been previously used and underpin the results of our applied study com/paring rates of postnatal depression among mothers in India and the UK (Bluett-Duncan et al. 2024).

2 IRT models for both trait and response style

2.1 Factor model for polytomous data

Within the family of the generalized linear latent variable models (Bartholomew, Knott, and Moustaki 2011; Rabe-Hesketh, Skrondal, and Pickles 2004), factor models for poly-tomous data and IRT models provide a powerful framework for the analysis of question/naire responses. The various ways in which factor models for polytomous data can be implemented in Stata have been previously described (Raykov and Marcoulides 2018; Grant et al. 2017; Zheng and Rabe-Hesketh 2007). For the context of our empirical work in cross-cultural comparison of self-report scores, we focus here on a graded membership ordinal-response probit model.

Let y_hij denote the response to the item i (i = 1,…,p), of individual j (j = 1…, N), from group h (h = l,…, H). Then the latent trait position (θ_hj) of individual j, from group h, is assumed to follow a conditionally normal distribution G, with a common variance $σ_{θ}^{2}$ and with effects of a covariate X (1 × N) assumed linear; that is,

G (θ_{h j}) = N (β_{h} X_{j}, σ_{θ}^{2}) (1)

(1)

where

β_{h}

denotes the parameter vector of the linear predictor for group h. We assume a factor model in which the underlying trait θ is the single common factor and that it is linked to an unobserved continuous score

y_{h i j} *

for each item i giving

y_{h i j} * = λ_{i} θ_{h j} (2)

(2)

The factor loadings λ (or discrimination coefficients in the IRT context) distinguish those items that are more or are less responsive to variation in the trait and here are assumed to be invariant across groups for the same item (metric invariance). The unobserved continuous score for each item is mapped onto the observed response category k (k = 1,…,K) set through the proportions of the normal curve lying between the thresholds as

Pr (y_{h i j} = k) = Φ_{k} (y_{h i j} * > τ_{h i k}) - Φ_{k - 1} (y_{h i j} * > τ_{h i (k - 1)}) (3)

(3)

where Φ stands for the cumulative normal distribution function. This is pictured in figure 1 for a three-category response set for a single item. Note that the expected proportion responding in a category is identifiable only subject to restrictions among the threshold positions and the mean and variance of the distribution, most commonly that y* distribution’s mean is 0 and the residual variance is 1. We refer to this model as M₁.

Figure 1.

Distribution of y* with response category thresholds for a three-category response set

2.2 Standard model with AVs

We now consider a setting where, in addition to a self-rating for each item, we have ratings of AVs. For each item, we may have multiple vignettes reflecting various levels of the trait under investigation. We treat variation in the ratings of the same vignette between participants as reflecting variation in response style or bias, which is modeled as another latent trait eta ( η ), distinct from theta ( θ ), which we wish to measure. We assume that the participants rate the AVs using the same thresholds as when rating themselves, but because each vignette is rated by multiple participants, we can treat the level of the trait represented by each vignette as a fixed effect (rather than another random effect or latent variable).

For the item responses for each participant, two latent traits now contribute to the unobserved continuous score distribution for each item, the core trait and the response style trait, extending (2) to a two-factor model

y_{h i j} * = λ_{i} θ_{h j} + γ_{i} η_{h j} (4)

(4)

where

γ_{i}

is the response style factor loading for item i, and we assume that the two traits are uncorrelated with

ψ = cov (θ, η) = 0

(orthogonal model). For the AV responses, the core latent trait makes no contribution,

{AV}_{h i j} * (anchor vignette m for item i) = α_{i m} + γ_{i} η_{h j}

(5)

with the

α_{i m}

corresponding to the level of the vignette (for example, severity) as evaluated by a participant with a response style of zero. The structure of the resulting model is illustrated in figure 2 below for a four-item scale with a self-report and two AVs for each item.

Figure 2.

Model for TRAIT (self) and BIAS (response style) factors for four items each with two Avs

2.3 The mean-shift model

In the mean-shift (MS) model, differences in bias in group estimates are considered to arise from differences in the mean of the response style factor. The response style is assumed conditionally independent of θ and may be a function of observed covariates X_hj (h = 1 ,…, H, j = 1,…, N) and additional individual difference attributed to a latent bias, denoted by η and assumed Gaussian. As shown in (5), the impact of response style may vary from item to item.

While the size of the response style factor and the pattern of its factor loadings across items can be of considerable interest, the motivation for much of the interest in AVs has principally been to account for systematic differences in response style that in the self-ratings may be confounded with true differences in the core trait.

On the assumption of common thresholds, (4) and (5) allow us to separate system/atic differences in the mean of the response style factor that are distinct from those systematic differences in the self-report factor.

We refer to this model as the mean-shift random (MSR) bias model M₄. Restricting σ_η = 0 makes the response bias the same for all individuals with the same values of covariates and group, allowing the bias-adjustment terms to be included in the fixed part of the model. We call this model the MS bias model M₂.

2.4 The free-thresholds model

An alternative approach to restricting rating differences to differences in the bias factor is to allow for differences in the thresholds. Rabe-Hesketh and Skrondal (2002) in their implementation of King’s CHOPIT model allow the spacing between thresholds to be a linear function of covariates via a log-link function. More easily implemented in gsem using the multiple group option, this allows different thresholds for different population groups, providing independent bias for each of the multiple thresholds of a polytomous response, rather than the uniform bias of the MS bias model. We describe this model as the free-threshold (FT) model M₃ and as the free-threshold random (FTR) bias model M₅.

To summarize the models:

Model M₁ is the standard probit model for ordinal self-report that allows for no reporting bias.

Model M₂ (MS) allows for a fixed (shared by everyone with same covariates or from same group) bias uniform across the scale or thresholds of each item.

Model M₃ (FT) allows bias to be different at different points along the scale of each item.

Models M₄ (MSR) and M5 (FTR) add individual differences in the strength of this bias (a response style factor) to models M₂ and M₃, respectively.

3 Example

3.1 Simulated data structures

A primary purpose of this article is to illustrate how different models can or cannot recover unbiased estimates of group differences in trait in the presence of response style variation of different kinds and how for some models this could be achieved with data from novel AV presentation designs that radically expand their wider use. This required us to use simulated data.

3.1.1 Item response generation

The simulated data consisted of p = 4 items, each with K = 3 category ordinal responses {0,1, 2} obtained by placing thresholds on Gaussian y* variates formed from the sum of contributions from independent Gaussian trait and response style and bias factors. Two groups of N_h = 500 participants were generated. Each participant provided a self-report and in most scenarios additional ratings of up to two AVs for each item. The dataset structure as it appears in Stata (long format) is

The variable id stands for the unique identifier of each of the 1,000 participants. Three dummy variables (m1, m2, and m3) follow, and they correspond to each of the three records, one for self-report and two for AVs. The responses to the four items follow, indicated by the variables y1, y2, y3, and y4. The group2 variable (coded h = 1, 2) distinguishes the two groups for whom we wish to estimate a difference in mean trait but who may also use different item thresholds in responding. When used as a predictor, a centered version (here the variable group is coded as -0.5 and 0.5) may help convergence of some models.

The factor loadings of the response-generating model, relating trait and response style to underlying response y*, varied across items but were common across groups and the same for both trait and response style, though this last was not imposed by the fitted models. The item thresholds and category boundaries varied across items.

We also simulated data where rather than the bias being introduced into the data through the y* distribution, it is introduced as differences in threshold, specifically one standard deviation group difference in the first of the two thresholds of each item. This could correspond to a difference in the interpretation of the wording used to define the first and second categories of the response set of the items, perhaps a difference in what is understood by “sometimes”. With only one of the two thresholds differing by group, there is no MS model that would match these data.

We considered four data-generating models:

G₁ corresponded to analysis model M₁, where there was no bias.

G₂ corresponded to M₂, where the bias took the form of thresholds shifted uniformly.

G₃ corresponded to M₄, where the bias was uniform across thresholds but varied between individuals.

G₄ corresponded to M₅, where bias was present for only one of the two thresholds and the magnitude of that bias varied between individuals.

3.1.2 Vignette presentation design

Properly characterizing the response style of each participant requires that they respond to several vignettes for each item. This can present an excessive burden on participants. If we reduce the number of vignettes, it would seem intuitively good sense to present vignettes that were more likely to be close to the thresholds that bounded each respondents self-report and not to present those much more mild (lower-trait θ) or severe (higher-trait θ ). Selecting vignettes dependent on the respondent’s self-report is known as adaptive testing in psychometrics.

Provided vignette selection is based on prior observed responses, the data missing from the unpresented vignettes conform to being missing at random and are ignorable under gsem’s default maximum-likelihood estimation. In the collaborative project be/tween India and the UK (Bluett-Duncan et al. 2024), this was achieved by developing a set of vignettes that contained examples of the underlying trait that could be mapped across the response categories provided by the questionnaire. An electronic survey plat/form was then programmed to present the two vignettes thought likely to fall either side of the self-report response provided by each participant.

Thus, in addition to scenarios with one and two vignettes for each item, we sim/ulated this approach by considering the two-vignette-per-item data with random bias but where those with self-report = 0 are presented with the first vignette and those with self-report = 2 are presented with the second vignette. Only those with intermediate self-report = 1 are presented with both vignettes. This reduced the number of vignette presentations by 39%.

The additional response burden of AVs can be entirely eliminated for the target sample if the AVs are presented instead to an entirely independent sample. To examine such a scenario, we modified our data with two vignettes so that in each group one half responded only to the self-report and one half responded only to the AVs. No one in these data responded to both self-report and AVs.

Thus, we considered five vignette availability scenarios:

D₁ corresponded to self-report data only.

D₂ had self-report and an additional single vignette per item capable of removing response style bias.

D₃ had two vignettes per item and expected to improve precision and bias reduc/tion.

D₄ selected one or both from the available vignettes adaptively, potentially able to further improve precision and make the consistency assumption more tenable because vignettes are selected to be closer to self-report.

D₅ had self-report and vignettes but not from the same participants, half the sample providing self-report and half responses to AVs.

3.2 Model specification in gsem

3.2.1 The standard bias-naive model

The standard IRT model of (1) allows the mean of the y* distribution to vary by group, reflected in the standard factor model M₁ specified in gsem as below,

where y1-y4 represent the vectors of responses to the 4 items, m1 is a dummy variable indicating a self-report rather than AV response, and m1==1 corresponds to fitting the model solely to the self-report data records. The model parameters labeled lambda1 to lambda4 are the factor loadings (item discrimination parameters) of (2). The explicit label ensures the gsem default of setting the first factor loading to 1 for identification is not followed; instead, the variance of the TRAIT[id] is explicitly set to 1.

The model fits two threshold parameters for each item, labeled cut1 and cut2 [corresponding to the $τ_{h i k}$ parameters of (3), where k = 1, 2]. The estimated item difficulty, in IRT terminology, which is the value of the latent trait required to have a probability of 0.5 of scoring above the threshold, is given by the $τ_{h i k}$ value divided by the item loading (discrimination). In our simulation, the group variable is the only X covariate of (1). By default, the intercept for this equation is set to 0. @labels ensure all factor loadings are freely estimated under the restriction that the trait variance is fixed at 1. Our primary focus lies in the estimation of the coefficient for group on TRAIT[id].

3.2.2 The MS model

The MS bias model M₂ is extended for a single vignette per item by the addition of the fixed effect for dummy variable m1 and by ensuring that the TRAIT factor acts as a random coefficient for the dummy variable m1, which indicates self-report. Additional vignettes are accommodated through additional dummy variables, say, m3 and m4.

3.2.3 The FT model

Model M₃ allows nonuniform bias along the scale (that is, at different thresholds) by the use of the multiple group option, with all parameters constrained equal across groups except for thresholds,

3.2.4 The MSR bias model

To allow for random bias variation between participants, we add a second latent variable, BIAS[id]. The MS model M₂ becomes the MSR bias model M₄,

3.2.5 The FTR bias model

The corresponding FTR bias model M₅ can be specified by

3.3 Results: Single dataset example and model comparison simulation study

3.3.1 Tips and tricks

The example log file illustrates the use of several options to assist in gsem estimation. These include the preliminary estimation of a simpler model whose estimates are stored in a matrix (for example, matrix a = e(b)) subsequently used as starting values for a more complex model using the from(a) option and (parameter, init(value)) for individual starting values and the diff, dnumerical, and intpoints (number_of_integration_points) options, which improve convergence (but at the expense of computational time). Estimates of latent variable variances approaching zero are a common problem for convergence of misspecified or overly complex models. Model-fitting time is highly variable. For the example dataset shown below, the MSR model took 8 seconds, and the FTR model 6 minutes and 49 seconds on a modest laptop.

3.3.2 Interpretation of parameter estimates—MSR bias model

Displayed below is the output from the first of the datasets simulated under the MSR bias model with a TRAIT mean of 0.5 in group 1 and -0.5 in group 2 (that is, a mean difference of 1 corresponding to a difference of one standard deviation in the trait). The ratings from the two groups are also subject to a systematic difference in bias (of -0.25 in group 1 and 0.25 in group 2, a difference of minus one standard deviation in the response style). Factor loadings can be positive or negative, so the bias can vary in direction from item to item, though this is not the case in the data simulated here. The same applies to the trait variation, which contributes only to self-report responses, not responses to AVs. In the MSR model (M₄), the adjustment for systematic differences in bias can be thought of as occurring through adjustment of the response distribution y*. The dummy variable coefficients for m1 and m3 estimate how the response mean and second AV differ from the first AV mean of each item.

The four coefficients for BIAS[id] are the unstandardized factor loadings for the response style factor to be compared with the simulation values of 0.25, 0.5, 0.75, and 1. The four coefficients for c.m1#TRAIT[id] (0.54, 0.89, 1.52, and 1.96) are the unstandardized factor loadings for measurement of the trait to be compared with the simulation values of 0.5, 1, 1.5, and 2. The c.m1# prefix reflects that the trait variation contributes only to responses where the dummy variable m1 is nonzero, that is, for self-report records only. The variance of both BIAS[id] and TRAIT[id] factors is shown at the bottom of the output and is fixed at 1 for model identification, and no covariance between them has been estimated. The two coefficients for group reflect the mean difference in bias and the mean difference in trait, the latter adjusted for the difference in bias. The estimate of 1.087247 corresponds to the simulation group difference of 1, indicating group 2 has the higher mean. The estimate of group on y1-y4 corresponds to the group difference in mean bias.

The cut coefficients near the bottom of the output are the unstandardized thresholds for the self-reporting of the four items. These were simulated to have standardized values of 0.5 and 1.5 (two thresholds for each item because the response sets have three categories with values typical of those found for symptom presence). These cutpoint coefficient estimates must be standardized by the standard deviation of $y_{item} *$ given by sqrt(1 + (item bias factor loading)² + (item trait factor loading)²), giving 0.77, 0.64, 0.44, and 0.37 for cut1 and 1.85, 1.74, 1.46, and 1.36 for cut2.

3.3.3 Interpretation of parameter estimates—FT model

In the FTR bias model fits shown below, there is no difference in the means of the BIAS factor. Instead, the cut (threshold) coefficients in group 2 differ from those in group 1. The coefficients for the m1 and m3 dummy variables are as in the MS model and are the same across groups because they are rating the same AVs. The factor loadings for the BIAS[id] response style factor and the TRAIT[id] factor are also the same across groups by assumption. In both groups, the variances of the two factors are fixed to 1. In group 1, the mean of the TRAIT[id] factor is set to 0, while in group 2 the mean of TRAIT[id] is estimated as 1.089026 (relative to group 1). The BIAS[id] factor has a mean fixed at 0 for both groups because systematic differences in bias are now accounted for by the different cut coefficients between the groups. The FT model is the more general model because it does not assume a uniform bias across the scale and has seven more parameters, but the data are simulated to conform to the assumptions of the MS model with a uniform shift. It is therefore no surprise that a comparison of the information criteria (Akaike information criterion of 15283.04 and Bayesian information criterion of 15457.23 for the MS model and Akaike information criterion of 15287.77 and Bayesian information criterion of 15485.98 for the FT model) suggests the more parsimonious MS model is preferred in this instance. However, the estimates of the parameters that are common across models are very similar. This need not be the case with real data and is not the case in the data simulated with nonuniform shift in conformity with the FT model.

3.4 Model comparison simulation study

For many studies, the primary focus is on estimating covariate or group difference coefficients. In this section, we examine the performance for the estimation of a group difference in trait for five different models (M ₁ to M₅) when fit to data from four data-generating models (G₁ to G₄) and five different data availability scenarios (D₁ to D₅). We used Stata’s simulate command and focused exclusively on estimating the difference in mean trait between the groups and the removal of bias in this estimation due to differences in response style. Without manual intervention, valid estimates could not be obtained from the more complex models for all 100 simulation data replicates for each data scenario. We report estimates only from replicates where all models converged (always > 80/100, excluding replicates typically arising from models with random response style where there was no such variation in the data) and gave feasible values (1 or 2 infeasible values mainly arising from the FTR model). Simulation code and output files are available from https://github.com/andrewrpickles/Anchoring-Vignettes/.

3.4.1 Standard bias-naive model M₁

The first row of table 1 displays estimates from the standard bias-naive IRT model. Only for data-generating model G₁, which has no response bias of any kind, does the average model fit of 0.996 give an unbiased estimate of the group difference (true value 1.000) in mean trait. Elsewhere, the bias is substantial, several times the standard error.

Table 1.
Model fits of group difference in TRAIT from 100 simulations (* = excluded improper solution): Models M₁ naive IRT, M₂ MS bias, M₃ FT bias, M₄ MSR bias, and M₅ FTR bias.

Simulation model G ₁ no bias data G₂MS bias data G₃MSR bias data G₄FTR bias data

Δμ SE (SD) Δμ SE (SD) Δμ SE (SD) Δμ SE (SD)

Scenario D₁: Self response only

Model M₁ 0.996 0.083 (0.084) 0.504 0.078 (0.076) 0.444 0.076 (0.077) 1.426 0.084 (0.096)

Scenario D₂: One vignette

Model M₁ 0.996 0.124 (0.103) 0.997 0.107 (0.095) 0.861 0.112 (0.080)

Model M₂ 0.996 0.089 (0.103) 0.996 0.091 (0.095) 0.865 0.088 (0.080)

Model M₃ 0.992 0.085 (0.114) 0.994 0.099 (0.113) 1.004 0.095 (0.094)

Model M₄ 0.983 0.087 (0.107) 0.987 0.090 (0.099) 0.994 0.094 (0.096)

Scenario D₃: Two vignettes

Model M₁ 0.997 0.086 (0.080) 0.995 0.087 (0.140) 0.865 0.086 (0.078) 0.840 0.082 (0.079)

Model M₂ 0.993 0.084 (0.080) 1.001 0.086 (0.086) 0.867 0.084 (0.078) 0.983 0.082 (0.073)

Model M₃ 0.997 0.085 (0.081) 1.006 0.087 (0.085) 1.005 0.091 (0.088) 0.866 0.084 (0.087)

Model M₄ 0.986 0.083 (0.111) 1.001* 0.086 (0.085) 0.990 0.091 (0.103) 0.987* 0.083 (0.078)

Scenario D₄: Adaptive selection of ∼5,000 of 8,000 possible vignette presentations

Model M₁ 0.444 0.076 (0.077)

Model M₂ 0.888 0.088 (0.085)

Model M₃ 0.890 0.086 (0.084)

Model M₄ 1.008 0.094 (0.092)

Model M₅ 0.997 0.095 (0.097)

Scenario D₅: AV from calibration samples only

Model M₁ 0.447 0.107 (0.105)

Model M₂ 0.867 0.121 (0.116)

Model M₃ 0.869 0.119 (0.115)

Model M₄ 1.008 0.150 (0.144)

Model M₅ 0.996* 0.150 (0.154)

Simulation model	G ₁ no bias data	G₂MS bias data	G₃MSR bias data	G₄FTR bias data
Scenario D₁: Self response only
Model M₁	0.996	0.083 (0.084)	0.504	0.078 (0.076)	0.444	0.076 (0.077)	1.426	0.084 (0.096)
Scenario D₂: One vignette
Model M₁	0.996	0.124 (0.103)	0.997	0.107 (0.095)	0.861	0.112 (0.080)
Model M₂	0.996	0.089 (0.103)	0.996	0.091 (0.095)	0.865	0.088 (0.080)
Model M₃	0.992	0.085 (0.114)	0.994	0.099 (0.113)	1.004	0.095 (0.094)
Model M₄	0.983	0.087 (0.107)	0.987	0.090 (0.099)	0.994	0.094 (0.096)
Scenario D₃: Two vignettes
Model M₁	0.997	0.086 (0.080)	0.995	0.087 (0.140)	0.865	0.086 (0.078)	0.840	0.082 (0.079)
Model M₂	0.993	0.084 (0.080)	1.001	0.086 (0.086)	0.867	0.084 (0.078)	0.983	0.082 (0.073)
Model M₃	0.997	0.085 (0.081)	1.006	0.087 (0.085)	1.005	0.091 (0.088)	0.866	0.084 (0.087)
Model M₄	0.986	0.083 (0.111)	1.001*	0.086 (0.085)	0.990	0.091 (0.103)	0.987*	0.083 (0.078)
Scenario D₄: Adaptive selection of ∼5,000 of 8,000 possible vignette presentations
Model M₁					0.444	0.076 (0.077)
Model M₂					0.888	0.088 (0.085)
Model M₃					0.890	0.086 (0.084)
Model M₄					1.008	0.094 (0.092)
Model M₅					0.997	0.095 (0.097)
Scenario D₅: AV from calibration samples only
Model M₁					0.447	0.107 (0.105)
Model M₂					0.867	0.121 (0.116)
Model M₃					0.869	0.119 (0.115)
Model M₄					1.008	0.150 (0.144)
Model M₅					0.996*	0.150 (0.154)

3.4.2 The M₂ MS bias and M₄ MSR bias models

The fixed mean bias model M₂ can recover an unbiased estimate of the group difference where the response style factor has zero variance (column G₂), but the bias adjustment is incomplete where this is not the case (columns G₃ and G₄). The standard error falls from ∼0.99 to 0.86 with two vignettes rather than one vignette per item, a pattern common to all the models.

The MSR bias model M₄ requires fitting both TRAIT and BIAS factors and was a challenge to fit when no random bias was present in the data. However, the model appears to make a suitable adjustment in the presence of both fixed and random bias, except when the bias is not uniform over thresholds (column G₄ and scenario D₃).

3.4.3 M₃ FT and M₅ FTR bias models

The FT model (M₃) recovers the true score with the same precision and bias as the MS (M₂) model for both uniform-shift and random bias data (G₂ and G₃) and also for the nonuniform shift setting (G₄). The FTR recovers an unbiased estimate under all data-generating models and scenarios.

3.4.4 Adaptive AV presentation and AV calibration sample

Evident from scenario D₄ of table 1, adaptive AV selection has not changed the pattern of success in bias adjustment, with both the appropriate MS and FTR bias models providing unbiased estimates of the group difference in trait. A nearly 40% reduction in the number of AVs presented has come with only a small loss of precision, suggesting that adaptive AV selection is a more efficient study design.

We can exploit the missing-at-random assumption further and partition the participant burden by restricting the AV presentation to a subsample of participants. We can also distinguish the target samples that we wish to compare from those who complete self-report by selecting additional independent “bias calibration” samples who potentially respond only to the AVs. The self-report data from the calibration sample participants could either be ignored (as in the results we present) or included in the analysis but allowed a different mean.

For scenario D₅, model convergence was poorer (85 out of the 100 simulated datasets with a further FTR estimate infeasible), but for those 85, the true group difference estimate, their standard error, and corresponding standard deviation were all satisfactorily recovered by both MS and FT models that included random bias (as the simulated data did). With half the number of self-reports (target sample n = 250 per group) and AV presentations (calibration sample n = 250 per group) of scenario D₃, we would expect a doubling of the variance of the estimate. However, the reported variance for the two models increases by about 0.150²/0.091² = 2.7, suggesting some loss of efficiency compared with having self-report and AVs from the same sample.

3.5 Conclusion

We have presented gsem specifications of standard item-response theory models that correct for response bias when data are additionally available from AVs. Models can be fit that account for either a mean shift bias in the responses or a less structured shift in the response thresholds and where the bias may additionally vary randomly among participants. We have shown that model estimation is possible using two approaches that reduce the response burden of AVs, a problem that has hindered their more widespread use. The first uses adaptive selection of a subset of AVs to present those that best match each participant’s own self-report, possibly also helping to meet the assumptions of vignette equivalence required for the AV approach to be valid. The second allows AVs to be collected on entirely separate samples, ones other than the primary research target sample. Such calibration sample data could usefully be made publicly available for others to use. This opens up the possibility of a much wider use of AVs and of a library of AV calibration samples for use in multiple studies to which these gsem models might then be applied.

Our work was motivated by wanting to adjust for response style in a questionnaire where the set of response alternatives offered was different for almost every item (Bluett- Duncan et al. 2024), and so all our models allowed for unconstrained response style factor loadings. Response style is often defined a priori as a particular pattern of response bias leading subjects to consistently respond in a certain way (for example, extreme, acquiescent, midpoint) (Paulhus 1991; Van Vaerenbergh and Thomas 2012), suggesting more restricted models might be appropriate, especially where the items share a common response set. gsem offers scope for a variety of constraints to be considered. We have not examined whether a model with correlated factors would be identifiable. We have also assumed factorial invariance, not exploring models where the factor loadings vary across groups. However, we see no reason to believe that this more detailed assessment of so-called factorial invariance would not be possible and might well be aided by the addition of suitable AV assessments.

Footnotes

4

This work was supported by NICHR SI award NF-SI-0617-10120, the NICHR Maudsley Biomedical Research Centre at the South London, and Maudsley NHS Foundation Trust. The motivating study was funded jointly by the UK Medical Research Council and Indian Council for Medical Research (Grants MR/N000870/1 and ICMR/MRC-UK/2/M/2015-NCD- 1) to Helen Sharp and Andrew Pickles. Matt Bluett-Duncan was funded through a dual PhD scholarship from the University of Liverpool and the National Institute for Mental Health and Neuroscience, Bangalore.

5

To install the software files as they existed at the time of publication of this article, type

The code for generating the data is also available .

About the authors

Andrew Pickles is Professor of Biostatistics and Psychological Methods at King’s College Lon/don. He has contributed to the development of the gllamm program, which formed the basis of Stata’s gsem, and to many empirical studies in developmental epidemiology and child mental health.

Matt Bluett-Duncan is a postdoctoral researcher with expertise in the development of cross- cultural research methods and investigations regarding the impact of environmental exposures on fetal and child neurodevelopment.

Helen Sharp is Professor of Perinatal and Clinical Child Psychology with a research focus on the earliest origins of childhood mental health problems.

Silia Vitoratou is a senior lecturer in psychometrics and leads the Psychometrics and Measure/ment Lab at King’s College London.

References

Bartholomew

D. J.

Knott

Moustaki

. 2011. Latent Variable Models and Factor Analysis: A Unified Approach. 3rd ed. Chichester, UK: Wiley. 10.1002/9781119970583.

Bluett-Duncan

Pickles

Chandra

P. S.

Hill

Kishore

M. T.

Satyanarayana

Sharp

. 2024. Experience and reporting of postnatal depression across cultures: A comparison using anchoring vignettes of mothers in the United Kingdom and India. American Journal of Epidemiology 193: 214–226. 10.1093/aje/ kwad182.

Bolt

D. M.

Y. J.

Kim

J.-S.

. 2014. Measurement and control of response styles using anchoring vignettes: A model-based approach. Psychological Methods 19: 528–541. 10.1037/met0000016.

Brown

J. M.

Weiss

D. J.

. 1977. An adaptive testing strategy for achievement test batteries. Research Report 77-6, Department of Psychology, University of Minnesota. https://files.eric.ed.gov/fulltext/ED150165.pdf .

Chevalier

Fielding

. 2011. An introduction to anchoring vignettes. Journal of the Royal Statistical Society, A ser., 174: 569–574. 10.1111/j.1467-985X.2011.00703.x.

d’Uva

T. B.

Lindeboom

O’Donnell

van Doorslaer

. 2011. Slipping anchor? Testing the vignettes approach to identification and correction of reporting heterogeneity. Journal of Human Resources 46: 875–906. 10.3368/jhr.46.4.875.

Grant

R. L.

Furr

D. C.

Carpenter

Gelman

. 2017. Fitting Bayesian item response models in Stata and Stan. Stata Journal 17: 343–357. 10.1177/1536867X1701700206.

Hopkins

D. J.

King

. 2010. Improving anchoring vignettes: Designing surveys to correct interpersonal incomparability. Public Opinion Quarterly 74: 201–222. 10.1093/poq/nfq011.

Jordans

M. J. D.

Luitel

N. P.

Lund

Kohrt

B. A.

. 2020. Evaluation of proactive community case detection to increase help seeking for mental health care: A pragmatic randomized controlled trial. Psychiatric Services 71: 810–815. 10.1176/appi.ps.201900377.

10.

King

Murray

C. J. L.

Salomon

J. A.

Tandon

. 2004. Enhancing the va/lidity and cross-cultural comparability of measurement in survey research. American Political Science Review 98: 191–207. 10.1017/S000305540400108X.

11.

Kondratek

. 2022. uirt: A command for unidimensional IRT modeling. Stata Journal 22: 243–268. 10.1177/1536867X221106368.

12.

Paulhus

D. L

. 1991. Measurement and control of response bias. In Measures of Person/ality and Social Psychological Attitudes, ed. Robinson

J. P.

Shaver

P. R.

Wrightsman

L. S.

, 17–59. San Diego: Academic Press. 10.1016/B978-0-12-590241-0.50006-X.

13.

Rabe-Hesketh

Skrondal

. 2002. Estimating CHOPIT models in gllamm: Po/litical efficacy example from King et al. (2002). http://www.gllamm.org/chopit.pdf .

14.

Rabe-Hesketh

Skrondal

Pickles

. 2004. Generalized multilevel struc/tural equation modeling. Psychometrika 69: 167–190. 10.1007/BF02295939.

15.

Raykov

Marcoulides

G. A.

. 2018. A Course in Item Response Theory and Modeling with Stata. College Station, TX: Stata Press.

16.

van Soest

Vonkova

. 2012. Testing parametric models using anchoring vi/gnettes against nonparametric alternatives. Netspar Discussion Paper 02/2012–012, Network for Studies on Pensions, Aging and Retirement. 10.2139/ssrn.2046907.

17.

Van Vaerenbergh

Thomas

T. D.

. 2012. Response styles in survey research: A literature review of antecedents, consequences, and remedies. International Journal of Public Opinion Research 25: 195–217. 10.1093/ijpor/eds021.

18.

Young

G. S.

Constantino

J. N.

Dvorak

Belding

Gangi

Hill

Miller

Parikh

Schwichtenberg

A. J.

Solis

Ozonoff

. 2020. A video/based measure to identify autism risk in infancy. Journal of Child Psychology and Psychiatry 61: 88–94. 10.1111/jcpp.13105.

19.

Zheng

Rabe-Hesketh

. 2007. Estimating parameters of dichotomous and ordinal item response models with gllamm. Stata Journal 7: 313–333. 10.1177/1536867X0700700302.

Simulation model	G ₁ no bias data		G₂MS bias data		G₃MSR bias data		G₄FTR bias data
	Δμ	SE (SD)	Δμ	SE (SD)	Δμ	SE (SD)	Δμ	SE (SD)
Scenario D₁: Self response only
Model M₁	0.996	0.083 (0.084)	0.504	0.078 (0.076)	0.444	0.076 (0.077)	1.426	0.084 (0.096)
Scenario D₂: One vignette
Model M₁	0.996	0.124 (0.103)	0.997	0.107 (0.095)	0.861	0.112 (0.080)
Model M₂	0.996	0.089 (0.103)	0.996	0.091 (0.095)	0.865	0.088 (0.080)
Model M₃	0.992	0.085 (0.114)	0.994	0.099 (0.113)	1.004	0.095 (0.094)
Model M₄	0.983	0.087 (0.107)	0.987	0.090 (0.099)	0.994	0.094 (0.096)
Scenario D₃: Two vignettes
Model M₁	0.997	0.086 (0.080)	0.995	0.087 (0.140)	0.865	0.086 (0.078)	0.840	0.082 (0.079)
Model M₂	0.993	0.084 (0.080)	1.001	0.086 (0.086)	0.867	0.084 (0.078)	0.983	0.082 (0.073)
Model M₃	0.997	0.085 (0.081)	1.006	0.087 (0.085)	1.005	0.091 (0.088)	0.866	0.084 (0.087)
Model M₄	0.986	0.083 (0.111)	1.001*	0.086 (0.085)	0.990	0.091 (0.103)	0.987*	0.083 (0.078)
Scenario D₄: Adaptive selection of ∼5,000 of 8,000 possible vignette presentations
Model M₁					0.444	0.076 (0.077)
Model M₂					0.888	0.088 (0.085)
Model M₃					0.890	0.086 (0.084)
Model M₄					1.008	0.094 (0.092)
Model M₅					0.997	0.095 (0.097)
Scenario D₅: AV from calibration samples only
Model M₁					0.447	0.107 (0.105)
Model M₂					0.867	0.121 (0.116)
Model M₃					0.869	0.119 (0.115)
Model M₄					1.008	0.150 (0.144)
Model M₅					0.996*	0.150 (0.154)

Distinguishing differences in construct from differences in response style: gsem for item response theory models with anchoring vignettes

Abstract

Keywords

1 Introduction

2 IRT models for both trait and response style

2.1 Factor model for polytomous data

2.4 The free-thresholds model

3 Example

3.1 Simulated data structures

3.1.1 Item response generation

3.1.2 Vignette presentation design

3.2 Model specification in gsem

3.2.1 The standard bias-naive model

3.2.2 The MS model

3.2.3 The FT model

3.2.4 The MSR bias model

3.2.5 The FTR bias model

3.3 Results: Single dataset example and model comparison simulation study

3.3.1 Tips and tricks

3.3.2 Interpretation of parameter estimates—MSR bias model

3.3.3 Interpretation of parameter estimates—FT model

3.4 Model comparison simulation study

3.4.1 Standard bias-naive model M1

3.4.3 M3 FT and M5 FTR bias models

3.4.4 Adaptive AV presentation and AV calibration sample

3.5 Conclusion

Footnotes

4

5

About the authors

References

3.4.1 Standard bias-naive model M₁

3.4.3 M₃ FT and M₅ FTR bias models