Sage Journals: Discover world-class research

Abstract

The use of discrete categorical formats to assess psychological traits has a long-standing tradition that is deeply embedded in item response theory models. The increasing prevalence and endorsement of computer- or web-based testing has led to greater focus on continuous response formats, which offer numerous advantages in both respondent experience and methodological considerations. Response styles, which are frequently observed in self-reported data, reflect a propensity to answer questionnaire items in a consistent manner, regardless of the item content. These response styles have been identified as causes of skewed scale scores and biased trait inferences. In this study, we investigate the impact of response styles on individuals’ responses within a continuous scale context, with a specific emphasis on extreme response style (ERS) and acquiescence response style (ARS). Building upon the established continuous response model (CRM), we propose extensions known as the CRM-ERS and CRM-ARS. These extensions are employed to quantitatively capture individual variations in these distinct response styles. The effectiveness of the proposed models was evaluated through a series of simulation studies. Bayesian methods were employed to effectively calibrate the model parameters. The results demonstrate that both models achieve satisfactory parameter recovery. Neglecting the effects of response styles led to biased estimation, underscoring the importance of accounting for these effects. Moreover, the estimation accuracy improved with increasing test length and sample size. An empirical analysis is presented to elucidate the practical applications and implications of the proposed models.

Keywords

continuous response format continuous response model response styles item response theory

Introduction

When conducting educational assessments or psychological tests with subjects, the most frequently encountered data formats include dichotomous item responses (such as true or false) or ordered categorical responses (such as the Likert-type rating scale). Specific types of item response theory (IRT) models must be employed for these distinct categorical item types. IRT models help capture the nonlinear relationship between an individual’s trait level and the likelihood of endorsing a specific response to a test item. Numerous IRT models employing either dichotomous or polytomous scoring methods are readily available for researchers to analyze collected data (e.g., Lord, 1980; Samejima, 1969). Although comprehensive and illustrative examples have widely covered the realm of IRT application with respect to larger-scale international assessments or personality diagnosis (Martin et al., 2017; Spence et al., 2012), confining a person’s responses within certain defined categories may hinder the true feelings and propensities that should be reflected in the measurement scale and may consequently lead to information loss during the model calibration process.

Continuous Response Measurement

In contrast to the ordered categorical responses used in traditional IRT models, there is a growing trend toward adopting continuous item response formats in attitude or personality assessments. This approach permits subjects to express diverse degrees of agreement toward statements, allowing their views to be more comprehensively represented. For instance, the original five-point Likert-type rating scale ranging from strongly disagree (a score of 1) to strongly agree (a score of 5) of the celebrated Big Five Inventory (John & Srivastava, 1999) can be replaced with a line segment with continuous scores between 0 and 1000 (Simms et al., 2019). In this revised setup, individuals are instructed to mark a point on the continuous scale to indicate the extent to which they endorse a given statement. This differs from the conventional practice of providing only a few discrete categorical responses.

The visual analog scale (VAS), which is commonly employed for the collection of continuous responses, uses a scaled and visualized line segment on which respondents are allowed to mark any position along the continuous line to reflect their view on a given statement and has been frequently applied across various social science fields (e.g., Brumfitt & Sheeran, 1999; Christ et al., 2013; Ferrando, 2002). This continuous response format empowers respondents to more finely convey their sentiments or judgments, resulting in more precise propensity estimates than those exhibited by discrete item response formats (such as Likert-type scales; Bejar, 1977; García-Pérez, 2023). In addition, rapid progress has been made in digital technology involving the use of more convenient VASs, such as by integrating slider bars into computer or online surveys, facilitating easier management and precise measurements without manual scoring.

While continuous item response formats are well established in the literature and the corresponding IRT models can be traced back to the work of Samejima (1973) and later Müller (1987), continuous IRT models are utilized for data analysis less often than discrete IRT counterparts designed for Likert-type scales (Zopluoglu, 2020). When handling collected continuous outcome data, factor-analytic models are traditionally favored over continuous IRT models (Mellenbergh, 2016). In our approach, we chose to embrace continuous IRT modeling rather than factor-analytic modeling for fitting VAS data. This is because continuous IRT modeling offers a more theoretically sound foundation than factor-analytic modeling and maintains a strong connection to conventional discrete IRT models. For a more in-depth exploration of this choice, refer to Ferrando (2002) and García-Pérez (2023).

The Continuous Response Model and Its Relative Measurement Issues

A prominent IRT model for analyzing continuous responses in VAS data is the continuous response model (CRM), which was pioneered by Samejima (1973). It functions by constructing a probabilistic framework for continuous responses, accounting for individuals’ latent tendencies and the characteristics of the items. Furthermore, when the number of response categories approaches infinity, the CRM can be viewed as a special case of the graded response model (Samejima, 1969). In real-time testing scenarios, item responses may not be influenced solely by the distinctions between individuals’ and items’ positions on a latent continuum. Instead, extraneous factors unrelated to the intended measurement domain, such as varying cognitive strategies, test-taking behaviors, differential item functioning (DIF), and response styles, could introduce complexities that impact respondents’ scores on the assessed content variables. These factors have been extensively studied in the realm of discrete IRT models (e.g., Falk & Cai, 2016; Holland & Wainer, 1993; Huang, 2020; Meyer, 2010; Mislevy et al., 2008; C. Wang et al., 2018).

With respect to Likert-type data, disregarding potential nuisance factors during tests can lead to significant estimation bias and distort ability inferences for respondents (Liu & Wang, 2016; Merhof & Meiser, 2023). However, the impact of extraneous variables unrelated to the content on estimating individuals’ trait levels and item characteristics garners less attention when the CRM is fit for VAS responses, with a few exceptions. For example, Zopluoglu (2020) extended the CRM using a mixture model to investigate the heterogeneity in item response behavior among respondents. Applying a five-item continuous-scoring personality measurement, Zopluoglu (2020) found that respondents tended to respond to items with different strategies and identified two latent classes that appeared to differ in the use of extreme endpoints. In addition, Ferrando and Navarro-González (2021) introduced a more comprehensive CRM through a dual modeling approach, considering both items and respondents as sources of differential measurement error. Considering the person-trait fluctuations that occur during the response process, Ferrando and Navarro-González asserted that the correlation between the target trait and external criterion could be corrected for attenuation. Recently, Finch (2023) applied the CRM to generate VAS data that incorporate DIF effects and compared the detection efficiencies of various existing DIF detection methods; the MIMIC method was recommended to serve as an efficient approach to detecting DIF in his simulation settings.

Response Styles in Survey Data

Response styles, referring to the inclination to consistently answer self-report inventories in specific ways regardless of item content, have been examined within discrete IRT models and classic analytic approaches for Likert-type data (e.g., De Beuckelaer et al., 2010; De Jong et al., 2008; Falk & Cai, 2016; Plieninger & Meiser, 2014; Weijters et al., 2010). However, limited attention has been given to the measurement challenges posed by response styles in VAS data when the CRM is used for fitting. While response styles encompass a broader spectrum of construct-irrelevant and systematic response behaviors for individuals’ use of rating scales (Baumgartner & Steenkamp, 2001), in our study, we focus on two specific response styles: extreme response style (ERS), characterized by an excessive inclination toward selecting the endpoints of a continuous scale, and acquiescence response style (ARS), characterized by an inclination toward preferring the higher points on a continuous scale regardless of their true attitude and judgment (i.e., a tendency to agree with items). This focus stems from the extensive research available on ERS and ARS effects (e.g., Böckenholt, 2012; Bolt & Newton, 2011; Falk & Cai, 2016; Henninger & Meiser, 2020a, 2020b; Jin & Wang, 2014; Jonas & Markon, 2019; Plieninger, 2017; Thissen-Roe & Thissen, 2013). This existing body of literature, along with the corresponding IRT models designed for discrete responses, offers valuable insights for advancing and expanding the CRM to accommodate these two response styles in VAS data.

Although the concepts of ERS and ARS originate from research on discrete item response formats, the same definition still applies to VAS format measurements. Specifically, respondents with ERS tendencies are more likely to endorse extreme options on a Likert-type scale and give rating points near the extremity on a continuous rating scale, depending on their attitude intention (e.g., positive or negative judgment). Respondents with ARS tendencies tend to choose the highest category on a Likert-type scale or mark the highest points on a continuous scale, regardless of their initial attitudes and judgments.

Existing Measurement Models for Response Styles

A multitude of psychometric models that can deal with the ERS and ARS tendencies have been proposed in terms of various theoretical perspectives and views on the nature of the measurement scale. When viewing the item responses as discrete variables (e.g., Likert-type rating scales), at least four approaches have been developed within the framework of IRT models to control for or quantify the ERS or ARS effect; these approaches deserve additional attention and discussion.

The first approach employs multidimensional IRT models to characterize latent traits of the measured substantive domains alongside distinct latent tendencies for various response styles, including ERS and ARS (Bolt & Johnson, 2009; Bolt & Newton, 2011; Falk & Cai, 2016; Johnson & Bolt, 2010; Plieninger, 2017). The second approach uses mixed IRT models and qualitatively separates the normal and ERS latent classes, in which the ERS latent class is assumed to have smaller intervals between adjacent thresholds than the normal class (Huang, 2016; Morren et al., 2012; von Davier et al., 2007). The third approach applies the perspectives of cognitive psychometrics and assumes that item responses can be deconstructed into content-related and response style–related aspects with an IRTree model (Böckenholt, 2012; De Boeck & Partchev, 2012; Thissen-Roe & Thissen, 2013). Like the multidimensional IRT models applied in the first approach, the fourth approach introduces either a multiplicative person parameter on thresholds to modify the distance between adjacent thresholds for ERS (Jin & Wang, 2014, 2018) or an additive person parameter on item locations to account for shifts in item difficulty for ARS (Jonas & Markon, 2019). Since the middle response style (MRS) works conversely to ERS along the same axis, its effect can also be incorporated and modeled within random-threshold IRT models.

Regardless of the approach employed in previous simulation and empirical studies, they have consistently indicated that response styles bias parameter estimation, compromise test validity, and interfere with people’s inference ability (e.g., Henninger & Meiser, 2020a). These IRT extension models, however, are not capable of accounting for ERS and ARS tendencies when item responses are continuous (i.e., VAS data). Intuitively, a traditional factor-analytic model may be directly employed to fit the VAS data, and the substantive domain representing the target trait and the nuisance domains representing the ERS or ARS effect can simultaneously influence item responses (e.g., Billiet & McClendon, 2000; Ferrando, Lorenzo-Seva, 2010). This approach, however, ignores the boundary of the continuous scale (i.e., continuous-limited item responses) and provides approximate estimates only under an ideal testing scenario (Ferrando, 2002).

To consider the boundary effects of VAS responses, Ferrando (2009) modified the factor-analytic model by defining the lower bound and the upper bound for the range of latent trait levels and reparametrized the intercept/location parameters to produce a form similar to that of standard IRT models. Furthermore, Ferrando (2014) extended his previous normative model (Ferrando, 2009) to become the so-called differential discrimination model (DDM) and allowed individual differences in the item discrimination parameters to quantify different sensitivities to the use of the scale. Although the DDM is promising and encouraging for investigating potential response extremes, several limitations deserve further investigation. First, the range of trait levels may not perfectly correspond to the constrained range that the DDM has defined because the target latent traits are often assumed to be normally distributed in psychometric models. Second, the estimation algorithm of the DDM depends on the two-stage method with bivariate information that is commonly applied for classical factor-analytic models, which may result in imprecise parameter estimation due to ignoring estimation errors compared with the one-stage approach with full information (Hung & Huang, 2022). From a statistical perspective, linear homoscedastic models, such as the DDM, are less flexible than and less applicable to nonlinear heteroscedastic models (e.g., the CRM) in VAS data analyses (Ferrando, 2002, 2009, 2014; Samejima, 1973; Tutz & Jordan, 2023). Consequently, in this study, we seek to expand the CRM for analyzing VAS data by accommodating the ERS or ARS.

This article is structured as follows: First, a concise overview of the CRM is provided, followed by the subsequent development and elaboration of its extensions to incorporate ERS and ARS. Next, a series of simulation studies were conducted to assess the parameter recovery of the proposed models and explore the ramifications of disregarding ERS and ARS effects in VAS data. Afterward, an empirical demonstration is presented that showcases the application of the developed models to VAS data to detect whether ERS or ARS affects respondents’ continuous responses. Finally, the article is concluded by summarizing the implications derived from the results of both the simulation and the empirical analyses, and suggestions for future research are presented.

The CRM and Its Extension to Account for ERS and ARS

The CRM was introduced by Samejima (1973) to formulate the probability density function of continuous responses in relation to a person’s trait level and an item’s characteristic, similar to the objectives of conventional discrete IRT models. If we consider a continuous scale ranging from zero to $k_{j}$ for item j and denote person i’s continuous response to item j as $X_{ij}$ , the CRM, while potentially parameterized differently by subsequent research (Ferrando, 2002; Shojima, 2005; T. Wang & Zeng, 1998; Zopluoglu, 2012, 2013), can be conceptualized through the conditional distribution of logit-transformed continuous responses, given by $Z_{ij} = \ln (\frac{X_{ij}}{k_{j} - X_{ij}})$ . This distribution can be expressed as follows:

f (Z_{ij} | θ_{i}, a_{j}, b_{j}, α_{j}) = \frac{a_{j}}{\sqrt{2 π} α_{j}} \times \exp {\frac{- {[a_{j} \times (θ_{i} - b_{j} - \frac{Z_{ij}}{α_{j}})]}^{2}}{2}},

(1)

where $θ_{i}$ is the trait level of person i; $a_{j}$ , $b_{j}$ , and $α_{j}$ are the discrimination, difficulty, and scaling parameters, respectively, for item j. The α scaling parameter integrated into the probability density function connects the original continuous score with the latent trait scale (T. Wang & Zeng, 1998). Suppose that the response $X_{i j}$ is bounded between 0 and 1; without loss of generality, the interpretations of the $θ_{i}$ , $a_{j}$ , and $b_{j}$ parameters are consistent with their counterparts in discrete IRT models. Specifically, the expected item score $E (X_{ij} | θ)$ monotonically increases as $θ_{i}$ increases, the slope of the monotonic curve is determined by the combination of the $a_{j}$ discrimination parameter and the $α_{j}$ scaling parameter, and the $b_{j}$ item difficulty parameter is the value of $θ$ at which $E (X_{ij} | θ) = 0.5$ , exhibiting a strong connection between the CRM and the two-parameter IRT model (García-Pérez, 2023). Within the CRM parameterization, the $Z_{ij}$ logit-transformed score functions as a normally distributed random variable, with a conditional mean of $α_{j} \times (θ_{i} - b_{j})$ and a conditional variance of ${(\frac{α_{j}}{a_{j}})}^{2}$ .

The Extended CRM for Accounting for ERS

Given that ERS entails a consistent inclination toward selecting the endpoints of a categorical or continuous rating scale, it is reasonable to assume that an individual with an ERS tendency would choose higher ratings on the continuum than their actual assessed ratings if $θ_{i} > b_{j}$ . Conversely, another person with the same ERS inclination would choose lower ratings than the expected accurate rating if $θ_{i} < b_{j}$ . Specifically, a propensity for ERS encourages a larger (or lower) expectation value of $Z_{ij}$ than its theoretical expectation within the conditional normal distribution, particularly for respondents with high (or low) latent traits.

To model the relationship between ERS propensity and the deviation of the latent trait from item difficulty, we introduce variability in the scaling parameters across individuals and reparametrize the random-effect scaling parameters to connect with the target latent trait via a linear model of normally distributed residuals. As such, the logit-transformed random variable $Z_{ij}$ in the ERS-enhanced CRM (abbreviated as CRM-ERS) is assumed to follow a conditional normal distribution:

(Z_{ij} | θ_{i}, a_{j}, b_{j}, α_{ij}) ~ N (ξ_{ij} \times (θ_{i} - b_{j}), {(\frac{ξ_{ij}}{a_{j}})}^{2}),

(2)

\ln (ξ_{ij}) = \ln (α_{j}) + γ_{i},

(3)

and

(θ_{i}, γ_{i}) ~ N_{2} (0, Σ),

(4)

where $ξ_{ij}$ represents the reparametrized scaling parameter for person i’s response to item j, assumed to follow a lognormal distribution with a mean $\ln (α_{j})$ , and $γ_{i}$ represents the residual of the linear function that describes the extent to which person i deviates from the distributional mean of $\ln (α_{j})$ . Positive residual values indicate a greater inclination toward ERS, while negative values indicate a greater inclination toward MRS (i.e., $γ_{i}$ can be considered the ERS propensity for person i). Both random-effect variables $θ_{i}$ and $γ_{i}$ are assumed to follow a bivariate normal distribution with a mean vector of zero and a variance–covariance matrix of Σ, as shown in Equation 4. The remaining parameters retain their previous definitions. For model identification, the variance of the target latent trait $θ_{i}$ must be constrained to one. Notably, the b parameter retains the same interpretation as in the traditional CRM, where $E (X_{ij} | θ) = 0.5$ when $θ = b$ . However, in this model, the slope of the monotonic curve is determined not only by the discrimination parameter but also by the person-specific scaling parameter.

To elucidate the influence of random-effect scaling parameters on item responses, we present an illustrative example. We consider three chosen latent trait levels of θ = -1, 0, and 1 for responding to an item with parameters a = 1.5, b = 0, and α = 1, respectively. The parameter γ is set at values of 1, 0, and -1, which correspond to the ERS, normal, and MRS classes, respectively. For each combination of latent trait (θ) and ERS propensity (γ), we simulate 2,000 persons’ responses to the given item using Equations 2 and 3. The observed scores are obtained by transforming $Z_{ij}$ with the equation $\frac{k \times \exp (Z_{i j})}{1 + \exp (Z_{i j})}$ , with the score range set between zero and 11 (i.e., k = 11). The upper panel of Figure 1 displays the conditional distribution of continuous responses under various manipulations for the CRM-ERS. When individuals possess a low trait level (θ = −1), the distribution of observed scores shifts toward the left end of the continuum with γ = 1, indicating an inclination toward ERS, as shown in Figure 1(a). The distribution for normal responses remains consistent with the conventional CRM prediction with γ = 0, whereas the MRS class with γ = −1 guides the distribution toward the middle score.

Figure 1

Conditional Distribution of the Continuous Response at Three Trait Levels for the CRM-ERS (a–c) and CRM-ARS (d–f): (a) CRM-ERS When Theta = −1. (b) CRM-ERS When Theta = 0; (c) CRM-ERS When Theta = 1; (d) CRM-ARS When Theta = -1; (e) CRM-ARS When Theta = 0; (f) CRM-ARS When Theta = 1

When an individual’s trait level aligns with the item difficulty (θ = 0), the three response class distributions exhibit similar means, as shown in Figure 1(b). However, the ERS class (with γ = 1) is more likely than the other classes to select extreme positions on the continuum. In addition, the MRS class (with γ = -1) demonstrates a strong tendency to select the middle rating, in contrast to the normal class (with γ = 0). Finally, when individuals possess a high trait level (θ = 1), as depicted in Figure 1(c), the distribution of observed scores for those with γ = 1 (or γ = −1) is influenced by ERS (or MRS), causing a shift toward the right (or left) end of the continuum. The distribution of the normal class with γ = 0 lies between the two ERS and MRS classes.

To clearly illustrate the distributional patterns under the same response style, we present the conditional distribution of continuous responses, demonstrating the variation in the γ parameter for the CRM-ERS. This information is depicted in the upper panel of Figure A1, which is included in Online Supplement A. In Figure A1(a), when respondents lean toward MRS $(γ = - 1)$ , there is a tendency to mark points around the median score, regardless of their trait levels. This is in contrast to the normal responses observed in Figure A1(b). Conversely, when $γ = 1$ , the ERS effect influences the distribution of observed scores, causing them to approach the right and left ends of the continuum for respondents with high and low trait levels, respectively. Moreover, the distribution appears to follow a flat pattern for respondents with a medium trait level.

The Extended CRM for Accounting for ARS

Considering the inclination toward ARS in the VAS data, the CRM can be extended to accommodate respondents who tend to choose higher ratings on continuous scales than expected (i.e., a tendency to agree with item statements). Although the ERS tendency works like the tendency of intriguing respondents to give higher ratings on a continuous scale, a major difference should be clearly noted. The ARS manifests as a consistent tendency to choose ratings near the high end of the continuous scale regardless of personal or item characteristics, while the ERS reveals that the tendency to choose a point near the high or low end of the continuous scale depends on the difference between the person’s trait and the item’s location, as shown in Figure 1(a) to 1(c).

The presence of ARS suggests that the likelihood of person i obtaining a score of x or higher on item j surpasses the scoring probability anticipated by a conventional CRM for the same person and item. Mathematically, this is expressed as $P (X_{ij} \geq x | ARS) > P (X_{ij} \geq x | Normal) .$ Given that the observed score $X_{ij}$ can be transformed into the logit-scaled score $Z_{ij}$ using a logit-transformed function, the conditional probability of marking a score equal to or higher than x for person i and item j can be calculated by integrating the corresponding probability density function outlined in Equation 1. This is denoted as

P (X_{ij} \geq x | Normal) = P (Z_{ij} \geq z | Normal) = \int_{- \infty}^{z} f (Z_{ij} | θ_{i}, a_{j}, b_{j}, α_{j}) d Z,

(5)

representing the probability of normal responses without exhibited response styles.

To increase the conditional probability under the same conditions, we can include a nonnegative person-specific parameter governed by the ARS dimension in Equation 1. This inclusion enforces a higher conditional probability $(P [X_{ij} \geq x])$ for respondents exhibiting ARS tendencies. As the slope (discrimination) parameters are originally defined to capture the varying influence of the latent trait on each item in the original CRM, a distinct set of slope parameters can be introduced for items to account for the impact of the ARS dimension. In this modified version, termed the CRM-ARS, a random-effect parameter $(ω_{i})$ is incorporated for individual i, and an ARS-specific slope parameter ${(a}_{j}^{(ARS)})$ is incorporated for item j. The reparametrized Equation 1 for the CRM-ARS is as follows:

\begin{matrix} f {(Z_{ij} | θ_{i}, ω_{i}, a_{j}, a_{j}^{(ARS)}, b_{j}^{*}, α_{j}^{*})}_{ARS} \end{matrix} = \frac{a_{j}}{\sqrt{2 π} α_{j}} \times \exp {\frac{- {[a_{j} \times θ_{i} + \exp (a_{j}^{(ARS)} \times ω_{i}) - b_{j}^{*} - \frac{Z_{ij}}{α_{j}^{*}}]}^{2}}{2}},

(6)

where $b_{j}^{*} = a_{j} \times b_{j}$ and $α_{j}^{*} = \frac{α_{j}}{a_{j}}$ . Here, $ω_{i}$ represents person i’s inclination toward the ARS and follows a normal distribution with a mean of zero and a variance of $σ_{ω}^{2}$ . The other variables retain their previously defined meanings. In the conventional CRM, the $b_{j}$ and $α_{j}$ parameters are applied to the $θ_{i}$ scale, which are multiplicatively combined with the $a_{j}$ parameter altogether. In the CRM-ARS represented by Equation 6, we introduce the ARS propensity and apply $\exp (a_{j}^{(ARS)} \times ω_{i})$ to the $a_{j} θ_{i}$ scale. To maintain scaling consistency, the corresponding item difficulty and discrimination parameters must be rescaled and reparametrized. In addition, the reparameterization of the CRM-ARS clearly provides an analogous and interpretable form that is comparable to that of the conventional CRM.

The interpretations of the CRM-ARS parameters differ somewhat from those of the conventional CRM and warrant further attention. The expected item score is equal to 0.5 when $a_{j} \times θ_{i} + \exp (a_{j}^{(ARS)} \times ω_{i}) = b_{j}^{*}$ , and the $b_{j}^{*}$ the parameter is solely considered the intercept parameter rather than the item difficulty parameter, as described previously. When appropriate, a multidimensional difficulty index may be computed by considering the impacts of each latent dimension on the item. This index can be used as an indicator to represent the relative difficulty of the item related to the distinct domain (see Reckase, 2009, pp. 89–90).

The identifiability issue on the CRM-ARS should further be discussed. First, the slope parameter of the ARS dimension for one item must be fixed at an arbitrary value. Following the tradition of IRT regarding constraints on discrimination parameters (as seen in Huang & Wang, 2014), we arbitrarily choose $a_{1}^{(ARS)} = 1$ for this study. Next, Equation 6 clearly shows that the two types of person parameters are additively combined and that their corresponding slope parameters are constrained to be positive (i.e., items measure the same direction). Because all the items are allowed to be simultaneously governed by two random-effect parameters, a zero-correlation constraint should be imposed just as an unrestricted factor-analytic model has demonstrated (Ferrando et al., 2003). In this study, we constrained the orthogonal structure of the $θ_{i}$ and $ω_{i}$ parameters to ensure that the two parameters could be accurately identified; this approach has been commonly employed in previous studies involving multidimensionality (Bolt & Newton, 2011; De Boeck et al., 2011; Ferrando et al., 2011; Ferrando & Lorenzo-Seva, 2010; Holman & Glas, 2005; Jin et al., 2023; Liu et al., 2019).

Although the orthogonal constraint derives from statistical considerations, the correlation between trait levels and ARS propensity revealed different patterns depending on the response-style modeling approach. For example, Plieninger (2017) found a near-zero correlation between a measured trait and ARS dimensions; on the other hand, Liu and Wang’s (2019) study showed a moderate negative correlation between two-person parameters during unfolding data analysis. Nevertheless, one may allow correlation estimation in the CRM-ARS when at least one item can be identified to purely measure the target trait and ARS domains in a confirmatory way (Holman & Glas, 2005). In this case, identification of the ARS-free items is necessary. However, flagging ARS-free items is methodologically challenging and may be attained by purifying iteration methods employed in searching for DIF-free anchor items (e.g., Huang, 2014), which may go beyond the scope of this study.

Consequently, the conditional probability of obtaining a score of x or higher on item j for person i with an ARS tendency can be derived by integrating Equation 6, expressed as:

P (X_{ij} \geq x | ARS) = P (Z_{ij} \geq z | ARS) = \int_{- \infty}^{z} f {(Z_{ij} | θ_{i}, ω_{i}, a_{j}, a_{j}^{(ARS)}, b_{j}^{*}, α_{j}^{*})}_{ARS} d Z,

(7)

where the logit-transformed score $Z_{i j}$ is drawn from a normal distribution with a conditional mean of $α_{j} \times [θ_{i} - b_{j} + \frac{\exp (a_{j}^{(ARS)} \times ω_{i})}{a_{j}}]$ and a conditional variance of ${(\frac{α_{j}}{a_{j}})}^{2}$ .

The same illustrative scenario applied to the CRM-ERS, involving the continuous responses of 2,000 individuals to an item across three trait levels, also illustrates the impact of ARS using the CRM-ARS. However, in this case, the ARS propensity for respondents is set as ω = 1 for high ARS levels and ω = −1 for low ARS levels. In addition, the slope of the ARS dimension is set to one. This is displayed in the bottom panel of Figure 1. In the context of the three trait levels, a higher ω parameter causes a more pronounced shift in the observed continuous response distribution toward the right end of the continuum than a lower ω parameter does. The CRM-ARS allows acquiescent tendencies to be individually captured and quantified.

When examining the conditional distribution of continuous responses with respect to variations in the ω parameter, Figure A1 in Appendix A provides a clear depiction of the impact of the ARS on observed scores. In contrast to the standard responses associated with the traditional CRM (where the ω value is set to negative infinity, as depicted in Figure A1[e]), respondents with ω = -1 exhibit a slight shift toward the right end of the continuum, irrespective of their trait levels (refer to Figure A1[d]). Conversely, respondents with ω = 1 demonstrate a significant shift toward the right end of the continuum (refer to Figure A1[f]), highlighting that the inclination toward ARS can dominate the variation in the ω parameter.

In certain instances, respondents might lean toward using the lower range of the continuous scale, manifesting a disacquiescence response style (DRS). DRS, wherein respondents tend to respond to items in a negative manner, can be conceptualized as the opposite effect of ARS. For instance, DRS propensity may be observed when reverse-keyed items are administered to evoke a DRS inclination (Ferrando & Lorenzo-Seva, 2010). The developed CRM-ARS can be adapted to address the DRS effect by constraining the distributional mean of the conditional probability density function. This adjustment involves reducing the mean as a respondent’s DRS propensity $(υ_{i})$ increases. This can be mathematically described as $α_{j} \times [θ_{i} - b_{j} - \frac{\exp (a_{j}^{(DRS)} \times υ_{i})}{a_{j}}]$ . The conditional probability of obtaining a score higher than x can be derived following a similar logic as discussed earlier. Although CRM extensions encompassing diverse response styles extend beyond the previously described models, the effects of ERS and ARS are the focus of this study. Therefore, the evaluation of CRM extensions, specifically the CRM-ERS and CRM-ARS, through the upcoming simulation study centers on the estimation quality of the extensions.

Method

Design

Two simulation studies were conducted to assess the performance of the proposed models, the CRM-ERS and CRM-ARS, in comparison to the conventional CRM. The CRM-ERS and CRM-ARS were used as data-generating models for the simulations. Simulated data were generated based on these models and then fitted to their corresponding true models as well as the conventional CRM (which does not account for response styles). This evaluation was conducted to measure the estimation accuracies of the proposed models and to explore the impact of neglecting the impacts of ERS and ARS under different conditions. The rationale behind the use of true and misleading models to fit the simulated data is that, provided by the results, we can determine the extent to which the parameter recovery obtained from the conventional CRM deviates from that obtained from the true CRM-ERS or CRM-ARS. If the deviation of the conventional CRM from the true models is substantial, then the ERS or ARS effect cannot be ignored, and the CRM-ERS or CRM-ARS must be employed. In contrast, if the deviation is small, the CRM would be sufficient for providing a parsimonious estimation. This approach is a typical procedure commonly observed in the literature when developing a novel model and investigating its efficiency in estimation (e.g., Huang, 2023; Jin & Wang, 2014; Merhof & Meiser, 2023).

In both simulation studies, two sets of sample sizes, 500 and 1,000, were employed, along with two test lengths, 10 and 20 items. These conditions were chosen to align with previous research on CRM (Finch, 2023; Zopluoglu, 2020). For the CRM-ERS simulations, two-person parameters, $θ_{i}$ and $γ_{i}$ , were generated from a bivariate standard normal distribution. The correlation between these parameters was set at either −0.40 or 0.40. This variation is consistent with prior literature showing that the relationship between latent traits and ERS tends to be either positive or negative (Liu & Wang, 2019; Plieninger, 2017; Plieninger & Meiser, 2014).

For the CRM-ARS simulations, the intended measure $θ_{i}$ parameters were sampled from a standard normal distribution. The ARS propensity $ω_{i}$ parameters followed a normal distribution with a mean of zero and a standard deviation of 0.5. The two-person parameters were treated as mutually independent. A moderate variation in describing individual differences in ARS tendency was chosen to ensure that the $ω_{i}$ parameters could be transformed into nonnegative values through a natural exponential function, thus representing the influence of ARS on the conditional distribution. For instance, when the slope parameter of the ARS dimension is set to unity, the random $\exp (ω_{i})$ value falls within the range of $\exp (- 1.5) = 0.223$ to $\exp (1.5) = 4.482$ , approximately within a 99% confidence interval.

Continuous item responses were generated within the range of zero to 11, where the highest possible item score was 11. The item parameters were established following settings similar to those designed by Zopluoglu (2022). In both simulations, the simulated values for the item parameters were randomly selected from a uniform distribution with an item difficulty parameter range of -1 to 1, an item scaling parameter range of 0.8 to 1.2, and a range of 0.5 to 2.5 for item discrimination (slope) parameters concerning θ and ω. In essence, a combination of manipulated factors led to eight conditions (i.e., two sample sizes × two test lengths × two correlation levels) emerging for the CRM-ERS simulation. For the CRM-ARS simulation, four conditions emerged (i.e., two sample sizes × two test lengths). All of these conditions were replicated 100 times to evaluate the parameter estimation accuracy.

Note that the item parameter generation for the simulation was chosen to be consistent with or similar to the ranges that have been observed and used in previous studies (e.g., Finch, 2023; Shojima, 2005; Zopluoglu, 2022). However, we cannot exclude the possibility that the real item parameters may be outside the ideal distribution of the item parameters. As demonstrated in the following empirical analysis, the item difficulty parameters were estimated to be lower, and the respondents thus intended to mark a higher point on the continuous scale. To examine whether the proposed model is robust to the variation in item parameters obtained from real data analysis, a follow-up simulation that mirrored the real-world scenario was conducted, as shown in the section followed by the empirical analysis.

When VAS data are collected, an important question arises regarding which model can provide a better fit to the data to detect the potential response styles for the CRM. For this purpose, we employ two criteria to evaluate the model fit: the deviance information criterion (DIC; Spiegelhalter et al., 2002), which functions as a partial Bayesian method, and leave-one-out cross-validation with Pareto-smoothed importance sampling (PSIS-LOOCV; Vehtari et al., 2017, 2019), which is recognized as a comprehensive Bayesian approach. To assess the efficiency of the two fit criteria, the third simulation study was conducted in which the three models of the CRM-ERS, CRM-ARS, and CRM were used to generate item responses over 100 replications, and the three respective models were then fit to the simulated data to calculate the corresponding DIC and PSIS-LOOCV values. It was expected that when the data-generating model was consistent with the fitting model, the DIC and PSIS-LOOCV would be lower than when the data-generating and fitting models were incoherent, suggesting that the data-generating model is superior to the competing models. In this scenario, we fixed the test length to 10 items and the sample size to 500 persons to investigate the effectiveness of the fit indices in a less ideal situation.

Analysis

Because existing statistical packages cannot be applied to the newly proposed models, we devised the JAGS (Plummer, 2017) syntax by employing Bayesian estimation. This approach was used to effectively calibrate the model parameters for the extended CRM encompassing response styles. The prior distribution of each parameter was a prerequisite for establishing the joint posterior distributions of the parameters. Markov chain Monte Carlo (MCMC) methods were employed to efficiently and sequentially sample and construct specific posterior distributions for each parameter. The priors employed in this study were determined using a vaguely sound approach and were similar to those utilized in prior Bayesian IRT studies (e.g., Huang, 2016, 2020). The JAGS syntax for the CRM-ERS and CRM-ARS can be readily found in Online Supplements B and C, respectively. These resources are provided to interest readers in developing customized models.

A normal prior distribution with a mean of zero and a variance of four was implemented for the item difficulty parameters. A lognormal prior distribution with a mean of zero and a variance of one was implemented for the item discrimination and item scaling parameters. A gamma prior distribution with both hyperparameters set at 0.01 was specified for the standard deviation of the ERS (i.e., $γ_{i}$ ) and ARS (i.e., $ω_{i}$ ) propensity parameters. The correlation between the target latent trait and ERS propensity parameters was assigned a uniform distribution ranging from −0.99 to 0.99. A total of 15,000 iterations were executed, with the initial 5,000 iterations being discarded as the burn-in phase. This step was employed to establish stationarity, as evidenced by the computed values of the multivariate potential scale reduction factor (Brooks & Gelman, 1998), all of which closely approximated unity.

Furthermore, the bias and root mean square error (RMSE) were calculated for each estimator, and the correlations between the true and estimated values were also calculated. To save space, the correlations between true and estimated item parameters were calculated separately for each parameter and then averaged across estimators. This calculation served two purposes: first, to assess the accuracy of parameter recovery when fitting the model to the data generated by the proposed data-generating model, and second, to explore the implications of utilizing the misleading CRM to fit data involving ERS or ARS effects in VAS data.

Results

Simulation 1: Parameter Recovery Evaluation for the CRM-ERS

The effectiveness of parameter recovery was assessed by examining biases and RMSEs for both the CRM-ERS and the CRM-ARS. The outcomes are presented using box plots. When the data-generating model was the CRM-ERS, as depicted in Figure 2, the bias values had upper and lower that were closer to zero than those of the CRM. Figure 3 shows the box plots for the RMSEs. The CRM-ERS yielded lower RMSE values than did the CRM across all parameter estimators. Furthermore, increasing the sample size led to reduced bias and RMSE values when calibrating the CRM-ERS parameters. Conversely, the test length and correlation level had relatively minor impacts on the estimations of the item and structural parameters.

Figure 2

Box Plots of the Biases of the Parameter Estimates Calibrated Under Different Conditions When the CRM-ERS (a–d) and CRM (e–h) Are Fit to the Simulated CRM-ERS Data

Figure 3

Box Plots of the RMSEs of the Parameter Estimates Calibrated Under Different Conditions When the CRM-ERS (a–d) and CRM (e–h) Are Fit to the Simulated CRM-ERS Data

Regarding the recovery of latent trait parameters, as depicted in Figure 4(a) and 4(b), the CRM-ERS more accurately estimated the latent trait than the CRM, regardless of whether the correlation was positive or negative. Furthermore, extending the test length further enhanced the precision of latent trait estimation within the context of the CRM-ERS. The CRM-ERS necessitated estimating an additional parameter related to ERS propensity, and Figure 4(c) illustrates that this parameter could be satisfactorily estimated, particularly when 20 items were administered. Furthermore, when inspecting the correlations between the true and estimated parameters, as shown in the upper portion of Table 1, the CRM-ERS had higher correlation values than the CRM for all conditions, and the conclusions described above can be directly applied.

Figure 4

Box Plots of the RMSEs for Assessing the Recovery of the Person Parameters Under Different Conditions When the Data-Generating Models Are CRM-ERS (a–c) and CRM-ARS (d–e)

Table 1

Mean Correlation Between the Generated and Estimated Parameters for the Simulated Data

		Sample size	500			1,000
True model	Fitting model	Para	Item	Trait	Prop	Item	Trait	Prop
		TL
CRM-ERS	CRM-ERS	10	0.995/0.962	0.962/0.970	0.978/0.983	0.996/0.969	0.965/0.966	0.982/0.982
		20	0.995/0.998	0.981/0.980	0.989/0.991	0.999/0.999	0.981/0.981	0.991/0.990
	CRM	10	0.917/0.861	0.551/0.685	—	0.831/0.881	0.572/0.582	—
		20	0.847/0.856	0.497/0.448	—	0.874/0.904	0.619/0.544	—
CRM-ARS	CRM-ARS	10	0.993	0.941	0.782	0.999	0.941	0.778
		20	0.994	0.978	0.877	0.997	0.972	0.871
	CRM	10	0.851	0.527	—	0.883	0.616	—
		20	0.767	0.496	—	0.768	0.497	—

Note. Para = parameter, TL = test length, Item = item and model structural parameters, Trait = persons’ target trait parameters, and Prop = persons’ ERS/ARS propensity parameters. The positive and negative correlation conditions are shown on the left- and right-hand sides, respectively, of the slash symbol, when the CRM-ERS was used as the true model.

Simulation 2: Parameter Recovery Evaluation for the CRM-ARS

Figure 5 displays box plots depicting the RMSE and bias values for item and structural parameter estimations when the CRM-ARS and the CRM were applied to the CRM-ARS-generated data across various manipulation conditions. In terms of both bias and RMSE, the CRM-ARS consistently generated more accurate parameter estimates than the CRM. Notably, the evaluation criteria for the CRM-ARS and CRM exhibited substantial discrepancies, underscoring the significance of accounting for the effects of ARS on item responses. The majority of the CRM-ARS parameters were satisfactorily estimated, although slight variations were observed, particularly in the discrimination parameters pertaining to the ARS dimension (i.e., $a_{j}^{(ARS)}$ ). Note that only the loading of the first item with respect to the ARS dimension was fixed to one (i.e., $a_{1}^{(ARS)} = 1$ ) for model identification, and the other $a_{j}^{(ARS)}$ , for $j \neq 1$ , was freely estimated. The slightly larger bias values for the ARS-related discrimination parameters could be attributed to the nonlinear combination of the ARS effect, quantified by the natural logarithm base raised to the power of $a_{j}^{(ARS)} \times ω_{i}$ , with other parameters within the function. This exponentiation step may limit the estimation quality of ARS-related parameters.

Figure 5

Box Plots of the RMSEs and Biases of the Parameter Estimates Calibrated Under Different Conditions When the CRM-ARS (a and c) and CRM (b and d) Are Fit to the Simulated CRM-ARS Data

When examining the effect on the recovery of personal parameters, the CRM-ARS exhibited more accurate latent trait estimates than the CRM. Furthermore, the CRM-ARS yielded acceptably precise estimations for ARS propensity, as demonstrated in Figure 4(d) and 4(e). The parameter recovery patterns of the CRM-ARS resembled those of the CRM-ERS, where longer test lengths correlated with more accurate person parameter estimations and larger sample sizes contributed to increased precision for item and structural parameter estimations. Again, the patterns of the correlations between the true and estimated parameters were consistent with those obtained from the inspection of the biases and RMSEs, as shown in the lower portion of Table 1. The results suggested that the CRM-ARS provided better parameter estimation than the CRM when individuals’ responses were influenced by the ARS tendency.

Simulation 3: Evaluation of Model-Data Fit Criteria

Table 2 shows the efficiency of the DIC and PSIS-LOOCV criteria when the VAS data were simulated according to a specific model fit to the true (data-generating) model and the other two misspecified models. When the true models were the CRM-ERS and CRM-ARS, employing the true model to fit the simulated data always yielded smaller DIC and PSIS-LOOCV values, and the proportions of individuals selecting the true model as the best-fitting model over replications approached one, suggesting that the DIC and PSIS-LOOCV criteria work efficiently in the detection of different response styles. The two fit criteria, however, failed to flag the CRM as the best-fitting model when the CRM data were fit to the three respective models. Next, we calibrated the item and person parameters under the fit of the three models for the CRM data and examined whether there were substantial differences in the parameter estimation between the CRM (i.e., the true model) and the two misspecified models.

Table 2

Efficiency Evaluation of Model-Data Fit Criteria for the Simulated Data

	Fitting model
	Average fit value			Proportion of replications in favor of
True model	CRM-ERS	CRM-ARS	CRM	CRM-ERS	CRM-ARS	CRM
DIC
CRM-ERS	12,258	21,350	22,462	1.00	0	0
CRM-ARS	13,249	12,327	13,468	0.01	0.99	0
CRM	11,599	11,562	11,569	0.29	0.41	0.3
PSIS-LOOCV
CRM-ERS	12,005	21,256	22,539	1.00	0	0
CRM-ARS	12,886	11,811	13,393	0.01	0.99	0
CRM	11,494	11,495	11,505	0.44	0.45	0.11

Note. The simulation conditions were fixed to a sample size of 500 and a test length of 10. The values in bold indicate the overall best-fitting model.

The CRM can be considered a special case of the CRM-ERS when the variance in the ERS propensity (i.e., $σ_{γ}^{2}$ ; see Equations 3 and 4) is equal to zero, and the CRM-ARS can be considered a special case when the $ω_{i}$ parameters have a normal distribution with a mean approaching an extremely negative value and a variance equal to zero and the $a_{j}^{(ARS)}$ parameters are close to unity (i.e., $\exp [a_{j}^{(ARS)} \times ω_{i}] \to 0$ ; see Equation 6). Note that for the CRM-ARS, we constrained the mean of the item difficulty parameters to a fixed value (e.g., a zero mean) instead of a zero mean constraint on the distributional mean of the $ω_{i}$ parameters employed previously to allow the respondents with normal responses to be accurately identified.

Figure 6 shows the box plots for the bias and RMSE values under the fit of the three models to the CRM data when a small sample size of 500 persons and a short test length of 10 items were applied. The three fitting models, including the true CRM and the two extended CRMs considering the ERS and ARS effects, produced nearly comparable and satisfactory estimations for the item and person parameters because of their smaller bias and RMSE values. As expected, the mean of the estimate across replications was 0.000 for the variance in the ERS propensity under the fit of the CRM-ERS, and the CRM-ARS had a mean estimate of −5.000 and 0.000 for the distributional mean and variance in the ARS propensity, respectively; additionally, the mean $a_{j}^{(ARS)}$ estimates spread the range of 1.075 to 1.113 for the 10 items. The results implied that using the CRM-ERS and CRM-ARS to fit the CRM data made little difference and did not harm the item and person estimations but may have led to a loss of model parsimony. On the other hand, as shown previously, fitting the simple CRM to the CRM-ERS and CRM-ARS data led to seriously biased estimations, whether for the item or person parameters, underscoring the importance and necessity of using the proposed models for VAS data analysis.

Figure 6

Box Plots of the RMSEs and Biases of the Parameter Estimates Calibrated Under Different Models When the Data-Generating Model Was the CRM

Empirical Study

In a study conducted by Kan (2009), a survey was administered to 307 preservice teachers to assess teacher self-efficacy in various teaching activities. The participants were instructed to indicate their judgments on ten measurement items by marking a point on an 11-cm line segment, with the endpoints representing extremes of certainty (cannot do at all to highly certain can do). This dataset can be found in the R package “ESTCRM” (Zopluoglu, 2022). To demonstrate the applicability of the extended CRMs for response styles, we applied our proposed models to this continuous dataset. Specifically, three models were utilized: the CRM-ERS, CRM-ARS, and conventional CRM. We aimed to fit these models to the data and subsequently compare their model-data fit performances.

To evaluate the model fit, two criteria, DIC and PSIS-LOOCV, were employed, and smaller DIC and PSIS-LOOCV values indicate a more favorable model-data fit. The JAGS program was utilized to calibrate the model parameters and the same prior distributions as those utilized in the simulation study were employed. The CRM-ERS demonstrated the most favorable fit, as indicated by the DIC values, which were 6,079, 7,148, and 6,725 for the CRM-ERS, CRM-ARS, and conventional CRM, respectively. In addition, the PSIS-LOOCV values were 5,783, 6,653, and 6,684 for the respective models. To explore the potential repercussions of disregarding extreme responses in VAS data, our focus shifted to the estimation disparities between the best-fitting model and the conventional model.

The upper portion of Table 3 depicts the parameter estimates for the model structural and item parameters under the CRM-ERS. These estimates ranged from 1.02 to 2.25 with an average of 1.71 for the item discrimination parameters, from −2.27 to −1.42 with an average of −1.90 for the item difficulty parameters, and from 0.81 to 1.01 with an average of 0.93 for the scaling parameters. In addition, the propensity for ERS exhibited a moderate negative correlation with the target trait, and the variation in ERS tendency across individuals appeared to be relatively mild.

Table 3

The Estimates of the Model Structural Parameters Under the CRM-ERS (Upper Panel) and the Estimates of the Person Parameters Under the CRM and CRM-ERS Along With the Raw Response Scores for Selected Samples (Lower Panel)

	Item ID
Parameter	1	2	3	4	5	6	7	8	9	10
Discrimination	1.75	1.87	1.86	1.74	1.43	2.25	2.06	1.66	1.02	1.42
Difficulty	−1.90	−2.25	−1.72	−2.20	−1.42	−1.66	−1.85	−1.81	−1.86	−2.27
Scaling	0.93	0.91	1.01	0.91	1.01	1.01	0.97	0.90	0.81	0.86
Correlation $(θ, γ)$						−0.72
Variance $(γ)$						0.21
	Raw response scores												CRM	CRM-ERS
Participant ID	1	2	3	4	5	6	7	8	9	10	Mean	SD	$\hat{θ}$	$\hat{θ}$	$\hat{γ}$
34	5.40	7.80	1.90	7.85	1.60	1.45	4.10	7.05	7.00	5.30	4.95	2.43	−2.67	−2.03	0.57
222	10.00	5.50	8.80	5.85	5.90	5.80	10.60	5.50	5.50	10.20	7.37	2.12	−0.82	−1.32	0.72
130	8.40	5.80	8.95	8.45	7.85	6.70	8.55	5.05	7.40	7.00	7.42	1.21	−1.20	−1.05	0.08
269	7.70	7.75	7.35	7.40	8.55	7.80	7.20	6.80	7.10	7.15	7.48	0.47	−1.22	0.32	−0.92
32	8.65	8.45	8.80	8.45	9.05	8.45	8.90	8.55	8.80	8.40	8.65	0.21	−0.49	1.83	−0.98
279	5.90	9.90	10.00	9.90	5.40	9.95	5.80	9.95	9.75	10.00	8.66	1.94	−0.01	−0.80	0.47
26	7.05	9.60	9.20	9.95	7.55	8.45	8.15	9.55	8.35	9.00	8.69	0.89	−0.25	−0.15	−0.11
268	7.70	10.20	9.35	9.50	10.95	9.75	10.65	10.35	8.20	10.70	9.74	1.03	1.15	−0.45	0.59
108	10.00	10.05	9.60	9.70	10.00	9.70	9.80	9.50	9.35	9.70	9.74	0.22	0.61	1.78	−0.51
153	9.45	9.50	9.00	9.60	10.35	10.30	9.70	9.05	10.20	10.35	9.75	0.50	0.62	0.52	−0.07
133	10.90	10.95	10.30	10.90	10.15	10.15	10.75	10.50	10.70	10.80	10.61	0.30	2.96	0.31	0.55

The literature indicates varying correlation estimates between substantive traits and ERS propensity: some studies have shown a moderate positive correlation (Plieninger & Meiser, 2014; Thissen-Roe & Thissen, 2013), some have exhibited a moderate negative correlation (LaHuis et al., 2019; Liu & Wang, 2019), and others have shown a slight or near-zero correlation (Bolt & Newton, 2011; LaHuis et al., 2019). The diverse findings may be attributed to the use of different modeling approaches and may depend on the survey contents applied for analysis. For example, LaHuis et al. (2019) applied two IRTree models derived from different cognitive process assumptions to fit real data and found that the correlations between the content and ERS dimensions were substantially different. Furthermore, Bolt and Newton (2011) used a multidimensional nominal response model to analyze multiscale data and found that the correlation estimation depended on which content was analyzed (see also Falk & Cai, 2016). Therefore, we cannot clearly determine the reasons behind the negative correlation estimated from our example or compare the results with those of previous studies because the measurement models and assessed tests were entirely different. As suggested by Liu and Wang (2019), a qualitative interview may be designed to identify the underlying cognitive bias triggering preservice teachers’ ERS tendencies and to investigate why people with lower teaching self-efficiency exhibit a greater tendency toward ERS. However, this approach is not accessible for the current analysis due to the nature of the secondary data.

The lower portion of Table 3 displays the raw response scores along with the corresponding θ and γ estimates for the selected samples. Given that the estimates for the item difficulty parameters were relatively low, only three respondents had a mean score below the median score of 5.5 on the scale (i.e., the midpoint of the continuum). This implies that most participants were more inclined to mark a point toward the right end of the continuum if their γ estimate was higher. Participant 34 was one of three individuals whose mean score fell below 5.5. Due to the elevated propensity for ERS $(\hat{γ} = 0.57)$ , Participant 34 tended to select scores from the lower range of the continuum, causing the CRM-ERS to adjust its θ estimate upward from −2.67 to −2.03.

Participants 222, 130, and 269 had similar item mean scores but different γ estimates. Similar patterns emerged among Participants 32, 279, and 26 and among Participants 268, 108, and 153. When the mean scores were comparable, individuals with higher γ estimates consistently yielded higher θ estimates calibrated by the conventional CRM than those calibrated by the CRM-ERS. This indicates that neglecting the influence of ERS leads to an overestimation of θ. Conversely, individuals with a tendency toward MRS were associated with lower γ estimates. In these cases, the CRM-ERS attributed higher θ estimates to respondents with MRS tendencies. For example, Participant 32, with a $\hat{γ}$ of −0.98, received θ estimates of −0.49 from the CRM and 1.83 from the CRM-ERS. For other participants whose γ estimates were relatively close to zero, such as Participant 130, the disparities between the two models did not exhibit significant or systematic patterns.

Notably, as in discrete multiple-parameter IRT models, respondents’ target trait estimates are no longer linearly correlated with their corresponding raw scores because the items are allowed to have different discriminating powers with respect to the substantive dimension (Embretson & Reise, 2000). For example, the CRM estimated the target trait parameter as 1.55 and 0.61 for Participants 268 and 108, respectively, while they had different item response patterns and the same mean raw score. In addition, the CRM-ERS is capable of separating the influences of target trait parameters from ERS propensity parameters on item responses to provide more precise estimations of target trait parameters. This finding implied that the trait estimates adjusted via the CRM-ESR would largely deviate from those calibrated by the conventional CRM and the corresponding raw scores when the ERS effect was considerably involved in the item responses. The deviation in the person-trait estimates between the CRM and CRM-ERS is illustrated by the following analysis.

Finally, we generated a scatter plot, as shown in Figure 7, to examine the relationship between the θ estimates obtained from the CRM and those obtained from the CRM-ERS. Notably, the scatter points deviated noticeably from the diagonal identity line. The correlation coefficient for the θ estimates between the two models was 0.58, underscoring the substantial differences in the θ estimates of the CRM-ERS and CRM. Aligned with the simulation results and the literature on response-style-relative IRT models (e.g., Jin & Wang, 2014; Liu & Wang, 2019; Plieninger, 2017; Tutz et al., 2018), it can be concluded that persons’ trait scores adjusted by the CRM-ERS are more predictive and inferable for their true performance levels than are those adjusted by the conventional CRM. To summarize, the empirical analysis of the data revealed that participants’ item responses were influenced by varying tendencies toward ERS or MRS. In addition, the CRM-ERS exhibited greater precision in estimation, positioning it as a valuable tool for quantifying different levels of extreme or middle responses.

Figure 7

Relationship of the θ Estimates Between the CRM-ERS and CRM

A Follow-Up Simulation

Although the results of the empirical study indicated that the CRM-ERS provided a better fit to the real data than the competing models, the estimates of the item parameters did not spread within a range that was perfectly congruous with that of the simulation manipulation, especially for the item difficulty parameters (i.e., the items were much easier for participants). Therefore, a follow-up simulation that mimicked the real-data analysis conditions was conducted to evaluate the robustness of the CRM-ERS to variations in the model parameters. The item and structural parameters were set the same as the estimates from the above analysis, and the sample size was fixed to 307. The data responses were generated according to the CRM-ERS and then fit to the true model to examine the parameter recovery using the same procedure as in the above simulation. The simulation conditions were also replicated 100 times.

To respect the space constraints, the results of the parameter recovery are summarized as follows. The bias (or RMSE) values were between −0.048 (0.018) and −0.014 (0.156) for the item discrimination parameter, between −0.053 (0.089) and −0.012 (0.156) for the item difficulty parameters, between −0.019 (0.053) and −0.010 (0.067) for the scaling parameters, 0.014 (0.035) for the trait correlation parameter, and −0.006 (0.018) for the variance of the ERS propensity. In addition, the mean RMSEs across replications were 0.445 and 0.201 for the target trait and ERS propensity estimates, respectively. The quality of the parameter estimation derived from the real-time test was comparable to that obtained from the ideal conditions (as shown in Figures 2 to 4), indicating that the parameter recovery was less influenced by the variations in the parameter distribution.

Conclusion

Both discrete and continuous rating scales are frequently employed in survey research, and individuals’ responses can be influenced by various interpretations of these scales. Neglecting to appropriately account for response styles can potentially compromise the validity of tests and hinder accurate inferences about individuals’ abilities (Baumgartner & Steenkamp, 2001; Falk & Cai, 2016; Huang, 2016; Plieninger & Meiser, 2014). In light of the extensive investigation of ERS and ARS within the framework of discrete IRT modeling, we developed the CRM-ERS and CRM-ARS to accommodate these effects within VAS data. To explore the ramifications of the ERS and ARS effects, we simulated continuous response data under various conditions using the CRM-ERS or CRM-ARS. We then employed Bayesian methods to calibrate the model parameters based on the data-generating model and the conventional CRM. This approach allowed us to assess parameter recovery and investigate the implications of neglecting ERS and ARS effects on VAS responses.

The results of the simulations demonstrate that both the CRM-ERS and CRM-ARS exhibit satisfactory recoveries of item and person parameters when compared with the misleading conventional CRM. However, the discrimination parameters associated with the ARS dimension in the CRM-ARS displayed only marginally acceptable estimation precision due to their transformation through an exponential function within the probabilistic density function. Consistent with most psychometric models, increasing the test length enhances the measurement precision of person parameters, while increasing the sample size improves parameter calibration for both item and structural parameters in the CRM-ERS and CRM-ARS.

For ease of application and illustration, we selected an empirical dataset that employed a continuous item response format to measure the self-efficacy of preservice teachers in teaching activities. We then fit this dataset to the extended CRMs we developed, aiming to detect potential ERS and ARS effects. Following model-data fit evaluation, the CRM-ERS emerged as the optimal choice, indicating that the presence of ERS influences item responses and revealing variations in participants’ tendencies to overuse the endpoints or middle points on the response continuum. Due to the relatively low estimates for the item difficulty parameters within the CRM-ERS (all negative estimates), participants tended to select higher scores on the continuum. Most notably, the ERS dimension influenced the rightmost points of the continuous rating scale for most participants. A comparison of the outcomes derived from the best-fitting CRM-ERS with those from the conventional CRM demonstrated that the conventional CRM calibration tended to underestimate the target latent trait parameters when a participant exhibited a tendency toward the MRS (i.e., a negative $\hat{γ}$ ) and, conversely, overestimate when an ERS tendency (i.e., a positive $\hat{γ}$ ) was present. These findings align with our expectations and underscore the biased estimation of individual traits caused by fitting a conventional CRM to VAS data while disregarding possible ERS effects. Furthermore, the ERS dimension was strongly negatively correlated with the target trait dimension. This finding implies that preservice teachers with lower self-efficacy are more inclined toward ERS, while those with higher self-efficacy might prefer using the middle point of the continuum. Similar relationships between these two dimensions that align with our findings can be found in previous studies (Liu & Wang, 2019; Plieninger, 2017).

Under the framework of general linear models, the proposed CRM-ARS can be thought of as a special case of the random intercept item factor analysis model (Maydeu-Olivares & Coffman, 2006) when the random intercept term is assumed to be positive and the observed responses have been logit-transformed. Therefore, the CRM-ARS can be flexibly modified to meet the needs of real testing situations. For example, to reduce the influence of the ARS, a balance scale (Ray, 1983), which designs half of the test items to measure the target trait in one direct direction and the other half to measure the opposite direction (i.e., reverse-worded items), has frequently been applied in the social sciences. Due to the poor measuring properties found for negatively worded items (Barnette, 2000), the CRM-ARS can be utilized to fit positively and negatively related item responses for the purpose of controlling for the ARS. Following common practice, one can first reverse score for reverse-worded items and then postulate a positive loading on the ARS dimension for the positive-worded items and a negative loading on the ARS dimension for the negative-worded items because, for the negative-worded items, a low score implies the propensity to agree with these items. In addition, to make the parameter estimation more stable and calculable, the weak assumption of balance can be adopted, and the sum of the loadings on the ARS dimension can be forced to be zero (see Ferrando et al., 2011; Ferrando & Lorenzo-Seva, 2010).

When fitting the proposed CRM-ERS and CRM-ARS to empirical VAS data in real testing situations, several recommendations are provided to guide the research direction. First, the lower and upper boundaries should be inspected, and the observed item responses should be rearranged by removing maximum and minimum scores or enlarging the scoring interval to avoid the difficulty of logit transformation (Finch, 2023; Tutz & Jordan, 2023). The second step is to determine which response style matches the data. For the current application, one can fit the CRM-ERS, CRM-ARS, and conventional CRM to the data and compare them with each other in terms of model-data fit criteria, as we used DIC and PSIS-LOOCV. The third step is to examine whether the variance of the ERS or ARS propensity is substantially small and not statistically significant or whether the distributional mean of the ARS score is extremely low when the CRM-ERS and CRM-ARS are selected as the best-fitting models. Although the DIC and PSIS-LOOCV criteria are not sensitive to the choice of a simple model, using complex CRM-ERS and CRM-ARS models to fit CRM data always produces precise and satisfactory parameter estimations. Considering model parsimony, parsimonious CRM is recommended when the response styles have no effect. For the fourth step, the item content is reviewed, and the items that are likely to trigger ERS or ARS are rewritten. For example, the $a_{j}^{(ARS)}$ parameter of the CRM-ARS can be used as an indicator of the extent to which an item is influenced by the ARS dimension. An item with a higher $a_{j}^{(ARS)}$ estimate may have social desirability and has to be revised. Finally, the potential distortive effect of response bias should be checked. Because the exhibition of a specific response style is context dependent (for detailed discussion, see Thissen-Roe & Thissen, 2013), a further interview may be helpful for obtaining the whole picture.

While this study holds significant value in the development and extension of the CRM to account for ERS and ARS, further efforts are necessary, and additional research is warranted to comprehensively address this complex topic. First, importantly, response styles are not confined to the ERS and ARS effects alone; multiple response styles could concurrently influence respondents’ strategies when utilizing continuous rating scales. A diverse array of discrete IRT models has been formulated under a comprehensive measurement framework to simultaneously address various response styles and enable latent subpopulations to be impacted by distinct response styles using a mixture modeling approach (Falk & Cai, 2016; Huang, 2016; Liu & Wang, 2019; Plieninger, 2017). By applying this approach to the analysis of VAS data, one might contemplate merging the CRM-ERS and CRM-ARS to quantify both ERS and ARS tendencies or explore latent heterogeneity concerning ERS or ARS tendencies within the population. However, such extensions can present estimation efficiency challenges, given the simultaneous involvement of multiple latent propensities and the need to appropriately address model identifiability. Future research in this domain could explore these complexities and strive to develop models that accommodate a broader spectrum of response styles in VAS data. This approach would contribute to a more comprehensive understanding of the intricacies involved and pave the way for improved measurement practices.

Second, leveraging additional information to enhance measurement precision and offer insights into cognitive processes has potential. For instance, the time respondents take to answer individual items could be collectively modeled with the distribution of item responses, enabling the interplay among the target trait, response-style-related tendencies, and speed factors to be explored. This could be achieved within the hierarchical modeling framework (van der Linden, 2007). Furthermore, external covariates such as gender, age, or other relevant background variables could be integrated to elucidate population heterogeneity through regression of latent propensities against the covariate set. This objective could be accomplished by employing the explanatory item response modeling framework (De Boeck & Wilson, 2004), which warrants further exploration and investigation.

Recently, Tutz and Jordan (2023) introduced a comprehensive framework of latent trait response models tailored for continuous responses. They developed a threshold model that is remarkably flexible in accommodating a wide array of continuous response types, encompassing positive responses and range restrictions, through specifying diverse response and difficulty functions (also referred to in Tutz, 2022). Various existing models for measuring continuous scales, including the CRM (Samejima, 1973; T. Wang & Zeng, 1998), factor-analytic models (McDonald, 1985), generalized linear IRT model (Mellenbergh, 1994), and lognormal response-time model (van der Linden, 2006), can be viewed as special examples of this threshold model. Effectively addressing the influence of response styles on VAS scores within the general thresholds model perspective requires substantial effort. Exploring how to extend the threshold model to cater to these requirements is an intriguing subject worth further investigation.

Supplemental Material

sj-docx-1-epm-10.1177_00131644241242789 – Supplemental material for Exploring the Influence of Response Styles on Continuous Scale Assessments: Insights From a Novel Modeling Approach

Supplemental material, sj-docx-1-epm-10.1177_00131644241242789 for Exploring the Influence of Response Styles on Continuous Scale Assessments: Insights From a Novel Modeling Approach by Hung-Yu Huang in Educational and Psychological Measurement

Footnotes

Author’s Note

Hung-Yu Huang is currently a professor at the Institute of Education, National Cheng Kung University, Taiwan. The address is No. 1, University Rd., East District, Tainan City 70101, Taiwan. Correspondence should be addressed to hyhuang1220@gmail.com.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the National Science and Technology Council (No. 112-2410-H-845-036).

ORCID iD

Hung-Yu Huang

Supplemental Material

Supplemental material for this article is available online.

References

Barnette

J. J.

(2000). Effects of stem and Likert response option reversals on survey internal consistency: If you feel the need, there is a better alternative to using those negatively worded stems. Educational and Psychological Measurement, 60(3), 361–370. https://doi.org/10.1177/00131640021970592

Baumgartner

Steenkamp

J.-B. E. M.

(2001). Response styles in marketing research: A cross-national investigation. Journal of Marketing Research, 38(2), 143–156. https://doi.org/10.1509/jmkr.38.2.143.18840

Bejar

I. I.

(1977). An application of the continuous response level model to personality measurement. Applied Psychological Measurement, 1(4), 509–521. https://doi.org/10.1177/014662167700100407

Billiet

J. B.

McClendon

M. J.

(2000). Modeling acquiescence in measurement models for two balanced sets of items. Structural Equation Modeling, 7(4), 608–628. https://doi.org/10.1207/S15328007SEM0704_5

Böckenholt

(2012). Modeling multiple response processes in judgment and choice. Psychological Methods, 17(4), 665–678. https://doi:10.1037/a0028111

Bolt

D. M.

Johnson

T. R.

(2009). Addressing score bias and differential item functioning due to individual differences in response style. Applied Psychological Measurement, 33(5), 335–352. https://doi.org/10.1177/0146621608329891

Bolt

D. M.

Newton

J. R.

(2011). Multiscale measurement of extreme response style. Educational and Psychological Measurement, 71(5), 814–833. https://doi:10.1177/0013164410388411

Brooks

S. P.

Gelman

(1998). General methods for monitoring convergence of iterative simulations. Journal of Computational and Graphical Statistics, 7(4), 434–455. https://doi.org/10.1080/10618600.1998.10474787

Brumfitt

S. M.

Sheeran

(1999). The development and validation of the Visual Analogue Self-Esteem Scale (VASES). British Journal of Clinical Psychology, 38(4), 387–400. https://doi.org/10.1348/014466599162980

10.

Christ

T. J.

Zopluoglu

Monaghen

B. D.

Van Norman

E. R.

(2013). Curriculum-based measurement of oral reading: Multi-study evaluation of schedule, duration, and dataset quality on progress monitoring outcomes. Journal of School Psychology, 51(1), 19–57. https://doi:10.1016/j.jsp.2012.11.001

11.

De Beuckelaer

Weijters

Rutten

(2010). Using ad hoc measures for response styles: A cautionary note. Quality & Quantity: International Journal of Methodology, 44(4), 761–775. https://doi.org/10.1007/s11135-009-9225-z

12.

De Boeck

Cho

S.-J.

Wilson

(2011). Explanatory secondary dimension modeling of latent differential item functioning. Applied Psychological Measurement, 35(8), 583–603. https://doi.org/10.1177/0146621611428446

13.

De Boeck

Partchev

(2012). IRTrees: Tree-based item response models of the GLMM family. Journal of Statistical Software, 48(1), 1–28. https://doi:10.18637/jss.v048.c01

14.

De Boeck

Wilson

(2004). Explanatory item response models: A generalized linear and nonlinear approach. Springer. https://doi.org/10.1007/978-1-4757-3990-9_1

15.

De Jong

M. G.

Steenkamp

J.-B. E. M.

Fox

J.-P.

Baumgartner

(2008). Using item response theory to measure extreme response style in marketing research: A global investigation. Journal of Marketing Research, 45(1), 104–115. https://doi.org/10.1509/jmkr.45.1.104

16.

Embretson

S. E.

Reise

S. P.

(2000). Item response theory for psychologists. Lawrence Erlbaum Associates Publishers.

17.

Falk

C. F.

Cai

(2016). A flexible full-information approach to the modeling of response styles. Psychological Methods, 21(3), 328–347. https://doi.org/10.1037/met0000059

18.

Ferrando

P. J.

(2002). Theoretical and empirical comparisons between two models for continuous item responses. Multivariate Behavioral Research, 37(4), 521–542. https://doi.org/10.1207/S15327906MBR3704_05

19.

Ferrando

P. J.

(2009). Difficulty, discrimination, and information indices in the linear factor analysis model for continuous item responses. Applied Psychological Measurement, 33(1), 9–24. https://doi.org/10.1177/0146621608314608

20.

Ferrando

P. J.

(2014). A factor-analytic model for assessing individual differences in response scale usage. Multivariate Behavioral Research, 49(4), 390–405. https://doi.org/10.1080/00273171.2014.911074

21.

Ferrando

P. J.

Anguiano-Carrasco

Chico

(2011). The impact of acquiescence on forced-choice responses: A model-based analysis. Psicológica, 32(1), 87–105.

22.

Ferrando

P. J.

Lorenzo-Seva

(2010). Acquiescence as a source of bias and model and person misfit: A theoretical and empirical analysis. British Journal of Mathematical and Statistical Psychology, 63(2), 427–448. https://doi.org/10.1348/000711009X470740

23.

Ferrando

P. J.

Lorenzo-Seva

Chico

(2003). Unrestricted factor analytic procedures for assessing acquiescent responding in balanced, theoretically unidimensional personality scales. Multivariate Behavioral Research, 38(3), 353–374. https://doi.org/10.1207/S15327906MBR3803_04

24.

Ferrando

P. J.

Navarro-González

(2021). A Multidimensional item response theory model for continuous and graded responses with error in persons and items. Educational and Psychological Measurement, 81(6), 1029–1053. https://doi.org/10.1177/0013164421998412

25.

Finch

W. H.

(2023). The impact and detection of uniform differential item functioning for continuous item response models. Educational and Psychological Measurement, 83(5), 929–952. https://doi.org/10.1177/00131644221111993

26.

García-Pérez

M. A.

(2023). Are the steps on Likert scales equidistant? responses on visual analog scales allow estimating their distances. Educational and Psychological Measurement, 84, 91–122. https://doi.org/10.1177/00131644231164316

27.

Henninger

Meiser

(2020a). Different approaches to modeling response styles in divide-by-total item response theory models (part 1): A model integration. Psychological Methods, 25(5), 560–576. https://doi.org/10.1037/met0000249

28.

Henninger

Meiser

(2020b). Different approaches to modeling response styles in divide-by-total item response theory models (part 2): Applications and novel extensions. Psychological Methods, 25(5), 577–595. https://doi.org/10.1037/met0000268

29.

Holland

P. W.

Wainer

(Eds.). (1993). Differential item functioning (1st ed.). Routledge. https://doi.org/10.4324/9780203357811

30.

Holman

Glas

C. A. W.

(2005). Modelling non-ignorable missing-data mechanisms with item response theory models. British Journal of Mathematical and Statistical Psychology, 58(1), 1–17. https://doi.org/10.1348/000711005X47168

31.

Huang

H.-Y.

(2014). Effects of the common scale setting in the assessment of differential item functioning. Psychological Reports, 114(1), 104–125. https://doi.org/10.2466/03.PR0.114k11w0

32.

Huang

H.-Y.

(2016). Mixture random-effect IRT models for controlling extreme response style on rating scales. Frontiers in Psychology, 7, Article 1706. https://doi.org/10.3389/fpsyg.2016.01706

33.

Huang

H.-Y.

(2020). A mixture IRTree model for performance decline and nonignorable missing data. Educational and Psychological Measurement, 80(6), 1168–1195. https://doi.org/10.1177/001316442091

34.

Huang

H.-Y.

(2023). Modeling rating order effects under item response theory models for rater-mediated assessments. Applied Psychological Measurement, 47(4), 312–327. https://doi.org/10.1177/01466216231174566

35.

Huang

H.-Y.

Wang

W.-C.

(2014). Multilevel higher-order item response theory models. Educational and Psychological Measurement, 74(3), 495–515. https://doi.org/10.1177/0013164413509628

36.

Hung

S.-P.

Huang

H.-Y.

(2022). Forced-choice ranking models for raters’ ranking data. Journal of Educational and Behavioral Statistics, 47(5), 603–634. https://doi.org/10.3102/10769986221104207

37.

Jin

K.-Y.

Paulhus

D. L.

Shih

C.-L.

(2023). A new approach to desirable responding: Multidimensional item response model of overclaiming data. Applied Psychological Measurement, 47(3), 221–236. https://doi.org/10.1177/01466216231151704

38.

Jin

K.-Y.

Wang

W.-C.

(2014). Generalized IRT models for extreme response style. Educational and Psychological Measurement, 74(1), 116–138. https://doi:10.1177/0013164413498876

39.

Jin

K.-Y.

Wang

W.-C.

(2018). A new facets model for rater’s centrality/extremity response style. Journal of Educational Measurement, 55(4), 543–563. https://doi.org/10.1111/jedm.12191

40.

John

O. P.

Srivastava

(1999). The Big Five Trait taxonomy: History, measurement, and theoretical perspectives. In Pervin

L. A.

John

O. P.

(Eds.), Handbook of personality: Theory and research (2nd ed., pp. 102–138). Guilford Press.

41.

Johnson

T. R.

Bolt

D. M.

(2010). On the use of factor-analytic multinomial logit item response models to account for individual differences in response style. Journal of Educational and Behavioral Statistics, 35(1), 92–114. https://doi:10.3102/1076998609340529

42.

Jonas

K. G.

Markon

K. E.

(2019). Modeling response style using vignettes and person-specific item response theory. Applied Psychological Measurement, 43(1), 3–17. https://doi.org/10.1177/0146621618798663

43.

Kan

(2009). Effect of scale response format on psychometric properties in teaching self-efficacy. Euroasian Journal of Educational Research, 34, 215–228.

44.

LaHuis

D. M.

Blackmore

C. E.

Bryant-Lees

K. B.

Delgado

(2019). Applying item response trees to personality data in the selection context. Organizational Research Methods, 22(4), 1007–1018. https://doi.org/10.1177/1094428118780310

45.

Liu

C.-W.

Qiu

X.-L.

Wang

W.-C.

(2019). Item response theory modeling for examinee-selected items with rater effect. Applied Psychological Measurement, 43(6), 435–448. https://doi.org/10.1177/0146621618798667

46.

Liu

C.-W.

Wang

W.-C.

(2016). Unfolding IRT models for Likert-type items with a don’t know option. Applied Psychological Measurement, 40(7), 517–533. https://doi.org/10.1177/0146621616664047

47.

Liu

C.-W.

Wang

W.-C.

(2019). A general unfolding IRT model for multiple response styles. Applied Psychological Measurement, 43(3), 195–210. https://doi.org/10.1177/0146621618762743

48.

Lord

F. M.

(1980). Application of item response theory to practical testing problems. Erlbaum.

49.

Martin

M. O.

Mullis

I. V. S.

Hooper

(Eds.). (2017). Methods and procedures in PIRLS 2016. https://timssandpirls.bc.edu/publications/pirls/2016-methods.html

50.

Maydeu-Olivares

Coffman

D. L.

(2006). Random intercept item factor analysis. Psychological Methods, 11(4), 344–362. https://doi.org/10.1037/1082-989X.11.4.344

51.

McDonald

R. P.

(1985). Factor analysis and related methods. Erlbaum.

52.

Mellenbergh

G. J.

(1994). Generalized linear item response theory. Psychological Bulletin, 115(2), 300–307. https://doi.org/10.1037/0033-2909.115.2.300

53.

Mellenbergh

G. J.

(2016). Models for continuous responses. In van der Linden

W. J.

(Ed.), Handbook of item response theory (pp. 153–163). CRC Press/Taylor & Francis Group.

54.

Merhof

Meiser

(2023). Dynamic response strategies: Accounting for response process heterogeneity in IRTree decision nodes. Psychometrika, 88(4), 1354–1380. https://doi.org/10.1007/s11336-023-09901-0

55.

Meyer

J. P.

(2010). A mixture Rasch model with item response time components. Applied Psychological Measurement, 34(7), 521–538. https://doi.org/10.1177/0146621609355

56.

Mislevy

R. J.

Levy

Kroopnick

Rutstein

(2008). Evidentiary foundations of mixture item response theory models. In Hancock

G. R.

Samuelsen

K. M.

(Eds.), Advances in latent variable mixture models (pp. 149–176). Information Age.

57.

Morren

Gelissen

Vermunt

(2012). The impact of controlling for extreme responding on measurement equivalence in cross-cultural research. Methodology: European Journal of Research Methods for the Behavioral and Social Sciences, 8(4), 159–170. https://doi:10.1027/1614-2241/a000048

58.

Müller

(1987). A Rasch model for continuous ratings. Psychometrika, 52(2), 165–181. https://doi.org/10.1007/BF02294232

59.

Plieninger

(2017). Mountain or molehill? A simulation study on the impact of response styles. Educational and Psychological Measurement, 77(1), 32–53. https://doi.org/10.1177/0013164416636655

60.

Plieninger

Meiser

(2014). Validity of multiprocess IRT models for separating content and response styles. Educational and Psychological Measurement, 74(5), 875–899. https://doi.org/10.1177/0013164413514998

61.

Plummer

(2017). JAGS version 4.3.0 user manual [Computer software manual]. https://sourceforge.net/projects/mcmc-jags

62.

Ray

J. J.

(1983). Reviving the problem of acquiescent response bias. The Journal of Social Psychology, 121(1), 81–96. https://doi.org/10.1080/00224545.1983.9924470

63.

Reckase

M. D.

(2009). Multidimensional item response theory. Springer. https://doi.org/10.1007/978-0-387-89976-3

64.

Samejima

(1969). Estimation of latent ability using a response pattern of graded scores. Psychometrika Monograph Supplement, 34(4, Pt. 2), 100.

65.

Samejima

(1973). Homogeneous case of the continuous response model. Psychometrika, 38(2), 203–219. https://doi.org/10.1007/BF02291114

66.

Shojima

(2005). A noniterative item parameter solution in each EM cycle of the continuous response model. Educational Technology Research, 28, 11–22.

67.

Simms

L. J.

Zelazny

Williams

T. F.

Bernstein

(2019). Does the number of response options matter? Psychometric perspectives using personality questionnaire data. Psychological Assessment, 31(4), 557–566. https://doi.org/10.1037/pas0000648

68.

Spence

Owens

Goodyer

(2012). Item response theory and validity of the NEO-FFI in adolescents. Personality and Individual Differences, 53(6–4), 801–807. https://doi.org/10.1016/j.paid.2012.06.002

69.

Spiegelhalter

D. J.

Best

N. G.

Carlin

B. P.

van der Linde

(2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society. Series B: Statistical Methodology, 64(4), 583–616. https://doi.org/10.1111/1467-9868.00353

70.

Thissen-Roe

Thissen

(2013). A two-decision model for responses to Likert- type items. Journal of Educational and Behavioral Statistics, 38(5), 522–547. https://doi:10.3102/1076998613481500

71.

Tutz

(2022). Item response thresholds models: A general class of models for varying types of items. Psychometrika, 87(4), 1238–1269. https://doi.org/10.1007/s11336-022-09865-7

72.

Tutz

Jordan

(2023). Latent trait item response models for continuous responses. Journal of Educational and Behavioral Statistics. https://doi.org/10.3102/10769986231184147

73.

Tutz

Schauberger

Berger

(2018). Response styles in the partial credit model. Applied Psychological Measurement, 42(6), 407–427. https://doi.org/10.1177/0146621617748322

74.

van der Linden

W. J.

(2006). A lognormal model for response times on test items. Journal of Educational and Behavioral Statistics, 31(2), 181–204. https://doi.org/10.3102/10769986031002181

75.

van der Linden

W. J.

(2007). A hierarchical framework for modeling speed and accuracy on test items. Psychometrika, 72(3), 287–308. https://doi.org/10.1007/s11336-006-1478-z

76.

Vehtari

Gelman

Gabry

(2017). Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC. Statistics and Computing, 27(5), 1413–1432. https://doi.org/10.1007/s11222-016-9696-4

77.

Vehtari

Simpson

Gelman

Yao

Gabry

(2019). Pareto smoothed importance sampling [Preprint]. arXiv:1507.02646

78.

von Davier

Eid

Zickar

M. J.

(2007). Detecting response styles and faking in personality and organizational assessments by mixed Rasch models. In Carstensen

C. H.

(Ed.), Multivariate and mixture distribution Rasch models (pp. 255–270). Springer Verlag.

79.

Wang

Shang

Kuncel

(2018). Detecting aberrant behavior and item preknowledge: A comparison of mixture modeling method and residual method. Journal of Educational and Behavioral Statistics, 43(4), 469–501. https://doi.org/10.3102/1076998618767123

80.

Wang

Zeng

(1998). Item parameter estimation for a continuous response model using an EM algorithm. Applied Psychological Measurement, 22(4), 333–344. https://doi.org/10.1177/0146621698022004

81.

Weijters

Geuens

Schillewaert

(2010). The stability of individual response styles. Psychological Methods, 15(1), 96–110. https://doi.org/10.1037/a0018721

82.

Zopluoglu

(2012). EstCRM: An R package for Samejima’s continuous IRT model. Applied Psychological Measurement, 36(2), 149–150. https://doi.org/10.1177/0146621612436599

83.

Zopluoglu

(2013). A comparison of two estimation algorithms for Samejima’s continuous IRT model. Behavior Research Methods, 45(1), 54–64. https://doi.org/10.3758/s13428-012-0229-6

84.

Zopluoglu

(2020). A finite mixture item response theory model for continuous measurement outcomes. Educational and Psychological Measurement, 80(2), 346–364. https://doi.org/10.1177/0013164419856663

85.

Zopluoglu

(2022). EstCRM: Calibrating parameters for the Samejima's continuous IRT Model. R package version 1.6, https://CRAN.R-project.org/package=EstCRM.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.23 MB