Sources of Error in IRT Trait Estimation

Abstract

In item response theory (IRT), item response probabilities are a function of item characteristics and latent trait scores. Within an IRT framework, trait score misestimation results from (a) random error, (b) the trait score estimation method, (c) errors in item parameter estimation, and (d) model misspecification. This study investigated the relative effects of these error sources on the bias and confidence interval coverage rates for trait scores. Our results showed that overall, bias values were close to 0, and coverage rates were fairly accurate for central trait scores and trait estimation methods that did not use a strong Bayesian prior. However, certain types of model misspecifications were found to produce severely biased trait estimates with poor coverage rates, especially at extremes of the latent trait continuum. It is demonstrated that biased trait estimates result from estimated item response functions (IRFs) that exhibit systematic conditional bias, and that these conditionally biased IRFs may not be detected by model or item fit indices. One consequence of these results is that certain types of model misspecifications can lead to estimated trait scores that are nonlinearly related to the data-generating latent trait. Implications for item and trait score estimation and interpretation are discussed.

Keywords

item response theory score interpretation estimation

Get full access to this article

View all access options for this article.

References

Akaike

(1974). A new look at the statistical model identification. IEEE Transactions on Automatic Control, 19, 716-723.

Barton

Lord

(1981). An upper asymptote for the three-parameter logistic item-response model (Technical Report RR-81-20). Princeton, NJ: Educational Testing Service.

Berg

Meyer

(2004). Deviance information criterion for comparing stochastic volatility models. Journal of Business & Economic Statistics, 22, 107-120.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Application of an EM algorithm. Psychometrika, 46, 443-459.

Bock

R. D.

Mislevy

R. J.

(1982). Adaptive EAP estimation of ability in a microcomputer environment. Applied Psychological Measurement, 6, 431-444.

Chalmers

R. P.

(2012). mirt: A multidimensional item response theory package for the R environment. Journal of Statistical Software, 48, 1-29.

Cheng

Yuan

K.-H.

(2010). The impact of fallible item parameter estimates on latent trait recovery. Psychometrika, 75, 280-291.

Culpepper

S. A.

(2016). Revisiting the 4-parameter item response model: Bayesian estimation and application. Psychometrika, 81, 1142-1163.

De Ayala

R. J.

Schafer

W. D.

Sava-Bolesta

. (1995). An investigation of the standard errors of expected a posteriori ability estimates. British Journal of Mathematical and Statistical Psychology, 47, 385-405.

10.

Dodd

B. G.

Koch

W. R.

De Ayala

R. J.

(1993). Computerized adaptive testing using the partial credit model: Effects of item pool characteristics and different stopping rules. Educational and Psychological Measurement, 53, 61-77.

11.

Drasgow

(1989). An evaluation of marginal maximum likelihood estimation for the two-parameter logistic model. Applied Psychological Measurement, 13, 77-90.

12.

Drasgow

Parsons

C. K.

(1983). Applications of unidimensional item response theory models to multidimensional data. Applied Psychological Measurement, 7, 189-199.

13.

Hulin

C. L.

Lissak

R. I.

Drasgow

(1982). Recovery of two-and three-parameter logistic item characteristic curves: A Monte Carlo study. Applied Psychological Measurement, 6, 249-260.

14.

Jones

D. H.

Wainer

Kaplan

(1984). Estimating ability with three item response models when the models are wrong and their parameters are inaccurate (Technical Report 84-46). Princeton, NJ: Educational Testing Service.

15.

Lewis

(1985, June). Estimating individual abilities with imperfectly known item response functions. Paper presented at the Annual Meeting of the Psychometric Society, Nashville, TN.

16.

Lewis

(2001). Expected response functions. In Boomsma

van Duijn

M. A. J.

Snijders

T. A. B.

(Eds.), Essays on item response theory (pp. 163-171). New York, NY: Springer-Verlag.

17.

Loken

Rulison

(2010). Estimation of a four-parameter item response theory model. British Journal of Mathematical and Statistical Psychology, 63, 509-525.

18.

Lord

F. M.

(1975). The ability scale in item characteristic curve theory. Psychometrika, 40, 205-217.

19.

Lord

F. M.

(1983). Unbiased estimators of ability parameters, of their variance, and of their parallel-forms reliability. Psychometrika, 48, 233-245.

20.

Magis

Raiche

(2012). Random generation of response patterns under computerized adaptive testing with the R package catR. Journal of Statistical Software, 48, 1-31.

21.

Markon

K. E.

Chmielewski

(2013). The effect of response model misspecification and uncertainty on the psychometric properties of estimates. In Millsap

R. E.

van der Ark

L. A.

Bolt

D. M.

Woods

C. M.

(Eds.), New developments in quantitative psychology (pp. 85-114). New York, NY: Springer.

22.

Maydeu-Olivares

Joe

(2005). Limited and full information estimation and goodness-of-fit testing in 2ⁿ contingency tables: A unified framework. Journal of the American Statistical Association, 100, 1009-1020.

23.

Maydeu-Olivares

Joe

(2014). Assessing approximate fit in categorical data analysis. Multivariate Behavioral Research, 49, 305-328.

24.

Mislevy

R. J.

(1986). Bayes modal estimation in item response models. Psychometrika, 51, 177-195.

25.

Orlando

Thissen

(2000). New item fit indices for dichotomous item response theory models. Applied Psychological Measurement, 24, 50-64.

26.

Patton

J. M.

Cheng

Yuan

K.-H.

Diao

(2014). Bootstrap standard errors for maximum likelihood ability estimates when item parameters are unknown. Educational and Psychological Measurement, 74, 697-712.

27.

Perline

Wright

B. D.

Wainer

(1979). The Rasch model as additive conjoint measurement. Applied Psychological Measurement, 3, 237-255.

28.

Ramsay

J. O.

(1991). Kernel smoothing approaches to nonparametric item characteristic curve estimation. Psychometrika, 56, 611-630.

29.

R Core Team. (2016). R: A language and environment for statistical computing [Computer software]. Vienna, Austria. Retrieved from http://www.R-project.org/

30.

Reise

S. P.

Waller

N. G.

(2003). How many IRT parameters does it take to model psychopathology items?Psychological Methods, 8, 164-184.

31.

Rulison

K. L.

Loken

(2009). I’ve fallen and I can’t get up: Can high-ability students recovery from early mistakes in CAT? Applied Psychological Measurement, 33, 83-101.

32.

Spiegelhalter

D. J.

Best

N. G.

Carlin

B. P.

van der Linde

(2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64, 583-639.

33.

Stan Development Team. (2016). RStan: The R interface to Stan. R package version 2.14.1. Retrieved from http://mc-stan.org

34.

Stevens

S. S.

(1946). On the theory of scales of measurement, 103, 677-680.

35.

Thissen

Wainer

(1982). Some standard errors in item response theory. Psychometrika, 47, 397-412.

36.

Tsutakawa

R. K.

Johnson

J. C.

(1990). The effect of uncertainty of item parameter estimation on ability estimates. Psychometrika, 55, 371-390.

37.

Wainer

Thissen

(1987). Estimating ability with the wrong model. Journal of Educational and Behavioral Statistics, 12, 339-368.

38.

Waller

N. G.

Feuerstahler

(2017). Bayesian modal estimation of the four-parameter item response model in real, realistic, and idealized data sets. Multivariate Behavioral Research, 52, 350-370.

39.

Waller

N. G.

Reise

S. P.

(1989). Computerized adaptive personality assessment: An illustration with the absorption scale. Journal of Personality and Social Psychology, 57, 1051-1058.

40.

Waller

N. G.

Reise

S. P.

(2010). Measuring psychopathology with non-standard IRT models: Fitting the four-parameter model to the MMPI. In Embretson

Roberts

J. S.

(Eds.), Measuring psychological constructs: Advances in model-based approaches (pp. 147-173). Washington, DC: American Psychological Association.

41.

Warm

T. A.

(1989). Weighted likelihood estimation of ability in item response theory. Psychometrika, 54, 427-450.

42.

Weiss

D. J.

(1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6, 473-492.

43.

Yen

W. M.

(1981). Using simulation results to choose a latent trait model. Applied Psychological Measurement, 5, 245-262.

44.

Zhang

(2005). Bias correction for the maximum likelihood estimate of ability (ETS Research Report No. RR-05-15). Princeton, NJ: Educational Testing Service.

45.

Zhang

(2012). The impact of variability of item parameter estimators on test information function. Journal of Educational and Behavioral Statistics, 37, 737-757.

46.

Zhang

Xie

Song

(2011). Investigating the impact of uncertainty about item parameters on ability estimation. Psychometrika, 76, 97-118.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.16 MB

0.36 MB

0.09 MB