Investigation of IRT-Based Equating Methods in the Presence of Outlier Common Items

Abstract

Common items with inconsistent b-parameter estimates may have a serious impact on item response theory (IRT)—based equating results. To find a better way to deal with the outlier common items with inconsistent b-parameters, the current study investigated the comparability of 10 variations of four IRT-based equating methods (i.e., concurrent calibration, separate calibration with test characteristic curve [TCC] and mean/sigma [M/S] transformations, and calibration with fixed common item parameters [FCIP]) when outliers were either ignored or considered. Simulated data were generated for the common-item nonequivalent groups matrix design to reflect the manipulated factors: group ability differences and nonequivalent groups, number/score points of outliers, and types of outliers. When no outliers were present, the TCC and M/S transformations performed the best. When there were outliers, overall, the methods that considered them (except the M/S transformation with outliers weighted) resulted in a vast improvement compared to the methods that ignored them.

Keywords

item response theory equating outliers calibration transformation.

Get full access to this article

View all access options for this article.

References

Angoff, W.H. (1984). Scales, norms, and equivalent scores. Princeton, NJ: Educational Testing Service.

Baker, F.B. (1996). An investigation of the sampling distributions of equating coefficients. Applied Psychological Measurement , 20(1), 45-57.

Bejar, I. , & Wingersky, M.S. (1981). An application of item response theory to equating the Test of Standard Written English (College Board Report No. 81-8). Princeton, NJ: Educational Testing Service. (ETS No. 81-35)

Bolt, D.M. (1999). Evaluating the effects of multidimensionality on IRT true-score equating. Applied Measurement in Education , 12(3), 383-407.

Childs, R.A. , & Chen, W.-H. (1999). Obtaining comparable item parameter estimates in MULTILOG and PARSCALE for two polytomous IRT models. Applied Psychological Measurement, 23(4), 371-379.

Cohen, A.S. , & Kim, S.-H. (1998). An investigation of linking methods under the graded response model. Applied Psychological Measurement, 22(2), 116-130.

Cook, L.L. , Eignor, D.R. , & Hutton, L.R. (1979, April). Considerations in the application of latent trait theory to objective-based criterion-referenced tests. Paper presented at the meeting of the American Educational Research Association, San Francisco.

Cook, L.L. , & Petersen, N.S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11, 225-244.

De Champlain, A.F. (1996). The effect of multidimensionality on IRT true-score equating for subgroups of examinees. Journal of Educational Measurement, 33(2), 181-201.

10.

Dorans, N.J. , & Kingston, N.M. (1985). The effects of violations of unidimensionality on the estimation of item and ability parameters and on item response theory equating of the GRE verbal scale. Journal of Educational Measurement, 22(4), 249-262.

11.

Gifford, J.A. , & Swaminathan, H. (1990). Bias and the effect of priors in Bayesian estimation of parameters of item response models. Applied Psychological Measurement, 14, 33-43.

12.

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144-149.

13.

Hambleton, R.K. , & Murray, L. (1983). Some goodness of fit investigations for item response models. In R. K. Hambleton (Ed.), Applications of item response theory (pp. 71-94). British Columbia : Educational Research Institute of British Columbia .

14.

Hanson, B.A. , & Beguin, A.A. (2002). Obtaining a common scale for item response theory item parameters using separate versus concurrent estimation in the common-item equating design. Applied Psychological Measurement, 26(1), 3-24.

15.

Hanson, B.A. , & Feinstein, Z.S. (1997). Application of a polynomial log linear model to assessing differential item functioning for common items in the common-item equating design (ACT Research Report Series 97-1). Iowa City, IA: American College Testing.

16.

Harris, D.J. (1991, April). Equating with nonrepresentative common item sets and non-equivalent groups. Paper presented at the annual meeting of the American Educational Research Association, Chicago.

17.

Harwell, M. , Stone, C.A. , Hsu, T.-C. , & Kirisci, L. (1996). Monte Carlo studies in item response theory. Applied Psychological Measurement, 20(2), 101-125.

18.

Hills, J.R. , Subhiyah, R.G. , & Hirsch, T.M. (1988). Equating minimum-competency tests: Comparison of methods. Journal of Educational Measurement, 25(3), 221-231.

19.

Ironson, G.H. (1983). Using item response theory to measure bias. In R. K. Hambleton (Ed.), Applications of item response theory (pp. 155-174). British Columbia, Canada: Educational Research Institute of British Columbia.

20.

Klein, L.W. , & Jarjoura, D. (1985). The importance of content representation for common-item equating with non-random groups. Journal of Educational Measurement , 22, 197-206.

21.

Kolen, M.J. , & Brennan, R.L. (2004). Test equating, scaling, and linking: Methods and practices. New York: Springer.

22.

Kromrey, J.D. , Parshall, C.G. , & Yi, Q. (1998, April). The effects of content representativeness and differential weighting on test equating: A Monte Carlo study. Paper presented at the annual meeting of the American Educational Research Association, San Diego, CA.

23.

Lee, G. , Kolen, M.J. , Frisbie, D.A. , & Ankenmann, R.D. (2001). Comparison of dichotomous and polytomous item response models in equating scores from tests composed of testlets. Applied Psychological Measurement, 25(4), 357-372.

24.

Lehman, R.S. , & Bailey, D.E. (1968). Digital computing: Fortran IV and its applications in behavioural science. New York: John Wiley.

25.

Li, Y.H. , Lissitz, R.W. , & Yang, Y.-N. (1999, April). Estimating IRT equating coefficients for tests with polytomously and dichotomously scored items. Paper presented at the annual meeting of the National Council on Measurement in Education, Montreal, Canada.

26.

Linn, R.L. , Levine, M.V. , Hastings, C.N. , & War-drop, J.L. (1980). An investigation of item bias in a test of reading comprehension (Tech. Rep. No. 163). Urbana: Center for the Study of Reading, University of Illinois.

27.

Lord, F.M. (1980). Applications of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

28.

Loyd, B.H. , & Hoover, H.D. (1980). Vertical equating using the Rasch model. Journal of Educational Measurement, 17, 179-193.

29.

Marco, G.L. (1977). Item characteristic curve solutions to three intractable testing problems. Journal of Educational Measurement, 14, 139-160.

30.

Minnesota Comprehensive Assessments Grade 3 & 5 Technical Manual. (2002). Retrieved November 26, 2005, from http://education.state.mn.us/mde/static/001879.pdf

31.

Muraki, E. , & Bock, R.D. (1999). PARSCALE: IRT Item Analysis and Test Scoring for Rating-scale Data (Version 3.5) [Computer software]. Chicago : Scentific Software.

32.

Petersen, N.C. , Marco, G.L. , & Stewart, E.E. (1982). A test of the adequacy of linear score equating models. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 71-135). New York: Academic Press.

33.

Psychometric Society. (1979). Publication policy regarding Monte Carlo studies. Psychometrika , 44, 133-134.

34.

Stocking, M.L. , & Lord, F.M. (1983). Developing a common metric in item response theory . Applied Psychological Measurement, 7, 201-210.

35.

Tate, R. (2000). Performance of a proposed method for the linking of mixed format tests with constructed response and multiple choice items . Journal of Educational Measurement, 37(4), 329-346.

36.

Thissen, D. (1991). MULTILOG user's guide: Multiple, categorical item analysis and test scoring using item response theory (Version 6.0). New York: Springer.

37.

2001 MCAS technical report. (2001). Retrieved November 26, 2005, from http://www.doe.mass.edu/mcas/2002/news/01techrpt.pdf

38.

2002 MCAS technical report. (2002). Retrieved November 26, 2005, from http://www.doe.mass.edu/mcas/2003/news/02techrpt.pdf

39.

Vukmirovic, Z. , Hu, H. , & Turner, J.C. (2003, April). The effects of outliers on IRT equating with fixed common item parameters. Paper presented at the meeting of the National Council on Measurement in Education, Chicago.

40.

Wang, T.-Y. , Hanson, B.A. , & Harris, D.J. (2000). The effectiveness of circular equating as a criterion for evaluating equating. Applied Psychological Measurement , 24(3), 195-210.

41.

Yang, W. (1997, April). The effects of content mix and equating method on the accuracy of test equating using anchor-item design. Paper presented at the annual meeting of the American Educational Research Association, Chicago.

42.

Yen, W.M. (1984). Effects of local item dependence on the fit and equating performance of the three-parameter logistic model. Applied Psychological Measurement, 8(2), 125-145.

43.

Zeng, L. (1991). Standard errors of linear equating for the single-group design (ACT Research Report 91-4). Iowa City, IA: American College Testing.

44.

Zenisky, A.L. (2001, October). Investigating the accumulation of equating error in fixed common item parameter linking: A simulation study. Paper presented at the annual meeting of the Northeastern Educational Research Association, Kerhonkson, NY.

45.

Zimowski, M.F. , Muraki, E. , Mislevy, R.J. , & Bock, R.D. (1996). BILOG-MG: Multiple group IRT analysis and test maintenance for binary items [Computer program]. Chicago : Scientific Software International.