Sage Journals: Discover world-class research

Abstract

In equating practice, the existence of outliers in the anchor items may deteriorate the equating accuracy and threaten the validity of test scores. Therefore, stability of the anchor item performance should be evaluated before conducting equating. This study used simulation to investigate the performance of the t-test method in detecting outliers and compared its performance with other outlier detection methods, including the logit difference method with 0.5 and 0.3 as the cutoff values and the robust z statistic with 2.7 as the cutoff value. The investigated factors included sample size, proportion of outliers, item difficulty drift direction, and group difference. Across all simulated conditions, the t-test method outperformed the other methods in terms of sensitivity of flagging true outliers, bias of the estimated translation constant, and the root mean square error of examinee ability estimates.

Keywords

Outliers Rasch model translation constant t-test

Get full access to this article

View all access options for this article.

References

Altman

D. G.

Bland

J. M.

(1994). Diagnostic tests. 1: Sensitivity and specificity. BMJ, 308(6943), 1552. https://doi.org/10.1136/bmj.308.6943.1552

Bock

R. D.

Muraki

Pfeiffenberger

(1988). Item pool maintenance in the presence of item parameter drift. Journal of Educational Measurement, 25(4), 275–285. https://doi.org/10.1111/j.1745-3984.1988.tb00308.x

DeMars

C. E.

(2004). Detection of item parameter drift over multiple test administrations. Applied Measurement in Education, 17(3), 265–300. https://doi.org/10.1207/s15324818ame1703_3

Donoghue

J. R.

Isham

S. P.

(1998). A comparison of procedures to detect item parameter drift. Applied Psychological Measurement, 22(1), 33–51. https://doi.org/10.1177/01466216980221002

Goldstein

(1983). Measuring changes in educational attainment over time: Problems and possibilities. Journal of Educational Measurement, 20(4), 369–377. https://doi.org/10.1111/j.1745-3984.1983.tb00214.x

Harris

D. J.

(1991). Equating with nonrepresentative common item sets and nonequivalent groups . Paper presented at the annual meeting of the American Educational Research Association. Chicago, 1991

Cui

Fang

Chen

(2013). Using a linear regression method to detect outliers in IRT common item equating. Applied Psychological Measurement, 37(7), 522–540. https://doi.org/10.1177/0146621613483207

Cui

Osterlind

S. J.

(2015). New robust scale transformation methods in the presence of outlying common items. Applied Psychological Measurement, 39(8), 613–626. https://doi.org/10.1177/0146621615587003

Hogg

R. V.

(1979). Statistical robustness: One view on its use in applications today. The American Statistician, 33(3), 108–115. https://doi.org/10.2307/2683810

10.

Rogers

W. T.

Vukmirovic

(2008). Investigation of IRT-based equating methods in the presence of outlier common items. Applied Psychological Measurement, 32(4), 311–333. https://doi.org/10.1177/0146621606292215

11.

Huang

C. Y.

Shyu

C. Y.

(2003). The impact of item parameter drift on equating . Paper Presented at the Annual Meeting of the National Council on Measurement in Education. Chicago, 2003

12.

Huynh

Gleaton

Seaman

S. P.

(1992). Technical documentation for the South Carolina high school exit examination of reading and mathematics: Paper No. 2 (2nd ed.). University of South Carolina

13.

Huynh

Meyer

(2010). Use of robust z in detecting unstable items in item response theory models. Practical Assessment, Research and Evaluation, 15(2), 11

14.

Huynh

Rawls

(2011). A comparison between robust z and 0.3-logit difference procedures in assessing stability of linking items for the Rasch model. Journal of Applied Measurement, 12(2), 96–105

15.

Johnson

(2011). Investigating common-item screening procedures in development a vertical scale . Paper Presented at the Annual Meeting of the National Council on Measurement in Education. New Orleans, 2011

16.

Klein

L. W.

Jarjoura

(1985). The importance of content representation for common-item equating with nonrandom groups. Journal of Educational Measurement, 22(3), 197–206. https://doi.org/10.1111/j.1745-3984.1985.tb01058.x

17.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking. Springer-Verlag

18.

Linacre

J. M.

(2021). Winsteps® Rasch measurement computer program. Winsteps.com

19.

Liu

Jurich

Morrison

Grabovsky

(2021). Detection of outliers in anchor items using modified Rasch fit statistics. Applied Measurement in Education, 34(4), 327–341. https://doi.org/10.1080/08957347.2021.1987901

20.

Lord

F. M.

(1977). A study of item bias, using item characteristic curve theory. In Poortinga

Y. H.

(Ed), Basic problems in cross-cultural psychology (pp. 19–29). Swets & Zeitlinger

21.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates

22.

Manna

V. F.

(2019). Different methods of adjusting for form difficulty under the Rasch model: Impact on consistency of assessment results (Research Report No. RR-19-08). Educational Testing Service. https://doi.org/10.1002/ets2.12244

23.

Miller

G. E.

Rotou

Twing

J. S.

(2004). Evaluation of the 0.3 logit screening criterion in common item equating. Journal of Applied Measurement, 5(2), 172–177

24.

Muraki

Engelhard

(1989). Examining differential item functioning with BIMAIN. Paper presented at the annual meeting of the American Educational Research Association. San Francisco, 1989

25.

Murphy

Little

Fan

Lin

Kirkpatrick

(2010). The impact of different anchor stability methods on equating results and student performance . Paper presented at the annual meeting of the National Council on Measurement in Education. Denver, 2010

26.

Smith

R. M.

(1996). A comparison of the Rasch separate calibration and between-fit methods of detecting item bias. Educational and Psychological Measurement, 56(3), 403–418. https://doi.org/10.1177/0013164496056003003

27.

Smith

R. M.

Suh

K. K.

(2003). Rasch fit statistics as a test of the invariance of item parameter estimates. Journal of Applied Measurement, 4(2), 153–163

28.

Thissen

Steinberg

Wainer

(1992). Detection of differential item functioning using the parameters of item response models. In Holland

P. W.

Wainer

(Eds), Differential item functioning (pp. 67–113). Lawrence Erlbaum Associates

29.

von Davier

A. A.

Holland

P. W.

Thayer

D. T.

(2004). The kernel method of test equating. Springer-Verlag

30.

Wright

B. D.

Douglas

G. A.

(1975). Best test design and self-tailored testing (Research Memorandum No. 19). University of Chicago, Department of Education, Statistical Laboratory

31.

Wright

B. D.

Panchapakesan

(1969). A procedure for sample-free item analysis. Educational and Psychological Measurement, 29(1), 23–48. https://doi.org/10.1177/001316446902900102

32.

Wright

B. D.

Stone

M. H.

(1979). Best test design. MESA Press

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.26 MB

Outlier Detection Using t-test in Rasch IRT Equating under NEAT Design

Abstract

Keywords

Get full access to this article

References

Supplementary Material