Two Statistics for Measuring the Score Comparability of Computerized Adaptive Tests

Abstract

This study introduces two new statistics for measuring the score comparability of computerized adaptive tests (CATs) based on comparing conditional standard errors of measurement (CSEMs) for examinees that achieved the same scale scores. One statistic is designed to evaluate score comparability of alternate CAT forms for individual scale scores, while the other statistic is designed to evaluate the overall score comparability of alternate CAT forms. The effectiveness of the new statistics is illustrated using data from grade 3 through 8 reading and math CATs. Results suggest that both CATs demonstrated reasonably high levels of score comparability, that score comparability was less at very high or low scores where few students score, and that using random samples with fewer students per grade did not have a big impact on score comparability. Results also suggested that score comparability was sometimes higher when the bottom 20% of scorers were used to calculate overall score comparability compared to all students. Additional discussion related to applying the statistics in different contexts is provided.

Keywords

computerized adaptive testing score comparability conditional standard error of measurement equity vertical scales

Get full access to this article

View all access options for this article.

References

American Educational Research Association, American Psychological Association, and National Council for Measurement in Education (2014). Standards for educational and psychological testing. American Educational Research Association.

Davey

Thomas

(1996). April 8–12). Constructing adaptive tests to parallel conventional programs. American Educational Research Association Annual Meeting.

Divgi

D. R.

(1981). April 13 – 17). Two procedures for scaling and equating tests with item response theory. American Educational Research Association Annual Meeting.

Dorans

N. J.

Feigenbaum

M. D.

(1994). Equating issues engendered by changes to the SAT and PSAT/NMSQT. In Lawrence

I. M.

Dorans

N. J.

Feigenbaum

M. D.

Feryok

Schmitt

A. P.

Wright

N. K.

(Eds.), Technical issues related to the introduction of the new SAT and PSAT/NMSQT (ETS RM-94-10). Educational Testing Service.

Eignor

D. R.

Stocking

M. L.

Way

W. D.

Steffen

(1993). Case studies in computer adaptive test design through simulation (ETS-RR-93-56). Educational Testing Service.

Harris

D. J.

Fang

(2021). Examining comparability across CAT assessments. Educational Measurement: Issues and Practice, 40(4), 18–20. https://doi.org/10.1111/emip.12473

Keng

McClarty

K. L.

Davis

L. L.

(2008). Item-level comparative analysis of online and paper administrations of the Texas assessment of knowledge and skills. Applied Measurement in Education, 21(3), 207–226. https://doi.org/10.1080/08957340802161774

Kim

D. H.

Huynh

(2010). Equivalence of paper-and-pencil and online administration modes of the statewide English test for students with and without disabilities. Educational Assessment, 15(2), 107–121. https://doi.org/10.1080/10627197.2010.491066

Kingsbury

C. G.

Zara

A. R.

(1991). A comparison of procedures for content-sensitive item selection in computerized adaptive tests. Applied Measurement in Education, 4(3), 241–261. https://doi.org/10.1207/s15324818ame0403_4

10.

Kingsbury

G. G.

Zara

A. R.

(1989). Procedures for selecting items for computerized adaptive tests. Applied Measurement in Education, 2(4), 359–375. https://doi.org/10.1207/s15324818ame0204_6

11.

Kolen

M. J.

(1999). Threats to score comparability with applications to performance assessments and computerized adaptive tests. Educational Assessment, 6(2), 73–96. https://doi.org/10.1207/S15326977EA0602_01

12.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Erlbaum.

13.

Morris

C. N.

(1982). On the foundations of test equating. In Holland

P. W.

Rubin

D. B.

(Eds.), Test equating (pp. 169–191). Academic Press.

14.

Nunnally

J. C.

Bernstein

I. H.

(1994). Psychometric theory (3rd ed.). McGraw-Hill.

15.

Rasch

(1960). Probabilistic models for some intelligence and attainment tests. Denmark Paedogogiske Institut.

16.

Thompson

N. A.

(2009). Item selection in computerized classification testing. Educational and Psychological Measurement, 69(5), 778–793. https://doi.org/10.1177/0013164408324460

17.

Thompson

Way

(2007, June 7–8). Investigating CAT designs to achieve comparability with a paper test. GMAC® conference on computerized adaptive testing.

18.

Tong

Kolen

M. J.

(2005). Assessing equating results on different equating criteria. Applied Psychological Measurement, 29(6), 418–432. https://doi.org/10.1177/0146621606280071

19.

van der Linden

W. J.

(2005). Linear models for optimal test design. Springer-Verlag.

20.

Wang

Shin

C. D.

(2010). Comparability of computerized adaptive and paper-pencil tests. Test, Measurement, and Research Services Bulletin, 13, 1–7. Pearson.

21.

Wang

Jiao

Young

M. J.

Brooks

Olson

(2007). A meta-analysis of testing mode effects in grade K-12 mathematics tests. Educational and Psychological Measurement, 67(2), 219–238. https://doi.org/10.1177/0013164406288166

22.

Wang

Jiao

Young

M. J.

Brooks

Olson

(2008). Comparability of computer-based and paper-and-pencil testing in K-12 reading assessments: A meta-analysis of testing mode effects. Educational and Psychological Measurement, 68(1), 5–24. https://doi.org/10.1177/0013164407305592

23.

Wang

Kolen

M. J.

(2001). Evaluating comparability in computerized adaptive testing: Issues, criteria, and an example. Journal of Educational Measurement, 38(1), 19–49. https://doi.org/10.1111/j.1745-3984.2001.tb01115.x

24.

Weiss

D. J.

(1982). Improving measurement quality and efficiency with adaptive testing. Applied Psychological Measurement, 6(4), 473–492. https://doi.org/10.1177/014662168200600408

25.

Weiss

D. J.

McBride

J. R.

(1984). Bias and information of Bayesian adaptive testing. Applied Psychological Measurement, 8(3), 273–285. https://doi.org/10.1177/014662168400800303

26.

Wyse

A. E.

McBride

J. R.

(2021). A framework for measuring the amount of adaptation of Rasch-based computerized adaptive tests. Journal of Educational Measurement, 58(1), 83–103. https://doi.org/10.1111/jedm.12267

27.

Wyse

A. E.

Reckase

M. D.

(2011). A graphical approach to evaluating equating using test characteristic curves. Applied Psychological Measurement, 35(3), 217–234. https://doi.org/10.1177/0146621610377082

28.

Zeng

Yin

Shedden

K. A.

(2015). Does matching quality matter in mode comparison studies. Educational and Psychological Measurement, 75(6), 1045–1062. https://doi.org/10.1177/0013164414565006

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.41 MB