Sage Journals: Discover world-class research

Abstract

Dorans and Holland (2000) and von Davier, Holland, and Thayer (2003) introduced measures of the degree to which an observed-score equating function is sensitive to the population on which it is computed. This article extends the findings of Dorans and Holland and of von Davier et al. to item response theory (IRT) true-score equating methods that are commonly used in the nonequivalent-groups anchor test (NEAT) design. Using data from the Advanced Placement Program Calculus AB exam, which contain multiple-choice (MC) and free-response (FR) sections, the authors investigate the population sensitivity of the IRT equating functions computed for the MC section only and for the MC and FR sections together. The degree of population sensitivity is also compared across three equating methods: the IRT true-score equating method and two observed-score equating methods, chained equipercentile and Tucker linear equating.

Keywords

Index terms: population sensitivity observed-score equating IRT true-score equating nonequivalent-groups anchor test (NEAT)

Get full access to this article

View all access options for this article.

References

Braun, H.I. , & Holland, P.W. (1982). Observed score test equating: A mathematical analysis of some ETS equating procedures. In P. W. Holland & D. B. Rubin (Eds.), Test equating (pp. 9-49). New York: Academic Press.

Brennan, R.L. , & Kolen, M.J. (1987). Some practical issues in equating. Applied Psychological Measurement, 11, 279-290.

Cook, L.L. , Dorans, N.J. , Eignor, D.R. , & Petersen, N.S. (1985). An assessment of the relationship between the assumption of unidimensionality and the quality of IRT true-score equating (ETS Research Rep. No. RR-85-30). Princeton, NJ : Educational Testing Service.

Cook, L.L. , & Eignor, D.R. (1991). An NCME instructional module on IRT equating methods . Educational Measurement: Issues and Practice, 10, 37-45.

Cook, L.L. , & Petersen, N.S. (1987). Problems related to the use of conventional and item response theory equating methods in less than optimal circumstances. Applied Psychological Measurement, 11, 225-244.

Dorans, N.J. , & Holland, P.W. (2000). Population invariance and equatability of tests: Basic theory and the linear case. Journal of Educational Measurement , 37, 281-306.

Dorans, N.J. , Holland, P.W. , Thayer, D.T. , & Tateneni, K. (2003, April). Invariance of score linking across gender groups for three Advanced Placement Program exams. Paper presented at the annual meeting of the National Council on Measurement in Education, New Orleans, LA.

Haebara, T. (1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22, 144-149.

Hambleton, R.K. , Swaminathan, H. , & Rogers, H.J. (1991). Fundamentals of item response theory. Newbury Park, CA: Sage.

10.

Harris, D.J. , & Crouse, J.D. (1993). A study of criteria used in equating. Applied Measurement in Education, 6, 195-240.

11.

Harris, D.J. , & Kolen, M.J. (1986). Effect of examinee group on equating relationships . Applied Psychological Measurement, 10, 35-43.

12.

Hattie, J. (1985). Methodology review: Assessing unidimensionality of tests and items. Applied Psychological Measurement, 9, 139-164.

13.

Jodoin, M.G. , & Davey, T. (2003, April). A multidimensional simulation approach to investigate the robustness of IRT common item equating. Paper presented at the annual meeting of the American Educational Research Association, Chicago, IL.

14.

Kolen, M.J. , & Brennan, R.L. (2004). Test equating, scaling, and linking: Methods and practices (2nd ed.). New York : Springer-Verlag.

15.

Lord, F.M. (1980). Application of item response theory to practical testing problems. Hillsdale, NJ: Lawrence Erlbaum.

16.

Muraki, E. (1992). A generalized partial credit model: Application of an EM algorithm. Applied Psychological Measurement, 16, 159-176.

17.

Peterson, N.S. (2008). A Discussion of Population Invariance of Equating . Applied Psychological Measurement, 32, 98-101.

18.

Petersen, N.S. , Cook, L.L. , & Stocking, M.L. (1983). IRT versus conventional equating methods: A comparative study of scale stability. Jour

19.

nal of Educational Statistics, 8, 137-156. Petersen, N.S. , Kolen, M.J. , & Hoover, H.D. (1989). Scaling, norming, and equating. In R. L. Linn (Ed.), Educational measurement (3rd ed., pp. 221-262). New York: Macmillan .

20.

Stocking, M.L. , & Lord, F.M. (1983). Developing a common metric in item response theory . Applied Psychological Measurement, 7, 201-210.

21.

Thissen, D. , Wainer, H. , & Wang, X.-B. (1994). Are tests comprising both multiple-choice and free responses items necessarily less unidimensional than multiple-choice tests? An analysis of two tests. Journal of Educational Measurement , 31, 113-123.

22.

von Davier, A.A. (2003). Notes on linear equating methods for the non-equivalent groups design (ETS Research Rep. No. RR-03-24). Princeton, NJ: Educational Testing Service.

23.

von Davier, A.A. , Holland, P.W. , & Thayer, D.T. (2003). Population invariance and chain versus post-stratification equating methods. In N. J. Dorans (Ed.), Population invariance of score linking: Theory and applications to Advanced Placement Program examinations (ETS Research Rep. No. RR-03-27, pp. 19-36). Princeton, NJ: Educational Testing Service.

24.

von Davier, A.A. , Holland, P.W. , & Thayer, D.T. (2004a). The chain and post-stratification methods for observed-score equating and their relationship to population invariance. Journal of Educational Measurement, 41, 15-32.

25.

von Davier, A.A. , Holland, P.W. , & Thayer, D.T. (2004b). The kernel method of test equating. New York: Springer-Verlag.

26.

von Davier, A.A. , & Wilson, C. (2005). A didactic approach to the use of IRT true score equating (ETS Research Rep. No. RR-05-26). Princeton, NJ : Educational Testing Service.

Investigating the Population Sensitivity Assumption of Item Response Theory True-Score Equating Across Two Subgroups of Examinees and Two Test Formats

Abstract

Keywords

Get full access to this article

References