Multidimensional DIF Analyses: The Effects of Matching on Unidimensional Subtest Scores

Abstract

Popular techniques for assessing differential item functioning (DIF) assume that the test under study is unidimensional. When this assumption is tenable, number-correct score is a reasonable matching criterion. When a test is intentionally multidimensional, matching on a single test score does not ensure comparability and may result in inflated error rates. An alternate approach is to match on all relevant traits simultaneously, using a procedure such as logistic regression. In this study, data were generated to simulate two-dimensional tests. The dimensional structure of the tests, the discrimination levels of the items, and the correlation between the traits measured by the test were varied. Standard DIF analyses were conducted using total test score as the matching variable. High false-positive error rates were found. Items were divided into subtests using nonlinear factor analysis and DIF analyses were repeated with subtest scores as the matching criteria. False-positive error rates were reduced for most datasets. The dimensional structure of the test and the discrimination level of the items influenced false-positive rates for both sets of DIEF analyses. The findings suggest that assessing the dimensional structure of a test can be an important first step in DIF analysis. If a dataset is intentionally multidimensional, conditioning on scores reflecting each dimension can enhance the validity of the analyses.

Get full access to this article

View all access options for this article.

References

Ackerman, T. A. (1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29, 67-91.

Ackerman, T. A. (1994). Using multidimensional item response theory to understand what items and tests are measuring. Applied Measurement in Education, 7, 255-278.

Ackerman, T. A. , & Evans, J. A. (1993, April). A didactic example of the influence of conditioning on the complete latent ability space when performing DIF analyses. Paper presented at the meeting of the National Council on Measurement in Education, Atlanta GA.

Bleistein, C. , & Wright, D. (1987). Assessment of unexpected differential item difficulty for Asian-American examinees on the Scholastic Aptitude Test. In A. P. Schmitt & N. J. Dorans (Eds.), Differential item functioning on the Scholastic Aptitude Test (Research Memorandum No. 87-1). Princeton NJ: Educational Testing Service.

Camilli, G. , Wang, M. , & Fesq, J. (1995). The effects of dimensionality on equating the Law School Admission Test. Journal of Educational Measurement, 32, 79-96.

Clauser, B. E. , Nungester, R. , Mazor, K. M. , & Ripkey, D. (1996). A comparison of alternative matching strategies for DIF detection in tests that are multidimensional. Journal of Educational Measurement, 33, 202-214.

Donoghue, J. R. , Holland, P. W. , & Thayer, D. T. (1993). A Monte Carlo study of factors that affect the Mantel-Haenszel and standardization measures of differential item functioning. In P. Holland & H. Wainer (Eds.), Differential item functioning (pp. 137-166). Hillsdale NJ: Erlbaum.

Dorans, N. J. (1989). Two new approaches to assessing differential item functioning: Standardization and the Mantel-Haenszel method. Applied Psychological Measurement, 3, 217-233.

Fraser, C. (1988). NOHARM II: A FORTRAN program for fitting unidimensional and multidimensional normal ogive models of latent trait theory. Armidale, New South Wales, Australia: Centre for Behavioural Studies, The University of New England.

10.

French, A. W. , & Miller, T. R. (1996). Logistic regression and its use in detecting differential item functioning in polytomous items. Journal of Educational Measurement, 33, 315-332.

11.

Holland, P. W. , & Thayer, D. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In H. Wainer & H. I. Braun (Eds.), Test validity (pp. 129-145). Hillsdale NJ: Erlbaum.

12.

Holland, P. W. , & Wainer, H. (Eds.). (1993). Differential item functioning. Hillsdale NJ: Erlbaum.

13.

Kok, F. (1988). Item bias and multidimensionality. In R. Langeheine & J. Rost (Eds.), Latent trait and latent class models (pp. 263-275). New York: Plenum.

14.

Mazor, K. M. (1993). An investigation of the effects of conditioning on two ability estimates in DIF analyses when the data are two-dimensional. Unpublished doctoral dissertation, University of Massachusetts, Amherst.

15.

Mazor, K. M. , Kanjee, A. , & Clauser, B. E. (1995). Using logistic regression and the Mantel-Haenszel with multiple ability estimates to detect differential item functioning. Journal of Educational Measurement, 32, 131-144.

16.

Mazor, K. M. , Narayanan, P. , Stout, W. , & Roussos, L. (1994, April). Identification of valid subtests for DIF analyses when tests are intentionally multidimensional. Paper presented at the meeting of the National Council on Measurement in Education, New Orleans LA.

17.

Nandakumar, R. , & Stout, W. (1993). Refinements of Stout's procedure for assessing latent trait unidimensionality. Journal of Educational Statistics, 18, 41-68.

18.

Noursis, M. J. (Ed.) (1993). SPSSfor Windows: Professional statistics. Chicago: SPSS, Inc.

19.

Oshima, T. C. , & Miller, M. D. (1990). Multidimensionality and IRT-based item invariance indexes: The effect of between group variation in trait correlation. Journal of Educational Measurement, 27, 273-283.

20.

Oshima, T. C. , & Miller, M. D. (1992a). Multidimensionality and item bias in item response theory. Applied Psychological Measurement, 16, 237-248.

21.

Oshima, T. C. , & Miller, M. D. (1992b, April). Item bias detection in multidimensional test data. Paper presented at the meeting of the American Educational Research Association, San Francisco.

22.

Reckase, M. D. (1985). The difficulty of test items that measure more than one ability. Applied Psychological Measurement, 9, 401-412.

23.

Reckase, M. D. (1997). A linear logistic multidimensional model for dichotomous item response data. In W. J. van der Linden & R. K. Hambleton (Eds.), Handbook of modem item response theory (pp. 271-286). New York: Springer-Verlag.

24.

Reckase, M. D. , & McKinley, R. L. (1991). The discriminating power of items that measure more than one dimension. Applied Psychological Measurement, 15, 361-373.

25.

Roussos, L. A. , & Stout, W. F. (1996). A multidimensionality-based DIF analysis paradigm. Applied Psychological Measurement, 20, 355-371.

26.

Roussos, L. A. , Stout, W. F. , & Marden, J. I. (1993, April). Dimensional and structural analysis of standardized tests using DIMTESTwith hierarchical cluster analysis. Paper presented at the meeting of the National Council on Measurement in Education, Atlanta GA.

27.

Shealy, R. , & Stout, W. (1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58, 159-194.

28.

Solomon, D. J. , Speer, A. J. , Callaway, M. R. , & Ainsworth, M. A. (1996). Assessing the invariance of a factor structure for a measure of clinical competence across examination formats. Academic Medicine, 71 (10), (October Supplement), S 106-S 108.

29.

Swaminathan, H. , & Rogers, H. J. (1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370.

30.

Traub, R. E. (1993). On the equivalence of the traits assessed by multiple-choice and constructed response tests. In R. E. Bennett & W. C. Ward (Eds.), Construction versus choice in cognitive measurement (pp. 29-44). Hillsdale NJ: Erlbaum.

31.

Traub, R. E. (1994, August). Facing the challenge of multidimensionality in performance assessment. Invited presentation at the Sixth Ottawa Conference on Medical Education, Toronto, Canada.

32.

Zwick, R. , & Ercikan, K. (1989). Analysis of differential item functioning in NAEP History assessment. Journal of Educational Measurement, 26, 55-66.