How Item Residual Heterogeneity Affects Tests for Differential Item Functioning

Abstract

Differential item functioning (DIF) occurs when people with the same proficiency have different probabilities of giving a certain response to an item. The present study focused on an assumption implicit in popular methods for DIF testing that has received little attention in published literature (item residual homogeneity). The assumption is explained, a strategy for detecting violations of it (i.e., item residual heterogeneity) is illustrated with empirical data, and simulations are carried out to evaluate the performance of binary logistic regression, two-group item response theory (IRT), and the Mantel–Haenszel (MH) test in the presence of item residual heterogeneity. Results indicated that heterogeneity inflated Type I error and attenuated power for logistic regression, and attenuated power and produced biased estimates of the latent focal group mean and standard deviation for two-group IRT. The MH test was robust to item residual heterogeneity, probably because it does not use the logistic function.

Keywords

differential item functioning DIF logistic regression heterogeneity

Get full access to this article

View all access options for this article.

References

Allison

P. D.

(1999). Comparing logit and probit coefficients across groups. Sociological Methods & Research, 28, 186-208.

Benjamini

Hochberg

(1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society, Series B, 57, 289-300.

Bennett

R. E.

Rock

D. A.

Kaplan

B. A.

(1987). SAT differential item performance for nine handicapped groups. Journal of Educational Measurement, 24, 41-55.

Birnbaum

(1968). Some latent trait models. In Lord

F. M.

Novick

M. R.

(Eds.), Statistical theories of mental test scores (pp. 395-479). Reading, MA: Addison & Wesley.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: An application of the EM algorithm. Psychometrika, 46, 443-459.

Breslow

N. E.

Day

N. E.

(1980). Statistical methods in cancer research: Vol. 1—The analysis of case-control studies. Lyon, France: International Agency for Research on Cancer.

Camilli

Shepard

L. A.

(1994). Methods for identifying biased test items. Thousand Oaks, CA: Sage.

Clark

(1996). SNAP Manual for administration, scoring, and interpretation. Minneapolis: University of Minnesota Press.

Dunn

P. K.

Smyth

G. K.

(2012). Package dglm: Double generalized linear models (version 1.6.2) [R software]. Retrieved from http://cran.us.r-project.org/

10.

Hoetker

(2007). The use of logit and probit models in strategic management research: Critical issues. Strategic Management Journal, 28, 331-343.

11.

Holland

P. W.

(1985). On the study of differential item performance without IRT. In Proceedings of the 17th Annual Conference of the Military Testing Association (Vol. 1, pp. 282-287). San Diego, CA: Navy Personnel Research and Development Center.

12.

Holland

P. W.

(1990). On the sampling theory foundations of item response theory models. Psychometrika, 55, 577-601.

13.

Holland

P. W.

Thayer

D. T.

(1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer

Braun

H. I.

(Eds.), Test validity (pp. 129-145). Hillsdale, NJ: Erlbaum.

14.

Johnson

Kotz

Balakrishnan

(1995). Continuous univariate distributions (Vol. 2, 2nd ed.). New York, NY: Wiley.

15.

Brooks

G. P.

Johanson

G. A.

(2012). Item discrimination and Type I error in the detection of differential item functioning. Educational and Psychological Measurement, 72, 847-861.

16.

Long

J. S.

(2009). Group comparisons in logit and probit using predicted probabilities. Retrieved from http://www.indiana.edu/~jslsoc/research_groupdif.htm

17.

Mantel

Haenszel

(1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22, 719-748.

18.

Mellenbergh

G. J.

(1989). Item bias and item response theory. International Journal of Educational Research, 13, 127-143.

19.

Millsap

R. E.

(2011). Statistical approaches to measurement invariance. New York, NY: Taylor & Francis.

20.

Monahan

P. O.

Ankenmann

R. D.

(2005). Effect of unequal variances in proficiency distributions on Type-I error of the Mantel-Haenszel chi-square test for differential item functioning. Journal of Educational Measurement, 42, 101-131.

21.

Mood

(2010). Logistic regression: Why we cannot do what we think we can do, and what we can do about it. European Sociological Review, 26, 67-82.

22.

Muthén

Asparouhov

(2002). Latent variable analysis with categorical outcomes: Multiple group and growth modeling in Mplus (Mplus Web Notes: No. 4 [version 5]). Retrieved from http://www.statmodel.com/examples/webnote.shtml#web4

23.

Pei

(2010). Effects of unequal ability variances on the performance of logistic regression, Mantel-Haenszel, SIBTEST IRT, and IRT likelihood ratio for DIF detection. Applied Psychological Measurement, 34, 453-456.

24.

Smyth

G. K.

Verbyla

A. P.

(1999). Adjusted likelihood methods for modeling dispersion in generalized linear models. Environmetrics, 10, 695-709.

25.

Swaminathan

Rogers

H. J.

(1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27, 361-370.

26.

Thissen

(2001). IRTLRDIF v2.0b: Software for the computation of the statistics involved in item response theory likelihood-ratio tests for differential item functioning. Documentation for computer program [Computer software and manual]. Chapel Hill: L. L. Thurstone Psychometric Laboratory, University of North Carolina.

27.

Thissen

Steinberg

Gerrard

(1986). Beyond group-mean differences: The concept of item bias. Psychological Bulletin, 99, 118-128.

28.

Thissen

Steinberg

Kuang

(2002). Quick and easy implementation of the Benjamini-Hochberg procedure for controlling the false positive rate in multiple comparisons. Journal of Educational and Behavioral Statistics, 27, 77-83.

29.

Thissen

Steinberg

Wainer

(1988). Use of item response theory in the study of group difference in trace lines. In Wainer

Braun

(Eds.), Test validity (pp. 147–169). Hillsdale, NJ: Erlbaum.

30.

Thissen

Steinberg

Wainer

(1993). Detection of differential item functioning using the parameters of item response models. In Holland

P. W.

Wainer

(Eds.), Differential item functioning (pp. 67-113). Hillsdale, NJ: Erlbaum.

31.

Williams

(2009). Using heterogeneous choice models to compare logit and probit coefficients across groups. Sociological Methods & Research, 37, 531-559.

32.

Williams

(2011, September). Comparing logit and probit coefficients between models and across groups. Colloquium presented at the University of Kansas, Lawrence.

33.

Williams

V. S. L.

Jones

L. V.

Tukey

J. W.

(1999). Controlling error in multiple comparisons, with examples from state-to-state differences in educational achievement. Journal of Educational and Behavioral Statistics, 24, 42-69.

34.

Woods

C. M.

Oltmanns

T. F.

Turkheimer

(2009). Illustration of MIMIC-model DIF testing with the schedule for nonadaptive and adaptive personality. Journal of Psychopathology and Behavioral Assessment, 31, 320-330.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.37 MB