Sage Journals: Discover world-class research

Abstract

Differential item functioning (DIF) analysis is one of the most important applications of item response theory (IRT) in psychological assessment. This study examined the performance of two Bayesian DIF methods, Bayes factor (BF) and deviance information criterion (DIC), with the generalized graded unfolding model (GGUM). The Type I error and power were investigated in a Monte Carlo simulation that manipulated sample size, DIF source, DIF size, DIF location, subpopulation trait distribution, and type of baseline model. We also examined the performance of two likelihood-based methods, the likelihood ratio (LR) test and Akaike information criterion (AIC), using marginal maximum likelihood (MML) estimation for comparison with past DIF research. The results indicated that the proposed BF and DIC methods provided well-controlled Type I error and high power using a free-baseline model implementation, their performance was superior to LR and AIC in terms of Type I error rates when the reference and focal group trait distributions differed. The implications and recommendations for applied research are discussed.

Keywords

differential item functioning Bayes factor deviance information criterion ideal point item response theory

Get full access to this article

View all access options for this article.

References

Akaike

(1974). A new look at the statistical model identification. Automatic Control, IEEE Transactions on, 19(6), 716–723. https://doi.org/10.1109/tac.1974.1100705.

Birnbaum

(1968). Some latent trait models and their use in inferring an examinee’s ability. In Lord

F. M.

Novick

M. R.

(Eds), Statistical theories of mental test scores. Addison Wesley Publishing Company, Inc.

Carter

N. T.

Dalal

D. K.

(2010). An ideal point account of the JDI Work satisfaction scale. Personality and Individual Differences, 49(7), 743–748. https://doi.org/10.1016/j.paid.2010.06.019.

Carter

N. T.

Zickar

M. J.

(2011). A comparison of the LR and DFIT frameworks of differential functioning applied to the generalized graded unfolding model. Applied Psychological Measurement, 35(8), 623–642. https://doi.org/10.1177/0146621611427898.

Chun

Stark

Kim

E. S.

Chernyshenko

O. S.

(2016). MIMIC methods for detecting DIF among multiple groups: Exploring a new sequential-free baseline procedure. Applied Psychological Measurement, 40(7), 486–499. https://doi.org/10.1177/0146621616659738.

Claeskens

Hjort

N. L.

(2008). Model selection and model averaging. Cambridge University Press.

Cohen

(1992). Statistical Power Analysis. Current Directions in Psychological Science, 1(3), 98–101. https://doi.org/10.1111/1467-8721.ep10768783.

Cohen

A. S.

Kim

S.-H.

Wollack

J. A.

(1996). An investigation of the likelihood ratio test for detection of differential item functioning. Applied Psychological Measurement, 20(1), 15–26. https://doi.org/10.1177/014662169602000102.

Collins

W. C.

Raju

N. S.

Edwards

J. E.

(2000). Assessing differential functioning in a satisfaction scale. Journal of Applied Psychology, 85(3), 451–461. https://doi.org/10.1037/0021-9010.85.3.451.

10.

de la Torre

Stark

Chernyshenko

O. S.

(2006). Markov chain Monte Carlo estimation of item parameters for the generalized graded unfolding model. Applied Psychological Measurement, 30(3), 216–232. https://doi.org/10.1177/0146621605282772.

11.

DeShon

R. P.

(2004). Measures are not invariant across groups without error variance homogeneity. Psychology Science, 46(1), 137–149.

12.

Doornik

J. A.

(2009). An object-oriented matrix programming language Ox 6.

13.

Dorans

N. J.

Cook

L. L.

(2016). Fairness in educational assessment and measurement. Routledge.

14.

Ellis

B. B.

Mead

A. D.

(2000). Assessment of the measurement equivalence of a Spanish translation of the 16PF questionnaire. Educational and Psychological Measurement, 60(5), 787–807. https://doi.org/10.1177/00131640021970781.

15.

Finch

W. H.

French

B. F.

(2008). Anomalous type I error rates for identifying one type of differential item functioning in the presence of the other. Educational and Psychological Measurement, 68(5), 742–759. https://doi.org/10.1177/0013164407313370.

16.

Gelman

Rubin

D. B.

(1992). Inference from iterative simulation using multiple sequences. Statistical Science, 7(4), 457–472. https://doi.org/10.1214/ss/1177011136.

17.

Good

I. J.

(1983). Good thinking: The foundations of probability and its applications. University of Minnesota Press.

18.

Haebara

(1980). Equating logistic ability scales by a weighted least squares method. Japanese Psychological Research, 22(3), 144–149. https://doi.org/10.4992/psycholres1954.22.144.

19.

Jeffreys

(1998). The theory of probability. Oxford University Press.

20.

Joo

S. H.

Chun

Stark

Chernyshenko

O. S.

(2019). Item parameter estimation with the general hyperbolic cosine ideal point IRT model. Applied Psychological Measurement, 43(1), 18–33. https://doi.org/10.1177/0146621618758697.

21.

Joo

S. H.

Lee

Stark

(2017). Evaluating anchor-item designs for concurrent calibration with the GGUM. Applied Psychological Measurement, 41(2), 83–96. https://doi.org/10.1177/0146621616673997.

22.

Kang

Cohen

A. S.

Sung

H.-J.

(2009). Model selection indices for polytomous items. Applied Psychological Measurement, 33(7), 499–518. https://doi.org/10.1177/0146621608327800.

23.

Kass

R. E.

Raftery

A. E.

(1995). Bayes factors. Journal of the American Statistical Association, 90(430), 773–795. https://doi.org/10.1080/01621459.1995.10476572.

24.

Kolen

M. J.

Brennan

R. L.

(2014). Test equating, scaling, and linking: Methods and practices. Springer-Verlag.

25.

Bolt

D. M.

(2006). A comparison of alternative models for testlets. Applied Psychological Measurement, 30(1), 3–21. https://doi.org/10.1177/0146621605275414.

26.

Ling

Zhang

Locke

K. D.

(2016). Examining the process of responding to circumplex scales of interpersonal values items: Should ideal point scoring methods be considered?. Journal of Personality Assessment, 98(3), 310–318. https://doi.org/10.1080/00223891.2015.1077852.

27.

H.-H.

Stout

(1996). A new procedure for detection of crossing DIF. Psychometrika, 61(4), 647–677. https://doi.org/10.1007/bf02294041.

28.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Erlbaum.

29.

Meredith

(1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58(4), 52–-543. https://doi.org/10.1007/bf02294825.

30.

Nye

C. D.

Drasgow

(2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96(5), 966–980. https://doi.org/10.1037/a0022955.

31.

Oshima

T. C.

Raju

N. S.

Nanda

A. O.

(2006). A new method for assessing the statistical significance in the differential functioning of items and tests (DFIT) framework. Journal of Educational Measurement, 43(1), 1–17. https://doi.org/10.1111/j.1745-3984.2006.00001.x.

32.

Patz

R. J.

Junker

B. W.

(1999). A straightforward approach to Markov chain Monte Carlo methods for item response theory models. Journal of Educational and Behavioral Statistics, 24(2), 146–178. https://doi.org/10.3102/10769986024002146.

33.

Raju

N. S.

van der Linden

W. J.

Fleer

P. F.

(1995). IRT-based internal measures of differential functioning of items and tests. Applied Psychological Measurement, 19(4), 353–368. https://doi.org/10.1177/014662169501900405.

34.

Roberts

J. S.

Donoghue

J. R.

Laughlin

J. E.

(2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24(1), 3–32. https://doi.org/10.1177/01466216000241001.

35.

Roberts

J. S.

Donoghue

J. R.

Laughlin

J. E.

(2002). Characteristics of MML/EAP parameter estimates in the generalized graded unfolding model. Applied Psychological Measurement, 26(2), 192–207. https://doi.org/10.1177/01421602026002006.

36.

Roberts

J. S.

Fang

H.-R.

Cui

Wang

(2006). GGUM2004: A Windows-Based Program to Estimate Parameters in the Generalized Graded Unfolding Model. Applied Psychological Measurement, 30(1), 64–65. https://doi.org/10.1177/0146621605280141.

37.

Roberts

J. S.

Laughlin

J. E.

(1996). A unidimensional item response model for unfolding responses from a graded disagree-agree response scale. Applied Psychological Measurement, 20(3), 231–255. https://doi.org/10.1177/014662169602000305.

38.

Roberts

J. S.

Thompson

V. M.

(2011). Marginal maximum a posteriori item parameter estimation for the generalized graded unfolding model. Applied Psychological Measurement, 35(4), 259–279. https://doi.org/10.1177/0146621610392565.

39.

Ryan

A. M.

Horvath

Ployhart

R. E.

Schmitt

Slade

L. A.

(2000). Hypothesizing differential item functioning in global employee opinion surveys. Personnel Psychology, 53(3), 531–562. https://doi.org/10.1111/j.1744-6570.2000.tb00213.x.

40.

Samejima

(1997). Graded response model. In van der Linden

W. J.

Hambleton

R. K.

(Eds), Handbook of modern item response theory (pp. 85–100). Springer. https://doi.org/10.1007/978-1-4757-2691-6_5.

41.

Seybert

Stark

Chernyshenko

O. S.

(2014). Detecting DIF with ideal point models: A comparison of area and parameter difference methods. Applied Psychological Measurement, 38(2), 151–165. https://doi.org/10.1177/0146621613508306.

42.

Sinharay

Johnson

M. S.

Stern

H. S.

(2006). Posterior predictive assessment of item response theory models. Applied Psychological Measurement, 30(4), 298–321. https://doi.org/10.1177/0146621605285517.

43.

Spiegelhalter

D. J.

Best

N. G.

Carlin

B. P.

Van Der Linde

(2002). Bayesian measures of model complexity and fit. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 64(4), 583–639. https://doi.org/10.1111/1467-9868.00353.

44.

Stark

Chernyshenko

O. S.

Drasgow

(2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29(3), 184–203. https://doi.org/10.1177/0146621604273988.

45.

Stark

Chernyshenko

O. S.

Drasgow

(2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91(6), 1292–1306. https://doi.org/10.1037/0021-9010.91.6.1292.

46.

Stark

Chernyshenko

O. S.

Drasgow

Williams

B. A.

(2006). Examining assumptions about item responding in personality assessment: Should ideal point methods be considered for scale development and scoring?. Journal of Applied Psychology, 91(1), 25–39. https://doi.org/10.1037/0021-9010.91.1.25.

47.

Stocking

M. L.

Lord

F. M.

(1983). Developing a common metric in item response theory. Applied Psychological Measurement, 7(2), 201–210. https://doi.org/10.1177/014662168300700208.

48.

Tay

Drasgow

(2012). Theoretical, statistical, and substantive issues in the assessment of construct dimensionality: Accounting for the item response process. Organizational Research Methods, 15(3), 363–384. https://doi.org/10.1177/1094428112439709.

49.

Tay

Drasgow

Rounds

Williams

B. A.

(2009). Fitting measurement models to vocational interest data: Are dominance models ideal? Journal of Applied Psychology, 94(4), 1287–1304. https://doi.org/10.1037/a0015899.

50.

Teresi

J. A.

Ocepek-Welikson

Kleinman

Eimicke

J. P.

Crane

P. K.

Jones

R. N.

Lai

J. S.

Choi

S. W.

Hays

R. D.

Reeve

B. B.

Reise

S. P.

Pilkonis

P. A.

Cella

(2009). Analysis of differential item functioning in the depression item bank from the Patient Reported Outcome Measurement Information System (PROMIS): An item response theory approach. Psychology Science Quarterly, 51(2), 148–180.

51.

Thissen

Steinberg

Wainer

(1988). Use of item response theory in the study of group differences in trace lines. In Wainer

Braun

H. I.

(Eds), Test validity (pp. 147–169). Erlbaum.

52.

Verhagen

A. J.

Fox

J. P.

(2013). Bayesian tests of measurement invariance. British Journal of Mathematical and Statistical Psychology, 66(3), 383–401. https://doi.org/10.1111/j.2044-8317.2012.02059.x.

53.

Verhagen

Levy

Millsap

R. E.

Fox

J. P.

(2016). Evaluating evidence for invariant items: A Bayes factor applied to testing measurement invariance in IRT models. Journal of Mathematical Psychology, 72, 171–182. https://doi.org/10.1016/j.jmp.2015.06.005.

54.

Wang

W.-C.

Shih

C.-L.

Sun

G.-W.

(2012). The DIF-free-then-DIF strategy for the assessment of differential item functioning. Educational and Psychological Measurement, 72(4), 687–708. https://doi.org/10.1177/0013164411426157.

55.

Wang

Tay

Drasgow

(2013). Detecting differential item functioning of polytomous items for an ideal point response process. Applied Psychological Measurement, 37(4), 316–335. https://doi.org/10.1177/0146621613476156.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.60 MB

Bayesian Approaches for Detecting Differential Item Functioning Using the Generalized Graded Unfolding Model

Abstract

Keywords

Get full access to this article

References

Supplementary Material