Detecting DIF in Multidimensional Forced Choice Measures Using the Thurstonian Item Response Theory Model

Abstract

Although modern item response theory (IRT) methods of test construction and scoring have overcome ipsativity problems historically associated with multidimensional forced choice (MFC) formats, there has been little research on MFC differential item functioning (DIF) detection, where item refers to a block, or group, of statements presented for an examinee’s consideration. This research investigated DIF detection with three-alternative MFC items based on the Thurstonian IRT (TIRT) model, using omnibus Wald tests on loadings and thresholds. We examined constrained and free baseline model comparisons strategies with different types and magnitudes of DIF, latent trait correlations, sample sizes, and levels of impact in an extensive Monte Carlo study. Results indicated the free baseline strategy was highly effective in detecting DIF, with power approaching 1.0 in the large sample size and large magnitude of DIF conditions, and similar effectiveness in the impact and no-impact conditions. This research also included an empirical example to demonstrate the viability of the best performing method with real examinees and showed how a DIF and a DTF effect size measure can be used to assess the practical significance of MFC DIF findings.

Keywords

noncognitive testing multidimensional forced choice measures measurement invariance differential item functioning Thurstonian item response theory Monte Carlo simulation

Get full access to this article

View all access options for this article.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education (AERA, APA, & NCME). (2014). Standards for educational and psychological testing. Washington, DC: American Educational Research Association.

Anguiano-Carrasco

MacCann

Geiger

Seybert

J. M.

Roberts

R. D.

(2015). Development of a forced-choice measure of typical-performance emotional intelligence. Journal of Psychoeducational Assessment, 33, 83–97.

Aon

Hewitt

. (2015). 2015 Trends in global employee engagement report. Lincolnshire, IL: Aon Corp.

Baron

(1996). Strengths and limitations of ipsative measurement. Journal of Occupational and Organizational Psychology, 69, 49–56.

Bartram

(2007). Increasing validity with forced-choice criterion measurement formats. International Journal of Selection and Assessment, 15, 263–272.

Bartram

(2012). Stability of OPQ32 personality constructs across languages, cultures and countries. In Ryan

A. M.

Leong

F. T. L.

Oswald

(Eds.), Conducting multinational research projects in organizational psychology: Challenges and opportunities (pp. 59–89). Washington, DC: American Psychological Association.

Bartram

(2013). Scalar equivalence of OPQ32: Big Five profiles of 31 countries. Journal of Cross-Cultural Psychology, 44, 61–83.

Borman

W. C.

Buck

D. E.

Hanson

M. A.

Motowidlo

S. J.

Stark

Drasgow

(2001). An examination of the comparative reliability, validity, and accuracy of performance ratings made using computerized adaptive rating scales. Journal of Applied Psychology, 86, 965–973.

Brown

(2010). How item response theory can solve problems of ipsative data [Unpublished doctoral dissertation]. University of Barcelona.

10.

Brown

Bartram

(2009). Development and psychometric properties of OPQ32r (Supplement to the OPQ32 technical manual). Thames Ditton, UK: SHL Group Limited.

11.

Brown

Inceoglu

Lin

(2017). Preventing rater biases in 360-degree feedback by forcing choice. Organizational Research Methods, 20, 121–148.

12.

Brown

Maydeu-Olivares

(2010). Issues that should not be overlooked in the dominance versus ideal point controversy. Industrial and Organizational Psychology, 3, 489–493.

13.

Brown

Maydeu-Olivares

(2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71, 460–502.

14.

Brown

Maydeu-Olivares

(2012). Fitting a Thurstonian IRT model to forced-choice data using Mplus. Behavior Research Methods, 44, 1135–1147.

15.

Brown

Maydeu-Olivares

(2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18, 36–52.

16.

Brown

Maydeu-Olivares

(2018). Modeling forced-choice response formats. In Irwing

Booth

Hughes

(Eds.), The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development (pp. 523–570). Hoboken, NJ: Wiley-Blackwell.

17.

Brunel

F. F.

Nelson

M. R.

(2003). Message order effects and gender differences in advertising persuasion. Journal of Advertising Research, 43, 330–341.

18.

Bürkner

P. C.

(2018). thurstonianIRT: Thurstonian IRT models in R. R package Version 0.5.

19.

Bürkner

P. C.

Schulte

Holling

(2019). On the statistical and practical limitations of Thurstonian IRT models. Educational and Psychological Measurement, 79, 827–854.

20.

Byle

K. A.

Holtgraves

T. M.

(2008). Integrity testing, personality, and design: Interpreting the Personnel Reaction Blank. Journal of Business and Psychology, 22, 287–295.

21.

Cao

Drasgow

(2019). Does forcing reduce faking? A meta-analytic review of forced-choice personality measures in high-stakes situations. Journal of Applied Psychology, 104, 1347–1368.

22.

CEB. (2010). Global personality inventory—Adaptive technical manual. Thames Ditton, UK: CEB.

23.

Chernyshenko

O. S.

Stark

Prewett

M. S.

Gray

A. A.

Stilson

F. R.

Tuttle

M. D.

(2009). Normative scoring of multidimensional pairwise preference personality scales using IRT: Empirical comparisons with other formats. Human Performance, 22, 105–127.

24.

Chun

Stark

Kim

E. S.

Chernyshenko

O. S.

(2016). MIMIC methods for detecting DIF among multiple groups: Exploring a new sequential-free baseline procedure. Applied Psychological Measurement, 40, 486–499.

25.

Cohen

(1988). Statistical power analysis for the behavioral sciences (2nd ed.). Hillsdale, NJ: Lawrence Erlbaum.

26.

Converse

P. D.

Oswald

F. L.

Imus

Hedricks

Roy

Butera

(2008). Comparing personality tests and warnings: Effects on criterion-related validity and test-taker reactions. International Journal of Selection and Assessment, 16, 155–169.

27.

Costa

P. T.

Jr Terracciano

McCrae

R. R.

(2001). Gender differences in personality traits across cultures: Robust and surprising findings. Journal of Personality and Social Psychology, 81, 322–331.

28.

Cubiks. (2010). PAPI: Personality and Preference Inventory. http://www.cubiks.com/PRODUCTS/PERSONALITYASSESSMENTS/Pages/papi.aspx

29.

Dalal

D. K.

Zhu

X. S.

Rangel

Boyce

A. S.

Lobene

(2019). Improving applicant reactions to forced-choice personality measurement: Interventions to reduce threats to test takers’ self-concepts. Journal of Business and Psychology. Advance online publication. https://doi.org/10.1007/s10869-019-09655-6

30.

de la Torre

Ponsoda

Leenen

Hontangas

(2012, April). Some extensions of the multiunidimensional pairwise preference model. Paper presented at the 26th annual meeting of the Society for Industrial and Organizational Psychology, Chicago, IL.

31.

Drasgow

(1984). Scrutinizing psychological tests: Measurement equivalence and equivalent relations with external variables are the central issues. Psychological Bulletin, 95, 134–135.

32.

Drasgow

Chernyshenko

O. S.

Stark

(2010). 75 years after Likert: Thurstone was right! Industrial and Organizational Psychology, 3, 465–476.

33.

Drasgow

Nye

C. D.

Stark

Chernyshenko

O. S.

(2018). Differential item and test functioning. In Irwing

Booth

Hughes

(Eds.), The Wiley handbook of psychometric testing: A multidisciplinary reference on survey, scale and test development (pp. 885–899). Hoboken, NJ: Wiley-Blackwell.

34.

Dueber

D. M.

Love

A. M.

Toland

M. D.

Turner

T. A.

(2019). Comparison of single-response format and forced-choice format instruments using Thurstonian item response theory. Educational and Psychological Measurement, 79, 108–128.

35.

Dutton

J. E.

Jackson

S. E.

(1987). Categorizing strategic issues: Links to organizational action. Academy of Management Review, 12, 76–90.

36.

Ferrando

P. J.

Anguiano-Carrasco

Chico

(2011). The impact of acquiescence on forced-choice responses: A model-based analysis. Psicológica, 32, 87–105.

37.

Fisher

P. A.

Robie

Christiansen

N. D.

Speer

A. B.

Schneider

(2019). Criterion-related validity of forced-choice personality measures: A cautionary note regarding Thurstonian IRT versus classical test theory scoring. Personnel Assessment and Decisions, 5, 49–61.

38.

Griffith

R. L.

Chmielowski

Yoshita

(2007). Do applicants fake? An examination of the frequency of applicant faking behavior. Personnel Review, 36, 341–355.

39.

Guenole

Brown

A. A.

Cooper

A. J.

(2018). Forced-choice assessment of work-related maladaptive personality traits: Preliminary evidence from an application of Thurstonian item response modeling. Assessment, 25, 513–526.

40.

Hastie

Dawes

(2001). The psychology of judgment and decision making. Thousand Oaks, CA: Sage.

41.

Van de Vijver

F. J.

(2013). A general response style factor: Evidence from a multi-ethnic study in the Netherlands. Personality and Individual Differences, 55, 794–800.

42.

Heggestad

E. D.

Morrison

Reeve

C. L.

McCloy

R. A.

(2006). Forced-choice assessments of personality for selection: Evaluating issues of normative assessment and faking resistance. Journal of Applied Psychology, 91, 9–24.

43.

Hontangas

P. M.

de la Torre

Ponsoda

Leenen

Morillo

Abad

F. J.

(2015). Comparing traditional and IRT scoring of forced-choice tests. Applied Psychological Measurement, 39, 598–612.

44.

Hough

L. M.

Eaton

N. K.

Dunnette

M. D.

Kamp

J. D.

McCloy

R. A.

(1990). Criterion-related validities of personality constructs and the effect of response distortion on those validities. Journal of Applied Psychology, 75, 581–595.

45.

Jodoin

M. G.

Zenisky

Hambleton

R. K.

(2006). Comparison of the psychometric properties of several computer-based test designs for credentialing exams with multiple purposes. Applied Measurement in Education, 19, 203–220.

46.

Joo

S. H.

Lee

Stark

(2018). Development of information functions and indices for the GGUM-RANK multidimensional forced choice IRT model. Journal of Educational Measurement, 55, 357–372.

47.

Joo

S. H.

Lee

Stark

(2020). Adaptive testing with the GGUM-RANK multidimensional forced choice model: Comparison of pair, triplet, and tetrad scoring. Behavior Research Methods, 52, 761–772.

48.

Joubert

Inceoglu

Bartram

Dowdeswell

Lin

(2015). A comparison of the psychometric properties of the forced choice and Likert scale versions of a personality instrument. International Journal of Selection and Assessment, 23, 92–97.

49.

Kim

E. S.

Joo

S. H.

Lee

Wang

Stark

(2016). Measurement invariance testing across between-level latent classes using multilevel factor mixture modeling. Structural Equation Modeling: A Multidisciplinary Journal, 23, 870–887.

50.

Kim

E. S.

Yoon

Lee

(2012). Testing measurement invariance using MIMIC: Likelihood ratio test with a critical value adjustment. Educational and Psychological Measurement, 72, 469–492.

51.

Lee

Stark

(2018). Examining validity evidence for multidimensional forced choice measures with different scoring approaches, Personality and Individual Differences, 123, 229–235.

52.

Lee

Joo

S. H.

Stark

Chernyshenko

O. S.

(2019). GGUM-RANK statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43, 226–240.

53.

Lin

Brown

(2017). Influence of context on item parameters in forced-choice personality assessments. Educational and Psychological Measurement, 77, 389–414.

54.

Little

T. D.

(1997). Mean and covariance structures (MACS) analyses of cross-cultural data: Practical and theoretical issues. Multivariate Behavioral Research, 32, 53–76.

55.

Lopez Rivas

G. E.

Stark

Chernyshenko

O. S.

(2009). The effects of referent item parameters on differential item functioning detection using the free baseline likelihood ratio test. Applied Psychological Measurement, 33, 251–265.

56.

Masuda

Ellsworth

P. C.

Mesquita

Leu

Tanida

Van de Veerdonk

(2008). Placing the face in context: Cultural differences in the perception of facial emotion. Journal of Personality and Social Psychology, 94, 365–381.

57.

Meade

A. W.

(2004). Psychometric problems and issues involved with creating and using ipsative measures for selection. Journal of Occupational and Organizational Psychology, 77, 531–551.

58.

Meade

A. W.

Wright

N. A.

(2012). Solving the measurement invariance anchor item problem in item response theory. Journal of Applied Psychology, 97, 1016–1031.

59.

Mitchelson

J. K.

Wicher

E. W.

LeBreton

J. M.

Craig

S. B.

(2009). Gender and ethnicity differences on the Abridged Big Five Circumplex (AB5C) of personality traits: A differential item functioning analysis. Educational and Psychological Measurement, 69, 613–635.

60.

Morillo

Leenen

Abad

F. J.

Hontangas

de la Torre

Ponsoda

(2016). A dominance variant under the multi-unidimensional pairwise-preference framework: Model formulation and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 40, 500–516.

61.

Muthén

L. K.

Muthén

B. O.

(2013). Version 7.1 Mplus language addendum. www.statmodel.com.

62.

Muthén

L. K.

Muthén

B. O.

(1998–2018). Mplus user’s guide (8th ed.). Los Angeles, CA: Muthén & Muthén. www.statmodel.com

63.

Lee

Kuykendall

Stark

Tay

(2020). The development and validation of a multidimensional forced-choice format character measure: Testing the Thurstonian IRT approach. Journal of Personality Assessment. Advance online publication. https://doi.org/10.1080/00223891.2020.1739056

64.

Nye

C. D.

(2011). The development and validation of effect size measures for IRT and CFA studies of measurement equivalence (Unpublished doctoral dissertation). University of Illinois at Urbana-Champaign.

65.

Pavlov

Maydeu-Olivares

Fairchild

A. J.

(2019). Effects of applicant faking on forced-choice and Likert scores. Organizational Research Methods, 22, 710–739.

66.

Reise

S. P.

Smith

Furr

R. M.

(2001). Invariance on the NEO PI-R neuroticism scale. Multivariate Behavioral Research, 36, 83–110.

67.

Roberts

J. S.

Donoghue

J. R.

Laughlin

J. E.

(2000). A general item response theory model for unfolding unidimensional polytomous responses. Applied Psychological Measurement, 24, 3–32.

68.

Salgado

J. F.

(2017). Moderator effects of job complexity on the validity of forced-choice personality inventories for predicting job performance. Journal of Work and Organizational Psychology, 33, 229–238.

69.

Salgado

J. F.

Anderson

Tauriz

(2015). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational and Organizational Psychology, 88, 797–834.

70.

Salgado

J. F.

Tauriz

(2014). The five-factor model, forced-choice personality inventories and performance: A comprehensive meta-analysis of academic and occupational validity studies. European Journal of Work and Organizational Psychology, 23, 3–30.

71.

SAS Institute. (2010). SAS 9.3 user’s guide. Cary, NC: Author.

72.

Sass

Frick

Reips

U. D.

Wetzel

(2018). Taking the test taker’s perspective: Response process and test motivation in multidimensional forced-choice versus rating scale instruments. Assessment, 27, 572–584.

73.

Schmitt

Oswald

F. L.

(2006). The impact of corrections for faking on the validity of noncognitive measures in selection settings. Journal of Applied Psychology, 91, 613–621.

74.

Schulte

Holling

Bürkner

P. C.

(2020). Can high-dimensional questionnaires resolve the ipsativity issue of forced-choice response formats?. Advance online publication. https://doi.org/10.1177/0013164420934861

75.

Seybert

Stark

Chernyshenko

O. S.

(2014). Detecting DIF with ideal point models: A comparison of area and parameter difference methods. Applied Psychological Measurement, 38, 151–165.

76.

Sijtsma

Junker

B. W.

(2006). Item response theory: Past performance, present developments, and future expectations. Behaviormetrika, 33, 75–102.

77.

Smith

L. L.

Reise

S. P.

(1998). Gender differences on negative affectivity: An IRT study of differential item functioning on the Multidimensional Personality Questionnaire Stress Reaction scale. Journal of Personality and Social Psychology, 75, 1350–1362.

78.

Society for Industrial and Organizational Psychology (SIOP). (2018). Principles for the validation and use of personnel selection procedures (5th ed.). Bowling Green, OH: Author.

79.

Stark

Chernyshenko

O. S.

Chan

K. Y.

Lee

W. C.

Drasgow

(2001). Effects of the testing situation on item responding: Cause for concern. Journal of Applied Psychology, 86, 943–953.

80.

Stark

Chernyshenko

O. S.

Drasgow

(2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89, 497–508.

81.

Stark

Chernyshenko

O. S.

Drasgow

(2005). An IRT approach to constructing and scoring pairwise preference items involving stimuli on different dimensions: The multi-unidimensional pairwise-preference model. Applied Psychological Measurement, 29, 184–203.

82.

Stark

Chernyshenko

O. S.

Drasgow

(2006). Detecting differential item functioning with confirmatory factor analysis and item response theory: Toward a unified strategy. Journal of Applied Psychology, 91, 1292–1306.

83.

Stark

Chernyshenko

O. S.

Drasgow

Nye

C. D.

White

L. A.

Heffner

Farmer

W. L.

(2014). From ABLE to TAPAS: A new generation of personality tests to support military selection and classification decisions. Military Psychology, 26, 153–164.

84.

Stout

Froelich

A. G.

Gao

(2001). Using resampling methods to produce an improved DIMTEST procedure . In Essays on item response theory (pp. 357–375). New York, NY: Springer.

85.

Thurstone

L. L.

(1927). A law of comparative judgment. Psychological Review, 34, 273–286.

86.

Tversky

Kahneman

(1981). The framing of decisions and the psychology of choice. Science, 211, 453–458.

87.

Walton

K. E.

Cherkasova

Roberts

R. D.

(2020). On the validity of forced choice scores derived from the Thurstonian item response theory model. Assessment, 27, 706–718.

88.

Wang

W. C.

Qiu

X. L.

Chen

C. W.

Jin

K. Y.

(2017). Item response theory models for ipsative tests with multidimensional pairwise comparison items. Applied Psychological Measurement, 41, 600–613.

89.

Wang

Tay

Drasgow

(2013). Detecting differential item functioning of polytomous items for an ideal point response process. Applied Psychological Measurement, 37, 316–335.

90.

Watrin

Geiger

Spengler

Wilhelm

(2019). Forced-choice versus Likert responses on an occupational big five questionnaire. Journal of Individual Differences, 40, 134–148.

91.

Wetzel

Böhnke

J. R.

Carstensen

C. H.

Ziegler

Ostendorf

(2013). Do individual response styles matter? Assessing differential item functioning for men and women in the NEO-PI-R. Journal of Individual Differences, 34, 69–81.

92.

Wetzel

Brown

Hill

P. L.

Chung

J. M.

Robins

R. W.

Roberts

B. W.

(2017). The narcissism epidemic is dead; long live the narcissism epidemic. Psychological Science, 28, 1833–1847.

93.

Wetzel

Frick

(2019). Comparing the validity of trait estimates from the multidimensional forced-choice format and the rating scale format. Psychological Assessment, 32, 239–253.

94.

Wetzel

Greiff

(2018). The world beyond rating scales. European Journal of Psychological Assessment, 34, 1–5.

95.

Wetzel

Roberts

B. W.

Fraley

R. C.

Brown

(2016). Equivalence of Narcissistic Personality Inventory constructs and correlates across scoring approaches and response formats. Journal of Research in Personality, 61, 87–98.

96.

White

L. A.

Young

M. C.

(1998, August). Development and validation of the Assessment of Individual Motivation (AIM). Paper presented at the annual meeting of the American Psychological Association, San Francisco, CA.

97.

Woods

C. M.

Grimm

K. J.

(2011). Testing for nonuniform differential item functioning with multiple indicator multiple cause models. Applied Psychological Measurement, 35, 339–361.

98.

Zenisky

Hambleton

R. K.

Luecht

R. M.

(2009). Multistage testing: Issues, designs, and research. In van der Linden

W. J.

Glas

C. A.

(Eds.), Elements of adaptive testing (pp. 355–372). New York, NY: Springer.

99.

Zickar

M. J.

Robie

(1999). Modeling faking good on personality items: An item-level analysis. Journal of Applied Psychology, 84, 551–563.

100.

Zieky

(1993). Practical questions in the use of DIP statistics in test development. In Holland

P. W.

Wainer

(Eds.), Differential item functioning (pp. 337–347). Hillsdale, NJ: Lawrence Erlbaum.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.22 MB