Reliability Estimates for IRT-Based Forced-Choice Assessment Scores

Abstract

Forced-choice (FC) assessments of noncognitive psychological constructs (e.g., personality, behavioral tendencies) are popular in high-stakes organizational testing scenarios (e.g., informing hiring decisions) due to their enhanced resistance against response distortions (e.g., faking good, impression management). The measurement precisions of FC assessment scores used to inform personnel decisions are of paramount importance in practice. Different types of reliability estimates are reported for FC assessment scores in current publications, while consensus on best practices appears to be lacking. In order to provide understanding and structure around the reporting of FC reliability, this study systematically examined different types of reliability estimation methods for Thurstonian IRT-based FC assessment scores: their theoretical differences were discussed, and their numerical differences were illustrated through a series of simulations and empirical studies. In doing so, this study provides a practical guide for appraising different reliability estimation methods for IRT-based FC assessment scores.

Keywords

Thurstonian IRT forced choice reliability test-retest

Get full access to this article

View all access options for this article.

References

Anguiano-Carrasco

MacCann

Geiger

Seybert

J. M.

Roberts

R. D.

(2015). Development of a forced-choice measure of typical-performance emotional intelligence. Journal of Psychoeducational Assessment, 33(1), 83–97. doi:10.1177/0734282914550387

Barrick

M. R.

Mount

M. K.

(1991). The big five personality dimensions and job performance: A meta-analysis. Personnel Psychology, 44(1), 1–26.

Bartram

(2005). The great eight competencies: A criterion-centric approach to validation. Journal of Applied Psychology, 90(6), 1185–1203. doi:10.1037/0021-9010.90.6.1185

Bartram

Brown

Fleck

Inceoglu

Ward

. (2006). OPQ32 technical manual. SHL.

Bock

R. D.

Aitkin

(1981). Marginal maximum likelihood estimation of item parameters: Applications of an EM algorithm. Psychometrika, 46, 443–459.

Boyce

A. S.

Conway

J. S.

Caputo

P. M.

Huber

C. R.

(2015). Development of the Adaptive Employee Personality Test (ADEPT-15TM). Paper presented at the International Personnel Assessment Council conference, Atlanta, GA.

Brown

(2010). Doing less but getting more: Improving forced-choice measures with item response theory. Assessment and Development Matters, 2(1), 21–25.

Brown

(2012). Multidimensional CAT in non-cognitive assessments. Paper presented at the 8th Conference of the International Test Commission, Amsterdam, Netherlands.

Brown

(2016). Item response models for forced-choice questionnaires: A common framework. Psychometrika, 81(1), 135–160. doi:10.1007/s11336-014-9434-9

10.

Brown

Maydeu-Olivares

(2011). Item response modeling of forced-choice questionnaires. Educational and Psychological Measurement, 71(3), 460–502. doi:10.1177/0013164410375112

11.

Brown

Maydeu-Olivares

(2013). How IRT can solve problems of ipsative data in forced-choice questionnaires. Psychological Methods, 18(1), 36–52. doi:10.1037/a0030641

12.

Bürkner

Schulte

Holling

(2019). On the statistical and practical limitations of Thurstonian IRT models. Educational and Psychological Measurement, 79(5), 827–854. doi:10.1177/0013164419832063

13.

Cheung

M. W. L.

Chan

. (2002). Reducing uniform response bias with ipsative measurement in multiple-group confirmatory factor analysis. Structural Equation Modeling: A Multidisciplinary Journal, 9(1), 55–77. doi:10.1207/S15328007SEM0901_4

14.

Christiansen

N. D.

Burns

G. N.

Montgomery

G. E

. (2005). Reconsidering forced-choice item formats for applicant personality assessment. Human Performance, 18(3), 267–307. doi:10.1207/s15327043hup1803_4

15.

Cronbach

L. J.

(1951). Coefficient alpha and the internal structure of tests. Psychometrika, 16(3), 297–334. doi:10.1007/BF02310555

16.

Dueber

D. M.

Love

A. M. A.

Toland

M. D.

Turner

T. A.

(2019). Comparison of single-response format and forced-choice format instruments using Thurstonian item response theory. Educational and Psychological Measurement, 79(1), 108–128. doi:10.1177/0013164417752782

17.

du Toit

. (2003). IRT from SSI: BILOG-MG, MULTILOG, PARSCALE, TESTFACT. Scientific Software International.

18.

Edwards

A. L

. (1973). Edwards Personal Preference Schedule manual. Psychological Corporation.

19.

Embretson

Reise

. (2000). Item response theory for psychologists. Lawrence Erlbaum.

20.

Fisher

P. A.

Robie

Christiansen

N. D.

Speer

A. B.

Schneider

(2019). Criterion-related validity of forced-choice personality measures: A cautionary note regarding Thurstonian IRT versus classical test theory scoring. Personnel Assessment and Decisions, 5(1). doi:10.25035/pad.2019.01.003

21.

Friedman

Amoo

(1999). Rating the rating scales. Journal of Marketing Management, 9, 114–123.

22.

Gnambs

(2015). Facets of measurement error for scores of the big five: Three reliability generalizations. Personality and Individual Differences, 84, 84–89. doi:10.1016/j.paid.2014.08.019

23.

Gordon

L. V

. (1993). Gordon Personal Profile Inventory: Manual 1993 revision. Pearson-TalentLens.

24.

Griffith

R. L.

Chmielowski

Yoshita

(2007). Do applicants fake? An examination of the frequency of applicant faking behavior. Personnel Review, 36(3), 341–355.

25.

Hicks

L. E.

(1970). Some properties of ipsative, normative, and forced-choice normative measures. Psychological Bulletin, 74(3), 167–184. doi:10.1037/h0029780

26.

Hough

L. M.

(1998). Effects of intentional distortion in personality measurement and evaluation of suggested palliatives. Human Performance, 11(2), 209.

27.

E. H.

(2010). Interpretation of the three-parameter testlet response model and information function. Applied Psychological Measurement, 34(7), 467–482. doi:10.1177/0146621610364975

28.

Jackson

D. N.

Wroblewski

V. R.

Ashton

M. C.

(2000). The impact of faking on employment tests: Does forced choice offer a solution? Human Performance, 13(4), 371–388.

29.

Joo

Lee

Stark

. (2019). Adaptive testing with the GGUM-RANK multidimensional forced choice model: Comparison of pair, triplet, and tetrad scoring. Behavior Research Methods. doi:10.3758/s13428-019-01274-6

30.

Joubert

Inceoglu

Bartram

Dowdeswell

Lin

(2015). A comparison of the psychometric properties of the forced choice and Likert scale versions of a personality instrument. International Journal of Selection & Assessment, 23(1), 92–97. doi:10.1111/ijsa.12098

31.

Kantrowitz

T. M.

Tuzinski

K. A.

Raines

J. M

. (2018). Global assessment trends report. SHL.

32.

Kreitchmann

R. S.

Abad

F. J.

Ponsoda

Nieto

M. D.

Morillo

(2019). Controlling for response biases in self-report scales: Forced-choice vs. psychometric modeling of Likert items. Frontiers in Psychology, 10, 2309. https://www.frontiersin.org/article/10.3389/fpsyg.2019.02309

33.

Lee

Joo

Stark

Chernyshenko

O. S.

(2019). GGUM-RANK statement and person parameter estimation with multidimensional forced choice triplets. Applied Psychological Measurement, 43(3), 226–240. doi:10.1177/0146621618768294

34.

(2017). An information-correction method for testlet-based test analysis: From the perspectives of item response theory and generalizability theory. ETS Research Report Series, 17(27), 1–25. https://doi.org/10.1002/ets2.12151

35.

Lord

F. M.

(1986). Maximum likelihood and Bayesian parameter estimation in item response theory. Journal of Educational Measurement, 23, 157–162.

36.

Martin

B. A.

Bowen

C. C.

Hunt

S. T.

(2002). How effective are people at faking on personality questionnaires? Personality and Individual Differences, 32(2), 247–256.

37.

Maydeu-Olivares

Brown

(2010). Item response modeling of paired comparison and ranking data. Multivariate Behavioral Research, 45(6), 935–974. doi:10.1080/00273171.2010.531231

38.

Merk

Schlotz

Falter

(2017). The Motivational Value Systems Questionnaire (MVSQ): Psychometric analysis using a forced choice Thurstonian IRT model. Frontiers in Psychology, 8, 1626. https://www.frontiersin.org/article/10.3389/fpsyg.2017.01626

39.

Mislevy

R. J.

(1986). Bayes modal estimation in item response models. Psychometrika, 51, 177–195.

40.

Morillo

Abad

F. J.

Kreitchmann

R. S.

Leenen

Hontangas

Ponsoda

(2019). The journey from Likert to forced-choice questionnaires: Evidence of the invariance of item parameters. Journal of Work and Organizational Psychology, 35(2), 75–83.

41.

Morillo

Leenen

Abad

F. J.

Hontangas

de la Torre

Ponsoda

(2016). A dominance variant under the multi-unidimensional pairwise-preference framework: Model formulation and Markov chain Monte Carlo estimation. Applied Psychological Measurement, 40(7), 500–516. doi:10.1177/0146621616662226

42.

Mueller-Hanson

Heggestad

E. D.

Thornton

G. C. I.

II . (2003). Faking and selection: Considering the use of personality from select-in and select-out perspectives. Journal of Applied Psychology, 88(2), 348–355. doi:10.1037/0021-9010.88.2.348

43.

O’Neill

T. A.

Lewis

R. J.

Law

S. J.

Larson

Hancock

Radan

Lee

Carswell

J. J.

(2017). Forced-choice pre-employment personality assessment: Construct validity and resistance to faking. Personality and Individual Differences, 115, 120–127. doi:10.1016/j.paid.2016.03.075

44.

Paulhus

D. L.

Vazire

. (2007). The self-report method. In Robins

R. W.

Fraley

R. C.

Krueger

(Eds.), Handbook of research methods in personality psychology (pp. 224–239). Guilford.

45.

Pavlov

Maydeu-Olivares

Fairchild

A. J.

(2019). Effects of applicant faking on forced-choice and Likert scores. Organizational Research Methods, 22(3), 710–739.

46.

Salgado

J. F.

(2017). Moderator effects of job complexity on the validity of forced-choice personality inventories for predicting job performance. Journal of Work and Organizational Psychology, 33(3), 229–238.

47.

Salgado

J. F.

Anderson

Táuriz

(2015). The validity of ipsative and quasi-ipsative forced-choice personality inventories for different occupational groups: A comprehensive meta-analysis. Journal of Occupational & Organizational Psychology, 88(4), 797–834. doi:10.1111/joop.12098

48.

Salgado

J. F.

Táuriz

(2014). The five-factor model, forced-choice personality inventories and performance: A comprehensive meta-analysis of academic and occupational validity studies. European Journal of Work and Organizational Psychology, 23(1), 3–30. doi:10.1080/1359432X.2012.716198

49.

Schmidt

F. L.

Hunter

J. E.

(1998). The validity and utility of selection methods in personnel psychology: Practical and theoretical implications of 85 years of research findings. Psychological Bulletin, 124(2), 262–274. doi:10.1037/0033-2909.124.2.262

50.

Seybert

Becker

(2019). Examination of the test-retest reliability of a forced-choice personality measure. ETS Research Report Series, 2019(1), 1–17. doi:10.1002/ets2.12273

51.

Stark

Chernyshenko

O. S.

Drasgow

Nye

C. D.

White

L. A.

Heffner

Farmer

W. L.

(2014). From ABLE to TAPAS: A new generation of personality tests to support military selection and classification decisions. Military Psychology, 26(3), 153–164. doi:10.1037/mil0000044

52.

Usami

Sakamoto

Naito

Abe

(2016). Developing pairwise preference-based personality test and experimental investigation of its resistance to faking effect by item response model. International Journal of Testing, 16(4), 288–309. doi:10.1080/15305058.2016.1145123

53.

Viswesvaran

Ones

D. S.

(1999). Meta-analyses of fakability estimates: Implications for personality measurement. Educational and Psychological Measurement, 59(2), 197–210. doi:10.1177/00131649921969802

54.

Watrin

Geiger

Spengler

Wilhelm

(2019). Forced-choice versus Likert responses on an occupational big five questionnaire. Journal of Individual Differences, 40(3), 134–148.

55.

Wetzel

Frick

(2020). Comparing the validity of trait estimates from the multidimensional forced-choice format and the rating scale format. Psychological Assessment, 32(3), 239–253. doi:10.1037/pas0000781

56.

Xiao

Liu

(2017). Integration of the forced-choice questionnaire and the Likert scale: A simulation study. Frontiers in Psychology, 8, 806. https://www.frontiersin.org/article/10.3389/fpsyg.2017.00806

57.

Zhang

Sun

Drasgow

Chernyshenko

O. S.

Nye

C. D.

Stark

White

L. A

. (2019). Though forced, still valid: Psychometric equivalence of forced-choice and single-statement measures. Organizational Research Methods, 1094428119836486. doi:10.1177/1094428119836486

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.18 MB