Sage Journals: Discover world-class research

Abstract

Psychological test scores are commonly used in high-stakes settings to classify individuals. While measurement invariance across groups is necessary for valid and meaningful inferences of group differences, full measurement invariance rarely holds in practice. The classification accuracy analysis framework aims to quantify the degree and practical impact of noninvariance. However, how to best navigate the next steps remains unclear, and methods devised to account for noninvariance at the group level may be insufficient when the goal is classification. Furthermore, deleting a biased item may improve fairness but negatively affect performance, and replacing the test can be costly. We propose item-level effect size indices that allow test users to make more informed decisions by quantifying the impact of deleting (or retaining) an item on test performance and fairness, provide an illustrative example, and introduce unbiasr, an R package implementing the proposed methods.

Keywords

measurement invariance item bias classification accuracy fairness R package

Get full access to this article

View all access options for this article.

References

American Educational Research Association, American Psychological Association, & National Council on Measurement in Education. (2014). Standards for educational and psychological testing.

Arias de la Torre

Ronaldson

Vilagut

Martínez-Alés

Dregan

Bakolis

Valderas

J. M.

Molina

A. J.

Martín

Bellón

J. Á.

Alonso

(2024). Implementation of community screening strategies for depression. Nature Medicine, 30, 930–932. https://doi.org/10.1038/s41591-024-02821-1

Bandalos

D. L.

(2018). Measurement theory and applications for the social sciences (pp. 478–518). Guilford Publications.

Biddle

(2006). Adverse impact and test validation: A practitioner’s guide to valid and defensible employment testing (2nd ed.). Routledge. https://doi.org/10.4324/9781315263298

Borsboom

Romeijn

J. W.

Wicherts

J. M.

(2008). Measurement invariance versus selection invariance: Is fair selection possible? Psychological Methods, 13(2), 75–98. https://doi.org/10.1037/1082-989X.13.2.75

Brown

(1910). Some experimental results in the correlation of mental abilities. British Journal of Psychology 1904-1920, 3(3), 296–322. https://doi.org/10.1111/j.2044-8295.1910.tb00207.x

Byrne

B. M.

Shavelson

R. J.

Muthén

(1989). Testing for the equivalence of factor covariance and mean structures: The issue of partial measurement invariance. Psychological Bulletin, 105(3), 456–466. https://doi.org/10.1037/0033-2909.105.3.456

Byrne

B. M.

Shavelson

R. J.

Muthén

(2007). Invariance in measurement and prediction revisited. Psychometrika, 72, 461–473. https://doi.org/10.1007/s11336-007-9039-7

Cleary

T. A.

(1968). Test bias: Prediction of grades of Negro and White students in integrated colleges. Journal of Educational Measurement, 5, 115–124.

10.

Cohen

(1988). Multiple factor analysis (2nd ed.). Lawrence Erlbaum Associates, Publishers.

11.

Drasgow

(1984). Scrutinizing psychological tests: Measurement equivalence and equivalent relations with external variables are the central issues. Psychological Bulletin, 95(1), 134–135. https://doi.org/10.1037/0033-2909.95.1.134

12.

Gonzalez

Pelham

W. E.

(2021). When does differential item functioning matter for screening? A method for empirical evaluation. Assessment, 28, 446–456. https://doi.org/10.1177/1073191120913618

13.

Hammack-Brown

Fulmore

Keiffer

Nimon

(2022). Finding invariance when noninvariance is found: An illustrative example of conducting partial measurement invariance testing with the automation of the factor-ratio test and list-and-delete procedure. Human Resource Development Quarterly, 33, 179–203. https://doi.org/10.1002/hrdq.21452

14.

Horn

J. L.

McArdle

J. J.

(1992). A practical and theoretical guide to measurement invariance in aging research. Experimental Aging Research, 18(3), 117–144. https://doi.org/10.1080/03610739208253916

15.

Jöreskog

K. G.

(1969). A general approach to confirmatory maximum likelihood factor analysis. Psychometrika, 34(2), 183–202. https://doi.org/10.1007/BF02289343

16.

Kruyen

P. M.

Emons

W. H.

Sijtsma

(2012). Test length and decision quality in personnel selection: When is short too short? International Journal of Testing, 12(4), 321–344.

17.

Kruyen

P. M.

Emons

W. H.

Sijtsma

(2013). On the shortcomings of shortened tests: A literature review. International Journal of Testing, 13(3), 223–248.

18.

Lai

M. H. C.

Kwok

Yoon

Hsiao

(2017). Understanding the impact of partial factorial invariance on selection accuracy: An R script. Structural Equation Modeling, 24(5), 783–799. https://doi.org/10.1080/10705511.2017.1318703

19.

Lai

M. H. C.

Richardson

G. B.

Mak

H. W.

(2019). Quantifying the impact of partial measurement invariance in diagnostic research: An application to addiction research. Addictive Behaviors, 94, 50–56. https://doi.org/10.1016/j.addbeh.2018.11.029

20.

Lai

M. H. C.

Zhang

(2022). Classification accuracy of multidimensional tests: Quantifying the impact of noninvariance. Structural Equation Modeling, 29(4), 620–629. https://doi.org/10.1080/10705511.2021.1977936

21.

Lord

Novick

Birnbaum

(1968). Statistical theories of mental test scores. Addison-Wesley.

22.

Maassen

D’Urso

E. D.

Van Assen

M. A.

Nuijten

M. B.

De Roover

Wicherts

J. M.

(2023). The dire disregard of measurement invariance testing in psychological science. Psychological Methods. Advance online publication. https://doi.org/10.1037/met0000624

23.

Mellenbergh

G. J.

(1989). Item bias and item response theory. International Journal of Educational Research, 13, 127–143. https://doi.org/10.1016/0883-0355(89)90002-5

24.

Meredith

(1993). Measurement invariance, factor analysis and factorial invariance. Psychometrika, 58, 525–543. https://doi.org/10.1007/BF02294825

25.

Meredith

Millsap

(1992). On the misuse of manifest variables in the detection of measurement bias. Psychometrika, 57, 289–311. https://doi.org/10.1007/BF02294510

26.

Meredith

Teresi

J. A.

(2006). An essay on measurement and factorial invariance. Medical Care, 44(11, Suppl. 3), S69–S77. https://doi.org/10.1097/01.mlr.0000245438.73837.89

27.

Miller

T. Q.

Markides

K. S.

Black

S. A.

(1997). The factor structure of the CES-D in two surveys of elderly Mexican Americans. The Journals of Gerontology: Series B, 52B, S259–S269. https://doi.org/10.1093/geronb/52B.5.S259

28.

Millsap

R. E.

Kwok

O. M.

(2004). Evaluating the impact of partial factorial invariance on selection in two populations. Psychological Methods, 9(1), 93–115. https://doi.org/10.1037/1082-989X.9.1.93

29.

Mohan

Singla Kaushal

Kadry

(2021). Artificial intelligence, machine learning, and data science technologies: Future impact and well-being for society 5.0. https://doi.org/10.1201/9781003153405

30.

Nye

C. D.

Drasgow

(2011). Effect size indices for analyses of measurement equivalence: Understanding the practical importance of differences between groups. Journal of Applied Psychology, 96, 966–980. https://doi.org/10.1037/a0022955

31.

Radloff

L. S.

(1977). The CES-D scale: A self-report depression scale for research in the general population. Applied Psychological Measurement, 1, 385–401. https://doi.org/10.1177/014662167700100306

32.

Reynolds

C. R.

Altmann

Allen

D. N.

(2021). Mastering modern psychological testing theory and methods (2nd ed.). Springer.

33.

Rhemtulla

Brosseau-Liard

P. É.

Savalei

(2012). When can categorical variables be treated as continuous? A comparison of robust continuous and categorical SEM estimation methods under suboptimal conditions. Psychological Methods, 17(3), 354–373.

34.

Schmitt

Kuljanin

(2008). Measurement invariance: Review of practice and implications. Human Resource Management Review, 18(4), 210–222. https://doi.org/10.1016/j.hrmr.2008.03.003

35.

Somaraju

A. V.

Nye

C. D.

Olenick

(2022). A review of measurement equivalence in organizational research: What’s old, what’s new, what’s next? Organizational Research Methods, 25(4), 741–785. https://doi.org/10.1177/10944281211056524

36.

Spearman

(1910). Correlation calculated from faulty data. British Journal of Psychology, 3(3), 271–295.

37.

Stark

Chernyshenko

O. S.

Drasgow

(2004). Examining the effects of differential item (functioning and differential) test functioning on selection decisions: When are statistically significant effects practically important? Journal of Applied Psychology, 89(3), 497–508. https://doi.org/10.1037/0021-9010.89.3.497

38.

Thurstone

L. L.

(1947). Multiple factor analysis. University of Chicago Press.

39.

US Preventive Services Task Force, Barry

M. J.

Nicholson

W. K.

Silverstein

Chelmow

Coker

T. R.

Davidson

K. W.

Davis

E. M.

Donahue

K. E.

Jaén

C. R.

Ogedegbe

Pbert

Rao

Ruiz

J. M.

Stevermer

J. J.

Tsevat

Underwood

S. M.

Wong

J. B

. (2023). Screening for depression and suicide risk in adults: US preventive services task force recommendation statement. Journal of the American Medical Association, 329(23), 2057–2067. https://doi.org/10.1001/jama.2023.9297

40.

Vandenberg

Lance

(2000). A review and synthesis of the measurement invariance literature: Suggestions, practices, and recommendations for organizational research. Organizational Research Methods, 3(1), 4–70. https://doi.org/10.1177/109442810031002

41.

Estabrook

(2016). Identification of confirmatory factor analysis models of different levels of invariance for ordered categorical outcomes. Psychometrika, 81(4), 1014–1045.

42.

Zhang

Fokkema

Cuijpers

P. J.

Li Smits

Beekman

(2011). Measurement invariance of the Center for Epidemiological Studies Depression Scale (CES-D) among Chinese and Dutch elderly. BMC Medical Research Methodology, 11, 74–83. https://doi.org/10.1186/1471-2288-11-74

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.08 MB

Exploring the Impact of Deleting (or Retaining) a Biased Item: A Procedure Based on Classification Accuracy

Abstract

Keywords

Get full access to this article

References

Supplementary Material