A Psychometric Framework for Evaluating Fairness in Algorithmic Decision Making: Differential Algorithmic Functioning

Abstract

As algorithmic decision making is increasingly deployed in every walk of life, many researchers have raised concerns about fairness-related bias from such algorithms. But there is little research on harnessing psychometric methods to uncover potential discriminatory bias inside decision-making algorithms. The main goal of this article is to propose a new framework for algorithmic fairness based on differential item functioning (DIF), which has been commonly used to measure item fairness in psychometrics. Our fairness notion, which we call differential algorithmic functioning (DAF), is defined based on three pieces of information: a decision variable, a “fair” variable, and a protected variable such as race or gender. Under the DAF framework, an algorithm can exhibit uniform DAF, nonuniform DAF, or neither (i.e., non-DAF). For detecting DAF, we provide modifications of well-established DIF methods: Mantel–Haenszel test, logistic regression, and residual-based DIF. We demonstrate our framework through a real dataset concerning decision-making algorithms for grade retention in K–12 education in the United States.

Keywords

fairness discrimination algorithms machine learning decision analysis differential item functioning retention

Get full access to this article

View all access options for this article.

References

Ackerman

T. A.

(1992). A didactic explanation of item bias, item impact, and item validity from a multidimensional perspective. Journal of Educational Measurement, 29(1), 67–91. https://doi.org/10.1111/j.1745-3984.1992.tb00368.x

Angoff

W. H.

(1993). Perspectives on differential item functioning methodology. In Holland

P. W.

Wainer

(Eds.), Differential item functioning (pp. 3–24). Routledge.

Angoff

W. H.

Sharon

A. T.

(1974). The evaluation of differences in test performance of two or more groups. Educational and Psychological Measurement, 34(4), 807–816. https://doi.org/10.1177/001316447403400408

Azadkia

Chatterjee

(2021). A simple measure of conditional dependence. The Annals of Statistics, 49(6), 3070–3102. https://doi.org/10.1214/21-aos2073

Barocas

Hardt

Narayanan

(2019). Fairness and machine learning. http://www.fairmlbook.org

Berk

Heidari

Jabbari

Kearns

Roth

(2021). Fairness in criminal justice risk assessments: The state of the art. Sociological Methods & Research, 50(1), 3–44. https://doi.org/10.1177/0049124118782533

Breiman

(2001). Random forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/a:1010933404324

Cannon

J. S.

Lipscomb

(2011). Early grade retention and student success: Evidence from Los Angeles. Public Policy Institute of California. http://www.ppic.org/main/publication.asp?i=910

Chouldechova

(2017). Fair prediction with disparate impact: A study of bias in recidivism prediction instruments. Big Data, 5(2), 153–163. https://doi.org/10.1089/big.2016.0047

10.

Cleary

T. A.

Hilton

T. L.

(1968). An investigation of item bias. Educational and Psychological Measurement, 28(1), 61–75. https://doi.org/10.1177/001316446802800106

11.

Corbett-Davies

Goel

(2018). The measure and mismeasure of fairness: A critical review of fair machine learning. arXiv. https://doi.org/10.48550/ARXIV.1808.00023

12.

Corbett-Davies

Pierson

Feller

Goel

Huq

(2017). Algorithmic decision making and the cost of fairness. Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 797–806). Association for Computing Machinery. https://doi.org/10.1145/3097983.3098095

13.

Crocker

Algina

(1986). Introduction to classical and modern test theory. ERIC.

14.

Dwork

Hardt

Pitassi

Reingold

Zemel

(2012). Fairness through awareness. Proceedings of the 3rd Innovations in Theoretical Computer Science Conference (pp. 214–226). Association for Computing Machinery. https://doi.org/10.1145/2090236.2090255

15.

Feldman

Friedler

S. A.

Moeller

Scheidegger

Venkatasubramanian

(2015). Certifying and removing disparate impact. Proceedings of the 21th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (pp. 259–268). Association for Computing Machinery. https://doi.org/10.1145/2783258.2783311

16.

Foulds

J. R.

Islam

Keya

K. N.

Pan

(2020, 20–24 April). An intersectional definition of fairness. 2020 IEEE 36th International Conference on Data Engineering (ICDE) (pp. 1918–1921). IEEE. https://doi.org/10.1109/icde48307.2020.00203

17.

Gorsuch

R. L.

(1983). Factor analysis. Lawrence Erlbaum Associates.

18.

Greene

J. P.

Winters

M. A.

(2006). Getting ahead by staying behind: An evaluation of florida’s program to end social promotion. Education Next, 6(2), 65–70.

19.

Hanson

B. A.

(1998). Uniform DIF and DIF defined by differences in item response functions. Journal of Educational and Behavioral Statistics, 23(3), 244–253. https://doi.org/10.2307/1165247

20.

Hardt

Price

Srebro

(2016). Equality of opportunity in supervised learning. In Lee

Sugiyama

Luxburg

Guyon

Garnett

(Eds.), Advances in neural information processing systems (Vol. 29). Curran Associates, Inc. https://proceedings.neurips.cc/paper/2016/file/9d2682367c3935defcb1f9e247a97c0d-Paper.pdf

21.

Holland

P. W.

Thayer

D. T.

(1986). Differential item functioning and the Mantel–Haenszel procedure. ETS Research Report Series, 1986(2), i–24. https://doi.org/10.1002/j.2330-8516.1986.tb00186.x

22.

Holland

P. W.

Wainer

(Eds.). (1993). Differential item functioning. Lawrence Erlbaum Associates. https://doi.org/10.4324/9780203357811

23.

Hong

Raudenbush

S. W.

(2006). Evaluating kindergarten retention policy: A case study of causal inference for multilevel observational data. Journal of the American Statistical Association, 101(475), 901–910. https://doi.org/10.1198/016214506000000447

24.

Huddleston

A. P.

(2014). Achievement at whose expense? A literature review of test-based grade retention policies in US schools. Education Policy Analysis Archives, 22(18), 1–31. https://doi.org/10.14507/epaa.v22n18.2014

25.

Jackson

G. B.

(1975). The research evidence on the effects of grade retention. Review of Educational Research, 45(4), 613–635. https://doi.org/10.3102/00346543045004613

26.

Jarrahi

M. H.

(2018). Artificial intelligence and the future of work: Human-ai symbiosis in organizational decision making. Business Horizons, 61(4), 577–586. https://doi.org/10.1016/j.bushor.2018.03.007

27.

Jones

T. M.

(1991). Ethical decision making by individuals in organizations: An issue-contingent model. Academy of Management Review, 16(2), 366–395. https://doi.org/10.5465/amr.1991.4278958

28.

Kim

S.-H.

Cohen

A. S.

(1991). A comparison of two area measures for detecting differential item functioning. Applied Psychological Measurement, 15(3), 269–278. https://doi.org/10.1177/014662169101500307

29.

Kleinberg

Mullainathan

Raghavan

(2017). Inherent trade-offs in the fair determination of risk scores. In Papadimitriou

C. H.

(Ed.), 8th innovations in theoretical computer science conference (ITCS 2017) (43:1–43:23, Vol. 67, pp. 1224). Schloss Dagstuhl. https://doi.org/10.4230/LIPIcs.ITCS.2017.43

30.

Lawley

D. N.

Maxwell

A. E.

(1962). Factor analysis as a statistical method. Journal of the Royal Statistical Society. Series D (The Statistician), 12(3), 209–229. https://doi.org/10.2307/2986915

31.

Lee

Kizilcec

R. F.

(2020). Evaluation of fairness trade-offs in predicting student success. https://doi.org/10.48550/ARXIV.2007.00088

32.

Lefkowitz

(2017). Ethics and values in industrial-organizational psychology. Routledge. https://doi.org/10.4324/9781315628721

33.

Liaw

Wiener

(2002). Classification and regression by randomForest. R News, 2(3), 18–22. https://CRAN.R-project.org/doc/Rnews/

34.

Lim

Choe

E. M.

Han

K. T.

(2022). A residual-based differential item functioning detection framework in item response theory. Journal of Educational Measurement, 59(1), 80–104. https://doi.org/10.1111/jedm.12313

35.

Lord

F. M.

(1980). Applications of item response theory to practical testing problems. Lawrence Erlbaum Associates.

36.

Madaio

M. A.

Stark

Wortman Vaughan

Wallach

(2020, 23 April). Co-designing checklists to understand organizational challenges and opportunities around fairness in AI. Proceedings of the 2020 CHI Conference on Human Factors in Computing Systems (pp. 1–14). Association for Computing Machinery. http://www.jennwv.com/papers/checklists.pdf

37.

Magis

Béland

Tuerlinckx

De Boeck

(2010). A general framework and an r package for the detection of dichotomous differential item functioning. Behavior Research Methods, 42(3), 847–862. https://doi.org/10.3758/brm.42.3.847

38.

Mehrabi

Morstatter

Saxena

Lerman

Galstyan

(2021). A survey on bias and fairness in machine learning. ACM Computing Surveys, 54(6), 1–35. https://doi.org/10.1145/3457607

39.

Mellenbergh

G. J.

(1982). Contingency table models for assessing item bias. Journal of Educational Statistics, 7(2), 105–118. https://doi.org/10.2307/1164960

40.

Mitchell

Potash

Barocas

D’Amour

Lum

(2021). Algorithmic fairness: Choices, assumptions, and definitions. Annual Review of Statistics and Its Application, 8, 141–163. https://doi.org/10.1146/annurev-statistics-042720-125902

41.

Mulligan

D. K.

Kroll

J. A.

Kohli

Wong

R. Y.

(2019). This thing called fairness: Disciplinary confusion realizing a value in technology. Proceedings of the ACM on Human-Computer Interaction, 3(CSCW), 1–36. https://doi.org/10.1145/3359221

42.

Neykov

Balakrishnan

Wasserman

(2021). Minimax optimal conditional independence testing. The Annals of Statistics, 49(4), 2151–2177. https://doi.org/10.1214/20-aos2030

43.

Pessach

Shmueli

(2022). A review on fairness in machine learning. ACM Computing Surveys (CSUR), 55(3), 1–44. https://doi.org/10.1145/3494672

44.

Pine

S. M.

(1977). Applications of item characteristic curve theory to the problem of test bias. In Weiss

D. J.

(Ed.), Applications of computerized adaptive testing: Proceedings of a symposium presented at the 18th annual convention of military testing association (pp. 37–43). University of Minnesota.

45.

Porter

K. E.

Gruber

Van Der Laan

M. J.

Sekhon

J. S.

(2011). The relative performance of targeted maximum likelihood estimators. The International Journal of Biostatistics, 7(1), 31. https://doi.org/10.2202/1557-4679.1308

46.

R Core Team. (2021). R: A language and environment for statistical computing. R Foundation for Statistical Computing. https://www.R-project.org/

47.

Raju

N. S.

(1988). The area between two item characteristic curves. Psychometrika, 53(4), 495–502. https://doi.org/10.1007/bf02294403

48.

Revelle

(2021). Psych: Procedures for psychological, psychometric, and personality research [R package version 2.1.9]. Northwestern University. https://CRAN.R-project.org/package=psych

49.

Shealy

Stout

(1993). A model-based standardization approach that separates true bias/DIF from group ability differences and detects test bias/DTF as well as item bias/DIF. Psychometrika, 58(2), 159–194. https://doi.org/10.1007/bf02294572

50.

Sireci

S. G.

Rios

J. A.

(2013). Decisions that make a difference in detecting differential item functioning. Educational Research and Evaluation, 19(2-3), 170–187. https://doi.org/10.1080/13803611.2013.767621

51.

Suk

Kang

(2022). Robust machine learning for treatment effects in multilevel observational studies under cluster-level unmeasured confounding. Psychometrika, 87(1), 310–343. https://doi.org/10.1007/s11336-021-09805-x

52.

Swaminathan

Rogers

H. J.

(1990). Detecting differential item functioning using logistic regression procedures. Journal of Educational Measurement, 27(4), 361–370. https://doi.org/10.1111/j.1745-3984.1990.tb00754.x

53.

van Buuren

Groothuis-Oudshoorn

(2011). mice: Multivariate imputation by chained equations in r. Journal of Statistical Software, 45(3), 1–67. https://doi.org/10.18637/jss.v045.i03

54.

van der Laan

M. J.

Polley

E. C.

Hubbard

A. E.

(2007). Super learner. Statistical Applications in Genetics and Molecular Biology, 6(1). https://doi.org/10.2202/1544-6115.1309

55.

Walston

McCarroll

J. C.

(2010). Eighth-grade algebra: Findings from the eighth-grade round of the early childhood longitudinal study, kindergarten class of 1998-99 (ECLS-K). statistics in brief. NCES 2010-016. National Center for Education Statistics.

56.

White

I. R.

Royston

Wood

A. M.

(2011). Multiple imputation using chained equations: Issues and guidance for practice. Statistics in Medicine, 30(4), 377–399. https://doi.org/10.1002/sim.4067

57.

Xia

Kirby

S. N.

(2009). Retaining students in grade: A literature review of the effects of retention on students’ academic and nonacademic outcomes. (Technical report no. 678). RAND Corporation. http://www.rand.org/pubs/technical_reports/TR678/

58.

Xiao

Wang

W. H.

Ning

Shenkman

E. A.

Bian

Wang

(2022). Algorithmic fairness in computational medicine. medRxiv. https://doi.org/10.1101/2022.01.16.21267299

59.

Zill

Loomis

L. S.

West

(1997). The elementary school performance and adjustment of children who enter kindergarten late or repeat kindergarten: Findings from national surveys. (statistical analysis report NCES 98-097). http://www.rand.org/pubs/technical_reports/TR678/

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.53 MB