Sage Journals: Discover world-class research

Abstract

This study proposes an empirical Bayesian approach with loss functions for testing the fairness of an automated scoring algorithm. The proposed method outperforms traditional approaches by (1) being robust to small samples, (2) incorporating parameter uncertainty, and (3) balancing the loss of keeping versus flagging items. The impact of flagging potentially unfair items on the classification of examinees is investigated. The effectiveness of the proposed method is illustrated through simulations and an application to language assessment data.

Keywords

automated scoring fairness empirical Bayes standardized mean difference loss functions

Get full access to this article

View all access options for this article.

References

Bridgeman

Trapani

Attali

(2009). Considering fairness and validity in evaluating automated scoring. Annual Meeting of the National Council on Measurement in Education, San Diego, CA.

Briggs

D. C.

(2024). Strive for measurement, set new standards, and try not to be evil. Journal of Educational and Behavioral Statistics, 49(5), 694–701.

Bulut

Beiting-Parrish

Casabianca

J. M.

Slater

S. C.

Jiao

Song

Ormerod

C. M.

Fabiyi

D. G.

Ivan

Walsh

Rios

Wilson

Yildirim-Erbasli

S. N.

Wongvorachan

Liu

J. X.

Tan

Morilova

(2024). The rise of artificial intelligence in educational measurement: Opportunities and ethical challenges. arXiv preprint arXiv:2406.18900.

Burstein

(2025). Duolingo English test responsible AI standards. Duolingo Research Report DRR-25-05.

Carlin

B. P.

Louis

T. A.

(2000) Bayes and Empirical Bayes Methods for Data Analysis, 2nd edn. Boca Raton: Chapman and Hall CRC Press.

Casella

(1985). An introduction to empirical Bayes data analysis. The American Statistician, 39(2), 83–87.

Dorans

N. J.

Holland

P. W.

(1992). DIF detection and description: Mantel-Haenszel and Standardization. ETS Research Report Series, 1992(1), 1–40.

Gelman

Carlin

J. B.

Stern

H. S.

Dunson

D. B.

Vehtari

Rubin

D. B.

(2013). Bayesian data analysis (3rd ed.). Chapman; Hall/CRC. https://doi.org/10.1201/b16018

Hao

von Davier

A. A.

Yaneva

Lottridge

von Davier

Harris

D. J.

(2024). Transforming assessment: The impacts and implications of large language models and generative AI. Educational Measurement: Issues and Practice, 43(2), 16–29.

10.

Johnson

M. S.

McCaffrey

D. F.

(2023). Evaluating fairness of automated scoring in educational measurement. In Yaneva

von Davier

(Eds.), Advancing natural language processing in educational assessment. Routledge.

11.

Kiefer

Wolfowitz

(1956). Consistency of the maximum likelihood estimator in the presence of infinitely many incidental parameters. The Annals of Mathematical Statistics, 27, 887–906.

12.

Loukina

Madnani

Zechner

(2019). The many dimensions of algorithmic fairness in educational applications. Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications, Florence, Italy, pp. 1–10.

13.

Madnani

Loukina

Von Davier

Burstein

Cahill

(2017). Building better open-source tools to support fairness in automated scoring. Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, Valencia, Spain, pp. 41–52.

14.

McCaffrey

D. F.

Casabianca

J. M.

Ricker-Pedley

K. L.

Lawless

R. R.

Wendler

(2022). Best Practices for Constructed-Response Scoring. ETS Research Report Series, 2022(1), 1–58.

15.

Messick

(1989). Validity. In Linn

R. L.

(Ed.), Educational measurement (pp. 13–103). Macmillan.

16.

Murphy

K. P.

(2007). Conjugate Bayesian analysis of the Gaussian distribution. def, 1(2

σ

2), 16.

17.

Narasimhan

Efron

(2020). Deconvolver: A g-modeling program for deconvolution and empirical Bayes estimation. Journal of Statistical Software, 94, 1–20.

18.

Penfield

R. D.

(2016). Fairness in test scoring. In Dorans

N. J.

Cook

L. L.

(Eds.), Fairness in educational assessment and measurement (pp. 55–76). Routledge.

19.

Raudenbush

S. W.

Bryk

A. S.

(2002). Hierarchical linear models: Applications and data analysis methods (Vol. 1). Sage.

20.

Shermis

M. D.

(2024). AI scoring and writing fairness. In Shermis

M. D.

Wilson

(Eds), The Routledge international handbook of automated essay evaluation (pp. 386–420). Routledge.

21.

Suk

Han

K. T.

(2024). A psychometric framework for evaluating fairness in algorithmic decision making: Differential algorithmic functioning. Journal of Educational and Behavioral Statistics, 49(2), 151–172.

22.

Van Dongen

. (2006). Prior specification in Bayesian statistics: Three cautionary tales. Journal of Theoretical Biology, 242(1), 90–100.

23.

Williamson

D. M.

Breyer

F. J.

(2012). A framework for evaluation and use of automated scoring. Educational Measurement: Issues and Practice, 31(1), 2–13.

24.

(2010). How do we go about investigating test fairness? Language Testing, 27(2), 147–170.

25.

Zhang

Dorans

N. J.

Rupp

A. A.

(2017). Differential feature functioning in automated essay scoring. In Jiao

Lissitz

(Eds.), Test fairness in the new generation of large-scale assessment (pp.185–208). Information Age Publishing.

26.

Zhou

Chen

Berry

Reed

Zhang

Savage

(2020). A survey on ethical principles of AI and implementations. 2020 IEEE Symposium Series on Computational Intelligence (SSCI), Canberra, Australia, pp. 3010–3017.

27.

Zieky

M. J.

(2016). Fairness in test design and development. In Dorans

N. J.

Cook

L. L.

(Eds.), Fairness in educational assessment and measurement (pp. 9–31). Routledge.

28.

Zwick

(2012). A review of ETS differential item functioning assessment procedures: Flagging rules, minimum sample size requirements, and criterion refinement. ETS Research Report Series, 2012(1), 1–30.

29.

Zwick

Thayer

D. T.

Lewis

(1997) An investigation of the validity of an empirical Bayes approach to Mantel-Haenszel DIF analysis. ETS Research Report No. 97–21.

30.

Zwick

Thayer

D. T.

Lewis

(1999). An empirical Bayes approach to Mantel-Haenszel DIF analysis. Journal of Educational Measurement, 36(1), 1–28.

31.

Zwick

Thayer

D. T.

Lewis

(2000). Using loss functions for DIF detection: An empirical bayes approach. Journal of Educational and Behavioral Statistics, 25(2), 225–247.

32.

Zwick

Isham

(2012). Improving Mantel–Haenszel DIF estimation through bayesian updating. Journal of Educational and Behavioral Statistics, 37(5), 601–629.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB

An Empirical Bayesian Approach for Testing the Fairness of Automated Scoring

Abstract

Keywords

Get full access to this article

References

Supplementary Material