How Do We Demonstrate AI Responsibility: The Devil Is in the Details

Abstract

This commentary examines the Duolingo English Test Responsible AI standards and provides some thoughts on specific ways we can evaluate the use of AI for automated scoring.

Keywords

differential item functioning algorithmic bias error-in-variables regression fairness

Get full access to this article

View all access options for this article.

References

Barocas

Hardt

Narayanan

(2023). Fairness and machine learning: Limitations and opportunities. The MIT Press.

Burstein

(2023). The Duolingo English test responsible AI standards. Retrieved March 29, 2024, from https://go.duolingo.com/ResponsibleAI

Cohen

(1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256

Culpepper

S. A.

Aguinis

(2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16(2), 166–178.

Fuller

W. A.

(1980). Properties of some estimators for the errors-in-variables model. The Annals of Statistics, 8(2), 407–422.

Goodman

L. A.

Kruskal

W. H.

(1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732–764. https://doi.org/10.2307/2281536

Haberman

S. J.

(2019). Measures of agreement versus measures of prediction accuracy (pp. 1–23). ETS Research Report. https://doi.org/10.1002/ets2.12258

Holland

P. W.

Thayer

D. T.

(1988). Differential item performance and the Mantel-Haenszel procedure. In Wainer

Braun

(Eds.), Test validity (pp. 129–145). Lawrence Erlbaum Associates.

Johnson

M. S.

Liu

McCaffrey

D. F.

(2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59, 338–361. https://doi.org/10.1111/jedm.12335

10.

Johnson

M. S.

McCaffrey

D. F.

(2023). Evaluating fairness of automated scoring in educational measurement. In Yaneva

von Davier

(Eds.), Advancing natural language processing in educational assessment (pp. 142–164). NCME Applications in Educational Measurement and Assessment; Routledge. https://doi.org/10.4324/9781003278658

11.

Kleinberg

Mullainathan

Raghavan

(2016). Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807.

12.

Loukina

Madnani

Cahill

Yao

Johnson

M. S.

Riordan

McCaffrey

D. F.

(2020). Using PRMSE to evaluate automated scoring systems in the presence of label noise. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 18–29). Association for Computational Linguistics.

13.

Mantel

Haenszel

(1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719–748.

14.

McCaffrey

D. F.

Casabianca

Ricker-Pedley

Lawless

Wendler

(2021). Best practices for constructed-response scoring. Educational Testing Service. https://www.ets.org/pdfs/about/cr_best_practices.pdf

15.

Suk

Han

K. T.

(2023). A psychometric framework for evaluating fairness in algorithmic decision making: Differential algorithmic functioning. Journal of Educational and Behavioral Statistics, 49(2), 151–172. https://doi.org/10.3102/10769986231171711

16.

Williamson

D. M.

Breyer

F. J.

(2012). A framework for evaluation and use of automated scoring. Educational Measurement Issues and Practices, 31(1), 2–13. https://10.1111/j.1745-3992.2011.00223.x