This commentary examines the Duolingo English Test Responsible AI standards and provides some thoughts on specific ways we can evaluate the use of AI for automated scoring.
CohenJ. (1968). Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit. Psychological Bulletin, 70(4), 213–220. https://doi.org/10.1037/h0026256
4.
CulpepperS. A.AguinisH. (2011). Using analysis of covariance (ANCOVA) with fallible covariates. Psychological Methods, 16(2), 166–178.
5.
FullerW. A. (1980). Properties of some estimators for the errors-in-variables model. The Annals of Statistics, 8(2), 407–422.
6.
GoodmanL. A.KruskalW. H. (1954). Measures of association for cross classifications. Journal of the American Statistical Association, 49(268), 732–764. https://doi.org/10.2307/2281536
7.
HabermanS. J. (2019). Measures of agreement versus measures of prediction accuracy (pp. 1–23). ETS Research Report. https://doi.org/10.1002/ets2.12258
8.
HollandP. W.ThayerD. T. (1988). Differential item performance and the Mantel-Haenszel procedure. In WainerH.BraunH. (Eds.), Test validity (pp. 129–145). Lawrence Erlbaum Associates.
9.
JohnsonM. S.LiuX.McCaffreyD. F. (2022). Psychometric methods to evaluate measurement and algorithmic bias in automated scoring. Journal of Educational Measurement, 59, 338–361. https://doi.org/10.1111/jedm.12335
10.
JohnsonM. S.McCaffreyD. F. (2023). Evaluating fairness of automated scoring in educational measurement. In YanevaV.von DavierM. (Eds.), Advancing natural language processing in educational assessment (pp. 142–164). NCME Applications in Educational Measurement and Assessment; Routledge. https://doi.org/10.4324/9781003278658
11.
KleinbergJ.MullainathanS.RaghavanM. (2016). Inherent trade-offs in the fair determination of risk scores. arXiv preprint arXiv:1609.05807.
12.
LoukinaA.MadnaniN.CahillA.YaoL.JohnsonM. S.RiordanB.McCaffreyD. F. (2020). Using PRMSE to evaluate automated scoring systems in the presence of label noise. In Proceedings of the Fifteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 18–29). Association for Computational Linguistics.
13.
MantelN.HaenszelW. (1959). Statistical aspects of the analysis of data from retrospective studies of disease. Journal of the National Cancer Institute, 22(4), 719–748.
SukY.HanK. T. (2023). A psychometric framework for evaluating fairness in algorithmic decision making: Differential algorithmic functioning. Journal of Educational and Behavioral Statistics, 49(2), 151–172. https://doi.org/10.3102/10769986231171711
16.
WilliamsonD. M.XiX.BreyerF. J. (2012). A framework for evaluation and use of automated scoring. Educational Measurement Issues and Practices, 31(1), 2–13. https://10.1111/j.1745-3992.2011.00223.x