Generating language assessment content free from representational harms

Abstract

Today’s language models can produce syntactically accurate and semantically coherent texts. This capability presents new opportunities for generating content for language assessments, which have traditionally required intensive expert resources. However, these models are also known to generate biased texts, leading to representational harms. Therefore, to utilize language models for language assessment content generation, it is crucial to address this bias issue to ensure all test takers have a fair and beneficial assessment experience. This paper proposes a novel method to ensure the generation of language assessment content free from representational harms. Specifically, the method eliminates any systematic relationship between demographic groups and their attributes through a two-step process. Two case studies were conducted to illustrate and evaluate the method’s effectiveness. In both studies, the method produced language assessment content comparable with their respective targets and successfully prevented representational harms in a systematic manner.

Keywords

Automated item generation bias fairness language model representational harm

Get full access to this article

View all access options for this article.

References

Abid

Farooqi

Zou

(2021, July). Persistent anti-Muslim bias in large language models. In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society (pp. 298–306).

Achiam

Adler

Agarwal

Ahmad

Akkaya

Aleman

F. L.

. . . McGrew

(2023). Gpt-4 technical report. arXiv preprint arXiv:2303.08774.

Ansel

Yang

Gimelshein

Jain

Voznesensky

. . . Chintala

(2024, April). Pytorch 2: Faster machine learning through dynamic python bytecode transformation and graph compilation. In Proceedings of the 29th ACM International Conference on Architectural Support for Programming Languages and Operating Systems, Volume 2 (pp. 929–947).

Attali

Runge

LaFlair

G. T.

Yancey

Goodwin

Park

Von Davier

A. A.

(2022). The interactive reading task: Transformer-based automatic item generation. Frontiers in Artificial Intelligence, 5, 903077.

Baack

(2024, June). A critical analysis of the largest source for Generative AI training data: Common crawl. In The 2024 ACM conference on fairness, accountability, and transparency (pp. 2199–2208).

Barocas

Hardt

Narayanan

(2023). Fairness and machine learning: Limitations and opportunities. MIT press.

Bowen

N. E. J. A.

Hopper

(2023). The representation of race in English language learning textbooks: Inclusivity and equality in images. TESOL Quarterly, 57(4), 1013–1040.

Burstein

Yancy

Bicknell

Gottlieb

Zheng

von Davier

(2023). Responsible AI standards. Duolingo White Paper.

Castelnovo

Crupi

Greco

Regoli

Penco

I. G.

Cosentini

A. C.

(2022). A clarification of the nuances in the fairness metrics landscape. Scientific Reports, 12(1), 4209.

10.

Camilli

(2006). Test fairness. In Brennan

R. L.

(Ed.), Educational measurement (pp. 221–256). Praeger Publications.

11.

Chen

Aryadoust

Zhang

(2024). A systematic review of differential item functioning in second language assessment. Language Testing, 42(2), 193–222.

12.

Cheng

Durmus

Jurafsky

(2023). Marked personas: Using natural language prompts to measure stereotypes in language models. arXiv preprint arXiv:2305.18189.

13.

Chien

Danks

(2024, June). Beyond behaviorist representational harms: A plan for measurement and mitigation. In The 2024 ACM Conference on Fairness, Accountability, and Transparency (pp. 933–946).

14.

Crawford

(2017). The trouble with bias. Neural Information Processing Systems Keynote Address.

15.

Dathathri

Madotto

Lan

Hung

Frank

Molino

. . . Liu

(2019). Plug and play language models: A simple approach to controlled text generation. arXiv preprint arXiv:1912.02164.

16.

Ellemers

(2018). Gender stereotypes. Annual Review of Psychology, 69(1), 275–298.

17.

Ferne

Rupp

A. A.

(2007). A synthesis of 15 years of research on DIF in language testing: Methodological advances, challenges, and recommendations. Language Assessment Quarterly, 4(2), 113–148.

18.

Gallegos

I. O.

Rossi

R. A.

Barrow

Tanjim

M. M.

Kim

Dernoncourt

. . . Ahmed

N. K

. (2024). Bias and fairness in large language models: A survey. Computational Linguistics, 1–79.

19.

Goodfellow

Bengio

Courville

(2016). Deep learning. MIT Press.

20.

Holland

P. W.

Wainer

(1993). Differential item functioning. Routledge.

21.

Honnibal

Montani

Van Landeghem

Boyd

(2020). spaCy: Industrial-strength Natural Language Processing in Python. https://doi.org/10.5281/zenodo.1212303

22.

Johnson

M. S.

(2025). Responsible AI for measurement and learning: Principles and practices (Research Report No. RR-25-03). ETS.

23.

Johnson

M. S.

McCaffrey

D. F.

(2023). Evaluating fairness of automated scoring in educational measurement. In Yaneva

von Davier

(Eds.), Advancing natural language processing in educational assessment. Routledge.

24.

Kojima

S. S.

Reid

Matsuo

Iwasawa

(2022). Large language models are zero-shot reasoners. Advances in Neural Information Processing Systems, 35, 22199–22213.

25.

Kovács

(2024). The Turing test of online reviews: Can we tell the difference between human-written and GPT-4-written online reviews? Marketing Letters, 1–16.

26.

Kunnan

A. J.

(2000). Fairness and justice for all. In Kunnan

A. J.

(Ed.), Fairness and validation in language assessment (pp. 1–14). Cambridge University Press.

27.

Kunnan

A. J.

(2004). Test fairness. In Milanovic

Weir

(Eds.), European language testing in a global context: Proceedings of the ALTE Barcelona Conference (pp. 27–48). Cambridge University Press.

28.

Lai

Xie

Liu

Yang

Hovy

(2017). Race: Large-scale reading comprehension dataset from examinations. arXiv preprint arXiv:1704.04683.

29.

Liceralde

V. R. T.

Loukina

Beigman Klebanov

Lockwood

J. R.

(2022). Beyond text complexity: Production-related sources of text-based variability in oral reading fluency. Journal of Educational Psychology, 114, 16–36.

30.

Liu

Ott

Goyal

Joshi

Chen

Stoyanov

(2019). RoBERTa: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

31.

Longpre

Mahari

Chen

Sileo

Brannon

Muennighoff

Hooker

(2024). A large-scale audit of dataset licensing and attribution in AI. Nature Machine Intelligence, 6, 975–987. https://doi.org/10.1038/s42256-024-00878-8

32.

Lucy

Bamman

(2021). Characterizing English variation across social media communities with BERT. Transactions of the Association for Computational Linguistics, 9, 538–556.

33.

McCaffrey

D. F.

Casabianca

J. M.

Ricker-Pedley

K. L.

Lawless

R. R.

Wendler

(2022). Best practices for constructed-response scoring. ETS Research Report Series. ETS.

34.

McInnes

Healy

Melville

(2018). Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.

35.

Mehrabi

Morstatter

Saxena

Lerman

Galstyan

(2021). A survey on bias and fairness in machine learning. ACM Computing Surveys (CSUR), 54(6), 1–35.

36.

Minaee

Mikolov

Nikzad

Chenaghlu

Socher

Amatriain

Gao

(2024). Large language models: A survey. arXiv preprint arXiv:2402.06196.

37.

Oya

(2011, August). Syntactic dependency distance as sentence complexity measure. In Proceedings of the 16th International Conference of Pan-Pacific Association of Applied Linguistics (Vol. 1).

38.

Pakhale

(2023). Comprehensive overview of named entity recognition: Models, domain-specific applications and challenges. arXiv preprint arXiv:2309.14084.

39.

Patterson

Gonzalez

Liang

Munguia

L. M.

Rothchild

. . . Dean

(2021). Carbon emissions and large neural network training. arXiv preprint arXiv:2104.10350.

40.

Radford

Child

Luan

Amodei

Sutskever

(2019). Language models are unsupervised multitask learners. OpenAI Blog, 1(8), 9.

41.

Reiter

(2010). Natural language generation. In Clark

Fox

Lappin

(Eds.), The handbook of computational linguistics and natural language processing (pp. 574–598). Wiley.

42.

Rubin

D. B.

(1976). Inference and missing data. Biometrika, 63(3), 581–592.

43.

Sap

Gabriel

Qin

Jurafsky

Smith

N. A.

Choi

(2019). Social bias frames: Reasoning about social and power implications of language. arXiv preprint arXiv:1911.03891.

44.

Sarkar

Pick

J. B.

Parrish

(2017). Geographic patterns and socio-economic influences on Internet use in US States: A spatial and multivariate analysis. SSRN 2942292.

45.

Settles

LaFlair

G. T.

Hagiwara

(2020). Machine learning-driven language assessment. Transactions of the Association for Computational Linguistics, 8, 247–263.

46.

Shannon

C. E.

(1948). A mathematical theory of communication. The Bell System Technical Journal, 27(3), 379–423.

47.

Shelby

Rismani

Henne

Moon

Rostamzadeh

Nicholas

. . . Virk

(2023). Sociotechnical harms of algorithmic systems: Scoping a taxonomy for harm reduction. In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society (pp. 723–741).

48.

Shin

(2021a). Automated item generation by combining the non-template and template-based approaches to generate reading inference test items. Unpublished Doctoral Dissertation, University of Alberta.

49.

Shin

(2021b). Item writing and item writers. In Fulcher

Harding

(Eds.), The Routledge handbook of language testing (pp. 341–356). Routledge.

50.

Suk

Han

K. T

. (2024). A psychometric framework for evaluating fairness in algorithmic decision making: Differential algorithmic functioning. Journal of Educational and Behavioral Statistics, 49(2), 151–172.

51.

Weidinger

Mellor

Rauh

Griffin

Uesato

Huang

P. S.

. . . Gabriel

(2021). Ethical and social risks of harm from language models. arXiv preprint arXiv:2112.04359.

52.

Wolf

Debut

Sanh

Chaumond

Delangue

Moi

. . . Rush

A. M.

(2020, October). Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 conference on empirical methods in natural language processing: System demonstrations (pp. 38–45).

53.

(2010). How do we go about investigating test fairness? Language Testing, 27(2), 147–170.

54.

Choi

Hao

(2023). Automated distractor generation for fill-in-the-blank items using a prompt-based learning approach. Psychological Testing and Assessment Modeling, 65(2), 55–75.