Sage Journals: Discover world-class research

Abstract

Japanese

With the rapid development of generative artificial intelligence (AI) frameworks (e.g., the generative pre-trained transformer [GPT]), a growing number of researchers have started to explore its potential as an automated essay scoring (AES) system. While previous studies have investigated the alignment between human ratings and GPT ratings, few have examined potential biases in the ratings produced by GPT. Addressing this critical quality of GPT as an AES tool, the present study explored the extent to which GPT can provide fair ratings across writers who belong to different gender, race/ethnicity, and socioeconomic status groups. The study capitalized on the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus, which contains 6,482 essays rated by 27 human raters. Additional ratings were collected by asking GPT-4o to rate these essays. The data were analyzed using a many-facet Rasch measurement approach. Results indicated that GPT-4o exhibited no substantial bias regarding gender or socioeconomic status. However, GPT-4o demonstrated significant bias regarding race/ethnicity, assigning unexpectedly higher scores to essays from the Asian/Pacific Islander group and lower scores to essays written by the Hispanic/Latino group. These findings underscore the need for cautious and critical uses of GPT as an AES tool in view of fairness.

Keywords

Fairness GPT many-facet Rasch measurement rater bias writing

Get full access to this article

View all access options for this article.

References

Amorim

Cançado

Veloso

(2018). Automated essay scoring in the presence of biased ratings. Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 1, 229–237. https://aclanthology.org/N18-1021/

Baffour

Crossley

(2024). Advances in automating feedback for argumentative writing: Feedback prize as a case study. In Shermis

M. D.

Wilson

(Eds.), The Routledge international handbook of automated essay evaluation (pp. 303–328). Routledge. https://doi.org/10.4324/9781003397618

Baffour

Saxberg

Crossley

(2023). Analyzing bias in large language model solutions for assisted writing feedback tools: Lessons from the feedback prize competition series. In Proceedings of the 18th workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 242–246). Association for Computational Linguistics. https://aclanthology.org/2023.bea-1.21/

Bannò

Vydana

H. K.

Knill

K. M.

Gales

M. J.

(2024). Can GPT-4 do L2 analytic assessment? arXiv Preprint. https://doi.org/10.48550/arXiv.2404.18557

Crossley

Tian

Baffour

Franklin

Kim

Morris

. . . Boser

(2023). The English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus. International Journal of Learner Corpus Research, 9(2), 248–269. https://doi.org/10.1075/ijlcr.22026.cro

Eckes

(2011). Introduction to many-facet Rasch measurement: Analyzing and evaluating rater-mediated assessments. Peter Lang.

Ferrara

Qunbar

(2022). Validity arguments for AI-based automated scores: Essay scoring as illustration. Journal of Educational Measurement, 59(3), 288–313. https://doi.org/10.1111/jedm.12333

Hannah

Jang

E. E.

Shah

Gupta

(2023). Validity arguments for automated scoring of essay scoring of young students’ writing traits. Language Assessment Quarterly, 20(4/5), 399–420. https://doi.org/10.1080/15434303.2023.2288253

Hoang

G. T. L.

Kunnan

A. J.

(2016). Automated essay evaluation for English language learners: A case study of MY access. Language Assessment Quarterly, 13(4), 359–376. https://doi.org/10.1080/15434303.2016.1230121

10.

Kondo-Brown

(2002). A FACETS analysis of rater bias in measuring Japanese second language writing performance. Language Testing, 19(1), 3–31. https://doi.org/10.1191/0265532202lt218oa

11.

Kunnan

A. J.

(2018). Evaluating language assessments. Routledge.

12.

Linacre

J. M.

(2023). A user’s guide to FACETS Rasch-model computer programs. Program manual 3.87.0. https://www.winsteps.com/a/Facets-Manual.pdf

13.

Liu

Kunnan

A. J.

(2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WriteToLearn. CALICO Journal, 33(1), 71–91. https://doi.org/10.1558/cj.v33i1.26380

14.

McNamara

Knoch

Fan

(2019). Fairness, justice and language assessment. Oxford University Press.

15.

Mizumoto

Eguchi

(2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050

16.

Naismith

Mulcaire

Burstein

(2023). Automated evaluation of written discourse coherence using GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 394–403). Association for Computational Linguistics. https://aclanthology.org/2023.bea-1.32

17.

Schaefer

(2008). Rater bias patterns in an EFL writing assessment. Language Testing, 25(4), 465–493. https://doi.org/10.1177/0265532208094273

18.

Shin

Lee

J. H.

(2024). Exploratory study on the potential of ChatGPT as a rater of second language writing. Education and Information Technologies, 29(18), 1–23. https://doi.org/10.1007/s10639-024-12817-6

19.

Suresh

Guttag

(2021). A framework for understanding sources of harm throughout the machine learning life cycle. In Proceedings of the 1st ACM Conference on Equity and Access in Algorithms, Mechanisms, and Optimization (pp. 1–9). Association for Computing Machinery. https://doi.org/10.48550/arXiv.1901.10002

20.

Tate

T. P.

Steiss

Bailey

Graham

Moon

Ritchie

. . . Warschauer

(2024). Can AI provide useful holistic essay scoring? Computers and Education: Artificial Intelligence, 7, 100255. https://doi.org/10.1016/j.caeai.2024.100255

21.

United Nations Educational, Scientific and Cultural Organization, & International Research Centre on Artificial Intelligence. (2024). Challenging systematic prejudices: An investigation into gender bias in large language models. https://unesdoc.unesco.org/ark:/48223/pf0000388971

22.

Wind

S. A.

Wolfe

E. W.

Engelhard

Jr Foltz

Rosenstein

(2018). The influence of rater effects in training sets on the psychometric quality of automated scoring for writing assessments. International Journal of Testing, 18(1), 27–49. https://doi.org/10.1080/15305058.2017.1361426

23.

Wright

B. D.

Linacre

J. M.

(1994). Reasonable mean-square fit values. Rasch Measurement Transactions, 8(3), 370. https://www.rasch.org/rmt/rmt83b.htm

24.

Xiao

Song

S. X.

Zhang

Wang

(2024). Human-AI collaborative essay scoring: A dual-process framework with LLM. arXiv Preprint. https://doi.org/10.48550/arXiv.2401.06431

25.

Yamashita

(2024). An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3(3), 100133. https://doi.org/10.1016/j.rmal.2024.100133

26.

Yamashita

(2025, March 6). Exploring potential biases in GPT-4o’s ratings for English language learners’ essays. Retrieved from osf.io/ekdh8

27.

Yancey

K. P.

Laflair

Verardi

Burstein

(2023). Rating short L2 essays on the CEFR scale with GPT-4. In Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) (pp. 576–584). Association for Computational Linguistics.

28.

Zack

Lehman

Suzgun

Rodriguez

J. A.

Celi

L. A.

Gichoya

. . . Alsentzer

(2024). Assessing the potential of GPT-4 to perpetuate racial and gender biases in health care: A model evaluation study. The Lancet Digital Health, 6(1), e12–e22. https://doi.org/10.1016/S2589-7500(23)00225-X

Exploring potential biases in GPT-4o’s ratings of English language learners’ essays

Abstract

Keywords

Get full access to this article

References