Abstract
With the rapid development of generative artificial intelligence (AI) frameworks (e.g., the generative pre-trained transformer [GPT]), a growing number of researchers have started to explore its potential as an automated essay scoring (AES) system. While previous studies have investigated the alignment between human ratings and GPT ratings, few have examined potential biases in the ratings produced by GPT. Addressing this critical quality of GPT as an AES tool, the present study explored the extent to which GPT can provide fair ratings across writers who belong to different gender, race/ethnicity, and socioeconomic status groups. The study capitalized on the English Language Learner Insight, Proficiency and Skills Evaluation (ELLIPSE) corpus, which contains 6,482 essays rated by 27 human raters. Additional ratings were collected by asking GPT-4o to rate these essays. The data were analyzed using a many-facet Rasch measurement approach. Results indicated that GPT-4o exhibited no substantial bias regarding gender or socioeconomic status. However, GPT-4o demonstrated significant bias regarding race/ethnicity, assigning unexpectedly higher scores to essays from the Asian/Pacific Islander group and lower scores to essays written by the Hispanic/Latino group. These findings underscore the need for cautious and critical uses of GPT as an AES tool in view of fairness.
Get full access to this article
View all access options for this article.
