Abstract
Large language models (LLMs) have recently gained attention in automated writing evaluation (AWE) due to their flexibility, ease of use, and free accessibility. However, most existing studies have relied on standardized rubrics and detailed scoring guidelines to guide model outputs. Recent evidence suggests that LLMs can adapt their scoring behavior through example-based calibration. Building on this insight, the present study examines whether ChatGPT-4o can mirror individual instructors’ evaluative tendencies. Data consisted of 100 previously graded final exam writing samples from Saudi students of English as a second language (ESL), provided by five instructors at a Saudi university’s Bachelor of Arts program. GPT (generative pre-trained transformer) was calibrated using instructor-graded writing samples to enhance its alignment with human grading criteria. Subsequent analysis involved 82 samples, excluding those used in calibration. Results revealed a strong positive and statistically significant correlation (
Keywords
Get full access to this article
View all access options for this article.
