Abstract
This study investigates ChatGPT’s performance as an Automated Writing Evaluation (AWE) system by comparing its scoring with that of human raters and examining learners’ perceptions of its feedback. Six ChatGPT models were developed using different prompt configurations. Sixty English writing samples produced by Korean university English as a Foreign Language (EFL) learners were evaluated by two human raters and the six models. A multifaceted Rasch model, Spearman’s correlation, and intraclass correlation were used to examine reliability, severity, and bias. Learners’ perspectives on the models’ feedback were collected through open-ended surveys and analyzed thematically. The results indicate that prompt design plays a central role in shaping ChatGPT’s scoring behavior. Prompts combining Chain-of-Thought reasoning with Fill-in-the-blank scaffolding were associated with higher scoring consistency, while predefined personas and few-shot exemplars tended to moderate scoring severity. However, no stable patterns were observed for either bias or rating scale use, suggesting that prompt design alone cannot fully control domain-level bias. In particular, reasoning-intensive writing domains showed substantial divergence from human judgment, highlighting the need for human oversight. In parallel, learners generally viewed ChatGPT’s feedback positively, while also noting areas for improvement. Overall, the study demonstrates the potential of prompt-calibrated ChatGPT-based AWE as a supplementary tool for writing assessment and instruction.
Keywords
Get full access to this article
View all access options for this article.
