Sage Journals: Discover world-class research

Abstract

This study investigates ChatGPT’s performance as an Automated Writing Evaluation (AWE) system by comparing its scoring with that of human raters and examining learners’ perceptions of its feedback. Six ChatGPT models were developed using different prompt configurations. Sixty English writing samples produced by Korean university English as a Foreign Language (EFL) learners were evaluated by two human raters and the six models. A multifaceted Rasch model, Spearman’s correlation, and intraclass correlation were used to examine reliability, severity, and bias. Learners’ perspectives on the models’ feedback were collected through open-ended surveys and analyzed thematically. The results indicate that prompt design plays a central role in shaping ChatGPT’s scoring behavior. Prompts combining Chain-of-Thought reasoning with Fill-in-the-blank scaffolding were associated with higher scoring consistency, while predefined personas and few-shot exemplars tended to moderate scoring severity. However, no stable patterns were observed for either bias or rating scale use, suggesting that prompt design alone cannot fully control domain-level bias. In particular, reasoning-intensive writing domains showed substantial divergence from human judgment, highlighting the need for human oversight. In parallel, learners generally viewed ChatGPT’s feedback positively, while also noting areas for improvement. Overall, the study demonstrates the potential of prompt-calibrated ChatGPT-based AWE as a supplementary tool for writing assessment and instruction.

Keywords

AI-assisted writing assessment automated writing evaluation ChatGPT feedback learner perceptions prompt engineering

Get full access to this article

View all access options for this article.

References

Ahmadi Shirazi

. (2019). For a greater good: Bias analysis in writing assessment. SAGE Open, 9(1), 2158244018822377. https://doi.org/10.1177/2158244018822377

Andrade

H. G.

(2005). Teaching with rubrics: The good, the bad, and the ugly. College Teaching, 53(1), 27–31. https://doi.org/10.3200/CTCH.53.1.27-31

Atil

Aykent

Chittams

Passonneau

R. J.

Radcliffe

Rajagopal

G. R.

Sloan

Tudrej

Ture

Baldwin

(2024). Non-determinism of “deterministic” LLM settings. arXiv. https://doi.org/10.48550/arXiv.2408.04667

Attali

Lewis

Steier

(2013). Scoring with the computer: Alternative procedures for improving the reliability of holistic essay scoring. Language Testing, 30(1), 125–141. https://doi.org/10.1177/026553221245239

Burstein

Kukich

Wolff

Chodorow

Braden-Harder

Harris

M. D.

(1998). Automated scoring using a hybrid feature identification technique. In Boitet

Whitelock

(Eds.), 36th annual meeting of the association for computational linguistics and 17th international conference on computational linguistics (Vol. 1; pp. 206–210). Association for Computational Linguistics.

Cahill

Chodorow

Flor

(2018). Developing an e-rater advisory to detect babel-generated essays. Journal of Writing Analytics, 2, 203–224. https://doi.org/10.37514/JWA-J.2018.2.1.08

Council of Europe. (2001). Structured overview of all CEFR scales. http://www.coe.int/t/dg4/education/elp/elpreg/Source/Key_reference/Overview_CEFRscales_EN.pdf

Council of Europe. (2004). Common European framework of reference for languages (CEFR): Descriptors. https://www.coe.int/en/web/common-european-framework-reference-languages/cefr-descriptors

Coyne

Sakaguchi

Galvan-Sosa

Zock

Inui

(2023). Analyzing the performance of GPT-3.5 and GPT-4 in grammatical error correction. arXiv. https://doi.org/10.48550/arXiv.2303.14342

10.

D’Antonoli

T. A.

Stanzione

Bluethgen

Vernuccio

Ugga

Klontzas

M. E.

Koçak

(2024). Large language models in radiology: Fundamentals, applications, ethical considerations, risks, and future directions. Diagnostic and Interventional Radiology, 30(2), 80–90. https://dirjournal.org/articles/doi/dir.2023.232417

11.

Dikli

(2006). An overview of automated scoring of essays. The Journal of Technology, Learning and Assessment, 5(1), 1640. https://ejournals.bc.edu/index.php/jtla/article/view/1640

12.

Eckes

(2008). Rater types in writing performance assessments: A classification approach to rater variability. Language Testing, 25(2), 155–185. https://doi.org/10.1177/02655322070867

13.

Eckes

(2012). Operational rater types in writing assessment: Linking rater cognition to rater behavior. Language Assessment Quarterly, 9(3), 270–292. https://doi.org/10.1080/15434303.2011.649381

14.

Ekin

(2023). Prompt engineering for ChatGPT: A quick guide to techniques, tips, and best practices. TechRxiv. https://doi.org/10.36227/techrxiv.22683919.v2

15.

Ericsson

P. F.

Haswell

R. H.

(2006). Machine scoring of student essays: Truth and consequences. Utah State University Press.

16.

Fahim

Bijani

(2011). The effects of rater training on raters’ severity and bias in second language writing assessment. International Journal of Language Testing, 1(1), 1–16.

17.

Fang

Yang

Lan

Wong

D. F.

Chao

L. S.

Zhang

(2023). Is chatgpt a highly fluent grammatical error correction system? A comprehensive evaluation. arXiv. https://doi.org/10.48550/arXiv.2304.01746

18.

Farrokhi

Esfandiari

Schaefer

(2012). A many-facet rasch measurement of differential rater severity/leniency in three types of assessment. JALT Journal, 34(1), 79–101. https://doi.org/10.37546/JALTJJ34.1-3

19.

Fokides

Peristeraki

(2024). Comparing ChatGPT’s correction and feedback comments with that of educators in the context of primary students’ short essays written in English and Greek. Education and Information Technologies, 30, 2577–2621. https://doi.org/10.1007/s10639-024-12912-8

20.

Foltz

P. W.

Laham

Landauer

T. K.

(1999). The intelligent essay assessor: Applications to educational technology. Interactive Multimedia Electronic Journal of Computer-Enhanced Learning, 1(2), 939–944. http://imej.wfu.edu/articles/1999/2/04/index.asp

21.

García-Varela

Nussbaum

Mendoza

Martínez-Troncoso

Bekerman

(2025). ChatGPT as a stable and fair tool for automated essay scoring. Education Sciences, 15(8), 946. https://doi.org/10.3390/educsci15080946

22.

Giray

(2023). Prompt engineering with ChatGPT: A guide for academic writers. Annals of Biomedical Engineering, 51, 2629–2633. https://doi.org/10.1007/s10439-023-03272-4

23.

Goyal

Garg

Mordia

Ramachandran

Kumar

Challa

J. S.

(2025). The impact of large language models on K-12 education in rural India: A thematic analysis of student volunteer’s perspectives. arXiv. https://doi.org/10.48550/arXiv.2505.03163

24.

Hauenstein

N. M. A.

McCusker

M. E.

(2017). Rater training: Understanding effects of training content, practice ratings, and feedback. International Journal of Selection and Assessment, 25, 253–266. https://doi.org/10.1111/ijsa.12177

25.

Hijikata-Someya

Ono

Yamanishi

(2015). Evaluation by native and non-native English teacher-raters of Japanese students’ summaries. English Language Teaching, 8(7), 1–12. https://doi.org/10.5539/elt.v8n7p1

26.

Hoang

(2011). Validating my access as an automated writing instructional tool for English language learners (Unpublished Master’s thesis). California State University.

27.

Huang

S. J.

(2014). Automated versus human scoring: A case study in an EFL context. Electronic Journal of Foreign Language Teaching, 11(Suppl. 1), 149–164.

28.

Hwang

Lee

(2026, January 12). Supplemental files for “ChatGPT for automated writing evaluation: Scoring and feedback across prompt conditions”. Retrieved from osf.io/vf954

29.

Hwang

M. Y.

Lee

K. H.

Lee

H. K.

(2025). A word to the wise: Crafting impactful prompts for ChatGPT. System, 133, 103756. https://doi.org/10.1016/j.system.2025.103756

30.

Hwang

M. Y.

Robert

Lee

H. K.

(2025). Exploring learner prompting behavior and its effect on ChatGPT-assisted English writing revision. The Asia-Pacific Education Researcher, 34, 1157–1167. https://doi.org/10.1007/s40299-024-00930-6

31.

Ifenthaler

(2022). Automated essay scoring systems. In Jung

Zawacki-Richter

(Eds.), Handbook of open, distance and digital education (pp. 1–15). Springer.

32.

Kim

(2020). Effects of rating criteria order on the halo effect in L2 writing assessment: A many-facet rasch measurement analysis. Language Testing in Asia, 10(1), 16. https://doi.org/10.1186/s40468-020-00115-0

33.

Koltovskaia

Rahmati

Saeli

(2024). Graduate students’ use of ChatGPT for academic text revision: Behavioral, cognitive, and affective engagement. Journal of Second Language Writing, 65, 101130. https://doi.org/10.1016/j.jslw.2024.101130

34.

Koo

T. K.

M. Y.

(2016). A guideline of selecting and reporting intraclass correlation coefficients for reliability research. Journal of Chiropractic Medicine, 15(2), 155–163. https://doi.org/10.1016/j.jcm.2016.02.012

35.

Lin

(2024). How to write effective prompts for large language models. Nature Human Behaviour, 8, 611–615. https://doi.org/10.1038/s41562-024-01847-2

36.

Linacre

J. M.

(1994a). Constructing measurement with a many-facet Rasch model. In Wilson

(Ed.), Objective measurement: Theory into practice (Vol. 2; pp. 129–144). Ablex.

37.

Linacre

J. M.

(1994b). Sample size and item calibration stability. Rasch Measurement Transactions, 7(4), 328. https://www.rasch.org/rmt/rmt74m.htm

38.

Linardatos

Papastefanopoulos

Kotsiantis

(2020). Explainable AI: A review of machine learning interpretability methods. Entropy, 23(1), 18. https://doi.org/10.3390/e23010018

39.

Liu

Kunnan

A. J.

(2016). Investigating the application of automated writing evaluation to Chinese undergraduate English majors: A case study of WriteToLearn. Calico Journal, 33(1), 71–91. https://doi.org/10.1558/cj.v33i1.26380

40.

Mahshanian

Eslami

A. R.

Ketabi

(2017). Raters’ fatigue and their comments during scoring writing essays: A case of Iranian EFL learners. Indonesian Journal of Applied Linguistics, 7(2), 302–314. https://doi.org/10.17509/ijal.v7i2.8347

41.

Mansour

Albatarni

Eltanbouly

Elsayed

(2024). Can large language models automatically score proficiency of written essays? arXiv. https://doi.org/10.48550/arXiv.2403.06149

42.

Mizumoto

Eguchi

(2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2(2), 100050. https://doi.org/10.1016/j.rmal.2023.100050

43.

Mizumoto

Shintani

Sasaki

Teng

M. F.

(2024). Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment. Research Methods in Applied Linguistics, 3(2), 100116. https://doi.org/10.1016/j.rmal.2024.100116

44.

Mohd Noh

M. F.

Mohd Matore

M. E. E

. (2022). Rater severity differences in English language as a second language speaking assessment based on rating experience, training experience, and teaching experience through many-faceted Rasch measurement analysis. Frontiers in Psychology, 13, Article 941084. https://doi.org/10.3389/fpsyg.2022.941084

45.

OpenAI. (2025, August 7). Introducing GPT-5. https://openai.com/ko-KR/index/introducing-gpt-5/

46.

Page

E. B.

(1966). The imminence of . . . grading essays by computer. The Phi Delta Kappan, 47(5), 238–243.

47.

Page

E. B.

(2003). Project essay grade: PEG. In Shermis

M. D.

Burstein

(Eds.), Automated essay scoring: A cross-disciplinary perspective (pp. 43–54). Lawrence Erlbaum Associates.

48.

Reynolds

McDonell

(2021, May 8–13). Prompt programming for large language models: Beyond the few-shot paradigm [Conference session]. Extended Abstracts of the 2021 CHI Conference on Human Factors in Computing Systems, Yokohama, Japan.

49.

Rudner

L. M.

Liang

(2002). Automated essay scoring using Bayes’ theorem. The Journal of Technology, Learning and Assessment, 1(2), 1–21.

50.

Schaller

N. J.

Ding

Horbach

Meyer

Jansen

(2024). Fairness in automated essay scoring: A comparative analysis of algorithms on German learner essays from secondary education. In Kochmar

Bexte

Burstein

Horbach

Laarmann-Quante

Tack

Yaneva

Yuan

(Eds.), Proceedings of the 19th workshop on innovative use of NLP for Building Educational Applications (BEA 2024; pp. 210–221). Association for Computational Linguistics. https://aclanthology.org/2024.bea-1.18

51.

Seong

T. J.

(2010). Validity and reliability. Hakjisa.

52.

Stahl

Biermann

Nehring

Wachsmuth

(2024). Exploring LLM prompting strategies for joint essay scoring and feedback generation. arXiv. https://doi.org/10.48550/arXiv.2404.15845

53.

Steiss

Tate

Graham

Cruz

Hebert

Wang

Moon

Tseng

Warschauer

Olson

C. B.

(2024). Comparing the quality of human and ChatGPT feedback of students writing. Learning and Instruction, 91, 101894. https://doi.org/10.1016/j.learninstruc.2024.101894

54.

Vantage Learning. (2003). Assessing the accuracy of Intellimetric for scoring a district-wide writing assessment (RB-806). Vantage Learning.

55.

Wang

Chen

Wang

Shadiev

(2024). ChatGPT’s capabilities in providing feedback on undergraduate students’ argumentation: A case study. Thinking Skills and Creativity, 51, 101440. https://doi.org/10.1016/j.tsc.2023.101440

56.

Wei

Wang

Schuurmans

Bosma

Ichter

Xia

Chi

E. H.

Q. V.

Zhou

(2022). Chain-of-thought prompting elicits reasoning in large language models. In Koyejo

Mohamed

Agarwal

Belgrave

Cho

(Eds.), Advances in neural information processing systems 35 (pp. 24824–24837). Curran Associates. https://proceedings.neurips.cc/paper_files/paper/2022/hash/9d5609613524ecf4f15af0f7b31abca4-Abstract-Conference.html

57.

White

Hays

Sandborn

Olea

Gilbert

Elnashar

Spencer-Smith

Schmidt

D. C.

(2023). A prompt pattern catalog to enhance prompt engineering with ChatGPT. arXiv. https://doi.org/10.48550/arXiv.2302.11382

58.

Woodworth

Barkaoui

(2020). Perspectives on using automated writing evaluation systems to provide written corrective feedback in the ESL classroom. TESL Canada Journal, 37(2), 234–247. https://doi.org/10.18806/tesl.v37i2.1340

59.

Xia

Mao

Zheng

(2024). Empirical study of large language models as automated essay scoring tools in English: Composition-taking TOEFL independent writing task for example. arXiv. https://doi.org/10.48550/arXiv.2401.03401

60.

Yamashita

(2024). An application of many-facet rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3(3), 100133. https://doi.org/10.1016/j.rmal.2024.100133

61.

Yan

Chuang

P. L.

(2023). How do raters learn to rate? Many-facet rasch modeling of rater performance over the course of a rater certification program. Language Testing, 40(1), 153–179. https://doi.org/10.1177/02655322221074913

62.

Yoo

Y. S.

(2000). Correlation. Kyoyookbook.

63.

Yun

J. Y.

(2023). Meta-analysis of inter-rater agreement and discrepancy between human and automated English essay scoring. English Teaching, 78(3), 105–124. https://doi.org/10.15858/engtea.78.3.202309.105

64.

Zhang

(2013). Contrasting automated and human scoring of essays. R & D Connections.

65.

Zhao

F. F.

H. J.

Liang

J. J.

Cen

Wang

Lin

Chen

Yang

Chen

Cen

L. P.

(2025). Benchmarking the performance of large language models in uveitis: A comparative analysis of ChatGPT-3.5, ChatGPT-4.0, Google Gemini, and Anthropic Claude3. Eye, 39(6), 1132–1137. https://doi.org/10.1038/s41433-024-03545-9

™ChatGPT for automated writing evaluation: Scoring and feedback across prompt conditions

Abstract

Keywords

Get full access to this article

References