Sage Journals: Discover world-class research

Abstract

Large language models (LLMs) have recently gained attention in automated writing evaluation (AWE) due to their flexibility, ease of use, and free accessibility. However, most existing studies have relied on standardized rubrics and detailed scoring guidelines to guide model outputs. Recent evidence suggests that LLMs can adapt their scoring behavior through example-based calibration. Building on this insight, the present study examines whether ChatGPT-4o can mirror individual instructors’ evaluative tendencies. Data consisted of 100 previously graded final exam writing samples from Saudi students of English as a second language (ESL), provided by five instructors at a Saudi university’s Bachelor of Arts program. GPT (generative pre-trained transformer) was calibrated using instructor-graded writing samples to enhance its alignment with human grading criteria. Subsequent analysis involved 82 samples, excluding those used in calibration. Results revealed a strong positive and statistically significant correlation (r = 0.816, p < .001) between GPT scores and teacher-assigned scores. Descriptive analyses further indicated differential scoring tendencies: GPT was more generous toward lower-quality writings, assigning higher mean scores than human raters, whereas teachers tended to award higher scores than GPT for high-quality writings. These findings suggest that GPT, particularly when effectively calibrated, can mirror teacher grading practices, though with notable differences at performance extremes. Consequently, this study highlights GPT’s potential as a complementary assessment tool in ESL writing instruction.

Keywords

ChatGPT in writing evaluation correlation analysis formative assessment GenAI in writing GPT writing feedback writing assessment AI in ESL classroom

Get full access to this article

View all access options for this article.

References

Bai

P.Y.

(2014). The relationship between use of writing strategies and English proficiency in Singapore primary schools. Asia-Pacific Education Researcher, 23, 355–365. https://doi.org/10.1007/s40299-013-0110-0

Black

Wiliam

(1998). Inside the black box: Raising standards through classroom assessment. King’s College School of Education.

Brown

Mann

Ryder

, et al (2020). Language models are few-shot learners. Advances in Neural Information Processing Systems, 33, 1877–1901. https://doi.org/10.48550/arXiv.2005.14165

Bucol

J.L.

Sangkawong

(2024). Exploring ChatGPT as a writing assessment tool. Innovations in Education and Teaching International, 62, 867–882. https://doi.org/10.1080/14703297.2024.2363901

Bui

N.M.

Barrot

J.S.

(2024). ChatGPT as an automated essay scoring tool in the writing classrooms: How it compares with human scoring. Education and Information Technologies, 30, 2041–2058. https://doi.org/10.1007/s10639-024-12891-w

Chai

Wang

Zhu

Han

(2024). Grading by AI makes me feel fairer? How different evaluators affect college students’ perception of fairness. Frontiers in Psychology, 15, Article 1221177. https://doi.org/10.3389/fpsyg.2024.1221177

Condon

(2013). Large-scale assessment, locally-developed measures, and automated scoring of essays: Fishing for red herrings? Assessing Writing, 18, 100–108. https://doi.org/10.1016/j.asw.2012.11.001

Fagbohun

Iduwe

N.P.

Abdullahi

Ifaturoti

Nwanna

O.M.

(2024). Beyond traditional assessment: Exploring the impact of large language models on grading practices. Journal of Artificial Intelligence, Machine Learning and Data Science, 2, 1–8. https://doi.org/10.51219/JAIMLD/oluwole-fagbohun/19

Farsani

Beikmohammadi

Mohebbi

(2014). Self-regulated learning, goal-oriented learning, and academic writing performance of undergraduate Iranian EFL learners. The Electronic Journal for English as a Second Language or Foreign Language, 18, 1–19. https://tesl-ej.org/wordpress/issues/volume18/ej70/ej70a4/

10.

Fitriani

Rini

P. Z. E.

(2025). Can we trust AI to assess writing? An analysis of scoring reliability and feedback consistency. Jurnal Profesi Keguruan, 11(1), 66–79. https://doi.org/10.15294/jpk.v11i1.25849

11.

Flodén

(2024). Grading exams using large language models: A comparison between human and AI grading of exams in higher education using ChatGPT. British Educational Research Journal, 51, 201–224. https://doi.org/10.1002/berj.4069

12.

Geçkin

Kızıltaş

Çınar

. (2023). Assessing second-language academic writing: AI vs. Human raters. Journal of Educational Technology and Online Learning, 6, 1096–1108. https://doi.org/10.31681/jetol.1336599

13.

Gjorevski

Cox

T.L.

(2025). Exploring the potential of ChatGPT for evaluating English essays in a criterion-based assessment. TESOL Quarterly, 59, S251–S279. https://doi.org/10.1002/tesq.70011

14.

Guo

Wang

(2024). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies, 29, 8435–8463. https://doi.org/10.1007/s10639-023-12146-0

15.

Han

(2024). Exploring ChatGPT-supported teacher feedback in the EFL context. System, 126, Article 103502. https://doi.org/10.1016/j.system.2024.103502

16.

Heritage

(2010). Formative assessment and next-generation assessment systems: Are we losing an opportunity? Council of Chief State School Officers. Available at: https://eric.ed.gov/?id=ed543063 (accessed January 2026).

17.

Ibrahim

Kirkpatrick

(2024). Potentials and implications of ChatGPT for ESL writing instruction. International Review of Research in Open and Distributed Learning, 25, 394–409. https://doi.org/10.19173/irrodl.v25i3.7820

18.

Jamshed

Manjur Ahmed

A.S.M.

Sarfaraj

Warda

W.U.

(2024). The impact of ChatGPT on English language learners’ writing skills: An assessment of AI feedback on mobile. International Journal of Interactive Mobile Technologies, 18, 18–36. https://doi.org/10.3991/ijim.v18i19.50361

19.

Kim

Baghestani

Yin

, et al (2024). ChatGPT for writing evaluation: Examining the accuracy and reliability of AI-generated scores compared to human raters. In Chapelle

C.A.

Beckett

G.H.

Ranalli

(Eds.), Exploring artificial intelligence in applied linguistics (pp. 73–95). Iowa State University Digital Press. https://doi.org/10.31274/isudp.2024.154.06

20.

Latif

Zhai

(2024). Fine-tuning ChatGPT for automatic scoring. Computers and Education: Artificial Intelligence, 6, Article 100210. https://doi.org/10.48550/arXiv.2310.10072

21.

Liu

(2024). Applying large language models for automated essay scoring for non-native Japanese. Humanities and Social Sciences Communications, 11, 1–15. https://doi.org/10.1057/s41599-024-03209-9

22.

Liew

P.Y.

Tan

I.K.T.

(2024). On automated essay grading using large language models. In Chen

Fang

(Eds.), Proceedings of the 2024 8th International Conference on Computer Science and Artificial Intelligence (pp. 204–211). Association for Computing Machinery. https://doi.org/10.1145/3709026.3709030

23.

Liu

(2025). Comparing GPT-based approaches in automated writing evaluation. Assessing Writing, 66, Article 100961. https://doi.org/10.1016/j.asw.2025.100961

24.

Liu

Z.M.

Hwang

G.J.

Chen

C.Q.

Chen

X.D.

(2024). Integrating large language models into EFL writing instruction: Effects on performance, self-regulated learning strategies, and motivation. Computer Assisted Language Learning. Advance online publication. https://doi.org/10.1080/09588221.2024.2389923

25.

Mills

Mizouri

Peach

(2025). Prompting better feedback: A study of custom GPT for formative assessment in undergraduate physics. Education Sciences, 15, Article 1058. https://doi.org/10.3390/educsci15081058

26.

Mizumoto

Eguchi

(2023). Exploring the potential of using an AI language model for automated essay scoring. Research Methods in Applied Linguistics, 2, Article 100050. https://doi.org/10.1016/j.rmal.2023.100050

27.

Mizumoto

Shintani

Sasaki

Teng

M.F.

(2024). Testing the viability of ChatGPT as a companion in L2 writing accuracy assessment. Research Methods in Applied Linguistics, 3, Article 100116. https://doi.org/10.1016/j.rmal.2024.100116

28.

Mun

(2024). EFL learners’ English writing feedback and their perception of using ChatGPT. Journal of English Teaching through Movies and Media, 25, 26–39. https://doi.org/10.16875/stem.2024.25.2.26

29.

OpenAI. (2023). GPT-4 technical report. OpenAI. Available at: https://cdn.openai.com/papers/gpt-4.pdf (accessed January 2026).

30.

OpenAI. (n.d.). Prompt engineering: Six strategies for getting better results. Available at: https://platform.openai.com/docs/guides/prompt-engineering (accessed 31 January 2026).

31.

Pack

Barrett

Escalante

(2024). Large language models and automated essay scoring of English language learner writing: Insights into validity and reliability. Computers and Education: Artificial Intelligence, 6, Article 100234.

32.

Perelman

(2020). The Babel generator and e-rater: 21st century writing constructs and automated essay scoring (AES). Journal of Writing Assessment, 13, 1–9.

33.

Pfau

Polio

(2023). Exploring the potential of ChatGPT in assessing L2 writing accuracy for research purposes. Research Methods in Applied Linguistics, 2, Article 100083.

34.

Pigai.org. (n.d.). PIGAI automatic essay scoring service [web page]. Available at: https://en.pigai.org (accessed January 2026).

35.

Poole

F.J.

Coss

M.D.

(2024). Can ChatGPT reliably and accurately apply a rubric to L2 writing assessments? The devil is in the prompt (s). Journal of Technology and Chinese Language Teaching, 15, 1–24. http://www.tclt.us/journal/2024v15n1/poolecoss.pdf

36.

Qazi

M.H.

Munir

Zaigham

M.S.

Mughal

U.A.

(2025). The impact of AI-assisted self-regulated learning, specifically ChatGPT, on students’ engagement and writing skill enhancement. Journal of Applied Linguistics and TESOL (JALT), 8, 393–415. https://jalt.com.pk/index.php/jalt/article/view/327

37.

Quah

Zheng

Sng

T.J.H.

Yong

C.W.

Islam

(2024). Reliability of ChatGPT in automated essay scoring for dental undergraduate examinations. BMC Medical Education, 24, Article 962. https://doi.org/10.1186/s12909-024-05881-6

38.

Shabara

ElEbyary

Boraie

(2024). Teachers or ChatGPT: The issue of accuracy and consistency in L2 assessment. Teaching English with Technology, 24, 71–92. https://doi.org/10.56297/vaca6841/LRDX3699/XSEZ5215

39.

Shin

Lee

J.H.

(2024). Exploratory study on the potential of ChatGPT as a rater of second language writing. Education and Information Technologies, 29, 24735–24757. https://doi.org/10.1007/s10639-024-12817-6

40.

Song

(2019). Investigating Chinese EFL college students’ writing through the web-automatic writing evaluation program. English Language and Literature Studies, 9, 20. https://doi.org/10.5539/ells.v9n3p20

41.

Song

Zhu

Wang

Zheng

(2024). Automated essay scoring and revising based on open-source large language models. IEEE Transactions on Learning Technologies, 17, 1880–1890. https://doi.org/10.1109/TLT.2024.3396873

42.

Steiss

Tate

Graham

, et al (2024). Comparing the quality of human and ChatGPT feedback of students’ writing. Learning and Instruction, 91, Article 101894. https://doi.org/10.1016/j.learninstruc.2024.101894

43.

Stiggins

(2005). From formative assessment to assessment for learning: A path to success in standards-based schools. Phi Delta Kappan, 87, 324–328. https://doi.org/10.1177/003172170508700414

44.

Tate

T.P.

Steiss

Bailey

, et al (2024). Can AI provide useful holistic essay scoring? Computers and Education: Artificial Intelligence, 7, Article 100255. https://doi.org/10.1016/j.caeai.2024.100255

45.

Teng

M. F.

(2024). A systematic review of ChatGPT for English as a foreign language writing: Opportunities, challenges, and recommendations. International Journal of TESOL Studies, 6(3), 36–57. https://doi.org/10.58304/ijts.20240304

46.

Vasu

K.A.

Mei Fung

Nimehchisalem

Md Rashid

(2020). Self-regulated learning development in undergraduate ESL writing classrooms: Teacher feedback versus self-assessment. RELC Journal, 53, 612–626. https://doi.org/10.1177/0033688220957782

47.

Vygotsky

L.S.

(1978). Mind in society: The development of higher psychological processes. Harvard University Press.

48.

Wang

Anthony

Arshad

N.I.

(2023). A content-controlled monolingual comparable corpus approach to comparing learner and proficient argumentative writing. Research Methods in Applied Linguistics, 2, Article 100053. https://doi.org/10.1016/j.rmal.2023.100053

49.

Wang

Gayed

J.M.

(2024). Effectiveness of large language models in automated evaluation of argumentative essays: Fine-tuning vs. zero-shot prompting. Computer Assisted Language Learning. Advance online publication. https://doi.org/10.1080/09588221.2024.2371395

50.

Warschauer

Grimes

(2008). Automated writing assessment in the classroom. Pedagogies: An International Journal, 3, 22–36. https://doi.org/10.1080/15544800701771580

51.

Wetzler

E.L.

Cassidy

K.S.

Jones

M.J.

, et al (2024). Grading the graders: Comparing generative AI and human assessment in essay evaluation. Teaching of Psychology, 52, 298–304. https://doi.org/10.1177/00986283241282696

52.

Wiliam

(2006). Formative assessment: Getting the focus right. Educational Assessment, 11, 283–289. https://doi.org/10.1080/10627197.2006.9652993

53.

Xia

Mao

Zheng

(2024). Empirical study of large language models as automated essay scoring tools in English composition: Taking TOEFL independent writing task for example. arXiv, 2401, Article 3401. https://doi.org/10.48550/arXiv.2401.03401

54.

Yamashita

(2024). An application of many-facet Rasch measurement to evaluate automated essay scoring: A case of ChatGPT-4.0. Research Methods in Applied Linguistics, 3, Article 100133.

55.

Yancey

K.P.

Laflair

Verardi

Burstein

(2023). Rating short L2 essays on the CEFR scale with GPT-4. In Kochmar

Burstein

Horbach

, et al (Eds.), Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023), pp. 576–584. Association for Computational Linguistics.

56.

Yao

(2021). Automated writing evaluation for ESL learners: A case study of Pigai system. Journal of Asia TEFL, 18, 949–958.

57.

Yavuz

Çelik

Ö.

Yavaş Çelik

(2024). Utilizing large language models for EFL essay grading: An examination of reliability and validity in rubric-based assessments. British Journal of Educational Technology, 56, 150–166. https://doi.org/10.1111/bjet.13494

58.

Zeevy-Solovey

(2024). Comparing peer, ChatGPT, and teacher corrective feedback in EFL writing: Students’ perceptions and preferences. Technology in Language Teaching and Learning, 6, 1–23. https://doi.org/10.29140/tltl.v6n3.1482

59.

Zimmerman

B.J.

(2002). Becoming a self-regulated learner: An overview. Theory into Practice, 41, 64–70. https://doi.org/10.1207/s15430421tip4102_2

60.

Zimmerman

B.J.

Kitsantas

(2002). Acquiring writing revision and self-regulatory skill through observation and emulation. Journal of Educational Psychology, 94, 660–668. https://doi.org/10.1037/0022-0663.94.4.660

61.

Zimmerman

B.J.

Martinez-Pons

(1986). Development of a structured interview for assessing student use of self-regulated learning strategies. American Educational Research Journal, 23, 614–628. https://doi.org/10.3102/00028312023004614

Can ChatGPT score ESL writing? A correlation analysis between teacher and GenAI scores

Abstract

Keywords

Get full access to this article

References