Abstract
Technological innovations have promised a great deal to language teachers over the years, raising expectations but without always delivering the benefits we hope for. The potential of automated writing evaluation systems and generative artificial intelligence tools such as ChatGPT, however, might change this. Their ability to relieve teachers of hours of marking while providing instant local and global written corrective feedback across multiple drafts targeted to student needs and in greater quantities seems too good to ignore. In this short ‘viewpoint’ paper, I explore the main pros and cons of these developments and ask if generative artificial intelligence (GenAI) tools are just robotic marking machines or whether they actually help improve our feedback and the writing skills of our students.
Introduction
Technological innovations have often been greeted with caution by language teachers suspicious of both the hype which tends to surround them and the threats they can pose to the integrity of student work. However, with ever-increasing class sizes and constant admonishments to provide ever more – and more useful – feedback on students’ assignments, teachers might welcome the arrival of generative artificial intelligence (GenAI) applications in classrooms. Their potential to relieve teachers of hours of marking while providing instant local and global written corrective feedback across multiple drafts targeted to student needs and in greater quantities seems too good to ignore. In this short ‘viewpoint’ paper, I want to explore the main pros and cons of these developments, particularly in higher education contexts, and ask if GenAI tools are just robotic marking machines or whether they actually help improve our feedback and the writing skills of our students, whether L1 or L2 learners.
GenAI, Feedback, and Writing Instruction
The benefits of feedback on student written work is now well attested in the literature and needs no cheerleading from me. Teachers are encouraged to provide feedback that is timely, personalized, and detailed (Hattie and Timperley, 2007), that encourages student engagement (Carless, 2016), and that contains do-able recommendations for improvement (Ferris and Kurzer, 2019). Delivering on these strictures, however, is less straightforward and imposes heavy demands on teachers already burdened with substantial workloads. A recent survey of US teachers, for example, found they spent 9.9 hours per week grading, and 32% seriously considered leaving the profession in the past year because of this (Learnosity, n.d.). In the UK, a National Audit Office report highlighted excessive marking workloads as a key reason teachers leave (Hudson, 2025). Nor is delivering such high-quality feedback always practical, particularly in the context of large-scale assessments such as the Test Of English as a Foreign Language (TOEFL), massive open online courses (MOOCs), or large class sizes.
Automation has the potential to change all this and, in the last few years, we have seen new digital resources riding to the rescue, promising a new dawn of support for teachers suffering from feedback burnout. There are huge potential benefits to AI's ability to correct and explain language use, offer example sentences and translations (Kohnke, 2024), and scaffold and review students’ argumentative writing (Su et al., 2023). Recent developments have produced tools which can create automatic translations, error corrections, and automated scoring systems. It is now relatively straightforward for teachers to use these programmes to give feedback on student texts through text exemplars for specific content and linguistic needs, to generate test items, and to foster learner autonomy through user inquiries, content creation, and feedback with metalinguistic explanations (Godwin-Jones, 2024). More generally, in a recent review of 24 studies, for example, Khalifa and Albadawy (2025) identify six domains where AI helps academic writing and research: (1) facilitating idea generation and research design, (2) improving content and structuring, (3) supporting literature review and synthesis, (4) enhancing data management and analysis, (5) supporting editing, review, and publishing, and (6) assisting in communication, outreach, and ethical compliance.
So far, so good, but there is, as usual, vinegar in the salad oil. Attracting most controversy, of course, is GenAI's apparent ability to instantly produce text in an appropriate register across any genre or discipline through a simple natural language prompt. No teacher is unaware of the risks here. Feedback occurs in a context of instruction, and AI feedback is intimately related to text generation itself. The worry is that students might submit AI-generated texts as their own – and so far it has proved impossible to identify these texts with any certainty (Gao et al., 2023). A recent survey of 3017 high school and college students in the US, for example, found that almost one-third confessed to using ChatGPT for assistance with their homework (Pudasaini et al., 2024). The rise of large language models (LLMs) such as GPT-4, Claude 3.5 Sonnet, and Gemini has therefore led to a surge in academic misconduct and ‘an ongoing technical arms race between detection technologies and evasion tactics’ (Pudasaini et al., 2024).
While studies show that GenAI use is difficult to automatically detect, even with specialist tools such as DetectGPT, RADAR, and GPT-Sentinel, automatically generated texts might not always deliver what is hoped for. Accompanying citations, references, and even content may be factually incorrect (or ‘hallucinated’), and AI writing can seem awkward, impersonal, and shallow. Work I’ve been doing with Kevin Jiang (Jiang & Hyland, 2025a, 2025b, 2025c), for example, shows that ChatGPT produces impressively coherent academic texts. However, to do so it uses a narrower selection and more repetitive range of lexical bundles, significantly fewer epistemic and attitudinal stance markers – particularly questions and personal asides – and exhibits far less authorial presence in its essays compared with student writers (Jiang & Hyland, 2025a, 2025b, 2025c). Research also points to the negative impact of AI use on critical thinking, authorship, and academic integrity (Crompton et al., 2024), which may deprive students of learning opportunities (Barrot, 2023).
So not everything in the AI garden is rosy. GenAI tools, of course, are not human beings. They have only limited topic comprehension, a restricted contextual awareness, an inability to critically assess information, and a deficiency in higher-order thinking skills. But while they are still far from capturing the subtleties of human writing, LLMs have considerable potential to deliver personalized feedback at scale.
Automating Writing Evaluation
We arrive at this point after several years of seeing improvements in automated writing evaluation (AWE), a feature which emerged at the turn of the century to provide students with instant scoring and corrective feedback (Warschauer and Ware, 2006). While early versions were criticized for their over-reliance on surface-level corrections, neglect of rhetorical features, and limited feedback specificity (e.g., Ranalli et al., 2017), recent renderings show much more promise.
Overall, studies show that AWE has a positive effect on writing development, although with some reservations (see Zhai and Ma, 2023, for a meta-analysis). AWE can encourage L2 students to improve the quality of their L2 drafts, with reduced errors, longer texts, and higher scores (e.g. Zhang and Hyland, 2018). Learners see the process as empowering as they have control over their revising and gain the confidence of submitting error-free (or reduced error) work to a teacher. Zhang and Hyland (2018) found that students saw the opportunity to revise an essay multiple times at their own pace without the need to wait as a major advantage of AWE feedback, while other studies have shown that it can promote multi-drafting and learner autonomy (Chen and Cheng, 2008). However, Stevenson and Phakiti (2019) discovered that while L2 students using AWE reduced their errors across drafts of the same assignment, this learning did not transfer well across tasks.
Teachers who encourage their students to use these tools will know that they offer them some relief from the drudgery of mundane grammar correction. Principally this is because students can submit an assignment to the programme as many times as they like to improve their score before the teacher sees it. Once a student has managed to reach a threshold score set by the teacher, they can then read the corrected draft without struggling through mechanical errors, perhaps augmenting this with both teacher and peer feedback. Students sometimes receive AWE feedback together with teacher comments, either separately (Zhang and Hyland, 2018) or inserted through the AWE system (Grimes and Warschauer, 2010). When students use AWE first in this way it allows teachers to spend more time on organization, content, and critical thinking issues (Wilson and Czik, 2016). Overall, it seems that positive student and teacher perceptions are greater where the software is used regularly for pre-writing and drafting. In fact, the success of AWE may well depend on how it is integrated into L2 classrooms.
Teachers, then, are not completely off the hook when it comes to marking student writing. AWE systems seem to be particularly effective when they are used in conjunction with teacher and peer feedback. One line of research has dichotomized teacher and peer feedback (e.g., Murillo-Zamorano and Montanero, 2018), or opposed teacher feedback with computer-generated feedback (e.g., Dikli and Bleyle, 2014). But these dichotomies fail to reflect the realities of student learning and ignore the fact that in real classrooms students often have access to more than one type of feedback. Combining AWE with teacher feedback seems to yield greater improvements in writing performance (Han and Sari, 2024), with teachers offering more substantive, higher-order feedback (Wilson and Czik, 2016). Han and Li (2024), for example, asked over 100 students to complete two writing tasks, with both corrective and holistic feedback provided by ChatGPT which was later modified by teachers, with the students incorporating more of this co-produced feedback into subsequent revisions.
A key aspect of the process is
GenAI: The New Kid on the Block
Feedback-capable GenAI systems have only been around since late 2022 with the emergence of more flexible, general-purpose LLMs such as ChatGPT and Gemini. GenAI represents a paradigm shift in both capability and pedagogical potential with the ability to identify infelicities ranging from spelling and punctuation (e.g., Fokides and Peristeraki, 2024) to language and content (e.g., Meyer et al., 2024). Based on different technological principles, involving transformers and neural nets rather than statistical models, GenAI promises to go beyond feedback on spelling and grammar to include coherence, tone, content, and critical thinking. It creates more personalized and interactive responses than AWE programmes and can mimic Socratic feedback which can guide revision, offer models, and explain reasoning. Ideally, then, GenAI programmes’ ability to create text and grasp context offers a promising basis for feedback adapted to different tasks. The fact that they can follow simple prompts enables teachers to integrate rules in the feedback and so bridge students’ need for personalized support and teachers’ involvement in the process.
These are early days, but research on using LLMs for automated feedback has so far been encouraging. Guo and Wang (2024), for example, found that ChatGPT generated a greater quantity of feedback than EFL teachers and this was more balanced across content, organization, and language. Banihashem et al. (2024) reported that ChatGPT provided more detailed feedback on argument structure than peers, and Wang et al. (2024) found that the tool offered more comprehensive feedback on students’ argumentative texts. Users may also respond more positively to LLM-generated feedback than that given by a teacher (Wan & Chen, 2024) or peers (Zeevy-Solovey, 2024). Chinese students in Li et al.'s (2024) study, for example, rated Chat GPT4's provision of written feedback as more relevant to their specific needs compared with teachers’ more general feedback, as well as being comprehensive, addressing content, organization, and language-related issues.
Before launching a celebratory fireworks display, though, we need to add a note of caution. The ability of digital tools to deliver useful feedback obviously depends on their effectiveness in analysing texts, but this is far from assured. In a recent paper, for example, Curry et al. (2024) found that ChatGPT4 did a poor job of categorizing keywords in specialized texts, made false inferences about concordance lines, and failed to identify and analyse direct and indirect questions, so making function-to-form analysis problematic. Worse, Yoon et al.'s (2023) study discovered most feedback sentences generated by ChatGPT were highly abstract and generic, failing to provide concrete suggestions for improvement. Moreover, the accuracy in detecting major problems, such as repetitive ideas and the inaccurate use of cohesive devices, depended on superficial linguistic features and was often incorrect.
There also seems to be significant differences between the scores given by human raters and those generated by GenAI when grading texts, with ChatGPT being a significantly tougher, although more consistent, marker than teachers (Topuz et al., 2025). Another problem is that much of the research has been conducted outside of realistic learning contexts in experimental situations, raising questions about its usefulness to classroom teachers (e.g., Steiss et al., 2024). We might also want to rethink the use of large proprietary LLMs like OpenAI's GPT while ignoring smaller open-source models such as
So Where Do We Go from Here?
Overall, then, it is now becoming clear that automation is not (yet?) the answer to our prayers and that we cannot just assume that automated marking means automatic improvements in student writing. We have to consider a range of complex factors such as learner engagement, digital literacy, and the role of teachers in the process. Students generally express a desire for richer feedback that includes different modes (Henderson et al., 2021), and there are question marks over the ability of general-purpose LLMs to generate reliable feedback in fields where knowledge of disciplinary-specific rhetorical conventions are needed (Capellini et al., 2024). In fact, GPT's ready-to-use out-of-the-box chat version, which relies on prompting to shape feedback, may not be the most effective use of AI at all. Research is starting to show that fine-tuning the model may lead to better results (Mazzula and Bullet, 2024). The fact that AI models are able to follow instructions means that they can be programmed to integrate teachers’ preferences in the feedback and thus connect students’ needs for individual support with teachers’ involvement in the process. This kind of pre-training, however, requires expertise which teachers often lack, running the risk of sidelining our direct involvement as teachers and ceding it to specialist techies.
More problematic, I think, is the key assumption underlying a lot of this triumphant cheerleading for GenAI. It seems to me that a problem with much current work sees feedback as a somewhat mechanical process simply aimed at improving student texts rather than encouraging the human activity of learning. What seems crucial, at this stage of tech-assisted feedback, is the need to move away from what has been something of an obsession with improving texts to how we can improve writers. While considerable attention has been devoted to the quality of feedback generated by AWE and GenAI, their future capacity to reform feedback processes is also significant. Technology-enabled feedback potentially allows students, in collaboration with teachers, to become more actively involved in their feedback and how they use it while gaining greater feedback awareness and digital literacy skills. How, in other words, can GenAI be leveraged in the service of developing writers rather than drafts?
There is a tendency in all of this to focus attention on the tools rather than the learners themselves and the skills they need to engage with automated feedback effectively. Navigating AI tools requires an additional skill set that may not be intuitive, especially for students with limited digital literacy. This poses a growing challenge, aggravated by language barriers and the widening digital divide (Warschauer et al., 2023). Recent research suggests that many students miss the ‘human touch’ that teacher-provided feedback offers and stresses the important role of people in the process (Han and Li, 2024; Teng, 2024). At its most effective, formative feedback is more than information transmission: simply providing students with advice on their texts. Teachers recognize that feedback is a dialogue between students and teachers designed to encourage reflection and growth. This understanding of feedback as a social practice requires a greater role for teachers in the process, making it crucial that we leverage AWE and GenAI to create more interactive and collaborative feedback loops rather than static, one-way advice.
I leave the final word to GenAI itself: ‘ChatGPT should be used as a starting point, not a final arbiter of writing quality’ (ChatGPT, 2025).
