Abstract
Artificial intelligence (AI) tools are increasingly used in second language (L2) writing instruction, yet empirical research comparing their longitudinal effects against human feedback remains limited. This longitudinal, mixed-methods quasi-experimental study investigated the impact of AI-based feedback versus teacher-mediated feedback on the grammatical accuracy, syntactic complexity, and writing quality of 60 first-year English Language Teaching students. Over an 8-week intervention, developmental trajectories were analyzed using Linear Mixed-Effects Models. Results revealed a significant Group × Time interaction favoring the AI-Feedback group for grammatical accuracy, which exhibited a significantly steeper rate of error reduction. Conversely, the Teacher-Feedback group demonstrated superior growth in discourse-level writing quality (p = .036). Syntactic complexity developed comparably across both conditions (p = .41), suggesting that gains in linguistic maturity were driven by task repetition and cognitive engagement rather than the specific feedback modality. Qualitative findings highlighted a critical tension: while AI tools significantly reduced writing anxiety and facilitated immediate iterative revision, participants expressed skepticism regarding the contextual reliability of automated suggestions, retaining a strong preference for human mediation to address higher-order concerns. These findings advocate for a blended pedagogical approach, where L2 learners leverage AI for surface-level precision while relying on human feedback to scaffold rhetorical and conceptual development.
Plain Language Summary
This study investigated how artificial intelligence (AI) tools, such as Grammarly and ChatGPT, compare to human teachers in helping students improve their English writing. Researchers followed 60 university students in Turkiye over an eight-week period. Half of the students used AI tools for feedback, while the other half received traditional written feedback from their instructors. The results revealed that both methods offer unique benefits. Students in the AI group improved their grammatical accuracy much faster, showing a steeper decline in errors. Because AI provides instant feedback, these learners were able to perform multiple rounds of revisions within a single session. However, the group receiving teacher feedback showed significantly better growth in the overall quality of their writing, specifically in areas like coherence and argumentation. Regarding the complexity of the students’ sentences, both groups improved at a similar rate. This suggests that “linguistic maturity” may be driven more by repeated writing practice than by the specific type of feedback received. The study also explored student perceptions. While AI tools significantly reduced writing anxiety and allowed for immediate corrections, students remained skeptical of the software’s ability to understand context. Consequently, they still preferred human teachers for “higher-order” concerns like rhetorical flow and conceptual development. The researcher conclude that a “blended” approach is most effective. They advocate for using AI tools to handle surface-level precision (like grammar) while relying on human expertise to guide the deeper, more complex aspects of communication.
Introduction
Over the past decade, AI has increasingly influenced L2 education, supplementing teacher-led instruction with tools capable of responding within seconds (Alharbi, 2023; Chaudhry & Kazim, 2022; X. Chen et al., 2022). These platforms allow learners to interact with writing in ways difficult to replicate in conventional classrooms (Gayed et al., 2022; Link et al., 2022). This trend echoes SLA debates emphasizing how digital tools support personalized, self-paced learning (X. Chen et al., 2021).
L2 writing is cognitively demanding, requiring simultaneous attention to accuracy, coherence, and logic (Ellis, 2008). When feedback is delayed, learners struggle to notice patterns; conversely, AI tools instantly highlight errors and offer alternatives (Fan & Ma, 2022; Q. Wang, 2024). This immediacy aligns with established SLA theories (D’Mello et al., 2014). For instance, as elaborated further in the upcoming sections, AI enhances feature visibility as posited by the Noticing Hypothesis (Schmidt, 1990), supports the repeated practice central to Skill Acquisition Theory (DeKeyser, 2015), and facilitates mediated interaction emphasized by sociocultural perspectives (Jia et al., 2022; Vygotsky, 1978).
Despite accelerated interest in AI-assisted writing, the empirical research base remains uneven. Existing studies frequently report on short-term accuracy gains while neglecting syntactic complexity, or they focus heavily on learner perceptions without linking those attitudes to actual writing performance. Furthermore, current literature predominantly centers on general English as a Foreign Language (EFL) populations rather than specialized cohorts like English Language Education (ELE) students, whose advanced metacognitive skills and emerging professional identities likely shape their interaction with AI differently. To address these gaps, this study makes a unique contribution by employing a longitudinal, 6-week quasi-experimental design to simultaneously evaluate the impact of AI versus teacher feedback on grammatical accuracy, syntactic complexity, and global writing quality. By situating this investigation within a teacher education program, the research distinctly links learner perceptions directly to performance outcomes, elucidating how future educators navigate and negotiate digital tools that are becoming central to their professional environments.
Literature Review
This review of the literature is divided into two distinct parts. The following two sections establish the theoretical framing by situating AI-assisted writing tools within established Instructed Second Language Acquisition (ISLA) frameworks. Building upon this foundation, the last section provides an empirical synthesis of recent research, examining how these theoretical affordances translate into measurable writing outcomes and learner perceptions in practice.
Theoretical Perspectives on AI-Assisted Grammar and Writing Development
Recent work situates AI-supported L2 learning within established SLA theories (Hockly, 2024). Skill Acquisition Theory posits that learners move from rule recall to automaticity through practice (DeKeyser, 2015), a process facilitated by the iterative revision cycles AI tools provide (Link et al., 2022). Studies indicate that repeated AI-assisted revision produces texts with greater accuracy and structural complexity (Escalante et al., 2023; Fan & Ma, 2022). Additionally, the Noticing Hypothesis (Schmidt, 1990) suggests that AI tools effectively operationalize noticing by drawing visual attention to errors (Alharbi, 2023), encouraging learners to attend to forms they might otherwise ignore (Gayed et al., 2022). Research suggests this visually enhanced feedback improves awareness, retention, and uptake (Ahmed Ali et al., 2025; Konyrova, 2024; Shirinova, 2025).
From a Sociocultural perspective (Vygotsky, 1978), AI serves as a mediational tool within the Zone of Proximal Development, offering scaffolding that mirrors teacher support (Jia et al., 2022; Molenaar, 2022). Learners using AI have been shown to attempt structures beyond their independent ability, with reliance on scaffolding decreasing over time (Annamalai & Bervell, 2025; Jerusha & Rajakumari, 2024). However, the effectiveness of this support depends on clarity (S. Li, 2010) and metacognitive engagement (Lai et al., 2023), as meaningful uptake requires learners to actively interpret information (Jin et al., 2023). Thus, AI tools contribute to cognitive and affective development when pedagogically guided (Annamalai, 2024).
AI, Grammar Instruction, and Attention to Form
Research on grammar instruction within ISLA offers an important foundation for understanding how AI might contribute to L2 learning (Ellis, 2006). Work on form-focused instruction (FFI) has consistently shown that directing learners’ attention to form—whether through explicit explanation or more incidental, task-based interventions—can support both explicit and implicit dimensions of grammatical knowledge (Ellis, 2008). A substantial body of meta-analytic evidence has reinforced these claims (Goo et al., 2015; Kang et al., 2019; S. Li, 2010; F. Li & Sun, 2024; Norris & Ortega, 2000), illustrating that FFI operates effectively across a variety of pedagogical formats rather than belonging to a single instructional method.
A central mechanism often invoked to explain how FFI works is noticing (Schmidt, 1990). Following Schmidt’s (1990) proposal, noticing involves learners becoming aware of particular linguistic features in the input, a prerequisite for the restructuring of interlanguage. In this respect, the affordances of AI map neatly onto long-standing theoretical arguments. Many AI tools are designed to highlight non-target like forms, provide short explanations, or offer modeled corrections, all of which increase the perceptual salience of grammatical features (Alharbi, 2023; Yalçın & Yıldız, 2025). These tools effectively act as input-enhancement devices, drawing attention to linguistic patterns in ways that static materials often cannot. Moreover, AI can supply abundant, high-quality input on demand. Early work in input theory (Krashen, 1985) emphasized the importance of comprehensibility, while later ISLA research has underscored the role of enriched or enhanced input in accelerating form learning. AI-generated examples—varied, contextualized, and adjustable to learner proficiency—offer precisely this type of enriched linguistic environment (Jia et al., 2022). In this sense, AI acts as an accessible form of input enhancement that makes FFI more workable for independent revision.
At the same time, researchers caution against assuming that AI can support all dimensions of writing equally well (Luan et al., 2020). Its strengths are most visible at the sentence level—grammar, lexis, and structural accuracy—areas that are computationally tractable and relatively easy to evaluate with finite rules or probabilistic models (Elkhatat et al., 2023). Higher-order features of writing, however—coherence, cohesion, rhetorical flow, and argumentation—require contextual and pragmatic judgments that AI systems are still limited in providing. As Yalçın and Yıldız (2025) argue, this distinction matters greatly for L2 writing, where learners benefit not only from accurate sentence-level production but also from coherent organization and clear discourse structure. For these reasons, AI is best understood not as a comprehensive corrective system but as a tool whose strengths and limitations must be carefully aligned with established principles of ISLA (Mohamed & Lamia, 2018).
AI-Assisted Feedback in L2 Writing
Shifting from theoretical foundations to empirical synthesis, automated feedback has become a central focus of L2 writing research, particularly as AI tools have grown more accessible and sophisticated (Fan & Ma, 2022). A substantial body of empirical work indicates that AI-assisted corrective feedback can lead to improvements in grammatical accuracy by enabling immediate, repeated engagement with learner errors (Link et al., 2022; Ranalli, 2021). Because feedback is delivered instantaneously, learners often complete multiple revision cycles within a single writing session—an intensity of practice rarely feasible with teacher feedback alone.
However, the literature also documents important limitations. Even advanced tools tend to prioritize sentence-level corrections, offering comparatively limited guidance on discourse-level features such as coherence, paragraph development, and rhetorical flow (Kohnke et al., 2023; Lo et al., 2024; Yalçın & Yıldız, 2025). These discourse-level features require contextual judgment and sensitivity to communicative intent that current AI systems cannot reliably replicate (Ji et al., 2022).
Learner perceptions add another layer of complexity. Many students value the immediacy and neutral tone of AI, noting increased comfort in experimenting with language when feedback is not tied to evaluation (Aydın Yıldız, 2023; Zhai, 2023). Others, however, express doubts about the accuracy of suggestions, particularly when AI overlooks context or nuance (Waziana et al., 2024). These affective reactions—whether trust or uncertainty—influence whether learners ultimately adopt proposed revisions. Thus, researchers increasingly argue that studies should consider both performance outcomes and learner perceptions to capture a full picture of pedagogical impact (Altamimi, 2025; Darwin et al., 2023).
To sum up, the literature suggests that automated feedback is especially effective for grammatical accuracy and iterative revision (Ng et al., 2023), while teacher feedback remains more influential for higher-order writing. This complementary pattern underscores the need for comparative research examining how learners respond to both feedback sources across different dimensions of performance.
Empirical Research on AI-Supported Grammar and Writing Development
Empirical inquiry into AI-mediated writing has expanded rapidly, with a growing consensus that these tools can significantly scaffold L2 writing performance. Recent classroom-based studies show that automated feedback leads to noticeable improvements in learner writing (Ahmed Ali et al., 2025; Ozfidan et al., 2024). For instance, Tang et al. (2024) observed that university learners utilizing ChatGPT suggestions not only reduced error rates but also demonstrated gains in syntactic complexity—improvements that notably transferred to new tasks rather than remaining tied to a single prompt (Annamalai & Bervell, 2025).
However, the literature suggests that access alone does not guarantee learning; impact is heavily mediated by learner engagement. A systematic review by H. Chen and Anyanwu (2025) concluded that significant accuracy gains occurred only when learners used AI comments for meaningful revision rather than passive acceptance. This is reinforced by Lai et al. (2023), who found that metalinguistic prompts encouraged students to actively monitor errors, fostering essential self-regulatory skills (Jin et al., 2023). Similarly, Evenddy (2024) documented a developmental shift where students moved from treating AI suggestions as final answers to using them for strategic, reflective editing. Despite promising results, limitations warrant caution. X. Wang et al. (2024) documented inconsistencies in feedback accuracy across platforms, while Idowu et al. (2024) observed that learners often struggled to discern if AI advice suited their communicative intentions. Furthermore, a recent global study by Ravšelj et al. (2025) highlights that while higher education students widely recognize the utility of tools like ChatGPT, their early reactions are heavily nuanced by concerns regarding reliability, academic integrity, and contextual limitations. Ultimately, while AI supports gains in accuracy and complexity, gaps remain. Existing work largely relies on short-term interventions or general EFL populations. Far fewer studies have examined AI feedback over extended periods with upper-intermediate teacher-education students—a demographic with distinct metacognitive capacities (Kramar et al., 2024).
To better position the present investigation within this growing body of literature, Table 1 provides a critical synthesis of key empirical studies investigating AI-assisted L2 writing and learner perceptions.
Synthesis of Key Empirical Studies on AI-Assisted L2 Writing.
Research Questions
Grounded in the theoretical claims surrounding attention to form, noticing, and mediated revision—and informed by empirical evidence demonstrating differential strengths of AI-assisted and teacher-mediated feedback in L2 writing—the present study sought to examine how upper-intermediate English Language Education majors develop in accuracy, complexity, and discourse-level quality when exposed to distinct feedback modalities. Given the growing pedagogical relevance of AI tools in teacher-education programs and the need to understand how such tools shape revision behaviors, linguistic development, and learner perceptions, the study addressed the following research questions:
To what extent does AI-assisted feedback lead to improvements in learners’ grammatical accuracy compared to teacher-mediated written corrective feedback?
How does AI-assisted feedback influence learners’ syntactic complexity and global writing quality relative to teacher feedback?
How do learners perceive AI-based and teacher-based feedback, and in what ways do these perceptions shape their revision choices, confidence, and engagement during the writing process?
Based on the theoretical frameworks outlined above, the study proposed the following hypotheses: First, grounded in the Noticing Hypothesis and Skill Acquisition Theory, learners receiving AI-assisted feedback would demonstrate greater gains in grammatical accuracy due to immediate, visually enhanced error noticing and iterative revision cycles. Second, informed by sociocultural perspectives on pedagogical mediation, those receiving teacher feedback would show stronger improvements in global writing quality—operationalized here as higher-order discourse features including coherence, organization, and argument development. Third, no substantial between-group differences were expected for syntactic complexity, as both conditions provide comparable opportunities for structural elaboration during revision. Finally, it was hypothesized that learners would hold distinct perceptual profiles for each feedback modality, with AI perceived as more immediate and confidence-building, and teacher feedback viewed as more reliable and pedagogically informative.
Methodology
Research Design
This study employed a 6-week quasi-experimental pre-test–post-test design to investigate the effects of AI-assisted feedback on L2 learners’ grammatical accuracy, syntactic complexity, and overall writing development. Two intact classes from an English Language Education program were randomly assigned to one of two conditions: an AI-Feedback Group, which received automated feedback from Grammarly and ChatGPT, and a Teacher-Feedback Group, which received traditional written corrective feedback from course instructors. Because true randomization at the individual student level was not feasible, equivalency between the intact classes was ensured beyond baseline pre-test scores by verifying that both groups shared the same curriculum, followed identical syllabi, and were taught by the same instructor. While these two tools offer distinct affordances—with Grammarly primarily providing rule-based corrective feedback on morphosyntactic accuracy, and ChatGPT offering generative suggestions for sentence reformulation and lexical variation—they were treated as a single “AI-assisted” condition. This grouping was intentionally chosen to preserve the ecological validity of modern digital writing environments, where learners typically employ a constellation of automated tools simultaneously rather than in isolation. Together, they represent a unified paradigm of instantaneous, machine-generated feedback that contrasts directly with delayed, human mediation. This design enabled a controlled comparison of writing development across feedback modalities while preserving the ecological validity of authentic instructional settings.
Both groups completed the same writing tasks, followed the same instructional sequence. Participants in the AI-assisted group received feedback through an AI-based writing support tool. The system provided automated feedback primarily at the sentence and clause level, including grammatical corrections, lexical suggestions, and reformulations. Learners were encouraged to review the feedback critically and revise their drafts accordingly. No explicit restrictions were imposed on the number of revisions. On the other hand, participants in the teacher-mediated group received written feedback from the course instructor. Feedback focused on grammatical accuracy, syntactic development, and discourse-level quality, including coherence and argumentation. Unlike the AI condition, teacher feedback included explanatory comments and task-specific guidance grounded in pedagogical judgment. Quantitative data were collected through pre- and post-intervention writing assessments and analytic measures of grammatical accuracy and syntactic complexity, while qualitative data derived from learner surveys and interviews provided insight into learners’ perceptions and revision behaviors. Although the study does not adopt a person-centered methodology per se, the longitudinal design allows for sensitivity to individual developmental trajectories, an important consideration in AI-supported learning environments where learners may engage with feedback in heterogeneous ways.
Participants
Participants were 60 undergraduate English Language Education students (aged 18–23) enrolled at a public university in Türkiye. Having completed an intensive English preparatory program, all participants demonstrated upper-intermediate proficiency (CEFR B2–B2+) in receptive skills, although their written production revealed ongoing developmental needs in grammar, syntax, and discourse organization. None of the participants reported prior structured experience with AI-based writing tools. Participation was voluntary, and all students provided informed consent prior to data collection. Table 1 summarizes the participants’ demographic and proficiency characteristics.
Instruments
Data were collected using multiple instruments designed to capture both linguistic development and learner perceptions. The primary tools for assessing learning gains were a writing pre-test, administered during the first week, and a parallel post-test, administered during the final week of the intervention. In both assessments, students were required to produce a 250 to 300-word argumentative essay within a 40-min time limit.
Essays were evaluated using analytic measures of grammatical accuracy, syntactic complexity, and writing quality. Accuracy was assessed through Error-Free Clause Ratio (EFCR) and grammatical errors per 100 words, measures widely established as robust indicators of L2 linguistic accuracy (Polio, 1997; Wigglesworth, 2008). Syntactic complexity was evaluated through Mean Length of T-unit (MLT) and the Subordination Index, consistent with standard complexity metrics in SLA research (Norris & Ortega, 2009; Ortega, 2003). Finally, writing quality was rated through a ten-point analytic rubric addressing task fulfillment, organization, cohesion, lexical appropriateness, accuracy, and complexity, adapted from the ESL Composition Profile (Jacobs et al., 1981) to align with specific course objectives. All essays were scored independently by two trained raters, whose evaluations demonstrated strong inter-rater reliability. Table 2 outlines the instruments used in the study and specifies their purposes, target skills, and administration schedule.
Participant Characteristics by Group.
During the 6-week intervention, students also completed weekly writing assignments in which they produced short argumentative paragraphs of approximately 150 to 180 words. These assignments provided repeated opportunities for feedback and revision within each feedback condition. The AI-Feedback Group revised drafts using Grammarly Premium and ChatGPT, while the Teacher-Feedback Group revised drafts based on instructor comments.
In addition, a 20-item Likert-scale perception survey was administered in the final week to capture learners’ attitudes toward the feedback they received, their motivational responses, and their metacognitive awareness during revision. Prior to administration, content validity was established through an expert review process involving two experienced colleagues in the field of applied linguistics, who evaluated the items for clarity, relevance, and lack of ambiguity (Dörnyei, 2003). The survey demonstrated high internal reliability, yielding a Cronbach’s alpha coefficient of .83, well above the acceptable threshold for social science research (Mackey & Gass, 2016). Semi-structured interviews with a subset of participants further explored learners’ experiences, challenges, and perceptions of AI-supported or teacher-mediated feedback. Interviews were transcribed verbatim and served as the primary qualitative data source.
Procedure
The study spanned 8 weeks. Following the Week 1 pre-test, the AI-Feedback Group received a 30-min orientation on the ethical and critical functions of Grammarly and ChatGPT. To ensure consistency across the experimental condition, Grammarly Premium was standardized to the “Academic” domain and “Formal” tone settings. For ChatGPT, learners were instructed to use a standardized base prompt for their weekly revisions (e.g., “Act as an expert English writing tutor. Review the following paragraph for grammatical accuracy and lexical variety. Provide specific corrections and brief explanations.”) to prevent wide variability in the AI’s generative output (see Appendix A for comparative examples of AI-generated versus teacher-mediated feedback). During the intervention (Weeks 2–7), participants completed weekly writing tasks and revised drafts according to their assigned condition, ensuring parallel rhetorical demands. In the final week, students completed the post-test and perception survey, followed by semi-structured interviews with eight volunteers.
Data Analysis
Quantitative Analysis
Quantitative data from the pre-tests, post-tests, and weekly writing samples were analyzed using the R statistical environment (Version 4.3.1). To account for the longitudinal nature of the design and the hierarchical structure of the data, Linear Mixed-Effects Models (LMMs) were fitted using the lme4 package (Bates et al., 2015).
For each outcome variable, the models included Group (AI vs. Teacher), Time (Weeks 1–8), and the Group × Time interaction as fixed effects, with Participant entered as a random intercept to control for individual baseline differences. Time was modeled as a continuous, linear predictor based on the theoretical expectation of steady, incremental development over a relatively short intervention period, and to maintain model parsimony. Model assumptions—linearity, homoscedasticity, and normality of residuals—were assessed through visual inspection of Q–Q plots and residual scatterplots. Instead of Cohen’s d, effect sizes for the mixed models were calculated using the MuMIn package, reporting marginal R2 (R2m; variance explained by fixed effects) and conditional R2 (R2c; variance explained by both fixed and random effects). All statistical tests were two-tailed with a significance threshold of α < .05.
Qualitative Analysis
For the qualitative phase, interview recordings were transcribed verbatim and checked for accuracy. The transcripts and open-ended survey responses were then imported into NVivo 14 (QSR International) for coding. Data were analyzed using reflexive thematic analysis (Braun & Clarke, 2006, 2019). In alignment with the reflexive nature of this methodology, it is necessary to acknowledge my positionality as a language educator and researcher familiar with the ELE context. This background inherently informed the interpretive lens through which the data were analyzed, particularly in recognizing the nuances of pedagogical mediation and pre-service teacher anxiety. To ensure analytical rigor and mitigate undue bias, codes were continually questioned and refined through sustained, critical engagement with the data. An initial coding scheme was developed both deductively, based on a priori areas of interest (confidence, reliability, human mediation, and immediacy), and inductively, allowing unexpected themes to emerge. Codes were iteratively refined, merged, or split, and related codes were grouped into the four overarching themes reported in the Results section.
Ethical Considerations
Ethical approval for the study was obtained prior to data collection. Participation was entirely voluntary, and all students provided written informed consent prior to any data collection or intervention. To limit the risk of harm to study participants, the research design ensured that no student was deprived of instructional support; specifically, a no-feedback control group was avoided because withholding feedback would be pedagogically and ethically problematic in a credit-bearing course. Furthermore, to protect participant privacy, all data were anonymized, stored securely on encrypted drives, and only the core research team had access to the raw files. The potential benefits of this research to society and the participants—namely, providing evidence-based guidelines for integrating AI into writing instruction and equipping future educators with critical digital literacy—substantially outweigh the minimal risks associated with the study, which did not exceed those encountered in standard educational settings.
Data Availability Statement
The datasets generated and analyzed during the current study are not publicly available due to privacy and ethical restrictions regarding human subjects. However, de-identified quantitative data and synthesized qualitative data are available from the corresponding author upon reasonable request.
Results
To examine the differential effects of feedback modality on L2 writing development, Linear Mixed-Effects Models (LMMs) were fitted for each outcome measure: grammatical accuracy (EFCR), syntactic complexity (MLT), and global writing quality. LMMs were selected as the primary analytical framework to robustly handle the longitudinal design (repeated measures nested within participants) and to control for individual baseline variability.
All models specified Group (AI-Feedback vs. Teacher-Feedback), Time (Weeks 2–7), and the Group × Time interaction as fixed effects, with Participant entered as a random intercept. Visual inspection of residual plots confirmed that assumptions of homoscedasticity and normality were met. Table 3 summarizes the fixed effects estimates, standard errors, and significance levels for all three models.
Summary of Data Collection Instruments and Their Purposes.
As detailed in Table 4, no significant main effects for Group were found at baseline, confirming that the two groups began with comparable proficiency levels across all measures (p > .05). However, developmental trajectories differed significantly by condition. For grammatical accuracy, the model yielded a significant Group × Time interaction (β = .03, p < .001), indicating that the AI-Feedback group improved at a significantly steeper rate than the Teacher-Feedback group.
Fixed Effects Estimates for Predictors of Grammatical Accuracy, Syntactic Complexity, and Writing Quality.
Note. N = 60. Reference group for “Group” is Teacher-Feedback. R2m = Marginal R2 (variance explained by fixed effects); R2c = Conditional R2 (variance explained by fixed + random effects).
Conversely, the analysis of writing quality revealed a significant negative interaction term for the AI condition (β = −.05, p = .036), suggesting that the Teacher-Feedback group demonstrated superior growth in discourse-level performance over time. The LMM yielded a significant main effect of Time (β = .21, SE = .04, 95% CI [0.13, 0.29], p < .001). However, unlike grammatical accuracy, the Group × Time interaction was not statistically significant (β = .04, SE = 0.05 [−0.06, 0.14], p = .41).
Grammatical Accuracy
The first research question investigated the impact of feedback type on the development of grammatical accuracy, operationalized as the EFCR. The analysis revealed a significant main effect of Time (β = .02, SE = 0.003, p < .001), indicating that the writing practice itself contributed to accuracy gains across the sample. More importantly, the model yielded a statistically significant Group × Time interaction (β = .03, SE = 0.004, p < .001, 95% CI [0.02, 0.04]).
As illustrated in Figure 1, while the Teacher-Feedback Group demonstrated steady linear growth, the AI-Feedback Group exhibited a significantly steeper trajectory of improvement. The model explained a substantial proportion of the variance (R2m = .34), with the interaction term suggesting that automated feedback accelerated the rate of error reduction significantly more than teacher mediation.

Weekly development of grammatical accuracy for the AI-Feedback and Teacher-Feedback groups across the 8-week intervention period.
Syntactic Complexity
Subsequent analyses examined whether feedback modality differentially influenced the development of syntactic complexity (MLT). The LMM yielded a significant main effect of Time (β = .21, SE = .04, p < .001), confirming that the 6-week instructional sequence promoted syntactic elaboration across the entire cohort.
However, unlike grammatical accuracy, the Group × Time interaction was not statistically significant (β = .04, SE = 0.05, p = .41). This lack of interaction indicates that the rate of syntactic development was comparable between the AI-Feedback and Teacher-Feedback conditions. As illustrated in Figure 2, both groups followed parallel developmental trajectories. The model explained a moderate proportion of the variance (R2m = .12; R2c = .45), with the substantial difference between marginal and conditional R2 values suggesting that individual learner variability played a larger role in complexity development than the specific type of feedback received.

Weekly development of syntactic complexity across the 8-week intervention for the AI-Feedback and Teacher-Feedback groups.
These results suggest that gains in syntactic complexity were likely driven by task effects—specifically, the cognitive demands of the argumentative writing prompts and repeated practice—rather than the specific corrective modality. In other words, both feedback types were equally compatible with, though not uniquely predictive of, growth in syntactic maturity.
Writing Quality
The analysis of overall writing quality revealed a more differentiated pattern of development between the two groups. Figure 3 provides a visual comparison of pre- and post-test writing quality scores across groups, highlighting the slightly stronger gains observed for the Teacher-Feedback Group.

Pre-test and post-test writing quality scores for the AI-Feedback and Teacher-Feedback groups.
Consistent with the other measures, a significant main effect of Time was found (β = .15, SE = 0.02, 95% CI [0.11, 0.19], t = 7.30, p < .001). In contrast to the accuracy results, the Group × Time interaction significantly favored the Teacher-Feedback Group (β = −.05, SE = 0.02, t = −2.15, p = .036). The negative β coefficient for the AI condition (relative to the Teacher reference group) indicates that the rate of improvement in global writing quality was slower for students receiving automated feedback. The Teacher-Feedback Group demonstrated stronger gains in discourse-level features such as coherence and argumentation by the post-test (R2m = .18, R2c = .52).
Rater notes indicated that teacher feedback frequently highlighted issues related to idea development (aligning with the rubric’s task fulfillment dimension), paragraph unity (aligning with organization), and transitions (aligning with cohesion), whereas automated feedback tended to focus more heavily on sentence-level issues. As a result, these results indicate that teacher feedback was somewhat more effective in promoting higher-order writing quality. These findings align with prior research suggesting that AI tools excel at micro-level feedback but may provide less support for discourse-level features.
Learner Perceptions and Qualitative Insights
Qualitative findings from the perception survey (N = 60) and semi-structured interviews revealed distinct patterns in learner perceptions across the two feedback conditions. Thematic analysis identified four predominant themes: increased confidence, reliability concerns, value of human mediation, and feedback immediacy.
Increased Confidence With AI Feedback
Participants in the AI-Feedback Group consistently associated automated tools with psychological safety. Analysis of the survey responses indicated that 24 out of 30 (80%) participants in this condition agreed that the feedback was supportive and non-judgmental. In the open-ended responses, 18 students (60%) explicitly noted that the neutral tone of AI suggestions reduced the anxiety typically associated with error correction.
As one participant explained, “When I used AI, I felt more relaxed because it didn’t make me feel bad about my mistakes” (P7). Other students corroborated this, noting that the tool’s safety net encouraged them to take risks and experiment with more complex syntax (P3). Furthermore, nearly half of the group (n = 14) reported that the AI’s repetitive highlighting of similar errors heightened their metalinguistic awareness of persistent grammatical patterns in their writing (P12).
Reliability Concerns About Automated Suggestions
Despite the generally positive reception, 11 of the 30 (37%) AI-group participants expressed reservations regarding the semantic accuracy of AI suggestions. These learners reported instances where corrections were grammatically valid but contextually inappropriate. For instance, P4 remarked, “Sometimes AI suggested words that didn’t fit what I wanted to say, so I wasn’t sure if I should accept the change.” This skepticism often led to selective uptake; 9 participants (30%) explicitly mentioned independently verifying AI feedback against external sources to confirm its contextual accuracy before accepting the changes.
Preference for Human Mediation in Teacher Feedback
In the Teacher-Feedback Group, the dominant theme was the pedagogical depth of the commentary. The vast majority of participants (n = 26; 87%) in this group cited “clarity” and “contextualization” as the primary benefits of human feedback. Unlike the AI group, who focused on surface corrections, these learners highlighted that teacher explanations fostered conceptual understanding.
P22 illustrated this distinction: “My teacher told me why the sentence didn’t work, not just that it was wrong, and that really helped me understand the logic.” Similarly, other learners emphasized the personalized nature of the guidance, appreciating how teachers tailored their support to individual writing styles rather than providing generic corrections.
Feedback Immediacy and Revision Cycles
The biggest divergence between groups concerned the logistics of the revision process. Almost all respondents in the AI condition (n = 28; 93%) cited immediacy as a critical advantage, enabling multiple revision cycles within a single session. P14 noted, “AI let me revise many times in one sitting, which I couldn’t do with teacher feedback.”
Conversely, 12 students (40%) in the Teacher-Feedback Group noted that the turnaround time for human feedback acted as a bottleneck, temporarily stalling their writing progress until the corrections were returned. However, it is notable that despite this logistical constraint, these learners still prioritized the reliability of teacher feedback over speed.
Discussion
The purpose of this study was to examine how AI-assisted and teacher-mediated feedback influence key dimensions of L2 writing development—grammatical accuracy, syntactic complexity, and overall writing quality—while also exploring learners’ perceptions of each feedback modality. In direct response to the research questions and proposed hypotheses, the findings revealed three distinct patterns. First, AI-assisted feedback led to significantly steeper gains in grammatical accuracy, supporting the hypothesis that immediate, visually salient automated corrections accelerate rule-based learning. Second, while syntactic complexity grew comparably across both conditions, teacher mediation proved significantly more effective at improving global, discourse-level writing quality. Finally, qualitative findings confirmed the hypothesized divergent perceptual profiles: learners utilized AI as a low-anxiety, immediate tool for iterative drafting, but relied on the epistemic authority of human teachers for reliable, contextualized pedagogical guidance. By integrating these longitudinal quantitative findings with qualitative insights, the study offers a nuanced account of how automated and human feedback support different aspects of the writing process.
Differential Contributions to Grammatical Accuracy
The most substantial gains associated with AI-assisted feedback were observed in the domain of grammatical accuracy. Learners in the AI-Feedback Group exhibited significantly greater increases in EFCR and larger reductions in grammatical errors per 100 words than those in the Teacher-Feedback Group. These findings align with previous research reporting strong accuracy gains following exposure to automated writing tools (Ahmed Ali et al., 2025; Tang et al., 2024).
Importantly, the distributional patterns presented in Figure 4 indicates that these gains were not limited to specific proficiency levels. Rather, the AI-Feedback Group exhibited a uniform upward shift in accuracy scores accompanied by reduced variance at post-test. This pattern indicates that automated feedback functioned as a form of systematic scaffolding, enabling lower-performing learners to address recurring surface-level errors and narrow the gap with higher-performing peers (Fan & Ma, 2022).

Distributional shifts in grammatical accuracy (EFCR) from pre-test to post-test.
From a theoretical standpoint, these results can be interpreted through the lens of the Noticing Hypothesis (Schmidt, 1990). AI tools operationalize noticing by providing immediate visual enhancement and explicit prompts that direct learners’ attention to morphosyntactic forms that might otherwise escape awareness (Alharbi, 2023). In addition, the immediacy of AI feedback facilitates multiple revision cycles within a single writing session, allowing learners to repeatedly engage with error patterns, test hypotheses, and refine output. Such iterative practice is central to Skill Acquisition Theory, which emphasizes repeated feedback and rehearsal as mechanisms for the proceduralization of grammatical knowledge (DeKeyser, 2015). Taken together, these mechanisms offer a compelling explanation for the accelerated accuracy gains observed in the AI condition.
Convergent Growth in Syntactic Complexity
Despite the clear advantage of AI feedback in accuracy, syntactic complexity developed similarly across both groups, with no significant interaction effect observed. This parallelism suggests that gains in complexity were driven by the cognitive demands of the writing tasks and repeated practice effects rather than the specific feedback modality. From a cognitive load perspective, argumentative writing requires simultaneous attention to idea generation, organization, and linguistic formulation. When faced with this high cognitive burden, learners in both groups likely utilized their respective feedback modalities to reduce local error loads rather than spontaneously risking structural expansion. Therefore, the 6-week sequence of argumentative writing tasks likely provided sustained opportunities for syntactic elaboration, encouraging learners to experiment with clause combining, subordination, and extended sentence structures only when compelled by the task itself. Consequently, this suggests that syntactic complexity is inherently less feedback-sensitive; to achieve differentiated gains, learners may require explicit, targeted elaboration prompts (e.g., “combine these two sentences”) rather than general corrective feedback. This finding is consistent with prior research indicating that while AI tools can facilitate revision efficiency, they do not inherently promote syntactic expansion unless learners are explicitly prompted to restructure or elaborate their output (Annamalai & Bervell, 2025; Zhang et al., 2025). For upper-intermediate ELE students, therefore, the act of writing and revising complex texts appears to be a stronger determinant of syntactic development than whether feedback is delivered by AI or by a teacher.
Notably, the substantial discrepancy between marginal and conditional R2 values in the complexity model suggests that individual learner variability played a considerable role in shaping developmental outcomes. This pattern underscores the importance of longitudinal and person-sensitive analytic approaches when examining higher-order dimensions of L2 development, which may unfold unevenly across learners despite shared instructional conditions.
Furthermore, these findings must be interpreted in light of measurement sensitivity. The metrics employed in this study to evaluate syntactic complexity—MLT and the Subordination Index—are robust and standard indicators of sentence-level structural elaboration. However, they are not inherently designed to capture discourse-level syntactic development or rhetorical appropriateness. It is entirely plausible that while both groups produced comparably long and subordinate structures, the Teacher-Feedback Group may have deployed these structures more effectively to serve cohesive and argumentative goals. This limitation in measurement sensitivity helps explain why the significant advantages observed for the Teacher-Feedback Group in global writing quality were not mirrored in the sentence-bound complexity indices.
Teacher Advantages in Global Writing Quality
A different pattern emerged for global writing quality. The Teacher-Feedback Group demonstrated significantly stronger gains in discourse-level performance, particularly in coherence, organization, and argument development. This finding reinforces a recurring conclusion in the literature: while AI tools excel at micro-level feedback, they remain limited in addressing higher-order aspects of writing that require contextual interpretation, rhetorical judgment, and pedagogical sensitivity (Celik et al., 2022; Kohnke et al., 2023; Lo et al., 2024).
Rater observations indicated that teacher feedback frequently addressed idea development, paragraph unity, and transitions—areas that automated systems tended to overlook or treat superficially (H. Chen & Anyanwu, 2025). From a sociocultural perspective, such feedback constitutes dialogic and contextualized mediation that extends beyond error correction and supports learners’ movement through the Zone of Proximal Development (Vygotsky, 1978). Teachers are able to situate comments within broader instructional goals, task expectations, and learners’ emerging writing identities, thereby facilitating meaning-making processes that contribute to global writing development (Aljabr & Al-Ahdal, 2024).
These findings suggest that discourse-level writing quality remains an area where human expertise is particularly difficult to replace. Rather than viewing AI and teacher feedback as competing alternatives, the results point to their complementary pedagogical roles.
Learner Perceptions: Complementary Strengths and Divergent Expectations
The qualitative findings provide further insight into these differential effects. Learners in the AI-Feedback Group consistently described automated tools as immediate, supportive, and non-threatening, highlighting the affective benefits of receiving feedback in a low-stakes environment (Çakmak, 2022; Zhai, 2023). Reduced anxiety and increased willingness to experiment with language were recurrent themes, echoing prior research linking AI use to enhanced confidence and self-regulation (Evenddy, 2024; Jin et al., 2023; Lai et al., 2023).
At the same time, a critical tension emerged regarding trust. A notable proportion of learners expressed skepticism about the contextual appropriateness of AI-generated suggestions, particularly when feedback conflicted with their communicative intentions. This skepticism often led to selective uptake, with some learners verifying AI suggestions through external sources before incorporating them. In contrast, teacher feedback was widely perceived as reliable, pedagogically grounded, and sensitive to individual awareness and needs, even though its delivery was less immediate (Qin & Zhang, 2025).
These divergent perceptions illuminate why AI feedback may be particularly effective for surface-level revision while remaining insufficient for higher-order concerns. AI appears to function as an affective and logistical scaffold, whereas teachers provide epistemic authority and contextual clarity (Alharbi, 2023). For ELE majors, who are simultaneously developing as language users and future educators, this distinction is especially salient, as it shapes not only learning outcomes but also emerging beliefs about responsible feedback practices (Kong et al., 2021; Lan, 2024).
When considered together, the quantitative and qualitative results suggest that AI-assisted and teacher-mediated feedback exert complementary but asymmetrical influences on L2 writing development. AI feedback accelerates grammatical accuracy through immediacy, repetition, and reduced affective barriers, whereas teacher feedback more effectively supports discourse-level quality by providing interpretive and pedagogical mediation. Syntactic complexity, meanwhile, appears to be influenced more strongly by task demands and sustained writing practice than by feedback modality alone. This integrative pattern helps reconcile mixed findings in the existing literature. Crucially, these findings refine existing models of feedback within ISLA by establishing distinct boundary conditions for AI effectiveness. While traditional ISLA models often evaluate form-focused feedback as a generalized construct, this study posits a bifurcated model of digital mediation: automated tools are highly effective for proceduralizing constrained, rule-based morphosyntactic knowledge, which is aligned with Skill Acquisition Theory, but they reach a hard boundary condition when addressing unconstrained, context-dependent discourse features. At this boundary, the epistemic authority and dialogic scaffolding of human mediation rooted in sociocultural theory become theoretically indispensable. Rather than asking whether AI “works” for writing development, the present study demonstrates that what AI supports, how it supports it, and for whom it supports it depend on the linguistic dimension under consideration and on learners’ engagement with feedback.
Pedagogical Implications
The findings indicate that AI and teacher feedback exert complementary influences, offering distinct pedagogical implications. First, the strong gains in grammatical accuracy within the AI-Feedback Group suggest AI tools effectively serve as supplementary resources for sentence-level refinement (Ahmed Ali et al., 2025). By offering immediate, repeated feedback, AI facilitates revision cycles often impractical for teachers, thereby alleviating instructor workload and increasing learner autonomy (Chaudhry & Kazim, 2022).
Second, the Teacher-Feedback Group’s improvement in global writing quality underscores the necessity of human mediation for higher-order skills like argumentation and cohesion. Consequently, AI should not replace teachers but be integrated into a model where instructors retain responsibility for discourse-level concerns and contextual objectives (Aljabr & Al-Ahdal, 2024; Holmes et al., 2022). Furthermore, establishing this balance is crucial to mitigate significant ethical and pedagogical risks, such as the potential “deskilling” of both learners who may lose autonomous self-editing capabilities and educators who might increasingly offshore critical pedagogical judgments to algorithms.
Third, divergent learner perceptions highlight the need for explicit training in feedback and digital literacy. Third, divergent learner perceptions highlight an urgent need for explicit training in critical AI literacy, particularly within ELE programs. Because these students are future educators, ELE curricula must go beyond teaching the mere operational use of AI tools. Teacher training programs should incorporate reflective practicum tasks where pre-service teachers evaluate the pedagogical appropriateness of AI-generated feedback against established SLA principles, actively diagnosing where automated tools succeed and where human mediation remains essential. By cultivating this critical pedagogical lens, future educators will be better equipped to model responsible, effective, and critically evaluated AI integration for their own future students (Hockly, 2024; Lan, 2024). Specifically, mediating AI pedagogically requires three core competencies: technological-pedagogical evaluation (the ability to assess AI suggestions against discourse-level goals and learner intent), critical digital literacy (understanding the limitations, biases, and reliability constraints of automated tools), and affective mediation (scaffolding learners’ trust, confidence, and anxiety when engaging with machine feedback).
Students benefited most when approaching suggestions critically, indicating a need for instruction on evaluating automated reliability (Hockly, 2024). Furthermore, the reduction in anxiety associated with AI use suggests it can serve as an affective scaffold for low-confidence learners (Aydın Yıldız, 2023). Deploying AI during early writing stages may therefore support risk-taking and self-regulation (Tai, 2024).
Finally, the emotional and motivational benefits associated with AI-assisted feedback—particularly the reduction in writing anxiety—suggest that AI tools may serve as useful scaffolds for learners who lack confidence or who fear negative evaluation (Aydın Yıldız, 2023). Teachers might therefore consider deploying AI-supported revision tasks at early stages of the writing process, allowing students to experiment freely before receiving more rigorous human feedback. In this way, AI tools can provide affective scaffolding that supports engagement, risk-taking, and the development of self-regulated writing habits (Tai, 2024).
Limitations and Future Research Directions
Although the present study provides valuable insights into the comparative effects of AI and teacher feedback, several limitations must be acknowledged. First, the study employed a quasi-experimental design with intact classes (N = 60) rather than a large-scale randomized control trial. While this approach preserved ecological validity within an authentic instructional setting, the relatively small sample size and the specific demographic—upper-intermediate ELE majors in Türkiye—limit the generalizability of the findings to wider populations, such as lower-proficiency learners or those in secondary education settings. Specifically, ELE students possess higher metalinguistic awareness and pedagogical training than general EFL populations, which likely influenced their critical engagement with AI tools. Furthermore, the findings are situated within the Turkish higher education context; learners from different educational cultures with varying exposures to digital literacy or differing attitudes toward teacher authority might interact with automated feedback differently. Consequently, caution must be exercised when extrapolating these results to lower-proficiency learners, general EFL cohorts, or secondary education settings in other sociocultural contexts. Additionally, the research design did not include a pure control group (i.e., a no-feedback condition). While withholding feedback entirely would be ethically and pedagogically problematic in a credit-bearing ELE writing course, the absence of this baseline means the study can only evaluate the relative efficacy of AI versus teacher mediation, rather than isolating the absolute effect of feedback against task repetition alone.
Second, while assigning intact classes to distinct feedback conditions helped mitigate cross-contamination during formal instructional time, the possibility of informal interaction between groups outside the classroom cannot be entirely ruled out. Students may have shared revision strategies or discussed their respective feedback modalities, which is an inherent challenge in classroom-based research.
Third, the intervention spanned only 8 weeks without a delayed post-test. Consequently, the long-term sustainability of the observed gains remains unclear. While AI feedback significantly accelerated grammatical accuracy during the treatment, it is unknown whether these effects persist once the scaffolding is removed. Longitudinal studies incorporating delayed post-tests are needed to investigate whether AI-supported practice leads to durable L2 acquisition or merely temporary performance improvements.
Further, the investigation was restricted exclusively to written production. As AI technologies increasingly offer multimodal capabilities, including voice recognition and conversation-based interaction, their potential to support oral grammatical accuracy and pragmatic appropriateness remains a fertile ground for inquiry. Future studies should explore how emerging AI speech analysis tools might influence learners’ spoken fluency and interactional competence.
Finally, the role of individual learner differences warrants closer attention. Variables such as metacognitive awareness, digital literacy, and language anxiety likely mediate the effectiveness of AI feedback. Mixed-methods research incorporating detailed learner profiles could illuminate how different types of students benefit from—or struggle with—automated revision processes. Furthermore, ethical considerations, including algorithmic bias and the risk of overdependence, deserve sustained inquiry to ensure that the integration of AI into language education remains pedagogically sound and socially responsible.
Conclusion
This study examined the differential effects of AI-assisted versus teacher-mediated feedback on upper-intermediate L2 learners. Findings reveal a coherent pattern where AI excels in supporting sentence-level accuracy, while teacher feedback provides stronger guidance for discourse-level quality. Syntactic complexity improved comparably across groups, indicating that such development is influenced more by repeated practice than by a specific feedback modality. Qualitative data reinforce that both sources offer distinct advantages: learners appreciated AI’s immediacy and non-judgmental nature, which fostered confidence, whereas teacher feedback was valued for contextual sensitivity and personalized support. These results advocate for an integrated model combining AI tools with teacher mediation to provide comprehensive support.
Footnotes
Appendix A
Ethical Considerations
This study was approved by the Ethics Committee at Bartın University, under approval number SB-7548-0496 in June 2025.
Consent to Participate
All participants provided written informed consent prior to enrollment in the study.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets generated and/or analyzed during the current study are not publicly available due to the large size of the dataset, but are available from the corresponding author on reasonable request.
