Abstract
Despite the growing potential of Generative Artificial Intelligence (GenAI) to enhance learning—particularly in transforming traditional English as a Foreign Language (EFL) teaching and learning practices—there is still limited research available to guide educators and practitioners in understanding its role in pedagogical contexts. This systematic literature review (SLR) explores GenAI’s roles in EFL instruction by examining application contexts, research methods employed, and key issues identified. Following the Preferred Reporting Items for Systematic Reviews and Meta-Analyses (PRISMA), 51 articles from an initial pool of 284 studies published between 2020 and 2024 (based on early access year) were selected from WoS and Scopus. Findings of application contexts revealed the marked preference of higher education settings, particularly in East Asia and the Middle East, with an overwhelming focus on writing instruction. Methodologically, mixed and qualitative methods, large sample sizes, and subjective data prevailed. Furthermore, research issues demonstrated GenAI’s versatile yet double-edged role in specific courses, including writing, reading, speaking, grammar, and general EFL instruction, particularly its role as an assessor. Notably, despite divergent findings on the effectiveness of GenAI in writing instruction, students consistently preferred teacher feedback over GenAI feedback. Moreover, a teacher-centered perspective remained dominant in studies of general EFL instruction. Therefore, future research is encouraged to broaden application contexts, strengthen quantitative approaches with varied sample sizes and objective data, and deepen exploration of the GenAI’s roles in the full teaching cycle, addressing existing divergences and incorporating more perspectives.
Introduction
Generative Artificial Intelligence (GenAI, used herein as an umbrella term including tools like ChatGPT and DALL-E), defined as algorithms capable of producing audio, visual, and written content (McKinsey & Company, 2023), is reshaping how English is taught and learned (Yeh, 2025). While traditional English as a foreign language (EFL) instruction is confronted with persistent challenges regarding personalization, student engagement, and assessment burdens (Kawinkoonlasate, 2020; Q. Li et al., 2023; Michel-Villarreal et al., 2023; Yavuz et al., 2025), GenAI offers potential solutions by enabling dynamic teaching, interactive learning, and automated assessment to enhance learner engagement and improve outcomes (Hieu & Thao, 2024; Mohamed, 2024; Sayed et al., 2024; Topsakal & Topsakal, 2022).
Despite its potential, GenAI has yet to be fully embraced by educators, and its implementation in EFL classrooms remains limited (Gao et al., 2024; X. Liu & Xiao, 2025). This hesitation is rooted in teachers’ lack of professional training, ethical concerns regarding academic integrity, and the scarcity of practical guidance for effective integration (Arefian et al., 2024; Kohnke et al., 2023b; Zaiarna et al., 2024). Furthermore, the existing literature presents conflicting perspectives. While some researchers criticized GenAI for producing shallow or generic content (X. Liu & Xiao, 2025), others highlighted its ability to generate structured and useful teaching materials (Kusuma et al., 2024). Similarly, Zou et al. (2025) found that teacher feedback led to better engagement than GenAI feedback, whereas Guo and Wang (2024) suggested that GenAI’s feedback might better boost students’ engagement in revision.
In view of the aforementioned challenges and contradictions, a systematic literature review (SLR) is needed to further understand its roles in EFL instruction. What is available in the current body of literature reviews focuses on exploration of GenAI in a broader educational context, specifically in higher education or general education contexts (Batista et al., 2024; Hwang & Chang, 2023; L. Yan et al., 2024), while others only focus on AI or are restricted to ChatGPT (Liang et al., 2023; Lo et al., 2024; Meniado, 2023), with limited emphasis on EFL context. Specifically, this study is aimed at exploring current literature by analyzing: (a) application contexts, (b) research methods employed, and (c) key issues identified.
Literature Review
GenAI is transforming education amid rapid technological advancements (Bahroun et al., 2023). Bahroun et al. (2023) asserted that research on GenAI in education has surged since 2018, partly due to the innovation and popularity of GenAI technologies. Supported by Large Language Models (LLMs), loads of GenAI applications are adept at text processing and generation (R. Lee, 2025). Consequently, recent studies have demonstrated its significant potential in various educational fields, spanning computer science, engineering education, higher education, medical education, nursing education, and academia (Abdulai & Hung, 2023; Cain et al., 2023; Crompton & Burke, 2023; Denny et al., 2023; Dergaa et al., 2023; Relmasira et al., 2023).
In the context of language instruction, GenAI serves as a resourceful, accessible counselor for both students and teachers, drawing increasing attention (Fryer et al., 2020; Kohnke et al., 2023a). Notably, it holds considerable promise for enhancing the language teaching cycle: preparation, implementation, and evaluation (Kovačević, 2023). In preparation, GenAI can help predict students’ performance, find effective teaching methods based on learning data, create discussion questions, adjust materials for students with different skill levels, provide example writing for reference, and make teaching handouts with explanations or practice tasks (Bahroun et al., 2023; Chaudhry & Kazim, 2021; Pack & Maloney, 2023). During implementation, GenAI tools can function as intelligent tutoring systems and support learning management (Bahroun et al., 2023). Particularly, it helps non-native speakers with their writing, making it more effective and easier to manage, while also increasing students’ interest, learning skills, and English abilities (Chan & Lee, 2023; Tang & Deng, 2022; D. Yan, 2023). For evaluation, GenAI can automatically generate assessments, including grades and feedback (Bucol & Sangkawong, 2025; J. Li et al., 2024; Shin & Lee, 2024).
Methodology
This research utilized a systematic literature review process based on the PRISMA 2020 paradigm (Page et al., 2021) with the Modified Technology-based Learning Model proposed by Lo et al. (2024) as its theoretical framework. A literature search was performed in two primary academic databases, Scopus and Web of Science (WoS), concentrating on the incorporation of GenAI in EFL education. The review procedure adhered to the four major steps of PRISMA: identification, screening, eligibility, and inclusion, following the steps highlighted by Page et al. (2021).
Search Strategy
In the initial phase (identification), two major databases (Scopus and WoS) were searched using the following search string: TITLE-ABS-KEY ((“Generative Artificial Intelligence” OR “GenAI” OR “GenAI” OR “GAI” OR “LLMs” OR “Large Language Models” OR “ChatGPT”) AND (“English as a Foreign Language” OR “EFL” OR “Foreign Language Teaching”)).
Inclusion and Exclusion Criteria
During this process, the search was limited to articles written in English and published from 2020 to 2024 (based on early access year). The year 2020 was selected as the starting date as it marked a critical turning point. In 2020, GPT-3 was introduced as the first GenAI model with near-human coherence (with a human detection accuracy of only 52%; Brown et al., 2020). Unlike pre-2020 models requiring fine-tuning, GPT-3 lowered barriers for educators by enabling operation via natural language prompts (Brown et al., 2020). Consequently, since 2020, GenAI has been capable of generating high-quality outputs comparable to human performance (Brown et al., 2020; Trigka & Dritsas, 2025). Furthermore, there was research on GenAI in education in 2020 (Mittal et al., 2024; Zhu et al., 2020).
On top of that, during the screening stage, titles, abstracts, and research questions were examined to identify studies specifically focused on the use of GenAI in EFL instruction. Importantly, studies addressing GenAI-assisted evaluation were also included, as evaluation is considered a core component of instructional design (Tyler, 1975).
To refine the dataset further, only empirical studies were selected. Following the definition used by Batista et al. (2024), empirical studies were those that involved the collection and analysis of data to generate objective, evidence-based findings—excluding theoretical, opinion-based, or speculative works. Table 1 provides a summary of the inclusion and exclusion criteria used in this stage.
Inclusion and Exclusion Criteria.
Study Selection
As depicted in Figure 1, a total of 284 articles were initially identified from Scopus (n = 159) and WoS (n = 125). After removing duplicate articles (n = 94) and one article without an English version, 189 records remained for screening. Subsequently, articles unrelated to the use of GenAI (n = 5) and not themed on EFL instruction (n = 127) were excluded. Finally, after removing six non-empirical studies, 51 studies were retained.

A PRISMA 2020 flow diagram illustrating the study selection process. Adaptation from Page et al. (2021).
Data Extraction and Analysis
In line with Lo et al.’s (2024) Modified Technology-based Learning Model, the data extraction and analysis focused on analyzing: (1) application contexts, (2) research methods employed, and (3) key issues identified. For application contexts, the following information was collected: (1a) study locations, (1b) educational contexts, and (1c) learning domains. For research methods, the analysis categorized the articles based on (2a) research approaches (i.e., quantitative, qualitative, or mixed methods), (2b) research data sources (e.g., surveys and interviews) and research topics; (2c) research sample size (i.e., large, medium or small sample sizes). Additionally, thematic analysis was conducted on the selected articles to further examine the key issues.
Findings and Discussion
Findings
Findings of the systematic review were structured around the application contexts, research methods, and research issues concerning the roles of GenAI in EFL instruction.
(1) Application Contexts.
For application contexts, three dimensions were examined, including (1a) study locations, (1b) educational contexts, and (1c) learning domains. Table 2 summarizes the major findings of application contexts.
Major Findings of Application Contexts.
Overall, GenAI in EFL instruction was most extensively explored in East Asia and the Middle East, with limited research in Europe, Africa, and South Asia. Similarly, higher education was the most frequently studied educational context, while other levels such as elementary, secondary, and special education remained underrepresented. In terms of learning domains, there was a strong focus on writing, whereas speaking, grammar, reading, and especially listening received comparatively little attention.
(1a) Study locations.
Figure 2 presents the distribution of the study locations across all 51 selected articles.

Distribution of study locations.
As depicted in Figure 2, a total of 51 studies on the role of GenAI in EFL instruction were conducted across 22 countries and administrative regions, categorized into six geographical areas, including East Asia (n = 19), the Middle East (n = 17), Southeast Asia (n = 8), Europe (n = 5), Africa (n = 1), and South Asia (n = 1). Research was predominantly conducted in East Asia (n = 19, 37.3%) and the Middle East (n = 17, 33.3%), whereas studies in Africa and South Asia remained scarce, with only one conducted by Sayed et al. (2024) in Ethiopia and one by Almashy et al. (2024) in India. Of all the 22 countries and administrative regions, the mainland of China (n = 11, 22%) was the primary contributor. Interestingly, there were two studies under special circumstances. One study was conducted by ElEbyary and Shabara (2024) in the UK, where English is a mother tongue, but this study involved participants who are native Arabic speakers. Another study by Zaiarna et al. (2024) was conducted in Ukraine, but involved participants from Ukraine, the EU, and the USA.
(1b) Educational contexts.
To examine the educational contexts represented in the literature, all 51 articles were analyzed based on their focal teaching settings, as depicted in Figure 3.

Distribution of educational contexts.
Figure 3 demonstrates the role of GenAI in EFL instruction across six educational contexts: higher education, secondary education, adult education, elementary education, special education, and primary to upper secondary education. Additionally, 6% of the studies did not specify a fixed educational context, including those by Mena Octavio et al. (2024), Stewart and Zheng (2024), and Parviz (2024). Higher education (66%) was the primary focus, followed by secondary education (16%). Studies covering primary to upper secondary education (6%) were more frequent than those focused solely on primary education (2%). Research on adult education, elementary education, and special education was the least prevalent, each making up 2%. Notably, most studies on GenAI in EFL instruction focused on general EFL context, with one exception: Fan et al. (2024) specifically concentrated on EAP context in higher education.
(1c) Learning domains.
As presented in Figure 4, there were five primary language learning domains where GenAI was applied in EFL instruction: writing (n = 28, 55%), speaking (n = 2, 4%), grammar (n = 1, 2%), reading (n = 1, 2%), and unspecified domains (n = 19, 37%). Writing dominated the field, among which one study by Sapan and Uzun (2024) investigated the role of GenAI in writing and vocabulary instruction. A substantial proportion of studies (37%) did not focus on a specific language skill but examined GenAI’s role in lesson preparation (e.g., Milad & Fayez, 2024; Yeh, 2025), instructional implementation (e.g., Mena Octavio et al., 2024), and teacher competence development (e.g., Kartal, 2024; Korucu-Kış, 2024). Only a few studies examined GenAI in speaking, grammar, and reading. No studies explored GenAI’s role in listening skills, despite its importance as a core EFL competency as emphasized by Lo et al. (2024).
(2) Research methods.

Distribution of learning domains.
Regarding research methods, the review examines (2a) the research approaches employed, (2b) the sample sizes involved, and (2c) the research topics and data sources used. Table 3 presents a summary of the key findings.
Key Findings and Implications of Research Methods.
Methodologically, mixed-methods approaches were most common, followed by qualitative approaches, while quantitative studies were comparatively limited. Moreover, most studies involved large sample sizes, with fewer small and medium sample sizes. In terms of topics and data collection, the selected studies tended to rely on subjective data from perceptions and self-report instruments, such as questionnaires and interviews to examine educational outcomes, pedagogical benefits, teacher perceptions, and assessment-related issues. Topics and data sources based on objective performance were underrepresented.
(2a) Research approaches.
As shown in Figure 5, research approaches were categorized into three types: quantitative methods, qualitative methods and mixed methods. The majority of studies adopted mixed methods, comprising 43% of the total, followed by qualitative research, representing 37%, while quantitative research was the least common, at 20%.

Distribution of research approaches.
Table 4 details the distribution of research approaches by their early access year. The findings revealed a distinct turning point for the field: no relevant studies (n = 0) were found prior to 2023. The field emerged with a small number of publications in 2023 (n = 4) and then surged exponentially in 2024 (n = 47).
Research Approaches by Early Access Year (n = 51).
(2b) Research sample sizes.
In line with the criteria of Liang et al. (2023), a sample size no more than 10 was classified as small, a sample size between 11 and 30 as medium, and a sample size no less than 30 as large. As shown in Figure 6, large sample sizes were most frequently used (n = 26; 51%), followed by medium sample sizes (n = 14; 27%), while small sample sizes were the least common (n = 11; 22%).
(2c) Topics and data sources.

Distribution of research sample sizes.
Figure 7 presents the methodological preferences in the research on the role of GenAI in EFL instruction from 2020 to 2024. Research on GenAI in EFL instruction was generally categorized into four main topics, including educational effects, pedagogical strengths, teacher perspectives, and evaluation, utilizing 11 data sources, including questionnaires/surveys, interviews, tests/assessments, reflective records, artifacts, technology-based records, observation records, lesson plan templates, feedback/revision records, scales, and rating criteria/rubrics.

Topics and data sources.
Notably, most studies incorporated multiple data sources. Interviews (n = 25; 25.8%) and questionnaires (n = 20; 20.6%) were the primary data collection methods for studying the roles of GenAI in EFL instruction and were widely applied across all four research themes. This indicated that subjective data played a significant role in investigating the role of GenAI in EFL instruction. Studies on EFL assessment inclined to collecting data from multiple dimensions and multiple perspectives, employing the widest range of data sources except lesson plan templates, with questionnaires/surveys (n = 11; 22.9%) being the most frequently used. Studies on teacher perspectives mainly relied on interviews (n = 7; 53.8%), followed by questionnaires/surveys (n = 4; 30.8%), whereas artifacts and technology-based records were the least utilized. Studies on pedagogical strengths utilized seven data sources, with interviews (n = 6; 33.3%) being the most common, followed by observations (n = 5; 27.8%). However, technology-based records, feedback/revision records, scales, and rating criteria/rubrics were not employed. Studies on educational effects utilized scales, technology-based records, reflective records, tests/assessments, interviews, and questionnaires/rubrics as data sources. Among these, interviews were the most frequently used (n = 5; 27.8%), followed by scales and questionnaires/surveys (n = 22.2%).
(3) Research issues.
The key findings related to these research issues are summarized in Table 5.
Key Findings Related to Research Issues.
As illustrated in Table 5, four prominent findings are summarized. Most notably, the reviewed studies demonstrate a versatile yet double-edged role of GenAI in the instruction of writing, reading, speaking, grammar, and general EFL courses. Particularly, it mainly functions as a feedback provider, automatic rater, teaching assistant, study buddy, psychological and cognitive facilitator, and instructional designer. It presents potential in assisting lesson planning, implementation, assessment, and facilitating psychological and cognitive development of teachers and learners. However, challenges remain, such as rigid lesson planning, overreliance, and concerns such as feedback accuracy and academic integrity. Secondly, research issues are markedly uneven, focusing primarily on GenAI’s role as an assessor (feedback provider and rater) across writing, speaking, grammar, and general EFL instruction, while other roles were underrepresented. Thirdly, while there are inconsistencies concerning the findings of comparing GenAI feedback and ratings with those of teachers as well as its effects on motivation, researchers have commonly found that students prefer teacher feedback (Zeevy-Solovey, 2024; Zou et al., 2025). Finally, studies on general EFL instruction predominantly focus on teachers’ perspective but pay limited attention to other perspectives.
(3a) Research issues in writing instruction (n = 28).
Writing instruction constituted the primary research focus within the literature (n = 28; 54.9%), centering on three key issues concerning the efficacy of feedback (n = 14; 50%), grading (n = 4; 14.3%), and broader pedagogical application and effects (n = 10; 35.7%), demonstrating the positive and negative roles of GenAI.
GenAI’s Efficacy as a Feedback Provider
As a feedback provider, GenAI played a double-edged role. Positively, it delivered immediate, direct, comprehensive, abundant feedback to students and corrected language errors (Almashy et al., 2024; Gozali et al., 2024; Guo et al., 2024; Guo & Wang, 2024; Polakova & Ivenz, 2024; Teng, 2024). Ultimately, it enhanced feedback quality and promoted students’ motivation, collaboration, and writing skills (Gozali et al., 2024; Guo et al., 2024; Polakova & Ivenz, 2024; Teng, 2024). Compared to teacher feedback, GenAI feedback was generally more detailed, direct, and structured (Guo & Wang, 2024). Moreover, GenAI’s comments on essay content contained more praise and were therefore more favored by students (Guo & Wang, 2024; Zou et al., 2025). Negatively, GenAI feedback was found to include irrelevant comments, excessive length, inaccessibility, incorrect and confusing feedback, and ethical and contextual insensitivity (Bucol & Sangkawong, 2025; Gozali et al., 2024; Guo & Wang, 2024; H. S. Long, 2024). Compared to teacher feedback, GenAI feedback focused on more superficial aspects and demonstrated lower quality (Fan et al., 2024; Zou et al., 2025), resulting in lower student acceptance and preference and less progress (H. S. Long, 2024; D. Yan, 2024; Zeevy-Solovey, 2024; Zou et al., 2025).
Notably, there was a debate. Some studies suggested GenAI outperforms teachers in offering more praise in feedback, which is supposed to encourage students’ engagement in revision (Guo & Wang, 2024), whereas some argued that GenAI underperforms in accuracy, motivating students’ engagement in revision (Fan et al., 2024; Zou et al., 2025). Conversely, some found GenAI feedback as effective as teacher feedback (Alsofyani & Barzanji, 2025).
GenAI’s Efficacy as an Automated Rater
As a rater, GenAI demonstrated human-like grading with greater consistency, validity, objectivity, and systematicity than human raters (Bucol & Sangkawong, 2025; J. Li et al., 2024; Shin & Lee, 2024; Yavuz et al., 2025). However, it tended to be more lenient, assigning higher scores and sometimes overlooking specific criteria or exhibiting comprehension biases when evaluating complex or creative writing (Bucol & Sangkawong, 2025; Shin & Lee, 2024).
Nevertheless, there was also divergence. While Yavuz et al. (2025) observed similar scores between GenAI and human teachers on content and organization, Shin and Lee (2024) reported more lenient and higher scores from GenAI, whereas J. Li et al. (2024) found it overall outperformed human raters.
GenAI’s Efficacy as a Teaching and Learning Assistant
Studies examining the broader pedagogical application and effects revealed the roles of GenAI in assisting teaching and learning. For learners, it enhanced learning experience (Hieu & Thao, 2024; Huang & Mizumoto, 2025), engagement (Hieu & Thao, 2024; Polakova & Ivenz, 2024; Teng, 2024; Woo et al., 2024), motivation and self-efficacy (Huang & Mizumoto, 2025; Z.-M. Liu et al., 2024; Song & Song, 2023; Teng, 2024), feedback literacy (Gozali et al., 2024), and hereby promoted writing performance and ability (Ghafouri et al., 2024; Guo et al., 2024; Z.-M. Liu et al., 2024; Polakova & Ivenz, 2024; Woo et al., 2024). For teachers, it supported professional development, self-efficacy, workload reduction, and broader technological adoption while fostering critical thinking and creativity (Ghafouri et al., 2024; Hieu & Thao, 2024). However, limitations in contextual alignment, integration with existing methods, language quality, and technical resources may lead to over-reliance, ethical risks, and reduced student creativity (Hieu & Thao, 2024; Song & Song, 2023; Stewart & Zheng, 2024), resulting in smaller gains in vocabulary, writing, and learner satisfaction compared to teacher-led instruction (Ahmed, 2023; Sapan & Uzun, 2024; Sawangwan, 2024).
Nonetheless, some studies reported that GenAI enhanced student motivation (Huang & Mizumoto, 2025; Z.-M. Liu et al., 2024; Song & Song, 2023), whereas Woo et al. (2024) observed only a marginal, non-significant increase in motivation.
(3b) Research issues in oral instruction (n = 2).
Only two studies investigated the role of GenAI in EFL oral instruction, focusing on the psychological and emotional impacts of GenAI in communicative practice and speaking assessment within higher education.
Yıldız (2024) investigated the role of GenAI in communicative activities as a study buddy, revealing that GenAI could boost students’ speaking self-efficacy, confidence, and enjoyment while reducing stress by creating a supportive and non-judgmental environment for oral practice. Nevertheless, with a majority of feedback focusing on grammar and vocabulary, there was limited feedback on pronunciation, intonation, and stress patterns.
Sayed et al. (2024) delved into GenAI’s role as a rater and feedback provider in oral tests and reported that GenAI enhanced oral skills, autonomy, and academic resilience by providing instant, personalized feedback in a non-judgmental setting, reducing anxiety and increasing motivation. Additionally, GenAI eased teachers’ workloads and made curricula more dynamic, suggesting that curriculum designers might benefit from incorporating GenAI to promote mental health, autonomy, and GenAI-assisted testing (Sayed et al., 2024).
(3c) Research issues in grammar instruction (n = 1).
A single experimental study by Kucuk (2024) particularly focused on the role of GenAI as a learning assistant and feedback provider in grammar instruction by comparing the GenAI-led instruction and teacher-led instruction.
The findings also revealed a double-edged role for GenAI. Positively, this study found that GenAI-assisted grammar instruction resulted in more significant improvement of university students’ grammar proficiency, compared to the teacher-led grammar instruction. Moreover, GenAI obtained high satisfaction due to its interactive, personalized, and constant support. However, concerns were raised about ambiguous responses and insufficient feedback, which may reduce their critical thinking skills, increase their dependence on technology, and lead to irrational use. Despite these challenges, Kucuk’s (2024) study ultimately concluded that the benefits of GenAI-assisted grammar instruction outweighed its disadvantages.
(3d) Research issues in reading instruction (n = 1).
A single qualitative study conducted by Xin (2024) focused on the role of GenAI as a teaching assistant for teaching material development in reading instruction.
Based on three EFL teachers’ usage of GenAI in EFL reading instruction, Xin’s (2024) study identified both the advantages and limitations of using GenAI in developing instructional materials, particularly for text modification, task design, and acquiring instructional suggestions. Results revealed that GenAI was conducive to increasing efficiency and the ability to generate novel and learner-centered tasks. Nevertheless, teachers found that the tool could not interpret multimodal elements (like posters or charts) within a PDF and sometimes provided unreliable answers or suggestions that were not pedagogically sound.
Therefore, Xin (2024) also suggested that teachers must rely on their pedagogical expertise, understanding of students, and linguistic awareness to make informed judgments when using GenAI for instructional materials development. The study also proposed a D-R-E-A-M model (Determine, Render, Evaluate, Adjust, Make decision) to guide teachers in developing instructional materials with GenAI.
(3e) Research issues in general EFL instruction (n = 19).
Studies related to the role of GenAI in general EFL instruction majorly focused on three issues: teachers’ perspectives on GenAI roles (n = 11), psychological and cognitive impacts (n = 4) and GenAI’s efficacy in lesson planning and implementation (n = 4).
Teachers’ Perspectives on GenAI as a Teaching Assistant (n = 11)
Eleven studies examined perspectives on GenAI in EFL instruction from general teachers (n = 6), novice teachers (n = 4), and special education teachers (n = 1).
Overall, EFL teachers generally held a positive yet cautious view of GenAI (Zaiarna et al., 2024). They recognized its value particularly in teaching preparation and assessment (Derakhshan & Ghiasvand, 2024; Mohamed, 2024; Parviz, 2024; Ulla et al., 2023; Zaiarna et al., 2024). However, significant concerns were raised, primarily regarding over-dependence, trustworthiness, academic integrity and teacher-student interaction (Derakhshan & Ghiasvand, 2024; Gao et al., 2024; Ulla et al., 2023; Zaiarna et al., 2024).
Preservice teachers acknowledged GenAI’s role in professional growth despite its limitations (Kartal, 2024; Kusuma et al., 2024; Mustroph & Steinbock, 2024; Wulandari & Purnamaningwulan, 2024). However, similar concerns were also mentioned, including information quality, accuracy, over-reliance, ethical risks, and misinformation (Kusuma et al., 2024; Wulandari & Purnamaningwulan, 2024). Therefore, they proposed that effective integration required human-GenAI collaboration, critical analysis, and creativity (Kartal, 2024; Mustroph & Steinbock, 2024).
In special education, Alenezi et al. (2023) found that attitudes toward GenAI in special education were moderate, with female teachers showing greater willingness for future use.
GenAI’s Efficacy as a Psychological and Cognitive Facilitator (n = 4)
Existing literature demonstrated that GenAI exerted multiple psychological and cognitive effects in EFL instruction. Y. J. Lee and Davis (2024) highlighted its role in boosting learners’ motivation, interest, and confidence. Ghafouri et al. (2024) found that structured GenAI teaching models helped build emotionally supportive learning environments and students’ psychological grit. Korucu-Kış (2024) noted its potential to enhance teacher creativity, though effectiveness depended on expertise and faced challenges like input accuracy and content repetition. Hınız (2024) argued that while GenAI provided diverse materials and promoted inclusiveness, it also raised concerns about plagiarism and cognitive skill development.
GenAI’s Efficacy as an Instructional Designer and Facilitator (n = 4)
Studies also indicated that GenAI’s multifaceted role in EFL lesson planning and implementation (Mena Octavio et al., 2024; Williyan et al., 2024; Yeh, 2025). Williyan et al. (2024) revealed that GenAI could assist teachers in lesson design, classroom introduction, content presentation, practice activities, immediate feedback, and assessment, fostering adaptability and creativity. Mena Octavio et al. (2024) confirmed its positive impact on planning, implementation, and assessment. Yeh (2025) noted its role in personalizing instruction, increasing interactivity, and making classrooms more student-cantered. However, Milad and Fayez (2024) compared a GenAI lesson plan with that of student teachers and found that GenAI-generated lesson plans followed a rigid, linear structure, relied heavily on vague teacher instructions, and lacked detailed interaction design.
Discussion
Imbalanced Distribution of Application Contexts
The imbalanced distribution of study location, with East Asia and the Middle East as the most common locations, while limited research in Europe, Africa, and South Asia, may be interpreted by the Technology Acceptance Model (TAM), which believes that perceived usefulness results in the willingness to adopt technology (Davis, 1989). In East Asia, the Middle East, and Southeast Asia, AI is seen as more beneficial than risky, but perceived risks dominate in Europe and South Asia, and infrastructural limitations constrain access in Africa (Maslej, 2025; Neudert et al., 2020).
The dominance of higher education is associated with classical cognitive development theories which suggest university students embrace more advanced abstract thinking and critical reasoning abilities (Piaget & Duckworth, 1970). Consequently, their understanding of artificial intelligence becomes more comprehensive and in-depth than at earlier stages (Staikova et al., 2024), thereby enhancing the feasibility of applying GenAI in EFL instruction.
The strong focus on writing, with little or no attention to other skills, aligns closely with the second language acquisition theories (SLA). According to the Output Hypothesis, language internalization occurs through active language production (Swain, 1995). Powered by LLMs, many GenAI applications excel in text processing and generation (R. Lee, 2025), thereby reinforcing the output–feedback–revision cycle that underpins writing development (Gozali et al., 2024; Song & Song, 2023). In contrast, speaking depends on authentic, two-way interaction, as emphasized in M. H. Long’s (1996) Interaction Hypothesis, which remains challenging for GenAI to replicate (Michel-Villarreal et al., 2023). For receptive skills such as reading and listening, GenAI can generate abundant comprehensible input consistent with Krashen’s (1985) Input Hypothesis. However, its role remains largely supportive rather than essential. Similarly, grammatical acquisition, which relies on contextualized language production, is more effectively fostered through writing than through isolated grammar exercises.
Preference for Research Methods
The result of research approaches indicated the prevalence of mixed methods approaches, followed by qualitative and quantitative approaches. This finding contrasts with previous research reviews on similar but broader topics, which predominantly featured quantitative studies, such as Liang et al.’s (2023) study on the role of AI in language education, Hwang and Chang’s (2023) study on chatbots in education, and Batista et al.’s (2024) study on GenAI in education. This indicates the need for further support from quantitative empirical research. Moreover, research began in 2023 and surged in a year suggesting that the related research may still be in the exploratory stage. This phenomenon may be related to the public release of ChatGPT on November 30, 2022 (Lo, 2023).
The high number of large-sample studies seems contradictory to the lack of quantitative research. However, this can be explained by the prevalence of mixed-methods research, which tends to involve larger sample sizes (e.g., Ghafouri, 2024; Sawangwan, 2024; D. Yan, 2024). This suggests more research focusing on statistical validity that overlooks the in-depth analysis from small-sample studies.
In terms of focus and data collection, the selected studies tend to center on educational outcomes, pedagogical benefits, teacher perceptions, and assessment-related issues, predominantly using questionnaires and interviews as data sources. This indicates a focus on subjective perceptions, lacking objective quantification. This finding aligns with S. Lee et al. (2025), who noted that research on GenAI in language classrooms is still at an early stage and largely relies on subjective data.
Discussion of Research Issues Identified
The double-edged role of GenAI in EFL instruction accords with the constructivist view of learning as a process of active knowledge construction (Vygotsky et al., 1978). On the one hand, it can function as a powerful scaffold, accelerating skill acquisition in diverse ways, such as quickly correcting errors and enhancing feedback quality (Gozali et al., 2024; Polakova & Ivenz, 2024). On the other hand, it enables students to evade necessary cognitive challenges, through shortcuts, such as plagiarism (Bucol & Sangkawong, 2025). This leads to the significant risks of over-reliance and a decline in creativity, as noted in several reviewed studies (Hieu & Thao, 2024; Song & Song, 2023).
Findings suggest a predominant concentration on applying GenAI for assessment, including feedback and grading, especially in writing instruction, whereas research on its role in teaching implementation and preparation remains limited. This trend can be interpreted through SLA theories. Assessment aligns closely with the Output Hypothesis (Swain, 1995), which posits that learners internalize language knowledge through a cyclical output–feedback–revision process. Language output generally encompasses both speaking and writing (Zhang et al., 2024). GenAI facilitates this process by providing instant, personalized, and non-judgmental feedback, which enables learners to identify and correct linguistic errors, thereby promoting rapid internalization of language knowledge (Almashy et al., 2024; Sayed et al., 2024). The directness and measurability of this feedback render such studies easier to design and quantify, thereby contributing to their prevalence in literature. Conversely, teaching implementation and preparation pertain to the Input Hypothesis (Krashen, 1985) and the Interaction Hypothesis (M. H. Long, 1996), both of which emphasize comprehensible input and meaning negotiation. Research in these domains is more complex owing to contextual factors, technological constraints, learner variability, and instructional design (Zhang & Dong, 2024), leading to a comparatively smaller body of empirical studies that focus on authentic classroom applications (S. Lee et al., 2025).
The divergence in findings regarding feedback, scoring, and effectiveness in writing instruction may stem from both technical variations and differences in research participants. For instance, J. Li et al. (2024) employed ChatGPT-4, whereas Yavuz et al. (2025) compared ChatGPT and Google’s Bard. In terms of participants, Woo et al. (2024) examined secondary school students, while Huang and Mizumoto (2025) focused on university students.
Learners consistently demonstrate a lower preference for GenAI feedback (Zeevy-Solovey, 2024; Zou et al., 2025). This result can be interpreted through Vygotsky et al.’s (1978) Sociocultural Theory, in which learning is socially and emotionally mediated. Nevertheless, despite its multiple strengths, GenAI lacks the empathetic and emotional dimensions inherent in human instruction, prompting learners to show a clear preference for teacher-led guidance (Michel-Villarreal et al., 2023).
The tendency of teacher perspective in general EFL instruction primarily can also be explained from a constructivist standpoint, in which teachers play a crucial role in designing learning environments and guiding the learning process (Vygotsky et al., 1978). Therefore, they serve as “gatekeepers” in facilitating the effective integration of GenAI into education (Yue et al., 2025). However, since learning involves learners’ active construction of knowledge (Piaget & Duckworth, 1970), the student perspective is equally essential and warrants greater attention.
Significantly, the findings are not merely transient technological effects but emerge from the interaction of relatively stable factors, including technology acceptance, students’ cognitive maturity, the inherent characteristics of GenAI, socio-cultural contexts, and teachers’ mediating roles. Given the stability of these factors, the conclusions of this study carry enduring pedagogical implications.
Conclusion and Limitations
To conclude, this systematic review analyzed 51 studies related to the role of GenAI in EFL instruction, guided by Page et al.’s (2021) PRISMA protocols. To capture the most recent publications, the timespan (2020–2024) was applied to the early access year. The findings reveal three key features of the current research landscape: (1) contextual imbalance, with a heavy focus on higher education, East Asia and the Middle East, and writing instruction; (2) methodological preferences for mixed and qualitative methods, large sample sizes, and subjective data; and (3) research issues exhibiting GenAI’s versatile but double-edged role, an emphasis on assessment, divergent results on the effectiveness of GenAI in writing instruction, students’ preference for teacher-feedback, and teacher-centeredness in the studies of general EFL instruction.
This study provides significant implications for both teaching and policy. For teachers, given the double-edged role of GenAI, it is recommended to critically integrate the technology rather than simply adopt it (Xin, 2024). Teachers should employ GenAI as a scaffolding tool (Guo & Wang, 2024; Zou et al., 2025), while guiding students on its limitations, including bias, overreliance, misinformation, and risks to academic integrity (Bucol & Sangkawong, 2025; Kusuma et al., 2024; Wulandari & Purnamaningwulan, 2024). For policymakers, in view of potential cultural biases and ethical concerns (Song & Song, 2023), institutions should establish clear ethical guidelines and safeguards and provide professional development programs that emphasize the pedagogical use of GenAI, rather than treating it as a mere technological tool.
Future research on GenAI in EFL instruction should focus on three priorities: first, expanding contextual diversity by including underrepresented regions (e.g., Europe, Africa, South Asia), diverse educational levels (e.g., K-12, adult, and special education), and multiple language skills beyond writing (e.g., speaking, reading, and listening); second, optimizing research methods by increasing quantitative studies, incorporating small- and medium-sized samples, and integrating objective performance data rather than relying solely on self-reports; and third, deepening research issues by examining GenAI’s multifaceted roles throughout the full instructional cycle (S. Lee et al., 2025), conducting replication studies to address inconsistencies in feedback and grading, and exploring perspectives beyond teachers, such as those of students and administrators.
This study shows two main limitations. On the one hand, in view of the rapid advancement of GenAI, the specific publication period (2020–2024) may limit the inclusion of the most recent technological developments. To address this limitation, this review incorporated early access articles from early 2025 and integrated the latest literature from 2025 into the analysis of the research background, research issues, and Discussion sections to ensure the currency of the study. Moreover, based on stable sociocultural contexts, intrinsic technological mechanisms, and teachers’ pivotal roles rather than specific software versions, the findings possess lasting explanatory significance. On the other hand, the limitation of databases from literature selection (only Scopus and WoS) may result in imbalance and bias. For instance, the focus on higher education limits generalizability to K-12 and other contexts, and the reliance on subjective data sources like questionnaires and interviews may introduce bias. Future studies should broaden the publication range and datasets to include diverse educational settings and adopt robust quantitative methods to better understand GenAI’s role in EFL instruction.
Footnotes
Acknowledgements
We thank the anonymous reviewers for their constructive feedback and insightful comments, which significantly improved the quality of this manuscript.
Ethical Considerations
This study is a systematic literature review of previously published studies. No human participants were directly recruited or involved in this research. Therefore, ethical approval was not required.
Consent to Participate
Since this study relies exclusively on secondary data from publicly available academic publications, informed consent was not applicable.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data supporting the findings of this study are available from the corresponding author upon reasonable request.
