Sage Journals: Discover world-class research

Abstract

This study investigated the role of ChatGPT4o as an AI peer assessor in English-as-a-foreign-language (EFL) speaking classrooms, with a focus on its scoring reliability and the effectiveness of its feedback. The research involved 40 first-year English major students from two parallel classes at a Chinese university. Twenty from one class served as speech sample providers; and the other 20 served as human peer assessors. In addition, ChatGPT4o served as an AI peer assessor. The study employed univariate and multivariate generalizability (G-) theory to compare the consistency and reliability of holistic and analytic scoring between ChatGPT4o and human peer assessors. The results demonstrated that ChatGPT4o provided significantly more consistent and reliable scores across domains such as accuracy, fluency, and complexity. Moreover, ChatGPT4o delivered more comprehensive and effective feedback, offering clear guidance for improvement. However, interviews with human peer assessors revealed concerns about ChatGPT4o’s limitations in capturing the subtle aspects of spoken language, such as emotion and cultural context, and the potential over-reliance on ChatGPT4o in EFL assessments. The findings suggested that while ChatGPT4o as an AI peer assessor can enhance the reliability and quality of peer assessments, its adoption should be carefully managed to complement, rather than replace, human judgment, ensuring a balanced approach in EFL speaking classrooms.

Keywords

ChatGPT4o an AI peer assessor EFL speaking classrooms generalizability (G-) theory

Introduction

In recent years, the incorporation of artificial intelligence (AI) into language education has attracted considerable attention as both educators and researchers investigate its potential to improve teaching, learning, and assessment processes (Guo & Wang, 2023; Link, Mehrzad, & Rahimi, 2022; Shermis & Hamner, 2013). AI has brought about new methods that are transforming instructional practices and assessment approaches, with particular emphasis on its ability to deliver efficient and personalized educational experiences (Ansari et al., 2023; Barrot, 2023; Creely, 2024; Farazouli et al., 2024; Tate et al., 2024). Among these AI advancements, OpenAI’s ChatGPT has gained recognition as a significant tool, especially for its role in evaluating English-as-a-foreign-language (EFL) tasks (Kasneci et al., 2023; Lu et al., 2024).

In the context of EFL writing, assessments demand considerable expertise and resources to ensure they are both accurate and fair (Huang, 2012, Li & Huang, 2022). As a result, there has been increasing interest in utilizing AI to enhance both peer and teacher assessments in EFL writing classrooms (Li et al., 2024; Zhang, 2023a, 2023b). Research indicates that AI tools like ChatGPT can automate evaluation processes, deliver immediate feedback, and customize learning experiences to meet the needs of individual students (Lu et al., 2024; Praphan & Praphan, 2023; Zou & Huang, 2023a, 2023b). For instance, Lu et al. (2024) found that ChatGPT could effectively complement traditional teacher evaluations in undergraduate academic writing by providing consistent and prompt feedback.

Although much research has concentrated on EFL writing assessments, assessing EFL speaking skills presents similar challenges, requiring extensive expertise and resources to produce reliable results (Belmamoune, 2022; de Jong, 2023; Huang et al., 2020; Khusnia, 2015; Wu et al., 2022). Human raters often face difficulties in maintaining consistency and objectivity when evaluating speaking proficiency (de Jong, 2023; Wu et al., 2022). Despite the similarities between writing and speaking assessments, raters must evaluate accuracy, fluency, and complexity, much like they assess language, content, and organization in writing (Li et al., 2024; Wu et al., 2022). These similarities create challenges in ensuring consistent and objective evaluations, whether in speaking or writing tasks. AI tools like ChatGPT4o could help mitigate these challenges by offering consistent evaluations across both modalities. However, the application of AI technologies like ChatGPT4o in EFL speaking assessment has been underexplored. This lack of research is particularly significant in countries like China, where the population of EFL learners is the largest globally (Christou et al., 2024).

This study aimed to fill the research gap by exploring the potential of ChatGPT4o, an advanced iteration of ChatGPT, as an AI peer assessor in Chinese EFL speaking classrooms (Murad et al, 2024; Shahriar et al., 2024). The decision to focus on AI as a peer assessor stems from the growing interest in peer assessment in educational contexts, which allows students to engage actively in the assessment process, receive feedback from multiple perspectives, and contribute to a more personalized learning environment (Li et al., 2022). In large classrooms, where individual feedback from teachers may be limited (Li et al., 2022; Wu et al., 2022), peer assessment supported by AI tools like ChatGPT4o can significantly enhance the efficiency and effectiveness of evaluations. By integrating AI, this study attempted to explore how peer assessment, traditionally a time-intensive process, can be refined through AI’s consistent scoring capabilities, while still benefiting from human judgment in key aspects such as emotional tone and cultural context.

Therefore, this study explored the reliability of ChatGPT4o’s scoring of EFL speech samples and the effectiveness of its feedback. It provided crucial insights into the integration of AI in peer assessment in EFL speaking classrooms to help deliver consistent and effective feedback to EFL students

Literature Review

Unlike the assessments of EFL receptive skills (i.e., listening and reading), the assessments of EFL productive skills (i.e., writing and speaking) require subjective judgments from raters on different aspects of writing and speaking samples (Li et al., 2024; Li et al., 2022; Wu et al., 2022). Much research on EFL assessments of productive skills has concentrated on writing; the speaking domain has received comparatively less attention; furthermore, EFL writing and speaking assessments share similarities. Therefore, this section reviews literature on both EFL writing and speaking assessments.

Reliability and Feedback in EFL Writing and Speaking Assessments

Reliability is a fundamental principle in effective EFL writing and speaking assessments, ensuring consistent measurement outcomes across different instances of the assessment process. According to the Standards for Educational and Psychological Testing (AERA, APA, & NCME, 2014), reliability is defined as the extent to which scores are free from random error, highlighting the need for consistent results across raters and occasions. The similarities between EFL writing and speaking assessments pose challenges in maintaining reliable and consistent assessments across raters and occasions. Just as writing assessments often struggle with biases in scoring, speaking assessments also encounter issues of inconsistency, especially when evaluating the speech samples of EFL speakers (Li et al., 2022).

Inter-rater reliability, which evaluates the degree of agreement among different raters scoring the same responses, has been extensively studied in EFL writing and speaking assessments (Han & Huang, 2017; de Jong, 2023; Zhang et al., 2023b). For example, research by Han and Huang (2017) and de Jong (2023) has explored strategies to improve inter-rater reliability in EFL writing and speaking assessments, respectively, emphasizing the importance of rigorous training and calibration for assessors.

Beyond reliability, providing qualitative feedback is crucial in EFL writing and speaking assessments, as it helps students identify their strengths and areas for improvement (Carless et al., 2011; Hattie & Timperley, 2007; Hu & Zhang, 2014). Effective feedback should be clear, understandable, and constructive to support students’ progress in language proficiency (Li et al., 2022; Wu et al., 2022; Hu & Zhang, 2014). Recent research has emphasized the importance of delivering feedback that addresses language accuracy, fluency, and complexity in speaking assessments, calling for more systematic approaches to feedback delivery (Li et al., 2022).

Applying Univariate and Multivariate G-Theory in Scoring EFL Writing and Speaking

Generalizability (G−) theory, developed by Cronbach et al. (1972), goes beyond classical test theory (CTT) by allowing for the analysis of multiple sources of error simultaneously, provides a more powerful framework than CTT for evaluating the reliability of scoring in EFL writing and speaking assessments, where subjective judgments are common (Brennan, 2001a; Shavelson & Webb, 1991). Unlike other frameworks, such as many-facet Rasch measurement, G-theory provides a comprehensive approach by offering a robust understanding of assessment reliability (Brennan, 2001a; Shavelson & Webb, 1991).

Many studies have adopted the G-theory approach in EFL writing assessments (e.g., Zhao and Huang, 2020; Li and Huang, 2022; Li et al., 2024; Han, 2017; Whipple, 2016). For example, Whipple (2016) and Zhao and Huang (2020a) employed G-theory to examine how a single-rater system impacts the score reliability of EFL writing assessment. The findings of both studies indicated that the current single-rater system cannot achieve an acceptable reliability coefficient. Later, Han (2017) and Li and Huang (2022) used G-theory to investigate the impact of essay quality on the holistic score reliability of EFL essays. Both studies reported that the scores assigned to high-quality essays demonstrated higher reliability than those assigned to low-quality essays. The rater interview results indicated that most raters only considered language or content while marking low overall quality essays but considered language, content, and organization while marking high overall quality essays. Recently, Li et al. (2024) employed G-theory to evaluate the role of ChatGPT4 in enhancing EFL writing assessment in classroom settings and reported that ChatGPT4 achieved consistently higher scoring reliability than the teacher raters.

While G-theory has been widely used in EFL writing assessments, it is equally beneficial for examining the reliability of EFL speaking assessments, which often involve more subjectivity. It allows for the simultaneous analysis of various sources of error, such as raters, tasks, and linguistic complexity. It provides a more comprehensive approach to assessing the consistency of scoring, which is especially crucial in EFL speaking assessments where evaluations can vary widely (Li et al., 2022).

Software programs like GENOVA (Crick & Brennan, 1983) and its multivariate version, mGENOVA (Brennan, 2001b), have been critical in conducting G-studies that assess the impact of various sources of variance, such as raters and tasks, on overall assessment outcomes. These tools have been used to ensure both holistic and analytic scoring reliability in EFL writing and speaking assessments, thereby improving the quality and fairness of these evaluations (Gao & Brennan, 2001; Li et al., 2022; Wu et al., 2022). The present study used both univariate and multivariate G-theory frameworks to assess the reliability of ChatGPT4o’s scoring in EFL speaking assessments, aiming to provide a thorough analysis of its potential as an AI peer assessor.

Peer Assessments in EFL Writing and Speaking

Extensive research has demonstrated the benefits of peer assessments in EFL writing and speaking, showing its effectiveness in improving students’ writing and speaking skills and fostering learner autonomy (Fathi et al., 2019; Li et al., 2020; Lee & Evans, 2019; Liu & Carless, 2006; Matsuno, 2009; Shen et al., 2020; Su et al., 2023; Wu et al., 2022). Shen et al. (2020) found that peer assessments significantly enhanced learner autonomy, reduced dependence on teachers, and increased students’ confidence. Matsuno (2009) used multifaceted Rasch modeling to compare peer and self-assessments with teacher assessments in writing classes, revealing the beneficial role of peer assessments. Similarly, Sun et al. (2023) employed multifaceted Rasch modeling to examine the impact of peer assessment in college English writing classes, noting its effectiveness despite some limitations compared to teacher feedback, particularly in terms of depth and solution-oriented comments.

Research comparing the reliability of peer assessments and teacher assessments has shown that, when peers are adequately trained, their reliability can match that of teachers. Wu et al. (2022) used G-theory to compare the reliability of holistic scores from peer and teacher assessors, finding that the reliability of scores from three peer assessors matched that of one teacher assessor. Peers were comparable to teachers in evaluating the content and organization of EFL essays. Similarly, Li et al. (2022) applied G-theory to compare the reliability of holistic scores from peer and teacher assessors in EFL speaking classrooms, reporting that the reliability of up to two peer assessors’ holistic scoring is equivalent to that of one teacher assessor’s scoring.

Effectiveness of AI-Generated Feedback

AI tools like ChatGPT have transformed the delivery of feedback in educational settings by providing immediate, detailed, and personalized feedback that aids students in improving their language skills and increasing their confidence (Cao & Zhong, 2023; Guo et al., 2022; Yan, 2023). Research has shown that such feedback can significantly enhance students’ learning outcomes by promoting self-directed learning and complementing classroom interactions (Lu et al., 2024; Pang et al., 2024). Moreover, AI-driven feedback systems improve the efficiency and reliability of assessments, offering secure testing and a personalized learning experience that is often more effective than traditional methods (Song & Song, 2023; Su et al., 2023; Zhai, 2023).

In EFL settings, automated scoring systems, such as those used in TOEFL or IELTS, have long been used to assess language proficiency. These systems, however, are generally limited in the depth and adaptability of the feedback they provide, as they rely on fixed, rule-based algorithms (Shadiev & Feng, 2024; Taskıran & Goksel, 2022; Xi, 2010). In contrast, AI-based feedback, such as that from ChatGPT, can deliver more context-sensitive, detailed, and individualized suggestions that cater to the learner’s specific areas of improvement (Evenddy, 2024; Khasawneh, 2024; Li et al., 2024). These differences are particularly significant in language learning, where personalized feedback is crucial to the development of writing, speaking, and listening skills. Studies have shown that AI-generated feedback can supplement or even surpass traditional methods in some areas, providing more immediate and comprehensive support to learners (Evenddy, 2024; Li et al., 2024).

Comparative studies have examined the effectiveness of AI-generated feedback compared to traditional feedback. Traditional feedback typically refers to teacher-provided assessments, such as written comments, verbal feedback, and corrections (Ishchenko & Verkhovtsova, 2019; Mak, 2019). While this form of feedback is personalized, it is often limited by the time constraints teachers face and may lack the immediacy or consistency required for optimal learning. Comparative studies have shown that AI-generated feedback, such as that provided by ChatGPT, offers distinct advantages in areas like vocabulary development and coherence in writing. For example, Cao and Zhong (2023) found that AI-driven feedback significantly improved students’ vocabulary and inter-sentence cohesion. Similarly, research by Song and Song (2023) suggested that EFL learners who received AI-generated feedback outperformed their peers in writing skills and motivation. These results highlight how AI feedback not only compares favorably with traditional feedback but also enhances students’ engagement and performance in assessments.

Research Gaps and Questions

While substantial research has explored the benefits of ChatGPT in EFL writing assessments, the use of ChatGPT4o in assessing EFL speaking proficiency remains unexplored. Existing studies on AI-driven EFL speaking assessments are limited, and the potential of ChatGPT4o in peer assessment contexts has not been fully examined. This study aimed to bridge this gap by investigating the reliability and effectiveness of ChatGPT4o in EFL speaking classrooms, focusing on its ability to provide consistent and effective feedback compared to human peer assessors.

The following key research questions guided this study: (a) how does the reliability of ChatGPT4o’s scoring of EFL speech samples compare with that of human peer assessors when both apply the same course-validated rubric? (b) How does the qualitative feedback produced by ChatGPT4o differ from that of human peers in terms of clarity, accuracy, and actionable usefulness? And (c) what are human peer assessors’ perceptions of, and willingness to use, ChatGPT-4o as an AI peer assessment tool in their EFL speaking classrooms?

Methodology

Participants

This study involved 40 first-year English major undergraduate students from a university in China. The sample consisted of two groups: 20 students who provided speech samples and 20 students who acted as human peer assessors. The speech sample providers participated in a final English speaking examination as part of their regular coursework, and the peer assessors were recruited from a parallel class. To further understand human raters’ perspectives on AI feedback, four of the human peer assessors were selected for follow-up interviews.

The participants were informed about the voluntary nature of their participation and data privacy concerns. Informed consent was obtained from all students prior to data collection. Both the speech sample providers and peer assessors were selected from the same cohort of first-year English majors, which ensured that they shared similar levels of language proficiency.

Materials

EFL Speech Samples

The EFL speech samples used in this study were drawn from the students’ final English speaking examination, where each student was required to give a one-minute oral presentation on the topic “Describe a person whom you like most.” The presentations were recorded individually using audio recording devices in a controlled classroom environment. The speech samples selected for this study were diverse, reflecting a range of proficiency levels.

ChatGPT4o

For the AI peer assessor, ChatGPT4o was employed to evaluate the EFL speech samples. ChatGPT4o was selected for its advanced capabilities in processing oral language, offering a more sophisticated analysis than previous versions of ChatGPT (Murad et al., 2024; Shahriar et al., 2024). The AI’s scoring was based on the same assessment criteria used by human raters, ensuring that it evaluated accuracy, fluency, and complexity in the same way that human assessors would. ChatGPT4o was trained to refine its scoring algorithms to match the assessment framework, ensuring alignment with the human peer assessment criteria.

Data Collection Procedures

Speech Sample Collection

The speech samples were collected during a final English speaking examination, where each student was required to provide a one-minute oral presentation on a specific topic. The recordings were collected using audio devices during a controlled exam session, ensuring that the assessment was consistent for all participants.

Human Peer Assessors

The human peer assessors underwent training to ensure they understood the scoring criteria and could apply them consistently. The training was delivered through a series of online workshops, where assessors were introduced to the scoring criteria (course-embedded and validated by the program’s testing committee). The workshops included a discussion of the key aspects of fluency, accuracy, and complexity, followed by the practice of scoring three sample speech recordings. The peer assessors were asked to assign holistic scores first and then proceed with analytic scoring for accuracy, fluency, and complexity after a one-week interval between scoring phases to minimize potential bias.

To ensure consistency, discussion sessions followed each practice session to allow assessors to compare their decisions and reconcile any discrepancies. The one-week interval between the holistic scoring and analytic scoring ensured that human raters could focus on different aspects of the speech samples independently, without the influence of prior assessments.

ChatGPT4o’s Training

ChatGPT4o underwent a similar process for training. The model was fine-tuned using a set of sample speech data and assessment criteria, which allowed it to align closely with human-based ratings. This process involved refining its scoring algorithms to focus on specific aspects of accuracy, fluency, and complexity in oral production. This procedure was based on existing language assessment frameworks and aligned with the criteria used by human raters. ChatGPT4o was trained on the same speech samples provided to the human peer assessors, ensuring it evaluated them according to the same standards.

Follow-up Interviews

After the assessment phase, four human peer assessors were selected for follow-up semi-structured interviews. The interviews aimed to understand the peer assessors’ experiences using ChatGPT4o and their perceptions of the AI’s effectiveness as a peer assessor. Before the interviews, the peer assessors were provided with the AI-generated scores and feedback for the speech samples. They were asked to review the feedback, and the interviews explored how ChatGPT4o compared to their own assessments and feedback in terms of clarity, accuracy, and actionability. Sample interview questions included “what is your current stance on the use of ChatGPT4o for assessing English major students’ English speaking in classroom settings?”“Do you believe ChatGPT4o-generated feedback is effective? Why or why not?”“What are your major concerns regarding the acceptance of ChatGPT4o-generated scores and feedback for classroom English speaking assessments?”And “what factors would increase your trust in ChatGPT4o-generated scores and feedback for classroom English speaking assessments?”

Data Collection Timeline

Data collection took place over several weeks. Human assessors worked in two phases, each lasting about one week. ChatGPT4o assessed the same set of 20 speech samples in four separate sessions, spaced three days apart, to minimize bias. The entire process was designed to allow for comprehensive analysis of both human and AI-generated feedback.

Data Analysis

Scoring Reliability

To address the first research question, the scoring reliability of ChatGPT4o and human peer assessors was analyzed using univariate and multivariate G-theory frameworks. The software programs GENOVA and mGENOVA (Brennan, 2001b; Crick & Brennan, 1983) were used to estimate variance components and assess their impact on scoring reliability within balanced designs. This involved performing random effects univariate and multivariate G-studies and decision (D)-studies for both ChatGPT4o and human peer assessors, focusing on the variability and reliability of holistic and analytic scoring, respectively. These analyses provided insights into the consistency and robustness of the scoring systems utilized.

Feedback Effectiveness

The effectiveness of the qualitative feedback provided by both ChatGPT4o and human peer assessors was analyzed using a combination of quantitative and qualitative methods. The feedback was categorized into the domains of accuracy, fluency, and complexity using a color-coding system. The effectiveness of feedback was then measured for each domain by the first two researchers using a 0-3 holistic rubric covering clarity, accuracy, and actionability, with Cohen’s κ = .82. Descriptive statistics (e.g., mean and standard deviation) were used to compare the quantity of effective feedback provided by each assessor group.

Follow-up Interview Data

The interview data from the human peer assessors were analyzed thematically, focusing on how ChatGPT4o’s feedback compared to human-generated feedback and its integration into their assessments. The first two researchers independently coded the interview transcripts and discussed discrepancies to ensure consistency in theme identification (Creswell & Creswell, 2023). Key themes were identified and used to explore human peer assessors’ perceptions of and willingness to use ChatGPT4o as an AI peer assessment tool in the EFL speaking classes.

Results

Findings for the First Research Question

Reliability of Holistic Scoring

To evaluate the consistency and reliability of the holistic scoring for the 20 EFL speech samples, univariate G-theory was applied, incorporating both G-studies and D-studies. The G-studies focused on identifying the proportion of total score variance attributable to each variance component, while the D-studies further analyzed these components to assess the reliability of the scoring process (Brennan, 2001a). The results of these analyses are displayed in Tables 1 and 2.

Table 1.

Results for Univariate Person-by-Rater Design G-Studies.

Rater Type	Sources of Variability	DF	σ²	%
ChatGPT4o	p	19	.3474	59.40
	r _GPT4o	3	.0092	1.57
	pr _GPT4o	57	.2283	39.03
	Total	79	.5849	100
Human peers	p	19	.4885	35.63
	r _peers	19	.5011	36.54
	pr _peers	361	.3817	27.83
	Total	399	1.3713	100

Table 2.

Results for Univariate Person-by-Rater Design D-Studies.

Number of persons (p)	Number of raters (r)	Phi-Coefficients
Number of persons (p)	Number of raters (r)	ChatGPT4o	Human peers
20	1	.59	.36
20	2	.75	.53
20	3	.81	.62
20	4	.85	.69
20	5	.88	.73
20	6	.90	.77
20	7	.91	.79
20	8	.92	.82
20	9	.93	.83
20	10	.94	.85
20	11	.94	.86
20	12	.95	.87
20	13	.95	.88
20	14	.95	.89
20	15	.96	.89
20	16	.96	.90
20	17	.96	.90
20	18	.96	.91
20	19	.96	.91
20	20	.97	.92

Table 1 presents the G-study results for both ChatGPT4o and human peer assessors. For ChatGPT4o, the variance due to the object of measurement (i.e., person, p), which reflects the students’ English speaking abilities, accounted for 59.4% of the total variance. The residual variance, which includes variability from interactions between facets and other unmeasured errors, comprised 39.03% of the total variance. The variance component related to rater differences, indicating variations in scoring leniency among raters, contributed only 1.57% of the total variance. In the G-theory framework, variance associated with the object of measurement is considered desirable variance, whereas variance related to the rater and residual components is viewed as undesirable variance (Brennan, 2001a; Shavelson & Webb, 1991). In contrast, the holistic scoring by peer assessors showed significantly less desirable variance and more undesirable variance, with the object of measurement accounting for only 35.63% of the total variance, the rater variance component representing 36.54% of the total variance, and the residual variance comprising 27.83%.

The D-studies further explored these variance components to calculate the reliability (i.e., phi-coefficients for criterion-referenced interpretations in classroom assessment settings) of holistic scoring by ChatGPT4o compared to human peer assessors. Reliability coefficients were calculated under various conditions, ranging from one to ten assessors, with a sample size of 20. The results are shown in Table 2 and Figure 1.

Figure 1.

Comparison of holistic scoring reliability by ChatGPT4o and human peer assessors.

As seen in Table 2, ChatGPT4o demonstrated significantly higher reliability compared to human peer assessors. For example, for a single rating, the reliability coefficient was .59 for ChatGPT4o, whereas it was just .36 for human peer assessors. Increasing the number of ratings to two would achieve substantially improved reliability, with coefficients rising to .75 for ChatGPT4o and .53 for human peer assessors.

In summary, the G- and D-studies revealed that holistic scoring by ChatGPT4o exhibited considerably less variability and higher reliability compared to the scoring by human peer assessors. These findings suggested that ChatGPT4o could serve as a reasonably reliable AI peer assessor to complement holistic scoring conducted by human peer assessors in English speaking classrooms.

Reliability of Analytic Scoring

To assess the variability and reliability of analytic scoring for the 20 EFL speech samples, multivariate G-theory was utilized, involving both G-studies and D-studies (Brennan, 2001a). The results of these multivariate G- and D-studies are presented in Tables 3 and 4.

Table 3.

Results for Multivariate Person-by-Rater Design G-Studies.

Rater type	Sources of variability	Accuracy		Fluency		Complexity
Rater type	Sources of variability	σ²	%	σ²	%	σ²	%
ChatGPT4o	p	.3474	59.40	.3465	61.99	.3377	57.87
	r _GPT4o	.0092	1.57	.0070	1.25	.0053	.91
	pr _GPT4o	.2283	39.03	.2055	36.76	.2406	41.22
	Total	.5849	100	.5590	100	.5836	100
Human peers	P	.4296	31.81	.6831	39.91	.4627	30.30
	r _peers	.4730	35.02	.4650	27.16	.6129	40.14
	pr _peers	.4479	33.17	.5637	32.93	.4513	29.56
	Total	1.3505	100	1.7118	100	1.5269	100

Table 4.

Results for Multivariate Person-by-Rater Design D-Studies.

p	r	Phi-coefficients
		Accuracy		Fluency		Complexity
		GPT4o	Human peers	GPT4o	Human peers	GPT4o	Human peers
20	1	.59	.32	.62	.40	.58	.30
20	2	.75	.48	.77	.57	.73	.47
20	3	.81	.59	.83	.67	.80	.58
20	4	.85	.65	.87	.73	.85	.63
20	5	.88	.70	.89	.77	.87	.68
20	6	.90	.74	.91	.80	.89	.72
20	7	.91	.77	.92	.83	.91	.75
20	8	.92	.79	.93	.84	.92	.78
20	9	.93	.81	.94	.86	.93	.80
20	10	.94	.82	.94	.87	.93	.81
20	11	.94	.84	.95	.88	.94	.83
20	12	.95	.85	.95	.89	.94	.84
20	13	.95	.86	.95	.90	.95	.85
20	14	.95	.87	.96	.90	.95	.86
20	15	.96	.87	.96	.91	.95	.87
20	16	.96	.88	.96	.91	.96	.87
20	17	.96	.89	.97	.92	.96.	.88
20	18	.97	.89	.97	.92	.96	.89
20	19	.97	.90	.97	.93	.97	.90
20	20	.98	.90	.98	.93	.97	.90

Table 3 details the outcomes of the multivariate G-studies, which examined the variability in analytic scoring by ChatGPT4o and human peer assessors across the three domains of accuracy, fluency, and complexity of the speech samples, respectively. For ChatGPT4o, the largest variance component was the person variance (59.4% for accuracy, 61.99% for fluency, and 57.87% for complexity), followed by the residual variance (39.03% for accuracy, 36.76% for fluency, and 41.22% for complexity), with the rater variance contributing minimally (less than 2% of the total variance) across all three domains.

Conversely, for human peer assessors, the person, residual, and rater variance components explained relatively similar amounts of total variance across the domains of accuracy, fluency, and complexity. Specifically, the object of measurement accounted for 31.81%, 39.91%, and 30.30% of the total variance in the accuracy, fluency, and complexity domains, respectively. The residual variance accounted for 33.17%, 32.93%, and 29.56% of the total variance across the three domains, respectively. The rater variance component contributed 35.02%, 27.16%, and 40.14% of the total variance in the three domains, respectively.

These variance components were further analyzed through multivariate D-studies to calculate reliability coefficients (i.e., phi-coefficients for criterion-referenced interpretations in classroom assessment settings) for the analytic scoring by ChatGPT4o compared to human peer assessors across the domains of accuracy, fluency, and complexity. Reliability coefficients for all three domains were calculated under various conditions, ranging from one to ten assessors, with a sample size of 20. The results are shown in Table 4 and Figure 2.

Figure 2.

Comparison of analytic scoring reliability by ChatGPT4o and human peer assessors.

For a single rating by ChatGPT4o, the reliability coefficients were .59 for accuracy, .62 for fluency, and .58 for complexity. Increasing the number of assessors to two would significantly improve the reliability across all three domains, with coefficients rising to .75 for accuracy, .77 for fluency, and .73 for complexity.

For a single human peer assessor, the reliability coefficients were only .32 for accuracy, .40 for fluency, and .30 for complexity. When the number of assessors was increased to two, reliability increased to just .48 for accuracy, .57 for fluency, and .47 for complexity.

In conclusion, the multivariate G- and D-studies indicated that the analytic scoring conducted by ChatGPT4o exhibited considerably less variability and higher reliability across the three domains of accuracy, fluency, and complexity than human peer assessors. These results suggested that ChatGPT4o could also be a reasonably reliable AI peer assessor to assist EFL students with analytic scoring in English speaking classrooms.

Findings for the Second Research Question

Usefulness of Qualitative Feedback

The qualitative feedback provided by both ChatGPT4o and human peer assessors on the 20 speech samples was analyzed using a combination of quantitative methods, such as descriptive statistics, and qualitative techniques, including coding, categorization, and thematic analysis, to determine its usefulness. In this study, effective feedback is defined as feedback that clearly identifies issues in any of the three domains—accuracy, fluency, or complexity—of an EFL speech sample and offers understandable and constructive suggestions for improvement in the specified area (Carless et al., 2011; Hattie & Timperley, 2007). The results of these quantitative and qualitative analyses are presented in Tables 5 and 6, respectively.

Table 5.

Quantitative Findings of Feedback.

Domains	Assessors	N	Effectiveness of feedback
Domains	Assessors	N	Total	Mean	SD
Accuracy	ChatGPT4o	4*	234	58.5	0.97
Accuracy	Human peer assessors	20	350	17.5	6.38
Fluency	ChatGPT4o	4	221	55.25	0.99
Fluency	Human peer assessors	20	315	15.75	7.13
Complexity	ChatGPT4o	4	215	53.75	1.06
Complexity	Human peer assessors	20	185	9.25	4.75

The number 4 indicates that ChatGPT4o assessed the same set of 20 speech samples in four separate sessions, spaced three days apart, to minimize bias.

Table 6.

Qualitative Findings of Feedback.

Domain	Themes	Codes
Accuracy	Accuracy in pronunciation	Correct pronunciation of sounds; stress; intonation; rhythm
	Accuracy in vocabulary	Appropriate word choice; correct use of collocations
	Accuracy in grammar	Sentence structure; verb tenses; subject-verb agreement; correct use of articles, prepositions, and pronouns
Fluency	Speed fluency	Speech rate; pacing
	Breakdown fluency	Filled pauses; unfilled pauses; hesitations
	Repair fluency	Self-corrections; false starts; repetitions
Complexity	Lexical complexity	Variety of words; advanced vocabulary; lexical density (content words)
Complexity	Grammar complexity	Sentence length; variety of sentences; dependent clauses

Table 5 summarizes the descriptive statistics regarding the effectiveness of feedback comments provided by ChatGPT4o compared to those given by human peer assessors. The findings indicated that ChatGPT4o consistently delivered more useful feedback than human peer assessors, offering more than three times the feedback on accuracy (mean for ChatGPT4o = 58.5, mean for human peer assessors = 17.5) and fluency (mean for ChatGPT4o = 55.25, mean for human peer assessors = 15.75), and approximately six times the feedback on complexity (mean for ChatGPT4o = 53.75, mean for human peer assessors = 9.25). Additionally, the standard deviation values for ChatGPT4o were significantly lower than those for human peer assessors, indicating that ChatGPT4o provided more consistent feedback across the domains of accuracy, fluency, and complexity, whereas human peer assessors showed greater variability in their feedback.

These descriptive statistical results suggested considerable differences in the quantity and consistency of effective feedback on accuracy, fluency, and complexity between ChatGPT4o and human peer assessors. The findings implied that ChatGPT4o could be a valuable AI peer assessor to support human peer assessment by providing comprehensive qualitative feedback in EFL speaking classrooms.

To better understand and classify the effective feedback provided on EFL speech samples by ChatGPT4o and human peer assessors, the feedback was analyzed to identify and consolidate major themes within the domains of accuracy, fluency, and complexity (see Table 6). This thematic analysis clarified the specific areas where students’ English speaking showed weaknesses and needed improvement.

The qualitative analysis of feedback revealed eight major themes across the domains of accuracy, fluency, and complexity. In the accuracy domain, the feedback primarily highlighted issues related to (a) pronunciation accuracy, (b) vocabulary accuracy, and (c) grammar accuracy. In the fluency domain, the feedback concentrated on (a) speed fluency, (b) breakdown fluency, and (c) repair fluency. In the complexity domain, the feedback focused on (a) lexical complexity and (b) grammatical complexity.

However, the foci on these themes varied significantly between the feedback provided by each assessor group. For instance, ChatGPT4o was able to provide comprehensive feedback that addressed nearly all the identified themes within a single speech sample. In contrast, the feedback from human peer assessors often included numerous positive comments, particularly related to accuracy or fluency themes, which may not have directly contributed to improving speaking performance. To illustrate this, specific feedback provided by ChatGPT4o and one human peer assessor on the accuracy, fluency, and complexity domains of EFL speech sample #12 is discussed below.

“Accuracy: The pronunciation is generally understandable, but there are several errors in stress and intonation that affect clarity. Vocabulary choices are mostly appropriate but somewhat limited, leading to occasional misuse of words. Grammar is fair, with noticeable mistakes that can impede understanding.

Fluency: The speech has a moderate rate of delivery but is marked by frequent pauses and hesitations. These filled and unfilled pauses, along with a number of false starts and repetitions, disrupt the flow of the speech and make it harder to follow.

Complexity: The speech demonstrates a basic level of lexical complexity, with a limited range of vocabulary. Grammatical complexity is minimal, with a predominant use of simple sentences and very few compound or complex sentences. The speaker would benefit from using a greater variety of sentence structures to enhance the overall complexity.”

(Feedback provided by ChatGPT4o)

“Accuracy: The speech demonstrates accurate pronunciation.

Fluency: The speech is very fluent.

Complexity: The speech contains a variety of words and expressions.”

(Feedback provided by one human peer assessor)

In summary, the feedback themes identified recurring issues in accuracy, fluency, and complexity across the 20 EFL speech samples assessed by ChatGPT4o and human peer assessors. The feedback stressed the need for targeted instruction and practice, suggesting that addressing these areas with specific strategies could significantly enhance the quality of students’ English speaking.

Findings for the Third Research Question

Human Peer Assessors’ Acceptance of ChatGPT4o as an AI Peer Assessor in EFL Speaking

The analysis of interview data with four human peer assessors uncovered distinct themes and differences in their acceptance of ChatGPT4o as an AI peer assessor in English speaking classrooms, particularly reflected in their responses to the five interview questions.

Theme 1: Perceived Reliability of ChatGPT4o Scores

Regarding the reliability of scores generated by ChatGPT4o (Interview Question 2), most participants recognized the system’s potential, though some had reservations. Participant A expressed confidence in ChatGPT4o’s reliability, stating, “The reliability of ChatGPT4o’s scoring is consistently higher than that of students’ scoring across various domains,” suggesting a strong belief in the AI’s capability to deliver accurate and fair evaluations. Similarly, Participant C shared this view, noting that the AI evaluates “from a relatively professional angle,” which makes it useful for reliable assessments. However, Participant B voiced doubt, pointing out that “it may sometimes be inaccurate” due to its reliance on pattern recognition and the need for further research, particularly in the context of Chinese students. Participant D offered a balanced perspective, acknowledging the reliability of the scores “to some extent” but also expressing concerns about the AI peer assessor’s ability to fully grasp the subjective subtleties of speaking, such as emotions and intonation.

Theme 2: Effectiveness of ChatGPT4o Feedback

On the subject of the effectiveness of feedback generated by ChatGPT4o (Interview Question 3), participants generally found the feedback useful and effective, though their enthusiasm varied. Participant A was particularly positive, citing how students in a high school EFL program “responded positively to the specific, practical steps provided by the AI tool,” which led to measurable improvements in their speaking skills. Participant D also found the feedback useful and effective but stressed that its effectiveness depends on the quality of the input provided to the AI, stating, “If our input is not clear and specific, the AI might not provide sufficiently effective feedback.” In contrast, Participant B raised concerns about the AI’s limitations in understanding “language fluency, intonation, cultural appropriateness, or emotional subtlety,” indicating that while the feedback might be helpful for technical aspects like grammar, it may fall short in more subtle areas. Participant C offered a unique viewpoint, comparing the feedback’s utility to technological advancements, suggesting that while it may seem limited now, it could become more practical as technology progresses.

Theme 3: Concerns over acceptance and trust in ChatGPT4o

Concerns about the acceptance and trust in ChatGPT4o-generated scores and feedback (Interview Questions 4 and 5) were prominent among most participants. Participant B expressed concern that teachers might become overly dependent on AI, potentially diminishing “the importance of real human interaction in language learning.” This concern was echoed by Participant D, who feared that ChatGPT4o’s assessments could become standardized, leading to a “one-size-fits-all approach” that fails to account for individual student differences. On the other hand, Participant C worried about the impact of lower-than-expected scores on student motivation, noting that “the constant feedback might lead to student fatigue and increased pressure.” To build trust, Participant A recommended a collaborative approach, involving “ongoing input from students, educators, and AI specialists,” to ensure that the system remains responsive to classroom needs. Participant D suggested that authoritative teacher evaluations be compared with AI-generated feedback to validate the system’s reliability and increase student trust.

Theme 4: Stance on the Use of ChatGPT4o in Classroom Settings

Participants expressed varying opinions on using ChatGPT4o for assessing English speaking (Interview Question 1). Participant A was strongly supportive, noting that integrating ChatGPT4o into classroom assessments could provide “valuable benefits,” particularly in offering targeted support for improving speaking skills. In contrast, Participant B saw the AI as a supplementary tool rather than a replacement for teachers, emphasizing that “only human teachers can offer a more personalized understanding and guidance.” Participant C appreciated the professional perspective that ChatGPT4o offered, viewing it as a tool that could “better support student development” by helping identify weaknesses. Participant D emphasized the practicality of implementing ChatGPT4o, citing its low cost and ability to assist teachers in providing “targeted guidance” based on organized data.

In summary, while there is broad recognition of the potential benefits of ChatGPT4o as a peer assessment aid, the students’ responses reflect cautious optimism, weakened by concerns about reliability, effectiveness, and the essential role of human involvement in language learning. The findings suggested that while ChatGPT4o could be a valuable AI peer assessor in classroom assessments, its integration should be carefully managed to address these concerns and build trust among students.

Discussion and Conclusions

The findings of this study provide important insights into the potential of ChatGPT4o as an AI peer assessor in EFL speaking classrooms, highlighting both its strengths and limitations. These results contribute significantly to the existing body of research on AI-assisted EFL assessments, particularly in the unexplored area of using ChatGPT4o in assessing speaking proficiency.

Scoring Reliability

In terms of scoring reliability, this study demonstrated that ChatGPT4o offers a higher degree of consistency and reliability in both holistic and analytic scoring compared to human peer assessors. This finding is consistent with previous studies suggesting that AI models, when optimized for specific tasks, can significantly improve scoring accuracy and consistency (Li et al., 2024). The lower variability in ChatGPT4o’s scoring, as indicated by the G- and D-studies, supports the notion that AI tools like ChatGPT4o can serve as reliable assessment aids, especially in settings where consistency is critical (Brennan, 2001a).

However, it is important to address a potential concern: whether the AI’s consistency might stem from leniency. Some researchers have noted that AI might provide uniformly high scores, similar to human teachers who might be overly lenient in their grading (Tate et al., 2024). While ChatGPT4o’s consistent scoring can be seen as a strength, it could also indicate that the system might have built-in biases or algorithms that tend to favor higher scores, regardless of actual proficiency levels. Thus, while ChatGPT4o’s consistency is advantageous, further investigation into whether this reflects genuine scoring accuracy or an inherent leniency bias is essential. This adds an important layer to understanding the underlying mechanisms driving ChatGPT4o’s performance and highlights the need to assess whether consistent results translate into truly valid assessments.

Moreover, while ChatGPT4o demonstrated reliability in scoring, its inability to fully capture the subtleties of spoken language—such as emotional expression, tone, and intonation—suggests that AI cannot fully replace human judgment in these areas. These limitations reinforce the notion that while AI can complement human assessors, it should not replace them entirely. The nuanced elements of speech, including tone, emotional inflections, and cultural contexts, remain better evaluated by human assessors, who possess the ability to interpret these factors.

Feedback Effectiveness

The study also found that ChatGPT4o provides significantly more useful and consistent feedback across the domains of accuracy, fluency, and complexity compared to human peer assessors. This finding aligns with the growing body of research suggesting that AI-driven feedback systems can deliver detailed, personalized, and effective feedback, enhancing student learning more effectively than traditional methods (Cao & Zhong, 2023; Li et al., 2024). ChatGPT4o demonstrated the ability to cover a broad range of feedback themes within a single speech sample, which is often difficult for human peer assessors to achieve consistently due to human limitations, such as cognitive load and bias.

However, while ChatGPT4o excelled at providing consistent and comprehensive feedback, the feedback it offered was not without its limitations. Students raised concerns about the AI’s inability to capture the cultural nuances and contextual subtleties embedded in language use (Song & Song, 2023). For example, ChatGPT4o might miss cultural references or fail to interpret emotionally charged language effectively. This finding reinforces the idea that AI feedback, while valuable, should not replace human insights, especially when it comes to language subtleties that require deeper cultural understanding or emotional awareness.

AI Acceptance and Trust

A significant part of the study also focused on student acceptance and trust in ChatGPT4o as an AI peer assessor. The analysis of interview data revealed varying levels of acceptance among students, with some expressing concerns about over-reliance on AI and its potential to reduce human interaction in language learning. This is consistent with broader concerns in educational research about the integration of AI into teaching and assessment (Creely, 2024). While some students expressed trust in ChatGPT4o’s reliability and found its feedback to be useful, others were hesitant, fearing that AI might standardize assessments in ways that overlook individual student needs.

Interestingly, the interview data highlighted that students felt more comfortable with ChatGPT4o if it was seen as a complementary tool rather than a replacement for human assessors. For example, some students suggested that a collaborative approach—involving input from both AI and human assessors—might help address trust issues and ensure that the system responds appropriately to classroom needs. These insights align with previous studies recommending transparency in AI implementation and ongoing collaboration between students, educators, and AI developers (Creely, 2024). This collaborative approach would help build trust in AI tools and ensure that their role in education is optimized.

Conclusions

In conclusion, the findings of this study suggest that ChatGPT4o can be a reliable and effective AI peer assessor in EFL speaking classrooms. The AI demonstrated significant advantages in terms of scoring consistency and providing feedback, making it a potentially valuable tool for complementing human assessments. However, the study also highlighted several challenges, particularly the limitations of AI in understanding the emotional subtleties and cultural nuances of spoken language. These findings suggest that while AI tools like ChatGPT4o have the potential to significantly enhance language assessments, their use should be seen as complementary to, rather than a replacement for, human judgment.

Limitations and Future Research

Despite these advantages, the study’s limitations should be considered. First, the small sample size (20 EFL speech samples and 20 peer assessors) limits the generalizability of the findings. Further research with larger and more diverse populations will be necessary to confirm these results and explore the full potential of ChatGPT4o across different contexts and student groups.

Second, the study focused on first-year English majors at a single institution, which limits the applicability of the findings to other educational settings. Future research should expand the study to include different educational levels and institutions, as well as consider the impact of cultural differences on AI assessment.

Third, while this study compared ChatGPT4o’s assessments with those of human peer assessors, it did not include a comparison with professional teacher assessments, which are often considered the gold standard. Future research should compare AI-based assessments with teacher assessments to better contextualize the performance of AI tools in educational settings.

Lastly, ChatGPT4o may not be fully accessible to all learners and instructors in China, particularly in comparison to more widely used AI tools such as Doubao. It offers unique advantages in natural language processing and real-time assessment that make it an ideal candidate for exploring the future integration of AI in EFL speaking assessment (Murad et al., 2024; Shahriar et al., 2024). This study focused on ChatGPT4o’s potential to push the boundaries of what AI can achieve in EFL speaking assessment. The researchers acknowledge, however, that accessibility remains a significant challenge, and future research could explore how other AI tools like Doubao could also be used to support peer assessments in more practical contexts.

Pedagogical Implications

This study offers several pedagogical implications for EFL instructors and students. First, by improving the reliability of human peer assessments, ChatGPT4o can help ensure fairer and more consistent evaluations of students’ speaking performance. This enhanced reliability could help students trust the assessment process more and focus on improving their speaking skills.

Second, the ability of ChatGPT4o to provide detailed and effective feedback across domains such as accuracy, fluency, and complexity means that students can receive more specific and actionable suggestions for improvement. This targeted feedback can guide their language learning more effectively, particularly in self-directed learning environments.

Finally, integrating ChatGPT4o into classroom assessments could promote a more collaborative and autonomous learning environment. Students could use AI-generated feedback to reflect on their own performance and that of their peers, fostering deeper engagement with the learning material and promoting learner autonomy.

In summary, while ChatGPT4o can be a valuable tool for enhancing EFL speaking assessments, its use should be carefully managed to balance the strengths of AI with the irreplaceable contributions of human assessors. With ongoing research and collaboration, AI tools like ChatGPT4o have the potential to transform language learning and assessment practices, offering greater consistency, reliability, and personalized feedback.

Footnotes

ORCID iD

Jinyan Huang

Ethical Considerations

This study involving human participants was reviewed and approved by the Evidence-based Research Center for Educational Assessment (ERCEA) Research Ethical Review Board at Jiangsu University (Ethical Approval Number: ERCEA2404).

Consent to Participate

The participants provided their written informed consent to participate in this study.

Author Contributions

Jinyan Huang and Junfei Li made equal contribution to this article and share first authorship. Junfei Li: conceptualization, literature, methodology, data acquisition, data analysis, revision, and funding. Jinyan Huang: conceptualization, literature, methodology, data analysis, writing, and editing for submission. Thomas Sheeran: conceptualization, methodology, data acquisition, and revision.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by the Shanghai Educational Sciences Research Program (C2024114) granted to Junfei Li.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets generated during and/or analyzed during the current study are available from the corresponding author only on reasonable request.

References

American Educational Research Association, American Psychological Association, and National Council on Measurement in Education. (AERA, APA, & NCME). (2014). Standards Edu Psychol Testing. American Psychological Association, Washington, DC

Ansari

A. N.

Ahmad

Bhutta

S. M.

(2023). Mapping the global evidence around the use of ChatGPT in higher education: A systematic scoping review. Education and Information Technologies, 29(9), 1–41. https://doi.org/10.1007/s10639-023-12223-4

Barrot

J. S.

(2023). Using ChatGPT for second language writing: Pitfalls and potentials. Assessing Writing, 57, 100745. https://doi.org/10.1016/j.asw.2023.100745

Belmamoune

(2022). Self-assessment of physics students’ oral presentations to enhance their English speaking skills. Humanization Journal for Researches and Studies, 13(2), 243–259.

Brennan

R. L.

(2001a). Generalizability theory. Springer-Verlag. https://doi.org/10.1007/978-1-4757-3456-0

Brennan

R. L.

(2001b). Manual for mGENOVA (Version 2.1). Iowa Testing Programs.

Cao

Zhong

(2023). Exploring the effectiveness of ChatGPT-based feedback compared with teacher feedback and self-feedback: Evidence from Chinese to English translation. arXiv. https://doi.org/10.48550/arxiv.2309.01645

Carless

Salter

Yang

Lam

(2011). Developing sustainable feedback practices. Studies in Higher Education, 36(4), 395–407. https://doi.org/10.1080/03075071003642449

Christou

Thomas

McKinley

(2024). Chinese pre-service English teachers’ beliefs about English as an international language (EIL). Journal of Multilingual and Multicultural Development, 45(10), 4166–4181. https://doi.org/10.1080/01434632.2022.2153652

10.

Creely

(2024). Exploring the role of generative AI in enhancing language learning: Opportunities and challenges. International Journal of Changes in Education, 1(3), 158–167. https://doi.org/10.47852/bonviewIJCE42022495

11.

Creswell

J. W.

Creswell

J. D.

(2023). Research design: Qualitative, quantitative, and mixed methods approaches (6th ed.). Sage Publications.

12.

Crick

J. E.

Brennan

R. L.

(1983). Manual for GENOVA: A generalized analysis of variance system (Version 2.1). The American College Testing Program.

13.

Cronbach

L. J.

Gleser

G. C.

Nanda

Rajaratnam

(1972). The dependability of behavioral measurements: Theory of generalizability for scores and profiles. John Wiley & Sons.

14.

de Jong

N. H.

(2023). Assessing second language speaking proficiency. Annual Review of Linguistics, 9, 541–560. https://doi.org/10.1146/annurev-linguistics-030521-052114

15.

Evenddy

S. S.

(2024). Investigating AI’s automated feedback in English language learning. Foreign Language Instruction Probe, 3(1), 76–87.

16.

Farazouli

Cerratto-Pargman

Bolander-Laksov

McGrath

(2024). Hello GPT! Goodbye home examination? An exploratory study of AI chatbots impact on university teachers’ assessment practices. Assessment & Evaluation in Higher Education, 49(3), 363–375. https://doi.org/10.1080/02602938.2023.2241676

17.

Fathi

Ahmadnejad

Yousofi

(2019). Effects of blog-mediated writing instruction on L2 writing motivation, self-efficacy, and self-regulation: A mixed methods study. Journal of Research in Applied Linguistics, 10(2), 159–181. https://doi.org/10.22055/rals.2019.14722

18.

Gao

Brennan

R. L.

(2001). Variability of estimated variance components and related statistics in a performance assessment. Applied Measurement in Education, 14(2), 191–203. https://doi.org/10.1207/S15324818AME1402_5

19.

Guo

Wang

(2023). To resist it or to embrace it? Examining ChatGPT’s potential to support teacher feedback in EFL writing. Education and Information Technologies. https://doi.org/10.1007/s10639-023-12146-0

20.

Guo

Wang

Chu

S. K. W.

(2022). Using chatbots to scaffold EFL students’ argumentative writing. Assessing Writing, 54, 100666. https://doi.org/10.1016/j.asw.2022.100666

21.

Han

(2017). Scores assigned by inexpert EFL raters to different quality EFL compositions, and the raters’ decision-making behaviors. International Journal of Progressive Education, 13(1), 136–152.

22.

Han

Huang

(2017). Examining the impact of scoring methods on the institutional EFL writing assessment: A Turkish perspective. PASAA, 53, 1–36.

23.

Hattie

Timperley

(2007). The power of feedback. Review of Educational Research, 77(1), 81–112. https://doi.org/10.3102/003465430298487

24.

Zhang

(2014). A study of college English writing feedback system based on M-learning. Modern Educational Technology, 24(7), 71–78. https://doi.org/10.3969/j.issn.1009-8097.2014.07.010

25.

Huang

(2012). Using generalizability theory to examine the accuracy and validity of large-scale ESL writing assessment. Assessing Writing, 17(3), 123–139.

26.

Huang

B. H.

Bailey

A. L.

Chang

Y. S.

(2020). An investigation of the validity of a speaking assessment for adolescent English language learners. Language Testing, 38(3), 331–356. https://doi.org/10.1177/0265532220925731

27.

Ishchenko

Verkhovtsova

(2019). Peer feedback versus traditional teaching. Journal of English Teaching, 5(2), 115–124.

28.

Kasneci

Seßler

Küchemann

Bannert

Dementieva

Fischer

… Kasneci

(2023). ChatGPT for good? On opportunities and challenges of large language models for education. Learning and Individual Differences, 103, 102274. https://doi.org/10.1016/j.lindif.2023.102274

29.

Khasawneh

D. M. A. S. K

. (2024). Improving the learning of language proficiency at tertiary education level through AI-driven assessment models and automated feedback systems. Migration Letters, 21(2), 712–726. https://doi.org/10.33182/ml.v21i2.4920

30.

Khusnia

A. N.

(2015). Strategies to enhance peer feedback and self-assessment in extended speaking course. PEOPLE: International Journal of Social Sciences, 1(1), 1334–1344. https://doi.org/10.20319/pijss.2015.s21.13341344

31.

Lee

Evans

(2019). Investigating the operating mechanisms of the sources of L2 writing self-efficacy at the stages of giving and receiving peer feedback. The Modern Language Journal, 103(3), 831–847. https://doi.org/10.1111/modl.12590

32.

Huang

Cheng

(2022). The reliability, effectiveness, and benefits of peer assessment in college EFL speaking classrooms: Student and teacher perspectives. Studies in Educational Evaluation, 72, 101120.

33.

Huang

(2022). The impact of essay organization and overall quality on the holistic scoring of EFL writing: Perspectives from classroom English teachers and national writing raters. Assessing Writing, 51, 100604.

34.

Huang

Whipple

(2024). Evaluating the role of ChatGPT in enhancing EFL writing assessments in classroom settings: A preliminary investigation. Humanities and Social Sciences Communications, 11, 1268. https://doi.org/10.1057/s41599-024-03755-2

35.

Xiong

Hunter

C. V.

Guo

Tywoniw

(2020). Does peer assessment promote student learning? A meta-analysis. Assessment & Evaluation in Higher Education, 45(2), 193–211. https://doi.org/10.1080/02602938.2019.1620679

36.

Link

Mehrzad

Rahimi

(2022). Impact of automated writing evaluation on teacher feedback, student revision, and writing improvement. Computer Assisted Language Learning, 35(4), 605–634. https://doi.org/10.1080/09588221.2020.1743323

37.

Liu

N. F.

Carless

(2006). Peer feedback: The learning element of peer assessment. Teaching in Higher Education, 11(3), 279–290. https://doi.org/10.1080/13562510600680582

38.

Liu

Huang

(2020). The quality assurance of a national English writing assessment: Policy implications for quality improvement. Studies in Educational Evaluation, 67, 100941.

39.

Yao

Xiao

Yuan

Wang

Zhu

(2024). Can ChatGPT effectively complement teacher assessment of undergraduate students’ academic writing? Assessment & Evaluation in Higher Education, 49(5), 616–633. https://doi.org/10.1080/02602938.2024.2301722

40.

Mak

(2019). From traditional to alternative feedback: What do L2 elementary students think? International Journal of Applied Linguistics, 29(1), 109–129.

41.

Matsuno

(2009). Self-, peer-, and teacher-assessments in Japanese university EFL writing classrooms. Language Testing, 26(1), 79–100. https://doi.org/10.1177/0265532208097337

42.

Murad

I. A.

Khaleel

M. I.

Shakor

M. Y.

(2024). Unveiling GPT-4o: Enhanced multimodal capabilities and comparative insights with ChatGPT-4. International Journal of Electronics and Communications, 4(2), 127–136.

43.

Pang

Kootsookos

Cheng

C.-T.

(2024). Artificial intelligence use in feedback: A qualitative analysis. Journal of University Teaching and Learning Practice, 21(6), 1–18.

44.

Praphan

P. W.

Praphan

(2023). AI technologies in the ESL/EFL writing classroom: The villain or the champion? Journal of Second Language Writing, 62, 101072. https://doi.org/10.1016/j.jslw.2023.101072

45.

Shadiev

Feng

(2024). Using automated corrective feedback tools in language learning: A review study. Interactive Learning Environments, 32(6), 2538–2566.

46.

Shahriar

Lund

B. D.

Mannuru

N. R.

Arshad

M. A.

Hayawi

Bevara

R. V. K.

[13 more author names here], … Batool

(2024). Putting GPT-4o to the sword: A comprehensive evaluation of language, vision, speech, and multimodal proficiency. Applied Sciences, 14(17), 7782. https://doi.org/10.3390/app14177782

47.

Shavelson

R. J.

Webb

N. M.

(1991). Generalizability theory: A primer. Measurement methods for the social sciences series 1. Sage Publications.

48.

Shen

Bai

Xue

(2020). The effects of peer assessment on learner autonomy: An empirical study in a Chinese college English writing class. Studies in Educational Evaluation, 64, 100821. https://doi.org/10.1016/j.stueduc.2019.100821

49.

Shermis

M. D.

Hamner

(2013). Contrasting state-of-the-art automated scoring of essays. In Shermis

M. D.

Hamner

(Eds.), Handbook of automated essay evaluation (pp. 1–34), Routledge.

50.

Song

(2023). Enhancing academic writing skills and motivation: Assessing the efficacy of ChatGPT in AI-assisted language learning for EFL students. Frontiers in Psychology, 14, 1260843. https://doi.org/10.3389/fpsyg.2023.1260843

51.

Lin

Lai

(2023). Collaborating with ChatGPT in argumentative writing classrooms. Assessing Writing, 57, 100752. https://doi.org/10.1016/j.asw.2023.100752

52.

Sun

Chen

Yin

(2023). The role and features of peer assessment feedback in college English writing. Frontiers in Psychology, 13, 1070618. https://doi.org/10.3389/fpsyg.2022.1070618

53.

Taskıran

Goksel

(2022). Automated feedback and teacher feedback: Writing achievement in learning English as a foreign language at a distance. Turkish Online Journal of Distance Education, 23(2), 120–139.

54.

Tate

P. T.

Steiss

Bailey

Graham

Moon

Ritchie

Tseng

Warschauer

(2024). Can AI provide useful holistic scoring? Computers and Education: Artificial Intelligence, 7, 100255.

55.

Whipple

(2016). Rater variability and reliability of constructed-response questions in high-stakes tests of English language arts and mathematics: Implications for educational policy. [Unpublished doctoral dissertation, Niagara University].

56.

Huang

Han

Zhang

(2022). Evaluating peer feedback as a reliable and valid complementary aid to teacher feedback in EFL writing classrooms: A feedback giver perspective. Studies in Educational Evaluation, 73, 101140.

57.

(2010). Automated scoring and feedback systems: Where are we and where are we heading? Language Testing, 27(3), 291–300.

58.

Yan

(2023). Impact of ChatGPT on learners in a L2 writing practicum: An exploratory investigation. Education and Information Technologies, 28(11), 13943–13967. https://doi.org/10.1007/s10639-023-11742-4

59.

Zhai

(2023). ChatGPT: Reforming education on five aspects. Shanghai Education, 8, 16–17.

60.

Zhang

Zou

Cheng

(2023a). Chatbot-based learning of logical fallacies in EFL writing: Perceived effectiveness in improving target knowledge and learner motivation. Interactive Learning Environments, 32(9), 5552–5569. https://doi.org/10.1080/10494820.2023.2220374

61.

Zhang

Zou

Cheng

(2023b). Chatbot-based training on logical fallacy in EFL argumentative writing. Innovation in Language Learning and Teaching, 17(5), 932–945. https://doi.org/10.1080/17501229.2023.2197417

62.

Zhao

Huang

(2020). The impact of the scoring system of a large-scale standardized EFL writing assessment on its score variability and reliability: Implications for assessment policy makers. Studies in Educational Evaluation, 67, 10091.

63.

Zou

Huang

(2023a). The impact of ChatGPT on L2 writing and expected responses: Voice from doctoral students. Education and Information Technologies, 29, 13201–13219. https://doi.org/10.1007/s10639-023-12397-x

64.

Zou

Huang

(2023b). To use or not to use? Understanding doctoral students’ acceptance of ChatGPT in writing through technology acceptance model. Frontiers in Psychology, 14, 1259531. https://doi.org/10.3389/fpsyg.2023.1259531

ChatGPT4o as an AI Peer Assessor in EFL Speaking Classrooms: Examining Scoring Reliability and Feedback Effectiveness

Abstract

Keywords

Introduction

Literature Review

Reliability and Feedback in EFL Writing and Speaking Assessments

Applying Univariate and Multivariate G-Theory in Scoring EFL Writing and Speaking

Peer Assessments in EFL Writing and Speaking

Effectiveness of AI-Generated Feedback

Research Gaps and Questions

Methodology

Participants

Materials

EFL Speech Samples

ChatGPT4o

Data Collection Procedures

Speech Sample Collection

Human Peer Assessors

ChatGPT4o’s Training

Follow-up Interviews

Data Collection Timeline

Data Analysis

Scoring Reliability

Feedback Effectiveness

Follow-up Interview Data

Results

Findings for the First Research Question

Reliability of Holistic Scoring

Reliability of Analytic Scoring

Findings for the Second Research Question

Usefulness of Qualitative Feedback

Findings for the Third Research Question

Human Peer Assessors’ Acceptance of ChatGPT4o as an AI Peer Assessor in EFL Speaking

Theme 1: Perceived Reliability of ChatGPT4o Scores

Theme 2: Effectiveness of ChatGPT4o Feedback

Theme 3: Concerns over acceptance and trust in ChatGPT4o

Theme 4: Stance on the Use of ChatGPT4o in Classroom Settings

Discussion and Conclusions

Scoring Reliability

Feedback Effectiveness

AI Acceptance and Trust

Conclusions

Limitations and Future Research

Pedagogical Implications

Footnotes

ORCID iD

Ethical Considerations

Consent to Participate

Author Contributions

Funding

Declaration of Conflicting Interests

Data Availability Statement

References