Sage Journals: Discover world-class research

Abstract

The current study proposes a new approach to weakness identification in diagnostic language assessment (DLA) for speaking skills. We also propose to design actionable and contextualised diagnostic feedback through the systematic integration of feedback and remedial learning activities. Focusing on lexical use in second language speaking, the current study developed and validated our DLA programme in terms of actual learning gains, using an experimental design. A total of 59 beginner-to-intermediate-level Japanese learners of English were randomly assigned to control or experimental groups. While both groups engaged in task repetition with a conversational artificial intelligence (AI) agent on six occasions, only the experimental group received the diagnostic feedback on lexical use including the paraphrased utterances of their original utterance. The results showed that the control group (task repetition only) demonstrated significant improvement during the task repetition sessions but failed to transfer and retain the learning gains. In contrast, despite the lack of practice effects, the experimental group (task repetition with diagnostic feedback) outperformed the control group at the posttest with a near-medium effect size. A qualitative investigation into learners’ perceptions further confirmed that the proposed contextualised diagnostic feedback succeeded in heightening their awareness of weaknesses.

Keywords

AI-driven diagnostic assessment contextualised diagnostic feedback conversational AI diagnostic assessment speaking task repetition vocabulary

Introduction

Given the integral role of assessment in language learning and teaching, language testing research has explored various ways to conjoin different types of assessment with educational contexts. Among them, diagnostic language assessment (DLA) has attracted increasing attention from language testers. Diagnostic assessment is one type of assessment which aims at facilitating learners’ subsequent learning through the identification of their strengths and weaknesses (Alderson, 2005; Alderson et al., 2015). Strengths refer to what the learner has learned, whereas weaknesses are concerned with what the learner has not acquired with a particular focus on what hinders them from achieving successful performance. DLA research has tended to focus on weaknesses, which can offer useful information regarding learners’ states of target knowledge and thus can help learners and teachers to decide what to learn next and how (Harding et al., 2015). However, given the importance of learners’ psychological variables in learning outcomes (Jang et al., 2015; Xie & Andrews, 2013), both strengths and weaknesses should be assumed to facilitate their learning. For instance, feedback on strengths can foster the sense of achievement in their previous learning (i.e., self-efficacy; see Kormos & Wilby, 2019), subsequently contributing to their sustained motivation and offering insights into their self-regulatory strategies (Xie & Lei, 2021).

Alderson and colleagues (2015) have claimed that DLA tests should be specifically designed for a diagnostic purpose and that effective DLA may require consistency across different components of DLA, that is, diagnosis, feedback, and remedial learning (Lee, 2015). Moreover, possibly due to the challenges in weakness identification, DLA research has suffered from a lack of applications in speaking skills (for a rare exception, see Isbell, 2021), compared to those in receptive skills (Harding et al., 2015; Jang et al., 2015) and writing skills (Y.-H. Kim, 2011; Sawaki et al., 2013). In addition, Lee (2015) pointed out that to secure validity evidence for the effectiveness of DLA, the causal connection between certain components of DLA and the changes in learners’ subsequent learning should be made. Some recent studies longitudinally tracked learners’ development of target skills and gained useful insights into the potential of diagnostic assessments (e.g., Isbell, 2021). However, despite the persistent urge to collect this type of validity evidence of DLA (i.e., learning gains; Isbell, 2021; Lee, 2015), to the best of our knowledge, no studies have examined the effectiveness of DLA using an experimental design (e.g., random group assignment, comparison groups; Phakiti, 2015) to discuss the causal relationship between a target DLA programme and learning gains.

To address the aforementioned challenges in DLA research, the current study proposes a new approach to weakness identification in DLA for speaking skills with the assistance of artificial intelligence (AI) technologies, as well as the realisation of actionable and contextualised diagnostic feedback by systematically integrating the components of DLA (Lee, 2015). To this end, using an experimental design, the current study examined the effects of the DLA programme based on our proposed approach on the development of lexical use in speaking performance.

Background

Challenges in diagnostic assessment for second language speaking

Motivated by the potential of diagnostic assessment for second language (L2) learning, language testers have attempted to develop theories and principles to systematically design and evaluate DLA programmes. Lee (2015) proposes three major components of diagnostic assessment: diagnosis, feedback, and remedial learning. The primary goal of diagnosis is to identify both strengths and weaknesses that prevent learners from reaching the next level of learning or target skills. Feedback is concerned with the presentation of diagnostic results. The format of diagnostic feedback should be carefully designed so that learners and teachers can take appropriate action in response to their learning and teaching. The third component, remedial learning, refers to a set of pedagogical activities specifically designed to improve learners’ target skills by addressing the identified weaknesses. These activities should be in line with learners’ or teachers’ desired goal of L2 learning.

Another line of research is the identification of characteristics of effective diagnostic assessments and tests. Alderson and colleagues have established a list of distinctive features of diagnostic tests (Alderson et al., 2015; see also Alderson & Huhta, 2011). Alderson et al. (2015) claim that test items in DLA are “more likely to be discrete-point than integrative, or more focused on specific elements than on global abilities” and are “more likely to focus on ‘low-level’ language skills than higher-order skills which are more integrated” (p. 238). These statements highlight the transparent correspondence between test items and target constructs for diagnosis. A primary reason for this principle is the interpretability of test performance (Alderson, 2005). Speech production is realised through the orchestration of various cognitive and linguistic resources and processes, and thus it is not always straightforward to identify the cause of challenges at the level of underlying competence and knowledge (i.e., weaknesses) through spontaneous speaking performance (cf. S. Suzuki & Kormos, 2023). DLA practices that conform to the multi-componential nature of oral proficiency and speaking performance include the DIALANG project (Alderson, 2005) and a seminal study by Isbell (2021) on pronunciation, both of which commonly employ discrete-point elements for high interpretability of learners’ current states of target knowledge and skills. While this approach can help learners develop certain aspects of linguistic competence, care should be taken regarding the transferability of the attained knowledge to communicative situations. Drawing on the usefulness of learners’ global skills information for DLA (Alderson, 2005), a diagnostic assessment of speaking performance and skills might be expected to identify learners’ unsatisfactory approaches to language production through spontaneous speaking tasks (e.g., wrong word choice). However, this approach needs to consider the incremental and multi-faceted processes underlying spoken language production (Kormos, 2006), and thus some innovative method might be needed to identify weaknesses through actual language production.

To maximise the potential learning gains through diagnostic assessment, care should be taken regarding the consistency between the components of DLA. In light of feedback, Lee (2015) claims that diagnostic feedback should be actionable and contextualised. The ideal format of DLA feedback enhances its actionability—the probability that learners take remedial actions based on the diagnostic feedback (Lee, 2015)—by closely linking DLA feedback to subsequent learning activities. For instance, the feedback format can be designed with regard to how the content of feedback would be used in a specific remedial learning activity. Similarly, effective DLA feedback should also be contextualised to specific items and tasks, thereby drawing learners’ attention to their unsatisfactory responses. In the context of DLA for speaking, contextualised feedback targeting students’ inappropriate language use, for instance, can be realised by providing alternative expressions while maintaining the learner’s original intention and message. Such feedback can help them notice the gap between how they actually approached an item or task and how they should have approached it, raising their awareness of how to remedy their approach to a certain item or task (Isbell, 2021; Lee, 2015). Notably, these two criteria for diagnostic feedback cannot be achieved by feedback alone because the actionability and contextualization of feedback are determined when learners engage with subsequent remedial learning activities. Both criteria largely depend on the compatibility between assessment tasks in the diagnosis phase and remedial learning activities, as well as on the alignment between the feedback content and remedial learning activities (cf. Lee, 2015). In other words, these two important criteria of DLA feedback could be achieved through (a) establishing a close connection between DLA feedback content and remedial learning activities (i.e., actionability) and (b) contrasting learners’ responses with suggested corrections in relation to the actual communicative context of the test tasks (i.e., contextualization).

Integrating Instructed Second Language Acquisition (ISLA) research and cognitive psychology in diagnostic assessment

As mentioned earlier, learning gains through diagnostic assessment can be maximised by actionable and contextualised diagnostic feedback, which should be achieved through the systematic integration of feedback and remedial learning activities (Lee, 2015). Hence, the content of diagnostic feedback could arguably be optimised in accordance with the learning processes that the test designers intend to facilitate in the remedial learning activities. In this regard, Alderson and colleagues argued that a DLA programme could be “based on a specific theory of language development, preferably detailed rather than a global theory” (p. 238). To obtain insights into learning processes embedded in DLA, one may argue that theories and findings from the domain of ISLA and cognitive psychology can be useful. Drawing on ISLA research findings, both meaning-focused language use (e.g., spontaneous speech, conversation) and form-focused activities (e.g., grammar instruction, corrective feedback) are essential to developing L2 speaking skills (e.g., Rossiter et al., 2010; Sato & Lyster, 2012). Accordingly, spontaneous speaking activities as a remedial learning activity may play a central role in learning gains through diagnostic assessment, while the role of diagnostic feedback, which typically offers form-focused information about language use, should be indispensable. L2 learning is facilitated when such meaning-focused and form-focused activities are systematically aligned (e.g., Long, 2015), which supports the idea of realising contextualised diagnostic feedback by connecting the three DLA components: diagnosis, feedback, and remedial learning

To enhance the actionability of diagnostic feedback, a linkage between diagnostic assessment tasks and remedial learning tasks should be established. With regard to this principle, the pedagogical technique of task repetition (i.e., repeating the same or similar communicative task) might be particularly relevant. ISLA research has suggested that task repetition can facilitate the consolidation of linguistic knowledge (Kakitani & Kormos, 2024; Lambert et al., 2017; Y. Suzuki & Hanzawa, 2022). There are several variations of task repetition in terms of what to repeat: exact task repetition (same content, same procedure) and procedural task repetition (different content, same procedure). Exact task repetition generally yields greater learning gains (N. de Jong & Perfetti, 2011; Y. Kim & Tracy-Ventura, 2013). Furthermore, form-focused pedagogical activities and techniques, such as corrective feedback and consciousness-raising tasks, can be combined to direct learners’ attention to certain linguistic forms so that multiple aspects of speaking performance can be enhanced simultaneously (Tran & Saito, 2021; van de Guchte et al., 2016).

Recently, research on task repetition has been reconceptualised from the perspective of cognitive psychology, highlighting the importance of the schedule of task repetition (see Wiseheart et al., 2019). The implementation of task repetition can be viewed as retrieval practice, where learners attempt to retrieve newly acquired or partially consolidated knowledge (Ellis, 1995). Repeating such retrieval practice leads to the modification of learners’ memory system so that they can retain the target knowledge in their long-term memory. Key variables that can be manipulated to enhance learning gains through repetitive retrieval include the time interval between practice sessions (i.e., intersession interval [ISI]) and the interval between the final practice session and the testing session (i.e., retention interval [RI]). Although longer ISIs are preferable in general (i.e., spacing effects), the learning outcome tends to depend on the ratio of the ISI to the RI (i.e., ISI/RI ratio), which is ideally between 10% and 30% (Cepeda et al., 2008; Kakitani & Kormos, 2024; Rohrer & Pashler, 2007).

Another relevant issue affecting learning gains through retrieval practice is the complexity of target practice. According to the desirable difficulty framework (Bjork, 1994), optimal learning gains can be achieved when learners engage in effortful retrieval of target items, meaning that the difficulty level of activities and target forms should be matched with learners’ developmental readiness, that is, at a point where they can benefit from pedagogical activities based on their current L2 system (see Y. Suzuki et al., 2019). For the sake of achievable learning goals, it might thus be ideal to focus on linguistic features that are developmentally ready in the DLA context. Effective diagnostic feedback for lexical use in speaking, for instance, may incorporate usages characteristic of speakers at a slightly higher proficiency level than that of the target learner.

Potential of AI technologies in diagnostic assessment

In various speaking assessment contexts, language testers have admitted the potential of technologies such as AI to enhance the practicality and authenticity of assessment practices. Regarding speaking performance elicitation, the introduction of computer-based test format, such as that of the Test of English as a Foreign Language Internet-based Test (TOEFL iBT), has reduced the cost for the delivery of speaking tests. However, the mode of speaking has been limited to monologic speaking tasks, while dialogic speaking tasks have been, as yet, administered mostly by trained human examiners (e.g., Cambridge Assessment English exams). Recently, in response to the need for assessing interactional speaking skills, language testers have developed and validated the use of spoken dialog systems, which can orally ask questions to and generate responses to questions from test-takers, for eliciting test-takers’ interactive performance (Gokturk & Chukharev-Hudilainen, 2023; Ockey & Chukharev-Hudilainen, 2021). To further enhance the authenticity of assessment tasks, the potential of a multimodal spoken dialog system (hereafter, conversational AI agent), which is equipped with an avatar and thus can use non-verbal gestures, has also drawn increasing attention from language testers. Our precursor study confirmed that such a multimodal conversational AI agent can elicit ratable speech samples reflecting test taker’s upper limit of oral proficiency (e.g., linguistic breakdown) in an oral proficiency interview (OPI) format (Saeki et al., 2024).

The application of technologies to automated speech evaluation has also been advanced due to the development of neural network-based machine learning (ML) algorithms, which can recognise non-linear, dynamic patterns between speech characteristics and assigned scores. L2 research has conventionally employed rule-based feature extraction and a linear regression approach to score prediction for the sake of interpretability of predicted scores, targeting constructs such as fluency (S. Suzuki et al., 2021), comprehensibility (Saito, 2021) and, more holistically, oral proficiency (N. H. de Jong et al., 2012; Révész et al., 2016). However, one notable advantage of a neural network-based ML approach over traditional SLA feature extraction is that it can exhaustively capture various speech characteristics and even their interrelationships with outcome variables, typically in the form of feature vectors (e.g., transformer, attention mechanism; see Devlin et al., 2019; Vaswani et al., 2017). ML-based models have been reported to achieve high accuracy in predicting both holistic and analytic scores of L2 speaking performance (e.g., Chen et al., 2018; Ramanarayanan et al., 2017). In the context of an OPI task with a conversational AI agent, the automated scoring model, which incorporated neural network-based multimodal feature extractions and ML-based algorithm, achieved high consistency with trained human raters in predicting Common European Framework of Reference for Languages (CEFR) levels on the scales of Overall Oral Interaction, General Linguistic Range, Grammatical Accuracy, Overall Phonological Control, Fluency, and Coherence and Cohesion, as defined by the Council of Europe (2020) (Quadratic Weighted Kappa [QWK] = .934–.965; for details, see Takatsu et al., 2026). Despite the high prediction accuracy, an ML approach, however, has been criticised in various domains for the inherent “black box” nature, that is, the low interpretability of score prediction (for discussion in automated speaking assessment, see Khabbazbashi et al., 2021). Researchers have thus extended the framework of eXplainable Artificial Intelligence (XAI), which aims to offer insights into the decision-making processes of AI-based predictions, to avoid unwanted consequences of AI use, such as adopting unacceptable decisions, and to empower users in understanding the decisions made by AI-based systems (Saeed & Omlin, 2023).

Recently, with the assistance of ML and natural language processing (NLP) techniques, it has become possible to identify the usage that should be modified to make performance more successful in relation to a certain global skill or construct (e.g., Yoon et al., 2019), which provides insights into learners’ weaknesses at the level of linguistic knowledge. Several automated feedback systems for speaking performance have been developed and evaluated. In light of the diagnosis component, by using the SpeechRater engine, which is constructed to predict the scores of spoken responses in TOEFL iBT (Chen et al., 2018; Xi et al., 2006), Gu et al. (2021) designed the automated feedback based on six fine-grained features targeting the constructs of fluency, pronunciation and vocabulary. Focusing on the listening-to-speaking tasks in the TOEFL iBT test, Yoon et al. (2019) developed the diagnosis system on the content adequacy of spoken responses, by means of similarity scores between responses and source texts based on word-embeddings. In contrast, only a limited number of studies have developed feedback in relation to subsequent practices and have evaluated learning gains in target skills such as pronunciation (Cucchiarini et al., 2009) and syntactic knowledge (de Vries et al., 2015).¹ Finally, Isbell (2021) has developed the Korean Pronunciation Diagnostic with DLA principles (e.g., Alderson, 2005; Alderson et al., 2015), which provides the information of learners’ strengths and weaknesses in both perception and production at the phoneme level. Despite the lack of a control group, his in-depth qualitative investigation into the long-term impact of diagnostic feedback indicated that learning gains can vary according to individuals’ learning efforts after diagnosis.

Taken together, AI-powered technologies have enhanced the authenticity of speaking performance elicitation and the accuracy in predicting target skill scores. However, the black-box nature of ML-based diagnosis hinders a key criterion of DLA in the decision-making processes for subsequent learning activities—the interpretability of learner profiles. To maintain both the accuracy of ML-driven diagnosis and the interpretability of learners’ weaknesses, the framework of XAI may offer a promising solution for automated DLA systems. Provided the XAI framework aims to evade unwanted consequences of AI use and to assist users in better understanding the decisions made by AI-based systems, it could be considered compatible with the fundamental principle of language assessment, namely, the meaningful interpretation of assessment records (cf. Bachman & Palmer, 2010). In addition, in the context of speaking performance assessment, XAI might be used to demystify predicted performance assessment scores by identifying linguistic features that influence the AI model’s score prediction. One useful XAI technique is the Shapley Additive explanations (SHAP) framework (Lundberg & Lee, 2017), which is derived from cooperative game theory—Shapley values (Shapley, 1953). SHAP quantifies the relative contribution of individual features to the predicted value of outcome variables so that the relative importance of those features can be presented in an interpretable manner.

The current study

In response to the challenges of weakness identification in diagnostic assessment for speaking skills, the current study proposes a diagnostic approach that integrates skill-level assessment and discrete item-level linguistic features, orchestrating ML, NLP, and XAI techniques. To achieve actionable and contextualised diagnostic feedback, we also propose to closely align the design of diagnosis, feedback, and remedial learning with one another (see Lee, 2015). To this end, we incorporate one of the well-researched ISLA pedagogies—task repetition—into speaking tasks for diagnosis and remedial learning phases, using a conversational AI agent as an interlocutor. Moreover, the feedback contrasts learners’ unsatisfactory utterances identified as weaknesses with suggested corrections in relation to the actual questions from the AI agent.

Provided that L2 speaking performance is multidimensional (Kormos, 2006; Segalowitz, 2010) and that the quality of learning efforts during remedial learning activities can affect learning gains (Isbell, 2021), learners’ attention should arguably be regulated to relevant aspects of speech production when testing the effects of diagnostic feedback on a certain target skill. Given the lexically driven nature of speech production (Kormos, 2006), lexical use in speech is relatively less complex than other aspects of speech such as fluency (cf. S. Suzuki & Kormos, 2023). Therefore, we decided to test the effects of DLA based on the proposed approaches on changes in lexical use in this study. Specifically, the current study operationalised the intended lexical learning as the retention of new lexical items in spontaneous speech production. The holistic changes in the lexical use were measured on the CEFR General Linguistic Range scale (Council of Europe, 2020).

In response to the call for robust validity evidence of DLA in terms of learning gains (Lee, 2015; see also Isbell, 2021), the current study aims to examine the effects of the DLA programme on lexical use, using an experimental design. To capture the learning gains comprehensively, the current study followed recent L2 speaking task repetition research (Kakitani & Kormos, 2024; Y. Suzuki & Hanzawa, 2022), assessing the improvement (a) during the practice sessions, (b) after the sessions (i.e., retention) and (c) to a new task (i.e., transfer). The following research questions (RQs) were thus examined:

RQ1. What are the effects of contextualised diagnostic feedback on lexical use during the DLA programme?

RQ2. To what extent are the effects of contextualised diagnostic feedback on lexical use durable after 1 week?

RQ3. To what extent can contextualised diagnostic feedback improve lexical use in a new task?

Method

Overall design

The current study adopted an experimental design with a pretest, a posttest, and a delayed posttest to examine the effects of our DLA programme, which was conducted for six consecutive days (see Figure 1). In the control group, learners engaged in the exact task repetition of an OPI with a conversational AI agent. Prior to the task repetition session on each day, they received the estimated CEFR Overall level based on their performance on the previous day as minimum feedback. In the experimental condition, learners also engaged in the same task repetition programme as a remedial learning activity, whereas the results of diagnostic assessments were provided in addition to the estimated CEFR level. To ensure that the differences in learning gains between the groups could be attributed to the inclusion of contextualised diagnostic feedback, we controlled for learning opportunity during the remedial learning phase by offering the identical pedagogical activity (i.e., task repetition).

Figure 1.

Overview of the experimental design for diagnostic language assessment. The study employed an experimental design with a pretest (Day 0), posttest (Day 10), and delayed posttest (Days 13–16) to examine the effects of contextualised diagnostic feedback. Both groups engaged in six consecutive days of task repetition with 1-day intersession intervals. Each session involved an oral proficiency interview (OPI) task with a conversational AI agent, covering three fixed interview topics. Pretest and posttest scenarios were counterbalanced across participants, while practice sessions and the delayed posttest used a separate fixed set. The control group (left) received only estimated CEFR levels as minimal feedback, whereas the experimental group (right) received diagnostic feedback.

In both groups, all sessions—including a pretest, practice sessions, a posttest, and a delayed posttest—were individually conducted outside the lab. To this end, three different sets of OPI questions were prepared, and two of these sets were adopted for a pretest and a posttest and were counterbalanced across participants. The remaining set was used for the remedial learning activity. To maximise the potential learning gains, the time periods between the final practice session and posttest (ISI/RI ratio = 25%) and between the final practice session and the delayed posttest session (ISI/RI ratio = 10–14%) were set within the optimal range proposed in previous studies.

Participants

We recruited a total of 80 Japanese learners of English at a private university in Japan via online advertisement.² To control for the effects of learners’ proficiency levels, we randomly assigned recruited participants into either a control or experimental group with a group matching technique based on their self-reported scores of proficiency tests. Although 12 students did not take the pretest, the remaining 68 students completed all the experimental sessions, adhering to the experimental schedule (e.g., 1-day ISI). However, after excluding participants whose recordings were not of sufficient quality, only the pretest, practice, posttest, and delayed posttest data from 59 students were included in the current study (see Procedure section). Before the treatment session, we delivered a fully automated speaking test with a conversational AI agent (Saeki et al., 2024) as the pretest, which assesses their overall oral proficiency on the CEFR scale of Overall Oral Interaction (hereafter, CEFR Overall; for the descriptors, see Council of Europe, 2020).³ The resulting CEFR level was used to decide the difficulty level of interview topics in the treatment sessions. The distribution of their CEFR levels is summarised in Table 1.

Table 1.

Distribution of overall CEFR levels based on a pretest across groups.

Group	A1	A2	B1	B2	C1	C2	Total
Experimental	1	9	8	14	0	0	32
Control	0	5	13	9	0	0	27

DLA programme

Diagnosis through an OPI task

During the DLA programme, participants repeated an OPI task with a conversational AI agent (Saeki et al., 2024), and each session of task repetition served as both assessment tasks and remedial learning activities in our study. As illustrated in Figure 1, the whole DLA programme spanned six days, and the participants engaged in a practice session on a daily basis (i.e., 6 sessions in total). Each session consisted of three interview topics, covering familiar topics (e.g., breakfast, favourite season) to more abstract themes (e.g., community building, future trends) according to the target CEFR levels. Each topic comprised four to five questions. Building on the desirable difficulty framework (Bjork, 1994), the target difficulty level was determined as one level higher than the participants’ CEFR Overall pretest scores (e.g., B1-level topics for A2-level learners). Although the adaptive test format was adopted in the pretest and posttest (for details, see the “Pretest and posttest tasks” section), the order and content of topics were fixed throughout the practice sessions to maximise the opportunity to retrieve the same target expressions. Each task repetition session usually lasted approximately 10 minutes.

Estimation and feedback

Participants’ oral performance at each task repetition session was submitted to a ML-based automated scoring system (Takatsu et al., 2026), which returned their levels of proficiency on the CEFR scale of General Linguistic Range (hereafter, Range; Council of Europe, 2020). This scale primarily taps into the breadth and depth of lexical repertoires and grammatical structures while partially referring to the smoothness of lexical retrieval and search (see Appendix 1 for the scale). In our scoring system, the probabilities for the CEFR levels are estimated via neural networks from multimodal features and then are converted to a continuous value score $x$ by the following equation: $x = Σ_{c = 1}^{6} c \times p_{c}$ ( $Σ_{c = 1}^{6} p_{c} = 1$ ) represents the probability of level $c$ (1: A1, 2: A2, . . ., 6: C2) in a category. After computing the discrete-level boundaries of A1-C2 so as to maximise QWK in the dataset based on $x$ , the system feeds back a normalised score $x ’$ (0~1: A1, 1~2: A2, . . ., 5~6: C2) to the learner so that the boundaries of each level are evenly spaced. The sufficient level of agreement with the scores assigned by trained raters was achieved (QWK = .957) with the test dataset of 130 interviews in our precursor study (for details, see Takatsu et al., 2026; for discussion on the limitations of kappa coefficients, see also Xu et al., 2021).⁴ For the current study, we adopted the numerical scores of CEFR Range rather than the estimated CEFR levels as the target variable. Furthermore, the system adopts the SHAP framework (Lundberg & Lee, 2017) and identifies utterances that lower the probability of being estimated as one level higher than the participant’s current CEFR level of Range (i.e., weaknesses) and those that contribute to the probability of being estimated as the current CEFR level (i.e., strengths). Figure 2 illustrates the relative contributions of each word to the prediction level of Range by colour and thickness.

Figure 2.

Example of visualisation of the contribution of each word of a transcribed dialogue to the B1 probability of a learner who is estimated as A2 level of Range.

The utterances labelled as weaknesses were then submitted to a large language model to generate a list of paraphrased sentences. In this study, we employed GPT-4 (OpenAI et al., 2023). To maintain the participant’s original intention as much as possible and to provide target lexical items in a contextualised manner, we generated a list of ten candidate sentences for each target utterance. The generated sentences were then filtered based on a set of criteria, including (a) semantic similarity to the student’s original utterance, by means of Word Rotator’s Distance (Yokoi et al., 2020), (b) appropriateness in spoken discourse, with assistance of Styleformer (Etinger & Black, 2019), and (c) the number of expressions at one level higher than the current CEFR level based on English Vocabulary Profile (Capel, 2010, 2012; http://vocabulary.englishprofile.org; for details, see Takatsu et al., forthcoming). The target expressions at one level higher (i.e., weaknesses) and those which are useful to maintain the current CEFR level (i.e., strengths) are highlighted by red and blue, respectively. A simplified example of diagnostic assessment record is available in Appendix 2. The complete version of the feedback sheet is available on the Open Science Framework (OSF; Suzuki, 2025).

Remedial learning

Our remedial learning activity consists of a score report activity and task repetition. The score report activity is designed to raise learners’ awareness of their learning through the subsequent task repetition and regulate their attention to expressions that they expect to manage to use by themselves (cf. Isbell, 2021). From a methodological perspective, this activity also helped researchers ensure that participants actively engaged with their diagnostic feedback, which could enhance the internal validity of the experimental condition. They were asked to review the diagnostic assessment record and then to evaluate each of the paraphrased sentences in terms of how likely they can use the suggested expressions by themselves using a 5-point scale (1 = I cannot use it at all, 5 = I can use it very well). This score report activity can encourage participants to pay attention to the expressions that are partially acquired rather than completely unfamiliar expressions. Immediately after completing the score report activity, they took an OPI with a conversational AI agent (Saeki et al., 2024) with the same procedure as the diagnosis phase described above.

Pretest and posttest tasks

Although both pretests and posttests were conducted in the same OPI format with the conversational AI agent as practice sessions, the difficulty level of topics was adaptively changed during the OPI based on the incremental automated assessment of the overall CEFR level to secure the opportunity for learners to demonstrate their upper limit of performance (for details, see Saeki et al., 2024). The pretests and posttests included three main topics in addition to a warm-up phase and a closing phase, following the ACTFL OPI (Salaberry, 2000). The format of the pretest and posttest was identical, while the pool of topics presented was counterbalanced so that we could test the transfer of learning effects on a new interview task. Only the recordings of responses to the main topics were submitted to the aforementioned automated scoring system.

Procedures

Pretest and posttest sessions

All the materials at the pretest and posttest sessions were delivered online. Participants were first asked to respond to a set of questionnaire items for individual difference factors using Qualtrics (www.qualtrics.com). However, they are beyond the scope of the current study and thus are not reported in this paper. After completing the questionnaire, they accessed another website specifically designed for this experiment to take the OPI with a conversational AI agent (see Figure 3). At the end of the posttest, participants also responded to a debriefing questionnaire about their opportunity to refer to our feedback sheet and use English outside the current DLA programme with a 6-point scale (1 = not at all, 2 = less than 15 minutes, 3 = 15–30 minutes, 4 = 30–60 minutes, 5 = 61–120 minutes, 6 = more than 120 minutes). To collect participants’ perceptions of the DLA programme, the debriefing questionnaire also included open-ended questions about strengths and challenges of our diagnostic assessment record.

Figure 3.

Screenshot of the oral proficiency interview task with a conversational AI agent.

Practice sessions and a delayed posttest

The procedure of practice sessions was different between the first session (Day 1) and the remaining sessions (Days 2–6). On Day 1, students only completed the OPI with a conversational AI agent. On the subsequent days, they received feedback of their performance on the previous day from researchers via an email—only the overall CEFR level for the control group and diagnostic feedback for the experimental group. Participants in both groups completed the task repetition activity at the same time of day as much as possible so that the ISI approximated 24 hours. For the experimental group, to ensure that they process the diagnostic feedback, they were required to complete the score report activity using the Qualtrics software, prior to the task repetition activity. To ensure the internal validity of task repetition as retrieval practice, our participants were instructed not to view the feedback report during the task repetition activity. A delayed posttest was conducted using the same set of topics as the practice session. The time interval between the final practice session and the delayed posttest was initially set as 1 week (ISI/RI ratio = 14%). However, to reduce participants’ dropout due to the difficulties in adhering to the experimental schedule, we allowed the delayed posttest to be completed up to ten days after the final practice session. This adjustment kept the ISI/RI ratio (10%) within the ideal range suggested by previous studies (i.e., 10%–30%; see Kakitani & Kormos, 2024; Rohrer & Pashler, 2007).

Analysis

Due to the informal nature of the current practice activity, it is plausible to assume that some participants might not be engaged with the remedial learning activity. For the sake of internal validity of the current treatment, those participants should be excluded from the analyses. Following the literature of learner engagement (Hiver et al., 2021), we analysed their speech data in terms of the number of words produced and the mean length of turns in words as a proxy for behavioural engagement. With reference to those objective indices, we manually checked the video recordings of practice sessions. It was found that one student in the experimental group continuously provided irrelevant responses to the AI agent and that eight participants (four from the experimental group; four from the control group) failed to complete the interview task appropriately due to the problem of automatic speech recognition primarily caused by noise in the learners’ audio input. After excluding those cases, 59 participants’ data of speaking performance were submitted to the subsequent analyses.

To compare lexical performance within practice sessions (RQ1) and at the delayed posttest between groups (RQ2), we constructed a linear mixed-effect model. The automatically estimated numerical CEFR Range score was used as the outcome variable, and the fixed-effect predictor variables of Group (control vs. experimental) and Time were included as between-subject and within-subject variables, respectively. Regarding the predictor variable of Time, we narrowed down our scope to three crucial time points to minimise the rate of type II errors: Day 2 as learners’ first performance after they received their feedback, Day 6 as the final performance during the task repetition sessions, and delayed posttest. The difference between Day 2 and Day 6 reflects the potential learning gains within the DLA programme, whereas the comparison between Day 6 and delayed posttest gives insights into the durability of learning through the current DLA programme. The interaction term between Group and Time was included to detect the differential effects of treatment programmes (with vs. without diagnostic feedback) across time points. Following previous experimental studies on task repetition (Kakitani & Kormos, 2024; Y. Suzuki & Hanzawa, 2022), we also included participants’ CEFR Range score at Day 1 as the covariate to consider the potential baseline difference between the two groups. The model also included the random intercept of individual participants. To identify the location of statistically significant between-group and within-group differences, post-hoc comparisons were conducted with the Tukey adjustment method to account for multiple comparisons. To test the transfer of learning gains to a new context (RQ3), we built another linear mixed-effect model predicting the CEFR Range score from fixed-effect predictor variables of Group (control vs. experimental), Time (pretest vs. posttest) and their interaction term with the random intercepts of individual participants and interview scenarios. All these statistical analyses were conducted in R statistical software version 4.0.2 (R Core Team, 2020), using the packages of lme4 (Bates et al., 2015) and emmeans (Lenth, 2020). The R code and anonymized data set are available on OSF (https://osf.io/vtx3f/). Effect sizes were interpreted based on Plonsky and Oswald’s (2014) field-specific guidelines. For the post hoc comparisons, the effect sizes of Cohen’s d for intergroup (d = .40 as small; d = .70 as medium; d = 1.00 as large) and intragroup (d = .60 as small; d = 1.00 as medium; d = 1.40 as large) comparisons were computed based on the estimated marginal means.

In addition to statistical analyses, to capture participants’ perceptions of the current diagnostic assessment record, the responses from the experimental group to the open-ended questions about its strengths and challenges were coded using inductive content analysis (Selvi, 2019). The first author initially open-coded the responses to establish the coding scheme. The third author then blind-coded the entire responses, and the initial agreement rate was 65.4% and 95.0% for strength and challenges, respectively. All the disagreements were resolved through the discussion between them.

Results

Lexical changes within practice sessions and 1-week retention

Figure 4 illustrates participants’ CEFR Range scores during the DLA programme and at the delayed posttest in both experimental and control groups (for descriptive statistics, see Table 2). To test the statistical differences in the pattern of lexical performance changes between the groups across three time points (Day 2, Day 6, Delayed posttest), we constructed a linear mixed-effect model predicting the CEFR Range scores from the fixed-effect predictor variables of Time, Group and the interaction term by these two variables, with their scores at Day 1 as the covariate (see Table 3). The results indicate a significant positive effect of participants’ Range score at Day 1, confirming the importance of considering the potential baseline score for performance changes. The regression model also showed a significant interaction effect between Time [Day 6] and Group (β = –0.938, p = .003), indicating that Range score changes between Day 2 and Day 6 can be differentiated by the experimental conditions. Accordingly, the post hoc comparisons were conducted with estimated marginal means across the time points and the conditions with the Kenward-Roger method. The results indicated that the control group significantly improved from Day 2 to Day 6 with a small effect size (β = –0.736, p = .019, d = –0.882) and also significantly decreased from Day 6 to the delayed posttest with a medium effect size (β = 0.897, p = .002, d = 1.074). There was no significant difference between Day 2 and the delayed posttest (β = 0.160, p = .981, d = 0.192). On the contrary, in the experimental group, there were no significant changes in any combination of three time points (see Table 4). It is notable that at the Day 6 session, the control group outperformed the experimental group with a large effect size (β = 0.942, p = .001, d = 1.128). However, there was no significant difference between the groups at the delayed posttest session (β = 0.072, p = 1.000, d = 0.086). The estimated marginal means and 95% confidence intervals are summarised in Table 5.

Figure 4.

Range scores during practice sessions and at the delayed posttest.

Table 2.

Descriptive statistics of CEFR Range scores.

Time	Experimental			Control
	M	SD	SE	M	SD	SE
Pretest vs. Posttest
Pretest	1.749	0.645	0.114	1.847	0.545	0.105
Posttest	2.194	0.690	0.122	1.853	0.663	0.128
Practice sessions
Day 1	2.337	1.480	0.262	2.483	1.138	0.219
Day 2	2.954	1.543	0.273	3.028	1.437	0.277
Day 3	2.837	0.951	0.168	3.270	1.400	0.269
Day 4	2.761	0.605	0.107	3.412	1.295	0.249
Day 5	2.867	0.709	0.125	3.402	1.210	0.233
Day 6	2.753	0.639	0.113	3.764	1.175	0.226
Delayed posttest	2.726	0.672	0.119	2.867	0.764	0.147

Table 3.

Model summary for practice data.

Fixed effects	Estimate	SE	95% CIs for estimate		t	p
			Lower	Upper
(Intercept)	1.844	0.224	1.396	2.292	8.235	<.001
Range score at Day 1	0.477	0.058	0.361	0.592	8.275	<.001
Time [Day 6]	0.736	0.227	0.282	1.191	3.240	.002
Time [Delayed]	−0.160	0.227	−0.615	0.294	−0.706	.482
Group [Experimental]	−0.004	0.234	−0.472	0.464	−0.017	.986
Time [Day 6]: Group [Experimental]	−0.938	0.309	−1.555	−0.320	−3.038	.003
Time [Delayed]: Group [Experimental]	−0.068	0.309	−0.685	0.549	−0.221	.826
Random effects(intercepts)	Variance	SD
Participants	0.105	0.323
ICC	0.13
R2	Estimate
Marginal	0.389
Conditional	0.469

Note: The reference level of Time is Day 2, and the reference level of Group is Control.

Table 4.

Post-hoc comparisons for practice data.

Contrast	Estimate	SE	95% CIs for estimate		t	p	Cohen’s d
			Lower	Upper
Between-group
Day2 Cont—T2 Exp	0.004	0.234	−0.455	0.463	0.017	1.000	0.005
Day6 Cont—T6 Exp	0.942	0.234	0.483	1.400	4.022	.001	1.128
Delayed Cont—Delayed Exp	0.072	0.234	−0.386	0.531	0.308	1.000	0.086
Within-group (Control)
Day2 Cont—Day6 Cont	−0.736	0.227	−1.181	−0.291	−3.240	.019	−0.882
Day6 Cont—Delayed Cont	0.897	0.227	0.452	1.342	3.946	.002	1.074
Day2 Cont—Delayed Cont	0.160	0.227	−0.285	0.605	0.706	.981	0.192
Within-group (Experimental)
T2 Exp—T6 Exp	0.201	0.209	−0.208	0.611	0.964	.928	0.241
T6 Exp—Delayed Exp	0.027	0.209	−0.382	0.437	0.130	1.000	0.033
T2 Exp—Delayed Exp	0.228	0.209	−0.181	0.638	1.094	.883	0.274

Note: Estimated marginal means were compared using the Kenward-Roger method for degrees of freedom. The reported t values and p values reflect this adjustment.

Table 5.

Estimated marginal means and 95% confidence intervals of Range scores.

Time	Experimental			Control
	M	95% CI		M	95% CI
		Lower	Upper		Lower	Upper
Pretest vs. Posttest
Pretest	1.75	1.51	1.99	1.85	1.59	2.10
Posttest	2.19	1.95	2.44	1.85	1.59	2.11
Practice sessions
Day2	2.99	2.63	3.34	2.99	2.60	3.38
Day6	2.78	2.43	3.14	3.73	3.34	4.12
Delayed posttest	2.76	2.40	3.11	2.83	2.44	3.22

Pretest–posttest changes

Figure 5 illustrates the Range scores at the pretest and posttest for both groups. To compare participants’ Range scores between the two groups at the posttest, we constructed another linear mixed-effect model predicting Range scores from the fixed-effect variables of Time (Pretest vs. Posttest) and Group (Control vs. Experimental) with random intercepts of individual participants and interview scenarios (see Table 6). The results indicated that an interaction effect between Time [Posttest] and Group was approaching statistical significance (β = –0.440, p = .052), suggesting that the Range scores might be differentiated by the conditions at the time of posttest. The post hoc comparisons based on the estimated marginal means with the Kenward-Roger method demonstrated that the experimental group might have outperformed the control group at the posttest with a near-medium effect size (β = –0.342, p = .044, d = –0.570; see Table 7).

Figure 5.

Range scores at a pretest and a posttest.

Table 6.

Model summary for pretest–posttest comparison.

Fixed effects	Estimate	SE	95% CIs for estimate		t	p
			Lower	Upper
(Intercept)	1.847	0.123	1.606	2.089	14.986	<.001
Time [Posttest]	0.005	0.163	−0.315	0.325	0.032	.975
Group [Experimental]	−0.099	0.167	−0.427	0.229	−0.589	.557
Time [Posttest]: Group [Experimental]	0.440	0.222	0.006	0.874	1.987	.052
Random effects(intercepts)	Variance	SD
Participants	0.051	0.226
Scenarios	<.001	<.001
ICC	0.12
R2	Estimate
Marginal	0.070
Conditional	0.185

Note: The reference level of Time is Pretest, and the reference level of Group is Control.

Table 7.

Post-hoc comparisons for pretest and posttest data.

Contrast	Estimate	SE	95% CIs for estimate		t	p	Cohen’s d
			Lower	Upper
Cont—Exp (Pretest)	0.099	0.167	−0.229	0.426	0.589	.557	0.165
Cont—Exp (Posttest)	−0.342	0.167	−0.669	−0.014	−2.039	.044	−0.570

Note: Estimated marginal means were compared using the Kenward-Roger method for degrees of freedom. The reported t values and p values reflect this adjustment.

Learners’ perceptions of diagnostic feedback

To evaluate the current diagnostic feedback from the learners’ perspective, we summarised participants’ responses to two open-ended questions in the debriefing questionnaire, each of which addressed the strengths and challenges of our feedback, respectively (see Table 8). Regarding its strengths, seven participants pointed out the benefits of contextualised feedback in enhancing their awareness of alternative expressions, while four noted the high actionability of the feedback for their subsequent learning. As for the challenges of the diagnostic feedback, two participants stated that they expected multiple options of paraphrases so that they can choose their preferred expressions by themselves. Although this learning preference might not be predominant among the current participants, it suggests that diagnostic feedback should be designed to facilitate learner autonomy (cf. Isbell, 2021). Finally, technological challenges related to the quality of paraphrases were pointed out, such as automatic speech recognition accuracy (n = 4) and discrepancies between the paraphrase output by GPT-4 and their original intended message (n = 7).

Table 8.

Feedback categories relating to the diagnostic assessment feedback.

Question	Categories	Description	n
Strengths	Contextualisation	The benefit of paraphrased sentences being adapted to the specific context	7
	Actionability	The benefit of paraphrased sentences making their use more likely	4
	Paraphrasing in general	The benefit of paraphrased sentences contributing to learning, without a specific focus on actionability or contextualisation	9
	Design and Layout	The benefit of structure and visualisation of the score report	4
	Others	Other general benefits not covered in the specified categories	3
Challenges	Single paraphrases	The limitation of having only one paraphrase example	2
	Paraphrase Quality	Issues related to paraphrased expressions being too difficult, unnatural, or misaligned with the original intent.	7
	ASR Errors	Transcription errors caused by automatic speech recognition errors	4
	Design and Layout	Concerns regarding text size, formatting, or the overall visual design of the feedback	4
	Extension to Other Skills	Specific focus on vocabulary and expressions, with a preference for extending other aspects such as grammar and pronunciation	2
	Others	Other general challenges not covered in the specified categories	1

Note: n represents the number of participants who mentioned a certain category. Some responses were labelled with multiple codes.

Discussion

Learning gains during the practice sessions and their retention

To examine the learning gains in lexical performance within the current DLA programme (RQ1) and their retention (RQ2), we conducted linear mixed-effect modelling. The results indicated that participants in the control group who only engaged in the exact task repetition activity continually improved their lexical performance within the practice sessions, whereas the lexical improvement was not sustained for the 7- to 10-day RI. Given the current measure of lexical performance (i.e., CEFR Range score) is theoretically reflective of the breadth and depth of lexicogrammatical resources (Council of Europe, 2020), the score gains during the practice session might indicate the improvement of linguistic repertoires such as the use of advanced lexical items and complex sentence structures. However, given the failure to retain the improvement at the delayed posttest, it might be reasonable to regard the within-practice improvement as practice effects. From the perspective of speech production, one possible scenario is that the task repetition activities might have increased the resting activation level of relevant lemmas in learners’ mental lexicon, which could subsequently enhance the accessibility of other less accessible but advanced lemmas (Kormos, 2006). Yet the enhanced use of those linguistic items was likely temporary, as the higher resting activation level alone may not have been sufficient for their consolidation through six repetitions of identical communicative tasks.

In contrast, the experimental group who engaged with diagnostic feedback through the score report activity did not exhibit practice effects during the practice sessions nor the retention effects. Lack of practice effects may indicate that, despite the same interview topics, participants in the experimental group engaged with the interview task differently from session to session. Given that the only difference between the groups was the inclusion of diagnostic feedback, it is plausible that despite the lack of significant improvement in the target variable, the current diagnostic feedback at least differentiated their approach to the same OPI task. From the perspective of L2 development, such differentiation of approach to speech production may include the elaboration of new linguistic items and/or the modification of their lexical organisation in learners’ mental lexicon (Housen et al., 2012). These learning processes are theoretically assumed to be reflected in enhanced lexical complexity of their speaking performance (Housen et al., 2012). By contrast, the consolidation of these learning processes (i.e., automatization) can be observed as enhanced fluency in the later phase of development. Accordingly, the lack of practice effects in the experimental group might indicate that participants’ L2 systems were in the middle of elaboration and restructuring of their mental lexicon and did not reach sufficient consolidation of target linguistic items. This may also explain the lack of retention effects in the experimental group, despite the research-informed ISI-RI ratio of the current task repetition schedule. One possible reason for the emerging status of consolidation of target L2 items might be related to the way in which the current system selects target items across practice sessions. The DLA system identifies the utterances that lower the probability of being estimated as the level higher than the current CEFR level (i.e., weaknesses). The current DLA simply rans this system for each task repetition performance, which does not have a mechanism to maintain the same target items in the diagnostic feedback until learners are able to use them. Although the exact task repetition may have created an opportunity to use the expressions suggested at previous sessions, it is worth exploring the algorithm for selecting suggested expressions repetitively to optimise learning efficiency longitudinally (see Maie et al., 2025).

Effects of diagnostic assessment on lexical development

The third research question concerned the transfer effects of our DLA programme on lexical performance with novel speaking topics in terms of the pretest–posttest change. The results demonstrated that the experimental group outperformed the control group with a near-medium effect size at the posttest. Given the experimental design of the current study, it is plausible to regard this result as the positive evidence of the current DLA programme for vocabulary learning. In other words, the current approach to weakness identification through speaking performance by XAI techniques, as well as the contextualised diagnostic feedback, can be considered to meaningfully facilitate the restructuring of learners’ mental lexicon. In terms of effect sizes, the current findings are comparable with previous studies reporting the effects of task repetition combined with form-focused activities on item-based learning (Tran & Saito, 2021).

One may argue that the transfer effect in the experimental group appears to contradict the lack of meaningful improvements of lexical performance during the practice sessions and at the delayed posttest. One possible interpretation of those seemingly inconsistent findings is the complex interplay between the time interval between the final practice session and the posttest and the test format difference between the practice sessions and the posttest. From the perspective of cognitive psychology, it is common that longer spacing results in better learning gains than shorter spacing or a massed condition (Cepeda et al., 2008; Rohrer & Pashler, 2007; see also Kakitani & Kormos, 2024). Accordingly, it might have been possible that 1-day ISI during the practice sessions was not long enough to integrate the items retrieved in the previous sessions into learners’ long-term memory. By contrast, the 4-day interval between the final practice session and the posttest, which is the optimal ISI-RI ratio (25%), might have been long enough to assist them in utilising the expressions suggested by the current diagnostic feedback (for a similar discussion, see Y. Suzuki & Hanzawa, 2022). Additionally, it should be noted that the posttest followed the adaptive test format, as opposed to the fixed set of topics during the practice sessions and at the delayed posttest. In other words, the posttest might have been likely to elicit better performance and succeed in capturing the emerging developmental changes in their L2 systems, compared with the delayed posttest.

Limitations and directions for future research

The current study highlights the potential of weakness identification through speaking performance by means of ML, NLP, and XAI techniques, as well as the effectiveness of actionable and contextualised diagnostic feedback by closely integrating all three components of diagnostic assessment. To evaluate the learning gains through diagnostic assessment, the study adopted an experimental design in response to the call for robust validity evidence (Isbell, 2021; Lee, 2015). The results showed that the control group (task repetition only) demonstrated significant improvement during the remedial learning sessions but failed to transfer and retain the learning gains. In contrast, despite the lack of practice effects, the experimental group (task repetition with diagnostic feedback) outperformed the control group at the posttest with a near-medium effect size. However, we cautiously interpreted the current findings as positive evidence of our proposed approach to DLA for spontaneous speaking skills in light of methodological limitations of the current study.

First and foremost, the duration of the current DLA programme is shorter in than previous intervention studies on speaking skills, which typically detect a significant improvement after the time period of one academic semester (e.g., Saito et al., 2021) or even longer (e.g., Taguchi et al., 2022). To gain more insights into the short-term and long-term learning gains through the current DLA programme, more extended longitudinal studies with multiple time points and different ISIs are necessary.

Second, we adopted the score report activity to ensure that our participants process their diagnostic feedback for the sake of internal validity of our DLA programme. However, this could simultaneously caution that the current findings are limited to the inclusion of the activity, meaning that the effectiveness of our approach in other contexts may depend on the depth of processing diagnostic feedback. Future studies should explore the optimal design and format of feedback, such as the number of suggested expressions (cf. Maie et al., 2025).

Third, following previous studies on task repetition (Kakitani & Kormos, 2024; Y. Suzuki & Hanzawa, 2022), we adopted the same interview topics for the practice sessions and the delayed posttest to test the retention effects under comparable conditions. Accordingly, we decided not to include a delayed posttest with another new set of topics, because it could have made the research design unmanageably complex and also could have made it difficult to maintain the ISI-RI ratio within the optimal range. However, future research should examine how durable the transfer effects of DLA are, using a more longitudinal design.

Fourth, as the suggested expressions of the current feedback system are only based on learners’ actual utterances, it cannot provide feedback on what they cannot express or avoid using. Towards more sophisticated diagnostic feedback, a variety of AI-based technologies could be combined to detect learners’ linguistic breakdowns (including avoidance) and identify challenging lexical items based on the prediction of what they intend to express.

Finally, although the current study aims to test the improvement of lexical performance during the DLA programme, after the sessions (i.e., retention) and to a new task (i.e., transfer), further insights can be obtained through careful investigations of lexical items that learners actually used. For instance, combining the rate of using the suggested expressions with the learners’ responses to the score report activity, it might be possible to characterise actionable learning items. However, some methodological advances might be needed to distinguish items newly learnt through diagnostic assessment and those known by the learner beforehand.

Footnotes

Appendix 1

CEFR descriptor of general linguistic range.

Level	Descriptor
C2	- Can exploit a comprehensive and reliable mastery of a very wide range of language to formulate thoughts precisely, give emphasis, differentiate and eliminate ambiguity. No signs of having to restrict what they want to say.
C1	- Can use a broad range of complex grammatical structures appropriately and with considerable flexibility.- Can select an appropriate formulation from a broad range of language to express themselves clearly, without having to restrict what they want to say.
B2	- Can express themselves clearly without much sign of having to restrict what they want to say.
B2	- Has a sufficient range of language to be able to give clear descriptions, express viewpoints and develop arguments without much conspicuous searching for words/signs, using some complex sentence forms to do so.
B1	- Has a sufficient range of language to describe unpredictable situations, explain the main points in an idea or problem with reasonable precision and express thoughts on abstract or cultural topics such as music and film.
B1	- Has enough language to get by, with sufficient vocabulary to express themselves with some hesitation and circumlocutions on topics such as family, hobbies and interests, work, travel and current events, but lexical limitations cause repetition and even difficulty with formulation at times.
A2	- Has a repertoire of basic language which enables them to deal with everyday situations with predictable content, though they will generally have to compromise the message and search for words/signs.
A2	- Can produce brief, everyday expressions in order to satisfy simple needs of a concrete type (e.g. personal details, daily routines, wants and needs, requests for information).- Can use basic sentence patterns and communicate with memorised phrases, groups of a few words/signs and formulae about themselves and other people, what they do, places, possessions, etc.- Has a limited repertoire of short, memorised phrases covering predictable survival situations; frequent breakdowns and misunderstandings occur in non-routine situations
A1	- Has a very basic range of simple expressions about personal details and needs of a concrete type. Can use some basic structures in one-clause sentences with some omission or reduction of elements.

Note: The original scale (Council of Europe, 2020) has the descriptor for Pre-A1 level, which was omitted when our precursor study developed an automated scoring system due to the limited number of training data in the target population of test-takers, that is, university students (Takatsu et al., 2026).

Appendix 2

A simplified example of the Diagnostic Assessment Feedback.

Sentences paraphrased into higher level expressions.

Dialog	Paraphrase	Word Sense (C2)	Word Sense (C1)
That sounds like a fun way to spend your days off. Why do you enjoy spending time with your friends? Oh, cause i know i really like to eat some delicious food with my friends. So like, yeah, i can like, yeah, i can, yeah. So i always like ask my friends like, uh, did you want to go to this restaurant? Or something like that. And then yeah, yeah, and eating like delicious food with my friends is like really, really fun for me. Yeah, so that’s why.	So basically, yes, I am able to, absolutely.	able 1. [C2] clever or good at doing something	Absolutely! 1. [B1] completely2. [C1] used to strongly agree with someone
That sounds like a fun way to spend your days off. Why do you enjoy spending time with your friends? Oh, cause i know i really like to eat some delicious food with my friends. So like, yeah, i can like, yeah, i can, yeah. So i always like ask my friends like, uh, did you want to go to this restaurant? Or something like that. And then yeah, yeah, and eating like delicious food with my friends is like really, really fun for me. Yeah, so that’s why.	I always take the initiative to ask my pals, uh, how about trying out this restaurant ?		take the initiative 1. [C1] to be the first person to do something that solves a problem or improves a situation
How possible do you think it is to have a perfect day? I don’t think it’s like really possible. Uh, yeah, it depends on like, yeah, um, the definition of a perfect day. But yeah, there’s yeah, the life is always have like some unexpected things. So, yeah, it’s really hard to have the perfect day. But you know, um like the an neck and expected thinks can, can, can also.	But indeed, life invariably entails unforeseen situations.	invariably 1. [C2] always	unforeseen 1. [C1] not expected

Acknowledgements

We are grateful to Language Testing reviewers, the journal editor, Talia Isaacs, and the special issue editors, Eunice Jang and Yasuyo Sawaki, for their constructive feedback on earlier versions of the manuscript. The research presented in this study is based on results obtained from a project, JPNP20006 (“Online Language Learning AI Assistant that Grows with People”), subsidised by the New Energy and Industrial Technology Development Organization (NEDO).

Author contributions

Shungo Suzuki: Conceptualisation; Data curation; Formal analysis; Investigation; Methodology; Project administration; Supervision; Writing—original draft; Writing—review & editing.

Hiroaki Takatsu: Conceptualisation; Methodology; Resources; Software; Visualisation; Writing—original draft; Writing—review & editing.

Ryuki Matsuura: Conceptualisation; Data curation; Investigation; Methodology; Project administration; Resources; Software; Writing—review & editing.

Miina Koyama: Data curation; Investigation; Methodology; Project administration; Resources; Software; Writing—review & editing.

Mao Saeki: Data curation; Methodology; Resources; Software; Writing—review & editing.

Yoichi Matsuyama: Conceptualisation; Funding acquisition; Resources; Writing—review & editing.

Declaration of conflicting interests

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: The conversational AI agent and the automated scoring system used in the present study were originally developed for research purposes as part of a research project at Waseda University, Japan. These systems were subsequently adapted and are currently maintained by Equmenopolis, Inc. for commercial use. The second, fifth, and sixth authors are affiliated with Equmenopolis, Inc.: Mao Saeki and Hiroaki Takatsu are employees, and Yoichi Matsuyama serves as CEO/CTO of the company. None of the other authors disclosed any conflicts of interest.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by JPNP20006 (“Online Language Learning AI Assistant that Grows with People”) from the New Energy and Industrial Technology Development Organization (NEDO).

ORCID iDs

Shungo Suzuki

Hiroaki Takatsu

Ryuki Matsuura

Miina Koyama

Mao Saeki

Yoichi Matsuyama

Supplemental material

Supplemental material for this article is available on OSF (Suzuki, 2025) at the following link: .

Open practice

This article has received badges for Open Data and Open Materials. More information about the Open Practices badges can be found at .

Notes

References

Alderson

J. C.

(2005). Diagnosing foreign language proficiency: The interface between learning and assessment. Language Assessment Quarterly, 5, 77–81.

Alderson

J. C.

Brunfaut

Harding

(2015). Towards a theory of diagnosis in second and foreign language assessment: Insights from professional practice across diverse fields. Applied Linguistics, 36(2), 236–260. https://doi.org/10.1093/applin/amt046

Alderson

J. C.

Huhta

(2011). Can research into the diagnostic testing of reading in a second or foreign language contribute to SLA research? EUROSLA Yearbook, 11(2011), 30–52. https://doi.org/10.1075/eurosla.11.04ald

Bachman

Palmer

A. S.

(2010). Language assessment in practice: Developing language assessments and justifying their use in the real world. Oxford University Press.

Bates

Mächler

Bolker

Walker

(2015). Fitting linear mixed-effects models using lme4. Journal of Statistical Software, 67(1). https://doi.org/10.18637/jss.v067.i01

Bjork

R. A.

(1994). Memory and metamemory considerations in the training of human beings. In Metcalfe

Shimamura

A. P.

(Eds.), Metacognition: Knowing about knowing (pp. 185–205). The MIT Press.

Capel

(2010). A1–B2 vocabulary: Insights and issues arising from the English Profile Wordlists project. English Profile Journal, 1, e3. https://doi.org/10.1017/S2041536210000048

Capel

(2012). Completing the English vocabulary profile: C1 and C2 vocabulary. English Profile Journal, 3, e1. https://doi.org/10.1017/S2041536212000013

Cepeda

N. J.

Vul

Rohrer

Wixted

J. T.

Pashler

(2008). Spacing effects in learning: A temporal ridgeline of optimal retention: Research article. Psychological Science, 19(11), 1095–1102. https://doi.org/10.1111/j.1467-9280.2008.02209.x

10.

Chen

Zechner

Yoon

Evanini

Wang

Loukina

Tao

Davis

Lee

C. M.

Mundkowsky

Leong

C. W.

Gyawali

(2018). Automated scoring of nonnative speech using the SpeechRater SM v. 5.0 engine. ETS Research Report Series, 2018(1), 1–31. https://doi.org/10.1002/ets2.12198

11.

Council of Europe. (2020). Common European framework of reference for languages: Learning, teaching, assessment—Companion volume. https://doi.org/10.1002/9781118784235.eelt0114.pub2

12.

Cucchiarini

Neri

Strik

(2009). Oral proficiency training in Dutch L2: The contribution of ASR-based corrective feedback. Speech Communication, 51(10), 853–863. https://doi.org/10.1016/j.specom.2009.03.003

13.

de Jong

Perfetti

C. A

. (2011). Fluency training in the ESL Classroom: An experimental study of fluency development and proceduralization. Language Learning, 61(2), 533–568. https://doi.org/10.1111/j.1467-9922.2010.00620.x

14.

de Jong

N. H.

Steinel

M. P.

Florijn

A. F.

Schoonen

Hulstijn

J. H

. (2012). Facets of speaking proficiency. Studies in Second Language Acquisition, 34(1), 5–34. https://doi.org/10.1017/S0272263111000489

15.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Burstein

Doran

Solorio

(Eds.), Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) (pp. 4171–4186). Association for Computational Linguistics. https://doi.org/10.18653/v1/N19-1423

16.

de Vries

B. P.

Cucchiarini

Bodnar

Strik

van Hout

. (2015). Spoken grammar practice and feedback in an ASR-based CALL system. Computer Assisted Language Learning, 28(6), 550–576. https://doi.org/10.1080/09588221.2014.889713

17.

Ellis

N. C.

(1995). The psychology of foreign language vocabulary acquisition: Implications for call. Computer Assisted Language Learning, 8(2–3), 103–128. https://doi.org/10.1080/0958822940080202

18.

Etinger

C. I.

Black

A. W.

(2019). Formality style transfer for noisy, user-generated conversations: Extracting labeled, parallel data from unlabeled corpora. In Xu

Ritter

Baldwin

Rahimi

(Eds.), Proceedings of the 5th Workshop on Noisy User-generated Text (W-NUT 2019) (pp. 11–16). Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-5502

19.

Gokturk

Chukharev-Hudilainen

(2023). Strategy use in a spoken dialog system–delivered paired discussion task: A stimulated recall study. Language Testing, 40(3), 630–657. https://doi.org/10.1177/02655322231152620

20.

Davis

Tao

Zechner

(2021). Using spoken language technology for generating feedback to prepare for the TOEFL iBT® test: A user perception study. Assessment in Education: Principles, Policy and Practice, 28(1), 58–76. https://doi.org/10.1080/0969594X.2020.1735995

21.

Harding

Alderson

J. C.

Brunfaut

(2015). Diagnostic assessment of reading and listening in a second or foreign language: Elaborating on diagnostic principles. Language Testing, 32(3), 317–336. https://doi.org/10.1177/0265532214564505

22.

Hiver

Al-Hoorie

A. H.

Vitta

J. P.

(2021). Engagement in language learning: A systematic review of 20 years of research methods and definitions. Language Teaching Research. https://doi.org/10.1177/13621688211001289

23.

Housen

Kuiken

Vedder

(2012). Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA. John Benjamins Publishing Company.

24.

Isbell

D. R.

(2021). Can the test support student learning? Validating the use of a second language pronunciation diagnostic. Language Assessment Quarterly, 18(4), 331–356. https://doi.org/10.1080/15434303.2021.1874382

25.

Jang

E. E.

Dunlop

Park

van der Boom

E. H.

(2015). How do young students with different profiles of reading skill mastery, perceived ability, and goal orientation respond to holistic diagnostic feedback? Language Testing, 32(3), 359–383. https://doi.org/10.1177/0265532215570924

26.

Kakitani

Kormos

(2024). The effects of distributed practice on second language fluency development. Studies in Second Language Acquisition, 16(4), 1–25. https://doi.org/10.1017/S0272263124000251

27.

Khabbazbashi

Galaczi

(2021). Opening the black box: Exploring automated speaking evaluation. In Lanteigne

Coombe

Brown

J. D.

(Eds.), Challenges in language testing around the world (pp. 333–343). Springer.

28.

Kim

Tracy-Ventura

(2013). The role of task repetition in L2 performance development: What needs to be repeated during task-based interaction? System, 41(3), 829–840. https://doi.org/10.1016/j.system.2013.08.005

29.

Kim

Y.-H.

(2011). Diagnosing EAP writing ability using the reduced reparameterized unified model. Language Testing, 28(4), 509–541. https://doi.org/10.1177/0265532211400860

30.

Kormos

(2006). Speech production and second language acquisition. Lawrence Erlbaum.

31.

Kormos

Wilby

(2019). Task motivation. In Lamb

(Ed.), The Palgrave handbook of motivation for language learning (pp. 267–286). Springer.

32.

Lambert

Kormos

Minn

(2017). Task repetition and second language speech processing. Studies in Second Language Acquisition, 39(1), 167–196. https://doi.org/10.1017/S0272263116000085

33.

Lee

Y. W.

(2015). Diagnosing diagnostic language assessment. Language Testing, 32(3), 299–316. https://doi.org/10.1177/0265532214565387

34.

Lenth

R. V.

(2020). Emmeans: Estimated marginal means, aka least-squares means (R Package Version 1.4.8). https://CRAN.R-project.org/package=emmeans.

35.

Long

(2015). Second language acquisition and task-based language teaching. Wiley-Blackwell.

36.

Lundberg

S. M.

Lee

S.-I.

(2017). A unified approach to interpreting model predictions. In Proceedings of the 31st Conference on Neural Information Processing Systems (pp. 765–4774). Association for Computing Machinery.

37.

Maie

Oikawa

Chen

Uchihara

(2025). Cumulative testing for L2 vocabulary learning: The impact of retrieval practice and proficiency. TESOL Quarterly. https://doi.org/10.1002/tesq.3391

38.

Ockey

G. J.

Chukharev-Hudilainen

(2021). Human versus computer partner in the paired oral discussion test. Applied Linguistics, 42(5), 924–944. https://doi.org/10.1093/applin/amaa067

39.

OpenAI, Achiam

Adler

Agarwal

Ahmad

Akkaya

Aleman

F. L.

Almeida

Altenschmidt

Altman

Anadkat

Avila

Babuschkin

Balaji

Balcom

Baltescu

Bao

Bavarian

Belgum

. . . Zoph

. (2023). GPT-4 technical report. http://arxiv.org/abs/2303.08774

40.

Phakiti

(2015). Experimental research methods in language learning. Bloomsbury Academic.

41.

Plonsky

Oswald

F. L.

(2014). How big is “big”? Interpreting effect sizes in L2 research. Language Learning, 64(4), 878–912. https://doi.org/10.1111/lang.12079

42.

R Core Team. (2020). R: A language and environment for statistical computing (Version 4.0.2). R Foundation for Statistical Computing.

43.

Ramanarayanan

Lange

P. L.

Evanini

Molloy

H. R.

Suendermann-Oeft

(2017). Human and automated scoring of fluency, pronunciation and intonation during human-machine spoken dialog interactions. Proceedings of the 18th Annual Conference of the International Speech Communication Association (INTERSPEECH) (pp. 1711–1715).

44.

Révész

Ekiert

Torgersen

E. N.

(2016). The effects of complexity, accuracy, and fluency on communicative adequacy in oral task performance. Applied Linguistics, 37, 828–848. https://doi.org/10.1093/applin/amu069

45.

Rohrer

Pashler

(2007). Increasing retention without increasing study time. Current Directions in Psychological Science, 16(4), 183–186.

46.

Rossiter

M. J.

Derwing

T. M.

Manimtim

L. G.

Thomson

R. I.

(2010). Oral fluency: The neglected component in the communicative language classroom. Canadian Modern Language Review, 66(4), 583–606.

47.

Saeed

Omlin

(2023). Explainable AI (XAI): A systematic meta-survey of current challenges and future opportunities. Knowledge-Based Systems, 263, 110273. https://doi.org/10.1016/j.knosys.2023.110273

48.

Saeki

Takatsu

Kurata

Suzuki

Eguchi

Matsuura

Takizawa

Yoshikawa

Matsuyama

(2024). InteLLA: Intelligent language learning assistant for assessing language proficiency through interviews and roleplays. In Kawahara

Demberg

Ultes

Inoue

Mehri

Howcroft

Komatani

(Eds.), Proceedings of the 25th Annual Meeting of the Special Interest Group on Discourse and Dialogue (pp. 385–399). Association for Computational Linguistics. https://aclanthology.org/2024.sigdial-1.34

49.

Saito

(2021). What characterizes comprehensible and native-like pronunciation among English-as-a-second-language speakers? Meta-analyses of phonological, rater, and instructional factors. TESOL Quarterly, 55(3), 866–900. https://doi.org/10.1002/tesq.3027

50.

Saito

Suzuki

Oyama

Akiyama

(2021). How does longitudinal interaction promote second language speech learning? Roles of learner experience and proficiency levels. Second Language Research, 37(4), 547–571. https://doi.org/10.1177/0267658319884981

51.

Salaberry

(2000). Revising the revised format of the ACTFL oral proficiency interview. Language Testing, 17(3), 289–310. https://doi.org/10.1177/026553220001700301

52.

Sato

Lyster

(2012). Peer interaction and corrective feedback for accuracy and fluency development. Studies in Second Language Acquisition, 34, 591–626. https://doi.org/10.1017/S0272263112000356

53.

Sawaki

Quinlan

Lee

Y. W.

(2013). Understanding learner strengths and weaknesses: Assessing performance on an integrated writing task. Language Assessment Quarterly, 10(1), 73–95. https://doi.org/10.1080/15434303.2011.633305

54.

Segalowitz

(2010). Cognitive bases of second language fluency. Routledge.

55.

Selvi

A. F.

(2019). Qualitative content analysis. In McKinley

Rose

(Eds.), The Routledge handbook of research methods in applied linguistics (pp. 440–452). Routledge.

56.

Shapley

L. S.

(1953). A value for n-person games. In Kuhn

Tucker

(Eds.), Contributions to the theory of games (Vol. 2, pp. 307–317). Princeton University Press.

57.

Suzuki

(2025). Facilitating vocabulary learning through contextualised diagnostic assessment for L2 speaking. Available at: osf.io/vtx3f

58.

Suzuki

Kormos

(2023). The multidimensionality of second language oral fluency: Interfacing cognitive fluency and utterance fluency. Studies in Second Language Acquisition, 45(1), 38–64. https://doi.org/10.1017/S0272263121000899

59.

Suzuki

Kormos

Uchihara

(2021). The relationship between utterance and perceived fluency: A meta-analysis of correlational studies. The Modern Language Journal, 105(2), 435–463. https://doi.org/10.1111/modl.12706

60.

Suzuki

Hanzawa

(2022). Massed task repetition is a double-edged sword for fluency development. Studies in Second Language Acquisition, 44(2), 536–561. https://doi.org/10.1017/S0272263121000358

61.

Suzuki

Nakata

DeKeyser

(2019). The desirable difficulty framework as a theoretical foundation for optimizing and researching second language practice. The Modern Language Journal, 103(3), 713–720. https://doi.org/10.1111/modl.12585

62.

Taguchi

Hirschi

Kang

(2022). Longitudinal L2 development in the prosodic marking of pragmatic meaning. Studies in Second Language Acquisition, 44(3), 843–858. https://doi.org/10.1017/S0272263121000486

63.

Takatsu

Suzuki

Eguchi

Matsuura

Saeki

Matsuyama

. (2026). Gnowsis: Multimodal multitask learning for oral proficiency assessments. Computer Speech & Language, 95, 101860. https://doi.org/10.1016/j.csl.2025.101860

64.

Takatsu

Suzuki

Matsuura

Koyama

Matsuyama

. (forthcoming). A personalized diagnostic feedback system to enhance conversational vocabulary competence of second language learners.

65.

Tran

M. N.

Saito

(2021). Effects of the 4/3/2 activity revisited: Extending Boers (2014) and Thai & Boers (2016). Language Teaching Research. https://doi.org/10.1177/1362168821994136

66.

van de Guchte

Braaksma

Rijlaarsdam

Bimmel

(2016). Focus on form through task repetition in TBLT. Language Teaching Research, 20(3), 300–320. https://doi.org/10.1177/1362168815609616

67.

Vaswani

Shazeer

Parmar

Uszkoreit

Jones

Gomez

A. N.

Kaiser

Polosukhin

(2017). Attention is all you need. http://arxiv.org/abs/1706.03762

68.

Wiseheart

Küpper-Tetzel

C. E.

Weston

Kim

A. S. N.

Kapler

I. V.

Foot-Seymour

(2019). Enhancing the quality of student learning using distributed practice. In Dunlosky

Rawson

(Eds.), The Cambridge handbook of cognition and education (pp. 550–584). Cambridge University Press.

69.

Zechner

Bejar

(2006). Extracting meaningful speech features to support diagnostic feedback: An ECD approach to automated scoring. In Proceedings of NCME (pp. 1–23).

70.

Xie

Andrews

(2013). Do test design and uses influence test preparation? Testing a model of washback with Structural Equation Modeling. Language Testing, 30(1), 49–70. https://doi.org/10.1177/0265532212442634

71.

Xie

Lei

(2021). Diagnostic assessment of L2 academic writing product, process and self-regulatory strategy use with a comparative dimension. Language Assessment Quarterly, 19(3), 231–263. https://doi.org/10.1080/15434303.2021.1903470

72.

Jones

Laxton

Galaczi

(2021). Assessing L2 English speaking using automated scoring technology: Examining automarker reliability. Assessment in Education: Principles, Policy and Practice, 28(4), 411–436. https://doi.org/10.1080/0969594X.2021.1979467

73.

Yokoi

Takahashi

Akama

Suzuki

Inui

(2020). Word rotator’s distance. In Webber

Cohn

Liu

(Eds.), Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP) (pp. 2944–2960). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.236

74.

Yoon

S.-Y.

Hsieh

C.-N.

Zechner

Mulholland

Wang

Madnani

(2019). Toward automated content feedback generation for non-native spontaneous speech. In Proceedings of the Fourteenth Workshop on Innovative Use of NLP for Building Educational Applications (pp. 306–315). https://doi.org/10.18653/v1/W19-4432

Feedforwarding diagnostic language assessment: Artificial intelligence- (AI-) driven weakness identification and contextualised feedback for second language speaking

Abstract

Keywords

Introduction

Background

Challenges in diagnostic assessment for second language speaking

Integrating Instructed Second Language Acquisition (ISLA) research and cognitive psychology in diagnostic assessment

Potential of AI technologies in diagnostic assessment

The current study

Method

Overall design

Participants

DLA programme

Diagnosis through an OPI task

Estimation and feedback

Remedial learning

Pretest and posttest tasks

Procedures

Pretest and posttest sessions

Practice sessions and a delayed posttest

Analysis

Results

Lexical changes within practice sessions and 1-week retention

Pretest–posttest changes

Learners’ perceptions of diagnostic feedback

Discussion

Learning gains during the practice sessions and their retention

Effects of diagnostic assessment on lexical development

Limitations and directions for future research

Footnotes

Appendix 1

Appendix 2

Acknowledgements

Author contributions

Declaration of conflicting interests

Funding

ORCID iDs

Supplemental material

Open practice

Notes

References