Sage Journals: Discover world-class research

Abstract

This study aimed to develop and validate an analytical rating scale specifically designed to assess the lexical proficiency of Chinese college students in Academic English speaking tasks. A multi-layer construct of lexical proficiency was first developed and operationalized into an eight-dimension rating scale, including word diversity, word sophistication, word accuracy, and word fluency, phraseological diversity, phraseological sophistication, phraseological accuracy, and phraseological fluency. The reliability and validity of the scale were then examined by using the many-facet Rasch model and confirmatory factor analysis with 600 academic speech samples. The results showed that the rating scale could successfully and reliably differentiate levels of lexical proficiency among participants. High construct validity was confirmed with excellent model fit and support for the proposed dimensions. The proposed analytical rating scale provides a practical approach to comprehensively assessing lexical proficiency in academic speaking contexts. The study has implications for L2 vocabulary instruction, assessment, and speaking assessment in L2 academic contexts.

Plain Language Summary

Developing and validating an analytical rating scale to evaluate the lexical proficiency of Chinese college students in their academic English speeches

A large body of research on lexical vocabulary has been conducted by using a quantitative approach, with few attempts to evaluate it through human rating. Even involved as an essential component in grading the speaking or writing proficiency in testing practice by human raters, lexical proficiency fails to be assessed independently and comprehensively in this approach. Besides, automated scoring needs a human rating as a reference. Taking these issues into consideration, we attempted to explore the potential of the human assessment of lexical proficiency in academic English speech. In this study, we developed and validated an analytical rating scale for lexical proficiency among Chinese learners of English with varying levels of English proficiency and from diversified university backgrounds. The study process included three steps. First, we established a multi-dimensional construct of lexical proficiency targeted to this research based on the existing literature and testing practice. Second, we operationalized the construct into an analytical rating scale. The preliminary scale was tested through a pilot study in which 200 samples of academic English speeches were rated by trained raters and analyzed statistically. The scale was further revised based on the rater interview, reflection and statistical data. Finally, the raters received a new round of training to use the revised scale to evaluate another 400 samples. The reliability and construct validity of the rating scale were tested. The research findings indicate that the constructed analytical rating scale is capable of effectively evaluating the lexical proficiency of Chinese college students in their academic English speeches.

Keywords

lexical proficiency analytical rating scale reliability validity academic English speech

Introduction

Developing proficiency in academic English oral communication is a crucial objective for college students learning English as a second language. A persistent obstacle to achieving this goal is students’ inadequate L2 lexical proficiency, which hinders their development of academic speaking abilities (Dang et al., 2017). As an indispensable part of communicative competence (Bachman & Palmer, 2010), lexical proficiency plays a vital role in academic oral contexts, where a sophisticated and precise command of vocabulary is essential for effective communication (Crossley et al., 2013). The critical significance of lexical competence in L2 speaking performance has been well supported by empirical evidence. For example, vocabulary measures can explain substantial variance in speaking performance, ranging from 63% for vocabulary size alone to over 71% when combined with vocabulary depth measures (Enayat & Derakhshan, 2021; Koizumi & In’nami, 2013). Moreover, various vocabulary types make differential contributions to speaking performance. High-frequency productive vocabulary (the 3,000 most frequent word families) demonstrates predictive power of speaking competence (Alharthi, 2020). High-frequency receptive vocabulary predicts overall oral proficiency while mid-frequency receptive vocabulary serves as a key predictor of lexical resources (Derakhshan & Enayat, 2020). Given this well-established relationship between lexical knowledge and speaking performance, effective vocabulary instruction must be built on accurate diagnosis of learners’ particular vocabulary strengths and weaknesses. Such a diagnosis requires comprehensive and reliable assessment tools that can provide insights into students’ lexical competence in academic speaking contexts.

However, current assessments of lexical proficiency in academic English speech remain problematic due to limitations in existing approaches. Traditional vocabulary assessments mainly focus on discrete-point, decontextualized testing formats such as form-meaning matching, multiple-choice questions, or word association tasks (Read, 2021). While this type of assessment proves valuable, it is unable to fully reflect the multi-dimensional nature of lexical proficiency in real language contexts.

Recent performance-based lexical assessment studies have adopted computational approaches to addressing dimensions such as lexical diversity and lexical sophistication (e.g., Kyle & Crossley, 2015; Read, 2000). Among the body of quantitative research on productive lexical competence assessment, focus is either on traditional word-level analysis (Lu, 2012) or on formulaic sequence analysis (Xu, 2018). Comprehensive approaches that examine both single-word and multi-word dimensions have emerged only recently (Kim et al., 2018; Kyle & Crossley, 2015; Leńko-Szymańska, 2020). In addition, although academic vocabulary represents a distinct feature of Academic English oral discourse (Smith et al., 2020), research on assessing academic lexical competence in learners’ academic oral English is pretty limited (Kyle & Crossley, 2015).

Apart from limitations in quantitative research, significant gaps exist in the qualitative evaluation of lexical competence as a construct. So far, few studies have attempted qualitative, human-rated lexical assessments that serve as a baseline for automated scoring of vocabulary use in an academic context (Crossley et al., 2011, 2015). Even in testing practice where human rating is required, lexical assessments in high-stakes academic English tests remain inadequate. The lexical proficiency in TOEFL and IELTS speaking tests is often scored as a component of rubrics of speaking proficiency rather than as an independent and comprehensive measure of a construct (TOEFL iBT^® Independent Speaking Rubric; ielts-speaking-band-descriptors.pdf). Such blending fails to provide diagnostic feedback for students on their lexical strengths and weaknesses.

Additionally, existing lexical rating scales require further investigation. One significant limitation of the analytical rating scale developed by Crossley et al. (2015) is that it lacks validation analysis. The latest research methodology in applied linguistics emphasizes that the development of a reliable and valid rating scale for lexical proficiency as a complex multi-dimensional construct needs to be undertaken based on theoretical foundations and empirical validation measures, such as Rasch modeling, exploratory factor analysis (EFA), confirmatory factor analysis (CFA) or exploratory structural equation modeling (ESEM; Cong-Lem, 2025). To bridge these gaps, this study aims to develop an independent multi-dimensional analytical rating scale based on validation analysis to evaluate L2 learners’ lexical proficiency in academic English oral discourse.

Theoretical Background

Models of Lexical Proficiency

In the field of L2 acquisition research, lexical proficiency typically refers to the knowledge and ability to use vocabulary in a language (Council of Europe, 2018). Research on lexical proficiency can be primarily explored from cognitive and communicative perspectives. Lexical proficiency in cognition is closely related to lexical knowledge, including declarative and procedural knowledge of vocabulary (Bulté et al., 2008). Descriptions of lexical proficiency from a communicative perspective, on the other hand, integrate vocabulary knowledge and processing with the context and strategic use (Chapelle, 1994; Council of Europe, 2018). These theories have laid a solid theoretical foundation for lexical ability assessment (Z. Wu & Luo, 2020). However, the lexical knowledge they described represent inherent and unobservable implicit knowledge (Chapelle, 1998).

When lexical proficiency is defined as a behavioral construct, which can be observable and measurable, to reflect the ability of L2 learners to use vocabulary for communication (Leńko-Szymańska, 2020), this line of research primarily focuses on constructs such as lexical diversity, lexical sophistication, lexical accuracy, and lexical fluency (Bulté & Housen, 2012; Bulté et al., 2008; Wolfe-Quintero et al., 1998). It is these behavioral constructs that provide the theoretical background for this research.

Wolfe-Quintero et al. (1998) classified lexical proficiency into lexical complexity, lexical accuracy, and lexical fluency, and classified lexical complexity into lexical density, lexical diversity, and lexical sophistication. Bulté et al. (2008) proposed a hierarchical three-level (cognitive, behavioral, and statistical) construct model of lexical proficiency. The behavioral level consists of five dimensions: lexical diversity, lexical sophistication, lexical complexity, lexical productivity, and lexical fluency. Both models focused on the single-word level. Following the three dimensions (complexity, accuracy, fluency, CAF) of language proficiency framework (Housen & Kuiken, 2009; Skehan, 1998), Bulté and Housen (2012) focused on complexity alone and developed a lexical complexity model, which included density, diversity, compositionality, and sophistication. Notably, this model paralleled single words and collocations with the same weight, which marked the researchers’ theoretical concern about collocations as an integral part of lexical proficiency.

Lexical proficiency at the phraseological level has also been under scrutiny. Xu (2018) proposed a spoken collocational competence model to measure collocations by complexity, accuracy, and fluency. Paquot (2019) and Paquot et al. (2022) measured phraseological complexity through dimensions of phraseological diversity and phraseological sophistication in both learners’ written as well as spoken discourses. This phraseological line of research reflects the growing recognition of collocations in the field of vocabulary assessment (Eguchi & Kyle, 2023; Kyle & Crossley, 2015; Read, 2000). Leńko-Szymańska (2020) proposed a model of lexical proficiency with both single words and collocations after the author had comprehensively measured lexical complexity, accuracy, and fluency in the written texts of L2 university students.

Crossley et al. (2013, 2015) categorized lexical knowledge as collocation accuracy, lexical diversity, and several psycholinguistic dimensions, based on which they developed an analytical rating scale to validate computational lexical measures and explore the relationships between holistic and analytical assessments of L2 lexical proficiency.

Limitations of Existing Lexical Models

These lexical proficiency constructs all emphasize the central role of lexical knowledge and provide valuable theoretical framework. However, these constructs show limitations for assessing L2 lexical proficiency in academic contexts. For example, the use of academic vocabulary in L2 academic contexts is insufficiently explored in existing models. The use of academic vocabulary is not only a distinctive feature of academic English but also a key factor that can manifest L2 students’ the academic English communicative competence of L2 students (Gardner & Davies, 2014). However, most existing lexical proficiency models do not address this issue. Although Chapelle (1994) addressed contextual factors, her framework lacks an adequate description of specific vocabulary use in academic English.

Second, the significant value of formulaic sequences has been underestimated. Formulaic sequences play a pivotal role in L2 lexical proficiency assessment (Read, 2000; Wood, 2010), and fluent use of formulaic sequences indicates high language proficiency (Kim et al., 2018; Kyle & Crossley, 2015. However, many models fail to take phraseological dimensions into theoretical frameworks. For example, Wolfe-Quintero et al.’s (1998) lexical proficiency model is comprehensive in its breadth plus accuracy, but it lacks its depth without probing into the use of formulaic sequences. While Bulté and Housen (2012) absorbed formulaic sequences in the theoretical construct, they didn’t provide feasible observational dimensions for formulaic expressions. Moreover, some research categorizes formulaic sequences as a subdivision of sophistication (Kim et al., 2018; Kyle & Crossley, 2015). Such treatment overlooks the multi-dimensional characteristics of formulaic sequences, which, like single words, can be examined in terms of diversity, sophistication, accuracy, and fluency (Paquot, 2019; Xu, 2018), leading to conceptual and practical confusion.

Additionally, current L2 lexical proficiency research is primarily carried out through computational approaches, while human-judgment-based exploration is quite insufficient. Although Leńko-Szymańska (2020) proposed a lexical proficiency framework concerning lexical complexity, lexical accuracy, and lexical fluency at both single-word and phraseological levels, the model was grounded in a quantitative approach.

Research Questions

Based on a systematic analysis of theoretical and methodological limitations, this research aims to develop an analytical rating scale to evaluate lexical proficiency in L2 academic English speaking, by integrating both single-word and formulaic sequences, and the academic context with a human-judgment-based assessment approach. It attempts to answer the following questions:

(1) What dimensions constitute the construct of lexical proficiency in the context of Academic English speaking, and how are they operationalized into an analytical rating scale?

(2) To what extent does the analytical rating scale show adequate reliability and construct validity when evaluating Chinese college students’ lexical proficiency in their Academic English oral performances?

In the sections that follow, we will develop a theoretical analytical framework for the construct of lexical proficiency, operationalize its components, and then establish an analytical rating scale based on the construct operationalization and testing practice. The reliability and construct validity of the rating scale proposed in the second question will be confirmed by empirical evidence.

Construct Definition and Operationalization

This section will answer the first research question of the constituents of the lexical proficiency construct and its operationalization.

Construct Definition

The present study focuses on assessing the lexical proficiency of Chinese college students in general academic English oral performances. To define an appropriate construct applicable to this study, we carefully examined the behavioral models of vocabulary ability at both the single and phraseological levels, then chose Wolfe-Quintero et al.’s (1998) triad model as a primary single-word framework for several reasons: (1) The model was most comprehensive in its inclusion of complexity, accuracy and fluency despite its lack of phraseological concern.

(2) We integrated the dimensions presented in Bulté et al.’s (2008) and Bulté and Housen’s (2012) models into Wolfe-Quintero et al.’s (1998) model. For example, among Bulté et al.’s (2008) five dimensions at the behavioral level: lexical diversity, lexical sophistication, lexical complexity, lexical productivity, and lexical fluency, lexical complexity was used to evaluate a learner’s ability to understand or use the deep knowledge of a specific word through forms of cloze or gap-filling tests rather than learners’ oral productions targeted in this study. Therefore, this dimension was left out. On the other hand, the dimensions of lexical diversity and lexical productivity in Bulté et al.’s (2008) were merged into lexical diversity in Bulté and Housen’s (2012), suggesting that lexical productivity (the number of words a speaker uses to complete a given task, Bulté et al., 2008) be an indicator of lexical diversity assessment, as has been involved in many studies (Jarvis, 2017; Lu, 2012). Besides, lexical compositionality proposed by Bulté and Housen (2012) referred to the number of lexical forms or semantic elements, which may be applicable to L2 learners with low language abilities or at young age (Zareva, 2019), while less relevant for the college students with relatively high English abilities in this study.

To fill the gap in Wolfe-Quintero et al.’s (1998) model, we add Xu’s (2018) and Paquot’s (2019) phraseological construct to align with the single-word construct, both constituting the construct of lexical proficiency in this study. It is defined as Chinese college students’ knowledge of and ability to use English vocabulary in the context of general academic oral performance. It consists of three primary dimensions: lexical complexity, lexical accuracy, and lexical fluency. In this case, lexical complexity is classified into word complexity and phraseological complexity. Word complexity is categorized into word diversity and word sophistication, and phraseological complexity into phraseological diversity and phraseological sophistication. Lexical accuracy is classified into word accuracy and phraseological accuracy. Lexical fluency consists of word fluency and phraseological fluency (Figure 1).

Figure 1.

Construct of lexical proficiency in academic English oral performance.

Construct Operationalization

The instrument for this study was a rating scale of lexical proficiency. Scale development concerns the operationalization of rubric dimensions, the establishment of the rating scale, and its optimization. Based on the construct definition, there are eight lexical variables as indicators of EFL leaners’ lexical proficiency in academic English oral productions: word diversity, word sophistication, word accuracy, word fluency, phraseological diversity, phraseological sophistication, phraseological accuracy and phraseological fluency. These variables were operationally defined as follows:

Word diversity refers to the richness and variation of word use, achieved through synonyms, hyponyms, paraphrases, and so on, to avoid repetition (Jarvis & Daller, 2013; Read, 2000). If a participant’s spoken output is rich and diverse with low repetition, the sample will receive a high score in the word diversity dimension.

Word sophistication characterizes the size or proportion of high-frequency academic words, uncommon words, and advanced words (Kim et al., 2018; Kyle & Crossley, 2015). If academic words, uncommon words, or other advanced words that appear in formal contexts are used more frequently, it indicates a stronger level of word sophistication used by a learner.

Word accuracy is the extent to which learners present error-free word usage, considering errors in word form, word formation, semantics, and morphology within the discourse context (Leńko-Szymańska, 2020; Saito et al., 2016). Fewer such errors in the learner’s sample indicate a higher level of word accuracy.

Word fluency is defined as the speed with which the learner produces or encodes words (Bulté et al., 2008), considering pauses, speed, and repairs (Bosker et al., 2013; Skehan, 2003). Shorter duration and lower frequency of pauses, lower frequency of self-repairs and repetitions perceived in a sample reflect a higher level of word fluency of the participant.

Phraseological diversity refers to the size and variety of formulaic sequences used (e.g., lexical and grammatical collocations, phrasal verbs, idioms; Stengers et al., 2011; Wood, 2010; Wray, 2002). A greater number of different formulaic expressions indicates a higher level of phraseological diversity.

Phraseological sophistication refers to the size of high-frequency academic or other uncommon, advanced formulaic sequences (Paquot, 2019). More academic, uncommon, or formal collocations, phrases, and idioms used suggest a greater proficiency in the use of sophisticated phrases by the learner.

Phraseological accuracy refers to the extent to which L2 learners present error-free phraseological usage in their academic speaking performance, including precise meaning, appropriate form, and morphological use (Huang, 2015; Xu, 2018). Fewer errors in form, usage, and meaning represent a higher level of phraseological accuracy.

Phraseological fluency is the degree of automation in phrase production, measured by the proportion of non-fluent formulaic sequences (Xu, 2018). Fewer noticeable pauses, repetitions, or self-repairs within phrases indicate a stronger level of phraseological fluency (Bosker et al., 2013).

Methodology

Participants

The study involved 600 participants whose academic English speech data were randomly selected from the MCAESCL (A Multimodal Corpus of Academic English Speech by Chinese Learners) database, which was designed to investigate the differences in students’ critical thinking abilities and their use of multimodality in academic speaking discourse (Chen, 2022). The corpus comprises over a thousand prepared speeches (still expanding) by sophomore students (about 19 years old, with nearly even gender distribution) from various levels of universities in China, from top national universities to general provincial universities. A total of 78 majors from humanities, social sciences, and STEM fields were involved. The participants had just completed 1 year of college English instruction. Most of them have passed the College English Test Band 4 (CET-4), reaching intermediate to upper-intermediate proficiency level.

According to the literature, when construct validity analysis includes confirmatory factor analysis (CFA), a minimum sample size of 200 is generally recommended, and an average sample size is approximately 375 (Kline, 2023). Many-facet Rasch Model (MFRM) analysis requires a minimum of 30 observations per facet (Linacre, 2023). With eight dimensions and three raters, this study chose a minimum sample size of 200 (100 males and 100 females) in the pilot study and 400 (219 males and 181 females, nearly evenly distributed) in the main study.

Tasks

Since the MCAESCL corpus serves as a data set for multimodal automated scoring, participants had to face the camera so that their complete features of verbal and nonverbal language could be captured. Accordingly, the project designers chose academic speech as a task type rather than group discussion. Neither the question-and-answer format was considered due to its limitations in terms of discourse length and completeness (Chen, 2022).

The participants were required to prepare and deliver approximately 2-minute speeches on one of the two prompts: “How to prepare for an oral presentation?” (an informative speech) and “To read selectively or extensively? That is a question. What’s your understanding? Support your opinion with some details” (a persuasive speech). These tasks were selected to align with the participants’ academic and educational backgrounds, eliciting language relevant to academic oral discourse.

The participants had 1 week to prepare and were told in advance that they would deliver the speech in front of a camera. Before the speaking task, they were informed about the purpose and procedures of the research, including the video recording requirement, data usage for research purposes only, and their right to withdraw at any time without consequence. Verbal consent was obtained and their personal information was documented before proceeding with data collection. All participants voluntarily agreed to participate after receiving the detailed explanation. In this research, audio recordings and transcribed texts were used to assess L2 lexical proficiency.

Raters

Three Chinese PhD students specializing in foreign language linguistics and applied linguistics participated as raters in the scoring process. They have been engaged in teaching college English for years and possess rich theoretical knowledge and practical experience in the field of vocabulary acquisition research. Two of them have engaged themselves in compiling English dictionaries. In addition, all have participated in scoring oral and written sections of College English Tests (China’s national standard English tests for college students) multiple times, thus having rich experience in human rating.

Scale Development

Establishment of the Preliminary Rating Scale

The lexical proficiency rating scale included eight 5-level dimensions, with descriptors developed based on lexical proficiency descriptions in large-scale speaking tests (e.g., IELTS, TOEFL, CET-4, CET-6) and the CEFR. The preliminary scale was used to assess 200 samples in the pilot study to examine its appropriateness and rater consistency.

The raters received at least four training sessions, each lasting over 2 hr. The first training session helped them understand the construct definition of lexical proficiency, the operationalization of each dimension into measurable criteria, and the application of the analytical rating scale in the samples of different levels. They were trained to score one dimension at a time using benchmark samples of various levels to avoid interference among dimensions. After the training, the raters assessed each lexical dimension for a few samples and discussed those scored over two levels of discrepancy until an agreement was finally reached. This rating session continued for every 30 to 50 samples till the end.

SPSS and many facet Rasch analyses demonstrated the following problems: (1) reliability wasn’t satisfactory enough. While overall rater reliability (Cronbach’s α) for eight dimensions ranged from .868 to .917, the rater reliability between two raters of some dimensions, such as word diversity and lexical accuracy, was between 0.6 and 0.7. (2) The percentages of high and low scores among the samples were very low. It was partly due to the raters’ central tendency and severity effects (Knoch, 2009), resulting in difficulty distinguishing between proficiency levels, particularly between advanced and intermediate learners and between intermediate and low-level learners.

The interview conducted after the rating process focused on rater reliability, the clarity and discriminative power of the descriptors, and other issues encountered during the rating process. It reflected that (1) definitions and rating scale for dimensions were not operationally clear and appropriate enough. Take word diversity as an example, the raters complained about the tricky treatment of those samples with short lengths. “While the words expressed seemed quite varied, there were only four sentences in this discourse. What score should I give it?” (2) Those descriptors, with general modifiers such as “very” and “limited,” were not instructional enough to discriminate between levels, confusing the raters. (3) Raters need more adequate training. For example, another reason for the central tendency of student scores was disagreement in rater severity. The raters tended to be more severe when assessing a sample of higher lexical proficiency. A rater commented, “Be conservative when giving a high score, as students always make mistakes.” However, they tended to be more lenient over a weak sample. “As long as he/she says something, I’ll give him/her some points.”

Optimization of the Rating Scale

Based on rater interviews, inter-rater consistency checks, and many-facet Rasch analysis of the preliminary rating results, the researchers revised the rating scale, including the improvement of the descriptors and the addition of quantitative assistance. Building on the experience from the pilot study and the practices of Knoch (2009) and Zhao (2013), we established quantitative thresholds for each level of each dimension for raters. For example, an oral discourse shorter than 80 words was assigned to the lowest level of word diversity. Those with 6 to 8 academic, formal, or less frequently-used words belonged to the mid-level of word sophistication (see Table A1 in Appendix A). Besides, frequency density (the occurrences of target feature per line or every two or three lines) was an effective means that could helped raters better perceive lexical features distributed in a long discourse with a lower extent of subjectivity. Based on the optimized rating scale, the raters received a new round of training and then used the scale to assess another 400 samples. The final inter-rater reliability across all dimensions increased to 0.9.

Data Analysis

Many-Facet Rasch Model (MFRM)

The MFRM offers a framework for assessing and analyzing multiple facets of data. In this study, a four-facet Rash model was adopted including participant ability, rater severity, task difficulty, and item type. It provided insights into the discrimination of lexical proficiency levels among participants, the consistency and potential biases of raters, the relative difficulty of tasks and scoring criteria, and the overall fit of the rating scale to the Rasch model, shown as follows:

Log (P_{n l j k} / P_{n l j k - 1}) = B_{n} - C_{l} - D_{j} - F_{k j}

P_nljk = probability of participant n receiving a score of k on task l when assessed by rater j;

P_nljk−1 = probability of participant n receiving a score of k−1 on task l when assessed by rater j;

B_n = ability of participant n;

C_l = severity of rater j;

D_j = difficulty of task j;

F_kj = difficulty of achieving a score of k on item j relative to the adjacent score of k−1.

Key statistics examined in the MFRM analysis included: (1) Infit and Outfit mean square (MnSq) values, indicators of data-model fit, with values between 0.5 and 1.5 considered acceptable (Linacre, 2021). (2) Separation indices and reliability indices, measures of the ability to reliably distinguish different levels of proficiency, tasks, and criteria, with values greater than 2 and 0.8, respectively, considered adequate (Bond & Fox, 2015). (3) Fixed chi-square tests, assessments of the statistical significance of differences among facet elements.

Confirmatory Factor Analysis (CFA)

CFA was employed to evaluate the construct validity of the proposed lexical proficiency model and its underlying dimensions. The analysis examined the extent to which the observed data fit the hypothesized model, as well as the convergent and discriminant validity of the proposed dimensions using the data of the main study.

Indicators used to assess model fit included: the chi-square test to degree-of-freedom ratio (CMIN/DF), root mean square error of approximation (RMSEA), goodness-of-fit index (GFI), adjusted goodness-of-fit index (AGFI), comparative fit index (CFI), normed fit index (NFI), and parsimonious normed fit index (PNFI). If CMIN/DF < 3, RMSEA < 0.05, while GFI, AGFI, CFI, and NIFI > 0.9, the model shows an ideal fit. If 3 < CMIN/DF < 5, 0.05 < RMSEA < 0.08, the model fits well (M. Wu, 2020).

Convergent and Discriminant Validity

Convergent validity implies that the measurement indicators for the same trait concept should fall under the same factor, emphasizing the correlation between various scoring items within the factor structure (Sawaki, 2007). The key indicators for testing convergent validity are the average variance extracted (AVE) and composite reliability (CR). If the AVE for each factor is >0.5, CR > 0.7, and the factor loadings for each measurement item are greater than 0.5, high convergent validity is proved (Cheung et al., 2023).

Discriminant validity examines whether each factor can reflect different aspects of the target ability concept (Sawaki, 2007). It is assessed by comparing the square root of the AVE for each factor with the correlations between that factor and other factors. If the square root of AVE exceeds the inter-factor correlations, discriminant validity is established (Cheung et al., 2023).

Results

This section aims to answer the second research question concerning the reliability and construct validity of the established rating scale for the lexical proficiency assessment of Chinese college students in their academic English oral performances.

Reliability of the Analytical Ratings

The reliability of the analytical scales was assessed through the MFRM analysis, examining the facets of participant ability, rater severity, task difficulty, and criterion difficulty.

Participant Ability

The primary purpose of the analytical rating scale is to differentiate the lexical proficiency of Chinese college students in academic English speeches. Figure 2. presents a Wright map illustrating the distribution of participant lexical ability, rater severity, task difficulty, and criterion difficulty measures on a shared logit scale. This visual representation provides a comprehensive overview of how these facets interact within the assessment framework.

Figure 2.

The Wright map.

The second column of the Wright map displayed a wide distribution of points representing the varying lexical proficiency levels of the student participants. The participants’ lexical abilities followed an approximately normal distribution, with measures ranging from approximately −5.4 to +6.9 logits. The value of strata for participant ability was 6.76, indicating that participants could at least be divided into six levels.

As shown in Table 1, for the participant facet, the Infit Mean Square (MnSq) of 1.00 indicated an excellent fit to the Rasch model expectations. The high separation index (4.82) and reliability (0.96), combined with the significant fixed chi-square (χ² = 9,751.5, df = 399, p < .001), demonstrated that the rating scale effectively differentiated students’ lexical proficiency.

Table 1.

General Model Fit of Four Layers.

Facets	Obsvd Av. (SD)	N	Measure (SD)	MnSq		Separation	Reliability	Fixed χ²	df
Facets	Obsvd Av. (SD)	N	Measure (SD)	Infit	Outfit	Separation	Reliability	Fixed χ²	df
Participants	3.44 (0.66)	400	0.84 (1.82)	1.00	1.01	4.82	0.96	9,751.5**	399
Raters	3.44 (0.03)	3	0.00 (0.09)	1.01	1.01	2.83	0.89	27**	2
Tasks	3.45 (0.1)	2	0.00 (0.12)	1.00	1.00	4.89	0.96	49.8**	1
Criteria	3.44 (0.16)	8	0.00 (0.38)	0.99	1.01	7.89	0.98	543.6**	7

p < .001.

Rater Severity

The data of rater severity in Table 1 showed statistically significant differences among raters (χ² = 27, df = 2, p < .001). The overall Infit MnSq of 1.01 represented a good model fit. The separation index was 2.83, and the reliability index was 0.89, both higher than expected values, confirming the severity difference. From Table 2, which presented more detailed information about facets, rater severity measures ranged from −0.10 to 0.11 logits, with Rater 3 being the strictest and Rater 2 the most lenient. Despite the statistical significance, the range of rater severity (−0.10 to 0.11 logits) is considerably smaller than that of participant ability. This suggests that differences in rater severity are unlikely to have a substantial impact on overall student scores.

Table 2.

Model Fit Statistics of Individual Raters, Tasks, and Rating Items.

Facets	Type	Obsvd Av.	Fair (M) Av.	Measure	Model SE	Infit MnSq	Outfit MnSq
	3	3.40	3.44	0.11	0.03	1.01	1.05
Rater	1	3.44	3.49	−0.01	0.03	1.03	0.97
	2	3.48	3.52	−0.10	0.03	0.97	1.01
Task	IT	3.55	3.53	−0.12	0.03	0.94	0.96
Task	PT	3.34	3.44	0.12	0.02	1.06	1.06
Criterion	PS	3.26	3.30	0.53	0.05	0.93	0.94
	WA	3.38	3.50	0.34	0.05	0.98	0.97
	WS	3.27	3.27	0.22	0.05	0.90	0.90
	WF	3.44	3.52	0.01	0.04	1.18	1.28
	PD	3.41	3.46	0.00	0.05	0.90	0.91
	WD	3.40	3.43	−0.08	0.05	0.89	0.89
	PA	3.56	3.67	−0.20	0.05	1.05	1.03
	PF	3.80	3.92	−0.82	0.04	1.11	1.19

Note. PS = phraseological sophistication; WA = word accuracy; WS = word sophistication; WF = word fluency; PD = phraseological diversity; WD = word diversity; PA = phraseological accuracy; PF = phraseological fluency.

Task Difficulty

Two task types were included in the analysis: the informative task (IT) and the persuasive task (PT). The separation index for tasks was 4.89, with a reliability of 0.96. A significant difference in difficulty was found between the two task types (χ² = 49.8, df = 1, p < .001; Table 1). The PT (0.12 logits) was more difficult than the IT (−0.12 logits; Table 2), indicating that it is more challenging for students to achieve as high scores in persuasive speech tasks as in informative speech tasks.

Criterion Difficulty

The eight criteria demonstrated significant differences in difficulty levels (χ² = 543.6, df = 7,p < .001), with a high separation index of 7.89 and reliability of 0.98 (Table 1), indicating that the criteria effectively distinguish different aspects of lexical proficiency. Criterion difficulty ranged from 0.53 logits for phraseological sophistication (PS) to −0.82 logits for phraseological fluency (PF; Table 2). This hierarchy demonstrated the relative challenges students face in different aspects of lexical proficiency, among which phraseological sophistication seemed to be the most challenging one for them.

The Infit MnSq values for all criteria (ranging from 0.89 to 1.18; Table 2) fell within the acceptable range, indicating good model fit. This suggests that all criteria contribute meaningfully to the assessment of lexical proficiency without redundancy or inconsistency.

Category probability curves (Figure 3) were examined to assess criterion category discrimination. Six criteria (WD, WS, WF, PD, PS, PF; criteria 1–4, 7–8) displayed distinct peaks for all categories, with monotonically advancing thresholds and appropriate spacing (1.4–5 logits) between curve intersections (Eckes, 2015), indicating that for these criteria, the 5-point rating scale functioned well, with each category representing a distinct level of performance.

Figure 3.

Category probability curves.

However, word accuracy and phraseological accuracy (WA, PA; criteria 5–6) exhibited less distinction between lower score (1–3) categories, with spacing less than 1.4 between the curve intersections on the horizontal axis. The result showed that the raters may have found it difficult to distinguish between the lower performance levels for these two criteria.

Bias Analysis

While the overall rater consistency was high, bias analysis revealed some specific interactions between raters and criteria. As shown in Table 3, no significant bias was found between raters and participants’ ability or between raters and tasks, indicating that raters applied the scale consistently across different ability levels and task types. However, 5 out of 24 (20.8%) rater-criterion interactions showed significant bias (χ² = 57.4, df = 24, p < .001).

Table 3.

Bias Effect Between Rater and Criterion.

Rater	Criterion	Bias size	SE	t-Value	Data fit
1	Word sophistication	−0.18	0.09	−2.05	0.9
1	Phraseological fluency	0.16	0.08	2.10	1.2
2	Phraseological sophistication	−0.20	0.09	−2.23	0.9
2	Word fluency	0.25	0.07	3.28	1.1
3	Word fluency	−0.27	0.07	−3.81	1.1

Rater 1 showed a tendency to be lenient in scoring word sophistication (bias = −0.18, t = −2.05) but strict in scoring phraseological fluency (bias = 0.16, t = 2.10). Rater 2 was lenient in scoring phraseological sophistication (bias = −0.20, t = −2.23) but strict in scoring word fluency (bias = 0.25, t = 3.28). Rater 3 showed leniency in scoring word fluency (bias = −0.27, t = −3.81). These biases, while not severely impacting the overall assessment, indicated the need for potential rater training or scale refinement.

Construct Validity of the Rating Scale

Data Fit

The statistical analysis showed that the skewness of vocabulary proficiency scores for the 400 samples ranged from −0.19 to −0.89, and the kurtosis ranged from −0.14 to 1.11, falling within the range of −2 to 2, indicating a normal distribution and allowing entry into the structural equation model. The CFA results demonstrated an excellent fit between the proposed lexical proficiency model and the observed data. The CMIN/DF ratio (2.617), the RMSEA (0.064), GFI (0.976), AGFI (0.942), CFI (0.99) and NIFI (0.985) values all met the recommended criteria for a well-fitting model (M. Wu, 2020) (Table 4).

Table 4.

Model Fit.

Fit indices	CMIN	DF	CMIN/DF	RMSEA	GFI	AGFI	CFI	NIFI
Reference values			<3	<0.08	>0.90	>0.90	>0.90	>0.90
Results	39.255	15	2.617	0.064	0.976	0.942	0.99	0.985

Convergent Validity

The factor loadings for the measurement indicators ranged from 0.83 to 0.95 (Table 5), exceeding the recommended threshold of 0.5 and providing evidence of convergent validity. Furthermore, the average variance extracted (AVE) values for each factor (0.749 to 0.901) and the composite reliability (CR) values (0.857 to 0.948) were all above the respective cut-offs of 0.5 and 0.7, further supporting the convergent validity of the proposed model.

Table 5.

Convergent Validity.

Factor	Indicators	Factor loadings	AVE	CR
Word complexity	A1	0.866	0.749	0.857
Word complexity	A2	0.865	0.749	0.857
Phraseological complexity	B1	0.878	0.752	0.858
Phraseological complexity	B2	0.856	0.752	0.858
Lexical complexity	Word complexity	0.944	0.890	0.942
Lexical complexity	Phraseological complexity	0.943	0.890	0.942
Lexical accuracy	C1	0.944	0.829	0.907
Lexical accuracy	C2	0.876	0.829	0.907
Lexical fluency	D1	0.935	0.901	0.948
Lexical fluency	D2	0.963	0.901	0.948
Reference values		>0.5	>0.5	>0.7

Discriminant Validity

As shown in Table 6, the square root of the AVE for each factor (values on the diagonal) exceeded the inter-factor correlations, indicating that the proposed dimensions of lexical proficiency exhibited discriminant validity. This finding suggests that the dimensions correspond to the distinct aspects of the lexical proficiency construct (Sawaki, 2007).

Table 6.

Pearson Correlation and the Square Root of the AVE.

Dimensions	Lexical complexity	Lexical accuracy	Lexical fluency
Lexical complexity	.943
Lexical accuracy	.723***	.910
Lexical fluency	.605***	.489***	.949

Note. The numbers on the diagonal represent the square root of the AVE for each factor.

***

p = .00.

Discussion

This study aimed to develop and validate an analytic rating scale for assessing lexical proficiency in L2 Academic English speech. The results provide strong evidence for the reliability and validity of the proposed scale, offering several important insights into the nature of the construct itself and the assessment of lexical proficiency.

Construct Dimensionality and Scale Development

The proposed construct showed good model fit and high convergent validity and discriminant validity, indicating that the dimensions within the construct are well-defined and empirically supported. Unlike Wolfe-Quintero et al.’s (1998) focus on single words, Xu’s (2018) focus on collocations, and Leńko-Szymańska’s (2020) focus on quantitative study, the construct developed takes into account of both single words and formulaic sequences from the perspective of human rating, broadening the scope of traditional assessment models of lexical proficiency. More importantly, the dimensionality was validated by empirical evidence from the assessment of Chinese college students’ academic English speech, confirming that these aspects are interrelated and collectively form the construct of lexical proficiency.

Within the construct, diversity, sophistication, accuracy, and fluency of words and formulaic sequences showed strong explanatory power for the lexical construct, indicating that the more varied and sophisticated the words and formulaic sequences used by learners in academic speech, and the more accurate and fluent their use, the stronger the learners’ lexical proficiency. This result is consistent with previous research findings (Kim et al., 2018; Kyle & Crossley, 2015; Saito et al., 2016; Xu, 2018).

Besides, the relationships between the internal dimensions of the construct model are clarified. Lexical complexity is categorized into word complexity and phraseological complexity, with word complexity distinguishing between word diversity and word sophistication, and phraseological complexity distinguishing between phraseological diversity and phraseological sophistication. The classification is consistent with Wolfe-Quintero et al.’s (1998) word complexity classification and Paquot’s (2019) phraseological complexity classification. This categorization reveals a closer relationship between diversity and sophistication at the word or phraseological level than the relationship between words and phrases at the diversity or sophistication level. Empirical evidence reveals that learners with higher lexical diversity proficiency tend to use more academic or low-frequency words and formulaic sequences. However, learners with lower lexical proficiency tend to repetitively use familiar high-frequency words, resulting in lower diversity and sophistication (Ansarin et al., 2021; Bayazidi et al., 2019; Siskova, 2012).

Reliability and Rater Performance

The MFRM analysis revealed that participants could be differentiated into at least six levels, indicating high reliability in discriminating participants’ lexical proficiency levels. This finding aligns with previous research, suggesting that lexical proficiency is a multifaceted construct that can be reliably evaluated (Crossley et al., 2015; Kyle & Crossley, 2015).

Raters achieved satisfactory performance despite statistically significant differences observed in rater severity and rater-criteria bias. Especially in the assessment of word accuracy and phonological accuracy, raters showed less satisfactory perceptions in both low-level performances, probably due to their difficulty distinguishing between the lower performance levels. As Yan (2014) stated, raters tend to agree more on higher score levels than on lower score levels. While the impact of rater effects on student overall scores is minor, improvement in rater training or more detailed descriptor elaboration for a certain criterion is still required.

Task and Criterion Difficulty

Divergence in difficulty between informative and persuasive tasks leads to differential outcome of L2 lexical performance, proving supportive evidence to the body of research that have revealed the influence of task types on language performance in L2 speech (De Jong et al., 2012; Skehan, 2009). Compared to an informative task, a persuasive task is more challenging for students, which requires more logical reasoning, persuasive abilities, and better lexical resources (Lucas & Stob, 2020). Observed task effects highlight the necessity for varied assessment tasks when evaluating academic lexical proficiency.

The hierarchy of criterion difficulty provides valuable insights into the challenges L2 students face in vocabulary learning. First, phraseological sophistication turned out to be the most challenging criterion. The finding aligns with previous studies (Paquot, 2019; Siyanova-Chanturia & Omidian, 2019) that sophisticated phrases or formulaic sequences are not easy for L2 students as they require innate native intuition (Siyanova-Chanturia & Omidian, 2019). Moreover, formulaic structures (e.g., Verb + direct objects collocations) can be used to discriminate the most advanced level of proficiency (Paquot, 2019). This result emphasizes the importance of explicit instruction and the assessment of phraseological competence in academic L2 programs.

The relative difficulty of word accuracy and sophistication over diversity observed in this study could was similar to that reported in Polat and Kim’s (2014) longitudinal case study, where a learner of English exhibited progress in lexical diversity while accuracy remained unchanged, suggesting slow development and acquisition of language accuracy.

Conversely, the relative ease of phraseological fluency suggests that learners may have acquired some common, familiar phrases before achieving sophisticated ones. Characterized by being stored and retrieved holistically, L2 students’ use of common formulaic expressions experiences faster processing speed and higher accuracy than that of non-formulaic expressions (Carrol & Conklin, 2020; Jiang & Nekrasova, 2007).

Implications

The validated rating scale offers several implications for L2 pedagogy and assessment in academic contexts. First, the multi-dimensional construct of lexical proficiency provides a framework for more targeted instruction and feedback to students. Teachers can use the scale to identify specific lexical strengths and weaknesses in L2 academic speeches and thereafter provide personalized pedagogical interventions. Secondly, lexical proficiency assessment should extend beyond single-word vocabulary knowledge to emphasize the appropriate use of formulaic expressions, particularly when distinguishing between intermediate and advanced learners in academic contexts. The development of phraseological competence for L2 learners should play a crucial part in L2 academic language instruction.

Limitations

Despite the robust evidence offered for the reliability and validity of the proposed rating scale, several limitations should be acknowledged. First, sample size and sample background are limited and should be expanded to ensure greater generalizability of the rating scale. In this research, the rating scale was developed to assess the lexical proficiency of Chinese college learners of English only. Chinese language and Chinese academic cultural communication style (e.g., indirectness in Chinese rhetoric) may have a distinctive influence on their lexical proficiency. Therefore, further study is necessary to verify whether the scale should be adjusted to better function in a different language or a different academic cultural setting.

Another limitation is that this study only examines two task types. A broader range of academic speaking tasks, such as debate and question-and-answer tasks, should be explored to assess the robustness of the scale. Additionally, the slight inconsistencies observed in the lower categories of the word accuracy and phraseological accuracy criteria need further investigation through larger sample sizes, refined criteria, or more comprehensive rater training for these specific aspects of lexical proficiency.

Conclusion

This study has developed and validated an analytic rating scale for assessing lexical proficiency in L2 academic English speech. The findings presented strong psychometric properties of the scale, supporting its structural multidimensionality, sensitivity to task effects and ability to differentiate multiple proficiency levels in academic English contexts. The multidimensional approach provides several theoretical contributions to our understanding of lexical proficiency. First, it confirms the integrated contributions of single words, formulaic sequences, and lexical academic specificity to the assessment of lexical proficiency in academic speaking contexts. Second, varied difficulty patterns across dimensions suggest different trajectories of lexical development for L2 academic English learners. Furthermore, differentiated assessment outcomes between tasks highlight the significance of context-specific assessment in L2 academic English speaking.

This 8-dimensional scale is a nuanced tool for pedagogical applications in lexical proficiency assessment in L2 academic speaking. Teachers can use it as a diagnostic assessment tool to identify learners’ lexical strengths and weaknesses and provide targeted feedback. Through this feedback, L2 learners can be aware of the specific areas that need to improve, such as phraseological variety, academic vocabulary, or advanced word usage.

Further research should establish a simplified scale for routine classroom use by identifying the dimensions most predictive of overall lexical proficiency. Additionally, the scale’s generalizability beyond Chinese linguistic and academic cultural contexts, its performance across broader tasks such as discussion and debate, and the potential of automatic speech analysis tools to address rater subjectivity need further investigation.

Footnotes

Appendix A

Table A1.

The Analytical Rating Scale of Lexical Proficiency in Academic English Speech.

Score	Word diversity	Phraseological diversity	Word sophistication	Phraseological sophistication
5	Uses a wide range of words, adept at expressing similar meanings through paraphrasing or synonyms	Uses rich and various formulaic expressions such as collocations, phrasal verbs and idioms (20+)	Regularly uses less-common, formal, or academic words. (12+)	Uses quite a few less-common, formal, or academic collocations, phrases and idioms (8+)
4	Uses a fairly wide range of words and paraphrase effectively with some repetition	Uses adequate, non-repetitive collocations, phrasal verbs, and idioms (16–20)	Uses quite a few uncommon, formal or academic words (9–11)	Uses adequate less-common, formal, or academic collocations, phrases and idioms (5–7)
3	Lacks an adequate range of words with noticeable repetition in expression	Uses a moderate number of non-repetitive collocations, phrasal verbs and idioms (11–15)	Mainly uses common words, with limited less-common, formal or academic words (6–8)	Uses a small number of less-common, formal, or academic collocations, phrases and idioms (3–4)
2	Uses limited words, short in length with frequent repetition (<80)	Uses limited non-repetitive collocations, phrasal verbs and idioms (6–10)	Mostly uses common, basic words, with quite limited less-common or academic words (3–5)	Mostly uses common, basic collocations, with occasional uncommon or academic collocations (1–2)
1	Uses a very limited range of words, severely short in length (<60)	Uses very limited non-repetitive collocations, phrasal verbs and idioms(0–5)	Uses very common, basic words, barely with academic words (0–2)	Uses very common basic collocations, with no use of less-common or academic collocations (0)
Score	Word accuracy	Phraseological accuracy	Word fluency	Phraseological fluency
5	Uses words appropriately and accurately, with occasional minor errors(0–1)	Has accurate and appropriate usage of phrases with occasional minor errors (0–1)	Speaks fluently with few repetitions or self-repairs(0–1)	Speaks holistically and fluently with rare unnatural pauses, repetitions or self-repairs (0–1)
4	Uses words appropriately with several errors barely affecting communication.(2–4)	Has fairly accurate and appropriate usage of phrases with a few errors insignificantly affecting communication (2)	Speaks fairly fluently with several unnatural pauses, brief repetitions or self-repairs (1–2)	Speaks fairly completely and fluently with a couple of unnatural pauses, repetitions, or self-repairs within phrases (1–2)
3	Uses words with noticeable errors somewhat affecting communication. (5–6)	Has some noticeable serious errors in phrasal usage affecting communication (3)	Speaks with longer pauses, noticeable repetitions, or self-repairs (3–5)	Speaks with several longer pauses, repetitions, or self-repairs within phrases (3)
2	Has frequent serious errors significantly affecting communication (7–8)	Has frequent serious errors in phrasal usage significantly affecting communication (4)	Has many longer pauses, repetitions, or repairs, resulting in incompleteness of sentence perception (6–9)	Has a few noticeably longer pauses, repetitions, or self-repairs within phrases (4–5)
1	Has numerous serious errors leading to almost impossible communication (9+)	Has too many serious errors severely affecting communication (5+)	Has frequent long pauses, repetitions, or repairs, or frequent repetitions in the same places, resulting in fragmented sentences (10+)	Has frequent serious breakdowns, repetitions, or self-repairs within phrases (6+) or fails to score successfully

Acknowledgements

We express our sincere gratitude to all the young participants who contributed to the sample data, the responsible university staff who collected the data and the dedicated PhD student raters who participated in the whole rating process.

ORCID iDs

Yuyu Jiang

Hua Chen

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: this study was supported by the Key Project of National Social Science Fund of China (20AYY013).

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data will be made available on request.

References

Alharthi

(2020). Investigating the relationship between vocabulary knowledge and FL speaking performance. International Journal of English Linguistics, 10(1), 37–46.

Ansarin

A. A.

Karafkan

M. A.

Hadidi

(2021). The effects of task type on Iranian EFL learners’ use of lexical diversity and sophistication. Applied Research on English Language, 10(4), 39–70.

Bachman

L. F.

Palmer

A. S.

(2010). Language assessment in practice. Oxford University Press.

Bayazidi

Ansarin

A. A.

Mohammadnia

(2019). The relationship between syntactic and lexical complexity in speech monologues of EFL learners. Applied Research on English Language, 8(4), 473–488.

Bond

T. G.

Fox

C. M.

(2015). Applying the Rasch model: Fundamental measurement in the human sciences. Psychology Press.

Bosker

H. R.

Pinget

A. F.

Quené

Sanders

De Jong

N. H.

(2013). What makes speech sound fluent? The contributions of pauses, speed and repairs. Language Testing, 30(2), 159–175.

Bulté

Housen

(2012). Deﬁning and operationalising L2 complexity. In Housen

Kuiken

Vedder

(Eds.), Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA (pp. 23–46). John Benjamins Publishing.

Bulté

Housen

Pierrard

Van Daele

(2008). Investigating lexical proficiency development over time – The case of Dutch-speaking learners of French in Brussels. Journal of French Language Studies, 18(3), 277–298.

Carrol

Conklin

(2020). Is all formulaic language created equal? Unpacking the processing advantage for different types of formulaic sequences. Language and Speech, 63(1), 95–122.

10.

Chapelle

C. A.

(1994). Are C-tests valid measures for L2 vocabulary research? Second Language Research, 10(2), 157–187.

11.

Chapelle

C. A.

(1998). Construct definition and validity inquiry in SLA research. In Bachman

L. F.

Cohen

A. D.

(Eds.), Interfaces between second language acquisition and language testing research (pp. 32–70). Cambridge University Press.

12.

Chen

(2022). Constructing a new learner multimodal academic speech corpus – MAESCCL. Technology Enhanced Foreign Language Education, (3), 45–51.

13.

Cheung

G. W.

Cooper-Thomas

H. D.

Lau

R. S.

Wang

L. C.

(2023). Reporting reliability, convergent and discriminant validity with structural equation modeling: A review and best-practice recommendations. Asia Pacific Journal of Management, 41(2), 745–783.

14.

Cong-Lem

(2025). Scale validation in applied linguistics: Methodological trends and challenges. Research Methods in Applied Linguistics, (4), 100217.

15.

Council of Europe. (2018). Common European framework of reference for languages: Learning, teaching, assessment. Companion Volume with New Descriptors. Retrieved March 18, 2021, from https://rm.coe.int/cefr-compaion-volume-with-new-discriptors-2018/1680787989

16.

Crossley

S. A.

Salsbury

McNamara

D. S.

(2013). Validating lexical measures using human scores of lexical proficiency. In Jarvis

Daller

(Eds.), Vocabulary knowledge: Human ratings and automated measures (pp. 105–134). John Benjamins.

17.

Crossley

S. A.

Salsbury

McNamara

D. S.

(2015). Assessing lexical proficiency using analytic ratings: A case for collocation accuracy. Applied Linguistics, 36(5), 350–370.

18.

Crossley

S. A.

Salsbury

McNamara

D. S.

Jarvis

(2011). What is lexical proficiency? Some answers from computational models of speech data. TESOL Quarterly, 45, 182–193.

19.

Dang

T. N. Y.

Coxhead

Webb

(2017). The academic spoken word list. Language Learning, 67(4), 959–997.

20.

De Jong

N. H.

Steinel

M. P.

Florijn

Schoonen

Hulstijn

J. H.

(2012). The effect of task complexity on functional adequacy, fluency and lexical diversity in speaking performances of native and non-native speakers. In Housen

Kuiken

Vedder

(Eds.), Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA (pp. 121–142). John Benjamins Publishing.

21.

Derakhshan

Enayat

M. J.

(2020). High-and mid-frequency vocabulary size as predictors of Iranian university EFL students’ speaking performance. Iranian Journal of English for Academic Purposes, 9(3), 1–13.

22.

Eckes

(2015). Introduction to many-facet Rasch measurement: Analyzing and evaluation rater-mediated assessments. Peter Lang.

23.

Eguchi

Kyle

(2023). L2 collocation profiles and their relationship with vocabulary proficiency: A learner corpus approach. Journal of Second Language Writing, 60, 100975.

24.

Enayat

M. J.

Derakhshan

(2021). Vocabulary size and depth as predictors of second language speaking ability. System, 99, 102521.

25.

Gardner

Davies

(2014). A new academic vocabulary list. Applied Linguistics, 35(3), 305–327.

26.

Housen

Kuiken

(2009). Complexity, accuracy, and fluency in second language acquisition. Applied Linguistics, 30(4), 461–473.

27.

Huang

(2015). More does not mean better: Frequency and accuracy analysis of lexical bundles in Chinese EFL learners’ essay writing. System, 53, 13–23.

28.

Jarvis

(2017). Grounding lexical diversity in human judgments. Language Testing, 34(4), 537–553.

29.

Jarvis

Daller

(Eds.). (2013). Vocabulary knowledge: Human ratings and automated measures (Vol. 47). John Benjamins Publishing.

30.

Jiang

N. A.

Nekrasova

T. M.

(2007). The processing of formulaic sequences by second language speakers. The Modern Language Journal, 91(3), 433–445.

31.

Kim

Crossley

S. A.

Kyle

(2018). Lexical sophistication as a multidimensional phenomenon: Relations to second language lexical proficiency, development, and writing quality. The Modern Language Journal, 102(1), 120–141.

32.

Kline

R. B.

(2023). Principles and practice of structural equation modeling. Guilford Publications.

33.

Knoch

(2009). Diagnostic writing assessment: The development and validation of a rating scale. Peter Lang GmbH, Internationaler Verlag Der Wissenschaften.

34.

Koizumi

In’nami

(2013). Vocabulary knowledge and speaking proficiency among second language learners from novice to intermediate levels. Journal of Language Teaching and Research, 4(5), 900.

35.

Kyle

Crossley

S. A.

(2015). Automatically assessing lexical sophistication: Indices, tools, findings, and application. Tesol Quarterly, 49(4), 757–786.

36.

Leńko-Szymańska

(2020). Defining and assessing lexical proficiency. Routledge/Taylor & Francis Group.

37.

Linacre

J. M.

(2021). A user’s guide to FACETS Rasch-model computer programs. https://www.winsteps.com/manuals.htm.

38.

Linacre

J. M.

(2023). A user's guide to WINSTEPS and Facets. https://winsteps.com

39.

(2012). The relationship of lexical richness to the quality of ESL learners’ oral narratives. The Modern Language Journal, 96(2), 190–208.

40.

Lucas

S. E.

Stob

(2020). The art of public speaking. McGraw-Hill.

41.

Paquot

(2019). The phraseological dimension in interlanguage complexity research. Second Language Research, 35(1), 121–145.

42.

Paquot

Gablasova

Brezina

Naets

(2022). Phraseological complexity in EFL learners’ spoken production across proficiency levels. In Götz

Leńko-Szymańska

(Eds.), Complexity, accuracy and fluency in learner corpus research (pp. 115–136). John Benjamins Publishing Company.

43.

Polat

Kim

(2014). Dynamics of complexity and accuracy: A longitudinal case study of advanced untutored development. Applied Linguistics, 35(2), 184–207.

44.

Read

(2000). Assessing vocabulary. Cambridge University Press.

45.

Read

(2021). Introductory remarks at symposium on development and use of rating scales in language testing [Symposium]. 14th Language Testing Research Colloquium, Vancouver, BC, Canada, February 27–March 1.

46.

Saito

Webb

Trofimovich

Isaacs

(2016). Lexical profiles of comprehensible second language speech: The role of appropriateness, fluency, variation, sophistication, abstractness, and sense relations. Studies in Second Language Acquisition, 38(4), 677–701.

47.

Sawaki

(2007). Construct validation of analytic rating scales in a speaking assessment: Reporting a score profile and a composite. Language Testing, 24(3), 355–390.

48.

Siskova

(2012). Lexical richness in EFL students’ narratives. Language Studies Working Papers, 4, 26–36.

49.

Skehan

(1998). A cognitive approach to language learning. Oxford University Press.

50.

Skehan

(2003). Task-based instruction. Language Teaching, 36(1), 1–14.

51.

Skehan

(2009). Lexical performance by native and non-native speakers on language-learning tasks. In Richards

Malvern

D. D.

Meara

Milton

Treffers-Daller

(Eds.), Vocabulary studies in first and second language acquisition: The interface between theory and application (pp. 107–124). Palgrave Macmillan.

52.

Smith

G. F.

Kyle

Crossley

S. A.

(2020). Word lists and the role of academic vocabulary use in high stakes speaking assessments. International Journal of Learner Corpus Research, 6(2), 193–219.

53.

Stengers

Boers

Housen

Eyckmans

(2011). Formulaic sequences and L2 oral proficiency: Does the type of target language influence the association? IRAL—International Review of Applied Linguistics in Language Teaching, 49(4), 1–23.

54.

Siyanova-Chanturia

Omidian

(2019). Key issues in researching multiword items. In The Routledge handbook of vocabulary studies (pp. 529–544). Routledge.

55.

Wolfe-Quintero

Inagaki

Kim

H.-Y.

(1998). Second language development in writing: Measures of ﬂuency, accuracy, and complexity. Second Language Teaching & Curriculum Center, University of Hawaii.

56.

Wood

(2010). Formulaic language and second language speech fluency: Background, evidence and classroom applications. Bloomsbury Publishing.

57.

Wray

(2002). Formulaic language and the lexicon. Cambridge University Press.

58.

(2020). Structural equation modeling: The operation and application of AMOS. Chongqing University Press.

59.

Luo

(2020). China standards of English: Research on organizational proficiency scales. Higher Education Press.

60.

(2018). Measuring “spoken collocational competence” in communicative speaking assessment. Language Assessment Quarterly, 15(3), 255–272.

61.

Yan

(2014). An examination of rater performance on a local oral English proficiency test: A mixed-methods approach. Language Testing, 31(4), 501–527.

62.

Zareva

(2019). Lexical complexity of academic presentations: Similarities despite situational differences. Journal of Second Language Studies, 2(1), 71–92.

63.

Zhao

C. G.

(2013). Measuring authorial voice strength in L2 argumentative writing: The development and validation of an analytic rubric. Language Testing, 30(2), 201–230.

Validating a Multi-Dimensional Rating Scale for Lexical Proficiency in L2 Academic Speech: A Mixed-Methods Approach

Abstract

Plain Language Summary

Keywords

Introduction

Theoretical Background

Models of Lexical Proficiency

Limitations of Existing Lexical Models

Research Questions

Construct Definition and Operationalization

Construct Definition

Construct Operationalization

Methodology

Participants

Tasks

Raters

Scale Development

Establishment of the Preliminary Rating Scale

Optimization of the Rating Scale

Data Analysis

Many-Facet Rasch Model (MFRM)

Confirmatory Factor Analysis (CFA)

Convergent and Discriminant Validity

Results

Reliability of the Analytical Ratings

Participant Ability

Rater Severity

Task Difficulty

Criterion Difficulty

Bias Analysis

Construct Validity of the Rating Scale

Data Fit

Convergent Validity

Discriminant Validity

Discussion

Construct Dimensionality and Scale Development

Reliability and Rater Performance

Task and Criterion Difficulty

Implications

Limitations

Conclusion

Footnotes

Appendix A

Acknowledgements

ORCID iDs

Funding

Declaration of Conflicting Interests

Data Availability Statement

References