Abstract
A body of research has examined the relationship between vocabulary knowledge (VK) and English as a Foreign Language (EFL) writing proficiency. However, it is still unclear how the vocabulary-writing relationship develops over time. This longitudinal study followed 154 Chinese university learners over an 8-month period using the same battery of vocabulary and writing tests. Pearson correlations on bootstrapping and Linear Mixed-effects Models (LMM) were used for data analyses. The results showed that learners made substantial gains in receptive size, while productive size remained at the 2,000 to 3,000 word level, despite a moderate increase. Productive derivative knowledge also showed moderate improvement. Further analyses revealed a developmental shift: receptive size was more strongly related to writing performance at the earlier stage, whereas productive size and derivative knowledge became stronger predictors later. In contrast, vocabulary depth (collocations and synonyms) showed little improvement and did not predict writing performance. These findings enhance our understanding of the dynamic relationship between vocabulary and writing and highlight the importance of fostering productive vocabulary and depth of knowledge through both formal instruction and informal exposure.
Plain Language Summary
This study followed Chinese university students over eight months to explore how different types of vocabulary knowledge support their writing. The students made significant progress in understanding words, while their ability to use words actively and produce derived forms improved more slowly. Over time, passive word knowledge became less important, and the ability to use words productively and apply word parts became more crucial for writing. In contrast, knowledge of how words combine (such as collocations and synonyms) showed little improvement and did not contribute much to writing performance. These findings deepen our understanding of how the vocabulary-writing relationship develops and highlight the importance of moving from simple word recognition to active use in meaningful contexts both inside and outside the classroom.
Keywords
Introduction
Vocabulary knowledge (VK) has been widely acknowledged as the cornerstone of English as a Foreign Language (EFL) writing proficiency (Crossley, 2020; Qian & Lin, 2020; Stæhr, 2008). For second language (L2) learners, writing can be one of the most challenging productive skills to acquire. It requires not only cognitive resources and grammatical accuracy (Castillo & Tolchinsky, 2017) but also a sophisticated command of various lexical aspects: word form, meaning, collocations, word parts and associations, among others (Coxhead, 2018; Nation, 2022). Therefore, investigating the role of multifaceted vocabulary in L2 writing is essential for teachers and researchers to achieve insights into lexical and writing competence and development.
Of all the lexical dimensions, vocabulary size (quantity of words) and depth (quality of words) have been among the most widely taken up in language proficiency (Schmitt, 2014). The two types of VK have been demonstrated as critical predictors of writing performance in previous research (Crossley et al., 2015; Lin, 2023; Stæhr, 2008; Sukying, 2023). However, previous studies were primarily limited to cross-sectional assessment; the longitudinal observation of the size and depth distinction remains under-researched in EFL writing development (Nation, 2022; Schmitt, 2019). This is particularly true for university students in Mainland China, as they represent a special group of EFL learners with limited opportunities to use English in communication. The unique context also subjects students to the pressure of high-stakes tests, which leads to their reliance on rote vocabulary learning (decontextualized memorization of word-pair translations; Lin, 2023). Shaped by the learning environment, the patterns of word learning and the vocabulary-writing relationship may display different characteristics. As learners advance in their studies, the influence of vocabulary size and depth knowledge on their writing proficiency may also change accordingly (Crossley, 2020; Wu et al., 2019). Therefore, the present study aimed to examine how the relationship between VK and writing performance develops over an 8-month period of regular classroom instruction.
Literature Review
Vocabulary Size and Depth
The size of VK relates to the quantity of words learners know or how many words can be recognized (Read, 2023). It typically assesses the capability to map the word form to its meaning, making form-meaning links the single dimension in this construct (Schmitt, 2014). Laufer and Goldstein (2004) subdivided this mapping into active (productive) and passive (receptive) modalities: passive recognition, active recognition, passive recall and active recall. Schmitt (2010) adapted and clarified Laufer and Goldstein’s categorization using clearer terms: form recall (provide the word form), form recognition (choose the word form), meaning recall (provide the definition) and meaning recognition (choose the definition). Among these four types of knowledge, meaning recognition and form recall are frequently measured in language performance as proxies for receptive and productive size, respectively (Zhang & Zhang, 2022).
It is noted that VK encompasses more than form-meaning mappings (i.e., vocabulary size); it also requires a deep and qualitative understanding of multiple shades of meaning, semantic associations and grammatical functions (i.e., vocabulary depth; Read, 2023). However, when it comes to vocabulary depth, researchers are coping with a construct that is “inherently ill-defined, multidimensional, variable and thus resistant to neat classification” (Read, 2004, p. 224). Various conceptualizations of this construct have been developed from different perspectives. For example, Read (2004) encompassed three dimensions: (a) Precision of meaning, which refers to the gradual refinement of word meaning from a vague and limited understanding to a more extensive and accurate mastery; (b) Comprehensive word knowledge, which is defined as various aspects of knowledge beyond semantic features, including spelling, morphology, collocations, and syntactic and pragmatic features, also known as the component approach (Nation, 2022); and (c) Network knowledge, which refers to the integration of words into learners’ mental lexicon, where they build associations between words (e.g., paradigmatic and syntagmatic relations). Among these dimensions, the network associative knowledge and morphological forms have been the most widely measured (Naismith & Juffs, 2021; Read, 2023). Lexical networks are often operationalized as synonymy and collocations in research, while derivative knowledge, as a key indicator of vocabulary depth (Read, 2023), refers to the derived forms in word families formulated by adding affixes to the base units.
The relationship between vocabulary size and depth has not been definitively established. On one hand, the two constructs have shown strong correlations in empirical research, making it difficult to distinguish between the various size and depth components (González-Fernández, 2022; Vermeer, 2001). For example, Naismith and Juffs (2021) observed a dynamic relationship between learners’ vocabulary size and depth in productive word use, indicating that the growth of depth naturally leads to an increase in size, and vice versa. On the other hand, vocabulary depth provides additional explanatory power beyond vocabulary size in regression models (Schmitt, 2014). This suggests that the two dimensions may represent separate yet related constructs, in which different lexical components may develop unevenly, with receptive recognition generally preceding productive recall (González-Fernández & Schmitt, 2020).
Dóczi and Kormos (2016) proposed a theoretical pyramid model illustrating a hierarchical acquisition of word knowledge types, from basic receptive aspects (e.g., written form, word meaning) to more complex productive components (e.g., collocation use and other forms and meaning senses). Beyond the component approach, vocabulary acquisition has been conceptualized as a multidimensional continuum (Henriksen, 1999). Learners typically progress from a foundational recognition of word form and meaning (size), to building interconnected semantic and morphological networks (depth), and ultimately to the productive use of words in context. These models provide a structured view of how different dimensions of VK evolve in L2 acquisition. It seems reasonable to hypothesize that learners first accumulate vocabulary size before extending to vocabulary depth, and move from receptive understanding toward productive use. However, longitudinal research on the development of multiple lexical dimensions is still needed (Schmitt, 2019). It has been suggested that multi-aspect measures at both receptive and productive levels should be employed to give a full account of VK (González-Fernández & Schmitt, 2020; Henriksen, 1999). The present study, therefore, devised a range of vocabulary tests in relation to the assessment of writing development.
The impact of vocabulary size and depth on L2 writing
The central role of vocabulary size and depth has been widely emphasized in language proficiency. Vocabulary is not only a key prerequisite for successful writing (Qian & Lin, 2020), but also “promotes syntactic flexibility and creates a foundation for further learning” (Baba, 2009, p. 192). This role becomes even more critical in EFL writing, as L2 writers often have more limited working memory resources than their L1 counterparts (Lin, 2023). From the perspective of the Simple View of Writing (Wu et al., 2019), lower-order skills, such as precise lexical selection, alleviate working memory load, thereby freeing cognitive resources for higher-order tasks, such as content development, process monitoring and text revision.
In these circumstances, it is assumed that if learners have a larger vocabulary, as measured by vocabulary size, they are likely to use more diverse words in their writing (Nation, 2022). Indeed, knowledge of form and meaning has been demonstrated as a major contributor to writing quality. It has been found that the correlation between receptive size (assessed by the Vocabulary Levels Test, VLT) and writing was surprisingly high (r = .73; Stæhr, 2008), indicating that vocabulary size accounts for half of the variance in L2 writing. Likewise, Wu et al. (2019) found that word-pair translation contributed more to writing than other depth aspects, such as synonyms and morphological forms. Because the participants in Stæhr (2008) and Wu et al. (2019) were low-proficiency EFL learners with small vocabulary sizes, it seems reasonable that they relied mainly on passive meaning recognition in writing. Nevertheless, similar findings in recent studies (Dabbagh & Janebi Enayat, 2019; Lin, 2023; Yang et al., 2019) revealed that higher-proficiency university learners with a larger vocabulary depended on receptive size in their writing to a similar extent. However, only receptive vocabulary (Stæhr, 2008; Yang at al., 2019) or receptive measures (Dabbagh & Janebi Enayat, 2019; Lin, 2023) were touched on in previous studies, largely ignoring the productive dimensions. The validity of the L1 to L2 word translation format in size tests (Wu et al., 2019) may also warrant further consideration (Janebi Enayat & Derakhshan, 2021). Overall, vocabulary size represented by form-meaning links has been demonstrated as a strong predictor of writing performance.
While the size of VK is pivotal in L2 writing, some studies suggest that depth measures may provide additional or even stronger explanatory power. For example, Baba (2009) and Sukying (2023) examined the roles of vocabulary size and depth in L2 writing performance. Their findings suggest that a mere increase in size does not necessarily lead to immediate improvements in writing quality, while vocabulary depth makes a unique contribution to writing proficiency. However, one limitation in previous studies is that the measures used may not fully capture the depth construct. For instance, Sukying (2023) employed the Productive Vocabulary Levels Test (PVLT), which has been regarded as a form-recall (or productive size) test rather than a proper depth measure (Schmitt, 2010). Other studies have focused on learner corpora and suggested that certain aspects of vocabulary depth play a crucial role in L2 writing. Crossley et al. (2015) analyzed L2 texts and found that the correct use of collocations can be the best predictor of writing ability (accounting for 84% of the variance). Similarly, Naismith and Juffs (2021) found that derivative knowledge is central to writing proficiency; however, learners relied on only one or two derived forms in a word family to reduce lexical errors. Despite the fruitful insights, previous studies have primarily highlighted the importance of a certain type of depth knowledge in L2 writing, and studies that concurrently measure a wide range of vocabulary aspects are still needed.
The Development of Vocabulary and L2 Writing
A growing body of longitudinal and cross-sectional research has attempted to trace how vocabulary development aligns with improvements in L2 writing. One general finding is that certain dimensions of VK can be strongly indicative of EFL learners’ writing development (Crossley, 2020). For example, L2 writers who use more lower-frequency words tend to produce higher-scoring written output (Castillo & Tolchinsky, 2017; Crossley et al., 2011). Crossley et al. (2011) scrutinized compositions produced by 9th grade, 11th grade, and college level learners and found that the use of more infrequent lexis was the main lexical discriminator across grade levels. Likewise, Castillo and Tolchinsky (2017) reported that both vocabulary depth and fluency increase with grade level, and that productive semantic associations (orally producing synonyms and antonyms) contributed most to written production. Focusing specifically on derivative knowledge in L2 writing, Leontjev et al. (2016) observed a stable growth pattern across proficiency levels, although certain morphological forms developed more rapidly than others. This developmental sensitivity to derivative knowledge was further supported by Asaad (2024), who demonstrated that explicit instruction in morphological awareness led to more gains in both VK and academic writing.
It should be noted that research has also identified a discrepancy between VK and L2 writing development. The size and depth of VK may increase significantly, but L2 writers still find it difficult to move VK into productive use (Nation, 2022). Because learners may not have fully mastered the words, they tend to avoid using more difficult items to minimize lexical errors. Consequently, L2 writers often refrain from using low-frequency lexis and collocations. For Chinese EFL learners, the gap between lexical expansion and writing improvement may be even more pronounced. Over a 4-month interval, Zhong (2014) assessed the development of VK and sentence writing ability and found that receptive size was the stable predictor, contributing 74% to productive word use. Similarly, Wu et al. (2019) investigated the role of vocabulary size and depth in young writers’ compositions across two levels and found that productive size significantly predicts L2 writing proficiency. Given the differences in grade and proficiency levels across studies, one plausible hypothesis is that vocabulary size plays a fundamental role in writing at lower proficiency levels, while vocabulary depth begins to assume a more prominent role as learners advance.
Moreover, specific word learning strategies may largely influence the vocabulary that EFL learners use in their writing. From the focus group interviews conducted over one semester, rote memorization of word lists was reported as the dominant learning method among EFL students (Lin, 2023). However, Teng and Xu’s (2025) 12-week intervention study showed that productive learning methods (i.e., sentence translation and writing tasks) yielded the strongest effects in transferring receptive vocabulary to productive mastery, regardless of repetition frequency. Beyond classroom-based learning, extramural exposure may also shape how VK develops in language performance. For instance, general informal exposure (e.g., songs and videos) has been shown to facilitate vocabulary gains mainly at the 2,000-word level (Tsang & Lo, 2025), which has been identified as the most crucial frequency band for EFL writing (Stæhr, 2008; Yang et al., 2019). This suggests that increased out-of-class input at high-frequency levels may promote lexical manipulation in L2 writing.
The Present Study
Although previous studies have attempted to measure the vocabulary-writing relationship among Chinese learners, most have employed receptive VK measures (Lin, 2023; Yang et al., 2019; Zhong, 2014) or used word translation equivalents in vocabulary size tests (Wu et al., 2019). Some studies have also focused on VK in sentence translation and sentence writing tasks (Teng & Xu, 2025; Zhong, 2014) without extending to a higher level of literacy. To the authors’ knowledge, there are few studies employing a range of receptive and productive vocabulary measures associated with L2 writing in a pretest and post-test design. As Nation (2022) has pointed out, “of all the four skills, writing is the one where we know the least about the relationship between the skill and vocabulary knowledge” (p. 226). This study longitudinally explored the individual roles of vocabulary size and depth in the development of EFL writing ability among Chinese university learners. The study was guided by two research questions:
What are the levels of Chinese learners’ vocabulary size and depth knowledge, and how do these knowledge components naturally develop over an 8-month period?
To what extent can Chinese learners’ vocabulary size and depth contribute to their writing proficiency? Does this contribution vary as the learners advance to higher levels?
Methods
Participants
This study involved 154 L1 Chinese undergraduate students in four intact classes from a university in Mainland China. The participants (123 females, 45 males) were aged between 18 and 21 years (M = 19.62, SD = 0.62), and most were trained to become middle school English teachers (which may explain the gender imbalance). By the time of the pretests, they had an English learning history of 10 to 13 years (M = 11.7, SD = 0.78). Although they had completed a credit-bearing course on argumentative writing, most of their learning focus was on English reading and grammar. The updated Vocabulary Levels Test (uVLT) revealed that 64.6% of the participants had reached the 2,000-frequency level and 39.4% had mastered the 3,000-level, suggesting a relatively large receptive vocabulary. Over two semesters, from the second semester of Year 3 to the first semester of Year 4, the participants took 14 English courses altogether, making up approximately 13.5 hr’ intensive study per week. In Year 3, they had little exposure to language use outside the classroom, while in Year 4, they devoted more time to written productive skills (e.g., translation and writing practices) for examination purposes.
Vocabulary Size Tests
The receptive vocabulary size was assessed using the updated Vocabulary Levels Test (uVLT, Webb et al., 2017). The uVLT is a revised version of the traditional VLT in terms of lexical selection and frequency coverage. This test was developed and validated to measure VK at 1,000 to 5,000 frequency levels. 30 target words in 10 clusters are included in each frequency level. Each cluster includes three target words and three distractors in separate columns (see Figure 1). The participants were required to select the right words out of the six options to match the given meanings. Webb et al. (2017) reported high internal consistency across frequency levels and provided validity evidence across five dimensions, including content, substantive, structural, generalizability and external validity. This test has been validated in EFL learner populations comparable to the present sample (e.g., Iwaizumi & Webb, 2022), supporting its suitability for use in this context.

Example items of the uVLT.
The productive vocabulary size was captured using the Productive Vocabulary Levels Test (PVLT, Laufer & Nation, 1999). The PVLT measures the productive form recall ability in sentences with the first letter(s) of the target words being provided as cues (see Figure 2). It mainly targets the vocabulary at the 2,000, 3,000, 5,000, and 10,000 word-frequency levels. There are 18 target words in each frequency band, making a total of 72 items. The participants had to complete the word form according to the letter prompts and sentence contexts. The PVLT showed good internal reliability (KR-21 > 0.80) and clear discriminative validity across proficiency levels. Laufer and Nation (1999) argued that the PVLT is a valid, reliable and practical measure of controlled productive VK, which is especially useful for language teachers to test how many words learners can readily use in EFL writing.

An example item of the PVLT.
Vocabulary Depth Tests
The Word Association Test (WAT, Read, 2004) mainly examines the semantic networks of synonymous and collocational relations between words. The WAT is one of the most widely used measures for assessing vocabulary depth in language performance (Yanagisawa & Webb, 2020). This test has demonstrated high reliability with reported coefficients of 0.89 (Qian & Schedl, 2004), and its validity has been well established through item discrimination and factor analysis (Schmitt et al., 2011). In the original format, there are 40 items in the test, and participants choose two adjective synonyms and two noun collocations for each item. However, wild guessing has remained a concern. To address this issue, this study adopted the version refined by Qian and Schedl (2004), in which one to three keys were randomly assigned to each box (see Figure 3). Moreover, synonymy and collocations were regarded as separate variables in data analysis. Participants received one point for each correct choice, while incorrect selections were not credited.
Derivative knowledge has been advocated as an essential dimension of depth knowledge to assess the relationships between word families (Read, 2023). For the derivative test, this study used González-Fernández’s (2022) method, in which 20 target words were sampled from the 1,000 to 9,000 word-frequency levels to ensure broad lexical coverage. For each of the 20 target words, four sentences were provided with a different derivative left blank in each sentence (see Figure 4). The participants completed the sentences with the appropriate derivative forms. They could write an X in the blanks if they believed the derivative did not exist or use the target word without change. The four sentences for each word were semantically similar with repeated high-frequency words (within the 2,000 frequency levels). Each sentence was designed to limit the possible derived forms to only one word class. This measure has demonstrated satisfactory reliability with high internal consistency reported across leaner groups. Its validity has been supported by factor-analytic evidence, indicating that all test items reflect a single underlying construct of VK.
Peak

An example item of WAT.
Development

An example item of the derivative test.
Argumentative Writing Test
The argumentative writing question was selected from the writing items of the International English Language Testing System (IELTS), Task Two in the academic module. The writing task required the participants to write an essay (250–300 words) on the topic of the necessity to travel abroad and learn about other countries. This question was chosen because it was a common topic about traveling and learning, which might be pertinent to the participants. It is noted that using the same writing prompt at both testing points might introduce familiarity or practice effects (Kim et al., 2022). However, this decision was made to ensure comparability across time points, as different writing items may vary in difficulty and topical knowledge demands.
Procedure
The test validation was examined through two ways: (a) three experienced EFL professors were invited to examine the measures, ensuring all items and sample sentences matched our participants’ proficiency levels; (b) 10 students with comparable proficiency levels who were not part of the main sample participated in the pilot study. The pretests (Time 1) were administered near the beginning of the sixth semester in Year 3 before all participants agreed to sign the written consent. The tests were done using a pen-and-paper method, with the writing task being assigned first. The participants were allowed as much time as they needed (more than 90 min) to finish the writing, so that time pressure might be minimized. Then the vocabulary tests were conducted in separate sessions with the following sequence: receptive size (uVLT) → productive size (PVLT) → receptive depth (WAT) → productive depth (derivative recall). Each test was allotted 30 min with a 5-min break in between, so that the participants would not suffer test fatigue. To reduce random guessing, explicit instructions were provided, especially for the WAT, in which each box contained a varying number of correct responses. Two teachers organized and proctored the tests, ensuring they did not consult dictionaries or other materials. After an 8-month interval (across two semesters, in which students had regular classroom instruction), the participants received the post-tests (Time 2) near the end of the seventh semester in Year 4 with the same tests and same procedure.
Scoring
The scoring rubrics were the same for Time 1 and Time 2. As the vocabulary tests are established measures, the present study followed the existing scoring methods for these tests. As for the writing test, this study adopted the IELTS Writing Task 2 rating criteria, which mainly consist of four dimensions: task response, coherence and cohesion, lexical resource and grammatical range and accuracy. This scale has undergone several phases of validation, and has sound validity, reliability, impact and practicality (Shaw & Falvey, 2008). Before scoring, two raters carefully reviewed the official IELTS writing band descriptors and discussed any unclear points in the criteria to ensure a shared understanding. Because the linguistic components in writing were the major research goal, this study combined the task response, coherence and cohesion into a single variable, namely, the higher-order skills. The Intraclass Correlation Coefficients (ICC) were used to estimate the inter-rater reliability as this measure may “reflect both degree of correlation and agreement between measurements” (Koo & Li, 2016, p. 156). According to Koo and Li (2016), ICC values above 0.75 indicate good reliability, while values above 0.90 are considered excellent. The coefficients in Table 1 were all higher than 0.75, suggesting high inter-rater reliability. Moreover, the Cronbach’s alpha values of the vocabulary tests are also shown in Table 1, indicating acceptable to high internal consistency.
Estimates of the Internal Reliability and Inter-Rater Consistency.
Note. n/a = not applicable. Cronbach's alpha was not calculated for writing scores as they were based on rating scales and estimated by ICCs for interrater reliability.
Data Analysis
To address the first question, this study examined the descriptive statistics (in percentages) and compared scores at Time 1 and Time 2 using Wilcoxon signed-rank tests, as the data distribution was non-normal, as indicated by the significant Shapiro-Wilk test of normality (p < .05). The effect size of rank-biserial correlation coefficient r was calculated to indicate the variations between the two Times, with the effect ranges of small (r = .25), medium (r = .40) and large (r = .60; Field, 2018). The values of 95% confidence interval were also reported. To mediate Type I error, the Bonferroni method with adjusted p-value (p < .006; 0.05/9 = 0.006) was used for multiple comparisons.
In response to the second question, this study conducted Pearson correlations with bootstrapping between vocabulary and writing variables, as this method is suitable for non-normally distributed data (Yang et al., 2019). Moreover, a linear mixed-effects model (LMM) was performed to examine the dynamic predictive effects of vocabulary size and depth on writing ability. Compared to analyze of variance, LMMs could avoid the information loss caused by data aggregation and account for random variance, thereby yielding more accurate estimates (Nicklin et al., 2025). Scores of vocabulary size and depth and their interactions with time were regarded as fixed effects, while the overall writing score was treated as the dependent variable. Participants were included as random effects with random intercepts and a random slope for time. Degrees of freedom for fixed effects were estimated using the Satterthwaite method, and all continuous predictors were mean-centered before forming interactions. The model diagnostics indicated approximately normal residuals (checked via residual-fitted plots). The variance inflation factors (VIFs) for the predictors were all below 3, indicating no collinearity issues. Outlier checks using boxplots revealed no extreme values. Moreover, effect sizes for the LMM were estimated based on the Conditional R2 (variance accounted for by fixed and random effects together) and Marginal R2 (variance explained by fixed effects alone). The values range from small (R2 < .20) to large (R2 > .50; Field, 2018).
Results
Vocabulary and Writing Performances
Table 2 presents the descriptive statistics (in percentages) for vocabulary and writing tests at Time 1 and Time 2. The Wilcoxon signed-rank tests indicated that students performed better at Time 2 in most of the vocabulary and writing tests. Notably, receptive size was larger at Time 2 with a large effect size (Z = 8.97, r = .72, p < .001, [0.78–0.89]). Productive size (Z = 5.69, r = .46, p < .001, [0.39–0.65]) and derivative knowledge (Z = 4.53, r = .37, p < .001, [0.27–0.57]) also increased with moderate effect sizes. However, vocabulary depth aspects, including synonymy and collocations, revealed no significant differences. For the writing measures, the overall writing score increased with medium gains, especially evident in higher-order skills. Grammatical knowledge and lexical resource scores also increased, but the effect sizes were relatively small (see Figure 5).
Descriptive Statistics (in Percentages) for Vocabulary and Writing Tests (n = 154).

Temporal changes in test performances at Time 1 and Time 2 (n = 154).
Correlations Between Vocabulary Knowledge and L2 Writing Ability
Tables 3 and 4 present the correlations on bootstrapping between vocabulary components and the overall writing score at Time 1 and Time 2, respectively. It was found that vocabulary variables at both Times had small to moderate relationships with writing performance.
Results of Pearson Correlation on Bootstrapping for Time 1.
Results of Pearson Correlation on Bootstrapping for Time 2.
Specifically, receptive size revealed a stable but decreasing relation with writing ability across the two Times (from r = .44 to r = .34). In contrast, productive size had strengthened correlations with writing as students progressed to higher levels, with the 2,000-word band (r = .41, p < .001, [0.25–0.55]) and the 3,000-word band (r = .42, p < .001, [0.29–0.54]) showing the strongest associations. The derivation-writing correlation also increased from a small effect size at Time 1 to a moderate one at Time 2. However, synonyms and collocations only had small correlations with writing at Time 1 and did not improve much at Time 2.
Size and Depth as Predictors of L2 Writing Development
An LMM was performed to examine the effects of vocabulary size and depth variables on overall writing ability (see Table 5). The model accounted for 52% of the variance with a large effect size (Conditional R2 = .52), 38% of which was explained by fixed effects with a moderate effect size (Marginal R2 = .38; Field, 2018).
Results of Mixed-Effects Model for Overall Writing Score.
The model showed a significant main effect of time, indicating that writing performance significantly improved at Time 2. Among the vocabulary predictors, receptive size (Estimate = 0.78, SE = 0.16, p < .001), productive size (Estimate = 0.71, SE = 0.16, p < .001) and derivative knowledge (Estimate = 0.58, SE = 0.18, p < .001) were significant, positive predictors of the writing score. Moreover, significant interaction effects were found between productive size and time (Estimate = 0.58, SE = 0.25, p = .003) and derivation and time (Estimate = 0.45, SE = 0.21, p = .005). This indicates that the predictive strength of these vocabulary aspects increased over time. However, the interaction effect between receptive size and time was negative (Estimate = −0.31, SE = 0.15, p = .014), suggesting a reduced predictive power at Time 2. No significant interactions effects were found between time and either synonym or collocation knowledge.
Discussion
The present study evaluated the evolving relationship between VK and Chinese EFL writing proficiency, employing receptive and productive measures of vocabulary size and depth to examine their associations with writing development over time.
In terms of the first question regarding students’ lexical performance, improvement was most pronounced in receptive size, implying that our students largely expanded their superficial form-meaning understanding as they advanced to higher levels. While productive size also increased, the gap between receptive and productive vocabulary remained wide. It is evident that students’ productive vocabulary size was substantially lower than their receptive size, and its growth rate lagged behind the latter, which is consistent with previous literature (Dóczi & Kormos, 2016; Laufer & Goldstein, 2004). The finding empirically echoes Schmitt’s (2019) hypothesis indicating that moving from no knowledge to receptive mastery may be easy, but progressing from receptive to productive mastery can be the real challenge. However, our longitudinal observation is not consistent with previous intervention studies (Jia et al., 2024; Zhong & Hirsh, 2009). This discrepancy may be because previous studies primarily adopted production-oriented vocabulary tasks and learning strategies, whereas the present study involved no intervention and was conducted over an 8-month period of regular classroom instruction. Traditional pedagogy in this study mainly emphasized passive recognition through textbook and lecture learning rather than active word use. The difference in learning outcomes suggests that specific learning needs and instruction types may be the driving forces for acquisition of various vocabulary aspects, especially when students’ lexical proficiency reaches a certain level (e.g., the 2,000-level word families; Yang et al., 2019).
As for the development of vocabulary depth, the network knowledge of synonyms and collocations showed little increase from Time 1 to Time 2, although derivative knowledge improved moderately. This growth pattern is surprising since the synonym and collocation measure in this study was receptive and context-free, while the derivative test was in a productive recall format within sentence contexts. Contrary to the receptive-to-productive progression posited in Dóczi and Kormos’s (2016) pyramid model, our findings suggest that productive recall may not always lag behind receptive recognition. This result also contrasts with González-Fernández and Schmitt’s (2020) cross-sectional assessment of four word components, in which recognition consistently preceded recall in the acquisition sequence. However, as Schmitt (2019) speculated, recognition may not invariably precede recall across all lexical components, and our findings provide empirical support for this assumption. One possible explanation is that the sentence contexts in the derivative test may provide clues that directly targeted the derived form recall, while the decontextualized, randomly-arranged synonym and collocation responses offered little support for form and meaning recognition. Another reason may lie in the rule-based nature of derivative knowledge (e.g., adverbs often end with -ly and nouns with -ion and ment; Laufer, 2024). Such principled affix knowledge may help learners recall appropriate word family members more easily. In contrast, knowledge of synonyms and collocations typically develops through extensive exposure and lacks clear rules, making it difficult for learners to map form and meaning without contextual prompts.
Another finding of particular interest is that the overall size-depth relationship generally became weaker (from r = .65 to r = .56) as students progressed from Year 3 to Year 4. This result is inconsistent with the findings reported by Wu et al. (2019), where the opposite pattern was observed. This inconsistency may be attributed to the different proficiency levels of EFL learners. The participants in their study were young high-school learners with small vocabularies, whereas those in this study were more advanced university learners with larger vocabulary sizes. Schmitt (2014) and Read (2023) argued that vocabulary size and depth are highly correlated and often overlap for learners with limited vocabulary and high-frequency words. However, for learners with more extensive VK and low-frequency words, this correlation weakens, and the gap between size and depth knowledge becomes more pronounced. Our students’ vocabulary size may increase significantly as academic levels progressed, but their depth knowledge cannot rise at the same rate, making the distinction between the two increasingly evident.
As for the second question regarding the contribution made by VK to L2 writing, the correlation on bootstrapping and LMM results showed that vocabulary size and depth in totality had moderate relationships with the overall writing score. This portion of explained variance (38%), although lower than the 52% reported for young EFL learners’ writing (Stæhr, 2008), is comparable to the 25% and 35% variance explained in Lin (2023) and Sukying (2023), respectively. Taken together, this study provides additional empirical support for the positive relationship between VK and L2 writing proficiency.
Specifically, receptive size, as measured by the VLT, was found to be a strong predictor of writing ability across the two times of tests, especially when learners were at a relatively low-proficiency level, which supports previous findings (Kim et al., 2022; Lin, 2023; Stæhr, 2008). However, the LMM showed that the interaction effect for receptive size and time was negative, while productive size and derivative knowledge positively and significantly interacted with time. This may indicate a gradual shift from relying on the most fundamental word meaning to using more varied and morphological word forms in writing. At the advanced stage, students may have larger vocabularies and are not confined to superficial meaning making; rather, they actively produce a diverse range of words and manipulate derived forms to construct sentences.
This lexical growth in writing aligns with Crossley et al.’s (2011) finding that learners tend to produce more advanced word forms as their proficiency increases. The progression may be interpreted through the lexical threshold hypothesis in EFL writing (Yang et al., 2019), which suggests that word families at the 2,000-level represent a turning point for achieving average L2 writing performance. As our students may have accumulated greater exposure to high-frequency input, their lexical processing fluency and access to common word families likely improved (e.g., exceeding the 2,000-level threshold). This in turn may elevate the predictive power of productive size and derivative knowledge for writing. An in-depth analysis further revealed that productive size at the 2,000- and 3,000-word levels had the strongest correlations with writing ability at Time 2. This pattern partly supports Tsang and Lo’s (2025) findings that young EFL learners’ informal English exposure is most strongly associated with gains around the 2,000-word band. However, it should be noted that students may have retained memory of the pretest writing item over the course of two semesters. The familiarity with the topic may reduce cognitive load in the post-test, allowing them to focus more on generating ideas and employing complex word forms. Test familiarity indeed has been demonstrated as a significant contributor to L2 writing ability (Kim et al., 2022).
Notably, derivative recall emerged as a growing predictor of writing performance with a moderate effect size. This finding corresponds to previous research showing that derivative knowledge significantly predicts young EFL learners’ writing (Leontjev et al., 2016; McCutchen & Stull, 2015) and postgraduates’ academic writing (Asaad, 2024). This type of contribution generally increased with students’ vocabulary and proficiency levels, which is also supported by our finding. It has been found that grammatical recognition of derived forms is closely linked to L2 writing development (Leontjev et al., 2016). In other words, as students progress through grade levels, they may develop a stronger ability to identify parts of speech, which in turn supports accurate spelling and word formation, leading to a larger vocabulary. Consequently, this expanded vocabulary may mediate the relationship between morphological awareness and writing performance (Asaad, 2024). This process may offer a plausible explanation for the limited role of derivative knowledge in L2 writing at lower levels and its growing contribution at higher stages.
In contrast, synonyms and collocations did not contribute to L2 writing at the two times of tests. This suggests that Chinese learners’ VK may primarily stand alone and stop at the item learning level; their word use may rarely extend to the network building dimension in context (Henriksen, 1999). This finding contrasts with a number of studies (Castillo & Tolchinsky, 2017; Crossley et al., 2015; Sukying, 2023), despite being aligned with some others. This contradiction may stem from two reasons. Firstly, previous studies involved EFL postgraduates or L1 learners, whose higher lexical proficiency likely enabled more natural use of collocations and synonyms in writing than our participants. Secondly, the different assessment methods of vocabulary depth may also account for the discrepancy. For example, Crossley et al. (2015, p. 578) calculated the “acceptable and expected multi-word units” in written samples, while Castillo and Tolchinsky (2017) used researcher-designed vocabulary depth measure (orally establishing semantic relations). Interestingly, studies employing the most widely used depth measure (WAT), as was done in this study, tended to yield weak relationships between vocabulary depth and L2 writing (Dabbagh & Janebi Enayat, 2019; Lin, 2023). Indeed, a recent meta-analysis conducted by Zhang and Zhang (2022) reported that different formats of VK measures truly determine the relationship between VK and listening and reading ability. This effect may also extend to L2 writing performance, which warrants further research attention. These findings suggest that studies aiming to examine the relationship between vocabulary and writing should not only involve participants of different proficiency levels, but also carefully select vocabulary measures based on specific research goals as findings may vary considerably due to differences in measurement.
Pedagogical Implications
Based on the above discussion, the study provides several pedagogical implications for vocabulary and writing instruction. The receptive form and meaning knowledge (receptive size) lays the groundwork for later use of productive and morphological forms in context. For writing purposes, therefore, an ongoing development of such knowledge should be the first priority for teachers and students, especially the mid-frequency (e.g., 3,000–9,000-word levels; Naismith & Juffs, 2021) and academic words (Coxhead, 2000).
However, students revealed a severe shortage of productive size, and only the high frequency word levels (the 2,000- and 3,000-word levels) contributed to their writing performance. Hence, production-based vocabulary instruction should become a key focus for teachers, given its direct impact on lexical production beyond simple receptive recognition (Jia et al., 2024; Zhong & Hirsh, 2009). For example, words can be taught and practiced through immediate writing tasks since VK acquired through writing is more likely to be engraved in learners’ long-term memory (Coxhead, 2018). In addition, teachers may assign sentence writing or translation tasks when introducing new words, as these activities have been shown to yield better recall ability than traditional sentence reading exercises (Teng & Xu, 2025). Explicit instruction should also be extended to derivative knowledge since it was the only depth-related factor that significantly contributed to writing ability in this study. Teachers may design long-term and structured programs to promote students’ morphological awareness and word formation ability. The noun forms within word families should be given greater instructional attention, as nominal structures are essential to modern academic writing; however, EFL learners tend to overuse verb forms but under-use nouns in writing (Naismith & Juffs, 2021).
As for other aspects of vocabulary depth (collocations and synonyms), special pedagogical attention should be paid to productive collocations of mid and low-frequency words since most collocations of frequent words tend to be exposed frequently, and learners likely acquire them incidentally (Nizonkiza & Vdpoel, 2019). However, collocations of lower-frequency words (e.g., 3,000 to 9,000 levels), which may appear less often in input and combine with infrequent items, are less likely to be learned without explicit instruction (Naismith & Juffs, 2021). Regarding the teaching of synonyms, it may be remedied by embedding more class activities, such as meaning association brainstorms or topic-centered words and phrases summary (Janebi Enayat & Derakhshan, 2021), which may help students build effective lexical networks.
Moreover, research has shown that spending more time on formal tutorials and assignments does not necessarily lead to vocabulary gains, whereas informal lexical exposure outside the classroom does (Tsang & Lo, 2025). The limited growth and weak predictive power of associative depth may thus stem from insufficient exposure to multi-word and semantic networks. Therefore, in addition to explicit instruction in vocabulary size and depth, more structured opportunities for informal input, such as extensive viewing and graded reading with word-family mapping, may be integrated to complement classroom learning.
Conclusion
This longitudinal study used a wide range of validated measures to investigate the dynamic relationship between VK and writing in the Chinese EFL context. The findings substantiate the critical importance of VK and its development in shaping L2 writing performance in different learning stages. The multiple size and depth components revealed varying growth trends and strengths of relationship with writing ability as academic years advanced. Receptive word recognition, as a constant and robust predictor of writing performance, played a crucial role in supporting writing development. Productive vocabulary and derivative knowledge figured more prominently in contributing to higher-proficiency learner writing. However, the lexical development did not transfer to the deep and qualitative network knowledge and its influence on writing. These findings emphasize the necessity of explicit instruction with a focus on production-oriented lexical acquisition and teaching. Such instruction may include writing with target vocabulary, sentence-construction and translation tasks, as well as structured informal exposure beyond the classroom. Moreover, this study calls for special pedagogical attention to productive derivative knowledge, collocations and synonyms in writing development.
Limitations and Recommendations
Nevertheless, the conclusions drawn from this study should be interpreted regarding its limitations. Firstly, only one particular cohort of EFL learners was sampled, and their similar proficiency levels and backgrounds may influence the longitudinal outcomes, such that the students at Time 1 and Time 2 may produce homogeneous performances. Future studies are recommended to involve participants from varying contexts with different L1 or proficiency levels. Secondly, the receptive tests (uVLT and WAT) in this study did not control for random guessing, which may have introduced practice effects and inflated the scores. Future longitudinal studies may consider implementing item-order variations across testing points for measures involving repeated prompts or forms. Thirdly, this study did not collect data on learners’ general informal exposure or socioeconomic (SES) backgrounds, which may partly explain individual differences in vocabulary growth (Tsang & Lo, 2025). Future research may incorporate exposure and SES measures to test their mediating or moderating roles between vocabulary growth and writing development, and further examine whether the 2,000 to 3,000 frequency range remains a key threshold for productive writing gains.
Footnotes
Ethical Considerations
This study involved human participants. The vocabulary and writing tests in this study were non-invasive and part of regular classroom activities, minimizing any potential risks. The benefits of the research outweigh the minimal risks.
Consent to Participate
Informed written consent was obtained from all participants before the tests, and their personal data were kept confidential.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The data can be accessed as required from the corresponding author.
