Sage Journals: Discover world-class research

Abstract

Word frequency has a long history of being considered the most important predictor of word difficulty and has served as a guideline for several aspects of second language vocabulary teaching, learning, and assessment. However, recent empirical research has challenged the supremacy of frequency as a predictor of word difficulty. Accordingly, applied linguists have questioned the use of frequency as the principal criterion in the development of wordlists and vocabulary tests. Despite being informative, previous studies on the topic have been limited in the way the researchers measured word difficulty and the statistical techniques they employed for exploratory data analysis. In the current study, meaning recall was used as a measure of word difficulty, and random forest was employed to examine the importance of various lexical sophistication metrics in predicting word difficulty. The results showed that frequency was not the most important predictor of word difficulty. Due to the limited scope, research findings are only generalizable to Vietnamese learners of English.

Keywords

Data mining lexical sophistication random forest word difficulty word frequency

Introduction

For decades, word frequency has been operationalized as the main predictor of word difficulty (Hashimoto, 2021). This frequency-difficulty assumption has been widely adopted by vocabulary test creators (McLean & Kramer, 2015; Webb et al., 2017). In recent years, this assumption has been the topic of debate, and several attempts have been made to re-examine the predictive relationship between numerous variables of lexical sophistication and word difficulty (Hashimoto, 2021; Hashimoto & Egbert, 2019; Robles-García et al., 2023; Stewart et al., 2022; Vitta et al., 2023). While these studies offer useful insights into this relationship, all of them share certain limitations that warrant further research.

The first limitation concerns how word difficulty has been estimated. In all the mentioned studies, the researchers obtained word difficulty values by measuring the form-meaning knowledge of a group of learners and then computing Item Facility (IF) or Rasch item measures. This operationalization of word difficulty is appropriate, except in all cases, a Yes/No vocabulary test was used. The problem with Yes/No tests is that a “yes” response may correspond with a wide range of word knowledge, from understanding meaning to a bare awareness of form (Nation & Coxhead, 2021). We would argue that although form recognition is an important step in learning new words, it is insufficient for communication, and so tests of the form-meaning link are clearer measures of word difficulty.

The second limitation lies with the statistical methods employed to determine the best predictor of word difficulty. In previous studies, variable importance values were determined and compared using statistical techniques based on linear regression modeling. Statistics based on general linear modeling (GLM), while suitable for confirmatory analysis, can be inappropriate for exploratory research (Fife & D’Onofrio, 2023; Mizumoto, 2023). One reason is that they rely on many assumptions (independence, normality, homogeneity, and linearity) and protections (sample size and corrections for multiple tests) to yield reliable probability estimates (Fife & D’Onofrio, 2023; Hastie et al., 2009). These conditions are unlikely to be met in exploratory data analysis (Fife & D’Onofrio, 2023). Another reason is that, as the standardized beta coefficients obtained from multiple regression analyses cannot fully resolve the correlations between predictor variables, these statistical tools sometimes confuse which predictor to give credits to (Fife & D’Onofrio, 2023; Mizumoto, 2023). In addition, due to the lack of a built-in cross-validation mechanism, GLM-based statistics are vulnerable to overfitting (Fife & D’Onofrio, 2023). Overfitting refers to the situation where a model appears to perform better than it actually does, usually caused by the model accidentally fitting its error terms, or residuals. As these residuals are not reproducible, a model may perform well on the dataset used to build it but not with new data. The only way to detect overfit is cross-validation, that is, fitting the built model to another dataset, which is usually not feasible for GLM-based methods.

To address these limitations, the current study employed meaning-recall testing, the format that offers perhaps the deepest and most accurate measure of the form-meaning link (Nation & Coxhead, 2021). In particular, a meaning-recall version of McLean and Kramer’s (2015) New Vocabulary Levels Test (NVLT) was given to a cohort of 304 Vietnamese students. In the modified version, all the response options were removed, leaving only the item stems, and learners provided translations of the target words. In addition, random forest (RF, Breiman, 2001), a statistical method that addresses the limitations of regression for exploratory data analysis or data mining (Fife & D’Onofrio, 2023; Hastie et al., 2009; Mizumoto, 2023), was employed to compare the importance of word frequency and other lexical sophistication measures.

Machine learning and RF

Machine learning (ML) algorithms analyze data with the ability to learn and improve their predictions from experience without being explicitly programmed. ML can be categorized as supervised, unsupervised, semi-supervised, and reinforcement (Hastie et al., 2009). The key difference between the first three is the use of labeled or unlabeled data, with supervised ML using labeled data, unsupervised ML analyzing unlabeled data, and semi-supervised ML being able to analyze both (Sarker, 2021). Reinforcement ML uses a reward–penalty system to receive feedback from the environment and find the best solution (Sarker, 2021). Reinforcement ML is usually applied in the development of artificial intelligence (AI) in games such as chess. Supervised ML such as Decision Tree (DT), RF, artificial neural networks (ANN), support vector machines (SVM), and K-nearest neighbor (KNN) are most frequently used in exploratory data analysis where the importance of several independent variables is examined and compared. Recent studies of Brandić et al. (2023) and Moghadam et al. (2023) found that RF and ANN outperform other ML models in term of predictive performance. This paper is another attempt to extend the method showcase of Mizumoto (2023) and advocates the use of RF in applied linguistics.

As its name suggests, RF, a non-parametric ML model, is built on DTs, usually hundreds or even thousands in number (Fife & D’Onofrio, 2023; Hastie et al., 2009). In RF, the prediction is made based on the results of DTs by major voting or averaging. A DT, in non-technical terms, is an ML algorithm that employs tree-like structure to represent classifications and their possible outcomes. Figure 1 shows an example of a DT that predicts the BNC/COCA frequency band of 123 words in this study’s dataset based on their values of contextual distinctiveness (McD_CD), age of exposure (aoe_inverse_average, AoE), semantic distinctiveness (Sem_D), and lexical concreteness (Brysbaert_Concreteness_Combined_AW). For the DT used in this illustration, 20% (24 words) of the dataset is held out for cross-validation, leaving 99 (80%) words in the model.

Figure 1.

An example of a decision tree.

The top square, or the root node, separates words according to whether the co-occurrence probability value, an indicator of contextual distinctiveness, is −0.345 or greater. If the answer is no, then a second node, referred to as a branching or internal node, further separates words based on inversed average AoE. Words with values below −1.07 are predicted to belong to the 1 K band, and those above to the 2 K band. Returning to the root node, words with McD_CD values of −0.345 or greater are directed to a different branching node that separates words based on a semantic distinctiveness threshold of −0.325. This process continues until all words are predicted in the lowest nodes in the tree, which are called terminal or leaf nodes. Nodes that are in the middle of the tree, regardless of specific level, are called branching nodes or branches. The leaves in this DT are presented as square boxes. Classification accuracy can be identified at each node level using a metric called node purity, or sum of squared residuals (Fife & D’Onofrio, 2023).

In RF, each DT randomly samples with replacement, a process called bootstrapping (Breiman, 2001). Normally, the bootstrapped sample size is set at 67% of the original dataset, leaving the remaining 33% for cross-validation (Fife & D’Onofrio, 2023). The reserved sample for cross-validation is called the out-of-bag (OOB) sample (Breiman, 2001). After DTs are built using the bootstrapped dataset, the OOB sample is applied on each DT to generate another round of predictions. The degree of agreement between the OOB and bootstrapped data reflects the prediction accuracy of the RF model. In this study, the original dataset of 123 target words was split into (1) a training dataset, (2) a validation dataset and, (3) a test dataset. The training dataset was used to build the initial RF model, and then the test and validation datasets were used for cross-validation. OOB errors, or degree of agreement between the OOB and bootstrapped samples, were calculated for the training and validation datasets to evaluate model prediction accuracy.

In addition to OOB cross-validation, there is another built-in cross-validation mechanism in RF called the hold-out sample (Breiman, 2001). That is, RF compares mean-squared error (MSE) between the three models constructed using the training, validation, and test datasets mentioned in the previous paragraph. Low MSE values for training and test models indicate a good model fit, while high training and test MSEs may suggest underfitting, when the model does not perform well either in training or test data. Low training MSE but high test MSE may suggest overfitting. This means that the model fitted noise or random patterns in training data that do not generalize well in test data.

In addition to sampling participants, RF also samples variables. Normally, the minimum sample size is $\sqrt{m}$ or m/3 depending on whether the outcome variable is categorical or numeric, correspondingly, with m being the number of predictor variables (Fife & D’Onofrio, 2023). As RF constructs DTs, it generates something called variable importance or predictor importance for each predictor in the model. The two variable importance metrics used in this study are total increase in node purity and mean decrease in accuracy. To estimate the increase in node purity, the algorithm compares the prediction accuracy of DTs before and after including a particular predictor. This metric works best with categorical variables and tends to inflate variable importance of continuous predictors. The calculation of mean decrease in accuracy is more computationally demanding but does not suffer from such a bias. First, the algorithm randomly shuffles OOB scores of the subjects, or target words in this case, for each node in DTs. This shuffling/permutation eliminates any existing correlations between the predictors. Then the algorithm evaluates the prediction accuracy of the model before and after permutation. A large change in OOB errors after permutation for a specific predictor means that the predictor has a strong relationship with the outcome variable and therefore is very important. See Breiman (2001), Hastie et al. (2009), and Fife and D’Onofrio (2023) for detailed discussions on these metrics.

RF is believed to outperform traditional regression models because it (1) is less vulnerable to overfitting (Breiman, 2001; Fife & D’Onofrio, 2023), (2) has fewer statistical assumptions as a non-parametric model (Fife & D’Onofrio, 2023; Mizumoto, 2023), (3) has built-in cross-validation mechanisms (Breiman, 2001; Fife & D’Onofrio, 2023; Mizumoto, 2023), and (4) can be used when there are more variables than participants (Breiman, 2001; Fife & D’Onofrio, 2023). To the best of our knowledge, no research has used RF to examine the predictive relationship between various metrics of lexical sophistication and word difficulty.

The general objective of the present research was to determine the strongest predictor of word difficulty under an RF approach. Therefore, the study is driven by the following research question:

Amongst several lexical sophistication metrics, what is the strongest predictor of word difficulty?

Methods

Participants

The study involved 304 second-year Vietnamese university students. All completed 9 years of compulsory English education at elementary, middle and high schools, plus three mandatory modules of Business English at the tertiary level, suggesting an average Common European Framework of Reference for Languages (CEFR) proficiency level of B1 or higher.

Target words

Out of 150 NVLT target words, 149 were included in our analysis. These included the 120 words sampled from the first five 1,000-word bands of the BNC/COCA lists, and 29 of the 30 items from the Academic Word List that were also from the first 5 K of the BNC/COCA. The AWL word “notwithstanding” was a 6 K word and therefore excluded as an outlier. The frequency variable used in our analysis was BNC/COCA band, with values from 1000 to 5000.

Word difficulty

A meaning-recall test was utilized as a word difficulty measure. This test consisted of the NVLT item stems, and participants were instructed to demonstrate target-word knowledge by providing written L1 definitions or explanations. Responses were marked by two Vietnamese applied linguists who are highly proficient in English. Any dictionary definition of a target word, regardless of part of speech or meaning sense, was marked as correct. Inter-rater reliability was “Almost Perfect” (κ = .95; Landis & Koch, 1977, p. 165). The Rasch item and person reliability indices were .98 and .97, respectively.

Word difficulty was operationalized as IF, or the proportion of correct responses to an item, with possible values ranging from 0.00 to 1.00.

Lexical sophistication

TAALES 2.2 (Kyle et al., 2018) was used to compute the lexical sophistication metrics for the 149 target words. As BNC/COCA frequency was operationalized as the main measure of frequency, all TAALES range and frequency measures were excluded. All options for isolating content words and function words were excluded, and only all-word measures were included in the analysis. This initially resulted in 42 lexical sophistication measures, including BNC/COCA frequency, for each target word.

However, TAALES is an incomplete dataset, with some metrics of lexical sophistication absent for some words. In this study, missing data are detected using the index coverage data offered by TAALES. Lexical sophistication metrics were classified as good coverage if missing data rate were less than 10%, or 12 words, moderate coverage if missing data rate is between 10% and 20% (12–25 words). Metrics that cover less than 80% of the data (missing from 26 words) were classified as bad coverage. Metrics with bad coverage were excluded from further analyses. 15 lexical sophistication metrics were excluded. Words that were not covered by the remaining metrics were removed. Twenty-six words were deleted.

The retained 27 features were checked for multicollinearity. RF is relatively robust to multicollinearity as each DT trains on a random subset of features which may not always highly correlate. However, multicollinearity may affect feature importance of highly correlated predictors, even in RF. Therefore, predictors with Pearson correlations of .80 or higher were flagged for multicollinearity, and those with the strongest correlations with IF were retained while the others were excluded from further analyses until all inter-predictor correlations were below .80. In this process, eight metrics were removed, and 19 remained in the final RF model. Table 1 shows the inclusion and exclusion of metrics. See TAALES 2.2’s index description spreadsheet for details on each metric (Kyle et al., 2018; https://www.linguisticanalysistools.org/taales.html).

Table 1.

Lexical sophistication metrics.

Lexical sophistication metrics	In-text name	Decision
Age of exposure
• aoe_index_above_threshold_40	LDA Age of Exposure (.40 cosine threshold)	Excluded for multicollinearity
• aoe_inflection_point_polynomial	LDA Age of Exposure (inflection point)	Included
• aoe_inverse_average	LDA Age of Exposure (inverse average)	Excluded for multicollinearity
• aoe_inverse_linear_regression_slope	LDA Age of Exposure (inverse slope)	Excluded for multicollinearity
Age of acquisition
• Kuperman_AoA_AW	Age of Acquisition	Included
Character bigram frequency
• BG_Mean	Character Bigram Frequency	Included
Contextual distinctiveness
• eat_tokens	Free Association Response Types	Excluded for missing data
• eat_types	Free Association Response Tokens	Excluded for missing data
• lsa_average_all_cosine	LSA Contextual Distinctiveness (top 3 cosine)	Excluded for missing data
• lsa_average_top_three_cosine	LSA Contextual Distinctiveness (maximum cosine)	Excluded for missing data
• lsa_max_similarity_cosine	LSA Contextual Distinctiveness (all cosine)	Excluded for missing data
• McD_CD	McDonald Co-occurrence Probability	Included
• Sem_D	Hoffman et al. Semantic Distinctiveness	Included
• USF	Free Association Stimuli Elicited	Included
Frequency
• BNC/COCA Freq	BNC/COCA Frequency Ranking	Included
Lexical concreteness
• Brysbaert_Concreteness_Combined_AW	Brysabaert Concreteness Combined	Included
• MRC_Concreteness_AW	MRC Concreteness	Excluded for missing data
Lexical familiarity
• MRC_Familiarity_AW	MRC Familiarity	Excluded for missing data
Lexical imageability
• MRC_Imageability_AW	MRC Imageability	Excluded for missing data
Lexical meaningfulness
• MRC_Meaningfulness_AW	MRC Meaningfulness	Excluded for missing data
Word neighbor
• Freq_N	Orthographic Neighborhood Frequency	Excluded for missing data
• Freq_N_OG	Phonographic Neighborhood Frequency Logarithm (homophones included)	Excluded for missing data
• Freq_N_OGH	Phonographic Neighborhood Frequency Logarithm (homophones excluded)	Excluded for missing data
• Freq_N_P	Phonological Neighborhood Frequency (homophones included)	Excluded for missing data
• Freq_N_PH	Phonological Neighborhood Frequency (homophones excluded)	Excluded for missing data
• OG_N	Phonographic Neighbors (homophones excluded)	Excluded for multicollinearity
• OG_N_H	Phonographic Neighbors (homophones included)	Excluded for missing data
• OLD	Average Levenshtein Distance of closest orthographic neighbors	Excluded for multicollinearity
• OLDF	Average log HAL frequency of closest orthographic neighbors	Included
• Ortho_N	Orthographic Neighbors	Included
• Phono_N	Phonological Neighbors (homonyms excluded)	Included
• Phono_N_H	Phonological Neighbors (homonyms included)	Excluded for multicollinearity
• PLD	Average Levenshtein Distance of closest phonological neighbors	Included
• PLDF	Average log HAL frequency of closest phonological neighbors	Included
Word recognition
• LD_Mean_Accuracy	Lexical Decision Accuracy	Included
• LD_Mean_RT	Lexical Decision Time	Excluded for multicollinearity
• LD_Mean_RT_SD	Lexical Decision Time (standard deviation)	Included
• LD_Mean_RT_Zscore	Lexical Decision Time (z-score)	Included
• WN_Mean_Accuracy	Word Naming Response Accuracy	Included
• WN_Mean_RT	Word Naming Response Time	Excluded for multicollinearity
• WN_SD	Word Naming Response Time (standard deviation)	Included
• WN_Zscore	Word Naming Response Time (z-score)	Included

Data analysis

The open source statistical package, JASP 0.18.3.0 (JASP Team, 2024), was used for statistical analysis. RF (Breiman, 2001) was selected as the primary statistical technique to compute variable importance. For each DT, the ratio of training to OOB sample size was set at 67% and 33%, in the order given. For the RF model, 70% of the data was used to train the RF model, and the remaining 30% was split in half into test and validation datasets. The sample size (i.e., number of target words) for training, validation, and test datasets were 87 (70.7%), 18 (14.6%), and 18 (14.6%), respectively. It is worth noting that JASP holds out test and validation data consecutively, not simultaneously. This means that it first takes out a proportion of the data for the test model and later removes data for the validation model from the remaining data. As a result, the 70:15:15 ratio was obtained by setting test data at 15% of the original dataset, and setting validation data at 17% of the remaining data. Random seed was fixed at 1. The number of features per split was set at m/3 with m being the number of predictors (Fife & D’Onofrio, 2023). As a result, the RF model with 19 predictor variables had 19/3 = 6 features per split.

We applied feature scaling, a statistical method used to normalize the range of several independent variables or features. This variable standardization technique ensures values of features from different scales range into a particular, similar scale and therefore offers better stability (Hastie et al., 2009). The current version of JASP uses Z-score standardization with mean equals 0 and SD of 1 as the default option for feature scaling. JASP determined the optimal number of trees for the RF model to be 786 (with the maximum set at 500,000).

Correlation coefficients between the metrics of lexical sophistication and IF were computed for a rough comparison in predictor importance between RF and a GLM-based technique. As RF is non-parametric, a non-parametric method for estimating correlation coefficients, Spearman, was selected.

Data generated and used in this study are available on the Open Science Framework (Ha et al., 2024).

Results

Tables 2 offers descriptive statistics for each metric. Skewness and kurtosis values indicate departure from normality in some cases. However, as RF does not assume normality, this was not a cause for concern.

Table 2.

Descriptive statistics.

Lexical sophistication metric	Mean (95% CI)	SD	Skewness	Kurtosis
Item facility	0.443 (0.383–0.502)	0.334	0.23	−1.402
BNC/COCA Freq	2926.829 (2702.709–3150.95)	1255.615	0.14	−0.898
Kuperman_AoA_AW	8.199 (7.725–8.674)	2.658	0.005	−0.674
Brysbaert_Concreteness_Combined_AW	3.388 (3.207–3.569)	1.015	0.157	−1.26
USF	12.683 (8.406–16.96)	23.962	4.372	24.324
McD_CD	1.529 (1.403–1.656)	0.71	0.284	−0.911
Sem_D	1.771 (1.721–1.82)	0.279	−0.632	1.144
Ortho_N	3.797 (2.83–4.764)	5.418	1.552	1.426
Phono_N	9.211 (6.759–11.664)	13.74	1.861	2.81
OLDF	7.457 (7.318–7.596)	0.78	−0.161	−0.463
PLD	2.075 (1.913–2.237)	0.908	0.917	0.693
PLDF	7.655 (7.474–7.836)	1.014	0.753	1.444
BG_Mean	3786.534 (3502.163–4070.906)	1593.165	0.598	0.588
LD_Mean_RT_Zscore	−0.452 (−0.496 to −0.408)	0.246	0.813	0.985
LD_Mean_RT_SD	232.199 (219.226–245.17)	72.679	0.616	0.312
LD_Mean_Accuracy	0.967 (0.96–0.973)	0.035	−1.808	5.335
WN_Zscore	−0.369 (−0.418 to −0.32)	0.273	0.974	2.315
WN_SD	145.885 (136.594–155.176)	52.052	1.143	1.084
WN_Mean_Accuracy	0.987 (0.981–0.993)	0.033	−5.023	35.735
aoe_inflection_point_polynomial	6.115 (5.662–6.568)	2.539	−0.257	−0.038

Table 3 shows general information on the first RF model and its performance. Similarly, low MSE values for the training, test, and validation models indicate good and consistent model fit. OOB error demonstrated a high degree of agreement at 1 – 0.069 = 93.1%. Figure 2 graphically depicts the strong agreement between the training and validation models. The R² value of .581 indicates that the RF model explained a substantial portion of the total variance of word difficulty. Figure 3 provides a visual presentation of the model’s predictive performance of the test model.

Table 3.

General information of the model.

Model performance
Training MSE	0.059
Test MSE	0.059
Validation MSE	0.047
R ²	0.581
OOB error	0.069

Figure 2.

Out-of-bag mean-squared error plot.

Figure 3.

Predictive performance plot of the test model.

Table 4 lists the importance of the 19 predictors and their correlations with IF, in order by total increase in node purity. The result witnessed a strong degree of agreement between the two RF’s metrics of variable importance, namely total increase in node purity and mean decrease in accuracy, up to the 8th predictor (Brysbaert_Concreteness_Combined_AW). AoE (aoe_inflection_point_polynomial) was clearly the strongest predictor of word difficulty, with contextual distinctiveness (McD_CD) and word frequency (BNC/COCA Freq) ranking second and third. Fourth was USF, another metric of contextual distinctiveness, highlighting the importance of this subcategory to word difficulty. Age of acquisition (AoA) came fifth, and the next two places belonged to the standardized values of lexical decision time (LD_Mean_RT_Zscore) and word naming response time (WN_Zscore), which were the two metrics of word recognition. Although not shown in Table 4, USF and McD_CD correlated at −0.575 and LD_Mean_RT_Zscore and WN_Zscore correlated at 0.567, suggesting that their contribution to word difficulty was relatively unique. The presence of several unique predictors of word difficulty suggested multidimensionality of the word difficulty construct. Figures 4 and 5 offer graphic illustrations to the two metrics of predictor importance in RF.

Table 4.

Feature importance metrics and spearman correlation with item facility.

Lexical sophistication metric	Mean decrease in accuracy	Total increase in node purity	Spearman correlation
aoe_inflection_point_polynomial	0.014	0.900	−0.532
McD_CD	0.012	0.747	−0.573
BNC/COCA Freq	0.011	0.680	−0.594
USF	0.009	0.576	0.597
Kuperman_AoA_AW	0.007	0.509	−0.513
LD_Mean_RT_Zscore	0.004	0.430	−0.493
WN_Zscore	0.004	0.386	−0.372
Brysbaert_Concreteness_Combined_AW	0.001	0.309	0.060
WN_SD	3.625 × 10⁻⁴	0.193	−0.174
BG_Mean	−6.167 × 10⁻⁴	0.187	−0.084
Sem_D	−1.566 × 10⁻⁴	0.176	0.112
LD_Mean_RT_SD	−8.470 × 10⁻⁶	0.157	−0.157
PLD	−2.946 × 10⁻⁴	0.156	−0.218
OLDF	−1.643 × 10⁻⁴	0.146	0.222
PLDF	−5.565 × 10⁻⁴	0.144	0.217
Phono_N	2.772 × 10⁻⁴	0.124	0.210
LD_Mean_Accuracy	9.389 × 10⁻⁴	0.119	0.223
Ortho_N	−5.761 × 10⁻⁴	0.091	0.175
WN_Mean_Accuracy	−1.534 × 10⁻⁴	0.039	0.225

Figure 4.

Mean decrease in accuracy plot.

Figure 5.

Total increase in node purity plot.

Table 4 also shows the correlations between 19 predictors and IF. There is some agreement between the correlation coefficients and the two RF’s metrics of predictor importance. However, the predictors with the strongest correlations with IF such as BNC/COCA frequency ranking and USF were not ranked first in the RF model. Similarly, Brysbaert’s Concreteness value, had a substantially lower correlation than LD_Mean_Accuracy and WN_Mean_Accuracy but was ranked higher than those predictors in the RF model.

Discussion and conclusion

By using a more rigorous measure of word difficulty and arguably better data mining procedures, this brief report represents another attempt to examine which lexical characteristics best predict word difficulty. The results support the findings of Vitta et al. (2023) and Hashimoto and Egbert (2019) that word difficulty is predicted by multiple aspects of lexical sophistication in addition to frequency. This study found AoE (aoe_inflection_point_polynomial) to be the most important predictor of word difficulty. This is a new and interesting finding as AoE has never been included in Hashimoto and Egbert’s or Vitta et al.’s models. Vitta et al. excluded AoE due to potential distributional violations which is not a concern for RF. In fact, Dascalu et al. (2016) introduced AoE as an improved model of word learning process that would overcome the limitations of AoA and Word Maturity (Landauer et al., 2011). The second most important predictor, McD_CD, were also ranked as one of the strongest predictors of word difficulty in Hashimoto and Egbert’s model.

Different from other studies that based their frequency measures on metrics provided by TAALES, this study used BNC/COCA bands so that the findings can be directly related to BNC/COCA-based vocabulary tests. The results indicate that word difficulty is predicted by several, unique variables, the most important of which is not BNC/COCA frequency. This could explain why Webb et al. (2017, p. 47) could only partially model word difficulty according to BNC/COCA frequency in their vocabulary test validation study.

These findings provide further evidence that frequency, particularly when it is based on corpora comprising of mostly written, British and American English, offers only a rough estimation of word difficulty. This may be especially true in EFL contexts where the exposure of learners to English words is heavily influenced by textbooks and teachers’ lessons, which may not always reflect natural frequency (Webb et al., 2017, pp. 46–48). In this study, the measure of word difficulty was based on a sample of Vietnamese learners, whose learning experiences may not relate well with British- and American-based frequency rankings. This is one of this study’s, and arguably all previous studies’, limitations. That is the misalignment between the subjects and the variables, or measures of word difficulty and lexical sophistication. This was why Hashimoto and Egbert (2019) excluded AoA from their analysis, stating that the metric pertained “only to learners from specific L1 backgrounds” (p. 850). In addition, the fact that the dataset included responses from only Vietnamese learners limits generalizability. That is, the results may differ if word difficulty included data from learners of different backgrounds. This is something that future research should examine.

As a tool for exploratory data analysis, RF’s estimation of variable importance does not involve significance testing, which is a feature of confirmatory data analysis (Fife & D’Onofrio, 2023). As a result, certain concerns may be raised regarding the statistical significance of the differences between values of predictor importance as well as the ranking of predictors. While we acknowledge this limitation of the method, it is worth noting that RF has been shown to make robust inferences beyond the data (Fox et al., 2017). Moreover, the results of an RF model are cross-validated, which warrants reproducibility. In other words, the cross-validation procedures make sure that the results do not occur by mere chance. In addition, as the estimation of mean decrease in accuracy involves random permutation of subjects’ OOB scores, it greatly reduces the effects of correlations between predictors and therefore reflects their unique importance to the outcome variable. Together, these warrant that the ranking of predictors is based on their unique importance, which is robust and reproducible.

Findings from the current study have implications for the sampling process in vocabulary test development, especially tests that sample from the BNC/COCA lists. That is, when selecting target words from a particular frequency level, perhaps attention should be paid to other word difficulty metrics so that the average difficulty of the sample better reflects that of the whole frequency level.

Footnotes

Acknowledgements

The authors would like to express their sincere thanks to the four anonymous reviewers who invested their valuable time in reviewing our paper. Their expertise and thoughtful suggestions have significantly improved the quality of our manuscript. They also wish to thank editors Ruslan Suvorov and Benjamin Kremmel for their efforts in facilitating a smooth and timely review process.

Author contributions

Hung Tan Ha: Conceptualization; Data curation; Formal analysis; Investigation; Methodology; Project administration; Resources; Software; Supervision; Validation; Visualization; Writing—original draft; Writing—review & editing.

Duyen Thi Bich Nguyen: Data curation; Formal analysis; Investigation; Methodology; Resources; Writing—review & editing.

Tim Stoeckel: Formal analysis; Investigation; Visualization; Writing—original draft; Writing—review & editing.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Ethics approval

We obtained ethics approval from the University of Economics Ho Chi Minh City (UEH). All participants provided written informed consent.

ORCID iDs

Hung Tan Ha

Duyen Thi Bich Nguyen

Tim Stoeckel

Availability of data and materials

The data that support the findings of this study are openly available on our Open Science Framework project page (Ha et al., 2024) at

References

Brandić

Pezo

Bilandžija

Peter

Šurić

Voća

(2023). Comparison of different machine learning models for modelling the higher heating value of biomass. Mathematics, 11, 2098. https://doi.org/10.3390/math11092098

Breiman

(2001). Random forests. Machine Learning, 45(1), 5–32.

Dascalu

McNamara

D. S.

Crossley

S. A.

Trausan-Matu

(2016). Age of exposure: A model of word learning. Proceedings of the AAAI Conference on Artificial Intelligence, 30(1). https://doi.org/10.1609/aaai.v30i1.10372.

Fife

D. A.

D’Onofrio

(2023). Common, uncommon, and novel applications of random forest in psychological research. Behavior Research Methods, 55, 2447–2466. https://doi.org/10.3758/s13428-022-01901-9

Fox

E. W.

Hill

R. A.

Leibowitz

S. G.

Olsen

A. R.

Thornbrugh

D. J.

Weber

M. H.

(2017). Assessing the accuracy and stability of variable selection methods for random forest modeling in ecology. Environmental Monitoring and Assessment, 189, 316. https://doi.org/10.1007/s10661-017-6025-0

H. T.

Nguyen

D. T. B.

Stoeckel

. (2024, June 3). What is the best predictor of word difficulty? A case of data mining using random forest. https://doi.org/10.17605/OSF.IO/JC5KA

Hashimoto

B. J.

(2021). Is frequency enough? The frequency model in vocabulary size testing. Language Assessment Quarterly, 18(2), 171–187. https://doi.org/10.1080/15434303.2020.1860058

Hashimoto

B. J.

Egbert

(2019). More than frequency? Exploring predictors of word difficulty for second language learners. Language Learning, 69(4), 839–872. https://doi.org/10.1111/lang.12353

Hastie

Tibshirani

Friedman

(2009). The elements of statistical learning: Data mining, inference, and prediction (2nd ed.). Springer.

10.

JASP Team. (2024). JASP (Version 0.18.3) [Computer software]. https://jasp-stats.org/

11.

Kyle

Crossley

S. A.

Berger

(2018). The tool for the analysis of lexical sophistication (TAALES): Version 2.0. Behavior Research Methods, 50(3), 1030–1046. https://doi.org/10.3758/s13428-017-0924-4

12.

Landauer

T. K.

Kireyev

Panaccione

(2011). Word maturity: A new metric for word knowledge. Scientific Studies of Reading, 15(1), 92–108. https://doi.org/10.1080/10888438.2011.536130

13.

Landis

J. R.

Koch

G. G.

(1977). The measurement of observer agreement for categorical data. Biometrics, 33(1), 159–174. https://doi.org/10.2307/2529310

14.

McLean

Kramer

(2015). The creation of a new vocabulary levels test. Shiken, 19(2), 1–11. https://teval.jalt.org/sites/default/files/19-02-1_McLean_Kramer.pdf

15.

Mizumoto

(2023). Calculating the relative importance of multiple regression predictor variables using dominance analysis and random forests. Language Learning, 73(1), 161–196. https://doi.org/10.1111/lang.12518

16.

Moghadam

S. M.

Yeung

Choisne

(2023). A comparison of machine learning models’ accuracy in predicting lower-limb joints’ kinematics, kinetics, and muscle forces from wearable sensors. Scientific Reports, 13, 5046. https://doi.org/10.1038/s41598-023-31906-z

17.

Nation

I. S. P.

Coxhead

(2021). Measuring native-speaker vocabulary size. John Benjamins Publishing Company.

18.

Robles-García

Stewart

Nicklin

Vitta

J. P.

McLean

Kramer

(2023). ‘The wisdom of crowds’: When teacher judgments outperform word-frequency as a predictor of students’ vocabulary knowledge. Language Teaching Research. Advance online publication. https://doi.org/10.1177/13621688231176067

19.

Sarker

I. H.

(2021). Machine learning: Algorithms, real-world applications and research directions. SN Computer Science, 2, Article 160. https://doi.org/10.1007/s42979-021-00592-x

20.

Stewart

Vitta

J. P.

Nicklin

McLean

Pinchbeck

G. G.

Kramer

(2022). The relationship between word difficulty and frequency: A response to Hashimoto (2021). Language Assessment Quarterly, 19(1), 90–101. https://doi.org/10.1080/15434303.2021.1992629

21.

Vitta

J. P.

Nicklin

Albright

S. W.

(2023). Academic word difficulty and multidimensional lexical sophistication: An English-for-academic-purposes- focused conceptual replication of Hashimoto and Egbert (2019). The Modern Language Journal, 107(1), 373–397. https://doi.org/10.1111/modl.12835

22.

Webb

Sasao

Balance

(2017). The updated vocabulary levels test: Developing and validating two new forms of the VLT. ITL—International Journal of Applied Linguistics, 168(1), 33–69. https://doi.org/10.1075/itl.168.1.02web