Abstract
We analysed aggregated TOEFL scores from over 15 million non-immersion learners across 127 countries and 65 first languages (L1s), spanning a 16-year period (2005–20). This extensive dataset enabled us to investigate not only the effects of various country-level characteristics – including linguistic distances – but also the dynamic nature of these effects over time. To capture complex, non-linear interactions, we applied a generalized additive mixed regression model (GAMM), with particular focus on three linguistic distance measures: morphological, lexical, and phonological. This approach allowed us to map the evolving role of morphological distance in relation to the other two dimensions. All three linguistic distance measures significantly contributed to explaining variation in L2 English proficiency at the country level. Greater linguistic distance was consistently associated with lower TOEFL scores. Morphological distance provided a distinct and competitive contribution alongside lexical and phonological distances. Importantly, all three measures interacted significantly with the year of testing, indicating that the impact of linguistic distance is not static but evolves over time. These temporal changes likely reflect the increasing global presence and dominance of English as a second language. Specifically, morphological and phonological properties gained importance at higher levels of L2 proficiency, while the influence of lexical distance gradually diminished.
Keywords
I Introduction
English has become a dominant second language (L2) language world-wide. Around one billion people have learned English as a second language, far more than the 370 million people who use English as their first language (Eberhard et al., 2022). The L2 English testing industry has become a billion dollar enterprise (Spherical Insights, 2021). Language testing companies aggregate mean scores of their examinees by country or native first language (L1). Educational Testing Service (ETS) publishes each year long lists of TOEFL country and L1 scores. Students with Danish, Dutch and German as their native language turn out to have the highest scores in TOEFL, year after year, while students with African native languages such as Bambara, Hausa, and Wolof have the lowest scores throughout the years between 2005 and 2020. South Africa belongs to the highest scoring countries. Many factors may contribute to TOEFL country scores, from socioeconomic, educational, cultural, and historical origins, but in this study we attempt to isolate the impact of linguistic factors, namely the L1 influence.
Remarkably, TOEFL adds, year after year, the following disclaimer: ETS, creator of the TOEFL® test, does not endorse the practice of ranking countries on the basis of TOEFL scores, as this is a misuse of data. The TOEFL test provides accurate scores at the individual level; it is not appropriate for comparing countries. The differences in the number of students taking the test in each country, how early English is introduced into the curriculum, how many hours per week are devoted to learning English, and the fact that those taking the test are not representative of all English speakers in each country or any defined population make ranking by test score meaningless. (ETS, 2021: 19)
Nevertheless, we have analysed these scores, focusing on the effect of linguistic distance in general and morphological distance in particular. The concept of linguistic distance is based on the degree of dissimilarity between two languages. The less similar they are the larger the distance. We explored the country scores in great detail on the basis of the linguistic distance of their native language to English, the second or additional language. Although this approach may be qualified as unusual, the outcomes may prove its potential by bringing insight into matters that we cannot investigate otherwise. We begin by considering the arguments of ETS against using their country scores for further analyses.
The arguments of ETS are less evident than they seem. It is obvious that the TOEFL test is not representative of all English speakers in a country, yet it should be equally evident that the test is assumed to be representative for the target population as defined by ETS, i.e. pre-academic students. TOEFL defines its target population in the following way: The TOEFL iBT [internet-based test] test was introduced in the United States in September 2005 and was gradually introduced worldwide during 2005 and 2006. The test was developed in response to a request by institutions to provide a test that would measure non-native speakers’ ability to communicate in English in an academic setting. (ETS, 2021: 4)
It includes pre-academic students who aim at being enrolled at universities with English-medium instruction (EMI). This restricted target population represents the upper levels of English proficiency within a country, but the country scores make clear that there are significant proficiency differences between countries. We agree that various educational factors (e.g. English is introduced in the curriculum and the number of hours devoted to learning English) can be impactful, but the relevant data are unfortunately not available for all the countries involved. We have chosen to explain differences between countries on the basis of linguistic distances and economic indices (that includes information on the quality of the educational curricula of countries). What is especially relevant though, is that examinees are well informed on the criteria on which their test results will be evaluated and that mock exams are widely available. Examinees know what they are expected to deliver. In addition, TOEFL is an expensive test implying that examinees are expected to do their ultimate best to pass the exam. Lastly, it is not entirely clear what ETS means with ‘differences in the number of students’. If it means that differences in number of students between countries are not visible in the aggregated TOEFL L1 scores for L1s like Arabic, French or Spanish that are spoken in a huge number of countries, then we agree. A number bias does not apply to the TOEFL country scores, as no number of examinees are involved in comparing country scores, only their aggregated scores.
The consistency of the TOEFL outcomes over the years stipulates that TOEFL scores might be used to explore their relationship with the L1s involved and with (socio)economic variables. 1 Our overall question is whether linguistic distance and morphological distance in particular do correlate with aggregated TOEFL scores of the countries. In earlier research (Schepens, 2015; Schepens et al., 2013, 2020, 2023; Van der Slik, 2010; Van der Slik et al., 2015, 2019), we computed dissimilarity or distance measures between Dutch (L2) and other languages (L1s) following an algorithmic linguistic approach, comparing words (lexicon), constructions (morphosyntax), and sounds (phonology). The lexical distance measure was based on the proportion of cognates two Indo-European languages share. This measure is obviously only appropriate for languages of the Indo-European language family. The morphological and phonological distance measures can be deployed to all kinds of language pairs, from whatever language family. The morphological measure scales the degree of morphological differences between two languages, making a distinction between less and more complex features. The phonological measure determines the number of phonological features not present in a specific source language compared to a target language (an extensive explanation is given in Schepens et al., 2023). All three distance measures substantially contributed to predicting L2 Dutch (the target language) proficiency on the basis of the L1s of adult learners (the source languages), using a large database containing test proficiency scores of 56,000 L2 adult learners from 117 countries with 50 different L1s. We found a distinctive and consistent effect for all three distance measures: the larger the linguistic distance, the lower the L2 Dutch proficiency.
At this point a comparison with another test might shed more light on the potential variation of samples across countries. We earlier analysed the aggregated scores of Education First’s Standard English Test (henceforth EF scores) coming from 2 million learners of English in 110 countries world-wide with 61 different L1s, and found substantial linguistic distance effects, but these were the scores for one specific year (Van Hout and Van der Slik, 2024). We deem it relevant to validate this outcome with other data, preferably with a wider time span to investigate how stable linguistic distance effects are and to track developments over time. The ETS’ TOEFL iBT (internet-based test) reports for a longer time span aggregated L2 proficiency scores on country and L1 level. The database we used comprises aggregated TOEFL iBT total scores (henceforth TOEFL scores) of more than 15 million learners of English in 127 countries world-wide with 65 different L1s covering the period 2005–20. 2
The time depth of the TOEFL data opens up the opportunity to investigate trends in the measured levels of English proficiency. The time window from 2005 to 2020 seems wide enough to observe changes in the educational and economic situation of countries over the last two decades. Moreover, the position of English as a dominant L2 increased over the last two decades and its dominance is expected to increase even further (Zeng and Yang, 2024). The number of learners of English has shown an upward trend over the last decades. In the period from July 1993 till June 1995, for example, the number of TOEFL examinees was around 1.4 million (ETS, 1996) that has risen to an estimated 2.4 million in 2024. 3
We presume that both global (worldwide) and local (country specific) developments may affect linguistic distance effects. Distance can be seen as a proxy to the effort to be made by a learner to acquire an additional target language, a distance that is maximal when the learner starts from scratch. When English becomes an additional language, strongly present in a society and its educational system, the familiarity with English and starting proficiency levels of learners change. That means that the effects of linguistic distance change on the individual level, but also on aggregated levels such as the country level. The time depth of the TOEFL data offers the opportunity to test the changing influence of linguistic distances. Prior linguistic knowledge, captured through distance, interacts with other aspects of L2 acquisition (input, exposure, quality of teaching, education system, language dominance, etc.). Such changes are observable in the cross-linguistic influences between a dominant and non-dominant language in bilingual speakers. Their strength might reduce when the dominance of the stronger language becomes weaker (see Van Dijk et al., 2022).
Our approach implies that economic, cultural and educational variables need to be included as much as possible in the analysis of the proficiency data, to ascertain a valid analysis of the effects of the three linguistic distance measures in which we disentangle socio-economic and linguistic factors. However, having only aggregated data has the consequence that we are unable to examine the impact of predictors at the individual level, such as education, gender, age of onset, or length of experience. This might be considered a setback, but on the other hand the explained variance of individual level predictors in large-scale studies of second language acquisition (SLA) is rather small and in fact much smaller than the explained variance of linguistic distance and country level predictors. From the figures presented in Schepens et al. (2013: 219), for example, it can be deduced that the sum of the country and L1 level explained variances of 16.4% (8.5% and 7.9%, respectively) is almost four times as large as the contribution of the individual level predictors.
We formulated three hypotheses to test to what extent linguistic distance may affect the aggregated levels of proficiency in the countries involved, with their many L1s. The hypotheses go from a global level of impact of linguistic distance (the ‘a’ variants of the three hypotheses, including lexical, morphological, and phonological distance) to a more specific spell-out of the distance effects, in particular in relation to morphological distance (the ‘b’ variants). We want to investigate the distinctive contribution of morphosyntactic parameters in relation to other linguistic distance effects, given the focus of this special issue, but also because of the remarkable role of morphology in second language acquisition by adults. Bentz and Winter (2013) concluded not only that many studies on second language acquisition attest to the difficulty of acquiring morphological case, but also that morphosyntactic complexity is reduced by higher degrees of language contact involving adult learners. We are going to test three hypotheses:
• Hypothesis 1: (a) Linguistic distance predicts L2 English country scores in a statistically significant way. (b) Morphological distance contributes in a distinctive way, alongside lexical and phonological distance effects.
• Hypothesis 2: (a) The larger the linguistic distance between English and the dominant language of a country the lower the aggregated L2 English score of a country. Linguistic distance effects monotonically decrease. (b) Morphological distance has its own pattern of monotonic decrease, distinct from other measures of linguistic distance.
• Hypothesis 3: (a) The effect of linguistic distance is dynamic and changes over time. (b) Morphological distance has its own dynamic pattern.
In testing these hypotheses, we opted for a generalized additive mixed regression model (GAMM). This statistical model has the advantage that non-linear effects can be estimated in a fairly consistent way, but the obvious disadvantage is that the number of potential regression models grow rapidly and that the evaluation of the best solution is less straightforward than in linear mixed modeling. Baayen and Linke (2020) give a basic introduction to GAMM with useful references to applications in linguistics (including sociolinguistics and psycholinguistics). They explicitly mention its excellent properties in analysing naturalistic data with nonlinear properties. In this contribution we will focus on the outcomes. The data and the many analyses (R codes) are freely available at the Open Science Framework (https://osf.io/x8mw5) (Van der Slik and Van Hout, 2025).
II Method
1 Country and first language scores
In the present study, we make use of the TOEFL iBT data that have been reported for 2005, 2008, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, and 2020 (ETS, 2007, 2009, 2011, 2014, 2015, 2016, 2017, 2018, 2019, 2020c, 2021). In these reports, both aggregated country of origin data as well as aggregated test takers’ native or first language (L1) data are reported, in separate tables. In the Appendix in supplemental material, we included a table with an overview of the relevant variables which gives also insight in the reasons for including and excluding countries in the final analyses. We selected year of testing 2016 for illustrative purposes.
Our first step was to store the country level data and to add for each country the most likely L1. For instance, we assumed that the Dutch country scores mainly represent Dutch L1 scores. In reverse, Dutch L1 scores cannot be used as representing Dutch country scores as Dutch is spoken in the Netherlands, Belgium, and Suriname. For most European, American, and Asian countries, this procedure seems unproblematic as many of these countries are mainly monolingual. 4 For the African continent, however, this procedure was less straightforward for at least two reasons. The first reason is that many African countries are multilingual, making it difficult to determine the most obvious L1 if at all. For that reason, we used information on CIA’s World Factbook, Wikipedia, Ethnologue (Grimes and Grimes, 2000), and WALS (Dryer and Haspelmath, 2011) to make an educated guess. Our prime arguments for selecting a specific L1 were that the L1 had to be available in TOEFL’s L1 list, that the L1 is at least one of the most widespread L1s in that particular country, and finally that it is used as a lingua franca. In Cameroon, for example – partially a former German, French, and a British colony – 277 living languages have been recorded (www.ethnologue.com/country/CM). Both French and English are official languages, of which French is most widely spoken and used as a lingua franca by 57% of the 29.3 million Cameroonians (https://en.wikipedia.org/wiki/Languages_of_Cameroon). We therefore assumed French as representing Cameroon’s L1. In Nigeria – a former British colony – 520 languages are currently spoken, according to Ethnologue, of which English is an official language (used by 60 million). So we might assume Nigeria’s country score to represent English as its L1. As we shall see shortly, Nigeria was excluded from the analysis.
Exceptions however apply to South Africa, Ethiopia, Indonesia, the Philippines, Sri Lanka, Turkey, Morocco, and Ukraine. For South Africa, we used the Zulu and Afrikaans L1 scores as substitutions for South Africa’s country scores, implying that South Africa was represented twice in the database. More South African L1s, such as Xhosa, might have been chosen. However, the number of Xhosa speakers was lower than 30 across all years, implying that TOEFL did not report test scores for Xhosa. In the following additional cases, we used two L1 scores as substitutions for a country score: Amharic and Oromo for Ethiopia’s country scores, Indonesian and Javanese for Indonesia’s; Tagalog and Cebuano for the Philippines’; Tamil and Sinhalese for Sri Lanka’s. For Ukraine, we used the L1 Ukrainian score. In the case of Morocco and Iraq, we added Berber L1 scores in addition to Morocco’s country score and Kurdish L1 sores in addition to Iraq’s country score, respectively, in addition to Arabic. 5
The second reason is that many African countries as well as Asian countries, have a colonial past which complicates first language determination. For several countries, the former colonial language is still (one of) the official or dominant language(s). This applies to English in the following 14 countries: Australia, Bahamas, India, Ireland, Liberia, Mauritius, Namibia, New Zealand, Nigeria, South Sudan, Swaziland, Trinidad and Tobago, and the USA. These countries, and Great Britain and Ireland, too, were excluded from further analyses. French is still (one of) the official or dominant language(s) in: Cameroon, Canada, Congo, French Polynesia, Gabon, Martinique, and Réunion; this applies to Portuguese in Angola, Brazil, and Mozambique, and to Spanish in the following 22 countries: Argentina, Bolivia, Chile, Colombia, Costa Rica, Cuba, Dominican Republic, Ecuador, El Salvador, Guatemala, Honduras, Jamaica, Mexico, Nicaragua, Panama, Paraguay, Peru, Puerto Rico, Sierra Leone, Uruguay, and Venezuela. In a number of other countries, the former colonial ties with Great Britain are less formalized yet still influential. We will return to this issue in the next section.
2 Data
Initially, we had 1,852 observations at our disposal from 167 countries or regions with 88 different associated L1 languages. It is not entirely clear on how many students these aggregated data are based since ETS does not disclose the number of examinees from 2005 onwards. A conservative estimate is at least 15 million over the period 2005–20. Currently, 2.3 million examinees take the test yearly (https://en.wikipedia.org/wiki/Test_of_English_as_a_Foreign_Language).
Excluding English-speaking countries reduced the number of observations to 1,728 from 153 countries (and 87 L1s). As we shall see below we had to exclude 22 L1s having no score on the morphological distance measure (10 from Africa, 3 from Europe, 1 from America, 8 from Asia).
This reduced the number of observations to 1,486 for 65 L1s from 133 countries or regions. Finally, excluding countries with no enrollment scores resulted in 1,425 observations, from 127 countries and 65 native languages. The analyses are based on these last figures.
3 Variables
a TOEFL iBT test (TOEFL)
Total score (Mean = 81.65; SD = 9.62) on the Test of English as a Foreign Language. This test has a history going back to 1960s. In 2005, the internet-based version (iBT) was introduced, consisting of four language skills (reading, listening, speaking, and writing) that are considered as important for effective communication in an academic setting. The test is aimed at measuring English proficiency of potential students whose native language is not English and who want to study at colleges and universities with English as a medium of instruction (ETS, 2007).
Both the reliability and comparability (ETS, 2020a) as well as validity (ETS, 2020b) are well established. Scores can range from 0 to 30 for each section, adding up to a maximum score of 120. A typical exam takes between 200 and 390 minutes to complete. For detailed information on the administration and operationalization of each skill, see ETS (2009).
b Year
TOEFL measurements were reported for each of the following 11 years: 2005, 2008, 2010, 2013, 2014, 2015, 2016, 2017, 2018, 2019, and 2020. The data for 2006, 2007, 2009, 2011, and 2012 are missing. We were unable to trace them, apparently as ETS did not report the data for those years. This allows us to examine the impact of year of testing.
c Status of English (SE)
Coding whether (1) or not (0) a country uses English as one of the official languages, next to an indigenous language, or as a medium of instruction in secondary schooling, or as commonly used as a lingua franca in business and public life. 6 This information was obtained from the World Atlas (2022). We identified the following 23 countries and regions to which this qualification applies: Bahrain, Cameroon, Canada, Cyprus, Ghana, Hong Kong, Israel, Jordan, Kenya, Kuwait, Malaysia, Myanmar, Pakistan, Panama, Philippines, Qatar, Singapore, South Africa, Sri Lanka, Sudan, Tanzania, UAE, and Uganda.
d Gross enrollment ratio of secondary schooling 1990–2020 (enroll)
The World Bank (https://databank.worldbank.org) publishes a wealth of socioeconomic indicators on country and regional level, covering several decades. In previous studies, we have found that country level scores on gross enrollment in secondary schooling are strongly associated with test takers’ scores on second language proficiency in Dutch (e.g. Schepens et al., 2023; Van der Slik, 2010). In the present study, we will extend that line of research by performing a time lag analysis over the years 1990–2020. Missing scores per country were imputed by nearest enrollment scores for that country if present. Countries with no enrollment scores at all were deleted from further analyses. We explain the imputation protocol in supplemental material.
Given the delayed effect of socio-economic indices in relation to schooling, we needed to establish the optimal time lag, if present, between earlier gross enrollment and later TOEFL scores. 7 It may be assumed that a country’s investment in increasing their citizens’ human capital, operationalized here by secondary schooling participation, will not have an instant effect but will take some time to materialize. Differently formulated, how many years will it take to see the emerging effect of a country’s educational investments on increasing proficiency in English? Our approach can be linked to Ogburn’s (1922) cultural lag theory in which cultural innovations lag behind the development of material opportunities.
e First language (L1)
This is the country’s designated indigenous language. There are 65 languages. for which morphological distance scores could be computed and which could be connected to a country of origin. These include 31 Indo-European languages, seven Afro-Asiatic, five Niger-Congo, five Austronesian, four Altaic, three Uralic, two Sino-Tibetan, two Tai-Kadai, two Austro-Asiatic, and one Dravidian, Japonic, Kartvelian, and Korean language. Two L1s occur by far the most frequent: Arabic (n = 19) and Spanish (n = 22).
f Country of origin (C)
This is the examinee’s native country. There are 127 native countries. These include 33 European, 41 Asian, 24 African, 16 South-American, 12 North-American, and one Oceanic country.
g Lexical distance (LEX)
This is a symmetric measure that represents the sum of branch lengths that connect two languages in a phylogenetic language tree of the Indo-European language family based on lexical items (for the Dutch version of this measure, see Schepens et al., 2013; for the original Indo-European tree, see Gray and Atkinson, 2003). This measure is assumed to be particularly sensitive for distances between English and other Indo-European languages and the consequence is a maximum distance between English and non-Indo-European languages of 19.03, which also is the score of Urdu, the lexically most distant Indo-European language to English in this study (M = 15.14, SD = 4.08). These lexical items do not occur in non-Indo-European language families.
h Morphological distance (MORF)
This is an asymmetric measure that compares the morphological features between languages according to differences in complexity (for the Dutch version of this measure, see Schepens, 2015). We used an existing list with rank orderings for the feature values of 23 morphological features (Lupyan and Dale, 2010; see their Table 1 and Text S6). We computed distances for the 69 languages that have at least five available values in WALS (Dryer and Haspelmath, 2011) (M = 0.25, SD = 0.13). This measure is assumed to be particularly sensitive for distances to non-Indo-European languages. We redefined one of the morphological features, feature 77 which stands for ‘Semantic distinctions of evidentiality’ with feature values: (1) no grammatical evidentials, (2) indirect only, and (3) direct and indirect. In previous research on morphological distance to Dutch, we made a distinction between 1 vs. 2 and 3. This turned out to be a bad decision for English as target language, as English is the only Germanic language with feature value 1. We therefore decided to make a distinction between 1 and 2 vs. 3, making feature value 3 the classification that differs from English. The Appendix in supplemental material gives a full description of the morphological features used and also the scores per language in relation to English.
Means and standard deviations (range) of the TOEFL internet-based test (iBT) over 11 years in the period 2005–20.
Notes. n = 1,425, 127 countries, 65 native languages for the variables in our study.
i Phonological distance (PHON)
This is an asymmetric measure counting the number of new phonological features in a target language based on complete sound and feature inventories (Schepens et al., 2020: 4–7). The phonological sound and feature inventories from PHOIBLE (Moran and McCloy, 2019) were used. We computed distances to English for the 69 languages for which PHOIBLE lists a phoneme inventory (M = 13.87, SD = 3.90).
4 Description of the data
Since the introduction of the TOEFL iBT test in 2005, ETS has regrettably abstained from reporting numerical information on gender, age, and educational degree. (Almost) annually, mean and SD scores for male and female students as well as for undergraduate, graduate, and other students are reported. However, without actual information on the number of students involved, it is impossible to decide if reported scores differ in a statistically significant way. Despite the scarce information provided by ETS, we can provide in Table 1 a general picture of the longitudinal database being used in relation to the variables that are part of our study.
Figure 1 contains the 11 box plots of the country scores ranging from 2005–20. As already has been noted, the data for 2006, 2007, 2009, 2011, and 2012 are missing. The general picture that emerges from Figure 1 is that the mean scores vary, but in the end increase over the years.

Box plots of the TOEFL country scores over the years 2005–20.
5 Statistical approach
To answer our research questions, we applied GAMM by using the ‘mgcv’ package (Wood, 2017) in R (R Core Team, 2018). A major reason for preferring GAMM instead of, for example, the ‘lmer’ package, is that GAMM can model complex non-linear relationships by applying penalized splines. As relationships are seldom linear, the advantage of using GAMM is self-evident. For a detailed description of the mathematical background of GAMM we refer to Wood (2017) and to the tutorials of, for example, Baayen and Linke (2020), Sóskuthy (2017), Van Rij (2015), Van Rij et al. (2022), and Wieling (2018).
6 Model selection strategy
A detailed description on the model selection strategy we used to arrive at the final model, m4.7AR, can be found in supplemental material, including the R code.
We started with a base model, model m0, containing only random effects for first language L1 and country of origin. Adding years of testing in model m1 results in significant results. Next, additional covariates and their interaction with year of testing were included in subsequent models m2.1 to m3.6. After that we included for each year of testing previous enrollment scores per country. We traced back the enrollment scores for 16 years. In this way, time lags ranging from 0 to 15 years were tested. Based on the Akaike information criterion (AIC) and Bayesian information criterion (BIC) comparisons by means of the Relative Likelihood test (Burnham and Anderson, 2002), we concluded that model m4.7, representing a time lag of 7 years had to be preferred. Model m4.7 was 18 (AIC) or 30 (BIC) times more likely than the next best fitting Model m4.6, representing a time lag of 6 years. Since we make use of a repeated measure design (11 years of testing), it was tested if autocorrelation was present. This indeed was the case (rho = .273). We, therefore, accommodated for autocorrelation in model m4.7AR. Autocorrelation was now virtually absent. As a final step, we reran the analyses while having excluded observations with standardized residual values larger than |3.0|. The results of this final step did not differ from the outcomes of model m4.7AR in a substantial manner, so we decided to keep model m4.7AR as our final model.
7 Model criticism
Several tests have been conducted to check the sustainability of our model m4.7AR. We used gam.check() that provides various diagnostics such as residuals, deviance residuals, and qq-plots (see Figure S2 in supplemental material). The residuals are distributed rather normally, except for some extremes at the upper and lower tail. As noted, we reran the final analysis with the most extreme residuals excluded with no substantially different outcomes. It appeared that the number of basic dimensions (ks) in some cases was too low as compared to edf since the associated p-values were significant. However, increasing the number of knots did not result in non-significant p-values. The differences between the edfs and ks was quite large, so we kept the number of knots as they are.
As a final step, we assessed concurvity and found the values to be within the range of statistically acceptable values, except for the three pairs of linguistic distance with the random effect L1, and country of origin, which were all 1.000 (see Figure S3). Such high concurvity values between random effects and covariates are found quite regularly and appear to be not well understood. On https://stats.stackexchange.com, for example, several researchers observed this issue and asked on how to deal with this. Until now, these researchers did not receive an answer on their thread. One reason might be that the random effects (L1 and Country) and the three linguistic covariates refer to the same phenomena. According to Clark (2022), referring to Wood (2008), GAMM estimation procedures have been developed in such a way that they can handle such issues and that one can have confidence in the results even in the presence of concurvity. We were reassured by the stability of the outcomes given that model m4.7AR – a rather complex model in our opinion – converged in 14 iterations.
III Results
In Table 2, the outcomes of final model m4.7AR that accommodates for autocorrelation are presented. The intercept gives the overall average, 83.76, taking all other parameters into account, and is quite near to the original raw average in Table 1, 81.65.
Generalized additive mixed regression model (GAMM) parameter estimations of the TOEFL internet-based test (iBT) for the final model, m4.7AR, per first language.
Notes. n = 1,425, 127 countries, 65 native languages for the variables in our study. *p < .05. **p < .01. ***p < .001.
All parameters together are successful in predicting the country scores given the R2 value of .93. This impressive amount of explained variance is partly accounted for by the random intercepts of L1 and country as reflected by the outcomes of model m0 (R2 = .88). The contribution of the fixed variables can be evaluated by the outcomes of model m4.7AR_L1C, that is identical to model m4.7AR, but now without the random intercepts of L1 and country. This model has an explained variance of 65.5%.
The parametric coefficient relates to Status of English, and is significant. The other main effects can be found under smooth terms (they all are continuous variables). They are most of the time significant, but they all are significant in interaction with Year. An important conclusion is that all three linguistic distance measures contribute in their own way in explaining the countries’ proficiency scores.
The effective degrees of freedom (edf) give an indication of the non-linearity of the smooth term. If an edf is (close to) one, it indicates that the parameter of interest is (almost) linear. This applies to lexical and phonological distance, and to enrollment. The larger the edf is, the more wiggly the parameter is, and this applies to the majority of the terms. As the numerical outcomes of GAMs are difficult to interpret due to this non-linearity, we will visualize the outcomes in figures.
The last non-linguistic explanatory variable is enrollment, with a time lag of seven years. Both its main effect and its interaction with year of testing are significant. The effects are visualized in Figure 2. The tensor in the right part of Figure 2 reflects the values of three variables in three dimensions. The vertical dimension is the TOEFL score. Higher values mean higher scores. The plane is defined by crossing year of testing and the enrollment score of a country. The main effect of enrollment is linear and a higher enrollment means a higher TOEFL score. The 3D plot connects the TOEFL scores per year and shows some shifting in the color pattern. The blue area indicates lower scores and yellow the area with higher scores. The slight shifts in the color pattern mark a significant interaction, but they are not systematic enough for a clear interpretation. The above non-linguistic effects (enrollment, year of testing, status of English) were incorporated in the statistical model to control for their effects in estimating the three linguistic distance effects. The lexical and morphological distances have a significant main effect, but all three distance measures have a significant interaction with year of testing. We visualized these three distance effects in the same way as enrollment.

Main effect of enrollment (left panel) and the 3D plot of the tensor product interaction of enrollment with year of testing (right panel).
Figure 3 visualizes the effects for lexical distance. The left panel exemplifies a straightforward linear effect. A higher distance returns a lower TOEFL score. The dots make clear that there is variation between the countries, variation that may be related to other explanatory variables. The 3D plot in the right panel shows an interesting pattern. Year of testing runs from 2005 in the front to 2020 in the back. Two shifts are visible. The range in scores shrinks (the vertical axis) over time and the blue area is much larger in the initial years. The conclusion is that the influence of lexical distance reduces in the course of time.

Main effect of lexical distance (left panel) and the 3D plot of the tensor product interaction of lexical distance with year of testing (right panel).
The effects of morphological distance are visualized in Figure 4 and exhibit a pattern different from lexical distance. The main effect in the left panel is non-linear, with lower scores particularly for languages with a higher morphological distance, but the scores increase with a lower morphological distance. The influence of morphological distance increases over time, with a wider range in predicted TOEFL scores in the middle years indicated by the size of the blue area.

Main effect of morphological distance (left panel) and the 3D plot of the tensor product interaction of morphological distance with year of testing (right panel).
Finally, Figure 5 visualizes the effects of phonological distance. In the initial years the blue area dominates the outcomes, with a rather irregular pattern, in fact indicating that phonological distance is irrelevant. The pattern becomes more regular over time and in the later years of testing a more obvious relationship evolves between distance and TOEFL proficiency. It implies that the effect of phonological distance increases over time, as for morphological distance. However, in the beginning any phonological distance has a similar negative effect on the TOEFL scores.

Main effect of phonological distance (left panel) and the 3D plot of the tensor product interaction of phonological distance with year of testing (right panel).
IV Discussion
In the present study, we analysed aggregated TOEFL scores from over 15 million non-immersion learners representing 127 countries and 65 first languages (L1s), spanning a 16-year period (2005–20). This extensive dataset enabled a comprehensive investigation of the influence of country-level characteristics – including linguistic distance measures – on English language proficiency. To capture complex, non-linear relationships and temporal dynamics, we employed a Generalized Additive Mixed Model (GAMM). This approach allowed us to examine intricate interaction effects, with particular emphasis on three linguistic distance metrics, and to trace the evolving impact of morphological distance over time. To disentangle linguistic from non-linguistic influences, we incorporated two key country-level variables into the model: gross secondary school enrollment and the sociolinguistic status of English. These controls facilitated a more nuanced understanding of how structural and educational factors interact with linguistic distance in shaping second language acquisition outcomes.
Although the sample of TOEFL examinees does not represent the entire population of L2 English speakers within each country, the test’s design – targeting a pre-academic population – ensures a degree of comparability across national samples. This comparability is further supported by the observed stability of country-level scores over the 16-year period, suggesting minimal fluctuation in sample composition. The most plausible interpretation is that, across countries, individuals with similar educational backgrounds and academic aspirations consistently apply for the TOEFL exam. Consequently, while the findings may not generalize to all L2 English speakers, they are robust and valid for evaluating hypotheses concerning the effects of linguistic distance within this specific, globally comparable population.
As for hypothesis 1 we can conclude that our outcomes are in line with this hypothesis. All three distances contribute to explaining L2 English country scores. Morphological distance delivers its own, distinctive contribution, next to and in competition with the lexical and phonological distances. Although the overall relationship between linguistic distance and TOEFL scores are non-linear for morphology and phonology, the three distances share a monotonically decreasing effect (hypothesis 2): larger linguistic distances correlate with lower L2 language proficiency scores. Morphological distance has its own decreasing pattern, that decreases more rapidly for larger distance values. Phonological distance shows this decreasing pattern only in the later years of testing (2013–20). Changing patterns reveal the relevance of including the time dimension (year of testing).
All three distances interact with year of testing (hypothesis 3). The effect range of lexical distance decreases, whereas the effect range of morphological distance increases and phonological distance becomes more regular over time. These changes may reflect the increasing presence and dominance of L2 English worldwide, but it shows, most of all, that the effect of distance is dynamic. More structural properties become more important the higher the L2 proficiency is, while the importance of lexical distance gradually decreases. Phonological distance showed a wiggly pattern until 2013, with blue scores, indicating overall lower proficiency scores, compared to later years. This requires further investigation.
We have carried out additional analyses to substantiate our interpretation. We used the TOEFL’s L1 Total scores instead of TOEFL’s Country Total scores as the dependent or criterion variable. Such a choice has the disadvantage that we can no longer use country variables. Another disadvantage is that the number of data rows reduces from 127 countries to 65 L1s. One might, given the differences in the number of students per country mentioned in the introduction, speculate that taking the TOEFL L1 scores as the criterion variable would produce deviant outcomes (see van der Slik and van Hout, 2025). However, distance still is a significant predictor, including morphological distance (see Van der Slik and van Hout, 2025, more specifically ‘2_Models_L1.html’). It is reassuring, moreover, that the correlation between the TOEFL L1 scores and the L1 scores we obtained, aggregated over time, is almost perfect (r = .97, p < .001) (see Figure S4 in supplemental material). We can safely conclude that our use of the TOEFL country L1 scores did not bias the analysis with respect to the TOEFL L1 scores.
To support the validity of using TOEFL country-level scores for cross-national comparisons, we examined their correlation with Education First (EF) Standard English Test (SET) scores (Education First, 2021). Specifically, we compared scores from 2020 and 2021 for countries represented in both datasets. Despite differences in the target populations – EF SET participants are typically self-selected individuals who voluntarily take a low-stakes, free test, whereas TOEFL examinees are often high-stakes test takers seeking admission to English-medium instruction (EMI) universities – a strong positive correlation was observed (r = .809, p < .001). TOEFL is explicitly designed to test communicative competence in an academic setting, and the EF SET test was specifically designed as an inexpensive alternative for TOEFL. This high correlation, in light of the tests’ parallel aims (i.e. assessing communicative competence in English), supports the congruent validity of both instruments. Bolton and Bacon-Shone (2020) have performed a similar analysis on EF and TOEFL country scores for a restricted subset of 20 Asian countries. We found the two rank orders of this limited subset to have a correlation of .57 that is significant (p < .01). This correlation is substantial and it casts serious doubts on the authors’ conclusion that the EF test is not valid at all in evaluating differences between countries. Had they used the wide range of all available mean country scores, the correlation would have been higher.
Another finding that underlines the usefulness of the TOEFL data is the size of the time lag we found for the enrollment variable. A time lag of seven years means that we use the enrollment data of seven years earlier to predict the TOEFL country scores. We would have preferred to use the Human Capital Index (Corral et al., 2021; Kraay, 2019) as it measures the expected future human capital of a child born today, but this measure has become available just recently for the years 2010, 2017, 2018, and 2020 only (World Bank, 2025). We used ‘gross enrollment in secondary schooling’ (World Bank, 2022). A lag of seven years makes sense as it closely matches the recurrent educational stages of six years. UNICEF (see the website of unicef.org) concludes that standards vary, but that primary education is typically designed for the ages 6 to 11 years and secondary education for 12 to 17 years of age. The pre-school period is six years as well.
Our study can be qualified as exploratory. It would be interesting to have more groups of L2 English learners within countries to investigate their L2 English proficiency and to investigate controlled samples of learners both within and between countries. We presume that at all proficiency levels linguistic distance effects will play a role, but we need to test these effects, of course. Hopefully, more big data samples will become available, with data on individual learners. If so, they certainly can contribute to testing the many variables that play a role in second language learning. More controlled studies are necessary to overcome the exploratory level of big data studies. For our own study it was crucial that the learner samples between countries were comparable and not what these samples say about other groups of L2 English learners within countries.
In Section II we explained how we handled the L1s of the countries involved. It is a limitation that many languages were not involved in our analysis. This is the consequence of the way ETS TOEFL reports test results which aims at providing information to test-takers rather than empirical investigations of specific research questions. We inspected our data on country outliers but we did not observe any. Nevertheless, the analyses would have benefited from more detailed data on the language background of the language learners.
A potential parallel can be drawn between our findings on linguistic distance and research on borrowing and code-mixing. In these domains, the interaction between language systems is shaped by their structural properties and relative dominance. A key insight from the literature is the borrowability hierarchy, which posits that nouns are more readily borrowed than verbs, and that content words are more frequently borrowed than function words (Matras, 2007; Muysken, 2000). This hierarchy reflects a continuum of word classes – from open classes such as nouns and verbs to closed classes like prepositions and pronouns – each occupying a distinct position within the structural framework of a language.
Borrowing structural elements requires a context in which both languages integrate their grammatical and phonological systems (Verheijen and Van Hout, 2022). Haspelmath (2009) further distinguishes between lexical borrowing and imposition, based on whether structural features are adopted by native speakers from a dominant language or retained by non-native speakers from their L1 during language shift. These processes underscore the role of structural distance and language dominance in shaping linguistic outcomes. While we do not claim a direct equivalence between borrowing and SLA, the relevant connection lies in the differential embedding of language components. Certain linguistic features – such as morphology and phonology – may be more structurally entrenched, leading to distinct patterns of acquisition and transfer. These patterns are crucial for understanding the nuanced role of linguistic distance in SLA. A dynamic manifestation of this phenomenon is observed in cross-linguistic influence in bilingual speech, where the extent of influence correlates with the dominance of one language over the other (Van Dijk et al., 2022). This influence operates at the individual learner level but also reflects broader shifts in linguistic competence as learning progresses.
The dominance of English has increased over the last two decades worldwide, as well as the position of English in the educational tracks of the countries involved. Unfortunately, there is no detailed evidence on how vocabulary, morphology and phonology might be involved in cross-linguistic processes on an overall level, despite the availability of many studies on cross-linguistic influence examining specific language pairs and phenomena. We need more big data to explore these language learning processes. The most impressive change in our data was the role of morphological distance, whose weight increased, pointing to the prominent role of structural features in later stages of language learning. We need more research on how these components get involved during L2 learning. Another relevant research track is to expand and improve distance measures. For this contribution, we successfully defined distances for L2 English, based on the approach we developed for L2 Dutch. To advance this line of inquiry, further research is needed to explore how specific linguistic components interact during SLA. Additionally, there is a clear need to refine and expand linguistic distance measures. Future improvements will draw on emerging typological databases, which offer more granular insights into language structure and diversity.
Ginsburgh and Weber (2020) outline an economics of language model that incorporates the consequences of linguistic diversity in general. One of their conclusions is that we need linguistic distance measures. We made global linguistic distances concrete between a number of languages (L1s) and English (L2). Our measures work, but we are sure that they can be improved. That applies especially to our morphological and phonological distances.
A critical question arises regarding the underlying mechanisms driving the temporal effects we observed. Specifically, it remains unclear whether these changes are attributable as well to evolving pedagogical practices, structural modifications to the TOEFL examination itself, or broader societal transformations – such as the proliferation of English-language media and resources. For instance, platforms like YouTube now provide learners of English with unprecedented access to authentic oral input, potentially enhancing listening and pronunciation skills more effectively than traditional classroom instruction. Contemporary learners often have access to mobile applications and digital tools that support pronunciation, vocabulary acquisition, and grammar practice. These resources offer flexible, individualized learning opportunities that were largely unavailable to previous generations, who typically relied on limited classroom instruction constrained by work schedules.
V Conclusions
The longitudinal database of country-level TOEFL scores (2005–20) enabled a detailed investigation into the effects of various country-level characteristics, including linguistic distances, and their evolution over time. We employed a Generalized Additive Mixed Model (GAMM) to uncover complex, non-linear interaction effects, with particular attention to three linguistic distance measures: morphological, lexical, and phonological. This modeling approach allowed us to trace the changing influence of morphological distance relative to the other two measures.
All three linguistic distance measures significantly contributed to explaining variation in L2 English proficiency across countries. Higher linguistic distances were consistently associated with lower TOEFL scores. Morphological distance emerged as a distinct and competitive predictor, providing explanatory power beyond that of lexical and phonological distances. Notably, the influence of morphological and phonological properties increased with higher levels of L2 proficiency, whereas the impact of lexical distance diminished over time.
To substantiate the validity of using TOEFL country scores for examining linguistic distance effects, we provided multiple lines of argumentation. While individual-level data would allow for more granular and definitive analyses – particularly regarding the unique role of morphological complexity – our findings demonstrate that country-level TOEFL scores offer a robust and informative proxy for investigating cross-linguistic influences on second language acquisition.
Supplemental Material
sj-docx-1-slr-10.1177_02676583251389484 – Supplemental material for Linguistic distance effects on country TOEFL scores between 2005 and 2020 and the effect of morphological distance in particular
Supplemental material, sj-docx-1-slr-10.1177_02676583251389484 for Linguistic distance effects on country TOEFL scores between 2005 and 2020 and the effect of morphological distance in particular by Frans van der Slik and Roeland van Hout in Second Language Research
Footnotes
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Supplemental material
Supplemental material for this article is available online.
1.
The last time that ETS gave more detailed information about the examinees was in 2004, a year before the introduction of TOEFL iBT (internet based test). A total of 521,082 examinees took the test, including 71.857 examinees who did not report their motivation to take the test. The remaining 449,225 were distributed as follows: (1) graduate 234.453 (52%); (2) undergraduate 168.670 (38%); (3) professional 28.929 (6%); (4) secondary school 17.173 (4%). That means that 90% of the examinees took the TOEFL test as a pre-academic student. There is no reason to assume that these figures have drastically changed over the past two decades.
2.
Our use of aggregated data is far from unique as economists often rely on aggregated data in doing their macroeconomic analyses.
3.
See https://en.wikipedia.org/wiki/Test_of_English_as_a_Foreign_Language. The Global ‘English Proficiency Test Market’ was valued at USD 2.7 billion in 2021 and is expected to reach USD 15.26 Billion by 2030, growing at a CAGR of 8.90% during 2021–30 (
).
4.
This, of course, is a simplification. Not all test takers of a specific country will have the same L1. We think however that some noise, caused for example by students already migrated to an English-speaking country in anticipation of acceptance by a university or college with English as medium of instruction, is manageable and does not affect the outcomes in a systematic way. We like to emphasize that we had to make our own decisions for a county’s L1 for the simple reason that ETS does not provide country data broken down by test takers’ L1s for external researchers (personal communication,
, 27-06-2014).
5.
The vast majority of Berber speakers are of Moroccan descent. Kurdish is spoken in several countries in the Middle East, such as Iran, Iraq, Syria, and Turkey, yet predominantly in Iraq and Turkey. We decided to choose Iraq for the following reasons. First, in Turkey, Kurdish is not recognized as an official language and is not taught at school. Second, Iraqi Kurdistan harbors two English-medium instructed universities, both requiring freshmen to have sufficient command of academic English, for example by means of a pre-decided criterion score on the TOEFL test.
6.
Except for Canada, other countries with English as an official language were excluded from further analyses.
7.
We aimed at using historical Human Capital Index (HCI) scores but unfortunately such data are largely unavailable, yet.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
