Sage Journals: Discover world-class research

Abstract

We present a corpus investigation of the influence of first language – second language (L1–L2) typological similarity on the acquisition of the L2 English article. We consider item-level typological similarity in terms of the availability of an article in the L1, but also broader typological similarity in terms of the linguistic distance between L1 and L2 as captured through a variety of lexical, morphosyntactic and phonological measures of linguistic distance. We analyse the accuracy of the use of the definite and indefinite English articles in around 0.5 million writings from learners with 11 typologically diverse L1s. The data are sampled from an open access English as a foreign language (EFL) corpus, EFCAMDAT. Our results indicate that L1 influence arises from a combination of item level L1–L2 differences, that is, the availability of an article in the L1, as well as broader properties of the L1 grammar, as captured by linguistic distance measures. The results indicate that it is the availability of the definite article in the L1 that predicts article omission in L2 English, for both the definite and indefinite articles. This finding supports the generative typological distinction between determiner phrase (DP) and noun phrase (NP) languages, indicating that the availability of a definite article and a DP predicts the use of bare nominals in the L1 and consequently, article omission in L2 English.

Keywords

articles crosslinguistic influence linguistic distance

I Introduction

The similarity between learners’ first (i.e. native) language (L1) and second language (L2) is commonly seen as a central factor modulating the influence of L1 on L2. This basic observation goes back to contrastive analysis but has survived in all current theoretical approaches of crosslinguistic influence (Abbas et al., 2021; Ellis, 2006; Ionin and Montrul, 2010; Jarvis, 2000; Lago et al., 2021; Odlin, 1989; Schwartz and Sprouse, 1996, Westergaard, 2021). Importantly, similarity is crucial for demonstrating ‘crosslinguistic performance congruity’ (Jarvis and Pavlenko, 2008). This means that the source of L1 influence is related to identifiable properties of the L1 rather than other aspects of L2 learning or even cultural and educational factors that might correlate with an L1.

A common manifestation of crosslinguistic influence is transfer of an item (feature, form, structure) from L1 to L2. The more similar L1 and L2 are, the more likely transfer is (Jarvis and Pavlenko, 2008; Kellerman, 1983). Transfer can facilitate learning: the more L1–L2 similarities exist (e.g. in shared cognates, functional morphemes etc.), the more comprehensible the L2 input is for the learner (Kellerman, 1983), leading to faster learning. By contrast, when L1–L2 have few similarities, learning will proceed more slowly, with learners producing errors and potentially avoiding challenging structures in their production (Schachter, 1974). L1–L2 similarity can involve specific structures or morphemes, (e.g. articles, tense–aspect markers, question particles, or more abstract features, e.g. definiteness). Within the generative literature, similarity relates to the settings of linguistic parameters, which are related to abstract features capturing crosslinguistic variation. For example, the availability of the person (formal) feature broadly distinguishes languages with systematic agreement (Russian, German, Italian, Arabic etc.) from languages without agreement (Chinese, Japanese, Korean etc.) (Roberts, 2019).

There is substantial empirical evidence showing how the availability of similar or congruent items in L1 facilitates L2 acquisition (Charasbaszcz and Jiang, 2014; Jarvis and Pavlenko, 2008; Murakami and Alexopoulou, 2016). Recently, the increasing availability of big learner data from assessment and teaching institutions has enabled a focus on typological effects beyond individual L1s. Thus, Murakami and Alexopoulou (2016) considered the accuracy of six L2 English morphemes in the written component of Cambridge Assessment exams from the Cambridge Learner Corpus (Nichols, 2003). They calculated accuracy in supplying the relevant morpheme in obligatory contexts in 11,893 exam scripts across proficiency from learners of seven L1s (German, French, Spanish, Korean, Japanese, Turkish, Russian). They showed that the availability of a congruent morpheme in L1 leads to higher accuracy in the use of the corresponding morpheme in L2 English, across several L1s and across proficiency.

Shifting the focus beyond individual features and items, Schepens et al. (2015, 2020) demonstrated compellingly that the linguistic distance between L1 and L2/3 Dutch predicts spoken proficiency scores in state examinations for proficiency in Dutch as an additional language (Ln). Using a variety of measures for linguistic distance (lexical, morphological, phonological), based on measures of similarities between L1 and Dutch regarding cognates, morphemes, and sounds, they showed that the larger the linguistic distance between L1 and Ln Dutch, the lower the learners’ proficiency scores. Importantly, this effect is significant even when other variables are controlled (e.g. years of education, quality of education in country of origin, age of arrival, length of residence, and gender). Their combined measures of linguistic distance capture 28%–69% of variance in Ln attainment scores.

Not only is linguistic distance predictive of scores in Dutch proficiency tests but van der Slik et al. (2017) found that it can also predict learning progress over time. Thus, test scores improve over time (i.e. as length of residence increases) for learners with L1s of equal or higher morphological complexity than Dutch. Strikingly, the scores of learners with L1s of lower morphological complexity worsen over time. Crucially, they used big data, specifically test scores obtained by thousands of immigrants taking the official test of Dutch as a second language (known as ‘STEX’). Such databases enable the investigation of a large number of typologically diverse languages. For example, Schepens et al. (2015) examined 56 different L1s so that the effect of linguistic distance can be reliably evaluated.

In this article, we build on the work of Murakami and Alexopoulou (2016) and Schepens et al. (2015, 2020) with the aim of understanding the potential interplay between similarity involving individual items/features and similarity involving broader typological features (as measured by linguistic distances measures). Schepens and colleagues established the effect of linguistic distance on broad acquisition outcomes like speaking proficiency, while Murakami and Alexopoulou (2016) showed that the availability of a congruent element in the L1 influences the accuracy of the corresponding L2 morpheme across a range of typologically diverse L1s (e.g. Turkish, Russian, Korean). The question that arises is whether the effect of linguistic distance on broad outcomes arises as the aggregate of individual L1–Ln similarities/differences, with broader typological features of the L1 playing no role in the acquisition of individual items. However, it is possible that broader typological properties of the L1 influence the acquisition of individual items, like L2 English morphemes, over and above the availability of a congruent item in the L1. A broader range of typological L1–L2 similarities (e.g. word order, agreement patterns etc.) will make input more comprehensible overall and, thus, (indirectly) facilitate the acquisition of individual features. Crucially, it has been known since Greenberg (1967) that the distribution of typological features across languages is not random. Rather, there are correlations and implicational relations in their distribution. For example, we generally find definite articles in systems with a count/mass distinction rather than in systems based on classifiers (Chierchia, 1999). Thus, we hypothesize that broader typological similarities between L1 and L2 will impact on the acquisition of individual items.

Within the generative second language acquisition (SLA) literature, this hypothesis is consistent with the full transfer hypothesis of L1 parameter settings to the initial state of L2 (Schwartz and Sprouse, 1996). To the extent that such broader typological correlations and implicational relations are captured by linguistic distance measures, it is reasonable to expect an effect of linguistic distance on the acquisition of individual features, beyond the availability of the individual relevant feature in L1.

We focus on the L2 acquisition of the English article as a case study to investigate if L1–L2 linguistic distance influences the acquisition of this individual morpheme, beyond the (known) facilitative effect of an article in the L1. Following up on Murakami and Alexopoulou (2016), we examine the accuracy learners show in their use of the definite and indefinite articles, specifically their accuracy of use in obligatory contexts for articles. We have sampled writings from learners from 11 L1 backgrounds from the EF Cambridge Open Language Database (EFCAMDAT) and look at the effect of L1 type (whether an article is available in the L1) and L1–L2 linguistic distance, using a variety of lexical, phonological and syntactic distance scores.

II Background

In the typological literature, linguistic distance is used to establish phylogenetic relations between languages (see reviews in Longobardi and Guardiano, 2009; Ruhlen, 1991; Trask, 2000). Here, we use linguistic distance as a measure of L1–L2 similarity and a way to operationalize L1–L2 congruency. The dominant approach of measuring linguistic distance relies on the number of cognates shared between a pair of languages and the degree of phonological and orthographic differences between word translation equivalents, such as Levenshtein distances (Schepens et al., 2011). Such lexico-statistical approaches have been successfully exploited for automated measurement of linguistic distance and have been complemented by approaches involving word distributions or n-grams (see Gamallo et al., 2017). There is a variety of measures available for lexical (e.g. Gray and Atkinson, 2003), morphosyntactic (e.g. Dunn et al., 2011), and phonological distance (e.g. Atkinson, 2011) as well as syntactic distance based on generative syntax parameters (Ceolin et al., 2020). These measures can support treelike models of language family relations (e.g. Cysouw, 2013; Schepens et al., 2013, 2015).

Adopted for SLA research, linguistic distance measures allow us to investigate L1 typological effects on broad outcomes, like L2 proficiency, and appreciate the relative contribution of different subdomains and features (lexical, phonological, morphological) to such outcomes. For instance, Schepens et al. (2020) show that when considering phonological distance, sub-categorical properties (i.e. phonological features) are stronger predictors of Ln proficiency scores than sound categories. Additionally, they showed that a combination of measures of lexical, morphological, and phonological distance is necessary to explain the variance in the proficiency scores. It is an open empirical question if similar effects of linguistic distance can be obtained for the acquisition of individual features or whether linguistic distance effects can only be detected on broad outcomes.

In this study we investigated the L2 English article. Articles are amongst the more difficult morphemes for L2 acquisition (DeKeyser, 2007) and show strong crosslinguistic influence (Ionin and Montrul, 2010; Murakami and Alexopoulou, 2016) in that the availability of an article in the L1 facilitates article acquisition in L2. However, while the effect of L1 type in terms of presence/absence of an article is clearly established, it is not possible at present to evaluate variation within each type of language; for example, among the article-less L1s, it is not known if Korean or Japanese learners might find L2 English articles more challenging than Russian learners; similarly, within L1s which have articles, it is not clear, for example, if Arab learners might be less accurate than French or German learners of English. The descriptive data in Murakami (2016) presented in Figure 3 suggest some differences within each type of L1: for example, Japanese and Korean learners appear to have lower accuracy than Russians and Turkish learners at lower levels of proficiency; similarly, Spanish learners appear to have lower accuracy than German and French learners. However, it is not clear in Murakami and Alexopoulou (2016) what the effect of linguistic distance might be, if any.

Another aspect motivating the choice of the definite article is that, in comparison to other L2 morphemes, it is relatively straightforward to ascertain if a language (L1) has a definite article or not. By contrast, verbal morphemes tend to conflate agreement, tense, and aspect features in English making the correspondence between congruent items often less straightforward crosslinguistically. Nevertheless, languages lacking a definite article may have an indefinite article or a numeral which can be considered as congruent with the indefinite article (Dryer, 2013). The classification of languages could, therefore, be based on the definite article only or on both articles. Current syntactic accounts suggest that only the definite article is relevant for language classification, since it is the definite article that is associated with a distinct functional projection, the Determiner Phrase (DP) (Alexiadou et al., 2007). Lack of a DP predicts the use of bare nouns as arguments, for both definite and indefinite nominals and irrespective of whether a language has an indefinite article or numeral. Syntactic accounts, therefore, predict that learners from L1s lacking a DP will show high omission rates for both articles, definite and indefinite, even when the L1 has an indefinite article or numeral, because it is the absence of a DP that predicts bare nouns in argument positions. This prediction has been empirically confirmed (Ionin and Montrul, 2010). However, it is important to examine whether the absence of a definite article influences the acquisition of the indefinite article for a larger sample of languages and compare with possible linguistic distance effects.

III The current study

The goal of the present study is to examine further the role of L1–L2 typological similarity in crosslinguistic influence focusing on the acquisition of the articles in L2 English. Our research questions are summarized below:

Research question 1: Is learner accuracy in the use of the L2 English definite and indefinite articles linked to:

(a) the availability of a congruent item in the L1 (i.e. definite and indefinite articles)?

(b) solely on the availability of a definite article in L1?

(d) the learner’s proficiency?

Research question 2: Does the L1 impact on the acquisition of the definite and indefinite article similarly, i.e. does the availability of a definite/indefinite article and linguistic distance have a similar impact on both articles?

We follow current syntactic assumptions¹ and hypothesize that the availability of a DP is the crucial L1 aspect that influences the acquisition of both articles, irrespective of the availability of an indefinite article/numeral; we therefore analyse both definite and indefinite articles to investigate this hypothesis. Specifically, in light of previous empirical studies (e.g. Murakami and Alexopoulou, 2016) we predict that learners whose L1 has a definite article will show higher accuracy in their use of both definite and indefinite articles, in comparison to learners whose L1s do not. We further expect an additional effect of linguistic distance, specifically that L1–L2 similarity will correlate with higher accuracy scores. We also expect that all learners will improve their accuracy with proficiency; in other words, proficiency will be a strong predictor of accuracy.

We examined the accuracy in the use of articles in the writings of learners of English as a foreign language (EFL) from a teaching environment. We sampled writings from 11 L1s: German, French, Italian, Brazilian Portuguese, Spanish, Arabic, Russian, Turkish, Chinese, Japanese and Korean. This sample allowed us to consider a typologically diverse set of L1s, five of which lack a definite article (Russian, Turkish, Japanese, Korean, Chinese), while six do have a definite article (German, French, Italian, Brazilian Portuguese, Spanish, Arabic).

IV Method

1 The EFCAMDAT corpus

We sampled the EF Cambridge Open Language Database (EFCAMDAT) (Alexopoulou et al., 2017), an open access corpus of English as a foreign language which is available at https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html. EFCAMDAT consists of learner writings submitted to Englishtown, the online school of EF (Education First), an international school of English as a foreign language. The data collection was completed in 2013. The Englishtown curriculum offered in 2013 was organized in 16 teaching levels, each with eight teaching units or lessons. At the end of each lesson there was an open-ended writing task, which learners needed to complete and submit to Englishtown in order to progress to the next unit. Each writing was corrected and graded by an EF teacher. Learners could move to the next unit if they received a satisfactory grade, otherwise they would repeat the writing task. While writing their answer, learners may have consulted the preceding lesson as well as a model answer that accompanies each writing prompt (or any other external resource).

There is a total of 128 writing tasks across the 16 EF teaching levels. The curriculum is strongly communicative, and the writing prompts include a variety of descriptive, narrative and argumentative tasks, such as writing a review for a restaurant, completing a story, writing an email to a colleague, contributing to a forum discussion, giving a set of instructions on how to play a game etc. Each writing task specifies its length, ranging from 20–40 words at Level 1 to 150–180 at Level 16. The curriculum is standardized, so all learners across different countries complete the same writing tasks.

Around 66% of the EFCAMDAT writings contain EF teachers’ corrections for spelling, grammatical errors, lexical choices etc. Each writing is associated with a randomly generated identification number for the learner submitting the writing, the EF teaching level, a topic-id code corresponding to each one of the 128 writing prompts, national language (NL) and date of submission. National language (NL) captures the nationality of the learner and country of residence. Thus, a Japanese national accessing Englishtown from Brazil is not included in the corpus. NL has been used as a proxy of L1 (Murakami and Alexopoulou, 2016). As an L1 proxy, it is imperfect as it does not capture learners’ multilingualism (e.g. a German learner who is also a heritage speaker of Turkish). Nevertheless, NL has been shown to be a reliable proxy for L1 in previous research (Murakami and Alexopoulou, 2016).

In this study, we used a subset of EFCAMDAT consisting of 527,758 writings, henceforth, scripts, written by 104,541 learners, summing to 34-million-word tokens (for the total number of learners and writings in each teaching level and NL group, see Appendix 1 in supplemental material). To construct this subcorpus we selected only the scripts that had teacher corrections, so that, for each script, we have both the student’s original writing and the teacher corrected version.

Our teacher-corrected subcorpus was then cleaned following the process described in Shatz (2020). Appendix 2 in supplemental material details the cleaning process. Following the data cleaning, we annotated the learners’ original and teacher-corrected writings with part-of-speech (POS) tags using the Penn Treebank Project (Markus et al., 1993). The part-of-speech tags were used to identify the noun phrases and definite and indefinite articles in learners’ original and teacher corrected writings. To identify the obligatory contexts for the articles we used the parsed version of the error-corrected texts and targeted article-related parts-of-speech tags. To target part-of-speech tags, we developed R scripts using the dplyr package version 1.0.5 (Wickham et al., 2022) and the stringr package (Wickham, 2022).

2 Target L1 groups and proficiency

We included 11 NL groups: Korean, Chinese, German, Brazilian Portuguese, Russian, (Saudi) Arabic, Spanish, Turkish, French, Italian, and Japanese. As mentioned earlier, NL is a proxy for L1, crossing nationality with country of residence. Thus, NL Chinese learners included those living in Mainland China and Taiwan; for these learners we assume Mandarin Chinese as the relevant L1. NL Arabic is the national language of learners from Saudi Arabia while NL Spanish is the L1 of learners from Mexico and Spain. These NL groups were selected to enable comparisons of typologically diverse languages, while ensuring sufficient numbers of teacher-corrected writings. Table 1 presents the general properties of the cleaned subcorpus, which is available at https://ef-lab.mmll.cam.ac.uk/EFCAMDAT.html.

Table 1.

Main properties of the subcorpus.

National language groups (nationalities)	Korean, Mandarin Chinese, German, Brazilian Portuguese, Russian, (Saudi) Arabic, Spanish, Turkish, French, Italian, Japanese
Target language	English
CEFR levels	A1–B2
Total number of writings	527,758
Total number of learners	104,541
Mean number of writings per learner	8.33 (11.7)
Number of words (tokens) in the corpus	34,037,354
Mean number of words per writing	64.49 (33.37)

Note. Standard deviations are in parentheses.

Regarding proficiency, we sampled from the Englishtown levels 1–3, 4–6, 7–9, and 10–12, which correspond to the Common European Framework of Reference (CEFR) levels A1, A2, B1, and B2 respectively (we excluded higher levels corresponding to C1 because of a significant decrease in the number of writings). Appendix 1 in supplemental material provides further information on the distribution of learners and error-corrected writings across NL groups and CEFR levels.

3 Accuracy measure and data extraction

To measure learners’ article accuracy in obligatory contexts, we calculated the ratio between correct uses and obligatory contexts (see also Murakami, 2016). This accuracy measure is conceptually equivalent to target-like use scores (Pica, 1983). To obtain the number of correct uses, we exploited the EFCAMDAT teacher-corrected writings. As mentioned earlier, in our cleaned subcorpus, we have two versions of each learner writing: the raw text submitted by the learner to Englishtown and a version corrected by the teacher. We operationalized as obligatory contexts, all the contexts where a definite or indefinite article was used in the teacher-corrected version of a script. We then compared the teacher corrected scripts with the original script submitted by the learners in order to identify omission and substitution errors. Whenever the original and corrected scripts matched, an instance of correct article use was recorded. In case of a mismatch, two types of discrepancies were noted: (1) a missing article in a learner’s text where the teacher used an article, and (2) an incorrect article in a learner’s text where the teacher used a different article (‘a’ instead of ‘the’ or vice versa). Each discrepancy was coded as an error, and the error type in each case was coded as omission (1) or substitution (2). The examples in Table 2 illustrate this process. (For an example of a full original annotated script, see Appendix 3 in supplemental material.)

Table 2.

Examples of obligatory context and error identification.

Original	Teacher-corrected	Number of obligatory contexts	Errors
In my house there is swimming pool. There is a tennis court	In my house there is a swimming pool. There is a tennis court.	2	1 omission
Sometimes I go on the business trip.	Sometimes I go on a business trip.	1	1 substitution

Regrettably, we could not include article overuse errors, such as using ‘a’ or ‘the’ where no article is needed (e.g. ‘drink a milk’). We excluded such errors from our analysis mainly due to the technical difficulty in the automated identification of obligatory contexts for no article. To identify obligatory contexts in the teacher corrected scripts we used the article part-of-speech labels; bare nominals are not associated with a distinct part of speech label which means that their identification using part of speech labels is less straightforward. Thus, our accuracy score captures suppliance in obligatory context (SOC) and not target-like use (TLU).

To identify the obligatory contexts for the articles, we targeted article-related POS labels in the error-corrected scripts. We developed R scripts to define obligatory contexts in the error-corrected text, compare corrected and original writings to identify errors and error types and count each type of error in error-corrected writings.

Our measure of accuracy relies on the accuracy of the teacher corrections as well as the accuracy of the R scripts we developed to retrieve learner errors and measure SOC scores. We therefore evaluated the accuracy of the teacher corrections and our scripts against a manually annotated gold standard. Appendix 4 in supplemental material explains the manual error annotation procedure in detail. The results of the error annotation indicate high reliability. For the manual annotation, first, a trained annotator and the third author analysed the same 55 scripts and reached an inter-annotator agreement of over 97%. Then they analysed 55 scripts each, identifying a total of 370 errors in the use of articles. Overall, they reached an agreement of over 88% with original teachers’ error annotation.

4 Data analysis and variables

Considering that our focus – learners’ accuracy in article use – was based on the proportion of correct article use in each writing sample in relation to the number of obligatory article contexts, we adopted generalized linear mixed models with a binomial distribution and logit link function. The outcome variable was accuracy in each obligatory context (1 if article is used correctly; 0 if it is omitted or substituted). This modelling approach allowed us to quantify the proportion of correct article use in each writing sample in relation to the number of obligatory article contexts. We constructed two models: first, we investigated whether the availability of definite and indefinite articles in learners’ L1s, along with proficiency, predicts their accuracy in article use. Second, we examined whether L1–L2 English linguistic distance impacts article accuracy.

We, therefore, considered four predictors of learner accuracy in article use: L1 type, linguistic distance, EF teaching levels, and article type.

a L1 type

A two-level categorical variable indicating whether learners’ L1 associated with our NL groups has a definite article (absent/present). The absent (i.e. the reference group) group included L1s which lack a definite article, namely Japanese, Turkish, Korean, Russian, Chinese. The present group included German, French, Italian, Spanish, Brazilian Portuguese, Arabic. As mentioned earlier, the availability of an indefinite article is often independent of the availability of a definite article. For example, most varieties of Arabic, including Gulf Arabic lack an indefinite article, though they do have a definite article. By contrast, Turkish lacks a definite article but has a numeral which functions as an indefinite article. Table 3 shows the availability of definite and indefinite articles in the L1s associated with our NL groups. The information is based on the World Atlas of Language Structures, (specifically Dryer, 2013).² We should highlight that what counts as a language with an indefinite article is not unambiguous. A less permissive approach would only include languages where the indefinite article obligatorily marks all singular count nouns. Under such an approach, Turkish would be classified as a language without an indefinite article since the relevant item, bir is not obligatory for the realization of count singular items and indeed bare nouns are widely licensed in Turkish. For the purposes of the empirical exploration of the source of crosslinguistic influence, we opted for the more permissive approach because it allowed us to distinguish between Arabic and Turkish (the less permissive definition would classify both Turkish and Arabic as no-indefinite article languages).

Table 3.

Grouping of national languages based on first language (L1) type.

Language	Definite article	Indefinite article	L1 type
German	Yes	Yes	present
French	Yes	Yes	present
Mexican Spanish	Yes	Yes	present
Brazilian Portuguese	Yes	Yes	present
Italian	Yes	Yes	present
Arabic	Yes	No	present
Turkish	No	Yes	absent
Japanese	No	No	absent
Chinese	No	No	absent
Korean	No	No	absent
Russian	No	No	absent

b Linguistic distance

These are continuous variables measured for phonological, lexical, and morphosyntactic distances. We calculated crosslinguistic lexical distance using two measures. First, we used Levenshtein Distance, a commonly used measure of lexical distance. The Levenshtein Distance scores correlate with expert cognancy judgement (e.g. Schepens et al., 2013), with linguistic distance measures based on morphological features (e.g. Schepens et al., 2020), and with different psycholinguistic measures including lexical knowledge (e.g. De Wilde et al., 2010). We calculated the Levenshtein Distance scores by taking L1–L2 word pairs and calculating the minimum number of character substitutions, additions, and deletions that are needed to transform one string to another. Then we divided this number by the length of the longer word (in terms of number of letters). In the case of the word blue, for example, the English–German pair /blue/ and /blau/, the Levenshtein Distance score is 0.50 because there are two-character transformations, and the word length is 4. The smaller the scores, the closer the two languages are. The measures were based on an extended Swadesh list of 38 concepts that are shared by all the national languages in our sample.

Second, we used the lexical similarity scores provided by Bella et al. (2021). These scores capture the number of similar words, based on cognancy, in the contemporary lexicons of pairs of languages. The similarity score is a single number between 0 and 100. Larger scores indicate a higher number of shared cognates between two languages. The wordlists used for comparisons were obtained from the large, general, and contemporary lexicons. The score is accompanied by a confidence rating depending on the size of the lexicons over which the similarity score was calculated. All of the 11 L1 groups involved in this study had high confidence ratings.

We also used the measures of phonological similarity developed by Schepens et al. (2020), capturing similarity between languages in terms of how many sound categories they share. Specifically, this measure considers the effects of three aspects: new sounds (sounds present in English but not in learners’ L1), missing sounds (sounds present in L1 but not in English), and different sounds (different sound categories between L1 and English) (Schepens et al., 2020).

Finally, for morphosyntactic distance, we used the syntactic distance scores provided by Baker and Roberts (see Baker, 2021) and Ceolin et al. (2020). Both studies used the parametric comparison method proposed in Longobardi and Guardiano (2009). The unit of measurement is the generative syntactic parameter which may be set positively or negatively for a given feature, e.g. a positive feature setting for the word-order parameter in case of head-final order, or a negative setting for head-initial orders. Baker (2021) used 88 generative morphosyntactic parameters and provided distance scores for 33 languages; they considered the positive/negative setting of features such as person, tense, evidentiality, passive voice, the locus of feature realization and word order and movement parameters (e.g. verb–object ordering, wh-movement). Ceolin et al. (2020) focused on syntactic distance in the nominal structure. Extending Longobardi and Guardiano (2009), they used 94 parameters capturing crosslinguistic variation in nominal syntax and provided scores for 69 Eurasian languages.³

Table 4 summarizes the linguistic distance measures between English and the 11 L1s considered in this study. As can be seen, lexico-semantic scores (Bella et al., 2021) capture very well the distance between Romance languages and English but do less well in capturing the closeness of English with German, and the closeness of Russian, as an Indo-European language, to English. Levenshtein distance scores seem to distinguish very well between Indo-European languages, but do not show any distance differences for Turkish, Korean, Chinese, and Japanese. The syntax distance scores provided more variability across all languages in our sample, including non-Indo-European languages. According to nominal syntax measures, Chinese is the most distant language from English (0.71) followed by Korean and Japanese (0.6), and then Turkish (0.47). Unfortunately, we do not have a nominal syntax distance score for Arabic. The Baker (2021) syntax scores also showed variability in non-Indo-European languages. One notable difference with all other linguistic distance scores is that Mandarin Chinese had a low distance score from English (28), comparable to German (26) and Romance languages. This is probably due to the fact that Chinese shares with English basic word order (except the nominal domain) and no agreement which compares with impoverished agreement in English.

Table 4.

Linguistic distance scores.

L1 groups	German	Chinese (Mandarin)	Italian	French	Mexican Spanish	Brazilian Portuguese	Arabic	Russian	Japanese	Turkish	Korean
L1 type	Article +	Article –	Article +	Article +	Article +	Article +	Article –	Article –	Article –	Article –	Article –
Syntactic distance scores (Baker, 2021)	28	26	32	33	22	29	29	27	44	39	32
Syntactic distance scores (Ceolin et al., 2020)	0.16	0.71	0.26	0.27	0.25	0.25		0.26	0.6	0.47	0.6
Levenshtein distance scores (Shatz, 2020)	0.66	0.92	0.82	0.85	0.86	0.87	0.91	0.88	0.90	0.90	0.90
Lexical similarity (Bella et al., 2021)	4.72	0.05	6.75	9.66	7.9	7.94	0.18	2.02	1.78	1.39	0.53
Phonological distance scores (Schepens et al., 2020)	9	14	14	10	18	14	10	17	19	7	14

c EF teaching levels

We based proficiency on Englishtown EF teaching levels 1–12 (mean 4.38, SD 2.84), corresponding to CEFR levels A1–B2. The value indicates the proficiency level of the lesson for which a writing is submitted by the learner; indirectly, this value captures the learners’ proficiency levels at the time of writing, which could be at the start of their learning programme or moving on from a preceding level.

d Article type

Article is a categorical variable with two levels: definite (reference level) and indefinite. We extracted a total of 2,774,915 obligatory contexts for both article types, 37.77% were indefinite and 62.22% were definite. We included article type in our models to specifically examine whether the effects of L1 type and linguistic distance scores on accuracy vary as a function of article type.

V Results

1 Distribution of errors in obligatory contexts

Table 5 shows that the obligatory contexts for the definite article outnumber the contexts for the indefinite article by approximately 1.7 to 1, suggesting a significantly higher opportunity for the use of the definite article. In both contexts, omission errors represent the vast majority of errors, with substitution errors just over 6% in indefinite contexts, roughly double the 3.34% of substitution errors found in definite contexts. The unequal distribution of the obligatory contexts for the two articles means that we have a much higher number of errors for the definite article than the indefinite article.

Table 5.

Number of obligatory contexts and errors for definite and indefinite articles (percentages in parentheses).

Article type	Obligatory contexts	Omission errors	Substitution errors
Definite	1,726,529	245,253 (96.7)	8,482 (3.3)
Indefinite	1,048,386	137,568 (93.3)	8,936 (6.1)

Notes. For indefinite articles, we also detected phonological variant errors referring to errors regarding a/an distinction (0.64%). These were included as accurate in the analysis.

2 The effects of national languages on accuracy

Table 6 shows the mean accuracy scores of suppliance of the indefinite and definite articles in obligatory contexts across all EF teaching levels for each NL group, ranked from lowest to highest accuracy score. We generally observe high accuracy, ranging from 78% (SD = 34) to 90% (SD = 23). Looking at the rightmost column, we can observe that the lowest scores belong to NL groups without definite articles in the corresponding L1s (absent), namely Korean, Turkish, Japanese, Russian. Chinese learners are an exception with the second highest accuracy score despite the unavailability of a definite article in Mandarin Chinese.

Table 6.

Accuracy scores for national languages.

National language group	SOC scores definite article	SOC scores indefinite article	SOC scores both article types
KoreanTurkishJapaneseRussianArabicBrazilian PortugueseMexican SpanishItalianFrenchChineseGerman	0.76 (0.36)0.76 (0.36)0.79 (0.34)0.78 (0.35)0.81 (0.34)0.84 (0.31)0.86 (0.29)0.86 (0.28)0.87 (0.28)0.87 (0.27)0.90 (0.24)	0.79 (0.33)0.80 (0.32)0.80 (0.33)0.79 (0.33)0.81 (0.33)0.84 (031)0.85 (0.29)0.87 (0.27)0.87 (0.27)0.87 (0.28)0.91 (0.22)	0.78 (0.34)0.78 (0.35)0.79 (0.33)0.79 (0.34)0.81 (0.33)0.84 (0.30)0.86 (0.27)0.87 (0.27)0.87 (0.27)0.87 (0.28)0.90 (0.23)

Note. SOC = suppliance in obligatory context.

Turning to the influence of the indefinite article, we observe that the availability of the indefinite article in the L1 does not seem to impact on article accuracy in L2. For example, Turkish learners whose L1 has an indefinite article do not seem to do better than, e.g. Japanese, or Korean learners, who lack an indefinite article. At the same time, the unavailability of an indefinite article in Arabic does not lead to lower accuracy for Arab learners with the indefinite article. Crucially, Arab learners show higher accuracy with the indefinite article than, e.g. Turkish learners (in fact, all learners in the absent NL groups). At the same time, within the absent NL group – Korean, Turkish, Japanese, and Russian – the indefinite article generally seems to have higher accuracy than the definite article. The exception to this is again Chinese, the only absent language showing equal accuracy for both articles, and, in this way, patterning with the present NL groups, in which the definite and indefinite article generally have the same accuracy (except for German learners). Overall, it seems that the indefinite article has higher accuracy than the definite, but this is predominately true for the absent group. Finally, within the present NL group, German learners have the highest accuracy score and Arab learners the lowest, with learners from Romance languages in between, suggesting some language family effects.

To facilitate comparisons between NL groups, Figures 1 and 2 show the average accuracy scores for both definite and indefinite articles across EF teaching levels for all NL groups. As expected, accuracy improves with proficiency. At the same time, the NL differences that are present at beginner EF teaching levels are also present at the end of intermediate teaching levels (the highest proficiency points in our data), suggesting that proficiency cannot overturn NL language effects, at least up to intermediate levels. Figure 1 shows the highest and lowest SOC scores in each NL group across EF teaching levels. Moreover, it shows that present national language groups cluster at the higher end of accuracy and absent national language cluster at the lower end of accuracy with the Arab national language group in the middle (note that the y-axis starts at 0.7). Chinese learners show an overall flatter trajectory than other groups. Notably, Chinese learners achieved high accuracy in the use of articles, which makes them an outlier in the absent group.

Figure 1.

Developmental trajectories of accuracy scores across national language groups.

Figure 2.

Developmental trajectories of accuracy scores across national language groups for definite articles (left) and indefinite articles (right).

Figure 2 shows accuracy across EF teaching levels per national language group separately for the definite and indefinite article. The patterns are similar for the two articles though Arab learners seem to cluster with the absent NL groups in the indefinite article. We also see a large drop in the accuracy of Korean learners. We closely inspected the data to understand the source of this drop. We considered the possibility of a technical error in the database, such as learners from a lower EF teaching level assigned to a higher level. We confirmed that the vast majority of the learners moved on from preceding levels. The drop in accuracy seems to be due to low numbers of scripts as there are considerably fewer learners in the dataset compared to other NL groups.

3 Model development

We built two generalized linear mixed models with a binomial distribution, using the lme4 package in R to address our research questions (Bates et al., 2015; Version 4.3.2; R Core Team, 2023). We calculated the p-values using the lmerTest (Version 3.1-3; Kuznetsova et al., 2017).⁴ We constructed multiple models and compared them to identify the most plausible model, using the Akaike Information Criterion (AIC) values. For model selection, we opted for a forward-selection approach (see James et al., 2013). We began with the simplest model and added predictors incrementally, only if they improved model goodness-of-fit (see also Murakami, 2016). For both models, we first built an unconditional model, including only random intercepts.

To control for individual variation, we included by-learner and by-national language random intercepts. The by-learner intercepts accounted for variation in article accuracy across individual learners, while the by-national language intercepts captured overall accuracy differences among native language groups. Additionally, we included by-national language random slope for article types in both models, representing variation in accuracy differences between article types across national language groups. This implies that some learners, depending on the characteristics of their native languages, may learn to use articles more quickly than others. In Model 1, we investigated whether L1 type, alongside EF teaching levels, predicts learners’ accuracy in their use of definite and indefinite articles. As fixed effects, Model 1 included L1 type (absent vs. present vs. chinese), EF teaching levels (as a continuous variable), and article type (Indefinite vs. Definite), allowing us to test interactions between L1 type, EF teaching levels, and article type. We coded the categorical variable L1 type using treatment coding in which we defined the absent group as the reference level and compared present and chinese groups to the absent group (absent = 0, present = 1, chinese = 1). We treated chinese as a separate group because the descriptive data suggested a pattern distinct from both absent and present groups. To account for potential non-linearity, we included both linear (i.e. EF Teaching levels¹) and quadratic (i.e. squared, EF Teaching levels²) effects of EF teaching levels, which were centred and standardized. In the second model, we examined whether linguistic distance scores, alongside EF teaching levels, predict learners’ accuracy in their use of definite and indefinite articles. The full analysis code is available in the https://osf.io/r638b/.

4 The effects of L1 type on article use accuracy

The results are presented in Table 7. The VIF scores of L1 type, EF teaching levels (including linear and quadratic effects) and article type did not indicate any problems with multicollinearity (Fox & Weisber, 2019; all VIF scores < 1.40). As predicted, the main effect of L1 type was significant. Learners in both present and chinese groups achieved overall higher accuracy in article use than those in the absent group, whose L1s lack article. To directly compare the learners in the present and chinese groups, we re-levelled the model. There was no significant difference in the accuracy of article use between the present and chinese groups b = −0.175, SE = 0.18, z = −0.965, p = .334. The main effect of article type was significant, indicating that learners were more accurate when using indefinite articles compared to definite articles. Additionally, both the linear and quadratic effects of EF teaching levels were significant, showing that the learners’ accuracy in article use improved as they progressed through proficiency levels (see also Figures 1 and 2).

Table 7.

Results of the binomial generalized linear mixed model, Model 1, investigating the effects of Education First (EF) teaching levels, first language (L1) type and Article type.

Fixed effects	Estimate	SE	95% CI	z-value	p
(Intercept)	1.253	0.085	[1.085, 1.421]	14.639	< .001
EF Teaching levels¹	0.040	0.009	[0.021, 0.059]	4.216	< .001
EF Teaching levels²	0.053	0.006	[0.040, 0.065]	8.156	< .001
L1 type: chinese vs. absent	0.853	0.189	[0.483, 1.224]	4.515	< .001
L1 type: present vs. absent	0.678	0.109	[0.462, 0.893]	6.166	< .001
Article type: Indefinite vs. Definite	0.213	0.037	[0.140, 0.286]	5.734	< .001
EF Teaching levels¹ × L1 type: chinese vs. (absent)	0.024	0.013	[−0.002, 0.051]	1.801	.071
EF Teaching levels¹ × L1 type: present vs. (absent)	0.139	0.010	[0.119, 0.159]	13.806	< .001
EF Teaching levels² × L1 type: chinese vs. (absent)	−0.060	0.009	[−0.079, −0.040]	−6.096	< .001
EF Teaching levels² × L1 type: present vs. (absent)	−0.016	0.006	[−0.029, −0.002]	−2.353	.018
L1 type (chinese vs. absent) ×Article type (Indefinite vs. Definite)	−0.312	0.078	[−0.465, −0.158]	−3.985	< .001
L1 type (present vs. absent) ×Article type (Indefinite vs. Definite)	−0.218	0.046	[−0.310, −0.126]	−4.670	< .001
EF teaching levels¹ × Article type: Indefinite vs. (Definite)	−0.071	0.004	[−0.080, −0.062]	−15.354	< .001
EF teaching levels² × Article type: Indefinite vs. (Definite)	0.034	0.003	[0.027, 0.041]	9.635	< .001

Notes. EF teaching levels¹ = linear effects of EF teaching levels; EF teaching levels² = quadratic effects of EF teaching levels. L1 type Baseline levels for L1 type is absent, and for Article type is definite articles. We coded the categorical variable L1 type using the treatment coding (absent = 0, present = 1, chinese = 1). We coded the categorical variable article type (definite = 0, indefinite = 1). The full model specification was as follows: (correct, incorrect) ~ (1|learnerID) + (1+art_type|nat_lang) + level_scaled + level_scaled_sq + L1_type + art_type + level_scaled*L1_type + level_scaled_sq*L1_type + art_type*L1_type + art_type*level_scaled + art_type*level_scaled_sq.

To decompose the significant interaction between L1 type and EF teaching levels, we employed the emtrends function in the emmeans package in R (Lenth, 2024) to obtain the simple slopes for EF teaching levels including both linear and quadratic effects by each level of L1 type. The linear effect of EF Teaching levels on article accuracy was significantly lower for learners in the absent group than for those in the present group, Estimate = −0.1394, SE = 0.010, z = −13.806, p < .001 (0.004–0.144 = −0.1394). Figure 3 illustrates the interaction effect. In other words, learners in the present group showed a greater positive effect of EF teaching levels on article accuracy. There was no significant difference between learners in the absent group and those who speak chinese, regarding the linear effect of EF Teaching levels on article accuracy, Estimate = −0.024, SE = 0.013, z = −1.801, p = .169 (0.004–0.029 = −0.024). The linear effect of EF Teaching levels on article accuracy was significantly lower for Chinese speaking learners compared to those in the present group Estimate = −0.1148, SE = 0.010, z = −10.906, p < .001 (0.029 – 0.114).

Figure 3.

Interaction between first language (L1) type (absent, present, chinese) and Education First (EF) teaching levels.¹

The quadratic effects of EF teaching levels on article accuracy were significantly higher for learners in the absent group than for those who speak chinese, indicating that the improvement in article accuracy as EF teaching levels increase accelerates more for learners in the absent group compared to those in the chinese group (see Figure 4), Estimate = 0.060, SE = 0.009, z = 6.096, p < .001 (0.070 – 0.010). There was also a significant difference between learners in the absent group and those who were in the present group, regarding the quadratic effect of EF Teaching levels on article accuracy, Estimate = 0.016, SE = 0.006, z = 2.353, p < .04 (0.070 – 0.054), showing that as EF Teaching levels increase, article accuracy improves at a slightly faster rate for the learners in the absent group than for those in the present group. Lastly, the quadratic effect of EF Teaching levels on article accuracy for chinese speaking learners was significantly lower compared to those in the present group, Estimate = −0.043, SE = 0.007, z = −5.511, p < .001 (0.010 – 0.054).

Figure 4.

Predicted accuracy by Education First (EF) teaching levels (linear and squared) and article type for each first language (L1) type.

We deconstructed the significant interaction between article type and EF teaching levels. The linear effect of EF teaching levels on article accuracy was higher for definite articles compared to indefinite articles, Estimate = 0.072, SE = 0.004, z = 15.418, p < .001 (0.1523–0.0802 = 0.072). However, the quadratic effect of EF teaching levels showed a different pattern. As EF teaching levels increased, the improvement in article accuracy accelerated more for indefinite articles, indicating a non-linear relationship in which learners’ accuracy with indefinite articles improved at an increasing rate with higher EF teaching levels, Estimate = −0.034, SE = 0.003, z = −9.635, p < .001 (0.027−0.062). Figure 4 illustrates the interaction effect, incorporating both linear and quadratic effects of teaching levels.

In order to gain a clearer understanding of the interaction between L1 type and article type, we ran a series of post-hoc tests using least-square means employing the emmeans package in R (Lenth, 2024), with Tukey adjustments for multiple comparisons. The results indicated that the learners in both the present and chinese groups used both types of articles significantly more accurately than the learners in the absent group, as can be seen in Table 8. The learners in the absent group used indefinite articles more accurately than definite articles, Estimate = 0.248, SE = 0.037, z = 6.693, p < .001.

Table 8.

Post-hoc comparison of accuracy by first language (L1) type for definite and indefinite articles.

L1 type contrasts	Article type	Estimate	SE	z-ratio	p
chinese – absent present – absent present – chinese chinese – absent present – absent present – chinese	DefiniteDefiniteDefiniteIndefiniteIndefiniteIndefinite	0.7930.661−0.1310.4810.443−0.038	0.1890.1100.1820.1820.1060.176	4.2016.028−0.7252.6424.178−0.217	<.001<.0010.748.022<.001.974

5 The effects of linguistic distance on article use accuracy

Table 9 shows correlations between linguistic distance scores and accuracy. As can be seen, there is a medium-to-strong correlation between Levenshtein distance and accuracy, medium size correlations for the syntactic distance scores, and essentially no correlation with phonological scores. We constructed generalized linear mixed models with a binomial distribution, to investigate the effects of linguistic distance scores on the accuracy of article use. In this model (Model 2), national languages are represented with linguistic distance scores, as can be seen in Table 4. The VIF scores of EF teaching levels, article type, Levenshtein distance scores, and the Baker (2021) syntactic distance scores did not indicate any problems with multicollinearity (VIF scores < 1.40). As we aimed to analyse the effect of linguistic distance, we decided not to include L1 type in the second model. Similar to Model 1, the resulting final model included random intercept for subject, a by-L1 random intercept for L1 and also a by-L1 random slope for article type. For fixed effects, Model 2 included EF teaching levels with linear (i.e. EF Teaching levels¹) and quadratic (i.e. squared, EF Teaching levels²) effects, article type (INDEFINITE or DEFINITE), Levenshtein distance scores, and the Baker (2021) syntactic distance scores. Phonological and Ceolin et al.’s (2020) nominal syntactic distance⁵ scores did not enter into the final version of the model, as these predictors did not improve the model’s goodness-of-fit. The resulting final model also included the interaction terms, EF teaching levels and article type, Baker (2021) syntactic distance scores and article type.

Table 9.

Correlation co-efficient between linguistic distance and suppliance in obligatory context (SOC) scores in the use of articles.

Linguistic distance scores	Pearson correlations
Phonological distance scores (Schepens et al., 2020)Levenshtein distance scoresLexical similarity ratingsSyntactic distance scores (Baker, 2021)Syntactic distance scores (Ceolin et al., 2020)	p = 0.15, r = .16p = .64 r = .58p = .59, r = .39p = .50, r = .39p = .46, r = .46

The results are presented in Table 10. Both the linear and quadratic effects of EF teaching levels were significant, showing that the accuracy of article use was higher in learners of higher proficiency. Turning to linguistic distance scores, the main effects of both Levenshtein distance and Baker (2021) syntactic distance scores were significant, as they predicted article accuracy. In other words, learners in the L1 groups with lower Levenshtein and Baker (2021) syntactic distance scores, such as German, used articles more accurately than learners in the L1 groups with higher distance scores, such as Turkish and Korean. The main effect of article type was not significant.

Table 10.

Results of the mixed effects linear Model 2, investigating the effects of Education First (EF) teaching levels, linguistic distance scores and article type.

Fixed effects	Estimate	SE	95% CI	z-value	p
(Intercept)	1.762	0.085	[1.595, 1.929]	20.644	<.001
EF Teaching levels¹	0.152	0.003	[0.144, 0.159]	40.115	<.001
EF Teaching levels²	0.034	0.002	[0.029, 0.159]	12.893	<.001
Article type: Indefinite vs. definite	0.035	0.037	[−0.037, 0.108]	0.953	.340
Levenshtein distance scores	−0.135	0.032	[−0.198, −0.072]	−4.202	<.001
Syntactic distance scores (Baker, 2021)	−0.103	0.048	[−0.198, −0.008]	−2.136	.032
EF Teaching levels¹ × Article type: Indefinite vs. (Definite)	−0.072	0.004	[−0.081, −0.062]	−15.418	<.001
EF Teaching levels² × Article type: Indefinite vs. (Definite)	0.034	0.003	[0.027, 0.041]	9.598	<.001
Syntactic distance scores (Baker, 2021) × Article type: Indefinite vs. (Definite)	0.048	0.020	[0.006, 0.090]	2.291	.021

Notes. EF teaching levels¹ = linear effects of EF teaching levels; EF teaching levels² = quadratic effects of EF teaching levels. We coded the categorical variable article type (definite = 0, indefinite = 1).

We deconstructed the significant interaction between article type and EF teaching levels. The linear effect of EF teaching levels on article accuracy was higher for definite articles compared to indefinite articles, Estimate = 0.153, SE = 0.004, z = 15.418, p < .001 (0.152–0.080 = 0.153). However, indefinite articles show a stronger quadratic trend, indicating that as EF teaching levels increased, the improvement in article accuracy accelerated more for indefinite articles, indicating a non-linear relationship, Estimate = −0.034, SE = 0.003, z = −9.598, p < .001 (0.034 (definite) –0.069 (indefinite) = −0.036).

We also deconstructed the significant interaction between article type and Baker (2021) syntactic distance scores. As the scores increase, accuracy for both article types appears to decrease slightly. The difference in trends between the two types of articles is significant, with definite articles showing a larger decrease in accuracy compared to indefinite articles, Estimate = −0.048, SE = 0.021, z = −2.291, p = .02 (−0.103–0.055 = −0.048).

VI Discussion

Below we summarize our main findings:

A higher number of definite obligatory contexts (approximately 1.7 definite contexts for each indefinite one.).

Overall high accuracy in suppliance of the articles, ranging from 75%–90% of correct suppliance in obligatory contexts.

Errors predominately involve omission, with substitution errors below 6% of all errors.

The presence of a definite article in learners’ L1 predicts both their accuracy in L2 article usage and the extent of their improvement with proficiency. Specifically, learners in the present group demonstrate significantly higher accuracy and greater improvement across proficiency levels compared to learners in the absent group as shown in Figure 4. In addition, Chinese learners are distinct from both present and absent groups. Surprisingly, they show higher overall accuracy than learners without a definite article (absent group) but their improvement across proficiency is lower than the present group and more similar to the absent group.

When teaching levels are considered in quadratic terms, a more nuanced picture emerges regarding the progress learners make with proficiency. Thus, as can be seen in Figure 5, learners from the present group show strong linear improvement in their accuracy with proficiency, contrasting with Chinese learners who show less pronounce improvement across proficiency. By contrast, learners from the absent group show a U-shape like curve. For the definite article they show slow progress at the lower levels and a steeper increase in accuracy at the higher proficiency levels. The U-shape curve is clearer for the indefinite article.

Learners from the absent group show higher accuracy in the use of the indefinite article, in comparison with the definite article. In this respect, they contrast with present and chinese groups who show no difference between the two articles. Lexical and syntactic distance scores show medium size correlations with accuracy. However, only Levenshtein and Baker (2021) scores were significant in the regression model.

Figure 5.

Interaction between article type (definite, indefinite) and first language (L1) type (absent, present, chinese).

We begin with the observation that overall, our learners show relatively high levels of accuracy ranging between 75%–90%. By comparison, accuracy in Murakami and Alexopoulou, (2016) ranged between 65%–93% for beginner and intermediate levels of proficiency. The higher accuracy of learners in our study can be explained by the fact that Murakami and Alexopoulou, (2016) reported results from the Cambridge Learner Corpus, a corpus of high-stakes exams. By contrast, the learners of our study produced writings as part of their learning in an open setting with access to resources. Their higher accuracy is, therefore, not as surprising. It is worth noting that even a 15%–20% error rate is highly noticeable for articles since each sentence may contain a couple of article obligatory contexts. As we mentioned in our methodology section, our manual evaluation indicated that the teacher corrections were reliable, so the higher accuracy is not due to teachers missing learner errors.

Our results confirm earlier findings that article omission is the main error type, that accuracy improves with proficiency, and that the availability of an article in the L1 leads to higher accuracy in the use of the L2 English articles. One important limitation of the current study is that we were not able to consider overgeneralization errors. However, a comparison with Murakami and Alexopoulou, (2016) and Derkach and Alexopoulou (2023), who include overgeneralization errors and calculate TLU scores, shows that omission is the predominant type of error while the comparison between TLU and SOC scores does not impact on the overall picture of the L1 influence on L2 accuracy (Murakami and Alexopoulou, 2016).

Higher proficiency does not appear to close the accuracy gap between the present and absent groups. Interestingly, the gap increases with proficiency since learners from the present group increase their accuracy more than learners from the absent group (Figure 4). Of course, we do not know if this trajectory changes in advanced proficiency levels, since our higher proficiency learners are at late intermediate levels. However, we should note that in Murakami and Alexopoulou, (2016), the difference between present and absent L1 groups remained significant until advanced levels. This result is reminiscent of van der Slik et al. (2017), who found that proficiency test scores of learners of Dutch with an L1 of equal or higher morphological complexity than Dutch improved over time, while test scores of learners with an L1 of lower morphological complexity than Dutch worsened over time.

The frequency distribution of the two target articles, definite vs. indefinite did not seem to play any role in learner accuracy. If it did, we should see related differences in the accuracy of the two articles, with the definite article potentially showing higher accuracy given its higher frequency. The analysis showed that within the present and chinese NL groups there was no difference in the accuracy of definite and indefinite articles. Within the absent NL groups, it was the less frequent indefinite article which was more accurate and showed more gains over proficiency. Overall, though, there was no correlation between the frequency of the article contexts and their accuracy.

Regarding L1 influence, L1 article type (present/absent) was the strongest predictor of L2 accuracy. Specifically, the availability of the definite article was predictive of accuracy. At least informally, the availability of an indefinite article did not appear to predict accuracy. For example, Turkish learners, whose L1 has an indefinite article, were indistinguishable from Japanese learners, whose L1 lacks an indefinite article. Similarly, Arab learners, belonging to the present group, showed higher accuracy in the use of the indefinite article than all absent NL groups, even though they lack an indefinite article in their L1. In addition, they showed no difference in the accuracy between the definite and indefinite articles, on a par with the present NL group. These results are consistent with the generative view that the availability of a definite article indicates a deeper structural difference between languages, namely the involvement of a DP, which plays a crucial role in the syntactic manifestation of arguments (Bošković, 2008; Chierchia, 1999). Thus, the variation concerns whether bare NPs are allowed as arguments and predicts that the absence of a DP will result in licensing of bare NPs as arguments and, thus, lead to high rates of article omission in L2. The availability of an indefinite article in a non-DP language like Turkish, does not change the fact that bare NPs are grammatical as arguments in Turkish. Similarly, despite the absence of an indefinite article in Arabic, the distribution of bare nominals is restricted in Arabic⁶ as in other DP languages (Ferri, 2007). One might argue that the availability of the indefinite article strongly correlates with the availability of the definite article, as can be seen in Table 4, making it hard to tease apart the influence of each article. This, though, is the essence of the distinction between DP and NP languages; the former will systematically require an overt argumentizer in the form of an article or some other D element (e.g. demonstrative), while in the latter, a bare NP can be an argument.

We should acknowledge that our distinction between NP and DP languages (see Bošković, 2008) oversimplifies and abstracts away from theoretical debates regarding the DP/NP distinction and the role of the definite article in assuming a DP (see, among others, Franks and Pereltsvaig, 2004; Gillon and Armoskaite, 2015; Crisma et al., 2020; Kornfilt, 2017, 2018; Köylü,, 2021; Lyutikova and Pereltsvaig, 2015; Progovac, 1998; Salzmann, 2020). Further investigation of finer grained distinctions in the use of the articles in different contexts (e.g. regarding specific or generic nouns) might well need to draw from more nuanced typological distinctions. However, our current findings do lend support to a broad typological distinction between NP and DP languages.

We should also note that, while our results are consistent with the generative hypothesis that languages vary in allowing DPs or bare NPs as arguments, alternative hypotheses cannot be excluded. For example, it could be argued that the definite article has a prominent discourse function and that the acquisition of the indefinite article is somehow dependent semantically on the acquisition of the definite article, making the definite article a better predictor of L2 accuracy. Note, however, that the learners from the absent group show higher accuracy with the indefinite article. To understand which aspects of the learning challenge relate to interpretation and the discourse and semantic properties of each article and which aspects relate to structural and morphosyntactic aspects varying between languages, qualitative data are needed (e.g. the use of the indefinite article with singular nouns in predicative contexts, like ‘she is a teacher/they are teachers’ where the indefinite article makes no contribution to meaning and its presence is only required for the singular).

Chinese learners did not confirm the prediction that the absence of an article in the L1 leads to lower accuracy in the L2. In fact, Chinese learners showed accuracy which was higher than the average accuracy in the present NL group. It is possible that, despite the absence of an English article, Chinese does have a DP and an abstract definiteness feature that facilitates the acquisition of the English article. Indeed, a number of theoretical proposals assume a DP layer for Chinese nominal structure (Chen, 2004; Cheng et al., 2017; Huang, 1999). If these proposals are correct, then the high accuracy of Chinese learners would suggest that the availability of abstract features in L1 is facilitative for L2 acquisition even in the absence of an overt item lexicalizing such features. More qualitative research is required to better understand what underpins the surprisingly high accuracy of Chinese learners. However, unlike learners in the present NL groups, the Chinese learners do not seem to improve their accuracy with proficiency, an aspect that sets Chinese learners apart from the present group.

Our data also indicated broader typological effects on the accuracy of L2 English article use. Within the present NL groups, German learners showed the highest accuracy, followed by Romance NL groups and then Arab learners. No similar typological effects could be seen within the absent group. In fact, Russian learners, with an Indo-European L1, patterned together with learners from non-Indo-European languages and showed lower accuracy than Arab learners (Semitic L1). We aimed to capture the typological effects by using existing measures of linguistic distance, following the paradigm established by Schepens et al. (2020). We found medium size correlations with lexical and syntactic measures of linguistic distance, while the phonological measures we used were not predictive. Only the lexical-phonological measures based on Levenshtein distance and the syntactic measures of Baker (2021) were significant in the regression model.

The correlation between linguistic distance scores and accuracy demonstrates that the effect of L1 on L2 articles goes well beyond the availability of an article/DP in the L1 and suggests influence from a broader range of typological properties. Thus, our study provides a positive answer to our main research question, namely whether L2 accuracy is solely influenced by the availability of an article in the L1 or by broader L1 properties.

Linguistic distance scores had a small effect in the regression model with the Ceolin et al.’s (2020) not reaching significance. This is not surprising in a model that has proficiency as a predictor. After all, all learners improve with proficiency. Crucially, L1 influence captures variation among learners of the same proficiency. It is, therefore, not surprising that proficiency is a stronger predictor of accuracy than linguistic distance in the regression model.

While linguistic distance scores capture the broader typological effects, they do not tell us which L1 properties and features are implicated and how they might affect the acquisition process. Since the acquisition of articles involves the syntax–semantics interface as well as discourse and information structure, it is reasonable to expect that a range of other properties might influence acquisition and underpin the typological effects: for example, other properties of the nominal system (e.g. whether the L1 has grammatical number and a count/mass distinction, encoding of specificity and case marking etc.), as well as broader properties (e.g. word order, tense and aspect marking, information structure etc.). Further qualitative empirical research is needed to reveal what underpins these broader typological effects. However, the present study has demonstrated that such broader typological effects are indeed part of the L1 influence on L2. These effects might also, partially, explain why many advanced learners will have difficulty surpassing the 90% accuracy threshold in their use of articles (see Murakami and Alexopoulou, 2016; Lardiere, 2008), as they suggest that the acquisition of one morpheme depends on a broader range of features.

VII Conclusions

Our study provides evidence that L1 influence arises from the combination of item level L1–L2 differences, e.g. in the availability of an article, as well as broader properties of the L1 grammar. It also shows that L1 influences overall accuracy as well as the rate of accuracy improvement at different proficiency levels. It also provides support for the generative typological distinction between DP and NP languages (Bošković, 2008), indicating that it is the availability of a definite article and a DP that predicts the use of bare nominals in the L1 and, accordingly, article omission in L2 English. Our study also highlights the use of big data for SLA research, which has enabled the study of 11 typologically diverse languages, a sample of languages that is much broader than what is available to the typical lab-based or field-work SLA studies. In this way, corpus studies based on data from educational institutions can complement the standard experimental developmental SLA studies.

At the same time, our study underscores the need for further empirical research, particularly qualitative studies, to gain a more comprehensive understanding of how linguistic distance affects the acquisition of individual L2 items. Future research should pinpoint which specific properties influence acquisition. Crucially, how similarity and transfer might lead to negative transfer, potentially hindering rather than facilitating acquisition. An example of this is the transfer of generic interpretations of the definite article from Romance languages to L2 English (see Ionin and Montrul, 2010). To fully grasp the nature of L1 influence in relation to the typological properties of L1 and L2, it is crucial to synthesize insights from targeted developmental studies with broader patterns revealed by large-scale corpus analyses.

Supplemental Material

sj-docx-1-slr-10.1177_02676583251395876 – Supplemental material for The influence of L1 typology on the acquisition of the L2 English article: A large-scale corpus study

Supplemental material, sj-docx-1-slr-10.1177_02676583251395876 for The influence of L1 typology on the acquisition of the L2 English article: A large-scale corpus study by Doğuş Öksüz, Theodora Alexopoulou, Kateryna Derkach and Ianthi Maria Tsimpli in Second Language Research

Footnotes

Acknowledgements

The authors would like to thank Thomas Hammond for his support in manual error annotation.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This project is funded by the Leverhulme Trust Research Project Grant ‘Linguistic Typology and Learnability in Second Language’ (RPG-2018-123) and a grant by EF Education First supporting the EF Lab on Applied Language Learning

ORCID iDs

Doğuş Öksüz

Theodora Alexopoulou

Open Badges Statement

The anonymized data, and R scripts are available on our project site on the Open Science Framework (OSF) platform .

Data Availability Statement

No data were collected from participants for this article; we analysed pre-existing data from the EF Cambridge Open Language Database.

Supplemental material

Supplemental material for this article is available online.

Notes

References

Abbas

Degani

Prior

(2021) Equal opportunity interference: Both L1 and L2 influence L3 morpho-syntactic processing. Frontiers in Psychology 12: Article 673535.

Alexiadou

Haegeman

Stavrou

(2007) Noun phrase in the generative perspective. Berlin: De Gruyter Mouton.

Alexopoulou

Michel

Murakami

Meurers

(2017) Task effects on linguistic complexity and accuracy: A large-scale learner corpus analysis employing natural language processing techniques. Language Learning, Special Issue. https://doi.org/10.1111/lang.12232

Atkinson

(2011) Phonemic diversity supports a serial founder effect model of language expansion from Africa. Science 332(6027): 346–349. https://doi.org/10.1126/science.1199295

Baker

(2021) Extending parametric comparison. Manuscript available via Cambridge Open Engage.

Bates

Maechler

Bolker

Walker

(2015) Fitting linear mixed-effects models using lme4. Journal of Statistical Software 67: 1–48.

Bella

Batsuren

Giunchiglia

(2021) A database and visualisation of the similarity of contemporary lexicons. In: Ekštein

Pártl

Koponík

(eds) Text, Speech and Dialogue (TSD 2021): Lecture notes in computer science: Volume 12848. Cham: Springer, pp. 91–103.

Bošković

(2008) What will you have, DP or NP? In: E

Elfner

Walkow

(eds) Proceedings of the Northeast Linguistic Society 37. Amherst, MA: GLSA, University of Massachusetts, pp. 101–14.

Ceolin

Guardiano

Irimia

, et al. (2020) Formal syntax and deep history. Frontiers in Psychology 11: Article 488871.

10.

Charasbaszcz

Jiang

(2014) The role of the native language in the use of the English nongeneric definite article by L2 learners: A cross-linguistic comparison. Second Language Research 30: 351–79.

11.

Chen

(2004) Identifiability and definiteness in Chinese. Linguistics 42: 1129–84.

12.

Cheng

LL-S

Heycock

Zamparelli

(2017) Two levels for definiteness. In: Erlewine

(ed.) Proceedings of GLOW in Asia XI: MIT Working Papers in Linguistics 84. Cambridge, MA: MIT, pp. 79–93.

13.

Chierchia

(1999) Linguistics and language. In: Wilson

Keil

(eds), The MIT encyclopedia of the cognitive sciences. MIT Press, pp. 454–456

14.

Crisma

Guardiano

Longobardi

(2020) Syntactic diversity and language learnability. Studi e saggi linguistici 58(2): 99–130.

15.

Cysouw

(2013) Predicting language learning difficulty. In: Borin

Saxena

(eds) Approaches to measuring linguistic differences. Berlin: De Gruyter Mouton, pp. 57–82.

16.

De Wilde

Brysbaert

Eyckmans

(2010) Learning English through out-of-school exposure: How do word-related variables and proficiency influence receptive vocabulary learning? Language Learning 70: 349–81.

17.

DeKeyser

(2007) Introduction: Situating the concept of practice. In: DeKeyser

(ed.), Practice in a second language: Perspectives from applied linguistics and cognitive psychology. Cambridge University Press, pp. 1–18.

18.

Derkach

Alexopoulou

(2023) Definite and indefinite article accuracy in learner English: A multifactorial analysis. Studies in Second Language Acquisition 46: 710–40.

19.

Dryer

Haspelmath

(eds) (2013) The world atlas of language structures online. Leipzig: Max Planck Institute for Evolutionary Anthropology. Available at: https://wals.info (accessed November 2025).

20.

Dunn

Greenhill

Levinson

Gray

(2011) Evolved structure of language shows lineage-specific trends in word-order universals. Nature 473: 79–82.

21.

Ellis

(2006) Language acquisition as rational contingency learning. Applied Linguistics 27: 1–24.

22.

Ferri

(2007) Bare, generic mass and referential Arabic DPs. In: Karimi

Samiian

Wilkins

(eds) Phrasal and clausal architecture: Syntactic derivation and interpretation. Amsterdam: John Benjamins, pp. 40–65.

23.

Franks

Pereltsvaig

(2004) Functional categories in the nominal domain. In: Arnaudova

Browne

Rivero

Stojanović

(eds) Formal approaches to Slavic linguistics 12: The Ottawa Meeting 2003. Ann Arbor, MI: Michigan Slavic Publications, pp. 109–28.

24.

Gamallo

Pichel

Alegria

(2017) From language identification to language distance. Physica A: Statistical Mechanics and its Applications 484: 152–62.

25.

Gillon

Armoskaite

(2015) The illusion of the NP/DP divide: Evidence from Lithuanian. Linguistic Variation 15: 69–115.

26.

Gray

Atkinson

(2003) Language tree divergence supports the Anatolian theory of Indo-European origin. Nature 426: 435–39.

27.

Huang

(1999) The emergence of a grammatical category: Definite article in spoken Chinese. Journal of Pragmatics 31: 77–94.

28.

Ionin

Montrul

(2010) The role of L1 transfer in the interpretation of articles with definite plurals in L2 English. Language Learning 60: 877–925.

29.

James

Witten

Hastie

Tibshirani

(2013) An introduction to statistical learning: With applications in R. New York: Springer.

30.

Jarvis

(2000) Methodological rigor in the study of transfer: Identifying L1 influence in the interlanguage lexicon. Language Learning 50: 245–309.

31.

Jarvis

Pavlenko

(2008) Crosslinguistic influence in language and cognition. New York: Routledge.

32.

Kellerman

(1983) Now you see it, now you don’t. In: Gass

Selinker

(eds) Language transfer in language learning. Newbury House, pp. 112–134.

33.

Kornfilt

(2017) DP versus NP: A cross-linguistic typology? In: McClure

Vovin

(eds) Studies in Japanese and Korean historical and theoretical linguistics and beyond. Leiden: Brill, pp. 138–58.

34.

Kornfilt

(2018) NP versus DP: Which one fits Turkish nominal phrases better? Turkic Languages 22: 155–66.

35.

Köylü

(2021) An overview of the NP versus DP debate. Language and Linguistics Compass 15: Article e12406.

36.

Kuznetsova

Brockhoff

Christensen

RHB

(2017) LmerTest package: Tests in linear mixed effects models. Journal of Statistical Software 82: 1–26.

37.

Lago

Mosca

Garcia

(2021) The role of crosslinguistic influence in multilingual processing: Lexicon versus syntax. Language Learning 71: 163–92.

38.

Lardière

(2008) Feature assembly in second language acquisition. In: Liceras

Zobl

Goodluck

(eds) The role of formal features in second language acquisition. Lawrence Erlbaum Associates, pp. 106–140.

39.

Longobardi

Guardiano

(2009) Evidence for syntax as a signal of historical relatedness. Lingua 119: 1679–1706.

40.

Lyutikova

Pereltsvaig

(2015) The Tatar DP. Canadian Journal of Linguistics / Revue Canadienne de Linguistique 60: 289–325.

41.

Marcus

Santorini

Marcinkiewicz

(1993) Building a large annotated corpus of English: The Penn Treebank. Computational Linguistics 19: 313–30.

42.

Murakami

(2016) Modelling systematicity and individuality in nonlinear second language development: The case of English grammatical morphemes. Language Learning 66: 834–71.

43.

Murakami

Alexopoulou

(2016) L1 influence on the acquisition order of English grammatical morphemes: A learner corpus study. Studies in Second Language Acquisition 38: 365–401.

44.

Nichols

(2003) Diversity and stability in language. In: Joseph

Janda

(eds) The handbook of historical linguistics. Blackwell Publishing, pp. 283–310. https://doi.org/10.1002/9780470756393.ch5

45.

Odlin

(1989) Language transfer: Crosslinguistic influence in language learning. Cambridge: Cambridge University Press.

46.

Ooms

(2020) Cld2: Google’s compact language detector 2: R package version 1.2.1 [software]. Available at: http://cran.r-project.org/package=cld2 (accessed November 2025).

47.

Pica

(1983) Adult acquisition of English as a second language under different conditions of exposure. Language Learning 33: 465–97.

48.

Progovac

(1998) Determiner phrase in a language without determiners (with apologies to Jim Huang 1982). Journal of Linguistics 34: 165–79.

49.

R Core Team (2023). R: A language and environment for statistical computing [software]. Vienna: R Foundation for Statistical Computing. Available at: http://www.R-project.org (accessed November 2025).

50.

Roberts

(2019) Parameter hierarchies and universal grammar. Oxford University Press.

51.

Ruhlen

(1991) A guide to the world’s languages: Classification. Stanford, CA: Stanford University Press.

52.

Salzmann

(2020) The NP vs. DP debate: Why previous arguments are inconclusive and what a good argument could look like: Evidence from agreement with hybrid nouns. Glossa: A Journal of General Linguistics 5: Article 83.

53.

Schachter

(1974) An error in error analysis. Language Learning 24: 205–214. https://doi.org/10.1111/j.1467-1770.1974.tb00502.x

54.

Schepens

Dijkstra

Grootjen

(2011) Distribution of cognates in Europe as based on Levenshtein distance. Bilingualism: Language and Cognition 15: 157–66.

55.

Schepens

Van der Slik

Van Hout

(2013) The effect of linguistic distance across Indo-European mother tongues on learning Dutch as a second language. In: Borin

Saxena

(eds) Approaches to measuring linguistic differences. Berlin: De Gruyter Mouton, pp. 199–230.

56.

Schepens

Van der Slik

Van Hout

(2015) L1 and L2 distance effects in learning L3 Dutch. Language Learning 66: 224–56.

57.

Schepens

Van Hout

Jaeger

(2020) Big data suggest strong constraints of linguistic similarity on adult language learning. Cognition 194: Article 104056.

58.

Schwartz

Sprouse

(1996) L2 cognitive states and the full transfer/full access model. Second Language Research 12: 40–72.

59.

Shatz

(2020) Refining and modifying the EFCAMDAT: Lessons from creating a new corpus from an existing large-scale corpus English language learner database. International Journal of Learner Corpus Research 6: 220–36.

60.

Trask

(2000) Time depth in historical linguistics. Cambridge: McDonald Institute for Archaeological Research.

61.

Westergaard

(2021) Microvariation in multilingual situations: The importance of property-by-property acquisition. Second Language Research 37: 379–407.

62.

Wickham

(2022) stringr: Simple, Consistent Wrappers for Common String Operations: R package version 1.4.0 [software]. Available at: http://cran.r-project.org/package=stringr (accessed November 2025).

63.

Wickham

François

Henry

Müller

(2022) dplyr: A grammar of data manipulation: R package version 1.0.9 [software]. Available at: http://cran.r-project.org/package=dplyr (accessed November 2025).

64.

van der Slik

van Hout

Schepens

(2017) The role of morphological complexity in predicting the learnability of an additional language: The case of La (additional language) Dutch. Second Language Research 35(1): 53–76. https://doi.org/10.1177/0267658317691322

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB