To Collocate or not to Collocate: Exploring Verb-Noun Collocations of Turkish EFL Learners

Abstract

A growing body of learner corpus research underscores the pivotal role of collocational competence in achieving advanced L2 proficiency. Verb–noun collocations have been identified as particularly challenging for L2 learners, largely due to their semantic opacity and combinatorial constraints. However, despite this importance and challenge, relatively little is known about how Turkish learners of English acquire and use verb–noun collocations across different proficiency levels. This study aims to investigate the production of verb-noun collocations by Turkish learners of English across three Common European Framework of Reference for Languages proficiency bands (A1-A2; B1-B2; C1-C2), focusing on their distribution, appropriateness, frequency in the British National Corpus. Also, collocational error types and patterns that persist across proficiency levels have also been examined. The analysis draws on data from the Cambridge Learner Corpus. Salient collocations were identified through frequency-based metrics (t-score and Mutual Information) and cross-referenced with the British National Corpus to assess native-likeness. In addition, collocational errors were manually annotated and classified by type. Findings suggest that Turkish learners of English as a foreign language often struggle with verb-noun collocations and produce incorrect collocations even at advanced level. The frequency of collocations decreases as the proficiency level increases, reflecting Turkish learners’ repetitive use of a limited set of verb–noun combinations at lower proficiency levels. In addition, learners mostly make verb-related (replacement, tense, form) and determiner errors that persist from A1 to C2 level. These results highlight the need for pedagogical interventions that explicitly target collocational development and underscore the value of learner corpus data in informing such approaches.

Plain Language Summary

Why and How Turkish Learners of English Struggle with Verb–Noun Word Combinations: Insights From Learner Language Data and What This Means for Teaching Vocabulary

In English, certain verbs and nouns are commonly used together, such as “make a decision” or “take a risk.” These combinations are called collocations. For people learning English as a foreign language, collocations are important because they make speech and writing sound more natural. However, many learners find them difficult to master. This study looked at how Turkish learners of English use verb–noun collocations, focusing on how their usage changes as they become more advanced. The Turkish subcorpus of the Cambridge Learner Corpus which contains exam scripts written by Turkish EFL learners at different proficiency levels, ranging from beginner (A1–A2) to advanced (C1–C2) was used as data. Learners’ collocation uses were compared to those found in a large database of native English (the British National Corpus) to see how close and distinct learners’ choices were to native usage. The study also examined the kinds of mistakes learners made while producing collocations. The results showed that Turkish learners often use collocations incorrectly, even at higher proficiency levels. While beginners and intermediate learners produced more collocations overall, advanced learners used fewer, perhaps because they avoided using structures they were less confident about. The most common problems involved choosing the wrong verb, using the wrong verb form or tense, or making errors with determiners (e.g., using “do a mistake” instead of “make a mistake”). These errors appeared across all levels. Learners showed little progress with collocations as their overall English improved, partly because they translated word-for-word from Turkish. These mistakes arise both from the complexity and unpredictability of collocations in English and from external factors such as limited exposure to authentic language or differences in educational background. These findings suggest that simply advancing in general language skills does not automatically lead to mastery of collocations.

Keywords

learner corpus research,verb-noun collocations Cambridge Learner Corpus Turkish EFL learners collocational competence A1-C2 proficiency levels

Introduction

The pivotal role of multiword units, including collocations, idioms, and other formulaic sequences, has long been firmly established in both theoretical and empirical accounts of second language acquisition. Scholars have also highlighted the challenges involved in learning these expressions, particularly regarding their frequency, transparency, and native-like use. Since Brown’s (1974) foundational work, this area has continued to attract scholarly attention, as reflected in Granger’s (1998) research on learner corpora, Sinclair’s (1991) idiom principle, and Wray’s (2002) exploration of formulaic language.

Multiword units constitute a core component of lexical competence in L2 learners, serving both a psycholinguistic and communicative function. They facilitate automatic language processing by allowing high-frequency word combinations to be retrieved as unified wholes (Howarth, 1998; Nattinger & DeCarrico, 1992; Wray, 2002), thereby contributing to increased fluency, idiomaticity, and pragmalinguistic appropriateness. As such, they are indispensable not only for efficient real-time processing but also for achieving stylistically appropriate, context-sensitive language production.

The mastery of the idiomatic patterns is frequently cited as a critical threshold distinguishing intermediate learners from advanced users (Thornbury, 2002). Research has demonstrated that the ability to produce native-like multiword combinations, particularly collocations, marks a shift toward more proficient language use, enhancing both accuracy and communicative efficacy (Boers et al., 2006; Wray, 2002).

Among these patterns, verb–noun collocations are of particular interest due to their inherent semantic constraints and limited substitutability, which make them especially difficult for L2 learners to acquire (Nesselhauf, 2005). Verb–noun collocations have been identified as particularly challenging for L2 learners, largely due to their semantic opacity and combinatorial constraints (Peng, 2016). For native speakers, collocations such as break the law, violate the law, and violate someone’s privacy are deeply entrenched in the mental lexicon and accessed holistically (Pawley & Syder, 1983). In contrast, L2 learners, particularly in EFL contexts, often fall back on literal or L1-influenced constructions (e.g., break someone’s privacy), resulting in semantically deviant or stylistically marked output. The acquisition of collocational knowledge not only reduces cognitive load during production but also enables more accurate alignment with native speaker norms in terms of register and stylistic naturalness (Conklin & Schmitt, 2008; Nattinger & DeCarrico, 1992; Wray, 2002).

As Lewis (2000, p. 15) succinctly puts it, “Fluency relies on the internalization of a substantial repository of fixed or semi-fixed prefabricated items, which serve as the basis for any subsequent linguistic innovation or creativity.” Numerous studies have corroborated this view, underscoring collocational competence as one of the primary distinguishing features between native and non-native speaker performance (Aston, 1995; Fillmore, 1979; Kjellmer, 1991). Errors in collocational usage remain one of the most salient markers of non-nativeness, often persisting well beyond the intermediate stages of language development. As Korosadowicz-Strużyńska (1980, p. 115) aptly observes, “errors in the use of word collocations surely add to the foreign flavor in the learner’s speech and writing and, along with faulty pronunciation, they are the strongest markers of “an accent.”

The use of corpora and corpus analysis techniques enable researchers to gather evidence of both accurate and faulty collocations, as well as instances of underuse or overuse by contrasting learner production with established native-speaker norms. Such comparative analyses allow for the identification of overuse, underuse, and misuse patterns, thereby offering a more nuanced understanding of the lexical behaviors that distinguish non-native from native writers.

Verb–noun collocations are widely acknowledged as one of the most challenging aspects of second language acquisition, mainly because of their semantic opacity and strict combinatorial constraints. Mastery of such collocations is crucial for achieving natural and native-like proficiency, yet learners often struggle with their accurate use. Despite this importance and difficulty, research on Turkish learners of English has not sufficiently explored how verb–noun collocations are acquired and used across different proficiency levels. This lack of evidence limits the understanding of the developmental trajectory of collocational competence in this learner group. Addressing this gap, the present study investigates the production of verb–noun collocations by Turkish learners of English at three Common European Framework of Reference for Languages (CEFR) proficiency bands (A1–A2; B1–B2; C1–C2). Specifically, it examines their distribution, appropriateness (native-likeness), frequency in comparison with the British National Corpus (BNC). The most common collocational errors and error patterns that persist across levels have also been investigated. The analysis is based on Turkish subcomponent of the Cambridge Learner Corpus (CLC), which provides a large-scale collection of learner writing.

By comparing the three learner groups, A1-A2; B1-B2; C1-C2, according to Common European Framework of Reference for Languages (CEFR) to one another, the study also aims to uncover developmental patterns in collocational competence. In this regard, the research is guided by the following questions:

What is the frequency distribution of verb-noun collocations across CEFR levels A1-C2?

To what extent do Turkish EFL learners produce appropriate verb-noun collocations in their writing?

How frequent are these collocations in British National Corpus (BNC) across proficiency levels?

What are the most frequent error types associated with these constructions?

Which collocation errors, if any, persist across proficiency levels from A1 to C2??

Review of Literature

Definition of Collocations

The term collocation has been variously defined within the literature, reflecting a continuum between phraseological and statistical traditions. Aisenstadt (1981, p. 53) characterizes collocations as “combinations of two or more words whose components are used in a non-idiomatic way, following certain structural patterns, and restricted in their commutability by grammatical and semantic valency and usage.” This definition underscores the semi-fixed nature of collocations: they are neither entirely free nor wholly fixed, exhibiting a degree of syntagmatic constraint. Similarly, Cowie (1988) distinguish collocations from other multiword units such as idioms, emphasizing their partial transparency and restricted substitutability.

Sinclair (1991) offers a more distributional perspective, defining collocations as “the occurrence of two or more words within a short space of each other in a text” (p. 170). This spatial parameter, or span, typically comprises four words to the left and right of a central lexical item or node, allowing for the identification of statistically salient co-occurrences.

From a methodological standpoint, two broad approaches have predominated. The first, grounded in phraseology, focuses on the semantic and structural properties of multiword combinations, typically categorizing them along a continuum ranging from free combinations to fixed idioms (Nesselhauf, 2005). The second adopts a frequency-driven, statistical framework, wherein collocations are identified based on the strength of association between constituent items, as evidenced through large corpus data (Lee & Shin, 2021; Liu & Afzaal, 2021; Nguyen & Webb, 2016). While the phraseological approach allows for fine-grained semantic analysis, it suffers from subjectivity and labor-intensive validation. In contrast, statistical methods enable large-scale analysis but risk extracting function word sequences with little pedagogical value (e.g., and the, of the) if not filtered appropriately (Nizonkiza, 2017; Szudarski & Carter, 2016).

An increasing number of studies now seek to reconcile these traditions by integrating both frequency-based association measures such as t-score, z-score, and Mutual Information (MI) and native-speaker judgment (Durrant & Schmitt, 2009; Gablasova et al., 2017; Granger & Bestgen, 2014). Despite such methodological advances, the identification and validation of native-like collocations in learner corpora remains a formidable challenge. Phraseological evaluations require subjective and context-sensitive interpretation, while frequency-based assessments demand computationally intensive comparisons against large native-speaker reference corpora.

Moreover, existing literature has disproportionately focused on intermediate and advanced learners, with limited attention paid to beginning-level language users (Siyanova-Chanturia, 2015). Similarly, large-scale, proficiency-stratified learner corpora remain underutilized, despite their potential to shed light on developmental trajectories in collocational competence across CEFR bands.

Present study adopts a phraseological approach to collocations, drawing upon the foundational work of Aisenstadt (1981), Cowie (1981), and Mel’čuk (1998), as well as subsequent scholars operating within this framework (e.g., Gläser, 1998; Howarth, 1998; Nesselhauf, 2005). Within this tradition, collocations are conceptualized as recurrent lexical combinations that are semantically transparent yet characterized by restricted lexical co-occurrence. As Nesselhauf (2005, p. 24) asserts, “collocation is considered the co-occurrence of words at a certain distance,” wherein the relationship between constituents is governed by lexico-grammatical constraints rather than idiomatic opacity.

This study adopts Nesselhauf’s (2005) operational definition of verb–noun collocations, focusing on combinations in which the verb selects a limited range of nouns either due to its semantically specific sense (Criterion 1) or because it fails to combine with all syntactically acceptable noun partners (Criterion 2). Hence, while fully compositional expressions such as read a newspaper do not qualify, combinations like take a picture, where the verb take displays restricted lexical selectivity, are treated as prototypical collocations. Applying these criteria to learner data from the Cambridge Learner Corpus (CLC) ensures the identification of multiword units that exhibit the kinds of substitutability constraints typical of entrenched native-speaker usage. By drawing on Nesselhauf’s (2005) taxonomy, the methodology is anchored in a solid theoretical framework of multiword sequences and is equipped with precise operational criteria for identifying learner errors in collocation choice. This study aims to investigate verb + noun collocations of Turkish EFL learners. In the analysis, all verb–noun combinations were analyzed regardless of the noun’s syntactic role (e.g., object, complement, or adverbial) and irrespective of the presence of any additional clause elements. Also, combinations (not necessarily collocations) such as wage war (VO) and cope with a problem (VPO), combinations such as look out of the window (VA), call sb. a genius (VOC), take sth. into consideration (VOPO) or force teachers to+inf (VO+ to +inf) have also been considered. So, the combinations that have been taken into account in the present study is as follows:

Verb + Object (VO) (e.g., wage war)

Verb + Prepositional Object (VPO) (e.g., cope with a problem)

Verb + Adverbial (VA) (e.g., look out of the window)

Verb + Object + Complement (VOC) (e.g., call someone a genius)

Verb + Object + Prepositional Object (VOPO) (e.g., take something into consideration)

Verb + Prepositional Phrase (VPP) (e.g., go out of control)

(Nesselhauf, 2005, p. 47-48)

Previous Studies on Collocations

In one of the earliest cross-linguistic investigations, Kaszubski (2000) analyzed frequently occurring verbs such as be, have, make, take, do, and get and their collocational patterns across multiple L2 English corpora, including those of Polish, Spanish, and French learners, in addition to a native-speaker corpus. His findings revealed systematic overuse of certain collocations by learners and a corresponding underuse of more idiomatic, native-like combinations, thereby illustrating the tendency of learners to rely on a limited repertoire of collocational structures.

Granger (1998), in her seminal study on the use of -ly intensifier + adjective combinations in academic writing, compared advanced French EFL learners to native English writers. She distinguished between two functional types of intensifiers: maximizers (e.g., absolutely, totally) and boosters (e.g., deeply, highly). While learners exhibited usage rates of maximizers comparable to those of native speakers, their use of boosters was significantly lower. Moreover, when these combinations were subjected to acceptability judgments by both native and non-native informants, the latter group demonstrated a marked insensitivity to collocational salience. Granger concluded that, although advanced learners can produce collocations, they tend to avoid or misuse native-like expressions, resorting instead to atypical or marginal combinations.

In a similar vein, Altenberg and Granger (2001) examined the lexical behavior of the verb make across three corpora: French learners, Swedish learners, and native English speakers. They aimed to identify cross-linguistic differences and native-like norms in collocational usage. While learners overused some collocations constructed with core verbs like be, have, make, etc. or particular amplifiers such as very, completely, highly, strongly, they frequently failed to employ native-like combinations. Learners also produced a notable number of collocational errors even though their grammar and individual word choices were typically accurate.

Nesselhauf (2005), working with the German component of the International Corpus of Learner English (ICLE), conducted one of the most comprehensive studies on verb–noun collocations. Analyzing approximately 2,000 collocational instances in a 154,191-word learner corpus, she reported that nearly one quarter were erroneous, with an additional third considered questionable by expert raters. Importantly, around 50% of the errors were attributed to L1 influence. Intriguingly, the error rate remained unaffected by dictionary consultation or time constraints, suggesting either that learners failed to recognize collocational errors or that reference tools did not provide adequate support. Nesselhauf concludes that the acquisition of collocations remains one of the most persistent challenges for L2 learners, irrespective of their general proficiency.

In another study, Siyanova and Schmitt (2008) first extracted every adjective–noun combination from Russian learners of English and native speakers, then assessed learners’ productive use by counting occurrences of frequent collocations. They also administered psycholinguistic judgment tasks, requiring participants to classify presented collocations as nativelike or atypical and measuring response times. They found that advanced English learners produced a comparable number of adjective–noun collocations to native speakers, with roughly half of these combinations qualifying as frequent collocations in both groups. However, when asked to judge whether presented combinations were nativelike or atypical, learners performed significantly worse than natives. In other words, despite matching native speakers in productive output, advanced learners still struggled to recognize and evaluate collocational patterns accurately, often accepting or producing atypical combinations.

Durrant and Schmitt (2009) compared collocational usage in academic essays written by native and non-native English speakers by analysing premodifier–noun word pairs extracted from a corpus of longer texts from both groups. Using frequency-based statistical measures, specifically t-scores to gauge high-frequency pair strength and Mutual Information (MI) to identify semantically tight but less frequent combinations, the researchers categorized each word pairing according to collocational strength. Results indicated that native writers employed a larger proportion of low-frequency, high-MI collocations, reflecting a broader and more nuanced lexical repertoire, whereas non-native writers relied predominantly on high-frequency collocations with elevated t-scores and underused high-MI pairs. Moreover, non-native writers demonstrated a tendency to repeat a narrow set of favored collocations, suggesting a conservative approach to lexical selection.

Laufer and Waldman (2011), analyzing 759 argumentative essays from the Israeli Learner Corpus of Written English (ILCoWE), undertook a large-scale comparison with the Louvain Corpus of Native English Essays (LOCNESS). Their investigation centered on verb–noun collocations across proficiency levels (basic, intermediate, and advanced), with attention to both quantitative usage and qualitative accuracy. Findings demonstrated a marked underuse of verb–noun collocations among all learner groups relative to the native-speaker baseline. While advanced learners attempted a greater number of collocations, this increase did not correspond to improved accuracy. On the contrary, roughly one-third of collocational attempts were erroneous, and error rates persisted across proficiency levels. Interestingly, advanced learners exhibited higher error rates not because of weaker competence but because of their increased willingness to attempt more collocations, suggesting that the acquisition of collocational accuracy does not necessarily develop in parallel with increased output.

In another study, Gao et al. (2019) investigated the use of verb–noun collocations in the written productions of Chinese EFL learners, drawing on data from the TECCL (Ten-thousand English Composition of Chinese Learners) corpus. The study included compositions from both middle school and university-level students. Collocational instances were manually extracted and subjected to a rigorous validation process incorporating dictionary references, native-speaker judgments, and corpus-based benchmarks. Each collocation was then categorized according to its structural and functional properties. Results revealed an accuracy rate of 79% among middle school learners and 83% among university students, indicating modest gains with increased educational exposure. The most recurrent errors stemmed from the inappropriate selection of either the verb or the noun constituent, with L1 transfer and lexical insufficiency in the L2 identified as primary explanatory factors.

Furhermore, Boone et al. (2023) tested 50 Dutch native speakers learning German on their productive knowledge of 35 target collocations at the end of each academic year using form-recall tasks and mixed-effects logistic regression; predictors included L1–L2 congruency, corpus frequency, association strength (MI), imageability, productive vocabulary size, and a five-month study-abroad stint. Participants’ accuracy rose from about 36% in year 1 to 55% in year 3, with congruent collocations initially almost 12 times more likely to be recalled correctly but showing a shallower learning curve over time, whereas incongruent items, though harder at first, were learned more steeply. Productive vocabulary size modestly boosted acquisition odds, while frequency, MI, imageability, and short-term study abroad had no significant effects.

Likewise, Du et al. (2022) worked on 18,000 EFL essays from the The EF-Cambridge Open Language Database (EFCAMDAT) at three language proficiency. They extracted ‘make/take + noun’ sequences, of which 2,501 were identified as collocations; each collocation’s noun was then annotated for semantic field (USAS), difficulty level (EVP), and orthographic length (LEN). Results of the analyses showed that beginners predominantly used concrete, everyday collocations, while advanced learners favored abstract and social/psychological fields. The findings also indicated that as learners progress, they combine common verbs with semantically richer, more challenging, and longer noun collocates.

In a more recent study, Akbaş and Yükselir (2024) compared adjective–noun combinations in a learner corpus of 341 Turkish EFL essays (TUWE, ~270 K words) against a native-speaker corpus of 325 K words (LOCNESS), focusing on seven high-frequency adjectives (good/bad, positive/negative, important/significant/crucial) and their immediate noun collocates. By comparing these collocations, they found that L2 writers produced a much wider range of adjective–noun strings (742 vs. 191 in L1) but only eight of 57 L2 strings matched conventional L1 usage; the vast majority were unconventional, often reflecting direct L1 transfers.

The reviewed studies converge on the conclusion that collocations represent a persistent area of difficulty for L2 learners, irrespective of their native language, length of instruction, or proficiency level. While learners may demonstrate receptive familiarity with certain collocations, their productive command often remains limited. Research consistently shows that collocational competence develops at a slower pace than vocabulary breadth and that the principal challenge lies not in recognition, but in accurate deployment. Given their centrality to idiomaticity and fluency, collocations merit greater pedagogical focus within the broader trajectory of lexical acquisition in second language learning.

Methodology

Corpus

The present study draws on the Cambridge Learner Corpus (CLC), which constitutes the principal data source for the analysis. The CLC is currently the most extensive annotated corpus of learner English, comprising over 55 million words extracted from more than 200,000 examination scripts compiled by Cambridge University Press and Cambridge English Language Assessment since 1993. These scripts represent learners from 217 countries or territories, covering 148 distinct first-language backgrounds. All essays have been fully transcribed, preserving every learner error, and a substantial proportion of the corpus has been manually annotated using a fine-grained tagging scheme comprising 88 error categories specifically devised for the CLC.

For the purposes of this research, a sub-corpus of texts produced by Turkish learners of English was extracted, thereby enabling quantitative analyses of verb–noun collocations within a richly annotated, large-scale learner corpus. The breakdown of Turkish EFL learner sub-corpora may be seen in Table 1 below.

Table 1.

Breakdown of Turkish EFL Learner Sub-corpora.

Mother tongue	Corpus size	Exam	Years	Exam format
		CPE		Note / Email / Memo
		CAE		Story
	A1–A2: 34,812	FCE		Report
Turkish	B1–B2: 145,873	FCEfs	1993–2013	Informative / Instructional Text
	C1–C2: 87,421	PET		Article
		PETfs		Proposal
	Total: 268,106	KET		Informative / News
		KETfs

The Turkish sub-corpus of the Cambridge Learner Corpus (CLC) comprises a total of 268,106 words, stratified according to CEFR proficiency levels (Council of Europe, 2001). 34,812 words at A1–A2, 145,873 words at B1–B2, and 87,421 words at C1–C2. All texts were collected between 1993 and 2013 from examination scripts of Turkish EFL learners, encompassing eight different Cambridge English assessments (CPE, CAE, FCE, FCE First Series, PET, PET First Series, KET, and KET First Series). Each script was elicited under standardized test conditions with prompts corresponding to a variety of written task types including notes, emails, memos, stories, reports, informative or instructional texts, articles, proposals, and news-style compositions. This comprehensive dataset allows for a granular examination of verb–noun collocation use across proficiency bands, while controlling for task type and exam context.

SketchEngine: Corpus Management and Analysis Tool

The corpus used in this study, the Cambridge Learner Corpus (CLC), is available in Sketch Engine under a special agreement with Cambridge University Press. All corpus queries and data extraction procedures in the present study were conducted using Sketch Engine, a state-of-the-art corpus management and analysis platform developed by Lexical Computing Ltd. Since its inception in 2003, Sketch Engine has served as a robust tool for the interrogation of large-scale corpora through linguistically-informed search algorithms and customizable query protocols. The system integrates three core components: Manatee, a back-end corpus indexing and database engine; Bonito, a web-based query interface facilitating the execution of complex search strings; and Corpus Architect, a utility for corpus compilation and metadata management.

The Cambridge Learner Corpus (CLC), the primary data source for this study, is accessible via Sketch Engine under a bespoke licensing agreement with Cambridge University Press. All linguistic searches and retrieval operations were performed using Corpus Query Language (CQL), a regular-expression-based syntax enabling highly granular pattern identification within annotated corpora.

Specifically, the Wordlist function was employed to extract the most frequently occurring noun types in the Turkish sub-corpus. Subsequently, the Keywords function was used to isolate verb–noun collocations, based on predefined co-occurrence criteria, which were then subjected to further statistical analysis using association measures described below.

MI Score and T-score

Collocations extraction involved computing association measures to quantify collocational strength. These metrics identify word pairs that occur together with a probability exceeding chance, as described by Manning and Schütze (1999). Each measure compares the observed co-occurrence frequency of a candidate collocation with its expected frequency and distinguish genuine collocations from random co-occurrences. The two most prevalent association measures in lexicography are the t-score and Mutual Information (MI) (Evert, 2004). However, these measures privilege different types of word pairs: t-scores tend to surface very frequent collocations, whereas MI highlights less common combinations whose constituent words rarely appear independently (Stubbs, 1995). For example, high-frequency pairs such as good example, long way, and hard work receive elevated t-scores but modest MI values, while semantically tighter yet infrequent pairs like ultimate arbiter, immortal souls, and tectonic plates exhibit the inverse pattern. To capture both high-frequency and semantically salient collocations, both t-score and MI have been employed in the present study.

In line with established conventions, threshold values of a t-score ≥ 2 and/or an MI score ≥ 3 have been proposed as minimum criteria for identifying collocational relationships (Stubbs, 1995). According to Hunston (2002), an MI score of 3 or above is often taken to indicate a significant collocation threshold. These criteria serve as the baseline for identifying verb–noun pairings as statistically meaningful collocations within the Turkish learner sub-corpus.

Identification of Errors in the Collocations

Error identification in the present study draws upon the error-annotation schema developed for the Cambridge Learner Corpus (CLC), a system designed to capture the full spectrum of interlanguage deviations in learner production (Cambridge University Press, 2014). All verb–noun collocational errors were retrieved through corpus queries conducted in Sketch Engine and subsequently classified according to this standardized taxonomy of error codes.

Table 2 provides a comprehensive overview of the error types observed in the dataset. These include both morphosyntactic and lexico-grammatical deviations, encompassing errors in verb selection, tense, agreement, word formation, and determiner/preposition usage, among others. A total of 26 distinct error categories were identified in Turkish EFL learners’ use of verb–noun collocations.

Table 2.

Types of Errors in Collocations.

Error code	Error
UD	Unnecessary determiner
UT	Unnecessary preposition
RV	Replace verb
S	Spelling error
TV	Incorrect tense of verb
AGV	Verb agreement error
RN	Replace noun
FV	Wrong verb form
RJ	Replace adjective
AGN	Noun agreement error
MV	Missing verb
MD	Missing determiner
IV	Incorrect verb inflection
RT	Replace preposition
MY	Missing adverb
FN	Wrong noun form
MQ	Missing quantifier
MN	Missing noun
MA	Missing anaphor
MT	Missing preposition
RD	Replace determiner
UV	Unnecessary verb
SX	Spelling confusion
UN	Unnecessary noun
DJ	Derivation of adjective error
DN	Derivation of noun error

To illustrate the application of this classification framework, Table 3 presents examples of selected error types extracted from learner scripts. These annotated instances demonstrate the nature of the deviation and the corrected form, highlighting the most salient error categories.

Table 3.

Type of Errors and Examples.

Type of errors	Examples
RV (Replace Verb)	2652417_9 My dad can <#RV> get \| bring </#RV> you to my house.
UD (Unnecessary Determiner)	570821_2 I spend most of my free time <#UD> my </#UD> using it.
TV (Incorrect Tense of Verb)	568724_1 As you <#TV> had asked \| asked </#TV> me in your letter.

Note. Bold words indicate the errors and corrections.

This typology enables the systematic identification of error patterns in learner collocational usage and facilitates the fine-grained analysis of error persistence across proficiency levels. The annotated corpus thus serves not only as an empirical foundation for learner corpus research but also as a pedagogical resource for understanding recurrent lexico-grammatical difficulties in verb–noun collocations.

Procedure

The analytical procedure consisted of several successive stages aimed at identifying, validating, and categorizing verb–noun collocations produced by Turkish EFL learners across three CEFR-aligned proficiency bands (A1–A2, B1–B2, C1–C2).

As a preliminary step, the Wordlist function of Sketch Engine was employed to extract the most frequently used nouns across the Turkish subcorpus of the Cambridge Learner Corpus (CLC). The top 20 nouns, ranked by relative frequency, are presented in Table 4. These nouns were selected on the premise that their high frequency in learner writing suggests familiarity not only with their meanings but also with their collocational behavior.

Table 4.

The Most Frequent Nouns in Turkish Learner Corpus.

Item	Frequency	Relative frequency
Friend	2,960	3,729.95621
People	1,861	2,345.08396
Time	1,847	2,327.44227
Day	1,751	2,206.47072
House	1,449	1,825.91437
School	1,318	1,660.83861
Concert	1,196	1,507.10393
Letter	1,165	1,468.0402
Thing	1,163	1,465.51996
Sport	1,145	1,442.83779
Love	1,107	1,394.95322
Phone	1,088	1,371.01093
Year	1,007	1,268.94118
Computer	989	1,246.25902
Club	953	1,200.89469
Life	907	1,142.92915
Game	864	1,088.74398
Class	834	1,050.94036
Hour	804	1,013.13675
Place	803	1,011.87663

The list of words received consisted of high-frequency words (e.g., friend, people, time, day, house…etc.). Selecting the high frequency nouns was based on the assumption that Turkish EFL learners know not only the meaning of these words but also their usage, including their collocations.

After the noun forms were extracted, the concordances of these nouns were created in the Keywords function of SkethcEngine so that verb-noun combinations could be identified and extracted. Threshold values of a t-score ≥ 2 and/or an MI score ≥ 3 have been utilized as minimum criteria for identifying collocations. In addition, collocations that have a minimum frequency of 5 and occur at least 3 different texts have been selected to enhance validity and mitigate text-dependent idiosyncrasies. Manual post-processing was conducted to eliminate redundant, anomalous, or grammatically implausible pairings.

In the next step, two types of sources were used to determine the degree of acceptability of the combinations that had been extracted from the corpus: dictionaries and British National Corpus (BNC). Two reference dictionaries were consulted to assess the acceptability of the extracted combinations:

The BBI Dictionary of English Word Combinations (Benson et al., 2009)

Longman Collocations Dictionary and Thesaurus (6th Edition).

A similar procedure of verification of collocations was used by Barfield (2007), Nesselhauf (2005), Wang and Shaw (2008) and Laufer and Waldman (2011). After checking the dictionaries, the structures were also consulted to the BNC which contains 100 million words, to determine the frequency band of each collocation. Using the BNC frequency information, we split the collocations into five frequency bands:

Band 0: Not attested in BNC

Band 1–5: Very low frequency

Band 6–20: Low frequency

Band 21–100: Moderate frequency

Band >100: High frequency

All the combinations that did not occur in any of the dictionaries or BNC were categorised as non-attested. Once the entire learner corpus had been examined, each proficiency-level subcorpus was independently analyzed to permit comparison of collocations produced by learners. And lastly, erroneous collocations have been identified, classified and traced according to CEFR levels. Persistent errors from A1 to C2 proficiency level have also been identified.

Findings

RQ1. What Is the Frequency Distribution of Verb–Noun Collocations Across CEFR Levels A1-C2?

The distribution of collocations produced by Turkish EFL learners at three proficiency bands has been provided in the table below. To allow meaningful cross-level comparisons, raw frequencies have been normalized per 100,000 words (Table 5).

Table 5.

The Frequency of Collocations According to Proficiency Bands.

CEFR level	Corpus size	Total number of collocations	f/100,000
A1– A2	34,812	244	7,009
B1 – B2	145,873	674	4,620
C1 – C2	87,421	208	2,379

As the normalized frequencies indicate, collocation use diminishes progressively with increasing proficiency. A1–A2 learners produced the highest normalized frequency of verb–noun collocations (7,009 per 100,000 words), followed by B1–B2 learners (4,620), while C1–C2 learners produced the fewest (2,379). This downward trend is notable, as it challenges the commonly held assumption that collocational density increases with proficiency, potentially reflecting greater caution or lexical selectivity among more advanced learners. The reduction observed at higher levels does not indicate weaker collocational competence; rather, advanced learners shift toward more accurate, discourse-level phraseology.

To further illuminate learner performance, the study examined the proportion of correct versus erroneous collocation uses at each level. The results are displayed in Table 6.

Table 6.

Correct and Erroneous Uses of Collocations Across Levels.

CEFR level	Correct use of collocations	Erroneous collocations	Total number	f/100
A1-A2	189	55	244	%22.5
B1-B2	534	140	674	%20.7
C1-C2	175	33	208	%15.8
Total	898	228	1126	%20.2

The data reveal a gradual decline in collocational error rates with increasing proficiency. While 22.5% of the collocations at A1–A2 were erroneous, this proportion decreased to 20.7% at B1–B2 and further to 15.8% at C1–C2. Despite the relatively small size of the C1–C2 subcorpus, the downward trajectory in error rates appears consistent, supporting the view that collocational accuracy improves over time. Nevertheless, across the entire corpus, approximately one in five collocations (20.2%) was used inaccurately, underscoring the persistent challenge collocations pose for L2 learners.

RQ2. To What Extent Do Turkish EFL Learners Produce Appropriate Verb–Noun Collocations in Their Writing?

Following the identification of correct and erroneous verb–noun collocations, all correct instances were subjected to a validation process using two reference sources: (1) The BBI Dictionary of English Word Combinations and The Longman Collocations Dictionary and Thesaurus, and (2) the British National Corpus. Table 7 summarizes the number of collocations confirmed by these sources.

Table 7.

The Number of Collocations Found in Either One of the Dictionaries or the BNC.

Total Number of Collocations	Non-existent in dictionaries	Non-existent in BNC	Total Number of Non-existent	Total Number of Appropriate Collocations	Percentage
898	54	60	114	784	%87.3

As shown in Table 7, of the 898 collocations deemed correct, 114 (12.7%) could not be found in either the dictionaries or the BNC. The remaining 784 collocations (87.3%) were thus validated as appropriate according to native-like norms. The 114 unconfirmed combinations were further analyzed and are listed in Table 8, categorized by Common European Framework of Reference for Languages level.

Table 8.

Collocations That Do Not Appear in Either of the Dictionaries.

A1-A2	Miss the time, learn the time, have got the time, choose break time, love things, save things, wear things, help with sports, love phone (4), chat with the phone, bring phone (2), give a computer (3), bring the computer (2), being games (8), help with the concert,
B1-B2	Create people, forget that day, film a day, film our school, go back to house, walk through house, cycle to school, think about a school, deliver to school, choose the school (31), study for school, finish letter, answer the letter, consider my letter, learn from the letter, display the things, afford a computer, work with people (2), know about computer, indulge in computers, have a club (2), run away from life, tell about life, see places, give place, call for the concert,
C1-C2	Seek free time, tell about my house, obtain from school, come to school, value things (2), think of the things, cherish the things, benefit life, see places, arrange place, think about the concert, forget this concert, go back to the concert,

Note. The common phrases across levels have been indicated in bold in the table.

According to the table, the recurrent combination see places appear across all three proficiency bands. Most non-attested combinations reflect literal translations from Turkish (e.g., tell about my house), semantically vague or unidiomatic pairings (e.g., create people), or misuse of otherwise frequent verbs in non-canonical syntactic environments (e.g., film a day). Examples drawn directly from the corpus illustrate these phenomena:

(1) 583052_2 <s> It has been a while since we last saw each other and I have really missed the times we were together in England. </s>

(2) 2791133_2 <s>I love designing the houses and creating the people.</s>

These examples underscore the learners’ tendency to generate semantically comprehensible but pragmatically marked or culturally unnatural combinations, which may not trigger correction in traditional grammar instruction but signal clear collocational deviance to native speakers.

To further evaluate the extent to which Turkish EFL learners use native-like collocations, Table 9 presents verb–noun combinations produced by learners that were not found in the British National Corpus, thus indicating a lack of attestation in standard native-speaker usage.

Table 9.

Collocations That Do Not Appear in the BNC.

A1-A2	bring sb. to house, study for school (2), film our school (2), help with sports, love phone (4), like phone (4), ask about phone, chat with phone, bring phone, bring the computer (2), İnvite club, bring games (8), help with the concert, ask about the concert,
B1-B2	Film a day, think about days, park near house, stay at house, bring sb. to house, pick sb up from house, go back to house (3), tell you about house, finish the school (2), study for school, film our school (2), graduate from school (2), deliver to school, inform about things, display the things, cherish the things, finish letter, thanks for the letter, learn from the letter, appreciate your letter, look forward to receiving letter, look forward for your letter, take your letter (2), afford a computer, indulge in computers, tell about life, control life, run away from life, benefit life, buy games (4), call for the concert, participate concert,
C1-C2	give people advice/opportunities, discuss with people, have respect for people, seek free time, come back to house, accommodate in house, study for school, film our school, cherish the things, benefit life, mention about games, arrange place, go back to the concert,

Note. The common phrases across levels have been indicated in bold in the table.

Bolded items in the original table reflect collocations recurring across multiple Common European Framework of Reference for Languages levels, suggesting long-term persistence of certain erroneous or non-native-like expressions. Learner data at A1–A2 reveal overgeneralized, literal translations of Turkish phrasal patterns (e.g., bring sb. to house, study for school), while also reflecting dependence on concrete, everyday topics (e.g., love phone, help with sports). Representative corpus examples include:

(3) 2606135_3 <s>You can say that you have to study for school.</s>

(4) 2652417_9 <s>My dad can bring you to my house.</s>

In both B1–B2 and C1–C2, learners continue to employ non-attested collocations (e.g., afford a computer, cherish the things, benefit life), many of which signal increased lexical ambition but persistent deviation from conventional usage. These are often unidiomatic extensions of otherwise legitimate words and grammatical patterns:

(4) 2606574_2 <s>A TV company came and filmed our school yesterday!</s>

(5) 2660091_9 <s>You can bring your games.</s>

The list of collocations that failed to appear in both the dictionaries and British National Corpus, thus being doubly unattested, is provided in Table 10.

Table 10.

Collocations Absent from Both Dictionaries and BNC.

A1-A2	Bring phone, Bring the computer, Help with sports, Help with the concert, Love phone
B1-B2	Afford a computer, Call for the concert, Deliver to school, Display the things, Film a day, Film our school, Finish letter, Go back to house, Indulge in computers, Learn from the letter, Run away from life, Study for school, Tell about life
C1-C2	Arrange place, Benefit life, Cherish the things, Go back to the concert, Seek free time

As can be seen from the table, unique collocations of Turkish EFL learners are observed such as study for school, tell about life, seek free time, learn from the letter, etc. At A1–A2, only five literal transfers (e.g., “Bring phone,”“Help with sports”) appear, reflecting basic L1 interference. In B1–B2, thirteen non-attested sequences (e.g., “Afford a computer,”“Run away from life,”“Learn from the letter”) emerge, indicating that intermediate learners still rely on unidiomatic formulas despite expanding their verb–noun repertoire. By C1–C2, only five items remain (e.g., “Arrange place,”“seek free time”), suggesting some consolidation but persistent non-native patterns at higher proficiency. Some examples from the corpus may be seen below:

(7) 1803337_1 <s>Therefore, I would agree with young people who seek free time when describing an ideal.</s>

(8) 760342_3 but I couldn't find it than I decided to go back to the concert.</s>

In the literature, an effective method for assessing native-like collocations in addition to appropriateness involves checking if identical word combinations appear with high frequency of >6 in a native-speaker corpus, BNC; when they do, it is assumed that they constitute native-like collocations. That’s why, instances that appeared with a frequency of 6 and above in the BNC have been considered as native-like collocations, the numbers of which may be seen below (Table 11):

Table 11.

Learner Collocations with a BNC Frequency of >6 and MI >3.

	MI > 3	Total learner
	FR > 6	Collocations
N	579	784
%	73	100

Accordingly, 579 out of 784 validated collocations were found to be both statistically significant and attested at native-like frequency levels in the BNC. This indicates that 73% of the verb–noun combinations used by Turkish EFL learners in this dataset demonstrate native-like qualities, reflecting a considerable degree of lexical competence. Nevertheless, the remaining 27% represent an area of concern in terms of collocational accuracy and idiomaticity.

RQ3. How Frequent Are These Collocations in British National Corpus (BNC) Across Proficiency Levels?

The collocations were checked in the BNC to reveal the frequencies according to five frequency bands: 0 (failed to appear in the BNC), 1–5, 6–20, 21–100, and <100 occurrences. The distribution of these collocations on the basis of their BNC frequencies have been provided below.

As shown in Figure 1, 4.2% of the learner collocations (n = 33) were completely absent from the BNC (Band 0), suggesting non-native or highly idiosyncratic usage. Examples of these collocations include create people, study part-time, learn the time, rest for a week. Some of the examples from the corpus are as follows:

(9) 573337_2 <s>I want to learn the exact time.

(10) 766128_3 <s>I told her that I wanted to rest for a week, and she said that it was o.k.

Figure 1.

Distribution of learner collocations on the basis of their BNC frequency band.

A further 22.1% of collocations fell into the 1–5 frequency range, and 9.1% appeared between 6–20 times in the BNC—both ranges typically considered low-frequency or marginally idiomatic. Examples from this tier include think of people, influence people, kill time, and break into house:

(11) 763548_1 <s> All over the world there are various kinds of music that influence people and that sometimes change the way they live.</s>

Approximately 29.5% of collocations belonged to the 21–100 frequency band, while the largest portion, 35%, consisted of collocations with more than 100 occurrences in the BNC (e.g., go to school, receive a letter, fall in love). These combinations are considered highly entrenched in native-speaker usage and thus represent idiomatic and fluent L2 production.

To gain further insight into developmental progression, the frequency band distribution was also analysed across proficiency levels, as depicted in Figure 2.

Figure 2.

Distribution of collocations on the basis of their BNC frequency bands CEFR levels.

Analysis across levels reveals distinct developmental patterns. A1–A2 learners produce relatively few collocations overall, and most are confined to the lowest frequency bands (0–20). Only 37% of their collocations occur in the higher-frequency range (21–100 and >100), while 20% are atypical or absent from native use. Also, B1–B2 learners display greater lexical range and collocational diversity, with substantial representation across all frequency bands. Notably, 45% of their collocations fall into the highest frequency tiers, although 16% remain low-frequency or unattested. As for C1–C2 learners, they exhibit greater reliance on high-frequency collocations (52%), indicating increased convergence with native-like usage. However, a relatively high 23% of their combinations still belong to the lower frequency bands or are unattested, underscoring residual phraseological inaccuracies at advanced levels.

Taken together, these results suggest that collocational appropriacy and idiomaticity improve with proficiency, but learners at all levels, including advanced, continue to produce combinations that fall outside the range of native-speaker norms, meaning that %23 of their uses seems to be infrequent or atypical collocations.

RQ4. What Are the Most Frequent Error Types Assossiated with These Constructions?

Error types in collocational constructions were analyzed according to the Cambridge Learner Corpus error-annotation scheme (Cambridge University Press, 2014). Table 12 presents the most frequently observed error categories at each CEFR proficiency level, with percentage distributions calculated within the total number of erroneous collocations per level.

Table 12.

The Most Frequent Collocational Error Types Across Proficiency Levels.

Proficiency levels	Error types	Frequency (%)
A1-A2	Incorrect Tense of Verb (TV)Replace Verb (RV)missing preposition (MT)missing determiner (MD)wrong verb form (FV)	32%14%14%11%6%
B1-B2	Incorrect Tense of Verb (TV)Replace Verb (RV)missing determiner (MD)missing preposition (MT)wrong verb form (FV)	20%18%7%7%6%
C1-C2	Replace Verb (RV)Incorrect Tense of Verb (TV)missing determiner (MD)Verb Agreement Error (AGV)derivation of adjective error (DJ)	12%9%9%9%9%

Across all levels, verb-related errors dominate the learner data. The most prevalent error overall is Incorrect Tense of Verb (TV), followed closely by Replace Verb (RV). These two categories reflect the ongoing challenges learners face in tense selection and verb appropriacy within collocational contexts.

(12) 1094956_8 <s>Dear, I<#TV>have|have had</#TV>a computer for 3 years.</s>

(13) 765190_1 <s>It has been<#RV>taking part|</#RV>in our school for 3 years.</s>

Missing preposition (MT) follows as the third error in A1-A2, and missing determiner (MD) follows it at B1-B2 and C1-C2 levels. In addition, Wrong Verb Form (FV) is observed both at A1-A2 and B1-B2 levels. Samples for these errors from the corpus is as follows:

(14) 2609322_4 <s>I was waiting<#MT>|for</#MT>this letter.</s> (B1-B2 level)

(15) 2663836_9 <s>I was wondering if you have time to come<#MT>|to</#MT><#MD>|a</#MD> concert with me.</s>

In addition, by the intermediate B1–B2 level, FV (Wrong Verb Form) remains common, indicating lingering difficulties with non-finite verb morphology:

(16) 567578_4<s>I do have a few ideas that we can work on<#RP>.|,</#RP>e.g.<#FV>start| starting</#FV>a club, <#UV>start|</#UV> raising money

At C1–C2, after the two verb-related errors, Verb Agreement Error (AGV) and Derivation of Adjective Error (DJ) become notably frequent. This progression indicates that lower-level learners struggle primarily with basic syntactic elements, while advanced learners begin to make more nuanced morphological and agreement errors within collocations. Some errors from the corpus is provided below.

(17) 581647_5 <s>Nevertheless, it<#AGV>take |takes</#AGV>time to put everything together and.</s>

(18) 573511_1 <s> young people are more keen on playing on the computer or watching TV rather than taking up a <#DJ>sport|sports</#DJ> activity.</s>

Such patterns suggest that although advanced learners achieve a higher degree of lexical and syntactic control, residual interlanguage traces still surface in subtle collocational environments. Additional, though less frequent, error types include replace determiner (RD), unneccessary determiner (UD) and replace noun (RN), the examples of which are given below:

(19) 765071_1 <s>I am<#IV>writting|writing</#IV>you<#RD>the|this</#RD>letter

(20) 1095003_3 <s>I hope I'll learn more<#UN> thing |</#UN>about her.</s>

(21) 2785813_3b <s>You</#MP>would<#UV> be|</#UV>haveflexible<#RN>times|hours </#RN>with it but

These instances reinforce the view that collocational errors are multifaceted, influenced not only by lexical choice but also by grammatical concord, derivational morphology, and article/preposition usage. The progression from syntactic to morphological sophistication with rising proficiency is observed from these errors.

RQ5. Which Collocation Errors, If Any, Persist Across Proficiency Levels from A1 to C2??

To explore whether certain collocational errors persist across the developmental continuum of CEFR levels, an error trajectory analysis was conducted. Table 13 presents the error types that occurred consistently across A1, B1, and C1 levels in the Turkish sub-corpus of the Cambridge Learner Corpus.

Table 13.

Persistent errors Across Levels.

Error Code	Error
RV	Replace verb
RT	Replace preposition
TV	Incorrect tense of verb
S	Spelling error
FV	Wrong verb form
MD	Missing determiner
UD	Unnecessary determiner

According to the Table 13, the data reveal that seven error categories were present across all CEFR levels examined. Among these, verb-related errors, particularly RV (Replace Verb), TV (Incorrect Tense of Verb), and FV (Wrong Verb Form), emerged as systematically persistent problems, regardless of learner proficiency. This pattern suggests a deep-rooted difficulty in verb selection and morphological manipulation, which may be linked to the collocational constraints of English and the transfer of L1 verb semantics.

Additionally, determiner misuse, as reflected in missing determiner and unnecessary determiner, persisted across the corpus. This indicates a continued challenge in article usage and noun phrase construction, commonly observed among speakers of article-less L1s such as Turkish (Ionin et al., 2004). Similarly, Replace Preposition errors reflect ongoing problems in mastering prepositional collocations, where L1 transfer and semantic overgeneralization are likely at play.

Even at advanced proficiency (C1–C2), these error types are not fully resolved, echoing findings from previous learner corpus research (e.g., Granger & Paquot, 2008; Nesselhauf, 2005) that suggest collocational competence develops more slowly than general vocabulary knowledge. This persistence underscores the importance of targeted pedagogical interventions focusing on collocational awareness, especially in the verb and determiner domains.

Discussion

This study examined English verb–noun collocation use in Turkish EFL learners’ written production across CEFR proficiency levels (A1–A2, B1–B2, C1–C2), aiming to identify developmental patterns and recurrent error types. Collocational knowledge is widely acknowledged as a key component of second language proficiency (Wray, 2002). Yet, it remains a persistent challenge for EFL learners, even those at the advanced level learners who have difficulties approximating native speakers’ collocation use, regardless of their language and cultural backgrounds (Granger & Paquot, 2009; Nesselhauf, 2003).

The examination of collocations across A1-C2 level revealed that the number of collocations decreases as the proficiency level increases. This does not concur with Laufer and Waldman’s (2011) and Li et al.’s (2023) studies which both revealed that learners at higher CEFR bands use more verb–noun collocations than lower-level peers. This drop in collocational use may be due to the repetitive use of a limited set of collocations at lower proficiency bands by Turkish EFL learners. The apparent decrease in collocation usage at higher proficiency levels is not an indication of reduced formulaicity, but rather a developmental shift from fixed, textbook-like bi-grams (typical of A1–A2 learners) towards more flexible, structurally complex multi-word units characteristic of C1–C2 production. In other words, advanced learners rely less on memorised two- three-word chunks and more on discourse-level phraseological patterns.

As for the number of the collocational errors across levels, a decrease is observed in the number of errors with a growth in proficiency level. However, this trend may also reflect the overall reduction in collocation use at higher levels. In total, the percentage of errenous collocations is observed to be 20.2%. This seems to be aligned with Nesselhauf’s (2005) study on the German learner corpus, which revealed that a quarter of the collocations were erroneous.This finding also accords with Laufer and Waldman’s (2011) study which revealed that approximately one-third of learners’ collocation attempts were incorrect. Nonetheless, error rates did not diminish with proficiency in their study and advanced learners actually made more errors because they used collocations more frequently. Since the use of collocations by Turkish EFL learners decreases as proficiency level increases, their collocational errors also decline accordingly.

The analysis indicates that a substantial proportion of verb–noun collocations produced by Turkish EFL learners demonstrate a degree of native-likeness. Specifically, 87.3% of the collocations were attested in at least one of the reference dictionaries, and frequency analysis using the BNC showed that 35% of learner collocations occurred more than 100 times, while a further 29.5% appeared between 21 and 100 times. These results suggest that learners are generally capable of producing collocational patterns that align with conventional usage, as evidenced by the finding that 73% of all collocations could be classified as native-like. At the same time, the presence of non-target-like patterns remains notable: 4.2% of learner collocations were entirely absent from the BNC, and 22.1% were extremely infrequent (occurring only 1–5 times). Taken together, these figures indicate that nearly one quarter (27%) of learner collocations diverge from typical native usage. This tendency points to learners’ reliance on unconventional or low-frequency pairings, which may reflect the influence of first language transfer, restricted exposure to authentic input, or an incomplete awareness of the fixed nature of collocational patterns in English.

Additionally, learners’ verb–noun combinations that are not attested in either reference dictionaries or with a frequency threshold >6 in the BNC clearly reflect the cross-linguistic influence from Turkish. Examples such as miss the time, answer the letter, study for school, bring sb. to house, accommodate in house illustrate instances of direct translation or literal transfer from learners' L1. And, most of them may be considered as unique uses such as bring phone, learn from the letter, arrange place, seek free time…etc. The L1 influence is consistent with findings from other learner corpora across diverse language backgrounds including Nesselhauf (2005) who revealed the influence of the learners' native language on the use of collocations in German learners of English corpus. Likewise, Laufer and Waldman (2011) also conclude that there is a persistent L1 interference and high error rates found in their corpus of Hebrew-speaking learners. Similarly, Lee (2016) also reported substantial L1 effects among Korean learners, and Gao et al. (2019) found similar tendencies in a corpus of Chinese EFL learners. Together with the present study, these findings reinforce the view that L1 interference is a cross-linguistic phenomenon affecting collocational competence in second language acquisition, regardless of the learners' first language.

The analysis of frequency bands in the BNC further supports this developmental trajectory, indicating that lower-level learners rely more on infrequent collocations, whereas advanced learners, though producing fewer collocations overall, tend to use a higher proportion of frequent and conventional verb–noun combinations. These trends support Laufer and Waldman’s (2011) finding that intermediate learners often rely on moderately frequent collocations, while advanced learners tend to favor highly frequent, native-like combinations. The results also further support Granger’s (1998) observation that even advanced learners’ underuse idiomatic, high-frequency collocations and sometimes compensate with atypical constructions.

As for the error types across proficiency levels, incorrect tense of verb, and replace verb are the two most frequent error types in collocations produced by Turkish EFL learners. This finding parallels with the findings of Gao et al. (2019), who observed that the most prevalent errors in their Chinese learner corpus stemmed from inappropriate verb selection. This finding is also in line with Can’s (2017) study on verb errors of Turkish EFL learners, revealing the most common verb error categories, which are incorrect tense of verb (TV), wrong verb choice (RV), wrong verb form (FV), missing verb (MV), and verb agreement (AGV) errors.

Following verb errors, missing preposition and missing determiner errors remain most prevalent across levels. A1-A2 and B1-B2 learners exhibit missing prepositions and determiners or wrong verb forms, while C1-C2 learners shift toward missing determiners, verb agreement, and adjective derivation errors, reflecting a move from basic syntactic challenges to more nuanced morphological issues. This pattern underscores that, as learners advance, they transition from fundamental syntactic omissions toward subtler morphological and agreement errors.

In addition to this, verb-related errors (replacement, tense, form) and determiner omissions persist across all CEFR levels, from A1 through C2, indicating that learners continue to struggle with verb usage and article selection in their collocational usage even for advanced learners. This observation aligns with the conclusion of Laufer and Waldman (2011) that collocational errors remain prevalent irrespective of proficiency level. It also echoes Granger and Paquot’s (2009) findings that advanced learners still struggle to approximate native-like collocational usage. This finding is also in line with Akbaş and Yükselir’s (2022) study which also revealed persistent difficulties with collocational awareness of Turkish EFL learners even at intermediate proficiency level. Furthermore, apart from the collocational errors, these verb-related and determiner errors are also among the most frequent error types in the writings of Turkish EFL learners as revealed in Aybek and Can’s (2023) study of errors Turkish EFL learners make in Cambridge ESOL (English for Speakers of Other Languages) exams.

Overall, Turkish EFL learners, like many other L2 learners, often struggle with verb-noun collocations and produce incorrect collocations. The one-fifth of their collocations used is erroneous, a figure consistent with prior corpus-based research. While Turkish learners of English do not show much progress in acquiring collocations with increasing proficiency, their development is hindered by L1 interference and a lack of attention to the fixed nature of collocational phrases in English. The developmental trajectory is hindered by L1 transfer, as learners directly translate Turkish collocational patterns into English, resulting in common mistakes such as using bring smt. instead of give smt. examples. These collocational errors arise from both intralinguistic factors (the complexity, opacity, and unpredictability of verb–noun collocations) and extralinguistic factors (such as the learner's educational background, the genre and context of written production, exposure to authentic input).

Conclusion

This study investigated the developmental trajectory of English verb–noun collocation use among Turkish EFL learners across CEFR levels and revealed that collocation frequency decreases with proficiency. Unlike previous corpus-based studies which reported a steady increase in collocational production at higher levels (Laufer & Waldman, 2011; Li et al., 2023), the present findings suggest a qualitative rather than quantitative development. Lower-level learners relied heavily on a narrow range of repetitive collocations, leading to a high frequency of collocation use, whereas advanced learners shifted toward more flexible and structurally complex multi-word units and discourse-level phraseology. Although overall collocation production was lower at C1–C2, the proportion of native-like combinations was substantially higher, and error rates decreased accordingly. The data also confirm the strong cross-linguistic influence of Turkish on verb selection and argument structure, particularly in low-frequency or unconventional pairings that indicate literal translation from L1.

Despite the reduced quantity of verb–noun collocations at higher proficiency bands, the quality of collocational competence improves, reflecting a gradual move from surface formulaicity toward deeper phraseological competence. Persistent verb-related and determiner-related errors across all levels suggest that collocational mastery remains challenging even at advanced stages. These findings reinforce the view that collocational development is not purely linear and cannot be captured through frequency alone; rather, it involves a shift from memorisation to productive deployment within broader syntactic and discourse frames.

The analysis of error types shows that Turkish learners most frequently struggle with verb-related errors (verb replacement, tense, and form), followed by missing-preposition and missing-determiner errors, which persist across all proficiency bands. At lower levels, errors are predominantly syntactic (e.g., omission of prepositions or determiners), whereas at higher levels they shift toward subtler morphological and lexical-choice issues (e.g., inappropriate verb selection or derivational inaccuracy). This pattern indicates that L1 transfer plays a persistent role in collocational deviance, especially in verb selection and argument structure, and confirms that collocational acquisition develops qualitatively rather than linearly. Pedagogically, the results point to the need for explicit, form–function focused collocation instruction which extends beyond lexical substitution exercises and draws learners’ attention to pattern grammar, phraseological constraints, and cross-linguistic transfer.

Implications

The findings of this study carry significant pedagogical implications for the teaching of English collocations, particularly verb–noun combinations, to Turkish EFL learners. Given the high rate of erroneous and non-native-like collocations across all proficiency levels, explicit instruction in collocations should become a core component of vocabulary and writing pedagogy. Instruction should move beyond incidental exposure and focus systematically on the structure, meaning, and contextual use of frequent verb–noun collocations. In line with previous research (Kjellmer, 1991; Wray, 2002), the results reaffirm that learners often rely on compositional word-by-word strategies rooted in L1, rather than accessing stored multiword units, which impedes fluency and idiomaticity.

To mitigate this, instructional approaches should integrate form-focused activities that evolve in complexity with learner proficiency. At lower levels, learners benefit from guided exposure to formulaic sequences and collocational chunks; whereas at advanced levels, they require contrastive, analytical tasks that raise awareness of subtle semantic and syntactic constraints on collocational usage. Persistent L1 transfer observed in this study further suggests the need for contrastive analysis, especially between Turkish and English collocational patterns, to help learners recognize and avoid direct translation strategies.

Furthermore, the study underscores the importance of Data-Driven Learning (DDL) and corpus-based materials in developing collocational competence. Turkish EFL learners showed difficulty in distinguishing between low- and high-frequency collocations, indicating a lack of exposure to natural language input and insufficient sensitivity to frequency effects. Teachers should therefore incorporate authentic corpus data—such as concordance lines and frequency lists—into classroom activities to promote noticing (Schmidt, 1994) and encourage learners to internalize collocations through contextualized exposure and repetition. Materials should highlight high-frequency verb–noun pairs, illustrate their usage in real-life contexts, and provide ample opportunities for controlled and communicative practice.

Additionally, pedagogical strategies like collocation error analysis, guided rewriting, and native-speaker comparison tasks can foster metalinguistic awareness and help learners self-correct deviant patterns. These activities also serve to reinforce the idea that lexical accuracy in English depends not only on word meaning but also on knowing which words typically co-occur. In conclusion, improving learners’ collocational competence requires intentional, scaffolded, and frequency-informed instruction, supported by tools and resources from corpus linguistics. Such interventions are crucial not only to enhancing learners’ fluency and accuracy but also to fostering the lexical sophistication expected in academic and professional communication.

Limitations and Suggestions

The study relied on the Cambridge Learner Corpus (CLC), which is exam-based data. As such, learners’ performance may reflect test-taking strategies rather than their everyday language use. Also, the CLC includes test-takers of different ages and educational backgrounds, but this information was not always consistently available, which limits the generalizability of the findings.

Furthermore, the study’s methodological decision to focus on high-frequency nouns may have influenced the nature and range of extracted collocations. The analysis was restricted to verb–noun collocations, leaving out other types of collocations (e.g., adjective–noun, verb–preposition) that may also reveal important aspects of learner language. A further limitation concerns the reliance on frequency data from the British National Corpus (BNC), as frequent usage does not necessarily guarantee universal acceptability, and less frequent yet correct collocational patterns may therefore have been disregarded.

Although corpus data provide valuable insights into recurrent errors and usage patterns, they offer limited explanation of the cognitive or conceptual mechanisms underlying learner difficulties. Such challenges may stem from first-language transfer or instructional practices, and therefore often require complementary qualitative or experimental evidence to be fully understood.

Future research could diversify the noun base and include spoken corpora to explore modality-based variation in collocational usage. Future research could also complement exam-based corpora with learner production from classrooms, writing portfolios, or spoken data to capture a fuller picture of collocational use.

In addition, tracking the same group of learners over time would provide clearer insights into developmental patterns and the persistence of collocational errors. Classroom-based studies could test targeted instructional techniques (e.g., data-driven learning, explicit teaching of frequent collocations) to evaluate their effectiveness in reducing persistent errors.

Footnotes

Acknowledgements

The author gratefully acknowledges Prof. Dr. Cem Can for his invaluable feedback and guidance throughout the completion of this paper.

ORCID iD

Sibel Aybek

Ethical Considerations

Corpus-based studies rely on pre-existing datasets that were compiled long ago with the necessary legal permissions and informed consent at the time of their creation. As researchers, we analyze these publicly available or previously approved corpora without directly involving human participants in new data collection processes. Therefore, ethical approval is generally not sought for such studies, as the original data compilation had already undergone the necessary ethical considerations.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Raw data, analysis files and categorizations that support the findings of this study are available from the author upon reasonable request.

References

Aisenstadt

(1981). Restricted collocations in English lexicology and lexicography. ITL Review of Applied Linguistics, 53, 53–61.

Akbaş

Yükselir

(2024). A frequency-based exploratory study of word combinations by EFL learners towards adjective–noun collocations. Círculo de Lingüística Aplicada a la Comunicación, 98, 193–208. https://doi.org/10.5209/clac.81885

Altenberg

Granger

(2001). The grammatical and lexical patterning of make in native and non-native student writing. Applied Linguistics, 22(2), 173–195. https://doi.org/10.1093/applin/22.2.173

Aston

(1995). Corpora in language pedagogy: Matching theory and practice. In Cook

Seidlhofer

(Eds.), Principle & practice in applied linguistics: Studies in honour of H. G. Widdowson (pp. 257–270). Oxford University Press.

Aybek

Can

(2023). Yabancı Dil Olarak İngilizce Öğrenen Türk Öğrencilerin Cambridge ESOL Sınavlarında Yaptıkları Yanlışlara Genel Bir Bakış: Derlem Tabanlı Bir YanlışÇözümlemesi. Dil Dergisi, 174(1), 31–64. https://doi.org/10.33690/dilder.1217990

Barfield

(2007). An exploration of second language collocation knowledge and development [Unpublished doctoral dissertation]. University of Swansea.

Benson

Ilson

R. F.

(2009). The BBI combinatory dictionary of English: Your guide to collocations and grammar (3rd ed., Rev. by R. Ilson). John Benjamins Publishing Company.

Boers

Eyckmans

Kappel

Stengers

Demecheleer

(2006). Formulaic sequences and perceived oral proficiency: Putting a lexical approach to the test. Language Teaching Research, 10(3), 245–261. https://doi.org/10.1191/1362168806lr195oa

Boone

De Wilde

Eyckmans

(2023). A longitudinal study into learners’ productive collocation knowledge in L2 German and factors affecting the learning. Studies in Second Language Acquisition, 45(2), 503–525. https://doi.org/10.1017/S0272263122000377

10.

Brown

(1974). A first language: The early stages. Harvard University Press.

11.

Can

(2017). A learner corpus-based study on verb errors of Turkish EFL learners. Journal of Education and Training Studies, 5(9), 167–175. https://doi.org/10.11114/jets.v5i9.2612

12.

Conklin

Schmitt

(2008). Formulaic sequences: Are they processed more quickly than nonformulaic language by native and nonnative speakers? Applied Linguistics, 29(1), 72–89. https://doi.org/10.1093/applin/amm022

13.

Council of Europe. (2001). Common European framework of reference for languages: Learning, teaching, assessment. Cambridge University Press.

14.

Cambridge University Press. (2014). Cambridge error coding – User guide (Cambridge Learner Corpus). Cambridge University Press.

15.

Cowie

(1981). The treatment of collocations and idioms in learner’s dictionaries. Applied Linguistics, 2(3), 223–235. https://doi.org/10.1093/applin/2.3.223

16.

Cowie

A. P.

(1988). Stable and creative aspects of vocabulary use. In Carter

McCarthy

(Eds.), Vocabulary and language teaching (pp. 126–139). Longman.

17.

Afzaal

Al Fadda

(2022). Collocation use in EFL learners’ writing across multiple language proficiencies: A corpus-driven study. Frontiers in Psychology, 13, Article 752134. https://doi.org/10.3389/fpsyg.2022.752134

18.

Durrant

Schmitt

(2009). To what extent do native and non-native writers make use of collocations? International Review of Applied Linguistics in Language Teaching, 47(2), 157–177. https://doi.org/10.1515/iral.2009.007

19.

Evert

(2004). Computational approaches to collocations. http://www.collocations.de

20.

Fillmore

C. J.

(1979). On fluency. In Fillmore

C. J.

Kempler

Wang

W. S.-Y.

(Eds.), Individual differences in language ability and language behaviour (pp. 85–101). Academic Press.

21.

Gablasova

Brezina

McEnery

(2017). Collocations in corpus-based language learning research: Identifying, comparing, and interpreting the evidence. Language Learning, 67(S1), 155–179. https://doi.org/10.1111/lang.12225

22.

Gao

Zhu

Luo

(2019). A corpus-based study of collocation production of Chinese EFL learners. International Journal of Education, 11(4), 1–9. https://doi.org/10.5296/ije.v11i4.15634

23.

Gläser

(1998). The stylistic potential of phraseological units in the light of genre analysis. In Cowie

A. P.

(Ed.), Phraseology: Theory, analysis and applications (pp. 125–143). Oxford University Press.

24.

Granger

(1998). Prefabricated patterns in advanced EFL writing: Collocations and formulae. In Cowie

A. P.

(Ed.), Phraseology: Theory, analysis and applications (pp. 145–160). Clarendon Press.

25.

Granger

Paquot

(2008). Disentangling the phraseological web. In Granger

Meunier

(Eds.), Phraseology: An interdisciplinary perspective. John Benjamins. https://doi.org/10.1075/z.139.07gra

26.

Granger

Paquot

(2009). Lexical verbs in academic discourse: A corpus-driven study of learner use. Bloomsbury.

27.

Granger

Bestgen

(2014). The use of collocations by intermediate vs. advanced non-native writers: A bigram-based study. International Review of Applied Linguistics in Language Teaching, 52(3), 229–252. https://doi.org/10.1515/iral-2014-0011

28.

Howarth

(1998). The phraseology of learners’ academic writing. In Cowie

A. P.

(Ed.), Phraseology: Theory, analysis, and applications (pp. 161–186). Oxford University Press.

29.

Hunston

(2002). Corpora in applied linguistics. Cambridge University Press.

30.

Ionin

Wexler

(2004). Article semantics in L2 acquisition: The role of specificity. Language Acquisition, 12, 3–70.

31.

Kaszubski

(2000). Selected aspects of lexicon, phraseology and style in the writing of Polish advanced learners of English: A contrastive, corpus-based approach. http://main.amu.edu.pl/∼przemka/research.html

32.

Kjellmer

(1991). A mint of phrases. In Aijmer

Altenberg

(Eds.), English corpus linguistics: Studies in honour of Jan Svartvik (pp. 111–127). Longman.

33.

Korosadowicz-Strużyńska

(1980). Word collocations in foreign language vocabulary instruction. Studia Anglica Posnaniensia, 12, 109–120.

34.

Laufer

Waldman

(2011). Verb-noun collocations in second language writing: A corpus analysis of learners’ English. Language Learning, 61(2), 647–672. https://doi.org/10.1111/j.1467-9922.2010.00621.x

35.

Lee

(2016). L1 influence on the processing of L2 collocation: An experimental study of Korean EFL learners. Linguistic Research, 33, 137–163. https://doi.org/10.17250/khisli.33..201609.006

36.

Lee

Shin

(2021). Towards improved assessment of L2 collocation knowledge. Language Assessment Quarterly, 18(4), 419–445. https://doi.org/10.1080/15434303.2021.1908295

37.

Lewis

(Ed.). (2000). Teaching collocation. Language Teaching Publications.

38.

J.-Y.

Mallart

Gaillat

Richard

(2023, November). Exploring verb-noun collocations in learner English. Paper presented at the Deep Learning for Language Assessment (DLLA) Closing Event, Paris, France. https://hal.science/hal-04321727

39.

Liu

Afzaal

(2021). Syntactic complexity in translated and nontranslated texts: A corpus-based study of simplification. PLOS ONE, 16(1), Article e0253454. https://doi.org/10.1371/journal.pone.0262050

40.

Manning

C. D.

Schütze

(1999). Foundations of statistical natural language processing. MIT Press.

41.

Mel’čuk

(1998). Collocations and lexical functions. In Cowie

A. P.

(Ed.), Phraseology: Theory, analysis and applications (pp. 23–53). Clarendon Press.

42.

Nattinger

J. R.

DeCarrico

J. S.

(1992). Lexical phrases and language teaching. Oxford University Press.

43.

Nesselhauf

(2003). The use of collocations by advanced learners of English and some implications for teaching. Applied Linguistics, 24(2), 223–242. https://doi.org/10.1093/applin/24.2.223

44.

Nesselhauf

(2005). Collocations in a Learner Corpus. John Benjamins Publishing Company. https://doi.org/10.1017/S0272263107270068

45.

Nguyen

Webb

(2016). Examining second language receptive knowledge of collocation and factors that affect learning. Language Teaching Research, 21(3), 298–320. https://doi.org/10.1177/1362168816645959

46.

Nizonkiza

(2017). Predictive power of controlled productive knowledge of collocations over L2 proficiency. In Gómez-González

M. L.

González-Álvarez

E. M.

(Eds.), Usage-based approaches to language acquisition and language teaching (Vol. 11, pp. 263–286). De Gruyter Mouton.

47.

Pawley

Syder

F. H.

(1983). Two puzzles for linguistic theory: Nativelike selection and nativelike fluency. In Richards

J. C.

Schmidt

R. W.

(Eds.), Language and communication (pp. 191–226). Longman.

48.

Peng

X. L.

(2016). Use of verb-noun collocations by advanced learners of Chinese (Publication No. 10158244) [Doctoral dissertation, University of Pennsylvania]. ProQuest Dissertations Publishing.

49.

Schmidt

(1994). Deconstructing consciousness in search of useful definitions for applied linguistics. AILA Review, 11, 11–26.

50.

Sinclair

J. M.

(1991). Corpus concordance collocation. Oxford University Press.

51.

Siyanova

Schmitt

(2008). L2 learner production and processing of collocation: A multi-study perspective. The Canadian Modern Language Review, 64(3), 429–458. https://doi.org/10.3138/cmlr.64.3.429

52.

Siyanova-Chanturia

(2015). Collocation in beginner learner writing: A longitudinal study. System, 53, 148–160. https://doi.org/10.1016/j.system.2015.07.003

53.

Stubbs

(1995). Collocations and semantic profiles: On the cause of the trouble with quantitative methods. Functions of Language, 2(1), 1–33.

54.

Szudarski

Carter

(2016). The role of input flood and input enhancement in EFL learners’ acquisition of collocations. International Journal of Applied Linguistics, 26(2), 245–265. https://doi.org/10.1111/ijal.12092

55.

Thornbury

(2002). How to teach vocabulary. Longman.

56.

Wang

Shaw

(2008). Transfer and universality: Collocation use in advanced Chinese and Swedish learner English. ICAME Journal, 32, 201–232.

57.

Wray

(2002). Formulaic language and the lexicon. Cambridge University Press.