Sage Journals: Discover world-class research

Abstract

Complexity, accuracy, and fluency (CAF) as measures of language proficiency have been used extensively in assessing second language production in English language studies related to task-based language teaching. The present review adds the following critical elements to the existing body of literature: (i) Provides synthesis and critical analysis of various measures of CAF used in empirical studies and reveals the variation with which the three constructs have been used; (ii) Lays down the most appropriate CAF measures for beginners, intermediates and advanced learners for a valid and robust assessment of oral production in English as a second and foreign language.

Keywords

task-based language teaching complexity fluency accuracy review second language

Introduction

The conceptualization of second language proficiency and its assessment is a pivotal topic in the field of language testing, developing extensively over the years with diverse contributions from the researchers and practitioners (Alderson, 1991; Amirian et al., 2017; Bachman, 1990, 2000; Bachman & Palmer, 1996; Canale & Swain, 1980; Davies, 1990; McNamara, 1996; Spolsky, 1985). Extensive research work has emanated myriad models of language proficiency with varied interpretations, resulting in the ambiguous nature of the measurement of language proficiency (Piggin, 2012). It was with the advent of task-based approach that language performance was conceptualized in terms of complexity, accuracy, and fluency (Bui & Skehan, 2018), and it was Skehan (1995, 1996) who came up with the three dimensions together (Wartan, 2019). These three dimensions fit well to explain the “Trade-off hypothesis”¹ given by Skehan (1998). The “trade-off hypothesis” asserts that when learners are engaged in communicative tasks, their mental resources are unable to give equal attention to all three dimensions, and therefore, improvement in all three dimensions may not occur simultaneously. CAF measures are also used to explain and test Robinson (1995, 2001) “Cognition Hypothesis,” which focuses on task complexity and sequencing of pedagogic tasks while retaining task complexity. Task complexity augments accuracy and complexity, promoting interaction.

Complexity, accuracy, and fluency has been utilized widely in the realms of Second Language Acquisition (SLA) and English as a Foreign Language (EFL) for assessing skill in oral and written production as “major research variables” (Housen & Kuiken, 2009; Rausch, 2017). Over the years, various constructs and sub-constructs of CAF have been utilized in studies, and these have been critically scrutinized in previous studies to generate a plausible rationale behind their usage. Given that there is a lack of consensus among researchers regarding the definition of manifold measures of CAF, it limits the “comparability, reliability, and validity” of the constructs (Housen et al., 2012). Furthermore, as the standardization of every metric of CAF is not feasible (Michel, 2017), employing CAF measurements in research investigations becomes difficult.

With a broad spectrum of CAF sub-constructs for assessing second language proficiency, valid measures have to be selected judiciously as the use of “all the measures every time” is arduous (Pallotti, 2009). In this regard, a synthesis of the measures used by previous researchers can aid in a better understanding of the metrics to be used. Some researchers have accomplished synthesis of CAF measures in previous studies with a focus on planning conditions (Suzuki, 2017) and task types with task conditions (Skehan & Foster, 2008). The present study focuses on synthesizing the CAF measures used in the past two decades, particularly studies on oral proficiency. This synthesis aims to extract the metrics that were used more than others and understand the researchers’ perspectives regarding the choice of particular metrics. Secondly, the study intends to explore the most suitable measures according to the proficiency level of the learners as it has been indicated time and again that second language speakers may face difficulty in attending to both form and meaning (Skehan, 1998, 2009), particularly the low proficiency speakers (Lambert et al., 2017).

CAF and Their Sub-Constructs

There are many sub-constructs used in previous studies to assess accuracy, fluency, and complexity.

Complexity

Generally speaking, complexity is “the extent to which learners produce elaborated language” (Ellis & Barkhuizen, 2005, p. 139). In measuring complexity, both syntactical complexity and lexical variety are considered the most germane sub-components of language complexity (Skehan, 2009). Both these sub-components have multiple metrics, which contribute to the “complexity,” and notably, in second language acquisition research, syntactic complexity is an “intensively measured component whereas lexical, morphological and phonological forms of complexity have been investigated much less” (Kuiken et al., 2019, p. 162–163).

Measures of lexical complexity

Second language research studies have developed different tests for measuring lexical richness (Read, 2000), and it is usually measured with “diversity, sophistication, and density” (Read, 2000; Wolfe-Quintero et al., 1998). Lexical complexity is measured with different metrics (type token, Guiraud’s index, D index). Furthermore, Bulté and Housen (2012) proposed “to add compositionality (i.e., the number of formal and semantic components of lexical items)” while Jarvis (2013) “identified six sub-components of lexical diversity: rarity, volume, variability, evenness, disparity, and dispersion.” These sub-constructs of lexical complexity make it challenging to present “an all-encompassing picture” of second language data (Michel, 2017), which also applies to accuracy and fluency.

Lexical diversity

It is a measure of complexity that assesses the variety of words produced. The most common way of assessing lexical variety is by finding the “type-token ratio (TTR), which is the number of word types divided by all word tokens” (Vercellotti, 2012). This measure is influenced by text length. As it has been considered unstable for short texts (Gregori-Signes & Clavel-Arroitia, 2015), other measures of lexical complexity were used in subsequent studies such as D-measure and Guiraud’s Index.

Lexical density and lexical sophistication

Another index of measuring lexical variety is lexical density. It is the “lexical items or content words packed into the grammatical structure” (Halliday & Martin, 1993). Lexical sophistication is also another index of measuring lexical variety (Read, 2000) indexed by the value of Lambda using the tool P-Lex (Meara & Bell, 2001; Bell, 2003). The Lambda value is the degree of rare and “advanced words” used by learners. Skehan and Foster (2008) proposed the use of lexical density (to be measured as D) and lexical sophistication as both these measures, according to Skehan and Foster (2008), encapsulate the different aspects of lexical performance. P-Lex is text external and works best if “task specific word lists” is created (De Jong & Vercellotti, 2016; Meara & Bell, 2001). It points out the “number of hard words in each 10-word segment (Vercellotti, 2012).”

General measures of syntactic complexity

Syntactic complexity as a multidimensional measure includes “sentential, clausal and phrasal complexity” (Lahmann et al., 2019). There are generalized and specialized measures of complexity, and researchers believe that general measures are better for measuring the complexity of language since they allow for comparisons between studies (Tonkyn, 2013, as cited in Inoue, 2016).

Clausal subordination

Foster and Skehan (1996) preferred generalized measures as more “sensitive indices of task performance.” According to Foster and Skehan (1999, p. 228), the number of clausal subordination is “a reliable measure in various experimental contexts and are correlated with other complexity measures,” and therefore, complexity should be measured as the amount of subordination per communication unit. Dembovskaya (2009) reported on the reliability of the clausal subordination measure. Many studies have calculated subordination, which is defined as the “percentage of subordinate clauses in the total number of clauses.” (Wigglesworth, 1997, as cited in Inoue, 2016).

Other measures of syntactic complexity

Norris and Ortega (2009) suggested that studies should incorporate the following measures of complexity—(a) length-based variables (such as words per chosen unit) as an overall measure of syntactic complexity, (b) subordination-based variables, (c) variables of phrasal complexity (i.e., clause length) (d) coordination-based variables (i.e., amount of coordination).

Accuracy

Accuracy as “freedom from error” (Skehan, 1996) includes sub-constructs such as percentage of error-free clauses, errors per AS unit, errors per 100 words, target-like use of verbs, plural, and tenses. Accuracy is a “straightforward and internally consistent construct” (Housen & Kuiken, 2009; Pallotti, 2009), and yet it is not as simple as conceptually it seems. As accuracy is assessed by how much a second language performance “deviates from the norm” (Housen et al., 2012; Pallotti, 2009; Wolfe-Quintero et al., 1998), it is difficult to define the “nature” of the norm. It is debatable whether accuracy norms or criteria should be based solely on native target language norms or certain non-native norms that are “acceptable in some social contexts” (Ellis, 2008; James, 1998; Polio, 1997) is a point of concern. This ambiguity within accuracy makes it difficult to apply its definition strictly to second language performance. When it comes to accuracy, “more is not better,” as (Sanell, 2007), as cited in Pallotti, 2009) discovered that advanced second-language French learners employed a phrase that was regarded erroneous by standard French norms but commonly used for “sociolinguistic comfort.” This creates an ambiguous situation in which determining an appropriate definition of accuracy will be challenging.

Fluency

Fluency is “the production of language in real time without undue pausing or hesitation” (Ellis & Barkhuizen, 2005). Fluency can be analyzed by measures of—(a) silence (breakdown fluency), (b) reformulation, replacement, false starts, and repetition (repair fluency), (c) speech rate (e.g., words/syllables per minute), and (d) automatization, through measures of length of run (Koponen & Riggenbach, 2000). Skehan (2003), Tavakoli and Skehan (2005) categorized fluency sub-components into speed fluency, breakdown fluency, and repair fluency.

The varying sub-constructs of CAF make the three dimensions multidimensional and multi-componential (Housen & Kuiken, 2009). Furthermore, CAF measures are interrelated with one another in non-linear ways, as Skehan (2009) characterized the interplay between the three measures using the Trade-off Hypothesis theory and Levelt (1989) speech production model. According to Housen et al. (2012), CAF does not develop “collinearly,” but the “distinct dimensions” may interact and interrelate in a variety of ways. As a result, utilizing CAF to assess and measure language is an intricate and challenging task (Michel, 2017). It is critical to investigate all three variables in a study rather than individual measures to grasp the intricacy and interconnectedness of CAF (Larsen-Freeman, 2009).

Research Questions

With the co-existence of different “definitions and interpretations” of the measures of CAF (Housen & Kuiken, 2009), it becomes essential to understand the interaction among the measuring variables. Therefore, the present review has been undertaken to explore the perspectives of researchers in terms of the suitability of certain measures and also put forward certain measures apt for assessing the oral production of second language learners. Thus, the study seeks to provide insights into the following research questions:

(a) What are the sub-dimensions of Complexity, Accuracy, and Fluency applied in previous studies?

(b) To what extent the varied sub-dimensions of CAF are suitable for assessing second language learners?

(c) Which measures of CAF will be appropriate for analyzing the oral production of learners, specifically for beginners, intermediate, and advanced language learners?

Methodology Implemented for the Present Review

The included studies were selected as per the inclusion and exclusion criteria.

Search Strategy of Studies

The journal databases of Scopus, Web of Science, SAGE Journals, Science Direct, Springer links, Taylor and Francis Online, ERIC, and ProQuest were searched for research articles and thesis, using relevant keywords. In the databases, the following search term combinations were entered: “complexity, accuracy, and fluency as measures of assessing the English language,” or “CAF for assessing tasks,” or “assessing oral performance in task-based studies,” and “accuracy as a measure,” or “components of accuracy;” “fluency as a measure,” or “components of fluency” and “complexity measures” or “components of complexity.” More similar articles were found by searching the reference lists of all the related articles. Finally, 36 studies were selected (31 research papers and 3 doctoral dissertations, 1 Master’s dissertation, and 1doctoral thesis) from 1999 to 2019 (see Figure 1). As CAF gained prominence since the mid-nineties, the review included studies from the 1990s. The selected empirical studies focused on assessing task-based lessons with complexity, accuracy, and fluency measures.

Figure 1.

Flow diagram illustrating the review selection process.

Inclusion and exclusion criteria

The included studies dealt with Task-based language teaching and the assessment of oral language output with the three constructs-complexity, accuracy, and fluency. Studies that dealt with all three components (CAF) or even just one or two of them were included as well. The articles published in journals from 1999 to 2019 and published/unpublished thesis and dissertation were included. Only quantitative studies dealing with speaking skills were included, where the measures of CAF were empirically tested. Furthermore, only those articles and theses that were public and accessible through Open access portals and authors’ university access were included; as a result, not all studies could be included. Studies based on the teaching of languages other than the English language were excluded, and studies based on language skills other than speaking were excluded. Qualitative studies and metanalysis were not included in the present review.

The studies were coded and selected by the two authors. Their interrater agreement at the initial screening based on title and abstract was strong. At the time of abstract screening, disagreements over including other languages for CAF were resolved. The authors evaluated each article independently with a series of yes/no questions based on some predetermined elements such as study design, use of language assessment measures, and outcomes (Appendix 1). The interrater reliability coefficient was strong at each stage. The included empirical studies were categorized and analyzed under the following headings-

a) Name of author and year of study; b) Sub-components of Complexity; c) Sub-components of Accuracy; d) Sub-components of Fluency.

Results

A rigorous review laid forth the variation with which CAF and its sub-constructs were defined, and the multiple sub-constructs used in previous studies have been arranged in a tabular form (Table 1, Table 2, and Table 3).

Table 1.

Sub Components of Complexity and the Respective Studies.

Lexical Complexity	Studies
Type token ratio	Fujita, 2006; Thurman, 2008; Yuan, 2001; Yuan & Ellis, 2003
Guiraud’s index	Gilabert, 2007a, Levkina & Gilabert, 2012; Malicka, 2018; Sample & Michel, 2014
D-score	Bui, 2019; De Jong & Vercellotti, 2016; Malicka, 2018; Qian, 2010; Révész et al., 2016; Sample & Michel, 2014; Vercellotti, 2102, 2015
MTLD (measure of textual Lexical Diversity)	Nitta & Nakatsuhara, 2014
Lexical density (content words/total words	Bui, 2019; Révész et al., 2016
Lexical sophistication (no. of low frequency words)	Bui, 2019; Qian, 2010
Lexical frequency	Révész et al., 2016
Syntactic complexity	Studies
Syntactic variety	Ahmadian, 2011, 2012; Ahmadian &Tavakoli, 2010, 2014; Khaerudin, 2014; Révész et al., 2016; Yuan, 2001; Yuan & Ellis, 2003
Mean length of AS unit or words per AS-unit	Ahmadian, 2011, 2012; Ahmadian & Tavakoli, 2014; Bamanger & Gashan, 2015; De Jong & Vercellotti, 2016; Inoue, 2016; Lou et al., 2016; Malicka, 2018; Qian, 2010; Révész et al., 2016; Sample & Michel, 2014; Tavakoli et al., 2016; Vercellotti, 2012, 2015
Subordination (No. or ratio of clauses per AS unit)	Ahmadian, 2011, 2012; Ahmadian & Tavakoli, 2010, 2014; Bamanger & Gashan, 2015; BavaHarji et al., 2014; Bui, 2019; De Jong & Vercellotti, 2016; Foster & Sekhan, 2013; Geng & Ferguson, 2013; Inoue, 2016; Levkina & Gilabert, 2012; Malicka, 2018; Nitta & Nakatsuhara, 2014; Qian, 2010; Révész et al., 2016; Saeedi, 2015; Sample & Michel, 2014; Tavakoli et al., 2016; Vercellotti, 2012
Phrasal complexity (words per clause)	Bui, 2019; De Jong & Vercellotti, 2016; Inoue, 2016; Malicka, 2018; Révész et al., 2016; Vercellotti, 2012
Frequency of prepositions	BavaHarji et al., 2014
Frequency of conjunctions	BavaHarji et al., 2014
Number of coordination	Inoue, 2016
Clauses per T unit	Fujita, 2006; Gilabert, 2007a; Yuan & Ellis, 2003
Clauses per C unit	Foster & Skehan, 1999

Table 2.

Subcomponents of Accuracy and the Respective Studies.

Sub-component of accuracy	Studies
Percentage of error free clauses	Abdi et al., 2012; Ahmadian, 2011; Ahmadian &Tavakoli, 2010, 2014; Bamanger & Gashan, 2015; BavaHarji et al., 2014; Bui, 2019; De Jong & Vercellotti, 2016; Foster & Sekhan, 1999, 2013; Fujita, 2006; Inoue, 2016; Khaerudin, 2014; Khoram & Zhang, 2019; Qian, 2010; Rafie et al., 2015; Saeedi, 2015; Shafaei et al., 2013; Tavakoli et al., 2016; Thurman, 2008; Vercellotti, 2012, 2015; Yuan, 2001; Yuan & Ellis, 2003
Errors per 100 words	Bamanger & Gashan, 2015; Bui, 2019; Geng & Ferguson, 2013; Inoue, 2016; Lou et al., 2016; Malicka, 2018; Nitta & Nakatsuhara, 2014; Qian, 2010; Révész et al., 2016
Errors per AS unit	Gilabert, 2007b; Inoue, 2016; Levkina & Gilabert, 2012; Malicka, 2018; Sample & Michel, 2014
Target verbs/tenses	Ahmadian, 2011; Ahmadian & Tavakoli, 2010, 2014; BavaHarji et al., 2014, Khoram & Zhang, 2019; Tavakoli et al., 2016; Thurman, 2008; Yuan, 2001; Yuan & Ellis, 2003
Target like use of plurals	BavaHarji et al., 2014
Target like use of preposition	Malicka, 2018
Target like use of tenses, modal verbs, and connectors	Révész et al., 2016
Use of articles	Ahmadian, 2012
Error-free AS unit	De Jong & Vercellotti, 2016; Sample & Michel, 2014; Vercellotti, 2012
Article error	Sample & Michel, 2014
Percentage of self-repairs	Ahmadian & Tavakoli, 2014; Gilabert, 2007a; 2007b
Number of corrections per 100 words	De Jong & Vercellotti, 2016; Gilabert, 2007b
Errors per T-unit	Ahangari & Abdi, 2011

Table 3.

Sub Components of Fluency and the Respective Studies.

Sub components of fluency studies
Replacement	Abdi et al., 2012; Bamanger & Gashan, 2015; Bui, 2019; Bui & Huang, 2016; Foster & Skehan, 1999; Lou et al., 2016; Qian, 2010
Reformulations	Abdi et al., 2012; Bamanger & Gashan, 2015; Bui, 2019; Bui & Huang, 2016; De Jong & Vercellotti, 2016; Foster & Skehan, 1999, 2013; Lou et al., 2016; Malicka, 2018; Nitta & Nakatsuhara, 2014; Qian, 2010; Tavakoli et al., 2016
Number of false starts	Abdi et al., 2012; Bamanger & Gashan, 2015; Bui, 2019; Bui & Huang, 2016; Foster & Skehan, 1999; Malicka, 2018; Qian, 2010; Révész et al., 2016; Tavakoli et al., 2016
Repetition	Abdi et al., 2012; Bamanger & Gashan, 2015; Bui, 2019; Bui & Huang, 2016; De Jong & Vercellotti, 2016; Foster & Skehan, 1999; Lou et al., 2016; Malicka, 2018; Nitta & Nakatsuhara, 2014; Qian, 2010; Révész et al., 2016; Tavakoli et al., 2016; Thurman, 2008
Hesitation	Tavakoli et al., 2016
Self-repairs	Révész et al., 2016
Self-corrections	Lambert et al., 2017; Nitta & Nakatsuhara, 2014
No. of pauses	BavaHarji et al., 2014; Foster & Skehan, 1999; Lou et al., 2016; Qian, 2010; Révész et al., 2016
Mid clause pausesEnd clause pause	Bui, 2019; Bui & Huang, 2016; Foster & Skehan, 2013; Lambert et al., 2017; Tavakoli et al., 2016Bui & Huang, 2016; Lambert et al., 2017; Tavakoli et al., 2016
Mean length of pauseDependent clause pauses with length	Bui & Huang, 2016; De Jong & Vercellotti, 2016; Nitta & Nakatsuhara, 2014; Tavakoli et al., 2016; Vercellotti, 2012Bui & Huang, 2016
Filled pauses	Bui & Huang, 2016; Sample & Michel, 2014; Tavakoli et al., 2016
Total word count (per minute)	Bui & Huang, 2016; Lou et al., 2016; Nitta & Nakatsuhara, 2014; Sample & Michel, 2014; Qian, 2010; Thurman, 2008
No. of words per 90 seconds	BavaHarji et al., 2014
Words per AS-unit	Lou et al., 2016
Word count for 2 minutes	Thurman, 2008
Mean length of fluent run (average number of syllables)	Bui & Huang, 2016; De Jong & Vercellotti, 2016; Tavakoli et al., 2016; Vercellotti, 2012; 2015
Speed fluency (speaking time/syllables)	Révész et al., 2016
Pruned speech rate B	Ahmadian, 2011,2012; Ahmadian & Tavakoli, 2010, 2014; Bui, 2019; Bui & Huang, 2016; Fujita, 2006; Geng & Fergusson, 2013; Gilabert, 2007a; Lambert et al., 2017; Levkina & Gilabert, 2012; Malicka, 2018; Saeedi, 2015; Tavakoli et al., 2016; Yuan, 2001; Yuan & Ellis, 2003
Unpruned speech rate A	Ahmadian & Tavakoli, 2010, 2014; Ahmadian, 2011, 2012; Fujita, 2006; Malicka, 2018; Tavakoli et al., 2016; Yuan, 2001; Yuan & Ellis, 2003
Phonation time ratio	Bui & Huang, 2016; De Jong & Vercellotti, 2016; Tavakoli et al., 2016; Vercellotti, 2012
Task completion time	Sample & Michel, 2014

Measures of Lexical and Syntactic Complexity

The subcomponents of lexical and syntactic complexity as used in previous research works are summarized (Table 1).

Lexical complexity

Lexical diversity

The review found that the type-token ratio was used by some researchers (Fujita, 2006; Thurman, 2008; Yuan, 2001; Yuan & Ellis, 2003). While Thurman (2008) used the measure on lower proficiency students, Fujita (2006) used the same measure for assessing advanced learners. The type-token ratio was considered unstable hence other measures were employed. Révész et al. (2016) used D-formula developed by Malvern & Richards (1997) for assessing lexical diversity (McKee et al., 2000). This program works only if tokens are above 50 and is a “text internal measure” that considers only words used in the text. Although the D score or VocD has been criticized for being sensitive to text length (McCarthy & Jarvis, 2007), some researchers have successfully used it (De Jong & Vercellotti, 2016; Vercellotti, 2012, 2015). Even McCarthy and Jarvis (2007) stated that the D measure is better than most other measures. Researchers have used Guiraud’s index (Gilabert, 2007a; Malicka, 2018; Sample & Michel, 2014), which is the square root of TTR. Guiraud’s index compensates for the decreasing TTR (Daller, 2010).

Lexical density and lexical sophistication

Out of the studies reviewed, very few works had used lexical density (e.g., Bui, 2019; Révész et al., 2016) and lexical sophistication (e.g., Bui, 2019; Qian, 2010). Lexical frequency was used by Révész et al. (2016). P-Lex was used for assessing lexical sophistication in the studies. Vercellotti (2012) indicated that P-lex lacks robustness as it is topic-specific, and all topics may not incur difficult words. Moreover, repetition of one difficult word several times may be counted and considered as difficult, even in the absence of a wide variety of words in the text.

Out of the lexical measures of density, diversity, and sophistication, Daller et al. (2007, as cited in Qian, 2010) pointed out that lexical sophistication and lexical diversity is highly related to language development and assessment. Likewise, Bui (2019) discovered that the two lexical sub-constructs mentioned above are unrelated and thus appropriate metrics. D measure and Guiraud’s index have been used for lexical diversity more than other measures (e.g., Malicka, 2018; Sample & Michel, 2014), indicating its effectiveness.

Syntactic complexity

General measures

Subordination appears to have been used in studies where the majority of the learners were in the lower intermediate, upper-intermediate, and advanced levels (Ahmadian & Tavakoli, 2010; 2014; Ahmadian, 2011; 2012; De Jong & Vercellotti, 2016; Fujita, 2006; Foster & Skehan, 2013; Geng & Ferguson, 2013; Inoue, 2016; Révész et al., 2016; Vercellotti, 2012) suggesting that this measure may be more appropriate for intermediate and advanced learners. Khaerudin (2014) employed another grammatical measure, syntactical variety, to find the number of verbs, as proposed by Yuan and Ellis (2003) because subordination as a measure failed to yield results for beginner-level learners.

Inoue (2016) used the four measures for measuring complexity following Norris and Ortega (2009)—words per unit for syntactic complexity, subordination-based variables, phrasal complexity, coordination-based variables. Inoue (2016) found that for measuring complexity, appropriate tasks need to be selected after piloting as his study reported that subordination was unexpectedly more in a less complex task which went against the established norm that complex tasks elicit complex language. Hence, Inoue (2016) suggested proper selection of tasks for measuring subordination.

The choice of syntactic measures may determine the research outcomes in terms of complexity concerning other variables like accuracy (Bui, 2019). Vercellotti (2012) suggested that grammatical complexity may be best measured using the three sub-constructs: length of AS unit, length of clauses, and the ratio of finite clauses to AS units. The present review showed that many studies employed clauses per AS unit (Ahmadian, 2011; 2012; Ahmadian & Tavakoli, 2010; 2014; Bamanger & Gashan, 2015; De Jong & Vercellotti, 2016; Foster & Sekhan, 2013; Geng & Ferguson, 2013; Inoue, 2016; Levkina & Gilabert, 2012; Malicka, 2018) and words per AS unit (e.g.,Ahmadian, 2011, 2012; Ahmadian & Tavakoli, 2014; Bamanger & Gashan, 2015; De Jong & Vercellotti, 2016; Inoue, 2016; Lou et al., 2016; Malicka, 2018; Qian, 2010; Révész et al., 2016; Sample & Michel, 2014; Tavakoli et al., 2016; Vercellotti, 2012, 2015).

Specific measures

The frequency of conjunctions and prepositions are specific complexity measures that have been used in a few studies (BavaHarji et al., 2014; Yuan & Ellis, 2003). In the study by BavaHarji et al. (2014), the learners were primarily beginners and used simple short sentences before the experiment, so the number of conjunctions and prepositions was appropriate to understand the development level of learners’ oral production in terms of complexity.

Measures of Accuracy

The sub-components of accuracy are mentioned in Table 2.

General measures of accuracy

Previous researchers have dealt with the accuracy of spoken discourse in varying ways. General measures include errors per 100 words, errors in total AS units, percentage of error-free clauses. Researchers have used several general accuracy measures, particularly the percentage of error-free clauses (Ellis & Barkhuizen, 2005; Skehan & Foster, 1999).

Errors per 100 words

Errors per 100 words also have been used as a measure of accuracy (Inoue, 2016; Malicka, 2018; Nitta & Nakatsuhara, 2014; Révész et al., 2016). Vercellotti (2012) pointed out that using errors per 100 words spares the researcher from segmenting discourse into T-units or AS-units. However, Vercellotti (2012) also emphasized that the “100 words segment have no psycholinguistic reality” while idea units and AS units do have some reality. It may be pointed out that while accuracy may be assessed with errors or error-free clauses, the question about the time limit for producing 100 words remains unclear. Beginners may find it difficult to produce 100 words, and the task requiring 100 words may be performed in less than 100 words. In such a circumstance, the appropriate remedy for assessment is uncertain.

Error-free clauses vs. Errors per unit

Researchers have also employed total errors per AS unit (e.g., Gilabert, 2007b; Inoue, 2016; Levkina & Gilabert, 2012; Malicka, 2018; Sample & Michel, 2014) as using errors as a benchmark reduces the probability of being subjective. If errors per AS unit or per clause are being used, then the definition of erroneous sentence and errors in syntax, morphology, tenses, lexical, etc., have to be decided beforehand. According to Skehan and Foster (1999), error-free clauses are appropriate for experimental design. As a result, many studies have used the percentage of error-free clauses as a measure of accuracy (Dembovskaya, 2009; Yuan & Ellis, 2003), where error-free clauses are divided by the total number of clauses and multiplied by 100 (Abdi et al., 2012; Ahmadian & Tavakoli, 2014; Ahmadian, 2011; Bamanger & Gashan, 2015; BavaHarji et al., 2014; De Jong & Vercellotti, 2016; Fujita, 2006; Shafaei et al., 2013; Thurman, 2008; Vercellotti, 2012, 2015). According to Skehan and Foster (2008), the disadvantage of this measure is that if the speakers use many short sentences, the scores may be inflated; to avoid this inflation, they proposed that all clauses be ranked in length, with a set criterion (usually 70% correct use), and the maximum length that meets the criteria be considered as “clause length accuracy score.”

Vercellotti (2012), in her work, used both error-free clauses as well as error-free AS units, and both resulted in high correlation, which led to the conclusion that only clause level accuracy would be adequate for assessment. Another problem in using an error-free AS unit, which Vercellotti (2012) faced, was that the learners were not proficient in producing lengthy error-free AS units. So, she concluded that error-free clauses would be sufficient to measure accuracy, and only if learners are of good proficiency level then error-free AS units can be used. Another variation of this measure is “errors per AS unit” as Bygate (2001) has emphasized that the use of errors per unit gives a clear picture of accuracy; hence, some studies used this measure (Gilabert, 2007b; Inoue, 2016; Malicka, 2018). Inoue (2016) used three measures: errors per AS unit, errors per 100 words, and percentage of error-free clauses, and concluded that errors per 100 words was the best fit for his study, where the learners’ levels ranged from basic to intermediate. Inoue (2016) pointed out that the differences in the denominator with clauses and words caused variation in the results when different measures were used.

Self-Repair

Though self-repair is a measure of fluency in terms of reformulations, it has been employed in a few studies as a measure of accuracy (Ahmadian & Tavakoli, 2014; Gilabert, 2007a; 2007b) since it demonstrates students’ awareness of forms and their efforts to be accurate (Kormos, 1999). Cognitively demanding tasks lead to more self-repairs (Gilabert, 2007b), so this measure may be used to see how much the learners are aware of their errors which will be more applicable for intermediate and advanced learners.

Specific measures of accuracy

Specific measures include correct verbs, articles, tenses, etc. Specific measures like verbs (Ahmadian, 2011; Ahmadian & Tavakoli, 2010; 2014 BavaHarji et al., 2014; Khoram & Zhang, 2019) or articles (Ahmadian, 2012) cannot capture the overall accuracy of learners, and quite likely, lexical flaws may go unnoticed in this process. A specific measure of accuracy is suitable in works where focused tasks and focused forms are used. Tasks are designed carefully to elicit the target forms. However, in loosely structured and unfocused tasks, general measures are more suitable as Ellis and Barkhuizen (2005, p. 151) recommend a general measure of accuracy, such as “percentage of error-free clauses or number of errors per 100 words.” In addition, general measures seemed to be more appropriate for assessing the accuracy of learners from different backgrounds (Vercellotti, 2012).

Measures of Fluency

Speech rate

There are variations in the way fluency word count and syllable count have been employed in previous studies. Most of the studies used word count in 1 minute or 60 seconds (Geng & Ferguson, 2013; Lou et al., 2016; Qian, 2010; Sample & Michel, 2014) and some used syllables per 60 seconds-both pruned and unpruned speech (Ahmadian, 2011; Ahmadian & Tavakoli, 2010; 2014; Malicka, 2018; Tavakoli et al., 2016); few used only pruned speech (Gilabert, 2007a; Levkina & Gilabert, 2012; Saeedi, 2015). BavaHarji et al. (2014) measured fluency and other constructs for “90 seconds,” explaining that most learners were beginners and were nervous initially, resulting in many pauses and repetitions. Therefore, they adopted the measure for 90 seconds for reliable results. 2 minutes of oral production was measured by Thurman (2008), which shows that there is no particular reason for selecting 1 minute or more than 1 minute as the time limit for testing the fluency of the learners. Nonetheless, it was found that most of the measurement was done with a time length of 1 minute. Bui and Huang (2016) preferred the tallying of frequency-based measures by adopting the standardized textual length approach (occurrence per 100 words/syllables) to the temporal length approach.

Instead of silence measures, De Jong et al. (2012) recommend utilizing the phonation time ratio, which is defined as “the percentage of time spent speaking as a percentage proportion of the time taken to produce the speech sample.” Likewise, Vercellotti (2012) and De Jong and Vercellotti (2016) used the phonation time ratio. Vercellotti (2012) considered speech rate and articulation rate irrelevant for her study as both metrics did not capture the length and number of pauses which is supposedly one of the essential markers of dysfluency. Pruned speech is recommended by Bui and Huang (2016) because it provides accurate speech rates and is sensitive to subtle changes in task characteristics and task conditions. Thus, most researchers prefer pruned speech (Ahmadian, 2011; 2012; Ahmadian & Tavakoli, 2010, 2014; Bui & Huang, 2016; Lambert et al., 2017; Fujita, 2006).

Pauses and silence as breakdown fluency

According to Ellis (2003), fluency can be measured in terms of the “number of pauses” (i.e., a break of 2.0 seconds or longer either within a turn or between turns). The previous studies vary in their use of the duration of pauses and the type of pauses (filled and unfilled pauses) that they took into their calculations. Some studies defined pause as a silence that is over 0.2 seconds long (De Jong & Perfetti, 2011; Nitta & Nakatsuhara, 2014; Vercellotti, 2012, 2015), some as 250 milliseconds (Révész et al., 2016) within an utterance, while some set 0.4 seconds or 1 second as their criterion (e.g., Skehan & Foster, 1997). A few researchers utilized pauses that were shorter in duration (e.g., Tavakoli & Skehan, 2005; Wigglesworth & Elder, 2010). However, pauses in speech should not be attributed simply to a lack of fluency (Freed, 2000, p. 256). Because little pausing is deemed normal in the first language, occasional pauses in the second language speech can be considered. Huensch and Tracy-Ventura (2016) investigated the fluency of first-language Spanish, French, and English speakers. They discovered that the first language English speakers paused a lot because English as a language has an expansive syllable inventory, which means there are more phonemes within each syllable, resulting in more pauses while speaking. Also, they have found in their study that first language fluency usually influences second language fluency, so individual differences are bound to be there.

While some studies have measured the number of pauses, few have calculated the length of pauses, such as Vercellotti (2012). Some have calculated filled pauses and unfilled pauses or either of them. Some studies have emphasized mid-clause pauses as more appropriate to assess fluency for non-native speakers Foster & Skehan, 2013; Skehan & Foster, 2008) as not much difference has been noticed in the end clause pauses in both first and second language speakers (De Jong, 2016; Tavakoli, 2011). However, mid-clause pausing is more common in less proficient second-language speakers. Skehan and Foster (2008) further pointed out that there is a difference between native and non-native speakers in pausing and in the pattern of pause location. Non-native speakers’ pauses become more prominent in unpredictable places, whereas native speakers pause as well, but it is less noticeable. De Jong et al. (2012) also reiterated that individual differences hold true for native and non-native speakers with regard to fluency. Thus, pauses have been interpreted in various ways, and the length of pauses helps to understand the fluency attained by learners over time, whereby long pauses may reduce with adequate oral practice.

Repair fluency measures

Repair fluency reflects “adjustments and improvements” which learners make in their performance (Foster & Skehan, 1999). This comprises reformulation whereby the learner rephrases syntax, morphology, or word order; replacement is a change of vocabulary while speaking; repetition is repeating of words or string of words, and false start is abandoning the utterance and starting afresh (Foster et al., 2000). Nitta and Nakatsuhara (2014) view that, unlike breakdown fluency, repair fluency may not necessarily indicate dysfluency because learners repairing their language can be a sign of improvement. Repair fluency has been used in many studies with good effect (Abdi et al., 2012; Bamanger & Gashan, 2015; Bui, 2019; Bui & Huang, 2016; De Jong & Vercellotti, 2016; Foster & Skehan, 1999, 2013; Lou et al., 2016; Malicka, 2018; Qian, 2010).

Discussion

Most Appropriate Measures of Complexity, Accuracy, and Fluency

The review of studies as presented in the results section indicated the varied sub-constructs used by researchers in the studies, and researchers discovered that some measures were more effective than others. As a consequence of the findings, which revealed the most commonly used and relevant CAF measures, the current study proposes measures that may be utilized to assess beginners, intermediates, and advanced learners.

Most appropriate measures of complexity

Based on the measures found in previous studies, general measures appear to be the most appropriate because they allow for comparison between studies (Foster & Skehan, 1996). In contrast, Norris and Pfeiffer (2003) claimed that specific measures such as target grammatical items could measure complexity more robustly than a general measure. It seems, however, for assessing beginners, complexity by coordination may be more suitable (Norris & Ortega, 2009), while phrasal complexity may work well with more advanced second language learners (Ortega, 2003). Some increasingly used measures may be selected for intermediate and advanced learners, such as words per AS unit, clauses per AS unit, phrasal complexity (Norris & Ortega, 2009; Vercellotti, 2012). For assessing lexical variety, D score and Guiraud’s index (Révész et al., 2016) may be used as the type-token ratio has limitations of being affected by text length.

Lexical complexity may not be used as a measure for assessing beginners, as Skehan (2009) stressed that native speakers’ complexity might be “unidimensional,” with lexical and structural complexity being hand in hand. However, for the second language, non-native learners, the two areas may not be integrated, and focusing on lexis may lead to poor performance in other areas. Therefore, depending on the objective of the study and the learners’ proficiency level, researchers ought to use their discretion while selecting the measures for oral complexity.

Most appropriate measures of accuracy

For measuring accuracy, a general measure of “percentage of error-free clauses” is appropriate if the tasks are loosely structured, and unfocused tasks are employed. Depending on the learners’ competence level and the tasks they are assigned, “errors per 100 words” might be used as a general measure of accuracy. Specific measures may be used for focused tasks as the accurate number of tenses, verbs, etc. General measures such as “percentage of error-free clauses” should be used along with specific measures. For beginners, identifying their specific areas of accuracy concern and working solely on those would be preferable, and specific metrics could be quite useful. At the same time, both specific and general measures may provide a comprehensive view of the accuracy level for intermediate and advanced learners. As Michel (2017) said, global accuracy measures may assist in assessing accuracy across “different languages, populations, and tasks.” However, it may not be apt for capturing slight differences in high proficiency learners. Michel (2017) said that specific measures might capture the slight changes. This indicates that combining both general and specific measures can yield an effective result.

Attaining accuracy in a foreign language is no mean feat; hence, while assessing the beginner’s, researchers need to be less stringent and use their discretion. As Pallotti (2009) noted, weighing learners’ errors is not an easy task, and Foster and Wigglesworth (2016, p. 112) stated that “anyone who has worked on assessing accuracy in L2 data will know this only too well; some degree of personal judgment has to be invoked occasionally.”

Most appropriate measures of fluency

De Jong et al. (2012) emphasized that for beginners’, articulation rate was the best predictor and for intermediate learners mean length of the run was the best predictor of fluency. Therefore, based on the critical analysis of the studies measuring fluency, the indices that may be appropriate for the beginners and low intermediate language learners are: Articulation rate (Tavakoli et al., 2016), number of mid clauses pauses per 100 words (Skehan & Foster, 2008; Foster & Skehan, 2013), Reformulations (repair fluency) per 100 words (Foster & Skehan, 2013).

Mean length of run (De Jong et al., 2012); pruned speech rate; repair fluency including false starts, reformulation, replacement and repetition, number of mid-clause pauses (Bui, 2019; Foster & Skehan, 2013) are appropriate for upper-intermediate and advanced language learners.

The duration of pauses for beginners and intermediate learners may be different. Beginners may be assessed for longer pauses as short pauses may be found more in their speech. Thurman (2008) avoided this measure altogether in his study as Japanese students learning English as a second language were beginners, and they were used to pausing between words. Harumi (2002, as cited in Thurman, 2008) stated that Japanese students are silent in English language classes as they lack confidence and have time management issues. This is true for most second language learners. Therefore, mid-clause pauses may be an appropriate measure for capturing the breakdown fluency of non-native learners (Skehan & Foster, 2008).

Skehan and Foster (2008) pointed out that repetition is found more in non-natives, and even planning time does not seem to reduce repetition, which is otherwise beneficial to native speakers. Therefore, instead of using repetition for assessing the fluency of beginner and intermediate learners, it may be used effectively for advanced learners. With studies assessing fluency with different repair measures and breakdown fluency measures, it seems that fluency may be assessed by using the measures of pauses and number of utterances in a given time; however, it should not be assessed at the cost of accuracy. Incomprehensible words spoken fluently may not serve the purpose of learning a language. Therefore, the amount of leniency in assessing fluency should rest with the researcher’s discretion.

Educational Implications

The findings of this study have ramifications for both English language teachers and institutions where English is taught as a second language including the teacher education courses/institutions. The goal of teaching a second language should be to make students feel at ease in the language. Language assessment is critical for teachers to identify the effectiveness of the course they are teaching as well as the language areas in which the students need to improve. Teachers cannot employ CAF constructs in classrooms because there is a traditional testing system in place. However, after an interactive session, the teacher may utilize distinctive fluency, accuracy, and complexity subconstructs to assess the students to see if the tasks and lessons helped them improve their language fluency, accuracy, and complexity. Teachers will be able to design better tasks and activities by using these evaluations, and they will be able to focus on the specific subconstruct that requires attention. Thus, if teachers and educational institutions are aware of the constructs of language proficiency, they can use them to monitor the students’ language proficiency and adapt their language curriculum for the desired improvements in students.

Summary and Conclusion

A limitation of this review that needs to be acknowledged is that it includes only those studies that were accessible in the database. Yet, the empirical studies selected and reviewed were of a wide range, which has provided an understanding of the manifold nature of CAF and its sub-components. The present study critically presented the different perspectives of researchers on the effectiveness of certain measures and concluded that CAF measures must be finalized based on learners’ proficiency levels. For instance, a beginner’s achievement in the second language initially should be based on words or syllables being uttered, with repetition and pauses measured subsequently. The present review further deduced that there is no fixed way of analyzing second language production using CAF, though undeniably, it remains a “scientifically valid and informative” (Pallotti, 2009) way of measuring language production.

Another finding of this review was that researchers preferred using more than one sub-component of each construct as they indicated that more than one sub-component would give an impartial result. Emphasis was on using both general and specific measures to obtain impartial results. The wide array of sub-constructs for assessing accuracy, fluency, and complexity gives flexibility to the researchers to select the sub-construct most suitable for their settings and participants, and this is one of the advantages of employing CAF measures. Another advantage is that it is one of the best ways that language practitioners have found to quantify language production. This quantification aids in understanding the extent of development in the language areas of accuracy, fluency, and complexity. CAF is used as dependent variables in relation to planning and tasks, and it helps the researchers decide on the type of tasks and planning that may aid in developing CAF with the least trade-off effects. The limitation of using CAF for assessing oral language production is that it requires expertise and proficiency for assessing the students’ oral proficiency. Assessing with CAF may be “fast and reliable” with “computerized tools” (Michel, 2017) and the use of technology. However, in different countries and regions where second language accents and pronunciation are unlike native speakers, the computerized tools may not be the best methods, and manual calculations for a large sample can be tedious.

The present review was limited to studies with oral proficiency; therefore, future reviews on the sub-constructs of CAF used in written proficiency will be an additional insight into the most appropriate measures of language performance according to task modality. Further studies with a distinct focus on the language performance of learners with different proficiency levels would aid in gaining insight into the most suitable measures for assessing second language learners.

Footnotes

Acknowledgments

The first author wishes to thank her guide and mentor, Professor Santoshi Halder (Second Author) for the step-by-step guidance and her valuable inputs.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Shazia Hasnain

Santoshi Halder

Note

Appendix

References

Abdi

Eslami

Zahedi

(2012). The impact of pre-task planning on the fluency and accuracy of Iranian EFL learners’ oral performance. Procedia-Social and Behavioral Sciences, 69, 2281−2288. https://doi.org/10.1016/j.sbspro.2012.12.199

Ahangari

Abdi

(2011). The effect of pre-task planning on the accuracy and complexity of Iranian EFL learners’ oral performance. Procedia - Social and Behavioral Sciences, 29(2), 1950−1959. https://doi.org/10.1016/j.sbspro.2011.11.445

Ahmadian

(2012). The effects of guided careful online planning on complexity, accuracy and fluency in intermediate EFL learners’ oral production: The case of English articles. Language Teaching Research, 16(1), 129−149. https://doi.org/10.1177/1362168811425433

Ahmadian

M. J.

(2011). The effect of ‘massed’ task repetitions on complexity, accuracy and fluency: Does it transfer to a new task? Language Learning Journal, 39(3), 269−280. https://doi.org/10.1080/09571736.2010.545239

Ahmadian

M. J.

Tavakoli

(2010). The effects of simultaneous use of careful online planning and task repetition on accuracy, complexity, and fluency in EFL learners’ oral production. Language Teaching Research, 15(1), 35−59. https://doi.org/10.1177/1362168810383329

Ahmadian

M. J.

Tavakoli

(2014). Investigating what second language learners do and monitor under careful online planning conditions. Canadian Modern Language Review, 70(1), 50−75. https://doi.org/10.3138/cmlr.1769

Alderson

J. C.

(1991). Language testing in the 90s: How far have we come? How much further have we to go? In Anivan

(Ed.), Current development in language testing. Seameo regional language center.

Amirian

S. M. R.

Moqaddam

H. H.

Moqaddam

Q. J.

(2017). Critical analysis of the models of language proficiency with a focus on communicative models. Theory and Practice in Language Studies, 7(5), 400−407. https://doi.org/10.17507/tpls.0705.11

Bachman

L. F.

(1990). Fundamental considerations in language testing. OUP.

10.

Bachman

L. F.

(2000). Modern language testing at the turn of the century: Assuring that what we count counts. Language Testing, 17(1), 1−42. https://doi.org/10.1191/026553200675041464

11.

Bachman

L. F.

Palmer

A. S.

(1996). Language testing in practice. Oxford University Press.

12.

Bamanger

E. M.

Gashan

A. K.

(2015). The effect of planning time on the fluency, accuracy, and complexity of EFL learners’ oral production. Journal of Educational Sciences, 27(1), 1−15. https://doi.org/10.33948/1158-027-001-008

13.

BavaHarji

Gheitanchian

Letchumanan

(2014). The effects of multimedia task-based language teaching on EFL learners’ oral EFL production. English Language Teaching, 7(4), 11−24. https://doi.org/10.5539/elt.v7n4p11

14.

Bell

(2003). Using frequency lists to assess L2 texts. (Unpublished PhD thesis), University of Swansea.

15.

Bui

(2019). Task-readiness and L2 task performance across proficiency levels. In Researching L2 task performance and pedagogy: In honour of Peter Skehan (pp. 253−277). John Benjamins.

16.

Bui

Huang

(2016). L2 fluency as influenced by content familiarity and planning: Performance, measurement, and pedagogy. Language Teaching Research, 22(1), 94−114. https://doi.org/10.1177/1362168816656650

17.

Bui

Skehan

(2018). Complexity, fluency and accuracy. In Liontas

(Ed.), TESOL encyclopedia of English language teaching (pp. 1−7). John Wiley & Sons, Inc. https://doi.org/10.1002/9781118784235.eelt0046

18.

Bulté

Housen

(2012). Dening and operationalising L2 complexity. In Housen

Kuiken

Vedder

(Eds.), Dimensions of L2 performance and proficiency: Investigating complexity, accuracy and fluency in SLA (pp. 21−46). Benjamins.

19.

Bygate

(2001). Effects of task repetition on the structure and control of oral language. In Bygate

Skehan

Swain

(Eds.), Researching pedagogic tasks: Second language learning, teaching and testing (pp. 23−48). Pearson Education Limited.

20.

Canale

Swain

(1980). Theoretical bases of communicative approaches to second language teaching and testing. Applied linguistics, 1(1), 1−47. https://doi.org/10.1093/applin/1.1.1

21.

Daller

Milton

Treffers-Daller

(Eds.), (2007). Modelling and assessing vocabulary knowledge. Cambridge University Press.

22.

Daller

(2010). Guiraud’s index of lexical richness. Presented at British Association of Applied Linguistics.

23.

Davies

(1990). Principles of language testing. Basil Blackwell.

24.

De Jong

(2016). Predicting pauses in L1 and L2 speech: The effects of utterance boundaries and word frequency. International Review of Applied Linguistics in Language Teaching, 54(2), 113–132. https//doi.org/10.1515/iral-2016-9993

25.

De Jong

Perfetti

C. A.

(2011). Fluency training in the ESL classroom: An experimental study of fluency development and proceduralization. Language learning, 61(2), 533–568. https://doi.org/10.1111/j.1467-9922.2010.00620.x

26.

De Jong

Vercellotti

M.L.

(2016). Similar prompts may not be similar in the performance they elicit: Examining fluency, complexity, accuracy, and lexis in narratives from five picture prompts. Language Teaching Research, 20(3), 387–404. https://doi.org/10.1177/1362168815606161

27.

De Jong

N. H.

Steinel

M. P.

Florijn

Schoonen

Hulstijn

J. H.

(2012). The effect of task complexity on functional adequacy, fluency and lexical diversity in speaking performances of native and non-native speakers. In Housen

Kuiken

Vedder

(Eds.), Dimensions of L2 performance and proficiency. Investigating complexity, accuracy and fluency in SLA (pp. 121-142). John Benjamins.

28.

Dembovskaya

S. B.

(2009). Task-based instruction: The effect of motivational and cognitive pre-tasks on Second Language oral French production. (Doctoral thesis), University of Iowa. https://doi.org/10.17077/etd.scwb9fdh

29.

Ellis

(2003). Task-based language learning and teaching. Oxford University Press.

30.

Ellis

(2008). The study of second language acquisition (2nd ed.). Oxford University Press.

31.

Ellis

Barkhuizen

(2005). Analysing learner language. Oxford University Press

32.

Foster

Skehan

(1996). The influence of planning and task type on second language performance. Studies in Second Language Acquisition, 18(3), 299–323. https://doi.org/10.1017/s0272263100015047

33.

Foster

Skehan

(1999). The influence of source of planning and focus of planning on task-based performance. Language Teaching Research, 3(3), 215–247. https://doi.org/10.1191/136216899672186140

34.

Foster

Skehan

(2013). Anticipating a post-task activity: The effects on accuracy, complexity, and fluency of second language performance. Canadian Modern Language Review, 69(3), 249–273. https://doi.org/10.3138/cmlr.69.3.249

35.

Foster

Tonkyn

Wigglesworth

(2000). Measuring spoken language: A unit for all reasons. Applied Linguistics, 23(3), 354–375. https://doi.org/10.1093/applin/21.3.354

36.

Foster

Wigglesworth

(2016). Capturing accuracy in second language performance: The case for a weighted clause ratio. Annual Review of Applied Linguistics, 36, 98−116. https://doi.org/10.1017/s0267190515000082

37.

Freed

B. F.

(2000). Is fluency, like beauty, in the eyes (and ears) of the beholder? In Riggenbach

(Ed.), Perspectives on fluency (pp. 243–265). University of Michigan Press

38.

Fujita

(2006). The influence of pre-task planning, on-line planning and their combination on fluency, complexity and accuracy in foreign language performance (Unpublished Master’s dissertation), University of Essex.

39.

Geng

Ferguson

(2013). Strategic planning in task-based language teaching: The effects of participatory structure and task type. System, 41(4), 982–993. https://doi.org/10.1016/j.system.2013.09.005

40.

Gilabert

(2007b). Effects of manipulating task complexity on self-repairs during L2 oral production. IRAL: International Review of Applied Linguistics in Language Teaching, 45(3), 215–240. https://doi.org/10.1515/iral.2007.010

41.

Gilabert

(2007a). The simultaneous manipulation of task complexity along planning time and (+/- Here-and-Now): Effects on L2 oral production. In Garcia-Mayo

(Ed.), Investigating tasks in formal language learning (pp. 44–68): Multilingual Matters.

42.

Gregori-Signes

Clavel-Arroitia

(2015). Analysing lexical density and lexical diversity in university students’ written discourse. Procedia-Social and Behavioral Sciences, 198, 546−556. https://doi.org/10.1016/j.sbspro.2015.07.477

43.

Halliday

M. A. K.

Martin

J. R.

(1993). Writing science: Literacy and discursive power. London, England: Palmer.

44.

Harumi

(2002). The use of silence by Japanese EFL learners. In Swanson

McMurray

(Eds.), On PAC3 at jalt 2001: a language odyssey (pp. 27–34). Japan Association for Language Teaching.

45.

Housen

Kuiken

(2009). Complexity, accuracy, and fluency in second language acquisition. Applied linguistics, 30(4), 461–473. https://doi.org/10.1093/applin/amp048

46.

Housen

Kuiken

Vedder

(Eds.), (2012). Dimensions of L2 performance and proficiency: Complexity, accuracy and fluency in SLA. John Benjamins

47.

Huensch

Tracy-Ventura

(2016). Understanding second language fluency behavior: The effects of individual differences in first language fluency, cross-linguistic differences, and proficiency over time (pp. 1–31). Applied Psycholinguistics. https://doi.org/10.1017/S0142716416000424

48.

Inoue

(2016). A comparative study of the variables used to measure syntactic complexity and accuracy in task-based research. The Language Learning Journal, 44(4), 487–505. https://doi.org/10.1080/09571736.2015.1130079

49.

James

(1998). Errors in language learning and use: Exploring error analysis. Longman.

50.

Jarvis

(2013). Capturing the diversity in lexical diversity. Language Learning, 63(s1), 87–106. https://doi.org/10.1111/j.1467-9922.2012.00739.x

51.

Khaerudin

(2014). Measuring accuracy and complexity of an L2 learner’s oral production. IJEE (Indonesian Journal of English Education), 1(2), 189–198. https://doi.org/10.15408/ijee.v1i2.1344

52.

Khoram

Zhang

(2019). The impact of task type and pre-task planning condition on the accuracy of intermediate EFL learners’ oral performance. Cogent Education, 6(1), 1–13. https://doi.org/10.1080/2331186x.2019.1675466

53.

Koponen

Riggenbach

(2000). Overview: Varying perspectives on fluency. In Riggenbach

(Ed.), Perspectives on fluency (pp. 5–24). University of Michigan Press.

54.

Kormos

(1999). Monitoring and self-repair in L2. Language Learning, 49(2), 303–342. https://doi.org/10.1111/0023-8333.00090

55.

Kuiken

Vedder

Housen

De Clercq

(2019). Variation in syntactic complexity: Introduction. International Journal of Applied Linguistics, 29(2), 161–170. https://doi.org/10.1111/ijal.12255

56.

Lahmann

Steinkrauss

Schmid

M. S.

(2019). Measuring linguistic complexity in long‐term L2 speakers of English and L1 attriters of German. International Journal of Applied Linguistics, 29(2), 173–191. https://doi.org/10.1111/ijal.12259

57.

Lambert

Kormos

Minn

(2017). Task repetition and second language speech processing. Studies in Second Language Acquisition, 39(1), 167–196. https://doi.org/10.1017/s0272263116000085

58.

Larsen-Freeman

(2009). Adjusting expectations: The study of complexity, accuracy, and fluency in second language acquisition. Applied linguistics, 30(4), 579–589. https://doi.org/10.1093/applin/amp043

59.

Levelt

W. J. M.

(1989). Speaking. From intention to articulation. ACL-MIT Press.

60.

Levkina

Gilabert

(2012). The effects of cognitive task complexity on L2 oral production. In Housen

Kuikuen

Vedder

(Eds.), Dimensions of L2 performance and proficiency investigating complexity, accuracy, and fluency in SLA (pp. 171–197). John Benjamins.

61.

Lou

Y. G.

Chen

L. Y.

(2016). Effects of a task-based approach to non-English-majored graduates’ oral English performance. Creative Education, 7(04), 660–668. http://dx.doi.org/10.4236/ce.2016.74069

62.

Malicka

(2018). The role of task sequencing in fluency, accuracy, and complexity: Investigating the SSARC model of pedagogic task sequencing. Language Teaching Research, 24(5), 642–665. https://doi.org/10.1177/1362168818813668

63.

Malvern

Richards

(1997). A new measure of lexical diversity. In Ryan

Wray

(Eds.), Evolving Models of Language. Multilingual Matters.

64.

McCarthy

P. M.

Jarvis

(2007). vocd: A theoretical and empirical evaluation. Language Testing, 24(4), 459–488. https://doi.org/10.1177/0265532207080767

65.

McKee

Malvern

Richards

(2000). Measuring vocabulary diversity using dedicated software. Literary and Linguistic Computing, 15(3), 323−337. https://doi.org/10.1093/llc/15.3.323

66.

McNamara

(1996). Measuring second Language performance. Longman.

67.

Meara

Bell

(2001). P lex: A simple and effective way of describing the lexical characteristics of short L2 texts. Prospect, 16(3), 5–19.

68.

Michel

(2017). Complexity, accuracy, and fluency in L2 production (pp. 50–68). The Routledge Handbook of Instructed Second Language Acquisition. https://doi.org/10.4324/9781315676968-4

69.

Nitta

Nakatsuhara

(2014). A multifaceted approach to investigating pre-task planning effects on paired oral test performance. Language Testing, 31(2), 147–175. https://doi.org/10.1177/0265532213514401

70.

Norris

J. M.

Ortega

(2009). Towards an organic approach to investigating CAF in instructed SLA: The case of complexity. Applied Linguistics, 30(4), 555–578. https://doi.org/10.1093/applin/amp044

71.

Norris

J. M.

Pfeiffer

(2003). Exploring the use and usefulness of ACTFL guidelines oral proficiency ratings in college foreign language departments. Foreign Language Annals, 36(4), 572–581. https://doi.org/10.1111/j.1944-9720.2003.tb02147.x

72.

Ortega

(2003). Syntactic complexity measures and their relationship to L2 proficiency: A research synthesis of college‐level L2 writing. Applied linguistics, 24(4), 492–518. https://doi.org/10.1093/applin/24.4.492

73.

Pallotti

(2009). Caf: Defining, refining and differentiating constructs. Applied Linguistics, 30(4), 590–601. https://doi.org/10.1093/applin/amp045

74.

Piggin

(2012). What are our tools really made out of? A critical assessment of recent models of language proficiency. Polyglossia: The Asia-Pacific’s Voice in Language and Language Teaching, 22, 79−87.

75.

Polio

(1997). Measures of linguistic accuracy in second language writing research. Language Learning, 47(1), 101–143. https://doi.org/10.1111/0023-8333.31997003

76.

Qian

(2010). Focus on form in Task based language teaching- Exploring the effects of post task activities and task practice on Learners’ Oral performance. (published Doctoral thesis), The Chinese University of Hong Kong.

77.

Rafie

Z. F.

Rahmany

Sadeqi

(2015). The differential effects of three types of task planning on the accuracy of L2 oral production. Journal of Language Teaching and Research, 6(6), 1297–1304. https://doi.org/10.17507/jltr.0606.17

78.

Rausch

(2017). Complexity, accuracy, fluency as a communication paradigm: From theory to instructional curriculum. Japanese Journal of Communication Studies, 45(2), 115−127.

79.

Read

(2000). Assessing vocabulary. Cambridge University Press.

80.

Révész

Ekiert

Torgersen

E. N.

(2016). The effects of complexity, accuracy, and fluency on communicative adequacy in oral task performance. Applied Linguistics, 37(6), 828–848. https://doi.org/10.1093/applin/amu069

81.

Robinson

(1995). Attention, memory, and the “noticing” hypothesis. Language learning, 45(2), 283–331. https://doi.org/10.1111/j.1467-1770.1995.tb00441.x

82.

Robinson

(2001). Task complexity, cognitive resources, and syllabus design: A triadic framework for examining task influences on SLA. In Robinson

(Ed.), Cognition and second language instruction (pp. 287–318). Cambridge University Press.

83.

Saeedi

(2015). Unguided strategic planning, task structure, and L2 performance: Focusing on complexity, accuracy, and fluency. Journal of Applied Linguistics and Language Research, 2(4), 263–274.

84.

Sample

Michel

(2014). An exploratory study into trade-off effects of complexity, accuracy, and fluency on young learners’ oral task repetition. TESL Canada Journal, 31(8), 23–46. https://doi.org/10.18806/tesl.v31i0.1185

85.

Sanell

(2007). An acquisitional study of negation and some focus particles in French L2 (English). Stockholm: Stockholm University.

86.

Shafaei

A. R.

Salimi

Talebi

(2013). The impact of gender and strategic pre-task planning time on EFL learners’ oral performance in terms of accuracy. Journal of Language Teaching and Research, 4(4), 746–753. https://doi.org/10.4304/jltr.4.4.746-753

87.

Skehan

(1995). Analysability, accessibility, and ability for use. In Cook

Seidlhofer

(Eds.), Principle and practice in applied linguistics. OUP.

88.

Skehan

(1996). A framework for the implementation of task-based instruction. Applied Linguistics, 17(1), 38–62. http://dx.doi.org/10.1093/applin/17.1.38

89.

Skehan

(1998). A cognitive approach to language learning. Oxford University Press.

90.

Skehan

(2003). Task-based instruction. Language teaching, 36(1), 1–14. https://doi.org/10.1017/s026144480200188x

91.

Skehan

(2009). Modelling second language performance: Integrating complexity, accuracy, fluency, lexis. Applied Linguistics, 30(4), 510–532. https://doi.org/10.1093/applin/amp047

92.

Skehan

Foster

(1997). Task type and task processing conditions as influences on foreign language performance. Language teaching research, 1(3), 185−211.

93.

Skehan

Foster

(1999). The influence of task structure and processing conditions on narrative retellings. Language Learning, 49(1), 93–120. https://doi.org/10.1111/1467-9922.00071

94.

Skehan

Foster

(2008). Complexity, accuracy, fluency and lexis in task-based performance: A meta-analysis of the ealing research. In Van Daele

Housen

Kuiken

Pierrard

Vedder

(Eds.), Complexity, accuracy, and fluency in second language use, learning, and teaching. University of Brussels Press.

95.

Spolsky

(1985). What does it mean to know how to use a language? An essay on the theoretical basis of language testing. Language Testing, 2(2), 180–191. https://doi.org/10.1177/026553228500200206

96.

Suzuki

(2017). Complexity, accuracy, and fluency measures in oral pre-task planning: A synthesis.

97.

Tavakoli

(2011). Pausing patterns: Differences between L2 learners and native speakers. ELT Journal, 65(1), 71–79. https://doi.org/10.1093/elt/ccq020

98.

Tavakoli

Campbell

McCormack

(2016). Development of speech fluency over a short period of time: Effects of pedagogic intervention. TESOL Quarterly, 50(2), 447–471. https://doi.org/10.1002/tesq.244

99.

Tavakoli

Skehan

(2005). Strategic planning, Task structure, and Performance testing. In Ellis

(Ed.), Planning and task performance (pp. 239–273). John Benjamins Publishing Company.

100.

Thurman

(2008). The interaction of topic choice and task-type in the EFL classroom (Doctoral dissertation), Temple University.

101.

Tonkyn

(2013). Measuring and perceiving changes in oral complexity, accuracy and fluency: examining instructed learners’ short term gains. In Housen

Kuiken

Vedder

(Eds.), Dimensions of L2 performance and proficiency (pp. 221−245). John Benjamins.

102.

Vercellotti

M.L.

(2012). Complexity, accuracy, and fluency as properties of language performance: The development of the multiple subsystems over time and in relation to each other (Doctoral dissertation). University of Pittsburgh.

103.

Vercellotti

M. L.

(2015). The development of complexity, accuracy, and fluency in second language performance: A longitudinal study. Applied Linguistics, 38(1), 90–111. https://doi.org/10.1093/applin/amv002

104.

Wartan

(2019). Complexity, accuracy and fluency in second language acquisition: Speaking style or language proficiency? Doctoral dissertation), Leiden University.

105.

Wigglesworth

(1997). An investigation of planning time and proficiency level on oral test discourse. Language Testing, 14(1), 101–122. https://doi.org/10.1177/026553229701400105

106.

Wigglesworth

Elder

(2010). An investigation of the effectiveness and validity of planning time in speaking test tasks. Language Assessment Quarterly, 7(1), 1–24. https://doi.org/10.1080/15434300903031779

107.

Wolfe-Quintero

Inagaki

Kim

H. Y.

(1998). Second language development in writing: Measures of fluency, accuracy, & complexity. University of Hawaii Press.

108.

Yuan

(2001). The effects of planning on language production in task-based language teaching (Doctoral dissertation), Temple University.

109.

Yuan

Ellis

(2003). The effects of pre-task planning and on-line planning on fluency, complexity and accuracy in L2 monologic oral production. Applied Linguistics, 24(1), 1–27. https://doi.org/10.1093/applin/24.1.1

Intricacies of the Multifaceted Triad-Complexity,Accuracy,and Fluency: A Review of Studies on Measures of Oral Production

Abstract

Keywords

Introduction

CAF and Their Sub-Constructs

Complexity

Measures of lexical complexity

Lexical diversity

Lexical density and lexical sophistication

General measures of syntactic complexity

Clausal subordination

Other measures of syntactic complexity

Accuracy

Fluency

Research Questions

Methodology Implemented for the Present Review

Search Strategy of Studies

Inclusion and exclusion criteria

Results

Measures of Lexical and Syntactic Complexity

Lexical complexity

Lexical diversity

Lexical density and lexical sophistication

Syntactic complexity

General measures

Specific measures

Measures of Accuracy

General measures of accuracy

Errors per 100 words

Error-free clauses vs. Errors per unit

Self-Repair

Specific measures of accuracy

Measures of Fluency

Speech rate

Pauses and silence as breakdown fluency

Repair fluency measures

Discussion

Most Appropriate Measures of Complexity, Accuracy, and Fluency

Most appropriate measures of complexity

Most appropriate measures of accuracy

Most appropriate measures of fluency

Educational Implications

Summary and Conclusion

Footnotes

Acknowledgments

Declaration of Conflicting Interests

Funding

ORCID iDs

Note

Appendix

References