Sage Journals: Discover world-class research

Abstract

Researchers have used different measures to examine participants’ second-language (L2) proficiency, yet it remains unclear how these different measures influence research results on the effect of L2 proficiency on the target variable. The present research explored the modulation effect of four different measures of L2 Chinese proficiency on Chinese phonological awareness (PA). Eighty-four L2 learners of Chinese who spoke English or Arabic as their first language completed four different measures of Chinese proficiency: a self-report measure of years of instruction in Chinese, a simplified Hanyu Shuiping Kaoshi (HSK; a standardized Chinese language proficiency test), a reading comprehension test, and a Chinese character recognition test. The overall results of t tests and multiple regression analysis revealed that different measures of L2 proficiency led to various findings on the effect of L2 proficiency on Chinese PA. Compared with the other three measures, the HSK, a comprehensive examination of L2 Chinese proficiency, yielded the most consistent and reliable results with the largest effect size. Our findings contribute to the current understanding of L2 proficiency measures and to future efforts for the improvement of L2 proficiency tests for research purposes.

Keywords

phonological awareness different L2 proficiency measures Chinese proficiency test HSK test

Introduction

With the growing influence of psychology and neuroscience on second-language acquisition (SLA), the demand for the rigorous control of research methods in SLA studies, particularly those examining participants’ second-language (L2) proficiency, has been highlighted (Grosjean, 1998; Tremblay, 2011a). Identifying valid and efficient methods to measure L2 proficiency is one of the most critical issues for both language education practitioners and SLA researchers. In the current study, language proficiency represents “an index of the comprehension and production abilities that L2 learners develop across linguistic domains (e.g., lexical competence, grammatical competence, discourse competence) and modalities to communicate” (Tremblay, 2011a, p. 340).

Studies on L2 proficiency assessment can generally be categorized into three types. The first group focuses on standardized tests (see journals such as Language Testing and Language Assessment Quarterly), such as the Test of English as a Foreign Language (TOEFL), International English Language Testing System (IELTS), and Hanyu Shuiping Kaoshi (HSK), a Chinese language proficiency test. These standardized tests are usually administered on a large scale and are well supported by specific theoretical frameworks. The second group of studies investigate placement tests, which are carried out by teaching units to evaluate a student’s proficiency level in the target language to determine which classes the student should take (such as Blais & Laurier, 1995; Fulcher, 1997, 1999; Harrington & Carey, 2009; Jamieson et al., 2013; Lee & Greene, 2007). The last group is interested in L2 proficiency assessment tools for research purposes (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). These studies examine participants’ L2 proficiency to provide evidence for academic research to control for participants’ language proficiency or to investigate the effect of language proficiency on the dependent variable(s); however, relevant research is comparatively scarce. The third type is the focus of the present study.

In general, the L2 proficiency level of L2 learners is used in three ways in empirical studies. First, L2 proficiency is used as a dependent variable, and the factors, such as the prediction of early L1 skills and L2 aptitude in L2 proficiency, that contribute to or affect the development of L2 proficiency are explored (Sparks et al., 2009). Second, according to participants’ performance on L2 proficiency measures, participants are categorized into groups with different proficiencies, and then cross-group differences in the target variable, such as the differences in lexical knowledge between native speakers of English, L2 advanced learners, and L2 intermediate learners, are examined using t tests or analyses of variance (ANOVAs; Zareva et al., 2005). Finally, participants’ scores on L2 proficiency measures are entered into regression models to examine their predictive power for the target variable, such as the role of working memory, L2 proficiency, and age at the processing of anaphoric sentences (Nowbakht, 2019). The present study focuses on the modulation effect of L2 proficiency measures on the investigation of the influence of L2 proficiency on the target variable.

Some researchers have summarized how different L2 proficiency assessment methods have been used in published works (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018), yet few studies have explored how these different L2 proficiency assessment methods influence research results on the effect of L2 proficiency on the target variable. Therefore, the present study explored how different L2 assessment methods could yield different results and findings on the influence of L2 proficiency on the target variable. This study is intended to contribute to the field of SLA methodology.

Literature Review

The sparse investigation of L2 proficiency for research purposes may be explained by several possible reasons. First, researchers do not agree on which test is considered the best method to assess L2 proficiency for research purposes, but such a need has been underlined by various studies (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). Because of the differences in theoretical frameworks, research questions, participants’ background information, and L2 learning contexts, SLA researchers have adopted different L2 proficiency measures.

Another reason may concern the shortage of replication studies. Makel et al. (2012) found that in the field of psychology, most replication studies are conceptual and involve, as Marsden et al. (2018) stated, “intentional adaptation of the initial study to investigate generalizability to new conditions, contexts, or study characteristics” (p. 325), while the few direct replication studies have had “no intentional or significant alternations of the initial study” (pp. 325–326). Marsden et al. (2018) further noted that few replication studies have been carried out or published in the field of SLA, and one possible reason might relate to the poor transparency of the initial studies, such as in the measurement of the participants’ proficiency. Therefore, for the nonreplication studies that dominate the SLA field, whether the L2 proficiency measures used in novel studies should be consistent with those used in previous studies does not appear to be a significant issue, and a uniform L2 proficiency test in a particular language does not seem necessary because the operationalized definition of L2 proficiency is widely accepted by researchers and journals.

In recent years, researchers have realized the need for and called for more studies on L2 proficiency tests for research purposes. Reporting and controlling for measures of L2 proficiency for research purposes is argued to be important for at least three reasons (Marsden et al., 2018; Norris & Ortega, 2012; Tremblay, 2011a). First, it could provide evidence for categorizing participants into different groups. Second, it could help readers determine the extent to which the findings may be generalized to other populations or samples. Finally, it could facilitate comparisons between studies and thus contribute to the accumulation of knowledge.

Researchers have surveyed different L2 proficiency measures used in published articles in different languages, such as English, French, and Chinese (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). The main findings are as follows.

First, studies have found that a variety of methods have been administered to measure L2 proficiency across studies published in different languages. Thomas (1994, 2006), Hulstijn (2012), and Tremblay (2011a) investigated publications in several journals, mostly in English and some in French, and found that the L2 proficiency measures utilized by researchers included institutional status, standardized tests, impressionistic judgment, in-house assessment instruments, translation tests, language test certificates, and others. Tremblay (2011a) further categorized these measures into independent tests (such as standardized tests, cloze tests, and oral tests) and dependent tests (such as previous language test scores, the length of residence in an L2-speaking environment, and self-assessment). An investigation of publications in the field of Chinese as a second language (CSL) reported similar findings (H. Zhang, 2018). The L2 proficiency measures employed by CSL researchers included standardized tests such as the standardized HSK, reading comprehension tests such as cloze tests, and Chinese character recognition tests.

The second finding is that researchers have tended to favor fast and convenient estimates of L2 proficiency, such as classroom level, years of instruction in the target language, program enrollment, and institutional status. Thomas (1994) and Thomas (2006) found that 40.1% and 33.2%, respectively, of studies used institutional status as an L2 proficiency measure. Tremblay (2011a) reported that 38.2% of English and French SLA studies measured L2 proficiency based on the classroom level or years of instruction. The percentage of CSL studies that used these two methods was as high as 52.38% (H. Zhang, 2018). Although methods such as years of instruction or classroom level are commonly used, it remains unclear whether they are as powerful as other measures, such as standardized tests.

Third, standardized language tests have not been widely used to estimate L2 proficiency in SLA studies. The investigations by Thomas (1994), Thomas, (2006), Tremblay (2011a), and H. Zhang (2018) demonstrated that only 22.3%, 23.2%, 14.59%, and 6.35% of studies, respectively, administered a standardized language test to measure L2 proficiency. The main reasons could be the following: (a) standardized tests are difficult for researchers to access and (b) standardized tests usually consume a great deal of time and effort (Tremblay, 2011a). Although standardized tests may be better able to reveal the actual state of learners’ proficiency, it is still unknown whether this method is more robust than other methods in terms of contributing to research findings concerning the effect of L2 proficiency on the target variable.

The above investigations show a lack of uniformity in L2 proficiency assessment methods. The possible consequences of using different L2 proficiency measures have been noted by some researchers, and the main consequence lies in the weakening of the generalizability of research findings (LeClercq et al., 2014; Tremblay, 2011a). However, the influence of different L2 proficiency measures on research results on the effect of L2 proficiency on the target variable has been explored less often in empirical studies. Currently, there may be only one available study, that by H. Zhang (2018), that has examined this issue; the study used data on phonological awareness (PA) and phonetic radical awareness (PRA) from 40 adult English CSL learners who studied in the United Kingdom. Five measures of L2 Chinese proficiency were used: years of instruction in Chinese, a simplified HSK test (including listening and reading comprehension), a listening comprehension test, a reading comprehension test, and a Chinese character recognition test. The participants were divided into preintermediate and intermediate groups according to the operationalized criteria based on each measure. When years of instruction in Chinese was used as the criterion, participants with 1 year of instruction and those with 2 years of instruction were assigned to the preintermediate and intermediate groups, respectively. When the HSK test, listening comprehension test, or reading comprehension test was used as the criterion, participants who scored below and above the midpoint of the maximum score on each test were assigned to the preintermediate and intermediate groups, respectively. The participants who scored below and above .33 on the character reading test were assigned to the preintermediate and intermediate groups, respectively, because the number of characters from beginner, intermediate, and advanced levels on the test was balanced.

The results of Zhang’s study indicated that using the HSK as an L2 proficiency measure led to significant differences in PA and PRA between the preintermediate and intermediate groups and that the effect sizes were large (PA, Cohen’s d = .93; PRA, Cohen’s d = .93). However, when years of instruction or a reading comprehension test were used as the L2 proficiency measure, the two groups differed only in PRA, and the effect size was medium (years of instruction, Cohen’s d = .68; reading comprehension, Cohen’s d = .89). In addition, the two groups did not differ significantly in either PA or PRA, and the effect size was small if a listening comprehension test (PA, Cohen’s d = .37; PRA, Cohen’s d = .39) or a character recognition test (PA, Cohen’s d = .47; PRA, Cohen’s d = .50) was used to assess L2 proficiency. The overall results demonstrated that different L2 proficiency measures yielded different results and that using a comprehensive test might lead to results with a large effect size.

This issue, however, still needs to be further investigated. H. Zhang (2018) explored this topic using only an independent-samples t test among a small group of 40 CSL participants and did not control for variables related to the participants’ background, such as age and gender. More importantly, Zhang did not conduct item analysis of the different L2 proficiency measures; thus, the item difficulty and item discrimination indexes were unknown, even though these two index values are critical for evaluating test quality (Aiken, 2003; Tremblay, 2011a). These issues will be addressed in the current study.

The Current Study

The current study extended the study by H. Zhang (2018) to explore the same research question by comparing four measures of L2 Chinese proficiency: years of instruction in Chinese, a simplified HSK, a reading comprehension test, and a character recognition test. It is impossible to explore this issue in relation to all aspects of SLA, and PA was used as an example for the following reasons.

First, PA is a type of metalinguistic awareness and refers to the ability to understand and manipulate the phonological structure of a language. Chinese PA includes awareness of syllables, onset, rime, and tone (Li et al., 2002). PA closely correlates with language proficiency and can indicate L2 ability to some extent (H. Zhang, 2018; Ziegler & Goswami, 2005). Chinese PA is not included in Chinese language proficiency tests, so it is easy to separate participants’ PA performance from their language proficiency. Topics such as grammatical structures or vocabulary are not suitable for the present study because they could overlap with some Chinese proficiency measures in terms of the content examined. Moreover, the number of studies on phonology in the area of SLA is much smaller than the number of studies on grammar and vocabulary.¹ Thus, researching PA could enrich previous studies on phonology and contribute to the development of the area of SLA.

Second, PA is a fundamental and essential skill for the acquisition of different linguistic components (such as vocabulary) and the development of various linguistic skills (such as listening, reading, and spelling) across different languages for L1 (Bradley & Bryant, 1983; Goswami & Bryant, 2016; Hulme & Snowling, 2013; Song et al., 2015; Swanson et al., 2003; Ziegler & Goswami, 2005) and L2 (Keung & Ho, 2009; McBride & Kail, 2002; Uchikoshi & Marinova-Todd, 2012 ; Yeung & Chan, 2013; H. Zhang & Roberts, 2019). The significant predictive power of PA for Chinese reading lies in the fact that phonological skills are universally required to read in any language because correct mapping between phonology and orthography is essential for the successful decoding of printed words (Perfetti et al., 1992). Therefore, the development of PA is one of the core and fundamental issues for language research.

Third, adult CSL learners’ PA could arguably develop well within a relatively short period because of its limited repertoire of phonemes and tones, as well as adult learners’ mature cognitive abilities. Chinese has 21 onsets, 39 rimes, and four tones. Studies have found that CSL learners can achieve above-chance PA performance within several weeks at the initial stage of Chinese learning (J. Zhang & Wu, 2007) and ceiling performance within 1 or 2 years (H. Zhang & Roberts, 2019). In addition, compared with the hundreds or thousands of items in the pool of grammatical structures or vocabulary, the limited repertoire of Chinese phonetics makes it practical to examine the influence of L2 proficiency on the whole range of the development path of Chinese PA instead of a portion of grammatical structures or vocabulary items.

Finally, this topic has been explored by H. Zhang (2018), who drew on the data of CSL learners’ PA. Thus, to make the results of the current study comparable with the findings reported in Zhang’s study, the current study also depended on Chinese PA to answer the research question.

The current study aimed to explore how different L2 proficiency measures influenced the research results on the effect of CSL proficiency on the target variable, Chinese PA. Specifically, the current study included two questions.

Research Question 1 (RQ1): Does the effect of L2 proficiency on PA between two groups of CSL learners with different proficiencies vary across different measures, that is, years of instruction in Chinese, a simplified HSK, a reading comprehension test, and a character recognition test?

Hypothesis 1: The effect of L2 proficiency on Chinese PA between two groups of CSL learners with different proficiencies will be more significant when L2 proficiency is measured via a simplified HSK test than when it is measured via other tests.

Research Question 2 (RQ2): Does the predictive power of L2 proficiency for Chinese PA vary across different measures, that is, years of instruction in Chinese, a simplified HSK, a reading comprehension test, and a character recognition test?

Hypothesis 2: The predictive power of L2 proficiency for Chinese PA will be stronger when CSL proficiency is measured via a simplified HSK test than when it is measured via other tests.

Method

Participants

The participants were 40 English-speaking and 44 Arabic-speaking CSL learners who studied Chinese as a primary subject at universities in the United Kingdom or Egypt. The English group included 20 second-year and 20 third-year learners, and the Arabic group included 23 second-year and 21 third-year learners. The data were collected at the beginning of the autumn term; thus, the second-year and third-year groups learned Chinese for approximately 1 and 2 years, respectively.

Measures

Simplified HSK

The HSK is a standardized L2 Chinese proficiency examination recognized worldwide. It includes six levels, with Level 1 being the beginner level and Level 6 being the advanced level. The HSK is integrative and includes subtests for listening, reading, speaking, and writing. The HSK is rarely used to measure L2 Chinese proficiency in research because the typical duration of the test is approximately 2 hours (H. Zhang, 2018). Drawing on Tremblay’s (2011a) recommendation, we used a simplified HSK that included listening and reading comprehension to make it relatively comprehensive and practical for administration. According to our teaching experience and the instructors’ reports, the participants’ L2 Chinese proficiency ranged from preintermediate to intermediate. Thus, the test was composed of eight questions from the preintermediate level and eight questions from the intermediate level, including four listening and four reading comprehension questions for each level. All the questions were multiple choice, and the participants were required to choose one correct answer from the four alternative items. A score of 1 was assigned to correct answers, and a score of 0 was assigned to unanswered item and incorrect answers, for a maximum score of 16. The test was administered in a paper-and-pencil form.

Reading comprehension test

Cloze tests have been widely used to examine reading comprehension skills and to explore L2 proficiency (Brown, 2002; Tremblay, 2011), yet they have not been used on the HSK test. In addition, cloze tests are not widely used in CSL research, probably due to the lack of a standardized cloze test (H. Zhang, 2018). The reading comprehension test in this study included four questions from the preintermediate level and four questions from the intermediate level. The materials were the same as those in the reading section of the simplified HSK test. The maximum score was 8. The test was administered the in paper-and-pencil form.

Chinese character recognition test

Chinese characters constitute the dominant writing system in Chinese. The number of Chinese characters that a CSL learner can recognize can be considered the basis for his or her reading ability. Chinese character recognition tests have been used as valid placement measures in classroom instruction (Wu et al., 2017) and CSL research (Gao, 2017). Such tests take far less time than other L2 proficiency tests, typically taking no more than 10 minutes. However, a standardized Chinese character recognition test has not been developed. In the present study, the Chinese character recognition test included 72 characters, with 36 characters each from the beginner and intermediate levels of The Graded Chinese Syllables, Characters and Words for the Application of Teaching Chinese to the Speakers of Other Languages (Guojia yuwei, 2010). The characters at each level did not differ significantly in terms of frequency or stroke numbers. The characters were listed on one piece of A4 paper from high to low frequency. The participants were required to read the characters aloud, and the test was stopped if they made five consecutive errors. The accuracy rate was calculated by dividing the number of correct answers by 72.

Chinese PA test

The odd-man-out test is a common task to examine PA in English and Chinese. However, because no standardized odd-man-out test is available for a Chinese PA test, an odd-man-out task was developed for the present study that included eight questions each for syllable awareness, onset awareness, rhyme awareness, and tone awareness. For each question, the participants heard three items and were then required to identify the odd item. For instance, among “dàgē, xiàngpí, gēmí,” the odd item was “xiàngpí” because the other two words had the same syllable “gē.” The test was administered in the paper-and-pencil form, and the accuracy rate was calculated by dividing the number of correct answers by 32.

Procedure

All participants completed the tasks individually in a quiet place. The tasks were administered in the following order: PA test (approximately 10 minutes), Chinese character recognition test (approximately 5 minutes), HSK test (including the reading comprehension test, approximately 15 minutes), and background questionnaire (approximately 5 minutes). Most participants completed these tasks within 35 minutes and were paid a small amount of money after completing the tasks.

Data Analysis

First, item analysis was carried out to examine the item difficulty and item discrimination indexes of the HSK, reading comprehension test, and Chinese character recognition test. Second, the preintermediate and intermediate groups differed in L2 proficiency and were not balanced in other aspects such as age and gender, so an independent-samples t test was used to explore how the preintermediate and intermediate groups differed in their PA performance based on the grouping criteria of using different measures of L2 Chinese proficiency. Third, to explore the power of different L2 proficiency measures in predicting PA when participants’ background variables, such as age and gender, were controlled, standard multiple regression analysis was used, in which each of the proficiency measures was entered into the model individually with the background variables, and then, all the proficiency measures were entered simultaneously with the background variables.

Results

According to DeVellis (1991), a test with a Cronbach’s alpha coefficient between .70 and .80 is acceptable, and that with a coefficient between .80 and .90 is very good. The Cronbach’s alpha reliability values of the HSK test (.80), reading comprehension test (.70), Chinese character recognition test (.96), and Chinese PA test (.72) were all excellent, indicating that these measures were reliable.

The participants’ performance for the measured variables is summarized in Table 1. The Arabic and English groups did not differ significantly for most of the measured variables, including the HSK, reading comprehension test, and character recognition test (Table 2). These results indicate that the two CSL groups were matched in L2 Chinese proficiency; thus, pooling the two groups for analysis was reasonable. In terms of PA performance, the English group outperformed the Arabic group (Table 2), yet each group’s performance was significantly above the chance level: Arabic, t(43) = 14.47, p < .0001; English, t(39) = 29.25, p < .0001.

Table 1.

Participants’ Background Information and Performance for the Measured Variables.

Group	Male	Female	Age	Years of instruction	HSK	Reading comprehension	Character recognition	Phonological awareness
Arabic	4	40	19.59(.79)	1.48(.51)	8.82(3.74)	5.34(1.94)	.21(.12)	.74(.11)
English	17	23	20.55(1.32)	1.50(.51)	9.98(3.43)	5.35(1.85)	.27(.16)	.87(.08)
Total	21	63	20.05(1.17)	1.49(.50)	9.37(3.62)	5.35(1.89)	.23(.14)	.80(.12)

Note. The numbers in parentheses are the standard deviations. HSK = Hanyu Shuiping Kaoshi.

Table 2.

Summary of the t-Test Results of the Differences in the Measured Variables Between the Arabic and English Groups.

Measure	t test	Cohen’s d	Hedge’s g
HSK	t(82) = −1.47, p = .14	.32	.32
Reading comprehension test	t(82) = −.02, p = .98	.005	.005
Character recognition test	t(82) = −1.72, p = .09	.37	.37
Phonological awareness	t(82) = −6.18, p < .001	1.35	1.34

Note. The numbers in parentheses are the degrees of freedom. HSK = Hanyu Shuiping Kaoshi.

Item Analysis

The results of the item analysis are summarized in Table 3. The mean item difficulty for HSK was close to .50, and its item discrimination index was excellent. The item difficulty for the reading comprehension test was acceptable, and the item discrimination index was good. The Chinese character recognition test had a good item discrimination index but had a relatively low value for item difficulty.

Table 3.

Summary of the Results of the Item Analysis.

	Item difficulty				Item discrimination
Test	M	SD	Minimum	Maximum	M	SD	Minimum	Maximum
HSK	.59	.20	.23	.92	.40	.11	.19	.62
Reading	.67	.19	.42	.92	.35	.07	.24	.45
Character recognition	.37	.30	0	.99	.49	.19	0	.78

Note. A higher item difficulty index indicates that the item was easier for the participants. All results were computed using the sjPlot package in the R environment. HSK = Hanyu Shuiping Kaoshi.

RQ1: Does the effect of L2 proficiency on PA between two groups of CSL learners with different proficiencies vary across different L2 proficiency measures?

The participants were sorted into preintermediate and intermediate groups based on the criteria of the four different measures (Table 4), and then, independent-samples t tests were carried out accordingly (Table 5). The criterion for classifying the preintermediate and intermediate levels was the midpoint of the maximum score on each test because the materials from the preintermediate and intermediate levels were balanced. This operationalized approach might appear somewhat arbitrary, yet currently, there is no absolute consensus about the definition of the L2 Chinese proficiency level. The results of the four t tests differed in effect size (Cohen, 1988). The HSK yielded the largest effect size, followed by the reading comprehension test and years of instruction; the Chinese character recognition had the smallest effect size.

Table 4.

Grouping Information Based on Each of the Four Measures.

Years of instruction	Group	Criterion	N	Age	Years of instruction	PA
	Preintermediate	1 year	43	19.63 (1.29)	1 year	.77 (.13)
	Intermediate	2 years	41	20.49 (.84)	2 years	.83 (.09)
HSK	Group	Criteria	N	Age	HSK score	PA
	Preintermediate	<8	27	19.70 (.99)	5 (1.75)	.75 (.10)
	Intermediate	≥8	57	20.21 (1.22)	11.44 (2.11)	.82 (.11)
Reading comprehension	Group	Criteria	N	Age	Reading comprehension	PA
	Preintermediate	<4	16	19.88 (1.02)	2.38 (.80)	.75 (.10)
	Intermediate	≥4	68	20.09 (1.21)	6.4 (1.29)	.81 (.12)
Character recognition	Group	Criteria	N	Age	Character recognition	PA
	Preintermediate	<.50	60	20.02 (1.30)	.27 (.14)	.79 (.12)
	Intermediate	≥0.50	24	20.13 (.80)	.62 (.10)	.83 (.09)

Note. The numbers in parentheses are the standard deviations. PA = phonological awareness; HSK = Hanyu Shuiping Kaoshi.

Table 5.

Summary of the t-Test Results for PA According to Different L2 Proficiency Measures.

Measure	t test	Cohen’s d	Hedge’s g
Years of instruction	t(76.47) = −2.07, p = .04	.45	.44
HSK	t(82) = −3.04, p = .003	.71	.70
Reading comprehension	t(82) = −1.78, p = .08	.49	.49
Character recognition	t(82) = −1.65, p = .10	.40	.40

Note. The numbers in parentheses are the degrees of freedom. The preintermediate and intermediate groups differed significantly in the variances when the participants were grouped according to years of instruction. HSK = Hanyu Shuiping Kaoshi.

RQ2: Does the predictive power of L2 proficiency in Chinese PA vary across different L2 proficiency measures?

The raw data collected using the four measures of L2 Chinese proficiency were used in the multiple regression analysis to explore how the four measures influenced the power of CSL proficiency in predicting Chinese PA.

First, each proficiency measure was modeled in isolation with the background variables. Most previous CSL studies have employed only one measure to indicate L2 Chinese proficiency. Therefore, the approach employed in this study is akin to comparing the results of four independent studies that explored the same issue. Because of debates on methods of outlier detection (Hodge & Austin, 2004), we carried out two series of regression analyses—one in which the outliers were kept (Table 6) and another in which the outliers were removed (Table 7). The regression diagnosis plots for the four models in Table 6 are presented in the Appendix (Figures 1-4). The results of the normality, heteroscedasticity, and multicollinearity tests are also presented in the tables.

Table 6.

Summary of the Results of the Multiple Regression Analysis (Each L2 Proficiency Measure + Background Variables, Outliers Kept).

No.	Predictor	VIF	Skewness/Kurtosis	Heteroscedasticity	F	p	R ²	Adjusted R²	b	SE	t	p	β	η^2>	ω²
1	Model	1.17	10.15**	8.23**	4.27	.01*	.14	.11
	Gender	1.12							.08	.03	2.68	.01*	.29	.08	.07
	Age	1.24							.003	.01	.31	.76	.04	.001	0
	Years of instruction	1.17							.03	.03	1.31	.19	.15	.02	.01
2	Model	1.11	5.79	2.77	7.41	.0002***	.22	.19
	Gender	1.17							.06	.03	2.16	.03*	.23	.05	.04
	Age	1.11							.01	.01	0.72	.47	.07	.01	0
	HSK score	1.07							.01	.003	3.17	.002**	.32	.11	.10
3	Model	1.09	5.00	4.26*	5.52	.002**	.17	.14
	Gender	1.13							.07	.03	2.53	.01*	.27	.07	.06
	Age	1.10							.01	.01	.78	.44	.08	.007	0
	Reading comprehension	1.03							.01	.01	2.24	.03*	.23	.06	.05
4	Model	1.09	10.92**	6.97**	5.38	.002**	.17	.14
	Gender	1.14							.07	.03	2.39	.02*	.26	.07	.05
	Age	1.11							.01	.01	0.62	.54	.07	.005	0
	Character recognition	1.03							.18	.08	2.16	.03*	.23	.06	.04

Note. VIF = variance inflation factor; SE = standard error; HSK = Hanyu Shuiping Kaoshi.

p < .05. **p < .01. ***p < .001.

Table 7.

Summary of the Results of the Multiple Regression Analysis (Each L2 Proficiency Measure + Background Variables, Outliers Removed).

No.	Predictor	VIF	Skewness/Kurtosis	Heteroscedasticity	F	p	R ²	Adjusted R²	b	SE	t	p	β	η²	ω²
1	Model	1.17	9.54**	7.50**	5.11	.003**	.166	.134
	Gender	1.12							.08	.03	2.91	.005**	.02	.10	.09
	Age	1.24							.002	.01	.16	.87	.32	0	0
	Years of instruction	1.17							.04	.03	1.65	.10	.19	.03	.02
2	Model	1.13	9.96**	2.79	7.97	.0001***	.237	.207
	Gender	1.18							.05	.03	1.86	.07	.20	.04	.03
	Age	1.13							.01	.01	1.49	.14	.16	.03	.02
	HSK score	1.07							.01	.003	3.16	.002**	.32	.11	.10
3	Model	1.08	3.72	4.35*	5.54	.002**	.178	.146
	Gender	1.13							.07	.03	2.58	.01*	.28	.08	.07
	Age	1.11							.008	.01	.75	.46	.08	.007	0
	Reading comprehension	1.02							.01	.007	2.25	.03*	.24	.06	.05
4	Model	1.09	11.17**	7.07**	3.31	.02*	.114	.080
	Gender	1.13							.07	.03	2.45	.02*	.28	.07	.06
	Age	1.12							.01	.01	.88	.38	.10	.01	0
	Character recognition	1.02							.04	.06	.60	.55	.06	.005	0

Note. VIF = variance inflation factor; SE = standard error; HSK = Hanyu Shuiping Kaoshi.

p < .05. **p < .01. ***p < .001.

Each of the four models significantly explained a certain percentage of the variance in PA, yet the effect sizes differed (Cohen, 1988). The model including the HSK explained the highest percentage of the variance in PA, followed by the model with reading comprehension and that with character recognition; the model with years of instruction accounted for the lowest percentage. Similarly, the four measures differed in their predictive power for PA. Independent of the outliers, the effect sizes for the HSK, reading comprehension, and years of instruction were large, medium, and small, respectively. The effect size for character recognition was medium when the outliers were kept and small when the outliers were removed.

Second, we further constructed a model with the four L2 proficiency measures and background variables. The correlation matrix between the measured variables is displayed in Table 8. Two multiple regression analyses were carried out with the outliers kept and removed (Table 9). The regression diagnosis plots for the model with outliers kept in Table 9 are presented in the Appendix (Figure 5). In both cases, the HSK was the only significant predictor of PA, with a medium effect size, whereas none of the other three measures significantly predicted PA, with very small effect sizes.

Table 8.

Correlation Matrix of the Measured Variables.

Measured variables	1	2	3	4	5
1. Years of instruction in Chinese	1.00
2. HSK	.49*	1.00
3. Reading comprehension	.44*	.87*	1.00
4. Character recognition	.68*	.63*	.54*	1.00
5. Phonological awareness	.22*	.39*	.27*	.25*	1.00

Note. HSK = Hanyu Shuiping Kaoshi.

p < .05.

Table 9.

Summary of the Results of the Multiple Regression Analysis (Four L2 Chinese Proficiency Measures + Background Variables).

Predictor	VIF	Skewness/Kurtosis	Heteroscedasticity	F	p	R ²	Adjusted R²	b	SE	t	p	β	η²	ω²
Outliers kept
Model	2.41	5.79	3.02	3.76	.003**	.227	.167
Age	1.29							.008	.01	.69	.49	.08	.006	0
Gender	1.20							.06	.03	1.96	.054	.22	.05	.04
Years of instruction	1.53							−.003	.03	−.11	.92	−.02	0	0
HSK score	4.87							.02	.007	2.21	.03*	.49	.06	.05
Reading comprehension	4.45							−.01	.01	−.86	.39	−.18	.01	0
Character recognition	1.10							.01	.06	.24	.81	.02	.001	0
Outliers removed
Model	2.33	5.65	3.05	3.62	.003**	.227	.164
Age	1.29							.006	.01	.55	.59	.06	.004	0
Gender	1.21							.06	.03	1.9	.06	.21	.05	.03
Years of instruction	1.50							−.001	.03	−.05	.96	−.01	0	0
HSK score	4.65							.02	.01	2.20	.03*	.49	.06	.05
Reading comprehension	4.23							−.01	.01	−.85	.40	−.18	.01	0
Character recognition	1.10							.02	.06	.27	.79	.03	.001	0

Note. VIF = variance inflation factor; SE = standard error; HSK = Hanyu Shuiping Kaoshi.

p<.05. **p<.01.

Discussion

The present study aimed to explore the influence of different L2 proficiency measures on research results on the effect of L2 proficiency on the target variable. We studied this issue by drawing on PA data collected from Arabic and English CSL learners and using four different measures of L2 Chinese proficiency. Consistent with our hypothesis, the overall results for RQ1 and RQ2 show that different L2 proficiency assessment methods yielded different results and that the comprehensive HSK test produced the most substantial effect, followed by reading comprehension and then Chinese character recognition, with years of instruction having the weakest effect. These results suggest that research findings do differ depending on the L2 proficiency assessment methods used by researchers and that the use of a comprehensive L2 proficiency measure could yield significant results and larger effect sizes than the use of a single aspect of L2 proficiency. These results are consistent with the findings reported by H. Zhang (2018).

The next question to answer is why these different L2 proficiency measures generated different research findings. These findings could be explained from two perspectives.

From the perspective of language testing, the first reason may relate to the comprehensiveness of the four different measures. The HSK test is the most comprehensive, followed by the reading comprehension test, the character recognition test, and years of instruction, which roughly indicates the length of CSL learning. This explanation is consistent with the suggestion that a more comprehensive test might be better to detect the differences in language proficiency among L2 learners at different levels and might make within-group L2 proficiency more homogeneous (Tremblay, 2011a). This explanation is also consistent with the recommended use of an integrative test instead of a discrete-point test to assess language proficiency (Hidri, 2018).

The second reason might relate to the psychometric indexes of the measures. Because years of instruction is not a psychometric measure, item analysis could not be conducted for this measure; thus, the discussion here focuses on the other three measures. Table 3 reveals that the simplified HSK demonstrated the best overall quality for item difficulty and item discrimination, followed by reading comprehension and character recognition. It is still unclear whether the consistency between the results of the item analyses of the three measures and the final results was coincidental or causal, yet this finding cannot be ignored; more studies are needed to explore this issue. Our results further confirm the importance of carrying out item analysis of proficiency measures for research purposes (Tremblay, 2011a).

From the perspective of SLA, our results could be explained by the input-intake hypothesis. It is generally believed that input refers to physical stimuli, such as acoustic–phonetic events and graphic objects, that learners encounter in meaningful contexts and that intake is the mental representation or comprehended version of the physical stimuli (S. Carroll, 2001; Chaudron, 1985; Gass, 1997; Krashen, 1981, 1982). First, the degree to which the nature of the input of the materials of the L2 Chinese proficiency measures is consistent with that of the PA test differs. The materials of the HSK test include both listening and reading input, yet the reading comprehension test and the Chinese character recognition test include only visual input, while years of instruction in Chinese cannot directly indicate anything about the nature and amount of input. However, successful performance on the PA task mainly depends on participants’ intake from auditory input. Although the orthographic representations of the acoustic–phonetic stimuli might be unconsciously activated (Oakhill & Kyle, 2000), in the case of the Chinese PA test, the activation of Pinyin (the Latin alphabet used to transcribe Chinese characters) might be stronger than that of Chinese characters. This is because the phonological structure could be manifested in Pinyin. For instance, the Pinyin form of the Chinese character 三 (three) is sān/san/, which represents its onset (/s/), rime (/an/), and tone (the diacritic above a). Thus, the PA task is consistent only with the listening section of the HSK test in terms of input. The HSK test may be better able to determine participants’ overall L2 proficiency and, in particular, listening skills, which can also be seen in Table 8, which shows the most robust correlation coefficient between the HSK and PA. However, we should be very cautious about the generalizability of our findings to other topics that might be based on different inputs, such as orthographic awareness.

More importantly, another reason could relate to the role of the input in L2 acquisition. Listening skill is the most essential and primary impetus for the initiation of first- and second-language learning and is the critical medium to maintain communication and the development of language learning (Bozorgian, 2012; Feyten, 1991; Oxford, 1993). The percentage of listening of the total time devoted to communicating is approximately 50% and might reach 90% in high school and college environments (Gilbert, 2005), and listening is the primary channel for accumulating knowledge and interpreting experience (Linebarger, 2001). In the case of L2 learning, according to Krashen (1981), input skills such as listening and reading play essential and primary roles in the development of language proficiency, and output skills such as speaking and writing emerge on the basis of a sufficient amount of comprehensible language input. In particular, listening is the most fundamental of the four components of language skills and is the basis for the development of language proficiency; the strong relationship between listening abilities and overall language proficiency has been reported in studies by Feyten (1991) in learners of French and Spanish as well as by Bozorgian (2012) in Iranian learners of English. These findings suggest that language proficiency tests should not neglect listening skills. In fact, it has been found that the structure of advanced CSL learners’ language abilities could be best modeled as listening and reading + speaking + writing (X. Yuan & Wang, 2012), which further provides empirical evidence of the validity of focusing on both aural and visual input in the simplified HSK test and for its potent effect on exploring L2 Chinese proficiency compared with other measures.

Drawing on the overall findings, we are not able to conclude which measure could best reveal CSL proficiency. This is because only four different measures of L2 proficiency were administered in the present study, and three measures focus only on input comprehension skills, indicating that none of them is an optimum test for CSL proficiency. The question of the best measure to assess CSL proficiency relates to theoretical accounts of language proficiency/skills. The components of L2 proficiency are very complex and have been discussed by numerous researchers (Bachman, 1990; J. B. Carroll, 1961; Davies, 2014; Hulstijn, 2012; Lado, 1961; Oller, 1979; Spolsky, 1977). Thus, discussing the best measure to represent CSL proficiency is beyond the aim of the present study.

Based on the results, we can tentatively recommend how each test could be used as an L2 proficiency assessment method in SLA research, particularly in studies involving CSL learners. A comprehensive test is recommended when global L2 proficiency must be measured for research purposes. In this study, listening and reading comprehension were included in the HSK, similar to Tremblay’s (2011a) recommendation to use a cloze test alongside an aural task to measure global L2 proficiency, which can enable researchers to investigate global L2 proficiency better than the use of one measure alone because there is a consensus that language proficiency is multidimensional. Administering an integrative test is not practical because of constraints such as time limits. However, the current study indicates that a multimode integrative test might be more potent than a single-mode test in evaluating L2 proficiency; hence, a comprehensive test, even a simplified version, is recommended as an essential assessment tool for global L2 proficiency for research purposes.

Despite the importance and power of standardized tests for SLA research, they are not commonly used to estimate L2 proficiency for various reasons (Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). In this case, a reading comprehension test could be an alternative. Reading comprehension tests have been widely administered to measure L2 proficiency, and the significant relationship between the two variables has been well documented (Heilenman, 1983; Tremblay, 2011a). The results of the present research provide further support for the reliability of reading comprehension tests as a method of assessing L2 Chinese proficiency (B. Yuan, 1995). For instance, B. Yuan (1995) categorized participants into five groups based on their scores on two cloze tests. The differences in acceptability judgment test scores in three conditions between elementary and advanced groups were significant, with effect sizes (Cohen’s d) of .58, 1.05, and .67. The results indicated that reading comprehension tests, such as cloze tests, could be used to discriminate among L2 learners at different proficiency levels. Nevertheless, researchers should be cautious about the results generated by using reading comprehension tests because this method does not measure other L2 skills, such as listening.

In the case of CSL research, Chinese character recognition might be used as an L2 Chinese proficiency measure. Character recognition tests have been administered as reliable measures to evaluate CSL learners’ proficiency in placement tests (Wu et al., 2017) and eye-tracking research (Gao, 2017). The present study found that the results of t tests based on character recognition did not achieve statistical significance and that character recognition could be a significant predictor of PA. However, the effect size of Chinese character recognition for PA was not stable and was smaller than those for the reading comprehension test and the HSK. These results indicate that character recognition could be used to measure L2 Chinese proficiency to some extent, yet its limitations cannot be ignored. The Chinese character recognition test concentrates only on characters, and it might be suitable for research involving Chinese character cognition, acquisition, and reading ability. For instance, Gao (2017) categorized CSL learners into intermediate and advanced levels according to their performance in recognizing characters, and the effect sizes of L2 proficiency levels on eye-tracking performance in Chinese reading were larger than 4. These results indicate that character recognition could be used to inform L2 Chinese proficiency up to a point.

The present study found that years of instruction was not a reliable L2 proficiency measure, even though it is commonly used as an indicator of L2 proficiency (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). First, although the results of t tests based on years of instruction in Chinese were significant, the effect size was small. Second, the regression model including years of instruction explained a significant amount of the variance in PA, yet years of instruction was not a significant predictor of PA. As B. Yuan (1995) and Tremblay (2011a) suggested, years of instruction as an L2 measure indicates the amount of L2 input, which varies widely in terms of both quality and quantity among participants. In addition, the extent to which years of instruction could predict L2 proficiency depends on the context in which the L2 is learned (Tremblay, 2011a). More importantly, years of instruction is not a psycholinguistic measure with reliability or validity. Therefore, years of instruction and classroom level, which are individual background variables, may not be reliable or valid for measuring L2 proficiency.

Implications

Regarding the use of L2 proficiency measures for research, we offer three suggestions.

First, details regarding L2 proficiency measures should be reported. Our study found that research findings can differ depending on the L2 proficiency measures used, and this finding confirms the importance of reporting the details of the L2 proficiency measures employed (Hulstijn, 2012; Marsden et al., 2018; Norris & Ortega, 2012). Not reporting how participants’ L2 proficiency was measured or how the researcher justified the use of a specific test might jeopardize the generalizability of the research findings (Thomas, 1994, 2006; H. Zhang, 2018). Therefore, we confirm Hulstijn’s (2012) recommendations to “describe the test’s target group (age, literacy and educational level), skills measured, task(s), and materials in sufficient detail, and report its validity (if known) and its psychometric characteristics (e.g., internal consistency) for the target group” (p. 430). Including a detailed description of L2 proficiency measures in future studies could support the generalizability of the findings and provide more information for meta-analysis research and replication studies (Marsden et al., 2018).

Second, a comprehensive test of L2 proficiency is recommended for research purposes to yield more desirable outcomes. Considering that publication bias favors research with significant results (Norris & Ortega, 2000; Rothstein et al., 2006; Sutton, 2009), it is understandable that researchers seek to obtain findings with statistical significance and large effect sizes. Because L2 proficiency is commonly measured or manipulated in SLA research, comprehensive measures of L2 proficiency, which could yield more significant findings, are recommended for researchers. However, one comprehensive test might not be powerful enough to assess an individual’s L2 proficiency; therefore, if time and available resources allow, measures of L2 proficiency from different perspectives, such as standardized tests, language learning history, and self-evaluation, could be utilized (Sundara et al., 2006; Tremblay, 2011b).

Third, item analysis of L2 proficiency measures is recommended for future research. The quality of L2 proficiency measures is crucial to successfully assess participants’ abilities in the target language; such research involves examining item difficulty and item discrimination. However, this practice has rarely been observed in SLA studies (Tremblay, 2011a). Notably, it is impossible for each SLA researcher to carry out detailed item analysis and to design comprehensive tests because of the significant differences in standardized language tests and L2 proficiency tests for research purposes. However, we believe that a brief item analysis could benefit the L2 proficiency measures administered in SLA research in terms of meeting both testing and research standards. Such analysis would, in turn, further contribute to the interpretation and generalizability of the results.

Attempts should also be made to develop a standard L2 proficiency test for research in specific languages. Although the application of meta-analysis could facilitate the synthesis of research findings (Glass et al., 1981; Schmidt, 1992; Thompson & Higgins, 2002), the use of different measures of L2 proficiency might still be a problem for researchers trying to generalize and accumulate evidence. Thus, efforts should be made to develop standard language tests for research, which would strengthen the generalizability of research findings and support the accumulation of knowledge. We propose several recommendations below.

First, the design of a standard L2 proficiency test for research purposes must be valid, reliable, and practical (Thomas, 1994). Practicality is more critical for L2 proficiency measures for research than it is for standardized tests and placement tests (Tremblay, 2011a). Practicality refers to the ratio between available resources and required resources, such as human resources, material resources, time, and related costs (Bachman & Palmer, 1996). Given the limited time and funding for most studies, it is optimal for researchers to evaluate participants’ L2 proficiency within a short time, generally less than half an hour (Tremblay, 2011a). Therefore, controlling for reliability and validity, researchers designing L2 proficiency measures for studies should pay more attention to practicality, aiming to achieve the goal of evaluating L2 proficiency in less time and with less cost yet with high efficiency.

Second, more studies are needed to explore the interface between language tests and language acquisition (Bachman, 1988; Bachman & Cohen, 2002; Gu, 2014; Shohamy, 2000). At the level of practice, standardized tests are not commonly used to estimate L2 proficiency; thus, it is crucial to investigate how other L2 proficiency measures map onto standardized tests. For instance, if the correspondence between participants’ performance on a cloze test and their scores or levels on a standardized test is known, evaluating L2 participants’ proficiency based on a cloze test could be more scientific, and the research findings could be more powerful.

Third, an assessment use argument (AUA) or operational models should be developed for L2 proficiency measures for research. An AUA is “essentially a local or operational theory that guides both the design and development of a specific language assessment and the collection of evidence in support of a specific intended use” (Bachman, 2007, p. 70). Researchers studying language testing draw upon a hybrid of theoretical frameworks, and they have defined language proficiency from different perspectives (Bachman, 2007), yet some of the frameworks are difficult for SLA researchers to employ. Therefore, an AUA or operational model of language proficiency is essential for SLA researchers to carry out studies.

Limitations

This study has at least three limitations. First, the sample size was small. Thus, the extent to which the results could be generalized to other samples and research questions is still unclear. In addition, the small sample size might have been responsible for the unsatisfactory regression diagnosis. Second, this study used PA as the dependent variable; thus, whether the findings can be applied to research involving other metalinguistic awareness is still uncertain. Third, the HSK test used in this study included only listening and reading sections and excluded speaking and writing tests; the simplified HSK is a relatively comprehensive indicator of only input comprehension skills involved in Chinese proficiency. Thus, studies that rely on a more comprehensive test that includes listening, speaking, reading, and writing skills might yield different results.

Although the present study has several limitations, this research is important for our understanding of how L2 proficiency measures affect research results on the effect of L2 proficiency on the target variable, and it has important implications for the use and design of standard L2 proficiency measures for research. We conclude the present research with the following statement from Tremblay (2011a):

Providing adequate proficiency estimates that meet both testing and research standards can not only enhance the interpretation of experimental results but also decrease the disparity between the proficiency assessment methods used in SLA research and thus facilitate comparisons between studies. (p. 364).

Footnotes

Appendix

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper was supported The Ministry of Education of China (20YJC740088), the Innovative School Project in Higher Education of Guangdong, China (GWTP-GC-2017-01) and the Social Science Key Research Grant of Universities in Guangdong Province (2018WZDXM005).

ORCID iD

Haiwei Zhang

Notes

References

Aiken

L. R.

(2003). Psychological testing and assessment (11th ed.). Boston: Allyn & Bacon.

Bachman

L. F.

(1988). Language testing-SLA research interfaces. Annual Review of Applied Linguistics, 9, 193–209. https://doi.org/10.1017/S0267190500000891

Bachman

L. F.

(1990). Fundamental considerations in language testing. Oxford University Press.

Bachman

L. F.

(2007). What is the construct? The dialectic of abilities and contexts in defining constructs in language assessment. In Fox

Wesche

Bayliss

Cheng

Turner

C. E.

Doe

(Eds.), Language testing reconsidered (pp. 41–71). University of Ottawa Press.

Bachman

L. F.

Cohen

A. D.

(Eds.) (2002). Interface between second language acquisition and language testing research. Foreign Language Teaching and Research Press/Cambridge University Press.

Bachman

L. F.

Palmer

A. S.

(1996). Language testing in practice. Oxford University Press.

Blais

Laurier

M. D.

(1995). The dimensionality of a placement test from several analytical perspectives. Language Testing, 12(1), 72–98. https://doi.org/10.1177/026553229501200105

Bozorgian

(2012). The relationship between listening and other language skills in International English Language Testing System. Theory and Practice in Language Studies, 2(4), 657–663. https://doi.org/10.4304/tpls.2.4.657-663

Bradley

Bryant

(1983). Categorizing sounds and learning to read: A causal connection. Nature, 301(3), 419–421. https://doi.org/10.1038/301419a0

10.

Brown

J. D.

(2002). Do cloze tests work? Or is it just an illusion? Second Language Studies, 21(1), 79–125.

11.

Carroll

J. B.

(1961). Fundamental considerations in testing for English language proficiency of foreign students. In Allen

H. B.

Campbell

R. N.

(Eds.), Teaching English as a second language (pp. 313–321). McGraw-Hill.

12.

Carroll

(2001). Input and evidence: The raw material of second language acquisition. John Benjamins.

13.

Chaudron

(1985). Intake: On methods and models for discovering learners’ processing of input. Studies in Second Language Acquisition, 7, 1–14. https://doi.org/10.1017/S027226310000512X

14.

Cohen

(1988). Statistical power analysis for the behavioral sciences. Lawrence Erlbaum.

15.

Davies

(2014). Fifty years of language assessment. In Kunnan

A. J.

(Ed.), The companion to language assessment (pp. 1–19). John Wiley & Sons.

16.

DeVellis

R. F.

(1991). Scale development. Sage.

17.

Feyten

C. M.

(1991). The power of listening ability: An overlooked dimension in language acquisition. The Modern Language Journal, 75(2), 173–180. https://doi.org/10.1111/j.1540-4781.1991.tb05348.x

18.

Fulcher

(1997). An English language placement test: Issues in reliability and validity. Language Testing, 14(2), 113–139. https://doi.org/10.1177/026553229701400201

19.

Fulcher

(1999). Computerizing an English language placement test. ELT Journal, 53(4), 289–299. https://doi.org/10.1093/elt/53.4.289

20.

Gao

(2017). 母语者和第二语言学习者汉语阅读中语块加工优势的眼动研究 [Processing advantage of formulaic sequences in Chinese reading by native and second language speakers: An eye-tracking study]. Shijie Hanyu Jiaoxue, 4, 560–575.

21.

Gass

S. M.

(1997). Input, interaction, and the second language learner. Lawrence Erlbaum.

22.

Gilbert

M. B.

(2005). An examination of listening effectiveness of educators: Performance and preference. Professional Educator, 27, 1–16.

23.

Glass

G. V.

McGaw

Smith

M. L.

(1981). Meta-analysis in social research. Sage.

24.

Goswami

Bryant

(2016). Phonological skills and learning to read. Routledge.

25.

Grosjean

(1998). Studying bilinguals: Methodological and conceptual issues. Bilingualism: Language and Cognition, 1(2), 131–149. https://doi.org/10.1017/S136672899800025X

26.

(2014). At the interface between language testing and second language acquisition: Language ability and context of learning. Language Testing, 31(1), 111–133. https://doi.org/10.1177/0265532212469177

27.

Guojia yuwei. (2010). 汉语国际教育用音节汉字词汇等级划分 [The graded Chinese syllables, characters, and words for the application of teaching Chinese to the speakers of other languages]. Beijing Language and Culture University Press.

28.

Harrington

Carey

(2009). The on-line Yes/No test as a placement tool. System, 37(4), 614–626. https://doi.org/10.1016/j.system.2009.09.006

29.

Heilenman

L. K.

(1983). The use of a cloze procedure in foreign language placement. The Modern Language Journal, 67(2), 121–126. https://doi.org/10.2307/328284

30.

Hidri

(2018). Discrete point and integrative testing. In Liontas

J. I.

(Ed.), The TESOL encyclopedia of English language teaching. John Wiley & Sons. https://doi.org/10.1002/9781118784235.eelt0375

31.

Hodge

Austin

(2004). A survey of outlier detection methodologies. Artificial Intelligence Review, 22(2), 85–126. https://doi.org/10.1023/B:AIRE.0000045502.10941.a9

32.

Hulme

Snowling

M. J.

(2013). Learning to read: What we know and what we need to understand better. Child Development Perspectives, 7(1), 1–5. https://doi.org/10.1111/cdep.12005

33.

Hulstijn

J. H.

(2012). The construct of language proficiency in the study of bilingualism from a cognitive perspective. Bilingualism: Language and Cognition, 15(2), 422–433. https://doi.org/10.1017/S1366728911000678

34.

Jamieson

Wang

Church

(2013). In-house or commercial speaking tests: Evaluating strengths for EAP placement. Journal of English for Academic Purposes, 12(4), 288–298. https://doi.org/10.1016/j.jeap.2013.09.003

35.

Keung

Y.-C.

C. S.-H.

(2009). Transfer of reading-related cognitive skills in learning to read Chinese (L1) and English (L2) among Chinese elementary school children. Contemporary Educational Psychology, 34(2), 103–112. https://doi.org/10.1016/j.cedpsych.2008.11.001

36.

Krashen

S. D.

(1981). Second language acquisition and second language learning. Pergamon.

37.

Krashen

S. D.

(1982). Principles and practice in second language acquisition. Pergamon.

38.

Lado

(1961). Language testing: The construction and use of foreign language tests. McGraw-Hill.

39.

LeClercq

Edmonds

Hilton

(Eds.). (2014). Measuring L2 proficiency: Perspectives from SLA. Multilingual Matters.

40.

Lee

Greene

(2007). The predictive validity of an ESL placement test: A mixed methods approach. Journal of Mixed Methods Research, 1(4), 366–389. https://doi.org/10.1177/1558689807306148

41.

Anderson

R. C.

Nagy

W. E.

Zhang

(2002). Facets of metalinguistic awareness that contributes to Chinese literacy. In Li

Gaffney

J. S.

Packard

J. L.

(Eds.), Chinese children’s reading acquisition: Theoretical and pedagogical issues (pp. 87–106). Kluwer Academic.

42.

Linebarger

D. L.

(2001). Beginning literacy with language: Young children learning at home and school. Topics in Early Childhood Special Education, 21, 188–192.

43.

Makel

Plucker

Hegarty

(2012). Replications in psychology research: How often do they really occur? Perspectives in Psychological Science, 7, 537–542. https://doi.org/10.1177/1745691612460688

44.

Marsden

Morgan-Short

Thompson

Abugraber

(2018). Replication in second language research: Narrative and systematic reviews and recommendations for the field. Language Learning, 68(2), 321–391. https://doi.org/10.1111/lang.12286

45.

McBride

Kail

R. V.

(2002). Cross-cultural similarities in the predictors of reading acquisition. Child Development, 73(5), 1392–1407. https://doi.org/10.1111/1467-8624.00479

46.

Norris

J. M.

Ortega

(2000). Effectiveness of L2 instruction: A research synthesis and quantitative meta-analysis. Language Learning, 50(3), 417–528. https://doi.org/10.1111/0023-8333.00136

47.

Norris

J. M.

Ortega

(2012). Assessing learner knowledge. In Gass

S. M.

Mackey

(Eds.), The Routledge handbook of second language acquisition (pp. 573–589). Routledge.

48.

Nowbakht

(2019). The role of working memory, language proficiency, and learners’ age in second language English learners’ processing and comprehension of anaphoric sentences. Journal of Psycholinguistic Research, 48(2), 353–370. https://doi.org/10.1007/s10936-018-9607-2

49.

Oakhill

Kyle

(2000). The relation between phonological awareness and working memory. Journal of Experimental Child Psychology, 75(2), 152–164. https://doi.org/10.1006/jecp.1999.2529

50.

Oller

J. W.

(1979). Language tests at school. Longman.

51.

Oxford

R. L.

(1993). Research update on teaching L2 listening. System, 21(2), 205–211. https://doi.org/10.1016/0346-251X(93)90042-F

52.

Perfetti

C. A.

Zhang

Berent

(1992). Reading in English and Chinese: Evidence for a “universal” phonological principle. In Frost

Katz

(Eds.), Orthography, phonology, morphology, and meaning (pp. 227–248). North-Holland.

53.

Rothstein

H. R.

Sutton

A. J.

Borenstein

(2006). Publication bias in meta-analysis: Prevention, assessment and adjustments. John Wiley & Sons.

54.

Schmidt

F. L.

(1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47(10), 1173–1181. https://doi.org/10.1037/0003-066X.47.10.1173

55.

Shohamy

(2000). The relationship between language testing and second language acquisition, revisited. System, 28(4), 541–553. https://doi.org/10.1016/S0346-251X(00)00037-3

56.

Song

Georgiou

G. K.

Shu

(2015). How well do phonological awareness and rapid automatized naming correlate with Chinese reading accuracy and fluency? A meta-analysis. Scientific Studies of Reading, 20(2), 99–123. https://doi.org/10.1080/10888438.2015.1088543

57.

Sparks

R. L.

Patton

J. O. N.

Ganschow

Humbach

(2009). Long-term relationships among early first language skills, second language aptitude, second language affect, and later second language proficiency. Applied Psycholinguistics, 30(4), 725–755.

58.

Spolsky

(1977). Language testing: Art or science? In Nickel

(Ed.), Proceedings of the Fourth International Congress of Applied Linguistics (Vol. 3, pp. 7–28). Hochschuverlag.

59.

Sundara

Polka

Baum

(2006). Production of coronal stops by simultaneous bilingual adults. Bilingualism: Language and Cognition, 9(1), 97–114. https://doi.org/10.1017/S1366728905002403

60.

Sutton

A. J.

(2009). Publication bias. In Cooper

Hedges

L. V.

Valentine

J. C.

(Eds.), The handbook of research synthesis and meta-analysis (2nd ed., pp. 435–452). Russell Sage Foundation.

61.

Swanson

H. L.

Trainin

Necoechea

D. M.

Hammill

D. D.

(2003). Rapid naming, phonological awareness, and reading: A meta-analysis of the correlation evidence. Review of Educational Research, 73(4), 407–440. https://doi.org/10.2307/3515998

62.

Thomas

(1994). Assessment of L2 proficiency in second language acquisition research. Language Learning, 44(2), 307–336. https://doi.org/10.1111/j.1467-1770.1994.tb01104.x

63.

Thomas

(2006). Research synthesis and historiography: The case of assessment of second language proficiency. In Norris

J. M.

Ortega

(Eds.), Synthesizing research on language learning and teaching (pp. 279–298). John Benjamins.

64.

Thompson

S. G.

Higgins

J. P. T.

(2002). How should meta-regression analyses be undertaken and interpreted? Statistics in Medicine, 21(11), 1559–1573. https://doi.org/10.1002/sim.1187

65.

Tremblay

(2011a). Proficiency assessment standards in second language acquisition research: “Clozing” the gap. Studies in Second Language Acquisition, 33(3), 339–372. https://doi.org/10.1017/S0272263111000015

66.

Tremblay

(2011b). Learning to parse liaison-initial words: An eye-tracking study. Bilingualism: Language and Cognition, 14(3), 257–279. https://doi.org/10.1017/S1366728910000271

67.

Uchikoshi

Marinova-Todd

(2012). Language proficiency and early literacy skills of Cantonese-speaking English language learners in the U.S. and Canada. Reading and Writing, 25(9), 2107–2129. https://doi.org/10.1007/s11145-011-9347-2

68.

Hong

Deng

(2017). 汉字认读在汉语二语者入学分班测试中的应用—建构简易汉语能力鉴别指标的实证研究 [Application of Chinese character identification in placement tests for CSL learners: An empirical study of constructing simple Chinese proficiency indicators]. Shijie Hanyu Jiaoxue, 3, 395–411.

69.

Yeung

S. S.

Chan

C. K. K.

(2013). Phonological awareness and oral language proficiency in learning to read English among Chinese kindergarten children in Hong Kong. British Journal of Educational Psychology, 83(4), 550–568. https://doi.org/10.1111/j.2044-8279.2012.02082.x

70.

Yuan

(1995). Acquisition of base-generated topics by English-speaking learners of Chinese. Language Learning, 45(4), 567–603. https://doi.org/10.1111/j.1467-1770.1995.tb00455.x

71.

Yuan

Wang

(2012). 基于结构方程模型的高级汉语学习者语言技能关系研究 [The study of the relationship between the four skills of the advanced Chinese learners: Based on structural equation modeling]. Huawen Jiaoxue yu Yanjiu, 4, 24–32.

72.

Zareva

Schwanenflugel

Nikolova

(2005). Relationship between lexical competence and language proficiency: Variable sensitivity. Studies in Second Language Acquisition, 27(4), 567–595. https://doi.org/10.1017/S0272263105050254

73.

Zhang

(2018). 研究用汉语水平分级测试方法对研究结果的影响探索 [The influence of different L2 proficiency measures on research results]. Yuyan Jiaoxue yu Yanjiu, 6, 14–23.

74.

Zhang

Roberts

(2019). The role of phonological awareness and phonetic radical awareness in acquiring Chinese literacy skills in learners of Chinese as a second language. System, 81, 163–178. https://doi.org/10.1016/j.system.2019.02.007

75.

Zhang

(2007). 外国留学生汉语语音意识的发展 [The development of foreign students Chinese phonological awareness]. 暨南大学华文学院学报 / Journal of College of Chinese Language and Culture of Jinan University, 1, 105–108.

76.

Ziegler

J. C.

Goswami

(2005). Reading acquisition, developmental dyslexia, and skilled reading across languages: A psycholinguistic grain size theory. Psychological Bulletin, 131(1), 3–29. https://doi.org/10.1037/0033-2909.131.1.3

Investigating the Influence of Different L2 Proficiency Measures on Research Results

Abstract

Keywords

Introduction

Literature Review

The Current Study

Method

Participants

Measures

Simplified HSK

Reading comprehension test

Chinese character recognition test

Chinese PA test

Procedure

Data Analysis

Results

Item Analysis

Discussion

Implications

Limitations

Footnotes

Appendix

Declaration of Conflicting Interests

Funding

ORCID iD

Notes

References