Abstract
Researchers have used different measures to examine participants’ second-language (L2) proficiency, yet it remains unclear how these different measures influence research results on the effect of L2 proficiency on the target variable. The present research explored the modulation effect of four different measures of L2 Chinese proficiency on Chinese phonological awareness (PA). Eighty-four L2 learners of Chinese who spoke English or Arabic as their first language completed four different measures of Chinese proficiency: a self-report measure of years of instruction in Chinese, a simplified Hanyu Shuiping Kaoshi (HSK; a standardized Chinese language proficiency test), a reading comprehension test, and a Chinese character recognition test. The overall results of
Introduction
With the growing influence of psychology and neuroscience on second-language acquisition (SLA), the demand for the rigorous control of research methods in SLA studies, particularly those examining participants’ second-language (L2) proficiency, has been highlighted (Grosjean, 1998; Tremblay, 2011a). Identifying valid and efficient methods to measure L2 proficiency is one of the most critical issues for both language education practitioners and SLA researchers. In the current study, language proficiency represents “an index of the comprehension and production abilities that L2 learners develop across linguistic domains (e.g., lexical competence, grammatical competence, discourse competence) and modalities to communicate” (Tremblay, 2011a, p. 340).
Studies on L2 proficiency assessment can generally be categorized into three types. The first group focuses on standardized tests (see journals such as
In general, the L2 proficiency level of L2 learners is used in three ways in empirical studies. First, L2 proficiency is used as a dependent variable, and the factors, such as the prediction of early L1 skills and L2 aptitude in L2 proficiency, that contribute to or affect the development of L2 proficiency are explored (Sparks et al., 2009). Second, according to participants’ performance on L2 proficiency measures, participants are categorized into groups with different proficiencies, and then cross-group differences in the target variable, such as the differences in lexical knowledge between native speakers of English, L2 advanced learners, and L2 intermediate learners, are examined using
Some researchers have summarized how different L2 proficiency assessment methods have been used in published works (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018), yet few studies have explored how these different L2 proficiency assessment methods influence research results on the effect of L2 proficiency on the target variable. Therefore, the present study explored how different L2 assessment methods could yield different results and findings on the influence of L2 proficiency on the target variable. This study is intended to contribute to the field of SLA methodology.
Literature Review
The sparse investigation of L2 proficiency for research purposes may be explained by several possible reasons. First, researchers do not agree on which test is considered the best method to assess L2 proficiency for research purposes, but such a need has been underlined by various studies (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). Because of the differences in theoretical frameworks, research questions, participants’ background information, and L2 learning contexts, SLA researchers have adopted different L2 proficiency measures.
Another reason may concern the shortage of replication studies. Makel et al. (2012) found that in the field of psychology, most replication studies are conceptual and involve, as Marsden et al. (2018) stated, “intentional adaptation of the initial study to investigate generalizability to new conditions, contexts, or study characteristics” (p. 325), while the few direct replication studies have had “no intentional or significant alternations of the initial study” (pp. 325–326). Marsden et al. (2018) further noted that few replication studies have been carried out or published in the field of SLA, and one possible reason might relate to the poor transparency of the initial studies, such as in the measurement of the participants’ proficiency. Therefore, for the nonreplication studies that dominate the SLA field, whether the L2 proficiency measures used in novel studies should be consistent with those used in previous studies does not appear to be a significant issue, and a uniform L2 proficiency test in a particular language does not seem necessary because the operationalized definition of L2 proficiency is widely accepted by researchers and journals.
In recent years, researchers have realized the need for and called for more studies on L2 proficiency tests for research purposes. Reporting and controlling for measures of L2 proficiency for research purposes is argued to be important for at least three reasons (Marsden et al., 2018; Norris & Ortega, 2012; Tremblay, 2011a). First, it could provide evidence for categorizing participants into different groups. Second, it could help readers determine the extent to which the findings may be generalized to other populations or samples. Finally, it could facilitate comparisons between studies and thus contribute to the accumulation of knowledge.
Researchers have surveyed different L2 proficiency measures used in published articles in different languages, such as English, French, and Chinese (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). The main findings are as follows.
First, studies have found that a variety of methods have been administered to measure L2 proficiency across studies published in different languages. Thomas (1994, 2006), Hulstijn (2012), and Tremblay (2011a) investigated publications in several journals, mostly in English and some in French, and found that the L2 proficiency measures utilized by researchers included institutional status, standardized tests, impressionistic judgment, in-house assessment instruments, translation tests, language test certificates, and others. Tremblay (2011a) further categorized these measures into independent tests (such as standardized tests, cloze tests, and oral tests) and dependent tests (such as previous language test scores, the length of residence in an L2-speaking environment, and self-assessment). An investigation of publications in the field of Chinese as a second language (CSL) reported similar findings (H. Zhang, 2018). The L2 proficiency measures employed by CSL researchers included standardized tests such as the standardized HSK, reading comprehension tests such as cloze tests, and Chinese character recognition tests.
The second finding is that researchers have tended to favor fast and convenient estimates of L2 proficiency, such as classroom level, years of instruction in the target language, program enrollment, and institutional status. Thomas (1994) and Thomas (2006) found that 40.1% and 33.2%, respectively, of studies used institutional status as an L2 proficiency measure. Tremblay (2011a) reported that 38.2% of English and French SLA studies measured L2 proficiency based on the classroom level or years of instruction. The percentage of CSL studies that used these two methods was as high as 52.38% (H. Zhang, 2018). Although methods such as years of instruction or classroom level are commonly used, it remains unclear whether they are as powerful as other measures, such as standardized tests.
Third, standardized language tests have not been widely used to estimate L2 proficiency in SLA studies. The investigations by Thomas (1994), Thomas, (2006), Tremblay (2011a), and H. Zhang (2018) demonstrated that only 22.3%, 23.2%, 14.59%, and 6.35% of studies, respectively, administered a standardized language test to measure L2 proficiency. The main reasons could be the following: (a) standardized tests are difficult for researchers to access and (b) standardized tests usually consume a great deal of time and effort (Tremblay, 2011a). Although standardized tests may be better able to reveal the actual state of learners’ proficiency, it is still unknown whether this method is more robust than other methods in terms of contributing to research findings concerning the effect of L2 proficiency on the target variable.
The above investigations show a lack of uniformity in L2 proficiency assessment methods. The possible consequences of using different L2 proficiency measures have been noted by some researchers, and the main consequence lies in the weakening of the generalizability of research findings (LeClercq et al., 2014; Tremblay, 2011a). However, the influence of different L2 proficiency measures on research results on the effect of L2 proficiency on the target variable has been explored less often in empirical studies. Currently, there may be only one available study, that by H. Zhang (2018), that has examined this issue; the study used data on phonological awareness (PA) and phonetic radical awareness (PRA) from 40 adult English CSL learners who studied in the United Kingdom. Five measures of L2 Chinese proficiency were used: years of instruction in Chinese, a simplified HSK test (including listening and reading comprehension), a listening comprehension test, a reading comprehension test, and a Chinese character recognition test. The participants were divided into preintermediate and intermediate groups according to the operationalized criteria based on each measure. When years of instruction in Chinese was used as the criterion, participants with 1 year of instruction and those with 2 years of instruction were assigned to the preintermediate and intermediate groups, respectively. When the HSK test, listening comprehension test, or reading comprehension test was used as the criterion, participants who scored below and above the midpoint of the maximum score on each test were assigned to the preintermediate and intermediate groups, respectively. The participants who scored below and above .33 on the character reading test were assigned to the preintermediate and intermediate groups, respectively, because the number of characters from beginner, intermediate, and advanced levels on the test was balanced.
The results of Zhang’s study indicated that using the HSK as an L2 proficiency measure led to significant differences in PA and PRA between the preintermediate and intermediate groups and that the effect sizes were large (PA, Cohen’s
This issue, however, still needs to be further investigated. H. Zhang (2018) explored this topic using only an independent-samples
The Current Study
The current study extended the study by H. Zhang (2018) to explore the same research question by comparing four measures of L2 Chinese proficiency: years of instruction in Chinese, a simplified HSK, a reading comprehension test, and a character recognition test. It is impossible to explore this issue in relation to all aspects of SLA, and PA was used as an example for the following reasons.
First, PA is a type of metalinguistic awareness and refers to the ability to understand and manipulate the phonological structure of a language. Chinese PA includes awareness of syllables, onset, rime, and tone (Li et al., 2002). PA closely correlates with language proficiency and can indicate L2 ability to some extent (H. Zhang, 2018; Ziegler & Goswami, 2005). Chinese PA is not included in Chinese language proficiency tests, so it is easy to separate participants’ PA performance from their language proficiency. Topics such as grammatical structures or vocabulary are not suitable for the present study because they could overlap with some Chinese proficiency measures in terms of the content examined. Moreover, the number of studies on phonology in the area of SLA is much smaller than the number of studies on grammar and vocabulary. 1 Thus, researching PA could enrich previous studies on phonology and contribute to the development of the area of SLA.
Second, PA is a fundamental and essential skill for the acquisition of different linguistic components (such as vocabulary) and the development of various linguistic skills (such as listening, reading, and spelling) across different languages for L1 (Bradley & Bryant, 1983; Goswami & Bryant, 2016; Hulme & Snowling, 2013; Song et al., 2015; Swanson et al., 2003; Ziegler & Goswami, 2005) and L2 (Keung & Ho, 2009; McBride & Kail, 2002; Uchikoshi & Marinova-Todd, 2012 ; Yeung & Chan, 2013; H. Zhang & Roberts, 2019). The significant predictive power of PA for Chinese reading lies in the fact that phonological skills are universally required to read in any language because correct mapping between phonology and orthography is essential for the successful decoding of printed words (Perfetti et al., 1992). Therefore, the development of PA is one of the core and fundamental issues for language research.
Third, adult CSL learners’ PA could arguably develop well within a relatively short period because of its limited repertoire of phonemes and tones, as well as adult learners’ mature cognitive abilities. Chinese has 21 onsets, 39 rimes, and four tones. Studies have found that CSL learners can achieve above-chance PA performance within several weeks at the initial stage of Chinese learning (J. Zhang & Wu, 2007) and ceiling performance within 1 or 2 years (H. Zhang & Roberts, 2019). In addition, compared with the hundreds or thousands of items in the pool of grammatical structures or vocabulary, the limited repertoire of Chinese phonetics makes it practical to examine the influence of L2 proficiency on the whole range of the development path of Chinese PA instead of a portion of grammatical structures or vocabulary items.
Finally, this topic has been explored by H. Zhang (2018), who drew on the data of CSL learners’ PA. Thus, to make the results of the current study comparable with the findings reported in Zhang’s study, the current study also depended on Chinese PA to answer the research question.
The current study aimed to explore how different L2 proficiency measures influenced the research results on the effect of CSL proficiency on the target variable, Chinese PA. Specifically, the current study included two questions.
Method
Participants
The participants were 40 English-speaking and 44 Arabic-speaking CSL learners who studied Chinese as a primary subject at universities in the United Kingdom or Egypt. The English group included 20 second-year and 20 third-year learners, and the Arabic group included 23 second-year and 21 third-year learners. The data were collected at the beginning of the autumn term; thus, the second-year and third-year groups learned Chinese for approximately 1 and 2 years, respectively.
Measures
Simplified HSK
The HSK is a standardized L2 Chinese proficiency examination recognized worldwide. It includes six levels, with Level 1 being the beginner level and Level 6 being the advanced level. The HSK is integrative and includes subtests for listening, reading, speaking, and writing. The HSK is rarely used to measure L2 Chinese proficiency in research because the typical duration of the test is approximately 2 hours (H. Zhang, 2018). Drawing on Tremblay’s (2011a) recommendation, we used a simplified HSK that included listening and reading comprehension to make it relatively comprehensive and practical for administration. According to our teaching experience and the instructors’ reports, the participants’ L2 Chinese proficiency ranged from preintermediate to intermediate. Thus, the test was composed of eight questions from the preintermediate level and eight questions from the intermediate level, including four listening and four reading comprehension questions for each level. All the questions were multiple choice, and the participants were required to choose one correct answer from the four alternative items. A score of 1 was assigned to correct answers, and a score of 0 was assigned to unanswered item and incorrect answers, for a maximum score of 16. The test was administered in a paper-and-pencil form.
Reading comprehension test
Cloze tests have been widely used to examine reading comprehension skills and to explore L2 proficiency (Brown, 2002; Tremblay, 2011), yet they have not been used on the HSK test. In addition, cloze tests are not widely used in CSL research, probably due to the lack of a standardized cloze test (H. Zhang, 2018). The reading comprehension test in this study included four questions from the preintermediate level and four questions from the intermediate level. The materials were the same as those in the reading section of the simplified HSK test. The maximum score was 8. The test was administered the in paper-and-pencil form.
Chinese character recognition test
Chinese characters constitute the dominant writing system in Chinese. The number of Chinese characters that a CSL learner can recognize can be considered the basis for his or her reading ability. Chinese character recognition tests have been used as valid placement measures in classroom instruction (Wu et al., 2017) and CSL research (Gao, 2017). Such tests take far less time than other L2 proficiency tests, typically taking no more than 10 minutes. However, a standardized Chinese character recognition test has not been developed. In the present study, the Chinese character recognition test included 72 characters, with 36 characters each from the beginner and intermediate levels of
Chinese PA test
The odd-man-out test is a common task to examine PA in English and Chinese. However, because no standardized odd-man-out test is available for a Chinese PA test, an odd-man-out task was developed for the present study that included eight questions each for syllable awareness, onset awareness, rhyme awareness, and tone awareness. For each question, the participants heard three items and were then required to identify the odd item. For instance, among “dàgē, xiàngpí, gēmí,” the odd item was “xiàngpí” because the other two words had the same syllable “gē.” The test was administered in the paper-and-pencil form, and the accuracy rate was calculated by dividing the number of correct answers by 32.
Procedure
All participants completed the tasks individually in a quiet place. The tasks were administered in the following order: PA test (approximately 10 minutes), Chinese character recognition test (approximately 5 minutes), HSK test (including the reading comprehension test, approximately 15 minutes), and background questionnaire (approximately 5 minutes). Most participants completed these tasks within 35 minutes and were paid a small amount of money after completing the tasks.
Data Analysis
First, item analysis was carried out to examine the item difficulty and item discrimination indexes of the HSK, reading comprehension test, and Chinese character recognition test. Second, the preintermediate and intermediate groups differed in L2 proficiency and were not balanced in other aspects such as age and gender, so an independent-samples
Results
According to DeVellis (1991), a test with a Cronbach’s alpha coefficient between .70 and .80 is acceptable, and that with a coefficient between .80 and .90 is very good. The Cronbach’s alpha reliability values of the HSK test (.80), reading comprehension test (.70), Chinese character recognition test (.96), and Chinese PA test (.72) were all excellent, indicating that these measures were reliable.
The participants’ performance for the measured variables is summarized in Table 1. The Arabic and English groups did not differ significantly for most of the measured variables, including the HSK, reading comprehension test, and character recognition test (Table 2). These results indicate that the two CSL groups were matched in L2 Chinese proficiency; thus, pooling the two groups for analysis was reasonable. In terms of PA performance, the English group outperformed the Arabic group (Table 2), yet each group’s performance was significantly above the chance level: Arabic,
Participants’ Background Information and Performance for the Measured Variables.
Summary of the
Item Analysis
The results of the item analysis are summarized in Table 3. The mean item difficulty for HSK was close to .50, and its item discrimination index was excellent. The item difficulty for the reading comprehension test was acceptable, and the item discrimination index was good. The Chinese character recognition test had a good item discrimination index but had a relatively low value for item difficulty.
Summary of the Results of the Item Analysis.
The participants were sorted into preintermediate and intermediate groups based on the criteria of the four different measures (Table 4), and then, independent-samples
Grouping Information Based on Each of the Four Measures.
Summary of the
The raw data collected using the four measures of L2 Chinese proficiency were used in the multiple regression analysis to explore how the four measures influenced the power of CSL proficiency in predicting Chinese PA.
First, each proficiency measure was modeled in isolation with the background variables. Most previous CSL studies have employed only one measure to indicate L2 Chinese proficiency. Therefore, the approach employed in this study is akin to comparing the results of four independent studies that explored the same issue. Because of debates on methods of outlier detection (Hodge & Austin, 2004), we carried out two series of regression analyses—one in which the outliers were kept (Table 6) and another in which the outliers were removed (Table 7). The regression diagnosis plots for the four models in Table 6 are presented in the Appendix (Figures 1-4). The results of the normality, heteroscedasticity, and multicollinearity tests are also presented in the tables.
Summary of the Results of the Multiple Regression Analysis (Each L2 Proficiency Measure + Background Variables, Outliers Kept).
Summary of the Results of the Multiple Regression Analysis (Each L2 Proficiency Measure + Background Variables, Outliers Removed).
Each of the four models significantly explained a certain percentage of the variance in PA, yet the effect sizes differed (Cohen, 1988). The model including the HSK explained the highest percentage of the variance in PA, followed by the model with reading comprehension and that with character recognition; the model with years of instruction accounted for the lowest percentage. Similarly, the four measures differed in their predictive power for PA. Independent of the outliers, the effect sizes for the HSK, reading comprehension, and years of instruction were large, medium, and small, respectively. The effect size for character recognition was medium when the outliers were kept and small when the outliers were removed.
Second, we further constructed a model with the four L2 proficiency measures and background variables. The correlation matrix between the measured variables is displayed in Table 8. Two multiple regression analyses were carried out with the outliers kept and removed (Table 9). The regression diagnosis plots for the model with outliers kept in Table 9 are presented in the Appendix (Figure 5). In both cases, the HSK was the only significant predictor of PA, with a medium effect size, whereas none of the other three measures significantly predicted PA, with very small effect sizes.
Correlation Matrix of the Measured Variables.
Summary of the Results of the Multiple Regression Analysis (Four L2 Chinese Proficiency Measures + Background Variables).
Discussion
The present study aimed to explore the influence of different L2 proficiency measures on research results on the effect of L2 proficiency on the target variable. We studied this issue by drawing on PA data collected from Arabic and English CSL learners and using four different measures of L2 Chinese proficiency. Consistent with our hypothesis, the overall results for RQ1 and RQ2 show that different L2 proficiency assessment methods yielded different results and that the comprehensive HSK test produced the most substantial effect, followed by reading comprehension and then Chinese character recognition, with years of instruction having the weakest effect. These results suggest that research findings do differ depending on the L2 proficiency assessment methods used by researchers and that the use of a comprehensive L2 proficiency measure could yield significant results and larger effect sizes than the use of a single aspect of L2 proficiency. These results are consistent with the findings reported by H. Zhang (2018).
The next question to answer is why these different L2 proficiency measures generated different research findings. These findings could be explained from two perspectives.
From the perspective of language testing, the first reason may relate to the comprehensiveness of the four different measures. The HSK test is the most comprehensive, followed by the reading comprehension test, the character recognition test, and years of instruction, which roughly indicates the length of CSL learning. This explanation is consistent with the suggestion that a more comprehensive test might be better to detect the differences in language proficiency among L2 learners at different levels and might make within-group L2 proficiency more homogeneous (Tremblay, 2011a). This explanation is also consistent with the recommended use of an integrative test instead of a discrete-point test to assess language proficiency (Hidri, 2018).
The second reason might relate to the psychometric indexes of the measures. Because years of instruction is not a psychometric measure, item analysis could not be conducted for this measure; thus, the discussion here focuses on the other three measures. Table 3 reveals that the simplified HSK demonstrated the best overall quality for item difficulty and item discrimination, followed by reading comprehension and character recognition. It is still unclear whether the consistency between the results of the item analyses of the three measures and the final results was coincidental or causal, yet this finding cannot be ignored; more studies are needed to explore this issue. Our results further confirm the importance of carrying out item analysis of proficiency measures for research purposes (Tremblay, 2011a).
From the perspective of SLA, our results could be explained by the input-intake hypothesis. It is generally believed that input refers to physical stimuli, such as acoustic–phonetic events and graphic objects, that learners encounter in meaningful contexts and that intake is the mental representation or comprehended version of the physical stimuli (S. Carroll, 2001; Chaudron, 1985; Gass, 1997; Krashen, 1981, 1982). First, the degree to which the nature of the input of the materials of the L2 Chinese proficiency measures is consistent with that of the PA test differs. The materials of the HSK test include both listening and reading input, yet the reading comprehension test and the Chinese character recognition test include only visual input, while years of instruction in Chinese cannot directly indicate anything about the nature and amount of input. However, successful performance on the PA task mainly depends on participants’ intake from auditory input. Although the orthographic representations of the acoustic–phonetic stimuli might be unconsciously activated (Oakhill & Kyle, 2000), in the case of the Chinese PA test, the activation of Pinyin (the Latin alphabet used to transcribe Chinese characters) might be stronger than that of Chinese characters. This is because the phonological structure could be manifested in Pinyin. For instance, the Pinyin form of the Chinese character 三 (three) is sān/san/, which represents its onset (/s/), rime (/an/), and tone (the diacritic above
More importantly, another reason could relate to the role of the input in L2 acquisition. Listening skill is the most essential and primary impetus for the initiation of first- and second-language learning and is the critical medium to maintain communication and the development of language learning (Bozorgian, 2012; Feyten, 1991; Oxford, 1993). The percentage of listening of the total time devoted to communicating is approximately 50% and might reach 90% in high school and college environments (Gilbert, 2005), and listening is the primary channel for accumulating knowledge and interpreting experience (Linebarger, 2001). In the case of L2 learning, according to Krashen (1981), input skills such as listening and reading play essential and primary roles in the development of language proficiency, and output skills such as speaking and writing emerge on the basis of a sufficient amount of comprehensible language input. In particular, listening is the most fundamental of the four components of language skills and is the basis for the development of language proficiency; the strong relationship between listening abilities and overall language proficiency has been reported in studies by Feyten (1991) in learners of French and Spanish as well as by Bozorgian (2012) in Iranian learners of English. These findings suggest that language proficiency tests should not neglect listening skills. In fact, it has been found that the structure of advanced CSL learners’ language abilities could be best modeled as
Drawing on the overall findings, we are not able to conclude which measure could best reveal CSL proficiency. This is because only four different measures of L2 proficiency were administered in the present study, and three measures focus only on input comprehension skills, indicating that none of them is an optimum test for CSL proficiency. The question of the best measure to assess CSL proficiency relates to theoretical accounts of language proficiency/skills. The components of L2 proficiency are very complex and have been discussed by numerous researchers (Bachman, 1990; J. B. Carroll, 1961; Davies, 2014; Hulstijn, 2012; Lado, 1961; Oller, 1979; Spolsky, 1977). Thus, discussing the best measure to represent CSL proficiency is beyond the aim of the present study.
Based on the results, we can tentatively recommend how each test could be used as an L2 proficiency assessment method in SLA research, particularly in studies involving CSL learners. A comprehensive test is recommended when global L2 proficiency must be measured for research purposes. In this study, listening and reading comprehension were included in the HSK, similar to Tremblay’s (2011a) recommendation to use a cloze test alongside an aural task to measure global L2 proficiency, which can enable researchers to investigate global L2 proficiency better than the use of one measure alone because there is a consensus that language proficiency is multidimensional. Administering an integrative test is not practical because of constraints such as time limits. However, the current study indicates that a multimode integrative test might be more potent than a single-mode test in evaluating L2 proficiency; hence, a comprehensive test, even a simplified version, is recommended as an essential assessment tool for global L2 proficiency for research purposes.
Despite the importance and power of standardized tests for SLA research, they are not commonly used to estimate L2 proficiency for various reasons (Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). In this case, a reading comprehension test could be an alternative. Reading comprehension tests have been widely administered to measure L2 proficiency, and the significant relationship between the two variables has been well documented (Heilenman, 1983; Tremblay, 2011a). The results of the present research provide further support for the reliability of reading comprehension tests as a method of assessing L2 Chinese proficiency (B. Yuan, 1995). For instance, B. Yuan (1995) categorized participants into five groups based on their scores on two cloze tests. The differences in acceptability judgment test scores in three conditions between elementary and advanced groups were significant, with effect sizes (Cohen’s
In the case of CSL research, Chinese character recognition might be used as an L2 Chinese proficiency measure. Character recognition tests have been administered as reliable measures to evaluate CSL learners’ proficiency in placement tests (Wu et al., 2017) and eye-tracking research (Gao, 2017). The present study found that the results of
The present study found that years of instruction was not a reliable L2 proficiency measure, even though it is commonly used as an indicator of L2 proficiency (Hulstijn, 2012; Thomas, 1994, 2006; Tremblay, 2011a; H. Zhang, 2018). First, although the results of
Implications
Regarding the use of L2 proficiency measures for research, we offer three suggestions.
First, details regarding L2 proficiency measures should be reported. Our study found that research findings can differ depending on the L2 proficiency measures used, and this finding confirms the importance of reporting the details of the L2 proficiency measures employed (Hulstijn, 2012; Marsden et al., 2018; Norris & Ortega, 2012). Not reporting how participants’ L2 proficiency was measured or how the researcher justified the use of a specific test might jeopardize the generalizability of the research findings (Thomas, 1994, 2006; H. Zhang, 2018). Therefore, we confirm Hulstijn’s (2012) recommendations to “describe the test’s target group (age, literacy and educational level), skills measured, task(s), and materials in sufficient detail, and report its validity (if known) and its psychometric characteristics (e.g., internal consistency) for the target group” (p. 430). Including a detailed description of L2 proficiency measures in future studies could support the generalizability of the findings and provide more information for meta-analysis research and replication studies (Marsden et al., 2018).
Second, a comprehensive test of L2 proficiency is recommended for research purposes to yield more desirable outcomes. Considering that publication bias favors research with significant results (Norris & Ortega, 2000; Rothstein et al., 2006; Sutton, 2009), it is understandable that researchers seek to obtain findings with statistical significance and large effect sizes. Because L2 proficiency is commonly measured or manipulated in SLA research, comprehensive measures of L2 proficiency, which could yield more significant findings, are recommended for researchers. However, one comprehensive test might not be powerful enough to assess an individual’s L2 proficiency; therefore, if time and available resources allow, measures of L2 proficiency from different perspectives, such as standardized tests, language learning history, and self-evaluation, could be utilized (Sundara et al., 2006; Tremblay, 2011b).
Third, item analysis of L2 proficiency measures is recommended for future research. The quality of L2 proficiency measures is crucial to successfully assess participants’ abilities in the target language; such research involves examining item difficulty and item discrimination. However, this practice has rarely been observed in SLA studies (Tremblay, 2011a). Notably, it is impossible for each SLA researcher to carry out detailed item analysis and to design comprehensive tests because of the significant differences in standardized language tests and L2 proficiency tests for research purposes. However, we believe that a brief item analysis could benefit the L2 proficiency measures administered in SLA research in terms of meeting both testing and research standards. Such analysis would, in turn, further contribute to the interpretation and generalizability of the results.
Attempts should also be made to develop a standard L2 proficiency test for research in specific languages. Although the application of meta-analysis could facilitate the synthesis of research findings (Glass et al., 1981; Schmidt, 1992; Thompson & Higgins, 2002), the use of different measures of L2 proficiency might still be a problem for researchers trying to generalize and accumulate evidence. Thus, efforts should be made to develop standard language tests for research, which would strengthen the generalizability of research findings and support the accumulation of knowledge. We propose several recommendations below.
First, the design of a standard L2 proficiency test for research purposes must be valid, reliable, and practical (Thomas, 1994). Practicality is more critical for L2 proficiency measures for research than it is for standardized tests and placement tests (Tremblay, 2011a). Practicality refers to the ratio between available resources and required resources, such as human resources, material resources, time, and related costs (Bachman & Palmer, 1996). Given the limited time and funding for most studies, it is optimal for researchers to evaluate participants’ L2 proficiency within a short time, generally less than half an hour (Tremblay, 2011a). Therefore, controlling for reliability and validity, researchers designing L2 proficiency measures for studies should pay more attention to practicality, aiming to achieve the goal of evaluating L2 proficiency in less time and with less cost yet with high efficiency.
Second, more studies are needed to explore the interface between language tests and language acquisition (Bachman, 1988; Bachman & Cohen, 2002; Gu, 2014; Shohamy, 2000). At the level of practice, standardized tests are not commonly used to estimate L2 proficiency; thus, it is crucial to investigate how other L2 proficiency measures map onto standardized tests. For instance, if the correspondence between participants’ performance on a cloze test and their scores or levels on a standardized test is known, evaluating L2 participants’ proficiency based on a cloze test could be more scientific, and the research findings could be more powerful.
Third, an assessment use argument (AUA) or operational models should be developed for L2 proficiency measures for research. An AUA is “essentially a local or operational theory that guides both the design and development of a specific language assessment and the collection of evidence in support of a specific intended use” (Bachman, 2007, p. 70). Researchers studying language testing draw upon a hybrid of theoretical frameworks, and they have defined language proficiency from different perspectives (Bachman, 2007), yet some of the frameworks are difficult for SLA researchers to employ. Therefore, an AUA or operational model of language proficiency is essential for SLA researchers to carry out studies.
Limitations
This study has at least three limitations. First, the sample size was small. Thus, the extent to which the results could be generalized to other samples and research questions is still unclear. In addition, the small sample size might have been responsible for the unsatisfactory regression diagnosis. Second, this study used PA as the dependent variable; thus, whether the findings can be applied to research involving other metalinguistic awareness is still uncertain. Third, the HSK test used in this study included only listening and reading sections and excluded speaking and writing tests; the simplified HSK is a relatively comprehensive indicator of only input comprehension skills involved in Chinese proficiency. Thus, studies that rely on a more comprehensive test that includes listening, speaking, reading, and writing skills might yield different results.
Although the present study has several limitations, this research is important for our understanding of how L2 proficiency measures affect research results on the effect of L2 proficiency on the target variable, and it has important implications for the use and design of standard L2 proficiency measures for research. We conclude the present research with the following statement from Tremblay (2011a): Providing adequate proficiency estimates that meet both testing and research standards can not only enhance the interpretation of experimental results but also decrease the disparity between the proficiency assessment methods used in SLA research and thus facilitate comparisons between studies. (p. 364).
Footnotes
Appendix
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper was supported The Ministry of Education of China (20YJC740088), the Innovative School Project in Higher Education of Guangdong, China (GWTP-GC-2017-01) and the Social Science Key Research Grant of Universities in Guangdong Province (2018WZDXM005).
