A Systematic Review of Chinese Character Size Tests From 1930 to 2021

Abstract

Chinese character size is the number of characters that a person can recognize and has been well documented as a critical measure of Chinese literacy. A variety of Chinese character size tests have been developed since the 1930s. However, systematic reviews have not yet been conducted on Chinese character size tests. The purpose of this article is to provide a comprehensive review of Chinese character size tests conducted between 1930 and 2021 among native Chinese-speaking children and Chinese language learners. There are three main findings. First, most character size tests were constructed using a frequency-based stratified-sampling method to select target characters, a mixed method focusing on both pronunciation and meaning to test target characters, and a holistic method to score the test-takers’ responses. Second, the majority of tests used Classical Testing Theory (CTT) for checking the item quality, reliability, and validity, and only two tests employed both CTT and Item Response Theory. Third, most tests estimated character sizes using CTT, while only three tests constructed character size norms. It is suggested that future studies address cross-group investigation, determine the most robust construction and estimation methods, develop computer-assisted tests, and apply character size tests to classroom settings.

Keywords

Chinese character recognition Chinese character size test Chinese reading Chinese as a second language

Introduction

The growth of vocabulary is fundamental to the development of literacy skills (Beglar, 2010; Cameron, 2002; Ishii & Schmitt, 2009; Karami, 2012; Nguyen & Nation, 2011; Zhao & Ji, 2018). For the development of Chinese literacy skills, it is crucial to master Chinese characters, which serve as the basis for Chinese words. As opposed to sound-based languages, Chinese is morpho-syllabic or meaning-based (DeFrancis, 1986). For instance, an equivalence of the word computer in Chinese is 电脑 (diànnǎo), which is composed of two Chinese characters: 电 (diàn, electricity) and 脑 (nǎo, brain), each representing a morpheme and syllable. As characters and words are closely associated in Chinese, testing one’s proficiency in the Chinese language may be assessed by testing one’s ability to recognize characters, or the number of characters one knows (Zhang, 2018).

Researchers have attempted to develop Chinese character size tests for assessing the size of characters acquired by native and non-native speakers of Chinese. Some tests were developed to differentiate between children with normal cognitive abilities and those with reading difficulties (Gai & Yang, 2006; Leung et al., 2008; Peng et al., 2017). Several tests have been developed to examine the developmental path of character recognition skills in Chinese children (e.g., Ai, 1949; Hung et al., 2008; Wang &Tao, 1996; Wen et al., 2015; Wu, 2012; Zhang, 1931) as well as learners of Chinese as a second language (CSL) (Li, 2003; Tseng et al., 2016; Zhang et al., 2021). While, in other studies, appropriate Chinese characters tests were developed to assess the role of orthographic awareness in character recognition (Li et al., 2012; Ruan et al., 2018; Song et al., 2015; Su & Kim, 2014; Yang et al., 2019; Zhang, 2017; Zhang & Roberts, 2019).

Although a number of Chinese character size tests have been developed since the 1930s, it remains unclear how these tests differ in terms of their methods for constructing the tests, checking their reliability and validity, and estimating the size of characters. For instance, some tests used more than 200 target characters (Wang & Tao, 1996), while others included only 31 characters (Hung et al., 2008). Thus, an analysis of these tests will enable us to better understand the differences and similarities among them. Additionally, researchers may be able to improve their test development by referring to the existing tests’ achievements and limitations. Lastly, such a review could help teachers select an appropriate character size test and tailor their Chinese character instruction accordingly. Taken together?, the present review aims to address the following questions:

RQ1: What methods have been used to construct the character size tests in terms of selecting, testing and scoring the target characters?

RQ2: What methods have been used to check the item quality, reliability and validity of the character size tests?

RQ3: What methods have been used to estimate character size?

Method

To minimize publication bias and provide a comprehensive picture of current research, the relevant literature was searched exhaustively. The steps are outlined in Table 1. Considering that studies on the development of Chinese character size tests were mainly published in Chinese, three major Chinese databases were utilized to search relevant studies, namely CNKI from mainland China (www.cnki.net), airiti library from Taiwan (http://www.airitilibrary.cn/) and Hong Kong Education Bibliographic Database (https://bibliography.lib.eduhk.hk/). Search terms included 识字/識字(Chinese character recognition), 识字量表/識字量表(Chinese character size scale) and 识字测试/識字測試 (Chinese character recognition test) in both simplified and traditional Chinese characters. The literature was retrieved from these websites on 15th August 2021. References in these studies were also consulted to ensure a complete selection of research.

Table 1.

A Brief Summary of the Literature Search and Inclusion Criteria.

Literature search	Electronic databases: CNKI in mainland China—2887 journal articles, 444 theses and 1973 conference papers; airiti library in Taiwan—249 journal articles and 77 theses; Hong Kong Education Bibliographic Database—17 conference papers, 15 journal articles, 10 book chapters, 7 theses, 2 books and 1 report Reference lists: 4 (2 book chapters, 1 journal article, and 1 book)
Inclusion criteria	Studies focusing on the methods to construct and to check the reliability and validity of character size tests. Target population focusing on children with normal cognitive abilities or CSL learners. Full information about the character size test could be accessed.
Included	Research included in the present study (n = 11)

This study adopted the following inclusion criteria in order to accommodate the focus of the review. First, the study should aim to construct a character size test for a specific population. In addition, the character size test should primarily target Chinese-speaking children with normal cognitive abilities or CSL learners. Third, the information concerning the construction, reliability, and validity of the character size test could be fully accessed.

According to the inclusion criteria listed above (Table 2), 11 studies were selected, including eight studies involving native Chinese speakers (one from Hong Kong, two from Taiwan and five from mainland China), and three studies for CSL learners (one from Taiwan and two from mainland China). As for publication type, there were nine journal articles, one book chapter, and one doctoral dissertation. In terms of publication language, three studies were published in English (two journal articles and one doctoral thesis) and eight studies were published in Chinese.

Table 2.

Summary of the Selected Studies Included for the Present Review.

Researcher	Year	Language	Type	Area	Target population
1. Zhang	1931	Chinese	Journal article	Beijing	Students from kindergartens to universities
2. Ai	1949	Chinese	Book chapter	Beijing, Nanjing	Students from primary, middle and high schools
3. Wang and Tao	1996	Chinese	Journal article	Shanghai	Students from primary school
4. Li	1999	Chinese	Journal article	Beijing, Hong Kong, Singapore	Students from kindergartens and primary schools
5. Hue	2003	English	Journal article	Taiwan	University students
6. Li	2003	Chinese	Journal article	Beijing	Adult CSL learners
7. Hung et al.	2008	Chinese	Journal article	Taiwan	Students from primary and middle schools
8. Wu	2012	English	Doctoral thesis	Hong Kong	Students from primary and middle schools
9. Wen et al.	2015	Chinese	Journal article	Beijing	Students from primary and middle schools
10. Tseng et al.	2016	English	Journal article	Taiwan	Adult CSL learners
11. Zhang et al.	2021	Chinese	Journal article	Mainland China	Adult CSL learners

Results

RQ1 Methods to Select, Test, and Score Target Characters

Table 3 summarizes the methods for selecting and testing the target characters.

Table 3.

Summary of the Construction of Different Chinese Character Size Tests.

Researcher	Year	Participants	Pool source	Pool volume	Number of tested characters	sample/pool ratio	Selection method	Testing method	Testing form
1. Zhang	1931	305 (from kindergartners to university students)	Dictionary for students	13,469	100	0.007	Interval sampling	Reading out pronunciation and explaining meaning	Interview
2. Ai	1949	1,050 5th and 6th graders; 1,382 middle school; 1,183 high school	Dictionary for students	13,469	95 in Test A 98 in Test B	0.007	Adopted from two of Zhang’s tests	Quadruple choice for pronunciation and meaning	Paper-and-pencil
3. Wang and Tao	1996	5,102 children (grade 1–5)	List of commonly used characters	3,500	1st grade–142 2nd grade–173 3rd grade–206 4th grade–194 5th grade–210	1st grade–0.04 2nd grade–0.05 3rd grade–0.059 4th grade–0.055 5th grade–0.06	Neyman optimum allocation based on the accuracy rates in a pilot test	Reading out pronunciation for 1st–2nd graders; Composing word using target character for 3rd–5th graders	Interview for 1st graders; Paper-and- for 2nd–5th graders
4. Li	1999	480 children (aged from 2.5 to 5.5 years)	Curriculum syllabus	2,600	200	0.08	Random selection after categorizing the pool into 13 groups based on learning difficulty	Matching character and picture Matching character and pronunciation Reading out the pronunciation of the displayed characters Using the target characters to say a sentence	Paper-and-pencil
5. Hue	2003	120 university students	Dictionary	12,000 (pool1) 16,796 (pool2)	70 (test 1) 67 (test 2)	0.006 (test 1) 0.004 (test 2)	Random selection after categorizing the pool into 4 levels based on frequency	Write out pronunciation using Zhuyin-Fuhao Writing out meaning by defining the target character, forming a word or phrase with target character	Paper-and-pencil
6. Li	2003	42 CSL learners (24 intermediate and 18 advanced)	CSL proficiency standard	2,900	60	0.021	15 randomly selected from each of the four levels	Writing out pronunciation Composing words using target characters Writing characters according to Pinyin	Paper-and-pencil
7. Hung et al.	2008	962 children (grade 1–9)	List of commonly used characters	5,021	31 for 1st–2nd graders 40 for 3rd–9th graders	0.006–0.008	Random selection after categorizing the pool into 17 groups based on frequency	Writing out pronunciation Composing word using target characters	Paper-and-pencil
8. Wu	2012	2,056 children (grade 1–6)	Textbooks and list of commonly used characters	3,033	150	0.049	Systematic selection based on the accuracy rates in the pilot test	Reading out the pronunciation of the displayed characters	Interview
9. Wen et al.	2015	1,338 children (grade 1–8)	Curriculum syllabus and list of commonly used characters	3,800	36 for 1st–2nd graders 45 for 3rd–8th graders	0.009–0.012	Random selection after categorizing the pool into 13 groups based on frequency	Writing out pronunciation Composing word using target characters	Paper-and-pencil
10. Tseng et al.	2016	106 adult CSL learners (50 males, 56 females)	List of commonly used characters	2,400	120	0.05	Random selection after categorizing the pool into 4 groups based on frequency	Writing out pronunciation using Pinyin or Zhuyin-Fuhao	Web-based test
11. Zhang et al.	2021	318 CSL learner (152 HSK4, 115 HSK5, 51 HSK6)	CSL proficiency standard	3,000	100	0.033	Systematic selection from each of the ten levels based on frequency and orthographic features	Writing out pronunciation Composing word using target characters/translating to L1	Paper-and-pencil

Target characters selection

Target character selection involves selecting both the character pool and the sample of target characters.

Character pool selection

As shown in Table 3, the character pools have been constructed according to two different approaches. One approach referred to a single source as the character pool, such as a dictionary, a list of commonly used characters, or a set of character lists in a curriculum syllabus or CSL proficiency standards. Another approach was to combine different sources of information, such as a list of commonly used characters, textbooks, and curriculum syllabuses. According to Table 4, the character pool is dominated by commonly used characters.

Table 4.

Summary of the Character Pool Source.

Character pool source	Research
Dictionary	Ai (1949), Hue (2003), Zhang (1931)
List of commonly used characters	Hung et al. (2008), Tseng et al. (2016), Wang and Tao (1996)
Character list in curriculum syllabus	Li (1999)
Character list in CSL proficiency standards	Li (2003), Zhang et al. (2021)
List of commonly used characters and characters in textbooks or curriculum syllabus	Wen et al. (2015), Wu (2012)

In terms of the number of characters in the character pools, the mean was 6832.33 (SD = 5,162.31). Additionally, the pool volume varied with the source and target population. The average number of items in the character pools derived from dictionary and non-dictionary sources is 13,933.51 (SD = 1,758.11) and 3,281.75 (SD = 780.26), respectively. In addition, for character size tests targeting Chinese-speaking children based on non-dictionary sources, Taiwan had the largest character pool (Hung et al., 2008, n = 5,021), followed by that in mainland China (Wang & Tao, 1996, n = 3,500; Wen et al., 2015, n = 3,800), and Hong Kong had the smallest pool (Wu, 2012, n = 2,400). Conversely, the character pool in mainland China (Li, 2003, n = 2,900; Zhang et al., 2021, n = 3,000) was larger than that in Taiwan (Tseng et al., 2016, n = 2,400) for CSL learners. In addition, the average character pool volume for native Chinese-speaking children (M = 3,550.8, SD = 873.91) appeared to be larger than those for CSL learners (M = 2,766.67, SD = 262.47).

Target characters sampling

Two methods were commonly used to sample target characters: spaced sampling and stratified sampling. The spaced sampling method was used in two early studies (i.e., Ai, 1949; Zhang, 1931). A spaced sampling method involves selecting specific characters from a specific dictionary, such as the first character on every 10th page, to form a pool of target characters. Stratified sampling was used in the other nine studies. This type of sampling method involves dividing the pool into multiple levels based on one or more criteria and then randomly selecting a suitable number of target items from each level. As shown in Table 3, five studies used character frequency as the sole criterion (Hue, 2003; Hung et al., 2008; Li, 2003; Tseng et al., 2016; Wen et al., 2015), while four studies selected the target characters based on both character frequency and other factors, such as difficulty (Li, 1999; Wang & Tao, 1996; Wu, 2012) and the orthographic features (e.g., structure, orthographic regularity) of characters (Zhang et al., 2021).

The average number of target characters in previous character size tests was 112.47 (SD = 59.96), with a range from 31 characters (Hung et al., 2008) to 210 characters (Wang & Tao, 1996). The average ratio of target characters to the pool was 0.031 (SD = 0.024), ranging from 0.004 (Hue, 2003) to 0.08 (Li, 1999). In addition, character size tests for native Chinese-speaking children (M = 116.06, SD = 63.80) had more target characters than those for CSL learners (M = 93.33, SD = 24.94).

Target characters testing

Methods of testing for target characters can be examined from two perspectives: the measured content and the subjective versus objective nature of the measures.

Regarding the measured content, three typical methods are available (Zhang et al., 2022): semantic methods (n = 1), phonological methods (n = 3), and mixed methods (n = 8). Specifically, Chinese character size tests utilizing semantic methods required the participants to form words and phrases by using target characters (Wang & Tal, 1996). Tests adopting phonological methods asked the participants to read aloud or write down the pronunciations of target characters (e.g., Tseng et al., 2016; Wang & Tao, 1996; Wu, 2012). Tests that employed mixed methods measured the participants’ Chinese character size by their knowledge of pronunciations and meanings of target characters (Ai, 1949; Hue, 2003; Hung et al., 2008; Li, 1999; Wen et al., 2015; Zhang, 1931; Zhang et al., 2021), or their knowledge of pronunciations, meanings and orthography of target characters (Li, 2003). In addition, ten studies used only one testing method, but Wang and Tao (1996) used different methods for children of different ages, such as phonological methods for first and second graders, and semantic methods for third to fifth graders.

As for the perspective of subjective versus objective measurement, subjective tasks required participants to produce the pronunciations and meanings of target characters using a paper-and-pencil method (Hue, 2003; Hung et al., 2008; Li, 1999; Li, 2003; Tseng et al., 2016; Wang & Tao, 1996; Wen et al., 2015) or interview form (Li, 1999; Wang & Tao, 1996; Wen et al., 2015; Zhang, 1931). In contrast, the objective tasks asked the participants to select the correct answer from a set of items (Ai, 1949; Li, 1999). Although the majority of studies used either subjective or objective measures, the tests designed by Li (1999) included both subjective and objective measures.

Target characters scoring

In terms of character pronunciation scoring, it is common for researchers to assess participants’ global performance in syllables in the subjective tasks such as reading aloud or writing down the character pronunciation. However, researchers used different methods to score tonal errors. Two studies used a non-tone method without considering tonal errors (Wen et al., 2015; Zhang et al., 2021), one study used two scoring methods (tone scoring and non-tone scoring) among CSL learners (Tseng et al., 2016), and the other studies did not mention how tonal errors were addressed.

As for character meaning scoring, participants’ holistic performance in meaning tasks was generally taken into account, ignoring their orthographic errors in producing Chinese characters. However, two studies (Hung et al., 2008; Wen et al., 2015) considered characters with more than two stroke errors as incorrect. Furthermore, three studies allowed participants to use Pinyin or Zhuyin-Fuhao as part of the meaning task (Hung et al., 2008; Wen et al., 2015; Zhang et al., 2021). Pinyin and Zhuyin-Fuhao are phonetic systems for representing the pronunciation of Chinese characters, and they are commonly used in mainland China and Taiwan region, respectively.

RQ2 Methods to Check the Reliability and Validity of Character Size Tests

As seen in Table 5, three studies did not mention how the reliability and validity of character size tests were checked (i.e., Li, 2003; Hue, 2003; Zhang, 1931). Another eight studies employed Classical Testing Theory (CTT) and two studies utilized both CTT and Item Response Theory (IRT) (Wu, 2012; Zhang et al., 2021).

Table 5.

Summary of the Reliability and Validity Analysis in Different Chinese Character Size Tests.

Researcher	Year	Testing theory	Item analysis	Reliability	Validity
1. Zhang	1931	NA	NA	NA	NA
2. Ai	1949	NA	NA	Parallel-forms reliability	Criterion validity
3. Wang & Tao	1996	CTT	NA	Parallel-forms reliability	Criterion validity
4. Li	1999	CTT	NA	Internal reliability, test-retest reliability	Discriminant validity, convergent reliability
5. Hue	2003	NA	NA	NA	NA
6. Li	2003	NA	NA	NA	NA
7. Hung et al.	2008	CTT	Difficulty	Cronbach alpha, split-half reliability	Construct validity, criterion validity
8. Wu	2012	CTT, IRT	Difficulty, discrimination	Cronbach alpha, split-half reliability, parallel-forms reliability	Empirical validity, criterion validity
9. Wen et al.	2015	CTT	Difficulty, discrimination	Cronbach alpha, split-half reliability	Content validity, construct validity, empirical validity
10. Tseng et al.	2016	CTT	NA	Test-retest reliability, Cronbach alpha, split-half reliability	Criterion validity
11. Zhang et al.	2021	CTT, IRT	Difficulty, discrimination	Cronbach alpha	CTT: Criterion validity, construct validity, empirical validity IRT: Content aspect, substantive aspect, structural aspect, responsiveness, generalizability

Note. NA means relevant information was not reported. CTT = Classical Testing Theory; IRT = Item Response Theory.

In terms of item analysis, four studies reported statistics about item difficulty (Hung et al., 2008; Wen et al., 2015; Wu, 2012; Zhang et al., 2021) and three studies reported statistics about item discrimination (Wen et al., 2015; Wu, 2012; Zhang et al., 2021).

In terms of reliability, six studies reported internal reliability such as Cronbach alpha and split-half coefficient (Hung et al., 2008; Li, 1999; Tseng, et al., 2016; Wen et al., 2015; Wu, 2012; Zhang et al., 2021), three studies reported parallel-forms reliability (Ai, 1949; Wang & Tao, 1996; Wu, 2012), and two studies reported test-retest reliability (Li, 1999; Tseng et al., 2016).

In terms of validity, six studies reported criterion validity (Ai, 1949; Hung et al., 2008; Tseng et al., 2016; Wang & Tao, 1996; Wu, 2012; Zhang et al., 2021), three studies reported construct validity (Hung et al., 2008; Wen et al., 2015; Zhang et al., 2021), three studies reported empirical validity (Wen et al., 2015; Wu, 2012; Zhang et al., 2021), two studies reported content validity (Wen et al., 2015; Zhang et al., 2021), one study reported discriminant validity and convergent validity (Li, 1999), and one study reported IRT-based validity such as responsiveness and generalizability (Zhang et al., 2021).

RQ3 Methods to Estimate Character Size

As shown in Table 6, seven studies used CTT-based methods to estimate the participants’ character sizes. The participants’ accuracy rate of the test was first calculated, and then the character size was estimated using certain mathematical equations, such as the accuracy rate × the number of characters in the pool (Ai, 1949; Hue, 2003; Hung et al., 2008; Zhang, 1931; Zhang et al., 2021), or by assigning weighted scores to the target characters (Li, 2003; Wang & Tao, 1996). Four studies did not report the participants’ character sizes (Li, 1999; Tseng et al., 2016; Wen et al., 2015; Wu, 2012). Furthermore, three studies constructed character size norms for children in mainland China (Ai, 1949), Hong Kong (Wu, 2012), and Taiwan (Hung et al., 2008).

Table 6.

Summary of the Character Sizes in Different Grades/Levels in Previous Studies.

Participants	Research
Participants	Zhang (1931)	Ai (1949)			Wang and Tao (1996)	Hung et al. (2008)	Wu (2012)			Wen et al. (2015)
Chinese children	University 8,249–8,762 High and middle schools 7,547–8,289 Primary school 3,402–6,210		Test A	Test B	Grade 1: 618	Grade 1: 712.37		Version 1	Version 2		IRT	CTT
		Grade 5-autumn	2,245	1,975	Grade 2: 1,341	Grade 2: 1,248.57	Grade 1	37.35 (1.67)	38.86 (1.96)	Grade 3	2,844	2,891
		Grade 5-spring	2,684	2,298	Grade 3: 2,063	Grade 3: 2,108.04	Grade 2	52.25 (19.96)	53.20 (20.97)	Grade 4	2,946	3,011
		Grade 6-autumn	3,237	2,739	Grade 4: 2,571	Grade 4: 2,660.52	Grade 3	69.86 (20.45)	70.27 (21.14)	Grade 5	3,129	3,260
		Grade 6-spring	3,640	3,151	Grade 5: 2,815	Grade 5: 3,142.08	Grade 4	82.72 (22.70)	82.86 (23.42)	Grade 6	3,197	3,387
		Grade 7-spring	4,509	4,014	Grade 6: 2,834	Grade 6: 3,340.02	Grade 5	99.49 (17.37)	99.88 (19.29)	Grade 7	3,206	3,421
		Grade 8-spring	5,161	4,612		Grade 7: 3,547.97	Grade 6	103.72 (21.03)	104.05 (21.63)	Grade 8	3,227	3,454
		Grade 9-spring	5,424	4,803		Grade 8: 3,521.06
		Grade 10-spring	5,801	5,145		Grade 9: 3,747.34
		Grade 11-spring	5,970	5,293
		Grade 12-spring	6,187	5,546
	Li (2003)	Zhang et al. (2021)
CSL learners	Intermediate level 1,000Advanced level 1,616	HSK 4: 1,020
		HSK 5: 1,470
		HSK 6: 1,980

Note. Wu (2012) only reported raw scores and the maximum score of the test was 150.

Discussion

It could be seen above that Chinese character size tests have been developed for a long period of time, demonstrating its significance for Chinese language learning. This section discusses how to construct character size tests, how to check their reliability and validity, and how to estimate character size in previous studies.

Construction of Character Size Tests

Target characters selection

As for the sources of character pools, dictionaries were primarily used in early studies (Ai, 1949; Zhang, 1931). However, dictionaries were later replaced by the lists of frequent characters due to the following two main reasons. First, different dictionaries have different numbers of characters, which range from thousands to millions, and are generally far more than any individual knows. Thus, using a dictionary as the character pool might overestimate the test-takers’ character size. Second, the use of dictionaries as character pools generally involves spaced sampling in selecting target characters, yet the main disadvantage of spaced sampling is that character frequency is ignored, resulting in an inaccurate representation of the entire dictionary. For instance, a highly frequent character (e.g., 打, dǎ, to hit) normally has multiple meanings and can form dozens of words in a dictionary, and therefore occupies greater space than a character with a low frequency (e.g., 妲, dá, a surname). Thus, characters with a high frequency are more likely to be selected as the test items than those with a low frequency, and such tests may also distort the estimated character sizes of the participants.

Although the lists of frequent characters have been the mainstream source to build character pools for different groups of test-takers from mainland China, Hong Kong and Taiwan, they were used differently for Chinese-speaking children and CSL learners. For Chinese-speaking children, the lists of frequent characters often serve as the single source of a character pool for measuring character size (Wen et al., 2015; Wu, 2012). In contrast, the lists of frequent characters are only partly used as the source of a character pool for CSL learners (Li., 2003; Tseng et al., 2016; Zhang et al., 2021). Such a difference is mainly due to two reasons. One reason is that CSL learners at different colleges/universities are exposed to a variety of textbooks (Li, 2002; Liang, 2020). Therefore, using part of the lists of frequent character may be able to better capture CSL learners’ character size. Another reason is that CSL learners tend to have a smaller character size than native Chinese-speaking children (Zhang et al., 2021). Thus, using the full lists of frequent characters may weaken the accuracy and representativeness of the character pools for CSL learners.

Although the character pools for Chinese-speaking children and CSL learners differ in the total number of characters, the use of frequently used characters as the main source of character pools makes the estimated character sizes of the two groups, particularly those from the same region, comparable. However, it should be noted that the lists of frequent characters are derived from native speakers’ data, yet CSL learners acquire Chinese characters primarily within the context of classroom settings. Therefore, it is debatable to directly use the lists of frequent characters to build character pools for CSL learners. Future research is suggested to examine whether L1-based lists of frequent characters are appropriate for measuring CSL learners’ character sizes, particularly those with low levels of Chinese proficiency.

The number of characters in the character pools varies across different areas and different populations. As for the wide variation in character pools targeting children in mainland China, Hong Kong, and Taiwan, it relates mainly to the local language policies. Despite the fact that both Hong Kong and Taiwan use traditional Chinese characters, Hong Kong promotes a “biliterate” (Chinese and English) and “trilingual” society (Bolton, 2011), thus less attention is paid to Chinese literacy education than it does in monolingual Taiwan. Although mainland China is also monolingual, the number of simplified characters commonly used in mainland China is less than that of traditional characters commonly used in Taiwan (Du, 2018). In respect to differences in character pool volumes for CSL learners between mainland China (Li, 2003; Zhang et al., 2021) and Taiwan (Tseng et al., 2016), the main explanation concerns the participants’ Chinese language proficiency. The participants in mainland China were intermediate and advanced learners, however, their counterparts in Taiwan were beginning and intermediate learners. To conclude, the use of different character pools for different groups of test-takers indicates the importance of ensuring that character size tests are appropriate for the target participants. It remains unclear, however, whether these different character pools produce comparable estimates of character sizes.

Target character selection is primarily conducted using stratified sampling, which divides the pool into different levels based on character frequency. While frequency-based stratified sampling can enhance the representativeness of sampled characters to a great extent over spaced sampling, it still has some limitations because character recognition can be influenced by a series of factors including frequency, orthographic, phonological, and semantic information (Hao, 2018; Tong & McBride-Chang, 2010; Tong et al., 2017; Wang et al., 2003). The limitation of frequency-based stratified sampling in selecting target items also exists for research on vocabulary size tests (de Groot, 2006; Hashimoto, 2021). The limitations of frequency-based stratified sampling imply that character frequency along with other factors should be taken into account, such as learners’ accuracy in pilot studies, orthographic, phonological, and semantic information of the characters (Wang & Tao, 1996; Wu, 2012; Zhang et al., 2021).

There was also a difference in the number of characters selected in previous tests, indicating a lack of consensus about the optimum number of characters for measuring character size. Wen et al. (2016), for example, explored the selection of the number of target characters using the Generalizability Theory and found that tests containing 20 or 30 characters could reach the .80 threshold of Dependability Coefficient (PHI), demonstrating a high degree of reliability. This finding is similar to Nation’s (2022) recommendation that at least 30 words should be tested to estimate vocabulary size. A recent study, however, found that at least 30 words should be used per 1,000-word frequency band (Gyllstad et al., 2020). These findings imply that it is reasonable to recommend that at least 30 characters be tested to estimate character size, and more characters may be tested if the conditions allow.

Target characters testing

Despite the fact that Chinese character recognition skill has been found unidimensional (Wen et al., 2016), researchers have not yet reached a consensus on its specific components. Character recognition may depend on different cognitive abilities and different brain areas for the tasks of phonological, semantic, and mixed methods to be completed (Wu et al., 2012), with phonological methods being the easiest, and mixed methods being the most challenging (Zhang et al., 2022). In spite of the significant correlations among the three methods of assessing character recognition competence (Everson, 1998; Jiang, 2003; Myers et al., 2007; Perfetti & Tan, 1998; Perfetti & Zhang, 1995; Zhang et al., 2022; Zhou et al., 1999), mixed methods and semantic methods may provide better item quality than the phonological methods (Zhang et al., 2022). As an example, mixed methods and semantic methods can generate higher item separation values than phonological methods, even though the three methods demonstrate similar item difficulty and item discrimination (Zhang et al., 2022). As a result, mixed methods or semantic methods are recommended for measuring Chinese character recognition.

Regarding the subjective versus objective nature of the measures, both have their advantages and disadvantages, however, subjective measures of character size are recommended for the following reasons. First, although subjective tasks might be time- and energy-consuming in terms of administration and scoring, their main advantage lies in enabling researchers to determine an individual’s real character size without providing cues, thus minimizing the effects of guessing and test-taking strategies on objective measures, as has been demonstrated by vocabulary size tests (Gyllstad et al., 2015; McLean, Kramer, & Stewart, 2015). For instance, in the case of measuring vocabulary size test, test-takers’ scores in multiple-choice tasks might be two times higher than their scores in L2-L1 translation task (Lemhöfer & Broersma, 2012; Nakata et al., 2020). Second, given that character sizes estimated by phonological methods, semantic methods, and mixed methods are comparable to some degree (Zhang et al., 2022), the web-based phonological methods that ask test-takers to provide character pronunciation using Pinyin or Zhuyin-Fuhao may be a viable alternative to mixed methods or semantic methods (Tseng et al., 2016). This is due to the possibility of automatically scoring Pinyin and Zhuyin-Fuhao, which might be as time-saving and efficient as objective tasks.

Target characters scoring

Tonal errors have been a controversial topic regarding scoring character pronunciation, since Chinese is a tonal language, and different tones indicate different meanings. Tonal errors, however, should be ignored when scoring character pronunciation for the following reasons. First, for reading by eyes, it is more important to be able to retrieve the semantic information rather than the tonal information of characters. Second, tone acquisition is the most difficult part for CSL learners compared with initials, finals, and meanings of characters, and CSL learners’ performance in tone acquisition may not improve with their Chinese proficiency (Wu et al., 2006). Third, the non-tone scoring method may enhance the discrimination of the items as well as the validity of character size tests, and may be more sensitive in measuring CSL learners’ proficiency in character recognition (Tseng et al., 2016).

Regarding the scoring of character meaning, it is recommended that the participants’ orthographic errors in character writing be ignored, although some studies have deemed characters with more than two stroke errors incorrect (Hung et al., 2008; Wen et al., 2015). This recommendation is based on two reasons. First, the purpose of a character size test is primarily to determine how many characters an individual knows, and it is reasonable to focus on a participant’s overall performance in the meaning task as long as the orthographic errors do not negatively impact the participant’s semantic performance in the meaning task. Second, orthographic errors only impact the estimated character size in tasks that required participants to write characters. Therefore, ignoring orthographic errors could enhance the comparability of estimated character sizes for different tasks, such as forming words using target characters and translating.

The above-mentioned aspects of the construction of character size tests are closely related to their practicality, item quality, reliability, and validity. Selection of the character pool and sampling method may have an impact on the content validity and item quality of the test. Specifically, the more characters that are tested, the more reliable and valid the test is likely to be. However, adding more items will result in a longer test, which may reduce the practicality of the test to some extent. When it comes to target character testing, the measured content could affect the construct validity of the test, since it reflects the perceptions of researchers about character recognition skills. A subjective or objective character size test could pose a considerable problem in terms of its practicality, since the format of the test and the duration of the test preparation, completion, and scoring could differ greatly between these two types of tasks. Additionally, the scoring methods for target characters may have some influence on the practicality of the character size test. The convenience of scoring answers depends largely on the scoring criteria, which are essential components of the test’s practicality.

Checking the Reliability and Validity of Character Size Tests

Character size tests have been updated with the development of testing theories, such as the shift from the CTT (Li, 1999; Wang & Tao, 1996; Wen et al., 2015) to the IRT (Wu, 2012; Zhang et al., 2021). While most of these tests have been validated using CTT, interrater reliability and face validity have not been reported which are also important indicators of reliability and validity (Kline, 2005). CTT has benefits in terms of administration and interpretation, but it also has disadvantages (DeVellis, 2006; Jabrayilov et al., 2016; Magno, 2009). For example, CTT-based tests do not adequately address measurement errors and may result in biased results. Additionally, CTT-based results are sample-dependent, making cross-group comparisons difficult.

To address the limitations of CTT, both CTT and IRT should investigate the reliability and validity of character size tests (Wu, 2012; Zhang et al., 2021). This would yield more comprehensive and robust results than those provided by CTT alone. In addition to being sample-independent, IRT can also minimize measurement errors and provide comprehensive information about various aspects of the assessment process, including items, raters, and test takers (Jabrayilov et al., 2016; Kline, 2005; Thomas, 2011). As a result, a combination of CTT and IRT is strongly recommended for a comprehensive analysis of the reliability and validity of character size tests.

Despite the fact that grade effect and gender effect have been commonly used to assess the empirical validity of character size tests, there has been more consensus on grade effect than on gender effect. The grade effect means that Chinese character size generally increases along with the participants’ grades. The grade effect might be universal across different literacy skills, as it has been widely observed in both character size tests (e.g., Ai, 1949; Hung et al., 2008; Wang & Tao, 1996; Wen et al., 2015) and vocabulary size tests (Beglar, 2010; Cameron, 2002; Segbers & Schroeder, 2017; Zhao & Ji, 2018). A similar effect has also been observed in CSL learners (Tseng et al., 2016; Zhang et al., 2021). These consistent findings suggest that grade effect could discriminate test-takers with different levels of Chinese character recognition skills and detect the longitudinal changes in learners’ character size growth, and it can be reliable in examining the empirical validity of character size tests targeting Chinese children and CSL learners,

Different from the consistent findings about grade effect, there is a lack of consistency in the findings regarding gender effect in character size tests. The gender effect refers to the hypothesis that female participants outperform their male counterparts in Chinese character size (Hung et al., 2008; Wen et al., 2015). Some studies, however, have not found any significant differences between the female and male participants (Li, 1999), or even found that the male students outperformed their female counterparts in less developed areas (Chen & Tan, 2012). These results are in accordance with the mixed findings regarding the role of gender in vocabulary size in alphabetic languages (Agustín Llach & Terrazas, 2012; Fernández-Fontecha, 2014). The main reason for this can be attributed to the interaction effects between gender and other factors (e.g., gender equality and socio-economic status) on character recognition. For example, male students from low socio-economic regions may outperform their female counterparts in character recognition in China. However, the performance of male students in character recognition may not necessarily be better than that of female students in more developed areas of China, where girls have equal access to education. These conflicting findings suggest that the role of gender in the development of linguistic (Eriksson et al., 2012; Frank et al., In Press; Hyde & Linn, 1988; Robinson & Lubienski, 2011) and cognitive abilities (Miller & Halpern, 2014) needs further investigation. It also indicates that the use of gender effect for validating the empirical validity of character size tests should be interpreted with caution.

In conclusion, the use of various testing theories and validation techniques has greatly contributed to the reliability and validity of character size tests for Chinese children and CSL learners. However, the application of IRT in character size tests is still limited and some advanced testing theories such as the cognitive diagnosis model (Templin & Henson, 2006) have not been utilized. Furthermore, the empirical validity of character size tests may be examined from a variety of perspectives, such as the frequency effect in character recognition, which has been widely proven in a number of studies, which may be used to examine the validity of character size tests empirically (Zhang et al., 2021).

Character Size Estimation

Character size is primarily estimated using CTT. A CTT-based method has the advantage that it is easy to understand and administer, and it can be calculated even by those who have no expertise in language testing. The CTT-based methods may, however, overestimate the participants’ character size. Based on simulated data, Wen et al. (2020) compared the effectiveness of CTT-based and IRT-based approaches for estimating character size. According to the results, the IRT-based approach reduced the risk of overestimating participants’ performance in low-frequency characters and produced a smaller estimated character size than the CTT approach. It remains unclear, however, whether this conclusion can be applied to actual data. The CTT-based approach tends to remain the mainstream approach for estimating character sizes due to its ease of estimation and explanation, however more research needs to be conducted on IRT-based estimation methods.

Character size norms have been established in limited studies in the past. A norm refers to “a standard or range of values that represents the typical performance of a group or of an individual (of a certain age, for instance) against which comparisons can be made” (Vandenbos, 2015, p. 715). Character size norms have served as a reference for teachers to assess Chinese children’s performance in character recognition to inform character teaching (Peng et al., 2017). However, they are still limited in the following areas. First of all, there were only three character size norms available for Chinese children, and there have been no similar norms in the field of CSL learning. The primary reason may be that a large sample size is necessary for the establishment of a norm. As an example, previous norms had an average of 3,226 test-takers (SD = 1,712) (Ai, 1949; Hung et al., 2008; Wang & Tao, 1996), while previous tests targeting CSL learners had a very small sample, such as the 318 participants in Zhang et al.’s (2021) study. Furthermore, some norms are city-specific and cannot be generalized. For example, the norm by Wang and Tao (1996) was Shanghai-specific, therefore applying this norm to other areas would be inappropriate due to the differences in education and economic development between this megacity and other areas. Third, some character size norms, such as those developed by Wang and Tao (1996) and Hung et al. (2008), need to be updated due to changes in educational development.

It is possible that methods for estimating the size of characters and the establishment of character size norms could influence the practicality of the test. The purpose of measuring an individual’s character size is to identify his or her position within a particular group and to tailor instruction accordingly. Therefore, the use of an appropriate estimation method and the creation of a reliable norm would facilitate the administration of the test and make results more understandable.

Directions for Future Research

Firstly, future research should consider conducting more cross-group comparisons of character size tests between different populations, since comparative studies between different populations are scarce (Li, 1999; Wu et al., 2015), and this is of significance in answering some important questions regarding the acquisition of Chinese literacy. For example, do Chinese children, ethnic minority CSL children, and adult CSL learners exhibit similar growth patterns with regard to their ability to acquire Chinese characters? Is it possible for adult CSL learners to acquire Chinese characters of the same size as native Chinese speakers? These questions may contribute to theoretical discussions, such as the ultimate level of L2 proficiency attained, age effects and the role of learning context (Dąbrowska, 2018; Granena & Long, 2013; Lardiere, 2006), individual differences (Dörnyei, 2005) in L2 learning, the Matthew effect in learning (Perc, 2014; Pfost et al., 2014; Protopapas et al., 2016), and how phonological capabilities affect reading skills (Perfetti, 2003; Ziegler & Goswami, 2005).

Second, future research could examine more robust approaches to testing character size. There have been a number of methods employed in developing character size tests, ranging from choosing target characters to estimating character size. However, it remains unclear whether these character size tests are similarly powerful or can yield comparable results. Therefore, more studies are required to compare the relative effectiveness of different methods in determining an individual’s actual character size. Research in the future may examine whether character size tests for Chinese children are suitable for CSL learners, whether CTT and IRT estimation methods yield different results in non-simulated contexts, and whether results produced by interviewing and using paper and pencils are comparable.

Third, future research may also focus on developing computer- or mobile-assisted character size tests. A variety of computer- or mobile-assisted software applications have been developed for Chinese character instruction, including MagiChinese, Skritter Chinese, TOFU learn apps, and for language testing such as computer-adaptive testing (Chalhoub–Deville & Deville, 2003; Young et al., 1996). However, only one web-based character size test has been developed to date (Tseng et al., 2016). A computer- or mobile-assisted test has several advantages, such as the automatic rating of participants’ answers and the ability to record the process of their responses, which could facilitate statistical analysis and enhance our understanding of Chinese character acquisition from a dynamic perspective. Technological advances in speech recognition, artificial intelligence, and deep learning have made it possible to develop and administer character size tests with ease.

Last but not least, further research is required regarding the application of character size tests in the classroom. For the purpose of assisting teachers in identifying children who require specific assistance, researchers have investigated the use of character size tests to screen Chinese children with deficit reading skills (Peng et al., 2017; Wu, 2012). In spite of this, studies tracking the development of character size are relatively few in number. In order to explore the longitudinal development of character recognition skills within a specific population, parallel forms of character size tests should be developed and administered. Research in this area can provide empirical evidence relating to the teaching of Chinese characters, particularly when selecting the characters to be used in Chinese language textbooks at various grade levels. Furthermore, exploring the impact of character size on children’s academic performance in other disciplines could assist in the development of the entire curriculum.

Limitations and Conclusion

The present review reveals that most of the existing character size tests were valid and reliable in testing character recognition skills, thus the estimated character sizes across different areas and different populations were roughly comparable to some extent. However, there are some limitations to the present study. First, the present study concentrated only on the character size tests developed for children with normal cognitive development and adult CSL learners due to the limited space available. By including tests designed for children with reading disorders in the present study, a more comprehensive picture of the development of Chinese character size tests would have been provided. Second, this study only reviewed how researchers developed and validated Chinese character size tests rather than conducting a meta-analysis, which may have provided more conclusive results.

Although there are several limitations to the present review, it is the first of its kind to our knowledge and provides a significant contribution to future studies on Chinese character size. The present review documents the noteworthy efforts of Chinese researchers to develop reliable and valid character size tests, in keeping with the increasing interest in Chinese character learning and research around the world (Gong et al., 2018; Li, 2020). Additionally, the review discusses future directions to facilitate the development of character size tests for different populations as well as vocabulary size tests in Chinese and other languages.

Footnotes

Appendix

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This paper was supported by The Ministry of Education of China [20YJC740088] and Peking University [7101303321].

ORCID iD

Peijian Paul Sun

References

Agustín Llach

M. P.

Terrazas

(2012). Vocabulary knowledge development and gender differences in a second language. Elia, 12, 45–75.

(1949). Hanzi wenti [ research topics on Chinese characters research ]. The Commercial Press.

Beglar

(2010). A rasch-based validation of the vocabulary size test. Language Testing, 27(1), 101–118. https://doi.org/10.1177/0265532209340194

Bolton

(2011). Language policy and planning in Hong Kong: Colonial and post-colonial perspectives. Applied Linguistics Review, 2, 51–74. https://doi.org/10.1515/9783110239331.51

Cameron

(2002). Measuring vocabulary size in English as an additional language. Language Teaching Research, 6(2), 145–173. https://doi.org/10.1191/1362168802lr103oa

Chalhoub–Deville

Deville

(2003). Computer adaptive testing in second language contexts. Annual Review of Applied Linguistics, 19, 273–299. https://doi.org/10.1017/S0267190599190147

Dąbrowska

(2018). Experience, aptitude and individual differences in native language ultimate attainment. Cognition, 178, 222–235. https://doi.org/https://doi.org/10.1016/j.cognition.2018.05.018

de Groot

A. M. B.

(2006). Effects of stimulus characteristics and background music on foreign language vocabulary learning and forgetting. Language Learning, 56(3), 463–506. https://doi.org/https://doi.org/10.1111/j.1467-9922.2006.00374.x

DeFrancis

(1986). The Chinese language: Fact and fantasy. University of Hawaii Press.

10.

DeVellis

R. F.

(2006). Classical test theory. Medical Care, 44(11), S50–S59. http://www.jstor.org/stable/41219505

11.

Dörnyei

(2005). The psychology of the language learner: Individual differences in second language acquisition. Lawrence Erlbaum Associates.

12.

(2018). Liangan hanzi guifan biaozhun duibi yanjiu[A comparative study on the Chinese character standards between the mailand and Taiwan]. Yuyan wenzi yingyong, 3, 11–20. https://doi.org/10.16499/j.cnki.1003-5397.2018.03.001

13.

Eriksson

Marschik

P. B.

Tulviste

Almgren

Pérez Pereira

Wehberg

Marjanovič-Umek

Gayraud

Kovacevic

Gallego

(2012). Differences between girls and boys in emerging language skills: Evidence from 10 language communities. British Journal of Developmental Psychology, 30(2), 326–343. https://doi.org/10.1111/j.2044-835X.2011.02042.x

14.

Everson

M. E.

(1998). Word recognition among learners of Chinese as a foreign language: Investigating the relationship between naming and knowing. The Modern Language Journal, 82(2), 194–204. https://doi.org/10.2307/329208

15.

Fernández-Fontecha

(2014). Motivation and gender effect in receptive vocabulary learning: An exploratory analysis in CLIL primary education. Latin American Journal of Content and Language Integrated Learning, 7, 27–49. https://doi.org/10.5294/laclil.2014.7.2.2

16.

Frank

M. C.

Braginsky

Marchman

V. A.

Yurovsky

(In Press). Variability and consistency in early language learning: The wordbank project. MIT Press.

17.

Gai

Yang

(2006). Hanyu yuedu zhangai ertong shizi zhuangkuang ceyan debianzhi[The compilation of Chinese character learning test for children with Chinese reading disorder]. Zhongguo teshu jiaoyu, 11, 58–63.

18.

Gong

Lyu

Gao

(2018). Research on teaching Chinese as a second or foreign language in and outside mainland China: A bibliometric analysis. The Asia-Pacific Education Researcher, 27(4), 277–289. https://doi.org/10.1007/s40299-018-0385-2

19.

Granena

Long

M. H.

(2013). Age of onset, length of residence, language aptitude, and ultimate L2 attainment in three linguistic domains. Second Language Research, 29(3), 311–343. https://doi.org/10.1177/0267658312461497

20.

Gyllstad

McLean

Stewart

(2020). Using confidence intervals to determine adequate item sample sizes for vocabulary tests: An essential but overlooked practice. Language Testing, 38(4), 558–579. https://doi.org/10.1177/0265532220979562

21.

Gyllstad

Vilkaitė-Lozdienė

Schmitt

(2015). Assessing vocabulary size through multiple-choice formats: Issues with guessing and sampling rates. ITL - International Journal of Applied Linguistics, 166(2), 278–306. https://doi.org/10.1075/itl.166.2.04gyl

22.

Hao

(2018). Gaoji hanyu shuiping liuxuesheng hanzi rendu yingxiang yinsu yanjiu[Predictors of Chinese character reading: Evidence from proficient L2 learners]. Yuyan jiaoxue yu yanjiu, (5), 1–12.

23.

Hashimoto

B. J.

(2021). Is frequency enough?: The frequency model in vocabulary size testing. Language Assessment Quarterly, 18(2), 171–187. https://doi.org/10.1080/15434303.2020.1860058

24.

Hue

C. W.

(2003). Number of characters a college student knows. Journal of Chinese Linguistics, 31, 300–339.https://ci.nii.ac.jp/naid/10026248027/en/

25.

Hung

L.-Y.

Wang

C.-C.

Chang

Y.-W.

Chen

H.-F.

(2008). Xuetong shiziliang pinggu ceyan zhi bianzhi baogao [Development of Assessment of Chinese Character Lists for Graders]. Psychological Testing, 55(3), 489–508.

26.

Hyde

J. S.

Linn

M. C.

(1988). Gender differences in verbal ability: A meta-analysis. Psychological Bulletin, 104(1), 53–69. https://doi.org/10.1037/0033-2909.104.1.53

27.

Ishii

Schmitt

(2009). Developing an integrated diagnostic test of vocabulary size and depth. RELC Journal, 40(1), 5–22. https://doi.org/10.1177/0033688208101452

28.

Jabrayilov

Emons

W. H. M.

Sijtsma

(2016). Comparison of classical test theory and item response theory in individual change assessment. Applied Psychological Measurement, 40(8), 559–572. https://doi.org/10.1177/0146621616664046

29.

Jiang

(2003). Butong muyu beijing de waiguo xuesheng hanzi zhiyin he zhiyi zhijian guanxi de yanjiu [The relationship between knowing pronunciation and knowing meaning of Chinese characters among CSL learners]. Yuyan jiaoxue yu yanjiu, (6), 51–57.

30.

Karami

(2012). The development and validation of a bilingual version of the vocabulary size test. RELC Journal, 43(1), 53–67. https://doi.org/10.1177/0033688212439359

31.

Kline

J. B. T.

(2005). Psychological testing: A practical approach to design and evaluation. SAGE Publications, Inc.

32.

Lardiere

(2006). Ultimate attainment in second language acquisition: A case study. Routledge.

33.

Lemhöfer

Broersma

(2012). Introducing LexTALE: A quick and valid lexical test for advanced learners of english. Behavior Research Methods, 44(2), 325–343. https://doi.org/10.3758/s13428-011-0146-0

34.

Leung

M. T.

Cheng-Lai

Kwan

(2008). The Hong Kong graded character naming test for primary school children. Center of Communication Disorder.

35.

(2003). Zhonggaoji liuxuesheng shiziliang chouyang ceshi baogao [Report of a sampled test on the volume of lexical accquisition by intermediate and advanced learners of Chinese as a second language]. Jinan daxue huawen xueyuan xuebao, (2), https://kns.cnki.net/kcms/detail/detail.aspx?doi=10.16131/j.cnki.cn44-1669/g4.2003.02.004

36.

(1999). Xueqian ji chuxiao ertong zhongwen shizi liangbiao de bianzhi yu chubu xiaoying jianyan[The creation and validation of preschool and primary chinese literacy scale]. Xinli fazhan yu jiaoyu, 15(3), 18–24.

37.

Shu

McBride-Chang

Liu

Peng

(2012). Chinese children’s character recognition: Visuo-orthographic, phonological processing and morphological skills. Journal of Research in Reading, 35(3), 287–307. https://doi.org/10.1111/j.1467-9817.2010.01460.x

38.

(2020). A systematic review of the research on Chinese character teaching and learning. Frontiers of Education in China, 15(1), 39–72. https://doi.org/10.1007/s11516-020-0003-y

39.

(2002). Jin ershinian duiwai hanyu jiaocai bianxie he yanjiu de jiben qingkuang shuping[A review of the complication and research on CSL textbooks in recent two decades]. Yuyan wenzi yingyong, (3), 100–106.

40.

Liang

(2020). Duiwai hanyu jiaocai gongqiu zhuangkuang de diaocha yu fenxi [A survey of the supply-demand situation of CSL textbooks]. Liaoning jiaoyu xingzheng xueyuan xuebao, 37(1), 85–90.

41.

Magno

(2009). Demonstrating the difference between classical test theory and item response theory using derived test data. The International Journal of Educational and Psychological Assessment, 1(1), 1–11.

42.

McLean

Kramer

Stewart

(2015). An empirical examination of the effect of guessing on vocabulary size test scores. Vocabulary Learning and Instruction, 4(1), 26–35.

43.

Miller

D. I.

Halpern

D. F.

(2014). The new science of cognitive sex differences. Trends in Cognitive Sciences, 18(1), 37–45. https://doi.org/https://doi.org/10.1016/j.tics.2013.10.011

44.

Myers

Taft

Chou

(2007). Character recognition without sound or meaning. Journal of Chinese Linguistics, 35(1), 1–57.

45.

Nakata

Tamura

Scott

(2020). Examining the validity of the LexTALE test for Japanese college students. Journal of Asia TEFL, 17(2), 335.

46.

Nation

I. S. P.

(2022). Learning vocabulary in another language (3rd ed.). Cambridge University Press. https://doi.org/10.1017/9781009093873

47.

Nguyen

L. T. C.

Nation

(2011). A bilingual vocabulary size test of English for Vietnamese learners. RELC Journal, 42(1), 86–99. https://doi.org/10.1177/0033688210390264

48.

Peng

Wang

Tao

Sun

(2017). The deficit profiles of Chinese children with reading difficulties: A meta-analysis. Educational Psychology Review, 29(3), 513–564. https://doi.org/10.1007/s10648-016-9366-2

49.

Perc

(2014). The Matthew effect in empirical data. Journal of The Royal Society Interface, 11(98), 20140378. https://doi.org/10.1098/rsif.2014.0378

50.

Perfetti

C. A.

(2003). The universal grammar of reading. Scientific Studies of Reading, 7(1), 3–24. https://doi.org/10.1207/S1532799XSSR0701_02

51.

Perfetti

C. A.

Tan

L. H.

(1998). The time course of graphic, phonological, and semantic activation in Chinese character identification. Journal of Experimental Psychology, 24(1), 101–117.

52.

Perfetti

C. A.

Zhang

(1995). The universal word identification reflex. In Medin

D. L.

(Ed.), The psychology of learning and motivation: Advances in research and theory (pp. 159–189). Academic Press.

53.

Pfost

Hattie

Dorfler

Arfelt

(2014). Individual differences in reading development a review of 25 years of empirical research on Matthew effects in reading. Review of Education Research, 84(2), 203–244. https://doi.org/10.3102/0034654313509492

54.

Protopapas

Parrila

Simos

(2016). In search of Matthew e?ffects in reading. Journal of Learning Disabilities, 49(5), 499–514. https://doi.org/10.1177/0022219414559974

55.

Robinson

J. P.

Lubienski

S. T.

(2011). The development of gender achievement gaps in mathematics and reading during elementary and middle school: Examining direct cognitive assessments and teacher ratings. American Educational Research Journal, 48(2), 268–302.

56.

Ruan

Georgiou

G. K.

Song

Shu

(2018). Does writing system influence the associations between phonological awareness, morphological awareness, and reading? A meta-analysis. Journal of Educational Psychology, 110(2), 180–202. https://doi.org/10.1037/edu0000216

57.

Segbers

Schroeder

(2017). How many words do children know? A corpus-based estimation of children’s total vocabulary size. Language Testing, 34(3), 297–320. https://doi.org/10.1177/0265532216641152

58.

Song

Georgiou

G. K.

Shu

(2015). How well do phonological awareness and rapid automatized naming correlate with Chinese reading accuracy and fluency? A meta-analysis. Scientific Studies of Reading, 20(2), 99–123. https://doi.org/10.1080/10888438.2015.1088543

59.

Kim

Y.-S.

(2014). Semantic radical knowledge and word recognition in Chinese for chinese as foreign language learners. Reading in a Foreign Language, 26(1), 131–152.

60.

Templin

J. L.

Henson

R. A.

(2006). Measurement of psychological disorders using cognitive diagnosis models. Psychological Methods, 11(3), 287–305. https://doi.org/10.1037/1082-989X.11.3.287

61.

Thomas

M. L.

(2011). The value of item response theory in clinical assessment: A review. Assessment, 18(3), 291–307. https://doi.org/10.1177/1073191110374797

62.

Tong

McBride-Chang

(2010). Developmental models of learning to read Chinese words. Developmental Psychology, 46(6), 1662–1676. https://doi.org/http://dx.doi.org/10.1037/a0020611

63.

Tong

McBride

(2017). Radical sensitivity is the key to understanding Chinese character acquisition in children. Reading and Writing, 30(6), 1251–1265. https://doi.org/10.1007/s11145-017-9722-8

64.

Tseng

C.-C.

Chang

L.-Y.

Chang

Y.-l.

Chen

H.-C.

(2016). Web-based Chinese character recognition assessment and its application on distance education of Chinese. Psychological Testing, 63(3), 179–202. http://dx.doi.org/10.6384/CIQ.200110.0143

65.

Vandenbos

G. R.

(Ed.). (2015). APA dictionary of psychology (2nd ed.). American Psychological Association.

66.

Wang

Perfetti

C. A.

Liu

(2003). Alphabetic readers quickly acquire orthographic structure in learning to read Chinese. Scientific Studies of Reading, 7(2), 183–208. https://doi.org/10.1207/S1532799XSSR0702_4

67.

Wang

Tao

(1996). Xiaoxuesheng shiziliang ceshi tiku ji pingjia liangbiao [Hanzi recogniton test for primary school students and norm]. Shanghai Education Press.

68.

Wen

Yang

(2020). Jiyu IRT tuiduan shiziliang de fangfa yanjiu [On the IRT-based literacy quantity inference method]. Yuyan wenzi yingyong, (1), 112–120.

69.

Wen

Tang

Liu

(2015). Yiwu jiaoyu jieduan xuesheng shiziliang ceyan de bianzhi yanjiu [Design of a test on quantity of literacy for students in the stage of compulsory education]. Yuyan wenzi yingyong, (3), 88–100.

70.

Wen

Tang

Liu

(2016). Shizi nengli de danweixing jianyan yanjiu [Unidimensional assessment for ability of literacy]. Xinli fazhan yu jiaoyu, 32(1), 73–80.

71.

C.-Y.

M.-H. R.

Chen

S.-H. A.

(2012). A meta-analysis of fMRI studies on Chinese orthographic, phonological, and semantic processing. Neuroimage, 63(1), 381–391. https://doi.org/10.1016/j.neuroimage.2012.06.047

72.

Gao

Xiao

Zhang

(2006). Oumei hanri xuesheng hanzi rendu yu shuxie xide yanjiu [A study of learning reading and writing Chinese characters by CSL learners from Korea, Japan and Western countries]. Yuyan jiaoxue yu yanjiu, (6), 64–71.

73.

(2012). Xianggang xiaoxue zhongwen shizi dengji liangbiao de goujian yu yanzheng [The construction and validation of a Hong Kong graded Chinese character identification test (HKGCCIT) for primary school children] [Unpublished doctoral dissertation, The Chinese Universisty of Hong Kong].

74.

Jin

Guo

(2015). Xianggang xiaoxuesheng hanzi rendu nengli de shizheng yanjiu [The Chinese character identification of primary school students in Hong Kong: An empirical examination]. Yuyan wenzi yingyong, (1), 104–111.

75.

Yang

Peng

Meng

(2019). How do metalinguistic awareness, working memory, reasoning, and inhibition contribute to Chinese character reading of kindergarten children? Infant and Child Development, 28(3), e2122. https://doi.org/10.1002/icd.2122

76.

Young

Shermis

M. D.

Brutten

S. R.

Perkins

(1996). From conventional to computer-adaptive testing of ESL reading comprehension. System, 24(1), 23–40. https://doi.org/https://doi.org/10.1016/0346-251X(95)00051-K

77.

Zhang

(2017). The influence of L1 background and other meta-linguistic and background variables on the learning of Pinyin and Hanzi by Arabic and English learners of Chinese as a second language [Doctoral thesis, University of York]. http://etheses.whiterose.ac.uk/16332/

78.

Zhang

(2018). Yanjiuyong hanyu shuiping fenji ceshi fangfa dui yanjiu jieguo de yingxiang [The influence of different L2 Chinese proficiency measurements on the results of CSL research]. Yuyan jiaoxue yu yanjiu, (6), 14–23.

79.

Zhang

Kim

S.-A.

Zhang

(2022). A comparative study of three measurement methods of Chinese character recognition for L2 Chinese learners. Frontiers in Psychology, 13, 753913. https://doi.org/10.3389/fpsyg.2022.753913

80.

Zhang

Roberts

(2019). The role of phonological awareness and phonetic radical awareness in acquiring Chinese literacy skills in learners of Chinese as a second language. System, 81, 163–178. https://doi.org/https://doi.org/10.1016/j.system.2019.02.007

81.

Zhang

Wang

(2021). Liuxuesheng shizi liangbiao bianzhi yanjiu [The creation and validation of a Hanzi recognition size test for learners of Chinese as a second language] Shijie Hanyu Jiaoxue, 35(1), 126–142.

82.

Zhang

(1931). Shizi ceyan [Chinese character recognition test]. Xinli zazhi xuancun, (44), 694–715.

83.

Zhao

(2018). Validation of the mandarin version of the vocabulary size test. RELC Journal, 49(3), 308–321. https://doi.org/10.1177/0033688216639761

84.

Ziegler

J. C.

Goswami

(2005). Reading acquisition, developmental dyslexia, and skilled reading across languages: A psycholinguistic grain size theory. Psychological Bulletin, 131(1), 3–29. http://dx.doi.org/10.1037/0033-2909.131.1.3

85.

Zhou

Marslen-Wilson

Taft

Shu

(1999). Morphology, orthography, and phonology reading Chinese compound words. Language and Cognitive Processes, 14(5/6), 525–565. https://doi.org/10.1080/016909699386185