Sage Journals: Discover world-class research

Abstract

Each year, all Australian students in grades 3, 5, 7 and 9 sit nationwide large-scale tests in literacy and numeracy which have their validity frequently questioned. We compared the performance of grade 3 twins on these large-scale reading tests with their performance on three individually administered literacy tests in comprehension, word reading and vocabulary within a genetically sensitive design. Comprehension, word reading and vocabulary accounted for a substantial amount of the variance in school reading tests. Performance on large-scale reading tests and individually administered tests was moderately to substantially heritable and the same genes contributed to performance in both types of test. These results confirm that large-scale school reading tests measure, at least in part, the literacy skills assessed by individual tests that are frequently considered to be the ‘gold-standard’ in testing. Also, as could be expected, the individually administered literacy tests were more closely related to performance on large-scale reading tests than to performance on large-scale school numeracy tests.

Keywords

Reading ability literacy skills academic achievement twins genetics environmental influences

Introduction

In this paper, we intend to contribute to the debate about the validity of the so-called high-stakes testing in Australian schools by reporting on how achievement in those tests relates to performance in alternative measures of literacy. We have a database of the test results of 250 grade 3 students in New South Wales (NSW) who completed the literacy assessments of the Basic Skills Test (BST) or its successor, the National Assessment Program: Literacy and Numeracy (NAPLAN). In addition to these assessment results, we have results from a number of well-accepted individually administered (IA) tests of literacy skills for each child that can serve as a validity check for the large-scale, group-administered school results. Further, the children are all members of twin pairs, which allows us to estimate the relative influences of genes and aspects of the environment on individual differences in performance on both individual and large-scale assessments. In brief, we address issues of the validity of large-scale tests and their ‘behaviour–genetics’, meaning we explore the relative influence of genes and the environment on variation in performance – as is explained further below.

These twins were part of a large international study on literacy, for which they were administered the labour-intensive individual literacy tests. Although there are sufficient participants from each of the BST and NAPLAN years for phenotypic analyses – that is analyses of the physically expressed trait – the number of participants within each test type is small for genetic analyses. Hence, we used both tests in our measure of large-scale tests for some of the analyses (see the Method section for further details regarding the combining of these scores). We acknowledge that it would have been more desirable to use data from just one type of large-scale test. However, the availability of these twin-study data represents a unique opportunity to explore the relationship between large-scale reading tests and individual literacy tests within a genetically sensitive design.

In 2008, nationwide assessment in literacy and numeracy began in Australia (Australian Curriculum Assessment and Reporting Authority [ACARA], 2008). The NAPLAN includes standardised tests in reading, writing, language conventions (spelling, grammar and punctuation) and numeracy. Each year, students in grades 3, 5, 7 and 9 from government and nongovernment schools take these tests. For each achievement domain, performance is measured on a common scale from grade 3 to grade 9; this common scale allows for comparisons between cohorts and provides a measure of growth within a cohort over time. The tests are developed with reference to the National Statements of Learning in English and Mathematics, and state and territory curricula (ACARA, 2010a).

Prior to the NAPLAN, statewide testing in literacy and numeracy was implemented in NSW in 1989 in the form of the BST (Masters et al., 1990). Though initially the BST was administered to grade 6 students, by 1996 students in grades 3 and 5 were tested, and performance at each grade level was measured on a common scale (Wasson, 2009). Many hallmarks of the current NAPLAN were already present in the BST. These included comparing performance of a school with a similar school as measured on remoteness, socioeconomic status and percentage of Aboriginal and Torres Strait Islander student enrolments, and providing schools with data and software to link performance and test items with the curriculum and teaching strategies (Wasson, 2009). These characteristics of the assessment and feedback process were intended to identify individuals, classes or schools that were performing below a minimum standard and to assist with teaching resources and support (Masters et al., 1990; Smith, 2005; Wasson, 2009).

Even though NSW now has a 25-year history of large-scale assessment in literacy and numeracy, the NAPLAN remains a controversial topic, particularly in relation to the interpretation of results. These tests have relatively few items, and though reliable information can be obtained when the cohort is sufficiently large, results from individuals and small groups, such as classrooms, are easily misinterpreted or overinterpreted (Wu, 2010, 2011). In addition, complaints have been raised of undue pressure on students, teachers and schools, perhaps fuelled by poorly informed responses to the publicly available results online (Athanasou, 2010; Wu, 2011). Criticisms have been made of some test items and questions raised about whether the items actually test the skills intended. For example, the NAPLAN language conventions test requires students to identify and correct spelling, punctuation and tense errors, but Williams (2009) suggests that this does not assess the ability to apply grammatical knowledge in any depth. Willett and Gardiner (2009) analysed the spelling errors of students from both the NAPLAN and dictation tests and found that students scored far better on dictation tests. In NAPLAN, some spelling items present a misspelt word and students are required to write the word correctly. In contrast, to the general pattern of scoring better for diction, Willet and Gardiner found these constructed errors sometimes facilitated the correct spelling of a word. This occurred when the constructed error was not in the part of a word that students were most likely to misspell.

Similarly, question format can assist in solving numeracy problems. In a critique of the grade 9 numeracy tests in 2008, Norton (2009) noted that 57% of Queensland students correctly answered an algebraic problem in multiple-choice format, which was almost double the 29% of correct answers when a similar algebraic problem required a written solution. Continuing to identify and investigate validity errors and misinterpreted results is important to appropriately report on performance and improve future tests. These item-level flaws, assuming that is what they are, open the question of what these large-scale tests measure. Our research does not address issues of individual items, but it does explore whether the results from the large-scale tests converge with well-accepted, individual forms of assessment.

The National Statements of Learning in English describe the reading skill level of grade 3 students as follows:

When students read and view texts, they identify the main topic or key information, some directly-stated supporting information, and the order of events. They can draw inferences from directly-stated descriptions and actions (e.g. infer a character’s feelings) and talk about how people, characters and events could have been portrayed differently (e.g. more fairly). They relate their interpretations to their own experiences. (Curriculum Corporation, 2005, p. 5)

Mastery of this level of reading comprehension requires the execution of a range of literacy skills. A proficient reader must have acquired the alphabetic principle, essentially that words are composed of letters that systematically represent phonemes (Byrne, 1998). The ability of a reader to decode regular words must be expanded to include the acquisition of orthographic rules and recognition of irregular words (Ehri, 2005). Skilled readers will execute these basic skills with accuracy and, eventually, fluency. Beyond this, the text must be related to meaning, a process that relies on higher level language skills such as vocabulary and syntax. Deficiencies at any of these levels will undermine reading comprehension, as captured by the Simple View of reading (Hoover & Gough, 1990). Therefore, the validity of the large-scale literacy tests could be approached by assessing the relationship of performance on the large-scale tests with performance on well-established tests of literacy skills that are administered on an individual basis. As further evidence of validity, we would expect the relationship between literacy skills and school reading to be stronger than the relationship between literacy skills and school numeracy. This is the approach we adopt here.

The data in the current analyses are part of the International Longitudinal Twin Study (ILTS), a large study that followed twins from preschool to grade 2 in four countries (Australia, Norway, Sweden and the United States). In the year prior to the start of formal schooling, the twins were tested on preliteracy skills and several aspects of language and cognition including phonological awareness, print knowledge, naming fluency, vocabulary, grammar, morphology, verbal learning and memory and nonverbal IQ (Byrne et al., 2002; Samuelsson et al., 2005). The twins were tested in subsequent years on measures including phonological awareness, word and nonword reading, reading comprehension, spelling, vocabulary and orthographic learning (Byrne et al., 2005, 2006, 2007, 2008, 2009). Along with these literacy measures, the BST data were collected from the ILTS students in NSW from 2003 to 2007 followed by the collection of NAPLAN data from 2008 to 2010, when data collection for the project ceased in Australia.

The BST and NAPLAN each include tests of reading, language conventions, writing and numeracy. We have chosen to focus on the reading and numeracy measures.

These performance data span numerous years and data for the language conventions and writing tests were not combined across the years for a number of reasons. First, from 2003 to 2010, the number of questions in the language conventions tests ranged from 26 to 50, and the number of marks possible in the writing tests ranged from 47 to 70. Second, the marking criteria for the writing tests changed considerably from BST to NAPLAN (see ACARA, 2010b; NSW Department of Education, 2006). Hence, it was inappropriate to combine the language conventions and writing tests into measures of large-scale school tests that spanned from 2003 to 2010. The appropriateness of combining the other measures is explained in the Method section.

One aim of the ILTS was to examine the genetic and environmental influences on the development of literacy from before formal instruction through the early years of school (Byrne et al., 2002). The inclusion of several countries allowed the pattern of genetic and environmental influence on literacy skills to be compared across countries (Byrne et al., 2009; Samuelsson et al., 2007, 2008). However, for the purpose of comparing achievement on large-scale tests with IA tests of literacy, only the Australian students are reported on in this article. Although much information about the relationship of school assessment and other tests of literacy skills can be obtained from a sample of unrelated children, an advantage of analysing data from twins is the ability to estimate how much of the phenotypic relationship would be due to shared genes or shared and unique environments across skills and over time.

The classical twin design compares the relative similarity of identical (monozygotic; MZ) and fraternal (dizygotic; DZ) twins on a trait. The known genetic and environmental relationships between MZ and DZ twins can be used to estimate the amount of variation among individuals that is due to genes (A), the shared or common environment (C) and the unique environment (E). In brief, MZ twins share all their genes and DZ twins share, on average, half of their segregating genes. Within each twin pair that is raised within the same family, there are some environmental influences that are shared, such as family income, common schools, teachers and friends, and some influences are unique to each twin, such as individual illnesses and accidents, different teachers (when twins are separated at school) and separate friends. By comparing the similarity of MZ and DZ twins, it is possible to derive quantitative estimates for the influence of genes, the shared environment and the unique environment on traits. The correlation of MZ twins on a trait is due to both their shared genes and their shared environment, while the correlation of DZ twins is due to a genetic influence that is half that of MZ twins and a shared environment that is the equivalent of the MZ twins. The genetic influence on a trait can be calculated as double the difference in the MZ and DZ correlations (Falconer & Mackay, 1996). The degree to which the DZ correlation is greater than half of the MZ correlation reflects the shared environment. The unique environment is the difference between the MZ correlation and unity (see Plomin, DeFries, Knopik, & Neiderhiser, 2013, for an introduction to the classic twin design methodology).

The genetic analyses from the ILTS have shown that by the time of testing in grade 2, genes explain 67–82% of the variation in performance for most of the literacy measures (Byrne et al., 2009). For the most part, the unique environment (which also includes any error in measurement) explains the remaining variation in performance, with very little influence from sources of shared environment (Byrne et al., 2009). The exception is vocabulary, where genes explain 44% of the variation in performance, with 36% due to the shared environment (Byrne et al., 2009). Given the substantial heritability of these literacy skills that contribute to mastery of reading, a substantial influence of genes on test performance in the school reading tests can be hypothesised.

Using multivariate extensions to the classic twin design, phenotypic correlations can be broken down into how much of the covariation between two traits is due to genes that affect both traits and how much is due to environmental factors (shared and nonshared) that affect both traits. The extent of genetic and environmental overlap is expressed as genetic and environmental correlations. Genetic correlations are independent of univariate heritability; two traits that are each highly heritable can be influenced by different genes and thus would show no genetic correlation. However, when related skills are measured, an overlap in the genetic and environmental factors would be expected. For example, the genetic correlations between word reading and comprehension in the grade 2 sample in the ILTS is 0.88, and the genetic correlations of those skills with vocabulary are 0.36 and 0.46, respectively (Byrne et al., 2009). These moderate-to-high correlations suggest that much of the genetic influence on literacy is shared among the component skills. The fact that these correlations are lower than 1 also suggests some independence in the genes that influence specific literacy skills (Byrne et al., 2009). In this example, the high genetic correlation between word reading and comprehension compared to that of word reading and vocabulary means that if a specific gene was identified as influencing word reading then that gene is highly likely to also influence comprehension, but is somewhat less likely to influence vocabulary. As such, if the large-scale reading tests require students to employ the same literacy skills as are tested by the IA tests, then we would expect substantial overlapping genetic influences among the literacy measures and school reading tests. The analyses reported in this article aim at testing this hypothesis.

To summarise, the first purpose of this article is to explore if the large-scale reading tests of the BST and NAPLAN show convergent validity with IA and well-established tests of literacy skills. We approach this first by analysing if literacy skills as measured by the IA tests are substantial predictors of performance on the large-scale reading tests, and we compare these results with the relationship between the test results from the IA literacy tests and the results on the large-scale numeracy tests. The second purpose is to examine whether the genes that substantially influence performance on the IA tests of literacy skill also influence performance on the large-scale tests, and we compare if this genetic overlap is greater for literacy than for numeracy performance.

Method

Participants

The participants were 250 children who sat the BST or NAPLAN in one of the years from 2003 to 2010; the respective numbers from each year were 22, 16, 22, 30, 54, 40, 34 and 32. The 125 twin pairs were all same sex, with 67 monozygotic (37 male and 30 female) and 58 dizygotic (32 male and 26 female) twins. The twins were recruited through the Australian Twin Registry in the year before beginning school. Parents were approached by mail, with a 60% participation rate, of which 93% (246 twin pairs) continued with the study to the end of grade 2 when the skills we analyse here were tested. Of these 125 pairs, 51% provided grade 3 school test results. Those who returned results of large-scale tests had no missing data on any of the IA literacy tests. Seven participants returned either a large-scale reading or numeracy test, but did not return both. A multivariate analysis of variance showed that those who returned results of large-scale tests performed significantly better on the IA literacy tests at the end of grade 2 than those who did not return results from large-scale tests (λ = .98, F(3, 488) = 3.33, p = .019, η²= .02). Although statistically significant, the size of this effect was very small and performance on the literacy tests accounted for only 2% of the variation in returning school tests. The means on each IA literacy skill test for students who did and did not return results of large-scale tests are presented in Table 1.

Table 1.

Means and standard deviations of the individually administered (IA) tests for participants who did and did not return results of large-scale tests.

Measure	Returned large-scale test results		Did not return large-scale test results
Measure	M	(SD)	M	(SD)
IA vocabulary	35.0	(5.6)	33.9	(6.1)
IA word reading	112.3	(13.1)	108.6	(14.6)
IA comprehension	110.1	(10.4)	107.8	(10.5)

Note: IA: individually administered.

The mean age (SD) in months at the time of the IA literacy tests was 95.5 (4.2). While the BST and NAPLAN tests were administered during different months, no significant difference in the age of participants who sat the BST and the NAPLAN could be observed (mean age (SD) in months 105.5 (3.8) and 104.1 (4.0), respectively). Zygosity was determined by DNA analysis from cheek swab collection or, in a minority of cases, by selected items from the Nichols and Bilbro (1966) questionnaire.

Materials

IA test of word reading

The Test of Word Reading Efficiency (TOWRE; Torgesen, Wagner, & Rashotte, 1999) was used to test word reading efficiency and phonemic decoding. A list of words and a list of nonwords constitute two subtests. The score on each subtest is the number of correctly read items in 45 s. Each subtest has two versions, and both versions of both subtests were administered to each child. Scores were standardised (M = 100, SD = 15). The test manual reports test–retest reliability for children aged 6–9 years of .97 for word and .90 for nonword standardised scores. The average from these four lists was used to obtain a single score for word reading.

IA test of vocabulary

The Boston Naming Test (Kaplan, Goodglass, & Weintraub, 2001) was used to assess vocabulary. This test requires the viewing of 60 concrete pictures that range from common (bed) to rare (abacus) with the score being the number of items correctly named, with a maximum possible score of 60 and an internal reliability of .84 in the whole ILTS sample.

IA test of comprehension

The Woodcock Passage Comprehension from the Woodcock Reading Mastery Test–Revised (Woodcock, 1989) was used to assess reading comprehension. This test is a modified cloze procedure where a short passage with a missing word is read. The child supplies a single word that is suitable to replace the blank in the sentence. The test includes 43 items and scores were standardised (M = 100, SD = 15). The test manual reports a split-half reliability for first grade students of 0.94.

Large-scale reading tests

For both the BST and NAPLAN, students were required to read a range of specially prepared texts and answer questions that asked them to identify or interpret information contained in the texts. Responses were predominantly multiple choice with the occasional short answer. The reported internal reliability of the grade 3 NAPLAN tests in reading from 2008 to 2010 was 0.88–0.89 (ACARA, 2013). For the BST and NAPLAN, the number of questions and maximum raw scores varied from 35 to 38 over the years that these results were obtained. The scaled scores for the NAPLAN were not used, as they are not comparable with the BST scores.

Large-scale numeracy test

The numeracy test of the BST and NAPLAN include questions assessing the aspects of number, measurement and space. Responses were predominantly multiple choice with the occasional short answer. The internal reliability of the grade 3 NAPLAN tests in numeracy from 2008 to 2010 was reported to be 0.87–0.92 (ACARA, 2013). The number of questions was the same each year, with a maximum raw score of 35.

Procedure

The individual literacy skills tests were administered in the school or the home of the twins in a session that lasted about an hour. Two test administrators assessed each twin pair at the same time, one test administrator per child (Byrne et al., 2009).

The large-scale tests were administered in class in line with prescribed, standardised procedures. The BST tests were sat during the first week of August in grade 3, and the NAPLAN tests were sat during the second week of May in grade 3. The test results for each student were forwarded to the research team by parents of the participants.

Data analysis

Our data has eight cohorts of participants, and although the IA literacy tests do not vary from year to year, the large-scale tests do. For those participants who returned the large-scale tests, we have standardised the scores on both the IA literacy tests and the large-scale tests within each year. This was done to control for any cohort effects in the IA literacy tests and to control for any cohort effects or inadvertent differences in test difficulty in the large-scale assessments. This creates a cohort specific, relative score in standard deviation units for each test to be used in the correlation and behaviour genetic analyses, where it is the relationship between variables for individuals or for twin pairs that is of interest and not the mean scores. Scores were adjusted for age and sex; outliers were truncated at ± 3.

To account for nonindependence within twin pairs, for the phenotypic correlations, the Griffin and Gonzalez’s (1995) formula was implemented to adjust the degrees of freedom when evaluating significance. The regression analyses involved multilevel modelling with restricted maximum likelihood to derive the estimates. Multilevel models with dyads have the individual at level 1 and the dyad at level 2 (Kenny, Kashy, & Cook, 2006). With only two units at level 1, the slopes are constrained to be equal across dyads, while the intercepts can vary; the nonindependence within dyads is modelled in the variation of these intercepts and the variation of the slopes is part of the error variance. This leaves two random effects to be estimated, the dyad covariance (s_dd) and the error variance ( $s_{e}^{2}$ ). To determine the variance explained, s_dd and $s_{e}^{2}$ are calculated for the model with the predictors included and with the predictors excluded. From these variances, psuedo R² can be calculated as

R^{2} = 1 - \frac{s_{dd} + s_{e}^{2}}{s_{dd}' + s_{e}^{2'}}

where the prime indicates the estimate is from the model without predictors. Using this calculation pseudo R² gives the proportion of variance in the model that is explained by the predictors (for more detail see Kenny et al., 2006).

To have sufficient data for some basic genetic parameter estimates, we wanted to combine the results from the BST and the NAPLAN and use the combined score as the school result for reading and numeracy. Although the BST and the NAPLAN are of similar length (35–38 questions), have similar types of questions (both literal and inferential), and require similar response formats (predominantly multiple choice), the tests have been shown to vary from year to year. For example, the proportion of questions requiring the interpretation of images has differed over the years (Unsworth & Chan, 2009). Moreover, Keenan, Betjemann and Olson (2008) have shown that reading comprehension tests vary in the extent to which they tap underlying literacy skills. While standardising within year will control for any variation in difficulty, before combining these measures of school reading, we needed to assess if each type of test was measuring reading comprehension in a similar way. To explore if the BST and NAPLAN were similar in the nature of their assessment of comprehension, we examined whether or not the correlations between each of the BST and the NAPLAN results and the IA literacy tests were equivalent. The correlations are reported in Table 2.

Table 2.

Pearson’s correlations between BST, NAPLAN and the individually administered literacy (IA) tests.

School domain	Test type	IA comprehension			IA word reading			IA vocabulary
School domain	Test type	r	Z _dif	p	r	Z _dif	p	r	Z _dif	p
LS reading	BST	.64	−0.01	.991	.61	0.37	.714	.46	0.60	.549
LS reading	NAPLAN	.64			.57			.39
LS numeracy	BST	.41	−0.71	.477	.45	0.01	.993	.37	−0.02	.982
LS numeracy	NAPLAN	.50			.45			.38

Note: BST, Basic Skills Test; IA, individually administered test; LS: large-scale test; NAPLAN, National Assessment Program: Literacy and Numeracy; Z_dif, Z-difference score.

As can be seen, the correlations between the BST reading scores and the IA literacy scores were within .07 of the NAPLAN reading scores. The correlations between the BST numeracy scores and the IA literacy scores were within .09 of the NAPLAN numeracy scores. The difference in BST and NAPLAN correlations was tested using Z-difference scores (as per Field, 2013) with the degrees of freedom adjusted for nonindependence between twins (see Table 2). None of the correlations were significantly different with an alpha of .05. As well as correlations, we used multilevel models to check potential differences in the amount of variation that was explained by the three IA literacy tests in BST and NAPLAN reading. These three IA literacy tests predicted 50% and 46% of the variance in BST and NAPLAN reading, respectively. With numeracy, the three IA literacy tests predicted 26% and 30% of the variance in BST and NAPLAN, respectively. As such, the relationship between the IA literacy tests and the BST closely resembled that of those same literacy tests with the NAPLAN. Therefore, the scores (standardised within year) from the BST and NAPLAN were combined into a single score for each of the domains of reading and numeracy. At this point, any variance in scores that is systematically due to the type of test (i.e. BST or NAPLAN) will contribute to error variance in our genetic analyses. To remove this error variance from the genetic analyses, test type was covaried out.

Results

Phenotypic analyses

All of the measures shared significant positive correlations (p < .001; see Table 3). The correlations among the three IA literacy tests ranged widely from .26 between word reading and vocabulary to .70 between word reading and comprehension. The correlations between the large-scale tests and the IA literacy tests were moderate to high, with the numeracy being slightly lower than reading.

Table 3.

Phenotypic correlations among the individually administered literacy (IA) tests and the large-scale (LS) tests.

Measure	IA vocabulary	IA word reading	IA comprehension	LS reading	LS numeracy
IA vocabulary	1	.26	.38	.43	.37
IA word reading		1	.70	.59	.45
IA comprehension			1	.64	.45
LS reading				1	.62
LS numeracy					1

p < .001 for all correlations.

Analyses of multilevel models were conducted to examine how much of the variation in student performance on large-scale reading tests was accounted for by the IA literacy tests. The variables were standardised prior to entry, so the coefficients reported are equivalent to standardised beta weights. The pseudo R² indicates the proportion of variance explained in the outcome by the predictor variables. The IA literacy tests accounted for 49% of the variation in performance on the large-scale reading test (see Table 4). The coefficients indicated each of the literacy skills explained a unique portion of the variation in the school reading test, with each literacy skill assessed by an IA test remaining a significant predictor of large-sale reading results after controlling for the effects of the other predictors. With each standard deviation increase in IA comprehension, performance on the large-scale reading test increased by 0.33 of a standard deviation. Similarly, a standard deviation increase in IA word reading was related to an increase in performance on the large-scale reading test by 0.31 of a standard deviation, and a standard deviation increase in IA vocabulary related to an increase in reading performance as measured by the large-scale test of 0.23 of a standard deviation.

Table 4.

Multilevel regression model predicting performance on large-scale (LS) tests from individually administered (IA) literacy tests.

Variable	Coefficient	Df	t	p	Pseudo R²
LS reading					.49
IA comprehension	.33	237	5.07	<.001
IA word reading	.31	238	4.81	<.001
IA vocabulary	.23	230	4.46	<.001
LS numeracy					.28
IA comprehension	.14	224	1.80	.073
IA word reading	.31	236	4.14	<.001
IA vocabulary	.25	236	4.15	<.001

Note: Pseudo R² was calculated as the proportion of variance explained by the predictors in the model.

Although still substantial at 28%, the IA literacy tests accounted for considerably less variation in numeracy as assessed by the large-scale tests. Interestingly, the unique contributions of IA word reading and IA vocabulary to predicting differences in large-scale numeracy performance were as substantial as their contribution to predicting performance in reading on the large-scale test.

Behaviour–genetic analyses

The correlations within zygosity type are reported as intraclass correlations in Table 5. The MZ correlations are consistently higher than the DZ correlations, indicating a genetic influence on all of the tests. The proportion of variance in each measure that was attributable to either genetic (A), shared environment (C) or unique environment (E) was estimated using OpenMx (Boker et al., 2011), and are also reported in Table 5. Due to our use of an opportune data set, the sample size is small for behaviour–genetic analyses and the confidence intervals are consequently wide, as such the parameter estimates should be interpreted cautiously and with this low power in mind. These results suggest that genes and unique environment both substantially influence performance in large-scale assessments of reading and numeracy, with little evidence of shared environmental influence.

Table 5.

Intraclass twin correlations and proportion of variance due to genes, shared environment and unique environment for individually administered (IA) literacy tests and large-scale (LS) reading and numeracy tests.

Measure	MZ	DZ	A	C	E
LS reading	.51	.38	.45 [0, .71]	.12 [0, .51]	.43 [.29, .63]
LS numeracy	.62	.32	.64 [.21, .76]	0 [0, .35]	.36 [.24, .52]
IA comprehension	.65	.36	.68 [.25, .78]	0 [0, .37]	.32 [.22, .47]
IA word reading	.77	.32	.81 [.59, .87]	0 [0, .20]	.19 [.13, .29]
IA vocabulary	.65	.40	.42 [0, .74]	.21 [0, .59]	.37 [.26, .52]

Note: MZ, monozygotic; DZ, dizygotic; A, genetic variance; C, common environmental variance; E, unique environmental variance.

95% confidence intervals are in brackets.

Genetic and environmental correlations represent the amount of overlap there is in the genes (or environment) that are influencing two traits. In Table 6, we have reported the genetic and environmental correlations between the IA literacy tests and the school-administered reading and numeracy tests. The shared environment correlations are unreliable as there was insufficient power to obtain a reliable estimate of the shared environment for any of the measures. Nevertheless, the model used to estimate these parameters was the full ACE model to avoid biasing our heritability estimates (Coventry & Keller, 2005; Keller & Coventry, 2005).

Table 6.

Genetic and unique environmental correlations between individually administered (IA) literacy tests and large-scale (LS) reading and numeracy tests.

	Genetic correlation	Shared environment correlation	Unique environment correlation
LS reading
IA comprehension	1 [.68, 1]	−1 [−1, 1]	.25 [.04, .45]
IA word reading	.83 [.56, 1]	1 [−1, 1]	.35 [.13, .54]
IA vocabulary	.45 [−1, 1]	.73 [−1, 1]	.28 [.04, .49]
LS numeracy
IA comprehension	.68 [.43, 1]	−.20 [−1, 1]	.04 [−.19, .27]
IA word reading	.64 [.39, 1]	−1 [−1, 1]	.21 [−.01, .45]
IA vocabulary	.41 [−1, 1]	1 [−1, 1]	.31 [.07, .51]

Note: 95% confidence intervals are in brackets.

The genetic correlations are high between performance on the large-scale reading tests and on the IA tests that assess the literacy skills. The genetic correlation between large-scale reading and IA comprehension is very high, being reported here at unity, essentially meaning that the genes that influence comprehension as assessed by the IA test also influence reading as assessed by the large-scale tests. The rest of the genetic correlations are high but not all-inclusive, indicating there are genes that contribute to performance on large-scale reading tests that do not also contribute to the individually assessed literacy skills of word reading or vocabulary.

The unique environmental correlations between the IA literacy skills and large-scale reading performance are considerably smaller than the genetic correlations. The unique environment variance estimates include any measurement error, which, by definition, does not correlate across measures. Therefore, unique environment correlations are evidence that the environmental variance is not solely due to measurement error. That is, the unique environmental correlations indicate specific factors in the unique environment that contribute to both performance on IA literacy tests and large-scale reading tests. However, the data analysed in this article do not allow the identification of those factors.

Similar to performance on large-scale reading tests, performance on large-scale numeracy tests also shared moderate-to-high genetic correlations with the IA literacy tests, indicating substantial overlap in the genes that influence performance on large-scale numeracy tests and on these IA tests of specific literacy skills. Again, some overlap in the environmental factors contributed to performance on large-scale numeracy tests and IA word reading and large-scale numeracy performance and IA vocabulary.

Discussion

Our principal aim in this article was to contribute to the dialogue on the validity of large-scale school assessment tests in Australia. We assessed if student performance on reading tests administered on a large scale converged with student performance on IA tests of literacy skills. We chose to explore this relationship with both phenotypic and behavioural genetic analyses.

Performance on the IA literacy skills tests accounted for 49% of the variation in performance on large-scale reading tests. Although the remaining variation in performance on large-scale reading tests needs to be explained, it is worth noting that our estimates are comparable in size to the amount of variance explained by IA word reading and listening comprehension in several widely used reading comprehension tests (Keenan, Betjemann & Olson, 2008). Furthermore, the correlations among different measures of reading comprehension are often quite modest (Cain, Oakhill, & Bryant, 2004; Keenan et al., 2008). As such, our findings are consistent in that they show that reading as assessed in large-scale tests taps into some of the same literacy skills that are measured by the IA tests. Note also that this relationship between performance on large-scale reading tests and the IA literacy skills tests is not driven exclusively by any one skill. Each of the literacy skills measured (i.e. comprehension, word reading and vocabulary) contributes uniquely to the variation in performance on the large-scale reading tests, suggesting that performance in reading tests is best predicted by a composite of the literacy skills as measured by the IA tests.

In contrast with school reading, performance on the IA literacy tests account for less variation in school numeracy. Literacy skills still do account for a substantial amount of variation in numeracy, almost 28%. A certain amount of overlap between literacy skills and numeracy is a realistic finding, given that the numeracy test requires reading of the test items and that there is extensive support in the literature for covariation in performance across academic domains (Hart, Petrill, Thompson, & Plomin, 2009; Helwig, Rozek-Tedesco, Tindal, Heath, & Almond, 1999; Kovas, Harlaar, Petrill, & Plomin, 2005). Taken together, these phenotypic results provide some evidence that large-scale reading tests preferentially assess the literacy skills that are thought to underpin reading comprehension.

The heritability estimates of performance on large-scale tests of reading and numeracy are moderate and consistent with the estimates obtained from studies conducted in the United Kingdom and the United States (Harlaar, Dale, & Plomin, 2007; Harlaar, Hayiou-Thomas, & Plomin, 2005; Hart et al., 2013; Oliver et al., 2004; Petrill et al., 2012). The heritability estimates of the literacy skills are also in line with larger studies, with a lower estimate for vocabulary than for comprehension and word reading (Harlaar et al., 2010). Moreover, the estimates of this subsample of Australian students who provided information on their large-scale test results were very close to the heritability estimates from the full ILTS sample at grade 2 for word reading (.82), vocabulary (.44) and comprehension (.67; Byrne et al., 2009).

Given that our purpose is to explore if there is convergence in performance on the IA and the large-scale tests, we need to break down the heritability and environmental variance estimates further. Specifically, if the large-scale reading test is tapping into and assessing skills that are also measured by the IA literacy tests, then this would be evidenced, in part, by common influences of genes and environmental factors. The high genetic correlations between performance on the large-scale reading test and the IA literacy tests indicate that common genes influence performance on both types of tests. This particularly applies to the correlation between the large-scale reading and IA comprehension tests. However, common genes can affect multiple academic domains. There is evidence for substantial overlap in genes across academic domains (Kovas et al., 2005; Plomin & Kovas, 2005; Plomin, Kovas, & Haworth, 2007), which was supported by the moderate-to-high genetic correlations among performance on the large-scale numeracy test and the IA literacy skills that emerged in our analyses.

Of particular interest in our data are the unique environmental correlations among the IA literacy tests and school reading. If the unique environment is genuinely influencing performance in the IA literacy tests and those same literacy skills are being tapped by school reading, then we would expect some positive unique environmental correlations. The unique environmental correlations between reading and the literacy skills indicate that there are environmental factors that contribute to both school reading and the underlying literacy skills. The substantial genetic and unique environmental correlations between school reading and the literacy skills support the phenotypic findings that a portion of performance in school reading is capturing those component literacy skills. Furthermore, this relationship between the component literacy skills is considerably stronger with school reading than school numeracy.

Limitations

The greatest limitation with our study is a consequence of using a subsample of a larger study. The sample size for our genetic analyses is small, evident in the large confidence intervals around many of the parameter estimates. However, the parameter estimates are very similar to those found in much larger studies on similar measures, such the Twins Early Development Study in the United Kingdom (Harlaar et al., 2007; Haworth, Kovas, Petrill, & Plomin, 2007), which gives us more confidence in the estimates of this small sample. The sample is also limited in its representativeness for a number of reasons. First, the students were all sampled from the Sydney metropolitan area and are from families who have registered to participate in twin studies. As such they are unlikely to be representative of Australian school students. Indeed, when compared to the national reports on the NAPLAN at least three quarters of our NAPLAN students were performing above the NSW average. Second, only those participants in the full ILTS study who sent in results of large-scale assessments were included in this paper. Those from the grade 2 ITLS assessment who returned this information and who are included in this paper performed, on average, slightly better in the IA tests than those who did not return results from large-scale tests, although given the small differences between groups, it seems unlikely that the lack of participation is driven by very poor results in the large-scale tests. Moreover, the heritability estimates at grade 2 in the full ILTS sample were very close to the estimates in this subsample of Australian students who returned results of the large-scale tests. This lends support to the subsample in these analyses being representative of the ILTS sample as a whole.

Conclusions

One key finding was that a few IA tests of component literacy skills accounted for a reasonable amount of variation in performance on large-scale reading tests. The effectiveness of these literacy skills at predicting performance on large-scale tests was greater for reading than for numeracy. These results indicate that large-scale reading tests are preferentially testing, to some extent, a student’s reading skills as assessed by well-accepted IA tests of component skills. In contrast to some reports in newspapers, these large-scale tests are not simply measures of student anxiety regarding exams, or their ability to colour in multiple-choice bubbles (Coulson, 2011).

Still, the results reported in this article are not evidence that the results from large-scale tests are sufficiently accurate for all the purposes for which parents, schools or politicians attempt to use them. As is the case with any test, the validity of large-scale tests does not rest solely on their measurement characteristics but also on an accurate and appropriate interpretation of their results.

The other key finding is that performance on these Australian large-scale reading and numeracy tests are heritable to about the same degree as other measures of literacy and numeracy, whether IA (e.g. Byrne et al., 2009) or based on teacher assessment (e.g. Harlaar et al., 2007), from a variety of studies and countries. This evidence further supports the validity of the BST and NAPLAN.

Footnotes

Acknowledgements

The Australian Twin Registry is supported by an enabling grant from the National Health and Medical Research Council. We thank the Australian Twin Registry, our testers, and the twins and parents involved.

Declaration of conflicting interests

None declared.

Funding

This work was supported by the Australian Research Council [grant numbers DP0663498, DP0770805].

References

Athanasou

(2010) NAPLAN and MySchool survey, Barton, Australian Capital Territory: Independent Education Union of Australia.

Australian Curriculum Assessment and Reporting Authority (2008) NAPLAN achievement in reading, writing, language conventions and numeracy, Sydney, Australia: AuthorRetrieved 19 November, 2014 from http://www.nap.edu.au/verve/_resources/2ndStageNationalReport_18Dec_v2.pdf.

Australian Curriculum Assessment and Reporting Authority (2010a) Submission to the Senate Education, employment and workplace relations committee: Inquiry into the administration and reporting of NAPLAN testing [Appendix C: NAPLAN technical paper] (Submission 261), Canberra, Australia: Parliament of AustraliaRetrieved from http://www.aph.gov.au/Parliamentary_Business/Committees/Senate_Committees?url=eet_ctte/completed_inquiries/2010-13/naplan/submissions.htm xxx.

Australian Curriculum Assessment and Reporting Authority (2010b) Writing: Narrative marking guide, Sydney, Australia: Author.

Australian Curriculum Assessment and Reporting Authority (2013, June 28) Cronbach alpha formula and reliabilities – NAPLAN 2008 – 2012, Sydney, Australia: Author.

Boker

Neale

Maes

Wilde

Spiegel

Brick

Fox

(2011) OpenMx: An open source extended structural equation modeling framework. Psychometrika 76: 306–317.

Byrne

(1998) The foundation of literacy: The child’s acquisition of the alphabetic principle, Hove, England: Psychology Press.

Byrne

Coventry

W. L.

Olson

R. K.

Hulslander

Wadsworth

DeFries

J. C.

Samuelsson

(2008) A behaviour-genetic analysis of orthographic learning, spelling and decoding. Journal of Research in Reading 31: 8–21.

Byrne

Coventry

W. L.

Olson

R. K.

Samuelsson

Corley

Willcutt

E. G.

DeFries

J. C.

(2009) Genetic and environmental influences on aspects of literacy and language in early childhood: Continuity and change from preschool to grade 2. Journal of Neurolinguistics 22: 219–236.

10.

Byrne

Delaland

Fielding-Barnsley

Quain

Samuelsson

Høien

Olson

(2002) Longitudinal twin study of early reading development in three countries: Preliminary results. Annals of Dyslexia 52: 47–73.

11.

Byrne

Olson

R. K.

Samuelsson

Wadsworth

Corley

DeFries

J. C.

Willcutt

(2006) Genetic and environmental influences on early literacy. Journal of Research in Reading 29: 33–49.

12.

Byrne

Samuelsson

Wadsworth

Hulslander

Corley

DeFries

Olson

(2007) Longitudinal twin study of early literacy development: Preschool through Grade 1. Reading and Writing 20: 77–102.

13.

Byrne

Wadsworth

Corley

Samuelsson

Quain

DeFries

J. C.

Olson

R. K.

(2005) Longitudinal twin study of early literacy development: Preschool and kindergarten phases. Scientific Studies of Reading 9: 219–235.

14.

Cain

Oakhill

Bryant

(2004) Children’s reading comprehension ability: Concurrent prediction by working memory, verbal ability, and component skills. Journal of Educational Psychology 96: 31–42.

15.

Coulson, J. (2011, May 9). Why my kids won’t sit the NAPLAN. Sydney Morning Herald. Retrieved 19 November, 2014 from http://www.smh.com.au/federal-politics/society-and-culture/why-my-kids-wont-sit-the-naplan-20110508-1ee67.html.

16.

Coventry

W. L.

Keller

M. C.

(2005) Estimating the extent of parameter bias in the classical twin design: A comparison of parameter estimates from extended twin-family and classical twin designs. Twin Research and Human Genetics 8: 214–233.

17.

Curriculum Corporation (2005) Statements of learning for English, Carlton South, Australia: AuthorRetrieved 19 November, 2014 from http://www.curriculum.edu.au/verve/_resources/SOL_English_Copyright_update2008_file.pdf.

18.

Ehri

L. C.

(2005) Development of sight word reading: Phases and findings. In: Snowling

M. J.

Hulme

(eds) The science of reading: A handbook, Oxford, England: Blackwell Publishing, pp. 135–154.

19.

Falconer

D. S.

Mackay

T. F. C.

(1996) Introduction to quantitative qenetics, 4th ed. Harlow, England: Longman.

20.

Field

(2013) Discovering statistics using IBM SPSS Statistics, 4th ed. London, England: Sage.

21.

Griffin

Gonzalez

(1995) Correlational analysis of dyad-level data in the exchangeable case. Psychological Bulletin 118: 430–439.

22.

Harlaar

Cutting

Deater-Deckard

DeThorne

Justice

Schatschneider

Petrill

(2010) Predicting individual differences in reading comprehension: A twin study. Annals of Dyslexia 60: 265–288.

23.

Harlaar

Dale

P. S.

Plomin

(2007) From learning to read to reading to learn: Substantial and stable genetic influence. Child Development 78: 116–131.

24.

Harlaar

Hayiou-Thomas

M. E.

Plomin

(2005) Reading and general cognitive ability: A multivariate analysis of 7-year-old twins. Scientific Studies of Reading 9: 197–218.

25.

Hart

S. A.

Logan

J. A. R.

Soden-Hensler

Kershaw

Taylor

Schatschneider

(2013) Exploring how nature and nurture affect the development of reading: An analysis of the Florida twin project on reading. Developmental Psychology 49: 1971.

26.

Hart

S. A.

Petrill

S. A.

Thompson

L. A.

Plomin

(2009) The ABCs of math: A genetic analysis of mathematics and its links with reading ability and general cognitive ability. Journal of Educational Psychology 101: 388–402.

27.

Haworth

C. M. A.

Kovas

Petrill

S. A.

Plomin

(2007) Developmental origins of low mathematics performance and normal variation in twins from 7 to 9 years. Twin Research and Human Genetics 10: 106–117.

28.

Helwig

Rozek-Tedesco

M. A.

Tindal

Heath

Almond

P. J.

(1999) Reading as an access to mathematics problem solving on multiple-choice tests for sixth-grade students. The Journal of Educational Research 93: 113–125.

29.

Hoover

W. A.

Gough

P. B.

(1990) The simple view of reading. Reading and Writing 2: 127–160.

30.

Kaplan

Goodglass

Weintraub

(2001) Boston naming test, 2nd ed. Baltimore, MD: Lippincott Williams & Wilkins.

31.

Keenan

J. M.

Betjemann

R. S.

Olson

R. K.

(2008) Reading comprehension tests vary in the skills they assess: Differential dependence on decoding and oral comprehension. Scientific Studies of Reading 12: 281–300.

32.

Keller

M. C.

Coventry

W. L.

(2005) Quantifying and addressing parameter indeterminacy in the classical twin design. Twin Research and Human Genetics 8: 201–213.

33.

Kenny

D. A.

Kashy

D. A.

Cook

W. L.

(2006) Dyadic data analysis, New York, NY: Guildford Press.

34.

Kovas

Harlaar

Petrill

Plomin

(2005) “Generalist genes” and mathematics in 7-year-old twins. Intelligence 33: 473–489.

35.

Masters

Lokan

Doig

Khoo

S. T.

Lindsey

Robinson

Zammit

(1990) Profiles of learning: The basic skills testing program in New South Wales 1989, Hawthorn: Australian Council for Educational Research.

36.

Nichols

R. C.

Bilbro

W. C.

(1966) The diagnosis of twin zygosity. Acta Genetica 16: 265–275.

37.

Norton

(2009) The responses of one school to the 2008 Year 9 NAPLAN numeracy test. Australian Mathematics Teacher 65: 26–37.

38.

NSW Department of Education, E. M. a. S. A. D (2006) Writing tasks marking procedures, Sydney, Australia: Author.

39.

Oliver

Harlaar

Hayiou Thomas

M. E.

Kovas

Walker

S. O.

Petrill

S. A.

Plomin

(2004) A twin study of teacher-reported mathematics performance and low performance in 7-year-olds. Journal of Educational Psychology 96: 504–517.

40.

Petrill

Logan

Hart

Vincent

Thompson

Kovas

Plomin

(2012) Math fluency is etiologically distinct from untimed math performance, decoding fluency, and untimed reading performance: Evidence from a twin study. Journal of Learning Disabilities 45: 371.

41.

Plomin

DeFries

J. C.

Knopik

V. S.

Neiderhiser

J. M.

(2013) Behavioral genetics. 6th ed, New York, NY: Worth.

42.

Plomin

Kovas

(2005) Generalist genes and learning disabilities. Psychological Bulletin 131: 592–617.

43.

Plomin

Kovas

Haworth

C. M. A.

(2007) Generalist genes: Genetic links between brain, mind, and education. Mind, Brain, and Education 1: 11–19.

44.

Samuelsson

Byrne

Olson

R. K.

Hulslander

Wadsworth

Corley

DeFries

J. C.

(2008) Response to early literacy instruction in the United States, Australia, and Scandinavia: A behavioral-genetic analysis. Learning and Individual Differences 18: 289–295.

45.

Samuelsson

Byrne

Quain

Wadsworth

Corley

DeFries

J. C.

Olson

(2005) Environmental and genetic influences on prereading skills in Australia, Scandinavia, and the United States. Journal of Educational Psychology 97: 705–722.

46.

Samuelsson

Olson

Wadsworth

Corley

DeFries

Willcutt

Byrne

(2007) Genetic and environmental influences on prereading skills and early reading and spelling development in the United States, Australia, and Scandinavia. Reading and Writing 20: 51–75.

47.

Smith, M. (2005, August). Getting SMART with data in schools: Lessons from NSW. Paper presented at the Australian Council for Educational Research Conference on Using Data to Support Learning, Melbourne, VIC.

48.

Torgesen

J. K.

Wagner

R. K.

Rashotte

C. A.

(1999) Test of word reading efficiency, Austin, TX: PRO-ED.

49.

Unsworth

Chand

(2009) Bridging multimodal literacies and national assessment programs in literacy. Australian Journal of Language & Literacy 32: 245–257.

50.

Wasson, D. (2009, August). Large cohort testing – How can we use assessment data to effect school and system improvement? Paper presented at the Australian Council for Educational Research Conference on Assessment and Student Learning: Collecting, Interpreting and Using Data to Inform Teaching, Perth, WA. Retrieved 19 November, 2014 from http://research.acer.edu.au/research_conference/RC2009/18august/4.

51.

Willett, L., & Gardiner, A. (2009, July). Testing spelling – Exploring NAPLAN. Paper presented at the AATE/ALEA National Conference on Bridging Divides: Ensuring Access, Equity and Quality in Literacy and English Education, Hobart, TAS.

52.

Williams

(2009) Down and dirty with grammar. Literacy Learning: The Middle Years 17: 11–20.

53.

Woodcock

R. W.

(1989) Woodcock reading mastery tests, Circle Pines, MN: American Guidance Service.

54.

(2010) Measurement, sampling, and equating errors in large-scale assessments. Educational Measurement: Issues and Practice 29: 15–27.

55.

(2011) The use of NAPLAN data for english language teaching. Idiom 47: 38–41.

Validity of large-scale reading tests: A phenotypic and behaviour–genetic analysis

Abstract

Keywords

Introduction

Method

Participants

Materials

IA test of word reading

IA test of vocabulary

IA test of comprehension

Large-scale reading tests

Large-scale numeracy test

Procedure

Data analysis

Results

Phenotypic analyses

Behaviour–genetic analyses

Discussion

Limitations

Conclusions

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

References