Sage Journals: Discover world-class research

Abstract

Teachers use assessment to ascertain and enhance student learning, thus the importance of assessment literacy. One of the instruments that has been used to examine teachers’ assessment literacy is the Assessment Literacy Inventory developed by Mertler and Campbell. The Assessment Literacy Inventory has been validated using pre-service teachers and employing traditional statistical techniques. This study reports on the evaluation of the Assessment Literacy Inventory utility using 582 in-service teachers through employing the Rasch model and confirmatory factor analysis. The results indicate that the Assessment Literacy Inventory works well at the item level. However, the Assessment Literacy Inventory seven-factor structure, based on the Standards for Teacher Competence in Educational Assessment of Students, poses challenges against newer psychometric techniques. Hence, recommendations are presented. This article concludes with relevant implications for instrument development, educational assessment research, policy and practice, and teachers’ professional development.

Keywords

Assessment literacy confirmatory factor analysis Rasch model scale/instrument validity

Introduction

Teachers have the potential to influence student development in the classroom. As primary agents in the educational process, teachers make instruction and learning beneficial for students. However, their success in making the teaching-learning process effective depends on several factors, and includes the sound knowledge and application of educational concepts that involve assessment (Churchill, Ferguson, Godinho, Johnson, Keddie, Letts, Mackay, McGill, Moss, Nagel, Nicholson, & Vick, 2011). The assessment component of teacher professional standards such as the Australian Professional Standards for Teachers (Australian Institute for Teaching and School Leadership, 2013) and the Philippine National Competency-Based Teacher Standards (Philippine Department of Education, 2006) further highlight this requirement. Hence, assessment literacy is pertinent to teachers’ professional competence (Popham, 2009; Stiggins, 1991a). Assessment literacy is defined as teachers’ ‘knowledge of and abilities to apply assessment concepts and techniques to inform decision making and guide practice’ (Mertler & Campbell, 2005, p. 16).

The need for sound assessment literacy stems from the fact that assessment is an essential part of teachers’ professional responsibilities (Mertler, 2003, 2004). Teachers have the accountability to establish and improve student learning through assessment and it is incumbent upon them to use appropriate means and provide evidence of that learning (Phye, 1997). The utilisation of a variety of assessment methods and strategies in ascertaining student learning requires teachers’ knowledge and skills in employing assessment. Moreover, continuous and meaningful use of assessment requires teachers to possess assessment literacy. It has been estimated that about 30% to 50% of teachers’ instructional time is spent on assessment-related activities (Stiggins & Conklin, 1992), which include designing, developing, selecting, administering, scoring, recording, reporting, evaluating, and revising assessment methods (Stiggins, 1988). These assessment activities can be made more effective where teachers possess the relevant knowledge and skills of and for assessment (Popham, 2009, 2011; Stiggins, 1988, 1991a, 1991b; Stiggins, Arter, Chappuis & Chappuis, 2007; Stiggins & Conklin, 1992).

As well as recording what has been learnt, assessment is also meant to support and improve teaching and learning in the classroom (Boone, Staver & Yale, 2014a, 2014b; Brookhart, 1999; Pellegrino, Chudowsky, & Glaser, 2001; Popham, 2011; Stiggins, 1991b, 2010; Stiggins & Conklin, 1992). Here, assessment results guide teachers' decisions regarding a range of pedagogy elements, which include the:

Learning aims and objectives;

Subject contents and activities that need to be given emphasis;

Methods and strategies to deliver the targets and the contents effectively;

‘What’ and ‘how’ of assessing student learning;

Judgement whether or not the learning aims have been achieved; and

Identification of improvement needs (Black & Wiliam, 1998a, 1998b; Kellaghan & Greany, 2001; Popham, 2011; Stiggins, 2010; Stiggins & Conklin, 1992).

Subscribing to the view that sound teaching requires sound student assessment, and recognising the need for teachers to possess knowledge and skills in student assessment, the American Federation of Teachers (AFT), National Council on Measurement in Education (NCME), and National Education Association (NEA) (1990) developed the Standards for Teacher Competence in Educational Assessment of Students that cover seven broad areas. These include:

(1) Choosing assessment methods appropriate for instructional decisions; (2) Developing assessment methods appropriate for instructional decisions; (3) Administering, scoring and interpreting the results of both externally-produced and teacher-produced assessment methods; (4) Using assessment results when making decisions about individual students, planning teaching, developing curriculum, and school improvement; (5) Developing valid pupil grading procedures which use pupil assessments; (6) Communicating assessment results to students, parents, other lay audiences, and other educators; and (7) Recognizing unethical, illegal, and otherwise inappropriate assessment methods and uses of assessment information (pp. 2–5).

More recently, the conceptualisation and advancement of the Assessment Framework for 21st Century Teaching and Learning in Singapore (NIE-NTU, 2009, p.25), emphasises the importance of teachers’ assessment literacy through ‘the need for producing teachers who have high assessment literacies and who are able to adopt the best practices in the classroom in order to effectively evaluate student outcomes.’

Likewise, Berry and Adamson (2011) and Masters (2013) highlight the importance of assessment, its principles, and associated challenges. Internationally, the OECD (2013, p.17) acknowledges the ‘widespread recognition that evaluation and assessment arrangements are key to both improvement and accountability in school systems.’ This comprehensive review examines the various components of assessment and evaluation frameworks that countries use with the objective of improving student outcomes (OECD, 2013).

Consistent in all these reports and reviews on ‘assessment and evaluation’ is the centrality of teachers’ assessment literacy and its influence on student learning outcomes. Gardner, Harlen, Hayward, & Stobart, (2011, p.106) reiterate that teachers need to be empowered to undertake meaningful assessment. For example, Chick and Pierce (2008) and Pierce and Chick (2010) highlight that teachers have difficulty interpreting assessment results. Thus, assessment is a vital element of education, and an understanding of associated processes is fundamental to teacher professionalism (OECD, 2013; Santiago, Donaldson, Herman, & Shewbridge, 2011).

With the aim of examining whether or not teachers in the field have the required assessment literacy and to identify any areas on which teachers need further training, the aforementioned standards have become the basis of assessment literacy research. In turn, the need to examine teachers’ assessment literacy using these standards has necessitated the development of various scales and instruments. One of the instruments which has been proposed, is the Assessment Literacy Inventory (ALI) developed by Mertler and Campbell (2005). This article reports on the utility of the ALI using data from 582 in-service teachers who were selected from schools from the list provided by DepEd-Tawi-Tawi Division Office in Bongao, Tawi-Tawi, Philippines. This sample, which was different to that of Mertler and Campbell (2005), provided the opportunity to test the portability of the scale. The data are examined using the Rasch model (Rasch, 1960, 1966; Wright, 1988) and confirmatory factor analysis (CFA) to provide information regarding item/scale-fit, factor structure and structural invariance of the scale. This paper aims to:

Add to ALI’s previous validation findings which were based on samples from the United States (Mertler & Campbell, 2005);

Ascertain its measurement properties and utility; and

Importantly gauge its portability to other education systems such as the countries in the Asia Pacific region

The Assessment Literacy Inventory (ALI)

Mertler and Campbell (2005) developed the ALI to measure teachers’ assessment literacy as a result of the findings on the psychometric utility of their previous instruments. For instance, they reported that in 1991, the first scale called the ‘Teacher Assessment Literacy Questionnaire (TALQ)’, developed by Plake (1993, cited in Mertler & Campbell, 2005), was employed in a national survey both to establish its psychometric properties and to measure the teachers’ assessment literacy. Using a sample of 555 in-service teachers from across the US, the reliability for the whole test employing KR₂₀ was 0.54 (Plake, Impara, & Fager, 1993). This was below the acceptable threshold of at least 0.65 (Chase, 1999, as cited in Mertler & Campbell, 2005). Campbell et al. (2002, cited in Mertler and Campbell, 2005) applied the identical scale called the ‘Assessment Literacy Inventory (ALI)’ to 220 students undertaking pre-service education program. This study yielded a reliability of 0.74 using the same statistical procedures.

Mertler (2003) compared the assessment literacy levels of both in-service and pre-service teachers. Like Campbell et al. (2002, as cited in Mertler & Campbell, 2005), Mertler (2003) used a slightly modified version of TALQ and called the instrument, the ‘Classroom Assessment Literacy Inventory (CALI)’. He noted that his study yielded similar results to those of Plake et al. (1993) and Campbell et al. (2002). Using KR₂₀, Mertler (2003) obtained a reliability of 0.57 for the in-service teachers (Plake et al. study, KR₂₀ = 0.54) and 0.74 for the pre-service teachers (Campbell et al. study, KR₂₀ = 0.74). Mertler and Campbell (2005) emphasised that the assessment literacy scales used previously showed low reliability for in-service teachers but exhibited a much greater reliability for pre-service teachers.

Having employed instruments that were identical to TALQ and having obtained consistently low reliability results for in-service teachers, both Campbell et al. (2002) and Mertler (2003) concluded that the original instrument possessed poor psychometric qualities. The original scale was considered ‘difficult to read, extremely lengthy, and contained items that were presented in a decontextualized way’ (pp. 8–9) which led to a complete redevelopment of the assessment literacy scale. Hence, the new ALI, which differs in items and structure from the earlier instruments, was developed in 2003 (Mertler & Campbell, 2005).

The ALI consists of 35 multiple-choice items that are embedded in five classroom-based scenarios. Each scenario reflects a classroom situation that features a teacher undertaking assessment-related activities and making assessment-related decisions. The situation in each scenario is followed by seven items that are aligned to the seven Standards for Teacher Competence in the Educational Assessment of Students (STCEAS) (AFT, NCME, & NEA, 1990). Each ALI stem has four options, with one correct answer and three distractors (Mertler & Campbell, 2005).

Previous ALI Validation

After the development of the ALI, the authors reviewed the items to ensure alignment with the STCEAS and to check for item clarity, readability, and the accuracy of the correct answers. Problematic items were reviewed ‘until consensus was reached regarding the item appropriateness and quality’ (Mertler & Campbell, 2005, p. 10).

After this initial face validity check, the ALI was trialed twice to examine its psychometric properties. The ALI was first administered in 2003 to 152 undergraduate pre-service teaching students who took the introductory assessment courses that were aligned with the STCEAS. The ALI was analysed using the Test Analysis Program (TAP) of Brooks and Johanson (2003, cited in Mertler & Campbell, 2005) to conduct test-level analysis, item-level analysis, reliability analysis, and options/distracters analysis. Some items of the ALI scale were revised and four items were completely removed based on the results of these analyses. In its second trial in 2004, the revised ALI was administered to 250 undergraduate pre-service teaching students after completing a course in testing and measurement. Results of the analyses with an overall reliability (KR₂₀) of 0.74, mean item difficulty of 0.68, and mean item discrimination of 0.31 showed ALI to have acceptable psychometric properties (Mertler & Campbell, 2005). The ALI developed in this way consisted of five scenarios with a total of 31 items. Each of the five scenarios consisted of seven questions. Mertler and Campbell (2005) describe the items as

Related to the seven Standards for Teacher Competence in the Educational Assessment of Students (SSTCEAS). Some of the items are intended to measure general concepts related to testing and assessment, including the use of assessment activities for assigning student grades and communicating the results of assessments to students and parents; other items are related to knowledge of standardized testing, and the remaining items are related to classroom assessment (p. 26)

The current validation of the ALI was undertaken for three specific reasons. First, it was done in response to the recommendation of the ALI developers that the new instrument be tested using in-service teachers to further examine its appropriateness as a diagnostic instrument to identify weaknesses or misconceptions regarding assessment. Second, the ALI was adapted to suit its application in a different context, namely practising teachers in Tawi-Tawi, Philippines. Third, an alternative psychometric analytic technique at the item level was used to recalibrate the ALI because the seven Standards for Teacher Competence in the Educational Assessment of Students, which underpinned the factorial structure of the ALI (Mertler & Campbell, 2005) had not been subjected to test of structural invariance between teachers in private and public schools, or across age groups. Thus, structural invariance tests for the above subgroups were investigated.

Validation techniques

The central validation technique that was used in the current validation of the ALI was the Rasch model developed by Georg Rasch (Rasch, 1960, 1966), a Danish mathematician, in the 1960s (Baker, 2001). It is a popular one-parameter item response model (Ben, Hailaya, & Alagumalai, 2012) that can be utilised to judge items at the pilot or validation stage (Wu & Adams, 2007) and to review the psychometric properties of the existing scale (Tennant & Conaghan, 2007). The Rasch model defines the probability of a specified response in relation to the test takers’ ability and the item difficulty. The probability of success in answering an item correctly is modelled as a logistic function of the difference between the person ability and the item difficulty (Van Alphen, Halfens, Hasman, & Imbos, 1994).

The model has the special properties of specific objectivity and unidimensionality. The property of specific objectivity emphasises that the estimation of item and person parameters are independent of each other (Bond & Fox, 2007; Ewing, Salzberger, & Sinkovics, 2005). The model positions person and item parameters on the same scale and both parameters are sample independent (Hambleton & Jones, 1993; Tinsley & Dawis, 1975; Van Alphen, Halfens, Hasman, & Imbos, 1994). Moreover, unidimensionality requires the measurement of one underlying or dominant factor, construct or attribute at a time (Bond & Fox, 2007). Thus, items that fit the Rasch model are expected to follow a structure that has a single or dominant dimension. It has been highlighted by Alagumalai and Curtis (2005, p. 2) that the Rasch model has a ‘unique property that embodies measurement.’ It provides a probabilistic insight into how the data operate within a unidimensional model, when understanding how a construct operates. The advantage of using Rasch model lies in its objectivity and fulfilment of measurement requirements, thus enhancing the measurement capacity of an instrument (Cavanagh & Romanoski, 2006). It offers a fresh perspective to overcoming challenges associated with traditional sample-dependent reliabilities and test statistics (Alagumalai & Curtis, 2005).

The Rasch model is employed to estimate measures of individuals and item characteristics on a particular scale. It determines whether the responses conform to the requirements of a measurement model. In judging the responses, fit indicators, which the model provides are used. Items that conform to the measurement requirements are retained while those that fail to satisfy the requirements are removed (Ben, Hailaya, & Alagumalai, 2012; Curtis & Boman, 2007).

As another validation technique, confirmatory factor analysis (CFA) was used to examine the factor structure of the ALI.

CFA is used to verify the factor structures of any scale (Schreiber, Nora, Stage, Barlow, & King, 2006). It is employed to provide evidence of construct validity (Probst, 2003). This statistical procedure can be considered as a ‘macro-level’ analytic practice as it examines whether or not a hypothesised relationship between the observed variables and their underlying latent constructs exists. CFA assumes that the researcher has some knowledge of the underlying factor structure of a set of measures (Byrne, 2001) and therefore it is used as ‘a test whether an a priori dimensional structure is consistent with the structure obtained in a particular set of measures’ (Stewart, 2001, p. 76). In other words, CFA is a theory-testing model in which a hypothesis is first put forward by the researcher before proceeding to analyse (Stapleton, 1997) – planning is driven by the theoretical relationships between the observed and latent variables (Schreiber et al., 2006). The theoretical relationships are empirically tested and confirmed by a set of data (Schreiber et al., 2006).

In the present study, the correlated seven-factor structure was examined based on the SSTCEAS upon which the ALI was developed. The seven factors including their corresponding items are as follows:

Choosing assessment methods appropriate for instructional decisions (items 1, 8, 15, 22 and 29);

Developing assessment methods appropriate for instructional decisions (items 2, 9, 16, 23 and 30);

Administering, scoring, and interpreting assessment results (items 3, 10, 17, 24 and 31);

Using assessment results when making decisions about individual students, planning teaching, developing curriculum, and school improvement (items 4, 11, 18, 25 and 32);

Developing valid grading procedures (items 5, 12, 19, 26 and 33);

Communicating assessment results to students, parents, lay audiences, and other educators (items 6, 13, 20, 27 and 34); and

Recognising unethical, illegal, and otherwise inappropriate use of assessment information (items 7, 14, 21, 28 and 35).

CFA was applied using structural equation modelling (SEM) (Schreiber et al., 2006).

Wright (1996) highlighted the use of the Rasch model and CFA as complementary techniques to identify model fit and confirm underlying factorial structures. In addition, Rasch and CFA are used to assess measurement invariance (Randall & Engelhard, 2010). Salzberger (2011, p.2) argued that, ‘it is pivotal to outline the requirements of measurement and to ensure that the Rasch philosophy and the theory of the construct guide the scale development and formation.’ Moreover, it is important to note Riemer and Kearns (2010, p.263) who highlighted ‘since confirmatory analysis can only demonstrate that the current model fits the current data reasonably well, but not whether it is the model that would best explain the variance-covariance structure in the data, additional tests like Rasch measurement’s dimensionality investigation to provide further evidence for the proposed factorial structure.’

Methods

Modification of the ALI

The ALI items were adapted to suit the Tawi-Tawi/Philippine context where the research was conducted. Adaptations were made mainly to teacher names and topics in the scenarios depicted. Still, while items were adapted the ALI’s original structure of the scenarios and items was preserved to maintain the integrity of the instrument. Face validation was undertaken by the authors of this article and by two university lecturers in the Philippines who were knowledgeable in the field (i.e. measurement, assessment and evaluation and teacher education) and familiar with the context. The modified ALI was pilot tested with 45 elementary and secondary school teachers of the Mindanao State University-Tawi-Tawi to check for its reliability. A Cronbach Alpha of 0.75 was obtained indicating acceptable reliability. The ALI was then administered to the intended respondents.

Administration of the ALI

Prior to data collection, ethics clearance for the study was sought from the University’s Human Research Ethics Committee. In addition, permission to administer the ALI was secured from the Philippine Department of Education (DepEd) at the national, regional, and local levels and from all other schools outside DepEd jurisdiction.

After approval was obtained, the ALI was administered to all Grade 6 (elementary level), Second Year and Fourth Year high school (Secondary level) teachers in the province of Tawi-Tawi, Philippines. The schools involved in the study were taken from the list provided by the DepEd’s Tawi-Tawi Division Office in Bongao, Tawi-Tawi, Philippines. All public and private elementary and secondary schools were initially identified. However, as the schools are located in the different islands all throughout the province, only those that could be reached and accessed, and that posed no hazard to the researcher were finally selected. A total of 128 schools (elementary: 91; secondary: 37) participated in the study. After selection of the schools, teachers and students were identified through the support of the DepEd-Tawi-Tawi Division; the DepEd’s engagement saw 100% of teachers identified responding to the ALI. A total of 582 teachers (321 elementary school teachers and 261 high school teachers) completed the instrument.

Validation analysis of the ALI

The ConQuest 2.0 (Wu, Adams, Wilson, & Haldane, 2007) software was used to undertake the Rasch analysis. To judge the acceptability of the items, the residual-based fit statistics were employed. The infit weighted mean square (IWMS) and the t-statistic (t) were used to examine whether or not items conformed to the Rasch model. Values for IWMS of 0.80 to 1.20 (Linacre, 2002), and −2 to +2 for t (Wu & Adams, 2007) were considered to indicate acceptable item fit. Items outside these ranges were removed from the analysis one at a time as they violated the measurement requirements.

As the ALI was developed based on STCEAS, the proposed seven-factor structure was re-tested for structural invariance. CFA was performed using LISREL version 8.80 (Jöreskog & Sörbom, 2006). How well the proposed seven-factor structure of the ALI fits the data was assessed using several fit indices. These indices include the chi-square (χ²) statistic, ratio of chi-square to the degrees of freedom, root mean square error of approximation (RMSEA), standardised root mean square residual (SRMR), goodness-of-fit index (GFI), adjusted goodness-of-fit index (AGFI), and comparative fit index (CFI).

The χ² is described as an index of ‘exact fit’ as it evaluates the perfect fit of a model to empirical data (Matsunaga, 2010). However, although often used, it has been considered to be sensitive to sample size and is almost always indicative of bad model fit. Thus, there is a need to divide the χ² by the number of degrees of freedom (df) to further assess the model (Probst, 2003). The RMSEA is an index of ‘approximate fit’ (Schermelleh-Engel, Moosbrugger, & Müller, 2003) and it determines how close the model fits to the data (Matsunaga, 2010). Considered as one the most informative fit indices and that represents error due to approximation (Diamantopoulos & Siguaw, 2000), it shows ‘how well would the model, with unknown but optimally chosen parameter values, fit the population covariance matrix if it were available?’ (Byrne, 2001, p. 82). The SRMR is a residual-based index that shows the average value of the standardised residuals between observed and predicted covariances (Matsunaga, 2010). It is a summary measure of standardised residuals (Diamantopoulos & Siguaw, 2000). The GFI and AGFI are absolute fit indices that estimate the extent to which the sample variances and covariances are reproduced by the hypothesised model (Bollen & Long, 1993). The AGFI’s defining characteristic that differs from GFI is that it adjusts for the number of degrees of freedom in the specified model. However, caution was taken with the use of these fit indices as they can be overly influenced by sample size (Fan, Thompson, & Wang, 1999, as cited in Byrne, 2001). The CFI is one of the major incremental fit indices that ‘measure the proportionate improvement in fit by comparing a target model with a more restricted, nested baseline model’ (Diamantopoulos & Siguaw, 2000, p. 87).

The non-statistically-significant result of χ² indicates good fit (Matsunaga, 2010). For χ²/df, 0 ≤ χ²/df ≤2 and 2 <χ²/df ≤3 indicate good and acceptable fit, respectively (Schermelleh-Engel, et al., 2003). For the RMSEA, values less than the critical value of 0.05 indicate good fit (Schermelleh-Engel et al., 2003). However, some researchers such as Schulz (2004) indicate that values around 0.08 indicate reasonable error of approximation, and for some, (e.g., Hu & Bentler, 1999) 0.06 is considered as the critical value for the RMSEA. Values more than 0.10 for RMSEA indicate poor fit (Diamantopoulos & Siguaw, 2000). SRMR values of less than 0.05 indicate a good fit while values between 0.05 and 0.10 indicate acceptable fit (Schermelleh-Engel et al., 2003). Threshold values for GFI and AGFI are 0.90 and 0.85, respectively; for GFI, values between 0.95 and 1.00 indicate good fit while values between 0.95 and 0.90 indicate acceptable fit; for AGFI, values that fall between 0.90 and 1.00 indicate good fit while values between 0.85 and 0.90 indicate acceptable fit (Schermelleh-Engel et al., 2003). For CFI, the conventional threshold of 0.90 can be used (Matsunaga, 2010). Table 1 summarises the fit indices and corresponding threshold values to evaluate model fit used by various authors highlighted above and including Boone, Staver and Yale (2014c).

Table 1.

Summary of fit indices and their corresponding permissible values.

Fit index	Indication/permissible values/ranges
χ ²	Result not statistically significant
χ²/df	0 ≤ χ²/df ≤3
RMSEA	≤0.10
SRMR	≤0.10
GFI	≥0.90
AGFI	≥0.85
CFI	≥0.90

χ²= chi square; df = degrees of freedom; RMSEA = root mean square error of approximation; SRMR = standardized root mean square residual; GFI = goodness-of-fit index; AGFI = adjusted goodness-of-fit index; and CFI = comparative fit index.

Results

Item analysis of the ALI using the Rasch model

The initial analysis included all the ALI items and the responses from all the participants. The fit statistics for each item were obtained. The results are presented in Table 2. As can be seen, the first run of the data analysis provided results in which all items possessed acceptable fit statistic values except for Item 22 which was found to be a misfit due to its t-value falling below the acceptable minimum range (–2.0). This item was removed as this indicated a ‘highly determined response pattern’ as it violated Rasch model requirements (Bond & Fox, 2007). Scale recalibration followed after the removal of item 22. The results of the second analysis run are shown in Table 3. It can be observed that all items were found to fit the Rasch model in the second and final calibration. The value of separation reliability (0.99) was indicative of the small measurement error and high discriminating power (Alagumalai & Curtis, 2005). This further indicates that the ALI has more precise measurement and reliability (Wright & Stone, 1999). The Rasch item analysis results indicate that the 34 remaining items of the ALI conformed to the Rasch model and satisfied the unidimensionality requirement. Hence, the ALI items are appropriate in measuring the teacher assessment literacy and reflect the unitary dimension of the scale pertaining to the assessment literacy.

Table 2.

Results of the initial analysis of ALI items.

Item #	Standard	Estimate (difficulty)	Error	IWMS	T
1	1	0.75	0.07	0.95	−0.7
2	2	−0.99	0.06	1.00	0.1
3	3	0.80	0.07	1.00	0.1
4	4	−0.42	0.06	0.99	−0.4
5	5	0.14	0.07	0.95	−1.1
6	6	0.39	0.07	0.97	−0.5
7	7	−0.89	0.06	1.01	0.6
8	1	−0.40	0.06	1.04	1.4
9	2	−0.43	0.06	1.04	1.6
10	3	0.08	0.07	1.04	0.9
11	4	0.31	0.07	1.04	0.8
12	5	0.52	0.07	1.02	0.3
13	6	0.70	0.07	0.95	−0.7
14	7	−0.03	0.06	1.03	0.7
15	1	0.81	0.07	0.96	−0.6
16	2	0.91	0.07	0.98	−0.3
17	3	−1.18	0.06	1.00	0.1
18	4	0.27	0.07	1.03	0.7
19	5	0.17	0.07	0.95	−1.1
20	6	−0.03	0.06	1.02	0.6
21	7	−0.14	0.06	1.02	0.6
22	1	−0.92	0.06	0.94	−2.7^a
23	2	0.60	0.07	1.02	0.3
24	3	0.68	0.07	1.06	0.9
25	4	0.45	0.07	1.01	0.3
26	5	0.39	0.07	1.01	0.2
27	6	−0.77	0.06	1.02	1.0
28	7	0.92	0.07	0.95	−0.6
29	1	−1.60	0.06	1.05	1.4
30	2	0.80	0.07	0.99	−0.1
31	3	−0.05	0.06	0.99	−0.3
32	4	−0.73	0.06	1.00	−0.1
33	5	−0.80	0.06	1.02	0.9
34	6	−0.30	0.06	0.99	−0.4
35	7	0.001	0.39	0.97	−0.7

IWMS = infit weighted mean square; t = t-statistic.

^aValue outside the acceptable t range.

Table 3.

Results of the final analysis of ALI items.

Item #	Standard	Estimate (difficulty)	Error	IWMS	T
1	1	0.72	0.07	0.92	−1.2
2	2	−1.01	0.06	1.01	0.3
3	3	0.77	0.07	0.99	−0.1
4	4	−0.44	0.06	1.01	0.2
5	5	0.11	0.07	0.96	−1.0
6	6	0.36	0.07	0.98	−0.5
7	7	−0.92	0.06	1.03	1.3
8	1	−0.43	0.06	1.02	0.6
9	2	−0.45	0.06	1.05	1.8
10	3	0.05	0.07	1.03	0.8
11	4	0.28	0.07	1.01	0.1
12	5	0.49	0.07	0.99	−0.1
13	6	0.67	0.07	0.95	−0.9
14	7	−0.05	0.06	1.00	0.0
15	1	0.78	0.07	0.95	0.7
16	2	0.88	0.07	0.97	−0.3
17	3	−1.20	0.06	1.01	0.2
18	4	0.24	0.07	1.02	0.4
19	5	0.14	0.07	0.91	−2.0
20	6	−0.06	0.06	1.00	0.0
21	7	−0.17	0.06	1.02	0.7
23	2	0.57	0.07	1.00	0.0
24	3	0.65	0.07	1.03	0.4
25	4	0.42	0.07	0.99	−0.3
26	5	0.36	0.07	1.01	0.2
27	6	−0.80	0.06	1.02	1.0
28	7	0.90	0.07	0.95	−0.7
29	1	−1.62	0.06	1.06	1.9
30	2	0.78	0.07	0.98	−0.3
31	3	−0.08	0.06	1.00	−0.0
32	4	−0.76	0.06	0.98	−0.8
33	5	−0.83	0.06	1.03	1.3
34	6	−0.33	0.06	0.99	−0.5
35	7	−0.03	0.38	0.98	−0.4

IWMS = infit weighted mean square; t = t-statistic.

Analysis of the ALI structure using CFA

The ALI items were tested in terms of the seven-factor structure based on the STCEAS (Standard 1 to Standard 7) (AFT, NCME, & NEA, 1990).

In this structure, four items (items 1, 8, 15, and 29) served as observed variables for the first latent factor (Standard 1). The rest of the latent factors (Standards) were each represented by five items as follows: items 2, 9, 16, 23, and 30 for the second latent factor (Standard 2); items 3, 10, 17, 24, and 31 for the third latent factor (Standard 3); items 4, 11, 18, 25, and 32 for the fourth latent factor (Standard 4); items 5, 12, 19, 26, and 33 for the fifth latent factor (Standard 5); items 6, 13, 20, 27, and 34 for the sixth latent factor (Standard 6); and items 7, 14, 21, 28, and 35 for the seventh latent factor (Standard 7). A conventional cut-off of 0.40 (Matsunaga, 2010) was used to evaluate the factorial coherence of ALI items. Items with a factor loading of at least 0.40 were to be retained while those with a factor loading of below 0.40 were to be discarded. The structure of the seven-factor model is presented in Figure 1.

Figure 1.

Seven-factor structure of the ALI.

Model fit indices

In evaluating the ALI’s hypothesised model, the overall model fit to the data was first examined using the results of the fit indices. The results are shown in Table 4. Of the seven fit indices reported, only two (RMSEA and SRMR) showed acceptable fit while the other five (χ², χ²/df, GFI, AGFI, and CFI) revealed poor fit of the model to the data. Moreover, it was noted that although RMSEA and SRMR exhibited acceptable fit, their values were closer to the adopted thresholds revealing only mediocre fit. These results generally indicate that the ALI’s seven-factor structure did not fit the data well and therefore the current factorial structure and importantly, the structural invariance can be challenged.

Table 4.

Summary of fit indices for the 7-factor ALI structure.

Fit index	Obtained value	Remark
χ ²	3433.70 (P = 0.0) (significant)	Poor fit
χ² /df	3433.70/506 = 6.79	Poor fit
RMSEA	0.09	Acceptable fit
SRMR	0.08	Acceptable fit
GFI	0.77	Poor fit
AGFI	0.73	Poor fit
CFI	0.53	Poor fit

CFA of the ALI hypothesised measurement model

In addition to checking the overall model fit, factor loadings of the ALI items were examined to gauge whether or not the items reflected the factors as intended.

The factor loadings are presented in Table 5. As can be seen, the majority of the ALI items have low factor loadings. Most of the items’ factor loadings are below the threshold of 0.4, except for ALI1, ALI3, ALI13, ALI15, ALI17, ALI19, and ALI28, which have loadings of 0.55, 0.58, 0.43, 0.59, 0.49, and 0.60, respectively. The results reveal that the items do not uniquely represent the factors under the seven-factor structure of the ALI and are consistent with the finding regarding the poor model fit of the ALI’s seven-factor structure. Hence, it seems that the seven-factor structure based on STCEAS may not be appropriate for the ALI.

Table 5.

Factor loadings of ALI items under the seven-factor model.

Factor	Item	Loading
1 (Standard 1)	ALI1	0.55
	ALI8	0.28
	ALI15	0.59
	ALI29	0.11
2 (Standard 2)	ALI2	0.17
	ALI9	0.13
	ALI16	0.26
	ALI23	0.18
	ALI30	0.21
3 (Standard 3)	ALI3	0.58
	ALI10	0.19
	ALI17	0.49
	ALI24	0.26
	ALI31	0.36
4 (Standard 4)	ALI4	0.15
	ALI11	0.31
	ALI18	0.17
	ALI25	0.39
	ALI32	0.39
5 (Standard 5)	ALI5	0.29
	ALI12	0.31
	ALI19	0.42
	ALI26	0.23
	ALI33	0.11
6 (Standard 6)	ALI6	0.34
	ALI13	0.43
	ALI20	0.28
	ALI27	0.21
	ALI34	0.27
7 (Standard 7)	ALI7	0.31
	ALI14	0.28
	ALI21	0.33
	ALI28	0.60
	ALI35	0.38

Rasch analyses plus CFA: Item-test fit and structural coherence

To understand better the links between item-test fits and the scale’s structural coherence, the Rasch Item Map of Latent Distribution and Response Model Parameter Estimates was positioned alongside the CFA of ALI (Figure 2).

Figure 2.

Item-test fit and structural coherence of the ALI.

It is evident from the item-plots and ALI’s CFA structure in Figure 2 (see right part of the map on the left) that no clear hierarchy exists between the standards. Overlapping items between standards make it difficult to specify clearly the ‘steps’ between standards, and thus its articulation for the denoted competency.

In a comprehensive study, the Education Department of Western Australia (1994) reported information regarding the articulation of steps or stages in the Standards Framework (for example, in working scientifically) and the associated establishment of the levels of achievement or competencies. This supports the need for an axiomatic application of measurement, i.e., linking a conceptual or theoretical framework to how items in a scale or test are distributed. There are varying bandwidths for selected standards, and it may be necessary to adopt Ingvarson’s (2002, p. 8) Content – Evidential – Performance tripartite standards model to guide the development of educational aims and standards.

Discussion

Results of the Rasch analysis provide evidence of acceptable measurement properties of the ALI instrument. This means that ALI is a reliable scale that can be employed to measure the assessment literacy of in-service teachers. Hence, the ALI can be considered an appropriate tool to gauge teachers’ assessment literacy and their readiness in undertaking classroom assessments. It can likewise be employed to examine teachers’ weaknesses or misconceptions in assessment as suggested by the ALI authors and to devise relevant interventions in the form of professional development. However, while the results of the micro-level analysis of the ALI items are consistent with the previous ALI validation findings, results of the macro-level analysis provide findings that are not in agreement with the ALI’s development framework. Thus, there is a need to review or re-examine the seven-factor structure of the ALI.

The problematic results of the ALI’s seven-factor structure in terms of poor model fit and low factor loadings can possibly be explained by three reasons. The first reason appears to be the absence of hierarchy among the items and factors (standards) as indicated by the overlapping of the items across the factors (standards) as shown in Figure 2. The absence of hierarchy among the items and the standards can pose some problems. In any instrument that takes the form of a test, it is recommended that the items should be of an increasing difficulty (Brizuela & Montero-Rojas, 2013; Ludlow & Haley, 1995). This is to motivate and at the same time challenge the examinees as they progress through the test. Having ‘difficult’ or challenging items at the start of the test can adversely affect the examinees' interest and performance in the test. Hence, the model proposed by Mertler and Campbell (2005) can be challenged. It is useful to note Stiggins’ (1999, as cited in Mertler & Campbell, 2005, p. 6) assertion that STCEAS are not sufficiently comprehensive to prepare teachers for the realities of the classroom. It will be useful for practitioners and researchers if constructs, standards and competencies are clearly defined and not confounded by complex nesting and hierarchies. Thus, explicit structures will enable targeted professional learning towards areas of need. If constructs, standards and competencies are nested within each other in a complex manner, interpretation of the underlying dimensions will be difficult. This, in turn would make it even more challenging to target and direct professional learning towards areas of need.

In discussing standards for Australian teachers, and those engaged in preparing teachers and providing for professional development, Ingvarson (2002, p. 3) highlighted that:

As measures, standards will not only describe what teachers need to know and be able to do to put these values into practice; they will describe how attainment of that knowledge will be assessed, and what counts as meeting the standard. A standard, in the latter sense, is the level of performance on the criterion being assessed that is considered satisfactory in terms of the purpose of the evaluation.

Importantly, ‘standards are not immutable; they need regular revision in the light of research and professional knowledge’ (Ingvarson, 2002, p. 3). Brookhart (2011) has also called for the revision of the STCEAS as it does not anymore include some of the current knowledge and practices on classroom assessment.

The second possible reason relates to the ambiguities and interpretation challenges associated with ALI. In order to obtain a more meaningful interpretation of the validation results, a common understanding of the key term used in the ALI instrument is essential. One might be tempted to consider the term ‘standard’ to be synonymous with the term ‘factor’ that is often used in scale validation. The ambiguity in the use of these important terms has to be clarified before proceeding to further analyses and drawing meaningful inferences.

As the ALI is essentially a test, the ‘standard’ as a ‘principle’ has been well defined by AFT, NCME, & NEA (1990) when developing the STCEAS. However, a ‘standard’ is not equivalent to a ‘factor’. A factor refers to an element, circumstance or influence, which contributes to producing a particular result or situation. In other words, a factor is anything that contributes to a result. According to Royce (1963, p. 522), factors are ‘dimensions, determinants, functional unities, parameters, and taxonomic categories.’ Furthermore, in the context of this paper, the term ‘standard’ means ‘standards for teacher competence in educational assessment of students’ (Brookhart, 2011, p. 4). However, if we adhere to this line of argument the standards (or at least the items in each standard) need to adhere to different levels of cognitive processing with increasing levels of difficulty (Brady & Kennedy, 2012), which can be examined using the Rasch Model. Hence, there is a need to clarify the terms factors and standards before appropriate structural analysis of the ALI can be undertaken.

Finally, the third possible reason could be related to ALI having a different structure than the Mertler and Campbell (2005) hypothesised seven-factor model. The poor model fit and the low factor loadings indicate that the ALI does not follow a seven-factor structure or there exists factorial or structural variance across countries and cultures. This further suggests that the ALI may have other underlying factorial structure than what has been currently hypothesised. One possible structure is a one-factor model as indicated by the results of Rasch analysis. As highlighted earlier, the Rasch model strictly adheres to the requirements of unidimensionality. As 34 of the 35 items fit the Rasch model, it could be concluded that the ALI has a one-factor structure pertaining to the assessment literacy. These items will contribute to the development of modules for professional learning programs for teachers.

Conclusion

In any educational context, teachers’ assessment literacy is of prime importance in order to ascertain learning. Results reported in this article have shown that the ALI has some psychometric qualities that make it useful for measuring teachers’ assessment literacy. At the item level, the ALI can be a potential instrument in examining teachers’ knowledge on classroom concepts and application, and can be used among in-service teachers.

However, further examination of the ALI using in-service teachers in other contexts (cultural context beyond the Mertler and Campbell (2005) study) and perhaps employing other validation procedures is warranted. Thus, ALI needs further review, validation and clarification to establish a meaningful structure of the instrument particularly with a view to the portability of the test across contexts.

Still, analyses reported in this article demonstrate how two psychometric techniques, namely Rasch analysis and confirmatory factor analyses can complement each other to examine the appropriateness of a scale in terms of its measurement properties and underlying theoretical foundations.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

Declaration of conflicting interests

None declared.

Acknowledgments

The first author wishes to thank Dr. Mertler and Dr. Campbell for the permission to use the ALI.

References

Alagumalai

Curtis

D. D

(2005) Classical test theory (CTT). In: Alagumalai

Curtis

D. D.

Hungi

(eds) Applied Rasch measurement: A book of exemplars, Dordrecht, The Netherlands: Springer, pp. 1–14.

American Federation of Teachers, National Council on Measurement in Education, and National Education Association (1990). Standards for teacher competence in educational assessment of students. Retrieved from http://www.unl.edu.buros/article3.html.

Australian Institute for Teaching and School Leadership (2013). Australian professional standards for teachers. Retrieved from http://www.teacherstandards.aitsl.edu.au/Standards/AllStandards.

Baker, F. (2001). The basics of item response theory (2^nd ed.). Retrieved from www.eric.ed.gov/ERICWebPortal/recordDetail?accno=ED458219.

Ben, F., Hailaya, W. M., & Alagumalai, S. (2012). Validation of the Technical and Further Education-South Australia (TAFE-SA) assessment of basic skills instrument (TAFE-SA Commissioned Report). Adelaide, Australia: TAFE-SA.

Berry

Adamson

(2011) Assessment reform in education: policy and practice, Dordrecht: Springer.

Black

Wiliam

(1998a) Assessment and classroom learning. Assessment in Education 5: 7–74.

Black

Wiliam

(1998b) Inside the black box: Raising standards through classroom assessment. Phi Delta Kappan 80: 139–148.

Bollen

K. A.

Long

J. S.

(1993) Testing structural equation models, Newbury Park: SAGE Publications.

10.

Bond

T. G.

Fox

C. M.

(2007) Applying the Rasch model: Fundamental measurement in the human sciences, New York, NY: Lawrence Erlbaum Associates.

11.

Boone

W. J.

Staver

J. R.

Yale

M. S.

(2014a) Setting pass-fail points and competency levels. In: Boone

W. J.

Staver

J. R.

Yale

M. S.

(eds) Rasch analysis in the human sciences, Dordrecht, The Netherlands: Springer, pp. 329–337.

12.

Boone

W. J.

Staver

J. R.

Yale

M. S.

(2014b) Expressing competency levels. In: Boone

W. J.

Staver

J. R.

Yale

M. S.

(eds) Rasch analysis in the human sciences, Dordrecht, The Netherlands: Springer, pp. 339–355.

13.

Boone

W. J.

Staver

J. R.

Yale

M. S.

(2014c) Fit. In: Boone

W. J.

Staver

J. R.

Yale

M. S.

(eds) Rasch analysis in the human sciences, Dordrecht, The Netherlands: Springer, pp. 159–190.

14.

Brady

Kennedy

(2012) Assessment and reporting: Celebrating student achievement, 4th ed. New South Wales: Pearson.

15.

Brizuela, A., & Montero-Rojas, E. (2013). Prediction of the difficulty level in a standardized reading comprehension test: contributions from cognitive psychology and psychometrics. RELIEVE, 19, art. 1. DOI: 10.7203/relieve.19.2.3149.

16.

Brookhart, S. M. (1999). The art and science of classroom assessment: The missing part of pedagogy. Washington, D. C.: Eric Clearinghouse on Higher Education (ED432938). Retrieved from http://chiron.valdosta.edu/whuitt/files/artsciassess.html.

17.

Brookhart

S. M.

(2011) Educational assessment knowledge and skills for teachers. Educational measurement: Issues and practice 30: 3–12. DOI: 10.1111/j.1745 3992.2010.00195.x.

18.

Byrne

B. M.

(2001) Structural equation modeling with AMOS: Basic concepts, applications, and Programming, New Jersey: Lawrence Erlbaum Associates.

19.

Campbell, C., Murphy, J. A., & Holt, J. K. (2002). Psychometric analysis of an assessment literacy instrument: Applicability to preservice teachers. Paper presented at the annual meeting of the Mid-Western Educational Research Association, Columbus, OH, October 2002.

20.

Cavanagh

R. F.

Romanoski

J. T.

(2006) Rating scale instruments and measurement. Learning Environ Res 9: 273–289. DOI: 10.1007/s10984-006-9011-y.

21.

Chick, H.L., & Pierce, R.U. (2008). Teaching statistics at the primary school levels: Beliefs, affordances, and pedagogical content knowledge. In C. Batanero, G. Burrill, C. Readings, & A. Rossman (Eds.). Joint ICMI.IASE study: Teaching statistics in school mathematics. Challenges for teaching and teaching education. Proceedings of the ICMI study 18 and 2008 IASE round table conference. Monterrey, Mexico: International Commission on Mathematical Instruction and International Association for Statistical Education.

22.

Churchill

Ferguson

Godinho

Johnson

N. F.

Keddie

Letts

Mackay

McGill

Moss

Nagel

M. C.

Nicholson

Vick

(2011) Teaching: Making a difference, Milton Qld: John Wiley & Sons Australia, Ltd.

23.

Curtis

D. D.

Boman

(2007) X-ray your data with Rasch. International Education Journal 8: 249–259.

24.

Diamantopoulos

Siguaw

J. A.

(2000) Introducing LISREL: A guide for the uninitiated, London: SAGE Publications.

25.

Education Department of Western Australia (1994) Profiles of student achievement 1993: Student performance in science in Western Australian government schools, Perth: The Department.

26.

Ewing

M. T.

Salzberger

Sinkovics

R. R.

(2005) An alternate approach to assessing cross-cultural measurement equivalence in advertising research. Journal of Advertising 34: 17–36.

27.

Gardner

Harlen

Hayward

Stobart

(2011) Engaging and empowering teachers in innovative assessment practice. In: Berry

Adamson

(eds) Assessment reform in education: Policy and practice, Dordrecht: Springer, pp. 105–120.

28.

Hambleton

R. K.

Jones

R. W.

(1993) Comparison of classical test theory and item-response Theory and their applications to test development. Educational Measurement: Issues and Practice 12: 38–47.

29.

L. T.

Bentler

P. M.

(1999) Cutoff criteria for fit indexes in covariance structure analysis: Conventional criteria versus new alternatives. Structural Equation Modeling 6: 1–55. DOI:10.1080/10705519909540118.

30.

Ingvarson, L. (2002). Development of a national standards framework for the teaching profession. Retrieved from http://research.acer.edu.au/teaching_standards/7.

31.

Jöreskog

K. G.

Sörbom

(2006) LISREL 8.80 for Windows (computer software), Lincolnwood, IL: Scientific Software International, Inc.

32.

Kellaghan, T., & Greany,V. (2001). Using assessment to improve the quality of education. Paris: UNESCO: International Institute for Educational Planning.

33.

Linacre

J. M.

(2002) What do infit and outfit, mean-square, and standardised mean? Rasch Measurement Transactions 16: 878.

34.

Ludlow

L. H.

Haley

S. M.

(1995) Rasch model logits: Interpretation, use, and transformation. Educational Psychological Measurement 55: 967– 975.

35.

Masters, G. N. (2013). Reforming Educational Assessment: Imperatives, principles and challenges. ACER Report. Camberwell-Victoria, Australia: ACER. Retrieved from http://research.acer.edu.au/aer/12.

36.

Matsunaga

(2010) How to factor-analyze your data right: Do’s, don’ts, and how-to’s. International Journal of Psychological Research 3: 97–110.

37.

Mertler, C. A. (2003). Preservice versus inservice teachers’ assessment literacy: Does classroom experience make a difference? Paper presented at the annual meeting of the Mid-Western Educational Research Association, Columbus, OH, October 2003.

38.

Mertler

C. A.

(2004) Secondary teachers’ assessment literacy: Does classroom experience make a difference? American Secondary Education 33: 49–64.

39.

Mertler & CampbellMertler, C. A., & Campbell, C. (2005). Measuring teachers’ knowledge & application of classroom assessment concepts: Development of the assessment literacy inventory. Paper presented at the annual meeting of the American Educational Research Association, Montreal, Quebec, Canada, 11–15 April 2005.

40.

NIE-NTU (2009) A teacher education model for the 21st century: A report by the National Institute of Education, Nanyang Walk, Singapore: National Institute of Education.

41.

OECD (2013) Synergies for better learning: An international perspective on evaluation and assessment, OECD Reviews of Evaluation. Paris, France: OECD.

42.

Pellegrino, J. W., Chudowsky, N., & Glaser, R. (Eds.) (2001). Knowing what students know: The science and design of educational assessment (pdf version). Retrieved from http://www.nap.edu/catalog/10019.html.

43.

Philippine Department of Education (2006). National competency-based teacher standards: A professional development guide for Filipino teachers. Retrieved from http://prime.deped.gov.ph/wp-content/uploads/downloads/2011/09/22June_POPULAR-VERSION-FINAL.pdf.

44.

Phye

G. D.

(1997) Classroom assessment: A multidimensional perspective. In: Phye

G. D.

(ed) Handbook of classroom assessment: Learning, achievement, and adjustment, California: Academic Press, pp. 33–51.

45.

Pierce, R.U., & Chick, H.L. (2010). Interpreting literacy and numeracy testing reports: What do teachers need to know. In C. Reading (Ed.), Data and context in statistics education: Towards an evidence-based society. Proceedings of the Eighth International Conference on Teaching Statistics (ICOTS8, July, 2010), Ljubljana, Slovenia. Voorburg, The Netherlands: International Statistical Institute.

46.

Plake

B. S.

Impara

J. C.

Fager

J. J.

(1993) Assessment competencies of teachers: A national survey. Educational Measurement: Issues & Practice 12: 10–39. DOI: 10.1111/j.1745-3992.1993.tb00548.x.

47.

Popham

W. J.

(2009) Assessment literacy for teachers: Faddish or fundamental? Theory into Practice 48: 4–11. DOI: 10.1080/00405840802577536.

48.

Popham

W. J.

(2011) Assessment literacy overlooked: A teacher educator’s confession. The Teacher Educator 46: 265–273. DOI: 10.1080/08878730.2011.605048.

49.

Probst

T. M.

(2003) Development and validation of the job security index and the job security satisfaction scale: A classical test theory and IRT approach. Journal of Occupational and Organizational Psychology 76: 451–467. DOI: 10.1348/096317903322591587.

50.

Randall

Engelhard

(2010) Using confirmatory factor analysis and Rasch measurement theory to assess measurement invariance in a high stakes reading assessment. Applied Measurement in Education 23: 286–306.

51.

Rasch

(1960) Probabilistic models for some intelligence and attainment tests (Reprint, with Foreword and Afterword by B. D. Wright, Chicago: University of Chicago Press, 1980, Copenhagen, Denmark: Danmarks Paedogogiske Institut.

52.

Rasch

(1966) An item analysis which takes individual differences into account. British Journal of Mathematical and Statistical Psychology 19: 49–57.

53.

Riemer

Kearns

M. A.

(2010) Description and psychometric evaluation of the youth counselling impact scale. Psychological Assessment 22: 259–268.

54.

Royce

J. R.

(1963) Factors as theoretical constructs. American Psychologist 18: 522–528. DOI: 10.1037/h0044493.

55.

Salzberger, T. (2011). Specification of Rasch-based measures in Structural Equation Modelling (SEM). Retrieved from www2.wu.ac.at/marketing/mbc/download/Rasch_SEM.pdf.

56.

Santiago

Donaldson

Herman

Shewbridge

(2011) OECD reviews of evaluation and assessment in education: Australia, Paris, France: OECD.

57.

Schermelleh-Engel

Moosbrugger

Müller

(2003) Evaluating the fit of structural equation models: Tests of significance and descriptive goodness-of-fit measures. Methods of Psychological Research Online 8: 23–74.

58.

Schreiber

J. B.

Nora

Stage

F. K.

Barlow

E. A.

King

(2006) Reporting structural equation modelling and confirmatory factor analysis results: A review. The Journal of Educational Research 99: 323–338. DOI: 10.3200/JOER.99.6.323-338.

59.

Schulz

(2004) Scaling procedures for Likert-type items on students’ concepts, attitudes, and actions. In: Schulz

Sibberns

(eds) IEA Civic Education Study Technical Report, The Netherlands: The International Association for the Evaluation of Educational Achievement, pp. 93–126.

60.

Stapleton, C. D. (1997). Basic concepts and procedures of confirmatory factor analysis. Retrieved from http://ericae.net/ft/tamu/Cfa.htm.

61.

Stewart

(2001) Methodological and statistical concerns of the experimental behavioral researcher. Journal of Consumer Psychology 10: 76–78.

62.

Stiggins

R. J.

(1988) Revitalizing classroom assessment: The highest instructional priority. Phi Delta Kappan 69: 363–368.

63.

Stiggins

R. J.

(1991a) Assessment literacy. Phi Delta Kappan 72: 534–539.

64.

Stiggins

R. J.

(1991b) Relevant classroom assessment training for teachers. Educational Measurement: Issues and Practice 10: 7–12.

65.

Stiggins

R. J.

(2010) Essential formative assessment competencies for teachers and school leaders. In: Andrade

Cijek

G. J.

(eds) Handbook of Formative Assessment, New York: Routledge, pp. 233–250.

66.

Stiggins

R. J.

Conklin

N. F.

(1992) In teachers’ hands: Investigating the practices of classroom assessment, Albany, NY: State University of New York Press.

67.

Stiggins

R. J.

Arter

J. A.

Chappuis

(2007) Classroom assessment: Doing it right – Using it well, Upper Saddle River, NJ: Pearson Education, Inc.

68.

Tennant

Conaghan

P. G.

(2007) The Rasch measurement model in rheumatology: What is it and why use it? When should it be applied, and what should one look for in a Rasch paper? Arthritis & Rheumatism (Arthritis Care & Research) 57: 1358–1362. DOI: 10.1002/art.23108.

69.

Tinsley

H. E. A.

Dawis

R. V.

(1975) An investigation of the Rasch simple logistic model: Sample free item and test calibration. Educational and Psychological Measurement 35: 325–339. DOI: 10.1177/001316447503500211.

70.

Van Alphen

Halfens

Hasman

Imbos

(1994) Likert or Rasch? Nothing is more applicable than good theory. Journal of Advanced Nursing 20: 196–201.

71.

Wright

(1996) Comparing Rasch measurement and factor analysis. Structural Equation Modeling – A Multidisciplinary Journal 3: 3–24.

72.

Wright, B. D. (1988). Georg Rasch and measurement. Rasch Measurement Transactions, 2, 25–32 Retrieved from http://www.rasch.org/rmt/rmt23.htm.

73.

Wright, B. D., & Stone, M. H. (1999). Measurement essentials (2^nd ed.). Wilmington, Delaware: Wide Range, Inc. Retrieved from www.rasch.org/measess//met-18.pdf.

74.

Wu, M. L., & Adams, R. J. (2007). Applying the Rasch model to psycho-social measurement: A practical approach. Melbourne: Educational Measurement Solutions. Retrieved from www.edmeasurement.com.au.

75.

Wu, M. L., Adams, R. J., Wilson, M. R., & Haldane, S. A. (2007). ConQuest Version 2.0 (Generalised item response modeling software). Camberwell, Victoria: ACER Press.

Examining the utility of Assessment Literacy Inventory and its portability to education systems in the Asia Pacific region

Abstract

Keywords

Introduction

The Assessment Literacy Inventory (ALI)

Previous ALI Validation

Validation techniques

Methods

Modification of the ALI

Administration of the ALI

Validation analysis of the ALI

Results

Item analysis of the ALI using the Rasch model

Analysis of the ALI structure using CFA

Model fit indices

CFA of the ALI hypothesised measurement model

Rasch analyses plus CFA: Item-test fit and structural coherence

Discussion

Conclusion

Footnotes

Funding

Declaration of conflicting interests

Acknowledgments

References