Sage Journals: Discover world-class research

Abstract

International Large-scale Student Assessments are the most prominent example of internationalization processes in education. A number of studies have analysed the policy reactions to such studies, particularly to the Organisation for Economic Co-operation and Development (OECD) Programme for International Student Assessment (PISA) study. PISA and comparable projects have also raised concerns of a feared increase of assessments and accountability procedures. So far, systematic empirical evidence that could corroborate such concerns has been scarce. In this paper, we examine the prevalence of assessment and accountability practices at secondary education level as well as changes in these practices over time. We explicitly focus on changes over time by drawing on data from PISA 2000 to PISA 2015. Analyses over time are not straightforward with PISA, as the questionnaires change between survey rounds. This leads to different coverage of specific indicators over time. We present descriptive analyses for 20 OECD countries. The results show an increasing trend for the vast majority of the generated indicators, indicating that assessments and the use of assessments for purposes of accountability increased within the larger part of the OECD during the last 15 years. Likewise, more horizontal, peer-oriented evaluation procedures focusing on organizational learning gained importance. A cluster analysis based on selected indicators of 2015 revealed that there are four distinct groups of countries, which are mainly distinguished by different levels of the prevalence of assessment, accountability and evaluation practices.

Keywords

Large-scale assessments evaluation accountability international trends educational governance

Introduction

In contemporary knowledge-based economies, the production of human capital through schooling is a crucial growth factor and a tool for adapting to external changes. The increasing importance of education as a means of productivity, along with the global norm of securing equity in education, creates an urgent demand for effectiveness and efficiency of national education policies (Kamens, 2013: 120). In search of best practices in schooling, international large-scale student assessments (ILSAs) became more and more popular during the last decades. According to Heyneman and Lee (2014), the number of international testing projects in education increased from 34 surveys between 1960 and 1989 to 152 surveys between 2000 and 2008.

Even though the first international studies of the International Association for the Evaluation of Educational Achievement have been implemented already, in the 1960s, currently the most prominent example of these testing projects is the Organisation for Economic Co-operation and Development (OECD) Programme for International Student Assessment (PISA). PISA has become a symbol for the rise of a new testing and ranking culture and is deemed to be the ‘main engine in the global accountability juggernaut’ (Meyer and Benavot, 2013: 9).

It is plausible to assume that the application and promotion of one uniform assessment across different countries and the dissemination of universal policy recommendations foster the emergence of a relatively homogenous worldwide model of education (Meyer and Ramirez, 2000). In other words, it is likely that education production processes converge and become more similar (Drezner, 2001; Jakobi and Teltemann, 2011). In fact, a growing body of literature is devoted to analysing policy reactions to PISA and other ILSAs, often with a critical stance. A number of studies show how the results of international comparisons are used during the implementation of educational reform (Breakspear, 2012; Dobbins and Martens, 2012; Egelund, 2008; Ertl, 2006; Grek, 2009; Takayama, 2008). A main focus of recent educational reform projects that is particularly promoted by PISA is the targeted use of regular assessment and test-based accountability measures in schools and education systems (Meyer and Benavot, 2013; Scheerens, 2007). The OECD’s and PISA’s focus on regular assessments as a means of quality assurance also raised serious concerns. It is feared that an increase in tests at school has negative consequences for learning (through teaching to the test, a streamlining of curricula, etc. (Lingard et al., 2013)) and might create an unproductive self-sustaining cycle: The increasing societal importance of education triggers educational reform, which in turn increases the need for testing – which again leads to reform (Baker and LeTendre, 2005). To some researchers and stakeholders, this dynamic appears as an ‘epidemic of education policy’ (Levin, 1998), a ‘tyranny of testing’ (Mansell, 2007) or a ‘testing juggernaut’ that has to be slowed down (Andrews, 2014). Some scholars identify a ‘global panopticism’ (Lingard et al., 2013) fostered by an ‘infrastructure of accountability’ (Anagnostopoulos, 2013). Yet, systematic empirical evidence that could substantiate the discourse is scarce. Further, the debate misses conceptual clarity: which kinds of testing and assessments are (potentially) ineffective and what does (the new) accountability actually involve?

In this paper, we seek to examine the prevalence of assessment and accountability practices at secondary education level as well as changes in these practices during the last 15 years. For both research questions, we explicitly focus on the changes over time by drawing on data from PISA 2000 to PISA 2015. Further, we take a closer look at changes in how PISA collects data on assessments, and accountability practices at school. We are interested in the various aspects of school evaluation which are covered in the questionnaires and in the correspondence between theoretical models of accountability and items surveyed in PISA.

The remainder of this paper is structured as follows: in the next section we provide an overview of concepts and definitions of assessments and accountability in education and summarize current trends in this area. The third section illustrates possibilities of analysing assessments and accountability practices across countries and time with data from the OECD PISA surveys. In section four, we delineate the data collection of the PISA surveys in the thematic fields of assessments and accountability. Before presenting our results based on descriptive analyses at the country level and on cluster analyses, we briefly describe the applied methods. The paper ends with a discussion and an outlook for further research.

Assessments and accountability in education – concepts and trends

The monitoring of students’ learning progress (through testing) and its evaluation through grades is a genuine feature of modern schooling. Although the terms assessment and evaluation are often used interchangeably, assessment is generally used to refer to the collection of evidence of students’ knowledge, skills and aptitudes, whereas evaluation often refers to the collection of evidence about systems, programmes, procedures and processes (Rosenkvist, 2010). Assessments can take various forms (e.g. standardized or unstandardized) and serve different purposes. A common distinction is made between summative and formative assessments. Whereas the latter refers to more interactive forms of assessing students in order to determine learning progress and learning needs and to adapt teaching strategies, the former often has ‘higher stakes’ and may serve to promote, select or certificate students. Another aspect of the purpose of assessment and evaluation is their accountability function. Accountability describes a reciprocal relationship in which one party is expected to produce a desired outcome, thereby gaining some benefit, and the other party has some capacity to affect the benefits of the performing party (Stecher and Hanser, 1992). Such a relationship requires a monitoring of the desired outcome. Ideally, this monitoring provides useful information for the parties involved on how best to act in order to produce the desired outcome (Perie et al., 2007). Traditionally, the level and unit of accountability in schools are the students, mainly for the purpose of accreditation, in order to meet the selection function of modern education systems (Parsons, 1959).

Summative assessments, however, can also be used to hold teachers and schools (and even education systems) accountable for meeting educational goals (Rosenkvist, 2010). This purpose of assessments in education has become more important during the last decade, which has seen wide transformations of educational governance. Several authors have described how decision-making processes in education have shifted from ‘idea based’, deliberative and proactive procedures, to evidence-based, reactive governance (Grek, 2013; Marttila, 2014). This shift involves the transition from ‘input oriented’ to ‘output oriented’ steering processes. Output orientation, however, involves a need for detailed information about the output (of education systems), for example, from assessment data. The change in educational governance does not only imply a change of political decision-making processes in many countries. Sahlberg and Hargreaves (2011) further enumerate standardized teaching and learning with common criteria for measurement, a focus on literacy and numeracy, and market-oriented reforms as elements of the ‘Global Education Reform Movement’ (GERM). With this thematic alignment, the GERM corresponds to what is generally referred to as the ‘New Public Management Paradigm’ (Mons, 2009). Market orientation in education involves decentralization and increased school autonomy, as well as an expansion of school choice (Teltemann and Windzio, 2018). An increased scope for decision-making at the level of schools, however, requires ensuring that standards are still met, thus test-based accountability procedures are likely to become more frequent (Hannan and Freeman, 1989; Wößmann, 2004). Further, choosing a school on a market requires information as a basis for decision-making, which means that performance indicators of schools have to be accessible (‘Monitoring’, (Figlio and Loeb, 2011)).

The different levels and purposes of accountability have been systematized in models of accountability (Rosenkvist, 2010; Stufflebeam 2001). For example, Lindberg (2013) presents a synthesis of different typologies of accountability and distinguishes between internal and external control, high and low strength of control, as well as vertical upward, vertical downward and horizontal directions of accountability, thereby generating a typology of 12 different models of accountability. Taking a more specific view on education, Scheerens et al. (2003: 6) present an overview of types of monitoring and evaluation in education. They distinguish data sources (e.g. assessment data, administrative data), functions (accountability, improvement, accreditation) and objects (e.g. system, school, student) of monitoring and evaluation in education.

Against these models, the ‘new accountability’ in education involves a shift of the levels of accountability (from student to schools and education systems) and new consequences (e.g. in terms of resource allocation or salaries) (O’Day, 2002). Further, as the new accountability has a clear focus on standards-based assessments, it can be described as a form of external (bureaucratic or hierarchical) accountability. In contrast to external, top-down accountability, internal accountability is enacted via peer review and professional standards and focuses on improving learning (Wyatt-Smith et al., 2014). However, it is an open empirical question whether both forms of accountability have become more prevalent during recent educational reform projects or whether they are competing with each other. This is one of the questions we are analysing in this paper.

Analysing assessments and accountability across countries and time

The aim of this paper is to examine changes in assessment, accountability and evaluation practices in secondary education over time and in different countries. So far, no international longitudinal survey on school assessment practices exists. However, ILSAs like the OECD PISA surveys do not only assess students’ literacy but also collect data on student background characteristics and on teaching and learning contexts. While studies like TIMSS and PIRLS are being implemented every four and five years, respectively, PISA provides indicators on a three-yearly basis. The currently available public datasets cover six cycles over the range of 15 years (2000 to 2015). For our analyses, we accordingly draw on PISA as the most frequent assessment. In the first place, PISA surveys reading, mathematics and science literacy of 15-year-old students in OECD countries and a number of partner countries. One of the aims of PISA is to relate variations in literacy to the learning context of students. Therefore, additional information is collected from students, parents and school heads (Kuger et al., 2016). The school level questionnaires, for example, survey the school’s general situation and cover aspects of assessments and evaluation in schools. The questionnaires are based on a theoretical framework developed by expert groups. The framework guides the selection of indicators in the respective rounds of assessment. Even though the frameworks of different survey rounds do build up on one another to a certain degree, they focus on different policy and research topics defined continually by the expert groups, the participating countries and the OECD secretariat (Jude, 2016).

Whereas the comparison of literacy measures over time has been a goal of PISA from the beginning, less systematic effort has been put into the comparability of background or context information collected in PISA. Consequently, the comparison of data from PISA context questionnaires is hampered by the fact that questions as well as wording and the scaling of answering categories for similar questions in PISA often change between the different survey rounds (Jude, 2016). Nevertheless, the political and scholarly interest in comparing potential determinants of literacy over time has increased during recent years. For example, after four rounds of PISA, the OECD (2009: 150) stated that ‘PISA can not only compare the trends in achievement over almost a decade for many countries, but also attempt to relate those trends to changes in policies, practices, and student populations.’ Without a detailed collection of context data and the possibility to compare these data over time, PISA would certainly not have become such a stable part of accountability for educational policy makers in participating countries. However, trend analyses of PISA data have to be handled with caution (Jerrim et al., 2017; Rutkowski and Rutkowski, 2016). This holds especially for context indicators, which are not implemented systematically over time, as shown in Table 1.

Table 1.

Questions and items on assessments and accountability in the Programme for International Student Assessment.

Question	No.	Item	2000	2003	2006	2009	2012	2015
Generally, in your school how often are 15-year-old students assessed using:^b	1	Standardized tests?^a	x	x		x		x
	2	Teacher-developed tests?	x	x		x		x
	3	Teachers’ judgmental ratings?	x	x		x		x
	4	Student portfolios?	x	x		x
	5	Student assignments/projects/homework?	x	x		x
In your school, are assessments of 15-year-old students used to:^c	6	Inform parents about their child’s progress?	x	x		x	x	x
	7	Make decisions about retention or promotion?	x	x		x	x	x
	8	Group students for instructional purposes?	x	x		x	x	x
	9	Compare the school to district or national performance?	x	x		x	x	x
	10	Monitor the school’s progress from year to year?	x	x		x	x	x
	11	Make judgments about teachers’ effectiveness?	x	x		x	x	x
	12	Identify aspects of instruction or the curriculum that could be improved?		x		x	x	x
	13	Compare the school with other schools?		x		x	x	x
During the last year, have any of the following been used to monitor the practice of mathematics teachers at your school?^d	14	Tests or assessments of student achievement		x		x	x	x
	15	Teacher peer review		x		x	x	x
	16	Principal or senior staff observations of lessons		x		x	x	x
	17	Observation of classes by inspectors or other persons external to the school		x		x	x	x
In your school, are achievement data used in any of the following accountability procedures?	18	Achievement data are posted publicly (e.g. in the media)			x	x	x	x
	19	Achievement data are used in evaluation of the principal’s performance			x	x
	20	Achievement data are used in evaluation of teachers’ performance			x	x
	21	Achievement data are used in decisions about instructional resource allocation to the school			x	x
	22	Achievement data are tracked over time by an administrative authority			x	x	x	x
	23	Achievement data are provided directly to parents						x
Which of the following measures aimed at quality assurance and improvement do you have in your school?	24	Written specification of the school’s curricular profile and educational goals					x	x
	25	Written specification of student performance standards					x	x
	26	Systematic recording of data including teacher and student attendance and graduation rates, test results and professional development of teachers^e					x	x
	27	Internal evaluation/self-evaluation					x	x
	28	External evaluation					x	x
	29	Seeking written feed-back from students					x	x
	30	Teacher mentoring					x	x
	31	Regular consultation aimed at school improvement with one or more experts over a period of at least six months					x	x
	32	Implementation of a standardized policy for mathematics					x	x

Differentiated into non-mandatory and mandatory in 2015.

Answering categories differ.

Further items added in 2015, to trend available so far.

Domain: 2003: mathematics, 2009: reading, 2012: mathematics, 2015 teacher in general.

Differentiated into two different items in 2015.

Conceptualizing assessments and accountability with indicators from PISA

In PISA, assessment and accountability practices of schools are addressed in the school questionnaires, which are administered to school leaders. A comparison of the six school level questionnaires of the studies between 2000 and 2015 reveals that the questions and items focusing on assessments and accountability practices do not appear consistently in all questionnaires (see Table 1). While some items have been surveyed in five of the six rounds of PISA between 2000 and 2015, others have been included only once.

Table 1 illustrates that the number of items focusing on assessment and accountability increased over time, already indicating a growing significance of these aspects of schooling. While the forms and purposes of standardized assessment have been addressed since the beginning of PISA in the year 2000, other topics, such as teacher evaluation, have been included in later cycles, with the topic of quality assurance in educational systems being surveyed only since 2012. The items allow for a distinction on the different levels of education systems, that is, the student, the school and the state. The functions of accountability, however, can only be differentiated since PISA 2012. The framework for context questionnaires in PISA 2012 distinguishes assessment and feedback at the classroom level, evaluation at the level of the schools, and accountability and evaluation policies at the system level (OECD, 2017: 175). Consequently, additional indicators have been introduced in the 2012 assessment, focusing on quality assurance measures. The context questionnaires in PISA 2015 explicitly addressed the topics of assessment, evaluation and accountability, due to the fact that this area was rated as being of high importance to policy makers of the participating countries (Bayer et al., 2017). Accordingly, the theoretical framework distinguishes standardized and less standardized practices, internal and external evaluations as well as means of quality management in schools, such as teacher evaluation (Klieme and Kuger, 2016). Furthermore, across these measures, purposes, practices and the use and consequences of assessment, evaluation and accountability are described. The PISA 2015 database includes measures on general assessment practices and results, external evaluation and teacher evaluation, which have been used in previous cycles, in order to report trends. In addition, indicators on internal school evaluation and formative, classroom-based assessment have been included.

Given the change in topics and foci over time, it is difficult to assign the PISA items to existing theoretical models or conceptualizations of assessments and accountability, as we have presented them in the second section above. Following the classification of data sources, functions and objects of monitoring and evaluation (Scheerens et al., 2003), it is mostly possible to identify the respective object of monitoring and evaluation which is addressed in the questions. Further, with regard to data sources, PISA provides information on data collected through tests and assessments and through systematic inquiry and review, but does not cover the utilization of administrative data and statistics. Most significantly, Table 1 shows that the majority of items cannot be distinguished according to the function of monitoring and evaluation. Scheerens et al. (2003) differentiated three different purposes: accountability, improvement and accreditation. Only for items 6, 7, 8, 12, 18, 19, 20 and 23 listed in Table 1 it is possible to determine a purpose; however, often there is no clear distinction between the purpose of accountability and the purpose of improvement.

Although it is not possible to categorize the PISA items on assessments and accountability according to existing theoretical concepts, we grouped the items into six different topical areas. We distinguish between items focusing on standardization and benchmarking policies, items reflecting processes of marketization, items describing accountability at the school level, items measuring (vertical) teacher accountability, and items focusing on internal and external evaluation, respectively. Table 2 shows the classification of the items.

Table 2.

Results of international trend analyses.

	(No.) Item	Year	Mean	SD	RSD	Min.	Max.
Standardization/ benchmarking	(1) Use of standardized assessments (1–2 times a year)	2000	46	30	0.65	0	87
		2003	54	23	0.43	20	85
		2009	52	22	0.42	17	94
		2015	63	17	0.27	24	93
	(1) Use of standardized assessments (three or more times a year)	2000	11	14	1.27	0	50
		2003	21	15	0.71	0	59
		2009	22	17	0.77	1	53
		2015	24	20	0.83	4	76
	(9) Assessments used to compare school with district/national performance	2000	37	30	0.81	0	92
		2003	44	31	0.70	6	91
		2009	50	25	0.50	10	95
		2012	62	24	0.39	17	96
		2015	73	23	0.32	23	96
	(13) Assessments used to compare the school with other schools	2003	38	27	0.71	3	84
		2009	42	24	0.57	7	90
		2012	51	24	0.47	18	90
		2015	63	24	0.38	20	94
	(25) Quality assurance: written specification of student performance standards	2012	71	21	0.30	38	95
		2015	78	18	0.23	41	100
	(32) Quality assurance: implementation of a standardized policy for mathematics	2012	60	17	0.28	24	88
		2015	61	16	0.26	35	86
Marketization	(18) Achievement data are posted publicly	2006	33	26	0.79	4	93
		2009	32	25	0.78	2	89
		2012	37	29	0.78	2	92
		2015	37	26	0.70	3	92

School accountability	(10) Assessments used to monitor school’s progress from year to year	2000	62	29	0.47	0	98
		2003	64	27	0.42	8	97
		2009	72	21	0.29	35	98
		2012	79	18	0.23	48	100
		2015	83	14	0.17	55	99
	(12) Assessments used to identify aspects of instruction or the curriculum that could be improved	2003	73	19	0.26	41	97
		2009	75	17	0.23	47	98
		2012	80	16	0.20	49	96
		2015	82	14	0.17	53	100
	(19) Achievement data are used in evaluation of the principal’s performance	2006	27	27	1.00	0	91
		2009	30	25	0.83	5	93
	(21) Achievement data are used in decisions about instructional resource allocation to the school	2006	30	22	0.73	1	79
		2009	31	19	0.61	1	72
	(22) Achievement data are tracked by an authority	2006	62	21	0.34	22	97
		2009	62	22	0.35	26	95
		2012	68	21	0.31	30	98
		2015	68	21	0.31	28	99
	(26) Systematic recording of data including teacher and student attendance and graduation rates, test results and professional development of teachers	2012	84	14	0.17	52	100
		2015	96	5	0.05	81	100
External and test-based / vertical teacher accountability	(11) Assessments used to make judgments about teacher effectiveness	2000	37	29	0.78	0	93
		2003	38	24	0.63	4	86
		2009	40	23	0.58	8	83
		2012	46	23	0.50	14	88
		2015	50	22	0.44	18	90
	(14) Tests or assessments of student achievement used to monitor teachers’ practice	2003	57	24	0.42	13	95
		2009	55	23	0.42	18	97
		2012	77	15	0.19	40	100
		2015	79	15	0.19	44	99
	(16) Principal or senior staff observations of lessons to monitor the practice of teachers	2003	54	33	0.61	5	100
		2009	61	32	0.52	13	98
		2012	61	31	0.51	8	100
		2015	74	29	0.39	14	100
	(17) Teacher valuation through external observations	2003	25	20	0.80	1	62
		2009	30	20	0.67	1	66
		2012	27	20	0.74	1	68
		2015	43	23	0.53	5	84
	(20) Achievement data are used in evaluation of teachers’ performance	2006	38	29	0.76	5	94
		2009	38	27	0.71	8	94
External evaluation	(28) Quality assurance: external evaluation	2012	63	23	0.37	6	91
	(28) Quality assurance: external evaluation	2015	74	21	0.28	21	97
	(31) Regular consultation aimed at school improvement with one or more experts over a period of at least six months	2012	43	20	0.47	10	80
		2015	46	23	0.50	10	87
Internal evaluation	(15) Teacher evaluation through teacher peer review	2003	49	27	0.55	5	88
		2009	55	28	0.51	15	92
		2012	58	26	0.45	12	99
		2015	61	26	0.43	10	96
	(27) Quality assurance: internal evaluation	2012	86	15	0.17	33	100

RSD: relative standard deviation; Min.: minimum; Max.: maximum

For our next analyses on changes in assessment and accountability practices, we excluded items 2 to 8 as they represent rather traditional features of schooling and student accreditation which are not relevant when assessing the ‘new’ levels and purposes of assessments and accountability in education. Also, items 24 and 30 do not relate to evaluation and accountability in our sense and are left out in the following.

Methodological approaches

For the following analyses, we draw on indicators derived from the PISA school questionnaires of the years 2000, 2003, 2006, 2009, 2012 and 2015. For the calculation of the indicators, the school level datasets have been merged with the student level data into a hierarchical structured dataset (in which students are nested in schools, which are nested in countries). Just as the PISA competence scores are aggregated at the country level in order to build league tables, information from the context questionnaires can be aggregated at the country level in order to generate indicators of national education systems. For example, students have been asked whether they attended preschool. Building on this variable, it is possible to compute the proportion of students that attended preschool in a country – as an indicator of the prevalence of pre-primary education. To give a second example: school heads have been asked whether their school is public or private. From this variable we can compute the percentage of students attending a private school in a country – as an indicator of institutional privatisation.

Since PISA’s sampling strategy results in a representative sample for 15-year-old students in a country, the aggregation has to be done with the student level data. Further information on sampling can be found in the technical reports for each cycle (OECD, 2014).

Two steps have to be carried out before performing the aggregation: first, in order to enable comparisons over time, some of the items have to be recoded. For example, in 2000 the answering categories for item number 1 on the use of standardized assessments were: Never, Yearly, 2 times a year, 3 times a year, 4 or more times a year. In 2003, 2009, 2012 and 2015 the categories were: Never, 1 or 2 times a year, 3 to 5 times a year, monthly, more than once a month. After recoding, three different categories remain: Never, 1 or 2 times a year, 3 or more times a year. In a second step, this recoded variable with three different values has to be transformed into three dummy variables in order to calculate the (weighted) average of these dummy variables per country. These average values indicate the proportion of students in a country attending schools which have implemented a respective practice (e.g. conducting standardized assessments one or two times a year). In order to interpret them as percentages, we multiplied these values by 100.

The aggregated variables derived from the school level questionnaires represent meaningful indicators of education systems and their characteristics because they reflect actual practices, whereas formal policy statements might not ‘arrive’ in schools. Depending on the proliferation of a certain practice across schools in a country (e.g. school autonomy), our indicators can be interpreted as implemented national policy.

While we needed the student level data for aggregating the country level indicators, we ignored the student and school level information for our comparative analyses, for which we only used the indicators measuring the percentage of students in a country attending schools with a specific practice in place. Based on the country level data, we computed mean values across 20 OECD countries¹ with valid measures for every year in which a respective indicator has been surveyed. We also report standard deviations, relative standard deviations, and minimum and maximum values, in order to assess the variation of assessment and accountability practices across our sample. For more in-depth analyses, we selected one indicator of each of our six thematic areas and plotted its values for every country. The plots contain the country level measurements for 2015 as well as the difference between the value of 2015 and the respective first measurement for each indicator (e.g. 2000). This difference can be interpreted as a change score, indicating the magnitude of increases or decreases of a certain assessment or accountability practice in a country. Further, we conducted a hierarchical cluster analysis with the six indicators of 2015 and their respective change scores. The cluster analysis helped to identify groups of countries exhibiting a similar pattern of (change in) assessment and accountability practices.

Results

Table 2 shows the mean values and standard deviations per year. A change in mean and standard deviation can be interpreted in such as way that there has been an increase or decrease in the prevalence of a respective practice (mean value) and that there has been a trend towards more or less similarity between countries (standard deviations). For example, the mean values for the use of standardized assessments show that the average proportion of students attending schools using standardized assessments one or two times a year increased across the 20 OECD countries between 2000 and 2015 (from 46% to 63%). At the same time, the standard deviation decreased, indicating a trend of convergence among the countries in our sample with regard to this indicator. The relative standard deviation gives the extent of variability in relation to the mean, with larger values indicating larger variability.

Table 2 reveals that of 22 indicators, 21 show a trend of more homogeneity across countries, as the (relative) standard deviations decreased. With regard to the average values, only one indicator (number 20, teacher accountability, measured in 2006 and 2009) shows a stable trend for the mean value; all other indicators are increasing. The minimum values likewise show increasing trends over time for all but three indicators. The largest increase in terms of average prevalence can be observed for indicators 9 and 13, that is, the use of assessment data for comparisons with other schools or with regional/national performance. Likewise, some indicators for teacher education show substantially increasing trends.

Figure 1(a) to (c) illustrates the change in terms of average prevalence (mean values) for all indicators graphically. For these graphs, the indicators have been sorted according to their size, starting with the least prevalent practices (Figure 1(a)). Figure 1(a) shows that in 2015, the least prevalent practice was the ‘use of standardized assessments 3 times or more often a year’; this practice has increased since the year 2000 from 11% to 24%. What is apparent is that all lines are steadily rising, except for the indicator ‘Achievement data are posted publicly’, which had a slightly higher value in 2012 than in 2015. The figure also shows the stable trend for teacher accountability by achievement data.

Figure 1.

Change of assessment and accountability practices in OECD.

Figure 1(b) illustrates the trend of the moderately prevalent practices. All lines increase steadily with a rather stable trend since 2012 for the indicator ‘Achievement data are tracked by an authority’. Quality insurance in terms of a standardized policy for mathematics shows only a small change between its two measurement points.

Figure 1(c) finally depicts the trend of the most prevalent practices in 2000. All lines show a steady increase, only ‘use of standardized assessments 1 or 2 times a year’ and ‘use of assessment data to monitor the practice of teachers’ had lower values in 2009 than in 2003 but have increased since then.

In the following paragraphs we take a closer look at selected indicators in order to assess variability across countries. We chose one typical indicator for every domain, as shown in Table 3.

Table 3.

Topical areas of items on assessment and accountability practices in Programme for International Student Assessment school questionnaires.

Area	Item no. (see Table 1)	Example
Standardization/benchmarking	1, 9, 13, 25, 32	Generally, in your school how often are 15-year-old students assessed using standardized tests?
Marketization	18, 23	In your school, are achievement data used in any of the following accountability procedures: achievement data are posted publicly (e.g. in the media)
School accountability	10, 12, 19, 21, 22, 26	In your school, are achievement data used in any of the following accountability procedures: achievement data are tracked over time by an administrative authority
External and test-based vertical teacher accountability	11, 14, 16, 17, 20	In your school, are achievement data used in any of the following accountability procedures: achievement data are used in evaluation of teachers’ performance
External evaluation	28, 31	Which of the following measures aimed at quality assurance and improvement do you have in your school: external evaluation
Internal evaluation	15, 27, 29	Which of the following measures aimed at quality assurance and improvement do you have in your school: internal evaluation/self-evaluation

Standardization

Data on educational outcomes are often measured in relation to national educational standards, which represent educational goals in a country and define the knowledge that students are expected to have acquired at a specific stage of their education (Koeppen et al., 2008; Shepard, 2006). Figure 2 illustrates the practice of using school level assessment data for comparisons with regional or national performance. We interpret this variable as an indicator of standardization in a country. The figure shows the absolute values for PISA 2015 for every country in our sample (dark bars), as well as the difference between the respective value in 2015 and the first measurement of this indicator in PISA 2000 (light bars).

Figure 2.

Assessments used for comparisons with regional or national performance, 2015 and change score.

The graph in Figure 2 shows that all 20 countries of our sample had higher values for using assessment data to compare with national performance in 2015 than in 2000. Table 2 has already displayed the substantial variation across countries of the absolute values, Figure 2 now also illustrates the variation in change rates. In some cases, countries with low values in 2000 also show high increases (e.g. Luxembourg, Denmark), and countries with high values in 2000 show only small increases (e.g. UK, USA), but there are also countries with rather low values still in 2015 and only moderate increase (e.g. Greece). In Luxembourg, almost every school uses assessments for comparisons, in Greece this is only the case for slightly more than every fifth school.

Marketization

The public posting of achievement data of schools reflects the degree of marketization and transparency of school performance in a country. For this indicator, the observed time span is only nine years. The pattern of change across countries is more diverse for this indicator, as we found some countries which decreased marketization practices in this respect, although the average across all countries increases over time. Comparing change rates and absolute values in 2015 reveals that lower starting values do not correspond with higher increases, as Belgium and Finland, for example, had relatively low values in 2006 and do not show substantial change over time. Also, Switzerland had comparably low values in 2006 and exhibits only very small increases. There is no country in which all schools report to post their achievement data; the highest prevalence of this practice is measured in the USA (92%), the lowest in Belgium (3%).

School accountability

As an indicator for school accountability we assessed the monitoring of school achievement data by educational authorities. Here, about half of the countries show only small change rates, and the relation between declines and rises is more balanced. Reflecting the overall increase, some countries with already relatively high values in 2006 experience rather substantial increases (e.g. Portugal and Korea). In the USA, every school is monitored by educational authorities, whereas schools in Italy are seldom monitored in this respect.

Teacher accountability

With regard to teacher accountability, Italy is an interesting case, with substantially lower values in 2015 than in 2000; however, it is not the only country which has reduced this form of accountability. Luxembourg by contrast exhibits the largest increase in relation to its values in 2000. In the UK, this form of accountability is most prevalent. Germany is the country with the lowest figure. Overall, teacher accountability is a less prevalent accountability practice.

External evaluation

In contrast to teacher accountability, the use of external evaluation as a means for quality assurance at schools is a comparatively widespread phenomenon. Almost all countries show increasing rates between 2012 and 2015, the only two years for which measurements are available. Greece is the country with the lowest prevalence of external school evaluation (21%), Portugal has the highest rate in this respect (97%). Again, Luxemburg shows the largest increase.

Internal evaluation

Internal evaluation is a less control-driven form of accountability, which is more related to improvement and organizational learning. The graph in Figure 3 shows the country values for the prevalence of teacher peer review. Iceland has the lowest values in 2015 and experienced a decreasing trend since 2000. Finland, the country with the highest PISA scores in 2000 but since then decreasing student outcomes, shows a decreasing trend, too.

Figure 3.

Teacher peer review.

In comparison, teacher peer review is fully established in Korea, where this practice was already very common in 2003. The biggest increase is observable for Sweden.

Cluster analyses

In a last step, we conducted a hierarchical cluster analysis² using the six selected indicators from Figures 2 to 7 and their respective values from 2015 to examine similarities of the 20 countries within our sample. We applied Ward’s Linkage procedure and the Duda–Hart Je(2)/Je(1) stopping rule (Duda and Hart, 1973) to determine the number of clusters. The cluster analyses revealed four clusters of countries with similar patterns of assessment and accountability practices. Table 4 gives an overview about the classification of countries across the four clusters. Figures 2 to 7 help describe the characteristics of the clusters.

Figure 4.

Public posting of school achievement data, 2015 and change score.

Figure 5.

Monitoring of achievement data.

Figure 6.

Teacher accountability.

Figure 7.

External valuation as a means of quality assurance.

Table 4.

Clusters of assessment and accountability practices.

Cluster	Countries
1	Denmark, Hungary, Korea, Mexico, Poland, Portugal, Sweden
2	UK, USA
3	Austria, Belgium, Switzerland, Germany, Finland, Greece, Italy
4	Spain, Ireland, Iceland, Luxembourg

Cluster 1 is the largest cluster, with seven rather diverse countries. It is characterized by comparably high average values for all indicators, except for standardization and external evaluation. This cluster thus seems to have established comprehensive evaluation and accountability measures, but less strict external standards and control mechanisms. Cluster 2 consists of two liberal welfare states, the UK and the USA. Taken together, these two countries exhibit the highest values on all selected indicators. This is not surprising, given that liberal welfare states have a long tradition of marketization and decentralization, and accordingly have well-established strategies of exerting control and implementing standards. Several works have assessed accountability policies in these two countries (see, for example, Cumming, 2012; Scoppio, 2012). For both countries, educational reforms implemented in the 1980s and 1990s focused on effectiveness and accountability measures, which might be seen in the current results: the current level of assessment is high, while the change in these policies over time is comparably small.

Cluster 3 mainly includes the continental welfare states (Ebbinghaus, 2012) but also Finland, Italy and Greece. The average values of this cluster are the lowest for all but one indicator (internal evaluation). It is again not surprising that continental welfare states show lower values on our indicators of assessment and accountability practices, as the idea of marketization, choice and autonomy is less prevalent in public institutions in these states. However, our analyses showed that these countries also experienced increases in assessment and accountability practices. In Austria, Switzerland and Germany, the publication of TIMSS 1997 and PISA 2000 results was followed by thorough educational reforms which included standards-based accountability measures on different levels (Bieber, 2014; OECD, 2017). Accordingly, the increase in accountability measures over time is rather high in these countries.

The fourth cluster consists of Spain, Ireland, Iceland and Luxembourg. It is characterized by rather low values for all indicators, but differs from Cluster 1 in terms of higher values for standardization and external evaluation. Thus, countries in this cluster experience stricter control in terms of external standards.

Overall, the cluster analysis does not suggest that clusters are identified due to specific constellations of our indicators (e.g. high marketization and low internal evaluation). Rather, differences in the overall levels of assessment and accountability practices seem to distinguish the four clusters. With the exception of standardization and external evaluation, the other areas (marketization, school accountability, teacher accountability and internal evaluation) seem to be interrelated.

Results beyond OECD countries

We limited our country sample to those countries which had valid measures on all indicators and years of measurement. These countries are all OECD countries, as only these participated in all PISA survey rounds. However, one could ask whether our findings are generalizable beyond OECD countries. In order to approach this question, we selected six further, non-OECD, countries which participated in PISA 2000, 2009, 2012 and 2015. These countries are Bulgaria, Brazil, Hong Kong, Indonesia, Latvia and Thailand. We repeated our analyses with this smaller sample and time span and found comparable results. Of the 19 indicators which are available for the four years, we found that 17 showed an increasing trend of the average prevalence and a decreasing trend of heterogeneity between countries.

Discussion

In this paper, we have presented an approach to analysing trends in assessment and accountability practices in education across countries and time with data from the OECD PISA study. It has to be noted that analyses over time are not straightforward with PISA, as the questionnaires change between PISA cycles. This leads to different coverage of specific indicators across all survey rounds. However, our approach of harmonizing PISA items from different questionnaires still provides the most comprehensive time-series indicators on assessment and accountability practices at secondary school level available so far. Our descriptive analyses for 20 OECD countries showed an increasing trend for the vast majority of the generated indicators, indicating that assessments and the use of assessments for purposes of accountability, but also the implementation of more horizontal, qualitative evaluation procedures increased within the larger part of the OECD during the last 15 years.

This finding supports the hypothesis that the significant public attention that was raised by ILSA results at the beginning of the 21st century might have increased the legitimacy of evidence-based policy-making and accountability in education. The presentation of PISA results in the form of league tables evokes a competition for the ‘best education system’. In search of examples of best practices and efficient policies, elements from the New Public Management Paradigm infiltrated into education policy making (Sahlberg and Hargreaves, 2011). When countries tried to adopt educational reform projects in this direction, they often referred to results of PISA and comparable projects in order to legitimate the reforms (Feniger and Lefstein, 2014; Heyneman and Lee, 2014; Sellar and Lingard, 2013). These reforms called for an evaluation of their success, thereby inducing pressure for further assessments. Further, new technical solutions of data processing and data management facilitated more sophisticated assessment and monitoring procedures (Anagnostopoulos, 2013). Taken together, these developments resulted in an increase of assessment, accountability and evaluation at secondary school level, as our results have shown. PISA, the OECD or other testing projects alone are certainly not the primary causes of these trends. Rather, PISA is a symptom of a new international governance – or even a world polity (Meyer and Ramirez, 2000) – in education. The results for non-OECD countries also support this notion, as they exhibit the same trends.

Further, our cluster analyses revealed that more vertical, control-driven forms of assessments and accountability do not seem to crowd out more peer-oriented forms of evaluation for improvement. Rather, both practices increased over time, reflecting an increased pressure on education systems to raise quality and efficiency. Further research should thus examine which forms of assessments and evaluation are related to educational outcomes. The discourse on the effects of ILSAs and a ‘new accountability regime’ could be substantiated, if results would show that certain assessment and accountability practices are ineffective or actually harmful in terms of learning outcomes. The longitudinal analyses of PISA data as we have suggested in this paper also bear the potential to overcome problems of conventional, cross-sectional analyses of educational outcomes and their determinants (Teltemann and Schunck, 2016).

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Notes

Author biographies

Janna Teltemann is a Professor for educational sociology at the University of Hildesheim. Her research interests include educational inequality, immigrant integration, education policy and education systems and quantitative methods. She has been a Senior Researcher in the project “Internationalization of Education Policy” within the TransState Research Center at the University of Bremen. In her doctoral thesis, she analysed the impact of institutions on the educational achievement of immigrants with data from the OECD PISA study. She has published several papers on institutional determinants of educational inequality.

Nina Jude is a Scientific co-ordinator of the PISA 2018 international project at the DIPF Leibniz Institute for Research and Information in Education in Frankfurt/M. She has been scientific coordinator of the PISA 2009 and PISA 2015 international projects. In her doctoral thesis, she analysed the structure of language competence. She is co-editor of the volume “Assessing Contexts of Learning. An International Perspective” (2016) and has published several papers on methodological issues of large scale assessments, on parental involvement and on language competencies.

References

Anagnostopoulos

(2013) Infrastructure of Accountability: Data Use and the Transformation of American Education. Cambridge, MA: Harvard Education Press.

Andrews

(2014) Open letter to Andreas Schleicher in his function as Head of the OECD Education Directorate. The Guardian, 6 May. https://www.theguardian.com/education/2014/may/06/oecd-pisa-tests-damaging-education-academics (accessed on 12 March 2018).

Baker

LeTendre

(2005) National Differences, Global Similarities: World Culture And The Future Of Schooling. Stanford: Stanford Social Sciences.

Bayer

Klieme

Jude

(2017) Assessment and evaluation in educational contexts. In: Kuger

Klieme

Jude

et al . (eds) Assessing Contexts of Learning. Cham: Springer, pp.469–488.

Bieber

(2014) Cooperation or conflict? Education policy in Switzerland after the PISA Study and the Bologna Process. In: Martens

Knodel

Windzio

(eds) Internationalization of Education Policy. New York: Palgrave Macmillan, pp.179–201.

Breakspear

(2012) The Policy Impact of PISA: An Exploration of the Normatie Effects of International Benchmarking in School System Performance. OECD Education Working Paper No. 71. Paris: OECD.

Cumming

(2012) Valuing Students with Impairment. SpringerBriefs in Education. Dordrecht: Springer Netherlands.

Dobbins

Martens

(2012) Towards an education approach à la finlandaise? French education policy after PISA. Journal of Education Policy 27(1): 23–43.

Drezner

(2001) Globalization and policy convergence. International Studies Review 3(1), 53–78.

10.

Duda

Hart

(1973) Pattern Classification and Scene Analysis. Hoboken: John Wiley & Sons.

11.

Ebbinghaus

(2012) Comparing welfare state regimes: Are typologies an ideal or realistic strategy? Draft paper presented at European Social Policy Analysis Network, ESPAnet Conference, Edinburgh, UK, 6–8 September 2012. Available at: http://www.mzes.uni-mannheim.de/d7/de/publications/presentation/comparing-welfare-state-regimes-are-typologies-an-ideal-or-realistic-strategy

12.

Egelund

(2008) The value of international comparative studies of achievement – a Danish perspective. Assessment in Education: Principles, Policy & Practice 15(3): 245–251.

13.

Ertl

(2006) Educational standards and the changing discourse on education: The reception and consequences of the PISA study in Germany. Oxford Review of Education 32(5): 619–634.

14.

Feniger

Lefstein

(2014) How not to reason with PISA data: An ironic investigation. Journal of Education Policy 29(6): 845–855.

15.

Figlio

Loeb

(2011) School Accountability. Handbook of the Economics of Education. 1st ed. Vol. 3. Amsterdam: North Holland.

16.

Grek

(2009) Governing by numbers: The PISA ‘effect’ in Europe. Journal of Education Policy 24(1): 23–37.

17.

Grek

(2013) Expert moves: International comparative testing and the rise of expertocracy. Journal of Education Policy 28(5): 695–709.

18.

Hannan

Freeman

(1989) Organizational Ecology. Cambridge, MA: Harvard University Press.

19.

Heyneman

Lee

(2014) The Impact of international studies of academic achievement on policy and research. In: Rutkowski

von Davier

Rutkowski

(eds) Handbook of International Large-Scale Assessment. Boca Raton, FL: CRC Press, pp.37–75.

20.

Jakobi

Teltemann

(2011) Convergence in education policy? A quantitative analysis of policy change and stability in OECD countries. Compare: A Journal of Comparative and International Education 41(5): 579–595.

21.

Jerrim

Lopez-Agudo

Marcenaro-Gutierrez

et al . (2017) What happens when econometrics and psychometrics collide? An example using the PISA Data (IZA discussion paper no. 10847) Bonn: IZA.

22.

Jude

(2016) The assessment of learning contexts in PISA. In: Kuger

Klieme

Jude

et al . (eds) Assessing Contexts of Learning. Cham: Springer, pp.39–51.

23.

Kamens

(2013) Globalization and the emergence of an audit culture: PISA and the search for `best practices’ and magic bullets. In: Meyer

H-D

Benavot

(eds) PISA, Power, and Policy: Oxford Studies in Comparative Education. Oxford: Symposium Books, pp.117–140.

24.

Klieme

Kuger

(2016) The PISA 2015 framework for context assessment: Monitoring opportunities and outcomes, policies and practices modelling patterns and relations, impacts and trends in education. In: OECD (eds) PISA 2015 Assessment and Analytical Framework: Science, Reading, Mathematics and Financial Literacy. Paris: OECD Publishing, pp.101–127.

25.

Koeppen

Hartig

Klieme

et al . (2008) Current issues in competence modeling and assessment. Zeitschrift für Psychologie/Journal of Psychology 216(2): 61–73.

26.

Kuger

Klieme

Jude

et al . (2016) Assessing Contexts of Learning. Cham: Springer International Publishing.

27.

Levin

(1998) An epidemic of education policy: (What) can we learn from each other? Comparative Education 34(2): 131–141.

28.

Lindberg

(2013) Mapping accountability: Core concept and subtypes. International Review of Administrative Sciences 79(2): 202–226.

29.

Lingard

Martino

Rezai-Rashti

(2013) Testing regimes, accountabilities and education policy: Commensurate global and national developments. Journal of Education Policy 28(5): 539–556.

30.

Mansell

(2007) Education by Numbers: The Tyranny of Testing. London: Politico’s Publishing.

31.

Marttila

(2014) Die Wissensbasierte Regierung der Bildung – Die Genese einer Transnationalen Gouvernementalität in England und Schweden. Berliner Journal Für Soziologie 24(2): 257–287.

32.

Meyer

H-D

Benavot

(2013) PISA and the globalization of education governance: Some puzzles and problems. In: Meyer

H-D

Benavot

(eds) PISA, Power, and Policy: Oxford Studies in Comparative Education. Oxford: Symposium Books, pp.7–26.

33.

Meyer

Ramirez

(2000) The world institutionalization of education. In. Schriewer

(ed.) Discourse Formation in Comparative Education. Frankfurt/M: Lang, pp.111–132.

34.

Mons

(2009) Theoretical and Real Effects of Standardised Assessment: Background Paper to the study National Testing of Pupils in Europe: Objectives, Organisation and Use of Results. EACEA; Eurydice. Available at: http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.174.4813&rep=rep1&type=pdf

35.

O’Day

(2002) Complexity, accountability, and school improvement. Harvard Educational Review 72(3): 293–329.

36.

OECD (2009) PISA 2009 Assessment Framework: Key Competencies in Reading, Mathematics and Science. Paris: OECD.

37.

OECD (2014) PISA 2012 Technical Report. Paris: OECD.

38.

OECD (2017) PISA 2015 Assessment and Analytical Framework. Paris: OECD Publishing.

39.

Parsons

(1959) The school class as a social system: Some if its functions in American society. Harvard Educational Review 29(4): 297–318.

40.

Perie

Park

Klau

(2007) Key Elements for Educational Accountability Models. Washington, DC: Council of Chief State School Officers.

41.

Rosenkvist

(2010) Using student test results for accountability and improvement: A literature review. OECD Education Working Papers, No. 54. Paris: OECD.

42.

Rutkowski

(2016) A call for a more measured approach to reporting and interpreting PISA results. Educational Researcher 45(4): 252–257.

43.

Sahlberg

Hargreaves

(2011) Finnish Lessons: What Can the World Learn from Educational Change in Finland? New York: Teachers College Press.

44.

Scheerens

(2007) The case of evaluation and accountability provisions in education as an area for the development of policy malleable system level indicators. In: Krüger

H-H

Rauschenbach

Sander

(eds) Bildungs- und Sozialberichterstattung. Wiesbaden: VS Verlag für Sozialwissenschaften, pp.207–224.

45.

Scheerens

Glas

CAW

Thomas

(2003)Educational Evaluation, Assessment, and Monitoring: A Systemic Approach. Swets & Zeitlinger.

46.

Scoppio

(2012) Common trends of standardisation, accountability, devolution and choice in the educational policies of England, U.K., California, U.S.A., and Ontario, Canada. Current Issues in Comparative Education 2(2): 130–141.

47.

Sellar

Lingard

(2013) The OECD and global governance in education. Journal of Education Policy 28(5): 710–725.

48.

Shepard

(2006) Classroom assessment. In: Brennan

(ed.) Educational Measurement. Westport, CT: Rowman & Littlefield Publishers, pp.623–646.

49.

Stecher

Hanser

(1992) Local Accountability in Vocational Education: A Theoretical Model and Its Limitations in Practice. RAND Corporation.

50.

Stufflebeam

(2001) Evaluation models. New Directions for Evaluation 2001(89): 7.

51.

Takayama

(2008) The politics of international league tables: PISA in Japan’s achievement crisis debate. Comparative Education 44(4): 387–407.

52.

Teltemann

Schunck

(2016) Education systems, school segregation, and second-generation immigrants’ educational success: Evidence from a country-fixed effects approach using three waves of PISA. International Journal of Comparative Sociology 57(6): 401–424.

53.

Teltemann

Windzio

(2018) The impact of marketisation and spatial proximity on reading performance: International Results from PISA 2012. Compare: A Journal of Comparative and International Education. Published online 27 April 2018, DOI: 10.1080/03057925.2018.1458597.

54.

Wößmann

(2004) How equal are educational opportunities? Family background and student achievement in Europe and the United States. CESifo Working Paper Series 1284. München: CESifo.

55.

Wyatt-Smith

Klenowski

Colbert

(2014) Designing Assessment for Quality Learning (1). Dordrecht: Springer Netherlands.

Assessments and accountability in secondary education: International trends

Abstract

Keywords

Introduction

Assessments and accountability in education – concepts and trends

Analysing assessments and accountability across countries and time

Conceptualizing assessments and accountability with indicators from PISA

Methodological approaches

Results

Standardization

Marketization

School accountability

Teacher accountability

External evaluation

Internal evaluation

Cluster analyses

Results beyond OECD countries

Discussion

Footnotes

Declaration of conflicting interests

Funding

Notes

Author biographies

References