Sage Journals: Discover world-class research

Abstract

The Australian government spends millions of dollars funding new programs every year. Taxpayers, policy makers, school leaders, teachers, and students need to know whether these programs are good. Legislation ensures they are evaluated, but do those evaluations report what good looks like, how good these programs are, and for whom? This research sought an answer to that question by analysing publicly available educational evaluations using a new conceptual framework that integrated the logic of evaluation and evaluative reasoning. Both are essential to making a credible, valid, and defensible claim about how good something is: the logic of evaluation makes the judgement legitimate, and evaluative reasoning justifies it. We examined 37 reports using our framework using an adapted systematic quantitative analysis method. Only four provided a legitimate and justified evaluative judgement; the rest we categorised as research – not evaluation. Based on our findings, we propose an updated conceptual framework we called the stairways to heaven which clarifies the steps for evaluation in comparison with research. The evaluation stairway clarifies the logic, justifications, and their relationship, integrating current resources for evaluation practice. It can be used by evaluators, evaluation commissioners, and users to clarify when evaluation is needed and get actually evaluative evaluations that connect values and data to decision-making to drive positive social change.

Keywords

evaluative reasoning conceptual framework logic of evaluation quantitative systematic analysis

What We Already Know

The difference between evaluation and research is centred on understanding context and ascribing value, merit, or worth to an evaluand or aspects of it.

Theoretically, evaluations should work through a process of the logic of evaluation to make legitimate and justified arguments.

Evaluative reasoning is necessary to surface arguments and ensure that evaluations are credible and defensible.

The Original Contribution the Article Makes to Theory and/or Practice

Surfaces the absence of evaluative reasoning in publicly available evaluations of Australian educational sector programs and problematises its impact.

Uses a new integrated logic of evaluation and evaluative reasoning framework to visualise how they interact to build a credible, valid, and defensible arguments.

Proposes the ‘stairways to heaven’ conceptual model to support future evaluation practice.

Introduction

Education is a fundamental human right, essential for equitable participation in society, and a lever for reducing inequality (Mathison, 2010; United Nations, 1948, 2020). In Australia, the Federal Government spends billions annually on educational interventions like the Better and Fairer Schools Initiative. The 2025–2026 budget provided $32.2 billion dollars through it to fund efforts across school sectors targeted at vulnerable students, including the Consent and Respectful Relationships Program ($20.4 M) and the National Student Wellbeing Program ($61.4 M) (Australian Government, 2025). Taxpayers, communities, students, teachers, and schools all have a right to know what good ‘looks like’ for education interventions like these, whether good is happening, and to what extent. Theoretically, evaluation can provide both learning and accountability to address this need (Mathison, 2010; Ni, 2010; OECD, 2013). The Australian government has both a stewardship responsibility and a legislative requirement (Government (2013); Australian Government The Treasury, nd) to provide evaluation of these interventions.

Our study started within this context, wondering whether evaluation in Australia is answering the question ‘What does good education look like?’ In situations like education when stakes are high due to budget, legislation, and stewardship, stakeholders need to be sure that judgements about goodness are credible, valid, and defensible. Within evaluation, following the process of evaluative reasoning results in judgements with those qualities. To answer our question, we needed a framework to recognise evaluative reasoning and a sample of evaluations to study. In this article, we describe and operationalise the concept of evaluative reasoning based on the literature and empirical research, report on our findings when we used it to analyse publicly available evaluation reports from 2014 to 2024 in the context of education in Australia, discuss the implications for public policy and practice, and propose a revised framework to facilitate evaluative reasoning in future practice.

Conceptual Framework for Evaluative Reasoning

Evaluation has often been defined as the process of generating judgements of merit, worth, and significance (Fitzpatrick et al., 2023; Owen, 2020; Patton, 2018a, 2020; Schwandt, 2008; Scriven, 1995; Stufflebeam & Coryn, 2014). In plain language, this means answering the question ‘what does good look like?’ for the item being evaluated (Gullickson, 2020). Evaluative reasoning (Davidson, 2014a; Gullickson, 2020; House, 1977; Hurteau et al., 2009; Meldrum, 2022; Nunns, 2016; Nunns et al., 2015) is necessary to make a valid claim about the goodness of something. It resides in the literature alongside two other concepts: evaluative attitude and evaluative thinking. Evaluative reasoning, attitude, and thinking all interact in the development of an evaluation and should be reflected in its ‘output’ (usually a report). We begin by clarifying the relationship among them to establish the space for our framework.

Davidson (2005) presented evaluative attitude as the personal orientation of the evaluator: ‘all serious criticism’ is valuable because it allows you to correct and or clarify. Knowing this, and actively seeking out such criticism, is central to being a good evaluator. After all, useful criticism is what we sell, so seeking it out ourselves is ‘walking the talk’ (p. 35). This disposition is an essential first step to critical thinking in general, so important for evaluators, commissioners, program staff, etc. (Abrami et al., 2008). Evaluative attitude is particularly important for leaders, as their orientation to criticism influences the culture and practices throughout the organisation – regardless of official policy (Friedman, 2007; Gullickson, 2010; Schein, 2010).

Several authors have defined evaluative thinking as the application of critical and other types of thinking in an evaluative context (Archibald, 2024; Buckley et al., 2015; Cole, 2023; Paproth et al., 2023; Patton, 2018b; Vo et al., 2018). An evaluative context includes an organisation, team or program; the needs and strengths of the contexts in which there are operating; the way they have framed the needs or problem they want to address in that context; the goals, policies, and programs that have been designed and created in response; how those solutions are implemented and the results they produce. Across all these areas, answering the questions of ‘what good looks like’ implies adding data and critical thinking at the individual, team, and/or organisational level to the prerequisite evaluative attitude. It also requires robust arguments about what good looks like, which requires evaluative reasoning.

Evaluative reasoning is at the centre of Clinton and Hattie’s (2021) discussion of evaluative thinking ‘a higher order, cognitively complex notion, grounded in procedural logic combining relevant values with nonevaluative data to achieve the task of making an evaluative judgement’ (p. 102,006). Evaluative reasoning establishes the logic of evaluation as the procedural logic (Fournier, 1995b; Scriven, 1991, 1994, 2007), necessary to generate a value judgement. Doing evaluative reasoning requires the knowledge and skill to bring together relevant values with evidence in a series of connected arguments that together make a legitimate and defensible claim about the goodness of something. Evaluative reasoning is the focus of this article and thus, the basis of our theoretical framework.

The definitions above have implications for skills, knowledge, and action related to how taxpayer dollars are spent in education. Those doing evaluation need to have an evaluative attitude, and be able to recognise, do and deliver quality evaluative thinking and evaluative reasoning. Potentially, those commissioning and using evaluation also need to have an evaluative attitude, recognise and do evaluative thinking, and recognise the presence or absence of evaluative reasoning and ideally, its quality. In all cases, evaluative reasoning is fundamental. The next sections we further discuss evaluative reasoning: what it is, its history and development, and existing research on it.

What is Evaluative Reasoning?

Evaluative reasoning involves using the logic of evaluation and supporting arguments to make a credible, defensible, and valid judgement about the goodness of something. Credible, defensible, and valid are terms from the literature (Table 1), and the text below provides plain language summaries of each:

• Credible – inspiring belief; believable in context, with conflicts and limitations declared; and clarity about how factual evidence is gathered with limitations stated (Davidson, 2005; Fournier, 1995a; House, 1977, 1980; House & Howe, 1999; Hurteau et al., 2010; Hurteau & Williams, 2014; Nkwake, 2015; Nunns et al., 2015; Scriven, 1981).

• Defensible – supported by argument; well-reasoned (robust, sound, persuasive, legitimate, and justified) (Davidson, 2005; Fournier, 1995a; Fournier & Smith, 1993; Nunns et al., 2015; Scriven, 1981).

• Valid – extent to which appropriate conclusions are derived (data + argument) (Davidson, 2005, 2014b; House & Howe, 1999; Macklin & Gullickson, 2022; Nkwake, 2015).

Table 1.

Definitions from the Literature.

	Definition	Authors
Credible	An explicit description of the context of the evaluand	(Greene, 2005; Gullickson, 2020; House, 1977)
	The types of evidence used to gather empirical facts	(Bledsoe, 2014; Montrosse-Moorhead et al., 2014; Nkwake, 2015)
	The reader’s perceptions of the objectivity of the evaluator	(Bledsoe, 2014; House, 2014; Scriven, 1977)
Valid	Whether interpretations and judgements are adequately warranted by the data and the quality of conclusions made from it based on evidence and reasoning	(Macklin & Gullickson, 2022)
Valid	Judgements based on coherent reasoning from trustworthy methods and data (increased coherence = increased credibility)	(Macklin & Gullickson, 2022)
Defensible	Quality of argument: Legitimate, justified, robust, sound, and persuasive	(Davidson, 2014a; House, 1980)

Credible and valid both have had extensive discussions in the literature as evidenced by citations against the definitions above and in Table 1 below. Defensibility has had less, so we will describe that more in depth.

Defensibility

Defensibility is defined as ‘able to be supported by argument’ (https://dictionary/cambridge.org.dictionary/english/defensible). In the case of evaluation, the arguments are practical not theoretical (House, 1980) and well-reasoned and well-evidenced (Davidson, 2014a). The evaluation literature describes five characteristics of defensibility.

• Robust (data + argument)

Definition: Strongly formed (https://www.merriam-webster.com/dictionary/robust)

Authors: Davidson (2014b); Nunns et al. (2015).

• Sound (data + argument)

Definition: Free from error, fallacy, or misapprehension (https://www.merriam-webster.com/dictionary/sound)

Authors: Arens (2005); Davidson (2014a, 2014b); Fournier and Smith (1993); Valovirta (2002).

• Persuasive (data + argument)

Definition: Making you want to do or believe a particular thing (https://dictionary.cambridge.org/dictionary/english/persuasive)

Authors: House and Howe (1999); House (1977, 1979); Valovirta (2002).

• Legitimate (logic)

Definition: Able to be defended with logic or justification; valid (https://languages.oup.com/google-dictionary-en/)

Authors: Fournier (1995a, 1995b); Fournier and Smith (1993); House and Howe (1999); Hurteau et al. (2009).

• Justified (argument)

Definition: Having a good reason for something (https://dictionary.cambridge.org/dictionary/english/justified)

Authors: Davidson (2014a); Fournier and Smith (1993); Gullickson (2020); Hurteau et al. (2009); Smith (1987).

We observed that often (i) these characteristics have been used without an associated definition or rationale and (ii) the definitions are not commensurate with each other. Consequently, we offer the following systematised and clarified language. To be defensible, the quality of the evidence and analysis, and the quality of the argument, need to be robust, persuasive, and sound. Defensible evidence and analysis must meet standards for validity (research) and credibility (context). A defensible argument must be legitimate and justified:

• To be legitimately categorised as evaluation, the logic used in the argument must be the whole logic of evaluation.

• To be justified, the steps in the logic must be warranted with good reasons.

As alluded to above, the concepts defensible, valid, and credible relate to each other. Credible and defensible overlap because context influences what counts as evidence and what logic must be applied to that evidence to make a justified argument. Defensible and valid overlap in relation to argument – its quality (defensibility) and its claims (validity). Valid and credible overlap because context influences what counts as a warrant (i.e. this is true because…). Compared with these concepts which have been frequently discussed in the evaluation literature, evaluative reasoning is comparatively thinly theorised. In the following section, we explore how the concept of evaluative reasoning has developed and operationalise it for research.

Operationalising Evaluative Reasoning

There has been significant expansion of Scriven’s (1981) original conceptualisation of the (general) logic of evaluation with additional steps that now constitute evaluative reasoning. Fournier, 1995b was the first to integrate the general logic with what she termed ‘working’ logic. This was also the first time that synthesis was identified as being part of what needed to be done to reach an evaluative judgement. Subsequently, Davidson (2005, 2014a) substantially operationalised synthesis by identifying three different levels: micro, meso, and macro. She also proposed an approach to conduct synthesis at the different levels and their relative contribution to an evaluative judgement. Finally, Nunns (2016), Nunns et al. (2015), and Gullickson (2020) developed significant argument around the importance of warranting in an evaluation. Nunns et al. (2015) proposed different types of warrant (literature, cultural, methodological, expert, and authority) that could be used to substantiate the evidence in an evaluation. While Gullickson (2020) expanded warranting to include its role in fully describing and justifying all aspects of an evaluation.

Figure 1 displays how evaluative reasoning interacts with the logic of evaluation in the literature. To differentiate between the logic and evaluative reasoning in Figure 1, the four-step logic, as it is most often cited, is highlighted in purple boxes. The theoretical development of synthesis is highlighted in orange boxes and evaluative reasoning elements in green. The blue boxes highlight the addition of evaluative evaluation questions or key evaluative questions, which are currently located outside both the commonly cited logic and evaluative reasoning. A single directional arrow indicates that the elements build upon each other; however, several authors (Davidson, 2014a; Gullickson, 2020; Hurteau & Williams, 2014) have indicated that evaluative reasoning is an iterative process.

Figure 1.

Literature support for the elements in the author’s coding framework.

All these elements are necessary in an evaluation, because when evaluators use the logic of evaluation they provide a legitimate evaluative judgement. When they justify the logic, by providing warrants and their associated backings, they provide a defensible, credible, and valid argument. In doing so, evaluators justify both the criteria and standards and conduct a synthesis integrating data about evaluand performance with standards. However, fundamentally synthesis is where the logic of evaluation and evaluative reasoning come together. This is because the logical process required for evaluation asks for the facts about evaluand performance to be combined with the standards (Davidson, 2014; Fournier, 1995a; Gullickson, 2020; Scriven, 1981, 1991).

Existing Research on Evaluative Reasoning

A search of peer reviewed journals in English discovered only four different empirical studies (Davidson, 2014b; Hurteau et al., 2009; Nunns et al., 2015; Ozeki et al., 2019), which investigated operationalisation of the logic of evaluation in evaluation practice. Table 2 presents a summary of those studies, which cover all the empirical evidence of evaluative reasoning in practice covering in total 76 publicly available reports from three different contexts, and 214 responses from members of the American Evaluation Association. Due to the nature of the studies Table 2 includes quantitative percentages and qualitative statements. While the variation across these studies prohibits generalisation, what arises from analysing the studies is evidence of variable operationalisation and occurrence of evaluative reasoning in practice.

Table 2.

Summary of the Empirical Evidence on Evaluative Reasoning.

Authors	Hurteau et al. (2009)	Nunns (2016)	Davidson (2014b)	Ozeki et al. (2019)
Sample	Evaluations located through ERIC search (n = 40)	Publicly available public sector evaluations Aotearoa New Zealand 2010–2013 (n = 30)	Qualitative commentary based on author’s expert review of program evaluations (n = 6)	Surveys of American Evaluation Association members (2016)
				Study 1, n = 108 (response rate 25%)
				Study 2, n = 106 (response rate 25%)
Evaluative objective(s) and/or question(s)	70%	61%	Key evaluation questions missing from a large proportion of reports
Criteria or comparator	100%	39%		94% criteria important or very important
Criteria or comparator	100%	39%		82% indicate establish criteria frequently or very frequently
Justification of criteria	95% of the reports where criteria or comparator was identifiable	82% of the reports where criteria or comparator was identifiable
Standards	65%	55%		81% establishing standards is important or very important
				61% indicate establish standards frequently or very frequently
				62% indicate compare performance to standards frequently or very frequently
				Study 2
				59% indicate compare performance to standards frequently or very frequently
Justification of standards	65% of the reports where standards were identifiable
Warranted argument		61%
Synthesis	No reports provided information on their synthesis		Lacked synthesis	75% indicate they use synthesis make evaluative conclusion(s) frequently or very frequently
Conclusion/judgement	50%	46%	Missing evaluative conclusions	86% indicate providing an evaluative conclusion (synthesis) is important or very important
				Study 2
				89% indicate make a single evaluative conclusion frequently or very frequently

The population of empirical studies have contextual and theoretical limitations. Contextually, they cover a small sample of the global evaluator/evaluation community. Nunns and colleagues (2015) examined reports from Aotearoa New Zealand; Ozeki and colleagues (2019) had respondents from only one of the 21 global professional evaluation associations (Better Evaluation, 2018). Hurteau and colleagues (2009) drew their sample from the ERIC database, which could mean that they were from anywhere across the world; they provided no demographic information. Davidson (2014b) did not divulge the context of the six reports that she reviewed. On the theory side, three of the studies had important limitations related to our inquiry, although they built on each other. Hurteau and colleagues (2009) did not identify warrants in their model. Nunns et al. (2015) included warrants but not evaluative synthesis. Ozeki et al. (2019) covered the logic of evaluation asking respondents to quantify their knowledge and practice using self-assessment, which has known flaws (Wildschut et al., 2024), including lack of evidence to support the claims.

Our Framework

Our study addresses gaps in previous evaluative reasoning research by providing a complete sampling frame (date range, sector, and context). The framework developed to analyse the data also addresses previous gaps in evaluative reasoning steps, specifically micro-, meso-, and macro-synthesis and warranting. The following paragraphs outline how we developed our framework.

We adapted Nunns and colleagues (2015) conceptual framework because it covered the most aspects of evaluative reasoning. We explicitly included the elements of the expanded logic of evaluation and evaluative reasoning. The expanded logic included evaluation questions or objectives, and additional steps for synthesis: micro-, meso-, and macro- initially proposed by Scriven (1991) and expanded by Davidson (2014a). Warranting was added at several key places in the process, where steps in the logic need to be justified. The resulting conceptual framework supporting the provision of legitimate and justified evaluative judgement in evaluation reports is illustrated in Figure 2 below. The following discussion outlines the conceptual framework and provides definitions for each element of it.

Figure 2.

The integrated logic and evaluative reasoning conceptual framework.

The integrated logic and evaluative reasoning conceptual framework, a combination of Scriven’s 1981 logic of evaluation and evaluative reasoning steps, consists of nine steps. Elements constituting the logic of evaluation (elements 1, 2, 4, and 6–9) are outlined in purple. Micro- and meso-synthesis have been integrated into the framework as elements seven and eight. Macro-synthesis is included in element nine, although as Davidson (2014a) indicated, macro-synthesis rarely occurs. However, it is essential that all evaluations reach evaluative conclusions(s) and/or a judgement about evaluand performance. Elements constituting evaluative reasoning are outlined in orange. To justify is a synonym for warranting, so in the context of this discussion, to justify an argument, the evaluator needs to provide an appropriate warrant for it. Warranting an argument is represented as a t-shape (element 3 and 5) and an oval underneath the micro and meso-synthesis elements. Warranting synthesis has not been given a number under the micro- and meso-synthesis elements because it needs to occur for both and allocating a number to it would prioritise the evaluative reasoning that needs to occur at this point where inferences are being made.

After the new conceptual framework was developed, each element was defined using literature sources. Table 3 presents the coding framework, which includes all the evaluative reasoning elements we coded for and their definitions sourced from the literature. It was used to analyse the data set by identifying evidence, according to the definition, of each evaluative reasoning element in reports. We focused on presence or absence based on explicit statements. We did not extrapolate or infer, and we did not assess quality or appropriateness of the evidence for any of the evaluative reasoning elements.

Table 3.

The Authors’ Coding Framework.

Element	Definition
Evaluative evaluation objective(s) and questions	Objective – the aim or goal of the evaluation expressed in evaluative terms containing at least one value word such as efficiency
Evaluative evaluation objective(s) and questions	Question – one or more questions that determine the quality/value and/or importance (Davidson, 2005) that focusses on the performance of the evaluand and can be measured (Rossi et al., 2004). Ideally key evaluation questions that are high level questions and their associated sub-questions including evaluative evaluation questions and research (descriptive) questions should be included
Criteria or comparator	Criteria of merit – related to evaluation questions and define dimension(s) of value for the evaluand (Davidson, 2005)
Criteria or comparator	Comparator – where criteria are not identified, an evaluand’s performance can be compared with definitions of value obtained from the literature, previous evaluation reports or existing data sets (Nunns et al., 2015)
Standards	Differentiated levels that an evaluand’s performance can be compared with. For example, on a continuum between excellent – poor (Davidson, 2005)
Warranted arguments	Argument that links the facts determined about evaluand performance with the evaluative conclusion(s) (Fournier & Smith, 1993)
Measure performance	Provide evidence about ‘what the evaluand does’ (Gullickson, 2020, p. 3)
Micro-synthesis	Integration of criteria or comparator with evidence of evaluand performance that answers evaluation sub-questions (Davidson, 2014a)
Meso-synthesis	Integration of micro-synthesis with other dimensions of value to answer key evaluation questions (Davidson, 2014a)
Evaluative judgement	Judgements that explicitly describe how good, valuable, or meritorious the evaluand is (Davidson, 2005)

Context

We applied our coding framework to publicly available evaluation reports published about education-associated programs conducted in the Australian primary, secondary, and tertiary sectors. Our aim was to answer the research question: What quantitative evidence of evaluative reasoning can be found in education evaluation reports conducted in Australia between 2014 and 2024?

The need for this study evolved from a series of questions posed by us that could justifiably be asked by key stakeholders to evaluation of educational interventions in Australia. The first of these are the taxpayers (including parents/caregivers of children and young people participating in the interventions) who are contributing to the Australian Government’s 2025–2026 budget, $32.2 billion dollars. Taxpayers have a right to ask:

(1) What is the value of educational programs to society?

(2) How justified is the spending?

The second group of key stakeholders to evaluation of educational interventions in Australia are educators working in or conducting research in the primary, secondary, and tertiary sectors. Educators could ask:

(1) What educational interventions work for whom and why?

(2) So what?

(3) What makes one new innovative program better than another?

Finally, members of the discipline of evaluation who could ask:

(1) How can we improve evaluation practice?

(2) How does the logic of evaluation integrate with ER and why is it important?

Understanding the presence of evaluative reasoning in evaluation reports from the education sector is fundamental to answering these questions from stakeholders.

Methods

We adapted Pickering and Byrne’s (2014) systematic quantitative analysis method to identify elements of reasoning evident in publicly available education evaluations. Pickering and Byrne (2014) proposed their systematic quantitative analysis method as a means of supporting postgraduate and early career researchers to write and publish literature reviews. The authors outlined a 15-step method for undertaking systematic and quantitative analysis of available literature to map the number of publications and associated findings related to a topic under investigation and its associated research questions.

Pickering and Byrne (2014) indicated that there are two strengths of the systematic quantitative analysis method. The first, that its systematic and quantitative approach may result in a reduced bias related to how literature is selected by a researcher to be included in a review. In this study, we relied on publicly available (published) evaluation reports, like the previous two empirical studies (Hurteau et al., 2009; Nunns, 2016). An evaluation was included in the data set if it met the inclusion criteria discussed below.

The second strength of systematic quantitative analysis method is that the quantification of literature results in an easy and comparatively fast map of a research topic when compared with the narrative approach to literature reviews. This strength of the systematic quantitative analysis method applied in the same way to this study and the findings resulted in clear identification of the presence of evaluative reasoning elements in the evaluations. Additionally, comparison between evaluations was fast and easy. This method was appropriate because it supported quantitative summing from text to identify elements of evaluative reasoning that we were interested in investigating.

The systematic quantitative analysis method calls for 15 steps to analyse literature. We changed the object of analysis to an evaluation report and removed the steps associated with a literature review, reducing the steps to 10. In the sections below, we discuss the search terms, inclusion criteria, data collection, and analysis.

Search Terms and Searches

This study replicated and expanded Nunns et al. (2015) study in a different context; Nunns Aotearoa New Zealand Public Service; this study Australia; primary, secondary, and tertiary education sectors. Consequently, Nunns et al. (2015) search strategy was replicated. Only Google was used to identify publicly available evaluations that could also be accessed by members of the public (taxpayers) without specific knowledge and access to databases. Synonyms for the term evaluation were not used because we were only interested in reviewing evaluation reports.

Table 4 presents the search terms and databases we used to locate the education evaluation reports for our sample. Results of the searches are detailed in Supplemental File 2. We searched twice. In August 2020, we searched with dates June 2014–June 2019 to generate the data set for the first author’s (KM) M. Phil. In November 2024, we revised the search for this publication and included additional sources published since 2019. In both cases, we reviewed each record against the inclusion criteria.

Table 4.

Search terms for database searches of Google and Department of Education (or similar) websites.

Site	Search Term
Google	School OR higher education AND evaluation AND Australia program OR project (ordered by relevance)
Australian Department of Education and Training	Program evaluation
State/Territory Department of Education	Evaluation

Inclusion Criteria

To answer our research question in our chosen context, we chose inclusion criteria that enabled us to identify publicly available evaluation reports from the Australian primary, secondary, and tertiary education contexts:

(1) Evaluation reports conducted on an education program or project in the primary, secondary, or tertiary sector published between June 2014 and November 2024.

(2) Report was commissioned by a central, state, territory, or program/project funding agency.

(3) Report was written by Australian-based agencies or authors, whether they are external or internal to the agency.

(4) No author appeared more than twice in the sample.

(5) No agency appeared in the sample more than twice.

(6) The evaluation report was a complete report not a summary.

(Adapted from Nunns et al., 2015, p. 147).

We used the first inclusion criterion to screen the record; we excluded those that did not meet it. Subsequently, the first author (KM) read each evaluation report and assessed it using the rest of the inclusion criteria. This resulted in a dataset of 37 evaluations.

Data Collection and Analysis

We structured the database in Microsoft Excel™ using elements and their definitions from the study’s conceptual framework (Figure 2 above) as column headings and used it to systematically review each evaluation report. According to the systematic quantitative analysis method, if an element was present in a report, we entered a 1 in its associated column. If an element was absent or only partially present a 1 was recorded in the ‘none’ column. We did not consider evidence quality or appropriateness as this was outside the bounds of the inquiry. We developed the database iteratively; as reports were reviewed, some columns were added. For example, we added columns for demographic information provided by the report such as the education sector, year, and whether state/territory or national datasets were included in the evaluation.

We reviewed reports independently. Author one (KM) reviewed 33 evaluation reports; Author two (AG) reviewed 7. We double-coded three reports and discussed discrepancies until consensus was reached. Our initial interrater reliability was 98%.

At the conclusion of data entry, we checked that each column only had one number (or no numbers) in it for each report. The database was also checked to make sure that each element was recorded for every report. Where there was a discrepancy, the report was retrieved and reviewed, and the error corrected. Subsequently, we created summary tables as different pages in the Excel spreadsheet and summed the elements within and between reports. We calculated descriptive statistics to quantify the presence of elements within and across reports. The full data set is available via the Open Science Framework (https://osf.io/8fbpw/?view_only=fb5de0c81eaf4d65824ccc12a6a86624).

Results

Below we present the findings of the analysis of 37 Australian education sector evaluation reports published between June 2014 and November 2024. We begin with the demographic characteristics of the data set. The systematic quantitative analysis findings follow.

Demographics

Of the 37 reports reviewed, agencies authored 62%, and individual consultancies or academics based at Australian tertiary institutions wrote 35%. 2017 and 2018 had the most reports (6) published, and 2024 had the fewest (0), followed by 2023 (1). Most reports (12, 32%) gathered data from evaluands that were located across the whole of Australia. The states of Victoria and New South Wales (NSW) were the next highest, gathering data from eight evaluands. Neither the Northern Territory nor Tasmania had any published reports. Data analysis focussed on evaluands in the primary, secondary, and tertiary sectors. The largest percentage (16, 43%) focussed on evaluands located in both the primary and secondary school sectors. These reports were distinct from those conducted in either the primary or secondary sectors as the evaluand that was the focus of the reports was active in both primary and secondary schools. Evaluands located only in the primary sector were the fewest number (6, 16%).

Systematic Quantitative Analysis of Evaluative Reasoning Elements

The aim of this study was to use systematic quantitative analysis to identify the presence of elements of the integrated logic of evaluation and evaluative reasoning conceptual framework (Figure 2) in the data set. The following sections focus on an analysis of the findings of the systematic quantitative analysis. Initially each of the reports was scored based on the number of elements present, and this enabled a synthesis of the overarching findings and different trends between reports that included four elements or more compared with those that contained three elements or less to be identified.

Overarching Findings

Following the scoring approach adopted by Hurteau and colleagues (2009), a maximum score of seven was available for each report. A score of seven indicated that all elements were able to be identified. As illustrated in Table 5, the synthesis column includes more than one type but only counts as one element. A report scored zero if it contained no identifiable elements. Table 5 below presents a summary of the total number of elements at each score from 7 to 1. In Table 5, the elements are listed across the top, with whether they were present or not in the reviewed reports in the rows coloured light blue and red. Data in the light blue row are for reports that were identified as evaluations according to the definitions applied in this study. Data in the red rows are for reports not identified as evaluations according to the definitions applied in this study. Totals for each of the elements are in the last dark blue row. The complete data set mapped to each of the elements is available in Supplemental File 3. Data in Table 5 is ordered from highest to lowest score.

Table 5.

Summary of number of elements at each scoring level from seven to one.

Notes. EO = evaluation objectives; EQ = evaluation questions; RQ = research questions; Crit = criteria; Comp = comparator. *RQ are not included in count of EO/EQ. They are included in summary table for information only. Pres^# = element present.

Report analysis provided four overarching findings. Firstly, the most significant finding was that only four reports (11%) of the 37 reviewed contained all seven elements. This means that only four reports constituted an evaluation according to the definition adopted for this study. This finding is significant because it indicates that although report titles included the word ‘evaluation’ they were not; primarily because the evaluators did not provide an evaluative judgement about the evaluand’s merit, worth, or significance. The reports that did not provide an evaluative judgement were categorised as research and are identified in Table 5 below and Supplemental File 3 with red shading.

Secondly, a general observation from the whole data set was that only one report (report 24) contained any identifiable criteria. These criteria, sustainability, effectiveness, and efficiency (economic benefit) were embedded in evaluation questions (report 24, p. 16). However, no specific definitions were provided for the criteria. In contrast, 17 reports contained comparators rather than criteria. Most comparators (12, 32%) were scholarly literature including other evaluations. This finding reflects that of Nunns et al. (2015) who also identified that more evaluations in their data set used literature as a comparator. While literature provides a broad context to situate the evaluand and its performance, it is an indirect comparator (Nunns et al., 2015). ‘Evaluative criteria specify the values that will be used in an evaluation’ (Peersman, 2014, p. 149) and provide the most explicit approach for comparison (Nunns et al., 2015). Consequently, an absence of explicit evaluative criteria undermines what it is to evaluate and fails to consider stakeholders’ values (Hurteau, 2008).

Thirdly, most of the data set (24, 65%) contained three elements or less. Therefore, only 13 (35%) reports scored 4 or more. A focus on the characteristics of the reports containing between six and four evaluative reasoning elements, which are closer to being evaluations that those with three elements or less, yielded the following observations. Eight of nine reports in this grouping did not provide a judgement about the evaluand. Report 28 provided a judgement ‘all three programs have been successfully designed and implemented’ (p. 49) but makes no explicit reference to any criteria in the report. The absence of this element meant that it was not classified as an evaluation according to the definition applied in this study.

In addition to missing a judgement, synthesis was the next element that was missing from reports that scored five or four. Synthesis is one of the elements that is critical to evaluations because it draws together performance evidence gathered about the evaluand to support judgements against criteria/comparator. The role of micro-synthesis in an evaluation is to integrate the criteria or comparator with evidence of evaluand performance to answer evaluation sub-questions (Davidson, 2014a). Evaluation sub-questions are related to one dimension of the evaluation. For example, if the evaluator wants to determine the effectiveness of a program that constitutes one dimension. If the evaluation is only measuring one dimension, then an evaluator will not need to conduct meso-synthesis because micro-synthesis will have answered the evaluation question. Subsequently, the evaluator should proceed to the next step in the logic of evaluation and provide an evaluative judgement.

Finally, standards and synthesis were missing from reports that scored four (4, 11%). In evaluations, standards identify differentiated levels, for example, on the continuum of excellent – poor, that an evaluand’s performance can be compared with (Davidson, 2005). This finding is significant because standards are also critical to evaluation. They provide a comparator that enables value to be ascribed to the evaluand’s performance. The absence of standards indicates that the evaluator has not identified how they will determine whether the evaluand’s performance was acceptable, according to pre-determined standards, in the subsequent synthesis step.

Reports Containing Three Elements or Less

Twenty-four (65%) of reports in the data set contained three elements or less. The consistent element included across all reports was measurement of some aspect of the evaluand’s performance. Aligned with measurement, most reports scoring three (7 out of 8), also included research questions. However, there was less consistency in included elements for reports scoring three. For example, when analysing reports containing three elements individually, report number 22 included research questions, ‘can XX program be applied in government schools?’ (p. 5), literature comparators, and results were compared with state achievement standards (standards) (p. 35 – 38), but did not provide a warranted argument, synthesis, or a judgement. Whereas reports 16 and 23 contained both evaluation objectives/evaluation questions and research questions and a warranted argument (in addition to measuring performance) only. Reports scoring two were mostly missing standards, warrants, synthesis, and a judgement, and there was inconsistent inclusion of objectives/evaluation questions and research questions and a comparator.

Discussion

In this study, we answered the research question: What quantitative evidence of evaluative reasoning can be found in education evaluation reports conducted in Australia between 2014 and 2024? Our analysis found only four (12%) of 37 reports constituted an evaluation as defined by our conceptual framework (Figure 2). To discuss, we compare previous empirical findings with those of this study, propose a new model for tracking and understanding evaluative reasoning and propose implications for practice.

Comparing the Findings of this Study with Other Empirical Studies

Hurteau et al. (2009), Davidson (2014b), and Nunns (2016) have all provided empirical evidence of evaluative reasoning in practice by analysing evaluation reports. The different research methods limit direct comparison; however, all previous research sought to identify elements of legitimate and/or justified arguments for evaluative judgements. Thus, we used the elements as the foundation for comparison. Table 6 below provides a summary of the findings and compares them with those of this study.

Table 6.

A Comparison of Previous Empirical Findings with those of this Study.

	Sample	Analysis Method	Evaluative Objective(s) and/or Question(s)	Criteria or Comparator (+Justification)	Standards (+Justification)^a	Warranted Argument	Micro-Synthesis	Meso-Synthesis	Conclusion/Judgement (Which may be Macro-Synthesis)
Hurteau et al. (2009)	International, mixed sector evaluations located through ERIC search (n = 40)	Meta-analysis	70%	100% (justification in 95% of the reports where criteria or comparator was identifiable)	65% (justification in 65% of the reports where standards were identifiable)				50%
Nunns (2016)	Aotearoa New Zealand public sector	Reported as meta-evaluation, but actually meta-analysis	61%	39% (justification in 82% of the reports where criteria or comparator was identifiable)	55%	61%			46%
Nunns (2016)	Publicly available evaluations 2010–2013 (n = 30)	Reported as meta-evaluation, but actually meta-analysis	61%		55%	61%			46%
Davidson (2014b)	Single organisation (n = 6)	Qualitative	KEQ missing from a large proportion of reports				Lacked synthesis		Missing evaluative conclusions
Ozeki et al. (2019) ^b	Primarily USA	Questionnaires		94%	81%			75%	75%
	Survey of American Evaluation Association members (2016)
	Study 1, n = 108
	Study 2, n = 106
Present study	Australia	Systematic quantitative analysis	54%	46% (justification in 32% of the reports where criteria or comparator was identifiable)	27%	51%	22%	11%	14%
	Education sector
	Publicly available reports, n = 37

^aNunns (2016) and this study did not separate the justification of standards or synthesis but reported the warranted argument as a pattern of reasoning across the report instead.

^bOzeki et al. (2019) self-reports of importance of logic of evaluation elements.

Overall, there were fewer evaluative reasoning elements identified in this study when compared with all previously published studies. Percentage differences are more disparate for the presence of standards, synthesis and a judgement. They are less disparate for the presence of evaluative objectives and/or questions and a warranted argument.

Limitations of Previous Studies that our Study Addresses

Previous studies did not report evidence of all the elements of evaluative reasoning as we have done according to our integrated logic and evaluative reasoning conceptual framework. Two previous meta-analyses (Hurteau et al., 2009; Nunns et al., 2015) did not report the presence of warrants (Hurteau et al., 2009) or synthesis (Nunns et al., 2015). Our theoretical framework development highlighted the importance of both elements for a legitimate and justified evaluative judgement. The use of an integrated model to analyse evaluation reports presents an opportunity to support future evaluation practice.

Implications of our Findings for the Education Sector

This is the first study that we are aware of that has specifically focussed on evaluation in the education sector. While the practice of evaluation emerged from the US education sector it appears, like previous empirical findings, that its practice is variable, and it has evolved to end of term feedback sheets (Guenther & Arnott, 2011). Evaluation was originally about amelioration but then became about accountability (Mathison, 2008; Shadish, 1991). However, based on our research, use of evaluation reports for accountability may not be warranted for reasons related to research and evaluation.

Related to research, firstly, while all reports measured at least one aspect of the evaluand, only 20 (54%) of reports identified associated research question(s) that guided data collection and analysis. A further 10 reports (27%) did not identify objectives or research questions to direct their inquiry into the evaluand. This finding identifies that all reports are doing the research component necessary to gather facts about the evaluand, but a quarter of them are doing that without a specific focus. Secondly, warrants are missing from 16 (43%) reports. The evaluative reasoning elements present in these reports (in the main) are evaluation objectives/evaluation questions/research questions and measurement. This means research is being conducted into evaluands, but findings are not being justified by report authors.

Related to evaluation, 89% of reports analysed in this study failed to provide the information required for a legitimate and justified evaluative judgement. Thus, the most significant finding of this study is that the value of educational programs reported in the study sample is largely unknown. While all the reports provided a perspective of the participant experience of the evaluand, which Mathison (2010) identified as amelioration, a lack of a judgement about the value of the evaluand means that there is little understanding about whether the program was good (Ni, 2010; Schwandt, 2008) and why it was good (Ni, 2010). Additionally, and potentially more importantly, a lack of understanding about what a good educational program ‘looks like’ and what attributes make it good, places limitations on improving educational practices and supporting learning (OECD, 2013) in the Australian education sectors where the programs in the sample took place.

Limitations of our Study

Our study had three limitations. Firstly, was limited to publicly available evaluation reports that could be located using the search strategy. Thus, the sample may not include the total number of education program evaluation reports generated between June 2014 and October 2024. Secondly, a desk-based analysis meant we analysed only the information available in the report; we were blind to the contextual nature of the evaluand and the evaluation. Time, budget, and/or decisions made by the Commissioner will affect what was or was not included in these reports. Finally, the method used necessitates counting elements present in the conceptual framework. Therefore, this study does not make any judgement about the quality of evaluative reasoning in these evaluations nor of the evaluations themselves; we only made a judgement about the presence of reasoning elements in the data set based on our definitions.

Within these limitations, our study provides a ‘snapshot’ of evaluative practice in the Australian education sector between 2014 and 2024. Overall, the findings suggest variable practice of evaluative reasoning, aligning with previous empirical research in other countries and contexts (Hurteau et al., 2009; Nunns et al., 2015). In this study, it is evidenced by the finding that only four of the 37 evaluation reports analysed contained all seven reasoning elements. Further, in 24 reports (65%), three or fewer evaluative reasoning elements were able to be identified. As previously suggested, evaluative reasoning is a ‘building block of evaluation’ (Davidson, 2014a). Moreover, as ‘evaluation is an argument’ (Schwandt, 2008, p. 146) and the elements of evaluative reasoning provide a sound foundation for building credible, valid, and defensible arguments (Fournier, 1995a; House & Howe, 1999; Scriven, 1981), the variable practice found in this sample of reports is concerning for the Australian education sector, in particular.

A lack of understanding of what ‘to evaluate’ means has potentially impacted on this study because, although all the reports sourced contained the word ‘evaluation’ in the title, 33 reports exhibited the characteristics of research; research questions (in most), literature comparators; measurement; and, warranted arguments (in some). The research frame is reinforced by the absence of explicitly evaluative elements such as evaluative evaluation objectives and/or questions, standards, synthesis at any level, and judgements about value of the evaluand.

What Does This Mean for Practice?

The absence of evaluative reasoning in evaluation practice is not surprising. Only one evaluation textbook covers it explicitly (Davidson, 2005). Recent research shows it is not taught in introduction to evaluation subjects in university evaluation programs (LaVelle & Davies, 2021; LaVelle et al., 2023). While evaluation is often described as including determinations of merit, worth, and significance (Patton, 2008) the ‘how to’ aspect of this, and research on it, has been dramatically underdeveloped in comparison to research and practice related to research methods, utilisation, and stakeholder engagement (e.g. participatory methods).

The evaluation field is like body builders who miss a leg day – well-developed on top (e.g. methods, measurement, and causality), but underdeveloped in its foundation (e.g. values and valuing; logic and reasoning). Recent research (Roorda et al., 2020; Roorda & Gullickson, 2019; Teasdale, 2022; Teasdale and Pitts et al., 2023; Teasdale and Strasser et al., 2023) has made contributions to identifying criteria in evaluations in various contexts, but to make evaluative reasoning visible and actionable the field needs a clearer and simpler way of understanding it and differentiating it from research.

To that end, we propose Figure 3 which shows evaluation and research as separate stairways that start in the same place – describing the phenomenon – but lead to different destinations. Evaluation provides judgements about how good something is based on facts and values in context. Research provides facts to inform the knowledge base and recommendations. The stairway metaphor shows how describing, judging, and warranting interact for both research and evaluation. Each step of describing and/or judging is a tread on the stairway; descriptive steps are lighter coloured. The risers are the warrants, showing how each step needs to be supported to justify the choices made. In either case, missing a step, like criteria or standards for evaluation, or measures for both, means the argument has a leap in logic. Failing to provide a warrant means the argument has a step with no support. Missing any steps or supports undermines the validity, defensibility, and credibility of the claims. While following either stairway can lead to a helpful result, knowing which stairway you are on leads to clarity (heavenly!) about the inquiry process and the desired result. The two stairways can and should connect. Building on existing research can surface values, criteria, standards, measures and insights, to ensure past mistakes are not repeated and the burden of systemically underserved and over-evaluated populations is not increased with evaluation. Evaluative activity starts with explicit attention to the values of all stakeholders – not just evaluators, funders, or researchers – to ensure judgements are just. Identification of stakeholders, mapping of criteria, and participatory and reciprocal practices in evaluation can inform research efforts.

Figure 3.

The stairway of evaluation logic and reasoning © Authors CC-BY-4.0.

The evaluation stairway has some distinctions. Firstly, it is surrounded by context, which is essential to credibility. Choices made in the evaluation must be warranted based on context, which includes the physical, cultural, legal, theoretical, and organisational context of the evaluand, and the relevant ethical, cultural safety, and professional evaluation standards. Secondly, the final two steps of synthesis and judgement are a series of smaller steps and risers. Micro-synthesis generates judgements on data that provide the evidence for meso-synthesis; meso-level synthesis generates judgements on criteria or questions that are the evidence for macro-synthesis, which generates an overall judgement. Davidson (2014a) argued that only micro and meso steps are always needed. Third, while we have depicted this as a linear pathway, it is certainly more like M.C. Escher’s Ascending and Descending – a continuous process. Decisions made throughout will be informed and updated by all the other steps and supports.

Being explicit about evaluative reasoning can:

• Provide direction for the sources, research and reasoning needed to support the various steps (Al-Bayati et al., 2024).

• Reduce the amount of data required for collection (Vogels, 2025).

• Prioritise and integrate the voices of the most important stakeholders (Hall et al., 2012; House, 2006; House & Howe, 2003).

• Connect judgements about goodness directly to decision-making (Davidson et al., 2025).

These advantages demonstrate how evaluative reasoning can increase the quality, feasibility, and utility of evaluations. So how this be achieved? The stairway provides some insights about actions we can take as a field, as practitioners, and as commissioners and users of evaluation.

Firstly, we can consider evaluative attitude as part of the context. If the first step of describing the phenomenon is a landing, then evaluative attitude is a doorway on it, leading to the evaluation stairway. While standing on the landing, consider: Are those commissioning the evaluation open to critical feedback about the evaluand? If not, then the way is shut for explicit evaluative reasoning. A focus on good research (grey steps) can support future evaluative reasoning by (i) documenting the needs the evaluand is intended to address and how the problem has been framed, (ii) looking for relevant theories, (iii) being explicit about constructs (potential values), and (iv) summarising research and report findings that could be benchmarked against to set standards. If the attitude door is ajar, then it may be possible to use data to make the case for evaluation and increase evaluative attitude (Cousins et al., 2006). The steps explicitly related to valuing (purple) can help to identify whose voices are most important (Roorda & Gullickson, 2019) and integrate them in the ‘meaning making’ process across all the steps.

Using the stairway can help us think about warranting as an ongoing process throughout research and evaluation, which can be underpinned by cultural knowledge bases, existing theories, the peer reviewed and grey literature, and ongoing engagement with communities. Key resources from stakeholders and the knowledge bases can be used to warrant multiple steps. For instance, in a First Nations evaluation, the description of the evaluand may need to be done with systems mapping, the criteria based first on kinship and Country, and the ‘meaning making’ steps done in community, rather than by the evaluators or the commissioners. For educational evaluations, theories of learning, behaviour change, and implementation (Funnell & Rogers, 2011) can direct evaluators to relevant knowledge bases which hold potential criteria, standards, and measures. Realist evaluations and realist reviews will help warrant causal arguments, and developmental evaluations can provide principles that serve as criteria. Evaluation approaches can be used in response to the various warrants, as ways to action the choices that are appropriate for evaluating this evaluand in this context (Montrosse-Moorhead et al., 2024).

Conclusion

In (Pawson & Tilley, 1997) suggested that:

Evaluation research has not exactly lived up to its promise: Its stock is not high. The initial expectation was clearly of a great society made greater by dint of research-driven policy making. Since all policies were capable of evaluation, those which failed could be weeded out, and those which worked could be further refined as part of an ongoing progressive research program. This brave scenario which (to put it kindly) has failed to come to pass (p. 13).

We propose that the promise of evaluation cannot be realised until evaluation practice consistently and effectively engages with evaluative reasoning. Taxpayers, policy makers, teachers, students, and all those who develop programs to influence education (and other sectors) need to understand whether their investments of time, energy, and money are worth it. Without explicit evaluative reasoning – logic and warrants – evaluation loses its power to help decision makers stop doing things that are harmful or wasteful, because the judgements are not made, or if made, are not valid, defensible, and credible.

Our research indicates there is work to be done: only four of the 37 evaluations analysed provided a legitimate and justified evaluative judgement about program value. The remaining 33 ‘evaluations’ are on the research staircase, providing descriptive facts about the evaluand, without ascribing value to it (and in some cases, not covering all the required steps). Using the stairways metaphor and the variety of resources available to support both the logic and warrants, can help commissioners and users get clear about whether they need evaluation or research and what is required for each. The steps on the evaluation stairway can assist evaluators produce credible, valid, and defensible evaluations of educational programs that attend to values in context. The ‘stairways to heaven’ can help all of us work together to provide an evidence-base for decision-making and for ensuring that quality education is available to all members of society.

Supplemental Material

Supplemental Material - Stairways to Heaven? Seeking Evaluative Reasoning in Publicly Available Evaluation reports from the Australian Education Sector 2014-2024

Supplemental Material for Stairways to Heaven? Seeking Evaluative Reasoning in Publicly Available Evaluation reports from the Australian Education Sector 2014–2024 by Kathryn Meldrum and Amy M. Gullickson in Evaluation Journal of Australasia.

Footnotes

Acknowledgements

The authors acknowledge Dr. Ghislain Arbour’s contributions as a co-supervisor on the Master of Philosophy study which was the first round of this research. Amy is grateful for the conversations with Dr. Jane Davidson, Dr. John Gargani, and the New South Wales Department of Education and Australian Evaluation Society Chapter about this research and its implications for practice.

ORCID iDs

Kathryn Meldrum

Amy M. Gullickson

Ethical Considerations

Evaluation reports constituting the data set for this study were retrieved from publicly available internet sources. Consequently, ethical approval was not necessary.

Funding

The authors received no funding to conduct the research.

Declaration of Conflicting Interests

The authors declare no potential conflict of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

Data is available from the Open Science Framework .

Supplemental Material

Supplemental material for this article is available online.

References

Abrami

P. C.

Bernard

R. M.

Borokhovski

Wade

Surkes

M. A.

Tamim

Zhang

(2008). Instructional interventions affecting critical thinking skills and dispositions: A stage 1 meta-analysis. Review of Educational Research, 78(4), 1102–1134. https://doi.org/10.3102/0034654308326084

Al-Bayati

A. J.

Kowalkowski

K. J.

Schröter

(2024). Pursuing paradigm shift in construction safety management: A theoretical introduction to the safety action evaluation. Advances in Civil Engineering, 2024(1), Article 6219987. https://doi.org/10.3102/0034654308326084

Archibald

(2024). Evaluative thinking. In Newcomer

Mumfords

S. W.

(Eds.), Research handbook on program evaluation (pp. 35–49). Edward Elgar Publishing.

Arens

S. A.

(2005). A study of evaluative reasoning in evaluative studies judged “outstanding” [doctoral dissertation]. Indiana University.

Australian Government . (nd). Commonwealth evaluation policy. https://evaluation.treasury.gov.au/about/commonwealth-evaluation-policy

Australian Government . (2025). Budget 2025-26 - Budget paper no. 3. https://budget.gov.au/content/bp3/download/bp3_05_part_2_education.pdf#:∼:text=Better_and_Fairer_Schools_funding_The_Australian,Fund_funding_and_other_prescribed_purpose_funding

Better Evaluation . (2018). Evaluation societies and associations. https://www.betterevaluation.org/en/evaluation-options/evaluation_societies_and_associations

Bledsoe

K. L.

(2014). Truth, beauty, and justice: Conceptualizing House's framework for evaluation in community‐based settings. New Directions for Evaluation, 2014(142), 71–82. https://doi.org/10.1002/ev.20086

Buckley

Archibald

Hargraves

Trochim

W. M.

(2015). Defining and teaching evaluative thinking: Insights from research on critical thinking. American Journal of Evaluation, 36(3), 375–388. https://doi.org/10.1177/1098214015581706

10.

Clinton

J. M.

Hattie

(2021). Cognitive complexity of evaluator competencies. Evaluation and Program Planning, 89, Article 102006. https://doi.org/10.1016/j.evalprogplan.2021.102006

11.

Cole

M. J.

(2023). Evaluative thinking. Evaluation Journal of Australasia, 23(2), 70–90. https://doi.org/10.1177/1035719X231163932

12.

Cousins

J. B.

Goh

S. C.

Clark

(2006). Data use leads to data valuing: Evaluative inquiry for school decision making. Leadership and Policy in Schools, 5(2), 155–176. https://doi.org/10.1080/15700760500365468

13.

Davidson

E. J.

(2005). Evaluation methodology basics: The nuts and bolts of sound evaluation. Sage Publications.

14.

Davidson

E. J.

(2014a). Evaluative reasoning. Methodological briefs: Impact evaluation 4. UNICEF Office of. https://www.unicef-irc.org/publications/pdf/brief_4_evaluativereasoning_eng.pdf

15.

Davidson

E. J.

(2014b). How “beauty” can bring truth and justice to life. New Directions for Evaluation, 2014(142), 31–43. https://doi.org/10.1002/ev.20083

16.

Davidson

E. J.

Chianca

Dulieu

Sigdel

(2025). Rubrics methodology in detail: Helping save the children turn children's experiences of discrimination and exclusion into rich, trackable oOutcomes. Journal of MultiDisciplinary Evaluation, 21(49), 38–55. https://doi.org/10.56645/jmde.v21i49.1047

17.

Fitzpatrick

J. L.

Sanders

J. R.

Worthen

B. R.

Wingate

L. A.

(2023). Program evaluation: Alternative approaches and practical guidelines (5th ed.). Pearson Education.

18.

Fournier

D. M.

(1995a). Editor’s notes. Reasoning in evaluation: Inferential links and leaps. New Directions for Evaluation, 68, 1–104.

19.

Fournier

D. M.

(1995b). Establishing evaluative conclusions: A distinction between general and working logic. New Directions for Evaluation, 68, 15–32. https://doi.org/10.1002/ev.1017

20.

Fournier

D. M.

Smith

N. L.

(1993). Clarifying the merits of argument in evaluation practice. Evaluation and Program Planning, 16(4), 315–323. https://doi.org/10.1016/0149-7189(93)90044-9

21.

Friedman

E. H.

(2007). A failure of nerve: Leadership in the age of the quick fix. Church Publishing, Inc.

22.

Funnell

S. C.

Rogers

P. J.

(2011). Purposeful program theory: Effective use of theories of change and logic models. John Wiley & Sons.

23.

Government

Australian

(2013). The Public Governance, Performance and Accountability Act. Australian Government. https://www.legislation.gov.au/C2013A00123/latest/text.

24.

Greene

J. C.

(2005). Context. In Matheson

(Ed.), Encyclopedia of evaluation (pp. 82–84). Sage. https://doi.org/10.4135/9781412950558

25.

Guenther

Arnott

(2011). Legitimising evaluation for vocational learning: From bastard sibling to equal brother. In AVETRA 14th annual conference: Research in VET: Janus–reflecting back, projecting forward. https://covaluator.net/docs/S2.3_legitimising_evaluation_VET.pdf

26.

Gullickson

A. M.

(2010). Mainstreaming evaluation: Four case studies of systematic evaluation integrated into organizational culture and practices [doctoral dissertation]. Western Michigan University. https://scholarworks.wmich.edu/cgi/viewcontent.cgi?article=1565&context=dissertations

27.

Gullickson

A. M.

(2020). The whole elephant: Defining evaluation. Evaluation and Program Planning, 79, Article 101787. https://doi.org/10.1016/j.evalprogplan.2020.101787

28.

Hall

J. N.

Ahn

Greene

J. C.

(2012). Values engagement in evaluation: Ideas, illustrations, and implications. American Journal of Evaluation, 33(2), 195–207. https://doi.org/10.1177/1098214011422592

29.

House

Howe

K. R.

(1999). Values in evaluation and social research. Sage Publications.

30.

House

E. R.

(1977). The logic of evaluative argument. CSE monograph series in evaluation (7). University of California. https://files.eric.ed.gov/fulltext/ED156719.pdf

31.

House

E. R.

(1979). Coherence and credibility: The aesthetics of evaluation. Educational Evaluation and Policy Analysis, 1(5), 5–17. https://doi.org/10.3102/01623737001005005

32.

House

E. R.

(1980). Evaluating with validity. Sage Publications.

33.

House

E. R.

(2006). Democracy and evaluation. Evaluation, 12(1), 119–127. https://doi.org/10.1177/1356389006064196

34.

House

E. R.

(2014). Origins of the ideas in evaluating with validity. New Directions for Evaluation, 2014(142), 9–15. https://doi.org/10.1002/ev.20081

35.

House

E. R.

Howe

K. R.

(2003). Deliberative democratic evaluation. In Kelleghan

Stufflebeam

D. L.

Wingate

L. A.

(Eds.), International handbook of educational evaluation (pp. 79–100). Springer.

36.

Hurteau

(2008). The challenge to evaluation today: My point of view. Annual Meeting Canadian Society of Evaluation. https://www.academia.edu/92218185/The_Challenge_to_Evaluation_Today_My_point_of_view

37.

Hurteau

Houle

Mongiat

(2009). How legitimate and justified are judgments in program evaluation? Evaluation, 15(3), 307–319. https://doi.org/10.1177/1356389009105883

38.

Hurteau

Valois

Bossiroy

(2010). Jugement crédible en évaluation de programme: définition et conditions requises. Canadian Journal of Program Evaluation, 25(2), 83–101. https://doi.org/10.3138/cjpe.25.004

39.

Hurteau

Williams

D. D.

(2014). Credible judgment: Combining truth, beauty, and justice. New Directions for Evaluation, 142, 45–56. https://doi.org/10.1002/ev.20084

40.

LaVelle

J. M.

Davies

(2021). Seeking consensus: Defining foundational concepts for a graduate level introductory program evaluation course. Evaluation and Program Planning, 88, Article 101951. https://doi.org/10.1016/j.evalprogplan.2021.101951

41.

LaVelle

J. M.

Neubauer

L. C.

Boyce

A. S.

Archibald

(2023). Setting the stage for critically defined and responsive evaluator education and training. New Directions for Evaluation, 2023(177), 13–22. https://doi.org/10.1002/ev.20542

42.

Macklin

Gullickson

A. M.

(2022). What does it mean for an evaluation to be ‘valid’? A critical synthesis of evaluation literature. Evaluation and Program Planning, 91, Article 102056. https://doi.org/10.1016/j.evalprogplan.2022.102056

43.

Mathison

(2008). What is the difference between evaluation and research — And why do we care? In N. Smith, P. Brandon (Eds.), Fundamental issues in evaluation (pp. 183-196) Guilford Press, New York.

44.

Mathison

(2010). The purpose of educational evaluation. In Peterson

Baker

McGaw

(Eds.), International encyclopedia of education (3rd ed., pp. 792–797). Elsevier. https://doi.org/10.1016/B978-0-08-044894-7.01592-X

45.

Meldrum

(2022). An analysis of evaluative reasoning in education program evaluations conducted in Australia between 2014 – 2019: University of Melbourne.

46.

Montrosse‐Moorhead

Griffith

J. C.

Pokorny

(2014). House with a view: Validity and evaluative argument. New Directions for Evaluation, 2014(142), 95–105. https://doi.org/10.1002/ev.20088

47.

Montrosse-Moorhead

Schröter

Becho

L. W.

(2024). The garden of evaluation approaches. American Journal of Evaluation, 45(2), 166–185. https://doi.org/10.1177/10982140231216667

48.

Y. J.

(2010). Educational evaluation: Concepts, practice, and future directions. In Peterson

Baker

McGaw

(Eds.), International encyclopedia of education (3rd ed., pp. 518–529). Elsevier. https://doi.org/10.1016/B978-0-08-044894-7.01591-8

49.

Nkwake

A. M.

(2015). Credibility, validity, and assumptions in program evaluation methodology. Springer.

50.

Nunns

(2016). The practice of evaluative reasoning in the Aotearoa New Zealand public sector [doctoral dissertation]. Massey University.

51.

Nunns

Peace

Witten

(2015). Evaluative reasoning in public-sector evaluation in Aotearoa New Zealand: How are we doing? Evaluation Matters—He Take Tō Te Aromatawai, 1, 137–163. https://doi.org/10.18296/em.0007

52.

Organisation for Economic Co-operation and Development (OECD) . (2013). Synergies for better learning: An international perspective on evaluation and assessment. OECD Reviews of Evaluation and Assessment in Education. https://www.oecd.org/education/school/synergies-for-better-learning.htm

53.

Owen

J. M.

(2020). Program evaluation: Forms and approaches. Routledge.

54.

Ozeki

Coryn

C. L. S.

Schröter

D. C.

(2019). Evaluation logic in practice. Evaluation and Program Planning, 76, Article 101681. https://doi.org/10.1016/j.evalprogplan.2019.101681

55.

Paproth

Clinton

J. M.

Aston

(2023). The role of evaluative thinking in the success of schools as community hubs. In Cleveland

Backhouse

Chandler

McShane

Clinton

J. M.

Newton

(Eds.), Schools as community hubs: Building ‘more than a school’ for community benefit (pp. 309–321). Springer Nature.

56.

Patton

M. Q.

(2008). Utilization-focused evaluation. Sage Publications.

57.

Patton

M. Q.

(2018a). Evaluation science. American Journal of Evaluation, 39(2), 183–200. https://doi.org/10.1177/1098214018763121

58.

Patton

M. Q.

(2018b). A historical perspective on the evolution of evaluative thinking. New Directions for Evaluation, 158, 11–28. https://doi.org/10.1002/ev.20325

59.

Patton

M. Q.

(2020). Evaluation use theory, practice, and future research: Reflections on the Alkin and King AJE series. American Journal of Evaluation, 41(4), 581–602. https://doi.org/10.1177_1098214020919498

60.

Pawson

Tilley

(1997). Realistic evaluation. Sage Publications.

61.

Peersman

(2014). Evaluative criteria, methodological bBriefs: Impact evaluation 3. UNICEF Office of Research. https://library.alnap.org/system/files/content/resource/files/main/brief_3_evaluativecriteria_eng.pdf

62.

Pickering

Byrne

(2014). The benefits of publishing systematic quantitative literature reviews for PhD candidates and other early-career researchers. Higher Education Research & Development, 33(3), 534–548. https://doi.org/10.1080/07294360.2013.841651

63.

Roorda

Gullickson

Renger

(2020). Whose values? Decision making in a COVID-19 emergency-management scenario. Evaluation Matters—He Take Tō Te Aromatawai, 6, 1–14. https://doi.org/10.18296/em.0057

64.

Roorda

Gullickson

A. M.

(2019). Developing evaluation criteria using an ethical lens. Evaluation Journal of Australasia, 19(4), 179–194. https://doi.org/10.1177/1035719X19891991

65.

Rossi

P. H.

Lipsey

M. W.

Freeman

H. E.

(2004). Evaluation: A systematic approach (7th ed.). Sage.

66.

Schein

E. H.

(2010) Organizational culture and leadership (2). John Wiley & Sons.

67.

Schwandt

T. A.

(2008). Educating for intelligent belief in evaluation. American Journal of Evaluation, 29(2), 139–150. https://doi.org/10.1177/1098214008316889

68.

Scriven

(1977). The methodology of evaluation. McCutchain.

69.

Scriven

(1981). The logic of evaluation. Edgepress.

70.

Scriven

(1991). Evaluation thesaurus. Sage.

71.

Scriven

(1994). The final synthesis. Evaluation Practice, 15(3), 367–382. https://doi.org/10.1016/0886-1633(94)90031-0

72.

Scriven

(1995). The logic of evaluation and evaluation practice. New Directions for Evaluation, 1995(68), 49–70. https://doi.org/10.1002/ev.1019

73.

Scriven

(2007). The logic of evaluation. In OSSA conference. University of Windsor. https://scholar.uwindsor.ca/cgi/viewcontent.cgi?article=1390&context=ossaarchive

74.

Shadish

W. R.

(1991). Foundations of program evaluation: Theories of practice. Sage.

75.

Smith

N. L.

(1987). Toward the justification of claims in evaluation research. Evaluation and Program Planning, 10(4), 309–314. https://doi.org/10.1016/0149-7189(87)90002-4

76.

Stufflebeam

D. L.

Coryn

C. L. S.

(2014) Evaluation theory, models, and applications (50). John Wiley & Sons.

77.

Teasdale

R. M.

(2022). Representing the values of program participants: Endogenous evaluative criteria. Evaluation and Program Planning, 94, Article 102123. https://doi.org/10.1016/j.evalprogplan.2022.102123

78.

Teasdale

R. M.

Pitts

R. T.

Gates

E. F.

Shim

(2023). Teaching specification of evaluative criteria: A guide for evaluation education. New Directions for Evaluation, 2023(177), 31–37. https://doi.org/10.1002/ev.20546

79.

Teasdale

R. M.

Strasser

Moore

Graham

K. E.

(2023). Evaluative criteria in practice: Findings from an analysis of evaluations published in evaluation and program planning. Evaluation and Program Planning, 97, Article 102226. https://doi.org/10.1016/j.evalprogplan.2023.102226

80.

United Nations . (1948). Declaration of human rights. https://www.un.org/en/about-us/universal-declaration-of-human-rights

81.

United Nations . (2020). Sustainable development goals. Sustainable development knowledge platform. https://sustainabledevelopment.un.org/?menu=1300

82.

Valovirta

(2002). Evaluation utilization as argumentation. Evaluation, 8(1), 60–80. https://doi.org/10.1177/1358902002008001487

83.

A. T.

Schreiber

J. S.

Martin

(2018). Toward a conceptual understanding of evaluative thinking. New Directions for Evaluation, 2018(158), 29–47. https://doi.org/10.1002/ev.20324

84.

Vogels

(2025). A transparent bubble of practice: An inquiry planning template for explicit and defensible investigation within evaluation. New South Wales Department of Education Evaluation Conference.

85.

Wildschut

Gullickson

A. M.

Mason

(2024). Making the learn evaluation assessment portal: Developing a reciprocal tool for learning and research. Evaluation Journal of Australasia, 24(4), 243–271. https://doi.org/10.1177/1035719X241293805

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.06 MB

Stairways to Heaven? Seeking Evaluative Reasoning in Publicly Available Evaluation Reports from the Australian Education Sector 2014–2024

Abstract

Keywords

What We Already Know

The Original Contribution the Article Makes to Theory and/or Practice

Introduction

Conceptual Framework for Evaluative Reasoning

What is Evaluative Reasoning?

Defensibility

Operationalising Evaluative Reasoning

Existing Research on Evaluative Reasoning

Our Framework

Context

Methods

Search Terms and Searches

Inclusion Criteria

Data Collection and Analysis

Results

Demographics

Systematic Quantitative Analysis of Evaluative Reasoning Elements

Overarching Findings

Reports Containing Three Elements or Less

Discussion

Comparing the Findings of this Study with Other Empirical Studies

Limitations of Previous Studies that our Study Addresses

Implications of our Findings for the Education Sector

Limitations of our Study

What Does This Mean for Practice?

Conclusion

Supplemental Material

Supplemental Material - Stairways to Heaven? Seeking Evaluative Reasoning in Publicly Available Evaluation reports from the Australian Education Sector 2014-2024

Footnotes

Acknowledgements

ORCID iDs

Ethical Considerations

Funding

Declaration of Conflicting Interests

Data Availability Statement

Supplemental Material

References

Supplementary Material