Abstract
Purpose:
This article gives an overview of the successes and lessons learned to date of the Education Endowment Foundation (EEF), one of the leading organizations of the What Works movement.
Design/Approach/Methods:
Starting with its history, this article covers salient components of the EEF’s unique journey including lessons learned and challenges in evidence generation.
Findings:
The EEF has demonstrated that it is feasible to rapidly expand the use of school-based randomized controlled trials (RCTs) in a country context, set high standards for research independence, transparency, and design, and generate new evidence on what works. Challenges include the need to consider alternative designs to RCTs to answer a range of practice-relevant questions, how to best test interventions at scale, and how study findings are reported and interpreted.
Originality/Value:
This article addresses some of the key components required for the success of What Works organizations globally.
Introduction
The articles in this issue focus on a general introduction to evidence-based educational reform in different countries and the effectiveness of evidence-based education reforms. There have been various identifiable waves in the process of evidence revolution in education with the 2010s characterized by attempts to institutionalize the use of evidence through the emergence of knowledge brokering agencies, most notably the What Works movement in the U.S. and the UK (White, 2019). This article focuses on the case of the Education Endowment Foundation (EEF) in England and presents an internal perspective on the work of the EEF in generating, documenting, and promoting the use of high-quality evidence and evaluation to inform teaching and other school practices, for transfer and possible adaptation in other contexts. The article gives a brief overview of the EEF’s history, discusses the successes and lessons learned so far, followed by a discussion on the challenges faced primarily from the perspective of evidence generation.
A brief history of the EEF
Inspired by the Obama administration’s Race to the Top initiative in the U.S. (U.S. Department of Education, 2009), the UK Secretary of State for Education, Michael Gove announced in late 2010 plans to establish an EEF to help raise standards in challenging schools in England (Department for Education and The RT Hon Michael Gove MP, 2010). The EEF was founded in 2011 by a lead charity, The Sutton Trust, in partnership with Impetus, with a £125 million founding grant from the Department for Education. They were selected following an open competition which attracted interest from 14 organizations. The EEF was envisaged to have a 15-year life span. In addition to receiving the founding grant, the EEF has set itself the goal of securing additional investment to enable it to award over £200 million in supporting the development, delivery, and rigorous evaluation of programs over its life span (The Education Endowment Foundation [EEF], 2012a).
The EEF is governed by an independent Board of Trustees, nominated by the founding partners and Chaired by Sir Peter Lampl. The Board and the executive team are guided by two key advisory bodies: an Advisory Board of experts from education, public policy, and business; and an Evaluation Advisory Group (EAG). The EAG provides critical guidance on evaluation methodologies and best practice in evidence generation. The charity is also supported by a number of legal and professional services firms, offering pro bono advice. Importantly, the EEF is independent of government, but maintains strong and collaborative working relationships with a number of Ministries, principally the Department for Education.
In March 2013, the EEF and the Sutton Trust were jointly designated by the Government as the What Works Clearinghouse (WWC) for Education. The WWC network is made up of nine independent WWCs, three affiliate members, and two associate members (Cabinet Office, 2019). Together these centers cover policy areas which account for more than £250 billion of public spending. What sets WWCs apart from standard research institutions is that the centers are committed to increasing both the supply of and demand for evidence in their policy area, and their output is tailored to the needs of their primary audiences (Cabinet Office, 2013; Cabinet Office and HM Treasury, 2018).
The Government’s WWC network represents the political hegemony of the What Works movement, which is largely built on the rise of impact evaluations (particularly randomized controlled trials [RCTs]) since the early 2000s and the increased production of systematic reviews over the last 10 years (White, 2019). These developments have been paralleled in the UK with the emphasis placed on evidence-based policy and practice by the incoming New Labour Government in 1997 and taken forward through successive Labour administrations, the following Conservative/Liberal Democrat Coalition Government 2011–2015 and the Conservative Government thereafter (see also Connolly et al., 2017).
Mission, aims, and key strands of work of the EEF
The EEF is an independent charity dedicated to breaking the link between family income and educational achievement in England. It aims to raise the attainment of 3 to 18-year-olds, particularly those facing disadvantage, develop their essential life skills, and prepare young people for the world of work and further study.
The EEF is dedicated to achieving its aims by synthesizing the best available evidence in user-friendly language for senior leaders and teachers in schools and translating this into resources that include summaries and practical tools and that are designed to improve practice and boost learning (e.g., Quigley & Coleman, 2019; van Poortvliet et al., 2018; The Teaching and Learning Toolkit available at https://educationendowmentfoundation.org.uk/evidence-summaries/teaching-learning-toolkit); generating evidence of what works to improve teaching and learning, by funding high-quality, independent evaluations of promising programs and interventions using predominantly RCT designs; and supporting educational institutions ranging from early years’ settings to post-16 settings across England by promoting the use of evidence to inform practice to maximize the benefits for children and young people. To do so the EEF works in partnership with a network of 32 Research Schools and 8 Associate Research Schools across the country to support the use of evidence to improve teaching practice (see www.researchschool.org.uk/ for further info).
Currently, there are 20,217 state-funded schools, attended by just over 8 million pupils, in England (Department for Education, 2019a) with 453,400 full-time equivalent teachers working in state schools (Department for Education, 2019b). As of March 2020, just over 14,000 schools and over 1.58 million children and young people were involved in EEF-funded studies (of which over 150 are RCTs). A recent systematic review found that there have been 1,017 unique RCTs in education since 1980 and of these, 799 have been produced in the last 10 years (Connolly et al., 2018). Only 25% of education trials identified in this review included more than 1,000 participants (Connolly et al., 2018), whereas over 70% of EEF RCTs include more than 1,000 participants, with the average size being over 8,000 participants. Therefore, the EEF is one of the leading funders of RCTs in education globally and has commissioned approximately 19% of all known trials in the last 10 years, and some of the largest.
While the EEF’s work has a predominantly domestic focus, its approach to generating and using evidence to improve teaching and learning is internationally relevant. Furthermore, more countries becoming involved in this endeavor will support the EEF’s core mission to boost attainment for disadvantaged children and young people in England. The EEF has quickly established itself as a world-leading organization in evidence generation and supporting teachers to put research to good use, which has triggered the development of a number of global partnerships since 2014 spanning Australasia to Latin America. In 2018, the EEF launched a 5-year project “Building a Global Evidence Ecosystem for Teaching” in partnership with the BHP Foundation as part of its Global Education Equity Program. This project will enable the EEF to take its work to scale, supporting more partners in more countries to generate evidence that senior leaders and teachers could use to make evidence-informed decisions to support school improvement.
Evidence generation: Key successes and lessons learned
Feasibility of large-scale commissioning of sizable school-based education evaluations in a single country context
The EEF has demonstrated that it is possible to commission many, relatively large, education evaluations, mostly RCTs, over a short period of time (8 years). Prior to the EEF being set up in 2011, few large-scale pragmatic RCTs had been conducted in English schools (Styles & Torgerson, 2019). There was some resistance to the use of RCTs in the UK among parts of the education research community (Oakley, 2006) and common objections included the argument that randomization itself is unethical and the perception that participants will think the randomization is unethical and refuse to participate (Hutchinson & Styles, 2010). Indeed, it was believed that one of the main challenges the EEF would face would be persuading schools to take part, yet by 2019, it had successfully recruited more than half the schools in England to its evaluations (Nevill, 2019a).
Early EEF trials suffered from relatively high attrition, with an average of 24% of pupils dropping out between 2011 and 2012 (Dawson et al., 2017). Since then, the EEF has introduced a number of strategies to recruit and retain schools. For example, the EEF has learned the value of communicating the benefits of RCTs to schools through recruitment events and documents clearly explaining the evaluation design and the schools’ responsibilities (Dawson et al., 2017), strategies which are supported by the literature on intrinsic motivation (Tirole & Benabou, 2006). Extrinsic rewards can also have a role in increasing intrinsic motivation (Muralidharan & Sundararaman, 2011). The EEF offers financial incentives, particularly to control schools, and where the data collection burden is onerous. The EEF has also recognized schools’ contribution through letters of thanks and certificates designating them “EEF research partner schools.”
These efforts have culminated in the EEF developing and publishing guidance on recruitment and retention for delivery partners (The EEF, 2019a). Recruitment has become easier over time, as schools have become familiar with EEF’s work, and the education system has recognized the value of high-quality research (Cullinane, 2018).
The EEF has set high standards for research independence, transparency, and design
With £125 million available at the outset, the EEF was able to set the agenda regarding the type of research and evaluation it would fund. It began with the intention of only commissioning RCTs as “gold standard” evaluation designs with respect to minimizing selection bias. For example, the U.S. Government’s What Works Clearinghouse (WWC, 2017) gives its highest rating only to well-implemented RCTs. Yet, not all RCT designs provide useful evidence for schools and there are many aspects of both design and delivery that are debated (Ginsburg & Smith, 2016).
This section discusses some salient aspects of EEF’s journey in the development of its quality standards and expectations regarding gold standard evidence generation. The EEF has taken a collaborative approach to develop these, via consultation with the UK and international research organizations, relevant experts, and its Evaluation Advisory Group (see above on governance structure). For example, the EEF hosts an annual conference and analysis workshop for the members of its Panel of Evaluators to discuss challenges and solutions in education evaluation.
Independent evaluation
Conflict of interest in summative evaluation is a recognized problem across many fields. In medicine, drug trials are often funded by drug companies, leading to accusations of reporting bias and negative findings being withheld (Goldacre & Heneghan, 2014). Similarly, in education, there is a risk of bias from developers who exert a strong influence over the study and have a personal or financial interest in a positive result. In particular, “as one reaches the stage of summative evaluation, there are clear concerns about bias when an evaluator is too closely affiliated with the design team” (National Research Council, 2004, p. 61). Yet historically such developer-led evaluations have been common (Ginsburg & Smith, 2016).
The EEF recognized the benefits of setting a precedent for commissioning “independent” evaluations from the outset and needed to quickly clarify what it meant by that. The approach the EEF took was to appoint, through a competitive tendering process, a Panel of Evaluators, namely research organizations with expertise in education evaluation, who would compete to evaluate programs, and subsequently be partnered with developers. The EEF has separate grant agreements with the evaluator and developer and acts as a mediator in discussions about evaluation design, implementation, and analysis. The EEF’s standards for independent evaluation clarify that it expects design decisions to be made collaboratively, but that randomization, primary outcome data collection, analysis, and reporting should always be conducted by the evaluator (The EEF, 2017a).
The approach chosen by the EEF is unusual and relatively extreme in the way that it separates evaluators’ and developers’ financial and personal interests. For example, a comparable organization to the EEF is the U.S. Department of Education’s Institute for Education Sciences (IES) and its evaluation arm; the National Centre for Education Evaluation and Regional Assistance (NCEE), which conducts and supports large-scale evaluations of education programs. The IES was established by the Education Sciences Reform Act (ESRA) of 2002.
ESRA requires that the NCEE conduct independent evaluations by “awarding evaluation contracts competitively to experts external to the Department who are free from conflicts of interest” (IES, 2017). Yet, to meet this requirement, grantees are expected to select the evaluator themselves, name them in their grant application, and are responsible for administering the grant funds. In a report of 65 Investing in Innovation (i3) education evaluations, supported by the IES NCEE, 1 it was found that 97% were independent, as defined by having at least one outcome that was collected, analyzed, and reported by the independent evaluator (Boulay et al., 2018). Independent evaluators are usually from a different organization to the grantee or developer, but there appear to be no checks regarding evaluator’s conflict of interest, unlike in the EEF commissioning model.
One consequence of this approach may be that there are fewer RCTs and more quasi-experimental designs (QEDs) commissioned. A QED approach is often preferred by developers for practical reasons, and evaluators may feel compelled to agree if they are paid by the developers. Of the 19 i3 impact evaluations supported by NCEE, 2 13 (just over two thirds) were reported to meet the WWC standards without reservations (Boulay et al., 2018), meaning a randomized design without high attrition, defined as less than 55% overall (WWC, 2017). It is not possible to directly compare the WWC standards with the EEF’s padlock rating that is used to assess the security of the primary impact result (see the section on challenges below). However, of the 95 published impact evaluation reports to date, 89% were based on an RCT design and 85% achieved three or more padlocks, meaning that they had an overall attrition rate of less than 30%.
The approach taken by the EEF was controversial, there has been resistance in some parts of the academic community, and several projects have fallen through because the developers would not agree to the design proposed by the independent evaluator. But it has been successful in achieving high-quality results and minimizing bias.
Reporting transparency
Publication bias is a recognized problem whereby authors and publishers favor positive findings. An analysis of social science research found that those with strong results are 40 percentage points more likely to be published than null results (Franco et al., 2014). For this reason and to minimize selective reporting (i.e., the bias that derives from the exclusion of negative or undesirable results), the EEF requires a prespecified protocol and statistical analysis plan for every trial to be published on its website and the trial registered on ISRCTN registry, 3 a primary clinical trial registry. The first EEF protocol and reporting templates were published in 2013, based on CONSORT standards (Shulz et al., 2010), and have been updated since to reflect changing standards (Montgomery et al., 2018). All EEF’s findings are published, whatever the result (Nevill, 2016).
Despite these measures being taken by the EEF, the risk of publication bias still exists. There have been several examples where the developer has chosen to separately publish journal articles that present a more positive picture, in response to EEF’s report of a null result (e.g., Burgess et al., 2019). The EEF attempts to minimize this risk by encouraging evaluators and developers to enter into a publication agreement and requesting to see publications before they are submitted. But in practice, once the evaluation is complete, it has little influence.
Data archiving and reproducibility
There have been concerns over the last decade regarding the replicability of scientific findings (Open Science Collaborative, 2015). The EEF knew it would be generating large amounts of powerful RCT data, so with urgency in 2012, it set up a data archive to enable it to check the reproducibility of evaluator estimates, track long-term outcomes, and support secondary research and reanalysis across trials. Education researchers in the UK benefit from having access to high-quality census data on pupils and schools, including outcomes at the end of primary and secondary school, compiled by the UK Department for Education (FFT Education Datalab, 2018). This is called the National Pupil Database (NPD). The EEF data archive is the first of its kind to collect pupil-level data from many large-scale education RCTs in one place, link this data set to longitudinal outcomes (in the NPD), and make it accessible for further research.
It has led to important methodological innovations. For example, the reanalysis of 17 early EEF trials using four analytical models enabled EEF research partners at Durham University to reveal the extent of variation in effect size estimates that occurs as a result of analysis choice (Xiao et al., 2016) and led to the EEF publishing the first version of its statistical analysis guidance to increase the comparability of trial results, which has been updated three times since (The EEF, 2018). It has also been used to examine the theoretical and empirical implications of accounting for clustering at the class level (Demack, 2019) and to explore using standard deviation (SD) as an outcome of an intervention (Tymms & Kasim, 2018).
Currently, the archive holds data for 105 completed studies that the EEF has commissioned. As the number of trials hosted within the EEF archive grows, this powerful data set provides increasing potential for understanding what works for different types of schools and pupils.
Evaluation design
There is much that could be written about EEF’s journey with respect to the designs that it commissions, but this article will focus on two aspects: implementation and process evaluation (IPE) and measurement of outcomes.
IPE
Some early EEF trials suffered because they lacked high-quality IPE which meant that they were unable to explain the causal processes underlying the results or describe implementation (Morris et al., 2016). This is not unusual in education, with only 38% of 1,017 education trials identified between 1980 and 2016, including a process evaluation component (Connolly et al., 2018). But EEF’s early approach to design was limiting because education programs are complex and there is much to be learned regarding implementation and causal mechanisms. The British Medical Research Council’s evaluation recommendations for complex interventions specify that “a good theoretical understanding is needed of how the intervention causes change” and “lack of effect may reflect implementation failure (or teething problems) rather than genuine ineffectiveness” (Craig et al., 2008, p. 980). For this reason, in 2014, the EEF commissioned a literature review by Manchester University of IPE for education interventions (Humphrey et al., 2016) which informed guidance highlighting the importance of a detailed intervention description (Hoffman et al., 2014) and high-quality data on implementation (Durlak & Dupre, 2008), compliance, control group activity, causal mechanisms, and cost (Dawson et al., 2017). Since then the EEF has commissioned a review of methods for evaluating complex education programs (Anders et al., 2017) and published revised guidance with the aim of improving theory-testing, integration of impact and IPE, prespecification, and the measurement of compliance (The EEF, 2019b).
For commissioning bodies like the EEF, there is a difficult balance to strike between the desire to evidence many hypothesized elements of the intervention’s theory of change and the practical risk of overburdening participants and cost considerations. There are several examples of the EEF commissioning multiarmed trials (e.g., Lord et al., 2017; McNally et al., 2018), and 90% of EEF trials include the measurement of at least one secondary outcome or mechanism of change (Nevill, 2019a). However, there is still some distance to travel before the EEF is regularly commissioning all aspects of “realist” trials, for example, that examine the effects of intervention components separately and in combination; using multi-arm studies and factorial trials; explore mechanisms of change, for example analysing how pathway variables mediate intervention effects; use multiple trials across contexts to test how intervention effects vary with contexts. (Bonell et al., 2012, p. 2299)
Measurement of outcomes
A study may have the most precise analysis but without psychometrically reliable and valid measurement instruments, RCT results have little meaning. At the outset, the EEF recognized the risk associated with using outcomes too closely aligned to the treatment (Ginsburg & Smith, 2016; WWC, 2017) resulting in greatly inflated effect sizes (Cheung & Slavin, 2016). Early in 2012, the EEF published strict guidance on test selection for the primary outcome in its evaluations, the main criteria being that it must have broad external validity and be highly correlated with performance in national high stake assessments (The EEF, 2012b). The EEF also uses outcomes from the NPD where possible. Further discussion on the approach the EEF has taken to address measurement, attrition, and timing can be found in Dawson et al. (2017).
There are many available commercial standardized tests in the UK, with some being widely used by schools and researchers. Yet, the psychometric properties of these tests are not well reported. Some companies do report measures such as internal test-item consistency (e.g., Cronbach’s α), but studies of validity and reliability are less frequent. Recent analysis of archived data has shown that many of these assessments offer only moderate predictive validity (Allen et al., 2018) and some evaluations have suffered from floor and ceiling effects (e.g., Hodgen et al., 2019). In 2014, the EEF expanded its remit to include the early years and non-attainment outcomes such as self-control and resilience and commissioned literature reviews to inform databases of available measures in these areas (e.g., Wigelsworth, 2017). The EEF is now commissioning a systematic review of available attainment measures to fill this gap but would have benefited from doing so earlier. This is a useful lesson for similar organizations wanting to fund large numbers of evaluations with similar outcomes.
The EEF has generated evidence about what does and does not work
As a result of the efforts described above, the EEF has generated a large body of evidence that helps to identify what does and does not work in English schools. For example, EEF research has clarified the need to better deploy teaching assistants and the importance of early years’ approaches (The EEF, 2017b). It has also generated important knowledge about what does not work and that schools should be wary about expecting large returns from popular approaches such as “lesson study” and “growth mindset” (Foliano et al., 2019; Murphy et al., 2017). Much of what has been learned has not been from the headline finding and schools find information on implementation useful (Quigley, 2019).
Synthesis is essential in order to establish the external validity of findings (Shadish et al., 2002). For this reason, EEF trials are part of a rich tapestry of evidence generated by the EEF including the Teaching and Learning Toolkit (Higgins et al., 2015) and Early Years Toolkit, which is currently based on many meta-analyses (Higgins et al., 2013), and EEF’s guidance reports which provide practical, evidence-based guidance for teachers on a range of high-priority issues, based on the best available evidence. The EEF Toolkits present over 30 approaches to improving teaching and learning, each summarized in terms of its average impact on attainment, its cost, and the strength of the evidence supporting it (Higgins et al., 2015). Recent guidance reports include recommendations on improving social and emotional learning in primary schools (van Poortvliet et al., 2019) and improving literacy in secondary schools.
The meta-analyses in the Teaching the Learning Toolkit often combine studies of varying quality, on different ages, subjects, and even countries. This is why the EEF has commissioned the EEF Education Database, a major initiative involving scores of coders coding the estimated 10,000 individual studies within the Toolkit. The ultimate aspiration is to create a “live” database where all education studies carried out globally can be included as their results become available.
Key challenges faced and opportunities for future
While the EEF has had the opportunity to celebrate many milestones on its evidence synthesis, generation, and mobilization journey, it has also faced several challenges, which have pushed it to adapt and innovate.
Common RCT designs are not always suited to answering some kinds of questions of importance to schools and teachers
The EEF typically commissions relatively large-scale trials using school-level randomization. However, the EEF has learned that it can sometimes be hard to determine in advance whether an RCT is the most feasible way to evaluate an intervention. Uncertainties can emerge in the design and implementation stages regarding various trial aspects such as a number of participants, likely intervention effects, and costs (Edovald & Firpo, 2016). Furthermore, sometimes an RCT design is not acceptable to participants (e.g., Sutherland et al., 2017). Also, some interventions or questions lend themselves more readily to RCTs than others. This section describes two new strands of work that the EEF has recently introduced in response to the challenges it has faced in evaluating certain relevant questions in school settings.
School choices
Successful RCT designs rely on participants that are willing to be randomized. The EEF has managed to recruit more than half the schools in England to participate in its RCTs, but there are some choices that schools are not willing to be randomized to. For example, an EEF-funded RCT of mixed ability grouping versus setting and streaming based on ability failed to recruit schools (Roy et al., 2014), as did a trial that involved changing secondary school start times (Robinson, 2016). To address these challenges and to understand the impact of school-level decisions and policies that do not necessarily involve the introduction of a new intervention, the EEF in 2019 introduced a new funding stream titled “Researching school choices.” The studies funded as part of this stream look at how the different choices schools make lead to different outcomes by examining natural variation in the system and using QEDs to estimate the impact of different approaches. As is the case with EEF trials, there is also a strong emphasis on best practice, research transparency, and lack of conflicts of interest when undertaking these studies.
Teacher choices
In 2018, the EEF undertook a review of its grant-making process with the aim of understanding how to make its projects timelier and more relevant for schools. The review identified that head teachers and teachers are keen for the EEF to answer research questions that can more directly feed into the existing teaching practice. As with the “School choices” strand of work, these questions are often not related to the impact of manualized programs, which require schools to purchase particular resources or training. Instead, they are about the everyday choices that teachers make when planning their lessons and supporting their pupils such as: Does phoning home improve student behavior? Does marking books lead to more learning than whole-class feedback? and What are the most effective ways to read with a class? The EEF has recently launched a pilot through which to explore innovative evaluation designs to investigate such questions, including approaches (e.g., within-participant designs and more proximal outcomes) that mean trials can run over shorter time frames and with smaller numbers of schools than in typical EEF trials.
Few EEF-funded trials have shown interventions to work better than standard practice
Lortie-Forgues and Inglis (2019) recently reanalyzed RCTs commissioned by the EEF and the NCEE to generate insights into the effectiveness and informativeness of the approach of using large-scale educational trials to generate evidence. They assessed the magnitude and precision of effects found in 141 RCTs involving 1,222,024 students, commissioned by the two organizations. They found that many of these RCTs have produced small effects (mean effect size .06 SDs), with wide confidence intervals (mean width .30 SDs). The authors concluded that many of these trials are uninformative, which provides a narrow view on the use of RCTs and the What Works movement, particularly given its specific focus on the precision of the headline finding (Nevill, 2019a). The quality of trials has progressively improved based on lessons from the EEF and NCEE experience. A fundamental premise of the What Works movement is to keep learning not just about what works but also about how to best research what works. A good example of this is that the effective sample size of EEF trials over time has risen over time (and almost doubled in 2014 when compared to earlier trials), meaning that EEF trials have become progressively more informative (Sanders, 2019).
In addition, it is essential that the What Works movement is not only able to say what does work but also what does not, as resources spent on ineffective practices could be better used elsewhere. The message that few popular programs available to schools are better than what schools are already doing (business as usual) is useful. What is even more valuable is to reflect on why only a few programs generate positive effects and explore how best to refine the EEF’s research questions and designs to be more fit for purpose. This exploration should involve developing a more in-depth understanding of business as usual, as the lack of intervention impacts may reflect high-quality teaching practice, and generating better data on costs. It is essential to investigate what works for whom, where, and under what circumstances, which presents additional methodological and logical challenges (e.g., the sample sizes required to test the intervention impacts on subgroups).
So far, the EEF has been mostly responsive to what is available in the education system when choosing which interventions to fund and evaluate. The “best picks” tend to be those that have a clearer underlying theory of change and are driven by some prior evidence. The EEF is increasingly funding more piloting and development work to improve intervention designs. In terms of its future direction, the EEF may be more likely to have a greater impact on pupil outcomes if it considers testing fewer, more intensive, and more theory-driven interventions, using richer, more “realist” RCT designs.
The EEF has also learned that the larger the scale of implementation, and size of the RCT, the harder it is to find educationally interesting effects compared to business as usual (Nevill, 2019b). There are examples of EEF trials that have shown positive effects in the efficacy testing stage (Hanley et al., 2015), which have not been replicated in the effectiveness testing stage (Kitmitto et al., 2018). There is often a direct tension between high-quality implementation and the need for greater statistical power. A large study sample may require intervention providers to scale up faster than appropriate, risking the scale-up mechanism. Smaller studies allow for more intensive support, monitoring, and engagement which keeps the processes of implementation more manageable. When increasing the scale of an intervention, providers will often need to recruit and train new staff and adapt training models (World Health Organization and ExpandNet, 2009). The EEF has recognized that scaling up is hard, whether this is within a large-scale study or post-experimentation. It now needs to rise to the challenge of investigating the key features that facilitate large-scale, successful, delivery of education interventions in different settings and contexts.
It is difficult to work out what works
Interpreting the results of RCTs can be a challenging task. A measure of what is “informative” as defined by the precision of the estimate misses many other important elements of security that arguably tell us more about the reliability of the evaluation and its conclusions. It is for that reason that the EEF developed its padlock security rating designed to summarize, in a single scale, a number of possible sources of bias that could threaten the security of a result (The EEF, 2019c). While it might be controversial to summarize the quality of a headline finding in a single scale, it represents an important attempt to make evidence as accessible as possible to time-poor practitioners, so that they can better use it to inform their practice.
There are underlying problems with current analysis and reporting practices that exacerbate the challenges of communicating evidence and influence the interpretation of what works. The research community continues to practice the misinterpretation of p values and significance testing (Wasserstein et al., 2019), ranging from conclusions being based solely on whether an effect is found to be “statistically significant” and using arbitrary thresholds such as p < .05 to assign value to results, to concluding practical importance based on statistical significance or lack thereof. Furthermore, there remains a challenge regarding how we embrace uncertainty and report confidence intervals (Amrhein et al., 2019). An ongoing challenge for the EEF is to balance the adequate reporting of uncertainty with the need to communicate actionable results to end users.
Conclusion
The EEF has been influential in building capacity and capability to conduct high-quality education evaluations in the UK and is now recognized as a world leader in education evaluation (The Economist, 2018). It has strived for high standards of independent evaluation, pre-specification, and research transparency, analysis, and design. The lessons learned from EEF’s journey can be valuable not only to other organizations but also governments and to those more widely trying to promote the What Works movement. For example, the EEF is now working collaboratively with other, newly formed, UK WWCs in areas including early intervention, crime and violence, employment, and social care, to share its learning, to develop best practice and ensure a consistent approach with respect to evidence generation. Furthermore, the EEF actively shares its lessons learned with new partner organizations globally. However, EEF’s journey has not been without its challenges and no doubt more will emerge as the journey continues. The EEF is a learning organization and will strive to continue building on its experience to ensure it generates and shares the best possible evidence to improve teaching and learning and address educational disadvantage.
Footnotes
Authors’ note
This article represents authors’ views based on their experience of working at the EEF.
Declaration of conflicting interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Notes
Contributorship
Triin Edovald and Camilla Nevill conceived of the presented idea and were in charge of overall direction and planning. Both Edovald and Nevill contributed to the writing of the manuscript and provided critical feedback.
