Abstract
The realist evaluation approach has become firmly established within the field of evaluation. Reflecting its sustained and increasing uptake across diverse fields, a growing number of reviews have, over the years, examined practical applications of realist evaluations. Drawing on an umbrella review of 23 published reviews of realist evaluations, this article takes stock of key challenges in realist evaluation and proposes practical principles for addressing them. The proposed principles are designed to promote greater methodological congruence, coherence and transparency in the design and implementation of future realist evaluations.
Keywords
Introduction
Since the publication of Ray Pawson and Nick Tilley’s (1997) seminal book Realistic Evaluation, the realist evaluation (RE) approach has become firmly established within the field of evaluation. Its growing influence is evident in the exponential proliferation of publications on RE (Lemire et al., 2020; Renmans and Pleguezuelo, 2023), the publication of dedicated books (Emmel et al., 2018; Manzano and Williams, 2025), the convening of international conferences (e.g. the International Realist Conferences in 2021 and 2025), and the development of quality standards for reporting on REs (Wong et al., 2017) and realist syntheses (Wong et al., 2014). As evaluation practice grapples with rising complexity and calls for theory-driven explanation, RE appears well positioned to play an increasingly central role in future evaluation research, policy learning and practice.
At its core, RE seeks to answer the question of how programmes work, for whom, and under what conditions (Pawson and Tilley, 1997). It rests on the assumption that interventions are “theories incarnate” (Pawson and Tilley, 1997), meaning that programmes embody underlying assumptions about causal processes. The conceptual structure guiding RE is expressed through context–mechanism–outcome configurations (CMOCs) or variants thereof (De Weger et al., 2020), which articulate how mechanisms operating under particular contextual conditions generate outcomes. For realist evaluators, context is an irreducible and dynamic component of explanation (Craig et al., 2008; Greenhalgh and Manzano, 2021). Accordingly, CMOCs are developed through retroduction, a form of reasoning that moves from observed patterns to hypothesising about the underlying causal mechanisms that account for them.
The explanatory ambitions of RE are grounded in scientific realism, yet the precise nature of this grounding has been the focus of considerable theoretical debate. Early debates centred on whether RE required adherence to critical realism (Porter, 2015). In his rebuttal, Pawson (2016) resisted Bhaskarian social theory as overly normative and argued that RE does not aim to offer a full theory of social structure, agency or emancipation. Instead, RE’s purpose is to develop testable, context-sensitive causal explanations of programmes in action (Pawson, 2016). More recently, Mukumbang et al. (2023) reframe this divide by arguing that RE draws on an amalgam of scientific and critical realist principles. This account positions realist programme theories as retroductively developed, fallible and context-sensitive explanatory propositions. Functionally, they are the equivalent of middle-range theories that mediate between abstract ontology and empirical findings.
In unpacking CMOCs, REs are “methodologically promiscuous,” potentially using quantitative and qualitative data to test theories (Van Belle et al., 2016: 313). However, the compatibility of RE with certain research designs, such as experimental designs, remains contested (Blackwood et al., 2010; Jamal et al., 2015; Van Belle et al., 2016). To this day, the RE community remains divided between advocates for combining RE with experimental designs (e.g. Bonell et al., 2024) and those that are strongly opposed, arguing that such realist trials are ontologically and epistemologically incongruent with realist principles (Van Belle et al., 2016). Such divisions illustrate how methodological debate continues to play a constitutive role in the development of RE.
Whereas rigour in (post) positivist experimental impact evaluation is grounded in randomisation as a means of isolating causal effects and minimising bias, realists emphasise rigour as explicit and reasoned theorising about programmes, the articulation of their underlying mechanisms and the iterative testing of these theories to build plausible explanatory accounts. To support such practices, realist scholars have developed the Realist And Meta-narrative Evidence Syntheses: Evolving Standards (RAMESES II) for design and reporting (Wong et al., 2017). Yet, given the expansive and flexible nature of RE, these standards are necessarily less prescriptive than those found in the experimental social and health sciences. As with other emerging methodologies, the application of RE remains open to interpretation and variation.
The maturation of RE as an evaluation approach is reflected in a growing number of reviews examining its application across diverse fields (e.g. Malengreaux et al., 2024; Taylor et al., 2024). These past reviews underscore both the methodological promise of RE and the challenges that accompany its practical implementation. Considered collectively, these reviews offer a comprehensive overview of the RE landscape.
Nearly three decades after Pawson and Tilley’s seminal publication, it is timely to take stock of what has been learned about the practice of RE. To this end, we conducted an umbrella review of published reviews of REs. An umbrella review synthesises evidence across multiple reviews on a given topic, offering a comprehensive overview of existing findings, identifying patterns and generating higher-level insights to guide future research and practice (Belbasis et al., 2022). This umbrella review addresses two related questions. First, what do existing reviews of RE studies reveal about how RE is practised, particularly with respect to conceptual, methodological and analytical issues across the RE cycle? Second, building on these insights, what overarching yet practically actionable principles can be articulated to address these challenges and strengthen rigour in future REs?
This article is structured in three parts. The review methodology is outlined first, followed by a presentation of the findings, which highlight key conceptual, analytical and methodological challenges related to RE. The article concludes by proposing and discussing a set of principles to inform and strengthen future RE practice.
Methodology
In this section, we describe the search, screening, coding and analysis procedures performed in the review.
Search strategy
To inform our umbrella review, we first conducted broad electronic searches in PsycINFO, PubMed, Web of Science, ERIC, Campbell Collaboration and Cochrane Libraries, supplemented by more targeted manual searches using Google Scholar, selected institutional websites and citation chasing. Search terms for the electronic searches combined “realist evaluation,” “realistic evaluation” and “review,” covering the period from 1997 to May 2025. Publications in English, French, German and the Scandinavian languages were considered.
All types of reviews—systematic, scoping, integrative, narrative, meta-reviews and meta-syntheses—that explicitly aimed to identify, appraise and/or synthesise findings from multiple RE studies were included. Reviews including both RE studies and realist syntheses were eligible. We included reviews published in academic journals and grey literature. We excluded review protocols, book reviews, opinion pieces, reviews of realist syntheses exclusively (with no individual RE studies), individual RE studies and topical realist reviews/syntheses of other types of studies (e.g. realist syntheses of non-RE studies on a given topic).
Our database searches initially identified 1133 publications. After abstract screening and removal of duplicates, 19 publications remained for full-text screening. Manual searching and citation chasing yielded an additional nine publications, resulting in 28 publications screened for eligibility. Following full-text screening, five articles were excluded as they did not meet the inclusion criteria. In total, 23 reviews of RE were included in the analysis. The selection process is depicted in the PRISMA diagram (Figure 1).

PRISMA diagram of literature search.
Coding and analysis
All publications were read by the authors. Coding and data extraction in NVivo were carried out by author 1, thus avoiding interrater reliability issues. Data was extracted using a two-pronged strategy. We used NVivo for detailed coding of the reviews and developed a summary table in Microsoft Excel to capture their key characteristics. The NVivo codebook, informed by prior knowledge of the literature, included the following overarching themes (parent nodes): (1) scope and objectives, (2) review type, (3) methodology, (4) data sources and search strategies, (5) domain, (6) methodological challenges, (7) analytical challenges, (8) conceptual challenges, (9) recommendations and (10) RAMESES standards. Data for the summary table were extracted by both authors and subsequently consolidated. The summary table was used to identify overall patterns across the reviews. While discussing and consolidating our own analysis, we also compared these findings against AI-generated output.
We applied Generative Artificial Intelligence (GenAI) models—ChatGPT 5.1 and NotebookLM—for two purposes: first, as a tool for researcher triangulation, adopting the perspectives of a realist theorist and evaluator. We prompted the models to generate summary tables and draft text, which were considered alongside our own analyses. These AI-generated outputs helped to challenge and deepen our interpretations in areas we initially found underexplored, such as stakeholder involvement and the philosophical underpinnings of RE. When cross-validating the AI-generated summary tables against those produced using NVivo, we found the AI outputs to be of lower quality than our human data extraction, illustrating the current frontier of GenAI in research support (Dell’Acqua et al., 2023). Second, we employed the models as proxy persona-based article reviewers (Bougie and Watanabe, 2024), simulating the perspectives of a realist theorist, a review methodologist and an evaluation practitioner. These AI-generated reviews informed revisions to the article, including the addition of references to concrete applications and refinements to the proposed principles in the discussion section.
Limitations
This umbrella review presents several limitations that warrant consideration. First, as an umbrella review, it synthesises findings from existing reviews of REs rather than directly analysing primary RE studies. This second-order analysis may obscure important methodological nuances present in individual REs. Second, the gross number of REs cited across the reviews includes substantial duplication (about 50%), and the exact number of unique REs remains undetermined. This compromises the precision of claims about the breadth of empirical evidence underpinning the review. Third, past reviews are heavily weighted towards evaluations in the health sector and authored predominantly by researchers in Anglo-European contexts. This overrepresentation may limit the transferability of findings to other sectors or regions with different evaluation traditions. Moreover, while peer-reviewed literature provides a degree of rigour, the exclusion of unpublished evaluations, dissertations and grey literature in many past reviews introduces the risk of publication bias, particularly in a field where much evaluation practice occurs outside academic settings. Finally, while using a single coder in NVivo alleviates concerns about interrater reliability, it introduces the possibility of coder bias and limits the breadth of interpretation. To address this, we used summary tables to discuss and consolidate data extraction and employed two GenAI models to interrogate and cross-validate our interpretations. All findings were subsequently reviewed and discussed jointly by the authors to ensure consistency and analytical rigour.
Findings
In this section, we first provide a description of the key characteristics of the 23 reviews included in this umbrella review, including their publication year (Figure 2), and country of origin, main purpose, domain, coverage and application of quality standards in Table 1. Informed by this initial description, we identify different review orientations and discuss three methodological challenges identified across the reviews.
Key characteristics of the reviews.
Note. *First author.
Key characteristics of existing reviews
Publication year
The publication of reviews on RE has grown steadily over time, reflecting the broader expansion of the approach. Although Pawson and Tilley’s book (1997) marks the starting point, the first reviews did not appear until 2012 (Marchal et al., 2012; Ridde et al., 2012). Between 1997 and 2016, only six reviews were published, compared with 17 published between 2017 and 2025. This increase mirrors the wider proliferation of RE studies in recent years (Nielsen et al., 2022; Renmans and Pleguezuelo, 2023). The increasing trend in reviews of RE studies is illustrated in Figure 2.

Reviews on realist evaluation published per year (2012–2025).
Country of origin
The geographical distribution of published reviews of REs is uneven, with most reviews stemming from the Global North (21 reviews). Two reviews come from (leading economies in) the Global South (Brazil and South Africa). Most reviews originate from Anglo-Saxon contexts (Australia, Canada, South Africa and the United States) and European countries (Belgium, Denmark, Germany and the United Kingdom), with only a single review published outside these linguistic regions—in Brazil (Quintans et al., 2020). This pattern broadly reflects the distribution of REs more generally, though researchers in the United Kingdom appear less represented among review authors. Nielsen and Lemire (2025) document that 13.5 per cent of published RE studies were conducted in Africa, Asia or South America.
Purpose
The 23 reviews varied in their primary purposes, reflecting, in large part, the diverse ways RE is studied and applied. Some reviews are primarily descriptive and aimed to map the RE literature and identify key issues (e.g. Haunberger and Baumgartner, 2017; Lemire et al., 2020; Nielsen et al., 2022, 2023; Renmans and Pleguezuelo, 2023; Ridde et al., 2012). Other reviews are predominantly prescriptive and serve as a foundation for methodological development (e.g. Manzano, 2016; Quintans et al., 2020; Wong et al., 2017), while a third group of reviews are primarily normative and aimed to advance RE within specific sector domains, such as health or social work (e.g. Jenkins et al., 2021; Lam et al., 2021; Taylor et al., 2024).
Domain
The domains covered by the reviews varied, with health (encompassing public health, clinical settings and health systems) being the dominant and primary focus of 13 reviews. This trend reflects a similar domain concentration observed in individual RE studies (Nielsen and Lemire, 2025). Eight reviews were cross-sectoral, while the remaining two focused on other health-adjacent domains, specifically social work (Haunberger and Baumgartner, 2017) and food security (Lam et al., 2021).
Coverage
The empirical coverage of the 23 reviews varied considerably, reflecting differences in purpose, domain and, to some extent, publication year. The largest reviews—Lemire et al. (2020) with 195 publications, Renmans and Pleguezuelo (2023) with 166 studies, the two reviews by Nielsen et al. (2022), and Nielsen and Lemire (2025) with 126 publications—addressed broad, cross-domain topics and were published more recently. In contrast, other reviews included fewer than 20 publications, either due to a narrow purpose (e.g. review by Nielsen et al., 2023, on realist trials) or domain focus (e.g. review by Lam et al., 2021, on RE in food security) or because they were published in earlier years where fewer RE studies had been published (Marchal et al., 2012; Ridde et al., 2012; Salter and Kothari, 2014).
Collectively, the 23 reviews drew on 1049 publications (some studies represented multiple affiliated papers). Of these, we were able to retrieve 881 articles (84%). After removing duplicates, 446 unique RE studies remained. Extrapolating to the full set of reviews, we estimate that roughly 535 unique studies constitute the empirical basis of the 23 reviews.
Quality standards
Early reviews use key characteristics of RE to structure their analysis, assess and identify challenges in the design and reporting of REs, but fall short of creating rubrics or standards to assess study quality (cf. Lacouture et al., 2015; Marchal et al., 2012; Salter and Kothari, 2014).
Wong et al. (2017) use their review as the basis, along with expert consultations and Delphi surveys, to develop the RAMESES II standards for designing and reporting RE studies. This effort marks an important juncture in codifying adherence to realist principles.
After the introduction of the RAMESES II standards, we found only two reviews that applied these standards in reviewing RE studies. Nielsen et al. (2023) applied the RAMESES II standards to assess whether a trade-off between RE and experimental design standards exists when conducting realist trials. Rees et al. (2024) partially applied RAMESES II standards, as they derived five criteria to assess the use of realist interviewing in health professions education research (HPER).
Only two of the remaining reviews explicitly addressed the RAMESES II standards as part of their review procedures. Malengreaux et al. (2022) and noted that they refrained from using the standards to assess study quality but did not provide a rationale. Several other reviews incorporated RAMESES reporting either as a screening criterion or as a coding variable (Dalkin et al., 2021; Lemire et al., 2020; Malengreaux et al., 2024; Nielsen and Lemire, 2025; Nielsen et al., 2022).
In a similar vein, none of the reviews set out to explicitly examine the philosophical underpinnings (epistemological and ontological) of RE studies and how these align with basic realist principles. Greenhalgh and Manzano’s review (2021) comes close, as they note that such assumptions are pivotal for the operationalisation of context. However, they refrain from presenting a systematic elicitation from the cases included in their review. Nielsen et al. (2023) also describe how differences anchored in critical or scientific realism, or their interpretation, may inform what research designs and techniques can be considered in adherence with RE principles in the context of realist trials. However, they fall short of exploring the issue in more depth.
Review orientations
Informed by the purpose and domain of the reviews, the main orientation (or focus) of each review can be identified. In Figure 3, we map the 23 reviews against three main orientations: methodological, conceptual and domain.

Diagram of review orientations.
Methodological reviews
Some studies concentrate on methodological issues, such as how to design, collect data, analyse and report on REs. For example, Renmans and Pleguezuelo (2023) map the use and combination of different data collection methods across REs. Manzano (2016) develops principles for conducting realist interviews, while Nielsen and Lemire (2025) map analytical strategies and techniques employed in REs.
Conceptual reviews
Other reviews emphasise conceptual issues, such as how key RE constructs are defined and operationalised. Lacouture et al. (2015), Lemire et al. (2020) and Nielsen et al. (2022) focus on definitions and operationalisation of mechanism and context in RE studies. Greenhalgh and Manzano (2021) analyse the context construct with particular attention to its epistemological and ontological underpinnings. Similarly, Malengreaux et al. (2024) document how stakeholder involvement has implications for research and methods.
Domain reviews
A third group of reviews primarily adopts a domain focus, exploring the application of RE in specific domains, such as health or social work. Some reviews illustrate how RE can provide new heuristics in domains dominated by positivist paradigms (e.g. Lam et al., 2021; Taylor et al., 2024), while others map the use and implications of RE within social work (e.g. Haunberger and Baumgartner, 2017) or the health domain (e.g. Palm and Hochmuth, 2020). Some reviews combine domain focus with methodological issues, using the empirical domain as a backdrop for examining adherence to RE principles (Marchal et al., 2012) or developing procedural steps in RE (Quintans et al., 2020).
Some studies integrate all three orientations, echoing calls in the broader theory-based evaluation literature to connect substantive and implementation theory to programme and CMOC development (Lemire et al., 2020). For example, Dalkin et al. (2021) apply Normalisation Process Theory (NPT) to elucidate mechanisms in a health setting. Hitchcock et al. (2022) investigate how systems thinking informs programme theory and implementation in health systems, and Salter and Kothari (2014) use the PARiHS framework to identify mechanisms and contextual factors.
Methodological challenges
The authors of the 23 reviews identify several methodological challenges related to the design and implementation of REs. Across the reviews, we identify three main challenges: methodological congruence, methodological convergence and methodological transparency.
Methodological congruence
Methodological congruence refers to the extent to which a study’s design is logically and philosophically coherent (Creswell, 2013). Methodological congruence is particularly salient in RE, where the research design, data collection and analysis methods in unison should enable the testing of the programme theory.
While realist scholars broadly agree on methodological pluralism—allowing flexibility in tailoring research designs and methods to specific needs—RE is nonetheless grounded in epistemological and ontological foundations rooted in scientific and/or critical realism. This raises an important question: Are some designs inherently incompatible with the philosophical underpinnings of RE?
Several reviews have addressed this issue of philosophical alignment. Nielsen et al. (2023) summarise the debate on realist trials (see also Bonell et al., 2024; Van Belle et al., 2016), showing that while integration of RE with designs grounded in other ontological traditions is feasible in practice, it can present significant challenges in adhering to both established quality standards for randomised controlled trials standards and RAMESES II standards at the same time. Often, the impact study (involving the randomised controlled trial design) and the implementation study (using a RE approach) are reported separately. In their review, Nielsen et al. (2023) identified only two realist trials (out of 16 studies) that adhered to quality standards for both RE and randomised controlled trials and successfully integrated both of these aspects of their design.
Early reviews (Marchal et al., 2012; Ridde et al., 2012) noted that REs often lacked explicit CMOC logic, exhibited weak theorising and provided limited explanations of generative mechanisms. More recent reviews report persistent variation in how key realist constructs are defined, as well as analytical challenges in distinguishing mechanisms from both context and programme components (Lemire et al., 2020; Nielsen et al., 2022).
Greenhalgh and Manzano’s (2021) review of context in RE illustrates how study design and method selection shape the way realist constructs are conceptualised, operationalised and analysed. Greenhalgh and Manzano identify marked ontological and epistemological variation across realist studies, with some conceptualising context in a largely positivist or actualist manner (as static, observable features that trigger mechanisms) while others adopt a more scientific realist stance, treating context as relational, dynamic and constitutive of causal processes. These differences shape whether studies aim primarily at identifying transferable conditions for implementation or at developing explanatory, middle-range theories of how context–mechanism interactions evolve over time. Consequently, they argue that context in RE should not be treated as a statistical variable but rather as an irreducible and dynamic component of explanation, operating through interwoven micro-, meso-and macro-level processes.
Such differences in epistemological and ontological assumptions were also observed in other reviews. For example, the reviews by Palm and Hochmuth (2020) and Salter and Kothari (2014) reveal varied philosophical orientations among realist evaluators—some drawing on critical realism, others on scientific realism, and still others adopting a more pragmatic stance—though these positions are often left implicit. Defining rigour in terms of a realist logic of inquiry, therefore, requires explicit reflection on epistemological and ontological assumptions in study design, a point also emphasised by Renmans and Pleguezuelo (2023).
In sum, the uneven alignment between designs and realist principles identified across reviews highlights the need for greater attention to aligning study design and analytic procedures with the realist logic of inquiry in a coherent methodology. Methodological congruence may be supported by adherence to guiding principles and standards for RE. Although the introduction of the RAMESES II standards for design and reporting RE (Wong et al., 2017) has provided a framework for such alignment, to date, only one review (Nielsen et al., 2023) has applied these standards systematically to assess adherence to realist principles. More consistent application of the RAMESES II standards, combined with methodological innovations, such as realist interviewing and focus group techniques, may help strengthen methodological congruence in future studies.
Methodological convergence
Methodological convergence refers to strengthening the rigour of findings by demonstrating consistency across different methods or sources (Sánchez-Gómez and García, 2018). This can be achieved by triangulation of different data sources, methods, analytical techniques and theoretical perspectives (Yin, 2018). Methodological convergence is particularly important in RE because adequately explaining outcomes and the context–mechanism interactions that generate them often requires the integration of multiple data sources and methods.
Several reviews show that RE studies commonly employ multiple strands of qualitative or mixed-methods data collection (Nielsen and Lemire, 2025; Renmans and Pleguezuelo, 2023), providing opportunities to design methodologically convergent lines of inquiry. Renmans and Pleguezuelo (2023) report a particularly high reliance on interviews (97%), followed by observations (participant, video and document/monitoring data) (55%), surveys (26%) and what they term “innovative methods” such as vignettes, diaries and photographs (8%). They further note that specifically, realist interviews are used in only 18 per cent of studies, calling for a broader range of data collection techniques to better elucidate mechanisms, as well as greater attention to sampling strategies and realist-informed survey methods.
When examining data sources, Renmans and Pleguezuelo (2023), however, observe that half of the REs are based on data solely from the users/beneficiaries or the key informants (policymakers, implementers, service deliverers, etc.). Although specific evaluation circumstances may justify this focus on one group of respondents, the influence of interests and social position . . . may give a biased and incomplete understanding of the intervention. (p. 6)
In a similar vein, Malengreaux et al. (2024) found that REs rarely provide a theoretically grounded account of who stakeholders are, when and how they should be involved, and why their perspectives matter. In most cases, stakeholder involvement was limited to knowledge validation, with decision-making authority concentrated among evaluators rather than shared with participants or communities. This potentially raises questions about the extent to which perspectives across stakeholder groups are fully integrated into the REs.
Considered collectively, these findings suggest that a more purposeful pursuit of triangulation across not only data collection methods but also data sources is called for.
Methodological transparency
Methodological transparency refers to conceptual transparency (clarity in defining and operationalising realist constructs) and procedural transparency (clarity in documenting analytical processes and methodological choices). A recurring concern in the reviews is the lack of transparency in how realist evaluators and researchers define, apply and report key concepts and analytical procedures.
Reviews consistently highlight that REs often lack clear and consistent definitions of core constructs (Lacouture et al., 2015; Nielsen et al., 2022; Ridde et al., 2012; Salter and Kothari, 2014). Conceptual confusion is evident in the varied or absent definitions of foundational constructs, with risk of conflation or misapplication. Greenhalgh and Manzano (2021) found that only 45 per cent of studies included an explicit definition of context, while Nielsen et al. (2022) reported a similar figure (48%). Lemire et al. (2020) observed that nearly half of the studies did not define mechanisms, and when definitions were provided, they varied widely across studies. They also found that different conceptualisations of mechanisms can be traced to different conceptualisations rooted in Pawson and Tilley’s work (1997) and that of Astbury and Leeuw (2010). In a similar vein, Greenhalgh and Manzano (2021) argue that different conceptualisations of context can be traced to different philosophical and methodological underpinnings.
Comparable inconsistencies were also found in how outcomes were defined and assessed (Salter and Kothari, 2014). This lack of definitional clarity contributes to persistent confusion between mechanisms and context (Lacouture et al., 2015; Nielsen et al., 2022; Ridde et al., 2012; Salter and Kothari, 2014). Taken together, these observations raise questions about how core realist principles are being operationalised in practice (Marchal et al., 2012).
RE explicitly encourages the development and testing of multiple CMOCs (De Weger et al., 2020). Programmes are often complex and likely to operate through different mechanisms depending on context. Studies that elicit or test too few CMOCs risk oversimplifying these dynamics and narrowing the scope of theory building. By contrast, working with several CMOCs within the same study allows evaluators to capture the heterogeneity of contexts, explore alternative mechanisms and generate more robust and practically useful explanations of what works, for whom, how and under which circumstances. However, this comes at the risk of losing sight of key explanatory drivers. This calls for transparent and reasoned identification and selection of CMOCs in the different phases of the realist cycle (De Weger et al., 2020).
Nielsen and Lemire (2025) reported a range from 1 to 23 CMOCs per evaluation, with an average of 4.1 CMOCs per evaluation, and noted that 77 per cent of evaluations contained five or fewer CMOCs. They further observed that a limited explanation was provided for why certain CMOCs were prioritised for testing while others were abandoned when moving from the initial programme theory to the ones being tested.
Additional concerns arise in relation to the use of data collection methods. Several reviews emphasise that the procedures for selecting methods and data sources (and analytical strategies) are often underdescribed or missing altogether (Haunberger and Baumgartner, 2017; Salter and Kothari, 2014). As just one example, Rees et al. (2024) found that studies often failed to specify how realist interview questions aligned with initial CMOs and how empirical data contributed to refining the final CMOs.
When testing the CMOCs, reviews also identified limited procedural transparency in reporting analytical steps, techniques and methodological decisions. Nielsen and Lemire (2025) found that only half of the reviewed studies explicitly reported the analytical techniques employed. While evaluators frequently employ thematic analysis, Nielsen and Lemire (2025) caution that such approaches are not always congruent with the analytical needs in a realist logic of inquiry.
Jenkins et al. (2021) similarly observed inconsistent documentation of coding strategies or techniques for developing and refining CMOCs in the field of nutrition and dietetics, while Hitchcock et al. (2022) noted that many health system evaluations used CMOCs only partially or failed to link them transparently to their findings.
Taken together, these findings suggest that real-world RE remains a maturing approach, characterised by variability in both conceptual clarity and procedural rigour. This requires more explicit definitions of key constructs, clearer documentation of analytical processes and transparent reporting of methodological choices. Adopting Morse et al.’s (1996) criteria for concept maturity (clarity of definition, delineated boundaries, specified preconditions and observable outcomes), RE appears best characterised as still developing rather than fully consolidated. This has important implications. Persistent variation in the definition and operationalisation of core constructs may constrain cumulative middle-range theory building, underscoring the need for greater conceptual clarity to support methodological congruence and analytical rigour. Moving forward, developing codified procedures and tools, alongside consistent application of frameworks such as the RAMESES II standards, may help strengthen both conceptual and procedural transparency in RE.
Recommendations for future realist evaluation
Motivated by identified challenges, many of the reviews provide recommendations on how to improve future REs. To organise the most salient recommendations, we adopt Salter and Kothari’s framework of four phases of RE: (1) formulation of an initial programme theory articulated as CMOCs, (2) data collection informed by the CMOCs, (3) data analysis and testing of the CMOCs and (4) refinement of CMOCs based on the findings. These recommendations are summarised in Table 2. Taken together, these recommendations underscore the need for clearer definitions, codified procedures and making explicit reasoning throughout the RE cycle.
Recommendations across four phases of a RE.
Source. Adapted from Nielsen and Lemire (2025), originally adapted from Salter and Kothari (2014).
Discussion
This umbrella review of 23 published reviews on RE reveals a maturing, yet heterogeneous, field, which is still grappling with persistent methodological challenges. These challenges largely revolve around ensuring congruence between epistemological assumptions, study design and analytical strategies; convergence of findings through triangulation; and issues related to lack of conceptual and procedural transparency. Nonetheless, authors express cautious optimism, with several reviews proposing frameworks and recommendations to scaffold future rigorous realist practice.
In advancing quality in RE, different instruments serve different purposes. Standards such as RAMESES II are valuable for quality assessment and peer review but offer limited guidance for real-world decisions about designing and implementing an RE study. Nevertheless, the RAMESES II standards have played an important role in supporting the maturation of RE as an evaluation approach. Building on this work, our aim is not only to provide recommendations drawn from existing reviews but also to go a step further by proposing a set of principles to guide realist practice. Principles serve as heuristics for thinking and decision-making, helping evaluators navigate trade-offs, design choices and philosophical tensions in real-world contexts—for example, aligning evaluations with commissioner requirements such as the MAGENTA guidelines (HM Treasury (British government), 2020).
Drawing on the work by Patton (2017), the proposed principles are developed using the GUIDE criteria: Guiding, Useful, Inspiring, Developmental, Evaluable. Using these criteria makes each proposed principle explicit, justified and actionable. The first five principles (Principles 1–5) focus on laying the conceptual and methodological foundations: Together, these principles create the conditions for rigour. They ensure that from the outset, a RE has conceptual clarity, methodological coherence and a theoretically informed framework. As documented, these are common issues raised across past reviews.
1.
•
•
•
•
•
2.
•
•
•
•
•
3.
•
•
•
•
•
4.
•
•
•
•
•
5.
•
•
•
•
•
The second set of five principles (Principles 6–10) shifts focus from foundational issues to execution. These principles aim to help evaluators ascertain that the evaluation is conducted and reported in a way that delivers transparent, reasoned and plausible explanations. They address the weaknesses in procedural transparency, methodological execution and stakeholder engagement that the review identified.
6.
•
•
•
•
•
7.
•
•
•
•
•
8.
•
•
•
•
•
9.
•
•
•
•
•
10.
•
•
•
•
•
We readily admit that a structured and inclusive developmental process akin to the RAMESES I and II projects (Wong et al., 2014, 2017), undoubtedly would have strengthened the legitimacy and quality of the proposed principles. Nevertheless, we believe the proposed principles can be practically useful for guiding the selection of study design, data sources and collection methods, and analytical techniques. Informed by the current umbrella review and particularly on the work by Nielsen and Lemire (2025), we posit that the principles may help realist evaluators in at least two ways: practically and heuristically. By informing RE practice, using these principles may help evaluation practitioners avoid a number of the practical challenges identified in the current review. The principles both address foundational issues and procedural issues in RE. As such, the principles may help evaluators better adhere to RAMESES II standards, inform the development of protocols and training of RE practitioners, as well as guide future reviews of RE studies.
The principles may also serve to support future RE studies heuristically. Collectively, these principles are oriented towards strengthening RE’s contribution to middle-range theory building. By promoting conceptual clarity, methodological congruence and transparent reasoning, they seek to create the conditions under which explanatory propositions can accumulate, be refined across contexts, and travel beyond single case studies. Advancing RE in this direction is essential if it is to fulfil its ambition of generating transferable, yet context-sensitive, causal explanations.
Conclusion
This umbrella review synthesises key challenges and advances in RE, drawing on 23 published reviews. While RE offers strong potential for generating explanatory insights, the published studies also document conceptual ambiguity, methodological inconsistency and analytical opacity. To address these gaps, we proposed a set of guiding principles aligned with Patton’s GUIDE criteria to promote greater methodological congruence, convergence and clarity in future RE studies. The proposed principles aim to strengthen realist practice and support the continued development of standards, tools, evaluator training and rigour in thinking in future RE studies.
Footnotes
Acknowledgements
The authors would like to thank Ray Pawson and Stine Øien Dandanell Garn for constructive critique of earlier versions of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by institutional resources.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
AI use declaration
The authors used ChatGPT (GPT-5.1, OpenAI) for data extraction and writing support, including language refinement and summary drafting during article development. AI was also used as a persona-based proxy reviewer of the article. All AI-assisted text was edited, verified and approved by the authors, who take full responsibility for the content and its interpretation. No confidential or proprietary information was provided to AI systems, and no AI tool is credited as an author.
