Sage Journals: Discover world-class research

Abstract

This Method Note presents an integrative review of 117 Research on Evaluation (RoE) studies published in the American Journal of Evaluation (2014–2024). We combined thematic analysis with a structured quality appraisal using established critical appraisal tools. Findings show that RoE literature focused primarily on studies on evaluation activities and professional issues, with far fewer examining evaluation outcomes or contexts. Qualitative descriptive designs were most common, whereas experimental designs were rare. Overall methodological quality was moderate to high, but with recurring weaknesses, including limited theoretical grounding and reflexivity in qualitative studies, limited attention to confounding and sampling issues in quantitative studies, uneven quality between qualitative and quantitative strands in mixed-methods studies, and limited application of systematic review techniques. We discuss implications for improving RoE research, such as broadening study foci and adopting more rigorous design and reporting standards to strengthen the evidence base for evaluation practice.

Keywords

research on evaluation integrative review evaluation taxonomy critical appraisal methodological quality

Defining Research on Evaluation

Research on Evaluation (RoE) is broadly concerned with studying evaluation itself—its theories, methods, practice, and professional work—with the aim of strengthening evaluation knowledge and, in turn, improving evaluation practice (Henry & Mark, 2003; Mark, 2008). At the same time, the field has repeatedly noted that we still lack a strong, cumulative evidence base about evaluation processes and what evaluations accomplish in real-world settings, which is why calls for more RoE have been so consistent (Alkin, 2003; Henry & Mark, 2003; Mark, 2008).

A persistent challenge, however, remains definitional. Recent work shows that RoE has been defined in multiple, partially overlapping ways, and that these differences matter because they shape what gets counted, synthesized, and learned from (Arbour, 2025; Aston et al., 2025; Linnell & Stachowski, 2025). For example, Aston et al. (2025) summarize several widely used definitions over the past decade and note that the field is not unified around a single definition. Arbour (2025) similarly cautions against treating current RoE definitions as definitive and argues that excluding nonempirical work may be unnecessarily limiting depending on the purpose of the inquiry. We recognize that an exclusively empirical definition can omit conceptual and methodological contributions that shape how evaluation is studied and practiced. We nonetheless made a narrower choice here because our aim was to characterize observable research designs, methods, and appraisal indicators in published studies.

For this review, we adopted Coryn et al.'s (2016) RoE definition because it is explicitly empirical and therefore aligns with our goals: describing methodological approaches and appraising methodological quality using formal critical appraisal tools. We are therefore not attempting to resolve the broader question of what RoE should include. We acknowledge the ongoing definitional debate and return to its implications when interpreting our findings (e.g., Arbour, 2025).

Background

Although RoE scholarship has expanded, prior syntheses show an uneven distribution of topics and methodological approaches. Vallin et al. (2015) cataloged American Journal of Evaluation (AJE) RoE studies from 1998 to 2014 and noted considerable imbalance in subject focus. Coryn et al. (2017) reviewed 14 journals (2005–2014) using Henry and Mark's (2003) agenda and Mark's (2008) taxonomy and similarly found that most RoE studies were descriptive, with comparatively fewer examining evaluation outcomes or evaluation contexts. Galport and Galport (2015), analyzing a subset of Vallin et al.’s sample, documented methodological shifts but did not assess study rigor. Collectively, these reviews suggest RoE has often leaned heavily toward descriptive approaches and has given less attention to evaluation consequences and contexts relative to evaluation activities. A recent meta-RoE update (Linnell & Stachowski, 2025) continues this line of work and shows growth in RoE volume.

What has been missing across most of this mapping work is systematic attention to how well RoE studies are executed and reported. Coryn et al. (2017), for example, explicitly excluded quality assessment and expressed concern about the strength of conclusions drawn from a largely descriptive and unappraised evidence base. Brandon and Singh (2009) offered one of the few early attempts to examine methodological quality in evaluation-use studies; however, their assessment relied on a narrow set of content-validity-focused criteria and drew on a broad body of sources (including narrative reflections and case accounts), limiting its applicability to appraising the rigor of empirical RoE studies. As a result, this remains a critical gap: without systematic attention to rigor, the field risks drawing conclusions from uneven evidence (Brandon & Singh, 2009; Donaldson, 2022, 2025; Villalobos et al., 2025).

Other disciplines have moved toward clearer standards for transparent reporting and structured appraisal of bias (e.g., Preferred Reporting Items for Systematic Reviews and Meta-Analyses [PRISMA]-style guidance and design-appropriate risk-of-bias tools). Similar attention is needed in RoE. This need has also been reiterated in recent RoE-focused scholarship, including a 2025 New Directions for Evaluation special issue that highlights ongoing definitional challenges and calls for clearer standards to support higher-quality RoE and more accurate meta-RoE (Aston et al., 2025).

This integrative review addresses these needs by examining what topics recent RoE studies have explored and evaluating how well those studies were conducted. We focus on RoE articles in AJE from 2014 to 2024 asking: (1) What themes and subjects do they investigate, and how have these trends shifted over time? (2) What research methods and designs are most commonly used? (3) What is the methodological quality of these studies, as judged by established appraisal criteria?

Methods

Sample Selection and Inclusion Criteria

AJE was selected because of its central role in RoE scholarship and to maintain continuity with prior reviews (e.g., Coryn et al., 2017; Galport & Galport, 2015; Vallin et al., 2015). We began with a census of all items published in AJE between 2014 and 2024 (n = 504). Records categorized by the journal as non-article content (e.g., editorials, book reviews, commentaries, letters, and corrections) were removed before screening, leaving 389 abstracts for review (Figure 1).

Figure 1.

Selection process for identifying RoE studies in AJE. Note. ^aRecords categorized by AJE as editorials, book reviews, commentaries, letters, or corrections were excluded prior to screening. ^bStudies excluded are those that do not meet the operational definition for empirical RoE. Narrative reflections and case studies without an explicit methods section or systematic data collection were also excluded. ^cMeasurement and tool validation studies and simulations were excluded from quality appraisal as suitable critical appraisal tools were not available for those types of studies. AJE= American Journal of Evaluation; RoE= Research on Evaluation.

Given ongoing definitional variation in the field, we adopted Coryn et al.'s (2016) definition of RoE as “any purposeful, systematic, empirical inquiry intended to test existing knowledge, contribute to existing knowledge, or generate new knowledge related to some aspect of evaluation processes or products, or evaluation theories, methods, or practices” (p. 161). This definition aligned with our focus on methodological quality appraisal and guided our screening decisions.

Articles were included if they (1) were published in AJE between January 1, 2014, and December 31, 2024; (2) met this definition of RoE; and (3) reported a defined dataset and a systematic method for data collection and analysis. This process yielded 117 included studies. Of these, 105 were eligible for methodological quality appraisal; 12 were retained in the review but excluded from appraisal because no suitable checklist was available for those designs (e.g., simulations and tool validation studies). Figure 1 summarizes the selection process, and the full list of included studies is provided in the Supplemental material.

To apply the definition consistently, we used decision rules refined through pilot coding. Studies were classified as empirical RoE when they presented a clearly defined qualitative, quantitative, or mixed dataset and described a systematic analytic or data collection procedure. Narrative reflections and illustrative case accounts without those features were excluded. Case studies were included when authors specified case selection, data sources, and an analytic approach. Embedded studies were included only when they were framed as producing transferable knowledge about evaluation theory or practice rather than assessing program outcomes alone. Tool development and validation studies were included in the review but not in the quality appraisal.

Coding Framework and Thematic Classification

Lessons learned from a pilot study on a subset of this sample (2014–2021; 74 studies) informed the coding framework and decision rules. In the pilot, the lead author and a senior RoE researcher independently coded small design-diverse sets of articles to operationalize key variables and refine decision rules, resolving discrepancies through consensus. All 117 studies were thematically classified using Mark's (2008) subjects-of-inquiry taxonomy. Because our aim was not only to describe study focus but also to examine research design and quality, we relied on the subjects-of-inquiry taxonomy and did not use Mark's (2008) modes-of-inquiry taxonomy, which did not align with the study design distinctions needed for the critical appraisal tools. Using the subjects-of-inquiry taxonomy, we assigned each article to one primary theme and subtheme (Table 1). Mark (2008) provides illustrative subthemes; we adopted these and added several additional subthemes to better capture recurrent topics in our sample and reduce reliance on “other” categories noted by Coryn et al. (2017). Although theme categories were not mutually exclusive, each study was coded to a single theme and subtheme that best captured its primary objective or main research question; overlaps were addressed during pilot consensus discussions. We acknowledge that assigning a single category may underrepresent overlap among subjects of inquiry in the RoE literature, an issue also highlighted by Coryn et al. (2017).

Disciplinary domains (e.g., education and healthcare) were coded using adapted categories from Vallin et al. (2015) and Brandon and Singh (2009). Each study was also classified by methodological approach (qualitative, quantitative, mixed-methods, or review), and we noted whether primary or secondary data were used. For analyses of publication trends, publication year was coded as the year an article was first published online by the journal. As a result, this analytic year may differ from the final issue year shown in the reference list for a small number of articles.

Study Design Classification and Methodological Quality Appraisal

We used a decision tree (Figure 2) to guide classification of study designs and selection of appraisal checklists. Studies were classified using typologies adapted from the Joanna Briggs Institute (JBI) levels of evidence framework because JBI offers design-specific guidance and appraisal tools that align with the wide range of qualitative, quantitative, and review designs represented in this RoE sample (Aromataris et al., 2024). Studies were coded as qualitative, quantitative, mixed-methods, or review based on methodology and stated intent. Qualitative studies without a specific label were coded as qualitative descriptive when they reported qualitative data without an explicit theoretical or methodological framework. Quantitative studies were grouped as experimental (e.g., randomized controlled trials and quasi-experiments) or nonexperimental; nonexperimental studies were further classified as analytical (e.g., regression or correlation models) or descriptive (e.g., distributions or prevalence without modeling). Review and synthesis studies were assigned based on the review's primary purpose and reported procedures, given inconsistent and evolving terminology for reviews (Grant & Booth, 2009). Using JBI guidance, some articles described by their authors as “systematic reviews” were reclassified as scoping reviews because their main aim was to map a topic area (e.g., describe trends, summarize what has been studied, and identify gaps) rather than to synthesize findings in a way that supports conclusions about effectiveness or outcomes. For scoping reviews, JBI does not require critical appraisal of included studies. For that reason, any checklist items related to appraising included evidence were marked N/A for scoping reviews and were not counted in the scoring denominator. This was done so that scoping reviews were not penalized for omitting a step that is not expected for that design. In contrast, systematic reviews that synthesized findings were expected to report critical appraisal of included studies.

Figure 2.

Flowchart for Study Design Classification and Critical Appraisal Checklist Assignment. Note. Developed by the authors based on Hong et al.'s (2018) study-design selection algorithm. Checklist items noted as not applicable (N/A) to a given design were excluded from scoring denominators. A link to each critical appraisal checklist is available in Appendix A.

To assess methodological quality across the range of designs represented in this RoE literature, we used applicable JBI critical appraisal checklists and the Mixed Methods Appraisal Tool (MMAT) integration criteria. We used the MMAT because, unlike design-specific checklists alone, it explicitly appraises the integration of qualitative and quantitative components in mixed-methods studies. Mixed-methods studies were appraised using the relevant JBI checklist(s) for the qualitative strand, the relevant JBI checklist for the quantitative strand, and the MMAT integration criteria. We calculated a composite mixed-methods score as the average of the three component scores (equal weights) and then applied a cap based on the MMAT principle that overall quality cannot exceed the weakest strand (Hong et al., 2018). Used together, these tools provided a consistent framework for appraising the RoE studies in this review.

The 105 eligible studies were appraised using the applicable JBI checklist (8–13 items; “yes” = 1, “no/unclear” = 0; “not applicable” excluded) and the MMAT integration criteria (5 items; “yes” = 1, “no/can’t tell” = 0). For each study, we calculated a total score as the percentage of criteria rated “yes.” Because JBI does not prescribe quality cutoffs, we drew on prior systematic reviews that used cutoffs ranging from ≥70% to 80% to indicate high quality (e.g., Akl et al., 2021; Varmaghani et al., 2024). We therefore classified overall quality as high (≥80% of criteria met), moderate (50–79%), or low (<50%).

Although the JBI and MMAT checklists provided a clear structure, several criteria still required judgment (e.g., whether components were “adequately described” or whether there was “congruity” across the question, methodology, and analytic approach). To support consistency, we relied on an appraisal guide that required a brief written rationale tied to text evidence and included design-specific examples. For example, for the qualitative checklist item assessing congruity between the research methodology and the research question or objectives, we coded “Yes” when the question and approach were clearly aligned (e.g., an ethnographic question paired with observation or interviews); “No” when the stated approach and the question and methods were misaligned (e.g., a “grounded theory” claim without theory-building procedures); and “Unclear” when methodological detail was insufficient to determine alignment. For review studies, we judged search strategies as “appropriate” only when databases, search terms, and inclusion criteria were reported in sufficient detail to be reproducible. Consistent with our decision rules, we used N/A only when a criterion was not applicable.

Data Synthesis and Consistency Checks

The lead author conducted data extraction and scoring for the full sample using the finalized decision rules and design-specific appraisal guidance. To support intra-rater consistency over the course of coding, the lead author periodically rechecked earlier coded articles at multiple points, with emphasis on borderline classifications and outliers, and verified decisions against the written rules and checklist instructions. Unresolved questions were discussed with the senior RoE researcher, and resulting clarifications were applied consistently. Findings were summarized using descriptive statistics. Frequencies and percentages were used to describe thematic categories, study designs, and data sources, and means and standard deviations were used to summarize critical appraisal scores. Appraisal findings were interpreted descriptively, with attention to patterns that recurred within study designs. Figures were created using Microsoft Excel and Python, with Python figures produced using matplotlib and seaborn (Hunter, 2007; Waskom, 2021).

Results

Thematic Trends

From 2014 to 2024, RoE publication counts fluctuated across years rather than showing a uniform upward trend, although RoE remained a visible and recurring area of inquiry in AJE (Figure 3 ). The majority of the 117 studies focused on Evaluation Activities (procedures and methods of evaluation, about 57% of studies) or Professional Issues (evaluators’ roles, competencies, or the profession itself, about 26%). Within Evaluation Activities, most studies examined practices and overarching approaches (31% and 18%), with relatively few tool validation studies (9%). Within Professional Issues, studies centered around evaluator dynamics and identity and training (11% and 8%), with fewer studies examining competencies or standards or research that mapped trends and scope of RoE. In contrast, only a small fraction examined Evaluation Consequences (e.g., use or influence of evaluation findings, 9%) or Evaluation Contexts (how organizational or social context affects evaluation, 8%) (Figure 4). These findings indicate that RoE in the past decade has continued to emphasize how evaluations are done and the evaluator community, while questions about the outcomes and broader impact of evaluation remain underexplored (see Tables 2 and 3 for the distribution of themes and subthemes and illustrative examples).

Figure 3.

Publication trend over time (N = 117).

Figure 4.

Evaluation themes over time (N = 117).

Table 1.

Operationalization of Mark's (2008) Subject of Inquiry Taxonomy.

Theme/Definition	Subtheme	Subtheme Definition and Scope^a
Evaluation Activities (the procedures used in planning, carrying out, and disseminating evaluation)	Approaches	Theoretical, conceptual frameworks guiding evaluations
	Practices	Evaluation methods or practices employed, focusing on methodological choices and techniques used
	Validation Studies^b	Assessing the validity, reliability, or usability of measurement tools or methods
Evaluation Consequences (changes that do [or do not] occur as a result of evaluation)	Context	How an evaluation has influenced broader contexts (community, sector, or system-wide settings or dynamics)
	Participants	Effects on those involved in or the subjects of the evaluation (stakeholders’ perceptions, experiences, or behaviors)
	Use/Influence	How the evaluation findings were utilized or influenced decision making
Evaluation Contexts (the circumstances within which evaluation occurs)	Evaluation-specific	Unique conditions directly impacting the design or execution of an evaluation
	Organizational Context	Organizational factors affecting the evaluation, such as the organization's structure, politics, culture, and internal dynamics
	Societal/Cultural^b	Cultural and societal influences on the evaluation, including community norms, societal values, and cultural expectations
Professional Issues (issues involving the structure, norms, and continuation of the field of evaluation)	Evaluator Dynamics/Identity^b	Cognitive and reflective practices of evaluators involved in an evaluation (e.g., critical thinking, judgment, and anxiety) or professional identity and self-awareness
	Research on Evaluation^b	Scope, trends, and nature of research on evaluation, classification of RoE topics
	Competencies/Standards	Skills and standards required for professional evaluation practice (American Evaluation Association (AEA) standards, competencies, and ethics)
	Training	Preparation and professional development opportunities for evaluators

Note. ^aThe definition and scope reflect how we interpreted Mark's (2008) subject of inquiry subthemes. While Mark's taxonomy provides a definition for the broad themes with examples of studies, definitions for subthemes were not specified.

Subthemes adapted or added and not part of Mark's (2008) taxonomy.

Table 2.

Study Themes, Disciplinary Domains, and Data Sources in RoE Studies (2014–2024).

RoE Study Characteristics	n	Percentage of Total
Themes and Subthemes
Evaluation Activities	67	57
Practices	36	31
Approaches	21	18
Validation Studies	10	9
Professional Issues	31	26
Evaluator Dynamics/Identity	13	11
Training	9	8
Competencies/Standards	5	4
Research on Evaluation	4	3
Evaluation Consequences	10	9
Participants	7	6
Use/Influence	3	3
Evaluation Contexts	9	8
Organizational Context	6	5
Societal/Cultural	3	3
Disciplinary Domain
Evaluation	67	57
Education	19	16
Healthcare	10	9
Social Services	6	5
Public Policy	6	5
Nonprofit/Philanthropy	5	4
International Development	4	3
Data Source
Primary	55	47
Secondary	25	21
Secondary (Review/Synthesis)^a	22	19
Mixed	15	13

Note. All percentages are based on the total number of studies (N = 117). Percentages are rounded to the nearest whole number and may not total 100% due to rounding. Two subthemes, Context (under Evaluation Consequences) and Evaluation-specific (under Evaluation Contexts, had no studies. See Table 1 for the full coding framework.

Secondary (Review/Synthesis) refers to scoping, systematic, or other review studies that analyzed existing published literature or evaluation reports rather than original primary data collection. RoE= Research on Evaluation.

Table 3.

Illustrative Examples of Themes and Subthemes Published in AJE, 2014–2024.

Themes	Subthemes	Illustrative Examples
Evaluation Activities	Approaches	Boadu and Ile (2024) Beyond the buzzword: A framework for an indigenous relational evaluation in traditional communities in Ghana
	Practices	Abbato (2023) Digital evaluation stories: A case study of implementation for monitoring and evaluation in an Australian community not-for-profit
	Validation Studies^a	Cho et al. (2023) Measuring evaluator competencies: Developing and validating the evaluator competencies assessment tool
Evaluation Consequences	Context	–
	Participants	Bundi (2016) What do we know about the demand for evaluation? Insights from the parliamentary arena
	Use/Influence	Kupiec and Wrońska (2025) Motivation to adopt evaluation practice as a determinant of evaluation use
Evaluation Contexts	Evaluation-specific	–
	Organizational Context	Bourgeois and Maltais (2023) Translating evaluation policy into practice in government organizations
	Societal/Cultural^a	Maddox et al. (2021) Reviewing health service and program evaluations in indigenous contexts: A systematic review
Professional Issues	Evaluator Dynamics/Identity^a	Boyce et al. (2023) Social justice as ontology: The intersection of black evaluators’ identities, roles, and practice
	Research on Evaluation^a	Coryn et al. (2017) A decade of research on evaluation: A systematic review of research on evaluation published between 2005 and 2014
	Competencies/Standards	Dunaway et al. (2023) Cultural competence: 10-year comparison of program evaluators’ perceptions
	Training	Harnar et al. (2025) Using cognitive complexity to understand role-play as a pedagogical tool in graduate evaluation education

Note. ^aSubthemes adapted or added and not part of Mark's (2008) taxonomy. AJE=American Journal of Evaluation.

Disciplinary Domains

More than half of the RoE studies fell under the disciplinary domain of evaluation (57%). Studies were coded into this category when they examined evaluation in general rather than in a specific sector. Education (16%) and healthcare (9%) were the most common applied settings, while social services, public policy, nonprofit or philanthropy, and international development appeared less frequently (3%–5% each).

Methodological Approaches

Among the 117 RoE studies, qualitative methods were the most prevalent. Purely qualitative studies accounted for 38% of the sample (with simple qualitative descriptive designs being especially common). Quantitative studies comprised about 27%, most of which were nonexperimental surveys or correlational studies (analytical cross-sectional designs); only a handful of experimental or quasi-experimental studies (6% combined) were published in the entire period. Mixed-methods studies represented 16% of the sample, typically integrating a qualitative component with a quantitative survey. Review studies made up 19% and were usually scoping or mapping reviews; very few performed formal meta-analysis or meta-evaluation (see Figure 5). Fewer than half of the RoE studies relied on newly collected primary data (47%); most instead drew mainly on secondary sources (53%). Secondary use took three main forms: reanalysis of existing datasets such as surveys or administrative records (21%), use of secondary synthesis data from scoping or systematic reviews (19%), and mixed designs that combined primary and secondary data (13%). Qualitative and mixed-methods studies were more likely to collect primary data, whereas quantitative studies more often analyzed existing datasets. Taken together, this pattern suggests a reliance on available information rather than generating new data to study evaluation practice (see Tables 2 and 4).

Figure 5.

Methodological approaches over time (N = 117).

Table 4.

Methodological Approach and Study Design Characteristics of RoE Studies (N = 117).

Methodological Approach and Study Design Characteristics	n	Percentage of Total^a
Qualitative	44	38
Qualitative description	24	21
Case study	12	10
Ethnography	4	3
Phenomenology	2	2
Grounded theory	2	2
Quantitative	32	27
Nonexperimental analytical	19	16
Nonexperimental descriptive	4	3
Quasi-experiment	4	3
Randomized experiment	3	3
Simulation (bootstrapping)	2	2
Review/synthesis	22	19
Scoping review	14	12
Systematic review with qualitative synthesis	4	3
Other (realist, critical review)	2	2
Systematic review with quantitative synthesis	1	1
Systematic review with mixed synthesis	1	1
Mixed-methods	19	16
Convergent	9	8
Exploratory sequential	4	3
Explanatory sequential	4	3
Multiphase/embedded	2	2

Note. Categories reflect coder classification, not necessarily author labels.

Percentages reflect proportions of study types across the entire sample (N = 117). Percentages are rounded to the nearest whole number and may not total 100% due to rounding.

RoE= Research on Evaluation.

Methodological Quality

Overall, the quality of RoE studies was moderate to high. Of the 105 appraised studies, 60% met at least 80% of the appraisal criteria (high quality), and 38% were in the 50% to 79% range (moderate quality). Only 2% fell below 50%. Descriptively, appraisal scores were higher for studies published from 2018 to 2024 than for those published from 2014 to 2017, suggesting methodological quality in RoE studies may have improved somewhat over time (see Table 5 and Figure 6). However, our appraisal identified several recurring methodological weaknesses across studies. Many qualitative studies did not explicitly state their theoretical or epistemological framework and lacked reflexivity (e.g., no discussion of the researcher's perspective or bias). Common issues in quantitative studies included insufficient information on how confounding variables were addressed and a lack of justification for sample size or discussion of response bias. Mixed-methods studies often showed imbalance in quality between the qualitative and quantitative components (the qualitative strand was frequently weaker), and integration of findings was sometimes superficial. Most of the 22 review studies were classified as scoping reviews and other interpretive reviews, for which formal critical appraisal of included studies is not expected. Of the six studies classified as systematic reviews, where critical appraisal is a standard expectation, only two reported appraising the quality of included studies. Notably, two scoping reviews conducted appraisal, although not required to do so. Overall, formal appraisal of included evidence remained uncommon across RoE syntheses.

Figure 6.

Quality rating over time (N = 105).

Table 5.

Mean Quality Scores by Methodological Approach (N = 105).

Approach/Period	n	Mean Score (%)	SD	≥80%	50%–79%	<50%
Quantitative	23	91.9	11.7	83%	17%	0%
Synthesis	22	86.3	17.1	68%	32%	0%
Qualitative	43	83.3	15.8	56%	42%	2%
Mixed-methods	17	71.3	13.3	29%	65%	6%

Note. Mean scores reflect the proportion of applicable JBI/MMAT criteria met. High = ≥80% of applicable criteria met, moderate = 50%–79%, and low = <50%. MMAT = Mixed Methods Appraisal Tool.

Discussion

Our review provides an updated picture of RoE in AJE over the past decade and reveals both areas of progress and areas for improvement. On the one hand, RoE scholarship maintained a visible presence in AJE over the review period, reflecting continued attention to building a more evidence-based understanding of evaluation. This pattern aligns with recent meta-RoE work showing a marked increase in RoE across evaluation journals (Linnell & Stachowski, 2025). At the same time, recent RoE scholarship is a useful reminder that these counts depend on how RoE is defined and how borderline cases are handled (Arbour, 2025; Aston et al., 2025). For this review, we used Coryn et al.'s (2016) empirically anchored definition because our goals included classifying study designs and appraising methodological quality, which required included studies to present a defined dataset and a systematic method. This choice likely made our screening more conservative and may help explain why our AJE-only count for overlapping years is lower than that reported in a recent meta-RoE synthesis (Linnell & Stachowski, 2025). Linnell and Stachowski (2025) noted that their operationalization may have differed from Coryn et al. (2017) in applying that approach to a different body of studies, which further underscores how definitional and screening decisions can affect what gets counted over time. Against that backdrop, differences between our AJE-only RoE counts and those reported in broader meta-RoE reviews are not surprising, especially for borderline articles in which methods are not reported in sufficient detail to support design classification or appraisal.

We found that RoE studies continue to concentrate on how evaluations are conducted and on the evaluation profession itself, mirroring earlier findings (Coryn et al., 2017). This pattern is also consistent with the recent meta-RoE update by Linnell and Stachowski (2025), in which most published RoE remained descriptive and centered on evaluation activities. The persistent underrepresentation of studies on evaluation outcomes and context suggests that fundamental questions about the effects of evaluation or the influence of context remain insufficiently addressed. This imbalance echoes long-standing concerns that the evaluation field lacks evidence of its impact (Henry & Mark, 2003), leaving gaps in our collective understanding of what evaluations accomplish and under what conditions.

On the other hand, we noted some positive developments, including a greater proportion of recent RoE studies that are “evaluation-general” (not tied to specific sectors), and some recent studies have begun to tackle contemporary issues such as racial equity, social justice, and power dynamics in evaluation (e.g., Boyce et al., 2023), whereas others have proposed principles for centering equity in evaluation criteria (Teasdale et al., 2025). These contributions suggest that RoE is beginning to engage with the broader sociopolitical contexts that shape evaluation work. Although these studies remain a minority, they reflect an important evolution in the field's priorities.

Methodologically, our analysis confirms that RoE has been dominated by descriptive and nonexperimental research designs. Qualitative and cross-sectional studies are prevalent, whereas experimental approaches remain rare. This may reflect the complex and context-dependent nature of evaluation practice, but it also points to missed opportunities for methodological innovation. There has been a slight shift toward more sophisticated quantitative analyses (e.g., regression-based studies) compared to earlier decades, but experimental and longitudinal designs are still scarce. The heavy reliance on secondary data is a practical choice but underscores the importance of improving the quality and accessibility of evaluation datasets for research purposes.

Importantly, by systematically appraising study quality with established critical appraisal tools, this review offers an empirical assessment of the rigor of RoE publications in AJE as a focal sample and, to our knowledge, represents one of the first applications of formal critical appraisal tools to this body of RoE studies. The generally moderate-to-high appraisal ratings are encouraging, suggesting that many RoE studies met a substantial proportion of the applicable methodological criteria. Furthermore, the improvement in average appraisal scores in recent years is an encouraging sign that methodological standards may be improving.

At the same time, checklist-based appraisal does not eliminate judgment calls. Several JBI and MMAT criteria require interpretation (e.g., whether key elements are “adequately described” or whether there is “congruity” between the research question, methods, and analysis), so we relied on explicit decision rules and coded conservatively when reporting was incomplete. This means some recurring “weaknesses” likely reflect reporting gaps as much as methodological shortcomings, reinforcing the importance of transparent write-ups and fuller use of supplemental materials when appropriate.

However, the recurring shortcomings we identified point to concrete areas for enhancement. In qualitative RoE studies, researchers should more transparently report their methodological framework and incorporate reflexivity (Patton, 2015); doing so will add credibility and depth to qualitative findings. In quantitative studies, future authors need to pay closer attention to design transparency, clearly reporting how they handled potential biases (like confounds or sampling issues) and the validity reporting of their measures. For mixed-methods studies, it is critical to ensure that both qualitative and quantitative components are executed with rigor and that they are truly integrated (Greene, 2008); mixed-methods designs in RoE would benefit from following established frameworks and explaining how the two strands inform each other. Finally, evidence synthesis in RoE would benefit from closer alignment between review purpose and review methods. Because most review studies in our sample were scoping or mapping reviews, their primary aim was to describe the landscape of RoE rather than to evaluate the strength of evidence. Even so, formal appraisal of included studies was uncommon, even among systematic reviews where it is a standard expectation. Reviews should advance beyond simple narrative reviews by applying systematic review techniques (using protocols, dual coding, quality appraisal of sources, and similar practices), which should greatly strengthen the reliability of conclusions drawn from multiple evaluation studies. Requiring or at least encouraging such practices (e.g., via journal guidelines akin to PRISMA) could improve the quality of RoE reviews. These shortcomings point to specific areas where training, mentorship, and editorial expectations could raise the bar.

Implications

The findings of this review have several implications for the evaluation community. First, RoE scholars should broaden the scope of inquiry to address the evident gaps in our knowledge. There is a need for studies that investigate what difference evaluations make (evaluation consequences) and how contextual factors shape evaluation practice and use. Such research may involve more complex designs, potentially including experiments, longitudinal case studies, or embedded research approaches where researchers collaborate with evaluation teams in real time (Jackson et al., 2021). By expanding into these areas, we can ensure that evidence about both processes and outcomes informs evaluation theory and practice. Second, there is a clear mandate to enhance methodological rigor in RoE studies. Graduate programs and professional development for evaluators might emphasize qualitative methodology (e.g., study design coherence, reflexive practice) and mixed-methods integration skills more strongly (Azzam & Jones, 2025), since our review indicates these are common weak points (Christie et al., 2014; LaVelle & Donaldson, 2010). Strengthening training in these areas will prepare future RoE researchers to conduct more rigorous studies. It may also be valuable to develop evaluation-specific research standards or appraisal tools that account for the unique aspects of studying evaluation and evalation practices. Third, funders and institutions can support RoE by integrating research components into evaluation projects, allocating resources to systematically study how evaluations are conducted and used in various settings (Donaldson, 2022, 2025; Donaldson et al., 2015). This could generate more primary data for RoE and promote a culture of self-reflection in evaluation practice.

Fourth, because our sample is drawn entirely from AJE, we focus these implications on AJE authors and editorial practices specifically. Authors should be encouraged to fully document their methods, even when space is limited. Although AJE already allows and encourages online appendices, many studies in our sample still omitted key details that could have been provided via supplemental files or repositories. In line with open science practices, we recommend that authors submit detailed coding frameworks, extended results, and appraisal decisions via platforms like the Open Science Framework (Linnell & Tilton, 2024). AJE could strengthen and normalize the use of these online supplements. This may be particularly relevant given that mixed-methods studies had the lowest composite mean appraisal scores in our sample, and several appraisal criteria depend heavily on what the authors report (“no/unclear” may sometimes reflect incomplete reporting rather than absence of the practice). One practical, low-burden step AJE could consider is prompting authors to name their study design in the title or abstract (e.g., “case study,” “quasi-experiment,” “systematic review,” or “scoping review”). In this review, study designs were not always labeled. Clear design labeling would make studies easier to find, interpret, and synthesize, and it aligns with widely used reporting guidance (e.g., Strengthening the Reporting of Observational Studies in Epidemiology for observational studies and PRISMA for systematic reviews) that asks authors to identify study design in the title or abstract (von Elm et al., 2007; Page et al., 2021). Finally, the field would benefit from more formal reporting standards tailored to RoE synthesis and design studies. For example, review studies could be asked to follow adapted PRISMA-style checklists, including a transparent search protocol, inclusion criteria, and, where appropriate, critical appraisal of included sources. Establishing such expectations, perhaps as an author resource page or checklist for RoE submissions, could further advance transparency and cumulative learning in evaluation science. More broadly, recent RoE scholarship has also emphasized the value of adopting RoE reporting standards to strengthen quality and the accuracy of RoE syntheses, an area where editorial guidance and clearer expectations can help (Aston et al., 2025). By encouraging these practices, publication outlets like AJE can improve the transparency and quality of RoE reporting and, in turn, the usefulness of RoE findings for advancing the field. High-quality RoE is one pathway to more exemplary evaluations (Donaldson, 2022, 2025; Villalobos et al., 2025).

Limitations

This review has limitations. We examined RoE articles from a single journal over a defined time span, which may limit generalizability to other outlets or recent developments beyond 2024. Because our inclusion criteria followed an explicitly empirical definition of RoE, conceptual and methodological contributions that inform RoE but do not report empirical findings were not captured. Additionally, some types of RoE studies (e.g., simulation-based research and tool development) could not be appraised with existing checklists and therefore were not formally appraised here. The appraisal tools used (JBI and MMAT) may not capture all nuances of RoE studies, and some judgment was required in applying their criteria. In interpreting appraisal results, it is important to distinguish methodological shortcomings from reporting omissions; our ratings reflect what was reported in the published articles and may underestimate methodological quality when procedures were conducted but not described. The cut points used to summarize appraisal ratings and the composite scoring approach for mixed-methods studies were pragmatic choices intended to support descriptive comparison across designs and should not be interpreted as definitive thresholds. Because the JBI and MMAT tools differ in content, emphasis, and number of applicable criteria across study designs, cross-design comparisons of percentage scores should be interpreted as approximate descriptive summaries rather than directly equivalent measures of methodological quality. In particular, because mixed-methods composite scores were capped by the weakest strand, mean appraisal differences across design types should be interpreted cautiously. Finally, while the codebook and decision rules were pilot-tested through independent coding and consensus on a subset of articles, full-sample extraction was conducted by a single coder. Formal interrater reliability assessment was not feasible, and some classification error may remain.

Conclusion

Over the past decade, RoE has gained momentum, shedding light on many aspects of evaluation practice and the evaluation profession (Donaldson, 2022, 2025; Villalobos et al., 2025). Our integrative review of AJE publications from 2014 to 2024 shows that RoE has predominantly explored “how we evaluate” (methods, processes, and people), with comparatively little attention to “what evaluation accomplishes” in terms of outcomes and influence. By introducing a quality appraisal perspective, we found that while RoE studies are generally conducted with moderate-to-high rigor, there are consistent deficiencies in reporting and methodology that need to be addressed. Moving forward, broadening the focus of RoE and improving the methodological transparency and rigor of studies will help build a more credible, actionable knowledge base for the evaluation field. In line with Mark's (2008) vision, investing in rigorous RoE ultimately strengthens evaluation practice itself, ensuring that as evaluators, we continually evaluate and improve our own work through evidence.

Supplemental Material

sj-docx-1-aje-10.1177_10982140261438466 - Supplemental material for Research on Evaluation Studies Published in AJE: An Integrative Review of Themes, Methods, and Quality

Supplemental material, sj-docx-1-aje-10.1177_10982140261438466 for Research on Evaluation Studies Published in AJE: An Integrative Review of Themes, Methods, and Quality by Selam Stephanos and Stewart I. Donaldson in American Journal of Evaluation

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Selam Stephanos

Stewart I. Donaldson

Supplemental Material

Supplemental material for this article is available online.

Appendix A

References

Abbato

(2023). Digital evaluation stories: A case study of implementation for monitoring and evaluation in an Australian community not-for-profit. American Journal of Evaluation, 44(4), 604–628. https://doi.org/10.1177/10982140221138031

Akl

Ranatunga

Long

Jennings

Nimmo

(2021). A systematic review investigating patient knowledge and awareness on the association between oral health and their systemic condition. BMC Public Health, 21(1), 2077. https://doi.org/10.1186/s12889-021-12016-9

Alkin

M. C.

(2003). Evaluation theory and practice: Insights and new directions. New Directions for Evaluation, 2003(97), 81–90. https://doi.org/10.1002/ev.78

Arbour

(2025). Redefining research on evaluation to unlock its full empirical and conceptual potential. New Directions for Evaluation, 2025(187), 97–104. https://doi.org/10.1002/ev.70004

Aromataris

Lockwood

Porritt

Pilla

Jordan

(Eds.). (2024). JBI manual for evidence synthesis . Adelaide, Australia: JBI. https://synthesismanual.jbi.global

Aston

Linnell

D. J.

Westine

(2025). Editors’ notes: Moving research on evaluation forward. New Directions for Evaluation, 2025(187), 7–11. https://doi.org/10.1002/ev.70003

Azzam

Jones

N. D.

(2025). Mixed methods design. In Rog

D. J.

Bickman

(Eds.), The evaluation handbook: An evaluator’s companion (pp. 344–362). New York, NY: The Guilford Press.

Boadu

E. S.

Ile

(2024). Beyond the buzzword: A framework for an indigenous relational evaluation in traditional communities in Ghana. American Journal of Evaluation, 45(3), 467–488. https://doi.org/10.1177/10982140211048459

Bourgeois

Maltais

(2023). Translating evaluation policy into practice in government organizations. American Journal of Evaluation, 44(3), 353–373. https://doi.org/10.1177/10982140221079837

10.

Boyce

A. S.

Reid

Avent

Adetogun

Moller

J. R.

Singletary

B. H.

(2023). Social justice as ontology: The intersection of black evaluators’ identities, roles, and practice. American Journal of Evaluation, 44(3), 528–548. https://doi.org/10.1177/10982140221108664

11.

Brandon

P. R.

Singh

J. M.

(2009). The strength of the methodological warrants for the findings of research on program evaluation use. American Journal of Evaluation, 30(2), 123–157. https://doi.org/10.1177/1098214009334507

12.

Bundi

(2016). What do we know about the demand for evaluation? Insights from the parliamentary arena. American Journal of Evaluation, 37(4), 522–541. https://doi.org/10.1177/1098214015621788

13.

Cho

Castleman

A. M.

Umans

Mwirigi

M. O.

(2023). Measuring evaluator competencies: Developing and validating the evaluator competencies assessment tool. American Journal of Evaluation, 44(3), 474–494. https://doi.org/10.1177/10982140211056539

14.

Christie

C. A.

Quiñones

Fierro

(2014). Informing the discussion on evaluator training: A look at evaluators’ course taking and professional practice. American Journal of Evaluation, 35(2), 274–290. https://doi.org/10.1177/1098214013503697

15.

Coryn

C. L. S.

Ozeki

Wilson

L. N.

Greenman

G. D.

Schröter

D. C.

Hobson

K. A.

Azzam

A. T.

(2016). Does research on evaluation matter? Findings from a survey of American Evaluation Association members and prominent evaluation theorists and scholars. American Journal of Evaluation, 37(2), 159–173. https://doi.org/10.1177/1098214015611245

16.

Coryn

C. L. S.

Wilson

L. N.

Westine

C. D.

Hobson

K. A.

Ozeki

Fiekowsky

E. L.

Greenman

G. D.

Schröter

D. C.

(2017). A decade of research on evaluation: A systematic review of research on evaluation published between 2005 and 2014. American Journal of Evaluation, 38(3), 329–347. https://doi.org/10.1177/1098214016688556

17.

Donaldson

S. I.

(2022). Introduction to theory-driven program evaluation: Culturally responsive and strengths-focused applications (2nd ed.). Routledge.

18.

Donaldson

S. I.

(2025). Exemplary evaluations in a multicultural world. In Rog

D. J.

Bickman

(Eds.), The evaluation handbook: An evaluator's companion (pp. 532–549). New York, NY: The Guilford Press.

19.

Donaldson

S. I.

Christie

C. A.

Mark

M. M.

(2015). Credible and actionable evidence: The foundation for rigorous and influential evaluations. Thousand Oaks, CA: Sage Publications, Inc.

20.

Dunaway

Gardner

Grieve

(2023). Cultural competence: 10-year comparison of program evaluators’ perceptions. American Journal of Evaluation, 44(3), 513–527. https://doi.org/10.1177/10982140221122767

21.

Galport

(2015). Methodological trends in research on evaluation. New Directions for Evaluation, 2015(148), 17–29. https://doi.org/10.1002/ev.20154

22.

Grant

M. J.

Booth

(2009). A typology of reviews: An analysis of 14 review types and associated methodologies. Health Information & Libraries Journal, 26(2), 91–108. https://doi.org/10.1111/j.1471-1842.2009.00848.x

23.

Greene

J. C.

(2008). Is mixed methods social inquiry a distinctive methodology? Journal of Mixed Methods Research, 2(1), 7–22. https://doi.org/10.1177/1558689807309969

24.

Harnar

M. A.

Prieur

A. M.

Ross Nelson

(2025). Using cognitive complexity to understand role-play as a pedagogical tool in graduate evaluation education. American Journal of Evaluation, 46(3), 338–352. https://doi.org/10.1177/10982140241242128

25.

Henry

G. T.

Mark

M. M.

(2003). Toward an agenda for research on evaluation. New Directions for Evaluation, 2003(97), 69–80. https://doi.org/10.1002/ev.77

26.

Hong

Q. N.

Fàbregues

Bartlett

Boardman

Cargo

Dagenais

Gagnon

M.-P.

Griffiths

Nicolau

O’Cathain

Rousseau

M.-C.

Vedel

Pluye

(2018). The Mixed Methods Appraisal Tool (MMAT) version 2018 for information professionals and researchers. Education for Information, 34(4), 285–291. https://doi.org/10.3233/efi-180221

27.

Hunter

J. D.

(2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

28.

Jackson

G. L.

Damschroder

L. J.

White

B. S.

Henderson

Vega

R. J.

Kilbourne

A. M.

Cutrona

S. L.

(2021). Balancing reality in embedded research and evaluation: Low vs high embeddedness. Learning Health Systems, 6(2), e10294. https://doi.org/10.1002/lrh2.10294

29.

Kupiec

Wrońska

(2025). Motivation to adopt evaluation practice as a determinant of evaluation use. American Journal of Evaluation, 46(1), 22–38. https://doi.org/10.1177/10982140241290737

30.

LaVelle

J. M.

Donaldson

S. I.

(2010). University-based evaluation training programs in the United States 1980–2008: An empirical examination. American Journal of Evaluation, 31(1), 9–23. https://doi.org/10.1177/1098214009356022

31.

Linnell

D. J.

Stachowski

(2025). The next 7 years of published research on evaluation: A follow-up of Coryn et al. (2017). New Directions for Evaluation, 2025(187), 13–20. https://doi.org/10.1002/ev.70007

32.

Linnell

D. J.

Tilton

(2024). Open science practices among evaluators: A survey of AEA members. New Directions for Evaluation, 2024(184), 17–27. https://doi.org/10.1002/ev.20622

33.

Maddox

Blais

Mashford-Pringle

Monchalin

Firestone

Ziegler

Ninomiya

M. M.

Smylie

(2021). Reviewing health service and program evaluations in indigenous contexts: A systematic review. American Journal of Evaluation, 42(3), 332–353. https://doi.org/10.1177/1098214020940409

34.

Mark

M. M.

(2008). Building a better evidence base for evaluation theory: Beyond general calls to a framework of types of research on evaluation. In Smith

N. L.

Brandon

P. R.

(Eds.), Fundamental issues in evaluation (pp. 111–134). New York, NY: The Guilford Press.

35.

Page

M. J.

McKenzie

J. E.

Bossuyt

P. M.

Boutron

Hoffmann

T. C.

Mulrow

C. D.

Shamseer

Tetzlaff, J. M., Akl, E. A., Brennan, S. E., Chou, R., Glanville, J., Grimshaw, J. M., Hróbjartsson, A., Lalu, M. M., Li, T., Loder, E. W., Mayo-Wilson, E., McDonald, S., ... Moher, D. (2021). The PRISMA 2020 statement: An updated guideline for reporting systematic reviews. Journal of Clinical Epidemiology, 134, 178–189. https://doi.org/10.1016/j.jclinepi.2021.03.001

36.

Patton

M. Q.

(2015). Qualitative research & evaluation methods: Integrating theory and practice (4th ed.). Thousand Oaks, CA: Sage.

37.

Teasdale

R. M.

Avent

C. M.

Moore

C. L.

Serrano Abreu

M. B.

Yan

(2025). “It has to be the common thread”: Weaving attention to racial equity and justice across lines of inquiry and evaluative criteria. American Journal of Evaluation, 46(2), 241–264. https://doi.org/10.1177/10982140241260069

38.

Vallin

L. M.

Philippoff

Pierce

Brandon

P. R.

(2015). Research-on-evaluation articles published in the American Journal of Evaluation, 1998–2014. New Directions for Evaluation, 2015(148), 7–15. https://doi.org/10.1002/ev.20153

39.

Varmaghani

Pourtaheri

Ahangari

Tehrani

(2024). The prevalence of adolescent pregnancy and its associated consequences in the eastern Mediterranean region: A systematic review and meta-analysis. Reproductive Health, 21(1), 113. https://doi.org/10.1186/s12978-024-01856-4

40.

Villalobos

J. P.

Cho

Donaldson

S. I.

(2025). Gleaning insights from research on evaluation (RoE) PhD dissertations. New Directions for Evaluation, 2025(187), 83–95. https://doi.org/10.1002/ev.70002

41.

von Elm

Altman

D. G.

Egger

Pocock

S. J.

Gøtzsche

P. C.

Vandenbroucke

J. P.

(2007). The strengthening the reporting of observational studies in epidemiology (STROBE) statement: Guidelines for reporting observational studies. PLoS Medicine, 4(10), e296. https://doi.org/10.1371/journal.pmed.0040296

42.

Waskom

M. L.

(2021). Seaborn: Statistical data visualization. Journal of Open Source Software, 6(60), 3021. https://doi.org/10.21105/joss.03021

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.04 MB

Research on Evaluation Studies Published in AJE : An Integrative Review of Themes,Methods,and Quality

Abstract

Keywords

Defining Research on Evaluation

Background

Methods

Sample Selection and Inclusion Criteria

Coding Framework and Thematic Classification

Study Design Classification and Methodological Quality Appraisal

Data Synthesis and Consistency Checks

Results

Thematic Trends

Disciplinary Domains

Methodological Approaches

Methodological Quality

Discussion

Implications

Limitations

Conclusion

Supplemental Material

sj-docx-1-aje-10.1177_10982140261438466 - Supplemental material for Research on Evaluation Studies Published in AJE: An Integrative Review of Themes, Methods, and Quality

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iDs

Supplemental Material

Appendix A

References

Supplementary Material