Artificial Intelligence (AI) Readiness to Support Evidence Synthesis by Workflow: Findings From a Review of Reviews

Abstract

Background

Evidence synthesis is crucial for informing evidence-based practice across various fields. However, the traditional methodology is resource-intensive, and its findings can be outdated before publication. There is a growing trend toward integrating automation and artificial intelligence (AI) approaches into evidence synthesis to enhance efficiency, but standardized adoption is still pending.

Objective

The goal of this study is to identify peer-reviewed evidence documenting AI readiness for evidence synthesis.

Methods

We searched MEDLINE, Embase, and Global Index Medicus in May 2025 to identify review articles that evaluated evidence synthesis tools. Relevant study reviews and tool reviews published in English between January 2020 and May 2025 were included in our review of reviews. Tool features and performance metrics were extracted according to stages of the evidence synthesis workflow, including search, screening, appraisal, extraction, and synthesis.

Results

We included 21 studies in our review of reviews and identified 46 evidence synthesis tools. Nine tools supported all five stages of the evidence synthesis workflow, among which DistillerSR covered the most workflow-supporting features (19 out of 21). Ten of the identified tools reported sensitivity rates for AI-powered title/abstract screening, all of which achieved $\geq 95 %$ sensitivity in at least one configuration. Reported sensitivity rates of EPPI-Reviewer, Research Screener and SWIFT-Active Screener consistently reached the 95% threshold with varying degrees of automation.

Conclusion

This review found peer-reviewed evidence supporting AI readiness for human-supervised automation of title/abstract screening. However, evidence documenting AI readiness for other evidence synthesis tasks remains limited. DistillerSR and EPPI-Reviewer demonstrated the broadest feature support and strong evidence for AI-powered title/abstract screening. Our study highlights the potential of AI to improve efficiency while maintaining high sensitivity in the screening stage. AI-powered screening may serve as a critical first step toward scaling rapid reviews into living evidence syntheses.

Plain Language Summary

Why was the study done: Reviewing large numbers of studies is essential for making informed decisions in healthcare and other fields. This process called evidence synthesis brings together findings from many studies to answer important questions. However, it can take a long time and requires a lot of effort from researchers. By the time a review is completed, new studies may already have been published, making the results less up to date. Artificial intelligence (AI) has been suggested as a way to speed up this process. AI tools can help with tasks such as searching for studies, screening which studies are relevant, extracting data from studies, and combining results. Despite growing interest, it is not yet clear how peer-reviewed literature documents the readiness of these tools. What did the researchers do: In this study, we looked at existing research to understand how well AI is currently supporting evidence synthesis. We searched major health databases for published review articles from 2020 to 2025 that evaluated tools used in evidence synthesis. We then collected evidence on what these tools can do and how well they perform at different stages of the review process. These stages include searching for studies, screening them for relevance, assessing their quality, extracting key information, and combining findings. What did the researchers find: We included 21 review articles and identified 46 different tools designed to support evidence synthesis. A few of these tools could support all stages of the review process. Among them, some tools, such as DistillerSR and EPPI-Reviewer, offered the widest range of features. We found the strongest evidence for the use of AI in screening studies by title and abstract. In this task, AI systems help researchers quickly decide which studies are likely to be relevant. Several tools reported high sensitivity, meaning they were able to correctly identify at least 95% of relevant studies in some settings. However, for other stages of evidence synthesis, such as assessing study quality or combining results, there is still limited evidence on how well AI performs. While AI shows promise, its use beyond screening is not yet fully supported by strong research. What do the findings mean: Overall, the evidence we found suggests that AI is ready to assist with some parts of evidence synthesis, especially the early screening stage. Using AI in this way could make reviews faster while still maintaining quality. This may also help support “living” reviews, which are updated regularly as new evidence becomes available. More research is needed to understand how AI can reliably support the full evidence synthesis process.

Keywords

evidence synthesis artificial intelligence (AI)readiness workflow sensitivity randomized controlled trials (RCTs)

Introduction

Evidence synthesis is a crucial evidence-based practice, providing a structured approach for aggregating and evaluating research findings to inform decision-making across various fields, including medicine, public health, public policy and the social sciences (Cierco Jimenez et al., 2022; Jin et al., 2024; Legate et al., 2024). The principle of evidence synthesis is to rigorously combine available research findings to summarize current knowledge about a specific topic, forming the foundation for evidence-based medicine and improving healthcare decision-making, diagnosis, and treatment (Cierco Jimenez et al., 2022; Legate et al., 2024). Among the range of evidence synthesis methodologies, systematic reviews are considered the gold standard. The classic five-stage systematic review workflow involves searching, screening, critically appraising, extracting, and synthesizing “the best available evidence within pre-specified eligibility criteria” (Cierco Jimenez et al., 2022; Coiera & Liu, 2022; Legate et al., 2024). Other evidence synthesis approaches share this core workflow but differ in the emphasis placed on individual stages. However, this workflow is resource-demanding and may produce findings that are outdated before publication. The median time to complete a systematic review is approximately 17 months, with some taking over 2 years, which limits their applicability in rapidly evolving policy environments (Cierco Jimenez et al., 2022; Guo et al., 2024; Jin et al., 2024; Legate et al., 2024; Marshall & Wallace, 2019). Living systematic reviews were introduced to address this lag by enabling the continuous updating of findings as new data emerge; however, they still require substantial human and financial resources, making large-scale implementation difficult (Bastian et al., 2010; Elliott et al., 2014).

The importance of evidence synthesis methods, particularly systematic reviews, has grown significantly over time. This growth has been driven globally by two key factors. First, the development and spread of Health Technology Assessment, initiated by the U.S. Office of Technology Assessment in 1976, established systematic reviews as a key method for rigorously evaluating the efficacy, safety, and cost-effectiveness of health technologies (Banta & Jonsson, 2009). Second, global health crises, such as the COVID-19 pandemic, have underscored the need for rapid synthesis of research to support evidence-informed public health decision-making. By mid-2022, the NIH’s LitCOVID Hub catalogued approximately 270,000 COVID-19-related research articles (Coiera & Liu, 2022). However, the large volume of evidence made it difficult to quickly identify, synthesize, and apply critical findings, leading to delays that had real-world impacts on pandemic response efforts (Coiera & Liu, 2022).

To address the need for faster evidence synthesis, numerous rapid review approaches have been developed. Although these reviews provide more timely findings by simplifying the synthesis workflow, they are often associated with lower reporting quality (Tricco et al., 2015). Transitioning rapid reviews into living evidence syntheses could offer a solution that improves their methodological rigor without compromising timeliness, though the process would require new methodological standards.

Automation and artificial intelligence (AI) have been increasingly integrated into various stages of systematic reviews, including literature search, screening, critical appraisal and data extraction, to address the challenges (Cierco Jimenez et al., 2022; Guo et al., 2024; Johnson et al., 2022). The Systematic Review Toolbox, a widely used repository tracking systematic reviews tools, catalogued 235 software tools as of December 2022 (Johnson et al., 2022), many of which incorporate AI to varying extents. AI methodologies, such as machine learning (ML), have demonstrated potential in handling unstructured biomedical literature, streamlining screening and data extraction processes (Amin et al., 2023; Cierco Jimenez et al., 2022). Despite these advancements, AI-driven approaches to systematic reviews and other evidence synthesis methods have yet to fully integrate into mainstream methodologies. Recognizing this gap, the International Collaboration for the Automation of Systematic Reviews (ICASR) has emphasized the need for standardized evaluation metrics, interoperability among AI-powered evidence synthesis tools, and transparent validation frameworks to ensure their reliability and widespread adoption (O’Connor et al., 2018).

This review of reviews aims to identify peer-reviewed evidence on the capabilities of evidence synthesis tools and the performance of their AI-powered features. In this review, we will use the following working definitions:

• Evidence synthesis tools: software tools that assist reviewers in completing one or more stages of the evidence synthesis workflow (e.g., Covidence), including general-purpose tools with relevant capabilities (e.g., ChatGPT)

• Workflow-supporting features: components of evidence synthesis tools that assist with a specific task in the evidence synthesis workflow (e.g., a module that supports manual title/abstract screening)

• AI-powered features: workflow-supporting features that use ML, natural language processing, generative AI or other data-driven methods to support or automate a specific task (e.g., a ML-based text classification model that automates title/abstract screening)

The review will search, screen, appraise, extract and synthesize evidence from published reviews that evaluate the application of AI in the evidence synthesis workflow. It aims to identify evidence that addresses the following research questions:

• What types of evidence synthesis tools are available, and which stages of the evidence synthesis workflow do their features support?

• What metrics and methods are used to evaluate AI-powered features of evidence synthesis tools? What does evaluative evidence suggest about their performance?

• Which stages of evidence synthesis appear most ready for AI-powered scaling of rapid reviews into living evidence syntheses?

Our study synthesizes existing peer-reviewed literature on evidence synthesis tools rather than directly assessing current AI readiness for evidence synthesis. As a result, the evidence captured in our review reflects publication timelines. Established tools well represented in the literature (e.g., RobotReviewer, developed over a decade ago) have a larger body of published evidence, whereas more recent tools (e.g., Claude) have comparatively limited evidence on their performance in evidence synthesis workflows, which shapes their representation in our review.

Methods

The study design of this review is included in our protocol (Ngongoma et al., 2025), which was registered in PROSPERO on June 6, 2025 (CRD420251054446).

Literature Search

We searched MEDLINE, Embase, and Global Index Medicus for articles published from January 1, 2020 to May 7, 2025. To formulate the search strategy, we used search terms for the following concepts:

• Artificial Intelligence

• Evidence Synthesis as a Subject

• Reviews as Publication Types

• Tools and Task-Specific Terminology

The initial search strategy was developed with a librarian in Medline on the EBSCO platform. Comparable search strategies were then developed for Embase and Global Index Medicus. The detailed search strategies are listed in Supplemental Material 1.

Eligibility Criteria

We imported all retrieved articles into Covidence software, which conducted deduplication automatically. To select peer-reviewed review articles that evaluated AI-powered features of evidence synthesis tools, we developed the following eligibility criteria:

• Study Type: Reviews of studies (e.g., systematic reviews and other evidence synthesis articles) and reviews of tools (e.g., overviews of evidence synthesis tools) focused on evidence synthesis in the health sciences and policy fields.

• Topic: Eligible reviews must evaluate AI-powered or automated features of tools applied to any stage of the evidence synthesis workflow, namely search, screening, appraisal, extraction and synthesis.

• Outcomes: Reviews must report performance outcomes such as accuracy, time savings, efficiency, or inter-rater agreement. Reviews may also report on implementation outcomes, including usability, trust, transparency, reproducibility, or barriers to adoption.

Preprints and conference abstracts were excluded. We also excluded studies that were published prior to 2020, were not in English, were found to lack methodological clarity or presented high risk of bias during quality appraisal. The timeframe was selected to capture recent advancements in AI development and use in systematic review workflows.

Study Screening

Screening was conducted in Covidence software. Prior to full title/abstract screening, a screening pilot was conducted to ensure consistency among reviewers. Two reviewers independently screened titles and abstracts to identify potentially eligible reviews based on our inclusion criteria. Full-text articles of potentially relevant reviews were then retrieved and screened independently by two reviewers. Discrepancies arising during either the title/abstract or full-text screening stages were discussed between the two reviewers, and a third reviewer adjudicated if agreement could not be achieved. Screening decisions and reasons for exclusion at the full-text stage were documented in Covidence and summarized in Figure 1.

Figure 1

PRISMA Flow Diagram of the Review

Quality Assessment

Included reviews were critically appraised for risk of bias and study quality using AMSTAR 2, with assessments conducted independently by two reviewers. Although AMSTAR 2 is designed for systematic reviews, we found its criteria useful for appraising other types of reviews included in our study. A third reviewer adjudicated if consensus could not be achieved.

Data Extraction

We performed data extraction using a standardized extraction form that was pilot tested to ensure consistency and calibration. Two reviewers extracted data independently, and any discrepancies were resolved by a third reviewer. We extracted information on evidence synthesis tools identified in each review, including their features and performance metrics if available. We reached out to authors for clarification for any missing or unclear data. The detailed extraction form is provided in Supplemental Material 2.

Data Synthesis

Due to the heterogeneity of included studies, a meta-analysis was not possible for this review. Instead, we conducted a qualitative synthesis, visualizing temporal ranges of sources cited in the included reviews, the distribution of identified tools across supported stages of the evidence synthesis workflow, and the coverage of workflow-supporting features among identified tools.

We derived a list of workflow-supporting features from the feature analysis by Cowie et al. (2022), due to its comprehensiveness and the relevant data it provided. From the original list of 30 features, we selected 20 features that could be mapped to the five stages of the evidence synthesis workflow. We added one additional workflow-supporting feature to the list: automatic ranking of references during screening, which can accelerate the screening stage by helping reviewers prioritize relevant studies.

We synthesized all identified evidence synthesis tools into a single list. Using the tool and feature lists, we constructed a two-dimensional grid heat map with three coding categories. Each tool-feature pair was coded as “covered” if any included review reported that the tool supports the feature, “not covered” if a review reported that the tool does not support the feature, and “unknown” if the feature was not mentioned in relation to the tool. Tools that did not cover any of the listed workflow-supporting features were excluded from the heat map.

In addition to the heat map, we compiled a list of tools reported to facilitate a specific type of evidence synthesis: systematic reviews of RCTs, given their crucial role in evidence-based medicine. For each tool in this list, we documented the features designed for systematic reviews of RCTs and the stages of the workflow they support.

After extracting performance metrics specifically for AI-powered features across evidence synthesis tools, we found that, overall, tools rarely reported the same metric for a given feature. Aside from sensitivity of AI-powered title/abstract screening, which was reported by 10 tools, at most three tools reported the same type of metric. The limited data pose challenges for interpretation and comparison, so we defer addressing these sparse metrics in the present work and will develop synthesis methods for them as we scale up this study. For sensitivity of AI-powered title/abstract screening, however, we had sufficient data to enable meaningful comparison. We therefore prioritize sensitivity as the primary measure for two reasons.

First, from a reporting standpoint, most peer-reviewed studies evaluating AI-powered title/abstract screening report performance in terms of sensitivity. The sparseness of other metrics limits comparability across tools and constrains our ability to synthesize them systematically.

Second, from a practical standpoint, maximizing the inclusion of true positives during title/abstract screening is critical. False negatives represent irreversibly missed evidence, whereas false positives can be addressed during subsequent full-text screening. Therefore, sensitivity is particularly crucial. This is reinforced by the empirical characteristics of evidence synthesis, where screening typically involves identifying a relatively small subset of eligible studies (often less than 10%) from a large pool of references (typically thousands). In this highly imbalanced context, sensitivity is the most informative metric for capturing meaningful differences in tool performance.

To illustrate reported sensitivity, we used a bubble chart in which bubble size represents the percentage of references saved from manual screening. Percentages of manual screening saved were commonly reported alongside sensitivity in included studies and may provide a practical reference for researchers.

After extracting the full conclusions of each included review, we developed five categories to classify overall concluding remarks on AI readiness: recommended for use, cautious recommendation, further research needed, not recommended for use, and not applicable. Each review’s conclusion was assigned to one of these categories.

We also synthesized the affiliations of lead authors from included reviews to examine their geographic and disciplinary distributions.

Results

Our review of reviews yielded a final list of 21 articles published between 2021 and 2025, including 15 reviews of studies and 6 reviews of tools (Abdelkader et al., 2021; Abogunrin et al., 2025; Affengruber et al., 2024a, 2024b; Aletaha et al., 2023; Blaizot et al., 2022; Cierco Jimenez et al., 2022; Cowie et al., 2022; Dos Santos et al., 2023; Feng et al., 2022; Hanegraaf et al., 2024; Khalil et al., 2022; Legate et al., 2024; Lieberum et al., 2025; Roth & Wermer-Colan, 2023; Sallam, 2023; Sandner et al., 2024; Schmidt et al., 2023, 2025; Shorey et al., 2024; Yao et al., 2024). No eligible reviews were published in 2020. The included reviews encompass various types of evidence synthesis, such as systematic, scoping, and narrative reviews. We identified 46 distinct tools from these reviews. Additionally, 21 studies that met the title/abstract screening criteria were excluded following full-text review. Details of these excluded studies are provided in Supplemental Material 3.

The included reviews were predominantly authored by researchers based in economically developed, English-speaking countries (e.g., the United States and Canada), with comparatively fewer authors from the Global South. This distribution could be influenced by our eligibility criteria, as only English-language studies were included in this review. In terms of disciplinary background, most authors were affiliated with health-related fields, while those from a computer science background constituted a sizeable minority. Details of the distributions of the lead authors’ affiliations are provided in Supplemental Material 4 and 5.

We analyzed the publication year range of sources cited in each included review to examine their temporal distribution. As illustrated in Figure 2, the years 2018 to 2020 are covered by the highest number of reviews (n = 19). The reduced coverage from 2021 to 2024 likely reflects the inherent delay between the publication of primary research and its incorporation into reviews.

Figure 2

Temporal Coverage of Sources Cited in Each Review. A Bar at a Given Year Indicates the Number of Reviews Whose Sources’ Publication Year Range Covers that Year

The availability of tools across the stages of evidence synthesis appears uneven, based on those identified in this review. As shown in Figure 3, tools are more commonly available for screening, literature search, and data extraction. Comparatively fewer tools support more complex interpretive tasks such as critical appraisal and synthesis.

Figure 3

Distribution of Identified Tools Across Supported Stages of the Evidence Synthesis Workflow

Across 46 identified tools and 21 workflow-supporting features, feature coverage is also uneven (Figure 4). Coverage breadth differs significantly by tool: while the median number of covered features is only 3, there are 5 tools that cover ≥15 features (DistillerSR, EPPI-Reviewer, Covidence, NestedKnowledge, and Giotto Compliance).

Figure 4

Coverage of 21 Workflow-Supporting Features Across 46 Evidence Synthesis Tools. Blue Indicates Covered, Red Indicates Not Covered, and Grey Indicates Unknown. Abbreviations: SWIFT: Sciome Workbench for Interactive Computer-Facilitated Text-Mining; ITSS: Interactive Text Summarization System for Scientific Documents; LLaMA: Large Language Model Meta AI; OATS: Ontology-Based and User-Focused Automatic Text Summarization; DASyR: Document Analysis System for Systematic Reviews; BERT: Bidirectional Encoder Representations From Transformers; LaMDA: Language Model for Dialogue Applications; PaLM: Pathways Language Model; SDES: Semi-Automatic Data Extraction System for Heterogeneous Data Sources; MeSH: Medical Subject Headings

The most commonly supported features are Data Extraction (n = 25), Title/Abstract Screening (n = 24), and Database Search (n = 24). In contrast, Citation Management (n = 2), Automated Full-Text Retrieval (n = 4), and Dual Extraction (n = 5) are supported by the least number of tools.

Two points should be considered when interpreting our heat map of feature coverage. First, the map reflects documented evidence of AI readiness in peer-reviewed literature. The actual tools could be updated or become unavailable without being reported in the literature. Second, feature coverage does not indicate whether a feature is AI-powered. For example, the “Data Extraction” feature could be marked as covered for a tool that provides storage for extracted data or offers AI-powered suggestions during extraction (e.g., Covidence).

For AI-powered title/abstract screening, ten tools achieved 95% sensitivity, of which seven (70%) additionally reported the proportion of citations screened manually during evaluation (Figure 5). For these seven tools, sensitivity was evaluated after training the tool on manually screened citations, reflecting the level of sensitivity users might attain after manually screening a similar percentage of studies. For example, all reported sensitivity rates of EPPI-Reviewer were above 95%, and the lowest associated percentage of manually screened citations reported was 38%, suggesting that up to 62% of manual screening could be saved at 95% sensitivity. These reported sensitivity values and savings in manual screening should be interpreted with caution, as they depend on the representativeness of the studies included in the manually screened training set and on other evaluation parameters that were not specified in the reviews.

Figure 5

Sensitivity of AI-Powered Title/Abstract Screening by Tool. A Bubble Represents a Sensitivity Value Reported With the Percentage of Saved Manual Screening (Indicated by Bubble Size). A Cross Represents a Single Reported Value of Sensitivity. The Dashed Line Marks Sensitivity = 0.95

Tools that specifically support systematic reviews of RCTs are summarized in Table 1. The synthesized evidence indicates that evidence synthesis tools are available for every stage of the RCT systematic review workflow, except for synthesis.

Table 1

Tools Reported to Support Systematic Reviews of RCTs

Stage of review	Tool	Details
Search	RCT tagger	Uses ML algorithms to identify RCTs based on titles, abstracts and other metadata during search
Search	TrialStreamer	Uses fully automated PICO labels and classifications of study types to search for RCTs from a fully data-mined and regularly updated version of the PubMed dataset
Screening	Covidence	Uses the Cochrane RCT classifier to tag RCTs during screening
Appraisal	ExaCT	Evaluates risk of bias of RCTs
Appraisal	RobotReviewer	Assigns low, high or unclear risk of bias to RCTs
Extraction	ExaCT	Locates and extracts key trial characteristics of RCTs from clinicaltrials.gov
Extraction	RobotReviewer	Extracts and consolidates RCT data

In terms of overall concluding remarks on AI readiness for evidence synthesis (Figure 6), most reviews either called for further research (42.9%, n = 9) or offered cautious recommendation (33.3%, n = 7). Three reviews (14.3%) firmly recommended using AI to assist evidence synthesis, one review (4.8%) advised against it, and one review (4.8%) did not provide a conclusion on AI readiness. This distribution aligns with the concentration of reported tools and performance metrics at the screening stage and their relative absence at other stages of the evidence synthesis workflow. It also aligns with current Responsible use of AI in Evidence SynthEsis (RAISE) recommendations, which emphasize that although AI has the potential to enhance efficiency, the limited evidence base makes recommending broad adoption difficult and indicates the need for clear guidance on when and how to apply specific AI tools appropriately (Thomas et al., 2026).

Figure 6

Distribution of Overall Concluding Remarks on AI Readiness for Evidence Synthesis Across Included Reviews

The authors’ original conclusions and the categories assigned to their remarks on AI readiness are provided in Supplemental Table 6.

Discussion

This review synthesizes findings from previous reviews to present an overview of peer-reviewed evidence on evidence synthesis tools, their workflow-supporting features, and the performance of their AI-powered features. In 21 included studies, we found a broad range of methods and metrics to evaluate AI-powered features of evidence synthesis tools. Due to the heterogeneity of evaluation methods used by different studies, combining their evaluative evidence presents a challenge. Our review attempted to extract all relevant features and performance metrics reported for each evidence synthesis tool in included studies and the authors’ overall opinions on AI readiness for evidence synthesis. The key findings of our review were synthesized through cataloging features and aggregating feature-specific performance metrics for each tool. Our decision to use these synthesis methods was based on considerations of scientific rigor and practical use.

The included studies also varied in their scope with respect to the evidence synthesis workflow. Several studies focused on one stage within the workflow or one specific workflow-supporting feature, while others addressed the workflow in its entirety. Our review synthesized evidence across these studies to capture the full picture of AI readiness for evidence synthesis. This broader perspective helps reveal patterns that are not evident in feature- or stage-specific studies, such as the concentration of quantitative evidence on AI-powered title/abstract screening and the relative lack of evidence for other AI-powered features.

Through data synthesis, we developed a heat map that maps 46 evidence synthesis tools to 21 workflow-supporting features. Cowie et al. (2022) presented a heat map that served as a foundational data source for our analysis. By incorporating evidence from multiple sources and adding newly identified tools, our heat map represents an integrative synthesis that may serve as the basis for a publicly accessible and updatable evidence and gap map (EGM). This could enable future researchers to populate areas for which evidence was unavailable within the scope of this review.

Lessons Learned

According to the RAISE recommendations, reviewers who use AI to assist evidence synthesis should provide evidence that AI use “will not undermine the trustworthiness or reliability of the synthesis or its conclusions” (Flemyng et al., 2025). Our findings illustrate the types of empirical evidence that may be sufficient to justify AI-assisted approaches. Achieving consistent sensitivity rates above 95% suggests that tools such as EPPI-Reviewer can be considered reliable for title/abstract screening. The reported proportion of records screened manually alongside sensitivity offers a practical reference point: reviewers can conduct small-scale pilot tests, manually screening a comparable proportion of records and adjusting this proportion based on observed AI performance, thereby maintaining appropriate human oversight and risk mitigation.

We recommend that future evaluation studies explicitly report supporting procedural steps (e.g., the extent of manual screening) alongside the associated performance metrics, as these elements are inseparable for assessing the trustworthiness of reported results. Despite software tools such as DistillerSR and EPPI-Reviewer supporting all stages of evidence synthesis, substantially more evidence exists for AI readiness in title/abstract screening than for other workflow-supporting features. Therefore, we encourage further evaluation studies of comparable rigor for AI-powered features beyond title/abstract screening.

By mapping the current evidence on AI readiness across the evidence synthesis workflow and identifying both areas of strength and research gaps, this review contributes to ongoing efforts to accelerate, scale and sustain living evidence synthesis. We hope that strategic integration of AI tools into living evidence synthesis approaches will facilitate the delivery of timely, continuously updated evidence for decision-makers.

Limitations

Several limitations should be considered when interpreting the findings of this review. When an evidence synthesis tool was reported to support a given feature, this designation encompassed heterogeneous meanings. In some studies, it referred to the availability of software functionality for users to complete the task manually, whereas in other studies it denoted the presence of a partial or fully automated process. As a result, we were unable to systematically characterize the degree of automation provided by each tool for each task.

Moreover, details regarding the level of human intervention required for AI-powered features were frequently lacking. Performance metrics were often synthesized in included reviews without adequate description of workflows, limiting our ability to derive clear guidance for tool use. Importantly, AI performance data were heavily concentrated on title/abstract screening, while the lack of standardized and comprehensive evaluation of other AI-powered workflow-supporting features resulted in sparse and heterogeneous metrics, limiting our ability to synthesize findings. Further investigation into tool functionalities and primary studies would be required to address these gaps.

Based on the affiliations of lead authors of included reviews, the evidence base we identified may be shaped by a concentration of researchers with health and medical backgrounds from economically developed, English-speaking countries. We hope to capture broader international representation and interdisciplinary cooperation in future iterations of this work.

The scope of this review was limited by language restrictions and the selected time frame. Studies published in languages other than English, prior to January 2020, or after May 2025, as well as gray literature, were excluded from the review. Given the rapid pace of AI development, our findings may not fully reflect the most recent advances in this area.

Nevertheless, this review provides a structured approach for documenting evidence that assesses AI readiness in evidence synthesis. As a next step, we plan to transition this work into a living evidence synthesis and expand our search and inclusion criteria to incorporate newly published studies and mitigate the scope limitation. By incorporating gray literature, such as technical reports from tool providers, we aim to produce an EGM that more accurately reflects the availability of tools and the degree of automation they provide.

Conclusion

This review identified peer-reviewed evidence supporting AI readiness for human-supervised title/abstract screening, whereas evidence for AI readiness in other evidence synthesis tasks is limited and requires further evaluation. Among the tools assessed, DistillerSR and EPPI-Reviewer supported the largest number of features and showed strong performance in AI-powered title/abstract screening. Overall, our study highlights the potential of AI in improving efficiency while maintaining high sensitivity during screening, but also identifies gaps in evidence for AI use in other stages of the evidence synthesis workflow. Applying AI in the screening stage may therefore serve as a foundational step toward scaling rapid reviews into living evidence syntheses.

Supplemental Material

Supplemental Material - Artificial Intelligence (AI) Readiness to Support Evidence Synthesis by Workflow: Findings From a Review of Reviews

Supplemental Material for Artificial Intelligence (AI) Readiness to Support Evidence Synthesis by Workflow: Findings From a Review of Reviews by Zijing Wei, Luyanda Ngongoma, Jose Cols, Arina L. Bogdan, Ariel Lin, Claire Zhang, Yue Su, Nuno de Jesus Ximenes, Chloe Zhu, Yoav Ackerman, Heather L. Bullock, Juhua Hu and Yanfang Su in Campbell Systematic Reviews.

Footnotes

Acknowledgements

We thank Diana Louden for assistance developing the search strategies and running the searches. We thank Cara Evans for providing critiques throughout the study. We thank Youyi Li for assistance screening articles for inclusion. We thank Serena Chu and Ensheng Dong for valuable comments on the manuscript.

ORCID iDs

Zijing Wei

Luyanda Ngongoma

Jose Cols

Arina L. Bogdan

Ariel Lin

Claire Zhang

Yue Su

Nuno de Jesus Ximenes

Chloe Zhu

Yoav Ackerman

Heather L. Bullock

Juhua Hu

Yanfang Su

Author Contributions

YanfangS led the study design and supervised title screening, data extraction, data analysis and manuscript writing. HB and LN participated in the study design. LN led protocol writing, registration, title screening and data extraction. ZW led data analysis and manuscript writing. LN, ZW, ClaireZ, AL, YueS, and NJX conducted title screening and data extraction. ZW, JC, AL, and ChloeZ performed data analysis. ZW, LN and JC created the figures and tables. ZW, JC, AL and YA wrote the first draft of the manuscript. JH and AB provided critical feedback throughout the study. All authors reviewed the manuscript and approved the final version.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental material for this article is available online.

Appendix

References

Abdelkader

Navarro

Parrish

, et al. (2021). Machine learning approaches to retrieve high-quality, clinically relevant evidence from the biomedical literature: Systematic review. JMIR Medical Informatics, 9(9), Article e30401. https://doi.org/10.2196/30401

Abogunrin

Muir

J. M.

Zerbini

Sarri

(2025). How much can we save by applying artificial intelligence in evidence synthesis? Results from a pragmatic review to quantify workload efficiencies and cost savings. Frontiers in Pharmacology, 16, Article 1454245. https://doi.org/10.3389/fphar.2025.1454245

Affengruber

Nussbaumer-Streit

Hamel

Van der Maten

Thomas

Mavergames

Spijker

Gartlehner

(2024a). Rapid review methods series: Guidance on the use of supportive software. BMJ Evidence-Based Medicine, 29(4), 264–271. https://doi.org/10.1136/bmjebm-2023-112530

Affengruber

Van der Maten

Spiero

Nussbaumer-Streit

Mahmić-Kaknjo

Ellen

M. E.

Goossen

Kantorova

Hooft

Riva

Poulentzas

Lalagkas

P. N.

Silva

A. G.

Sassano

Sfetcu

Marqués

M. E.

Friessova

Baladia

Pezzullo

A. M.

Spijker

(2024b). An exploration of available methods and tools to improve the efficiency of systematic review production: A scoping review. BMC Medical Research Methodology, 24(1), Article 210. https://doi.org/10.1186/s12874-024-02320-4

Aletaha

Nemati-Anaraki

Keshtkar

Sedghi

Keramatfar

Korolyova

(2023). A scoping review of adopted information extraction methods for RCTs. Medical Journal of the Islamic Republic of Iran, 37, Article 95. https://doi.org/10.47176/mjiri.37.95

Amin

Khosla

Doshi

Chheang

Forman

H. P.

(2023). Artificial intelligence to improve patient understanding of radiology reports. Yale Journal of Biology & Medicine, 96(3), 407–417. https://doi.org/10.59249/nkoy5498

Banta

Jonsson

(2009). History of HTA: Introduction. International Journal of Technology Assessment in Health Care, 25(S1), 1–6. https://doi.org/10.1017/S0266462309090321

Bastian

Glasziou

Chalmers

(2010). Seventy-five trials and eleven systematic reviews a day: How will we ever keep up? PLoS Medicine, 7(9), Article e1000326. https://doi.org/10.1371/journal.pmed.1000326

Blaizot

Veettil

S. K.

Saidoung

Moreno‐Garcia

C. F.

Wiratunga

Aceves‐Martins

Lai

N. M.

Chaiyakunapruk

(2022). Using artificial intelligence methods for systematic review in health sciences: A systematic review. Research Synthesis Methods, 13(3), 353–362. https://doi.org/10.1002/jrsm.1553

10.

Cierco Jimenez

Lee

Rosillo

Cordova

Cree

I. A.

Gonzalez

Indave Ruiz

B. I.

(2022). Machine learning computational tools to assist the performance of systematic reviews: A mapping review. BMC Medical Research Methodology, 22(1), Article 322. https://doi.org/10.1186/s12874-022-01805-4

11.

Coiera

Liu

(2022). Evidence synthesis, digital scribes, and translational challenges for artificial intelligence in healthcare. Cell Reports Medicine, 3(12), Article 100860. https://doi.org/10.1016/j.xcrm.2022.100860

12.

Cowie

Rahmatullah

Hardy

Holub

Kallmes

(2022). Web-based software tools for systematic literature review in medicine: Systematic search and feature analysis. JMIR Medical Informatics, 10(5), Article e33219. https://doi.org/10.2196/33219

13.

Dos Santos

A. O.

Da Silva

E. S.

Couto

L. M.

Reis

G. V. L.

Belo

V. S.

(2023). The use of artificial intelligence for automating or semi-automating biomedical literature analyses: A scoping review. Journal of Biomedical Informatics, 142, Article 104389. https://doi.org/10.1016/j.jbi.2023.104389

14.

Elliott

J. H.

Turner

Clavisi

Thomas

Higgins

J. P. T.

Mavergames

Gruen

R. L.

(2014). Living systematic reviews: An emerging opportunity to narrow the evidence-practice gap. PLoS Medicine, 11(2), Article e1001603. https://doi.org/10.1371/journal.pmed.1001603

15.

Feng

Liang

Zhang

Chen

Wang

Huang

Sun

Liu

Zhu

Pan

(2022). Automated medical literature screening using artificial intelligence: A systematic review and meta-analysis. Journal of the American Medical Informatics Association, 29(8), 1425–1432. https://doi.org/10.1093/jamia/ocac066

16.

Flemyng

Noel-Storr

Macura

Gartlehner

Thomas

Meerpohl

J. J.

Jordan

Minx

Eisele‐Metzger

Hamel

Jemioło

Porritt

Grainger

(2025). Position statement on artificial intelligence (AI) use in evidence synthesis across cochrane, the Campbell collaboration, JBI, and the collaboration for environmental evidence 2025. Campbell Systematic Reviews, 21(4), Article e70074. https://doi.org/10.1002/cl2.70074

17.

Guo

Gupta

Deng

Park

Y. J.

Paget

Naugler

(2024). Automated paper screening for clinical reviews using large language models: Data analysis study. Journal of Medical Internet Research, 26, Article e48996. https://doi.org/10.2196/48996

18.

Hanegraaf

Wondimu

Mosselman

J. J.

de Jong

Abogunrin

Queiros

Lane

Postma

M. J.

Boersma

van der Schans

(2024). Inter-reviewer reliability of human literature reviewing and implications for the introduction of machine-assisted systematic reviews: A mixed-methods review. BMJ Open, 14(3), Article e076912. https://doi.org/10.1136/bmjopen-2023-076912

19.

Jin

Leaman

(2024). PubMed and beyond: Biomedical literature search in the age of artificial intelligence. EBioMedicine, 100, Article 104988. https://doi.org/10.1016/j.ebiom.2024.104988

20.

Johnson

E. E.

O’Keefe

Sutton

Marshall

(2022). The systematic review toolbox: Keeping up to date with tools to support evidence synthesis. Systematic Reviews, 11(1), Article 258. https://doi.org/10.1186/s13643-022-02122-z

21.

Khalil

Ameen

Zarnegar

(2022). Tools to support the automation of systematic reviews: A scoping review. Journal of Clinical Epidemiology, 144, 22–42. https://doi.org/10.1016/j.jclinepi.2021.12.005

22.

Legate

Nimon

Noblin

(2024). (Semi) automated approaches to data extraction for systematic reviews and meta-analyses in social sciences: A living review. F1000Research, 13, Article 664. https://doi.org/10.12688/f1000research.151493.1

23.

Lieberum

J. L.

Töws

Metzendorf

M. I.

Heilmeyer

Siemens

Haverkamp

Böhringer

Meerpohl

J. J.

Eisele-Metzger

(2025). Large language models for conducting systematic reviews: On the rise, but not yet ready for Use-a scoping review. Journal of Clinical Epidemiology, 181, Article 111746. https://doi.org/10.1016/j.jclinepi.2025.111746

24.

Marshall

I. J.

Wallace

B. C.

(2019). Toward systematic review automation: A practical guide to using machine learning tools in research synthesis. Systematic Reviews, 8(1), Article 163. https://doi.org/10.1186/s13643-019-1074-9

25.

Ngongoma

Wei

Zhang

, et al. (2025). Artificial intelligence readiness for evidence synthesis: An umbrella review. PROSPERO. CRD420251054446.

26.

O’Connor

A. M.

Tsafnat

Gilbert

S. B.

Thayer

K. A.

Wolfe

M. S.

(2018). Moving toward the automation of the systematic review process: A summary of discussions at the second meeting of international collaboration for the automation of systematic reviews (ICASR). Systematic Reviews, 7(1), Article 3. https://doi.org/10.1186/s13643-017-0667-4

27.

Roth

Wermer-Colan

(2023). Machine learning methods for systematic reviews: A rapid scoping review. Delaware Journal of Public Health, 9(4), 40–47. https://doi.org/10.32481/djph.2023.11.008

28.

Sallam

(2023). ChatGPT utility in healthcare education, research, and practice: Systematic review on the promising perspectives and valid concerns. Healthcare, 11(6), Article 887. https://doi.org/10.3390/healthcare11060887

29.

Sandner

Gütl

Jakovljevic

Wagner

(2024). Screening automation in systematic reviews: Analysis of tools and their machine learning capabilities. Studies in Health Technology and Informatics, 313, 179–185. https://doi.org/10.3233/SHTI240034

30.

Schmidt

Finnerty Mutlu

Elmore

Olorisade

Thomas

Higgins

(2025). Data extraction methods for systematic review (semi)Automation: Update of a living systematic review. F1000Research, 10, Article 401. https://doi.org/10.12688/f1000research.51117.3

31.

Schmidt

Sinyor

Webb

R. T.

Marshall

Knipe

Eyles

E. C.

John

Gunnell

Higgins

J. P.

(2023). A narrative review of recent tools and innovations toward automating living systematic reviews and evidence syntheses. Zeitschrift fur Evidenz, Fortbildung und Qualitat im Gesundheitswesen, 181, 65–75. https://doi.org/10.1016/j.zefq.2023.06.007

32.

Shorey

Mattar

Pereira

T. L. B.

Choolani

(2024). A scoping review of ChatGPT’s role in healthcare education and research. Nurse Education Today, 135, Article 106121. https://doi.org/10.1016/j.nedt.2024.106121

33.

Thomas

Hair

Noel-Storr

, et al. (2026). Responsible use of AI in evidence SynthEsis (RAISE): Recommendations for practice (version 3; updated 13 March 2026). In Open science framework. Center for Open Science. https://osf.io/

34.

Tricco

Antony

Zarin

Strifler

Ghassemi

Ivory

Perrier

Hutton

Moher

Straus

S. E.

(2015). A scoping review of rapid review methods. BMC Medicine, 13(1), Article 224. https://doi.org/10.1186/s12916-015-0465-6

35.

Yao

Kumar

M. V.

Flores Miranda

Saha

Sussman

(2024). Evaluating the efficacy of artificial intelligence tools for the automation of systematic reviews in cancer research: A systematic review. Cancer Epidemiology, 88, Article 102511. https://doi.org/10.1016/j.canep.2023.102511

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.25 MB

0.00 MB