Abstract
Background
Evidence synthesis is crucial for informing evidence-based practice across various fields. However, the traditional methodology is resource-intensive, and its findings can be outdated before publication. There is a growing trend toward integrating automation and artificial intelligence (AI) approaches into evidence synthesis to enhance efficiency, but standardized adoption is still pending.
Objective
The goal of this study is to identify peer-reviewed evidence documenting AI readiness for evidence synthesis.
Methods
We searched MEDLINE, Embase, and Global Index Medicus in May 2025 to identify review articles that evaluated evidence synthesis tools. Relevant study reviews and tool reviews published in English between January 2020 and May 2025 were included in our review of reviews. Tool features and performance metrics were extracted according to stages of the evidence synthesis workflow, including search, screening, appraisal, extraction, and synthesis.
Results
We included 21 studies in our review of reviews and identified 46 evidence synthesis tools. Nine tools supported all five stages of the evidence synthesis workflow, among which DistillerSR covered the most workflow-supporting features (19 out of 21). Ten of the identified tools reported sensitivity rates for AI-powered title/abstract screening, all of which achieved
Conclusion
This review found peer-reviewed evidence supporting AI readiness for human-supervised automation of title/abstract screening. However, evidence documenting AI readiness for other evidence synthesis tasks remains limited. DistillerSR and EPPI-Reviewer demonstrated the broadest feature support and strong evidence for AI-powered title/abstract screening. Our study highlights the potential of AI to improve efficiency while maintaining high sensitivity in the screening stage. AI-powered screening may serve as a critical first step toward scaling rapid reviews into living evidence syntheses.
Plain Language Summary
Why was the study done: Reviewing large numbers of studies is essential for making informed decisions in healthcare and other fields. This process called evidence synthesis brings together findings from many studies to answer important questions. However, it can take a long time and requires a lot of effort from researchers. By the time a review is completed, new studies may already have been published, making the results less up to date. Artificial intelligence (AI) has been suggested as a way to speed up this process. AI tools can help with tasks such as searching for studies, screening which studies are relevant, extracting data from studies, and combining results. Despite growing interest, it is not yet clear how peer-reviewed literature documents the readiness of these tools. What did the researchers do: In this study, we looked at existing research to understand how well AI is currently supporting evidence synthesis. We searched major health databases for published review articles from 2020 to 2025 that evaluated tools used in evidence synthesis. We then collected evidence on what these tools can do and how well they perform at different stages of the review process. These stages include searching for studies, screening them for relevance, assessing their quality, extracting key information, and combining findings. What did the researchers find: We included 21 review articles and identified 46 different tools designed to support evidence synthesis. A few of these tools could support all stages of the review process. Among them, some tools, such as DistillerSR and EPPI-Reviewer, offered the widest range of features. We found the strongest evidence for the use of AI in screening studies by title and abstract. In this task, AI systems help researchers quickly decide which studies are likely to be relevant. Several tools reported high sensitivity, meaning they were able to correctly identify at least 95% of relevant studies in some settings. However, for other stages of evidence synthesis, such as assessing study quality or combining results, there is still limited evidence on how well AI performs. While AI shows promise, its use beyond screening is not yet fully supported by strong research. What do the findings mean: Overall, the evidence we found suggests that AI is ready to assist with some parts of evidence synthesis, especially the early screening stage. Using AI in this way could make reviews faster while still maintaining quality. This may also help support “living” reviews, which are updated regularly as new evidence becomes available. More research is needed to understand how AI can reliably support the full evidence synthesis process.
Keywords
Introduction
Evidence synthesis is a crucial evidence-based practice, providing a structured approach for aggregating and evaluating research findings to inform decision-making across various fields, including medicine, public health, public policy and the social sciences (Cierco Jimenez et al., 2022; Jin et al., 2024; Legate et al., 2024). The principle of evidence synthesis is to rigorously combine available research findings to summarize current knowledge about a specific topic, forming the foundation for evidence-based medicine and improving healthcare decision-making, diagnosis, and treatment (Cierco Jimenez et al., 2022; Legate et al., 2024). Among the range of evidence synthesis methodologies, systematic reviews are considered the gold standard. The classic five-stage systematic review workflow involves searching, screening, critically appraising, extracting, and synthesizing “the best available evidence within pre-specified eligibility criteria” (Cierco Jimenez et al., 2022; Coiera & Liu, 2022; Legate et al., 2024). Other evidence synthesis approaches share this core workflow but differ in the emphasis placed on individual stages. However, this workflow is resource-demanding and may produce findings that are outdated before publication. The median time to complete a systematic review is approximately 17 months, with some taking over 2 years, which limits their applicability in rapidly evolving policy environments (Cierco Jimenez et al., 2022; Guo et al., 2024; Jin et al., 2024; Legate et al., 2024; Marshall & Wallace, 2019). Living systematic reviews were introduced to address this lag by enabling the continuous updating of findings as new data emerge; however, they still require substantial human and financial resources, making large-scale implementation difficult (Bastian et al., 2010; Elliott et al., 2014).
The importance of evidence synthesis methods, particularly systematic reviews, has grown significantly over time. This growth has been driven globally by two key factors. First, the development and spread of Health Technology Assessment, initiated by the U.S. Office of Technology Assessment in 1976, established systematic reviews as a key method for rigorously evaluating the efficacy, safety, and cost-effectiveness of health technologies (Banta & Jonsson, 2009). Second, global health crises, such as the COVID-19 pandemic, have underscored the need for rapid synthesis of research to support evidence-informed public health decision-making. By mid-2022, the NIH’s LitCOVID Hub catalogued approximately 270,000 COVID-19-related research articles (Coiera & Liu, 2022). However, the large volume of evidence made it difficult to quickly identify, synthesize, and apply critical findings, leading to delays that had real-world impacts on pandemic response efforts (Coiera & Liu, 2022).
To address the need for faster evidence synthesis, numerous rapid review approaches have been developed. Although these reviews provide more timely findings by simplifying the synthesis workflow, they are often associated with lower reporting quality (Tricco et al., 2015). Transitioning rapid reviews into living evidence syntheses could offer a solution that improves their methodological rigor without compromising timeliness, though the process would require new methodological standards.
Automation and artificial intelligence (AI) have been increasingly integrated into various stages of systematic reviews, including literature search, screening, critical appraisal and data extraction, to address the challenges (Cierco Jimenez et al., 2022; Guo et al., 2024; Johnson et al., 2022). The Systematic Review Toolbox, a widely used repository tracking systematic reviews tools, catalogued 235 software tools as of December 2022 (Johnson et al., 2022), many of which incorporate AI to varying extents. AI methodologies, such as machine learning (ML), have demonstrated potential in handling unstructured biomedical literature, streamlining screening and data extraction processes (Amin et al., 2023; Cierco Jimenez et al., 2022). Despite these advancements, AI-driven approaches to systematic reviews and other evidence synthesis methods have yet to fully integrate into mainstream methodologies. Recognizing this gap, the International Collaboration for the Automation of Systematic Reviews (ICASR) has emphasized the need for standardized evaluation metrics, interoperability among AI-powered evidence synthesis tools, and transparent validation frameworks to ensure their reliability and widespread adoption (O’Connor et al., 2018).
This review of reviews aims to identify peer-reviewed evidence on the capabilities of evidence synthesis tools and the performance of their AI-powered features. In this review, we will use the following working definitions: • Evidence synthesis tools: software tools that assist reviewers in completing one or more stages of the evidence synthesis workflow (e.g., Covidence), including general-purpose tools with relevant capabilities (e.g., ChatGPT) • Workflow-supporting features: components of evidence synthesis tools that assist with a specific task in the evidence synthesis workflow (e.g., a module that supports manual title/abstract screening) • AI-powered features: workflow-supporting features that use ML, natural language processing, generative AI or other data-driven methods to support or automate a specific task (e.g., a ML-based text classification model that automates title/abstract screening)
The review will search, screen, appraise, extract and synthesize evidence from published reviews that evaluate the application of AI in the evidence synthesis workflow. It aims to identify evidence that addresses the following research questions: • What types of evidence synthesis tools are available, and which stages of the evidence synthesis workflow do their features support? • What metrics and methods are used to evaluate AI-powered features of evidence synthesis tools? What does evaluative evidence suggest about their performance? • Which stages of evidence synthesis appear most ready for AI-powered scaling of rapid reviews into living evidence syntheses?
Our study synthesizes existing peer-reviewed literature on evidence synthesis tools rather than directly assessing current AI readiness for evidence synthesis. As a result, the evidence captured in our review reflects publication timelines. Established tools well represented in the literature (e.g., RobotReviewer, developed over a decade ago) have a larger body of published evidence, whereas more recent tools (e.g., Claude) have comparatively limited evidence on their performance in evidence synthesis workflows, which shapes their representation in our review.
Methods
The study design of this review is included in our protocol (Ngongoma et al., 2025), which was registered in PROSPERO on June 6, 2025 (CRD420251054446).
Literature Search
We searched MEDLINE, Embase, and Global Index Medicus for articles published from January 1, 2020 to May 7, 2025. To formulate the search strategy, we used search terms for the following concepts: • Artificial Intelligence • Evidence Synthesis as a Subject • Reviews as Publication Types • Tools and Task-Specific Terminology
The initial search strategy was developed with a librarian in Medline on the EBSCO platform. Comparable search strategies were then developed for Embase and Global Index Medicus. The detailed search strategies are listed in Supplemental Material 1.
Eligibility Criteria
We imported all retrieved articles into Covidence software, which conducted deduplication automatically. To select peer-reviewed review articles that evaluated AI-powered features of evidence synthesis tools, we developed the following eligibility criteria: • Study Type: Reviews of studies (e.g., systematic reviews and other evidence synthesis articles) and reviews of tools (e.g., overviews of evidence synthesis tools) focused on evidence synthesis in the health sciences and policy fields. • Topic: Eligible reviews must evaluate AI-powered or automated features of tools applied to any stage of the evidence synthesis workflow, namely search, screening, appraisal, extraction and synthesis. • Outcomes: Reviews must report performance outcomes such as accuracy, time savings, efficiency, or inter-rater agreement. Reviews may also report on implementation outcomes, including usability, trust, transparency, reproducibility, or barriers to adoption.
Preprints and conference abstracts were excluded. We also excluded studies that were published prior to 2020, were not in English, were found to lack methodological clarity or presented high risk of bias during quality appraisal. The timeframe was selected to capture recent advancements in AI development and use in systematic review workflows.
Study Screening
Screening was conducted in Covidence software. Prior to full title/abstract screening, a screening pilot was conducted to ensure consistency among reviewers. Two reviewers independently screened titles and abstracts to identify potentially eligible reviews based on our inclusion criteria. Full-text articles of potentially relevant reviews were then retrieved and screened independently by two reviewers. Discrepancies arising during either the title/abstract or full-text screening stages were discussed between the two reviewers, and a third reviewer adjudicated if agreement could not be achieved. Screening decisions and reasons for exclusion at the full-text stage were documented in Covidence and summarized in Figure 1. PRISMA Flow Diagram of the Review
Quality Assessment
Included reviews were critically appraised for risk of bias and study quality using AMSTAR 2, with assessments conducted independently by two reviewers. Although AMSTAR 2 is designed for systematic reviews, we found its criteria useful for appraising other types of reviews included in our study. A third reviewer adjudicated if consensus could not be achieved.
Data Extraction
We performed data extraction using a standardized extraction form that was pilot tested to ensure consistency and calibration. Two reviewers extracted data independently, and any discrepancies were resolved by a third reviewer. We extracted information on evidence synthesis tools identified in each review, including their features and performance metrics if available. We reached out to authors for clarification for any missing or unclear data. The detailed extraction form is provided in Supplemental Material 2.
Data Synthesis
Due to the heterogeneity of included studies, a meta-analysis was not possible for this review. Instead, we conducted a qualitative synthesis, visualizing temporal ranges of sources cited in the included reviews, the distribution of identified tools across supported stages of the evidence synthesis workflow, and the coverage of workflow-supporting features among identified tools.
We derived a list of workflow-supporting features from the feature analysis by Cowie et al. (2022), due to its comprehensiveness and the relevant data it provided. From the original list of 30 features, we selected 20 features that could be mapped to the five stages of the evidence synthesis workflow. We added one additional workflow-supporting feature to the list: automatic ranking of references during screening, which can accelerate the screening stage by helping reviewers prioritize relevant studies.
We synthesized all identified evidence synthesis tools into a single list. Using the tool and feature lists, we constructed a two-dimensional grid heat map with three coding categories. Each tool-feature pair was coded as “covered” if any included review reported that the tool supports the feature, “not covered” if a review reported that the tool does not support the feature, and “unknown” if the feature was not mentioned in relation to the tool. Tools that did not cover any of the listed workflow-supporting features were excluded from the heat map.
In addition to the heat map, we compiled a list of tools reported to facilitate a specific type of evidence synthesis: systematic reviews of RCTs, given their crucial role in evidence-based medicine. For each tool in this list, we documented the features designed for systematic reviews of RCTs and the stages of the workflow they support.
After extracting performance metrics specifically for AI-powered features across evidence synthesis tools, we found that, overall, tools rarely reported the same metric for a given feature. Aside from sensitivity of AI-powered title/abstract screening, which was reported by 10 tools, at most three tools reported the same type of metric. The limited data pose challenges for interpretation and comparison, so we defer addressing these sparse metrics in the present work and will develop synthesis methods for them as we scale up this study. For sensitivity of AI-powered title/abstract screening, however, we had sufficient data to enable meaningful comparison. We therefore prioritize sensitivity as the primary measure for two reasons.
First, from a reporting standpoint, most peer-reviewed studies evaluating AI-powered title/abstract screening report performance in terms of sensitivity. The sparseness of other metrics limits comparability across tools and constrains our ability to synthesize them systematically.
Second, from a practical standpoint, maximizing the inclusion of true positives during title/abstract screening is critical. False negatives represent irreversibly missed evidence, whereas false positives can be addressed during subsequent full-text screening. Therefore, sensitivity is particularly crucial. This is reinforced by the empirical characteristics of evidence synthesis, where screening typically involves identifying a relatively small subset of eligible studies (often less than 10%) from a large pool of references (typically thousands). In this highly imbalanced context, sensitivity is the most informative metric for capturing meaningful differences in tool performance.
To illustrate reported sensitivity, we used a bubble chart in which bubble size represents the percentage of references saved from manual screening. Percentages of manual screening saved were commonly reported alongside sensitivity in included studies and may provide a practical reference for researchers.
After extracting the full conclusions of each included review, we developed five categories to classify overall concluding remarks on AI readiness: recommended for use, cautious recommendation, further research needed, not recommended for use, and not applicable. Each review’s conclusion was assigned to one of these categories.
We also synthesized the affiliations of lead authors from included reviews to examine their geographic and disciplinary distributions.
Results
Our review of reviews yielded a final list of 21 articles published between 2021 and 2025, including 15 reviews of studies and 6 reviews of tools (Abdelkader et al., 2021; Abogunrin et al., 2025; Affengruber et al., 2024a, 2024b; Aletaha et al., 2023; Blaizot et al., 2022; Cierco Jimenez et al., 2022; Cowie et al., 2022; Dos Santos et al., 2023; Feng et al., 2022; Hanegraaf et al., 2024; Khalil et al., 2022; Legate et al., 2024; Lieberum et al., 2025; Roth & Wermer-Colan, 2023; Sallam, 2023; Sandner et al., 2024; Schmidt et al., 2023, 2025; Shorey et al., 2024; Yao et al., 2024). No eligible reviews were published in 2020. The included reviews encompass various types of evidence synthesis, such as systematic, scoping, and narrative reviews. We identified 46 distinct tools from these reviews. Additionally, 21 studies that met the title/abstract screening criteria were excluded following full-text review. Details of these excluded studies are provided in Supplemental Material 3.
The included reviews were predominantly authored by researchers based in economically developed, English-speaking countries (e.g., the United States and Canada), with comparatively fewer authors from the Global South. This distribution could be influenced by our eligibility criteria, as only English-language studies were included in this review. In terms of disciplinary background, most authors were affiliated with health-related fields, while those from a computer science background constituted a sizeable minority. Details of the distributions of the lead authors’ affiliations are provided in Supplemental Material 4 and 5.
We analyzed the publication year range of sources cited in each included review to examine their temporal distribution. As illustrated in Figure 2, the years 2018 to 2020 are covered by the highest number of reviews (n = 19). The reduced coverage from 2021 to 2024 likely reflects the inherent delay between the publication of primary research and its incorporation into reviews. Temporal Coverage of Sources Cited in Each Review. A Bar at a Given Year Indicates the Number of Reviews Whose Sources’ Publication Year Range Covers that Year
The availability of tools across the stages of evidence synthesis appears uneven, based on those identified in this review. As shown in Figure 3, tools are more commonly available for screening, literature search, and data extraction. Comparatively fewer tools support more complex interpretive tasks such as critical appraisal and synthesis. Distribution of Identified Tools Across Supported Stages of the Evidence Synthesis Workflow
Across 46 identified tools and 21 workflow-supporting features, feature coverage is also uneven (Figure 4). Coverage breadth differs significantly by tool: while the median number of covered features is only 3, there are 5 tools that cover ≥15 features (DistillerSR, EPPI-Reviewer, Covidence, NestedKnowledge, and Giotto Compliance). Coverage of 21 Workflow-Supporting Features Across 46 Evidence Synthesis Tools. Blue Indicates Covered, Red Indicates Not Covered, and Grey Indicates Unknown. Abbreviations: SWIFT: Sciome Workbench for Interactive Computer-Facilitated Text-Mining; ITSS: Interactive Text Summarization System for Scientific Documents; LLaMA: Large Language Model Meta AI; OATS: Ontology-Based and User-Focused Automatic Text Summarization; DASyR: Document Analysis System for Systematic Reviews; BERT: Bidirectional Encoder Representations From Transformers; LaMDA: Language Model for Dialogue Applications; PaLM: Pathways Language Model; SDES: Semi-Automatic Data Extraction System for Heterogeneous Data Sources; MeSH: Medical Subject Headings
The most commonly supported features are Data Extraction (n = 25), Title/Abstract Screening (n = 24), and Database Search (n = 24). In contrast, Citation Management (n = 2), Automated Full-Text Retrieval (n = 4), and Dual Extraction (n = 5) are supported by the least number of tools.
Two points should be considered when interpreting our heat map of feature coverage. First, the map reflects documented evidence of AI readiness in peer-reviewed literature. The actual tools could be updated or become unavailable without being reported in the literature. Second, feature coverage does not indicate whether a feature is AI-powered. For example, the “Data Extraction” feature could be marked as covered for a tool that provides storage for extracted data or offers AI-powered suggestions during extraction (e.g., Covidence).
For AI-powered title/abstract screening, ten tools achieved 95% sensitivity, of which seven (70%) additionally reported the proportion of citations screened manually during evaluation (Figure 5). For these seven tools, sensitivity was evaluated after training the tool on manually screened citations, reflecting the level of sensitivity users might attain after manually screening a similar percentage of studies. For example, all reported sensitivity rates of EPPI-Reviewer were above 95%, and the lowest associated percentage of manually screened citations reported was 38%, suggesting that up to 62% of manual screening could be saved at 95% sensitivity. These reported sensitivity values and savings in manual screening should be interpreted with caution, as they depend on the representativeness of the studies included in the manually screened training set and on other evaluation parameters that were not specified in the reviews. Sensitivity of AI-Powered Title/Abstract Screening by Tool. A Bubble Represents a Sensitivity Value Reported With the Percentage of Saved Manual Screening (Indicated by Bubble Size). A Cross Represents a Single Reported Value of Sensitivity. The Dashed Line Marks Sensitivity = 0.95
Tools Reported to Support Systematic Reviews of RCTs
In terms of overall concluding remarks on AI readiness for evidence synthesis (Figure 6), most reviews either called for further research (42.9%, n = 9) or offered cautious recommendation (33.3%, n = 7). Three reviews (14.3%) firmly recommended using AI to assist evidence synthesis, one review (4.8%) advised against it, and one review (4.8%) did not provide a conclusion on AI readiness. This distribution aligns with the concentration of reported tools and performance metrics at the screening stage and their relative absence at other stages of the evidence synthesis workflow. It also aligns with current Responsible use of AI in Evidence SynthEsis (RAISE) recommendations, which emphasize that although AI has the potential to enhance efficiency, the limited evidence base makes recommending broad adoption difficult and indicates the need for clear guidance on when and how to apply specific AI tools appropriately (Thomas et al., 2026). Distribution of Overall Concluding Remarks on AI Readiness for Evidence Synthesis Across Included Reviews
The authors’ original conclusions and the categories assigned to their remarks on AI readiness are provided in Supplemental Table 6.
Discussion
This review synthesizes findings from previous reviews to present an overview of peer-reviewed evidence on evidence synthesis tools, their workflow-supporting features, and the performance of their AI-powered features. In 21 included studies, we found a broad range of methods and metrics to evaluate AI-powered features of evidence synthesis tools. Due to the heterogeneity of evaluation methods used by different studies, combining their evaluative evidence presents a challenge. Our review attempted to extract all relevant features and performance metrics reported for each evidence synthesis tool in included studies and the authors’ overall opinions on AI readiness for evidence synthesis. The key findings of our review were synthesized through cataloging features and aggregating feature-specific performance metrics for each tool. Our decision to use these synthesis methods was based on considerations of scientific rigor and practical use.
The included studies also varied in their scope with respect to the evidence synthesis workflow. Several studies focused on one stage within the workflow or one specific workflow-supporting feature, while others addressed the workflow in its entirety. Our review synthesized evidence across these studies to capture the full picture of AI readiness for evidence synthesis. This broader perspective helps reveal patterns that are not evident in feature- or stage-specific studies, such as the concentration of quantitative evidence on AI-powered title/abstract screening and the relative lack of evidence for other AI-powered features.
Through data synthesis, we developed a heat map that maps 46 evidence synthesis tools to 21 workflow-supporting features. Cowie et al. (2022) presented a heat map that served as a foundational data source for our analysis. By incorporating evidence from multiple sources and adding newly identified tools, our heat map represents an integrative synthesis that may serve as the basis for a publicly accessible and updatable evidence and gap map (EGM). This could enable future researchers to populate areas for which evidence was unavailable within the scope of this review.
Lessons Learned
According to the RAISE recommendations, reviewers who use AI to assist evidence synthesis should provide evidence that AI use “will not undermine the trustworthiness or reliability of the synthesis or its conclusions” (Flemyng et al., 2025). Our findings illustrate the types of empirical evidence that may be sufficient to justify AI-assisted approaches. Achieving consistent sensitivity rates above 95% suggests that tools such as EPPI-Reviewer can be considered reliable for title/abstract screening. The reported proportion of records screened manually alongside sensitivity offers a practical reference point: reviewers can conduct small-scale pilot tests, manually screening a comparable proportion of records and adjusting this proportion based on observed AI performance, thereby maintaining appropriate human oversight and risk mitigation.
We recommend that future evaluation studies explicitly report supporting procedural steps (e.g., the extent of manual screening) alongside the associated performance metrics, as these elements are inseparable for assessing the trustworthiness of reported results. Despite software tools such as DistillerSR and EPPI-Reviewer supporting all stages of evidence synthesis, substantially more evidence exists for AI readiness in title/abstract screening than for other workflow-supporting features. Therefore, we encourage further evaluation studies of comparable rigor for AI-powered features beyond title/abstract screening.
By mapping the current evidence on AI readiness across the evidence synthesis workflow and identifying both areas of strength and research gaps, this review contributes to ongoing efforts to accelerate, scale and sustain living evidence synthesis. We hope that strategic integration of AI tools into living evidence synthesis approaches will facilitate the delivery of timely, continuously updated evidence for decision-makers.
Limitations
Several limitations should be considered when interpreting the findings of this review. When an evidence synthesis tool was reported to support a given feature, this designation encompassed heterogeneous meanings. In some studies, it referred to the availability of software functionality for users to complete the task manually, whereas in other studies it denoted the presence of a partial or fully automated process. As a result, we were unable to systematically characterize the degree of automation provided by each tool for each task.
Moreover, details regarding the level of human intervention required for AI-powered features were frequently lacking. Performance metrics were often synthesized in included reviews without adequate description of workflows, limiting our ability to derive clear guidance for tool use. Importantly, AI performance data were heavily concentrated on title/abstract screening, while the lack of standardized and comprehensive evaluation of other AI-powered workflow-supporting features resulted in sparse and heterogeneous metrics, limiting our ability to synthesize findings. Further investigation into tool functionalities and primary studies would be required to address these gaps.
Based on the affiliations of lead authors of included reviews, the evidence base we identified may be shaped by a concentration of researchers with health and medical backgrounds from economically developed, English-speaking countries. We hope to capture broader international representation and interdisciplinary cooperation in future iterations of this work.
The scope of this review was limited by language restrictions and the selected time frame. Studies published in languages other than English, prior to January 2020, or after May 2025, as well as gray literature, were excluded from the review. Given the rapid pace of AI development, our findings may not fully reflect the most recent advances in this area.
Nevertheless, this review provides a structured approach for documenting evidence that assesses AI readiness in evidence synthesis. As a next step, we plan to transition this work into a living evidence synthesis and expand our search and inclusion criteria to incorporate newly published studies and mitigate the scope limitation. By incorporating gray literature, such as technical reports from tool providers, we aim to produce an EGM that more accurately reflects the availability of tools and the degree of automation they provide.
Conclusion
This review identified peer-reviewed evidence supporting AI readiness for human-supervised title/abstract screening, whereas evidence for AI readiness in other evidence synthesis tasks is limited and requires further evaluation. Among the tools assessed, DistillerSR and EPPI-Reviewer supported the largest number of features and showed strong performance in AI-powered title/abstract screening. Overall, our study highlights the potential of AI in improving efficiency while maintaining high sensitivity during screening, but also identifies gaps in evidence for AI use in other stages of the evidence synthesis workflow. Applying AI in the screening stage may therefore serve as a foundational step toward scaling rapid reviews into living evidence syntheses.
Supplemental Material
Supplemental Material - Artificial Intelligence (AI) Readiness to Support Evidence Synthesis by Workflow: Findings From a Review of Reviews
Supplemental Material for Artificial Intelligence (AI) Readiness to Support Evidence Synthesis by Workflow: Findings From a Review of Reviews by Zijing Wei, Luyanda Ngongoma, Jose Cols, Arina L. Bogdan, Ariel Lin, Claire Zhang, Yue Su, Nuno de Jesus Ximenes, Chloe Zhu, Yoav Ackerman, Heather L. Bullock, Juhua Hu and Yanfang Su in Campbell Systematic Reviews.
Footnotes
Acknowledgements
We thank Diana Louden for assistance developing the search strategies and running the searches. We thank Cara Evans for providing critiques throughout the study. We thank Youyi Li for assistance screening articles for inclusion. We thank Serena Chu and Ensheng Dong for valuable comments on the manuscript.
ORCID iDs
Author Contributions
YanfangS led the study design and supervised title screening, data extraction, data analysis and manuscript writing. HB and LN participated in the study design. LN led protocol writing, registration, title screening and data extraction. ZW led data analysis and manuscript writing. LN, ZW, ClaireZ, AL, YueS, and NJX conducted title screening and data extraction. ZW, JC, AL, and ChloeZ performed data analysis. ZW, LN and JC created the figures and tables. ZW, JC, AL and YA wrote the first draft of the manuscript. JH and AB provided critical feedback throughout the study. All authors reviewed the manuscript and approved the final version.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
Appendix
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
