Sage Journals: Discover world-class research

Abstract

Systematic reviews are essential for evidence synthesis but often require extensive time and resources, especially during data extraction. This proof-of-concept study evaluates the performance of Elicit, an AI tool specifically developed to support systematic reviews, in the context of a systematic review on psychological factors in dermatological conditions. We compared Elicit’s automated data extraction with manually extracted data across 43 studies and 602 data points. Both were assessed against a consensus-based ground truth. Elicit achieved an overall accuracy of 81.4%, compared to 86.7% for human reviewers—a difference that was not statistically significant. In cases where Elicit and the human reviewer extracted the same information, this information was correct in 100% of instances, suggesting that agreement between human and machine may serve as a reliable proxy for validity. Based on these results, we propose a semi-automated workflow in which Elicit functions as a second reviewer, reducing workload while maintaining high data quality. Our results demonstrate that domain-specific AI tools can effectively augment data extraction in systematic reviews, especially in settings with limited time or personnel.

Keywords

systematic reviews large language models machine-assisted review Elicit data extraction data collection machine learning evidence synthesis semi-automated workflows

Introduction

Systematic reviews are often considered the “gold standard” of evidence synthesis. They consolidate existing findings, identify knowledge gaps, and inform both policy and practice in a transparent and structured manner (e.g., Alshami et al., 2023; Gue et al., 2024; Guo et al., 2024; Li et al., 2024). By adhering to a rigorous methodological framework—including clearly defined research questions, comprehensive literature searches, systematic screening, critical appraisal, data extraction, and synthesis—they aim to provide an unbiased and comprehensive summary of the current evidence (Egger et al., 2022; Mathes et al., 2017; Wollscheid & Tripney, 2021).

The importance of systematic reviews continues to grow due to the exponential increase in scientific publications. Between 2016 and 2022, the number of published articles rose by 47% (Hanson et al., 2024), a trend that is expected to persist, with projected annual growth rates between 4% and 8% (Bornmann et al., 2021; Bornmann & Mutz, 2015). As the volume of scientific output expands, efficiently navigating and identifying relevant knowledge becomes increasingly challenging.

However, conducting systematic reviews remains inherently labor-intensive and time-consuming (Alshami et al., 2023; Borah et al., 2017; Wang et al., 2020). Estimates suggest that completing a review takes between 6 months to well over a year (Borah et al., 2017; Haddaway & Westgate, 2019) and costs in excess of $100,000 (Lau, 2019; Michelson & Reuter, 2019). Within systematic reviews, study selection and data extraction are particularly resource-intensive steps (Nussbaumer-Streit et al., 2021).

In response, a variety of technical (and nontechnical) tools have been introduced to assist in aspects of the systematic-review process (e.g., Brignardello-Petersen et al., 2024; Scott et al., 2021). And while these tools certainly streamline the process, their primary utility is to facilitate the substantial manual work that remains essential, not to replace it. They still require significant human involvement (Blaizot et al., 2022; Gates et al., 2018; Jonnalagadda et al., 2015; Schmidt et al., 2023).

However, the advent of Large Language Models (LLMs) using generative pre-trained transformers (GPTs) has introduced transformative potential for truly automating these aspects of the systematic-review process (e.g., Alshami et al., 2023; Gue et al., 2024; Mahuli et al., 2023; National Institute for Health and Care Excellence, 2024; van Dis et al., 2023). So far, research on LLM applications in systematic reviews has primarily focused on study selection (e.g., Scherbakov et al., 2025). In contrast, significantly less evidence exists regarding the use of LLMs for full-text data extraction (Hu & Geng, 2024), despite this step being both time-consuming (Gue et al., 2024; Polanin et al., 2019) and prone to cognitive strain and errors (Garousi & Felderer, 2017; Mathes et al., 2017). Reported error rates in human data extraction range from approximately 15% (Buscemi et al., 2006) to 30% (Horton et al., 2010; Zhu et al., 2023) and up to 45% (Tang et al., 2023). These findings highlight the need for alternative procedures in data extraction, a task that LLMs should theoretically be capable of performing with high accuracy (Schmidt et al., 2025).

However, evidence from recent studies (Clark et al., 2025; Gue et al., 2024; Khraisha et al., 2024; Mahmoudi et al., 2024; Reason et al., 2024; Sun et al., 2024; Tao et al., 2024; Yun et al., 2024) suggests that while LLMs can extract relatively simple, binary, or explicitly stated information with sufficient precision, they struggle with more complex, nuanced, or context-dependent data. As Sun et al. (2024, p. 2) summarize their results: “Whilst promising, the percentage of correct responses is still unsatisfactory and therefore substantial improvements are needed for current AI tools to be adopted in research practice.”

Importantly, all of the studies discussed above used a general LLM like GPT 3.5 (Gue et al., 2024; Mahmoudi et al., 2024; Sun et al., 2024), GPT 4 (Khraisha et al., 2024; Reason et al., 2024; Tao et al., 2024), Claude 2 (Sun et al., 2024), or used open source models based on general LLMs (Yun et al., 2024). Therefore, they also face the problem of “hallucinating” information, that is, the tendency of LLMs to generate incorrect, misleading, or unrelated information (Susnjak, 2023). For instance, LLMs extracted incorrect numbers of participants, hallucinated a control group when none was present in the study, or fabricated a numeric answer when the actual answer was missing (Gartlehner et al., 2024; Kartchner et al., 2023; Schmidt et al., 2025; Yun et al., 2024). This presents a significant challenge for the integration of LLMs in systematic reviews, particularly due to the rigorous demands for accuracy and reliability in this field. Even small inaccuracies can have far-reaching consequences, potentially undermining the credibility of the review process and distorting its conclusions (Susnjak et al., 2024).

To address this challenge, the current study therefore focuses on the proprietary AI-Tool Elicit, specifically designed to facilitate the (systematic) literature review process. Elicit, created within a non-profit research organization and now an independent public benefit cooperation, allows for automatic data extraction from hundreds of papers at once, creating a data-extraction table in minutes. All of the information presented in this customizable data-extraction table is backed by supporting quotes from the underlying papers (elicit.com). Combined with further mechanisms to reduce hallucinations (George & Stuhlmueller, 2023), this should—at least in theory—lead to data-extraction results similar to human reviewers, not only for simple but also for more complex data.

Furthermore, the current study does not test the AI-tool retrospectively against a well-established “ground truth” but uses the tool “in vivo” while developing a systematic review manuscript. Therefore, we were able to not only test the tool’s overall accuracy, that is, its ability to autonomously extract information, but also test whether it could serve as an independent checker for the inevitable human errors, that is, as an alternative to a second human reviewer. The current study thus follows Cochrane’s call to not only rigorously evaluate tools for automated data extraction but also investigate how those tools might fit into existing workflows (Higgins et al., 2024, Chapter 5.5.9).

Method

The current research on the effectiveness of Elicit in data extraction was integrated into a systematic review which examines psychological factors such as shame and disgust in the onset and progression of skin diseases, as well as the effectiveness of mindfulness-based and compassion-based therapies in this context (Fink-Lamotte et al., 2025). The host review follows the PRISMA guidelines to ensure methodological rigor and transparency. While conceptually akin to a “study within a review” (SWAR), our evaluation was not pre-specified or prospectively registered (see Devane et al., 2022). The final dataset comprises 46 studies that met the eligibility criteria and form the empirical basis for our analysis.

Data Extraction Process

A structured data-extraction table was created for the 46 included studies. It comprised the following seven predefined columns:

• Study-Design,

• Population,

• Country/Region,

• Participant Count,

• Constructs of Interest,

• Measures, and

• Main Results.

Three of the authors of the systematic review independently extracted data from approximately one-third of the studies, yielding a manually curated data-extraction table with 322 data points (46 studies × 7 columns). No specialized systematic review software was used; instead, data were organized and extracted using Excel, following a predefined coding scheme. Prior to extraction, the three authors discussed and agreed upon the coding rules and extraction criteria to ensure consistency, although no formal piloting phase was conducted.

Automated Extraction Using Elicit

To evaluate the performance of the AI tool Elicit, we first randomly selected 3 of the 46 studies and uploaded their full-text PDF files via Elicit’s “upload and extract” feature. This function allows users to privately upload their own papers (which remain inaccessible to others) and specify which information should be extracted Using Elicit’s “high accuracy mode” (available during Autumn 2024). We then prompted the tool to generate a data-extraction table matching the human reviewers’ column structure. At that time, the “high accuracy mode” was an optional feature designed to enhance extraction precision by allocating additional computational resources. It has since become Elicit’s default setting for all users. During this initial trial, we observed that Elicit’s “Main Results” column contained less information than its manually extracted counterpart. Therefore, an additional column, “Interpretation,” was added to capture authors’ explanatory statements. Together, these two columns were later treated as a single “Main Results” entry for comparison purposes.

Following successful calibration on the three pilot studies, the remaining 43 studies were uploaded to a new instance of Elicit using identical prompts. The full set of prompts is documented in Table 1. Elicit thus generated a second, fully automated data-extraction table.

Table 1.

Prompts Used for Each Elicit Column

Column	Prompt
Study Design	Name the design of the research approach, e.g., “Cross-sectional survey,” “Review,” “qualitative Study,” “Intervention study.” Make it short
Population	Give the study’s clinical population. E.g., the medical condition. E.g., psoriasis, eczema etc. If there is a control-group, describe the control-group as well
Country/Region	Country or region where the study was carried out. If multiple countries, say something like “multinational.” If there are multiple countries but individual countries are not specified, also say “multinational.” Leave the answer blank if the country is not specified
Participant Count	Give the number of participants in the study and were included in the final analysis. It is not necessarily the number of individuals who are recruited, or who received treatment. If you can, give the total number of participants as well as the number of participants for each trial arm in the study (e.g., control vs. intervention), together with the arm names, using bullet points, e.g., “Total: 296 - Intervention: 147 - Control: 149.” But if numbers per arm are not available, give the total number across all arms, e.g., “296.” Leave the answer blank if the number of participants is not mentioned
Constructs of Interest	Name the constructs of our interest which were addressed in the study. Our constructs of interest are: shame or skin-shame, body-image or body image, self-compassion, embarrassment, self-stigma, disgust or self-disgust, mindfulness
Measures	Give all outcome-measures, e.g., the specific questionnaire, scale used in the study. E.g., “Dermatological Life Quality Index (DLQI)” Please remember, that an outcome measure does not have to be a questionnaire, but also fMRI-data, physiological parameter etc.
Main Results	Give the main result of the study. Our main interest is to extract information about shame and disgust the participants experienced in this study and if mindfulness- and compassion-based therapy was helpful. Please only include results that are covered by empirical evidence provided in the study, not possible implications or future research avenues
Interpretation	How do the study authors interpret their main results?
	Again, we are interested in (1) shame, embarrassment and self-stigma
	(2) Self-disgust and experience of disgust in dermatological diseases
	(3) Psychotherapy for dermatological diseases: Compassion-focused therapeutic approaches

Note. Elicit-columns “Main Results “and “Interpretation” were merged before comparison with manually extracted column “Main Results.”

Comparison and Validation Procedure

This setup mirrors the dual-reviewer model commonly used in systematic reviews (e.g., Büchter et al., 2021; Mathes et al., 2017), with the human-generated table acting as Reviewer 1 and Elicit functioning as Reviewer 2. The three initial pilot studies served as a training phase for refining prompts, analogous to a calibration exercise between reviewers. Therefore, all further analyses were carried out on the remaining 43 studies (see Thomas et al., 2025).

Subsequently, all 602 data points (43 studies x 7 columns x 2 extraction tables) were independently validated by three authors of the systematic review, using the 43 full papers (46 full papers used in the systematic review minus the 3 papers used in the calibration phase).

Elicit’s extractions were exported and compared to the manually created Reviewer 1 tables in a structured, paper-by-paper format (see Supplemental Material, Table Sheet Elicit_Table). As the three evaluators were also co-authors of the original systematic review and had contributed to the initial data extraction, a blinding procedure was not applied, as it was neither feasible nor methodologically meaningful in this context. The three authors independently determined whether the corresponding data points in the two extraction tables (one manually created by human “reviewer 1” and one automatically created by “reviewer 2” Elicit) provided the same information. Then, the data points were either labeled “correct” or “incorrect.” Afterwards, a consensus decision between the three authors of the original review was reached for all data points. These consensus decisions based on three independent reviewers are considered the “gold standard” or “ground truth” the human “reviewer 1” and the artificial “reviewer 2” is compared against.

For instance, for the “Measures” column, for one study, the human rater extracted “DLQI,” whereas Elicit extracted “Dermatology Life Quality Index (DLQI),” which was rated as conveying equivalent information. In the same column, for another study, the human rater extracted “Original,” whereas Elicit extracted “Original questionnaire with 9 half-open tasks, scale from 1 to 10 for categorizing diseases by level of shame, and percentage rating of shame for own disease compared to the most embarrassing disease,” which was rated as diverging in informational content. However, in both examples, both the human as well as the automated extraction were labeled as “correct.” For a third study, the human rater extracted “QASD,” whereas Elicit extracted “Questionnaire for the Assessment of Self-Disgust (QASD), Perceived Stigmatization Questionnaire (PSQ), Brief Symptom Inventory (BSI),” which was again rated as diverging in informational content. In this example, Elicit’s extraction was labeled as “correct,” whereas the human rater’s extraction was labeled as “incorrect,” since it was incomplete and did not provide all relevant measures used in this study.

Results

Extraction of the Same Information

Table 2 presents the percentage of cases in which the human reviewer and Elicit extracted the same information. Across all columns, this was achieved in 51.8% of the data points. Unsurprisingly, agreement was higher for more straightforward categories such as Population (74.4%) and Participant Count (81.4%) and substantially lower for more complex constructs like Main Results (34.8%) or Constructs of Interest (16.3%). Notably, in cases where Elicit and the human reviewer extracted the same information, this information was correct in 100% of cases, underscoring that this form of agreement between the two is a strong indicator of extraction accuracy.

Table 2.

Comparison of Extraction Results

Category	Same information extracted	Accuracy		McNemar test	Description of Elicit’s extraction errors
Category	Same information extracted	Human “reviewer 1”	Elicit as “reviewer 2”	McNemar test	Description of Elicit’s extraction errors
Study Design	0.70	0.907 (39/43)	0.930 (40/43)	Chi² (1, N= 43) = 0, p = 1	• In one case, Elicit did not extract the information with sufficient specificity (Muftin et al., 2022).
Study Design	0.70	0.907 (39/43)	0.930 (40/43)	Chi² (1, N= 43) = 0, p = 1	• In two cases, it mislabeled the underlying study design (e.g., Buckwalter, 1982).
Population	0.74	0.907 (39/43)	0.953 (41/43)	Chi² (1, N= 43) = 0.17, p = .68	• In both instances, Elicit extracted data from cited studies rather than from the actual study (a [systematic] review, e.g., Russo et al., 2004).
Country/Region	0.72	0.953 (41/43)	0.953 (41/43)	Chi² (1, N= 43) = 0, p = 1	• In one case, the relevant information was in the supplemental material, to which Elicit did not have access (van Beugen et al., 2016). It therefore marked the information as “not available.” The three reviewers nevertheless categorized this as “incorrect,” since a human rater could have simply referred to the supplemental material.
					• In one case, the information was not explicitly stated in the article, but from the way the methods section was written, it was clear that it was in the country the authors worked in. The human rater could infer this information, Elicit could not (Schienle & Wabnegger, 2022).
					• In cases of [systematic] reviews, Elicit often stated “not mentioned” or “international” or similar, whereas the human rater often extracted the country the first author of the study worked in. The three reviewers categorized this as “not the same information” but both “correct”.
Participant Count	0.81	0.907 (39/43)	0.884 (38/43)	Chi² (1, N= 43) = 0, p = 1	• In one case, Elicit did not differentiate between participants taking part in the qualitative study and participants taking part in the quantitative study but added these participants up to a single participant count (Tan et al., 2022b).
Participant Count	0.81	0.907 (39/43)	0.884 (38/43)	Chi² (1, N= 43) = 0, p = 1	• In three cases, Elicit extracted data from cited studies, not from the actual study (which is a [systematic)] review, e.g., Russo et al., 2004).
Constructs of Interest	0.16	0.558 (24/43)	0.512 (22/43)	Chi² (1, N= 43) = 0.04, p = .84	• In most cases, Elicit identified constructs of interest too broadly. For example, it included constructs that were mentioned but not measured, or only referenced indirectly, i.e., that are “related” to the constructs of interest (e.g., Buckwalter, 1982; Ginsburg & Link, 1993; Mento et al., 2020; Sampogna et al., 2012).
Measures	0.12	0.930 (40/43)	0.698 (30/43)	Chi² (1, N= 43) = 5.78, p = .02	• In one case, Elicit correctly described a questionnaire used, but this questionnaire was not used to measure one of the constructs of interest (O’Neil et al., 2011).
					• In 10 cases, Elicit extracted data from cited studies, not from the actual study (which is a [systematic] review, e.g., Russo et al., 2004).
					• In one case, Elicit extracted the statistical methods used, not the measurement instruments (Schielein et al., 2020).
					• In one case, Elicit extracted the quantitative questionnaires correctly but neglected the qualitative instrument (George et al., 2021).
Main Results	0.35	0.907 (39/43)	0.767 (33/43)	Chi² (1, N= 43) = 3.13, p = .08	• In one review article, Elicit singled out a result from a cited study and presented it as one of the main results (Torales et al., 2020).
					• In six cases, Elicit reported results not found in the original article. Analogous to the information extracted in the “Constructs of Interest” column, it was too liberal in relabeling results of constructs that were only “implied” (e.g., Jafferany & Pastolero, 2018; Vlăduţ & Kállay, 2010).
					• In one case, Elicit did not strictly distinguish between the study’s own results and results discussed in the cited literature (Almeida et al., 2020).
					• In two cases, Elicit’s extraction of the main results was incomplete, possibly due to these results being mainly presented in a table or figure (e.g., Tan et al., 2022).
Average	0.52	0.867	0.814

Note. “Same information extracted” indicates whether the three human reviewers judged the information extracted by the human rater and by Elicit to convey equivalent content. “Accuracy” indicates whether the reviewers rated the extracted information as correct. “McNemar test” presents the uncorrected test results comparing the accuracy of human and Elicit extractions. “Description of Elicit’s extraction errors” summarizes the main types of errors made by Elicit in each category.

Comparison of Human and Elicit Accuracy

Table 2 also shows the data-extraction accuracy of both the human “reviewer 1” and Elicit as “reviewer 2,” both compared against the ground truth as described above. On average, the human reviewer has an accuracy of 86.7%, whereas Elicit has an accuracy of 81.4%, which, as evidenced by McNemar tests, is quite similar. Although one comparison (“Measures”) yielded a nominally significant result (p = .016), this effect did not survive correction for multiple comparisons. Specifically, after applying the Holm–Bonferroni method (Holm, 1979) to control the family-wise error rate, no p-values remained significant. Similarly, applying the Benjamini–Hochberg procedure (Benjamini & Hochberg, 1995) to control the false discovery rate also resulted in no statistically significant findings. Thus, the observed effect should be interpreted with caution, as it may reflect chance findings due to multiple testing. Furthermore, both a paired-T-test (t [6] = 1.42, p = .206) and a sign test (S = 4, p = .688) also indicate that there is no difference between the accuracies of a human reviewer and Elicit.

Differences in Accuracy Between Columns

To examine whether Elicit’s accuracy varied across different columns, we conducted Cochran’s Q test on the binary accuracy data across all seven columns. The test was statistically significant, Q (6) = 46.94, p < .001, indicating that accuracy differed for at least some columns. To identify where these differences occurred, we performed pairwise McNemar tests between all column combinations, with Holm–Bonferroni correction for multiple comparisons. Consistent with the descriptive statistics shown in Table 2, accuracy for Constructs of Interest was significantly lower than accuracy for Study Design (p_adj = .001), Population (p_adj = .001), Participants (p_adj = .044), and Country/Region (p_adj = .001). In addition, accuracy in Population significantly differed from accuracy in Measures (p_adj = .044).

Differences in Accuracy Between Empirical Primary Studies and Other Article Types

As described in column “Description of Elicit’s extraction errors” of Table 2, Elicit occasionally encounters difficulties with articles that are not empirical primary studies (i.e., theoretical articles and [systematic] reviews). To substantiate this observation, an independent-samples t-test was conducted to examine whether articles classified as empirical primary studies (n = 30) differed from non-empirical articles (n = 13) in their total score (the sum of correct extractions over all seven columns). Results revealed a significant difference between the groups, t (20) = 3.86, p < .001, with empirical articles (M = 6.07, SD = 0.87) scoring higher on average than non-empirical articles (M = 4.85, SD = 0.99). The effect size was large, d = 1.35, 95% CI [0.62, 2.08], indicating a substantial difference between the two types of articles.

To verify the robustness of this finding, a nonparametric Wilcoxon rank-sum test was also performed, yielding a similar result (W = 314, p = .001), with a large effect size of r = .50.

Taken together, both analyses indicate a clear and practically meaningful difference: on average, empirical primary studies exhibited approximately one additional correct extraction (about six versus five out of seven possible) compared with non-empirical articles.

Discussion

Key Findings and Their Implications

This proof-of-concept study evaluated the performance of Elicit, an AI-powered data-extraction tool, within the context of a systematic review in the field of psychodermatology. Our results demonstrate that Elicit achieved an overall extraction accuracy of 81.4%, closely approaching that of human reviewers (86.7%). Crucially, in cases where both Elicit and a human reviewer extracted the same information, this information was correct in 100% of instances—suggesting that agreement between human and machine may serve as a reliable proxy for validity.

These findings suggest that Elicit is capable of supporting, and in many cases approximating, human-level performance in data-extraction tasks (e.g., Tang et al., 2023)—particularly in factual categories such as Study Design and Participant Count. In contrast, its performance declined in more interpretive categories, such as Constructs of Interest, mirroring limitations observed in other evaluations of large language models (Khraisha et al., 2024; Sun et al., 2024). Interestingly, human reviewers faced similar difficulties in extracting information for the category Constructs of Interest, achieving an accuracy of only 0.558 compared to Elicit’s 0.512. This nonsignificant difference suggests that the challenge may not lie solely in the tool’s capabilities but also in the inherent complexity of the task itself. In the social sciences, research frequently addresses nuanced, multidimensional, and context-dependent constructs that are often less precisely defined or inconsistently labeled across studies (see Belur et al., 2021; Curran et al., 2007 for related arguments). Such heterogeneity complicates the extraction of consistent information, even for trained human coders. Consequently, systematic reviews in these fields face the dual challenge of synthesizing diverse and fragmented literature while maintaining reliability and comprehensiveness (Curran et al., 2007; Davis et al., 2014)—with or without support by technical tools.

Towards a Hybrid Model of Data Extraction

Rather than replacing human reviewers, Elicit appears best suited as a complementary tool within systematic review workflows. Its greatest potential lies not in full automation but in strategic augmentation: reducing the manual workload, flagging inconsistencies, and enabling scalable quality control. This hybrid model aligns with current best practices that recommend dual independent data extraction (Higgins et al., 2024; Institute of Medicine, 2011), while simultaneously addressing the resource constraints that many research teams face (Bennett et al., 2015; Oliver et al., 2015).

A Refined Workflow Proposal

Building on our findings, we propose a pragmatic workflow that integrates Elicit as a semi-autonomous second reviewer:

1. Initial manual extraction: A single human reviewer manually extracts the data from all eligible full texts and constructs a data-extraction table.

2. Prompt calibration: The human reviewer randomly selects a small sample of these full texts (3–5 full texts, depending on the overall number of full-texts), uploads them to Elicit, extracts the same columns as in the manual data-extraction table, compares the results of the two tables for this small sample of full texts, and modifies the prompts until the reviewer is satisfied with Elicit’s data-extraction table.

3. Automated extraction: The human reviewer uploads all remaining full texts to Elicit and uses the identical prompts as before to get Elicit’s full data-extraction table.

4. Discrepancy detection: After completion of both the manual and AI-based data extraction, the two resulting tables are compared by either the initial human reviewer or an independent human reviewer who was not involved in the initial extraction. All data points in which the human reviewer and Elicit did not extract the same information are flagged for further review.

5. Targeted reconciliation: The flagged discrepancies are then reviewed and adjudicated by the person conducting the discrepancy detection. This process mirrors established arbitration procedures commonly used to resolve disagreements in systematic reviews (e.g., Higgins et al., 2024). The finalized data-extraction table incorporates these adjudicated decisions.

This workflow simulates a “dual-extraction” model while requiring fewer resources than the “single extraction with verification” strategy (e.g., Buscemi et al., 2006; Li et al., 2019). In our case, only 48% of data points required manual verification. The remaining 52%—where human and AI agreed—could be accepted with high confidence, thereby conserving resources without compromising data integrity. Such efficiency gains are especially valuable in underfunded research settings or time-sensitive projects such as policy reviews or public health interventions.

Strengths and Limitations

A major strength of our study lies in its ecological validity. By embedding the evaluation of Elicit into an active systematic review, we assessed the tool under authentic review conditions. Furthermore, the use of consensus-based ground truth ratings by three independent reviewers adds robustness to our accuracy estimates.

Importantly, the proposed workflow leverages the efficiency of automation while maintaining methodological rigor. However, it also prioritizes human oversight, ensuring verification and accountability throughout the process (Alshami et al., 2023; Huang & Tan, 2023; Kohandel Gargari et al., 2023; Scherbakov et al., 2025; van Dis et al., 2023).

Nonetheless, several limitations should be noted. First, our findings are restricted to a single domain—psychological mechanisms in dermatological conditions—and may not generalize to other fields with different data structures or conceptual frameworks. Second, while Elicit provides source references for extracted data, its performance remains limited in tasks that require nuanced interpretation, abstraction, or inferential reasoning—particularly in relation to complex psychological constructs. This mirrors broader concerns in the literature about the interpretive limits of current LLM-based tools (Gue et al., 2024; Tao et al., 2024).

Our results are also consistent with the cautious stance taken by Clark et al. (2025), who argue that LLMs are not yet suitable for autonomous use in systematic reviews. They emphasize the need for further empirical research—especially in relation to specific review tasks such as data extraction. Our findings contribute to this emerging field and underscore the value of domain-specific tools like Elicit to complement—but not replace—human expertise.

Future Directions

Future work should extend this line of inquiry in several ways. First, evaluating Elicit across multiple disciplines and review types (e.g., intervention reviews, qualitative syntheses, and scoping reviews) would help clarify its generalizability. Second, head-to-head comparisons with other AI-based tools, including proprietary and open-source alternatives, could contextualize Elicit’s relative performance. Third, studies should systematically assess not only extraction accuracy but also usability, time efficiency, and trustworthiness from the perspective of end-users.

In addition, future research should further examine whether extraction performance varies systematically by study type. In the present data, Elicit appeared to perform better on primary empirical studies than on theoretical papers or reviews—a post-hoc observation that warrants cautious interpretation. This difference could stem from the more standardized reporting structure of primary studies, from limitations in our prompting design, or from a combination of both. Systematically evaluating the interaction between study type, prompt formulation, and extraction performance would provide valuable insights into how LLM-based tools can be optimized for different forms of scientific writing.

Building on these findings, future research should also focus on improving extraction accuracy for conceptually complex or overlapping constructs. In our study, this was the category that posed the greatest challenge for both human and AI reviewers. The difficulty likely reflects not only technical limitations but also conceptual ambiguity: constructs in social sciences often overlap or are nested within each other. For instance, shame and self-stigma were both defined as distinct constructs of interest in the underlying systematic review, yet individual studies sometimes treat them as conceptually intertwined. In Ginsburg and Link (1993), for example, shame is described as one of several subdimensions of stigma-related feelings rather than as an independent construct. Such definitional overlap creates genuine ambiguity—not only for AI models attempting automated extraction but also for trained human reviewers applying predefined coding schemes. To address this, future work could experiment with adding an explicit layer of conceptual structure to the extraction process—for example, by using controlled vocabularies (predefined lists of accepted constructs and their synonyms), construct hierarchies (taxonomies clarifying how constructs relate to one another, such as whether shame is a subdimension of self-stigma), or mapping schemes (tables linking constructs to the specific measures or operationalizations used across studies; see also Kartchner et al., 2023; Gartlehner et al., 2024; Susnjak, 2023). Likewise, prompt templates that ask the model to differentiate between constructs that are measured versus merely mentioned could reduce overinclusive classifications (see Table 2). Taken together, such refinements might help AI-assisted and human coders alike handle complex conceptual categories more consistently.

Conclusion

Elicit demonstrates promising accuracy and consistency as a tool for semi-automated data extraction in systematic reviews. Although it does not yet outperform human reviewers, it significantly reduces human workload while maintaining high data quality—especially when integrated into a structured hybrid workflow. As scientific output continues to accelerate and systematic reviews grow more resource-intensive, tools like Elicit can play a pivotal role in ensuring that rigorous review methodologies remain scalable, inclusive, and timely.

Supplemental Material

Supplemental Material - Evaluating the AI Tool “Elicit” as a Semi-Automated Second Reviewer for Data Extraction in Systematic Reviews: A Proof-of-Concept

Supplemental Material for Evaluating the AI Tool “Elicit” as a Semi-Automated Second Reviewer for Data Extraction in Systematic Reviews: A Proof-of-Concept by Frederic Hilkenmeier, Marie Pelzer, Christian Stierle, and Jakob Fink-Lamotte in Social Science Computer Review

Footnotes

Author Note

The authors maintain a paid subscription to Elicit, which they fund personally. During the period in which the study was conducted, data-extraction tables generated by Elicit were obtained using a credit-based system (“tokens”). These tokens were provided free of charge to the authors by Elicit. No further interactions or exchanges took place between Elicit and the authors. In particular, Elicit did not influence the study design, nor was the company given access to any draft or preprint of this manuscript.

ORCID iD

Frederic Hilkenmeier

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The data that support the findings of this study are available as electronic supplementary material alongside this article.*

Supplemental Material

Supplemental material for this article is available online.

Author Biographies

Frederic Hilkenmeier studied psychology at the University of Hamburg and obtained his Ph.D. from theUniversity of Paderborn in 2012. He is currently working as a researcher and lecturer in the psychology school at Fresenius University of Applied Sciences.

Marie Pelzer studied Psychology at the University of Essen and Potsdam. She is currently working as a clinical psychologist, training to become a psychotherapist.

Christian Stierle studied psychology at the University of Marburg and obtained his Ph.D. from theUniversity of Halle. He is currently working as Professor for Clinical Psychology in the psychology school at Fresenius University of Applied Sciences and is Visiting Professor at the Department of Health Psychology at Stradins University/Riga.

Jakob Fink-Lamotte studied psychology at the University of Tübingen as well as the University of Massachusetts, USA and obtained his Ph.D. from the University of Leipzig in 2018. He is currently working as a researcher and lecturer in the Department of Psychology at the University of Potsdam. He is a licensed Cognitive-Behavioral Psychotherapist.

References

Almeida

Leite

Â.

Constante

Correia

Almeida

I. F.

Teixeira

Vidal

D. G.

Sousa

H. F. P. E.

Dinis

M. A. P.

Teixeira

(2020). The mediator role of body image-related cognitive fusion in the relationship between disease severity perception, acceptance and psoriasis disability. Behavioral Sciences, 10(9), 142. https://doi.org/10.3390/bs10090142

Alshami

Elsayed

Ali

Eltoukhy

A. E. E.

Zayed

(2023). Harnessing the power of ChatGPT for automating systematic review process: Methodology, case study, limitations, and future directions. Systems, 11(7), 351. https://doi.org/10.3390/systems11070351

Belur

Tompson

Thornton

Simon

(2021). Interrater reliability in systematic review methodology: Exploring variation in coder decision-making. Sociological Methods & Research, 50(2), 837–865. https://doi.org/10.1177/0049124118799372

Benjamini

Hochberg

(1995). Controlling the false discovery rate: A practical and powerful approach to multiple testing. Journal of the Royal Statistical Society Series B: Statistical Methodology, 57(1), 289–300. https://doi.org/10.1111/j.2517-6161.1995.tb02031.x

Bennett

N. R.

Cumberbatch

Francis

D. K.

(2015). There are challenges in conducting systematic reviews in developing countries: The Jamaican experience. Journal of Clinical Epidemiology, 68(9), 1095–1098. https://doi.org/10.1016/j.jclinepi.2014.09.026

Blaizot

Veettil

S. K.

Saidoung

Moreno-Garcia

C. F.

Wiratunga

Aceves-Martins

Lai

N. M.

Chaiyakunapruk

(2022). Using artificial intelligence methods for systematic review in health sciences: A systematic review. Research Synthesis Methods, 13(3), 353–362. https://doi.org/10.1002/jrsm.1553

Borah

Brown

A. W.

Capers

P. L.

Kaiser

K. A.

(2017). Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open, 7(2), Article e012545. https://doi.org/10.1136/bmjopen-2016-012545

Bornmann

Haunschild

Mutz

(2021). Growth rates of modern science: A latent piecewise growth curve approach to model publication numbers from established and new literature databases. Humanities and Social Sciences Communications, 8(1), 224. https://doi.org/10.1057/s41599-021-00903-w

Bornmann

Mutz

(2015). Growth rates of modern science: A bibliometric analysis based on the number of publications and cited references. Journal of the Association for Information Science and Technology, 66(11), 2215–2222. https://doi.org/10.1002/asi.23329

10.

Brignardello-Petersen

Santesso

Guyatt

G. H.

(2024). Systematic reviews of the literature: An introduction to current methods. American Journal of Epidemiology, 194(2), 536–542. https://doi.org/10.1093/aje/kwae232

11.

Büchter

R. B.

Weise

Pieper

(2021). Reporting of methods to prepare, pilot and perform in systematic reviews: Analysis of a sample of 152 Cochrane and Non-Cochrane reviews. BMC Medical Research Methodology, 21(1), 240. https://doi.org/10.1186/s12874-021-01438-z

12.

Buckwalter

(1982). The influence of skin disorders on sexual expression. Sexuality and Disability, 5(2), 98–106. https://doi.org/10.1007/BF01103301

13.

Buscemi

Hartling

Vandermeer

Tjosvold

Klassen

T. P.

(2006). Single data extraction generated more errors than double data extraction in systematic reviews. Journal of Clinical Epidemiology, 59(7), 697–703. https://doi.org/10.1016/j.jclinepi.2005.11.010

14.

Clark

Barton

Albarqouni

Byambasuren

Jowsey

Keogh

Liang

Moro

O’Neill

Jones

(2025). Generative artificial intelligence use in evidence synthesis: A systematic review. Research Synthesis Methods, 1–19(4), 601–619. https://doi.org/10.1017/rsm.2025.16

15.

Curran

Burchardt

Knapp

McDaid

(2007). Challenges in multidisciplinary systematic reviewing: A study on social exclusion and mental health policy. Social Policy & Administration, 41(3), 289–312. https://doi.org/10.1111/j.1467-9515.2007.00553.x

16.

Davis

Mengersen

Bennett

Mazerolle

(2014). Viewing systematic reviews and meta-analysis in social research through different lenses. SpringerPlus, 3(1), 511. https://doi.org/10.1186/2193-1801-3-511

17.

Devane

Burke

N. N.

Treweek

Clarke

Thomas

Booth

Tricco

A. C.

Saif-Ur-Rahman

K. M.

(2022). Study within a review (SWAR). Journal of Evidence-Based Medicine, 15(4), 328–332. https://doi.org/10.1111/jebm.12505

18.

Egger

Higgins

J. P. T.

Davey Smith

(2022). Systematic reviews in health research: An introduction. In Egger

Higgins

J. P. T.

Davey Smith

(Eds.), Systematic reviews in health research. Wiley. https://doi.org/10.1002/9781119099369.ch1

19.

Fink-Lamotte

Wehle

Brinkmann

Pelzer

Exner

Stierle

(2025). Shame and disgust in patients with inflammatory skin diseases: A systematic review of psychological correlates and psychotherapeutic approaches. Frontiers in Medicine, 12, 1620940. https://doi.org/10.3389/fmed.2025.1620940

20.

Garousi

Felderer

(2017). Experience-based guidelines for effective and efficient data extraction in systematic reviews in software engineering. Association for Computing Machinery. https://doi.org/10.1145/3084226.3084238

21.

Gartlehner

Kahwati

Hilscher

Thomas

Kugley

Crotty

Viswanathan

Nussbaumer-Streit

Booth

Erskine

Konet

Chew

(2024). Data extraction for evidence synthesis using a large language model: A proof-of-concept study. Research Synthesis Methods, 15(4), 576–589. https://doi.org/10.1002/jrsm.1710

22.

Gates

Johnson

Hartling

(2018). Technology-assisted title and abstract screening for systematic reviews: A retrospective evaluation of the abstrackr machine learning tool. Systematic Reviews, 7(1), 45. https://doi.org/10.1186/s13643-018-0707-8

23.

George

Stuhlmueller

(2023). Factored verification: Detecting and reducing hallucination in summaries of academic papers. In Ghosal

Grezes

Allen

Lockhart

Accomazzi

Blanco-Cuaresma

(Eds.), Proceedings of the Second Workshop on Information Extraction from Scientific Publications (pp. 107–116). Association for Computational Linguistics. https://doi.org/10.18653/v1/2023.wiesp-1.13

24.

George

Sutcliffe

Scheinmann

Mizara

McBride

S. R.

(2021). Psoriasis: The skin I’m in. Development of a behaviour change tool to improve the care and lives of people with psoriasis. Clinical and Experimental Dermatology, 46(5), 888–895. https://doi.org/10.1111/ced.14594

25.

Ginsburg

I. H.

Link

B. G.

(1993). Psychosocial consequences of rejection and stigma feelings in psoriasis patients. International Journal of Dermatology, 32(8), 587–591. https://doi.org/10.1111/j.1365-4362.1993.tb05031.x

26.

Gue

C. C. Y.

Abdul Rahim

N. D.

Rojas-Carabali

Agrawal

Palvannan

R. K.

Abisheganaden

Yip

W. F.

(2024). Evaluating the OpenAI’s GPT3.5 Turbo’s performance in extracting information from scientific articles on diabetic retinopathy. Systematic Reviews, 13(1), 135. https://doi.org/10.1186/s13643-024-02523-2

27.

Guo

Gupta

Deng

Park

Y. J.

Paget

Naugler

(2024). Automated paper screening for clinical reviews using large language models: Data analysis study. Journal of Medical Internet Research, 26, Article e48996. https://doi.org/10.2196/48996

28.

Haddaway

N. R.

Westgate

M. J.

(2019). Predicting the time needed for environmental systematic reviews and systematic maps. Conservation Biology: The Journal of the Society for Conservation Biology, 33(2), 434–443. https://doi.org/10.1111/cobi.13231

29.

Hanson

M. A.

Gómez Barreiro

Crosetto

Brockington

(2024). The strain on scientific publishing. Quantitative Science Studies, 5(4), 823–843. https://doi.org/10.1162/qss_a_00327

30.

Higgins

J. P. T.

Thomas

Chandler

Cumpston

Page

M. J.

Welch

V. A.

(2024). Cochrane handbook for systematic reviews of interventions. (Version 6.5). Cochrane. https://training.cochrane.org/handbook

31.

Holm

(1979). A simple sequentially rejective multiple test procedure. Scandinavian Journal of Statistics, 6(2), 65–70.

32.

Horton

Vandermeer

Hartling

Tjosvold

Klassen

T. P.

Buscemi

(2010). Systematic review data extraction: Cross-sectional study showed that experience did not increase accuracy. Journal of Clinical Epidemiology, 63(3), 289–298. https://doi.org/10.1016/j.jclinepi.2009.04.007

33.

Geng

(2024). Accelerating systematic reviews with large language models: Current practices and recommendations. https://doi.org/10.31219/osf.io/wm2av

34.

Huang

Tan

(2023). The role of ChatGPT in scientific communication: Writing better scientific review articles. American Journal of Cancer Research, 13(4), 1148–1154.

35.

Institute of Medicine . (2011). Finding what works in health care: Standards for systematic reviews. The National Academies Press. https://doi.org/10.17226/13059

36.

Jafferany

Pastolero

(2018). Psychiatric and psychological impact of chronic skin disease. The Primary Care Companion for CNS Disorders, 20(2), 17nr02247. https://doi.org/10.4088/PCC.17nr02247

37.

Jonnalagadda

S. R.

Goyal

Huffman

M. D.

(2015). Automating data extraction in systematic reviews: A systematic review. Systematic Reviews, 4(1), 78. https://doi.org/10.1186/s13643-015-0066-7

38.

Kartchner

Al-Hussaini

Kronick

Ramalingam

Mitchell

(2023). Zero-shot information extraction for clinical meta-analysis using large language models. In The 22nd workshop on biomedical natural language processing and BioNLP shared tasks (pp. 396–405). Association for Computational Linguistics. https://aclanthology.org/2023.bionlp-1.37/.

39.

Khraisha

Saleh

Alam

Mahmood

Wong

(2024). Can large language models replace humans in systematic reviews? Evaluating GPT-4’s efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Research Synthesis Methods, 15(4), 616–626. https://doi.org/10.1002/jrsm.1684

40.

Kohandel Gargari

Mahmoudi

M. H.

Hajisafarali

Samiee

(2023). Enhancing title and abstract screening for systematic reviews with GPT-3.5 turbo. BMJ Evidence-Based Medicine, 29(1), 69–70. https://doi.org/10.1136/bmjebm-2023-112678

41.

Lau

(2019). Editorial: Systematic review automation thematic series. Systematic Reviews, 8(1), 70. https://doi.org/10.1186/s13643-019-0974-z

42.

Saldanha

I. J.

Jap

Smith

B. T.

Canner

Hutfless

S. M.

Branch

Carini

Chan

de Bruijn

Wallace

B. C.

Walsh

S. A.

Whamond

E. J.

Murad

M. H.

Sim

Berlin

J. A.

Lau

Dickersin

Schmid

C. H.

(2019). A randomized trial provided new evidence on the accuracy and efficiency of traditional vs. electronically annotated abstraction approaches in systematic reviews. Journal of Clinical Epidemiology, 115, 77–89. https://doi.org/10.1016/j.jclinepi.2019.07.005

43.

Zhang

Wang

Zhang

(2024). Evaluating the effectiveness of large language models in abstract screening: A comparative analysis. Systematic Reviews, 13(1), 219. https://doi.org/10.1186/s13643-024-02609-x

44.

Mahmoudi

Chang

Lee

Ghaffarzadegan

Jalali

M. S.

(2024). A critical assessment of large language models for systematic reviews: Utilizing ChatGPT for complex data extraction. JMIR AI, 4, e68097. https://doi.org/10.2196/68097

45.

Mahuli

S. A.

Rai

Mahuli

A. V.

Kumar

(2023). Application of ChatGPT in conducting systematic reviews and meta-analyses. British Dental Journal, 235(2), 90–92. https://doi.org/10.1038/s41415-023-6132-y

46.

Mathes

Klaßen

Pieper

(2017). Frequency of data extraction errors and methods to increase data extraction quality: A methodological review. BMC Medical Research Methodology, 17(1), 152. https://doi.org/10.1186/s12874-017-0431-4

47.

Mento

Rizzo

Muscatello

M. R. A.

Zoccali

R. A.

Bruno

(2020). Negative emotions in skin disorders: A systematic review. International Journal of Psychological Research, 13(1), 71–86. https://doi.org/10.21500/20112084.4078

48.

Michelson

Reuter

(2019). The significant cost of systematic reviews and meta-analyses: A call for greater involvement of machine learning to assess the promise of clinical trials. Contemporary Clinical Trials Communications, 16, Article 100443. https://doi.org/10.1016/j.conctc.2019.100443

49.

Muftin

Gilbert

Thompson

A. R.

(2022). A randomized controlled feasibility trial of online compassion-focused self-help for psoriasis. The British Journal of Dermatology, 186(6), 955–962. https://doi.org/10.1111/bjd.21020

50.

National Institute for Health and Care Excellence . (2024). Use of AI in evidence generation: NICE position statement. https://www.nice.org.uk/about/what-we-do/our-research-work/use-of-ai-inevidence-generation–nice-positionstatement

51.

Nussbaumer-Streit

Ellen

Klerings

Sfetcu

Riva

Mahmić-Kaknjo

Poulentzas

Martinez

Baladia

Ziganshina

L. E.

Marqués

M. E.

Aguilar

Kassianos

A. P.

Frampton

Silva

A. G.

Affengruber

Spjker

Thomas

Berg

R. C.

Gartlehner

(2021). Resource use during systematic review production varies widely: A scoping review. Journal of Clinical Epidemiology, 139, 287–296. https://doi.org/10.1016/j.jclinepi.2021.05.019

52.

Oliver

Bangpan

Stansfield

Stewart

Tripney

(2015). Capacity for conducting systematic reviews in low- and middle-income countries: A rapid appraisal. Health Research Policy and Systems, 13(1), 23. https://doi.org/10.1186/s12961-015-0012-0

53.

O’Neill

J. L.

Chan

Y. H.

Rapp

S. R.

Yosipovitch

(2011). Differences in itch characteristics between psoriasis and atopic dermatitis patients: Results of a web-based questionnaire. Acta Dermato-Venereologica, 91(5), 537–540. https://doi.org/10.2340/00015555-1126

54.

Polanin

J. R.

Pigott

T. D.

Espelage

D. L.

Grotpeter

J. K.

(2019). Best practice guidelines for abstract screening large‐evidence systematic reviews and meta‐analyses. Research Synthesis Methods, 10(3), 330–342. https://doi.org/10.1002/jrsm.1354

55.

Reason

Benbow

Langham

Gimblett

Klijn

S. L.

Malcolm

(2024). Artificial intelligence to automate network meta-analyses: Four case studies to evaluate the potential application of large language models. PharmacoEconomics - Open, 8(2), 205–220. https://doi.org/10.1007/s41669-024-00476-9

56.

Russo

P. A.

Ilchef

Cooper

A. J.

(2004). Psychiatric morbidity in psoriasis: A review. The Australasian Journal of Dermatology, 45(3), 155–161. https://doi.org/10.1111/j.1440-0960.2004.00078.x

57.

Sampogna

Tabolli

Abeni

IDI Multipurpose Psoriasis Research on Vital Experiences (IMPROVE) investigators . (2012). Living with psoriasis: Prevalence of shame, anger, worry, and problems in daily activities and social life. Acta Dermato-Venereologica, 92(3), 299–303. https://doi.org/10.2340/00015555-1273

58.

Scherbakov

Hubig

Jansari

Bakumenko

Lenert

L. A.

(2025). The emergence of large language models as tools in literature reviews: A large language model-assisted systematic review. Journal of the American Medical Informatics Association, 32(6), 1071–1086. https://doi.org/10.1093/jamia/ocaf063

59.

Schielein

M. C.

Tizek

Schuster

Ziehfreund

Biedermann

Zink

(2020). Genital psoriasis and associated factors of sexual avoidance - A people-centered cross-sectional study in Germany. Acta Dermato-Venereologica, 100(10), adv00151. https://doi.org/10.2340/00015555-3509

60.

Schienle

Wabnegger

(2022). Self-disgust in patients with dermatological diseases. International Journal of Behavioral Medicine, 29(6), 827–832. https://doi.org/10.1007/s12529-022-10058-w

61.

Schmidt

Hair

Graziosi

Campbell

Kapp

Khanteymoori

Craig

Engelbert

Thomas

(2025). Exploring the use of a large language model for data extraction in systematic reviews: A rapid feasibility study [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2405.14445

62.

Schmidt

Olorisade

B. K.

McGuinness

L. A.

Thomas

Higgins

J. P. T.

(2023). Data extraction methods for systematic review (semi)automation: A living systematic review. F1000 Research, 10, 401. https://doi.org/10.12688/f1000research.51117.1

63.

Scott

M. A.

Forbes

Clark

Carter

Glasziou

Munn

(2021). Systematic review automation tool use by systematic reviewers, health technology assessors and clinical guideline developers: Tools used, abandoned, and desired. MedRxiv. https://doi.org/10.1101/2021.04.26.21255833

64.

Sun

Zhang

Doi

S. A.

Furuya-Kanamori

Lin

(2024). How good are large language models for automated data extraction from randomized trials? MedRxiv. https://doi.org/10.1101/2024.02.20.24303083

65.

Susnjak

(2023). PRISMA-DFLLM: An extension of PRISMA for systematic literature reviews using domain-specific finetuned large language models [Preprint]. arXiv. https://doi.org/10.48550/arXiv.2306.14905

66.

Susnjak

Hwang

Reyes

N. H.

Barczak

A. L. C.

McIntosh

T. R.

Ranathunga

(2024). Automating research synthesis with domain-specific large language model fine-tuning. ACM Transactions on Knowledge Discovery from Data, 19(3), 1–39. https://doi.org/10.1145/3715964

67.

Tan

Beissert

Cook-Bolden

Chavda

Harper

Herbert

Lain

Layton

Rocha

Weiss

Dreno

(2022a). Impact of facial atrophic acne scars on quality of life: A multi-country population-based survey. American Journal of Clinical Dermatology, 23(1), 115–123. https://doi.org/10.1007/s40257-021-00628-1

68.

Tan

Beissert

Cook-Bolden

Chavda

Harper

Herbert

Lain

Layton

Rocha

Weiss

Dreno

(2022b). Evaluation of psychological well-being and social impact of atrophic acne scarring: A multinational, mixed-methods study. JAAD International, 6, 43–50. https://doi.org/10.1016/j.jdin.2021.11.006

69.

Tang

Wang

Doi

S. A. R.

Furuya-Kanamori

Lin

Qin

Tao

(2023). Double data extraction was insufficient for minimizing errors in evidence synthesis: A randomized controlled trial. MedRxiv. https://doi.org/10.1101/2023.10.16.23297056

70.

Tao

Osman

Z. A.

Tzou

P. L.

Rhee

S. Y.

Ahluwalia

Shafer

R. W.

(2024). GPT-4 performance on querying scientific publications: Reproducibility, accuracy, and impact of an instruction sheet. BMC Medical Research Methodology, 24(1), 139. https://doi.org/10.1186/s12874-024-02253-y

71.

Thomas

Flemyng

Noel-Storr

(2025). Responsible AI in Evidence Synthesis (RAISE): Guidance and recommendations. In Open science framework (version 2). Center for Open Science.

72.

Torales

Echeverría

Barrios

García

O'Higgins

Castaldelli-Maia

J. M.

Ventriglio

Jafferany

(2020). Psychodermatological mechanisms of psoriasis. Dermatologic Therapy, 33(6), Article e13827. https://doi.org/10.1111/dth.13827

73.

van Beugen

Maas

van Laarhoven

A. I. M.

Galesloot

T. E.

Rinck

Becker

E. S.

van de Kerkhof

P. C. M.

van Middendorp

Evers

A. W. M.

(2016). Implicit stigmatization-related biases in individuals with skin conditions and their significant others. Health Psychology: Official Journal of the Division of Health Psychology, American Psychological Association, 35(8), 861–865. https://doi.org/10.1037/hea0000404

74.

van Dis

E. A. M.

Bollen

van Rooij

Zuidema

Bockting

C. L.

(2023). ChatGPT: Five priorities for research. Nature, 614(7947), 224–226. https://doi.org/10.1038/d41586-023-00288-7

75.

Vlăduţ

C. I.

Kállay

É.

(2010). Psychosocial implications of psoriasis: Theoretical review. Cognition, Brain, Behavior: An Interdisciplinary Journal, 14(1), 23–35.

76.

Wang

Nayfeh

Tetzlaff

O’Blenis

Murad

M. H.

(2020). Error rates of human reviewers during abstract screening in systematic reviews. Plos One, 15(1), Article e0227742. https://doi.org/10.1371/journal.pone.0227742

77.

Wollscheid

Tripney

(2021). Rapid reviews as an emerging approach to evidence synthesis in education. London Review of Education, 19(1), 1–13. https://doi.org/10.14324/LRE.19.1.32

78.

Yun

H. S.

Pogrebitskiy

Marshall

I. J.

Wallace

B. C.

(2024). Automatically extracting numerical results from randomized controlled trials with large language models [Preprint]. arXiv.

79.

Zhu

Yang

Suhail

A. R.

Furuya-Kanamori

Lin

Qin

Tao

(2023). Effects of double data extraction on errors in evidence synthesis: A crossover, multicenter, investigator-blinded, randomized controlled trial. MedRxiv. https://doi.org/10.1101/2023.10.16.23297056

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.34 MB