Sage Journals: Discover world-class research

Abstract

Aim:

Systematic reviewing is a time-consuming process that can be aided by artificial intelligence (AI). There are several AI options to assist with title/abstract screening, however options for full text screening are limited. The objective of this study was to evaluate the performance of a custom generative pretrained transformer (cGPT) for full text screening.

Methods:

A cGPT powered by OpenAI’s ChatGPT4o was tested with subsets of articles assessed in duplicate by human reviewers. Outputs from the testing subset were coded to simulate cGPT as an autonomous and an assistant reviewer. Cohen’s kappa was used to assess interrater agreement.

Results:

For the inclusion/exclusion decision, the human–human kappa scores ranged from 0.87 to 0.96, exceeding the ranges of kappa scores for autonomous cGPT–human pairings (0.59 to 0.67) and assistant cGPT–human pairings (0.62 to 0.72). For exclusion reason classification, the human–human kappa scores ranged from 0.71 to 0.78, exceeding the ranges of kappa scores for autonomous cGPT–human pairings (0.47 to 0.53) and assistant cGPT–human pairings (0.52 to 0.63).

Conclusions:

The assistant cGPT outperformed the autonomous cGPT. An assistant cGPT could speed up systematic reviewing in a sufficiently reliable manner, however, further research is needed to establish standardized thresholds for practical use. Improved speed of systematic reviewing has implications for directing timely public health policy decisions.

Keywords

OpenAI custom GPT full text screening large language models systematic review article filtration

Background

Systematic reviews (SRs) are critical for synthesizing evidence that should form the basis for up-to-date best practice and policy [1]. This type of secondary research starts by broadly identifying all potentially relevant articles before screening for title/abstract relevancy and full text inclusion. The process of independent screening in duplicate and resolution of discrepancies through consensus may translate to hundreds, if not thousands, of human work hours [2]. With the current explosion in artificial intelligence (AI) tools, there is now a high expectation for improved cost-efficiency.

Various AI tools are already present in SR software, developed to assist in title/abstract and full text screening [3]. Commonly, these tools provide predictions of article relevancy, however, they do not provide determination of inclusion status. Nevertheless, their training parameters may serve as a guide when creating conceptually novel AI models. The training material required to initiate relevancy predictions in these models ranges from 50 included and 50 excluded articles [4] to a single article [5]. In the case of Nested Knowledge, determinations of inclusion or exclusion at the full text stage can be made after human reviewers have categorized 40 excluded and 10 included articles [6].

It is generally agreed that the current capabilities of AI tools and models are not in a position to totally replace human reviewers for either title/abstract or full text screening [7 –10]. It has been reported that the reliability of text mining AI tools is better for intervention studies than for complex health service topics [7,9,11,12].

Typical performance metrics of AI in screening tasks are reported in Table I [7,8,10 –14]. Additionally, evaluation of interrater reliability between human and AI decisions can be done using Cohen’s kappa statistical analysis [15]. The benefit of kappa is adjustment for random agreement [16]. AI tools used in screening for reviews tend to have high specificity but lower (<90%) sensitivity [11,14,15,17,18].

Table I.

Test performance metrics with mathematical definitions for binary outcomes.

Metric	Acronym	Definition	Equation^a
Sensitivity (recall)	SENS	The proportion of correctly included articles to all includable articles	$\frac{T P}{T P + F N}$
Specificity	SPEC	The proportion of correctly excluded articles to all excludable articles	$\frac{T N}{T N + F P}$
False negative rate	FNR	The proportion of incorrectly excluded articles to all includable articles	$\frac{F N}{T P + F N}$
Proportion missed	PM	The proportion of incorrectly excluded articles to all excluded articles	$\frac{F N}{T N + F N}$
Positive predictive value (precision)	PPV	The proportion of correctly included articles to all included articles	$\frac{T P}{T P + F P}$
Observed agreement (accuracy)	OBS AG	The proportion of true positives and true negatives to all articles	$\frac{T P + T N}{T P + T N + F P + F N}$
F1	N/A	The harmonic mean of precision and recall, where “1” indicates perfect precision and recall and “0” indicates total failure of the model	$\frac{2 T P}{2 T P + F P + F N}$

TP (true positive): number of articles correctly included; TN (true negative): number of articles correctly excluded; FP (false positive): number of articles incorrectly included; FN (false negative): number of articles incorrectly excluded.

One promising AI model to aid in systematic reviewing is ChatGPT—a large language model (LLM) with the ability to analyze and generate human-like text [19]. A retrospective study of several different LLMs’ ability to identify inclusion criteria in title/abstract screening reported ChatGPT4o to be the model with the best combined sensitivity and specificity [20]. The major limitations of ChatGPT (versions 3.5 and 4o) have been identified in data extraction, limited recall, and incomplete responses [8,13]. Insufficient training may have led to poor observed agreement [15].

One must consider these limitations within the context of how ChatGPT is used. As of November 2023, it has been possible to create a custom generative pretrained transformer (cGPT) that tailors responses based on input instructions and uploaded files. These cGPTs will not “forget” instructions in the same way as a non-custom ChatGPT model, which begins each new conversation without background information on the question’s context [21].

Aims

The purpose of this study was to evaluate the performance and time-saving potential of a tailor-made cGPT in full text screening within an SR. The work was motivated by the need to assess more than a thousand articles in full text.

Methods

The articles used in this study were selected from an ongoing SR that began in 2023. Briefly, the SR aims to describe the global meta-mean and dispersion of its outcomes (population-level habitual 24-h urine volume, population-level habitual 24-h creatinine excretion) identified in primary studies of adult populations globally. The outcome variables mainly appear as intermediary outputs in the includable articles. This requires the cGPT to identify defined concepts rather than just the primary studies’ main outcome(s) [22].

The SR’s corpus is screened in duplicate at the title/abstract and full text stages. The articles selected were published between 2020 and 2024 and consisted of a prompt engineering set and a testing set. We use the phrase “prompt engineering” throughout this article since the AI model has already been “trained” by OpenAI and the process of prompt engineering is meant to align the pretrained model’s outputs with the study’s outcomes of interest. Articles were coded into their inclusion status (“included” as 1, “excluded” as 0) and exclusion reason (coded as 1 to 9). In the event of an included article, the exclusion reason was coded “10” (Table II).

Table II.

Distribution of the different categories of exclusion in the testing set from the human–human consensus.

Exclusion reason		Explanation	Number of articles
1	Review^a	Includes any type of review.	10
2	Population <16 years^b	Should not include individuals under 16 years old.	0
3	Non-free-living population	Includes hospitals, prisons, research facilities, and other confined settings.	6
4	Diseased population	That may distort representation of the general population [22].	13
5	Condition affecting outcome	Such as manipulated diet or intravenous fluid administration. Does not include pregnancy/breastfeeding or manipulated 24-h fluid intake if fluid intake is measured and reported.	8
6	N <30	The includable population, as described in exclusion reasons 1 to 5, must have a minimum of 30 individuals.	7
7	Outcome data not collected	Studies that did not measure or collect all urine produced in 24 h.	41
8	Outcome data not reported	24-h urine volume measured but unreported.	84
9	Awaiting classification	Not enough information within the article(s) of a study to make a determination.	6
10	Included article	None of the above apply.	63
Sum			238

Reviews were investigated for additional studies in full text before exclusion.

Exclusion reason No. 2 was effectively filtered out during title/abstract screening and was not present at the full text stage.

The researcher responsible for uploading, recording, and interpreting responses from cGPT is called the cGPT operator. cGPT’s performance was evaluated in two screening roles: (1) autonomous reviewer, where cGPT’s analysis of inputted articles was coded without any interpretation by the operator; and (2) assistant reviewer, where the cGPT analysis was interpreted by the operator who decided to code the article in agreement with cGPT or not. cGPT’s ability as a reviewer in these roles was quantified and described as interrater agreement (Cohen’s kappa), observed agreement, classification performance metrics, and time savings. All screening took place within the Covidence environment, a platform for conducting screening and data extraction in reviews, between June and September 2024.

Aggregation of prompt engineering and testing data

Beginning in the full text screening stage, four human reviewers (EH, RD, SL, and KG) kept a log of article ID number, status, exclusion reason, study design, and field of study. Once a total of 250 articles were classified and cataloged in duplicate by reviewers, a subset of 40 excluded and 10 included articles was selected as the prompt engineering set. The articles purposefully encompassed all possible study designs, fields, and exclusion reasons to expose cGPT to the greatest variety of scenarios. The 200 test articles were augmented with an additional 38, representing all articles screened in duplicate by human reviewers at that point in time. The human–human consensus status of the testing set was 63 included and 175 excluded. Exclusion reasons for the testing set can be found in Table II.

Prompt engineering

The first version of our cGPT was created on August 23, 2024 using the paid feature on OpenAI’s ChatGPT platform (San Francisco, CA, USA). The cGPT was powered by ChatGPT4o. Initially, cGPT received abridged versions of the protocol and relevant supplements with a short description included in its “Instructions” text field in the “Configuration” tab of the general cGPT environment. cGPT provided incomplete and inconsistent responses when assessing the articles from the prompt engineering set. A more structured prompt–response configuration produced more accurate and complete responses and an evolving flowchart was developed to aid cGPT in navigating each article to a final decision (Supplemental file 1).

cGPT was instructed to “Navigate this article through the exclusion flowchart.” Its responses provided explanations for its reasoning for excluding or proceeding at each step of the flowchart. Errors in these responses were addressed by creating “patches” in the instructions, explicitly advising cGPT on how to proceed when faced with a certain scenario.

The team trialed differing levels of complexity with the flowchart. A persistent issue was cGPT’s reluctance to identify includable subpopulations within a study when following the flowchart, resulting in the inappropriate exclusion of a study altogether. Therefore, cGPT’s primary prompt was changed to “Navigate each population in this article through the exclusion flowchart individually.” The complete configuration of cGPT can be found in Supplemental file 2.

Testing

Testing ran from September 25 to 30, 2024. The cGPT operator was blind to the consensus status and exclusion reason decisions throughout testing. A new “conversation” with cGPT was opened and the 238 articles were uploaded one at a time using the final prompt, described previously. Conversations are separate windows that initially have no history of dialogue, like sending a new email instead of continuing in a thread. No feedback was provided to cGPT and no other prompts were used for the duration of testing. The cGPT operator coded both cGPT’s raw evaluation of the articles as well as the operator’s interpretation of status and exclusion reason based on cGPT’s evaluation. The latter was supported by notes and tags left by the first human reviewer in the Covidence environment. Notes and tags could include pertinent information about the population, the context of the study, or the study outcomes. In this way, the roles of cGPT as an autonomous reviewer and cGPT as an assistant reviewer were coded in parallel. The testing set was assessed two separate times by cGPT so that kappa statistics could be calculated between cGPT Test 1 and Test 2. Test 1 was completed before beginning Test 2.

Upon testing completion, the operator was unblinded and the decisions of the operator and cGPT were pooled from Test 1 to identify the “best response.” This served as the testing results of cGPT in the assistant role. Best response was defined as selecting which response, either cGPT or the operator, agreed most closely with the consensus decision. The best response was meant to simulate what would happen in a live environment should a discrepancy occur between the operator and cGPT. In the case of such a discrepancy, the operator would select the decision believed to be most accurate and leave a note that the article could be classified in one of two ways. If the selected classification was in disagreement with the other human reviewer, the consensus decision could be expedited by referring to the second possible classification described in the note.

Statistical analysis

Cohen’s kappa and observed agreement were calculated across cGPT in its two roles and the human reviewers. The kappa score assumes no “true” classification in the interrater comparison. Therefore, we calculated additional classification performance metrics using the consensus as the “truth,” in accordance with the definitions found in Table I. These metrics were calculated from the binary variable status.

When performing kappa statistics for exclusion reason classification, both excluded articles with exclusion reasons coded 1 through 9 and included articles with reason coded 10 were included in the analysis. Included articles were kept in the exclusion reason analysis due to the potential for discrepancy in the status decision between cGPT and the human reviewer.

When assessing and interpreting Cohen’s kappa values, the following conventional designations were used: <0.00 signifies poor strength in agreement, 0.00–0.20 slight agreement, 0.21–0.40 fair agreement, 0.41–0.60 moderate agreement, 0.61–0.80 substantial agreement, and 0.81–1.00 almost perfect agreement [16].

R version 4.3.3 for Windows was used for all statistical calculation. Packages used in analysis included “psych” and “irr.” The observed agreement, kappa statistic confidence intervals (CIs), and classification performance metrics were calculated directly from equations in the R environment. Findings are reported in-text as point estimate (95% CI).

Results

Experiential observations

An important observation was that cGPT tended to have obvious output errors, either with truncated responses or failure to respond to the prompt altogether. When this occurred, the operator would reissue the prompt with the article attached. cGPT was observed to be more likely to have an output error after it had had one previously and with conversations of increasing length. There did not, however, appear to be a set length that would precipitate output errors. Longer conversations appeared to cause longer response times. After observing these trends, it was determined that upon the first instance of an output error, the most time-efficient response was to begin a new conversation. Test 1 required reinitiating the conversation four times to complete the testing set while Test 2 required five.

cGPT was able to reliably divide populations within an article when prompted to do so before navigating populations through the exclusion flowchart. Engineering its prompt in this way allowed for simplification of the flowchart when coupled with detailed instructions. This combination produced responses that were delivered in a prescribed format, with explanations for its decisions that allowed the cGPT operator to identify errors in its logic. cGPT would only continue down the exclusion reason list until it identified an exclusion reason that rendered the article ineligible. In observational studies cGPT most often divided populations based on how the study reported its outcomes and in intervention studies by intervention and control group(s). Supplemental file 3 contains an example of an interaction between the operator and cGPT concerning one article .

cGPT had a tendency to drift in its exclusion reason assessment, randomly favoring one reason over another. The most favored exclusion reasons were No. 3. (non-free-living population) and No. 5. (condition affecting outcome; see Table II). This random favoritism did not carry over when a new conversation was initiated.

cGPT at times would report logical fallacies within the boundaries of its responses. The three most common logical errors observed were as follows: a population comprised of 30 or more individuals was classified as less than 30, resulting in an inappropriate exclusion reason No. 6. (Table II); identifying that one of the outcomes was reported but excluding for reason No. 8, since the study did not report both outcomes (Table II); and inclusion based on reporting of urine analytes other than creatinine (calcium, iodine, etc.) in the absence of 24-h urine volume reporting, resulting in inappropriate inclusion rather than exclusion due to exclusion reason No. 8 (Table II).

cGPT reproducibility

The autonomous cGPT’s status decisions between Test 1 and Test 2, had an observed agreement (95% CI) of 0.777 (0.719, 0.829) and a Cohen’s kappa of 0.493 (0.368, 0.618). The autonomous cGPT’s exclusion reason classifications between Test 1 and Test 2 had an observed agreement of 0.622 (0.56, 0.684) and a Cohen’s kappa of 0.495 (0.429, 0.562).

Results of testing

Classification performance metrics are presented using the binary variable status (Table III). This provides a comprehensive overview of how cGPT compares to human reviewers when both are compared against the consensus decision.

Table III.

Classification performance metrics of status decisions.

Reviewer		Number of articles	Status
A	B	Number of articles	SENS (95% CI)	SPEC (95% CI)	FNR (95% CI)	PM (95% CI)	Pr (95% CI)	F1
CON	Auto	238	0.778 (0.655, 0.873)	0.897 (0.842, 0.938)	0.222 (0.127, 0.345)	0.082 (0.046, 0.1334)	0.731 (0.609, 0.832)	0.754
CON	Assistant	238	0.873 (0.765, 0.944)	0.897 (0.842, 0.938)	0.127 (0.056, 0.235)	0.049 (0.021, 0.093)	0.753 (0.639, 0.847)	0.809
CON	EH	238	1.00 (0.943, 1.00)	0.989 (0.959, 0.999)	0 (0, 0.057)	0 (0, 0.021)	0.969 (0.893, 0.996)	0.984
CON	RD	83	0.906 (0.750, 0.980)	0.980 (0.896, 0.999)	0.094 (0.020, 0.250)	0.057 (0.012, 0.157)	0.967 (0.828, 0.999)	0.936
CON	SL	76	1.00 (0.753, 1.00)	0.968 (0.890, 0.996)	0 (0, 0.247)	0 (0, 0.059)	0.867 (0.595, 0.983)	0.929
CON	KG	75	1.00 (0.794, 1.00)	0.983 (0.909, 1.00)	0 (0, 0.206)	0 (0, 0.062)	0.941 (0.713, 0.999)	0.970

A: Consensus (CON) between human–human reviewers; B: cGPT as an autonomous reviewer (Auto), cGPT as an assistant reviewer (Assistant); SENS: sensitivity; SPEC: specificity; FNR: false negative rate; PM: proportion missed; Pr: precision.

All the articles were initially screened by EH. Together, RD, SL, and KG shared the work of the second screening.

Human–human kappa scores are presented, both with their status decision and exclusion reason classification (Table IV). Table IV also contains kappa scores (both status and exclusion reason) of human reviewers paired with cGPT in the role of autonomous reviewer, followed by cGPT as an assistant reviewer.

Table IV.

Observed agreement (OBS AG) and kappa scores of status and exclusion reason decisions between human–human pairings, autonomous cGPT–human pairings, and assistant cGPT–human pairings.

Reviewer		Number of articles	Status		Exclusion reason
A	B	Number of articles	OBS AG (95% CI)	Cohen’s kappa (95% CI)	OBS AG (95% CI)	Cohen’s kappa (95% CI)
EH	RD	83	0.940 (0.865, 0.980)	0.872 (0.658, 1.000)	0.783 (0.694, 0.872)	0.717 (0.611, 0.823)
EH	SL	76	0.961 (0.889, 0.992)	0.872 (0.647, 1.000)	0.855 (0.776, 0.934)	0.784 (0.656, 0.912)
EH	KG	75	0.987 (0.928, 1.000)	0.961 (0.735, 1.000)	0.773 (0.679, 0.868)	0.713 (0.606, 0.821)
Auto	CON	238	0.866 (0.816, 0.906)	0.661 (0.534, 0.788)	0.651 (0.591, 0.712)	0.533 (0.467, 0.599)
Auto	EH	238	0.866 (0.816, 0.906)	0.665 (0.539, 0.792)	0.655 (0.595, 0.716)	0.532 (0.464, 0.600)
Auto	RD	83	0.819 (0.720, 0.895)	0.60 (0.385, 0.815)	0.627 (0.522, 0.731)	0.521 (0.417, 0.625)
Auto	SL	76	0.882 (0.787, 0.944)	0.668 (0.447, 0.889)	0.658 (0.551, 0.765)	0.513 (0.387, 0.639)
Auto	KG	75	0.853 (0.753, 0.924)	0.59 (0.364, 0.816)	0.60 (0.489, 0.711)	0.465 (0.353, 0.577)
Assistant	CON	238	0.891 (0.844, 0.927)	0.733 (0.607, 0.859)	0.719 (0.661, 0.776)	0.632 (0.568, 0.696)
Assistant	EH	238	0.882 (0.835, 0.920)	0.715 (0.589, 0.841)	0.681 (0.621, 0.740)	0.577 (0.511, 0.643)
Assistant	RD	83	0.831 (0.733, 0.905)	0.64 (0.425, 0.855)	0.711 (0.613, 0.808)	0.627 (0.523, 0.731)
Assistant	SL	76	0.882 (0.787, 0.944)	0.623 (0.401, 0.845)	0.658 (0.551, 0.765)	0.519 (0.395, 0.643)
Assistant	KG	75	0.88 (0.784, 0.944)	0.678 (0.453, 0.903)	0.64 (0.531, 0.749)	0.532 (0.422, 0.642)

CON: consensus; Auto: cGPT as an autonomous reviewer; Assistant: cGPT as an assistant reviewer.

Time and workload savings

The time required to screen a fixed number of articles was determined by the total number of articles screened divided by the number of articles screened per hour. Time savings was calculated as the difference in the time required to screen the testing set between human reviewers and cGPT. For cGPT, internet quality, type of article, number of populations within the article, and variability in generative speed of responses were unforeseen variables that impacted the number of articles screened per hour. For human reviewers, type of article (including whether the full text could be located), internet quality, and other environmental factors influenced how many articles were screened per hour. Due to this high variability, time savings was described as minimum and maximum. Out of six total observations, the minimum documented number of articles completed by a human reviewer in 1 h was 9 while the maximum was 31. Out of four total observations, the minimum number of articles screened by cGPT was 32 and the maximum was 44. The 2020–2024 time-strata corpus of the present SR contains 1057 full text articles for screening. Considering our minimum and maximum values, this translates to between 34.1 and 117.4 effective human hours and 24 to 33 cGPT hours to conduct a single screening on this time strata corpus. The time savings then can be described as between 10.1 and 84.4 h saved when using cGPT instead of a human reviewer. While the human reviewers can only keep their top screening speed for a few hours at a time, cGPT, by its nature, has only the duration limitations of the operator and the number of prompts permitted hourly by the server.

Discussion

The cGPT in this study scored below 90% sensitivity, with specificity outperforming sensitivity in both reviewer roles, consistent with what has been observed in another study of LLMs [20]. The opposite was observed for human reviewers, who in three out of four instances had higher sensitivity than specificity, meaning, when in doubt, they included the article. Specificity was identical for both cGPT as an autonomous reviewer and as an assistant reviewer. Sensitivity, however, was higher in the assistant role, so the true strength of using cGPT as an assistant rather than as an autonomous reviewer lies in its ability to include more relevant articles, rather than to discard more irrelevant articles. This phenomenon was also reflected by cGPT’s higher F1 score as an assistant than as an autonomous reviewer. The risk of inaccurate screening from cGPT’s lower sensitivity compared to other reviewers was partially mitigated by requiring the screening to be done in duplicate, with one reviewer mandatorily being an unassisted human.

While Nested Knowledge gives determinations on article inclusion or exclusion during full text screening, its validation is reported almost exclusively with binary statistics (recall, precision, and F1), the exception being observed agreement [13,23]. Furthermore, looking at cGPT’s roles as autonomous and assistant reviewers in our study, observed agreement with the consensus status decision, used as the “truth,” was 86.6% and 89.1%, respectively, compared to a human reviewer and a consensus agreement range of 95.2% to 99.2%. This means that for the best human-consensus agreement, cGPT will miss one additional article for every 10 articles screened. For exclusion reason, the cGPT and consensus pairing of observed agreement was 65.1% (autonomous cGPT) and 71.9% (assistant cGPT). The range of human reviewer and consensus exclusion reason agreement was 86.7% and 94.7%. While status agreement was quite similar across all pairings, the same cannot be said for exclusion reason agreement.

A total of 12 possible combinations existed for comparisons between the three paired human reviewers and the four cGPT–human pairings. Regarding status decision, the bounds of the Cohen’s kappa 95% CI were crossed in all 12 combinations for both autonomous and assistant cGPT. This suggests that a difference in status decision performance between human pairings and any cGPT pairings could not be established with certainty. Regarding exclusion reason decisions, assistant cGPT 95% CIs overlapped with paired human reviewers in 9 of 12 possible Cohen’s kappa combinations, while autonomous cGPT overlapped in 4 of 12 combinations. This creates uncertainty about trends in performance between human pairings and cGPT pairings for the more nuanced exclusion reason decisions.

According to Cohen’s kappa classifications, the strength of human–human agreement was “almost perfect” (0.81–1.00) for status decisions and “substantial” (0.61–0.80) for exclusion reason decisions. The strength of agreement for status decision in autonomous cGPT and human pairings was “moderate” (0.41–0.60) to “substantial,” and assistant cGPT and human pairings was “substantial.” The strength of agreement for exclusion reason was “moderate” for pairings with both autonomous and assistant cGPT. Although these designations are arbitrary, they can provide useful performance benchmarks. In this case, they signify that human–human pairings had greater strength in agreement in all areas compared to cGPT–human pairings.

To compare interrater agreement with 10 alternative responses, a Cohen’s kappa is necessary as it takes into account random chance agreement. There will always be a risk of bias when using human decision as the “truth” by which to compare a screening tool. While this bias is partially mitigated by using the consensus decision, there nevertheless remain borderline decisions that could be interpreted in one of two ways. The kappa score captures this uncertainty within its metrics, while observed agreement does not. For this reason, a threshold for acceptable use of a cGPT assistant may be most appropriately determined using a Cohen’s kappa rather than observed agreement.

The study of LLMs like ChatGPT in full text screening is still largely uncharted territory with limited study comparability, preprints excluded. Further validation studies need to be conducted so that results such as those presented here can be compared and a threshold for acceptable use established.

Limitations and future directions

cGPT was observed to randomly favor one exclusion reason over others. This bias is addressed in part by using cGPT as an assistant rather than as an autonomous reviewer; however, since cGPT responses terminate when an exclusion reason is reached, the operator would be required to ask follow-up questions to discern the true exclusion reason. Since starting a new conversation with cGPT makes it “forget” this favoritism, a potential solution may be to initiate a new conversation at fixed intervals.

The logical fallacies occasionally reported by cGPT most likely contributed to the performance gap between cGPT as an autonomous reviewer and cGPT as an assistant reviewer. It is entirely possible that the newly released ChatGPT5.1, with its stronger complex reasoning, will not make such mistakes. There is value in rerunning the test set with this novel model, using the present study as a benchmark.

Despite efforts to encourage reproducibility in its configuration, cGPT had a poorer kappa score with itself between Test 1 and Test 2 of the testing set than it did with consensus in either the assistant reviewer or autonomous reviewer role. Care must be taken by researchers who choose to employ a cGPT for any purpose due to this inconsistency in its responses. It has been put forward that perhaps using a “majority vote” with five different LLMs evaluating articles together may produce improved sensitivity scores and could address the issue of inconsistent responses [20]. Alternatively, it is possible that this issue may be abated by prompting cGPT with the same article five or more times and accepting the majority vote for classification. The present study was not streamlined to such a degree that requesting five iterations of analysis on a single article was feasible, nor were there sufficient resources to produce five different custom LLMs catered to the project. These may be opportunities to explore in the future.

cGPT’s testing as an assistant reviewer occurred after all 238 articles had been screened for the first time by human reviewers. The operator had the opportunity to read the notes and tags left behind by the first human screener to confirm the scoring decisions guided by cGPT. Access to this information certainly influenced the operator’s decision to agree with cGPT or not, and therefore the performance of cGPT as an assistant reported here is dependent on this order of operations. Hence, the present study can only provide assistant cGPT benchmarks when the operator is placed in the second role. Due to the suboptimal performance of the assistant cGPT compared to human reviewers even when first reviewer information was available, a team looking to employ a cGPT-assisted reviewer would likely have the most advantageous results by also placing this individual in the role of the second screener.

Since screening is done in duplicate, the potential time savings only covers half of the workload of full text screening. Even with the fastest examples of human reviewing compared to the slowest example of cGPT/operator reviewing, cGPT still outperformed human reviewers’ screening speed. Furthermore, resolution of cGPT-generated discrepancies was first investigated by the cGPT operator. If the operator was in agreement with the unassisted human reviewer, resolution was reached by the operator without consultation. In practice, this saves time for the human reviewer who is not obliged to perform consensus on any but the most nuanced cases of disagreement, however, this was not evaluated quantitatively.

Even with the potential to save human work hours, time savings in vivo will depend on several factors including internet speed, availability and content of articles, and the unpredictability of cGPT-generated responses as described throughout this article. It is now possible to use an application programming interface with cGPTs that allows one to pull in articles from a cloud server, thus negating the need for an operator to manually upload one article at a time into the cGPT conversation and allowing for greater time-saving potential [20,24]. The purpose of the present study was both to evaluate the legitimacy of cGPT as a full text screener as well as describe the time-saving potential of the method used. While its performance is evaluated in detail here, further exploration into optimizing cGPT’s time-saving potential is needed.

The outcomes of interest in our SR are mostly intermediary steps in the studies being examined. If the SR’s and the studies’ outcomes were aligned, as is often the case, cGPT may have achieved a kappa score closer to those of the human–human reviewer pairs.

The method used to develop the cGPT was designed to be one that researchers who do not have a significant computer science background could reproduce or mimic. Indeed, creating a cGPT using a flow diagram similar to that employed in the present article has recently been explored in one health education study [25]. The use of a flow diagram can be valuable in the screening stage of any SR that ranks their exclusion reasons, as has been done here. Further research is needed to determine how best to configure cGPTs for the purpose of full text screening. Additionally, flow diagrams create an opportunity to methodologically constrain cGPT responses, opening the possibility for these to be used in other research domains.

The present SR is a considerably large study with data from 1990 to 2024 and thousands of articles requiring full text screening. It was feasible to take several hundred articles for the use of creating a custom AI model that could then be employed in the remaining articles. Many SRs will not have screening needs as demanding as those presented here, so there is likely to be a minimum threshold for size of SRs where the benefits of using a cGPT outweigh the initial time investment.

Using LLMs like ChatGPT in this way does not come without cost. The environmental impact of operating LLMs is not transparent, however water consumption, carbon emissions, and strains on the power grid are all known consequences of powering large models [26]. Furthermore, there are financial costs, social consequences, and the potential for disruption in what is prioritized in this type of research [27,28]. While the production of energy required to power the servers that support LLMs is becoming greener, it is unclear whether the costs outweigh the benefits from a larger systems perspective. This conversation is increasingly relevant as LLMs are used in novel ways for optimizing research [8,29].

The cGPT operator configured their data controls to deny OpenAI use of their content for training new models. Additionally, all conversations containing uploaded files were deleted, placing them on a 30-day countdown to be erased from OpenAI’s servers [30]. Despite these measures, providing full text article access to LLMs presents an ethical dilemma of data sharing and intellectual property rights. In the present study, it was realized after the fact that several articles uploaded to ChatGPT were paywalled (i.e., purchase or subscription required for access). The authors hope that transparency about this error may encourage researchers to critically evaluate the appropriateness of uploading articles to LLM environments. There appears to be an urgent need to strengthen education on the ethical concerns of granting companies like OpenAI access to paywalled content.

Conclusion

The improved speed of systematic reviewing with the use of LLMs like ChatGPT has implications for directing timely public health policy decisions. cGPT powered by ChatGPT4o performed best when employed as a second reviewer’s assistant. More research is needed to establish standardized performance guidelines for the practical use of cGPTs, possibly in the form of a Cohen’s kappa threshold. There is evidence of the time-saving potential to utilizing cGPT. A more sophisticated system for inputting articles and coding cGPT’s responses is needed to maximize the potential time-saving benefits. Investigating cGPT’s role as either an assistant or autonomous reviewer in a less complex SR may provide improved reliability and time-saving potential that outpace the results of the current study. Repeating the testing with the most updated ChatGPT model holds exciting promise.

Supplemental Material

sj-docx-1-sjp-10.1177_14034948261423410 – Supplemental material for Evaluating the performance of a custom GPT in full text screening of a systematic review

Supplemental material, sj-docx-1-sjp-10.1177_14034948261423410 for Evaluating the performance of a custom GPT in full text screening of a systematic review by Rachel C. Davis, Saskia S. List, Kendal G. Chappell, Ahmed Madar, Sigrun Henjum and Espen Heen in Scandinavian Journal of Public Health

Supplemental Material

sj-docx-2-sjp-10.1177_14034948261423410 – Supplemental material for Evaluating the performance of a custom GPT in full text screening of a systematic review

Supplemental material, sj-docx-2-sjp-10.1177_14034948261423410 for Evaluating the performance of a custom GPT in full text screening of a systematic review by Rachel C. Davis, Saskia S. List, Kendal G. Chappell, Ahmed Madar, Sigrun Henjum and Espen Heen in Scandinavian Journal of Public Health

Supplemental Material

sj-docx-3-sjp-10.1177_14034948261423410 – Supplemental material for Evaluating the performance of a custom GPT in full text screening of a systematic review

Supplemental material, sj-docx-3-sjp-10.1177_14034948261423410 for Evaluating the performance of a custom GPT in full text screening of a systematic review by Rachel C. Davis, Saskia S. List, Kendal G. Chappell, Ahmed Madar, Sigrun Henjum and Espen Heen in Scandinavian Journal of Public Health

Footnotes

Acknowledgements

Haakon Meyer provided useful comments on the manuscript.

Contributions

All members of the team contributed to the study’s conceptualization, data curation, reviewing, and editing. Rachel Davis was responsible for methodology, software, formal analysis, investigation, resources, visualization, original draft writing, and project administration. Espen Heen was instrumental in developing the methodology, visualization, validation, project administration, editing, and supervision.

Data availability statement

All data produced in the present study are available upon reasonable request to the authors.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Use of artificial intelligence tools

A custom GPT (Pro-version), powered by ChatGPT4o from OpenAI, was used as described throughout this article, but was not used in the generation of text, tables, or in any other content of the article.

ORCID iDs

Rachel C. Davis

Ahmed Madar

Supplemental material

Supplemental material for this article is available online.

References

Gopalakrishnan

Ganeshkumar

Systematic reviews and meta-analysis: understanding the best evidence in primary healthcare. J Fam Med Prim Care 2013;2:9–14.

Borah

Brown

Capers

, et al. Analysis of the time and workers needed to conduct systematic reviews of medical interventions using data from the PROSPERO registry. BMJ Open 2017;7:e012545.

Agrawal

Singer

, et al. Leveraging artificial intelligence to enhance systematic reviews in health research: advanced tools and challenges. Syst Rev 2024;13:269.

Ouzzani

Hammady

Fedorowicz

, et al. Rayyan-a web and mobile app for systematic reviews. Syst Rev 2016;5:210.

van de Schoot

de Bruin

Schram

, et al. An open source machine learning framework for efficient and transparent systematic reviews. Nat Mach Intell 2021;3:125–133.

Nested Knowledge. Robot Screener, https://about.nested-knowledge.com/docs/robot-screener/ (updated 2026, accessed 22 January 2025).

Giummarra

Lau

Gabbe

BJ.

Evaluation of text mining to reduce screening workload for injury-focused systematic reviews. Inj Prev 2020;26:55–60.

Alshami

Elsayed

Ali

Eltoukhy

AEE

, et al. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Syst 2023;11:351.

Paynter

Bañez

Erinoff

, et al. Commentary on EPC methods: an exploration of the use of text-mining software in systematic reviews. J Clin Epidemiol 2017;84:33–6.

10.

van Dijk

SHB

Brusse-Keizer

MGJ

Bucsán

, et al. Artificial intelligence in systematic reviews: promising when appropriately used. BMJ Open 2023;13:e072254.

11.

Gates

Johnson

Hartling

Technology-assisted title and abstract screening for systematic reviews: a retrospective evaluation of the Abstrackr machine learning tool. Syst Rev 2018;7:45.

12.

Valizadeh

Moassefi

Nakhostin-Ansari

, et al. Abstract screening using the automated tool Rayyan: results of effectiveness in three diagnostic test accuracy systematic reviews. BMC Med Res Methodol 2022;22:160.

13.

Purewal

Fautsch

Klasova

, et al. Human versus artificial intelligence: evaluating ChatGPT’s performance in conducting published systematic reviews with meta-analysis in chronic pain research. Reg Anesth Pain Med. Epub ahead of print 16 February 2025. DOI: 10.1136/rapm-2024-106358.

14.

dos Reis

AHS

de Oliveira

ALM

Fritsch

, et al. Usefulness of machine learning softwares to screen titles of systematic reviews: a methodological study. Syst Rev 2023;12:68.

15.

Schopow

Osterhoff

Baur

Applications of the natural language processing tool ChatGPT in clinical practice: comparative study and augmented systematic review. JMIR Med Inform 2023;11:e48933.

16.

Landis

Koch

GG.

The measurement of observer agreement for categorical data. Biometrics 1977;33:159–74.

17.

Rathbone

Hoffmann

Glasziou

Faster title and abstract screening? Evaluating Abstrackr, a semi-automated online screening program for systematic reviewers. Syst Rev 2015;4:80.

18.

Pham

Jovanovic

Bagheri

, et al. Text mining to support abstract screening for knowledge syntheses: a semi-automated workflow. Syst Rev 2021;10:156.

19.

Roumeliotis

Tselikas

ND.

ChatGPT and Open-AI models: a preliminary review. Fut Internet 2023;15:192.

20.

Sun

Tan

Evaluating the effectiveness of large language models in abstract screening: a comparative analysis. Syst Rev 2024;13:219.

21.

OpenAI. Introducing GPTs, https://openai.com/index/introducing-gpts/ (2023, accessed 22 January 2025).

22.

Heen

Madar

Meyer

, et al. 24-hour urine volume and 24-hour urine creatinine excretion in general adult populations a systematic review and meta-analysis, https://www.academia.edu/125599854/24_HOUR_Urine_Volume_and_24_HOUR_Urine_Creatinine_Excretion_in_General_Adult_Populations_a_Systematic_Review_and_Meta_Analysis (2024, accessed 23 January 2025).

23.

Sauca

AI for screening: how is it used, and how can we tell how accurate it is?, https://about.nested-knowledge.com/2024/08/02/ai-for-screening-how-is-it-used-and-how-can-we-tell-how-accurate-it-is/ (2024, accessed 23 January 2025).

24.

Guo

Gupta

Deng

, et al. Automated paper screening for clinical reviews using large language models: data analysis study. J Med Internet Res 2024;26:e48996.

25.

Owens

Leonard

MS.

Evaluating an AI chatbot “prostate cancer info” for providing quality prostate cancer screening information: cross-sectional study. JMIR Cancer 2025;11:e72522.

26.

Morrison

Fernandez

Dettmers

Strubell

Dodge

Holistically Evaluating the Environmental Impact of Creating Language Models [preprint]. arXiv:2503.05804 [cs.CY]. 2025 Mar 3. Available from: https://arxiv.org/abs/2503.05804

27.

Bender

Gebru

McMillan-Major

, et al. On the dangers of stochastic parrots: can language models be too big? In: FAccT ‘21: 2021 ACM conference on fairness, accountability, and transparency New York City, NY, USA, March 1 2023 pp. 610–623, New York: ACM.

28.

Rillig

Ågerstrand

, et al. Risks and benefits of large language models for the environment. Env Sci Technol 2023;57:3464–66.

29.

Gartlehner

Kahwati

Hilscher

, et al. Data extraction for evidence synthesis using a large language model: a proof-of-concept study. Res Synth Methods 2024;15:576–89.

30.

OpenAI. File uploads FAQ, https://help.openai.com/en/articles/8555545-file-uploads-faq (2025, accessed 4 December 2025).