Sage Journals: Discover world-class research

Abstract

Generative artificial intelligence (genAI) tools are transforming workflows, with growing interest in their potential applications in qualitative research. While the use of genAI in facilitating the systematic review process has been explored, its application in the quality appraisal of qualitative research remains to be understood. This pilot study aims to evaluate the degree to which ChatGPT appraises qualitative research using popular appraisal tools compared to human assessments. Two reviewers applied the Critical Appraisal Skills Program (CASP) and Joanna Briggs Institute (JBI) checklists for qualitative research to studies identified through a previously published review (n = 21). Next, iteratively developed prompts along with a copy of each study were uploaded to ChatGPT to instruct it to appraise each article. Interrater reliability measures and crude agreements were conducted to estimate the level of agreement between human and genAI assessments. Interrater reliability assessments between human and ChatGPT (GPT-5) revealed no agreement to moderate agreement for CASP checklist items (kappa: <.00-.46; crude agreement: 23.8%-100%) and from none to substantial for JBI items (kappa: <.00-.83; crude agreement: 4.8%-95.2%). Agreement was highest for reporting-based elements such as study aims, ethics approval, value of research (CASP), and participant voices and conclusions (JBI). Disagreements were greatest for interpretive and context-dependent items such as research design, researcher–participant relationships, and worldview–methodology congruity. Findings demonstrate that ChatGPT (GPT-5) can reliably identify objective components yet performs inconsistently when assessing items requiring nuance and contextual understanding across both checklists. Currently, any adoption of genAI for quality appraisal of qualitative research must be carefully applied only alongside human assessments and uphold principles of transparency and data privacy.

Keywords

generative artificial intelligence qualitative research critical appraisal evidence synthesis systematic reviews quality assessment

Evaluates ChatGPT’s capacity to conduct qualitative quality appraisal by comparing AI-generated CASP and JBI assessments with human consensus ratings across 21 studies.

Finds strong alignment on clearly reported, descriptive items (e.g., study aims, ethics, participant voices, value of research), demonstrating AI’s reliability with objective checklist criteria.

Identifies substantial inconsistencies on interpretive and context-dependent items, such as research design, researcher–participant relationship, and methodological congruity, where human judgment remains essential.

Reveals wide variability in agreement metrics, with kappa values ranging from none to moderate for CASP and none to substantial for JBI, underscoring tool-based differences in AI performance.

Concludes that genAI cannot replace human appraisal in qualitative research.

Introduction

The integration of generative artificial intelligence (genAI) tools continues to reshape workflows across various domains, including psychology,¹ healthcare,² education,³ and research.⁴ While there are concerns regarding the potential risks associated with reliance on genAI technologies, such as inaccuracies and hallucinations,^5,6 their broad applicability gives rise to notable advantages, such as enhanced productivity and cost-effectiveness.⁷ GenAI tools employ advanced computational techniques such as machine learning (ML), deep learning (DL), and natural language processing to analyze large volumes of data and generate outputs such as text, images, videos, and music in response to user prompts.⁸ For instance, OpenAI’s ChatGPT, a variant of the GPT language model, is a tool trained to produce human-like responses in conversational contexts.⁸

GenAI offers significant potential in streamlining research processes⁹ and has been explored for its application in qualitative research synthesis, including evidence review¹⁰ and thematic analysis.^11-13 A qualitative study assessing the comprehension capabilities of 6 large language models (LLMs) of medical research papers using the STROBE checklist found that LLMs, while varying in their ability to understand medical papers and provide summaries of the different sections of the paper, have the potential to enhance the efficiency of qualitative synthesis and facilitate evidence-based decision-making.¹⁴

Central to assessing research quality is quality appraisal. While its transferability to qualitative research remains to be of debate due to a lack of consensus in how criteria can be developed to assess quality in qualitative research,^15,16 some studies suggest that a systematic quality appraisal can ensure the rigor of qualitative studies.^17,18 In addition, an important element of a systematic review is the appraisal of methodological quality of the primary studies to decide whether its results are trustworthy and the extent to which these studies should contribute to the overall evidence synthesis.¹⁹ This is particularly important in fields such as public health, where the validity and reliability of findings inform decision-making.²⁰ A variety of tools exist to appraise the quality of a study, with some using reporting guidelines to serve as a proxy for quality appraisal. Quality appraisal tools range in their assessment of research quality, including rigor, credibility, trustworthiness, and relevance of results, offering guidance for identifying a study’s strengths and weaknesses and informing the credibility of its findings.¹⁸ The most widely adopted are the Joanna Briggs Institute (JBI) Checklist²¹ and Critical Appraisal Skills Program (CASP) Checklist,²² which contain elements on design, recruitment strategy, data collection, reflexivity, analysis, ethics, reporting, and value of findings. The main purpose of these tools is to serve as an appraisal in qualitative syntheses and help ensure the findings are based on trustworthy evidence.

Preliminary searches found no existing studies exploring the use of genAI for conducting quality appraisals of qualitative studies using established quality appraisal tools. Notably, Nguyen-Trung et al¹⁰ explored the application of genAI in rapid reviews and acknowledged the potential of AI for quality appraisal as an area warranting future exploration. Addressing this gap will provide insight for guiding the appropriate use and development of AI-assisted tools in quality appraisal in qualitative research.

Therefore, this pilot study aims to evaluate ChatGPT’s appraisals of qualitative research by assessing areas of alignment and divergence with human reviewers and evaluating the potential applications and limitations of genAI in qualitative quality appraisal. The objectives of this research are to:

Develop a systematic process for using ChatGPT to conduct quality appraisal using the CASP and JBI qualitative checklists; and

Assess the level of agreement between AI and human-conducted quality appraisals.

Methods

Twenty-one peer-reviewed qualitative studies were identified through a previously published review by Fazakarley et al,²³ which synthesized qualitative research on stakeholder beliefs around the use of AI in healthcare. The qualitative synthesis identified 19 open-access articles, and 2 articles with restricted access. This review was selected as our base for quality appraisals because the studies contained within it varied in geography, participants, data collection methods, and analytical approaches used. Next, the CASP and JBI qualitative checklists were applied to all studies, and the following information was noted: presence of item (ie, yes, no, or unclear), a brief justification, and location where the item is found in the study.

The CASP checklist, a quality appraisal tool endorsed by the Cochrane Qualitative and Implementation Methods, comprises 10 questions guiding the assessment of the validity, results, and local relevance of results presented in qualitative research.¹⁷ The JBI checklist also includes 10 questions on the congruity of elements within the study (e.g. congruity between method and results), reflexivity, participant voices, ethics, and coherence of conclusions.¹⁵ We selected these 2 checklists because they are the most widely adopted quality appraisal tools used in qualitative syntheses and are known for their ease of use.^15,17 More information about these tools can be accessed by visiting the respective websites.^21,22

ChatGPT was selected for this pilot study given its widespread availability, integration into academic and professional workflows, and prior evidence of use in tasks such as literature review and evidence synthesis.^24,25 Its conversational interface also facilitates transparent prompt construction, making it suitable for a feasibility assessment of quality appraisal.

Prior to data analysis, a pretest was conducted to: (i) critically appraise 3 studies via a human reviewer (HS) and (ii) assess genAI’s ability to perform comparable quality appraisals on the same studies. This pretesting guided the iterative development of prompts provided to ChatGPT 4o in June 2025 (Supplemental File 1). Next, 2 independent reviewers (HS, MM, and AT) assessed each study by applying the CASP and JBI qualitative checklist. Following independent review, the reviewers discussed any conflicts and generated a consensus across the 21 studies (Supplemental Files 2 and 3). Consensus ratings served as a reference standard for comparison with genAI’s appraisals of the 2 qualitative checklists across 21 studies.

For the CASP checklist, the following prompt and an attached copy of the article were provided to ChatGPT (GPT-5) in a new ‘chat’ for each study:

I would like you to critically appraise the attached qualitative study using the CASP Qualitative Research Checklist.

1. Carefully review the full content of the study.

Then, assess each of the ten questions in the CASP checklist, and provide your assessment in a table format with the following 4 columns:

1. Title of the study

2. A brief answer to the question in the checklist (Yes, No, Can’t tell)

3. A 1 sentence justification based on explicit evidence from the text

4. The location in the study where the relevant information is found

Use the following criteria to guide your judgement:

Answer ‘Yes’ if the criterion is clearly addressed in the text, with adequate detail and justification.

Answer ‘No’ if criterion is completely absent.

Answer ‘Can’t tell’ if the criterion is partially met, unclear, or lacks sufficient detail to judge confidently.

When forming a response, take note of the following considerations for each checklist item (exact descriptions of each item from the tool are included in the prompt; see Supplemental 1 for the full prompt, including exact item descriptions from the checklist).

For the JBI checklist, the following prompt and an attached copy of the article were provided to ChatGPT (GPT-5) in a new ‘chat’ for each study:

I would like you to critically appraise the attached qualitative study using the JBI Critical Appraisal Checklist for Qualitative Research

Carefully review the full content of the study.

Then, assess each of the ten questions in the JBI critical appraisal tool for qualitative research, and provide your assessment in a table format with the following 4 columns:

1. Title of the study

2. A brief answer to the question in the checklist (Yes, No, Unsure)

3. A 1 sentence justification based on explicit evidence from the text

4. The location in the study where the relevant information is found

Use the following criteria to guide your judgement:

Answer ‘Yes’ if the criterion is clearly addressed in the text, with adequate detail and justification.

Answer ‘No’ if criterion is completely absent.

Answer ‘Unsure’ if the criterion is partially met, unclear, or lacks sufficient detail to judge confidently.

When forming a response, take note of the following considerations for each checklist item (exact descriptions of each item from the tool are included in the prompt; see Supplemental 2 for the full prompt, including exact item descriptions from the checklist).

The raw appraised outputs from the ChatGPT (GPT-5) are provided in an open database: https://osf.io/cyfsm/. Subsequently, interrater reliability between human and ChatGPT (GPT-5) assessments was assessed using the ‘irr’ package in RStudio (version 2025.05.0 running R 4.4.0). Crude percent agreement, although a less rigorous metric,²⁶ is also presented because Cohen’s kappa could not be calculated for some checklist items due to homogeneous responses. This crude measure reflects the proportion of results that agree between the human reviewer and ChatGPT (GPT-5) but does not account for agreement due to chance. Despite its limitations, we report on this metric alongside Cohen’s kappa to highlight the crude level of agreement to provide additional context where kappa could not be calculated.

Results

Summary of Included Studies

The 21 articles included in this pilot study, drawn from a published review of qualitative research, were published between 2020 and 2022, and had a median number of participants of 28 (range: 12-475). Participants included students, researchers, industry experts, healthcare providers, industry stakeholders, and regulators. Most studies used semi-structured interviews, thematic analysis, and were conducted in the United Kingdom (Table 1).

Table 1.

Summary of Qualitative Studies From the Previously Published Review.

Characteristic	n	%
Publication year
2020	4	19.0
2021	13	61.9
2022	4	19.0
Study location
United Kingdom	6	28.6
Australia	2	9.5
India	2	9.5
Canada	2	9.5
Germany	2	9.5
Other^a	7	33.3
Data analysis method^c
Thematic analysis	9	42.9
Content analysis	4	19.0
Grounded theory analysis	3	14.3
Nominal group technique	2	9.5
Other^b	3	14.3
Data collection tool
Semi-structured interviews	11	52.4
Survey (open-ended questions included)	5	23.8
Focus groups	6	28.6

Source. Adapted from Fazakarley et al.²³

This category included 1 study from each of the following: Africa, Denmark, France, Global, Italy, Jordan, and Switzerland.

This category included 1 study from each of the following: iterative analysis (unspecified), inductive analysis (unspecified), and framework analysis.

Studies could employ more than 1 study method; thus, total may exceed 100%.

Human and ChatGPT (GPT-5) Responses to CASP Checklist Items

When comparing human and ChatGPT (GPT-5) responses for each CASP checklist item, the highest alignment in ‘Yes’ responses occurred when assessing the value of research, statement of aims, and ethics (Table 2). In contrast, the greatest difference was observed in items assessing the appropriateness of research design and the researcher–participant relationship. Responses selected as ‘No’ showed limited alignment overall, with notable divergence for the researcher–participant relationship. For ‘Unclear’ responses, the strongest alignment was observed for rigorous data analysis, whereas the lowest alignment again appeared for the assessment of appropriate research design. See Supplemental 3 for the consolidated CASP quality appraisal results for all 21 included qualitative studies, showing side-by-side human and ChatGPT (GPT-5) responses for each checklist item (Q1–Q10).

Table 2.

Comparing the Responses Using the CASP Checklist Between Human and ChatGPT (GPT-5) Assessments of 21 Qualitative Studies.

Item	Human assessment			ChatGPT (GPT-5)
	n (%)			n (%)
	Yes	No	Unclear	Yes	No	Unclear
Q1: Statement of aims	20 (95.2)	1 (4.8)	0 (0)	21 (100)	0 (0)	0 (0)
Q2: Appropriate methodology	16 (76.2)	2 (9.5)	3 (14.3)	19 (90.5)	0 (0)	2 (9.5)
Q3: Appropriate research design	1 (4.8)	1 (4.8)	19 (90.5)	17 (81.0)	0 (0)	4 (19.0)
Q4: Appropriate recruitment strategy	18 (85.7)	0 (0)	3 (14.3)	13 (61.9)	0 (0)	8 (38.1)
Q5: Appropriate data collection	17 (81.0)	1 (4.8)	3 (14.3)	19 (90.5)	0 (0)	2 (9.5)
Q6: Researcher-participant relationship	2 (9.5)	18 (85.7)	1 (4.8)	1 (4.8)	5 (23.8)	15 (71.4)
Q7: Ethics	17 (81.0)	3 (14.3)	1 (4.8)	17 (81.0)	0 (0)	4 (19.0)
Q8: Rigorous data analysis	9 (42.9)	5 (23.8)	7 (33.3)	14 (66.7)	0 (0)	7 (33.3)
Q9: Statement of findings	20 (95.2)	1 (4.8)	0 (0)	18 (85.7)	0 (0)	3 (14.3)
Q10: Value of research	21 (100)	0 (0)	0 (0)	21 (100)	0 (0)	0 (0)

Human and ChatGPT (GPT-5) Responses to JBI Checklist Items

When comparing results for JBI checklist items, the greatest alignment in ‘Yes’ responses was observed for participant voices represented, ethical approval, and the relationship of conclusions to analysis (Table 3). Objective–method congruity and design–data collection congruity also showed high agreement. In contrast, the greatest differences occurred for worldview–methodology congruity and design–analysis congruity, where ChatGPT (GPT-5) was more likely to classify items as ‘Yes’. For ‘No’ responses, partial alignment was observed for researcher orientation and the influence of the researcher on the research, although ChatGPT (GPT-5) more often shifted these assessments to ‘Unclear’. See Supplemental 4 for the consolidated JBI quality appraisal results for all 21 included qualitative studies, showing side-by-side human and ChatGPT (GPT-5) responses for each checklist item (Q1–Q10).

Table 3.

Comparing the Responses Using the JBI Checklist Between Human and ChatGPT (GPT-5) Assessments of 21 Qualitative Studies.

Item	Human assessment			ChatGPT (GPT-5)
	n (%)			n (%)
	Yes	No	Unclear	Yes	No	Unclear
Q1: Worldview-methodology congruity	0 (0)	21 (100)	0 (0)	9 (42.9)	1 (4.8)	11 (52.4)
Q2: Objective and method congruity	17 (81.0)	3 (14.3)	1 (4.8)	20 (95.2)	0 (0)	1 (4.8)
Q3: Design and data collection method congruity	18 (85.7)	0 (0)	3 (14.3)	17 (81.0)	0 (0)	4 (19.0)
Q4: Design and analysis congruity	11 (52.4)	5 (23.8)	5 (23.8)	19 (90.5)	0 (0)	2 (9.5)
Q5: Method and interpretation of results congruity	14 (66.7)	2 (9.5)	5 (23.8)	20 (95.2)	0 (0)	1 (4.8)
Q6: Researcher orientation	5 (23.8)	16 (76.2)	0 (0)	3 (14.3)	10 (47.6)	8 (38.1)
Q7: Influence of researcher on research	3 (14.3)	18 (85.7)	0 (0)	6 (28.6)	4 (19.0)	11 (52.4)
Q8: Participant voices represented	18 (85.7)	3 (14.3)	0 (0)	18 (85.7)	2 (9.5)	1 (4.8)
Q9: Ethical approval	18 (85.7)	2 (9.5)	1 (4.8)	20 (95.2)	1 (4.8)	0 (0)
Q10: Relationship of conclusion to analysis	21 (100)	0 (0)	0 (0)	20 (95.2)	0 (0)	1 (4.8)

Agreement Between Human and ChatGPT (GPT-5) Assessments

Cohen’s kappa coefficients indicated levels of agreement between ChatGPT (GPT-5) and the human reviewers ranging from none to moderate (<0.00-0.46) for CASP items and none to substantial (<0.00-0.83) for JBI items (Table 4). To address checklist items where little within-reviewer variability prevented kappa calculation, crude percent agreement was also reported, spanning from as low as 23.8% to as high as 100% across CASP items, and from 4.8% to 95.2% across JBI items. Highest crude agreement was observed for CASP item 10 (value of research, 100%) and JBI items 3 (Design and data collection method congruity) and 8 (Participant voices represented, 95.2%), while the lowest was seen for CASP item 3 (Appropriate research design, 23.8%) and JBI item 1 (Worldview-methodology congruity, 4.8%).

Table 4.

Level of Agreement Between Human and Generative AI Assessments, Demonstrated Through Cohen’s Kappa Coefficients and Crude Percent Agreement (%).

Item	CASP		JBI
Item	κ^a	Crude agreement (%)	κ	Crude agreement (%)
1	.00	95.2	.00	4.8
2	.20	76.2	<.00	76.2
3	.03	23.8	.83	95.2
4	.46	66.7	.24	61.9
5	.25	81.0	.19	71.4
6	.06	28.6	.37	61.9
7	.43	81.0	.10	28.6
8	.05	42.9	.81	95.2
9	<.00	81.0	.20	85.7
10	N/A	100	0	95.2

Some interrater reliability measures could not be calculated due to homogeneous responses across studies.

Discussion

The potential applications of genAI are rapidly expanding, with diverse sectors integrating these tools to enhance workflow efficiency. In research, tools such as RobotReviewer have been explored for their ability to automate risk of bias assessment in randomized controlled trials,²⁷ while software like NVivo has embedded auto-coding features for theme and sentiment identification.²⁸ In particular, the rapid integration of genAI tools and their contributions to the systematic review process have highlighted the success of genAI tools in assisting research processes.^29,30 Notably, systematic review platforms such as DistillerSR and Nested Knowledge have incorporated AI-based tools to assist with early phases of reviews, such as data extraction, searching, and screening.³¹ In particular, DistillerSR, a reference management software, integrates AI into their screening and data extraction capabilities,³² with promise shown for screening efficiency.³³ However, the application of genAI in more interpretative processes, such as quality appraisal, remains uncertain. In this pilot study, ChatGPT (GPT-5) was used to evaluate how well it performed quality appraisal of qualitative studies, and results were compared to those of human reviewers. Future research should compare ChatGPT with other large language models to assess performance patterns.

Our findings demonstrate that ChatGPT (GPT-5) makes sound decisions when identifying simpler, descriptive components. For instance, there was favorable crude agreement for study aims and value of research (CASP), and for participant voices, ethical approval, and conclusions (JBI). These findings suggest genAI tools may best support researchers in identifying key study components that are primarily reporting-based and require less subjectivity. Notably, the efficiency with which these assessments are performed may also provide significant advantages for streamlining the reporting checklist process and ensuring researchers are reporting key elements of qualitative research. A scoping review exploring resource use during systematic reviews identified quality appraisal as a highly resource-intensive task.³⁴ Studies have investigated the use of quality assessments in systematic reviews and found an increase in the adoption of such assessments in qualitative and quantitative evidence syntheses despite the number and subjectivity of appraisal tools available.^35-37 However, there continue to be concerns around the sufficient integration of quality assessments within the overall synthesis, especially inherent challenges with more interpretative methods such as meta-ethnography.^35,37 Thus, there is a need to revisit the use of genAI to conduct quality appraisal and further explore its use and effectiveness in conducting reporting checklists for qualitative studies.

Our findings also highlighted differences between the 2 appraisal tools. CASP items more often targeted clear reporting elements (e.g., study aims, ethics, value of research), where ChatGPT (GPT-5) achieved higher alignment with human reviewers. In contrast, the JBI checklist places greater emphasis on methodological congruity (e.g., worldview–methodology alignment, design–analysis coherence), which are inherently more interpretive. This distinction is consistent with prior comparative analyses of appraisal tools, such as Hannes et al,³⁸ who observed that CASP is less sensitive to deeper aspects of validity, while JBI’s emphasis on congruity provides greater coherence in assessing interpretive and theoretical dimensions. These distinctions may help explain why agreement ranged from none to moderate for CASP, but from none to substantial for JBI. Appraisal tools differ not only in structure but also in the types of validity they emphasize,³⁸ with our findings suggesting that genAI tools are more effective in supporting reporting-based assessments but remain limited in making judgments that require contextual or nuanced interpretation. Human reviewers remain essential for addressing the more interpretive dimensions of qualitative appraisal.

Recognizing that quality appraisals often require nuanced judgment and contextual understanding, we identified widespread limitations with the application of genAI for quality appraisal using both CASP and JBI qualitative checklists. Even when provided with the full explanatory notes from the CASP and JBI checklists, ChatGPT (GPT-5) remained inconsistent, with only minor changes observed in item-level outputs. This suggests that limitations are not solely prompt-dependent but instead reflect challenges with applying nuanced, interpretive judgment. When encountered with interpretative and context-dependent components, most notably appropriate research design and the researcher–participant relationship (CASP), and worldview–methodology congruity and design–analysis congruity (JBI), ChatGPT (GPT-5) showed poor agreement with human reviewers. In these areas, ChatGPT (GPT-5) often shifted responses toward ‘Yes’ or ‘Unclear’, diverging from human reviewers. The tool may have struggled with such elements due to multiple reasons. First, the subjective nature of the CASP and JBI checklists¹⁷ may have presented challenges for the tool. For instance, the fixed response options of the CASP tool often lack the flexibility needed to accommodate nuanced interpretation,¹⁷ which may have made it difficult for genAI to exercise contextual judgment when appraising the qualitative studies. Second, the limitations in the reasoning depth of genAI may also pose challenges for interpreting nuanced items.³⁹ A study comparing the concordance and speed of information extracted by ChatGPT-3.5 Turbo against human extraction methods found that the model was effective in extracting simple information, but researcher involvement was required for complex tasks.⁴⁰ Similarly, although a study evaluating the comprehension of 6 LLMs in summarizing medical papers found that ChatGPT-3.5 excelled in identifying funding information, which the authors note requires a nuanced understanding of elements embedded at the end of the papers, it also observed that LLMs demonstrated weaker performance on items requiring larger methodological interpretation.¹⁴ We did not analyze the AI-generated justifications, which may provide further insights into model reasoning and limitations in nuanced judgment.

Based on our findings, the research team developed the following recommendations through consensus discussions, drawing on our expertise in evidence synthesis and in the governance and application of genAI in public health. These recommendations are intended to inform future work on the use of genAI in the quality appraisal of qualitative research:

Recommendation 1: Human Oversight and Additional Evaluation Are Necessary

Human oversight remains critical to the application of genAI in quality appraisal. Studies have repeatedly demonstrated the importance of human oversight in research processes, including systematic reviews³⁰ and in the creation of plain language summaries⁴¹ to ensure the validity of genAI outputs. Our study adds to this knowledge base and stresses the importance of human oversight when adopting genAI tools, particularly in qualitative research, where interpretative and context-dependent components are highly prevalent. In particular, our analysis highlights the limitations of genAI when faced with subjective context. Additional testing and evaluation are needed to examine how such tools perform across a wider variety of review topics and contexts. Based on this pilot study, genAI-assisted assessments are too inconsistent to be reliable for adoption in qualitative reviews. If used, they should be subjected to human oversight to ensure accuracy and contextual appropriateness. Human oversight should be undertaken by researchers with the appropriate expertise to critically assess the output and its relevance to the intended use. At this stage, the greatest utility of genAI quality assessment would likely be in rapid review contexts where resources and timelines might normally allow only 1 human reviewer. While our analysis focused on agreement across checklist items, we did not conduct a qualitative coding of the 1-sentence justifications generated by ChatGPT (GPT-5). Future research should systematically analyze these justifications for accuracy, relevance, and depth, as this may provide richer insights into model reasoning. Ultimately, genAI tools should complement the researcher’s efforts.

Recommendation 2: Transparent Reporting Remains Essential for Rigor and Trust in Qualitative Research

Transparent reporting indicating where and how genAI is applied in the qualitative synthesis process ensures the rigor of qualitative research. This is particularly important given the skepticism surrounding genAI adoption, largely due to unseen algorithmic activity that produces outcomes without clear explanation.^42,43 In addition, maintaining transparency ensures rigor and trust, aligning with best practices in qualitative research.⁴⁴

Recommendation 3: Caution Should be Exercised With Regard to Data Privacy

Caution is warranted when adopting genAI tools in qualitative research, as uploading non-open-access or unpublished full-text studies may compromise privacy and pose ethical risks. Oftentimes, AI platforms process and store user data, and using copyrighted or subscription-based material may breach ethical or legal guidelines.⁴⁵ Therefore, to mitigate these concerns, researchers should avoid inputting restricted content into public AI platforms and consult AI data policies and institutional guidelines.

Recommendation 4: The Development of a GenAI-Compatible Quality Appraisal Tool May Enhance Integration of GenAI in Qualitative Quality Appraisal

Current quality appraisal frameworks were not designed to accommodate the application of genAI. In particular, this study demonstrates the constraints of applying genAI to quality appraisal, which requires nuance and contextual understanding. The protocol guiding the development of QUADAS-AI, an AI-specific extension of QUADAS, acknowledges that addressing the concerns of AI is central to developing tools that can effectively perform the desired task.⁴⁶ Therefore, we recommend exploring the development of validated quality appraisal tools for rapid reviews as a first step that prioritizes AI compatibility. These tools should address the limitations of current AI systems and incorporate features that reduce ambiguity in interpretation. Addressing these limitations can enhance the efficiency, accuracy, and reliability of AI-assisted quality appraisal.

Conclusions

This pilot study reveals that genAI may not be an appropriate standalone replacement for quality appraisal of qualitative research using existing qualitative checklists. However, its integration can be considered alongside human oversight. Our study demonstrated low human-genAI reliability to effectively apply the checklists across nuanced and context-dependent items. However, ChatGPT (GPT-5) showed promising results in applying the checklist for items that were reporting-based and required little depth such as study aims. Future research is needed to explore the capacity of genAI tools to apply reporting checklists, and to evaluate the capacity of existing genAI tools to critically appraise qualitative studies across broader contexts and settings. This will further our understanding of the strengths and limitations of genAI in the quality appraisal process, which will be critical for developing and refining tools that can effectively support evidence synthesis.

Supplemental Material

sj-pdf-1-inq-10.1177_00469580251399374 – Supplemental material for A Pilot Study on Generative Artificial Intelligence’s Reliability in Qualitative Research Quality Appraisal Using CASP and JBI Checklists

Supplemental material, sj-pdf-1-inq-10.1177_00469580251399374 for A Pilot Study on Generative Artificial Intelligence’s Reliability in Qualitative Research Quality Appraisal Using CASP and JBI Checklists by Hisba Shereefdeen, Abhinand Thaivalappil, Ian Young and Melissa MacKay in INQUIRY: The Journal of Health Care Organization, Provision, and Financing

Footnotes

ORCID iDs

Ian Young

Melissa MacKay

Ethical Considerations

There are no human participants in this article and informed consent is not required.

Author Contributions

Conceptualization (MM and AT), Methodology (HS, AT, IY, and MM), Formal analysis (AT), Investigation (HS, AT, and MM), Writing – original draft (HS), Writing – review & editing (HS, AT, IY, and MM), Resources (MM), Supervision (AT and MM), and Validation (AT, IY, and MM). All authors have read and approved the final manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article. One author is employed at the Public Health Agency of Canada (PHAC). The work was not undertaken under the auspices of PHAC as part of their employment responsibilities. Any views expressed therein are not those of PHAC.

Data Availability Statement

All data is available via Supplemental, in Open Science Framework: or by request from the corresponding author.

Supplemental Material

Supplemental material for this article is available online.

References

Salah

Abdelfattah

Al Halbusi

. The good, the bad, and the GPT: reviewing the impact of generative artificial intelligence on psychology. Curr Opin Psychol. 2024;59:101872. doi:10.1016/j.copsyc.2024.101872

Reddy

. Generative AI in healthcare: an implementation science informed translational path on application, integration and governance. Implement Sci. 2024;19(1):27. doi:10.1186/s13012-024-01357-9

DTK

Chan

EKC

. Opportunities, challenges and school strategies for integrating generative AI in education. Comput Educ Artif Intell. 2025;8:100373. doi:10.1016/j.caeai.2025.100373

Andersen

Degn

Fishberg

, et al Generative artificial intelligence (GenAI) in the research process – a survey of researchers’ practices and perceptions. Technol Soc. 2025;81:102813. doi:10.1016/j.techsoc.2025.102813

Currie

Barry

. ChatGPT in nuclear medicine education. J Nucl Med Technol. 2023;51(3):247-254. doi:10.2967/jnmt.123.265844

Magesh

Surani

Dahl

Suzgun

Manning

. Hallucination-free? Assessing the reliability of leading AI legal research tools. J Empir Leg Stud. 2025;22(2):216-242. doi:10.1111/jels.12413

Chui

Hazan

Roberts

, et al The economic potential of generative AI. McKinsey & Company; 2023. Accessed July 1, 2025. https://www.mckinsey.com/~/media/mckinsey/business%20functions/mckinsey%20digital/our%20insights/the%20economic%20potential%20of%20generative%20ai%20the%20next%20productivity%20frontier/the-economic-potential-of-generative-ai-the-next-productivity-frontier.pdf

Feuerriegel

Hartmann

Janiesch

Zschech

. Generative AI. Bus Inf Syst Eng. 2024;66(1):111-126. doi:10.1007/s12599-023-00834-7

Pattyn

. The value of generative AI for qualitative research: a pilot study. J Data Sci Intell Syst. 2025;3(3):184-191. doi:10.47852/bonviewJDSIS42022964

10.

Nguyen-Trung

Saeri

Kaufman

. Applying ChatGPT and AI-powered tools to accelerate evidence reviews. Hum Behav Emerg Technol. 2024;2024(1):8815424. doi:10.1155/2024/8815424

11.

Al-Fattal

Singh

. Comparative reflections on human-driven and generative artificial intelligence-assisted thematic analysis: a collaborative autoethnography. Int J Qual Methods. 2025;24:16094069251337870. doi:10.1177/16094069251337870

12.

Anakok

Katz

Chew

Matusovich

. Leveraging generative text models and natural language processing to perform traditional thematic data analysis. Int J Qual Methods. 2025;24:16094069251338898. doi:10.1177/16094069251338898

13.

Kabir

SMA

Ali

Ahmed

Sulaiman-Hill

. Exploring the use of AI in qualitative data analysis: comparing manual processing with avidnote for theme generation. Int J Qual Methods. 2025;24:16094069251336810. doi:10.1177/16094069251336810

14.

Akyon

Camyar

Hızlı

Sari

Hızlı . Evaluating the capabilities of generative AI tools in understanding medical papers: qualitative study. JMIR Med Inform. 2024;12:e59258. doi:10.2196/59258

15.

Majid

Vanstone

. Appraising qualitative research for evidence syntheses: a compendium of quality appraisal tools. Qual Health Res. 2018;28(13):2115-2131. doi:10.1177/1049732318785358

16.

Williams

Boylan

Nunan

. Critical appraisal of qualitative research: necessity, partialities and the issue of bias. BMJ Evid-Based Med. 2020;25(1):9-11. doi:10.1136/bmjebm-2018-111132

17.

Long

French

Brooks

. Optimising the value of the critical appraisal skills programme (CASP) tool for quality appraisal in qualitative evidence synthesis. Res Methods Med Health Sci. 2020;1(1):31-42. doi:10.1177/2632084320947559

18.

Tod

Booth

Smith

. Critical appraisal. Int Rev Sport Exerc Psychol. 2022;15(1):52-72. doi:10.1080/1750984X.2021.1952471

19.

Kolaski

Logan

Ioannidis

JPA

. Guidance to best tools and practices for systematic reviews. JBJS Rev. 2023;11(6):e23.00077. doi:10.2106/JBJS.RVW.23.00077

20.

Shaheen

Ramadan

, et al Appraising systematic reviews: a comprehensive guide to ensuring validity and reliability. Front Res Metr Anal. 2023;8:1268045. doi:10.3389/frma.2023.1268045

21.

Joanna Briggs Institute. The Joanna Briggs Institute Critical Appraisal Tools for Use in JBI Systematic Reviews. 2017. Accessed July 24, 2025. https://jbi.global/sites/default/files/2019-05/JBI_Critical_Appraisal-Checklist_for_Qualitative_Research2017_0.pdf

22.

Critical Appraisal Skilles Programme. CASP Checklist: For Qualitative Research. 2024. https://casp-uk.net/casp-checklists/CASP-checklist-qualitative-2024.pdf

23.

Fazakarley

Breen

Thompson

Leeson

Williamson

. Beliefs, experiences and concerns of using artificial intelligence in healthcare: a qualitative synthesis. Digit Health. 2024;10:20552076241230075. doi:10.1177/20552076241230075

24.

Syriani

David

Kumar

. Screening articles for systematic reviews with ChatGPT. J Comput Lang. 2024;80:101287. doi:10.1016/j.cola.2024.101287

25.

Alshami

Elsayed

Ali

Eltoukhy

AEE

Zayed

. Harnessing the power of ChatGPT for automating systematic review process: methodology, case study, limitations, and future directions. Systems. 2023;11(7):351. doi:10.3390/systems11070351

26.

Ranganathan

Pramesh

Aggarwal

. Common pitfalls in statistical analysis: measures of agreement. Perspect Clin Res. 2017;8(4):187-191. doi:10.4103/picr.PICR_123_17

27.

Tian

Yang

Doi

, et al Towards the automatic risk of bias assessment on randomized controlled trials: a comparison of RobotReviewer and humans. Res Synth Methods. 2024;15(6):1111-1119. doi:10.1002/jrsm.1761

28.

Lumivero. NVivo Automated Coding with AI. 2025. Accessed July 18, 2025. https://lumivero.com/resources/product-tutorials/nvivo-automated-coding-with-ai/

29.

Khraisha

Put

Kappenberg

Warraitch

Hadfield

. Can large language models replace humans in systematic reviews? evaluating GPT-4's efficacy in screening and extracting data from peer-reviewed and grey literature in multiple languages. Res Synth Methods. 2024;15(4):616-626. doi:10.1002/jrsm.1715

30.

Clark

Barton

Albarqouni

, et al Generative artificial intelligence use in evidence synthesis: a systematic review. Res Synth Methods. 2025;16(4):601-619. doi:10.1017/rsm.2025.16

31.

Agrawal

Singer

, et al Leveraging artificial intelligence to enhance systematic reviews in health research: advanced tools and challenges. Syst Rev. 2024;13(1):269. doi:10.1186/s13643-024-02682-2

32.

Distiller

. The case for artifical intelligence in systematic reviews. 2025. Accessed July 24, 2025. https://www.distillersr.com/resources/guides-white-papers/the-case-for-artificial-intelligence-in-systematic-reviews

33.

Hamel

Kelly

Thavorn

Rice

Wells

Hutton

. An evaluation of DistillerSR’s machine learning-based prioritization tool for title/abstract screening - impact on reviewer-relevant outcomes. BMC Med Res Methodol. 2020;20(1):256. doi:10.1186/s12874-020-01129-1

34.

Nussbaumer-Streit

Ellen

Klerings

, et al Resource use during systematic review production varies widely: a scoping review. J Clin Epidemiol. 2021;139:287-296. doi:10.1016/j.jclinepi.2021.05.019

35.

Katikireddi

Egan

Petticrew

. How do systematic reviews incorporate risk of bias assessments into the synthesis of evidence? A methodological study. J Epidemiol Community Health. 2015;69(2):189-195. doi:10.1136/jech-2014-204711

36.

Hannes

Macaitis

. A move to more systematic and transparent approaches in qualitative evidence synthesis: update on a review of published papers. Qual Res. 2012;12(4):402-442. doi:10.1177/1468794111432992

37.

Carroll

Booth

. Quality assessment of qualitative evidence for systematic review and synthesis: is it meaningful, and if so, how should it be performed? Res Synth Methods. 2015;6(2):149-154. doi:10.1002/jrsm.1128

38.

Hannes

Lockwood

Pearson

. A comparative analysis of three online appraisal instruments’ ability to assess validity in qualitative research. Qual Health Res. 2010;20(12):1736-1743. doi:10.1177/1049732310378656

39.

Naqvi

Ganjoo

Rowe

Pashine

Mishra

. Critical thinking in the age of generative AI: implications for health sciences education. Front Artif Intell. 2025;8:1571527. doi:10.3389/frai.2025.1571527

40.

Gue

CCY

Rahim

NDA

Rojas-Carabali

, et al Evaluating the OpenAI’s GPT-3.5 turbo’s performance in extracting information from scientific articles on diabetic retinopathy. Syst Rev. 2024;13(1):135. doi:10.1186/s13643-024-02523-2

41.

Ovelman

Kugley

Gartlehner

Viswanathan

. The use of a large language model to create plain language summaries of evidence reviews in healthcare: a feasibility study. Cochrane Evid Synth Methods. 2024;2(2):e12041. doi:10.1002/cesm.12041

42.

Dwivedi

Kshetri

Hughes

, et al Opinion paper: “So what if ChatGPT wrote it?” multidisciplinary perspectives on opportunities, challenges and implications of generative conversational AI for research, practice and policy. Int J Inf Manag. 2023;71:102642. doi:10.1016/j.ijinfomgt.2023.102642

43.

University of Toronto. Generative AI Considerations in academic research. University of Toronto; 2025. Accessed July 18, 2025. https://cris.utoronto.ca/guides/genai_considerations/

44.

Lyons

Gause

Eakin

O’Brien

Santhosh

. Seven practical recommendations for designing and conducting qualitative research in medical education. Sch. 2024;5(2):231-241. doi:10.34197/ats-scholar.2023-0144PS

45.

Bjelobaba

Waddington

Perkins

Foltýnek

Bhattacharyya

Weber-Wulff

. Maintaining research integrity in the age of GenAI: an analysis of ethical challenges and recommendations to researchers. Int J Educ Integr. 2025;21(1):18. doi:10.1007/s40979-025-00191-w

46.

Guni

Sounderajah

Whiting

Bossuyt

Darzi

Ashrafian

. Revised tool for the quality assessment of diagnostic accuracy studies using AI (QUADAS-AI): protocol for a qualitative study. JMIR Res Protoc. 2024;13:e58202. doi:10.2196/58202

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.63 MB