Sage Journals: Discover world-class research

Abstract

Background

Large language models have a huge positive impact on various disciplines, including healthcare. As family caregivers are an essential part of the healthcare system, they need support and can benefit from the technology. However, there is no consensus on reliable and valid measures to evaluate large language models.

Objective

This study aims to review the literature on the evaluation measures of large language models for caregivers.

Methods

We conducted a scoping review guided by Arksey and O’Malley methodology and the PRISMA-ScR checklist. A literature search on PubMed, EMBASE, CINAHL, and PsycINFO, from 2018 through July 2024, was carried out. An additional rapid review was conducted for the recent literature update from July 2024 through November 2025.

Results

All 10 final publications that met the inclusion criteria out of 1812 focused on ChatGPT, whereas three of them also addressed other large language models, such as Google Bard and Bing AI. The most commonly assessed core conceptual components of evaluation measures were accuracy, reliability, readability, and comprehensiveness. Overall, the included studies reported that large language models’ responses were somewhat accurate and reliable and mixed results in readability and comprehensiveness. The final 14 publications from a rapid review offered additional evidence on ChatGPT-centrism.

Conclusions

This review provides a comprehensive overview of the measures for evaluating large language models and highlights the need for their improvement using reliable and valid measures. The findings guide the direction of future research and practice to maximize the benefits through continuous quality improvement.

Keywords

Large language model scoping review generative artificial intelligence family caregiver health informatics

Introduction

Artificial intelligence (AI) technology has been a breakthrough and has impacted our lives and work. Within this field, generative AI systems, capable of autonomously creating new content such as text, images, or music, have evolved dramatically since their introduction decades ago.¹

Large language models (LLMs) are a fundamental, notable example of how generative AI has evolved and specialized in text-based generative tasks. Language models have advanced over the past several decades since their inception.² Eliza is typically considered the first chatbot.^2–4 Eliza is known as an early natural language processing program, created in 1966 by Massachusetts Institute of Technology researcher, Joseph Weizenbaum. However, Eliza is not a LLM. It is a rule-based chatbot run by a certain script and was able to parody the human–computer conversation by emulating a psychotherapist, which served as a stepping stone toward more advanced systems. Language models have evolved with the advance of AI over the past decades, and LLMs have taken the world by storm.⁵ The four main aspects of LLMs, the most popular and widely used AI language models at present, are pretraining (trained with massive text data from various resources), adaptation (adapting to tasks), utilization (applying in the real world), and evaluation (assessing performance).⁵

Large language models have shown the potential to interpret vast amounts of information, such as providing health-related information and supplementing diagnostic processes, but the use of LLMs warrants caution.^6,7 Concerns include disseminating misinformation, producing AI hallucinations, and amplifying existing biases. Careful consideration of their potential effects and ethical implications has been emphasized.^8,9 Nevertheless, LLMs have been very useful and have huge potential because they operate round-the-clock without breaks, provide real-time information promptly that is easy to understand, and are available to many users at free or low-cost rates.¹⁰ As such, because of these numerous benefits, LLMs are increasingly being adopted across diverse fields, including finance, business, and cybersecurity, and their use is rapidly expanding within the healthcare field as well.¹¹

The representative examples of LLMs at present are Chat Generative Pre-trained Transformer (ChatGPT) by OpenAI, Gemini by Google, and LLM Meta AI (LLaMA) by Meta. As of today, the user numbers of ChatGPT, Gemini, and LLaMA are several million worldwide. OpenAI's first generative language model, GPT-1, was introduced in 2018, but it was not a public product and established the foundation for subsequent models, such as GPT-2 and GPT-3,¹² and then ChatGPT was launched based on GPT-3.5 to the world by OpenAI in November 2022. After that, LLaMA by Meta AI and Gemini by Google DeepMind were released in 2023. ChatGPT, LLaMA, and Gemini quickly went viral, and they have captured millions of users around the world. Since their debut, the practicalities, applications, impacts, and concerns of LLMs in healthcare have been explored and investigated.^8,10 However, a consensus has not been reached on the most reliable and valid measure for evaluating the use of LLMs in healthcare yet. Among the various users of LLMs, understanding the implications of LLMs for family caregivers with reliable and validated evaluation measures is crucial because they are an essential part of the healthcare system.¹³

More than one in five Americans are family caregivers,¹⁴ and when estimated globally, hundreds of millions take on caregiving roles. Family caregivers, also known as informal caregivers, experience a high burden while caring for their loved ones.¹⁵ Caregiving burden has been a long-standing issue because it leads to numerous negative impacts on both caregivers and care recipients.^13,15 While a large number of interventions that support caregivers have been developed,^16,17 research on the use of LLMs for caregivers has not been extensively explored despite the huge potential and benefits of using LLMs, such as providing educational opportunities, increasing available information, and reducing language barriers.¹⁸ Additionally, concerns about the reliability of LLM outputs (e.g., the risk of AI “hallucinations” producing incorrect information) and the lack of established evaluation standards raise questions about how such tools should be assessed. There are recently published reviews regarding the use of LLMs,^19–21 but two reviews were limited to care settings, either emergency medicine or oncology,^19,20 and the other one did not specify care settings or users.²¹ To date, no comprehensive review has focused on how LLMs are being used to support caregivers across multiple care contexts, what benefits or limitations have been observed, and what gaps exist in this nascent field. Therefore, our research question for this scoping review is: What measures have been used to evaluate LLM's use among family caregivers in healthcare research? Thus, this review aims to explore existing literature on the use of LLMs, such as ChatGPT, LLaMA, and Gemini, and reveal the gaps in the literature regarding the measures used to evaluate LLMs and the findings that emerge from their application for family caregivers.

Methods

According to the methodology by Arksey and O’Malley^22,23 and the PRISMA-ScR checklist (Supplemental material 1),²⁴ this scoping review was undertaken to provide an overview of the existing literature regarding the use of LLMs for family caregivers and guide future directions for improving LLM evaluation measures. The search was conducted with PubMed, EMBASE, CINAHL, and PsycINFO databases on the 13th of July 2024. The protocol for this review was registered on the Open Science Framework Registry on 8 January 2025 (http://osf.io/a8b9y). To ensure transparency regarding the literature search prior to registration, we established predefined eligibility criteria, conducted independent screening, and documented all decisions before synthesis. Publications from peer-reviewed journals written in English were included. Given the recent emergence of LLMs, we limited the time of published years between 2018 and July 2024. We excluded certain types of publications, such as conference presentations, editorials, commentaries, reviews, and unpublished works, and publications that do not explicitly explain the type of technology, whether it is an LLM or not, and do not focus on its use for caregivers or patient–caregiver dyads. The search keywords included LLMs and caregivers. The details of search terms are presented in Table 1. Titles, abstracts, and full-text publications were reviewed independently by two reviewers, using a free web and mobile application, Rayyan (SH and HC).²⁵ Although Rayyan has AI-assisted screening features, our screening process followed the traditional two independent reviewer approach without using the AI features. The interrater reliability, measured using the percent agreement method,²⁶ is 96% for the title/abstract screening and 83% for the full-text screening. And then, all disagreements between two independent reviewers were resolved manually through discussion with a third person, who has expertise in health informatics and computer science (YC). A narrative synthesis was used to summarize findings regarding the evaluation measures used across studies and the extent of caregiver involvement. In addition, a table of evidence was established by including the language model type, evaluation dataset, study aim, target users (caregiver type), key evaluation measure and score, and relevant findings in Table 2.

Table 1.

An overview of used search terms (n = 1812).

Databases	Keywords
PubMed (NIH/NLM): 666 publications identified	(“artificial intelligence"[MeSH Terms] OR “artificial* intelligence” OR “language model” OR ChatGPT OR Gemini OR LLaMA) AND (“caregivers"[MeSH Terms] OR caregiver* OR carer* OR caretaker* OR caregiving*) Filters: English, from 2018–2024 Search applied to Default fields with MeSH terms
EMBASE (Elsevier): 616 publications identified	(‘artificial intelligence’/exp OR ‘language model'/exp OR ‘artificial* intelligence’ OR ‘language model’ OR ChatGPT OR Gemini OR LLaMA) AND (‘caregiver’/exp OR caregiver* OR carer* OR caretaker* OR caregiving*) Filters: [english]/lim AND [2018–2024]/py Search applied to Emtree terms and free-text, all fields
CINAHL (EBSCO): 436 publications identified	((MH “Artificial Intelligence+”) OR “artificial* intelligence” OR “language model” OR ChatGPT OR Gemini OR LLaMA) AND ((MM “Caregivers”) OR caregiver* OR carer* OR caretaker* OR caregiving*) Filters: English, from 2018–2024 Search applied to “All fields”
PsycINFO (EBSCO): 94 publications identified	(MM “Artificial Intelligence” OR “artificial* intelligence” OR “language model” OR ChatGPT OR Gemini OR LLaMA) AND ((MM “Caregiving”) OR (MM “Caregivers”) OR caregiver* OR carer* OR caretaker* OR caregiving*) Filters: English, from 2018–2024 Search applied to “All fields”

NIH: National Institutes of Health; NLM: National Library of Medicine.

Table 2.

Characteristics of included studies in this scoping review (n = 10).

Last name of first author (year)	Language model	Evaluation dataset	Study aim	Target user (caregiver type)	Key evaluation measure and score	Main relevant findings
Aguirre et al. (2024)²⁷	ChatGPT-3.5	60 posts by dementia caregivers from Reddit, a popular social media platform	To examine the potential of ChatGPT-3.5 to provide high-quality information that may enhance dementia care and patient–caregiver education	Caregivers of individuals with dementia. Posts that are commonly made by caregivers were used.	Quality rating scale (scoring range: 0–5; the higher the score, the better the quality). The categories of the rating scale include 1) Factuality, 2) Interpretation, 3) Application, 4) Synthesis, and 5) Comprehensiveness, which was adapted by Hurtz et al.'s levels of cognitive complexity regarding clinical decision-making.	Posts were verified by 3 dementia clinicians as representing dementia caregivers’ desire for information in memory loss and confusion, aggression, and driving. The team used a formal rating scale. ChatGPT's response overall quality scores ranged from 3 to 5. Out of 60 responses, 26 (43%) received 5 points, 21 (35%) received 4 points, and 13 (22%) received 3 points, suggesting high quality. There were no responses that scored 0, 1, or 2. ChatGPT obtained consistently high scores in synthesizing information to provide follow-up recommendations (n = 58, 96%). The majority of responses contained factual information (n = 56, 93%). The majority of responses contained tangible actions the caregivers could apply to their situation (n = 54, 90%). ChatGPT correctly interpreted the poster's main need of the majority of responses (n = 47, 78%). ChatGPT received the lowest scores in the area of comprehensiveness (n = 38, 63%).
Kim et al. (2024)²⁸	ChatGPT-4 and ChatGPT-3.5	57 common epilepsy questions from the Korean Epilepsy Society's guidebook, titled “Epilepsy Patient and Caregiver Guide”	To assess the accuracy and reliability of ChatGPT's reponses to questions commonly asked by patients with epilepsy and their caregivers	Caregivers of individuals with epilepsy. Questions that are commonly asked by caregivers were used.	Two epileptologists evaluated the responses and categorized them into four grades: 1) sufficient, 2) correct but inadequate, 3) mixed with correct/incorrect/outdated, and 4) incorrect. Regarding discrepancy between two epileptologists, another third epileptologist was served as the final arbiter. The responses between ChatGPT-3.5 and ChatGPT-4 were also compared.	Out of 57 questions, 40 responses were classified as “sufficient educational value” and 16 were “correct but inadequate.” There was no entirely incorrect response. Of 57 questions, there was no response from GPT-4 that was “much better” than GPT-3.5. However, 10 responses were “better,” 45 were “similar,” and 2 were “worse.” When compared GPT-4 with the official guide, one response was “much better,” 32 responses were “better,” 20 were “similar,” 4 were “worse,” and none was “much worse.” Thus, GPT-4 is better than GPT-3.5 and the official guide in general.
Lim et al. (2023)²⁹	ChatGPT-3.5, ChatGPT-4, and Google Bard	31 myopia care-related questions that are commonly asked by patients and their parents. The queries were gathered from health websites, such as the National Eye Institute, the American Academy of Ophthalmology, and the Brien Holden Vision Institute	To evaluate performance of three large language models (LLMs), namely ChatGPT-3.5, ChatGPT-4, and Google Bard, in delivering accurate responses to common myopia-related queries	Caregivers of individuals with vision disorders. Questions that are commonly asked by caregivers were used.	Three pediatric ophthalmologists evaluated based on a 3-point accuracy scale (poor, borderline, and good). 31 questions were categorized into 6 domains: pathogenesis, risk factors, clinical presentation, diagnosis, treatment and prevention, and prognosis.	ChatGPT-4 demonstrated better accuracy compared to others. All three LLMs showed high comprehensiveness scores and substantial self-correction capabilities. All three LLMs performed worst when answering questions about the “treatment and prevention” domain compared to other domains.
Loughran et al. (2024)³⁰	ChatGPT (no information on the specific ChatGPT version)	Languages used in the content from the SmartSHOTS mobile application, which is for caregivers of children aged 0–24 months	To evaluate using ChatGPT to reduce the complex language used in health-related educational materials	Parents of children aged 0–24 months. Contents regarding vaccines from the SmartSHOTS mobile application were used.	A panel of subject matter experts and linguistic experts who ensured their accuracy and readability reviewed the adapted edits by ChatGPT.	In terms of accuracy, no issues with accuracy were identified in the edits made by ChatGPT. In terms of readability, ChatGPT required interactive process because early prompts were not suitable for the target audience (e.g., using metaphors, reusing the same phrases across multiple edits causing a lack of flow). After some adjustments to the prompts, readability improved.
McFayden et al. (2024)³¹	ChatGPT-4	13 common questions asked by caregivers	To evaluate the responses provided by ChatGPT, including basic information about autism, myths/misconceptions, and resources	Caregivers of individuals with autism. Questions that are commonly asked by caregivers were used.	With ChatGPT's responses to 13 questions, accuracy (based on 3Cs including Correctness, Clarity, and Conciseness), language (based on a bias measure and language preferences in the autistic community), understandability and actionability (based on a validated Patient Education Materials Assessment Tool for Printable materials), references (based on hyperlink accuracy), and interrater reliability (using intraclass correlations for continuous measures, such as references, understandability and actionability, and using weighted Cohen's Kappa for categorical measures, such as 3C's and language) were evaluated. (No information about the number or information of evaluators).	ChatGPT responses were largely correct, concise, and clear, but did not provide much actionable advice.
Neo et al. (2024)³²	2 AI chatbots (ChatGPT-3.5 and Google Bard)	Questions collected from outpatients and their caregivers through a survey	To evaluate the accuracy, safety, relevance, and readability of 2 AI chatbots (ChatGPT and Google Bard) in providing responses to common questions about stroke rehabilitation posed by a local group of outpatients and their caregivers	Caregivers of individuals with stroke. Questions that are commonly asked by caregivers were used.	A 3-point Likert-like rubric for 4 domains of accuracy, safety, relevance, and readability, with 3 scoring levels of unsatisfactory, borderline, and satisfactory, was used.	Both chatbots presented similar overall scores. ChatGPT received 79 satisfactory grades (65.8%), 29 borderline grades (24.2%), and 12 unsatisfactory grades (10%), while Google Bard received 91 satisfactory grades (75.8%), 21 borderline grades (17.5%), and 8 unsatisfactory grades (6.7%). For accuracy, ChatGPT received 22 satisfactory grades (73.3%), 6 borderline grades (20%), and 2 unsatisfactory grades (6.7%), with Google Bard receiving the exact same scores. The majority of grades fell in the satisfactory range. For safety, ChatGPT received 14 satisfactory grades (46.7%), 13 borderline grades (43.3%), and 3 unsatisfactory grades (10%), while Google Bard received 19 satisfactory grades (63.3%), 10 borderline grades (33.3%), and 1 unsatisfactory grade (3.3%). For relevance, ChatGPT received 16 satisfactory grades (53.3%), 9 borderline grades (30%), and 5 unsatisfactory grades (16.7%), while Google Bard received 24 satisfactory grades (80%), 3 borderline grades (10%), and 3 unsatisfactory grades (10%). For readability, ChatGPT received 27 satisfactory grades (90%), 1 borderline grade (3.3%), 2 satisfactory grades (6.7%), while Google Bard received 26 satisfactory grades (86.7%), 2 borderline grades (6.7%), and 2 unsatisfactory grades (6.7%). To sum, ChatGPT and Google Bard tie in accuracy. Google Bard outperformed in relevance and safety. Both performed exceptionally well in readability, with ChatGPT slightly better in satisfactory grades. Interrater agreement was low (Fleiss’ Kappa of 0.181), which indicates the variability in physician acceptance of their responses.
Nikdel et al. (2024)³³	ChatGPT-4	27 questions about amblyopia and 28 questions about childhood myopia were asked twice (totally 110 questions)	To assess the responses of ChatGPT-4 to frequently asked questions regarding two common pediatric ophthalmologic disorders, amblyopia, and childhood myopia	Caregivers of individuals with vision disorders. Questions that are commonly asked by caregivers were used.	Reproducibility was assessed by entering the same questions once again on the following day. Two pediatric ophthalmologists evaluated the responses of ChatGPT. They graded the responses as acceptable, incomplete, or unacceptable.	Two ophthalmologists showed a high agreement (96.4%) on their assessment of the responses. On a second day, the responses were almost identical compared to the prior responses for 87.3% of the questions, indicating high reproducibility. Acceptable responses were provided to 93 of 110 (84.6%) questions in total (81.5% for amblyopia and 87.5% for childhood myopia). 12.9% responses were graded as incomplete for amblyopia and 7.1% were graded as incomplete for childhood myopia. The most noticeable inappropriate responses were related to definition of reverse amblyopia and the threshold of refractive error for prescription of spectacles to children with myopia.
Pradhan et al. (2023)³⁴	4 LLM-derived chatbots (ChatGPT-4, DocsGPT, Google Bard, and Bing Chat)	Human-derived educational material on cirrhosis from printable templates from the Epic. 4 LLM-derived materials were obtained with a prompt to compose a 1-page education sheet	To examine the readability, grade level, understandability, actionability, and accuracy of educational material regarding cirrhosis	Caregivers of individuals with liver diseases. Educational materials for caregivers were created by LLMs and used.	Patient educational materials generated by 4 LLM-derived chatbots were compared with a human-derived educational material in Epic. 8 hepatologists and 14 patients/caregivers evaluated materials. Readability and grade level were set to a sixth-grade reading level according to the American Medical Association and evaluated by using the Flesch–Kincaid Reading Ease tool, Flesch–Kincaid grade level tool, and Simple Measure of Gobbledygook. The understandability and actionability were evaluated using the Patient Education Materials Assessment Tool for Printable Materials scoring system. The accuracy was evaluated by 8 hepatologists without patients/caregivers.	Most materials scored similarly in readability and grade level and all materials were considered understandable by both patients/caregivers and hepatologists. However, only the human-derived material (Epic) was considered actionable by both groups. No significant difference in actionability and understandability between human-derived material and LLM-derived materials. Both groups were not able to identify which materials were human-derived versus LLM-derived.
Saeidnia et al. (2024)³⁵	ChatGPT-4	Questions formulated through literature review by authors	To evaluate ChatGPT in responding to information needs and information seeking of caregivers who care for individuals with dementia	Caregivers of individuals with dementia. Questions that are commonly asked by caregivers were used.	Answers to the questions were evaluated by two groups of caregivers: informal vs formal caregivers. Caregivers were asked to evaluate the level of correctness, whether the responses were in their opinion scientific enough or not, and compare the quality of the responses compared to that of the sources of information that they had normally used to find the corresponding information, such as the Internet, mobile applications, textbooks, and medical pamphlets.	15 informal and 15 formal caregivers participated in the study. Informal caregivers expressed more positive feedback towards the use of ChatGPT to get nonspecialized information about dementia compared to formal caregivers. ChatGPT struggled to provide satisfactory responses to more specialized clinical inquiries. Formal caregivers showed less trust in ChatGPT's responses compared to informal caregivers. ChatGPT's responses to nonclinical information needs related to dementia patients were generally satisfactory.
Yeo et al. (2023)¹⁰	ChatGPT-3.5	Questions collected from well-regarded professional societies, institutions and posts in patient support groups on Facebook	To examine the accuracy and reproducibility of ChatGPT in answering questions regarding knowledge, management, and emotional support for cirrhosis and hepatocellular carcinoma (HCC)	Caregivers of individuals with liver diseases. Questions that are commonly asked by caregivers were used.	Two hepatologists reviewed ChatGPT's responses to 164 questions and discrepancies were resolved with another third senior hepatologist. The response for the accuracy included 1) comprehensive, 2) correct, but inadequate, 3) mixed with correct and incorrect, and 4) completely incorrect. Reproducibility was determined by assessing the similarity of the two responses.	ChatGPT performance was better in basic knowledge, lifestyle, and treatment than diagnosis and preventative medicine. ChatGPT provided comprehensive and correct responses to 74% of the 73 questions in the categories of “basic knowledge,” but 50% of questions were graded as a mix of correct/incorrect information in the category of “diagnosis.” For quality measures, ChatGPT answered 76.9% of questions correctly but lacked knowledge of decision-making cutoffs, treatment durations, and regional guidelines variations, such as HCC screening criteria. For emotional support, ChatGPT were able to provide psychological and practical recommendations and improve their resilience.

LLM: large language model; HCC: hepatocellular carcinoma.

Given the fast-evolving nature of LLM advancement, after we completed the review with published literature from 2018 through July 2024, we posited that additional rapid review was necessary to capture newly published literature from 1 July 2024, through 6 November 2025. This rapid review was intended to reflect the most current evidence, rather than to be a comprehensive review involving a rigorous review process, such as having two independent reviewers.

Results

Study characteristics

From the initial search, we identified 1812 publications across the four databases, PubMed, EMBASE, CINAHL, and PsycINFO. After removing duplicates, 1290 titles and abstracts were screened. After that, 46 full-text publications were reviewed. 36 publications were removed because of wrong technology (e.g., not using LLMs), wrong publication type (e.g., commentary), and wrong population (e.g., no family caregivers). Ultimately, 10 publications that met the inclusion criteria were assessed and analyzed (Figure 1). Characteristics of all included studies, such as specific models, evaluation datasets, study aims, target users, evaluation measures and scores, and main relevant findings, are summarized in Table 2. The details of key evaluation measures are presented in Table 3. While all 10 studies used ChatGPT, 3 of them also addressed other LLMs, such as Google Bard and Bing AI. In terms of the target health conditions of patients (care recipients), two studies addressed caregivers of individuals with dementia,^27,35 two addressed those with vision disorders,^29,33 two addressed those with liver diseases,^10,34 one addressed those with stroke,³² one addressed those with autism,³¹ one addressed the importance of vaccines for caregivers of children aged 0–24 months,³⁰ and one with epilepsy.²⁸ Most studies assessed questions or posts that are commonly made by caregivers,^10,27–29^,31–33,35 whereas one study prompted LLM to create educational materials for caregivers,³⁴ and another study prompted LLM to edit the content for caregivers.³⁰

Figure 1.

PRISMA flow diagram for the scoping review process (publications from 2018 to 2024).

Table 3.

Evaluation measures of included studies in this scoping review (n = 10).

Last name of first author (year)	Core conceptual component of measure	Quantitative measure		Qualitative measure or other comments
Last name of first author (year)	Core conceptual component of measure	Atomic measure (single dimension/concept)	Composite measure (multiple dimensions/concepts)	Qualitative measure or other comments
Aguirre et al.²⁷	Factuality (accuracy)	Each component was measured separately on a binary scale (0 or 1): Factuality: absence of inaccurate/false info Interpretation: correct understanding of caregiver need Application: actionable suggestions Synthesis: inclusion of follow-up referrals/resources Comprehensiveness: depth and breadth of response	Total quality score (0–5), created by summing the five binary component scores per response.	Evaluated by 3 interdisciplinary dementia clinicians. Score reflects overall quality but retains transparency due to structured domain-level rating.
	Interpretation
	Application
	Synthesis
	Comprehensiveness
Kim et al.²⁸	Educational Value (Embeds Accuracy)		Educational Value were rated on a 4-point ordinal scale: 1 = Sufficient educational value, 2 = Correct but inadequate, 3 = Mixed with correct/incorrect/outdated information, 4 = Incorrect.	Two epileptologists independently scored each item; a third resolved any disagreement. Accuracy was embedded in overall educational value but not scored or reported independently.
Kim et al.²⁸	Reliability	The reliability of ChatGPT-4 was checked by comparing with ChatGPT-3.5 and the official guide. Responses were reviewed for differences and no significant variations noted.
Lim et al. (2023)²⁹	Accuracy	Accuracy measured using 3-point ordinal scale:1 = Poor (dangerous), 2 = Borderline (minor errors), 3 = Good (error-free).Each response rated by 3 pediatric ophthalmologists. Final rating based on majority consensus.If all 3 raters disagreed, the lowest score (“poor”) was assigned. Scores summed for aggregate accuracy.
Lim et al. (2023)²⁹	Comprehensiveness		For comprehensiveness, 5-point Likert scale was used only for responses rated “good” in accuracy rating.Rated 1 (not comprehensive) to 5 (very comprehensive), based on level of detail and completeness.Scores were averaged across the 3 graders.
Loughran et al. (2024)³⁰	Accuracy			Accuracy was reviewed qualitatively (not measured using a binary or numeric scale). No errors were identified.
Loughran et al. (2024)³⁰	Readability			Readability was assessed collaboratively by subject matters experts and linguistic experts, but no formal tool (e.g., Flesch–Kincaid) was applied.
McFayden et al. (2024)³¹	Accuracy		Accuracy is a composite measure computed as the average of the three 3C scores. The 3C elements are correctness, clarity, and conciseness.
	Language	Language framing measured with 3-point scale of language used to describe autism:1 = medical language2 = mixed3 = neurodiversity-affirming language.
	Understandability		Measured using the 10 individual items of the Patient Education Materials Assessment Tool for Printable Materials (PEMAT-P) (e.g., word choice, sentence structure, layout, use of examples). Final score is the average of these binary items.
	Actionability		Measured using 6 PEMAT-P items assessing whether the content gives clear, actionable instructions for the reader to follow (e.g., steps to take, behavior change suggestions).
	References (hyperlink accuracy, timeliness (date published))	Percentage of correct URLs (0 = incorrect, 1 = correct).	Incorporates both URL accuracy and up-to-date recency
Neo et al. (2024)³²	Accuracy	Each domain scored separately on a 3-point scale: Unsatisfactory, Borderline, Satisfactory.		All four domains were evaluated independently using a standardized 3-point rubric. The rubric was developed by a mixed-methods researcher and vetted by two specialists.
	Safety
	Relevance
	Readability
Nikdel et al. (2024)³³	Reliability (Reproducibility)	The same questions were entered into ChatGPT. Two experts graded the responses as acceptable, incomplete, or unacceptable in terms of reproducibility.
Pradhan et al. (2023)³⁴	Readability		Measured using validated scoring systems, including the Flesch Reading Ease, Flesch–Kincaid grade level, and Simple Measure of Gobbledygook (SMOG).
	Understandability		Measured subjectively by 14 patients/ caregivers and 8 transplant hepatologists using the PEMAT-P.
	Actionability		Measured subjectively by 14 patients/ caregivers and 8 transplant hepatologists using the PEMAT-P.
	Accuracy	Measured by single 5-point scale evaluated by expert reviewers:1 = <25% of the information is accurate,2 = 26–50% of the information is accurate,3 = 51–75% of the information is accurate;4 = 76–99% of the information is accurate,5 = 100% of the information is accurate
Saeidnia et al. (2024)³⁵	Accuracy (Correctness)			The level of correctness of ChatGPT responses by two distinct groups of caregivers (informal and formal caregivers).
Saeidnia et al. (2024)³⁵	Comprehensiveness			Comprehensiveness was measured by two distinct groups of caregivers (informal and formal caregivers) by assessing whether the responses were in their opinion scientific enough or not, and compare the quality of responses compared to that of sources of information.
Yeo et al. (2023)¹⁰	Accuracy	Measured using a 4-rating scales: 1) comprehensive, 2) correct but inadequate, 3) mixed with correct and incorrect/outdated data, and 4) completely incorrect.		Accuracy was evaluated using the response options, including “comprehensive.” This study treated “comprehensiveness” as implying some level of accuracy without enough detailed information or clarification.
Yeo et al. (2023)¹⁰	Reliability (Reproducibility)	Measured by evaluating the reproducibility of ChatGPT's responses. Each question was entered twice and the consistency between the two responses was reviewed.

PEMAT-P: Patient Education Materials Assessment Tool for Printable Materials; SMOG: Simple Measure of Gobbledygook.

Measures for evaluating LLMs

All studies included for this review employed different measures for evaluating LLMs, leading to some overlap across the literature. Based on the core conceptual focus of each measure's component, four key components that were most commonly addressed emerged from the literature review: accuracy, reliability, readability, and comprehensiveness. In more detail, we identified all evaluation measures of LLMs from all included publications, as presented in Table 3. We then counted the number of each evaluation measure that appeared across all publications. Based on the numbers, we determined the most commonly used measures.

Accuracy

Based on a review of the literature, including nine publications that addressed accuracy. Accuracy refers to whether the information is precise, factually valid, and free of errors.^10,27–32^,34,35 Pradhan et al.³⁴ evaluated accuracy with eight transplant hepatologists based on a 5-point scoring system, which was adapted from Dy et al.³⁶ and Storino et al.³⁷ A score of 1 indicating <25% is accurate, 2 indicating 26%–50% is accurate, 3 indicating 51%–75% is accurate, 4 indicating 76%–99% is accurate, 5 indicating 100% is accurate. Aguirre et al.²⁷ assessed whether the responses did not contain inaccurate or false information using a 5-rating scale by adapting the levels of cognitive complexity regarding clinical decision-making from Hurtz et al.³⁸ Kim et al.²⁸ assessed accuracy using 4-rating scales (sufficient educational value, correct but inadequate, mixed with correct/incorrect/outdated information, incorrect). Yeo et al.¹⁰ used a 4-rating scale (comprehensive, correct but inadequate, mixed with correct and incorrect/outdated data, completely incorrect). McFayden et al.³¹ also used a 4-rating scale (completely correct/clear/concise, almost correct/clear/concise, partially correct/clear/concise, completely incorrect/unclear/unconcise) on three domains: correctness, clarity, and conciseness. Lim et al.²⁹ assessed accuracy using a 3-point scale (poor, borderline, good). Neo et al.³² also used a 3-point Likert-like rubric for accuracy (unsatisfactory, borderline, satisfactory). Loughran et al.³⁰ had a panel of subject matter experts and linguistic experts to ensure the accuracy of ChatGPT responses, and Saeidnia et al.³⁵ used both quantitative (questionnaire) and qualitative (interview) approaches to evaluate the level of correctness of ChatGPT responses, but these two studies did not provide the details about a scale or rubric.

Reliability

Based on three publications in this review, reliability is defined as consistency between multiple evaluators or across similar conditions (i.e., reproducibility, consistency) or whether the information is based on reliable sources.^10,28,33 Yeo et al.¹⁰ evaluated reproducibility by entering each question into ChatGPT twice and having two independent hepatologist reviewers and another senior hepatologist as a third reviewer to assess the similarities between the two responses. Two responses were graded separately and classified as significantly different if categorized into opposing groups (grades 1–2 vs. 3–4). Nikdel et al.³³ also assessed the rate of agreement of acceptable, incomplete, or unacceptable between two ophthalmologists. Kim et al.²⁸ assessed the internal consistency (reliability) of ChatGPT-4 by comparing it with ChatGPT-3.5 and the official guide, “Epilepsy Patient and Caregiver Guide,” published annually by the Korean Epilepsy Society.

Readability

Based on three publications, we define readability as how easy the information is to read or comprehend.^30,32,34 Pradhan et al.³⁴ used three validated scoring systems, including the Flesch Reading Ease Score (FRS),³⁹ the Flesch–Kincaid Grade Level Score (FKGL),³⁹ and the Simple Measure of Gobbledygook (SMOG).⁴⁰ Neo et al.³² used a 3-point Likert-like rubric. Loughran et al.³⁰ assessed readability but did not provide a detailed explanation of the measure(s).

Comprehensiveness

Based on three publications included in our literature review, comprehensiveness is defined as whether the information covers all necessary aspects.^27,29,35 Lim et al.²⁹ assessed comprehensiveness using a 5-point scale, and Aguirre et al.²⁷ assessed comprehensiveness using a 4-rating scale whether the ChatGPT's response was thorough and complete. Saeidnia et al.³⁵ evaluated comprehensibility and completeness of ChatGPT-4 responses, but there was no detailed information about a scale or rubric.

Findings derived from the application of LLM evaluation measures

Accuracy

Most publications proposed that LLMs provided accurate information, ranging 70%–99% of the time.^10,27–32^,34,35 To elaborate on the accuracy rates across publications, Aguirre et al.²⁷ reported that most responses (93%) of ChatGPT-3.5 contained factual information. Pradhan et al.³⁴ demonstrated that most education materials generated by LLMs, including ChatGPT-4, DocsGPT, Google Bard, and Bing Chat, showed accurate medical information more than 76% of the time. Kim et al.²⁸ reported that 40 out of 57 responses (70%) of ChatGPT-4 were classified as sufficient educational value with no entirely incorrect response, and better results than those from ChatGPT-3.5 (i.e., 10 responses were better, 45 were similar, 2 were worse, and 0 was much better in ChatGPT-4 compared to ChatGPT-3.5).

Yeo et al.¹⁰ and McFayden et al.,³¹ using ChatGPT-3.5 and ChatGPT-4, respectively, demonstrated that ChatGPT achieved high accuracy in responding to questions related to cirrhosis/hepatocellular carcinoma and autism. However, both studies mentioned areas where ChatGPT did not respond correctly or provided outdated information. To elaborate further, ChatGPT could not provide correct information regarding the cutoff for certain situations, such as liver stiffness measurements indicating the need for an upper endoscopy, the maximum time window that is recommended for performing an upper endoscopy, and the minimum antibiotic course duration for empiric gram-negative coverage in Yeo et al.¹⁰ and outdated information was used in McFayden et al.³¹

In terms of specific contents/domains, Yeo et al.¹⁰ proved that ChatGPT-3.5's performance was better in basic knowledge, lifestyle, and treatment than diagnosis and preventative medicine. In a Lim et al. study,²⁹ ChatGPT-4 demonstrated better accuracy compared to ChatGPT-3.5 and Google Bard. This study showed that all three LLMs performed well in the domains of “clinical presentation” and “prognosis.” In the “pathogenesis,” “risk factors,” and “diagnosis” domains, ChatGPT-3.5 and ChatGPT-4 performed well, but Google Bard did not. In the “treatment and prevention” domain, all LLMs performed poorly. In addition, all three LLMs showed substantial self-correction capabilities through further prompts. In contrast, Neo et al.³² showed that both ChatGPT and Google Bard provided readable responses with some general accuracy and were tied in accuracy.

Loughran et al.³⁰ had a panel of subject matter experts and linguistic experts to ensure the accuracy of ChatGPT responses and indicated that using ChatGPT to edit human-written content as an editorial tool appears to be safer for avoiding inaccuracies created by ChatGPT because no issues with accuracy were identified in the edits made by ChatGPT.

Saeidnia et al.³⁵ reported that both informal and formal caregivers expressed the need for and importance of further evolution with an interdisciplinary approach with computer scientists, healthcare providers, and caregivers to improve the accuracy and reliability of ChatGPT responses.

Reliability

Yeo et al.¹⁰ demonstrated a high reproducibility, with 90% of ChatGPT-3.5 responses remaining consistent across repeated questions. In addition, Nikdel et al.³³ study also presented a high agreement between two ophthalmologists and a high reproducibility of responses on different days. In a Kim et al. study,²⁸ no explicitly stated assessment measure specifically for reliability was provided, distinct from the accuracy assessment.

Readability

Pradhan et al.³⁴ demonstrated that most materials generated by ChatGPT-4, DocsGPT, Google Bard, and Bing Chat showed similar readability but were above the desired sixth-grade reading level. However, Neo et al.³² found the opposite, suggesting most responses from both ChatGPT-3.5 and Google Bard were relatively easy to understand, with ChatGPT slightly better in satisfactory grades. The most significant difference between these two studies is that they used different measures. Neo et al.³² used a 3-point Likert-like rubric, while Pradhan et al.³⁴ used the FRS, FKGL, and SMOG. On the other hand, Loughran et al.³⁰ highly valued ChatGPT as an editorial tool because it excelled at editing content to be easier to read or understand.

Comprehensiveness

In the Aguirre et al. study,²⁷ ChatGPT-3.5 received the lowest ratings in comprehensiveness that were assessed by three clinicians with more than 15 years of experience with patients with dementia and their caregivers. However, Lim et al.²⁹ concluded the opposite, indicating all three LLM-Chatbots, ChatGPT-4, ChatGPT-3.5, and Google Bard, demonstrated high mean comprehensiveness scores that three ophthalmologists evaluated regarding the responses to common myopia-related queries. Furthermore, informal caregivers appeared to have more positive opinions on ChatGPT's responses compared to formal caregivers.³⁵

A figure was constructed to visually summarize the evaluation measures of LLMs that were commonly addressed in this review (Figure 2). Since the scales of measures across publications varied, we normalized all scales to a 0–100 scale for display. We selected those that were most commonly used and aligned with the focus of our review for visual representation rather than detailed analysis. Supplemental material 2 provides the underlying data and normalization process, including three sheets: Sheet 1, “Raw Data,” presents the reported numbers of all publications; Sheet 2, “Normalization Process,” presents 0–100 scale normalization process of all publications; and Sheet 3, “Final Data,” presents the normalized values.

Figure 2.

Mean score of each measure component across large language models.

Recent literature update from 2024 through 2025

Since the field of LLMs is rapidly evolving, as of 6 November 2025, a rapid review was conducted. Of the 1557 publications extracted from the same four databases, PubMed, EMBASE, CINAHL, and PsycINFO, duplicates were removed. From 1157 publications, one author (HC) screened publications based on titles/abstracts, and 43 publications were selected. Another author (SH) then screened full-text publications, resulting in 14 final publications (Figure 3). This rapid review was not intended to be exhaustive to identify the most commonly used evaluation measures, but rather to examine whether the recent literature continues to reflect ChatGPT-centrism or exhibits more diverse models.

Figure 3.

PRISMA flow diagram for the scoping review process (publications from 2024 to 2025).

Overall, the updated literature continued to demonstrate a strong emphasis on ChatGPT-based evaluations (Table 4).^41–54 While a small number of studies referenced alternative AI-driven tools or emerging systems, these varied substantially in design and purpose and were not always directly comparable to general-purpose conversational LLMs. Examples included AI-assisted search (e.g., Google SGE, Microsoft Bing Chat),⁵⁰ or medical literature synthesis platforms (e.g., OpenEvidence),^44,47 and caregiving domain-specific small language model prototype (e.g., Parmanto et al.),⁴⁹ which differ conceptually from publicly deployed LLMs, such as ChatGPT or DeepSeek. This pattern demonstrates that ChatGPT remains the dominant reference model in applied research, likely reflecting its accessibility and familiarity among researchers.

Table 4.

Recent literature update results (n = 14).

Last name of first author (year)	Language model	Evaluation dataset	Study aim	Target user (caregiver type)
Bagnato et al. (2025)⁴¹	ChatGPT-4o and ChatGPT-o1	57 open-ended questions reflecting common caregiver concerns in both English and Italian, categorized into three domains: Clinical data, instrumental diagnostics, and therapy.	The study aimed to evaluate the accuracy of two ChatGPT models and to compare their performance in English and Italian and evaluate the level of empathy in their responses.	Caregivers of patients with prolonged disorders of consciousness
Brewster et al. (2025)⁴²	ChatGPT-4o	Customized, patient-specific free text discharge instructions were extracted, which were originally prepared in English by physicians.	The study aimed to evaluate translation quality across three different translation modalities: 1) ChatGPT-4o alone, 2) ChatGPT-4o followed by professional linguists, and 3) professional linguists.	Bilingual family caregivers
Chang et al. (2024)⁴³	ChatGPT-3.5	Researchers collected one year of posts from a private Facebook group on pediatric myopia and identified the most frequently asked questions written in Traditional Chinese.	The study aimed to evaluate the reliability and readability of ChatGPT's Chinese-language responses to questions about myopia that are frequently asked by parents and caregivers.	Caregivers of children with myopia
Dadi et al. (2025)⁴⁴	ChatGPT-4o, Google Gemini, DeepSeek, and OpenEvidence	25 frequently asked queries related to surgical management, etiological factors, and insurance considerations associated with microtia from publicly available clinic websites.	The study aimed to examine the overall quality, reliability, and readability of the responses from artificial intelligence (AI) chatbots when answering questions about microtia.	Families of patients with microtia
Ensari and Boztas (2025)⁴⁵	ChatGPT-4o	40 frequently asked questions from clinic websites.	The study aimed to evaluate the reliability, quality, and readability of ChatGPT's responses to frequently asked questions about pediatric urinary stones.	Families of pediatric patients with urinary stones
Gondode et al. (2024)⁴⁶	ChatGPT and Google Gemini (no information on the specific version)	30 widely circulated misconceptions about palliative care from reputable sources, such as palliative care organization website.	The study aimed to assess the accuracy and usefulness of AI Chatbots for correcting palliative care misconceptions.	Caregivers of patients in palliative care
Mauffrey et al. (2025)⁴⁷	ChatGPT and OpenEvidence	51 common pediatric orthopedic conditions were selected from the American Academy of Orthopaedic Surgeons OrthoInfo Patient educational materials database.	The study aimed to evaluate whether AI tools can generate patient educational materials in pediatric orthopedics that reach readability standards while ensuring accuracy.	Caregivers of children with pediatric orthopedic conditions
Özdemir Kaçer and Şen (2025)⁴⁸	ChatGPT-4	30 questions were selected from the National Institute of Neurological Disorders and Stroke website.	The study aimed to evaluate the accuracy, reliability, and educational value of ChatGPT's responses to 30 questions about febrile seizures.	Caregivers of children with febrile seizures
Parmanto et al. (2024)⁴⁹	The prototype caregiving domain-specific small language model using a retrieval-augmented generation (RAG) framework plus parameter-efficient fine-tuning.	A caregiving knowledge base compiled from publicly available sources (journal articles, guidelines, and caregiver forums) and a Q–A evaluation set of 66 test questions (held out from a total 581 Q–A pairs; 415 used for training/fine-tuning).	The study aimed to develop a reliable caregiving language model grounded in a caregiving knowledge base (to reduce hallucination and provide references), make it accessible by using small foundation models, and evaluate performance compared with a large foundation model (GPT-3.5).	Caregivers of individuals with dmentia
Sezgin et al. (2024)⁵⁰	Four large language models (LLMs)-supported tools (ChatGPT-4, Google Bard, Microsoft Bing Chat, and Google SGE)	The research team created a set of 26 frequently asked questions.	The study aimed to evaluate the performance of LLM-supported tools in providing information to caregivers of children with cancer.	Caregivers of children with cancer
Thayappa et al. (2025)⁵¹	ChatGPT-3.5 and Google Gemini	The education guides about three common pediatric illnesses were generated by giving a prompt to each LLM.	The study aimed to examine the overall quality, reliability, and readability of ChatGPT and Google Gemini in generating education guides for three common pediatric illnesses, such as acute otitis media, pharyngitis, and pneumonia.	Caregivers of pediatric patients with three common pediatric illnesses
Xiao et al. (2025)⁵²	ChatGPT-4o, Google Gemini, and Moonshot AI's Kimi	A total of 72 questions across four domains (18 open-ended questions in each primary domain) were developed using clinical guidelines and publications by the research team.	The study aimed to examine the effectiveness of LLM in answering questions regarding mild cognitive impairment management and to compare the quality of English versus Chinese responses generated by LLMs.	Caregivers of individuals with mild cognitive impairment
Zeng et al. (2024)⁵³	ChatGPT-3.5 and ChatGPT-4	The research team established a comprehensive set of 14 questions related to the prevention, treatment, and care of Alzheimer's disease.	The study aimed to evaluate the comprehensibility, usefulness, and overall satisfaction of LLM's answers and to compare the quality of English versus Chinese responses.	Caregivers of individuals with AD
Zhao et al. (2025)⁵⁴	ChatGPT-4o, ChatGPT-o1, ChatGPT-o3 mini-high, DeepSeek-V3, and DeepSeek-R1	Education materials on scoliosis generated by a prompt to each LLM.	The study aimed to evaluate the readability and overall quality of LLMs’ answers.	Caregivers of patients with scoliosis

AI: artificial intelligence; LLM: large language model; AD: Alzheimer's disease; RAG: retrieval-augmented generation.

Discussion

Generative AI, especially LLMs, has created a great deal of evolution in healthcare and is still advancing.⁵⁵ Our review also contributes to the ongoing efforts to guide the meaningful use and development of LLMs. This review delivers insights into a comprehensive overview of LLM evaluation measures when using LLMs for caregivers. This review reports the four main key components for the evaluation measures that were most commonly addressed in included studies for this review: accuracy, reliability, readability, and comprehensiveness.

One thing to consider is that the components of the evaluation measures can vary depending on the users. One study thoroughly assessed the relevance, evidence-based quality, and actionability of LLMs when used by healthcare professionals, rather than family caregivers.⁵⁶ In this case, they assessed the quality of LLMs if the clinical practice is justified or should be changed and if the contents are based on medical literature, which are prioritized for healthcare professionals rather than family caregivers.⁵⁶ Therefore, to establish a comprehensive evaluation tool of LLMs, the characteristics of users should be considered.

The findings of this review inform future directions for advancing LLMs through conceptually essential measures for continuous quality improvement and for their use by family caregivers, such as answering caregivers’ questions, developing educational materials tailored to their needs, or editing content to enhance readability. One of the publications included in this review emphasized the benefits of LLMs in providing emotional support to patients and caregivers,¹⁰ which has been addressed in the literature.⁵⁷ Besides emotional support, providing detailed information, including medical information, financial and legal information, and emotional support, is of utmost importance for family caregivers.^58,59 Given that healthcare professionals experience burden and difficulty in empathic communication and providing quality care due to lack of time and resource-limited settings,^60–62 utilizing LLMs that support both patients and their family caregivers to independently find information and receive emotional support would ultimately also benefit healthcare professionals.

Because ChatGPT has been the most successful in becoming widely adopted, it is also the most used in research. In terms of accuracy and reliability, the literature included in this review indicates that LLMs provided somewhat accurate information, but the performance varies depending on specific domains. Another recently published study also corroborates these findings, showing that ChatGPT-3.5 performs relatively accurately on basic medical knowledge but is relatively poor in specific areas, such as pharmacology, social welfare, law and regulations, endocrinology/metabolism, and dermatology.⁶³ Therefore, future research evaluating the accuracy of LLM responses should raise awareness of the caution of using LLMs for certain medical decision-making, such as diagnosis, treatment plan, medication management, and legal/financial decisions, and highlight possible risks and harm that can be exacerbated by the use of LLMs.

Nevertheless, our review revealed that LLMs’ responses often require higher educational and literacy levels, especially when using the FRS, FKGL, and SMOG tools for readability evaluation. There is no research confirming whether using these measures is an appropriate way to evaluate the readability of LLMs. In the literature, patients’ readability was evaluated by recruiting and interviewing patients by addressing the importance of directly assessing patients’ ability, rather than retrospectively using a simple readability assessment tool, such as the FRS.⁶⁴ This highlights the need for further research to determine the ideal and most appropriate measures for assessing family caregivers’ readability when using LLMs. That is, researchers should carefully consider and choose the most appropriate evaluation methods and tools. In addition, as reliability and validity are very important when obtaining medical information, an ongoing commitment to improving key measures for evaluating LLMs is required. Future studies also need to consider whether these methods and tools should be modified and subjected to rigorous reliability and validity assessments, depending on the specific LLM. Furthermore, other studies highlight the need for clear criteria for the required number of prompts (i.e., iterations with effective prompts) and the ideal word count per prompt (i.e., ideal length) to yield better, more accurate output from LLMs.^65,66 That is, family caregivers may need education on how to use LLMs more effectively (e.g., clearer prompts, the need for follow-up prompts, shorter question lengths). Ultimately, it is crucial to develop evaluation measures that focus on the most frequently mentioned components revealed from this review (e.g., accuracy, reliability, readability, comprehensiveness), while incorporating considerations for the use of family caregivers.

Importance of digital literacy among caregivers

Another important consideration is that effective support for family caregivers is often hindered by challenges with health literacy. Recent studies highlight that low e-health literacy was observed among patients and their family caregivers,^67,68 potentially exacerbated by geographical disparities impacting rural populations.⁶⁷ This indicates that equitable digital interventions for patients and their family caregivers should evaluate their literacy level as a means to address geographical disparities. Although the National Institutes of Health and the American Medical Association both recommend that patient education materials be written not above the eighth-grade reading level, the majority of patient education materials published in high-impact medical journals are written significantly above the recommended reading grade.⁶⁹ Our review findings resonate strongly with this issue in the context of AI. While LLMs hold promise, several studies included in this review demonstrated the limitations of LLMs, including outdated information and inappropriate educational or literacy levels. Nevertheless, LLMs have the potential to support family caregivers as the main source of valuable information because they can provide information at a specific literacy level as per request through multiturn conversations (i.e., need to ask ChatGPT to rephrase at another literacy level). As demonstrated in Siddiqi et al.,⁶⁸ where caregivers expressed that they desire a chatbot that can promptly provide reliable, complete, and coherent information in multiple languages, the findings of this review may support the development of LLMs by considering readability and literacy to support caregivers. One of the exemplary cases of practically utilizing LLMs in the real world is the use of LLM-based chatbots, which can be utilized by healthcare professionals, patients, and their families.⁷⁰

Research and practice implications

Our findings indicate that the performances of LLMs in the most commonly addressed components of measures, including accuracy, reliability, readability, and comprehensiveness, still require human oversight and conservative use among family caregivers. While LLMs can provide general health information, clarify basic medical concepts, and offer emotional support, they remain limited in their ability to guide medical diagnosis, interpret test results, or recommend treatments. Healthcare providers need to emphasize that patients and their families contact them when they encounter immediate health problems to avoid delays in care. These limitations, however, do not imply that caregivers should be discouraged from using LLMs. Instead, LLMs should be positioned as supplementary, low-risk educational tools that support general education, lifestyle recommendations, caregiving resource navigation, and emotional support. They should not be used particularly for high-risk areas involving clinical decision-making, diagnosis, test result interpretation, treatment recommendations, urgent care management, and medication management. Such clinical and medical decisions should be made with healthcare providers.

To promote safer and more effective use, healthcare providers and researchers can incorporate LLM guidance into caregiver education. This includes educating that LLM-generated information should not be treated as definitive clinical advice, and that LLMs are most appropriate for exploring resources or receiving general explanations and emotional reassurance. Providers can also help family caregivers adopt a simple verification approach to safely use information generated by LLMs. This process includes: (1) reviewing the initial response for clarity, source references, and any red-flag warnings; (2) comparing the output with a second, independent LLM to reduce single-model omissions and to evaluate consistency in key recommendations across multiple models; and (3) confirming essential information through authoritative clinical resources or direct communication with healthcare professionals (e.g., messaging providers through patient portal). Nonetheless, when the LLM models provide conflicting answers, if anything is unclear, or in urgent or high-risk situations regardless of model agreement, caregivers should immediately contact healthcare providers. Although this process cannot eliminate all errors generated by LLMs, it reduces the risk of single-model omissions or hallucinations and makes system limitations more visible to the user.

A risk-based perspective is also essential. For low-risk informational tasks, such as understanding a diagnosis or learning lifestyle recommendations, LLMs can offer helpful, readable guidance. For moderate-risk situations, such as seeking advice about minor symptoms or routine daily care, caregivers should follow stricter safeguards, including clear warnings, consistent messages across repeated prompts, and explicit instructions on when to seek care. For high-risk scenarios, such as medication adjustments, clinical thresholds, or new severe symptoms, caregivers should not rely on LLMs. In these cases, direct contact with healthcare providers is required, and systems should be designed to gently redirect users to professional help.

Large language models cannot replace healthcare professionals. As the application of LLMs is still at an early stage, we recommend that researchers report the findings of the use of LLMs transparently and prioritize caregiver safety, not just efficiency and accuracy, which will guide the process of improving LLMs for family caregiver use. Relying on a single “accuracy” measure is also insufficient for evaluating LLM performance in caregiving contexts. A global accuracy score averages responses across questions with very different levels of clinical risk and can mask serious errors in high-stakes scenarios. An LLM may perform well on general, low-complexity questions while producing unsafe or misleading recommendations on clinically sensitive topics. For caregivers, even a small number of errors in these areas can meaningfully increase risk. Evaluations must therefore move beyond aggregated accuracy metrics and incorporate scenario-specific assessments that examine safety, actionability, and variability across risk levels.

Limitations

This review has limitations. First, since our rapid review was intended to confirm ChatGPT-centrism, the main focus of identifying the most commonly used evaluation measures may not reflect literature published after July 2024. Due to an extremely rapid change in language models driven by AI and automation, evidence is changing rapidly, and timeliness is key. For example, ChatGPT versions have changed. Google Bard changed to Gemini, and Bing AI changed to Copilot as of March 2025. This makes comparison between LLMs challenging in this review. With all the competition in LLMs, the applications of LLMs for caregivers and accompanying research projects need a rapid turnaround of evidence to stay abreast of evolving advancements. Second, this review is limited to include English-written publications only. We might miss some other language models introduced in other languages from other tech-leading countries. Third, although we sought to capture all LLMs, as presented in Table 2 and Table 4, ChatGPT was predominantly presented across publications. We speculate that many healthcare researchers are more familiar with or accessible to ChatGPT compared to other LLMs. Even our additional rapid review of recent literature offered further evidence of ChatGPT-centrism. We believe that the findings of this review can make a valuable contribution by opening new directions. Future studies would benefit from systematic cross-model evaluations using comparable benchmarks and clearly defined model categories by bringing more diverse LLMs into healthcare research and applications, as LLMs are still relatively new and evolving. In addition, one of the included studies in our review used the term accuracy interchangeably with comprehensiveness. In a Yeo et al.¹⁰ study, when accuracy was evaluated using a 4-rating scale, the response options were “comprehensive,” “correct but inadequate,” “mixed with correct and incorrect/outdated data,” and “completely incorrect.” This study treated “comprehensiveness” as implying some level of accuracy without enough detailed information, although comprehensiveness is distinct from accuracy in terms of dictionary definitions. Any areas of ambiguity, such as the meaning of each measure, were discussed by three authors (SH, HC, and YC) to ensure the quality and rigor of this review. We determined that Yeo et al. evaluated accuracy, rather than comprehensiveness, for two reasons: (1) their explicitly stated aim was to examine the accuracy, and (2) three out of four response options are accuracy items (“correct but inadequate,” “mixed with correct and incorrect/outdated data,” “completely incorrect”). This suggests that future survey development should ensure construct consistency and highlights the importance of a clear definition of each measure.

Conclusions

Large language models represent a significant breakthrough in health informatics and can swiftly provide human-like, practical, and informative responses to inquiries from family caregivers. The findings of this review provide some insights into how to better evaluate and develop LLMs for caregivers. For continuous quality improvement on LLMs through future research, we also make suggestions to serve as a foundation for developing LLM evaluation measures that incorporate the following components: accuracy, reliability, readability, and comprehensiveness.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076261425343 - Supplemental material for Exploring evaluation measures of large language models for family caregiver use: A scoping review

Supplemental material, sj-docx-1-dhj-10.1177_20552076261425343 for Exploring evaluation measures of large language models for family caregiver use: A scoping review by Soojeong Han, Hannah Cho, Yong K Choi and Gregory L Alexander in DIGITAL HEALTH

Supplemental Material

sj-xlsx-2-dhj-10.1177_20552076261425343 - Supplemental material for Exploring evaluation measures of large language models for family caregiver use: A scoping review

Supplemental material, sj-xlsx-2-dhj-10.1177_20552076261425343 for Exploring evaluation measures of large language models for family caregiver use: A scoping review by Soojeong Han, Hannah Cho, Yong K Choi and Gregory L Alexander in DIGITAL HEALTH

Footnotes

Acknowledgments

The authors would like to express gratitude for the support of librarians at Columbia University Health Sciences Library.

ORCID iDs

Soojeong Han

Hannah Cho

Yong K Choi

Gregory L Alexander

Ethical approval

Not applicable, because this study is a review of the literature and did not require research ethics approval or patient consent.

Contributorship

SH initiated the review study, established a search strategy, designed the study, and wrote the first draft of the manuscript. SH, HC, and YC contributed to data analysis. SH and GLA provided scientific oversight and supervision. All authors contributed to conceptualization, manuscript writing, and editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: Soojeong Han is a postdoctoral research fellow, supported by the National Institutes of Health, National Institute of Nursing Research – Reducing Health Disparities through Informatics training grant (T32 NR007969).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

SH.

Supplemental material

Supplemental material for this article is available online.

References

Cao

Tan

. Generative artificial intelligence: a historical perspective. Nat Sci Rev 2025; 12. 10.1093/nsr/nwaf050

Raeini

. The evolution of language models: from N-grams to LLMs, and beyond. SSRN 2023, https://www.ssrn.com/abstract=4625356 (accessed 19 June 2025).

Bassett

. The computational therapeutic: exploring Weizenbaum’s ELIZA as a history of the present. AI Soc 2019; 34: 803–812.

Weizenbaum

. ELIZA—A computer program for the study of natural language communication between man and machine. Commun ACM 1966; 9: 36–45.

Zhao

Zhou

, et al. A survey of large language models [Internet]. arXiv 2024. http://arxiv.org/abs/2303.18223 (accessed 19 June 2025).

Hristidis

Ruggiano

Brown

, et al. ChatGPT vs Google for queries related to dementia and other cognitive decline: comparison of results. J Med Internet Res 2023; 25. 10.2196/48966

Shen

Heacock

Elias

, et al. ChatGPT and other large language models are double-edged swords. Radiology 2023; 307. 10.1148/radiol.230163

Will ChatGPT transform healthcare?

Nat Med 2023; 29: 505–506.

De Angelis

Baglivo

Arzilli

, et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health 2023; 11. 10.3389/fpubh.2023.1166120

10.

Yeo

Samaan

, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721–732.

11.

Sarker

. LLM Potentiality and awareness: a position paper from the perspective of trustworthy and responsible AI modeling. Discov Artif Intell 2024; 4: 40.

12.

Radford

Narasimhan

Salimans

, et al. Improving language understanding by generative pre-training. [Preprint] 2018, https://cdn.openai.com/research-covers/language-unsupervised/language_understanding_paper.pdf (accessed 19 June 2025).

13.

World Health Organization. World report on ageing and health. 2015. https://iris.who.int/bitstream/handle/10665/186468/WHO_FWC_ALC_15.01_eng.pdf;jsessionid=5CF057B974523B0F25C8369320C1307E?sequence=1 (accessed 19 June 2025).

14.

AARP and National Alliance for Caregiving. Caregiving in the United States 2020. Washington, DC: AARP, 2020 May. https://www.aarp.org/pri/topics/ltss/family-caregiving/caregiving-in-the-united-states.html (accessed 19 June 2025).

15.

Liu

Heffernan

Tan

. Caregiver burden: a concept analysis. Int J Nurs Sci 2020; 7: 438–445.

16.

Chi

Han

Lin

, et al. Resilience-enhancing interventions for family caregivers: a systematic review. Chronic Illn 2024; 20: 199–220.

17.

Agarwal

Tincher

Abukhadra

, et al. Prioritizing intervention preferences to potentially reduce caregiver burden in racially and ethnically diverse close family members of cardiac arrest survivors. Resuscitation 2024; 194: 110093.

18.

Rillig

Ågerstrand

, et al. Risks and benefits of large language models for the environment. Environ Sci Technol 2023; 57: 3464–3466.

19.

Preiksaitis

Ashenburg

Bunney

, et al. The role of large language models in transforming emergency medicine: scoping review. JMIR Med Inform 2024; 12. 10.2196/53787

20.

Kouzy

Cha

Rosen

, et al. Review of large language models for patient and caregiver support in cancer care delivery. JCO Clin Cancer Inform 2025; 9: e2500044.

21.

Moulaei

Yadegari

Baharestani

, et al. Generative artificial intelligence in healthcare: a scoping review on benefits, challenges and applications. Int J Med Inform 2024; 188. 10.1016/j.ijmedinf.2024.105474

22.

Arksey

O’Malley

. Scoping studies: towards a methodological framework. Int J Soc Res Methodol 2005; 8: 19–32.

23.

Westphaln

Regoeczi

Masotya

, et al. From Arksey and O’Malley and beyond: customizations to enhance a team-based, mixed approach to scoping review methodology. MethodsX 2021; 8. 10.1016/j.mex.2021.101375

24.

Tricco

Lillie

Zarin

, et al. PRISMA Extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med 2018; 169: 467–473.

25.

Ouzzani

Hammady

Fedorowicz

, et al. Rayyan—a web and mobile app for systematic reviews. Syst Rev 2016; 5: 210.

26.

McHugh

. Interrater reliability: the kappa statistic. Biochem Medica 2012: 276–282. 10.11613/BM.2012.031

27.

Aguirre

Hilsabeck

Smith

, et al. Assessing the quality of ChatGPT responses to dementia caregivers’ questions: qualitative analysis. JMIR Aging 2024; 7. 10.2196/53019

28.

Kim

Shin

Kim

, et al. Assessing the performance of ChatGPT’s responses to questions related to epilepsy: a cross-sectional study on natural language processing and medical information retrieval. Seizure 2024; 114: –8.

29.

Lim

Pushpanathan

Yew

SME

, et al. Benchmarking large language models’ performances for myopia care: a comparative analysis of ChatGPT-3.5, ChatGPT-4.0, and Google Bard. EBioMedicine 2023; 95. 10.1016/j.ebiom.2023.104770

30.

Loughran

Kane

Wyatt

, et al. Using large language models to address health literacy in mHealth: case report. Comput Inform Nurs 2024; 42: 696–703.

31.

McFayden

Bristol

Putnam

, et al. ChatGPT: artificial intelligence as a potential tool for parents seeking information about autism. Cyberpsychol Behav Soc Netw 2024; 27: 135–148.

32.

Neo

JRE

Ser

Tay

. Use of large language model-based chatbots in managing the rehabilitation concerns and education needs of outpatient stroke survivors and caregivers. Front Digit Health 2024; 6. 10.3389/fdgth.2024.1395501

33.

Nikdel

Ghadimi

Tavakoli

, et al. Assessment of the responses of the artificial intelligence–based chatbot ChatGPT-4 to frequently asked questions about amblyopia and childhood myopia. J Pediatr Ophthalmol Strabismus 2024; 61: 86–89.

34.

Pradhan

Fiedler

Samson

, et al. Artificial intelligence compared with human-derived patient educational materials on cirrhosis. Hepatol Commun 2024; 8. 10.1097/HC9.0000000000000367

35.

Saeidnia

Kozak

Lund

, et al. Evaluation of ChatGPT’s responses to information needs and information seeking of dementia patients. Sci Rep 2024; 14. 10.1038/s41598-024-61068-5

36.

Taylor

Patel

, et al.

Does the quality, accuracy, and readability of information about lateral epicondylitis on the internet vary with the search term used?

Hand 2012; 7: 420–425.

37.

Storino

Castillo-Angeles

Watkins

, et al. Assessing the accuracy and readability of online health information for patients with pancreatic cancer. JAMA Surg 2016; 151: 831.

38.

Hurtz

Chinn

Barnhill

, et al.

Measuring clinical decision making: do key features problems measure higher level cognitive processes?

Eval Health Prof 2012; 35: 396–415.

39.

Flesch

. A new readability yardstick. J Appl Psychol 1948; 32: 221–233.

40.

McLaughlin

. SMOG Grading-a new readability formula. J Reading 1969; 12: 639–646. https://www.jstor.org/stable/40011226 (accessed 19 June 2025).

41.

Bagnato

Boccagni

Bonavita

. Assessing the accuracy of ChatGPT in answering questions about prolonged disorders of consciousness. Brain Sci 2025; 15: 392.

42.

Brewster

Tse

Fan

, et al. Evaluating human-in-the-loop strategies for artificial intelligence-enabled translation of patient discharge instructions: a multidisciplinary analysis. NPJ Digit Med 2025; 8: 629.

43.

Chang

Sun

Chen

, et al. Evaluation of the quality and readability of ChatGPT responses to frequently asked questions about myopia in traditional Chinese language. Digit Health 2024; 10: 20552076241277021.

44.

Dadi

Kring

Latz

, et al. Evaluating the reliability and readability of AI chatbot responses for microtia patient education. J Craniofac Surg. Published online October 2, 2025. 10.1097/SCS.0000000000011988.

45.

Ensari

Boztas

. Evaluation of ChatGPT-4o® responses on pediatric urolithiasis: is it useful? Urolithiasis 2025; 53: 202.

46.

Gondode

Mahor

Rani

, et al. Debunking palliative care myths: assessing the performance of artificial intelligence chatbots (ChatGPT vs. Google Gemini). Indian J Palliat Care 2024; 30: 284–287.

47.

Mauffrey

Lashani

Heyer

, et al. The ABCs of PEMs: using artificial intelligence to enhance the readability of patient educational materials in pediatric orthopaedics. J Pediatr Orthop Soc N Am 2025; 13: 100273.

48.

Özdemir Kaçer

Şen

. The evaluation of ChatGPT-4’s capacity to provide information on febrile seizures. Eurasian J Emerg Med. Published online 17 February 2025. 10.4274/eajem.galenos.2025.82160.

49.

Parmanto

Aryoyudanta

Soekinto

, et al. A reliable and accessible caregiving language model (CaLM) to support tools for caregivers: development and evaluation study. JMIR Form Res 2024; 8: e54633.

50.

Sezgin

Jackson

Kocaballi

, et al. Can large language models aid caregivers of pediatric cancer patients in information seeking? A cross-sectional investigation. Cancer Med 2025; 14: e70554.

51.

Thayappa

Manake

Kyasa

, et al. A cross-sectional study to evaluate the effectiveness of patient information guides produced by ChatGPT versus Google Gemini for three pediatric illnesses. Cureus. Published online 26 July 2025. 10.7759/cureus.88824.

52.

Xiao

Pan

Liu

, et al. Evaluating large language models for mild cognitive impairment among older adults: a bilingual comparison of ChatGPT, Gemini, and Kimi. Health Inform J 2025; 31: 14604582251381240.

53.

Zeng

Zou

, et al. Assessing the role of the generative pretrained transformer (GPT) in Alzheimer’s disease management: comparative study of neurologist- and artificial intelligence–generated responses. J Med Internet Res 2024; 26: e51095.

54.

Zhao

Zhou

Han

, et al. Evaluating the readability and quality of AI-generated scoliosis education materials: a comparative analysis of five language models. Sci Rep 2025; 15: 35454.

55.

Ross

McGrow

Zhi

, et al. Foundation models, generative AI, and large language models: essentials for nursing. CIN Comput Inform Nurs 2024; 42: 377–387.

56.

Low

Jackson

Hyde

, et al. Answering real-world clinical questions using large language model, retrieval-augmented generation, and agentic systems. Digit Health 2025. 10.1177/20552076251348850

57.

Cavnar Helvaci

Hepsen

Candemir

, et al. Assessing the accuracy and reliability of ChatGPT’s medical responses about thyroid cancer. Int J Med Inform 2024; 191. 10.1016/j.ijmedinf.2024.105593

58.

Wackerbarth

Johnson

MMS

. Essential information and support needs of family caregivers. Patient Educ Couns 2002; 47: 95–100.

59.

Stavrou

Ploumis

Voulgaris

, et al. Informal caregivers’ perceived needs for health education information and emotional support: a comparison between acute and sub-acute rehabilitation settings. Int J Caring Sci 2017; 10: 243–250. https://www.internationaljournalofcaringsciences.org/docs/28_stavrou_ABSTRACT_10_1.pdf (accessed 19 June 2025).

60.

Banerjee

Manna

Coyle

, et al. Oncology nurses’ communication challenges with patients and families: a qualitative study. Nurse Educ Pract 2016; 16: 193–201.

61.

Chan

Tsang

Ching

SSY

, et al. Nurses’ perspectives on their communication with patients in busy oncology wards: a qualitative study. PLoS One 2019; 14. 10.1371/journal.pone.0224178

62.

Adama

Adua

Bayes

, et al. Support needs of parents in neonatal intensive care unit: an integrative review. J Clin Nurs 2022; 31: 532–547.

63.

Taira

Itaya

Hanada

. Performance of the large language model ChatGPT on the national nurse examinations in Japan: evaluation study. JMIR Nurs 2023; 6: e47305.

64.

Manchanayake

MGCA

Bandara

GRWSK

Samaranayake

. Patients’ ability to read and understand dosing instructions of their own medicines – a cross sectional study in a hospital and community pharmacy setting. BMC Health Serv Res 2018; 18: 425.

65.

Perkins

Roe

. The use of generative AI in qualitative analysis: inductive thematic analysis with ChatGPT. J Appl Learn Teach 2024; 7. https://journals.sfu.ca/jalt/index.php/jalt/article/view/1585 (accessed 19 June 2025).

66.

Alexander

Livingston

Han

, et al. Emerging models of care using IT in long-term/post-acute care: a comparative analysis of human and AI-driven qualitative insights. J Gerontol Nurs 2025; 51: 6–11.

67.

Verma

Saldanha

Ellis

, et al. Ehealth literacy among older adults living with cancer and their caregivers: a scoping review. J Geriatr Oncol 2022; 13: 555–562.

68.

Siddiqi

Miraj

Raza

, et al. Development and feasibility testing of an artificially intelligent chatbot to answer immunization-related queries of caregivers in Pakistan: a mixed-methods study. Int J Med Inform 2024; 181. 10.1016/j.ijmedinf.2023.105288

69.

Rooney

Santiago

Perni

, et al. Readability of patient education materials from high-impact medical journals: a 20-year analysis. J Patient Exp 2021. 10.1177/2374373521998847

70.

Albogami

Alfakhri

Alaqil

, et al. Safety and quality of AI chatbots for drug-related inquiries: a real-world comparison with licensed pharmacists. Digit Health 2024 Jan; 10: 20552076241253523.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB

0.02 MB