Sage Journals: Discover world-class research

Abstract

Objective: To evaluate large language models (LLMs) in managing mild cognitive impairment (MCI) and supporting nonspecialist healthcare professionals and care partners, comparing English and Chinese responses. Methods: Seventy-two MCI-related questions were submitted to ChatGPT-4o, Gemini, and Kimi. Responses were assessed for accuracy, comprehensibility, specificity, and actionability using a 5-point Likert scale. Statistical analyses included intraclass correlation coefficients and Mann-Whitney U tests. Results: LLMs performed best in the symptoms and diagnosis domain (M = 4.11 ± 0.15). Healthcare professionals’ needs were better met than those of care partners, particularly in comprehensibility and actionability (p < .001). English responses were significantly more comprehensible and specific than Chinese responses (p < .001). Conclusion: This study highlights the potential of LLMs like ChatGPT, Gemini, and Kimi in supporting MCI management, especially in diagnosis and providing actionable insights. However, their performance varied across languages and user groups, with English responses generally more effective than Chinese. The findings emphasize the need for culturally and linguistically adapted LLMs to enhance accuracy and usability. Future research should focus on expanding user diversity, improving adaptability, and incorporating region-specific data to optimize LLMs for MCI care.

Keywords

large language models mild cognitive impairment bilingual evaluation MCI management strategies artificial intelligence

Introduction

Large language models (LLMs), advanced artificial intelligence (AI) systems pretrained on vast text datasets, are fine-tuned with human feedback to process natural language, enabling them to generate accurate text responses to medical inquiries, offering a convenient way for patients and physicians to access health information.^1,2 OpenAI’s ChatGPT became the fastest-growing consumer application in human history, amassing ore than 400 million weekly active users by February 2025,³ allowing it to cater to a broad audience and handle diverse linguistic needs. In contrast, Gemini is optimized primarily for English, with a corpus that focuses on English-language data for greater contextual precision, limiting its ability to handle other languages like Chinese effectively. Kimi, on the other hand, is designed for Chinese language users, with a specialized dataset tailored to Chinese discourse and cultural nuances. This monolingual focus ensures better performance in Chinese-speaking populations. Google’s Gemini and Moonshot AI’s Kimi are gaining popularity for their exceptional information retrieval capabilities. As people increasingly turn to LLMs to seek information, generate ideas, and enhance productivity,^4,5 our research focused on assessing these three models—ChatGPT, Gemini, and Kimi.

Mild cognitive impairment (MCI) is characterized by cognitive decline that surpasses normal aging but is not severe enough for a dementia diagnosis, representing an intermediate stage between normal aging and dementia.⁶ The prevalence of MCI increases with age, affecting 15%–20% of individuals aged 65 or older.⁷ Although not all MCI cases progress to dementia, MCI increases the risk of developing Alzheimer’s disease.⁸ Early identification and management are crucial for delaying progression and improving quality of life.⁹ However, many older adults—particularly those in community or primary care settings—encounter barriers to timely diagnosis and intervention. Nonspecialist healthcare professionals (HPs) and care partners (CPs), often the first point of contact, frequently face challenges due to insufficient training, fragmented resources, and time constraints, contributing to delayed diagnoses and increased CP burden.^10,11 In China, these challenges are compounded by a lack of routine cognitive screening, limited access to formal care, and inadequate CP education.¹² Most individuals with MCI or dementia rely on informal home-based care, with formal services concentrated among higher-income groups or those with milder symptoms.^13–15 The dementia care continuum framework advocates early detection, targeted intervention, and sustained support tailored to patients’ evolving cognitive, emotional, and social needs.¹⁶ In this context, LLMs present a promising, scalable solution. By providing accessible, evidence-based, and linguistically appropriate information, LLMs can enhance CP education,¹⁷ support nonspecialist healthcare providers, and reduce system burden—particularly in resource-constrained environments such as community-based aging care in China.

LLMs have rapidly expanded in medicine, enhancing insights into MCI and dementia care. Recent research showed that LLMs have predominantly concentrated on leveraging spontaneous speech and clinical notes for diagnosis and symptom monitoring.^18,19 MCI research has emphasized early detection and progression prediction to prevent or delay dementia onset.^20–22 However, concerns about hallucination and reliability remain, emphasizing the need for careful validation in clinical contexts.²³ Although early studies highlighted the potential of LLMs in areas like diagnosis and speech-based detection,^24,25 most research has focused on narrow tasks or single-user evaluations,^20,26–28 limiting understanding of their broader applicability across diverse patient populations. Furthermore, the current evidence is largely confined to English-language settings, with limited assessments in Chinese, Dutch, or Korean.^29–32 This underscores the need for multilingual and multiperspective evaluations to ensure equitable and effective deployment in global healthcare.

Therefore, we had three aims for this study: (a) to evaluate the potential of LLMs to respond to questions related to MCI management for older adults; (b) to explore how the needs of nonspecialist HPs and CPs might be more effectively met by LLMs; and (c) to compare and assess differences between English and Chinese responses to MCI care queries. Based on this systematic and comprehensive evaluation, our study can catalyze further research into this underexplored yet critical area at the intersection of generative AI and MCI management (see Figure 1).

Figure 1.

Overview of LLM model pipeline.

Methods

Ethical statement

This study received approval from the Tsinghua University Institutional Review Board (Protocol: THU01KS2025035). All participants provided informed consent through procedures reviewed and approved by the ethics committee. Data collection was strictly anonymous, with no personally identifiable information retained during analysis or storage.

Participant selection

The recruitment of nonspecialist HPs and CPs occurred between November 28 and December 3, 2024. All participants voluntarily participated and provided informed consent.

The recruitment of nonspecialist HPs was designed to ensure diversity in hospital types (tertiary hospitals and primary care facilities), clinical expertise, and work experience. Inclusion criteria were: (a) at least 2 years of clinical experience providing care or performing related duties in a relevant linguistic environment, specifically involving patients with cognitive decline or dementia symptoms; (b) current employment at a tertiary hospital or primary care facility; and (c) voluntary participation. This approach aimed to capture a wide range of perspectives from practitioners with direct clinical exposure to MCI or dementia, ensuring the findings reflect real-world challenges in managing cognitive impairments.

CPs were recruited to ensure diversity in age, gender, cultural background, and educational level, with all participants having direct caregiving experience. Inclusion criteria were: (a) at least 1 year of hands-on experience providing daily care for an individual with MCI; (b) sufficient language proficiency to understand and contribute to the study; (c) 18 years old or older, representing diverse caregiving contexts; and (d) voluntary participation. This approach aimed to capture a wide range of caregiving experiences relevant to MCI management. All characteristics are listed in Multimedia Appendix 2.

Question pool development

In this study, four primary domains related to MCI were identified for evaluation: (a) symptoms and diagnosis, (b) treatment and management strategies, (c) CP support and resources and (d) nursing and rehabilitation. For each domain, 18 open-ended questions were carefully developed, spanning three levels of difficulty to ensure a comprehensive assessment—from basic knowledge to complex scenarios (see Multimedia Appendix 1 for the complete question set). During the initial design phase, extensive reference was made to authoritative publications from esteemed professional associations and organizations and established medical guidelines from China, the United Kingdom, and the United States, ensuring the relevance and accuracy of the question pool.^33–39 These questions did not involve real case data.

LLMs and response generation

This study evaluated the effectiveness of three LLMs in responding to MCI-related questions: ChatGPT-4o (OpenAI), Gemini 2.0 Flash (Google DeepMind), and Kimi 1.18.1 (Moonshot AI). These LLMs were selected based on their prominence and availability in November 2024.

To ensure fair evaluation aligned with each model’s primary use case, ChatGPT-4o was tested in both Chinese and English, reflecting its multilingual adaptability. In contrast, Kimi and Gemini were tested in their specialized languages—Chinese and English, respectively—due to their focus on excelling in specific linguistic domains. All responses are listed in Multimedia Appendix 3.

Evaluation response

To minimize evaluator bias in assessing responses from LLMs, a partially repeated evaluation design was implemented. For each language, three HPs and two CPs independently evaluated the responses. These HPs and CPs had clinical and care experience in regions where the respective languages are predominantly spoken, ensuring familiarity with the cultural and linguistic nuances of the responses. Prior to the evaluation, the evaluation criteria and requirements were emphasized on the assessment homepage.

Using a random number generator, unique identifiers were assigned to 144 responses to maintain a double-blind design, ensuring that evaluators were unaware of the source LLM for each response. All evaluations were completed through an online scoring system. To ensure consistency, each question was submitted twice consecutively. If the LLM did not answer either submission, the question was excluded from further evaluation. The coded responses were scored based on a 5-point Likert scale (1 = strongly disagree to 5 = strongly agree) across four criteria: accuracy,^40,41 comprehensibility,²⁸ specificity,⁴² and actionability.⁴³ Accuracy reflected the factual correctness of the response, ensuring that the information aligned with established clinical guidelines, evidence-based practices, or authoritative sources. Comprehensibility assessed how clear and understandable the response was for the target audience, including both HPs and lay CPs. Specificity evaluated the degree of detail and context specificity in the response, avoiding vague or generalized answers. Finally, actionability determined whether the response provided practical, clear, and implementable recommendations or guidance (Figure 2).

Before starting, all evaluators received a briefing and opportunity for clarification from a single research team member Y.X., and the scoring criteria were reiterated on the evaluation homepage to ensure standardization.

Figure 2.

Methodological framework for evaluating LLMs in MCI research.

Statistical analysis

Descriptive statistics, including means and standard deviations, were calculated for Likert scale ratings across all response categories. The reliability of evaluators was assessed using intraclass correlation coefficients (ICCs), calculated with a two-way mixed-effects model for the k rater type. ICC values and 95% confidence intervals (CIs) were interpreted as follows: less than .50 indicated low reliability, .50–.74 indicated moderate reliability, .75–.90 indicated good reliability, and more than .90 indicated excellent reliability.⁴⁴

The Kruskal-Wallis H test was used to analyze the performance of LLMs in the four domains of MCI management. For bilingual analysis and comparisons between HPs and CPs, we conducted Mann-Whitney U tests. A Shapiro-Wilk test confirmed that the data did not follow a normal distribution. A significance level of p < .05 was applied for all statistical tests. Statistical analysis was performed using R version 4.3.1, and Bonferroni correction was applied to account for multiple comparisons.

Results

General findings

Figure 3 displays the response consistency rates of four LLMs when responding to 72 questions, assessed through duplicate submissions. All LLMs demonstrated high consistency, with response consistency rates exceeding 80%. To ensure the accuracy of the analysis, inconsistent responses were excluded from the data, as indicated by the Figure 3 note. The ICC values for HPs and CPs both exceeded .75, indicating good consistency in their ratings (see Table 1 and Table 2).

Figure 3.

Evaluation of LLMs in MCI Management. (a) LLM performance across categories; (b) consistency across Four LLMs; (c) comparison of Chinese versus English responses by evaluation criteria; and (d) comparison of HP and CP evaluations based on evaluation criteria. Abbreviations: ChatGPT-CN, Chinese version of ChatGPT; ChatGPT-EN, English version of ChatGPT; CP, care partner; HP, healthcare professional.

Table 1.

ICCs for HPs.

Variable	English	Chinese
Variable	ICC (CI)	ICC (CI)
Accuracy	.80 (.74, .85)	.78 (.71, .83)
Comprehensibility	.75 (.68, .81)	.83 (.78, .87)
Specificity	.76 (.70, .82)	.81 (.75, .85)
Actionability	.75 (.67, .81)	.76 (.69, .82)

Table 2.

ICCs for CPs.

Variable	English	Chinese
Variable	ICC (CI)	ICC (CI)
Accuracy	.79 (.71, .85)	.80 (.72, .85)
Comprehensibility	.79 (.72, .85)	.78 (.70, .84)
Specificity	.79 (.72, .85)	.77 (.69, .83)
Actionability	.79 (.72, .85)	.78 (.70, .84)

Table 3 presents the mean scores (with standard deviations) of four LLMs across four MCI management domains. The symptoms and diagnosis domain received the highest average score (4.11), followed by nursing and rehabilitation (4.04), treatment options and management strategies (4.02), and CP support and resources (4.00). Significant differences were found in specificity (p = .002) and actionability (p < .001), with no significant differences in accuracy or comprehensibility (p > .05). Figure 3 shows the performance of the LLMs—ChatGPT-CN (Chinese version of ChatGPT), ChatGPT-EN (English version of ChatGPT), Gemini, and Kimi—across the four MCI-related domains.

Table 3.

Descriptive statistics of evaluation scores across domains.

Domain	Accuracy	Comprehensibility	Specificity	Actionability	Overall
Domain	M (SD)	M (SD)	M (SD)	M (SD)	M (SD)
Symptoms and diagnosis	4.28 (0.62)	4.22 (0.66)	3.90 (0.75)	4.04 (0.80)	4.11 (0.15)
Treatment options and management strategies	4.26 (0.67)	4.09 (0.69)	3.83 (0.79)	3.93 (0.76)	4.02 (0.16)
CP support and resources	4.28 (0.59)	4.22 (0.60)	3.69 (0.72)	3.81 (0.75)	4.00 (0.25)
Nursing and rehabilitation	4.32 (0.56)	4.12 (0.72)	3.84 (0.77)	3.86 (0.75)	4.04 (0.20)
p	.09	.05	.002	<.001

P-values were obtained using Kruskal-Wallis H tests to analyze the performance of LLMs across the four domains of MCI management. Abbreviations: CP, care partner.

Regarding language comparison, no significant differences were found between Chinese and English responses in accuracy or actionability (p > .05). However, English responses were significantly better in comprehensibility (p < .001) and specificity (p < .001). CPs rated accuracy higher than HPs (4.32 vs 4.27, respectively), whereas HPs rated comprehensibility (4.24 vs 3.72), specificity (3.83 vs 3.63), and actionability (4.04 vs 3.79) significantly higher (see Table 4).

Table 4.

Overview of LLM performance across key evaluation criteria.

Group and response category	Accuracy	Comprehensibility	Specificity	Actionability
Group and response category	M (SD)	M (SD)	M (SD)	M (SD)
Group 1
ChatGPT-CN	4.52 (0.55)	4.30 (0.56)	3.96 (0.68)	4.03 (0.69)
ChatGPT-EN	4.61 (0.49)	4.56 (0.59)	4.32 (0.62)	4.23 (0.78)
Gemini	4.05 (0.59)	3.93 (0.71)	3.63 (0.72)	3.66 (0.81)
Kimi	3.97 (0.53)	3.88 (0.54)	3.35 (0.65)	3.74 (0.62)
p	<.001	<.001	<.001	<.001
Group 2
Chinese	4.26 (0.61)	4.10 (0.59)	3.66 (0.73)	3.89 (0.67)
English	4.32 (0.61)	4.23 (0.73)	3.96 (0.75)	3.94 (0.84)
p	.06	<.001	<.001	.24
Group 3
HP	4.27 (0.61)	4.24 (0.64)	3.83 (0.77)	4.04 (0.76)
CP	4.32 (0.61)	3.72 (0.73)	4.06 (0.70)	3.79 (0.74)
p	.11	<.001	.42	<.001

P-values were obtained using Kruskal–Wallis H tests for comparisons across LLMs. Mann–Whitney U tests were used for bilingual comparisons and HP–CP group differences. Abbreviations: ChatGPT-CN, Chinese version of ChatGPT; ChatGPT-EN, English version of ChatGPT; CP, care partner; HP, healthcare professional.

Symptoms and diagnosis

Table 5 displays the comparison across the three groups in MCI management metrics (accuracy, comprehensibility, specificity, and actionability). Overall, LLMs performed well, with high scores in accuracy and comprehensibility, although performance in specificity and actionability was slightly lower, indicating areas for improvement in providing precise and actionable information. English responses generally outperformed Chinese responses, with significant differences in comprehensibility and specificity (p < .001). Additionally, HPs rated responses higher than CPs, especially in actionability (p < .001) and comprehensibility (p = .001), suggesting that LLMs more effectively met the needs of HPs for actionable and understandable information.

Table 5.

LLM performance across key evaluation criteria in symptoms and diagnosis.

Group and response category	Accuracy	Comprehensibility	Specificity	Actionability
Group and response category	M (SD)	M (SD)	M (SD)	M (SD)
Group 1
ChatGPT-CN	4.47 (0.62)	4.30 (0.57)	3.97 (0.69)	4.08 (0.78)
ChatGPT-EN	4.63 (0.48)	4.60 (0.56)	4.36 (0.57)	4.48 (0.69)
Gemini	3.97 (0.59)	4.04 (0.70)	3.68 (0.78)	3.73 (0.82)
Kimi	4.00 (0.48)	3.86 (0.55)	3.53 (0.68)	3.86 (0.67)
p	<.001	<.001	<.001	<.001
Group 2
Chinese	4.26 (0.61)	4.11 (0.60)	3.78 (0.72)	3.98 (0.74)
English	4.30 (0.63)	4.32 (0.69)	4.02 (0.76)	4.11 (0.84)
p	.52	.001	.001	.10
Group 3
HP	4.28 (0.62)	4.29 (0.64)	3.94 (0.76)	4.22 (0.79)
CP	4.28 (0.63)	4.11 (0.66)	3.85 (0.74)	3.79 (0.74)
p	.39	.001	.03	<.001

For example, responses varied across LLMs to the question, “What are the causes of mild cognitive impairment?” ChatGPT provided a detailed and structured answer, covering neurological, vascular, psychiatric, lifestyle, and genetic factors. Gemini focused on treatable causes and risk factors but lacked detail on neurological and vascular factors. Kimi emphasized Alzheimer’s disease, vascular issues, and psychiatric conditions, but missed lifestyle and genetic factors. ChatGPT-CN mirrored ChatGPT’s response in Chinese, offering a similar level of detail. Overall, ChatGPT-EN performed best, providing the most comprehensive and multifactorial explanation of MCI causes.

Treatment options and management strategies

Table 6 presents the performance of LLMs across key evaluation criteria in treatment options and management strategies. Group 1 results remained consistent with prior findings, so no further discussion is needed. In Group 2, English responses outperformed Chinese responses in accuracy (p = .03), comprehensibility (p = .05), and specificity (p = .009), though no significant difference was found in actionability (p > .05). Group 3 comparisons revealed that HPs rated responses higher than CPs across all categories, with significant differences in comprehensibility (p = .003), specificity (p = .001), and actionability (p = .002).

Table 6.

LLM performance across key evaluation criteria in treatment options and management strategies.

Group and response category	Accuracy	Comprehensibility	Specificity	Actionability
Group and response category	M (SD)	M (SD)	M (SD)	M (SD)
Group 1
ChatGPT-CN	4.42 (0.57)	4.24 (0.51)	4.02 (0.71)	4.04 (0.74)
ChatGPT-EN	4.67 (0.47)	4.53 (0.63)	4.40 (0.62)	4.19 (0.77)
Gemini	4.04 (0.72)	3.83 (0.74)	3.54 (0.71)	3.75 (0.79)
Kimi	3.93 (0.61)	3.82 (0.50)	3.37 (0.66)	3.78 (0.61)
p	<.001	<.001	<.001	<.001
Group 2
Chinese	4.17 (0.63)	4.02 (0.55)	3.68 (0.76)	3.90 (0.69)
English	4.33 (0.69)	4.15 (0.77)	3.94 (0.80)	3.95 (0.81)
p	.03	.05	.009	.66
Group 3
HP	4.28 (0.71)	4.21 (0.60)	3.96 (0.78)	4.05 (0.76)
CP	4.24 (0.61)	3.92 (0.78)	3.63 (0.76)	3.75 (0.71)
p	.45	.002	.001	.003

In response to a patient reporting severe headaches on donanemab, a treatment for early stage Alzheimer’s disease, ChatGPT-EN provided the most comprehensive guidance, detailing symptom evaluation, timing relative to treatment, management strategies, and the importance of consulting the clinical trial team. ChatGPT-CN offered a similar but less detailed response, particularly lacking nuanced recommendations on imaging and follow-up. Kimi provided a more basic symptom assessment, omitting references to trial protocols and complications associated with donanemab. Gemini also suggested general evaluations such as medical history review and imaging but did not address critical concerns like amyloid-related imaging abnormalities or infusion-related reactions.

CP support and resources

The lower scores in this domain (Table 1) can be attributed to several key factors. Responses were often too general, lacking depth and specificity in addressing the unique needs of CPs. Many suggestions were broad and did not provide clear, actionable steps. Additionally, the emotional and psychological challenges of caregiving were not fully acknowledged, leading to responses that may have felt impersonal or insufficient. More personalized, practical advice and greater empathy would improve the relevance and effectiveness of these responses.

Table 7 presents the performance of LLMs across key evaluation criteria in CP support and resources. Group 1 results aligned with previous findings. In Group 2, there were no significant differences between Chinese and English responses except for specificity (p < .001), with English responses performing slightly better. This suggests that although both language groups provided generally similar levels of CP support, English responses were somewhat more precise in addressing specific caregiving concerns. Group 3 revealed that HPs rated responses higher than CPs, especially in comprehensibility (p = .002), specificity (p = .04), and actionability (p = .001). This indicates that HPs generally found the information provided by LLMs more useful and relevant to their clinical needs, whereas CPs, who may require more practical and emotional support, rated responses lower, especially in terms of specific caregiving strategies and actionable advice.

Table 7.

LLM performance across key evaluation criteria in CP support and resources.

Group and response category	Accuracy	Comprehensibility	Specificity	Actionability
Group and response category	M (SD)	M (SD)	M (SD)	M (SD)
Group 1
ChatGPT-CN	4.66 (0.48)	4.40 (0.56)	3.91 (0.68)	3.95 (0.60)
ChatGPT-EN	4.57 (0.50)	4.53 (0.65)	4.21 (0.59)	4.14 (0.87)
Gemini	4.02 (0.54)	3.98 (0.64)	3.53 (0.64)	3.60 (0.82)
Kimi	3.99 (0.51)	4.07 (0.39)	3.24 (0.59)	3.63 (0.57)
p	<.001	<.001	<.001	<.001
Group 2
Chinese	4.31 (0.60)	4.23 (0.51)	3.57 (0.72)	3.79 (0.60)
English	4.26 (0.59)	4.22 (0.70)	3.83 (0.70)	3.84 (0.88)
p	.41	.76	<.001	.64
Group 3
HP	4.25 (0.57)	4.31 (0.55)	3.63 (0.74)	3.93 (0.75)
CP	4.35 (0.62)	4.09 (0.68)	3.78 (0.69)	3.63 (0.72)
p	.09	.002	.04	.001

Nursing and rehabilitation

Table 8 presents the performance of LLMs across key evaluation criteria in nursing and rehabilitation. Group 1 results aligned with previous findings. In Group 2, English responses slightly outperformed Chinese responses, particularly in specificity (p < .001); no significant differences were found in accuracy (p = .06) or actionability (p = .49). Group 3 revealed that HPs rated responses higher than CPs, particularly in accuracy (p = .02) and actionability (p = .004); no significant differences emerged between HPs and CPs in comprehensibility (p > .05) or specificity (p > .05).

Table 8.

LLM performance across key evaluation criteria in nursing and rehabilitation.

Group and response category	Accuracy	Comprehensibility	Specificity	Actionability
Group 1
ChatGPT-CN	4.51 (0.50)	4.26 (0.56)	3.95 (0.65)	4.06 (0.64)
ChatGPT-EN	4.57 (0.50)	4.57 (0.56)	4.30 (0.68)	4.08 (0.74)
Gemini	4.18 (0.49)	3.85 (0.76)	3.75 (0.72)	3.58 (0.82)
Kimi	3.96 (0.56)	3.75 (0.66)	3.29 (0.67)	3.71 (0.63)
p	<.001	<.001	<.001	<.001
Group 2
Chinese	4.25 (0.59)	4.02 (0.66)	3.64 (0.74)	3.89 (0.66)
English	4.38 (0.53)	4.22 (0.76)	4.03 (0.75)	3.83 (0.82)
p	.06	.01	<.001	.49
Group 3
HP	4.26 (0.06)	4.14 (0.72)	3.84 (0.77)	3.97 (0.73)
CP	4.40 (0.58)	4.10 (0.71)	3.87 (0.76)	3.97 (0.73)
p	.02	.59	.74	.004

Discussion

Principal results

We evaluated the performance of three LLMs (ChatGPT, Gemini, and Kimi) by analyzing their responses to MCI-related questions in both Chinese and English, as assessed by nonspecialist HPs and CPs. A rigorous study design incorporating appropriate masking, randomization, and dual independent reviews helped ensure the robustness and integrity of our evaluation process. Our findings suggest that LLMs—particularly ChatGPT-4o—have strong potential to generate accurate and comprehensive responses to MCI-related queries.

Notably, the inclusion of CPs as evaluators provided a distinctive contribution. As individuals who engage extensively with patients on a daily basis, CPs offer unique, experience-based insights that complement those of HPs. Their close observation and firsthand knowledge of patients’ behavioral and emotional changes bring an essential real-world perspective to the assessment of LLMs in chronic disease management. By incorporating CPs into our evaluation design, we aimed to not only reflect the real-world complexity of MCI care but also inform the development of vertical, user-sensitive LLMs tailored to the diverse needs of stakeholders in long-term, community-based care settings. English responses outperformed Chinese in comprehensibility and specificity, highlighting the need for language-tailored optimization as an important direction for LLM development.

LLM performance in MCI healthcare pathways

LLMs showed significant potential, particularly in the symptoms and diagnosis domain of the MCI healthcare pathway, due to their comprehensive and specialized training in clinical and diagnostic content, which aligns with the current medical emphasis on clinical care.⁴⁵ Our study found that LLMs provided accurate, detailed, and actionable information about MCI symptoms, causes, and diagnostic processes, effectively supporting HPs. However, their performance was weaker in the CP support and resources domain, because CPs need not only factual information but also emotional support and practical guidance.⁴⁶ This supports prior research that highlighted the lack of focus on CP support.⁴⁷ LLM responses tended to be too general,⁴⁸ lacking the personalized depth necessary to address the unique challenges of CPs. As noted in previous studies, lack of comprehensiveness was prevalent in nearly 90% of the research analyzed, with LLM outputs often being incomplete or overly general, particularly regarding medical tasks.⁴⁹ This gap in LLM performance may inadvertently amplify issues in healthcare—such as an overemphasis on treatment at the expense of prevention and CP support⁵⁰—by providing users with seemingly accurate but ultimately imprecise or generalized insights. Therefore, caution is needed when relying on LLMs in these domains.

This study highlights how LLMs can strengthen the dementia care continuum, particularly in managing MCI among older adults. By delivering timely, accessible, and accurate information, LLMs can support early detection and intervention—key pillars often challenged by low diagnosis rates and limited professional training among nonspecialist CPs.⁵¹ Beyond initial diagnosis, LLMs also can offer ongoing guidance on cognitive monitoring, lifestyle adjustments, and CP education, helping sustain the well-being of aging individuals.⁵² By improving the efficiency and reach of care, LLMs could ease healthcare burdens and make the continuum more inclusive and sustainable for an aging population.⁵³ Our findings reinforce the value of integrating LLMs into dementia care to enhance quality and optimize processes for older adults with MCI (Figure 4).

In addition to their clinical strengths, LLMs may face limitations in nonmedical domains related to MCI care, particularly in providing emotional support for caregivers. The complexity of human emotions and caregiving dynamics makes it difficult for LLMs to offer truly empathetic and personalized responses.⁵⁴ Our study found that CPs perceived LLMs as less effective in providing actionable and empathetic support, which is crucial in managing the emotional and practical challenges of caregiving. This highlights a critical gap in LLM performance, because CPs need not only factual information but also emotional support and tailored guidance. Development of LLMs could benefit from incorporating personalization features, such as adapting responses to the unique caregiving situations and emotional needs of individuals, ensuring more empathetic and contextually appropriate interactions.¹⁷

Figure 4.

Pathways for patients in MCI healthcare system empowered by LLMs.

Diverging perceptions between HPs and CPs

An interesting pattern emerged in this study: CPs—who typically lack formal medical training—assigned slightly higher scores for accuracy than HPs. However, HPs rated responses significantly higher than CPs in comprehensibility and actionability, with notably lower scores by CPs in these areas. In contrast, specificity ratings were higher for CPs, though this difference was not statistically significant. These findings suggest that CPs prioritize accuracy but may face challenges in understanding and applying detailed information, whereas HPs, despite their more extensive clinical knowledge, value comprehensibility, specificity, and actionability more due to their professional experience and training.

This suggests that CPs, who are generally focused on practical relevance and everyday caregiving tasks, prioritize accuracy based on how well the information aligns with their experience, rather than technical precision. Meanwhile, due to their clinical background, HPs emphasize the clarity, detail, and applicability of the information provided. These findings indicate that users evaluate LLM-generated content through the lens of their knowledge and caregiving role. This is consistent with previous research showing that CPs may prioritize relevance over technical accuracy, particularly in nonmedical domains.^55,56

These differences underscore the necessity for LLMs to be designed with audience-specific customization. CPs and patients require content that is not only factually accurate but also accessible and relevant to their caregiving needs. In contrast, HPs would benefit from LLM outputs that emphasize clinical precision, specificity, and actionability. Therefore, when integrating LLMs into healthcare contexts, it is crucial that model development involves tailored outputs by user type. Moreover, the use of LLMs in medical decision-making must be closely supervised by healthcare providers to ensure ethical standards and patient safety. LLMs should serve as supplementary tools, not replacements for professional judgment. Healthcare providers remain responsible for patient care, because LLM hallucination may generate incorrect or misleading information, risking misdiagnosis or inappropriate treatment. Therefore, robust monitoring frameworks are essential to detect and correct inaccuracies, especially in high-stakes settings.

Therefore, HPs must be trained to recognize the limitations of AI-generated content to prevent overreliance. Clear guidelines are needed for integrating AI into clinical workflows without compromising patient safety.⁵⁷ Research should focus on best practices for LLM integration, addressing both benefits and ethical concerns to minimize errors and improve human–AI collaboration in healthcare.

Impact of linguistic and cultural contexts on LLM performance

This study’s innovation lies in its comparative evaluation of LLM performance in English and Chinese, revealing the significant influence of linguistic and cultural contexts on the effectiveness of AI tools in healthcare. Notably, the most pronounced differences between the two languages occurred in comprehensibility and specificity, reflecting not only technical discrepancies but also deeper cultural and communicative distinctions, which aligns with the latest research.⁵⁸ Previous research has primarily used language-specific knowledge bases, which may have overlooked the nuanced contextual processing capabilities of LLMs across linguistic environments.^30,32

High-quality medical literature is predominantly published in English, contributing to more structured, standardized, and clinically annotated training corpora. In contrast, Chinese medical corpora are often less formalized and lower in domain-specific consistency, which may adversely affect the performance of LLMs in clinical reasoning and content generation. Furthermore, the Chinese language often favors indirect or general expressions, especially in the context of stigmatized conditions such as MCI and dementia, for which cultural norms may discourage open discussion.

These findings highlight the need for linguistic and cultural alignment in LLM development, particularly in languages other than English. Linguistic structure, semantic conventions, and cultural communication norms shape not only how models interpret and present information but also how users perceive its accuracy and relevance.⁵⁹ To realize the potential of LLMs in global healthcare delivery, particularly for aging populations, developers must integrate culturally appropriate language patterns, clinical norms, and user expectations into model design. Building on these insights, we propose that integrating LLMs with socialized knowledge bases, region-specific clinical guidelines, and population-level patient profiles can substantially improve performance in both specificity and actionability. Such integration would enable LLMs to deliver more precise, context-aware content to HPs while also offering clear, accessible, and culturally sensitive guidance to CPs. Addressing language-based disparities could not only enhance clinical applicability but also bridge knowledge gaps, reduce inequities, and advance equitable, evidence-based MCI care across diverse healthcare systems.

Limitations

The study’s findings should be interpreted with caution for several reasons. First, the sample size of CPs and HPs was relatively small, which may limit the robustness and generalizability of the results. In addition, the question pool used for evaluation was not pretested or validated prior to deployment, which may have affected the reliability and interpretability of the scoring outcomes. Second, the evaluation relied on single-turn, standardized prompts rather than dynamic, multiturn clinical dialogues, potentially overstating model performance in real-world use. Third, filtering out incomplete or inconsistent responses, although improving methodological rigor, may have introduced selection bias by omitting prompts that could expose model limitations. Fourth, language coverage was asymmetric: ChatGPT-4o was evaluated in both English and Chinese, whereas Gemini and Kimi were tested in one language each, constraining direct model-to-model comparisons. Fifth, subjective 5-point Likert ratings may be prone to evaluator bias; incorporating objective endpoints in future assessments would improve validity. Sixth, model performance may evolve over time, and our evaluation reflects only the version status as of November and December 2024. Finally, only three general-purpose LLMs were tested using a single MCI-related question set, limiting the applicability of findings to other models or specialized clinical scenarios in MCI care. These limitations suggest that researchers should prioritize larger and more diverse samples, validated question sets, real-world dialogue formats, balanced multilingual designs, and objective clinical outcomes.

These considerations underscore the need for research in broader and more varied contexts, incorporating patient-facing dialogues and real-time clinical workflows. Studies should include larger samples and more sociodemographically diverse groups of CPs and HPs to ensure greater representativeness. In addition, evaluating LLM performance across a wider range of input languages would better simulate real-world scenarios and provide a more comprehensive understanding of model robustness in multilingual healthcare environments.

Conclusions

This study provided an early role-sensitive evaluation of LLMs in the context of MCI management. By assessing ChatGPT, Gemini, and Kimi across four core domains—symptoms and diagnosis, treatment strategies, CP support, and rehabilitation—we demonstrated the potential of LLMs to deliver accurate, actionable, and comprehensible information, particularly in diagnostic areas. However, notable differences emerged based on language, user group, and model, with English responses generally outperforming Chinese responses in specificity and comprehensibility and HPs and CPs showing distinct evaluative patterns shaped by their expertise and experience.

These findings highlight the promise of LLMs in supporting dementia care pathways while also emphasizing the need for culturally and linguistically tailored development. Future research should include expanded evaluation populations, incorporate more realistic interaction settings, and ensure model outputs align with the practical needs of both professional and nonprofessional users. With thoughtful refinement, LLMs can be positioned as accessible, scalable tools to support early detection and ongoing management of cognitive impairment in aging populations.

Supplemental Material

Supplemental Material - Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT, Gemini, and Kimi

Supplemental Material for Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT, Gemini, and Kimi by Yexuan Xiao, Qianhui Pan, Haoyuan Liu, Yilin He, Yuhe Zhang, Nan Jiang in Health Informatics Journal

Supplemental Material

Supplemental Material - Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT, Gemini, and Kimi

Supplemental Material

Supplemental Material - Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT, Gemini, and Kimi

Footnotes

ORCID iDs

Yexuan Xiao

Nan Jiang

Ethical considerations

This study received approval from Tsinghua University Institutional Review Board (Protocol: THU01KS2025035). All participants provided informed consent through procedures reviewed and approved by the ethics committee. Data collection was strictly anonymous, with no personally identifiable information retained during analysis or storage.

Authors’ contributions

Y.X. and Q.H. contributed to the conceptualization, literature review, data analysis, and wrote the original manuscript. N.J. is the corresponding author, contributed to the conception and suggestions of this study, and supervised the entire research process. H.Y. contributed to the literature review, LLM evaluation, and data analysis. Y.L. participated in the LLM evaluation and data analysis. Y.H. contributed to the creation of figures and tables. All authors read and approved the final manuscript.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by The National Natural Science Foundation of China (72495135), Ministry of Education, PRC (171-BHZX) and Tsinghua University Seed Grant (53331100125). The funder has no role in study design, data collection and analysis, preparation of manuscript, or decision to submit for publication.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets used and/or analyzed during the current study are available from the corresponding author on reasonable request.*

Supplemental Material

Supplemental material for this article is available online.

Appendix

References

Moor

Banerjee

Abad

ZSH

, et al. Foundation models for generalist medical artificial intelligence. Nature 2023; 616: 259–265.

Haupt

Marks

. AI-generated medical advice-GPT and beyond. JAMA 2023; 329: 1349–1350.

Gordon

. ChatGPT is the fastest growing app in the history of web applications. Forbes 2023.

Noy

Zhang

. Experimental evidence on the productivity effects of generative artificial intelligence. Science 2023; 381: 187–192.

Sun

Foo

, et al. How and for whom using generative AI affects creativity: a field experiment. J Appl Psychol. 2025. doi:10.1037/apl0001296.

Jongsiriyanyong

Limpawattana

. Mild cognitive impairment in clinical practice: a review article. Am J Alzheimers Dis Other Demen 2018; 33: 500–507.

Gauthier

Reisberg

Zaudig

, et al. Mild cognitive impairment. Lancet 2006; 367: 1262–1270.

Dubois

Feldman

Jacova

, et al. Revising the definition of Alzheimer’s disease: a new lexicon. Lancet Neurol 2010; 9: 1118–1127.

Jack

Jr. Bennett

Blennow

, et al. NIA-AA Research Framework: toward a biological definition of Alzheimer’s disease. Alzheimer's Dement 2018; 14: 535–562.

10.

Livingston

Huntley

Sommerlad

, et al. Dementia prevention, intervention, and care: 2020 report of the Lancet Commission. Lancet 2020; 396: 413–446.

11.

Petersen

Roberts

Knopman

, et al. Mild cognitive impairment: ten years later. Arch Neurol 2009; 66: 1447–1455.

12.

Wang

Xie

, et al. The continuum of care for dementia: needs, resources and practice in China. J Glob Health 2019; 9: 020321.

13.

Zhang

Chen

Liu

, et al. A caregiver survey in Beijing, Xi’an, Shanghai and Chengdu: health services status for the elderly with dementia. Zhongguo Yi Xue Ke Xue Yuan Xue Bao 2004; 26: 116–121.

14.

Shan

Guo

, et al. The cause analysis of low rate for dementia diagnosis in outpatient clinic. Chinese Journal of Geriatrics 2011: 820–822.

15.

Jiang

RHJR

Shan Xin

Zhang RuiXin

, et al. Comparison between doctors and nurses cognition about the senile dementia. Chinese Journal of Clinical Healthcare. 2009; 12(2): 194–195.

16.

Marquardt

Bueter

. Extending the continuum of care for people with dementia: building resilience. In: Ferdous

Roberts

(eds). (Re)designing the Continuum of Care for Older Adults: The Future of Long-Term Care Settings. Springer International Publishing, 2023, pp. 217–236.

17.

Hasan

Zaman

Wang

, et al. Empowering Alzheimer’s caregivers with conversational AI: a novel approach for enhanced communication and personalized support. Npj Biomed Innov 2024; 1: 3.

18.

Agbavor

Liang

. Predicting dementia from spontaneous speech using large language models. PLOS Digit Health 2022; 1: e0000168.

19.

Prakash

Dupre

Østbye

, et al. Extracting critical information from unstructured clinicians’ notes data to identify dementia severity using a rule-based approach: feasibility study. JMIR Aging 2024; 7: e57926.

20.

Novoa-Laurentiev

Plasek

, et al. Enhancing early detection of cognitive decline in the elderly: a comparative study utilizing large language models in clinical notes. EBioMedicine 2024; 109: 105401.

21.

Han

Lam

JCK

VOK

, et al. A large language model based data generation framework to improve mild cognitive impairment detection sensitivity. DATA & POLICY 2025; 7: e33.

22.

Agbavor

Liang

. Multilingual prediction of cognitive impairment with large language models and speech analysis. Brain Sci 2024; 14: 1292.

23.

Azamfirei

Kudchadkar

Fackler

. Large language models and the perils of their hallucinations. Crit Care 2023; 27: 120.

24.

Hristidis

Ruggiano

Brown

, et al. ChatGPT vs google for queries related to dementia and other cognitive decline: comparison of results. J Med Internet Res 2023; 25: e48966.

25.

Wen

Song

Tian

, et al. Intervention of computer-assisted cognitive training combined with occupational therapy in people with mild cognitive impairment: a randomized controlled trial. Front Aging Neurosci 2024; 16: 1384318.

26.

Ruggiano

Brown

Roberts

, et al. Chatbots to support people with dementia and their caregivers: systematic review of functions and quality. J Med Internet Res 2021; 23: e25006.

27.

Kale

Wankhede

Pawar

, et al. AI-driven innovations in Alzheimer’s disease: integrating early diagnosis, personalized treatment, and prognostic modelling. Ageing Res Rev 2024; 101: 102497.

28.

Zeng

Zou

, et al. Assessing the role of the generative pretrained transformer (GPT) in Alzheimer’s disease management: comparative study of neurologist- and artificial intelligence-generated responses. J Med Internet Res 2024; 26: e51095.

29.

Wei

Yao

Cui

, et al. Evaluation of ChatGPT-generated medical responses: a systematic review and meta-analysis. J Biomed Inform 2024; 151: 104620.

30.

Wang

Shen

Chen

, et al. Performance of ChatGPT-3.5 and ChatGPT-4 in the Taiwan national pharmacist licensing examination: comparative evaluation study. JMIR Med Educ 2025; 11: e56850.

31.

Morreel

Mathysen

Verhoeven

. Aye, AI! ChatGPT passes multiple-choice family medicine exam. Med Teach 2023; 45: 665–666.

32.

Choi

Lee

. ChatGPT goes to the operating room: evaluating GPT-4 performance and its potential in surgical education and training in the era of large language models. Ann Surg Treat Res 2023; 104: 269–273.

33.

Allegri

Russo

Kremer

, et al. [Review of recommendations and new diagnosis criteria for mild cognitive impairment due to Alzheimer’s disease]. Vertex 2012; 23: 5–15.

34.

Eshkoor

Hamid

Mun

, et al. Mild cognitive impairment and its management in older people. Clin Interv Aging 2015; 10: 687–693.

35.

Langa

Levine

. The diagnosis and management of mild cognitive impairment: a clinical review. JAMA 2014; 312: 2551–2561.

36.

McKhann

Knopman

Chertkow

, et al. The diagnosis of dementia due to Alzheimer’s disease: recommendations from the National Institute on Aging-Alzheimer’s Association workgroups on diagnostic guidelines for Alzheimer’s disease. Alzheimer's Dement 2011; 7: 263–269.

37.

Pérez Palmer

Trejo Ortega

Joshi

. Cognitive impairment in older adults: epidemiology, diagnosis, and treatment. Psychiatr Clin North Am 2022; 45: 639–661.

38.

Petersen

Caracciolo

Brayne

, et al. Mild cognitive impairment: a concept in evolution. J Intern Med 2014; 275: 214–228.

39.

Powell

Tomlinson

Quinn

, et al. Interventions for self-management of medicines for community-dwelling people with dementia and mild cognitive impairment and their family carers: a systematic review. Age Ageing 2022; 51: afac089.

40.

Agarwal

Goswami

Sharma

. Evaluating ChatGPT-3.5 and Claude-2 in answering and explaining conceptual medical physiology multiple-choice questions. Cureus 2023; 15: e46222.

41.

Gilson

Safranek

Huang

, et al. How does ChatGPT perform on the United States medical licensing examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ 2023; 9: e45312.

42.

Xue

Bracken-Clarke

Iannantuono

, et al. Utility of large language models for health care professionals and patients in navigating hematopoietic stem cell transplantation: comparison of the performance of ChatGPT-3.5, ChatGPT-4, and bard. J Med Internet Res 2024; 26: e54758.

43.

Yun

Kim

Lee

, et al. A comprehensive evaluation of ChatGPT consultation quality for augmentation mammoplasty: a comparative analysis between plastic surgeons and laypersons. Int J Med Inform 2023; 179: 105219.

44.

Thayaparan

Mahdi

. The Patient Satisfaction Questionnaire Short Form (PSQ-18) as an adaptable, reliable, and validated tool for use in various settings. Med Educ Online 2013; 18: 21747.

45.

van der Flier

de Vugt

Smets

EMA

, et al. Towards a future where Alzheimer’s disease pathology is stopped before the onset of dementia. Nat Aging 2023; 3: 494–505.

46.

Cheng

Losada

, et al. Psychological interventions for dementia caregivers: what we have achieved, what we have learned. Curr Psychiatry Rep 2019; 21: 59.

47.

Monteiro

Brito

Pereira

. Burden and quality of life of family caregivers of Alzheimer’s disease patients: the role of forgiveness as a coping strategy. Aging Ment Health 2024; 28: 1003–1010.

48.

Liu

McCoy

Wright

, et al. Leveraging large language models for generating responses to patient messages-a subjective analysis. J Am Med Inform Assoc 2024; 31: 1367–1379.

49.

Busch

Hoffmann

Rueger

, et al. Current applications and challenges in large language models for patient care: a systematic review. Commun Med 2025; 5: 26.

50.

Reynolds

3rd Jeste

Sachdev

, et al. Mental health care for older adults: recent advances and new directions in clinical practice and research. World Psychiatry 2022; 21: 336–363.

51.

Kim

Chua

Rickard

, et al. ChatGPT and large language model (LLM) chatbots: the current state of acceptability and a proposal for guidelines on utilization in academic medicine. J Pediatr Urol 2023; 19: 598–604.

52.

Ullah

Parwani

Baig

, et al. Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology–a recent scoping review. Diagn Pathol 2024; 19: 43.

53.

Chen

Miao

. Opportunities, challenges, and future directions of large language models, including ChatGPT in medical education: a systematic scoping review. J Educ Eval Health Prof 2024; 21: 6.

54.

Sorin

Brin

Barash

, et al. Large Language models and empathy: systematic review. J Med Internet Res 2024; 26: e52597.

55.

Moshirfar

Altaf

Stoakes

, et al. Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus 2023; 15: e40822.

56.

Bangerter

Griffin

Harden

, et al. Health information-seeking behaviors of family caregivers: analysis of the health information national trends survey. JMIR Aging 2019; 2: e11237.

57.

Comeau

Bitterman

Celi

. Preventing unrestricted and unmonitored AI experimentation in healthcare through transparency and accountability. npj Digit Med 2025; 8: 42.

58.

Song

Zhang

. Cultural tendencies in generative AI. Nat Hum Behav. 2025; 1: 10. doi:10.1038/s41562-025-02242-1.

59.

Jackson

Trivedi

Baur

. Re-prioritizing digital health and health literacy in healthy people 2030 to affect health equity. Health Commun 2021; 36: 1–8.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.11 MB

0.09 MB

3.20 MB

Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT,Gemini,and Kimi

Abstract

Keywords

Introduction

Methods

Ethical statement

Participant selection

Question pool development

LLMs and response generation

Evaluation response

Statistical analysis

Results

General findings

Symptoms and diagnosis

Treatment options and management strategies

CP support and resources

Nursing and rehabilitation

Discussion

Principal results

LLM performance in MCI healthcare pathways

Diverging perceptions between HPs and CPs

Impact of linguistic and cultural contexts on LLM performance

Limitations

Conclusions

Supplemental Material

Supplemental Material - Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT, Gemini, and Kimi

Supplemental Material

Supplemental Material - Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT, Gemini, and Kimi

Supplemental Material

Supplemental Material - Evaluating large language models for mild cognitive impairment among older adults: A bilingual comparison of ChatGPT, Gemini, and Kimi

Footnotes

ORCID iDs

Ethical considerations

Authors’ contributions

Funding

Declaration of conflicting interests

Data Availability Statement

Supplemental Material

Appendix

References

Supplementary Material