Sage Journals: Discover world-class research

Abstract

Objective

To compare the performance of ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 Pro in real-world outpatient prescription counseling and evaluate their applicability across clinical contexts.

Methods

Fifty authentic prescriptions from four departments were submitted to the three models using standardized Chinese prompts. Responses were independently rated by three associate chief pharmacists across five dimensions—accuracy, relevance, clarity, practicality, and completeness—on a 5-point Likert scale. Rank-based non-parametric tests were applied for overall and subgroup analyses.

Results

Significant inter-model differences were observed in most dimensions (P < 0.05). DeepSeek excelled in clarity and practicality, ChatGPT achieved the highest accuracy and completeness, while Gemini consistently scored lower. Department-specific analyses revealed distinct contextual advantages. All models exhibited high response stability.

Conclusions

LLMs demonstrate promising yet heterogeneous performance in outpatient medication counseling. DeepSeek and ChatGPT showed superior overall quality, supporting their potential as assistive “AI pharmacists” under professional supervision. However, several limitations should be acknowledged, including a modest sample size, reliance on expert evaluation rather than patient feedback, and context-specific findings that may limit generalizability.

Keywords

large language models outpatient prescriptions medication counseling pharmacy services artificial intelligence

1. Introduction

Outpatient medication counseling represents a critical component of clinical pharmacy practice, directly influencing patients’ adherence, understanding of therapy, and medication safety.^1,2 In real-world outpatient settings, pharmacists are frequently required to perform prescription review, dispensing, and patient education under strict time constraints. With the growing complexity of therapeutic regimens and the increasing prevalence of chronic diseases and polypharmacy, providing individualized, clear, and complete medication guidance for every patient has become increasingly difficult.³ This tension between workload and service quality has prompted exploration of intelligent tools that can extend pharmacists’ capacity without compromising safety or professionalism.

In parallel, large language models (LLMs) have rapidly advanced in their ability to process natural language and to generate contextually coherent responses. Among currently available large language models, ChatGPT, DeepSeek, and Gemini represent three prominent systems with distinct development paradigms and application focuses. ChatGPT, developed by OpenAI, is based on the Generative Pre-trained Transformer architecture and has demonstrated strong performance in medical question answering, clinical reasoning, and patient-oriented communication due to its extensive multilingual training corpus and alignment optimization.⁴ DeepSeek, developed in China, has shown advantages in structured reasoning and Chinese-language semantic processing, with particular strength in generating coherent and context-aware responses in clinical and technical domains.⁵ Gemini, developed by Google DeepMind, integrates multimodal capabilities and real-time information retrieval, enabling dynamic interaction with up-to-date knowledge sources⁶; however, its performance in domain-specific medical tasks has shown greater variability across studies. Their applications in medicine—ranging from clinical decision support and patient education to information retrieval—have attracted widespread attention.^7,8 Compared with traditional rule-based or template-driven systems, LLMs demonstrate greater flexibility in understanding colloquial expressions, integrating background context, and producing fluent human-like explanations.⁹ These features position them as potential assistants in patient counseling tasks, particularly in environments where pharmacist resources are limited. Recent studies have further demonstrated that large language models can assist in generating patient-specific clinical guidance and treatment plans with moderate agreement with expert recommendations, particularly in rehabilitation and patient communication contexts.¹⁰ However, despite their promising capabilities, these models may exhibit limitations in detailed clinical reasoning and require professional oversight to ensure safety and accuracy.⁶

Nevertheless, the clinical use of LLM-generated information presents both opportunities and risks. Although their linguistic fluency is impressive, their factual accuracy and contextual appropriateness are not guaranteed. Differences in training corpora, alignment strategies, and update cycles lead to substantial variation across models.¹¹ Inconsistent performance raises concerns regarding reliability, safety, and trust, especially in medication-related communication, where errors may directly endanger patient outcomes.^12–14 Therefore, systematic and quantitative evaluations of different LLMs in authentic pharmaceutical service scenarios are essential to clarify their strengths, limitations, and boundaries of use.¹⁵

Previous studies have attempted to assess the utility of individual models for drug counseling or medical question answering. However, most existing work is confined to English-language datasets or standardized test questions, and few have incorporated real Chinese outpatient prescription contexts.^16–18 Moreover, prior research has largely emphasized technical correctness—such as accuracy or medical validity—while neglecting patient-centered aspects including clarity, practicality, and informational completeness.¹⁹ Comparative studies across different departments or disease types are also rare, and little evidence exists on the stability and reproducibility of model outputs, an issue that is crucial for clinical reliability and patient trust.

To address these gaps, the present study conducted a multidimensional comparison of three representative LLMs—ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 Pro—using real outpatient prescriptions from diverse clinical departments. By assessing model-generated counseling responses across five key dimensions (accuracy, relevance, clarity, practicality, and completeness), this study aims to elucidate performance variability among models and clinical contexts, providing empirical evidence for their potential integration into human–AI collaborative pharmaceutical services.

2. Methods

2.1. Study design

This study employed a cross-sectional comparative design to evaluate the performance of three large language models—ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 Pro—in outpatient medication counseling. The research process included four key stages: (1) prescription selection, (2) standardized model prompting and response collection, (3) expert evaluation of LLM outputs, and (4) statistical comparison across models and clinical contexts. The overall workflow is illustrated in Figure 1.

Figure 1.

Flowchart of the study design.

Fifty authentic outpatient prescriptions were randomly selected from the hospital’s information system, representing four departments: chronic disease (n = 17), emergency (n = 14), obstetrics/pediatrics (n = 10), and oncology (n = 9). Only prescriptions containing clear medication instructions and commonly used therapeutic categories were included. Table 1 summarizes the departmental distribution of the samples; detailed drug categories are provided in Appendix A.

Table 1.

Distribution of prescribing departments and categories of medications.

Number	Prescription category	Medication category 1	Medication category 2	Medication category 3
1	Obstetrics, Gynecology, and Pediatrics	Central Nervous System Stimulants
2	Obstetrics, Gynecology, and Pediatrics	Glucocorticoids	Antihistamines
3	Obstetrics, Gynecology, and Pediatrics	Antiviral Drugs
4	Obstetrics, Gynecology, and Pediatrics	Adsorbents	Probiotic Preparations
5	Obstetrics, Gynecology, and Pediatrics	Antibiotics	Antibiotics
6	Obstetrics, Gynecology, and Pediatrics	Iron Supplements	Vitamins
7	Obstetrics, Gynecology, and Pediatrics	Estrogens
8	Obstetrics, Gynecology, and Pediatrics	Gonadotropin-Releasing Hormone
9	Obstetrics, Gynecology, and Pediatrics	Progestins	Estrogens
10	Obstetrics, Gynecology, and Pediatrics	Estrogen Receptor Modulators
11	Emergency	Glucocorticoids	Antibiotics
12	Emergency	Expectorants	Antibiotics
13	Emergency	Antihistamines	Antipyretics and Analgesics
14	Emergency	Antibiotics	Cough Suppressants
15	Emergency	Antifibrotic Drugs
16	Emergency	Antibiotics
17	Emergency	Antibiotics
18	Emergency	Antiepileptic Drugs
19	Emergency	Antibiotics	Immunosuppressants
20	Emergency	Antibiotics	Probiotics	Proton Pump Inhibitors
21	Emergency	Osmotic Laxatives
22	Emergency	Hepatoprotective Drugs	Hepatoprotective Drugs
23	Emergency	Antibiotics	Electrolyte Supplements
24	Emergency	H2 Receptor Antagonists	Gastric Mucosal Protectants
25	Chronic Disease	Antidiabetic Drugs	Bisphosphonates
26	Chronic Disease	Antidiabetic Drugs
27	Chronic Disease	Antithyroid Drugs	Beta Blockers
28	Chronic Disease	Thyroid Hormones	Vitamins
29	Chronic Disease	Lipase Inhibitors	Uric Acid Excretion Promoters
30	Chronic Disease	Dopamine Agonists
31	Chronic Disease	Antiepileptic Drugs
32	Chronic Disease	Acetylcholinesterase Inhibitors	NMDA Receptor Antagonists
33	Chronic Disease	Antidepressants	Benzodiazepines	Calcium Channel Blockers
34	Chronic Disease	Alpha-Keto Analogues	Iron Supplements
35	Chronic Disease	Xanthine Oxidase Inhibitors
36	Chronic Disease	Angiotensin II Receptor Antagonists	Glucocorticoids
37	Chronic Disease	Angiotensin II Receptor Antagonists	Electrolyte Supplements
38	Chronic Disease	Alpha-1 Receptor Blockers	Calcium Channel Blockers
39	Chronic Disease	Nitrate Drugs
40	Chronic Disease	Statins	Antioxidants
41	Chronic Disease	Anticoagulants
42	Oncology	Anticancer Drugs	Anticancer Drugs
43	Oncology	Anticancer Drugs
44	Oncology	Anticancer Drugs
45	Oncology	Glucocorticoids	Vitamins
46	Oncology	Anticancer Drugs	Gastrointestinal Motility Agents
47	Oncology	Anticancer Drugs
48	Oncology	Anticancer Drugs	Antidiabetic Drugs
49	Oncology	Opioid Analgesics
50	Oncology	Anticancer Drugs

2.2. Model query and response collection

From September 1 to 3, 2025, each prescription was submitted independently to three large language models via their official web interfaces: ChatGPT-5.0 (OpenAI, USA), DeepSeek-R1 (DeepSeek Inc., China), and Gemini-2.5 Pro (Google DeepMind, USA). All interactions were conducted through publicly accessible web-based platforms rather than application programming interfaces (APIs). To minimize bias and carryover effects, a new conversation was initiated for every input. The standardized Chinese prompt was as follows:

“You are a senior clinical pharmacist at a tertiary hospital with ten years of experience in patient counseling. Please provide medication instructions for this outpatient prescription.”

Each prescription was queried once per day for three consecutive days under identical conditions, yielding 150 response sets. The first-day responses were used for primary analysis, and all three-day results were analyzed for response stability. Model settings, including temperature and conversation reset, followed the default configuration of each platform, and no additional parameter tuning was applied. Because large language models are continuously updated and do not provide fixed version identifiers or static model snapshots, exact reproducibility cannot be fully guaranteed. Although all interactions were conducted within a defined time window using standardized prompts and independent sessions, several sources of variability may still affect the outputs. First, inherent stochasticity in model generation may lead to response variability even under identical inputs. Second, the use of publicly accessible web interfaces, rather than controlled API environments, may introduce additional variability due to backend updates and undisclosed system-level optimizations. Third, model performance may change over time as providers continuously update training data and alignment strategies. To mitigate these limitations, we implemented strict standardization procedures, including identical prompts, independent sessions, and repeated queries across three consecutive days. Nevertheless, the findings should be interpreted as representative of model performance within the specific evaluation window rather than as fully reproducible results across different time points or system versions.

2.3. Expert evaluation and scoring framework

Three associate chief pharmacists, each with more than 10 years of clinical pharmacy experience, independently evaluated the LLM-generated counseling responses. All evaluators were affiliated with the Department of Pharmacy, The First Affiliated Hospital of Guangxi Medical University. They were not involved in the study design, data collection, statistical analysis, or manuscript preparation, and were invited solely for independent evaluation.

To minimize potential evaluation bias, several safeguards were implemented. First, all model outputs were anonymized, and any identifiers related to model origin (e.g., model name, interface characteristics) were removed before scoring, ensuring that evaluators were blinded to model identity. Second, the order of responses was randomized prior to assessment to prevent sequence effects. Third, evaluators conducted scoring independently without communication with each other during the evaluation process.

All evaluators declared no financial or non-financial conflicts of interest related to this study.

A five-dimension, five-point Likert scale was used, encompassing:

Accuracy – consistency with current medical evidence and pharmaceutical guidelines;

Relevance – alignment with the patient’s prescription question;

Clarity – linguistic comprehensibility and logical organization;

Practicality – usability and applicability to patient education;

Completeness – inclusion of all essential medication details.

Scores ranged from 1 (seriously inadequate) to 5 (fully compliant). Before scoring, evaluators participated in a calibration session to standardize judgment criteria. For each prescription and evaluation dimension, the final score was calculated as the mean of the three raters’ scores. Inter-rater reliability was assessed using the intraclass correlation coefficient (ICC) based on a two-way random-effects model with absolute agreement.

For stability analysis, responses generated on three different days were compared for textual similarity. Outputs with ≥75% overlapping content were classified as “essentially identical”, 50–74% as “partially consistent”, and <50% as “inconsistent.²⁰”

2.4. Statistical analysis

All quantitative data were managed in Excel 2019 and analyzed using SPSS 26.0. As Likert scores are ordinal variables and most dimensions did not meet normality assumptions (Shapiro–Wilk test), non-parametric tests were adopted. The Kruskal–Wallis H test was used to compare score distributions among the three models, followed by Bonferroni-adjusted Mann–Whitney U tests for pairwise comparisons. The significance threshold was set at α = 0.05 (two-tailed).

2.5. Data security and ethics

All prescription data were fully anonymized prior to analysis. Specifically, all direct and indirect identifiers were removed, including patient name, identification number, contact information, address, visit date, and any other information that could potentially enable re-identification.

The study was conducted in accordance with the Declaration of Helsinki and approved by the Ethics Committee of the First Affiliated Hospital of Guangxi Medical University (Approval No. 2025-E0542). Data handling complied with institutional policies on patient data protection and privacy.

When interacting with large language models, only de-identified prescription content was entered. No personally identifiable information was provided to any external platform. In addition, potential data retention risks associated with third-party LLM interfaces were considered. To mitigate such risks, all inputs were strictly limited to anonymized clinical information, and no sensitive personal data were included at any stage of the study.

This study was reported in accordance with the STROBE guidelines for cross-sectional studies.

3. Results

Inter-rater reliability analysis demonstrated moderate agreement among the three pharmacists (ICC = 0.58, 95% CI: 0.53–0.63), indicating an acceptable level of consistency in scoring.

3.1. Overall comparison among the three models

The overall analysis using the Kruskal–Wallis H test revealed statistically significant differences among ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 Pro across all five evaluation dimensions (P < 0.05). As summarized in Table 2 and Figure 2(a), ChatGPT and DeepSeek exhibited significantly higher mean ranks than Gemini across most dimensions. ChatGPT achieved the highest mean rank in accuracy, whereas DeepSeek demonstrated the highest mean rank in clarity, practicality, and completeness. No significant difference was observed between ChatGPT and DeepSeek in relevance or completeness (P > 0.05).

Table 2.

Overall comparison of three large language models in outpatient prescription counseling.

Dimension	Model	Number of cases	Mean rank	χ² (df)	P
Accuracy	ChatGPT	50	90.93	14.44 (2)	<0.01
	DeepSeek	50	76.23
	Gemini	50	59.34
Relevance	ChatGPT	50	80.96	6.63 (2)	<0.05
	DeepSeek	50	82.34
	Gemini	50	63.20
Clarity	ChatGPT	50	69.97	36.17 (2)	<0.01
	DeepSeek	50	102.72
	Gemini	50	53.81
Practicality	ChatGPT	50	81.86	28.27 (2)	<0.01
	DeepSeek	50	93.92
	Gemini	50	50.72
Completeness	ChatGPT	50	87.56	28.40 (2)	<0.01
	DeepSeek	50	89.42
	Gemini	50	49.52

χ²: chi-square statistic; df: degrees of freedom.

Figure 2.

Scoring analysis of the three LLMs in medication counseling across different scenarios: (a) overall; (b) obstetrics, gynecology, and pediatrics; (c) emergency department; (d) chronic disease; (e) oncology (*P < 0.01, **P < 0.05).

These results suggest that both ChatGPT and DeepSeek provide more precise and patient-friendly medication explanations than Gemini, with DeepSeek’s responses generally being more structured and fluent.

3.2. Departmental subgroup analyses

3.2.1. Obstetrics, gynecology, and pediatrics

As shown in Table 3 and Figure 2(b), ChatGPT and DeepSeek achieved significantly higher mean ranks for accuracy and practicality than Gemini (P < 0.05). ChatGPT achieved the highest mean rank in relevance and completeness, whereas DeepSeek also demonstrated competitive performance in practicality. Differences in clarity among the three models were not statistically significant (P > 0.05). Overall, ChatGPT provided more thorough medication explanations for special populations such as pregnant women and children, while DeepSeek generated concise and context-aware expressions.

Table 3.

Scoring analysis of the three LLMs in medication counseling for obstetrics, gynecology, and pediatrics prescriptions.

Dimension	Model	Number of cases	Mean rank	χ² (df)	P
Accuracy	ChatGPT	10	19.05	9.98 (2)	<0.01
	DeepSeek	10	18.85
	Gemini	10	8.60
Relevance	ChatGPT	10	19.90	6.21 (2)	<0.05
	DeepSeek	10	10.55
	Gemini	10	16.05
Clarity	ChatGPT	10	16.95	1.19 (2)	>0.05
	DeepSeek	10	16.35
	Gemini	10	13.20
Practicality	ChatGPT	10	18.05	7.29 (2)	<0.05
	DeepSeek	10	18.85
	Gemini	10	9.60
Completeness	ChatGPT	10	22.30	11.62 (2)	<0.01
	DeepSeek	10	14.90
	Gemini	10	9.30

χ²: chi-square statistic; df: degrees of freedom.

3.2.2. Emergency department

In emergency prescriptions, significant differences were found across four of the five dimensions (Table 4, Figure 2(c)). DeepSeek achieved the best clarity and completeness (P < 0.01), followed by ChatGPT. Both models performed better than Gemini in accuracy and practicality (P < 0.05). Relevance showed no significant difference (P > 0.05). These findings suggest that DeepSeek’s structured language and coherent logic may offer advantages in complex, high-pressure scenarios typical of emergency care.

Table 4.

Scoring analysis of the three LLMs in medication counseling for emergency department prescriptions.

Dimension	Model	Number of cases	Mean rank	χ² (df)	P
Accuracy	ChatGPT	14	26.43	13.14 (2)	<0.01
	DeepSeek	14	25.93
	Gemini	14	12.14
Relevance	ChatGPT	14	22.79	2.43 (2)	>0.05
	DeepSeek	14	24.14
	Gemini	14	17.57
Clarity	ChatGPT	14	21.18	17.91 (2)	<0.01
	DeepSeek	14	31.14
	Gemini	14	12.18
Practicality	ChatGPT	14	23.61	8.19 (2)	<0.05
	DeepSeek	14	26.61
	Gemini	14	14.29
Completeness	ChatGPT	14	22.21	13.86 (2)	<0.01
	DeepSeek	14	29.54
	Gemini	14	12.75

χ²: chi-square statistic; df: degrees of freedom.

3.2.3. Chronic disease management

As presented in Table 5 and Figure 2(d), DeepSeek had the highest mean rank in relevance, clarity, practicality, and completeness (P < 0.05), while accuracy did not differ significantly among models (P > 0.05). ChatGPT generally showed intermediate mean ranks across these dimensions, while Gemini consistently demonstrated lower mean ranks. This pattern indicates that DeepSeek and ChatGPT are better suited for long-term patient counseling tasks requiring continuity and clear follow-up guidance.

Table 5.

Scoring analysis of the three LLMs in medication counseling for chronic disease prescriptions.

Dimension	Model	Number of cases	Mean rank	χ² (df)	P
Accuracy	ChatGPT	17	29.50	1.69 (2)	>0.05
	DeepSeek	17	23.38
	Gemini	17	25.12
Relevance	ChatGPT	17	25.71	6.39 (2)	<0.05
	DeepSeek	17	32.18
	Gemini	17	20.12
Clarity	ChatGPT	17	20.38	10.79 (2)	<0.01
	DeepSeek	17	34.76
	Gemini	17	22.85
Practicality	ChatGPT	17	27.79	13.24 (2)	<0.01
	DeepSeek	17	33.85
	Gemini	17	16.35
Completeness	ChatGPT	17	28.74	7.49 (2)	<0.05
	DeepSeek	17	30.94
	Gemini	17	18.32

χ²: chi-square statistic; df: degrees of freedom.

3.2.4. Oncology prescriptions

Among oncology prescriptions (Table 6, Figure 2(e)), statistically significant differences were observed only in clarity (P < 0.01). DeepSeek had the highest mean rank, while no notable difference was found in other dimensions. The limited sample size and heterogeneity of oncologic treatments may have contributed to nonsignificant results.

Table 6.

Scoring analysis of the three LLMs in medication counseling for oncology department prescriptions.

Dimension	Model	Number of cases	Mean rank	χ² (df)	P
Accuracy	ChatGPT	9	17.50	4.83 (2)	>0.05
	DeepSeek	9	9.94
	Gemini	9	14.56
Relevance	ChatGPT	9	14.22	4.25 (2)	>0.05
	DeepSeek	9	17.56
	Gemini	9	10.22
Clarity	ChatGPT	9	12.72	13.43 (2)	<0.01
	DeepSeek	9	21.11
	Gemini	9	8.17
Practicality	ChatGPT	9	13.83	1.64 (2)	>0.05
	DeepSeek	9	16.39
	Gemini	9	11.78
Completeness	ChatGPT	9	15.72	2.36 (2)	>0.05
	DeepSeek	9	15.50
	Gemini	9	10.78

χ²: chi-square statistic; df: degrees of freedom.

3.3. Stability evaluation

To assess consistency across repeated queries, the textual similarity of model responses over three consecutive days was analyzed by expert comparison. As summarized in Table 7, ChatGPT and DeepSeek demonstrated the highest response stability (both 98.0% “essentially identical”), while Gemini showed slightly lower repeatability (94.0%). No responses were categorized as “incorrect,” suggesting all three models maintained acceptable reproducibility within short-term repeated interactions.

Table 7.

Stability of responses generated by three large language models.

Stability	DeepSeek	ChatGPT	Gemini
Essentially identical (≥75% overlap)	49 (98.0%)	49 (98.0%)	47 (94.0%)
Partially consistent (50–74%)	1 (2.0%)	1 (2.0%)	3 (6.0%)
Inconsistent (<50%)	0	0	0

4. Discussion

4.1. Overall performance and key findings

This study provided a real-world, multidimensional comparison of three large language models—ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 Pro—applied to outpatient medication counseling. All models generated medication explanations that met minimum professional standards, yet their performance varied considerably across evaluation dimensions. DeepSeek and ChatGPT consistently achieved higher overall scores, while Gemini showed weaker and less consistent output quality. These findings confirm that LLMs differ significantly in reasoning depth, linguistic organization, and factual reliability, indicating that they cannot be regarded as equivalent in clinical pharmacy applications.^21,22

Across the five evaluated dimensions, accuracy and relevance were generally high, but greater divergence occurred in clarity, practicality, and completeness. DeepSeek performed best in clarity and practicality, likely due to its emphasis on Chinese semantic optimization and contextual generation. ChatGPT maintained superior accuracy and completeness, reflecting its extensive multilingual medical corpus and alignment optimization. Gemini’s weaker performance—particularly in completeness—may be related to its limited domain-specific training data and less mature contextual reasoning mechanisms.²³ Together, these results illustrate that corpus diversity and alignment strategies critically influence LLM behavior in pharmacy-related tasks.

4.2. Model variation across prescription types

Significant context-dependent differences were observed among departments. In obstetrics, gynecology, and pediatrics, ChatGPT and DeepSeek outperformed Gemini in accuracy and practicality, suggesting stronger competence in identifying high-risk drug warnings for special populations such as children and pregnant women. In the emergency department, DeepSeek achieved the best clarity and completeness, implying that structured reasoning facilitates accurate communication in urgent and complex prescriptions. In chronic disease management, DeepSeek again led in clarity, relevance, and practicality, while ChatGPT maintained the highest accuracy. These results align with prior findings that conversational coherence improves comprehension and adherence among long-term patients. For oncology prescriptions, only clarity showed a significant difference, with DeepSeek ranking highest, possibly due to sample heterogeneity and complex regimens.

Overall, these outcomes emphasize that model applicability is context-specific. Rather than a uniform approach, selection should consider department characteristics, medication complexity, and patient literacy.²⁴ Such differentiation ensures safer and more efficient use of LLMs in clinical counseling workflows.

4.3. Implications for pharmacy practice

The findings suggest that LLMs can serve as assistive tools for outpatient counseling but cannot replace pharmacists’ professional judgment. When incorporated into structured service models, LLMs may generate preliminary drafts of counseling texts based on standardized prompts. Pharmacists can then verify and refine these drafts, reducing repetitive workload while maintaining professional oversight.²⁵ Models demonstrating superior clarity and practicality—such as DeepSeek—may be particularly useful in chronic disease management, where written summaries can enhance patients’ medication understanding and recall.²⁶

However, potential risks should be carefully considered. Some generated content omitted clinically critical safety information, including contraindications and drug–drug interaction (DDI) details.¹² For example, several responses failed to provide warnings for pregnancy or lactation when teratogenic medications were involved, did not adequately address dose adjustments in patients with renal or hepatic impairment, and occasionally overlooked common DDIs such as interactions between anticoagulants and antibiotics or between antihypertensive agents and nonsteroidal anti-inflammatory drugs.²⁷

These omissions are clinically significant, as they may directly compromise patient safety, lead to inappropriate medication use, and reduce adherence due to insufficient risk communication. In real-world pharmacy practice, such missing information could result in preventable adverse drug events, particularly among vulnerable populations such as elderly patients or those with polypharmacy.

These findings are consistent with previous research demonstrating that large language models show variable and often incomplete performance in identifying drug–drug interactions, with limited reliability for standalone clinical decision-making. The inconsistency in detecting DDIs highlights a fundamental limitation of current LLMs in handling complex, context-dependent pharmacological safety information.²⁸

Therefore, pharmacist oversight remains essential when integrating LLMs into medication counseling workflows. To mitigate these risks, several strategies can be considered. First, structured prompting approaches that explicitly request safety-related information (e.g., contraindications, DDIs, dose adjustments, and special population considerations) may improve output completeness. Second, integrating LLMs with validated drug interaction databases or clinical decision support systems could enhance accuracy and reliability. Third, implementing standardized safety checklists during pharmacist review may help identify and correct omissions before patient-facing use.

4.4. Strengths, limitations, and future directions

Nevertheless, several limitations should be acknowledged. First, the relatively small sample size, particularly in subgroup analyses (e.g., oncology prescriptions, n = 9), may have reduced statistical power and increased the risk of Type II error, potentially leading to false-negative findings and obscuring true differences among models. Second, this study was conducted using data from a single institution, which may limit the generalizability of the findings to other clinical settings, healthcare systems, or patient populations. Third, although the study focused on “patient counseling,” all evaluations were performed by clinical pharmacists rather than patients. Therefore, the results primarily reflect professional judgment of response quality and may not fully capture patient comprehension, readability, or real-world usability. Finally, this study did not specifically examine high-risk clinical scenarios, such as polypharmacy, major drug–drug interactions, or high-alert medications. These situations may present additional challenges for large language models and should be addressed in future research.

Future research should extend evaluation beyond expert scoring to include patient-reported outcomes, such as comprehension, satisfaction, and adherence. Studies focusing on high-risk scenarios—including polypharmacy, major drug–drug interactions, and high-alert medications—are also needed to assess safety under more complex conditions. In addition, multicenter studies across diverse clinical settings would improve generalizability. Integrating LLMs with clinical decision support systems may further enhance reliability in practice.

5. Conclusion

This study demonstrates that current mainstream large language models show promising yet heterogeneous performance in outpatient medication counseling. DeepSeek-R1 and ChatGPT-5.0 achieved superior overall performance, while Gemini-2.5 Pro showed relatively lower completeness and stability. These findings support the potential role of LLMs as assistive tools in clinical pharmacy practice under professional supervision. Future integration should emphasize structured prompting, pharmacist oversight, and system-level safeguards to ensure safety and reliability.

Supplemental material

Supplemental Material - Comparative evaluation of ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 pro in real-world outpatient prescription counseling: A multidimensional analysis

Supplemental Material for Comparative evaluation of ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 pro in real-world outpatient prescription counseling: A multidimensional analysis by Quanyuan Huang, Wuchang Zhu, Huizhen Mo, Bingxiu Huang, Zuming Liao, Xiao Lu and Hongliang Zhang in Digital Health.

Footnotes

Acknowledgments

The authors would like to thank the clinical pharmacists from the Department of Pharmacy, the First Affiliated Hospital of Guangxi Medical University, who served as independent evaluators in this study. They were not involved in study design, data analysis, or manuscript preparation. We also appreciate the constructive feedback from anonymous reviewers, which greatly improved the quality of this work.

ORCID iD

Quanyuan Huang

Ethical considerations

This study was conducted in accordance with the ethical principles of the Declaration of Helsinki. The use of anonymized outpatient prescription data was approved by the Ethics Committee of the First Affiliated Hospital of Guangxi Medical University (Approval No. 2025-E0542).

Author contributions

Quanyuan Huang and Hongliang Zhang conceived and designed the study, supervised the research progress, and provided critical revisions to the manuscript. Wuchang Zhu, Huizhen Mo, Bingxiu Huang, Zuming Liao and Xiao Lu collected and anonymized the real-world outpatient prescriptions, conducted the model evaluation, and performed the statistical analysis. Quanyuan Huang drafted the initial version of the manuscript. Quanyuan Huang, Wuchang Zhu, Huizhen Mo, Bingxiu Huang, Zuming Liao and Xiao Lu contributed to data interpretation and visualization. All authors participated in reviewing and refining the paper, approved the final version for publication, and agreed to be accountable for all aspects of the work.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This research project is supported by the Medical Quality (Evidence-based) Management Research Project of the National Health Commission Hospital Management Research Institute (No. YLZLXZ23K004) and The Self-funded Scientific Research Project of the Guangxi Zhuang Autonomous Region Administration of Traditional Chinese Medicine (No. GXZYA20250316).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental material

Supplemental material for this article is available online.

References

Carroll

Johnson

Stassun

, et al. Health Literacy-Informed Communication to Reduce Discharge Medication Errors in Hospitalized Children: A Randomized Clinical Trial. JAMA Netw Open 2024; 7: e2350969. https://doi.org/10.1001/jamanetworkopen.2023.50969

Hämeen-Anttila

Mikkola

. Is there a need for standardization of medication counseling in community pharmacies? Res Social Adm Pharm 2024; 20: 547–552. https://doi.org/10.1016/j.sapharm.2024.02.005

Tadesse

Sendekie

Mekonnen

, et al. Pharmacists’ Medication Counseling Practices and Knowledge and Satisfaction of Patients With an Outpatient Hospital Pharmacy Service. Inquiry 2023; 60: 469580231219457. https://doi.org/10.1177/00469580231219457

Iqbal

Tanweer

Rahmanti

, et al. Impact of large language model (ChatGPT) in healthcare: an umbrella review and evidence synthesis. J Biomed Sci 2025; 32: 45. https://doi.org/10.1186/s12929-025-01131-z

Tordjman

Liu

Yuce

, et al. Comparative benchmarking of the DeepSeek large language model on medical tasks and clinical reasoning. Nat Med 2025; 31: 2550–2555. https://doi.org/10.1038/s41591-025-03726-3

Gürses

ÖA

Özüdoğru

Tuncay

, et al. The Role of Artificial Intelligence Large Language Models in Personalized Rehabilitation Programs for Knee Osteoarthritis: An Observational Study. J Med Syst 2025; 49: 73. https://doi.org/10.1007/s10916-025-02207-x

Bedi

Liu

Orr-Ewing

, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. Jama 2025; 333: 319–328. https://doi.org/10.1001/jama.2024.21700

Sandmann

Hegselmann

Fujarski

, et al. Benchmark evaluation of DeepSeek large language models in clinical decision-making. Nat Med 2025; 31: 2546–2549. https://doi.org/10.1038/s41591-025-03727-2

Xie

Cai

Sun

, et al. LLM-DQR: Large language model-based automated generation of data quality rules for electronic health records. J Biomed Inform 2025; 172: 104951. https://doi.org/10.1016/j.jbi.2025.104951

10.

Chen

Alyakin

Seas

, et al. LLM-assisted systematic review of large language models in clinical medicine. Nat Med 2026; 32: 1152–1159. https://doi.org/10.1038/s41591-026-04229-5

11.

Peng

Qin

, et al. Accuracy of large language models in data extraction from randomized controlled trials in sleep medicine: A proof-of-concept study. Sleep Med Rev 2025; 84: 102192. https://doi.org/10.1016/j.smrv.2025.102192

12.

Mondal

Dash

Mondal

, et al. A systematic mapping review on the capability of large language models in drug-drug interaction analysis. Expert Rev Clin Pharmacol 2025; 18: 683–690. https://doi.org/10.1080/17512433.2025.2568090

13.

Smith

Liebrenz

Bhugra

, et al. Are clinical improvements in large language models a reality? Longitudinal comparisons of ChatGPT models and DeepSeek-R1 for psychiatric assessments and interventions. Int J Soc Psychiatry 2026; 72: 91–102. https://doi.org/10.1177/00207640251358071

14.

Abanmy

Al-Ghreimil

Alsabhan

, et al. Evaluating the accuracy of ChatGPT in delivering patient instructions for medications: an exploratory case study. Front Artif Intell 2025; 8: 1550591. https://doi.org/10.3389/frai.2025.1550591

15.

Wang

Zheng

Liu

, et al. Performance Assessment of ChatGPT-4.0 and ChatGLM Series in Traditional Chinese Medicine for Metabolic Associated Fatty Liver Disease: Comparative Study. JMIR Form Res 2025; 9: e66503. https://doi.org/10.2196/66503

16.

Kiyomiya

Aomori

Ohtani

. Medication counseling for OTC drugs using customized ChatGPT-4: Comparison with ChatGPT-3.5 and ChatGPT-4o. Digit Health 2025; 11: 20552076251323810. https://doi.org/10.1177/20552076251323810

17.

Ehlert

Cao

, et al. Large Language Models and the North American Pharmacist Licensure Examination (NAPLEX) Practice Questions. Am J Pharm Educ 2024; 88: 101294. https://doi.org/10.1016/j.ajpe.2024.101294

18.

Zong

, et al. Performance of ChatGPT on Chinese national medical licensing examinations: a five-year examination evaluation study for physicians, pharmacists and nurses. BMC Med Educ 2024; 24: 143. https://doi.org/10.1186/s12909-024-05125-7

19.

van Nuland

Lobbezoo

van de Garde

EMW

, et al. Assessing accuracy of ChatGPT in response to questions from day to day pharmaceutical care in hospitals. Explor Res Clin Soc Pharm 2024; 15: 100464. https://doi.org/10.1016/j.rcsop.2024.100464

20.

Zheng

Lan

, et al. Comparative evaluation of the accuracy and reliability of ChatGPT versions in providing information on Helicobacter pylori infection. Front Public Health 2025; 13: 1566982. https://doi.org/10.3389/fpubh.2025.1566982

21.

Sari

Çelik

Mirza

. ChatGPT-4 vs. DeepSeek-V3: a comparative study of response quality, reliability, usefulness, and readability for exercise and rehabilitation strategies in patients with ankylosing spondylitis. Clin Rheumatol 2025; 45: 187–195. https://doi.org/10.1007/s10067-025-07789-y

22.

Yang

Wei

Liu

, et al. Using artificial intelligence models to generate dietary recommendations for chronic kidney disease patients: A comparative cross-sectional study. Clin Nutr 2025; 55: 76–80. https://doi.org/10.1016/j.clnu.2025.10.014

23.

Zhang

Zhao

Dai

, et al. Evaluating Large Language Models’ Potential in Field Epidemiology Investigation Based on Chinese Context- Zhejiang Province, China, 2025. China CDC Wkly 2025; 7: 1296–1301. https://doi.org/10.46234/ccdcw2025.220

24.

Tan

Niu

, et al. From algorithms to operating room: can large language models master China’s attending anesthesiology exam? a cross-sectional evaluation. Int J Surg 2025; 112: 190–201. https://doi.org/10.1097/js9.0000000000003406

25.

Pan

Tian

Guo

, et al. Clinical feasibility of AI Doctors: Evaluating the replacement potential of large language models in outpatient settings for central nervous system tumors. Int J Med Inform 2025; 203: 106013. https://doi.org/10.1016/j.ijmedinf.2025.106013

26.

Serugunda

Jianquan

Kasujja Namatovu

, et al. Using Large Language Models for Chronic Disease Management Tasks: Scoping Review. JMIR Med Inform 2025; 13: e66905. https://doi.org/10.2196/66905

27.

Jeong

, et al. Discovering Severe Adverse Reactions From Pharmacokinetic Drug-Drug Interactions Through Literature Analysis and Electronic Health Record Verification. Clin Pharmacol Ther 2025; 117: 1078–1087. https://doi.org/10.1002/cpt.3500

28.

Ong

JCL

Chen

, et al. A scoping review on generative AI and large language models in mitigating medication related harm. NPJ Digit Med 2025; 8: 182. https://doi.org/10.1038/s41746-025-01565-7

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

Comparative evaluation of ChatGPT-5.0,DeepSeek-R1,and Gemini-2.5 pro in real-world outpatient prescription counseling: A multidimensional analysis

Abstract

Objective

Methods

Results

Conclusions

Keywords

1. Introduction

2. Methods

2.1. Study design

2.2. Model query and response collection

2.3. Expert evaluation and scoring framework

2.4. Statistical analysis

2.5. Data security and ethics

3. Results

3.1. Overall comparison among the three models

3.2. Departmental subgroup analyses

3.2.1. Obstetrics, gynecology, and pediatrics

3.2.2. Emergency department

3.2.3. Chronic disease management

3.2.4. Oncology prescriptions

3.3. Stability evaluation

4. Discussion

4.1. Overall performance and key findings

4.2. Model variation across prescription types

4.3. Implications for pharmacy practice

4.4. Strengths, limitations, and future directions

5. Conclusion

Supplemental material

Supplemental Material - Comparative evaluation of ChatGPT-5.0, DeepSeek-R1, and Gemini-2.5 pro in real-world outpatient prescription counseling: A multidimensional analysis

Footnotes

Acknowledgments

ORCID iD

Ethical considerations

Author contributions

Funding

Declaration of conflicting interests

Supplemental material

References

Supplementary Material