Sage Journals: Discover world-class research

Abstract

Introduction

Large language models (LLMs) offer potential as clinical decision support systems (CDSS) for detecting drug-related problems (DRPs), yet their real-world performance compared to clinical pharmacists (CPs) remains unclear, especially in complex hematology care. We aimed to evaluate the concordance between a clinical pharmacist and three LLMs in identifying DRPs within a Bone Marrow Transplantation unit.

Methods

This prospective observational study evaluated the concordance between a CP and three LLMs (ChatGPT-4o, Grok-3, DeepSeek-v3) in a Bone Marrow Transplantation unit. Eighty-three anonymized patient cases encompassing 210 CP-identified DRPs, classified via the PCNE v9.1 system, were presented using a standardized CDSS-simulating prompt. Performance was assessed based on direct detection, prompted detection after structured follow-up, and the clinical relevance of AI-generated therapeutic recommendations against the CP's gold-standard assessments.

Results

Direct detection of intervention-requiring DRPs was limited (51.4%-60.5% across models), with nearly half missed initially. Guided prompting significantly improved overall detection rates to 93.8%-98.1%, with ChatGPT achieving the highest accuracy. All models produced hallucinations. Recommendation concordance with the CP exceeded 70% in most DRP categories. DeepSeek and ChatGPT showed more consistent performance in context-dependent evaluations, whereas Grok demonstrated higher direct detection but lower recommendation alignment. LLMs demonstrate meaningful potential to assist in DRP detection but are not sufficiently reliable as standalone tools. Expert-guided interaction substantially enhanced their performance, underscoring the critical value of hybrid pharmacist-AI workflows.

Conclusion

Future research should validate these findings across broader populations with multiple expert evaluators and integrate next-generation AI architectures for safer CDSS implementation.

Keywords

Get full access to this article

View all access options for this article.

References

Kaboli

Hoth

McClimon

, et al. Clinical pharmacists and inpatient medical care: a systematic review. Arch Intern Med 2006; 166: 955–964.

Sendak

D’Arcy

Kashyap

, et al. A path for translation of machine learning products into healthcare delivery. NPJ Digit Med 2020; 3: 1–6.

Topol

. High-performance medicine: the convergence of human and artificial intelligence. Nat Med 2019; 25: 44–56.

Okuyan

Henman

Paudyal

, et al. Research priorities of the European Society of Clinical Pharmacy (ESCP): a questionnaire-based study. Int J Clin Pharm 2025; 47: 1770–1783.

Pharmaceutical Care Network Europe (PCNE). PCNE classification for drug-related problems V9.1, https://www.pcne.org/upload/files/417_PCNE_classification_V9-1_final.pdf (2020).

Bekele

Tsegaye

Negash

, et al. Magnitude and determinants of drug-related problems among patients admitted to medical wards of southwestern Ethiopian hospitals: a multicenter prospective observational study. PLoS One 2021; 16: e0248575.

Griese-Mammen

Hersberger

Messerli

, et al. PCNE Definition of medication review: reaching agreement. Int J Clin Pharm 2018; 40: 1199–1208.

Shreffler

Huecker

. Diagnostic testing accuracy: sensitivity, specificity, predictive values and likelihood ratios. In: StatPearls. Treasure Island (FL): StatPearls Publishing, 2025. PMID: 32491423.

World Health Organization. Global Patient Safety Action Plan 2021–2030, https://iris.who.int/server/api/core/bitstreams/1eacccb6-838e-4787-bfd9-4bdeb4debfcf/content (2021).

10.

Bakshi

Payling

. Accuracy of a clinical decision support system in identifying patients at risk of cancer in primary care. J Clin Oncol 2023; 41: 1562.

11.

Park

Chae

Jeong

, et al. Appropriateness of alerts and physicians’ responses with a medication-related clinical decision support system: retrospective observational study. JMIR Med Inform 2022; 10: e40511.

12.

Poly

Islam

Yang

, et al. Appropriateness of overridden alerts in computerized physician order entry: systematic review. JMIR Med Inform 2020; 8: e15653.

13.

Nanji

Seger

Slight

, et al. Medication-related clinical decision support alert overrides in inpatients. J Am Med Inform Assoc 2018; 25: 476–481.

14.

Azamfirei

Kudchadkar

Fackler

. Large language models and the perils of their hallucinations. Crit Care 2023; 27: 20.

15.

Lee

Frieske

, et al. Survey of hallucination in natural language generation. ACM Comput Surv 2023; 55: 38.

16.

Sonoda

Kurokawa

Hagiwara

, et al. Structured clinical reasoning prompt enhances LLM’s diagnostic capabilities in diagnosis please quiz cases. Jpn J Radiol 2025; 43: 586–592.

17.

Hager

Jungmann

Holland

, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024; 30: 2613–2622.

18.

Denecke

May

, LLMHealthGroup , et al. Potential of large language models in health care: delphi study. J Med Internet Res 2024; 26: e52399.

19.

Neves

Marsh

DWR

. Modelling the impact of AI for clinical decision support. In: Riaño

Wilk

ten Teije

(eds)Artificial intelligence in medicine (AIME 2019). Vol. 11526. Lecture Notes in Computer Science, 2019, pp.292–297. Cham, Switzerland: Springer. https://doi.org/10.1007/978-3-030-21642-9_37.

20.

Roosan

Padua

Khan

, et al. Effectiveness of ChatGPT in clinical pharmacy and the role of artificial intelligence in medication therapy management. J Am Pharm Assoc 2024; 64: 422–428.e8.

Real-world evaluation of large language models in detecting drug-related problems: A clinical pharmacist–AI concordance study in hematology care