Sage Journals: Discover world-class research

Abstract

Large language models (LLMs) are increasingly applied in healthcare, yet their evaluation remains inconsistent and often disconnected from clinical practice. OpenAI's HealthBench represents an important advancement, encompassing 5000 multiturn synthetic clinical conversations benchmarked against 48,562 clinician-developed criteria across accuracy, completeness, context awareness, communication, and instruction-following. Key strengths include broad scenario coverage, contributions from 262 clinicians across 60 countries, and automated grading methods that show high concordance with physician ratings. HealthBench provides a scalable and globally relevant framework. Nevertheless, important limitations constrain its clinical applicability. Exclusive reliance on synthetic dialogs limits ecological validity, and model-based graders may reinforce shared blind spots. Moreover, HealthBench assesses static, offline interactions while omitting multimodal inputs, longitudinal care, and patient outcomes—factors critical to real-world decision-making. Without external validation, strong benchmark performance may not translate into improved diagnostic accuracy, workflow efficiency, or patient safety. To ensure safe and effective integration of LLMs into practice, future benchmarks must incorporate authentic clinical data, longitudinal outcomes, and system-level considerations. HealthBench is a valuable step, but evaluation strategies must evolve to capture the complexity and demands of frontline care.

Keywords

Large language model artificial intelligence clinical evaluation benchmarking HealthBench

OpenAI's HealthBench¹ represents an important advancement toward systematic and clinically grounded evaluation of large language models (LLMs) in healthcare. It includes 5000 multiturn clinical conversations using 48,562 clinician-developed rubric criteria, covering axes such as “clinical accuracy,” “communication quality,” and “context awareness.” Among the 5000 prompts, 4248 (85.0%) are in English, 192 (3.8%) in Spanish, 173 (3.5%) in Portuguese, and the remaining 387 (7.7%) are distributed across 29 other languages. The authors address longstanding limitations of previous benchmarks—lack of expert validation and limited scope for measurable improvement.¹ At the same time, researchers have noted challenges in LLM evaluation using HealthBench. In this Viewpoint, we discuss the benchmark's key strengths and examine its limitations, highlighting areas for further research and the importance of real-world validation.

Strengths

Comprehensive clinical scenarios: The benchmark encompasses seven diverse clinical themes (e.g. emergency referrals, global health) and evaluates five key behavioral dimensions (accuracy, completeness, context awareness, communication, instruction-following). This granularity enables precise assessments for developers, healthcare institutions, and regulatory organizations.

Robust expert engagement: Criteria development and validation involved a carefully selected cohort of 262 clinicians spanning 26 specialties across 60 countries. This broad expert base significantly enhances both the validity and global applicability of the benchmark.

Reliable and scalable evaluation: In the meta-evaluation, automated grading by GPT-4.1 demonstrated high agreement with physician evaluations (macro F1 = 0.71), comparable to interphysician agreement. This finding suggests that the automated grading approach is reliable and aligns well with physician ratings.

Limitations

Despite these advances, several limitations constrain HealthBench's clinical applicability. First, HealthBench relies exclusively on synthetic conversations, not actual clinical encounters. This is common in the field, for example, a recent Journal of the American Medical Association systematic review found only approximately 5% of LLM studies used real patient data.² This approach may inadequately represent real clinical complexity and uncertainty. Second, HealthBench uses a model-based grader (e.g. GPT-4.1) to enable scalable scoring. The meta-evaluation in the HealthBench paper reports physician-level agreement (macro F1 = 0.71) for GPT-4.1 on several themes.¹ However, model-as-judge approaches can inherit biases and blind spots shared with the systems under test.³ For instance, both the grading model and the evaluated LLM might overlook subtle diagnostic cues in a hypothetical case of atypical chest pain suggestive of aortic dissection—a rare but high-stakes condition often challenging to diagnose in practice. Such shared failure modes in model-as-a-judge evaluations underscore the need for human adjudication and heterogeneous (cross-model) graders in meta-evaluation to ensure reliability.⁴ Third, HealthBench evaluates static, offline conversations, lacking real-time clinical interaction assessments or measurement of downstream clinical outcomes, such as patient health improvements or workflow efficiencies. The reliability of HealthBench was evaluated using worst-at-k performance, which is defined as the average of the lowest scores across every possible group of k responses drawn from the N independent responses to a given problem. While informative for robustness, this remains a theoretical proxy for clinical utility.¹ In real practice, small context shifts can change model outputs and thus reliability.⁵ Finally, although more realistic than exam-style questions, HealthBench scenarios still omit critical elements—multimodal data integration (e.g. laboratory and imaging results and trends), longitudinal follow-up, patient adherence, and system constraints, such as electronic health record (EHR) latency, alert burden, and interoperability. To better reflect clinical reality, future benchmark items should align more closely with end-to-end care, considering the full picture of history, physical examination, diagnostic testing, and care coordination. Consequently, excelling on HealthBench does not guarantee effective clinical decision-making. For example, a model may score well on “emergency referral” items yet misjudge urgency in practice; without access to laboratory trajectories or imaging, it could inappropriately flag a case as emergent, delaying care for truly urgent patients and straining healthcare resources. Prospective, EHR-embedded evaluations that link model behavior to longitudinal outcomes are therefore needed to establish clinical utility.⁶

The path forward: From benchmarks to bedside

HealthBench provides a valuable framework of conversation-specific rubrics for assessing LLMs in healthcare. It has set a new standard for moving LLM evaluation beyond simple knowledge tests. However, its challenges with synthetic data, automated grading, and inability to fully capture the complexities of real-world clinical scenarios remind us that strong benchmark performance may not translate into improved diagnostic accuracy, workflow efficiency, or patient safety.

To ensure safe and effective integration of LLMs into practice, future evaluation strategies must evolve. We propose that the next generation of evaluation should include prospective, “silent-mode” clinical trials.⁷ In such a study design, an LLM could be integrated into an EHR system to generate recommendations in real-time based on live, multimodal patient data. These recommendations would be recorded for analysis but not shown to the treating clinicians. Investigators would compare LLM recommendations with clinician decisions at the encounter level and, critically, assess the association between model–clinician discordance and prespecified longitudinal outcomes (e.g. 30-day readmission, adjudicated diagnostic accuracy, and adverse events), with appropriate risk adjustment and multiplicity control.⁶ This approach would provide high-quality evidence of clinical utility and safety without compromising patient care. By bridging the gap between benchmark performance and real-world impact, we can ensure that AI technologies truly serve the needs of patients and clinicians.

Footnotes

ORCID iD

Jialin Liu

Author contributions

JL and SL contributed to the conceptualization, data curation, and literature review. JL and SL drafted the original manuscript. All authors contributed to the review and editing of the manuscript.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Guarantor

Siru Liu, PhD, Department of Biomedical Informatics, Vanderbilt University Medical Center, Nashville, TN, USA. Email: siru.liu@vumc.org

References

Rahul

Jason

Rebecca

, et al. HealthBench: Evaluating large language models towards improved human health. Accessed May 12, 2025. https://openai.com/index/healthbench/.

Bedi

Liu

Orr-Ewing

, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 2025; 333: 319–328.

Laura

Oleg

Peter

, et al. Principles and guidelines for the use of LLM judges. In Proceedings of the 2025 International ACM SIGIR Conference on Innovative Concepts and Theories in Information Retrieval (ICTIR) (ICTIR ‘25), July 18, 2025, Padua, Italy. ACM. 10.1145/3731120.3744588

Hager

Jungmann

Holland

, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024; 30: 2613–2622.

Liu

Wang

, et al. Prompt engineering in clinical practice: tutorial for clinicians. J Med Internet Res 2025; 27: e72644.

Vasey

Nagendran

Campbell

, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 2022; 28: 924–933.

Van der Vegt

Scott

Dermawan

, et al. Deployment of machine learning algorithms to predict sepsis: systematic review and application of the salient clinical AI implementation framework. J Am Med Inform Assoc 2023; 30: 1349–1361.

HealthBench: Advancing AI evaluation in healthcare,but not yet clinically ready