Evaluating Large Language Models for Patient Information: What Is Worth Publishing?

Abstract

In recent months, our journals have seen a surge in submissions evaluating Large Language Models (LLMs)—such as ChatGPT, Claude, or Perplexity—and their ability to answer common patient questions about orthopaedic conditions. Most studies follow a familiar structure: authors create a list of common questions on the topic, the chatbot is asked (usually one time) to respond to each question, the responses are recorded, and the answers are rated based on expert opinion. Although timely, the scientific and clinical value of these studies is uneven, and, in many cases, they are yielding diminishing returns.

These studies may be helpful when they highlight blind spots in AI-generated information, flag risks of patient confusion, or aid clinicians in anticipating what their patients may encounter online. But editors and reviewers are increasingly encountering common limitations:

Reproducibility lacking: Responses vary across time, model updates, and even query phrasing. Without reporting the date and version of the model, results cannot be reproduced. Even identical questions may yield different outputs across sessions.

Audience mismatch: Responses intended for patients are often evaluated solely by clinicians, with limited consideration given to patient readability and comprehension.

No clear evaluation reference standard: Response quality is typically assessed by expert opinion, rather than validated clinical guidelines or structured rubrics.

Accuracy and Content of the response: Accuracy (absence of factual error) is often conflated with completeness (depth of content), which are distinct attributes. A response may be error-free yet still lack meaningful substance. Accuracy remains a persistent issue with LLMs, particularly when citations or supporting evidence are required. Models may generate plausible-sounding references that are incorrect, fabricated, or misleading.

Scoring subjective: Many studies rely on unvalidated grading systems, often without interrater reliability data.

Novelty fading: Applying identical methods to a new diagnosis adds little new insight.

Evidence levels: According to the JBJS Levels of Evidence, these studies should be classified as Level V (expert opinion).

Beyond these design issues lies a more fundamental concern: LLMs are constantly changing with both upgrades and new versions. A response generated today may differ markedly next week. By the time a study is published, its findings may already be dated or even obsolete.

So, What Should Be Published?

These papers are most valuable when they go beyond documenting chatbot outputs. Manuscripts should focus on the following:

Uncovering structural flaws in AI tools (eg, hallucinations, inconsistencies)

Evaluating patient perceptions, trust, and comprehension of AI-generated information

Developing and testing reproducible methods for evaluating AI-generated health content

Comparing AI-generated advice with clinician guidance in ways that inform practice

What Editors and Reviewers Should Ask?

Does the study add meaningful insight beyond what has already been published or generally known?

Are chatbot outputs evaluated with clear standards and transparency?

Is there value for clinicians or patients beyond a static content snapshot?

Editorial Responsibility

As editors, we must guide this emerging area with discernment. Not every chatbot response audit warrants publication. However, well-designed, methodologically sound studies that help define how AI tools can be used appropriately with patients do deserve our attention. At this time, such submissions will be considered for publication in Foot & Ankle Orthopaedics (FAO), rather than Foot & Ankle International (FAI), which prioritizes hypothesis-driven clinical research.

Authors considering such submissions should consult the new AI/Chatbot Evaluation Checklist now available in the FAO “Instructions for Authors,” which outlines minimum criteria for transparency, validity, and relevance.

Only those studies that advance our understanding of AI’s practical, patient-centered value in foot and ankle care will merit peer review and consideration for publication.

Charles L. Saltzman, MD Robert B. Anderson, MD Brad D. Blankenhorn, MD John T. Campbell, MD Christopher P. Chiodo, MD Timothy R. Daniels, MD, FRCSC George B. Holmes Jr, MD Ellie Pinsker, PhD Stefan Rammelt, MD, PhD Robert A. Vander Griend, MD

Footnotes

This editorial has been copublished in Foot & Ankle International.