Sage Journals: Discover world-class research

Abstract

Generative artificial intelligence (AI) comprises a class of AI models that generate synthetic outputs based on learning acquired from a dataset that trained the model. This means that they can create entirely new outputs that resemble real-world data despite not being explicitly instructed to do so during training. Regarding technological capabilities, computing power, and data availability, generative AI has given rise to more advanced and versatile models including diffusion and large language models that hold promise in healthcare. In musculoskeletal healthcare, generative AI applications may involve the enhancement of images, generation of audio and video, automation of clinical documentation and administrative tasks, use of surgical planning aids, augmentation of treatment decisions, and personalization of patient communication. Limitations of the use of generative AI in healthcare include hallucinations, model bias, ethical considerations during clinical use, knowledge gaps, and lack of transparency. This review introduces critical concepts of generative AI, presents clinical applications relevant to musculoskeletal healthcare that are in development, and highlights limitations preventing deployment in clinical settings.

Keywords

artificial intelligence generative healthcare ChatGPT large language model diffusion

Introduction

Artificial intelligence (AI) comprises computer algorithms that mimic human behavior requiring intelligence, such as advanced decision-making, language processing, and image identification [10,45]. The rapid progress in AI has been driven by technological advancements in computing power and model architectures, as well as by considerable financial investments from public and private stakeholders [5]. Many investments have focused on generative AI models that produce synthetic outputs representing patterns learned during training [36]. Generative AI, well known as the foundation of interactive chatbots such as ChatGPT [26], has been called transformative due to its versatility and broad capabilities in tasks relevant to multiple sectors such as finance and healthcare.

Musculoskeletal healthcare, including orthopedics and rheumatology, is poised to benefit from advances in generative AI as treatment decisions rely on large amounts of unstructured data and visual information such as radiographs or advanced imaging modalities [27] and interventions may lead to rapid changes in patient’s health. Given that human lives are directly affected, innovations in AI should be embraced, yet meticulously challenged [33]. In a field where treatment is grounded in regulatory approvals, randomized trials, and vetted clinical guidelines, healthcare providers may find it difficult to adapt to this rapidly changing technology. Yet because generative AI has potential benefits for musculoskeletal healthcare, providers should become familiar with this technology and related digital healthcare solutions [16].

Musculoskeletal providers’ opinions are divided on the value and necessity of AI solutions [17]. However, with the rapid evolution of generative AI and the development of unprecedented use cases, its potential has become more evident [27]. Applied toward musculoskeletal healthcare tasks, generative AI may unlock unprecedented efficiencies in clinical settings and enhance patient care practices, including surgical interventions. This review presents the central concepts of generative AI, introduces current and developing applications relevant to musculoskeletal care, and discusses critical limitations surrounding the clinical use of generative AI.

What Is Generative AI?

In defining generative AI, it is useful to first differentiate it from discriminative (non-generative) AI. In machine learning (ML; subset of AI that enables computers to learn from data and improve performance for predicting an outcome) and deep learning (DL; specialized branch of ML that uses neural networks to model complex datasets), discriminative AI is defined as technology that predicts or classifies an outcome based on its training with data that has known or labeled outcomes [10,44]. For example, in an institutional registry of patients who underwent rotator cuff repair—which contains demographic and perioperative variables—rotator cuff re-tears may be routinely recorded. If the goal of an investigation was to apply ML algorithms to this dataset to predict the probability of re-tear based on a patient’s risk profile, this would be a discriminative task because the outcome is already known. Discriminative algorithms can further be described as classifiers, as in the above example, in which the outcome is dichotomous and the goal is to classify the probability of an event, or regressors, which use inputs to predict a continuous numerical value, such as patient-reported outcome measure scores [44]. For imaging-based tasks, such as the analysis of a magnetic resonance imaging (MRI) of a shoulder, other discriminative models could include classifiers, object detection, and segmentation [41,42]. In classifiers for imaging tasks, the goal would be to detect whether a rotator cuff tear was present based on an existing MRI sequence. Object detection combines the classifier task with object localization to indicate on the MRI where the rotator cuff tear occurred. In a segmentation task, specific pixels within an image are identified as belonging to specific classes based on their location within a numeric matrix. Using the rotator cuff example, segmentation could be used to outline several regions of interest on an MRI, such as the supraspinatus or the infraspinatus, or to automate the measurement of clinically relevant metrics, such as the acromiohumeral interval.

Unlike discriminative models, generative AI is not trained with specific labels in a dataset but instead generates novel data that reflects its training based on probabilities [27,46]. For example, when Netflix suggests movies you may want to watch next, it uses generative AI to leverage data based on your previous selections. Likewise, when Gmail or Outlook suggests text to complete a sentence you have begun writing, these programs are fusing natural language processing to make a statistically informed prediction for how the sentence should be finished based on how it began and language habits. This method of training with unlabeled inputs has enabled several advanced capabilities that include integrating multiple forms of communication including visual, audio, and text inputs to generate new content with applications relevant to healthcare and musculoskeletal medicine.

Contemporary Generative AI Models

At the time of this review, large language models (LLMs) and diffusion models for computer vision tasks are the predominant models of generative AI being used [21,29]. This section defines them and introduces how they can be applied to clinical tasks in musculoskeletal health.

Diffusion Models

Diffusion models are a family of generative AI models also known as “reconstruction algorithms,” as they leverage learning based on the statistical probability distributions of images acquired in training to generate new visual data [22]. These models use neural networks called “U-nets” that specialize in capturing complex spatial details through encoding processes [57]. Diffusion models have become the predominant deep generative model for generating new visual data, creating higher fidelity outputs than other families of these models. Diffusion models function through an iterative process involving 2 competing arms: a random forward process and a generative backward process. In the random forward process, diffusion models create “noise”—images that become increasingly dissimilar from those used in training [57]. For example, a diffusion model created to generate synthetic shoulder radiographs would utilize millions of authentic shoulder radiographs during its training phase; in the random forward process, diffusion model outputs deconstruct these images until, eventually, they resemble static noise without any semblance of an image. In the generative backward process, these images are then “denoised” step-by-step until high-quality images resembling realistic shoulder radiographs are generated from pure randomness. Another layer of complexity includes the distinction between conditioned and unconditioned diffusion models. In a conditioned diffusion model, inputs from a user can help guide the model output, whereas unconditioned models generate imaging solely based on data patterns [57]. Diffusion models can also be applied to generating synthetic video sequences.

Large Language Models

Contemporary LLMs include those that specialize in generating text outputs that resemble human speech, such as Claude (Anthropic), Gemini (Google), Mistral (MistralAI), and Chat generative pre-trained transformer (ChatGPT; OpenAI) [39]. LLMs are either proprietary or open-weight. In proprietary models, also known as “off-the-shelf,” underlying code and weights can be accessed only through subscription and applications are available for use immediately upon signing up for this subscription [27]. Open-weight models (such as LLaMA by Meta) are those in which base code is available for use on local environments and require maniupulation and further development prior to functioning for a specific task. While this offers more user control and customization, it also requires technical expertise to manipulate data and create a functioning model [27]. Both models undergo training on extensive datasets containing broad information sources—from Websites to peer-reviewed articles—learning trillions of parameters; however, with proprietary models, the company addresses troubleshooting and regular updates, whereas with open-weight models, the user is required to carry out technical demands of deployment and upkeep.

At the core of LLMs, transformer architectures are used through a process called next-token prediction, in which the next word (token) in a sequence is predicted based on the preceding context [30]. Transformer architectures allow LLMs to capture long-range dependencies by enabling the model to consider each part of text within a given input relative to every preceding part of that input. By considering the importance of order and distance within a text sequence through statistical relationships, LLMs can interpret text structures such as sentences and human language [37].

Although thought of as text input-text output models (such as a chatbot), many contemporary LLMs are now considered multimodal or foundation models [1,3]. In other words, capabilities have expanded to function with inputs beyond text including images, audio, and videos. Outputs are likewise multimodal. Some multimodal models incorporate components of diffusion models to enhance the performance of generating visual outputs, while text-image models may use transformer architectures to allow for conditioned diffusion by allowing users to encode text prompts [4,56]. Training such models is resource-intensive and requires vast amounts of data, but the reward is enabling efficient and versatile model performance for potential deployment in clinical settings for new healthcare use cases.

Modifying Generative Models May Unlock New Healthcare Use Cases

To address concerns about suboptimal performance and training data bias in LLMs, several modifications have been investigated including (1) fine-tuning, (2) prompt engineering, (3) retrieval augmented generation (RAG), and (4) multi-agent frameworks [27]. A potential problem with incorporating proprietary LLMs into healthcare is that the output is generated from the unregulated domain of the Internet and thus may be outdated or incorrect. Confidence in the validity of responses and transparency in the retrieval data can be increased through modifications and allow for novel use cases of LLMs.

Fine-tuning involves customizing a basic pre-trained LLM for a specific task by training a subset of its parameters on data specific to that task [34]. In other words, while the majority of the model’s prior knowledge based on its training is maintained, a portion is modified, enabling new capabilities. Reinforcement learning with human feedback (RLHF) is a popular method of fine-tuning. In RLHF, model outputs are aligned with human preferences and ethical standards through an iterative process prior to release [55]. This provides the model with feedback and areas for improvement based on how it responds to given inputs.

Prompt engineering is used at the input phase once a model is developed and ready for use [49]. It involves crafting inputs that can better guide a model toward a specific task. Zero-shot learning and few-shot learning are 2 examples of prompt engineering. In zero-shot learning, a user inputs a prompt, and the model provides an output with explicit task training for the desired input. The model relies on the general knowledge it acquired during training. In few-shot prompting, the user would give the LLM specific instructions and background prior to input of the prompt or query. This can help contextualize the task at hand and therefore increase performance. Other prompt engineering techniques include introducing roles or providing specific frameworks for an output, such as instructing ChatGPT to pretend it is an orthopedic surgeon specializing in rotator cuff repair and complex shoulder procedures. A second prompt would ask ChatGPT to provide information on rotator cuff repair, including advanced concepts an orthopedic surgeon would use, and to explain the steps of the surgical procedure using an output format with sections on positioning, draping, instruments, diagnostic arthroscopy, method of repair, and postoperative rehabilitation.

RAG offers existing LLMs the ability to access curated data infused into the model. Using RAG, a specific body of information is introduced by being encoded into mathematical representations, which is known as an embedding [31,59]. When a user query is entered, a series of mathematical operations are initiated and those with similarity are identified and extracted to provide the LLM with additional context. Therefore, by using RAG, the LLM can reference information from a known knowledge source as opposed to a generalized training source, thereby enabling more contextually appropriate and accurate responses [59]. For example, Nwachukwu et al [39] evaluated 4 off-the-shelf LLMs for their ability to provide information about the management and treatment of anterior cruciate ligament (ACL) and rotator cuff injuries and compared them to American Academy of Orthopaedic Surgeons (AAOS) clinical practice guidelines. The authors reported that up to 30% of the responses were either indeterminate or discordant with these evidenced-based guidelines and not found to be trustworthy. Interestingly, Woo et al [58] utilized RAG to infuse several LLMs with the 2022 AAOS guidelines on ACL injury and management. The authors reported poor starting accuracy (<60%) that improved by an average of 39% when applying RAG.

Multi-agent frameworks, also known as agentic augmentation, can be thought of as a simultaneous collaboration among multiple LLMs, each with a specialized function [8,9]. By implementing a modular approach to tasks, multi-agent frameworks can simplify seemingly complex and demanding workflows, allowing for more efficient function. In these frameworks, each LLM has a responsibility based on its strengths. For example, a framework may include an LLM that can break down a large amount of complex input data into smaller samples, recruit multiple LLMs to interpret sections of an input, evaluate a proposed response for appropriateness, and even incorporate human input into the process. Some multi-agent frameworks allow access to Internet resources in real-time to coordinate various processing steps within the workflow and incorporate contemporary information or data points. A surgical multi-agent framework could in theory include LLMs dedicated to tasks including assisting with anesthesia and monitoring patient vital signs (with alerts from deviations from normal limits), monitoring real-time implant and tray availability, and tracking surgical procedure duration and turnover times simultaneously.

Applications of Generative AI in Musculoskeletal Health

Most generative AI models have not been fully vetted through the regulatory frameworks required for responsible and safe deployment in a clinical setting due to challenges surrounding the complexity of AI systems, the rapidly evolving and dynamic nature of technology and products, medicolegal considerations of patient privacy and data security, and challenges in harmonization across international borders [35]. However, several proprietary models have been explored at smaller scales for feasibility and efficacy after obtaining preliminary consent and device clearances. The primary use cases applicable to musculoskeletal clinical contexts include automating clinical intake and scheduling, generating documentation, processing billing and prior authorizations, and monitoring patients [54]. Others include communications, diagnostic and prognostic treatment insights, and surgical training and planning.

Administrative Tasks and Medical Document Generation

Medical document generation, clinical intake, and scheduling are notoriously time- and work-intensive processes [19]. A broad outline of a typical musculoskeletal clinical workflow makes clear the areas in which generative AI can be applied: patient scheduling, completion of clinical intake prior to or upon arrival, acquisition of potentially relevant imaging of an affected joint or extremity, clinical evaluation by specialist integrating patient history and imaging, clinical documentation with proposed treatment plan, and post-encounter communication. An indicated treatment often requires prior authorization with the payer, with several “pain points” that demand a cognitive or physical burden; extensive effort is required for clinical chart review, analysis of patient imaging, documenting a medical note for the chart and billing purposes, and responding to messages or calls from patients [19,52]. These critically important yet time-consuming processes place an even greater demand on providers already experiencing time constraints due to high patient volumes and workload demands.

Several healthcare technology start-ups have been developing and deploying generative AI solutions for such challenges. For example, Veradigm and Thoughtful AI are both leveraging generative models to improve the patient experience, optimize resource utilization, and decrease administrative burden and associated costs through automating patient scheduling. Another advantage to this automated process is its 24-hour availability compared to the defined work hours of humans.

The automation of clinical intake may also enhance the patient experience through decreasing pre-visit wait times while assisting providers by generating medical documents, templates, or treatment plans based on a patient’s medical intake. AllaiHealth, Inc., a healthcare technology start-up, provides an AI-driven intake platform that performs smart screening to triage patients to the correct providers, eliminates the need for pre-charting, automatically generates a history of patient illness summaries, presents differential diagnoses, and provides patient education and potential treatment plans based on possible diagnoses and evidence-based guidelines.

One can imagine a foundation model pipeline in which information from clinical intake and a physical exam are integrated with imaging from a picture archiving and communication system to provide real-time risk prediction and treatment prognosis. ML has been used for real-time probability generation and risk stratification in musculoskeletal research for several years [18], but the rapid evolution of capabilities that are now inherent in foundation models may further improve the opportunity to personalized treatment. Ambient scribes currently can be implemented for medical document generation alone. Integration of ambient scribing systems that utilize AI in real time to record and document office-based conversations as well as translate physical exam findings and pertinent information into existing risk probability models may further expand their utility in patient forecasting [7,53]. Regardless, AI solutions that perform documentation may not only standardize quality and minimize errors but also reduce the cognitive burden on providers.

Patient Communication

Responding to patient inquiries often overflows outside of work hours into a clinician’s “pajama time.” Several investigations have sought to use generative AI to address this burden on providers. Two studies published in JAMA reported on the implementation of AI-enhanced inbox tools that automatically drafted responses to patient messages and investigated their effect on provider cognitive burden and patient reply time; both reported benefits in decreasing provider workload, with variable changes in time needed to reply [11,51]. Such use of LLMs may increase patient engagement and satisfaction while fostering collaboration between patients and providers. Patient communications may also be enhanced by translating complex medical information into easy to understand content [2]. A cross-sectional investigation of 195 social media exchanges compared physician-generated responses to those generated by LLM; the LLM responses were preferred in over 75% of evaluations and consistently rated as more empathetic [2]. This could enable patients to have more autonomy over their care, by becoming better informed about their conditions and more invested in health management. One can imagine that infusing LLMs with provider- or procedure-specific information could guide recovery, decrease the burden of postoperative communication, and return time to providers. For patients, reliable communication through an LLM could provide continuous access to an interactive and trustworthy virtual medical expert.

Payer Solutions and Prior Authorization

Prior authorization for treatment consumes significant time and may require frustrating peer-to-peer phone calls. This process has been consistently recognized as a burden for physicians, requiring considerable time for costly and inefficient treatment discussions [48]. Almost one-third of physicians reported prior authorization leading to serious adverse events for patients, while more than 75% of physicians reported that patients may abandon treatment due to prior authorization decisions and wait time [40]. Several healthcare start-ups have attempted to address this ongoing challenge. Cohere and Availity utilize AI to facilitate this process by matching payer utilization management criteria with prerequisites obtained from medical documentation. The use of AI can make the prior authorization process more efficient, transparent, and accessible.

Medical Imaging and Surgical Planning

Computer vision tasks leveraging generative AI also have relevant applications in musculoskeletal healthcare. Diffusion models can enable image-image transformation—for example, converting a T1 MRI sequence into a fluid-sensitive MRI sequence—as well as image conversion by producing 3-dimensional imaging (ie, computed tomography) from 2-dimensional imaging (radiographs) [13]. This may decrease the cost and hazardous exposure associated with some advanced imaging. Researchers have explored utilizing diffusion models to generate synthetic pelvis and hip radiographs to create larger and more robust de-identified image research repositories that avoid issues of patient privacy [24]. Others have targeted surgical planning. For example, Rouzrokh et al [47] created a novel DL inpainting algorithm called THA-NET to demonstrate how a total hip arthroplasty (THA) implant would appear postoperatively using only a preoperative pelvis radiograph as an input. This application also allowed users to change the implant type and examine clinically important metrics, with each image retaining implant-specific features important for planning. Such use cases may enhance preoperative planning, allow for more patient-specific planning, and eventually integrate with intraoperative robotics and advanced surgical platforms.

Limitations of Generative AI Models

Major challenges concerning the use of generative AI in healthcare include the propensity for LLMs to hallucinate, knowledge cutoffs, propagation of bias, lack of transparency in decision-making, and ethical considerations [15]. These limitations must be considered when performing research and development in the realm of generative AI. More importantly for practitioners, these challenges must be addressed to ensure responsible clinical use of generative AI and to establish trust in augmenting clinical practice with AI-based solutions.

Model Output Hallucinations and Response Inaccuracies

Hallucinations are a well-described limitation of generative AI models, particularly LLMs [25,28,50]. The primary cause of a hallucination is a mismatch between the model’s programming (which requires it to respond to an input) and the depth of subject knowledge the model possesses. Unfortunately, hallucinations can be difficult to detect for users who are not experts in a topic. Such users may presume the output is credible—not surprising, perhaps, given that LLMs make use of a highly confident and proficient tone that mimics human language. An example follows:

User:

Please write an introduction for a manuscript concerning the use of biologics for knee osteoarthritis and provide citations.

LLM:

Several randomized controlled trials have demonstrated that platelet rich plasma is superior to placebo for treating knee osteoarthritis. . .. In conclusion, there is strong evidence that biologics are appropriate to use in such patients.

Although the output may include a list of references that appear authentic in formatting and content, upon exploration the user would find that the studies do not exist. Without a human confirming its credibility, this hallucination could be further propagated and may mislead healthcare providers who interact with patients. Previously discussed modifications of LLMs such as fine-tuning and RAG may help overcome this limitation and mitigate the incidence of hallucinations.

Knowledge Cutoffs

While each LLM generally reflects training on a larger or more contemporary body of information, there is limited ability to incorporate real-time updates as guidelines change or new information accumulates [6]. As a result, each LLM has a knowledge cutoff, a date on which its training data were gathered. Information created after the knowledge cutoff will not be included. Thus, depending on the version, clinical guidelines and medical knowledge may be outdated or information may be incomplete. This poses the risk of harm and medicolegal liability in the treatment of patients. Furthermore, when providing outputs, LLMs often fail to disclose uncertainty or gaps in information; this may further propagate incorrect knowledge. Static training data also contribute to hallucinations; for example, a query might be made on a musculoskeletal topic that is more recent or nuanced than the data available when the model was trained. This is akin to a user performing a query on robotic THA with an LLM developed prior to this technology, and the LLM subsequently providing a seemingly plausible output on this topic without any true knowledge.

Propagation of Bias

As is well known, disparities exist in medical data, and if these data are used to train LLMs it may give rise to biases that affect patient health [32]. Omiye et al [43] demonstrated that outputs of all 4 of the leading LLMs showed examples of perpetuating race-based medicine. As generative AI produces novel output by mimicking training data, disparities and biases that are present in an original dataset would theoretically be propagated in model outputs. For example, bias in the form of restricting training data from a specific geographic region or patient demographic would make the model less generalizable. This is especially concerning for large generalized LLMs that are trained on vast amounts of unregulated Internet resources. In addition, human biases may influence methods of model design, evaluation, and performance. In the realm of computer vision, the propagation of bias is also a concern in diffusion models, as synthetic images may differ in meaningful ways if influenced by inherent biases in training data [23]. Generative models must therefore be rigorously evaluated prior to deployment in clinical settings, and those responsible for model development must proactively train models with inclusive and diverse datasets.

Transparency and Ethical Considerations

The inherent complexity of generative model architectures, neural networks, and the processes through which a model makes predictions based on its training data remains an area with poor transparency; this lack of transparency is often referred to as a “black box” [14]. It creates uncertainty and doubt for providers. Trustworthiness and accountability are essential components of healthcare and providers using AI solutions, yet they must take a leap of faith that the outputs are grounded in evidenced-based reasoning. Therefore, methods for increasing model transparency and leveraging insight into model decision-making is an important and ongoing area of research.

Important biomedical and ethical considerations are related to lack of transparency. Without clarity on how an AI model makes decisions, using the model in clinical settings will continue to present an ethical dilemma. For example, consider a provider who utilizes an AI-assisted tool in patient care without first validating its accuracy or fully understanding the decision-making process of the model. If the outcome is suboptimal, there is a question of where medicolegal fault is assigned. Is the responsibility of this poor outcome attributed solely to the provider opting to use an AI-assisted tool, the healthcare system that employs the provider and subscribes to the tool, or is it shared between both? Does the company that provides the AI tool bear any accountability, especially if the model was trained using datasets that may possess bias or disparities? This complex medicolegal circumstance requires a comprehensive and thoughtful regulatory framework and may deter providers and healthcare systems from adopting such technology [20,38]. The rapidly evolving nature of both AI technology and the tools already created introduces further challenges for regulatory bodies to monitor the safety and efficacy of generative AI solutions [12]. Significant effort and collaboration between AI companies, AI users such as providers and healthcare systems, and political entities and regulatory bodies will be necessary to ensure the responsible use of AI solutions in healthcare; all parties will need to adapt to a new environment in which the use of AI solutions is routine [20].

Conclusion

Generative AI encompasses ML and DL models developed on unlabeled training sets that create novel outputs based on acquired knowledge. Contemporary generative AI predominantly involves the use of LLMs and diffusion models. The breadth of relevant healthcare applications and use cases for generative AI solutions continues to expand as considerable investments in this technology are made by financial institutions. Indeed, generative AI has given rise to the vision of a digital ecosystem involving all aspects of an episode of care, from scheduling a patient visit and clinical decision-making to postoperative patient communication and surveillance. The ultimate integration of generative AI into healthcare is complex and will require caution due to the current lack of standardized regulations and a comprehensive bioethical framework that considers the risks of bias and patient harm.

Supplemental Material

sj-pdf-1-hss-10.1177_15563316251335334 – Supplemental material for Generative Artificial Intelligence and Musculoskeletal Health Care

Supplemental material, sj-pdf-1-hss-10.1177_15563316251335334 for Generative Artificial Intelligence and Musculoskeletal Health Care by Kyle N. Kunze in HSS Journal®

Footnotes

CME Credit

Please go to HSS eAcademy at to find all journal-related CME, complete the online post-test, and claim CME credit.

Declaration of Conflicting Interests

The author declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article. Kyle N. Kunze, MD, reports a relationship with AllaiHealth, Inc.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Human/Animal Rights

All procedures followed were in accordance with the ethical standards of the responsible committee on human experimentation (institutional and national) and with the Helsinki Declaration.

Informed Consent

Informed consent was not required for this review article.

Required Author Forms

Disclosure forms provided by the author are available with the online version of this article as supplemental material.

ORCID iD

Kyle N. Kunze

References

Abbasian

Khatibi

Azimi

, et al. Foundation metrics for evaluating effectiveness of healthcare conversations powered by generative AI. NPJ Digit Med. 2024;7:82.

Ayers

Poliak

Dredze

, et al. Comparing physician and artificial intelligence chatbot responses to patient questions posted to a public social media forum. JAMA Intern Med. 2023;183:589–596.

Bedi

Liu

Orr-Ewing

, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA. 2024;333(4):319–328.

Bluethgen

Chambon

Delbrouck

, et al. A vision-language foundation model for the generation of realistic chest X-ray images. Nat Biomed Eng. 2025;9(4):494–506.

Chen

Decary

Artificial intelligence in healthcare: an essential guide for health leaders. Healthc Manage Forum. 2020;33:10–18.

Cheng

Marone

Weller

Lawrie

Khashabi

Van Durme

Dated data: tracing knowledge cutoffs in large language models. arXiv preprint arXiv:2403.12958. 2024.

Coiera

Kocaballi

Halamka

Laranjo

The digital scribe. NPJ Digit Med. 2018;1:58.

Dovrat

Bruckstein

AM.

Antalate-a multi-agent autonomy framework. Front Robot AI. 2021;8:719496.

Fan

Huang

Few-shot multi-agent perception with ranking-based feature learning. IEEE Trans Pattern Anal Mach Intell. 2023;45:11810–11823.

10.

Fralick

Campbell

KR.

The basics of machine learning. NEJM Evid. 2022;1:EVIDe2200062.

11.

Garcia

Shah

, et al. Artificial intelligence-generated draft replies to patient inbox messages. JAMA Netw Open. 2024;7:e243201.

12.

Geis

Brady

, et al. Ethics of artificial intelligence in radiology: summary of the joint european and north american multisociety statement. Can Assoc Radiol J. 2019;70:329–334.

13.

Guermazi

Omoumi

Tordjman

, et al. How AI may transform musculoskeletal imaging. Radiology. 2024;310:e230764.

14.

Handelman

Kok

Chandra

, et al. Peering into the black box of artificial intelligence: evaluation metrics of machine learning methods. AJR Am J Roentgenol. 2019;212:38–43.

15.

Hasan

Fury

Woo

Kunze

Ramkumar

PN.

Ethical application of generative artificial intelligence in medicine. Arthroscopy. 2024;41(4):874–885.

16.

Huffman

Pasqualini

Khan

, et al. Enabling personalized medicine in orthopaedic surgery through artificial intelligence: a critical analysis review. JBJS Rev. 2024;12:232.

17.

Iyengar

Yousef

MMA

Nune

Sharma

Botchu

Perception of chat generative pre-trained transformer (Chat-GPT) AI tool amongst msk clinicians. J Clin Orthop Trauma. 2023;44:102253.

18.

Jang

Rosenstadt

Lee

Kunze

KN.

Artificial intelligence for clinically meaningful outcome prediction in orthopedic research: current applications and limitations. Curr Rev Musculoskelet Med. 2024;17:185–206.

19.

Jiang

Huang

Watral

Blocker

Rushlow

DR.

Predicting provider workload using predicted patient risk score and social determinants of health in primary care setting. Appl Clin Inform. 2024;15:511–527.

20.

Jindal

Lungren

Shah

NH.

Ensuring useful adoption of generative artificial intelligence in healthcare. J Am Med Inform Assoc. 2024;31:1441–1444.

21.

Kazerouni

Aghdam

Heidari

, et al. Diffusion models in medical imaging: a comprehensive survey. Med Image Anal. 2023;88:102846.

22.

Kebaili

Lapuyade-Lahorgue

Ruan

Deep learning approaches for data augmentation in medical imaging: a review. J Imaging. 2023;9:81.

23.

Khosravi

Rouzrokh

Erickson

, et al. Analyzing racial differences in imaging joint replacement registries using generative artificial intelligence: advancing orthopaedic data equity. Arthroplast Today. 2024;29:101503.

24.

Khosravi

Rouzrokh

Mickley

, et al. Creating high fidelity synthetic pelvis radiographs using generative adversarial networks: unlocking the potential of deep learning models without patient privacy concerns. J Arthroplasty. 2023;38:2037–2043.e2031.

25.

Kumar

Mani

Tripathi

Saalim

Roy

Artificial hallucinations by Google Bard: think before you leap. Cureus. 2023;15:e43313.

26.

Kunze

Jang

Fullerton

Vigdorchik

Haddad

FS.

What’s all the chatter about?

Bone Joint J. 2023;105-B:587–589.

27.

Kunze

Nwachukwu

Cote

Ramkumar

PN.

Large language models applied to health care tasks may improve clinical efficiency, value of care rendered, research, and medical education. Arthroscopy. 2024;41(3):547–556.

28.

Kwong

JCC

Wang

SCY

Nickel

Cacciamani

Kvedar

. The long but necessary road to responsible use of large language models in healthcare research. NPJ Digit Med. 2024;7:177.

29.

Lambin

Leijenaar

RTH

Deist

, et al. Radiomics: the bridge between medical imaging and personalized medicine. Nat Rev Clin Oncol. 2017;14:749–762.

30.

Huang

Ildiz

Rawat

Oymak

. Mechanics of next token prediction with self-attention. In: Sanjoy

Stephan

Yingzhen

, eds. Proceedings of The 27th International Conference on Artificial Intelligence and Statistics. Cambridge, MA: Proceedings of Machine Learning Research (PMLR); 2024:685–693.

31.

Zhao

, et al. RefAI: a GPT-powered retrieval-augmented generative tool for biomedical literature recommendation and summarization. J Am Med Inform Assoc. 2024;31:2030–2039.

32.

Lin

Wang

Guo

Wong

K-F.

Investigating bias in LLM-based bias detection: disparities between LLMs and human perception. arXiv preprint arXiv:2403.14896. 2024.

33.

Lubowitz

Cote

Ramkumar

Kunze

KN.

Applications of artificial intelligence for health care providers. Arthroscopy. 2024;41(3):537–538.

34.

Maharjan

Garikipati

Singh

, et al. OpenMedLM: prompt engineering can out-perform fine-tuning in medical question-answering with open-source large language models. Sci Rep. 2024;14:14156.

35.

Mello

Cohen

IG.

Regulation of health and health care artificial intelligence [published online ahead of print March 17, 2025]. JAMA. 2025. doi:10.1001/jama.2025.3308.

36.

Moulaei

Yadegari

Baharestani

Farzanbakhsh

Sabet

Reza Afrash

Generative artificial intelligence in healthcare: a scoping review on benefits, challenges and applications. Int J Med Inform. 2024;188:105474.

37.

Nerella

Bandyopadhyay

Zhang

, et al. Transformers and large language models in healthcare: a review. Artif Intell Med. 2024;154:102900.

38.

Neri

Coppola

Miele

Bibbolino

Grassi

Artificial intelligence: who is responsible for the diagnosis?

Radiol Med. 2020;125:517–521.

39.

Nwachukwu

Varady

Allen

, et al. Currently available large language models do not provide musculoskeletal treatment recommendations that are concordant with evidence-based clinical practice guidelines. Arthroscopy. 2024;41(2):263–275.e6.

40.

O’Reilley

1 in 3 Doctors Has Seen Prior Auth Lead to Serious Adverse Event. American Medical Association. 2023. Available at: https://www.ama-assn.org/practice-management/prior-authorization/1-3-doctors-has-seen-prior-auth-lead-serious-adverse-event. Accessed August 1, 2024.

41.

Oeding

Williams

III Camp

, et al. A practical guide to the development and deployment of deep learning models for the orthopedic surgeon: part II. Knee Surg Sports Traumatol Arthrosc. 2023;31:1635–1643.

42.

Oeding

Williams

Nwachukwu

, et al. A practical guide to the development and deployment of deep learning models for the orthopedic surgeon: part I. Knee Surg Sports Traumatol Arthrosc. 2023;31:382–389.

43.

Omiye

Lester

Spichak

Rotemberg

Daneshjou

Large language models propagate race-based medicine. NPJ Digit Med. 2023;6:195.

44.

Polce

Kunze

KN.

A guide for the application of statistics in biomedical studies concerning machine learning and artificial intelligence. Arthroscopy. 2023;39:151–158.

45.

Ramkumar

Kunze

Haeberle

, et al. Clinical and research medical applications of artificial intelligence. Arthroscopy. 2021;37:1694–1697.

46.

Rashidi

Pantanowitz

Hanna

, et al. Introduction to artificial intelligence (AI) and machine learning (ML) in pathology & medicine: generative & non-generative AI basics. Mod Pathol. 2025;2025:100688.

47.

Rouzrokh

Khosravi

Mickley

Erickson

Taunton

Wyles

CC.

THA-NET: a deep learning solution for next-generation templating and patient-specific surgical execution. J Arthroplasty. 2024;39:727–733.e724.

48.

Scott

We Must Fix Prior Authorization to Protect Our Patients. American Medical Association. 2024. Available at: https://www.ama-assn.org/about/leadership/we-must-fix-prior-authorization-protect-our-patients. Accessed August 1, 2024.

49.

Shah

Sharma

, et al. Large language model prompting techniques for advancement in clinical medicine. J Clin Med. 2024;13:5101.

50.

Shah

SV.

Accuracy, consistency, and hallucination of large language models when analyzing unstructured clinical notes in electronic medical records. JAMA Netw Open. 2024;7:e2425953.

51.

Tai-Seale

Baxter

Vaida

, et al. AI-generated draft replies integrated into health records and physicians’ electronic communication. JAMA Netw Open. 2024;7:e246565.

52.

Tait

Chibnall

JT.

Community perspectives on patient credibility and provider burden in the treatment of chronic pain. Pain Med. 2022;23:1075–1083.

53.

Tierney

Gayre

Hoberman

, et al. Ambient artificial intelligence scribes to alleviate the burden of clinical documentation. NEJM Catalyst. 2024;5:404.

54.

Wachter

Brynjolfsson

Will generative artificial intelligence deliver on its promise in health care?

JAMA. 2024;331:65–69.

55.

Wan

Hou

Bao

, et al. Human-in-the-loop low-shot learning. IEEE Trans Neural Netw Learn Syst. 2021;32:3287–3292.

56.

Wang

Liao

Zhang

Layerwised multimodal knowledge distillation for vision-language pretrained model. Neural Netw. 2024;175:106272.

57.

Wenzel

Generative adversarial networks and other generative models. In: Colliot

, ed. Machine Learning for Brain Disorders. New York, NY: Humana; 2023:139–192.

58.

Woo

Yang

Olsen

, et al. Custom large language models improve accuracy: comparing retrieval augmented generation and artificial intelligence agents to non-custom models for evidence-based medicine. Arthroscopy. 2024;41(3):565–573.e6.

59.

Zakka

Shad

Chaurasia

, et al. Almanac—retrieval-augmented language models for clinical medicine. NEJM AI. 2024;1(2):68. doi:10.1056/aioa2300068.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.20 MB