Sage Journals: Discover world-class research

Abstract

Objective

To provide a comprehensive overview of the current use of large language models in clinical medicine and surgery, with emphasis on model characteristics, clinical applications, and readiness for adoption.

Methods

A scoping review of studies on the use of large language models in clinical medicine and surgery was conducted in accordance with the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-scoping review and JBI methodology (protocol registration: 10.37766/inplasy2025.3.0102). A comprehensive search of EMBASE, PubMed, CINAHL, and IEEE Xplore identified 3313 articles published between 2018 and 2023. After screening of articles and full-text review, 156 studies were included. Data were extracted for study type, sample size, clinical specialty, model architecture, training methods, application purpose, and performance metrics. Descriptive analyses were performed.

Results

Most studies were proof-of-concept studies (55.8%) or clinical trials (21.2%), with a steady rise in publications since 2022. Large language models were most frequently used for data extraction (69.9%), followed by clinical recommendations (11.5%), report generation (9.0%), and patient-facing chatbots (7.1%). Proprietary models were used in 57.7% of the studies, whereas 39.7% used open-source models. ChatGPT-3.5, ChatGPT-4, and Bidirectional Encoder Representations from Transformers (BERT) were the most commonly reported models. Only 25.0% of the studies reported models as ready for clinical use, whereas 67.9% stated that the models required further validation. F-score (30.8%) and area under the curve (15.4%) were the most common performance metrics; 10.9% of the studies used expert opinion for validation.

Conclusions

Large language models are increasingly being used in clinical medicine. Although most applications focus on data extraction and summarization, emerging studies are beginning to explore higher-level tasks such as clinical decision-making and multidisciplinary simulation. Significant heterogeneity continues to exist in model architecture, evaluation methods, and reporting standards. Further standardization is needed to develop transparent evaluation frameworks and ensure safe, reliable integration of large language models into complex clinical workflows.

Keywords

Large language models natural language processing generative pre-trained transformer machine learning artificial intelligence

Introduction

Natural language processing (NLP) and large language models (LLMs) represent a frontier in the field of artificial intelligence (AI), leveraging vast amounts of data to comprehend and generate human-like text.¹ Most modern LLMs are built on transformer architecture, a neural network design that allows for efficient parallel processing and long-range dependency tracking within text sequences. These models are typically pre-trained on large corpora using self-supervised learning objectives, which enable them to develop broad, generalizable language understanding capabilities that are currently being adapted for various clinical tasks.

Despite this shared foundation, LLMs differ in their architectural configuration, which in turn influences their function. Early LLMS such as Bidirectional Encoder Representations from Transformers (BERT) utilize an encoder-only architecture optimized for tasks that require text comprehension, such as classification and entity recognition (Table 1). Conversely, newer models such as generative pre-trained transformer (GPT) adopt a decoder-only architecture that enables them to generate coherent text sequences and engage in conversational tasks. However, hybrid models such as T5 and LLaMA employ an encoder–decoder structure, which allows them to understand and generate language within an identical framework.

Table 1.

Large language model architecture and differences.

Model	Year	Training	Clinical applications
Encoder-only
BERT	2018	- Pre-trained with masked language modeling to predict masked tokens in the text- Captures bidirectional context and is optimized for text comprehension⁴	- Token-level tasks including named entity recognition and electronic health record data extraction
Encoder–decoder
T5	2020	- Trained using a text-to-text framework with span corruption, where segments of text are masked.- Flexible for various NLP tasks, converting all tasks into text generation⁵	- Multi-step processing for summarization and guideline extraction- Flexibility in summarization, translation, and patient communication
Decoder-only
GPT-2	2019	- Autoregressive training to predict the next token in a sequence, enabling coherent text generation- Trained using a decoder-only architecture for free-form text generation⁶	- Free-form generation: clinical note drafting and discharge summaries- Patient-facing narratives and report templates
GPT-3	2020	- Uses in-context learning to infer task logic from input prompts without explicit training- Capable of performing various tasks with minimal fine-tuning⁷	- Dynamic clinical reasoning - Differential diagnosis generation
ChatGPT-4	2023	- Fine-tuned using reinforcement learning from human feedback to improve task-following and reasoning- Supports more advanced decision-making and complex instructions⁸	- Decision support- Patient chatbots- Documentation aid
LLaMA	2023	- Parameter-efficient autoregressive transformer trained on large public datasets- Trained on public datasets and optimized for fine-tuning⁹	- Local deployment for privacy- Customization for niche specialty-specific tasks
Gemini	2023	- Uses retrieval-augmented generation to integrate external data for contextually accurate responses- Optimized for interactive dialogue¹⁰	- Dynamic information delivery for public health - Evidence-grounded patient communication and education

BERT: Bidirectional Encoder Representations from Transformers; NLP: natural language processing; GPT: generative pre-trained transformer.

Model architecture in LLMs has a direct impact on their real-world applications and potential for improving clinical medicine and surgery. Encoder-only models such as BERT are well-suited for structured tasks such as extracting electronic health records. Decoder-only models such as GPT are better suited for generative applications such as medical note transcription. Encoder–decoder models can combine these computational properties and assist with complicated tasks such as clinical decision support. Additionally, retrieval-augmented generation has further extended model capabilities by integrating external databases, enabling the generation of contextually grounded, evidence-based responses in real time.^2,3

Validated models such as ClinicalBERT and BioBERT are available for medical applications, but they are typically used for medical chart extraction and medical exam administration.^11–13 These tasks do not take advantage of the computational potential of modern LLMs, and preliminary studies have shown that these models can be applied to complicated tasks such as diagnostics and clinical decision support.^14–16 At present, however, LLM use in medical diagnostics and clinical practice is challenged by the lack of standardized protocols and reporting guidelines for LLM development, training, and performance evaluation. To address this knowledge gap, we conducted a scoping review to (1) describe the adoption of LLMs in clinical medicine and surgery, (2) evaluate the role of LLMs in advanced medical application, and (3) detail modern LLM training and reporting practices in medicine.

Materials and methods

Protocol and registration

We performed a scoping review of studies on the use of LLMs in clinical medicine and surgery. We used the Preferred Reporting Items for Systematic reviews and Meta-Analyses (PRISMA)-scoping review guidelines to aid the reporting of the study.¹⁷ This proposed scoping review was conducted in accordance with JBI methodology for scoping reviews and was registered retrospectively at INPLASY (registration DOI: 10.37766/inplasy2025.3.0102).

Eligibility criteria

We included all study types that directly measured the validity of utilizing LLMs in a clinical context, except for narrative reviews, systematic reviews, and other scoping reviews.

For the purposes of this review, LLMs refer broadly to transformer-based models that have been pre-trained on large text corpora and exhibit NLP capabilities applicable to clinical tasks. This broad definition encompasses open-source models based on transformer architecture (e.g. GPT, BERT, and LLaMA) as well as proprietary models designed for specific purposes. Although not all encoder-based models are generative in nature, they are included in this analysis because of their frequent deployment in clinical NLP applications under the broader umbrella of transformer-based LLMs.

We included studies that evaluated the performance of LLMs in diagnosis, treatment, and/or management in a clinical or simulated clinical setting, including nonhuman synthetic clinical datasets and simulated clinical workflows. We included the following medicine and surgical specialties: cardiology, emergency medicine, endocrinology, gastroenterology, general surgery, genetics, geriatrics, hematology, intensive care, infection disease, internal medicine, neurology, obstetrics and gynecology, oncology, ophthalmology, orthopedic surgery, otolaryngology, pathology, pediatrics, primary care, psychiatry, radiology, respirology, and urology.

We excluded animal studies and biological simulations on the basis that the results of such studies could not be reliably translated into clinical practice. Studies were also excluded if the authors used LLMs outside of clinical contexts or if they were used to aid exam-taking, medical education, or research planning. Although these studies offer valuable insight into LLM performance, they do not offer direct insight into how LLMs could impact patient care. Additionally, studies were excluded if the performance of LLMs was not the outcome measure, as these studies typically do not provide a quantifiable measure of performance for comparison. Next, studies were excluded if they were conducted in the context of allied health services, dentistry, pharmacy, and/or public health settings. Although we believe that these are important domains, we choose to maintain a narrow focus for our review of studies on only clinical medicine and surgery.

Database search

The search strategy aimed to retrieve both published and unpublished studies. A three-step search strategy was utilized in this review. First, an initial limited search of EMBASE (Ovid) and CINAHL was undertaken to identify articles on the topic. The text of the titles, abstracts of relevant articles, and the index terms used to describe the articles were used to develop a full search strategy with help from an information specialist. The search strategy, including all identified keywords and index terms, was adapted for each included database. The search strategy is presented in Appendix 1. PubMed and IEEE Xplore were utilized to include gray literature and conference proceedings, given the timely nature of the topic. The reference lists of all included studies were also screened for additional studies. Only studies published in English were included. Studies published between January 2018 and October 2023 were included. We opted to exclude any publications prior to 2018, given the introduction of the BERT model in this year and the fact that prior models would not have utilized the same architecture as that of the current language models.²

Article selection and data abstraction

Following the search, all identified citations were collated and uploaded into Covidence, and duplicates were removed. Following a pilot test, titles and abstracts were screened by two independent reviewers (EL & SP) for assessment against the inclusion criteria for the review. Potentially relevant sources were retrieved in full, and their citation details were imported into Covidence. The full text of selected citations was assessed in detail against the inclusion criteria by two independent reviewers. Reasons for the exclusion of sources of evidence in the full text that do not meet the inclusion criteria were recorded and reported in the scoping review. Any disagreements that arose between the reviewers at each stage of the selection process were resolved through consensus or discussion with an additional reviewer (PS). The results of the search and the study inclusion process were reported in full in the final scoping review and presented in a PRISMA flow diagram.

Data were extracted from the studies included in the scoping review by two independent reviewers using a data extraction tool developed by the reviewers. The data extracted included specific details about the participants, concept, context, study methods, and key findings relevant to the review question.

A draft extraction form is provided (Appendix 2). The draft data extraction tool was modified and revised as necessary during the process of extracting data from each included evidence source. Included studies were assessed for the type of medical specialty, type of study, purpose of LLM use, name of LLM, open-source vs. proprietary, trained vs. untrained, training tool, workflow, sample size, outcome measures and value, readiness for clinical application, and conclusions drawn by the authors. When required, the authors of papers were contacted to request missing or additional data.

Statistical analysis

Given the exploratory nature of this scoping review, we conducted a descriptive statistical analysis to characterize the included studies. The primary aim was to provide an overview of the landscape of LLM use in clinical medicine. No data manipulation other than a grouping of data fields was performed. For instance, sample sizes were grouped into four categories: small (≤10), intermediate (11–100), large (101–1000), and extra-large (>1000). LLM application types, model accessibility, training methods, and readiness for clinical use were all coded based on predefined ordinal or categorical schemes. Free-text responses for the type of training, workflow, and conclusions were categorized based on conceptual similarities. Given the heterogeneity of the study design, there were no standardized definitions of clinical readiness. Hence, we structured our protocols for determining readiness based on prior systematic reviews in AI.¹⁸ We developed categories of model readiness, whereby models deemed as not requiring further validation were interpreted as “ready.” Models requiring further training or validation were interpreted as “requires further improvement,” whereas those deemed to have no role in future clinical application were deemed “failed.”

Results

Database search results

Of the 3313 identified articles, 156 were eligible and included in this scoping review (Figure 1).^19–172 A significant year-on-year increase in the number of studies published was observed from 2018 to 2023 (Figure 2). The characteristics of the studies are reported in Table 2 and Appendix 3. Most studies were proof-of-concept or simulated prospective cohort studies (n = 87, 55.8%), followed by nonrandomized clinical trials (n = 33, 21.2%), cohort studies (n = 28, 17.9%), and case studies (n = 8, 5.1%). Study sample sizes varied from individual case studies to large datasets containing up to 50,000,000 patients. When grouped by sample size, 14 (9.0%) of the included studies had a sample size less than 10, 53 (34.0%) had sample sizes between 11 and 100, 29 (18.6%) had sample sizes between 101 and 1000, and 60 (38.5%) had sample sizes greater than 1000. The most common specialties were radiology (12.2%), internal medicine (9.6%), psychiatry (7.7%), and neurology (7.1%).

Figure 1.

PRISMA flowchart. PRISMA: Preferred Reporting Items for Systematic reviews and Meta-Analyses.

Figure 2.

Number of studies published by year.

Table 2.

Baseline characteristics of the included studies.

	Number of studies	Percentage of studies (%)
Type of study
Clinical trials	33	21.2
Proof-of-concept studies	87	55.8
Cohort studies	28	17.9
Case studies	8	5.1
Sample size
Small (n ≤ 10)	19	12.2
Intermediate (10 < n ≤ 100)	53	34.0
Large (100 < n ≤ 1000)	29	18.6
Extra-large (n > 1000)	60	38.5
Specialty of clinical medicine
Cardiology	7	4.5
Emergency	5	3.2
Endocrinology	1	0.6
Gastroenterology	8	5.1
General surgery	5	3.2
Genetics	2	1.3
Geriatrics	2	1.3
Hematology	2	1.3
Intensive care	3	1.9
Infectious disease	1	0.6
Internal medicine	14	9.0
Neurology	10	6.4
Neurosurgery	4	2.6
Obstetrics and gynecology	4	2.6
Oncology	6	3.8
Ophthalmology	3	1.9
Orthopedic surgery	4	2.6
Otolaryngology	7	4.5
Pathology	4	2.6
Pediatrics	5	3.2
Plastic surgery	2	1.3
Primary care	12	7.7
Psychiatry	14	9.0
Radiology	19	12.2
Respirology	8	5.1
Urology	3	1.9
Vascular surgery	1	0.6

Characteristics of LLM studies and tools

All studies reported on the function and performance of LLMs in various clinical scenarios. The most common use for LLMs was data extraction (n = 109, 69.9%), followed by clinical recommendations (n = 18, 11.5%), report generation (n = 14, 9.0%), patient chatbot (n = 11, 7.1%), and tumor board simulation (n = 2, 1.3%). Table 3 shows the characteristics of the LLMs utilized in the included studies. The most common type of LLMs used were proprietary models (n = 90, 57.7%), followed by open-source models (n = 62, 39.7%). Four studies did not report the source of their LLMs. Many studies failed to mention the names of the models used (n = 52, 33.3%). The most commonly reported models included ChatGPT-3.5 (n = 18, 11.5%), ChatGPT-4.0 (n = 18, 11.5%), and BERT (n = 8, 5.1%). Most studies reported on the training status of the tested LLM, with details available in Table 4. Most of the included studies utilized trained models (n = 88, 56.4%). Existing patient data were the most popular resource for model training (n = 67, 42.9%), followed by published clinical guidelines (n = 19, 12.2%) and a combination of both (n = 7, 4.5%). Several studies utilized untrained models (n = 63, 40.4%). Five of the included studies did not mention the training status of their LLMs.

Table 3.

Summary of large language models used in clinical settings.

	Number of studies	Percentage of studies (%)
Purpose of LLM use
Data extraction	109	69.9
Patient chatbot	11	7.1
Clinical recommendation	18	11.5
Tumor board	2	1.3
Report generation	14	9.0
Accessibility
Not available	4	2.6
Open-source	62	39.7
Proprietary	90	57.7
Mixed	0	0

Table 4.

Frequency of large language model training methods and readiness for clinical application.

	Number of studies	Percentage of studies (%)
Training
Unknown	5	3.2
Trained	88	56.4
Untrained	57	36.5
Mixed	6	3.8
Training data
Untrained	63	40.4
Guidelines	19	12.2
Patient data	67	42.9
Mixed	7	4.5
Readiness for application
Ready	39	25.0
Needs improvement	106	67.9
Failed	11	7.1
Measures of accuracy
Accuracy of data extraction	6	3.8
Agreement with expert opinion	17	10.9
Area under the curve	24	15.4
Comparison with guideline recommendations	6	3.8
Diagnostic accuracy	25	16.0
F-score	48	30.8
Improvement in the accuracy of performance measures	2	1.3
Positive predictive value	3	1.9
Sensitivity, specificity	11	7.1
Others	14	9.0
AUC improvement	1
BLEU-1	1
C-index	1
C-statistics	1
DISCERN score	1
Gleason score	1
Macro-averaged mean	1
Mean and standard deviation of correct diagnosis positions	1
Patient-reported adherence	1
Precision, recall	1
ROUGE-L improvement	1
Safety score	1
Sensitivity	1
Usefulness for clinical documentation	1

Description of clinical workflow

Despite the varying specific parameters and instructions, the studies examined can be largely grouped into two distinct workflow regimes. Most studies (n = 111, 71.2%) followed a Structured Information Extraction and Data Processing workflow, whereby models systematically parsed and extracted data from clinical texts. Techniques such as sequence labeling, conditional random fields, and custom tokenization schemes were utilized to identify medical entities including symptoms, medications, and diagnoses. Several studies also employed rule-based systems and feature-engineering approaches to enhance precision, enabling the extraction of structured data that could be mapped to databases or electronic health records. These workflows focused on high-fidelity data extraction to support downstream tasks such as automated coding, predictive analytics, and clinical documentation management.

A smaller group of studies (n = 45, 28.8%) leveraged contextual understanding through deep learning architectures, particularly transformer-based models, allowing for greater flexibility in interpreting unstructured clinical data. These models utilized attention mechanisms and positional encodings to capture complex relationships between tokens across long sequences of text. This feature facilitated capabilities such as summarization, semantic understanding of patient histories, and even potential storage in vector databases for similarity-based retrieval of clinical reports. Most of these studies performed pre-training on large corpora or incorporated advanced techniques such as masked language modeling, encoder–decoder architectures, and self-supervised learning. These methods fine-tuned the models for specific clinical tasks and enabled them to generate coherent, contextually appropriate outputs based on dynamic input.

Measures of accuracy

A wide range of outcome measures were used to evaluate the performance and accuracy of LLMs across the included studies. The most frequently reported metric was the F-score, which was cited in 48 studies (30.8%), followed by the area under the curve (AUC), which was reported in 24 studies (15.4%). Diagnostic accuracy was reported in 25 studies (16.0%), whereas agreement with expert opinion was reported in 17 studies (10.9%). Measures such as sensitivity and specificity were reported together in 11 studies (7.1%) and positive predictive value in 3 studies (1.9%). A smaller number of studies utilized alternative or task-specific measures, including accuracy of data extraction (n = 6, 3.8%), comparison with guideline recommendations (n = 6, 3.8%), and improvement in the accuracy of performance measures (n = 2, 1.3%). Additionally, a subset of studies (n = 14, 9.0%) reported other specialized metrics, such as C-index, BLEU-1, DISCERN score, Gleason score, macro-averaged mean, precision and recall, and ROUGE-L improvement.

Analysis of LLM use by medical specialties

The type of LLM utilized varied across medical and surgical specialties, reflecting differing clinical needs, access to data, and stages of adoption. A detailed report of LLMs used by different medical specialties is provided in Appendix 4.

Radiology (n = 19, 12.2%) reported the greatest diversity in model use, including ChatGPT-3.5 (n = 4), ChatGPT-4 (n = 1), BERT variants (n = 5), and specialized tools such as Empolis, PubMed BERT, and CHARTextract. These were primarily used for data extraction (n = 16) and report generation (n = 3), with only five studies (26.3%) reporting readiness for immediate clinical use.

Psychiatry (n = 14, 9.0%) similarly demonstrated a wide range of models, including BioBERT (n = 1), XLM-RoBERTa (n = 1), PsyBERTpt (n = 1), MotiVAte (n = 1), and LLaMA-7B (n = 1), with a focus on data extraction (n = 9) and patient chatbot tasks (n = 2). Six studies (42.9%) reported their models as ready for immediate clinical use.

In internal medicine (n = 14, 9.0%), a combination of ChatGPT (n = 2), Bi-LSTM, GatorTron, and other custom models was used. Data extraction (n = 9) remained the dominant function, with only three studies (21.4%) reporting readiness for real-world integration.

Primary care (n = 12, 7.7%) utilized a wide range of models, including ClinicalBERT, Hi-BEHRT, GatorTron, and ChatGPT variants. These were applied for data extraction (n = 5), clinical recommendation (n = 4), and report generation (n = 2) tasks, with five studies (41.7%) reporting readiness for clinical use.

Specialties such as hematology, intensive care, neurology, and genetics often relied on niche or purpose-built models, including EHRead, NLP-Dx-BD, ASA, and Bio_ClinicalBERT. These were mostly employed for structured data parsing and clinician support, with no studies in these fields reporting models as ready for clinical use.

Clinical utility of LLMs

The threshold for the adoption of LLMs into clinical workflows varied widely across the included studies. Summarized results are reported in Table 4. Eleven studies (7.1%) reported outright failure of LLMs to achieve meaningful progress toward their intended outcomes. These failures were predominantly associated with untrained, open-source models that produced inaccurate or unreliable responses when applied to clinical scenarios. Only one of the failed models had been trained using patient data; this proprietary LLM in the field of otolaryngology became overly sensitive to the outcome of interest, resulting in an unacceptably high false-positive rate.

The majority of the studies (n = 106, 67.9%) suggested that their models showed potential to enhance clinical workflows but required further validation before they could be considered for real-world implementation.

A subset of studies (n = 39, 25.0%) reported that their LLMs were ready for immediate clinical use. These included both trained models (n = 32, 82.1%) and untrained models (n = 7, 17.9%), applied across specialties such as cardiology, emergency medicine, endocrinology, gastroenterology, internal medicine, and otolaryngology. Almost all of these ready-to-deploy models were used for data extraction (n = 36, 92.3%) and report generation (n = 3, 7.7%), which are tasks that typically present lower risk and complexity, making them more suitable for early integration into clinical environments compared with models designed for diagnosis or treatment recommendations.

Discussion

Herein, we presented a scoping review of the clinical applications of LLMs in medicine and surgery. The upward trend in the number of studies being published each year demonstrates the growing importance and interest in language models associated with the medical community. Although most studies employed LLMs for foundational tasks such as data extraction, relatively few studies explored their use in higher-function clinical applications, including simulating tumor boards or providing clinical recommendations. These higher-order tasks require not only accurate information retrieval but also complex contextual reasoning, patient-specific tailoring, and integration of evolving clinical guidelines. The use of models such as PsyBERTpt and ChatGPT-4 reflects the growing interest in leveraging LLMs in these areas, although further research is needed to ensure consistency and reliability in outputs.

We revealed that 41.2% of the LLMs were open-source, although a substantial number of studies failed to detail the exact model. Open-source LLMs carry the advantage of being low-cost, often with better technological support, but sometimes do not offer the quality of proprietary models.¹⁷³ Proprietary models are designed for precision and can offer superior privacy protection for sensitive patient data when run using local storage.¹⁷⁴ They, however, require a longer time to produce and incur a greater financial burden.¹⁷⁵

Studies exploring the use of ChatGPT and BERT variants continue to dominate the literature. There remains a notable underrepresentation of studies utilizing hybrid or domain-adaptive models, which are developed for use in resource-constrained environments such as mobile health platforms, rural clinics, or offline decision support tools.^9,176 Despite the potential of these models, a few studies prior to our review have evaluated the clinical performance of these models beyond structured information extraction tasks. This gap in the literature highlights a critical opportunity for future research to assess the real-world impact of deploying domain-adaptive and locally fine-tuned LLMs in diverse clinical settings.

Most studies also reported on the training status of their models and type of datasets used but did not specify the intensity of their fine-tuning regime or the specificity of the datasets. Existing literature suggests that the intensity of fine-tuning and the quality of data used for training have a direct impact on the accuracy of the output model irrespective of using open-source or proprietary models.¹⁷⁷ As interest in LLMs continues to increase, there is a need for better transparency in reporting as well as standardization of model selection and training regimes.¹⁷⁸

Additionally, there was substantial heterogeneity in the methods used to assess the accuracy and performance of LLMs across studies. A total of 14 different outcome metrics were identified, with wide variation in their frequency and application. The most commonly reported measure was the F-score, used in 48 studies, followed by the AUC in 24 studies. Moreover, there were task-specific or less commonly used measures such as C-index, BLEU-1, and DISCERN scores. The use of expert opinion as a reference standard was common in early-phase studies, particularly for evaluating models in subjective or interpretive tasks. Although this approach provides clinically grounded validation, it is inherently limited by inter-rater variability and potential bias.¹⁷⁹ The absence of a standardized benchmark across studies relying on expert evaluation makes it difficult to compare performance across different models or specialties. Furthermore, expert agreement may reflect consistency rather than correctness, which poses challenges in domains where clinical guidelines are evolving or where expert consensus is lacking.¹⁸⁰

Comparison of metric selection revealed important considerations for interpreting model performance. The F-score, which represents the mean of precision and recall, is frequently used in information retrieval and classification tasks where imbalanced datasets are common. Its utility lies in penalizing models that over-prioritize sensitivity or specificity, offering a more balanced view of performance.¹⁸¹ However, the F-score does not provide information about the model’s ability to discriminate between classes across thresholds, limiting its interpretability in probabilistic outputs.¹⁸² In contrast, AUC is commonly used in classification problems and reflects the model’s ability to distinguish positive cases from negative ones across all thresholds.¹⁸³ This metric is threshold-independent and intuitive to interpret but may be less informative in datasets with class imbalance, where high AUC values can be achieved despite poor real-world performance.¹⁸⁴ Furthermore, AUC does not penalize misclassification severity, which may be clinically relevant in high-risk or safety-critical applications.¹⁸⁵ Given the diversity of tasks and model architectures, the choice of evaluation metric should be tailored to the specific use case. Studies evaluating diagnostic classification may benefit from reporting both AUC and F-score in tandem, whereas those focusing on entity recognition or summarization should prioritize token-level precision, recall, and agreement with structured reference standards.

The current variability in performance assessment highlights the need for standardized, task-specific evaluation frameworks to ensure comparability and reproducibility across future LLM studies. Recent publications in LLM development have proposed frameworks for fine-tuning models to specific tasks as well as objective scoring systems for measuring output accuracy.¹⁸⁶ These were not adopted in the training of the models reviewed in this study but are an important area for future studies.¹⁸⁷ Additionally, few studies tend to align model evaluation with clinical impact, such as assessing how language model outputs affect patient outcomes. To address this gap, future research should move toward more comprehensive evaluation frameworks that capture both technical and real-world performance. One such approach is the Holistic Evaluation of Language Models framework, which emphasizes multidimensional benchmarking across criteria such as accuracy, robustness, fairness, and efficiency.¹⁸⁸ When applied to clinical settings, such a framework could incorporate patient-centered outcome measures, such as changes in treatment decisions, patient satisfaction, or care quality metrics, to evaluate whether models meaningfully improve clinical care. Broader adoption of such evaluation strategies will be critical for ensuring that LLMs deployed in healthcare are not only technically performant but also clinically impactful.

The current literature on LLMs in clinical contexts is limited by methodological and practical challenges. Many studies are exploratory, rely on retrospective or synthetic data, and lack external validation, reducing generalizability. Models are often tested in narrow or idealized settings, making real-world applicability uncertain. Inconsistent outcome metrics and the absence of standardized benchmarks further hinder meaningful comparisons across studies. LLMs themselves present technical limitations. Generative models such as ChatGPT are prone to hallucinations and operate as black boxes, limiting transparency in clinical decision-making. Most of these models lack real-time access to clinical data, and many embed training data biases that risk perpetuating disparities. Additionally, the computational demands of LLMs, dependence on proprietary platforms, and limited fine-tuning options for domain-specific use pose barriers to adoption. To improve reproducibility and rigor, future studies should clearly describe model selection, training protocols, and performance metrics and justify assessments of clinical readiness. Addressing these issues will enhance methodological transparency and facilitate more accurate cross-study comparisons.

Conclusion

Our study is among the first to provide a comprehensive overview of the current landscape of LLMs and their clinical utility in medicine and surgery. Radiology and medical specialties were the most active areas of study, with ChatGPT and BERT being the most commonly used models. Although most studies focused on low-risk tasks such as data extraction and documentation, we also identified emerging efforts to explore higher-order applications, such as clinical decision-making and simulated multidisciplinary discussions. These advanced uses represent a novel and important frontier for LLM integration in healthcare, requiring models that are both contextually aware and clinically reliable. However, significant heterogeneity in model training, evaluation, and reporting standards persists. Standardization in model validation, alongside focused development of interpretable tools for complex clinical reasoning, is essential to ensure safe, transparent, and impactful deployment of LLMs in patient care.

Footnotes

Acknowledgements

None.

Author contributions

Eric Liang: Research question formulation, search strategy development, data extraction, data analysis, manuscript drafting and editing.

Sophia Pei: Search strategy development, data extraction, data analysis, manuscript drafting.

Phillip Staibano: Research question formulation, search strategy development, manuscript editing.

Benjamin van der Woerd: Research question formulation, manuscript editing.

Data availability statement

All data generated or analyzed during this study are included in this published article and its supplementary information files.

Declaration of conflicting interests

None.

Ethics approval

Not applicable.

Funding

Not applicable.

ORCID iD

Eric Nan Liang

References

Vaswani

Shazeer

Parmar

, et al. Attention is all you need. In: Advances in neural information processing systems 30: annual conference on neural information processing systems, Long Beach, California, USA, 4–9 December 2017, pp. 5998–6008.

Zhang

Sun

Chen

, et al. Gpt4roi: instruction tuning large language model on region-of-interest. In: European conference on computer vision. Cham: Springer, 2025, pp. 52–70.

Chen

Practical techniques to constraint LLM output in JSON format. Medium, https://mychen76.medium.com/practical-techniques-to-constraint-llm-output-in-json-format-e3e72396c670 (2023, accessed 26 March 2025).

Devlin

Chang

Lee

, et al. Bert: pre-training of deep bidirectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, 2019, volume 1, pp. 4171–4186.

Raffel

Shazeer

Roberts

, et al. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research 2020; 21: 1–67.

Radford

Child

, et al. Language models are unsupervised multitask learners. OpenAI Blog 2019; 1: 9.

Brown

Mann

Ryder

, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems 2020; 33: 1877–1901.

Achiam

Adler

Agarwal

, et al. Gpt-4 technical report. arXiv preprint arXiv Epub ahead of print 15 March 2023. DOI: 10.48550/arXiv.2303.08774.

Touvron

Lavril

Izacard

, et al. LLaMA: open and efficient foundation language models. arXiv preprint arXiv Epub ahead of print 27 Feb 2023. DOI: 10.48550/arXiv.2302.13971.

10.

Team

Anil

Borgeaud

, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv Epub ahead of print 19 December 2023. DOI: 10.48550/arXiv.2312.11805.

11.

Huang

Altosaar

Ranganath

Clinicalbert: modeling clinical notes and predicting hospital readmission. arXiv preprint arXiv Epub ahead of print 10 April 2019. DOI: 10.48550/arXiv.1904.05342.

12.

Lee

Yoon

Kim

, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36: 1234–1240.

13.

Lai

Hsu

, et al. Evaluating the performance of ChatGPT-4 on the United Kingdom medical licensing assessment. Front Med (Lausanne) 2023; 10: 1240915.

14.

Kreimeyer

Foster

Pandey

, et al. Natural language processing systems for capturing and standardizing unstructured clinical information: a systematic review. J Biomed Inform 2017; 73: 14–29.

15.

Athavale

Baier

Ross

, et al. The potential of chatbots in chronic venous disease patient management. JVS Vasc Insights 2023; 1: 100019.

16.

Omar

Brin

Glicksberg

, et al. Utilizing natural language processing and large language models in the diagnosis and prediction of infectious diseases: a systematic review. Am J Infect Control 2024; 52: 992–1001.

17.

Tricco

Lillie

Zarin

, et al. PRISMA extension for scoping reviews (PRISMA-ScR): checklist and explanation. Ann Intern Med 2018; 169: 467–473.

18.

Rangasamy

Artificial intelligence in cancer diagnosis: a systematic review. Research Square Epub ahead of print 6 June 2023. DOI: 10.21203/rs.3.rs-3017637/v1.

19.

Decker

Trang

Ramirez

, et al. Large language model−based chatbot vs surgeon-generated informed consent documentation for common procedures. JAMA Netw Open 2023; 6: e2336997.

20.

Reynolds

Bunch

Steinberg

, et al. Novel methodology for the evaluation of symptoms reported by patients with newly diagnosed atrial fibrillation: application of natural language processing to electronic medical records data. J Cardiovasc Electrophysiol 2023; 34: 790–799.

21.

Manolitsis

Feretzakis

Tzelves

, et al. Training ChatGPT models in assisting urologists in daily practice. Stud Health Technol Inform 2023; 305: 576–579.

22.

Goldhoff

Fuller

, et al. Using natural language processing to improve discrete data capture from interpretive cervical biopsy diagnoses at a large health care organization. Arch Pathol Lab Med 2023; 147: 222–226.

23.

Diaz-Asper

Chandler

Turner

, et al. Increasing access to cognitive screening in the elderly: applying natural language processing methods to speech collected over the telephone. Cortex 2022; 156: 26–38.

24.

Clapp

Kim

James

, et al. Comparison of natural language processing of clinical notes with a validated risk-stratification tool to predict severe maternal morbidity. JAMA Netw Open 2022; 5: e2234924.

25.

Rahman

Nowakowski

Agrawal

, et al. Validation of a natural language processing algorithm for the extraction of the sleep parameters from the polysomnography reports. Healthcare (Basel) 2022; 10: 1837.

26.

Karhade

Oosterhoff

Groot

, et al. Can we geographically validate a natural language processing algorithm for automated detection of incidental durotomy across three independent cohorts from two continents? Clin Orthop Relat Res 2022; 480: 1766–1775.

27.

Goodman-Meza

Shover

Medina

, et al. Development and validation of machine models using natural language processing to classify substances involved in overdose deaths. JAMA Netw Open 2022; 5: e2225593.

28.

Chillakuru

Munjal

Laguna

, et al. Development and web deployment of an automated neuroradiology MRI protocoling tool with natural language processing. BMC Med Inform Decis Mak 2021; 21: 213.

29.

Wyles

Osmon

, et al. Automated detection of periprosthetic joint infections and data elements using natural language processing. J Arthroplasty 2021; 36: 688–692.

30.

Lee

Brumback

Lober

, et al. Identifying goals of care conversations in the electronic health record using natural language processing and machine learning. J Pain Symptom Manage 2021; 61: 136–142.e2.

31.

Ross

Zheng

Zhu

, et al. Accuracy of asthma computable phenotypes to identify pediatric asthma at an academic institution. Methods Inf Med 2020; 59: 219–226.

32.

Zand

Sharma

Stokes

, et al. An exploration into the use of a chatbot for patients with inflammatory bowel diseases: retrospective cohort study. J Med Internet Res 2020; 22: e15589.

33.

Karhade

Bongers

Groot

, et al. Natural language processing for automated detection of incidental durotomy. Spine J 2020; 20: 695–700.

34.

Vaci

Liu

Kormilitzin

, et al. Natural language processing for structuring clinical text data on depression using UK-CRIS. Evid Based Ment Health 2020; 23: 21–26.

35.

Steinkamp

Bala

Sharma

, et al. Task definition, annotated dataset, and supervised natural language processing models for symptom extraction from unstructured clinical notes. J Biomed Inform 2020; 102: 103354.

36.

Kim

Meystre

SM.

Ensemble method–based extraction of medication and related information from clinical texts. J Am Med Inform Assoc 2020; 27: 31–38.

37.

Hong

Wen

Stone

, et al. Developing a FHIR-based EHR phenotyping framework: a case study for identification of patients with obesity and multiple comorbidities from discharge summaries. J Biomed Inform 2019; 99: 103310.

38.

Zhou

Wang

Sohn

, et al. Automatic extraction and assessment of lifestyle exposures for Alzheimer’s disease using natural language processing. Int J Med Inform 2019; 130: 103943.

39.

Filice

RW.

Deep-learning language-modeling approach for automated, personalized, and iterative radiology-pathology correlation. J Am Coll Radiol 2019; 16: 1286–1291.

40.

Van Vleck

Chan

Coca

, et al. Augmented intelligence with natural language processing applied to electronic health records for identifying patients with non-alcoholic fatty liver disease at risk for disease progression. Int J Med Inform 2019; 129: 334–341.

41.

Tang

Yang

San Ang

, et al. Detecting adverse drug reactions in discharge summaries of electronic medical records using Readpeer. Int J Med Inform 2019; 128: 62–70.

42.

Carrodeguas

Lacson

Swanson

, et al. Use of machine learning to identify follow-up recommendations in radiology reports. J Am Coll Radiol 2019; 16: 336–343.

43.

Lee

Jensen

Levin

, et al. Accurate identification of colonoscopy quality and polyp findings using natural language processing. J Clin Gastroenterol 2019; 53: e25–e30.

44.

Afzal

Mallipeddi

Sohn

, et al. Natural language processing of clinical notes for identification of critical limb ischemia. Int J Med Inform 2018; 111: 83–89.

45.

Hai

Song

, et al. Multimodal data hybrid fusion and natural language processing for clinical prediction models. AMIA Jt Summits Transl Sci Proc 2024; 2024: 191.

46.

Reese

Danis

Caufield

, et al. On the limitations of large language models in clinical diagnosis. medRxiv Epub ahead of print 26 February 2024. DOI: 10.1101/2023.07.13.23292613.

47.

Perlis

RH.

Application of GPT-4 to select next-step antidepressant treatment in major depression. MedRxiv Epub ahead of print 18 April 2023. DOI: 10.1101/2023.04.14.23288595.

48.

Rau

Zöller

, et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology 2023; 308: e230970.

49.

Nastasi

Courtright

Halpern

, et al. Does ChatGPT provide appropriate and equitable medical advice?: A vignette-based, clinical evaluation across care contexts. MedRxiv Epub ahead of print 1 March 2023. DOI: 10.1101/2023.02.25.23286451.

50.

Rao

Pang

Kim

, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 2023; 25: e48659.

51.

Sekler

Kämpgen

Reinert

, et al. Identifying secondary findings in PET/CT reports in oncological cases: a quantifying study using automated natural language processing. medRxiv Epub ahead of print 5 December 2022. DOI: 10.1101/2022.12.02.22283043.

52.

Wang

Xie

, et al. Development and validation of a deep learning system for the diagnosis of pediatric diseases: a large-scale real-world data study in Shanghai. medRxiv Epub ahead of print 11 October 2022. DOI: 10.1101/2022.10.07.22280541.

53.

Yang

PourNejatian

Shin

, et al. GatorTron: a large language model for clinical natural language processing. MedRxiv Epub ahead of print 18 March 2022. DOI: 10.1101/2022.02.27.22271257.

54.

Maillard

Micheli

Lefevre

, et al. Can chatbot artificial intelligence replace infectious diseases physicians in the management of bloodstream infections? A prospective cohort study. Clin Infect Dis 2024; 78: 825–832.

55.

Al-Ashwal

Zawiah

Gharaibeh

, et al. Evaluating the sensitivity, specificity, and accuracy of ChatGPT-3.5, ChatGPT-4, Bing AI, and Bard against conventional drug-drug interactions clinical tools. Drug Healthc Patient Saf 2023: 15: 137–47.

56.

Jeblick

Schachtner

Dexl

, et al. ChatGPT makes medicine easy to swallow: an exploratory case study on simplified radiology reports. Eur Radiol 2024; 34: 2817–2825.

57.

Mithun

Jha

Sherkhane

, et al. Development and validation of deep learning and BERT models for classification of lung cancer radiology reports. Informatics in Medicine Unlocked 2023; 40: 101294.

58.

Hurley

Schroeder

Hess

AS.

Would doctors dream of electric blood bankers? Large language model‐based artificial intelligence performs well in many aspects of transfusion medicine. Transfusion 2023; 63: 1833–1840.

59.

Mika

Martin

Engstrom

, et al. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am 2023; 105: 1519–1526.

60.

Nicolson

Dowling

Koopman

Improving chest X-ray report generation by leveraging warm starting. Artificial Intelligence in Medicine 2023; 144: 102633.

61.

Zhou

Moon

Szatkowski

, et al. Evaluating ChatGPT responses in the context of a 53-year-old male with a femoral neck fracture: a qualitative analysis. Eur J Orthop Surg Traumatol 2024; 34: 927–955.

62.

Pugliese

Maccari

Felisati

, et al. Are artificial intelligence large language models a reliable tool for difficult differential diagnosis? An a posteriori analysis of a peculiar case of necrotizing otitis externa. Clin Case Rep 2023; 11: e7933.

63.

Moazzam

Cloyd

Lima

, et al. Quality of ChatGPT responses to questions related to pancreatic cancer and its surgical care. Ann Surg Oncol 2023; 30: 6284–6286.

64.

Jeong

Lee

Eyre

, et al. Exploring the use of natural language processing for objective assessment of disorganized speech in schizophrenia. Psychiatr Res Clin Pract 2023; 5: 84–92.

65.

Msosa

Grauslys

Zhou

, et al. Trustworthy data and AI environments for clinical prediction: application to crisis-risk in people with depression. IEEE J Biomed Health Inform 2023; 27: 5588–5598.

66.

Gorelik

Ghersin

Maza

, et al. Harnessing language models for streamlined postcolonoscopy patient management: a novel approach. Gastrointest Endosc 2023; 98: 639–641.

67.

Schneider

Zhang

, et al. Large-scale identification of undiagnosed hepatic steatosis using natural language processing. EClinicalMedicine 2023; 62: 102149.

68.

Zhou

Wang

, et al. A cross-institutional evaluation on breast cancer phenotyping NLP algorithms on electronic health records. Comput Struct Biotechnol J 2023; 22: 32–40.

69.

Crook

Park

Hurley

, et al. Evaluation of online artificial intelligence-generated information on common hand procedures. J Hand Surg Am 2023; 48: 1122–1127.

70.

Pan

Zhang

Peters

, et al. Cerebrovascular disease case identification in inpatient electronic medical record data using natural language processing. Brain Inform 2023; 10: 22.

71.

Michalski

Lis

Stankiewicz

, et al. Supporting the diagnosis of Fabry disease using a natural language processing-based approach. J Clin Med 2023; 12: 3599.

72.

Duey

Nietsch

Zaidat

, et al. Thromboembolic prophylaxis in spine surgery: an analysis of ChatGPT recommendations. Spine J 2023; 23: 1684–1691.

73.

Ali

Dobbs

Jovic

, et al. Validating a novel natural language processing pathway for automated quality assurance in surgical oncology: incomplete excision rates of 34 955 basal cell carcinomas. Br J Surg 2023; 110: 1072–1075.

74.

Yeo

Samaan

, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721.

75.

Wang

Zhuang

, et al. A novel staging system derived from natural language processing of pathology reports to predict prognostic outcomes of pancreatic cancer: a retrospective cohort study. Int J Surg 2023; 109: 3476–3489.

76.

Koga

Martin

Dickson

DW.

Evaluating the performance of large language models: ChatGPT and Google Bard in generating differential diagnoses in clinicopathological conferences of neurodegenerative disorders. Brain Pathol 2024; 34: e13207.

77.

Muñoz

Souto

Lecumberri

, et al. Development of a predictive model of venous thromboembolism recurrence in anticoagulated cancer patients using machine learning. Thromb Res 2023; 228: 181–188.

78.

Cai

Chen

Guo

, et al. RegEMR: a natural language processing system to automatically identify premature ovarian decline from Chinese electronic medical records. BMC Med Inform Decis Mak 2023; 23: 126.

79.

Seng

Mehdipour

Simpson

, et al. Tracking persistent postoperative opioid use: a proof-of-concept study demonstrating a use case for natural language processing. Reg Anesth Pain Med 2024; 49: 241–247.

80.

Kao

Chien

Wang

, et al. Assessing ChatGPT’s capacity for clinical decision support in pediatrics: a comparative study with pediatricians using KIDMAP of Rasch analysis. Medicine (Baltimore) 2023; 102: e34068.

81.

O’Hern

Yang

Vidal

NY.

ChatGPT underperforms in triaging appropriate use of Mohs surgery for cutaneous neoplasms. JAAD Int 2023; 12: 168–170.

82.

Lahat

Shachar

Avidan

, et al. Evaluating the utility of a large language model in answering common patients’ gastrointestinal health-related questions: are we there yet? Diagnostics (Basel) 2023; 13: 1950.

83.

Lian

Hsiao

Hwang

, et al. Predicting health-related quality of life change using natural language processing in thyroid cancer. Intell Based Med 2023; 7: 100097.

84.

Samaan

Yeo

Rajeev

, et al. Assessing the accuracy of responses by the language model ChatGPT to questions regarding bariatric surgery. Obes Surg 2023; 33: 1790–1796.

85.

Schroeder

Bigolin Lanfredi

, et al . Prediction of obstructive lung disease from chest radiographs via deep learning trained on pulmonary function data. Int J Chron Obstruct Pulmon Dis 2020; 15: 3455–3466.

86.

Sim

Wang

, et al. Natural language processing and machine learning methods to characterize unstructured patient-reported outcomes: validation study. J Med Internet Res 2021; 23: e26777.

87.

Feng

, et al. Using deeply time-series semantics to assess depressive symptoms based on clinical interview speech. Front Psychiatry 2023; 14: 1104190.

88.

Krebs

Nataraj

McCabe

, et al. Developing a triage predictive model for access to a spinal surgeon using clinical variables and natural language processing of radiology reports. Eur Spine J 2023: 1–4.

89.

Yang

Chen

PourNejatian

, et al. A large language model for electronic health records. NPJ Digit Med 2022; 5: 194.

90.

Gunter

Puac-Polanco

Miguel

, et al. Rule-based natural language processing for automation of stroke data extraction: a validation study. Neuroradiology 2022; 64: 2357–2362.

91.

Sung

Pan

, et al. Automated risk assessment of newly detected atrial fibrillation poststroke from electronic health record data using machine learning and natural language processing. Front Cardiovasc Med 2022; 9: 941237.

92.

Fink

Kades

Bischoff

, et al. Deep learning–based assessment of oncologic outcomes from natural language processing of structured radiology reports. Radiol Artif Intell 2022; 4: e220055.

93.

Decker

Turco

, et al. Development of a natural language processing algorithm to extract seizure types and frequencies from the electronic health record. Seizure 2022; 101: 48–51.

94.

Bali

Weaver

Turzhitsky

, et al. Development of a natural language processing algorithm to detect chronic cough in electronic health records. BMC Pulm Med 2022; 22: 256.

95.

Zhao

Ren

, et al. Construction of an assisted model based on natural language processing for automatic early diagnosis of autoimmune encephalitis. Neurol Ther 2022; 11: 1117–1134.

96.

Xie

Gallagher

Conrad

, et al. Extracting seizure frequency from epilepsy clinic notes: a machine reading approach to natural language processing. J Am Med Inform Assoc 2022; 29: 873–881.

97.

Young

Holmes

Kishore

, et al. Natural language processing diagnosed behavioral disturbance vs confusion assessment method for the intensive care unit: prevalence, patient characteristics, overlap, and association with treatment and outcome. Intensive Care Med 2022; 48: 559–569.

98.

Lindvall

Deng

Moseley

; ACP-PEACE Investigatorset al. Natural language processing to identify advance care planning documentation in a multisite pragmatic clinical trial. J Pain Symptom Manage 2022; 63: e29–e36.

99.

Tohira

Finn

Ball

, et al. Machine learning and natural language processing to identify falls in electronic patient care records from ambulance attendances. Inform Health Soc Care 2022; 47: 403–413.

100.

Lau-Min

Marini

Shah

, et al. Pilot study of a mobile phone chatbot for medication adherence and toxicity management among patients with GI cancers on capecitabine. JCO Oncol Pract 2024; 20: 483–490.

101.

Muñoz Martín

Huerga Domínguez

Souto

, et al. Predicting recurrence of venous thromboembolism in anticoagulated cancer patients using real-world data and machine learning. Journal of Clinical Oncology 2022; 40: number 16 suppl.

102.

Yang

Zhu

Howard

, et al. Context-based identification of muscle invasion status in patients with bladder cancer using natural language processing. JCO Clin Cancer Inform 2022; 6: e2100097.

103.

Lin

Chen

Kaluzny

, et al. Extraction of active medications and adherence using natural language processing for glaucoma patients. AMIA Annu Symp Proc 2022; 2021: 773–782.

104.

Niehues

Adams

Gaudin

, et al. Deep-learning-based diagnosis of bedside chest X-ray in intensive care and emergency medicine. Invest Radiol 2021; 56: 525–534.

105.

Hussain

Sezgin

Krivchenia

, et al. A natural language processing pipeline to synthesize patient-generated notes toward improving remote care and chronic disease management: a cystic fibrosis case study. JAMIA open 2021; 4: ooab084.

106.

Macchia

Ferrandina

Patarnello

, et al. Multidisciplinary tumor board smart virtual assistant in locally advanced cervical cancer: a proof of concept. Front Oncol 2022; 11: 797454.

107.

Cheung

, et al. Detecting suicide risk using knowledge-aware natural language processing and counseling service data. Soc Sci Med 2021; 283: 114176.

108.

Yeung

Iaboni

Rochon

, et al. Correlating natural language processing and automated speech analysis with clinician assessment to quantify speech-language changes in mild cognitive impairment and Alzheimer’s dementia. Alzheimers Res Ther 2021; 13: 109.

109.

McLenon

Okuhn

Lancaster

, et al. Validation of natural language processing to determine the presence and size of abdominal aortic aneurysms in a large integrated health system. J Vasc Surg 2021; 74: 459–466.

110.

Mithun

Jha

Sherkhane

, et al. Development and validation of deep learning and BERT models for classification of lung cancer radiology reports. Informatics in Medicine Unlocked 2023; 40: 101294.

111.

Ortiz

Portoles

Pino-Pino

, et al. Clinical characteristics and management of patients with secondary hyperparathyroidism undergoing hemodialysis: a feasibility analysis of electronic health records using natural language processing. Kidney Dis (Basel) 2023; 9: 187–196.

112.

Lin

Kaluzny

Chen

, et al. Medication list extraction using natural language processing for glaucoma patients. Investigative Ophthalmology & Visual Science 2021; 62: 1001.

113.

Weissler

Zhang

Lippmann

, et al. Use of natural language processing to improve identification of patients with peripheral artery disease. Circ Cardiovasc Interv 2020; 13: e009447.

114.

Bala

Steinkamp

Feeney

, et al. A web application for adrenal incidentaloma identification, tracking, and management using machine learning. Appl Clin Inform 2020; 11: 606–616.

115.

Omoregbe

Ndaman

Misra

, et al. Text messaging‐based medical diagnosis using natural language processing and fuzzy logic. Journal of Healthcare Engineering 2020; 2020: 8839524.

116.

Katsuki

Narita

Matsumori

, et al. Preliminary development of a deep learning-based automated primary headache diagnosis model using Japanese natural language processing of medical questionnaire. Surg Neurol Int 2020; 11: 475.

117.

Almog

Rai

Zhang

, et al. Deep learning with electronic health records for short-term fracture risk identification: crystal bone algorithm development and validation. J Med Internet Res 2020; 22: e22550.

118.

Liu

Zhong

Liu

, et al. REDBot: natural language process methods for clinical copy number variation reporting in prenatal and products of conception diagnosis. Mol Genet Genomic Med 2020; 8: e1488.

119.

Agurto

Cecchi

Norel

, et al. Detection of acute 3,4-methylenedioxymethamphetamine (MDMA) effects across protocols using automated natural language processing. Neuropsychopharmacology 2020; 45: 823–832.

120.

Negi

Pavuri

Patel

, et al. A novel method for drug-adverse event extraction using machine learning. Informatics in Medicine Unlocked 2019; 17: 100190.

121.

Coquet

Bozkurt

Kan

, et al. Comparison of orthogonal NLP methods for clinical phenotyping and assessment of bone scan utilization among prostate cancer patients. J Biomed Inform 2019; 94: 103184.

122.

Marafino

Park

Davies

, et al. Validation of prediction models for critical care outcomes using natural language processing of electronic health record data. JAMA Netw Open 2018; 1: e185097.

123.

Sohn

Ali

, et al. Natural language processing for asthma ascertainment in different practice settings. J Allergy Clin Immunol Pract 2018; 6: 126–131.

124.

Tan

Hassanpour

Heagerty

, et al. Comparison of natural language processing rules-based and machine-learning systems to identify lumbar spine imaging findings related to low back pain. Acad Radiol 2018; 25: 1422–1432.

125.

Miao

, et al. Extraction of BI-RADS findings from breast ultrasound reports in Chinese using deep learning approaches. Int J Med Inform 2018; 119: 17–21.

126.

Tang

Ouyang

, et al. Machine learning to parse breast pathology reports in Chinese. Breast Cancer Res Treat 2018; 169: 243–250.

127.

Weiss

Zhou

Walker

, et al. A case study of the incremental utility for disease identification of natural language processing in electronic medical records. Pharmaceutical Medicine 2018; 32: 31–37.

128.

Filannino

Stubbs

Uzuner

Ö.

Symptom severity prediction from neuropsychiatric clinical records: overview of 2016 CEGS N-GRID Shared Tasks Track 2. J Biomed Inform 2017; 75: S62–S70.

129.

Fraser

Crossland

Bacher

, et al. Comparison of diagnostic and triage accuracy of Ada health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: clinical data analysis study. JMIR Mhealth Uhealth 2023; 11: e49995.

130.

Henson

Brown

Lee

, et al. Evaluation of the potential utility of an artificial intelligence chatbot in gastroesophageal reflux disease management. Am J Gastroenterol 2023; 118: 2276–2279.

131.

Galido

Butala

Chakerian

, et al. A case study demonstrating applications of ChatGPT in the clinical management of treatment-resistant schizophrenia. Cureus 2023; 15: e38166.

132.

Zhou

Evaluation of ChatGPT's capabilities in medical report generation. Cureus 2023; 15: e37589.

133.

Naik

Prather

Gurda

GT.

Synchronous bilateral breast cancer: a case report piloting and evaluating the implementation of the AI-powered large language model (LLM) ChatGPT. Cureus 2023; 15: e37587.

134.

Kim

HY.

A case report on ground-level alternobaric vertigo due to eustachian tube dysfunction with the assistance of conversational generative pre-trained transformer (ChatGPT). Cureus 2023; 15: e36830.

135.

Sezgin

Hussain

Rust

, et al. Extracting medical information from free-text and unstructured patient-generated health data using natural language processing methods: feasibility study with real-world data. JMIR Form Res 2023; 7: e43014.

136.

Gaviria-Valencia

Murphy

Kaggal

, et al. Near real-time natural language processing for the extraction of abdominal aortic aneurysm diagnoses from radiology reports: algorithm development and validation study. JMIR Med Inform 2023; 11: e40964.

137.

Lin

Hsu

Liang

, et al. Accurately identifying cerebroarterial stenosis from angiography reports using natural language processing approaches. Diagnostics (Basel) 2022; 12: 1882.

138.

Shi

Morgan

Bradshaw

, et al. Identifying patients who meet criteria for genetic testing of hereditary cancers based on structured and unstructured family health history data in the electronic health record: natural language processing approach. JMIR Med Inform 2022; 10: e37842.

139.

Sanaeifar

Eslami

Ahadi

, et al. DxGenerator: an improved differential diagnosis generator for primary care based on metamap and semantic reasoning. Methods Inf Med 2022; 61: 174–184.

140.

Tinmouth

Swain

Chorneyko

, et al. Validation of a natural language processing algorithm to identify adenomas and measure adenoma detection rates across a health system: a population-level study. Gastrointest Endosc 2023; 97: 121–129.

141.

Shin

Kam

Jeon

, et al. Automatic classification of thyroid findings using static and contextualized ensemble natural language processing systems: development study. JMIR Med Inform 2021; 9: e30223.

142.

Berman

Biery

Ginder

, et al. Natural language processing for the assessment of cardiovascular disease comorbidities: the cardio‐Canary comorbidity project. Clin Cardiol 2021; 44: 1296–1304.

143.

Razjouyan

Freytag

Dindo

, et al. Measuring adoption of patient priorities–aligned care using natural language processing of electronic health records: development and validation of the model. JMIR Med Inform 2021; 9: e18756.

144.

Chen

Dedhia

Imbus

, et al. Thyroid ultrasound reports: will the thyroid imaging, reporting, and data system improve natural language processing capture of critical thyroid nodule features? J Surg Res 2020; 256: 557–563.

145.

Jin

Vimalananda

, et al. Automatic detection of hypoglycemic events from the electronic health record notes of diabetes patients: empirical study. JMIR Med Inform 2019; 7: e14340.

146.

Odisho

Bridge

Webb

, et al. Automating the capture of structured pathology data for prostate cancer clinical care and research. JCO Clin Cancer Inform 2019; 3: 1–8.

147.

Johnson

Adekkanattu

Campion

Jr , et al. From sour grapes to low-hanging fruit: a case study demonstrating a practical strategy for natural language processing portability. AMIA Jt Summits Transl Sci Proc 2018; 2017: 104–112.

148.

Nam

Kim

Kho

HS.

Differential diagnosis of jaw pain using informatics technology. J Oral Rehabil 2018; 45: 581–588.

149.

Beltrami

Gagliardi

Rossini Favretti

, et al. Speech analysis by natural language processing techniques: a possible tool for very early detection of cognitive decline? Front Aging Neurosci 2018; 10: 369.

150.

Balakrishna

Yadav

Singh

, et al. Smart drug delivery systems using large language models for real-time treatment personalization. In: 2024 2nd World Conference on Communication & Computing (WCONF). IEEE, 2024.

151.

Zhou

Dong

, et al. TCM-FTP: fine-tuning large language models for herbal prescription prediction. In: 2024 IEEE International Conference on Bioinformatics and Biomedicine (BIBM). IEEE, 2024, pp. 4092–4097.

152.

Mendoza-Pittí

Gómez-Pulido

Vargas-Lombardo

, et al. Machine-learning model to predict the intradialytic hypotension based on clinical-analytical data. IEEE Access 2022; 10: 72065–72079.

153.

Pathak

Marshall

Davis

, et al. RespBERT: a multi-site validation of a natural language processing algorithm, of radiology notes to identify acute respiratory distress syndrome (ARDS). IEEE J Biomed Health Inform 2024; 29: 1455–1463.

154.

Abdulnazar

Roller

Schulz

, et al. Large language models for clinical text cleansing enhance medical concept normalization. IEEE Access 2024; 12: 147981–147990.

155.

Wang

Wong

, et al. Frozen large-scale pretrained vision-language models are the effective foundational backbone for multimodal breast cancer prediction. IEEE J Biomed Health Inform 2025; 29: 3234–3246.

156.

Kim

Park

, et al. Identifying alcohol-related information from unstructured bilingual clinical notes with multilingual transformers. IEEE Access 2023; 11: 16066–16075.

157.

Iwai

Shimazaki

Fukuda

, et al. Development and preliminary evaluation of remote pacemaker monitoring system using large language model. In: 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI). IEEE, 2024, pp.561–562.

158.

Lin

Wang

Lee

, et al. Interpretability of deep learning analysis result of intradialytic hypotension prediction model with recommendation reports utilizing large language model. In: 2024 IEEE 6th Eurasia Conference on Biomedical Engineering, Healthcare and Sustainability (ECBIOS). IEEE, 2024, pp. 91–95.

159.

Zhao

Wang

, et al. ChatCAD+: towards a universal and reliable interactive CAD using LLMs. IEEE Trans Med Imaging 2024; 43: 3755–3766.

160.

Lin

Chia

, et al. Leveraging large language models for generating personalized care recommendations in dementia. In: 2024 IEEE International Workshop on Electromagnetics: Applications and Student Innovation Competition (iWEM). IEEE, 2024, pp.1–4.

161.

Saha

Chopra

Saha

, et al. A large-scale dataset for motivational dialogue system: an application of natural language generation to mental health. In: 2021 International Joint Conference on Neural Networks (IJCNN). IEEE, 2021, pp.1–8.

162.

Santos

Ulbrich

, et al. Evaluating LLMs for diagnosis summarization. Annu Int Conf IEEE Eng Med Biol Soc 2024; 2024: 1–7.

163.

Lentzen

Linden

Veeranki

, et al. A transformer-based model trained on large scale claims data for prediction of severe COVID-19 disease progression. IEEE J Biomed Health Inform 2023; 27: 4548–4558.

164.

Gundogdu

Pamuksuz

Chung

, et al. Customized impression prediction from radiology reports using BERT and LSTMs. IEEE Transactions on Artificial Intelligence 2021; 4: 744–753.

165.

Vallabhaneni

Rahul

Kumari

KS.

Improved lung cancer detection through use of large language systems with graphical attributes. In: 2024 IEEE International Conference on Computing, Power and Communication Technologies (IC2PCT). IEEE, 2024, pp.1849–1854.

166.

Moqurrab

Ayub

Anjum

, et al. An accurate deep learning model for clinical entity recognition from clinical notes. IEEE J Biomed Health Inform 2021; 25: 3804–3811.

167.

Arnaud

Elbattah

Gignon

, et al. NLP-based prediction of medical specialties at hospital admission using triage notes. In: 2021 IEEE 9th International Conference on Healthcare Informatics (ICHI). IEEE, 2021, pp.548–553.

168.

Zeinali

AlBashayreh

Fan

, et al. Comparison of BERT implementations for enhanced cancer symptoms extraction from electronic health records. In: 2024 IEEE First International Conference on Artificial Intelligence for Medicine, Health and Care (AIMHC). IEEE, 2024, pp. 18–19.

169.

Wang

Liu

Wang

Knowledge-enhanced pre-training large language model for depression diagnosis and treatment. In: 2023 IEEE 9th International Conference on Cloud Computing and Intelligent Systems (CCIS). IEEE, 2023, pp. 532–536.

170.

Niero

Guilherme

e Oliveira

, et al. PsyBERTpt: a clinical entity recognition model for psychiatric narratives. In: 2023 IEEE 36th International Symposium on Computer-Based Medical Systems (CBMS). IEEE, 2023, pp. 672–677.

171.

Mamouei

Salimi-Khorshidi

, et al. Hi-BEHRT: hierarchical transformer-based model for accurate prediction of clinical events using multimodal longitudinal electronic health records. IEEE J Biomed Health Inform 2022; 27: 1106–1117.

172.

Finkelstein

Cui

Morgan

, et al. Reducing diagnostic uncertainty using large language models. In: 2024 IEEE First International Conference on Artificial Intelligence for Medicine, Health and Care (AIMHC). IEEE, 2024, pp.236–242.

173.

Thirunavukarasu

Ting

Elangovan

, et al. Large language models in medicine. Nat Med 2023; 29: 1930–1940.

174.

Dennstädt

Hastings

Putora

, et al. Implementing large language models in healthcare while balancing control, collaboration, costs and security. NPJ Digit Med 2025; 8: 143.

175.

Wong

Comparative analysis of open source and proprietary large language models: performance and accessibility. Advances in Computer Sciences 2024; 7: 1–7.

176.

Qin

Tong

Opportunities and challenges for large language models in primary health care. J Prim Care Community Health 2025; 16: 21501319241312571.

177.

Liu

Han

, et al. Understanding llms: a comprehensive overview from training to inference. Neurocomputing 2024: 129190.

178.

Fehr

Citro

Malpani

, et al. A trustworthy AI reality-check: the lack of transparency of artificial intelligence products in healthcare. Front Digit Health 2024; 6: 1267290.

179.

Hripcsak

Rothschild

AS.

Agreement, the f-measure, and reliability in information retrieval. J Am Med Inform Assoc 2005; 12: 296–298.

180.

Kelly

Karthikesalingam

Suleyman

, et al. Key challenges for delivering clinical impact with artificial intelligence. BMC Med 2019; 17: 195.

181.

Sokolova

Lapalme

A systematic analysis of performance measures for classification tasks. Information processing & management 2009; 45: 427–437.

182.

Lipton

Elkan

Narayanaswamy

Thresholding classifiers to maximize F1 score. arXiv preprint arXiv Epub ahead of print 8 February 2014. DOI: 10.48550/arXiv.1402.1892.

183.

Fawcett

An introduction to ROC analysis. Pattern Recognition Letters 2006; 27: 861–874.

184.

Saito

Rehmsmeier

The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PloS One 2015; 10: e0118432.

185.

Ling

Huang

Zhang

AUC: a better measure than accuracy in comparing learning algorithms. In: Advances in artificial intelligence (eds Xiang Y and Chaib-draa B). Canadian AI 2003. Lecture Notes in Computer Science, vol 2671. 11–13 June 2003, pp. 329–341. Berlin Heidelberg: Springer.

186.

Reddy

Evaluating large language models for use in healthcare: a framework for translational value assessment. Informatics in Medicine Unlocked 2023; 41: 101304.

187.

Zhang

Zheng

Tang

, et al. Balancing specialized and general skills in llms: the impact of modern tuning and data strategy. arXiv preprint arXiv Epub ahead of print 7 October 2023. DOI: 10.48550/arXiv.2310.04945.

188.

Liang

Bommasani

Lee

, et al. Holistic evaluation of language models. arXiv preprint arXiv Epub ahead of print 16 November 2022. DOI: 10.48550/arXiv.2211.09110.

Clinical applications of large language models in medicine and surgery: A scoping review

Abstract

Objective

Methods

Results

Conclusions

Keywords

Introduction

Materials and methods

Protocol and registration

Eligibility criteria

Database search

Article selection and data abstraction

Statistical analysis

Results

Database search results

Characteristics of LLM studies and tools

Description of clinical workflow

Measures of accuracy

Analysis of LLM use by medical specialties

Clinical utility of LLMs

Discussion

Conclusion

Footnotes

Acknowledgements

Author contributions

Data availability statement

Declaration of conflicting interests

Ethics approval

Funding

ORCID iD

References