Utilizing large language models for gastroenterology research: a conceptual framework

Abstract

Large language models (LLMs) transform healthcare by assisting clinicians with decision-making, research, and patient management. In gastroenterology, LLMs have shown potential in clinical decision support, data extraction, and patient education. However, challenges such as bias, hallucinations, integration with clinical workflows, and regulatory compliance must be addressed for safe and effective implementation. This manuscript presents a structured framework for integrating LLMs into gastroenterology, using Hepatitis C treatment as a real-world application. The framework outlines key steps to ensure accuracy, safety, and clinical relevance while mitigating risks associated with artificial intelligence (AI)-driven healthcare tools. The framework includes defining clinical goals, assembling a multidisciplinary team, data collection and preparation, model selection, fine-tuning, calibration, hallucination mitigation, user interface development, integration with electronic health records, real-world validation, and continuous improvement. Retrieval-augmented generation and fine-tuning approaches are evaluated for optimizing model adaptability. Bias detection, reinforcement learning from human feedback, and structured prompt engineering are incorporated to enhance reliability. Ethical and regulatory considerations, including the Health Insurance Portability and Accountability Act, General Data Protection Regulation, and AI-specific guidelines (DECIDE-AI, SPIRIT-AI, CONSORT-AI), are addressed to ensure responsible AI deployment. LLMs have the potential to enhance decision-making, research efficiency, and patient care in gastroenterology, but responsible deployment requires bias mitigation, transparency, and ongoing validation. Future research should focus on multi-institutional validation and AI-assisted clinical trials to establish LLMs as reliable tools in gastroenterology.

Plain language summary

How large language models could transform gastroenterology: a framework for future research and care

Artificial intelligence (AI) is transforming healthcare by helping doctors make better decisions, analyze research faster, and improve patient care. Large language models (LLMs) are a type of AI that process and generate human-like text, making them useful in gastroenterology. This paper presents a structured framework for safely using LLMs in clinical practice, using Hepatitis C treatment as an example. The framework begins by setting clear goals, such as improving Hepatitis C treatment recommendations or making patient education easier to understand. A team of doctors, AI specialists, and data experts is assembled to ensure the model is medically accurate and practical. Next, relevant medical data from electronic health records (EHRs), clinical guidelines, and research studies is gathered and prepared to improve AI, ensuring it provides useful and fair recommendations. The right AI model is then chosen and improved to specialize in gastroenterology. To make sure the model is reliable and makes correct suggestions, its performance is checked and adjusted before use. A user-friendly interface is created so doctors can access AI-generated recommendations directly in EHRs and decision-support tools, making it easy to integrate into daily practice. Before full use, the AI is tested in real-world settings, where gastroenterologists review its recommendations for safety and accuracy. Once in use, ongoing updates based on doctor feedback help improve its performance. Ethical and legal safeguards, such as protecting patient privacy and ensuring fairness, guide its responsible use. Findings are then shared with the medical community, allowing for further testing and broader adoption. By following this framework, LLMs can help doctors make better decisions, personalize treatments, and improve efficiency, ultimately leading to better patient outcomes in gastroenterology.

Keywords

artificial intelligence framework generative artificial intelligence healthcare

Introduction

Artificial intelligence (AI) enables machines to simulate complex human cognitive functions in machines.¹ A key subset of AI, machine learning (ML), enables computers to learn from data and improve performance on specific tasks without explicit programming. Within ML, supervised maps input to outputs using labeled data, while unsupervised learning identifies hidden patterns within unlabeled data.^2,3 Deep learning, a further subset of ML, employs neural networks inspired by the human brain function to analyze vast datasets and extract meaningful patterns without manual feature extraction.⁴ Generative AI, advanced deep learning, can create new data—such as text, images, or audio—by learning patterns from large datasets and generating samples that align with the same underlying distribution.

Large language models (LLMs), a class of generative AI, have transformed natural language processing (NLP) in medicine, demonstrating success in clinical decision support, patient education, drug discovery, and biomedical research (Table 1).^5–8

Table 1.

Applications of large language models in Gastroenterology.

Application area	Use case	Model	Key insights
Patient education	IBD patient questions⁹	ChatGPT-4	LLMs can provide general, accurate information for IBD patients, answering common questions and aiding preliminary understanding. However, it lacks specificity, updates, and empathetic engagement, which are critical in chronic disease management.
	Nutrition questions for IBD¹⁰	GPT-4	GPT-4 accurately answers 83% of IBD nutrition questions with 92% reproducibility but has a 17% error rate, needing refinement for clinical use. It excels in certain areas, like tube feeding guidance, and can support patient education, especially where access to care is limited. Future improvements in prompt design and updates could enhance its reliability for healthcare.
	Colonoscopy procedure information¹¹	ChatGPT	ChatGPT provided accurate, easy-to-understand colonoscopy answers similar to hospital websites, with low similarity to online content, making its responses unique. However, it generated answers above the recommended reading level, and doctors could only identify AI answers 48% of the time, highlighting its realistic tone. This suggests potential for patient education, though readability improvements and oversight are needed for clinical use.
	General gastrointestinal symptom management¹²	ChatGPT	ChatGPT performs moderately well on general GI questions, scoring around 3.9/5 in accuracy but lacks depth, especially in diagnostic-related queries where it struggles with medical terminology. It frequently suggests consulting a doctor, which underscores its current limitations. To be clinically reliable, ChatGPT requires improved accuracy and specificity in handling complex GI information.
Clinical decision support	Diagnosis and treatment recommendations for digestive diseases¹³	ChatGPT-4	LLMs like ChatGPT show promise in patient education for digestive diseases but vary in accuracy (6.4%–91.4%) and pose safety risks if used clinically without oversight. Their use needs standardization and accuracy improvements before reliable clinical application.
	Extracting PROs in IBD¹⁴	GPT-4	LLMs like GPT-4 outperform traditional NLP in extracting PROs for IBD. GPT-4 achieved over 90% accuracy and generalizability across institutions, unlike traditional models which showed poor external validity.
	Clinical support for gastroenterologists¹⁵	GastroGPT	The study on GastroGPT, a gastroenterology-specific AI model, found that it significantly outperformed general models like GPT-4, Bard, and Claude in simulated gastroenterology tasks.
	CPGs for COVID-19 in GI patients¹⁶	Custom LLM with CPG integration	LLMs with CPGs significantly improve their performance in CDS, particularly for COVID-19 outpatient treatment. Three methods—BDT, PAGC, and CoT-FSP—were tested, with BDT achieving the highest automatic evaluation scores.
Medical education and training	Diagnostic case study training for medical learners¹⁷	ChatGPT (Legacy 3.5)	The study on ChatGPT’s diagnostic utility found a 49% case accuracy with an overall accuracy of 74%, indicating it can effectively rule out incorrect diagnoses but struggles with precision and sensitivity. While useful for general guidance, limitations in interpreting lab values and clinical nuances mean ChatGPT should be used as a supplementary educational tool rather than a standalone diagnostic aid.
Literature review and summarization	Keyword-based retrieval for systematic reviews¹⁸	LitLLM (GPT-4)	The LitLLM toolkit uses retrieval-augmented generation to automate scientific literature reviews by retrieving, ranking, and summarizing relevant papers based on user-provided abstracts. This system reduces effort in literature reviews, minimizes LLM hallucinations, and enhances control through sentence planning, showing promise as a reliable academic assistant.
Drug discovery and repurposing	Drug repurposing for IBD¹⁹	DrugReAlign with multi-source prompts	The DrugReAlign framework uses LLMs with multi-source prompts for drug repurposing, enhancing accuracy and interpretability in predicting drug–target interactions. Through validation with molecular docking, it demonstrated robust predictive performance.

AI, artificial intelligence; BDT, binary decision tree; CDS, clinical decision support; CoT-FSP, chain-of-thought-few-shot prompting; CPGs, clinical practice guidelines; IBD, inflammatory bowel disease; LLM, large language models; NLP, natural language processing; PAGC, program-aided graph construction; PROs, patient-reported outcomes; GI, gastroenterology.

Notably, MedPaLM 2 outperforms GPT-4 in USMLE-style medical questions and is currently being evaluated in real-world clinical interactions.²⁰ Similarly, BioGPT excels in biomedical literature mining, while UCSF-BERT has demonstrated >88% accuracy in detecting treatment-emergent adverse events from electronic health records (EHRs).^21,22

In gastroenterology, LLMs have been leveraged for diverse applications, including generating patient education materials for bariatric surgery candidates, enhancing doctor–patient interactions in hepatocellular carcinoma and cirrhosis, responding to clinical queries, grading non-alcoholic fatty liver disease histology, and extracting data.^{9,13,20,23–30} An NIH-funded pilot study is currently evaluating LLMs for automating data extraction in hepatocellular carcinoma.³¹ In addition, these models hold promise for improving telehealth scalability through patient triaging and serving as best practice tools for identifying patients eligible for colorectal cancer screening.^32,33 LLMs also assist with research and data analysis. These models can streamline academic writing and accelerate labor-intensive tasks such as conducting literature reviews and meta-analyses.^34,35

Despite their potential, LLM integration in research and clinical practice presents significant challenges. Concerns regarding data privacy, model bias, and the need for rigorous validation and transparency must be addressed. Overconfident errors or “hallucinations” pose safety risks, while explainability remains a key issue. Furthermore, real-world adoption is complicated by shifts in medical terminology, evolving clinical practices, and unpredictable factors such as disease outbreaks.^36–38

To overcome these challenges, we propose a structured framework, illustrated in Figure 1, that integrates LLMs into research and clinical decision-making, using personalized Hepatitis C treatment model as an example. This framework incorporates fine-tuning versus retrieval-augmented generation (RAG), prompt optimization for structured outputs, active learning for continuous improvement, and methods to mitigate hallucinations and standardize outputs.

Figure 1.

This figure illustrates a step-by-step framework for developing an LLM-powered clinical decision support system for Hepatitis C treatment. It outlines key stages, including ethical approvals, data collection, model selection, calibration, user interface development, integration with EHRs, and continuous improvement.

Framework

Define system objectives

Establishing clear goals for using an LLM model is essential. In our clinical scenario, the primary objective is to develop an AI-driven treatment recommendation system that generates personalized plans by incorporating genotype, viral load, fibrosis score, and prior treatment history. The model will integrate real-time clinical guidelines to ensure up-to-date recommendations while enhancing patient safety through risk stratification and optimized decision-making.

Assemble a multidisciplinary team

Developing an effective LLM for gastroenterology requires collaboration among AI specialists, data scientists, gastroenterologists, hepatologists, infectious disease experts, clinical informatics, quality assurance teams, bioethicists, and regulatory policymakers.³⁹ This team ensures the model is clinically accurate, seamlessly integrates into EHR systems, and complies with regulatory standards.⁴⁰

Data collection and preparation

Relevant data sources include EHR records (previous Hepatitis C treatments, patient demographics, alcohol use, diagnostic testing results, imaging, genotype data), clinical guidelines (e.g., American Association for the Study of Liver Diseases), medical literature, patient-reported outcomes, and regulatory and safety information (U.S. Food and Drug Administration approvals and adverse event reporting). Data must be adequate and representative of the intended population, as models fine-tuned on specific demographic groups may underperform in others.³⁸

Key preprocessing steps involve structured (e.g., HCV RNA levels, fibrosis scores) and unstructured data (e.g., clinical notes describing symptoms or treatment history) integration while ensuring Health Insurance Portability and Accountability Act (HIPAA)-compliant de-identification. Data cleaning is crucial to handle missing values, outliers, and inconsistencies. In pre-processing, structured data are converted to numerical forms, and unstructured data are processed using NLP techniques like tokenization and entity recognition to extract key medical concepts and map them to structured data, allowing the model to consider both empirical results and contextual patient history.

Following preprocessing, the dataset is divided into a training Set (70%–80%; for fine-tuning), a validation set (to prevent overfitting, which occurs when the model becomes overly specialized to the fine-tuning dataset, learning specific patterns that do not generalize well to new, unseen data), and test set (10%–15%).

Model selection

LLMs can be general purpose (GPT-4 and GPT-3.5), which are versatile but require domain adaptation and fine-tuned (BioGPT, UCSF-BERT, BioBert, PubMedBERT, etc.) which are tailored to specific applications^41–44 (Table 2). In choosing between smaller models like Bidirectional Encoder Representations from Transformers (BERT) and large models like GPT-4, BERT should be preferred for real-time performance scenarios like bedside diagnostics due to faster inference times and lower communication costs.^42,45 GPT-4 should be preferred for generating human-like text. For Hepatitis C, BERT can classify patient records and extract structured clinical information, while GPT-4 can generate personalized treatment recommendations based on multiple factors such as genotype, comorbidities, and prior treatments.⁴⁶ In our use case, a large model will be more suitable.

Table 2.

Pre-trained large language models.

Model name	Full form	Key use cases	Gastroenterology applications	Use case example	Access type	Graphics processing units and time required^a	Dataset utilized	Dataset size
BERT⁴⁷	Bidirectional encoder representations from transformers	NLP tasks (e.g., QA, sentiment analysis)	Strong at information retrieval and data extraction from clinical text and research documents.	Can identify patient symptoms related to IBS from unstructured EHRs and recommend further investigation steps.	Open source	Few hours on a single GPU	BooksCorpus, English Wikipedia	Approximately 16 GB (3.3 billion words)
GPT-4⁴⁸	Generative pre-trained transformer 4	Text generation and conversation	Excellent for text generation, creating patient education materials, summarization, and clinical decision support.	Can generate tailored patient information leaflets on managing GERD or summarize complex clinical guidelines for healthcare providers.	Proprietary	Estimated 16 + high-performance GPUs like A100 GPUs with several days to over a week for fine-tuning.	WebText2 + diverse proprietary datasets (proprietary subset)	~570 GB (~300 billion tokens)
RoBERTa⁴⁹	Robustly optimized BERT approach	NLP tasks (e.g., text classification)	Useful for diagnosing IBD from patient records or research articles.	Classifies clinical notes and predicts IBD flare-ups based on prior visits or lab results.	Open source	1–2 GPUs, 1–2 days	BooksCorpus, Wikipedia, CC-News, OpenWebText, Stories	~160 GB (~40 billion tokens)
T5⁵⁰	Text-to-text transfer transformer	Text-to-text transformation	Transforms medical tasks into a text format for various applications.	Generates concise summaries of clinical guidelines for managing hepatitis C.	Open source	16 TPUs, 3 days	C4 (Colossal Clean Crawled Corpus)	~750 GB (~365 billion tokens)
XLNET⁵¹	Extra-long network	NLP tasks (e.g., text classification, QA)	Captures complex dependencies in clinical text for improved predictions.	Analyzes unstructured clinical notes to predict patient outcomes for GERD.	Open source	4–8 GPUs for several hours to a few days.	BooksCorpus, Wikipedia, Giga5, ClueWeb09, Common Crawl	~170 GB (for pretraining)
ALBERT⁵²	A Lite BERT	NLP tasks (e.g., QA, text classification)	Efficiently processes medical literature for relevant information extraction.	Analyzes clinical notes to identify patients with IBD based on symptoms and treatment histories.	Open source	4 GPUs, ~2 days for fine-tuning	BooksCorpus, Wikipedia (reduced dataset (10%))	~16 GB (compressed) reduced
BART⁵³	Bidirectional and auto-regressive transformers	Text generation and understanding	Generates comprehensive reports and educational materials for patient–provider communication.	Summarizes clinical trial results on IBS treatments and generates personalized patient leaflets	Open source	4–8 GPUs, 1–2 days for fine-tuning	BooksCorpus, Wikipedia	~160 GB (pretraining on denoising tasks)
ELECTRA⁵⁴	Efficiently learning an encoder that classifies token replacements accurately	Text classification and token classification	Analyzes clinical notes to identify relevant information about gastrointestinal disorders.	Classifies EHRs to predict colorectal cancer risk based on symptoms and family history.	Open source	2–4 GPUs, less than a day for fine-tuning	BooksCorpus, Wikipedia	~16 GB (corpus for replaced token prediction)
BioBERT⁴²	Biomedical BERT	Biomedical text mining	Effective for biomedical literature search and research analysis related to gastroenterology.	Extracts research data trends on FMT from PubMed.	Open source	1–2 GPUs, 1 day for fine-tuning	PubMed abstracts, PMC articles	~21 GB (~18 billion tokens)
ClinicalBERT⁵⁵	Clinical BERT	Clinical text analysis	Useful for clinical data extraction (e.g., patient history, diagnoses) from EHRs.	Extracts relevant procedures (e.g., colonoscopies) and flags patients at risk of colorectal cancer.	Open source	2 GPUs, 1 day for fine-tuning	MIMIC-III clinical notes	~500 MB (~0.5 billion tokens)
BlueBERT⁵⁶	Blue Cross Blue Shield BERT	Healthcare record analysis		Analyzes patient data for trends in gastrointestinal disorders.	Open source	2 GPUs, 1 day for fine-tuning	PubMed abstracts, MIMIC-III database	~5 GB (text data)
PubMedBERT⁴³	PubMed BERT	Biomedical literature search	Optimized for searching through PubMed literature for gastroenterology research or treatment updates.	Retrieves the latest research on Helicobacter pylori treatments	Open source	1–2 GPUs, several hours for fine-tuning	PubMed abstracts	~14 GB (~4.5 billion tokens)
MedBERT⁴¹	Medical BERT	Clinical prediction tasks	Excellent for prediction tasks and summarization in medicine.	Predicts outcomes for acute pancreatitis by analyzing clinical notes and lab results.	Proprietary	1–2 GPUs, fine-tuning within a day	MIMIC-III EHR dataset	~60 GB (structured data)
GatorTron⁵⁷	GatorTron	Medical NLP	Processes clinical data and EHRs for gastroenterology support.	Analyzes patient charts for abnormal liver enzyme trends.	Open source	8–16 GPUs, 1–2 days for fine-tuning	GatorTron EHR corpus	~80–100 GB (~40–50 billion tokens)
BioGPT²¹	Biomedical generative pre-trained transformer	Biomedical test generation	Generates biomedical summaries useful for research in gastroenterology.	Summarizes clinical trial outcomes for probiotics in IBS.	Open source	8 GPUs, ~5 days for training	PubMedGPT dataset	~14 GB (~4.5 billion tokens)

We provide an approximate of the resources required for fine-tuning specific models. Needs might be variable based on the complexity of the task.

EHRs, electronic health records; FMT, fecal microbiota transplantation; GERD, gastroesophageal reflux disease; IBD, inflammatory bowel disease; IBS, irritable bowel syndrome; NLP, natural language processing; QA, question answering.

Open-source models like BERT offer free access and customizability, allowing fine-tuning with medical data for tailored applications while ensuring data privacy when run locally, making them suitable for HIPAA compliance. However, they require significant computational resources and specialized expertise for setup and maintenance. By contrast, proprietary models like GPT-4 deliver state-of-the-art performance with minimal fine-tuning and seamless cloud-based scalability, making them ideal for healthcare deployment.⁵⁸ They benefit from continuous updates and easy API integration but come with high costs, limited control over model internals, and potential data privacy concerns when using cloud services.⁵⁹

Model optimization

Mitigating hallucinations and output variability

Hallucination occurs when an LLM generates incorrect or misleading information due to incomplete training data, overfitting, or lack of fact-checking. In clinical settings, this can lead to unsafe recommendations, such as incorrect Hepatitis C treatment guidance.⁶⁰ Cross-referencing outputs from trusted guidelines and developing internal layers that detect unsupported outputs can prevent hallucinations. Below, we discuss a few hallucination and output variability mitigation strategies.

Fine-tuning

Pretrained models learn from large, general datasets, making them broadly knowledgeable but lacking medical specificity. Fine-tuning adapts these models to clinical needs by training them on specialized datasets.⁶¹ For Hepatitis C, fine-tuning ensures the model correctly interprets genotypes, fibrosis scores, and viral load and recommends AASLD-based treatment plans.

Fine-tuned models improve over time through active learning, where clinician feedback is used to enhance accuracy. This involves reviewing incorrect outputs, reinforcing correct responses, and periodically retraining the model with updated guidelines. Clinicians can actively flag incorrect recommendations, compare LLM outputs with expert decisions, and integrate the latest antiviral studies into the model’s learning process.

Model calibration and out-of-distribution detection

For gastroenterologists using LLMs in Hepatitis C treatment, ensuring that the model’s confidence aligns with real-world accuracy is critical. Model calibration adjusts confidence scores to prevent overconfidence in uncertain cases, helping clinicians gauge trust in recommendations. If an LLM suggests glecaprevir/pibrentasvir with 85% confidence, a well-calibrated model ensures this reflects an actual 85% likelihood of correctness.⁶² Temperature scaling is a key calibration method that refines confidence levels by correcting systematic overconfidence.⁶³

Out-of-distribution (OOD) detection ensures the model recognizes when a case falls outside its training data, such as rare genotypic resistance mutations or Hepatitis C co-infections.⁶⁴ If the model flags a recommendation as uncertain or detects an OOD scenario, clinicians should seek additional expert input, refer to updated guidelines, or consider alternative antiviral options. By integrating calibrated confidence scores and OOD detection, LLMs become safer, more reliable tools for Hepatitis C decision-making.

Retrieval augmented generation

Unlike fine-tuning, which relies on static training data, RAG allows models to pull in real-time medical guidelines and research. This ensures treatment recommendations are based on the latest evidence without requiring constant retraining.⁶⁵ In Hepatitis C management, RAG is preferable over fine-tuning when real-time adaptability is needed, such as incorporating the latest AASLD/EASL guidelines, drug availability, insurance policies, and patient-specific factors from external sources. It allows the model to dynamically retrieve updated recommendations without retraining, making it ideal for rapidly evolving treatment landscapes and multi-center adaptability.²⁴ Fine-tuning is better suited for static, rule-based decision-making, such as structured genotype-based treatment selection.⁶⁶

Prompt optimization

Prompt optimization plays a crucial role in improving LLM responses without additional training.⁶⁷ Zero-shot prompts rely on the model’s general knowledge, such as asking, “What is the standard treatment for Hepatitis C genotype 1?” One-shot prompts provide an example before the query, improving contextual accuracy. Few-shot prompts offer multiple examples, refining model responses by demonstrating structured reasoning. Structuring prompts effectively—for instance, specifying “List first-line and second-line treatments for Hepatitis C based on fibrosis stage” rather than asking an open-ended question—reduces variability and enhances precision. To maintain consistency, techniques like controlled vocabulary use (ensures standardized medical terminology, such as using “F3 fibrosis stage” instead of the ambiguous “moderate liver scarring”), response formatting constraints (structure outputs predictably, like listing “Genotype → Preferred regimen → Treatment duration” for Hepatitis C therapy), and iterative prompt refinement (involves adjusting prompts to improve accuracy and completeness of responses) help reduce prompt sensitivity, ensuring reliable clinical decision support.⁶⁸

Validation

LLM variability should be validated through inter-clinician agreement studies and clinical decision support systems (CDSS) comparisons to ensure consistency with expert decisions and established protocols, ensuring stable and reliable clinical use.⁶⁹

Developing user interface/integrating with existing system

Designing an LLM interface for gastroenterology requires tailoring it to gastroenterologists, nurses, and administrative staff to ensure the interface aligns with clinical workflows and enhances usability.^70,71

A structured input system with fields for patient ID, date of birth, and prior procedures (e.g., endoscopies, colonoscopies) ensures seamless data entry and accurate patient identification.

Pre-populating fields with recent lab results, imaging, and treatment history streamlines workflows, reducing administrative burden and minimizing errors. Key features for Hepatitis C decision support are outlined in Supplemental Table 1, with a wireframe of the interface in Figure 2.

Figure 2.

This figure presents a wireframe of the LLM-driven clinical decision support interface for Hepatitis C management. It demonstrates structured data entry fields for patient ID, genotype, fibrosis score, co-morbidities, and HCV RNA levels. The interface pre-populates critical fields, integrates biopsy/imaging uploads, and generates personalized treatment recommendations based on clinical guidelines. A feedback loop allows clinicians to refine model outputs, enhancing reliability and usability.

Integrating an LLM into existing clinical systems enhances efficiency and patient care by embedding AI-driven recommendations within EHRs, CDSS, and hospital management software.⁷² Real-time EHR integration allows the model to analyze genotype, viral load, fibrosis scores, and treatment history without redundant data entry. When a clinician accesses a Hepatitis C patient’s record, the model provides tailored treatment suggestions directly within the EHR interface, minimizing disruptions and centralizing decision-making.

When embedded in a CDSS, the LLM serves as an advanced support layer, refining alerts and recommendations with context-aware insights.

Ethical and regulatory considerations

Approval from the ethics committee or institutional review board must be secured to ensure compliance with ethical standards. Adherence to the HIPAA in the United States and the General Data Protection Regulation (GDPR) in Europe is essential.⁷³ Patients do not typically provide consent waivers for using their de-identified data in model development.

Implementation and testing

After training the LLM on refined data and developing the user interface/integration with EHR, pilot testing with a small patient cohort under clinical oversight should be initiated. This stage is critical for gathering feedback from both patients and healthcare professionals, helping identify any challenges in usability, accuracy, and clinical applicability. Continuous feedback during pilot testing will help refine the model for real-world implementation.

Traditional evaluation metrics like accuracy, recall, precision, F1 score, and AUC-ROC are commonly used for predictive models; however, for LLMs generating personalized treatment plans, these metrics may not fully capture the model’s performance. Instead, assessment should focus on relevance (alignment with clinical expectations), coherence (logical consistency in recommendations), and safety (adherence to clinical guidelines). In addition, clinical performance metrics such as cure rates, treatment adherence, and incidence of side effects should be monitored to ensure the model provides actionable insights that improve patient outcomes.^73,74

Interpretability and explainability analysis

Ensuring equitable recommendations is critical in gastroenterology, as Hepatitis C outcomes vary based on patient demographics such as age, sex, race, and socioeconomic status. Clinicians should be aware that LLMs can inherit biases from training data, leading to disparities in treatment recommendations.^75–79 Bias detection tools, such as demographic parity analysis, can identify if certain patient groups receive different treatment suggestions. To mitigate bias, strategies like re-weighting underrepresented patient populations in training data or supplementing the model with diverse clinical trial data can help ensure fairer recommendations. For example, if a model consistently under-recommends direct-acting antivirals for certain racial groups due to historical underrepresentation in trials, re-weighting the data can correct this imbalance.^80–82 As highlighted in recent GI literature on algorithmic bias, disparities in clinical trial participation and the unavailability of representative data can exacerbate these biases.⁸³

Transparency and explainability remain significant challenges for LLMs in healthcare, as LLMs do not provide traditional, step-by-step reasoning like clinical algorithms.⁸⁴ Feature attribution techniques can help clarify why the model suggested a particular regimen by highlighting key factors such as genotype, fibrosis score, or past treatment failures.⁸⁵ Saliency maps visually emphasize influential data points, such as “genotype 1a” or “prior failure with sofosbuvir,” making the rationale behind recommendations more intuitive.⁸⁶

Counterfactual explanations, on the other hand, enable clinicians to see how slight changes in input data, like assuming the patient was treatment-naïve, would alter the model’s output.⁸⁷ This helps clinicians understand the model’s decision boundaries and evaluate its relevance to the patient’s specific circumstances.

Understanding how the model was trained is equally important. If most recommendations favor sofosbuvir-based regimens, it may be because training data were heavily drawn from clinical trials emphasizing these therapies.

In cases where interpretability remains a challenge, simplified surrogate models can approximate the decision-making process of complex LLMs, offering clinicians a more intuitive pathway to understanding AI-driven recommendations.⁸⁸ For example, a decision tree could outline how viral load and past treatment history influenced the suggested therapy, providing a straightforward map of the model’s logic.

Continuous improvement

Reinforcement learning from human feedback (RLHF) helps fine-tune responses by incorporating real-world patient outcomes and clinician inputs, ensuring the model remains relevant.⁸⁹ Continuous monitoring of a model’s performance in real-world settings is essential. Regular audits, including external audits by independent bodies, help maintain compliance with regulatory requirements post-deployment. Periodic reviews of the model’s impact on patient outcomes and healthcare practices ensure it continues to provide benefits without unintended negative consequences.

Training gastroenterologists on LLMs ensures safe and effective integration into clinical workflows. Awareness of limitations is crucial, including potential hallucinations, outdated or biased training data, lack of clinical nuance, and response variability. Best practices include cross-checking AI recommendations with guidelines, assessing confidence scores, using structured prompts, and engaging in feedback loops to refine accuracy. Transparent patient communication about AI’s role reinforces that LLMs support, rather than replace, clinical judgment.

Dissemination of findings

Disseminating AI model findings through peer-reviewed publications, conferences, and preprint servers fosters collaboration and knowledge sharing. Hosting code and data on GitHub ensures transparency, while partnerships with universities and research institutions support clinical validation and refinement. This approach promotes reproducibility, adoption, and continuous improvement in real-world settings.

Discussion

Since AI models are mostly used as CDSS in healthcare settings instead of autonomous interventions, the framework utilized for their deployment plays a key role in impacting patient outcomes.^28,90 While certain in silico studies have demonstrated that AI algorithms can match the efficacy of clinicians, there is a lack of convincing evidence suggesting a positive impact on patient outcomes and clinical efficacy.^91–93 Bridging this gap requires optimizing the translation process from in silico to real-world settings, with a focus on maintaining feedback loops, continuous learning, and considering human factors and safety.^94,95 Regulatory compliance with HIPAA, GDPR, and guidelines like DECIDE-AI, SPIRIT-AI, and CONSORT-AI to ensure responsible deployment and monitoring is essential.^94,96–98

LLMs face bias, ethical concerns, susceptibility to cyberattacks, hallucinations, and output variability, often struggling with edge cases, evolving guidelines, and patient variability.⁹⁹ Adoption barriers include limited clinician training, workflow integration challenges, and concerns about automation in decision-making. Without continuous updates, LLMs risk stagnation, reducing their long-term utility. Our framework addresses these gaps through fine-tuning with diverse datasets and RAG for real-time adaptability, confidence scoring, calibration, human-in-the-loop oversight, and structured response formats to mitigate errors. Training gastroenterologists in LLM interpretation rather than automation ensures responsible use, while regular audits, RLHF, and iterative updates support long-term accuracy.

In our framework, clinicians are involved from model optimization to implementation, ensuring safe and effective LLM integration in gastroenterology. During model optimization, gastroenterologists contribute by validating fine-tuned outputs, refining prompts for accuracy, and cross-referencing AI-generated recommendations with clinical guidelines. In calibration and hallucination mitigation, they help assess confidence scores, detect errors, and provide real-world feedback to improve model reliability. During implementation, clinicians facilitate EHR integration, ensuring AI recommendations align with workflow needs while participating in pilot testing to evaluate usability and clinical relevance. In continuous improvement, they engage in RLHF refinements, audit LLM performance, and advocate for necessary updates based on evolving Hepatitis C treatment protocols. By actively participating at each stage, gastroenterologists play a crucial role in ensuring LLMs function as reliable, evidence-based decision-support tools in clinical practice.¹⁰⁰

Despite these measures, real-world validation of LLMs remains in its early stages, and regulatory pathways for AI-based clinical decision tools are still evolving.¹⁰¹ Scaling multi-institutional validation, clinician-AI collaboration models, and AI-assisted clinical trials will be essential in establishing LLMs as reliable tools in gastroenterology.¹⁰²

Supplemental Material

sj-docx-1-tag-10.1177_17562848251328577 – Supplemental material for Utilizing large language models for gastroenterology research: a conceptual framework

Supplemental material, sj-docx-1-tag-10.1177_17562848251328577 for Utilizing large language models for gastroenterology research: a conceptual framework by Parul Berry, Rohan Raju Dhanakshirur and Sahil Khanna in Therapeutic Advances in Gastroenterology

Footnotes

Acknowledgements

None.

Declarations

ORCID iD

Sahil Khanna

Supplemental material

Supplemental material for this article is available online.

References

Sheikh

Prins

Schrijvers

. Artificial intelligence: definition and background. In: Sheikh

Prins

Schrijvers

(eds) Mission AI: the new system technology. Cham: Springer International Publishing, 2023, pp. 15–41.

Mitchell

. Machine learning. New York: McGraw-Hill, 1997.

Shai Shalev-Shwartz

Ben-David

. Understanding machine learning. Cambridge: Cambridge University Press, 2014.

Sarker

. Deep learning: a comprehensive overview on techniques, taxonomy, applications and research directions. SN Comput Sci 2021; 2: 420.

Kim

Thiessen

Bolton

, et al. PubChem substance and compound databases. Nucleic Acids Res 2016; 44: D1202–D1213.

Wishart

Feunang

Guo

, et al. DrugBank 5.0: a major update to the DrugBank database for 2018. Nucleic Acids Res 2018; 46: D1074–D1082.

Raiaan

MAK

Mukta

MSH

Fatema

, et al. A review on large language models: architectures, applications, taxonomies, open issues and challenges. IEEE Access 2024; 12: 26839–26874.

Khurana

Koli

Khatter

, et al. Natural language processing: state of the art, current trends and challenges. Multimed Tools Appl 2023; 82: 3713–3744.

Gravina

Pellegrino

Cipullo

, et al. May ChatGPT be a tool producing medical information for common inflammatory bowel disease patients’ questions? An evidence-controlled analysis. World J Gastroenterol 2024; 30: 17–33.

10.

Samaan

Issokson

Feldman

, et al. Artificial intelligence and patient education: examining the accuracy and reproducibility of responses to nutrition questions related to inflammatory bowel disease by GPT-4. medRxiv, 2023. DOI: 10.1101/2023.10.28.23297723.

11.

Lee

Staller

Botoman

, et al. ChatGPT answers common patient questions about colonoscopy. Gastroenterology 2023; 165: 509–511.e7.

12.

Lahat

Shachar

Avidan

, et al. Evaluating the utility of a large language model in answering common patients’ gastrointestinal health-related questions: are we there yet? Diagnostics (Basel) 2023; 13: 1950.

13.

Giuffrè

Kresevic

You

, et al. Systematic review: the use of large language models as medical chatbots in digestive diseases. Aliment Pharmacol Ther 2024; 60: 144–166.

14.

Patel

Davis

Ralbovsky

, et al. Large language models outperform traditional natural language processing methods in extracting patient-reported outcomes in IBD. Gastro Hep Adv 2025; 4: 100563.

15.

Simsek

. GastroGPT: successful proof-of-concept study of gastroenterology-specific large language model. UEG Week 2023, 2023.

16.

Oniani

Visweswaran

, et al. Enhancing large language models for clinical decision support by incorporating clinical practice guidelines. In: Proceedings of the 2024 IEEE 12th International Conference on Healthcare Informatics (ICHI). Piscataway, New Jersey: Institute of Electrical and Electronics Engineers, 2024, pp. 694–702.

17.

Hadi

Tran

Nagarajan

, et al. Evaluation of ChatGPT as a diagnostic tool for medical learners and clinicians. PLoS One 2024; 19: e0307383.

18.

Agarwal

Laradji

Charlin

, et al. LitLLM: a toolkit for scientific literature review. arXiv:2402.01788, 2024.

19.

Wei

Zhuo

, et al. DrugReAlign: a multisource prompt framework for drug repurposing based on large language models. BMC Biol 2024; 22: 226.

20.

Singhal

Azizi

, et al. Large language models encode clinical knowledge. Nature 2023; 620: 172–180.

21.

Luo

Sun

Xia

, et al. BioGPT: generative pre-trained transformer for biomedical text generation and mining. Brief Bioinform 2022; 23: bbac409.

22.

Silverman

Sushil

Bhasuran

, et al. Algorithmic identification of treatment-emergent adverse events from clinical notes using large language models: a pilot study in inflammatory bowel disease. Clin Pharmacol Ther 2024; 115: 1391–1399.

23.

Sun

Owens

, et al. Development of a liver disease-specific large language model chat interface using retrieval-augmented generation. Hepatology 2024; 80(5): 1158–1168.

24.

Zhang

Liu

Sheng

, et al. Preliminary fatty liver disease grading using general-purpose online large language models: ChatGPT-4 or Bard? J Hepatol 2024; 80: e279–e281.

25.

Milne-Ives

de Cock

Lim

, et al. The effectiveness of artificial intelligence conversational agents in health care: systematic review. J Med Internet Res 2020; 22: e20346.

26.

Srinivasan

Samaan

Rajeev

, et al. Large language models and bariatric surgery patient education: a comparative readability analysis of GPT-3.5, GPT-4, Bard, and online institutional resources. Surg Endosc 2024; 38: 2522–2532.

27.

Yeo

Samaan

, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721–732.

28.

Liu

Wright

Patterson

, et al. Using AI-generated suggestions from ChatGPT to optimize clinical decision support. J Am Med Inform Assoc 2023; 30: 1237–1245.

29.

Bitan

Patterson

. Unique challenges in user interface design for medical devices that use predictive algorithms. Proc Int Symp Hum Factors Ergon Healthc 2020; 9: 265–266.

30.

Sciberras

Farrugia

Gordon

, et al. Accuracy of information given by ChatGPT for patients with inflammatory bowel disease in relation to ECCO guidelines. J Crohns Colitis 2024; 18: 1215–1221.

31.

Delk

, et al. A comparison of large language model versus manual chart review for extraction of data elements from the electronic health record. Gastroenterology 2024; 166(4): 707–709.e3.

32.

Savage

Wang

Shieh

. A large language model screening tool to target patients for best practice alerts: development and validation. JMIR Med Inform 2023; 11: e49886.

33.

Lahat

Klang

. Can advanced technologies help address the global increase in demand for specialized medical care and improve telehealth services? J Telemed Telecare 2024; 30(9): 1516–1517.

34.

Şendur

Cerit

. ChatGPT from radiologists’ perspective. Br J Radiol 2023; 96: 20230203.

35.

Huespe

Echeverri

Khalid

, et al. Clinical research with large language models generated writing-clinical research with AI-assisted writing (CRAW) study. Crit Care Explor 2023; 5: e0975.

36.

Andaur Navarro

Damen

JAA

Takada

, et al. Risk of bias in studies on prediction models developed using supervised machine learning techniques: systematic review. BMJ 2021; 375: n2281.

37.

Subbaswamy

Saria

. From development to deployment: dataset shift, causality, and shift-stable models in health AI. Biostatistics 2020; 21: 345–352.

38.

Finlayson

Subbaswamy

Singh

, et al. The clinician and dataset shift in artificial intelligence. N Engl J Med 2021; 385: 283–286.

39.

Helman

Terry

Pellathy

, et al. Engaging multidisciplinary clinical users in the design of an artificial intelligence-powered graphical user interface for intensive care unit instability decision support. Appl Clin Inform 2023; 14: 789–802.

40.

Kusters

Misevic

Berry

, et al. Interdisciplinary research in artificial intelligence: challenges and opportunities. Front Big Data 2020; 3: 577974.

41.

Rasmy

Xiang

Xie

, et al. Med-BERT: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ Digit Med 2021; 4: 86.

42.

Lee

Yoon

Kim

, et al. BioBERT: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics 2020; 36: 1234–1240.

43.

Tinn

Cheng

, et al. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans Comput Healthc 2022; 3: 1–23.

44.

Laparra

Mascio

Velupillai

, et al. A review of recent work in transfer learning and domain adaptation for natural language processing of electronic health records. Yearb Med Inform 2021; 30: 239–244.

45.

Moor

Banerjee

Abad

ZSH

, et al. Foundation models for generalist medical artificial intelligence. Nature 2023; 616: 259–265.

46.

Shen

Schutte

, et al. Classifying the lifestyle status for Alzheimer’s disease from clinical notes using deep learning with weak supervision. BMC Med Inform Decis Mak 2022; 22: 88.

47.

Devlin

Chang

M-W

Lee

, et al. BERT: pre-training of deep bidirectional transformers for language understanding. arXiv:1810.04805, 2019.

48.

Waisberg

Ong

Masalkhi

, et al. GPT-4: a new era of artificial intelligence in medicine. Ir J Med Sci 2023; 192: 3197–3200.

49.

Tan

Lee

Anbananthen

KSM

, et al. RoBERTa-LSTM: a hybrid model for sentiment analysis with transformer and recurrent neural network. IEEE Access 2022; 10: 21517–21525.

50.

Mastropaolo

Scalabrino

Cooper

, et al. Studying the usage of text-to-text transfer transformer to support code-related tasks. In: Proceedings of the 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE), Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2021, pp. 336–347.

51.

Zheng

, et al. Deep learning for knowledge graph completion with XLNET. In: Proceedings of the 2021 5th international conference on deep learning technologies, 2021, pp. 13–19, Qingdao, China: Association for Computing Machinery.

52.

Lan

Chen

Goodman

, et al. ALBERT: a lite BERT for self-supervised learning of language representations. arXiv:1909.11942, 2020.

53.

Zhou

Qin

Lan

, et al. News text generation method integrating pointer-generator network with bidirectional auto-regressive transformer. In: Proceedings of the 2023 2nd international conference on Artificial Intelligence and Intelligent Information Processing (AIIIP). Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2023, pp. 114–118.

54.

Clark

Luong

M-T

, et al. ELECTRA: pre-training text encoders as discriminators rather than generators. arXiv:2003.10555, 2020.

55.

Alsentzer

Murphy

Boag

, et al. Publicly available clinical BERT embeddings. In Proceedings of the 2nd clinical natural language processing workshop, 2019, pp. 72–78. Minneapolis, MN: Association for Computational Linguistics.

56.

Peng

Chen

. An empirical study of multi-task learning on BERT for biomedical text mining. arXiv:2005.02799, 2020.

57.

Yang

Chen

PourNejatian

, et al. GatorTron: a large language model for clinical natural language processing. medRxiv, 2022. DOI: 10.1101/2022.02.27.22271257.

58.

Wang

. Selecting between BERT and GPT for text classification in political science research. arXiv:2411.05050, 2024.

59.

Nagarajan

Kondo

Salas

, et al. Economics and equity of large language models: health care perspective. J Med Internet Res 2024; 26: e64226.

60.

Azamfirei

Kudchadkar

Fackler

. Large language models and the perils of their hallucinations. Crit Care 2023; 27: 120.

61.

Tinn

Cheng

, et al. Fine-tuning large neural language models for biomedical natural language processing. Patterns 2023; 4: 100729.

62.

Jiang

Araki

Ding

, et al. How can we know when language models know? On the calibration of language models for question answering. Trans Assoc Comput Linguist 2021; 9: 962–977.

63.

Xie

Chen

Lee

, et al. Calibrating language models with adaptive temperature scaling. arXiv:2409.19817, 2024.

64.

Ding

. Large language models for anomaly and out-of-distribution detection: a survey. arXiv:2409.01980, 2024.

65.

Gao

Xiong

Gao

, et al. Retrieval-augmented generation for large language models: a survey. arXiv:2312.10997, 2023.

66.

Giuffrè

Kresevic

Pugliese

, et al. Optimizing large language models in digestive disease: strategies and challenges to improve clinical outcomes. Liver Int 2024; 44(9): 2114–2124.

67.

Sabbatella

Ponti

Giordani

, et al. Prompt optimization in large language models. Mathematics 2024; 12: 929.

68.

Chen

Zhang

Langrené

, et al. Unleashing the potential of prompt engineering in large language models: a comprehensive review. arXiv:2310.14735, 2023.

69.

Huang

Ruan

Huang

, et al. A survey of safety and trustworthiness of large language models through the lens of verification and validation. Artif Intell Rev 2024; 57: 175.

70.

Ambacher

. Designing a user-friendly and optimized version of a user interface for a large language model (LLM), PhD diss., Technische Hochschule Ingolstadt, 2024.

71.

Ghosh

Huang

Yan

, et al. Enhancing healthcare user interfaces through large language models within the adaptive user interface framework. In: Proceedings of the International Congress on Information and Communication Technology. Singapore: Springer Nature Singapore, 2024, pp. 527–540.

72.

Khan

. Enhancing electronic health records systems and diagnostic decision support systems with large language models. PhD diss., Purdue University Graduate School, 2024.

73.

Moore

Frye

. Review of HIPAA, Part 1: history, protected health information, and privacy and security rules. J Nucl Med Technol 2019; 47: 269–272.

74.

Bedi

Liu

Orr-Ewing

, et al. Testing and evaluation of health care applications of large language models: a systematic review. JAMA 2025; 333(4): 319–328.

75.

Strickland

. IBM Watson, heal thyself: how IBM overpromised and underdelivered on AI health care. IEEE Spectrum 2019; 56: 24–31.

76.

Wang

Kosinski

. Deep neural networks are more accurate than humans at detecting sexual orientation from facial images. J Pers Soc Psychol 2018; 114: 246.

77.

Qiu

Liu

Zhou

, et al. Review of artificial intelligence adversarial attack and defense technologies. Appl Sci 2019; 9: 909.

78.

Berghoff

Neu

von Twickel

. Vulnerabilities of connectionist AI applications: evaluation and defense. Front Big Data 2020; 3: 23.

79.

Dressel

Farid

. The accuracy, fairness, and limits of predicting recidivism. Sci Adv 2018; 4: eaao5580.

80.

Gavrilova

. Responsible artificial intelligence and bias mitigation in deep learning systems. In: Proceedings of the 27th International Conference on Information Visualisation (IV). Los Alamitos, California: Institute of Electrical and Electronics Engineers Computer Society, 2023, pp. 329–333.

81.

Wang

Mukhopadhyay

Xiao

, et al. An interactive approach to bias mitigation in machine learning. In: Proceedings of the 2021 IEEE 20th International Conference on Cognitive Informatics & Cognitive Computing (ICCI*CC). Piscataway, NJ: Institute of Electrical and Electronics Engineers, 2021, pp. 199–205.

82.

de Manuel

Delgado

Parra Jounou

, et al. Ethical assessments and mitigation strategies for biases in AI-systems used during the COVID-19 pandemic. Big Data Soc 2023; 10: 20539517231179199.

83.

Uche-Anya

Anyane-Yeboa

Berzin

, et al. Artificial intelligence in gastroenterology and hepatology: how to advance clinical practice while ensuring health equity. Gut 2022; 71: 1909–1915.

84.

Merrill

Peng

, et al. Transparency helps reveal when language models learn meaning. Trans Assoc Comput Linguist 2023; 11: 617–634.

85.

Zhou

Adel

Schuff

, et al. Explaining pre-trained language models with attribution scores: an analysis in low-resource settings. arXiv:2403.05338, 2024.

86.

Ding

Koehn

. Evaluating saliency methods for neural language models. arXiv:2104.05824, 2021.

87.

Penicig

Chen

Wilson

, et al. Assessing explainability in large language models through soft counterfactual analysis: a comparative study of google Gemini and Openai Chatgpt, https://www.researchsquare.com/article/rs-5011294/v1 (2024).

88.

Egami

Hinck

Stewart

, et al. Using imperfect surrogates for downstream inference: design-based supervised learning for social science applications of large language models. Adv Neural Inf Process Syst 2024; 36.

89.

Lai

Van Nguyen

Ngo

, et al. Okapi: instruction-tuned large language models in multiple languages with reinforcement learning from human feedback. arXiv:2307.16039, 2023.

90.

Skivington

Matthews

Simpson

, et al. A new framework for developing and evaluating complex interventions: update of Medical Research Council guidance. BMJ 2021; 374: n2061.

91.

Liu

Faes

Kale

, et al. A comparison of deep learning performance against health-care professionals in detecting diseases from medical imaging: a systematic review and meta-analysis. Lancet Digit Health 2019; 1: e271–e297.

92.

Vasey

Ursprung

Beddoe

, et al. Association of clinician diagnostic performance with machine learning-based decision support systems: a systematic review. JAMA Netw Open 2021; 4: e211276.

93.

Freeman

Geppert

Stinton

, et al. Use of artificial intelligence for image analysis in breast cancer screening programmes: systematic review of test accuracy. BMJ 2021; 374: n1872.

94.

DECIDE-AI Steering Group. DECIDE-AI: new reporting guidelines to bridge the development-to-implementation gap in clinical artificial intelligence. Nat Med 2021; 27: 186–187.

95.

McCradden

Stephenson

Anderson

. Clinical research underlies ethical integration of healthcare artificial intelligence. Nat Med 2020; 26: 1325–1326.

96.

Cruz Rivera

Liu

Chan

, et al. Guidelines for clinical trial protocols for interventions involving artificial intelligence: the SPIRIT-AI extension. Nat Med 2020; 26: 1351–1363.

97.

Liu

Cruz Rivera

Moher

, et al. Reporting guidelines for clinical trial reports for interventions involving artificial intelligence: the CONSORT-AI extension. Nat Med 2020; 26: 1364–1374.

98.

Vasey

Nagendran

Campbell

, et al. Reporting guideline for the early-stage clinical evaluation of decision support systems driven by artificial intelligence: DECIDE-AI. Nat Med 2022; 28: 924–933.

99.

Tamkin

Brundage

Clark

, et al. Understanding the capabilities, limitations, and societal impact of large language models. arXiv:2102.02503, 2021.

100.

Hager

Jungmann

Holland

, et al. Evaluation and mitigation of the limitations of large language models in clinical decision-making. Nat Med 2024; 30: 2613–2622.

101.

Meskó

Topol

. The imperative for regulatory oversight of large language models (or generative AI) in healthcare. NPJ Digit Med 2023; 6: 120.

102.

Goh

Gallo

Hom

, et al. Large language model influence on diagnostic reasoning: a randomized clinical trial. JAMA Netw Open 2024; 7: e2440969.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB