Sage Journals: Discover world-class research

Abstract

French

Objectives: Determine if a large language model (LLM, GPT-4) can label and consolidate and analyze interventional radiology (IR) microwave ablation device safety event data into meaningful summaries similar to humans. Methods: Microwave ablation safety data from January 1, 2011 to October 31, 2023 were collected and type of failure was categorized by human readers. Using GPT-4 and iterative prompt development, the data were classified. Iterative summarization of the reports was performed using GPT-4 to generate a final summary of the large text corpus. Results: Training (n = 25), validation (n = 639), and test (n = 79) data were split to reflect real-world deployment of an LLM for this task. GPT-4 demonstrated high accuracy in the multiclass classification problem of microwave ablation device data (accuracy [95% CI]: training data 96.0% [79.7, 99.9], validation 86.4% [83.5, 89.0], test 87.3% [78.0, 93.8]). The text content was distilled through GPT-4 and iterative summarization prompts. A final summary was created which reflected the clinically relevant insights from the microwave ablation data relative to human interpretation but had inaccurate event class counts. Conclusion: The LLM emulated the human analysis, suggesting feasibility of using LLMs to process large volumes of IR safety data as a tool for clinicians. It accurately labelled microwave ablation device event data by type of malfunction through few-shot learning. Content distillation was used to analyze a large text corpus (>650 reports) and generate an insightful summary which was like the human interpretation.

Visual Abstract

This is a visual representation of the abstract.

Keywords

large language model artificial intelligence safety microwave ablation interventional radiology

Introduction

Large volumes of medical device adverse event data accumulate daily. For example, in 2022 nearly 3 million medical device adverse events were recorded in the United States Food and Drug Administration’s Manufacture and User Facility Device Experience (MAUDE) database.¹ Multiple studies have evaluated events related to angiographic equipment, venous stents, and inferior vena cava filters, highlighting MAUDE’s value in interventional radiology (IR).^2-4 However, MAUDE reports are heterogeneous with most data captured in a free-text data field. When attempting to generate meaningful insights from such databases, human analysis is limited by multiple factors like expertise, time required, lack of uniformity, and fatigue.⁵ Consequently, databases that house critical safety information may be underutilized.

Recent advances in artificial intelligence (AI), including foundational models such as the large language model (LLM) GPT-4, continue to fuel excitement about the future of AI in radiology.^6-8 Prior studies demonstrated that GPT-4 performs well on radiology-specific tasks, including board exam-style questions, study protocolling, and CT report mining.^9-13 The ability of these models to perform well in natural language processing (NLP) tasks with relatively little technical expertise required by human users is a leap toward overcoming tasks that may otherwise be too costly from a human resource perspective.¹⁴ However, one challenge with LLMs is that they are limited by the number of context tokens (a discrete unit of text, eg, 4000 tokens may equal 2500 words) and text data, like safety event reports, may be in the thousands or millions of tokens.^1,15 Nevertheless, LLMs may present a possible solution to the problem of identifying, classifying, and analyzing the immense volume of safety data in radiology, enabling faster, easier, and more meaningful human analysis.

Microwave ablation (MWA) is a commonly performed IR procedure and awareness of reported device malfunctions is important for safe use.¹⁶ Prior studies have mostly examined types of patient-related complications with MWA, such as infection, vascular injuries, tumour seeding.¹⁷ However there is little research on modes of device malfunctions.¹⁸ This study aims to assess the feasibility of applying artificial intelligence to analyze device malfunction data, specifically through an example using MWA devices. This is performed through generation of a human-labelled dataset of MWA device malfunctions that were captured in the MAUDE database. In particular, the performance of the LLM GPT-4 in 2 tasks is assessed: the classification and summarization of MWA data. We hypothesize that a state-of-the-art LLM, GPT-4, will be able to accurately label the type of malfunction through in-context learning (ie, by providing examples in the prompt). Additionally, we hypothesize that through a process of content distillation, the LLM context window limitations can be overcome to generate an overall summary with insights from the data similar in quality to the human-guided analysis. This proposed AI implementation is potentially high-value and low-risk, as it could reduce the cost to analyze, annotate, and gain insights from safety data.

Materials and Methods

Study Design

This was a feasibility study using retrospective data from the open MAUDE database.¹ Therefore, research ethics board approval was waived. The study follows the applicable Checklist for Artificial Intelligence in Medical Imaging criteria.¹⁹

Data

Summarization and Classification Training/Validation Data

A dataset from MAUDE (accessed March 31, 2022) examining microwave ablation devices (n = 1189 events; “System, Ablation, Microwave, and Accessories”) from January 1, 2011 to December 31, 2021 was used for human-guided analysis of MWA device malfunctions and subsequent LLM-powered analysis prompt development training and validation (Figure 1). The data was cleaned and analyzed by 3 radiology residents and the final data was analyzed by an interventional radiology fellow. Cleaning involved removal of duplicate and irrelevant data (eg, non-ablation or literature related entries). The final dataset after cleaning contained 664 events. The data were then randomly split into prompt development/training data (n = 25) and validation data (n = 639).

Figure 1.

MAUDE data used in the study.

Classification Testing Data

A second smaller test dataset (n = 98 November 1, 2022-October 31, 2023) was retrieved from MAUDE (accessed Nov 6, 2023) after the training and validation data had been analyzed, to mitigate data leakage in prompt development and evaluate the classification performance of the LLM. This was preprocessed and analyzed by the IR fellow using the same classification system and cleaning process (n = 79 post cleaning).

Human Analysis

Initial human data analysis of the training/validation dataset involved review of the first 50 cases by all 3 diagnostic radiology residents, to achieve consensus on a classification system. The classification system for device malfunctions was generated in conjunction with the staff interventional radiologists and co-authors. The remaining data were then divided and analyzed individually by the residents. One resident then merged and reviewed all data, ensuring consistency in classification. A final review was then performed by an interventional radiology fellow. The test data were reviewed and classified by the IR fellow, but not included in the summary generation as this would/could have resulted in data leakage in the prompt development.

Artificial Intelligence Large Language Model (AI LLM) Analysis

OpenAI’s GPT-4 (model = “gpt-4,” accessed Oct.-Nov., 2023) was used as the LLM in this study via API access using Python (OpenAI, Delaware; Python Version 3.11, Python Software Foundation, Delaware). GPT-4’s stochasticity “temperature” parameter was set to 0.1 (ie, more deterministic) but was not tuned to select this number.⁷ Hyperparameter tuning was not performed as the prompt was considered the main varied hyperparameter in this study, to emulate a real-world deployment by a non-AI expert. As the event descriptions are free-form text reports, the corpus of reports would exceed the maximum tokens of the LLM (8192 tokens). Therefore, content distillation through iterative summarization was required (Figure 2). First, a small portion of data (n = 25) was used to iteratively develop a prompt to both (1) classify the event into one of the 8 classes and (2) summarize the event into a free-form text summary to reduce token count (prompt shown in Figure 3). The small sample size (n = 25) for prompt development was chosen to emulate real-world deployment, where it would not be possible to have a large training dataset and limited examples of each class may be seen. Indeed, it was chosen to be half of the size of the initial human analysis where 50 events were reviewed. The entire event summary text field was submitted with no other pre-processing other than extracting it from the correct column and formatting it into a single list of events for GPT-4 to analyze. Upon satisfactory performance of classification and summarization, the summary prompt was applied to the whole cohort of data on an individual basis, that is, each event was submitted with the prompt via the API, one-by-one (Figures 2 and 3). The summarized events were then batched based on the context limitation (batches of 7000 tokens to allow over 1000 tokens for a response; GPT-4’s context limit = 8192) and a summarization instruction prompt was provided, leading to a mid-level (2nd) summarization. Finally, the last summarization/conclusion was created using the mid-level summarizations. Classification labels were extracted from the summaries. Second and third summarizations were performed (1) with the classification + summary and (2) with only the summary text using a modified mid-level prompt without classification (ie, classification stripped from the text and no classification framework provided), to examine the impact of a provided classification on the summary.

Figure 2.

Classification and content distillation workflow and study protocol. Development/Training data is emphasized as this is not a traditional machine learning/deep learning workflow but rather based on prompt development. The prompt development workflow can be applied at any phase of prompt development, that is, in a summarization workflow, at each summarization step modification may be required to create outputs that meet the desired outcomes. In this study, the test data were reserved for only the classification task, as the content distillation was based on previously explored data known to the researchers.

Figure 3.

Final summary prompt used to instruct the LLM to summarize each safety event. Initial background information on the role and task are provided first. Next, the classification system is outlined with minimal required description, this is where modifications may be required with iterative prompt development to ensure accuracy in the early phases. Next, a formal example is provided, furthering the “in-context” learning, the process of providing examples in the prompt to guide the task. Lastly, the commencement of the task is clearly provided.

Statistical Analysis

The assessment of content distillation and summarization is subjective, and the final reports were subjectively reviewed for accuracy regarding the summarization content. Classification performance was assessed using Cohen’s kappa, to reflect the interobserver agreement for classification, as GPT-4 would be considered a “second reviewer” in this workflow. Additionally, accuracy (95% CI) and micro-averaged F1 Score, selected for class imbalance are reported. Statistical analysis performed using R (version 4.3.0, R Foundation for Statistical Computing). A P value of .05 was used for significance.

Results

Data

Prompt development training/validation and test data were collected from the MAUDE website with 1189 events in the training/validation data and 98 in the test data. Following data cleaning, 663 unique events included in the final human analysis/summarization (the training and validation dataset; Figure 1). Test data included 79 events in the final analysis. The data were class imbalanced, as shown in Table 1 and Figure 4.

Table 1.

Classification of Microwave Ablation Device Malfunctions and Count From the Ensemble of Training, Validation, and Test Data.

Classification of failure	Count observed in MAUDE, total (n test)
Fracture	330 (42)
Temperature problem	139 (4)
Failure to fire	69 (6)
Other	63 (21)
Bent	62 (2)
Software	33 (2)
Leak	32 (2)
User error	15 (0)

Note. Total = training, validation, and test data.

Figure 4.

Heatmaps demonstrating the classification performance of the LLM in the training (n = 25), For Peer Review validation (n = 639), and test (n = 79) data. Human labels on the Y axis and LLM (GPT-4) predicted labels on the X axis. Proportion of the true (human-standard) labels for each data in a row-wise fashion.

Human Analysis

Review of the MWA data demonstrated patterns of device malfunctions which resulted in the following classification system for MWA device malfunction: (1) Fracture: fracture of the microwave ablation device, (2) Temperature Problem: temperature-related issues, (3) Bent: bending of the device, (4) Leak: fluid leaks from the device, (5) User Error: user error leading to malfunction but not including fractures, (6) Software: software failures, (7) Failure to Fire: when the device would not start despite no other failure, (8) Other: all other malfunctions. The results are presented in Table 1. Mechanical malfunctions (fracture or bending) were the most common reported malfunction accounting for over half of the malfunctions (392/744; 52.7%). Specifically, fracture of the MWA probe was the most frequently reported malfunction accounting for 44.4% of the events (330/743). In 42.7% (141/330) of the fractures, a portion of the fractured device was retained in situ. Temperature problems were also common, mostly related to overheating of the ablation probe preventing further use (139/743; 18.7%). User error was uncommon (15/743; 2.0%).

AI LLM Classification

Iterative prompt engineering was performed using the training data for the event classification and summarization step (Figure 3 demonstrates the third and final prompt).²⁰ It manually reviewed and the prompt revised until accurate summarization of the events was observed (Table 2 and Figures 3 and 4, 1/25 incorrect, 96% accuracy [95% CI 79.7, 99.9]). The prompt was then applied to the entire validation dataset and subsequently the test data (Table 2 and Figure 4). The results demonstrated high agreement to the human labels with a Cohen’s kappa score of 0.95 in the prompt training data with the final prompt, 0.82 with the validation data, and 0.81 with the out of sample test data. All instances of misclassification were reviewed, with examples of misclassification shown in Table 2. High accuracy was seen for physical problems throughout all data, such as Fracture (96%-100%), Temperature problems (87%-100%), and Bending (98%-100%) as demonstrated in the Figure 4 heatmap. Software, User Errors, Failure to Fire, and Other showed lower accuracy; with the lowest at 50% in the software category for the validation data. Compared to the no information rate, GPT-4 classification was statistically more accurate (Table 2, accuracy range 86.4% [validation] to 96.0% [training], P < .001 for all data). Additionally, micro-averaged F1 scores and Cohen’s Kappa scores were all above 0.8, suggestive of strong model performance despite minimal prompting.

Table 2.

Results From GPT-4 Classification.

GPT classification results	Training data (n = 25)	Validation data (n = 639)	Test data (n = 79)
Accuracy (%, 95% CI)	96.0 (79.7, 99.9)	86.4 (83.5, 89.0)	87.3 (78.0, 93.8)
No information rate (P Acc > NIR)	0.36 (<.001)	0.44 (<.001)	0.53 (<.001)
Micro-averaged F1 score	0.96	0.86	0.87
Cohen’s Kappa	0.95	0.82	0.81

Note. Acc = accuracy; NIR = no information rate.

Content Distillation and Final Summary

The training and validation database was reviewed with overall insights known to the authors and therefore conclusions generated by GPT-4 were able to be subjectively assessed (Figure 5). Content distillation was trialed with and without the classification system to observe the effect on the final report. Indeed, the classification system appeared to guide the structure of the LLM’s final summary (Figure 5). When provided with the classification, the summary followed that structure. However, in the absence of the classification system, it was able to create its own classification that closely mirrored the human classifications. However, other important conclusions were also created including the flag of failure to return devices to the manufacturer and procedural delays. The count data was inaccurate relative to the classification in both instances: without the provided classification it reported only 214 fractures, and with the classification it reported 314, but the real number was 288 (Figures 4 and 5). As a result of the final summary, post-hoc analysis of the frequency of return of the devices and unnecessary sedation was performed. Unnecessary sedation with delayed or aborted procedures was seen in 7% (52/743). A total of 27.9% (207/743) devices were documented as returned to manufacturer. Supplementary text files are available for review containing both summaries, with and without the human classification.

Figure 5.

Context distillation workflow with outputs. In the no classes final summary, the classes from the classification system were stripped from the input to the LLM, to limit influence on output. Note the outputs of all but the “FINAL SUMMARY NO CLASSES” are truncated to fit in the figure (denoted by “. . .”).

Discussion

This study demonstrates the feasibility and ease of use of AI implementation in safety data analysis for IR devices. The initial human analysis identified types of device related malfunctions that can occur with MWA and derived a classification system. Subsequently, with minimal prompting, the AI LLM GPT-4 was able to accurately classify and summarize the data to generate meaningful conclusions.

Microwave Ablation Malfunctions

Mechanical failures (bending or fracture) were among the most observed malfunctions, with tip fracture being the most common. This is a known complication in the MWA literature.^17,18 Fracture of the tip of the probe can be related to improper or excessive force applied through the probe. In nearly half of the fracture events (42.7%, 141/330), the probe tip was left in situ, however, due to the limited nature of the MAUDE database, it is not possible to know what the long-term outcomes are.

Classification

This study demonstrates how in-context learning with an LLM, the process by which an LLM may “learn” based off a prompt and examples encapsulated within it, may be used to accurately classify adverse event data related to MWA devices with accuracy remaining greater than 85% in training, validation, and test data.²⁰ This is similar to other NLP tasks in radiology that have demonstrated high accuracy, including a recent GPT-4 study showing 96% accuracy in labelling free text CT reports.^12,21 In our study, there was strong agreement between the human rater and the LLM generated classification (Cohen’s kappa >0.80 in all data), reflecting the strength of the LLM despite minimal prompting and only in-context learning. GPT-4 struggled in labelling non-mechanical malfunctions (eg, Temperature, Failure to Fire, Software). One explanation for this could be that these often have overlapping terminology, whereas mechanical issues such as device fracture are more clear cut. Another instance where the LLM struggled was in classifying an event as “failure to fire” when a temperature error code was also reported was likely due to the code descriptions being unknown to the LLM (eg, event 339 in Table 3). It is suspected that with further prompt development this could be mitigated. Despite these limitations, the overall classification showed high agreement with human generated labels. With a very small set of training data of only 25 cases, we were able to develop a prompt instructing the LLM to label microwave ablation safety data with good performance. This is a promising starting point for using LLMs to analyze safety data.

Table 3.

Example GPT-4 Classification Errors. The Event Text Excerpts Are Taken From the Event Descriptions From the Data, Chosen to Select the Most Relevant Portion. Note That the Event Number is a Random Internal Number Attributed by the Research Team to Index the Events and Not a MAUDE Event ID.

Event text excerpt	GPT classification	Human classification	Comments
Event 525: “THE TEMP ALERT SIGN IS ON BECAUSE THE WATER FLOW WAS DISRUPTED. THE WATER FLOW INSIDE THE TUBE GETS BLOCKED”	Leak	Temperature problem	The water coolant system is used to cool the probe, but there was no report of leakage in the event. This was a cooling and therefore temperature related event.
Event 549: “BOTH PROBES GAVE ‘TEMP TOO HIGH’ ERRORS . . . BOTH PROBES WERE EVENTUALLY REMOVED. THE TIPS OF BOTH PROBES WERE RETAINED IN THE LESION.”	Temperature problem	Fracture (ie, detached tips)	Challenging event text that may have been flagged by a human. A hierarchy of events was created, with fracture as most important, therefore the temperature issue should have been discarded.
Event 339: “WHILE TESTING THE FIRST SOLERO PROBE 5-6 TIMES PRIOR TO THE PROCEDURE IN CHILLED SALINE, AN ERROR 209 MESSAGE DISPLAYED.”	Failure to fire	Temperature problem	Error code 209 is a temperature error for the Solero device, which was not provided to GPT-4. In the absence of this data it labelled the event as failure to fire, which is a more general failure to start functioning for unknown reasons.
Event 13: “STARTED THE BURN AND A MINUTE AND A HALF INTO THE ABLATION, THE UNIT GAVE A REFLECTIVE POWER ERROR”	Failure to fire	Temperature problem	The high reflective power error is a temperature related problem with the device. Additionally, the device was able to start (fire) but failed to continue to function.

Content Distillation

LLMs are memory constrained which results in context token limits.⁷ This limitation could be a hinderance to their utility in analysis of large volumes of text data. However, in our study, we demonstrate that content distillation can be used to generate informative summaries highlighting patterns of adverse events related to microwave ablation devices. While it is challenging to assess summarization given its subjective nature, the LLM-generated final report closely reflected the human analysis. The LLM also highlighted potentially valuable information and insights not initially recognized in the human analysis. Two key insights from the LLM were: (1) 7% (52/743) of ablation procedures were aborted after induction of anaesthesia/sedation without proceeding with the procedure or with significant delays, that is, unnecessary sedation (2) Only a quarter (27.9%, 207/743) of the devices were documented as returned to the manufacturer for evaluation. On the other hand, the LLM in our study was prone to inaccurate count data, which has also been shown in prior studies.²⁰ Separate research has demonstrated that GPT-4 has faltering math performance on certain tasks. Thus, any count data or mathematical insights from a task such as this should be interpreted with caution and may in part be a consequence of content distillation whereby count data are carried forward between summaries.²² Fortunately, in this setting where classification and summarization are performed on an case-by-case basis, it is easy to generate count data programmatically rather than relying on the LLM to generate the counts. It is important that any user of an AI system critically analyze the output of the system to understand where it may have failed and where additional human analysis may be required. We are proposing herein that an AI may be used to augment rather than replace the human in this workflow.²³

Limitations

The MAUDE database is limited in that it is not a research database and ultimately provides only a snapshot into a clinical scenario with no long-term follow-up or additional clinical details.²⁴ Additionally, MAUDE does not reflect the volume of use of a device, and therefore no conclusions can be drawn regarding the incidence of device malfunctions. With respect to the AI in this study, use of in-context learning rather than fine-tuning to the task by training the LLM on additional data may limit maximal performance.²⁰ Additionally, more extensive prompt development, trial of different LLMs, and varying hyperparameters such as the temperature could be performed. However, the goal of the study was to assess performance in a constrained fashion to emulate real-world deployment, and satisfactory performance was seen with limited data and prompting. As LLM performance continues to improve and fine-tuned models become available, future studies will be able to examine the impact of larger context on outputs, which will likely improve performance; however data also continues to grow which will likely continue to challenge the models.

Conclusions

Patient safety data continue to grow and event reports are a valuable source of information to improve clinical practice and understanding of medical devices.^1,3,4 This study demonstrates the feasibility of using AI to generate reports on data that might otherwise be under evaluated. Importantly, automated analysis of these data could be created by non-AI experts and the resulting LLMs could act as early detectors to identify important insights that may otherwise have not been explored. This proposed AI-human collaborative implementation would be low-risk, given the data most often already exist and the AI would be used to augment human analysis, with continued expert human oversight as the final safeguard.

Supplemental Material

sj-txt-1-caj-10.1177_08465371241269436 – Supplemental material for Feasibility of Artificial Intelligence Powered Adverse Event Analysis: Using a Large Language Model to Analyze Microwave Ablation Malfunction Data

Supplemental material, sj-txt-1-caj-10.1177_08465371241269436 for Feasibility of Artificial Intelligence Powered Adverse Event Analysis: Using a Large Language Model to Analyze Microwave Ablation Malfunction Data by Blair E. Warren, Fahd Alkhalifah, Aida Ahrari, Adam Min, Aly Fawzy, Ganesan Annamalai, Arash Jaberi, Robert Beecroft, John R. Kachura and Sebastian C. Mafeld in Canadian Association of Radiologists Journal

Supplemental Material

sj-txt-2-caj-10.1177_08465371241269436 – Supplemental material for Feasibility of Artificial Intelligence Powered Adverse Event Analysis: Using a Large Language Model to Analyze Microwave Ablation Malfunction Data

Supplemental material, sj-txt-2-caj-10.1177_08465371241269436 for Feasibility of Artificial Intelligence Powered Adverse Event Analysis: Using a Large Language Model to Analyze Microwave Ablation Malfunction Data by Blair E. Warren, Fahd Alkhalifah, Aida Ahrari, Adam Min, Aly Fawzy, Ganesan Annamalai, Arash Jaberi, Robert Beecroft, John R. Kachura and Sebastian C. Mafeld in Canadian Association of Radiologists Journal

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Blair E. Warren

Supplemental Material

Supplemental material for this article is available online.

References

United States Food and Drug Administration. MAUDE - manufacturer and user facility device experience. Accessed August 9, 2023. https://www.accessdata.fda.gov/scripts/cdrh/cfdocs/cfmaude/search.cfm

Yan

Datta

Kesselman

Malhotra

Scherer

Wadhwa

. Adverse events associated with new nitinol venous-labeled stents: a review of the FDA MAUDE database. J Vasc Interv Radiol. 2022;33(10):1261-1263. doi:10.1016/j.jvir.2022.06.018

Bikdeli

Kirtane

Jimenez

, et al. Hemopericardium and cardiac tamponade as a complication of vena caval filters: systematic review of the published literature and the MAUDE database. Clin Appl Thromb Hemost. 2019;25:1076029619849111. doi:10.1177/1076029619849111

Healy

Ahrari

Alkhalifah

, et al. Typology, severity, and outcomes of adverse events related to angiographic equipment-a ten-year analysis of the FDA MAUDE database. Can Assoc Radiol J. 2023;74(4):737-744. doi:10.1177/08465371231167990

Kazai

Kamps

Milic-Frayling

. An analysis of human factors and label accuracy in crowdsourcing relevance judgments. Inf Retr. 2013;16(2):138-178. doi:10.1007/s10791-012-9205-0

Touvron

Martin

Stone

, et al. Llama 2: open foundation and fine-tuned chat models. Published online July 19, 2023. doi:10.48550/arXiv.2307.09288

OpenAI. GPT-4 technical report. Published online March 27, 2023. doi:10.48550/arXiv.2303.08774

Bommasani

Hudson

Adeli

, et al. On the opportunities and risks of foundation models. Published online July 12, 2022. Accessed November 7, 2023. http://arxiv.org/abs/2108.07258

Bhayana

Bleakney

Krishna

. GPT-4 in radiology: improvements in advanced reasoning. Radiology. 2023;307(5):e230987. doi:10.1148/radiol.230987

10.

Bhayana

Krishna

Bleakney

. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology. 2023;307(5):e230582. doi:10.1148/radiol.230582

11.

Gertz

Bunck

Lennartz

, et al. GPT-4 for automated determination of radiological study and protocol based on radiology request forms: a feasibility study. Radiology. 2023;307(5):e230877. doi:10.1148/radiol.230877

12.

Fink

Bischoff

Fink

, et al. Potential of ChatGPT and GPT-4 for data mining of free-text CT reports on lung cancer. Radiology. 2023;308(3):e231362. doi:10.1148/radiol.231362

13.

Barat

Soyer

Dohan

. Appropriateness of recommendations provided by ChatGPT to interventional radiologists. Can Assoc Radiol J. 2023;74(4):758-763. doi:10.1177/08465371231170133

14.

Mozayan

Fabbri

Maneevese

Tocino

Chheang

. Practical guide to natural language processing for radiology. Radiographics. 2021;41:1446-1453. doi:10.1148/rg.2021200113

15.

Liu

Lin

Hewitt

, et al. Lost in the middle: how language models use long contexts. Published online July 31, 2023. Accessed November 1, 2023. http://arxiv.org/abs/2307.03172

16.

Simon

Dupuy

Mayo-Smith

. Microwave ablation: principles and applications. Radiographics. 2005;25(suppl_1):S69-S83. doi:10.1148/rg.25si055501

17.

Fang

Cortis

Yusuf

, et al. Complications from percutaneous microwave ablation of liver tumours: a pictorial review. Br J Radiol. 2019;92(1099):20180864. doi:10.1259/bjr.20180864

18.

Lahat

Eshkenazy

Zendel

, et al. Complications after percutaneous ablation of liver tumors: a systematic review. Hepatobiliary Surg Nutr. 2014;3(5):317-323. doi:10.3978/j.issn.2304-3881.2014.09.07

19.

Mongan

Moy

Kahn

. Checklist for artificial intelligence in medical imaging (CLAIM): a guide for authors and reviewers. Radiol Artif Intell. 2020;2(2):e200029. doi:10.1148/ryai.2020200029

20.

Bhayana

Biswas

Cook

, et al. From bench to bedside with large language models: AJR expert panel narrative review. AJR Am J Roentgenol. Published online April 10, 2024. doi:10.2214/AJR.24.30928

21.

Chen

Ball

Yang

, et al. Deep learning to classify radiology free-text reports. Radiology. 2018;286(3):845-852. doi:10.1148/radiol.2017171115

22.

Chen

Zaharia

Zou

. How is ChatGPT’s behavior changing over time? Published online October 31, 2023. doi:10.48550/arXiv.2307.09009

23.

Warren

Bilbily

Gichoya

, et al. An introductory guide to artificial intelligence in interventional radiology: part 2: implementation considerations and harms. Can Assoc Radiol J. Published online March 6, 2024. doi:10.1177/08465371241236377

24.

Sawaya

Champlain

Cohen

Avram

. Barriers to reporting: limitations of the Maude database. Dermatol Surg. 2021;47(3):424-425. doi:10.1097/DSS.0000000000002832

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB