Abstract
Complex medical terminology utilized in clinical documentation can present barriers to patients understanding their medical findings. We aimed to generate easy-to-understand summaries of clinical radiology reports using large language models (LLMs) and evaluate their safety and quality. Eight board-certified physician reviewers evaluated 1982 LLM-generated radiology report summaries (computed tomography, magnetic resonance imaging, ultrasound, and x-ray) for safety and quality, using predefined rating criteria and the corresponding original radiology reports for reference. Physician reviewers determined 99.2% (1967 out of 1982) of the LLM-generated summaries to be safe. The reviewers scored the quality of the LLM-generated summaries from “5—Very Good” to “1—Very Poor,” respectively, as follows: 80.6%, 11.1%, 5.7%, 1.7%, and 0.9%. Safety varied significantly across imaging modality (P = .002). Large language models can be used to generate safe and high-quality summaries of clinical radiology reports. Further investigation is warranted to determine the impact of LLM-generated summaries on patient perception of understanding, knowledge of their medical conditions, and overall experience.
Keywords
Introduction
The 21st Century Cures Act has provided patients and caregivers with unprecedented access to their medical documentation and results. With the aim of increasing patient autonomy, access to such medical documentation is an important factor to facilitate informed patient decision-making. 1 Despite increased access to medical records, barriers persist—including the complex and specific language that characterizes clinical documentation that becomes even more pronounced for patients with limited English proficiency. Radiology reports and clinical notes, for example, are available electronically for patients and highly specialized language that is not easily understandable by a general audience, posing unique challenges to the patient experience. These barriers add layers of opacity that can hinder patients’ abilities to fully understand their medical information.2–4 Moreover, some evidence suggests that medical jargon has the potential of placing patients at risk for harm when misunderstood. 5 “Translation” of highly technical clinical documentation via large language models (LLM) offers novel avenues to convert specialized language (ie, technical medical language medical) into simple and easy-to-understand narratives for patients and their families and caregivers.
Recent advancements in natural language processing, including LLMs, have raised the possibility that artificial intelligence (AI) can be utilized to improve communication between healthcare teams and patients. Several previous studies have demonstrated that LLMs can generate accurate responses to medical scenarios,6–8 and other language-based clinical queries. 9 A recent study, for example, demonstrated that ChatGPT achieved high performance on the United States Medical Licensing Exam. 10 This suggests that machine learning representations of medical knowledge within some LLMs may be sufficient to provide pain language summaries of clinical reports. To our knowledge, however, no study has examined the safety or quality of using LLMs to produce AI-based summaries of medical documents.
Radiology reports, which often contain complex anatomical and pathology terminology, offer an ideal category of clinical notes for applying machine learning techniques aimed at generating easy-to-understand summaries. We aimed to (1) generate LLM-based summaries of clinical radiology reports that were understandable at a basic reading level and (2) characterize the safety and quality of the LLM-generated summaries of clinical radiology reports using LLMs and physician raters following predefined safety and quality rating criteria.
Methods
To assess the quality and safety of LLM-generated radiology report summaries, a subsample of deidentified radiology reports (n = 2001) was randomly selected in proportion of relative emergency department order frequency 11 from a larger random sample of deidentified radiology reports (n = 6000) utilized for quality review. Radiology reports are structured clinical notes written by radiologists, who are medical doctors specifically trained to diagnose injuries and illnesses present in medical imaging studies. Radiology reports typically include information on the imaging test performed including the reason the test was performed (ie, indication), the technique used, any test comparisons (ie, prior imaging studies used for comparison), findings, and the radiologist's clinical impression of the test. The medical imaging studies included in this investigation were the 4 most commonly used modalities in emergency departments (listed by frequency of occurrence in the medical record): computed tomography (CT, 20.03%), magnetic resonance imaging (MRI, 6.96%), ultrasound (29.52%), and x-ray (43.49%).
We constructed a language prompt to facilitate summarization of radiology results at a basic (fifth grade) reading level, avoid making assumptions or evaluations beyond what is explicitly stated in the report, translate any medical jargon into simple language, define necessary medical terms, and describe the imaging modality. The prompt was further expanded to include several examples of ideal LLM-generated summaries (ie, “few shot” prompting) and was concatenated with deidentified radiology reports. The concatenated prompt and radiology reports were processed using GPT 3.5 Turbo (OpenAI gpt-3.5-turbo).
The LLM-generated summaries were evenly distributed to a team of 8 actively practicing board-certified physician raters (6 emergency medicine and 2 family medicine), who evaluated them for safety and quality, using the original radiology reports for reference. The physician raters evaluated safety and quality based on the criteria described in Supplement 1. The physician raters were instructed to categorize the summaries as “safe—will not lead to patient harm” (safe) or “unsafe—contains inaccurate information or is missing key information that has potential for patient harm” (unsafe). Criteria for safety included that all radiology findings were included in the summary, avoidance of overly reassuring or overly concerning language, follow-up recommendations included, indeterminant findings communicated accurately, and context, patient history, reason for exam, and limitations of the study were present in the translation. Exploratory safety comparisons between imaging modalities were conducted using χ2 test (overall) and Fisher exact test (group comparisons). The physician raters also evaluated the quality of the summaries using a Likert scoring scale (1-5, corresponding to “Very Poor” to “Very Good”). Domains for translation quality assessment were adapted from the American Translators Association “Framework for Standardized Error Marking.” 12 Criteria for quality assessment included accuracy, completeness of information, and understandability at a basic reading level, clarity, and objectivity. The present study was determined to be exempt by the Institutional Review Board.
Results
Of the 2001 original radiology reports, 1982 summaries were generated by the LLM and were evaluated by the physician raters. The remainder either had failed to produce any output by the LLM (13) or had incomplete ratings by the physician reviewers (6). A total of 1982 LLM-generated clinical radiology summaries were evaluated by the physician raters. The final sample included 397 CT, 138 MR, 585 ultrasound, and 862 x-ray reports. Radiology report lengths varied from approximately 400 to 6300 characters. The radiology findings of the studies included a wide array of clinical pathologies, in addition to normal findings, such as acute and chronic cardiac, pulmonary, vascular, hepatobiliary, musculoskeletal, intracranial, maxillofacial, neurological, gastrointestinal, gynecological, oncologic, infectious, medical device, and trauma-related related etiologies.
Out of the 1982 LLM-generated summaries, 1967 were rated as safe (99.2%) and 15 were rated as unsafe (0.8%). Safety was found to vary by imaging modality (P = .002). Subgroup comparisons by imaging modality revealed a higher frequency of “safe” rating among ultrasound and x-ray summaries, compared to CT reports (P < .01, uncorrected; Table 1). Most summaries were rated as “5—Very Good” (n = 1598, 80.6%), “4—Good” (n = 220, 11.1%), or “3—Acceptable” (n = 113, 5.7%), whereas fewer studies (2.6% total) were rated as “2—Poor” (n = 33, 1.7%) and 18 (0.9%) as “Very Poor” (Table 2).
Physician Reviewer Safety Ratings of LLM-Generated Radiology Report Summaries by Imaging Modality.
Abbreviations: LLM, large language model; CT, computed tomography; MR, magnetic resonance.
Physician Reviewer Quality Ratings of LLM-Generated Radiology Report Summaries.
Abbreviation: LLM, large language model.
Discussion
As a result of the legislation in the 21st Century Cures Act, 1 patients now have unprecedented access to their medical records, including clinical notes and results. Although readily available access to medical records is an important step toward increasing patient autonomy, the complex language contained within clinical documents remains a significant barrier for most patients to embrace full understanding of their medical findings. The present law that ensures that patients have access to their clinical information does not contain requirements that the clinical information be presented in a method that is easily understandable. In the present study, we investigated the potential of LLMs to generate easy-to-understand, safe, and high-quality summaries of clinical radiology reports. The LLM-generated radiology report summaries were reviewed and evaluated by board-certified physicians, and over 99% and 80% of the summaries were found to be safe and high-quality, respectively. Importantly, based on the criteria for quality in this study, the findings suggest that it is feasible to generate plain-language AI-based summaries that maximally retain clinically relevant findings of the source. Taken together, the present study provides evidence that AI-based language tools hold promise to enhance provider–patient communication and overcome barriers posed by the highly specialized language in medicine. Since informed decision-making is a cornerstone of patient autonomy, AI-based language tools aimed at increasing patients’ understanding of their health conditions offer avenues to augment patient experience, including greater ability to choose between treatment options and self-management of care.
Although some past studies have investigated the utility of LLMs in various clinical applications, there is a paucity of studies investigating the ability of LLMs to generate summaries for patients. The present study, to our knowledge, is the first to investigate the potential of AI to generate summaries of clinical radiology reports. Although some past studies have made significant contributions by showing that GPT may be able to generate discharge summaries and discharge instructions, these studies were exploratory and were not able to assess safety and quality systematically.13,14 The present study utilized a team of board-certified physicians to perform safety and quality assessments for nearly two thousand LLM-generated summaries. We also explored the possibility that the safety of LLM-generated summaries could depend upon imaging modality. We found that safety was, indeed, associated with imaging modality and that x-ray and ultrasound summaries were rated as having higher safety. It seems reasonable that reports from more sophisticated imaging, such as CT imaging, would inherently generate more complexity that could impact the accuracy of LLM-generated summaries. Such reports tend to be longer than x-ray or ultrasound imaging reports. It is interesting, however, that no significant difference in safety was found when other modalities were compared to MRI. This may have been due to the relatively lower frequency of MRI reports among the dataset.
In summary, the present study provides substantial evidence that natural language processing can produce safe and high-quality summaries of radiology reports. Accordingly, AI-based tools to enhance patient communication hold significant promise to augment patient experience and address some challenges posed by the highly specialized language of medicine. Investigation using additional clinical document types (ie, clinical notes), large sample sizes, and future iterations of LLMs will be required to elucidate best practices for generating safe and high-quality AI-based summaries. Moreover, follow-up studies will be required to build upon the groundwork of the present study and directly measure the impact of LLM-generated summaries on patient perception of understanding, knowledge of their medical conditions, and overall experience.
Limitations
Although the present study had significant strengths, including large sample size, utilizing reports from a variety of imaging modalities, and physician review, the study had several limitations. First, the frequency of summaries determined to be unsafe was low. To characterize the true frequency and distribution of safety by imaging modality, larger studies will be needed. Second, we did not include a measure of physician inter-rater reliability in the present study. Future studies may benefit from a broader range of imaging types and rater overlap. Third, in this retrospective study, it was not feasible to address efficacy directly (ie, via patient surveys). Further studies will be required to build the groundwork of the present study by (1) measuring directly the effect of LLM summaries on patient perception of understanding, education, and overall experience and (2) determining the scope of LLM capabilities, such as extension to various clinical document types.
Supplemental Material
sj-docx-1-jpx-10.1177_23743735241259477 - Supplemental material for Patient-Readable Radiology Report Summaries Generated via Large Language Model: Safety and Quality
Supplemental material, sj-docx-1-jpx-10.1177_23743735241259477 for Patient-Readable Radiology Report Summaries Generated via Large Language Model: Safety and Quality by Nicholas W. Sterling, MD, PhD, Felix Brann, BSc, Stephanie O. Frisch, PhD, RN, and Justin D. Schrager, MD, MPH in Journal of Patient Experience
Footnotes
Declaration of Conflicting Interests
The author(s) declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article. F.B. is the Vice President of Data Science at Vital Software, Inc., a company engaged in developing artificial intelligence clinical decision support products for the ED. N.W.S. is Director of Clinical Innovation & Research at Vital Software, Inc. S.O.F. is Director of Nursing at Vital Software, Inc. J.D.S. is a Co-Founder and Chief Medical Officer of Vital Software, Inc.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
