Sage Journals: Discover world-class research

Abstract

Objective

To investigate the performance (accuracy, comprehensiveness, consistency, and the necessary information ratio) of large language models (LLMs) in providing knowledge related to respiratory aspiration, and to explore the potential of using LLMs as training tools.

Methods

This study was a non-human-subject evaluative research. Two LLMs (GPT-3.5 and GPT-4) were asked 36 questions (32 objective questions and four subjective questions) about respiratory aspiration in English and Chinese. Responses were scored by two experts against gold standards derived from authoritative books. The accuracy of the two LLMs’ responses of objective questions were compared by chi-square test or Fisher exact probability method. For subjective questions, the t-test or Mann–Whitney U test was used to compare the differences between two LLMs.

Results

There was no significant difference in the ratings provided by the two experts. The accuracy scores of objective questions of two LLMs were high. LLMs also performed well on subjective questions, showing high levels of accuracy, comprehensiveness, consistency, and necessary information ratio. And no significant differences were found in the accuracy of the English and Chinese responses to subjective questions between the two LLMs (z = 0.331, p = 0.886; z = 1.703, p = 0.114). There was no significant difference in the comprehensiveness of the English and Chinese responses between the two LLMs (t = 0.787, p = 0.461; t = 1.175, p = 0.285).

Conclusions

LLMs demonstrated promising performance in delivering respiratory aspiration-related knowledge and showed promise as supportive tools in training, particularly when their limitations were well understood.

Keywords

Dysphagia aspiration large language models artificial intelligence accuracy comprehensiveness

Introduction

Respiratory aspiration is a condition in which food, oral secretions, or stomach contents cannot be effectively swallowed or spit out, and enter the airway or lung.¹ The incidence of pulmonary aspiration is approximately 1 in 8,325, rising to 1 in 8,202 in emergency situations,² and the incidence of respiratory aspiration reaches as high as 88% among patients undergoing endotracheal intubation.³ Respiratory aspiration is associated with impaired laryngeal function, esophageal diseases, alcohol consumption, reflux, tube feeding, dementia, and impaired consciousness.⁴ Respiratory aspiration can lead to aspiration pneumonia which is life-threatening.⁵ The incidence of aspiration pneumonia reaches 17.5% among hospitalized patients,⁶ accounting for 5–66.8% of all pneumonia cases.⁷ Furthermore, the incidence of aspiration pneumonia was increasing every year.⁸ Aspiration pneumonia was estimated to incur annual healthcare costs exceeding 10 billion dollars.⁶

The occurrence of respiratory aspiration is associated with nurses’, patients’, and caregivers’ knowledge of respiratory aspiration prevention and treatment.⁹ Training nurses, patients, and caregivers can improve awareness of respiratory aspiration, reducing its incidence. Currently, the main forms of training were face-to-face teaching, written material, video and mobile phone application.^10,11 However, current training lacks personalized guidance¹² and fails to meet clinical needs due to a shortage of respiratory aspiration specialists.¹³ Therefore, exploring tech-driven approaches to respiratory aspiration training is essential.

Large language models (LLMs) have potential for the training of respiratory aspiration. LLMs are built based on deep learning to understand and generate human language.¹⁴ LLMs can understand text and produce appropriate responses, allowing for multiple rounds of conversation in a row. LLMs were also used in healthcare and have great potential to improve patient health and enhance patient quality of life.¹⁴ LLMs could provide a wealth of health knowledge and information to help people access health-related advice.^15–17 Additionally, the ready availability of LLMs allows for efficient, real-time information retrieval in clinical settings. Therefore, it seemed feasible to apply LLMs to the training of respiratory aspiration.

Generative Pre-trained Transformers (GPT-3.5/4) are the most widely used LLMs in healthcare, offering disease-related information.¹⁸ However, their performance in addressing respiratory aspiration remains unclear. This study aims to assess GPT-3.5/4's responses to respiratory aspiration-related questions in terms of accuracy, comprehensiveness, consistency, and the necessary information ratio, across English and Chinese contexts. We hypothesis that both LLMs will perform well, with no significant differences between versions or language settings. The results may support the integration of LLMs into respiratory aspiration training to help reduce respiratory aspiration and aspiration pneumonia incidence.

Methods

This study was non-human subject research, as no human subject was enrolled in this study. However, the testing questions used in this study were derived from clinical scenarios commonly encountered by patients with respiratory aspiration. To ensure ethical compliance, the study was reviewed and approved by the medical ethics committee of Capital Medical University (approval # Z2022SY027).

Design of test questions

Two researchers (SF and ZX) with a background in respiratory aspiration and artificial intelligence created 36 questions according to the clinical cases about respiratory aspiration from two widely recognized books on respiratory aspiration.^19,20 At the same time, the gold standard of the answers to the questions was obtained from the books together with the questions. The questions were divided into two categories: basic knowledge (objective questions) and comprehensive analysis (subjective questions). The objective questions were structured into five thematic templates: (1) overview of respiratory aspiration (mechanism, definition, risk factors, and outcomes); (2) clinical symptoms of respiratory aspiration; (3) screening and diagnosis (screening scales and laboratory examinations); (4) management of adverse outcomes (emergency of choking and treatment of aspiration pneumonia); and (5) prevention strategies for different risk factors (dysphagia, weakened cough reflex, gastroesophageal reflux, oral problems, poor eating behaviors, and treatment-related factors). The subjective questions covered the identification of aspiration risks, corresponding countermeasures, and first aid treatment strategies. The objective questions were multiple-choice which were informal and easy to understand to simulate conditions in which caregivers or patients asked questions to LLMs. The subjective questions were specialized and challenging cases to simulate the professional inquiries of nurses. Then, the questions and gold standard were reviewed by the third expert (QX) who has extensive experience in the management of respiratory aspiration. Eventually, a list of questions and gold standard were formed after repeated modification (Supplementary Table S1).

Data collection

Two LLMs, GPT-3.5 (https://openai.com/blog/openai-api; accessed on 17 September 2023), and GPT-4 (https://openai.com/blog/openai-api; accessed on 20 September 2023) were asked the same questions in both English and Chinese. Before being asked objective questions, LLMs were told the follows, “You will be asked 32 multiple-choice questions next. Each question has four options, but there is only one correct answer. Please give the correct answer and analysis based on the latest knowledge.” Before being asked subjective questions, LLMs were told the follows, “Next, you will be asked questions about case analysis. Please give the best answer based on the latest knowledge.” All responses were accurately recorded by the researchers (SF and ZX), and no missing data were observed in this study, as all answers were automatically generated by the LLMs (Supplementary Table 2 and 3).

Outcomes

The main outcomes were the accuracy and comprehensiveness of responses. Accuracy refers to the correctness and reliability of the responses, as compared to the gold standard. Comprehensiveness refers to the extent of detail and coverage in the responses, as compared to the gold standard. For accuracy and comprehensiveness, there were the following comparisons: 1) the accuracy and comprehensiveness of the responses in English between the two LLMs; 2) the accuracy and comprehensiveness of the responses in Chinese between the two LLMs; and 3) the accuracy and comprehensiveness of the responses in various languages of the same model. The secondary outcomes were consistency and necessary information ratio. In this study, consistency refers to the degree of similarity in responses to the same question asked by different languages, regardless of their accuracy. And necessary information ratio is defined as the percentage of relevant content within the total output generated by the LLM in response to a given question.

Evaluation of the outcomes

After fully understanding the questions and the responses, two experienced experts in the field of respiratory aspiration scored the accuracy, comprehensiveness, consistency, and necessary information ratio of the responses. The responses of objective questions were assessed as correct /consistent or false /inconsistent. Subjective questions were scored out of five and the score of them was calculated by the following ways. The accuracy score was calculated by dividing the number of correct items in the LLM's response by the total number of items it provided, and then multiplying the result by five. The comprehensiveness score was determined by dividing the number of items in the LLM's response that matched the gold standard by the total number of items in the gold standard, also multiplied by five. And the score of consistency was calculated by multiplying the proportion of terms with the same answers in both English and Chinese by five. In addition, the necessary information ratio was calculated by dividing the number of items in the LLM's response that matched the gold standard by the total number of items provided by the LLM. The ultimate score of the subjective question would be the average of the two experts’ scores if there was a difference of less than one point between their scores. If the difference of the two experts’ scores was one point or more, a third expert would be asked to evaluate. The final score was obtained by calculating the average of the three experts’ scores.

Statistical analysis

The accuracy and consistency of responses to objective questions were described using frequencies and percentages, and differences between the two LLMs were compared using the chi-square test or Fisher's exact test. In addition, the scores for accuracy, comprehensiveness, and consistency of responses were described as means ± standard deviations or medians (first quartile–third quartile). Differences between the two LLMs were compared using the t-test or Mann–Whitney U test. In addition, to evaluate the consistency between the two experts’ assessments of model performance, based on the normality test results for the paired differences, the paired t-test was applied to dimensions with normally distributed differences, while the Wilcoxon signed-rank test was used for those violating the normality assumption. SPSS 28.0 statistical software was used for all statistical analyses. All statistical tests were two-sided, and a p-value of less than 0.05 was considered statistically significant.

Results

To evaluate the consistency between the two experts’ assessments of LLMs’ performance, paired analyses were conducted across five dimensions: accuracy in Chinese, accuracy in English, comprehensiveness in Chinese, comprehensiveness in English, and the consistency between Chinese and English responses. The results revealed no statistically significant differences between the two experts’ scores across all five dimensions (see Supplementary Table 4), indicating a high level of inter-rater consistency. The specific performance of LLMs was detailed as follows.

Accuracy

The accuracy of two large language models

As shown in Table 1, when English was used for conversation, GPT-3.5 and GPT-4 answered the objective questions with equal accuracy (90.63%, 29/32). When the LLMs were asked questions in Chinese, two LLMs also demonstrated high accuracy: GPT-3.5 with 84.38% (27/32) and GPT-4 with 81.25% (26/32). And there was no statistical difference in the accuracy of the Chinese responses to objective questions between the two LLMs (χ² = 0.110, p = 0.740).

Table 1.

The difference of the responses to objective questions of two LLMs.

Variable	N	Accuracy/consistency N (%)	χ²/ Fisher	p-value
The accuracy of English responses	32		0.000	1.000
GPT-3.5		29(90.63)
GPT-4		29(90.63)
The accuracy of Chinese responses	32		0.110	0.740
GPT-3.5		27(84.38)
GPT-4		26(81.25)
The accuracy of Chinese and English responses of GPT-3.5	32		0.571^a	0.708
English		29(90.63)
Chinese		27(84.38)
The accuracy of the Chinese and English responses of GPT-4	32		1.164^a	0.747
English		28(87.50)
Chinese		27(84.38)
The consistency between Chinese and English responses of the two LLMs	32		0.217^a	1.000
GPT-3.5		30(93.75)
GPT-4		29(90.63)

Note: ^aFisher's exact test.

LLM: large language model.

The scores of the LLMs’ responses to subjective questions are shown in Figure. 1. The accuracy scores of the two LLMs’ English responses to subjective questions were as follows: GPT-3.5: 5.00 (4.20, 5.00); GPT-4: 4.84 (4.17, 5.00). The accuracy of GPT-4's Chinese responses was the highest (GPT-3.5: 4.50 (4.41, 4.88); GPT-4: 5.00 (4.79, 5.00)). No significant differences were found in the accuracy of the English and Chinese responses to subjective questions between the two LLMs (z = 0.331, p = 0.886; z = 1.703, p = 0.114).

Figure 1.

The accuracy scores of two LLMs’ responses to the subjective questions. LLM: large language model.

The accuracy of Chinese and English responses for the same model

There were no significant differences in the accuracy of Chinese and English responses to both objective and subjective questions for each model (Table 1, Figure 2).

Figure 2.

The scores of two LLMs’ responses to the subjective questions in different language. LLM: large language model.

Comprehensiveness

The comprehensiveness of two large language models

The comprehensiveness scores of the LLMs’ responses are shown in Figure 3. The comprehensiveness scores of the two LLMs’ responses to subjective questions were as follows: for English responses, GPT-3.5 scored 3.77 ± 1.01, and GPT-4 scored 3.33 ± 0.48; for Chinese responses, GPT-3.5 scored 2.97 ± 1.06, and GPT-4 scored 3.66 ± 0.51. There was no significant difference in the comprehensiveness of the English and Chinese responses between the two LLMs (t = 0.787, p = 0.461; t = 1.175, p = 0.285).

Figure 3.

The comprehensiveness and consistency scores of two LLMs’ responses to the subjective questions. LLM: large language model.

The comprehensiveness of Chinese and English responses for the same model

As shown in Figure 4, there were no significant differences in the comprehensiveness of Chinese and English responses to both objective and subjective questions for each model (GPT-3.5:t = 1.094, p = 0.316; GPT-4: t = 0.945, p = 0.381).

Figure 4.

The comprehensiveness scores of two LLMs’ responses to the subjective questions in different language. LLM: large language model.

Consistency

The consistency scores of the LLMs’ responses to objective questions are shown in Table 1. The consistency ratios for objective questions were high for both models (GPT-4: 29/32, GPT-3.5: 30/32). No statistically significant differences were found in the consistency of responses to objective and subjective questions between the two LLMs (χ² = 0.217, p = 1.000; t = 0.608, p = 0.566). The consistency scores of the two LLMs’ responses to subjective questions were as follows: GPT-3.5 (3.11 ± 0.65), GPT-4 (3.57 ± 0.39) (Figure 3).

Necessary information ratio

The necessary information ratio of English responses was 84.50% ± 8.99% for GPT-3.5 and 78.75% ± 3.54% for GPT-4. For Chinese responses, the necessary information ratio was 95.00% (69.75%, 100%) for GPT-3.5 and 100% (75.25%, 100%) for GPT-4.

Discussion

This study aimed to investigate the performance of LLMs in providing knowledge related to respiratory aspiration, and to explore the potential of using LLMs as promising tools for the training of respiratory aspiration knowledge. The results showed the two LLMs scored high for the accuracy and comprehensiveness of responses regarding respiratory aspiration. There was no difference in the accuracy, comprehensiveness and consistency of the two LLMs’ responses. Moreover, there was no difference in the accuracy or comprehensiveness of the LLMs’ responses when the same model was asked in English or Chinese. And the necessary information ratio of two LLMs were high, suggesting that most of the information provided was indeed necessary. Based on the above results, LLMs could meet users’ needs for knowledge related to respiratory aspiration, and it was possible to apply LLMs in respiratory aspiration knowledge training. Consistent with the findings of this study, studies had also reported that LLMs performed well in answering questions related to the chronic respiratory diseases,²¹ oncology,²² and myopia,²³ further supporting their value in domain-specific clinical support.

LLMs had great potential to be applied to training nurses, patients, and caregivers on respiratory aspiration. The prerequisites for utilizing LLMs in respiratory aspiration training were the high accuracy and comprehensiveness of their responses to the respiratory aspiration questions. This study confirmed that LLMs could provide users with correct and comprehensive knowledge about respiratory aspiration. Previous studies have demonstrated the potential of LLMs to deliver medical knowledge in various domains, such as rheumatism,²⁴ rhinosinusitis,²⁵ and epilepsy.²⁶ These findings collectively support the feasibility of applying LLMs in the healthcare field. In addition, the LLMs enabled anthropomorphism, which was the basis for humans to interact with non-human entities.²⁷ When users talked to LLMs, it was like they were talking to medical professionals. Time, location, or money should not be taken into consideration when using LLMs to provide intensive or large-scale training of respiratory aspiration knowledge for nurses, patients, and caregivers.²⁸ Overall, using LLMs in the training of knowledge related to respiratory aspiration had certain benefits.

The application of LLMs was limited by some conditions, such as language, culture, and ethics. There was no difference in the accuracy and comprehensiveness of the responses when LLMs were asked in two different languages in this study, showing that users could communicate with LLMs in a language they were comfortable with. However, culture might have an impact on LLMs’ reactions. LLMs have shown human-like content biases.²⁹ Because of different programming and constraints imposed by different creators, LLMs carries kind of cultural perspective, both in the evaluative and value sense.³⁰ Users could use native LLMs to prevent cultural misunderstandings. Ethics also had a significant impact on the application of LLMs in the medical field.^31,32 It was unclear who would be responsible for patient's injuries—the medical professionals, the LLMs, or their developers.³³ Additionally, LLMs would collect and store personal information of patients when LLMs interact with patients, which was a potential violation of patients’ privacy.^34,35 Even more concerning was the possibility that patients would choose to believe LLMs rather than medical professionals even though LLMs provided wrong responses; this could have an impact on the rapport between patients and medical professionals.³¹ The premise that LLMs follow fundamental human ethical standards was that their creators uphold intellectual virtue.

LLMs still had some shortcomings. Firstly, LLMs might provide users with incorrect or incomplete information,³⁰ which might lead to delays in disease treatment and progression. This limitation stems from the nature of their training data. LLMs were trained on large text datasets, which were usually crawled from the Internet. While such datasets contain a wealth of accurate medical knowledge, they also include misinformation, outdated content, and non-authoritative sources.³⁶ Consequently, LLMs may produce responses that are factually incorrect, overly general, or insufficiently comprehensive. In the healthcare, this issue was particularly critical, as over-reliance on LLMs might result in clinicians receiving outdated or incomplete knowledge.³⁷ Therefore, it was important to acknowledge that LLMs might not fully replace human-led instruction or training. Their value lay in serving as a powerful supplement—provided that clinicians remain fully aware of their limitation.

Secondly, it should also be noted that LLMs might provide some unnecessary information, which were difficult to avoid.³⁸ Although in this study, LLMs were able to deliver a high necessary information ratio, instances of off-topic or tangential details were still observed. Unnecessary information could distract users, increase cognitive load, and obscure key points. Therefore, caution should be exercised when using LLMs, particularly in healthcare. Thirdly, the accuracy of LLMs’ responses was related to the questioning strategies.³⁹ In this study, the main reason for the wrong responses of LLMs was the incorrect understanding of the question, especially when the questions involved anatomical terms, physiological mechanism, or numbers. Asking LLMs questions correctly was vital, which was the prerequisite for the use of LLMs. Nevertheless, users did not master the questioning strategy of the LLMs. Although there was study that suggested guidelines for asking questions of LLMs, that's not enough.⁴⁰ Lastly, LLMs lacked creativity.⁴¹ The answers of LLMs came from the existing knowledge, and they would not create new information. If respiratory aspiration patients experienced an unreported dangerous situation, they would be unable to get suggestions from LLMs. Crucially, these shortcomings were pointed out without a preconceived pessimism toward LLMs, which aimed to ultimately make full use of LLMs.

The following ideas could be taken into consideration in order to better employ LLMs in the training of respiratory aspiration knowledge. Firstly, what role will the LLMs play in the training of respiratory aspiration knowledge? This is a question that requires ongoing investigation. Currently, LLMs can be used as a way to train online. Secondly, although LLMs might make wise decisions,⁴² health professionals’ critical judgment should also be considered when making decisions.⁴³ At present, LLMs might be more suitable as supplementary tools. Thirdly, strategies should be developed to improve the LLMs’ understanding of the conversations, particularly when it comes to complex medical terms. Prompt engineering refers to the practice of designing, refining, and implementing prompts or instructions that guide the output of LLMs to help in various tasks, which is helpful for the communication of LLMs and users.⁴⁴ Taking respiratory aspiration as an example, directly point out the theme of “respiratory aspiration” at the beginning of the conversation to make the LLM clear about your focus. Then, progressively refine the inquiry through step-by-step questioning. If there are specific background details that need to be considered, you can add the background information in the prompt. Finally, you may specify the desired response style and format to further tailor the output to your needs. In the future, the application of prompt engineering in LLMs should be enhanced. Lastly, to improve the accuracy of LLMs’ responses, LLMs could be combined with knowledge graphs (KGs).⁴⁵ KGs are structured knowledge models that explicitly store rich factual knowledge.⁴⁶ Users will receive accurate and comprehensive responses when respiratory aspiration-focused KGs and LLMs are combined. Currently, recommendations for clinicians, technologists, and healthcare organizations on the use of LLMs in medicine have been released, which can be used as a reference.⁴⁰

Limitation

There were still some limitations in this study. Firstly, only the topic of respiratory aspiration was tested. Although this was an informed choice based on the characteristics of this particular health topic, it would be necessary to expand these findings to other domains. Further research is needed to explore the effects of using LLMs in other contexts of healthcare. Secondly, since LLMs were not yet widely available, we did not apply LLMs to train nurses, patients, and caregivers about respiratory aspiration. We will complete the follow-up study in the future. Thirdly, the number of questions used to test LLMs was limited. Further studies, which take these limitations into account, will need to be undertaken.

Conclusion

LLMs performed well in providing knowledge related to respiratory aspiration, highlighting their potential as valuable supplementary tools for training of respiratory aspiration knowledge. Users can select the most appropriate LLM based on their individual needs, language preferences, and cultural context. While LLMs have great potential as clinical training tools, they can't fully replace traditional training ways. When their limitations were clearly understood, LLMs could be powerful supplementary tools. In the future, it may be possible to investigate the effects of LLMs in other medical specialties to advance the reality of digital health. To enhance their utility, LLMs should continue to improve the accuracy and comprehensiveness of their outputs while adhering to ethical standards.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076251349616 - Supplemental material for To take a different approach: Can large language models provide knowledge related to respiratory aspiration?

Supplemental material, sj-docx-1-dhj-10.1177_20552076251349616 for To take a different approach: Can large language models provide knowledge related to respiratory aspiration? by Yirou Niu, Shuojin Fu, Zehui Xuan, Ruifu Kang, Zhifang Ren, Shuai Jin, Yanling Wang and Qian Xiao in DIGITAL HEALTH

Footnotes

ORCID iDs

Yirou Niu

Shuojin Fu

Zehui Xuan

Zhifang Ren

Shuai Jin

Yanling Wang

Qian Xiao

Ethical considerations

This study was reviewed and approved by the medical ethics committee of Capital Medical University (approval # Z2022SY027).

Author contributions

YN contributed to conceptualization, methodology, supervision, and writing—original draft. SF contributed to conceptualization, investigation, and writing—original draft. ZX contributed to investigation. RK contributed to conceptualization. ZR contributed to visualization. SJ contributed to methodology. YW contributed to writing—review and editing. QX contributed to conceptualization, methodology, supervision, and writing—review and editing.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the National Natural Science Foundation of China, (grant number 72174130).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data availability statement

The datasets during and/or analyzed during the current study available from the corresponding author on reasonable request.

Supplemental material

Supplemental material for this article is available online.

References

Ueha

Magdayao

Koyama

, et al. Aspiration prevention surgeries: a review. Respir Res 2023; 24: 43. 2023/02/08.

Baettig

Filipovic

Giroud

, et al. SAFE anaesthesia for patients at risk of pulmonary aspiration: a nationwide survey. Eur J Anaesthesiol Intensive Care 2025; 4: e0070. 20250310.

Jaillette

Martin-Loeches

Artigas

, et al. Optimal care and design of the tracheal cuff in the critically ill patient. Ann Intensive Care 2014; 4: 7. 20140227.

Masuda

Ueha

Sato

, et al. Risk factors for aspiration pneumonia after receiving liquid-thickening recommendations. Otolaryngol Head Neck Surg 2022; 167: 125–132. 2021/09/29.

Yoshimatsu

Melgaard

Westergren

, et al. The diagnosis of aspiration pneumonia in older persons: a systematic review. Eur Geriatr Med 2022; 13: 1071–1080. 2022/08/26.

Lindenauer

Strait

Grady

, et al. Variation in the diagnosis of aspiration pneumonia and association with hospital pneumonia outcomes. Ann Am Thorac Soc 2018; 15: 562–569.

Yıldızeli

ŞO

. Determinants of hospital stay, mortality, and readmission in aspiration pneumonia patients. Eur J Geriatric Gerontol 2024; 6: 171–177.

Teramoto

. The current definition, epidemiology, animal models and a novel therapeutic strategy for aspiration pneumonia. Respir Investig 2022; 60: 45–55. 2021/11/17.

Barnard

Jones

Cruice

. Managing ongoing swallow safety through information-sharing: an ethnography of speech and language therapists and nurses at work on stroke units. Int J Lang Commun Disord 2022; 57: 852–864. 2022/04/10.

10.

Hittle

Hils

Fendinger

, et al. A scoping review of sleep education and training for nurses. Int J Nurs Stud 2023; 142: 104468. 2023/04/21.

11.

Huang

Xie

, et al. Effects of home-based chronic wound care training for patients and caregivers: a systematic review. Int Wound J 2023; 20: 3802–3820. 2023/06/06.

12.

Sperber

Boucher

Hughes

, et al. Mandated caregiver training in the veterans health administration: caregiver inquiry informs national dissemination. Gerontologist 2023; 63: 534–544. 2022/11/04.

13.

Jauk

Kramer

Veeranki

SPK

, et al. Evaluation of a machine learning-based dysphagia prediction tool in clinical routine: a prospective observational cohort study. Dysphagia 2023; 38: 1238–1246. 2023/01/11.

14.

Will ChatGPT transform healthcare?

Nat Med 2023; 29: 505–506. 2023/03/16.

15.

Sng

GGR

Tung

JYM

Lim

DYZ

, et al. Potential and pitfalls of ChatGPT and natural-language artificial intelligence models for diabetes education. Diabetes Care 2023; 46: e103–e105. 2023/03/16.

16.

Yeo

Samaan

, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol 2023; 29: 721–732. 2023/03/23.

17.

Rahsepar

Tavakoli

Kim

GHJ

, et al. How AI responds to common lung cancer questions: chatGPT vs google bard. Radiology 2023; 307: e230922. 2023/06/13.

18.

García-Rudolph

Sanchez-Pinsach

Opisso

, et al. Exploring new educational approaches in neuropathic pain: assessing accuracy and consistency of AI responses from GPT-3.5 and GPT-4. Pain Med 2024; 1: 48–50. 2024/10/10.

19.

Suiter

Gosa

. Assessing and treating dysphagia: a lifespan perspective. New York: Georg Thieme Verlag, 2019.

20.

Zulin

. Assessment and treatment of dysphagia. Beijing, China: People's Medical Publishing House, 2017.

21.

Graña-Castro

Izquierdo

Piñas-Mesa

, et al. Assessing the impact of new technologies on managing chronic respiratory diseases. J Clin Med 2024; 13: 20241116.

22.

Ganjavi

Melamed

Biedermann

, et al. Generative artificial intelligence in oncology. Curr Opin Urol 2025; 35: 205–213. 20250228.

23.

Jiang

Lim

, et al. Comparative performance analysis of global and Chinese-domain large language models for myopia. Eye (Lond) 2025 20250413.

24.

Mannstadt

Mehta

. Large language models and the future of rheumatology: assessing impact and emerging opportunities. Curr Opin Rheumatol 2024; 36: 46–51. 2023/09/20.

25.

Workman

Rathi

Lerner

, et al. Utility of a LangChain and OpenAI GPT-powered chatbot based on the international consensus statement on allergy and rhinology: Rhinosinusitis. Int Forum Allergy Rhinol 2023; 14: 1101–1109. 2023/12/18.

26.

Kim

Shin

Kim

, et al. Assessing the performance of ChatGPT's responses to questions related to epilepsy: a cross-sectional study on natural language processing and medical information retrieval. Seizure 2023; 114: 1–8. 2023/11/27.

27.

Shanahan

McDonell

Reynolds

. Role play with large language models. Nature 2023; 623: 493–498. 2023/11/08.

28.

Lee

Park

, et al. Deep learning-based natural language processing for detecting medical symptoms and histories in emergency patient triage. Am J Emerg Med 2023; 77: 29–38. 2023/12/15.

29.

Acerbi

Stubbersfield

. Large language models show human-like content biases in transmission chain experiments. Proc Natl Acad Sci U S A 2023; 120: e2313790120. 2023/10/26.

30.

Farina

Lavazza

. ChatGPT in society: emerging issues. Front Artif Intell 2023; 6: 1130913. 2023/07/03.

31.

Wang

Liu

Yang

, et al. Ethical considerations of using ChatGPT in health care. J Med Internet Res 2023; 25: e48009. 2023/08/11.

32.

Porsdam Mann

Earp

Møller

, et al. AUTOGEN: a personalized large language model for academic enhancement-ethics and proof of principle. Am J Bioeth 2023; 23: 28–41. 2023/07/24.

33.

Nashwan

Abujaber

. Harnessing large language models in nursing care planning: opportunities, challenges, and ethical considerations. Cureus 2023; 15: e40542. 2023/07/19.

34.

Naik

Hameed

BMZ

Shetty

, et al.

Legal and ethical consideration in artificial intelligence in healthcare: who takes responsibility?

Front Surg 2022; 9: 862322. 2022/04/02.

35.

Cohen

. What should ChatGPT mean for bioethics? Am J Bioeth 2023; 23: 8–16. 2023/07/13.

36.

Mittelstadt

Wachter

Russell

. To protect science, we must use LLMs as zero-shot translators. Nat Hum Behav 2023; 7: 1830–1832. 2023/11/21.

37.

Ray

. Can LLMs improve existing scenario of healthcare? J Hepatol 2023; 80: e28–e29. 2023/08/19.

38.

Truhn

Loeffler

Müller-Franzes

, et al. Extracting structured information from unstructured histopathology reports using generative pre-trained transformer 4 (GPT-4). J Pathol 2023; 262: 310–319. 2023/12/15.

39.

Liu

Yuan

, et al. Pre-train, prompt, and predict: a systematic survey of prompting methods in natural language processing. ACM Comput Surv 2023; 55: 1–35. 2023/03/01.

40.

Indran

Paramanathan

Gupta

, et al. Twelve tips to leverage AI for efficient and effective medical question generation: a guide for educators using Chat GPT. Med Teach 2023; 46: 1–6. 2023/12/26.

41.

Lee

. A mathematical investigation of hallucination and creativity in GPT models. Mathematics 2023; 11: 2320. 2023/06/10.

42.

Chen

Liu

Shan

, et al. The emergence of economic rationality of GPT. Proc Natl Acad Sci U S A 2023; 120: e2316205120. 2023/12/12.

43.

Humphries

Wright

Hoffman

, et al. What's the best chatbot for me? Researchers put LLMs through their paces. Nature 2023 2023/09/28.

44.

Meskó

. Prompt engineering as an important emerging skill for medical professionals: tutorial. J Med Internet Res 2023; 25: e50638. 2023/10/04.

45.

Razzaq

Maqbool

Ilyas

, et al. Evorecipes: a generative approach for evolving context-aware recipes. Ieee Access 2023; 11: 74148–74164.

46.

Tomaszuk

Szeremeta

Korniłowicz

. MMLKG: knowledge graph for mathematical definitions, statements and proofs. Sci Data 2023; 10: 791. 2023/11/11.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.14 MB