Sage Journals: Discover world-class research

Abstract

Background

ChatGPT is rapidly becoming a source of medical knowledge for patients. This study aims to assess the completeness and accuracy of ChatGPT's answers to the most frequently asked patients’ questions about shoulder pathology.

Methods

ChatGPT (version 3.5) was queried to produce the five most common shoulder pathologies: biceps tendonitis, rotator cuff tears, shoulder arthritis, shoulder dislocation and adhesive capsulitis. Subsequently, it generated the five most common patient questions regarding these pathologies and was queried to respond. Responses were evaluated by three shoulder and elbow fellowship-trained orthopedic surgeons with a mean of 9 years of independent practice, on Likert scales for accuracy (1–6) and completeness (rated 1–3).

Results

For all questions, responses were deemed acceptable, rated at least “nearly all correct,” indicated by a score of 5 or greater for accuracy, and “adequately complete,” indicated by a minimum of 2 for completeness. The mean scores for accuracy and completeness, respectively, were 5.5 and 2.6 for rotator cuff tears, 5.8 and 2.7 for shoulder arthritis, 5.5 and 2.3 for shoulder dislocations, 5.1 and 2.4 for adhesive capsulitis, 5.8 and 2.9 for biceps tendonitis.

Conclusion

ChatGPT provides both accurate and complete responses to the most common patients’ questions about shoulder pathology. These findings suggest that Large Language Models might play a role as a patient resource; however, patients should always verify online information with their physician.

Level of Evidence

Level V Expert Opinion.

Keywords

Artificial intelligence machine learning ChatGPT shoulder large language model

Introduction

The information age has brought many resources for patients to learn about their illnesses. Shoulder pathology is no exception. Many patients turn to the Internet prior to consulting their physician; however, evidence suggests that, in general, online information is often suboptimal.^1,2 Somerson et al.³ evaluated this for the shoulder, concluding that resources on the Internet were of mixed accuracy, quality and completeness. Further, they suggested that surgeons aim to direct patients towards higher quality resources.

Patients frequently query search engines regarding their shoulder pathology.⁴ However, newer technologies such as Large language models (LLMs), including ChatGPT (Open AI), are rapidly becoming sources of health information. LLMs have demonstrated some utility in patient explanation in many orthopedic subspecialties and other fields.^5–9 With the ability to respond to patient queries in a human-like fashion, with vast quantities of information at its disposal, ChatGPT will probably play a large role in patient explanation and public health.

ChatGPT has shown promising results in medical education. Early iterations passed the United States Medical Licensing Examination battery of examinations.¹⁰ Additional testing demonstrated that ChatGPT (version 3.5) passed the Orthopaedic In-Training Examination (OITE) at the level of a postgraduate year 1 resident. The latest version, GPT-4, performed at the level of a passing fifth year resident, highlighting a rapidly improved level of competence between versions.¹¹ Still, in its current form, Jain et al.¹² concluded that ChatGPT's performance relied largely on outputting rote facts, and that its utility would be limited in orthopedic surgery. Further work showed that chatbot performance declined with question complexity, indicating deficiency in applying knowledge versus simply regurgitating facts.¹³ In the field of shoulder and elbow surgery specifically, Fiedler et al.¹⁴ noted that average performance by humans on the American Shoulder and Elbow Surgeons Maintenance of Certification Exam was superior to both ChatGPT 3.5 and 4, demonstrating that there remains a substantial knowledge gap between fellowship trained orthopedic surgeons and the chatbot. Additionally, the potential remains for the chatbot to produce misinformation or “hallucinate,” when the chatbot will fabricate information, presenting false information. This might represent a potential danger to patients, create pre-emptive false beliefs prior to the actual physician-patient encounter, and might be this resource's major limitation.^15–17

Despite well-documented abilities on standardized tests, there is a dearth of literature assessing ChatGPT's ability to provide patients with specific information relating to shoulder pathology. This study aims to assess ChatGPT's ability to answer common patients’ queries regarding common shoulder pathologies and to evaluate the response for accuracy and completeness. We hypothesize that ChatGPT will produce both accurate and complete responses to common patient questions regarding shoulder pathology. These data might educate surgeons about the reliability of LLMs as patients’ information resources.

Methods

ChatGPT (version 3.5) was queried in October of 2023 to identify the five most common shoulder pathologies. ChatGPT generated the following list of shoulder pathologies: biceps tendinitis, rotator cuff tears, shoulder arthritis, shoulder dislocation and adhesive capsulitis. It was then asked to generate the five most common patient questions regarding each of the above conditions. This produced 25 questions across all pathologies.

These 25 questions were then individually posed to ChatGPT and answers evaluated on Likert scales for both accuracy (1–6) and completeness (rated 1–3) as previously used for other assessments of the chatbot.^18,19 For accuracy, a score of 1 indicated completely incorrect, 2 indicated more incorrect than correct, 3 indicated equal parts correct and incorrect, 4 indicated more correct than not, 5 indicated nearly all correct, and 6 indicated a completely correct answer. For completeness, a score of 1 indicated an incomplete response, 2 indicated an adequate response, with the minimum information required to be considered complete, 3 indicated a comprehensive response beyond what was expected. Responses rated at least “nearly all correct,” indicated by a score of 5 or greater for accuracy, and “adequately complete,” indicated by minimum of 2 for completeness, were considered to be sufficiently accurate and complete for patient education purposes.

The panel of three independent raters, consisting of fellowship-trained shoulder and elbow orthopedic surgeons, with a mean of nine years of independent practice, reviewed and scored all responses. Mean scores, for both accuracy and completeness, were calculated for each of the 25 questions. Additionally, mean scores for both metrics were calculated for all questions regarding each pathology. Inter-rater reliability was assessed for both metrics via intraclass correlation coefficient.

Results

For all questions across all five conditions, ChatGPT's answers were deemed acceptable, producing answers rated at least “nearly all correct,” indicated by a score of 5 or greater for accuracy and “adequately complete,” indicated by a minimum of 2 for completeness. The average intraclass coefficient for accuracy scoring was 0.477 (95% CI: 0.050–0.742), suggesting subpar agreement among raters, while completeness scoring was 0.769 (95% CI: 0.470–0.899), demonstrating good reliability. As our scoring system was inherently subjective, some disagreement was anticipated.

The mean scores for accuracy and completeness, respectively, were 5.5 and 2.6 for rotator cuff tears, 5.8 and 2.7 for shoulder arthritis, 5.5 and 2.3 for shoulder dislocations, 5.1 and 2.4 for adhesive capsulitis, 5.8 and 2.9 for biceps tendonitis. Full results for accuracy and completeness can be seen in Figures 1 and 2, respectively. ChatGPT generated questions and answers can be seen in full in Supplemental Material A to E.

Figure 1.

Plot representing the mean rating for accuracy of ChatGPT's responses by question. Mean accuracy rating per pathology is shown by the trendline.

Figure 2.

Plot representing the mean rating for completeness of ChatGPT's responses by question. Mean accuracy rating per pathology is shown by the trendline.

Discussion

This study assessed ChatGPT's ability to provide complete and accurate answers to patients’ questions regarding common shoulder pathology. For all five pathologies, the panel of three shoulder and elbow fellowship-trained orthopedic surgeons rated ChatGPT's answers at least “adequately complete” and “nearly all correct,” indicating sufficient accuracy and completeness, demonstrating acceptable performance in both domains. While peer-reviewed published literature and medical textbooks might provide higher quality information, these sources are often inaccessible to laypersons. The Internet and search engines provide available information but are challenging to parse through and assess for quality and reliability. ChatGPT might provide a solution to this dilemma, as its natural language processing abilities allow for human-like conversation, and vast reserve of training data enables it to synthesize evidence-based answers and deliver them in an accessible manner. While patients should still consult with physicians for questions about their health, our results show that LLMs might have a role in patient education.

These findings are corroborated by related literature. Multiple studies have reported that ChatGPT was able to provide patient information in shoulder surgery subspecialty specific areas, but noted that the reading level required was higher than American Medical Association and National Institute of Health recommended sixth grade reading level for patient materials.^20–22 Furthermore, these studies noted a conspicuous absence of reliable citations, calling into question the reliability of these answers. Our own study reflected a similar phenomenon. While our queries were quite basic, the responses were generally lengthy and, while frequently technically correct, were perhaps overly detailed for simple patient questions. For example, ChatGPT proposed X-ray, ultrasound, magnetic resonance imaging, arthrogram, and computed tomography imaging for diagnosis of a rotator cuff tear. While all of these imaging modalities may be utilized at some stage of workup, and are valid options, it is unlikely a patient would require all of these for diagnosis.

These mixed findings are shared in other fields of orthopedic surgery. In the setting of elbow ulnar collateral ligament reconstruction, Johns et al.²³ reported that ChatGPT was capable of producing satisfactory answers to common patient questions; however, they noted a large proportion of unsatisfactory responses, particularly as question complexity increased. In hand surgery, ChatGPT was able to provide medical information, at a layperson's level, on carpal tunnel syndrome management. However, it was noted the chatbot was prone to producing fabricated citations.⁸ In a study assessing responses to a variety of hand pathologies, Jagiella-Lodise et al.²⁴ noted that the chatbot's responses were largely correct, but found them frequently incomplete. In the context of preoperative education prior to total hip arthroplasty, Mika et al.⁷ found that ChatGPT was able to provide largely satisfactory, largely evidence-based answers to most patients’ questions and was able to clarify answers when needed. They also noted that, when asked questions on controversial topics, the chatbot was seemingly unbiased. Similarly, in total knee arthroplasty, Magruder et al.⁶ found that ChatGPT was able to answer questions regarding clinical practice guidelines, albeit with lower scores in the domain of “consistency,” reflecting poor reliability. Vaira et al.²⁵ drew similar conclusions in the context of head and neck surgery, reporting the chatbot's producing largely accurate information, with some incomplete answers, particularly as questions moved from closed-ended to more open-ended clinical scenarios.

Despite these promising reports in the literature, as well as our own, there are still a variety of issues with ChatGPT in healthcare. Goodman et al.¹⁸ noted that, while ChatGPT was able to provide highly accurate and complete answers to questions pertaining to a variety of medical specialties, there were notable outliers, and the chatbot would occasionally produce highly erroneous answers. They also noted that it was particularly concerning that these incorrect answers were delivered in the same conversational confident manner in which correct answers were given, suggesting a high potential for deception and misinformation. In our own study, the chatbot posited that rotator cuff tears may heal on their own. While not all tears require operative intervention, and many patients achieve functional and symptomatic improvement with conservative treatment, rotator cuff tears generally do not heal spontaneously. Many studies have reported on the “hallucination” phenomenon, which appears to be particularly common in citations for scientific articles.^26–28 False statements, presented with fabricated, but very real-appearing references, present a problem for physicians, patients, and researchers. These “hallucinations” might result in misinformation, threaten patient safety by falsely conveying pre-conceived beliefs, and could cause problems in scientific work. Furthermore, it emphasizes the notion that patients should verify information from ChatGPT, and from any other sources, with their physician.

In a related study, Kuroiwa et al.²⁹ noted a number of other concerns, including subpar reliability when assessing the chatbot's ability to aid patients in self-diagnosis. More disconcertingly, they reported that the chatbot did not sufficiently prompt the patient to seek medical attention, a potential risk to patient safety. The potential for this type of misuse, combined with the relatively early phase of the technology, is a challenge, and might, for now, preclude physicians’ recommendation of ChatGPT to patients. However, this problem is not unique to chatbots; the dangers of misdiagnosis with online tools predates ChatGPT.^30,31 Simply stated, the adage to “consult with your physician” cannot be over-emphasized.

Limitations

Although this study provided valuable and novel insights into the utility of an emerging technology, LLMs such as ChatGPT are not without limitation. The Likert-style questions, while commonly used in research, are ultimately a subjective measure of individual opinion. Our own analysis was somewhat limited by suboptimal interrater reliability, which is likely a reflection of this inherent subjectivity. Further work might attempt to create a validated objective scoring assessment of ChatGPT's responses to patient questions. Additionally, our analysis was limited to common questions regarding common pathologies. It is unclear how the chatbot would fare with less common pathologies and with more complex questions, as there is likely to be less training data on these topics.³² Our own study was limited to fairly basic questions, and no further lines of questioning were initiated to allow the chatbot to expound upon its response. Lastly, this study evaluated a specific iteration of the chatbot, ChatGPT version 3.5. More advanced versions, for example, the paid-access GPT-4, exist. These might produce different results. Furthermore, there are a variety of other online resources which patients might use for information, including other LLMs. Research comparing different LLM tools is warranted to avoid conflating a single company's technology with the larger field of LLMs across the board.

Conclusion

ChatGPT was found to provide largely accurate and complete information to patients’ frequent questions about shoulder pathology. Mean response scores for all questions and pathologies were rated, at a minimum, nearly all correct and adequately complete, meeting our threshold for acceptable performance. However, there was subpar concordance among raters for accuracy ratings, but good agreement for completeness ratings. Initial results and the ease of interaction with ChatGPT demonstrate some possible utility in patient education. Still, despite their reliability, there remain many concerns regarding its use as a ubiquitous patient education tool. Surgeons should counsel patients on the importance of verifying, with their physician, information from ChatGPT or other online sources.

Supplemental Material

sj-docx-1-sel-10.1177_17585732241283971 - Supplemental material for ChatGPT provides acceptable responses to patient questions regarding common shoulder pathology

Supplemental material, sj-docx-1-sel-10.1177_17585732241283971 for ChatGPT provides acceptable responses to patient questions regarding common shoulder pathology by Umar Ghilzai, Benjamin Fiedler, Abdullah Ghali, Aaron Singh, Benjamin Cass, Allan Young and Adil Shahzad Ahmed in Shoulder & Elbow

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iD

Aaron Singh

Supplemental material

Supplemental material for this article is available online.

References

Daraz

Morrow

Ponce

, et al. Can patients trust online health information? A meta-narrative systematic review addressing the quality of health information on the internet. J Gen Intern Med 2019; 34: 1884–1891.

Hesse

Nelson

Kreps

, et al. Trust and sources of health information: the impact of the internet and its implications for health care providers: findings from the first Health Information National Trends Survey. Arch Intern Med 2005; 165: 2618–2624.

Somerson

Bois

Jeng

, et al.

Quality of internet-based decision aids for shoulder arthritis: what are patients reading?

BMC Musculoskelet Disord 2018; 19: 12.

Khalil

Castle

Akioyamen

, et al. What are patients asking and Reading online? An analysis of online patient searches for rotator cuff repair. J Shoulder Elbow Surg 2023; 32: 2245–2255.

Dubin

Bains

Chen

, et al. Using a Google web search analysis to assess the utility of ChatGPT in total joint arthroplasty. J Arthroplasty 2023; 38: 1195–1202.

Magruder

Rodriguez

Wong

CHJ

, et al. Assessing ability for ChatGPT to answer total knee arthroplasty-related questions. J Arthroplasty 2024; 39: 2022–2027.

Mika

Martin

Engstrom

, et al. Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am 2023; 105: 1519–1526.

Seth

Xie

Rodwell

, et al. Exploring the role of a large language model on carpal tunnel syndrome management: an observation study of ChatGPT. J Hand Surg Am 2023; 48: 1025–1033.

Zandi

Fahey

Drakopoulos

, et al. Exploring diagnostic precision and triage proficiency: a comparative study of GPT-4 and Bard in addressing common ophthalmic complaints. Bioengineering 2024; 11: 20.

10.

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2: e0000198.

11.

Kung

Marshall

Gauthier

, et al. Evaluating ChatGPT performance on the orthopaedic in-training examination. JBJS Open Access 2023; 8: e23.00056.

12.

Jain

Gottlich

Fisher

, et al. Assessing ChatGPT’s orthopedic in-service training exam performance and applicability in the field. J Orthop Surg Res 2024; 19: 27.

13.

Lum

. Can artificial intelligence pass the American Board of Orthopaedic Surgery Examination? Orthopaedic Residents versus ChatGPT. Clin Orthop Relat Res 2023; 481: 1623–1630.

14.

Fiedler

Azua

Phillips

, et al. ChatGPT performance on the American Shoulder and Elbow Surgeons Maintenance of Certification Exam. J Shoulder Elbow Surg 2024; 33: 1888–1893.

15.

Alkaissi

McFarlane

. Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus 2023; 15: e35179.

16.

De Angelis

Baglivo

Arzilli

, et al. ChatGPT and the rise of large language models: the new AI-driven infodemic threat in public health. Front Public Health 2023; 11: 1166120.

17.

Hatem

Simmons

Thornton

. A call to address AI “Hallucinations” and how healthcare professionals can mitigate their risks. Cureus 2023; 15: e44720.

18.

Goodman

Patrinely

Stone

, et al. Accuracy and reliability of chatbot responses to physician questions. JAMA Netw Open 2023; 6: e2336483.

19.

Johnson

Goodman

Patrinely

, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the chat-GPT model. Res Sq. Published Online February 2023; 28: rs.3.rs-2566942.2023.

20.

Warren

Hurley

Park

, et al. Evaluation of information from artificial intelligence on rotator cuff repair surgery. JSES Int 2024; 8: 53–57.

21.

Hurley

Crook

Lorentz

, et al. Evaluation high-quality of information from ChatGPT (artificial intelligence-Large language model) artificial intelligence on shoulder stabilization surgery. Arthroscopy 2024; 40: 726–731.e6.

22.

Yeramosu

Johns

Onor

, et al. ChatGPT is capable of providing satisfactory responses to frequently asked questions regarding total shoulder arthroplasty. Shoulder Elbow. Published online April 17, 2024. DOI: https://doi.org/10.1177/17585732241246560.

23.

Johns

Kellish

Farronato

, et al. ChatGPT can offer satisfactory responses to common patient questions regarding elbow ulnar collateral ligament reconstruction. Arthrosc Sports Med Rehabil 2024; 6: 100893.

24.

Jagiella-Lodise

Suh

Zelenski

. Can patients rely on ChatGPT to answer hand pathology-related medical questions? Hand (NY) Published online April 23, 2024: 15589447241247246. DOI: https://doi.org/10.1177/15589447241247246.

25.

Vaira

Lechien

Abbate

, et al. Accuracy of ChatGPT-generated information on head and neck and oromaxillofacial surgery: a multicenter collaborative analysis. Otolaryngol Head Neck Surg 2024; 170: 1492–1503.

26.

Emsley

. ChatGPT: these are not hallucinations – they’re fabrications and falsifications. Schizophrenia 2023; 9: 52.

27.

Goddard

. Hallucinations in ChatGPT: a cautionary tale for biomedical researchers. Am J Med 2023; 136: 1059–1060.

28.

McGowan

Gui

Dobbs

, et al. ChatGPT and Bard exhibit spontaneous citation fabrication during psychiatry literature search. Psychiatry Res 2023; 326: 115334.

29.

Kuroiwa

Sarcon

Ibara

, et al. The potential of ChatGPT as a self-diagnostic tool in common orthopedic diseases: exploratory study. J Med Internet Res 2023; 25: e47621.

30.

Fraser

Crossland

Bacher

, et al. Comparison of diagnostic and triage accuracy of Ada health and WebMD symptom checkers, ChatGPT, and physicians for patients in an emergency department: clinical data analysis study. JMIR Mhealth Uhealth 2023; 11: e49995.

31.

Patel

Swanton

Gross

. Online symptom checkers are poor tools for diagnosing men’s health conditions. Urology 2022; 170: 124–131.

32.

Shen

Pratap

, et al. How does ChatGPT use source information compared with google? A text network analysis of online health information. Clin Orthop Relat Res 2024; 482: 578–588.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB