Sage Journals: Discover world-class research

Abstract

Research Type:

Level 4 – Case series

Introduction/Purpose:

Given the latest advancements in artificial intelligence, platforms such as Chat Generative Pre-trained Transformer (ChatGPT) have gained popularity among the public seeking health advice. Since May 2024, ChatGPT has garnered over 180.5 million monthly users. There is a paucity of literature examining the accuracy of ChatGPT in foot and ankle surgery, particularly for fractures at the metadiaphysis of the 5th metatarsal (Jones fractures). It is advantageous to examine the efficacy of patient education tools regarding these fractures, which occur at an incidence of approximately 6.7 per 10,000 people. The present study aims to investigate the quality of responses generated by ChatGPT in answering common questions about the diagnosis and treatment of Jones fractures.

Methods:

Nine frequently asked questions regarding Jones fractures were posed to ChatGPT 3.5 and 4o. Data was collected on June 28, 2024 in one sitting, as the platform is continuously evolving. Two senior authors scored the responses as either “excellent response not requiring clarification,” “satisfactory requiring minimal clarification,” “satisfactory requiring moderate clarification,” or “unsatisfactory requiring substantial clarification,” corresponding to scores of 1, 2, 3, and 4, respectively. Flesch Reading Ease score and Flesch-Kincaid Grade Level (corresponding to US grade level) were used to assess length and readability of the AI-generated answers.

Results:

The mean rating for ChatGPT 3.5 responses was 2.89 ± 0.78 compared to 1.78 ± 0.83 for ChatGPT 4o (p=0.0133), with lower scores corresponding to better quality responses. In cases of disagreement, a consensus was reached between the graders. Flesch Reading Ease was an average of 43.7 and 49.06 for ChatGPT 3.5 and 4o, respectively (p= 0.0918) which both correlate with a college reading level. Flesch-Kincaid Grade Level was an average of 11.19 and 9.27 for ChatGPT 3.5 and 4o, respectively (p=0.0214).

Conclusion:

ChatGPT 4o scored better in accuracy compared to ChatGPT 3.5. Flesch Reading Ease and Flesch-Kincaid Grade Level both improved with the development of ChatGPT 4o, but need improvement given that the average American reads at the 8th grade level. To our knowledge, this is the first study within foot and ankle surgery that compares different versions of ChatGPT, with previous studies either not specifying or featuring an older chatbot version such as 3.0. Users should always consult a surgeon and recognize ChatGPT’s shortcomings, including a lack of citations and inherent biases.

The Accuracy of ChatGPT 4o versus 3.5 in Answering Patients’ Frequently Asked Questions Regarding Jones Fractures