Abstract
Background:
Carpal tunnel syndrome (CTS) is a prevalent neuropathy in hand surgery that significantly affects people’s quality of life. Frequently, patients conduct research online before seeking medical care. Large language models (LLMs) like ChatGPT are increasingly used for health information, yet concerns remain regarding the accuracy, readability, and complexity of their responses. Previous studies have assessed older ChatGPT models but have not comprehensively compared newer versions. The purpose of this study is to compare ChatGPT-4-generated, ChatGPT-4o-generated, and ChatGPT-o1-generated answers to common CTS-related patient questions.
Methods:
Six frequently asked CTS questions were queried of each LLM. Responses were independently graded by 2 board-certified hand surgeons using evidence-based guidelines. Lexical diversity was assessed using the Measure of Textual Lexical Diversity, and readability was evaluated using the Flesch-Kincaid Grade Level, Flesch Reading Ease Score, and Simple Measure of Gobbledygook. Analysis of variance or Kruskal-Wallis with post hoc tests were conducted to compare LLMs and questions.
Results:
All 3 ChatGPT models averaged 93% accuracy with no significant differences between them, though a significant difference in accuracy was observed between questions 3 and 5. Readability scores between models varied significantly, with ChatGPT-4o generating the most readable responses and ChatGPT-o1 producing the most complex answers.
Conclusions:
While LLMs had similar accuracy, ChatGPT-4o offered the most patient-friendly content. Furthermore, the readability of all models remains above the recommended level for the general population. Future work should explore whether fine-tuning or advancements in model design can enhance accessibility for a broader audience.
Get full access to this article
View all access options for this article.
