Abstract
Background:
The integration of artificial intelligence (AI) into health care witnessed significant advancements, particularly with AI-driven tools like ChatGPT. Initial evaluations indicated that ChatGPT 3.5 did not perform as well as humans on specialized hand surgery self-assessment examinations. The purpose of this study is to evaluate the performance of ChatGPT 4o on American Society for Surgery of the Hand (ASSH) self-assessment questions and whether using enhanced techniques such as better prompts and file search improve accuracy.
Methods:
Using data from the ASSH self-assessment examinations (2008-2013), we explored the impact of ChatGPT model version, prompt, and file search on the accuracy of AI-generated responses. We used OpenAI’s application programming interface to automate question input and response scoring. Statistical analysis was conducted using one-way analysis of variance. KR-20 was used to assess the reliability of the test.
Results:
Results indicate that the latest AI models, particularly ChatGPT 4o with enhanced prompting and access to peer-reviewed literature, can achieve performance levels comparable to human examinees, particularly on text-based questions. ChatGPT 4o performed significantly better than ChatGPT 3.5 and showed marked improvement with better prompts and file search capabilities. The KR-20 for the 2013 examination was 0.946, indicating a very reliable test.
Conclusions:
These findings highlight AI’s potential to support medical education and practice, demonstrating that ChatGPT can perform at a human-equivalent level on hand surgery self-assessment examinations. Our results suggest potential utility as a supplementary tool in educational settings and as a supportive resource in clinical practice.
Get full access to this article
View all access options for this article.
