Matching Human Expertise: ChatGPT’s Performance on Hand Surgery Examinations

Abstract

Background:

The integration of artificial intelligence (AI) into health care witnessed significant advancements, particularly with AI-driven tools like ChatGPT. Initial evaluations indicated that ChatGPT 3.5 did not perform as well as humans on specialized hand surgery self-assessment examinations. The purpose of this study is to evaluate the performance of ChatGPT 4o on American Society for Surgery of the Hand (ASSH) self-assessment questions and whether using enhanced techniques such as better prompts and file search improve accuracy.

Methods:

Using data from the ASSH self-assessment examinations (2008-2013), we explored the impact of ChatGPT model version, prompt, and file search on the accuracy of AI-generated responses. We used OpenAI’s application programming interface to automate question input and response scoring. Statistical analysis was conducted using one-way analysis of variance. KR-20 was used to assess the reliability of the test.

Results:

Results indicate that the latest AI models, particularly ChatGPT 4o with enhanced prompting and access to peer-reviewed literature, can achieve performance levels comparable to human examinees, particularly on text-based questions. ChatGPT 4o performed significantly better than ChatGPT 3.5 and showed marked improvement with better prompts and file search capabilities. The KR-20 for the 2013 examination was 0.946, indicating a very reliable test.

Conclusions:

These findings highlight AI’s potential to support medical education and practice, demonstrating that ChatGPT can perform at a human-equivalent level on hand surgery self-assessment examinations. Our results suggest potential utility as a supplementary tool in educational settings and as a supportive resource in clinical practice.

Keywords

ChatGPT AI education certification self-assessment

Get full access to this article

View all access options for this article.

References

McClain

. Americans’ use of ChatGPT is ticking up, but few trust its election information. March 26, 2024. Accessed July 6, 2024. https://www.pewresearch.org/short-reads/2024/03/26/americans-use-of-chatgpt-is-ticking-up-but-few-trust-its-election-information/#who-has-used-chatgpt

Mihalache

Huang

Popovic

, et al. ChatGPT-4: an assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2024;46(3):366-372. doi:10.1080/0142159X.2023.2249588

Ghanem

Nassar

El Bachour

, et al. ChatGPT earns American Board Certification in Hand Surgery. Hand Surg Rehabil. 2024;43(3):101688. doi:10.1016/j.hansur.2024.101688

Massey

Montgomery

Zhang

AS.

Comparison of ChatGPT-3.5, ChatGPT-4, and orthopaedic resident performance on orthopaedic assessment examinations. J Am Acad Orthop Surg. 2023;31(23):1173-1179. doi:10.5435/JAAOS-D-23-00396

Brin

Sorin

Vaid

, et al. Comparing ChatGPT and GPT-4 performance in USMLE soft skill assessments. Sci Rep. 2023;13(1):16492. doi:10.1038/s41598-023-43436-9

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health. 2023;2(2):e0000198. doi:10.1371/journal.pdig.0000198

Arango

Flynn

Zeitlin

, et al. The performance of ChatGPT on the American Society for Surgery of the Hand Self-Assessment Examination. Cureus. 2024;16(4):e58950. doi:10.7759/cureus.58950

Han

Choudhry

Simon

, et al. ChatGPT’s performance on the hand surgery self-assessment exam: a critical analysis. J Hand Surg Glob Online. 2024;6(2):200-205. doi:10.1016/j.jhsg.2023.11.014

OpenAI. Hello GPT-4o. May 13, 2024. Accessed June 26, 2024. https://openai.com/index/hello-gpt-4o/

10.

Prompt engineering—OpenAI API. Accessed June 26, 2024. https://platform.openai.com/docs/guides/prompt-engineering

11.

Assistants Overview—OpenAI API. Accessed April 15, 2024. https://platform.openai.com/docs/assistants/overview

12.

OpenAI. Introducing OpenAI o1-preview. September 12, 2024. Accessed September 24, 2024. https://openai.com/index/introducing-openai-o1-preview/

13.

Kirschenbaum

. HandAI; 2024. https://github.com/zkbaum/handai

14.

Cody

Smith

JK.

Test scoring and analysis using SAS. SAS Institute; 2014. Accessed July 6, 2024. https://support.sas.com/content/dam/SAS/support/en/books/test-scoring-and-analysis-using-sas/67044_excerpt.pdf

15.

Nunnally

JC.

Psychometric Theory. McGraw-Hill; 1967.

16.

Nori

Lee

Zhang

, et al. Can generalist foundation models outcompete special-purpose tuning? Case study in medicine. arXiv:231116452. 2023. doi: 10.48550/arXiv.2311.16452

17.

Sakai

Maeda

Ozaki

, et al. Performance of ChatGPT in board examinations for specialists in the Japanese Ophthalmology Society. Cureus. 2023;15(12):e49903. doi:10.7759/cureus.49903

18.

Bross

. The Atlantic announces product and content partnership with OpenAI. May 29, 2024. Accessed July 6, 2024. https://www.theatlantic.com/press-releases/archive/2024/05/atlantic-product-content-partnership-openai/678529/

19.

Open AI, Reddit. OpenAI and Reddit partnership. May 16, 2024. Accessed July 6, 2024. https://openai.com/index/openai-and-reddit-partnership/

20.

Aryee

JNA

Frias

Haddad

, et al. Understanding variations in the management of displaced distal radius fractures with satisfactory reduction. Hand. Published online March 8, 2024. doi:10.1177/15589447241233709