Sage Journals: Discover world-class research

Abstract

Background

Large language models (LLMs) have many clinical applications. However, the comparative performance of different LLMs on orthopedic board style questions remains largely unknown.

Methods

Three LLMs, OpenAI’s GPT-4 and GPT-3.5, and Google Bard, were tested on 189 official 2022 Orthopedic In-Training Examination (OITE) questions. Comparative analyses were conducted to assess their performance against orthopedic resident scores and on higher-order, image-associated, and subject category-specific questions.

Results

GPT-4 surpassed the passing threshold for the 2022 OITE, performing at the level of PGY-3 to PGY-5 (p = .149, p = .502, and p = .818, respectively) and outperforming GPT-3.5 and Bard (p < .001 and p = .001, respectively). While GPT-3.5 and Bard did not meet the passing threshold for the exam, GPT-3.5 performed at the level of PGY-1 to PGY-2 (p = .368 and p = .019, respectively) and Bard performed at the level of PGY-1 to PGY-3 (p = .440, .498, and 0.036, respectively). GPT-4 outperformed both Bard and GPT-3.5 on image-associated (p = .003 and p < .001, respectively) and higher-order questions (p < .001). Among the 11 subject categories, all models performed similarly regardless of the subject matter. When individual LLM performance on higher-order questions was assessed, no significant differences were found compared to performance on first order questions (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). Finally, when individual model performance was assessed on image-associated questions, only GPT-3.5 performed significantly worse compared to performance on non-image-associated questions (p = .045).

Conclusion

The AI-based LLM GPT-4, exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum score for the 2022 OITE, and outperforming predecessor GPT-3.5 and Google Bard.

Keywords

medical education surgical education residency education artificial intelligence large language models ChatGPT Google bard

Introduction

Artificial intelligence (AI)-based large language models (LLMs) have promising applications and have garnered considerable public interest. The utility of AI-based LLMs in the medical field, particularly with differential diagnoses and clinical decision-making, has been subject to much research.^1,2 The most well-known LLM has been OpenAI’s ChatGPT (Generative Pretrained Transformer), which was launched to the public in November 2022. The first publicly available version of ChatGPT utilizes the GPT-3.5 language model, trained through a blend of supervised and unsupervised learning methodologies using large banks of textual data up until September 2021. GPT-3.5 was subsequently fine-tuned using reinforcement learning techniques grounded in human input.³ The base model has since been updated to GPT-4.

Several studies have assessed the performance of GPT-3.5 and GPT-4 on examinations used to evaluate medical students and resident physicians for medical licensing and board certification.^4,5 Kung et al. recently examined GPT-3.5’s performance on the United States Medical Licensing Examination (USMLE) Step 1, 2, and 3 – standardized examinations taken during medical school and residency that require approximately 400 h of study each for the average student – and demonstrated that GPT-3.5 achieved a passing score across all three exams.⁴ The newly updated GPT-4 model exhibited even more accuracy, illustrating a 20% improvement on all three exams.⁶

Given the established proficiency of GPT-based LLMs on clinical subjects covered during early medical education, recent studies have explored their performance on more advanced board examinations administered to trainees in orthopedics, plastic surgery, radiology, and neurosurgery.^7–10 Of these studies, only Ali et al. compared GPT-based LLMs performance to another independently developed LLM, Google Bard. Google Bard, released in May 2023, utilizes the LaMDA language model trained using Infiniset, a dataset primarily focused on dialogues and conversations from public forums.¹¹ Unlike GPT-3.5 and GPT-4, Google Bard is able to search the internet in real-time when generating responses to user queries.

Due to this unique capability of Google Bard, the present study aims to assess and compare the performance of LLMs, specifically GPT-3.5, GPT-4, and Bard on the Orthopedic In-Training Examination (OITE) from the American Association of Orthopaedic Surgery (AAOS), a standardized multiple-choice test administered to orthopedic residents each year to gauge their knowledge of the field.¹² A recent study by Lum explored this topic with only GPT3.5, using publicly available practice OITE questions from Orthobullets. Building upon this previous work, we aim to provide a more comprehensive comparative analyses of multiple language models on official 2022 OITE questions.¹³

Materials and methods

Language learning models

The three language learning models tested in the present study, GPT-3.5, GPT-4, and Bard, are designed to generate human-like text in response to user-generated prompts. These models utilize an underlying transformer architecture that breaks down a question into chunks of texts, called input tokens, which are processed by the model’s many layers to ultimately generate the output tokens that form the answer. To generate the output tokens, the models do not use a retrieval-based system where pre-existing answers are pulled from a database. Instead, responses are generated on-the-fly based on token patterns seen during training and do not represent new, real-time understanding or analysis.¹⁴

While they share a common architecture and methodology, there are substantial differences in their training data and model complexity. GPT-3.5 and GPT-4, developed by OpenAI, use their proprietary language model which has been trained on approximately 175 billion parameters, though the exact number and list have not been disclosed. It is worth noting that when testing GPT-4’s performance against older models, older models use a scaled-down model which utilize 1/1000^th of GPT-4’s computing power.¹⁵ Both models can only access information up to September 2021.^1,13 Bard, on the other hand, is built upon the LaMDA transformer-based neural language model which has been trained on approximately 137 billion model parameters and is able to access the internet to generate responses.¹⁶

Orthopedic in-training examination

OITE is a comprehensive exam in orthopedic surgery designed to assess an orthopedic surgeon’s knowledge during residency.¹⁷ Official OITE 2022 questions and compiled resident score reports are available from AAOS, making it an ideal tool for assessing the knowledge and problem-solving abilities of these LLMs.¹⁸ While AAOS reports resident performance on all 264 test items, the OITE 2022 examination available from AAOS only contained 207 test items. Of the 207 questions available for this study, 18 questions that exclusively provided image data (radiograph, image, or video) without associated text were excluded due to the current constraints of these LLMs, which only accept text inputs. The remaining 189 questions, including 64 questions containing image references, were used for evaluation based solely on their text content. Each question was presented verbatim with corresponding answers labeled alphabetically. The responses from each LLM were recorded verbatim, and the answer corresponding to the response was identified as the chosen answer for each question. As comparison, ACGME-resident scores on the OITE 2022 were extrapolated from the technical report by AAOS.

Categorization of study questions

Two fellowship-trained orthopedic surgeons independently classified each question into one of 11 general orthopedic categories, including Basic Science, Foot & Ankle, Hand & Wrist, Hip & Knee, Oncology, Pediatrics, Shoulder & Elbow, Spine, Sports Medicine, Trauma, and Practice Management. Questions were separately classified based on their association with images into image-associated and non-image-associated. Finally, questions were stratified using two-level taxonomy into first order and higher order, with first order questions examining recall and higher order questions examining complex reasoning (comprehension, interpretation, inferencing) and application of knowledge. All classification was performed in a double-blinded manner with no discrepancies between the two reviewers.

Statistical analyses

All statistical analyses were performed using STATA version 16.0 software (Stata Corporation, College Station, TX). OITE scores were compared between LLMs and against ACGME-residents, overall and across post-graduate years (PGY), with Pearson’s Chi-square testing and post-hoc analysis. Performance on image-associated and higher-order questions was similarly assessed between LLMs and within each LLM (using performance on questions with non-image-associated and first-order questions for controls, respectively) with a Pearson’s Chi-square test and post-hoc analysis. Performance by subject category between LLMs was assessed with a Pearson’s Chi-square test or Fischer’s exact test when indicated by sample size. Statistical significance was set at p < .05. Bonferroni correction was used for post-hoc analysis to account for multiple comparisons and the family-wise alpha threshold was maintained at 0.05. Statistical significance was set to p < .008 and p < .006 in post-hoc analysis when six or eight tests were performed, respectively. Study findings are reported in accordance with the STROBE guidelines.

Results

Overall performance

According to the published AAOS report, residents achieved a mean score of 173 (66% correct, 262 questions total) on the 2022 OITE.¹⁵ In comparison, the LLMs, with the exception of GPT-3.5 (p = .001), did not score significantly higher or lower than aggregate data for all years of residents (p = .052 for GPT-4, p = .112 for Bard) (Table 1). GPT-4 did, however, score higher than the other two LLMs (p < .001 for GPT-3.5, p = .001 for Bard).

Table 1.

LLMs and resident performance on the OITE exam.

Variable		Correct, N (%)	p Value
Variable		Correct, N (%)	v GPT-4	v GPT-3.5	v Bard
Residents	Aggregate^a	173 (65.5)	.052	.001	.112
	PGY-1	144 (54.5)	<.001	.368	.440
	PGY-2	162 (61.4)	.005	.019	.498
	PGY-3	179 (67.8)	.149	<.001	.036
	PGY-4	188 (71.2)	.502	<.001	.004
	PGY-5	193 (73.1)	.818	<.001	<.001
GPT-4^b		140 (74.1)	-	<.001	.001
GPT-3.5^b		95 (50.3)	-	-	.121
Bard^b		110 (58.2)	-	-	-

Abbreviations: LLM, Large Language Models; OITE, Orthopedic In-Training Examination; GPT, Generative Pretrained Transformers. Bold represents statistically significant.

^aMean ACGME resident scores on 264-items of the 2022 OITE were extrapolated from technical report published by the American Academy of Orthopedic Surgeons.¹⁶

^bLLM model performances were tested on 189 eligible official OITE 2022 questions.

After disaggregation of the resident scores by residency year, GPT-4 scored at the same level as a PGY-3 (p = .149), PGY-4 (p = .502), and PGY-5 (p = .818); GPT-3.5 scored at the same level as a PGY-1 (p = .368) and PGY-2 (p = .019); and Bard scored at the same level as a PGY-1 (p = .440), PGY-2 (p = .498), and PGY-3 (p = .036) (Table 1). Per these results, only GPT-4, scoring 74% correct, would pass the OITE (passing score is 68.6% correct).¹⁵

Performance by categories

When comparing performance between LLMs, GPT-4 was found to perform significantly better than both GPT-3.5 and Bard on image-associated questions (p < .001 and p = .003, respectively) (Table 2, Supplemental Figures 1–3). For non-image-associated questions GPT-4 performed significantly better compared to GPT-3.5 and Bard (p < .001), though Bard also outperformed GPT-3.5 (p < .001) (Table 2). On first-order questions, GPT-4 performed statistically better than GPT-3.5 and Bard (p < .001) (Table 2). Likewise, GPT-4 outperformed both Bard and GPT-3.5 on higher-order questions (p < .001), though Bard performed significantly better compared to GPT-3.5 (p < .001) (Table 2). Finally, among the 11 question categories all models performed similarly regardless of the subject matter (Table 3).

Table 2.

LLM performance questions by category type.

Question category	Questions^a	GPT-4^b	GPT-3.5^b	Bard^b	p Value
Question category	Questions^a	GPT-4^b	GPT-3.5^b	Bard^b	4 v 3.5	4 v Bard	3.5 v Bard
Image-associated	64	48 (75.0)	26 (40.6)	35 (54.7)	<.001	.003	.009
Non-image-associated	123	92 (74.8)	69 (56.0)	75 (60.9)	<.001	<.001	<.001
First-order	87	69 (79.3)	49 (56.3)	54 (62.0)	<.001	<.001	.020
Higher-order	102	71 (69.6)	46 (45.1)	56 (54.9)	<.001	<.001	<.001

Abbreviations: LLM, Large Language Models; GPT, Generative Pretrained Transformers.

^aQuestion count associated with each category.

^bCount (%) correct.

Table 3.

LLMs performance based on subject category, reported as N (%) correct.

Subject category	GPT-4^a	GPT-3.5^a	Bard^a	p-value
Oncology	5 (62.5)	4 (50.0)	2 (25.0)	.327
Trauma	20 (71.4)	12 (42.9)	13 (46.4)	.065
Foot and ankle	9 (60.0)	9 (60.0)	6 (40.0)	.447
Hip and knee	12 (63.2)	8 (42.1)	13 (68.4)	.220
Hand and wrist	13 (81.3)	7 (43.8)	10 (62.5)	.090
Shoulder and elbow	14 (87.5)	10 (62.5)	12 (75.0)	.263
Sports	14 (87.5)	12 (75.0)	12 (75.0)	.603
Spine	19 (76.0)	11 (44.0)	13 (52.0)	.058
Pediatrics	14 (63.6)	10 (45.5)	12 (54.6)	.480
Basic sciences	14 (87.5)	9 (56.3)	11 (68.8)	.080
Practice management	6 (75.0)	3 (37.5)	6 (75.0)	.362

Abbreviations: LLM, Large Language Models; GPT, Generative Pretrained Transformers.

^aQuestion count associated with each category.

When individual LLMs’ performance on each question category was analyzed, the taxonomy of the question had no influence on performance (GPT-4 p = .139, GPT-3.5 p = .124, Bard p = .319). However, when questions were stratified based on association with images, GPT-3.5 answered more image-associated questions incorrectly (GPT-3.5 p = .045). Whereas the performance of GPT-4 and Bard was not significantly impacted by association with images (GPT-4 p = .976, Bard p = .407).

Discussion

AI-based language learning models are of growing interest due to their utility in various fields, including healthcare. However, their ability to accurately and reliably address highly complex user-generated questions that may require years of clinical education and well-developed clinical reasoning and medical management skills has been heavily debated.^19,20 The present study found that, besides GPT-3.5, GPT-4 and Bard scored similarly on the OITE to actual residents, with GPT-4 scoring at the level of PGY-3 to PGY-5 and Bard scoring at the level of PGY-1 to PGY-3. In addition, this investigation showed that GPT-4 is better at answering higher-order and image-associated questions compared to Bard and GPT-3.5. However, with the exception of GPT-3.5 performance on image-associated questions versus non-image-associated questions, no LLM performed significantly different on higher versus first-order questions or image-associated versus non-image-associated questions. Finally, no single LLM was better at answering specific categories of questions. These findings demonstrate that the newer LLMs may potentially be used in orthopedic education and may even serve as an adjunct tool in the clinical management of patients.

Multiple studies have previously assessed the performance of LLMs on board examinations.^8,9,13 Ali et al., recently demonstrated that LLMs, particularly GPT-4, can outperform GPT-3.5 and human test-takers on neurosurgery written board examination, excelling in neuroanatomy, functional neurosurgery, and peripheral nerve sections.⁸ It can even correctly answer higher-order questions that may pose a particular clinical challenge to residents. Bhayana et al. further illustrated the ability of GPT-3.5 to accurately answer radiology board-style questions without images, especially ones focused on clinical management.⁹ However, in our study, GPT-3.5 and Bard, unlike GPT-4, did not meet the passing threshold and did not perform strongly on higher-order questions focused on application of concepts. More recently, Lum et al. placed GPT3.5 at the 40^th percentile for first-year orthopedic residents and within the first percentile for third-year orthopedic residents, noting decreased performance with increased complexity of questions.¹³ The current study expands on Lum et al.’s findings, noting the ability of GPT-4 to outperform its predecessor GPT-3.5 and pass the OITE. This is consistent with the performance of the two GPT models on other examinations, highlighting the superiority of GPT-4.^8,21 However, contrary to Lum et al.’s findings, GPT-3.5 and GPT-4 were not significantly better at answering higher order. This discrepancy is possibly explained by Lum et al. implementing a three-tier system as opposed to our study’s two-tier system.

Within the field of orthopedics, this study is the first to compare performance of different LLMs, with the addition of Bard. In this study, Bard performed inferiorly to GPT-4 despite Bard’s live access to the internet. This finding may be explained by differences in the training datasets between the LLMs. Since Bard’s training set is largely conversational forums, Bard may have had insufficient training on patterns related to clinical inputs. In addition, Bard’s ability to access the internet may have served as a crux rather than an advantage. Although Bard can access the internet, it still processes information based on recognition of patterns rather than actively searching specific resources like Orthobullets. Furthermore, many correct answers on the OITE rely on the findings of a handful of studies and these studies’ findings may conflict with results from less rigorous studies. Since Bard is unable to assess the validity and limitations of the data it accesses, it may utilize patterns learned from the less rigorous studies in order to generate its answer.

AI has already garnered significant attention in orthopedic surgery and its current applications in the field are multifold, ranging from three-dimensional digital surgical planning to computer-assisted navigation of mechanical construction placement.^22–24 Now, with GPT-4s ability to perform similarly to a PGY-5 on the OITE, there is a question of whether AI-based models could supplement surgeon care and independently treat patients. However, the practice of surgery extends beyond the ability to succeed on multiple choice questions and requires focused physical examination, manual dexterity, and surgical precision that AI cannot currently replace.^25,26

Furthermore, although this study found that some AI models performed at a similar level as trained orthopedics residents, the individual LLMs perform similarly on all questions despite differences in taxonomy and association with images, suggesting their ability to answer correctly may be due to chance. This is supported by the science behind how transformer-based language models generate text. LLMs examine the context of the input, sample the probability of different text to come next, and deploy the text with the highest probability as their response. Since statistical patterns drive response generation, we cannot know if LLM’s have true understanding of the text they process or generate. As such, LLMs do not have the knowledge, beliefs, desires, or intentions of humans. They only identify patterns in the data they are trained on and use those patterns to generate plausible-sounding text. Despite their advanced capabilities, LLMs can potentially result in dangerous circumstances where misinformation sounds plausible.

Despite the limitations of AI-based LLMs, their use is becoming widespread with ChatGPT being integrated across many platforms, including emails. Thus, it is important that orthopedic surgeons recognize advantages and limitations of the different models. Knowing that Bard is primarily trained on conversational forums, surgeons and interested parties may want to further develop and train Bard to interface with patients to answer questions and write post-operative care plans. However, if the surgeon’s goal was to further develop a model to assist in triaging patients and orthopedic resident education, utilizing Bard would be suboptimal compared to the GPT platform.

GPT-4’s performance on the OITE holds promise for its use within orthopedics. Since GPT-4’s likelihood to answer correctly was not dependent on question taxonomy, GPT-4 may be advanced enough to assist in tasks that require interpretation of clinical data, such as triaging patients to determine which patients need specialized care. Although, future work is required to clarify whether GPT-4 can truly interpret information or if this finding is the result of random chance. Additionally, since GPT-4’s baseline orthopedic knowledge is sufficient to pass the OITE exam, it may provide a conversational format for trainees to discuss topics they have difficulty with and generate new practice questions. While these tasks may seem far off, GPT-4 compared to its predecessor GPT-3.5 already performs almost 25% better on the OITE, suggesting that future iterations of GPT may become increasingly useful adjuncts in clinical care.

The present study has several potential limitations. To begin with, the study only examines two of the most common language models (one across multiple updates) even though other advanced medically focused models may be available or be in development. Secondly, the study excludes exclusively image-based questions due to the current limitations in AI language models. Additionally, the study utilizes only 70% of the official 2022 OITE questions answered by orthopedic residents in the same year to compare LLM performance between models and residents. However, the proportion of questions utilized should capture the depth and breadth of orthopedics tested by the OITE. The study also does not examine the hallucination rates of the LLMs in the context of the OITE exam. LLM generated hallucinations are instances where the output has no relation to the input. Previous studies have shown that GPT-4 experiences 19%–29% less hallucinations than GPT-3.5.¹⁵ Bard, likewise, has been shown to have a 57% hallucination rate.²¹ These may ultimately limit the generalizability of our findings and highlight the need for continued research to assess AI use in orthopedic surgery.^27,28 Further studies that leverage LLM APIs (Application Programming Interface) and image support may be able to address these limitations.

Conclusion

The AI-based LLM GPT-4 exhibits a robust ability to correctly answer a diverse range of OITE questions, exceeding the minimum passing score for the 2022 OITE, and outperforming its predecessor GPT-3.5 and Google Bard.

Supplemental Material

Supplemental Material - Comparative performance of artificial intelligence-based large language models on the orthopedic in-training examination

Supplemental Material for Comparative performance of artificial intelligence-based large language models on the orthopedic in-training examination by Andrew Y Xu, Manjot Singh, Mariah Balmaceno-Criss, Allison Oh, David Leigh, Mohammad Daher, Daniel Alsoof, Christopher L McDonald, Bassel G Diebo and Alan H Daniels in Journal of Orthopaedic Surgery

Footnotes

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Andrew Y Xu

Manjot Singh

David Leigh

Mohammad Daher

Alan H Daniels

Supplemental Material

Supplemental material for this article is available online.

References

Cascella

Montomoli

Bellini

, et al. Evaluating the feasibility of ChatGPT in healthcare: an analysis of multiple clinical and research scenarios. J Med Syst 2023; 47(1): 33. DOI: 10.1007/s10916-023-01925-4.

Gordon

. Growing use and confidence in artificial intelligence for care delivery. NEJM Catalyst. 2022; 3(4): 1–2. DOI:10.1056/CAT.22.0095.

ChatGPT. https://openai.com/chatgpt (accessed 24 June 2023).

Kung

Cheatham

Medenilla

, et al. Performance of ChatGPT on USMLE: potential for AI-assisted medical education using large language models. PLOS Digit Health 2023; 2(2): e0000198. DOI: 10.1371/journal.pdig.0000198.

Antaki

Touma

Milad

, et al. Evaluating the performance of ChatGPT in ophthalmology: an analysis of its successes and shortcomings. Ophthalmol Sci 2023; 3(4): 100324. DOI: 10.1016/j.xops.2023.100324.

Nori

King

McKinney

, et al. Capabilities of GPT-4 on medical challenge problems. Published online 2023. DOI: 10.48550/ARXIV.2303.13375.

Hopkins

Nguyen

Dallas

, et al. ChatGPT versus the neurosurgical written boards: a comparative analysis of artificial intelligence/machine learning performance on neurosurgical board–style questions. J Neurosurg 2023; 139: 904–911. DOI: 10.3171/2023.2.JNS23419.

Ali

Tang

Connolly

, et al. Performance of ChatGPT, GPT-4, and Google bard on a neurosurgery oral boards preparation question bank. Neurosurgery 2023; 93: 1090–1098. DOI: 10.1227/neu.0000000000002551.

Bhayana

Krishna

Bleakney

. Performance of ChatGPT on a radiology board-style examination: insights into current strengths and limitations. Radiology 2023; 307(5): e230582. DOI: 10.1148/radiol.230582.

10.

Humar

Asaad

Bengur

, et al. ChatGPT is equivalent to first-year plastic surgery residents: evaluation of ChatGPT on the plastic surgery in-service examination. Aesthet Surg J 2023; 43: NP1085–NP1089. DOI: 10.1093/asj/sjad130.

11.

Try bard, an AI experiment by Google. https://bard.google.com (accessed 26 June 2023).

12.

Wick

Haus

, et al. Orthopaedic in-training examination: history, perspective, and tips for residents. J Am Acad Orthop Surg 2021; 29(9): e427–e437. DOI: 10.5435/JAAOS-D-20-01020.

13.

Lum

. Can artificial intelligence pass the American board of orthopaedic surgery examination? Orthopaedic residents versus ChatGPT. Clin Orthop Relat Res 2023; 481: 1623–1630. DOI: 10.1097/CORR.0000000000002704.

14.

Naveed

Khan

Qiu

, et al. A comprehensive overview of Large Language Models. Published online September 13, 2023. https://arxiv.org/abs/2307.06435 (accessed 26 September 2023).

15.

OpenAI . GPT-4 technical report. Published online March 27, 2023. https://arxiv.org/abs/2303.08774 (accessed 26 June 2023).

16.

Thoppilan

De Freitas

Hall

, et al. LaMDA: language Models for dialog applications. Published online February 10, 2022. https://arxiv.org/abs/2201.08239 (accessed 21 June 2023).

17.

Fritz

Bednar

Harrast

, et al. Do orthopaedic in-training examination scores predict the likelihood of passing the American board of orthopaedic surgery Part I examination? An update with 2014 to 2018 data. J Am Acad Orthop Surg 2021; 29(24): e1370–e1377. DOI: 10.5435/JAAOS-D-20-01019.

18.

American Academy of Orthopaedic Surgeons . Orthopaedic in-training examination (OITE). Technical Report. https://www.aaos.org/globalassets/education/product-pages/oite/oite-2022-technical-report-20230125.pdf (2022, accessed 24 June 2023).

19.

Liu

Ning

Teng

, et al. Evaluating the logical reasoning ability of ChatGPT and GPT-4. Published online May 5, 2023 https://arxiv.org/abs/2304.03439 (accessed 27 June 2023).

20.

Huang

Hou

, et al. Large Language models can self-improve. Published online October 25, 2022 https://arxiv.org/abs/2210.11610 (accessed 27 June 2023).

21.

Ali

Tang

Connolly

, et al. Performance of ChatGPT and GPT-4 on neurosurgery written board examinations. Med Educ 2023. DOI: 10.1101/2023.03.25.23287743.

22.

Kurmis

Ianunzio

. Artificial intelligence in orthopedic surgery: evolution, current state and future directions. Arthroplasty 2022; 4(1): 9. DOI: 10.1186/s42836-022-00112-z.

23.

Hui

Alvandi

Eleswarapu

, et al. Artificial intelligence in modern orthopaedics: current and future applications. JBJS Rev 2022; 10(10): e2200086. DOI: 10.2106/JBJS.RVW.22.00086.

24.

Makhni

Ramkumar

. Artificial intelligence for the orthopaedic surgeon: an overview of potential benefits, limitations, and clinical applications. J Am Acad Orthop Surg 2021; 29(6): 235–243. DOI: 10.5435/JAAOS-D-20-00846.

25.

Oosterhoff

JHF,

Doornberg

and Machine Learning Consortium . Artificial intelligence in orthopaedics: false hope or not? A narrative review along the line of Gartner’s hype cycle. EFORT Open Rev 2020; 5(10): 593–603. DOI: 10.1302/2058-5241.5.190092.

26.

Bernstein

. Not the last word: ChatGPT can’t perform orthopaedic surgery. Clin Orthop Relat Res 2023; 481(4): 651–655. DOI: 10.1097/CORR.0000000000002619.

27.

Sonntagbauer

Haar

Kluge

. [Artificial intelligence: how will ChatGPT and other AI applications change our everyday medical practice?]. Med Klin Intensivmed Notfmed 2023; 118(5): 366–371. DOI: 10.1007/s00063-023-01019-6.

28.

American Academy of Orthopaedic Surgeons . Artificial intelligence proves to be an effective tool for documenting orthopaedic encounters in hand surgery. https://www.prnewswire.com/news-releases/artificial-intelligence-proves-to-be-an-effective-tool-for-documenting-orthopaedic-encounters-in-hand-surgery-301764164.html (accessed 26 June 2023).

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.16 MB

Comparitive performance of artificial intelligence-based large language models on the orthopedic in-training examination

Abstract

Background

Methods

Results

Conclusion

Keywords

Introduction

Materials and methods

Language learning models

Orthopedic in-training examination

Categorization of study questions

Statistical analyses

Results

Overall performance

Performance by categories

Discussion

Conclusion

Supplemental Material

Supplemental Material - Comparative performance of artificial intelligence-based large language models on the orthopedic in-training examination

Footnotes

Declaration of conflicting interests

Funding

ORCID iDs

Supplemental Material

References

Supplementary Material