Sage Journals: Discover world-class research

Abstract

Objective:

To evaluate the performance of three large language models in answering questions regarding pediatric developmental dysplasia of the hip.

Methods:

We formulated 18 open-ended clinical questions in both Chinese and English and established a gold standard set of answers to benchmark the responses of the large language models. These questions were presented to ChatGPT-4o, Gemini, and Claude 3.5 Sonnet. The responses were evaluated by two independent reviewers using a 5-point scale. The average score, rounded to the nearest whole number, was taken as the final score. A final score of 4 or 5 indicated an accurate response, whereas a final score of 1, 2, or 3 indicated an inaccurate response.

Results:

The raters demonstrated a high level of agreement in scoring the answers, with weighted Kappa coefficients of 0.865 for Chinese responses (p < 0.001) and 0.875 for English responses (p < 0.001). No significant differences were observed among the three large language models in terms of accuracy when answering questions, with rates of 83.3%, 77.8%, and 77.8% for Claude 3.5 Sonnet, ChatGPT-4o, and Gemini in the Chinese responses (p = 1), and 83.3%, 83.3%, and 72.2% for ChatGPT-4o, Claude 3.5 Sonnet, and Gemini in the English responses (p = 0.761). In addition, there was no significant difference in the performance of the same large language model between the Chinese and English settings.

Conclusions:

Large language models demonstrate high accuracy in delivering information on dysplasia of the hip, maintaining consistent performance across both Chinese and English, which suggests their potential utility as medical support tools.

Level of evidence:

Level II.

Keywords

Large language model developmental dysplasia of the hip artificial intelligence natural language processing question-answering

Introduction

Large language models (LLMs) represent a major technological advancement in the fields of artificial intelligence and natural language processing. These sophisticated models employ advanced deep-learning algorithms and complex architectural frameworks to understand and process textual data. Trained on extensive and diverse datasets, LLMs develop the capability to comprehend and generate human-like language with impressive precision and fluency.^1,2 They are considered as the models closest to achieving true mastery of human language.³

Since the launch of OpenAI’s ChatGPT, LLMs have attracted considerable attention in the medical field. These models are being explored for their potential to assist in critical areas such as medical decision-making, where they can provide rapid and accurate insights.⁴ In the realm of medical education, LLMs can function as interactive learning tools, offering personalized feedback and support to students.⁵ Their utility extends to research, where they can aid in navigating vast medical literature, streamline research processes, and write impactful articles.⁶ In documentation, LLMs can generate comprehensive patient summaries and articulate follow-up recommendations, producing discharge letters with both high quality and efficiency.⁷

Another promising application of LLMs in medicine is intelligent question-answering. Research has demonstrated their efficacy in providing accurate responses to complex medical queries, as evidenced by their performance on questions related to cardiovascular disease prevention,⁸ bariatric surgery,⁹ and evidence-based dentistry.¹⁰ Despite these successes, there remains a notable gap in exploring their effectiveness within the specialized domain of pediatric orthopedics.

Developmental dysplasia of the hip (DDH) is a prevalent pediatric orthopedic condition characterized by abnormalities in the morphology, position, and function of the femoral head and acetabulum.^11,12 If not addressed early, DDH can lead to significant long-term mobility issues. It is a primary cause of total hip arthroplasty,¹³ with studies indicating that 26% of patients under 40 undergoing this procedure have underlying DDH.¹⁴ There is a clear need for accurate and reliable information to guide clinical decisions. However, the performance of LLMs in the context of DDH has not yet been evaluated. Furthermore, previous research has predominantly been conducted in English-language settings, leaving the performance in other languages largely unexamined.

This study aimed to compare the performance of three LLMs (ChatGPT-4o, Gemini, Claude 3.5 Sonnet) in answering questions related to pediatric DDH in both Chinese and English-language contexts. Our hypothesis was that ChatGPT-4o has a higher accuracy rate in answering questions than other models and that the accuracy rate of English answers is higher than that of Chinese answers.

Methods

Formulating questions and establishing gold standard answers

To evaluate the performance of LLMs in addressing pediatric DDH-related questions, we developed 18 open-ended clinical questions in both Chinese and English. These questions were formulated based on relevant clinical guidelines, expert consensus, and high-quality review articles indexed in PubMed, encompassing various aspects of DDH, including etiology, risk factors, pathological changes, clinical manifestations, diagnostic methods, differential diagnosis, classification, and treatment. To provide a benchmark for evaluating the responses generated by the LLMs, we established a gold standard set of answers. The answers to each question were strictly derived from the aforementioned clinical guidelines, expert consensus, and high-quality review articles, serving as the reference point for assessing the accuracy and quality of the LLMs’ responses.

Querying the models

From August 15 to 16, 2024, one author posed 18 questions to the three LLMs in Chinese. From August 17 to 18, 2024, the same author posed the same questions to the models in English. The models tested included OpenAI’s ChatGPT-4o, Google’s Gemini, and Anthropic’s Claude 3.5 Sonnet. Each question was posed in a new chat window without additional explanations or prompts, ensuring that the models were not influenced by previous interactions or any other information. The responses provided by the LLMs were recorded in two separate Excel spreadsheets: one for Chinese answers and another for English answers. Each model’s responses were assigned to a distinct column to facilitate scoring by reviewers. There were two reviewers who held senior titles in pediatric orthopedics and had over 13 years of experience with DDH.

Scoring the LLMs

Duplicate spreadsheets were created, with the LLM names removed to ensure the anonymity of the models for the reviewers. Reviewers, who had previously familiarized themselves with the gold standard answers, scored the responses on a scale of 1 to 5, where 1 indicated an inappropriate answer that was severely incorrect, harmful, or clearly contradictory to the gold standard. A score of 2 was assigned for answers that were mostly inappropriate due to significant inaccuracies, omissions of key details, or inconsistencies with the gold standard. A score of 3 indicated answers that were partially appropriate, providing reasonably accurate information but necessitating improvement in completeness, depth, or consistency with the gold standard. A score of 4 was assigned to answers that were mostly appropriate, essentially accurate, covered the most critical information, and aligned with the gold standard, albeit with minor omissions or the need for clarifications. A score of 5 indicated answers that were fully appropriate, comprehensive, aligned with the gold standard, and addressed all key aspects of the clinical question⁹. The average score from the two reviewers was rounded to the nearest whole number to determine the final score for each answer, with scores of 4 and 5 considered accurate, and scores of 1, 2, or 3 deemed inaccurate.

Statistical analysis

The consistency of the scores assigned by the two reviewers was assessed using weighted Kappa testing. For each model’s scores from each reviewer, the distribution was described using the lowest, highest, and median scores. The accuracy rate of each LLM in answering questions was calculated and expressed as a percentage, with categorical data comparisons performed using Fisher’s exact test. All statistical analyses were conducted with two-tailed tests, using a significance threshold of p < 0.05. Statistical analyses were performed using SPSS 27.0 (IBM Corporation, Armonk, NY, USA).

Results

The weighted Kappa coefficient for the scores given by the two reviewers for the Chinese answers was 0.865 (p < 0.001), and for the English version, it was 0.875 (p < 0.001), indicating a high level of inter-rater reliability in both cases. Queried questions are shown in Table 1. Table 2 presents the distribution of scores for the three LLMs in responding to both Chinese and English questions. Irrespective of language, all models received the same minimum score of 1 and the same maximum score of 5. In the Chinese responses, Claude 3.5 Sonnet achieved the highest median score of 5, whereas ChatGPT-4o and Gemini shared the same median score of 4. In the English responses, both ChatGPT-4o and Claude 3.5 Sonnet achieved the same median score of 5, while Gemini had a median score of 4.

Table 1.

List of queried English and Chinese questions.

English version	Chinese version
What are the etiological and risk factors for developmental dysplasia of the hip?	发育性髋关节发育不良的病因和危险因素有哪些？
What are the pathological changes associated with developmental dysplasia of the hip?	发育性髋关节发育不良的病理改变有哪些？
What are the clinical manifestations of developmental dysplasia of the hip in neonates and infants under 3 months of age?	小于3个月的新生儿及婴儿发育性髋关节发育不良的临床表现有哪些？
What are the clinical manifestations of developmental dysplasia of the hip in infants over 3 months old who have not started to walk?	大于3个月但尚未行走的婴儿发育性髋关节发育不良的临床表现有哪些？
What are the clinical manifestations of developmental dysplasia of the hip in toddlers who have started walking?	已学步行走的婴幼儿发育性髋关节发育不良的临床表现有哪些？
What are the Ortolani and Barlow tests, and what is their clinical significance in diagnosing developmental dysplasia of the hip?	什么是Ortolani试验？什么是Barlow试验？他们在发育性髋关节发育不良中各自的临床意义是什么？
What imaging modalities are commonly used for diagnosing developmental dysplasia of the hip, and what is their clinical significance?	发育性髋关节发育不良的常用影像学检查有哪些，各自的临床意义是什么？
What are the advantages and disadvantages of selective versus mass screening for developmental dysplasia of the hip?	发育性髋关节发育不良选择性筛查与普查各自的利弊有哪些？
What conditions should be differentiated from developmental dysplasia of the hip during diagnosis?	发育性髋关节发育不良需要与哪些疾病相鉴别？
How is developmental dysplasia of the hip classified according to Tönnis classification in children?	儿童发育性髋关节发育不良的Tönnis 分型是怎样的？
How is developmental dysplasia of the hip classified according to the IHDI classification in children?	儿童发育性髋关节发育不良的IHDI分型是怎样的？
What are the guiding principles for treating developmental dysplasia of the hip in children?	儿童发育性髋关节发育不良的治疗原则是什么？
How is developmental dysplasia of the hip treated in infants under 6 months of age?	6月龄以下婴儿发育性髋关节发育不良如何治疗？
How should femoral nerve palsy during Pavlik harness treatment for developmental dysplasia of the hip be managed?	Pavlik吊带治疗发育性髋关节发育不良时出现股神经麻痹应当如何处理？
How is developmental dysplasia of the hip treated in children aged 6–18 months?	6–18月龄儿童发育性髋关节发育不良如何治疗？
How is developmental dysplasia of the hip treated in children over 18 months of age?	18月龄以上儿童亲少年发育性髋关节发育不良如何治疗？
How is pelvic osteotomy selected in patients with developmental dysplasia of the hip?	发育性髋关节发育不良患者骨盆截骨术如何选择？
What are the indications for performing a femoral osteotomy in cases of developmental dysplasia of the hip?	发育性髋关节发育不良股骨截骨的指征是什么?

Table 2.

Statistical description of the scores given by the two evaluators for the answers provided by the three LLMs.

Model	ChatGPT-4o		Gemini		Claude 3.5 Sonnet
Model	Evaluator 1	Evaluator 2	Evaluator 1	Evaluator 2	Evaluator 1	Evaluator 2
Chinese
Minimum	1	1	1	1	1	1
Median	4	4	4	4	5	5
Maximum	5	5	5	5	5	5
English
Minimum	1	1	1	1	1	1
Median	5	5	4	4	5	5
Maximum	5	5	5	5	5	5

In the Chinese responses, Claude 3.5 Sonnet achieved the highest accuracy rate of 83.3%, whereas ChatGPT-4o and Gemini both had accuracy rates of 77.8%, with no statistically significant differences among the groups (p = 1). In the English responses, both ChatGPT-4o and Claude 3.5 Sonnet achieved accuracy rates of 83.3%, while Gemini had an accuracy rate of 72.2%, with no statistically significant differences among the groups (p = 0.761). The accuracy rate of ChatGPT-4 in English was higher than in Chinese, and Gemini’s accuracy rate in Chinese was higher than in English; however, these differences were not statistically significant (p = 1). Regardless of language, Claude 3.5 Sonnet maintained an accuracy rate of 83.3% (Table 3).

Table 3.

Number and proportion of accurate answers from the three LLMs.

Model	ChatGPT-4o	Gemini	Claude 3.5 Sonnet	p Value
Chinese	14 (77.8%)	14 (77.8%)	15 (83.3%)	1
English	15 (83.3%)	13 (72.2%)	15 (83.3%)	0.761
p Value	1	1	1

Discussion

Following extensive training in medical literature, LLMs have demonstrated the ability to understand and generate medically relevant information. This is particularly evident with OpenAI’s ChatGPT, which has been a subject of comparison studies with other LLMs in the context of medical intelligent question-answering. ChatGPT-4 outperformed Bard in answering questions related to hematopoietic stem cell transplantation.¹⁵ ChatGPT-4 outperformed Bing Chat and Bard in answering questions related to evidence-based dentistry and bariatric surgery.^9,10 In pediatric orthopedics, research regarding the application of LLMs remains limited. A recent study evaluated ChatGPT 3.5’s performance in responding to questions about pediatric in-toeing, finding that 46% of the responses were rated as excellent and 44% as satisfactory.¹⁶

Our study provides the first comparative evaluation of ChatGPT-4o, an advanced version of ChatGPT-4 model,¹⁷ alongside Gemini and Claude 3.5 Sonnet, in terms of their performance in answering questions related to pediatric DDH. This represents our preliminary exploration into applying LLMs to pediatric orthopedics, with the potential to spark interest and stimulate discussions among the pediatric orthopedic community. It may also encourage more extensive applications and in-depth research of LLMs within the field of pediatric orthopedics in the future.

The questions used to assess the models’ performance, as well as the gold standard answers, were developed based on guidelines, expert consensus, and high-quality review articles, ensuring clinical relevance and alignment with current best clinical practices. The establishment of a gold standard was crucial for producing reliable and valid results, providing an objective measure of the models’ performance. It ensured that the evaluation was grounded in established medical knowledge and practices, rather than being influenced by subjective interpretations or individual biases.

The results of our study demonstrated that LLMs exhibited a high accuracy rate in answering questions in this specific field of DDH, indicating their potential utility as auxiliary tools for healthcare professionals. In contrast to findings from other domains, this study found that the performance of other LLMs was comparable to that of ChatGPT-4o, with all models achieving high accuracy rates. This outcome may be due to the intense competition within the LLM domain, prompting continuous optimization and enhancements of various models, and to the similar quality of the training data, which is a critical factor affecting model performance.

In the domain of natural language processing, the availability and richness of language resources significantly affect model performance. English, classified as a high-resource language, is often contrasted with other languages that are considered low resource.¹⁸ A study comparing ChatGPT’s responses to questions regarding total hip arthroplasty reported a satisfaction rate of 90% for English responses, whereas only 10% for responses in Urdu.¹⁹ In this study, we not only compared the performance of different LLMs in Chinese question-answering but also evaluated the same models in both Chinese and English contexts. The results indicated that the three tested LLMs exhibited comparable performance in Chinese question-answering, and each model showed similar performance across both English and Chinese contexts, with high accuracy rates.

The Chinese answers were not simply translations of the English answers. For specific questions, the Chinese answers were completely different from the English ones. For example, regarding the Tönnis classification of DDH, the English answer provided by Claude 3.5 sonnet was fully appropriate, hence both evaluators awarded 5 points. However, the Chinese answer provided by Claude 3.5 sonnet was inappropriate; instead of connecting the superior margins of the bilateral acetabulum, it connects the centers of the bilateral acetabular Y-shaped cartilage, resulting in both evaluators only awarding 1 point. Besides, even when the scores for the Chinese and English answers were the same, their logic, order, details, and reasons for deducting points also differed. For instance, in the differential diagnosis of DDH, both evaluators gave 4 points to Gemini’s answers in both languages. The Chinese answer did not differentiate from spastic hip dislocation, while the English answer did not differentiate from pathological hip dislocation.

To avoid simple translations and learning during the questions and answering process, we took some control measures in advance. First, Chinese and English questions were asked at different times. The Chinese questions were asked between August 15th and 16th, 2024, and the English questions were asked between August 17th and 18th, 2024. Second, a new dialogue window was created for each question. Although the accuracy of the Chinese and English answers from the same model was similar, we found the Chinese answers were not a simple translation of the English answers. This indicates that Chinese resources are already quite rich and reliable, and models trained with Chinese resources can achieve performance equivalent to those trained with English resources. These findings suggest that these LLMs have the potential to serve as valuable auxiliary tools in medical consultations within Chinese-speaking regions, offering informational support to healthcare professionals in China.

Prior research has emphasized the inherent risks of employing LLMs in medical consultations, particularly the risk of hallucinations, where models might produce responses that conflict with accepted medical knowledge, which could mislead or harm patients.^9,20 In this study, while the models largely provided accurate information, there were cases where they offered incorrect responses to certain queries, including DDH grading. Accurate DDH grading is crucial for guiding clinical treatment decisions; inaccurate grading could lead to misguided treatment plans, thereby negatively affecting patient outcomes. Consequently, the use of LLMs in medical intelligent question-answering systems must be approached with caution, viewing them as adjuncts rather than replacements for professional medical decision-making. All responses generated by LLMs must be subject to rigorous review and validation by qualified medical professionals. Furthermore, to enhance LLM accuracy in the medical domain, future research should concentrate on refining algorithms, improving the quality of training datasets, and adopting more nuanced fine-tuning strategies.^21,22

Our study has several limitations, which include the following: The sample size of questions was limited to 18, potentially not providing a comprehensive reflection of the LLMs’ performance. Besides, only two reviewers might affect the objectivity of our quality assessment of the answers. Furthermore, the study captures the performance of LLMs at a single point in time. Considering the rapid pace of technological advancements, the performance of LLMs is likely to evolve, necessitating continuous reassessment to maintain accuracy.

This study provides a preliminary assessment of the potential applications of LLMs within the DDH domain. Despite its limitations, the study demonstrated that LLMs have a high accuracy rate in delivering medical information, and their performance remained consistent across different languages, indicating their potential to serve as valuable adjuncts in the medical field.

Supplemental Material

sj-pdf-1-cho-10.1177_18632521251331772 – Supplemental material for Preliminary assessment of large language models’ performance in answering questions on developmental dysplasia of the hip

Supplemental material, sj-pdf-1-cho-10.1177_18632521251331772 for Preliminary assessment of large language models’ performance in answering questions on developmental dysplasia of the hip by Shiwei Li, Jun Jiang and Xiaodong Yang in Journal of Children’s Orthopaedics

Footnotes

Acknowledgements

We express our gratitude to statistical expert Ting Wang from the Clinical Research Management Department of West China Hospital, Sichuan University, for providing us with professional statistical advice.

Author contributions

Shiwei Li: Investigation, data curation, formal analysis, and writing—original draft; Jun Jiang: Investigation and data curation; Xiaodong Yang: Investigation, conceptualization, methodology, writing—review & editing, and supervision.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

References

Contreras Kallens

Kristensen-McLachlan

Christiansen

. Large language models demonstrate the potential of statistical learning in language. Cogn Sci 2023; 47: e13256.

Májovský

Černý

Kasal

, et al. Artificial intelligence can generate fraudulent but authentic-looking scientific medical articles: pandora’s box has been opened. J Med Internet Res 2023; 25: e46924.

Mahowald

Ivanova

Blank

, et al. Dissociating language and thought in large language models. Trends Cogn Sci 2024; 28: 517–540.

Rao

Pang

Kim

, et al. Assessing the utility of ChatGPT throughout the entire clinical workflow: development and usability study. J Med Internet Res 2023; 25: e48659.

Totlis

Natsis

Filos

, et al. The potential role of ChatGPT and artificial intelligence in anatomy education: a conversation with ChatGPT. Surg Radiol Anat 2023; 45: 1321-1329.

Huang

Tan

The role of ChatGPT in scientific communication: writing better scientific review articles. Am J Cancer Res 2023; 13: 1148–1154.

Tung

JYM

Gill

Sng

GGR

, et al. Comparison of the quality of discharge letters written by large language models and junior clinicians: single-blinded study. J Med Internet Res 2024; 26: e57721.

Sarraju

Bruemmer

Van Iterson

, et al. Appropriateness of cardiovascular disease prevention recommendations obtained from a popular online chat-based artificial intelligence model. JAMA 2023; 329: 842–844.

Lee

Shin

Tessier

, et al. Harnessing artificial intelligence in bariatric surgery: comparative analysis of ChatGPT-4, Bing and Bard in generating clinician-level bariatric surgery recommendations. Surg Obes Relat Dis 2024; 20: 603–608.

10.

Giannakopoulos

Kavadella

Aaqel Salim

, et al. Evaluation of the performance of generative AI large language models ChatGPT, Google Bard and Microsoft Bing Chat in supporting evidence-based dentistry: comparative mixed methods study. J Med Internet Res 2023; 25: e51580.

11.

Harsanyi

Zamborsky

Krajciova

, et al. Developmental dysplasia of the hip: a review of etiopathogenesis, risk factors and genetic aspects. Medicina (Kaunas, Lithuania) 2020; 56: 153.

12.

Swarup

Penny

Dodwell

ER.

Developmental dysplasia of the hip: an update on diagnosis and management from birth to 6 months. Curr Opin Pediatr 2018; 30: 84–92.

13.

Uluçay

Ozler

Güven

, et al. Etiology of coxarthrosis in patients with total hip replacement. Acta Orthop Traumatol Turc 2013; 47: 330–333.

14.

Engesæter

Lehmann

Laborie

, et al. Total hip replacement in young adults with hip dysplasia: age at diagnosis, previous treatment, quality of life and validation of diagnoses reported to the Norwegian Arthroplasty Register between 1987 and 2007. Acta Orthop 2011; 82: 149–154.

15.

Xue

Bracken-Clarke

Iannantuono

, et al. Utility of large language models for health care professionals and patients in navigating hematopoietic stem cell transplantation: comparison of the performance of ChatGPT-3.5, ChatGPT-4 and Bard. J Med Internet Res 2024; 26: e54758.

16.

Amaral

Schultz

Martin

, et al. Evaluating chat generative pre-trained transformer responses to common pediatric in-toeing questions. J Pediatr Orthop 2024; 44: e592–e597.

17.

Ebel

Ehrengut

Denecke

, et al. GPT-4o’s competency in answering the simulated written European Board of Interventional Radiology exam compared to a medical student and experts in Germany and its ability to generate exam items on interventional radiology: a descriptive study. J Educ Eval Health Prof 2024; 21: 21.

18.

Frei

Kramer

Annotated dataset creation through large language models for non-english medical NLP. J Biomed Inform 2023; 145: 104478.

19.

Ibrahim

Khaskheli

Shahzad

, et al. Language-adaptive artificial intelligence: assessing CHATGPT'S answer to frequently asked questions on total hip arthroplasty questions. J Pak Med Assoc 2024; 74: S161–S164.

20.

Farquhar

Kossen

Kuhn

, et al. Detecting hallucinations in large language models using semantic entropy. Nature 2024; 630: 625–630.

21.

Rau

Zoeller

, et al. A context-based chatbot surpasses trained radiologists and generic ChatGPT in following the ACR appropriateness guidelines. Radiology 2023; 308: e230970.

22.

Tinn

Cheng

, et al. Fine-tuning large neural language models for biomedical natural language processing. Patterns (New York, NY) 2023; 4: 100729.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.10 MB