A Comparative Analysis of GPT-3.5,GPT-4,GPT–4 Omni,Gemini Advanced,and Gemini 1.5 in Answering Frequently Asked Questions Regarding High Tibial Osteotomy

Abstract

Background:

Large language model (LLM)–based chatbots, such as ChatGPT and Gemini, have become widely used sources of medical information. No study has assessed the performance of LLM chatbots in providing clinically reliable information on high tibial osteotomy (HTO).

Purpose:

To evaluate the accuracy and relevance of different LLM chatbots in responding to frequently asked questions (FAQs) about HTO.

Study Design:

Cross-sectional study.

Methods:

A total of 35 FAQs about HTO were curated from online sources and categorized into 6 categories: general/procedure related, indications for surgery and outcomes, risks and complications of surgery, pain and postoperative recovery, specific activities after surgery, and alternatives to and variations of HTO. These questions were used as input to 5 different LLM chatbots: ChatGPT-3.5, ChatGPT-4, ChatGPT–4 Omni, Gemini Advanced and Gemini 1.5. Responses were collected from July 12 to 14, 2024 (ChatGPT-3.5, ChatGPT-4, ChatGPT–4 Omni, and Gemini Advanced) and on September 26, 2024 (Gemini 1.5). Two independent orthopaedic surgeons assessed the responses using a 5-point Likert scale (1 = very incorrect/very irrelevant, 5 = very accurate/very relevant). Responses were anonymized to blind evaluators to chatbot identities. Differences in accuracy among chatbots were assessed using analysis of variance, and differences in relevance using the Kruskal-Wallis test.

Results:

LLM chatbots demonstrated the following mean accuracy scores: GPT-3.5 (4.66 ± 0.64), GPT-4 (4.66 ± 0.54), GPT–4 Omni (4.94 ± 0.24), and Gemini 1.5 (4.86 ± 0.36), while Gemini Advanced showed significantly lower scores (3.83 ± 1.40) (P < .001) in answering HTO-related FAQs. Particularly, Gemini Advanced exhibited lower accuracy scores in the categories of indications and outcomes (P = .002) and alternatives and variations (P = .015). There were no significant differences among the models regarding general/procedure related (P = .12), risks and complications (P = .50), pain and postoperative recovery (P = .53), and specific activities after surgery (P = .09). All models provided relevant answers to all questions (35/35; 100%), except for Gemini Advanced (30/35; 85.7%).

Conclusion:

This study showed that ChatGPT-3.5, ChatGPT-4, ChatGPT–4 Omni, and Gemini 1.5 provided accurate and relevant responses on HTO, whereas Gemini Advanced exhibited limitations and underperformed in comparison with the other models.

Keywords

large language model chatbot ChatGPT Google Gemini high tibial osteotomy

In recent years, large language models (LLMs) have been increasingly explored for their potential to support health care services and improve patient experience.³³ Their applications span various medical fields, including patient education, imaging analysis, diagnostic support, drug discovery, communication between doctors and patients, medical content development, and enhancement of clinical workflows.^2,8 At a rapid pace, various types of LLM services are now being developed and provided.⁴ Among the most widely used LLM-based chatbots, ChatGPT (OpenAI) and Gemini (Google LLC; Alphabet Inc) have emerged as prominent tools in health care information delivery.² ChatGPT, introduced in 2022, has evolved through several versions, with GPT-4 and GPT–4 Omni offering enhanced reasoning and multimodal capabilities.^9,13 Google's Gemini, initially released in early 2024, has also undergone rapid development, transitioning from the Gemini 1.0 Ultra–based Gemini Advanced to the more advanced Gemini 1.5.^1,12 ChatGPT and Google Gemini are LLM-based chatbots trained on a data set covering various subjects, including medical publications.⁷ When asked questions, they provide well-structured, interactive, and easy-to-understand answers. Furthermore, chatbots constantly learn from user interaction, adapting themselves to become more responsive to user needs.²⁹ In recent years, LLM chatbots have become an increasingly popular source of information in the medical field, with many patients turning to these tools for guidance on health-related questions.¹⁰

Despite their potential, LLM chatbots have also generated controversy. These models produce responses by recognizing patterns and relationships in their training data rather than through actual understanding or logical reasoning.²⁵ The variable quality of online health information used to train these models, combined with the risk of misinformation, can result in incorrect or irrational answers.^19,25 Therefore, the outputs of LLM chatbots should be carefully evaluated for accuracy and scientific validity. In response to these concerns, researchers have begun to assess their accuracy and consistency in medical contexts.^{21,22,28,34,36} Several studies show that LLMs such as ChatGPT and Google Gemini provide satisfying answers for certain subjects, but their reliability across all medical topics remains uncertain.^15,23,28

Over the past few decades, high tibial osteotomy (HTO) has gained popularity for the treatment of medial compartment osteoarthritis of the knee, particularly in young and active patients with varus deformity.^17,20 Recent studies reported that HTO using a locking plate led to satisfactory clinical outcomes, proper correction maintenance, and successful osteotomy-site healing in most patients.^18,26

Its clinical importance continues to increase, particularly as the demand for postponement of the need for total knee arthroplasty (TKA) rises.²⁶ While several studies have evaluated the performance of LLM chatbots in answering questions related to common orthopaedic surgeries such as TKA and anterior cruciate ligament reconstruction,^6,16,36 no previous study has examined their performance in the context of HTO. Therefore, we aimed to determine whether current LLMs can accurately provide information regarding this increasingly common but less frequently studied orthopaedic procedure. We hypothesized that ChatGPT (GPT-3.5, GPT-4, and GPT–4 Omni) and Google Gemini (Gemini Advanced and Gemini 1.5) would provide accurate and relevant responses to frequently asked questions (FAQs) about HTO, with no significant differences between models across evaluation criteria.

Methods

Question Curation

We asked 5 different chatbot models 35 common questions about HTO and evaluated their answers. Typing “high tibial osteotomy” into Google brought up the "people also ask" box, which generated a list of FAQs. Google “people also ask” features reflects frequently searched, real-world patient inquiries in a reproducible manner.^11,30 The use of the “people also ask” function has been adopted in previous LLM evaluation studies for similar purposes.^5,19,36 To ensure academic transparency and clinical relevance, all FAQs were reviewed and finalized by 2 board-certified orthopaedic surgeons with >10 years of clinical experience in knee surgery. Inclusion criteria were questions directly related to HTO that patients might commonly ask. Exclusion criteria included questions unrelated to HTO and duplicate or highly similar questions, to avoid redundancy. In this study, questions were categorized into 6 domains based on their primary intent, following frameworks used in previous chatbot evaluation studies. For example, questions about surgical definitions, procedures, and types were included in the “general and procedure related” category; questions about the appropriateness of HTO for specific patients and expected outcomes were categorized as “indications and outcomes”; questions about potential disadvantages, complications, and perioperative risks were classified under “risks and complications”; questions about healing time or recovery were categorized under “pain and postoperative recovery”; questions about postoperative timelines for activities such as walking were included in “specific activities after surgery”; and questions about other treatment options, different surgical techniques, and new technologies were categorized under “alternatives and variations” (Table 1).

Table 1

Categories and Frequently Asked Questions

Categories	Frequently Asked Questions
General and procedure related	What is a high tibial osteotomy?
	Is high tibial osteotomy considered a major surgery?
	What are the benefits of high tibial osteotomy?
	What can I do to prepare for a high tibial osteotomy?
	What are the different types of high tibial osteotomy and their respective advantages and disadvantages?
Indications and outcomes	What are the criteria for high tibial osteotomy?
	What is the success rate of high tibial osteotomy?
	Should I consider having a high tibial osteotomy?
	Who is a candidate for a high tibial osteotomy?
	What are the contraindications for high tibial osteotomy?
	How long does a high tibial osteotomy last?
	What is the appropriate age for a high tibial osteotomy?
Risks and complications	What is the most common complication of a high tibial osteotomy?
	What are the disadvantages of a high tibial osteotomy?
	Can a high tibial osteotomy fail?
	What is the risk of severe complications or death from a high tibial osteotomy?
Pain and postoperative recovery	What can I expect after a high tibial osteotomy?
	How painful is a high tibial osteotomy?
	When can I return to normal activities after a high tibial osteotomy?
	How long does the recovery process take after a high tibial osteotomy?
	How long do I need to stay in the hospital after a high tibial osteotomy?
	Will I need rehabilitation after a high tibial osteotomy?
Specific activities after surgery	Can I bend my knee after a high tibial osteotomy?
	Can I run after a high tibial osteotomy?
	How long after a high tibial osteotomy can I walk?
	When can I resume playing sports after a high tibial osteotomy?
	What exercises can I do after a high tibial osteotomy?
	Will I be able to squat or kneel after a high tibial osteotomy?
Alternatives and variations	What is the difference between total knee replacement and high tibial osteotomy?
	What are the differences between high tibial osteotomy and unicompartmental knee replacement?
	Which is better, total knee replacement or high tibial osteotomy?
	How can I strengthen my knees to avoid a total knee replacement?
	What is the alternative to high tibial osteotomy?
	How long does it take for the high tibial osteotomy bone to heal?
	Do I need hardware removal surgery after a high tibial osteotomy?

Response Generation by LLM Chatbots

The FAQs were input into both ChatGPT (GPT-3.5, GPT-4, and GPT–4 Omni) and Google Gemini (Gemini Advanced and Gemini 1.5) to obtain the results for comparison. The questions were asked and answered from July 12 to 14, 2024 (ChatGPT-3.5, ChatGPT-4, ChatGPT–4 Omni, and Gemini Advanced) and on September 26, 2024 (Gemini 1.5). Because artificial intelligence (AI)–based chatbots adapt to user feedback from previous responses and provide different answers that reflect the user when generating subsequent answers, only the initial responses were considered in the evaluation. After completing a response to a question, the “new chat” button was pressed to open a fresh chat window and submit a new question.

Expert Evaluation

The responses of the chatbot models—GPT-3.5, GPT-4, GPT–4 Omni, Gemini Advanced, and Gemini 1.5—were assessed for both accuracy and relevance. Factual accuracy was defined as how scientifically accurate the responses were and whether they reflected the most up-to-date information. Relevance indicated how useful and effective the responses were in addressing the questions. Each response was rated on a 5-point Likert scale.³¹ Each evaluation scale definition was clearly listed at the top of the scoring sheet to ensure raters could refer to them during scoring (Tables 2 and 3). When the chatbot avoided answering, it was assigned a score of 1. Scoring was performed by the same 2 independent orthopaedic surgeons, possessing >10 years of clinical orthopaedic experience. Before scoring, they participated in a briefing session to review the evaluation scales and practiced with sample answers to ensure consistent understanding of the criteria.

Table 2

Likert Scale for Grading Factual Accuracy

Likert Scale	Factual Accuracy Levels
1	Very inaccurate: seriously misleading, providing information that is contrary to current medical evidence and could potentially cause harm
2	Inaccurate: not entirely accurate, with a few factual errors and some information that is no longer current
3	Somewhat accurate: mostly accurate, but lacks nuance and precision
4	Accurate: generally adheres to the latest guidelines, but may require some minor adjustments or additional context
5	Very accurate: highly accurate and up-to-date with current medical knowledge, needing only slight adjustments for complete accuracy

Table 3

Likert Scale for Grading Relevance

Likert Scale	Relevance Levels
1	Very irrelevant: fails to answer the question directly and strays significantly from the topic
2	Irrelevant: includes a small amount of relevant content but is largely off-topic
3	Somewhat relevant: addresses the question superficially but fails to provide complete or sufficient information
4	Relevant: mostly on point, but contains some extraneous information that makes the response a little less clear
5	Very relevant: completely relevant and helpful, giving the patient exactly the information needed without any superfluous detail

To ensure the raters were unable to distinguish between the different chatbots, all generated responses were formatted into plain text. Identifying features including hyperlinks, images, or citation markers were removed or reformatted to prevent any model-specific clues. This ensured that the assessment focused solely on the accuracy and relevance of the textual content itself. The appropriateness of images or the correctness of cited references, which are model-specific presentation elements, were not included in the evaluation criteria. Furthermore, after anonymization, the plain text responses were randomly shuffled before being presented to the 2 independent orthopaedic surgeons for scoring.

Statistical Analysis

Each question was independently scored using a 5-point Likert scale, and scores were averaged. Ratings from the Likert scale were dichotomized. A score of ≥4 indicated an accurate or relevant response, while a score <4 was categorized as inaccurate or irrelevant. For accuracy, Likert scores were analyzed using 1-way analysis of variance with Tukey post hoc test to compare chatbot performance, as data followed a normal distribution. The percentage of relevant responses was calculated for each category. Because relevance data did not meet normality assumptions, the Kruskal-Wallis H test with Dunn post hoc test was used for group comparisons. Data analysis was conducted using R software Version 4.3.3 (R Foundation for Statistical Computing). A P value of <.05 was the threshold for statistical significance. To assess the consistency of scoring patterns between raters, the Pearson correlation coefficient and Cronbach alpha were calculated.

Results

Overall Performance

The mean accuracy of the ChatGPT models was 4.66 ± 0.64 for GPT-3.5, 4.66 ± 0.54 for GPT-4, and 4.94 ± 0.24 for GPT–4 Omni. For Google Gemini models, Gemini 1.5 had a mean accuracy of 4.86 ± 0.36, while Gemini Advanced had a significantly lower accuracy of 3.83 ± 1.40; P < .001) (Figure 1). The relevance analysis revealed a significant difference (P < .001) between Gemini Advanced (85.7%) and the other chatbots (100%). Gemini Advanced avoided answering 5 out of 35 questions (Figure 2). Consistency between raters was excellent, with a Pearson correlation coefficient of 0.93 and a Cronbach alpha of .96.

Figure. 1.

Bar chart with overall accuracy of chatbot responses to common high tibial osteotomy questions for the 5 different chatbots tested (GPT-3.5, GPT-4, GPT–4 Omni, Gemini Advanced, and Gemini 1.5). The y-axis represents the mean score over various categories of questions, while the x-axis shows the different chatbot models. Error bars indicate SD for each chatbot's score, illustrating the variability in the data. Asterisk indicates statistical significance of P < .001.

Figure. 2.

Evasive responses of Gemini Advanced. Gemini Advanced responded to certain questions by recommending consulting a specialist or by stating that the question was beyond its capabilities.

General/Procedure Related

This category included 5 FAQs. All models performed reasonably well, with no statistically significant differences (P = .12) (Table 4). GPT–4 Omni achieved the highest score of 5.00 ± 0.00, while Gemini Advanced had the lowest score of 3.40 ± 1.67. As shown in Table 5, the relevance rate was 80.00% for Gemini Advanced, whereas all other chatbots maintained a rate of 100%.

Table 4

Results of Model Performance Analysis by Category for Accuracy ^a

Category	Gemini Advanced	Gemini 1.5	GPT-3.5	GPT-4	GPT–4 Omni	P
General	3.40 ± 1.67	4.60 ± 0.55	4.00 ± 1.00	4.40 ± 0.55	5.00 ± 0.00	.12
Indications	3.29 ± 1.89 ^b	5.00 ± 0.00	5.00 ± 0.00	4.86 ± 0.38	5.00 ± 0.00	.002
Risks	4.50 ± 0.58	5.00 ± 0.00	4.25 ± 0.96	4.25 ± 0.96	4.75 ± 0.50	.50
Recovery	4.33 ± 0.82	4.83 ± 0.41	4.67 ± 0.52	4.67 ± 0.52	4.83 ± 0.41	.53
Activities	4.17 ± 0.75	4.83 ± 0.41	4.67 ± 0.52	4.67 ± 0.52	5.00 ± 0.00	.09
Variation	3.57 ± 1.81 ^b	4.86 ± 0.38	5.00 ± 0.00	4.86 ± 0.38	5.00 ± 0.00	.015

Data are presented as mean ± SD.

Indicates statistically significant difference.

Table 5

Results of Model Performance Analysis by Category for Relevance ^a

Category	No. of Questions	Relevance
Category	No. of Questions	Gemini Advanced	Gemini 1.5	GPT-3.5	GPT-4	GPT–4 Omni
General	5	4 (80)	5 (100)	5 (100)	5 (100)	5 (100)
Indication	7	5 (71.4)	7 (100)	7 (100)	7 (100)	7 (100)
Risks	4	4 (100)	4 (100)	4 (100)	4 (100)	4 (100)
Recovery	6	6 (100)	6 (100)	6 (100)	6 (100)	6 (100)
Activities	6	6 (100)	6 (100)	6 (100)	6 (100)	6 (100)
Variation	7	5 (71.4)	7 (100)	7 (100)	7 (100)	7 (100)
Total	35	30 (85.7)	35 (100)	35 (100)	35 (100)	35 (100)

Data are presented as n of relevant responses (%).

Indications and Outcomes

This category included 7 FAQs. As shown in Table 4, GPT-3.5, GPT–4 Omni, and Gemini 1.5 achieved the highest accuracy scores (5.00 ± 0.00), followed by GPT-4 (4.86 ± 0.38). In contrast, Gemini Advanced recorded a significantly lower score (3.29 ± 1.89) (P = .002). The relevance rate was 71.4% for Gemini Advanced, whereas all other chatbots demonstrated full relevance across all questions (Table 5).

Risks and Complications

Four FAQs about the risks and complications of HTO surgery were included in this category. No significant differences were shown between the scores of all models, indicating similar performance (P = .50) (Table 4). Gemini 1.5 achieved the highest accuracy score at 5.00 ± 0.00. Both GPT-3.5 and GPT-4 recorded the lowest score of 4.25 ± 0.96.

Pain and Postoperative Recovery

Six FAQs were included in this category. All models provided accurate responses with scores >4, showing no significant performance differences among them (P = .53) (Table 4). Scores ranged from 4.33 ± 0.82 (Gemini Advanced) to 4.83 ± 0.41 (Gemini 1.5 and GPT–4 Omni).

Specific Activities After Surgery

Six FAQs were included in this category. All models achieved accuracy scores >4, with no significant differences in performance among them (P = .09) (Table 4). GPT–4 Omni recorded the highest score of 5.00 ± 0.00, while Gemini Advanced had the lowest score of 4.17 ± 0.75.

Alternatives and Variations

This category included 7 FAQs. The accuracy scores in this category differed significantly between chatbots, with GPT-3.5 and GPT–4 Omni scoring 5.00 ± 0.00, GPT-4 and Gemini 1.5 scoring 4.86 ± 0.38, and Gemini Advanced scoring 3.57 ± 1.81 (P = .015) (Table 4). This large difference was primarily driven by the markedly lower performance of Gemini Advanced compared with the other models. A relevance rate of 80.00% was achieved by Gemini Advanced, whereas all other chatbots achieved a rate of 100% (Table 5).

Discussion

The major findings of our study demonstrated that ChatGPT models (GPT-3.5, GPT-4, and GPT–4 Omni) and Gemini 1.5 provided highly accurate responses to HTO-related FAQs, with mean accuracy scores ranging from 4.66 to 4.94 and 100% relevance across all categories. In contrast, Gemini Advanced exhibited significantly lower accuracy (3.83 ± 1.40; P < .001) and reduced relevance (85.7%), with the greatest variability among its responses. These results indicate that the reliability of LLM chatbots is not uniform, underscoring the need for careful selection and validation before clinical application.

In terms of accuracy, category-specific analysis reveals that all chatbots provided accurate answers without significant differences in the categories of "general/procedure related," "risks and complications," "pain and postoperative recovery," and "specific activities after surgery". Gemini Advanced showed noticeable underperformance in the "indications and outcomes" and "alternatives and variations" categories (Table 4). In the "indications and outcomes" category, all models except for Gemini Advanced achieved highly accurate results when answering questions related to medical indications, suggesting that these models possess a strong understanding of when a particular treatment or procedure is appropriate. Similar to the results for the "indications and outcomes" category, most models demonstrate strong performance in the “alternatives and variations" category, suggesting a good understanding of the variations in treatment approaches or disease presentation. It is typically assumed that newer and more advanced chatbots will yield better performance.^24,32 In this study, we found that the latest model of ChatGPT (4 Omni) consistently achieved the highest accuracy scores in most categories, highlighting its ability to handle general medical inquiries and provide accurate information related to recovery and activity recommendations. However, the differences in accuracy scores among the various ChatGPT versions tested were not statistically significant. Gemini 1.5 significantly outperformed Gemini Advanced, indicating an improvement in accuracy with the newer generation of this chatbot.

In terms of relevance, consistent with the findings on accuracy, only Gemini Advanced recorded less than perfect relevance (85.7%). Notably, Gemini Advanced showed low relevance in the categories of "indications and outcomes" and "alternatives and variations." In contrast, all other chatbots achieved 100% relevance across all responses. These results indicate that all chatbots, except for Gemini Advanced, were able to consistently demonstrate both accuracy and relevance in their responses.

The relatively low accuracy and relevance of Google Gemini Advanced were influenced both by its lower accuracy in provided answers compared with other chatbots and by its tendency to avoid answering certain questions, which were assigned the lowest score. Representative responses to the same question from Gemini Advanced (low rated) and GPT–4 Omni (high rated) is provided in Appendix Table A1. Regarding avoidance, Gemini Advanced either recommended consulting with a specialist or stated that the question was beyond its capabilities for 5 out of the total of 35 questions posed (Figure 2). The questions that Gemini avoided answering are characterized by their potential answers influencing decisions directly related to medical treatment. This may reflect a more conservative and constrained design philosophy for the Gemini Advanced model, in comparison with other models tested. While this approach seems like it should help minimize risks from misinformation, it may also lead users to look elsewhere for more direct guidance. In contrast, less restrictive models such as the various versions of ChatGPT provided direct answers, albeit with disclaimers encouraging users to consult an expert (eg, "These guidelines are general and may vary depending on individual circumstances. Always follow your surgeon's specific recommendations").

In addition to these findings, a direct comparison of Google's Gemini and ChatGPT revealed distinct approaches to information presentation. Specifically, Gemini enhanced the user experience by meticulously citing sources and incorporating relevant images, which improved both the reliability and the readability of the information provided (Appendix Table A2). This attention to detail could make Gemini a valuable tool for users who value transparency and concise information in their AI interactions. Conversely, we found that ChatGPT provided more detailed and lengthy answers. The latest version of ChatGPT exhibited a tendency to generate more voluminous output (Appendix Table A3). This voluminous output sometimes decreased the proportion of relevant information, potentially confusing patients. Also, other research has shown that LLM chatbots can sometimes give incorrect or misleading information, even creating believable but false statements. This issue has been called "artificial hallucination."^3,14,35 However, in our study, no hallucinations were detected. While these findings highlight the reliability of chatbot-generated responses within the scope of HTO-related questions, it is difficult to conclude that this applies to all medical topics.

These findings suggest that chatbots could be integrated into clinical workflows to enhance patient care. For example, they could provide previsit education about HTO, support shared decisionmaking during consultations by explaining surgical options in simple language, and offer postoperative guidance to reinforce rehabilitation. Such integration may improve patient understanding, streamline clinic visits, reduce clinician workload, and enhance overall patient experience. However, ongoing advancements in AI models are needed to enhance their clinical relevance and ensure the safe and accurate delivery of medical information.

Limitations

There are some limitations to our study. First, chatbot models are constantly being updated, improved, and refined. In the case of Open AI, since the introduction of GPT-1 in June 2018, development has continued at a pace with the intervals between new models and updates becoming increasingly shorter. Google, on the other hand, first introduced its LLM in the form of Bard in 2023, followed by Gemini in February 2024. Since then, Google has implemented >20 updates with the frequency of their updates also increasing. As changes to these chatbots occur rapidly, continuous monitoring and evaluation are necessary. Therefore, our results should be interpreted with caution while we also acknowledge that the current performance of the various chatbots may be different from our results. Second, because this study only dealt with specialized medical questions, it is difficult to conclude that the same result would be obtained for general questions. We exclusively evaluated the various chatbots’ ability to provide high-quality answers to FAQs specifically related to HTO surgery. As such, our results do not provide a definitive answer on the overall performance of each model in addressing general questions. Third, this study exclusively evaluated chatbot responses in English. Because the performance of AI chatbots can significantly vary across different languages,²⁷ our findings might have limitations when being generalized to non–English speaking populations. Finally, this study assessed chatbot performance only based on expert evaluation, without evaluating patient comprehension or satisfaction. In addition, the accuracy assessments were conducted by only 2 orthopaedic surgeons, and intrarater reliability was not evaluated, which may limit the robustness of the findings. Future studies involving direct interactions between chatbots and patients are needed to evaluate the usability, reliability, and effectiveness of chatbots in real-world clinical settings.

Conclusion

This study evaluated the accuracy and relevance of AI-based chatbots in response to common questions related to HTO. GPT-3.5, GPT-4, GPT–4 Omni, and Gemini 1.5 successfully provided highly accurate and relevant responses. However, Gemini Advanced exhibited lower accuracy and relevance than the other chatbots, highlighting a clear performance gap. These findings suggest that highly proficient LLM chatbots hold substantial promise for enhancing patient education and clinical information delivery. To translate these findings into clinical workflows, future studies will need to evaluate patient comprehension and satisfaction with chatbot-provided information, as well as to examine their effect on clinical efficiency and shared decisionmaking.

Footnotes

Appendix

Table A3

Examples of Questions With More Voluminous Output by ChatGPT–4 Omni Than Gemini 1.5

Question: What are the benefits of HTO?
Question: What is an HTO?
Gemini 1.5	An HTO is a surgical procedure that realigns the knee joint. It involves cutting and reshaping the tibia (shinbone) to shift the weightbearing load away from the damaged portion of the knee. This procedure is primarily used to treat unicompartmental osteoarthritis of the knee, which is arthritis that affects only 1 side of the knee, typically the inner (medial) side. By realigning the knee, an HTO can help to • Relieve pain • Improve joint stability • Slow the progression of arthritis • Delay or prevent the need for a total knee replacement HTO is typically considered for patients who are • Younger and more active • Have good range of motion in the knee • Have no significant ligamentous instability • Have arthritis primarily on 1 side of the knee It is important to note that HTO is not suitable for everyone, and your doctor will carefully evaluate your individual condition to determine if it is the right treatment option for you.
GPT–4 Omni	An HTO is a surgical procedure aimed at treating knee arthritis, specifically in the medial (inner) compartment of the knee. This procedure is commonly performed on younger, active patients who have osteoarthritis localized to 1 side of the knee and wish to delay total knee replacement. Here's a detailed overview: • Purpose: - Redistribution of load: HTO is used to realign the knee joint and redistribute weight away from the damaged, arthritic part of the knee to the healthier part. - Pain relief: by shifting the weightbearing axis, it helps reduce pain and slows the progression of arthritis. • Indications: - Medial knee osteoarthritis: typically for patients with medial compartment osteoarthritis. - Varus deformity: often performed on patients with a varus deformity (bow-leggedness) to correct the alignment. - Active lifestyle: suitable for younger, active patients who wish to maintain an active lifestyle and avoid or delay total knee replacement. • Procedure: 1. Incision and exposure: an incision is made on the upper part of the tibia (shinbone), just below the knee. 2. Bone cut: the surgeon makes a precise cut in the tibia. 3. Realignment: the cut bone is then carefully realigned to shift the weightbearing axis of the leg. This can be done by removing a wedge of bone (closing-wedge osteotomy) or adding a wedge (opening-wedge osteotomy). 4. Fixation: the realigned bone is stabilized using metal plates, screws, or other fixation devices to allow it to heal in the new position. • Recovery: - Rehabilitation: after surgery, a rehabilitation program involving physical therapy is crucial to regain strength and mobility. - Weightbearing: initially, weightbearing is limited, and patients may use crutches. Gradual return to full weightbearing is guided by the surgeon. - Healing time: complete recovery can take several months, with the bone typically healing in about 3 to 6 months. • Benefits: - Pain reduction: significant pain relief in the affected knee compartment. - Delay of knee replacement: can delay the need for total knee replacement by several years. • Risks and considerations: - Infection: as with any surgery, there is a risk of infection. - Nonunion or malunion: risks related to bone healing, such as the bone's not healing properly or in the wrong position. - Nerve or blood vessel injury: there is potential for damage to surrounding nerves or blood vessels. HTO is a valuable option for selected patients, allowing for improved function and pain relief while preserving the knee joint for as long as possible.
Gemini 1.5	Here are the key benefits of HTO: • Pain relief: HTO effectively reduces pain caused by unicompartmental osteoarthritis (arthritis affecting only 1 side of the knee). By realigning the leg, pressure is shifted off the damaged portion of the knee, alleviating pain. • Improved function: with reduced pain, patients experience improved knee function and mobility. Activities such as walking, climbing stairs, and participating in certain sports become easier. • Delaying knee replacement: HTO can significantly postpone the need for total knee replacement surgery. This is particularly beneficial for younger, active patients who want to preserve their natural joint for as long as possible. • Faster recovery: compared with knee replacement, HTO typically involves a faster recovery time. Patients can often return to normal activities sooner. • Bone preservation: HTO preserves the natural knee joint and avoids the need for artificial implants. Additional considerations: • Ideal candidates: HTO is most successful in patients with - Unicompartmental osteoarthritis (typically on the inside of the knee) - A varus deformity (bow-leggedness) - Relatively healthy cartilage in the unaffected parts of the knee - A reasonable level of activity and fitness Not suitable for everyone: HTO may not be appropriate for patients with - Advanced, widespread arthritis - Inflammatory arthritis - Significant ligamentous damage - Obesity Overall, HTO offers a valuable treatment option for individuals with specific knee conditions. It can provide pain relief, improve function, and delay the need for more invasive procedures.
GPT–4 Omni	HTO is a surgical procedure commonly used to treat knee osteoarthritis, particularly in younger, active patients who have malalignment of the knee. The procedure involves cutting and reshaping the tibia (shinbone) to relieve pressure on the knee joint. Here are the primary benefits of HTO: • Pain relief: one of the most significant benefits of HTO is the reduction of pain associated with knee osteoarthritis. By redistributing the weight and relieving pressure on the damaged part of the knee, patients often experience substantial pain relief. • Delays the need for knee replacement: HTO can postpone the need for total knee replacement surgery, which is particularly beneficial for younger patients. This is important because knee replacements have a limited lifespan, and delaying this surgery can reduce the likelihood of needing a revision surgery later. • Improved knee function: patients often experience improved knee function and stability after HTO. This can lead to a more active lifestyle and the ability to participate in activities that were previously limited because of pain and instability. • Preservation of natural knee joint: unlike total knee replacement, HTO preserves the patient's natural knee joint. This can be advantageous in maintaining a more natural range of motion and knee biomechanics. • Correction of knee alignment: HTO corrects the alignment of the knee joint, which can improve the overall mechanics of the leg. This realignment can reduce stress on the knee and other joints, potentially preventing the progression of arthritis. • Improved quality of life: by reducing pain and improving knee function, patients often report an improved quality of life. They can engage in daily activities with greater ease and comfort. • Less invasive than total knee replacement: HTO is less invasive than a total knee replacement, which typically means a shorter recovery time and less risk of complications associated with more extensive surgery. • Potential for cartilage regeneration: in some cases, HTO can create an environment that encourages cartilage regeneration. By offloading the damaged part of the knee, the healthier cartilage can bear the load, potentially slowing the progression of osteoarthritis. HTO is a valuable option for patients with specific types of knee problems, particularly those who are younger, active, and have knee malalignment contributing to their arthritis. The decision to undergo HTO should be made in consultation with an orthopaedic surgeon who can assess the individual patient's condition and determine the most appropriate treatment plan.

HTO, high tibial osteotomy.

Final revision submitted July 3, 2025; accepted August 11, 2025.

One or more of the authors has declared the following potential conflict of interest or source of funding: This research was supported by Hallym University Research Fund 2024 (HURF-2024-46). AOSSM checks author disclosures against the Open Payments Database (OPD). AOSSM has not conducted an independent investigation on the OPD and disclaims any liability or responsibility relating thereto.

Ethical approval was not required for this study, as it involved analysis of publicly available chatbot outputs and did not include any human subjects.

References

Akhtar

ZB.

From bard to Gemini: an investigative exploration journey through Google's evolution in conversational AI and generative AI. Comput Artif Intell. 2024;2(1):1378.

Alhur

Redefining healthcare with artificial intelligence (AI): the contributions of ChatGPT, Gemini, and Co-pilot. Cureus. 2024;16(4):e57795.

Alkaissi

McFarlane

SI.

Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35179.

Arora

The promise of large language models in health care. Lancet. 2023;401(10377):641.

Aydilek

Karadamar

OL.

How reliable are ChatGPT and Google's answers to frequently asked questions about unicondylar knee arthroplasty from a scientific perspective?

J Orthop Surg (Hong Kong). 2025;33(2):10225536251350411.

Bains

Dubin

Hameed

, et al. Use and application of large language models for patient questions following total knee arthroplasty. J Arthroplasty. 2024;39(9):2289-2294.

Biswas

SS.

Role of Chat GPT in public health. Ann Biomed Eng. 2023;51(5):868-869.

Chatterjee

Bhattacharya

Pal

Lee

Chakraborty

ChatGPT and large language models in orthopedics: from education and surgery to research. J Exp Orthop. 2023;10(1):128.

Claburn

OpenAI Says Natively Multimodal GPT-4o Eats Text, Visuals, Sound—and Emits the Same: the o Is Not Short for “Oh Sh…” for Google and Co. Updated May 13, 2024. Accessed July 4, 2025. https://www.theregister.com/2024/05/13/openai_gpt4o/

10.

Clark

Bailey

. Chatbots in Health Care: Connecting Patients to Information: Emerging Health Technologies. Canadian Agency for Drugs and Technologies in Health; 2024. Accessed March 20, 2025 CADTH Horizon Scans report No. EH0122. https://www.ncbi.nlm.nih.gov/books/NBK602381/

11.

Dosso

Chevalier

Tamine

Paubel

Salmeron

People also ask: how does this tool affect exploration-exploitation strategies with regard to prior domain knowledge and search context? An eye-tracking study. Appl Ergon. 2024;121:104367.

12.

Gemini Team Google: Georgiev

Lei

Burnell

, et al. Gemini 1.5: Unlocking Multimodal Understanding Across Millions of Tokens of Context. ArXiv. Revised December 16, 2024. Accessed July 4, 2025 https://arxiv.org/abs/2403.05530

13.

Goldman

ChatGPT Launched Six Months Ago. Its Impact—and Fallout Is Just Beginning. VentureBeat. Updated May 30, 2023. Accessed July 4, 2025. https://venturebeat.com/ai/chatgpt-launched-six-months-ago-its-impact-and-fallout-is-just-beginning-the-ai-beat/

14.

Hao

Pan

Morello

Quantifying the uncertainty of LLM hallucination spreading in complex adaptive social networks. Sci Rep. 2024;14(1):16375.

15.

Johnson

Goodman

Patrinely

, et al. Assessing the accuracy and reliability of AI-generated medical responses: an evaluation of the Chat-GPT model. Res Sq. Preprint posted online February 28, 2023. doi:10.21203/rs.3.rs-2566942/v1

16.

Kaarre

Feldt

Keeling

, et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc. 2023;31(11):5190-5198.

17.

Kim

H-J

Shin

J-Y

Lee

H-J

, et al. Does preoperative bone marrow edema affect clinical outcomes after medial open-wedge high tibial osteotomy? Knee Surg Relat Res. 2024;36(1):9.

18.

Kim

Nguyen

Seo

Y-J

, et al. Bone healing on serial plain radiographs occurs slowly but adequately after medial opening wedge high tibial osteotomy without bone graft. Knee Surg Sports Traumatol Arthrosc. 2022;30(3):993-1000.

19.

Kolac

Karademir

Ayik

Kaymakoglu

Familiari

Huri

Can popular AI large language models provide reliable answers to frequently asked questions about rotator cuff tears?

JSES Int. 2025;9(2):390-397.

20.

Lee

Shin

J-S

Hong

S-Y

, et al. Implant removal after medial opening wedge high tibial osteotomy provides implant-related pain relief and functional improvement. Arthroscopy. 2024;40(6):1837-1844.

21.

Lee

Campbell

Patel

, et al. Unlocking health literacy: the ultimate guide to hypertension education from ChatGPT versus Google Gemini. Cureus. 2024;16(5):e59898.

22.

Meral

Ates

Gunay

Ozturk

Kusdogan

Comparative analysis of ChatGPT, Gemini and emergency medicine specialist in ESI triage assessment. Am J Emerg Med. 2024;81:146-150.

23.

Mika

Martin

Engstrom

Polkowski

Wilson

JM.

Assessing ChatGPT responses to common patient questions regarding total hip arthroplasty. J Bone Joint Surg Am. 2023;105(19):1519-1526.

24.

Moshirfar

Altaf

Stoakes

Tuttle

Hoopes

PC.

Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus. 2023;15(6):e40822.

25.

Ollivier

Pareek

Dahmen

, et al. A deeper dive into ChatGPT: history, use and future perspectives for orthopaedic research. Knee Surg Sports Traumatol Arthrosc. 2023;31(4):1190-1192.

26.

Park

Kim

Cho

Kim

Choi

Clinical and radiological results of high tibial of osteotomy over the age of 65 are comparable to that of under 55 at minimum 2-year follow-up: a propensity score matched analysis. Knee Surg Relat Res. 2024;36(1):10.

27.

Reusens

Borchert

De Weerdt

Baesens

. Native Design Bias: Studying the Impact of English Nativeness on Language Model Performance. arXiv. Revised October 7, 2024. Accessed March 20, 2025 https://arxiv.org/abs/2406.17385

28.

Rossettini

Rodeghiero

Corradi

, et al. Comparative accuracy of ChatGPT-4, Microsoft Copilot and Google Gemini in the Italian entrance test for healthcare sciences degrees: a cross-sectional study. BMC Med Educ. 2024;24(1):694.

29.

Salvagno

Taccone

Gerli

AG.

Can artificial intelligence help for scientific writing?

Crit Care. 2023;27(1):75.

30.

Shah

. Rank for People Also Ask: Ultimate Infinite Questions Google PAA Handbook. Outranking. Accessed July 4, 2025. https://www.outranking.io/people-also-ask-handbook/

31.

Sullivan

Artino

Jr.

Analyzing and interpreting data from Likert-type scales. J Grad Med Educ. 2013;5(4):541-542.

32.

Tassoker

ChatGPT-4 Omni's superiority in answering multiple-choice oral radiology questions. BMC Oral Health. 2025;25(1):173.

33.

Taylor

4th Cheng

Weinblatt

Bergstein

Long

WJ.

An artificial intelligence chatbot is an accurate and useful online patient resource prior to total knee arthroplasty. J Arthroplasty. 2024;39(8)(suppl 1):S358-S362.

34.

Vaishya

Iyengar

Patralekh

, et al. Effectiveness of AI-powered chatbots in responding to orthopaedic postgraduate exam questions—an observational study. Int Orthop. 2024;48(8):1963-1969.

35.

Whiles

Bird

Canales

DiBianco

Terry

RS.

Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023;180:278-284.

36.

Zhang

Liau

ZQG

Tan

KLM

Chua

WL.

Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement. Knee Surg Relat Res. 2024;36(1):15.