Sage Journals: Discover world-class research

Abstract

Background:

Artificial intelligence (AI) chatbots are increasingly used for medical information provision. However, systematic evaluations of their accuracy and reliability in orthopaedic surgery, particularly in total knee replacement (TKR), remain limited.

Purpose:

To systematically compare and evaluate performances of various AI chatbots, focusing on their ability to provide accurate and reliable information related to TKR.

Study Design:

Cohort study; Level of evidence, 2.

Methods:

A total of 43 clinically relevant TKR-related frequently asked questions (FAQs) were selected based on Google search trends and expert consultation. Questions were categorized into 6 key domains: (1) general/procedure-related information, (2) indications and outcomes, (3) risks and complications, (4) pain and postoperative recovery, (5) specific activities after surgery, and (6) alternatives and variations. Each question was submitted to 5 different chatbot models (GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5) for response generation. Two independent orthopaedic surgeons assessed the chatbot's responses for both accuracy and relevance using a 5-point Likert scale. Responses were anonymized, blinding evaluators to the chatbot identities to prevent bias. Accuracy differences among the chatbot models were analyzed by analysis of variance, and relevance was compared using the Kruskal-Wallis test.

Results:

GPT-3.5 (4.8 ± 0.5), GPT-4 (4.9 ± 0.4), GPT-4 Omni (4.9 ± 0.3), and Gemini 1.5 (4.8 ± 0.4) demonstrated high accuracy, whereas Gemini Advanced scored significantly lower (4.1 ± 1.4) (P < .001). However, general/procedure-related information, risks and complications, pain and recovery, and postoperative activities showed no significant differences among chatbots. Gemini Advanced underperformed in indications and outcomes (P = .04) and alternatives and variations (P = .002). Regarding relevance, all chatbots except Gemini Advanced (36/43; 83.7%) achieved a 100% relevance rate (P < .001).

Conclusion:

This study demonstrates that GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 can provide highly accurate and relevant responses to TKR-related queries, while Gemini Advanced underperforms.

Keywords

artificial intelligence chatbot ChatGPT Gemini total knee replacement

In recent years, the rapid development of artificial intelligence (AI) technology, particularly in the field of chatbots, marks a significant milestone in modern digital innovation. These AI-driven chatbots developed using large language models (LLMs) trained on extensive and diverse textual data ranging from news articles and novels to peer-reviewed journals and web content have become increasingly prevalent across numerous sectors.⁵ Their ability to generate contextually relevant and linguistically coherent responses has opened new possibilities in areas such as customer service, education, and health care.^5,16 AI chatbots are increasingly being studied for their potential application in orthopaedics, including rehabilitation planning, surgical decision support, and patient communication.²² Recent studies have shown that ChatGPT is a promising tool for addressing clinical questions related to procedures such as anterior cruciate ligament reconstruction and total knee replacement (TKR), particularly in terms of accuracy and clinical utility.^9,23 Additionally, ChatGPT has been proposed as a virtual assistant to enhance patient education and engagement by providing tailored information about orthopaedic conditions, treatment options, and postoperative care.⁷

Although AI chatbots are gaining traction, concerns regarding the accuracy and reliability of the information they provide have emerged.¹ Data sources, collection periods, and training methodologies vary significantly across different chatbot platforms, leading to potential discrepancies in the information they disseminate. In the health care sector, where accurate information is critical, these discrepancies could have serious implications for patient care.

Given these concerns, conducting a comprehensive comparative analysis of chatbots is necessary to assess the reliability and accuracy of the information they provide. This research is particularly crucial in the field of orthopaedics, where patients frequently seek information about procedures such as TKR. Because TKR is one of the most common and significant surgeries in orthopaedics, it is a focal point for patients searching for medical guidance online.^3,4,14

Previous studies have evaluated the performance of a single AI model, such as ChatGPT-3.5 or ChatGPT-4, in answering questions related to TKR.^2,19,23 This study advanced previous works by evaluating multiple chatbot models, thereby allowing a broader investigation into performance differences across LLMs. The purpose of this study was to assess the accuracy and relevance of responses provided by different versions of OpenAI's ChatGPT (GPT-3.5, GPT-4, and GPT-4 Omni) and Google's Gemini (Gemini Advanced and Gemini 1.5) to frequently asked questions (FAQs) about TKR. We hypothesized that ChatGPT and Gemini chatbot models would perform similarly in providing accurate and relevant responses to TKR-related FAQs, with no significant differences.

Methods

Selection and Categorization of FAQs Related to Total Knee Replacement

This study did not involve human participants and therefore did not require approval from an institutional review board. The authors used chatGPT and DeepL to correct English grammar and improve sentence clarity and academic tone. To identify FAQs related to TKR, S.H.K. created a new Google account with no prior search history and entered the search term "total knee replacement" into Google web search (Google; Alphabet Inc). Initial questions were extracted from the "People also ask" section of the main search page. These were reviewed by 2 board-certified orthopaedic surgeons (H.L. and S.Y.S.) who consolidated duplicate and semantically similar entries, resulting in a refined list of 43 unique and clinically relevant questions. To facilitate structured evaluation and domain-specific comparison, the FAQs were categorized into 6 clinically relevant domains reflecting key aspects of the TKR perioperative process; (1) general/procedure-related information, covering basic information about the procedure; (2) indications for surgery and outcomes, addressing reasons for surgery and expected outcomes; (3) risks and complications, discussing potential risks associated with TKR; (4) pain and postoperative recovery, focusing on pain management and recovery time frames; (5) specific activities after surgery, related to activities that patients can resume after surgery; and (6) alternatives and variations, including alternatives such as partial knee replacement and robotic TKR (Table 1).

Table 1

The 6 Categories and Frequently Asked Questions (FAQs)

Categories	FAQs
General/ procedure-related information	1. What is total knee replacement?
	2. Is a total knee replacement considered a major surgery?
	3. How long does a total knee replacement surgery take?
	4. What can I do to prepare for a total knee replacement?
	5. What types of anesthesia are used during a total knee replacement?
	6. What is the implant material used in a total knee replacement?
	7. How long is the scar from a total knee replacement?
Indications for surgery and outcomes	1. What is the ideal age to have a total knee replacement?
	2. Who is a candidate for total knee replacement surgery?
	3. What is the success rate of a total knee replacement?
	4. How long does a total knee replacement last?
	5. What are the benefits of a total knee replacement?
	6. What is the regret rate after total knee replacement?
Risks and complications	1. What is the most common complaint after knee replacement surgery?
	2. What are the downsides of knee replacement?
	3. What are the risks of a total knee replacement?
	4. What is the risk of severe complications or death from a total knee replacement?
Pain and postoperative recovery	1. What is the recovery time for a total knee replacement?
	2. What is the most painful period after a total knee replacement surgery?
	3. What are red flags after knee surgery?
	4. How long do I need to stay in hospital after a total knee replacement?
	5. How long is bed rest after knee replacement?
	6. How long does it take to walk normally after knee replacement?
	7. Are there any food restrictions after a total knee replacement?
	8. Should I force my knee to bend after knee replacement?
	9. How long does the recovery process take after a total knee replacement?
	10. What’s the good exercise after knee replacement?
	11. How much weight can I put on my operated leg after total knee replacement?
	12. What is the rehabilitation process after a total knee replacement?
Specific activities after surgery	1. Are there any activity restrictions after a total knee replacement?
	2. Can I sit cross-legged after a total knee replacement?
	3. Can I run after knee replacement?
	4. Will I be able to squat or kneel after a total knee replacement?
	5. Can I climb stairs after knee replacement?
	6. When can I drive after a total knee replacement?
	7. When can I resume playing sports after a total knee replacement?
Alternatives and variations	1. Are there any alternatives to a total knee replacement?
	2. What is the difference between a total knee replacement and a partial knee replacement?
	3. What is robotic total knee replacement?
	4. What is the benefit of robotic knee replacement?
	5. Is there any difference in recovery between conventional and robotic total knee replacement?
	6. What are the advantages and disadvantages of bilateral total knee replacement?
	7. I have osteoarthritis in both knees; should I consider doing total knee replacements for both knees at the same time?

Evaluation of Chatbot Responses

The FAQs were input into chatbot models (GPT-3.5, GPT-4, GPT-4 Omni, Gemini Advanced, and Gemini 1.5) for comparison, with responses collected in July 2024. To minimize potential bias from previous interactions, only initial responses were evaluated. Each question was submitted in a new chat session by selecting the “new chat” option before proceeding with the next inquiry. Chatbot models were assessed for both accuracy and relevance of their responses. Factual accuracy refers to the degree to which the chatbot’s responses align with the most up-to-date medical guidelines regarding TKR as of August 2024. Relevance indicated how useful and effective the responses were in addressing the questions. Each response was rated on a 5-point Likert scale.¹⁷ To ensure clear evaluation, a description of the level of response represented by each scale was also provided (Tables 2 and 3). Two independent orthopaedic surgeons (H.L. and S.Y.S.) evaluated responses of chatbots based on predefined criteria. To minimize bias, evaluators were blinded to chatbot identities. All responses were anonymized and presented in a randomized order, ensuring that scoring was based solely on content rather than the source of the response. Each rating was guided by a structured rubric based on the descriptive criteria outlined in Tables 2 and 3, ensuring standardized and reproducible scoring. This evaluation is essential to ensure that the information provided by the AI is both scientifically sound and relevant to contemporary clinical practice. For each question, the scores from the 2 raters were averaged to obtain a final score.

Table 2

Likert Scale for Grading Factual Accuracy

Likert Scale	Factual Accuracy Levels
1	Very inaccurate: significantly inaccurate, containing incorrect or misleading information that deviates from current medical knowledge
2	Inaccurate: somewhat inaccurate, with factual errors or outdated information
3	Somewhat accurate: generally accurate but lacks the depth or precision expected
4	Accurate: adheres to the latest guidelines, although some minor revisions or additional explanations might be necessary
5	Very accurate: almost entirely consistent with the latest medical evidence, requiring only very slight modifications or elaborations to be fully accurate

Table 3

Likert Scale for Grading Relevance

Likert Scale	Relevance Levels
1	Very irrelevant: largely off-topic and unhelpful in answering the question
2	Irrelevant: provides some relevant content but is mostly irrelevant
3	Somewhat relevant: partially address the question but lack critical information
4	Relevant: relevant but included some unrelated content that slightly detracted from the clarity or focus of the response
5	Very relevant: very relevant, providing a clear, precise answer that fully addresses the patient’s question without showing any irrelevant information

Statistical Analysis

For accuracy, chatbot responses were evaluated using the mean and standard deviation of Likert scores. In contrast, relevance was dichotomized: responses with a score of ≥4 were classified as relevant, while those with a score <4 were considered irrelevant. Statistical analysis was performed using R Version 4.0.3 (R Foundation for Statistical Computing). Statistical significance of the difference in chatbot performance was assessed using analysis of variance (ANOVA) and Tukey honest signiﬁcance post hoc test for accuracy and Kruskal-Wallis H test and Dunn post hoc test for relevance. A P value of <.05 was considered statistically significant. ANOVA was used to compare chatbot performance in terms of accuracy because the data followed a normal distribution, while the Kruskal-Wallis H test was used for relevance since the data did not meet normality assumptions. Interrater reliability was measured using Gwet AC2. Additionally, Pearson correlation coefficient (r) and Cronbach alpha (α) were used to evaluate the internal consistency of scoring patterns across questions.

Results

Overall Performance

Overall accuracies of different chatbots were as follows. GPT-3.5 scored 4.81 ± 0.45. GPT-4 scored 4.86 ± 0.35. GPT-4 Omni scored 4.91 ± 0.29. Gemini Advanced scored 4.07 ± 1.44 and Gemini 1.5 scored 4.84 ± 0.43 (Table 4; Supplemental Material Figure A1, available separately). Gemini Advanced achieved a significantly (P < .001) lower accuracy score than other chatbots, highlighting a clear performance gap. The relevance rate was 83.72% (36/43) for Gemini Advanced, while other chatbots achieved a rate of 100% (43/43). This difference was statistically significant (P < .001). Gemini Advanced avoided answering 7 questions, either by recommending consultation with a specialist or by stating that the question was beyond its capabilities (Table 5). Interrater reliability for the assessment showed substantial to excellent agreement. Gwet AC2 values were 0.85 for factual accuracy and 0.94 for relevance, indicating strong interrater agreement. In addition, Pearson correlation coefficient (0.93) and Cronbach alpha (0.96) supported excellent internal consistency of raters.

Table 4

Accuracy of Various Chatbots According to Subcategories^a

Category	GPT-3.5	GPT-4	GPT-4 Omni	Gemini Advanced	Gemini 1.5	P
General	4.86 ± 0.38	4.71 ± 0.49	4.86 ± 0.38	4.29 ± 1.50	5.00 ± 0.00	.45
Indication	5.00 ± 0.00	5.00 ± 0.00	5.00 ± 0.00	3.50 ± 1.97^b	4.67 ± 0.52	.04
Risk	5.00 ± 0.00	5.00 ± 0.00	5.00 ± 0.00	4.75 ± 0.50	5.00 ± 0.00	.44
Recovery	4.83 ± 0.39	4.83 ± 0.39	4.92 ± 0.29	4.50 ± 1.17	4.83 ± 0.39	.51
Activity	4.57 ± 0.79	4.86 ± 0.38	5.00 ± 0.00	4.00 ± 1.41	4.57 ± 0.70	.22
Variation	4.71 ± 0.49	4.86 ± 0.38	4.71 ± 0.49	3.29 ± 1.60^b	5.00 ± 0.00	.002
Total	4.81 ± 0.45	4.86 ± 0.35	4.91 ± 0.29	4.07 ± 1.44^b	4.84 ± 0.43	<.001

Data are presented as mean ± SD. One-way analysis of variance was used to compare different models.

The group causing the difference.

Table 5

Instances of Gemini Advanced Avoidance of Responses to Specific Questions^a

Question	Response
What is the success rate of a total knee replacement?	As a language model, I’m not able to assist you with that.
Are there any alternatives to a total knee replacement?	I’m designed solely to process and generate text, so I’m unable to assist you with that.
Is there any difference in recovery between conventional and robotic total knee replacement?	I’m unable to help you with that, as I’m only a language model and don’t have the necessary information or abilities.

The model frequently recommended consulting a specialist or stated that the query exceeded its capabilities, highlighting its limitations in providing accurate and relevant answers.

General/Procedure-Related Information

The analysis included 7 FAQs. Regarding the accuracy for each chatbot, the score was 4.86 ± 0.38 for GPT-3.5, 4.71 ± 0.49 for GPT-4, 4.86 ± 0.38 for GPT-4 Omni, 4.29 ± 1.50 for Gemini Advanced, and 5.00 ± 0.00 for Gemini 1.5. These results showed no statistically significant differences among chatbots (P = .45). The relevance analysis revealed that Gemini Advanced showed relevance in 6 (85.71%) out of the 7 FAQs, while all other chatbots demonstrated relevance in all FAQs.

Indications for Surgery and Outcomes

There were 6 FAQs regarding the indications for TKR and outcomes from surgery. Of responses, accuracy for each chatbot was 5.00 ± 0.00 for GPT-3.5, 5.00 ± 0.00 for GPT-4, 5.00 ± 0.00 for GPT-4 Omni, 3.50 ± 1.97 for Gemini Advanced, and 4.67 ± 0.52 for Gemini 1.5. The difference in accuracy among chatbots was statistically significant (P = .04), with Gemini Advanced showing notably lower accuracy than other chatbots. Relevance rate was demonstrated in 4 (66.7%) out of 6 FAQs for Gemini Advanced, while all other chatbots achieved a relevance rate of 100%.

Risks and Complications

In the category of risks and complications comprising 4 FAQs, only Gemini Advanced demonstrated an accuracy of 4.75 ± 0.50, while all other chatbots achieved an accuracy of 5.00 ± 0.00 (P = .44). All chatbots demonstrated a relevance rate of 100%.

Pain and Postoperative Recovery

A total of 12 FAQs were analyzed. Regarding the accuracy for each chatbot, the score was 4.83 ± 0.39 for GPT-3.5, 4.83 ± 0.39 for GPT-4, 4.92 ± 0.29 for GPT-4 Omni, 4.50 ± 1.17 for Gemini Advanced, and 4.83 ± 0.39 for Gemini 1.5 (P = .51). Notably, all questions except 1 in the Gemini Advanced (91.67%) category demonstrated relevance.

Specific Activities After Surgery

The analysis included a total of 7 FAQs. Regarding the accuracy for each chatbot, it was 4.57 ± 0.79 for GPT-3.5, 4.86 ± 0.38 for GPT-4, 5.00 ± 0.00 for GPT-4 Omni, 4.00 ± 1.41 for Gemini Advanced, and 4.57 ± 0.70 for Gemini 1.5 (P = .22). Relevance rates were 100% for all chatbots except for Gemini Advanced (6/7; 85.71%).

Alternatives and Variations

There were 7 FAQs regarding indications for alternatives and variations. Regarding the accuracy of each chatbot's responses, the score was 4.71 ± 0.49 for GPT-3.5, 4.86 ± 0.38 for GPT-4, 4.71 ± 0.49 for GPT-4 Omni, 3.29 ± 1.60 for Gemini Advanced, and 5.00 ± 0.00 for Gemini 1.5. Statistically significant differences in chatbot accuracy were observed (P = .002), driven by the lower accuracy of Gemini Advanced compared with the other models. The relevance rate for Gemini Advanced was 71.43% (5/7), while other chatbots achieved a relevance rate of 100% (7/7).

Discussion

The major finding of this study was that AI chatbots vary in their ability to provide accurate and relevant responses to TKR-related FAQs. Among the models tested, GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 demonstrated consistently high performance across all evaluated categories, with mean accuracy scores of ≥4.8 and 100% relevance, while Gemini Advanced showed significantly lower accuracy (4.1 ± 1.4; P < .001) and relevance (83.72%). These findings underscore the variability in chatbot quality and the importance of selecting and validating models before clinical application.

Various studies have demonstrated that chatbots can effectively respond to medical inquiries, showcasing their potential as supportive tools in the health care sector.^6,10,21 These studies have illustrated that chatbots can provide accurate and contextually appropriate answers to a wide range of medical questions, ranging from general health advice to specialized clinical knowledge. Although previous studies have primarily assessed general capabilities of chatbots in medicine, this study specifically evaluated their accuracy within the domain of orthopaedic surgery, where precise clinical guidance is crucial for surgical decision-making and postoperative management. In fact, as chatbots become increasingly specialized, some have achieved a level of expertise sufficient to pass professional examinations such as the United States Medical Licensing Examination (USMLE).^5,11 Furthermore, their integration into health care settings has garnered positive feedback by enhancing patient engagement, simplifying access to information, and supporting decision-making processes. This growing body of evidence highlights the role of chatbots as valuable assets in complementing traditional medical practices. Results of this study further support these findings.

All chatbot models, except for Gemini Advanced, demonstrated strong overall performance, achieving a mean score of ≥4.5 in response to questions related to TKR. When analyzed by category, all chatbots provided accurate answers without showing statistically significant differences in categories of general/procedure-related information, risks and complications, pain and postoperative recovery, and specific activities after surgery. In contrast, in indications and outcomes and alternatives and variations categories, Gemini Advanced was observed to provide less accurate responses. Notably, Gemini Advanced frequently exhibited a tendency to avoid answering specific questions, which can lead to decreased accuracy and relevance. Unlike other models that provided informative answers along with disclaimers (eg, "Always prioritize your surgeon’s and therapist’s advice"), Gemini Advanced tended to avoid answering entirely, instead stating its limitations (eg, "I’m only a language model"). This avoidance behavior likely reflects model-specific safety filters designed to minimize the risk of misinformation. While such a cautious approach may be appropriate in high-stakes contexts, it also limits the model's utility to deliver practical guidance in clinical decision-making or patient education scenarios.

In terms of relevance, consistent with findings on accuracy, only Gemini Advanced exhibited lower performance on certain questions, whereas all other chatbots demonstrated 100% relevance across all responses. Similar to findings on accuracy, a lack of relevance was observed exclusively in 7 instances where Gemini Advanced avoided providing a response. These results suggested that, aside from these exceptions, all chatbots consistently exhibited both accuracy and relevance in their responses.

Zhang et al. have reported that ChatGPT-3.5 demonstrates an accuracy of 88% and a relevance of 100% in responses to 50 questions related to TKR.²³ A study comparing ChatGPT-4 and arthroplasty-trained nurses regarding responses to TKR-related questions has reported that both groups demonstrate an accuracy of 44/60 (73.3%).² These findings are consistent with results of our study. One study has reported that ChatGPT can effectively respond not only to general questions about TKR, but also to more specialized questions regarding periprosthetic joint infection after total hip and knee arthroplasty. Among the 12 questions analyzed, none was rated as “unsatisfactory.” One response was deemed accurate enough to require no clarification, while 7 responses required low clarification and 4 required moderate clarification.⁸

Comparative analyses of chatbot models have generally demonstrated that newer, updated versions exhibit superior performance.^12,18 However, our study found that GPT-3.5, GPT-4, and GPT-4 Omni showed similar performance levels, with minor differences that were not statistically significant. In contrast, for Gemini models, responses generated by Gemini 1.5, a subsequent iteration of Gemini Advanced, exhibited significantly improved accuracy and overall performance. Notable differences were observed in information delivery methods of the 2 chatbots. Specifically, Gemini can enhance the reliability and readability of the information it provides by meticulously citing sources and incorporating relevant images, thereby improving the overall user experience. In contrast, ChatGPT tends to generate lengthy textual responses and a higher volume of output. In the case of ChatGPT, the latest versions were observed to provide more detailed information. However, they occasionally included contents with slightly reduced relevance. While this can increase the overall output volume, it may decrease the proportion of relevant information, potentially causing confusion for patients. Some studies have reported that AI chatbots may occasionally provide inaccurate or misleading information and, in some instances, generate plausible but false statements—a phenomenon termed "artificial hallucination." This issue requires careful consideration and scrutiny.^1,20 However, in our study, no instance of hallucination was observed, as all chatbot responses remained consistent with established medical knowledge and guidelines. This finding suggests that evaluated chatbot models can provide reliable information within the scope of TKR-related queries. While this result suggests a high reliability of TKR-related queries, it is important to note that the absence of hallucination might be context dependent. Further evaluation across broader clinical topics is warranted.

Given the rapid evolution of LLMs, the accuracy and reliability of chatbot-generated medical information may vary over time. Because this study presents a comparative evaluation of 5 AI models as of August 2024, the findings should be interpreted as a temporal snapshot rather than a generalizable conclusion. In particular, some models may undergo frequent updates in their knowledge base or response strategies, potentially altering their factual accuracy, relevance, or even clinical safety. Therefore, ongoing monitoring and reevaluation are essential to maintain up-to-date benchmarks and ensure that these tools remain reliable for patient-facing health care communication.

Strengths and Limitations

This study provides valuable insights into the accuracy and reliability of AI chatbots in delivering medical information. Findings of this study highlight both the potential utility and current challenges of chatbot-assisted patient education, emphasizing the need for continuous improvement in AI models to enhance their clinical applicability. Future investigations should focus on addressing these limitations to refine AI-based health care communication tools further.

The following limitations that should be considered when interpreting this study's findings. First, chatbot responses were evaluated at a single time point (August 2024), and given the rapid evolution of AI models, their performance may change over time.^15,24 Therefore, the results of this study should be interpreted as a temporal benchmark rather than a permanent evaluation. Ongoing reassessment of these models is necessary to ensure their sustained clinical relevance and safety. Second, this study evaluated chatbot responses exclusively in English, which might limit the generalizability of our findings to non−English speaking populations.¹³ Third, the evaluation of the chatbot responses was conducted exclusively by medical experts; however, the comprehensibility and readability from the perspective of patients was not assessed directly. Finally, the evaluation focused solely on factual accuracy and relevance. Other important dimensions of chatbot performance—such as clarity, completeness, consistency, adherence to evidence-based guidelines, and the potential for harmful or misleading content—were not assessed. Although no overtly unsafe responses were observed in this study, future research could address this and other limitations by incorporating longitudinal assessments, multilingual evaluations, and multidimensional performance metrics that may provide a more comprehensive understanding of chatbot performance and its clinical effect.

Conclusion

This study demonstrates that GPT-3.5, GPT-4, GPT-4 Omni, and Gemini 1.5 can provide highly accurate and relevant responses to TKR-related queries, while Gemini Advanced underperforms. These findings highlight the variability in chatbot performance and emphasize the need for ongoing assessments to ensure their reliability and maintain their value as trusted sources of medical information.

Footnotes

Appendix

Final revision submitted September 25, 2025; accepted October 19, 2025.

One or more of the authors has declared the following potential conflict of interest or source of funding: This work was supported by the National Research Foundation of Korea grant funded by the Korea government (RS-2024-00344750). AOSSM checks author disclosures against the Open Payments Database (OPD). AOSSM has not conducted an independent investigation on the OPD and disclaims any liability or responsibility relating thereto.

Ethical approval was not sought for the present study.

References

Alkaissi

McFarlane

SI.

Artificial hallucinations in ChatGPT: implications in scientific writing. Cureus. 2023;15(2):e35179.

Bains

Dubin

Hameed

, et al. Use and application of large language models for patient questions following total knee arthroplasty. J Arthroplasty. 2024;39(9):2289-2294.

Cassidy

Baker

JF.

Orthopaedic patient information on the World Wide Web: an essential review. J Bone Joint Surg Am. 2016;98(4):325-338.

Fraval

Ming Chong

Holcdorf

Plunkett

Tran

Internet use by orthopaedic outpatients—current trends and practices. Australas Med J. 2012;5(12):633-638.

Gilson

Safranek

Huang

, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ. 2023;9:e45312.

Greif

Mpunga

Koopman

Pye

Hivnor

Owen

JL.

Evaluating the effectiveness of ChatGPT4 in the diagnosis and workup of dermatologic conditions. Dermatol Online J. 2024;30(4):10.5070/D330464104.

Hernigou

Scarlat

MM.

Two minutes of orthopaedics with ChatGPT: it is just the beginning; it’s going to be hot, hot, hot!

Int Orthop. 2023;47(8):1887-1893.

Niemann

Kienzle

, et al. Evaluating ChatGPT responses to frequently asked patient questions regarding periprosthetic joint infection after total hip and knee arthroplasty. Digit Health. 2024;10:20552076241272620.

Kaarre

Feldt

Keeling

, et al. Exploring the potential of ChatGPT as a supplementary tool for providing orthopaedic information. Knee Surg Sports Traumatol Arthrosc. 2023;31(11):5190-5198.

10.

Kaba

Beyazal

Celiker

Yel

Vogl

TJ.

Accuracy and readability of ChatGPT on potential complications of interventional radiology procedures: AI-powered patient interviewing. Acad Radiol. 2025;32(3):1547-1553.

11.

Mihalache

Huang

Popovic

Muni

RH.

ChatGPT-4: An assessment of an upgraded artificial intelligence chatbot in the United States Medical Licensing Examination. Med Teach. 2024;46(3):366-372.

12.

Moshirfar

Altaf

Stoakes

Tuttle

Hoopes

PC.

Artificial intelligence in ophthalmology: a comparative analysis of GPT-3.5, GPT-4, and human expertise in answering StatPearls questions. Cureus. 2023;15(6):e40822.

13.

Reusens

Borchert

Weerdt

Baesens

Native design bias: studying the impact of English nativeness on language model performance. arXiv. 2024:2406.17385.

14.

Shen

Driscoll

Islam

Bovonratwet

Haas

EP.

Modern internet search analytics and total joint arthroplasty: what are patients asking and reading online?

J Arthroplasty. 2021;36(4):1224-1231.

15.

Song

Xia

Luo

, et al. Evaluating the performance of different large language models on health consultation and patient education in urolithiasis. J Med Syst. 2023;47(1):125.

16.

Stokel-Walker

Van Noorden

What ChatGPT and generative AI mean for science. Nature. 2023;614(7947):214-216.

17.

Sullivan

Artino

Jr.

Analyzing and interpreting data from Likert-type scales. J Grad Med Educ. 2013;5(4):541-542.

18.

Tassoker

ChatGPT-4 Omni’s superiority in answering multiple-choice oral radiology questions. BMC Oral Health. 2025;25(1):173.

19.

Taylor

4th Cheng

Weinblatt

Bergstein

Long

WJ.

An artificial intelligence chatbot is an accurate and useful online patient resource prior to total knee arthroplasty. J Arthroplasty. 2024;39(8)(suppl 1):S358-S362.

20.

Whiles

Bird

Canales

DiBianco

Terry

RS.

Caution! AI bot has entered the patient chat: ChatGPT has limitations in providing accurate urologic healthcare advice. Urology. 2023;180:278-284.

21.

Zhang

, et al. Application of chatbots to help patients self-manage diabetes: systematic review and meta-analysis. J Med Internet Res. 2024;26:e60380.

22.

Yapar

Demir Avci

Tokur Sonuvar

Egerci

Yapar

ChatGPT’s potential to support home care for patients in the early period after orthopedic interventions and enhance public health. Jt Dis Relat Surg. 2024;35(1):169-176.

23.

Zhang

Liau

ZQG

Tan

KLM

Chua

WL.

Evaluating the accuracy and relevance of ChatGPT responses to frequently asked questions regarding total knee replacement. Knee Surg Relat Res. 2024;36(1):15.

24.

Zhou

Schellaert

Martinez-Plumed

Moros-Daval

Ferri

Hernández-Orallo

Larger and more instructable language models become less reliable. Nature. 2024;634(8032):61-68.

A Comparative Analysis of GPT-3.5,GPT-4,GPT-4 Omni,Gemini Advanced,and Gemini 1.5 in Answering Total Knee Replacement−Related Questions

Abstract

Background:

Purpose:

Study Design:

Methods:

Results:

Conclusion:

Keywords

Methods

Selection and Categorization of FAQs Related to Total Knee Replacement

Evaluation of Chatbot Responses

Statistical Analysis

Results

Overall Performance

General/Procedure-Related Information

Indications for Surgery and Outcomes

Risks and Complications

Pain and Postoperative Recovery

Specific Activities After Surgery

Alternatives and Variations

Discussion

Strengths and Limitations

Conclusion

Footnotes

Appendix

References