Sage Journals: Discover world-class research

Abstract

Background:

With the rapid advancement of artificial intelligence in health care, large language models (LLMs) demonstrate increasing potential in medical applications. However, their performance in specialized oncology remains limited. This study evaluates the performance of multiple leading LLMs in addressing clinical inquiries related to bladder cancer (BLCA) and demonstrates how strategic optimization can overcome these limitations.

Methods:

We developed a comprehensive set of 100 clinical questions based on established guidelines. These questions encompassed epidemiology, diagnosis, treatment, prognosis, and follow-up aspects of BLCA management. Six LLMs (Claude-3.5-Sonnet, ChatGPT-4.0, Grok-beta, Gemini-1.5-Pro, Mistral-Large-2, and GPT-3.5-Turbo) were tested through three independent trials. The responses were validated against current clinical guidelines and expert consensus. We implemented a two-phase training optimization process specifically for GPT-3.5-Turbo to enhance its performance.

Results:

In the initial evaluation, Claude-3.5-Sonnet demonstrated the highest accuracy (89.33% ± 1.53%), followed by ChatGPT-4 (85.67% ± 1.15%). Grok-beta achieved 84.33% ± 1.53% accuracy, whereas Gemini-1.5-Pro and Mistral-Large-2 showed similar performance (82.00% ± 1.00% and 81.00% ± 1.00%, respectively). GPT-3.5-Turbo demonstrated the lowest accuracy (74.33% ± 3.06%). After the first phase of training, GPT-3.5-Turbo’s accuracy improved to 86.67% ± 1.89%. Following the second phase of optimization, the model achieved 100% accuracy.

Conclusion:

This study not only establishes the comparative performance of various LLMs in BLCA-related queries but also validates the potential for significant improvement through targeted training optimization. The successful enhancement of GPT-3.5-Turbo’s performance suggests that strategic model refinement can overcome initial limitations and achieve optimal accuracy in specialized medical applications.

Get full access to this article

View all access options for this article.

References

Witjes

, Bruins

, Cathomas

, et al. European association of urology guidelines on muscle-invasive and metastatic bladder cancer: Summary of the 2020 guidelines. Eur Urol, 2021; 79(1):82–104.

Ozgor

, Caglar

, Halis

, et al. Urological cancers and ChatGPT: Assessing the quality of information and possible risks for patients. Clin Genitourin Cancer, 2024; 22(2):454–457.e4.

Şahin

, Topkaç

, Doğan

, et al. Still using only ChatGPT? The comparison of five different artificial intelligence chatbots’ answers to the most common questions about kidney stones. J Endourol, 2024; 38(11):1172–1177.

Yeo

, Samaan

, Ng

, et al. Assessing the performance of ChatGPT in answering questions regarding cirrhosis and hepatocellular carcinoma. Clin Mol Hepatol, 2023; 29(3):721–732.

Eysenbach

. The role of ChatGPT, generative language models, and artificial intelligence in medical education: A conversation with ChatGPT and a call for papers. JMIR Med Educ, 2023; 9:e46885.

Gilson

, Safranek

, Huang

, et al. How does ChatGPT perform on the United States Medical Licensing Examination (USMLE)? The implications of large language models for medical education and knowledge assessment. JMIR Med Educ, 2023; 9:e45312.

Tsai

, Cheng

, Deng

, et al. ChatGPT v4 outperforming v3.5 on cancer treatment recommendations in quality, clinical guideline, and expert opinion concordance. Digit Health, 2024; 10:20552076241269538.

Liang

, Zhao

, Peng

, et al. Enhanced artificial intelligence strategies in renal oncology: Iterative optimization and comparative analysis of GPT 3.5 versus 4.0. Ann Surg Oncol, 2024; 31(6):3887–3893.

Demirci

. A comparison of ChatGPT and human questionnaire evaluations of the urological cancer videos most watched on youtube. Clin Genitourin Cancer, 2024; 22(5):102145.

10.

Guidelines E. Edn. Presented at the EAU Annual Congress Milan. EAU Guidelines Office Arnhem: The Netherlands; 2023.

11.

, He

, Liu

, et al. A brief overview of ChatGPT: The history, status quo and potential future development. IEEE/CAA J Autom Sinica, 2023; 10(5):1122–1136.

12.

Ruksakulpiwat

, Kumar

, Ajibade

. Using ChatGPT in medical research: Current status and future directions. J Multidiscip Healthc, 2023; 16:1513–1520.

13.

Sung

, Ferlay

, Siegel

, et al. Global cancer statistics 2020: GLOBOCAN estimates of incidence and mortality worldwide for 36 cancers in 185 countries. CA Cancer J Clin, 2021; 71(3):209–249.

14.

Kamat

, Hahn

, Efstathiou

, et al. Bladder cancer. Lancet, 2016; 388(10061):2796–2810.

15.

Lenis

, Lec

, Chamie

, et al. Bladder cancer: A review. JAMA, 2020; 324(19):1980–1991.

16.

Kung

, Cheatham

, Medenilla

, et al. Performance of ChatGPT on USMLE: Potential for AI-assisted medical education using large language models. PLOS Digit Health, 2023; 2(2):e0000198.

17.

Cocci

, Pezzoli

, Lo Re

, et al. Quality of information and appropriateness of ChatGPT outputs for urology patients. Prostate Cancer Prostatic Dis, 2024; 27(1):159–160.

18.

Şahin

, Ateş

, Keleş

, et al. Responses of five different artificial intelligence Chatbots to the top searched queries about erectile dysfunction: A comparative analysis. J Med Syst, 2024; 48(1):38.

19.

Moor

, Banerjee

, Abad

ZSH

, et al. Foundation models for generalist medical artificial intelligence. Nature, 2023; 616(7956):259–265.

20.

Rewthamrongsris

, Burapacheep

, Trachoo

, et al. Accuracy of large language models for infective endocarditis prophylaxis in dental procedures. Int Dent J, 2025; 75(1):206–212.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.41 MB

0.02 MB

6.63 MB

0.23 MB