Abstract
Background
Large language models (LLMs) have demonstrated promising capabilities in medical diagnostic reasoning, yet their performance in specialized clinical domains such as rheumatology remains incompletely characterized. While diagnostic accuracy has been evaluated, critical dimensions including calibration, reasoning quality, and temporal stability have not been systematically assessed across contemporary models.
Objectives
This study aimed to comprehensively evaluate and compare the diagnostic accuracy, certainty expression, reasoning quality, and hallucination rates of four state-of-the-art LLMs ChatGPT-4, Claude 3.5, DeepSeek-V3, and Gemini 1.5 Pro in complex rheumatologic case scenarios.
Design
A cross-sectional, analytical, and comparative study was conducted following STARD and TRIPOD guidelines, adapted for LLM evaluation. Nine complex rheumatologic cases from published case reports were evaluated at three time points (Days 1, 5, and 10) between July 1 and September 18,2025.
Methods
Standardized clinical vignettes were submitted to each LLM under controlled experimental conditions. Two blinded senior rheumatologists independently assessed diagnostic accuracy, reasoning quality across five analytical dimensions using Likert scales, and hallucination frequency. Certainty expression and temporal stability were quantified using intraclass correlation coefficients. Correlation analyses examined relationships between reasoning quality and confidence expression.
Results
All models achieved near-perfect diagnostic accuracy, with ChatGPT, Claude and Gemini correctly identifying the primary diagnosis in 100% of cases and DeepSeek in 88.9%. However, Spearman correlation analysis revealed uniformly weak and non-significant associations between reasoning quality and expressed certainty across all models (ρ range: -0.156 to 0.215, all p>0.05), indicating fundamental miscalibration. ChatGPT demonstrated the highest reasoning score (3.89±0.23) and lowest hallucination rate (7.4%), while Gemini showed the highest hallucination frequency (18.5%). Temporal stability was excellent for ChatGPT (ICC=0.84) and good for DeepSeek (ICC=0.79).
Conclusion
Despite exceptional diagnostic accuracy, current LLMs exhibit critical limitations in confidence calibration and variable hallucination rates, representing significant barriers to safe clinical deployment in rheumatology.
Keywords
Introduction
Large language models (LLMs), a rapidly evolving subclass of artificial intelligence (AI), are increasingly being integrated into healthcare, encompassing applications in medical education, clinical research, and patient care.1,2 Recent investigations have shown that even general-purpose LLMs without domain-specific training can accurately propose diagnostic hypotheses for hospitalized patients using real clinical, imaging, and laboratory data. 3 Beyond diagnostic reasoning, these models have also demonstrated the ability to generate responses that are empathetic, coherent, and linguistically refined, often comparable to those produced by human clinicians. 4
Diagnostic reasoning remains one of the most intricate aspects of clinical medicine, particularly in patients with complex or overlapping disease presentations. It requires the synthesis of heterogeneous data and the ability to navigate uncertainty through structured and iterative reasoning. The recent emergence of reasoning-augmented LLMs marks a pivotal shift from pattern recognition to deliberate, stepwise analytical processes designed to emulate human clinical reasoning. 5 These models, supported by reinforcement learning and advanced algorithmic architectures, hold substantial promise for improving interpretability, reliability, and clinical applicability. 5
In rheumatology a field characterized by heterogeneous, multisystem disorders, LLMs may play a transformative role. Their potential extends from diagnostic support and therapeutic guidance to adverse event prediction and the personalization of treatment strategies. 3 By enhancing data interpretation and supporting differential diagnosis, LLMs could improve both diagnostic accuracy and efficiency in musculoskeletal and autoimmune disease management.
Recent comparative studies have reported encouraging diagnostic performance of models such as ChatGPT-4, Claude 3.5, and Gemini in rheumatologic contexts. 3 In parallel, the introduction of DeepSeek-R1, an open-source reasoning model released in early 2025 and trained using reinforcement learning techniques, represents a novel generation of interpretable and adaptive LLMs. 6 Krusche et al. found that ChatGPT-4 listed the correct diagnosis as top diagnosis in 35% versus 39% for rheumatologists (p=0.30), and among top 3 diagnoses in 60% versus 55% (p=0.38), with notably higher performance in inflammatory rheumatic disease cases but lower accuracy in non-IRD cases. 7 Coskun et al. reported that GPT-4 achieved 100% accuracy in providing comprehensive patient information about methotrexate use compared to 86.96% for GPT-3.5, 60.87% for BARD, and 60.87% for Bing. 8 However, Nakaphan et al. highlighted important limitations, showing that while LLMs achieved high accuracy on text-based rheumatology multiple-choice questions (90-96%), their performance on image-based questions was significantly lower and more variable (16-56%), with all models showing significantly reduced odds of correct responses to image questions compared to MCQs (p<0.01). 9 This rapid evolution underscores the need for rigorous, domain-specific evaluations to determine their true clinical utility and reasoning coherence.
However, several critical gaps remain in the current literature. First, no study has comprehensively compared diagnostic accuracy across multiple major LLMs specifically for complex rheumatological cases. Second, existing studies have not systematically examined how LLMs explain their diagnostic reasoning through structured differential diagnoses with certainty estimates and supporting arguments. Third, the temporal stability of LLM performance remains unexplored, with no studies evaluating the same cases across multiple time points to assess model consistency. Fourth, there is a lack of systematic analysis on how LLMs perform diagnostically in rare rheumatological conditions, limiting our understanding of their potential clinical integration.
Therefore, this study aimed to comprehensively assess and compare the diagnostic accuracy, certainty expression, and reasoning quality of four contemporary LLMs, ChatGPT-4, Claude 3.5 Sonnet/Opus, DeepSeek-V3, and Gemini 1.5 Pro in the context of complex rheumatologic case scenarios derived from published real-world case reports.
Materials and methods
Study design
This was a cross-sectional, analytical, and comparative study conducted in accordance with the CHART (Chatbot Assessment Reporting Tool) recommendations, adapted to the evaluation of large language models (LLMs). 10 The study was conducted between July 1 and September 18, 2025.
Clinical case selection
Clinical cases were extracted from Therapeutic Advances in Musculoskeletal Disease, a journal published by SAGE, covering the period from 2009 to 2025. To be eligible, an article had to be a case report written in English and provide a comprehensive clinical presentation, including detailed history, physical examination, laboratory and imaging findings, and a confirmed final diagnosis. Incomplete cases or those lacking essential diagnostic information were excluded. No formal sample size calculation was performed a priori for this exploratory diagnostic accuracy study.
Preparation of clinical vignettes
Each clinical case was reformulated into a standardized vignette including: • Patient demographics (age, sex) • Chief complaint and history of present illness • Relevant medical, surgical, and family history • Detailed physical examination • Results of complementary investigations (laboratory, imaging, immunological tests) • Clinical course when applicable
All identifiable patient data were removed in accordance with confidentiality principles.
A standardized and optimized medical prompt was designed using principles of clinical prompt engineering, as follows: « You are a highly experienced rheumatologist. Carefully analyze the following clinical case and provide a structured response including the following elements: Primary diagnosis • State the most likely diagnosis. • Provide an estimated certainty percentage for this diagnosis. • List the supporting arguments in favor of this diagnosis. • List the arguments against this diagnosis. Top 3 Differential Diagnoses: List the three most probable alternative diagnoses, ranked in descending order of likelihood. Top 5 Differential Diagnoses: List five possible diagnoses, ranked in descending order of likelihood. Clinical case: [ ] Please structure your response clearly, hierarchically, and concisely. »
Evaluated LLMs
LLMs’ characteristics of our study.
Standardized experimental conditions
• Temperature: 0.0 to ensure consistency and coherence in the model’s responses. • Maximum tokens: Unlimited, allowing exhaustive responses • Language: English • Independent sessions: Each case was submitted in a new chat session to avoid contextual bias • Reproducibility: Each case was submitted three times to each LLM at five-day intervals, and the median consensus response was retained • Randomization: The order of case presentation was randomized for each LLM. Human evaluators were blinded to the origin of the responses (single-blind design).
Evaluation criteria and performance metrics
The primary endpoint was diagnostic accuracy, assessed as: • Top-1: Correct main diagnosis • Top-3: Correct diagnosis among the top three suggestions • Top-5: Correct diagnosis among the top five suggestions
Secondary endpoints evaluated the quality of diagnostic reasoning using a five-dimension Likert-scale (0–5 points each): • Clinical information integration • Physical examination coherence • Paraclinical exam coherence • Analytical reasoning and differential structuring • Diagnostic synthesis and justification
The sum of these dimensions yielded a total reasoning score (maximum = 25 points).
Additionally, medical hallucinations were assessed as the number of incorrect or fabricated statements produced by each LLM.
Evaluation process
The expert panel consisted of two senior rheumatologists, each with at least five years of clinical experience. Both independently evaluated all LLM responses in a blinded fashion. A calibration session using two external pilot cases was conducted beforehand to standardize scoring criteria.
Scoring procedure
For each clinical case, responses from the four LLMs were anonymized and randomized. Evaluations were collected using a standardized scoring grid built in Google Forms. When discrepancies greater than two points occurred in the reasoning score, a consensus discussion was held to reach agreement. When responses differed across the three submissions, the most clinically coherent response was retained through expert consensus.
Statistical analysis
All statistical analyses were conducted using R software (version version 4.5.1; R Foundation for statistical computing, Vienna, Austria). Descriptive statistics were used to summarize model performance, certainty levels, reasoning quality, and hallucination rates. Continuous variables are presented as means with standard deviations (SD) or medians with interquartile ranges (IQR), as appropriate. Categorical data are expressed as frequencies and percentages. For diagnostic performance, proportions of correct top-1, top-3, and top-5 diagnoses were calculated for each LLM. Certainty probabilities were analyzed longitudinally across Day 1, Day 5, and Day 10, and temporal stability was assessed using mean differences (Δ) between time points. The intraclass correlation coefficient (ICC) was computed to evaluate test–retest reliability of certainty ratings over time, both overall and per model, and interpreted according to conventional thresholds (poor <0.5, fair 0.5–0.75, good 0.75–0.9, excellent >0.9). Reasoning quality was assessed based on evaluator ratings across five analytical dimensions. Mean reasoning scores and standard deviations were computed for each LLM, and inter-rater reliability was quantified using ICC analysis. Agreement among raters was further evaluated by calculating the perfect agreement rate (%) and mean absolute difference across evaluations. Hallucination frequency was expressed as the percentage and absolute count (n) of cases per model. Within-model correlations between reasoning quality and certainty expression were examined using Spearman’s rank correlation coefficients (ρ), while inter-model relationships were assessed using Pearson’s correlation coefficient (r), both reported with 95% confidence intervals and p-values. Correlation matrices were generated to visualize pairwise associations between diagnostic performance, certainty, and reasoning quality across all LLMs. Correlation strength was interpreted as negligible (<0.3), low (0.3–0.5), moderate (0.5–0.7), or strong (>0.7). Finally, pairwise associations between diagnostic performance, certainty, and reasoning quality across LLMs were explored using correlation matrices. Statistical significance was defined as a two-sided p-value < 0.05.
Ethical considerations
This study did not involve real patient data; only published and anonymized clinical cases from the scientific literature were used, in compliance with ethical and confidentiality standards. Written informed consent was obtained from both participating senior rheumatologists in accordance with institutional ethical standards and data protection regulations.
Results
Case description
The characteristics of the included studies.
Global diagnostic performance
The primary responses of each LLM at day 1, day 5, and day 10.
Certainty probability
At Day 1, Claude exhibited the mean certainty at 96.4% (SD=2.3%), while Deepseek demonstrated at 92.2% (SD=3.6%). By Day 5, Gemini achieved certainty of 96.0% (SD=2.3%), whereas Deepseek remained the confident at 93.3% (SD=2.0%). At Day 10, Claude reached certainty across all time points at 97.8% (SD=2.0%), with Deepseek again showing the confidence at 92.2% (SD=4.4%). All four models demonstrated temporal stability, with minimal changes in mean certainty between Day 1 and Day 10: ChatGPT (Δ=-0.1%), Claude (Δ=+1.3%), Deepseek (Δ=0.0%), and Gemini (Δ=-0.2%). The intraclass correlation coefficient (ICC) analysis confirmed good overall test-retest reliability across time points with ICC=0.73, (95%CI [0.58-0.84]). Individual model reliability rankings revealed that ChatGPT demonstrated excellent reliability (ICC=0.84), followed by Deepseek (ICC=0.79), while both Gemini (ICC=0.58) and Claude (ICC=0.56) showed fair reliability. Figure 1 illustrates the violin plots with individual data points depicting the distribution of certainty percentages for each LLM across Day 1, Day 5, and Day 10. Violin and jitter plot of total scores by model, day, and LLM.
Quality of reasoning
Evaluation of reasoning quality across the five analytical dimensions revealed consistent yet model-dependent variations. ChatGPT achieved the highest overall mean score (3.89 ± 0.23), followed by Claude (3.82 ± 0.16), DeepSeek (3.54 ± 0.24), and Gemini (3.52 ± 0.08). Figure 2 illustrates the line plot of mean evaluator scores across these dimensions, showing the progression and variability for each LLM. Inter-rater agreement, assessed using the intraclass correlation coefficient (ICC), demonstrated fair concordance with an overall ICC of 0.36 (95% CI [0.27–0.42]), a perfect agreement rate of 59.8%, and a mean absolute difference of 0.532 across the three days. Hallucination rates varied across models, with ChatGPT exhibiting frequency at 7.4% (2/27), followed by Claude at 11.1% (3/27), DeepSeek at 14.8% (4/27), and Gemini showing rate at 18.5% (5/27). Line plot of the evolution of clinical dimensions across time points by LLM. Five reasoning dimensions: Analytical Reasoning and differential structuring, clinical information integration, diagnostic synthesis and justification, paraclinical exam coherence, and physical examination coherence.
Correlation between clinical reasoning quality and certainty expression
Global clinical scores and certainty by LLM.

Corrplot shows correlation matrix heatmap a) circular b)hierarchical clustering.

Individual scatter plots with regression lines for each LLM showing the relationship between global clinical scores and certainty percentages.
Inter-Model correlation matrix between clinical scores and certainty.
Spearman’s rank correlation coefficients. * p < 0.05, ** p < 0.01. Diagonal = 1.000 (self-correlation). n = 27 observations per model.
Discussion
This study provides a comprehensive multidimensional evaluation of four state-of-the-art large language models in the context of complex rheumatologic diagnoses, revealing both the remarkable capabilities and critical limitations of current AI systems in specialized clinical reasoning. Our findings demonstrate near-perfect diagnostic accuracy across all models while simultaneously exposing fundamental challenges in internal calibration, reasoning quality assessment, and the relationship between expressed confidence and actual clinical performance.
Diagnostic performance: Beyond surface-level accuracy
The diagnostic accuracy observed in our study, 100% for ChatGPT-4, Claude 3.5, and Gemini 1.5 Pro, and 88.9% for DeepSeek-V3 on top-1 predictions substantially exceeds previously reported benchmarks for general medical reasoning tasks. These results align with recent investigations showing that advanced LLMs can achieve expert-level diagnostic performance on complex clinical cases.4,20 However, this apparent excellence warrants careful interpretation. The near-ceiling performance across all models, including the open-source DeepSeek-V3, suggests that current reasoning-augmented LLMs have surpassed a critical threshold in pattern recognition and knowledge retrieval for well-documented clinical presentations.1,6,21 Yet, this success may reflect the models’ ability to leverage extensive training data from medical literature rather than genuine clinical reasoning capacity. The cases used in our study, while complex, were derived from published case reports, a genre characterized by clear diagnostic narratives and complete information sets. This inherent publication bias may have inadvertently favored models trained on similar structured medical texts, raising questions about generalizability to real-world clinical scenarios characterized by ambiguity, incomplete data, and diagnostic uncertainty.
Notably, DeepSeek-V3’s single diagnostic error occurred in a case requiring integration of subtle temporal patterns and multi-system involvement, suggesting that open-source reasoning models, despite impressive overall performance, may still lag behind proprietary systems in handling diagnostic complexity. This finding is particularly significant given DeepSeek’s recent emergence as a competitive alternative to commercial LLMs, highlighting that architectural advances in reinforcement learning-based reasoning have not entirely eliminated performance gaps in specialized medical domains.22,23
The calibration crisis: Confidence without correspondence
Perhaps the most clinically concerning finding of our study is the complete absence of significant correlation between reasoning quality and expressed certainty across all four models. This fundamental miscalibration where confidence scores fail to predict diagnostic accuracy or reasoning coherence represents a critical barrier to safe clinical deployment. ChatGPT exhibited a weak negative correlation (ρ = -0.083), while DeepSeek showed the strongest, albeit non-significant, positive trend (ρ = 0.215). These findings contrast sharply with human physician behavior, where calibration between confidence and correctness, though imperfect, typically shows moderate to strong positive associations.24–26
The temporal stability of certainty expressions, as evidenced by excellent ICC for ChatGPT (0.84) and good ICC for DeepSeek (0.79), paradoxically compounds this problem. Models consistently express similar confidence levels over time regardless of whether this confidence accurately reflects diagnostic accuracy or reasoning quality. This consistency without validity suggests that certainty percentages generated by current LLMs may represent learned linguistic patterns rather than genuine epistemic uncertainty quantification. 25 The weak inter-model correlations in certainty measures further indicate that confidence expression strategies are model-specific artifacts rather than meaningful assessments of diagnostic reliability.
This calibration failure has profound implications for human-AI collaborative decision-making. Clinicians relying on AI-generated confidence scores may be systematically misled, potentially over-trusting incorrect diagnoses presented with high certainty or under-valuing correct diagnoses expressed with lower confidence. 27 The unexpected strong negative correlation between Claude’s certainty and Gemini’s clinical scores (r = -0.528, p < 0.01) suggests that some models may even exhibit inverse relationships between confidence and performance, a phenomenon that could amplify clinical risks. These findings underscore the urgent need for developing calibration-aware training objectives and post-hoc uncertainty quantification methods specifically designed for medical AI systems. 27
Reasoning quality: Uniformity masking variability
While ChatGPT achieved the highest overall reasoning score (3.89 ± 0.23), followed closely by Claude (3.82 ± 0.16), the modest inter-rater ICC of 0.36 reveals substantial evaluator disagreement regarding reasoning quality assessment. This fair concordance, despite our rigorous calibration protocol, highlights an inherent challenge in quantifying the multidimensional construct of clinical reasoning. The five analytical dimensions we evaluated clinical information integration, physical examination coherence, paraclinical exam interpretation, analytical reasoning, and diagnostic synthesis while theoretically distinct, may exhibit complex interdependencies that resist linear decomposition.
The narrow score range across models (3.52 to 3.89 on a 5-point scale) suggests that current LLMs have converged on a “good enough” level of reasoning quality that satisfies basic coherence requirements but may lack the exceptional analytical depth characteristic of expert human clinicians. This plateau effect may reflect limitations in current evaluation methodologies rather than true performance ceilings. Traditional Likert-scale assessments, even when structured around specific dimensions, may fail to capture subtle yet clinically significant differences in reasoning sophistication, such as the ability to recognize diagnostic uncertainty, weigh conflicting evidence, or integrate rare but critical clinical features.28,29
The moderate to strong positive correlations between reasoning scores across different models indicate that evaluators perceived consistent patterns of reasoning quality independent of model identity. This consistency suggests that our assessment framework captured genuine, reproducible aspects of clinical reasoning. However, it also raises the possibility that all models, despite architectural differences, employ similar reasoning strategies potentially reflecting shared training data sources or convergent optimization toward common linguistic patterns in medical literature.
Hallucinations
The observed hallucination rates ranging from 7.4% with ChatGPT to 18.5% with Gemini while apparently modest, represent a critical safety concern that may be underestimated by aggregate statistics. In our complex rheumatologic cases, hallucinations primarily manifested as fabricated laboratory values, non-existent imaging findings, or spurious treatment recommendations. Crucially, these errors were embedded within otherwise coherent and sophisticated clinical narratives, making them difficult to detect without expert scrutiny and complete access to source medical records.
The inverse relationship between hallucination rates and overall diagnostic accuracy across models is noteworthy. Gemini, despite achieving 100% top-1 accuracy, exhibited the highest hallucination rate, while ChatGPT demonstrated both optimal accuracy and minimal hallucinations. This dissociation indicates that factual fabrication and diagnostic reasoning failures are partially independent phenomena, potentially governed by distinct mechanisms within LLM architectures. Hallucinations may arise from the models’ tendency to maintain narrative coherence and fulfill implicit expectations of completeness in clinical documentation, even when specific data points are absent or ambiguous in the source prompt. 30
From a clinical safety perspective, the presence of any hallucinations in high-stakes diagnostic scenarios is unacceptable. A single fabricated critical test result could lead to catastrophic clinical decisions, regardless of the correctness of the final diagnostic label. The temporal stability of hallucination patterns across repeated evaluations suggests these errors are not random but may reflect systematic blind spots or knowledge gaps in model training. This consistency paradoxically makes hallucinations more dangerous, as clinicians may incorrectly develop trust in specific models based on initial experiences, failing to maintain appropriate vigilance over time.
Methodological innovations and limitations
Our study design introduced several methodological advances over prior LLM evaluations in medicine. The multi-timepoint assessment with blinded randomization enhanced reliability and minimized confounding from evaluator learning effects or case-order biases. The use of real-world, published case reports from a peer-reviewed rheumatology journal ensured clinical authenticity while maintaining ethical compliance. The dual-evaluator design with standardized scoring rubrics across five reasoning dimensions provided granular assessment beyond binary accuracy metrics.
However, important limitations warrant acknowledgment. First, an important limitation is the absence of formal sample size calculation, with only nine cases selected based on pragmatic availability rather than statistical power considerations. This modest sample size, while sufficient for an exploratory comparative study generating 108 diagnostic evaluations across four models and three time points, limits the generalizability of our findings and the statistical power to detect meaningful differences between LLMs. The high diagnostic accuracy observed may not extend to unpublished cases, atypical presentations, or scenarios requiring longitudinal data integration. Second, all cases were derived from successful diagnostic narratives published in the literature, introducing potential selection bias toward “solvable” presentations and potentially inflating performance estimates. Third, our evaluation focused exclusively on diagnostic reasoning, excluding therapeutic decision-making, prognostication, and patient communication domains where LLM capabilities may differ substantially. Fourth, the five-day interval between assessments, while designed to simulate realistic repeated consultation scenarios, may not fully capture the models’ stability over longer timeframes or across major version updates. Five important limitations are the use of case reports from an open-access journal, which may have been included in the LLMs’ training data, potentially inflating diagnostic accuracy. This could explain why models achieved high diagnostic performance while showing less robust reasoning, as published case reports typically emphasize final diagnoses rather than explicit reasoning processes. Validation using unpublished cases would better assess true diagnostic capabilities independent of potential training exposure.
Implications for clinical integration
Our findings yield critical insights for the responsible integration of LLMs into rheumatologic practice. First, diagnostic accuracy alone is insufficient for clinical deployment calibration, reasoning transparency, and hallucination mitigation must be prioritized as co-equal performance metrics. Second, LLM-generated confidence scores should not be interpreted as reliable uncertainty quantification without external validation. Third, human oversight remains essential, with particular attention to detecting plausible but fabricated clinical details embedded within coherent narratives. Fourth, model selection should consider not only accuracy but also hallucination rates, reasoning quality, and temporal stability.
Cross-specialty comparisons reveal important performance variations, with Hirosawa et al. demonstrating that ChatGPT-4 achieved 54.6% top-1 accuracy across 392 internal medicine cases; substantially lower than the near-perfect rheumatologic performance observed in our study. 31 While this disparity could suggest that rheumatology’s well-established classification criteria and distinctive clinical phenotypes facilitate more accurate LLM diagnosis, it may also indicate training data contamination from open-access case reports, as Hirosawa et al. found no significant accuracy variation based on publication date or access status. 20 Future investigations must employ unpublished or recently published cases across multiple specialties to determine whether rheumatology genuinely represents an optimal domain for LLM application or whether current benchmarks artificially inflate performance metrics.
Also future research must address the calibration crisis through novel training approaches incorporating proper scoring rules, conformal prediction frameworks, or Bayesian uncertainty estimation. Developing domain-specific evaluation benchmarks that reflect the true complexity, ambiguity, and information gaps characteristic of real-world rheumatologic practice is essential. Investigating the mechanisms underlying hallucinations and developing reliable detection algorithms represents an urgent safety priority. Finally, longitudinal studies examining how LLM performance evolves with model updates, user expertise, and iterative human-AI interaction patterns will be crucial for establishing sustainable clinical integration strategies.
Conclusion
This comprehensive evaluation reveals that contemporary LLMs have achieved remarkable diagnostic accuracy in complex rheumatologic reasoning tasks, yet fundamental limitations in calibration, hallucination control, and reasoning transparency persist. The dissociation between confidence expression and actual performance represents the most significant barrier to safe clinical deployment. As these powerful tools continue to evolve, the rheumatology community must insist on rigorous, multidimensional evaluation standards that prioritize not only what AI systems can do, but how reliably and transparently they perform in service of patient care.
Supplemental material
Supplemental material - Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence
Supplemental material for Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence by Yannick Laurent Tchenadoyo Bayala, Fulgence Kaboré, Charles Sougué, Aboubakar Ouedraogo, Aboubakar Ouedraogo, Yamyellé Enselme Zongo, Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo and Dieu-Donné Ouedraogo in Health Informatics Journal.
Footnotes
ORCID iDs
Ethical considerations
This study did not involve real patient data; only published and anonymized clinical cases from the scientific literature were used, in compliance with ethical and confidentiality standards
Consent to participate
Written informed consent was obtained from both participating senior rheumatologists in accordance with institutional ethical standards and data protection regulations.
Author contributions
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Data Availability Statement
The datasets used and/or analyzed in this study are available from the corresponding author upon reasonable request.
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
