Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence

Abstract

Background

Large language models (LLMs) have demonstrated promising capabilities in medical diagnostic reasoning, yet their performance in specialized clinical domains such as rheumatology remains incompletely characterized. While diagnostic accuracy has been evaluated, critical dimensions including calibration, reasoning quality, and temporal stability have not been systematically assessed across contemporary models.

Objectives

This study aimed to comprehensively evaluate and compare the diagnostic accuracy, certainty expression, reasoning quality, and hallucination rates of four state-of-the-art LLMs ChatGPT-4, Claude 3.5, DeepSeek-V3, and Gemini 1.5 Pro in complex rheumatologic case scenarios.

Design

A cross-sectional, analytical, and comparative study was conducted following STARD and TRIPOD guidelines, adapted for LLM evaluation. Nine complex rheumatologic cases from published case reports were evaluated at three time points (Days 1, 5, and 10) between July 1 and September 18,2025.

Methods

Standardized clinical vignettes were submitted to each LLM under controlled experimental conditions. Two blinded senior rheumatologists independently assessed diagnostic accuracy, reasoning quality across five analytical dimensions using Likert scales, and hallucination frequency. Certainty expression and temporal stability were quantified using intraclass correlation coefficients. Correlation analyses examined relationships between reasoning quality and confidence expression.

Results

All models achieved near-perfect diagnostic accuracy, with ChatGPT, Claude and Gemini correctly identifying the primary diagnosis in 100% of cases and DeepSeek in 88.9%. However, Spearman correlation analysis revealed uniformly weak and non-significant associations between reasoning quality and expressed certainty across all models (ρ range: -0.156 to 0.215, all p>0.05), indicating fundamental miscalibration. ChatGPT demonstrated the highest reasoning score (3.89±0.23) and lowest hallucination rate (7.4%), while Gemini showed the highest hallucination frequency (18.5%). Temporal stability was excellent for ChatGPT (ICC=0.84) and good for DeepSeek (ICC=0.79).

Conclusion

Despite exceptional diagnostic accuracy, current LLMs exhibit critical limitations in confidence calibration and variable hallucination rates, representing significant barriers to safe clinical deployment in rheumatology.

Keywords

large language models artificial intelligence diagnostic reasoning rheumatology clinical decision support calibration

Introduction

Large language models (LLMs), a rapidly evolving subclass of artificial intelligence (AI), are increasingly being integrated into healthcare, encompassing applications in medical education, clinical research, and patient care.^1,2 Recent investigations have shown that even general-purpose LLMs without domain-specific training can accurately propose diagnostic hypotheses for hospitalized patients using real clinical, imaging, and laboratory data.³ Beyond diagnostic reasoning, these models have also demonstrated the ability to generate responses that are empathetic, coherent, and linguistically refined, often comparable to those produced by human clinicians.⁴

Diagnostic reasoning remains one of the most intricate aspects of clinical medicine, particularly in patients with complex or overlapping disease presentations. It requires the synthesis of heterogeneous data and the ability to navigate uncertainty through structured and iterative reasoning. The recent emergence of reasoning-augmented LLMs marks a pivotal shift from pattern recognition to deliberate, stepwise analytical processes designed to emulate human clinical reasoning.⁵ These models, supported by reinforcement learning and advanced algorithmic architectures, hold substantial promise for improving interpretability, reliability, and clinical applicability.⁵

In rheumatology a field characterized by heterogeneous, multisystem disorders, LLMs may play a transformative role. Their potential extends from diagnostic support and therapeutic guidance to adverse event prediction and the personalization of treatment strategies.³ By enhancing data interpretation and supporting differential diagnosis, LLMs could improve both diagnostic accuracy and efficiency in musculoskeletal and autoimmune disease management.

Recent comparative studies have reported encouraging diagnostic performance of models such as ChatGPT-4, Claude 3.5, and Gemini in rheumatologic contexts.³ In parallel, the introduction of DeepSeek-R1, an open-source reasoning model released in early 2025 and trained using reinforcement learning techniques, represents a novel generation of interpretable and adaptive LLMs.⁶ Krusche et al. found that ChatGPT-4 listed the correct diagnosis as top diagnosis in 35% versus 39% for rheumatologists (p=0.30), and among top 3 diagnoses in 60% versus 55% (p=0.38), with notably higher performance in inflammatory rheumatic disease cases but lower accuracy in non-IRD cases.⁷ Coskun et al. reported that GPT-4 achieved 100% accuracy in providing comprehensive patient information about methotrexate use compared to 86.96% for GPT-3.5, 60.87% for BARD, and 60.87% for Bing.⁸ However, Nakaphan et al. highlighted important limitations, showing that while LLMs achieved high accuracy on text-based rheumatology multiple-choice questions (90-96%), their performance on image-based questions was significantly lower and more variable (16-56%), with all models showing significantly reduced odds of correct responses to image questions compared to MCQs (p<0.01).⁹ This rapid evolution underscores the need for rigorous, domain-specific evaluations to determine their true clinical utility and reasoning coherence.

However, several critical gaps remain in the current literature. First, no study has comprehensively compared diagnostic accuracy across multiple major LLMs specifically for complex rheumatological cases. Second, existing studies have not systematically examined how LLMs explain their diagnostic reasoning through structured differential diagnoses with certainty estimates and supporting arguments. Third, the temporal stability of LLM performance remains unexplored, with no studies evaluating the same cases across multiple time points to assess model consistency. Fourth, there is a lack of systematic analysis on how LLMs perform diagnostically in rare rheumatological conditions, limiting our understanding of their potential clinical integration.

Therefore, this study aimed to comprehensively assess and compare the diagnostic accuracy, certainty expression, and reasoning quality of four contemporary LLMs, ChatGPT-4, Claude 3.5 Sonnet/Opus, DeepSeek-V3, and Gemini 1.5 Pro in the context of complex rheumatologic case scenarios derived from published real-world case reports.

Materials and methods

Study design

This was a cross-sectional, analytical, and comparative study conducted in accordance with the CHART (Chatbot Assessment Reporting Tool) recommendations, adapted to the evaluation of large language models (LLMs).¹⁰ The study was conducted between July 1 and September 18, 2025.

Clinical case selection

Clinical cases were extracted from Therapeutic Advances in Musculoskeletal Disease, a journal published by SAGE, covering the period from 2009 to 2025. To be eligible, an article had to be a case report written in English and provide a comprehensive clinical presentation, including detailed history, physical examination, laboratory and imaging findings, and a confirmed final diagnosis. Incomplete cases or those lacking essential diagnostic information were excluded. No formal sample size calculation was performed a priori for this exploratory diagnostic accuracy study.

Preparation of clinical vignettes

Each clinical case was reformulated into a standardized vignette including:

• Patient demographics (age, sex)

• Chief complaint and history of present illness

• Relevant medical, surgical, and family history

• Detailed physical examination

• Results of complementary investigations (laboratory, imaging, immunological tests)

• Clinical course when applicable

All identifiable patient data were removed in accordance with confidentiality principles.

A standardized and optimized medical prompt was designed using principles of clinical prompt engineering, as follows:

« You are a highly experienced rheumatologist. Carefully analyze the following clinical case and provide a structured response including the following elements:

Primary diagnosis

• State the most likely diagnosis.

• Provide an estimated certainty percentage for this diagnosis.

• List the supporting arguments in favor of this diagnosis.

• List the arguments against this diagnosis.

Top 3 Differential Diagnoses: List the three most probable alternative diagnoses, ranked in descending order of likelihood.

Top 5 Differential Diagnoses: List five possible diagnoses, ranked in descending order of likelihood.

Clinical case: [ ]

Please structure your response clearly, hierarchically, and concisely. »

Evaluated LLMs

Four LLMs were compared in this study. These models represented the most advanced and accessible systems at the time of the analysis, each recognized in recent literature for its medical reasoning capabilities (Table 1).

Table 1.

LLMs’ characteristics of our study.

Model	Version	Developer	Day 1	Day 5	Day 10
ChatGPT	GPT-4	OpenAI	05/07/25	10/07/25	15/07/25
Claude	Claude 3.5 Sonnet/Opus	Anthropic	05/07/25	10/07/25	15/07/25
DeepSeek	DeepSeek-V3	DeepSeek	05/07/25	10/07/25	15/07/25
Gemini	Gemini 1.5 Pro	Google	05/07/25	10/07/25	15/07/25

Standardized experimental conditions

• Temperature: 0.0 to ensure consistency and coherence in the model’s responses.

• Maximum tokens: Unlimited, allowing exhaustive responses

• Language: English

• Independent sessions: Each case was submitted in a new chat session to avoid contextual bias

• Reproducibility: Each case was submitted three times to each LLM at five-day intervals, and the median consensus response was retained

• Randomization: The order of case presentation was randomized for each LLM. Human evaluators were blinded to the origin of the responses (single-blind design).

Evaluation criteria and performance metrics

The primary endpoint was diagnostic accuracy, assessed as:

• Top-1: Correct main diagnosis

• Top-3: Correct diagnosis among the top three suggestions

• Top-5: Correct diagnosis among the top five suggestions

Secondary endpoints evaluated the quality of diagnostic reasoning using a five-dimension Likert-scale (0–5 points each):

• Clinical information integration

• Physical examination coherence

• Paraclinical exam coherence

• Analytical reasoning and differential structuring

• Diagnostic synthesis and justification

The sum of these dimensions yielded a total reasoning score (maximum = 25 points).

Additionally, medical hallucinations were assessed as the number of incorrect or fabricated statements produced by each LLM.

Evaluation process

The expert panel consisted of two senior rheumatologists, each with at least five years of clinical experience. Both independently evaluated all LLM responses in a blinded fashion. A calibration session using two external pilot cases was conducted beforehand to standardize scoring criteria.

Scoring procedure

For each clinical case, responses from the four LLMs were anonymized and randomized. Evaluations were collected using a standardized scoring grid built in Google Forms. When discrepancies greater than two points occurred in the reasoning score, a consensus discussion was held to reach agreement. When responses differed across the three submissions, the most clinically coherent response was retained through expert consensus.

Statistical analysis

All statistical analyses were conducted using R software (version version 4.5.1; R Foundation for statistical computing, Vienna, Austria). Descriptive statistics were used to summarize model performance, certainty levels, reasoning quality, and hallucination rates. Continuous variables are presented as means with standard deviations (SD) or medians with interquartile ranges (IQR), as appropriate. Categorical data are expressed as frequencies and percentages. For diagnostic performance, proportions of correct top-1, top-3, and top-5 diagnoses were calculated for each LLM. Certainty probabilities were analyzed longitudinally across Day 1, Day 5, and Day 10, and temporal stability was assessed using mean differences (Δ) between time points. The intraclass correlation coefficient (ICC) was computed to evaluate test–retest reliability of certainty ratings over time, both overall and per model, and interpreted according to conventional thresholds (poor <0.5, fair 0.5–0.75, good 0.75–0.9, excellent >0.9). Reasoning quality was assessed based on evaluator ratings across five analytical dimensions. Mean reasoning scores and standard deviations were computed for each LLM, and inter-rater reliability was quantified using ICC analysis. Agreement among raters was further evaluated by calculating the perfect agreement rate (%) and mean absolute difference across evaluations. Hallucination frequency was expressed as the percentage and absolute count (n) of cases per model. Within-model correlations between reasoning quality and certainty expression were examined using Spearman’s rank correlation coefficients (ρ), while inter-model relationships were assessed using Pearson’s correlation coefficient (r), both reported with 95% confidence intervals and p-values. Correlation matrices were generated to visualize pairwise associations between diagnostic performance, certainty, and reasoning quality across all LLMs. Correlation strength was interpreted as negligible (<0.3), low (0.3–0.5), moderate (0.5–0.7), or strong (>0.7). Finally, pairwise associations between diagnostic performance, certainty, and reasoning quality across LLMs were explored using correlation matrices. Statistical significance was defined as a two-sided p-value < 0.05.

Ethical considerations

This study did not involve real patient data; only published and anonymized clinical cases from the scientific literature were used, in compliance with ethical and confidentiality standards. Written informed consent was obtained from both participating senior rheumatologists in accordance with institutional ethical standards and data protection regulations.

Results

Case description

A total of fourteen clinical case reports published in Therapeutic Advances in Musculoskeletal Disease were initially identified for inclusion in this study. Five articles were excluded due to incomplete or insufficiently detailed case descriptions, leaving nine reports that met the inclusion criteria. The selected cases, published between 2019 and 2024, represent a diverse spectrum of rheumatologic and systemic inflammatory disorders. These include pregnancy and lactation associated osteoporosis, systemic lupus erythematosus with class III lupus nephritis and secondary antiphospholipid syndrome, multisystem inflammatory syndrome in children temporally associated with SARS-CoV-2 infection, rheumatoid arthritis associated peripheral ulcerative keratitis, Blau syndrome, hereditary hypophosphatemic rickets with hypercalciuria, seropositive rheumatoid arthritis with multiple biologic hypersensitivity reactions managed with a JAK inhibitor, adult-onset Still’s disease complicated by secondary hemophagocytic lymphohistiocytosis, and giant cell arteritis.^11–19 The characteristics of the included studies are summarized in Table 2.

Table 2.

The characteristics of the included studies.

Author	Year	Country	Diagnosis
Ota et al.¹¹	2024	Japan	Severe multiple vertebral fractures due to pregnancy and lactation-associated osteoporosis
Zhang et al.¹⁰	2022	China	Refractory lupus nephritis and anti-phospholipid antibody syndrome
Scarcella et al.¹¹	2022	Italy	Neuro-paediatric inflammatory multisystem syndrome temporally associated with COVID-19
Calvo-Rio et al.¹²	2022	Spain	Rheumatoid arthritis and secondary Sjögren’s associated peripheral ulcerative keratitis
Alvarez Reguera et al.¹³	2022	Spain	Blau Syndrome
Dreimane et al.¹⁴	2020	USA	Hereditary hypophosphatemic rickets with hypercalciuria
Costanzo et al.¹⁵	2020	Italy	Seropositive rheumatoid arthritis with multiple biologic hypersensitivity reactions
Ajeganova et al.¹⁶	2020	Sweden	Adult-Onset Still’s disease complicated by secondary hemophagocytic lymphohistiocytosis
Del Giorno et al.¹⁷	2019	Switzerland	Giant cell temporal arteritis

Global diagnostic performance

Overall performance analysis showed that all top 1 prediction were deemed correct by the evaluators, except for a single case in which DeepSeek provided an incorrect diagnosis. Table 3 summarizes the primary responses of each LLM at day 1, day 5, and day 10. Regarding the top 3 and top 5 diagnoses, all LLMs successfully identified the correct diagnoses. The mean top 1 accuracy was 100% for ChatGPT, Claude 3.5, and Gemini, while DeepSeek achieved 88.9%. For both the top 3 and top 5 predictions, all models reached 100% accuracy. Responses from the LLMs that included differential diagnoses are presented in Supplement 1.

Table 3.

The primary responses of each LLM at day 1, day 5, and day 10.

Certainty probability

At Day 1, Claude exhibited the mean certainty at 96.4% (SD=2.3%), while Deepseek demonstrated at 92.2% (SD=3.6%). By Day 5, Gemini achieved certainty of 96.0% (SD=2.3%), whereas Deepseek remained the confident at 93.3% (SD=2.0%). At Day 10, Claude reached certainty across all time points at 97.8% (SD=2.0%), with Deepseek again showing the confidence at 92.2% (SD=4.4%). All four models demonstrated temporal stability, with minimal changes in mean certainty between Day 1 and Day 10: ChatGPT (Δ=-0.1%), Claude (Δ=+1.3%), Deepseek (Δ=0.0%), and Gemini (Δ=-0.2%). The intraclass correlation coefficient (ICC) analysis confirmed good overall test-retest reliability across time points with ICC=0.73, (95%CI [0.58-0.84]). Individual model reliability rankings revealed that ChatGPT demonstrated excellent reliability (ICC=0.84), followed by Deepseek (ICC=0.79), while both Gemini (ICC=0.58) and Claude (ICC=0.56) showed fair reliability. Figure 1 illustrates the violin plots with individual data points depicting the distribution of certainty percentages for each LLM across Day 1, Day 5, and Day 10.

Figure 1.

Violin and jitter plot of total scores by model, day, and LLM.

Quality of reasoning

Evaluation of reasoning quality across the five analytical dimensions revealed consistent yet model-dependent variations. ChatGPT achieved the highest overall mean score (3.89 ± 0.23), followed by Claude (3.82 ± 0.16), DeepSeek (3.54 ± 0.24), and Gemini (3.52 ± 0.08). Figure 2 illustrates the line plot of mean evaluator scores across these dimensions, showing the progression and variability for each LLM. Inter-rater agreement, assessed using the intraclass correlation coefficient (ICC), demonstrated fair concordance with an overall ICC of 0.36 (95% CI [0.27–0.42]), a perfect agreement rate of 59.8%, and a mean absolute difference of 0.532 across the three days. Hallucination rates varied across models, with ChatGPT exhibiting frequency at 7.4% (2/27), followed by Claude at 11.1% (3/27), DeepSeek at 14.8% (4/27), and Gemini showing rate at 18.5% (5/27).

Figure 2.

Line plot of the evolution of clinical dimensions across time points by LLM. Five reasoning dimensions: Analytical Reasoning and differential structuring, clinical information integration, diagnostic synthesis and justification, paraclinical exam coherence, and physical examination coherence.

Correlation between clinical reasoning quality and certainty expression

Spearman’s rank correlation analysis revealed uniformly weak and non-significant correlations across all LLMs. Deepseek demonstrated non-significant, positive correlation (ρ = 0.215, 95% CI [-0.180 to 0.550], p = 0.281), while ChatGPT showed a weak negative correlation (ρ = -0.083, 95% CI [-0.449 to 0.307], p = 0.681). Claude exhibited a weak positive correlation (ρ = 0.129, 95% CI [-0.264 to 0.485], p = 0.522), and Gemini displayed a weak negative correlation (ρ = -0.156, 95% CI [-0.506 to 0.238], p = 0.438) (Table 4). The full correlation matrix revealed moderate to strong positive correlations between clinical scores across different LLMs. Figure 3 presents the correlation matrix visualizations circular and hierarchical clustering, and Figure 4 displays individual scatter plots for each LLM with regression lines.

Table 4.

Global clinical scores and certainty by LLM.

LLM	Spearman’s ρ	95% CI	p-value	Pearson’s r	p-value
ChatGPT	-0.083	[-0.449, 0.307]	0.681	-0.104	0.606
Claude	0.129	[-0.264, 0.485]	0.522	0.159	0.429
Deepseek	0.215	[-0.180, 0.550]	0.281	0.193	0.334
Gemini	-0.156	[-0.506, 0.238]	0.438	-0.212	0.289

Figure 3.

Corrplot shows correlation matrix heatmap a) circular b)hierarchical clustering.

Figure 4.

Individual scatter plots with regression lines for each LLM showing the relationship between global clinical scores and certainty percentages.

The complete correlation matrix analysis revealed distinct patterns in the relationships between clinical performance and certainty expression both within and across LLMs. Clinical reasoning scores demonstrated moderate to strong positive inter-model correlations, with the strongest associations observed between Deepseek and Gemini scores (r = 0.502, p < 0.01) and between ChatGPT and Deepseek scores (r = 0.418, p < 0.05). However, certainty measures showed weaker inter-model correlations, with the highest being between ChatGPT and Claude certainty (r = 0.334, p = 0.09). Cross-model relationships revealed unexpected negative correlations, such as Claude’s certainty negatively correlating with Gemini’s clinical scores (r = -0.528, p < 0.01). Table 5 shows inter-model correlation matrix between clinical scores and certainty.

Table 5.

Inter-Model correlation matrix between clinical scores and certainty.

Variable	ChatGPT score	ChatGPT certainty	Claude score	Claude certainty	DeepSeek score	DeepSeek certainty	Gemini score	Gemini certainty
ChatGPT Score	1.000	-0.083	0.388*	-0.031	0.418*	0.486**	0.266	-0.190
ChatGPT Certainty	-0.083	1.000	0.175	0.334	0.000	0.141	0.016	0.193
Claude Score	0.388*	0.175	1.000	0.129	0.403*	-0.019	0.300	-0.528**
Claude Certainty	-0.031	0.334	0.129	1.000	-0.122	0.071	0.161	0.222
DeepSeek Score	0.418*	0.000	0.403*	-0.122	1.000	0.215	0.502**	-0.284
DeepSeek Certainty	0.486**	0.141	-0.019	0.071	0.215	1.000	0.297	0.072
Gemini Score	0.266	0.016	0.300	0.161	0.502**	0.297	1.000	-0.156
Gemini Certainty	-0.190	0.193	-0.528**	0.222	-0.284	0.072	-0.156	1.000

Spearman’s rank correlation coefficients. * p < 0.05, ** p < 0.01. Diagonal = 1.000 (self-correlation). n = 27 observations per model.

Discussion

This study provides a comprehensive multidimensional evaluation of four state-of-the-art large language models in the context of complex rheumatologic diagnoses, revealing both the remarkable capabilities and critical limitations of current AI systems in specialized clinical reasoning. Our findings demonstrate near-perfect diagnostic accuracy across all models while simultaneously exposing fundamental challenges in internal calibration, reasoning quality assessment, and the relationship between expressed confidence and actual clinical performance.

Diagnostic performance: Beyond surface-level accuracy

The diagnostic accuracy observed in our study, 100% for ChatGPT-4, Claude 3.5, and Gemini 1.5 Pro, and 88.9% for DeepSeek-V3 on top-1 predictions substantially exceeds previously reported benchmarks for general medical reasoning tasks. These results align with recent investigations showing that advanced LLMs can achieve expert-level diagnostic performance on complex clinical cases.^4,20 However, this apparent excellence warrants careful interpretation. The near-ceiling performance across all models, including the open-source DeepSeek-V3, suggests that current reasoning-augmented LLMs have surpassed a critical threshold in pattern recognition and knowledge retrieval for well-documented clinical presentations.^1,6,21 Yet, this success may reflect the models’ ability to leverage extensive training data from medical literature rather than genuine clinical reasoning capacity. The cases used in our study, while complex, were derived from published case reports, a genre characterized by clear diagnostic narratives and complete information sets. This inherent publication bias may have inadvertently favored models trained on similar structured medical texts, raising questions about generalizability to real-world clinical scenarios characterized by ambiguity, incomplete data, and diagnostic uncertainty.

Notably, DeepSeek-V3’s single diagnostic error occurred in a case requiring integration of subtle temporal patterns and multi-system involvement, suggesting that open-source reasoning models, despite impressive overall performance, may still lag behind proprietary systems in handling diagnostic complexity. This finding is particularly significant given DeepSeek’s recent emergence as a competitive alternative to commercial LLMs, highlighting that architectural advances in reinforcement learning-based reasoning have not entirely eliminated performance gaps in specialized medical domains.^22,23

The calibration crisis: Confidence without correspondence

Perhaps the most clinically concerning finding of our study is the complete absence of significant correlation between reasoning quality and expressed certainty across all four models. This fundamental miscalibration where confidence scores fail to predict diagnostic accuracy or reasoning coherence represents a critical barrier to safe clinical deployment. ChatGPT exhibited a weak negative correlation (ρ = -0.083), while DeepSeek showed the strongest, albeit non-significant, positive trend (ρ = 0.215). These findings contrast sharply with human physician behavior, where calibration between confidence and correctness, though imperfect, typically shows moderate to strong positive associations.^24–26

The temporal stability of certainty expressions, as evidenced by excellent ICC for ChatGPT (0.84) and good ICC for DeepSeek (0.79), paradoxically compounds this problem. Models consistently express similar confidence levels over time regardless of whether this confidence accurately reflects diagnostic accuracy or reasoning quality. This consistency without validity suggests that certainty percentages generated by current LLMs may represent learned linguistic patterns rather than genuine epistemic uncertainty quantification.²⁵ The weak inter-model correlations in certainty measures further indicate that confidence expression strategies are model-specific artifacts rather than meaningful assessments of diagnostic reliability.

This calibration failure has profound implications for human-AI collaborative decision-making. Clinicians relying on AI-generated confidence scores may be systematically misled, potentially over-trusting incorrect diagnoses presented with high certainty or under-valuing correct diagnoses expressed with lower confidence.²⁷ The unexpected strong negative correlation between Claude’s certainty and Gemini’s clinical scores (r = -0.528, p < 0.01) suggests that some models may even exhibit inverse relationships between confidence and performance, a phenomenon that could amplify clinical risks. These findings underscore the urgent need for developing calibration-aware training objectives and post-hoc uncertainty quantification methods specifically designed for medical AI systems.²⁷

Reasoning quality: Uniformity masking variability

While ChatGPT achieved the highest overall reasoning score (3.89 ± 0.23), followed closely by Claude (3.82 ± 0.16), the modest inter-rater ICC of 0.36 reveals substantial evaluator disagreement regarding reasoning quality assessment. This fair concordance, despite our rigorous calibration protocol, highlights an inherent challenge in quantifying the multidimensional construct of clinical reasoning. The five analytical dimensions we evaluated clinical information integration, physical examination coherence, paraclinical exam interpretation, analytical reasoning, and diagnostic synthesis while theoretically distinct, may exhibit complex interdependencies that resist linear decomposition.

The narrow score range across models (3.52 to 3.89 on a 5-point scale) suggests that current LLMs have converged on a “good enough” level of reasoning quality that satisfies basic coherence requirements but may lack the exceptional analytical depth characteristic of expert human clinicians. This plateau effect may reflect limitations in current evaluation methodologies rather than true performance ceilings. Traditional Likert-scale assessments, even when structured around specific dimensions, may fail to capture subtle yet clinically significant differences in reasoning sophistication, such as the ability to recognize diagnostic uncertainty, weigh conflicting evidence, or integrate rare but critical clinical features.^28,29

The moderate to strong positive correlations between reasoning scores across different models indicate that evaluators perceived consistent patterns of reasoning quality independent of model identity. This consistency suggests that our assessment framework captured genuine, reproducible aspects of clinical reasoning. However, it also raises the possibility that all models, despite architectural differences, employ similar reasoning strategies potentially reflecting shared training data sources or convergent optimization toward common linguistic patterns in medical literature.

Hallucinations

The observed hallucination rates ranging from 7.4% with ChatGPT to 18.5% with Gemini while apparently modest, represent a critical safety concern that may be underestimated by aggregate statistics. In our complex rheumatologic cases, hallucinations primarily manifested as fabricated laboratory values, non-existent imaging findings, or spurious treatment recommendations. Crucially, these errors were embedded within otherwise coherent and sophisticated clinical narratives, making them difficult to detect without expert scrutiny and complete access to source medical records.

The inverse relationship between hallucination rates and overall diagnostic accuracy across models is noteworthy. Gemini, despite achieving 100% top-1 accuracy, exhibited the highest hallucination rate, while ChatGPT demonstrated both optimal accuracy and minimal hallucinations. This dissociation indicates that factual fabrication and diagnostic reasoning failures are partially independent phenomena, potentially governed by distinct mechanisms within LLM architectures. Hallucinations may arise from the models’ tendency to maintain narrative coherence and fulfill implicit expectations of completeness in clinical documentation, even when specific data points are absent or ambiguous in the source prompt.³⁰

From a clinical safety perspective, the presence of any hallucinations in high-stakes diagnostic scenarios is unacceptable. A single fabricated critical test result could lead to catastrophic clinical decisions, regardless of the correctness of the final diagnostic label. The temporal stability of hallucination patterns across repeated evaluations suggests these errors are not random but may reflect systematic blind spots or knowledge gaps in model training. This consistency paradoxically makes hallucinations more dangerous, as clinicians may incorrectly develop trust in specific models based on initial experiences, failing to maintain appropriate vigilance over time.

Methodological innovations and limitations

Our study design introduced several methodological advances over prior LLM evaluations in medicine. The multi-timepoint assessment with blinded randomization enhanced reliability and minimized confounding from evaluator learning effects or case-order biases. The use of real-world, published case reports from a peer-reviewed rheumatology journal ensured clinical authenticity while maintaining ethical compliance. The dual-evaluator design with standardized scoring rubrics across five reasoning dimensions provided granular assessment beyond binary accuracy metrics.

However, important limitations warrant acknowledgment. First, an important limitation is the absence of formal sample size calculation, with only nine cases selected based on pragmatic availability rather than statistical power considerations. This modest sample size, while sufficient for an exploratory comparative study generating 108 diagnostic evaluations across four models and three time points, limits the generalizability of our findings and the statistical power to detect meaningful differences between LLMs. The high diagnostic accuracy observed may not extend to unpublished cases, atypical presentations, or scenarios requiring longitudinal data integration. Second, all cases were derived from successful diagnostic narratives published in the literature, introducing potential selection bias toward “solvable” presentations and potentially inflating performance estimates. Third, our evaluation focused exclusively on diagnostic reasoning, excluding therapeutic decision-making, prognostication, and patient communication domains where LLM capabilities may differ substantially. Fourth, the five-day interval between assessments, while designed to simulate realistic repeated consultation scenarios, may not fully capture the models’ stability over longer timeframes or across major version updates. Five important limitations are the use of case reports from an open-access journal, which may have been included in the LLMs’ training data, potentially inflating diagnostic accuracy. This could explain why models achieved high diagnostic performance while showing less robust reasoning, as published case reports typically emphasize final diagnoses rather than explicit reasoning processes. Validation using unpublished cases would better assess true diagnostic capabilities independent of potential training exposure.

Implications for clinical integration

Our findings yield critical insights for the responsible integration of LLMs into rheumatologic practice. First, diagnostic accuracy alone is insufficient for clinical deployment calibration, reasoning transparency, and hallucination mitigation must be prioritized as co-equal performance metrics. Second, LLM-generated confidence scores should not be interpreted as reliable uncertainty quantification without external validation. Third, human oversight remains essential, with particular attention to detecting plausible but fabricated clinical details embedded within coherent narratives. Fourth, model selection should consider not only accuracy but also hallucination rates, reasoning quality, and temporal stability.

Cross-specialty comparisons reveal important performance variations, with Hirosawa et al. demonstrating that ChatGPT-4 achieved 54.6% top-1 accuracy across 392 internal medicine cases; substantially lower than the near-perfect rheumatologic performance observed in our study.³¹ While this disparity could suggest that rheumatology’s well-established classification criteria and distinctive clinical phenotypes facilitate more accurate LLM diagnosis, it may also indicate training data contamination from open-access case reports, as Hirosawa et al. found no significant accuracy variation based on publication date or access status.²⁰ Future investigations must employ unpublished or recently published cases across multiple specialties to determine whether rheumatology genuinely represents an optimal domain for LLM application or whether current benchmarks artificially inflate performance metrics.

Also future research must address the calibration crisis through novel training approaches incorporating proper scoring rules, conformal prediction frameworks, or Bayesian uncertainty estimation. Developing domain-specific evaluation benchmarks that reflect the true complexity, ambiguity, and information gaps characteristic of real-world rheumatologic practice is essential. Investigating the mechanisms underlying hallucinations and developing reliable detection algorithms represents an urgent safety priority. Finally, longitudinal studies examining how LLM performance evolves with model updates, user expertise, and iterative human-AI interaction patterns will be crucial for establishing sustainable clinical integration strategies.

Conclusion

This comprehensive evaluation reveals that contemporary LLMs have achieved remarkable diagnostic accuracy in complex rheumatologic reasoning tasks, yet fundamental limitations in calibration, hallucination control, and reasoning transparency persist. The dissociation between confidence expression and actual performance represents the most significant barrier to safe clinical deployment. As these powerful tools continue to evolve, the rheumatology community must insist on rigorous, multidimensional evaluation standards that prioritize not only what AI systems can do, but how reliably and transparently they perform in service of patient care.

Supplemental material

Supplemental material - Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence

Supplemental material for Assessing the clinical reasoning of large language models on complex rheumatology cases: A multidimensional evaluation of four artificial intelligence by Yannick Laurent Tchenadoyo Bayala, Fulgence Kaboré, Charles Sougué, Aboubakar Ouedraogo, Aboubakar Ouedraogo, Yamyellé Enselme Zongo, Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo and Dieu-Donné Ouedraogo in Health Informatics Journal.

Footnotes

ORCID iDs

Yannick Laurent Tchenadoyo Bayala

Fulgence Kaboré

Charles Sougué

Aboubakar Ouedraogo

Yamyellé Enselme Zongo

Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo

Dieu-Donné Ouedraogo

Ethical considerations

This study did not involve real patient data; only published and anonymized clinical cases from the scientific literature were used, in compliance with ethical and confidentiality standards

Consent to participate

Written informed consent was obtained from both participating senior rheumatologists in accordance with institutional ethical standards and data protection regulations.

Author contributions

Yannick Laurent Tchenadoyo Bayala: Conceptualization; Methodology; Statistical analysis; Formal analysis; Investigation; Data curation; Writing – original draft; Writing – review & editing; Visualization; Supervision. Fulgence Kaboré: Methodology; Investigation; Clinical validation; Writing – review & editing. Charles Sougué: Formal analysis; Writing – review & editing. Aboubakar Ouedraogo: Data curation; Investigation; Writing – review & editing. Yamyellé Enselme Zongo: Clinical evaluation; Investigation; Writing – review & editing. Wendlassida Joelle Stéphanie Zabsonré/Tiendrebeogo: Investigation; Data validation; Writing – review & editing. Dieu-Donné Ouedraogo: Supervision; Conceptualization; Writing – review & editing; Project administration.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The datasets used and/or analyzed in this study are available from the corresponding author upon reasonable request.*

Supplemental material

Supplemental material for this article is available online.

References

Hussain

Delsoz

Elahi

, et al. Performance of DeepSeek, Qwen 2.5 MAX, and ChatGPT Assisting in Diagnosis of Corneal Eye Diseases, Glaucoma, and Neuro-Ophthalmology Diseases Based on Clinical Case Reports. medRxiv 2025; 2025. https://doi.org/10.1101/2025.03.14.25323836

Jiao

Rosas

Asadigandomani

, et al. Diagnostic Performance of Publicly Available Large Language Models in Corneal Diseases: A Comparison with Human Specialists. Diagnostics (Basel) 13 mai 2025; 15(10): 1221. https://doi.org/10.3390/diagnostics15101221

Bayala

YLT

Zabsonré/Tiendrebeogo

WJS

Ouedraogo

, et al. Performance of the Large Language Models in African rheumatology: a diagnostic test accuracy study of ChatGPT-4, Gemini, Copilot, and Claude artificial intelligence. BMC Rheumatology 2025; 9(1): 54. https://doi.org/10.1186/s41927-025-00512-z

Huang

. A large language model improves clinicians’ diagnostic performance in complex critical illness cases. Crit Care 2025; 29(1): 230. https://doi.org/10.1186/s13054-025-05468-7

Wang

Wan

, et al. A Systematic Review of ChatGPT and Other Conversational Large Language Models in Healthcare. medRxiv 2024; 2024;2024.04.26.24306390. https://doi.org/10.1101/2024.04.26.24306390

Agarwal

Sharma

Wani

. Evaluating the Accuracy and Reliability of Large Language Models (ChatGPT, Claude, DeepSeek, Gemini, Grok, and Le Chat) in Answering Item-Analyzed Multiple-Choice Questions on Blood Physiology. Cureus 2025; 17(4): e81871. https://doi.org/10.7759/cureus.81871

Krusche

Callhoff

Knitza

, et al. Diagnostic accuracy of a large language model in rheumatology: comparison of physician and ChatGPT-4. Rheumatol Int 2024; 44(2): 303–306. https://doi.org/10.1007/s00296-023-05464-6

Coskun

Yagiz

Ocakoglu

, et al. Assessing the accuracy and completeness of artificial intelligence language models in providing information on methotrexate use. Rheumatol Int 2024; 44(3): 509–515. https://doi.org/10.1007/s00296-023-05473-5

Nakaphan

Damara

Lerttiendamrong

, et al. Comparative evaluation of large language models on multiple-choice and image-based rheumatology questions. Rheumatol Int 2025; 46(1): 9. https://doi.org/10.1007/s00296-025-06053-5

10.

CHART Collaborative . Reporting guideline for chatbot health advice studies: the Chatbot Assessment Reporting Tool (CHART) statement. BMJ Med 2025; 4(1): e001632. https://doi.org/10.1136/bmjmed-2025-001632

11.

Ota

Asanuma

Hirasawa

, et al. Minodronate for severe multiple vertebral fractures due to pregnancy- and lactation-associated osteoporosis: a case report and literature review. Therapeutic Advances in Musculoskeletal 2024; 16: 1759720X241259897. https://doi.org/10.1177/1759720X241259897

12.

Zhang

Sun

. Successful treatment of sirolimus in a Chinese patient with refractory LN and APS: a case report. Therapeutic Advances in Musculoskeletal 2022; 14: 1759720X221079253. https://doi.org/10.1177/1759720X221079253

13.

Scarcella

Mastrolia

Marrani

, et al. Neuro-PIMS-TS: a single case report and review of the literature. Therapeutic Advances in Musculoskeletal 2022; 14: 1759720X221139627. https://doi.org/10.1177/1759720X221139627

14.

Calvo-Río

Sánchez-Bilbao

Álvarez-Reguera

, et al. Baricitinib in severe and refractory peripheral ulcerative keratitis: a case report and literature review. Therapeutic Advances in Musculoskeletal 2022; 14: 1759720X221137126. https://doi.org/10.1177/1759720X221137126

15.

Álvarez-Reguera

Prieto-Peña

Herrero-Morant

, et al. Clinical and immunological study of Tofacitinib and Baricitinib in refractory Blau syndrome: case report and literature review. Therapeutic Advances in Musculoskeletal 2022; 14: 1759720X221093211. https://doi.org/10.1177/1759720X221093211

16.

Dreimane

Chen

Bergwitz

. Description of a novel SLC34A3.c.671delT mutation causing hereditary hypophosphatemic rickets with hypercalciuria in two adolescent boys and response to recombinant human growth hormone. Therapeutic Advances in Musculoskeletal 2020; 12: 1759720X20912862. https://doi.org/10.1177/1759720X20912862

17.

Costanzo

Firinu

Losa

, et al. Baricitinib exposure during pregnancy in rheumatoid arthritis. Therapeutic Advances in Musculoskeletal 2020; 12: 1759720X19899296. https://doi.org/10.1177/1759720X19899296

18.

Ajeganova

De Becker

Schots

. Efficacy of high-dose anakinra in refractory macrophage activation syndrome in adult-onset Still’s disease: when dosage matters in overcoming secondary therapy resistance. Therapeutic Advances in Musculoskeletal 2020; 12: 1759720X20974858. https://doi.org/10.1177/1759720X20974858

19.

Del Giorno

Iodice

Mangas

, et al. New-onset cutaneous sarcoidosis under tocilizumab treatment for giant cell arteritis: a quasi-paradoxical adverse drug reaction. Case report and literature review. Therapeutic Advances in Musculoskeletal 2019; 11: 1759720X19841796. https://doi.org/10.1177/1759720X19841796

20.

Hirosawa

Kawamura

Harada

, et al. ChatGPT-Generated Differential Diagnosis Lists for Complex Case–Derived Clinical Vignettes: Diagnostic Accuracy Evaluation. JMIR Med Inform 2023; 11: e48808. https://doi.org/10.2196/48808

21.

Chan

. DeepSeek-R1 and GPT-4 are comparable in a complex diagnostic challenge: a historical control study. Int J Surg 2025; 111(6): 4056–4059. https://doi.org/10.1097/JS9.0000000000002386

22.

Deng

Qiu

Dong

, et al. Evaluating ChatGPT and DeepSeek in postdural puncture headache management: a comparative study with international consensus guidelines. BMC Neurol 2025; 25(1): 264. https://doi.org/10.1186/s12883-025-04280-8

23.

Kaygisiz

ÖF

Teke

. Can deepseek and ChatGPT be used in the diagnosis of oral pathologies? BMC Oral Health 2025; 25(1): 638. https://doi.org/10.1186/s12903-025-06034-x

24.

Asker

Recai

Genc

, et al.

Chatbots in urology: accuracy, calibration, and comprehensibility; is DeepSeek taking over the throne?

BJU Int 2025; 136(5): 937–945. https://doi.org/10.1111/bju.16873

25.

de Oliveira

Garber

Gwinnutt

, et al. A study of calibration as a measurement of trustworthiness of large language models in biomedical natural language processing. JAMIA Open août 2025; 8(4): ooaf058. https://doi.org/10.1093/jamiaopen/ooaf058

26.

Bedi

Liu

Orr-Ewing

, et al. Testing and Evaluation of Health Care Applications of Large Language Models: A Systematic Review. JAMA 2025; 333(4): 319–328. https://doi.org/10.1001/jama.2024.21700

27.

Savage

Wang

Gallo

, et al. Large language model uncertainty proxies: discrimination and calibration for medical diagnosis and treatment. J Am Med Inform Assoc 2025; 32(1): 139–149. https://doi.org/10.1093/jamia/ocae254

28.

Halawani

Almehmadi

Alhubaishy

, et al. Empowering patients: how accurate and readable are large language models in renal cancer education. Front Oncol 2024; 14: 1457516. https://doi.org/10.3389/fonc.2024.1457516

29.

Umer

Batool

Naved

. Innovation and application of Large Language Models (LLMs) in dentistry - a scoping review. BDJ Open 2024; 10(1): 90. https://doi.org/10.1038/s41405-024-00277-6

30.

Sallam

. ChatGPT Utility in Healthcare Education, Research, and Practice: Systematic Review on the Promising Perspectives and Valid Concerns. Healthcare (Basel) 2023; 11(6): 887. https://doi.org/10.3390/healthcare11060887

31.

Hirosawa

Harada

Mizuta

, et al. Diagnostic performance of generative artificial intelligences for a series of complex case reports. Digit Health 2024; 10: 20552076241265215. https://doi.org/10.1177/20552076241265215

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.23 MB

0.00 MB

Author	ChatGPT-4	Claude AI	Deepseek	Gemini
Day 1
Ota et al.	Pregnancy- and Lactation-Associated Osteoporosis (PLO/Lactation-Associated Osteoporosis)	Pregnancy and Lactation-Associated Osteoporosis (PLO)	Pregnancy and Lactation-Associated Osteoporosis (PLO)	Pregnancy- and Lactation-Associated Osteoporosis (PLO) (also known as Transient Postpartum Osteoporosis (TPO))
Zhang et al.	Systemic Lupus Erythematosus (SLE) with Class III (A) Lupus Nephritis and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Lupus Nephritis Class III (A) and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Class III Lupus Nephritis and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Proliferative Lupus Nephritis (Class III) and Antiphospholipid Syndrome (APS)
Scarcella et al.	Multisystem Inflammatory Syndrome in Children (MIS-C) temporally associated with SARS-CoV-2 infection	Multisystem Inflammatory Syndrome in Children (MIS-C) associated with SARS-CoV-2	Kawasaki Disease (KD)/Kawasaki Disease Shock Syndrome (KDSS)	Multisystem Inflammatory Syndrome in Children (MIS-C) associated with SARS-CoV-2 (COVID-19)
Calvo-Rio et al.	Rheumatoid arthritis–associated peripheral ulcerative keratitis (PUK)	Peripheral Ulcerative Keratitis (PUK) secondary to Rheumatoid Arthritis with inadequate disease control	Peripheral Ulcerative Keratitis (PUK) associated with active Rheumatoid Arthritis	Rheumatoid Arthritis (RA)-Associated Peripheral Ulcerative Keratitis (PUK) with Corneal Perforation
Alvarez Reguera et al.	Blau Syndrome (Early-onset Sarcoidosis, NOD2-associated autoinflammatory disease)	Blau Syndrome (Pediatric Sarcoidosis) due to NOD2/CARD15 mutation (R334Q)	Blau Syndrome/Early-Onset Sarcoidosis (EOS)	Early-Onset Sarcoidosis (EOS) (also known as Blau Syndrome/Blau-like Syndrome)
Dreimane et al.	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH) - SLC34A3 gene mutation	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)
Costanzo et al.	Seropositive Rheumatoid Arthritis (RA), with multiple biologic hypersensitivity reactions, currently managed with a JAK inhibitor	Seropositive Rheumatoid Arthritis (RA) with Multiple Anti-TNF Hypersensitivity Reactions, Successfully Treated with JAK Inhibitor	Rheumatoid Arthritis (RA) treated with a JAK Inhibitor (e.g., Baricitinib, Tofacitinib, Upadacitinib)	Seropositive Rheumatoid Arthritis (RA) with Biologic Failure and Drug Hypersensitivity
Ajeganova et al.	Adult-Onset Still’s Disease (AOSD) complicated by Secondary Hemophagocytic Lymphohistiocytosis (sHLH/Macrophage Activation Syndrome, MAS)	Adult-Onset Still’s Disease (AOSD) complicated by Macrophage Activation Syndrome (MAS)/Secondary Hemophagocytic Lymphohistiocytosis (HLH)	Adult-Onset Still’s Disease (AOSD) with Macrophage Activation Syndrome (MAS)	Macrophage Activation Syndrome (MAS) secondary to Adult-Onset Still’s Disease (AOSD)
Del Giorno et al.	Giant cell (temporal) arteritis (GCA)	Giant Cell Arteritis (GCA)/Temporal Arteritis with Steroid-Induced Osteoporosis and Paradoxical Sarcoid-Like Granulomatous Reaction to TNF Inhibitor Therapy	Giant Cell Arteritis (GCA) with a Drug-Induced Granulomatous Reaction (likely to Tocilizumab)	Giant Cell Arteritis (GCA)
Day 5
Ota et al.	Pregnancy- and Lactation-Associated Osteoporosis (PLO)	Pregnancy and Lactation-Associated Osteoporosis (PLO)	Pregnancy and Lactation-Associated Osteoporosis (PLAO)	Pregnancy- and Lactation-Associated Osteoporosis (PLO) (also known as Transient Osteoporosis of Pregnancy and Lactation, or Postpartum Osteoporosis)
Zhang et al.	Systemic Lupus Erythematosus (SLE) with Class III (A) Lupus Nephritis and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Lupus Nephritis Class III(A) and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Class III Lupus Nephritis and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Lupus Nephritis Class III (A) and Antiphospholipid Syndrome (APS)
Scarcella et al.	Multisystem Inflammatory Syndrome in Children (MIS-C) associated with SARS-CoV-2	Multisystem Inflammatory Syndrome in Children (MIS-C) associated with SARS-CoV-2	Kawasaki Disease (KD), specifically Atypical or Neurologically Prominent Kawasaki Disease	Multisystem Inflammatory Syndrome in Children (MIS-C) associated with SARS-CoV-2
Calvo-Rio et al.	Rheumatoid arthritis–associated peripheral ulcerative keratitis (PUK) secondary to active seropositive rheumatoid arthritis with secondary Sjögren’s syndrome	Peripheral Ulcerative Keratitis (PUK) secondary to Seropositive Rheumatoid Arthritis with Secondary Sjögren’s Syndrome	Rheumatoid Arthritis (RA)-associated Peripheral Ulcerative Keratitis (PUK)	Rheumatoid Arthritis-Associated Peripheral Ulcerative Keratitis (PUK) with Ocular Perforation
Alvarez Reguera et al.	Blau Syndrome (NOD2-associated autoinflammatory disease)	Blau Syndrome (Pediatric Sarcoidosis/Early-Onset Sarcoidosis)	Blau Syndrome/Early-Onset Sarcoidosis	Blau Syndrome (BS)/Early-Onset Sarcoidosis (EOS)
Dreimane et al.	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH) — due to a loss-of-function mutation in SLC34A3 (NaPi-IIc gene)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH) - also known as Autosomal Recessive Hypophosphatemic Rickets Type 2 (ARHR2)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)
Costanzo et al.	Seropositive Rheumatoid Arthritis (RA) — with systemic allergic reactions to multiple TNF inhibitors (adalimumab, certolizumab) and subsequent treatment with a JAK inhibitor	Seropositive Rheumatoid Arthritis (RA) Treated with JAK Inhibitor (Most Likely Baricitinib)	Active Seropositive Rheumatoid Arthritis (RA) with prior exposure to a JAK inhibitor during early pregnancy	Seropositive Rheumatoid Arthritis (RA) - Biologic Failure
Ajeganova et al.	Adult-Onset Still’s Disease (AOSD) complicated by secondary Hemophagocytic Lymphohistiocytosis (HLH)/Macrophage Activation Syndrome (MAS)	Adult-Onset Still’s Disease (AOSD) with Secondary Hemophagocytic Lymphohistiocytosis (HLH)/Macrophage Activation Syndrome (MAS)	Adult-Onset Still’s Disease (AOSD) complicated by Macrophage Activation Syndrome (MAS)	Adult-Onset Still’s Disease (AOSD) complicated by Secondary Hemophagocytic Lymphohistiocytosis (sHLH)/Macrophage Activation Syndrome (MAS)
Del Giorno et al.	Giant cell (temporal) arteritis (GCA)	Giant Cell Arteritis (GCA)/Temporal Arteritis with Paradoxical Granulomatous Skin Reaction to Tocilizumab	Giant Cell Arteritis (GCA) with a subsequent Drug-Induced Granulomatous Reaction (likely to Tocilizumab)	Giant Cell Arteritis (GCA)/Temporal Arteritis
Day 10
Ota et al.	Pregnancy- and Lactation-Associated Osteoporosis (PLO/Postpartum Osteoporosis)	Pregnancy and Lactation-Associated Osteoporosis (PLO)	Severe Osteoporosis due to Malabsorption and Malnutrition	Primary Diagnosis: Pregnancy and Lactation-Associated Osteoporosis (PLO)
Zhang et al.	Systemic Lupus Erythematosus (SLE) with Class III (A) Lupus Nephritis and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Lupus Nephritis Class III(A) and Secondary Antiphospholipid Syndrome (APS)	Systemic Lupus Erythematosus (SLE) with Class III Lupus Nephritis and Secondary Antiphospholipid Syndrome (APS)	Primary Diagnosis: Systemic Lupus Erythematosus (SLE) with Lupus Nephritis Class III (A)
Scarcella et al.	Multisystem Inflammatory Syndrome in Children (MIS-C) temporally associated with SARS-CoV-2 infection (Kawasaki-like presentation)	Multisystem Inflammatory Syndrome in Children (MIS-C) Associated with SARS-CoV-2	Kawasaki Disease (KD), complicated by Kawasaki Disease Shock Syndrome (KDSS) and severe neurological involvement.	Multisystem Inflammatory Syndrome in Children (MIS-C)/Pediatric Inflammatory Multisystem Syndrome Temporally Associated with SARS-CoV-2 PIMS-TS
Calvo-Rio et al.	Peripheral Ulcerative Keratitis (PUK) secondary to active seropositive Rheumatoid Arthritis (RA) with associated secondary Sjögren’s syndrome	Peripheral Ulcerative Keratitis (PUK) Secondary to Seropositive Rheumatoid Arthritis with Secondary Sjögren’s Syndrome, Successfully Treated with Rituximab	Peripheral Ulcerative Keratitis (PUK) secondary to Active Rheumatoid Arthritis (RA)	Rheumatoid Arthritis-Associated Peripheral Ulcerative Keratitis (PUK) with Corneoscleral Perforation
Alvarez Reguera et al.	Blau Syndrome (Early-onset Sarcoidosis) due to a de novo NOD2/CARD15 R334Q mutation	Blau Syndrome (Pediatric Sarcoidosis/NOD2-Associated Autoinflammatory Disease)	Blau Syndrome/Early-Onset Sarcoidosis	Blau Syndrome (BS)
Dreimane et al.	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)/Autosomal Recessive Hypophosphatemic Rickets Type 2 (ARHR2)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)	Hereditary Hypophosphatemic Rickets with Hypercalciuria (HHRH)
Costanzo et al.	Seropositive Rheumatoid Arthritis (RA)	Seropositive Rheumatoid Arthritis (RA) Treated with JAK Inhibitor (Most Likely Baricitinib) with Unintended First-Trimester Pregnancy Exposure	Rheumatoid Arthritis (RA) treated with a JAK Inhibitor (e.g., Baricitinib, Tofacitinib)	Seropositive Rheumatoid Arthritis (RA) with Multiple Drug Intolerance/Refractory Disease
Ajeganova et al.	Adult-Onset Still’s Disease (AOSD) complicated by Secondary Hemophagocytic Lymphohistiocytosis (sHLH/Macrophage Activation Syndrome, MAS)	Adult-Onset Still’s Disease (AOSD) Complicated by Macrophage Activation Syndrome (MAS)/Secondary Hemophagocytic Lymphohistiocytosis (HLH)	Adult-Onset Still’s Disease (AOSD) complicated by Macrophage Activation Syndrome (MAS)	Macrophage Activation Syndrome (MAS) secondary to Adult-Onset Still’s Disease (AOSD)
Del Giorno et al.	Giant cell (temporal) arteritis (GCA) — cranial large-vessel granulomatous vasculitis	Giant Cell Arteritis (GCA)/Temporal Arteritis with Paradoxical Granulomatous Skin Reaction to Tocilizumab	Giant Cell Arteritis (GCA) with a subsequent Drug-Induced Sarcoid-like Reaction (likely to an IL-6 Receptor Antagonist such as Tocilizumab).	Giant Cell Arteritis (GCA)