Sage Journals: Discover world-class research

Abstract

Background

Therapeutic inertia (TI) remains a critical barrier to optimizing outcomes in multiple sclerosis (MS) and neuromyelitis optica spectrum disorders (NMOSDs).

Objective

We evaluated the proficiency of ChatGPT-4o in addressing complex neuro-immunological management challenges compared to practicing neurologists.

Methods

We conducted a comparative analysis using 21 clinical vignettes derived from a multicenter research framework. Responses from 290 neurologists were benchmarked against ChatGPT-4o, both with and without Retrieval-Augmented Generation (RAG). The primary endpoint was guideline-adherent decision-making at the item level, with the prevalence of TI as a secondary clinical outcome. Scenarios included MS therapy escalation, aquaporin-4-IgG positive NMOSD management, and serum neurofilament light chain integration.

Results

ChatGPT-4o with RAG achieved significantly higher guideline adherence in decision-making than neurologists (80.5% vs. 66.5%; p = 0.001). Multivariable generalized estimating equation models identified ChatGPT-4o as an independent predictor of evidence-based decision-making (Odds ratio 3.17; 95% confidence interval: 2.05-4.88; p < 0.0001). While the model demonstrated a lower propensity for TI overall, performance parity occurred in emerging biomarker scenarios where clinical consensus is still evolving.

Conclusions

ChatGPT-4o demonstrated superior guideline adherence and reduced TI compared to neurologists. Integrating Large Language Models as clinical decision-support tools may enhance the standardization of neuro-immunological care and serve as a valuable adjunct to mitigate human cognitive biases.

Keywords

Artificial intelligence large language models multiple sclerosis neuromyelitis optica spectrum disorders decision-making therapeutic inertia

Introduction

The management of multiple sclerosis (MS) and neuromyelitis optica spectrum disorders (NMOSDs) has seen significant therapeutic advances with the development of highly effective, mechanism-specific therapies.^1,2 In MS, the prompt use of high-efficacy treatments is critical to reduce relapse rates, limit new lesion formation, and prevent disability progression.^1,3 Similarly, in NMOSD, preventing relapses is essential to avoid permanent disability and long-term accumulation of damage.^4,5 Furthermore, new tools like testing for serum neurofilament light chain (sNfL) have emerged to detect subclinical neuroaxonal damage, positioning them as powerful aids for prognosis and timely therapeutic adjustments.^6,7

Despite these pharmacological and monitoring advances, therapeutic inertia (TI), the failure to initiate or escalate treatment despite uncontrolled disease activity, persists as a primary obstacle to optimal care in MS and NMOSD.^8,9 Factors such as aversion to ambiguity, low tolerance for uncertainty, and limited use of structured decision-support strategies have been identified as relevant contributors to TI.^8–11 In this context, artificial intelligence (AI), specifically large language models (LLMs) such as ChatGPT, represents a potential adjunct for clinical decision support.¹² LLMs can potentially integrate complex information and align recommendations with guidelines, serving as a supportive framework to mitigate human cognitive biases and help reduce unintended variability in physician decisions while respecting clinical autonomy.^13–20 The objective of this study was to evaluate whether ChatGPT-based recommendations demonstrate greater adherence to evidence-based guidelines and a lower prevalence of TI compared to practicing neurologists managing MS and NMOSD.

Methods

Study design

This study employed a secondary pooled analysis of data derived from 290 neurologists who participated in three multicenter, observational, cross-sectional studies in Spain: DIScUTIR MS, NewFeeLs-MS, and PREFERENCES-NMOSD.^9–11 We recruited clinician participants through a collaboration with the Spanish Society of Neurology (SEN). The SEN invited members actively managing demyelinating diseases to participate via targeted electronic correspondence.

Clinical vignettes and scenario selection

We selected 21 clinical vignettes designed to emulate high-stakes, representative management challenges in MS and NMOSD care. These scenarios focused on the most frequent drivers of TI across three distinct contexts:

DIScUTIR MS study (n = 9): These cases evaluated the influence of neurologists’ behavioral risk preferences on disease-modifying therapy escalation in patients with relapsing-remitting MS exhibiting clinical or radiological activity.¹⁰

NewFeeLs-MS study (n = 7): This module investigated the impact of sNfL biomarker integration on the propensity to initiate or intensify treatment.¹¹

PREFERENCES-NMOSD study (n = 5): Scenarios addressed the management of aquaporin-4-IgG positive NMOSD within an expended landscape of high-efficacy targeted therapies.⁹

ChatGPT configuration and contextual retrieval

We utilized the OpenAI gpt-4o-2024-08-06 model for response generation and the text-embedding-3-large model for embedding generation. All prompts were processed via the official Application Programming Interface (API) with the temperature set to 0.7. This parameter was selected to balance determinism with creative synthesis, mirroring the default configuration of the standard ChatGPT web interface. To ensure data security, the API key was managed exclusively through encrypted environment variables and remained unexposed within the study code or documentation. Furthermore, employing a controlled API environment with a fixed model version served as a proactive strategy to mitigate data leakage. Given that the clinical vignettes originated from previous research, this methodology minimizes the risk of the model retrieving verbatim study content or pre-existing answer keys from its underlying training corpus.

Each vignette was evaluated under two conditions:

Without context: The model was provided only with the case scenario and questions.

With contextual retrieval: The model was augmented using retrieval-augmented generation (RAG). Domain-specific clinical documents related to MS and NMOSD management and sNfL testing indications were converted to a vector database (ChromaDB).^1,5,6,21 Relevant document excerpts, prioritized by semantic similarity and filtered by a score above 0.80, were retrieved and appended to the model's prompt for contextual guidance. Each case was processed multiple times (n = 10 repetitions) to assess variability in responses.

Outcome measures and definitions

The primary endpoint was guideline-adherent decision-making, defined at the item level as the provision of a correct response consistent with evidence-based guidelines. Specifically, a response was classified as correct if it mandated the escalation of disease-modifying therapy for patients exhibiting breakthrough disease activity, characterized by recurrent clinical relapses, new radiological activity on magnetic resonance imaging, or elevated sNfL levels. Secondary outcomes included the prevalence of TI and the identification of predictors for the association between participant's and case-based characteristics with correct responses. For the pooled analysis, we maintained the study-specific definitions of TI derived from the original frameworks:

NewFeeLs-MS and DIScUTIR MS studies: TI was considered present if there were at least three incorrect responses.

PREFERENCES-NMOSD study: TI was considered present if there was at least one incorrect response. A more stringent definition was used in this study due to the high severity of NMOSD and the risk of irreversible disability accumulation following a single relapse.^2,5

Statistical analyses

Demographic and professional characteristics were reported using mean (standard deviation [SD]), medians (interquartile range [IQR]), or number (percent), as applicable. A descriptive analysis summarized the percentage of correct answers provided by both ChatGPT and neurologists for each clinical scenario. Comparative analyses were then conducted to evaluate differences in performance between the model and clinicians. The proportion of correct responses was compared using either the Student's t-test or the Mann-Whitney U test, depending on the distribution of the data, with normality assessed prior to selection. Only cases evaluated by both participants (ChatGPT and neurologists) were included in this analysis. A generalized estimating equation (GEE) model with a logit link function was fitted to assess factors associated with the likelihood of providing a correct response (absence of TI).

Dependent variable: Correct answer (Yes/No).

Independent variables: Type of responder (Participant: Neurologist or ChatGPT), and clinical case characteristics including patient demographics (age, time since diagnosis, Expanded Disability Status Scale [EDSS]), and disease-related characteristics (relapse, lesions, sNfL, and affectation), and participant characteristics (age, sex, years of professional expertise, and practice setting).

Clustering: ChatGPT runs were treated as independent observations, but clinician responses were clustered by individual clinician to account for the correlation arising from each clinician evaluating multiple cases. This within-clinician correlation was explicitly accounted for in the GEE model.

Model assessment: Model fit was evaluated using the quasi-likelihood under the independence model criterion (QIC), and the proportion of variance explained by the fixed effects was estimated using the marginal coefficient of determination (R² marginal).

Reporting: Odds ratios (ORs) with 95% confidence intervals (CIs) were estimated.

• All statistical analyses were performed using IBM SPSS Statistics and data visualizations were generated using RStudio. A two-sided p-value < 0.05 was considered statistically significant for all analyses.

Results

Study population

This analysis included the responses of 290 neurologists pooled from DIScUTIR MS (N = 96), NewFeeLs-MS (N = 116), and PREFERENCES-NMOSD (N = 78) studies. Overall, the cohort comprised clinicians with a mean (SD) age of 40.2 (9.9) years, of whom 51.7% were male. These neurologists had a mean of 13.9 (8.8) years of professional experience, 74.1% worked in academic hospitals, and saw a median (IQR) of 15.0 (8.0–25.0) patients per week. Additionally, 70.7% attended congresses of the European Committee for Treatment and Research in Multiple Sclerosis, and 62.8% were authors of peer-reviewed publications. When analyzed by study, TI was observed in 66.0% of neurologists in the DIScUTIR MS, 91.8% in the NewFeeLs-MS, and 38.5% in the PREFERENCES-NMOSD. Further details are summarized in Table 1.

Table 1.

Characteristics of the study population.

	Preferences-NMOSD (N = 78)	NewFeeLs-MS (N = 116)	DIScUTIR MS (N = 96)	All studies (N = 290)
Age (years), mean (SD)	38.4 (10.8)	41.9 (10.1)	39.6 (8.5)	40.2 (9.9)
Sex, n (%)
Male	43 (55.1)	62 (53.4)	45 (46.9)	150 (51.7)
Type of hospital, n (%)
Academic	34 (43.6)	112 (96.6)	69 (71.9)	215 (74.1)
Years of experience as a neurologist, mean (SD)	11.3 (8.7)	16.0 (9.2)	13.4 (7.9)	13.9 (8.8)
Number of patients seen per week, median (IQR)	15.0 (4.0–25.0)	16.0 (10.0–25.0)	15.0 (9.5–25.0)	15.0 (8.0–25.0)
ECTRIMS attendance, n (%)	45 (57.7)	104 (89.7)	56 (58.3)	205 (70.7)
Author of scientific publications, n (%)	39 (50.0)	64 (55.2)	79 (82.3)	182 (62.8)
Percentage of incorrect answers, mean (SD)	11.8 (18.4)	46.2 (11.6)	36.1 (15.1)	33.3 (20.4)
Therapeutic inertia, n (%)
Yes	30 (38.5)	101 (91.8)	62 (66.0)	193 (68.4)

ECTRIMS: European committee for treatment and research in multiple sclerosis; IQR: interquartile range; SD: standard deviation; NMOSD: neuromyelitis optica spectrum disorder; MS: multiple sclerosis.

ChatGPT responses

In general, ChatGPT's responses were highly consistent across the 10 repetitions per case, with the model producing identical answers in most scenarios or showing minimal variability, typically limited to two different response patterns. The percentage of correct responses generated by ChatGPT without context is presented in Table 2, showing that the model provided 100% correct answers in all included cases for the PREFERENCES-NMOSD study, 100% correct responses for cases 1 and 5 and 0% for the remaining cases in the NewFeeLs-MS study, and 100% correct responses in 6 out of 9 case scenarios and between 60% and 90% in the remaining ones for the DIScUTIR MS study.

Table 2.

Descriptive analysis of correct answers: ChatGPT-4o (without context) versus neurologists.

Cases	ChatGPT % of correct answers	Neurologists % of correct answers
PREFERENCES-NMOSD
Case 1	100.0	85.9
Case 2	100.0	97.4
Case 3	100.0	74.3
Case 4	100.0	88.4
Case 5	100.0	94.8
NewFeeLs-MS
Case 1	100.0	92.3
Case 2	0.0	88.8
Case 3	0.0	8.5
Case 4	0.0	5.1
Case 5	100.0	99.1
Case 6	0.0	5.2
Case 7	0.0	80.3
DIScUTIR MS
Case 1	100.0	47.9
Case 4	100.0	91.6
Case 7	100.0	93.7
Case 8	100.0	86.4
Case 9	80.0	21.8
Case 11	90.0	38.5
Case 13i	100.0	81.2
Case 13ii	60.0	16.6
Case 15	100.0	97.8

NMOSD: neuromyelitis optica spectrum disorder; MS: multiple sclerosis.

When ChatGPT was provided with context, a slight improvement in performance was observed, resulting in a 100% correct response rate across all evaluated cases for the PREFERENCES-NMOSD and DIScUTIR MS studies, while in NewFeeLs-MS, correct responses were observed in 100% of the iterations for cases 1 and 5, in 90% for case 2, and 0% in the remaining cases (Table 3).

Table 3.

Descriptive analysis of correct answers: ChatGPT-4o (with context) vs neurologists.

Clinical cases	ChatGPT % of correct answers	Neurologists % of correct answers
PREFERENCES-NMOSD
Case 1	100.0	85.9
Case 2	100.0	97.4
Case 3	100.0	74.3
Case 4	100.0	88.4
Case 5	100.0	94.8
NewFeeLs-MS
Case 1	100.0	92.3
Case 2	90.0	88.8
Case 3	0.0	8.5
Case 4	0.0	5.1
Case 5	100.0	99.1
Case 6	0.0	5.2
Case 7	0.0	80.3
DIScUTIR MS
Case 1	100.0	47.9
Case 4	100.0	91.6
Case 7	100.0	93.7
Case 8	100.0	86.4
Case 9	100.0	21.8
Case 11	100.0	38.5
Case 13i	100.0	81.2
Case 13ii	100.0	16.6
Case 15	100.0	97.8

NMOSD: neuromyelitis optica spectrum disorder; MS: multiple sclerosis.

Performance comparison between ChatGPT-4o and neurologists

When analyzing results across all studies combined, ChatGPT-4o demonstrated a significantly higher percentage of correct responses than neurologists in both conditions: Without context (mean: 72.9% vs. 66.5%; p = 0.038) and with context (mean: 80.5% vs. 66.5%; p = 0.001) (Figure 1A).

Figure 1.

Comparative analysis of guideline-adherent responses between ChatGPT and neurologists. Results are presented for (A) the combined study dataset, (B) the PREFERENCES-NMOSD study, (C) the DIScUTIR MS study, and (D) the NewFeeLs-MS study. Individual dots represent the percentage of correct responses for each clinical vignette, while bars and vertical lines indicate the mean and standard deviation, respectively. Inter-group means were compared using the Mann–Whitney U test for the combined dataset, the NewFeeLs-MS study, and the DIScUTIR MS study (with clinical context); a Student's t-test was utilized for all other analyses. NMOSD: neuromyelitis optica spectrum disorder; MS: multiple sclerosis. Statistical significance is denoted as: *p < 0.05; **p < 0.01.

This superior performance was consistent across the PREFERENCES-NMOSD study and the DIScUTIR MS study. In both of these studies, the model showed a significantly higher percentage of correct responses compared to clinicians under both conditions:

PREFERENCES-NMOSD: Without context/with context (mean: 100.0% vs. 88.2%; p = 0.019) (Figure 1B).

DIScUTIR MS: Without context (mean: 92.2% vs. 64.0%; p = 0.015) and with context (mean: 100.0% vs. 64.0%; p = 0.004) (Figure 1C).

Conversely, for the NewFeeLs-MS study, no significant differences were observed in the percentage of correct responses between ChatGPT-4o and neurologists, either without context (mean: 28.6% vs. 54.2%; p = 0.191) or with context (mean: 41.4% vs. 54.2%; p = 0.518) (Figure 1D). Based on these findings, ChatGPT-4o demonstrated a lower tendency to exhibit TI in treatment decision-making compared to neurologists.

Factors associated with correct responses

We performed multivariable GEE analyses to determine whether the type of responder and patient-specific demographic or clinical characteristics were independently associated with guideline-adherent decision-making. To ensure model stability and convergence, the sex variable was excluded from all models due to a highly unbalanced distribution (male sex represented in only 2 of 21 clinical vignettes). Two separate models were analyzed: One considering responses from ChatGPT-4o without context and the other considering responses from ChatGPT-4o with context.

The multivariable GEE model showed that ChatGPT-4o without context was an independent predictor and more likely to provide correct responses compared with clinicians, indicating a positive association between ChatGPT-4o participation and correct responses (p = 0.042) (Table 4). The presence of both lesions and relapses were positively associated with correct responses, suggesting that TI was less likely to be observed in scenarios reflecting active disease (p < 0.0001 and p = 0.011, respectively). The EDSS score was negatively associated with correct responses in the model (p < 0.0001). Similarly, the second model confirmed that ChatGPT-4o with context was an independent predictor and more likely to provide correct responses compared with clinicians (p < 0.0001) (Table 5). The presence of lesions and relapses were associated with correct responses (P < 0.0001 and P = 0.006, respectively). Conversely, the EDSS score was negatively associated with correct responses (p < 0.0001).

Table 4.

Factors associated with correct responses in the multivariable GEE model (ChatGPT-4o without context).

Independent variables	Dependent variable: correct responses
	OR (95% CI)	P-value
Participant: ChatGPT (ref: neurologist)	1.55 (1.02–2.38)	0.042
Age (years)	1.05 (0.95–1.17)	0.361
Time since diagnosis (years)	0.98 (0.96–1.00)	0.030
EDSS score	0.48 (0.39–0.60)	< 0.0001
Relapse: Yes (ref: No)	1.86 (1.15–3.00)	0.011
Lesions: Yes (ref: No)	7.64 (4.31–13.53)	<0.0001
sNfL: Yes (ref: No)	1.06 (0.69–1.62)	0.794
Topography: Not specified (ref: No)	0.02 (0.01–0.04)	<0.0001
Topography: Spinal cord (ref: No)	0.72 (0.41–1.27)	0.257
Topography: Brain (ref: No)	0.13 (0.05–0.36)	<0.0001

A multivariable GEE model with a logit link function was fitted, considering correct response (yes/no) as the dependent variable, and participant (ChatGPT without clinical context/neurologist), along with patients’ demographic and clinical characteristics from each case, as independent variables. Population-averaged ORs with 95% CIs and P-values are presented. QIC: 1879.43. R²-marginal: 0.461. CI: confidence interval; EDSS: Expanded Disability Status Scale; GEE: generalized estimating equation; OR: odds ratio; sNfL: serum neurofilament light chain; QIC: quasi-likelihood under the independence model criterion.

Table 5.

Factors associated with correct responses in the multivariable GEE model (ChatGPT-4o with context).

Independent variables	Dependent variable: correct responses
	OR (95% CI)	P-value
Participant: ChatGPT (ref: neurologist)	3.17 (2.05–4.88)	<0.0001
Age (years)	1.06 (0.95–1.18)	0 .271
Time since diagnosis (years)	0.98 (0.96–1.00)	0.017
EDSS score	0.50 (0.40–0.62)	<0.0001
Relapse: Yes (ref: No)	2.09 (1.23–3.57)	0.006
Lesions: Yes (ref: No)	8.69 (4.61–16.39)	<0.0001
sNfL: Yes (ref: No)	1.41 (0.90–2.21)	0.131
Topography: Not specified (ref: No)	0.02 (0.01–0.04)	<0.0001
Topography: Spinal cord (ref: No)	0.74 (0.42–1.31)	0.298
Topography: Brain (ref: No)	0.16 (0.06–0.42)	0.0002

A multivariable GEE model with a logit link function was fitted, considering correct response (yes/no) as the dependent variable, and participant (ChatGPT with clinical context/clinician), along with patients’ demographic and clinical characteristics from each case, as independent variables. Population-averaged ORs with 95% CIs and P-values are presented. QIC: 1834.13. R2-marginal: 0.477. CI: confidence interval; EDSS: Expanded Disability Status Scale; GEE: generalized estimating equation; OR: odds ratio; sNfL: serum neurofilament light chain; QIC: quasi-likelihood under the independence model criterion.

Discussion

Generative AI, particularly LLMs such as ChatGPT, represents a transformative advancement in clinical decision-support.^12,22–24 These systems are redefining the acquisition of evidence-based information and the interpretation of complex therapeutic data. While the utility of LLMs is enhancing triage, diagnostic accuracy, and patient communication across multiple specialties, their application is especially salient in neurology, where therapeutic landscapes for complex pathologies are evolving rapidly.

Our study demonstrates that ChatGPT-4o provided a higher proportion of guideline-adherent recommendations and exhibited lower TI than practicing neurologists. These findings align with recent performance benchmarks. Notably, Schubert et al.¹³ reported that ChatGPT outperformed human users on neurology board-style examinations, achieving an 85.0% accuracy rate against a mean human score of 73.8%. While that study established the model's proficiency in addressing theoretical, standardized questions, our analysis validates the model's capacity to navigate nuanced, simulated clinical vignettes, including those involving emerging biomarkers such as sNfL. By moving beyond rote recall to evaluate therapeutic decisions in MS and NMOSD, this study provides evidence that ChatGPT-4o can function as a reliable clinical decision-support tool to mitigate TI and enhance care consistency. Beyond diagnostic and therapeutic accuracy, preliminary evidence supports the broader clinical utility of LLMs. In NMOSD, ChatGPT-3.5 has demonstrated proficiency in early disease recognition.²⁵ Similarly, in MS care, several AI platforms have performed comparably to specialist neurologists in knowledge assessments.¹⁹ Furthermore, ChatGPT has generated management explanations that patients with MS perceived as more empathetic than those provided by clinicians, suggesting a potential role in enhancing patient-provider communication.²⁰ Collectively, these findings indicate that when provided with structured clinical information, LLMs possess the capacity to supplement human reasoning and address unmet needs in well-defined neuro-immunological settings.

The optimization of performance through RAG underscores the critical importance of structured, context-specific input in mitigating baseline LLM limitations, such as factual inaccuracies or reliance on outdated data.^16,26 However, the observed parity between the model and neurologists in the NewFeeLs-MS scenarios warrants consideration.¹¹ This component of the study evaluated sNfL testing, a context where clinical consensus and universally accepted protocols are still emerging.^27–29 The resultant ambiguity appears to challenge both human clinicians and LLMs alike, illustrating that while LLMs are highly reliable within established parameters, they remain sensitive to dynamic and heterogeneous clinical evidence.^13,30

Multivariable GEE analysis further suggests that AI integration can enhance decision-making uniformity. Consistent with clinical expectations, guideline-adherent responses were more frequent in scenarios featuring overt disease activity, such as clinical relapses or new radiological lesions.^31–33 Beyond these traditional indicators, LLMs may eventually support personalized treatment strategies by integrating multimodal datasets. From a practical standpoint, the lower TI observed in ChatGPT-4o suggests its potential to support timely treatment optimization, thereby preventing the accumulation of irreversible disability.¹²

The demographic profile of the participating neurologists represents a high-performance benchmark. With 74.1% of participants practicing in academic centers and 62.8% involved in peer-reviewed research, this cohort likely exhibits high baseline knowledge and adherence to specialized management guidelines. It is essential to clarify that the lower adherence observed among these neurologists does not suggest a deficit in medical knowledge or a failure to comprehend evidence-based recommendations. On the contrary, the observed TI likely reflects the inherent difficulty of applying standardized protocols to complex, real-world cases where clinicians must navigate ambiguity and balance individual patient risks.²⁹ Understanding these behavioral and systemic drivers is critical; it suggests that AI tools should not serve to “educate” expert physicians, but rather to provide a consistent, objective framework to mitigate cognitive biases, such as ambiguity aversion and low tolerance for uncertainty, that contribute to treatment delays and medical errors.^8,10,34,35

We acknowledge that clinical guidelines are not infallible and may be perceived as too rigid to capture the granular complexities of individual care. Furthermore, formal guidelines often exhibit a significant temporal lag between the emergence of scientific evidence and its official publication. Consequently, expert neurologists frequently make therapeutic decisions aligned with the latest evidence before it is codified into standard protocols. While clinical intuition and nuanced, shared decision-making remain fundamental to high-quality medicine, guidelines provide a necessary framework to harmonize care. Within this context, AI tools function as a standardized measure to reduce therapeutic variability without replacing the essential judgment of the treating neurologist.

The integration of LLMs into clinical workflows should aim to provide concise, context-specific recommendations that complement expert judgment. Crucially, the ultimate clinical responsibility for patient care remains with the neurologist; thus, human supervision of all AI-generated recommendations is mandatory.³⁶ To ensure these systems are utilized as rigorous adjuncts to clinical reasoning, AI literacy must be prioritized within medical education.

This study has several limitations. First, fundamental differences exist between human clinical reasoning and the algorithmic generation of AI responses. While the simulated vignettes were designed to be clinically relevant, they may not fully encapsulate the multi-faceted complexities of real-world practice. Second, LLM performance is strictly contingent upon the quality and currency of its source material; any temporal lag in the integration of emerging guidelines could result in suboptimal recommendations.³⁶ Third, although we utilized current API versions and RAG to anchor responses to verified evidence, the risk of data leakage cannot be definitively excluded, as the specific training corpora for proprietary LLMs remain undisclosed. Finally, the inclusion of a predominantly academic cohort of Spanish neurologists may restrict the external validity of these findings regarding other healthcare systems or non-academic clinical environments. Future research should focus on validating LLM performance within more heterogeneous, global cohorts to ensure generalizability across diverse socioeconomic and institutional frameworks.

Conclusion

This study demonstrates that ChatGPT-4o can effectively assist therapeutic decision-making in neuro-immunological diseases, achieving accuracy levels comparable to or exceeding those of practicing neurologists. Specifically, the model exhibited lower rates of TI and significantly greater adherence to evidence-based recommendations, particularly when clinical context was provided through RAG. These findings contribute to the expanding body of evidence suggesting that LLMs can enhance clinical consistency, mitigate cognitive bias in complex neurological reasoning, and potentially reduce medical errors.

However, the ethical and responsible integration of this technology necessitates mandatory human supervision of all AI-generated recommendations and the formal incorporation of AI literacy into medical education. While these systems represent powerful tools for decision-making, they must be viewed as a supportive resource designed to complement, rather than replace, the nuanced and essential expert clinical judgment of the treating neurologist.

Footnotes

Data availability statement

The datasets generated during the analysis of the study are available from the corresponding author upon reasonable request.

Declaration of conflicting interest

The authors declared the following potential conflicts of interest with respect to the research, authorship, and/or publication of this article: Gustavo Saposnik received consulting fees from Roche Farma Spain and is supported by the University of Toronto Scientific Merit award. Enric Monreal received research grants, travel support, or honoraria for speaking engagements from Almirall, Merck, Roche, Sanofi, BMS, Biogen, Janssen, and Novartis. Eduardo Agüera received speaking honoraria from Roche, Novartis, Merck, Sanofi, and Biogen. María Sepúlveda received speaking honoraria from Roche, Biogen, and UCB Pharma, and travel reimbursement from Biogen, Sanofi, Merck and Roche for national and international meetings. Gary Álvarez-Bravo received compensation for consulting services and speaking fees from Biogen, Novartis, Merck, Sanofi, Amgen, Roche, and BMS. Miguel A Hernández has served as a speaker/moderator in meetings and/or symposia organized by Biogen, Merck, Sanofi, Roche, Novartis, and BMS. He has received funding for research projects from Biogen, Novartis, Merck, Teva, Sanofi, Roche, and BMS. Javier Riancho received speaking, consulting fees and travel funding from Merck, Sanofi, Roche, Biogen, Novartis, BMS, Jannsen, Neuraxpharm, and Teva. Juan P Cuello received consulting fees, support for travel, fees honoraria for participation on data monitoring boards, speaking honoraria, and expert testimony from Novartis, Biogen, Merck, Sanofi, and Roche. Ángel Pérez-Sempere has received consulting and speaking fees from Merck, Novartis, Teva, and Roche. Rocío Gómez-Ballesteros and Jorge Maurino are employees of Roche Farma Spain. Aleix Solanes declare that he has no competing interests.

Ethical approval and informed consent statement

All component studies adhered to the principles of the Declaration of Helsinki and received formal approval from the Institutional Review Board of the Hospital Universitario Clínico San Carlos in Madrid, Spain.

Informed consent

All participants provided written informed consent.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by the Roche España (grant number NA).

ORCID iDs

Gustavo Saposnik

Enric Monreal

Gary Álvarez-Bravo

Eduardo Agüera-Morales

Javier Riancho

Jorge Maurino

Rocío Gómez-Ballesteros

References

Filippi

Amato

Centonze

, et al. The use of high-efficacy disease-modifying therapies in multiple sclerosis: recommendations from an expert Delphi consensus. J Neurol 2025; 272: 65.

Wang

. Advances in maintenance therapies for neuromyelitis optica spectrum disorders: a new era of targeted drugs. Mult Scler Relat Disord 2025; 96: 106351.

Singer

Feng

Chiong-Rivero

. Early use of high-efficacy therapies in multiple sclerosis in the United States: benefits, barriers, and strategies for encouraging adoption. J Neurol 2024; 271: 3116–3130.

Paul

Marignier

Palace

, et al. International delphi consensus on the management of AQP4-IgG+ NMOSD: recommendations for eculizumab, inebilizumab, and satralizumab. Neuro. Neuroimmuno. Neuroinflamm 2023; 10: 200124.

Kümpfel

Giglhuber

Aktas

, et al. Update on the diagnosis and treatment of neuromyelitis optica spectrum disorders (NMOSD) - revised recommendations of the neuromyelitis Optica study group (NEMOS). Part II: Attack Therapy and Long-Term Management. J Neurol 2024; 271: 141–176.

Freedman

Abdelhak

Bhutani

, et al. The role of serum neurofilament light (sNfL) as a biomarker in multiple sclerosis: insights from a systematic review. J Neurol 2025; 272: 00.

Monreal

Ruiz

San Román

, et al. Value contribution of blood-based neurofilament light chain as a biomarker in multiple sclerosis using multi-criteria decision analysis. Front Public Health 2024; 12: 1397845.

Saposnik

Montalban

. Therapeutic inertia in the new landscape of multiple sclerosis care. Front Neurol 2018; 9: 74.

Cobo-Calvo

Gómez-Ballesteros

Orviz

, et al. Therapeutic inertia in the management of neuromyelitis optica spectrum disorder. Front Neurol 2024; 15: 1341473.

10.

Saposnik

Sempere

Prefasi

, et al. Decision-making in multiple sclerosis: the role of aversion to ambiguity for therapeutic inertia among neurologists (DIScUTIR MS). Front Neurol 2017; 8: 65.

11.

Saposnik

Monreal

Medrano

, et al.

Does serum neurofilament light chain measurement influence therapeutic decisions in multiple sclerosis?

Mult Scler Relat Disord 2024; 90: 105838.

12.

Miranda

Pereira-Silva

Guichard

, et al. Artificial intelligence outperforms physicians in general medical knowledge, except in the paediatrics domain: a cross-sectional study. Bioengineering (Basel) 2025; 12: 53.

13.

Schubert

Wick

Venkataramani

. Performance of large language models on a neurology board-style examination. JAMA Netw Open 2023; 6: e2346721.

14.

Amin

Nakamura

Ontaneda

. Artificial intelligence and multiple sclerosis: past, present, and future. Semin Neurol 2026; 46: 96–104.

15.

Demuth

Ed-Driouch

Dumas

, et al. Scoping review of clinical decision support systems for multiple sclerosis management: leveraging information technology and massive health data. Eur J Neurol 2025; 32: e16363.

16.

Inojosa

Voigt

Wenk

, et al. Integrating large language models in care, research, and education in multiple sclerosis management. Mult Scler 2024; 30: 1392–1401.

17.

Joseph

. A pilot evaluation of the diagnostic accuracy of ChatGPT-3.5 for multiple sclerosis from case reports. Transl Neurosci 2024; 15: 20220361.

18.

Patel

Villalobos

Shan

, et al.

Generative artificial intelligence versus clinicians: who diagnoses multiple sclerosis faster and with greater accuracy?

Mult Scler Relat Disord 2024; 90: 105791.

19.

Yaman Kula

Durmaz Çeli

Özben

, et al. Artificial intelligence versus neurologists: a comparative study on multiple sclerosis expertise. Clin Neurol Neurosurg 2025; 250: 108785.

20.

Maida

Moccia

Palladino

, et al. CHATGPT vs. Neurologists: a cross-sectional study investigating preference, satisfaction ratings and perceived empathy in responses among people living with multiple sclerosis. J Neurol 2024; 271: 4057–4066.

21.

Freedman

Gnanapavan

Booth

, et al. Guidance for use of neurofilament light chain as a cerebrospinal fluid and blood biomarker in multiple sclerosis management. EBioMedicine 2024; 101: 104970.

22.

Ros-Arlanzón

Perez-Sempere

. Evaluating AI competence in specialized medicine: comparative analysis of CHATGPT and neurologists in a neurology specialist examination in Spain. JMIR Med Educ 2024; 10: e56762.

23.

Komasawa

Yokohira

. Generative artificial intelligence (AI) in medical education: a narrative review of the challenges and possibilities for future professionalism. Cureus 2025; 17: e86316.

24.

Rincón

EHH

Jimenez

Aguilar

LAC

, et al. Mapping the use of artificial intelligence in medical education: a scoping review. BMC Med Educ 2025; 25: 26.

25.

Shan

Patel

McCreary

, et al. Faster and better than a physician? Assessing diagnostic proficiency of ChatGPT in misdiagnosed individuals with neuromyelitis optica spectrum disorder. J Neurol Sci 2024; 468: 123360.

26.

Zakka

Shad

Chaurasia

, et al. Almanac - retrieval-augmented language models for clinical medicine. NEJM AI 2024; 1: 10.1056/aioa2300068.

27.

Lycke

. Using serum neurofilament-light in clinical practice: growing enthusiasm that may need bridling. Mult Scler 2024; 30: 1575–1577.

28.

Yaldizli

Benkert

Achtnichts

, et al. Personalized treatment decision algorithms for the clinical application of serum neurofilament light chain in multiple sclerosis: a modified Delphi study. Mult Scler 2025; 31: 932–943.

29.

Weld-Blundell

Learmonth

Klaic

, et al. The application of treatment guidelines in multiple sclerosis care: a qualitative analysis of barriers and facilitators. Int J MS Care 2026; 28: 1537.

30.

Meltzer

. Solving gaps in clinical reasoning is the cure to Neurophobia in artificial intelligence. J Neurol Sci 2025; 479: 125672.

31.

Bsteh

Aicher

Walde

, et al. Association of disease-modifying treatment with outcome in patients with relapsing multiple sclerosis and isolated MRI activity. Neurology 2024; 103: e209752.

32.

Gavoille

Rollot

Casey

, et al. Acute clinical events identified as relapses with stable magnetic resonance imaging in multiple sclerosis. JAMA Neurol 2024; 81: 814–823.

33.

Guerra

Copetti

Achille

, et al. Refining prognostic factors in adult-onset multiple sclerosis: a narrative review of current insights. Int J Mol Sci 2025; 26: 7756.

34.

Saposnik

Maurino

Sempere

, et al. Herding: a new phenomenon affecting medical decision-making in multiple sclerosis care? Lessons learned from DISCUTIR MS. Patient Prefer Adherence 2017; 11: 175–180.

35.

Monreal

Gómez-Ballesteros

Meca-Lallana

, et al. Neurologists’ openness to evidence-based innovation in multiple sclerosis care: individual and structural determinants. Neuropsychiatr Dis Treat 2025; 21: 1523–1531.

36.

Smith

Weathers

. An overview of clinical machine learning applications in neurology. J Neurol Sci 2023; 455: 122799.

Large language models as clinical decision-support tools in multiple sclerosis and neuromyelitis optica spectrum disorders: A comparative study of ChatGPT-4o and neurologists

Abstract

Background

Objective

Methods

Results

Conclusions

Keywords

Introduction

Methods

Study design

Clinical vignettes and scenario selection

ChatGPT configuration and contextual retrieval

Outcome measures and definitions

Statistical analyses

Results

Study population

ChatGPT responses

Performance comparison between ChatGPT-4o and neurologists

Factors associated with correct responses

Discussion

Conclusion

Footnotes

Data availability statement

Declaration of conflicting interest

Ethical approval and informed consent statement

Informed consent

Funding

ORCID iDs

References