Abstract
Background
The integration of artificial intelligence (AI) into medical education and clinical decision-making is rapidly expanding. ChatGPT-4o, a multimodal AI model, offers real-time access to a vast corpus of biomedical knowledge. Nonetheless, concerns persist regarding the scientific accuracy and interpretive reliability of its responses when applied to clinical subjects such as migraine.
Aim
To assess the reliability and factual accuracy of ChatGPT-4o when addressing key clinical questions regarding migraine.
Methods
Eight clinically relevant questions were submitted to ChatGPT-4o, covering migraine pathophysiology, diagnosis and treatment. Each response was compared with current evidence from high-impact medical literature and rated as satisfactory, partially satisfactory or unsatisfactory. Classifications were based on conceptual accuracy, reference validity and clinical coherence.
Results
Of the eight responses analyzed, 62.5% were classified as satisfactory, while 37.5% were deemed partially satisfactory. No response was considered entirely unsatisfactory. The most common limitations included reference-related AI hallucinations and insufficient technical depth in selected answers.
Conclusions
ChatGPT-4o demonstrates potential as a support tool in the dissemination of structured medical information about migraine. However, its clinical use must remain supervised by professionals, given its limitations in bibliographic precision and interpretive nuance.
Introduction
Technological advances in computing have significantly shortened the gap between the generation of knowledge and its practical application in medicine. At the forefront of this transformation are large language models (LLMs), capable of producing coherent text, synthesizing evidence and interacting contextually with human users (1). These tools mark a new phase of artificial intelligence (AI), with the potential to profoundly affect how biomedical knowledge is produced, accessed and utilized.
ChatGPT is one of the most representative examples of this transition. Developed by OpenAI and first released in November 2022, the platform was upgraded to its multimodal architecture in May 2024 with the launch of GPT4o, enabling text, image and audio processing.
Despite its promise, the use of AI in medicine requires methodological rigor and critical evaluation. Enthusiasm about its potential must be accompanied by systematic assessments of its reliability, accuracy and limitations within the broader framework of scientific knowledge production. Furthermore, it is essential to recognize that no technology replaces clinical judgment, and any implementation must adhere to established scientific and ethical standards.
Accordingly, this study aimed to evaluate the accuracy and consistency of GPT4o responses regarding migraine, by comparing its outputs with current evidence-based medical literature. In doing so, we sought to discuss the role of AI as a support tool for learning, scientific updating and the democratization of knowledge, particularly within a complex and sensitive field such as neurology.
Despite the potential benefits of using artificial intelligence in educational settings, it is essential to recognize that this technology presents important limitations. Among these limitations is the occurrence of “AI hallucinations”, comprising a phenomenon in which the model generates factually incorrect, fabricated or non-existent information, often presented with misleading confidence.
Methods
As a language model, ChatGPT requires detailed prompts and contextual inputs to generate responses with higher precision and relevant bibliographic support. This operational premise is acknowledged in OpenAI's official documentation.
For this study, a tailored instruction protocol was applied, embedding the professional background of the querying neurologist and specifying the depth required in the responses. The prompt used can be found in Appendix 1.
Eight questions were designed and submitted, addressing migraine pathophysiology, diagnosis and treatment. Selection was based on their relevance to clinical reasoning and educational value. No AI tool was used to generate the questions. All responses were generated using the GPT-4o model (version o1-preview was not used due to beta-stage constraints).
All queries were submitted once, with no prior conversation history or contextual memory enabled, apart from the instruction protocol discussed above. The model's behaviour was evaluated in a clean session for each prompt to avoid influence from previous interactions. Data collection was conducted in May 2024.
Two neurologists (PAH and MAH) independently evaluated the responses, classifying them based on conceptual consistency, reference validity and overall coherence. One reviewer has 39 years of clinical and academic experience in headache medicine and is a former president of the Brazilian Headache Society; the second evaluator has five years of experience and is a member of the Education Committee of the Brazilian Headache Society. Although the process was not blinded, assessments were conducted independently and discussed by consensus.
No statistical analysis was performed; only qualitative evaluation was conducted by clinical experts.
Response quality classification
Response accuracy was assessed by comparing the content of each answer with the most current evidence-based sources available as of May 2024. Benchmarks included the International Classification of Headache Disorders, 3rd edition (ICHD-3) classification, the UpToDate clinical database and the latest American Headache Society (AHS) guidelines and peer-reviewed publications (2,3).
The responses were classified into three main categories based on clear and objective criteria related to conceptual accuracy, reference quality and adequacy to the given request:
Satisfactory Response: A response was considered satisfactory when:
No conceptual errors: The information provided aligns with the current scientific consensus and medical literature. No reference errors: All cited references exist and are consistent with the presented data. Meets the request: The response fully addresses the question without significant omissions.
Partially Satisfactory Response: Classified as partially satisfactory if:
It contains one reference error and one conceptual error. It presents a well-structured response with mostly correct data but includes minor flaws that partially compromise its quality.
Unsatisfactory Response: Defined as unsatisfactory if:
The response was superficial, resembling a lay-level explanation rather than a medically rigorous one. Did not adequately address the request. It contained two or more conceptual or reference errors, indicating serious flaws in accuracy or information reliability.
Results
The results indicated a satisfactory performance in most responses. Comparative analysis revealed that 62.5% (five out of eight questions) were classified as satisfactory, demonstrating coherence with medical literature and an absence of significant conceptual errors. Only 37.5% (three out of eight questions ) were deemed partially satisfactory, containing minor limitations in information depth without significantly compromising accuracy and reference-related AI hallucinations. No response was unsatisfactory (Table 1).
Summary of evaluated questions and classifications
Abbreviations: ICHD = International Classification of Headache Disorders.
Question 1: what is migraine?
Comparative Result: Satisfactory Response (3–6).
Details: As an introduction to headaches, the AI covered the typical symptomatology of migraine, its treatment, pathophysiology and diagnosis according to current medical literature. However, its discussion of each topic was somewhat general, lacking in-depth explanations of specific terms. Nevertheless, considering the subsequent responses generated by the AI this initial response can be understood as an introductory overview of the topic rather than an inability of ChatGPT to delve deeper into the subject.
Question 2: what is the cause of migraine?
Comparative Result: Satisfactory Response3–7
Details: the AI did not present any incorrect information according to current medical literature and was able to filter out already debunked theories. Overall, all the information provided was accurate and at an appropriate level of complexity, aligning with the request.
Question 3: what are the non-pharmacological treatment options for migraine?
Comparative Result: Satisfactory Response (6,8–12).
Details: The tool proved to be adequate in providing information on available non-pharmacological treatment options for migraine. It serves both as a patient education resource and as a professional support tool, allowing for a more targeted understanding of treatment strategies. Notably, the AI did not present these options as miraculous solutions or give patients false hope about their effectiveness, ensuring a balanced and evidence-based approach.
Question 4: what are the pharmacological treatment options for migraine?
Comparative Result: Partially Satisfactory Response (6,13–16).
Details: the AI did not present any incorrect information according to current medical literature. Additionally, it correctly emphasized the cautious use of medications to prevent medication-overuse headache and mentioned new therapies.
Regarding the references, the AI generated an hallucination, mistakenly identifying one person as the author, when, in fact, he was a co-author of a different paper, a sub-analysis of the same sample.
Overall, all the information provided was accurate and at an appropriate level of complexity, aligning with the request.
Question 5: how is migraine diagnosed?
Comparative Result: Partially Satisfactory Response (3–6).
Details: The response generated by ChatGPT was highly accurate and well-structured, in accordance with current medical literature. Again, there was an inaccuracy in the references: the AI swapped information from two similarly titled articles by the same author (publication date and journal).
Question 6: define episodic and chronic migraine
Comparative Result: Satisfactory Response (3,4,6).
Details: the AI did not present any incorrect information according to current medical literature and was able to produce a concise and well-written response with an appropriate level of complexity for the request.
Question 7: what is medication-overuse headache?
Comparative Result: Partially Satisfactory Response (3,17,18).
Details: the AI largely provided a response that aligns with current medical literature. However, it incorrectly included Criterion D: “The headache reverts to its previous pattern – usually improves – within two months after discontinuation of the overused medication.”
Furthermore, it misidentified one author and publication date.
Question 8: what are the recommendations of international headache societies regarding surgical procedures for migraine treatment?
Comparative Result: Satisfactory Response (19–21).
Details: the AI did not present any incorrect information according to current medical literature and was able to produce a concise and well-written response with an appropriate level of complexity for the request.
Discussion
The evolution of generative AI models, such as ChatGPT-4o, has introduced new perspectives for medical research and knowledge dissemination. In this study, we evaluated the accuracy and reliability of ChatGPT-4o's responses on migraine by comparing them with current scientific literature and clinical guidelines.
One relevant issue was the presence of reference-related AI hallucinations. These inaccuracies were observed in the diagnostic criteria for medication-overuse headache, according to the ICHD-3, as well as in some cited references. AI hallucinations pose a significant risk because they may lead to diagnostic errors, inappropriate prescriptions and the dissemination of medical misinformation.
In the context of large language models, an AI hallucination refers to output that is factually incorrect, non-existent or fabricated (22,23). These errors can occur in citation generation, clinical data interpretation, or logical construction, posing significant risks when applied to medical decision-making (24).
During this study, a pattern in AI hallucinations was observed. A significant portion of them occurred within the references provided by AI. Authors with major contributions to the field were erroneously included in certain citations. It is possible that, recognizing the substantial participation of these authors in migraine-related scientific publications, the AI used their names as a mechanism of authority and credibility. Another issue worth highlighting is that, in articles with similar titles or authors, the AI often confused information, swapping publication dates and journal names.
Despite these limitations, strategies to mitigate AI hallucinations are being developed, including algorithm improvements, continuous human feedback and cross-checking with reliable medical databases. The use of hybrid approaches, in which AI complements but does not replace human analysis, can substantially reduce the risks associated with inaccurate information.
Nevertheless, the adoption of AI in medicine should proceed under strict regulation and continuous monitoring. Transparency regarding the sources of information provided by AI models, as well as the incorporation of mechanisms that flag potential inaccuracies, is essential to ensure the safety and reliability of these technologies. Future studies should assess the effectiveness of AI hallucination mitigation strategies and investigate the impact of these tools on healthcare quality and medical education.
Clinically, ChatGPT may support education and preliminary medical guidance, but its limitations require human oversight. Future studies should investigate use in real-world environments, compare LLMs and evaluate integration in medical training.
Although ChatGPT may demonstrate apparent consistency when answering objective, structured medical questions, it still falls short when applied to the more complex dimensions of clinical care. Tasks such as conducting a nuanced, patient-centered history or formulating a therapeutic plan that accounts for comorbidities, disease trajectory, psychosocial context and individual preferences remain outside the scope of these systems.
Artificial intelligence lacks the ability to interpret non-verbal cues, recognize subtle emotional undertones or adapt its reasoning based on patient values and real-world constraints. As a result, it may produce responses that sound coherent yet are clinically inappropriate or even misleading. The danger lies precisely in this illusion of understanding: AI can simulate clarity, but it does not truly comprehend – and in medicine, that distinction matters deeply.
Moreover, therapeutic decision-making is inherently contextual. It involves balancing risks, understanding longitudinal disease patterns and applying professional judgment that AI cannot replicate. This can lead to false reassurance, overmedicalization or premature self-diagnosis, especially when patients use these tools without professional guidance.
This study had important limitations. First, it relied on single-response instances from the AI model, without multiple sampling or independent replication, which may have affected the consistency and reproducibility of results. Second, there was no blinded review or independent adjudication of outputs, increasing the risk of interpretive bias. Third, because large language models operate stochastically, variability in responses is expected, and the outputs may differ depending on prompt phrasing or internal model state.
Conclusions
This study evaluated the factual accuracy and conceptual consistency of ChatGPT-4o's responses to eight clinically relevant questions about migraine. The responses were assessed by two neurologists based on reference validity, alignment with current medical literature and overall coherence. Of the eight questions analyzed, five responses (62.5%) were classified as satisfactory and three (37.5%) as partially satisfactory. No response was considered entirely unsatisfactory. The results suggest that ChatGPT-4o is capable of generating structured and generally accurate content when prompted with direct conceptual questions.
However, it is essential to emphasize that the model demonstrated significant limitations, including reference-related AI hallucinations and improper application of formal diagnostic criteria. Such shortcomings may compromise the scientific reliability of its responses. This makes it clear that ChatGPT-4o should not be regarded as a reliable source for more complex educational guidance in unsupervised settings. While it may assist in the understanding of basic migraine-related concepts, its outputs must be interpreted critically and always verified against validated guidelines and authoritative medical sources. Its use should remain under professional supervision.
In summary, although ChatGPT-4o may not yet be considered reliable for medical education or professional development, the present study suggests that it may serve as a useful resource for lay users or for retrieving basic information in fields where the user lacks expertise.
Key findings
Of the eight evaluated responses, 62.5% were categorized as satisfactory, demonstrating high concordance with current medical knowledge.
The remaining 27.5% were classified as partially satisfactory due to minor inaccuracies, lack of depth and hallucinations when referencing.
No responses were considered entirely unsatisfactory. However, some hallucinations, particularly in reference citation and diagnostic criteria, were identified.
Supplemental Material
sj-docx-1-cep-10.1177_03331024251387684 - Supplemental material for What does ChatGPT know about Migraine? A comparative-descriptive analysis
Supplemental material, sj-docx-1-cep-10.1177_03331024251387684 for What does ChatGPT know about Migraine? A comparative-descriptive analysis by Lucas Bernardi Garcia, Ana Júlia Ferreira and Mohamad Ali Hussein, Pedro André Kowacs in Cephalalgia
Footnotes
Author contributions
The authors declared no potential conflicts of interest with respect to the research, authorship and/or publication of this article.
Data availability
Data are available from the authors at reasonable request.
Declaration of conflicting interests
The authors of this article have no conflicts of interest to declare.
ORCID iDs
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
