Abstract
Francesco Brigo, Serena Broggi, Gionata Strigaro, Sasha Olivo, Valentina Tommasini, Magdalena Massar, Gianni Turcato, Arian Zaboli. Epilepsy Behav. 2025 May;166:110364. doi: 10.1016/j.yebeh.2025.110364. Epub 2025 Mar 12.
Commentary
“Alexa, does my patient have epilepsy?” “Siri, what type of seizure is this?”
High-performing artificial intelligence (AI) could in theory improve upon medicine in many ways. Automating processes could help time-strapped clinicians make faster decisions. Even when local expertise exists, wait times can be long to see an epilepsy specialist, and diagnostic delays can be considerable. 1 Another allure is whether AI could not just speed up assessments, but also render more accurate assessments than possible by humans alone regardless of expertise. Misdiagnosis of epileptic versus psychogenic seizures, acute provoked seizures versus epilepsy, and drug-refractoriness versus pseudorefractoriness are all example classification tasks that are fraught with possible misclassification under usual care. Humans are prone to many cognitive errors, unlike machines.2,3
Recent work by Brigo et al investigated the performance of ChatGPT in diagnosing and classifying seizures and epilepsy. 4 Readers may already be familiar that ChatGPT is a large language model permeating many aspects of society, placing information at your fingertips. Clinicians could feed in clinical information and then instantly receive the model's educated guess regarding the patient's diagnosis.
In their study, ChatGPT was first trained by uploading the most trustworthy information that we have—International League Against Epilepsy guideline documents.5–7 The model was then tested and tuned by giving it numerous real-world clinical cases followed by the epileptologists’ feedback when model answers appeared incorrect. After training and tuning, an experienced epileptologist searched their hospital's electronic health record system to identify 37 unseen clinical cases for validation. Cases represented a range of adult cases emblematic of clinical practice involving epilepsy but also nonepileptic events (eg, syncope, hypertensive crisis, and vertigo).
How did the model perform? The Cohen's kappa (agreement beyond chance) was “almost perfect” when comparing ChatGPT versus the senior epileptologist for distinguishing epileptic seizures versus mimics, seizure onset (eg, focal vs generalized), and presence of an epilepsy syndrome. However, it performed lower for etiology (eg, structural, metabolic, infectious, etc), seizure type (eg, acute symptomatic vs unprovoked), and epilepsy diagnosis (yes/no). In the least accurate task (epilepsy diagnosis), ChatGPT detected all epilepsy cases (100% sensitivity). However, it overcalled 11 of 15 nonepilepsy cases (specificity 27%). This resulted in a 100% negative predictive value (a negative prediction successfully ruled out epilepsy) but only a 59% positive predictive value (a positive prediction did not guarantee epilepsy). Similarly for seizure type, the model identified all unprovoked seizures but overcalled many acute symptomatic seizures as if they were unprovoked.
With so many tasks and metrics, we must ask: what classification task is most critical, and what would be the most important metric of success? It may be argued that diagnosing epilepsy would seem to be the most critical primary classification task due to the treatment implications. Furthermore, many antiseizure medications are broad-spectrum, and other aspects of counseling, safety, and tracking are all widely applicable across seizure types and etiologies. Therefore, it is disappointing that ChatGPT's algorithm showed such poor agreement with human experts at diagnosing epilepsy. The tradeoff between sensitivity and specificity is a common problem when designing diagnostic tests. For example, we are often taught during EEG training to prioritize not overcalling epileptiform discharges, as doing so may “commit” an “innocent” patient to antiseizure medications, whereas a borderline finding can still thereafter declare itself. Still, at least if one knows that an algorithm is highly sensitive, a negative prediction could still be valuable, as it could rule out epilepsy thus providing reassurance and avoiding overtreatment. Furthermore, one could argue that tuning an algorithm to be maximally sensitive for cases of epilepsy could be desirable in this case, as a positive prediction could simply be used to justify further scrutiny rather than representing a definitive diagnosis for treatment purposes. The high interrater agreement between human epileptologists in this study for all evaluated metrics could justify using this tool as a referral decision-point, if this pattern was confirmed in future larger studies.
Additionally, perhaps future work could further investigate why those 11 nonepilepsy cases fooled the algorithm into overdiagnosing epilepsy and if further tuning can be done. It was interesting that the algorithm was particularly good at what the authors described as tasks requiring less “nuanced clinical judgment.” Presumably, descriptions of lateralizing semiology can be more unambiguous than the degree to which a seizure is provoked or a multimodal epilepsy diagnosis integrating history and diagnostic testing. To make AI maximally “intelligent”, further development will be needed to capture such “nuances.” Or perhaps the error is simply in asking the wrong question—forcing the model to give a black/white answer to clinical scenarios that humans will recognize inevitably involve shades of gray or posttest probabilities.
This work's goal resembles some past efforts to automate the diagnosis and classification of seizures, with variations on a theme. As one example, investigators previously programmed an algorithm (EpiPick.org) to distinguish between seizure types (eg, absences, myoclonic, and generalized tonic-clonic) that might have therapeutic implications built upon expert opinion, also incorporating user inputs about the patient's history (eg, gray matter lesion, lip smacking, jerks, etc). 8 While their specificity was likewise imperfect (64%) thus potentially overcalling epilepsy despite inclusion of “red flags” historical questions, they found fairly strong agreement between algorithm versus expert seizure classification. Their primary intent was to create an online algorithm that could assist with classifying seizure types to identify optimal antiseizure medication selection, particularly in settings where epileptologist availability might be limited. This illustrates a fundamentally different approach to Brigo et al. Whereas Brigo et al fed ChatGPT official source documents followed by a tuning process, EpiPick instead programmed the algorithm by hand according to expert consensus. Clearly other approaches are possible, such as multivariable predictive modeling considering the “gold standard” to be whether a subsequent unprovoked seizure recurred (rather than expert consensus). These many research approaches capture the current moment in biomedicine where innovation is proceeding in many directions in a race to solve diagnostic problems using technology.
Asking the computer for the patient's diagnosis could provide a rapid, objective method for pulling in expert guidance to make care more efficient or else more accurate determinations. While the current work had a small sample in a selected number of cases, it still provides an important contribution, likely foreshadowing much research to come as the epilepsy community deliberates exactly what problems we need technology to solve and for whom.
Footnotes
Declaration of Conflicting Interests
The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author received no financial support for the research, authorship, and/or publication of this article.
