Sage Journals: Discover world-class research

Abstract

Background

Misdiagnoses of headache disorders are a serious issue. Therefore, we developed an artificial intelligence-based headache diagnosis model using a large questionnaire database in a specialized headache hospital.

Methods

Phase 1: We developed an artificial intelligence model based on a retrospective investigation of 4000 patients (2800 training and 1200 test dataset) diagnosed by headache specialists. Phase 2: The model’s efficacy and accuracy were validated. Five non-headache specialists first diagnosed headaches in 50 patients, who were then re-diagnosed using AI. The ground truth was the diagnosis by headache specialists. The diagnostic performance and concordance rates between headache specialists and non-specialists with or without artificial intelligence were evaluated.

Results

Phase 1: The model’s macro-average accuracy, sensitivity (recall), specificity, precision, and F values were 76.25%, 56.26%, 92.16%, 61.24%, and 56.88%, respectively, for the test dataset. Phase 2: Five non-specialists diagnosed headaches without artificial intelligence with 46% overall accuracy and 0.212 kappa for the ground truth. The statistically improved values with artificial intelligence were 83.20% and 0.678, respectively. Other diagnostic indexes were also improved.

Conclusions

Artificial intelligence improved the non-specialist diagnostic performance. Given the model’s limitations based on the data from a single center and the low diagnostic accuracy for secondary headaches, further data collection and validation are needed.

Keywords

Coronavirus disease 2019 (COVID-19)machine learning migraine smartphone application telemedicine

Introduction

Headaches are a widespread public health problem (1). The International Classification of Headache Disorders, 3rd edition (ICHD-3) (2) includes migraine, tension-type headache (TTH), and trigeminal autonomic cephalalgias (TACs) as representative primary headaches. In Japan, migraine prevalence is 4.3–8.4% (3 –5), and 29.8–74.2% of patients with migraine headaches reported significant impairment in daily activities (1). Additionally, approximately 15–20% of the Japanese population have TTH, with 22.4–29.2% reporting that it affects their performance (6). Furthermore, approximately 3.5% of patients who visited headache clinics have TACs (7). The prophylactic and acute treatment of headache disorders depends on the type of primary headache disorder and requires a proper diagnosis based on ICHD-3 (8). However, in 2022, there were only 899 headache specialists in Japan for more than 10 million people experiencing migraine, or, one headache specialist for 11,000 individuals with migraine. This lack of headache specialists may lead to diagnostic delays, which can lead to increased risk of chronic migraines, treatment refractoriness, comorbidities, and medication overuse (9 –11).

Headache awareness campaigns can increase patients’ awareness regarding the condition (12,13), including the need for doctor consultation. Patients are gradually seeking medical attention; however, there remains only few specialized headache clinics, and primary care doctors are often responsible for the initial consultation regarding headache disorders. Hence, the burden on primary care doctors is enormous. As a result, they are unable to allocate sufficient time to ask detailed questions regarding the patient’s history, and their diagnosis and knowledge of headache treatment are sometimes unsatisfactory, leading to misdiagnosis and patient dissatisfaction. Moreover, since March 2020, the COVID-19 pandemic has increased the need for telemedicine to avoid in-person consultations (14). Online clinics for most diseases officially started in Japan in April 2022 (15). As it is time efficient, telemedicine will continue to grow, particularly among patients with headaches because they are unable to take time off from work to consult doctors during the day on weekdays (12). As telemedicine becomes widespread, it will become even more critical to diagnose headache disorders based solely on medical interviews.

Automated headache diagnosis systems using artificial intelligence (AI) can solve this problem because AI-based diagnosis models can save the time usually taken for a medical interview while improving diagnostic accuracy (16,17). However, reportedly, there are only five models (17 –21) that can diagnose primary headache subtypes and have been evaluated in both training and test datasets (22). Furthermore, the efficacy of AI diagnosis models for non-specialist diagnosis performance has not been investigated and must be validated before social implementation. Specifically, no reliable and robust AI-based headache diagnosis models have been proven to improve the diagnostic performance of non-headache specialists.

Therefore, we aimed to develop an AI-based headache diagnosis model using a large questionnaire database in a specialized headache hospital. The primary objective was to prove the model’s efficacy for non-specialist diagnosis performance on headache disorders. Our AI diagnosis model can potentially solve the problem of undertreated and underdiagnosed headache disorders in daily clinical practice by non-specialists.

Materials and methods

Study design

This study consisted of two phases. Phase 1 involved developing an AI diagnosis model based on a retrospective investigation of 4000 patients (2800 training and 1200 test datasets) diagnosed by four headache specialists at Tominaga Hospital Headache Center. The model was developed using the 2800 patient questionnaire datasets with pre-processing, hyperparameter tuning, and cross-validation. We then tested its performance using the 1200 patients’ test dataset. In this phase, the model production was only based on the 2800 training dataset, and the performance was tested using the 1200 dataset, which was not treated in the model production. We then validated the model’s efficacy for accuracy in the 50 new patients in phase 2, separate from the 4000 patients in phase 1. Five non-headache specialists with 1–11 years of experience first diagnosed the headaches of 50 patients and then re-diagnosed them using AI. The ground truth was the diagnosis made by the four headache specialists, and the concordance rates between the headache specialists and non-specialists’ diagnoses with or without the AI model were evaluated (Figure 1).

Figure 1.

Study design.

Phase 1: Developing an AI diagnosis model

We retrospectively collected 4000 consecutive patients’ questionnaire sheets and their diagnoses on their first visit to the Tominaga Hospital Headache Center from March 2013 to December 2021. Inclusion criterion was aged ≥15 years. Children <15 years were excluded from this study. From the questionnaire sheet, we retrieved 17 variables: age, sex, height, weight, age at headache onset, headache frequency, headache duration, site of headache, headache characteristics, headache severity, presence of aggravation by exercise, concomitant symptoms, presence of aura, times when the headache is most likely to occur, inducement of headache, use of acute medication, and family history. The patients completed the questionnaire. The items in the questionnaire are listed in Table 1. Four headache specialists in Tominaga Hospital Headache Center finally decided on headache diagnosis after medical examination, appropriate radiological examination, and sufficient discussion. Headache diagnosis was based on the ICHD-3 (2) and was grouped into five categories:

Class 1; migraine and medication-overuse headache (MOH) (ICHD-3 code 1 and 8.2)

Class 2; TTH (code 2)

Class 3; TACs (code 3)

Class 4; other primary headaches (code 4)

Class 5; secondary headaches (other codes).

Table 1.

Headache questionnaire sheet.

Questions	Answers
01. Age	( ) y.o.
02. Sex	Male/Female
03. Height	( ) cm
04. Weight	( ) kg
05. Headache onset age	( ) y.o., ( ) days/months/years ago
06. Headache frequency	( ) times per month/year, Every day, For the first time
07. Headache duration	Always, Several days, One day, Half a day, 1–3 hours, Moment
08. Site of headache	Unilateral (right/left), Bilateral, Center, Different site, Around the eye, Front, Back, Side, Top, Craniocervical transitional
09. Headache characteristics	Pulsating, Constricting, Stabbing, Tingling, Grasped, Gouged out, Burning, Racking, Dull
10. Headache severity	Needs rest, Disturbing daily life without rest, Not disturbing
11. Presence of aggravation or improvement by exercise	Aggravation, Improvement, No change
12. Concomitant symptoms	Nausea or vomiting, Photophobia, Phonophobia, Osmophobia, Red eye, Lacrimation, Runny nose, Dizziness, Fatigue, Stiff shoulders, Neck pain, Numbness in the extremities
13. Presence of aura	Absent, Scintillating scotoma, Numbness in the extremities
14. Times when headaches are most likely to occur	Wake up, Morning, Afternoon, Evening, Sleeping, Holidays, Menstrual period
15. Inducement of headache	None, Lack of sleep, Too much sleep, Tired, Stressed, Tense, Hungry, Exercise, Drinking
16. Use of acute medication	Drug’s name: ( )Frequency: ( ) times per day/month/yearEffectiveness: Very effective, Mild effective, Not effective
17. Does anybody in your family have a headache?	Yes/No

Ask patients to check or fill out each item on the questionnaire.

Class 1 included codes 1.1, 1.2, and 1.1 + 1.3 but did not include 1.4–1.6. MOH (code 8.2) was not included in Class 5 but in Class 1. Most of class 2 consisted of codes 2.2 and 2.3, with a few 2.1. Class 2 did not contain code 2.4.

The AI-based model used 17 variables in the questionnaire and predicted one of five diagnoses. We used PyCaret (https://pycaret.readthedocs.io/en/latest/index.html) to create the AI-based diagnosis model because it easily performs pre-processing, comparison of algorithms, and hyperparameter tuning. After putting the 4000 patients’ dataset into the PyCaret on Python notebook, PyCaret randomly divided the 4000 patients into training data of 2800 patients and test data of 1200 patients (7:3 ratio) by “get_data” and “setup” commands. The model production is only based on the 2800 training dataset, and the performance was tested using the 1200 dataset. For pre-processing, z-normalization for numerical variables was performed. Using the 2800 training dataset, PyCaret made several predictive models with 10-fold internal cross-validation. The algorithm with the largest c-statistics (area under the curve of the receiver operating characteristic curve; AUC of ROC) was chosen after model comparison by the “compare_models” command among Light Gradient Boosting Machine, Random Forest, Linear Discriminant Analysis, Ridge Classifier, Extra Trees, Gradient Boosting Classifier, Logistic Regression, Ada Boost Classifier, Decision Tree, K Neighbors, Naive Bayes, Dummy Classifier, Support Vector Machine, and Quadratic Discriminant Analysis. The c-statistics were the averages of the 10-fold cross-validation. Hyperparameter tuning was then performed to maximize sensitivity by “create_model,” “tune_model,” and “finalize_model” commands after choosing one of the best models described above. Randomized search cross-validation was applied during hyperparameter tuning with 10 iterations. Finally, the 1200 test dataset, which was still untouched, was predicted using the final tuned model. The accuracy, sensitivity (recall), specificity, precision, F-values (Definition A; macro-average of F values for each class, Definition B; harmonic mean of average recall and precision), and macro c-statistics were used to evaluate the model’s performance. We used SHapley Additive exPlanations (SHAP) values (23) to understand why the AI outputs the patients’ diagnosis. The Python source cord is available in online Supplementary File 1.

Phase 2: Testing the efficacy of the AI diagnosis model

We then validated the efficacy of the AI model for non-specialists in 50 new patients, separate from the 4000 patients, as external validation. The external validation dataset consisted of 50 consecutive patients aged 15 years and older who were in the same hospital where the 4000 patients in phase 1 were in February 2022. We prospectively collected data from the 50 patients with headaches, including questionnaires and diagnoses by the specialists. These data were routinely collected during daily clinical practice. In addition to clinical practice, five non-headache specialists were recruited, and they diagnosed headache in 50 patients solely on the basis of questionnaire sheets, without actually seeing the patients. The details of the five non-headache specialists are described in Table 2. Subsequently, they were re-diagnosed with AI after a rest period of one hour to one week. The ground truth was the diagnosis made by the four headache specialists in clinical practice, and the concordance rates between headache specialists and non-specialists diagnoses with or without the AI model were evaluated. The accuracy, sensitivity (recall), specificity, precision, F-values (definitions A and B), and kappa were evaluated as performance indices. This process, phases 1 and 2, was determined based on the guidelines for developing and reporting machine-learning predictive models (22).

Table 2.

Detail of the five non-headache specialists in phase 2.

	Years of physician experience	Department	Main locations for usual headache clinical practice
Rater 1	4 years	Psychiatry	Psychiatric outpatient
Rater 2	11 years	General practice	Internal medicine outpatient and emergency room
Rater 3	1 year	Internal Medicine	Internal medicine outpatient and emergency room
Rater 4	3 years (First year of neurology- specialized training)	Neurology	Neurological outpatient and emergency room
Rater 5	3 years	Plastic surgery	Emergency room

Assuming a migraine prevalence of 57.6% in the Tominaga Hospital Headache Center, the permissible kappa was 0.41, and the expected kappa was 0.81. The required sample size was 20. We adjusted the final sample size by accounting for an estimated 2% of patients diagnosed with class 4, other primary headaches (code 4). Hence, we enrolled 50 patients to ensure that they included patients with diagnoses of other primary headaches. Sample size calculation (24) was performed online (https://rdrr.io/cran/irr/man/N.cohen.kappa.html).

Statistical analysis

Variables with a normal distribution are shown as the mean (standard deviation) (25), and those without a normal distribution are shown as the median (interquartile range). Concordance in diagnosis between the model and raters was measured using unweighted Cohen’s kappa statistics; unweighted kappa was selected because the outcomes were nominal variables. Kappa values were interpreted using Cohen’s recommendations (26) as “no agreement” for kappa ≤0, “none to slight agreement” for kappa = 0.01–0.20, “fair agreement” for kappa = 0.21–0.40, “moderate agreement” for kappa = 0.41–0.60, “substantial agreement” for kappa = 0.61–0.80, and “almost perfect agreement” for kappa = 0.81–1.00. The evaluation indices of the AI model and raters were assessed using specialists’ diagnoses as the reference standard (ground truth). The accuracy, sensitivity (recall), specificity, precision, and F-values (definitions A and B) were evaluated to examine whether the AI diagnosis model contributed to the improvement of the diagnostic ability of non-specialists, using the paired t-test for variables with normal distribution and Mann-Whitney U test for those without normal distribution. Statistical significance was defined as one-tailed p < 0.05. We used SPSS 28.0.0 (IBM Corp., Armonk, New York, USA), Python 3.9.0, PyCaret 3.0.0, SHAP 45b85c18, and Matplotlib 3.5.1.

Ethical aspects

The Tominaga Hospital Ethics Committee approved this study (approval number: 120120). The requirement for written informed consent from 4000 patients was waived because of the study’s retrospective nature. Opt-out consent documents were presented on the Tominaga Hospital website (https://www.tominaga.or.jp/about/registration_and_trial/) for patients who did not wish to participate. Written informed consent was obtained from the 50 patients enrolled in the external validation, phase 2. All methods were performed in accordance with the relevant guidelines and regulations of the Declaration of Helsinki. Phase 1, developing an AI-based diagnosis model, was performed under the Strengthening the Reporting of Observational Studies in Epidemiology guidelines (27) Phase 2, external validation of the AI-based model’s efficacy, was performed under the Transparent Reporting of a multivariable prediction model for Individual Prognosis or Diagnosis guidelines (28).

Results

Patient characteristics

The diagnoses of the 2800 training and 1200 test data in phase 1 and the 50 test data in phase 2 are summarized in Table 3. There were no missing values. Diagnoses were made by headache specialists in Tominaga Hospital Headache Center; approximately 60% of the patients had migraine or MOH, 16% had TTH, 8% had TACs, 2% had other primary headaches, and 14% had secondary headaches. The mean patient age was approximately 41 years, and 64% of the patients were women. Baseline characteristics were not statistically different.

Table 3.

Patient characteristics.

	Phase 1		Phase 2
	Training data	Test data	Validation data
Number of cases	2800	1200	50
Mean age (standard deviation)	41.27 (17.69)	41.38 (18.27)	40.48 (19.14)
Sex (%Women)	64.80%	63.20%	66.0%
Class 1; Migraine or MOH	1597 (57.03%)	708 (59.00%)	30 (60.00%)
Class 2; TTH	522 (18.64%)	197 (16.42%)	8 (16.00%)
Class 3; TACs	244 (8.71%)	101 (8.42%)	4 (8.00%)
Class 4; Other primary headaches	55 (1.96%)	30 (2.50%)	1 (2.00%)
Class 5; Secondary headaches	382 (13.64%)	164 (13.67%)	7 (14.00%)

Abbreviations: TACs, trigeminal autonomic cephalalgia; TTH, tension-type headache; MOH, medication-overuse headache.

Phase 1: Developing an AI diagnosis model and its performance

Using the 2800 training dataset, PyCaret revealed that the light gradient boosting machine classifier (https://lightgbm.readthedocs.io/en/v3.3.2/) had the largest c-statistics of 0.9203 among the several algorithms (online Supplementary File 2). We performed hyperparameter tuning to optimize sensitivity (recall); the results are shown in online Supplementary File 3. The AI model’s performance for the 1200 test data in phase 1 is shown in the confusion matrix (Table 4) and illustrated in Figure 2A. The diagnostic performances for class 1 (migraine and MOH) and class 3 (TACs) were high. The accuracy, sensitivity (recall), specificity, precision, and F value for class 1 were 90.96%, 90.96%, 80.00%, 87.98%, and 89.44%, respectively, and for class 3 were 84.15%, 84.16%, 86.74%, 98.82%, and 85.42%, respectively. Overall, the average accuracy, sensitivity (recall), specificity, precision, and F values (definitions A and B) were 76.25%, 56.26%, 92.16%, 61.24%, 56.88%, and 73.58%, respectively. The ROC for each class and its AUC is shown in Figure 2B. The AUCs for classes 1 and 3 were 0.94 and 0.99. The micro-average AUC of the ROC was 0.95. The SHAP values are shown in Figure 2C. Monthly headache days, age, onset age, nausea, phonophobia, duration; approximately 3 hrs., sex, monthly acute medication intake days, photophobia, neck pain, aggravation by exercise, characteristics; constricting, site; unilateral, numbness in the extremities, family history, onset; morning, site; craniocervical transitional, characteristics; pulsating, characteristics; gouged out, and fatigue were important in that order.

Table 4.

Confusion matrix for Phase 1.

	Prediction by AI						Performance index
	Class 1; Migraine or MOH	Class 2; TTH	Class 3; TACs	Class 4; Other primary headaches	Class 5; Secondary headaches	Total	Accuracy	Sensitivity (Recall)	Precision	Specificity	F value
Ground truth by headache specialists
Class 1; Migraine or MOH	644	29	7	1	27	708	90.96%	90.96%	87.98%	80.00%	89.44%
Class 2; TTH	30	124	1	1	41	197	62.94%	62.94%	55.61%	90.13%	59.05%
Class 3; TACs	8	2	85	1	5	101	84.15%	84.16%	86.740%	98.82%	85.42%
Class 4; Other primary headaches	6	13	1	2	8	30	6.67%	6.67%	33.33%	99.66%	11.11%
Class 5; Secondary headaches	44	55	4	1	60	164	36.58%	36.58%	42.56%	92.18%	39.34%
Total	732	223	98	6	141	1200	76.25%	56.26%	61.24%	92.16%	56.88% and 73.58%†

Overall, the average accuracy, sensitivity (recall), specificity, precision, and F values (definitions A and B) were 75.80%, 56.23%, 92.16%, 61.24%, 56.88%, and 73.58%, respectively. Abbreviations: TACs, trigeminal autonomic cephalalgias, TTH, tension-type headache, MOH, medication-overuse headache.

†; the former is Definition A; macro-average of F values for each class, and the latter is Definition B; harmonic mean of average recall and average precision.

Figure 2.

AI model performance.

Phase 2: Efficacy of the AI diagnosis model

Apart from the phase 1 treating 2800 training and 1200 test dataset, five non-specialists interpreted the questionnaire sheets of 50 patients and diagnosed one headache disorder from five classes. Without AI, the overall accuracy and kappa for the ground truth were 46.00% and 0.212, respectively. With AI, the parameters statistically improved to 83.20% and 0.678, respectively. Other indices, such as sensitivity, specificity, precision, and F values, also significantly improved. Regarding each class, the sensitivity for class 1, sensitivity and specificity for classes 2 and 3, and specificity for classes 4 and 5 increased (Table 5). The diagnostic agreement using Cohen’s kappa coefficient between the five non-specialists without AI ranged from 0.023 to 0.362, indicating slight agreement. However, those with AI increased to 0.606–0.874, suggesting a substantial agreement. The kappa between the AI and the ground truth by specialists was 0.815 (Table 6).

Table 5.

Indices for diagnostic ability without AI and with AI of five non-specialists in phase 2.

	Rater 1	Rater 2	Rater 3	Rater 4	Rater 5	Mean of the 5 raters	AI	Rater 1 with AI	Rater 2 with AI	Rater 3 with AI	Rater 4 with AI	Rater 5 with AI	Mean of the 5 raters with AI	p value
Sensitivity for 1 migraine	56.67%	60.00%	56.67%	70.00%	43.33%	57.33%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%	<0.001*
Specificity for 1 migraine	80.00%	70.00%	80.00%	80.00%	90.00%	80.00%	75.00%	75.00%	75.00%	70.00%	50.00%	75.00%	69.00%	0.065
Sensitivity for 2 TTH	75.00%	62.50%	37.50%	37.50%	37.50%	50.00%	100.00%	87.50%	87.50%	75.00%	75.00%	100.00%	85.00%	0.042*†
Specificity for 2 TTH	64.29%	90.48%	76.19%	85.71%	73.81%	78.10%	100.00%	92.86%	97.62%	100.00%	95.24%	100.00%	97.14%	0.006*
Sensitivity for 3 TACs	25.00%	0.00%	50.00%	0.00%	0.00%	15.00%	100.00%	100.00%	100.00%	75.00%	25.00%	100.00%	80.00%	0.041*†
Specificity for 3 TACs	91.30%	95.65%	80.43%	89.13%	89.13%	89.13%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%	100.00%	0.006*
Sensitivity for 4	0.00%	100.00%	0.00%	0.00%	0.00%	20.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.00%	0.317†
Specificity for 4	100.00%	75.51%	95.92%	87.76%	71.43%	86.12%	100.00%	100.00%	97.96%	100.00%	100.00%	100.00%	99.59%	0.033*
Sensitivity for 5	14.29%	14.29%	0.00%	28.57%	14.29%	14.29%	42.85%	14.29%	14.29%	42.86%	0.00%	42.85%	22.86%	0.265
Specificity for 5	95.35%	97.67%	93.02%	93.02%	97.67%	95.35%	100.00%	100.00%	97.67%	95.35%	97.67%	100.00%	98.14%	0.016*
Accuracy	50.00%	50.00%	44.00%	52.00%	34.00%	46.00%	90.00%	84.00%	84.00%	84.00%	74.00%	90.00%	83.20%	0.001*
Average recall (sensitivity)	34.19%	47.36%	28.83%	27.21%	19.02%	31.32%	68.57%	60.36%	60.36%	58.57%	40.00%	68.57%	57.57%	0.009*
Average precision	40.71%	37.65%	24.44%	31.47%	31.62%	33.18%	96.43%	88.93%	64.64%	85.83%	62.50%	96.43%	79.67%	0.002*
Average specificity	86.19%	85.86%	85.11%	87.12%	84.41%	85.74%	95.00%	93.57%	93.65%	93.07%	88.58%	95.00%	92.77%	0.005*
Average F value (definition A)	37.57%	40.50%	40.63%	48.33%	35.76%	40.56%	88.08%	73.77%	75.51%	78.08%	66.90%	88.08%	76.47%	0.001*
Average F value (definition B)	37.16%	41.94%	26.45%	29.19%	23.75%	31.70%	80.14%	71.91%	62.43%	69.63%	48.78%	80.14%	66.58%	0.001*
kappa for ground truth	0.247	0.253	0.176	0.258	0.127	0.212	0.815	0.703	0.705	0.698	0.469	0.815	0.678	0.002*
Time (seconds)	1340	680	1592	2670	1243	1505	ー	600	643	855	1596	751	889	0.043*†

Abbreviations: AI, artificial intelligence; TACs, trigeminal autonomic cephalalgia; TTH, tension-type headache; MOH, medication overuse headache.

*p < 0.05; †, variable with non-normal distribution.

Table 6.

Diagnostic agreement using Cohen’s kappa coefficient between five non-specialists, AI, and ground truth in Phase 2.

	Rater 2	Rater 3	Rater 4	Rater 5	Rater 1 with AI	Rater 2 with AI	Rater 3 with AI	Rater 4 with AI	Rater 5 with AI	AI	Specialist (ground truth)
Rater 1	0.023	0.129	0.255	0.032	0.249	0.204	0.15	0.259	0.203	0.203	0.247
Rater 2		0.096	0.245	0.092	0.231	0.328	0.298	0.249	0.269	0.269	0.253
Rater 3			0.071	0.216	0.181	0.254	0.197	0.222	0.192	0.192	0.176
Rater 4				0.362	0.272	0.341	0.303	0.260	0.275	0.275	0.258
Rater 5					0.035	0.151	0.163	0.045	0.129	0.129	0.127
Rater 1 with AI						0.788	0.699	0.606	0.830	0.830	0.703
Rater 2 with AI							0.829	0.612	0.874	0.874	0.705
Rater 3 with AI								0.651	0.871	0.871	0.698
Rater 4 with AI									0.611	0.611	0.469
Rater 5 with AI										1.000	0.815
AI											0.815

Abbreviations: AI, artificial intelligence.

There was disagreement between AI diagnosis and ground truth by headache specialists in six patients, including TACs (class 3), new daily persistent headache (NDPH) (class 4), sinusitis, meningitis, trigeminal neuralgia, and Chiari malformation (class 5). Among those, AI misdiagnosed NDPH as TTH (class 2) and others as migraine (class 1).

Discussion

In phase 1, we developed an AI-based headache diagnosis model using big data of 4000 patients (2800 training and 1200 test datasets). The overall diagnostic performance for the 1200 test dataset was tested; the diagnostic performance for class 1 (migraine and MOH) and class 3 (TACs) were high. Subsequently, in phase 2, we externally validated the efficacy of the AI-based model for non-specialists and found that the diagnostic performance and agreement were significantly improved when supported by AI. Our AI model missed secondary headaches and classified them as migraine or TTH. To our knowledge, this is the first study to report on the effectiveness of a robust AI-based headache diagnostic model.

Use of the AI diagnosis model in clinical settings

Diagnosing headache disorders requires thorough interviews with patients and a comprehensive decision algorithm. Therefore, we tested whether AI could substitute for clinical interviews. The diagnostic performance of our model for class 1 (migraine or MOH) and class 3 (TACs) was high as compared to class 2 (TTH), class 4 (other primary headache disorders), and class 5 (secondary headaches). This may be attributed to the small sample size. In addition, diagnosing secondary headaches requires more detailed information (29), such as neurological examination, radiological imaging, and laboratory tests. Furthermore, the headache specialists had access to the supplementary information regarding diagnoses for the 4000 patients, whereas the AI model only had access to their questionnaire data. The fact that the AI model achieved high diagnostic performance only by learning the medical questionnaire without radiological imaging or laboratory test results suggests the importance of medical interviews in headache clinical practice. On the other hand, only magnetic resonance images (30) and electroencephalography (31) can indicate migraine patients or healthy volunteers, suggesting such information is useful in identifying not only secondary headaches but also primary headaches. The diagnostic performance might have improved if we had used such additional information. Indeed, secondary headaches should be diagnosed based on their clinical course and causative diseases rather than headache features. However, our AI model missed the secondary headaches and classified them as migraine or TTH because the model did not learn such information other than the headache questionnaires.

Therefore, our approach could not completely replace physician-based diagnosis. Nonetheless, this study demonstrated the feasibility of developing an AI-based automated diagnosis method for headache disorders. In addition, our results might be used to inform or assist physicians by pre-screening or to increase the diagnostic accuracy of less specialized care providers. In particular, 100% sensitivity for class 1 (migraine or MOH) diagnosis and 100% sensitivity and specificity for class 3 (TACs) in phase 2 can support primary care doctors in deciding to consult headache specialists regarding probable patients with migraine or TACs. Furthermore, this model can be a powerful tool in clinical practice if secondary headaches are ruled out. On the other hand, one should consider that in settings where secondary headaches cannot be ruled out (e.g., lack of radiological and laboratory testing equipment availability), misdiagnosis is a possibility.

Previous reports on AI-based diagnostic models

Many computer- and broadly AI-based headache diagnosis models have been reported (16) and these diagnostic models should be evaluated in the training and test datasets to avoid overfitting (22). However, only five models (17 –21) have met this requirement. Furthermore, external validation using a different cohort should be performed (22) and only two studies (17,20) have performed external validation. In addition, the efficacy of AI diagnosis models for non-specialist diagnosis has not been investigated. Therefore, it remains unclear whether AI-based headache diagnosis models improve the diagnostic performance of non-headache specialists.

The previous five reports (17 –21) which met the criteria for developing AI diagnostic models (22) are summarized in Table 7. Yin et al. (19) developed a 2-class classifier using 81 variables to distinguish between migraine and TTH. The accuracy and sensitivity were greater than 90%. They used a case-based reasoning method and a genetic algorithm to calculate the distance between a new case and each case in the case library. The model required a long computing time, and the sample size was less than 1000 patients. Walters and Smitherman (18) developed a 2-class classifier using only four variables to diagnose migraines or other headaches from primary and secondary headaches. They used classical logistic regression with only four variables, holdout testing, and achieved over 90% sensitivity and specificity. However, the model was not a multiple-class classifier because it was a logistic regression model. Vandewiele et al. (20) reported a decision-tree-based 3-class classifier distinguishing migraine, TTH, and TACs from the primary headache disorder cohort. They did not perform tests during the model development. In addition, the validation accuracy was 98%, although only 32 cases were recruited. The 32 cases used for the validation could be simple and were not described in detail. Therefore, the diagnostic performance of this index is questionable. Kwon et al. (21) described a 5-class extreme gradient boosting classifier for diagnosing migraine, TTH, TACs, thunderclap, and epicranial headaches. They used 75 variables and discovered that the diagnostic performance for migraine alone was high. However, the macro-average diagnostic performance was not significantly high, making the 5-class classifier ineffective. Cowan et al. (17) used a decision tree with 135 variables for a 2-class classifier for migraine or other headache disorders. The development process was not described in detail; however, the validation performance was high, with approximately 90% accuracy, sensitivity, and specificity.

Table 7.

Previous reports on AI diagnosis for headache disorders.

Author	Year	Output by the model	Methods	Variables	Training sample number	Test sample number	Validation sample number	%Migraine	Accuracy	Sensitivity (recall)	Specificity	Precision	F value
Yin (19)	2015	2 class; Migraine or TTH	Case-based reasoning +Genetic algorithm	81	676	222	Not performed	76.10%	93.00%	97.02%	79.20%	93.14%	95.04%
Walters (18)	2016	2 class; Migraine or Other headache disorders	Logistic regression	4	887	942	Not performed	9.40%	92%	94%	92%	64%	93%
Vandewiele (20)	2018	3 class; Migraine, TTH, TACs	Decision tree	Not described	849	–	32	Not described	98%	98%	98%	Not described	Not described
Kwon (21)	2020	5 class; Migraine, TTH, TACs, Thunderclap headache, Epicranial headache	XGB	75	1286	876	Not performed	68.49%	58.60%†	58.70%†	85.64%†	65.28%†	58.64%†‡
Cowan (17)	2022	2 class; Migraine or Other headache disorders	Decision tree	135	–	–	212	62%	92%	89%	97%	98%	93%
Ours	2022	5 class; Migraine or MOH, TTH, TACs, Other primary headaches, Secondary headaches	LGBM	17	2800	1200	50	60.00%	90.00%	68.57%	95.00%	96.43%	88.08%

Abbreviations: AI, artificial intelligence; LGBM, light gradient boosting machine; TACs, trigeminal autonomic cephalalgias; TTH, tension-type headache; MOH, medication-overuse headache; XGB, eXtreme Gradient Boosting.

†, calculated in macro-average, ‡; F value by Definition A.

Our model has four distinct advantages. 1) The sample size during AI development was as large as 4000, resulting in high accuracy and robustness. 2) Our model was a 5-class classifier and achieved high diagnostic performance, although categorizing multiple-class classification is more difficult than categorizing 2-class classification. 3) The diagnosis of migraine and TACs can be made with high sensitivity among many class 4 (other primary headaches) and class 5 (secondary headaches) patients, which may act as noise in the computing process. 4) We externally validated the usefulness of diagnostic accuracy for non-specialists. Based on these strengths, our AI diagnosis model can potentially resolve undertreated and underdiagnosed headache disorders in daily clinical practice by non-specialists.

Smartphone applications and AI diagnosis

Recently, smartphone applications for recording headaches have become widespread. Migraine Buddy (Healint, Singapore; https://migrainebuddy.com/) was used with detailed records of headache attacks, as detailed as headache questionnaire sheets. M-sense (Newsenselab GmbH, Berlin, Germany; https://www.m-sense.de/en/), the digital migraine treatment program, can diagnose a single headache attack as migraine or TTH (32). Zutool (BellSystem24, Inc., Tokyo, Japan; https://zutool.jp/) and Tenkitsu Yohou (Weather News, Tokyo, Japan; https://weathernews.jp/s/pain/) provide weather-based headache attack forecasting (33). Of course, these data loggers, as a headache diary, are beneficial for headache self-control. Furthermore, if these digital devices installed our AI diagnosis model, users could self-diagnose their headache disorders, thereby facilitating access to appropriate treatment (17). Furthermore, smartphone applications can collect big real-world data. Therefore, combining smartphone applications with our AI model could aid the development of more accurate diagnostic models in the future.

Limitations

This study has some limitations. First is the lack of generalizability. The results were derived from data obtained from a single outpatient specialized headache hospital. The model primarily learns the diagnostic habits of the specialized headache hospital. It is unknown if the same diagnostic performance can be achieved in general clinics (general practitioners or family doctors) that perform initial headache examinations and clinical practice. Thus, our results need to be validated in an independent cohort study in different situations to show generalizability, such as general clinics, general neurological/neurosurgical, gynecological, and pediatric outpatients. In Japan, only 57.4% of people with migraine consult doctors. The most common settings for lifetime consultation were internist/primary care (34.4%), neurosurgeons (19.9%), headache specialists (7.9%), general neurologists (7.1%), obstetrician/gynecologist (3.0%), and occupational physician (1.8%) (34). Therefore, our model used data of 4000 patients who visited the specialized headache hospital, constituting only 15.0% of migraine patients. The external validation of our models in non-headache specialized hospitals, where people with headache usually visit, should be examined in the future. Furthermore, the burden of headaches is even more serious in foreign countries, especially in developing countries (35), Europe (36) and the United States (37), where the prevalence rates are high. Validation in such countries, with different headache prevalence, headache medical resources, and clinical settings, is also required. Second, because we created this model only from the questionnaire sheets, we lacked information on detailed medication use, comorbidities, or dietary habits, which are factors that can trigger headaches. Apart from headache disorders, we did not acquire information on neurological symptoms, vital signs, or other medical history. Third, we applied conventional machine learning approaches in this study as an AI method. Other recent models include Bayesian statistics, Monte Carlo simulations, and semi-mechanistic modeling. In addition, when combined with other deep learning techniques, such as natural language processing (14), a superior model should be developed. Fourth, in phase 2, there could be a bias in diagnosing the same patients twice within a relatively short time frame. There were no measures made to mitigate this, so the diagnostic performance in phase 2 may have appeared too high. Although reports of the same subject being tested twice by the same examiner exist (38), this bias cannot be completely eliminated due to the nature of the test, which is diagnosed by the same examiner. We will confirm the superiority of our AI model in future external validations to be conducted at other hospitals and clinics. Finally, it is a requirement for a diagnostic tool to have high specificity for secondary headaches to avoid misdiagnosis of potentially life-threatening disorders. In order to rule out secondary headaches, it is desirable to develop an AI-based diagnostic tool that also takes information on the patient’s history, neurological findings, and radiological and laboratory test results.

Conclusions

We developed an AI-based headache diagnosis model using big data of 4000 patients. The diagnostic performance was high, especially for migraine and TACs. Subsequently, we validated the efficacy of the AI-based model for non-specialists. AI significantly improved diagnostic performance and agreement. This AI headache diagnosis model can potentially solve the problem of undertreated and underdiagnosed headache disorders among non-headache specialists. Given the model’s limitations based on the data obtained from a single center and the low accuracy for secondary headaches, further data collection and validation are needed.

Article Highlights

We developed an artificial intelligence-based headache diagnosis model with high performance based on big data of 4000 patients.

The model improved non-specialists’ diagnostic performance and inter-rater agreement.

The model can potentially resolve undertreated and underdiagnosed headache disorders in daily clinical practice by non-specialists.

Supplemental Material

sj-pdf-1-cep-10.1177_03331024231156925 - Supplemental material for Developing an artificial intelligence-based headache diagnostic model and its utility for non-specialists’ diagnostic accuracy

Supplemental material, sj-pdf-1-cep-10.1177_03331024231156925 for Developing an artificial intelligence-based headache diagnostic model and its utility for non-specialists’ diagnostic accuracy by Masahito Katsuki, Tomokazu Shimazu, Shoji Kikui, Daisuke Danno, Junichi Miyahara, Ryusaku Takeshima, Eriko Takeshima, Yuki Shimazu, Takahiro Nakashima, Mitsuhiro Matsuo and Takao Takeshima in Cephalalgia

Supplemental Material

sj-pdf-2-cep-10.1177_03331024231156925 - Supplemental material for Developing an artificial intelligence-based headache diagnostic model and its utility for non-specialists’ diagnostic accuracy

Supplemental material, sj-pdf-2-cep-10.1177_03331024231156925 for Developing an artificial intelligence-based headache diagnostic model and its utility for non-specialists’ diagnostic accuracy by Masahito Katsuki, Tomokazu Shimazu, Shoji Kikui, Daisuke Danno, Junichi Miyahara, Ryusaku Takeshima, Eriko Takeshima, Yuki Shimazu, Takahiro Nakashima, Mitsuhiro Matsuo and Takao Takeshima in Cephalalgia

Supplemental Material

sj-xlsx-3-cep-10.1177_03331024231156925 - Supplemental material for Developing an artificial intelligence-based headache diagnostic model and its utility for non-specialists’ diagnostic accuracy

Supplemental material, sj-xlsx-3-cep-10.1177_03331024231156925 for Developing an artificial intelligence-based headache diagnostic model and its utility for non-specialists’ diagnostic accuracy by Masahito Katsuki, Tomokazu Shimazu, Shoji Kikui, Daisuke Danno, Junichi Miyahara, Ryusaku Takeshima, Eriko Takeshima, Yuki Shimazu, Takahiro Nakashima, Mitsuhiro Matsuo and Takao Takeshima in Cephalalgia

Footnotes

Acknowledgments

We are thankful to the staff for supporting our work and data acquisition.

Author contributions

MK: drafting the article, statistical analysis, and artificial intelligence. DD, SK, JM, TS, and TT: data acquisition and critical advice about clinical headache practice. TN, YS, MM, RT, and ET: Data acquisition. TS and TT: supervision.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

ORCID iDs

Masahito Katsuki

Shoji Kikui

Daisuke Danno

References

Matsumori

Ueda

Komori

, et al. Burden of migraine in Japan: Results of the ObserVational Survey of the Epidemiology, tReatment, and Care Of MigrainE (OVERCOME [Japan]) Study. Neurol Ther 2022; 11: 205–222.

Headache Classification Committee of the International Headache Society (IHS) The International Classification of Headache Disorders, 3rd edition. Cephalalgia 2018; 38: 1–211.

Sakai

Igarashi

Prevalence of migraine in Japan: A nationwide survey. Cephalalgia 1997; 17: 15–22.

Katsuki

Yamagishi

Matsumori

, et al. Questionnaire-based survey on the prevalence of medication-overuse headache in Japanese one city-Itoigawa study. Neurol Sci 2022; 43: 3811–3822.

Takeshima

Ishizaki

Fukuhara

, et al. Population-based door-to-door survey of migraine in Japan: The Daisen Study. Headache 2004; 44: 8–19.

Suzuki

Ishikawa

Gomi

, et al. Prevalence and characteristics of headaches in a socially active population working in the Tokyo metropolitan area -surveillance by an industrial health consortium. Intern Med 2014; 53: 683–689.

Imai

Yagi

Kuroda

, et al. Clinical profile of cluster headaches in Japan: Low prevalence of chronic cluster headache, and uncoupling of sense and behaviour of restlessness. Cephalalgia 2011; 31: 628–633.

Headache Clinical Practice Guideline Development Committee. Clinical Practice Guideline for Headache Disorders 2021 (Japanese). Tokyo: Igaku-Shoin, 2021.

Buse

Armand

Charleston

, et al. Barriers to care in episodic and chronic migraine: Results from the chronic migraine epidemiology and outcomes study. Headache 2021; 61: 628–641.

10.

Katsuki

Kashiwagi

Kawamura

, et al. The efficacy of Japanese herbal kampo medicine as an acute and prophylactic medication to treat chronic daily headache and medication overuse headache: Single arm retrospective study. Cureus 2022; 14: e25419.

11.

Katsuki

Kawahara

Matsumori

, et al. Questionnaire-based survey during COVID-19 vaccination on the prevalence of elderly’s migraine, chronic daily headache, and medication-overuse headache in one Japanese city—Itoigawa Hisui Study. J Clin Med 2022; 11: 4707.

12.

Katsuki

Matsumori

Kawahara

, et al. Headache education by leaflets distribution during COVID-19 vaccination and school-based on-demand E-learning -Itoigawa Geopark Headache Awareness Campaign-. Headache 2023; in press. doi: 10.1111/head.14472.

13.

Carlsen

Westergaard

Bisgaard

, et al. National awareness campaign to prevent medication-overuse headache in Denmark. Cephalalgia 2018; 38: 1316–1325.

14.

Katsuki

Narita

Matsumori

, et al. Preliminary development of a deep learning-based automated primary headache diagnosis model using Japanese natural language processing of medical questionnaire. Surg Neurol Int 2020; 11: 475.

15.

Katsuki

The first case series from Japan of primary headache patients treated by completely online telemedicine. Cureus 2022; 14: e31068.

16.

Woldeamanuel

Cowan

RP.

Computerized migraine diagnostic tools: a systematic review. Ther Adv Chronic Dis 2022; 13: 20406223211065236.

17.

Cowan

Rapoport

Blythe

, et al. Diagnostic accuracy of an artificial intelligence online engine in migraine: A multi-center study. Headache 2022; 62: 870–882.

18.

Walters

Smitherman

TA.

Development and validation of a four-item migraine screening algorithm among a nonclinical sample: The Migraine-4. Headache 2016; 56: 86–94.

19.

Yin

Dong

, et al. A clinical decision support system for the diagnosis of probable migraine and probable tension-type headache based on case-based reasoning. J Headache Pain 2015; 16: 29.

20.

Vandewiele

De Backere

Lannoye

, et al. A decision support system to follow up and diagnose primary headache patients using semantically enriched data. BMC Med Inform Decis Mak 2018; 18: 98.

21.

Kwon

Lee

Cho

, et al. Machine learning-based automated classification of headache disorders using patient-reported questionnaires. Sci Rep 2020; 10: 14062.

22.

Luo

Phung

Tran

, et al. Guidelines for developing and reporting machine learning predictive models in biomedical research: A multidisciplinary view. J Med Internet Res 2016; 18: e323.

23.

Lundberg

Lee

S-I.

A unified approach to interpreting model predictions. Adv Neural Inf Process Syst 2017; 30: 4768–4777.

24.

Cantor

AB.

Sample-size calculation for Cohen’s kappa. Psychol Methods 1996; 1: 150–153.

25.

Jaykaran, Yadav

Chavda

, et al. Some issues related to the reporting of statistics in clinical trials published in Indian medical journals: A survey. Int J Pharmacol 2010; 6: 354–359.

26.

Cohen

A coefficient of agreement for nominal scales. Educ Psychol Meas 1960; 20: 37–46.

27.

von Elm

Altman

Egger

, et al. The strengthening the reporting of observational studies in epidemiology (STROBE) statement: guidelines for reporting observational studies. J Clin Epidemiol 2008; 61: 344–349.

28.

Zamanipoor Najafabadi

Ramspek

Dekker

, et al. TRIPOD statement: a preliminary pre-post analysis of reporting and methods of prediction models. BMJ Open 2020; 10: e041537.

29.

Aygun

Bildik

Clinical warning criteria in evaluation by computed tomography the secondary neurological headaches in adults. Eur J Neurol 2003; 10: 437–442.

30.

Chong

Gaw

, et al. Migraine classification using magnetic resonance imaging resting-state functional connectivity data. Cephalalgia 2017; 37: 828–844.

31.

Aslan

Migraine detection from EEG signals using tunable Q-factor wavelet transform and ensemble learning techniques. Phys Eng Sci Med 2021; 44: 1201–1212.

32.

Roesch

Dahlem

Neeb

, et al. Validation of an algorithm for automated classification of migraine and tension-type headache attacks in an electronic headache diary. J Headache Pain 2020; 21: 75.

33.

Sato

Ueyama

Morita

, et al. [ The epidemiological and clinical features of weather–related pain (TENKITSU) and development of prediction information service for the onset of pain] (Japanese). Pain Res 2021; 36: 75–80.

34.

Hirata

Ueda

Komori

, et al. Comprehensive population-based survey of migraine in Japan: results of the ObserVational Survey of the Epidemiology, tReatment, and Care Of MigrainE (OVERCOME [Japan]) study. Curr Med Res Opin 2021; 37: 1945–1955.

35.

Steiner

Birbeck

Jensen

, et al. The Global Campaign turns 18: a brief review of its activities and achievements. J Headache Pain 2022; 23: 49.

36.

Stovner

Andree

Prevalence of headache in Europe: a review for the Eurolight project. J Headache Pain 2010; 11: 289–299.

37.

Lipton

Scher

Kolodner

, et al. Migraine in the United States: epidemiology and patterns of health care use. Neurology 2002; 58: 885–94.

38.

Nishi

Yamashiro

Okumura

, et al. Artificial intelligence trained by deep learning can improve computed tomography diagnosis of nontraumatic subarachnoid hemorrhage by nonspecialists. Neurol Med Chir (Tokyo) 2021; 61: 652–660.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.02 MB

0.04 MB

0.02 MB