Sage Journals: Discover world-class research

Abstract

Objective

Mental health self-report and clinician-rating scales with diagnoses defined by sum-score cut-offs are often used for depression screening. This study investigates whether machine learning (ML) can detect major depressive episodes (MDE) based on screening scales with higher accuracy than best-practice clinical sum-score approaches.

Methods

Primary data was obtained from two RCTs on the treatment of depression. Ground truth were DSM 5 MDE diagnoses based on structured clinical interviews (SCID) and PHQ-9 self-report, clinician-rated QIDS-16, and HAM-D-17 were predictors. ML models were trained using 10-fold cross-validation. Performance was compared against best-practice sum-score cut-offs. Primary outcome was the Area Under the Curve (AUC) of the Receiver Operating Characteristic curve. DeLong's test with bootstrapping was used to test for differences in AUC. Secondary outcomes were balanced accuracy, precision, recall, F1-score, and number needed to diagnose (NND).

Results

A total of k = 1030 diagnoses (no diagnosis: k = 775; MDE: k = 255) were included. ML models achieved an AUC_QIDS-16 = 0.94, AUC_HAM-D-17 = 0.88, and AUC_PHQ-9 = 0.83 in the testing set. ML AUC was significantly higher than sum-score cut-offs for QIDS-16 and PHQ-9 (ps ≤ 0.01; HAM_D-17: p = 0.847). Applying optimal prediction thresholds, QIDS-16 classifier achieved clinically relevant improvements (Δbalanced accuracy = 8%, ΔF1-score = 14%, ΔNND = 21%). Differences for PHQ_9 and HAM-D-17 were marginal.

Conclusions

ML augmented depression screenings could potentially make a major contribution to improving MDE diagnosis depending on questionnaire (e.g., QIDS-16). Confirmatory studies are needed before ML enhanced screening can be implemented into routine care practice.

Keywords

Major depressive disorder diagnosis machine learning digital health health care

Introduction

In mental health care, an accurate, psychometrically sound assessment, and diagnosis in a timely manner is the key to treatment and optimized health care pathways.^1–5 To achieve this goal, different assessment and diagnosis approaches exist ranging from structured clinical interviews (e.g. SCID)⁶ to self-report instruments (e.g. nine-item Patient Health Questionnaire (PHQ-9))^7,8 and clinician rating scales (e.g. 16-item version of the Quick Inventory of Depressive Symptomology (QIDS-16)).^4,9 Given the economic pressure of many health care systems,^10–12 time-consuming diagnostic assessments like SCID seem difficult which may contribute to the issue of under-recognition.^13–16 A feasible time-saving alternative to SCID is the use of self-report or clinician-rated screening instruments.^1,15,16

For such screening instruments, traditionally, a cut-off value (e.g. for the sum scores of all items of the instrument) is determined by balancing the sensitivity and specificity (e.g. optimal Youden Index in a Receiver Operating Characteristic (ROC) curve).^7,17,18 An individual participant data analysis with N = 44,318 participants demonstrated a sensitivity of 0.86 (95%-CI: 0.80 to 0.90) and specificity of 0.85 (95%-CI: 0.82 to 0.87) for such cut-off scores regarding the prediction of major depression with the PHQ-9.¹⁹ However, while these characteristics are impressive, it has also been shown that cut-off approaches compared to SCID can lead to biased diagnosis, highlighting the need for further optimization.²⁰

In recent years, machine learning (ML) has been applied in many areas accompanied by a paradigm shift away from a priori human-defined solutions for problems like prediction tasks towards algorithmic solutions by ML methods automatically determining parameters to reach an optimal solution.^21,22 In short, such ML algorithms try to find an optimal function to map the observed data and features to an output, while bias (=prediction error) and variance (=performance difference between known data—training data, and unknown data—testing data) are balanced.^22–24 To achieve this, various ML models (e.g. random forest, neural nets) and their hyperparameters (e.g. learning rate) are usually tested and tuned. Doing so, ML has been shown to achieve high accuracy in many prediction tasks^21,22 and may also offer a way towards optimized diagnostic procedures.

Given its high prevalence^2,25 and impact on global health,^26–29 ML-based diagnosis has been studied in particular for depression. However, while studies based neuroimaging,³⁰ microRNA,³¹ RNA,³² peripheral blood markers,³³ and other biomarkers³⁴ show promising results in some cases, the necessary data for such biomarkers is often obtained with resource and cost-expensive methods. Hence, the development of ML prediction models based on data, which is easily and cost-efficiently obtainable, may offer a more promising way to improve diagnosis. As such, frequently available screening instruments like self-report questionnaires or clinician-rated instruments could be used in ML classifiers. However, currently, it is unclear to which extent their performance can be optimized by ML compared to existing best practice clinical cut-offs.

Therefore, the present study evaluates the potential of ML classifiers to detect a diagnosis of a major depressive episode (MDE) based on the items of commonly used screening instruments and evaluates their performance compared to established best-practice clinical sum-score approaches in which a diagnosis is given if a score threshold is exceeded. Given the impact of the chosen threshold on the accuracy (e.g. lower thresholds increase sensitivity and decrease specificity), both performance metrics across the spectrum of all possible thresholds and performance metrics for a single best-practice threshold (i.e. optimal balance between sensitivity and specificity) need be addressed. Accordingly, the following research questions will be answered:

Primary research question: Is there a significant difference between the Area Under the Curve (AUC) of the ROC across all possible class prediction thresholds in ML models and sum-score thresholds of the questionnaires?

Secondary research question: Is there a difference in the balanced accuracy, precision, recall, F1-score, and number needed to diagnose (NND) between ML models and sum-score approaches, if, respectively, the best threshold is applied (e.g. best class probability threshold compared to best-practice sum-score cut-off)?

Methods

Study design and sample

The present study is a secondary analysis of the large-scale pragmatic, observer-blinded randomized controlled trials WARD-BP (N = 210) and PROD-BP (N = 295).^35–38 In short, the trials investigated the effectiveness of a digital intervention for depression in individuals with clinical or sub-clinical depressive symptoms, respectively. Both trials followed a parallel design with multiple assessment points providing self-reported, clinician-rated depression assessments and clinical diagnosis of depression (cf. Measures and Outcomes). Recruitment was conducted on-site in clinics and via letters in both RCTs. The primary data consisting of the diagnosis and corresponding questionnaires’ data were obtained for the present analysis. All MDE cases and no depression (=absence of any depressive disorder) cases were included.

All procedures have been approved by the ethics committee of the Albert-Ludwigs-University of Freiburg, Germany, and the data security committee of the German Pension Insurance. All participants provided written informed consent for participation in the studies. The studies were registered in the German Register for Clinical Trials (DRKS: DRKS00009272, registered on 14 September 2015; DRKS00007960, registered on 12 August 2015). Further details on the studies are provided in their study protocols and main result publications.^35–38 The present study followed the STARD³⁹ and TRIPOD⁴⁰ reporting guidelines (see checklists in supplement 1).

Measures and outcomes

Ground Truth: For the diagnosis of an MDE, the SCID for DSM-5 and SCID-5-RV module were used and conducted by trained and supervised psychologists. The training included at least four training interviews supervised by a licensed psychotherapist. The agreement between supervisors and trainees was excellent (κ > 0.90).^36,37 Interviewers were all blinded toward the group allocation. Participants were not interviewed by the same clinical psychologist twice. Following this diagnostic procedure, cases with either an MDE (ground truth class = 1) or not any depressive disorder (ground truth class = 0) were used in the present analysis. Cases with depressive diagnoses other than MDE (i.e. persistent depressive disorder) were excluded.

Self-report instruments: The nine-item version of the PHQ-9 is a reliable, valid, and widely used screening instrument for depression with similar psychometric quality in its computerized version, which has been used in the included studies.^7,41

Clinician-rated instruments: Besides the structured interview, the clinical psychologists rated the 17-item version of the Hamilton Rating Scale for Depression (HAM-D-17) and the QIDS-16.^9,42 Mixed and good psychometric properties are reported for the instruments, respectively.^9,42–45

Analysis

We trained a range of candidate models to find the best mapping from the included features (e.g. questionnaire items) to the binary outcome of MDE status. Candidate models varied in: (a) model specifications (e.g. random forest, logistic regression); (b) feature sets (e.g. original items, extended features, demographic variables); (c) dimension reduction steps (e.g. principal component analysis); and (d) data imbalance handling (e.g. down sampling). Please see below for a detailed description. A stepwise combination from (a)–(d) resulted in a total of 261 candidate models for each questionnaire. The preprocessing and analysis code are published in the open science framework (https://osf.io/3hnvz/).

General procedure: For the training and testing split a random 75%/25% split was chosen. The training was conducted using 10-fold cross-validation with five repeats. Stratification of MDE status was used in data splitting and cross-validation. Proportions of the MDE status in the training and testing set can be found in supplement 2. The performance summary of all trained models is reported in supplement 3. The best fitting candidate model based on the highest ROC AUC from training was applied to the testing set for each respective questionnaire for primary and secondary outcomes (see below).

Model specifications: The following ML models were chosen as candidates: logistic regression (lr), extreme gradient boosted regression trees (XGB), single layer neural network (mlp), Naïve Bayes (nb), Bayesian additive regression trees (bart), multivariate adaptive regression splines (MARS), bagged MARS (bag_MARS), decision tree, bagged decision tree (bag_tree), linear (svm_l), radial basis (svm_r), and polynomial (svm_p) support vector machine, random forest (rf), and K-Nearest Neighbor (knn).^24,46 Hyperparameters of the ML models were optimized by grid-search combined with an ANOVA racing method.⁴⁷ In short, performance metrics (ROC AUC and J-Index) were determined for a set of tuning parameters for each of the models (25 candidate values for each hyperparameter). The racing method eliminated tuning parameter combinations that were unlikely to improve the model after an initial number of resamples using a repeated measure ANOVA model. For more details on the applied grid search racing approach see.⁴⁷ A list of all tuned hyperparameters see supplement 4.

Feature sets: All candidate models were provided the item answers of each scale. In addition, candidate models were trained with an extended feature list (mean, median, variance, max, min, kurtosis, skewness, and frequency of each response option were created as features for each case), and demographic variables (age, gender).

Dimension reduction: Given the correlation between features, candidate models were trained without dimensionality reduction, supervised principal component analysis with partial least squares, and uniform manifold approximation and projection.⁴⁶

Data imbalance: MDE status was unbalanced in the data set (see results). While this mirrors the “in the wild” prevalence of MDE with a majority of healthy individuals and a minority of individuals with MDE,²⁵ imbalance can potentially affect the performance of ML.⁴⁸ Hence, candidate models were trained without imbalance handling, under sampling,⁴⁶ and Random Over-Sampling Example (ROSE) method.⁴⁹ Under sampling and ROSE was only used in the training, but not in the testing subset to evaluate the “in the wild” performance of the trained classifiers in the testing set.

Feature preprocessing: We followed preprocessing recommendations by Kuhn and Silge⁴⁶ and removed zero variance features, normalized features, applied orthogonal polynomial basis functions, and included interactions of features depending on the algorithms.

Primary outcome: The ROC and the corresponding AUC was the primary outcome. An ROC compares the true positive rate against the false positive rate at different thresholds of an outcome (e.g. the varying sum score cut-offs). The AUC of the ROC gives an aggregated measure of the performance across all possible thresholds of an outcome. An AUC of 0 would indicate that 0% of the predictions are correct, while an AUC of 1 would indicate 100% correct predictions.

The ROC AUC of the best ML models after training was compared in the testing set against the AUC of varying sum scores for each of the PHQ-9, HAM-D-17, and QIDS-16 instruments according to their respective scoring procedures.^7,9,42 DeLong's test for two correlated ROC with n = 2000 bootstraps was used to test (two-sided) for a significant difference between the ML models and the traditional sum score cut-off approach.⁵⁰ The type I error level for the comparison was set to 5%. Additionally, bootstrapped 95% confidence intervals (CI) for the true difference between the ROCs were calculated.

Secondary outcome: The ROC AUC provides a comprehensive performance estimate over the whole spectrum of possible prediction thresholds (e.g. best-practice clinical cut-offs for questionnaire sum scores, or predicted class probability in ML) and was hence chosen as the primary outcome. In contrast, the balanced accuracy, recall, precision, F1-score, and the NND (=the number of patients who need to be examined to correctly detect one person with depression) provide a performance metric for a single threshold. We calculated the balanced accuracy, recall, precision, F1-score, and NND in the testing set for the secondary outcome comparison based on the following rationale:

Firstly, all three questionnaire have established clinical cut-offs.^51,52 However, there is an ongoing debate on the best sum score cut-offs for the instruments.^17,53,54 For the present secondary analysis, we have chosen the following cut-offs PHQ-9 ≥ 10,^7,19 HAM-D-17 ≥ 8,⁵⁴ and QIDS-16 ≥ 6.⁹

Secondly, since the optimal cut-offs for PHQ-9, HAM-D-17 and QIDS-16 might be different in the present sample, we additionally determined the secondary outcome metrics for best-sample-specific cut-offs for the questionnaires in the training sample. Cut-offs were determined using the Youden Index.^18,55

Thirdly, we calculated the optimal ML prediction threshold for the best ML candidate in the training set using Youden Index^18,55 and calculated the secondary outcomes in the testing set based on this threshold.

Software

All analyses were conducted in R.⁵⁶ The tidymodels framework was used for ML training and analysis.⁴⁶ For detailed session information for all R packages and versions see online supplement 5. Analysis code is available at: https://osf.io/3hnvz/.

Results

The total number of k = 1030 diagnoses (non-depressed: 775; depressed: 255) were included (Figure 1). The gender distribution of the diagnoses was 629 female (61.1%) and 401 male (38.9%) diagnoses. The mean age was M = 51.80 (SD = 8.64, min = 22, max = 79). Depression severity across all diagnoses was M = 8.62 (SD = 4.65) based on self-reported PHQ-9 and based on clinician-ratings M_HAM-D = 8.72 (SD_HAM-D-17 = 5.97) and M_QIDS-16 = 6.57 (SD_QIDS-16 = 4.44).

Figure 1.

Included cases from WARD-BP and PROD-BP. Note. Questionnaire assessments (PHQ-9, QIDS-16, HAM-D-17) and diagnoses (SCID-V) were conducted at multiple time points over the course of the studies. Each assessment point was independent (i.e. clinical assessor was changed after each assessment and clinicians were blinded towards treatment condition and previous results).

Primary outcome

Clinician-rated instruments

The trained QIDS-16 classifier with only the QIDS-16 items as predictors yielded an AUC = 0.935 in the testing set. In comparison, the AUC using the traditional QIDS-16 sum score achieved an AUC = 0.901. DeLong's test showed a significant difference between the AUC favoring the ML approach (ΔAUC = 0.04, 95%-CI: 0.02 to 0.05, p < .001). A summary of feature importance in the prediction can be found in supplement 5. Adding the extended feature set, age, and gender to the model improved the AUC marginal, but without clinical relevance (AUC = 0.937, ΔAUC = 0.002 compared to ML with only QIDS-16 items).

For the HAM-D-17, a random forest model with the original HAM-D-17 items achieved the best performance of all candidate models. While the ROC AUC of the HAM-D-17 ML model was also higher (AUC = 0.876) in the testing set compared to the sum-score approach (AUC = 0.873), the difference was not significant (ΔAUC = 0.00, 95%-CI: −0.02 to 0.03, p = .847). A summary of the feature importance is displayed in supplement 5.

Self-report instrument:

For the PHQ-9, a Naïve Bayes model based on the PHQ-9 items and the extended features set significantly outperformed the traditional cut-off approach: AUC for the Naïve Bayes model was AUC = 0.83 compared to an AUC = 0.82 for sum scores (ΔAUC = 0.01, 95%-CI: 0.00 to 0.02, p = .009). Information regarding feature importance can be found in supplement 5.

See Table 1 for an overview of the primary outcome across all three questionnaires.

Table 1.

Primary performance comparison AUC of ROC.

	AUC of ROC for the best ML	AUC of ROC for sum-score	ΔAUC	95%-CI of the difference	p ^a
QIDS-16—basic^b	0.935	0.900	0.035	0.02 to 0.05	<0.001
QIDS-16—extended^c	0.937	0.900	0.036	0.02 to 0.06	<0.001
HAM-D-17	0.876	0.873	0.003	-0.02 to 0.03	0.847
PHQ-9	0.832	0.818	0.014	0.00 to 0.02	0.009

Two-sided bootstrap test for ROC curves within in the testing set, indicating the probability of whether the true difference in AUC is different from 0.

Logistic regression model containing only the QIDS-16 items as features.

Logistic regression model containing QIDS-16 items, the extended feature set (e.g. mean of items), age, and gender.

Secondary outcomes

For further comparison the balanced accuracy, precision, recall, F1-score, and the NND were determined for the performance in the testing set following the rationale of best clinical thresholds, best sample-specific thresholds (QIDS-16 ≥ 9, HAM-D ≥ 11, PHQ-9 ≥ 9), and best class probability thresholds in the training set (QIDS-16: 0.28, HAM-D-17: 0.32, PHQ-9: 0.01).

Overall, the balanced accuracy achieved by the ML classifiers outperformed established clinical cut-offs. While precision or recall as a stand-alone metric favored established cut-offs in some cases, the harmonic mean of precision and recall (=F1-score) and balanced accuracy always favored the ML classifiers. Additionally, the NND favored the ML classifiers. For a summary of all secondary outcomes, see Table 2. For the calculation of additional performance metrics, the confusion matrix can be found in supplement 6.

Table 2.

Secondary performance metrics.

	Classifier	Balanced accuracy	Recall	Precision	F1-score	NND
QIDS-16	ML—basic^a	0.87	0.86	0.71	0.77	1.35
	ML—extended^b	0.86	0.83	0.72	0.77	1.39
	Clinical sum-score cut-off	0.79	0.95	0.47	0.63	1.71
	Sample specific cut-off	0.82	0.78	0.65	0.71	1.56
HAM-D-17	ML	0.81	0.75	0.65	0.70	1.63
	Clinical sum-score cut-off	0.75	0.89	0.43	0.58	2.02
	Sample specific cut-off	0.79	0.77	0.58	0.66	1.72
PHQ-9	ML	0.76	0.79	0.49	0.61	1.93
	Clinical sum-score cut-off	0.76	0.79	0.49	0.60	1.95
	Sample specific cut-off	0.72	0.81	0.43	0.56	2.26

Note. All secondary outcomes refer to the performance in the testing set. Optimized thresholds were determined in the testing set.

Logistic regression model containing only the QIDS-16 items as features.

Logistic regression model containing QIDS-16 items, the extended feature set (e.g. mean of items), age, and gender.

Discussion

The present study evaluated the potential of ML to enhance the performance of clinical assessment tools to detect MDE based on clinician-rated (QIDS-16, HAM-D-17) and self-report (PHQ-9) questionnaire data obtained in a pragmatic health care setting. The present results demonstrated that ML models like penalized logistic regression or naïve Bayes models can significantly increase the AUC of the ROC curve for the clinician-rated QIDS-16 (p ≤ 0.001) and the self-report PHQ-9 (p = .009). However, while the difference for QIDS-16 and PHQ-9 were significant, especially the difference for the QIDS-16 indicates clinically relevant improvements (e.g. Δbalanced accuracy = 8%, ΔF1-score = 14%, ΔNND = 21%) if the best class probability threshold is compared against current best practice clinical sum-score cut-offs. Underlining this finding, clinically relevant improvements were also observed for the QIDS-16 classifier in comparison to the best sample-specific cut-off (e.g. Δbalanced accuracy = 5%, ΔF1-score = 6%, ΔNND = 13%). However, in contrast the differences in PHQ-9—although statistically significant—are below clinical relevance (ΔROC AUC = 1%, Δbalanced accuracy = 0%, ΔF1-score = 1%, ΔNND = 1%), and for the HAM-D-17 we found no statistically difference in the ROC AUC, while seeing differences favoring ML, if the best class probability threshold for the ML classifier is compared to the clinical sum-score cut-off (Δbalanced accuracy = 6%, ΔF1-score = 12%, ΔNND = 19%).

A key issue in the field limiting ML in achieving high accuracy and potentially explaining the differences in the findings across the here evaluated questionnaires, is the noise and measurement error of the underlying questionnaires and items (e.g. caused by social desirability bias, recall bias, systematic and random measurement error bias, or confirmation bias).^57–59 In particular, patient reported outcome measures are prone to some biases (e.g. social desirability) compared to instruments rated by trained clinicians (e.g. QIDS-16), which may contribute to the difference of improvement by ML based on the QIDS-16 and PHQ-9. However, eliminating some sources of bias by a clinician rating does not guarantee a reliable assessment. For instance, the reliability of the HAM-D-17 has been shown to be questionable, in particular for some items.^44,60,61 Contributing to this, the HAM-D-17 is characterized by less precise wording regarding the frequency and duration of symptoms when compared against other instruments like the PHQ-9 or QIDS-16 (e.g. HAM-D-17 mood refers to if feeling states of sadness are reported without categorization of frequency or duration, compared to PHQ-9 and QIDS-16 specified responses whether sadness occurs “at all”, “several times”, “more than half the times”, “nearly all the time”).^9,42 In addition, core symptoms of depression according to the DSM-5⁶ or ICD-11⁶² (i.e. hypersomnia or hyperphagia) are missing in the HAM-D-17 potentially further explaining an increase in noise and measurement error compared to PHQ-9 and QIDS-16.^4,9,42 Applying more precise and highly reliable assessments in clinical practice like computer-adaptive tests may provide an opportunity to increase measurement quality of questionnaires and thereby potential of ML in future.^63–66

Furthermore, the present study could be expanded by including other data sources like electronic health records, molecular biosignatures (e.g. epigenomics), environmental data (e.g. lifestyle), physiological data, and smart-sensing and digital phenotyping data providing unbiased objective behavioral data using sensors in our daily life from omnipresent smart devices (e.g. in smartphones, smart watches and other wearables), might be promising to further augment ML algorithms in their predictive accuracy and maximize the potential of complex data and ML.^34,67–75 However, this goes hand in hand with higher effort and costs to collect such data. In particular, the QIDS-16 might be suited to provide ML-optimized MDE predictions even without such an extension. That said, for health care providers (e.g. physicians), who have access to various sources of health records, the combination of different sources in an ML augmented expert system assisting them in the diagnosis of mental (and other) disorders could be feasible and promising.⁷⁵

In future, the application ML in the screening of depression and providing an instant prediction of the depression status based on the questionnaire data alone (e.g. based on QIDS-16)—or further augmented by objective sensing data, health records data, or biological parameters—may offer a way towards an accurate, time-efficient and thus feasible depression screening leading to earlier diagnosis and treatment initiation in resource-limited mental health care.^10,11,13,14 If proven to be effective, this could have a major impact on highly prevalent mental disorders like depression and public health.

However, some limitations of the present study must be considered and targeted in future studies before the implementation of ML-optimized depression screening in clinical settings. Most importantly, the present study is an exploratory secondary analysis. We used 10-fold cross-validation in the present study.^76,77 Accordingly, the algorithms were trained on a subset of the data (exploratory) and the final performance of the trained models was evaluated on previously unknown data (confirmatory). While this procedure mirrors the logic of a confirmatory study to some extent, it cannot replace the necessity of confirmatory experimental studies eliminating other potential sources of bias and effects. Clinical randomized controlled trials (e.g. comparing ML-enhanced routine care screening of MDE against screening as usual) are of utmost importance before it comes to ML-augmented screening tools or expert systems in routine care settings. Also, it must be ensured that any ML-enhanced medical application derived is fair and unbiased.^78–82 While the exploration of age and gender as additional features in the present analysis provided no meaningful improvements and is speaking for the generalizability, further evidence is needed that the benefit of ML in screening accuracy and performance holds across different gender, ages, ethnic and cultural populations. Speaking of generalizability, the setting of the included RCTs also needs to be considered. While the diagnoses and data of the present study were obtained in a pragmatic healthcare setting and the exploration of ML-enhanced screening procedures based on real-world healthcare data is a strong suit of this study, the healthcare setting at hand is also very specific: All participants were recruited in orthopedic rehabilitation centers.^35–38 The replication of the present findings in different settings is highly needed. Nonetheless, by highlighting the potential of ML-enhanced screening tools for some questionnaires widely used in clinical practice (i.e. QIDS-16), the present study makes an important first step toward clinical trials and lays the foundation for future clinical implementation. Building on this, future studies could also move away from a binary classification (e.g. depression-free vs. MDE) towards a multi-classification approach to also support the differential diagnosis (e.g. distinguishing between various mental disorders with similar symptoms). However, it also needs to be highlighted, that the current nosology in clinical practice only insufficiently reflects the heterogeneous nature of mental disorder.^83–85 For instance, over 1000 unique symptom profiles with a frequency of 1.8% of the most common profile have been reported in 3703 outpatients.⁸⁴ Furthermore, the poor reliability of diagnoses⁸⁶ and the high comorbidity as a potential indicator of unitary disorders being split up into various diagnoses⁸³ pose major limitations for research and practice. Investigating how ML can be applied in different systems (e.g. Hierarchical Taxonomy of Psychopathology⁸³ or the Extended Evolutionary Meta-Model^87,88) to predict clinical relevant symptomology and inform treatment processes would be a very valuable addition to this study.

Conclusions

The present study evaluated the potential of ML to increase the performance of screening instruments for MDE. ML improved the performance of clinician-rated and self-report instruments (QIDS-16, PHQ-9). In particular, the optimization of the QIDS-16 indicated high clinical relevance. If proven to be effective in confirmatory studies, implementing ML-enhanced screening tools in clinical practice could significantly improve diagnostic procedures. Given that a timely and efficient diagnosis is key in healthcare, ML applications in mental health care may lay the foundation for optimized mental healthcare in future. However, further research how the potential of ML can be exploited (e.g. by including more data sources like socio-demographic, health records, biomarkers, and smart sensing data) and the implementation into routine care is highly needed.

Supplemental Material

sj-docx-1-dhj-10.1177_20552076231194939 - Supplemental material for Optimizing the predictive power of depression screenings using machine learning

Supplemental material, sj-docx-1-dhj-10.1177_20552076231194939 for Optimizing the predictive power of depression screenings using machine learning by Yannik Terhorst, Lasse B Sander, David D Ebert and Harald Baumeister in DIGITAL HEALTH

Footnotes

Acknowledgement

The authors would like to thank all researchers and assistants who contributed to the PROD-BP and WARD-BP studies.^35–38

Contributorship

Conceptualization: YT and HB; Methodology: YT; Software: YT; Validation: YT, HB; Formal Analysis: YT; Investigation: YT, HB, LBS, DDE; Resources: HB, DDE; Data Curation: YT, LBS, HB, DDE; Writing—Original Draft: YT; Writing—Review and Editing: YT, LBS, DDE, HB; Visualization: YT; Supervision: HB; Project Administration: YT, HB; Funding Acquisition: HB, DDE, see funding statement for details.

Ethical approval

The present study is a secondary analysis of the large-scale pragmatic, observer-blinded randomized controlled trials WARD-BP and PROD-BP.^35–38 All procedures have been approved by the ethics committee of the Albert-Ludwigs-University of Freiburg, Germany, and the data security committee of the German Pension Insurance. All participants provided written informed consent for participation in the studies.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The present study was self-funded by the authors. The included RCTs providing the primary data were funded by the German Federal Ministry of Education and Research (grant number: 01GY1330A).^35–38

Data availability statement

The primary data obtained from the RCTs can be provided by the principal investigator of these trials (HB) on reasonable request. Data-sharing agreements (including the researchers and institutions from the two RCTs) may have to be signed depending on the request. Data and analysis script are available under CC-By Attribution 4.0 international at . Requests regarding the analysis should be directed at the corresponding author (YT). Support from HB and the corresponding author is depending on available resources.

ORCID iD

Yannik Terhorst

Supplemental material

Supplemental material for this article is available online.

References

Kroenke

. Depression screening and management in primary care. Fam Pract 2018; 35: 1–3.

Lim

Tam

, et al. Prevalence of depression in the community from 30 countries between 1994 and 2014. Sci Reports 2018; 8: 1–10.

Wurcel

Cicchetti

Garrison

, et al. The value of diagnostic information in personalised healthcare: a comprehensive concept to facilitate bringing this technology into healthcare systems. Public Health Genomics 2019; 22: 8–15.

Fried

Flake

Robinaugh

. Revisiting the theoretical and methodological foundations of depression measurement. Nat Rev Psychol 2022; 1: 358–368.

Wang

Berglund

Olfson

, et al. Failure and delay in initial treatment contact after first onset of mental disorders in the national comorbidity survey replication. Arch Gen Psychiatry 2005; 62: 603–613.

American Psychiatric Association . Diagnostic and Statistical Manual of Mental Disorders. Epub ahead of print 22 May 2013. DOI: 10.1176/APPI.BOOKS.9780890425596.

Kroenke

Spitzer

Williams

JBW

. The PHQ-9. J Gen Intern Med 2001; 16: 606–613.

Newson

Hunter

Thiagarajan

. The heterogeneity of mental health assessment. Front Psychiatry 2020; 11: 76.

Rush

Trivedi

Ibrahim

, et al. The 16-item quick inventory of depressive symptomatology (QIDS), clinician rating (QIDS-C), and self-report (QIDS-SR): a psychometric evaluation in patients with chronic major depression. Biol Psychiatry 2003; 54: 573–583.

10.

Irving

Neves

Dambha-Miller

, et al. International variations in primary care physician consultation time: a systematic review of 67 countries. BMJ Open 2017; 7: e017902.

11.

Ahmad

Khairatul

Farnaza

. An assessment of patient waiting and consultation time in a primary healthcare clinic. Malaysian Fam Physician Off J Acad Fam Physicians Malaysia 2017; 12: 14.

12.

Lotfi

Flyckt

Krakau

, et al. Undetected depression in primary healthcare: occurrence, severity and co-morbidity in a two-stage procedure of opportunistic screening. Nord J Psychiatry 2010; 64: 421–427.

13.

Trautman

Beesdo-Baum

. The treatment of depression in primary care—A cross-sectional epidemiological study. Dtsch Arztebl Int 2017; 114: 721–728.

14.

Kroenke

Unutzer

. Closing the false divide: sustainable approaches to integrating mental health services into primary care. J Gen Intern Med 2017; 32: 404–410.

15.

Thombs

Coyne

Cuijpers

, et al. Rethinking recommendations for screening for depression in primary care. C Can Med Assoc J 2012; 184: 413.

16.

Mitchell

Vaze

Rao

. Clinical diagnosis of depression in primary care: a meta-analysis. Lancet 2009; 374: 609–619.

17.

Lamoureux

Linardatos

Fresco

, et al. Using the QIDS-SR16 to identify major depressive disorder in primary care medical patients. Behav Ther 2010; 41: 423–431.

18.

Martínez-Camblor

Pardo-Fernández

. The Youden Index in the Generalized Receiver Operating Characteristic Curve Context. Int J Biostat 2019; 15: 20180060.

19.

Levis

Sun

, et al. Accuracy of the PHQ-2 alone and in combination with the PHQ-9 for screening to detect major depression. JAMA 2020; 323: 2290.

20.

Levis

Yan

, et al. Comparison of depression prevalence estimates in meta-analyses based on screening tools and rating scales versus diagnostic interviews: a meta-research review. BMC Med 2019 171 2019; 17: 1–10.

21.

Yarkoni

Westfall

. Choosing prediction over explanation in psychology: lessons from machine learning. Perspect Psychol Sci 2017; 12: 1100–1122.

22.

Dwyer

Falkai

Koutsouleris

. Machine learning approaches for clinical psychology and psychiatry. Annu Rev Clin Psychol 2018; 14: 91–118.

23.

Hoogendoorn

Funk

. Machine Learning for the Quantified Self. Cham: Springer International Publishing, 2018.

24.

Kuhn

Johnson

. Appl Predict Model. New York: Springer, 2013.

25.

Fischer

Zocholl

Rauch

, et al. Prevalence estimates of major depressive disorder in 27 European countries from the European Health Interview Survey: accounting for imperfect diagnostic accuracy of the PHQ-8 open access. BMJ Ment Heal 2023; 26: 1–7.

26.

Santomauro

Mantilla Herrera

Shadid

, et al. Global prevalence and burden of depressive and anxiety disorders in 204 countries and territories in 2020 due to the COVID-19 pandemic. Lancet 2021; 398: 1700–1712.

27.

James

Abate

, et al. Global, regional, and national incidence, prevalence, and years lived with disability for 354 diseases and injuries for 195 countries and territories, 1990–2017: a systematic analysis for the Global Burden Of Disease Study 2017. Lancet 2018; 392: 1789–1858.

28.

Greenberg

Fournier

A-A

Sisitsky

, et al. The economic burden of adults with major depressive disorder in the United States (2010 and 2018). Pharmacoeconomics 2021; 39: 653–665.

29.

König

H-H

König

H-H

Konnopka

. The excess costs of depression: a systematic review and meta-analysis. Epidemiol Psychiatr Sci 2020; 29: e30.

30.

Gallo

El-Gazzar

Zhutovsky

, et al. Functional connectivity signatures of major depressive disorder: machine learning analysis of two multicenter neuroimaging studies. Mol Psychiatry 2023: 1–10. doi:10.1038/s41380-023-01977-5

31.

Fiori

Turecki

, et al. Machine learning analysis of blood microRNA data in major depression: a case-control study for biomarker discovery. Int J Neuropsychopharmacol 2020; 23: 505–510.

32.

Verma

Shakya

. Machine learning model for predicting major depressive disorder using RNA-Seq data: optimization of classification approach. Cogn Neurodyn 2022; 16: 443–453.

33.

Wang

Meng

Chen

. Screening of potential biomarkers in peripheral blood of patients with depression based on weighted gene co-expression network analysis and machine learning algorithms. Front Psychiatry 2022; 13: 1009911.

34.

Zhao

Bao

Zhao

, et al. Identification of diagnostic markers for major depressive disorder using machine learning methods. Front Neurosci 2021; 15: 645998.

35.

Lin

Sander

Paganini

, et al. Effectiveness and cost-effectiveness of a guided internet- and mobile-based depression intervention for individuals with chronic back pain: protocol of a multi-centre randomised controlled trial. BMJ Open 2017; 7: e015226.

36.

Baumeister

Paganini

Sander

, et al. Effectiveness of a guided internet- and mobile-based intervention for patients with chronic back pain and depression (WARD-BP): a multicenter, pragmatic randomized controlled trial. Psychother Psychosom 2021; 90: 255–268.

37.

Sander

Paganini

Terhorst

, et al. Effectiveness of a guided web-based self-help intervention to prevent depression in patients with persistent back pain. JAMA Psychiatry 2020; 77: 1001.

38.

Sander

Paganini

Lin

, et al. Effectiveness and cost-effectiveness of a guided internet- and mobile-based intervention for the indicated prevention of major depression in patients with chronic back pain—study protocol of the PROD-BP multicenter pragmatic RCT. BMC Psychiatry 2017; 17: 1–13.

39.

Bossuyt

Reitsma

Bruns

, et al. STARD 2015: an updated list of essential items for reporting diagnostic accuracy studies. Br Med J Epub ahead of print 28 October 2015; 351: h5527. doi:10.1136/BMJ.H5527

40.

Collins

Reitsma

Altman

, et al. Transparent reporting of a multivariable prediction model for individual prognosis or diagnosis (TRIPOD): the TRIPOD Statement. BMC Med 2015; 13. doi:10.1186/S12916-014-0241-Z

41.

Erbe

Eichert

H-C

Rietz

, et al. Interformat reliability of the patient health questionnaire: Validation of the computerized version of the PHQ-9. Internet Interv 2016; 5: –4.

42.

Hamilton

. Development of a rating scale for depression. J Neurol Neurosurg Psychiatry 1960; 23: 56–62.

43.

Roniger

Späth

Schweiger

, et al. A psychometric evaluation of the German version of the quick inventory of depressive symptomatology (QIDS-SR16) in outpatients with depression. Fortschritte der Neurol Psychiatr 2015; 83: e17–e22.

44.

Uher

Farmer

Maier

, et al. Measuring depression: comparison and integration of three scales in the GENDEP study. Psychol Med 2008; 38: 289–300.

45.

Reilly

MacGillivray

, et al. Psychometric properties of the 16-item quick inventory of depressive symptomatology: a systematic review and meta-analysis. J Psychiatr Res 2015; 60: 132–140.

46.

Kuhn

Silge

. Tidy Modeling with R, https://www.tmwr.org/ (2022, accessed 1 March 2022).

47.

Kuhn

. Futility Analysis in the Cross-Validation of Machine Learning Models, https://arxiv.org/abs/1405.6974 (2014, accessed 1 March 2022).

48.

Kumar

Lalotra

Sasikala

, et al. Addressing Binary Classification over Class Imbalanced Clinical Datasets Using Computationally Intelligent Techniques. Healthcare 2022; 10: 1293. doi:10.3390/HEALTHCARE10071293

49.

Menardi

Torelli

Menardi

, et al. Training and assessing classification rules with imbalanced data. Springer 2014; 28: 92–122.

50.

DeLong

Clarke-Pearson

. Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. Biometrics 1988; 44: 837–845.

51.

Linn

Grunau

. New patient-oriented summary measure of net total gain in certainty for dichotomous diagnostic tests. Epidemiol Perspect Innov 2006; 3: 11.

52.

Tharwat

. Classification assessment methods. Appl Comput Informatics 2021; 17: 168–192.

53.

Kriston

von Wolff

. Not as golden as standards should be: interpretation of the Hamilton rating scale for depression. J Affect Disord 2011; 128: 175–177.

54.

Zimmerman

Martinez

Young

, et al. Severity classification on the Hamilton depression rating scale. J Affect Disord 2013; 150: 384–388.

55.

Youden

. Index for rating diagnostic tests. Cancer 1950; 3: 32–35.

56.

R Core Team . R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria 2018; 0: {ISBN} 3-900051-07-0.

57.

Jacobucci

Grimm

. Machine learning and psychological research: the unexplored effect of measurement. Perspect Psychol Sci 2020; 15: 809–816.

58.

McNamara

Zisser

Beevers

, et al. Not just “big” data: importance of sample size, measurement error, and uninformative predictors for developing prognostic models for digital interventions. Behav Res Ther 2022; 153: 104086.

59.

Althubaiti

. Information bias in health research: definition, pitfalls, and adjustment methods. J Multidiscip Healthc 2016; 9: 11.

60.

Bagby

Ryder

Schuller

, et al.

The Hamilton depression rating scale: has the gold standard become a lead weight?

Am J Psychiatry 2004; 161: 2163–2177.

61.

Trajković

Starčević

Latas

, et al. Reliability of the Hamilton rating scale for depression: a meta-analysis over a period of 49 years. Psychiatry Res 2011; 189: 1–9.

62.

World Health Organization . Internațional Classification of Diseases, 11th Revision (ICD-11).

63.

Knauer

Terhorst

Philippi

, et al. Effectiveness and cost-effectiveness of a web-based routine assessment with integrated recommendations for action for depression and anxiety (RehaCAT+): protocol for a cluster randomised controlled trial for patients with elevated depressive symptoms in rehabilitation facilities. BMJ Open 2022; 12: e061259.

64.

Wahl

Löwe

Bjorner

, et al. Standardization of depression measurement: a common metric was developed for 11 self-report depression measures. J Clin Epidemiol 2014; 67: 73–86.

65.

Pilkonis

Dodds

, et al. Validation of the depression item bank from the patient-reported outcomes measurement information system (PROMIS®) in a three-month observational study. J Psychiatr Res 2014; 56: 112–119.

66.

Cella

Gershon

Lai

, et al. The future of outcomes measurement: item banking, tailored short-forms, and computerized adaptive assessment. Qual Life Res 2007; 16: 133–141.

67.

Steele

Hillsgrove

Khoshavi

, et al. A survey of cyber-physical system implementations of real-time personalized interventions. J Ambient Intell Humaniz Comput 2022; 13: 2325–2342.

68.

Kennis

Gerritsen

van Dalen

, et al. Prospective biomarkers of major depressive disorder: a systematic review and meta-analysis. Mol Psychiatry 2019 252 2019; 25: 321–338.

69.

De Menezes Galvão AC, Almeida RN, De Sousa Junior GM, et al . Potential biomarkers of major depression diagnosis and chronicity. PLoS One 2021; 16: e0257251.

70.

Fernandes

Williams

Steiner

, et al. The new field of ‘precision psychiatry’. BMC Med 2017; 15: 80.

71.

Wang

Casalino

Khullar

. Deep learning in medicine—promise, progress, and challenges. JAMA Intern Med 2019; 179: 293.

72.

Opoku Asare

Terhorst

Vega

, et al. Predicting depression from smartphone behavioral markers using machine learning methods, hyperparameter optimization, and feature importance analysis: exploratory study. JMIR mHealth UHealth 2021; 9: e26540.

73.

Moshe

Terhorst

Opoku Asare

, et al. Predicting symptoms of depression and anxiety using smartphone and wearable data. Front Psychiatry 2021; 12: 625247.

74.

Garatva

Terhorst

Messner

E-M

, et al. Smart sensors for health research and improvement. In: Montag

Baumeister

(eds) Digital phenotyping and Mobile sensing. Berlin: Springer, 2023, pp.395–411.

75.

Terhorst

Knauer

Baumeister

. Smart sensing enhanced diagnostic expert systems. In: Montag

Baumeister

(eds) Digital phenotyping and Mobile sensing. Berlin Springer, 2023, pp.413–425.

76.

Goodacre

. On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning. J Anal Test 2018 23 2018; 2: 249–262.

77.

James

Witten

Hastie

, et al. An introduction to statistical learning. Curr Med Chem 2000; 7: 995–1039.

78.

McCradden

Joshi

Mazwi

, et al. Ethical limitations of algorithmic fairness solutions in health care machine learning. The Lancet Digital Health 2020; 2: e221–e223.

79.

Smuha

. The EU approach to ethics guidelines for trustworthy artificial intelligence. Computer Law Review International 2019; 20: 97–106.

80.

Char

Shah

Magnus

. Implementing machine learning in health care—addressing ethical challenges. N Engl J Med 2018; 378: 981–983.

81.

Corbett-Davies

Goel

. The Measure and Mismeasure of Fairness: A Critical Review of Fair Machine Learning, http://arxiv.org/abs/1808.00023 (2018, accessed 31 May 2021).

82.

Obermeyer

Powers

Vogeli

, et al. Dissecting racial bias in an algorithm used to manage the health of populations. Science (80-) 2019; 366: 447–453.

83.

Kotov

Krueger

Watson

, et al. The hierarchical taxonomy of psychopathology (HiTOP): a dimensional alternative to traditional nosologies. J Abnorm Psychol 2017; 126: 454–477.

84.

Fried

Nesse

. Depression is not a consistent syndrome: an investigation of unique symptom patterns in the STAR*D study. J Affect Disord 2015; 172: 96–102.

85.

Zimmerman

Ellison

Young

, et al.

How many different ways do patients meet the diagnostic criteria for major depressive disorder?

Compr Psychiatry 2015; 56: 29–34.

86.

Regier

Narrow

Clarke

, et al. DSM-5 Field trials in the United States and Canada, part II: test-retest reliability of selected categorical diagnoses. Am J Psychiatry 2013; 170: 59–70.

87.

Hayes

Hofmann

Ciarrochi

. A process-based approach to psychological diagnosis and treatment: the conceptual and treatment utility of an extended evolutionary meta model. Clin Psychol Rev 2020; 82: 101908.

88.

Hayes

Ciarrochi

Hofmann

, et al. Evolving an idionomic approach to processes of change: towards a unified personalized science of human improvement. Behav Res Ther 2022; 156: 104155.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.08 MB