Abstract

A clinician is seeing a patient named James for the first time in years. James suffers from bipolar type 2, yet due to side effects from his first medication, sodium valproate, he has chosen to self-manage his symptoms with the help of a psychologist. However, James’ partner has recently fallen ill, and their first child is due in 2 months. He has been sleeping less and working more, while the dynamic range of his highs and lows has continued to widen. For the first time since diagnosis, the only form of respite presents itself in ego dystonic thoughts of harm. Given his first and only experience with psychiatric medication, the hope of reprieve from a new class of drug seems like an unlikely but necessary panacea. While apprehensive, James decides to reach back out to his psychiatrist and discuss trying a new medication.
Scenarios such as these are all too common in clinical care. High-risk situations where a patient is in need of fast and effective treatment. Unfortunately, this need often goes unmet, with patient adherence to first treatments often wavering. If attrition rates in the second and third treatment levels of the sequenced treatment alternatives to relieve depression (STAR*D) study are of any indication, patient hopes for successful treatment are quick to fade (Rush et al., 2006). In the face of this dilemma, two solutions are commonly proposed. First, the discovery of new pharmacological, biological or psychological interventions with novel mechanisms of action. Second, personalizing treatment selection for the interventions that we already have through the use of machine learning (ML) methodologies (Chekroud et al., 2016). In recent years, the utility of this approach has been demonstrated and a greater dearth of prognostic outcomes from suicide prediction to rehospitalization have been explored.
One area of psychiatry that has been quick to adopt ML methodologies is the field of neuroimaging (Kambeitz et al., 2017). Of interest, support vector machines (SVM) have been chosen in approximately 60–73% of current neuroimaging works of major depressive disorder (MDD) with theoretical or clinical justifications for model selection lacking (Kambeitz et al., 2017). This is a strange phenomenon as the purpose of moving to a statistical learning framework is to build translational models for clinical use while demonstrating the potential prognostic value of certain predictor modalities.
This observation begs the question, if ML models ever make it to clinical care, how might we actually use them? And based on this use, how should this inform the models we choose and how we build them in our current research efforts? Given the interpersonal nature of clinical practice, we deem it highly unlikely that ML applications alone will supersede the need for human practitioners in the foreseeable future. What is more likely is that ML models will exist in the form of clinical decision support systems that will be used to complement the professional judgement of clinicians. Therefore, we see the clinician/ML collaboration as a synergistic relationship, with clinicians calling on ML systems in high-risk scenarios when a second opinion is required. In this sense, we can think of this human/ML collaboration as forming its own meta-classifier, where a clinician’s original probabilistic prediction is combined with that of an ML model, subsequently updated, and used to make a binary decision derived from this new meta-probability estimate (treat with drug A? Yes/No). Therefore, the need for robust probability calibration defined as the degree of convergence between a model’s predicted probability of an event and its actual probability is essential for this collaborative workflow (Niculescu-Mizil and Caruana, 2005).
An everyday example of probability calibration can be found when we check the weather. Here, we may see a statement about how likely it is that it will rain. If our weather app tells us that there is a 70% chance of rain, and on 70 out of 100 days that it makes this statement, it does in fact rain, we would say that our weather app is well calibrated. While predicting the weather and prognosticating patient outcomes are fundamentally different problems, they both share the same underlying properties of uncertainty. Therefore, a clinician should not just know how accurately an ML model will classify a prognostic or diagnostic endpoint, but the underlying probability of the respective endpoint itself. This probability can then provide a measure of confidence in a model’s binary prediction in the face of this uncertainty.
To further reiterate this point, we can think about two levels of accuracy in an ML model (levels A and B). A model that is 70% accurate at the group level (A) means that 70% of the predictions it makes are correct. While at the individual level (B), a probability of 70% for ‘yes’ means that we are 70% sure that this individual prediction for group level membership (A) is correct. If a model is poorly calibrated, the confidence we can have in this probabilistic estimation will be low (it may give a probability of 90% ‘yes’ although the true probability of ‘yes’ is only 60%). While it might still be true that the patient belongs to the ‘yes’ class, the model will be considerably overconfident in this group level (A) classification. At current, many works in psychiatry only test whether (A) is true. Case in point, the ‘go to’ SVM (both linear and non-linear) chosen in the majority of applied neuroimaging ML works has some of the worst calibrated probabilities of any classical ML model, showing a sigmoid curve plotted across probability estimates on the X axis of the reliability curve (Figure 1, Linear: orange curve, Non-linear: purple curve). This statistical property is commonly observed in maximum-margin methods which focus on hard samples that are close to a model’s decision boundary (Niculescu-Mizil and Caruana, 2005).

Reliability curves for commonly used classifiers in psychiatric machine learning studies. The closer a curve is to the line of equality (the straight dashed line), the better calibrated the ML models probabilities. All classifiers were trained and tested with default parameters on a simulated binary data set of 100,000 observations. Without post-processing calibration, logistic regression returns well-calibrated probabilities by default as it is optimized for log-loss, while other commonly used classifiers show varying non-linear reliability curves. SVM: support vector machine, RBF: radial basis function.
To make clear the clinical ramifications of this behaviour, imagine consulting James from the introductory example. A decision support system trained with a linear SVM predicts that he will not respond to Lithium, the clinicians next choice for treatment. In addition, the underlying probability behind this binary classification is 10%, suggesting that the model is highly confident that James will not respond (in comparison to giving James a 49% probability of response, for example). The clinician is starting to second guess their decision, especially given the promising accuracy of the model demonstrated in clinical trials. Imagine now that we calibrate the model, the model still classifies James as someone that will not respond to Lithium; however, it is only 47% confident in this group level classification (A). As the model is uncertain, the clinician now relies solely on their clinical expertise and prescribes Lithium, leading to a successful treatment response for James.
Given this dissonance between clinical need and commonly selected model’s probabilistic behaviour, what can be done to mitigate this discrepancy? First, as logistic regression (LR) models directly optimize for log-loss, their probabilities are well calibrated by virtue of their underlying mathematics (Figure 1). In addition, the adding of L1 and L2 regularization terms to the objective function (known as LASSO and Ridge) builds in embedded feature selection to the model fitting procedure (in the case of L1, or in the combination of both norms, known as the Elastic Net). Given that modern psychiatric data sets commonly contain more features than observations (known as large P small N problems), such a constraint is optimal for both feature selection and subsequent model fitting. In addition, as both LR and linear SVMs attempt to fit a linear decision boundary between data points, with the former fitting a line and the latter a hyperplane, accuracy measurement variation between the two models should theoretically be small. Therefore, we might start with the simpler well-calibrated LR model first and only then move to a linear or radial basis function SVM (or other non-linear models) if a major improvement is observed in classification accuracy.
In situations where a clinically meaningful increase in accuracy is achieved compared to LR, attaining well-calibrated probabilities is still possible. First, a practitioner can directly optimize for negative log-loss rather than metrics such as area under the receiver operator characteristic curve or balanced accuracy. Second, post-processing probability calibration methods are available and have demonstrated success when samples are large enough to include this procedure (Nixon et al., 2019). Commonly available procedures include sigmoid and isotonic regression for post-processing calibration. However, samples are needed that are separate to those used in model training, imposing further sample size requirements on samples that may already be limited. Alternatively, post-processing calibration can be nested; however, optimistic bias may theoretically arise due to similarity between train and test set samples used in a nesting procedure. The balancing of these constraints needs to be considered when calibrating an ML models probabilities post hoc. Given these restrictions, a clinically meaningful increase may need to be seen in model accuracy to invoke post-processing calibration. One solution to this problem would be to start with a regularized LR model first, forming a basis to quantify changes in subsequent calibration/classification accuracy trade-offs (in the case of multiclass classification, a one-vs-rest or multinomial LR can be used). In addition, the statistical significance of univariate coefficients forming the underlying multivariate pattern can be disseminated, providing insight to clinicians hoping to complement their own clinical judgement with a nuance that extends beyond continuous probability estimates and binary outputs.
Finally, as a clinician hoping to form an understanding of a model’s calibration, what metrics should be of concern? The first commonly used scoring rule is log-loss, which compares a model’s predicted probabilities to its true binary class labels. Subsequently, a score is calculated that penalizes each probability estimate based on the distance from its expected class label (Niculescu-Mizil and Caruana, 2005). As the penalty is logarithmic, it offers a small score for small differences and a large score for large differences. These scores are averaged across samples, providing an aggregate score for model calibration. Another common scoring rule is known as the Brier score. This score measures the mean squared difference between the predicted probability assigned to a possible outcome and its true class labels (Niculescu-Mizil and Caruana, 2005). Therefore, the Brier score is akin to a mean squared error measurement of a model’s probability prediction. Consequently, for both log-loss and the Brier score, lower scores equate to better calibrated probabilities. Finally, plotting a reliability curve (Figure 1) provides an intuitive visualization of a model’s calibration, allowing a clinician or ML practitioner to derive a visual understanding of discrepancies between actual and predicted probability values from a model.
In conclusion, we suggest that ML model selection needs to be driven and justified by future best practice clinical workflows. As we deem it plausible that these workflows will entail human/ML collaboration with the purpose of engaging in high-risk clinical decision-making, we recommend that a greater emphasis be placed on model calibration in future works. As a trade-off between calibration and rudimentary measures of accuracy may ensue, active thought surrounding clinical context should always form the basis of model selection. This context should be made explicit, drive the model selection process and be informed by clinical expertise. To form a basis to capture calibration/accuracy trade-offs, we suggest starting with basic penalized LR models that are well calibrated by virtue of their mathematics. If marginal trade-offs in calibration/accuracy are observed that are justifiable from the required clinical context, the progression to higher complexity models with poorer default calibration could be made. Hyperparameters could then be directly optimized for negative log-loss or calibrated with post-processing methods if required. Finally, while the go to linear SVM has established itself as a useful model in psychiatric classification studies, it is important to be aware of its poor default calibration and the implications this has for clinical care. We urge ML practitioners to apply thought to clinical context when selecting models in future works.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
