Abstract

To the Editor,
We read with great interest the study by Veeramani et al on artificial intelligence approaches to predict unplanned intubation after anterior cervical discectomy and fusion. 1 We agree that respiratory compromise can be devastating and that identification of high-risk patients is of paramount importance. We commend the authors for their efforts and wish to offer our insights. In our opinion, the main flaw of the presented work is that reporting accuracy, area under the curve (AUC), and Brier score to prove that these models are useful at predicting unplanned intubation is inappropriate and insufficient given the highly imbalanced classes, which in turn could be misleading to readers.
Veeramani et al 1 are dealing with an imbalanced classification problem with only 283 real positive cases (.51%) and 54,219 real negative cases (99.49%). Thus, the focus during model development should be towards the positive class, as models will be inherently biased towards patients who did not require reintubation. In such cases and what is unfortunately missing in the paper is data on recall (sensitivity) and precision (positive predictive value). 2
Relying on accuracy as a measure for good performance is inappropriate in this dataset due to the accuracy paradox.3-5 A model that guesses that zero patients will undergo unplanned intubation will have a 99.49% accuracy even if it misses all of the true positives. Obviously, such a model would not provide useful data and would be clinically unsound. Similarly, an AUC of .73 only tells us that this is the probability that a randomly chosen reintubated patient had a higher predicted risk compared to a randomly chosen non-reintubated patient. Note that this is not the probability that a patient requiring unplanned intubation is correctly classified (i.e., recall), or that a patient predicted to require reintubation will actually undergo reintubation (i.e., precision). 5 Lastly, the Brier score is also not very useful in this scenario. If the model estimates the unplanned intubation risk at .51% for a given patient and he/she does not require reintubation (the most likely outcome in this dataset), the Brier score will be almost perfect at .00026.
For these aforementioned reasons, not having data on the models’ recall or precision is a significant limitation. Predicting rare events with an imbalanced dataset is challenging and advanced machine learning methods such as Synthetic Minority Over-Sampling Technique can be applied, but these are beyond the scope of this letter. 6 We encourage authors to construct confusion matrices for each of the models and plot the true positives, true negatives, false positives, and false negatives. Given that the cost of a false negative prediction can result in severe consequences for the patient, the models need to be optimized towards a higher recall value but without ignoring precision. We again commend the investigators for choosing to study such an important topic and hope these comments provide readers with additional perspective on the limitations of these algorithms before deciding to implement them in their practice.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) received no financial support for the research, authorship, and/or publication of this article.
