Abstract

Keywords
Dear Editor,
We read with great interest the study by Ge et al. comparing long short-term memory (LSTM), bidirectional long short-term memory, convolutional neural networks (CNNs), hybrid CNN–LSTM, bidirectional encoder representations from transformers (BERT), and a hard-voting ensemble for three-class sentiment (low/medium/high) classification in 3325 doctor–patient consultations. 1 We commend the authors for breadth of models and the attempt to pair performance with interpretability (attention maps and feature attributions) are notable. However, certain aspects limit the direct clinical use of the reported 75.5% accuracy.
The three sentiment severities are adopted directly from a publicly available, text-only repository not originally curated as a clinical affect or distress scale. In the absence of a clinical gold standard—such as expert communication raters or psycho-oncology scoring—the models risk learning lexical salience (e.g. “cancer,” “scan,” “blood test”) rather than clinically meaningful escalation signals.
2
This reduces confidence in the high
The authors oversampled only in training folds, leaving validation and test folds imbalanced, which is a good practice for information leakage, but it makes the ensemble–transformer comparison partly a comparison between models that benefit from balanced synthetic data and models (such as BERT) that typically exploit natural, skewed distributions. 3 This approach may limit generalizability of the statement that the ensemble is “strongest,” because deployment settings will rarely match this fold structure. Reporting performance by consultation topic or subdomain would clarify whether the gain reflects true robustness or corpus homogeneity.
The study shows a clinically relevant divergence: BERT achieved the best precision for low-severity interactions, whereas the ensemble improved recall. In real services, priorities differ—live safety monitoring needs high recall for high-severity cases; retrospective quality auditing may prefer high precision. 4 A single macro-averaged score therefore conceals use-case variability. It may be helpful to report operating points tailored to triage, telemedicine follow-up, and patient-experience mining.
In summary, this study helpfully shows that conventional ensemble learning can rival transformer models for multiclass sentiment in clinical dialogue, but its clinical translation will depend on clinically grounded labeling, task-specific thresholding, and validation on consultation subsets that differ in linguistic complexity and emotional load.
Footnotes
Author contributions
Shyam Sundar Sah: conceptualization, methodology, writing—original draft, and writing—review and editing.
Abhishek Kumbhalwar
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
Declaration of conflicting interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Generative AI use statement
Generative AI tools, including Paperpal and ChatGPT 5, were utilized solely for language, grammar, and stylistic refinement. These tools had no role in the conceptualization, data analysis, interpretation of results, or substantive content development of this manuscript. All intellectual contributions, data analysis, and scientific interpretations remain the sole work of the authors. The final content was critically reviewed and edited to ensure accuracy and originality. The authors take full responsibility for the accuracy, originality, and integrity of the work presented.
