Sage Journals: Discover world-class research

Abstract

Keywords

Doctor–patient communication sentiment analysis ensemble learning clinical natural language processing interpretability in artificial intelligence

Dear Editor,

We read with great interest the study by Ge et al. comparing long short-term memory (LSTM), bidirectional long short-term memory, convolutional neural networks (CNNs), hybrid CNN–LSTM, bidirectional encoder representations from transformers (BERT), and a hard-voting ensemble for three-class sentiment (low/medium/high) classification in 3325 doctor–patient consultations.¹ We commend the authors for breadth of models and the attempt to pair performance with interpretability (attention maps and feature attributions) are notable. However, certain aspects limit the direct clinical use of the reported 75.5% accuracy.

The three sentiment severities are adopted directly from a publicly available, text-only repository not originally curated as a clinical affect or distress scale. In the absence of a clinical gold standard—such as expert communication raters or psycho-oncology scoring—the models risk learning lexical salience (e.g. “cancer,” “scan,” “blood test”) rather than clinically meaningful escalation signals.² This reduces confidence in the high F1 for the high-severity class, because it is unclear whether the models detect patient distress, medical seriousness, or merely investigations. It may be helpful to introduce clinician re-annotation to align labels with actual consultation outcomes (need for reassurance, escalation, mental health referral).

The authors oversampled only in training folds, leaving validation and test folds imbalanced, which is a good practice for information leakage, but it makes the ensemble–transformer comparison partly a comparison between models that benefit from balanced synthetic data and models (such as BERT) that typically exploit natural, skewed distributions.³ This approach may limit generalizability of the statement that the ensemble is “strongest,” because deployment settings will rarely match this fold structure. Reporting performance by consultation topic or subdomain would clarify whether the gain reflects true robustness or corpus homogeneity.

The study shows a clinically relevant divergence: BERT achieved the best precision for low-severity interactions, whereas the ensemble improved recall. In real services, priorities differ—live safety monitoring needs high recall for high-severity cases; retrospective quality auditing may prefer high precision.⁴ A single macro-averaged score therefore conceals use-case variability. It may be helpful to report operating points tailored to triage, telemedicine follow-up, and patient-experience mining.

In summary, this study helpfully shows that conventional ensemble learning can rival transformer models for multiclass sentiment in clinical dialogue, but its clinical translation will depend on clinically grounded labeling, task-specific thresholding, and validation on consultation subsets that differ in linguistic complexity and emotional load.

Footnotes

ORCID iD

Shyam Sundar Sah

Author contributions

Shyam Sundar Sah: conceptualization, methodology, writing—original draft, and writing—review and editing.

Abhishek Kumbhalwar: validation, supervision, project administration, writing—original draft, and writing—review and editing.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Generative AI use statement

Generative AI tools, including Paperpal and ChatGPT 5, were utilized solely for language, grammar, and stylistic refinement. These tools had no role in the conceptualization, data analysis, interpretation of results, or substantive content development of this manuscript. All intellectual contributions, data analysis, and scientific interpretations remain the sole work of the authors. The final content was critically reviewed and edited to ensure accuracy and originality. The authors take full responsibility for the accuracy, originality, and integrity of the work presented.

References

Dai

Huang

, et al. Ensemble learning for improved sentiment analysis in doctor–patient communication. Digit Health 2025; 30.

Bitterman

Miller

Mak

, et al. Clinical natural language processing for radiation oncology: a review and practical primer. Int J Radiat Oncol Biol Phys 2021; 110: 641–655.

Altalhan

Algarni

Turki-Hadj Alouane

. Imbalanced data problem in machine learning: a review. IEEE Access 2025; 13: 13686–13699.

Singhal

. Real-time emergency response system using geolocation, data analytics, and machine learning. Int J Innov Res Adv Eng 2025; 12: 287–291.

Comment on “Ensemble learning for improved sentiment analysis in doctor–patient communication”