Abstract
The task of sentiment analysis is fundamental for Natural Language Processing because it allows Natural Language Processing (NLP) systems to group texts based on subjective expressions. Despite progress has been made in multilingual sentiment classification, Kashmiri language struggles to gain representation due to non-existent labeled datasets and pre-trained models. The study presents EnsembleSenti-Kash which represents a stacked ensemble learning system built to perform Kashmiri sentiment analysis. To address the lack of existing resources, a labeled Kashmiri sentiment dataset was manually developed for this study, along with a custom stopword file to enhance text preprocessing. TF-IDF vectorization was employed for feature extraction. The proposed model combines Support Vector Machine (SVM), Random Forest (RF), XGBoost and Logistic Regression (LR) as a base and Logistic Regression classifier as a meta-classifier. In addition, to the ensemble model, individual classifiers (SVM, RF, LR) were trained and evaluated, achieving accuracies of 93.1%, 91.5%, and 92.75%, respectively. The ensemble model outperformed all individual classifiers with an overall accuracy of 93.35% and works best in classification of sentiments among all the models. To strengthen the analysis, deep learning models including Long Short-Term Memory (LSTM) and multilingual BERT (mBERT) were also evaluated as baselines. A real-time usability study was performed, reporting model size, number of parameters and inference time, which demonstrated that the proposed model offers a practical balance between accuracy and efficiency. The performance of all models under study is also confirmed through AUC-ROC evaluation. This research lays important groundwork for Kashmiri NLP by providing a new dataset and demonstrating an effective sentiment classification approach.
Keywords
Get full access to this article
View all access options for this article.
