Abstract
Social media has given voice to people around the globe. However, all voices are not counted due to the scarcity of lexical computational resources. Such resources could harness the torrent of social media text data. Computational resources for rich languages such as English are available. More are being developed, meanwhile strengthening and enhancing the current ones. However, Roman Sindhi, a resource-poor writing style, is a phonetically rich language lacking computational resources, creating a working space for researchers. This work attempts to develop lexical sentiment resources that will help calculate the public opinion expressed in Roman Sindhi and bring their point of view into the limelight. This work is one of the initial efforts to develop lexical Roman Sindhi sentiment dictionary resources to help detect sentiment orientation in a text. Furthermore, it also developed two interfaces to leverage the lexical resources—a Roman Sindhi to English translator (RoSET) that translates a Roman Sindhi feature into an equivalent English word and a Roman Sindhi rule-based sentiment scorer (RBRS3) that assigns sentiment score to a Roman Sindhi script features. The results obtained from the developed system accommodated the bilingual dataset (Roman Sindhi + English) more adequately. An increase of 20.8% was recorded for positive sentence detection, and a 16% increase was obtained for negative sentences, whereas neutral sentences were marginalized to a lower number (59.31% decrease). The resultant system makes those public voices expressed in the Roman Sindhi script get counted, which otherwise are in vain.
Keywords
Introduction
Sentiment analysis, an application of natural language processing (NLP), is a field of knowledge that aims at finding public mode and orientation toward a social or ethical issue, political event, company products, or services using free text data (Ali et al., 2021). The field of sentiment analysis got widespread attention with the emergence of computer-mediated social networks. Today, social networks are the centers for the public to cast their opinion with maximum freedom. People often use English as a medium to speak their minds on social networks. Researchers have developed resources to calculate sentiments expressed using English script (Baccianella et al., 2010; Hutto & Gilbert, 2014; Loria, 2020; Nielsen, 2011). More research for the English language is in progress worldwide (Honnibal & Montani, 2017). However, there are an estimated 74.1% (as of 2020) (Statista, 2022) users on the Internet who use other languages besides English. Amongst them, many people use resource-poor languages (estimated 33%). One such medium of expression is Sindhi Language, written from the right side (Dootio & Wagan, 2021). Figure 1 shows examples of such sentences. But due to the lack of technology (hardware and software support) and personal expertise, people rarely use the native Sindhi script (Devanagari, Gurmukhi, Khudabadi, Shikarpuri, or Perso-Arabic) on social media. They usually use Roman Sindhi script to express their point of view (Baccianella et al., 2010). For example, “intrnet wqt jo zyaan aahe.” Roman Sindhi script borrows phonetics from Sindhi Language (52 sounds) and the letters from English (Latin script). It is a form of writing (left-to-right) that has become the choice of expression on social media by more than 47,886,051 people who live in Sindh (a province/state in Pakistan growing at the rate of 2.41% (Department, 2022) and many more around the globe.

Sindhi language sentences in Perso-Arabic.
Roman Sindhi lacks almost entirely, and there is no composed sentiment lexical resource out there which may assist in counting their voice expressed in Roman Sindhi. The Roman Sindhi script (reviews, statements, tweets, and posts) is not counted for whatever is written (posted) due to the unavailability of resources to calculate the expressed sentiments. This scenario has created a massive gap for the researchers to fill and let the voice of millions of people get counted and heard.
The Roman Sindhi lexicon and sentiment resources are built from a topical dataset of 4,500 sentences. The dataset was collected using a semi-automated technique, utilizing automated systems provided through social networks APIs and manual work effort. The semi-automated process used Twitter Tweepy API, HTTP-based Facebook Graph API, Whatsapp group chat, and random manual search methods. This work suggests two parallel approaches to determine the sentiment expressed in a Roman Sindhi text. The RoSET uses the Roman Sindhi lexicon to translate Roman Sindhi words into English, whereas the
Development of lexical sentiment Roman Sindhi resources (first of their kind to the best of our knowledge).
Roman Sindhi to English translator (RoSET).
Rule-based Roman Sindhi sentiment scorer
RoSET and

Bilingual dataset sentence distribution.
The rest of the paper is organized as follows. Section (2) is a survey of past literature, elaborating on key features and achievements of related work. Section (3) details the undertaken work from data acquisition to sentence PN-polarity detection; finally, Section (4) presents the results, whereas Section (5) concludes the work, describing some avenues for future work. The manuscript ends with the cited research work.
Literature Review
Sentiment analysis has attracted researchers due to its applicability in numerous domains. With the emergence and popularity of social media, this field of knowledge has got more attention because social media has become a de-facto public data accumulation repository. Cambria et al. (2017) have comprehensively presented working of sentiment analysis and natural language processing (NLP) methods. A classic sentiment analysis process includes data extraction (Khan et al., 2021), data cleaning, data preprocessing (Jianqiang & Xiaolin, 2017; Symeonidis et al., 2018), feature extraction/selection, feature transformation, modeling (Qureshi et al., 2022), and the result visualization processes. Much research in the domain of sentiment analysis and classification is available, and more is in progress for rich languages, such as English. However, datasets and related sentiment computational resource(s) are rare or unavailable for resource-poor languages. Roman Sindhi is an example of a resource-poor writing method with more than 45 million vibrant social media users for which no sentiment analysis resources are available.
Sodhar et al. (2020), have tried to produce an online system for Roman Sindhi sentiment analysis. To our knowledge, this is the only closely related work available. The authors developed a sentence-level sentiment classifier for Roman Sindhi that encompasses static 100 sentences with 500 words in the dataset. Out of 100 sentences, the online sentiment classifier classified 97 sentences as neutral and 3 samples as negative. Not a single sentence was classified as positive. In our opinion, the sentences numbered 16, 23, 43, 46, 47, 61, 68, 69, 70, 83, and 94, given in the dataset, may be classified as positive (at least 11 sentences). On the other side of “PN-polarity”, sentences numbered 5, 9, 12, 18, 22, 44, and 52 may be computed as negative (seven sentences). Furthermore, the (subject + verb + object) sentence structure approach is considered for Roman Sindhi script, whereas, in actual, it should be (subject + object + verb). For example, sentence numbered 68 may be written as “manho chanwar pasand kan tha.” instead of “manho pasand Kan tha chanwar.” In fact, Roman Sindhi script only borrows alphabets from Latin script. The grammatical sentence structure remains under standard Sindhi language grammar rules (Board, S, 2022). The above description shows a huge research gap in the area of determining the “PN-polarity” using Roman Sindhi writings. There is a need for scalable computational resources that may help analyze Roman Sindhi sentences from closed domains with a capacity for extension.
Another paper related to Roman Sindhi script processing issues and complexities is described in Leghari and Rahman (2015). The authors have suggested equivalent Latin letter(s) for Sindhi script letters. They developed a system that could translate Sindhi script to Devanagari script and vice versa to bring semantic understanding between two types of scripts for the same root language (Sindhi language).
Mehmood et al. (2018) described their work on a phonetically similar language (Roman Urdu). They used a dataset of 779 samples extracted from multiple domains and implemented Nave Bayes and logistic regression (LR) algorithm using lexical features. They reported Nave Bayes (NB) to perform better with unigram features than other methods.
In another work, Mehmood et al. developed a sentiment analysis system for Roman Urdu using machine learning and deep learning algorithms with unigram, bigram, and uni-bigram hybrid features on 11,000 reviews from six domains. They reported having achieved an improved accuracy up to 12% increase in comparison to the baseline model (Mehmood et al., 2020).
Rana et al. (2022) proposed an unsupervised sentiment analysis system for Roman Urdu on short text classification without suffering domain dependency. The authors used a rule-based method to classify the short texts in Roman Urdu script with sentiment labels. The authors claimed that their proposed approach is effective in sentiment analysis on social media short text classification for Roman Urdu.
Mukhtar and Khan (2020) proposed a lexicon-based Urdu sentiment analysis system. The authors developed a wide-coverage Urdu sentiment lexicon that included adjectives, nouns, and verbs. They claimed that their developed lexicon-based system for Urdu sentiment analysis achieved high accuracy by effectively dealing with negations, intensifiers, and context-dependent words.
Rauf and Pad (2019) presented a trilingual semantic relationship for building Urdu, Roman Urdu, and English lexicons. Despite having noise and different syntax, the pair Roman Urdu-Urdu obtained an accuracy of 85%, and the English-Urdu pair achieved 45% accuracy.
Sadia et al. (2020) presented a Boolean rules-based opinion mining parser to find polarity in the Roman Urdu text. The set of Boolean rules classified a user posted/written review as positive, negative, or neutral. The authors evaluated their method on a dataset of Roman Urdu public reviews and found that it achieved an accuracy of 92.4%.
The most relevant work is performed in Sodhar et al. (2020). However, it is a study of 100 static Roman Sindhi sentences, lacking scalability features in the system. The scalability can only be incorporated using a Roman Sindhi sentiment lexicon. Moreover, sentiment understating and scoring ability may be reliably attained through sentiment lexical resources that make the sentiment analysis system scalable and robust. This discussion infers a need to develop and leverage lexical computational resources to calculate sentiment orientation expressed in the Roman Sindhi script.
The Proposed Method—Roman Sindhi Sentiment Analyzer (RSSA)
This section describes the research method adopted to include the sentiments expressed in Roman Sindhi script into a bilingual sentiment analysis system. Figure 3 reflects the functional diagram of Roman Sindhi Sentiment Analyzer (RSSA).

Roman Sindhi Sentiment Analyzer (RSSA) framework.
The impetus behind this work is to develop lexical sentiment resource(s) and their interfacing modules to enable the resultant bilingual sentiment classification system to count sentiments expressed in the Roman Sindhi script. An alternate approach is the usage of a supervised machine learning approach. A supervised learning method requires labeled dataset(s). However, such labeled datasets are often unavailable and require high cost and human effort. Therefore, there is a need to develop resources that may assist in auto-labeling expressions expressed in Roman Sindhi script.
The following subsections describe the purpose, functionality, and possible outcome(s) of each component presented in Figure 3.
Input Text
A sentiment analysis system extracts “PN-polarity” using the text data. The input text data is usually unavailable. Such a data availability problem for Roman Sindhi makes it double-fold. The input text data for this work is accumulated through social media (short expressions) using a semi-automated technique. The semi-automated technique used Twitter Tweepy API, HTTP-based Facebook Graph API, Whatsapp group chat, and a random manual search method. Twitter Tweepy API adopted a keyword search approach to extract tweets under #SocialMediaIsCurse, #SocialMediaIsBlessing, and #SocialMediaACurseOrBlessing. The Graph API targeted particular users to obtain their public posts on the topic of “whether the Internet/Social Media is a blessing or a curse.” A Whatsapp group was also created to get public opinion, consisting of 254 users from different parts of Sindh, Pakistan. The collected dataset consists of 4,500 bilingual sentences, having English (3,779 sentences) and Roman Sindhi expressions. Roman Sindhi sentences comprised 721 Roman Sindhi sentences with 2,120 unique Roman Sindhi words. Table 1 contains statistics about the collected dataset. It provides the details about the number of Roman Sindhi words (positive, negative, neutral, and unique). Figure 4 shows the word cloud showing Roman Sindhi lexical features.
Collected Dataset Statistics.

Word cloud showing Roman Sindhi unigram lexical features.
The sample for positive and negative sentences in Roman Sindhi are given in Tables 2 and 3, respectively. Since no sentiment computational resource is available, such expressions go uncounted when calculating public sentiment about an entity.
Roman Sindhi Sentences (Positively Oriented).
Roman Sindhi Sentences (Negatively Oriented).
Text Preprocessing
Text data is highly prone to noise. Therefore, text data should be preprocessed before feeding it to the succeeding module in the sentiment analysis system. The preprocessing step cleanses the data, mitigates feature vector space, and decreases the models computational cost. The last column in Table 4 shows the impact of text preprocessing on feature vector space. Text preprocessing effectively performs Roman Sindhi sentence vector space mitigation, decreasing feature vector space size from 13, 9, 7 to 5, 6, and 4, respectively. Therefore, it can be concluded that text preprocessing provides an implicit way to curb the curse of dimensionality. Text preprocessing steps include handling indistinct terms, typos, punctuation, digits, stop-words (English + Roman Sindhi), contractions, character case normalization, and segmentation (Alvi et al., 2018). However, the order in which text data should be cleaned and prepared is significant and application dependent. Since, for this work, the data is insensitive toward acronyms and short forms, case-normalization is performed in the beginning. Afterward, contracted forms of the words and phrases are separated, and punctuation and digits are evicted. The Roman Sindhi stop words are filtered in amalgamation with English language stop words. Since the dataset is extracted from social media, misspelled terms are handled by checking for each term occurrence in at least two sentences. The word absence, at least from two sentences in the raw dataset, disqualifies it from inclusion in the processed dataset. The same criteria are applied for typos and misspelled terms. Eventually, such indistinct words are filtered out effectively. The impact of text preprocessing is shown in the second column of Table 4 in expression form.
Impact of Text Preprocessing on Text Data.
Lexical Feature Extraction
Lexical features for text data include adjectives, verbs, adverbs, nouns, and prepositions. Each instance from them forms a unigram in the text. Unigrams are selected as potential features for this study due to two reasons. Firstly, it is the most popular text feature, and secondly, the work being undertaken is the first of its kind. Tables 5 and 6 represent unique unigram Roman Sindhi features. Table 5 lists Roman Sindhi unigrams with rare or no sentiment orientation. But such terms should be identified and filtered out to prepare the final feature set. Table 6 displays the Roman Sindhi unigrams, their equivalent English word, and the sentiment score. The lexical sentiment score is human-annotated by three persons. The scores were accepted using a standard deviation up to 0.1. A linguistic expert verified the final scores. The sentiment score ranges from −1 (negative) to +1 (positive). A 0 (zero) score is considered neutral.
List of Roman Sindhi Stop Words.
Lexical Roman Sindhi Sentiment Resource—A Sample.
Roman Sindhi to English Translator (RoSET)
The next component in the RSSA framework is RoSet, as shown in Figure 3. The RoSET is one of the significant contributions of this research work, constituting the core of the Roman Sindhi sentiment analyzer. The working principle of RoSET is based on the “search-Match-Replace” method. RoSET parses the input text (sentence), detects the Roman Sindhi word, finds its matching English word, and replaces it. RoSET is built on the top of “textblob”, which consists of a sentiment lexicon and sentiment analyzer. It utilizes the “textblob” lexicon to assign auto-labels to the words after performing the Roman Sindhi to English translation and finally calculates the combined sentiment score for the whole sentence. However, the real power behind RoSET comes from the developed lexical unigram resource. A source sample is given in Table 6.
The original Roman Sindhi text follows
An algorithm for RoSET.
Rule-Based Roman Sindhi Sentiment Scorer
Another core contribution encompasses the Rule-based Roman Sindhi Sentiment Scorer. The
The
An algorithm for RBRSSS.
The impact of
Impact of
Auto Labeling the Lexical Features
The study aimed to develop a system that may help label the Roman Sindhi words with suitable sentiment scores. The goal is achieved through RoSET,
Sentence Level Sentiment Score
A Roman Sindhi (RS) sentence is assigned a cumulative sentiment score, considering each RS unigram by using the Equation 1, where
Category-Wise Sentiment Calculation (CSC)
“PN-polarity” wise is the last step before the RSSA result representation as shown in Figure 5. A sentence is declared to be positive if

Comparison: Showing the impact of Roman Sindhi sentiment analyzer.
An algorithm for CSC.
RSSA focuses on three-way sentiment polarity detection
Results and Discussion
This section presents the obtained results for Roman Sindhi Sentiment Analyzer. In total, 4,500 bilingual subjective samples were collected from heterogeneous resources for sentiment analysis using a semi-automated method. The topic of discussion was whether the people of Sindh (Pakistan) consider the Internet and social media a blessing or a curse. Out of 4,500 total sentences, 3,779 were written in English, whereas 721 were in Roman Sindhi script 1. With available lexical resources, 721 data samples would have been excluded from the overall sentiment analysis and remained unattended (calculated as neutral). Roman Sindhi sentiment analyzer got it counted and provided a positive, negative, or neutral score to each Roman Sindhi expression. Figure 5a shows that 1,841 sentences have been classified as positive, 1,585 as negative, and 1,074 neutral without applying RSSA. With RSSA (as shown in Figure 5b), the classification statistics changed toward the achievement of the study goal. Now positive sentences number reached 2,224 (20.80% increase), negative sentences elevated to 1,839 (16% increase), and the neutral sentences were marginalized to 437 (59.31% decrease).
The secondary results showed that people of Sindh (Pakistan) showed mixed behavior toward the Internet and social media usage. 49.42% people favored the Internet and social media, 40.86% were against them, and 9.71% abstained from giving a clear point of view.
Rules for Writing Roman Sindhi
In addition to the 15 basic Roman Sindhi writing rules described in Sodhar et al. (2021), the following rules are also proposed. These rules will normalize the informal writing style used on social media networks. Writing rules will improve the chances for a text message, tweet, or post to be counted. Following these rules will also help the developers to strengthen the sentiment resources more conveniently. The rules encompass Sindhi language letters that help to connect letters to form a meaningful word. These letters are similar to vowels in the English language, as shown in Figure 6.

Most common Sindhi language words similar to English vowels.
These Roman Sindhi writing rules are important to follow to build a robust and scalable Roman Sindhi sentiment analyzer because Roman Sindhi provides a larger space to the users in writing style. For example, a word “barbaad” (numbered 6 in Table 6) can be written in multiple ways, such as “brbad”, “brbaad”, and “berbaad”. To cope with the issues of word variations, either the users may follow the Roman Sindhi writing rules as provided above, or a separate dictionary resource is required. The word variation principle applies to all other Roman Sindhi words and their variants. Figure 7 incorporates the most common Roman Sindhi unigrams used in the collected dataset, whereas Figure 8 presents unique Roman Sindhi unigrams. These lists are largely domain specific. However, Roman Sindhi stop word list (Figure 6) is common to all the data samples.

Most frequent Roman Sindhi unigrams.

Most unique Roman Sindhi unigrams.
Conclusion and Future Work
Sentiment analysis aims to determine “PN-polarity” in an opinionated sentence. Researchers adopt either of the two methods to accomplish the task of sentiment analysis: supervised machine learning techniques or a lexicon-based approach. Machine learning techniques need labeled dataset(s), whereas lexicon-based systems require sentiment computational resources. There are sufficient labeled datasets and computational resources for resource-rich languages such as English, but resource-poor languages such as Roman Sindhi writing lack such facilities.
This work is one of the initial (the first of its kind to the best of our knowledge) attempts to develop sentiment computational resources and their allied modules for Roman Sindhi Sentiment Analysis (RSSA). The resources provide the capability in the system to calculate sentiment orientation for a sentence expressed in Roman Sindhi. Roman Sindhi resources will help get people’s voices counted, which otherwise used to go uncounted.
For this work, the collected bilingual dataset consists of 4,500 samples, of which 721 sentences are exclusively written in Roman Sindhi script. It suggests that 16.02% samples of the dataset are in Roman Sindhi writing solely. Without the lexical sentiment resources, Roman Sindhi sentences would have been lost (counted as neutral in a “PN-polarity detection system). The given Roman Sindhi expression percentage is only related to the dataset collected for this study. The ratio of English-Roman Sindhi bilingual sentences may differ in other topical datasets. Nonetheless, after incorporating these lexical resources and processing modules (RoSET and
This study resulted in
Sentiment Computation Resource(s)
RoSET
The experiments reveal that after applying RSSA, the sentiment detection rate of opinionated sentences increased significantly. An increase of the 20.8% was experienced for positive sentences; negative expressions were enhanced by 16%, whereas neutral sentences were marginalized. Neutral sentences saw a decrease of 59.31%, a visible improvement in a bilingual sentiment analysis system.
It is recommended that the users should follow the rules (discussed in the Results and Discussion section) for writing the Roman Sindhi script. User sentences written in standard form will enable the system to detect the expressed sentiments more effectively and efficiently determine their polarity. As a future work, it is suggested that the developed lexical sentiment resources may be strengthened by adding more vocabulary (multi-domain), developing a resource for Roman Sindhi word variations, and a module for negation handling.
Footnotes
Declaration of Conflicting Interests
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors are thankful to the Deanship of Scientific Research and under the supervision of the Research Centre Funding program at Najran University for funding this work under the grant code (NU/RCP/SERC/12/15).
