Count Me Too: Sentiment Analysis of Roman Sindhi Script

Abstract

Social media has given voice to people around the globe. However, all voices are not counted due to the scarcity of lexical computational resources. Such resources could harness the torrent of social media text data. Computational resources for rich languages such as English are available. More are being developed, meanwhile strengthening and enhancing the current ones. However, Roman Sindhi, a resource-poor writing style, is a phonetically rich language lacking computational resources, creating a working space for researchers. This work attempts to develop lexical sentiment resources that will help calculate the public opinion expressed in Roman Sindhi and bring their point of view into the limelight. This work is one of the initial efforts to develop lexical Roman Sindhi sentiment dictionary resources to help detect sentiment orientation in a text. Furthermore, it also developed two interfaces to leverage the lexical resources—a Roman Sindhi to English translator (RoSET) that translates a Roman Sindhi feature into an equivalent English word and a Roman Sindhi rule-based sentiment scorer (RBRS³) that assigns sentiment score to a Roman Sindhi script features. The results obtained from the developed system accommodated the bilingual dataset (Roman Sindhi + English) more adequately. An increase of 20.8% was recorded for positive sentence detection, and a 16% increase was obtained for negative sentences, whereas neutral sentences were marginalized to a lower number (59.31% decrease). The resultant system makes those public voices expressed in the Roman Sindhi script get counted, which otherwise are in vain.

Keywords

Lexical computational resources Roman Sindhi sentiment analysis Roman Sindhi to English translator (RoSET)

Introduction

Sentiment analysis, an application of natural language processing (NLP), is a field of knowledge that aims at finding public mode and orientation toward a social or ethical issue, political event, company products, or services using free text data (Ali et al., 2021). The field of sentiment analysis got widespread attention with the emergence of computer-mediated social networks. Today, social networks are the centers for the public to cast their opinion with maximum freedom. People often use English as a medium to speak their minds on social networks. Researchers have developed resources to calculate sentiments expressed using English script (Baccianella et al., 2010; Hutto & Gilbert, 2014; Loria, 2020; Nielsen, 2011). More research for the English language is in progress worldwide (Honnibal & Montani, 2017). However, there are an estimated 74.1% (as of 2020) (Statista, 2022) users on the Internet who use other languages besides English. Amongst them, many people use resource-poor languages (estimated 33%). One such medium of expression is Sindhi Language, written from the right side (Dootio & Wagan, 2021). Figure 1 shows examples of such sentences. But due to the lack of technology (hardware and software support) and personal expertise, people rarely use the native Sindhi script (Devanagari, Gurmukhi, Khudabadi, Shikarpuri, or Perso-Arabic) on social media. They usually use Roman Sindhi script to express their point of view (Baccianella et al., 2010). For example, “intrnet wqt jo zyaan aahe.” Roman Sindhi script borrows phonetics from Sindhi Language (52 sounds) and the letters from English (Latin script). It is a form of writing (left-to-right) that has become the choice of expression on social media by more than 47,886,051 people who live in Sindh (a province/state in Pakistan growing at the rate of 2.41% (Department, 2022) and many more around the globe.

Figure 1.

Sindhi language sentences in Perso-Arabic.

Roman Sindhi lacks almost entirely, and there is no composed sentiment lexical resource out there which may assist in counting their voice expressed in Roman Sindhi. The Roman Sindhi script (reviews, statements, tweets, and posts) is not counted for whatever is written (posted) due to the unavailability of resources to calculate the expressed sentiments. This scenario has created a massive gap for the researchers to fill and let the voice of millions of people get counted and heard.

The Roman Sindhi lexicon and sentiment resources are built from a topical dataset of 4,500 sentences. The dataset was collected using a semi-automated technique, utilizing automated systems provided through social networks APIs and manual work effort. The semi-automated process used Twitter Tweepy API, HTTP-based Facebook Graph API, Whatsapp group chat, and random manual search methods. This work suggests two parallel approaches to determine the sentiment expressed in a Roman Sindhi text. The RoSET uses the Roman Sindhi lexicon to translate Roman Sindhi words into English, whereas the $RBR S^{3}$ can leverage the sentiment lexical dictionary to assign a score to each selected feature. The contribution of this work includes;

Development of lexical sentiment Roman Sindhi resources (first of their kind to the best of our knowledge).

Roman Sindhi to English translator (RoSET).

Rule-based Roman Sindhi sentiment scorer $(RBR S^{3})$ .

RoSET and $(RBR S^{3})$ modules are based on the resources containing 2,120 unique Roman Sindhi words, out of which 200 are filtered and enlisted as Roman Sindhi stop words, having rare or no PN-polarity. The dataset used for the experimental work consisted of 4,500 bilingual sentences, dataset sentences distribution describing the Pakistan public opinion on whether the Internet and social media is a blessing or a curse. As shown in Figure 2, 16.02% of the dataset sentences are exclusively in Roman Sindhi. Without Roman Sindhi Sentiment Analyzer (RSSA), 16.02% would have been in vain. RSSA increased positive and negative sentence detection by 20.8% and 16%, respectively, counting the voice of common people expressed in Roman Sindhi script.

Figure 2.

Bilingual dataset sentence distribution.

The rest of the paper is organized as follows. Section (2) is a survey of past literature, elaborating on key features and achievements of related work. Section (3) details the undertaken work from data acquisition to sentence PN-polarity detection; finally, Section (4) presents the results, whereas Section (5) concludes the work, describing some avenues for future work. The manuscript ends with the cited research work.

Literature Review

Sentiment analysis has attracted researchers due to its applicability in numerous domains. With the emergence and popularity of social media, this field of knowledge has got more attention because social media has become a de-facto public data accumulation repository. Cambria et al. (2017) have comprehensively presented working of sentiment analysis and natural language processing (NLP) methods. A classic sentiment analysis process includes data extraction (Khan et al., 2021), data cleaning, data preprocessing (Jianqiang & Xiaolin, 2017; Symeonidis et al., 2018), feature extraction/selection, feature transformation, modeling (Qureshi et al., 2022), and the result visualization processes. Much research in the domain of sentiment analysis and classification is available, and more is in progress for rich languages, such as English. However, datasets and related sentiment computational resource(s) are rare or unavailable for resource-poor languages. Roman Sindhi is an example of a resource-poor writing method with more than 45 million vibrant social media users for which no sentiment analysis resources are available.

Sodhar et al. (2020), have tried to produce an online system for Roman Sindhi sentiment analysis. To our knowledge, this is the only closely related work available. The authors developed a sentence-level sentiment classifier for Roman Sindhi that encompasses static 100 sentences with 500 words in the dataset. Out of 100 sentences, the online sentiment classifier classified 97 sentences as neutral and 3 samples as negative. Not a single sentence was classified as positive. In our opinion, the sentences numbered 16, 23, 43, 46, 47, 61, 68, 69, 70, 83, and 94, given in the dataset, may be classified as positive (at least 11 sentences). On the other side of “PN-polarity”, sentences numbered 5, 9, 12, 18, 22, 44, and 52 may be computed as negative (seven sentences). Furthermore, the (subject + verb + object) sentence structure approach is considered for Roman Sindhi script, whereas, in actual, it should be (subject + object + verb). For example, sentence numbered 68 may be written as “manho chanwar pasand kan tha.” instead of “manho pasand Kan tha chanwar.” In fact, Roman Sindhi script only borrows alphabets from Latin script. The grammatical sentence structure remains under standard Sindhi language grammar rules (Board, S, 2022). The above description shows a huge research gap in the area of determining the “PN-polarity” using Roman Sindhi writings. There is a need for scalable computational resources that may help analyze Roman Sindhi sentences from closed domains with a capacity for extension.

Another paper related to Roman Sindhi script processing issues and complexities is described in Leghari and Rahman (2015). The authors have suggested equivalent Latin letter(s) for Sindhi script letters. They developed a system that could translate Sindhi script to Devanagari script and vice versa to bring semantic understanding between two types of scripts for the same root language (Sindhi language).

Mehmood et al. (2018) described their work on a phonetically similar language (Roman Urdu). They used a dataset of 779 samples extracted from multiple domains and implemented Nave Bayes and logistic regression (LR) algorithm using lexical features. They reported Nave Bayes (NB) to perform better with unigram features than other methods.

In another work, Mehmood et al. developed a sentiment analysis system for Roman Urdu using machine learning and deep learning algorithms with unigram, bigram, and uni-bigram hybrid features on 11,000 reviews from six domains. They reported having achieved an improved accuracy up to 12% increase in comparison to the baseline model (Mehmood et al., 2020).

Rana et al. (2022) proposed an unsupervised sentiment analysis system for Roman Urdu on short text classification without suffering domain dependency. The authors used a rule-based method to classify the short texts in Roman Urdu script with sentiment labels. The authors claimed that their proposed approach is effective in sentiment analysis on social media short text classification for Roman Urdu.

Mukhtar and Khan (2020) proposed a lexicon-based Urdu sentiment analysis system. The authors developed a wide-coverage Urdu sentiment lexicon that included adjectives, nouns, and verbs. They claimed that their developed lexicon-based system for Urdu sentiment analysis achieved high accuracy by effectively dealing with negations, intensifiers, and context-dependent words.

Rauf and Pad (2019) presented a trilingual semantic relationship for building Urdu, Roman Urdu, and English lexicons. Despite having noise and different syntax, the pair Roman Urdu-Urdu obtained an accuracy of 85%, and the English-Urdu pair achieved 45% accuracy.

Sadia et al. (2020) presented a Boolean rules-based opinion mining parser to find polarity in the Roman Urdu text. The set of Boolean rules classified a user posted/written review as positive, negative, or neutral. The authors evaluated their method on a dataset of Roman Urdu public reviews and found that it achieved an accuracy of 92.4%.

The most relevant work is performed in Sodhar et al. (2020). However, it is a study of 100 static Roman Sindhi sentences, lacking scalability features in the system. The scalability can only be incorporated using a Roman Sindhi sentiment lexicon. Moreover, sentiment understating and scoring ability may be reliably attained through sentiment lexical resources that make the sentiment analysis system scalable and robust. This discussion infers a need to develop and leverage lexical computational resources to calculate sentiment orientation expressed in the Roman Sindhi script.

The Proposed Method—Roman Sindhi Sentiment Analyzer (RSSA)

This section describes the research method adopted to include the sentiments expressed in Roman Sindhi script into a bilingual sentiment analysis system. Figure 3 reflects the functional diagram of Roman Sindhi Sentiment Analyzer (RSSA).

Figure 3.

Roman Sindhi Sentiment Analyzer (RSSA) framework.

The impetus behind this work is to develop lexical sentiment resource(s) and their interfacing modules to enable the resultant bilingual sentiment classification system to count sentiments expressed in the Roman Sindhi script. An alternate approach is the usage of a supervised machine learning approach. A supervised learning method requires labeled dataset(s). However, such labeled datasets are often unavailable and require high cost and human effort. Therefore, there is a need to develop resources that may assist in auto-labeling expressions expressed in Roman Sindhi script.

The following subsections describe the purpose, functionality, and possible outcome(s) of each component presented in Figure 3.

Input Text

A sentiment analysis system extracts “PN-polarity” using the text data. The input text data is usually unavailable. Such a data availability problem for Roman Sindhi makes it double-fold. The input text data for this work is accumulated through social media (short expressions) using a semi-automated technique. The semi-automated technique used Twitter Tweepy API, HTTP-based Facebook Graph API, Whatsapp group chat, and a random manual search method. Twitter Tweepy API adopted a keyword search approach to extract tweets under #SocialMediaIsCurse, #SocialMediaIsBlessing, and #SocialMediaACurseOrBlessing. The Graph API targeted particular users to obtain their public posts on the topic of “whether the Internet/Social Media is a blessing or a curse.” A Whatsapp group was also created to get public opinion, consisting of 254 users from different parts of Sindh, Pakistan. The collected dataset consists of 4,500 bilingual sentences, having English (3,779 sentences) and Roman Sindhi expressions. Roman Sindhi sentences comprised 721 Roman Sindhi sentences with 2,120 unique Roman Sindhi words. Table 1 contains statistics about the collected dataset. It provides the details about the number of Roman Sindhi words (positive, negative, neutral, and unique). Figure 4 shows the word cloud showing Roman Sindhi lexical features.

Table 1.

Collected Dataset Statistics.

English	Positive sentence words	22,092
	Negative sentence words	9,510
	Neutral sentence words	3,177
	Total sentence words	34,011
	Total unique words	4,121
Roman Sindhi	Positive sentence words	3,830
	Negative sentence words	2,032
	Neutral sentence words	504
	Total sentence words	6,366
	Total unique words	2,120

Figure 4.

Word cloud showing Roman Sindhi unigram lexical features.

The sample for positive and negative sentences in Roman Sindhi are given in Tables 2 and 3, respectively. Since no sentiment computational resource is available, such expressions go uncounted when calculating public sentiment about an entity.

Table 2.

Roman Sindhi Sentences (Positively Oriented).

No.	Sentences
1	Ma pehje porhee kmzor maa sa aein nandhan baran sa Saudi ma galhae sukoon hasil kndo aahya.
2	Soshal meedya aen intrnet pehje galh puhchaen jo intehaee taqtwar zareeo thee wayo aa.
3	Soshal media zindageee tabdeel kre chhadee aahe
4	Medya wapaar khe b faedo dino aahe.
5	Facebook marhun khe mst kre chadyo aahe
6	Soshl media marhun khe maggan kre chadyo aahe.
7	Internet awam khe qreeb kayo aahe.
8	Soshal media insaf hasil krn jo aasan zareeo thee wayo aahe.
9	Soshal networks (Facebook aein bya) ahtejaj record karaen jo aasan aen munasib forum aahe.
10	Jadeed teknologi mulk ghandhe chhadyo.
11	Linkenin aein fesbk tey bhala maloomati sufha aein group b aahin.
12	medya aein internet faasla ghatae chhadya ahn.
13	utube te shagrd jaded taleem hasil kre sghn tha.
14	online cors krn sawalo thee wayo aahe.
15	Hunr sikkhe saghje tho aein hnr wadhae b saghje tho.
16	wadyun wadyun universities ja corses ghar wethey kya pya wjn.
17	covid warey zamaney ta internet aein social medya je ahmyat safa chitte kare chadi aahe.
18	Ajoko aein endr wqt computr, intrnet aein soshal networking jo aahe.
19	samaje km kaar b jald, aasani, aen kamyabe sa thyo wjn.
20	Khat likhn, document puhchaen, ya galh bolh kr dadho aasan thee wayo aahe.
21	fasla na rahya aein aasani thee waee.
22	intrnet kamaae jo zareeo the wayo aahe.

Table 3.

Roman Sindhi Sentences (Negatively Oriented).

No.	Sentences
1	Internet ootaqoun weeran kre chhadyou ahin.
2	fesbook buraee khe wdhao dino aahe.
3	Social medya khandanee zndagee taaraaj kae aahe.
4	Walden facebook kha nakhush ahn.
5	Fecbuk j istemal jo insani sehat te naakari asr pyo aahe.
6	soshal networks j istemal marhun ma sahp brdaasht khtm kre chhade aahe.
7	Social media je istemal sbhn khe beemar aein sust kre chhadyo aahe.
8	Soshal medya afwah saz factory aa.
9	soshal media ja femlee te hanjekaar asr thya aahin.
10	Soshal medya g nigrani thyan ghurjey.
11	Kachehryoun khtm thee wayoun aahin.
12	Sochal medya ikhlaqi pastee jo sabab banjee wae aahe.
13	Waldein baran lae fkrmand thyan tha jadhahein baar internet istemal kn tha.
14	Social media marhun khe deewano kre chhadyo aahe.
15	Nojawan internet lae junooni thee waya aahin.
16	Twitter, fesbok, internet nuksaan deh aahin.
17	Inhan nawan tarekan bekhabr marhun khe galt raste te lagae chhadyo aahe.
18	Nao nsl pehjo qeemati wqt zaya kre rehya aahin.
19	Bagher tarbyat, in teknolje jo istemal bholey j hath me chhurey j barabar aahe.
20	Skool, kolej, aein officoun: hr jae te marhoon soshal media te aahin.
21	frz adaegi khe - khuda hafiz aahe.
22	fahashee wdhe waee aahe intrnet kre.

Text Preprocessing

Text data is highly prone to noise. Therefore, text data should be preprocessed before feeding it to the succeeding module in the sentiment analysis system. The preprocessing step cleanses the data, mitigates feature vector space, and decreases the models computational cost. The last column in Table 4 shows the impact of text preprocessing on feature vector space. Text preprocessing effectively performs Roman Sindhi sentence vector space mitigation, decreasing feature vector space size from 13, 9, 7 to 5, 6, and 4, respectively. Therefore, it can be concluded that text preprocessing provides an implicit way to curb the curse of dimensionality. Text preprocessing steps include handling indistinct terms, typos, punctuation, digits, stop-words (English + Roman Sindhi), contractions, character case normalization, and segmentation (Alvi et al., 2018). However, the order in which text data should be cleaned and prepared is significant and application dependent. Since, for this work, the data is insensitive toward acronyms and short forms, case-normalization is performed in the beginning. Afterward, contracted forms of the words and phrases are separated, and punctuation and digits are evicted. The Roman Sindhi stop words are filtered in amalgamation with English language stop words. Since the dataset is extracted from social media, misspelled terms are handled by checking for each term occurrence in at least two sentences. The word absence, at least from two sentences in the raw dataset, disqualifies it from inclusion in the processed dataset. The same criteria are applied for typos and misspelled terms. Eventually, such indistinct words are filtered out effectively. The impact of text preprocessing is shown in the second column of Table 4 in expression form.

Table 4.

Impact of Text Preprocessing on Text Data.

	Roman Sindhi expressions	Vector size
Raw Roman Sindhi text	Watsap je kre ma’an roze Saudi ma’an ammar sa gaalhae sagha’an tho.	13
	miryaee km pyo hale, sabh laga pya aahin	9
	Soshl medya ootakoo tabah kre chhadyoun aahin!	7
Preprocessed Roman Sindhi text	watsap roze saudi ammar gaalhae	5
	miryaee km hale, sabh laga	6
	soshl medya ootakoo tabah	4

Lexical Feature Extraction

Lexical features for text data include adjectives, verbs, adverbs, nouns, and prepositions. Each instance from them forms a unigram in the text. Unigrams are selected as potential features for this study due to two reasons. Firstly, it is the most popular text feature, and secondly, the work being undertaken is the first of its kind. Tables 5 and 6 represent unique unigram Roman Sindhi features. Table 5 lists Roman Sindhi unigrams with rare or no sentiment orientation. But such terms should be identified and filtered out to prepare the final feature set. Table 6 displays the Roman Sindhi unigrams, their equivalent English word, and the sentiment score. The lexical sentiment score is human-annotated by three persons. The scores were accepted using a standard deviation up to 0.1. A linguistic expert verified the final scores. The sentiment score ranges from −1 (negative) to +1 (positive). A 0 (zero) score is considered neutral.

Table 5.

List of Roman Sindhi Stop Words.

aa	aahe	Aahein	aahin	aahya	aahyou	aayal
aayus	aein	Aen	ag	ahe	aj	ajoko
andr	asa	Asaan	asanjo	awha	awhaan	b
baad	baabat	Baad	baahr	bae	bannjee	bannji
barsa	bdra	Beehr	bna	bnd	bya	chhadee
dafo	deh	Dno	doraan	doran	ehra	enda
endr	g	Gad	ghurje	ghurjey	hale	hali
halya	halyo	Hea	hee	heth	hin	hitey
hk	ho	Hoa	hoo	hoyus	hr	Hrr
hua	hue	Hun	hunan	hutey	huyus	ihe
ilaawa	ilawa	In	inhan	istemaal	itaan	ja
jadahn	jadhahein	Jaree	jari	je	jee	jein
jekr	jenh	Jestaein	jitaan	jiyaan	jn	jo
kadahn	kae	Kalh	kaya	kayo	kayo	kea
kehri	kehro	Kenh	kenh	ker	ketra	kha
khe	khud	Kithe	kn	Kndo	ko	koe
kre	kujh	Kya	lae	lagae	maa	marhoo
marhun	matha	Me	mein	milando	milo	milya
milyo	milyou	Mokhe	moun	muhjo	mukhe	na
ohe	paan	Pehja	poe	pr	pya	rahan
rahe	rahya	Rahyo	sa	saan	sabh	saghan
saghe	saghein	Saghje	saghjey	sagya	samho	sandn
sands	sbhn	Sghn	so	srf	ta	taan
te	tenh	Tey	tha	thee	tho	thyan
tn	tou	Touhjo	trfa	tuhjo	twha	uhe
un	unhan	Unhn	utaan	varto	vathe	vathn
vathno	vato	Vayun	vijhan	wadheek	wae	waee
wagherah	wara	Waree	waro	warto	wat	wathan
wato	wich	Ya	yaan

Table 6.

Lexical Roman Sindhi Sentiment Resource—A Sample.

No.	RS word	EE word	$Sent i_{S} core$	No.	RS word	EE word	$Sent i_{S} core$
1	Aadi	Addicted	−0.4	48	laanat	scold	−0.7
2	Aajzi waaro	Submissive	0.3	49	Madad	Support	0.4
3	Aazad khyaal	Liberal	0.0	50	Madadgaar	Helper	0.4
4	Achraj	Amazing	0.8	51	Malaamat	Scold	−0.4
5	Afsos	Alas	−0.3	52	Maloomati	Informative	0.5
6	Agraee	Aggressive	−0.4	53	masa’ala	problems	−0.4
7	Agwaan	Leader	0.6	54	Mazboot	Strong	0.43
8	Akelo	Alone	−0.4	55	Mehnti	Industrious	0.7
9	Azaab	Curse	−0.2	56	munkasir	modest	0.2
10	Baa-ikhlaaq	Polite	0.5	57	Nasl	generation	0.0
11	Bahkn	Overjoyed	0.8	58	nemat	blessing	0.6
12	Baley baley	Wow	0.8	59	Nojawaan	Young people	0.0
13	Barbaad	Destroyed	−0.4	60	Nojawaan nasl	Youth	0.0
14	Bdnaam	Notorious	−0.5	61	Ootaak	Sitting room	0.0
15	Beemaar	Unhealthy	−0.5	62	Pako sathee	Steadfast	0.6
16	Bethak	Drawing room	0.0	63	Mayousi pasand	Pessimist	−0.6
17	Beywaah	Helpless	−0.4	64	preshani	anxiety	−0.4
18	Bhalae	Welfare	0.6	65	pur kashish	attractive	0.4
19	Bigaarr	Distort	−0.4	66	Pur-josh	Zealous	0.37
20	Chaalak	Cunning	−0.3	67	qaboolyat	acceptance	0.2
21	Changaee	Prosperity	0.6	68	Qadeem	Oldest	0.1
22	Dhokhe baaz	Fraudster	−0.8	69	Ruthal	Dissatisfied	−0.5
23	Dhokho	Fraud	−0.8	70	Sahap waaro	Tolerant	0.5
24	Dhoko	Deceive	−0.6	71	Sakhee	Generous	0.6
25	faedemand	Benefiting	0.4	72	Sakhi	Generous	0.6
26	Faedo	Benefit	0.4	73	Sakht	Strict	0.0
27	fahash	Abusive	−0.6	74	Sarshaar	Dedicated	0.4
28	Freb	Deception	−0.6	75	Shaeq	Keen	0.3
29	frmaanbardaar	Submissive	0.3	76	Sharmnaak	Shameful	−0.5
30	Gadlo	Diry	−0.4	77	Shokeen	Fond	0.4
31	Gair mutmain	Dissatisfied	−0.5	78	Sutho	Good	0.6
32	Gehro	Severe	−0.4	79	Taawun knder	Cooperative	0.4
33	Gumm	Lost	−0.6	80	tafreeh	amusement	0.6
34	hamlo	Attache	−0.2	81	Tanked	Criticize	−0.4
35	Hathee	Help	0.4	82	Tanz	Criticize	−0.4
36	Himat deendar	Encouraging	0.4	83	Tarakee	Progress	0.5
37	Himthaen	Encourage	0.4	84	tareef	praise	0.6
38	Ihtjaaj	Protest	0.0	85	Tbaah	Destroy	−0.6
39	Ikhlaak	Moral	0.0	86	Tez raftaar	Speedy	0.0
40	ilzaam	Accusation	−0.4	87	Tez taraar	Cunning	−0.3
41	Istemaal	Use	0.0	88	tuhmat	blame	−0.4
42	Izafo krn	Increase	0.2	89	Vaadh	Increment	0.2
43	Jaakhoree	Hardworking	0.7	90	Waddo	big	0.2
44	Jaan nisaar	Devoted	0.6	91	Wadhaaro	Increase	0.2
45	Josheelo	Zealous	0.37	92	Wadhao	Gain	0.4
46	kaawar	Resentment	-1	93	wah wa	superb	1
47	Khoobsoorat	Beautiful	0.85	94	Waqf	Dedicated	0.4

Roman Sindhi to English Translator (RoSET)

The next component in the RSSA framework is RoSet, as shown in Figure 3. The RoSET is one of the significant contributions of this research work, constituting the core of the Roman Sindhi sentiment analyzer. The working principle of RoSET is based on the “search-Match-Replace” method. RoSET parses the input text (sentence), detects the Roman Sindhi word, finds its matching English word, and replaces it. RoSET is built on the top of “textblob”, which consists of a sentiment lexicon and sentiment analyzer. It utilizes the “textblob” lexicon to assign auto-labels to the words after performing the Roman Sindhi to English translation and finally calculates the combined sentiment score for the whole sentence. However, the real power behind RoSET comes from the developed lexical unigram resource. A source sample is given in Table 6.

The original Roman Sindhi text follows $subject + [qualifier] + object + [qualifier] + verb$ pattern, per original Perso-Arabic Sindhi script grammar rules. Such as Soshl medya khaandaani zindagee khe taaraaj kre chhadyo aahe. In this example sentence, ”Soshal medya” refers to subject, ”khaandaani” is the objectqualif ier, ”zindagee” is the object, ”khe” indicates to be preposition, ”taaraaj” is the verbqualifier, and ”krn/kre chhadan/chhadyo” is/are the verb(s). After translation, the grammar pattern may violate standard English grammar rules, but the same is adequate and effective for Roman Sindhi sentiment analysis. The algorithm for RoSET is given in Algorithm 1, showing its working principle.

Algorithm 1.

An algorithm for RoSET.

Input bi-lingual textParse the

input_text

containing words

w

Search for Roman Sindhi word

RS_w

w = RS_w

then pause the text parsing process Using lexical resource dictionary look up

w \leftarrow RS_w

resume parsing input textend ifRepeat the process till the end of document

Rule-Based Roman Sindhi Sentiment Scorer $(RBR S^{3})$

Another core contribution encompasses the Rule-based Roman Sindhi Sentiment Scorer. The $RBR S^{3}$ may be employed in parallel to RoSET or in separation. After using RoSET, there is still a possibility that a few or more sentences may get a neutral score against their actual sentiment assessment due to inadequacy in the sentiment dictionary related to “textblob” based lexicon. This problem is addressed by developing a lexical sentiment resource that handles such words.

The $RBR S^{3}$ works on the top of the “textblob”, having a newly created lexicon with sentiment scores for each word. Each unattended Roman Sindhi word is assigned an appropriate rule-based sentiment score. The working of rule-based Roman Sindhi sentiment scorer is further described in Algorithm 2. The $RBR S^{3}$ parses the input bilingual sentence, searches the Roman Sindhi word, and assigns it an appropriate sentiment score. Sentiment score assignment continues until carriage control is encountered in a sentence.

Algorithm 2.

An algorithm for RBRSSS.

Input bi-lingual textParse the

input_text

containing for Roman Sindhi word

RS_w

RS_w

found then pause the text parsing process Using lexical sentiment resource dictionary look up

RS_w \leftarrow sent i_{s} core

resume parsing the input textend ifRepeat the process till the end of document

The impact of $RBR S^{3}$ is shown in Table 7. The input sentences are preprocessed using a text preprocessing pipeline, mitigating the feature vector space. Then the preprocessed Roman Sindhi sentences are assigned sentiment score using $RBR S^{3}$ module.

Table 7.

Impact of $RBR S^{3}$ on Roman Sindhi Sentences.

No.	Cat.	Roman Sindhi sentences	SentimentScore	Polarity
1	Actual transformed	social media time jo waste aahe social media time waste	−0.083	Negative
2	Actual transformed	manrhoń sust thee waya aahin people lazy	−0.250	Negative
3	Actual transformed	intrnet maán wado faedo aahe Internet very benefit	0.200	Positive
4	Actual transformed	dunya milee waee aahe, rabta mazboot thya aahin world connected, interaction strong	0.433	Positive
5	Actual transformed	sabh manrhoń soshal medya je puthyaán paejee wayaa aahin. all people social media search	0.0	Neutral

Auto Labeling the Lexical Features

The study aimed to develop a system that may help label the Roman Sindhi words with suitable sentiment scores. The goal is achieved through RoSET, $RBR S^{3}$ , and sentiment lexical resources. The RSSA, powered with these three components, starts parsing each dataset sample, labeling each Roman Sindhi word. The RSSA auto-labeling process continues until the input sentence ends.

Sentence Level Sentiment Score

A Roman Sindhi (RS) sentence is assigned a cumulative sentiment score, considering each RS unigram by using the Equation 1, where $x_{i}$ represents individual lexical score and $w_{i}$ is given in Equation 2, denoting the weight.

Sent i_{score} = \frac{\sum_{i = 1}^{n} (x_{i}) (w_{i})}{\sum_{i = 1}^{n} (w_{i})}

(1)

w_{i} = \frac{1}{1 + 2 (x_{i})}

(2)

Category-Wise Sentiment Calculation (CSC)

“PN-polarity” wise is the last step before the RSSA result representation as shown in Figure 5. A sentence is declared to be positive if $Sent i_{score} > 0$ , negative if $Sent i_{score} < 0$ . If the sum of the score of all the lexical features is zero, then the sentence is asserted to be neutral as shown in Algorithm 3.

Figure 5.

Comparison: Showing the impact of Roman Sindhi sentiment analyzer.

Algorithm 3.

An algorithm for CSC.

coun t_{p} ositive \leftarrow 0

coun t_{n} egative \leftarrow 0

coun t_{n} eutral \leftarrow 0

for each item

i

Sent i_{score}

i \leftarrow Sent i_{score}

i > 0

then

coun t_{positive} \leftarrow coun t_{positive} + 1

else if

i < 0

then

coun t_{negative} \leftarrow coun t_{negative} + 1

else

coun t_{neutral} \leftarrow coun t_{neutral} + 1

end if end ifend for

RSSA focuses on three-way sentiment polarity detection $(positive, negative, neutral)$ . It could be extended to a multilevel analysis by incorporating “SO-polarity” (Subjective-Objective polarity) detection and bifurcation criteria.

Results and Discussion

This section presents the obtained results for Roman Sindhi Sentiment Analyzer. In total, 4,500 bilingual subjective samples were collected from heterogeneous resources for sentiment analysis using a semi-automated method. The topic of discussion was whether the people of Sindh (Pakistan) consider the Internet and social media a blessing or a curse. Out of 4,500 total sentences, 3,779 were written in English, whereas 721 were in Roman Sindhi script 1. With available lexical resources, 721 data samples would have been excluded from the overall sentiment analysis and remained unattended (calculated as neutral). Roman Sindhi sentiment analyzer got it counted and provided a positive, negative, or neutral score to each Roman Sindhi expression. Figure 5a shows that 1,841 sentences have been classified as positive, 1,585 as negative, and 1,074 neutral without applying RSSA. With RSSA (as shown in Figure 5b), the classification statistics changed toward the achievement of the study goal. Now positive sentences number reached 2,224 (20.80% increase), negative sentences elevated to 1,839 (16% increase), and the neutral sentences were marginalized to 437 (59.31% decrease).

The secondary results showed that people of Sindh (Pakistan) showed mixed behavior toward the Internet and social media usage. 49.42% people favored the Internet and social media, 40.86% were against them, and 9.71% abstained from giving a clear point of view.

Rules for Writing Roman Sindhi

In addition to the 15 basic Roman Sindhi writing rules described in Sodhar et al. (2021), the following rules are also proposed. These rules will normalize the informal writing style used on social media networks. Writing rules will improve the chances for a text message, tweet, or post to be counted. Following these rules will also help the developers to strengthen the sentiment resources more conveniently. The rules encompass Sindhi language letters that help to connect letters to form a meaningful word. These letters are similar to vowels in the English language, as shown in Figure 6.

Rule 01“aa” may be used for a high voice note. For example, “aadi”.

Rule 02“aa” represents a high sound. If, in a word, “aa” is producing a high sound, then succeeding “a” will have similar pronunciation, such as “naakari” and “taaraj”.

Rule 03 If two or more characters succeed “aa”, then “aa” should be used for high sounds, such as “Aabdaar” which means “honorable”.

Rule 04 If the character numbered 2 in Figure 6 is used at the end of a word with deep sound, the users may insert “ee” or “i” for it in a word. Such as “mehnti” or “agraee”.

Rule 05 If the character of rule 04 is employed at the end or in the middle of a word with a soft sound, the users may leverage “e” or “ey” for it. Such as “weho” or “baley”, respectively.

Rule 06 If 3rd character in Figure 6 is utilized at the end of a word with deep sound; the users may leverage “oo” or “u” for it. For example, “Aabroo” or “Aabru”.

Rule 07 Similarly, rule 06 character, if used between other characters or at the end of a word with a light sound, “o” may be used. For example, “sutho”, “bhalo”, and “suhno”.

Rule 08 The last rule says that “a” or “aa” can be used for the first character, such as “achha” or “baar”.

Figure 6.

Most common Sindhi language words similar to English vowels.

These Roman Sindhi writing rules are important to follow to build a robust and scalable Roman Sindhi sentiment analyzer because Roman Sindhi provides a larger space to the users in writing style. For example, a word “barbaad” (numbered 6 in Table 6) can be written in multiple ways, such as “brbad”, “brbaad”, and “berbaad”. To cope with the issues of word variations, either the users may follow the Roman Sindhi writing rules as provided above, or a separate dictionary resource is required. The word variation principle applies to all other Roman Sindhi words and their variants. Figure 7 incorporates the most common Roman Sindhi unigrams used in the collected dataset, whereas Figure 8 presents unique Roman Sindhi unigrams. These lists are largely domain specific. However, Roman Sindhi stop word list (Figure 6) is common to all the data samples.

Figure 7.

Most frequent Roman Sindhi unigrams.

Figure 8.

Most unique Roman Sindhi unigrams.

Conclusion and Future Work

Sentiment analysis aims to determine “PN-polarity” in an opinionated sentence. Researchers adopt either of the two methods to accomplish the task of sentiment analysis: supervised machine learning techniques or a lexicon-based approach. Machine learning techniques need labeled dataset(s), whereas lexicon-based systems require sentiment computational resources. There are sufficient labeled datasets and computational resources for resource-rich languages such as English, but resource-poor languages such as Roman Sindhi writing lack such facilities.

This work is one of the initial (the first of its kind to the best of our knowledge) attempts to develop sentiment computational resources and their allied modules for Roman Sindhi Sentiment Analysis (RSSA). The resources provide the capability in the system to calculate sentiment orientation for a sentence expressed in Roman Sindhi. Roman Sindhi resources will help get people’s voices counted, which otherwise used to go uncounted.

For this work, the collected bilingual dataset consists of 4,500 samples, of which 721 sentences are exclusively written in Roman Sindhi script. It suggests that 16.02% samples of the dataset are in Roman Sindhi writing solely. Without the lexical sentiment resources, Roman Sindhi sentences would have been lost (counted as neutral in a “PN-polarity detection system). The given Roman Sindhi expression percentage is only related to the dataset collected for this study. The ratio of English-Roman Sindhi bilingual sentences may differ in other topical datasets. Nonetheless, after incorporating these lexical resources and processing modules (RoSET and $RBR S^{3}$ ), Roman Sindhi sentences got counted, which was the purpose of the work.

This study resulted in

Sentiment Computation Resource(s)

RoSET

$RBR S^{3}$

The experiments reveal that after applying RSSA, the sentiment detection rate of opinionated sentences increased significantly. An increase of the 20.8% was experienced for positive sentences; negative expressions were enhanced by 16%, whereas neutral sentences were marginalized. Neutral sentences saw a decrease of 59.31%, a visible improvement in a bilingual sentiment analysis system.

It is recommended that the users should follow the rules (discussed in the Results and Discussion section) for writing the Roman Sindhi script. User sentences written in standard form will enable the system to detect the expressed sentiments more effectively and efficiently determine their polarity. As a future work, it is suggested that the developed lexical sentiment resources may be strengthened by adding more vocabulary (multi-domain), developing a resource for Roman Sindhi word variations, and a module for negation handling.

Footnotes

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The authors are thankful to the Deanship of Scientific Research and under the supervision of the Research Centre Funding program at Najran University for funding this work under the grant code (NU/RCP/SERC/12/15).

ORCID iD

Asadullah Shaikh

References

Ali

Razzaq

Ali

Qadri

Zia

(2021). Improving sentiment analysis efficacy through feature synchronization. Multimedia Tools and Applications, 80, 13325–13338.

Alvi

Mahoto

Alvi

Unar

Shaikh

(2018). Hybrid classification model for Twitter data-a recursive preprocessing approach [Conference session]. 2018 5th International Multi-Topic ICT Conference (IMTIC).

Baccianella

Esuli

Sebastiani

(2010). Sentiwordnet 3.0: An enhanced lexical resource for sentiment analysis and opinion mining [Conference session]. Proceedings of The Seventh International Conference on Language Resources and Evaluation (LREC’10).

Board

S. T.

(2022). eBooks sindh textbook board jamshoro [Online]. Retrieved December 29, 2021, from https://ebooks.stbb.edu.pk/

Cambria

Poria

Gelbukh

Thelwall

(2017). Sentiment analysis is a big suitcase. IEEE Intelligent Systems, 32, 74–80.

Department

P. W.

(2022). Sindh demographic indicators [Online]. Retrieved June 29, 2023, from https://pwd.sindh.gov.pk/sindh

Dootio

M. A.

Wagan

A. I.

(2021). Development of Sindhi text corpus. Journal of King Saud University - Computer and Information Sciences, 33, 468–475.

Honnibal

Montani

(2017). spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing [Online]. https://spacy.io

Hutto

Gilbert

(2014). Vader: A parsimonious rule-based model for sentiment analysis of social media text. Proceedings of the International AAAI Conference on Web and Social Media, 8, 216–225.

10.

Jianqiang

Xiaolin

(2017). Comparison research on text pre-processing methods on twitter sentiment analysis. IEEE Access, 5, 2870–2879.

11.

Khan

M. M.

Shahzad

Malik

M. K.

(2021). Hate speech detection in Roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing, 20, 1–19.

12.

Leghari

Rahman

M. U.

(2015). Towards transliteration between Sindhi scripts using Roman script. Linguistics and Literature Review, 1, 101–110.

13.

Loria

(2020). TextBlob: Simplified text processing [Online]. Retrieved November 10, 2021, from textblob.readthedocs.io/en/dev/index.html/

14.

Mehmood

Essam

Shafi

(2018). Sentiment analysis system for Roman Urdu (pp. 29–42). Science And Information Conference.

15.

Mehmood

Essam

Shafi

Malik

M. K.

(2020). Sentiment analysis for a resource poor Language—Roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing, 19, 1–15.

16.

Mukhtar

Khan

M. A.

(2020). Effective lexicon-based approach for Urdu sentiment analysis. Artificial Intelligence Review, 53, 2521–2548.

17.

Nielsen

(2011). A new ANEW: Evaluation of a word list for sentiment analysis in microblogs. ArXiv Preprint ArXiv:1103.2903.

18.

Qureshi

M. A.

Asif

Hassan

M. F.

Abid

Kamal

Safdar

Akbar

(2022). Sentiment analysis of reviews in natural language: Roman Urdu as a case study. IEEE Access, 10, 24945–24954.

19.

Rana

T. A.

Shahzadi

Rana

Arshad

Tubishat

(2022). An unsupervised approach for sentiment analysis on social media short text classification in roman Urdu. ACM Transactions on Asian and Low-Resource Language Information Processing, 21, 1–16.

20.

Rauf

Pad

(2019). Learning trilingual dictionaries for Urdu-Roman Urdu-English [Conference session]. WNLP@ ACL.

21.

Sadia

Ullah

Hussain

Gul

Hussain

M. F.

Ul Haq

Bakar

(2020). An efficient way of finding polarity of roman Urdu reviews by using Boolean rules. Scalable Computing Practice and Experience, 21, 277–289.

22.

Sodhar

I. N.

Jalbani

A. H.

Buller

A. H.

Channa

M. I.

Hakro

D. N.

(2020). Sentiment analysis of Romanized Sindhi text. Journal of Intelligent & Fuzzy Systems, 38, 5877–5883.

23.

Sodhar

I. N.

Jalbani

A. H.

Channa

M. I.

Hakro

D. N.

(2021). Romanized Sindhi rules for text communication. Mehran University Research Journal of Engineering and Technology, 40(2), 298–304.

24.

Statista. (2022). Demographics & use [Online]. Retrieved January 13, 2022, from https://www.statista.com/statistics/262946/most-common-languages-on-the-internet/

25.

Symeonidis

Effrosynidis

Arampatzis

(2018). A comparative evaluation of pre-processing techniques and their interactions for twitter sentiment analysis. Expert Systems with Applications, 110, 298–310.

Count Me Too: Sentiment Analysis of Roman Sindhi Script

Abstract

Keywords

Introduction

Literature Review

The Proposed Method—Roman Sindhi Sentiment Analyzer (RSSA)

Input Text

Text Preprocessing

Lexical Feature Extraction

Roman Sindhi to English Translator (RoSET)

Rule-Based Roman Sindhi Sentiment Scorer ( RBR S 3 )

Auto Labeling the Lexical Features

Sentence Level Sentiment Score

Category-Wise Sentiment Calculation (CSC)

Results and Discussion

Rules for Writing Roman Sindhi

Conclusion and Future Work

Footnotes

Declaration of Conflicting Interests

Funding

ORCID iD

References

Rule-Based Roman Sindhi Sentiment Scorer $(RBR S^{3})$