Sage Journals: Discover world-class research

Abstract

Semantic word similarity is a quantitative method of determining how much two terms are contextually identical, which is a considerable challenge for computational linguistics. The research community has examined a range of approaches to address this issue. However, most of these approaches are for a comparatively limited set of languages, especially English. Research on semantic word similarity for South Asian languages, particularly Urdu, is immature. In recent years, transformer-based approaches have proved extremely successful for a range of language processing tasks. The primary aim of this study is to develop and compare a variety of transformer-based approaches to the cross-lingual English–Urdu semantic word similarity task. This study evaluated a publicly available benchmark USWS-19 corpus that comprises 518 word pairs. This study mainly explored four types of transformer-based approaches: (a) cross-lingual sentence transformer-based approaches using the original dataset, (b) cross-lingual sentence transformer-based approaches using the translated dataset (translation plus monolingual analysis [T+MA] approach), (c) the feature fusion approach (mixture of features), and large language models. In addition, this study also explores the word embedding-based approach using the translated dataset (T+MA approach). In total, this study developed 29 transformer-based models, with the highest results (Pearson correlation = 0.788) achieved using a feature fusion approach, that is, Best-Two-SBERT (where SBERT stands for sentence-bidirectional encoder representations from transformers; using T+MA) + BEST Baseline (with Bing translator) + Best cross-lingual SBERT. This approach improved by 7% over previously reported results on the same corpus.

Keywords

semantic word similarity cross-lingual semantic word similarity English–Urdu word pairs

1. Introduction

In the field of natural language processing (NLP), semantic word similarity (SWS) is a task that quantifies the relationship between pairs of words using a standard metric. Worth’s (2023) SWS is an important task as it has the potential to improve various research areas such as semantic web (Chahal & Singh, 2021), ontology generation (De Maio et al., 2009), word sense disambiguation (Karov & Edelman, 1998), semantic relatedness (Gabrilovich & Markovitch, 2007), information retrieval (Hliaoutakis et al., 2006), entity disambiguation (Zhu & Iglesias, 2018), spam filtering (Li & Huang, 2012), video retrieval (Aytar et al., 2008), and language translation (Duyck & Brysbaert, 2008; Tan et al., 2024).

The researcher explored two types of SWS (Camacho-Collados et al., 2017). These are: (a) monolingual SWS (ML-SWS) and (b) cross-lingual SWS (CL-SWS). ML-SWS deals with determining the semantic relatedness between two words that belong to the same language. In contrast, CL-SWS evaluates the degree of similarity between word pairs of two different languages. The focus of this research work is on CL-SWS for English–Urdu word pairs.

In recent years, significant studies have been conducted on CL-SWS for various language pairs, such as English–Persian (Jokar, 2024), English–Spanish (Tetzloff, 2024), English–Italian (Camacho-Collados et al., 2017), English–Romanian, Spanish–Romanian (Niewiarowski & Plichta, 2023), Spanish–Arabic (Ech-Charfi, 2023; Finkelstein et al., 2001), Chinese–English (Dai & Huang, 2011), and Tibetan–Chinese (Ma et al., 2020). Considering the CL-SWS task for the English–Urdu language pair, we found only one study conducted by Fatima et al. (2021a). In their study, they developed a benchmark corpus called the USWS-19 corpus for CL-SWS for the English–Urdu language pair and applied the translation plus monolingual analysis (T+MA) approach. The best results were obtained using the Bing translator approach with a Pearson correlation score of 0.716. As noted, there is still room for improvement in both results and approaches. To address this gap, the main aim of this study is to carry out a detailed and in-depth comparison of sentence transformer-based approaches with T+MA and without T+MA for CL-SWS for the English–Urdu language pair.

Urdu is the national language of Pakistan and is an important international language with more than 300 million speakers around the world (Ashraf & Khaleel, 2024; Daud et al., 2017). The vocabulary and grammar of Urdu are inherited from Persian, Arabic, and South Asian languages (Rahman, 2004). It is difficult to automatically process Urdu because it is a morphologically rich language. Some of its words may have up to 40 forms (Mehmood, 2021; Naseer & Hussain, 2009). In recent years, a lot of digital data has been generated for the Urdu language through social media websites, news articles, blogs, etc. Despite its wide usage, Urdu is still a poorly resourced language for NLP. Efforts are being made to create Urdu computational resources (Khattak et al., 2021).

This study presents a range of 29 approaches for CL-SWS in the English–Urdu language pair. These approaches include T+MA with sentence transformer methods, cross-lingual sentence transformer-based methods, and five-word embedding-based methods. Each approach was applied with different combinations as feature fusion approaches. This was done to identify the best approach for CL-SWS. Our proposed feature fusion-based model outperformed the baseline and other methods. It achieved a Pearson correlation score of 0.788.

The rest of the paper is organized as follows: Section 2 presents the related work. Section 3 presents the approaches used for the English–Urdu SWS task. Section 4 presents the experimental setup. Section 5 shows the results and their analysis. Finally, Section 6 concludes the study.

2. Related Work

In the literature, various efforts have been made to develop novel approaches for the CL-SWS task. The following section details some prominent attempts for this task.

Brans and Bloem (2024) explored the CL-SWS task for the Dutch language. They used the SimLex-999 corpus dataset. Similarity judgments were collected from 235 native Dutch speakers. The study tested two transformer-based Dutch language models, Bertje and RobBERT. Bertje showed better alignment (0.42) with human judgments than RobBERT (0.24).

Researchers conducted a useful study on the Turkish–English CL-SWS task (Arslan et al., 2023). The study used three Google translated datasets: SemEval2017 (Camacho-Collados et al., 2017), MC-30 (Miller & Charles, 1991), and SimVerb-3500 (Gerz et al., 2016), to measure word similarity. Researchers utilized three word embedding models. The top scores were 0.62 for SemEval2017, 0.69 for MC-30, and 0.22 for SimVerb-3500, all obtained using a word embedding approach.

In a study by Hassan and Mihalcea (2009), the CL-SWS task was explored for four languages: English, Spanish, Arabic, and Romanian. The experiments used four corpora translated from two English corpora, MC-30 (Miller & Charles, 1991) and WordSim-353 (Finkelstein et al., 2001). Two approaches, Lesk and Cosine, were applied. The best results (in terms of Pearson correlation) achieved were: English–Spanish (0.43), English–Arabic (0.32), English–Romanian (0.50), Spanish–Arabic (0.20), Spanish–Romanian (0.38), and Arabic–Romanian (0.32).

The SemEval-2017 Task 2 (Camacho-Collados et al., 2017) organized an event on the CL-SWS task for four languages: English, Persian (Farsi), Spanish, and Italian. Twelve teams participated in the CL-SWS task (Camacho-Collados et al., 2017). The highest results were achieved by the Luminoso run2 team.¹ This team achieved the highest Pearson correlation values: 0.60 for English–Farsi, 0.77 for English–Italian, 0.62 for Spanish–Farsi, 0.74 for Spanish–Italian, and 0.60 for Italian–Farsi.

In another investigation (Vulic & Moens, 2013), the CL-SWS task is studied for four languages: English, Spanish, Dutch, and Italian. The experiments use three parallel language corpora: the IT-EN-W dataset, the Spanish–English (ES-EN-W) dataset, and the Dutch–English (NL-EN) dataset. The best results are attained by employing multilingual probabilistic topic models. The results are as follows: 0.89 for the IT-EN-W dataset, 0.63 for the ES-EN-W dataset, and 0.51 for the NL-EN dataset.

Researchers also conducted a useful study for the Chinese–English CL-SWS task (Xia et al., 2011). In this study, the experiments use HowNet² and a parallel corpus. The top correlation score of 0.80 is obtained using a corpus-based approach.

Similar to the previous one, another study on the Chinese–English CL-SWS task is Dai and Huang (2011). The experiments use attribute and relation-based measures. The best results are obtained using the corpus-based approach, with a top correlation score of 0.90.

In a study (Ma et al., 2020), Tibetan and Chinese CL-SWS are explored. The experiments use the Tibetan–Chinese Wikipedia corpus³ for training and the MUSE library.⁴ The best results are obtained using word embeddings, achieving a 0.67 Spearman’s correlation ratio.

Regarding Urdu, Fatima et al. (2021a) conducted a study on the English–Urdu SWS task. The experiments used the CLSW-19 corpus, which contains 518 word pairs. The best results were obtained using the path length measure, with a Pearson correlation of 0.71.

To summarize, in the context of the English–Urdu CL-SWS task, only one contribution was found (Fatima et al., 2021b). This contribution used path length measures but ignored state-of-the-art approaches such as transformer-based methods. To address this gap, this study conducted a series of experiments using transformers and achieved state-of-the-art results. Sentence transformer-based approaches were not previously explored for CL-SWS for the English–Urdu language pair. We also propose feature fusion approaches with various combinations in this study. All these feature fusion approaches outperformed baseline results. Our proposed model improved the results by 7%. This combination has not been previously reported.

3. Approaches for CL-SWS

3.1. Cross-Lingual Sentence Transformer-Based Approaches

Several language representation models have been introduced in recent years. These include bidirectional encoder representations from transformers (BERT; Devlin et al., 2019; Peters et al., 2018), Universal Sentence Encoder (Cer et al., 2018), InferSent (Conneau et al., 2017), language-agnostic sentence representations (LASER; Feng et al., 2020), and RoBERTa (Liu et al., 2019). Recently, Reimers and Gurevych (2019) introduced the Sentence-BERT (SBERT) model. SBERT is a variant of the BERT neural network. It employs a triplet neural network and a Siamese neural network design to create semantically significant sentence embeddings. The effectiveness of SBERT models has been demonstrated in various NLP applications. These include mining bitext (Feng et al., 2020), paraphrase identification (Thakur et al., 2021; Fenogenova, 2021), document dating (Massidda, 2020), objective-based hierarchical clustering (Naumov et al., 2020; Darányi et al., 2023), text classification (Minaee et al., 2021), generating a missing part for story completion (Mori et al., 2020), identifying similar patent documents (Navrozidis & Jansson, 2020), semantic textual similarity (Guo et al., 2020), sentiment analysis (Ke et al., 2020), text ranking (Yates et al., 2021), question answering (He et al., 2020), and machine translation (Rei et al., 2020), among others.

For this study, we used a Python framework called Sentence Transformer.⁵ This framework is based on the SBERT model (Reimers & Gurevych, 2019). The sentence transformer architecture can determine sentence embeddings for more than 100 languages. It also allows for independent training of sentence transformers for general and specific tasks. Pretrained general sentence transformers include BERT-base, BERT-large, BERT-base-Wikipedia-mean-token, RoBERTa-base, and RoBERTa-large. Special sentence transformers include paraphrase-distilroberta-base-v1 (for paraphrase detection), stsb-roberta-base-v2, stsb-distilroberta-base-v2 (for semantic textual similarity), quora-distilbert-base (for duplicate question detection), nq-distilbert-base-v1 (for information retrieval), and language-agnostic BERT sentence embedding (LaBSE; for bitext mining, Reimers & Gurevych, 2020).

The general sentence transformers are trained on a combination of the Stanford Natural Language Inference (SNLI) corpus⁶ (Bowman et al., 2015) and the MultiGenre NLI (MultiNLI) corpus⁷ (Williams et al., 2018). The specific sentence transformer for paraphrase detection is trained on 50 million English paraphrase pairs. The semantic textual similarity models are trained on SNLI + MultiNLI and fine-tuned on the semantic textual similarity benchmark dataset.⁸ The duplicate questions detection model is fine-tuned on NLI+ semantic textual similarity benchmark dataset and then fine-tuned for quora duplicate questions.⁹ The information retrieval model is trained on information retrieval from Google’s Natural Questions dataset¹⁰ and a dataset with 100k real queries from Google search. It also uses translated pairs¹¹ (Feng et al., 2020) datasets.

For this study, we used four general sentence transformers: (a) bert-base-nli, (b) bert-large-nli,¹² (c) roberta-base-nli-mean-tokens, and (d) roberta-large-nli-mean-tokens.¹³ These general sentence transformers were chosen because they have all been utilized and evaluated for SWS tasks in previous work using STS12-STS16, STSb, and SICK-R datasets (Reimers & Gurevych, 2019). The roberta-large-nli model has been proven to be the most successful for STS12, STS15, STS16, and on average across all datasets (Reimers & Gurevych, 2019). The performances of these datasets (Reimers & Gurevych, 2019) with Pearson were (74.53, 81.85, 76.82, and 76.68) on STS12, STS15, STS16, and on average, respectively.¹⁴ There is only a minor difference between BERT and RoBERTa’s performance. Bert-large-nli has also shown the best performance for STS13, STS14, and STSb with Pearson scores of 78.46, 74.90, and 79.23, respectively. The performances of roberta-base-nli and bert-base-nli are slightly lower than roberta-large-nli and bert-large-nli. We evaluated the performance of all these models for CL-SWS similar to the STS task in Reimers and Gurevych (2019).

To calculate the similarity between cross-lingual sentence pairs, we need a multilingual sentence embedding model. For this study, we used the pretrained LaBSE. LaBSE extracts 768-dimensional averaged vectors for sentence data from both English and Urdu texts. The corpus includes 17 billion monolingual sentences and 6 billion bilingual translation pairs. We chose this model because it works well for translated pairs in several languages and supports 109 different languages (Feng et al., 2020).

For this experiment, we used the pretrained SBERT (distiluse-base-multilingual-cased-v2). This is a multilingual, distilled version of the Universal Sentence Encoder. Unlike the v1 model, which supports 15 languages, this version covers 50+ languages. It extracts 768-dimensional averaged vectors for both English and Urdu texts (Reimers & Gurevych, 2020). We chose this model because it performs well for information distillation across multiple languages and supports 50 different languages.

We also used another specific model for evaluation. The pretrained SBERT (quora-distilbert-multilingual) was selected. This model is a multilingual version of distilbert-base-nli-stsb-quora-ranking. It is fine-tuned with parallel data for 50+ languages and extracts 768-dimensional averaged vectors of sentences. We chose this model because it supports 50 languages and performs well for pair ranking in multiple languages (Reimers & Gurevych, 2020).

To test performance, we selected a different special model. For this experiment, we used the pretrained SBERT (quora-distilbert-multilingual). This model is a multilingual variant of quora-distilbert-base-multilingual, designed for English and Urdu texts. It was fine-tuned with parallel data for over 50 languages and extracts 768-dimensional averaged vectors of sentences. We chose this model because it effectively detects duplicate questions across multiple languages and supports 50 different languages (Reimers & Gurevych, 2020).

We selected additional special models developed for STS. We used the pretrained SBERT (stsb-xlm-r-multilingual).¹⁵ This model was chosen because it supports 50 languages and works well for cross-lingual semantic textual similarity.

We also selected other models: bert-large-nli-stsb-mean-tokens, bert-base-nli-stsb-mean-tokens, xlm-r-100langs-bert-base-nli-mean-tokens, and xlm-r-100langs-bert-base-nli-stsb-mean-tokens. These models were selected to evaluate all models developed for cross-lingual semantic textual similarity that support multiple languages. Each model has different dimensions and is trained and fine-tuned on various datasets for STS tasks.

The process for determining the similarity between English–Urdu text pairs involves two steps. First, the English word was converted into a sentence embedding vector using an SBERT model. Second, the Urdu word was also transformed into a sentence embedding vector. In the second step, the cosine similarity metric was used to compare the embedding vectors of English–Urdu word pairs (see equation (1)). For scores between 0 and 4, the results were multiplied by 4.

Cosine ({Word}_{Eng}, {Word}_{Ur}) = \frac{\vec{{W o r d}_{Eng}} . \vec{{W o r d}_{Ur}}}{| \vec{{W o r d}_{Eng}} | \times | \vec{{W o r d}_{Ur}} |}

(1)

where

| (\vec{{W o r d}_{Eng}}) |

and

| (\vec{{W o r d}_{Ur}}) |

represent the length of English word and Urdu word vector, respectively. The cosine similarity metric permits partial matching, enabling a more accurate similarity assessment.

3.2. Large Language Models

Large language models (LLMs) represent the current state-of-the-art in text generation as well as various downstream NLP tasks (Yadav et al., 2024). Through the application of prompt engineering techniques, LLMs can be optimized to yield improved outcomes, mitigating the necessity for costly fine-tuning processes (Sun et al., 2024). Although LLMs are predominantly designed for text generation, they have also been employed in tasks related to semantic similarity and paraphrasing.

3.3. Translation Plus Monolingual Analysis (T+MA)

3.3.1. Sentence Transformer Using T+MA-Based Approach

To carry out the detailed analysis, we used some monolingual and multilingual sentence transformers with a T+MA-based approach with Bing translator. For this study, we used some STS models and other special transformers.

We used all multilingual models with T+MA, which have been explored above on cross-lingual corpus including LaBSE (Feng et al., 2020), distiluse-base-multilingual-cased-v2 (Reimers & Gurevych, 2020), and especially developed for the STS task (stsb-bert-base, stsb-bert-large, stsb-roberta-base, stsb-distilroberta-base-v2, bert-large-nli-stsb-mean-tokens, and xlm-r-large-en-ko-nli-ststb). All these transformers are trained on STSb benchmark datasets and showed reasonable performance (Reimers & Gurevych, 2019). The selection of these models was made to evaluate all models created for both monolingual and multilingual semantic text similarity tasks. All of these models have distinct dimensions and were developed for STS tasks using various datasets.

We selected another special model to assess the performance. For this experiment, we used the pretrained SBERT (allenai-specter¹⁶; Reimers & Gurevych, 2020 for English and translated-Urdu word text. The reason for selecting this model is that it works well for finding similar papers in a monolingual setting.

For all transformers, the following steps were used to determine the similarity between English–Urdu word pairings: The Urdu text was first translated into English using Bing translator. In the second step, English and translated-Urdu (in English) word pairs were converted to sentence embedding vectors using the SBERT model. The sentence embedding vectors of English and translated-Urdu word pairs were compared using the cosine similarity measure (see equation 2). For scores between 0 and 4, these results were multiplied by 4.

Cosine ({Word}_{Eng}, {Word}_{Ur-translation}) = \frac{\vec{{W o r d}_{Eng}} . \vec{{W o r d}_{Ur-translation}}}{| \vec{{W o r d}_{Eng}} | \times | \vec{{W o r d}_{Ur-translation}} |}

(2)

where

| (\vec{{W o r d}_{Eng}}) |

and

| (\vec{{W o r d}_{Ur-translation}}) |

represent the length of the English word and the translated-Urdu (English) word vector, respectively. The cosine similarity metric permits partial matching, enabling a more accurate similarity assessment.

3.3.2. Word Embedding Using T+MA-Based Approach

Word embedding approaches have been used in a range of applications including word sense disambiguation (Pelevina et al., 2016), short text similarity (Kenter & De Rijke, 2015), plagiarism detection (Ferrero et al., 2017; Khorsi et al., 2018), analyzing survey responses, and verbatim comments (Healy, 2019). For this study, we used the Google pretrained word embedding model (Ghannay et al., 2016). The word embedding approach was utilized to calculate the similarity between English and translated-Urdu word pairs.

The pretrained word embedding model was given English and translated-Urdu (in English) words in the first step. A pretrained Google word embedding model that was trained on a dataset of 100 billion words from Google News was used to identify 300 nearest neighbors for the English and translated-Urdu words. The following phase involved computing the degree of similarity between word embedding vectors from the English and translated-Urdu word pairs that had been reused using two different methods: (a) the sum of word embedding vectors approach and (b) the average of word embedding vectors approach.

The word embedding vectors of the English word were added together in the sum of the word embedding vectors approach to produce a single English word embedding vector. A single translated-Urdu word embedding vector was created by adding the vectors of translated-Urdu words. Following that, the similarity between the (summed) English and translated-Urdu word embedding vectors was calculated using the cosine similarity measure (2) and the Euclidean distance measure (3) (Vijaymeena & Kavitha, 2016), respectively.

The word embedding vectors of the English word were averaged to create a single English word embedding vector for the average word embedding vectors approach. To create a single translated-Urdu word embedding vector, the vectors of translated-Urdu were averaged. Following that, the similarity between the (averaged) English and translated-Urdu word embedding vectors was calculated using the cosine similarity measure (2) (Lahitani et al., 2016) and the Euclidean distance measure (3) (Vijaymeena & Kavitha, 2016), respectively.

\begin{aligned} Euclidean ({Word}_{Eng}, {Word}_{Ur-translation}) \\ = \sqrt{(\vec{{W o r d}_{Eng}} - \vec{{W o r d}_{Ur-translation}}) . (\vec{{W o r d}_{Eng}} - \vec{{W o r d}_{Ur-translation}})} \end{aligned}

(3)

where

(\vec{{W o r d}_{Eng}})

and

(\vec{{W o r d}_{Ur-translation}})

represents the English and translated-Urdu words, respectively. The shortest distance between the English and translated-Urdu word pairs is measured using the Euclidean distance.

3.4. Feature Fusion Approaches

To carry out the detailed analysis, we used different combinations to obtain good performance.

For feature fusion approach 1, we combined the best-performing crosslingual SBERT transformer (LaBSE) with the best baseline approach using Bing translator (Fatima et al., 2021b). We averaged both scores.

For feature fusion approach 2, we combined the best-performing crosslingual SBERT transformer (LaBSE) with the best baseline approach using Google translator (Fatima et al., 2021b). We averaged both scores.

For feature fusion approach 3, we combined the best-performing crosslingual SBERT transformer (LaBSE) with the best baseline approach using both Google translator and Bing translator (Fatima et al., 2021b). We averaged all scores.

For feature fusion approach 4, we combined the best-performing crosslingual SBERT transformer (LaBSE) with the best baseline approach using Bing translator (Fatima et al., 2021b). We also combined the best-performing two SBERT transformers using the T+MA approach (LaBSE and distiluse-base-multilingual-cased-v2) and averaged all scores.

For feature fusion approach 5, we combined the best-performing crosslingual SBERT transformer (LaBSE) with the best baseline approach using Google translator (Fatima et al., 2021b). We also combined the best-performing two SBERT transformers using the T+MA approach (LaBSE and distiluse-base-multilingual-cased-v2) and averaged all scores.

For feature fusion approach 6, we combined the best-performing crosslingual SBERT transformer (LaBSE) with the best baseline approach using both Bing and Google translators (Fatima et al., 2021b). We also combined the best-performing two SBERT transformers using the T+MA approach (LaBSE and distiluse-base-multilingual-cased-v2) and averaged all scores.

4. Experimental Setup

This section presents the experimental setup, including the corpus, applied approaches used to measure cross in the English–Urdu language pair, and evaluation measures.

4.1. Corpus

This study used the USWS-19 corpus constructed by Fatima et al. (2021a) to evaluate the proposed approaches. The corpus contains 518 word pairs. The corpus is created for the crosslingual English–Urdu SWS task. To construct the USWS-19 corpus, source data was obtained from the SemEval-500 corpus.¹⁷ Two popular translators (Google and Bing) were used to translate the English word pairs of the source corpus (SemEval-500 corpus) into Urdu word pairs (USWS-19 corpus).

Following the standard practice of research, each word pair was annotated by 12 native Urdu speakers on a scale of 0–4. A score of 0 represents dissimilar and 4 represents very similar. The annotators’ qualifications varied from graduate students to doctorate professors. Furthermore, the corpus contains 676 nouns, 11 verbs, 35 adjectives, one adverb, 185 multiword expressions, and 128 named entities. The corpus is freely and publicly available for research purposes under the Creative Commons License.¹⁸

4.2. Approaches

A range of transformer-based approaches was applied to the corpus. Table 1 shows all approaches that were applied to the CL-SWS corpus. Figure 1 shows the proposed model on which we obtained the highest results. This is the feature fusion approach. The proposed model is based on the combination of different features coming from different approaches. It can be noted from the figure (Figure 1) that to achieve the top results, the model combined the features from five types of approaches. These approaches include (a) cross-lingual LaBSE transformer, (b) LaBSE transformer with T+MA, (c) best baseline approach with Bing translation, (d) Universal Sentence Encoder transformer with T+MA, and (e) LLMs).

Table 1.
Applied Approaches.

Experiment Approaches

Approach No. 1: Cross-lingual sentence transformer-based approaches

Exp 1.1 SBERT using “distilbert-multilingual-nli-stsb-quora-ranking”

Exp 1.2 SBERT using “LaBSE”

Exp 1.3 SBERT using “quora-distilbert-multilingual”

Exp 1.4 SBERT using “bert-large-nli-mean-tokens”

Exp 1.5 SBERT using “roberta-large-nli-mean-tokens”

Exp 1.6 SBERT using “stsb-xlm-r-multilingual”

Exp 1.7 SBERT using “distiluse-base-multilingual-cased-v2”

Exp 1.8 SBERT using “bert-base-nli-stsb-mean-tokens”

Exp 1.9 SBERT using “bert-large-nli-stsb-mean-tokens”

Exp 1.10 SBERT using ‘xlm-r-100langs-bert-base-nli-mean-tokens”

Exp 1.11 SBERT using “xlm-r-100langs-bert-base-nli-stsb-mean-tokens”

Exp 1.12 SBERT using “average-of-cross-lingual sentence transformers”

Approach No. 2: Sentence transformers with T+MA-based approaches using (Bing translator)

Exp 2.1 SBERT using “allenai-specter”

Exp 2.2 SBERT using “bert-large-nli-stsb-mean-tokens”

Exp 2.3 SBERT using “distiluse-base-multilingual-cased-v2”

Exp 2.4 SBERT using “stsb-bert-base”

Exp 2.5 SBERT using ‘stsb-bert-large”

Exp 2.6 SBERT using “stsb-distilroberta-base-v2”

Exp 2.7 SBERT using “stsb-roberta-base”

Exp 2.8 SBERT using “LaBSE”

Exp 2.9 SBERT using ‘stsb-roberta-large”

Exp 2.10 SBERT using‘ ‘xlm-r-large-en-ko-nli-ststb”

Exp 2.11 SBERT using “average-of-all”

Approach No. 3: Feature fusion approach

Exp 3.1 Best SBERT + BEST Baseline (with Bing translator)

Exp 3.2 Best SBERT + BEST Baseline (with Google translator)

Exp 3.3 Best SBERT + BEST Baseline (both translators)

Exp 3.4 Best-Two-SBERT (using T+MA) + BEST Baseline (with Bing translator) + Best cross-lingual SBERT

Exp 3.5 Best-Two-SBERT (using T+MA) + BEST Baseline (with Google translator) + Best cross-lingual SBERT

Exp 3.6 Best-Two-SBERT (using T+MA) + BEST Baseline (with Both translators) + Best cross-lingual SBERT

Approach No. 4: Word Embedding with T+MA-based approaches (Bing translator)

Exp 4.1 Word embedding (mean vectors) using Cosine

Exp 4.2 Word embedding (summed vectors) using Cosine

Exp 4.3 Word embedding (mean vectors) using Euclidean

Exp 4.4 Word embedding (summed vectors) using Euclidean

Exp 4.5 Taking average of word embedding-based features

Approach No. 5: Large language model

Exp 5.1 XLM-Roberta on (Bing Translated) corpus

Exp 5.2 Llama on (Bing Translated) corpus

Exp 5.3 XLM-Roberta on (Google Translated) corpus

Exp 5.4 Llama on (Google Translated) corpus

Exp 5.5 XLM-Roberta on (Bing Translated) corpus

Exp 5.6 Llama on Cross-lingual corpus

Experiment	Approaches
Approach No. 1: Cross-lingual sentence transformer-based approaches
Exp 1.1	SBERT using “distilbert-multilingual-nli-stsb-quora-ranking”
Exp 1.2	SBERT using “LaBSE”
Exp 1.3	SBERT using “quora-distilbert-multilingual”
Exp 1.4	SBERT using “bert-large-nli-mean-tokens”
Exp 1.5	SBERT using “roberta-large-nli-mean-tokens”
Exp 1.6	SBERT using “stsb-xlm-r-multilingual”
Exp 1.7	SBERT using “distiluse-base-multilingual-cased-v2”
Exp 1.8	SBERT using “bert-base-nli-stsb-mean-tokens”
Exp 1.9	SBERT using “bert-large-nli-stsb-mean-tokens”
Exp 1.10	SBERT using ‘xlm-r-100langs-bert-base-nli-mean-tokens”
Exp 1.11	SBERT using “xlm-r-100langs-bert-base-nli-stsb-mean-tokens”
Exp 1.12	SBERT using “average-of-cross-lingual sentence transformers”
Approach No. 2: Sentence transformers with T+MA-based approaches using (Bing translator)
Exp 2.1	SBERT using “allenai-specter”
Exp 2.2	SBERT using “bert-large-nli-stsb-mean-tokens”
Exp 2.3	SBERT using “distiluse-base-multilingual-cased-v2”
Exp 2.4	SBERT using “stsb-bert-base”
Exp 2.5	SBERT using ‘stsb-bert-large”
Exp 2.6	SBERT using “stsb-distilroberta-base-v2”
Exp 2.7	SBERT using “stsb-roberta-base”
Exp 2.8	SBERT using “LaBSE”
Exp 2.9	SBERT using ‘stsb-roberta-large”
Exp 2.10	SBERT using‘ ‘xlm-r-large-en-ko-nli-ststb”
Exp 2.11	SBERT using “average-of-all”
Approach No. 3: Feature fusion approach
Exp 3.1	Best SBERT + BEST Baseline (with Bing translator)
Exp 3.2	Best SBERT + BEST Baseline (with Google translator)
Exp 3.3	Best SBERT + BEST Baseline (both translators)
Exp 3.4	Best-Two-SBERT (using T+MA) + BEST Baseline (with Bing translator) + Best cross-lingual SBERT
Exp 3.5	Best-Two-SBERT (using T+MA) + BEST Baseline (with Google translator) + Best cross-lingual SBERT
Exp 3.6	Best-Two-SBERT (using T+MA) + BEST Baseline (with Both translators) + Best cross-lingual SBERT
Approach No. 4: Word Embedding with T+MA-based approaches (Bing translator)
Exp 4.1	Word embedding (mean vectors) using Cosine
Exp 4.2	Word embedding (summed vectors) using Cosine
Exp 4.3	Word embedding (mean vectors) using Euclidean
Exp 4.4	Word embedding (summed vectors) using Euclidean
Exp 4.5	Taking average of word embedding-based features
Approach No. 5: Large language model
Exp 5.1	XLM-Roberta on (Bing Translated) corpus
Exp 5.2	Llama on (Bing Translated) corpus
Exp 5.3	XLM-Roberta on (Google Translated) corpus
Exp 5.4	Llama on (Google Translated) corpus
Exp 5.5	XLM-Roberta on (Bing Translated) corpus
Exp 5.6	Llama on Cross-lingual corpus

Note. SBERT = sentence-bidirectional encoder representations from transformers; T+MA = translation plus monolingual analysis.

Figure 1.

Proposed model.

4.3. Evaluation Measure

The evaluation of proposed approaches was carried out using Pearson correlation (Camacho-Collados et al., 2017). The formula to compute Pearson correlation is given in equation (4).

r = \frac{Σ (x_{i} - \bar{x}) (y_{i} - \bar{y})}{\sqrt{(x_{i} - \bar{x})^{2} (y_{i} - \bar{y})^{2}}}

(4)

In the given equation

r

represents Pearson correlation,

x_{i}

represents the values of the

x

-variable in a sample, and

\bar{x}

represents the mean of the values of the

x

-variable. The variable

y_{i}

represents the values of the

y

-variable in a sample

\bar{y}

represents the mean of the values of the

y

-variable.

5. Results and Analysis

Table 2 shows Pearson scores obtained on the USWS-19 corpus using various transformer-based approaches for CL-SWS for the English–Urdu pair. In this table, the first two rows represent the baseline results achieved using two translators, that is, Google and Bing (Fatima et al., 2021a). The results show that Bing translator exhibits more promising results.

Table 2.
Comparison of Proposed Results With Baseline Results.

Baseline results Pearson correlation

Bing translator (baseline) 0.719

Google translator (baseline) 0.706

Approaches used in this study

Approach No. 1: Cross-lingual sentence transformer-based approaches

Exp 1.1: SBERT using “distilbert-multilingual-nli-stsb-quora-ranking” 0.582

Exp 1.2: SBERT using “LaBSE” 0.736

Exp 1.3: SBERT using “quora-distilbert-multilingual” 0.582

Exp 1.4: SBERT using “bert-large-nli-mean-tokens” 0.024

Exp 1.5: SBERT using “roberta-large-nli-mean-tokens” 0.002

Exp 1.6: SBERT using “stsb-xlm-r-multilingual” 0.516

Exp 1.7: SBERT using “distiluse-base-multilingual-cased-v2” 0.571

Exp 1.8: SBERT using “bert-base-nli-stsb-mean-tokens” 0.052

Exp 1.9: SBERT using “bert-large-nli-stsb-mean-tokens” 0.069

Exp 1.10: SBERT using “xlm-r-100langs-bert-base-nli-mean-tokens” 0.461

Exp 1.11: SBERT using “xlm-r-100langs-bert-base-nli-stsb-mean-tokens” 0.516

Exp 1.12: SBERT using “average-of-cross-lingual sentence transformers” 0.623

Approach No. 2: Sentence transformers with T+MA-based approaches (Bing translator)

Exp 2.1: SBERT using “allenai-specter” 0.615

Exp 2.2: SBERT using “bert-large-nli-stsb-mean-tokens” 0.694

Exp 2.3: SBERT using “distiluse-base-multilingual-cased-v2” 0.717

Exp 2.4: SBERT using “stsb-bert-base” 0.682

Exp 2.5: SBERT using “stsb-bert-large” 0.694

Exp 2.6: SBERT using “stsb-distilroberta-base-v2” 0.669

Exp 2.7: SBERT using “stsb-roberta-base” 0.685

Exp 2.8: SBERT using “LaBSE” 0.721

Exp 2.9: SBERT using “stsb-roberta-large” 0.696

Exp 2.10: SBERT using “xlm-r-large-en-ko-nli-ststb” 0.702

Exp 2.11: SBERT using “average-of-all” 0.725

Approach No. 3: Feature Fusion approach

Exp 3.1: Best SBERT + BEST Baseline (with Bing translator) 0.770

Exp 3.2: Best SBERT + BEST Baseline (with Google translator) 0.761

Exp 3.3: Best SBERT + BEST Baseline (both translators) 0.761

Exp 3.4: Best-Two-SBERT (using T+MA) + BEST Baseline (with Bing translator) + Best cross-lingual SBERT 0.788

Exp 3.5: Best-Two-SBERT (using T+MA) + BEST Baseline (with Google translator) + Best cross-lingual SBERT 0.770

Exp 3.6: Best-Two-SBERT (using T+MA) + BEST Baseline (with Both translator) + Best cross-lingual SBERT 0.777

Approach No. 4: Word Embedding with T+MA-based approaches (Bing translator)

Exp 4.1: WE-Cosine-Avg 0.477

Exp 4.2: WE-Cosine-Sum 0.025

Exp 4.3: WE-Euclidean-Avg 0.414

Exp 4.4: WE-Euclidean-Sum 0.414

Exp 4.5: Average-of-all-WE features 0.452

Approach No. 5: Large language model

Exp 5.1: XLM-Roberta on (Bing Translated) corpus 0.288

Exp 5.2: Llama on (Bing Translated) corpus 0.255

Exp 5.3: XLM-Roberta on (Google Translated) corpus 0.319

Exp 5.4: Llama on (Google Translated) corpus 0.290

Exp 5.5: XLM-Roberta on (Bing Translated) corpus 0.134

Exp 5.6: Llama on Cross-lingual corpus 0.132

Baseline results	Pearson correlation
Bing translator (baseline)	0.719
Google translator (baseline)	0.706
Approaches used in this study
Approach No. 1: Cross-lingual sentence transformer-based approaches
Exp 1.1: SBERT using “distilbert-multilingual-nli-stsb-quora-ranking”	0.582
Exp 1.2: SBERT using “LaBSE”	0.736
Exp 1.3: SBERT using “quora-distilbert-multilingual”	0.582
Exp 1.4: SBERT using “bert-large-nli-mean-tokens”	0.024
Exp 1.5: SBERT using “roberta-large-nli-mean-tokens”	0.002
Exp 1.6: SBERT using “stsb-xlm-r-multilingual”	0.516
Exp 1.7: SBERT using “distiluse-base-multilingual-cased-v2”	0.571
Exp 1.8: SBERT using “bert-base-nli-stsb-mean-tokens”	0.052
Exp 1.9: SBERT using “bert-large-nli-stsb-mean-tokens”	0.069
Exp 1.10: SBERT using “xlm-r-100langs-bert-base-nli-mean-tokens”	0.461
Exp 1.11: SBERT using “xlm-r-100langs-bert-base-nli-stsb-mean-tokens”	0.516
Exp 1.12: SBERT using “average-of-cross-lingual sentence transformers”	0.623
Approach No. 2: Sentence transformers with T+MA-based approaches (Bing translator)
Exp 2.1: SBERT using “allenai-specter”	0.615
Exp 2.2: SBERT using “bert-large-nli-stsb-mean-tokens”	0.694
Exp 2.3: SBERT using “distiluse-base-multilingual-cased-v2”	0.717
Exp 2.4: SBERT using “stsb-bert-base”	0.682
Exp 2.5: SBERT using “stsb-bert-large”	0.694
Exp 2.6: SBERT using “stsb-distilroberta-base-v2”	0.669
Exp 2.7: SBERT using “stsb-roberta-base”	0.685
Exp 2.8: SBERT using “LaBSE”	0.721
Exp 2.9: SBERT using “stsb-roberta-large”	0.696
Exp 2.10: SBERT using “xlm-r-large-en-ko-nli-ststb”	0.702
Exp 2.11: SBERT using “average-of-all”	0.725
Approach No. 3: Feature Fusion approach
Exp 3.1: Best SBERT + BEST Baseline (with Bing translator)	0.770
Exp 3.2: Best SBERT + BEST Baseline (with Google translator)	0.761
Exp 3.3: Best SBERT + BEST Baseline (both translators)	0.761
Exp 3.4: Best-Two-SBERT (using T+MA) + BEST Baseline (with Bing translator) + Best cross-lingual SBERT	0.788
Exp 3.5: Best-Two-SBERT (using T+MA) + BEST Baseline (with Google translator) + Best cross-lingual SBERT	0.770
Exp 3.6: Best-Two-SBERT (using T+MA) + BEST Baseline (with Both translator) + Best cross-lingual SBERT	0.777
Approach No. 4: Word Embedding with T+MA-based approaches (Bing translator)
Exp 4.1: WE-Cosine-Avg	0.477
Exp 4.2: WE-Cosine-Sum	0.025
Exp 4.3: WE-Euclidean-Avg	0.414
Exp 4.4: WE-Euclidean-Sum	0.414
Exp 4.5: Average-of-all-WE features	0.452
Approach No. 5: Large language model
Exp 5.1: XLM-Roberta on (Bing Translated) corpus	0.288
Exp 5.2: Llama on (Bing Translated) corpus	0.255
Exp 5.3: XLM-Roberta on (Google Translated) corpus	0.319
Exp 5.4: Llama on (Google Translated) corpus	0.290
Exp 5.5: XLM-Roberta on (Bing Translated) corpus	0.134
Exp 5.6: Llama on Cross-lingual corpus	0.132

Note. SBERT = sentence-bidirectional encoder representations from transformers; T+MA = translation plus monolingual analysis.

The highest performance was achieved in Experiment 3.4, yielding a Pearson correlation of 0.788. This experiment employed the “Best-Two-SBERT (using T+MA) + BEST Baseline (with Bing translator) + Best Cross-lingual SBERT” methodology. These results surpass the previously reported state-of-the-art baseline on this corpus, which had a Pearson correlation of 0.71. This significant improvement demonstrates that our proposed approach outperforms the current state-of-the-art, making it more effective for CL-SWS tasks involving the English–Urdu language pair on this corpus. The findings suggest that this combination is particularly advantageous for detecting CL-SWS in the English–Urdu language pair.

Among the cross-lingual sentence transformer-based approaches, the best performance was achieved in Experiment 1.2, which utilized the “LaBSE” transformer approach, resulting in a Pearson correlation of 0.736. This outcome surpasses the baseline results. The superior performance of the LaBSE model can be attributed to its support for 109 languages, enabling it to effectively handle translated pairs across multiple languages. Consequently, it extracts high-quality feature vectors for cross-lingual pairs. These findings suggest that this approach is more effective for detecting CL-SWS compared to individual SBERT models and baseline methods.

In word embedding-based approaches, the highest performance was achieved in Experiment 4.5, which utilized the “Average-of-all-WE features” method, yielding a Pearson correlation of 0.452. However, this result is significantly lower than the baseline performance. The suboptimal outcome is attributed to the extraction of low-quality feature vectors for cross-lingual word pairs from the USWS-19 corpus, which hinders effectiveness in cross-lingual semantic similarity tasks. These findings indicate that word embedding-based methods are less effective for detecting CL-SWS compared to other approaches.

XLM-RoBERTa and Llama exhibit lower performance in SWS tasks compared to SBERT transformers and T+MA approaches due to their general-purpose design and training. XLM-RoBERTa, while broad in multilingual coverage, is not specifically fine-tuned for semantic similarity, which limits its ability to capture nuanced relationships. In contrast, SBERT transformers are explicitly optimized for generating high-quality embeddings for semantic similarity, and T+MA methods leverage precise translation and monolingual embeddings for enhanced semantic matching. Similarly, Llama, optimized for text generation, does not align with the specific requirements of semantic similarity tasks. Its embeddings are less effective for this purpose compared to those produced by models that are fine-tuned or adapted specifically for similarity detection.

Within the T+MA-based approaches using sentence transformers, LaBSE demonstrated strong performance with translated data, achieving a Pearson correlation of 0.721. This performance surpasses that of other transformers, including monolingual models specifically designed for English. While these results are marginally better than the baseline, they remain slightly lower than the best-performing crosslingual sentence transformer-based approach.

In the feature fusion approaches, all combinations outperformed the baseline results: -

“Experiment 3.1” combined the best-performing cross-lingual sentence transformer with the baseline feature using Bing translator, resulting in a Pearson correlation of 0.770. This is higher than both the baseline and individual transformer results but lower than the best-performing results. This indicates that this feature fusion approach is effective for CL-SWS detection.

“Experiment 3.2” replaced Bing translator with Google translator in the same fusion approach, yielding a Pearson correlation of 0.761. Although this still outperforms the baseline and individual transformers, it falls short of the best results. The performance decline may be due to the lower translation quality of Google translator.

“Experiment 3.3” added Google translator to the best-performing cross-lingual sentence transformer and Bing translator combination, but the Pearson correlation remained at 0.761. This suggests that adding Google translator did not enhance performance, indicating that this combination is less effective for CL-SWS detection.

“Experiment 3.5” combined the best cross-lingual sentence transformer, Google translator, and the two best-performing T+MA-based sentence transformer features, achieving a Pearson correlation of 0.770. While this surpasses the baseline and individual approaches, it is still lower than the best results. The slightly lower performance compared to Experiment 3.4 may be due to using Google translator instead of Bing translator, highlighting the impact of translation quality. However, incorporating the two top T+MA-based features was beneficial.

“Experiment 3.6” combined the best cross-lingual sentence transformer, both Bing and Google translators, and the two best T+MA-based features, achieving a Pearson correlation of 0.777. Although this result exceeds the baseline and other fusion approaches, it remains slightly below the best performance. This combination demonstrates that feature fusion approaches can effectively enhance CL-SWS detection.

The selection of sentence transformer-based approaches is motivated by their promising performance in related tasks, such as text reuse detection (akin to CL-SWS; Muneer & Nawab, 2021, 2022; Muneer et al., 2023) and paraphrase detection (Reimers & Gurevych, 2019, 2020). Top-performing models such as LaBSE and the Universal Sentence Transformer (distiluse-base-multilingual-cased-v2) have demonstrated their capability to generate high-quality feature vectors for crosslingual pairs. To optimize performance, we averaged the outputs of these top-performing transformers with the best baseline feature (using Bing translator), which has proven effective in enhancing results.

To conclude, the main findings obtained after our extensive experimentation are as follows: First, the best performance is shown by the combination of two best-performing transformers with T+MA. It is also shown by one best-performing cross-lingual sentence transformer and the best baseline feature (using Bing translator). Second, this shows that different SBERT frameworks affect the performance of CL-SWS. Third, the LaBSE framework extracts more qualitative feature vectors for CL-SWS for the English–Urdu language pair. Fourth, the low performance of the T+MA-based transformer compared to the crosslingual sentence transformer-based approach shows that the translation quality is still low.

6. Conclusion

This paper presents a range of transformer-based approaches for CL-SWS for the English–Urdu language pair. This study reports a comparison of transformer-based approaches and feature fusion approaches to CL-SWS for the English–Urdu word pairs. For the SWS task, the USWS-19 corpus was used. Results demonstrate that our proposed model based on feature fusion approaches outperforms previously reported approaches. In the future, we plan to explore word embeddings. We also plan to explore knowledge graphs and stacked auto-encoder approaches for the CL-SWS task to further improve the results.

Footnotes

ORCID iD

Iqra Muneer

Author Contributions

Rao Muhammad Adeel Nawab contributed to Conceptualization. Methodology: Iqra Muneer contributed to the methodology. Iqra Muneer contributed to writing—original draft preparation. Ali Saeed contributed to writing—review and editing. Iqra Muneer and Ali Saeed contributed to the visualization. Rao Muhammad Adeel Nawab contributed to supervision.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Notes

References

Arslan

O. A.

Duman

Erdem

Günyel

Sönmez

Arslan

(2023). Towards Turkish word embeddings: An intrinsic evaluation. In 2023 8th International conference on computer science and engineering (UBMK) (pp. 564–568). IEEE.

Ashraf

Khaleel

(2024). Exploring the absence of Urdu language identity in Pakistani undergraduate education. International Research Journal of Social Sciences and Humanities, 3(1), 1–14. https://irjssh.com/index.php/irjssh/article/view/61/54

Aytar

Shah

Luo

(2008). Utilizing semantic word similarity measures for video retrieval. In 2008 IEEE conference on computer vision and pattern recognition (pp. 1–8). IEEE. https://ieeexplore.ieee.org/document/4587822.

Bowman

S. R.

Angeli

Potts

Manning

C. D.

(2015). A large annotated corpus for learning natural language inference. In Proceedings of the 2015 conference on empirical methods in natural language processing (pp. 632–642). Association for Computational Linguistics. https://doi.org/10.18653/v1/D15-1075; https://www.aclweb.org/anthology/D15-1075

Brans

Bloem

(2024). SimLex-999 for Dutch. In Proceedings of the 2024 joint international conference on computational linguistics, language resources and evaluation (LREC-COLING 2024) (pp. 14832–14845). Language Resources and Evaluation (LREC-COLING 2024). https://aclanthology.org/2024.lrec-main.1292.pdf

Camacho-Collados

Pilehvar

M. T.

Collier

Navigli

(2017). Semeval-2017 task 2: Multilingual and cross-lingual semantic word similarity, association for computational linguistics. In Proceedings of the 11th international workshop on Semantic Evaluations (SemEval-2017) (pp. 15–26). Association for Computational Linguistics. https://aclanthology.org/S17-2002.pdf

Cer

Yang

Kong

S.-y.

Hua

Limtiaco

John

R. S.

Constant

Guajardo-Cespedes

Yuan

Tar

Sung

Y.-H.

Strope

Kurzweil

(2018). Universal sentence encoder for English." In Proceedings of the 2018 conference on empirical methods in natural language processing: System demonstrations (pp. 169–174). Association for Computational Linguistics. https://aclanthology.org/D18-2029.pdf

Chahal

Singh

(2021). An efficient approach for ranking of semantic web documents by computing semantic similarity and using HCS clustering. International Journal of Semiotics and Visual Rhetoric (IJSVR), 5(1), 45–56. https://www.igi-global.com/gateway/article/272968

Conneau

Kiela

Schwenk

Barrault

Bordes

(2017). Supervised learning of universal sentence representations from natural language inference data. In Proceedings of the 2017 conference on empirical methods in natural language processing (pp. 670–680). Association for Computational Linguistics. https://doi.org/10.18653/v1/D17-1070. https://www.aclweb.org/anthology/D17-1070

10.

Dai

Huang

H.-Y.

(2011). An English-Chinese cross-lingual word semantic similarity measure exploring attributes and relations. In Proceedings of the 25th Pacific Asia conference on language, information and computation (pp. 467–476). https://www.jsoftware.us/vol10/03-jsw140201-1104.pdf

11.

Darányi

Czvetkó

Kummer

Ruppert

Abonyi

(2023). Multi-objective hierarchical clustering for tool assignment. CIRP Journal of Manufacturing Science and Technology, 42, 47–54. https://doi.org/10.1016/j.cirpj.2023.02.002

12.

Daud

Khan

Che

(2017). Urdu language processing: A survey. Artificial Intelligence Review, 47(3), 279–311. https://doi.org/10.1007/s10462-016-9482-x

13.

De Maio

Fenza

Loia

Senatore

(2009). Towards an automatic fuzzy ontology generation. In 2009 IEEE international conference on fuzzy systems (pp. 1044–1049). IEEE. https://ieeexplore.ieee.org/document/5277226

14.

Devlin

Chang

M.-W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of NAACL-HLT 2019 (pp. 4171–4186). Association for Computational Linguistics. https://aclanthology.org/N19-1423.pdf

15.

Duyck

Brysbaert

(2008). Semantic access in number word translation: The role of crosslingual lexical similarity. Experimental Psychology, 55(2), 102–112. https://doi.org/10.1027/1618-3169.55.2.102

16.

Ech-Charfi

(2023). Word frequency and lexical coverage in English and Arabic. Journal of Applied Language and Culture Studies, 6(3), 1–19. https://api.semanticscholar.org/CorpusID:266562604

17.

Fatima

Nawab

R. M. A.

Khan

M. S.

Saeed

(2021a). Developing a cross-lingual semantic word similarity corpus for English–Urdu language pair (to appear). ACM Transactions on Asian and Low-Resource Language Information Processing (TALLIP), 21, 1–14.

18.

Fatima

Nawab

R. M. A.

Khan

M. S.

Saeed

(2021b). Developing a cross-lingual semantic word similarity corpus for English–Urdu language pair. Transactions on Asian and Low-Resource Language Information Processing, 21(2), 1–16. https://doi.org/10.1145/3472618

19.

Feng

Yang

Cer

Arivazhagan

Wang

(2020). Language-agnostic BERT sentence embedding. In Proceedings of the 60th annual meeting of the association for computational linguistics (pp. 878–891). Association for Computational Linguistics. https://aclanthology.org/2022.acl-long.62/

20.

Fenogenova

(2021). Russian paraphrasers: Paraphrase with transformers. In Proceedings of the 8th workshop on Balto-Slavic natural language processing (pp. 11–19). Association for Computational Linguistics. https://aclanthology.org/2021.bsnlp-1

21.

Ferrero

Besacier

Schwab

Agnes

(2017). Using word embedding for cross-language plagiarism detection. In Proceedings of the 15th conference of the European chapter of the association for computational linguistics: Volume 2, short papers (pp. 415–421). Association for Computational Linguistics. https://www.aclweb.org/anthology/E17-2066

22.

Finkelstein

Gabrilovich

Matias

Rivlin

Solan

Wolfman

Ruppin

(2001). Placing search in context: The concept revisited. In Proceedings of the 10th international conference on World Wide Web (pp. 406–414). ACM. https://doi.org/10.1145/371920.372094

23.

Gabrilovich

Markovitch

(2007). Computing semantic relatedness using Wikipedia-based explicit semantic analysis. In IJcAI (Vol. 7, pp. 1606–1611). https://doi.org/10.5555/2770959.2771060

24.

Gerz

Vulić

Hill

Reichart

Korhonen

(2016). Simverb-3500: A large-scale evaluation set of verb similarity. arXiv preprint arXiv:1608.00869.

25.

Ghannay

Favre

Esteve

Camelin

(2016), Word embedding evaluation and combination. In Proceedings of the tenth international conference on language resources and evaluation (LREC’16) (pp. 300–305). European Language Resources Association (ELRA). https://aclanthology.org/L16-1046/

26.

Guo

Mirzaalian

Sabir

Jaiswal

Abd-Almageed

. (2020). CORD19STS: COVID-19 semantic textual similarity dataset. https://arxiv.org/abs/2007.02461

27.

Hassan

Mihalcea

(2009), Cross-lingual semantic relatedness using encyclopedic knowledge. In Proceedings of the 2009 conference on empirical methods in natural language processing (pp. 1192–1201). Association for Computational Linguistics. https://dl.acm.org/doi/pdf/10.5555/1699648.1699665

28.

Ning

Roth

. (2020). QuASE: Question-answer driven sentence encoding. In Proceedings of the 2009 conference on empirical methods in natural language processing. Association for Computational Linguistics. https://aclanthology.org/D09-1124

29.

Healy

(2019). Corpora in word embedding training and application.

30.

Hliaoutakis

Varelas

Voutsakis

Petrakis

E. G.

Milios

(2006). Information retrieval by semantic similarity. International Journal on Semantic Web and Information Systems (IJSWIS), 2(3), 55–73. https://www.intelligence.tuc.gr/~angelos/ijswis.pdf

31.

Jokar

(2024). The acquisition of functional morphology, semantics-morphology and syntax by Persian learners of English the bottleneck hypothesis in L2 acquisition [Master’s thesis]. UiT Norges arktiske universitet.

32.

Karov

Edelman

(1998). Similarity-based word sense disambiguation. Computational Linguistics, 24(1), 41–59. https://aclanthology.org/P97-1008.pdf

33.

Liu

Zhu

Huang

(2020). Sentilare: Linguistic knowledge enhanced language representation for sentiment analysis. In Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP) (pp. 6975–6988). Association for Computational Linguistics. https://doi.org/10.18653/v1/2020.emnlp-main.567

34.

Kenter

De Rijke

(2015). Short text similarity with word embeddings. In Proceedings of the 24th ACM international on conference on information and knowledge management (pp. 1411–1420). Association for Computing Machinery. https://doi.org/10.1145/2806416.2806475

35.

Khattak

Asghar

M. Z.

Saeed

Hameed

I. A.

Hassan

S. A.

Ahmad

(2021). A survey on sentiment analysis in Urdu: A resource-poor language. Egyptian Informatics Journal, 22(1), 53–74. https://doi.org/10.1016/j.eij.2020.04.003

36.

Khorsi

Cherroun

Schwab

et al. (2018). 2L-APD: A two-level plagiarism detection system for Arabic documents. Cybernetics and Information Technologies, 18(1), 124–138. https://doi.org/10.2478/cait-2018-0011

37.

Lahitani

A. R.

Permanasari

A. E.

Setiawan

N. A.

(2016). Cosine similarity to determine similarity measure: Study case in online essay assessment. In 2016 4th international conference on cyber and IT service management (pp. 1–6). IEEE. https://ieeexplore.ieee.org/document/7577578

38.

C. H.

Huang

J. X.

(2012). Spam filtering using semantic similarity approach and adaptive BPNN. Neurocomputing, 92, 88–97. https://doi.org/10.1016/j.neucom.2011.09.036

39.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

Stoyanov

(2019). RoBERTa: A robustly optimized BERT pretraining approach. https://doi.org/10.48550/arXiv.1907.11692

40.

Zhao

Yang

(2020). Tibetan–Chinese cross-lingual word embeddings based on MUSE. In Journal of physics: Conference series (Vol. 1453, p. 012043). IOP Publishing. https://doi.org/10.1088/1742-6596/1453/1/012043

41.

Massidda

(2020). rmassidda@ DaDoEval: Document dating using sentence embeddings at EVALITA 2020. In Proceedings of seventh evaluation campaign of natural language processing and speech tools for Italian. Final Workshop (EVALITA 2020), Online. CEUR. org. https://ceur-ws.org/Vol-2765/paper106.pdf

42.

Mehmood

(2021). On multi-domain sentence level sentiment analysis for Roman Urdu [PhD thesis]. University of New South Wales Canberra Australia. https://doi.org/10.26190/unsworks/22403

43.

Miller

G. A.

Charles

W. G.

(1991). Contextual correlates of semantic similarity. Language and Cognitive Processes, 6(1), 1–28. https://doi.org/10.1080/01690969108406936

44.

Minaee

Kalchbrenner

Cambria

Nikzad

Chenaghlu

Gao

(2021). Deep learning based text classification: A comprehensive review. ACM Computing Surveys (CSUR), 54(3), 1–40. https://doi.org/10.1145/3439726

45.

Mori

Yamane

Mukuta

Harada

(2020). Finding and generating a missing part for story completion. In Proceedings of the 4th joint SIGHUM workshop on computational linguistics for cultural heritage, social sciences, humanities and literature (pp. 156–166). International Committee on Computational Linguistics. https://aclanthology.org/2020.latechclfl-1.19

46.

Muneer

Fatima

Khan

M. S.

Adeel Nawab

R. M.

Saeed

(2023). Developing a large benchmark corpus for Urdu semantic word similarity. ACM Transactions on Asian and Low-Resource Language Information Processing, 22(3), 1–19.

47.

Muneer

Nawab

R. M. A.

(2021). Cross-lingual text reuse detection using translation plus monolingual analysis for English–Urdu language pair. Transactions on Asian and Low-Resource Language Information Processing, 21(2). https://doi.org/10.1145/3473331

48.

Muneer

Nawab

R. M. A.

(2022). Cross-lingual text reuse detection at sentence level for English–Urdu language pair. Computer Speech & Language, 101381. https://doi.org/10.1016/j.csl.2022.101381; https://www.sciencedirect.com/science/article/pii/S0885230822000225

49.

Naseer

Hussain

(2009). Supervised word sense disambiguation for Urdu using Bayesian classification. Center for Research in Urdu Language Processing. https://cle.org.pk/clt10/papers/Supervised%20Word%20Sense%20Disambiguation%20for%20Urdu%20Using%20Bayesian.pdf

50.

Naumov

Yaroslavtsev

Avdiukhin

. (2020). Objective-Based hierarchical clustering of deep embedding vectors. https://doi.org/10.48550/arXiv.2012.08466.

51.

Navrozidis

Jansson

(2020). Using natural language processing to identify similar patent documents. LU-CS-EX. https://doi.org/10.1016/j.aei.2019.101027

52.

Niewiarowski

Plichta

(2023). Matrix similarity analysis of texts written in Romanian and Spanish. In ECMS (pp. 507–512). https://www.scs-europe.net/dlib/2023/2023-0507.html

53.

Pelevina

Arefiev

Biemann

Panchenko

(2016). Making sense of word embeddings. In Proceedings of the 1st workshop on representation learning for NLP (pp. 174–183). Association for Computational Linguistics. https://doi.org/10.18653/v1/W16-1620; https://www.aclweb.org/anthology/W16-1620

54.

Peters

Neumann

Iyyer

Gardner

Clark

Lee

Zettlemoyer

(2018). Deep contextualized word representations. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (Long Papers) (pp. 2227–2237). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1202

55.

Rahman

(2004). Language policy and localization in Pakistan: Proposal for a paradigmatic shift. In SCALLA conference on computational linguistics (Vol. 99, pp. 1–19). https://apnaorg.com/research-papers-pdf/rahman-1.pdf

56.

Rei

Stewart

Farinha

A. C.

Lavie

(2020). COMET: A neural framework for MT evaluation. https://doi.org/10.18653/v1/2020.emnlp-main.213

57.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks. https://arxiv.org/pdf/1908.10084

58.

Reimers

Gurevych

(2020). Making monolingual sentence embeddings multilingual using knowledge distillation. https://doi.org/10.18653/v1/2020.emnlp-main.365

59.

Sun

Shaib

Wallace

B. C

. (2024). Evaluating the zero-shot robustness of instruction-tuned language models. In The twelfth international conference on learning representations. https://doi.org/10.48550/arXiv.2306.11270

60.

Tan

A. W.

Marchman

V. A.

Frank

M. C.

(2024). The role of translation equivalents in bilingual word learning. Developmental Science, 27(4), e13476. https://doi.org/10.1111/desc.13476

61.

Tetzloff

(2024). Variability in English–Spanish bilingual phonotactics: A look at sC clusters. Journal of Monolingual & Bilingual Speech (JMBS), 6(1), 1–20. https://utppublishing.com/journal/JMBS/article/view/24641

62.

Thakur

Reimers

Daxenberger

Gurevych

. (2021). Augmented SBERT: Data augmentation method for improving biencoders for pairwise sentence scoring tasks. In Proceedings of the 2021 conference of the North American chapter of the association for computational linguistics: Human language technologies (pp. 296–310). Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.naacl-main.28

63.

Vijaymeena

Kavitha

(2016). A survey on similarity measures in text mining. Machine Learning and Applications: An International Journal, 3(2), 19–28. https://doi.org/10.5121/mlaij.2016.3103

64.

Vulic

Moens

M.-F.

(2013). Cross-lingual semantic similarity of words as the similarity of their semantic word responses. In Proceedings of the 2013 conference of the North American chapter of the association for computational linguistics: Human language technologies (NAACL-HLT 2013) (pp. 106–116). ACL. https://aclanthology.org/N13-1011

65.

Williams

Nangia

Bowman

(2018). A broad-coverage challenge corpus for sentence understanding through inference. In Proceedings of the 2018 conference of the North American chapter of the association for computational linguistics: Human language technologies, volume 1 (Long Papers) (pp. 1112–1122). Association for Computational Linguistics. https://doi.org/10.18653/v1/N18-1101

66.

Worth

P. J.

(2023). Word embeddings and semantic spaces in natural language processing. International Journal of Intelligence Science, 13(1), 1–21. https://doi.org/10.4236/ijis.2023.131001

67.

Xia

Zhao

Yao

Jin

. (2011). Measuring Chinese-English cross-lingual word similarity with HowNet and parallel corpus. In International conference on intelligent text processing and computational linguistics (pp. 221–233). Springer. https://doi.org/10.1007/978-3-642-19437-5_18

68.

Yadav

Tang

Srinivasan

(2024). PAG-LLM: Paraphrase and aggregate with large language models for minimizing intent classification errors. In Proceedings of the 47th international ACM SIGIR conference on research and development in information retrieval (pp. 2569–2573). Association for Computing Machinery. https://doi.org/10.1145/3626772.3657959

69.

Yates

Nogueira

Lin

(2021). Pretrained transformers for text ranking: BERT and beyond. In Proceedings of the 14th ACM international conference on web search and data mining (pp. 1154–1156). arXiv. https://doi.org/10.48550/arXiv.2010.06467

70.

Zhu

Iglesias

C. A.

(2018). Exploiting semantic similarity for named entity disambiguation in knowledge graphs. Expert Systems with Applications, 101, 8–24. https://doi.org/10.1016/j.eswa.2018.02.011

Cross-Lingual English–Urdu Semantic Word Similarity Using Sentence Transformers

Abstract

Keywords

1. Introduction

2. Related Work

3. Approaches for CL-SWS

3.1. Cross-Lingual Sentence Transformer-Based Approaches

3.3. Translation Plus Monolingual Analysis (T+MA)

3.3.1. Sentence Transformer Using T+MA-Based Approach

4. Experimental Setup

4.1. Corpus

4.2. Approaches

Footnotes

ORCID iD

Author Contributions

Funding

Declaration of Conflicting Interests

Notes

References