An effective cybernated word embedding system for analysis and language identification in code-mixed social media text

Abstract

The language used by the users in social media nowadays is Code-mixed text, i.e., mixing of two or more languages. This paper describes the application of the code mixed index in Indian social media texts and comparing the complexity to identify language at word level using Bi-directional Long Short Term Memory model. Social media platforms are now widely used by people to express their opinion and interest. The major contribution of the work is to propose a technique for identifying the language of Hindi-English code-mixed data used in three social media platforms namely, Facebook, Twitter, and WhatsApp. We recommend a deep learning framework based on cBoW and Skip gram model that predicts the origin of the word from language perspective in the sequence based on the specific words that have come before it in the sequence. The context capture module of the system gives better accuracy for word embedding model as compared to character embedding.

Keywords

Language identification transliteration character embedding word embedding Natural Language Processing cBoW skip-gram

Get full access to this article

View all access options for this article.

References

Weischedel

Carbonell

Grosz

et al., White paper on natural language processing, in: Proceedings of the Workshop on Speech and Natural Language, Association for Computational Linguistics, 1989, pp. 481–493.

Kim

, Convolutional neural networks for sentence classification, arXiv preprint arXiv (2014).

Barman

Das

et al., Code mixing: A challenge for language identification in the language of social media, in: Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 13–23.

King

Baucom

et al., The IUCL+ system: Word-level language identification via extended Markov models, EMNLP (2014), 102–106.

King

and Abney

, Labeling the languages of words in mixed-language documents using weakly supervised methods, in: Proceedings of Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2013, pp. 1110–1119.

Nguyen

and Dugruoz

A.S.

, Word level language identification in online multilingual communication, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing, 2013, pp. 857–862.

Yogarshi

Gella

Sharma

Bali

and Choudhury

, Pos tagging of English-Hindi code-mixed social media content, in: Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 974–979.

Das

and Gamback

, Identifying languages at the word level in code-mixed indian social media text, in:Proceedings of the 11th International Conference on Natural Language Processing , 2014, pp. 378–387.

Sequiera

Choudhury

Gupta

et al., Overview of FIRE-2015 shared task on mixed script information retrieval, in: FIRE Workshops, 2015, pp. 19–25.

10.

Jhamtani

Bhogi

S.K.

et al., Word-level language identification in bi-lingual code-switched texts, in: Proceedings of the 28th Pacific Asia Conference on Language, Information and Computing, 2014, pp. 348–357.

11.

Ethiraj

Shanmugam

Srinivasa

Sinha

, NELIS – Named Entity and Language Identification System: Shared task system description, in: FIRE Workshops, 2015, pp. 43–46.

12.

Bhargava

Sharma

and Sharma

, Sentiment analysis for mixed script indic sentences, in: International Conference on Advances in Computing, Communications and Informatics, ICACCI, 2016, pp. 524–529.

13.

Castilho

Eckart

et al., Cross-platform text mining and natural language processing interoperability, in: Proceedings of the LREC, 2016.

14.

Barman

Wagner

and Foster

, Part-of-speech tagging of code-mixed social media content: Pipeline, stacking and joint modelling, in: Proceedings of the Second Workshop on Computational Approaches to Code Switching EMNLP, 2016, pp. 30–39.

15.

Bali

Jatin

and Choudhury

, “i am borrowing ya mixing?” An analysis of English-Hindi code mixing in Facebook, in: Proceedings of the First Workshop on Computational Approaches to Code Switching, 2014, pp. 116–126.

16.

Vyas

Gella

Sharma

Bali

and Choudhury

, POS tagging of English-Hindi code-mixed social media content, in: Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2014, pp. 974–979.

17.

Rao

P.R.

and Devi

S.L.

, Code mix entity extraction in Indian languages from social media text@FIRE, in: FIRE (Working Notes), 2016, pp. 289–295.

18.

Devi

Veena

Anand Kumar

P.V.

et al., AMRITA-CEN@FIRE 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets, in: CEUR Workshop Proceedings, 2016, pp. 304–308.

19.

Sapkal

and Shrawankar

, Transliteration of secured SMS to Indian regional language, Procedia Computer Science (2016), 748–755.

20.

Zubiaga

Vicente

I.S.

Gamallo

and Pichel

J.R.

, TweetLID: A benchmark for tweet language identification, Language Resources and Evaluation (2015), 729–766.

21.

Szegedy

Liu

Jia

et al., Going deeper with convolutions, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2015, pp. 1–9.

22.

Yann

Haffner

Bottou

et al., Object recognition with gradient-based learning, in: Shape, Contour and Grouping in Computer Vision, 1999, pp. 319–345.

23.

Hochreiter

and Schmidhuber

, Long short-term memory, Neural Computation (1997), 1735–1780.

24.

Mikolov

Sutskever

Chen

Corrado

G.S.

and Dean

, Distributed representations of words and phrases and their compositionality, in: Advances in Neural Information Processing Systems, 2013, pp. 3111–3119.

25.

Jamatia

and Das

, Task report: Tool contest on POS tagging for code-mixed Indian social media (Facebook, Twitter, and Whatsapp) text, in: Proceeding of ICON, 2016.

26.

Banerjee

Chakma

Naskar

et al., Overview of the Mixed Script Information Retrieval (MSIR), in: CEUR Workshop Proceedings, 2016, pp. 94–99.

27.

Sequiera

Choudhury

Gupta

et al., Overview of FIRE-2015 shared task on mixed script information retrieval, in: FIRE Workshops, 2015, pp. 19–25.

28.

Srinidhi

Singh

Devi

et al., Context based character embeddings for entity extraction in code-mixed text, in: CEUR Workshop Proceedings, 2016, pp. 321–324.

29.

Alekseev

and Nikolenko

, Word embeddings for user profiling in online social networks, Computación y Sistemas (2017), 203–226.

30.

Shekhar

Sharma

D.K.

and Beg

M.S.

, Hindi roman linguistic framework for retrieving transliteration variants using bootstrapping, Procedia Computer Science (2018), 59–67.

31.

Veena

P.V.

Kumar

and Soman

K.P.

, Character embedding for language identification in Hindi-English code-mixed social media text, Computación y Sistemas (2018), 65–74.