Co-occurrence based word representation for extracting named entities in Tamil tweets

Abstract

Social media is considered to be a vibrant area where millions of individuals interact and share their views. Processing social media text in Indian languages is a challenging task, as it is a well-known fact that Indian languages are morphologically rich in structure. On transferring such an unstructured text into a consistent format, the data is exposed to feature extraction method. In the huge corpora, information units i.e. entities holds the basic idea of the content. The main aim of the system is to recognise and extract the named entities in the social media twitter text. The proposed system relies on the proficient co-occurrence based word embedding models to extract the features for the words in the dataset. The proposed work makes use of text data from the Twitter resource in the Tamil language. In order to enhance the performance of the system, tri-gram features are extracted from the word embedding vectors. Hence, systems are trained using N-gram embedding features and named entity tags. Implementation of the system is using machine learning classifier, Support Vector Machine (SVM). On comparing the performance of the proposed systems, it can be seen that glove embedding shows better results with the accuracy of 96.93%, whereas the accuracy of word2vec embedding is 84.53%. The improvement in the performance of the system based on glove embedding with regard to the accuracy may be due to the imperative role of the co-occurrence information of glove embedding in recognising the entities.

Keywords

Support Vector Machine Word2vec glove embedding N-gram embedding structured skip gram

Get full access to this article

View all access options for this article.

References

Anand Kumar

, Shriya

and Soman

K.P.

, AMRITACEN@FIRE 2015: Extracting entities for social media texts in Indian languages CEUR Workshop Proceedings, CEUR Proceedings1587 (2015), 85–88.

Ekbal

and Bandyopadhyay

, Named entity recognition in Bengali and Hindi using support vector machine, Lingvisticae Investigationes34 (2011), 35–67.

Ekbal

and Bandyopadhyay

, Named entity recognition using support vector machine, A Language independent approach World Academy of Science, Engineering and Technology39 (2009), 548–563.

Gimenez

and Marquez

, SVMTool: A general POS tagger generator based on Support Vector Machines, In Proceedings of the 4th International Conference on Language Resources and Evaluation (2004).

Jayan

J.P.

and Sherly

, A hybrid statistical approach for named entity recognition for malayalam language, Sixth International Joint Conference on Natural Language Processing (2012), 58.

Joachims

, Making large-scale SVM learning practical Technical Report, SFB 475: Komplexitätsreduktion in Multivariaten Datenstrukturen, Universität Dortmund, 1998.

Kamal

, A hidden Markov model based system for entity extraction from social media english text at FIRE 2015, CEUR Workshop Proceedings1587 (2015), 89–95.

Liu

, Zhang

, Wei

and Zhou

, Recognizing Named Entities in tweets, ACL-HLT 2011 - Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies1 (2011), 359–367.

and Hovy

, End-to-end sequence labeling via bidirectional LSTM-CNNs-CRF, 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016 - Long Papers, 2016, 2, pp. 1064–1074.

10.

Nadeau

and Sekine

, A survey of named entity recognition and classification, Lingvisticae Investigationes, John Benjamins Publishing Company30 (2007), 3–26.

11.

Pallavi

, Srividhya

, Victor

and Ramya

, HITS@FIRE task 2015: Twitter based named entity recognizer for Indian languages, CEUR Workshop Proceedings, CEUR Proceedings1587 (2015), 81–84.

12.

Patra

, Das

and Prasath

, Shared task on sentiment analysis in Indian languages (SAIL) tweets - an overview, Lecture Notes in Computer Science9468 (2015), 650–655.

13.

Pattabi Rao

R.K.

, Malarkodi

C.S.

, Vijay Sundar Ram

and Devi

S.L.

, ESM-IL: Entity extraction from social media text for Indian languages@FIRE 2015-an overview, CEUR Workshop Proceedings1587 (2015), 74–80.

14.

Pattabi

and Devi

, CMEE-IL: Code mix entity extraction in Indian languages from social media Text@FIRE 2016 - An Overview, CEUR Workshop Proceedings, CEUR Proceedings1737 (2016), 289–295.

15.

Pennington

, Socher

and Manning

, GloVe: Global vectors for word representation, EMNLP 2014 - Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference (2014), 1532–1543.

16.

Remmiya Devi

, Veena

P.V.

, Anand Kumar

and Soman

K.P.

, AMRITACEN@FIRE 2016: Code-mix entity extraction for Hindi-English and Tamil-English tweets, CEUR Workshop Proceedings1737 (2016), 304–308.

17.

Remmiya Devi

, Veena

P.V.

, Anand Kumar

and Soman

K.P.

, Entity extraction for malayalam social media text using structured skip-gram based embedding features from unlabeled data, Procedia Computer Science93 (2016), 547–553.

18.

Ritter

, Clark

, Etzioni

and others, Named entity recognition in tweets: An experimental study, Proceedings of the Conference on Empirical Methods in Natural Language Processing (2011), 1524–1534.

19.

Mausam

R.A.

, Etzioni

and Clark

, Open domain event extraction from twitter, conf-name of the ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (2012), 1104–1112.

20.

Sharnagat

, Named Entity Recognition: A Literature Survey, Center For Indian Language Technology, 2014.

21.

Tsuboi

, Neural networks leverage corpus-wide information for part-of-speech tagging EMNLP 2014, Conference on Empirical Methods in Natural Language Processing, Proceedings of the Conference, EMNLP 2014, (2014), 938–950.

22.

Wang

, Dyer

, Black

and Trancoso

, Two/too simple adaptations of Word2Vec for syntax problems NAACL HLT 2015, Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (2015), 1299–1304.

23.

Zhang

, Roller

and Wallace

, MGNC-CNN: A simple approach to exploiting multiple word embeddings for sentence classification, NAACL HLT 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Proceedings of the Conference (2016), 1522–1527.

24.

Guang

, et al., Opinion word expansion and target extraction through double propagation, Computational Linguistics37(1) (2011), 9–27.

25.

Tran

V.C.

, Hoang

D.T.

, Nguyen

N.T.

and Hwang

D.A.

, A named entity recognition approach for tweet streams using active learning, Journal of Intelligent and Fuzzy Systems32(2) (2017), 1277–1287.