Abstract
Social media is considered to be a vibrant area where millions of individuals interact and share their views. Processing social media text in Indian languages is a challenging task, as it is a well-known fact that Indian languages are morphologically rich in structure. On transferring such an unstructured text into a consistent format, the data is exposed to feature extraction method. In the huge corpora, information units i.e. entities holds the basic idea of the content. The main aim of the system is to recognise and extract the named entities in the social media twitter text. The proposed system relies on the proficient co-occurrence based word embedding models to extract the features for the words in the dataset. The proposed work makes use of text data from the Twitter resource in the Tamil language. In order to enhance the performance of the system, tri-gram features are extracted from the word embedding vectors. Hence, systems are trained using N-gram embedding features and named entity tags. Implementation of the system is using machine learning classifier, Support Vector Machine (SVM). On comparing the performance of the proposed systems, it can be seen that glove embedding shows better results with the accuracy of 96.93%, whereas the accuracy of word2vec embedding is 84.53%. The improvement in the performance of the system based on glove embedding with regard to the accuracy may be due to the imperative role of the co-occurrence information of glove embedding in recognising the entities.
Get full access to this article
View all access options for this article.
