Author detection: Analyzing tweets by using a Naïve Bayes classifier

Abstract

In the context of digital social media, where users have multiple ways to obtain information, it is important to have tools to detect the authorship within a corpus supposedly created by a single author. With the tremendous amount of information coming from social networks there is a lot of research concerning author profiling, but there is a lack of research about the authorship identification. In order to detect the author of a group of tweets, a Naïve Bayes classifier is proposed which is an automatic algorithm based on Bayes’ theorem. The main objective is to determine if a particular tweet was made by a specific user or not, based on its content. The data used correspond to a simple data set, obtained with the Twitter API, composed of four political accounts accompanied by their username and tweet identifier as it is mixed with multiple user tweets. To describe the performance of the classification model and interpret the obtained results, a confusion matrix is used as it contains values like accuracy, sensitivity, specificity, Kappa measure, the positive predictive and negative predictive value. These results show that the prediction model, after several cases of use, have acceptable values against the observed probabilities.

Keywords

Naïve Bayes classifier authorship detection social network analysis Twitter confusion matrix

Get full access to this article

View all access options for this article.

References

Castro

and Lindauer

, Author Identification on Twitter, 2012.

Sriram

, Short text classification in twitter to improve information filtering. Doctoral dissertation, The Ohio State University. 2010.

Murthy

, Twitter: Social Communication in the Twitter Age. Cambridge, UK: Polity Press, (2013), pp. 193.

AlSukhni

and Alequr

, Investigating the use of machine learning algorithms in detecting gender of the Arabic tweet author, International Journal of Advanced Computer Science & Applications1(7) (2016), 319–328.

Burger

J.D.

, Henderson

, Kim

and Zarrella

, Discriminating Gender on Twitter. Proceedings of the Conference on Empirical Methods in Natural Language Processing. EMNLP ’11, Association for Computational Linguistics, Stroudsburg, PA, USA. (2011), pp. 1301–1309.

Sultana

, Polash

and Gavrilova

, Authorship recognition of tweets: A comparison between social behavior and linguistic profiles, IEEE International Conference on Systems, Man, and Cybernetics (SMC) (2017), 471–476.

Galán-García

, Puerta

J.G.D.L.

, Gómez

C.L.

, Santos

and Bringas

P.G.

, Supervised machine learning for the detection of troll profiles in twitter social network: Application to a real case of cyberbullying, Logic Journal of the IGPL24(1) (2016), 42–53.

Green

and Sheppard

, Comparting frequency- and style-based features for twitter author identification, Proceedings of the International Florida Artificial Intelligence Research Society Conference (FLAIRS), 2013.

Sousa Silva

, Laboreiro

, Sarmento

, Grant

, Oliveira

and Maia

, ‘twazn me!!!;(‘ Automatic authorship analysis of micro-blogging messages. Muñoz R., Montoyo A.,Métais E. (eds) Natural Language Processing and Information Systems, NLDB 2011. Lecture Notes in Computer Science, vol 6716. Springer, Berlin, Hidelberg. (2011), pp. 161–168.

10.

Mechti

, Jaoua

, Belguith

L.H.

and Faiz

, Machine learning for classifying authors of anonymous tweets, blogs, reviews and social media. Proceedings of the PAN@ CLEF, Sheffield, England. 2014.

11.

Phan

X.H.

, Nguyen

L.M.

and Horiguchi

, Learning to classify short and sparse text & web with hidden topics from large-scale data collections, Proceedings of the World Wide Web, ACM, Beijing, China, (2008), 91–100.