An author-specific-model-based authorship analysis using psycholinguistic aspects and style word patterns

Abstract

Illegal cyber activities can be curbed by means of authorship analysis which intends to identify the authors of a document by scrutinizing the writing style involved in it. One of the major threats associated with online media is the propagation of false statements on behalf of celebrities with the aim of tarnishing their public image especially as a part of online political campaigns. The scenario calls for the need of analyzing the authorship of documents with less contents and capturing the author style from among a large number of candidate authors belonging to the same domain. This is a less explored area of authorship analysis as the task is challenging because traditional methods fail to acquire accuracy when the contents of different authors are pertaining to same topic. Here we propose a method that accomplishes the task of analysis in such an environment, by employing psycholinguistic, lexical, and syntactic aspects of an author combined with word co-occurrences obtained by modeling the style word pattern of the text. The method identifies an author’s individualistic form of expression of emotional aspects, sociolinguistic aspects and word co-occurrences, to obtain an author-style pattern for each candidate author. An author-specific model is generated. The questioned document is fed into the different models so formed, and the final decision regarding the authorship is made based on the ensembled learning method. The experimental results of the proposed method has secured a precision of 0.98 in best case and 0.45 in worst case, thereby illustrating an improvement in the accuracy of authorship attribution of short texts, in comparison with the existing methods.

Keywords

Authorship attribution computational linguistics cyber forensics psycholinguistic features topic modelling stylometry ensembled learning

Get full access to this article

View all access options for this article.

References

Chaski

C.E.

, Who wrote it? Steps toward a science of authorship identification, National Institute of Justice Journal223 (1997), 15–22.

Koppel

, Schler

and Bonchek-Dokow

, Measuring differentiability: Unmasking pseudonymous authors, Journal of Machine Learning Research8 (2007), 1261–1276.

Abbasi

and Chen

, A stylometric approach to identity-level identification and similarity detection, ACM Transactions on Information Systems (TOIS)26(2) (2008), 7.

Holmes

D.I.

, The evolution of stylometry in humanities scholarship, Literary and Linguistic Computing13(3) (1998), 111–117.

Narayanan

, Paskov

, Gong

N.Z.

, Bethencourt

, Stefanov

, Shin

E.C.R.

and Song

, On the feasibility of internet-scale author identification, 2012, IEEE Symposium on Security and Privacy, San Francisco, pp. 300–314.

Luyckx

and Daelemans

, The effect of author set size and data size in authorship attribution, Literary and Linguistic Computing26(1) (2011), 35–55.

Chen

, Chen

, Zheng

, Jin

, Yao

and Yu

, Collaborative Personalized Tweet Recommendation, In Proceedings of the 35th international ACM SIGIR conference on Research and development in information retrieval, ACM, 2012, pp. 661–670.

Hong

, Doumith

A.S.

and Davison

B.D.

, Co-factorization machines: modeling user interests and predicting individual decisions in twitter, In Proceedings of the sixth ACM international conference on Web search and data mining, 2013, pp. 557–566.

Gopalan

P.K.

, Charlin

and Blei

, Content-based recommendations with Poisson factorization In Advances in Neural Information Processing Systems, 2014, pp. 3176–3184.

10.

Makki

, Soto

A.J.

, Brooks

and Milios

E.E.

, Twitter Message Recommendation Based on User Interest ProfilesIn Advances in Social Networks Analysis and Mining (ASONAM), IEEE/ACM International Conference (2016), pp. 406–410.

11.

Karidi

D.P.

, Stavrakas

and Vassiliou

, A Personalized Tweet Recommendation Approach Based on Concept GraphsIn Ubiquitous Intelligence Computing, Advanced and Trusted Computing, Scalable Computing and Communications, Cloud and Big Data Computing, Internet of People, and Smart World Congress (UIC/ATC/ScalCom/CBDCom/IoP/SmartWorld), 2016, pp. 253–260.

12.

Stamatatos

, A survey of modern authorship attribution methods, Journal of the American Society for information Science and Technology60(3) (2009), 538–556.

13.

Athira

and Thampi

S.M.

, Hallmarking Author Style from Short Texts by Multi-Classifier Using Enhanced Feature Set, In Proceedings of the Third International Symposium on Women in Computing and Informatics, 2015, pp. 284–289.

14.

J.S.

, Chen

L.C.

, Monaco

J.V.

, Singh

and Tappert

C.C.

, A comparison of classifiers and features for authorship authentication of social networking messages, In Concurrency and Computation: Practice and Experience, Wiley Online Library, 29.14, 2017.

15.

Barbon

, Igawa

R.A.

and Zarpelao

B.B.

, Authorship verification applied to detection of compromised accounts on online social networks, Multimedia Tools and Applications76(3) (2017), 3213–3233.

16.

Gill

A.J.

, French

R.M.

, Gergle

and Oberlander

, The language of emotion in short blog texts, In Proceedings of the 2008 ACM conference on Computer supported cooperative work, 2008, pp. 299–302.

17.

Lee

C.H.

, Kim

, Seo

Y.S.

and Chung

C.K.

, The relations between personality and language use, The Journal of General Psychology134(4) (2007), 405–413.

18.

Argamon

, Koppel

, Fine

and Shimoni

A.R.

, Gender, genre, and writing style in formal written texts, Text-The Hague then Amsterdam then Berlin23(3) (2003), 321–346.

19.

Pennebaker

J.W.

and Stone

L.D.

, Words of wisdom: language use over the life span, Journal of Personality and Social Psychology85(2) (2003), 291.

20.

Rude

, Gortner

E.M.

and Pennebaker

, anguage use of depressed and depression-vulnerable college students, Cognition Emotion18(8) (2004), 1121–1133.

21.

Blei

D.M.

, Probabilistic topic models, Communications of the ACM55(4) (2012), 77–84.

22.

Jeon

, Lavrenko

and Manmatha

, Automatic image annotation and retrieval using cross-media relevance models, In Proceedings of the 26th annual international ACM SIGIR conference on Research and development in informaion retrieval, 2003, pp. 119–126.

23.

Arun

, Saradha

, Suresh

, Murty

and Madhavan

, Stopwords and stylometry: a latent Dirichlet allocation approach, 2009 NIPS workshop on Applications for Topic Models.

24.

Seroussi

, Zukerman

and Bohnert

, Authorship attribution with topic models, Computational Linguistics40(2) (2014), 269–310.

25.

Polikar

, Ensemble based systems in decision making, IEEE Circuits and Systems Magazine6(3) (2006), 21–45.

26.

Rahman

and Tasnim

, Ensemble classifiers and their applications: A review, International Journal of Computer Trends and Technology10(1) (2014), 31–35.

27.

Tausczik

Y.R.

and Pennebaker

J.W.

, The psychological meaning of words: LIWC and computerized text analysis methods, Journal of Language and Social Psychology29(1) (2010), 24–54.

28.

Alpaydin

, Introduction to machine learning, MIT press, 2014, p. 227.

29.

Shannon

C.E.

, A mathematical theory of communication, ACM SIGMOBILE Mobile Computing and Communications Review5(1) (2001), 3–55.

30.

Blei

D.M.

, Ng

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation. Journal of machine Learning research, Journal of Machine Learning Research (2003), 993–1022.

31.

, Griffiths and M. Steyvers, Finding scientific topics, Proceedings of the National Academy of Sciences101(1) (2004), 5228–5235.