A graph model based feature set selection from short texts with application to document novelty detection

Abstract

Document novelty detection is a concept learning problem wherein the system gains its knowledge only from the positive documents under a concept and with that limited knowledge it attempts to detect the negative cases. This work focuses on learning author style as a concept from the given set of documents, particularly emails. Since author attribution for shorter texts such as emails is more complex compared to larger documents, the techniques originally used for the large documents prove inefficient for short texts. To address this shortcoming of existing algorithms in detecting aberration in author style, we have proposed a graph-model based technique for feature set extraction from short documents. Given the extracted feature set, we have also developed two probability based text representation schemes that could best represent a text document to an underlying one-class SVM classifier. The proposed models have been compared and evaluated on the public Enron email dataset. Applying graph based feature set extraction technique in combination with the inclusive compound probability based text representation has proved to be very efficient. The generality of the proposed method allows the approach to be applicable to all kind of text documents including emails.

Keywords

Novelty detection author attribution graph model feature extraction one class SVM email classification

Get full access to this article

View all access options for this article.

References

Smith

and Fujinaga

, A review of authorship attribution, Technical Report 15 (2008), 2010.

Holmes

D.I.

, The evolution of stylometry in humanities scholarship, Literary and Linguistic Computing 13(3) (1998), 111–117.

Williams

C.B.

, Mendenhall’s studies of word-length distribution in the works of shakespeare and bacon, Biometrika 62(1) (1975), 207–212.

Burrows

J.F.

, Not unles you ask nicely: The interpretative nexus between analysis and information, Literary and Linguistic Computing 7(2) (1992), 91–109.

Qian

Liu

Chen

and Peng

, Tri-training for authorship attribution with limited training data, in: Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics, 2014.

Howedi

and Mohd

, Text classification for authorship attribution using naive bayes classifier with limited training data, Computer Engineering and Intelligent Systems 5(4) (2014), 48–56..

Can

, Authorship attribution using principal component analysis and competitive neural networks, Mathematical and Computational Applications 19(1) (2014), 21–36.

Joachims

, A probabilistic analysis of the rocchio algorithm with tfidf for text categorization, Technical report, DTIC Document, 1996.

Salton

and McGill

M.J.

, Introduction to modern information retrieval, 1983.

10.

Manevitz

and Yousef

, One-class document classification via neural networks, Neurocomputing 70(7) (2007), 1466–1481.

11.

Manevitz

L.M.

and Yousef

, One-class svms for document classification, The Journal of Achine Learning Research 2 (2002), 139–154.

12.

Diederich

Kindermann

Leopold

and Paass

, Authorship attribution with support vector machines, Applied Intelligence 19(1-2) (2003), 109–123.

13.

Zhao

and Zhang

, An email classification model based on rough set theory, in: Active Media Technology, 2005. (AMT 2005), Proceedings of the 2005 International Conference on, IEEE, 2005, pp. 403–408.

14.

Aery

and Chakravarthy

, Emailsift: Email classification based on structure and content, In Data Mining, Fifth IEEE International Conference on, IEEE, 2005, p. 8.

15.

Brutlag

J.D.

and Meek

, Challenges of the email domain for text classification, in: ICML, 2000, pp. 103–110.

16.

De Vel

Anderson

Corney

and Mohay

, Mining e-mail content for author identification forensics, ACM Sigmod Record 30(4) (2001), 55–64.

17.

Corney

de Vel

Anderson

and Mohay

, Gender-preferential text mining of e-mail discourse, in: Computer Security Applications Conference, Proceedings. 18th Annual, IEEE, 2002, pp. 282–289.

18.

Goodman

Hahn

Marella

Ojar

and Westcott

, The use of stylometry for email author identification: a feasibility study, Proc. Student/Faculty Research Day, CSIS, Pace University, White Plains, NY, 2007, pp. 1–7.

19.

Calix

Connors

Levy

Manzar

MCabe

and Westcott

, Stylometry for e-mail author identification and authentication, Proceedings of CSIS Research Day, Pace University, 2008.

20.

Douglas Baker

Hofmann

McCallum

and Yang

, A hierarchical probabilistic model for novelty detection in text, in: Proceedings of International Conference on Machine Learning, Citeseer, 1999.

21.

Japkowicz

Myers

Gluck

et al., A novelty detection approach to classification, in: IJCAI, 1995, pp. 518–523

22.

Thompson

B.B.

Marks

R.J.

Choi

J.J.

El-Sharkawi

M.A.

Huang

M.-Y.

and Bunje

, Implicit learning in autoencoder novelty assessment, in: Neural Networks, 2002. IJCNN’02. Proceedings of the 2002 International Joint Conference on, Vol. 3, IEEE, 2002, pp. 2878–2883.

23.

Cavnar

W.B.

Trenkle

J.M

et al., N-gram-based text categorization, Ann Arbor MI 48113(2) (1994), 161–175.

24.

Schenker

Last

Bunke

and Kandel

, Classification of web documents using graph matching, International Journal of Pattern Recognition and Artificial Intelligence 18(3) (2004), 475–496.

25.

Schölkopf

Platt

J.C.

Shawe-Taylor

Smola

A.J.

and Williamson

R.C.

, Estimating the support of a high-dimensional distribution, Neural Computation 13(7) (2001), 1443–1471.

26.

Webber

, A programmatic introduction to neo4j, in: Proceedings of the 3rd annual conference on Systems, programming, and applications: software for humanity, ACM, 2012, pp. 217–218.

27.

Chang

C.-C.

and Lin

C.-J.

, Libsvm: a library for support vector machines, ACM Transactions on Intelligent Systems and Technology (TIST) 2(3) (2011), 27.

28.

Heaton

, Programming neural networks with encog 2 in java. heaton research, 2010.