Unsupervised authorship attribution using feature selection and weighted cosine similarity

Abstract

This paper presents a computational model for the unsupervised authorship attribution task based on a traditional machine learning scheme. An improvement over the state of the art is achieved by comparing different feature selection methods on the PAN17 author clustering dataset. To achieve this improvement, specific pre-processing and features extraction methods were proposed, such as a method to separate tokens by type to assign them to only one category. Similarly, special characters are used as part of the punctuation marks to improve the result obtained when applying typed character n-grams. The Weighted cosine similarity measure is applied to improve the B³ F-score by reducing the vector values where attributes are exclusive. This measure is used to define distances between documents, which later are occupied by the clustering algorithm to perform authorship attribution.

Keywords

Authorship attribution features selection similarity measure clustering features extraction

Get full access to this article

View all access options for this article.

References

Abualigah

L.M.

, Khader

A.T.

and Hanandeh

E.S.

, A new featureselectionmethod to improve the document clustering using particleswarm optimization algorithm, Journal of Computational Science25 (2018), 456–466.

Amigó

, Gonzalo

, Artiles

and Verdejo

, A comparison ofextrinsic clustering evaluation metrics based on formal constraints, Information retrieval12 (2009), 461–486.

Batyrshin

, Kubysheva

, Solovyev

and Villa-VargasVisualization

, of similarity measures for binary data and2×2 tables, Computación y Sistemas20(2016), 345–353.

Dara

and Reddy

T.R.

, Authorship attribution using content basedfeatures and n-gram features, International Journal ofEngineering and Advanced Technology9 (2019), 1152–1156.

Ferreira

A.J.

and Figueiredo

M.A.

, Efficient feature selectionfilters for high-dimensional data, Pattern Recognition Letters33 (2012), 1794–1804.

García-Mondeja

, Castro-Castro

, Lavielle-Castro

and Munoz

, Discovering Author Groups Using a β- compact Graph-based Clustering, In CLEF 2017 Working Notes CEUR Workshop Proceedings, 2017.

Gómez-Adorno

, Aleman

, Vilariño

, Sanchez-Perez

M.A.

, Pinto

and Sidorov

, Author Clustering using Hierarchical Clustering Analysis, In CLEF 2017 Working Notes CEUR Workshop Proceedings, 2017.

Halvani

and Graner

, Author Clustering based on Compression-based Dissimilarity Scores, In CLEF 2017 Working Notes CEUR Workshop Proceedings, 2017.

Kocher

and Savoy

, UniNE at CLEF 2017: Author Clustering, In CLEF 2017 Working Notes CEUR Workshop Proceedings, 2017.

10.

Liu

, Kang

, Yu

and Wang

, A comparative study on unsupervised feature selection methods for text clustering, In 2005 International Conference on Natural Language Processing and Knowledge Engineering, 2005, pp. 597–601.

11.

Mahor

and Das

, Performance evaluation of various featureextraction and classification techniques for authorship attribution, International Journal of Innovation and Scientific Research16 (2015), 252–259.

12.

Markov

, Stamatatos

and Sidorov

, Improving Cross-Topic Authorship Attribution: The Role of Pre-Processing, In Proceedings of the 18th International Conference on Computational Linguistics and Intelligent TextProcessing CICLing 2017. Springer, 2017.

13.

Martín-del-Campo-Rodríguez

, Sidorov

and Batyrshin

I.Z.

, Enhancement of Performance of Document Clustering in the Authorship Identification Problem with a Weighted Cosine Similarity, In I.Z. Batyrshin, M. de Lourdes Martínez-Villaseñor, and H.E.P. Espinosa (Eds.), Advances in Computational Intelligence – 17th Mexican-InternationalConferenceonArtificial Intelligence, MICAI 2018, Guadalajara, Mexico, October 22–27, 2018, pp. 49–56.

14.

Pedregosa

, Varoquaux

, Gramfort

, Michel

, Thirion

, Grisel

, Blondel

, Prettenhofer

, Weiss

, Dubourg

, Vanderplas

, Passos

, Cournapeau

, Brucher

, Perrot

and Duchesnay

, Scikit-learn: Machine Learning in Python, Journal ofMachine Learning Research12 (2011), 2825–2830.

15.

Pramokchon

and Piamsa-nga

, An unsupervised, fast correlation-based filter for feature selection for data clustering, In Proceedings of the First International Conference on Advanced Data and Information Engineering, Springer Singapore, 2013, pp. 87–94.

16.

Sapkota

, Bethard

, Montes-y-Gómez

and Solorio

, Not All Character N-grams Are Created Equal: A Study in Authorship Attribution, In Proceedings of the 2015 Annual Conference of the North American Chapter of the ACL: Human Language Technologies. NAACL-HLT-15. Association for Computational Linguistics, 2015, pp. 93–102.

17.

Sidorov

, Syntactic n-grams in computational linguistics. (1st ed.). Springer International Publishing, 2019.

18.

Stamatatos

, A survey of modern authorship attribution methods, Journal of the American Society for Information Science andTechnology60 (2009), 538–556.

19.

Tschuggnall

, Stamatatos

, Verhoeven

, Daelemans

, Specht

, Stein

and Potthast

, Overview of the Author Identification Task at PAN-2017: Style Breach Detection and Author Clustering, In Working Notes of CLEF 2017 – Conference and Labs of the Evaluation Forum, Dublin, Ireland, September 11–14, 2017.

20.

and Wunsch

, Survey of clustering algorithms, IEEETransactions on Neural Networks16 (2005), 645–678.