Sage Journals: Discover world-class research

Abstract

Feature selection is a crucial aspect in classification problems, especially in domains such as text classification, where usually there is a large number of features. Recently, a two-stage feature selection method for text classification which combines class-based and corpus-based feature selection, was introduced. Based on their experiments, the authors conclude what parameter values for both, corpus-based and class-based approaches, allow a feature selection which improves the traditional methods in text classification. In this paper, we revisit this two-stage feature selection method and based on several experiments we come to a different conclusion: the parameters suggested by the original work do not necessarily provide the best results. Based on our experiments, we conclude that by combining the best parameter value for each stage, for the specific corpus under study, the two stage selection method based on coverage policies provides a subset of features which allows to get statistically significant increase over the traditional methods in the success rates of the classifier.

Keywords

Text classification feature selection parameter tunning

Get full access to this article

View all access options for this article.

References

Church

K.W.

, Hanks

, Word association norms, mutual information, and lexicography, Computational Linguistics16 (1) (1990), 22–29.

Dasgupta

, Drineas

, Harb

, Josifovski

, Mahoney

M.W.

, Feature selection methods for text classification. In Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, KDD ’07, New York, NY, USA, ACM2007, pp. 230–239.

Demšar

, Statistical comparisons of classifiers over multiple data sets, Journal of Machine Learning Research7 (2006), 1–30.

Forman

, An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research3 (Mar) (2003), 1289–1305.

Forman

, Feature selection for text classification. Computational methods of feature selection, 1944355797, (2007).

Hall

M.A.

, Holmes

, Benchmarking attribute selection techniques for discrete class data mining, IEEE Transactions on Knowledge and Data Engineering15 (6) (2003), 1437–1447.

Ikonomakis

, Kotsiantis

, Tampakas

, Text classification using machine learning techniques, WSEAS Transactions on Computers4 (8) (2005), 966–974.

Javed

, Maruf

, Babri

H.A.

, A two-stage markov blanket based feature selection algorithm for text classification, Neurocomputing157 (2015), 91–104.

Joachims

, Text categorization with support vector machines: Learning with many relevant features. In European conference on machine learning, Springer, 1998, pp. 137–142.

10.

, Xia

, Zong

, Huang

C.-R.

, A framework of feature selection methods for text categorization. In Proceedings of the Joint Conference of the 47th Annual Meeting of the ACL and the 4th International Joint Conference on Natural Language Processing of the AFNLP: 2, Association for Comutational Linguistics, 2009, pp. 692–700.

11.

, Dai

, Wang

, Two-stage feature selection method for text classification, In 2009 International Conference on Multimedia Information Networking and Security, 1, 2009, pp. 234–238.

12.

Maghsoodi

, Homayounpour

M.M.

, Improving farsi multiclass text classification using a thesaurus and two-stage feature selection, Journal of the American Society for Information Science and Technology62 (10) (2011), 2055–2066.

13.

Özgür

, Özgür

, Güngör

, Text categorization with class-based and corpus-based keyword selection, Computer and Information Sciences-ISCIS 2005 (2005), 606–615.

14.

Özgür

, Güngör

, Two-stage feature selection for text classification, In Information Sciences and Systems 2015, Springer, 2016, pp. 329–337.

15.

Radovanovic

, Ivanovic

, Interactions between document representation and feature selection in text categorization, Lecture Notes in Computer Science4080 (2006), 489–498.

16.

Raju

, Pingali

, Varma

, An unsupervised approach to product attribute extraction. In European Conference on In formation Retrieval, Springer, 2009, pp. 796–800.

17.

Shang

, Huang

, Zhu

, Lin

, Qu

, Wang

, A novel feature selection algorithm for text categorization, Expert Systems with Applications33 (1) (2007), 1–5.

18.

Singh

S.R.

, Murthy

H.A.

, Gonsalves

T.A.

, Feature selection for text classification based on gini coefficient of inequality, Fsdm10 (2010), 76–85.

19.

Sokolova

, Lapalme

, A systematic analysis of performance measures for classification tasks, Information Processing & Management45 (4) (2009), 427–437.

20.

UÄŸuz

, A two-stage feature selection method for text categorization by using information gain, principal component analysis and genetic algorithm, Knowledge-Based Systems24 (7) (2011), 1024–1032.

21.

Yang

, Pedersen

J.O.

, A comparative study on feature selection in text categorization. In Icml, 97, 1997, pp. 412–420.

Revisiting two-stage feature selection based on coverage policies for text classification

Abstract

Keywords

Get full access to this article

References