A novel filter feature selection method for text classification: Extensive Feature Selector

Abstract

As the huge dimensionality of textual data restrains the classification accuracy, it is essential to apply feature selection (FS) methods as dimension reduction step in text classification (TC) domain. Most of the FS methods for TC contain several number of probabilities. In this study, we proposed a new FS method named as Extensive Feature Selector (EFS), which benefits from corpus-based and class-based probabilities in its calculations. The performance of EFS is compared with nine well-known FS methods, namely, Chi-Squared (CHI2), Class Discriminating Measure (CDM), Discriminative Power Measure (DPM), Odds Ratio (OR), Distinguishing Feature Selector (DFS), Comprehensively Measure Feature Selection (CMFS), Discriminative Feature Selection (DFSS), Normalised Difference Measure (NDM) and Max–Min Ratio (MMR) using Multinomial Naive Bayes (MNB), Support-Vector Machines (SVMs) and k-Nearest Neighbour (KNN) classifiers on four benchmark data sets. These data sets are Reuters-21578, 20-Newsgroup, Mini 20-Newsgroup and Polarity. The experiments were carried out for six different feature sizes which are 10, 30, 50, 100, 300 and 500. Experimental results show that the performance of EFS method is more successful than the other nine methods in most cases according to micro-F1 and macro-F1 scores.

Keywords

Dimension reduction feature selection text classification

Get full access to this article

View all access options for this article.

References

Uysal

. An improved global feature selection scheme for text classification. Expert Syst Appl 2016; 43: 82–92.

Agnihotri

Verma

Tripathi

, Variable global feature selection scheme for automatic classification of text documents. Expert Syst Appl 2017; 81: 268–281.

Onan

. An ensemble scheme based on language function analysis and feature engineering for text genre classification. J Inform Sci 2018; 44: 28–47.

Bhowmick

Hazarika

. E-Mail spam filtering: a review of techniques and trends. In: Kalam

Das

Sharma

(eds) Advances in electronics, communication and computing. New York: Springer, 2018, pp. 583–590.

Sjarif

NNA

Azmi

NFM

Chuprat

et al. SMS spam message detection using term frequency–inverse document frequency and random forest algorithm. Proced Comput Sci 2019; 161: 509–515.

Gupta

Sahoo

Roul

. Authorship identification using recurrent neural networks. In: Proceedings of the 2019 3rd international conference on information system and data mining, Houston, TX, 6–8 April 2019, pp. 133–137. New York: ACM.

Chang

Hsieh

Chen

et al. A semantic frame-based intelligent agent for topic detection. Soft Comput 2017; 21: 391–401.

Parlak

Uysal

. On classification of abstracts obtained from medical journals. J Inform Sci 2019; 46: 648–663.

Parlak

Uysal

. Classification of medical documents according to diseases. In: 23rd signal processing and communications applications conference (SIU), Malatya, Turkey, 16–19 May 2015, pp. 1635–1638. New York: IEEE.

10.

Parlak

Uysal

. The impact of feature selection on medical document classification. In: 11th Iberian conference on information systems and technologies (CISTI), Las Palmas, 15–18 June 2016, pp. 1–5. New York: IEEE.

11.

Parlak

Uysal

. On feature weighting and selection for medical document classification. In: Rocha

Reis

(eds) Developments and advances in intelligent systems and applications. New York: Springer, 2018, pp. 269–282.

12.

Onan

. Classifier and feature set ensembles for web page classification. J Inform Sci 2016; 42: 150–165.

13.

Uysal

Gunal

. The impact of preprocessing on text classification. Inform Proces Manag 2014; 50: 104–112.

14.

Forman

. An extensive empirical study of feature selection metrics for text classification. J Mach Learn Res 2003; 3: 1289–1305.

15.

Sebastiani

. Machine learning in automated text categorization. ACM Comput Surv 2002; 34: 1–47.

16.

Guyon

Elisseeff

. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–1182.

17.

Yang

Pedersen

. A comparative study on feature selection in text categorization. In: ICML ’97: proceedings of the fourteenth international conference on machine learning, Nashville, TN, 8–12 July 1997, pp. 412–420. New York: ACM.

18.

Lee

. Information gain and divergence-based feature selection for machine learning-based text categorization. Inform Proces Manag 2006; 42: 155–165.

19.

Liu

Sun

Liu

et al. Feature selection with dynamic mutual information. Pattern Recogn 2009; 42: 1330–1339.

20.

Shang

Huang

Zhu

et al. A novel feature selection algorithm for text categorization. Expert Syst Appl 2007; 33: 1–5.

21.

Ogura

Amano

Kondo

, Feature selection with a measure of deviations from Poisson in text categorization. Expert Syst Appl 2009; 36: 6826–6832.

22.

Uysal

Gunal

. A novel probabilistic feature selection method for text classification. Knowl Based Syst 2012; 36: 226–235.

23.

Yang

Liu

Zhu

et al. A new feature selection based on comprehensive measurement both in inter-category and intra-category for text categorization. Inform Proces Manag 2012; 48: 741–754.

24.

Zong

Chu

L-K

et al. A discriminative and semantic feature selection method for text categorization. Int J Product Econ 2015; 165: 215–222.

25.

Rehman

Javed

Babri

. Feature selection based on a normalized difference measure for text classification. Inform Proces Manag 2017; 53: 473–489.

26.

Rehman

Javed

Babri

et al. Selection of the most relevant terms based on a max-min ratio metric for text classification. Expert Syst Appl 2018; 114: 78–96.

27.

Rehman

Javed

Babri

et al. Relative discrimination criterion – A novel feature ranking method for text data. Expert Syst Appl 2015; 42: 3670–3681.

28.

Kim

Zzang

. Trigonometric comparison measure: a feature selection method for text categorization. Data Knowl Eng 2019; 119: 1–21.

29.

Pinheiro

Cavalcanti

Correa

et al. A global-ranking local feature selection method for text categorization. Expert Syst Appl 2012; 39: 12851–12857.

30.

Pinheiro

Cavalcanti

Ren

. Data-driven global-ranking local feature selection methods for text categorization. Expert Syst Appl 2015; 42: 1941–1949.

31.

Agnihotri

Verma

Tripathi

et al. Soft voting technique to improve the performance of global filter based feature selection in text corpus. Appl Intel 2019; 49: 1597–1619.

32.

Uysal

Gunal

. Text classification using genetic algorithm oriented latent semantic features. Expert Syst Appl 2014; 41: 5938–5947.

33.

Ghareb

Bakar

Hamdan

. Hybrid feature selection based on enhanced genetic algorithm for text categorization. Expert Syst Appl 2016; 49: 31–47.

34.

Uysal

. On two-stage feature selection methods for text classification. IEEE Access 2018; 6: 43233–43251.

35.

Chen

Huang

Tian

et al. Feature selection for text classification with Naïve Bayes. Expert Syst Appl 2009; 36: 5432–5435.

36.

Chen

C-M

Lee

H-M

Chang

Y-J

. Two novel feature selection approaches for web page classification. Expert Syst Appl 2009; 36: 260–272.

37.

Porter

. An algorithm for suffix stripping. Program 1980; 14: 130–137.

38.

Asuncion

Newman

. UCI machine learning repository, 2007, https://archive.ics.uci.edu/ml/index.php

39.

Parlak

Uysal

. The effects of globalisation techniques on feature selection for text classification. J Inform Sci. Epub ahead of print 18 June 2020. DOI: 10.1177/0165551520930897.

40.

Tan

. An effective refinement strategy for KNN text classifier. Expert Syst Appl 2006; 30: 290–298.