An effective dimensionality reduction method for text classification based on TFP-tree

Abstract

Obtaining interesting and topic-relevant information is a very important task in Web mining. Text classification using a small proportion of labeled data and a large proportion of unlabeled data, also called semi-supervised learning, is a well-known problem. Despite plenty of research on text classification, however, how to effectively and efficiently apply valuable frequent patterns and deal with high-dimensional data in text classification is still an open issue. Due to the increasing data volumes and plenty of high-dimensional data, both distance measures and time complexity could be influenced by the noisy data. This paper targets on this problem and presents a novel method for text classification called CTFP (Classification based on TFP-tree), which uses TFP-tree (Text-Frequent-Pattern-tree) to generate frequent patterns in tremendous amount of texts and conduct text classification in a relatively low dimensional data space. It effectively reduces the data dimensionality during constructing the classifier. Substantial experiments on three datasets (RCV1, SRAA and Reuters-21578) show that our proposed method can achieve better performance than many existing state-of-the-art methods on precision, efficiency and many other evaluation metrics.

Keywords

Text classification dimensionality reduction TFP-tree SVM frequent patterns

Get full access to this article

View all access options for this article.

References

Yuan

, Ouyang

Y.X.

and Sheng

, Investigating association rules for sentiment classification of Web reviews, Journal of Intelligent & Fuzzy Systems 27(4) (2014), 2055–2065.

Hernandez

, Rivero

C.R.

, Ruiz

and Corchuelo

, CALA: An unsupervised URL-based web page classification system, Knowledge-based Systems 57 (2014), 168–180.

Zhu

, Xie

, Yu

S.I.

and Wong

W.H.

, Exploiting link structure for Web page genre identification, Data Mining and Knowledge Discovery 30(3) (2016), 550–575.

Nagwani

N.K.

and Sharaff

, SMS spam filtering and thread identification using bi-level text classification and clustering techniques, Journal of Information Science 43(1) (2017), 75–87.

Wang

Y.W.

, Liu

Y.N.

and Zhu

X.D.

, Two-step based hybrid feature selection method for spam filtering, Journal of Intelligent & Fuzzy Systems 27(6) (2014), 2785–2796.

Duda

R.O.

, Hart

P.E.

and Stock

D.G.

, Pattern Classification (2nd ed.), John Wiley & Sons; 2001.

Cortes

and Vapnik

, Support-vector networks, Machine Learning 20(3) (1995), 273–297.

Quinlan

J.R.

, C4.5: Programs for Machine Learning. Morgan Kaufmann; 1993.

Aas

and Eikvil

, Text categorization: A survey, Norwegian Computing Center; 1999.

10.

Liu

and Peng

, Clustering-based method for positive and unlabeled text categorization enhanced by improved TFIDF, Journal of Information Science and Engineering 30 (2014), 1463–1481.

11.

Peng

, Liu

and Zuo

W.L.

, PU text classification enhanced by term frequency-inverse document frequency-improved weighting, Concurrency and Computation-Practice & Experience 26(3) (2014), 728–741.

12.

Aggarwal

C.C.

, Zhao

Y.C.

and Yu

P.S.

, On the use of side information for mining text data, IEEE Transactions on Knowledge and Data Engineering 26(6) (2014), 1415–1429.

13.

Feng

L.Z.

, Wang

Y.W.

and Zuo

W.L.

, Quick online spam classification method based on active and incremental learning, Journal of Intelligent & Fuzzy Systems 30(1) (2016), 17–27.

14.

Wang

Q.H.

, Liu

L.S.

, Jiang

J.Q.

, Jiang

M.Y.

, Lu

Y.N.

and Pei

Z.L.

, Feature selection method based on multiple centrifuge models, Cluster Computeing-The Journal of Networks Software Tools and Applications 20(2) (2017), 1425–1435.

15.

Saidani

F.R.

and Rassoul

, A weighted genetic approach for feature selection in sentiment analysis, International Journal of Computational Intelligence and Applications 16(2) (2017), 1750013.

16.

Jiang

S.Y.

and Wang

L.X.

, A clustering-based feature selection via feature separability, Journal of Intelligent & Fuzzy Systems 31(2) (2016), 927–937.

17.

Baccianella

, Esuli

and Sebastiani

, Using micro-documents for feature selection: The case of ordinal text classification, Expert Systems with Applications 40(11) (2013), 4687–4696.

18.

Pan

J.H.

, Hu

X.G.

, Zhang

Y.H.

, Li

P.P.

, Lin

Y.J.

, Li

H.Z.

, He

and Li

, Quadruple transfer learning: Exploiting both shared and non-shared concepts for text classification, Knowledge-Based Systems 90 (2015), 199–210.

19.

Tang

J.L.

and Liu

, Feature selection for social media data, ACM Transactions on Knowledge Discovery from Data 8(4) (2014).

20.

Agrawal

and Srikant

, Fast algorithm for mining association rules, In Proc 1994 Int Conf Very Large Data Bases (1994), 487–499.

21.

Han

, Pei

and Yin

, Mining frequent patterns without candidate generation, In Proc 2000 ACM-SIGMOD Int Conf Management of Data (2000), 1–12.

22.

Cheng

, Yan

, Han

and Hsu

C.W.

, Discriminative pattern analysis for effective classification, In Proc 2007 Int Conf Data Engineering (2007), 716–725.

23.

Cheng

, Yan

, Han

and Yu

P.S.

, Direct discriminative frequent pattern mining for effective classification, In Proc 2008 Int Conf Data Engineering (2008), 169–178.

24.

Liu

, Hsu

and Ma

, Integrating classification and association rule mining, In Proc 1998 Int Conf Knowledge Discovery and Data Mining (1998), 80–86.

25.

, Han

and Pei

, CMAR: Accurate and efficient classification based on multiple class-association rules, In Proc 2001 Int Conf Data Mining (2001), 369–376.

26.

Yin

and Han

, CPAR: Classification based on predictive association rules, In Proc 2003 SIAM Int Conf Data Mining (2003), 331–335.

27.

Han

, Kamber

and Pei

, Data mining: Concepts and techniques (Third Edition); 2012.

28.

Liu

, Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data, 2nd Edition, Springer; 2011.

29.

Zhang

B.Z.

and Zuo

W.L.

, Reliable negative extracting based on KNN for learning from positive and unlabeled examples, Journal of Computer 33 (2009), 94–101.