Abstract
Obtaining interesting and topic-relevant information is a very important task in Web mining. Text classification using a small proportion of labeled data and a large proportion of unlabeled data, also called semi-supervised learning, is a well-known problem. Despite plenty of research on text classification, however, how to effectively and efficiently apply valuable frequent patterns and deal with high-dimensional data in text classification is still an open issue. Due to the increasing data volumes and plenty of high-dimensional data, both distance measures and time complexity could be influenced by the noisy data. This paper targets on this problem and presents a novel method for text classification called CTFP (Classification based on TFP-tree), which uses TFP-tree (Text-Frequent-Pattern-tree) to generate frequent patterns in tremendous amount of texts and conduct text classification in a relatively low dimensional data space. It effectively reduces the data dimensionality during constructing the classifier. Substantial experiments on three datasets (RCV1, SRAA and Reuters-21578) show that our proposed method can achieve better performance than many existing state-of-the-art methods on precision, efficiency and many other evaluation metrics.
Get full access to this article
View all access options for this article.
