Abstract
In this paper, an automatic generation of domain-specific stopwords from a large labeled corpus is proposed. In the majority of text mining tasks, stopwords are removed according to a standard stopword list and/or using high and low document frequencies. In this paper, a new approach for stopword extraction, based on the notion of backward filter-level performance and data sparsity index, is proposed. First, based on the proposed model to evaluate the extracted stopwords, we examine high document frequency filtering for stopword reduction. Secondly, a new algorithm for building general and domain-specific stopword lists is proposed. For the method, it is assumed that a set of candidate stopwords must have a minimum information content and prediction capacity that is measured by the performance of a classifier. We show that to avoid obtaining the classifier performance, it can be estimated by the sparsity of the training dataset. Moreover, it is confirmed that even if a given term ranking measure can perform well for the feature selection, the measure is not necessarily efficient for selecting poor features (stopwords). According to the comparative study, the newly devised approach offers more promising results that guarantee a minimum information loss by filtering out most stopwords.
Get full access to this article
View all access options for this article.
