Sage Journals: Discover world-class research

Abstract

In this paper, an automatic generation of domain-specific stopwords from a large labeled corpus is proposed. In the majority of text mining tasks, stopwords are removed according to a standard stopword list and/or using high and low document frequencies. In this paper, a new approach for stopword extraction, based on the notion of backward filter-level performance and data sparsity index, is proposed. First, based on the proposed model to evaluate the extracted stopwords, we examine high document frequency filtering for stopword reduction. Secondly, a new algorithm for building general and domain-specific stopword lists is proposed. For the method, it is assumed that a set of candidate stopwords must have a minimum information content and prediction capacity that is measured by the performance of a classifier. We show that to avoid obtaining the classifier performance, it can be estimated by the sparsity of the training dataset. Moreover, it is confirmed that even if a given term ranking measure can perform well for the feature selection, the measure is not necessarily efficient for selecting poor features (stopwords). According to the comparative study, the newly devised approach offers more promising results that guarantee a minimum information loss by filtering out most stopwords.

Keywords

Text classification stopwords stopword reduction feature selection

Get full access to this article

View all access options for this article.

References

Dolamic

and Savoy

, When stopword lists make the difference, J Am Soc Inf Sci Technol 61 (January 2010), 200-203.

Porter

M.F.

, An algorithm for suffix stripping, Program 14(3) (1980), 130-137.

Haynes

, Stemming and stopwording effects on word frequency, in: Proceedings of the Thirteenth Midwest Artificial Intelligence and Cognitive Science Conference: MAICS 2002, Conlon

, ed., Chicago, IL, 2002, pp. 71-75.

R.T.-W.

, He

and Ounis

, Automatically building a stopword list for an information retrieval system, The Journal on Digital Information Management: Special Issue on the 5th Dutch-Belgian Information Retrieval Workshop (DIR'05) 3(1) (2005), 3-8.

Sinka

M.P.

and Corne

D.W.

, Evolving better stoplists for document clustering and web intelligence, Design and Application of Hybrid Intelligent Systems, (2003), 1015-1023.

Chen

and Gey

F.C.

, Building an Arabic stemmer for information retrieval, in: TREC (2002).

Petras

, Perelman

and Gey

F.C.

, UC berkeley at clef-2003 - Russian language experiments and domain-specific retrieval, in: CLEF, (2003), 401-411.

Crow

and DeSanto

, A hybrid approach to concept extraction and recognition-based matching in the domain of human resources, in: ICTAI, (2004), 535-539.

Seki

and Mostafa

, An application of text categorization methods to gene ontology annotation, in: SIGIR, (2005), 138-145.

10.

Riloff

, Little words can make a big difference for text classification, in: Proceedings of the 18th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, (1995), 130-136.

11.

Forman

, An extensive empirical study of feature selection metrics for text classification, Journal of Machine Learning Research 3 (2003), 1289-1305.

12.

Raghavan

, Information retrieval algorithms: A survey, in: SODA '97: Proceedings of the Eighth Annual ACM-SIAM Symposium on Discrete Algorithms, (Philadelphia, PA, USA), Society for Industrial and Applied Mathematics (1997), 11-18.

13.

Song

S.K.

, Jin

and Myaeng

S.H.

, Abbreviation disambiguation using semantic abstraction of symbols and numeric terms, in: Proceedings of (2005), IEEE International Conference on Natural Language Processing and Knowledge Engineering, IEEE NLP-KE '05, 30 Oct-1 2005, IEEE Computer Society (2005), 14-19.

14.

Sebastiani

, Machine learning in automated text categorization, ACM Computing Surveys 34(1) (2002), 1-47.

15.

Forman

, A pitfall and solution in multi-class feature selection for text classification, in: Proceedings of ICML-04, Twenty-First International Conference on Machine Learning, (2004), 297-304.

16.

Van Rijsbergen

C.J.

, Information retrieval, 2nd edition, Dept. of Computer Science, University of Glasgow, 1979.

17.

Fox

C.J.

, Lexical analysis and stoplists, in: Information Retrieval: Data Structures & Algorithms, (1992), 102-130.

18.

Sinka

M.P.

and Corne

D.W.

, Towards modernised and web-specific stoplists for web document analysis, in: Proceedings of the IEEE/WIC International Conference on Web Intelligence, IEEE Computer Society (2003), 396-402.

19.

Kawahara

and Kawano

, Mining association algorithm with threshold based on roc analysis, in: Proceedings of the 34th Annual Hawaii International Conference on System Sciences (HICSS-34), IEEE Computer Society 3 (2001), 3010-3017.

20.

Makrehchi

and Kamel

M.S.

, Automatic extraction of domain-specific stopwords from labeled documents, The 30th European Conference on Information Retrieval (ECIR), 30th March-3rd April 2008 0 (2008).

21.

http://www.ranks.nl/stopwords/index.html.

22.

Hollink

, Kamps

, Monz

and de Rijke

, Monolingual document retrieval for european languages, Inf Retr 7(1-2) (2004), 33-52.

23.

Savoy

, A stemming procedure and stopword list for general French corpora, Journal of the American Society for Information Science, (1999), 944-952.

24.

Oard

D.W.

and Gey

F.C.

, The trec 2002 Arabic/English CLIR track, in: TREC, (2002).

25.

Taghva

, Coombs

J.S.

, Pareda

and Nartker

T.A.

, Language model-based retieval for Farsi documents, in: ITCC (2) (2004), 13-17.

26.

Maletic

J.I.

and Valluri

, Automatic software clustering via latent semantic analysis, in: Proceedings 14th IEEE International Conference on Automated Software Engineering (ASE'99), Cocoa Beach Florida, (October 1999), 251-254.

27.

Hayes

J.H.

, Dekhtyar

and Sundaram

, Text mining for software engineering: How analyst feedback impacts final results, in: MSR '05: Proceedings of the (2005), International Workshop on Mining Software Repositories, New York, NY, USA, ACM Press (2005), 1-5.

28.

Koo

S.O.

, Lim

S.Y.

and Lee

S.-J.

, Building an ontology based on hub words for information retrieval, in: Web Intelligence, (2003), 466-469.

29.

Liu

, Liu

, Chen

and Ma

W.-Y.

, An evaluation on feature selection for text clustering, in: Proceedings of ICML 2003, (2003), 488-495.

30.

Berry

M.W.

, Survey of Text Mining: Clustering, Classification, and Retrieval, Springer, 2004.

31.

Lam

S.L.

and Lee

D.L.

, Feature reduction for neural network based text categorization, in: Proceedings of DASFAA-99, 6th IEEE International Conference on Database Advanced Systems for Advanced Application, (1999), 195-202.

32.

Rogati

and Yang

, High-performing feature selection for text classification, in: Proceedings of the Eleventh International Conference on Information and Knowledge Management, (2002), 659-661.

33.

Gabrilovich

and Markovitch

, Text categorization with many redundant features: Using aggressive feature selection to make SVMs competitive with C4.5, in: Proceedings of ICML-04, Twenty-First International Conference on Machine Learning, (Banff, Alberta, Canada), Morgan Kaufmann (2004), 321-328.

34.

Montanes

, Diaz

, Ranilla

, Combarro

E.F.

and Fernandez

, Scoring and selecting terms for text categorization, IEEE Intelligent Systems 20(3) (2005), 40-47.

35.

Yang

and Pedersen

J.O.

, A comparative study on feature selection in text categorization, in: Proceedings of ICML-97, 14th International Conference on Machine Learning, Fisher

D.H.

, ed., Nashville, US, Morgan Kaufmann Publishers, San Francisco, US, 1997, pp. 412-420.

36.

Cover

T.M.

and Thomas

J.A.

, Elements of Information Theory, Wiley-Interscience, 1991.

37.

Church

K.W.

and Hanks

, Word association norms, mutual information, and lexicography, Comput Linguist 16(1) (1990), 22-29.

38.

Wang

and Lochovsky

F.H.

, Feature selection with conditional mutual information maximin in text categorization, in: CIKM '04: Proceedings of the Thirteenth ACM Conference on Information and Knowledge Management, (2004), 342-349.

39.

Makrehchi

and Kamel

M.S.

, Text classification using small number of features, in: Proceedings 4th International Conference Machine Learning and Data Mining in Pattern Recognition, MLDM (2005), Leipzig, Germany, July 9-11, 2005, (2005), 580-589.

40.

Bakus

and Kamel

M.S.

, Higher order feature selection for text classification, Knowl Inf Syst 9(4) (2006), 468-491.

41.

Huang

, Chen

, Yu

and Ma

W.-Y.

, Multitype features coselection for web document clustering, IEEE Transactions on Knowledge and Data Engineering 18(4) (2006), 448-459.

42.

Brank

, Groblenik

, Milic-Frayling

and Mladenic

, Interaction of feature selection methods and linear classification models, in: Workshop on Text Learning held at ICML-2002, Sydney, Australia (2002).

43.

Guyon

, Matic

and Vapnik

, Discovering informative patterns and data cleaning, in: Advances in Knowledge Discovery and Data Mining, (1996), 181-203.

44.

Qiu

L.-Q.

, Zhao

R.-Y.

, Zhou

and Yi

S.-W.

, An extensive empirical study of feature selection for text categorization, May 2008, 312-315.

45.

Qiu

L.-Q.

, Zhao

R.-Y.

, Zhou

and Yi

S.-W.

, An extensive empirical study of feature selection for text categorization, in: Proceedings of the Seventh IEEE/ACIS International Conference on Computer and Information Science (ICIS 2008), Washington, DC, USA, IEEE Computer Society (2008), 312-315.

46.

Sharma

and Kuh

, Class document frequency as a learned feature for text categorization, in: IJCNN, (2008), 2988-2993.

47.

Jashki

M.-A.

, Makki

, Bagheri

and Ghorbani

, An iterative hybrid filter-wrapper approach to feature selection for document clustering, in: Advances in Artificial Intelligence, Gao

and Japkowicz

, eds, 5549 of Lecture Notes in Computer Science, ch. 10, Berlin, Heidelberg: Springer, 2009, pp. 74-85.

48.

Rijsbergen

C.J.

, Harper

D.J.

and Porter

M.F.

, The selection of good search terms, Information Processing and Management, (1981), 77-91.

49.

Mladenic

, Brank

, Grobelnik

and Milic-Frayling

, Feature selection using linear classifier weights: Interaction with classification models, in: Proceedings of the 27th Annual International Conference on Research and Development in Information Retrieval, (2004), 234-241.

50.

Rocchio

, Relevance Feedback in Information Retrieval, Prentice Hall, 1971.

51.

McCallum

A.K.

, Rosenfeld

, Mitchell

T.M.

and Ng

A.Y.

, Improving text classification by shrinkage in a hierarchy of classes, in: Proceedings of ICML-98, 15th International Conference on Machine Learning, Shavlik

J.W.

, ed., Madison, US, Morgan Kaufmann Publishers, San Francisco, US (1998), 359-367.

52.

Joachims

, A probabilistic analysis of the Rocchio algorithm with TFIDF for text categorization, in: Proceedings of ICML-97, 14th International Conference on Machine Learning, Fisher

D.H.

, ed., Nashville, US, Morgan Kaufmann Publishers, San Francisco, US, (1997), 143-151.

53.

Lewis

D.D.

, Yang

, Rose

T.G.

and Li

, RCV1: A new benchmark collection for text categorization research}, Journal of Machine Learning Research 5 (2004), 361-397.

54.

Craven

, DiPasquo

, Freitag

, McCallum

, Mitchell

, Nigam

and Slattery

, Learning to extract symbolic knowledge from the world wide web, in: Proceedings of the 15th National Conference on Artificial Intelligence (AAAI-98), (1998), 509-516.

55.

Tong

, Lerner

, Singhal

, Haahr

and Baker

, Locating meaningful stopwords or stop-phrases in keyword-based retrieval systems, US Patent: 7409383, August 2008.

56.

White

B.J.

, Fortier

, Clapper

and Grabolosa

, The impact of domain-specific stopword lists on ecommerce website search, in: Proceedings of Academy of Strategic E-Commerce (2007), 15-16.

Extracting domain-specific stopwords for text classifiers

Abstract

Keywords

Get full access to this article

References