Supervised learning for building stemmers

Abstract

This work is part of a project aiming to define a methodology for building simple but robust stemmers, having primitive knowledge of the stemmer’s target language. The methodology starts with a very simple primary stemmer that simply removes the longest suffix (using the primitive knowledge – the list of available suffixes) that matches the ending of the examined word. Information retrieval (IR) experts express their arguments against the results of the primary stemmer. These (the experts’ arguments) are valuable knowledge that offer us the ability to apply supervised learning in order to automatically produce better stemmers (that conform to the arguments expressed by the IR experts). We also conduct an evaluation of our supervised learning-based methodology that builds stemmers for languages that the experts do not have knowledge on.

Keywords

Information storage and retrieval stemming algorithms stemmer builder text analysis and indexing

Get full access to this article

View all access options for this article.

References

Alatrish

Tošić

Milenković

. Building ontologies for different natural languages. Computer Science and Information Systems 2014; 11: 623–644.

Bechet

Chauche

Prince

Roche

. How to combine text-mining methods to validate induced verb–object relations. Computer Science and Information Systems 2014; 11: 133–156.

Jivani

. A comparative study of stemming algorithms. International Journal of Computer Technology and Applications 2011; 2: 1930–1938.

Porter

. An algorithm for suffix stripping. Program 1980; 14: 130–137.

Paice

. Another stemmer. SIGIR Forum 1990; 24: 56–61.

Lovins

. Development of a stemming algorithm. Mechanical Translation and Computational Linguistics 1968; 11: 22–31.

Ntais

. Development of a stemmer for the Greek language. Master’s thesis, University of Stockholm, 2006.

Kalamboukis

Nikolaidis

. Suffix stripping with modern Greek. Program 1995; 29: 313–321.

Karanikolas

. Bootstrapping the Albanian information retrieval. In: Kefalas

Stamatis

Douligeris

(eds.) Proceedings of the fourth Balkan conference in informatics, Thessaloniki, Greece. IEEE Computer Society, 2009, pp. 231–235.

10.

Moumouris

. The ‘Erevinitis’ document retrieval system. Greek CHIP 1995; 11 (March 1995): 60–61.

11.

Karanikolas

. Low cost, cross-language and cross-platform information retrieval and documentation tools. Journal of Computing and Information Technology 2007; 15: 71–84.

12.

Karanikolas

Skourlas

. A parametric methodology for text classification. Journal of Information Science 2010; 34: 421–442.

13.

Keselj

Sipka

. A suffix subsumption-based approach to building stemmers and lemmatizers for highly inflectional languages with sparse resources. INFOTHECA, Journal of Informatics and Librarianship 2008; IX: 23a–33a.

14.

Hammarström

. Poor man’s stemming; Unsupervised recognition of same-stem words. In: Ng

Leong

Kan

(eds.) Proceedings of the third asia information retrieval symposium, Singapore. Lecture Notes in Computer Science, Vol. 4182. Berlin: Springer, 2006, pp. 323–337.

15.

Porter

. Snowball: A language for stemming algorithms, http://snowball.tartarus.org/texts/introduction.html (2001, accessed March 2014).

16.

Goldsmith

Higgins

Soglasnova

. Automatic language-specific stemming in information retrieval. In Peters

(ed.) Cross-language information retrieval and evaluation, proceedings of the CLEF 2000 workshop, Lisbon, Portugal. Lecture Notes in Computer Science, Vol. 2069. Berlin: Springer, 2000, pp. 273–283.

17.

Husain

. An unsupervised approach to develop stemmer. International Journal on Natural Language Computing 2012; 1: 15–23.

18.

Bacchin

Ferro

Melucci

. A probabilistic model for stemmer generation. Information Processing and Management 2005; 41: 121–137.

19.

Jongejan

Dalianis

. Automatic training of lemmatization rules that handle morphological changes in pre-, in- and suffixes alike. In: Su

Wiebe

(eds.) Proceedings of the 47th annual meeting of the Association for Computational Linguistics and the 4th international joint conference on natural language processing of the AFNLP. Singapore: Association for Computational Linguistics, 2009, pp. 145–153.

20.

Karanikolas

. A methodology for building simple but robust stemmers without language knowledge: Overview, data model and ranking algorithm. In Rachev

Smrikarov

(eds) Proceedings of the 14th international conference on computer systems and technologies (CompSysTech’13), Ruse, Bulgaria. New York: ACM, 2013, pp. 284–290.

21.

Karanikolas

. A methodology for building simple but robust stemmers without language knowledge: Stemmer configuration. Procedia – Social and Behavioral Sciences 2014; 147: 370–375.

22.

Moral

de Antonio

Imbert

Ramírez

. A survey of stemming algorithms in information retrieval. Information Research 2014; 19, paper 605, http://InformationR.net/ir/19-1/paper605.html (accessed March 2014).

23.

Patil

. Use of Porter stemming algorithm and SVM for emotion extraction from news headlines. International Journal of Electronics, Communication & Soft Computing Science and Engineering 2013; 2: 9–13.

24.

Chintala

Reddy

. An approach to enhance the CPI using Porter stemming algorithm. International Journal of Advanced Research in Computer Science and Software Engineering 2013; 3: 1148–1156.

25.

Karaa

WBA

. A new stemmer to improve information. International Journal of Network Security and Its Applications 2013; 5: 143–154.