Benchmarking and assessing the performance of Arabic stemmers

Abstract

Previous studies on the stemming of the Arabic language lack fair evaluation, full description of algorithms used or access to the source code of the stemmers and the datasets used to evaluate such stemmers. Freeing source codes and datasets is an essential step to enable researchers to enhance stemmers currently in use and to verify the results of these studies. This study laid the foundation of establishing a benchmark for Arabic stemmers and presents an evaluation of four heavy (root-based) stemmers for the Arabic language. The evaluation aims to assess the accuracy of each of the four stemmers and to show the strength of each. The four algorithms are: Al-Mustafa stemmer, Al-Sarhan stemmer, Rabab’ah stemmer and Taghva stemmer. The accuracy and strength tests used in this study ranked Rabab’ah stemmer as the first followed by Al-Sarhan, Al-Mustafa, and Taghva stemmers respectively.

Keywords

Arabic stemming evaluation of Arabic stemmers information retrieval search engines

Get full access to this article

View all access options for this article.

References

W. Frakes and C. Fox , Strength and similarity of affix removal stemming algorithm, Proceedings of the Annual Conference on Research and Development in Information Retrieval , ACM SIGIR Forum 37(1) (2003) 26-30.

R. Baeza-Yates and B. Ribeiro-Neto , Modern Information Retrieval (Addison Wesley, Upper Saddle River, NJ, 1999).

M. Al-Kabi and R. Al-Mustafa , Arabic root based stemmer, Proceedings of the International Arab Conference on Information Technology (ACIT 2006) ( Jordan, 2006).

K. Taghva , R. Elkhoury and J. Coombs , Arabic stemming without a root dictionary, Information Science Research Institute University of Nevada, ITCC (1) (Las Vegas, 2004) 152-157.

H. Al-Sarhan , R. Al-Shalabi and G. Kanaan , New approach for extracting Arabic roots, Proceedings of the 2003 Arab Conference on Information Technology (ACIT 2003) (Egypt, 2003) 42-59.

S. Ghawanmeh , R. Al-Shalabi , G. Kanaan , K. Khanfar and S. Rabab’ah , An algorithm for extracting the root for the Arabic language, Proceedings of the 5th International Business Information Management Association Conference (IBIMA) (Cairo, Egypt, 2005).

A. Chen and F. Gey , Building an Arabic stemmer for information retrieval, Proceedings of the 11th Text Retrieval Conference (TREC 2002) ( Gaithersburg, Maryland, 2002) 631-639.

R. Al-Shalabi and M. Evens , A computational morphology system for Arabic , Proceedings of the Workshop on Semitic Language Processing (COLING-ACL’98) (Canada, 1998) 66-72.

S. Al-Fedaghi and F. Al-Anzi , A new algorithm to generate Arabic root-pattern forms, Proceedings of the 11th National Computer Conference and Exhibition (Saudi Arabia, 1998) 391-400.

10.

G. Kanaan , R. Al-Shalabi and M. Al-Kabi , New approach for extracting quadrilateral Arabic roots, Abhath Al-Yarmouk, Basic Science and Engineering 14(1) (2005) 51-66.

11.

H. Harmanani , W. Keirouz and S. Raheel , A rule-based extensible stemmer for information retrieval with application to Arabic, International Arab Journal of Information Technology 3(3) (2006) 265-272.

12.

A. Nwesri , S. Tahaghoghi and F. Scholer , Stemming Arabic conjunctions and prepositions. In: Mariano P. Consens and Gonzalo Navarro, String Processing and Information Retrieval (Buenos Aires, Argentina, 2005) 206-217.

13.

H. Abu-Salem , Comparison of stemming and n-gram matching for term conflation in Arabic text, International Journal of Computer Processing of Oriental Languages 17(2) (2004) 61-81.

14.

H. Al-Ameed , S. Al-Ketbi , A. Al-Kaabi , K. Al-Shebli , N. Al-Shamsi , N. Al-Nuaimi and S. Al Muhairi, Arabic light stemmer: a new enhanced approach, Proceedings of the 2nd International Conference on Innovations in Information Technology (IIT’05) (United Arab Emirates, 2005).

15.

L. Larkey , L. Ballesteros and M. Connel , Improving stemming for Arabic information retrieval: light stemming and co-occurrence analysis, Proceedings of the 25th annual international ACM SIGIR Conference on Research and Development in Information Retrieval (2002) 275-282.

16.

L. Larkey , L. Ballesteros and M. Connell , Light stemming for Arabic information retrieval, In: A. Soudi , A. van den Bosch and G. Neumann (eds) Arabic Computational Morphology (Springer, Germany, 2007) 221-243.

17.

M. Momani and J. Faraj , A novel algorithm to extract tri-literal Arabic roots, Proceedings of the ACS/IEEE International Conference on Computer Systems and Applications (Los Alamitos, CA, 2007 ) 309-315.

18.

A. Goweder , H. Alhami , T. Rashed and A. Al-Musrati , A hybrid method for stemming Arabic text, Proceedings of the 9th International Arab Conference on Information Technology (ACIT’2008) (Tunis, 2008).

19.

R. Al-Shalabi , G. Kanaan , S. Ghwanmeh , F. Nour , Stemmer algorithm for Arabic words based on excessive letter locations, Proceedings of the 4th International Conference on Innovations in Information Technology (UAE, 2007) 456-460.

20.

L. Abouenour , S. El Hassani , T. Yazidy , K. Bouzouba and A. Hamdani , Building an Arabic morphological analyzer as part of an open Arabic NLP platform, Proceedings of Workshop on HLT and NLP within the Arabic World: Arabic Language and Local Languages Processing Status Updates and Prospects at the 6th Language Resources and Evaluation Conference (LREC’08) (Morocco, 2008).

21.

G. Kanaan , R. Al-Shalabi , M. Ababneh and A. Al-Nobani , Building an effective rule-based light stemmer for Arabic language to improve search effectiveness, Proceedings of the International Conference on Innovations in Information Technology (Al-Ain , UAE, 2008) 312-331.

22.

S. Ghawanmeh , R. Al-Shalabi , G. Kanaan , K. Khanfar and S. Rabab’ah , Enhanced algorithm for extracting the root of Arabic words, Proceedings of the 2009 Sixth International Conference on Computer Graphics, Imaging and Visualization (China, 2009) 388-391.

23.

I.I. Hmeidi , R.F. Al-Shalabi , A.T. Al-Taani , H. Najadat , S.A. Al-Hazaimeh , A novel approach to the extraction of roots from Arabic words using bigrams, Journal of the American Society for Information Science 61(3) (2010) 583-591.

24.

E. Al-Shawakfa , A. Al-Badarneh , S. Shatnawi , K. Al-Rabab ’ah, B. Bani-Ismail , A comparison study of some Arabic root finding algorithms, Journal of the American Society for Information Science 61(5) (2010) 1015-1024.

25.

S. Khoja and R. Garside , Stemming Arabic text (Lancaster University, Lancaster, UK, 1999).

26.

M. Kantrowitz , B. Mohit , V. Mittal , Stemming and its effects on TFIDF ranking , Proceedings of the 23rd annual international ACM SIGIR conference on Research and development in information retrieval (Athens, Greece, 2000) 357-359.