LDA-AdaBoost.MH: Accelerated AdaBoost.MH based on latent Dirichlet allocation for text categorization

Abstract

AdaBoost.MH is a boosting algorithm that is considered to be one of the most accurate algorithms for multilabel classification. It works by iteratively building a committee of weak hypotheses of decision stumps. To build the weak hypotheses, in each iteration, AdaBoost.MH obtains the whole extracted features and examines them one by one to check their ability to characterize the appropriate category. Using Bag-Of-Words for text representation dramatically increases the computational time of AdaBoost.MH learning, especially for large-scale datasets. In this paper we demonstrate how to improve the efficiency and effectiveness of AdaBoost.MH using latent topics, rather than words. A well-known probabilistic topic modelling method, Latent Dirichlet Allocation, is used to estimate the latent topics in the corpus as features for AdaBoost.MH. To evaluate LDA-AdaBoost.MH, the following four datasets have been used: Reuters-21578-ModApte, WebKB, 20-Newsgroups and a collection of Arabic news. The experimental results confirmed that representing the texts as a small number of latent topics, rather than a large number of words, significantly decreased the computational time of AdaBoost.MH learning and improved its performance for text categorization.

Keywords

AdaBoost.MH boosting latent Dirichlet allocation text categorization topic modeling

Get full access to this article

View all access options for this article.

References

Hastie

Tibshirani

Friedman

. Boosting and additive trees. The elements of statistical learning. Springer: New York, 2009, pp. 337–387.

Bloehdorn

Hotho

. Boosting for text classification with semantic features. In: Mobasher

Nasraoui

Liu

Masand

(eds) Advances in Web mining and Web usage analysis. Springer: Berlin, 2006, pp. 149–166.

Schapire

Freund

. Boosting: Foundations and algorithms. Cambridge, MA: MIT Press, 2012.

Freund

Schapire

. A desicion-theoretic generalization of on-line learning and an application to boosting. Springer: Berlin, 1995, pp. 23–37.

Freund

Schapire

. A decision-theoretic generalization of on-line learning and an application to boosting. Journal of Computer and System Sciences1997; 55: 119–139.

Ferreira

Figueiredo

MAT

. Boosting algorithms: A review of methods, theory, and applications. Ensemble Machine Learning: Methods and Applications2012: 35.

Schapire

Singer

. BoosTexter: A boosting-based system for text categorization. Machine Learning2000; 39: 135–168.

Blei

Jordan

. Latent Dirichlet allocation. The Journal of Machine Learning Research2003; 3: 993–1022.

Schapire

Singer

. Improved boosting algorithms using confidence-rated predictions. Machine Learning1999; 37: 297–336.

10.

Sebastiani

Sperduti

Valdambrini

. An improved boosting algorithm and its application to text categorization. New York: ACM, 2000, pp. 78–85.

11.

Esuli

Fagni

Sebastiani

. MP-Boost: A multiple-pivot boosting algorithm and its application to text categorization. In: String Processing and Information Retrieval. Berlin: Springer, 2006, pp. 1–12.

12.

Bíró

Szabó

. Latent Dirichlet allocation for automatic document categorization. In: Machine Learning and Knowledge Discovery in Databases. Berlin: Springer, 2009, pp. 430–441.

13.

Zhang

Phan

X-H

Horiguchi

. An efficient feature selection using hidden topic in text categorization. In: 22nd international conference on advanced information networking and applications – workshops, AINAW 2008. New York: IEEE, 2008, pp. 1223–1228.

14.

Griffiths

Steyvers

. Finding scientific topics. Proceedings of the National Academy of Sciences of the United States of America2004; 101: 5228–35.

15.

Tasci

Gungor

. LDA-based keyword selection in text categorization. In: 24th international symposium on computer and information sciences, ISCIS 2009. New York: IEEE, 2009, pp. 230–235.

16.

Lei

Qiao

Qimin

Qitao

. LDA boost classification: Boosting by topics. EURASIP Journal on Advances in Signal Processing2012; 2012: 1–14.

17.

Ramanathan

Wechsler

. Phishing Website detection using latent Dirichlet allocation and AdaBoost. In: IEEE international conference on intelligence and security informatics (ISI). New York: IEEE, 2012, pp. 102–107.

18.

Minka

Lafferty

. Expectation-propagation for the generative aspect model. In: Proceedings of the Eighteenth conference on uncertainty in artificial intelligence, Alberta, Sanfrancisco, CA: Morgan Kaufmann, 2002, pp. 352–359.

19.

Teh

Newman

Welling

. A collapsed variational Bayesian inference algorithm for latent Dirichlet allocation. In: NIPS’06, 2006, pp. 1378–1385.

20.

Heinrich

. Parameter estimation for text analysis. Technical report, 2005.

21.

Darling

. A theoretical and practical implementation tutorial on topic modeling and gibbs sampling. In: Proceedings of the 49th annual meeting of the association for computational linguistics: Human language technologies, 2011, pp. 642–647.

22.

Moschitti

Basili

. Complex linguistic features for text classification: A comprehensive study. In: Advances in information retrieval. Berlin: Springer, 2004, pp. 181–196.

23.

Al-Salemi

Ab Aziz

. Statistical bayesian learning for automatic arabic text categorization. Journal of Computer Science2010; 7: 39.

24.

McCallum

. Mallet: A machine learning for language toolkit, 2002.