Utilizing global and path information with language modelling for hierarchical text classification

Abstract

Hierarchical text classification of a Web taxonomy is challenging because it is a very large-scale problem with hundreds of thousands of categories and associated documents. Furthermore, the conceptual levels and training data availabilities of categories vary widely. The narrow-down approach is the state of the art; it utilizes a search engine for generating candidates from the taxonomy and builds a classifier for the final category selection. In this paper, we take the same approach but address the issue of using global information in a language modelling framework to improve effectiveness. We propose three methods of using non-local information for the task: a passive way of utilizing global information for smoothing; an aggressive way where a top-level classifier is built and integrated with a local model; and a method of using label terms associated with the path from a category to the root, which is based on our systematic observation that they are underrepresented in the documents. For evaluation, we constructed a document collection from Web pages in the Open Directory Project. A series of experiments and their results show the superiority of our methods and reveal the role of global information in hierarchical text classification.

Keywords

Hierarchical text classification language models web taxonomy

Get full access to this article

View all access options for this article.

References

Broder

Fontoura

Josifovski

Riedel

. A semantic approach to contextual advertising. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2007, pp. 559–566.

Broder

Ciccolo

Gabrilovich

Josifovski

Metzler

Riedel

. Online expansion of rare queries for sponsored search. In: Proceedings of the 18th international conference on World Wide Web. New York: ACM, 2009, pp. 511–520.

Zhang

Liu

Fan

. Improving web search results using affinity graph. In: Proceedings of the 28th annual international ACM SIGIR conference on Research and development in information retrieval. New York: ACM, 2005, pp. 504–511.

Cai

Zhou

Liu

Zhao

. Large-scale question classification in cQA by leveraging Wikipedia semantic knowledge. In: Proceedings of the 20th ACM international conference on information and knowledge management. New York: ACM, 2011, pp. 1321–1330.

Chen

Xue

G-R

. Advertising keyword suggestion based on concept hierarchy. In: Proceedings of the 2008 international conference on web search and data mining. New York: ACM, 2008, pp. 251–260.

Broder

Fontoura

Gabrilovich

Joshi

Josifovski

Zhang

. Robust classification of rare queries using web knowledge. In: Proceedings of the 30th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2007, pp. 231–238.

McCallum

Rosenfeld

Mitchell

. Improving text classification by shrinkage in a hierarchy of classes. In: Proceedings of the fifteenth international conference on machine learning. San Francisco, CA: Morgan Kaufmann, 1998, pp. 359–367.

Cai

Hofmann

. Hierarchical document categorization with support vector machines. In: Proceedings of the thirteenth ACM conference on information and knowledge management – CIKM ’04, 2004, pp. 78–87.

Sebastiani

Machine learning in automated text categorization. ACM Computing Surveys 2002; 34: 1–47.

10.

Sun

Lim

E-P

. Hierarchical text classification and evaluation. In: Proceedings of the 2001 IEEE international conference on data mining. Washington, DC: IEEE Computer Society, 2001, pp. 521–528.

11.

Bennett

Nguyen

. Refined experts: improving classification in large taxonomies. In: Proceedings of the 32nd international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2009, pp. 11–18.

12.

Malik

. Improving hierarchical SVMs by hierarchy flattening and lazy classification. In: Large-scale hierarchical classification workshop (ECIR 2010), 2010.

13.

Liu

T-Y

Yang

Wan

Zeng

H-J

Chen

W-Y

. Support vector machines classification with a very large-scale taxonomy. SIGKDD Explor Newsletter 2005; 7: 36–43.

14.

Xue

G-R

Xing

Yang

. Deep classification in large-scale text hierarchies. In: Proceedings of the 31st annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2008, pp. 619–626.

15.

Silla

Jr Freitas

. A survey of hierarchical classification across different application domains. Data Mining and Knowledge Discovery 2011; 22: 31–72.

16.

H-S

Choi

Myaeng

S-H

. Text classification for a large-scale taxonomy using dynamically mixed local and global models for a node. In: Proceedings of the 33rd European conference on advances in information retrieval, 2011, pp. 7–18.

17.

H-S

Choi

Myaeng

S-H

. Combining global and local information for enhanced deep classification. In: Proceedings of the 2010 ACM symposium on applied computing, 2010, pp. 1760–1767.

18.

Koller

Sahami

. Hierarchically classifying documents using very few words. In: Proceedings of the 4th international conference on machine learning, 1997, pp. 170–178.

19.

Robertson

Walker

. Some simple effective approximations to the 2-Poisson model for probabilistic weighted retrieval. In: Proceedings of the 17th annual international ACM SIGIR conference on research and development in information retrieval. New York: Springer, 1994, pp. 232–241.

20.

Ponte

Croft

. A language modeling approach to information retrieval. In: Proceedings of the 21st annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 1998, pp. 275–281.

21.

Lafferty

Zhai

. Document language models, query models, and risk minimization for information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2001, pp. 111–119.

22.

Kurland

Lee

. PageRank without hyperlinks: Structural re-ranking using links induced by language models. In: Proceedings of the 28th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2005, pp. 306–313.

23.

Zhai

Lafferty

. A study of smoothing methods for language models applied to ad hoc information retrieval. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2001, pp. 334–342.

24.

Kurland

Lee

. Corpus structure, language models, and ad hoc information retrieval. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2004, pp. 194–201.

25.

Liu

Croft

. Cluster-based retrieval using language models. In: Proceedings of the 27th annual international ACM SIGIR conference on research and development in information retrieval. New York: ACM, 2004, pp. 186–193.

26.

Tao

Wang

Mei

Zhai

. Language model information retrieval with document expansion. In: Proceedings of the main conference on human language technology conference of the North American Chapter of the Association of Computational Linguistics. Stroudsburg, PA: Association for Computational Linguistics, 2006, pp. 407–414.

27.

Bai

Song

Bruza

Nie

J-Y

Cao

. Query expansion using term relationships in language models for information retrieval. In: Proceedings of the 14th ACM international conference on information and knowledge management. New York: ACM, 2005, pp. 688–695.

28.

Wei

Croft

. Modeling term associations for ad-hoc retrieval performance within language modeling framework. In: Proceedings of the 29th European conference on IR research. Berlin: Springer, 2007, pp. 52–63.

29.

Peng

Schuurmans

Wang

Augmenting naive Bayes classifiers with statistical language models. Information Retrieval 2004; 7: 317–345.

30.

Peng

Schuurmans

Combining naive Bayes and n-gram language models for text classification. In: Proceedings of the 25th European conference on IR research. Berlin: Springer, 2003, pp. 335–350.

31.

Tan

Wang

Adapting centroid classifier for document categorization. Expert Systems with Applications 2011; 38: 10264–10273.

32.

Brants

Popat

Och

Dean

Inc

. Large language models in machine translation. In: Proceedings of the conference on empirical methods in natural language processing, 2007, pp. 858–867.