Sage Journals: Discover world-class research

Abstract

Detection of topics in Natural Language text collections is an important step towards flexible automated text handling, for tasks like text translation, summarization, etc. In the current dominant paradigm to topic modeling, topics are represented as probability distributions of terms. Although such models are theoretically sound, their high computational complexity makes them difficult to use in very large scale collections. In this work we propose an alternative topic modeling paradigm based on a simpler representation of topics as overlapping clusters of semantically similar documents, that is able to take advantage of highly-scalable clustering algorithms. Our Query-based Topic Modeling framework (QTM) is an information-theoretic method that assumes the existence of a “golden” set of queries that can capture most of the semantic information of the collection and produce models with maximum “semantic coherence”. QTM was designed with scalability in mind and was executed in parallel using a Map-Reduce implementation; further, we show complexity measures that support our scalability claims. Our experiments show that the QTM can produce models of comparable or even superior quality than those produced by state of the art probabilistic methods.

Keywords

Topics NLP clustering queries

Get full access to this article

View all access options for this article.

References

Cronen-Townsend

, Zhou

and Croft

W.B.

, Predicting query performance, in: SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval,ACM, New York, NY,USA, 2002, pp. 299–306.

Furnas

G.W.

, Landauer

T.K.

, Gomez

L.M.

and Dumais

S.T.

, The vocabulary problem in human-system communication, Commun. ACM30(11) (1987), 964–971.

Wei

and Croft

W.B.

, LDA-based document models for adhoc retrieval, in: SIGIR ’06: Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval,ACM, New York, NY, USA, 2006, pp. 178–185.

Blei

and Lafferty

, A Correlated Topic Model of Science, Annals of Applied Statistics1(1) (2007), 17–35.

Barroso

L.A.

, Dean

and HölzleWeb

, Search For a Planet:The Google Cluster Architecture, IEEE Micro23(2) (2003), 22–28.

Ramirez

, Large Scale Topic Modeling Using Search Queries: An Information-theoretic Approach, PhD thesis, Tecnologico de Monterrey, 2010.

Miller

G.A.

, Beckwith

, Fellbaum

, Gross

and Miller

K.J.

, Introduction to WordNet: An On-line Lexical Database*, Int J Lexicography3(4) (1990), 235–244. doi: 10.1093/ijl/3.4.235.

Pedersen

, Patwardhan

and Michelizzi

, Word-Net::Similarity - Measuring the Relatedness of Concepts, in: Proceedings of the Nineteenth National Conference on Artificial Intelligence (AAAI-04), 2004.

Liu

, Yu

and Meng

, Word sense disambiguation inqueries, in: CIKM ’05: Proceedings of the 14th ACM international conference on Information and knowledge management,ACM Press, New York, NY, USA, 2005, pp. 525–532.

10.

Mandala

, Tokunaga

and Tanaka

, The Use of Word-Net in Information Retrieval, in: Use of WordNet in Natural Language Processing Systems: Proceedings of the Conference, Harabagiu

, ed., Association for Comutational Linguistics, Somerset, New Jersey, 1998, 31–37.

11.

Tomás

and Vicedo

, Re-ranking Passages with LSA in a Question Answering System, Evaluation of Multilingual and Multi-modal Information Retrieval (2010), 275–279.

12.

Ciaramita

, Murdock

and Plachouras

, Semantic associations for contextual advertising, Journal of Electronic Commerce Research9(1) (2008), 1–15.

13.

Deerwester

S.C.

, Dumais

S.T.

, Landauer

T.K.

, Furnasand

G.W.

and Harshman

R.A.

, Indexing by Latent Semantic Analysis, Journal of the American Society of Information Science41(6) (1990), 391–407.

14.

Hofmann

, Probabilistic latent semantic indexing, in: SIGIR ’99: Proceedings of the 22nd annual international ACM SIGIR conference on Research and development in information retrieval, ACM, New York, NY, USA, 1999, 50–57.

15.

Blei

, Ng

and Jordan

, Latent Dirichlet allocation, Journal of Machine Learning Research3 (2003), 993–1022.

16.

Blei

D.M.

, Probabilistic topic models, Communications of the ACM55(4) (2012), 77–84.

17.

Griffiths

T.L.

, Steyvers

and Tenenbaum

J.B.

, Topics in semantic representation, Psychological Review114 (2007), 211–244.

18.

Steyvers

and Griffiths

, Probabilistic Topic Models, in: Handbook of Latent Semantic Analysis,, Landauer

, Mcnamara

, Dennis

and Kintsch

, eds, Lawrence Erlbaum Associates, 2007.

19.

Dhillon

I.S.

, Mallela

and Modha

D.S.

, Information-theoreticco-clustering, in: KDD ’03: Proceedings of the ninth ACM SIGKDD international conference on Knowledge discovery and data mining,ACM, New York, NY, USA, 2003, pp. 89–98.

20.

Puppin

, Silvestri

and Laforenza

, Query-driven document partitioning and collection selection, in: InfoScale ’06: Proceedings of the 1st international conference on Scalable information systems,ACM, New York, NY, USA, 2006, p. 34.

21.

Papadimitriou

and Sun

, DisCo: Distributed Coclustering with Map-Reduce: A Case Study towards Petabyte- Scale End-to-End Mining, in: ICDM ’08: Proceedings of the 2008 Eighth IEEE International Conference on Data Mining,IEEE Comuter Society, Washington, DC. USA, 2008, pp. 512–521.

22.

Puppin

and Silvestri

, The query-vector document model, in: CIKM ’06: Proceedings of the 15th ACM international conferenceon Information and knowledge management,ACM, New York, NY, USA, 2006, pp. 880–881.

23.

Pereira

, Tishby

and Lee

, Distributional Clustering Of English Words, in: Proceedings of the 31st Annual Meetingof the Association for Computational Linguistics,1993, pp. 183–190.

24.

Cover

T.M.

, Thomas

, Elements of Information Theory,Wiley, 1991.

25.

Tishby

, Pereira

F.C.

and Bialek

, The information bottleneck method, arXiv preprint physics/0004057 (2000).

26.

Slonim

, Friedman

and Tishby

, Unsupervised document classification using sequential information maximization, in: SIGIR ’02: Proceedings of the 25th annual international ACM SIGIR conference on Research and development in information retrieval,ACM, New York, NY, USA, 2002, pp. 129–136.

27.

Slonim

and Tishby

, Document Clustering using WordClusters via the Information Bottleneck Method, in: ACM SIGIR 2000,ACM Press, 2000, pp. 208–215.

28.

Shan

and Banerjee

, Bayesian Co-clustering, in: ICDM’08: Proceedings of the 2008 Eighth IEEE International Conferenceon Data Mining,IEEE Comuter Society, Washington, DC, USA, 2008, pp. 530–539.

29.

Wang

, Domeniconi

and Laskey

K.B.

, Latent DirichletBayesian Co-Clustering, in: ECML PKDD ’09: Proceedings of the European Conference on Machine Learning and Knowledge Discovery in Databases,Springer-Verlag, Berlin, Heidelberg2009, pp. 522–537.

30.

Sontag

and Roy

, Complexity of Inference in Latent Dirichlet Allocation, in: Advances in Neural Information Processing Systems 24, Shawe-Taylor

, Zemel

R.S.

, Bartlett

, Pereira

and Weinberger

K.Q.

, eds, Curran Associates, Inc., 2011, pp. 1008–1016.

31.

Porteous

, Newman

, Ihler

, Asuncion

, Smyth

and Welling

, Fast collapsed gibbs sampling for latent dirichlet allocation, in: KDD ’08: Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery and datamining,ACM, New York, NY, USA, 2008, pp. 569–577.

32.

Newman

, Asuncion

, Smyth

and Welling

, DistributedInference for Latent Dirichlet Allocation, in: Advances in Neural Information Processing Systems,, Vol. 20, 2007.

33.

Buntine

W.L.

and Mishra

, Experiments with non-parametrictopic models, in: Proceedings of the 20th ACM SIGKDD internationalconference on Knowledge discovery and data mining,ACM, 2014, pp. 881–890.

34.

Das

, Zaheer

and Dyer

, Gaussian LDA for topic Modelswith Word Embeddings, in: Proceedings of the 53rd Annual Meeting of the Association for Computational Linguistics andthe 7th International Joint Conference on Natural LanguageProcessing (Volume 1: Long Papers),1 (2015), 795–804.

35.

Theobald

, Siddharth

and Paepcke

, SpotSigs: robustand efficient near duplicate detection in large web collections, in: SIGIR ’08: Proceedings of the 31st annual internationalACM SIGIR conference on Research and development in informationretrieval,ACM, New York, NY, USA, 2008, pp. 563–570.

36.

Fowlkes

E.B.

and Mallows

C.L.

, A Method for Comparing Two Hierarchical Clusterings, Journal of the American Statistical Association78(383) (1983), 553–569.

37.

Wallace

D.L.

, A Method for Comparing Two Hierarchical Clusterings: Comment, Journal of the American Statistical Association78(383) (1983), 569–576.

38.

Ramirez

E.H.

, Brena

, Magatti

and Stella

, Topic model validation, Neurocomputing76(1) (2012), 125–133.

Scalable text semantic clustering around topics

Abstract

Keywords

Get full access to this article

References