Conceptualized phrase clustering with distributed k -means

Abstract

A vast majority of text mining and machine learning algorithms such as topic models, classification, clustering are based on statistical methods thus the semantics or meaning of the words or phrases are not considered. Interpretation of outputs generated by such algorithms are difficult for humans because of the absence of sufficient contextual information. Distributional semantics is a relatively new but active research area in natural language processing that quantifies semantic similarities between linguistic elements considering the context in which they occur. Conceptualization algorithms on the other hand enriches short text such as words and phrases. This paper proposes an approach that uses a map-reduce framework for combining these two techniques to generate conceptualized semantic clusters of phrases using distributional representation. Rigorous and systematic experiments on unstructured text datasets show that this approach can generate semantically rich and human interpretable concept clusters from large datasets. Further, the approach is scalable when dealing with high dimensional data since this method uses a map-reduce based framework for clustering.

Keywords

Distributional semantics concept extraction semantic clustering map-reduce text mining

Get full access to this article

View all access options for this article.

References

Mikolov

Chen

Corrado

Dean

. Efficient estimation of word representations in vector space. arXiv preprint arXiv: 1301.3781, 2013.

Mikolov

Sutskever

Chen

Corrado

Dean

. Distributed representations of words and phrases and their compositionality. in: Advances in Neural Information Processing Systems, 2013, pp. 3111-3119.

Mikolov

. Distributed representations of sentences and documents. in: Proceedings of the 31st International Conference on Machine Learning (ICML-14), 2014, pp. 1188-1196.

Song

Wang

Chen

. Short text conceptualization using a probabilistic knowledgebase. in: Proceedings of the 22nd International Joint Conference on Artificial Intelligence, 2011, 3, pp. 2330-2336. AAAI Press.

Wang

Zhu

. Probase: A probabilistic taxonomy for text understanding. in: Proceedings of the 2012 ACM SIGMOD International Conference on Management of Data, 2012, pp. 481-492. ACM.

Hua

Wang

Zheng

Zhou

. Short text understanding through lexical semantic analysis. in: IEEE 31st International Conference on Data Engineering (ICDE), 2015, pp. 495-506. IEEE.

Kim

Cho

. Bag-of-concepts: Comprehending document representation through clustering words in distributed representation. Neurocomputing 2017; 266: 336-352, Elsevier.

Baba

Nakatoh

Minami

. Vector representation of words for plagiarism detection based on string matching. in: International Conference on Human Interface and the Management of Information, 2017, pp. 341-350, Springer.

Fleiss

Cohen

. The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement 1973; 33(3): 613-619.

10.

Mrozinski

Whittaker

Furui

. Collecting a why-question corpus for development and evaluation of an automatic QA-system. in: 46th Annual Meeting of the Association of Computational Linguistics: Human Language Technologies, 2008, pp. 443-451.

11.

Sarasua

Simperl

Noy

. Crowdmap: Crowdsourcing ontology alignment with microtasks. in: International Semantic Web Conference, 2012, pp. 525-541.

12.

Bird

. NLTK: The natural language toolkit. in: Proceedings of the COLING/ACL on Interactive Presentation Sessions, 2006 July, pp. 69-72. Association for Computational Linguistics.

13.

Pedregosa

Varoquaux

Gramfort

Michel

Thirion

Grisel

Vanderplas

. Scikit-learn: Machine learning in Python. Journal of Machine Learning Research 2011; 12: 2825-2830.

14.

Hartigan

Wong

. Algorithm AS 136: A k-means clustering algo-rithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1979; 28(1): 100-108.

15.

Huang

Jian

Wang

Guo

. A P-LSTM neural network for sentiment classification. in: Pacific-Asia Conference on Knowledge Discovery and Data Mining, 2017, pp. 524-533, Springer.

16.

Sherkat

Milios

. Vector embedding of wikipedia concepts and entities. in: International Conference on Applications of Natural Language to Information Systems, 2017, pp. 418-428. Springer, Cham.

17.

Hsu

Moon

Jones

Samatova

. A hybrid CNN-RNN alignment model for phrase-aware sentence classification. EACL 2017; 443.

18.

Auer

Bizer

Kobilarov

Lehmann

Cyganiak

Ives

. Dbpedia: A nucleus for a web of open data. The Semantic Web 2007; 722-735.

19.

Bizer

Lehmann

Kobilarov

Auer

Becker

Cyganiak

Hellmann

. DBpedia-A crystallization point for the Web of Data. Web Semantics: Science, Services and Agents on the World Wide Web 2009; 7(3): 154-165.

20.

Miller

. WordNet: A lexical database for English. Communications of the ACM 1995; 38(11): 39-41.

21.

Song

Wang

. Open domain short text conceptualization: A generative+descriptive modeling approach. in: International Joint Conference on Artificial Intelligence, 2015, pp. 3820-3826.

22.

Kim

Wang

. Context-dependent conceptualization. in: IJCAI, 2013, pp. 2654-2661.

23.

Wang

Zhao

Wang

Meng

Wen

. Query understanding through knowledge-based conceptualization. in: IJCAI, 2015, pp. 3264-3270.

24.

Kohonen

. The self-organizing map. Neurocomputing 1998; 21(1): 1-6.

25.

Moon

. The expectation-maximization algorithm. IEEE Signal Processing Magazine 1996; 13(6): 47-60.

26.

Hartigan

Wong

. Algorithm AS 136: A k-means clustering algorithm. Journal of the Royal Statistical Society. Series C (Applied Statistics) 1979; 28(1): 100-108.