The research on text clustering based on LDA joint model

Abstract

This paper proposed a cluster algorithm based on the combination of LDA (Latent Dirichlet allocation) probabilistic topic model and VSM (Vector Space Model), with the three-tier framework adopted containing text, topic and feature word. Although LDA alone has the ability to seek out the hidden topic knowledge, it is hard for the low-dimensional model to remain the integrity of the text information, leading to insufficient capacity for distinguishing texts. The paper is set to launch the cluster analysis in turns of feature words and topic through integrating two model above. With a better mix of LDA and VSM, the clustering effect will be improved, paralleling determining the optimal clustering number K of the K-means algorithms and optimum topic number T of LDA model. In order to design the algorithms more scientifically and effectively, silhouette coefficient and Dunn coefficient have been brought in to make assessments.

Keywords

Text cluster LDA model K-means algorithms VSM model silhouette coefficient Dunn coefficient

Get full access to this article

View all access options for this article.

References

Oliver

J.J.

, Buntine

W.L.

, Roumeliotis

, System and method for adaptive text recommendation, 2015.

Salvador

S.W.

and Magdin

, Predictive natural language processing models, 2016.

Hamou

R.M.

, Bouarara

H.A.

and Amine

, Bio-inspired techniques in the clustering of texts: Synthesis and comparative study, International Journal of Applied Metaheuristic Computing (2015), 39–68.

Wei

, et al., A semantic approach for text clustering using WordNet and lexical chains, Expert Systems with Applications42(4) (2015), 2264–2275.

Errecalde

M.L.

, Cagnina

L.C.

and Rosso

, Silhouette attraction: A simple and effective method for text clustering, Natural Language Engineering1 (2015), 1–40.

Martinez

, et al., LDA-based probabilistic graphical model for excitation-emission matrices, Intelligent Data Analysis19(5) (2015), 1109–1130.

Chen

, A novel clustering algorithm for large-scale text collection and its incremental version, Information Technology & Control45(2) (2016).

Corriveau

, et al., Bayesian network as an adaptive parameter setting approach for genetic algorithms, Complex & Intelligent Systems (2016), 1–22.

Bharill

, Tiwari

and Malviya

, Fuzzy based clustering algorithms to handle Big Data with implementation on Apache Spark, IEEE Second International Conference on Big Data Computing Service and Applications, 2016, pp. 95–104.

10.

Kemaiaia

and Merouani

H.F.

, Clustering with probabilistic topic models on Arabic texts: A comparative study of LDA and K-means, International Arab Journal of Information Technology13(2) (2015).

11.

Kumar

, Yadav

D.K.

and Gupta

V.K.

, Frequent term based text document clustering: A new approach, International Conference on Soft Computing Techniques and Implementations IEEE, 2015.

12.

Salton

, Wong

and Yang

C.S.

, A vector space model for automatic indexing, Communications of the ACM18(11) (1975), 613–620.

13.

Blei

, Ng

and Jordan

, Latent dirichlet allocation, Journal of Machine Leaning Research3 (2003), 993–1022.

14.

Deerwester

S.C.

, Dumais

S.T.

and Landauer

T.K.

, et al., Indexing by latent semantic analysis[J], JASIS41(6) (1990), 391–407.

15.

, Rao

and Wang

, An empirical study of SLDA for information retrieval [J], Information Retrieval Technology (1) (2011), 84–92.

16.

Wei

and Croft

W.B.

, LDA-based document models for Adhoc retrieval, Proceeding of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006.

17.

Liu

, et al., An approach of latent semantic space partition and web document clustering, Journal of Chinese Information Processing25(1) (2011), 60–59.

18.

L.W.

, Beijing. Text Classification Based on Labeled-LDA Model [J], Chinese Journal of Computers31(4) (2009), 620–627.

19.

and Qiao

, A novel approach for Co-occurrence clustering analysis: Maximal frequent item set mining, Journal of the China Society for Scientific and Technical Information31(2) (2012), 143–150.

20.

Wang

Y.H.

, Jia

and Yang

S.Q.

, Massive short documents classification method based on frequent term set clustering, Computer Engineering & Design28(8) (2007), 1744–1746.

21.

Wang

, Jia

and Yang

, Study on massive short documents clustering technology, Computer Engineering33(14) (2007), 38–40.

22.

, et al., Microblog dimensionality reduction – A deep learning approach, IEEE Transactions on Knowledge & Data Engineering (2016), 1–1.

23.

Chang

C.J.

, Dai

W.L.

and Chen

C.C.

, A novel procedure for multi model development using the grey silhouette coefficient for small-data-set forecasting, Journal of the Operational Research Society66(11) (2015), 1887–1894.

24.

Trauwaert

, On the meaning of Dunn’s partition coefficient for fuzzy clusters, Fuzzy Sets and Systems25(2) (1988), 217–242.

25.

Xia

X.U.

, Peifeng

L.I.

and Zhu

, A Semi-supervised Chinese Event Extraction Method, Journal of Chinese Information Processing30(2) (2016), 168–174.

26.

Bouhriz

, Benabbou

and Benlahmer

, Text concepts extraction based on Arabic WordNet and formal concept analysis, International Journal of Computer Applications111(16) (2015), 30–34.

27.

Gang

, et al., Hybrid FA: A memory reduction technique for the AC automata based on statistics, Journal on Communications36(7) (2015), 31–39.

28.

Tian

W.D.

and Huang

, Study on the Application of Frequent Sub-tree Patterns in Focus Words Recognition, Microelectronics & Computer32(11) (2015), 27–32.

29.

Wang

, et al., Track fusion based on threshold factor classification algorithm in wireless sensor networks, International Journal of Communication Systems (2016), DOI: 10.1002/dac.3164

30.

Beguet

and Burmako

, Traversal Query Language For Scala Meta Epfl, 2015.

31.

Wang

and Huang

S.T.

, Chinese word segmentation based on A-priori and adjacent characters, International Conference on Machine Learning and Cybernetics, Vol. 6, 2005, pp. 3808–3813.

32.

Zhou

, Clothing-to-words mapping using word separation method, Computers & Electrical Engineering39(2) (2013), 361–372.

33.

Aljindi

, Information security, artificial intelligence and legacy information systems, Dissertations & Theses – Gradworks, 2015, 192 pages; 3740130.

34.

Hua

, et al., Short text understanding through lexical-semantic analysis, IEEE, International Conference on Data Engineering IEEE, 2015, pp. 495–506.

35.

Miyani

, Doshi

and Jain

, Word problem solver system using artificial intelligence, Procedia Computer Science45 (2015), 800–807.

36.

, Liu

and Li

, The simply implement of effective naïve bayes web news text classification model, Statistical and Application3 (2014), 30–35.

37.

Bendavid

, et al., High dimensional Bayesian inference for Gaussian directed acyclic graph models, arXiv:1109. 4371v5 [math.ST], 6 Mar2015, 1–55.

38.

Ross

S.M.

, Introduction to stochastic dynamic programming, Journal of the American Statistical Association (2015), 1–27.

39.

Gluss

, An elementary introduction to dynamic programming: A state equation approach, Journal of Regional Science14(1) (1974), 150–152.

40.

Han

, Yuan

and Xiao

, Research review on water science based on co-word cluster analysis of keywords, Journal of North China University of Water Resources & Electric Power36(4) (2015), 20–25.

41.

Aggarwal

C.C.

, et al., Frequent pattern mining with uncertain data, ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, Paris, France, 2009, pp. 29–38.

42.

Ordonez

and Omiecinski

, Efficient disk-based K-means clustering for relational databases, IEEE Transactions on Knowledge & Data Engineering16(8) (2004), 909–921.

43.

Yang

M.S.

, A survey of fuzzy clustering, Mathematical & Computer Modelling18(11) (1993), 1–16.

44.

Ghosh

and Dubey

S.K.

, Comparative analysis of K-Means and fuzzy C-means algorithms, International Journal of Advanced Computer Science & Applications4(4) (2013).

45.

Hamerly

and Elkan

, Learning the K in K-means, Advances in Neural Information Processing Systems17(2004) (2003).

46.

Liu

, et al., Kernel-based fuzzy C-means clustering method based on parameter optimization, Jilin Daxue Xuebao46(1) (2016), 246–251.

47.

Krishna

and Narasimha Murty

, Genetic K-means algorithm, IEEE Transactions on Systems Man & Cybernetics Part B Cybernetics A Publication of the IEEE Systems Man & Cybernetics Society29(3) (1999), 433–439.

48.

Chen

J.Y.

and He

H.H.

, Research on density-based clustering algorithm for mixed data with determine cluster centers automatically, Acta Automatica Sinica41(10) (2015), 1798–1813.

49.

Eler

D.M.

, Macanha

P.A.

and Garcia

R.E.

, Simplified Stress and Simplified Silhouette Coefficient to a Faster Quality Evaluation of Multidimensional Projection Techniques and Feature Spaces, 2015, pp. 133–139.

50.

Liu

J.W.

, Zheng

J.C.

and Chen

, A new method of behavior characteristic similarity calculation between children learners based on knowledge graphs and VSM, Journal of Anqing Teachers College22(2) (2016), 54–59.

51.

Voborník

, Effective determining of the degree of similarity of selected properties of objects through characteristic text strings, International Journal of Mathematics & Computers in Simulation10 (2016), 90–99.

52.

, et al., An improved focused crawler based on semantic similarity vector space model, Applied Soft Computing36 (2015), 392–407.

53.

Adji

T.B.

, Abidin

and Nugroho

H.A.

, System of negative Indonesian website detection using TF-IDF and Vector Space Model, International Conference on Electrical Engineering and Computer Science IEEE, 2015, pp. 206–210.

54.

Alodadi

and Janeja

V.P.

, Similarity in Patient Support Forums Using TF-IDF and Cosine Similarity Metrics, International Conference on Healthcare Informatics IEEE, 2015, pp. 521–522.

55.

Roul

R.K.

, et al., A novel modified apriori approach for web document clustering, Computer Science33 (2015), 159–171.

56.

Kar

, Nunes

and Ribeiro

, Summarization of changes in dynamic text collections using Latent Dirichlet Allocation model, Information Processing & Management51(6) (2015), 809–833.

57.

Thu

H.N.T.

, Thanh

T.D.

and Hai

T.N.

, et al., Building Vietnamese topic modeling based on core terms and applying in text classification [C], Fifth International Conference on Communication Systems and Network Technologies, IEEE (2015), 1284–1288.

58.

Gao

, Chen

and Zhu

, Streaming Gibbs Sampling for LDA Model, 2016.

59.

, Kontonatsios

and Ananiadou

, Supporting systematic reviews using LDA-based document representations, Systematic Reviews4(1) (2015), 1–12.

60.

Wen-Bo

, Le

and Da-Kun

, Text Classification Based on Labeled-LDA Model [J], Chinese Journal of Computers31(4) (2009), 620–627.

61.

Tran

D.T.

, Sakurai

and Lee

J.H.

, Integration of a topic probability distribution into surgical phase estimation with a hidden Markov model, Industrial Electronics Society, IECON 2015-, Conference of the IEEE IEEE, 2015.

62.

Kabir

C.A.

and Kumar

S.A.

, Discrete Characteristic Probability Distribution Theorem, Scholars Press, 2015.

63.

Wang

, Fu

and Chen

, Analyzing Knowledge Structure Research with LDA Model. New Technology of Library & Information Service, 2016.

64.

Zhang

, et al., UT-LDA Based Similarity Computing in Microblog, IEEE International Conference on Software Quality, Reliability and Security – Companion IEEE, 2015.

65.

Zheng

and Hong

L.I.

, Texts clustering of K-means based on LDA, Computer & Modernization1(8) (2013), 78–80.

66.

, Qin

and Liu

, Open-categorical text classification based on multi-LDA models, Soft Computing19(1) (2015), 29–38.

67.

Zheng

, Liu

J.L.

and Xiang

, FAQ Answering System Based on VSM and LDA Model, Computer Technology & Development24(1) (2014), 133–135.

68.

Lin

, et al., Intelligent medical guide system based on VSM weight improvement algorithm, Computer Applications & Software32(9) (2015), 81–83.

69.

, et al., Performance of using LDA for Chinese news text classification, 2015, pp. 1260–1264.

70.

Zhou

and Xie

, The integration technology of sensor network based on web crawler, 2015, pp. 1–7.

71.

Dařena

and Žižka

, Revealing Groups of antically Close Textual Documents by Clustering: Problems and Possibilities. Modern Computational Models of Semantic Discovery in Natural Language, 2015.

72.

Smith

and Agrawal

, A Comparison of Patent Classifications with Clustering Analysis. Web Information Systems Engineering – WISE 2015. Springer International Publishing, 2015.

73.

Cafieri

, Costa

and Hansen

, Modularity maximization clustering with cohesion conditions, 2015.

74.

Ajaykumar

, Gupta

and Merchant

P.S.N.

, Automated Lane Detection by K-means Clustering: A Machine Learning Approach. Electronic Imaging, 2016.

75.

Mary

S.A.L.

, Evaluation of clustering algorithm with cluster validation metrics, European Journal of Scientific Research69(1) (2012), 61–72.