A novel fuzzy k-means latent semantic analysis (FKLSA) approach for topic modeling over medical and health text corpora

Abstract

Medical and health text documents pose a challenge for data handling and retrieving the relevant and meaningful documents. Automatically retrieval of significant knowledge with a better understanding of medical and health documents is a challenging task. One popular approach for thematically understand the medical and health text documents and finding the topics from these documents is topic modeling. In this research, we propose a novel topic modeling approach Fuzzy k-means latent semantic analysis (FKLSA) by using the fuzzy clustering. Our method generates local and global term frequencies through the bag of words (BOW) model. Principal component analysis is used for removing high dimensionality negative impact on global term weighting. Previous work shows that in medical and health documents redundancy issue has a negative impact on the quality of text mining. Therefore, the main achievement of FKLSA is the handling of the redundancy issue in medical and text documents and discover semantically more precise topics. FKLSA is socially utilized for finding the themes from medical and health text corpus. These topics are further used for text classification and clustering tasks in text mining. Experimental results show that FKLSA performs better than LDA and RedLDA for redundant corpora. FKLSA’s time performance is also stable with an increase in number of topics and thus better than LDA and LSA on a big twitter heath dataset. Quantitative evaluations of the real-world dataset for health and medical documents show that FKLSA gives a higher performance as compared to state-of-the-art topic models like Latent Dirichlet allocation and Latent semantic analysis.

Keywords

Topic modeling bag-of-words term weighting fuzzy k-means principal component analysis

Get full access to this article

View all access options for this article.

References

E. National Academies of Sciences and Medicine, Future directions for NSF advanced computing infrastructure to support US science and engineering in 2017-2020: National Academies Press, 2016.

Karami

, Gangopadhyay

, Zhou

and Karrazi

, Flatm: A fuzzy logic approach topic model for medical documents, in Fuzzy Information Processing Society (NAFIPS) Held Jointly with 2015 5th World Conference on Soft Computing (WConSC), 2015 Annual Conference of the North American, 2015, pp. 1–6.

Tutubalina

, Miftahutdinov

Z.S.

, Nugmanov

, Madzhidov

, Nikolenko

, Alimova

, et al., Using semantic analysis of texts for the identification of drugs with similar therapeutic effects, Russian Chemical Bulletin 66 (2017), 2180–2189.

Kryszkiewicz

, Rough set approach to incomplete information systems, Information Sciences 112 (1998), 39–49.

Fatimah

, Rosadi

, Hakim

R.F.

and Alcantud

J.C.R.

, N-soft sets and their decision making algorithms, Soft Computing 22 (2018), 3829–3842.

Dhillon

I.S.

, Co-clustering documents and words using bipartite spectral graph partitioning, in Proceedings of the SeventhACMSIGKDDInternational Conference on Knowledge Discovery and Data Mining, 2001, pp. 269–274.

Aggarwal

C.C.

, Zhai

, An introduction to text mining, in Mining Text Data, ed: Springer, 2012, pp. 1–10.

Wei

and Croft

W.B.

, LDA-based document models for ad-hoc retrieval, in Proceedings of the 29th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2006, pp. 178–185.

Wang

and Blei

D.M.

, Collaborative topic modeling for recommending scientific articles, in Proceedings of the 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011, pp. 448–456.

10.

Blei

D.M.

, Ng

A.Y.

and Jordan

M.I.

, Latent dirichlet allocation, Journal of Machine Learning Research 3 (2003), 993–1022.

11.

Cohen

, Elhadad

and Elhadad

, Redundancy in electronic health record corpora: Analysis, impact on text mining performance and mitigation strategies, BMC Bioinformatics 14 (2013), 10.

12.

Wrenn

J.O.

, Stein

D.M.

, Bakken

and Stetson

P.D.

, Quantifying clinical narrative redundancy in an electronic health record, Journal of the American Medical Informatics Association 17 (2010), 49–53.

13.

Hotho

, Nürnberger

and Paaß

, A brief survey of text mining, in Ldv Forum, 2005, pp. 19–62.

14.

Karami

and Zhou

, Exploiting latent content based features for the detection of static sms spams, Proceedings of the American Society for Information Science and Technology 51 (2014), 1–4.

15.

, Wang

, Hua

X.-S.

and Li

, Tag refinement by regularized LDA, in Proceedings of the 17th ACMInternational Conference on Multimedia, 2009, pp. 573–576.

16.

Uzuner

, Mailoa

, Ryan

and Sibanda

, Semantic relations for problem-oriented medical records, Artificial Intelligence in Medicine 50 (2010), 63–73.

17.

Kaur

and Wasan

S.K.

, Empirical study on applications of data mining techniques in healthcare, Journal of Computer Science 2 (2006), 194–200.

18.

Jin

, Ma

and Li

, Medical Record Text Analysis Based on Latent Semantic Analysis, in Computational Intelligence and Design (ISCID), 2015 8th International Symposium on, 2015, pp. 108–110.

19.

Griffiths

T.L.

and Steyvers

, Finding scientific topics, Proceedings of the National academy of Sciences 101 (2004), 5228–5235.

20.

Aryal

, Gallivan

and Tao

Y.Y.

, Using Latent Semantic Analysis to Identify Themes in IS Healthcare Research, 2015.

21.

Sarioglu

, Choi

H.-A.

and Yadav

, Clinical report classification using natural language processing and topic modeling, in Machine Learning and Applications (ICMLA), 2012 11th International Conference on, 2012, pp. 204–209.

22.

Arnold

C.W.

, El-Saden

S.M.

, Bui

A.A.

and Taira

, Clinical case-based retrieval using latent topic analysis, in AMIA Annual Symposium Proceedings, 2010, p. 26.

23.

Asou

and Eguchi

, Predicting protein-protein relationships from literature using collapsed variational latent dirichlet allocation, in Proceedings of the 2nd International Workshop on Data and Text Mining in Bioinformatics, 2008, pp. 77–80.

24.

Arnold

and Speier

, A topic model of clinical reports, in Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2012, pp. 1031–1032.

25.

Dawson

J.A.

and Kendziorski

, Survival-supervised latent Dirichlet allocation models for genomic analysis of time-to-event outcomes, arXiv preprint arXiv:1202.5999, 2012.

26.

Perotte

A.J.

, Wood

, Elhadad

and Bartlett

, Hierarchically supervised latent Dirichlet allocation, in Advances in Neural Information Processing Systems, 2011, pp. 2609–2617.

27.

Bisgin

, Liu

, Fang

, Xu

and Tong

, Mining FDA drug labels using an unsupervised learning technique-topic modeling, in BMC Bioinformatics, 2011, p. S11.

28.

Huang

, Dong

, Duan

and Li

, Similarity measure between patient traces for clinical pathway analysis: Problem, method, and applications, IEEE Journal of Biomedical and Health Informatics 18 (2014), 4–14.

29.

Chen

J.H.

, Goldstein

M.K.

, Asch

S.M.

, Mackey

and Altman

R.B.

, Predicting inpatient clinical order patterns with probabilistic topic models vs conventional order sets, Journal of the American Medical Informatics Association 24 (2017), 472–480.

30.

Defossez

, Rollet

, Dameron

and Ingrand

, Temporal representation of care trajectories of cancer patients using data from a regional information system: An application in breast cancer, BMC Medical Informatics and Decision Making 14 (2014), 24.

31.

Cohen

, Aviram

, Elhadad

and Elhadad

, Redundancy-aware topic modeling for patient record notes, PloS One 9 (2014), e87555.

32.

Chiang

I.-J.

, Liu

C.C.-H.

, Tsai

Y.-H.

and Kumar

, Discovering latent semantics in web documents using fuzzy clustering, IEEE Transactions on Fuzzy Systems 23 (2015), 2122–2134.

33.

Cavaliere

, Senatore

and Loia

, Context-aware profiling of concepts from a semantic topological space, Knowledge-Based Systems 130 (2017), 102–115.

34.

Xie

X.L.

and Beni

, A validity measure for fuzzy clustering, IEEE Transactions on Pattern Analysis & Machine Intelligence (1991), 841–847.

35.

Naranjo

C.A.

, Bremner

K.E.

, Bazoon

and Turksen

I.B.

, Using fuzzy logic to predict response to citalopram in alcohol dependence, Clinical Pharmacology & Therapeutics 62 (1997), 209–224.

36.

Di Lascio

, Gisolfi

, Albunia

, Galardi

and Meschi

, A fuzzy-based methodology for the analysis of diabetic neuropathy, Fuzzy Sets and Systems 129 (2002), 203–228.

37.

Zahlmann

, Kochner

, Ugi

, Schuhmann

, Liesenfeld

, Wegner

, et al., Hybrid fuzzy image processing for situation assessment [diabetic retinopathy], IEEE Engineering in Medicine and Biology Magazine 19 (2000), 76–83.

38.

Helgason

C.M.

and Jobe

T.H.

, The fuzzy cube and causal efficacy: Representation of concomitant mechanisms in stroke, Neural Networks 11 (1998), 549–555.

39.

Helgason

C.M.

and Jobe

T.H.

, Causal interactions, fuzzy sets and cerebrovascular ‘accident’: The limits of evidence-based medicine and the advent of complexity-based medicine, Neuroepidemiology 18 (1999), 64–74.

40.

Helgason

C.M.

, Malik

, Cheng

S-C

, Jobe

T.H.

and Mordeson

J.N.

, Statistical versus fuzzy measures of variable interaction in patients with stroke, Neuroepidemiology 20 (2001), 77–84.

41.

Hassanien

A.E.

, Intelligent data analysis of breast cancer based on rough set theory, International Journal on Artificial Intelligence Tools 12 (2003), 465–479.

42.

Papageorgiou

E.I.

, Stylios

C.D.

and Groumpos

P.P.

, An integrated two-level hierarchical system for decision making in radiation therapy based on fuzzy cognitive maps, IEEE Transactions on Biomedical Engineering 50 (2003), 1326–1339.

43.

Moon

W.K.

, Chang

S.-C.

, Huang

C.-S.

and Chang

R.-F.

, Breast tumor classification using fuzzy clustering for breast elastography, Ultrasound in Medicine & Biology 37 (2011), 700–708.

44.

Gasch

A.P.

and Eisen

M.B.

, Exploring the conditional coregulation of yeast gene expression through fuzzy k-means clustering,research , Genome Biology 3 (2002), research0059. 1.

45.

Zhang

, Jin

and Zhou

Z.-H.

, Understanding bag-of-words model: A statistical framework, International Journal of Machine Learning and Cybernetics 1 (2010), 43–52.

46.

McCarthy

and Carroll

, Disambiguating nouns, verbs, and adjectives using automatically acquired selectional preferences, Computational Linguistics 29 (2003), 639–654.

47.

Salton

and Buckley

, Term-weighting approaches in automatic text retrieval, Information Processing & Management 24 (1988), 513–523.

48.

Agrawal

and Phatak

, A novel algorithm for automatic document clustering, in Advance Computing Conference (IACC), 2013 IEEE 3rd International, 2013, pp. 877–882.

49.

Choi

and Kim

, Automatic image annotation using semantic text analysis, in International Conference on Availability, Reliability, and Security, 2012, pp. 479–487.

50.

Huang

, Fu

and Chen

, Text-based video content classification for online video-sharing sites, Journal of the American Society for Information Science and Technology 61 (2010), 891–906.

51.

Iezzi

D.F.

, Centrality measures for text clustering, Communications in Statistics-Theory and Methods 41 (2012), 3179–3197.

52.

Gayathri

and Marimuthu

, Text document preprocessing with the KNN for classification using the SVM, in Intelligent Systems and Control (ISCO), 2013 7th International Conference on, 2013, pp. 453–457.

53.

Croft

W.B.

and Harper

D.J.

, Using probabilistic models of document retrieval without relevance information, Journal of Documentation 35 (1979), 285–295.

54.

Kolda

T.G.

, Limited-memory matrix methods with applications, 1998.

55.

Papineni

, Why inverse document frequency? in Proceedings of the second meeting of the North American Chapter of the Association for Computational Linguistics on Language Technologies, 2001, pp. 1–8.

56.

Dumais

, Enhancing performance in latent semantic indexing (LSI) retrieval, ed: Technical, 1992.

57.

Chisholm

and Kolda

T.G.

, New term weighting formulas for the vector space method in information retrieval, Computer Science and Mathematics Division, Oak Ridge National Laboratory 10 (1999), 5698.

58.

Abdi

and Williams

L.J.

, Principal component analysis, Wiley Interdisciplinary Reviews: Computational Statistics 2 (2010), 433–459.

59.

Gildea

, Corpus variation and parser performance, in Proceedings of the 2001 Conference on Empirical Methods in Natural Language Processing, 2001.

60.

Tsuruoka

, Tateishi

, Kim

J.-D.

, Ohta

, McNaught

, Ananiadou

, et al., Developing a robust part-of-speech tagger for biomedical text, in Panhellenic Conference on Informatics, 2005, pp. 382–392.

61.

Yang

and Jin

, Distance metric learning: A comprehensive survey, Michigan State Universiy 2 (2006), 4.

62.

Yang

, Slattery

and Ghani

, A study of approaches to hypertext categorization, Journal of Intelligent Information Systems 18 (2002), 219–241.

63.

Rendón

, Abundez

, Arizmendi

and Quiroz

E.M.

, Internal versus external cluster validation indexes, International Journal of computers and communications 5 (2011), 27–34.

64.

Caliński

and Harabasz

, A dendrite method for cluster analysis, Communications in Statistics-theory and Methods 3 (1974), 1–27.

65.

Hardy

, On the number of clusters, Computational Statistics & Data Analysis 23 (1996), 83–96.