Sage Journals: Discover world-class research

Abstract

Topic modelling is an important technique for extracting meaningful insights from large volumes of unstructured text data. The paper presents a federated technique for topic modelling that is based on a novel approach of the Latent Dirichlet Allocation (LDA) method for topic model generation. The proposed approach enables a topic model to be developed in a distributed environment using data continually generated from multiple sources without the need for sharing actual data. The first iteration of the topic modelling uses unsupervised LDA at each device generating the data. The results of each device are aggregated at a central server to generate a set of seed words that are used for guided LDA by the subsequent iterations of topic modelling. The proposed work, Federated LDA (F-LDA) has been evaluated using two datasets: a text dataset of dialogues between patients and doctors based on factual conversations and another comprising tweets related to depression. Comparing the performance of F-LDA with that of a centralized LDA, it was observed that F-LDA results in improved coherence score as well as diversity score in comparison to centralized LDA. This indicates that F-LDA achieves better interpretable topics covering a wide range of themes without redundancy.

Keywords

topic model latent dirichlet allocation federated learning data privacy topic similarity

Get full access to this article

View all access options for this article.

References

Qiu

Ding

, et al. A survey of machine learning for big data processing. EURASIP J Adv Signal Process 2016; 2016: 1–16.

Wang

, et al. Identifying objective and subjective words via topic modeling. IEEE Trans Neural Netw Learn Syst 2018; 29: 718–730.

Blei

Jordan

. Latent dirichlet allocation. J. Mach Learn Res 2003; 3: 993–1022.

Papadia

Pacella

Perrone

, et al. A comparison of different topic modelling methods through a real case study of Italian customer care. MDPI, 8 February 2023.

Jelodar

Wang

Yuan

, et al. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimed Tools Appl 2019; 78: 15169–15211.

Angelov

. Top2vec: Distributed representations of topics. arXiv preprint arXiv:2008.09470. (2020).

Yochum

Nisamaneewong

. Automated disease detection based on clinical text using topic modeling. December 2022, pp.74–79: ICIT.

Jelodar

Wang

Yuan

. Latent Dirichlet allocation (LDA) and topic modeling: models, applications, a survey. (13 November 2018).

Oliveira

Haque

Mougouei

, et al. Investigating the emotional response to COVID-19 news on Twitter: a topic modellingand emotion classification approach. IEEE Access 2022; 10: 16883–16816.

10.

Kushwaha

Kaur

. Depression detection on social Media. In: 2022 1st International Conference on Informatics (ICI), 2022, pp.153–158.

11.

Korencic

Ristov

Repar

, et al. A topic coverage approach to evaluation of topic models. IEEE Access 2021; 9: 123280–123123.

12.

Zhang

Zhou

, et al. Dynamic-Fusion-Based federated learning for COVID-19 detection. IEEE Internet of Things Journal 2021; 8: 15884–15891.

13.

Zhao

Ren

Yang

, et al. Latent Dirichlet allocation model training with differential privacy. IEEE Transactions on Information Forensics and Security 2020; 16: 1290–1305.

14.

Aledhari

Razzak

Parizi

, et al. Federated learning: a survey on enabling technologies, protocols, and applications. IEEE Access 2020; 8: 140699–140725.

15.

Guo

Zhao

, et al. Efficient and flexible management for industrial internet of things: a federated learning approach. Computer Networks 2021; 192: 108122.

16.

Lin

, et al. Accelerating federated learning over reliability-agnostic clients in mobile edge computing systems. IEEE Trans Parallel Distrib Syst 2020; 32: 1539–1155.

17.

Verbraeken

Wolting

Katzy

, et al. A survey on distributed machine learning. ACM Comput Surveys (CSUR) 2020; 53: 1–33.

18.

Glicksberg

, et al. Federated learning for healthcare informatics. J Healthcare Inform Res 2021; 5: 1–19.

19.

Sadilek

Liu

Nguyen

, et al. Privacy-first health research with federated learning. NPJ Digital Med 2021; 4: 1–8.

20.

Cui

Zhu

Deng

, et al. FeARH: federated machine learning with anonymous random hybridization on electronic medical records. J Biomed Inform 2021; 117: 103735.

21.

Chamikara

MAP

Bertok

Khalil

, et al. Privacy preserving distributed machine learning with federated learning. Comput Commun 2021; 171: 112–125.

22.

, et al. Fedner: Privacy-preserving medical named entity recognition with federated learning. arXiv preprint arXiv:2003.09288. (2020).

23.

Chalamala

Kummari

Singh

, et al. Federated learning to comply with data protection regulations. CSI Trans ICT 2022; 10: 47–60.

24.

Orkphol

. Word sense disambiguation using cosine similarity collaborates withWord2vec and WordNet. MDPI, 12 May 2019.

25.

Orkphol

Yang

. Word sense disambiguation using cosine similarity collaborates withWord2vec and WordNet. MDPI, 12 May 2019.

26.

Blair

. Aggregated topic models for increasing social media topic coherence. Appl Intell 2022: 138–156.

27.

McCallum

. Mallet: A machine learning for language toolkit, http://mallet.cs.umass.edu (2002).

28.

A Semisupervised form of LDA: https://guidedlda.readthedocs.io/en/latest.

29.

https://www.kaggle.com/datasets/dsxavier/diagnoise-me (accessed 25 July 2022).

30.

https://www.kaggle.com/datasets/ferno2/training1600000processednoemoticoncsv (accessed 01 January 2025).

31.

Mediano

Quintero

JAC

. A comparison study between coherence and perplexity for determining the number of topics in practitioners’ interviews analysis”. 13 April 2022, pp.225–234.

32.

Syed

Spruit

. Full-text or abstract? Examining topic coherence scores using latent dirichlet allocation. In: 2017 IEEE International conference on data science and advanced analytics (DSAA), 2017, pp.165–174: IEEE.

33.

Tran

Zerr

Bischoff

, et al. Topic cropping: leveraging latent topics for the analysis of small corpora. In: Aalberg

Papatheodorou

Dobreva

Tsakonas

Farrugia

(eds) Research and advanced technology for digital libraries. 2013, pp.297–308.

34.

Zhang

Wang

Han

. Automated topic labeling with top-n keywords. Inform Retr J 2020; 23: 65–89.

35.

Magatti

Calegari

Ciucci

, et al. Automatic labeling of topics. In: 2009 Ninth International Conference on Intelligent Systems Design and Applications, Pisa, Italy, 2009, pp.1227–1232.

36.

Lau

Grieser

Newman

, et al. Automatic labelling of topic models. In: Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics, Portland, Oregon, June 19-24, 2011, pp.1536–1545.

37.

Kim

Rhee

. An ontology-based labeling of influential topics using topic network analysis. J Inform Proces Syst 2019; 15: 1096–1107.

A federated LDA-based approach for topic modelling

Abstract

Keywords

Get full access to this article

References