SETM: A large-scale interactive visualisation of papers based on semantic embedding and topic modelling

Abstract

With the rapid advancement of new technologies, the number of academic papers in various fields has grown exponentially, making traditional keyword-based search methods insufficient to capture semantic information comprehensively. This article presents a method called SETM, which combines Text-embedding-ada-002, BERTopic and PageRank models to perform semantic extraction, topic modelling and ranking on large-scale academic papers. UMAP is used for dimensionality reduction and visualisation, revealing semantic relationships between papers. The system employs a tree-like layout combined with multi-level interactive features, allowing users to conveniently explore and retrieve papers, from a coarse-grained ‘paper galaxy’ view to fine-grained individual papers. Experimental results demonstrate that the proposed approach achieves high accuracy in semantic mining and clustering, providing an effective tool for dynamic visualisation and analysis of large-scale academic papers.

Keywords

BERTopic PageRank Text-embedding-ada visualisation

Get full access to this article

View all access options for this article.

References

Alonso-Mencía

. Seeking clustering excellence: unleashing the power of sentence transformers and preprocessing techniques. In: IberLEF@ SEPLN. Transformers in review analysis, Jaén, 26 September 2023.

Galli

Cusano

Meleti

, et al. Topic modeling for faster literature screening using transformer-based embeddings. Metrics 2024; 1(1): 2.

Mersha

Gemeda Yigezu

Kalita

Semantic-driven topic modeling using transformer-based embeddings and clustering algorithms. Procedia Comput Sci 2024; 244: 121–132.

Yang

Wang

, et al. Interactive steering of hierarchical clustering. IEEE Trans Vis Comput Graph 2020; 27(10): 3953–3967.

Wijanto

Widiastuti

Yong

HS.

Topic modeling for scientific articles: exploring optimal hyperparameter tuning in BERT. Int J Adv Sci Eng Inf Technol 2024; 14(3): 912–919.

Koroteev

MV.

BERT: a review of applications in natural language processing and understanding. arXiv:2103.11943 2021.

Sun

, et al. Knowing what it is: semantic-enhanced dual attention transformer. IEEE Trans Multimedia 2022; 25: 3723–3736.

Reimers

Gurevych

Sentence-BERT: sentence embeddings using Siamese BERT-networks. arXiv:1908.10084 2019.

Aperdannier

Koeppel

Unger

, et al. Systematic evaluation of different approaches on embedding search. In: Arai

(ed.) Future of information and communication conference. Cham: Springer, 2024, pp. 526–536.

10.

Tao

Kong

Kan

, et al. Textual dataset distillation via language model embedding. In: Findings of the Association for Computational Linguistics: EMNLP 2024, Miami, Florida, USA, November 2024, pp. 12557–12569. Association for Computational Linguistics.

11.

Blei

Jordan

Latent dirichlet allocation. In: Journal of Machine Learning Research, 3 Januray 2003, pp. 993–1022.

12.

Grootendorst

BERTopic: neural topic modeling with a class-based TF-IDF procedure. arXiv:2203.05794 2022.

13.

Massucci

Docampo

Measuring the academic reputation through citation networks via PageRank. J Informetr 2019; 13(1): 185–201.

14.

Duan

. Influence model of paper citation networks with integrated PageRank and HITS. In: 2021 IEEE 24th international conference on computer supported cooperative work in design (CSCWD), Dalian, China, 5–7 May 2021, pp. 1081–1086. New York: IEEE.

15.

Wang

Yin

, et al. Improved pagerank and new indices for academic impact evaluation using AI papers as case studies. J Inf Sci 2024; 50(3): 690–702.

16.

Belkina

Ciccolella

Anno

, et al. Automated optimized parameters for T-distributed stochastic neighbor embedding improve visualization and analysis of large datasets. Nat Commun 2019; 10(1): 5415.

17.

Soni

Prabakar

Upadhyay

Visualizing high-dimensional data using t-distributed stochastic neighbor embedding algorithm. In: Arabnia

Daimi

Stahlbock

, et al. (eds) Principles of data science. Cham: Springer, 2020, pp. 189–206.

18.

McInnes

Healy

Melville

UMAP: uniform manifold approximation and projection for dimension reduction. arXiv:1802.03426 2018.

19.

Boyapati

Aygun

Semanformer: semantics-aware embedding dimensionality reduction using transformer-based models. In: 2024 IEEE 18th international conference on semantic computing (ICSC), Laguna Hills, CA, USA, 5–7 February 2024, pp. 134–141. New York: IEEE.

20.

McAllister

Lennertz

Mojica

ZA.

Mapping a discipline: a guide to using VOSviewer for bibliometric and visual analysis. Sci Technol Libr 2022; 41(3): 319–348.

21.

Arruda

Silva

Lessa

, et al. VOSviewer and bibliometrix. J Med Libr Assoc 2022; 110(3): 392–395.

22.

Ding

Yang

Knowledge mapping of platform research: a visual analysis using VOSviewer and CiteSpace. Electron Commer Res 2022; 22: 1–23.

23.

Chen

The citespace manual. Coll Comput Informatics 2014; 1(1): 1–84.

24.

Liu

Schick

Schütze

Semantic-oriented unlabeled priming for large-scale language models. arXiv:2202.06133 2022.

25.

Bafna

Pramod

Vaidya

Document clustering: TF-IDF approach. In: 2016 international conference on electrical, electronics, and optimization techniques (ICEEOT), Chennai, India, 3–5 March 2016, pp. 61–66. New York: IEEE.

26.

Church

. Word2Vec. Nat Lang Eng 2017; 23(1): 155–162.

27.

Nałecz-Charkiewicz

Charkiewicz

Nowak

RM.

Quantum computing in bioinformatics: a systematic review mapping. Brief Bioinform 2024; 25(5): bbae391.

28.

Strubell

Ganesh

McCallum

Energy and policy considerations for modern deep learning research. Proc AAAI Conf Artif Intell 2020; 34(09): 13693–13696.

29.

Auger

Saroyan

. Overview of the OpenAI APIs. In: Generative AI for web development: building web applications powered by OpenAI APIs and Next.js. Berkeley, CA, 2024, pp. 87–116. Apress (Springer Nature).