An integrated LDA and SciBERT Model for Topic Analysis: A Case Study of Network Science

Abstract

Topic analysis facilitates the identification of knowledge flows and emerging trends in scientific research. Traditional models such as LDA generate interpretable topic-term distributions but lack deep semantic representation, while pretrained language models like SciBERT encode rich semantics with limited topic interpretability. To address this question, this study proposes an integrated LDA and SciBERT model for topic analysis. Firstly, the terms of each topic are identified by the LDA model, capturing the underlying statistical information of topic-term associations. Secondly, the SciBERT model is used to obtain semantically similar words for these terms, complementing the statistical topic information with enriched semantic context and reducing semantic information loss, thereby facilitating the extraction of fine-grained topic features and enhancing interpretability. Thirdly, a popularity index and a relevance index are proposed to analyze topic characteristics and domain evolution from both static and dynamic perspectives. Empirical results on network science data show that the proposed model produces semantically rich topics, facilitating understanding of a diverse range of applications and key issues in network science, and reveals an evolutionary trend of the domain from development to maturity. This research will help researchers improve their understanding of topic analyses within their disciplinary fields and promote innovation and the exchange of scientific knowledge.

Keywords

topic analysis topic interpretability LDA SciBERT network science

Introduction

Topic analysis in scientific domains is crucial for identifying emerging topics, trends, and knowledge transfer, thereby highlighting hot spots within a discipline (Jiang et al., 2021; Li et al., 2015). Methods such as co-word analysis, co-citation network analysis, and topic models play a role in illuminating the landscape of academic research. Many studies have combined these approaches to explore patterns of knowledge structure and topic evolution within subject fields (Nakazawa et al., 2015; Su & Lee, 2010; Zheng et al., 2025). However, the co-word analysis focuses only on terms appearing in the same paper without considering the relationships between different papers (Wang et al., 2012), and the co-citation network analysis fails to capture the latest dynamics in topic evolution (Liu et al., 2013). Topic models, particularly Latent Dirichlet Allocation (LDA), have been widely used in topic analysis since they were proposed (Jeong & Song, 2014). It usually identifies overarching topics and analyzes word frequency distributions within documents through statistical approaches, thus facilitating deeper insights into topics. However, such topic models, by relying solely on the statistical information of topic-term associations, overlook deep semantic and contextual topic information.

In contrast, deep learning techniques, such as Word2Vec (Mikolov et al., 2013) and BERT (Devlin et al., 2019), effectively improve semantic understanding for topic analysis by learning word embeddings that reflect semantic and contextual information. In recent years, several studies have delved into the application of word embedding techniques for enhancing topic semantic understanding (Gao et al., 2022; Wang et al., 2024b; J. Zhang et al., 2023), as well as contributing to the study of topic evolution in scientific publications (Xie et al., 2020). Previous studies combined topic modeling with word embedding technologies to process feature vectors and employed dimensionality reduction techniques to improve the effectiveness of the study. However, dimensionality reduction techniques, commonly used to project high-dimensional word embeddings into lower-dimensional spaces, often lead to the loss of fine-grained semantic information (Ray et al., 2021). In addition, dimensionality reduction typically optimizes for global data distribution rather than preserving topic-specific structure among terms. As a result, terms belonging to the same topic may not cluster together in the reduced space, while words from different topics may overlap, thereby limiting improvements to topic interpretability.

In this study, we introduce a novel approach to enhance the interpretability and obtain fine-grained features from topics by extracting similar words of topic terms within a context-specific semantic space and propose a popularity index (PI) and relevance index (RI) to explore the multidimensional features of topics. Firstly, we use the LDA model (Blei et al., 2003) to generate document-topic and topic-term probability distributions. Although recent models such as the contextualized topic model and BERTopic have emerged (Bianchi et al., 2020; Wang et al., 2024a), we adopt LDA as the base model due to its probabilistic generative framework, which ensures stable and reproducible topic modeling. Moreover, its clear and interpretable topic structures help enhance the consistency and controllability of subsequent semantic enrichment, enabling an effective combination of interpretability and semantic richness in topic representations. Secondly, the SciBERT model (Beltagy et al., 2019) is used to identify similar words for each term in each topic within their context-specific semantic spaces, thus enriching the semantic context and reducing semantic information loss, thereby obtaining fine-grained topic features and enhancing topic interpretability. Thirdly, we proposed PI based on growth rate and RI based on cosine similarity representing differentiation to explore the multidimensional features of topics and classify the topics into four categories: (a) Widely Followed-Distinct Topics, (b) Widely Followed-General Topics, (c) Less Followed-Distinct Topics, and (d) Less Followed-General Topics. Further details on these categories and the classification methodology are provided in Section 3.3.3. By analyzing topics and their categories over time, we provide a dynamic perspective on the trends within disciplines, thereby enhancing our understanding of disciplines. To evaluate our approach, we conducted an empirical analysis with a bibliographic dataset from network science.

To sum up, the main contributions of this study can be summarized as follows:

(1) We integrate LDA and SciBERT models to mine the fine-grained features of topics by extracting similar words and terms to expand the topic semantics, improve the understanding of topics, increase the description dimension, and enrich the interpretability of topics.

(2) We propose PI based on growth rate and RI representing differentiation based on cosine similarity to explore the multidimensional features of topics from a dynamic perspective.

(3) We conduct an empirical study on a dataset in network science collected from the Web of Science to evaluate the effectiveness of the proposed method and discover the features of the distribution of topics in network science and their evolutionary trends.

The structure of this paper is as follows: In the next section, a brief review of the related work is given. The Methodology section delineates the methodological detail, which includes the development and parameter optimization of the LDA model, the integration of the LDA and SciBERT models, and methods for exploring multidimensional features of topics. The Empirical Studies and Results section reports the empirical results and analysis. The Discussion section explores further points that this study can trigger. Finally, the Conclusion section concludes this study's main contributions, limitations, and directions for future research.

Related Work

This literature review explores three aspects of topic analysis: topic analysis based on LDA, word embedding techniques, and integration of two kinds of methods for topic analysis.

Topic Analysis Based on LDA

Topic modeling is a cornerstone in text mining, data mining, latent data discovery, and explaining relationships within datasets and textual documents (Jelodar et al., 2019). Among various topic models, LDA (Blei et al., 2003) remains one of the most widely used and effective approaches due to its ability to generate explicit and interpretable topic-term distributions (Vayansky & Kumar, 2020). The LDA model facilitates topic analysis by identifying frequently co-occurring words to represent a document’s topics (Xie et al., 2020).

Existing studies usually use the LDA model in two main ways: as a foundational tool to support further research, and as a direct method for analyzing topic structures and dynamics. As a foundational tool, Park et al. (2025) applied the LDA model for topic modeling in the blockchain domain, generating topic distributions across different document types, and further integrating these results through clustering to provide a foundation for subsequent time-series forecasting of topic trends. Liu et al. (2025) apply the LDA model to identify topics in funding texts and use cosine similarity to analyze the relationships between topics. Xiong et al. (2019) applied the LDA model to provide an in-depth analysis of topics in the manufacturing industry, identifying subfield topics that provide valuable insights for researchers. Farea et al. (2024) applied the LDA model in the field of sustainable energy research, analyzing optimal topic number selection, topic interpretability, and temporal evolution trends by systematically evaluating perplexity and coherence scores and exploring the long-term evolution patterns of research topics. Moreover, Kukreja (2024) employed the LDA model as a direct tool to explore topic structures and research trends within the field of comic recognition, providing a comprehensive overview of its evolving research landscape. However, limited by the Bag-of-Words assumption, LDA struggles to capture contextual semantics (Yang et al., 2019), prompting the use of word embedding techniques for richer semantic representation.

Word Embedding Techniques

Different words in scientific literature may represent similar semantic concepts (Gao et al., 2022). Word embedding techniques typically have the advantage of capturing richer semantic information. Mikolov et al. (2013) introduced Word2Vec to generate word embeddings efficiently and further reveal complex semantic relationships between words. However, this model ignores the fact that the same words may have multiple semantics in different contexts. To enable deeper semantic understanding, Devlin et al. (2019) proposed Bidirectional Encoder Representations from Transformers (BERT). This model achieves enhanced language understanding by capturing the diverse semantics of words in their respective contexts, showing superior performance on multiple language understanding tasks. Several variants of BERT have been developed to satisfy the specific needs of different domains. For example, the Bio-BERT (Lee et al., 2019) targets biomedical texts, optimizing understanding of the corresponding domains. BERT-large adapts to complex linguistic features through a larger number of training parameters. The SciBERT model, specially designed for the scientific domain, was proposed by Beltagy et al. (2019). It was pre-trained on a vast corpus of scientific texts, significantly enhancing the interpretation of scientific terms and concepts. This method notably improves the efficacy of text analysis and topic information extraction, which can serve as an important complement to traditional topic models.

Integration of the Topic Models and Word Embedding Techniques

The above-mentioned review clarifies the respective roles of topic models and word embedding techniques in topic analysis. Traditional topic models primarily depend on word frequency statistics and often fail to capture the semantics of words and the complexity of their relationships within context. To overcome this limitation, researchers are increasingly integrating word embedding techniques into topic models.

Some studies have validated the effectiveness of integrating topic models and word embedding techniques for topic analysis. For instance, Wang et al. (2024b) employed a fine-tuned BERT model to extract key informative sentences and applied noiseless LDA to generate interpretable topics from the cleaned corpus. Venugopalan and Gupta (2022) employed LDA and BERT to introduce an enhanced word embedding method for topic identification. This method integrates topic distribution, semantic knowledge, and syntactic structure, effectively categorizing news topics and proving its efficiency and accuracy across extensive text datasets. Subsequently, Zhou et al. (2022) developed a topic clustering model that employs BERT-LDA joint embeddings, focusing on exploring contextual semantics and narrative coherence to generate unique and relevant topic words, thereby improving insights into topics. Wang et al. (2024b) introduced the BERTopic model, specifically addressing topic analysis and evolution issues within interdisciplinary fields, using library and information science as a case study and identifying two distinct types of interdisciplinary topics within the discipline. Building on this, Benz et al. (2025) compared LDA and BERTopic on a large corpus of biology articles to compare differences in topic space exploration and their practical implications. They further showed that topic modeling can provide a valuable basis for understanding the semantic structure of scientific fields when combined with in-depth domain knowledge of the research object. These studies highlight the significant potential of combining topic models with word embedding technologies for topic and keyword extraction, providing valuable insights and references for topic analysis research in scientific domains.

Based on a summary of previous work, we found that combining these two techniques primarily focused on processing vector features and did not perform topic semantic mining based on the corresponding text corpus, which inevitably results in semantic information loss. Unlike previous studies, we combine LDA and SciBERT models to capture the fine-grained features of topics by generating similar words for different terms within the contextual environments of the topics. This approach enhances the semantic information and interpretability of the topics and reduces the semantic information loss associated with vector feature processing.

Methodology

The purpose of this section is to introduce the research framework that is illustrated in Figure 1. This section is divided into three modules: (a) LDA and parameter settings; (b) integrated model of SciBERT and LDA; and (c) exploring the multidimensional features of topics.

Figure 1.

Framework of this study.

LDA and Parameter Settings

The LDA model employs probabilistic distributions to discern document topics. In this study, LDA is primarily used to provide interpretable initial topic representations, including the document-topic and topic-term probability distribution vectors, which serve as the foundation for the subsequent tasks. Figure 2 illustrates the LDA graphical model, detailing the text generation process. Each document d is allocated a topic distribution $θ_{d}$ from $Dir (α)$ . For each word in the document, a specific topic $z_{n}$ is selected from $θ_{d}$ . Subsequently, a word $ω_{n}$ is drawn from a word distribution associated with $z_{n}$ , derived from $Dir (β)$ . This procedure is applied to all words N in the document and across every document D in the corpus. The joint probability of LDA is given by:

\begin{matrix} p (θ, z, w | α, β) = p (θ | α) Π_{n = 1}^{N} p (z_{n} | θ) p (w_{n} | z_{n}, β) \end{matrix}

(1)

Setting the parameters appropriately is essential for configuring the LDA model, particularly the number of topics (k) and the Dirichlet hyperparameters (α and β), which control the distributional sparsity of topics over documents and words over topics. Lower values of α and β reduce the smoothing effect on the corresponding multinomial distributions, resulting in sharper and more distinct topic representations.

Figure 2.

The structure of LDA.

To determine the number of topics and Dirichlet hyperparameters, this study adopts a mixed qualitative-quantitative approach, as recommended by Yu and Xiang (2023). At the qualitative level, a candidate range of topic numbers (k) was identified through manual inspection of topic coherence and interpretability. At the quantitative stage, commonly used values of α and β were tested. The C_v coherence index was employed as the primary selection criterion for jointly optimizing k, α, and β. $C_{v}$ coherence, which integrates an indirect cosine measure, normalized pointwise mutual information (NPMI), and a boolean sliding window, demonstrates superior performance compared to UCI and UMass coherence measures (Röder et al., 2015). While UCI and UMass rely heavily on corpus-based co-occurrence statistics, $C_{v}$ leverages semantic similarity among top words, enhancing its interpretability and robustness. Therefore, we use coherence score analysis to select the number of topics and Dirichlet hyperparameters for the LDA model.

Integrated model of LDA and SciBERT

To address the semantic limitations observed in LDA model outputs, this study integrates the SciBERT (“allenai/scibert_scivocab_uncased”) model. The integration is aimed at enriching the semantics of the extracted topics. The workflow, illustrated in Figure 3, unfolds in four steps.

Figure 3.

LDA with SciBERT workflow.

Step 1: Topic and terms extraction

We used the LDA model to extract k topics from the cleaned corpus. For clear description, we labeled the individual topics as $T_{k}$ and the range of topics as [ $T_{1}$ , $T_{2}$ , $T_{3}$ , …, $T_{k}$ ]. Then, each topic contains n terms, the individual terms were labeled as $k_{n}$ and the range of terms as [ $k_{1}$ , $k_{2}$ , $k_{3}$ , …, $k_{n}$ ].

Step 2: Semantic space generation

After mapping the relevant documents back to the original dataset, we identified the document collection corresponding to each topic, denoted as [ $d_{11}$ , $d_{12}$ , $d_{13}$ , …, $d_{kn}$ ]. The abstracts from each document collection were then fed into the SciBERT model to generate the contextual semantic space that is context-specific for each topic, rather than derived from the entire corpus.

Step 3: Word embedding vector generation

Given the topic terms $k_{n}$ obtained in Step 1, we input each $k_{n}$ into the SciBERT model to generate the corresponding word embedding vector. Leveraging the advanced semantic comprehension capabilities of SciBERT, we can accurately capture the semantic information of each term for subsequent tasks.

Step 4: Similar words extraction

Cosine similarity is a widely implemented index in information retrieval and semantic similarity studies. It measures the angle between vectors and is insensitive to their magnitudes. This makes it more effective than distance-based indices, as cosine similarity better captures semantic relationships in high-dimensional embedding spaces by focusing on vector orientation rather than magnitude (Reimers & Gurevych, 2019). It is defined as:

\begin{matrix} CosineSimilarity (A, B) = \frac{A \cdot B}{∥ A ∥ ∥ B ∥} \end{matrix}

(2)

Specifically, we map the word embedding vector of each topic term into its corresponding contextual semantic space. For each topic term $k_{n}$ , we compute the cosine similarity between its vector and the vectors of all candidate words within this space. We sort these candidate words in descending order based on their similarity scores, and extract the top-5 words ( $k_{s 1}, \dots, k_{s 5}$ ) as semantic expansions of the term. Consequently, the set of similar words for each topic term can be expressed as follows:

\begin{matrix} k_{n} = [k_{s 1}, k_{s 2}, k_{s 3}, k_{s 4}, k_{s 5}] \end{matrix}

(3)

Exploring the Multidimensional Features of Topics

Topic Interpretability Enhancement

The outcomes of the LDA model are limited to grasping the fundamental essence of a topic but fall short of exploring deeper insights into the topic. Topic terms usually indicate the most important information in a topic, we can identify them as a concentration of core concepts and grounded theory on the topic. After similar word extraction for topic terms described in section 3.2, we can take the following three perspectives to enhance the interpretability of the topics:

(a) Extending the semantic scope: By incorporating similar words retrieved from the contextual embedding space, we expand beyond the top LDA terms to include semantically related concepts, thereby broadening the coverage of topic semantics.

(b) Improving conceptual understanding of topics: The inclusion of similar words provides additional context and nuances for each topic, helping to clarify ambiguous terms and reinforce the core conceptual structure of the topic.

(c) Adding dimension to the topic description: The similar words enable a multi-faceted representation of each topic by introducing diverse expressions and viewpoints, enriching the overall topic description and supporting more comprehensive topic interpretation.

Definition of Popular Index and Relevance Index

Topic popularity measures the amount of research and attention given to a topic within a subject field, based on how often the topic has been studied over a given period of time (Xu et al., 2021). However, this view calculates topic size solely from a static perspective to define the popularity of a topic. It cannot distinguish topics that have similar proportions but different temporal trends. To address this limitation, we introduce the concept of topic growth rate in this study and incorporate it into the calculation of topic popularity. In this way, the proposed popular index (PI) not only reflects the size of a topic in a given year, but also whether its attention has increased or decreased over time, making it more informative than traditional size-based indices. The index can be expressed as:

\begin{matrix} {PI}_{k}^{y} = \frac{N_{k}^{y}}{N^{y}} \cdot (1 + g_{k}^{y}) \end{matrix}

(4)

\begin{matrix} g - P I_{k} = \sum_{y} {PI}_{k}^{y} \end{matrix}

(5)

In our analysis, ${PI}_{k}^{y}$ is calculated as the ratio of documents $N_{k}^{y}$ for a topic to the total documents $N^{y}$ for all topics in year y, and the topic growth rate $g_{k}^{y}$ directs affects the annual popularity score ${PI}_{k}^{y}$ of the $k^{th}$ topic for the year y. A positive growth rate ( $g_{k}^{y} > 0$ ) indicates an increase in the topic’s popularity, while a negative rate ( $g_{k}^{y} < 0$ ) indicates a decrease. By summing up ${PI}_{k}^{y}$ , we can calculate the composite popularity score $g - P I_{k}$ for the topic across the entire span.

To better reflect the differences between topics, we proposed a relevance index (RI), which is constructed as an inverse measure of cosine similarity to characterize topic relationships in a new way. The cosine similarity computed from embedding representations provides a fine-grained and semantically consistent basis for measuring topical relationships (Yu & Xiang, 2024). The index can be expressed as:

\begin{matrix} Dif f_{k} = \frac{1}{N - 1} \sum_{\begin{matrix} i = 1 \\ i \neq k \end{matrix}}^{N} [1 - Sim (k, i)] \end{matrix}

(6)

\begin{matrix} R I_{k} = 1 / Dif f_{k} \end{matrix}

(7)

Diff_k represents the average difference between topics, while $Sim (k, i)$ measures each topic’s similarity through cosine similarity. Additionally, $1 / Dif f_{k}$ is used to represent the $R I_{k}$ to get a better representation, higher values indicate less distinction, and lower values indicate more distinguishability.

Topics Discernment of Combining Indices

After calculating the $g - P I_{k}$ and $R I_{k}$ values for each topic, we can study the evolutionary trend of the topic and assess topics from two dimensions. Based on the mean values of $g - P I_{k}$ and $R I_{k}$ , topics are classified into four types:

(a) Widely Followed-Distinct Topics (high $g - P I_{k}$ , low $R I_{k}$ ): topics within the discipline that attract broad interest with higher differentiation, covering more refined research content.

(b) Widely Followed-General Topics (high $g - P I_{k}$ , high $R I_{k}$ ): topics within the discipline that attract broad interest with relatively lower differentiation, covering a wider range of research content.

(c) Less Followed-Distinct Topics (low $g - P I_{k}$ , low $R I_{k}$ ): topics within the discipline that attract less interest with higher differentiation, often reflecting specific research content.

(d) Less Followed-General Topics (low $g - P I_{k}$ , high $R I_{k}$ ): topics within the discipline that attract less interest with relatively lower differentiation, but hold potential value in comprehensive applications.

This classification assists researchers in improving the understanding of topics and the dynamic changes within research domains, offering guidance that can encourage innovation and drive disciplinary progress.

Empirical Studies and Results

Drawing on network science data, this section provided a comprehensive empirical analysis of multidimensional features for topics. Building on the methods introduced in Section 3, the analysis is conducted in four steps: (a) dataset construction; (b) discovering the topics in network science; (c) semantic-integrated topic interpretability analysis; and (d) topic feature and trend analysis based on PI-RI.

Dataset

Generally, researchers often select datasets based on topic retrieval using related terms (Chen et al., 2023; Yu et al., 2018). However, due to the highly interdisciplinary nature of network science, it is hard to ensure the identification and use of all relevant terms for retrieval. Two pivotal papers around the turn of the 21st century are recognized as foundational to the emergence of network science: Watts and Strogatz (1998) and Barabási and Albert (1999). The former studied small-world networks, and the latter explored scale-free networks. These two works were recognized as ground-breaking studies in the field of network science (Molontay & Nagy, 2019). Figure 4 shows the citation trends of these two dominant papers, exhibiting a fluctuating but rising pattern, highlighting their enduring significance and continued relevance to researchers in the network science community. This study considers a document relevant to network science if it cites either of these two dominant papers.

Figure 4.

Number of citations of two dominant papers over the years.

Figure 5 illustrates the data collection and preprocessing procedures in this study. The data were collected from the Web of Science core collection, a database widely used across disciplines and an important bibliographic resource for scientific research and academic assessment (Liu et al., 2024; Shah et al., 2015). We then obtained 49,176 articles and proceeding papers, including their titles, abstracts, and publication years from 1998 to 2022, focusing on abstracts as the primary data for this study. To enhance the quality of the data, papers without titles and abstracts were eliminated, and all duplicates were removed, resulting in 35,937 papers. In the preprocessing phase, all special characters, punctuation, HTML tags, and stop words were removed, and words appearing in fewer than 15 papers were also removed to further refine the dataset. After these steps, a cleaned corpus was constructed for subsequent analysis tasks.

Figure 5.

Data collecting and preprocessing.

Discovering the Topics in Network Science

This section first details the LDA modeling and parameter setting. Subsequently, based on the cleaned corpus, it visualizes the discovered topics using word clouds to delineate the conceptual landscape of network science.

LDA Modeling and Parameter Setting

As described in Section 3.1, this study employed a mixed qualitative-quantitative approach to determine the number of topics and Dirichlet hyperparameters.

At the qualitative level, we initially assessed a range of candidate topic numbers, specifically k∈ (10, 20, 30, 40, 50). Based on manual interpretation of topic coherence and distinctiveness, the range was narrowed to k∈ [15, 30]. At the quantitative level, within this refined k range, we tested commonly adopted Dirichlet hyperparameter combinations with α∈ (0.10, 0.15, 0.20, 0.25) and β∈ [0.005, 0.010, 0.015, 0.020, 0.025]. The optimal configuration of k, α, and β was determined using the coherence score $C_{v}$ as the principal evaluation index.

The results are presented in Table 1, which reports only the configurations under $k = 21$ , as this setting achieved higher coherence scores than other values of k. Higher coherence scores indicate closer semantic relationships among words and suggest better topic clarity. By combining quantitative and qualitative evidence, $k = 21$ , $α = 0.10$ , and $β = 0.010$ were selected as the optimal parameters in this study.

Table 1.

Results of Parameter Experiments.

α	β	$C_{v}$ coherence	α	β	$C_{v}$ coherence
0.10	0.005	0.4803	0.20	0.005	0.4757
0.10	0.010	0.4849	0.20	0.010	0.4756
0.10	0.015	0.4804	0.20	0.015	0.4769
0.10	0.020	0.4812	0.20	0.020	0.4653
0.10	0.025	0.4732	0.20	0.025	0.4661
0.15	0.005	0.4747	0.25	0.005	0.4671
0.15	0.010	0.4746	0.25	0.010	0.4773
0.15	0.015	0.4738	0.25	0.015	0.4771
0.15	0.020	0.4751	0.25	0.020	0.4682
0.15	0.025	0.4732	0.25	0.025	0.4548

Note. Bold indicates the optimal parameter combination and corresponding result.

In addition, to verify the performance enhancements of the model from parameter adjustments rather than random fluctuations, we set the random seed to 42, ensuring the reproducibility of research results. This approach strengthens the reliability of our findings by minimizing the impact of random variations on the model’s performance. Finally, the number of passes is set to 1000, and the result with the highest a posteriori probability is chosen as the optimal solution. These settings help ensure that the estimated topics are valid and provide a solid foundation for the subsequent semantic and index-based analyses.

Topic Visualization Using Word Clouds

There are 21 topics identified using the LDA model, along with their document-topic and topic-term distributions. The word clouds in Figures 6 –8 visually represent the associated terms of each topic. Each word cloud serves as a visual metaphor for a unique topic, and the size of each word in the word cloud reflects its probability distribution within the topic. By analyzing the key terms for each topic and their word cloud distribution features, we identified the labels for each topic.

Figure 6.

Word clouds for Topics 1–8.

Figure 7.

Word clouds for Topics 9–16.

Figure 8.

Word clouds for Topics 17–21.

In network science, network topology structures (T1) play a crucial role in understanding behaviors and features across various network types, serving as the theoretical foundation for complex network studies. There are different types of networks for different fields, such as bio-network analysis (T2), scientific networks (T5), trade networks (T14), network analysis of health data (T15), and brain functional networks (T10), each unveiling the interactions between various entities within complex systems. Recent studies also show that deep learning-based models support anomaly detection and decision making in complex networked systems (e.g., IoT ecosystems, and mental health monitoring platforms; Addula et al., 2025; Kumar et al., 2025; Yadulla et al., 2025). Furthermore, peer-to-peer networks (T8) are classified as one kind of complex network, exhibiting high efficiency and robustness in handling large-scale distributed tasks. Mathematical concepts such as degree distribution (T7), link prediction methods (T9), and graph theory (T12) provide the methodological and definitional foundation for these complex networks. Complex network modeling (T17) is a crucial interdisciplinary bridge to virtually all types of networks mentioned previously, characterized by structural properties, dynamic behaviors, multilevel and multiscale attributes, and clustering phenomena. Additionally, the study of network robustness (T6) is important for understanding and protecting complex networks, which is defined as the capacity of a network to sustain its functionality and services amid various disturbances and failures. This characteristic is crucial across various network types. In most cases, the underlying network structure plays a pivotal role in the system’s survivability against random failures or deliberate attacks. Optimal networks (T4) enhance network robustness by optimizing and upgrading configurations based on specific conditions and constraints. Similar AI-driven optimization ideas have also been applied to supply chain networks, where AI-based demand forecasting is used to improve inventory management and supply chain responsiveness and accuracy (Sajja et al., 2025). Moreover, understanding synchronization and control in complex networks (T16) contributes equally to improving the robustness of networks by elucidating the intrinsic mechanisms of these networks, leading to the development of more efficient and robust networks.

Network science has gained substantial popularity within the field of computer science. This study focuses on neural networks (T11), emphasizing computational and algorithmic foundations rather than biological aspects. Examples include artificial neural network modeling inspired by biological neural networks, graph networks, and other neural network features, which are prominent research topics in computing and artificial intelligence. Evolutionary games (T19) are a tool to study how individuals influence overall network behavior through strategic interactions in complex networks, often employing agent-based methods to delve into game dynamics, especially focusing on cooperative behaviors in dynamic processes (Szabó & Fath, 2007). And in practice, evolutionary game theory has been applied to a variety of scenarios, such as market competition strategies among enterprises and information spreading strategies. Network science aims to build models that replicate the properties of real networks, such as those observed in social network analysis (T18). Random walks on networks (T20) embrace this apparent randomness by constructing and characterizing networks that are inherently random, offering profound insights into strategic decisions and behavioral patterns. For instance, in the epidemic spreading model (T3), social network analysis (T18), and social information diffusion modeling (T21), this approach helps to reveal the dynamics of information spread. However, in network models related to spreading phenomena, most interactions are not continuous but have a finite duration, necessitating the consideration of temporal networks (T15).

These topics have improved our understanding of the concepts underlying network science and revealed its potential for application in various fields. By exploring the collaboration of these topics, we gain insight into the interdisciplinary nature and dynamics of network science, and its ability to explore complex problems.

Semantic-integrated Topic Interpretability Analysis

In this section, we will analyze from a semantic integration perspective how similar words to those terms obtained by SciBERT can enhance the interpretability of the topics and access to topic fine-grained features. We perform the analysis from three aspects: (a) extending semantic scope; (b) improving conceptual understanding of topics; and (c) adding dimension to the topic description. We use T5 and T6 as two examples to verify the applicability of the proposed model. To highlight the crucial information within each topic, we selected the top 10 terms from each topic. Subsequently, we extracted the top five similar words for these terms using the method described in subsection 3.2, and the analysis regarding the number of similar words extracted can be found in subsection 5.1.

Table 2 shows the results for the topic “scientific network.” By extending the semantic scope, the term “author” and related words such as “inventor” and “writer” broaden the understanding of the role of the author from mere writing to the dissemination, invention, and creation of knowledge, thus highlighting the different roles of researchers in the innovation process. The term “paper” and the related words “report,” and “essay” encompass many forms of text that are analyzed as different categories in scientific research. Furthermore, from the perspective of improving conceptual understanding of topics, the term “collaboration” and related words such as “time,”“production,” and “cultivation” outline the temporal and developmental nature of the collaborative process, enriching our understanding of the dynamic and critical role of collaboration in scientific research. Finally, the terms “web,”“internet,” and “wikipedia” broaden the scope of networked technologies to include both technical aspects and their role in disseminating information and knowledge, thus adding the dimensionality of the topic description. In addition, the terms “research,”“project,”“development,” and “investigation” illustrate the diversity of research activities, ranging from project management to research content investigation. Thus, we can find that these terms together build a comprehensive framework for understanding research.

Table 2.

Top 5 Similar Words Extraction for Scientific Network (T5).

Terms	S1	S2	S3	S4	S5
research	project	field	development	survey	scholarship
collabor	tempor	produc	cultiv	amelior	solv
use	sale	conduct	tune	deficit	deposit
data	dataset	setup	content	task	contect
analysi	entiti	scrutini	heirarchi	commerci	dichotomi
studi	famili	especi	polici	stori	dichotomi
paper	report	essay	disare	press	document
web	internet	wikipedia	crowd	swiss	vast
articl	singl	triangl	compel	articul	particl
author	authorit	coauthor	inventor	sponsor	writer

Table 3 shows the results of research on the topic of “network robustness.” Extending the semantic scope, the term “network” and its similar words have broadened our understanding of networks from purely physical connectivity to encompass multipath and data transmission, integrating numerous concepts ranging from geographic distribution to spectrum management. This extension shows that network robustness is linked to physical infrastructure, complex data flow management and optimization. Additionally, the term “node” and its similar words extend the role of network nodes from simple points to processing and communication pivot concepts, which in turn lead us to understand that the performance of each node, such as processing speed and transmission capacity, affects the overall effectiveness of the network. Terms such as “attack,”“threat,”“denial,” and “fraud” have broadened our understanding of cyberthreats and thus enhanced our understanding of the topic. These threats include not only direct intrusions but also tactics to destabilize network operations, such as fraud and denial-of-service attacks, providing a basis for a comprehensive security strategy to mitigate various risks. Finally, from the perspective of adding dimensions to the topic descriptions, the term “power” and its similar words underscore the importance of assessing network robustness from an energy efficiency perspective, emphasizing the need for meticulous management of energy demand and distribution while maintaining high speed and efficiency. This enables topic descriptions to transcend a single element and illustrate interactions with and impacts on other domains.

Table 3.

Top 5 Similar Words Extraction for Network Robustness (T6).

Terms	S1	S2	S3	S4	S5
network	overlay	multipath	neighborhood	router	transport
node	processor	channel	layer	station	rate
robust	multilevel	robustli	transient	explicit	indirect
attack	threat	denial	fraud	fault	hash
failur	expenditur	reconfigur	pressur	fail	fractur
strategi	studi	anomali	diagnosi	dichotomi	histori
rout	permut	hospit	finer	choos	outward
traffic	crime	throughput	multipath	flux	crash
cascad	incap	microgrid	switch	doublelay	singlelay
power	load	bandwidth	powergrid	speed	switch

Topic Feature and Trend Analysis Based on $PI$ - $RI$

In this section, we use the index-based results to obtain an integrated view of topic trends in network science. Section 4.4.1 presents a two-dimensional analysis of topics based on the PI and RI. Section 4.4.2 further analyzes topic trends from both static and dynamic perspectives, using PI and RI to identify different topic types and trace their evolution over time.

Two-dimensional Feature Analysis of Topics Integrating Indices

Based on Equations (3)–(6), the results of each topic regarding PI and RI can be obtained. Subsequently, we have ranked each topic using the PI score and RI score. The results are shown in Table 4.

Table 4.

PI Scores and RI Scores and Rankings of Topics.

Topic	PI	Rank ( $PI$ )	RI	Rank (RI)
Degree Distribution (T7)	6.354	1	6.433	18
Complex Networks Modeling (T17)	3.772	2	9.243	2
Dynamics of Neural Networks (T11)	3.056	3	8.614	5
Bio-network Analysis (T2)	2.043	4	6.052	19
Link Prediction Methods (T9)	2.033	5	7.663	12
Network Robustness (T6)	1.995	6	7.246	14
Scientific Network (T5)	1.767	7	7.149	16
Optimal Network (T4)	1.624	8	8.312	8
Social Information Diffusion Modeling (T21)	1.393	9	8.203	10
Social Network Analysis (T18)	1.217	10	8.943	3
Synchronization and Control in Complex Networks (T16)	1.212	11	7.975	11
Epidemic Spreading Model (T3)	1.138	12	8.582	6
Network Topology Structure (T1)	1.114	13	8.488	7
Brain Functional Networks (T10)	0.993	14	8.263	9
Graph Theory (T12)	0.972	15	8.651	4
Trade Networks (T14)	0.766	16	7.348	13
Random Walks on Networks (T20)	0.756	17	6.772	17
Evolutionary Games (T19)	0.732	18	5.251	21
Peer to Peer Network Systems (T8)	0.696	19	9.639	1
Network Analysis of Health Data (T13)	0.334	20	7.208	15
Temporal network (T15)	0.188	21	5.536	20

In a comprehensive assessment of the popularity of topics within the field of network science, T7 received the highest score, reflecting its pivotal role in elucidating the network’s structural characteristics. It is followed by T17 and T11, with PI values significantly higher than those of other topics. Complex network modeling offers insights into a diverse range of real-world complex systems (Albert & Barabási, 2002), while the dynamics of neural networks benefit from rapid advancements in artificial intelligence, particularly through the development of deep learning technologies that enhance their applicability in various fields. Topics with lower PI scores, such as T13 and T15, attract less attention in the broader academic community due to their highly specialized nature, predominantly focusing on niche areas.

Higher RI values, observed in T8 and T18, indicate the widespread acceptance of these topics and their potential for cross-disciplinary applications, highlighting their extensive cross-domain utility. In contrast, the lower RI values seen in T19 and T15 underscore the uniqueness and specialization of these topics, focused on specific theoretical models or application areas, providing deep insights. Furthermore, topics such as T17 and T11, with their higher RI scores, demonstrate the complexity and multidimensionality inherent in network science. These topics facilitate not only theoretical innovation but also play crucial roles in applications within fields such as machine learning and artificial intelligence. Topics such as T6 and T12 manifest in both theoretical frameworks and practical applications, underpinning theories of network stability and optimization, and are practically applied in areas like network design and security analysis.

The analysis of PI and RI scores highlights the specialization and popularity of topics within network science and also uncovers their interdisciplinary connections and applications. This offers valuable insights for researchers seeking novel research avenues and practical uses, thereby fostering the growth of the field.

Topic Trends in Network Science

This section explores the distributional features of topics from both static and dynamic perspectives. The dynamic perspective is based on the distribution of topics across years and their PI and RI scores.

From a static perspective, Figure 9 shows the distribution of topics by year, and we find that there is a clear difference in the features of the distribution before and after 2010. Prior to 2010, research topics were concentrated in a few core areas like T7 and T17, highlighting early network science’s focused attention and indicating the significant impact of these topics, with other topics still developing. However, post-2010, the distribution of topics became more balanced, with a broader range of subjects covered beyond the previously emphasized few. This diversification could be attributed to various factors, including the introduction of new technologies, increased interdisciplinary collaboration, and the emergence of new issues and applications. Furthermore, topics that initially received less focus, such as T15 and T19, began to attract more research interest over time, indicating the continuous expansion of knowledge and methodologies within the field of network science. Observing these changes in knowledge structure and the diversification of academic interests within the discipline is invaluable for understanding the developmental trajectory of network science and forecasting future research directions.

Figure 9.

Topic distribution over time.

From a dynamic perspective, our trajectory analysis of the network science field employs 2010 as a pivotal year, segmenting the development into two periods: 1998–2010, and 2011–2022. Subsequently, based on the PI and RI scores, we plotted strategic coordinates to delineate the topic areas with the mean of the PI and RI as the central axis, as defined in detail in subsection 3.3.3, and the results of the coordinate plot are shown in Figure 10.

Figure 10.

Index-driven topic distributions over time spans.

The evolution of the field of network science shows a trajectory from early, focused exploration to later mature diversification. As shown in Figure 10, exemplified by T7, which stood out within the “Widely Followed-Distinct Topics” category, capturing extensive interest for its distinct contribution to network fundamentals. As the discipline progressed, there was a discernible shift toward a broader thematic embrace. T4 and T2 transitioned from the peripheries of “Less Followed-Distinct Topics” to the more engaging “Widely Followed-General Topics,” signaling an increasing tendency to apply these theories practically across various disciplines. This expansion reflects the ascent of previously niche areas such as T9 and T21. Initially categorized as specialized, these subjects garnered a surge in scholarly attention, positioning them among more generalized topics, indicative of a blend between refined research and versatile applications. Such a trend highlights the gradual shift from a focused study of network principles to the multifaceted, applied research landscape that network science embodies today. The transition also underscores the multidisciplinary impact of network science, where once highly specific topics now inform a broad array of scientific inquiries. This interplay between emerging areas and established disciplines exemplifies the field’s growth and diversification, echoing a more profound understanding of complex systems. As a result, the maturity of the field is reflected not only in the growing diversity of topics but also in the richness of the linkages between the subfields, which provide a sound basis for innovation and interdisciplinary collaboration.

Discussion

The Number of Similar Words Extraction

The extended similar word set improves the semantic richness and precision of topic representations, thereby enhancing topic interpretability and facilitating further analysis. To determine the number of similar words, this study adopts a combined qualitative and quantitative method, as proposed by Yu et al. (2023). In the qualitative analysis, we considered the potential number of similar words introduced [5, 10, 20] and selected 5 and 10 as the preferred numbers based on their contribution to topic interpretability. Further, the impact of different numbers of similar words was quantitatively assessed by comparing the RI scores of each topic keyword with the introduction of 5 and 10 similar words, as well as the RI scores without the introduction of similar words, and the results are shown in Table 5.

Table 5.

RI Scores Based on Different Similar Words Extraction.

Topic	RI_nos	RI_s5	RI_s10
Degree Distribution (T7)	4.501	6.433	8.9740
Complex Networks Modeling (T17)	4.736	9.243	12.345
Dynamics of Neural Networks (T11)	6.001	8.614	10.746
Bio-network Analysis (T2)	6.716	6.052	9.8910
Link Predict & Method (T9)	5.651	7.663	9.7500
Network Robustness (T6)	6.485	7.246	11.085
Scientific Network (T5)	6.034	7.149	10.923
Optimal Network (T4)	5.916	8.312	12.619
Social Information Diffusion Modeling (T21)	5.581	8.203	13.136
Social Network Analysis (T18)	5.972	8.943	12.925
Synchronization and Control in Complex Networks (T16)	4.939	7.975	12.391
Epidemic Spreading Model (T3)	5.360	8.582	12.895
Network Topology Structure (T1)	5.879	8.488	11.144
Brain Functional Networks (T10)	5.416	8.263	11.819
Graph Theory (T12)	5.551	8.651	13.801
Trade Networks (T14)	4.933	7.348	10.853
Random Walks on Networks (T20)	5.781	6.772	9.9750
Evolutionary Games (T19)	5.489	5.251	10.331
Peer to Peer Networks System (T8)	5.496	9.639	13.150
Network Analysis of Health Data (T13)	5.185	7.208	11.926
Temporal network (T15)	5.889	5.536	9.7710

Note. RI_nos, RI_s5, and RI_s10 represent the RI scores when no similar words are introduced, when five similar words are introduced, and when 10 similar words are introduced, respectively.

In our analysis, striking a balance between the interpretability of topics and their distinction from each other was significant. Adding five similar words decreased the RI scores for certain topics, such as T2, T19, and T15, implying an enhancement in delineating these topics from others. Although integrating similar words slightly lowered the distinction among topics, it simultaneously increased their interpretability, enriching the semantic depth essential for a comprehensive understanding and detailed examination of each topic’s content. Introducing 10 similar words augmented topic interpretability yet resulted in a pronounced rise in RI scores, blurring the distinctions among topics. Considering the goal of bolstering topic interpretability while preserving clarity between topics, our combined qualitative and quantitative approach found that introducing five similar words offered optimal equilibrium. Therefore, this study introduced the selection of the top five similar words as a criterion in related topic analysis task.

Interdisciplinary Insights From Network Science

Interdisciplinary research has become a prominent trend in network science, driving innovations across diverse domains (Omodei et al., 2017; C. Zhang et al., 2024; Zheng et al., 2023). Understanding how emerging topics in network science evolve and diffuse across disciplinary boundaries is crucial for informing both scientific exploration and practical applications. In this context, building on the index-based topic analysis results, we examine representative models and applications across multiple fields to illustrate how different branches of network science contribute to and shape interdisciplinary research trends. This discussion offers an external perspective that helps situate the topic analysis in this study within the broader development of network science.

Interdisciplinary research in network science not only broadens the boundaries of scientific research but also drives technological innovation and theoretical development. For example, complex network modeling (T17) has been used to investigate the root causes of complex phenomena by simulating interactions among systems. Related studies in climate politics and ecology further indicate that network-based models can capture latent organizational structures, multilevel interaction patterns, and stability properties that are difficult to identify with conventional approaches. In climate politics, for example, network analysis has been used to uncover the institutional and corporate configurations underpinning the climate change counter-movement and to highlight the hidden architectures of socio-political influence (Farrell, 2016). In ecology, multilayer network frameworks extend the analysis of ecological systems by incorporating multiple types of interactions and levels of organization, thereby enabling the investigation of high-dimensional and heterogeneous dynamics in nature and linking network architecture to ecosystem robustness and stability (Landi et al., 2018; Pilosof et al., 2017).

Meanwhile, in the fields of computer science and artificial intelligence, the development of the dynamics of neural networks (T11), especially the breakthroughs in graph representation learning technology, has greatly advanced the field of computer science. Kipf and Welling (2016) proposed graph convolutional networks (GCN) for node classification of graph-structured data, influencing subsequent research on graph networks; and Veličković et al. (2017) proposed graph attention networks (GAT) as a graph neural network that aggregates the information of neighboring nodes through the attention mechanism to further improve the model’s performance. In addition, graph representation learning techniques such as the node2vec algorithm and the DeepWalk algorithm also utilize the network for relevant tasks (Grover & Leskovec, 2016; Perozzi et al., 2014). Some studies have used these techniques to address practical challenges. Wang et al. (2019) proposed a knowledge graph attention network, which explicitly models the high-order connectivity in knowledge graphs in an end-to-end way, thereby improving the performance of recommendation. Li et al. (2022) argued that graph representation learning would continue to advance machine learning for biomedicine and healthcare applications, including identifying genetic variations in complex traits and elucidating the role of single-cell behaviors and their impact on health. Taken together, these advances demonstrate that graph representation learning can leverage network structures to address practical challenges and broaden research perspectives, thereby strengthening interdisciplinary links among network science, artificial intelligence, and application domains such as recommendation systems and biomedicine (Jin et al., 2021; Wu et al., 2022).

Theoretical and Practical Perspective

From the theoretical perspective, this study provides a methodological extension of topic analysis in scientific domains. The model integrating LDA and SciBERT demonstrates how to combine probabilistic topic-term distributions and contextual semantic information without relying on dimensionality reduction. By extracting similar words for topic terms in contextual semantic spaces, the model enriches topic semantics and improves topic interpretability. In addition, the PI and RI describe topics from the two dimensions of growth and differentiation, and the four-quadrant classification offers a clear structure for understanding topic positions and evolution within a discipline. These elements together provide a theoretical reference for future studies that seek to analyze multidimensional topic features and dynamic topic patterns in large text corpora.

From the practical perspective, the proposed model outputs multidimensional topic features, thereby helping researchers and decision makers better identify key topics and understand their evolution trends. In the network science case, the PI-RI quadrant based on the proposed indices makes it possible to identify Less Followed-Distinct Topics and Less Followed-General Topics. These topics typically correspond to under-explored research gaps, including new questions, data sources, or application scenarios. At the same time, topics identified in the PI-RI quadrant as Widely Followed-General Topics and Less Followed-General Topics can reveal connections between network science and other areas (e.g., social sciences, biomedicine, and engineering), which will be priorities for interdisciplinary collaboration. Specifically, researchers can select topics for future projects and design cross-disciplinary teams, while managers and funding agencies can target resources more precisely toward directions with both scientific potential and practical value.

Conclusion

This study leverages advancements in technology and data analytics to investigate the evolving landscape of network science, focusing on the semantic analysis of research topics and exploring the evolutionary trends within the field. We employ the LDA model for topic extraction and the SciBERT model for further semantic understanding, providing a new perspective for understanding topics. This approach enables a comprehensive examination of the field’s dynamics, from core topic identification to tracking developmental trajectories. Additionally, we introduce PI and RI to analyze topic growth rates and distinctiveness, respectively. Through a time-series analysis, we delineate the shift toward a more balanced and holistic development in network science. This multidimensional perspective highlights the trends and features of the field, thus providing insights for future research.

However, this study has its limitations. First, due to variations in methodologies, theoretical frameworks, and field-specific features, the applicability of the methods proposed in this study requires further validation in datasets from different disciplines. Second, the dataset utilized here was primarily derived from citations to ground-breaking works by Watts and Strogatz (1998) and Barabási and Albert (1999). This approach may not fully capture the entire scope of research in network science and may introduce selection bias to some extent. This bias could affect the comprehensiveness of the identified topics. For example, some application-oriented network studies might cite seminal works in the relevant application domains, rather than the two ground-breaking papers mentioned above, and thus be excluded from our dataset.

In future work, we will incorporate highly impactful papers that cite these two ground-breaking papers as additional foundational papers to expand the dataset. Meanwhile, we note that stronger topic models may improve the quality of the initial topic representation. Therefore, we will consider adopting more advanced topic models to further strengthen this component and enhance the overall model performance, and validate the proposed model in datasets from different disciplines.

Footnotes

ORCID iDs

Mingtao Lu

Xiaoling Huang

Ethical Considerations

The authors state that this research complies with ethical standards. This research does not involve either human participants or animals.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: We acknowledge the financial support from the National Natural Science Foundation of China Grant No. 72204213.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The raw data used in this study has been uploaded to the Figshare database (), including data of 35,937 papers related to network science retrieved from the Web of Science.

References

Albert

Barabási

A. L.

(2002). Statistical mechanics of complex networks. Reviews of Modern Physics, 74(1), 47–97.

Addula

S. R.

Meesala

M. K.

Ravipati

Sajja

G. S.

(2025). A hybrid autoencoder and gated recurrent unit model optimized by honey badger algorithm for enhanced cyber threat detection in IoT networks. Security and Privacy, 8(6), e70086.

Beltagy

Cohan

(2019). SciBERT: A pretrained language model for scientific text. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, November 5–7, Hong Kong, China, pp. 3615–3620.

Barabási

A. L.

Albert

(1999). Emergence of scaling in random networks. Science, 286(5439), 509–512.

Blei

D. M.

A. Y.

Jordan

M. I.

(2003). Latent dirichlet allocation. Journal of Machine Learning Research, 3, 993–1022.

Bianchi

Terragni

Hovy

(2020). Pre-training is a hot topic: Contextualized document embeddings improve topic coherence. arXiv preprint arXiv:2004.03974.

Benz

Pradier

Kozlowski

Shokida

N. S.

Larivière

(2025). Mapping the unseen in practice: Comparing latent dirichlet allocation and BERTopic for navigating topic spaces. Scientometrics, 130, 3839–3870.

Chen

Wang

W. R.

Chen

X. M.

(2023). Bibliometric methods in traffic flow prediction based on artificial intelligence. Expert Systems with Applications, 228, 120421.

Devlin

Chang

M. W.

Lee

Toutanova

(2019). BERT: Pre-training of deep bidirectional transformers for language understanding [Conference session]. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, June 3–5, Minneapolis, MN, USA, pp. 4171–4186.

10.

Farea

Tripathi

Glazko

Emmert-Streib

(2024). Investigating the optimal number of topics by advanced text-mining techniques: Sustainable energy research. Engineering Applications of Artificial Intelligence, 136, Article 108877.

11.

Farrell

(2016). Network structure and influence of the climate change counter-movement. Nature Climate Change, 6(4), 370–374.

12.

Grover

Leskovec

(2016). Node2vec: Scalable feature learning for networks [Conference session]. Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 13–17, San Francisco, CA, USA, pp. 855–864.

13.

Gao

Huang

Dong

Liang

(2022). Semantic-enhanced topic evolution analysis: A combination of the dynamic topic model and word2vec. Scientometrics 127(3), 1543–1563.

14.

Jeong

D. H.

Song

(2014). Time gap analysis by the topic model-based temporal technique. Journal of Informetrics, 8(3), 776–790.

15.

Jelodar

Wang

Yuan

Feng

Jiang

Zhao

(2019). Latent Dirichlet allocation (LDA) and topic modeling: Models, applications, a survey. Multimedia Tools and Applications, 78, 15169–15211.

16.

Jiang

Liu

Zhang

Yin

Liu

(2021). Overview of trends in global single cell research based on bibliometric analysis and LDA model (2009-2019). Journal of Data and Information Science, 6(2), 163–178.

17.

Jin

Zeng

Xia

Huang

Liu

(2021). Application of deep learning methods in biological networks. Briefings in Bioinformatics, 22(2), 1902–1917.

18.

Kipf

T. N.

Welling

(2017). Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations, April 24–26, Palais des Congrès Neptune, Toulon, France, pp. 1–14.

19.

Kukreja

(2024). Comic exploration and insights: Recent trends in LDA-based recognition studies. Expert Systems with Applications, 255, Article 124732.

20.

Kumar

Pawar

P. P.

Addula

S. R.

Meesala

M. K.

Oni

Cheema

Q. N.

Ul Haq

Sajja

G. S.

(2025). AI-powered security for IoT ecosystems: A hybrid deep learning approach to anomaly detection. Journal of Cybersecurity and Privacy, 5(4), 90–112.

21.

Liu

Jiang

(2013). Collective dynamics in knowledge networks: Emerging trends analysis. Journal of Informetrics, 7(2), 425–438.

22.

Guan

Cui

(2015). Mapping publication trends and identifying hot spots of research on internet health information seeking behavior: A quantitative and co-word biclustering analysis. Journal of Medical Internet Research, 17(3), Article e3326.

23.

Landi

Minoarivelo

H. O.

Brännström

Hui

Dieckmann

(2018). Complexity and stability of ecological networks: A review of the theory. Population Ecology, 60(4), 319–345.

24.

M. M.

Huang

Zitnik

(2022). Graph representation learning in biomedicine and healthcare. Nature Biomedical Engineering, 6(12), 1353–1369.

25.

Lee

Yoon

Kim

C. H.

Kang

(2019). BioBERT: A pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4), 1234–1240.

26.

Liu

W. S.

G. Y.

(2024). Web of science core collection’s coverage expansion: The forgotten arts & humanities citation index?. Scientometrics, 129, 933–955.

27.

Liu

Yue

(2025). Research on the hysteresis effect of topic related evolution for emerging trends prediction. Journal of Data and Information Science, 10(3), 52–77.

28.

Molontay

Nagy

(2019). Two decades of network science: As seen through the co-authorship network of network scientists. In Proceedings of the 2019 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining, August 27–30, Vancouver British Columbia Canada, pp. 578–583.

29.

Mikolov

Sutskever

Chen

Corrado

G. S.

Dean

(2013). Distributed representations of words and phrases and their compositionality. Advances in Neural Information Processing Systems, 26, 3111–3119.

30.

Nakazawa

Itoh

Saito

(2015). A visualization of research papers based on the topics and citation network [Conference session]. 2015 19th International Conference on Information Visualisation, July 22–24, Barcelona, Spain, pp. 283–289.

31.

Omodei

De Domenico

Arenas

(2017). Evaluating the impact of interdisciplinary research: A multilayer network approach. Network Science, 5(2), 235–246.

32.

Perozzi

Al-Rfou

Skiena

(2014). Deepwalk: Online learning of social representations [Conference session]. Proceedings of the 20th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, August 24–27, New York, USA, pp. 701–710.

33.

Pilosof

Porter

M. A.

Pascual

Kéfi

(2017). The multilayer nature of ecological networks. Nature Ecology & Evolution, 1(4), Article 10101.

34.

Park

Lim

Syafiandini

A. F.

Song

(2025). Forecasting topic trends of blockchain utilizing topic modeling and deep learning-based time-series prediction on different document types. Journal of Informetrics, 19(2), Article 101639.

35.

Röder

Both

Hinneburg

(2015). Exploring the space of topic coherence measures [Conference session]. Proceedings of the Eighth acm international conference on web search and data mining, February 2–6, Shanghai China, pp. 399–408.

36.

Ray

Reddy

S. S.

Banerjee

(2021). Various dimension reduction techniques for high dimensional data analysis: A review. Artificial Intelligence Review, 54(5), 3473–3515.

37.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using Siamese BERT-networks [Conference session]. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th international joint conference on natural language processing, November 3–7, Hong Kong, China, pp. 3982–3992.

38.

H. N.

Lee

P. C.

(2010). Mapping knowledge structure by keyword co-occurrence: A first look at journal papers in technology foresight. Scientometrics, 85(1), 65–79.

39.

Szabó

Fath

(2007). Evolutionary games on graphs. Physics Reports, 446(4–6), 97–216.

40.

Shah

T. A.

Gul

Gaur

R. C.

(2015). Authors self-citation behaviour in the field of library and information science. Aslib Journal of Information Management, 67(4), 458–468.

41.

Sajja

G. S.

Addula

S. R.

Meesala

M. K.

Ravipati

(2025). Optimizing inventory management through AI-driven demand forecasting for improved supply chain responsiveness and accuracy. In AIP Conference Proceedings, 3306(1), Article 050003.

42.

Veličković

Cucurull

Casanova

Romero

Liò

Bengio

(2018). Graph attention networks. In International Conference on Learning Representations, April 30–May 3, Vancouver Convention Center, Vancouver, BC, Canada, pp. 1–12.

43.

Vayansky

Kumar

S. A. P.

(2020). A review of topic modeling methods. Information Systems, 94, Article 101582.

44.

Venugopalan

Gupta

(2022). An enhanced guided LDA model augmented with BERT based semantic strength for aspect term extraction in sentiment analysis. Knowledge-Based Systems, 246, 108668.

45.

Watts

D. J.

Strogatz

S. H.

(1998). Collective dynamics of ‘small-world’ networks. Nature, 393(6684), 440–442.

46.

Wang

Z. Y.

C. Y.

(2012). Research on the semantic-based co-word analysis. Scientometrics, 90(3), 855–875.

47.

Wang

Cao

Liu

Chua

T. S.

(2019). Kgat: Knowledge graph attention network for recommendation [Conference session]. Proceedings of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, August 4–8, Anchorage AK, USA, pp. 950–958.

48.

Sun

Zhang

Xie

Cui

(2022). Graph neural networks in recommender systems: A survey. ACM Computing Surveys, 55(5), 1–37.

49.

Wang

Chen

(2024a). Identifying interdisciplinary topics and their evolution based on BERTopic. Scientometrics, 129, 7359–7384.

50.

Wang

Liu

Zhu

X. L.

(2024b). Enhancing emerging technology discovery in nanomedicine by integrating innovative sentences using BERT and NLDA. Journal of Data and Information Science, 9(4), 155–195.

51.

Lakeh

A. B.

Ghaffarzadegan

(2021). Examining the features of impactful research topics: A case of three decades of HIV-AIDS research. Journal of Informetrics, 15(1), 101122.

52.

Xiong

Cheng

Zhao

W. H.

Liu

J. H.

(2019). Analyzing scientific research topics in manufacturing field using a topic model. Computers & Industrial Engineering, 135, 333–347.

53.

Xie

Zhang

X. Y.

Ding

Song

(2020). Monolingual and multilingual topic analysis using LDA and BERT embeddings. Journal of Informetrics, 14(3), 101055.

54.

D. J.

Wang

W. R.

Zhang

W. Y.

Zhang

(2018). A bibliometric analysis of research on multiple criteria decision making. Current Science, 114(4), 747–758.

55.

D. J.

Xiang

(2023). Discovering topics and trends in the field of artificial intelligence: Using LDA topic modeling. Expert Systems with Applications, 225, Article 120114.

56.

D. J.

Xiang

(2024). An ESTs detection research based on paper entity mapping: Combining scientific text modeling and neural prophet. Journal of Informetrics, 18(4), 101551.

57.

Yang

Chen

Shen

Zhu

(2019). Discovering author interest evolution in order-sensitive and semantic-aware topic modeling. Information Sciences, 486, 271–286.

58.

Yadulla

A. R.

Sajja

G. S.

Addula

S. R.

Maturi

M. H.

Nadella

G. S.

De La Cruz

Meduri

Gonaygunta

(2025). A systematic review of mental health monitoring and intervention using unsupervised deep learning on EEG data. Psychology International, 7(3), 61–78.

59.

Zhou

Kong

Lin

(2022). Financial topic modeling based on the BERT-LDA embedding [Conference session]. 2022 IEEE 20th International conference on industrial informatics, July 25–28, Perth, Australia, pp. 495–500.

60.

Zheng

(2025). A topic model-based knowledge graph to detect product defects from social media data. Expert Systems with Applications, 268, 126313.

61.

Zhang

Wang

Huang

Chang

(2024). Interdisciplinarity of information science: An evolutionary perspective of theory application. Journal of Documentation, 80(2), 392–426.

62.

Zhang

Liu

Jiang

Shi

(2023). Discovery of topic evolution path and semantic relationship based on patent entity representation. Aslib Journal of Information Management, 75(3), 618–642.

63.

Zheng

Zhang

Han

Hou

(2023). Research interdisciplinarity and citation impact: A network analysis of social networking sites research. Sage Open, 13(3), 21582440231193472.

An integrated LDA and SciBERT Model for Topic Analysis: A Case Study of Network Science

Abstract

Keywords

Introduction

Related Work

Topic Analysis Based on LDA

Word Embedding Techniques

Integration of the Topic Models and Word Embedding Techniques

Methodology

LDA and Parameter Settings

Integrated model of LDA and SciBERT

Step 1: Topic and terms extraction

Step 2: Semantic space generation

Step 3: Word embedding vector generation

Step 4: Similar words extraction

Exploring the Multidimensional Features of Topics

Topic Interpretability Enhancement

Definition of Popular Index and Relevance Index

Topics Discernment of Combining Indices

Empirical Studies and Results

Dataset

Discovering the Topics in Network Science

LDA Modeling and Parameter Setting

Topic Visualization Using Word Clouds

Semantic-integrated Topic Interpretability Analysis

Topic Feature and Trend Analysis Based on PI - RI

Two-dimensional Feature Analysis of Topics Integrating Indices

Topic Trends in Network Science

Discussion

The Number of Similar Words Extraction

Interdisciplinary Insights From Network Science

Theoretical and Practical Perspective

Conclusion

Footnotes

ORCID iDs

Ethical Considerations

Funding

Declaration of Conflicting Interests

Data Availability Statement

References

Topic Feature and Trend Analysis Based on $PI$ - $RI$