Abstract
Topic analysis facilitates the identification of knowledge flows and emerging trends in scientific research. Traditional models such as LDA generate interpretable topic-term distributions but lack deep semantic representation, while pretrained language models like SciBERT encode rich semantics with limited topic interpretability. To address this question, this study proposes an integrated LDA and SciBERT model for topic analysis. Firstly, the terms of each topic are identified by the LDA model, capturing the underlying statistical information of topic-term associations. Secondly, the SciBERT model is used to obtain semantically similar words for these terms, complementing the statistical topic information with enriched semantic context and reducing semantic information loss, thereby facilitating the extraction of fine-grained topic features and enhancing interpretability. Thirdly, a popularity index and a relevance index are proposed to analyze topic characteristics and domain evolution from both static and dynamic perspectives. Empirical results on network science data show that the proposed model produces semantically rich topics, facilitating understanding of a diverse range of applications and key issues in network science, and reveals an evolutionary trend of the domain from development to maturity. This research will help researchers improve their understanding of topic analyses within their disciplinary fields and promote innovation and the exchange of scientific knowledge.
Introduction
Topic analysis in scientific domains is crucial for identifying emerging topics, trends, and knowledge transfer, thereby highlighting hot spots within a discipline (Jiang et al., 2021; Li et al., 2015). Methods such as co-word analysis, co-citation network analysis, and topic models play a role in illuminating the landscape of academic research. Many studies have combined these approaches to explore patterns of knowledge structure and topic evolution within subject fields (Nakazawa et al., 2015; Su & Lee, 2010; Zheng et al., 2025). However, the co-word analysis focuses only on terms appearing in the same paper without considering the relationships between different papers (Wang et al., 2012), and the co-citation network analysis fails to capture the latest dynamics in topic evolution (Liu et al., 2013). Topic models, particularly Latent Dirichlet Allocation (LDA), have been widely used in topic analysis since they were proposed (Jeong & Song, 2014). It usually identifies overarching topics and analyzes word frequency distributions within documents through statistical approaches, thus facilitating deeper insights into topics. However, such topic models, by relying solely on the statistical information of topic-term associations, overlook deep semantic and contextual topic information.
In contrast, deep learning techniques, such as Word2Vec (Mikolov et al., 2013) and BERT (Devlin et al., 2019), effectively improve semantic understanding for topic analysis by learning word embeddings that reflect semantic and contextual information. In recent years, several studies have delved into the application of word embedding techniques for enhancing topic semantic understanding (Gao et al., 2022; Wang et al., 2024b; J. Zhang et al., 2023), as well as contributing to the study of topic evolution in scientific publications (Xie et al., 2020). Previous studies combined topic modeling with word embedding technologies to process feature vectors and employed dimensionality reduction techniques to improve the effectiveness of the study. However, dimensionality reduction techniques, commonly used to project high-dimensional word embeddings into lower-dimensional spaces, often lead to the loss of fine-grained semantic information (Ray et al., 2021). In addition, dimensionality reduction typically optimizes for global data distribution rather than preserving topic-specific structure among terms. As a result, terms belonging to the same topic may not cluster together in the reduced space, while words from different topics may overlap, thereby limiting improvements to topic interpretability.
In this study, we introduce a novel approach to enhance the interpretability and obtain fine-grained features from topics by extracting similar words of topic terms within a context-specific semantic space and propose a popularity index (PI) and relevance index (RI) to explore the multidimensional features of topics. Firstly, we use the LDA model (Blei et al., 2003) to generate document-topic and topic-term probability distributions. Although recent models such as the contextualized topic model and BERTopic have emerged (Bianchi et al., 2020; Wang et al., 2024a), we adopt LDA as the base model due to its probabilistic generative framework, which ensures stable and reproducible topic modeling. Moreover, its clear and interpretable topic structures help enhance the consistency and controllability of subsequent semantic enrichment, enabling an effective combination of interpretability and semantic richness in topic representations. Secondly, the SciBERT model (Beltagy et al., 2019) is used to identify similar words for each term in each topic within their context-specific semantic spaces, thus enriching the semantic context and reducing semantic information loss, thereby obtaining fine-grained topic features and enhancing topic interpretability. Thirdly, we proposed PI based on growth rate and RI based on cosine similarity representing differentiation to explore the multidimensional features of topics and classify the topics into four categories: (a) Widely Followed-Distinct Topics, (b) Widely Followed-General Topics, (c) Less Followed-Distinct Topics, and (d) Less Followed-General Topics. Further details on these categories and the classification methodology are provided in Section 3.3.3. By analyzing topics and their categories over time, we provide a dynamic perspective on the trends within disciplines, thereby enhancing our understanding of disciplines. To evaluate our approach, we conducted an empirical analysis with a bibliographic dataset from network science.
To sum up, the main contributions of this study can be summarized as follows:
(1) We integrate LDA and SciBERT models to mine the fine-grained features of topics by extracting similar words and terms to expand the topic semantics, improve the understanding of topics, increase the description dimension, and enrich the interpretability of topics.
(2) We propose PI based on growth rate and RI representing differentiation based on cosine similarity to explore the multidimensional features of topics from a dynamic perspective.
(3) We conduct an empirical study on a dataset in network science collected from the Web of Science to evaluate the effectiveness of the proposed method and discover the features of the distribution of topics in network science and their evolutionary trends.
The structure of this paper is as follows: In the next section, a brief review of the related work is given. The Methodology section delineates the methodological detail, which includes the development and parameter optimization of the LDA model, the integration of the LDA and SciBERT models, and methods for exploring multidimensional features of topics. The Empirical Studies and Results section reports the empirical results and analysis. The Discussion section explores further points that this study can trigger. Finally, the Conclusion section concludes this study's main contributions, limitations, and directions for future research.
Related Work
This literature review explores three aspects of topic analysis: topic analysis based on LDA, word embedding techniques, and integration of two kinds of methods for topic analysis.
Topic Analysis Based on LDA
Topic modeling is a cornerstone in text mining, data mining, latent data discovery, and explaining relationships within datasets and textual documents (Jelodar et al., 2019). Among various topic models, LDA (Blei et al., 2003) remains one of the most widely used and effective approaches due to its ability to generate explicit and interpretable topic-term distributions (Vayansky & Kumar, 2020). The LDA model facilitates topic analysis by identifying frequently co-occurring words to represent a document’s topics (Xie et al., 2020).
Existing studies usually use the LDA model in two main ways: as a foundational tool to support further research, and as a direct method for analyzing topic structures and dynamics. As a foundational tool, Park et al. (2025) applied the LDA model for topic modeling in the blockchain domain, generating topic distributions across different document types, and further integrating these results through clustering to provide a foundation for subsequent time-series forecasting of topic trends. Liu et al. (2025) apply the LDA model to identify topics in funding texts and use cosine similarity to analyze the relationships between topics. Xiong et al. (2019) applied the LDA model to provide an in-depth analysis of topics in the manufacturing industry, identifying subfield topics that provide valuable insights for researchers. Farea et al. (2024) applied the LDA model in the field of sustainable energy research, analyzing optimal topic number selection, topic interpretability, and temporal evolution trends by systematically evaluating perplexity and coherence scores and exploring the long-term evolution patterns of research topics. Moreover, Kukreja (2024) employed the LDA model as a direct tool to explore topic structures and research trends within the field of comic recognition, providing a comprehensive overview of its evolving research landscape. However, limited by the Bag-of-Words assumption, LDA struggles to capture contextual semantics (Yang et al., 2019), prompting the use of word embedding techniques for richer semantic representation.
Word Embedding Techniques
Different words in scientific literature may represent similar semantic concepts (Gao et al., 2022). Word embedding techniques typically have the advantage of capturing richer semantic information. Mikolov et al. (2013) introduced Word2Vec to generate word embeddings efficiently and further reveal complex semantic relationships between words. However, this model ignores the fact that the same words may have multiple semantics in different contexts. To enable deeper semantic understanding, Devlin et al. (2019) proposed Bidirectional Encoder Representations from Transformers (BERT). This model achieves enhanced language understanding by capturing the diverse semantics of words in their respective contexts, showing superior performance on multiple language understanding tasks. Several variants of BERT have been developed to satisfy the specific needs of different domains. For example, the Bio-BERT (Lee et al., 2019) targets biomedical texts, optimizing understanding of the corresponding domains. BERT-large adapts to complex linguistic features through a larger number of training parameters. The SciBERT model, specially designed for the scientific domain, was proposed by Beltagy et al. (2019). It was pre-trained on a vast corpus of scientific texts, significantly enhancing the interpretation of scientific terms and concepts. This method notably improves the efficacy of text analysis and topic information extraction, which can serve as an important complement to traditional topic models.
Integration of the Topic Models and Word Embedding Techniques
The above-mentioned review clarifies the respective roles of topic models and word embedding techniques in topic analysis. Traditional topic models primarily depend on word frequency statistics and often fail to capture the semantics of words and the complexity of their relationships within context. To overcome this limitation, researchers are increasingly integrating word embedding techniques into topic models.
Some studies have validated the effectiveness of integrating topic models and word embedding techniques for topic analysis. For instance, Wang et al. (2024b) employed a fine-tuned BERT model to extract key informative sentences and applied noiseless LDA to generate interpretable topics from the cleaned corpus. Venugopalan and Gupta (2022) employed LDA and BERT to introduce an enhanced word embedding method for topic identification. This method integrates topic distribution, semantic knowledge, and syntactic structure, effectively categorizing news topics and proving its efficiency and accuracy across extensive text datasets. Subsequently, Zhou et al. (2022) developed a topic clustering model that employs BERT-LDA joint embeddings, focusing on exploring contextual semantics and narrative coherence to generate unique and relevant topic words, thereby improving insights into topics. Wang et al. (2024b) introduced the BERTopic model, specifically addressing topic analysis and evolution issues within interdisciplinary fields, using library and information science as a case study and identifying two distinct types of interdisciplinary topics within the discipline. Building on this, Benz et al. (2025) compared LDA and BERTopic on a large corpus of biology articles to compare differences in topic space exploration and their practical implications. They further showed that topic modeling can provide a valuable basis for understanding the semantic structure of scientific fields when combined with in-depth domain knowledge of the research object. These studies highlight the significant potential of combining topic models with word embedding technologies for topic and keyword extraction, providing valuable insights and references for topic analysis research in scientific domains.
Based on a summary of previous work, we found that combining these two techniques primarily focused on processing vector features and did not perform topic semantic mining based on the corresponding text corpus, which inevitably results in semantic information loss. Unlike previous studies, we combine LDA and SciBERT models to capture the fine-grained features of topics by generating similar words for different terms within the contextual environments of the topics. This approach enhances the semantic information and interpretability of the topics and reduces the semantic information loss associated with vector feature processing.
Methodology
The purpose of this section is to introduce the research framework that is illustrated in Figure 1. This section is divided into three modules: (a) LDA and parameter settings; (b) integrated model of SciBERT and LDA; and (c) exploring the multidimensional features of topics.

Framework of this study.
LDA and Parameter Settings
The LDA model employs probabilistic distributions to discern document topics. In this study, LDA is primarily used to provide interpretable initial topic representations, including the document-topic and topic-term probability distribution vectors, which serve as the foundation for the subsequent tasks. Figure 2 illustrates the LDA graphical model, detailing the text generation process. Each document d is allocated a topic distribution
Setting the parameters appropriately is essential for configuring the LDA model, particularly the number of topics (k) and the Dirichlet hyperparameters (α and β), which control the distributional sparsity of topics over documents and words over topics. Lower values of α and β reduce the smoothing effect on the corresponding multinomial distributions, resulting in sharper and more distinct topic representations.

The structure of LDA.
To determine the number of topics and Dirichlet hyperparameters, this study adopts a mixed qualitative-quantitative approach, as recommended by Yu and Xiang (2023). At the qualitative level, a candidate range of topic numbers (k) was identified through manual inspection of topic coherence and interpretability. At the quantitative stage, commonly used values of α and β were tested. The Cv coherence index was employed as the primary selection criterion for jointly optimizing k, α, and β.
Integrated model of LDA and SciBERT
To address the semantic limitations observed in LDA model outputs, this study integrates the SciBERT (“allenai/scibert_scivocab_uncased”) model. The integration is aimed at enriching the semantics of the extracted topics. The workflow, illustrated in Figure 3, unfolds in four steps.

LDA with SciBERT workflow.
Step 1: Topic and terms extraction
We used the LDA model to extract k topics from the cleaned corpus. For clear description, we labeled the individual topics as
Step 2: Semantic space generation
After mapping the relevant documents back to the original dataset, we identified the document collection corresponding to each topic, denoted as [
Step 3: Word embedding vector generation
Given the topic terms
Step 4: Similar words extraction
Cosine similarity is a widely implemented index in information retrieval and semantic similarity studies. It measures the angle between vectors and is insensitive to their magnitudes. This makes it more effective than distance-based indices, as cosine similarity better captures semantic relationships in high-dimensional embedding spaces by focusing on vector orientation rather than magnitude (Reimers & Gurevych, 2019). It is defined as:
Specifically, we map the word embedding vector of each topic term into its corresponding contextual semantic space. For each topic term
Exploring the Multidimensional Features of Topics
Topic Interpretability Enhancement
The outcomes of the LDA model are limited to grasping the fundamental essence of a topic but fall short of exploring deeper insights into the topic. Topic terms usually indicate the most important information in a topic, we can identify them as a concentration of core concepts and grounded theory on the topic. After similar word extraction for topic terms described in section 3.2, we can take the following three perspectives to enhance the interpretability of the topics:
(a) Extending the semantic scope: By incorporating similar words retrieved from the contextual embedding space, we expand beyond the top LDA terms to include semantically related concepts, thereby broadening the coverage of topic semantics.
(b) Improving conceptual understanding of topics: The inclusion of similar words provides additional context and nuances for each topic, helping to clarify ambiguous terms and reinforce the core conceptual structure of the topic.
(c) Adding dimension to the topic description: The similar words enable a multi-faceted representation of each topic by introducing diverse expressions and viewpoints, enriching the overall topic description and supporting more comprehensive topic interpretation.
Definition of Popular Index and Relevance Index
Topic popularity measures the amount of research and attention given to a topic within a subject field, based on how often the topic has been studied over a given period of time (Xu et al., 2021). However, this view calculates topic size solely from a static perspective to define the popularity of a topic. It cannot distinguish topics that have similar proportions but different temporal trends. To address this limitation, we introduce the concept of topic growth rate in this study and incorporate it into the calculation of topic popularity. In this way, the proposed popular index (PI) not only reflects the size of a topic in a given year, but also whether its attention has increased or decreased over time, making it more informative than traditional size-based indices. The index can be expressed as:
In our analysis,
To better reflect the differences between topics, we proposed a relevance index (RI), which is constructed as an inverse measure of cosine similarity to characterize topic relationships in a new way. The cosine similarity computed from embedding representations provides a fine-grained and semantically consistent basis for measuring topical relationships (Yu & Xiang, 2024). The index can be expressed as:
Diffk represents the average difference between topics, while
Topics Discernment of Combining Indices
After calculating the
(a) Widely Followed-Distinct Topics (high
(b) Widely Followed-General Topics (high
(c) Less Followed-Distinct Topics (low
(d) Less Followed-General Topics (low
This classification assists researchers in improving the understanding of topics and the dynamic changes within research domains, offering guidance that can encourage innovation and drive disciplinary progress.
Empirical Studies and Results
Drawing on network science data, this section provided a comprehensive empirical analysis of multidimensional features for topics. Building on the methods introduced in Section 3, the analysis is conducted in four steps: (a) dataset construction; (b) discovering the topics in network science; (c) semantic-integrated topic interpretability analysis; and (d) topic feature and trend analysis based on PI-RI.
Dataset
Generally, researchers often select datasets based on topic retrieval using related terms (Chen et al., 2023; Yu et al., 2018). However, due to the highly interdisciplinary nature of network science, it is hard to ensure the identification and use of all relevant terms for retrieval. Two pivotal papers around the turn of the 21st century are recognized as foundational to the emergence of network science: Watts and Strogatz (1998) and Barabási and Albert (1999). The former studied small-world networks, and the latter explored scale-free networks. These two works were recognized as ground-breaking studies in the field of network science (Molontay & Nagy, 2019). Figure 4 shows the citation trends of these two dominant papers, exhibiting a fluctuating but rising pattern, highlighting their enduring significance and continued relevance to researchers in the network science community. This study considers a document relevant to network science if it cites either of these two dominant papers.

Number of citations of two dominant papers over the years.
Figure 5 illustrates the data collection and preprocessing procedures in this study. The data were collected from the Web of Science core collection, a database widely used across disciplines and an important bibliographic resource for scientific research and academic assessment (Liu et al., 2024; Shah et al., 2015). We then obtained 49,176 articles and proceeding papers, including their titles, abstracts, and publication years from 1998 to 2022, focusing on abstracts as the primary data for this study. To enhance the quality of the data, papers without titles and abstracts were eliminated, and all duplicates were removed, resulting in 35,937 papers. In the preprocessing phase, all special characters, punctuation, HTML tags, and stop words were removed, and words appearing in fewer than 15 papers were also removed to further refine the dataset. After these steps, a cleaned corpus was constructed for subsequent analysis tasks.

Data collecting and preprocessing.
Discovering the Topics in Network Science
This section first details the LDA modeling and parameter setting. Subsequently, based on the cleaned corpus, it visualizes the discovered topics using word clouds to delineate the conceptual landscape of network science.
LDA Modeling and Parameter Setting
As described in Section 3.1, this study employed a mixed qualitative-quantitative approach to determine the number of topics and Dirichlet hyperparameters.
At the qualitative level, we initially assessed a range of candidate topic numbers, specifically k∈ (10, 20, 30, 40, 50). Based on manual interpretation of topic coherence and distinctiveness, the range was narrowed to k∈ [15, 30]. At the quantitative level, within this refined k range, we tested commonly adopted Dirichlet hyperparameter combinations with α∈ (0.10, 0.15, 0.20, 0.25) and β∈ [0.005, 0.010, 0.015, 0.020, 0.025]. The optimal configuration of k, α, and β was determined using the coherence score
The results are presented in Table 1, which reports only the configurations under
Results of Parameter Experiments.
Note. Bold indicates the optimal parameter combination and corresponding result.
In addition, to verify the performance enhancements of the model from parameter adjustments rather than random fluctuations, we set the random seed to 42, ensuring the reproducibility of research results. This approach strengthens the reliability of our findings by minimizing the impact of random variations on the model’s performance. Finally, the number of passes is set to 1000, and the result with the highest a posteriori probability is chosen as the optimal solution. These settings help ensure that the estimated topics are valid and provide a solid foundation for the subsequent semantic and index-based analyses.
Topic Visualization Using Word Clouds
There are 21 topics identified using the LDA model, along with their document-topic and topic-term distributions. The word clouds in Figures 6–8 visually represent the associated terms of each topic. Each word cloud serves as a visual metaphor for a unique topic, and the size of each word in the word cloud reflects its probability distribution within the topic. By analyzing the key terms for each topic and their word cloud distribution features, we identified the labels for each topic.

Word clouds for Topics 1–8.

Word clouds for Topics 9–16.

Word clouds for Topics 17–21.
In network science, network topology structures (T1) play a crucial role in understanding behaviors and features across various network types, serving as the theoretical foundation for complex network studies. There are different types of networks for different fields, such as bio-network analysis (T2), scientific networks (T5), trade networks (T14), network analysis of health data (T15), and brain functional networks (T10), each unveiling the interactions between various entities within complex systems. Recent studies also show that deep learning-based models support anomaly detection and decision making in complex networked systems (e.g., IoT ecosystems, and mental health monitoring platforms; Addula et al., 2025; Kumar et al., 2025; Yadulla et al., 2025). Furthermore, peer-to-peer networks (T8) are classified as one kind of complex network, exhibiting high efficiency and robustness in handling large-scale distributed tasks. Mathematical concepts such as degree distribution (T7), link prediction methods (T9), and graph theory (T12) provide the methodological and definitional foundation for these complex networks. Complex network modeling (T17) is a crucial interdisciplinary bridge to virtually all types of networks mentioned previously, characterized by structural properties, dynamic behaviors, multilevel and multiscale attributes, and clustering phenomena. Additionally, the study of network robustness (T6) is important for understanding and protecting complex networks, which is defined as the capacity of a network to sustain its functionality and services amid various disturbances and failures. This characteristic is crucial across various network types. In most cases, the underlying network structure plays a pivotal role in the system’s survivability against random failures or deliberate attacks. Optimal networks (T4) enhance network robustness by optimizing and upgrading configurations based on specific conditions and constraints. Similar AI-driven optimization ideas have also been applied to supply chain networks, where AI-based demand forecasting is used to improve inventory management and supply chain responsiveness and accuracy (Sajja et al., 2025). Moreover, understanding synchronization and control in complex networks (T16) contributes equally to improving the robustness of networks by elucidating the intrinsic mechanisms of these networks, leading to the development of more efficient and robust networks.
Network science has gained substantial popularity within the field of computer science. This study focuses on neural networks (T11), emphasizing computational and algorithmic foundations rather than biological aspects. Examples include artificial neural network modeling inspired by biological neural networks, graph networks, and other neural network features, which are prominent research topics in computing and artificial intelligence. Evolutionary games (T19) are a tool to study how individuals influence overall network behavior through strategic interactions in complex networks, often employing agent-based methods to delve into game dynamics, especially focusing on cooperative behaviors in dynamic processes (Szabó & Fath, 2007). And in practice, evolutionary game theory has been applied to a variety of scenarios, such as market competition strategies among enterprises and information spreading strategies. Network science aims to build models that replicate the properties of real networks, such as those observed in social network analysis (T18). Random walks on networks (T20) embrace this apparent randomness by constructing and characterizing networks that are inherently random, offering profound insights into strategic decisions and behavioral patterns. For instance, in the epidemic spreading model (T3), social network analysis (T18), and social information diffusion modeling (T21), this approach helps to reveal the dynamics of information spread. However, in network models related to spreading phenomena, most interactions are not continuous but have a finite duration, necessitating the consideration of temporal networks (T15).
These topics have improved our understanding of the concepts underlying network science and revealed its potential for application in various fields. By exploring the collaboration of these topics, we gain insight into the interdisciplinary nature and dynamics of network science, and its ability to explore complex problems.
Semantic-integrated Topic Interpretability Analysis
In this section, we will analyze from a semantic integration perspective how similar words to those terms obtained by SciBERT can enhance the interpretability of the topics and access to topic fine-grained features. We perform the analysis from three aspects: (a) extending semantic scope; (b) improving conceptual understanding of topics; and (c) adding dimension to the topic description. We use T5 and T6 as two examples to verify the applicability of the proposed model. To highlight the crucial information within each topic, we selected the top 10 terms from each topic. Subsequently, we extracted the top five similar words for these terms using the method described in subsection 3.2, and the analysis regarding the number of similar words extracted can be found in subsection 5.1.
Table 2 shows the results for the topic “scientific network.” By extending the semantic scope, the term “author” and related words such as “inventor” and “writer” broaden the understanding of the role of the author from mere writing to the dissemination, invention, and creation of knowledge, thus highlighting the different roles of researchers in the innovation process. The term “paper” and the related words “report,” and “essay” encompass many forms of text that are analyzed as different categories in scientific research. Furthermore, from the perspective of improving conceptual understanding of topics, the term “collaboration” and related words such as “time,”“production,” and “cultivation” outline the temporal and developmental nature of the collaborative process, enriching our understanding of the dynamic and critical role of collaboration in scientific research. Finally, the terms “web,”“internet,” and “wikipedia” broaden the scope of networked technologies to include both technical aspects and their role in disseminating information and knowledge, thus adding the dimensionality of the topic description. In addition, the terms “research,”“project,”“development,” and “investigation” illustrate the diversity of research activities, ranging from project management to research content investigation. Thus, we can find that these terms together build a comprehensive framework for understanding research.
Top 5 Similar Words Extraction for Scientific Network (T5).
Table 3 shows the results of research on the topic of “network robustness.” Extending the semantic scope, the term “network” and its similar words have broadened our understanding of networks from purely physical connectivity to encompass multipath and data transmission, integrating numerous concepts ranging from geographic distribution to spectrum management. This extension shows that network robustness is linked to physical infrastructure, complex data flow management and optimization. Additionally, the term “node” and its similar words extend the role of network nodes from simple points to processing and communication pivot concepts, which in turn lead us to understand that the performance of each node, such as processing speed and transmission capacity, affects the overall effectiveness of the network. Terms such as “attack,”“threat,”“denial,” and “fraud” have broadened our understanding of cyberthreats and thus enhanced our understanding of the topic. These threats include not only direct intrusions but also tactics to destabilize network operations, such as fraud and denial-of-service attacks, providing a basis for a comprehensive security strategy to mitigate various risks. Finally, from the perspective of adding dimensions to the topic descriptions, the term “power” and its similar words underscore the importance of assessing network robustness from an energy efficiency perspective, emphasizing the need for meticulous management of energy demand and distribution while maintaining high speed and efficiency. This enables topic descriptions to transcend a single element and illustrate interactions with and impacts on other domains.
Top 5 Similar Words Extraction for Network Robustness (T6).
Topic Feature and Trend Analysis Based on
-
In this section, we use the index-based results to obtain an integrated view of topic trends in network science. Section 4.4.1 presents a two-dimensional analysis of topics based on the PI and RI. Section 4.4.2 further analyzes topic trends from both static and dynamic perspectives, using PI and RI to identify different topic types and trace their evolution over time.
Two-dimensional Feature Analysis of Topics Integrating Indices
Based on Equations (3)–(6), the results of each topic regarding PI and RI can be obtained. Subsequently, we have ranked each topic using the PI score and RI score. The results are shown in Table 4.
PI Scores and RI Scores and Rankings of Topics.
In a comprehensive assessment of the popularity of topics within the field of network science, T7 received the highest score, reflecting its pivotal role in elucidating the network’s structural characteristics. It is followed by T17 and T11, with PI values significantly higher than those of other topics. Complex network modeling offers insights into a diverse range of real-world complex systems (Albert & Barabási, 2002), while the dynamics of neural networks benefit from rapid advancements in artificial intelligence, particularly through the development of deep learning technologies that enhance their applicability in various fields. Topics with lower PI scores, such as T13 and T15, attract less attention in the broader academic community due to their highly specialized nature, predominantly focusing on niche areas.
Higher RI values, observed in T8 and T18, indicate the widespread acceptance of these topics and their potential for cross-disciplinary applications, highlighting their extensive cross-domain utility. In contrast, the lower RI values seen in T19 and T15 underscore the uniqueness and specialization of these topics, focused on specific theoretical models or application areas, providing deep insights. Furthermore, topics such as T17 and T11, with their higher RI scores, demonstrate the complexity and multidimensionality inherent in network science. These topics facilitate not only theoretical innovation but also play crucial roles in applications within fields such as machine learning and artificial intelligence. Topics such as T6 and T12 manifest in both theoretical frameworks and practical applications, underpinning theories of network stability and optimization, and are practically applied in areas like network design and security analysis.
The analysis of PI and RI scores highlights the specialization and popularity of topics within network science and also uncovers their interdisciplinary connections and applications. This offers valuable insights for researchers seeking novel research avenues and practical uses, thereby fostering the growth of the field.
Topic Trends in Network Science
This section explores the distributional features of topics from both static and dynamic perspectives. The dynamic perspective is based on the distribution of topics across years and their PI and RI scores.
From a static perspective, Figure 9 shows the distribution of topics by year, and we find that there is a clear difference in the features of the distribution before and after 2010. Prior to 2010, research topics were concentrated in a few core areas like T7 and T17, highlighting early network science’s focused attention and indicating the significant impact of these topics, with other topics still developing. However, post-2010, the distribution of topics became more balanced, with a broader range of subjects covered beyond the previously emphasized few. This diversification could be attributed to various factors, including the introduction of new technologies, increased interdisciplinary collaboration, and the emergence of new issues and applications. Furthermore, topics that initially received less focus, such as T15 and T19, began to attract more research interest over time, indicating the continuous expansion of knowledge and methodologies within the field of network science. Observing these changes in knowledge structure and the diversification of academic interests within the discipline is invaluable for understanding the developmental trajectory of network science and forecasting future research directions.

Topic distribution over time.
From a dynamic perspective, our trajectory analysis of the network science field employs 2010 as a pivotal year, segmenting the development into two periods: 1998–2010, and 2011–2022. Subsequently, based on the PI and RI scores, we plotted strategic coordinates to delineate the topic areas with the mean of the PI and RI as the central axis, as defined in detail in subsection 3.3.3, and the results of the coordinate plot are shown in Figure 10.

Index-driven topic distributions over time spans.
The evolution of the field of network science shows a trajectory from early, focused exploration to later mature diversification. As shown in Figure 10, exemplified by T7, which stood out within the “Widely Followed-Distinct Topics” category, capturing extensive interest for its distinct contribution to network fundamentals. As the discipline progressed, there was a discernible shift toward a broader thematic embrace. T4 and T2 transitioned from the peripheries of “Less Followed-Distinct Topics” to the more engaging “Widely Followed-General Topics,” signaling an increasing tendency to apply these theories practically across various disciplines. This expansion reflects the ascent of previously niche areas such as T9 and T21. Initially categorized as specialized, these subjects garnered a surge in scholarly attention, positioning them among more generalized topics, indicative of a blend between refined research and versatile applications. Such a trend highlights the gradual shift from a focused study of network principles to the multifaceted, applied research landscape that network science embodies today. The transition also underscores the multidisciplinary impact of network science, where once highly specific topics now inform a broad array of scientific inquiries. This interplay between emerging areas and established disciplines exemplifies the field’s growth and diversification, echoing a more profound understanding of complex systems. As a result, the maturity of the field is reflected not only in the growing diversity of topics but also in the richness of the linkages between the subfields, which provide a sound basis for innovation and interdisciplinary collaboration.
Discussion
The Number of Similar Words Extraction
The extended similar word set improves the semantic richness and precision of topic representations, thereby enhancing topic interpretability and facilitating further analysis. To determine the number of similar words, this study adopts a combined qualitative and quantitative method, as proposed by Yu et al. (2023). In the qualitative analysis, we considered the potential number of similar words introduced [5, 10, 20] and selected 5 and 10 as the preferred numbers based on their contribution to topic interpretability. Further, the impact of different numbers of similar words was quantitatively assessed by comparing the RI scores of each topic keyword with the introduction of 5 and 10 similar words, as well as the RI scores without the introduction of similar words, and the results are shown in Table 5.
RI Scores Based on Different Similar Words Extraction.
Note. RI_nos, RI_s5, and RI_s10 represent the RI scores when no similar words are introduced, when five similar words are introduced, and when 10 similar words are introduced, respectively.
In our analysis, striking a balance between the interpretability of topics and their distinction from each other was significant. Adding five similar words decreased the RI scores for certain topics, such as T2, T19, and T15, implying an enhancement in delineating these topics from others. Although integrating similar words slightly lowered the distinction among topics, it simultaneously increased their interpretability, enriching the semantic depth essential for a comprehensive understanding and detailed examination of each topic’s content. Introducing 10 similar words augmented topic interpretability yet resulted in a pronounced rise in RI scores, blurring the distinctions among topics. Considering the goal of bolstering topic interpretability while preserving clarity between topics, our combined qualitative and quantitative approach found that introducing five similar words offered optimal equilibrium. Therefore, this study introduced the selection of the top five similar words as a criterion in related topic analysis task.
Interdisciplinary Insights From Network Science
Interdisciplinary research has become a prominent trend in network science, driving innovations across diverse domains (Omodei et al., 2017; C. Zhang et al., 2024; Zheng et al., 2023). Understanding how emerging topics in network science evolve and diffuse across disciplinary boundaries is crucial for informing both scientific exploration and practical applications. In this context, building on the index-based topic analysis results, we examine representative models and applications across multiple fields to illustrate how different branches of network science contribute to and shape interdisciplinary research trends. This discussion offers an external perspective that helps situate the topic analysis in this study within the broader development of network science.
Interdisciplinary research in network science not only broadens the boundaries of scientific research but also drives technological innovation and theoretical development. For example, complex network modeling (T17) has been used to investigate the root causes of complex phenomena by simulating interactions among systems. Related studies in climate politics and ecology further indicate that network-based models can capture latent organizational structures, multilevel interaction patterns, and stability properties that are difficult to identify with conventional approaches. In climate politics, for example, network analysis has been used to uncover the institutional and corporate configurations underpinning the climate change counter-movement and to highlight the hidden architectures of socio-political influence (Farrell, 2016). In ecology, multilayer network frameworks extend the analysis of ecological systems by incorporating multiple types of interactions and levels of organization, thereby enabling the investigation of high-dimensional and heterogeneous dynamics in nature and linking network architecture to ecosystem robustness and stability (Landi et al., 2018; Pilosof et al., 2017).
Meanwhile, in the fields of computer science and artificial intelligence, the development of the dynamics of neural networks (T11), especially the breakthroughs in graph representation learning technology, has greatly advanced the field of computer science. Kipf and Welling (2016) proposed graph convolutional networks (GCN) for node classification of graph-structured data, influencing subsequent research on graph networks; and Veličković et al. (2017) proposed graph attention networks (GAT) as a graph neural network that aggregates the information of neighboring nodes through the attention mechanism to further improve the model’s performance. In addition, graph representation learning techniques such as the node2vec algorithm and the DeepWalk algorithm also utilize the network for relevant tasks (Grover & Leskovec, 2016; Perozzi et al., 2014). Some studies have used these techniques to address practical challenges. Wang et al. (2019) proposed a knowledge graph attention network, which explicitly models the high-order connectivity in knowledge graphs in an end-to-end way, thereby improving the performance of recommendation. Li et al. (2022) argued that graph representation learning would continue to advance machine learning for biomedicine and healthcare applications, including identifying genetic variations in complex traits and elucidating the role of single-cell behaviors and their impact on health. Taken together, these advances demonstrate that graph representation learning can leverage network structures to address practical challenges and broaden research perspectives, thereby strengthening interdisciplinary links among network science, artificial intelligence, and application domains such as recommendation systems and biomedicine (Jin et al., 2021; Wu et al., 2022).
Theoretical and Practical Perspective
From the theoretical perspective, this study provides a methodological extension of topic analysis in scientific domains. The model integrating LDA and SciBERT demonstrates how to combine probabilistic topic-term distributions and contextual semantic information without relying on dimensionality reduction. By extracting similar words for topic terms in contextual semantic spaces, the model enriches topic semantics and improves topic interpretability. In addition, the PI and RI describe topics from the two dimensions of growth and differentiation, and the four-quadrant classification offers a clear structure for understanding topic positions and evolution within a discipline. These elements together provide a theoretical reference for future studies that seek to analyze multidimensional topic features and dynamic topic patterns in large text corpora.
From the practical perspective, the proposed model outputs multidimensional topic features, thereby helping researchers and decision makers better identify key topics and understand their evolution trends. In the network science case, the PI-RI quadrant based on the proposed indices makes it possible to identify Less Followed-Distinct Topics and Less Followed-General Topics. These topics typically correspond to under-explored research gaps, including new questions, data sources, or application scenarios. At the same time, topics identified in the PI-RI quadrant as Widely Followed-General Topics and Less Followed-General Topics can reveal connections between network science and other areas (e.g., social sciences, biomedicine, and engineering), which will be priorities for interdisciplinary collaboration. Specifically, researchers can select topics for future projects and design cross-disciplinary teams, while managers and funding agencies can target resources more precisely toward directions with both scientific potential and practical value.
Conclusion
This study leverages advancements in technology and data analytics to investigate the evolving landscape of network science, focusing on the semantic analysis of research topics and exploring the evolutionary trends within the field. We employ the LDA model for topic extraction and the SciBERT model for further semantic understanding, providing a new perspective for understanding topics. This approach enables a comprehensive examination of the field’s dynamics, from core topic identification to tracking developmental trajectories. Additionally, we introduce PI and RI to analyze topic growth rates and distinctiveness, respectively. Through a time-series analysis, we delineate the shift toward a more balanced and holistic development in network science. This multidimensional perspective highlights the trends and features of the field, thus providing insights for future research.
However, this study has its limitations. First, due to variations in methodologies, theoretical frameworks, and field-specific features, the applicability of the methods proposed in this study requires further validation in datasets from different disciplines. Second, the dataset utilized here was primarily derived from citations to ground-breaking works by Watts and Strogatz (1998) and Barabási and Albert (1999). This approach may not fully capture the entire scope of research in network science and may introduce selection bias to some extent. This bias could affect the comprehensiveness of the identified topics. For example, some application-oriented network studies might cite seminal works in the relevant application domains, rather than the two ground-breaking papers mentioned above, and thus be excluded from our dataset.
In future work, we will incorporate highly impactful papers that cite these two ground-breaking papers as additional foundational papers to expand the dataset. Meanwhile, we note that stronger topic models may improve the quality of the initial topic representation. Therefore, we will consider adopting more advanced topic models to further strengthen this component and enhance the overall model performance, and validate the proposed model in datasets from different disciplines.
Footnotes
Ethical Considerations
The authors state that this research complies with ethical standards. This research does not involve either human participants or animals.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: We acknowledge the financial support from the National Natural Science Foundation of China Grant No. 72204213.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
