Sage Journals: Discover world-class research

Abstract

In the age of rapidly expanding textual data, extracting meaningful insights poses a significant challenge. To address this, we introduce TopS-Key, an advanced framework for automatic keyphrase extraction that integrates natural language processing, principal component analysis (PCA), and fuzzy decision-making method, fuzzy technique for order preference by similarity to ideal solution (Fuzzy TOPSIS). Our approach enhances traditional preprocessing by incorporating fuzzy string matching and normalization. To identify the most important semantic keyphrases, we calculate 10 feature scores and apply PCA to reduce the dimensionality of these features while preserving essential information. The Fuzzy TOPSIS method is used to rank keyphrases, treating keyphrase candidates as alternatives and principal components as evaluation criteria. Shannon entropy-based weighting is applied to determine the significance of each criterion in the Fuzzy TOPSIS process. We validate our framework using widely recognized datasets, including DUC 2001, SemEval 2017 Task 10, and Inspec, leveraging similarity and threshold functions. Evaluation metrics, precision, recall, and the F1 score are analyzed, adjusting thresholds to extract the top 3, 5, and 10 keyphrases. Experimental results show that our TopS-Key framework consistently surpasses existing methods across all datasets, demonstrating superior performance in keyphrase extraction. Furthermore, the TopS-Key model shows significant potential for applications in automated text summarization, information retrieval, and enhanced semantic document analysis, opening avenues for further exploration. Future research could focus on exploring alternative weighting methods for criteria in Fuzzy TOPSIS and applying the framework to other domains with unique challenges in keyphrase extraction.

Keywords

Keyphrase extraction Principal Component Analysis Bidirectional Encoder Representations from Transformers (BERT)Fuzzy Technique for Order Preference by Similarity to Ideal Solution (Fuzzy TOPSIS)Shannon entropy

1. Introduction

In this era of exponentially growing textual data, extracting meaningful knowledge is a monumental challenge. The internet is becoming increasingly populated with blog posts, new articles, research publications, forum posts, and social media posts from millions of people (Ortiz-Ospina, 2019) as shown in Figure 1.

Figure 1.

Data Growth of Both Structured & Unstructured Data Worldwide. Source of Figure 1: https://twitter.com/timothy_hughes/status/619075227021090817.

Identifying descriptive keywords and keyphrases manually for each document is impractical due to the massive amount of text generated daily. But keyword extraction can help in extracting keywords or keyphrases from huge sets of data, providing insight into the type of topics being discussed. The term keyphrase refers to a phrase that sums up the main points discussed in a document (Siddiqi & Sharan, 2015). In other words, they are words and phrases that define the contents of the document precisely and compactly.

Due to their ability to provide a concise yet precise summary of a document’s contents, they can be utilized for a variety of purposes. A list of keyphrases can serve as a quick indicator for readers of the relevance of a particular document since it can help them determine whether or not it is relevant to their needs. Using good keyphrases can help users find relevant documents by supplementing full-text indexing. Although keyphrases offer well-known advantages, only a small percentage of documents use them. To address this need, research is being conducted on automated approaches to keyphrase extraction.

In the research domain, researchers have provided keywords in their publications to represent the main semantic topics. To find other related papers in the relevant literature (Siddiqi & Sharan, 2015), these keywords should be used with a search mechanism. Using keywords to find other relevant documents has become easier since automated search methods have emerged.

A keyphrase extraction is a fundamental process of natural language processing (NLP; Song et al., 2023b) that identifies and extracts keyphrases from a document so that vital information can be summarized. The majority of existing keyphrase generation programs approach this task as a supervised machine learning classification task, learning how true keyphrases differentiate themselves from other possible candidates based on labeled keyphrases (Zhang, 2008).

Our paper proposes a novel method for extracting keyphrases from text. First, we preprocess our text to find potential keyphrases using standard NLP preprocessing techniques and inherit fuzzy string matching and normalization. Then we fetch the feature scores for each potential keyphrase using algorithms, namely, term frequency (TF) method, inverse document frequency (IDF), TF–IDF (Ramos, 2003), phrase length, noun phrase density, position score, the semantic score using the latent Dirichlet allocation (LDA) method model (Onan et al., 2016), similarity score using the bidirectional encoder representations from transformers (BERT) with cosine similarity (Devlin et al., 2018), Euclidean distance, and Manhattan distance. Then, to reduce the number of features and also to retain the information contained, we apply principal component analysis (PCA). Then, we apply the fuzzy technique for order preference by similarity to ideal solution (Fuzzy TOPSIS) to rank the potential keyphrases, which are treated as alternatives. Principal components (PCs) are considered as the criteria, and the weights for each criterion are determined using Shannon entropy-based weighting, which quantifies the degree of uncertainty or disorder within the system. The integration of fuzzy logic allows for handling the inherent uncertainty and imprecision in the evaluation process, enhancing the robustness of the ranking. We examined the three metrics as the similarity threshold changes in all three datasets and calculated the decreasing rates. We implemented other state-of-the-art approaches in our datasets and calculated the average precision, recall, and F1 scores, compared them with our method, and concluded that our method outperformed them.

Our approach to keyphrase extraction addresses several limitations and research gaps found in traditional methods, resulting in a more effective and comprehensive solution. A significant shortcoming in many existing techniques is their inability to handle variations in language, such as synonyms, abbreviations, or alternative phrasing. Most methods fail to extract keyphrases when exact matches are required. To overcome this, our approach incorporates fuzzy string matching during preprocessing, enabling the identification of keyphrases even when there are variations in wording. This ensures a more exhaustive extraction of relevant key terms, which is often overlooked in traditional approaches. Traditional keyphrase extraction methods often rely on a limited set of features, which can lead to biased or incomplete results. In our method, we addressed this by considering 10 different feature scores for each potential keyphrase, which evaluate various factors such as frequency, positional importance, semantic similarity, and contextual relevance. By adopting this multidimensional feature analysis, our approach provides a more comprehensive and balanced evaluation of each keyphrase, addressing the gap where single-feature-based methods fall short.

The challenge of handling numerous features, which can introduce noise and reduce the efficiency of extraction processes, is often overlooked. To resolve this, we used PCA to condense the 10 feature scores into a smaller, more manageable set of PCs. This step retains the most relevant information while reducing redundancy and noise, ensuring that keyphrase rankings are driven by essential features. It addresses the issue of feature complexity that many traditional methods struggle with.

In many keyphrase extraction models, the assignment of weights to different features is either arbitrary or based on manual heuristics, which can lead to inconsistency and bias. Our approach eliminates this subjectivity by applying Shannon entropy-based weighting, which calculates the importance of each feature based on its variability in the data. This ensures that the feature weights are determined objectively, improving the reliability of the ranking process. This data-driven method of weighting is an important advancement over conventional techniques that rely on subjective decisions.

Moreover, keyphrase extraction often encounters challenges such as ambiguity, synonymy, and contextual variations in keyphrase meanings. Ambiguity arises when a keyphrase holds multiple possible interpretations depending on its usage context, while synonymy refers to the existence of semantically similar or equivalent terms expressed differently. Contextual differences further complicate the relevance and importance of keyphrases, as the same term may carry varying significance across different domains or textual settings. Our method specifically addresses these challenges through the integration of Fuzzy TOPSIS and Shannon entropy. The use of Fuzzy TOPSIS introduces a robust mechanism for handling ambiguity and imprecision, as fuzzy logic enables the model to assess the closeness of keyphrases to both ideal and anti-ideal solutions without relying on exact, crisp distinctions. This allows for more flexible and nuanced evaluation, accommodating varied interpretations and meanings of keyphrases. Shannon entropy weighting further complements this by objectively assigning weights to features based on their information content and variability, ensuring that no single feature disproportionately influences the ranking process. This data-driven weighting captures contextual variability by giving higher importance to discriminative features in different domains, allowing the model to adapt dynamically to varying textual contexts. Together, Fuzzy TOPSIS and Shannon entropy create a balanced, context-aware framework that effectively mitigates ambiguity, recognizes synonymous expressions, and accounts for contextual differences in keyphrase extraction.

A common limitation of existing keyphrase extraction methods is their inability to effectively balance multiple criteria, often leading to an overemphasis on certain features while neglecting others. This imbalance can result in suboptimal rankings, reducing the accuracy and relevance of the extracted keyphrases. To address this, we employ the Fuzzy TOPSIS, which enhances the traditional TOPSIS approach by incorporating fuzzy logic. This allows for the modeling of uncertainty and imprecision inherent in the evaluation process. By comparing potential keyphrases to both an ideal and an anti-ideal solution, Fuzzy TOPSIS ensures a more nuanced and balanced ranking, providing a comprehensive evaluation across multiple criteria while maintaining objectivity.

Beyond achieving state-of-the-art performance, our proposed keyphrase extraction framework offers substantial practical utility across various domains. In information retrieval systems and academic search engines, accurate keyphrase identification enhances indexing, improves search relevance, and facilitates research summarization. Additionally, the adaptability of our approach makes it ideal for recommendation systems, where keyphrase-based profiling supports personalized content delivery. Its robustness across different text domains further enables applications in news aggregation, content categorization, and educational tools requiring reliable, interpretable keyphrase extraction.

1.1. Organization of the Paper

The structure of the paper is as follows: Section 2 reviews relevant literature and related work. Section 3 details the methodology behind the proposed approach. To illustrate its functionality, Section 4 presents a step-by-step example. The experimental setup, datasets, software implementation, and feasibility are discussed in Section 5. Section 6 covers threshold sensitivity analysis, ablation studies, comparative evaluations, and statistical performance assessments. In Section 7, we analyze the results and explore the limitations and challenges of the approach. Finally, in Section 8, we discuss the conclusion and future scope of the methodology.

2. Literature Survey

The keyphrase extraction task was introduced by Turney, who defined it as “the automatic selection of important and topical phrases from the body of a document” (Turney, 2000). A series of advances in keyphrase extraction methods has taken place in the past two decades, from traditional approaches to technologies based on deep learning. Automatically extracting relevant keywords from text can be accomplished using various techniques. There are many different ways to do this, from simple techniques such as counting the number of different words in the text to more complex techniques such as graph-based methods, supervised methods, etc. Let us discuss these methods.

2.1. Statistical Approach

A statistical approach to keyphrase extraction has evolved from foundational statistical measures to sophisticated language models, resulting in more nuanced and accurate methods of identifying keyphrases.

In keyphrase extraction, TF is the most influential statistical approach. Ramos (2003) used TF–IDF to select query expansion terms from documents. Further improvements to the statistical approach have been made using other methods. Campos et al. (2018) proposed a statistical approach based on the case, sentence position, TF, and dispersion of keyphrases. Rose et al. (2012) combined the degree of words, the frequency of words, and degree-to-frequency ratios into an unsupervised algorithm that can be applied to any domain, Sarkar et al. (2010) incorporated TF–IDF into the feature vectors as input to the neural network models. Liu et al. (2019a) used another statistical method LDA model, combined with multifeature weighting to extract keywords. Omuya et al. (2021) enhanced PCA for feature selection, improving classification performance by optimizing feature representation.

2.2. Graph-Based Approach

Through the evolution of graph-based models, several advancements have been achieved in keyword extraction. Many algorithms were built upon the foundational algorithm HITS (Hung et al., 2010) and PageRank (Joshi & Patel, 2018), such as the TextRank algorithm (Mihalcea & Tarau, 2004), in which, in order to extract keywords, text can be represented as a graph and terms are ranked according to their centrality within the graph. The PositionRank algorithm (Florescu & Caragea, 2017) uses graph-based ranking, positional information, and various heuristics to examine the content and position of words in a document to identify potential keyphrases, RAKE algorithm (Rose et al., 2012) creates a graph wherein nodes represent words and edges between them signify relationships. Recently, graph-based methods that use NLP techniques have been developed, such as FRAKE (Zehtab-Salmasi et al., 2021). This algorithm employs a combination of textural techniques and graph centralities to extract keywords, KECNW (Biswas et al., 2018; Chen et al., 2019) based on the exploitation of word co-occurrence graph features as a way of extracting keywords. Recent innovations include SDRank (Xu et al., 2025), which refines graph-based rankings by incorporating both local and global dependencies. Additionally, HCUKE (Xu et al., 2024) enhances graph-based ranking by integrating hierarchical clustering to group-related phrases before applying graph centrality-based scoring.

2.3. Supervised Approach

In supervised keyword extraction models, the keywords in the dataset are annotated so that a model can be trained on the labeled data. Zhang et al. (2006) used a pretrained model, a support vector machine (SVM) classification model for keyword/keyphrase extraction, Zhang (2008) considered keyword extraction as string labeling and applied the conditional random fields model to extract keywords. Hong and Zhen (2012) proposed an extended TF method that used linguistic features of a keyword and an SVM model for keyword extraction and optimization of the results. It uses BERT and ExEm for the representation of candidate keywords into vectors and performs classification tasks by considering keyphrase extraction as a sequence labeling problem. A recent advancement, BRYT (Ahmed et al., 2024), builds upon statistical principles by incorporating multiple ranking strategies and ChatGPT.

2.4. Other Approaches

Other approaches include fuzzy adaptive resonance theory neural networks (Munoz, 1997) used neural networks to represent words as a continuous vector. Pennington et al. (2014) utilized matrix factorization techniques to create word vectors based on global statistics of word co-occurrence BERT model is proposed (Devlin et al., 2018). Liu et al. (2019b) proposed RoBERTa, Grootendorst (2020) uses BERT embedding and cosine similarity and proposed KeyBERT. Automatic extractive text summarization (EATS) is proposed to generate keywords by topic modeling and performs EATS using clustering schemes driven by genetic algorithms (Hernández-Castañeda et al., 2022(@). Zhou et al. (2024) incorporated structure and position information in keyphrase extraction and AdaptiveUKE (Liu et al., 2024) used gated topic modeling. IKDSumm (Garg et al., 2024) integrates keyphrases into BERT for summarizing the disaster tweet. Recent studies have adapted TOPSIS for text ranking, including extractive summarization (Pati & Rautray, 2024) and robust hierarchical decision-making (Corrente & Tasiou, 2023), demonstrating its potential for keyphrase extraction. Additionally, optimization techniques for tokenization and memory management have been explored to enhance NLP processing efficiency in multilingual applications (Dodić et al., 2025). Furthermore, Fuzzy TOPSIS has also been explored in ranking and evaluation tasks, demonstrating its adaptability to text analysis. Wang (2022) integrated NLP techniques with Fuzzy TOPSIS for product evaluation.

Various techniques, including statistical methods, graph-based approaches, supervised learning, embedding-based models, PCA, TOPSIS, and fuzzy TOPSIS, have been individually employed for different aspects of keyword extraction and ranking. However, our approach differs by integrating multiple methodologies in a unified framework, leveraging their complementary strengths to enhance keyword extraction effectiveness and robustness.

3. Proposed Methodology

In this work, we propose a methodology that employs optimal combinations of NLP techniques, Shannon entropy, PCA, and Fuzzy TOPSIS. The flowchart in Figure 2 depicts the steps involved in implementing the proposed model. Our methodology is divided into four major steps: data preprocessing, feature scoring, dimensionality reduction using PCA, and keyword ranking using the Fuzzy TOPSIS method with Shannon entropy-based weights. Each of these steps is explained in detail in the following subsections.

Figure 2.

Workflow of the Proposed TopS-Key Framework Integrating Preprocessing, Feature Scoring, PCA-Based Dimensionality Reduction, Entropy Weighting, and Fuzzy TOPSIS Ranking. Note. PCA = principal component analysis; Fuzzy TOPSIS = fuzzy technique for order preference by similarity to ideal solution.

3.1. Preprocessing of the Textual Document

As a first step, we preprocess the data. Along with the standard preprocessing steps, including tokenization, punctuation removal, stopword removal, POS tagging, and noun phrase extraction, we also perform fuzzy string matching and normalization.

In fuzzy string matching (Hanani et al., 2021), strings are compared for similarities with minor variations and differences between them. We use fuzzy string matching because it is flexible, as it accommodates differences in spelling, spacing, or slight alterations of strings. A token set ratio calculated using equation (1) takes into account the intersection and union of token sets to calculate the similarity between strings. We consider the similarity threshold of 90 to ensure a relatively strict level of similarity between strings. In other words, if we lower the threshold, the strictness between the strings will decrease, and it will consider more phrases, which will create noise in the form of unrelated noun phrases. Therefore, it would increase the number of potential keyphrases.

Fuzzy token set ratio = \frac{2 \times len (μ_{1} \cap μ_{2})}{len (μ_{1}) + len (μ_{2})} \times 100,

(1)

where

μ_{1}

and

μ_{2}

are unique token sets of the two strings, and len

(μ_{1} \cap μ_{2})

is the number of common tokens in both strings.

In the normalization step, the fetched noun phrase is split into individual words, joined back together with a solitary space, and any extra spaces removed between words. After the preprocessing, we obtain a list of potential keyphrases, which are further analyzed to pull the actual keyphrases.

3.2. Feature Scores for the Potential Keyphrases

In this step, we generate feature scores for each potential key phrase. We considered 10 major features that are: TF, IDF, combination of TF–IDF, phrase length, noun phrase density, position score, latent Dirichlet semantic score, BERT embeddings with cosine similarity, BERT embedding Euclidean distance, and BERT embedding with Manhattan distance, respectively. Let us discuss these features in detail.

TF: We generate the first score using the statistical frequency-based approach, that is, the TF method. It is the score calculated by counting how many times each potential keyphrase appears in the document, then normalizing it according to how many potential keyphrases are there in the document.

IDF: We generate the second score by using IDF, which measures how much information a term provides by comparing the total number of documents with the number of documents containing the term. It measures how rare a phrase is across a set of documents. It is used to give more weight to terms that are less frequent across all documents and, thus, potentially more informative. Using equation (2), the IDF score is calculated for each potential keyphrase.

idf(i, d) = \log (\frac{| d |}{1 + | d, i |}),

(2)

where

| d |

is the number of documents and

| d, i |

is the number of documents containing the potential keyphrase

i

TF–IDF: We generate the third score by using TF–IDF, which combines how often a phrase appears in a specific document and how rare the phrase is across all documents. It gives higher importance to terms that appear frequently in a specific document but not in many other documents.

Phrase Length: The fourth feature is phrase length, which simply counts the number of words in the noun phrase. It indicates the complexity or specificity of the phrase.

Noun Phrase Density: We generate the fifth score by calculating noun phrase density, which measures how much of the document is covered by the noun phrase as a proportion of the total document length. Using equation (3), the noun phrase density is calculated for each potential keyphrase.

NPD (i) = \frac{N (w, i)}{N (w, d)},

(3)

where

NPD (i)

represents the noun phrase density of the phrase

i

N (w, i)

is the number of words in the phrase, and

N (w, d)

is the number of words in the document.

Position: We generate the sixth score by measuring the position score, which captures the location of a noun phrase within a document relative to the total length of the document. It depicts how early or late a noun phrase appears in a document, giving importance to phrases that appear earlier in the text. The idea is that phrases appearing earlier in a document may be more relevant, especially in documents such as research papers, where key information is often introduced early.

LDA model: The seventh score we calculate is the semantic score using an unsupervised, probabilistic, topic-modeling-based LDA model. In this model, for each group of potential keyphrases, a corpus is created, which is converted into a bag-of-words representation. Then, the LDA model is constructed considering two topics. The LDA model tracks down a latent topic and clusters phrases with similar meanings into specific topics according to how they are used in the paragraph. Then, two types of probability distributions are generated:

The probability distribution of topics within a paragraph P, which shows the likelihood that a paragraph belongs to a particular topic.

The probability distribution of phrases within a topic P $^{'}$ , which shows the likelihood that a phrase belongs to a particular topic.

To calculate both of these probabilities, we use the Dirichlet distribution (Onan et al., 2016). We calculate the measure of similarity between these two probability distributions by calculating Hellinger’s distance between them using equation (4):

H (P, P^{'}) = \frac{1}{\sqrt{2}} \sqrt{\sum_{j = 1}^{k} {(\sqrt{x_{j}} - \sqrt{x_{j}^{'}})}^{2}},

(4)

where

P = [x_{1}, x_{2}, \dots x_{k}], P^{'} = [x^{'}_{1}, x^{'}_{2}, \dots x^{'}_{k}]

, and

k = 2

. The probabilities

x_{j}

and

x_{j}^{'}

represent the likelihood of observing the

j^{th}

topic in distributions P and

P^{'}

, respectively.

Then, we finally calculate the semantic score using equation (5), where $i$ is the number of potential keyphrases:

Semantic\_score(i) = 1 - H (P, P^{'}) .

(5)

BERT model: Using the BERT model, mean-pooled BERT embeddings for both the paragraph and the potential keyphrases are calculated. These embeddings understand the semantic meaning of each phrase and then quantify their semantic similarity or relevance of the potential keyphrase in the context of the paragraph. The mean-pooled embeddings for the paragraph are obtained by taking the mean of the token-level embeddings of the paragraph. These embeddings represent a condensed representation of the entire paragraph, capturing its semantic content in a fixed-size vector. Then, we calculate three types of distance between them:

Euclidean Distance: It measures the straight-line distance between the embeddings of the noun phrase and the document using equation (6). The closer the noun phrase is to the document in the embedding space, the more semantically similar they are considered to be.

E_{BERT} = \sqrt{\sum_{j = 1}^{n} {(i_{j} - d_{j})}^{2}},

(6)

where

E_{BERT}

is the Euclidean distance between the BERT embeddings,

i_{j}

and

d_{j}

are the components of the noun phrase and document embeddings, respectively.

Cosine Similarity: It is the degree of similarity between the BERT embedding of each potential keyphrase and document. It is calculated using equation (7):

C_{BERT} (i, i^{'}) = \cos (\vec{v_{i}}, \vec{v_{i'}}) = \frac{\vec{v_{i}} \cdot \vec{v_{i}'}}{‖ \vec{v_{i}} ‖ \cdot ‖ \bar{v_{i}'} ‖},

(7)

where

C_{BERT}

is the cosine similarity between the BERT embeddings,

i

and

i^{'}

are two potential keyphrases,

\vec{v_{i}}

and

\vec{v_{i'}}

are the vectors representing

i

and

i^{'}

, respectively (Mennerich, 2013).

Manhattan Distance: It is also known as L1 distance, which sums the absolute differences between the components of the embeddings of the noun phrase and document. It is measured using equation (8):

M_{BERT} = \sum_{j = 1}^{n} | i_{j} - d_{j} |,

(8)

where

M_{BERT}

is the Manhattan distance between the BERT embeddings,

i_{j}

and

d_{j}

are the components of the noun phrase and document embeddings, respectively.

Thus, as a result, we get these 10 feature scores for each potential keyphrase.

The selection of 10 distinct features for scoring potential keyphrases is based on a thorough review of prior research and the need to capture diverse, complementary aspects of keyphrase relevance. These features span statistical, syntactic, structural, and semantic dimensions to ensure a well-rounded evaluation. Statistical features such as TF, IDF, and TF–IDF capture the importance and uniqueness of terms within a document and across the corpus. Syntactic and structural features—such as phrase length, noun phrase density, and position score—reflect common patterns of keyphrases, which often appear early, are moderately long, and follow noun phrase structures. Semantic features add contextual depth: LDA scores align phrases with the document’s topics, while BERT-based similarity evaluates how well phrases represent overall content. Euclidean and Manhattan distances provide additional ways to measure semantic closeness, complementing the BERT score. This carefully selected set of features avoids redundancy while maximizing relevance and coverage. Reducing the set may miss important cues, while adding more can introduce noise. Thus, the 10 features strike a strong balance between theoretical soundness and practical effectiveness for keyphrase scoring.

3.3. Dimensionality Reduction Using PCA

Then, we apply PCA to reduce the number of features, retaining all the information provided by the features (Esposito et al., 2022; Marukatat et al., 2023).

Using PCA, data is transformed into a set of linearly uncorrelated variables called PCs, thereby reducing the dimensionality. In order to retain most of the variance, these components are arranged in such a way that the first few retain the most. As a result of PCA, information is compressed into a smaller number of components by transforming it into a new coordinate system that maximizes variance. This allows for the retention of essential patterns while discarding noise or less important information.

Firstly, we use $z$ -score standardization, which normalizes each feature to have a mean of 0 and a standard deviation of 1 using equation (9):

z = \frac{x - μ}{σ},

(9)

where

x

is the original data point,

μ

is the mean, and

σ

is the standard deviation.

Afterward, we compute the covariance matrix of the standardized data using equation (10):

[C] = \frac{1}{n - 1} X^{T} X,

(10)

where

X

is the matrix of standardized features, and

C

is the covariance matrix.

Then, we compute the eigenvalues and eigenvectors of this covariance matrix. Eigenvectors indicate the new PCs’ directions, while eigenvalues illustrate how much variance they capture.

We then order the resulting components by the amount of variance they capture. The amount of variance explained by a particular PC is related to the corresponding eigenvalue of that component. The eigenvalue represents the variance associated with the PC. The explained variance ratio (EVR) by a particular PC is calculated using equation (11):

{EVR}_{j} = \frac{λ_{j}}{\sum_{m = 1}^{k} λ_{m}},

(11)

where

λ_{j}

is the eigenvalue of the

j^{th}

PC.

\sum_{m = 1}^{k} λ_{m}

is the sum of all eigenvalues across the

m

PCs.

Finally, the data is then projected onto the PCs, creating a new representation of the data as $X_{PC} = X V$ , where $X_{PC}$ is the matrix with PCs and $V$ is the matrix of eigenvectors giving us the PC directions.

Thus, we get a set of new features that are PCs that are linear combinations of the original weighted features. Since our goal in applying PCA is to reduce the dimensionality by identifying the most important components that account for the majority of the variance while ignoring the components that contribute less, we evaluate cumulative variance to determine how many PCs are needed to reach a certain percentage of the variance using equation (12). For our analysis, we select the PCs that together account for 90% of the variance.

{Cum}_{EVR} = \sum_{m = 1}^{k} {EVR}_{m} .

(12)

We select the PCs until the cumulative variance reaches 90%. The number of PCs required to explain 90% of the variance can vary depending on the document.

PCA is used in our framework to address potential redundancy and correlations among the selected features. By transforming the original features into uncorrelated PCs, PCA preserves essential information while enhancing robustness and computational efficiency. This dimensionality reduction step ensures that the ranking process remains stable and efficient, free from overlapping signals, making it a crucial component of our methodology. Although PCA captures only linear relationships, we mitigate this limitation through several design choices. Specifically, PCA is applied independently for each document, which helps retain local contextual and statistical patterns specific to that document. Additionally, our framework constructs a modular feature space comprising semantic, statistical, and structural features, prior to dimensionality reduction, which ensures that key nonlinear relationships are already embedded in the input space. Finally, the downstream fuzzy TOPSIS-based ranking with entropy-adaptive weights introduces nonlinear decision modeling, effectively compensating for any minor loss of nonlinear dependencies during PCA. These measures collectively ensure that PCA remains both reliable and well-suited to our keyphrase extraction architecture.

3.4. Applying Fuzzy TOPSIS With Shannon Entropy Weights

The MCDM approaches incorporate multiple criteria, which are often at odds, to assess and select from a variety of alternatives. In this work, we apply fuzzy TOPSIS to rank the potential keyphrases (Nădăban et al., 2016) in order to identify the best keyphrases. We first rank the potential keyphrases using the fuzzy TOPSIS method, which incorporates fuzzy logic to handle uncertainties and imprecision in the decision-making process.

Following dimensionality reduction, we employ an enhanced variant of the Fuzzy TOPSIS technique to rank the candidate keyphrases. Fuzzy TOPSIS extends the conventional TOPSIS method by incorporating fuzzy logic principles, which enable it to effectively manage the inherent uncertainty and imprecision present in linguistic assessments and heterogeneous feature evaluations. The procedure comprises several key stages: initially, the construction of a fuzzy decision matrix derived from the evaluated feature values; subsequently, the calculation of feature weights utilizing Shannon entropy to ensure objectivity; next, the identification of the fuzzy positive ideal solution (FPIS) and fuzzy negative ideal solution (FNIS); and finally, the computation of the relative closeness coefficient for each candidate keyphrase.

To further refine the standard Fuzzy TOPSIS framework for the keyphrase extraction task, we introduce several significant improvements. Rather than assigning feature weights manually, we leverage Shannon entropy to automatically derive feature importance scores, ensuring a data-driven and unbiased weighting strategy based on the intrinsic variability of each feature across all candidate keyphrases. Additionally, to mitigate redundancy and retain only the most informative features, we employ PCA, treating the resulting PCs as criteria within the Fuzzy TOPSIS model. This approach enhances both computational efficiency and ranking effectiveness.

In our model, each candidate keyphrase is conceptualized as an alternative, systematically evaluated against the PCA-derived feature criteria. We incorporate a diverse set of syntactic, semantic, structural, and statistical features, ensuring their normalization and consistent integration within the fuzzy decision matrix. This uniform treatment facilitates equitable comparisons across varying feature types. Furthermore, we design customized linguistic scales and membership functions to represent feature evaluations as fuzzy numbers, enabling more accurate modeling of the inherent ambiguity and vagueness associated with NLP and multifaceted feature interpretation.

In order to get started with fuzzy TOPSIS, we define criteria, alternatives, and weights for each criterion. We consider potential keyphrases as the alternatives that we need to rank to find the predicted keyphrases. The criteria are the PCs that we found out in the previous section, and now we need to find the weights for these criteria. For finding the criteria weights, we used Shannon entropy-based weighting.

3.4.1. Shannon Entropy-Based Weighting

Shannon entropy is a concept from information theory introduced by Claude Shannon in 1948 (Shannon, 2001), which measures the amount of uncertainty or disorder in a system. To assign appropriate weights to each feature criterion within the Fuzzy TOPSIS process, we adopt a systematic, data-driven approach grounded in Shannon entropy theory. This allows us to objectively quantify the importance of each feature before proceeding to the weight computation.

To assign the importance of each feature criterion in a systematic and impartial manner during the Fuzzy TOPSIS process, we adopt Shannon entropy-based weighting. This method provides a principled, data-driven mechanism to quantify the significance of each criterion, avoiding any subjective bias or manual intervention.

The fundamental strength of Shannon entropy lies in its ability to capture the inherent diversity or unpredictability of each feature across all candidate keyphrases. A feature that exhibits high variability among keyphrases conveys more discriminative information, as it better differentiates between alternatives. Conversely, a feature with low variability contributes little to the decision-making process. By applying entropy, we ensure that more informative features are prioritized through higher weights, while less informative, redundant, or homogeneous features receive proportionally lower influence.

Moreover, the entropy-based approach is highly adaptable to different datasets and feature distributions. It automatically adjusts the weights based on the statistical behavior of the features without requiring prior assumptions or domain-specific tuning. This not only enhances the objectivity and scalability of the ranking framework but also ensures that the weighting process reflects the true informational content of the features, thereby improving the overall robustness and effectiveness of keyphrase selection.

To operationalize the entropy-based weighting, we begin by normalizing each PC value such that the sum of values for a given PC across all candidate keyphrases equals 1. This normalization step ensures that the values are proportional and suitable for subsequent entropy computation. The normalization process for a PC $p$ corresponding to potential keyphrase $i$ is calculated using equation (13):

N_{i p} = \frac{x_{i p}}{\sum_{i = 1}^{m} x_{i p}},

(13)

where

x_{i p}

is the original score of feature

F

for potential keyphrase

i

Then, we calculate Shannon entropy-based weights for each PC across potential keyphrases for a given document. These weights are used to assess the relative importance of each PC in determining the significance of a potential keyphrase. It helps identify which PC is more informative by analyzing how diverse or uniform its values are. If a PC has high entropy, its values are more evenly spread out, indicating a high degree of uncertainty. Conversely, if a PC has low entropy, that means the values are more concentrated, suggesting that the information is more predictable or contains fewer variations. The formula for entropy $H_{p}$ for a PC $p$ is given in equation (14):

H_{p} = - \frac{1}{\log (m)} \sum_{i = 1}^{m} N_{i p} \log (N_{i p} + ϵ),

(14)

where

m

is the number of potential keyphrases,

N_{i p}

is the normalized value of

p

for potential keyphrase

i

ϵ

is a small value added to avoid

\log (0)

, ensuring numerical stability. The term

1 / \log (m)

ensures that the entropy values are normalized to the range

[0, 1]

Then, we evaluate the degree of diversification for each $p$ . It reflects how diverse or important a PC is for the given set of potential keyphrases. PCs with lower entropy are considered more informative and diversified. The degree of diversification $D_{p}$ for $p$ is calculated using equation (15):

D_{p} = 1 - H_{p},

(15)

where

H_{p}

is the entropy for

p

The weight of a PC is proportional to its degree of diversification. PC with higher diversification, that is, lower entropy, are assigned higher weights because they carry more distinct information. The weight $w_{p}$ for $p$ is calculated using equation (16):

w_{p} = \frac{D_{p}}{\sum_{p = 1}^{n} D_{p}},

(16)

where

D_{p}

is the degree of diversification for

p

, thus, we get the Shannon entropy-based weights for each PC.

3.4.2. Applying Fuzzy TOPSIS for Ranking

Now, we will continue with our Fuzzy TOPSIS calculations. We first construct a fuzzy decision-making matrix $D = [d_{i p}]$ , where $d_{i p}$ represents the evaluation of alternative $i$ that are potential keyphrases in our case, with respect to criterion $p$ , that is, the PCs. We then calculate the weighted criteria value by multiplying each criteria value by its corresponding Shannon entropy-based weights and construct a weighted fuzzy decision-making matrix $\tilde{D} = [{\tilde{d}}_{i p}]$ as in equation (17):

{\tilde{d}}_{i p} = w_{p} * d_{i p} .

(17)

In the traditional TOPSIS method, the crisp values are used for the purpose of evaluation, which may not adequately represent the uncertainty. So to resolve this limitation, each ${\tilde{d}}_{i p}$ is transformed into a triangular fuzzy number expressed as ${\tilde{d}}_{i p} = (l_{i p}, m_{i p}, u_{i p})$ . This fuzzification is done using the mean $μ_{i p}$ and the standard deviation ${sd}_{p}$ of the criterion values that capture the spread of the values. It is calculated using equation (18):

l_{i p} = μ_{i p} - k * {sd}_{p}; m_{i p} = μ_{i p}; u_{i p} = μ_{i p} + k * {sd}_{p},

(18)

where

k

is a multiplier controlling the fuzziness spread,

l_{i p}

is the lower bound value that represents the most pessimistic estimate,

m_{i p}

is the mean value that represents the most likely estimate, and

u_{i p}

is the upper bound that represents the most optimistic estimate.

After the construction of the fuzzy decision matrix, we identify the FPIS and FNIS. The FPIS represents the most desirable values across all criteria, capturing the highest level of performance, whereas the FNIS represents the least desirable, capturing the lowest level of performance, denoted by $F_{p}^{+}$ and $F_{p}^{-}$ , respectively. FPIS and FNIS are defined using equation (19):

F_{p}^{+} = (max_{i} u_{i p}, max_{i} m_{i p}, max_{i} l_{i p}), F_{p}^{-} = (min_{i} l_{i p}, min_{i} m_{i p}, min_{i} u_{i p}) .

(19)

These ideal solutions will be used as the benchmark values against which all the alternatives are compared.

Then, to ensure the comparability across different criteria, which may have varying scales or units, we normalize the fuzzy values using equation (20):

{\tilde{r}}_{i p} = (\frac{l_{i p}}{u_{p}^{max}}, \frac{m_{i p}}{m_{p}^{max}}, \frac{u_{i p}}{l_{p}^{max}}),

(20)

where

u_{p}^{max}, m_{p}^{max}

, and

l_{p}^{max}

are the bounds of

F_{p}^{+}

for criterion

p

Then, we calculate the distance of each alternative from both the FPIS and FNIS using the Euclidean distance formula for triangular fuzzy numbers. These distances quantify how close each alternative is to the ideal solutions, helping to determine the relative desirability of the alternatives. The distance between two fuzzy numbers $\tilde{d} = (l, m, u)$ and $\tilde{d^{'}} = (l^{'}, m^{'}, u^{'})$ is given by equation (21):

d (\tilde{d}, \tilde{d^{'}}) = \sqrt{\frac{1}{3} [{(l - l^{'})}^{2} + {(m - m^{'})}^{2} + {(u - u^{'})}^{2}]} .

(21)

Now, the total distances from the FPIS and FNIS for each alternative $i$ are calculated as in equation (22):

D_{i}^{+} = \sum_{p = 1}^{n} d ({\tilde{r}}_{i p}, {\tilde{F}}_{p}^{+}), D_{i}^{-} = \sum_{p = 1}^{n} d ({\tilde{r}}_{i p}, {\tilde{F}}_{p}^{-}),

(22)

where

D_{i}^{+}

is the distance from the FPIS and

D_{i}^{-}

is the distance from the FNIS.

Then, we evaluate the closeness coefficient $C_{i}$ and measure the relative closeness of each alternative to the FPIS. This coefficient provides a single measure that reflects the balance between the distance between the best and the worst solutions. The higher the $C_{i}$ , the closer the alternative is to the ideal solution. It is calculated as in equation (23):

C_{i} = \frac{D_{i}^{-}}{D_{i}^{+} + D_{i}^{-}} .

(23)

where

C_{i} \in [0, 1]

Finally, the alternatives are ranked on the basis of their closeness coefficients $C_{i}$ . The alternative with the highest value of the $C_{i}$ is considered the best option, as it is closest to the ideal solution. The complete algorithm of TopS-Key is shown in Algorithm 1.

We specifically adopt the Fuzzy TOPSIS method for keyphrase ranking due to several key advantages. First, keyphrase extraction inherently involves uncertainty and imprecision, especially when assessing multiple diverse features such as syntactic, semantic, and statistical components. Traditional ranking methods may not effectively capture this vagueness. Fuzzy TOPSIS integrates fuzzy logic, allowing us to model such uncertainty more accurately by representing feature evaluations as fuzzy numbers. Second, TOPSIS inherently considers both the best (positive ideal) and worst (negative ideal) criteria values, ensuring that selected keyphrases are not only close to desirable characteristics but also far from undesirable ones. This dual consideration provides a balanced and robust decision-making framework. Furthermore, by integrating Shannon entropy-based weights, we ensure data-driven, unbiased feature importance assignment without manual intervention. The combination of Fuzzy TOPSIS and entropy weighting thus allows us to objectively and effectively rank keyphrases while handling data ambiguity and avoiding arbitrary weight selection. Despite this expressiveness, the computational overhead remains tractable.

The overall time complexity of the proposed framework can be decomposed into two components, corresponding to the contextual embedding and the downstream processing stages. Formally, the cumulative complexity is expressed as equation (24).

O (n (p + 1) τ^{2} h) + O (n p),

(24)

where

n

denotes the number of input documents,

p

is the average number of noun phrases extracted per document,

τ

represents the average token length of the input sequence (bounded above by 512 due to BERT’s architectural constraints), and

h

is the hidden size of the BERT model (e.g., 768 for BERT-base).

The first term encapsulates the BERT-based contextual embedding phase, wherein the complexity arises from the multihead self-attention mechanism that scales quadratically with the input token length $τ$ , and linearly with the number of processed sequences $(p + 1)$ per document (comprising one document and $p$ noun phrases). This term represents the dominant computational cost when $τ$ and $h$ are treated as variables. The second term accounts for all subsequent stages in the pipeline, including feature extraction, PCA for dimensionality reduction, entropy-based feature weighting, and fuzzy TOPSIS ranking, all of which operate in linear time with respect to the number of noun phrases, yielding a complexity of $O (n p)$ .

However, since both $τ$ and $h$ are model-specific constants and independent of the input size $n$ or $p$ , the first term simplifies to equation (25):

O (n (p + 1) τ^{2} h) = O (n (p + 1)) = O (n p),

(25)

where the additive constant in

(p + 1)

is asymptotically negligible. Thus, both components exhibit equivalent growth behavior, and the overall time complexity of the proposed framework reduces to

O (n p)

. This analysis confirms that the proposed methodology is asymptotically linear with respect to the number of documents and noun phrases, ensuring its scalability and computational tractability for deployment in large-scale or real-time offensive content detection systems.

4. Illustrative Example

In this section, we provide a detailed illustration of each step involved in the proposed methodology. The process begins with the preprocessing of the text data, followed by feature scoring, dimensionality reduction, and finally, keyword ranking.

Let us consider an example with the title and abstract of a research paper (Pravia et al., 2002) as shown in Figure 3.

Figure 3.

Paragraph and Actual Keyphrases of the First Entry of the Inspec Dataset.

Preprocessing is performed on the paragraph. We observed that after standard NLP preprocessing, 29 potential keyphrases were extracted. However, some were not meaningful, such as needs, steps, reports, etc. After performing the fuzzy string matching and normalization, more meaningful and crisp potential keyphrases were obtained. Then, we calculated 10 feature scores for each potential keyphrase. Then, we applied PCA for dimensionality reduction and found the top PCs that capture the maximum information of each potential keyphrase from the 10 features based on the cumulative variance explained values computed for each PC. A visual representation of the loading values of features across all the PCs is depicted in Figure 4.

Figure 4.

Heatmap Depicting the Influence of Features on the Principal Components.

Each cell in the heatmap corresponds to the contribution of a specific feature to a particular PC, with the color intensity indicating the magnitude of that contribution. A darker cell means the value is positive, indicating that a feature contributes strongly and positively to the associated PC. Lighter means negative values, suggesting a weaker or negative contribution.

The number of PCs needed to reach at least 90% cumulative variance is selected. This ensures that the selected PCs retain a significant amount of the original variance. We have five PCs $(p_{1}, p_{2}, p_{3}, p_{4}, p_{5})$ for the example we are considering. Now, the PC score for each potential keyphrase is shown in Table 1.

Table 1.

Principal Component Score for Each Potential Keyphrase.

i	Potential Keyphrases	$p_{1}$	$p_{2}$	$p_{3}$	$p_{4}$	$p_{5}$
1	quantum lattice—gas algorithms	4.36	−1.029	−0.345	−0.55	0.384
2	recent theoretical results	2.314	1.493	0.281	0.453	−0.902
3	two qubit quantum information processors	3.979	−1.231	2.07	−0.982	0.887
4	classical channels	−1.45	0.179	0.271	0.388	−1.209
5	fluid dynamics problems	0.2	0.666	−0.55	−0.606	0.6
6	such architectures	−2.168	−2.408	1.094	−1.077	−1.256
7	diffusion equation	−1.95	3.613	1.756	−1.026	0.7
8	ensemble nuclear magnetic resonance	1.899	−1.026	−0.763	0.81	0.77
9	mass density	−1.883	0.359	−0.895	−0.118	0.895
10	standard pulse techniques	0.974	0.492	1.555	2.158	−0.818
11	experimental results	−0.858	0.876	−0.905	−0.846	−0.549
12	first attempt	−3.232	−2.95	1.138	−0.267	0.26
13	ideal simulation	−1.708	0.0933	0.395	1.683	0.342
14	observed implementation errors	1.771	1.326	−0.094	−0.099	−0.545
15	improved control	−2.56	−0.05	−0.545	0.745	1.58
16	nmr implementation	−0.506	1.037	−1.061	−0.985	−0.497
17	nonlinear burgers equations	1.084	−0.681	−1.421	0.514	0.17
18	nmr techniques	−0.264	−0.758	−1.979	−0.189	−0.811

PCs basically represent directions of the data along which variance is maximized. The coefficients associated with these features can be positive or negative. A positive value for a $p$ means that the observation lies in the direction of the PC axis. Whereas the negative value for a $p$ means that the observation lies in the opposite direction along the PC axis. The sign just indicates the direction, and it itself doesn’t have a specific meaning. It is the magnitude of the $p$ that conveys how much information is captured by that component.

Then, we applied the MCDM technique, Fuzzy TOPSIS, to get ranks of the potential keyphrases, giving us the final keyphrases. We have 18 alternatives that we have to rank, five criteria, and using Shannon entropy-based weighting, we got weights for each $p$ as shown in Table 2.

Table 2.

Shannon Entropy Weights for Each Criterion (Top Principal Components (PCs)).

Criteria	Shannon Entropy Weights
$p_{1}$	0.173589
$p_{2}$	0.300819
$p_{3}$	0.18961
$p_{4}$	0.221986
$p_{5}$	0.113996

After that, we apply fuzzy TOPSIS for ranking our potential keyphrases. The fuzzy decision matrix $D = [d_{i p}]$ is constructed, which is an $18 \times 5$ matrix as in our example $i = 18$ and $p = 5$ . We then construct a weighted fuzzy decision-making matrix by multiplying the Shannon entropy weights given in Table 2 with their respective PCs as shown in Table 3.

Table 3.

Weighted Fuzzy Decision-Making Matrix $\tilde{D} = [{\tilde{d}}_{i p}]$ .

i	Potential Keyphrases	$p_{1}$	$p_{2}$	$p_{3}$	$p_{4}$	$p_{5}$
1	quantum lattice—gas algorithms	0.757	−0.309	−0.065	−0.122	0.044
2	recent theoretical results	0.402	0.449	0.053	0.101	−0.103
3	two qubit quantum information processors	0.691	−0.37	0.393	−0.218	0.101
4	classical channels	−0.251	0.054	0.052	0.087	−0.137
5	fluid dynamics problems	0.035	0.201	−0.104	−0.134	0.069
6	such architectures	−0.376	−0.724	0.208	−0.239	−0.143
7	diffusion equation	−0.339	1.086	0.333	−0.228	0.079
8	ensemble nuclear magnetic resonance	0.329	−0.309	−0.145	0.179	0.087
9	mass density	−0.327	0.108	−0.169	−0.026	0.102
10	standard pulse techniques	0.169	0.148	0.294	0.478	−0.093
11	experimental results	−0.148	0.264	−0.172	−0.187	−0.063
12	first attempt	−0.561	−0.887	0.216	−0.059	0.029
13	ideal simulation	−0.296	0.028	0.075	0.373	0.039
14	observed implementation errors	0.307	0.398	−0.018	−0.022	−0.062
15	improved control	−0.444	−0.015	−0.103	0.165	0.18
16	nmr implementation	−0.088	0.311	−0.201	−0.218	−0.056
17	nonlinear burgers equations	0.188	−0.205	−0.269	0.114	0.019
18	nmr techniques	−0.045	−0.228	−0.375	−0.042	−0.092

Then, we transform these crisp weighted values ${\tilde{d}}_{i p}$ into the triangular fuzzy number as shown in Table 4.

Table 4.

Triangular Fuzzy Membership Values of Each ${\tilde{d}}_{i p} = (l_{i p}, m_{i p}, u_{i p})$ .

$i$	Potential Keyphrases	$p_{1}$	$p_{2}$	$p_{3}$	$p_{4}$	$p_{5}$
1	quantum lattice—gas algorithms	(0.370, 0.757, 1.143)	( $-$ 0.765, $-$ 0.310, 0.145)	( $-$ 0.284, $-$ 0.065, 0.153)	( $-$ 0.331, $-$ 0.122, 0.086)	( $-$ 0.052, 0.044, 0.139)
2	recent theoretical results	(0.015, 0.402, 0.788)	( $-$ 0.006, 0.449, 0.904)	( $-$ 0.165, 0.053, 0.272)	( $-$ 0.108, 0.101, 0.309)	( $-$ 0.198, $-$ 0.103, $-$ 0.007)
3	two qubit quantum information processors	(0.304, 0.691, 1.077)	( $-$ 0.826, $-$ 0.370, 0.085)	(0.174, 0.392, 0.611)	( $-$ 0.426, $-$ 0.218, $-$ 0.010)	(0.006, 0.101, 0.197)
4	classical channels	( $-$ 0.637, $-$ 0.251, 0.136)	( $-$ 0.401, 0.054, 0.509)	( $-$ 0.167, 0.051, 0.270)	( $-$ 0.122, 0.086, 0.295)	( $-$ 0.233, $-$ 0.138, $-$ 0.042)
5	fluid dynamics problems	( $-$ 0.352, 0.035, 0.421)	( $-$ 0.255, 0.201, 0.656)	( $-$ 0.323, $-$ 0.104, 0.114)	( $-$ 0.343, $-$ 0.135, 0.074)	( $-$ 0.027, 0.068, 0.164)
6	such architectures	( $-$ 0.763, $-$ 0.376, 0.010)	( $-$ 1.180, $-$ 0.724, $-$ 0.269)	( $-$ 0.011, 0.208, 0.426)	( $-$ 0.448, $-$ 0.239, $-$ 0.031)	( $-$ 0.239, $-$ 0.143, $-$ 0.048)
7	diffusion equation	( $-$ 0.725, $-$ 0.339, 0.047)	(0.632, 1.087, 1.542)	(0.114, 0.333, 0.552)	( $-$ 0.436, $-$ 0.228, $-$ 0.019)	( $-$ 0.016, 0.080, 0.175)
8	ensemble nuclear magnetic resonance	( $-$ 0.057, 0.330, 0.716)	( $-$ 0.764, $-$ 0.309, 0.147)	( $-$ 0.363, $-$ 0.145, 0.074)	( $-$ 0.028, 0.180, 0.388)	( $-$ 0.008, 0.088, 0.183)
9	mass density	( $-$ 0.713, $-$ 0.327, 0.059)	( $-$ 0.347, 0.108, 0.563)	( $-$ 0.388, $-$ 0.170, 0.049)	( $-$ 0.235, $-$ 0.026, 0.182)	(0.007, 0.102, 0.197)
10	standard pulse techniques	( $-$ 0.217, 0.169, 0.556)	( $-$ 0.307, 0.148, 0.603)	(0.076, 0.295, 0.513)	(0.270, 0.478, 0.687)	( $-$ 0.189, $-$ 0.093, 0.002)
11	experimental results	( $-$ 0.535, $-$ 0.149, 0.237)	( $-$ 0.191, 0.264, 0.719)	( $-$ 0.390, $-$ 0.172, 0.047)	( $-$ 0.396, $-$ 0.188, 0.020)	( $-$ 0.158, $-$ 0.063, 0.033)
12	first attempt	( $-$ 0.948, $-$ 0.561, $-$ 0.175)	( $-$ 1.343, $-$ 0.888, $-$ 0.433)	( $-$ 0.003, 0.216, 0.434)	( $-$ 0.268, $-$ 0.059, 0.149)	( $-$ 0.066, 0.030, 0.125)
13	ideal simulation	( $-$ 0.683, $-$ 0.297, 0.090)	( $-$ 0.427, 0.028, 0.483)	( $-$ 0.143, 0.075, 0.294)	(0.165, 0.374, 0.582)	( $-$ 0.056, 0.039, 0.134)
14	observed implementation errors	( $-$ 0.079, 0.308, 0.694)	( $-$ 0.056, 0.399, 0.854)	( $-$ 0.236, $-$ 0.018, 0.201)	( $-$ 0.230, $-$ 0.022, 0.186)	( $-$ 0.158, $-$ 0.062, 0.033)
15	improved control	( $-$ 0.831, $-$ 0.444, $-$ 0.058)	( $-$ 0.470, $-$ 0.015, 0.440)	( $-$ 0.322, $-$ 0.103, 0.115)	( $-$ 0.043, 0.165, 0.374)	(0.085, 0.180, 0.276)
16	nmr implementation	( $-$ 0.474, $-$ 0.088, 0.298)	( $-$ 0.143, 0.312, 0.767)	( $-$ 0.420, $-$ 0.201, 0.017)	( $-$ 0.427, $-$ 0.219, $-$ 0.010)	( $-$ 0.152, $-$ 0.057, 0.039)
17	nonlinear Burgers equations	( $-$ 0.198, 0.188, 0.575)	( $-$ 0.660, $-$ 0.205, 0.250)	( $-$ 0.488, $-$ 0.269, $-$ 0.051)	( $-$ 0.094, 0.114, 0.322)	( $-$ 0.076, 0.019, 0.115)
18	nmr techniques	( $-$ 0.432, $-$ 0.046, 0.341)	( $-$ 0.683, $-$ 0.228, 0.227)	( $-$ 0.594, $-$ 0.375, $-$ 0.157)	( $-$ 0.251, $-$ 0.042, 0.166)	( $-$ 0.188, $-$ 0.093, 0.003)

Then, we evaluate the FPIS and FNIS using equation (19), normalize the matrix using equation (20), and find the distance of each potential keyphrase from both FPIS and FNIS to evaluate how close the potential keyphrase is to the ideal solution using equations 21 and 22. Then, we finally evaluate the closeness coefficient $C_{i}$ for each $i$ and rank each potential keyphrase on the basis of $C_{i}$ as demonstrated in Table 5.

In this way, we are able to get ranked potential keyphrases, of which the top-ranked are considered to be the final predicted keyphrases.

5. Experimental Setup

This section outlines the experimental setup used to validate the proposed methodology. It includes a description of the datasets employed, details of the software environment and computational resources, and the evaluation metrics used to assess the effectiveness and efficiency of the approach. Each of these aspects is discussed in the following subsections.

Table 5.
Fuzzy TOPSIS Score $(C_{i})$ and Rank for Each Potential Keyphrase $i$ .

Potential keyphrases $C_{i}$ Rank

nmr implementation 0.649329578 1

observed implementation errors 0.63232435 2

standard pulse techniques 0.591319325 3

diffusion equation 0.544616292 4

ensemble nuclear magnetic resonance 0.54190852 5

ideal simulation 0.537630545 6

fluid dynamics problems 0.53257037 7

quantum lattice—gas algorithms 0.520692274 8

two qubit quantum information processors 0.52052765 9

nonlinear burgers equations 0.50916187 10

classical channels 0.505780004 11

improved control 0.477386938 12

mass density 0.453960929 13

recent theoretical results 0.443219113 14

experimental results 0.431578548 15

nmr techniques 0.401115914 16

first attempt 0.375854581 17

such architectures 0.341450809 18

Potential keyphrases	$C_{i}$	Rank
nmr implementation	0.649329578	1
observed implementation errors	0.63232435	2
standard pulse techniques	0.591319325	3
diffusion equation	0.544616292	4
ensemble nuclear magnetic resonance	0.54190852	5
ideal simulation	0.537630545	6
fluid dynamics problems	0.53257037	7
quantum lattice—gas algorithms	0.520692274	8
two qubit quantum information processors	0.52052765	9
nonlinear burgers equations	0.50916187	10
classical channels	0.505780004	11
improved control	0.477386938	12
mass density	0.453960929	13
recent theoretical results	0.443219113	14
experimental results	0.431578548	15
nmr techniques	0.401115914	16
first attempt	0.375854581	17
such architectures	0.341450809	18

5.1. Datasets

As part of the evaluation process, we use three publicly available benchmark datasets for keyphrase extraction. In this regard, note that some gold keyphrases are absent from the original text, and some keyphrases cannot be recognized as possible keyphrases. Therefore, extracting 100% of keyphrases is theoretically impossible.

The first one is the Inspec database (Hulth, 2003), where we retrieved 2,000 abstracts in English with their titles and keywords. Computers and control and information technology are the fields of study covered by these abstracts, which are based on journal papers published from 1998 to 2002. It has the least length as compared to the other datasets, that is, the least number of words per document.

The second dataset is SemEval 2017 Task 10 (Augenstein et al., 2017), which consists of 493 paragraphs derived from open-access ScienceDirect publications evenly distributed between computer science, material science, and physics.

The third dataset is DUC2001 (Wan & Xiao, 2008), which comprises 308 news articles gathered from TREC-9. In total, 30 topics of news were covered in the articles, and 740 words on average were written in each paper. It has the maximum length as compared to the other datasets, that is, the maximum number of words per document.

5.2. Software Implementation

The proposed methodology is implemented using Python 3.10.12, with relevant libraries including NumPy 1.24.3, pandas 2.0.3, SciPy (1.10.1), and scikit-learn 1.3.2 linguistic preprocessing tasks, including tokenization, lemmatization, and part-of-speech tagging, are performed using SpaCy (3.5.0). All experiments are conducted on a local machine equipped with an Intel® Core-i5-1035G1 CPU@1.00 GHz, 8 GB RAM, and a 64-bit Windows 11 Home Single Language operating system. No GPU acceleration is utilized, underscoring the lightweight and scalable nature of the proposed approach. This setup ensures the reproducibility of the research on standard computational resources.

Furthermore, the computational efficiency of the proposed methodology supports its applicability in real-time and large-scale scenarios. The PCA operates on a compact set of 10 feature dimensions, ensuring minimal processing time. Similarly, Fuzzy TOPSIS is applied to a limited set of candidate phrases per sentence, keeping the decision process lightweight. Empirical evaluation confirms that the complete pipeline executes efficiently on a standard CPU-based system without the need for GPU acceleration or high-end hardware. For more demanding real-time applications or larger textual datasets, further optimizations such as incremental PCA and approximation techniques for fuzzy decision-making can be integrated to enhance scalability. Additionally, recent advancements in memory and computational optimization, as highlighted in Dodić and Regodić (2024), offer promising avenues to reduce resource consumption without compromising accuracy. These factors collectively demonstrate the practicality and efficiency of the proposed approach for real-world deployments.

5.3. Evaluation Metrics

We calculate the efficiency of our proposed model, TopS-Key, by measuring three parameters: precision, recall, and F1 score. Each keyphrase extraction $N$ value is set to 3, 5, and 10 to account for the variance in keyphrase annotations across documents.

Calculating these metrics usually involves looking for exact matches. However, since we evaluate semantic similarities, similarity, and threshold functions become crucial in evaluating the degree of similarity between predicted and actual keyphrases. When calculating the similarity scores between an actual keyphrase and its predicted counterpart, we use the threshold to determine whether the predicted keyphrase matches the actual.

The procedure encompasses an iterative evaluation where each actual keyphrase undergoes scrutiny against predicted counterparts, predicated upon a predefined similarity threshold. True positives (TPs) are ascertained when an actual keyphrase aligns with a predicted counterpart surpassing the stipulated threshold, while false positives (FPs) ensue when predicted keyphrases fail to meet this criterion vis-à-vis any actual keyphrase. False negatives (FNs) arise when an actual keyphrase is not matched by any predicted keyphrase above the threshold. Table 6 shows the formula used for the evaluation of these metrics.

Table 6.
Formulas for Precision, Recall and F1 Score.

Metric Precision Recall F1 score

Formula $\frac{TP}{TP + FP}$ $\frac{TP}{TP + FN}$ $2 \times \frac{Precision \times Recall}{Precision + Recall}$

Metric	Precision	Recall	F1 score
Formula	$\frac{TP}{TP + FP}$	$\frac{TP}{TP + FN}$	$2 \times \frac{Precision \times Recall}{Precision + Recall}$

Note. TP = true positive; FP = false positive; FN = false negative.

6. Result Analysis

This section presents a comprehensive analysis of the results obtained from the proposed methodology. We explore the impact of different thresholds, evaluate the contribution of individual features through an ablation study, and conduct both comparative and statistical performance analyses to benchmark our approach. Each of these aspects is detailed in the following subsections.

6.1. PCA Cumulative Variance Threshold Sensitivity Analysis

To evaluate the impact of dimensionality reduction on keyphrase extraction performance, we conduct a detailed sensitivity analysis by varying the variance retention threshold in PCA from 0.80 to 1.00. This investigation spans three benchmark datasets, DUC 2001, SemEval 2017 Task 10, and Inspec—each characterized by differing document structures and vocabulary distributions. Performance is assessed using the F1 score at Top@3, Top@5, and Top@10 ranks to jointly account for precision and recall across varying retrieval granularities.

Figure 5 illustrates the F1 score trajectories for each dataset and evaluation depth as a function of the retained variance. A consistent pattern emerges across all datasets: increasing the threshold from 0.80 to 0.90 results in performance gains at most top- $N$ cutoffs. For instance, Inspec exhibits a pronounced increase at Top@10, peaking around an F1 score of 0.38 at the 0.90 threshold, the best-performing setting across all configurations. Similarly, DUC demonstrates steady improvement up to the best-performing threshold of 0.90, with relative gains ranging from 4% to 8%, particularly noticeable at Top@5 and Top@10. These trends indicate that retaining up to 90% of variance enables the PCA reduced feature space to preserve semantically salient information critical to effective keyphrase ranking.

Figure 5.

F1 Score Trends Across Varying Principal Component Analysis (PCA) Variance Thresholds on Three Datasets.

Beyond the 0.90 threshold, however, the F1 score trend either stabilizes or exhibits slight declines. For example, increasing the threshold from 0.90 to 1.00 results in a drop in performance for both Inspec and DUC at Top@5 and Top@3, with reductions reaching up to 6%. This confirms that 1.00 is the worst-performing threshold for these datasets, likely due to the inclusion of low-variance components that encode noise or redundancy, thereby reducing the discriminative power of the representation. At the lower end, 0.80 also yields suboptimal results, especially for DUC, suggesting that aggressive dimensionality reduction may suppress important semantic features.

Notably, SemEval displays relatively flat performance across all thresholds, with only marginal gains at 0.90. This stability may stem from its broader linguistic variability and longer texts, which render the PCA projection less sensitive to threshold variations. Even so, 0.90 still represents the most balanced setting, achieving the highest Top@10 F1 score with minimal drop elsewhere.

The Inspec dataset consistently outperforms the others at higher Top@ $N$ cutoffs, particularly Top@10, while maintaining smooth variance sensitivity. This can be attributed to its concise, domain-specific documents with inherently low-dimensional semantic structure, for which PCA compression yields diminishing returns. Nonetheless, the best results are still obtained at 0.90, while both extremes, 0.80 and 1.00, lead to measurable degradation, identifying them as worst-case configurations.

In summary, a PCA variance threshold of 0.90 offers the optimal tradeoff between dimensionality reduction and semantic preservation. It consistently enhances model performance across datasets and evaluation depths, while thresholds at 0.80 and 1.00 frequently yield the poorest results. Accordingly, we fix the variance retention at 90% in our final pipeline to balance accuracy, interpretability, and computational efficiency.

6.2. Similarity Threshold Sensitivity Analysis

We first analyze the effect of varying similarity thresholds on performance using precision, recall, and F1 score, focusing on the Top@3 evaluation depth. Thresholds are varied from 0.70 to 0.90 in increments of 0.05.

In Figure 6, we observe a steady decline as the threshold increases. Higher thresholds impose stricter matching criteria, causing the model to become more selective in predicting keyphrases. As a result, fewer predictions are made, and the number of TPs decreases, which lowers the precision. Among the datasets, DUC 2001 consistently exhibits the lowest precision, likely due to its longer document length and higher lexical variability.

Figure 6.

Precision Trends Across Varying Similarity Thresholds for All Datasets at Top@ $N = 3$ .

In Figure 7, we observe that threshold increase leads to a decrease in recall. A higher threshold indicates a stronger signal, that is, a predicted keyphrase requires a higher level of similarity to be deemed as a match. This causes the method to miss some TPs, which subsequently results in lower recall. Also, it is observed that the dataset with the minimum number of words per document, that is, Inspec, has the maximum recall value as compared to other datasets.

Figure 7.

Recall Trends Across Varying Similarity Thresholds for All Datasets at Top@ $N = 3$ .

In Figure 8, for the F1 score, which balances precision and recall, we observe a similar declining trend. This is expected, as both precision and recall drop with higher thresholds. Notably, the best overall F1 performance is obtained at threshold 0.75, where the tradeoff between precision and recall is most favorable—particularly on the Inspec dataset. In contrast, the lowest F1 scores are observed at threshold 0.9, especially for DUC 2001. This suggests that overly strict thresholds penalize semantically relevant but lexically diverse phrases, resulting in performance degradation. As document length increases, this effect becomes more pronounced.

Figure 8.

F1 Score Trends Across Varying Similarity Thresholds for All Datasets at Top@ $N = 3$ .

We calculated the average decreasing rate of precision, recall, and F1 score as the threshold increases in all the datasets, as shown in Table 7.

Table 7.

Average Decreasing Rate of Precision, Recall, and F1 Score of the Datasets.

	Precision	Recall	F1 score
Inspec	23.74%	21.24%	22.2%
SemEval2017 Task 10	27.77%	27.21%	27.33%
DUC2001	24.09%	22.19%	23.21%

Across all three datasets, the precision exhibits a steeper average rate of decline compared to recall, indicating that as the threshold increases, the model becomes more conservative and filters out more candidate phrases. This causes the proportion of relevant items among the retrieved set to drop more sharply, while recall, although affected, declines at a slightly slower pace. The F1 score decreases steadily, with the degradation remaining balanced across precision and recall, suggesting that the model responds uniformly to increasing threshold levels.

The best F1 performance is observed at a threshold of 0.75, where the tradeoff between precision and recall is most favorable, particularly evident in the Inspec dataset. In contrast, the worst F1 scores occur at threshold 0.9, especially for DUC 2001, due to its longer documents and more diverse vocabulary, which are disproportionately penalized under stricter similarity constraints.

Despite this, we select a threshold of 0.9 in our final configuration. This decision is motivated by the need for stricter semantic alignment between predicted and actual keyphrases. The 0.9 setting minimizes FPs and ensures that only highly precise, semantically valid keyphrases are accepted, which is desirable in practical applications requiring interpretability and reliability. Although there is a moderate tradeoff in recall and F1 performance compared to 0.75, the model remains robust across the tested threshold range, and 0.9 offers the most precision-oriented and semantically rigorous configuration.

6.3. Ablation Study on Features

To systematically assess the contribution and significance of each feature within our keyphrase extraction framework, we conducted a comprehensive ablation study, a widely adopted technique in the field (Papagiannopoulou & Tsoumakas, 2020). In this study, we remove each feature individually and reevaluate the model’s performance on three benchmark datasets: SemEval2017 Task 10, Inspec, and DUC2001. The evaluation employed three standard metrics: average precision, recall, and F1 score of top-3, top-5, and top-10 ranked keyphrases. Figure 9 illustrates the results of this analysis across the three datasets.

Figure 9.

Ablation Study Results Showing the Effect of Each Feature Across Datasets.

In Figure 9, we can observe that for the INSPEC dataset, the ablation results highlight the critical importance of certain features. Specifically, removing the BERT + Manhattan distance feature leads to the highest reduction in average F1 score, with a 19.7% decrease compared to the model using all features. The TF and IDF features also contribute significantly, showing respective F1 score drops of 19.4% and 12.6% upon removal. Additionally, the removal of the BERT + cosine similarity feature results in a 16.6% decrease, and the LDA topic coherence feature causes an 11.8% reduction. These findings indicate that semantic similarity and frequency-based features play the most influential roles in improving keyphrase extraction performance on the INSPEC dataset. The performance degradation after removing each feature confirms that all 10 features are essential, as their absence consistently reduces precision, recall, and F1 scores.

A similar pattern is observed for the SemEval dataset in Figure 9. Removal of the BERT + Manhattan distance feature results in the largest drop in average F1 score, amounting to 18.4%. The TF feature and IDF feature again show substantial influence, with respective F1 decreases of 17.5% and 10.2%. Moreover, the BERT + cosine similarity feature causes a notable decrease of 16.7%, while excluding the LDA topic coherence feature leads to a 12.5% drop. Other features, such as phrase length and noun position, exhibit moderate impacts on performance (between 8–11% decrease). The consistent decline in performance metrics after the removal of each feature illustrates that all ten features contribute meaningfully to the model's success.

For the DUC dataset, the ablation trends are consistent with the other datasets. The removal of the BERT + Manhattan distance feature yields the largest decline in average F1 score, with a decrease of 20.6%, further emphasizing its pivotal role. The TF and IDF features follow closely, leading to F1 score reductions of 19.6% and 11.7%, respectively. Other structural features such as phrase length, noun position, and noun phrase density exhibit smaller but noticeable reductions, generally around 7% to 10%.

Across all ablation configurations, the full model utilizing all 10 features consistently achieves the best F1 performance, confirming that each component contributes positively. The removal of the BERT + Manhattan distance feature leads to the worst-case performance, with F1 score drops ranging from 18.4% (SemEval) to 20.6% (DUC 2001), the most severe decline observed. This underscores its central role in capturing fine-grained semantic alignment. Frequency-based features such as TF and IDF also show high impact, with consistent reductions across datasets (e.g., up to 19.6% on DUC), reinforcing their complementary contribution. In contrast, features such as phrase length and noun position cause only modest performance drops, typically below 10%, suggesting a relatively lower influence on the final ranking, especially on longer documents where structural heuristics are less reliable. No single feature’s removal improves performance, indicating that the full model integrating all features offers the optimal balance of semantic, statistical, and structural cues.

This prominence of the BERT + Manhattan distance feature can be theoretically explained by its unique ability to capture fine-grained semantic deviation across embedding dimensions. While all BERT-based features leverage the contextual richness of transformer embeddings, the distance metric plays a crucial role in determining their effectiveness. Cosine similarity measures angular similarity, which is scale-invariant but insensitive to absolute differences in vector space. Euclidean distance captures geometric distance but tends to emphasize larger deviations disproportionately in high-dimensional spaces, often diluting the influence of smaller, semantically meaningful variations. In contrast, Manhattan (L1) distance computes the sum of absolute differences across all dimensions, offering greater sensitivity to small yet consistent deviations, especially valuable in BERT’s dense semantic space. This makes the BERT + Manhattan feature particularly adept at identifying subtle contextual misalignments between a candidate phrase and its surrounding text. Empirically, this observation is supported by our ablation results, where removing this feature resulted in the sharpest performance drop (up to 20.6% in F1 score), underscoring its essential contribution to semantic alignment in the proposed ranking model.

6.4. Comparative Analysis

In this study, we establish two primary research hypotheses: (1) the proposed model consistently outperforms existing baseline keyphrase extraction methods across various experimental conditions, and (2) the model demonstrates robust and stable performance across different text domains, irrespective of document structures or writing styles. To validate these hypotheses, we conduct extensive comparative analyses against 15 baseline methods, encompassing statistical, graph-based, embedding-based, and hybrid techniques, across three benchmark datasets: INSPEC, SemEval, and DUC.

The consistently superior performance of TopS-Key across all datasets and evaluation settings can be attributed to its holistic treatment of semantic representation, feature weighting, and ranking. Unlike traditional and recent neural baselines, TopS-Key effectively balances contextual informativeness and discriminative relevance through entropy-guided dimensionality reduction and multicriteria decision-making. This allows the model to maintain high precision and recall, particularly in settings characterized by topical drift, variable document lengths, and noisy contexts, challenges that typically degrade the performance of other state-of-the-art methods.

The methods used for comparison are RAKE (Rose et al., 2012) using GloVe embeddings (Pennington et al., 2014), PositionRank (Florescu & Caragea, 2017), MultipartiteRank (Boudin, 2018), EmbedRank (Bennani-Smires et al., 2018), YAKE (Campos et al., 2018), RoBERTa (Liu et al., 2019b), KeyBERT (Grootendorst, 2020), TripleRank (Li et al., 2021), MDERank (Zhang et al., 2021), PromptRank (Kong et al., 2023), HGUKE (Song et al., 2023a), BRYT (Ahmed et al., 2024), AdaptiveUKE (Liu et al., 2024), HCUKE (Xu et al., 2024), and SDRank (Xu et al., 2025).

In our experiments, the GloVe embeddings are 300-dimensional, trained on the Common Crawl dataset. PositionRank uses a window size of 10, while MultipartiteRank sets its weight adjustment hyperparameter to 1.1. EmbedRank connects nodes using a cosine similarity threshold of 0.5. YAKE uses a window size of 1, an n-gram length of 3, and a deduplication threshold of 0.9. For RoBERTa, the max sequence length is 512. KeyBERT considers n-gram ranges of (1, 3) and uses an MMR diversity parameter of 0.7. TripleRank uses a window size of 2 and a damping factor $(α)$ of 0.85. PromptRank, EmbedRank, and MDERank share $α = 0.85$ and an error tolerance of $1 \times 10^{- 6}$ . In HGUKE, $λ = 0.9$ controls the impact of mean similarity. BRYT uses 5,000 features and the “all-MiniLM-L6-v2” model with 384-dimensional embeddings. AdaptiveUKE is configured with 5,000 max features, 10 topics, and a learning rate of $1 \times 10^{- 4}$ . HCUKE uses $α = 0.85$ and BERT dimensions of 512. SDRank sets $λ = 0$ for Inspec, 0.1 for SemEval, and 0.3 for DUC2001 and uses a Jaccard threshold of 0.25.

We calculated the scores precision (P), recall (R), and F1 score (F1) for $N = 3, 5, 10$ . Table 8 to 10 showcase the performance comparison of the precision, recall, and F1 score over three datasets when $N = 3$ , $N = 5$ , and $N = 10$ , respectively.

Table 8.
Performance Comparison of P, R, and F1 Over Three Datasets at $N = 3$ .

Inspec @ $N = 3$ SemEval 2017 @ $N = 3$ DUC2001 @ $N = 3$

Method P R F1 P R F1 P R F1

TopS-Key 0.4242 0.1479 0.2081 0.4653 0.089 0.1457 0.3241 0.1795 0.2205

RAKE + Glove 0.1405 0.0583 0.0764 0.098 0.0258 0.0397 0.0141 0.0149 0.0129

PositionRank 0.3772 0.1464 0.2109 0.4273 0.103 0.162 0.3231 0.1667 0.22

MultipartiteRank 0.333 0.1424 0.1847 0.3147 0.0778 0.1214 0.2884 0.1736 0.2036

EmbedRank 0.1673 0.0683 0.0906 0.152 0.0349 0.0556 0.0519 0.0293 0.0359

YAKE 0.3602 0.1565 0.2018 0.3513 0.0857 0.1338 0.1883 0.1059 0.1292

RoBERTa 0.2217 0.1005 0.1256 0.2527 0.0596 0.094 0.1407 0.0783 0.0955

KeyBERT 0.3347 0.1477 0.1887 0.3867 0.0938 0.1474 0.1672 0.0975 0.1156

TripleRank 0.2033 0.0874 0.1128 0.192 0.047 0.0737 0.0519 0.03 0.0361

MDERank 0.2833 0.1199 0.1568 0.2847 0.0714 0.111 0.1163 0.0663 0.0803

PromptRank 0.368 0.1565 0.204 0.3527 0.0881 0.1296 0.1461 0.0795 0.0992

HGUKE 0.236 0.0852 0.118 0.2706 0.0502 0.0828 0.2045 0.1092 0.1345

BRYT 0.2345 0.0895 0.1223 0.2239 0.0444 0.0725 0.1834 0.106 0.1265

AdaptiveUKE 0.2563 0.0983 0.1327 0.2799 0.0531 0.0873 0.2911 0.1558 0.1958

HCUKE 0.3446 0.1228 0.171 0.3766 0.0737 0.12 0.2754 0.1493 0.1832

SDRank 0.3414 0.1151 0.1645 0.365 0.0672 0.111 0.2684 0.1377 0.1774

	Inspec @ $N = 3$	SemEval 2017 @ $N = 3$	DUC2001 @ $N = 3$
TopS-Key	0.4242	0.1479	0.2081	0.4653	0.089	0.1457	0.3241	0.1795	0.2205
RAKE + Glove	0.1405	0.0583	0.0764	0.098	0.0258	0.0397	0.0141	0.0149	0.0129
PositionRank	0.3772	0.1464	0.2109	0.4273	0.103	0.162	0.3231	0.1667	0.22
MultipartiteRank	0.333	0.1424	0.1847	0.3147	0.0778	0.1214	0.2884	0.1736	0.2036
EmbedRank	0.1673	0.0683	0.0906	0.152	0.0349	0.0556	0.0519	0.0293	0.0359
YAKE	0.3602	0.1565	0.2018	0.3513	0.0857	0.1338	0.1883	0.1059	0.1292
RoBERTa	0.2217	0.1005	0.1256	0.2527	0.0596	0.094	0.1407	0.0783	0.0955
KeyBERT	0.3347	0.1477	0.1887	0.3867	0.0938	0.1474	0.1672	0.0975	0.1156
TripleRank	0.2033	0.0874	0.1128	0.192	0.047	0.0737	0.0519	0.03	0.0361
MDERank	0.2833	0.1199	0.1568	0.2847	0.0714	0.111	0.1163	0.0663	0.0803
PromptRank	0.368	0.1565	0.204	0.3527	0.0881	0.1296	0.1461	0.0795	0.0992
HGUKE	0.236	0.0852	0.118	0.2706	0.0502	0.0828	0.2045	0.1092	0.1345
BRYT	0.2345	0.0895	0.1223	0.2239	0.0444	0.0725	0.1834	0.106	0.1265
AdaptiveUKE	0.2563	0.0983	0.1327	0.2799	0.0531	0.0873	0.2911	0.1558	0.1958
HCUKE	0.3446	0.1228	0.171	0.3766	0.0737	0.12	0.2754	0.1493	0.1832
SDRank	0.3414	0.1151	0.1645	0.365	0.0672	0.111	0.2684	0.1377	0.1774

Note. P = precision; R = recall; F1 = F1 score.

Table 9.

Performance Comparison of the P, R, and F1 Over Three Datasets at $N = 5$ .

	Inspec @ $N = 5$			SemEval 2017 @ $N = 5$			DUC2001 @ $N = 5$
Method	P	R	F1	P	R	F1	P	R	F1
TopS-Key	0.4034	0.2291	0.2768	0.4553	0.1444	0.2123	0.2892	0.2594	0.2629
RAKE + Glove	0.1442	0.0972	0.1089	0.0977	0.0417	0.056	0.0183	0.0231	0.0191
PositionRank	0.3445	0.2354	0.2613	0.398	0.1601	0.2209	0.281	0.2557	0.2581
MultipartiteRank	0.2898	0.1937	0.2176	0.2929	0.119	0.1635	0.2453	0.2303	0.2272
EmbedRank	0.1481	0.099	0.1114	0.1444	0.0562	0.0785	0.0466	0.0486	0.0441
YAKE	0.2935	0.2022	0.2249	0.3033	0.1235	0.1693	0.1742	0.1639	0.1616
RoBERTa	0.2035	0.1383	0.1526	0.2312	0.0898	0.1254	0.1229	0.112	0.1124
KeyBERT	0.2748	0.192	0.2103	0.3129	0.1243	0.1729	0.14	0.13	0.1289
TripleRank	0.1912	0.1277	0.1432	0.177	0.0697	0.0969	0.0441	0.0412	0.0411
MDERank	0.2618	0.1733	0.1958	0.2578	0.1076	0.1466	0.1148	0.1065	0.1068
PromptRank	0.38	0.2339	0.2732	0.3711	0.1376	0.1939	0.134	0.1174	0.1206
HGUKE	0.2095	0.1277	0.1482	0.2404	0.0734	0.1091	0.1617	0.1402	0.1439
BRYT	0.2485	0.15	0.1761	0.2432	0.0815	0.1182	0.1752	0.1557	0.1583
AdaptiveUKE	0.2494	0.1545	0.178	0.2684	0.0848	0.125	0.2366	0.2058	0.2124
HCUKE	0.33	0.1928	0.2287	0.3419	0.1088	0.1597	0.2547	0.221	0.2279
SDRank	0.3155	0.1656	0.2067	0.337	0.1012	0.1513	0.2242	0.1858	0.198

Note. P = precision; R = recall; F1 = F1 score.

Table 10.

Performance Comparison of the P, R, and F1 Over Three Datasets at $N = 10$ .

	Inspec @ $N = 10$			SemEval 2017 @ $N = 10$			DUC2001 @ $N = 10$
Method	P	R	F1	P	R	F1	P	R	F1
TopS-Key	0.3875	0.3975	0.3755	0.4144	0.2575	0.3061	0.3208	0.3935	0.3472
RAKE + Glove	0.1874	0.2021	0.1887	0.1037	0.085	0.0906	0.0285	0.0383	0.0314
PositionRank	0.3361	0.3509	0.3337	0.3406	0.2627	0.2876	0.3174	0.3913	0.346
MultipartiteRank	0.2834	0.2966	0.2817	0.2417	0.1899	0.206	0.2567	0.326	0.2816
EmbedRank	0.1658	0.1747	0.1658	0.1366	0.1051	0.1154	0.0506	0.0708	0.0569
YAKE	0.2678	0.2865	0.2699	0.251	0.1991	0.215	0.2075	0.2623	0.2268
RoBERTa	0.2045	0.2133	0.2025	0.1918	0.1453	0.1602	0.146	0.1803	0.1587
KeyBERT	0.2498	0.2691	0.2515	0.2296	0.1768	0.1943	0.1566	0.2072	0.1738
TripleRank	0.2033	0.2074	0.1994	0.1727	0.1293	0.1437	0.0629	0.0765	0.068
MDERank	0.2579	0.2622	0.2533	0.2267	0.1792	0.1947	0.1612	0.2	0.1758
PromptRank	0.3692	0.3091	0.3221	0.3969	0.234	0.2802	0.1707	0.2	0.1805
HGUKE	0.1957	0.2374	0.201	0.2219	0.1365	0.1627	0.143	0.1778	0.1548
BRYT	0.2396	0.2808	0.2431	0.2467	0.1593	0.1864	0.2083	0.2568	0.2255
AdaptiveUKE	0.2182	0.2546	0.2209	0.2347	0.1472	0.1738	0.2201	0.2686	0.2379
HCUKE	0.3008	0.3359	0.2997	0.3109	0.1954	0.2308	0.2868	0.3533	0.3109
SDRank	0.2828	0.2365	0.2432	0.3119	0.1626	0.2054	0.2163	0.2569	0.2302

Note. P = precision; R = recall; F1 = F1 score.

To ensure fairness, all models were evaluated on the same datasets, Inspec, SemEval 2017 Task 10, and DUC2001, using a unified semantic similarity-based evaluation protocol with a similarity threshold of 0.9. This consistent framework allowed us to compute precision, recall, and F1 score across Top- $N$ ( $N = 3, 5, 10$ ) keyphrases, ensuring direct comparability of performance across methods.

We can observe that the results clearly indicate the superior performance of our proposed method, TopS-Key, compared to other existing approaches.

Now, we will illustrate the comparison of TopS-Key with the state-of-the-art approaches over precision, recall, and F1 score, taking the average of the scores at $N = 3, 5, 10$ using equation (26):

\begin{aligned} Avg precision value & = \frac{P @ 3 + P @ 5 + P @ 10}{3}, \\ Avg recall value & = \frac{R @ 3 + R @ 5 + R @ 10}{3}, \\ Avg F1 score & = \frac{F1 @ 3 + F1 @ 5 + F1 @ 10}{3} . \end{aligned}

(26)

In Figure 10, we display a comparison graph based on average precision for the three datasets. We observe that in DUC, the precision is the least as compared to other datasets, Inspec and SemEval. This indicates that for the longer documents that contain more words, including irrelevant information and noise, our proposed method predicts more keyphrases, many of which could be irrelevant or unnecessary. As the document’s length grows, the context in which words are used can become more complex or ambiguous, leading to more FP predictions. Longer documents may cover a broader range of topics, making it harder for the proposed model to pinpoint the most relevant keyphrases. As a result, the model may predict keyphrases that aren’t relevant to the core content of the document.

Figure 10.

Comparison of TopS-Key with the State-of-the-Art Approaches Over Average Precision Value.

The precision values across the different datasets highlight the superior performance of the proposed method compared to many of the latest advanced methods in the field. It consistently outperforms most of the baseline approaches, achieving precision scores of 0.405 for Inspec, 0.445 for SemEval, and 0.311 for DUC. In contrast, traditional methods such as RAKE + GloVe show significantly lower precision across all datasets, as the proposed method outperforms these traditional methods by 169.5%, 182.4%, and 205.6% for Inspec, SemEval, and DUC, respectively, suggesting far more effective keyphrase extraction. While other advanced methods of recent years, including PositionRank, demonstrate competitive performance, where they lag by approximately 5% to 30% they still fall short, particularly on SemEval and Inspec. Even methods leveraging semantic embeddings and pretrained language models, such as RoBERTa (0.2099) and KeyBERT (0.2864), show moderate precision scores but trail behind by about 20% to 30%. This underscores the advantage of incorporating advanced features and optimization techniques, as the proposed method demonstrates up to 50% higher precision, particularly in datasets such as SemEval, where it achieves the highest precision.

In Figure 11, we can observe that our proposed model has the highest recall compared to other methods, particularly excelling in longer documents, that is, in the DUC dataset. This indicates that our model is very effective at capturing the majority of actual keyphrases, even when the document length increases. The high recall suggests that the model performs comprehensive keyphrase extraction, ensuring that relevant phrases are not missed.

Figure 11.

Comparison of TopS-Key with the State-of-the-Art Approaches Over Average Recall Value.

TopS-Key achieves recall scores of 0.2582 for Inspec, 0.1636 for SemEval, and 0.2775 for DUC. In comparison, traditional methods such as RAKE + GloVe, show much lower recall scores. The proposed method outperforms these methods by around 130% to 220%, indicating more effective keyphrase extraction. For example, on SemEval, PositionRank achieves 0.1586, trailing by about 3.4% behind the proposed method. Methods such as YAKE and KeyBERT, with recall scores of 0.2024 and 0.2029 for Inspec, and 0.1361 and 0.1283 for SemEval, show comparable performance but still underperform by around 20% to 30%. Even methods that use semantic embeddings and pretrained models, such as RoBERTa (0.1507) and MDERank (0.1851), lag behind by 30% to 50% in recall. Overall, the proposed method offers improvements of 10% to 50% over most of the competing methods, demonstrating its superior ability to capture relevant keyphrases, especially on datasets such as DUC and SemEval.

We also observe that while the model is successful at identifying most of the keyphrases, it also predicts a higher number of irrelevant phrases, leading to more FPs in longer texts. Thus, our model is highly effective in scenarios requiring high recall, ensuring that most keyphrases are identified.

In Figure 12, we depict a comparison graph based on the average F1 scores of three datasets. As the harmonic mean of precision and recall, the F1 score tends to follow a more complex pattern. We can observe that our proposed model has the highest F1 score compared to other methods.

Figure 12.

Comparison of TopS-Key with the State-of-the-Art Approaches Over Average F1 Score.

The F1 score results across the three datasets demonstrate that the proposed method outperforms many of the latest advanced methods in the field. It is specifically found that the proposed method achieves F1 scores of 0.2868 for Inspec, 0.2214 for SemEval, and 0.2769 for DUC, with Inspec achieving the best F1 and SemEval having the lowest. These scores are notably higher than traditional methods such as RAKE. This translates to the proposed method outperforming these methods by significant margins, ranging from 100% to over 200%. More advanced methods, including PositionRank, yield competitive F1 scores, with PositionRank achieving 0.2656 for Inspec, 0.2202 for SemEval, and 0.2747 for DUC. While these methods show some strength, they still fall short of the proposed method by about 5% to 10%.

Methods such as YAKE (0.2155 for Inspec and 0.1727 for SemEval) and KeyBERT (0.2168 for Inspec and 0.1695 for SemEval) perform similarly, but the proposed method still surpasses them by a notable margin of 10% to 30%. Even techniques that leverage advanced models, such as RoBERTa (0.1602 for Inspec and 0.1265 for SemEval) and MDERank (0.2019 for Inspec and 0.1508 for SemEval), show moderate performance, but the proposed method still achieves improvements of approximately 30% to 50% over these models. Overall, the proposed method consistently delivers superior performance, with improvements of 20% to 100% over most baseline and advanced models, particularly excelling on datasets such as Inspec and DUC.

The superior performance of TopS-Key can be attributed to its synergistic integration of multiple keyphrase relevance dimensions, statistical, contextual, syntactic, and positional, within a unified multicriteria decision-making framework. In particular, the use of contextual PCA enables effective dimensionality reduction while preserving semantically discriminative components, and adaptive Shannon entropy weighting highlights informative yet nonredundant features across candidate keyphrases. Moreover, the incorporation of Fuzzy TOPSIS ensures robust ranking by accounting for both ideal and anti-ideal reference points across multiple criteria. This carefully orchestrated combination enables TopS-Key to maintain high precision and recall even across domains with diverse vocabulary, structure, and content density.

6.5. Statistical Performance Analysis

To validate the effectiveness of our proposed method, we perform a series of three statistical significance tests to ensure that the observed improvements are not due to random variations. These tests quantitatively assess whether the performance differences between our approach and the baseline methods are statistically significant.

6.5.1. Paired t-Test for Performance Evaluation

To assess the statistical significance of the performance difference between our proposed method and the baseline methods, we perform a paired $t$ -test on precision scores across all test instances (Montgomery & Runger, 2019). The null hypothesis ( $H_{0}$ ) and alternative hypothesis ( $H_{A}$ ) are defined in equation (27):

\begin{aligned} H_{0} : μ_{d} & = 0 (there is no significant difference), \\ H_{A} : μ_{d} & > 0 (the proposed method significantly outperforms the baseline method), \end{aligned}

(27)

where

μ_{d}

is the mean difference between the precision scores of our approach and a given baseline.

The paired $t$ -test statistic is computed as in equation (28):

t = \frac{\bar{d}}{s_{d} / \sqrt{n}},

(28)

where

\bar{d}

is the mean of the differences in precision score across all test samples,

s_{d}

is the standard deviation of

d_{i}

, and

n

is the total number of test samples.

We conduct a paired $t$ -test on precision scores across the Inspec, SemEval, and DUC2001 datasets to determine whether our method significantly outperforms the baseline methods. Figure 13 presents the $t$ -statistic values for each dataset, with the dashed line representing the critical threshold (1.96) at a 95% confidence level. If the null hypothesis $H_{0}$ is rejected, it indicates a statistically significant performance improvement ( $H_{A}$ ).

Figure 13.

Paired T-Test Results on Precision Scores for Inspec (I), SemEval (II), and DUC2001 (III), Indicating Statistical Significance of TopS-Key.

In Figure 13(i)(Inspec), all $t$ -statistics are positive and well above the threshold, strongly rejecting $H_{0}$ and confirming significant gains over all baselines. Figure 13(ii) (SemEval) also shows predominantly positive $t$ -statistics exceeding the threshold, though with lower margins, indicating consistent but dataset-dependent improvements. In Figure 13(iii) (DUC2001), the $t$ -statistics are lower, with some values near the threshold, suggesting that while our method generally outperforms baselines, the significance varies. These results confirm that our approach consistently achieves statistically significant improvements across multiple datasets.

6.5.2. Z-Test for Performance Evaluation

In addition to the paired $t$ -test conducted on precision scores, we further assess statistical significance using the $Z$ -test on F1 scores. While the paired $t$ -test evaluates differences in precision between our proposed method and the baselines on a per-instance basis, the $Z$ -test provides a broader statistical validation by comparing the mean F1 scores. This dual approach ensures a more comprehensive assessment of our method’s effectiveness across different evaluation metrics. The $Z$ -test determines whether the mean F1 score of our method significantly differs from that of the baselines. We define the null and alternative hypotheses as in equation (29):

\begin{aligned} H_{0} : μ_{proposed} & = μ_{baseline}, \\ H_{A} : μ_{proposed} & > μ_{baseline} . \end{aligned}

(29)

The $Z$ -score is calculated using equation (30):

Z = \frac{μ_{proposed} - μ_{baseline}}{σ_{baseline} / \sqrt{n}},

(30)

where

μ_{proposed}

is the F1 score of the proposed method,

μ_{baseline}

is the mean F1 score of the baseline methods,

σ_{baseline}

is the standard deviation of the baseline F1 scores, and

n

is the number of baseline methods.

To validate the significance of the observed performance differences, we perform a two-tailed $Z$ -test at the 95% confidence level. The standard normal distribution is illustrated in Figure 14, where the curve represents the probability density function. The shaded rejection regions indicate the critical areas beyond the significance threshold.

Figure 14.

Paired T-Test Results on Precision Scores for Inspec (I), SemEval (II), and DUC2001 (III), Indicating Statistical Significance of TopS-Key.

As shown in Figure 14, the $Z$ -scores for the TopS-Key method are 2.618 (Inspec), 2.1765 (SemEval), and 2.0074 (DUC2001), all exceeding the critical threshold of 1.96. This confirms that our method achieves statistically significant performance at the 95% confidence level. The shaded rejection regions in the graphs begin at the observed $Z$ -score of TopS-Key, emphasizing its distinction from the baseline methods.

The null hypothesis assumes that TopS-Key performs similarly to the baselines. However, all baseline $Z$ -scores fall within the central region, indicating that their performance does not reach statistical significance. Since our method consistently achieves $Z$ -scores beyond the threshold across all datasets, these results further validate its effectiveness.

Together, the paired $t$ -test on precision scores and the $Z$ -test on F1 scores provide strong statistical evidence supporting the superiority of TopS-Key over baseline methods. The paired $t$ -test confirms significance for precision on an instance level, while the $Z$ -test establishes significance in terms of overall mean F1 performance. These combined results reinforce the robustness and reliability of our proposed approach for keyphrase extraction across diverse datasets.

6.5.3. Performance Validation via Analysis of Variance (ANOVA)

Following the paired $t$ -test on precision and the $Z$ -test on F1 scores, we further validate statistical significance through one-way ANOVA. ANOVA assesses whether significant differences exist among the F1 scores of all keyword extraction methods (Dodić et al., 2025). The hypotheses are defined in equation (31):

\begin{aligned} H_{0} & : μ_{1} = μ_{2} = \dots μ_{k}, \\ H_{A} & : \exists i, j such that μ_{i} \neq μ_{j}, for some i \neq j, \end{aligned}

(31)

where

μ_{i}

represents the mean F1 score of the

i^{th}

method, and

k

is the total number of methods. This hypothesis formulation implies that at least one method exhibits a statistically significant performance difference.

The $F$ -statistic used in ANOVA is computed as in equation (32):

F = \frac{σ_{between}^{2}}{σ_{within}^{2}} = \frac{{MS}_{between}}{{MS}_{within}},

(32)

where

{MS}_{between}

is the mean square between groups, measuring the variance among different keyword extraction methods.

{MS}_{within}

is the mean square within groups, measuring the variance within each method.

If $F > F_{α, (k - 1), (n - k)}$ , we reject $H_{0}$ , confirming at least one method significantly differs. With ANOVA yielding $F = 38.2493$ (Inspec), $F = 23.2493$ (SemEval), and $F = 29.5716$ (DUC2001), all exceeding the critical threshold of $F_{critical} = 1.587$ at $α = 0.05$ , we reject $H_{0}$ , indicating statistically significant differences in performance.

Since the ANOVA test only determines whether a significant difference exists but does not specify which methods differ, we apply the Tukey honestly significant difference (HSD) test for post-hoc pairwise comparisons.

6.5.4. Tukey HSD

Tukey HSD evaluates all possible pairwise differences among methods using the studentized range distribution (Abdi & Williams, 2010). The test statistic is computed using equation (33):

Q = \frac{{\bar{x}}_{i} - {\bar{x}}_{j}}{\sqrt{{MS}_{within} / n}},

(33)

where

{\bar{x}}_{i}

and

\bar{x} j

are the mean F1 scores of two methods, and

{MS}_{within}

is the within-group variance from ANOVA. The critical Tukey value at

α = 0.05

Q_{critical} = 3.98

. The results confirm statistically significant differences between TopS-Key and all baseline methods, with pairwise comparisons yielding

Q > 4.21

across datasets.

We further illustrate the F1 score distributions using Figure 15, where each box represents the distribution of F1 scores for a given method. The median, interquartile range, and outliers provide insights into the variability across methods. The statistical significance observed in the ANOVA and Tukey HSD tests is reflected in the variations among these distributions.

Figure 15.

Tukey Honestly Significant Difference (HSD) Analysis Results on F1 Scores for Inspec (I), SemEval (II), and DUC2001 (III), Indicating Statistical Significance of TopS-Key.

Figure 15 presents the mean difference in F1 scores for baseline methods across the three datasets. The observed negative mean differences, consistently below zero, indicate that all baseline methods underperform compared to our approach. This confirms that our method achieves statistically significant improvements over existing techniques. The results allow us to reject the null hypothesis, affirming that the observed performance gains are not due to random variations but rather to the effectiveness of our feature-driven optimization. The larger performance gap in the DUC dataset further highlights our method’s robustness in handling complex and lengthy documents, while the consistent trends across all datasets reinforce its generalizability.

7. Discussion

The consistently outstanding performance of our proposed method across all benchmark datasets is the result of its carefully designed integration of statistical, syntactic, contextual, and positional features within a robust multicriteria framework. Unlike traditional statistical techniques, which focus primarily on word frequency or co-occurrence statistics and fail to capture deeper semantic nuances or structural dependencies, our approach leverages a broad array of complementary features. This results in a more comprehensive evaluation of each keyphrase candidate’s importance.

Similarly, graph-based models MultipartiteRank and PositionRank build term co-occurrence graphs and extract keyphrases based on graph centrality measures. While effective in capturing local word relationships, they tend to ignore the broader contextual and semantic connections spanning across the entire document. Our method overcomes this shortcoming by incorporating structural insights alongside statistical and syntactic features, ensuring that both local dependencies and global context are accounted for. Nevertheless, PositionRank demonstrates consistently competitive performance across datasets, with its strength most evident on the INSPEC corpus. This dataset includes both titles and abstracts, where positional information strongly aligns with author-assigned keyphrases. By exploiting the occurrence of candidate terms in titles or early abstract segments, PositionRank capitalizes on structural regularities that are particularly informative in this domain, thereby achieving notably strong results. This explains why PositionRank secured the second-best overall performance in our comparative evaluation, despite being developed earlier than several more recent models.

Embedding-based methods, such as KeyBERT, EmbedRank, and BERT-based approaches, rely heavily on pretrained language models to represent semantic relationships. Although these models excel at capturing word and phrase meanings, they are often constrained by their dependence on external pretrained embeddings, which may not fully adapt to domain-specific language use or the unique characteristics of individual documents. Moreover, these embedding techniques typically evaluate candidate phrases in isolation, without integrating positional or structural information. Our approach, by contrast, dynamically adapts to each document using PCA to reduce feature dimensionality, ensuring that the most informative and document-relevant variance is retained.

Despite their novelty, recent advanced methods such as PromptRank, HGUKE, BRYT, AdaptiveUKE, HCUKE, and SDRank failed to achieve superior performance in our experiments due to their reliance on specialized mechanisms that do not generalize well across datasets. PromptRank leverages large language model prompting strategies that are sensitive to prompt design and domain mismatch, while HGUKE and HCUKE rely heavily on heuristic graph construction that may overlook semantic nuance outside their intended domain. Similarly, BRYT and AdaptiveUKE emphasize adaptive strategies tailored to particular data characteristics, but their effectiveness diminishes when applied to heterogeneous corpora such as INSPEC, SemEval, and DUC. SDRank, although the most recent, prioritizes semantic density but underutilizes positional and structural cues, which are particularly critical in datasets containing titles and abstracts. As a result, these advanced approaches exhibit strong performance in constrained settings but remain less competitive in broad benchmark comparisons.

A pivotal strength of our framework lies in the application of Shannon entropy-based weighting, which adaptively assigns importance scores based on the distribution variability of each feature. This dynamic weighting mechanism prevents over-dependence on any single feature type and ensures that discriminative features are prioritized depending on dataset-specific characteristics. Traditional statistical models, in comparison, assign uniform weights without adjusting to interfeature variability, potentially missing important contextual signals.

Furthermore, the integration of fuzzy TOPSIS allows our model to effectively handle the inherent subjectivity and ambiguity involved in ranking keyphrases. While graph-based and statistical models often apply rigid criteria to select keyphrases, fuzzy TOPSIS treats the process as a multicriteria decision-making task. This enables our method to evaluate all feature contributions simultaneously, providing a more balanced, flexible, and context-sensitive ranking mechanism. Such flexibility is particularly beneficial when processing documents with varying sentence structures, writing styles, and topic domains, scenarios where embedding-based models or graph algorithms may struggle to generalize.

Another crucial factor contributing to our model’s success is its use of syntactic and positional features, including part-of-speech patterns, term positions, and sentence-level embeddings. Unlike models such as YAKE, which use statistical and positional information but lack mechanisms for feature weighting and dimensionality reduction, our method combines these cues with adaptive weighting and PCA, allowing for effective noise reduction and better interfeature interaction modeling.

While individual components such as PCA, entropy-based weighting, or fuzzy TOPSIS have been explored separately in prior keyword extraction and ranking studies, the novelty of our approach lies in the synergistic integration of these elements into a single adaptive framework. Shannon entropy-based weighting dynamically evaluates the discriminative power of features, PCA reduces dimensionality and mitigates feature redundancy, and fuzzy TOPSIS provides a flexible multicriteria ranking mechanism that accommodates uncertainty and subjectivity in keyphrase evaluation. In isolation, each of these techniques addresses a specific aspect of the extraction process, but their combined application enables document-specific adaptability, robust interfeature interaction modeling, and context-sensitive ranking, which are unattainable by any single component alone. This integration not only improves precision and recall across diverse datasets but also enhances interpretability and generalizability, demonstrating a clear advancement over previous methods that relied on individual techniques without exploiting their complementary strengths.

In summary, the superior performance of our approach is driven by the synergistic integration of heterogeneous features, document-specific adaptability, dimensionality reduction through PCA, entropy-based adaptive weighting, and flexible evaluation via fuzzy TOPSIS. This thoughtful combination enables our model to consistently outperform baseline methods, whether statistical, graph-based, embedding-based, or hybrid approaches across all datasets in terms of precision, recall, and F1 score.

The evaluation results clearly indicate that our method achieves not only state-of-the-art accuracy but also a strong balance between interpretability, adaptability, and robustness. The significant improvements over 17 baseline methods, confirmed by statistical significance tests such as paired $t$ -tests and ANOVA, validate our first hypothesis that the model consistently outperforms baselines under all experimental conditions. Similarly, the model’s consistently high performance across INSPEC, SemEval, and DUC datasets, each covering diverse domains, confirms our second hypothesis regarding its cross-domain robustness and adaptability.

Given its favorable runtime efficiency compared to deep learning-based approaches, TopS-Key demonstrates strong potential for deployment in real-world applications where both scalability and interpretability are essential. In information retrieval systems, the extracted keyphrases can enhance document indexing, metadata generation, and query expansion, thereby improving retrieval precision and recall across large text repositories. Similarly, in automatic text summarization pipelines, keyphrases provide salient content indicators that can improve the quality of both extractive and abstractive summaries. Beyond these, the lightweight and modular design of TopS-Key makes it particularly well-suited for resource-constrained or real-time environments, such as mobile devices, monitoring dashboards, or social media analytics, where low latency and transparency are critical. These deployment scenarios highlight not only the efficiency of the framework but also its practical adaptability to diverse information management settings.

7.1. Limitations and Challenges

Despite the robust performance of the TopS-Key framework across standard benchmark datasets, several contextual and operational limitations merit consideration when extending the approach to diverse real-world settings.

The feature design and ranking strategy, while broadly applicable, may require adaptation in domain-specific contexts such as biomedical, legal, or informal social media corpora, where linguistic structures, terminology, and discourse conventions deviate from general-purpose datasets. In such cases, preprocessing modules such as part-of-speech tagging and candidate phrase extraction may benefit from domain-tuned heuristics or resources to preserve semantic relevance and syntactic coherence. Additionally, reliance on part-of-speech-based candidate extraction may overlook idiomatic expressions or multiword domain-specific keyphrases that do not conform to standard syntactic patterns, potentially limiting coverage in specialized corpora.

From a computational standpoint, although the use of dimensionality reduction via PCA and entropy-based weighting contributes to model efficiency, scalability concerns may arise when processing large-scale corpora or streaming data in low-latency environments. However, an inherent tradeoff of PCA is that while it reduces redundancy and highlights dominant patterns, it may also obscure fine-grained feature distinctions that capture subtle semantic or stylistic variations, which could be particularly relevant in domain-sensitive applications. While the proposed framework is significantly more lightweight than deep learning-based alternatives, system-level deployment in real-time applications may necessitate further optimization through parallelization, distributed processing, or model compression techniques. Another practical consideration lies in the sensitivity of certain hyperparameters, such as the PCA variance retention threshold and similarity-based ranking cutoffs. Although empirically stable across datasets, these parameters may require calibration to accommodate variations in text length, domain specificity, or candidate phrase density.

It is important to note that these limitations do not arise from architectural weaknesses, but rather reflect natural tradeoffs between generalizability, interpretability, and computational efficiency inherent in unsupervised modular frameworks. The adaptability and modularity of TopS-Key nonetheless position it as a viable solution for a wide range of downstream tasks, including document indexing, content classification, and summarization.

8. Conclusion and Future Scope

In this paper, we propose TopS-Key, a hybrid keyphrase extraction model that integrates NLP techniques with fuzzy multicriteria decision-making (Fuzzy TOPSIS). Our approach leverages fuzzy string matching, a 10-feature scoring system, contextual PCA-based dimensionality reduction, and Shannon entropy-based weighting, ensuring an effective and scalable ranking of keyphrases. We evaluated TopS-Key on three benchmark datasets—Inspec, SemEval 2017 Task 10, and DUC2001—and measured performance across different top- $N$ selections ( $N = 3, 5, 10$ ). Additionally, we analyzed the impact of varying similarity thresholds (0.7 to 0.9), observing a consistent decline in precision, recall, and F1 score as the threshold increased.

To ensure the statistical validity of our results, we conducted paired $t$ -tests and ANOVA analysis, confirming that TopS-Key’s performance improvements over baseline methods were statistically significant. Furthermore, we performed an ablation study to examine the contribution of each feature, demonstrating that all 10 features play a crucial role in enhancing keyphrase extraction effectiveness. We rigorously compared TopS-Key against 15 state-of-the-art keyphrase extraction methods, and our approach consistently achieved higher accuracy and relevance in keyphrase selection. The effectiveness of multidimensional feature analysis, data-driven weighting, and robust ranking through TOPSIS reinforces the viability of TopS-Key as an advanced solution for keyphrase extraction.

Future work will extend TopS-Key to informal short texts, such as tweets and social media posts, which present challenges due to noisy, sparse, and unstructured language. Addressing these cases will require enhanced preprocessing, normalization, and robust candidate selection strategies. We also intend to investigate the applicability of TopS-Key to morphologically rich languages, where inflectional and derivational complexity may affect keyphrase extraction. Furthermore, we plan to explore the use of extracted keyphrases as interpretable features for downstream tasks, such as toxicity or offensive content detection, particularly in short or low-resource text settings. Finally, evaluating the framework on domain-specific corpora, such as legal, biomedical, and technical texts, will help assess its adaptability and generalizability across specialized NLP applications.

Footnotes

ORCID iDs

Reetika Singh

Goonjan Jain

Author contributions

There are equal contributions in this research from all the authors of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Ethical approval

This article does not contain any studies with human participants or animals performed by any of the authors.

Data availability

The datasets analyzed during the current study are publicly available.

Code availability

The codes generated during the current study are available from the corresponding author on reasonable request.

References

Abdi

Williams

L. J.

(2010). Tukey’s honestly significant difference (HSD) test. Encyclopedia of Research Design, 3(1), 1–5.

Ahmed

Alexopoulos

Piangerelli

Polini

(2024). BRYT: Automated keyword extraction for open datasets. Intelligent Systems with Applications, 23, 200421.

Augenstein

Das

Riedel

Vikraman

McCallum

(2017). SemEval 2017 Task 10: ScienceIE-extracting keyphrases and relations from scientific publications. arXiv preprint arXiv:1704.02853.

Bennani-Smires

Musat

Hossmann

Baeriswyl

Jaggi

(2018). Simple unsupervised keyphrase extraction using sentence embeddings. arXiv preprint arXiv:1801.04470.

Biswas

S. K.

Bordoloi

Shreya

(2018). A graph based keyword extraction model using collective node weight. Expert Systems with Applications, 97, 51–59.

Boudin

(2018). Unsupervised keyphrase extraction with multipartite graphs. arXiv preprint arXiv:1803.08721.

Campos

Mangaravite

Pasquali

Jorge

A. M.

Nunes

Jatowt

(2018). Yake! collection-independent automatic keyword extractor. In Advances in information retrieval: 40th European conference on IR research, ECIR 2018, Grenoble, France, March 26–29, 2018, proceedings 40 (pp. 806–810). Springer.

Chen

Wang

Guo

(2019). Single document keyword extraction via quantifying higher-order structural features of word co-occurrence graph. Computer Speech & Language, 57, 98–107.

Corrente

Tasiou

(2023). A robust TOPSIS method for decision making problems with hierarchical and non-monotonic criteria. Expert Systems with Applications, 214, 119045.

10.

Devlin

Chang

M. W.

Lee

Toutanova

(2018). BERT: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.

11.

Dodić

Regodić

(2024). Tokenization and memory optimization for reducing GPU load in NLP deep learning models. Tehnički Vjesnik, 31(6), 1995–2002.

12.

Dodić

Regodić

Milutinović

(2025). Optimization of tokenization and memory management for processing large textual corpora in multilingual applications. Bulletin of Natural Sciences Research, 15(1), 98–115.

13.

Esposito

Masala

G. L.

Minutolo

Pota

(Eds.). (2022). Natural language processing: Emerging neural approaches and applications. MDPI. https://doi.org/10.3390/books978-3-0365-2272-2. ISBN 978-3-0365-2271-5 (print); 978-3-0365-2272-2 (PDF)

14.

Florescu

Caragea

(2017). PositionRank: An unsupervised approach to keyphrase extraction from scholarly documents. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: long papers) (pp. 1105–1115). Association for Computational Linguistics.

15.

Garg

P. K.

Chakraborty

Gupta

Dandapat

S. K.

(2024). IKDSumm: Incorporating key-phrases into BERT for extractive disaster tweet summarization. Computer Speech & Language, 87, 101649.

16.

Grootendorst

(2020). KeyBERT: Minimal keyword extraction with BERT. https://doi.org/10.5281/zenodo.4461265.

17.

Hanani

H. N.

Jayadianti

Rustamaji

H. C.

(2021). Fuzzy string matching for semi-automatication of words with Jaro Winkler distance algorithm on Microsoft Word documents. In Seminar nasional informatika (SEMNASIF) (Vol. 1, pp. 145–160). Jurusan Teknik Informatika, UPN “Veteran” Yogyakarta.

18.

Hernández-Castañeda

Á.

García-Hernández

R. A.

Ledeneva

Millán-Hernández

C. E.

(2022). Language-independent extractive automatic text summarization based on automatic keyword extraction. Computer Speech & Language, 71, 101267.

19.

Hong

Zhen

(2012). An extended keyword extraction method. Physics Procedia, 24, 1120–1127.

20.

Hulth

(2003). Improved automatic keyword extraction given more linguistic knowledge. In Proceedings of the 2003 conference on empirical methods in natural language processing (pp. 216–223). Association for Computational Linguistics.

21.

Hung

B. Q.

Otsubo

Hijikata

Nishida

(2010). HITS algorithm improvement using semantic text portion. Web Intelligence and Agent Systems: An International Journal, 8(2), 149–164.

22.

Joshi

M. A.

Patel

(2018). Google page rank algorithm and it’s updates. In International conference on emerging trends in science, engineering and management, ICETSEM-2018 (Article No. 59, pp. 1–6). Association for Computing Machinery.

23.

Kong

Zhao

Chen

Qin

Sun

Bai

(2023). PromptRank: Unsupervised keyphrase extraction using prompt. arXiv preprint arXiv:2305.04490.

24.

Sun

Chi

(2021). TripleRTank: An unsupervised keyphrase extraction algorithm. Knowledge-Based Systems, 219, 106846.

25.

Liu

Yuan

Yang

Zhao

Wang

(2024). AdaptiveUKE: Towards adaptive unsupervised keyphrase extraction with gated topic modeling. Expert Systems with Applications, 250, 123926.

26.

Liu

Zhang

(2019a). Keywords extraction method for technological demands of small and medium-sized enterprises based on LDA. In 2019 Chinese automation congress (CAC) (pp. 2855–2860). IEEE. https://doi.org/10.1109/CAC48633.2019.8996936.

27.

Liu

Ott

Goyal

Joshi

Chen

Levy

Lewis

Zettlemoyer

Stoyanov

(2019b). Roberta: A robustly optimized BERT pretraining approach. arXiv preprint arXiv:1907.11692.

28.

Marukatat

(2023). Tutorial on PCA and approximate PCA and approximate kernel PCA. Artificial Intelligence Review, 56(6), 5445–5477.

29.

Mennerich

(2013). Phase-field modeling of multidomain evolution in ferromagnetic shape memory alloys and of polycrystalline thin film growth (Schriftenreihe des Instituts für Angewandte Materialien, Band 19). KIT Scientific Publishing. ISBN 978-3-7315-0009-4. https://doi.org/10.5445/KSP/1000034246

30.

Mihalcea

Tarau

(2004). TextRank: Bringing order into text. In Proceedings of the 2004 conference on empirical methods in natural language processing (pp. 404–411). Association for Computational Linguistics.

31.

Montgomery

D. C.

Runger

G. C.

(2019). Applied statistics and probability for engineers. John Wiley & Sons.

32.

Munoz

(1997). Compound key word generation from document databases using a hierarchical clustering art model. Intelligent Data Analysis, 1(1), 25–48.

33.

Nădăban

Dzitac

(2016). Fuzzy TOPSIS: A general view. Procedia Computer Science, 91, 823–831.

34.

Omuya

E. O.

Okeyo

G. O.

Kimwele

M. W.

(2021). Feature selection for classification using principal component analysis and information gain. Expert Systems with Applications, 174, 114765.

35.

Onan

Korukoglu

Bulut

(2016). LDA-based topic modelling in text sentiment classification: An empirical analysis. International Journal of Computational Linguistics and Applications, 7(1), 101–119.

36.

Ortiz-Ospina

(2019). The rise of social media. Our World in Data. https://ourworldindata.org/rise-of-social-media.

37.

Papagiannopoulou

Tsoumakas

(2020). A review of keyphrase extraction. Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 10(2), e1339.

38.

Pati

S. P.

Rautray

(2024). Sentence selection for extractive text summarization using TOPSIS approach. Procedia Computer Science, 235, 1532–1538.

39.

Pennington

Socher

Manning

C. D.

(2014). GloVe: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP) (pp. 1532–1543). Association for Computational Linguistics.

40.

Pravia

M. A.

Chen

Yepez

Cory

D. G.

(2002). Towards a NMR implementation of a quantum lattice gas algorithm. Computer Physics Communications, 146(3), 339–344.

41.

Ramos

(2003). Using TF–IDF to determine word relevance in document queries. In Proceedings of the first instructional conference on machine learning (Vol. 242, pp. 29–48). Citeseer.

42.

Rose

S. J.

Cowley

W. E.

Crow

V. L.

Cramer

N. O.

(2012). Rapid automatic keyword extraction for information retrieval and analysis. US Patent 8,131,735.

43.

Sarkar

Nasipuri

Suranjan

(2010). A new approach to keyphrase extraction using neural networks. International Journal of Computer Science Issues, 7(2), 16–25.

44.

Shannon

C. E.

(2001). A mathematical theory of communication. ACM SIGMOBILE Mobile Computing and Communications Review, 5(1), 3–55.

45.

Siddiqi

Sharan

(2015). Keyword and keyphrase extraction techniques: A literature review. International Journal of Computer Applications, 109(2), 18–23.

46.

Song

Liu

Feng

Jing

(2023a). Improving embedding-based unsupervised keyphrase extraction by incorporating structural information. In Findings of the association for computational linguistics: ACL 2023 (pp. 1041–1048). Association for Computational Linguistics.

47.

Song

Xiao

Jing

(2023b). Learning to extract from multiple perspectives for neural keyphrase extraction. Computer Speech & Language, 81, 101502.

48.

Turney

P. D.

(2000). Learning algorithms for keyphrase extraction. Information Retrieval, 2, 303–336.

49.

Wan

Xiao

(2008). Single document keyphrase extraction using neighborhood knowledge. In Proceedings of the 23rd national conference on artificial intelligence (Vol. 2, pp. 855–860). AAAI Press.

50.

Wang

(2022). A novel approach of integrating natural language processing techniques with fuzzy TOPSIS for product evaluation. Symmetry, 14(1), 120.

51.

Mao

X. L.

Che

T. Y.

Mao

Huang

(2025). SDRank: A shallow-to-deep ranking framework for enhanced unsupervised keyphrase extraction. Expert Systems with Applications, 272, 126748.

52.

Mao

X. L.

Xin

C. X.

Shang

Y. M.

Che

T. Y.

Mao

H. L.

Huang

(2024). HCUKE: A hierarchical context-aware approach for unsupervised keyphrase extraction. Knowledge-Based Systems, 304, 112511.

53.

Zehtab-Salmasi

Feizi-Derakhshi

M. R.

Balafar

M. A.

(2021). FRAKE: Fusional real-time automatic keyword extraction. arXiv preprint arXiv:2104.04830.

54.

Zhang

(2008). Automatic keyword extraction from documents using conditional random fields. Journal of Computational Information Systems, 4(3), 1169–1180.

55.

Zhang

Tang

(2006). Keyword extraction using support vector machine. In Advances in web-age information management: 7th international conference, WAIM 2006, Hong Kong, China, June 17–19, 2006. Proceedings 7 (pp. 85–96). Springer.

56.

Zhang

Chen

Wang

Deng

Zhang

Wang

Cao

(2021). MDERank: A masked document embedding rank approach for unsupervised keyphrase extraction. arXiv preprint arXiv:2110.06651.

57.

Zhou

Jiang

Zhao

(2024). Unsupervised technical phrase extraction by incorporating structure and position information. Expert Systems with Applications, 245, 123140.

TopS-Key: Semantic Keyphrase Extraction Using Contextual Principal Component Analysis and Fuzzy Technique for Order Preference by Similarity to Ideal Solution With Adaptive Shannon Entropy Weights

Abstract

Keywords

1. Introduction

2. Literature Survey

2.1. Statistical Approach

2.2. Graph-Based Approach

2.3. Supervised Approach

2.4. Other Approaches

3. Proposed Methodology

3.4.1. Shannon Entropy-Based Weighting

5.2. Software Implementation

5.3. Evaluation Metrics

Table 6. Formulas for Precision, Recall and F1 Score. Metric Precision Recall F1 score Formula TP TP + FP TP TP + FN 2 × Precision × Recall Precision + Recall

6.1. PCA Cumulative Variance Threshold Sensitivity Analysis

6.5.1. Paired t-Test for Performance Evaluation

7.1. Limitations and Challenges

8. Conclusion and Future Scope

Footnotes

ORCID iDs

Author contributions

Funding

Declaration of Conflicting Interests

Ethical approval

Data availability

Code availability

References

Table 6.
Formulas for Precision, Recall and F1 Score.

Metric Precision Recall F1 score

Formula $\frac{TP}{TP + FP}$ $\frac{TP}{TP + FN}$ $2 \times \frac{Precision \times Recall}{Precision + Recall}$