Sage Journals: Discover world-class research

Abstract

This research introduces an enhanced stratified sampling-density-based spatial clustering of applications with noise (SS-DBSCAN), a scalable and robust density-based clustering algorithm designed to tackle challenges in high-dimensional and complex data analysis. The algorithm integrates advanced parameter optimization techniques to improve clustering accuracy and interpretability. Key innovations include a fast grid search method for optimizing the search of optimal minimum points (MinPts) by keeping the $ϵ$ parameter obtained constant. Notably, this study emphasizes the often-overlooked MinPts parameter, introducing a dynamic approach that initiates by calculating density metrics within a specified $ϵ$ distance and adjusting the MinPts range based on the standard deviation of these metrics. This approach identifies optimal MinPts values based on the maximum allowed range. Comprehensive experiments on five real-world datasets demonstrate SS-DBSCAN’s superior performance compared to density-based spatial clustering of applications with noise (DBSCAN), hierarchical DBSCAN, and ordering points to identify the clustering structure (OPTICS), evidenced by higher silhouette and Davies–Bouldin index scores. The results highlight SS-DBSCAN’s ability to capture intrinsic clustering structures accurately, providing deeper insights across various research domains. SS-DBSCAN’s scalability and adaptability to diverse data densities make it a valuable tool for analyzing large, complex datasets.

Keywords

SS-DBSCAN clustering high-dimensional fast grid search (FGS)scalability adaptability

1. Introduction

Data mining is an interdisciplinary field that merges database technology, statistics, machine learning, and pattern recognition, benefiting from each of these areas (Iavindrasana et al., 2009). While still not extensively adopted in many research domains, numerous studies have highlighted the potential of data mining in developing predictive models, evaluating risks, and assisting with decision-making (Ngiam & Khor, 2019). Data mining utilizing large datasets can generate crucial and impactful insights that are vital for precise decision-making and risk evaluation (Wu et al, 2021). Algorithms designed for data mining facilitate the achievement of these objectives.

The advent of large and complex datasets has ushered in a new era of data-driven insights across various domains. Among the myriad available datasets, those that encompass extensive and high-dimensional data such as medical information stand out due to their comprehensive and detailed collection of information (Arya & Abhishek Arya, 2019; Mollura et al., 2020). The complexity, volume, and high dimensionality of these datasets pose significant challenges for clustering and data analysis, necessitating advanced methodologies for effective data preprocessing and clustering parameter optimization.

When managing datasets characterized by high density, arbitrary shapes, and irregular distribution, density-based spatial clustering of applications with noise (DBSCAN) is recommended as a robust algorithm specifically designed to address these complex scenarios (Martin Ester et al., 1996; Ram et al., 2010; Sander et al., 1998; Shah, 2012). However, DBSCAN has its limitations, particularly in the selection of its two primary parameters, which can affect its performance and accuracy (Schubert et al., 2017). Despite its effectiveness in handling complex datasets, DBSCAN faces challenges, particularly in the selection and tuning of its two key parameters: the minimum number of points required to form a dense region (MinPts) and the maximum distance between two points for one to be considered as in the neighborhood of the other ( $ϵ$ ) (Monko & Kimura, 2023a). Among the two parameters, $ϵ$ has been the subject of extensive research, whereas MinPts has often been overlooked, with its selection frequently relying on rule-of-thumb methods or manual estimation based on data size. However, both parameters play a crucial role in determining clustering outcomes. In particular, improper determination of MinPts can significantly affect clustering results, especially as the data size of the same dataset increases.

Other variants of DBSCAN, such as hierarchical DBSCAN (HDBSCAN) and ordering points to identify the clustering structure (OPTICS), have been developed to address some of these limitations. HDBSCAN extends DBSCAN by converting it into a hierarchical clustering algorithm that does not require the user to specify a fixed value for $ϵ$ , aiming to find clusters of varying densities. However, HDBSCAN still struggles with high-dimensional datasets, as the hierarchical approach can become computationally expensive and less effective in distinguishing between closely spaced clusters in such complex data (Fotopoulou, 2024; Monko & Kimura, 2023b). Similarly, OPTICS improves upon DBSCAN by ordering points to identify the clustering structure and handling clusters of varying densities more effectively. Nevertheless, OPTICS also faces challenges in high-dimensional spaces, where the complexity of data can lead to suboptimal clustering results and increased computational costs (Wang et al., 2019).

This paper elaborates on the enhancements introduced to the stratified sampling-DBSCAN (SS-DBSCAN) algorithm from our previous work, focusing on its innovative approach to automatically select DBSCAN parameters (Monko & Kimura, 2023a, 2023b). The methodology presented here is specifically tailored to navigate the complexities inherent in high-dimensional datasets, providing a more nuanced and effective clustering solution for the unique challenges posed by these extensive datasets. We introduce a more convenient and improved grid search method, named fast grid search (FGS) for determining MinPts. By leveraging automatic parameter selection, SS-DBSCAN aims to improve the precision and applicability of clustering techniques, enhancing the potential for actionable insights in various research domains. Additionally, we demonstrate the pivotal role of principal component analysis (PCA) and t-distributed stochastic neighbor embedding (t-SNE) in preprocessing, alongside a modified approach to SS-DBSCAN parameter optimization, in enhancing clustering accuracy and interpretability within complex datasets. SS-DBSCAN is tested against other DBSCAN variant algorithms to demonstrate its robustness and resilience.

2. Related Works

Clustering algorithms, particularly DBSCAN, have been extensively studied for their capability to identify natural groupings in data without requiring a predefined number of clusters. The original DBSCAN algorithm, introduced by Martin Ester et al. (1996), demonstrated effectiveness in discovering clusters of arbitrary shapes and handling noise. Still, its performance heavily relies on the appropriate selection of two key parameters: $ϵ$ and MinPts.

Numerous subsequent studies have attempted to address these challenges through various enhancements to the DBSCAN algorithm. Schubert et al. (2017) revisited DBSCAN and discussed that DBSCAN is still a practical and effective clustering algorithm, especially when applied with careful consideration of parameters and indexing strategies. Selecting the $ϵ$ parameter for DBSCAN in high-dimensional data is still challenging due to diminished contrast in distances (Hui & Gao, 2021; Hutchison & Mitchell, nd; Zimek et al., 2012). This issue persists irrespective of the indexing method, making DBSCAN parameterization difficult in high-dimensional contexts. Algorithms such as OPTICS and HDBSCAN eliminate the need for the $ϵ$ parameter, making them more user-friendly. However, they also face challenges when dealing with high-dimensional data (Ankerst et al., 1999; Deng et al., 2015; Kanagala & Jaya Rama Krishnaiah, 2016).

Other modifications of the DBSCAN algorithm have been proposed to enhance its clustering performance. Liu et al. (2010) introduced DBSCAN-density levels partitioning, which uses a dynamic approach to select the $ϵ$ value by calculating it for each data point based on the local density and mean distance, although this increases computational complexity. Karami and Johansson (2014) developed BDE-DBSCAN, combining binary differential evolution with DBSCAN to fine-tune its parameters, while Ren et al. (2012) created DBCAMM, which uses Mahalanobis distance and an innovative merging strategy for better image segmentation. Lai et al. (2019) proposed an optimization technique using the multiverse optimizer (MVO) algorithm to iteratively refine DBSCAN parameters, and Khan et al. (2018) introduced adaptive DBSCAN to automate parameter selection. Despite these advancements, there is still a need for more adaptable and user-friendly methods for attaining better clusters with DBSCAN, which the current paper aims to address.

Other experiments conducted by Gan and Tao (2015) were performed on datasets, and their parameter settings were not well-suited for cluster analysis. Gan & Tao’s (2015) choice of the $ϵ$ ( $ϵ$ ) parameter was unusually large, set at a minimum of $ϵ = 5, 000$ for all their experiments. Their results only demonstrated better performance under certain questionable settings. In contrast, more realistic parameter choices showed that SS-DBSCAN implementations by Monko and Kimura (2023a) with an effective selection of both parameters (i.e., $ϵ$ and MinPts) result in the best results.

In high-dimensional datasets, the complexity and volume of data present significant challenges for traditional clustering algorithms (Paoletti et al., 2009; Saeed et al., 2002; Wang et al., 2020). These datasets often contain intricate patterns that are not readily apparent, necessitating the use of advanced dimensionality reduction techniques such as PCA and t-SNE (Abdi & Williams, 2010; Melit Devassy & George, 2020; Platzer, 2013; Smetana et al., 2024). In practice, applying PCA to reduce the dimensionality to a smaller number of components (e.g., 30–50) before running t-SNE is a common approach. This ensures that t-SNE works more efficiently and effectively, especially with very large or complex datasets (Pareek & Jacob, 2020; Shah & Silwal, 2019).

Various studies have also highlighted the potential of data mining and clustering in the medical domain. For instance, Zhang et al. (2016) explored big data mining in clinical medicine, emphasizing the utility of clustering techniques in identifying meaningful patterns in patient data. Ngiam and Khor (2019) discussed the role of machine learning algorithms in healthcare delivery, underscoring the importance of robust clustering methods for clinical decision-making.

From the above-discussed related works, we realized a need to address the existing gaps by building upon SS-DBSCAN, a new variant of DBSCAN that we developed in our previous studies that incorporates stratified sampling for $ϵ$ estimation and a novel grid search method for determining MinPts. We put more emphasis on the MinPts which most algorithms tend to use the rule of thumb 2 for 2 dimension or 2*D for high dimension, where D is the dimension of the data. Abdulhameed et al. (2024) offer significant improvements over traditional DBSCAN (semi-supervised-DBSCAN) by incorporating a pre-specified condition or constraint to better identify core points, the authors determined the MinPts parameter based on whether the dataset is noisy or not. For noisy data, the MinPts is set to 2*D, where D is the number of features, and for noiseless data, the MinPts is set to D + 1 (Abdulhameed et al., 2024). These approaches still do not work well with complex real-world data with high dimensions. And upon increasing the data size it usually results in poorer clusters. In this work, the MinPts determination is improved and offers better results than all other algorithms. The dual optimization of $ϵ$ and MinPt ensures that SS-DBSCAN is finely tuned to the intrinsic clustering structures within the high-dimensional datasets, enhancing clustering accuracy and interpretability.

3. Contribution

This paper makes four main contributions to the field of data mining and clustering high-dimensional datasets:

(1)
We enhanced the original DBSCAN to SS-DBSCAN to address the complexities of high-dimensional data through advanced parameter optimization techniques, ensuring precise and reliable clustering results.
(2)
We developed a novel adaptive range method based on local density estimates. This method dynamically adjusts the range for MinPts to improve the adaptability of SS-DBSCAN to varying data densities.
(3)
We enhanced the grid search technique to significantly reduce the computational time required to determine the optimal MinPts in any DBSCAN variant that utilizes this parameter.
(4)
With the enhancement and improvement listed above, we realized scalable and adaptable DBSCAN, SS-DBSCAN, which is a valuable tool for analyzing large and complex datasets across various research domains.

4. Methodology

4.1. Overview and Motivation

Clustering high-dimensional data remains a significant challenge, particularly in cases where data density varies across different regions, making parameter selection for traditional clustering algorithms difficult. DBSCAN is a well-known density-based clustering method that detects arbitrarily shaped clusters and identifies noise. However, its performance is highly sensitive to the choice of parameters ( $ϵ$ and MinPts), significantly impacting clustering quality.

The main limitations of DBSCAN include:

Global $ϵ$ estimation: A single fixed $ϵ$ value may not work well for datasets with varying densities.

Rule-based MinPts selection: Traditional DBSCAN often relies on heuristics such as $MinPts = 2D$ , which do not always yield optimal clusters, especially in high-dimensional data.

Noise misclassification: Poor parameter selection can incorrectly classify valid data points as noise or merge distinct clusters.

To address these challenges, we propose SS-DBSCAN, an enhanced version of DBSCAN that automates parameter tuning and improves clustering accuracy in high-dimensional datasets. SS-DBSCAN introduces two key enhancements:

Adaptive $ϵ$ Estimation: Instead of using a fixed global $ϵ$ , SS-DBSCAN applies stratified sampling to estimate $ϵ$ dynamically, improving sensitivity to variations in local density.

Optimized MinPts Selection: Instead of relying on a heuristic formula, SS-DBSCAN performs an FGS using the silhouette score to determine the best MinPts value, ensuring well-separated clusters with minimal noise.

These modifications enable SS-DBSCAN to handle noisy, high-dimensional datasets more robustly, ensuring that clusters are accurately formed without excessive noise classification. The following sections explain how SS-DBSCAN selects parameters, handles noise, and improves clustering performance.

4.2. Data Preprocessing

Our methodological pipeline begins with the use of Sentence-BERT’s (S-BERT’s) encoder to get context-sensitive sentence embeddings. The datasets used in our experiments were mainly text data (i.e., Emotion-Sentiment, Coronavirus-Tweets, and Cancer-Docand MIMIC III) and one numerical dataset (Sonar). The text data comprised sequences ranging from a minimum of 50 to a maximum of 500, with an average sequence length of 250. The average length of 250 reflects the natural distribution within the selected datasets. Therefore, we used the all-mpnet-base-v2 pre-trained model from S-BERT to generate embeddings (He et al., 2024; Jayanthi et al., 2021; Korea & Zahran, 2022). This pre-trained model has a maximum sequence length of 384, which can accommodate the sequence length of our data, has dimensions of 768, and has been trained over 1 billion training pairs (Reimers & Gurevych, 2019). By generating high-quality embeddings using S-BERT, we enhanced the representational capacity of our data, facilitating more accurate and meaningful clustering. Examples of the preprocessed and S-BERT-ready textual inputs are provided in Appendix A. We then normalized the features of the data using a Standard Scaler, followed by PCA and t-SNE as part of data pre-processing. Then, it proceeds with the strategies for achieving high-quality clustering using SS-DBSCAN. Figure 1 illustrates the process through data pre-processing techniques and parameter selection, particularly emphasizing the MinPts parameter, which has historically been the most challenging to optimize in previous research.

Figure 1.

Impoved MinPts for SS-DBSCAN architecture. Note. MinPts = minimum points; SS-DBSCAN = stratified sampling-density-based spatial clustering of applications with noise.

The Standard Scaler, in this case, was employed to prevent the dominance of high-variance features and to ensure that all features contribute proportionally to the clustering process. It also helped improve the interpretability of the data and reduce computational complexity. PCA is applied to retain the maximum variance while reducing the number of dimensions, thereby simplifying the data while preserving the essential characteristics necessary for effective clustering. Subsequently, we used t-SNE to project the high-dimensional data into a two-dimensional space, facilitating better visualization and analysis. We employed the default hyperparameters for t-SNE: t-SNE(n_components = 2, perplexity = 30, learning_rate = “auto,” n_iter = 300), but these values can be changed as it fits especially the number of perplexity. Applying PCA before t-SNE was essential for several reasons. Firstly, PCA reduced the dimensionality of the data, making t-SNE more computationally efficient and faster. Additionally, by retaining only the most significant features, PCA helped minimize noise, thereby enhancing the quality of input data for t-SNE. Furthermore, it played a crucial role in preventing t-SNE from overfitting to noise in high-dimensional data, resulting in more stable and robust clustering outcomes. By preprocessing the data with PCA, t-SNE was better able to reveal and preserve the underlying structure of the dataset, particularly in cases that involved complex data. This step was vital for both visualization and clustering, as it facilitated the identification of patterns that might not be evident in higher-dimensional representations.

4.3. SS-DBSCAN Parameter Selection

For the clustering component, our algorithm, SS-DBSCAN, incorporates a novel stratified sampling technique for estimating the $ϵ$ parameter. This approach was thoroughly discussed in our previous conference paper (Monko & Kimura, 2023b) and is depicted in Figure 1 as the second step following data pre-processing. The stratified sampling technique effectively accommodates the varying density distributions within the datasets, ensuring that $ϵ$ is optimally set to capture the spatial distribution of data points accurately. As a result, this enhancement strengthens the natural clustering tendency of SS-DBSCAN, leading to more precise and well-defined clusters.

4.3.1. FSG for MinPts

To optimize the selection of the MinPts, which dictate the core points in the DBSCAN algorithm, we implement an FSG strategy. This approach tests a range of MinPts values to pinpoint the optimal number that maximizes cluster validity, as measured by silhouette scores. This metric assesses how similar an object is to its own cluster compared to other clusters (Habib, 2021; Rousseeuw, 1987; Thinsungnoen et al., 2015). In our previous work, we employed a grid search technique to determine the optimal value for MinPts. We manually established a range, starting from 3 and extending to a maximum value, iterating by 1 or in steps of $n$ , while maintaining the $ϵ$ value derived from the $k$ -distance graph, which varied based on the data size and number of neighbors, $k$ (Monko & Kimura, 2023a). This approach enabled partial automation in selecting MinPts; however, we still had to manually define the range. This process not only increased execution time due to multiple iterations but also introduced the risk of overlooking critical values that could yield optimal silhouette scores, as the manual range setting might exclude such values.

Our novel method for selecting a single, optimal value for MinPts overcomes these limitations. It employs an adaptive range based on local density estimates. By calculating the density metrics $ρ_{i}$ for points within a specified $ϵ$ distance and computing the standard deviation ( $σ$ ) of these metrics, we dynamically adjust the range for MinPts by defining lower and upper bound. The lower bound is the average density minus the standard deviation, while the upper bound is the average density plus the standard deviation. This method significantly improves the adaptability of SS-DBSCAN to varying data densities and sizes. It provides a robust criterion for other analytical methods that require dynamic adjustments of sample sizes based on data density. Attaining the optimal value for MinPts involves a process described as follows:

Calculating Density Metrics

This function calculates the density metric for each point in a dataset by counting how many points lie within a certain distance $ϵ$ of each point (1). For a dataset $D$ with points $x_{i}$ , the density metric $ρ_{i}$ for point $x_{i}$ is given by:

ρ_{i} = \sum_{x_{j ϵ D}} 1 (‖ x_{i} - x_{j} ‖ \leq ϵ),

(1)

where

1

is the indicator function, which is 1 if the condition is true and 0 otherwise, and

‖ x_{i} - x_{j} ‖

is the Euclidean distance between points

x_{i}

and

x_{j}

\bar{ρ} = \frac{1}{n} \sum_{i = 1}^{n} ρ_{i},

(2)

where

\bar{ρ}

is the mean of the density metrics, and

n

is the total number of points (2).

Computing Standard Deviation

We then compute the standard deviation of the density metrics to understand the variability or spread of the density metrics across the dataset (3). If $ρ$ represents the vector of density metrics across all points, the standard deviation $σ$ is computed as:

σ = \sqrt{\frac{1}{n} \sum_{i = 1}^{n} {(ρ_{i} - \bar{ρ})}^{2}} .

(3)

Computing the Range for Minimum Samples

Equation (4) computes a range for the MinPts parameter in SS-DBSCAN clustering based on the average density $\bar{ρ}$ and the computed standard deviation $σ$ . The range is defined by a lower and upper bound, adjusted to ensure that the samples are at least $2$ .

\begin{aligned} Lower Bound = max (2, ⌊ \bar{ρ} - σ ⌋), \end{aligned}

(4a)

\begin{aligned} Upper Bound = ⌊ \bar{ρ} + σ ⌋ . \end{aligned}

(4b)

Here, $⌊ . ⌋$ denotes the floor function, which rounds down to the nearest integer. The function ensures that the lower bound is not less than $2$ , reflecting a minimum practical constraint for clustering.

Perform FSG

After determining the optimal range for MinPts defined in equations (4)a and (4)b, we employ an FSG technique to identify the best value by utilizing the silhouette score as a metric shown in Algorithm 1. This approach was enhanced by iterating through the identified range and, crucially, addressing the previous issue of unnecessarily printing all values within this range. To optimize the process, we introduced a stopping criterion: after identifying the current best MinPts, the iteration continues for five additional steps, although this criterion can be adjusted to a higher number as it fits better results. If the silhouette score shows no improvement or consistently decreases during these iterations, the loop is terminated. This modification led to a significant reduction in computational time, decreasing execution from as long as 2,158 s to as short as 4 s in certain datasets, as demonstrated in Table 1. The execution time, however, increases proportionally with the data size, type, and the number of neighbors ( $k$ ), both of which directly influence the computation of the range.

Through this dual optimization strategy ( $ϵ$ and MinPts), SS-DBSCAN is finely tuned to improve its sensitivity and adherence to the intrinsic clustering structures, thus promising more precise and reliable clustering outcomes.

5. Experiment Setup

We designed a comprehensive experimentation process to rigorously compare the effectiveness of various clustering algorithms across multiple datasets. In this research, we primarily worked with real-world text and numerical data of varying sizes and complexity to evaluate the effectiveness of SS-DBSCAN and other DBSCAN variants in clustering high-dimensional embeddings. The datasets selected for our experiments included Emotion-Sentiment, Coronavirus-Tweets, Cancer-Doc, Sonar, and MIMIC III, covering various applications. The datasets were chosen based on the following criteria: (1) Diversity: They include different domains such as medical records (MIMIC III, Cancer-Doc), sentiment analysis (Emotion-Sentiment, Coronavirus-Tweets), and structured data (Sonar). (2) Data Complexity: The data selected vary in structure, ensuring SS-DBSCAN’s applicability across different data types. (3) High-Dimensionality Challenges: Each dataset presents clustering challenges that standard DBSCAN struggles with, making them ideal for benchmarking our method.

The performance of each clustering algorithm was evaluated using the silhouette score and Davies–Bouldin index (DBI), whereas the silhouette score measures the quality of clustering (Habib, 2021; Rousseeuw, 1987; Thinsungnoen et al., 2015). A higher silhouette score indicates better-defined and more distinct clusters, thereby validating the effectiveness of the clustering technique. DBI was another metric used to evaluate the quality of clustering algorithms based on cluster compactness and separation (Wijaya et al., 2021). Lower DBI values (close to 0) indicate good clustering, while higher DBI values (much greater than 1) indicate poorer clustering.

Table 1.
Comparison of Grid Search and Fast Grid Search at Various Data Sizes.

Grid search Fast grid search

Data size Search range MinPts Time (s) MinPts Time (s)

10,000 525-1,509 550 920 550 6

12,000 1,210–3,150 1,217 2,158 1,217 4

10,000 1,225–3,506 1,279 1,980 1,279 6

7,000 755–1,825 810 600 810 31

6,000 95–557 292 480 292 2

5,000 77–317 102 60 102 3

5,000 345–1,225 466 242 446 2

4,000 50–184 64 4 64 2

		Grid search	Fast grid search
10,000	525-1,509	550	920	550	6
12,000	1,210–3,150	1,217	2,158	1,217	4
10,000	1,225–3,506	1,279	1,980	1,279	6
7,000	755–1,825	810	600	810	31
6,000	95–557	292	480	292	2
5,000	77–317	102	60	102	3
5,000	345–1,225	466	242	446	2
4,000	50–184	64	4	64	2

Note. MinPts = minimum points.

Table 2.

Parameter Values and Cluster Results of Different Algorithms at Various Data Sizes of MIMIC III.

	SS-DBSCAN			DBSCAN			HDBSCAN		OPTICS
Data size	eps	MinPt	Clusters	eps	MinPt	Clusters	Cluster size	Clusters	xi	MinPt	Clusters
1,000	4.1189	93	2	1.7438	4	3	30	3	0.001	9	5
2,000	3.3084	75	2	1.3950	4	10	20	3	0.001	7	5
3,000	3.1629	59	2	1.0884	4	6	14	4	0.001	7	4
4,000	3.2177	64	2	0.9509	4	16	15	4	0.001	8	4
5,000	2.1996	102	2	0.8634	4	41	20	4	0.001	7	4

Note. DBSCAN = density-based spatial clustering of applications with noise; SS-DBSCAN = stratified sampling-DBSCAN; HDBSCAN = hierarchical DBSCAN; OPTICS = ordering points to identify the clustering structure; MinPt = minimum point.

5.1. Clustering Algorithms Applied in Different Data Sizes

To evaluate the effectiveness of our preprocessing and clustering methodologies, we performed comparative analyses of several algorithms across varying data sizes within the MIMIC III dataset. The primary objective of this experiment was to demonstrate the robustness of SS-DBSCAN, particularly in managing complex datasets, and to assess its performance consistency as data size increases, an area where other algorithms often exhibit limitations as seen in Table 2. The MIMIC dataset used in this study primarily consists of two distinct clusters: adverse drug reaction (ADR) and non-ADR cases. Among the algorithms tested, only SS-DBSCAN consistently identified the correct number of clusters, regardless of increasing data size. In contrast, the other algorithms produced varying numbers of clusters with inconsistent results as the dataset expanded. These findings demonstrate that the enhanced SS-DBSCAN algorithm delivers superior clustering accuracy and robustness compared to the other algorithms evaluated in this experiment. Clustering results are also visualized in Figures 2 to 5

Figure 2.

Comparison of stratified sampling-density-based spatial clustering of applications with noise (SS-DBSCAN) results for different data sizes.

Figure 3.

Comparison of density-based spatial clustering of applications with noise (DBSCAN) results for different data sizes.

Figure 4.

Comparison of hierarchical density-based spatial clustering of applications with noise (HDBSCAN) results for different data sizes.

Figure 5.

Comparison of ordering points to identify the clustering structure (OPTICS) results for different data sizes.

5.1.1. Clustering Results With SS-DBSCAN

SS-DBSCAN employs stratified sampling for precise estimation of the $ϵ$ parameter and utilizes a FSG method for optimizing MinPts, allowing for the dynamic adjustment of DBSCAN’s parameters to better align with the inherent structure of the data. The resulting parameter values vary accordingly based on the characteristics of the dataset. Figure 2 presents the clustering results generated by SS-DBSCAN.

5.1.2. Clustering Results With DBSCAN

DBSCAN utilizes manually selected parameters based on established practices, specifically setting MinPts to 4 and $ϵ$ to the knee value. The $ϵ$ parameter is determined by identifying the knee point, which corresponds to the location of a significant bend in the curve. The MinPts parameter is chosen according to the rule that suggests MinPts should be set to two times the dimensionality of the data (2*dim). Figure 3 illustrates the clusters obtained.

5.1.3. Clustering Results with HDBSCAN

Another preferred algorithm is HDBSCAN, which can handle varying density clusters without specifying the epsilon or a global density threshold. Figure 4 presents the cluster results for HDBSCAN.

5.1.4. Clustering Results With OPTICS

OPTICS is an extension of the DBSCAN algorithm designed to identify clusters in data with varying densities. Unlike DBSCAN, which relies on fixed parameters, OPTICS produces an ordered list of points based on their reachability distances, allowing it to reveal the clustering structure at multiple density levels. Figure 5 shows the cluster results for OPTICS algorithm.

5.2. Results for Clustering Algorithms Applied in Different Datasets

In addition to experimenting with a single dataset of various sizes, we also conducted experiments across multiple datasets using all four algorithms. The datasets included Emotion-Sentiment, Coronavirus-Tweets, Cancer-Doc, and Sonar. Our comprehensive evaluation demonstrated that SS-DBSCAN consistently outperformed the other algorithms across all datasets and data sizes by the use of parameter values indicated in Table 3. The results of these experiments are presented in Figure 6, providing a quantitative comparison of clustering performance. These visualizations highlight the algorithm’s ability to effectively manage varying data densities and complex structures, reinforcing its robustness and applicability across different types of high-dimensional data.

Table 3.
Parameter Values and Cluster Results of Different Algorithms on Various Datasets.

SS-DBSCAN DBSCAN HDBSCAN OPTICS

Dataset eps MinPt Clusters eps MinPt Clusters Min cluster size Clusters xi MinPt Clusters

Emotion Sentiment 3.7833 466 1 1.0365 4 1 15 2 0.001 7 2

Corona Tweets 3.2270 194 2 1.1366 4 3 20 2 0.001 7 1

CancerDoc 3.5805 82 2 1.8789 4 1 15 19 0.001 7 7

MIMIC III 3.2177 64 2 0.9509 4 41 15 4 0.001 8 4

Sonar 5.0213 51 2 1.5947 4 1 15 — 0.001 7 7

	SS-DBSCAN	DBSCAN	HDBSCAN	OPTICS
Emotion Sentiment	3.7833	466	1	1.0365	4	1	15	2	0.001	7	2
Corona Tweets	3.2270	194	2	1.1366	4	3	20	2	0.001	7	1
CancerDoc	3.5805	82	2	1.8789	4	1	15	19	0.001	7	7
MIMIC III	3.2177	64	2	0.9509	4	41	15	4	0.001	8	4
Sonar	5.0213	51	2	1.5947	4	1	15	—	0.001	7	7

Figure 6.

Algorithms’ performances in different data sizes of MIMIC III.

5.3. Comparative Analysis

Each clustering technique is applied to the preprocessed data, and its performances are compared across algorithms with different data sizes and datasets. The clustering effectiveness is analyzed in the context of the data’s size, complexity, and high dimensionality, taking into account the nuances and variability inherent in these datasets.

Our results in Figures 7 and 6 demonstrate the effectiveness of SS-DBSCAN in achieving more reliable and meaningful clustering outcomes than other DBSCAN variants. More results are described in Tables 4 and 5. The stratified sampling approach for $ϵ$ estimation and the FGS for MinPts significantly enhance the robustness and resilience of the clustering process. This highlights the scalability of SS-DBSCAN and its adaptability to varying data densities and sizes and high dimensional data making it a valuable tool for complex data and decision-making.

Figure 7.

Algorithms’ performance in different datasets of 4,000 size.

6. Result Interpretation

Our implementation of SS-DBSCAN significantly enhances the clustering process by allowing us to precisely select the optimal values for $ϵ$ (the maximum radius of the neighborhood) and MinPts (the minimum number of points in a neighborhood to form a cluster). This method ensures that we consistently achieve reliable clustering outcomes, distinctly improving upon the approach used by other density-based clustering algorithms.

In the standard DBSCAN framework, the MinPts parameter is typically determined using a heuristic based on the dataset’s dimensionality, often set at twice the number of dimensions. In our study, after reducing the data’s dimensionality from 768 to 2, we applied a MinPts value of 4 following this rule of thumb. However, this approach is somewhat arbitrary and fails to accurately reflect the true density distribution in more complex datasets. Consequently, this led to suboptimal clustering results as shown in our experiments.

On the other hand, HDBSCAN, another variation of DBSCAN, adjusts its sensitivity based on several parameters such as min_cluster_size, min_samples, and alpha. The performance of HDBSCAN hinges significantly on the appropriate selection of min_cluster_size. Ineffective choices for this parameter can lead to poor clustering results, whereas optimal parameter tuning can considerably enhance the clustering quality. However, as the data size increases, the performance of HDBSCAN reduces, often returning meaningless clusters, as seen in Figure 4.

OPTICS was also included in our experiment to explore its potential advantages over traditional density-based methods such as DBSCAN and HDBSCAN. OPTICS attempts to uncover the clustering structure of data by ordering points based on their density-reachability. However, in our experiments, OPTICS underperformed in all datasets, as illustrated in Figures 5 and 6. The algorithm’s sensitivity to initial parameter settings (min_samples, xi, min_cluster_size, metric), coupled with its computational complexity, resulted in very poor clusters. OPTICS struggled to adapt to the intricate density variations in the data, ultimately producing less meaningful clustering outcomes.

Therefore, SS-DBSCAN distinguishes itself from other algorithms by incorporating stratified sampling to determine the best values for $ϵ$ and FGS to determine MinPts without arbitrary estimations. This approach allows SS-DBSCAN to adapt effectively across different sizes and complexities of datasets, including complex and noisy datasets. SS-DBSCAN delivers consistent results and underscores our algorithm’s robustness, making it highly effective for diverse applications.

Table 4.
Silhouette and DBI Scores of Different Algorithms at Various Data Sizes of MIMIC III.

SS-DBSCAN DBSCAN HDBSCAN OPTICS

Data size Silhouette DBI Silhouette DBI Silhouette DBI Silhouette DBI

1000 0.64 0.39 0.24 1.41 0.54 1.51 0.02 1.37

2000 0.62 0.48 $- 0.22$ 1.36 0.40 1.59 $- 0.11$ 1.25

3000 0.61 0.44 $- 0.40$ 1.45 0.38 1.50 $- 0.16$ 1.23

4000 0.61 0.44 $- 0.46$ 1.54 0.34 1.59 $- 0.22$ 1.22

5000 0.61 0.45 $- 0.04$ 1.52 0.09 1.57 $- 0.35$ 1.25

	SS-DBSCAN	DBSCAN	HDBSCAN	OPTICS
1000	0.64	0.39	0.24	1.41	0.54	1.51	0.02	1.37
2000	0.62	0.48	$- 0.22$	1.36	0.40	1.59	$- 0.11$	1.25
3000	0.61	0.44	$- 0.40$	1.45	0.38	1.50	$- 0.16$	1.23
4000	0.61	0.44	$- 0.46$	1.54	0.34	1.59	$- 0.22$	1.22
5000	0.61	0.45	$- 0.04$	1.52	0.09	1.57	$- 0.35$	1.25

Note. DBI = Davies–Bouldin index; DBSCAN = density-based spatial clustering of applications with noise; SS-DBSCAN = stratified sampling-DBSCAN; HDBSCAN = hierarchical DBSCAN; OPTICS = ordering points to identify the clustering structure.

Table 5.

DBI Scores of Different Algorithms on Various Datasets.

Dataset	Data size	SS-DBSCAN	DBSCAN	HDBSCAN	OPTICS
EmotionsSentiments	4,000	0.79	4.26	4.67	6.30
CoronavirusTweets	4,000	0.83	1.89	2.42	1.56
CancerDoc	4,000	0.71	1.62	1.47	1.65
MIMIC III	4,000	0.44	1.54	1.59	1.22
Sonar	4,000	0.40	1.52	2.40	1.82

7. Discussion

In this paper, we enhanced the SS-DBSCAN and evaluated its performance in different data sizes and 1on different datasets. Our findings underscore the adaptability and robustness of SS-DBSCAN, especially in handling large and complex data. The unique parameter optimization approach of SS-DBSCAN enhances its efficacy in identifying meaningful clusters vital for data mining and decision-making. Below, we discuss several aspects of SS-DBSCAN’s application and the implications of our results:

7.1. Noise Sensitivity

SS-DBSCAN performed better in managing noise in all datasets used for the experiment than traditional DBSCAN, HDBSCAN, and OPTICS. In contrast to standard DBSCAN, which relies on a fixed global $ϵ$ estimation, SS-DBSCAN utilizes stratified sampling to determine $ϵ$ adaptively, ensuring better sensitivity to local density variations and reducing the risk of misclassifying sparse regions as noise. Additionally, SS-DBSCAN improves MinPts selection through an FSG approach, where MinPts is optimized based on the silhouette Score, excluding noise points from the computation. This prevents misclassified noise from distorting cluster validity, leading to a more robust and noise-resistant clustering process. Unlike DBSCAN, where MinPts is often set based on a rule of thumb (e.g., 2D for high-dimensional data), SS-DBSCAN dynamically selects the best-fitting MinPts, ensuring that the clustering structure is well-defined with fewer noise points. These enhancements collectively enable SS-DBSCAN to produce higher-quality clusters with better-defined boundaries, particularly in complex and high-dimensional datasets.

7.2. Scalability

The scalability of SS-DBSCAN was rigorously evaluated using the MIMIC III dataset, a large and complex real-world dataset. The results demonstrate the algorithm’s efficiency in handling extensive data volumes while maintaining high-quality clustering performance. This establishes SS-DBSCAN as a highly suitable solution for large-scale datasets where computational efficiency and time constraints are critical factors. Furthermore, the algorithm’s flexibility in determining both the $ϵ$ and MinPts parameters consistently yields more accurate and reliable results, regardless of dataset size. This adaptability underscores SS-DBSCAN’s robustness across varying data densities, further enhancing its applicability in diverse research and real-world scenarios.

7.3. Parameter Adaptivity and Robusteness

Our methodology dynamically adjusts $ϵ$ and MinPts based on the dataset’s intrinsic characteristics. This adaptivity allows SS-DBSCAN to respond flexibly to variations in data density and distribution, ensuring optimal clustering across different datasets. We also explored how variations in $ϵ$ and MinPts in different datasets affect the stability of the clusters. Our results show that SS-DBSCAN maintains consistent clustering quality even with minor parameter adjustments, highlighting its reliability for clustering applications where precision is paramount.

7.4. Cluster Validation

We employed a silhouette statistical measure and DBI to validate the clusters generated by SS-DBSCAN. Both silhouette and DBI scores confirm the distinctiveness and relevance of the clusters. In comparison with other algorithms used in our experiment, SS-DBSCAN stands out for its robustness and precision. Unlike methods that require extensive parameter tuning and may not form clusters effectively, SS-DBSCAN adapts its parameters automatically, offering more reliable clustering even in complex datasets.

7.5. Importance of Dimensionality Reduction

The S-BERT embeddings used in this study are 768-dimensional, and directly applying clustering algorithms such as SS-DBSCAN and others to such high-dimensional data has proven ineffective due to the curse of dimensionality. High dimensionality distorts distance metrics used in density-based clustering, increases computational complexity, and reduces interpretability and cluster separability. Therefore, dimensionality reduction is essential in clustering high-dimensional data, as it enhances cluster separability, reduces noise, and improves computational efficiency. While the choice of technique is not limited to a specific method, our research found that the combination of PCA and t-SNE yielded the best results for all algorithms used in our experiments. PCA removes noise and extracts principal features, while t-SNE preserves local structure and adapts to varying cluster densities, leading to better-defined clusters across different dataset sizes. Other techniques, such as Uniform Manifold Approximation and Projection (UMAP; McInnes et al., 2018), can also be used, but our experiments showed key differences. UMAP performed well on smaller datasets, producing compact clusters, but as dataset size increased, it introduced significant noise and fragmented clusters. In contrast, PCA $+$ t-SNE maintained consistent clustering performance and better noise handling across all dataset sizes. While UMAP emphasizes local manifold learning, t-SNE’s perplexity parameter provided better cluster separation, making it a more suitable choice for our research. Future work can explore alternative combinations, such as PCA before UMAP or t-SNE after UMAP, to balance local and global structure preservation.

7.6. Limitations and Future Work

While we recognize that text data, particularly in high-dimensional embedding spaces, does not typically form spherical clusters, our experiments demonstrated that SS-DBSCAN could still identify meaningful and well-separated clusters with some datasets such as MIMIC III. This is attributed to the algorithm’s ability to adaptively estimate local density thresholds and handle complex, nonlinear cluster boundaries. Although we utilized the silhouette score and DBI to evaluate clustering performance metrics that traditionally favor spherical cluster structures, we observed that these metrics provided reasonable insights into cluster compactness and separation in our context. Nevertheless, these measures may not fully capture the structure of density-based, nonspherical clusters. As such, in future work, we plan to incorporate the density-based clustering validation index, which is more suitable for evaluating the quality of arbitrarily shaped clusters, thereby offering a more robust and accurate assessment of SS-DBSCAN’s performance on text data.

Future research could also explore the applicability of our technique to a broader range of datasets beyond those considered in this study. Our work primarily focused on text-based and structured (numerical) real-world data, where the proposed approach demonstrated effectiveness in handling high-dimensional and complex data distributions. However, its performance on other datasets, such as image, audio, or multi-modal data, remains an open question. Investigating how our method adapts to different data structures and domains would be a valuable direction for future work, potentially enhancing its generalizability and robustness across diverse applications.

8. Conclusion

In this study, we introduced an enhanced SS-DBSCAN algorithm designed to improve parameter selection and noise handling in complex and high-dimensional datasets. By leveraging adaptive $ϵ$ estimation through stratified sampling and an optimized MinPts selection with the FSG technique, SS-DBSCAN overcomes key limitations of traditional DBSCAN and other density-based clustering, making it particularly effective for high-dimensional, noisy, and non-spherical datasets.

Our experiments demonstrate that SS-DBSCAN excels in clustering diverse datasets, including text embeddings and structured numerical data. The algorithm dynamically adapts to varying density distributions, ensuring more accurate cluster formations while reducing the risk of misclassification. Additionally, its efficiency in handling large data volumes makes it a valuable tool for real-world clustering applications in domains such as biomedical informatics, social media analysis, and anomaly detection.

Beyond improving clustering accuracy, these enhancements strengthen the interpretability of cluster structures, providing deeper, actionable insights into complex datasets. Such insights are crucial for data-driven decision-making in research and industry.

Footnotes

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Data Availability Statement

The full dataset and code are available at: .

ORCID iDs

Gloriana Monko

Masaomi Kimura

Appendix A: Dataset Formatting for S-BERT

While preparing the MIMIC III dataset, we selected clinically meaningful features such as patient age, gender, admission type, prescribed medications, diagnosis description, and clinical notes. To generate context-sensitive embeddings using S-BERT, we first preprocessed structured and semi-structured clinical records. This included removing punctuation, lowercasing, and segmenting relevant text fields. The resulting text inputs, formatted as plain natural language, were then passed to the S-BERT encoder. Below are representative examples from the MIMIC III dataset formatted for embeddings.

References

Abdi

Williams

L. J.

(2010). Principal component analysis. Wiley Interdisciplinary Reviews: Computational Statistics, 2(4), 433. https://doi.org/10.1002/wics.101

Abdulhameed

T. Z.

Yousif

S. A.

Samawi

V. W.

Al-Shaikhli

H. I.

(2024). SS-DBSCAN: Semi-supervised density-based spatial clustering of applications with noise for meaningful clustering in diverse density data. IEEE Access, 12, 131507–131520. https://doi.org/10.1109/ACCESS.2024.3457587

Ankerst

Breunig

M. M.

Kriegel

H. P.

Sander

(1999). OPTICS: Ordering points to identify the clustering structure. SIGMOD Record, 28(2), 49–60. https://doi.org/10.1145/304181.304187

Arya

Abhishek Arya

R. B. E.

(2019). Exploratory data analysis of intensive care unit patients using MIMIC-III database. https://www.minsal.cl/wp-content/uploads/2019/01/2019.01.23_PLAN-NACIONAL-DE-CANCER_web.pdf.

Deng

Zhu

Huang

(2015). A scalable and fast OPTICS for clustering trajectory big data. Cluster Computing: The Journal of Networks, Software Tools and Applications, 18(2), 549–562. https://doi.org/10.1007/s10586-014-0413-9

Fotopoulou

(2024). A review of unsupervised learning in astronomy. Astronomy and Computing, 48, 100851. https://doi.org/10.1016/j.ascom.2024.100851

Gan

Tao

(2015). DBSCAN revisited: Mis-claim, un-fixability, and approximation. In Proceedings of the 2015 ACM SIGMOD international conference on management of data (pp. 519–530). https://doi.org/10.1145/2723372.2737792

Habib

A. B.

(2021). Elbow method vs silhouette co-efficient in determining the number of clusters. BRAC University Journal of Humanities and Social Sciences (BUJHSS) and Bangladesh Journal of Academy of Sciences, https://doi.org/10.13140/RG.2.2.27982.79688

Yuan

Chen

Horrocks

(2024). Language models as hierarchy encoders. Advances in Neural Information Processing Systems, 37, 14690–147111. https://doi.org/10.5281/zenodo.10511042.

10.

Hui

Gao

B. J.

(2021). When is nearest neighbor meaningful: Sequential data. In Proceedings of the 30th ACM international conference on information & knowledge management (pp. 3103–3106). https://doi.org/10.1145/3459637.3482219

11.

Iavindrasana

Cohen

Depeursinge

Müller

Meyer

Geissbuhler

(2009). Clinical data mining: A review. In Yearbook of medical informatics (pp. 121–133). Georg Thieme Verlag KG. https://doi.org/10.1055/s-0038-1638651

12.

Jayanthi

S. M.

Embar

Raghunathan

(2021). Evaluating pretrained transformer models for entity linking in task-oriented dialog. http://arxiv.org/abs/2112.08327

13.

Kanagala

H. K.

Jaya Rama Krishnaiah

V. V.

(2016). A comparative study of K-means, DBSCAN and OPTICS. In 2016 International conference on computer communication informatics, ICCCI 2016 (pp. 1–6). IEEE. https://doi.org/10.1109/ICCCI.2016.7479923

14.

Karami

Johansson

(2014). Choosing DBSCAN parameters automatically using differential evolution. International Journal of Computer Applications, 91(7), 1–11. https://doi.org/10.5120/15890-5059

15.

Khan

M. M. R.

Siddique

M. A. B.

Arif

R. B.

Oishe

M. R.

(2018). ADBSCAN: Adaptive density-based spatial clustering of applications with noise for identifying clusters with varying densities. In 4th International conference on electrical engineering and information & communication technology iCEEiCT 2018 (pp. 107–111). IEEE. https://doi.org/10.1109/CEEICT.2018.8628138

16.

Korea

Zahran

(2022). UNLPSat TextGraphs-16 natural language premise selection task: Unsupervised natural language premise selection in mathematical text using sentence-MPNet. https://hdl.handle.net/10468/13837

17.

Lai

Zhou

Bian

Song

(2019). A new DBSCAN parameters determination method based on improved MVO. IEEE Access, 7, 104085–104095. https://doi.org/10.1109/ACCESS.2019.2931334

18.

Liu

Xiong

Gao

(2010). Understanding of internal clustering validation measures. In Proceeding – IEEE international conference on data mining, ICDM (pp. 911–916). IEEE. https://doi.org/10.1109/ICDM.2010.35

19.

Martin Ester

X. X.

Kriegel

H.-P.

Sander

(1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD-96 proceedings (pp. 226–231). AAAI Press. https://dl.acm.org/doi/10.5555/3001460.3001507

20.

McInnes

Healy

Melville

(2018). UMAP: Uniform manifold approximation and projection for dimension reduction. arXiv preprint, vol. 1802.03426. https://doi.org/10.48550/arXiv.1802.03426

21.

Melit Devassy

George

(2020). Dimensionality reduction and visualisation of hyperspectral ink data using t-SNE. Forensic Science International, 311, 110194. https://doi.org/10.1016/j.forsciint.2020.110194

22.

Mollura

Mantoan

Romano

Lehman

L. W.

Mark

R. G.

Barbieri

(2020). The role of waveform monitoring in sepsis identification within the first hour of intensive care unit stay. In 2020 11th conference of the European study group on cardiovascular oscillations computation and modelling in physiology: New challenges and opportunities ESGCO 2020 (pp. 1–9). IEEE. https://doi.org/10.1109/ESGCO49734.2020.9158013

23.

Monko

G. J.

Kimura

(2023). Optimized DBSCAN parameter selection: Stratified sampling for epsilon and GridSearch for minimum samples. Computer Science & Information Technology (CS & IT), 43–61. https://doi.org/10.5121/csit.2023.132004

24.

Monko

G. J.

Kimura

(2023). SS-DBSCAN: Epsilon estimation with stratified sampling for density-based spatial clustering of applications with noise. In Proceeding 2023 international conference on automation, control and electronics engineering CACEE 2023 (pp. 72–76). IEEE. https://doi.org/10.1109/CACEE61121.2023.00023

25.

Ngiam

K. Y.

Khor

I. W.

(2019). Big data and machine learning algorithms for health-care delivery. Lancet Oncology, 20(5), e262–e273. https://doi.org/10.1016/S1470-2045(19)30149-4

26.

Paoletti

, et al. (2009). Explorative data analysis techniques and unsupervised clustering methods to support clinical assessment of chronic obstructive pulmonary disease (COPD) phenotypes. Journal of Biomedical Informatics, 42(6), 1013–1021. https://doi.org/10.1016/j.jbi.2009.05.008

27.

Pareek

Jacob

(2020). Data compression and visualization using PCA and T-SNE. In Advances in information communication technology and computing (pp. 327–337). Springer Nature Singapore Pte Ltd. https://doi.org/10.1007/978-981-15-5421-6_34

28.

Platzer

(2013). Visualization of SNPs with t-SNE. PLoS One, 8(2), e56883. https://doi.org/10.1371/journal.pone.0056883

29.

Ram

Jalal

A. S.

Kumar

(2010). A density based algorithm for discovering density varied clusters in large spatial databases. International Journal of Computer Applications, 3(6), 1–4. https://doi.org/10.5120/739-1038

30.

Reimers

Gurevych

(2019). Sentence-BERT: Sentence embeddings using siamese BERT-networks. In EMNLP-IJCNLP 2019—2019 conference on empirical methods in natural language processing 9th international joint conference on natural language (pp. 3982–3992). Cornell University. https://doi.org/10.18653/v1/d19-1410

31.

Ren

Liu

(2012). DBCAMM: A novel density based clustering algorithm via using the Mahalanobis metric. Applied Soft Computing, 12(5), 1542–1554. https://doi.org/10.1016/j.asoc.2011.12.015

32.

Rousseeuw

P. J.

(1987). Silhouettes: A graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, 20(C), 53–65. https://doi.org/10.1016/0377-0427(87)90125-7

33.

Saeed

Lieu

Raber

Mark

R. G.

(2002). MIMIC II: A massive temporal ICU patient database to support research in intelligent patient monitoring. Computers in Cardiology, 29, 641–644. https://doi.org/10.1109/cic.2002.1166854

34.

Sander

Ester

Kriegel

H. P.

(1998). Density-based clustering in spatial databases: The algorithm GDBSCAN and its applications. Data Mining and Knowledge Discovery, 2(2), 169–194. https://doi.org/10.1023/A:1009745219419

35.

Schubert

Sander

Ester

Kriegel

H. P.

(2017). DBSCAN revisited, revisited: Why and how you should (still) use DBSCAN. ACM Transactions on Database Systems (TODS), 42(3), https://doi.org/10.1145/3068335

36.

Shah

G. H.

(2012). An improved DBSCAN, a density based clustering algorithm with parameter selection for high dimensional data sets. In 3rd Nirma university international conference on engineering NUiCONE 2012 (pp. 1–6). IEEE. https://doi.org/10.1109/NUICONE.2012.6493211

37.

Shah

Silwal

(2019). Using dimensionality reduction to optimize t-SNE. http://arxiv.org/abs/1912.01098

38.

Smetana

Salles de Salles

Sukharev

Khazanovich

(2024). Highway construction safety analysis using large language models. Applied Sciences, 14(4), 1352. https://doi.org/10.3390/app14041352

39.

Thinsungnoen

Kaoungku

Durongdumronchai

Kerdprasop

(2015). The clustering validity with silhouette and sum of squared errors. Learning, 3(7), 44–51. https://doi.org/10.12792/iciae2015.012

40.

Wang

Y. F.

Jiong

G. P.

Qian

Y. R.

(2019). A new outlier detection method based on OPTICS. Sustainable Cities and Society, 45, 197–212. https://doi.org/10.1016/j.scs.2018.11.031

41.

Wang

McDermott

M. B. A.

Chauhan

Ghassemi

Hughes

M. C.

Naumann

(2020). MIMIC-Extract. In ACM CHIL 2020—proceedings of the 2020 ACM conference on health, inference, and learning (pp. 222–235). Association for Computing Machinery. https://doi.org/10.1145/3368555.3384469

42.

Winslett

(Ed.). (2009). Scientific and statistical database management: Proceedings of the 21st International Conference, SSDBM 2009, New Orleans, LA, USA, June 2–4, 2009 (Lecture Notes in Computer Science, Vol. 5566). Springer. https://link.springer.com/book/10.1007/978-3-642-02279-1

43.

Wijaya

Y. A.

Kurniady

D. A.

Setyanto

Tarihoran

W. S.

Rusmana

Rahim

(2021). Davies Bouldin index algorithm for optimizing clustering case studies mapping school facilities. TEM Journal – Technology, Education, Management, Informatics, 10(3), 1099–1103. https://doi.org/10.18421/TEM103-13

44.

W. T.

, et al. (2021). Data mining in clinical big data: The frequently used databases, steps, and methodological models. Military Medical Research, 8(1), 1–12. https://doi.org/10.1186/s40779-021-00338-z

45.

Zhang

Guo

S. L.

Han

L. N.

T. L.

(2016). Application and exploration of big data mining in clinical medicine. Chinese Medical Journal, 129(6), 731–738. https://doi.org/10.4103/0366-6999.178019

46.

Zimek

Schubert

Kriegel

H.-P.

(2012). A survey on unsupervised outlier detection in high-dimensional numerical data. Statistical Analysis and Data Mining, 5(5), 363–387. https://doi.org/10.1002/sam.11161

		Grid search		Fast grid search
Data size	Search range	MinPts	Time (s)	MinPts	Time (s)
10,000	525-1,509	550	920	550	6
12,000	1,210–3,150	1,217	2,158	1,217	4
10,000	1,225–3,506	1,279	1,980	1,279	6
7,000	755–1,825	810	600	810	31
6,000	95–557	292	480	292	2
5,000	77–317	102	60	102	3
5,000	345–1,225	466	242	446	2
4,000	50–184	64	4	64	2

	SS-DBSCAN		DBSCAN		HDBSCAN		OPTICS
Data size	Silhouette	DBI	Silhouette	DBI	Silhouette	DBI	Silhouette	DBI
1000	0.64	0.39	0.24	1.41	0.54	1.51	0.02	1.37
2000	0.62	0.48	$- 0.22$	1.36	0.40	1.59	$- 0.11$	1.25
3000	0.61	0.44	$- 0.40$	1.45	0.38	1.50	$- 0.16$	1.23
4000	0.61	0.44	$- 0.46$	1.54	0.34	1.59	$- 0.22$	1.22
5000	0.61	0.45	$- 0.04$	1.52	0.09	1.57	$- 0.35$	1.25

Enhanced Stratified Sampling-Density-Based Spatial Clustering of Applications With Noise (SS-DBSCAN) for High-Dimensional Data

Abstract

Keywords

1. Introduction

2. Related Works

3. Contribution

4.1. Overview and Motivation

4.2. Data Preprocessing

4.3.1. FSG for MinPts

5.1.2. Clustering Results With DBSCAN

5.1.3. Clustering Results with HDBSCAN

5.1.4. Clustering Results With OPTICS

5.2. Results for Clustering Algorithms Applied in Different Datasets

7.1. Noise Sensitivity

7.2. Scalability

7.3. Parameter Adaptivity and Robusteness

7.4. Cluster Validation

7.5. Importance of Dimensionality Reduction

7.6. Limitations and Future Work

8. Conclusion

Footnotes

Funding

Declaration of Conflicting Interests

Data Availability Statement

ORCID iDs

Appendix A: Dataset Formatting for S-BERT

References