Sage Journals: Discover world-class research

Abstract

Unsupervised learning is a major class of machine learning techniques where response information is missing or unavailable. Among these techniques, clustering plays a central role by grouping objects based on a chosen similarity measure. K-Means is one of the most established and widely used clustering methods, known for its simplicity and computational efficiency. For continuous data, K-Means performs well when the number of clusters $(K)$ is known and correctly specified. However, it faces convergence and overfitting challenges when $K$ is unknown. These issues stem from K-Means’ objective function, which monotonically decreases as the number of clusters increases—leading to a tendency to overfit. In this article, we propose an augmented K-Means algorithm that introduces a penalized version of the standard K-Means objective, designed to guard against overfitting and promote model parsimony when $K$ is unknown. We establish key optimality properties of both the traditional K-Means loss function and the proposed penalty term. Extensive simulation studies on benchmark datasets demonstrate the improved performance of the proposed method, including accurate identification of the true number of clusters. Extensive simulation studies on benchmark datasets demonstrate the improved performance of the proposed method, including accurate identification of the true number of clusters. Additionally, we apply our approach to the clustering of globular galaxy datasets—an example of truly large-scale (“Big”) data—to further illustrate its effectiveness.

Keywords

clustering convexity k-means mixture model penalized optimization

1. Introduction

Unsupervised learning problems arise when there is no available information about the response or outcome variable, yet it remains necessary to draw inferences from the observed input features. Clustering is a core task in this setting, aiming to partition the data into distinct groups based on shared characteristics among the features. This process helps reveal the natural structure of the data, offering insights that are both meaningful and applicable across a wide range of domains. A variety of clustering techniques exist, differing K-Means in how they segment the data space into clusters of observations. Broadly speaking, the goal is to ensure that observations within a cluster are more similar to each other, while observations in different clusters are more dissimilar—according to a chosen similarity or distance measure. Clustering mechanisms can be partitioned K-Means into two subcategories with respect to mechanism of forming clusters, namely,

Hierarchical clustering: Objects are clustered in a top-down or bottom-up hierarchy based on a chosen distance metric and linkage function, resulting in a tree-like structure. Examples include Agglomerative clustering and Divisive clustering.

Connectivity-based clustering: Each cluster is represented by a “center” and is constructed via minimizing the distance metric between cluster objects to the center, e.g., K-Means clustering, Medoid based clustering etc.

Each of these methods can be implemented using either a deterministic approach or a probabilistic, model-based framework. Both approaches have their own advantages and disadvantages, and their popularity largely depends on the application domain (e.g., social sciences, marketing, genetics, infectious disease hotspots, astronomy, etc.) as well as the ease of interpretability within that domain. Recent developments in clustering includes density-based approaches (Ester et al., 1996), graph-based/spectral clustering (Shi & Malik, 2000), convex clustering (Chi & Lange, 2015; Radchenko & Mukherjee, 2017; Tan & Witten, 2015) etc. In this article, we are focusing on the Connectivity-based clustering mechanism, especially, K-Means clustering which depends upon the choice of the

K

or cluster centers, and a distance metric. For continuous data, the Euclidean norm is the most popular, albeit, other distance measures (e.g.

L_{1}

L_{2}

etc.) are also explored and in turn, each norm imposes different probability distribution in case one want to exploit the duality between loss and probability distribution. K-Means with Euclidean norm is not only simple to implement and interpret but also has a strong connection with Gaussian mixture models. For a fixed cluster number

K

theoretical results exist showing the consistency of K-Means under various regularity conditions (MacQueen et al., 1967; Pollard, 1981). Nevertheless, a fundamental problem with this type of clustering is the choice of cluster number as

K

is often unknown. In this situation there is a tendency of mixture models to overfit on real data (Nguyen, 2013). Several works also have addressed this issue through various criteria for choosing optimal cluster numbers including “model-free” approach. Fraley and Raftery (1998) addressed the problem of cluster choice for model-based clustering via BIC criterion, Sugar and James (2003) optimized the cluster number of K-Means through distortion-based measure, Fang and Wang (2012) introduced clustering stability-based idea for choosing optimum cluster number, de Amorim and Hennig (2015) utilized feature rescaling to achieve optimal clusters in K-Means. In this paper, we make four key contributions. First, we theoretically prove that when

K

is unknown, the K-Means algorithm minimizes the underlying loss function as

K \to N

, where

N

is the number of data points. This represents the most complex model possible and ultimately defeats the purpose of clustering. Second, to address this overestimation issue, we propose a penalty function on the number of clusters, which is minimized when

K = 1

, thus encouraging the simplest model. Third, we combine the loss and penalty functions to develop an augmented, penalized version of K-Means, and we study its properties under various tuning parameter selection criteria. Fourth, we address another computational limitation of standard K-Means: its non-recursive optimization. Specifically, for each value of

K

, K-Means restarts the optimization from scratch, without leveraging the solution from the previous iteration. This leads to non-monotonicity in the clustering solution with respect to

K

, as the algorithm explores multiple potentially suitable values. Our modified K-Means algorithm improves this by reusing the solution from the previous step, allowing each optimization for a given

K

to build on the result from

(K - 1)

-th stage clustering. This not only accelerates computation but also makes the method scalable and manageable for truly massive datasets (as

N ↑

in our example).

The rest of the article is organized as follows. In Section 2, we discuss some issues with standard K-Means clustering and propose remedies to address these problems. Section 3 presents modifications to the K-Means algorithm for the automatic detection of the number of clusters in a dataset. The optimization methodology and the method for tuning parameter selection are elaborated in Sections 4 and 5, respectively. Simulation studies and applications to a real data problem are provided in Sections 6 and 7, respectively. Finally, Section 8 offers concluding discussion with future directions. For the sake of brevity, all proofs have been provided in the supplementary material.

2. Properties of K-Means Clustering

At first we will concentrate on some convergence properties of standard K-Means clustering problem. Assume the data in hand is $N \times p$ matrix $X = (x_{1}, \dots, x_{N})$ , each $x_{i}$ is a $p$ -dimensional vector. In any clustering mechanism, we partition the data matrix $X$ into $K$ possible clusters with $C_{j} \forall j = 1, \dots, K$ is an assignment set. For center-based clustering, a data point $`` x^{″}$ is part of cluster $C_{j}$ if

I (x \in C_{j}) = {\begin{cases} 1 & if j = \underset{1 \leq k \leq K}{\arg min} ‖ x - m_{k} ‖ \\ 0 & Otherwise, \end{cases}

where

‖ . ‖

is Euclidean norm and

m_{j}

is the center for cluster

C_{j}

. Particularly, In K-Means clustering, the main idea is to choose

K

centers and update them in such a fashion so that the sum of distances of all points from their respective cluster center is minimum. Therefore the optimization problem for K-Means solves,

\underset{m_{k} \in R^{p}, 1 \leq k \leq K}{\arg min} L_{K} (X, M) = \underset{m_{k} \in R^{p}}{\arg min} \underset{i = 1}{\sum^{N}} \underset{1 \leq k \leq K}{\arg min} ‖ x_{i} - m_{k} ‖^{2} = \underset{m_{k} \in R^{p}}{\arg min} \underset{i = 1}{\sum^{N}} \underset{k = 1}{\sum^{K}} I (x_{i} \in C_{k}) ‖ x_{i} - m_{k} ‖^{2}

with

\sum_{k = 1}^{K} ∣ C_{k} ∣= N

where

∣ . ∣

represents the cardinality of a set and

M = (m_{1}, \dots, m_{K})

is the

K \times p

matrix of cluster centers. Theoretically

1 \leq K \leq N

, however in most practical situations

K << N

and chosen as the maximum plausible cluster number. It is well known that the above optimization works well for known for fixed

K

(MacQueen et al., 1967; Pollard, 1981). However, if the value of the actual cluster number

`` K^{″}

is unknown and treated as a parameter with

K \in {1, 2, \dots, N}

, the above loss function will yield a trivial minimum at

K = N

i.e., each observation is its own cluster center and we will have

\underset{m_{k} \in R}{\arg min} L_{K} (X, M) = 0

, which is the lowest possible value of the K-Means objective function. The same is true for a slightly more generalized K-Means loss function,

\underset{m_{k} \in R^{p}, 1 \leq k \leq K}{\arg min} L_{K} (X, M) = \underset{m_{k} \in R^{p}}{\arg min} \underset{i = 1}{\sum^{N}} \underset{k = 1}{\sum^{K}} I (x_{i} \in C_{k}) ‖ x_{i} - m_{k} ‖

(1)

where the distance functions

‖ . ‖

is any norm following these properties,

$‖ x ‖ \geq 0 \forall x \in R^{N}$ and attains 0 only if $x = 0$ (non-negative property),

$‖ λ x ‖ =∣ λ ∣ ‖ x ‖ \forall x \in R^{N} and λ \in R$ (homogeneous property),

$‖ x + y ‖ \leq ‖ x ‖ + ‖ y ‖ \forall x, y \in R^{N}$ (triangular inequality).

Note, Euclidean norm is a special case and in fact for any

L_{p}, (p \geq 1)

norm is also going satisfy these conditions. The aforementioned properties of the distance function are the necessary foundations for the proofs of the following propositions and theorems. Next, we will propose a general result on distance norm following the aforementioned properties, for any three p-dimensional points. Proposition 1

For three points $v_{1}, v_{2} and v_{3}$ , we have

If $v_{3}$ has the convex form $v_{3} = θ v_{1} + (1 - θ) v_{2}$ for some $θ \in [0, 1]$ then $‖ v_{1} - v_{3} ‖ + ‖ v_{2} - v_{3} ‖ = ‖ v_{1} - v_{2} ‖$ .

For all other $v_{3}$ , $‖ v_{1} - v_{2} ‖ \leq ‖ v_{1} - v_{3} ‖ + ‖ v_{2} - v_{3} ‖$ .

Proof.

The proof is trivial from homogeneous and triangular inequality property of distance function under any norm.

The proposition can be extended for any convex set

A

containing a set of points

x_{i} \in A \forall i = {1, 2, \dots}

. In the next Lemma, we propose a result on the optimization under any convex set. Lemma 1

For a convex set $A$ , the optimizer of $min_{u \in A} \sum_{i} ‖ x_{i} - u ‖$ for all $x_{i} \in A$ will be unique.

Lemma 1 states that, within a convex set, the point that minimizes the sum of distances to all other points in the set is unique. Leveraging this result, one can establish the uniqueness of the solution to a given loss function under the assumption of convexity. Together, this lemma and the preceding proposition provide the foundational tools necessary to prove the next two theorems, demonstrating that the solution to the optimization problem under consideration in K-Means (in equation 1) is indeed unique when the convexity condition is satisfied. It is to be noted that, if

m_{k}

is the cluster center of cluster

C_{k}

then we consider center as,

m_{k} = \sum_{i \in C_{k}} θ_{i} x_{i} with \sum_{i \in C_{k}} θ_{i} = 1

, i.e., each cluster center can be represented as a convex combination of the points in the cluster. The characteristic of solution to the K-Means optimization described in equation (1) with unknown

K

is described in the next theorem. Theorem 1

The loss function $L_{K} (X, M)$ is a monotone decreasing function in $K \in {1, \dots, N}$ and reaches its minima at $K = N$ , for each $m_{k} \in C_{k} \forall k \in {1, \dots, K}$ .

Remark

Theorem 1 shows that unless we specify the cluster number of a K-Means, the underlying optimization problem denoted by $L_{k} (X, M)$ will always encourage more complex model with high number of clusters. In other words, given $\hat{M}$ as a $k \times p$ cluster center matrix, if we introduce another cluster center by optimizing equation (1) to get $(k + 1) \times p$ cluster center matrix ${\hat{M}}^{*}$ , then it is guaranteed to hold $L_{k + 1} (X, \hat{M}) \leq L_{k} (X, {\hat{M}}^{*})$ . Thus, the naive K-Means for an unknown number of clusters is an ill-posed problem, which necessitate a penalty parameter to guard against it.

It is evident from Theorem 1, K-Means with unknown

K

will always encourage a large number of clusters and the need to regularize the optimization function in equation (1) to prevent overfitting. The penalty term chosen for this work is,

\begin{aligned} P_{K, N} = \sum_{1 \leq i, j \leq K} ϕ_{i} ϕ_{j} ‖ m_{i} - m_{j} ‖, with ϕ_{i} = {\begin{cases} 1 if C_{i} is alive \\ 0 otherwise . \end{cases} \end{aligned}

Since the contribution of the penalty term is only for the clusters that are alive i.e., if both

ϕ_{i}

and

ϕ_{j}

is 1, therefore we can reduce the penalty as

P_{K, N} = \sum_{1 \leq i, j \leq K} ‖ m_{i} - m_{j} ‖ .

(2)

It is to be noted here that

P_{K, N}

has a trivial minimum at

K = 1

since all the observations fall within a single cluster and will have higher values as

K ↑

. In a sense, it behaves exactly oppositely to that of the loss function in equation (1). This type of penalty function using the Gaussian norm was first considered by our group in Ghosh and Dey (2009). Later Hocking et al. (2011) considered a

L_{1}

and Lindsten et al. (2011) considered a

L_{p}

penalized version of it. Though intuitively this makes sense, but none of these developments are without any direct proof about the properties of the considered penalty. In Theorem 2 we study the behavior of the penalty

P_{K, N}

as a function of the

K

. Theorem 2

The penalty $P_{K, N} = \sum_{1 \leq i, j \leq K} ‖ m_{i} - m_{j} ‖$ is a monotone increasing function in $K \in {1, \dots, N}$ and attains maximum at $K = N$ , under the assumption that each cluster center $m_{k}$ lies in the convex hull $C_{k}$ of the $k^{t h}$ cluster $\forall k \in {1, \dots, K}$ .

Remark

Theorem 2 states that when the number of clusters varies, the proposed penalty when minimized will always favor the simplest model, i.e., one in which all points belong to a single cluster. Furthermore, as we increase the number of clusters from $K$ to $K + 1$ in the same $p$ -dimensional feature space, the corresponding penalty values satisfy $P_{k, N} \leq P_{k + 1, N}$ and the penalty reaches its maximum when $K = N$ . This behavior of the penalty term, when combined with the loss function in equation (1), provides the necessary balance for selecting the optimal value of $K$ .

In the next section, we describe an augmented versus of K-Means clustering which combines the loss (in equation (1)) and penalty (in equation (2)) in a regularized framework. The monotone nature of both terms ensures parsimonious model selection, with appropriately chosen regularization parameter.

3. Reformulation of K-Means Clustering

The augmented version of the K-Means problem solves following optimization problem,

\underset{m_{k} \in R^{p}}{\arg min} \underset{i = 1}{\sum^{N}} \underset{k = 1}{\sum^{K}} I (x_{i} \in C_{k}) ‖ x_{i} - m_{k} ‖ + λ \sum_{1 \leq i, j \leq K} ‖ m_{i} - m_{j} ‖ .

(3)

The first part of the optimization governs the assignment of data points to clusters, while the second part constrains the number of cluster centers. Thus, with an appropriate choice of the tuning parameter

λ (\in [0, \infty))

, the augmented K-Means simultaneously optimizes both the cluster centers and the number of clusters. Next, we reformulate the optimization problem in equation (3) using matrix notation, where

A_{N \times K}

denotes cluster assignment matrix

A_{i j} = {\begin{cases} 1 & if x_{i} \in m_{j} \\ 0 & otherwise \end{cases}; \forall i \in {1, 2, \dots, N}, j \in {1, 2, \dots, K} .

Each element of

A

denotes the cluster assignment of observation

x_{i}

, whereas

M

is the

K \times p

matrix of cluster centers. Thus equation (3) can be reformulated as,

\underset{m_{k} \in R^{p}, 1 \leq k \leq K}{\arg min} \underset{i = 1}{\sum^{N}} ‖ x_{i} - A_{i .} M ‖ + λ \sum_{1 \leq i, j \leq K} ‖ m_{i} - m_{j} ‖,

(4)

where

A_{i .}

denotes the

i

-th row of

A

. Note, that the optimization problem stated above is fairly general and should work for any choice of distance norm, though, for the K-Means algorithm with continuous features, the natural choice is Euclidean norm or it squared version. It should be also noted different non-concave norms can be also chosen separately for loss and penalty terms as it is done in the regression context for LASSO type problems (Bach et al., 2012) and clustering via regression-based approach Wu et al. (2016). Albeit, the different norms in Loss and Penalty will lead to differential contraction (Grossmann & Winkler, 2013) which needs to be studied further theoretically, this may also be dictated by the nature of feature space (e.g., discrete, continuous, Boolean, etc.) based on which clustering is performed. For this work, we stick to squared error loss (or square of Euclidean norm) due to its acceptability and ease of computation. Under that we have the following optimization problem,

\underset{m_{k} \in R^{p}, 1 \leq k \leq K}{\arg min} \underset{i = 1}{\sum^{N}} ‖ x_{i} - A_{i .} M ‖^{2} + λ \sum_{1 \leq i, j \leq K} ‖ m_{i} - m_{j} ‖^{2} .

(5)

The loss function in above equation can be rewritten as

(X - A M)^{T} (X - A M) .

Whereas it can be easily shown that the difference between cluster centers can be written as,

m_{i} - m_{j} = M^{T} (e_{i} - e_{j})

where

e_{i}

is a

k

-dimensional vector of zero’s with

1

i

-th position. Thus our optimization problem of 5 boils down to

\underset{M \in R^{k \times p}, 1 \leq k \leq K}{\arg min} {(X - A M)}^{T} (X - A M) + λ \sum_{1 \leq i, j \leq K} {(e_{i} - e_{j})}^{T} M M^{T} (e_{i} - e_{j}) .

(6)

Note that we still need to pre-specify a value of

K

(as an upper limit) for the optimization in equation (6). Given this value, our algorithm is capable of selecting the optimal number of clusters within the range

[1, K]

. The most liberal choice of

K

is the total number of samples or

N

. However, this choice comes with a potential drawback: longer run-times, as the computational complexity of K-Means is proportional to the upper limit of the possible number of clusters. In most practical scenarios,

K << N

and thus, the runtime can be significantly reduced by choosing a

K

that is not too large.

4. Optimization of Augmented K-Means Clustering

In this section, we discuss the optimization of the penalized K-Means problem defined in equation (6) and provide an algorithm to determine the optimal number of clusters for a given tuning parameter $λ$ . To obtain the optimized cluster centers, we differentiate equation (6) with respect to $M$ and equate to $0$ and obtain,

(A^{T} A + λ {(\sum_{1 \leq i, j \leq K} (e_{i} - e_{j}) {(e_{i} - e_{j})}^{T})}^{T}) M = A^{T} X .

(7)

More simplification of equation (7) is possible through some algebraic manipulation. Following Chi and Lange (2015), for

K = K^{*}

we can obtain,

\sum_{1 \leq i, j \leq K^{*}} (e_{i} - e_{j}) {(e_{i} - e_{j})}^{T} = K^{*} I - 1 1^{T}

(8)

where

1

is a

K^{*}

length vector of 1 and

I

is a

K^{*} \times K^{*}

indicator matrix. Moreover, the matrix in the right-hand side of equation (8) is symmetric. Therefore, from equation (7) and (8), solution to the estimated cluster centers for a fixed number of cluster (in this case,

K^{*}

) are,

\hat{M} = {(A^{T} A + λ (K^{*} I - 1 1^{T}))}^{- 1} A^{T} X .

(9)

We have also derived an unbiased estimator of the degrees of freedom for the fitted model (see Appendix 4), which can be used to compute any goodness-of-fit statistic. Note that for a fixed number of clusters, the optimized cluster centers can be estimated directly. However, the problem at hand involves simultaneously estimation of both the number of clusters and the cluster centers. To address this, we propose Algorithm ??, which performs a grid search to jointly optimize the cluster number $K$ and the estimated cluster centers $\hat{M}$ . Using Algorithm ??, one can obtain appropriate cluster assignments for all data points. Note that the algorithm operates for a fixed value of the tuning parameter $λ$ . Since the optimization problem includes a penalty on the cluster centers, the value of $λ$ influences both the number of clusters and the resulting cluster centers. As with any regularization-based approach, selecting an appropriate tuning parameter is crucial in penalized K-Means. It significantly impacts not only the quality of the resulting clustering but also the convergence behavior of the algorithm. In the following section, we describe the criteria used for selecting the tuning parameter.

5. Selection of Tuning Parameter

The estimated number of clusters and the corresponding cluster assignments in penalized K-Means clustering depend on the value of the non-negative tuning parameter $λ$ . Smaller values of $λ$ tend to yield more clusters, while larger values encourage fewer clusters, thereby controlling the trade-off between model fit and model complexity. To appropriately tune $λ$ , we adopt a clustering stability-based approach, as proposed in Fang and Wang (2012), Sun et al. (2012) and Wang et al. (2016) for the penalized K-Means framework. A cluster assignment is considered stable if perturbations in the data (e.g., through subsampling) result in minimal changes to the clustering outcome. The stability measure utilized in Sun et al. (2012) and Wang et al. (2016), tunes both the number of clusters $K$ and the Lagrangian tuning parameter $λ$ . However, since the penalized K-Means algorithm provides the optimal number of clusters $K$ for fixed $λ$ , we adapt the stability-based selection procedure to optimize only the tuning parameter $λ$ . The modified stability selection mechanism for identifying the most appropriate value of $λ$ along with the corresponding $K$ for a given dataset is described next.

Suppose that we have optimized the Algorithm 1 on the set of $λ$ values, $λ^{*} = {λ_{1}, λ_{2}, \dots, λ_{l}}$ to obtain corresponding cluster assignments $Φ_{λ_{1}} (\cdot), Φ_{λ_{2}} (\cdot), \dots, Φ_{λ_{l}} (\cdot)$ with estimated number of clusters $K_{λ_{1}}, K_{λ_{2}}, \dots, K_{λ_{l}}$ respectively. If available, then let us denote the original assignments or true labels as $Φ^{T r u t h} (\cdot)$ . Draw $B$ many bootstrap samples without replacement $X^{* 1}, X^{* 2}, \dots, X^{* B}$ , each with observation $n^{*} < N$ from the original data $X$ . Remember for fixed $λ_{i}$ we have optimized assignment $Φ_{λ_{i}} (\cdot)$ with $K_{λ_{i}}$ clusters for $i = 1, 2, \dots, l$ . Therefore for each $λ_{i}$ we optimize Algorithm 1 on each bootstrap sample $X^{* b}$ such that, we obtain exactly $K_{λ_{i}}$ clusters with assignments $Φ_{λ_{i}}^{* b} (\cdot)$ where $b = 1, 2, \dots, B$ denoting the $b^{t h}$ bootstrap sample. We define two types of stability measure,

If the true labels $Φ^{T r u t h} (\cdot)$ are available then we define the true stability as the average agreement between the true labels and the predicted assignments in the bootstrap samples,

{\hat{S}}_{λ_{i}}^{T r u t h} = \frac{1}{B} \underset{b = 1}{\sum^{B}} Γ (Φ^{T r u t h} (X^{* b}), Φ_{λ_{i}}^{* b} (X^{* b}))

(10)

where

Γ (\cdot, \cdot)

is some concordance measure between the cluster assignments.

For data without any class or label information, the predicted stability measure is defined as the average concordance between the predicted cluster assignment $Φ_{λ_{i}} (\cdot)$ and the corresponding cluster assignments on the bootstrap samples $Φ_{λ_{i}}^{* b} (\cdot)$ , evaluated using a chosen measure of association $Γ$ ,

{\hat{S}}_{λ_{i}}^{P r e d i c t e d} = \frac{1}{B} \underset{b = 1}{\sum^{B}} Γ (Φ_{λ_{i}} (X^{* b}), Φ_{λ_{i}}^{* b} (X^{* b})) .

(11)

A common choice for the association/concordance measure in is the Rand Index (Rand, 1971). The stability index

(\hat{S})

defined in Sun et al. (2012) can be written in terms of Rand Index

(R)

as,

\hat{S} = 1 - R (C_{1}, C_{2})

, with two cluster assignments

C_{1}

and

C_{2}

. Several drawbacks of the Rand Index as a similarity measure have been identified. Morey and Agresti (1984) and Fowlkes and Mallows (1983) showed the high dependence of the Rand Index on the number of clusters and it converges to 1 with an increase of cluster number, which we also have observed. Moreover Rand Index also suffers from association by random chance. Thus it yields higher values for certain random clustering that does not coincide with the true labeling of the data. Nonetheless, the concept of concordance captured by the Rand Index remains valuable, particularly for detecting the number of clusters. To address these limitations, we adopt the Adjusted Rand Index (ARI; see Halkidi et al., 2002; Hubert & Arabie, 1985) as the measure of association

Γ

. ARI is a standardized version of the Rand Index that corrects for both random chance and the inflation caused by higher numbers of clusters. ARI index value is equal or close to 1 only for the homogeneous assignments on bootstrap samples and true/predicted assignments and close to 0 for a random partition. The best value of tuning parameter

λ

and corresponding cluster number is chosen as

\hat{λ} = \underset{λ_{i} \in λ^{*}}{a r g m a x} {\hat{S}}_{λ_{i}}^{P r e d i c t e d} .

If multiple values of

λ

produce the optimal

{\hat{S}}^{P r e d i c t e d}

, then we choose the best

λ

and

K

in the following manner. Let

{\hat{λ}}^{1}, {\hat{λ}}^{2}, \dots, {\hat{λ}}^{r}

are the

λ

values corresponding to optimal

{\hat{S}}^{P r e d i c t e d}

with cluster numbers

K^{1}, K^{2}, \dots, K^{r}

respectively. Then we have the optimal cluster number,

\hat{K} = m o d e {K^{1}, K^{2}, \dots, K^{r}}

and estimate of the tuning parameter

\hat{λ} = \underset{\hat{λ}}{m i n} {{\hat{λ}}^{i} : | C_{{\hat{λ}}^{i}} | = \hat{K}; \forall i \in (1, 2, \dots, r)} .

6. Simulation Study

The performance of the augmented K-Means clustering algorithm is first evaluated on several simulated benchmark datasets. Four separate benchmark datasets are selected for testing purposes, as detailed below:

Three Gaussian clusters (Tan & Witten, 2015) are generated with $n = 500$ and $p = 2$ . The $n \times p$ data matrix $X$ is generated from $M V N (μ_{k}, 1.5 * I)$ . The mean vectors $μ_{k}$ are assigned as $μ_{1} = - 3 \cdot 1_{p}, μ_{2} = 0 \cdot 1_{p}, and μ_{3} = 3 \cdot 1_{p}$ respectively for three clusters. The points are equally distributed among the three clusters such that each point corresponds to a unique cluster.

Second simulated setup is the four corners data from Karami and Johansson (2014). A total number of $n = 1000$ points are chosen in such a way that each belongs to a specific corner.

The outlier data (Karami & Johansson, 2014) with $n = 1000$ points are also considered in the simulation study. Data consists of two large half-circular regions and two smaller outlier regions.

The final benchmark dataset used in the simulation study is the crescent full moon dataset (Karami & Johansson, 2014), which consists of 500 data points arranged in a circular region and another 500 points forming a half-moon shape.

On each of the datasets, we have optimized the penalized K-Means algorithm on a grid of

λ

values. For each

λ

, the algorithm is optimized to obtain the best cluster number

K

. The optimum

λ

and corresponding cluster number

K

is chosen via the stability measure mentioned above. Figure 1 illustrates the original cluster assignments alongside the best predicted cluster assignments for the four simulated examples. The first row of the panel corresponds to the Gaussian clusters, the second row to the corners data, the third row to the outlier data, and the fourth row to the crescent full moon dataset. The proposed augmented K-Means clustering method accurately detects both the number and the locations of clusters in all cases, with the exception of the crescent full moon data. In that case, the method partitions the crescent moon into two separate clusters. This result is not unexpected, as a key assumption in augmented K-Means is that the cluster center should lie within the convex hull of the points in that cluster. This assumption is violated in non-convex structures like the crescent moon. Thus, for the non-convex shape of the crescent moon, the best possible clustering while adhering to the convex center construction constraint is shown in the last plot of the right panel in Figure 1. It is worth noting that with an appropriate choice of a loss function that accounts for non-convex cluster structures, it is possible to modify the optimization problem presented in equation (3). However, such a modification would require further methodological developments to effectively solve the resulting optimization problem, which is beyond the scope of the current paper.

Figure 1.

True Cluster Assignment (Left Panel) vs. Predicted Cluster Assignments (Right Panel). The X and Y Axes Represent the Two Dimensions Used for Data Generation and Clustering.

The next panel of plots in Figure 2 demonstrates the performance of the augmented K-Means clustering algorithm for different values of tuning parameter $λ$ . In all the plots, we kept the values of $λ$ along the x-axis, and the versions of the ARI along the y-axis. The plots are arranged in the same order of benchmark datasets as Figure 1. The plots on the left panel show the predicted ARI (in red dashed line) and the bootstrap version of ARI (solid black line) calculated with respect to the true response values. The close resemblance between the two curves is clearly visible, showing the stability of the proposed measure for identifying the best tuning parameter. The preferred choice of clusters is the clusters corresponding to the chosen tuning parameter. We see several ties of $λ$ values (yielding ARI as $1$ ) in the first three plots in the left panel of Figure 2. In those cases, the best choice of $λ$ is determined as mentioned in Section 5, which is the minimum value yielding one. This is not true for the fourth dataset of crescent full moon though, where ARI never reached unity as the best clustering detected by our algorithm is not an exact match with the truth. The plots in the right panel show the predicted ARI (solid line) along with its confidence bound with respect to the predicted clusters for each value of $λ$ . An almost identical pattern as the true ARI (in the left panel) can be noticed for the predicted ARI. A similar idea is implemented to break the ties and choose the best values of tuning parameter $λ$ and corresponding optimum cluster assignments. The plots in the right panel of Figure 1 showing the cluster assignments correspond to the optimum choice of $λ$ chosen through the predicted ARI. Note that for real data though, we do not observe any response/outcome, calculating the ARI is therefore infeasible and we resort to the predicted ARI, to optimize the tuning parameter for the proposed augmented K-Means algorithm.

Figure 2.

True and Bootstrap Version of Adjusted Rand Index (Left Panel) and the Predicted Adjusted Rand Index (Right Panel) in the Y-axis and for Different Values of $λ$ Along the X-axis.

7. Application on Globular Clusters Detection

Clustering is a very common problem with many application domains and as such our algorithm proposed in this article is fairly general. We choose the field of astronomy as the detection of star clusters and galaxies is a very important and challenging problem with unknown cluster numbers. This field often produces a single image large enough so that it cannot be loaded in the memory of a personal computer resulting in truly Big Data ( $N ↑$ ) (Estévez, 2016; Loredo et al., 2009; Mickaelian et al., 2016). For our purpose, we considered publicly available NASA Hubble telescope images. These galaxy(s) or star clusters are globular in nature and thus can be categorized into convex clusters via a clustering mechanism based on centers. We intend to test the performance of the augmented K-Means algorithm to detect the number and location of globular clusters present in an image of galaxies and star clusters. Two separate images with distinct characteristics are considered for this work, both downloaded from NASA (2015) and NASA (2017) respectively. Figure 3 shows the original images, with lots of possible galaxy and star clusters from Hubble deep space image (left) and two sister galaxy images (right). Some pre-processing mechanism was implemented on the images to remove the noises before running the clustering mechanism. The image pre-processing steps are as follows,

First we converted the color image to gray-scale. This is done as otherwise color temperature at each pixel poses as a separate dimension.

The images are then processed, in order to remove the background noises and detect pixel values (location and intensities).

Pixels are segmented and those with intensities more than the $95^{t h}$ percentile are considered for the analysis, rest ignored.

Image data are arranged to have a final data frame with pixel locations and corresponding gray-scale intensities.

Note that, other pixel intensity percentiles can be considered, but we stick with

95^{t h}

as it provided the best noise reduction of the images. We utilized the

R

(R Core Team, 2015) package “imager” (Simon, 2017) for the image pre-processing. After pre-processing, the denoised images are shown in Figure 4, where the colored space is the pixels considered for the clustering analysis. Before denoising the Hubble deep space image, we had around

304704

pixels and their corresponding intensity values, which were reduced to

30690

after the pre-processing. Similarly, for two sister galaxy images, we had

12517416

pixels which after noise reduction was scaled down to

657694

pixels for the final analysis. Analogous to the simulation study we optimized the augmented K-Means algorithm for each tuning parameter value

λ

to obtain the predicted clusters. We have chosen a fine grid of

λ

values ranging from 0 to 10, for our analysis. The number of possible cluster numbers are considered as 100 and 10 respectively for the two data sets. The model optimization is computationally expensive and requires large memory for both data and storage. Thus, is not possible to run on normal desktops or laptops even with parallel frameworks. We utilized the grid computing nodes of Wayne State University to run the optimization algorithm and final prediction.

Figure 3.

Lots of Galaxies Image (Left) and Two Galaxies Image (Right).

Figure 4.

Image After Preprocessing for Galaxy Cluster Data (Left Panel) and Two Sister Cluster (Right Panel).

Figure 5.

Predicted Adjusted Rand Index ( $1^{s t}$ Row) and the Estimated Clusters ( $2^{n d}$ Row) for Galaxy Cluster Data (Left Panel) and Two Sister Cluster (Right Panel) with the Predicted Tuning Parameter $λ$ (Red Dashed Line).

Figure 6.

Predicted Clusters and Corresponding Assignments for (Left) Galaxy Cluster Data with 11 Clusters and (Right) Two Sister Galaxy with 2 Clusters. The Line Within Each Figure Indicates the Outer Shape of Each Cluster.

The next panel of plots in Figure 4 shows the images after the pre-processing and thresholding. In Figure 5 we have plotted the predicted ARI values and the estimated clusters for both datasets. An optimum value of the tuning parameter $λ$ is presented as the red dashed line. For all the plots, we kept the values of $λ$ along the x-axis the predicted ARI (for 1st row of the panel), and the predicted cluster number (for 2nd row of the panel) along the y-axis. It can be observed that the optimum cluster number for the Hubble deep space image is 11 whereas 2 clusters are observed for two sister galaxies. Finally in Figure 6, we have shown the best cluster assignments among different galaxies. A point to note is that the images (as drawn) are the 2D representation of the original 3D space (consisting of pixel x-coordinate, pixel y-coordinate, and intensities). Moreover, the optimization of clusters is done in the transformed space rather than the original space. If we revert to the original space, the assignments are better, but the pictorial demonstration is not optimum.

8. Conclusion and Future Direction

Despite its limitations and computational challenges when applied to large, heterogeneous datasets, K-means clustering remains one of the oldest and most widely used clustering algorithms. However, traditional K-means is prone to overfitting and often converges to local minima. In this work, we revisit the method with the aim of mathematically characterizing some of its limitations. To address these issues, we propose a penalized version of K-means clustering that enables automated selection of the number of clusters, followed by cluster assignment. Our approach relies on the appropriate selection of a tuning parameter, which plays a crucial role in balancing model complexity and fit. There are several directions in which this work can be extended, and we outline a few of these as potential avenues for future research.

In typical cluster analysis, all available features are often assumed to be useful. However, in practice, not all features are equally informative—or even relevant—for the clustering task. This motivates the incorporation of feature selection prior to clustering, as discussed in Raftery and Dean (2006) and Witten and Tibshirani (2010). Effective feature selection can significantly influence the resulting cluster structure, including the estimated number of clusters. However, stepwise procedures—where features are selected first and the number of clusters $K$ is determined afterward—may be suboptimal, as errors introduced in the earlier step cannot be corrected later. Consequently, the simultaneous selection of relevant features and the optimal number of clusters $K$ remains a challenging and largely unresolved problem.

As noted throughout the paper, our method is based on K-means and is therefore best suited for convex clusters, where the notion of a cluster center is naturally meaningful. However, real-world data often exhibit non-convex or even irregularly shaped clusters, where this concept becomes less interpretable. In such cases, alternative formulations of the clustering objective are required. Pan et al. (2013) investigated clustering methods involving non-convex penalties, and more recently, Wu et al. (2016) proposed computationally efficient adaptations of such approaches. Nevertheless, both studies continued to rely on convex loss or distance functions. The estimation of the number of clusters under non-convex loss functions remains a challenging and largely unresolved problem.

Centroid-based clustering under non-convex metrics represents a compelling area for future research, posing both theoretical and computational challenges. Another promising direction involves clustering data that do not follow a continuous distribution, which demands the development of novel distance metrics. Identifying the number of clusters in such settings presents an intriguing and worthwhile avenue for further investigation.

Supplemental Material

sj-pdf-1-mod-10.1177_15741699251377692 - Supplemental material for Penalized K-Means Clustering: Another Look at Its Statistical Properties

Supplemental material, sj-pdf-1-mod-10.1177_15741699251377692 for Penalized K-Means Clustering: Another Look at Its Statistical Properties by Prithish Banerjee, Shreyo Ghosh and Samiran Ghosh in Model Assisted Statistics and Applications

Footnotes

Acknowledgments

The suggestions from the three referees and other members of the editorial team have greatly helped in polishing the paper.

ORCID iD

Samiran Ghosh

Funding

The author(s) received no financial support for the research, authorship, and/or publication of this article.

Declaration of Conflicting Interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Supplemental Material

Supplemental materials for this article are available online.

References

Bach

Jenatton

Mairal

Obozinski

(2012). Structured sparsity through convex optimization. Statistical Science, 27(4), 450–468.

Chi

E. C.

Lange

(2015). Splitting methods for convex clustering. Journal of Computational and Graphical Statistics, 24(4), 994–1013.

de Amorim

R. C.

Hennig

(2015). Recovering the number of clusters in data sets with noise features using feature rescaling factors. Information Sciences, 324, 126–145.

Ester

Kriegel

H.-P.

Sander

(1996). A density-based algorithm for discovering clusters in large spatial databases with noise. In KDD (Vol. 96, pp. 226–231).

Estévez

P. A.

(2016). Big data era challenges and opportunities in astronomy-how SOM/LVQ and related learning methods can contribute? In WSOM (p. 267).

Fang

Wang

(2012). Selection of the number of clusters via the bootstrap method. Computational Statistics & Data Analysis, 56(3), 468–477.

Fowlkes

E. B.

Mallows

C. L.

(1983). A method for comparing two hierarchical clusterings. Journal of the American Statistical Association, 78(383), 553–569.

Fraley

Raftery

A. E.

(1998). How many clusters? Which clustering method? Answers via model-based cluster analysis. The Computer Journal, 41(8), 578–588.

Ghosh

Dey

D. K.

(2009). Model based penalized clustering for multivariate data. In Advances in multivariate statistical methods (pp. 53–71). World Scientific.

10.

Grossmann

Winkler

(2013). Mesh-independent convergence of penalty methods applied to optimal control with partial differential equations. Optimization, 62(5), 629–647.

11.

Halkidi

Batistakis

Vazirgiannis

(2002). Cluster validity methods: Part I. ACM Sigmod Record, 31(2), 40–45.

12.

Hocking

T. D.

Joulin

Bach

Vert

J. -P.

(2011). Clusterpath an algorithm for clustering using convex fusion penalties.

13.

Hubert

Arabie

(1985). Comparing partitions. Journal of Classification, 2(1), 193–218.

14.

Karami

Johansson

(2014). Choosing DBSCAN parameters automatically using differential evolution. International Journal of Computer Applications, 91(7), 1–11.

15.

Lindsten

Ohlsson

Ljung

(2011). Clustering using sum-of-norms regularization: With application to particle filter output computation. In 2011 IEEE statistical signal processing workshop (SSP) (pp. 201–204). IEEE.

16.

Loredo

T. J.

Rice

Stein

M. L.

(2009). Introduction to papers on astrostatistics. Annals of Applied Statistics, 1, 1.

17.

MacQueen

(1967). Some methods for classification and analysis of multivariate observations. In Proceedings of the fifth Berkeley symposium on mathematical statistics and probability (Vol. 1, pp. 281–297).

18.

Mickaelian

A. M.

Abrahamyan

H. V.

Gyulzadyan

M. V.

Mikayelyan

G. A.

Paronyan

G. M.

(2016). Multi-wavelength studies of the statistical properties of active galaxies using big data. Proceedings of the International Astronomical Union, 12(S325), 32–38.

19.

Morey

L. C.

Agresti

(1984). The measurement of classification agreement: An adjustment to the rand statistic for chance agreement. Educational and Psychological Measurement, 44(1), 33–37.

20.

NASA . (2015). Lots of galaxy. http://www.dailygalaxy.com/my_weblog/2015/07/cosmic-variance-fewer-faint-distant-galaxies-in-the-universe.html

21.

NASA . (2017). Two spiral galaxy. https://www.nasa.gov/feature/goddard/2017/a-new-angle-on-two-spiral-galaxies-for-hubbles-27th-birthday. Online, Accessed 8 November 2017.

22.

Nguyen

(2013). Convergence of latent mixing measures in finite and infinite mixture models. The Annals of Statistics, 41(1), 370.

23.

Pan

Shen

Liu

(2013). Cluster analysis: Unsupervised learning via supervised learning with a non-convex penalty. The Journal of Machine Learning Research, 14(1), 1865–1889.

24.

Pollard

(1981). Strong consistency of k-means clustering. The Annals of Statistics, 9(1), 135–140.

25.

Radchenko

Mukherjee

(2017). Convex clustering via l1 fusion penalization. Journal of the Royal Statistical Society: Series B (Statistical Methodology), 79(5), 1527–1546.

26.

Raftery

A. E.

Dean

(2006). Variable selection for model-based clustering. Journal of the American Statistical Association, 101(473), 168–178.

27.

Rand

W. M.

(1971). Objective criteria for the evaluation of clustering methods. Journal of the American Statistical Association, 66(336), 846–850.

28.

R Core Team . (2015). R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria.

29.

Shi

Malik

(2000). Normalized cuts and image segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8), 888–905.

30.

Simon

(2017). imager: Image processing library based on ’CImg’. R package version 0.40.2.

31.

Sugar

C. A.

James

G. M.

(2003). Finding the number of clusters in a dataset: An information-theoretic approach. Journal of the American Statistical Association, 98(463), 750–763.

32.

Sun

Wang

Fang

(2012). Regularized k-means clustering of high-dimensional data and its asymptotic consistency. Electronic Journal of Statistics, 6, 148–167.

33.

Tan

K. M.

Witten

(2015). Statistical properties of convex clustering. Electronic Journal of Statistics, 9(2), 2324.

34.

Wang

Zhang

Sun

W. W.

Fang

(2016). Sparse convex clustering. arXiv preprint arXiv:1601.04586.

35.

Witten

D. M.

Tibshirani

(2010). A framework for feature selection in clustering. Journal of the American Statistical Association, 105(490), 713–726.

36.

Kwon

Shen

Pan

(2016). A new algorithm and theory for penalized regression-based clustering. The Journal of Machine Learning Research, 17(1), 6479–6503.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.22 MB