Abstract
Encoding categorical data remains challenging for soft clustering, particularly under ambiguity, missingness, and noise. Popular schemes—One-Hot, Complex, and N-Soft—suffer from high dimensionality, weak frequency semantics, or limited interpretability. We propose C-Soft Set, a complex-encoding that preserves both category distinctiveness and distributional information: the magnitude (radius) reflects category frequency/centrality, while the phase separates categories within the same frequency group via equiangular spacing. We evaluate C-Soft against One-Hot, Complex (r = 1), and N-Soft on nine datasets—four UCI benchmarks (Adult, Breast Cancer, Car Evaluation, Credit Approval) and five real-world datasets (Customer Segmentation, Online Retail ∼0.54 M rows, Food panda, Airlines Flights ∼0.30 M rows, Perfumes). Using Fuzzy C-Means (m = 2), we report Silhouette, Davies–Bouldin, Xie–Beni, Partition Entropy, runtime, and iterations. Across benchmarks and large real-world data, C-Soft achieves competitive or superior separation/compactness (lower DB/XB, competitive Silhouette) while avoiding the dimensional blow-up of One-Hot and maintaining efficient runtimes. Robustness is confirmed under synthetic distortion (missing/ambiguous values) without imputation: performance degrades gracefully and cluster compactness/separation are largely preserved. These results position C-Soft as a frequency-aware, phase-aware, and interpretable encoding for categorical data in soft clustering at both moderate and large scale.
Get full access to this article
View all access options for this article.
