C -Soft Set: A Complex-Valued Encoding Model for Categorical Data

Abstract

Encoding categorical data remains challenging for soft clustering, particularly under ambiguity, missingness, and noise. Popular schemes—One-Hot, Complex, and N-Soft—suffer from high dimensionality, weak frequency semantics, or limited interpretability. We propose C-Soft Set, a complex-encoding that preserves both category distinctiveness and distributional information: the magnitude (radius) reflects category frequency/centrality, while the phase separates categories within the same frequency group via equiangular spacing. We evaluate C-Soft against One-Hot, Complex (r = 1), and N-Soft on nine datasets—four UCI benchmarks (Adult, Breast Cancer, Car Evaluation, Credit Approval) and five real-world datasets (Customer Segmentation, Online Retail ∼0.54 M rows, Food panda, Airlines Flights ∼0.30 M rows, Perfumes). Using Fuzzy C-Means (m = 2), we report Silhouette, Davies–Bouldin, Xie–Beni, Partition Entropy, runtime, and iterations. Across benchmarks and large real-world data, C-Soft achieves competitive or superior separation/compactness (lower DB/XB, competitive Silhouette) while avoiding the dimensional blow-up of One-Hot and maintaining efficient runtimes. Robustness is confirmed under synthetic distortion (missing/ambiguous values) without imputation: performance degrades gracefully and cluster compactness/separation are largely preserved. These results position C-Soft as a frequency-aware, phase-aware, and interpretable encoding for categorical data in soft clustering at both moderate and large scale.

Keywords

categorical encoding soft clustering complex number

Get full access to this article

View all access options for this article.

References

Alcantud

J.C.R.

(2022). The semantics of N-soft sets, their applications, and a coda about three-way decision. Information Science, 606, 837–852. https://doi.org/10.1016/j.ins.2022.05.084

Alkhazaleh

Razak Salleh

Hassan

Ghafur Ahmad

(2010). Multisoft sets. In 2nd International conference on mathematical sciences (pp. 910–917).

Bolikulov

Nasimov

Rashidov

Akhmedov

Cho

Y. I.

(2024). Effective methods of categorical data encoding for artificial intelligence algorithms. Mathematics, 12(16), 1–21. https://doi.org/10.3390/math12162553

Brouwer

R. K.

(2002). A feed-forward network for input that is both categorical and quantitative. Neural Networks [Online]. www.elsevier.com/locate/neunet

Cagman

Enginoglu

Citak

(2011). Fuzzy soft set theory and its Applications. Iranian Journal of Fuzzy System, 8(3), 137–147. https://doi.org/10.22111/ijfs.2011.292

Cao

Liang

Bai

Dang

(2012). A dissimilarity measure for the k-modes clustering algorithm. Knowledge-based Systems, 26, 120–127. https://doi.org/10.1016/j.knosys.2011.07.011

Fatimah

Rosadi

Hakim

R. B. F.

Alcantud

J. C. R.

(2018). N-soft sets and their decision making algorithms. Soft Computing, 22(12), 3829–3842. https://doi.org/10.1007/s00500-017-2838-6

Hancock

J. T.

Khoshgoftaar

T. M.

(2020). Survey on categorical data for neural networks. Journal of Big Data, 7(1), 1–41. https://doi.org/10.1186/s40537-020-00305-w

Herawan

Deris

M. M.

(2009). On multi-soft sets construction in information systems. In 5th Inetrnational conference on intelegent computing, ICIC 2009 Olsan, South Korea (pp. 101–110). Springer.

10.

Irfan Ali

(2011). A note on soft sets, rough soft sets and fuzzy soft sets. Applied Soft Computing Journal, 11(4), 3329–3332. https://doi.org/10.1016/j.asoc.2011.01.003

11.

Khan

M. H. A.

(2022). Multi-hot encoding of categorical dataset for k-means clustering: cropping based clustering of districts in Bangladesh [Online]. https://ssrn.com/abstract=4066367

12.

Kunanbayev

Temirbek

Zollanvari

(2021). Complex encoding. In Proceedings of the international joint conference on neural networks. Institute of Electrical and Electronics Engineers Inc. https://doi.org/10.1109/IJCNN52387.2021.9534094

13.

Maji

P. K.

Biswas

Roy

A. R.

(2003). Soft set theory. Computers and Mathematics with Applications, 45(4-5), 555–562. https://doi.org/10.1016/S0898-1221(03)00016-6

14.

Molodtsov

(1999). Soft set theory first results. Computers and Mathematics with Applications, 37(4-5), 19–31. https://doi.org/10.1016/S0898-1221(99)00056-5

15.

Pawlak

(1991). Rough sets. Springer Netherlands.

16.

Pawlak

Skowron

(2007a). Rudiments of rough sets. Information Science, 177(1), 3–27. https://doi.org/10.1016/j.ins.2006.06.003

17.

Pawlak

Skowron

(2007b). Rough sets: Some extensions. Information Science, 177(1), 28–40. https://doi.org/10.1016/j.ins.2006.06.006

18.

Poslavskaya

Korolev

(2023). Encoding categorical data: Is there yet anything ‘hotter’ than one-hot encoding? [Online]. http://arxiv.org/abs/2312.16930

19.

Potdar

Pardawala

T. S.

Pai

C. D.

(2017). A comparative study of categorical Variable encoding techniques for neural network classifiers. International Journal of Computer Applications, 175(4), 7–9. https://doi.org/10.5120/ijca2017915495

20.

Seger

(2018). An investigation of categorical variable encoding techniques in machine learning: Binary versus one-hot and feature hashing.

21.

Wang

Zhou

Luo

Han

Niu

Lei

(2020). Complex-valued encoding metaheuristic optimization algorithm: A comprehensive survey. Neurocomputing, 407, 313–342. https://doi.org/10.1016/j.neucom.2019.06.112

22.

Zheng

Casari

(2018). Feature engineering for machine learning: Principle and techniques for data scientists. O’REILLY.