Network-based data complexity measures for multiclass classification of hybrid data and its correlation with supervised classifiers

Abstract

Data complexity assessment is the starting point in numerous machine learning and data analysis tasks. Among these measures, network-based ones are particularly significant due to their focus on the inner structure of the data. However, they have several limitations, such as the inability to handle non-symmetric dissimilarity functions, the requirement for fixed user-defined thresholds, and the provision of only an overall assessment of the data structure and not of different areas of the data space. Moreover, most data complexity measures do not address hybrid or multiclass data. In this paper, we propose 12 innovative network-based data complexity measures for assessing multiclass, hybrid, and incomplete supervised data complexity, effectively addressing the aforementioned limitations. These measures have been integrated into the EPIC platform, a practical tool for data analysis, making their computation easy and accessible. Furthermore, we conducted a correlation analysis of the proposed measures with the performance of two supervised classifiers, the Nearest Neighbor (NN) classifier, and the ACID classifier, yielding statistically significant results.

Keywords

data complexity measures supervised classification multiclass data hybrid data

Get full access to this article

View all access options for this article.

References

Basu

. Complexity measures of supervised classification problems. IEEE Trans Pattern Anal Mach Intell 2002; 24: 289–300.

Luengo

Herrera

. An automatic extraction method of the domains of competence for learning classifiers using data complexity measures. Knowl Inf Syst 2015; 42: 147–180.

Khan

Zhang

Rehman

, et al. A literature survey and empirical study of meta-learning for classifier selection. IEEE Access 2020; 8: 10262–10281.

Charte

Rivera

Jesus

MJD

, et al. On the impact of dataset complexity and sampling strategy in multilabel classifiers performance. In: Proceedings of the International conference on hybrid artificial intelligence systems, 2016, pp. 500–511.

Lee

Kim

. An efficient method to determine sample size in oversampling based on classification complexity for imbalanced data. Expert Syst Appl 2021; 184: 115442.

Maillo Hidalgo

Triguero

Herrera Triguero

. Redundancy and Complexity Metrics for Big Data Classification: Towards Smart Data. 2020.

Komorniczak

Ksieniewicz

. Complexity-based drift detection for nonstationary data streams. Neurocomputing 2023; 552: 126554.

Komorniczak

Ksieniewicz

Woźniak

. Data complexity and classification accuracy correlation in oversampling algorithms. In: Proceedings of the fourth international workshop on learning with imbalanced domains: theory and applications, 2022, pp. 175–186.

Villuendas-Rey

Rey-Benguría

Camacho-Nieto

, et al. Prediction of high capabilities in the development of kindergarten children. Appl Sci 2020; 10: 2710.

10.

Hernández-Castaño

Villuendas-Rey

Camacho-Nieto

, et al. Experimental platform for intelligent computing (EPIC). Comput Sist 2018; 22: 245–253.

11.

Hernández-Castaño

Villuendas-Rey

Nieto

, et al. A new experimentation module for the EPIC software. Res Comput Sci 2018; 147: 243–252.

12.

Cover

Hart

. Nearest neighbor pattern classification. IEEE Trans Inf Theory 1967; 13: 21–27.

13.

Villuendas-Rey

Alanis-Tamez

Benguría

CFR

, et al. Medical diagnosis of chronic diseases based on a novel computational intelligence algorithm. J Univers Comput Sci 2018; 24: 775–796.

14.

Lorena

Garcia

Lehmann

, et al. How Complex is your classification problem? A survey on measuring classification complexity. ACM Comput Surv (CSUR) 2019; 52: 1–34.

15.

omorniczak

Ksieniewicz

. Problexity—an open-source Python library for supervised learning problem complexity assessment. Neurocomputing 2023; 521: 126–136.

16.

Schmeing

Brun

Silva

. Dynamic selection of classifiers based on complexity measures. In: Proceedings of the 2022 IEEE 34th international conference on tools with Artificial Intelligence (ICTAI), 2022, pp. 82–89.

17.

Tusell-Rey

Camacho-Nieto

Yáñez-Márquez

, et al. A priori determining the performance of the customized naïve associative classifier for business data classification based on data complexity measures. Mathematics 2022; 10: 2740.

18.

Xian

Zeng

Liu

. Data complexity and its effect on EBRB system accuracy. In: Proceedings of the international conference on ubiquitous computing and Ambient Intelligence, 2024, pp. 841–852.

19.

Gosain

Saha

Singh

. Measuring harmfulness of class imbalance by data complexity measures in oversampling methods. Int J Intell Eng Inform 2019; 7: 203–230.

20.

Altıntop

ÇG

Latifoğlu

Akın

, et al. Classification of depth of coma using complexity measures and nonlinear features of electroencephalogram signals. Int J Neural Syst 2022; 32: 2250018.

21.

Wan

Zheng

Qin

, et al. Data complexity: a new perspective for analyzing the difficulty of defect prediction tasks. ACM Trans Softw Eng Methodol 2024; 33: 1–45.

22.

Al Hosni

Starkey

. Can complexity measures and instance hardness measures reflect the actual complexity of microarray data? In: Proceedings of the international conference on machine learning, optimization, and data science, 2023; pp. 445-462.

23.

Villuendas-Rey

Taylor

. Multitask classification: assessing data complexity and determining correlations with classifier performance. Neural Comput Appl 2024; 37: 27689–27706.

24.

Camacho-Urriolagoitia

Villuendas-Rey

Yáñez-Márquez

, et al. Novel features and neighborhood complexity measures for multiclass classification of hybrid data. Sustainability 2023; 15: 1995.

25.

García

Sánchez

Mollineda

. An empirical study of the behavior of classifiers on imbalanced and overlapped data sets. In: Proceedings of the progress in pattern recognition, image analysis and applications: 12th iberoamericann congress on pattern recognition, CIARP 2007, Valparaiso, Chile, 13–16 November 2007. Proceedings 12, 2007. pp. 397-406.

26.

Camacho-Urriolagoitia

Villuendas-Rey

López-Yáñez

, et al. Correlation assessment of the performance of associative classifiers on credit datasets based on data complexity measures. Mathematics 2022; 10: 1460.

27.

Trinidad

JFM

Shulcloper

Cortés

MSL

. Structuralization of universes. Fuzzy Sets and Systems 2000; 112: 485–500.

28.

García-Borroto

Ruiz-Shulcloper

. Selecting prototypes in mixed incomplete data. In: Proceedings of the progress in pattern recognition, image analysis and applications: 10th iberoamerican congress on pattern recognition, CIARP 2005, Havana, Cuba, 15–18 November 2005. Proceedings 10, 2005, pp. 450-459.

29.

Medina-Pérez

García-Borroto

Ruiz-Shulcloper

. Object selection based on subclass error correcting for ALVOT. In Proceedings of the progress in pattern recognition, image analysis and applications: 12th iberoamericann congress on pattern recognition, CIARP 2007, Valparaiso, Chile, 13–16 November 2007. Proceedings 12, 2007, pp. 496-505.

30.

Villuendas-Rey

García-Lorenzo

. Mixed data balancing through compact sets based instance selection. In Proceedings of the progress in pattern recognition, image analysis, computer vision, and applications: 18th iberoamerican congress, CIARP 2013, Havana, Cuba, 20–23 November 2013, Proceedings, Part I 18, 2013, pp. 254-261.

31.

lcalá-Fdez

Fernández

Luengo

, et al. KEEL Data-mining software tool: data set repository, integration of algorithms and experimental analysis framework. J Mul-Valued Log Soft Comput 2011; 17: 255–287.