Abstract
Data complexity assessment is the starting point in numerous machine learning and data analysis tasks. Among these measures, network-based ones are particularly significant due to their focus on the inner structure of the data. However, they have several limitations, such as the inability to handle non-symmetric dissimilarity functions, the requirement for fixed user-defined thresholds, and the provision of only an overall assessment of the data structure and not of different areas of the data space. Moreover, most data complexity measures do not address hybrid or multiclass data. In this paper, we propose 12 innovative network-based data complexity measures for assessing multiclass, hybrid, and incomplete supervised data complexity, effectively addressing the aforementioned limitations. These measures have been integrated into the EPIC platform, a practical tool for data analysis, making their computation easy and accessible. Furthermore, we conducted a correlation analysis of the proposed measures with the performance of two supervised classifiers, the Nearest Neighbor (NN) classifier, and the ACID classifier, yielding statistically significant results.
Get full access to this article
View all access options for this article.
