Sage Journals: Discover world-class research

Abstract

Understanding how a classification result is generated and what role individual features play in the classification is crucial in many applications and, in particular, in medical contexts such as the translation of diagnosis biomarkers into clinical practice. The goal is to find (ideally simple) relationships between the features in multi-dimensional data and the classification for an explanation of the underlying phenomenon. Mathematical formulas allow for the expression of these relationships and can serve as classifiers. However, there are infinitely many mathematical formulas for the given features and they bear an inherent trade-off between complexity and accuracy. We present an interactive visual approach that supports domain experts to mitigate the trade-off issue. Core to our approach is a novel feature selection method, from which formulas are composed using symbolic regression and where state-of-the-art classifiers serve as a reference. To evaluate our approach and compare the achieved classification performance to the performance achieved by other state-of-the-art feature selection techniques, we test our methods with well-known machine learning data sets. Our evaluation shows that our feature selection method performs better than randomly selecting features for data sets with many features or when a low number of generations in the symbolic regression is required. Moreover, it consistently matches or outperforms state-of-the-art methods. Moreover, we apply our approach in a case study to a hemodynamic cohort data set, where we report our findings and domain expert feedback. Our approach was able to find formulas containing features that are in agreement with literature. Also, we could find formulas that performed better in the micro-averaged F1 score when compared to established histological indices.

Keywords

Classification feature space formulas multidimensional data visual analysis

Introduction

Classification refers to assigning class labels to data items based on their features, where the features often form a multi-dimensional data set. Many classifiers have been proposed for binary or multi-class classification purposes.^1–6 They all come with advantages and limitations such that no single classifier can be considered the universal solution to all classification problems, but the choice of the classifier depends on the data and task at hand. While the performance of the classifier is generally of utmost importance, it is often similarly important to understand how the classifier comes to its decision. For example, for the translation of diagnostic biomarkers into clinical practice, it is crucial for the domain expert to understand the decision-making of a classifier. Ultimately, the domain expert must be in the position to explain the underlying phenomenon by a minimal set of features and their interplay. Mathematical formulas allow domain experts to formulate these relationships and can, theoretically, be derived from any state-of-the-art model. However, the complexity and readability of such formulas vary greatly depending on the model. For instance, deep neural networks combine all input features multiple times throughout many layers, which would result in extremely long formulas. Other models involve complex transformations of the input features (e.g. support vector machines²) or even rely on randomness (e.g. random forest⁴), rendering it difficult to obtain a single, simple formula.

Finding a simple formula is a non-trivial task, as it is a priori unclear how individual features need to be combined to achieve high classification scores. Hence, a search is required. Manual approaches have shown to not scale well to a large number of features.⁷ On the other hand, depending on the set of allowed operations (such as addition or multiplication), and the number of features, the search space may be too large to run an exhaustive automatic search. Even if it were feasible, the issue of simplicity of the resulting formula cannot be solved automatically, as the answer depends on the exact task at hand. Hence, the inherent trade-off issue between accuracy and simplicity of such formulas needs to be resolved by a user-centric, semi-automatic approach.

To aid the user in this process, we propose an interactive approach to compose formulas that serve as classifiers from multi-dimensional data. The ongoing exchange with our collaboration partners from the cardiovascular imaging group motivated this work and allowed us to define tasks and requirements for our approach. An overview of our interactive visual system is depicted in Figure 1.

Figure 1.

Overview of our approach: Given the set of all features $F$ , a feature embedding is calculated in a pre-processing step and presented in the form of a scatter plot (top left). Pairwise distances of embedded features represent the F1 score achieved by the baseline ensemble classifier trained on the respective feature pair $f_{ab}$ . Size and color of points encode the feature’s individual F1 score and feature importance,⁷ respectively. Also, the baseline classifiers are evaluated on all features to serve as reference. In an interactive visual system, the user can pan and zoom, filter, and select features in the embedding. Selected features will be used to compute the symbolic regression and to evaluate the baseline classifiers (top right). The user can always change the settings for the search, such as the parsimony coefficient, to control the complexity of the resulting formulas. Resulting formulas can be compared in the trade-off and distribution plots (bottom).

For the automated search component, we employ symbolic regression,⁸ a genetic algorithm that tries to find a mathematical formula to describe the data, while allowing us to control the simplicity of resulting formulas. As reducing the size of the search space is still essential for such a search to find meaningful results in a reasonable time, we introduce a novel feature selection approach.

Central to the interactive feature selection step is an overview visualization of all features in a 2D embedding. The 2D embedding is computed using Multi-dimensional Scaling (MDS),⁹ where the distance between each pair of features encodes their combined classification score. The resulting 2D scatter plot allows for the interactive selection of features, where color and size of the points encode the individual feature’s importance and classification score, respectively.

Once the user has selected a set of features, we compose formulas of these features using symbolic regression. To do so, we first split the data into a training and a test data set, as it is generally done for supervised classification. We rate the classification quality of each formula when applied to the training data and test data, which leads to two scores. We additionally rate the complexity of each formula leading to a third score.

The quality of the calculated formulas based on the three ratings is then analyzed in multiple coordinated views of statistical plots. Within these plots, the quality can also be compared against the performance of state-of-the-art classifiers, which serve as a baseline.

We tested our interactive visual system when applied to well-known machine learning data sets in an informal user study. We compare the classification performance achieved by the user-selected features to the performance achieved by features sets reported in other state-of-the-art articles. Finally, we apply our approach in a case study to a murine cohort data set acquired using 4D PC-MRI to analyze pulmonary hypertension in three different stages. We report findings and present feedback we received from domain experts when using our tool.

Our main contributions can be summarized as follows:

We introduce an interactive visual feature selection method that effectively reduces the feature space by leveraging pair-wise classification scores, per-feature classification scores, and feature importance to generate a feature space embedding.

We propose an approach to compose classifiers as formulas from labeled multi-dimensional data, which are compared against state-of-the-art classifiers in an interactive visual system.

We demonstrate our approach on multiple labeled multi-dimensional data sets and compare our user selections to state-of-the-art feature selection techniques.

We apply our approach to a pulmonary hypertension blood flow cohort data set acquired with 4D PC-MRI, from which hemodynamic features have been calculated, and report our findings as well as domain expert feedback.

In the remainder of the paper, we will first present our Task Analysis leading to the requirements for our system. At that point, we can discuss Related Work to our approach. As the goal of our approach is to find good classifiers, we next define what “good” means in the section on Classification Performance Calculation. Based on this measure, we then detail how the Feature Selection process is supported by our tool (Figure 1 top left). Having selected features, we explain how they are combined in the Formula Generation step (Figure 1 top right). All components then come together in the description of our Interactive Visual System that embeds the already explained steps and extends them with visual components for the analysis of generated formulas and their classification quality (Figure 1 bottom). Afterward, we present an Evaluation of our approach with multiple data sets, followed by a Case Study to analyze a hemodynamic cohort data set and respective Domain Expert Feedback. Finally, a description of Limitations and Discussion and Conclusion and Future Work wrap up our paper.

Task analysis

In multiple sessions with our collaboration partners from the cardiovascular imaging group, we discussed and analyzed their workflow and the typical problems they face in their research. Generally, they conduct animal studies to find non-invasive biomarkers extracted from time-resolved images acquired using four-dimensional phase-contrast magnetic resonance imaging (4D PC-MRI). In a first task (T1), all features are analyzed for statistical significance using statistical software, that is, they test if the animal cohorts can be separated through a single feature. However, even if a single feature may not serve as a classifier, the combination of multiple might be able to. Finding these combinations hence is the second task (T2). Due to the vast amount of features in spatio-temporal data, performing exhaustive computations to find the optimal classifier is often not feasible. Therefore, an interactive approach is desired to assemble classifiers (or biomarker candidates) that capture relationships between individual features. For the representation of these relationships, the domain experts prefer a mathematical formula over the alternatives, such as decision trees. The main reason for this is the familiarity of the domain with this particular representation. To give an example, the Teichholz et al. formula¹⁰ $V = (7 / 2.4 + D) D^{3}$ captures the relationship between the minor and major axis of the left ventricle to estimate the ventricular volume $V$ through echocardiography, where $D$ is the measured minor axis. Sophisticated tests and requirements were necessary to obtain the final formula from a theoretical model which, in general, may not be available. To assess the quality of such a classifier, it needs to be compared against a baseline.

From these tasks and details, we derived requirements with our collaboration partners that open the design space for our approach. We will later use these formulations to explain our design choices.

R1 Feature Overview: An overview of all features should be provided, such that the best, worst, and average performance of all features is assessable.

R2 Feature Detail: It should be possible to assess the individual features’ performance.

R3 Modelability: Each classifier should be represented by a single formula.

R4 Simplicity: The overall approach should favor a minimal set of features necessary to perform the tasks.

R5 Comparability: It shall be possible to compare different classifiers/formulas.

R6 Validity: Formulas shall be compared against a baseline, that is, state-of-the-art classifiers.

R7 Scalability: The approach should scale well with the number of classes (binary or multi-class), the number of features, and the number of samples.

R8 Interactivity: The formula composition process should be interactive, where necessary components may be computed in a pre-processing step.

Related work

In this section we discuss prior work on classification algorithms, automatic feature selection techniques, and why visual interactive feature selection techniques can yield better classification results.

Classification

Classification is a well-researched field with numerous existing algorithms. In this work, we focus on multi-class (R1), supervised classification algorithms. We focus on supervised algorithms since we assume the class labels to be known and can therefore improve classification performance during training. Classification algorithms can be classified into linear models like Linear Discriminant Analysis (LDA),¹ Support Vector Machine approaches (SVM),² tree-based models like Decision Trees³ and Random Forest,⁴ Bayesian models like Naive Bayes,⁵ and Machine Learning models.⁶ Different classification algorithms make different assumptions about the data. To capture different data properties, ensemble classifiers have been proposed.^11,12 Ensemble classifiers combine the predictions of multiple classifiers into a more robust prediction. Similar to Haq et al.,¹³ we chose a linear SVM, L1-regularized Logistic Regression (L1-LR), and Random Forest (RF) as our baseline, as their underlying techniques and assumptions can be considered quite orthogonal to each other. These classification algorithms are quick and can have a good classification performance. However, the models of these classifiers tend to be more complex and it might be difficult to turn the models into a formula that can be easily interpreted (R3). We add LDA as a fourth baseline classifier, as it is straightforward to derive a formula from the model. However, the resulting formula is a linear combination of all features, that requires classes to be linearly separable, which cannot be generally assumed. Similarly, a linear SVM also computes a linear combination of features by applying a different decision function. L1-LR applies the Softmax function on the linear combination to obtain the class probabilities. Hence, the resulting formulas can become rather complex. Deriving a formula from an RF model is not straightforward, as one would first need to transform the individual trees and then apply a majority voting to retrieve the dominantly predicted label.¹⁴ As our goal is to generate simple formulas, evolutionary algorithms can be used, as they allow us to search a large search space of features while taking into account simple models (R4). A taxonomy of different genetics-based classification algorithms was conducted by Fernandez et al.¹⁵ However, these evolutionary algorithms tend to be slower with the number of features due to their iterative nature.

Feature selection

For classification tasks, feature selection plays an important role when the number of features is large. It was shown that the presence of redundant, irrelevant, or noisy features may affect prediction accuracy,¹⁶ since models without integrated feature selection mechanism (such as convolutional neural networks¹⁷) accumulate noisy contributions of each noisy feature.¹³ To reduce the feature space, either feature extraction or feature selection methods can be used. Feature extraction methods tend to combine multiple features in more descriptive features, while feature selection methods try to find a subset of features that describe the data best. Since we do not want to transform the original features we will focus on feature selection methods. Some works^18–21 were published that categorize filter selection methods into filter methods,^22–24 wrapper methods,^25–28 and embedded methods.^3,4 A special type of wrapper method are genetic algorithms for feature selection,^29–31 which try to find an optimal set of candidate features through natural selection of entities and an underlying functional that should be optimized.

These filter methods make different assumptions on the underlying data. There is no single feature selection method that works well for all data sets. Therefore, Haq et al.¹³ present a feature selection approach using feature ranking and clustering to determine a suitable feature subset for classification. They employ multiple feature ranking techniques with different underlying assumptions regarding their regression function. This makes their approach more robust with regard to different data sets. Their approach filters out individual features based on their feature importance. Thus, their approach assumes that the least important features still perform badly, even when they are combined with other features. However, this might not necessarily be true. Therefore, we do not remove features prematurely but let the user decide which feature subset to keep.

Interactive visual feature selection

While automatic feature selection techniques might suffice in certain scenarios, it cannot be guaranteed that these approaches are able to find good minimal feature sets.³² Therefore, multiple interactive visual feature selection methods were introduced that are user-centric and support data- or task-dependent feature selection. They can be categorized based on their visualization design into radial approaches,^33–35 star coordinates,³⁶ scatter plot-based approaches,^37,38 histogram-based approaches,^39,40 and approaches using 3D visualizations.⁴¹

Wentzel et al.⁴² developed an interactive visual system for creating predictive models to estimate radiodose toxicities in head and neck cancer patients. They perform feature selection by encoding patient data into vectors based on organs and dose-volume histograms, using clustering algorithms to group patients and rank clusters by mean organ doses. In addition, they employ a constrained rule mining algorithm to generate dose thresholds for classification. However, they only derive threshold-based rules to explain their clinical use cases, while we are interested in an actual formula.

Chatzimparmpas et al.⁷ employ a heatmap with features and feature ranking techniques as well as the mean feature importance based on all feature ranking techniques. The paper combines five feature ranking techniques namely Univariate Feature Selection (FS),⁴³ Impurity-based Feature Importance (FI),⁴⁴ Permutation FI,⁴⁵ Accuracy-based FI,⁴⁶ and Ranking-based FS⁴⁷ using Recursive Feature Elimination (RFE).²⁸ These techniques are integrated into their system by normalizing and averaging the scores and then using these scores to rank the features by importance. The user can select which features to exclude. As we believe our approach would benefit from a robust feature ranking technique, we integrated their feature importance into our approach. Chatzimparmpas et al. support filtered features to be analyzed further in a radial tree visualization. Each node in the tree encodes statistical measures like Pearson’s correlation coefficient and mutual information. The edges encode if the correlation between a feature and the transformed features increases or decreases, in general. For a selected feature more details can be shown in a graph visualization. Besides the encoding for Pearson’s correlation coefficient and mutual information, per-class correlation and variance inflation factor are encoded. This plot also allows for feature generation using four operations $(+, -, \cdot, /)$ . Chatzimparmpas et al. ’s approach works well for a few features but might not scale that well when having a data set with many features (R7). Our feature embedding makes better use of the available screen space as we project features in proximity when they have a good combined predictive power. Another issue is their graph visualization which uses a rather complex encoding as the user has to judge the area of rings and boxes for the four statistical measures. We employ a scatter plot that simply uses position, color, and radius to encode relevant measures to judge the importance of a single feature as well as a set of features.

In summary, we opted for a combination of an ensemble classifier and symbolic regression. The ensemble classifier allows for fast exploration of the feature space, while the symbolic regression is used to find a simple model with comparable classification performance. We choose symbolic regression as we ultimately want to generate a formula (R3). An embedding of the pair-wise feature classification performance, averaged feature importance, and single feature classification performance serves as an overview visualization. By taking those three aspects into account the users can interactively select a feature subset based on their background knowledge and use case.

Our approach generally aligns with the framework of biomedical data analytics as proposed by Nguyen et al.⁴⁸ That is, in order for domain experts to generate knowledge from their data, we also differentiate between Pre-Observation, including data pre-processing, such as normalization; Computational Analysis, where the feature embedding is computed; and Visual Analytics, allowing users to analyze the data using visualizations and interactions.

Classification performance calculation

Input to our approach is labeled multi-dimensional data, the dimensions of which we refer to as a set of features $F$ . For each feature $f \in F$ , all values are normalized to the range $[0, 1]$ . The normalization is, in principle, not required, however, some of the baseline classifiers assume or work best with normalized data. For example, LDA and L1-LR create linear combinations of features using coefficients, which are easier to compare or interpret if all features are scaled to the same value range. Moreover, normalizing to range $[0, 1]$ improves the numerical stability of most algorithms, owing to the characteristics of floating-point arithmetic. We denote with $x_{i}$ one of the $N$ samples of the data such that $X = {x_{i}}_{1 \leq i \leq N}$ is the set of all samples. The corresponding ground truth labels we denote as $y^{true} = {y_{i}}_{1 \leq i \leq N}$ and the set of all unique ground truth labels as $Y$ . An ideal classifier $C$ then can be defined as a function that maps all samples $x_{i} \in X$ to its ground truth label $y_{i} \in y^{true}$ , that is, $C : X \to y^{true}$ with $\forall i : C (x_{i}) = y_{i}$ , even if the classifier was not fitted for all samples.

To assess the performance of a classifier that is fitted on $X$ or a subset thereof, the predicted labels need to be compared to the ground truth labels by utilizing a measure tailored for classification results. Choosing a one-fits-all measure is not possible, since desirable properties depend on the data and task at hand. However, as we are generally dealing with data where the number of samples per class may vary, we choose to use the micro-averaged F1 score in favor of others, such as the accuracy score. The F1 score captures both precision and recall and, hence, is a good choice if both need to be optimized. For the remainder of the paper, we refer to the micro-averaged F1 score as F1 score. It is calculated as

F 1_{micro} = \frac{Σ TP}{Σ TP + 0.5 (Σ FP + Σ FN)},

where $Σ TP$ , $Σ FP$ and $Σ FN$ are the sum of true positives (TP), false positives (FP), and false negatives (FN) of all classes, respectively. This score is best applicable in cases, where the classes are imbalanced and classes with more samples shall have a larger impact on the score, that is, it attributes equal importance to each sample. In cases where each class should be treated with equal importance, the measure could easily be replaced by a more suitable score, such as the macro-averaged F1 score or Matthew’s correlation coefficient.⁴⁹ Their interpretation is, however, considered less straightforward.

It should also be possible for a single feature to serve as a classifier. In this case, the predicted label would be represented by the feature’s value itself, that is, $C : X \to R$ . Additionally, it may discriminate the classes in reverse order. If, for example, the feature models a negative linear relationship between its value and the class label, the F1 score cannot be applied directly. For these reasons, we introduce an intermediate mapping $ψ : R \to Y$ from predicted distributions to class labels. We first make the assumption that, by design, for each reasonably well-performing classifier (baseline, feature, or formula), the resulting distributions are linearly separable for each class and uni-modal. If they were not, that would either mean that the classifier predicts a multi-modal distribution or that it cannot separate at least two classes well. Both are not what we aim for and should result in lower scores. With this assumption, we use Linear Discriminant Analysis (LDA) to determine the decision boundaries between the distributions. The ground truth labels are used to train the LDA model, which then acts as the required mapping $ψ$ . When a classifier $C$ is used to predict the label for a sample $x_{i}$ , the resulting labels are obtained by $y_{i}^{pred} = {ψ (C (x_{i}))}_{1 \leq i \leq N}$ . Then, $y^{true}$ and $y^{pred}$ are used to calculate TP, FP, and FN for each class.

Feature selection

All classifiers that we use as a baseline (LDA, SVM, Logistic Regression, and Random Forest) are relatively fast to compute for a larger number of features. To compose formulas, however, we apply symbolic regression.⁸ It uses a genetic algorithm that randomly initializes short formulas, picks the fittest among those formulas, and tries to converge to an optimal solution, that is, a solution with a high classification score, by applying genetic operations (such as selection, mutation, and cross-over).

How fast a satisfactory solution can be found depends strongly on the number of features considered by the symbolic regression. We demonstrate this dependency by setting up a synthetic three-class problem⁵⁰ using the implementation of the scikit-learn library.⁵¹ First, a 10-dimensional hypercube with side length 2 is constructed. Then, a total of 100 samples is distributed among normally distributed clusters $(σ = 1)$ centered at the vertices of the hypercube. Hence, the high-dimensional position of each sample represents the sample’s feature vector. Then, each cluster is assigned uniform randomly to one of the three classes, guaranteeing feature independence. This defines a three-class classification problem with feature space of size 10. We then successively add more and more features, initialized with random values from range $[0, 1]$ , that hence do not contribute to the classification task. Next, we set a target classification F1 score of 0.55 and count the generations needed by the symbolic regression to reach the target score. The experiment is repeated 50 times. The result of the experiment, shown in Figure 2, suggests that an increasing number of non-informative features, on average, increases the generations needed by a significant amount, corresponding to long computation times.

Figure 2.

Line plot encoding the average number of generations needed by the symbolic regression to achieve an F1 classification score of 0.55 on a synthetic three-class problem with increasing number of additional random features.⁵⁰ The gray band encodes the 95% confidence interval.

We conclude from the experiment that reducing the number of features (being left with the most promising ones), on average, reduces the number of generations needed to find the same set of features in the regression. Hence, a reduction of the feature space is essential to maintain an interactive and iterative approach (R4).

To support feature selection, we first aim to identify measures that indicate or predict if combining a pair of features $f_{ab} \in F \times F$ will lead to high F1 scores during symbolic regression. Commonly used pairwise feature measures^7,13 are Pearson’s correlation coefficient and (continuous) mutual information (cMI). If, for example, Pearson’s correlation coefficient of two features indicates high (anti-)correlation, then we expect that the combinations of this pair of features will not lead to a classifier that performs much better than either of the two features alone. In addition to using Person’s correlation coefficient and cMI, we combine our baseline classifiers into an ensemble classifier using majority voting. We record the classifiers’ achieved F1 score as well as the score’s improvement when compared to the feature’s individual score.

Given the set of these four measures $M$ , none of the measures is expected to be a good indicator on its own, as the effect on the F1 score is unclear. Hence, we considered all combinations of all measures, that is, $P (M)$ . In order to combine multiple measures into a single indicator, we normalize the value range of each measure $m \in M$ to the range $[0, 1]$ such that higher values indicate preferred combinations of pairs of features $f_{ab}$ . For Pearson’s correlation coefficient, we calculate one minus the absolute value, such that a low (anti-)correlation is favored. Similarly, we normalize (continuous) mutual information by calculating one minus the original value divided by the maximum value achieved by all pairs of features. For both the F1 score and respective improvement, only the normalization to range $[0, 1]$ is required as higher values indicate a favorable combination of features. We evaluate each measure $m \in M$ for each pair of features $f_{ab}$ . For combined measures, we compute the average of the individual measures.

For all pairs of features $f_{ab}$ we then run a symbolic regression and compare the achieved F1 score to the respective (combined) measures value using Spearman’s rank correlation coefficient $r$ , as we do not want to make the assumption of a normal distribution and linearity. Our experiments show, that only the baseline classifier’s F1 score is a reliable indicator, with $r > 0.8$ for all data sets. The best result for each data set is depicted in Figure 3, the results of all experiments can be found in the appendix. In the same figure we can observe, that the score achieved by the symbolic regression is typically lower than the score achieved by the baseline classifiers. Since we do not allow the formulas to become arbitrarily complex, this demonstrates the inherent trade-off between the classifier’s simplicity and performance.

Figure 3.

Comparison of F1 score achieved by our baseline ensemble classifier to the score achieved by symbolic regression. For all three data sets, we observe a high Spearman’s rank correlation coefficient.

To reduce the number of features, the goal is now to derive a feature embedding that includes all features but projects them into 2D space while preserving the following properties: On the one hand, features that combine well shall be projected close to each other. On the other hand, features that do not combine well shall be projected farther apart. Consequently, in a projection that satisfies these properties, clusters make for a good subset of features that can be selected. To obtain the projection we propose to use Multi-dimensional Scaling (MDS)⁹ as it tries to preserve pairwise distances and, hence, allows us to estimate how well two features combine.

To transform the baseline classifiers’ F1 scores $s$ of all feature pairs $f_{ab}$ into a distance matrix $d$ required as input for MDS, we compute $d_{ab} = Ψ (1 - s_{ab} / max (s))$ , where we define $Ψ$ to be an emphasizing function $Ψ (x) = x^{2}$ . The function has the purpose of pulling good feature pairs closer together. The idea behind this modification is that less emphasis is put on features that do not combine well. Without applying this function, MDS had difficulties preserving the distances, that is, the eigenvalues indicated that more than two dimensions would be required to capture most of the variance in the data. Finally, diagonal entries of the matrix are set to zero, and the resulting matrix is fed to MDS to obtain the projection.

For each data set used in this paper, the resulting projections were evaluated for their ability to preserve the topology of the high dimensional data using trustworthiness, normalized stress, and continuity.⁵² The scores, presented in the appendix, indicate that most of the topology could be preserved.

Finally, we propose to visualize the projected space using a 2D scatter plot (R1) that allows for the interactive selection of projected features (R8). Additionally, we enrich the scatter plot by mapping the feature’s individual F1 scores to the color and the mean feature importance by Chatzimparmpas et al.⁷ (cf. Related Work section) to the radius of the plotted circles (R2), see Figure 4 (gray areas show user selections).

Figure 4.

Feature embedding for the Pulmonary Hypertension data set, where colors encode the features’ individual F1 scores and size the mean feature importance. Only those features are labeled that are included in formulas found by the formula search. The gray area indicates the features selected by the domain expert who provided the data during our case study.

Formula generation

Given a selection of features $F$ , the goal is to assemble formulas that serve as classifiers (R3). We use symbolic regression based on a genetic algorithm to fit a model to the data. To do so, we split the data into a training set and a test set, where the latter is used for validation. Details on the splitting are discussed in the Limitations and Discussion section. In principle, we do not impose a specific implementation of the algorithm as long as it fulfills the requirements outlined below. For details about the implementation that was used in this work, we refer to the comprehensive guide by the author of the implementation.⁵³

Symbolic regression maximizes a fitness function by assembling a mathematical formula. The formula is assembled by using a set of arithmetic operations and applying them to a subset of the features. While, in principle, any arithmetic operation is possible, we restrict the available set to addition, multiplication, subtraction, and division in our work. We argue that these operations allow us to generate truncated Taylor series. Therefore, any function that can be represented by a Taylor series (i.e. its error term converges to zero) can be approximated with these operations. Of course, the allowed operations are also straightforward to extend.

The constructed formula is an arithmetic expression, which is represented by a tree that evolves and is optimized in the fitting process. The evolution starts with a competing population of trees, created from a random subset of all available features that are combined using another random subset of available operations. From each generation, the fittest individuals undergo so-called genetic operations, basically meaning random mutations of the tree. The fitness function we use is, again, the (micro-averaged) F1 score. We also apply the mapping described in the Classification Performance Calculation section to allow for more flexibility in the regression process, that is, the regression only optimizes for separability, not for predicting the exact labels.

Another important aspect during the evolution process is the complexity of the resulting formula. We measure the complexity by counting the number of characters in a formula. Hence, a complexity value of one indicates the simplest formula, that is, a single feature.

To generate formulas, the user may start multiple fitting processes, each of which uses a different random seed, that is, it starts with a different initial population. On average, formulas tend to become more complex over the course of the evolution process, which is referred to as bloat. To counteract this behavior, complexity is penalized using a so-called parsimony coefficient in the symbolic regression. In essence, when comparing two competing trees from the current population, a penalty term is added to their respective fitness by subtracting the product of (non-simplified) complexity and the parsimony coefficient from the current fitness value. To the resulting formula, we apply a symbolic simplification to minimize the mathematical expression (e.g. $x \cdot x \cdot x = x^{3}$ ).

The outcome of the formula generation is a set of formulas with three ratings per formula, namely its F1 training score (i.e. F1 score for training data), its F1 test score (i.e. F1 score for test data), and its complexity.

Interactive visual system

To evaluate our approach and to apply it to different use cases, we implemented an interactive visual system using the Python programing language and the Dash and Plotly framework. A summary is presented in Figure 1. For a demonstration of the interactive system, we kindly refer to the accompanying video material.

Pre-processing

In a pre-processing step, the distance matrix required for the embedding is calculated (see Feature Selection section). Also, the baseline and the ensemble classifier are evaluated on all features to serve as references. In our application, we restrict the number of generations to 10 to maintain interactivity (R8). From Figure 2, we can deduce that this requires a very effective feature selection method to achieve high F1 scores.

Feature selection

As described in the introduction, the user is first presented with the feature embedding providing an overview (R1), see Figure 4. Also, the individual feature’s performance can be assessed by its color and radius encoding F1 score and feature importance, respectively (R2). The user, then, may select all features or a subset thereof, on which we run the symbolic regression and evaluate the baseline and ensemble classifier (R3). At this point, we assume the data samples to be randomly split into a training and a test set, containing 80% and 20% of the samples, respectively. Testing on unseen data is essential when investigating how a classifier generalizes and to judge if it is overfitted. Considering only a single data split is generally not enough to investigate generalization, but we cannot evaluate multiple splits due to the nature of how formulas are obtained. A detailed discussion on this can be found in the Limitations and Discussion section. For now, we assume that a single, randomized split is still beneficial in most cases, in particular to assess classifier overfitting.

Trade-off plot

While all baseline classifiers are evaluated (1) for all features and (2) for the subset selected by the user (R6), the formula search is performed only for the user selection (2) within an interactive setting. The classifiers’ performance is then encoded in a standard 2D scatter plot, where each classifier is represented by a point and where the horizontal and vertical axes encode the F1 score for the training data and test data, respectively. To distinguish the two feature sets (all vs selected subset), we depict each classifier as a diamond or dot, respectively, where we assign the same color for the same classifier using a categorical color mapping (e.g. red is chosen for formulas). For the formulas calculated from the user-selected feature subset, the size of the dots encodes the complexity of the formula. Instead of directly mapping the complexity value to the radius, we choose to reinterpret complexity as simplicity, that is, larger dots represent simpler formulas, such that the attention of the user is guided toward finding simple formulas. The exact (ranges of) radii have been chosen empirically. We draw lines between each glyph and its label to simplify the matching for the user. This is useful since all formulas share the same (red) color and might have similar radii. In cases where the glyphs of classifiers overlap, we change the rendering order, such that smaller glyphs are rendered on top of larger glyphs. In addition, we group all labels of overlapping points and only draw one line to the group of overlapping points to improve readability. This plot shall allow the user to decide on the trade-off between complexity and performance of formulas (R5). An example is provided in Figure 5.

Figure 5.

Trade-off plot for pulmonary hypertension data set. The trade-off plot visualizes the classification quality for training and test data for each classifier. Classifiers trained on all features are shown as diamonds, while those trained on user-selected features are shown as dots. Colors indicate which baseline classifier is used or whether a formula is used (red). For formulas, the radius encodes their simplicity. Overlapping glyphs are grouped in the legend.

Distribution plot

For a detailed analysis of the classification quality of individual classifiers, we provide plots that allow for investigating the distributions of predicted values for all classes. For a single formula (or baseline classifier) selected by the user, we plot the respective distributions of all samples when evaluating the formula (or the classifier) for each class in a diagram that combines box plots and bee swarm plots 6a and 6b. Additionally, the Mann-Whitney-U test is calculated for all pairs of classes and the resulting statistical significance levels (if any) are indicated in the plot (R6). Moreover, two formulas or classifiers may be selected to perform a direct comparison, in which case we juxtapose the diagrams of the two formulas/classifiers.

Sample table

Finally, we also support the analysis of how consistently each data sample is assigned to a class by all classifiers. We add a table (See Figure 6) summarizing the predicted class by each baseline classifier and each formula for each sample. The columns of the table represent the different classifiers, while the rows represent the different data samples. Rows can be sorted by the number of false predictions to spot samples, that are misclassified the most by the classifiers. The table contains training and test set, where samples that are part of the test set are highlighted using blue color. Correctly classified samples are color-coded in green while misclassified samples are colored in red.

Figure 6.

Sample table for pulmonary hypertension data set. The first column encodes the sample name (here, animal ID), where training data is shown in gray and test data in blue. The second column shows the ground truth (GT) labels. The other columns represent the classifiers’ predicted labels (either baseline or formula), where correct classifications are shown in green and misclassifications in red. The classifiers are split into two parts. The first part highlights the classifiers that were trained on all features during the pre-processing step. The second part contains the classifiers that were trained on the current user selection with a population size of 100 for the symbolic regression (formulas in last five columns).

Evaluation

For our evaluations, we selected two common data sets from the machine learning community, that is, the Wisconsin Breast Cancer (Classification) data set and the QSAR (Quantitative Structure Activity Relationships) biodegradation data set, as well as the Red Wine Quality data set. Details about the data can be found below.

Wisconsin breast cancer data set

The Wisconsin breast cancer data set⁵⁴ aims at classifying breast cancers as either malignant or benign. The data set consists of 30 features and 569 samples, of which 357 are labeled as benign and 212 as malignant. This data set was chosen, since it contains a larger number of attributes, that is, to evaluate how the proposed approach generalizes to higher-dimensional data sets.

QSAR data set

As a third data set, we utilize the QSAR data set, where molecules are classified as being biodegradable or not, hence, also representing a binary classification problem. The class distribution is relatively imbalanced with 356 degradable and 594 non-degradable molecules. The data set contains 41 features and was also used by Chatzimparmpas et al.⁷

Red wine quality data set

The wine quality data set consists of two data sets of white and red Portuguese Vinho Verde wines and is used for classification and regression tasks.⁵⁵ The data set contains 11 physicochemical wine attributes as well as a wine quality score that had been assessed by 3 different sensory assessors and aggregated by taking the median of the three scores with possible scores ranging from 0 to 10. We focused on the red wine data set. It only contains 6 distinct quality scores from 3 to 8. The wine quality is roughly normally distributed with scores 5 and 6 containing over 600 samples each and the other four scores contain <300 or even 100 samples. This data set was chosen, as it was also used by Chatzimparmpas et al.⁷ Like Chatzimparmpas et al., we split the six quality levels into three levels of inferior, fine, and superior quality classes comprising of the levels 3 and 4 with 63 samples, 5 and 6 with 1319 samples, and 7 and 8 with 217 samples, respectively. To counteract overfitting of the baseline classifiers, we used oversampling to get 325 samples per quality class but only selected every fourth sample of each class to increase the interactivity.

The feature importance of Chatzimparmpas et al.⁷ for the red wine quality and QSAR data set was calculated using their reported setting. For the Wisconsin breast cancer data set, we used eightfold cross-validation to compute feature importance.

In our evaluation, we wanted to answer the question of whether our feature selection techniques encourage the user to intuitively select the most promising features, while leaving the flexibility to also choose other features on demand. Hence we formulate the hypothesis, that the feature subset selected by the user leads to better classification scores achieved by the symbolic regression than when applied on all features. We add the constraint that only a small number of generations is allowed such that an interactive workflow can be maintained (R8). Given a population size of 100, this number was empirically determined to be 10 on our system (Intel i7-6700k, 16 GB RAM).

To test our hypothesis, we conducted an informal user study with five domain experts, four from the visualization domain, and one collaboration partner from the medical imaging domain. The users were selected based on their availability and willingness to participate. Importantly, all participants had relevant domain expertise, which ensured their feedback was both insightful and aligned with the study’s objectives. While our collaboration partners conduct pre-clinical studies using PC-MRI data, the visualization experts were colleagues from our institute that are familiar with dimensionality reduction. Among these experts, three conduct own research in feature selection techniques, and one is specialized in statistical machine learning. All of them were familiar with MDS. In an individual hands-on session for each user, we briefly provided an overview about all plots and interactions of the entire system using a dataset that was not part of the study. They could familiarize with the tool and ask questions any time during the session. The users in our study had no prior knowledge about the features in the data sets. The task was to use the interactive visual system to select as few features as possible for constructing a simple formula that classifies the data with a decent F1 score, without specifying what “decent” means. The users were allowed to spend as much time as they needed, but we also instructed them to not try to engineer the optimal solution, as it may not exist. The users were asked to complete the task for each of the three data sets described above, in the listed order.

For the evaluation, we recorded all user interactions and, most importantly, the selection of features and formulas once the users were satisfied with the result, as well as the respective time stamps. On average, each session lasted approximately $26 \pm 10$ minutes. The embeddings the users were presented with, as well as their selection made for each data set, are depicted in Figure 7. The areas that are colored with shades of gray encode the number of users that selected the features contained by the areas. The selection of each user is shown as connected region allowing to discern individual selections. This encoding was only added after the study. We now compare how the symbolic regression performs with respect to the F1 score on (1) all features, (2) the user-selected features, and (3) on feature selections reported in state-of-the-art articles (cf. Related Work section). Namely, Haq et al.¹³ and Sanchez et al.³³ both applied their feature selection techniques on the breast cancer data set and reported two feature subsets each. Specifically, Haq et al. apply a linear (Lr) correlation-based and an information gain-based (IG) approach. Sanchez et al. reduce the number of features in a first (1) application of their approach to 26 features, and then further reduce it to 7 features (2). Their idea is to obtain a subset of features that allows to separate classes reasonably well.³³ Similarly, Chatzimparmpas et al.⁷ report a subset of features that they obtained using their tool in an in-depth analysis of the QSAR and red wine quality data set. For details we refer to the respective papers. In total, the symbolic regression was evaluated 25 times for each of the 5 user selected feature sets and $5 \cdot 25 = 125$ times for each feature set from literature for a fair comparison. The results are summarized in Figure 8 (bottom row) by showing the distribution of the F1 scores using box plots (and the outcome of pairwise statistical significance tests between the distributions). In the appendix, exemplary formulas and decision trees can be found for each user and each data set. Generally, we observe that decision trees can become rather complex even if the data set contains only few features.

Figure 7.

Feature embedding of the three data sets used in the user study. The gray areas indicate how often features have been selected for the final formulas by the five users of the user study. A connected area indicates an individual user’s feature selection. If multiple users have similar selections, these areas are shown in darker shades of gray (see color legend): (a) Breast Cancer (b) QSAR and (c) Red Wine Quality.

Figure 8.

Comparison of F1 scores achieved by symbolic regression when applied to features selected by our approach (all users), when applied to all features, and when applied to feature selections reported in literature. Symbolic regression was applied to three data sets using 2 or 10 generations: (a) Breast Cancer, 2 generations (b) QSAR, 2 generations (c) Red Wine Quality, 2 generations (d) Breast Cancer, 10 generations and (e) QSAR, 10 generations.

We observe that for the two data sets with a low number of features (red wine quality and breast cancer), the formulas generated for the respective feature sets perform similarly well, that is, there is no statistically significant difference. The reason is that, for a population size of 100, the likelihood of each feature occurring at least once in the first generation is quite high. This is backed up by our findings shown in Figure 2. For the QSAR dataset with 41 features, this likelihood is lower and we observe a significant difference between the score achieved for user-selected features and all features. When we compare the F1 scores for the users’ selection against the selection reported by Chatzimparmpas et al. for the QSAR and red wine quality data set, no significant improvement can be observed. We want to emphasize that we encode their feature important measure (cf. Related Work section) as the dots’ size in the scatter plot and, hence, a similar result was to be expected. While the approach by Chatzimparmpas et al. requires the user to go through a list of all features row by row and decide if the respective feature should be included or not, ours supports filtering and selection of multiple features at the same time in the embedded feature space. For the breast cancer data set, the user selection performed significantly better than both feature sets reported by Haq et al., however no significant difference is observed when compared to the feature sets reported by Sanchez et al.

To study the impact of the number of generations, we repeated the same experiment with only 2 generations. The idea is that, ideally, one would like to have as few generations as possible for reduced computation time while maintaining the classification performance. In this scenario, we observe that, for all data sets, the scores are generally lower when compared to those achieved with 10 generations. Nevertheless, scores achieved by the user selection perform significantly better than those of all features. Still, no significant difference can be found when compared to Chatzimparmpas et al. for the red wine quality and QSAR data set. However, here the user selection performed significantly better for the breast cancer data set when compared to the feature selections reported by both Haq et al. and Sanchez et al. Again, this can most likely be explained by the low number of features of the data sets. With only 2 generations, the symbolic regressions that are evaluated for the user-selected feature sets have already good features to start with, while those evaluated for all features may need more generations to randomly select the good features, since the search space is larger.

We believe that the learning effect, induced by the repeated use of our system is negligible compared to the differences in complexity of the data sets. This is supported by the fact that, in the case of the red wine quality dataset, scores obtained using user-selected feature sets did not show significant improvement compared to scores obtained using all features. Conversely, for the QSAR dataset, which was provided to users prior to the red wine quality dataset, the results were significantly better when user-selected feature sets were used.

We could not find strong evidence about whether the feature embedding can correctly represent transitivity between features. For example, if feature A is projected close to B, and B is projected close to C, that does not necessarily mean that A and C also combine well together. When selecting the three features, the resulting formulas could include either one, two, or all of them. The feature embedding hence can only be viewed as a heuristic.

Our experiments support our hypothesis and show that our feature selection approach is effective, when (a) a low number of generations is required or (b) when the data set at hand has a large number of features or (c) both of the aforementioned is true. We conclude that our approach supports users in selecting a subset from all available features that is suited for composing formulas by means of symbolic regression.

Case Study

We, next, applied our approach in a case study with real users. The goal of this case study triggered the development of our tool. Our collaboration partners conducted an animal study to analyze the effect of pulmonary blood circulation. Three cohorts were incorporated: (A) a healthy control group, (B) a group with severe, and (C) a group with moderate alteration of blood flow. All experiments were conducted in accordance with approved ethical guidelines.⁵⁶ Prospectively cardiac and respiratory triggered 4D-flow stack-of-stars phase-contrast sequence was performed on a 9.4T Bruker BioSpec USR 94/20 imaging scanner. For more details about the setup and imaging parameters, we refer to their paper.⁵⁶

We used 16 subjects from the study, from which 6 were control animals, 4 were severely, and 6 were moderately affected animals. In total, 111 features were provided. While some of them were physiological features such as body weight, also anatomical measurements such as the diameter of the right ventricle were included. From the imaging data, several hemodynamic features were extracted for each segment of the pulmonary trunk, that is, for the right (RPA), left (LPA), and main (MPA) pulmonary artery. This includes velocity magnitude, vorticity magnitude, acceleration, and helicity density. Also, histological features were provided such as the lung assessment score (LASS)⁵⁷ to which we will compare our results later. We want to point to the fact that this data set contains very few samples, leading to overfitting and poor test validation, see next section for discussion. Hence, all results have to be interpreted carefully and conclusions cannot be drawn solely based on our report.

The feature space embedding is depicted in Figure 4 along with the feature selection (gray area) made by the domain expert who provided the data. For a better readability of the plot, we only show the feature names of those features that were part of the generated formulas. The trade-off plot that resulted from the domain expert’s feature selection is depicted in Figure 5. Five formulas were generated for the domain expert’s feature selection. Overall, all baseline classifiers and generated formulas performed similarly well on the training data (horizontal axis). The fact that all models were able to achieve a training accuracy equal to or close to one (vertical axis) suggests, that it is possible to capture the entire data set with most of the models, due to the small amount of samples. However, none of them achieved a perfect test data score (vertical axis). Since the random test data split only contains three samples, we observe three discrete values for the test data score. All formulas contain at most three features, indicating that the classification task for this data set is rather trivial, as their accuracy is still decent. This is backed up by the fact that LDA is the best-performing classifier, that is, the cohorts are linearly separable. Also, we want to point to the fact that LDA performed worse when trained on all features, than when trained on the selection made by the domain expert.

From the generated formulas, three contain right-ventricular ejection fraction (RVEF) that is considered a benchmark in literature when comparing to hemodynamic features.⁵⁶ The same scores of RVEF were achieved by the mean acceleration in the right pulmonary artery lumen (RPA_ACCELERATION_mean). However, the domain expert gave the feedback that if this feature appears alone in a formula, without the counterpart from the left pulmonary artery lumen (LPA_ACCELERATION_mean), it should not be considered. For the same reason, we also ignore the longest formula, that is, LPA_ACCELERATION_min + RPV V_max + RVEF. Additionally, this formula is longer while only increasing the train data score marginally. Hence we focus on the product of the ratio of end-diastolic volume (EDV) and end-systolic volume (ESV) ( $\frac{EDV}{ESV}$ or EDV_ESV_ratio), and the maximum velocity in the bronchial branches (LBB_V_max). EDV_ESV_ratio was shown to have good classification properties for pulmonary arterial hypertension (PAH) in literature,⁵⁸ which is related to abnormal blood circulation.

Using juxtaposed diagrams of combined bee swarm and box plots, we compare the distributions of all animals as produced by the benchmark RVEF (see Figure 9(a)) to the mentioned formula (see Figure 9(b)). While the significance levels between the cohorts are identical - indicating that the benchmark could be reproduced - we observe slight differences in cohort distributions. Notably, the severe cohort has a smaller spread compared to the benchmark. Conversely, the control cohort experiences a wider spread.

Figure 9.

The distribution plots combine bee swarm plots, box plots, and the outcome of statistical significance tests to analyze the value distributions of the classes. Additionally, the Mann-Whitney-U test is calculated for all pairs of classes and the resulting statistical significance levels (if any) are indicated in the plot (* and ** indicate p-values below 0.05 and 0.01, respectively). Here, the user selected two formulas for comparison: (a) Distribution plot for formula “RVEF” and (b) Distribution plot for formula “EDV_ESV_ratio. LBBV_max”

To analyze the classification result of the two classifiers for each animal individually, we use the sample table (cf. Figure 6). The animal with ID 01-01297 was misclassified by both classifiers and also by most of the other classifiers. Hence, it could be considered an outlier and should be investigated in more detail. While the animal with ID 01-01300 was also misclassified by most of the classifiers, it was correctly predicted by the two selected classifiers.

In addition to the trade-off plot, we also compared the found formulas against established histological indices, that is, Atelectasis Area (AA), Emphysema Area (EA), Small Arteries Media Hypertrophy (SAMH), Peribronchial Arteries Perivascular Cellular Edema (PAPCE), Peribronchial Arteries Media Hypertrophy (PAMH), and Lung Assessment Sum (LASS).⁵⁷ We compute the F1 score between the predicted labels and the ground truth labels, since our formulas were only trained to distinguish different cohorts, while the actual value ranges of the cohort distributions do not matter (cf. Classification Performance Calculation section). The results are depicted in Table 1. The table shows two groups with (1) the five formulas trained on the user selection, and (2) the six histological indices. The two formulas that we analyzed in Figure 9 yield F1 scores comparable to or even better than the histological indices. This observation can be seen as an indicator that the formula generated from the user selection might be a suitable classifier, which will be further investigated by our collaboration partners.

Table 1.

Comparison of F1 scores achieved by formulas generated for the user-selected feature set (1st to 5th row). Histological indices are shown for comparison (6th to 11th row).

Index/Classifier	F1_micro
RVEF	0.8750
RPA_ACC._mean	0.7500
MPA_HELICITY_ABS_mean + RVEF	0.7500
LPA_ACC._min + RPV_V_max + RVEF	0.9375
EDV_ESV_ratio + LBB_V_max	0.8125
AA	0.5000
EA	0.5625
PAMH	0.8750
PAPCE	0.5625
SAMH	0.8125
LASS	0.8125

Domain expert feedback

Throughout the development of our system, we have had multiple meetings with our collaboration partners from the medical imaging domain. We performed a task analysis at the beginning and gathered feedback that we used to iterate our system. This was done by showcasing the current development of the system and letting them analyze their own data. Among the received feedback, physical plausibility was not important to them but rather the relationships between the hemodynamic parameters. Furthermore, certain hemodynamic parameters make only sense in combination with other parameters, which motivates why the user should be able to choose certain subsets of parameters for the symbolic regression. Lastly, a biomarker should have a high specificity and sensitivity to be considered a good candidate. Therefore, we chose train and test micro-averaged F1 scores as axes for the trade-off plot.

In the following, we will report our observations made during the user study and the feedback we got from our five domain experts. We observed varying behaviors. Users, when presented with a new data set, initially tended to select multiple features to familiarize with the data set, often by selecting features close to the barycenter of the scatter plot (cf. Figure 7) or by choosing features seemingly at random. Some users, upon finding an effective feature, tended to stick with it, generally focusing on discovering a single, superior formula without much regard for baseline comparisons. The strategies for composing a feature subset varied. Some identified a small feature set and gradually expanded it. Others repeatedly ran the algorithm to try different feature sets by selecting both central and peripheral features based on past experiences. Typically, this involved no more than three selections. One reason for this is that the system still needs some time to evaluate baseline classifiers and symbolic regression for the users’ feature selection.

Feedback from the study included the presumption that there is a bias for selecting from the middle and toward larger points. We agree with the users’ feedback and consider this bias a design goal of the embedding. Instinctively, users tend to select good features, while still being able to reconsider their selection. Also, users reported on usability aspects of the interactive visual system such as initially running the formula search via symbolic regression multiple times for a better understanding of variability. Because of performance reasons, the default setting is to use a single formula search, but we leave the option to perform additional searches to the user. Another requested change was to keep the formulas generated by the old feature set. In the system presented to the users, the trade-off plot would be cleared any time the user made a new feature selection, that is, only the baseline and formulas derived from all features remain as permanent references. The reason is that, otherwise, one would need to keep track of the features used for each baseline classifier and formula and encode them in a scatter plot, for example, by introducing a data provenance graph. Solving this problem is beyond the scope of this work, but shall be investigated in future work. Also, it was requested to visually highlight features that are part of the currently selected formula.

As one of our domain experts also provided us with a data set, we asked if this system would be a valuable addition to their work. They believe that the tool would save a lot of time. They want to use it for their work from now on.

Limitations and discussion

Test data validation

In our work, we focus on finding formulas that serve as classifiers. An important fact is that each formula has to be seen as a separate classifier. This makes it difficult to validate a formula in the general sense of machine learning, which typically requires techniques such as k-fold cross-validation to be applied in order to determine how well the model generalizes. However, for each split most likely a different formula will be generated, rendering it impossible to generalize a single formula. In fact, even running the formula search on the very same split multiple times will, in case of multiple selected features, lead to different formulas. As a compromise, we added pseudo-test validation to our system, by employing a single random split that could be changed by the user. If the sample size is large enough, the generalization score may be considered sufficiently trustworthy. Although we consider the research of test data validation of formulas beyond the scope of this work, we want to raise awareness of this problem and possible solutions. One approach could be to apply cross-validation, running the regression multiple times per split, and then analyze the mathematical expressions of the resulting formulas for similarity. Defining the similarity of formulas is a highly user- and task-specific problem. For example, one could check for common features or even terms.

Scalability

Concerning the scalability with the number of features, we already showed that decreasing the number of features, on average, decreases the time needed for the symbolic regression to find a good solution (R7), see Figure 2. We tested the embedding on the Pulmonary Hypertension set with 111 features, which would not be feasible by other presented methods.^7,36,41 Also, the visual representation of the 2D embedding in terms of a scatter plot scales quite well with the number of features, and the other visual encodings do not depend on the number of features. How well the 2D embedding represents feature space, is another question that is well known to data analysts working with multi-dimensional data projections. If, for example, a data set contains hundreds of features that are irrelevant to the classification task, projecting the features into two dimensions would no longer preserve most of the variance in the data. This can be explained by the extreme case, where all features are equally unlikely to combine well. Then all pairs of features would need to have the same distance from each other, which is only possible in a space of dimensionality equal to the number of features minus one. However, from our experience with the presented data sets, we assume that this is a rather constructed, theoretical case that rarely happens with real data.

Concerning the scalability with the number of samples in a data set, the time needed to evaluate the baseline classifiers and the symbolic regression scales linearly with the number of samples. This leads, for example, for the red wine quality data set, to a longer waiting time when a new feature subset is selected by the user. As highly optimized code is not the focus of this work, reducing the time could already be possible by better utilizing parallel hardware. Another option would be to only use a subset of the data, for example, by applying Monte-Carlo sampling. The visual encodings are independent of the number of samples except for the sample table, but the sorting option allows for focusing on the data samples that require further investigation.

Interpretability

In our work, we neglect the features’ units. The symbolic regression will, hence, combine features of all units using addition, subtraction, etc., which may not be physically meaningful. In physics, one technique to mitigate the issue of very complex unit terms is to introduce constants that add the inverse units to the expression, effectively canceling out unit terms. However, which unit to expect and how to counteract is highly data- and task-specific. Since we aimed for a general solution that finds relationships between features for the purpose of classification, that may lay the foundation for an in-depth analysis, we consider a sophisticated analysis of this problem out of scope for this work.

Another limitation inherent to formulas is the division by zero. It can, generally, not be assumed that no sample of a data set contains zero values. When normalizing the data to the unit interval, it is even guaranteed. The assumed behavior is implementation-specific and should be chosen carefully. One option is to remove the division from the allowed set of operations. Since we believe the division should not generally be excluded, we utilize the default behavior of the symbolic regression implementation⁵⁹ that uses a protected division that evaluates a term to 1 in case of a zero division. Resulting formulas with a single feature as denominator should be viewed critically.

Conclusion and future work

We presented an approach for the interactive visual composition of classifiers in the form of formulas. The approach comprises a novel feature selection technique based on a projection of the feature space, projecting features that are likely to synergize well for the classification task close to each other. Our feature selection techniques allow for an interactive refinement of formulas generated through a symbolic regression. Together with a plot showing the micro-averaged F1 score and complexity of each formula and when compared against the score of state-of-the-art classifiers, our tool supports resolving the user- and problem-specific trade-off between quality and complexity of classification models.

In future, we plan to extend our approach to multi-field data, where the spatial dimensions shall be taken into account. This could be done by aggregating connected regions of similar value distributions in the spatio-temporal space in an effort to reduce the number of potential features. A heat map of said space could be used to encode found regions. The spatial position could also be mapped back onto a 3D model of the domain, for example, onto a vessel surface to maintain the anatomical context.⁶⁰ Moreover, visual analysis of formula similarity deserves more research.

Supplemental Material

sj-pdf-1-ivi-10.1177_14738716241270288 – Supplemental material for Interactive visual formula composition of multidimensional data classifiers

Supplemental material, sj-pdf-1-ivi-10.1177_14738716241270288 for Interactive visual formula composition of multidimensional data classifiers by Adrian Derstroff, Simon Leistikow, Ali Nahardani, Katja Gruen, Marcus Franz, Verena Hoerr and Lars Linsen in Information Visualization

Footnotes

Acknowledgements

The results shown in this paper do not constitute any medical advice.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the Deutsche Forschungsgemeinschaft (DFG, German Research Foundation) grants 468824876 and 431460824 (CRC 1450).

ORCID iD

Adrian Derstroff

Supplemental material

Supplemental material for this article is available online.

References

Fisher

. The use of multiple measurements in taxonomic problems. Ann Eugen 1936; 7(2): 179–188.

Boser

Guyon

Vapnik

. A training algorithm for optimal margin classifiers. In: Proceedings of the fifth annual workshop on Computational learning theory. pp. 144–152.

Quinlan

. Induction of decision trees. Mach Learn 1986; 1: 81–106.

Breiman

. Random forests. Mach Learn 2001; 45: 5–32.

Langley

Iba

Thompson

, et al. An analysis of bayesian classifiers. In: Aaai. Citeseer, 1992, pp.223–228. Vol. 90.

Sucar

LE.

Bayesian classifiers. In: Singh

Kang

(eds) Probabilistic graphical models: Principles and applications. Cham: Springer International Publishing, 2021, pp.43–69. DOI: 10.1007/978-3-030-61943-5 4; https://doi.org/10.1007/978-3-030-61943-5_4

Chatzimparmpas

Martins

Kucher

, et al. Featureenvi: visual analytics for feature engineering using stepwise selection and semi-automatic extraction approaches. IEEE Trans Vis Comput Graph 2022; 28(4): 1773–1791.

Poli

Koza

Genetic programming. In: Burke

Kendall

(eds) Search methodologies. Boston, MA: Springer, 2014, pp.143–185.

Wickelmaier

. An introduction to MDS. Sound Qual Res Unit Aalb Univ Den 2003; 46(5): 1–26.

10.

Teichholz

Kreulen

Herman

, et al. Problems in echocardiographic volume determinations: echocardiographic-angiographic correlations in the presence of absence of asynergy. Am J Cardiol 1976; 37(1): 7–11.

11.

Ganaie

Malik

, et al. Ensemble deep learning: a review. Eng Appl Artif Intell 2022; 115: 105151.

12.

Dietterich

. Ensemble methods in machine learning. In: Goos

Hartmanis

van Leeuwen

, et al (eds) International workshop on multiple classifier systems. Springer, Berlin, Heidelberg: Springer Berlin Heidelberg, 2000, pp.1–15.

13.

Haq

Zhang

Peng

, et al. Combining multiple feature-ranking techniques and clustering of variables for feature selection. IEEE Access 2019; 7: 151482–151492.

14.

Hassine

Erbad

Hamila

. Important complexity reduction of random forest in multi-classification problem. In: 2019 15th international wireless communications & mobile computing conference (IWCMC). Tangier, Morocco: IEEE, pp.226–231.

15.

Fernandez

Garcia

Luengo

, et al. Genetics-based machine learning for rule induction: state of the art, taxonomy, and comparative study. IEEE Trans Evol Comput 2010; 14(6): 913–941.

16.

Fan

. High dimensional classification using features annealed independence rules. Ann Stat 2008; 36(6): 2605–2637.

17.

LeCun

Bengio

Hinton

. Deep learning. Nature 2015; 521(7553): 436–444.

18.

Guyon

Gunn

Nikravesh

, et al. Feature extraction: Foundations and applications. Berlin, Heidelberg, New York: Springer, 2008. Vol. 207.

19.

Bolón-Canedo

Sánchez-Maroño

Alonso-Betanzos

. A review of feature selection methods on synthetic data. Knowl Inf Syst 2013; 34: 483–519.

20.

Chandrashekar

Sahin

. A survey on feature selection methods. Comput Electr Eng 2014; 40(1): 16–28.

21.

Cheng

Wang

, et al. Feature selection: A data perspective. ACM Comput Surv 2018; 50(6): 1–45.

22.

Liu

. Efficient feature selection via analysis of relevance and redundancy. J Mach Learn Res 2004; 5: 1205–1224.

23.

Peng

Long

Ding

. Feature selection based on mutual information criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005; 27(8): 1226–1238.

24.

Urbanowicz

Meeker

La Cava

, et al. Relief-based feature selection: Introduction and review. J Biomed Inform 2018; 85: 189–203.

25.

Kittler

. Feature set search algorithms. In: Chen CH (ed.) Pattern recognition and signal processing. Alphen aan den Rijn, Netherlands: Sijthoff & Noordhoff, 1978, pp.41–60.

26.

Skalak

DB.

Prototype and feature selection by sampling and random mutation hill climbing algorithms. In: Cohen

Hirsh

(eds) Machine learning proceedings 1994. San Francisco (CA): Elsevier, 1994, pp.293–301.

27.

Holland

JH.

Adaptation in natural and artificial systems. Ann Arbor, Michigan: University of Michigan Press, 1975.

28.

Guyon

Weston

Barnhill

, et al. Gene selection for cancer classification using support vector machines. Mach Learn 2002; 46: 389–422.

29.

Rodrigues

Batista

La Cava

, et al. Slug: Feature selection using genetic algorithms and genetic programming. In: European conference on genetic programming (Part of EvoStar). Springer, pp.68–84.

30.

Chiesa

Maioli

Colombo

, et al. Gars: genetic algorithm for the identification of a robust subset of features in high-dimensional datasets. BMC Bioinformatics 2020; 21(1): 54–11.

31.

Zhou

Zhang

Kang

, et al. A problem-specific non-dominated sorting genetic algorithm for supervised feature selection. Inf Sci 2021; 547: 841–859.

32.

Guyon

Elisseeff

. An introduction to variable and feature selection. J Mach Learn Res 2003; 3(Mar): 1157–1182.

33.

Sanchez

Soguero-Ruiz

Mora-Jiménez

, et al. Scaled radial axes for interactive visual feature selection: a case study for analyzing chronic conditions. Expert Syst Appl 2018; 100: 182–196.

34.

Artur

Minghim

. A novel visual approach for enhanced attribute analysis and selection. Comput Graph 2019; 84: 160–172.

35.

Yang

Peng

Ward

, et al. Interactive hierarchical dimension ordering, spacing and filtering for exploration of high dimensional datasets. In: IEEE Symposium on Information Visualization 2003 (IEEE Cat. No. 03TH8714). IEEE, pp.105–112.

36.

Wang

Nie

, et al. Linear discriminative star coordinates for exploring class and cluster separation of high dimensional data. In: Heer

Ropinski

van Wijk

(eds) Computer graphics forum. Barcelona, Spain, 2017, pp.401–410. Vol. 36.

37.

Yang

Patro

Huang

, et al. Value and relation display for interactive exploration of high dimensional datasets. In: IEEE symposium on information visualization. Austin, Texas: IEEE, pp.73–80.

38.

Turkay

Filzmoser

Hauser

. Brushing dimensions-a dual visual analysis model for high-dimensional data. IEEE Trans Vis Comput Graph 2011; 17(12): 2591–2599.

39.

May

Davey

Ruppert

Smartstripes-looking under the hood of feature subset selection methods. In: Miksch

Santucci

(eds) EuroVA@ EuroVis. Bergen, Norway: The Eurographics Association, pp.13–16.

40.

Bernard

Steiger

Widmer

, et al. Visual-interactive exploration of interesting multivariate relations in mixed research data sets. Comput Graph Forum 2014; 33: 291–300.

41.

Dingen

Veer

Houthuizen

, et al. Regressionexplorer: interactive exploration of logistic regression models with subgroup analysis. IEEE Trans Vis Comput Graph 2018; 25(1): 246–255.

42.

Wentzel

Floricel

Canahuate

, et al. Dass good: Explainable data mining of spatial cohort data. In: Archambault

Bujack

Schreck

(eds) Computer graphics forum. Leipzig, Germany: Wiley Online Library, pp.283–295. Vol. 42.

43.

Jain

Saha

. Rank-based univariate feature selection methods on machine learning classifiers for code smell detection. Evol Intell 2022; 15(1): 609–638.

44.

Louppe

Wehenkel

Sutera

, et al. Understanding variable importances in forests of randomized trees. Adv Neural Inform Process Syst 2013; 26: 431–439.

45.

Radivojac

Obradovic

Dunker

, et al. Feature selection filters based on the permutation test. In Machine Learning: ECML 2004: 15th European Conference on Machine Learning, Pisa, Italy, September 20-24, 2004. Proceedings 15. Springer, pp.334–346.

46.

Janecek

Gansterer

Demel

, et al. On the relationship between feature selection and classification accuracy. In New challenges for feature selection in data mining and knowledge discovery. PMLR, pp.90–105.

47.

Jeon

. Hybrid-recursive feature elimination for efficient feature selection. Appl Sci 2020; 10(9): 3211.

48.

Nguyen

Lau

, et al. Biomedical data analytics and visualisation—a methodological framework. In: Catchpoole

Simoff

Kennedy

, et al. (eds) Data driven science for clinically actionable knowledge in diseases. Abingdon, England: Chapman and Hall/CRC, 2023. pp.174–196.

49.

Gorodkin

. Comparing two k-category assignments by a k-category correlation coefficient. Comput Biol Chem 2004; 28(5-6): 367–374.

50.

Guyon

. Design of experiments of the nips 2003 variable selection benchmark. In: NIPS 2003 workshop on feature extraction and feature selection. p. 40. Vol. 253.

51.

Pedregosa

Varoquaux

Gramfort

, et al. Scikitlearn: machine learning in Python. J Mach Learn Res 2011; 12: 2825–2830.

52.

Espadoto

Martins

Kerren

, et al. Toward a quantitative survey of dimension reduction techniques. IEEE Trans Vis Comput Graph 2021; 27(3): 2153–2173.

53.

gplearn implementation details. https://gplearn.readthedocs.io/en/stable/intro.html#initialization (Online; accessed 28-April-2024).

54.

Wolberg

Mangasarian

Street

, et al. Breast Cancer Wisconsin (Diagnostic). UCI Machine Learning Repository, 1995. https://doi.org/10.24432/C5DW2B.

55.

Cortez

Cerdeira

Almeida

, et al. Modeling wine preferences by data mining from physicochemical properties. Decis Support Syst 2009; 47(4): 547–553.

56.

Nahardani

Leistikow

Grün

, et al. Pulmonary arteriovenous pressure gradient and time-averaged mean velocity of small pulmonary arteries can serve as sensitive biomarkers in the diagnosis of pulmonary arterial hypertension: a preclinical study by 4d-flow mri. Diagnostics 2021; 12(1): 58.

57.

Franz

Grün

Betge

, et al. Lung tissue remodelling in MCT-induced pulmonary hypertension: a proposal for a novel scoring system and changes in extracellular matrix and fibrosis associated gene expression. Oncotarget 2016; 7(49): 81241–81254.

58.

Nahardani

Grün

Krämer

, et al. Introduction of the right ventricular-arterial coupling index in CMR reflecting lung histological changes in experimental PAH. In: Presented at the ISMRM 30th Annual Meeting, London, England, UK, 30 2022.

59.

Stephens

. gplearn. https://pypi.org/project/gplearn/.

60.

Derstroff

Leistikow

Nahardani

, et al. Explorative visual analysis of spatio-temporal regions to detect hemodynamic biomarker candidates. Poster presented at EuroVIS, Rome, 2022. Published by The Eurographics Association.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.40 MB