Sage Journals: Discover world-class research

Abstract

Feature selection plays an important role in algorithms for processing high-dimensional data. Traditional pattern classification and information theory methods are widely applied to feature selection methods. However, traditional pattern classification methods such as Fisher Score, Laplacian Score, and relief use class labels inadequately. Previous information theory based feature selection methods such as MIFS ignore the intra-class to tight inter-class to sparse property of the samples. To address these problems, a feature selection algorithm for the binary classification problem is proposed, which is based on class label transformation using self-organizing mapping neural network (SOM) and cohesive hierarchical clustering. The algorithm first converts class labels without numerical meaning into numerical values that can participate in operations and retain classification information through class label mapping, and constitutes a two-dimensional vector from it and the attribute values to be judged. Then, these two-dimensional vectors are clustered by using SOM neural network and hierarchical clustering. Finally, evaluation function value is calculated, that is closely related to intra-cluster to tightness, inter-cluster separation, and division accuracy after clustering, and is used to evaluate the ability of alternative attributes to distinguish between classes. It is experimentally verified that the algorithm is robust and can effectively screen attributes with strong classification ability and improve the prediction performance of the classifier.

Keywords

Feature selection class labeling SOM hierarchical clustering

Introduction

With the development and application of information technology, more and more high-dimensional data such as: digital images, financial time series, and gene expression microarrays have been accumulated. Feature selection has become an indispensable pre-processing step in algorithms for processing high-dimensional data. Feature selection refers to the process of selecting a subset of features from all attributes that are most beneficial for subsequent operations to reduce the dimensionality of the feature space while keeping the decision-making capability of the information system unchanged. As an important part of knowledge discovery techniques, feature selection can improve the speed of knowledge learning, enhance the compactness of learning models, increase the generalization ability of models,¹ and enable one to utilize data with minimal complexity in order to improve the level of awareness of information implicitly contained in huge data.

Filter feature selection has been a hot research topic.^2–6 It uses the intrinsic characteristics of the training data to judge the merits of the attributes to be selected and is independent of the learning algorithm.³ According to the evaluation measures adopted, filter algorithms can be roughly classified into: information, distance, dependency, consistency, similarity, and statistics measures.^4,7 Existing filter methods can be broadly categorized as either univariate or multivariate.^4,5,8 Univariate methods (i.e., feature weighting/ranking) test each attribute and score them according to their relevance to class labels.^4,5 FS,⁹ LS,¹⁰ SPEC,¹¹ and relief¹² are typical of univariate assessments using different evaluation measures.⁴ There is a consensus in univariate assessment for attributes that facilitate classification: similar (or close) attribute values for samples of the same class; large differences (or far away) in attribute values for samples of different classes; and class labels of samples containing important information that helps in attribute screening.^13,14 The FS, LS, SPEC, and relief algorithms score single attributes based on the intra-class to tight inter-class to sparsity criterion, during which relief uses sample class labels implicitly in a nearest sample manner,¹² FS, LS, and SPEC in a graphical manner.^11,15 (In fact, some recent algorithms^16–19 have also used graph to organize the attributes.) Multivariate methods directly focuses on those combinations of attributes that can represent all attributes, and a group of algorithms based on information theory^20–25 represented by MIFS²⁰ is the main representative of multivariate method.⁵ The MIFS algorithm takes advantage of the characteristics that mutual information can describe the nonlinear correlation and spatial transformation invariance among attributes,²³ and uses the mutual information between attributes and class labels and between attributes and attributes as the basis for whether attributes can enter the feature subset, but it ignores the intra-class to tight inter-class to sparsity property to some extent.

To address the above problems, this paper proposes a feature selection algorithm based on class labels, self-organizing mapping neural network (SOM) and hierarchical clustering (FS-CSH) for binary classification, that takes into account intra-class to tight inter-class to sparsity and explicitly uses class labels. The main ideas of this algorithm are (1) mapping class labels, converting class labels without numerical meaning into values that can participate in operations and retain classification information, and forming a two-dimensional vector from it and the attribute values to be selected; (2) clustering this two-dimensional vector using SOM neural networks and hierarchical clustering; (3) calculating the evaluation function values closely related to intra-cluster to tightness, inter-cluster separation and division accuracy after clustering as attribute scores. The algorithm measures the relationship between all attributes and classes, and evaluates the ability of attributes to distinguish between classes. It is experimentally verified that the algorithm is robust, and can effectively rank the attributes according to their classification ability, and improve the prediction performance of the classifier.

Problem formulation

Without loss of general description, let $X = {[X_{1}, X_{2}, \dots, X_{K}]}^{T}$ be the sample vector and K be the number of samples; $A = [A_{1}, A_{2}, \dots, A_{M}]$ be the attribute vector and M be the number of attributes; the kth sample is denoted as $X_{k} = [a_{k 1}, a_{k 2}, \dots, a_{kM}]$ , where $a_{km}$ is the mth attribute value of the kth sample; and $y_{k} \in [0, 1]$ is the category label of the sample $X_{k}$ .

For binary classification, the general naive view of whether an attribute $A_{m}$ is good for distinguishing categories is whether there is an obvious single mapping relationship between the distribution of $A_{m}$ values and the class labels $class 1$ and $class 2$ , although this relationship is implicit. When the distribution of $A_{m}$ values is naturally distributed in two regions with an obvious implicit single mapping between the class labels $class 1$ and $class 2$ , then $A_{m}$ is considered to have the ability to distinguish between categories, as shown in Figure 1(a), where the horizontal axis is the $A_{m}$ values, and the vertical axis is the probability density function of the distribution of $A_{m}$ values. This plain understanding is expressed in the data space as intra-class to tightness and inter-class sparsity of $A_{m}$ values, which are typically represented by LS, FS, and relief. The LS algorithm scores the mth attribute $A_{m}$ as equation (1), where D is the degree matrix and L is the Laplacian matrix.

S C_{L} (A_{m}) = \frac{{\tilde{A}}_{m}^{T} L {\tilde{A}}_{m}}{{\tilde{A}}_{m}^{T} D {\tilde{A}}_{m}}, where {\tilde{A}}_{m} = A_{m} - \frac{A_{m}^{T} D 1}{1^{T} D 1} 1

(1)

Figure 1.

(a) Strong two-region distribution of $A_{m}$ values, (b) weak two-region distribution of $A_{m}$ values, (c) mixed distribution of $A_{m}$ values, and (d) mixed distribution of $A_{m}$ values with explicit class labels.

FS selects attributes with close intra-class values and scattered inter-class values. It scores the mth attribute $A_{m}$ as equation (2). Where, $μ_{m}$ is the mean value of $A_{m}$ , $n_{j}$ is the number of samples belonging to the jth class, $μ_{m, j}$ and $σ_{m, j}$ are the mean and variance of $A_{m}$ on the jth class.

S C_{F} (A_{m}) = \frac{\sum_{j = 1}^{c} n_{j} {(μ_{m, j} - μ_{m})}^{2}}{\sum_{j = 1}^{c} n_{j} σ_{m, j}^{2}}

(2)

The relief scores the mth attribute $A_{m}$ as equation (3). Where $a_{km}$ is the value of the sample $X_{k}$ on the attribute $A_{m}$ , $a_{NH (X_{k}), m}$ and $a_{NM (X_{k}), m}$ are the values of the nearest samples that are the same class and different class of $X_{k}$ on the attribute $A_{m}$ , and $d (•)$ is the distance measure.

\begin{matrix} S C_{R} (A_{m}) = \frac{1}{2} \sum_{k = 1}^{K} (d (a_{km} - a_{NM (X_{k}), m}) - d (a_{km} - a_{NH (X_{k}), m})) \end{matrix}

(3)

There is no trace of sample class labels in the above expressions. In fact, the implicit mapping between the distribution of $A_{m}$ values and class labels in Figure 1(a) can be viewed as a transformation of the sample-based conditional probabilities $p (a_{km} | y_{k})$ , which is represented by the implicit use of class labels in the way of graphs in FS and LS, in the way of nearest samples in relief. This use of class labels is insufficient.

In addition, when the implicit single mapping between the distribution of $A_{m}$ values and class labels is not obvious, the classification ability of $A_{m}$ becomes weaker, as shown in Figure 1(b); when this mapping disappears, $A_{m}$ loses its classification ability, as shown in Figure 1(c). However, if class labels are considered in an explicit way in the third dimension when examining this single mapping, then the originally unclassifiable $A_{m}$ becomes capable of distinguishing classes with the addition of the third dimension, as shown in Figure 1(d), where $f (class)$ is a transformation function containing class labels, and information theory-based feature selection is one such class of algorithms, of which MIFS is a typical representative. The mutual information between attribute $A_{m}$ and class labels in MIFS is equation (4).

I (A_{m}, y_{k}) = \sum_{a_{m}, y_{k}} p (a_{m}, y_{k}) \log \frac{p (a_{m}, y_{k})}{p (a_{m}) p (y_{k})}

(4)

MIFS makes use of class labels in an explicit way, and it focuses on mutual information to evaluate the relevance of attributes and classes, and attributes to attributes, but it underexpresses the sample property of “intra-class to tight, inter-class to sparse.” In summary, this paper investigates a way to explicitly fuse class labels with attributes, and organizes the fused data in a clustering manner to discriminate the class differentiation ability of attributes based on inter-cluster and intra-cluster distances, and proposes a feature selection algorithm based on class labels, SOM, and hierarchical clustering.

Feature selection for binary classification based on class labeling and clustering

The proposed algorithm is shown in Table 1, which evaluates all the attributes one by one. Because of the need to explicitly use the guidance role of the class labels, this algorithm firstly fuses the sample values with their class labels to generate new 2-D vector samples. Then considering that attributes with significant classification performance possess the characteristics of similar (or close) attribute values for samples of the same class and large (or far) differences in attribute values for samples of different classes, this algorithm uses SOM neural network clustering and hierarchical clustering to automatically organize the division of the two-dimensional vector samples; Finally, the evaluation function of attribute $A_{m}$ is calculated to evaluate the intra-cluster densities, inter-cluster sparsity, and the correct rate of division.

Table 1.

Feature selection algorithm for binary classification based on class labeling, SOM, and hierarchical clustering.

Algorithm 1 FS-CSH
Input: $X = {[X_{1}, X_{2}, \dots, X_{K}]}^{T}$ , $X_{k} = [a_{k 1}, a_{k 2}, \dots, a_{kM}]$ , $k = 1, 2, \dots, K$
Output: $J_CSH = [J_CS H_{A_{1}}, J_CS H_{A_{2}}, \dots, J_CS H_{A_{M}}]$
1: Load the samples
2: for each $A_{m} \in A$ do
3: mapping class labels
4: Training the SOM network by Algorithm 2
5: SOM primary clustering by Algorithm 3
6: hierarchical clustering of basic clusters and calculating $J_CS H_{A_{m}}$ by Algorithm 4
7: end for
8: Return $J_CSH = [J_CS H_{A_{1}}, J_CS H_{A_{2}}, \dots, J_CS H_{A_{M}}]$

Mapping class labels

In order to play a guiding role of class labels in the feature selection process, the attribute values of the samples are combined with the class labels. For the attribute $A_{m}, (m = 1, 2, \dots, M)$ , the value of each sample $a_{km}, (k = 1, 2, \dots, K)$ is combined with the class label of the sample $y_{k}$ to form a two-dimensional vector $[a_{km}, y_{k}]$ . We note that $y_{k}$ in vector $[a_{km}, y_{k}]$ is the sample $X_{k}$ class label, which cannot be directly involved in mathematical operations in the general sense. Therefore, the following transformation has to be made.

Using equation (5) to normalize all the sample values of $a_{km}, (k = 1, 2, \dots, K)$ for $A_{m}$ to obtain $a'_{km}, (k = 1, 2, \dots, K)$ .

a'_{km} = \frac{a_{km} - min_{k} (a_{km})}{max_{k} (a_{km}) - min_{k} (a_{km})}

(5)

Equation (6) gives the normalized sample mean ${\bar{a}}_{m}$ of $A_{m}$ :

{\bar{a}}_{m} = \frac{\sum_{k = 1}^{K} a'_{km}}{K}

(6)

Definition 1: sample $X_{k}$ numerical label $y'_{k}$ : If the class label of sample $X_{k}$ is $y_{k} \in {0, 1}$ , then the numerical label of sample $X_{k}$ is noted as $y'_{k}$ :

y'_{k} = {(- 1)}^{y_{k}} {\bar{a}}_{m}

(7)

Equation (7) maps the class label $y_{k} \in {0, 1}$ of the sample to the numerical label $y'_{k}$ . The modulus of $y'_{k}$ is normalized to the sample mean ${\bar{a}}_{m}$ of the attribute $A_{m}$ , and the sign of $y'_{k}$ is determined by the class label. This converts the label information $y_{k}$ , which has no numerical meaning, into a value that can participate in the number operation and retains the classification information. Finally, $[{a^{'}}_{k m}, {y^{'}}_{k}]$ is mapped onto the unit circle using equations (8) and (9).

a_{km}^{*} = \frac{a'_{km}}{\sqrt{{(a'_{km})}^{2} + {(y'_{k})}^{2}}}

(8)

y_{k}^{*} = \frac{y'_{k}}{\sqrt{{(a'_{km})}^{2} + {(y'_{k})}^{2}}}

(9)

Through the above steps, the two-dimensional vector $[a_{km}, y_{k}]$ is mapped to $[a_{km}^{*}, y_{k}^{*}]$ , such that $A_{m}^{*} = {[A_{1 m}^{*}, A_{2 m}^{*}, \dots, A_{Km}^{*}]}^{T}$ , $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$ are used as inputs to the SOM network.

Training the SOM network

SOM neural network is different from other neural networks and also different from other clustering algorithms. It is able to reproduce the topological relationships among the input patterns with the topological relationships of the output layer neurons obtained from the mapping, and the spatial distribution of the trained network connection weight vectors reflects the statistical properties of the input patterns.²⁶ The general SOM neural network adopts a “planar” topology, which maps the high-dimensional input patterns to the output layer neurons with two-dimensional planar distribution through competition, cooperation, and adaptive steps, and reflects the aggregation characteristics of the input patterns by the distribution characteristics of the weight vectors of these neurons to form clusters after clustering.²⁷ At this case the clusters are distributed on a two-dimensional plane, which can complicate the subsequent hierarchical clustering. In fact, as mentioned in the literature, one-dimensional SOM is significantly superior than two-dimensional SOM in many aspects such as maintaining and achieving linear separability of classes, expressing the similarity between data and the clarity of inter-class location relationships, and the ease of visualizing class boundaries.²⁸ In addition, using clustering methods based on one-dimensional SOM to cluster unknown datasets can overcome the problem of not knowing the structure of the dataset and not being able to choose the correct clustering method, avoid the phenomenon of not getting the correct clustering results due to the unsuitability of the method, and provide a method to ensure that the clustering and structural characteristics of the unknown dataset can be better discovered.²⁸ The one-dimensional SOM can be used as the basis for clustering any type of dataset. Therefore, in this paper, a one-dimensional SOM is chosen and the output layer neurons are organized in a ring topology, as shown in Figure 2. It has l source nodes in the input layer and n neurons in the output layer, with full connectivity between neurons and source nodes. The neurons in the output layer are topologized in the form of a “closed curve,” where any neuron is directly connected to the neurons on either side of it, and the inhibition of the winning neuron on its neighboring neurons extends along both sides of the curve during the “weight coefficient update” phase of network training. The Gaussian topology field of winning neuron is shown in Figure 3. The neurons in the output layer after SOM training are distributed along the closed curve.

Figure 2.

One dimensional ring SOM structure.

Figure 3.

Gaussian topology field of winning neuron.

$A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$ is used as input to train the SOM network, as shown in Table 2.

Table 2.

Training SOM network.

Algorithm 2 Training the SOM network
Input: $A_{m}^{} = {[A_{1 m}^{}, A_{2 m}^{}, \dots, A_{Km}^{}]}^{T}$ , $A_{km}^{} = {[a_{km}^{}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$
Output: a weight matrix W with N rows and 2 columns
1: (Initialization) Assign random values in the $(0, 1)$ interval to a weight matrix $W (0)$ , where $W (0) = {[W_{1} (0), W_{2} (0), \dots, W_{N} (0)]}^{T}$ , N is number of neurons in the output layer. Set the initial learning rate $η_{0}$ . Set initial value $σ_{0} = [N / 2]$ , which is the effective width of the Gaussian topological field. Set number of iterations of training $iteMax$ .
2: (Prepare $A_{m}^{* }$ ) Generate a random integer $K_{1}$ in $[1, K]$ , then $A_{m}^{ } = [A_{K_{1} m}^{}, \dots, A_{Km}^{}, A_{1 m}^{}, \dots, A_{(K_{1} - 1) m}^{*}]$
3: (Training)
4: for $ite = 1$ to $iteMax$ do $η = η_{0} \exp (- \frac{ite}{iteMax / 3})$ $σ (ite) = σ_{0} \exp (- \frac{ite}{iteMax / (- 1.5 \log σ_{0})})$
5: for each $A_{km}^{} \in A_{m}^{ }$ do $I_{A_{km}^{}} = \underset{n}{\arg min} ‖ A_{km}^{*} - W_{n} ‖$
6: for $n = 1$ to $N$ do $h_{n, I_{A_{km}^{}}} (ite) = \exp (- \frac{d_{n, I_{A_{km}^{}}}^{2}}{2 σ^{2} (ite)})$ $W_{n} (ite + 1) = W_{n} (ite) + η (ite) h_{n, I_{A_{km}^{}}} (ite) (A_{km}^{} - W_{n} (ite))$
7: end for
8: end for
9: end for
10: Return W

The winning neuron during the competition for SOM training is noted as $I_{A_{km}^{*}}$ :

I_{A_{km}^{*}} = \underset{n}{\arg min} ‖ A_{km}^{*} - W_{n} ‖

(10)

$I_{A_{km}^{*}}$ is the neuron with the smallest Euclidean distance from the input $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ . The topological domain of the winning neuron $h_{n, I_{A_{km}^{*}}} (ite)$ selects the Gaussian function:

h_{n, I_{A_{km}^{*}}} (ite) = \exp (- \frac{d_{n, I_{A_{km}^{*}}}^{2}}{2 σ^{2} (ite)})

(11)

Where $d_{n, I_{A_{km}^{*}}}$ is the lateral Euclidean distance of the winning neuron $I_{A_{km}^{*}}$ from neuron n in output space:

d_{n, I_{A_{km}^{*}}} = ‖ n - I_{A_{km}^{*}} ‖

(12).

$σ (ite)$ is the “effective width” of the Gaussian topological domain, which shrinks in time and determines the shrinkage of the Gaussian topological domain as the number of iterations $ite$ increases:

σ (ite) = σ_{0} \cdot \exp (- \frac{ite}{τ_{1}})

(13)

To accelerate the rate of Gaussian topological field reduction, the constant $τ_{1}$ is taken as:

τ_{1} = i t e M a x / (- 1.5 \log σ_{0})

(14)

$η$ is the learning rate, starting from an initial value of $η_{0}$ and decreasing thereafter as the number of iterations $ite$ increases:

η (ite) = η_{0} \cdot \exp (- \frac{ite}{τ_{2}})

(15)

To ensure the convergence accuracy of SOM training, the constant $τ_{2}$ is taken:

τ_{2} = iteMax / 3

(16)

At the end of the network training, the neurons in the output layer are mapped and shot to different locations in the ring topology, which was influenced by the two-dimensional vector $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$ . They will serve as clustering centers, ready for reclassification $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$ .

SOM primary clustering

In order to achieve binary clustering of the two-dimensional vectors $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ originating from the sample attribute values and class labels, it is first necessary to perform primary clustering of them for generating the basic clusters, as shown in Table 3. Since the neurons in the output layer of the SOM correspond to $W_{n}^{T}$ , $n = 1, 2, \dots, N$ in the $N * 2$ weight matrix $W = {W_{1}^{T}, W_{2}^{T}, \dots, W_{N}^{T}}$ obtained from the training, the neurons in the output layer are represented by $W_{n}^{T}$ , $n = 1, 2, \dots, N$ , and they are used as the centers of the clusters to construct the sets $C = {C_{1}, C_{2}, \dots, C_{N}}$ , $C_{n} = {W_{n}}$ , $n = 1, 2, \dots, N$ containing N primary clusters. Next, the Euclidean distance between $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$ and the N primary cluster centers is calculated one by one, and the cluster center closest to $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ is $W_{A_{km}^{*}}$ :

W_{A_{km}^{*}} = \arg min_{N} ‖ A_{km}^{*} - W_{n} ‖

(17)

Table 3.

Primary clustering based on SOM.

Algorithm 3 SOM primary clustering
Input: W and $A_{m}^{*}$
Output: a set $C$ which contains $N^{*}$ basic clusters
1: (Initialization) Initial set $C = {C_{1} = {W_{1}}, C_{2} = {W_{2}}, \dots, C_{N} = {W_{N}}}$
2: (Clustering)
3: for each $A_{km}^{} \in A_{m}^{} = {[A_{1 m}^{}, A_{2 m}^{}, \dots, A_{Km}^{*}]}^{T}$ do
$W_{A_{km}^{}} = \underset{N}{\arg min} ‖ A_{km}^{} - W_{n} ‖$
$A_{km}^{} \to C_{W_{A_{km}^{}}}$
4: end for
5: for $n = 1$ to N do
6: if $‖ C_{n} ‖ = = 1$ then
$C = C - {C_{n}}$
7: end if
8: end for
9: Return $C$ , where $‖ C ‖ = N^{*}$

$A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ is divided into the clusters $C_{W_{A_{km}^{*}}}$ represented by $W_{A_{km}^{*}}$ . When the two-dimensional vectors $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$ are all partitioned, check $C_{n}$ , $n = 1, 2, \dots, N$ one by one. If there is a primary cluster which modulus equals to 1, it means that the primary cluster contains only the cluster center without any two-dimensional vectors $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , then it is removed from $C$ and the remaining elements in $C$ are the basic clusters.

Cohesive hierarchical clustering of basic clusters

On the basis of the primary clustering, cohesive hierarchical clustering is performed on the basic clusters until only two clusters remain. At this point all two-dimensional vectors $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ are completely divided into two clusters (i.e., different values of the same attribute are divided under the influence of class labels), and the evaluation function $J_CS H_{A_{m}}$ is calculated as the score of the attribute (i.e., the basis for selecting the attribute), as shown in Table 4.

Table 4.

Cohesive hierarchical clustering of basic clusters.

Algorithm 4 Hierarchical clustering of basic clusters and calculating $J_CS H_{A_{m}}$
Input: The set $C$ which contains $N^{*}$ basic clusters
Output: a evaluation function value $J_CS H_{A_{m}}$
1: (Calculating distance) According to the circular topological relationship of the output layer, calculate the Euclidean distance between the output layer neuron and its right adjacent neuron, and store it in $D$ .
2: $D = Ø$
3: for $j = 1$ to $(N^{*} - 1)$ do
$d_{C_{j}, C_{j + 1}} = ‖ W_{j} - W_{j + 1} ‖$
$d_{C_{j}, C_{j + 1}} \to D$
4: end for
5: $d_{C_{N^{}}, C_{1}} = ‖ W_{N^{}} - W_{1} ‖$
6: $d_{C_{N^{*}}, C_{1}} \to D$
7: (Merging basic clusters) There are $N^{}$ base class clusters, so merge $(N^{} - 2)$ times. Every time the two closest clusters are merged, then updating the merged cluster center and inter-cluster distance.
8: for $n = 1$ to $(N^{*} - 2)$ do
$d_{C_{i}, C_{i + 1}} (n) = \underset{N^{*}}{\arg min} (D)$
$C_{i} (n + 1) = C_{i} (n) \cup C_{i + 1} (n)$
$W_{i} (n + 1) = α_{i} * W_{i} (n) - α_{i + 1} * W_{i + 1} (n)$
9: for $l = 1$ to $i - 1$ do
$C_{l} (n + 1) = C_{l} (n)$
$W_{l} (n + 1) = W_{l} (n)$
10: end for
11: for $l = i + 1$ to $‖ C (n) ‖ - 1$ do
$C_{l} (n + 1) = C_{l + 1} (n)$
$W_{l} (n + 1) = W_{l + 1} (n)$
12: end for
13: $\begin{array}{l} C (n + 1) = {C_{1} (n + 1), \dots, C_{i - 1} (n + 1), C_{i} (n + 1), \\ C_{i + 1} (n + 1), \dots, C_{‖ C (n) ‖ - 1} (n + 1)} \end{array}$
14: $d_{C_{i - 1}, C_{i}} (n + 1) = ‖ W_{i - 1} (n + 1) - W_{i} (n + 1) ‖$
15: $d_{C_{i}, C_{i + 1}} (n + 1) = ‖ W_{i} (n + 1) - W_{i + 1} (n + 1) ‖$
16: $D = D - {d_{C_{i - 1}, C_{i}} (n), d_{C_{i}, C_{i + 1}} (n), d_{C_{i + 1}, C_{i + 2}} (n)}$
17: ${d_{C_{i - 1}, C_{i}} (n + 1), d_{C_{i}, C_{i + 1}} (n + 1)} \to D$
18: end for
19: (Calculating evaluation function value) $J_CS H_{A_{m}}$
20: Return $J_CS H_{A_{m}}$

In the hierarchical clustering, the Euclidean distance between the neuron in the output layer and its right neighboring neurons is calculated in turn with reference to the “circular” topology of the SOM output layer and stored in the distance set $D$ ; then the basic clusters are merged cyclically: The two closest clusters are selected from $D$ for merging, and the centers of the merged clusters are

W_{i} (n + 1) = α_{i} * W_{i} (n) - α_{i + 1} * W_{i + 1} (n)

(18)

Where $α_{i} = \frac{‖ C_{i} ‖}{‖ C_{i} ‖ + ‖ C_{i + 1} ‖}$ , $α_{i + 1} = \frac{‖ C_{i + 1} ‖}{‖ C_{i} ‖ + ‖ C_{i + 1} ‖}$ are the proportion of samples in the two clusters involved in the merge.

Collate the set $C$ of clusters and update the set $D$ of distances by calculating the distances between the new cluster and its left and right neighbors. When the loop merge is completed, calculate the evaluation function $J_CS H_{A_{m}}$ for the attribute $A_{m}$ .

$J_CS H_{A_{m}}$ reflects the intra-cluster denseness, inter-cluster sparseness, and the correctness of the division. Thus the $J_CS H_{A_{m}}$ is influenced by the intra-cluster distance, inter-cluster distance, and the division correctness, and the smaller its value, the stronger the category differentiation ability of the corresponding attribute.

Definition 2: Intra-cluster distance: The arithmetic mean of the Euclidean distance of the two-by-two combination $(X_{i}, X_{j})$ of all sample points in the cluster $C_{k}$ is the intra-cluster distance of the cluster $C_{k}$ , denoted as $d_{in} (C_{k})$ :

d_{in} (C_{k}) = \frac{\sum_{i = 1}^{‖ C_{k} ‖} \sum_{j = i + 1}^{‖ C_{k} ‖} ‖ X_{i} - X_{j} ‖}{C_{‖ C_{k} ‖}^{2}}

(19)

Where $X_{i}, X_{j} \in C_{k}$ , $i, j = 1, 2, \dots, ‖ C_{k} ‖$ and $i \neq j$ , $‖ C_{k} ‖$ is the modules of the cluster $C_{k}$ (i.e. the number of samples in the cluster $C_{k}$ ), $C_{‖ C_{k} ‖}^{2}$ is the number of combinations of any two of the $‖ C_{k} ‖$ samples.

Definition 3: Inter-cluster distance: The Euclidean distance between the centers of the clusters $C_{k}$ and $C_{l}$ is the inter-cluster distance, denoted as $d_{ex} (C_{k}, C_{l})$ :

d_{ex} (C_{k}, C_{l}) = ‖ W_{k} - W_{l} ‖

(20)

Where $W_{k}$ and $W_{l}$ are the $C_{k}$ and $C_{l}$ cluster centers, respectively.

Definition 4: Division accuracy: Two clusters $C_{1}$ and $C_{2}$ are obtained from the basic cluster $C_{n}$ , $n = 1, 2, \dots, N^{*}$ after cohesive hierarchical clustering, then the division accuracy is denoted as $Accuracy (C_{1}, C_{2})$ :

Accuracy (C_{1}, C_{2}) = {\begin{matrix} β_{1} * tp + β_{2} * fn, when tp \geq tn \\ β_{1} * tn + β_{2} * fp, when tp < tn \end{matrix}

(21)

Where $β_{1} = \frac{‖ C_{1} ‖}{‖ C_{1} ‖ + ‖ C_{2} ‖}$ , $β_{2} = \frac{‖ C_{2} ‖}{‖ C_{1} ‖ + ‖ C_{2} ‖}$ , $tp$ refer to the number of positive classes judged as positive; $fp$ refers to the number of negative classes judged as positive; $tn$ refers to the number of negative classes judged as negative; $fn$ refers to the number of positive classes judged as negative.

Definition 5: Evaluation function: The evaluation function for the attribute $A_{m}, (m = 1, 2, \dots, M)$ is denoted as $J_CS H_{A_{m}}$ :

J_CS H_{A_{m}} = \frac{β_{1} * d_{in} (C_{1}) + β_{2} * d_{in} (C_{2})}{Accuracy (C_{1}, C_{2}) * d_{ex} (C_{1}, C_{2})}

(22)

Time complexity analysis of the algorithm

In terms of time complexity, firstly, the number of attributes M affects the time overhead of the above algorithm as it calculates the evaluation function for all attributes one by one. Secondly, the training process of the SOM network takes up most of the time compared to the initial and hierarchical clustering process when examining any of the attributes, and the time overhead of the process is mainly determined by the number of iterations $iteMax$ ; Thirdly, the Euclidean distance between the two-dimensional vector $A_{km}^{*} = {[a_{km}^{*}, y_{k}^{*}]}^{T}$ , $k = 1, 2, \dots, K$ and the N neurons in the output layer of the SOM network is calculated in each iteration of the training process of the SOM network. Therefore, the time complexity of the proposed algorithm can be estimated as $O (M * iteMax * K * N)$ . Obviously, once N and $iteMax$ are determined (i.e., the number of neurons N in the output layer of the SOM network is fixed, and the number of iterations $iteMax$ that can ensure the convergence of the SOM network weights is determined), the time complexity of the proposed algorithm will be mainly affected by the size of the data set, which can be estimated as $O (M * K)$ .

Experimental analysis and discussion

Software resources used for the experiments include: Python 3.8.11 (https://www.python.org/) and Spyder IDE 5.1.5 (https://www.spyder-ide.org/). They are used to provide a scripting language environment and an integrated development environment, respectively. The hardware platform of the computer used for the experiments is mainly: Intel Core i7-10750H 2.60 GHz processor and 8 GB 2933 MHz memory.

In order to test the algorithm proposed in this paper, firstly, artificial data are used to verify its ability and robustness in distinguishing attributes; secondly, for real data from different sources, the algorithm of this paper is used to compare with classical mainstream algorithms (such as MI,²⁰ FS,⁹ LS,¹⁰ and reliefF¹²) for feature selection and compare the performance of classifiers after feature selection by LR (Logistic Regression), K-NN (K-Nearest Neighbor), DT (Decision Tree), and SVM (Support Vector Machine) classifiers to evaluate the practicality of the FS-CSH algorithm. Regarding LR, the regularization parameters adopts “l2,” the optimization method of loss function is “liblinear,” the residual convergence condition is specified as $10^{- 4}$ , the maximum number of iterations for the algorithm to converge is 100; regarding K-NN, the number of nearest samples is 5, the samples with voting rights are voted according to equal weight, the size of leaf nodes is 30; DT uses “gini” division criteria, with a threshold of 2 for stopping division of nodes; SVM uses the RBF kernel function type, with a stopping training error accuracy of $10^{- 3}$ , and a heuristic shrinkage.

The relevant parameters of the SOM network in the experiments are set as follows: number of input units $| SO M_{input} | = 2$ , number of output units $N = 21$ , number of iterations of training $iteMax = 1000$ , initial value of learning rate $η_{0} = 0.2$ .

Artificial data sets

Two-dimensional uniformly distributed data

Consider the classification problem in the two-dimensional feature space of Figure 4.^20,29 The attribute vector of the sample $(X, Y)$ is uniformly distributed in $[0, 1] \times [0, 1]$ . When the X attribute value of the sample is $x < α$ and the Y attribute value is $y < β$ , where $β = 1 / (2 α)$ , the sample belongs to class 1, otherwise the sample belongs to class 2. In Figure 4, when $0.5 < α < \sqrt{2} / 2$ , X has a stronger discriminative power than Y, and when $\sqrt{2} / 2 < α < 1.0$ , X is not as strong as Y. As the literature mentioned: good attributes can be selected before the learning process starts, and this selection does not depend on the details of the learning algorithm (including the initial weights and convergence of the algorithm).²⁰ The normalized Fisher linear discriminant vector (FLD),²⁰ the mutual information (MI),²⁰ and the evaluation function values J_CSH of the proposed algorithm are shown separately in Figure 5, which reflected the trend of “attribute to class differentiation ability” as $α$ increases in steps of 0.01 from 0.5 to 1.0. Apparently, J_CSH is able to distinguish attributes with class classification capability, just like FLD and MI.

Figure 4.

Two-dimensional sample distribution.

Figure 5.

(a) Fisher linear discriminant vector values, (b) mutual information values, and (c) evaluation function values J_CSH.

Two-dimensional Gaussian mixed probability density data

The robustness of the algorithm was verified by choosing different Gaussian probability densities for the case of binary classification of two-dimensional Gaussian mixed distribution data.²⁰ Consider a two-dimensional sample space attributed to two classes, with samples described by the attribute vector $(X, Y)$ . The X-dimensional component of class 1 obeys a Gaussian distribution with 0 mean, and the X-dimensional component will gradually elongate with increasing standard deviation $σ_{1 x}$ in different tests. The X-dimensional component of class 2, on the other hand, obeys a Gaussian distribution with fixed mean and standard deviation $(μ = 0.5, σ_{2 x} = 0.1)$ . Relative to class 1, the sample of class 2 is shifted by 0.5 in the X-dimensional direction as a whole. Both classes obey a Gaussian distribution with 0 mean and 0.1 standard deviation in the Y-dimensional direction. The probability density formula based on the one-dimensional Gaussian distribution.

N (υ, μ, σ) = \frac{1}{{(2 π σ^{2})}^{1 / 2}} \exp (- \frac{{(x - μ)}^{2}}{2 σ^{2}})

(23)

The Gaussian mixture probability density of class 1 and class 2 is simply expressed as:

P_{class 1} (x, y) = N (x, 0, σ_{1 x}) N (y, 0, 0.1)

(24)

P_{class 2} (x, y) = N (x, 0.5, 0.1) N (y, 0, 0.1)

(25)

To simulate a real classification task, a series of value-added tests of $σ_{1 x}$ were performed (the value of $σ_{1 x}$ gradually increased from 0.1 to 6.4). 1000 samples were selected with the same probability in both distributions, and Figure 6 shows the distribution of samples as $σ_{1 x}$ increases from 0.1 to 0.8. The Y-dimensional component of the samples in Figure 6 is completely useless for category delineation, because the two categories are identically distributed in the Y-dimensional component; whereas the X-dimensional component identifies the sample categories with a certain error rate.

Figure 6

(a) $σ_{1 x} = 0.1$ 2D Gaussian mixture sample distribution, (b) $σ_{1 x} = 0.2$ 2D Gaussian mixture sample distribution, (c) $σ_{1 x} = 0.4$ 2D Gaussian mixture sample distribution, and (d) $σ_{1 x} = 0.8$ 2D Gaussian mixture sample distribution.

The attribute evaluation function values $J_CSH (x, class)$ and $J_CSH (y, class)$ for the two attributes X and Y with the category label Class, mutual information $I (x, class)$ and $I (y, class)$ , and Fish linear discriminant vector (FLD) $W_{x}$ and $W_{y}$ are shown in Table 5. In the table, both the attribute evaluation function $J_CSH$ of the algorithm in this paper and the mutual information $MI$ can accurately distinguish the attribute X that is valid for classification, while FLD has misses. Specifically, the value of the Y-dimensional attribute evaluation function $J_CSH (y, class)$ remains constant, and the value of the X-dimensional component $J_CSH (x, class)$ is small and close to half of $J_CSH (y, class)$ , through which $J_CSH$ can filter the attribute X that is favorable for classification. For comparison, the value of the mutual information of the Y-dimensional attribute $I (y, class)$ is also constant and close to 0, and the value of the X-dimensional attribute mutual information $I (x, class)$ is larger than that of $I (y, class)$ , through which the mutual information $MI$ can also filter the attribute X. However, when the two categories are clearly distinguishable ( $σ_{1 x} = 0.1$ ), the mutual information of the X-dimensional component $I (x, class)$ is close to 1; as class 1 expands and gradually covers class 2, $I (x, class)$ tends to decrease; when $σ_{1 x}$ takes a larger value, $I (x, class)$ increases again, which is the case in the last row of the table, when class 2 has completely fallen into class 1 in the X-dimensional direction and its probability density function covers a much smaller area than class 1.This fluctuation is caused by estimating the probability density function by segmenting the statistical event frequencies from a limited number of samples. In addition, the results of Fisher’s linear discriminant analysis are presented in the last two columns of Table 5, where the two FLD components lose the ability to discriminate between the two attributes when $σ_{1 x}$ takes larger values. For example, at $σ_{1 x} = 3.2$ , FLD shows that the attributes in the Y-dimension have more categorical information, which is clearly inconsistent with the reality. This is due to the fact that as class 1 continues to expand along the X-dimensional component direction, even though the mean difference between the two classes in the Y-dimensional component direction is 0, estimates the mean of from a finite number of samples will have a small random Y component which will lead to incorrect indication results. Moreover, if all classes have the same mean, a linear discriminant function cannot be implemented, and for the case of smaller interclass distances measured by the mean difference, the same mean value will lead to serious estimation problems. If the interclass distances are small relative to the standard deviation of the classes, random fluctuation results will occur. On the contrary, the algorithm in this paper has a strong robustness due to the use of explicit class labeling transformation, which exploits the stability of the two-dimensional vectors $(x, class)$ and $(y, class)$ distributed in space as shown in Figure 7, avoiding the deficiency of mutual information $MI$ in estimating the probability density on a limited number of samples, and avoiding the influence of similar FLD by the mean and standard deviation, taking into account the supervisory role of class labeling while examining the intra-class to tight inter-class to sparse properties of the attributes.

Table 5.

Comparison of J_CSH, MI and FLD.

$σ_{1 x}$	$\begin{array}{l} J_C S H \\ (x, c l a s s) \end{array}$	$\begin{array}{l} J_C S H \\ (y, c l a s s) \end{array}$	$\begin{array}{l} I \\ (x, c l a s s) \end{array}$	$\begin{array}{l} I \\ (y, c l a s s) \end{array}$	$W_{x}$	$W_{y}$
0.1	0.0337	0.0612	0.964	0.001	0.999	−0.032
0.2	0.0337	0.0612	0.792	0.001	0.999	−0.028
0.4	0.0322	0.0612	0.539	0.001	0.999	−0.012
0.8	0.0305	0.0612	0.612	0.001	0.998	0.048
1.6	0.0293	0.0612	0.692	0.001	0.952	0.304
3.2	0.0286	0.0612	0.679	0.001	0.489	0.871
6.4	0.0282	0.0612	0.728	0.001	0.049	0.998

Figure 7.

(a) $(x, class)$ space distribution for different $σ_{1 x}$ values and (b) $(y, class)$ space distribution for different $σ_{1 x}$ values.

Real data sets

In the experiments, five binary classification datasets widely used for feature selection algorithms and classifier performance validation will be used, which belong to different domains and have a progressively increasing number of attributes and can be found in UCI (http://archive.ics.uci.edu/ml/index.php). These datasets include: Pima Indian Diabetes, Wisconsin Breast Cancer Database,³⁰ MUSK “Clean1” Database,³¹ LSVT Voice Rehabilitation Dataset,³² Olivetti Faces Database, as detailed in Table 6.

Table 6.

Information about the data set.

No.	Dataset	sample size	Number of attributes	Number of classes	remarks
1	Pima Indian Diabetes	768	8	2	Diabetic or not
2	Wisconsin Breast Cancer	569	30	2	Malignant breast tumors or benign
3	MUSK “Clean1”	476	166	2	Musk or non-musk
4	LSVT Voice Rehabilitation	126	310	2	Acceptable and unacceptable pronunciation in rehabilitation
5	Olivetti Faces	400	4096	2	Wearing glasses or not

The feature selection algorithms of the same type: MIFS, FS, LS, and reliefF were selected for comparison. MIFS, FS, and reliefF are the higher scoring attributes are more important, and their scores are normalized to min-max; the algorithms in this paper and LS are the lower scoring attributes are more important, and their scores are normalized to min-max after taking the inverse. To evaluate the classification accuracy of the classifier after feature selection, 80% of the data are training data and 20% are test data, and the training data are subjected to 10-fold cross-validation for 20 times to test the classification accuracy; the test data are subjected to 20 classification experiments to obtain the classification accuracy.

Pima Indian Diabetes dataset

Pima Indian Diabetes was used to study the classification problem between the class label “diabetic or not” and eight attributes. It has 768 data of Pima Indian female patients over 21 years of age, 268 of them have diabetes and 500 do not. The proposed algorithm was used to calculate the J_CSH values of the eight attributes shown in Figure 8. The J_CSH values of these attributes, in descending order, were: 2 Glucose, 6 BMI, 3 Blood Pressure, 7 Diabetes Pedigree Function, 8 Age, 1 Pregnancies, 4 Skin Thickness, and 5 Insulin. The top-ranked attributes with J_CSH values were selected as the result of feature selection. The two attributes with the lowest J_CSH values were 2 Glucose and 6 BMI, which is consistent with the conclusion stated by the authors Chen et al.,³³ and in line with the opinion of the authors Gong et al.³⁴ that the most dominant attributes in Pima India Diabetes are Glucose and BMI.

Figure 8.

J-CSH value of features for Pima.

The scores of MIFS, FS, LS, relief, and the algorithm in this paper regarding all attributes are shown in Figure 9. MIFS and FS can select the attributes Glucose and BMI that have a major impact on the classification, while reliefF and LS undergo a deviation. The algorithm in this paper, FS_CSH, is not only able to select the important attributes that are effective for classification, but also its ability to point out the categories for attribute differentiation is more obvious.

Figure 9.

(a) J-CSH value of features for Pima, (b) MI value of features for Pima, (c) FS value of features for Pima, (d) LS value of features for Pima, and (e) reliefF value of features for Pima.

Finally, the impact of the above algorithms on the classification accuracy was evaluated using LR and SVM classifiers. The average value of accuracy after 20 times of cross-validation of 10 folds of training data is shown in Figure 10; the average value of accuracy after 20 times of classification of test data is shown in Figure 11. The attributes selected by the algorithm FS_CSH in this paper can effectively improve the classification accuracy, and its ranking of attributes is generally higher than the cases of other feature selection algorithms when different numbers of attributes are required.

Figure 10.

(a) LR_80%Train_10fold and (b) SVM_80%Train_10fold.

Figure 11.

(a) LR_20%Test and (b) SVM_20%Test.

Wisconsin Breast Cancer dataset

The Wisconsin Breast Cancer dataset (WBCD) is pathological information of women with breast cancer in Wisconsin, USA. It contains data from 569 individuals, 212 of whom have malignant breast tumors and 357 of whom are benign, with 30 attributes per data entry. The evaluation function value J_CSH of the 30 attributes was calculated using the algorithm in this paper, as shown in Figure 12. The top 20% of J_CSH values in ascending order were selected: the 21st, 5th, 23rd, 1st, 3^rd, and 28th attributes constituted a new feature subset. The classification accuracy of LR and SVM is verified by compared to the feature subsets selected by MIFS, FS, LS, and reliefF algorithms.

Figure 12.

J-CSH value of features for WBCD.

Figure 13(a) shows the accuracy of 10-fold cross-validation for LR 80% training data, the proposed algorithm outperforms MIFS, relief, and FS algorithms overall, and is very close to the effect of LS algorithm. Figure 14(a) shows the accuracy of LR 20% test data, only in the case of attribute number 2 FS-CSH algorithm corresponds to 92.1% accuracy, which is slightly lower than MIFS, FS, and LS algorithms, but still higher than reliefF algorithm. The accuracy of the FS-CSH algorithm is either second or tied for first place with other algorithms for the number of attributes 1, 3, 4, 5, and 6.

Figure 13.

(a) LR_80%Train_10fold and (b) SVM_80%Train_10fold.

Figure 14.

(a) LR_20%Test and (b) SVM_20%Test.

Figure 13(b) shows the classification accuracy of SVM 80% training data with 10-fold cross-validation. When the number of attributes is 1, the accuracy of the proposed algorithm is the same as FS, higher than reliefF, and slightly lower than MIFS and LS; when the number of attributes is 2, the accuracy of FS-CSH is the same as MIFS; when the number of attributes is 3 and 4, the accuracy of FS-CSH is the same as FS and only lower than that of LS; when the number of attributes is 5, the accuracy of FS-CSH is the highest; when the number of attributes is 6, the accuracy of FS- CSH has the same accuracy as LS and is only lower than the accuracy of the reliefF algorithm. Figure 14(b) shows the accuracy of SVM 20% test data. When the number of attributes is 1, the accuracy of FS-CSH is the same as LS, higher than FS and reliefF, and slightly lower than MIFS; when the number of attributes is 2, the accuracy of FS-CSH is slightly lower than MIFS, FS, and LS, but still stronger than reliefF; when the number of attributes is 3 and 4, the accuracy of FS-CSH is the same as FS and reliefF, and the highest; when the number of attributes is 5 and 6, the accuracy of The accuracy rate of FS-CSH still reaches 91.2%. The above analysis shows that the proposed algorithm is effective in the Wisconsin Breast Cancer dataset and can filter out the subset of features that are beneficial for classification.

MUSK “Clean1” dataset

The Clean1 part of the MUSK dataset contains 476 molecular samples, described by 166 attributes, 207 samples are labeled as “musk” and 269 as “non-musk.” The algorithm of this paper is used to calculate the evaluation function values of 166 attributes J_CSH, as shown in Figure 15. Considering that the MUSK dataset is the structural description of microscopic molecules with linear indistinguishability, the classification accuracy of the above-mentioned feature subsets is examined on K-NN, DT, and SVM, as shown in Table 7. The accuracy of the proposed algorithm is lower than MIFS and reliefF only when the 10-fold cross-validation of 80% training data is performed on DT, and the remaining cases are higher than other algorithms. It can be concluded that the proposed algorithm FS-CSH is effective on the MUSK “Clean1” set, and the algorithm can filter the subset of features favorable for classification when the number of attributes is 10% of the total.

Figure 15.

J-CSH value of features for MUSK.

Table 7.

Average classification accuracy of K-NN, DT, and SVM.

Data divide	Classifier	FS-CSH		MIFS		FS		LS		reliefF
Data divide	Classifier	15 attributes	16 attributes	15 attributes	16 attributes	15 attributes	16 attributes	15 attributes	16 attributes	15 attributes	16 attributes
80% training data 10-fold cross-validation	K-NN	82.1	83.4	75.3	75.5	74.5	74.7	71.8	71.8	72.6	74.5
	DT	74.8	73.5	77.5	77.7	71.8	71.4	66.6	69.8	73.9	73.7
	SVM	78.2	78.4	77.0	77.5	77.6	77.4	66.3	66.3	74.7	76.3
20% test data	K-NN	81.3	81.3	73.4	73.3	81.3	81.3	76.0	79.2	79.2	78.1
	DT	77.8	78.8	74.1	73.0	75.9	77.3	64.7	68.3	74.1	77.0
	SVM	85.4	85.4	85.3	84.6	81.3	81.3	74.0	74.0	82.3	82.3

The bold entries in each row indicate the maximum classification average accuracy.

LSVT Voice Rehabilitation dataset

LSVT Voice Rehabilitation was created by Athanasios Tsanas of the University of Oxford, who obtained clinical information on 14 patients from the voice signals provided by LSVT Global. These patients were diagnosed with Parkinson’s disease and were receiving LSVT-assisted voice rehabilitation. The set characterizes 126 speech signals using 309 algorithms, that is, it has 126 samples of speech signals, each described by 309 attributes and labeled as “acceptable” and “unacceptable.”

Due to the specificity of the data distribution of each attribute in this dataset, it is more challenging to perform operations such as feature selection, feature extraction, and classifier performance validation on this dataset. In this experiment, a two-step preprocessing was performed on this set: 1 Eliminate attributes with small variances. The unbiased estimated variance of $σ^{2}$ is calculated for each attribute, and the attribute of $σ^{2} < 0.05$ is directly discarded. 2 For the attribute with large variance, the outliers in its fetched values are eliminated. For the attributes of $σ^{2} > 10000$ , the intervals are divided into a sufficient number of intervals and the frequencies of the values in each interval are counted, and the intervals with low frequencies are discarded. The mean $μ$ and standard deviation $σ$ are calculated for the retained part, and the values in the interval $[μ - 2.58 σ, μ + 2.58 σ]$ are selected.

The evaluation function values J_CSH of the remaining attributes were calculated using the algorithm in this paper, and the values of J_CSH for the excluded attributes were set to 0.2 uniformly, as shown in Figure 16. The two attributes with the lowest J_CSH values were: 153 entropy_log_4_coef and 84 MFCC_0th coef. On LR, K-NN, DT, and SVM, compared with algorithms such as MIFS, FS, LS, and reliefF are shown in Figures 17 and 18.

Figure 16.

J-CSH value of features for LSVT Voice Rehabilitation dataset.

Figure 17.

(a) LR_80%Train_10fold, (b) K-NN_80%Train_10fold, (c) DT_80%Train_10fold, and (d) SVM_80%Train_10fold.

Figure 18.

(a) LR_20%Test, (b) K-NN_20%Test, (c) DT_20%Test, and (d) SVM_20%Test.

Figure 17 shows the accuracy of the 10-fold cross-validation for 80% of the training data for the four classifiers. With the number of attributes of 2, the accuracy of the proposed algorithm is highest on LR and DT, inferior to MIFS, FS, and reliefF on K-NN, and equal to FS and reliefF and slightly inferior to MIFS on SVM. When the number of attributes is greater than 2, the accuracy of the proposed algorithm is comparable to other algorithms, only inferior to MIFS, FS, and reliefF on K-NN.

Figure 18 shows the accuracy of 20% of the test data for the four classifiers. With the number of attributes of 1, the accuracy of the proposed algorithm is only slightly lower than the MIFS algorithm on DT, and higher or equal to the rest of the algorithms. For the number of attributes >2, the proposed algorithm shows higher accuracy with MIFS, FS and reliefF on K-NN and SVM, better than LS on LR, and indistinguishable from MIFS, FS, and reliefF on DT. In summary, for the high-dimensional complex dataset LSVT Voice Rehabilitation, the algorithm FS-CSH still shows a trustworthy screening ability and its selected feature subset is effective in improving the classification accuracy of the classifier.

Olivetti Faces dataset

The Olivetti Faces set contains a total of 400 face images from 40 subjects with 10 faces each. Each image is 64 × 64 pixels, described as a 4096-dimensional vector, and each pixel has 256 gray levels. For this set, determining whether a person is wearing glasses or not is a typical binary classification problem. Before the experiment, this face image data was histogram equalized, and each piece of data was assigned a label of 0 or 1 depending on whether the face in each image was wearing glasses, with 1 being wearing glasses and 0 being not.

The 4096 attribute (pixel) evaluation function values J_CSH are calculated using the algorithms in this paper, as shown in Figure 19. Then the scores of MIFS, FS, LS, and reliefF algorithms are calculated for each attribute (pixel). The top 32 to 256 attributes (pixels) of each algorithm are selected to construct new feature subsets, and the classification accuracy of the selected feature subsets is checked by SVM.

Figure 19.

J-CSH value of features for Olivetti Faces.

In Figure 20(a), the average accuracy of the proposed algorithm is slightly lower than the MIFS algorithm when the number of attributes (pixels) is 32 and 64, on par with FS and reliefF, and higher than the LS algorithm; in the rest of cases, the average accuracy algorithm proposed in this paper is comparable to that of MIFS, and higher than other similar algorithms. In Figure 20(b), when the number of attributes (pixels) grows from 32 to 296, the average accuracy rates of FS-CSH, MIFS and reliefF alternately lead: FS-CSH leads four times, MIFS algorithm leads one time and reliefF algorithm leads three times. And they all have higher average accuracy than the FS and LS algorithms.

Figure 20.

(a) SVM_80%Train_10fold and (b) SVM_20%Test.

To visualize the effect of the proposed algorithm on the selection of face pixels, Figure 21(a) shows 119 images of faces wearing glasses out of 400 face data, and Figure 21(b) shows the distribution of selected pixels (marked by black pixel dots) in these face data images. Comparing Figure 21(a) and (b), the pixel points selected by the proposed algorithm are mainly concentrated in the cheek area below the eyes, while avoiding the frame area. The former is influenced by the refractive effect of the eyeglass lens, which makes this part more favorable for distinguishing whether glasses are worn or not; while the latter is influenced by the lateral shadow of the eyebrows and the bridge of the nose, resulting in its ineffectiveness for determining whether glasses are worn or not. Therefore, it can be concluded that the feature selection algorithm FS-CSH proposed in this paper is effective on the set of Olivetti Faces and can filter out the subset of features that are favorable for classification, which is slightly superior to the MIFS and reliefF algorithms that also utilize classification information.

Figure 21.

(a) Pictures of people wearing glasses and (b) selected pixels on pictures of people wearing glasses.

Conclusion

In this paper, a feature selection algorithm FS-CSH is proposed for the binary classification problem. The algorithm explicitly incorporates class labels, goes through SOM and hierarchical clustering, scores each attribute, evaluates their relevance to the category, and filters the subset of features with the maximum decision classification capability that can improve the classification accuracy of the reduced-dimensional dataset. The results of simulation applications on both artificial and real data show the above results. Compared with feature selection algorithms of the same type (i.e., algorithms that examine each attribute for scoring), the one proposed in this paper is a heuristic algorithm with a clear physical meaning, simple computation, effective, and robustness.

Footnotes

Acknowledgements

We gratefully thank the Action Editor and anonymous reviewers for their constructive comments.

Declaration of conflicting interests

The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by National Key R&D Program of China (2018YFB1703105) and National Natural Science Foundation of China (grant no. 51865027).

ORCID iD

Zhao Zhengtian

References

Liu

. Toward integrating feature selection algorithms for classification and clustering. IEEE Trans Knowl Data Eng 2005; 17: 491–502.

Langley

. Selection of relevant features in machine learning. In: Proceedings of the AAAI fall symposium on relevance, New Orleans, USA, 4–6 November 1994, Paper no. FS94-02-034, pp.127–131. Palo Alto, CA: AAAI Press.

Guyon

Elisseeff

. An introduction to variable and feature selection. J Mach Learn Res 2003; 3: 1157–1182.

Jović

Brkić

Bogunović

. A review of feature selection methods with applications. In: International convention on information and communication technology, Electronics and Microelectronics (MIPRO), Opatija, Croatia, 25–29 May 2015, pp.1200–1205. New York: IEEE.

Pudjihartono

Fadason

Kempa-Liehr

, et al. A review of feature selection methods for machine learning-based disease risk prediction. Front Bioinform 2022; 2: 1–17.

Islam

Lima

Das

, et al. A comprehensive survey on the process, methods, evaluation, and challenges of feature selection. IEEE Access 2022; 10: 99595–99632.

Dash

Liu

. Feature selection for classification. Intell Data Anal 1997; 1: 131–156.

Bolón-Canedo

Sánchez-Maroño

Alonso-Betanzos

. A review of feature selection methods on synthetic data. Knowl Inf Syst 2013; 34: 483–519.

Duda

Hart

Stork

. Pattern classification. 2nd ed. Hoboken: Wiley, 2012, p.117.

10.

Cai

Niyogi

. Laplacian score for feature selection. In: Proceedings of the 18th international conference on neural information processing systems (NIPS’05), British Columbia, Canada, 5–8 December 2005, pp.507–514. Cambridge, MA: MIT Press.

11.

Zhao

Liu

. Spectral feature selection for supervised and unsupervised learning. In: Proceedings of the 24th international conference on machine learning (ICML’07), Corvalis, OR, USA, 20–24 June 2007, pp.1151–1157. New York: ACM.

12.

Urbanowicz

Meeker

Cava

, et al. Relief-based feature selection: introduction and review. J Biomed Inform 2018; 85: 189–203.

13.

Alex

Arthur

, et al. Supervised feature selection via dependence estimation. In: Proceedings of the 24th international conference on machine learning (ICML’07), Corvalis, OR, USA, 20–24 June 2007, pp.823–830. New York: ACM.

14.

Alex

Arthur

, et al. Feature selection via dependence maximization. J Mach Learn Res 2012; 13: 1393–1434.

15.

Nie

Xiang

Jia

, et al. Trace ratio criterion for feature selection. In: Proceedings of the 23rd national conference on Artificial intelligence, Chicago, IL, USA, 13–17 July 2008, paper no. AAAI08-107, pp.671–676. Palo Alto, CA: AAAI Press.

16.

Henni

Mezghani

Mitiche

. Cluster density properties define a graph for effective pattern feature selection. IEEE Access 2020; 8: 62841–62854.

17.

Zhang

. Unsupervised feature selection via data reconstruction and side information. IEEE Trans Image Process 2020; 29: 8097–8106.

18.

Roffo

Melzi

Castellani

, et al. Infinite feature selection: a graph-based feature filtering approach. IEEE Trans Pattern Anal Mach Intell 2021; 43: 4396–4410.

19.

Haro-García

Toledano

Cerruela-García

, et al. Grab’Em: a novel graph-based method for combining feature subset selectors. IEEE Trans Cybern 2022; 52: 2942–2954.

20.

Battiti

. Using mutual information for selecting features in supervised neural net learning. IEEE Trans Neural Netw 1994; 5: 537–550.

21.

Fleuret

. Fast binary feature selection with conditional mutual information. J Mach Learn Res 2004; 5: 1531–1555.

22.

Peng

Long

Ding

. Feature selection based on mutual information: criteria of max-dependency, max-relevance, and min-redundancy. IEEE Trans Pattern Anal Mach Intell 2005; 27: 1226–1238.

23.

Estévez

Tesmer

Perez

, et al. Normalized mutual information feature selection. IEEE Trans Neural Netw 2009; 20: 189–201.

24.

Brown

Pocock

Zhao

, et al. Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J Mach Learn Res 2012; 13: 27–66.

25.

Vinh

Zhou

Chan

, et al. Can high-order dependencies improve mutual information based feature selection? Pattern Recognit 2016; 53:46–58.

26.

Haykin

. Neural networks and learning machines. 3rd ed. London: Pearson, 2008.

27.

Kohonen

. The self-organization map. Proc IEEE 1990; 78: 1464–1480.

28.

. A study of clustering and data analysis methods based on one-dimensional SOM. PhD Thesis, Tianjin University, China, 2009.

29.

Kamin

. A simple procedure for pruning back-propagation trained neural networks. IEEE Trans Neural Netw 1990; 1: 239–242.

30.

Wolberg

Mangasarian

. Multisurface method of pattern separation for medical diagnosis applied to breast cytology. Proc Natl Acad Sci USA 1990; 87: 9193–9196.

31.

Dietterich

Lathrop

Lozano-Pérez

. Solving the multiple instance problem with axis-parallel rectangles. Artif Intell 1997; 89: 31–71.

32.

Tsanas

Little

Fox

, et al. Objective automatic assessment of rehabilitative speech treatment in Parkinson’s disease. IEEE Trans Neural Syst Rehabil Eng 2014; 22: 181–190.

33.

Chen

Huang

, et al. Novel and efficient method on feature selection and data classification. J Comput Res Dev 2012; 49: 735–745.

34.

Gong