Sage Journals: Discover world-class research

Abstract

Decision trees are frequently used to overcome classification problems in the fields of data mining and machine learning, owing to their many perks, including their clear and simple architecture, excellent quality, and resilience. Various decision tree algorithms are developed using a variety of attribute selection criteria, following the top-down partitioning strategy. However, their effectiveness is influenced by the choice of the splitting method. Therefore, in this work, six decision tree algorithms that are based on six different attribute evaluation metrics are gathered in order to compare their performances. The choice of the decision trees that will be compared is done based on four different categories of the splitting criteria that are criteria based on information theory, criteria based on distance, statistical-based criteria, and other splitting criteria. These approaches include iterative dichotomizer 3 (first category), C $4.5$ (first category), classification and regression trees (second category), Pearson’s correlation coefficient based decision tree (third category), dispersion ratio (third category), and feature weight based decision tree algorithm (last category). On eleven data sets, the six procedures are assessed in terms of classification accuracy, tree depth, leaf nodes, and tree construction time. Furthermore, the Friedman and post hoc Nemenyi tests are used to examine the results that were obtained. The results of these two tests indicate that the iterative dichotomizer 3 and classification and regression trees decision tree methods perform better than the other decision tree methodologies.

Keywords

Attribute evaluation metric impurity measure splitting criteria decision tree classification

Introduction

Classification problems have been a long-standing focus in data mining and machine learning areas. Decision trees are among the most powerful and popular classifiers for handling these problems due to their numerous benefits, such as their simple architecture, high performance, and adaptability.

The decision tree development process can be summarized in three steps. The first phase involves applying an attribute selection method to pick the best attribute that will serve as the separating attribute. In the subsequent stage, the training data set is divided depending on the chosen splitting attribute. The child nodes are produced in the final phase by creating a branch for each category of the partitioning attribute. This process is repeated iteratively for each non-empty child node until a set of stopping conditions (all instances in the node belong to the same class, the number of instances is fewer than a given minimum, $\dots$ ) are satisfied. The complete procedure is known as the “top-down partitioning strategy.” An illustration of the process of constructing a decision tree using the top-down partitioning technique is shown in Figure 1.

Figure 1.

Illustration of the process of constructing a decision tree using the top-down partitioning technique.

The most challenging part of building decision trees, as shown by Figure 1, is establishing an attribute selection method that will be applied to choose the dividing attributes. Therefore, researchers have suggested a variety of decision tree strategies that are based on various node-splitting criteria. These criteria can be categorized into several groups, including information theory-based criteria, distance-based criteria, statistical-based criteria, and other splitting criteria. The first splitting category contains: Information Gain (IG)¹, Gain Ratio (GR)², Normalized Gain (NG)³, Average Gain (AG)⁴, and etc. In fact, the IG metric uses entropy, which is a measurement derived from information theory, as an impurity measure. Whereas the GR criterion, a kind of variation of the IG measure, was developed in order to counteract the IG’s main shortcoming. Moreover, the NG metric is a normalization of the GR indicator. Furthermore, the AG measurement, a variation of the GR criterion that addresses the problem arising from the undetermination of the GR metric, can be utilized as a node split criterion. On the other hand, the four metrics Gini Index (GI)⁵, Twoing⁵, Cluster Separation (ClusterS)⁶, and Margin of Separation (MarginS)⁷ are thought of as distance-based splitting criteria. The GI is another measure of impurity that is based on probability theory. While the Twoing indicator is advised when the domain of the target attribute is fairly broad. For the ClusterS, it is a measure based on the analysis of clusters. Finally, the MarginS metric is some sort of margin or boundary that is used as a separation between different classes. The third node splitting category is based on statistical coefficients, such as $χ^{2}$ ⁸, Pearson’s Correlation Coefficient (PCC)⁹, Correlation Ratio (CR)¹⁰, and Dispersion Datio (DR).¹¹

One of the top priorities in machine learning is still creating effective and ideal decision trees that are based on various new splitting metrics. Chandra et al.¹² created decision trees using the Distinct Class-based Splitting Measure (DCSM), a novel splitting criterion that is founded on the concept of distinct classes. In addition, Wang and his team¹³ introduce a less greedy two-term Tsallis Entropy Information Metric (TEIM) algorithm for decision tree classification using a new split criterion that is based on two-term Tsallis conditional entropy. Furthermore, Zhou et al. in 2021¹⁴ describe the Feature Weight based Decision Tree (FWDT) decision tree approach based on the feature weight principle, where the weights of features are determined by employing the ReliefF algorithm¹⁵. Additionally, a brand-new node partitioning metric called Entropy Gini Integrated Approach (EGIA), which combines the GI and entropy, was designed by Singh & Chhabra in $2021$ ¹⁶.

The splitting criterion that is utilized when building a decision tree has a significant impact on the performance of the model. In order to verify this assertion, this work compares the performance of several decision tree methods that are based on various splitting criteria in terms of four evaluation metrics: Classification accuracy, tree depth, leaf nodes, and tree construction time. In other words, the main objective of this paper is to contrast various splitting measures that belong to different categories. Consequently, six decision tree algorithms are chosen based on four categories of splitting criteria, including two decision trees with information theory-based splitting criteria (Iterative Dichotomizer 3 (ID3)¹ and C $4.5$ ²), one decision tree with a distance-based splitting metric (Classification And Regression Trees (CART)⁵), two decision trees with statistical-based splitting measurements (Pearson's Correlation Coefficient based decision Tree (PCC-Tree)¹⁷ and Dispersion Datio (DR)¹¹), and Feature Weight based Decision Tree (FWDT) decision tree methodology¹⁴.

The main contributions of this study are:

Background on the six compared decision tree processes ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT.

Twelve data sets from diverse fields are used to assess the effectiveness of the six techniques using four different evaluation measurements.

To further analyze the obtained results, the Friedman and post-hoc Nemenyi tests are carried out.

The remaining sections of the paper are organized as follows: first, a brief overview of the ID3, C

4.5

, CART, PCC-Tree, DR, and FWDT decision tree strategies is provided. Following that, the performances of these six algorithms are evaluated on 12 data sets in terms of several evaluation metrics. Finally, a brief summary is given at the end.

Background on decision tree algorithms

The six decision tree strategies that will be examined in this work involve three well-known methods, including ID3, C $4.5$ , and CART, and three recently created methodologies, including PCC-Tree, DR, and FWDT algorithm. Although all of these algorithms apply the top-down partitioning technique to build decision trees, the only distinction between them is the attribute selection methods used to select the splitting attributes.

The next subsections provide a brief summary of the different decision tree procedures that employ splitting criteria from various categories.

Decision trees with information theory-based splitting criteria

ID3 method:

The ID3 decision tree technique employs the IG metric to evaluate the significance of attributes.

In a data set $D$ with $N$ classes represented by $C_{1}, C_{2}, \dots, C_{N}$ , the IG of a categorical attribute $A$ can be calculated as follows:

IG (A) = Entropy (D) - Entropy (A, D),

(1)

where Entropy(

D

) assesses the entropy of information associated with the data set

D

and Entropy(

A, D

) measures the entropy of information related to the attribute

A

. Equations (2) and (3) provide these two measurements, respectively.

Entropy (D) = - \sum_{i = 1}^{N} P_{i} \log_{2} (P_{i}),

(2)

where

P_{i} = \frac{∣ C_{i} ∣}{∣ D ∣}

, the denominator represents the number of instances in

D

and the numerator is the number of instances relative to the class label

C_{i}

Assume that the attribute $A$ has $d$ distinct values listed as ${a_{1}, a_{2}, \dots, a_{d}}$ . Thus:

Entropy (A, D) = \sum_{j = 1}^{d} \frac{∣ D_{j} ∣}{∣ D ∣} Entropy (D_{j}),

(3)

where the data set

D

is subdivided into

d

subsets

{D_{1}, D_{2}, \dots, D_{d}}

according to the

d

different values of the attribute

A

and

D_{j}

contains the instances in

D

that have the outcome

a_{j}

of the attribute

A

Finally, the attribute with the highest IG value is then picked as the splitting attribute since it is the most relevant one.

C4.5 technique:

The C $4.5$ approach, an extension of the ID3 strategy, employs the GR as a metric to assess the importance of attributes and determine the optimal attributes for splitting.

In a data set $D$ , the GR of an attribute $A$ with $d$ distinct values, ${a_{1}, a_{2}, \dots, a_{d}}$ , can be defined as follows:

GR (A) = \frac{IG (A)}{SI (A, D)},

(4)

where SI

(A, D)

is the splitting information that calculates information generated after splitting the data set

D

into

d

partitions

D_{1}, D_{2}, \dots, D_{d}

based on the distinct values of

A

. The splitting information of the attribute

A

is given by:

SI (A, D) = - \sum_{j = 1}^{d} \frac{∣ D_{j} ∣}{∣ D ∣} l o g_{2} (\frac{∣ D_{j} ∣}{∣ D ∣}),

(5)

where

D_{j} = {instances \in D / A (instance) = a_{j}}

According to the C $4.5$ strategy, the splitting attribute at each level will be determined by the attribute that maximizes the GR metric.

Decision tree with distance-based splitting criterion

CART strategy:

The GI is utilized as an attribute evaluation metric by the CART technique to choose the optimum splitting attributes.

For a data set $D$ containing $N$ classes indicated by $C_{1}, C_{2}, \dots, C_{N}$ :

GI (D) = 1 - \sum_{i = 1}^{N} P_{i}^{2},

(6)

where

P_{i} = \frac{∣ C_{i} ∣}{∣ D ∣}

, the denominator represents the number of instances in

D

and the numerator is the number of instances relative to the class

C_{i}

Since the CART technique only allows for the creation of binary decision trees with two child nodes per node, the training data set $D$ can be separated into two subsets, $D_{1}$ and $D_{2}$ , by an attribute $A$ . Hence, the GI of the attribute $A$ is computed by Equation (7).

GI (A) = \sum_{j = 1}^{2} \frac{∣ D_{j} ∣}{∣ D ∣} GI (D_{j}) .

(7)

The attribute that reduces the GI measure will therefore be the most powerful separating attribute.

Decision trees with statistical-based splitting criteria

PCC-Tree algorithm:

The PCC-Tree technique analyzes attributes, selects the best attributes for splitting, and identifies the best splitting points by using the PCC as an impurity measure.

Assume that $D$ is a data set that contains $n$ instances ${x_{i}}_{i = 1}^{n}$ , each of which is described by $p$ attributes ${A_{k}}_{k = 1}^{p}$ and one class attribute $Y$ . The structure of the data set $D$ is displayed in Table 1, where $A_{k} (i)$ represents the value of the instance $i$ on the attribute $A_{k}$ .

Table 1.

Structure of the data set.

$x_{i}$	$A_{1}$	$A_{2}$	$\dots$	$A_{p}$	$Y$
$x_{1}$	$A_{1} (1)$	$A_{2} (1)$	$\dots$	$A_{p} (1)$	$y_{1}$
$x_{2}$	$A_{1} (2)$	$A_{2} (2)$	$\dots$	$A_{p} (2)$	$y_{2}$
.	.	.	.	.	.
.	.	.	.	.	.
.	.	.	.	.	.
$x_{n}$	$A_{1} (n)$	$A_{2} (n)$	$\dots$	$A_{p} (n)$	$y_{n}$

The PCC-Tree technique starts by substituting each attribute $A$ of the data set $D$ (either a description attribute $A_{k}$ or the class attribute $Y$ ) by a novel vector $V$ created as follows:

\begin{aligned} V (A) & = {v (A (x_{i}))}_{i = 1}^{n} \\ = {\begin{matrix} A (x_{i}) if \; A \; is \; numerical \\ \in {0, 1, \dots, d - 1} if \; A \; is \; categorical, \end{matrix} \end{aligned}

(8)

where the number of distinct values of the categorical attribute

A

is represented by

d

Furthermore, another vector is generated for each attribute $A$ (either a description attribute $A_{k}$ or the class attribute $Y$ ) based on the first one as follows:

\begin{aligned} V (A, c) & = {v (A (x_{i}), c)}_{i = 1}^{n} \\ = {\begin{matrix} 1 & i f V (A (x_{i})) \leq c \\ 2 & i f V (A (x_{i})) > c, \end{matrix} \end{aligned}

(9)

where the numerical value

c

is used to divide the value domain of the attribute

A

Finally, the splitting attribute $A^{*}$ and the splitting point $c$ that fulfill the following stipulation are chosen to form the splitting rule:

A^{*} = \underset{A_{k}}{a r g m a x} ∣ P (V (A_{k}, c), V (Y)) ∣,

(10)

where

P

stands for the Pearson’s correlation coefficient.

In equation (10), $V (Y)$ is a vector that contains $0, 1, \dots, N - 1$ integers ( $N$ is the total number of distinct class labels). These integers are assigned to replace the original class attribute $Y$ and determine Pearson’s correlation coefficient.

DR approach:

As the first preprocessing stage, the DR decision tree technique discretizes the numerical attributes of the data set using the k-means strategy. The DR metric, a modification of the preexisting CR statistic, is then used to evaluate the importance of all attributes.

The DR of a categorical attribute $A$ can be determined as follows:

DR (A) = \sqrt{\frac{\sum_{y \in Y} n_{y} ({\bar{m}}_{y}^{(A)} - {\bar{m}}^{(A)})^{2}}{\sum_{y \in Y} \sum_{j = 1}^{y} (v_{j y}^{(A)} - {\bar{m}}^{(A)})^{2}}},

(11)

where $y$ is a class label within the set of classes $Y$ , $n_{y}$ represents the number of instances with the class label $y$ , ${\bar{m}}_{y}^{(A)}$ is the relative importance of the attribute $A$ in relation to the class label $y$ , ${\bar{m}}^{(A)}$ illustrates the overall importance of the attribute $A$ , and $v_{j y}^{(A)}$ signifies the relative importance of the $j$ th value of the attribute $A$ in class $y$ .

The numerator of the DR measure represents the dispersion in the relative importances of an attribute among individual classes and the denominator represents the dispersion in the importance of that attribute across the whole population. On the other hand, the relative importance of the attribute $A$ within the class label $y$ is the ratio of the maximum frequency value of occurrence for a distinct value of the attribute $A$ with respect to the class label $y$ and the total number of instances in that class. Subsequently, the overall importance of the attribute $A$ is equal to the ratio of the summation of maximum frequencies of different classes and the total number of instances in the data set. Finally, the relative importance of the $j$ th value of the attribute $A$ in the class label $y$ is the ratio of the frequency value of occurence for the $j$ th value of the attribute $A$ with respect to the class $y$ and the total number of instances in the data set.

According to Equation (11), the attribute with the highest value of DR will be chosen as the splitting attribute.

Decision tree based on another splitting criterion

FWDT approach:

The k-means strategy is used as a preprocessing stage of the FWDT approach to discretize the numerical attributes of the data set. Before constructing the decision tree, the attribute space is reduced using the ReliefF feature selection method, which assigns weights to attributes. Then, all the attributes that have weight values above a certain threshold are chosen, while the others are afterwards removed. The median of the sorted feature weight vector serves as the threshold for the ReliefF method in the FWDT technique.

To evaluate the weight of an attribute $A$ , ReliefF randomly selects a sample $x_{i}$ from the training data set, calculates the distance between that sample and each other sample $x_{j}$ that has the same class label as $x_{i}$ , and then selects the $K$ closest samples. Similarly, the distance between the sample $x_{i}$ and every other sample $x_{k}$ with a different class label is determined, and the $K$ nearest samples are then chosen. Finally, the weight of the attribute $A$ for the instance $x_{i}$ can be evaluated as follows:

\begin{aligned} W (A, x_{i}) & = W (A, x_{i}) - \sum_{j = 1}^{K} \frac{diff (A, x_{i}, x_{j})}{M \times K} \\ + \sum_{C \notin class (x_{i})} \frac{\frac{P (C)}{1 - P (class (x_{i}))} \sum_{k = 1}^{K} diff (A, x_{i}, x_{k} (C))}{M \times K}, \end{aligned}

(12)

where the initial weight of each attribute is set at

0

\sum_{j = 1}^{K} \frac{diff (A, x_{i}, x_{j})}{M \times K}

is the weight of the samples

x_{j}

\sum_{C \notin class (x_{i})} \frac{\frac{P (C)}{1 - P (class (x_{i}))} \sum_{k = 1}^{K} diff (A, x_{i}, x_{k} (C))}{M \times K}

represents the weight of the samples

x_{k}

M

is the number of random samples, and

K

is the number of the nearest neighbors.

In the traditional ReliefF algorithm, $diff (A, x_{i}, x_{j})$ is given by Equation (13) for categorical attributes and it can be computed for numerical attributes by Equation (14).

diff (A, x_{i}, x_{j}) = {\begin{matrix} 0 & if A (x_{i}) = A (x_{j}) \\ 1 & otherwise . \end{matrix}

(13)

diff (A, x_{i}, x_{j}) = \frac{∣ A (x_{i}) - A (x_{j}) ∣}{\max (A) - \min (A)} .

(14)

Greater importance of the attribute to the class attribute is indicated by a bigger estimated weight value.

After the discretization and pre-filtering steps, the decision tree can be constructed based on the remaining attributes. Therefore, the splitting attribute used to separate the current node is the one with the highest weight value.

Each of the six decision tree techniques ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT has its own advantanges and disadvantages, the main of them are summarized in Table 2.

Table 2.

Advantages and disadvantages of the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT decision tree approaches.

Algorithms	Splitting criteria	Categories	Advantages	Disadvantages
ID3	IG	Information theory	1- Builds the fastest tree 2- Builds a short tree 3- Searchs the whole data set to build the decision tree	1- Handles only categorical attributes 2- Prioritizes attributes with more values 3- Do not handle imbalanced data sets and missing values
C $4.5$	GR	Information theory	1- Avoids the bias of selecting attributes with many values 2- Handles both categorical and numeric attributes 3- Dealing with data sets with missing values	1- Only suitable for small-size data sets 2- Constructs empty branches 3- Has a tendency to prefer the attribute with low split info
CART	GI	Distance	1- Handles both categorical and numeric value 2- Tackles classification and regression problems 3- Handles outliers	1- Produces unstable decision trees 2- Computation of GI for attribute having many different values takes considerable amount of time 3- has a preference towards multivalued attributes
PCC-Tree	PCC	Statistical coefficient	1- Handles data sets with mixed-type attributes 2- Avoids the over-partitioning problem	1-Its performance is influenced by the value of the stop criterion $ϵ$ 2- Difficult to determine the right value of the stop criterion $ϵ$
DR	DR	Statistical coefficient	1- Does not prefer the attributes withlarge number of distinct values 2- Not affected by imbalance class distribution 3- Suitable for data sets havingnumeric attributes also4- Handles datasets with more than $2$ classes	1- Difficult to find the exact value of $k$ when $k$ -means is used for discretization 2- Suitable for small size data sets and fewerfeature number
FWDT	Feature Weight	Other	1- Reduces the feature space by using afeature pre-filtering strategy beforeconstructing the decision tree 2- Handles categorical and numericalattributes	1- Requires more time 2- Difficult to find the exact value of $k$ when $k$ -means is used for discretization

Experimental study

In this section, the performances of the decision tree strategies ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT are compared on several data sets in terms of four evaluation metrics: classification accuracy, decision tree depth, leaf node count, and tree development time.

Data set and computing environment description

Several data sets with various numbers of instances, attributes, and classes are collected for the purpose of experimentally comparing the efficacy of the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT procedures. These data sets were obtained through the UCI Learning Repository¹⁸. The tested data sets are described as follows:

The Bankruptcy data set forecasts bankruptcy based on $6$ qualitative parameters provided by experts. This data set contains $250$ instances divided into two categories: $107$ instances with the class label “Bankruptcy” (B) and $143$ instances with the class label “Non-Bankruptcy” (NB).

The Breast Cancer W-D data set, a diagnostic Wisconsin breast cancer database, consists of $569$ instances that are each characterized by $30$ real attributes and one class attribute that records the prognosis: “benign” ( $1$ ) or “malignant” ( $2$ ).

The Hayes Roth data set belongs to the topic of human subjects study. This database includes $4$ attributes with numeric values that divide the $132$ instances it contains into two classes: $51$ instances are positive (P) and $81$ instances are negative (N).

The Hepatitis data set is made up of $155$ instances and $19$ attributes of various types, such as nominal, real, and categorical. In addition, there is the categorical class attribute, which has two possible values: “DIE” ( $32$ instances) and “LIVE” ( $123$ instances).

The Immunotherapy data set involves information about wart treatment results of $90$ patients using immunotherapy. These $90$ instances are defined by $7$ attributes and one class attribute with two labels: $19$ instances are members of class $0$ , and $71$ instances are members of class $1$ .

The Iris data set employs $4$ real-valued continuous attributes to characterize three different types of iris plants: “setosa,” “versicolor,” and “virginica.” The data set contains $150$ flowers; each of the three iris species has $50$ flowers.

The New Thyroid data set is a multi-class database that describes $215$ instances by utilizing $5$ numerical attributes. These $215$ instances comprise $150$ instances belonging to the class $1$ , $35$ instances belonging to the class $2$ , and $30$ instances belonging to the class $3$ .

The Pima data set makes a diabetes diagnosis prediction for a patient based on specific diagnostic measurements ( $8$ numeric-valued attributes). There are $768$ instances in this data set, of which $500$ instances test negatively for diabetes ( $0$ ), while $268$ instances test positively ( $1$ ).

The Seeds data set analyzes $7$ geometrical properties of $210$ kernels from three different varieties of wheat: “Kama” ( $1$ ), “Rosa” ( $2$ ), and “Canadian” ( $3$ ), each with $70$ elements.

The Somerville data set includes $6$ nominal attributes that classify $143$ instances into two groups: $66$ instances with class label “unhappy” ( $0$ ) and $77$ instances with class label “happy” ( $1$ ).

The Statlog dataset, a heart disease database, divides $270$ patients into two classes: $150$ instances do not have heart disease ( $1$ ) and $120$ instances do ( $2$ ) using $13$ attributes of various types, including real, ordered, nominal, and binary.

The Wine data set identifies the origin of wines using chemical analysis. This data set contains $178$ instances that are described by $13$ continuous attributes and one class attribute with three distinct classes (three types of wines). Out of these $178$ instances, $59$ instances have class label $1$ , $71$ instances have class label $2$ , and $48$ instances have class label $3$ .

Table 3 provides a summary of the characteristics of the data sets that were collected.

Table 3.

Detailed information on the employed data sets.

No.	Data sets	Instances	Attributes	Classes
1	Bankruptcy	250	6	2
2	Breast cancer W-D	569	30	2
3	Hayes Roth	132	4	2
4	Hepatitis	155	19	2
5	Immunotherapy	90	7	2
6	Iris	150	4	3
7	New thyroid	215	5	3
8	Pima	768	8	2
9	Seeds	210	7	3
10	Somerville	143	6	2
11	Statlog	270	13	2
12	Wine	178	13	3

ID3: iterative dichotomizer 3; CART: classification and regression trees; PCC-Tree: Pearson’s correlation coefficient based decision tree; DR: dispersion ratio; FWDT: feature weight based decision tree; IG: information gain; GR: gain ratio; GI: Gini index.

The hardware condition used in this experiment is an Intel (R) Core (TM) i $7$ - $11850$ H with a $2.50$ GHz CPU and $32$ GB of memory, while a $64$ -bit version of Windows $10$ is the operating system. On the other side, “R version $4.2.1$ ” is utilized as the computing environment.

Description of evaluation metrics

The performance of the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT methods is examined on the different collected data sets according to four metrics, including classification accuracy, decision tree depth, leaf node count, and tree development time. The four evaluation metrics are outlined below:

Classification accuracy is the most important performance indicator for assessing the overall effectiveness of a classifier. It is determined by the number of unseen instances that a classifier correctly classifies out of the total number of instances provided for testing. The classification accuracy is expressed as follows:

Accuracy = \frac{TP + TN}{TP + FN + FP + TN} \times 100,

(15)

where TP stands for the number of positive instances that the classifier correctly identified as positive, TN stands for the number of negative instances that the classifier correctly identified as negative, FP stands for the number of negative instances that the classifier incorrectly identified as positive, and FN stands for the number of positive instances that the classifier incorrectly identified as negative.

Decision tree depth represents the size of a decision tree (i.e., the total number of nodes in a tree).

Leaf node count identifies the number of leaf nodes in a decision tree.

Tree development time indicates the time required by a decision tree algorithm for constructing a tree.

A decision tree strategy performs well when the classification accuracy is higher, the tree depth is shallower, there are fewer leaf nodes, and the tree is formed more quickly.

Each data set is treated using the five-fold cross-validation methodology¹⁹. Consequently, every reported result in this framework is based on the average value of 50 times five-fold cross-validation.

Analysis of the classification accuracy reached using the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT strategies

The classification accuracies obtained using the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches on the 12 data sets are displayed in Table 4.

Table 4.

Classification accuracies (%) produced by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques.

Data sets	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
1	98.34	99.51	98.30	91.12	97.52	99.52
2	93.06	92.61	92.66	92.39	91.33	91.44
3	78.25	75.68	77.33	79.84	76.27	62.79
4	80.13	79.41	78.47	77.50	77.74	78.57
5	84.45	75.35	84.78	78.83	76.45	76.83
6	93.49	93.93	93.54	93.49	88.17	95.75
7	91.68	89.62	89.89	89.71	90.09	86.46
8	74.80	71.15	74.48	70.91	68.18	69.35
9	91.28	88.77	90.04	91.23	84.02	80.86
10	62.03	61.38	62.18	58.68	60.43	57.87
11	79.09	74.81	78.08	74.70	76.24	72.87
12	90.28	93.06	88.39	93.19	82.29	86.48
Average	84.74	82.93	84.01	82.63	80.73	79.90

The bold font is utilized to highlight the best results for each data set.

Based on the analysis of Table 4, it can be seen that none of the C $4.5$ and DR processes offer the highest testing accuracy for any given data set. However, each of the CART, PCC-Tree, and FWDT techniques can produce the greatest classification accuracy when applied to just two data sets. Whereas the ID3 method can generate the highest level of classification accuracy on six data sets. Furthermore, the ID3 methodology yields the highest average classification accuracy, which is $1.81 %$ , $0.73 %$ , $2.11 %$ , $4.01 %$ , and $4.84 %$ greater than that attained by the C $4.5$ , CART, PCC-Tree, DR, and FWDT strategies, respectively.

It can be said that the results for classification accuracy obtained using the three traditional decision tree methods are preferred to those acquired using the three other recent techniques.

The Friedman test²⁰, which is based on the average ranks, is used to further examine the classification accuracy levels attained by the six algorithms on the 12 data sets. The null hypothesis that will be tested asserts that the testing accuracies of the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT methodologies are identical at the significance level $α = 0.05$ . The average rankings for testing accuracies reached by the six techniques are presented in Table 5.

Table 5.

Average ranks for classification accuracies of the six methods.

Methods	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
Average ranks	1.875	3.417	2.667	3.792	4.750	4.500

The ID3 method appears to be the best among the six approaches in terms of classification accuracy given that it has the lowest average rank, as shown in Table 5.

Given that $k = 6$ decision tree methods and $N = 12$ data sets are available, the following equation yields the Friedman statistic $χ_{F}^{2}$ :

χ_{F}^{2} = \frac{12 N}{k (k + 1)} [\sum_{j = 1}^{k} R_{j}^{2} - \frac{k (k + 1)^{2}}{4}],

(16)

where

R_{j}

is the average rank of the

j

th decision tree strategy.

Later, based on the Friedman statistic, Iman’s F statistic can be computed as in equation (17).

F_{F} = \frac{(N - 1) χ_{F}^{2}}{N (k - 1) - χ_{F}^{2}} .

(17)

On the other hand, the statistic

F_{F}

, which is distributed according to the F-distribution with

5

and

55

degrees of freedom, has a critical value of

2.383

for

α = 0.05

The Friedman statistic $χ_{F}^{2}$ and the statistic $F_{F}$ for the classification accuracy are $20.536$ and $5.724$ , respectively. Hence, the null hypothesis should be rejected given that Iman’s $F$ statistic is higher than the critical value. Consequently, it can be said that the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches differ in their performance depending on the classification accuracy.

After ruling out the null hypothesis, a post hoc Nemenyi test²⁰ is employed to determine which strategy performs better than the others. The Nemenyi test reveals that there is a statistically significant difference in the performance of at least two decision tree strategies when the critical difference (CD), which is given by the equation below, exists between the corresponding average rankings of the two approaches.

\begin{aligned} CD & = q_{α} \sqrt{\frac{k (k + 1)}{6 N}} \\ = 2.177, \end{aligned}

(18)

where

q_{α}

is the critical value for post hoc tests.

The pairwise differences (in absolute value) of average rankings for classification accuracies that are attained by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques are summarized in Table 6. Furthermore, a pairwise difference that differs by more than CD is highlighted in red, which indicates a statistically significant difference in performance.

Table 6.

Pairwise differences of average ranks for testing accuracies.

Approaches	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
ID3	–
C $4.5$	1.542	–
CART	0.792	0.750	–
PCC-Tree	1.917	0.375	1.125	–
DR	red2.875	1.333	2.083	0.958	–
FWDT	red2.625	1.083	1.833	0.708	0.250	–

The examination of Table 6 demonstrates that, in terms of classification accuracy, the ID3 methodology performs significantly better than the DR and FWDT strategies. However, it is evident that in terms of classification accuracy, none of the three recent decision tree methods, PCC-Tree, DR, and FWDT can significantly exceed the other three traditional ones.

Comparison of the tree depth among the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques

With regard to the tree depth measurement, Table 7 displays the tree depths that were reached by applying the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT strategies to the 12 data sets.

Table 7.

Tree depths produced by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques.

Data sets	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
1	3.02	9.38	3.02	30.07	50.83	8.80
2	9.30	47.77	8.06	37.44	267.97	87.94
3	12.68	45.74	11.50	23.05	69.08	18.28
4	8.01	46.07	6.54	36.90	140.81	50.34
5	3.19	22.32	3.00	25.94	100.00	49.72
6	5.20	12.66	5.10	12.44	81.96	6.53
7	6.58	12.17	6.03	17.98	30.24	31.62
8	27.58	43.05	26.47	147.11	1085.08	203.73
9	8.28	27.36	6.80	25.87	18.79	7.44
10	11.92	152.70	11.33	49.41	205.35	65.92
11	13.38	124.44	13.20	81.51	275.88	118.24
12	7.84	14.72	7.70	13.69	157.99	49.67
Average	9.75	46.53	9.06	41.78	207.00	58.19

The bold font is utilized to highlight the best results for each data set.

According to Table 7, the C $4.5$ , PCC-Tree, DR, and FWDT methodologies cannot produce the best tree depth result on any given data set. Whereas the only data set for which the ID3 technique is successful in producing a tree with the minimum depth is the Bankruptcy data set. On the other hand, out of the 12 data sets, the CART method yields the smallest tree depth ( $9.06$ nodes on average).

The Friedman test is also employed to further analyze the tree depths that the six algorithms on the 12 data sets have acquired. The null hypothesis asserts that the tree depths of the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques are equivalent at the significance level $α = 0.05$ . The average ranks for tree depths that are obtained using the six methods on the 12 data sets are shown in Table 8.

Table 8.

Average ranks for tree depths of the six methods.

Methods	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
Average ranks	2.000	4.250	1.083	3.750	5.750	4.167

Given that the CART methodology, as shown in Table 8, has the lowest average rank among the six techniques, it seems to be the most effective in terms of the tree depth metric.

The Friedman $χ_{F}^{2}$ as well as Iman’s $F$ statistics for the tree depth measure are evaluated by Equations (19) and (20), respectively.

χ_{F}^{2} = 48.762.

(19)

F_{F} = 47.729.

(20)

Since

F_{F}

statistic exceeds the critical value (

2.383

), the null hypothesis is rejected. Consequently, there is a significant difference among the tree depths of the ID3, C

4.5

, CART, PCC-Tree, DR, and FWDT approaches.

In order to establish which approach works better than the others, the post hoc Nemenyi test is additionally performed. The pairwise differences (in absolute value) of average ranks for tree depths attained by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques are shown in Table 9. Red highlights are used to indicate pairwise differences that differ by more than CD.

Table 9.

Pairwise differences of average ranks for tree depths.

Approaches	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
ID3	–
C $4.5$	red2.250	–
CART	0.917	red3.167	–
PCC-Tree	1.750	0.500	red2.667	–
DR	red3.750	1.500	red4.667	2.000	–
FWDT	2.167	0.083	red3.083	0.417	1.583	–

In terms of the tree depth, the ID3 methodology greatly beats both the C $4.5$ and DR strategies, as demonstrated in Table 9. Additionally, all of the C $4.5$ , PCC-Tree, DR, and FWDT approaches are significantly outperformed by the CART algorithm. However, none of the three recently created decision tree techniques can considerably surpass any of the three traditional techniques in terms of tree depth.

Examining the number of leaf nodes produced by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT methods

Table 10 displays the leaf node counts that were reported by implementing the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches on the 12 data sets.

Table 10.

Number of leaf nodes produced by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques.

Data sets	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
1	2.01	6.58	2.01	15.54	34.22	6.20
2	5.15	24.39	4.53	19.22	188.84	49.11
3	6.92	33.80	6.30	12.02	50.98	13.94
4	4.50	23.54	3.77	18.95	98.42	29.53
5	2.10	12.38	2.00	13.47	81.00	41.42
6	3.10	6.83	3.05	6.72	67.94	4.53
7	3.79	6.58	3.51	9.49	22.05	20.62
8	14.29	22.02	13.74	74.06	933.56	158.30
9	4.64	14.18	3.90	13.44	36.58	13.88
10	6.48	122.11	6.18	25.21	164.06	52.24
11	7.19	78.10	7.10	41.26	243.72	98.16
12	4.42	7.86	4.35	7.35	158.35	57.18
Average	5.38	29.86	5.04	21.39	173.31	45.43

The bold font is utilized to highlight the best results for each data set.

The CART strategy, just like it did for the tree depth metric, surpasses all five other strategies in terms of leaf node count, as indicated by Table 10. It can provide the fewest number of leaf nodes across all data sets ( $5.04$ leaves on average).

The leaf node counts acquired by the six algorithms are further examined using the Friedman test. The tested null hypothesis states that the numbers of leaf nodes of the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches are indistinguishable at the significance level $α = 0.05$ . Table 11 depicts the average ranks for leaf node counts obtained by the six procedures.

Table 11.

Average ranks for leaf nodes of the six methods.

Methods	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
Average ranks	1.917	4.083	1.083	3.500	6.000	4.417

The $χ_{F}^{2}$ Friedman statistic for the leaf nodes metric is given by the equation below:

χ_{F}^{2} = 54.095.

(21)

Iman’s

F

statistic can be calculated using the Friedman statistic and is provided by:

F_{F} = 100.774.

(22)

Iman’s F statistic is higher than the critical value, hence the null hypothesis is disregarded. As a result, the ID3, C

4.5

, CART, PCC-Tree, DR, and FWDT strategies behave differently depending on the leaf nodes.

Additionally, the post hoc Nemenyi test is conducted to discover which method performs more effectively than the others. Table 12 displays the pairwise differences (in absolute value) of average rankings for leaf nodes generated by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT algorithms. Red highlights are used to indicate pairwise differences that differ by more than CD.

Table 12.

Pairwise differences of average ranks for leaf nodes.

Approaches	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
ID3	–
C $4.5$	2.167	–
CART	0.833	red3.000	–
PCC-Tree	1.583	0.583	red2.417	–
DR	red4.083	1.917	red4.917	red2.500	–
FWDT	red2.500	0.333	red3.333	0.917	1.583	–

Table 12 analysis shows that the ID3 method produces decision trees with leaf node counts that are significantly lower than those produced by the DR and FWDT approaches. The CART technique, on the other hand, greatly surpasses each of the C $4.5$ , PCC-Tree, DR, and FWDT methods by producing less leaves. Whereas the PCC-Tree strategy significantly outperforms the DR approach. In contrast, none of the ID3, C $4.5$ , and CART approaches can be surpassed by the PCC-Tree, DR, or FWDT methodologies.

Comparison of the tree construction time among the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches

Table 13 gives the times (in seconds) that the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches needed to create decision trees for the 12 data sets.

Table 13.

Comparison of the running time (seconds) among the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques.

Data sets	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
1	0.002	0.009	0.002	0.018	0.039	0.442
2	0.041	38.746	0.004	2.057	0.465	13.263
3	0.002	0.027	0.002	0.014	0.038	0.212
4	0.004	0.523	0.003	0.073	0.177	2.343
5	0.002	0.141	0.002	0.030	0.098	0.412
6	0.002	0.095	0.002	0.014	0.067	0.178
7	0.003	0.308	0.002	0.034	0.022	0.596
8	0.015	2.972	0.008	0.342	0.776	3.667
9	0.004	0.992	0.002	0.122	0.030	0.807
10	0.002	0.100	0.002	0.032	0.170	0.466
11	0.005	0.765	0.004	0.133	0.361	3.017
12	0.006	1.172	0.003	0.105	0.240	1.634
Average	0.007	3.821	0.003	0.248	0.207	2.253

The bold font is utilized to highlight the best results for each data set.

According to Table 13, the ID3 technique takes the second-shortest amount of time to generate decision trees, after the CART methodology. The average amount of time required to generate a tree using the CART approach is merely $0.003$ s, however the time needed to do it using the ID3, C $4.5$ , PCC-Tree, DR, and FWDT strategies is $0.004$ s, $3.818$ s, $0.245$ s, $0.204$ s, and $2.250$ s higher than that taken by the CART algorithm, respectively.

Similarly, the Friedman test is also conducted to test the null hypothesis, which states that the times taken by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT procedures are equivalent at the significance level $α = 0.05$ . The average ranks for tree construction times of the six processes are shown in Table 14.

Table 14.

Average ranks for tree construction times produced by the six methodologies.

Methods	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
Average ranks	1.792	4.833	1.208	3.333	4.000	5.833

For the tree building time measure, the $χ_{F}^{2}$ Friedman and Iman’s $F$ statistics are evaluated as follows:

{\begin{matrix} χ_{F}^{2} = 53.726. \\ F_{F} = 94.199. \end{matrix}

(23)

Iman’s F statistic is over the threshold (

2.383

), rejecting the null hypothesis. Consequently, depending on how quickly the tree is constructed, the ID3, C

4.5

, CART, PCC-Tree, DR, and FWDT techniques function differently.

The Nemenyi test is employed to identify which method performs better than the others. The pairwise differences (in absolute value) of average ranks for tree building durations are summarized in Table 15. Pairwise differences that differ by more than CD are highlighted in red.

Table 15.

Pairwise differences of average ranks for tree construction time.

Approaches	ID3	C $4.5$	CART	PCC-Tree	DR	FWDT
ID3	–
C $4.5$	red3.042	–
CART	0.583	red3.625	–
PCC-Tree	1.542	1.500	2.125	–
DR	red2.208	0.833	red2.792	0.667	–
FWDT	red4.042	1.000	red4.625	red2.500	1.833	–

It can be deduced from Table 15 that the C $4.5$ , DR, and FWDT strategies are significantly outperformed in terms of running time by the ID3 and CART algorithms. Furthermore, the PCC-Tree method is significantly faster than the FWDT approach.

In summary, the results of using the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques on the 12 data sets are shown in Figure 2 using box plots in terms of classification accuracy, tree depth, leaf nodes, and tree construction time.

Figure 2.

Comparison among the effectiveness of the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches.

Therefore, it can be said that the Friedman and post hoc Nemenyi tests demonstrate that the ID3 and CART procedures are preferred and more efficient than the other techniques in terms of all the four measurements (classification accuracy, tree depth, leaf nodes, and tree construction time).

Conclusion

This paper investigated the efficacy of six decision tree methods that are based on a variety of node splitting metrics. These criteria belong to various categories, including criteria based on information theory, criteria based on distance, statistical-based criteria, and other splitting criteria. The six decision tree strategies comprise the three well-known ID3, C $4.5$ , and CART approaches as well as the three recently developed PCC-Tree, DR, and FWDT approaches.

On eleven data sets with various dimensionalities, the six procedures are validated to see how well they perform in terms of four metrics: classification accuracy, tree depth, leaf nodes, and tree construction time. The obtained results demonstrate that the traditional ID3 and CART approaches perform better than the other methods. To further analyze the experimental results statistically, the Friedman and post-hoc Nemenyi tests are performed. These two tests confirm that the ID3 and CART techniques are very comparable in terms of effectiveness and are both preferable to the other methodologies.

A novel decision tree method, which is based on preordonance theory^21–24, will be proposed in the future work.

Footnotes

Acknowledgements

This work was supported by the National Center for Scientific and Technical Research of Morocco (CNRST).

Declaration of conflicting interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship and/or publication of this article.

ORCID iD

Fadwa Aaboub

References

Quinlan

. Induction of decision trees. Mach Learn 1986; 1(1): 81–106.

Quinlan

. C4.5: Program for machine learning. Morgan Kaufmann Pub 1993; 16: 235–240.

Jun

Kim

Song

et al. A new criterion in selection and discretization of attributes for the generation of decision trees. IEEE Trans Pattern Anal Mach Intell 1997; 19(12): 1371–1375.

Wang

Jiang

. An improved attribute selection measure for decision tree induction. In Fourth International Conference on Fuzzy Systems and Knowledge Discovery (FSKD 2007), volume 4. IEEE, pp. 654–658.

Brieman

Friedman

Stone

et al. Classification and regression tree analysis,, 1984.

Davies

Bouldin

. A cluster separation measure. IEEE Transactions on Pattern Analysis and Machine Intelligence 1979; 1(2): 224–227.

Orsenigo

Vercellis

. Discrete support vector decision trees via tabu search. Comput Stat Data Anal 2004; 47(2): 311–322.

Ahakonye

LAC

Nwakanma

Lee

et al. Scada intrusion detection scheme exploiting the fusion of modified decision tree and chi-square feature selection. Internet of Things 2023; 21: 100676.

Sedgwick

. Pearson’s correlation coefficient. Bmj 2012; 345: e4483–e4483.

10.

Roy

Mondal

Ekbal

et al. Crdt: correlation ratio based decision tree model for healthcare data mining. In 2016 IEEE 16th International Conference on Bioinformatics and Bioengineering (BIBE). IEEE, pp. 36–43.

11.

Roy

Mondal

Ekbal

et al. Dispersion ratio based decision tree model for classification. Expert Syst Appl 2019; 116: 1–9.

12.

Chandra

Kothari

Paul

. A new node splitting measure for decision tree construction. Pattern Recognit 2010; 43(8): 2725–2731.

13.

Wang

Xia

. A less-greedy two-term Tsallis entropy information metric approach for decision tree classification. Knowl Based Syst 2017; 120: 34–42.

14.

Zhou

Zhang

Zhou

et al. A feature selection algorithm of decision tree based on feature weight. Expert Syst Appl 2021; 164: 113842.

15.

Reyes

Morell

Ventura

. Scalable extensions of the relieff algorithm for weighting and selecting features on the multi-label learning context. Neurocomputing 2015; 161: 168–182.

16.

Singh

Chhabra

. Withdrawn: Egia: A new node splitting method for decision tree generation: Special application in software fault prediction, 2021.

17.

Liu

Wang

. A pearson’s correlation coefficient based decision tree and its parallel implementation. Inf Sci (Ny) 2018; 435: 40–58.

18.

Dheeru D, Taniskidou EK. Uci machine learning repository. 2017; http://archive ics uci edu/ml; accessed January 19, 2023.

19.

Kohavi

et al. A study of cross-validation and bootstrap for accuracy estimation and model selection. In Ijcai, volume 14. Montreal, Canada, pp. 1137–1145.

20.

Demšar

. Statistical comparisons of classifiers over multiple data sets. J Mach Learn Res 2006; 7: 1–30.

21.

Chamlal

Ouaderhman

Aaboub

. Preordonance correlation filter for feature selection in the high dimensional classification problem. In 2021 Fifth International Conference On Intelligent Computing in Data Sciences (ICDS). IEEE, pp. 1–5.

22.

Chamlal

Ouaderhman

Aaboub

. A graph based preordonnances theoretic supervised feature selection in high dimensional data. Knowl Based Syst 2022; 257: 109899.

23.

Chamlal

Ouaderhman

Rebbah

. A hybrid feature selection approach for microarray datasets using graph theoretic-based method. Inf Sci (Ny) 2022; 615: 449–474.

24.

Chamlal

Ouaderhman

El Mourtji

. Feature selection in high dimensional data: a specific preordonnances-based memetic algorithm. Knowl Based Syst 2023; 266: 110420.

Statistical analysis of various splitting criteria for decision trees *

Abstract

Keywords

Introduction

Background on decision tree algorithms

Decision trees with information theory-based splitting criteria

Decision tree with distance-based splitting criterion

Decision trees with statistical-based splitting criteria

Decision tree based on another splitting criterion

Experimental study

Data set and computing environment description

Description of evaluation metrics

Analysis of the classification accuracy reached using the ID3, C 4.5 , CART, PCC-Tree, DR, and FWDT strategies

Comparison of the tree depth among the ID3, C 4.5 , CART, PCC-Tree, DR, and FWDT techniques

Examining the number of leaf nodes produced by the ID3, C 4.5 , CART, PCC-Tree, DR, and FWDT methods

Comparison of the tree construction time among the ID3, C 4.5 , CART, PCC-Tree, DR, and FWDT approaches

Conclusion

Footnotes

Acknowledgements

Declaration of conflicting interests

Funding

ORCID iD

References

Analysis of the classification accuracy reached using the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT strategies

Comparison of the tree depth among the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT techniques

Examining the number of leaf nodes produced by the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT methods

Comparison of the tree construction time among the ID3, C $4.5$ , CART, PCC-Tree, DR, and FWDT approaches