Sage Journals: Discover world-class research

Abstract

Pattern recognition and machine learning methods provide an attractive approach for building decision support systems. Classification trees are frequently used algorithms for such tasks owing to their intuitive structure and effectiveness. It has been shown that for complex medical data, combining a number of base classifiers improves their overall accuracy. Classification tree ensembles have a certain number of free parameters to set, which can significantly affect their performance. In recent years such ensembles were often used by practitioners without a mathematical background (e.g. physicians), who may be unaware of how to obtain the optimal settings. Therefore, it is difficult for them to choose the satisfactory properties, while in most of the cases the default parameters proposed for them are not necessarily the most efficient. The aim of this article is to ascertain which types of combined tree classifiers give the best performance for medical decision support and which parameters should be chosen for them. A set of rules for end-users on how to tune their ensembles is proposed.

Keywords

Classifier ensembles decision support systems decision trees health informatics machine learning pattern recognition

Introduction

Machine learning is an attractive approach for building decision support systems.¹ In such tasks the quality of the knowledge base plays the key role in establishing satisfactory final performance. The following problems are often encountered:

domain experts cannot formulate the rules for decision problems because they might not have the knowledge needed to develop effective algorithms (e.g. human face recognition from images);

incomplete, unbalanced or noisy data owing to the mechanic fault or high cost of information acquisition if often required to be accommodated;

it is not possible to ascertain which classifier will behave better than the others—known as the ‘no free lunch’ theorem²;

the number of classifiers to test is very high and performing a complex investigation would be too time consuming.

The latter two problems are addressed in the current article. Recently, pattern recognition and machine learning have become interdisciplinary areas of research, attracting researchers from different backgrounds, as well as users of classification techniques to automate the decision-making process.³ Yet, owing to the high complexity of the pattern recognition problem and the great variety of methods and their settings, one may find the task bewildering when attempting to choose an appropriate procedure.⁴ This is especially the case in medicine, where decision support systems play an important role.^5–7 It is extremely difficult or nearly impossible to guarantee in all cases that individual experts in pattern recognition would agree on the best methods to tune a given system for a specific problem, owing, in part, to the numerous methods that are available. As a consequence, many users, in this case physicians, are often left to their own judgement, which may be lacking in some cases. This article aims to present a set of complex tests of performance that will highlight the strengths and weaknesses of the chosen classification methods. Here, the focus is on the decision trees owing to their intuitive nature and good performance over a variety of problems.⁸ Additionally, it is chosen to investigate the behaviour of tree ensembles.⁹ Combining basic classifiers is considered as one of the most promising trends in pattern recognition.¹⁰ Classifier ensembles applied for the analysis of complex data often increase the final accuracy, especially in the case of weak predictors, such as decision trees.¹¹ Additionally, the ensembles, compared with the canonical classifiers, have more free parameters that must be set, therefore making their proper tuning a challenging problem. It is the aim that this article will help practitioners, offering the so-needed advice for using ensembles, based on real data tests.

The content of this article is arranged as follows: in the next section the related work in the field of classification tree ensembles is presented; then, the theoretical framework behind pattern recognition, classification trees and tree ensembles methodologies is briefly presented. After that, the setting-up of computer experiments is discussed, followed by the results and a thorough discussion, and a set of guidance notes and advice on how to tune the tree ensembles. Finally, we present the conclusions.

Related work

Decision trees and their ensembles represent a group of popular methods for medical data classification and are often used in decision support systems. They are constantly being developed by academic researchers. Šprogar et al.¹² thoroughly described how the idea of creating medical decision support systems has changed over last few decades. Ordonez¹³ showed that in the task of disease prediction, decision trees often behave better than explicit association rules. Rodriguez et al.¹⁴ studied the methods of functional trees ensembles, yet without giving a clear indication of how they should be tuned for certain medical problems.

The Random Forest algorithm is probably the most popular ensemble algorithm created in recent years. Wu et al.¹⁵ showed that the Random Forest algorithm outperforms most of the canonical classifiers for the prediction of ovarian cancer. It is widely used in bioinformatics, as it can also serve at the same time as a classifier and a feature selector, as shown in Díaz-Uriarte et al.¹⁶ One of the novel usages of this algorithm is the application in problems with unbalanced data representation, which occur often when sufficient samples of pathological examples cannot be collected. One of the first works to report such usage was the paper by Khalilia et al.¹⁷

The Rotation Forest algorithm is a recently introduced method for creating a classifier ensemble. While it is not frequently used, possibly because of its relatively recent emergence, there are some reports of its successful implementation in the medical domain. For example, Ozcift et al.¹⁸ presented the behaviour of 30 different machine-learning algorithms combined using the Rotation Forest algorithm for the medical domain. It is noted, however, that the tests were conducted on a very small number of datasets. In different work, Ozcift¹⁹ presented a combination of feature selection with both Support Vector Machine and Rotation Forest for Parkinson disease diagnosis. Dehzangi et al.²⁰ reported the successful usage of Rotation Forest for the prediction of protein folds. One of the main drawback of Rotation Forest is that it often does not show a good performance for small datasets with a very high number of features. Therefore, alternative ensemble methods were proposed for such cases, such as those presented by Krawczyk,²¹ based on the combination of feature selection and a random subspace method. Another approach for complex data analysis was proposed by Wilk et al.²² Here, the authors used the one-class classification framework and created the ensemble with the usage of fuzzy logic.

Background

Pattern recognition task

The aim of pattern recognition²³ is to classify a given object to one of the predefined categories on the basis of an observation of the features which describe it. All data concerning the object and its attributes are presented as a feature vector x X. The pattern recognition algorithm Ψ maps the feature space X to the set of class labels M:

Ψ : X \to M

(1)

The mapping (1) is established based on the examples included in a learning set or rules given by experts. The learning set consists of learning examples, i.e. observations of the features describing an object and its correct classification. Let us assume that there are n classifiers Ψ⁽¹⁾ , Ψ⁽²⁾, … , Ψ⁽ⁿ⁾. For a given object x, each of these decides if it belongs to class i ∈ M = {1, …, M}. The combined classifier makes a decision on the basis of the following formulae:

\bar{Ψ} (Ψ^{(1)} (x), Ψ^{(2)} (x), \dots, Ψ^{(n)} (x)) = \arg \max_{j \in M} \sum_{l = 1}^{n} δ (j, Ψ^{(l)} (x))

(2)

where δ(j,i) = 0 if i ≠ j, otherwise δ(j,i) = 1. The presented approach is based on the 0-1 loss function. It is worth mentioning that recently a new model of the loss function based on fuzzy logic was proposed.²⁴ It was also tuned for the implementation in a decision tree algorithm.²⁵

Decision tree induction

Most algorithms, such as C4.5 given by Quinlan²⁶ or Alternative Decision Tree (ADTree), are based on the ‘Top Down Induction of Decision Tree’ (TDIDT).²⁷ The central idea of the TDIDT algorithm is the selection of ‘the best’ attribute, i.e. which attribute to test at each node in the tree. The family of algorithms based on the ID3 method (e.g. C4.5) uses the information gain that measures how well the given attribute separates the training examples according to the target classification. The future implementations of decision tree induction algorithms use measures based on the previously defined information gain (e.g. information ratio).

REPTree and CART

All decision tree algorithms tested in this article are constructed based on the information gain. The main difference among them is the pruning process. A fast decision tree learner uses reduced-error pruning (REPTree)²⁸ and requires a validation set. Each subtree of a given decision tree is substituted with the best possible ‘leaf’ and then tested using the validation set. If the error is lower than or equal to the previous tree structure, then the subtree is replaced by a leaf. The process continues for all subtrees in the tree.

CART²⁹ is a class of decision trees that implements minimal cost-complexity pruning. As in the case of REPTree, it also replaces each subtree with a best possible leaf. The difference is, however, that the performance of the modified tree can be lower if it brings about a significant reduction of the tree structure, i.e. the number of leaves. This can be more formally written as follows: for a given decision tree K , which was trained on a set of X instances, the number of misclassified training examples is equal to E . If it is assumed that the number of leaves in K is L(K) , then the cost-complexity is defined by Breiman et al.²⁷ as:

\frac{E}{X} + α \times L (K)

(3)

where α is some parameter (yet to be determined). For a substitution of a subtree S in K by a leaf to take place, the new tree would have to have the same cost-complexity as K . This is the case when:

α = \frac{N}{X \times (L (S) - 1)}

(4)

where N is the number of additional misclassifications made by the new tree and L(S) – 1 is the number of leaves which were pruned.

Bagging

Bagging (or bootstrap aggregating) is an ensemble meta-algorithm developed by Breiman.³⁰ It is based on creating a set of new bootstrap object samples from the original dataset and training one classifier on each of them. This assures that each of the classifiers was created on a diverse, heterogeneous dataset.

A dataset is given consisting of X instances, each belonging to one of M classes. The method generates T new versions of a learning set by taking repeated bootstrap samples from the original dataset. Each new set has the same size as the original (although the size can be adjusted) and, therefore, some of the instances can appear more than once. The algorithm trains each classifier Ψ^(t) based on one of the samples, where t = 1, 2, …,T. The final classifier Ψ^(*) is an aggregation of the results given by T classifiers. In the case of class prediction, it is a plurality vote or average, when a numerical value is predicted. As a basic classifier the decision tree is commonly used.

Random forest

The Random Forest approach was introduced by Breiman.³¹ It is, to some extent, an extension of the bootstrap aggregation³⁰ algorithm. It is similar to the previously presented Bagging algorithm as it also uses new subsets to create heterogeneous classifiers. Yet, in this case, the subspaces do not only consist of bootstrap samples of objects, but also of randomly chosen features. Whereas in Bagging each classifier is trained on the same features but on different objects, in Random Forest each tree is trained on different objects and different features.

Let F be the feature set of a given dataset. The classifier itself consists of T decision trees, each of which uses only a randomly selected vector of features k ,where k ∈ F and k << F . The vector k is chosen randomly for every tree node and the best split on this subset is applied to a given node. Every tree is fully grown and no pruning takes place. In addition, on the basis of the given dataset, such as in Bagging, T subsets are generated by uniformly sampling the examples with replacement from the standard training set. The size of each such subset is the same as the base learning set. The final decision is made by choosing the classification with the highest number of votes over T classifiers in the ensemble.

Rotation Forest

The Rotation Forest is an algorithm for creating classifier ensembles using feature extraction proposed by Rodríguez et al.³² As a base classifier the decision tree is used. Additionally, it implements principal component analysis (PCA) to increase accuracy and diversity of the ensemble.

The algorithm works as follows: let F be the feature set and X a set of training examples. For each base classifier Ψ^(t) (where t = 1, .., T ) the method splits F into L random subsets, where L is one of the parameters. For each subset of features a random subset of classes is selected, together with a bootstrap sample of instances from X . For such a fragment of the training set PCA is applied. All of the eigenvalues obtained from the feature subsets are then stored, sorted and used as a set of new features and a training set for Ψ^(t) . Each tree is trained using the whole learning set. A given x is assigned to a class with the highest average probability over all base classifiers.

Experimental investigation

The main idea behind the experiment is to show that in many practical cases, the results given by ensemble classifiers with standard configurations could be significantly enhanced by choosing appropriate parameter values. The goal is to give some insight on the performance of the presented combined tree classifiers and the effect of different parameter settings on the final accuracy. This can be viewed as an aid for dealing with practical implementation issues in the domain of medical decision support systems. The focus is not on the settings of the base classifiers (single decision trees) and classifier fuser, but rather it is shown that more attention should be paid to the design and tuning of the classifier ensembles.³³ The default fusion methods proposed by the software are used. Because of the focus on medical decision support, all datasets represent real life medical problems. In total, 15 different data sets are chosen, all of which are publically available at the UC Irvine repository.³⁴ Each dataset applies to a distinct medical condition or a group of diseases—most of which are commonly encountered in an everyday medical practice.

The datasets depict different forms of classification problems. The number of instances range from small datasets (such as lung cancer) with 32 objects to quite large datasets, i.e. more than 500 for breast Cancer and pima indians diabetes. The number of attributes also differs significantly from 8 (E. coli) to 69 (audiology). The attributes are mostly real, binary and nominal values. Five of the tested datasets have missing values, which for some classification methods may have a significant impact on the overall performance. The detailed descriptions of the datasets under consideration are presented in Table 1.

Table 1.

Characteristics of 15 medical datasets used in this study

Dataset	Code	Num. attr	Num. objects	Num. classes	Missing values (%)
Audiology	aud	69	226	24	1.97
Breast cancer	brc	32	569	2	0
Breast tissue	brt	10	106	6	0
Dermatology	der	33	366	6	0.06
E. coli	eco	8	336	8	0
Heart disease	hrt	13	270	2	0
Hepatitis	hep	19	155	2	5.39
Liver disorders	liv	7	345	2	0
Lung cancer	luc	56	32	3	0.27
Lymphography	lym	18	148	4	0
Parkinsons	par	23	197	2	0
Pima indians diabetes	pid	8	768	2	0
Post-operative patient	pop	8	90	3	0
Primary tumor	prt	18	339	21	3.69
Statlog (heart)	sth	13	270	2	0

For the purpose of these experiments two popular ensemble methods are chosen: Random Forest and Bagging. Additionally, a quite recently introduced method known as the Rotation Forest is considered. For the two former algorithms a base classifier must be chosen. In this article the focus is on decision trees, so the base classifiers tested are the three methods from this group of algorithms. These are as follows: C4.5 (the most popular decision tree algorithm), CART and REPTree. Each ensemble method has a unique set of parameters that can be changed and its effect on the outcomes measured.

In the case of Bagging, the percentage of the whole dataset in a training subset was studied. Here, if the parameter is equal to 100 then the training subset has the same size as the full training set. However, they are not the same, because the subset is generated by uniformly sampling the training set with replacement, which leads to duplicates. This can have a negative effect on the performance. To deal with this problem, smaller values of this parameter are investigated, namely 70% and 40%. These are arbitrarily chosen values drawn from experience in working in the machine learning domain. The differences between them in most cases are very clear and, therefore, these values give a clear indication of the influence of this parameter on the overall accuracy. Smaller values could also influence the diversity of the base classifiers by training them on potentially fewer identical instances.

In the Rotation Forest approach two parameters were studied—the number and size of the disjoined subsets of the feature space. The two parameters are mutually exclusive. Both parameters have been tested using the same values, namely 2 and 4.

The influence on the performance of two important factors is now quantified. The number of base classifiers is tested for five values, namely 10, 20, 40, 80 and 120 trees. Also, the depth of each of the base trees is set to one of three values: two arbitrary levels, the default level and twice the default level.

To provide a detailed comparison between the methods a statistical significance test is used. It allows to compare the tested classifiers and to ascertain whether their differences are, indeed, statistically significant. For this purpose, a Combined 5 × 2 cv F Test³⁵ is used. It repeats five-time twofold cross-validation so that in each of the folds the size of the training and testing sets is equal. This test is conducted by comparison of all versus all. As a test score the probability of rejecting the null hypothesis is adopted, i.e. that classifiers have the same error rates. As an alternative hypothesis, it is conjectured that tested classifiers have different error rates. A small difference in the error rate implies that the different algorithms construct two similar classiﬁers with similar error rates; thus, the hypothesis should not be rejected. For a large difference, the classiﬁers have different error rates and the hypothesis should be rejected. Therefore, two classifiers differ in a statistically significant way if the null hypothesis considering them is rejected. The combined version of this test takes a majority vote over the ten possible 5 × 2cv F test results. The benefit of using such a modification is that the combined test has a lower type II error, i.e. a lower probability of rejecting the hypothesis when the classiﬁers have similar error rates. At the same time, it offers higher power such that there is a larger probability of rejecting when the tested classifiers are different.

All the classification results were obtained with the use of Weka,³⁶ open source software for data analysis.

Results

In this section the results obtained by experimental investigations are presented. In each of the following tables the best results for a dataset are marked by the grey field. Additionally, the classifiers that showed statistical differences in the Combined 5 × 2 cv F Test are in bold.

As a base for the further comparison the results of the base tree classifiers are shown in Table 2.

Table 2.

Classification results of base tree classifiers

Dataset	C4.5	Cart	REP Tree
aud	65.21	54.46	37.41
brc	93.57	91.04	90.23
brt	66.04	68.87	72.64
der	94.26	95.90	94.81
eco	84.23	83.63	80.95
hrt	76.66	78.52	77.78
hep	83.87	78.71	78.71
liv	68.70	67.54	64.06
luc	40.61	50.00	50.00
lym	77.03	76.35	72.30
par	80.51	85.64	86.15
pid	65.11	62.47	65.10
pop	70.00	71.11	70.00
prt	39.82	41.00	38.94
sth	76.67	78.52	77.78

Tables 3 –5 present the classification accuracy over 10 datasets for Bagging, Rotation Forest and Random Forest respectively. Table 6 presents the comparison between the best results from three investigated methods.

Table 3.

Classification results of Bagging for three different decision trees and for three dataset sizes (100/70/40 stands for bagging subspace size parameter)

Bagging
Data set	C4.5			Cart			REP Tree
	100	70	40	100	70	40	100	70	40
aud	62.02	62.02	62.82	63.00	63.00	63.42	59.21	58.45	59.03
brc	95.85	95.42	95.99	95.71	94.99	94.84	95.57	95.57	95.57
brt	69.81	66.98	68.86	72.64	69.81	66.98	72.64	66.03	68.86
der	97.27	96.72	97.54	96.17	95.63	95.08	95.63	95.90	93.99
eco	84.82	86.01	86.01	85.12	83.04	83.33	83.33	83.33	82.73
hrt	82.96	79.26	81.48	80.00	78.52	80.37	81.48	78.89	79.26
hep	83.23	82.58	81.94	80.00	83.23	82.58	83.23	83.23	81.94
liv	70.14	70.43	69.57	68.12	69.57	65.93	69.86	69.57	67.54
luc	46.88	56.25	75.00	56.25	50.00	59.38	50.00	59.38	34.38
lym	81.76	83.78	81.08	79.73	82.43	75.68	75.68	75	77.03
par	88.72	92.31	87.69	89.74	89.74	85.13	89.23	87.69	83.59
pid	72.01	72.01	72.92	71.01	71.01	71.92	59.11	58.85	63.02
pop	70.00	70.00	68.89	71.11	71.11	71.43	65.56	71.11	71.11
prt	43.07	44.25	41.59	42.77	43.89	44.54	41.00	43.95	39.53
sth	80.00	81.48	82.96	79.26	81.48	78.89	78.89	79.26	77.78

Table 4.

Performance of the Rotation Forest algorithm for three different decision tree classifiers

Rotation Forest
Dataset	C4.5				Cart				REPTree
	N2	N4	S2	S4	N2	N4	S2	S4	N2	N4	S2	S4
aud	63.08	63.01	64.42	64.05	63.12	62.20	63.65	64.56	62.59	62.81	63.09	62.90
brc	97.00	96.85	96.71	96.85	96.71	97.28	98.13	97.13	97.13	97.13	97.28	97.57
brt	73.58	74.53	69.81	74.53	72.64	73.53	73.58	72.64	72.64	71.70	74.53	70.75
der	95.90	96.99	97.54	97.81	96.99	96.99	97.54	96.99	95.08	96.72	98.36	97.81
eco	86.90	87.39	87.79	87.50	86.01	86.30	88.80	86.31	84.52	85.71	85.12	86.90
hrt	82.96	81.48	81.85	81.48	82.96	83.70	78.89	83.33	81.48	81.85	84.74	80.44
hep	83.23	86.45	83.23	83.23	85.16	81.94	82.58	81.94	83.87	84.52	82.58	83.23
liv	68.41	72.36	72.75	70.14	71.59	70.43	71.59	68.12	69.86	73.62	75.43	72.46
luc	40.63	56.25	53.13	53.13	56.25	50.00	59.38	53.13	37.50	46.88	59.38	53.13
lym	83.11	87.16	82.43	84.46	79.05	81.08	80.41	80.41	79.73	75.68	79.73	80.41
par	91.79	91.79	92.28	90.25	89.23	90.25	89.23	89.74	89.23	90.26	93.79	91.28
pid	65.88	65.41	66.42	66.06	65.32	65.21	66.14	65.56	64.29	63.82	65.67	65.40
pop	68.89	71.11	70.00	70.00	71.11	70.02	70.37	70.37	71.11	71.11	71.91	71.11
prt	43.36	44.25	43.66	45.43	45.13	45.72	45.13	45.72	45.13	42.77	42.48	45.72
sth	82.96	81.48	81.85	81.48	82.96	83.70	78.89	83.33	81.48	81.85	80.74	84.44

N, parameter set to the number of subspaces; S, parameter set to the size of subspaces (those settings are mutually exclusive); 2/4, number/size value.

Table 5.

Performance of the Random Forest algorithm for different tree depth and ensemble size

Random Forest
Data set	Def					2					2*Def
	10	20	40	80	120	10	20	40	80	120	10	20	40	80	120
aud	62.12	63.24	62.99	62.54	62.56	63.11	65.23	66.08	67.74	65.41	59.35	59.77	59.80	59.80	59.63
brc	96.14	96.42	96.57	96.57	96.42	96.57	97.00	97.14	97.00	97.14	94.71	95.28	94.85	95.28	95.57
brt	72.64	68.87	70.75	69.81	70.75	66.98	71.70	68.87	70.75	69.81	72.64	71.70	71.70	72.64	71.70
der	97.81	97.54	96.99	96.99	96.99	95.63	96.72	97.27	98.27	96.99	97.81	96.72	96.45	96.17	96.72
eco	83.63	84.82	85.11	85.42	85.42	77.38	78.87	78.27	77.98	83.63	84.23	85.71	84.52	86.31	86.01
hrt	79.26	81.11	81.85	81.85	81.85	81.48	83.33	82.59	83.33	84.07	80.37	82.22	81.48	81.48	82.96
hep	80.00	80.00	81.29	84.52	85.16	83.23	83.23	85.16	83.23	85.87	80.00	80.00	83.87	85.81	85.16
liv	66.38	71.30	72.17	72.17	72.75	66.96	67.25	68.12	67.83	68.99	68.99	72.46	72.17	72.75	72.75
luc	53.13	46.88	50.00	50.00	56.25	53.13	56.25	65.63	59.38	59.38	43.75	37.50	43.75	53.13	46.88
lym	85.81	85.14	84.46	85.81	85.16	75.68	76.35	76.35	79.05	77.70	85.81	85.14	84.46	85.81	85.14
par	91.28	93.33	92.82	92.30	92.82	90.64	90.64	91.13	94.62	94.62	91.28	93.33	92.82	92.31	92.82
pid	62.24	63.15	62.89	62.76	62.63	64.71	65.63	66.28	66.54	73.32	59.77	59.77	59.77	59.77	59.64
pop	62.22	65.56	63.33	63.33	63.33	68.89	71.11	71.11	71.11	71.11	61.11	66.67	65.56	65.56	65.56
prt	42.48	42.18	41.89	42.48	43.07	36.87	38.05	40.23	42.23	44.23	40.12	43.95	43.66	43.54	43.25
sth	80.00	80.37	81.85	81.85	82.96	83.33	82.07	85.00	82.22	83.33	81.48	81.11	82.22	82.59	82.33

10/20/40/80/120, number of trees in the ensemble; Def/2/2*Def, the number of features used to construct each of the base trees; Def, the default number suggested for each of the datasets by the WEKA software.

Table 6.

Comparison of the best results for three ensemble methods under consideration

Dataset	Bagging	Rotation Forest	Random Forest
aud	63.42	64.56	67.74
brc	95.85	98.13	97.14
brt	72.64	74.53	72.64
der	97.54	98.36	98.27
eco	86.01	88.80	86.31
hrt	82.96	84.74	84.07
hep	83.23	86.45	85.87
liv	70.43	75.43	72.75
luc	75.00	59.38	65.63
lym	83.78	87.16	85.81
par	92.31	93.79	94.62
pid	72.92	65.67	73.32
pop	71.43	71.91	71.11
prt	44.54	45.72	44.23
sth	82.96	84.44	85.00

Discussion

Analysing the results from the experiments allows several interesting conclusions to be drawn. It is worth noting that none of the base classifiers outperformed the ensemble methods. This may be owing to the complex nature of the medical data. It is observed that when attempting to use a single classifier trying to model the whole decision space the result is that it tends to either overfit or to have too large a generalisation. In contrast, it is found that ensemble methods, with different areas of competence for each of the classifiers, can overcome this difficulty. The smallest accuracy gain occurred in the case of the breast tissue dataset (<2%) and the largest in the case of the lung cancer dataset (25%). This clearly shows that using tree ensembles is appropriate and the accuracy gain can compensate for the longer classifier training time.

When comparing the best results from the three different classifier ensemble types, it can be seen clearly that Bagging has produced the most inferior. Only in the case of the lung cancer dataset did it outperform the remaining two algorithms. In the case of other datasets the differences between Random Forest and Rotation Forest are very small. For four cases, Random Forest returned the best results, while Rotation Forest was superior for nine. It is worth noting that differences between them were not always statistically significant. This leads to one conclusion that both of these algorithms present good predictive power. However, the Rotation Forest execution time is much greater than that of Random Forest.

In case of the Bagging algorithm, the size of the subspaces and choice of the base classifier could be adjusted. From the experiments it can be seen that use of a REPTree algorithm for Bagging is not a promising direction. For all datasets it was outperformed by the C4.5 and CART. Only for the hepatitis dataset did it return good results, but, at the same time, the two Forest algorithms returned identical results. Therefore, it can be stated that the REPTree should not be used for Bagging. In fact, it was found that C4.5 displayed the highest accuracy for the Bagging procedure. For 8 out of 10 datasets this combination returned the best results. CART proved to be useful in the breast tissue and audiology datasets, but in the latter one it was not statistically significant. In most cases of the Bagging test the smaller size of the subspaces, namely 70% and, especially, 40% returned the best results. The subspace set to 70% returned best results for 6 out of 15 datasets and when set to 40% the best for 8 out of 15.

In the case of the Rotation Forest algorithm, the number or the size of the disjoined subsets of the feature space and choice of the base classifier can be manipulated. For this approach the REPTree returned the best results, surprisingly outperforming in more than half the other two base classifiers considered, i.e. 10 out of 15 cases. This can be explained by the pruning nature of the REPTree. Rotation Forest uses the PCA method for feature extraction and it was shown many times that only a small number of principal components contain most of the discriminant information. Therefore, REPTree allows quick discarding of the irrelevant nodes from the tree, resulting in the highest performance. In second place was the CART algorithm (three best and one equal performance). The worst performance was obtained from the use of C4.5 (one best and one equal). This is in complete contrast to the Bagging approach and it shows the differences between these ensemble methods. As for the other free parameters the best results were given by the size of the subspaces set to 2 for 11 out of 15 datasets. When size was set to four better results were obtained only in 3 cases out of 15. However, experimental results showed that there is little to be gained in changing the number of the subsets. Fixing this parameter returned good results in only 2 cases out of 15. It may be concluded that, this approach should be discouraged.

In the case of the Random Forest algorithm, the number of trees in the ensemble and the number of features, which are randomly chosen for tree induction, can be manipulated. The approach with very small trees, consisting only of two random features, outperformed significantly the other Random Forest methods considered. Additionally, it worked well with larger ensembles (40–120). This stems from the fact that a larger number of weak predictors suggest a more generalised model and at the same time introduces the much-needed diversity into the base classifier pool.⁹ When considering larger trees the probability that they are constructed of similar features increases and, therefore, the diversity decreases.

A set of rules/guidelines for end-users and practitioners on how to use the ensembles is now formulated.

Users should concentrate on the Random Forest and Rotation Forest algorithms. Those two approaches are most likely to deliver the best results. If the user does not have time to run the exhaustive test, then Bagging should be avoided.

For Random Forests using numerous ensembles of small trees is most likely to return the best results. Small ensembles of large trees should be avoided. This approach is recommended when there are several features contained within the data.

For the Rotation Forest, the REPTree as a base algorithm returns the best results, especially when the dataset under consideration consists of a large number of objects. The size parameter is worth exploring, while fixing the number of subsets has little positive effect on the classification. This is important, as the implementation of the Rotation Forest provided in the Java language presented in Rodríguez et al.³² allows only a change of one of these parameters.

Random Forest gives the best performance for datasets with a high number of features. This is owing to the fact that it serves at the same time as a classifier and a feature selector and is able to cope with high-dimensionality problems.

Rotation Forest is able to cope well with problems described by a low number of classes. This is considered to be owing to the use of PCA, which dismisses some of the information (albeit of least importance). In case of numerous classes Random Forest should be tested first.

It should be borne in mind that the tests reported here represent only a small part of a wide variety of medical data. It is challenging to give a clear indication of which algorithm will always return the optimal results. Experience suggests, however, that it is advisable for more than one setting to be tested before the final implementation. Through the experimental trails presented it has been shown that it is possible to significantly reduce the number of settings that need to be taken into consideration.

Conclusions

The article has dealt with a problem of choosing an optimal set of parameters for the decision tree ensembles used in medical decision support tasks. It has been shown that the default settings proposed by the available open source data analysis software do not always give the best results. Making use of default settings only imposes limitations on the performance of such pattern recognition system. It has been demonstrated that there is a need for a practical guide into the manual tuning of the ensemble’s parameters, which allows maximising the efficiency of their data analysis and gives greater insight into the mechanism of combining tree classifiers. Indeed, this was one of the premises which prompted the work, whereby propositions have been evaluated on the basis of computer experiments and carried out on diverse benchmark datasets. On the basis of the experimental results obtained and observations thereof, recommendations for practical implementation of selected methods have been formulated. It is recognised, of course, that one type of setting can never outperform all others on all datasets ever encountered. It has been shown, however, that for the dataset considered, a significant number of settings which never gave good results could be discarded immediately. Another important finding is that, for the remaining pool of settings, it has been shown that certain dependencies exist between the type of data and the type of classifiers that could be used. Consequently, the proposed guidelines allow the practitioner to simplify the task so that it can be flexibly tuned to the data at hand. It is considered that the proposed concept can be useful for tackling real-life medical decision support system problems whereby physicians and clinicians are required to routinely analyse data using standard available pattern recognition software.

Footnotes

Funding

This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.

References

Alpaydin

. Introduction to Machine Learning. 2nd edn. Cambridge, MA: The MIT Press, 2010.

Wolpert

. The supervised learning no-free-lunch theorems. In: Proc. 6th Online World Conference on Soft Computing in Industrial Applications 2001; 25–42.

Eom

Kim

. A survey of decision support system applications (1995–2001). J Oper Res Soc 2006; 57: 1264–1278.

Sittig

Wright

Osheroff

Middleton

Teich

Ash

. Grand challenges in clinical decision support. J Biomed Inform 2008; 41: 387–392.

Garg

Adhikari

NKJ

McDonald

Rosas-Arellano

Devereaux

Beyene

. Effects of computerized clinical decision support systems on practitioner performance and patient outcomes: A systematic review. J Am Med Assoc 2005; 293: 1223–1238.

Sleeman

Moss

Gyftodimos

Nicolson

Devereux

. A comparison between clinical decisions made about lung cancer patients and those inherent in the corresponding scottish intercollegiate guidelines network (SIGN) guideline. Health Informatics J 2010; 16(4): 260–273.

Kudyba

Gregorio

. Identifying factors that impact patient length of stay metrics for healthcare providers with advanced analytics. Health Informatics J 2010; 16(4): 235–245.

Hastie

Tibshirani

Friedman

. The Elements of Statistical Learning. Data Mining, Inference, and Prediction. Springer Series in Statistics. New York: Springer Verlag, 2001.

Kuncheva

. Combining Pattern Classifiers: Methods and Algorithms. Hoboken, New Jersey, USA: John Wiley & Sons, Inc., 2004.

10.

Jain

Duin

RPW

Mao

. Statistical pattern recognition: A review. IEEE Trans Pattern Anal Mach Intell 2000; 22: 4–37.

11.

Opitz

Maclin

. Popular ensemble methods: An empirical study. J Artificial Intelligence Res 1999; 11: 169–198.

12.

Šprogar

Lenič

Alayon

. Evolution in medical decision making. J Med Syst 2002; 26: 479–489.

13.

Ordonez

. Comparing association rules and decision trees for disease prediction. Proceedings of HIKM 2006: International Workshop on Healthcare Information and Knowledge Management 2006; 17–24.

14.

Rodríguez

García-Osorio

Maudes

Díez-Pastor

. An experimental study on ensembles of functional trees. LNCS 5997 2010; 64–73.

15.

Abbott

Fishman

McMurray

Mor

Stone

. Comparison of statistical methods for classification of ovarian cancer using mass spectrometry data. Bioinformatics 2003; 19(13): 1636–1643.

16.

Díaz-Uriarte

Alvarez de Andrés

. Gene selection and classification of microarray data using random forest. BMC Bioinformatics 2006; 7: Article 3.

17.

Khalilia

Chakraborty

Popescu

. Predicting disease risks from highly imbalanced data using random forest. BMC Med Inform Decision Mak 2011; 11: 51.

18.

Ozcift

Gulten

. Classifier ensemble construction with rotation forest to improve medical diagnosis performance of machine learning algorithms. Comput Method Program Biomed 2011; 104: 443–451.

19.

Ozcift

. SVM feature selection based Rotation Forest ensemble classifiers to improve computer-aided diagnosis of Parkinson disease. J Med Syst 2011; 1–7.

20.

Dehzangi

Phon-Amnuaisuk

Manafi

Safa

. Using rotation forest for protein fold prediction problem: An empirical study. LNCS 2010;6023: 217–227.

21.

Krawczyk

. Classifier committee based on feature selection method for obstructive nephropathy diagnosis. Semantic Methods for Knowledge Management and Communication. Studies Computat Intell 2011 381:115–125.

22.

Wilk

Woźniak

. Combination of one-class classifiers for multiclass problems by fuzzy logic. Neural Network World 2010; 20: 853–869.

23.

Duda

Hart

Stork

. Pattern Classification. Hoboken, New Jersey, USA: John Wiley & Sons, Inc., 2001.

24.

Burduk

. Classification error in bayes multistage recognition task with fuzzy observations. Pattern Analysis Appl 2010; 13: 85–91.

25.

Burduk

. The new upper bound on the probability of error in a binary tree classifier with fuzzy information. Neural Network World 2010; 20: 951–961.

26.

Quinlan

. C4.5: Programs for Machine Learning. San Mateo, USA: Morgan Kaufmann Publishers, 1993.

27.

Quinlan

. Induction of decision trees. Mach Learning 1986; 1: 81–106.

28.

Quinlan

. Simplifying decision trees. Int J Man-Machine Stud 1987; 27: 221–234.

29.

Breiman

Friedman

Olshen

Stone

. Classification and regression trees. Monterey, CA: Wadsworth and Brooks, 1984.

30.

Breiman

. Bagging predictors. Mach Learning 1996; 24: 123–140.

31.

Breiman

. Random forests. Mach Learning 2001; 45: 5–32.

32.

Rodríguez

Kuncheva

Alonso

. Rotation forest: A new classifier ensemble method. IEEE Trans Pattern Anal Mach Intell 2006; 28: 1619–1630.

33.

Wozniak

Zmyslony

. Combining classifiers using trained fuser – analytical and experimental results. Neural Network World 2010; 20: 925–934.

34.

Frank

Asuncion

. UCI machine learning repository, http://archive.ics.uci.edu/ml (2010, accessed 19 May 2012).

35.

Alpaydin

Combined 5 x 2 cv F test for comparing supervised classification learning algorithms. Neural Comput 1999; 11: 1885–1892.

36.

Holmes

Donkin

Witten

IH.

WEKA: A machine learning workbench. In: Australian and New Zealand Conference on Intelligent Information Systems – Proceedings, Brisbane, Qld, 1994, pp. 357–361.

On optimal settings of classification tree ensembles for medical decision support

Abstract

Keywords

Introduction

Related work

Background

Pattern recognition task

Decision tree induction

REPTree and CART

Bagging

Random forest

Rotation Forest

Experimental investigation

Results

Discussion

Conclusions

Footnotes

Funding

References