Sage Journals: Discover world-class research

Abstract

One of the difficulties in using gene expression profiles to predict cancer is how to effectively select a few informative genes to construct accurate prediction models from thousands or ten thousands of genes. We screen highly discriminative genes and gene pairs to create simple prediction models involved in single genes or gene pairs on the basis of soft computing approach and rough set theory. Accurate cancerous prediction is obtained when we apply the simple prediction models for four cancerous gene expression datasets: CNS tumor, colon tumor, lung cancer and DLBCL. Some genes closely correlated with the pathogenesis of specific or general cancers are identified. In contrast with other models, our models are simple, effective and robust. Meanwhile, our models are interpretable for they are based on decision rules. Our results demonstrate that very simple models may perform well on cancerous molecular prediction and important gene markers of cancer can be detected if the gene selection approach is chosen reasonably.

Keywords

gene expression profiles cancer prediction soft computing rough set theory feature selection decision rules

Introduction

Conventional tumor diagnostic methods based on the morphological appearance of tumors are not always effective as misdiagnoses often occur. On the other hand, a wide variety of studies have revealed cancer to be a disease involving dynamic changes in the genome. Therefore, using molecular markers of cancers might be an alternative approach to the diagnosis of tumors. The rapid advances in gene expression microarray technology that enable simultaneously measuring the expression levels for tens of thousands of genes in a single experiment, make the detection of cancerous molecular markers possible.¹ Since the pioneering work of Golub et al in applying gene expression monitoring by DNA microarray to cancer classification,² many investigations of using microarray technology to build cancer diagnosis, prognosis or prediction classifiers have been conducted. In general, the major difficulty in this topic is how to effectively identify the genes pertaining to the pathogenesis of specific cancers from the extremely high-dimensionality gene expression data, which often contain a large amount of noise caused by irrelevant genes. On the other hand, compared with the measured quantities of gene expression levels in experiments, the numbers of samples are severely limited. That often influences prediction accuracy. In this extreme of very few observations on very many features, it is natural and perhaps essential to investigate feature selection and regularization methods.³ Feature selection, i.e. gene filtering, is particularly crucial for microarray-based cancer prediction since the number of irrelevant genes for prediction may be huge, and as long as feature selection is performed reasonably, accurate prediction is achieved with even the simplest of predictive models.⁴

Various methods of building cancer predictors have been proposed such as Clustering, SVMs (Support Vector Machines), k-NNs (k-Nearest Neighbours), ANNs (Artificial Neural Networks), GAs (Genetic Algorithms), Naive Bayes (NB), DTs (Decision Trees), RSs (Rough Sets), EPs (Emerging Patterns), et al. In this article, we explore the use of rule-based pipelines to construct cancer predictors as the rule-based methods are more likely to be accepted by biologists and clinicians for they are easily understood. This kind of approaches like DTs,⁵ RSs,⁶ EPs⁷ etc. have been commonly utilized to produce cancer predictors by many investigators.^7–14 In addition, we attempt to employ one or two genes to conduct cancer prediction. The same problem also has been addressed by some investigators.^15,16

Our method is based on rough set theory, originally proposed by Pawlak in the early 1980s,⁶ which can be applied for analysis of both precise and imprecise data.¹⁷ In,^8–11 rough set theory is applied for cancer classification and prediction. A majority of these studies conduct feature selection by the attribute reduction approach, one core idea of rough set theory. However, to our knowledge, rough sets attribute reductions are computationally expensive, and the resultant reducts maybe are not unique. Moreover, the reducts cannot ensure high prediction performance because there maybe exists redundancy between the attributes in one reduct.¹⁸ To avoid expensive cost in computing attribute reductions, we select the features (genes) with perfect attribute depended degree, a concept from rough set theory, and then create rule classifiers by the chosen genes instead of running attribute reductions. As it is very difficult to find the single genes or gene pairs with perfect attribute depended degree in terms of the canonical definition, we extend the concept of attribute depended degree to the more flexible soft computing framework. Using the extended definition of attribute depended degree, we can detect some single genes or gene pairs with indeed strong class discriminatory power while they will be ignored if the conventional attribute depended degree standard is employed. Consequently, although the rules derived from the detected genes or gene pairs might not be absolutely true, they are comparatively reliable and able to perform effective prediction.

We apply our algorithm to the four noted gene expression datasets: central nervous system (CNS) tumor, colon tumor, lung cancer, and diffuse large B-cell lymphoma (DLBCL). They are available from the Kent Ridge Bio-medical Data Set Repository (http://datam.i2r.a-star.edu.sg/datasets/krbd/). We validate the efficacy of our method by leave-one-out cross-validation (LOOCV), and compare our results with other already published research outcomes. Furthermore, we examine and analyze the biological relevance of the selected genes.

Results

CNS Tumor Dataset

In the dataset, we first try to find the single genes with high class discriminative power. When α is set to 0.9 or 0.85, there is no gene with α depended degree equal to 1 occurring in all the 60 training sets; when α is set to 0.8, gene U28963_at occurs in 59 out of the 60 training sets; when α is set to 0.75 and 0.7, there are two and six genes occurring in all the 60 training sets, respectively. In every training set, each of the six genes results to two decision rules, which are used to predict the test sample. The final prediction estimate is the average of 60 test results. Table 1 shows the prediction results by the six genes. Subsequently, we attempt to seek for the gene pairs with strong class discriminative ability. When α is set to 0.9, no gene pair is detected; when α is set to 0.85, only one gene pair is detected; when α is reduced to 0.8, eleven gene pairs are found. In general, each gene pair produces four decision rules. Then we apply the four decision rules to classify the test sample and the average of 60 test results is the prediction estimate of the gene pair. Table 2 shows the prediction results by the eleven gene pairs.

Table 1.

6 genes with high prediction accuracy in the CNS tumor dataset.

Probe ID	Correctly-classified sample number (accuracy)	α
U28963_at	47 (78%)	0.75
X99050_rna1_at	45 (75%)	0.75
D83542_at	46 (77%)	0.7
S71824_at	50 (83%)	0.7
U37673_at	40 (67%)	0.7
D86974_at	45 (75%)	0.7

Table 2.

11 gene pairs with high prediction accuracy in the CNS tumor dataset.

1st – 2nd Probe ID	Correctly-classified sample number (accuracy)	α
D83542_at–S71824_at	54 (90%)	0.85
D31763_at–U08998_at	54 (90%)	0.8
D83542_at–X99050_rna1_at	49 (82%)	0.8
D83542_at–D86974_at	52 (87%)	0.8
L33243_at–U36448_at	52 (87%)	0.8
M73547_at–U74324_at	51 (85%)	0.8
M96739_at–U36448_at	54 (90%)	0.8
S71824_at–D86974_at	51 (85%)	0.8
U37143_at–D43682_s_at	48 (80%)	0.8
U79277_at–D43682_s_at	47 (78%)	0.8
X99050_rna1_at–D86974_at	49 (82%)	0.8

Here we denote the expression level of gene G by g(G). When the first sample is left out as the test set, and the remaining samples set is trained by the learning algorithm, the selected gene U28963_at will give rise to two decision rules:

•

If g(U28963_at) ≤ 431, then Class 1;

•

If g(U28963_at) >431, then Class 0.

The two rules have 81% and 84% confidence, respectively. One can use the two rules to classify the test set. When another sample instead of the first one is left out, gene U28963_at will result to two similar decision rules:

•

If g(U28963_at) ≤ x, then Class 1;

•

If g(U28963_at) > x, then Class 0.

x equals to 431 or is close to it. Anyway, the rules imply that if gene U28963_at is up-regulated in one CNS tumor patient, the patient will be more inclined to succumb to the disease. The other chosen genes give rise to similar form of rules.

Likewise, when the first sample is left out for test while the remaining samples are retained for training, the selected gene pair D83542_at—S71824_at will generate four decision rules:

•

if g(D83542_at) ≤ 280.5 and g(S71824_at) ≤ 434, then Class 1;

•

if g(D83542_at) ≤ 280.5 and g(S71824_at) > 434, then Class 1;

•

if g(D83542_at) > 280.5 and g(S71824_at) ≤ 434, then Class 1;

•

if g(D83542_at) > 280.5 and g(S71824_at) > 434, then Class 0.

The four rules possess 100%, 100%, 89% and 88% confidence, respectively. They can be simplified into equivalent three rules: •

if g(D83542_at) ≤ 280.5, then Class 1;

•

if g(S71824_at) ≤ 434, then Class 1;

•

if g(D83542_at) > 280.5 and g(S71824_at) > 434, then Class 0.

The three rules have 100%, 92% and 88% confidence, respectively. One can employ the four or alternative three rules to classify the test set. When another sample instead of the first one is left out, gene pair D83542_at—S71824_at will generate four similar decision rules. These rules indicate that if both D83542_at and S71824_at are highly expressed in one CNS tumor patient, then the patient will be very likely to succumb to the disease. Similar rules can be derived by the other chosen gene pairs.

Colon Tumor Dataset

Using the same learning algorithm for the dataset, we screen the genes and gene pairs with comparatively high prediction performance. The results are presented in Table 3 and Table 4. As before, decision rules can be induced by the selected genes or gene pairs.

Table 3.

21 genes with high prediction accuracy in the colon tumor dataset.

GenBank accession no.	Correctly-classified sample number (accuracy)	α
M63391	52 (84%)	0.8
M76378	50 (81%)	0.8
J02854	50 (81%)	0.8
M26383	52 (84%)	0.8
M76378	50 (81%)	0.75
T60155	48 (77%)	0.75
M22382	50 (81%)	0.75
X12671	49 (79%)	0.75
M76378	50 (81%)	0.75
T96873	47 (76%)	0.75
X86693	47 (76%)	0.75
J05032	50 (81%)	0.75
U25138	48 (77%)	0.75
T60778	47 (76%)	0.75
M91463	48 (77%)	0.75
R87126	51 (82%)	0.7
T51571	46 (74%)	0.7
T92451	48 (77%)	0.7
U09564	48 (77%)	0.7
R97912	45 (73%)	0.7
L41559	45 (73%)	0.7

Table 4.

16 gene pairs with high prediction accuracy in the colon tumor-dataset.

1st – 2nd GenBank accession no.	Correctly-classified sample number (accuracy)	α
T51571–J02854	56 (90%)	0.9
J02854–L41559	56 (90%)	0.9
M76378–M63391	52 (84%)	0.85
M63391–M76378	52 (84%)	0.85
M63391–Z49269	45 (73%)	0.85
M63391–X86693	53 (85%)	0.85
Z50753–H40095	55 (89%)	0.85
R87126–H81068	55 (89%)	0.85
X12671–J02854	56 (90%)	0.85
X12671–M26383	54 (87%)	0.85
M76378–M26383	55 (89%)	0.85
H40095–M36634	54 (87%)	0.85
R97912–J02854	55 (89%)	0.85
R97912–M26383	54 (87%)	0.85
R06601–X63629	54 (87%)	0.85
M36634–H08393	56 (90%)	0.85

Lung Cancer Dataset

In the dataset, when α is set to 0.8, no any gene is detected; when α equals to 0.75, eight genes are detected; when α is reduced to 0.7, no more genes are found. To make the decision rules induced by gene more reliable, we exclude the genes with missing values. When α is set to 0.9, 0.85 or 0.8, no any gene pair is found; when α is reduced to 0.75, eight gene pairs are detected. The results are presented in Table 5 and Table 6.

Table 5.

8 genes with high prediction accuracy in the lung cancer dataset.

Unigene ID	Correctly-classified sample number (accuracy)	α
505266^a	32 (82%)	0.75
Hs.95243	32 (82%)	0.75
Hs.25882	32 (82%)	0.75
Hs.275198	32 (82%)	0.75
36491^a	32 (82%)	0.75
Hs.170225	33 (85%)	0.75
Hs.17258	29 (74%)	0.75
Hs.11556	31 (79%)	0.75

The Unigene ID is not available.

Table 6.

8 gene pairs with high prediction accuracy in the lung cancer dataset.

1st – 2nd Unigene ID	Correctly-classified sample number (accuracy)	α
Hs.169611–Hs.285701	31 (79%)	0.75
Hs.285701–Hs.132415	29 (74%)	0.75
Hs.285701–Hs.57655	30 (77%)	0.75
Hs.57655–Hs.8595	31 (79%)	0.75
Hs.184542–Hs.58323	31 (79%)	0.75
Hs.262823–Hs.8595	31 (79%)	0.75
Hs.262480–Hs.772	32 (82%)	0.75
Hs.112193–505266^a	31 (79%)	0.75

DLBCL Dataset

In the dataset, when α is set to 0.7, there are four genes selected; when α increases to 0.75, no any gene is found. With respect to gene pairs, when α is set to 0.9 or 0.85, no any gene pair is found; when α decreases to 0.8, there are 22 gene pairs chosen. The results are presented in Table 7 and Table 8. Table 8 shows only 20 out of the 22 gene pairs. The other two gene pairs are omitted because of their overly low prediction accuracy.

Table 7.

4 genes with high prediction accuracy in the DLBCL dataset.

Probe ID	Correctly-classified sample number (accuracy)	α
U70663_at	44 (76%)	0.7
M17863_s_at	44 (76%)	0.7
U48865_s_at	43 (74%)	0.7
U90543_at	45 (78%)	0.7

Table 8.

20 gene pairs with high prediction accuracy in the DLBCL dataset.

1st – 2nd Probe ID	Correctly-classified sample number (accuracy)	α
AFFX-BioC-3_at–M95925_at	46 (79%)	0.8
AFFX-BioC-3_at–U70663_at	48 (83%)	0.8
AFFX-M27830_5_at – X70811_at	49 (84%)	0.8
AFFX-M27830_5_at – U46744_at	49 (84%)	0.8
AC002450_at–M95925_at	47 (81%)	0.8
AC002450_at–U48213_at	47 (81%)	0.8
AC002450_at–HG4020-HT4290_s_at	48 (83%)	0.8
M95925_at–X70811_at	46 (79%)	0.8
U23028_at–U70663_at	47 (81%)	0.8
U23028_at–X70811_at	48 (83%)	0.8
U51903_at–U70663_at	48 (83%)	0.8
U51903_at–X70811_at	47 (81%)	0.8
U66702_at–U70663_at	47 (81%)	0.8
U66702_at–HG4020-HT4290_s_at	48 (83%)	0.8
U66702_at–U90543_at	52 (90%)	0.8
U70663_at–U83908_at	47 (81%)	0.8
U70663_at–X83412_at	46 (79%)	0.8
U70663_at–X77777_s_at	47 (81%)	0.8
U70663_at–X16660_cds1_s_at	46 (79%)	0.8
U70663_at–U46744_at	47 (81%)	0.8

Comparison of Prediction Performance

CNS Tumor Dataset

The dataset is dataset C mentioned in¹⁹ that is used to analyze the outcome of the treatment for central nervous system embryonal tumor patients. In this dataset, we gain the 83% and 90% best prediction accuracy using one and two genes respectively. In,¹⁹ Pomeroy et al use a k-NNs algorithm to construct outcome predictor based on gene expression. The reported statistically significant gene size for k-NN models ranges from 2 to 21 genes, with the best prediction made by an 8-gene model that made 13/60 classification errors. Several other prediction algorithms including weighted voting, SVMs, and IBM SPLASH are also tested in.¹⁹ In,²⁰ Zhang et al propose a hybrid approach, which combines discernibility matrix, the filter strategy and the wrapper method to select gene sets. Then they adopt the classifiers C4.5 and NaiveBayes to evaluate the prediction performance of the gene sets. Their prediction accuracy by LOOCV is 75% for C4.5 using 20 genes and 86.67% for NaiveBayes using 29 genes. In,¹² Tan et al use decision trees (Single C4.5, Bagging C4.5, AdaBoost C4.5) to perform prediction tasks on cancerous microarray data including the CNS tumor dataset. They first employ Fayyad and Irani's²¹ discretization method to screen 74 genes for the actual learning process. Their highest prediction accuracy is 88% by tenfold cross-validation. The comparison of our methods with the others is summarized in Table 9. The table shows that our results are better than almost all the other compared results from previous studies.

Table 9.

Comparison of best prediction accuracy for the CNS tumor dataset.

Methods (feature selection + classification)^b	# Selected genes	# Correctly-classified samples (accuracy)
α depended degree + decision rules	1	50 (83%)
[this work]	2	54 (90%)
Signal to noise ratios + k-NNs¹⁹	8	47 (78%)
Signal to noise ratios + Weighted voting¹⁹	1–200	46 (77%)
Signal to noise ratios + SVMs¹⁹	150	45 (75%)
Signal to noise ratios + SPLASH¹⁹	1–200	45 (75%)
Signal to noise ratios + TrkC¹⁹	1	40 (67%)
Signal to noise ratios + Staging¹⁹	1–200	41 (68%)
Signal to noise ratios + staging, k-NNs and TrkC¹⁹	1–200	48 (80%)
Signal to noise ratios + SVM, k-NNs and TrkC¹⁹	1–200	48 (80%)
HFW + C4.5²⁰	20	45 (75%)
HFW + NaiveBayes²⁰	29	52 (86.67%)
Discretization + Single C4.5¹²	74^c	51 (85%)^d
Discretization + Bagging C4.5¹²	74^c	53 (88%)^d
Discretization + AdaBoost C4.5¹²	74^c	53 (88%)^d

The methods include two sections: feature selection methods and classification methods. The decision trees classification methods are also involved in feature selection.

74 is the number of the genes withheld for the actual learning process instead of the number of the genes contained in the decision trees, which is not provided in.

Tenfold cross-validation accuracy is provided.

Colon Tumor Dataset

The dataset is first studied by Alon et al.²² They propose two-way clustering approach that classify genes into functional groups and classify tissues based on their gene expression similarity. Since their original work, the dataset has been frequently investigated by other investigators. In this dataset, we reach the 84% and 90% highest prediction accuracy using one and two genes respectively. Table 10 compares the prediction results of our work with some other studies. The table demonstrates that whereas we use the least genes, our prediction accuracy is superior to or matches the others.

Table 10.

Comparison of best prediction accuracy for the colon tumor dataset.

Methods (feature selection + classification)	# Selected genes	# Correctly-classified samples (accuracy)
α depended degree + decision rules	1	52 (84%)
[this work]	2	56 (90%)
HykGene + k-NNs, SVMs, C4.5, NB¹⁰⁷	3	57 (92%)
MAVE + logistic discrimination¹⁰⁸	50	52 (84%)
Clustering and rough sets attribute reduction + k-NNs¹⁰⁹	6	49 (79%)
Clustering and rough sets attribute reduction + NB¹⁰⁹	6	51 (82%)
Clustering and rough sets attribute reduction + C5.0¹⁰⁹	6	56 (90%)
MRMR + NB¹¹⁰	9	58 (94%)
RBF + C4.5¹¹¹	4	58 (94%)
ReliefF + C4.5¹¹¹	4	53 (85%)
CFS-SF + C4.5¹¹¹	26	55 (89%)

Lung Cancer Dataset

In this dataset, we obtain the 85% and 82% highest prediction accuracy using one and two genes respectively. With respect to this dataset, we only find that Zhang et al report their study results²⁰ apart from the original paper.²³ Table 11 presents the comparison between our method and that provided in.²⁰ Although their best prediction accuracy by the HFW feature selection approach is a little higher than ours, the numbers of the genes used by them far exceed ours. As for the other feature selection approaches including FCBF, CFS-SF and ReliefF, the prediction performance caused by them is inferior to ours.

Table 11.

Comparison of best prediction accuracy for the lung cancer dataset.

Methods (feature selection + classification)	# Selected genes	# Correctly-classified samples (accuracy)
α depended degree + decision rules	1	33 (85%)
[this work]	2	32 (82%)
HFW + C4.5²⁰	12	35 (90%)
HFW + NaiveBayes²⁰	18	35 (90%)
FCBF + C4.5²⁰	12	31 (79%)
FCBF + NaiveBayes²⁰	12	24 (62%)
CFS-SF + C4.5²⁰	13	26 (67%)
CFS-SF + NaiveBayes²⁰	13	24 (62%)
ReliefF + C4.5²⁰	12	24 (62%)
ReliefF + NaiveBayes²⁰	18	25 (64%)

DLBCL Dataset

In this dataset, we achieve the 78% and 90% best prediction accuracy using one and two genes respectively. Table 12 gives the comparison between our method and that provided in²⁰ and²⁴ Obviously, our results dominate the others.

Table 12.

Comparison of best prediction accuracy for the DLBCL dataset.

Methods (feature selection + classification)	# Selected genes	# Correctly-classified samples (accuracy)
α depended degree + decision rules	1	48 (78%)
[this work]	2	52 (90%)
Signal to noise ratios + Weighted voting²⁴	13	44 (76%)
Signal to noise ratios + k-NNs²⁴	9	41 (71%)
Gradient descent algorithm + SVMs²⁴	unknown^e	45 (78%)
HFW + C4.5²⁰	22	44 (76%)
HFW + NaiveBayes²⁰	19	50 (86%)
FCBF + C4.5²⁰	27	27 (47%)
FCBF + NaiveBayes²⁰	27	31 (53%)
ReliefF + C4.5²⁰	22	25 (43%)
ReliefF + NaiveBayes²⁰	19	31 (53%)

No related data is provided.

Analysis of Biological Relevance

CNS Tumor Dataset

In this dataset, we identify six genes with comparatively high prediction performance individually. The six genes are U28963_at, X99050_rna1_at, D83542_at, S71824_at, U37673_at, and D86974_at. According to the decision rules induced by the genes, we suspect that they are all over-expressed in the patients who succumb to their disease. As expected, three out of the six genes are picked as the markers of survival by Pomeroy et al.¹⁹ The three genes are referred to as GPS2 (U28963_at), beta-NAP (U37673_at) and KIAA0220 gene (D86974_at) respectively. Moreover, beta-NAP and KIAA0220 gene are the members of the 8-gene model by which k-NN makes optimal prediction. In addition, three genes named Human polyposis locus (DP1 gene), NSCL1 and VLCAD which compose the gene pairs with strong prediction power are also identified as markers of survival by Pomeroy et al.¹⁹

GPS2 encodes a protein involved in G protein-mitogen-activated protein kinase (MAPK) signaling cascades. The function of this gene may be signal repression. Zhang et al indicate that GPS2 interacts with another protein RFX4_v3 to modulate transactivation of genes involved in brain morphogenesis.²⁵ Therefore, the dysregulation of GPS2 may be closely correlated with the pathogenesis of CNS tumor. Beta-NAP, a cerebellar degeneration antigen, is a neuron-specific vesicle coat protein.²⁶ NSCL1 is the gene expressed predominantly in the developing nervous system.²⁷ Our rules indicate that if the gene is over-expressed, the patients will be more likely to succumb to the CNS tumor. It coincides with the observation reported in.²⁷

Colon Tumor Dataset

In this dataset, we identify 21 genes which can result to relatively efficient prediction individually. Some of these genes have been proved to tightly link with the pathogenesis of colon tumor or other tumors. Desmin is identified as one of three known hub cancer genes in colon cancer-specific gene network.²⁸ Our rules indicate that the gene is down-regulated in colon tumor samples. The same conclusion is provided in.²⁹ The gene CRP encodes a member of the cysteine-rich protein (CSRP) family. This gene family includes a group of LIM domain proteins, which may be involved in regulatory processes important for development and cellular differentiation. The LIM/double zinc-finger motif found in this gene product occurs in proteins with critical functions in gene regulation, cell growth, and somatic differentiation. This gene has been reported to be associated with several cancers.^30–32 MONAP belongs to angiogenesis-related genes. Its overexpression is associated with the pathogenesis and progression of a variety of cancers.^33–37 Our rules imply that gene MONAP is up-regulated in colon tumor samples. It is consistent with the established notion. Moreover, just as Desmin, MONAP is also identified as one of three known hub cancer genes in colon cancer-specific gene network.²⁸ hnRNP belongs to the subfamily of ubiquitously expressed heterogeneous nuclear ribonucleoproteins which are associated with pre-mRNAs in the nucleus and appear to influence pre-mRNA processing and other aspects of mRNA metabolism and transport. Thus its dysregulation may cause the occurrence of cancers. Hevin encodes the protein which is implicated in tumor cell growth, differentiation and metastasis, and may play the role of tumor-suppressor.^38–44 Our rules show that if Hevin is down-regulated in the colon tissue samples, then the samples are more likely from the colon tumor patients. It rightly defends the argument that Hevin is the repressor of tumors. EF1R is associated with several functions including translation elongation, actin filament depolymerization, apoptosis, and ubiquitin-mediated protein degradation, etc. Its role in oncogenesis has been investigated by some researchers.^45–49 Calgizzarin encodes the protein which belongs to the group of S100 proteins involved in the Ca²⁺ signaling network, and regulates intracellular activities such as cell growth and motility, cell cycle progression, transcription, and cell differentiation^50,51 Chromosomal rearrangements and altered expression of this gene have been implicated in tumor metastasis. In,⁵² calgizzarin is characterized as a proteomic marker of colorectal cancer due to its significant up-regulation in colorectal carcinoma. The same observation is provided in.^53–55 Tanaka et al detect that the expression of human calgizzarin is remarkably elevated in colorectal cancers compared with that in normal colorectal mucosa by a large scale random cDNA sequencing and Northern blot analysis.⁵⁶ Our rules express the same tendency that calgizzarin is over-expressed in colon tumors. Likewise, our rules demonstrate that TPM1 is down-regulated in colon tumor that coincides with the finding reported in.⁵⁷ Our rules exhibit that PCBD1 is up-regulated in colon tumor, but very few literatures reports the same result. Additionally, there are several genes tightly associated with colon tumor among the marked gene pairs. In our rules, if MIF (macrophage migration inhibitory factor) is up-regulated, then the sample tends to come from tumor tissue. A number of investigations have demonstrated that MIF promotes colon tumor and the other cancers.^58–63 Thus, our rules conform to the documented evidence. CDH3 has been found to be involved in a broad spectrum of cancers including colorectal cancer.^64–71 The gene is identified as accurate prognostic indicator of several tumors due to its marked up-regulation in these tumors.^66,68,71,72 Our rules show that it is over-expressed in colon tumor as well.

In summary, the majority of important genes relevant to the pathogenesis of colon tumor are marked by our method. The other identified up-regulated genes include Hsp60, Human serine kinase mRNA, IPL1, HYPOTHETICAL PROTEIN IN TRPE 3'REGION and COL11A2 while down-regulated genes encompass MYL9, ACTIN, MaxiK, MGP, GLUT4, MYOSIN HEAVY CHAIN and HCC-1. Some of them have definite biological meaning while the others remain to be explored. Here what we want to emphasize is that the genes distinguishing tumor from normal tissues well involve not only muscle-specific ones but also non-muscle-specific portion. This is in agreement with the finding reported in.²² It also reflects the complexity of cancerous pathogenesis.

Lung Cancer Dataset

In this dataset, we identify eight genes with comparatively strong prediction power individually. Our rules reveal that the reduced expression of each gene is correlated with the poor prognosis of the cancer. Owing to five out of the eight genes have no annotation available in raw dataset, we only learn about the other three genes: TCEAL1, GEMIN5 and TMPO. TCEAL1, also named as p21, which belongs to the Cip/Kip family of cyclin dependent kinases, has been identified as a gene whose product is tightly associated with development and metastasis of several cancers.^73–77 Direct and indirect evidence has proved that a decrease in the expression levels of the gene might enhance tumor formation, progression and bad prognosis. GEMIN5 encodes the protein which is part of a large macromolecular complex localized to both the cytoplasm and the nucleus that plays a role in the cytoplasmic assembly of small nuclear ribonucleoproteins (snRNPs). In,⁷⁸ Lee et al suggest that Gemin5 overexpression inhibts tumor cell motility so as to may play a role of suppressing metastatic progression. This conforms to our rules. We have not found any evidence indicating that the expression levels of TMPO were correlated with prognosis of cancers. But there are investigations showing that the gene is deregulated in various human tumors.^79,80

In addition, we marked eight gene pairs with good prediction performance. Apart from the non-annotated genes, the involved genes encompass SMAC, PFDN2, FLJ10829, LOC51646, FLJ10326, FLJ12438, GYS10.145 and MSH5. Our rules imply that the decreased expression of these genes indicate a poor prognosis of NSCLC patients- relapse or metastasis. SMAC encodes an inhibitor of apoptosis protein (IAP)-binding protein. A wide variety of investigations have revealed the low expression levels of SMAC correlate with a worse prognosis in many tumor types including NSCLC.^81–92 At the same time, some researchers propose the idea of treating cancers by enhancing SMAC expression in tumor cells.^{83,85–87,89} MSH5 encodes a member of the mutS family of proteins that are involved in DNA mismatch repair or meiotic recombination (MMR) processes. It is a strong candidate for lung cancer susceptibility as deficiency of MMR has been documented to have a role in lung cancer.⁹³ Hence, it is quite possible that the downregulation of the gene results to unfavorable clinical outcome of tumors.

DLBCL Dataset

In this dataset, we marked four genes with relatively excellent prediction ability individually. The four genes are EZF, IGF2, CEBPE and BTF1. Our rules indicate that elevated expression of EZF, CEBPE or BTF1 may cause a worse prognosis of DLBCL while abundant expression of IGF2 implies a better prognosis. In,⁹³ IGF2 is also identified as a positive indicator of DLBCL prognosis. Whereas previous investigation indicates that these genes are involved in cancerous pathogenesis, further biological insights remain to be clarified.

Some genes lying in the gene pairs we selected in the dataset maybe have important biological relevance. DBP is responsible for high, tissue-specific expression of albumin in fully differentiated hepatocytes, which is expressed by adult not fetal liver cells, and is quickly down-regulated in proliferating hepatocytes.⁹⁴ Our rules indicate that if the gene is down-regulated in one DLBCL patient, then the patient is inclined to have a favorable prognosis. That sounds reasonable. TGM2 encodes the protein which is the enzyme that catalyzes the crosslinking of proteins and appears to be involved in apoptosis. Oudejans et al point out that differences in apoptosis resistance occurring between DLBCL samples link up with distinct clinical outcome.⁹⁵ Since the abundant expression of TGM2 activates the induction of the apoptosis, the upregulation of the gene might mean an excellent prognosis. Our rules reflect the tendency. In addition, in,⁹⁶ Mishra et al suggest that TGM2 modification of p53 oncoprotein could be an additional mechanism whereby TGM2 could facilitate apoptosis. In,⁹⁷ Mangala et al hold that TGM2-induced alterations in the extracellular matrix could effectively inhibit the process of metastasis. In,⁹⁸ Xu et al argue that TGM2 acts as an inhibitor of tumor progression in combination with another gene. PDCD4 encodes a protein localized to the nucleus in proliferating cells which is thought to play a role in apoptosis but the specific role has not yet been determined. Our rules imply that decreased expression of the gene is associated with a good prognosis. It appears to contradict with some previous reports,^99–103 whereas Lankat-Buttgereit et al point out that the function of Pdcd4 might be cell type specific and a role for Pdcd4 in apoptosis or as a tumor suppressor might be limited to certain cell types.¹⁰⁴ The other identified genes like HRES-1, DTNA,VIPR1, BTF1, HAB1, PTPRN2, EIF2B, IQGAP2 etc., overall possess strong class discriminative power, while their biological mechanism indicating the clinical outcome of DLBCL or other tumors remain unclear.

Conclusion

Using gene expression patterns to conduct classification or prediction of cancer is often faced with the dilemma: genes (features) far outnumber samples (instances), which will bring about weak prediction efficiency or effect if the model is not chosen reasonably. Another concern is the inter-pretability of the prediction model when biologists and clinician care for your investigation. Here we employ feature selection to overcome the first difficulty and decision rules to handle the second trouble. We propose one way of feature selection on the basis of the depended degree, a concept from rough set theory. As the canonical definition of the depended degree is too stringent to perform feature selection well, we extend its definition under soft computing consideration. We define the concept of α depended degree, whereby we are capable of screening highly discriminative features. Additionally, our work is in accordance with the principle of Occam's razor: when deciding among many models which make equivalent predictions, choose the simplest one. For this purpose, we only use single genes or gene pairs to build decision rules, which are used to execute prediction of cancer. Results demonstrate that our models work well in that the picked single genes and gene pairs overall give rise to excellent prediction, and meanwhile some biologically significant genes are identified. In general, our method is simpler and more interpretable than most of previously proposed approaches, since our model is based on rules and our rules are created via very few genes. Moreover, our model is robust as we are able to tune our parameters to meet different datasets. Indeed, through comparison, we discover our method outperforms or at least match other algorithms in simplicity and efficacy. It is not strange at all that one or two-gene models are able to result in accurate cancerous prediction because the single genes or gene pairs possibly are the biological or clinical indicators of some specific cancer or general cancer. It appears that one or two gene prediction models are overly simple in that the routine belief is that cancerous pathogenesis is involved in complex systems composed of multi-genes. Whereas our models do not violate the habitual notion in that we have various genes or gene pairs which can cause accurate prediction individually so as to be regarded as candidate markers of cancer. In contrast, some prediction models are not applicable for they contain too many parameters (genes) so that overfitting happens easily. Similar idea is expressed in^{4,7,13,15,105} as well. Another advantage of our models is that significant biomarkers can be identified with ease thanks to the operation of few genes once while it is hard to assess which gene is more important by multi-gene models for they run on the basis of a group of genes.

We test our method on several gene expression datasets including CNS tumor, colon tumor, lung cancer and DLBCL. In each dataset, we identify several important genes with documented biological relevance to the malignancy or the cell type. In the CNS tumor dataset, some significant genes like GPS2, beta-NAP, KIAA0220 gene, NSCL1 etc., are identified. In the colon tumor dataset, we succeed in choosing the genes highly related to colon tumor or other tumors. They include Desmin, CRP, MONAP, hnRNP, Hevin, EF1R, calgizzarin, TPM1, PCBD1, MIF etc., where in calgizzarin has been emphasized as a proteomic marker of colorectal cancer.⁵² In the lung dataset, TCEAL1, GEMIN5, TMPO, SMAC, MSH5 etc. genes associated with the pathogenesis and progression of a variety of cancers are marked by us. In the DLBCL dataset, IGF2, DBP, TGM2, PDCD4 etc., are identified. Their close relationship with tumor occurrence, progression, metastasis and relapse has been widely explored.

Generally speaking, most of the genes associated with tumors encode the proteins involved in cell growth, motility and differentiation, apoptosis, angiogenesis, metabolism, chromosomal rearrangement and translocation, and immune reaction. It is worth noting that whereas there may exist a few particular markers for some specific tumor, a majority of tumor markers might be shared by several tumors. In addition, it is possible that the repressor of some tumor acts as the promoter of another tumor. And it is not impossible that the enhancer of some tumor in one stage transforms into the inhibitor of the same tumor during the other stage.

Another issue concerned with molecular prediction of cancer is whether the prediction performance of one gene or gene set is proportional to its biological interest. We identify some genes which own strong prediction power while their biological or clinical involvements remain unobvious. Whether these genes are indeed correlated to the pathogenesis of cancer, or merely coincidence? This is an important problem, deserving further investigation.

In summary, our method uses very few genes to build rule classifiers of cancer. These classifiers can carry out comparatively accurate prediction. The efficacy of our method has been manifested to be satisfactory by testing on four gene expression datasets. Our follow-up study is to examine our method by more microarray data, including multi-class datasets. In addition, we plan to design more powerful and robust rule classifiers in conjunction with other machine learning algorithms.

Methods and Materials

Rough Sets

In reality, when we are faced with a heap of data, we often want to learn about them with already known knowledge. However, a majority of data cannot be precisely defined by known knowledge. Thus, in rough set theory, Pawlak describes ill-defined data by designing two concepts: upper approximations and lower approximations, based on the equivalence relation, which is also referred to as one knowledge on the studied object set.

Definition 1 Let U be a universe of discourse, X ⊆ U, and R is an equivalence relation on U. U/R represents the set of the equivalence class of U induced by R. R _* X, R ^* X, br(R, X), pos(R, X) and neg(R, X) represent the lower approximation, upper approximation, boundary region, positive region and negative region of X on R in U, respectively, where

\begin{matrix} R_{*} X = \cup {Y \in U / R | Y \subseteq X}, \\ R^{*} X = \cup {Y \in U / R | Y \cap X \neq \emptyset}, \\ p o s (R, X) = R_{*} X, \\ b r (R, X) = R^{*} X - R_{*} \\ n e g (R, X) = U - R^{*} X . \end{matrix}

If R ^* X = R _* X, then X is called definable or the precise set on R; otherwise X is called indefinable or the rough set on R.⁶

The data studied by rough set theory are mainly organized in the form of decision tables. One decision table can be represented as S = (U,A = C ∪ D), where U is the set of samples, C the condition attribute set and D the decision attribute set. In the decision table, we define the function I_a that maps a member (sample) of U to the value of the member on the attribute a (a ∈ A), and an equivalence relation R(A') induced by the attribute subset A' ⊆ A as: for x, y ∈ U, xR(A') y if and only if I_a(x) = I_a(y) for each a ∈ A'.

In,¹⁷ Pawlak defines a decision logic language (DLL) for decision table S = (U, A = C ∪ D) as: each (a, v) is an atomic formula, where a ∈ A and v ∈ V_a (set of all the values of a); if ϕ and ψ are formulas, then so are ¬ϕ∧ψ, ϕ∨ψ, ϕ→ψ, and ϕ→ψ. The semantics of DLL are defined through the model of decision tables. The satisfiability of a formula ϕ by an object x in S, denoted by $x ⊨_{S} φ$ or for short $x ⊨ φ$ if S is understood, is defined by the following conditions: (1)

x⊨(a,ν)if and only if I_a(x)=ν,

(2)

x⊨¬ϕ if and only if x ⊨ϕ,

(3)

x⊨ϕ∧ψ if and only if x ⊨ ϕ and x ⊨ ψ,

(4)

x⊨ϕ∨ψ if and only if x ⊨ ϕ or x ⊨ψ,

(5)

x⊨ϕ→ψ if and only if x ⊨ ¬ϕ∨ψ,

(6)

x⊨ϕ↔ψ if and only if x ⊨ϕ→ and x ⊨ψ→ϕ.

We call the set m_s(ϕ) = {x ∈ U | x ⊨ _sϕ} the meaning of formula ϕ in decision table S. m_s(ϕ) is simply written as m(ϕ) if S is understood. On the other hand, we call ϕ a description of object set m(ϕ). Obviously, the following properties hold: (a)

m((a-ν))={x ∈ U|I_a(x)=ν},

(b)

m(¬ϕ)=∼m(ϕ),

(c)

m(ϕ∧ψ)=m(ϕ)∩m(ψ),

(d)

m(ϕ∨ψ)=m(ϕ)∪m(ψ),

(e)

m(ϕ→ψ)=∼m(ϕ)∪m(ψ),

(f)

m(ϕ↔ψ)=(m(ϕ)∩m(ψ))∩(∼m(ϕ)∩∼m(ψ)),

In rough set theory, the depended degree of an attribute subset P by an attribute subset Q is denoted by γ_p(Q) and is defined as

γ_{P} (Q) = \frac{| {POS}_{P} (Q) |}{| U |},

where $| {POS}_{P} (Q) | = | \underset{X \in U / R (Q)}{\cup} p o s (P, X) |$ represents the size of the union of the positive region of each equivalence class in U/R(Q) on P in U, and |U| represents the size of U (set of samples).

If Q is the decision attribute D, and P a subset of condition attributes, then γ_p(D) indicates the depended degree of the condition attribute subset P by the decision attribute D. It means that, to what degree, P can discriminate the distinct classes of D. Thus, γ_p(D) rightly reflects the classification power of the sub set P of condition attributes. The greater γ_p(D) is, the stronger classification ability P inclines to possess.

Rough set theory tries to discover the simplest decision rules with the equivalent explaining power and classification performance as more complicated rules. One decision rule with the form of “A ⇒ B” indicates that “if A, then B”, where A is the description of condition attributes and B the description of decision attributes. The confidence of a decision rule A ⇒ B is defined as:

confidence (A \Rightarrow B)= \frac{support(A \land B)}{support(A)},

where support (A) denotes the proportion of the samples satisfying A and support (A ∧ B) the proportion of the samples satisfying and B simultaneously. According to the DLL, the confidence of a decision rule A ⇒ B is rewritten as:

confidence (A \Rightarrow B)= \frac{| m (A) \cap m (B) |}{| m (A) |} .

The confidence of a decision rule implies the reliable degree of the rule. If one decision rule has 100% confidence, we call it the consistent decision rule.

In the previous studies of classifying cancer by gene expression profiles using rough set theory, the measure of depended degree is often set as the basis of ranking genes.^9,10 However, as the canonical definition of depended degree is overly stringent, sometimes it is not able to rightly express the discriminatory power of features. Hence, here we extend the definition of depended degree under soft computing consideration.

Definition 2 Let U be a universe of discourse, $X \subseteq U, 0 \leq α \leq$ 1 and R is an equivalence relation on U. pos(R, X, α) representing the α positive region of X on R in U, is defined as:

p o s (R, X, α) = \cup {Y \in U / R | | Y \cap X | / | Y | \geq α} .

Correspondingly, the α depended degree of an attribute subset P by an attribute subset Q, denoted by γ_p(Q, α), is defined as:

γ_{P} (Q, α) = \frac{| {POS}_{P} (Q, α) |}{| U |}

where $| {POS}_{P} (Q, α) | = | \underset{X \in U / R (Q)}{\cup} p o s (P, X, α) |$ represents the size of the union of the α positive region of each equivalence class in U/R (Q) on P in U.

Obviously, the definition of α depended degree is a generalization of the definition of depended degree as when α equals to 1, both definitions are equivalent. We choose α depended degree instead of depended degree as the basis of screening features. Once α value is determined, we only choose the genes or gene pairs with 1 of γ_p(D, α) value to build classification (decision) rules. Suppose g is one of the selected genes and U sample set. U/R(g) = {c₁(g), c₂(g …, c_n(g)} represents the set of the equivalence class of samples induced by R(g). Two samples s₁ and s₂ belong to the same equivalence class of U/R(g) if and only if they have the same value on g. In addition, we represent the set of the equivalence class of samples induced by R(D)as U/R(D) = {d₁(D),d₂(D),…,d_m(D)}, where D is the class (decision) attribute. Two samples s₁ and s₂ belong to the same equivalence class of U/R(D) if and only if they have the same value on D. For each c_i(g) (i = 1, 2, …,n), if there exists some d_j(D) (j ∈ {1, 2,…, m}), satisfying $| c_{i} (g) \cap d_{j} (D) | / | c_{i} (g) | \geq α$ , then we generate the classification rule: A(c_i(g)) ⇒ B(d_j(D)), where A(c_i(g)) is the formula describing the sample set c_i(g) by g value and B(d_j(D)) is the formula describing the sample set d_j(D) by the class value. In the case of gene pairs, we construct classification rules through the same strategy. Here what we want to emphasize is that only the single genes or gene pairs chosen by all the leave-one-out training sets are used for building classification rules.

The confidences of the rules generated by our approach depend on α. The following theorem states the relationship between α and the confidences of the induced rules.

Theorem 1 The confidence of each induced decision rule by our way is no less than α.

Proof. For any condition attribute sub set P of size one or two, if γ_p(D, α) = 1, then P will be chosen by our way. Suppose the decision rule A ⇒ B is produced by P. Then by our way, we have m(A) ∈ U/R(P), m(B) ∈ U/R(D) and $| m (A) \cap m (B) | / | m (A) | \geq α$ As confidence $(A \Rightarrow B) = | m (A) \cap m (B) | / | m (A)$ , the conclusion is founded.

Therefore, by tuning α value, we can not only control the size of the set of selected single genes or gene pairs, but also ensure the confidence of derived decision rules.

For the cancer classification problem, every microarray data collected can be represented as a decision table with the form of Table 13. In the microarray data decision table, there are m samples and n genes. Every sample is assigned to one class label. g(x,y) represents the expression level of gene y in sample x.

Table 13.

Microarray data decision table.

Samples	Condition attributes (genes)				Decision attributes (classes)
	Gene 1	Gene 2	…	Gene n	Class label
1	g(1, 1)	g(1, 2)	…	g(1, n)	Class (1)
2	g(2, 1)	g(2, 2)	…	g(2, n)	Class (2)
…	…	…	…	…	…
…	…	…	…	…	…
m	g(m, 1)	g(m, 2)	…	g(m, n)	Class (m)

Dataset

CNS Tumor Dataset

The dataset is about patient outcome prediction for central nervous system embryonal tumor.¹⁹ In this dataset, there are 60 observations, each of which is described by the gene expression levels of 7129 genes and a class attribute with two distinct labels—Class 1 (survivors) versus Class 0 (failures). Survivors are patients who are alive after treatment while the failures are those who succumbed to their disease. Among 60 patient samples, 21 are labeled as “Class 1” and 39 are labeled as “Class 0”.

Colon Tumor Dataset

The dataset contains 62 samples collected from colon-cancer patients.²² Among them, 40 tumor biopsies are from tumors (labeled as “negative”) and 22 normal (labeled as “positive”) biopsies are from healthy parts of the colons of the same patients. Each sample is described by 2000 genes.

Lung Cancer Dataset

The dataset contains 39 NSCLC (Non-Small Cell Lung Cancer) samples, 24 of which are from patients with metastasis (labeled as “relapse”) and 15 are from the patients with disease-free based on both clinical and radiological testing (labeled as “non-relapse”).²³ The total number of genes is 2880.

DLBCL Dataset

The dataset is about patient outcome prediction for DLBCL.²⁴ The total of 58 DLBCL samples are from 32 cured patients (labeled as ‘cured’) and 26 refractory patients (labeled as ‘fatal’). The gene expression profile contains 6817 genes.

Table 14 summarizes the four gene expression datasets.

Table 14.

Summary of the four gene expression datasets.

Dataset	# Original genes	Class	# Samples
CNS Tumor	7129	Class 1/Class 0	60 (21/39)
Colon Tumor	2000	negative/positive	62 (40/22)
Lung Cancer	2880	relapse/non-relapse	39 (24/15)
DLBCL	6817	cured/fatal	58 (32/26)

Data Preprocessing

As there exist a few missing attribute values in the lung cancer dataset, we first fill each of them with the mean of all the attribute values from the same class of samples as the sample containing the missing value.

Because rough set theory is suitable for handling discrete attributes, we discretize all the training set decision tables. We utilize the entropy-based discretization method, proposed by Fayyad et al.²¹ This algorithm recursively applies an entropy minimization heuristic to discretize the continuous-valued attributes. The stop of the recursive step for this algorithm depends on the minimum description length (MDL) principle. We implement the discretization in the Weka package.¹⁰⁶ Every continuous-valued attribute is discretized into a one-category, two-category or three-category attribute. Table 15 shows the discretized decision table for the CNS tumor with the first sample left out. We execute our algorithm for the feature selection and decision rule induction using this kind of tables.

Table 15.

Discretized CNS tumor decision table with the first sample left out.

Samples	Condition attributes (genes)^f							Decision attributes (classes)
	Gene 1	…	Gene 11	…	Gene 18	…	Gene 7129	Class label
1	‘All’	…	‘(-inf-187]’	…	‘(-330-inf]’	…	‘All’	Class 1
2	‘All’	…	‘(-inf-187]’	…	‘(–330-inf]’	…	‘All’	Class 1
…	…	…	…	…	…	…	…	…
20	‘All’	…	‘(-inf-187]’	…	‘(–330-inf]’	…	‘All’	Class 1
21	‘All’	…	‘(-inf-187]’	…	‘(-330-inf]’	…	‘All’	Class 0
22	‘All’	…	‘(187-inf]’	…	‘(–330-inf]’	…	‘All’	Class 0
…	…	…	…	…	…	…	…	…
58	All'	…	‘(–inf-187]’	…	‘(–inf-330]’	…	All'	Class 0
59	All'	…	‘(–inf-187]’	…	‘(–330-inf]’	…	‘All’	Class 0

‘All’ represents that one gene has the same value in all samples; ‘(–inf-x]’ represents ‘≤x’; ‘(x-inf]’ represents ‘x’.

Validation

We employ leave-one-out cross-validation approach. For the dataset containing n samples, each sample is left out in turn, and the learning algorithm is trained on all the remaining n-1 samples. Then the training result is tested on the left-out sample. The final estimate is the average of n test results.

Disclosure

The authors report no conflicts of interest.

Footnotes

Acknowledgements

This work was partly supported by KAKENHI (Grant-in-Aid for Scientific Research) on Priority Areas “comparative genomics” from the Ministry of Education, Culture, Sports, Science and Technology of Japan.

References

Schena

, Shalon

, Davis

R.W.

, Brown

P.O.

Quantitative monitoring of gene expression patterns with a complementary DNA microarray.

Science. 1995; 270(5235): 467–70.

Golub

T.R.

, Slonim

D.K.

, Tamayo

Molecular classification of cancer: class discovery and class prediction by gene expression monitoring.

Science 1999; 286(5439): 531–7.

Xing

E.P.

, Jordan

M.I.

, Karp

R.M.

Feature selection for high-dimensional genomic microarray data. In: the Eighteenth International Conference on Machine Learning: 2001; Williams College, MA: Morgan Kaufmann Publishers Inc., San Francisco, U.S.A. 2001: 601–8.

Simon

Supervised analysis when the number of candidate feature (p) greatly exceeds the number of cases (n).

ACM SIGKDD Explorations Newsletter. 2003; 5(2): 31–6.

Quinlan

Induction of decision trees.

Machine Learning. 1986; 1: 81–106.

Pawlak

Rough sets.

International Journal of Computer and Information Sciences. 1982; 11: 341–56.

, Wong

Identifying good diagnostic gene groups from gene expression profiles using the concept of emerging patterns.

Bioinformatics. 2002; 18(5): 725–34.

Sun

, Miao

, Zhang

Efficient gene selection with rough sets from gene expression data. In: the 3rd International Conference on Rough Sets and Knowledge Technology: 2008: 164–71.

, Zhang

Gene selection using rough set theory. In: the 1st International Conference on Rough Sets and Knowledge Technology: 2006: 778–85.

10.

Momin

B.F.

, Mitra

Reduct generation and classification of gene expression data. In: First International Conference on Hybrid Information Technology. 2006: 699–708.

11.

Banerjee

, Mitra

, Banka

Evolutinary-rough feature selection in gene expression data.

IEEE Transaction on Systems, Man, and Cybernetics, Part C: Application and Reviews. 2007(37): 622–32.

12.

Tan

A.C.

, Gilbert

Ensemble machine learning on gene expression data for cancer classification.

Appl Bioinformatics. 2003; 2(3 Suppl): S75–83.

13.

, Liu

, Downing

J.R.

, Yeoh

A.E.

, Wong

Simple rules underlying gene expression profiles of more than six subtypes of acute lymphoblastic leukemia (ALL) patients.

Bioinformatics. 2003; 19(1): 71–8.

14.

Cong

, Tan

K.L.

, KH Tung

, Xu

Mining top-k covering rule groups for gene expression data. In: the ACM SIGMOD International Conference on Management of Data: 2005: 670–81.

15.

Geman

, d'Avignon

, Naiman

D.Q.

, Winslow

R.L.

Classifying gene expression profiles from pairwise mRNA comparisons.

Stat Appl Genet Mol Biol. 2004; 3: Article 19.

16.

Gordon

G.J.

, Jensen

R.V.

, Hsiao

L.L.

Translation of microarray data into clinically relevant cancer diagnostic tests using gene expression ratios in lung cancer and mesothelioma.

Cancer Res. 2002; 62(17): 4963–7.

17.

Pawlak

Rough sets-Theoretical aspects of reasoning about data, vol. 9. Dordrecht; Boston: Kluwer Academic Publishers; 1991.

18.

, Liu

Redundancy based feature selection for microarray data. In: the Tenth ACM SIGKDD Conference on Knowledge Discovery and Data Mining: 2004: 737–42.

19.

Pomeroy

S.L.

, Tamayo

, Gaasenbeek

Prediction of central nervous system embryonal tumour outcome based on gene expression.

Nature. 2002; 415(6870): 436–42.

20.

Zhang

L.J.

, Li

Z.J.

, Hu

X.H.

A Hybrid Gene Selection Method for Cancer Classification. In: VLDB Workshop on Data Mining in Bioinformatics: 2007; Vienna, Austria; 2007.

21.

Fayyad

U.M.

, Irani

K.B.

Multi-interval discretization of continuous-valued attributes for classification learning. In: the 13th International Joint Conference of Artificial Intelligence: 1993: 1022–7.

22.

Alon

, Barkai

, Notterman

D.A.

Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.

Proc Natl Acad Sci U S A. 1999; 96(12): 6745–50.

23.

Wigle

D.A.

, Jurisica

, Radulovich

Molecular profiling of non-small cell lung cancer and correlation with disease-free survival.

Cancer Res. 2002; 62(11): 3005–8.

24.

Shipp

M.A.

, Ross

K.N.

, Tamayo

Diffuse large B-cell lymphoma outcome prediction by gene-expression profiling and supervised machine learning.

Nat Med. 2002; 8(1): 68–74.

25.

Zhang

, Harry

G.J.

, Blackshear

P.J.

, Zeldin

D.C.

G-protein pathway suppressor 2 (GPS2) interacts with the regulatory factor X4 variant 3 (RFX4_v3) and functions as a transcriptional co-activator.

J Biol Chem. 2008; 283(13): 8580–90.

26.

Newman

L.S.

, McKeever

M.O.

, Okano

H.J.

, Darnell

R.B.

Beta-NAP, a cerebellar degeneration antigen, is a neuron-specific vesicle coat protein.

Cell. 1995; 82(5): 773–83.

27.

Lipkowitz

, Gobel

, Varterasian

M.L.

, Nakahara

, Tchorz

, Kirsch

I.R.

A comparative structural characterization of the human NSCL-1 and NSCL-2 genes. Two basic helix-loop-helix genes expressed in the developing nervous system.

J Biol Chem. 1992; 267(29): 21065–71.

28.

Jiang

, Li

, Rao

Constructing disease-specific gene networks using pair-wise relevance metric: application to colon cancer identifies interleukin 8, desmin and enolase 1 as the central elements.

BMC Syst Biol. 2008; 2: 72.

29.

Klieveri

, Fehres

, Griffini

, Van Noorden

C.J.

, Frederiks

W.M.

Promotion of colon cancer metastases in rat liver by fish oil diet is not due to reduced stroma formation.

Clin Exp Metastasis. 2000; 18(5): 371–77.

30.

Wang

, Williamson

, Bott

Hypomethylation of WNT5A, CRIP1 and S100P in prostate cancer.

Oncogene. 2007; 26(45): 6560–65.

31.

Hirasawa

, Arai

, Imazeki

Methylation status of genes upregulated by demethylating agent 5-aza-2′-deoxycytidine in hepatocellular carcinoma.

Oncology. 2006; 71(1-2): 77–85.

32.

Sato

, Fukushima

, Matsubayashi

, Goggins

Identification of maspin and S100P as novel hypomethylation targets in pancreatic cancer using global gene expression profiling.

Oncogene. 2004; 23(8): 1531–8.

33.

Schauer

I.G.

, Ressler

S.J.

, Rowley

D.R.

Keratinocyte-derived chemokine induces prostate epithelial hyperplasia and reactive stroma in a novel transgenic mouse model.

Prostate. 2009; 69(4): 373–84.

34.

Bendrik

, Dabrosin

Estradiol increases IL-8 secretion of normal human breast tissue and breast cancer in vivo.

J Immunol. 2009; 182(1): 371–8.

35.

Negaard

H.F.

, Iversen

, Bowitz-Lothe

I.M.

Increased bone marrow microvascular density in haematological malignancies is associated with differential regulation of angiogenic factors.

Leukemia. 2009; 23(1): 162–9.

36.

Chikazawa

, Inoue

, Fukata

, Karashima

, Shuin

Expression of angiogenesis-related genes regulates different steps in the process of tumor growth and metastasis in human urothelial cell carcinoma of the urinary bladder.

Pathobiology. 2008; 75(6): 335–45.

37.

Lurje

, Zhang

, Schultheis

A.M.

Polymorphisms in VEGF and IL-8 predict tumor recurrence in stage III colon cancer.

Ann Oncol. 2008; 19(10): 1734–41.

38.

Sullivan

M.M.

, Sage

E.H.

Hevin/SC1, a matricellular glycoprotein and potential tumor-suppressor of the SPARC/BM-40/Osteonectin family.

Int J Biochem Cell Biol. 2004; 36(6): 991–6.

39.

Framson

P.E.

, Sage

E.H.

SPARC and tumor growth: where the seed meets the soil?

J Cell Biochem. 2004; 92(4): 679–90.

40.

Lau

C.P.

, Poon

R.T.

, Cheung

S.T.

, Yu

W.C.

, Fan

S.T.

SPARC and Hevin expression correlate with tumour angiogenesis in hepatocellular carcinoma.

J Pathol. 2006; 210(4): 459–68.

41.

Esposito

, Kayed

, Keleg

Tumor-suppressor function of SPARC-like protein 1/Hevin in pancreatic cancer.

Neoplasia. 2007; 9(1): 8–17.

42.

Bendik

, Schraml

, Ludwig

C.U.

Characterization of MAST9/Hevin, a SPARC-like protein, that is down-regulated in non-small cell lung cancer.

Cancer Res. 1998; 58(4): 626–9.

43.

Claeskens

, Ongenae

, Neefs

J.M.

Hevin is down-regulated in many cancers and is a negative regulator of cell growth and proliferation.

Br J Cancer. 2000; 82(6): 1123–30.

44.

Nelson

P.S.

, Plymate

S.R.

, Wang

Hevin, an antiadhesive extracellular matrix protein, is down-regulated in metastatic prostate adenocarcinoma.

Cancer Res. 1998; 58(2): 232–6.

45.

Zhang

, Guo

, Mi

EF1A1-actin interactions alter mRNA stability to determine differential osteopontin expression in HepG2 and Hep3B cells.

Exp Cell Res. 2009; 315(2): 304–12.

46.

Umeda

, Yano

, Yamada

, Tachibana

Green tea polyphenol epigallocatechin-3-gallate signaling pathway through 67-kDa laminin receptor.

J Biol Chem. 2008; 283(6): 3050–8.

47.

Rho

S.B.

, Park

Y.G.

, Park

, Lee

S.H.

, Lee

J.H.

A novel cervical cancer suppressor 3 (CCS-3) interacts with the BTB domain of PLZF and inhibits the cell growth by inducing apoptosis.

FEBS Lett. 2006; 580(17): 4073–80.

48.

Frum

, Busby

S.A.

, Ramamoorthy

HDM2-binding partners: interaction with translation elongation factor EF1alpha.

J Proteome Res. 2007; 6(4): 1410–7.

49.

Gopalkrishnan

R.V.

, Su

Z.Z.

, Goldstein

N.I.

, Fisher

P.B.

Translational infidelity and human cancer: role of the PTI-1 oncogene.

Int J Biochem Cell Biol. 1999; 31(1): 151–62.

50.

Schafer

B.W.

, Heizmann

C.W.

The S100 family of EF-hand calcium-binding proteins: functions and pathology.

Trends Biochem Sci. 1996; 21(4): 134–40.

51.

Heizmann

C.W.

, Fritz

, Schafer

B.W.

S100 proteins: structure, functions and pathology.

Front Biosci. 2002; 7: d1356–68.

52.

Melle

, Ernst

, Schimmel

Different expression of calgizzarin (S100A11) in normal colonic epithelium, adenoma and colorectal carcinoma.

Int J Oncol. 2006; 28(1): 195–200.

53.

Stulik

, Koupilova

, Osterreicher

Protein abundance alterations in matched sets of macroscopically normal colon mucosa and colorectal carcinoma.

Electrophoresis. 1999; 20(18): 3638–46.

54.

Reichling

, Goss

K.H.

, Carson

D.J.

Transcriptional profiles of intestinal tumors in Apc(Min) mice are unique from those of embryonic intestine and identify novel gene targets dysregulated in human colorectal tumors.

Cancer Re. 2005; 65(1): 166–76.

55.

Chaurand

, DaGue

B.B.

, Pearsall

R.S.

, Threadgill

D.W.

, Caprioli

R.M.

Profiling proteins from azoxymethane-induced colon tumors at the molecular level by matrix-assisted laser desorption/ionization mass spectrometry.

Proteomics. 2001; 1(10): 1320–6.

56.

Tanaka

, Adzuma

, Iwami

, Yoshimoto

, Monden

, Itakura

Human calgizzarin; one colorectal cancer-related gene selected by a large scale random cDNA sequencing and northern blot analysis.

Cancer Lett. 1995; 89(2): 195–200.

57.

Varga

A.E.

, Stourman

N.V.

, Zheng

Silencing of the Tropomyosin-1 gene by DNA methylation alters tumor suppressor function of TGF-beta.

Oncogene. 2005; 24(32): 5043–52.

58.

X.X.

, Chen

, Yang

Macrophage migration inhibitory factor promotes colorectal cancer.

Mol Med. 2009; 15(1-2): 1–10.

59.

Ren

, Law

, Huang

Macrophage migration inhibitory factor stimulates angiogenic factor expression and correlates with differentiation and lymph node status in patients with esophageal squamous cell carcinoma.

Ann Surg. 2005; 242(1): 55–63.

60.

Legendre

, Decaestecker

, Nagy

Prognostic values of galectin-3 and the macrophage migration inhibitory factor (MIF) in human colorectal cancers.

Mod Pathol. 2003; 16(5): 491–504.

61.

Ren

, Tsui

H.T.

, Poon

R.T.

Macrophage migration inhibitory factor: roles in regulating tumor cell migration and expression of angiogenic factors in hepatocellular carcinoma.

Int J Cancer. 2003; 107(1): 22–9.

62.

Wilson

J.M.

, Coletta

P.L.

, Cuthbert

R.J.

Macrophage migration inhibitory factor promotes intestinal tumorigenesis.

Gastroenterology. 2005; 129(5): 1485–503.

63.

, Wang

, Ye

Overexpression of macrophage migration inhibitory factor induces angiogenesis in human breast cancer.

Cancer Lett. 2008; 261(2): 147–57.

64.

Imai

, Hirata

, Irie

Identification of a novel tumor-associated antigen, cadherin 3/P-cadherin, as a possible target for immunotherapy of pancreatic, gastric, and colorectal cancers.

Clin Cancer Res. 2008; 14(20): 6487–95.

65.

Bauer

, Dowejko

, Driemel

, Bosserhoff

A.K.

, Reichert

T.E.

Truncated P-cadherin is produced in oral squamous cell carcinoma.

Febs J. 2008; 275(16): 4198–210.

66.

Ben Hamida

, Labidi

I.S.

, Mrad

Markers of subtypes in inflammatory breast cancer studied by immunohistochemistry: prominent expression of P-cadherin.

BMC Cancer. 2008; 8: 28.

67.

Rocha

A.S.

, Soares

, Machado

J.C.

Mucoepidermoid carcinoma of the thyroid: a tumour histotype characterised by P-cadherin neoexpression and marked abnormalities of E-cadherin/catenins complex.

Virchows Arch. 2002; 440(5): 498–504.

68.

Paredes

, Albergaria

, Oliveira

J.T.

, Jeronimo

, Milanezi

, Schmitt

F.C.

P-cadherin overexpression is an indicator of clinical outcome in invasive breast carcinomas and is associated with CDH3 promoter hypomethylation.

Clin Cancer Res. 2005; 11(16): 5869–77.

69.

Patel

I.S.

, Madan

, Getsios

, Bertrand

M.A.

, MacCalman

C.D.

Cadherin switching in ovarian cancer progression.

Int J Cancer. 2003; 106(2): 172–7.

70.

Lo Muzio

, Campisi

, Farina

P-cadherin expression and survival rate in oral squamous cell carcinoma: an immunohistochemical study.

BMC Cancer. 2005; 5: 63.

71.

Reed

C.E.

, Graham

, Hoda

R.S.

A simple two-gene prognostic model for adenocarcinoma of the lung.

J Thorac Cardiovasc Surg. 2008; 135(3): 627–34.

72.

Bauer

, Bosserhoff

A.K.

Functional implication of truncated P-cadherin expression in malignant melanoma.

Exp Mol Pathol. 2006; 81(3): 224–30.

73.

Makino

, Tajiri

, Miyashita

Differential expression of TCEAL1 in esophageal cancers by custom cDNA microarray analysis.

Dis Esophagus. 2005; 18(1): 37–40.

74.

Hou

Y.F.

, Yuan

S.T.

, Li

H.C.

ERbeta exerts multiple stimulative effects on human breast carcinoma cells.

Oncogene. 2004; 23(34): 5799–806.

75.

Sohda

, Ishikawa

, Masuda

Pretreatment evaluation of combined HIF-1alpha, p53 and p21 expression is a useful and sensitive indicator of response to radiation and chemotherapy in esophageal cancer.

Int J Cancer. 2004; 110(6): 838–44.

76.

Kim

Y.B.

, Ki

S.W.

, Yoshida

, Horinouchi

Mechanism of cell cycle arrest caused by histone deacetylase inhibitors in human carcinoma cells.

J Antibiot (Tokyo). 2000; 53(10): 1191–200.

77.

Santos

A.M.

, Sousa

, Pinto

Linking TP53 codon 72 and P21 nt590 genotypes to the development of cervical and ovarian cancer.

Eur J Cancer. 2006; 42(7): 958–63.

78.

Lee

J.H.

, Horak

C.E.

, Khanna

Alterations in Gemin5 expression contribute to alternative mRNA splicing patterns and tumor cell motility.

Cancer Res. 2008; 68(3): 639–44.

79.

Parise

, Finocchiaro

, Masciadri

Lap2alpha expression is controlled by E2F and deregulated in various human tumors.

Cell Cycle. 2006; 5(12): 1331–41.

80.

Weber

P.J.

, Eckhard

C.P.

, Gonser

, Otto

, Folkers

, Beck-Sickinger

A.G.

On the role of thymopoietins in cell proliferation. Immunochemical evidence for new members of the human thymopoietin family.

Biol Chem. 1999; 380(6): 653–60.

81.

Xiao

, Wang

, Zhou

Inhibition of fibroblast growth factor 2-induced apoptosis involves survivin expression, protein kinase C alpha activation and subcellular translocation of Smac in human small cell lung cancer cells.

Acta Biochim Biophys Sin (Shanghai). 2008; 40(4): 297–303.

82.

Kempkensteffen

, Hinz

, Christoph

Expression levels of the mitochondrial IAP antagonists Smac/DIABLO and Omi/HtrA2 in clear-cell renal cell carcinomas and their prognostic value.

J Cancer Res Clin Oncol. 2008; 134(5): 543–50.

83.

Mizutani

, Nakanishi

, Yamamoto

Downregulation of Smac/DIABLO expression in renal cell carcinoma and its prognostic significance.

J Clin Oncol. 2005; 23(3): 448–54.

84.

Yan

, Mahotka

, Heikaus

Disturbed balance of expression between XIAP and Smac/DIABLO during tumour progression in renal cell carcinomas.

Br J Cancer. 2004; 91(7): 1349–57.

85.

Fulda

, Wick

, Weller

, Debatin

K.M.

Smac agonists sensitize for Apo2L/TRAIL- or anticancer drug-induced apoptosis and induce regression of malignant glioma in vivo.

Nat Med. 2002; 8(8): 808–15.

86.

Yang

, Mashima

, Sato

Predominant suppression of apoptosome by inhibitor of apoptosis protein in non-small cell lung cancer H460 cells: therapeutic effect of a novel polyarginine-conjugated Smac peptide.

Cancer Res. 2003; 63(4): 831–7.

87.

Vogler

, Giagkousiklidis

, Genze

, Gschwend

J.E.

, Debatin

K.M.

, Fulda

Inhibition of clonogenic tumor growth: a novel function of Smac contributing to its antitumor activity.

Oncogene. 2005; 24(48): 7190–202.

88.

Mao

H.L.

, Liu

P.S.

, Zheng

J.F.

Transfection of Smac/DIABLO sensitizes drug-resistant tumor cells to TRAIL or paclitaxel-induced apoptosis in vitro.

Pharmacol Res. 2007; 56(6): 483–92.

89.

Checinska

, Hoogeland

B.S.

, Rodriguez

J.A.

, Giaccone

, Kruyt

F.A.

Role of XIAP in inhibiting cisplatin-induced caspase activation in non-small cell lung cancer cells: a small molecule Smac mimic sensitizes for chemotherapy-induced apoptosis by enhancing caspase-3 activation.

Exp Cell Res. 2007; 313(6): 1215–24.

90.

McNeish

I.A.

, Lopes

, Bell

S.J.

Survivin interacts with Smac/DIABLO in ovarian carcinoma cells but is redundant in Smac-mediated apoptosis.

Exp Cell Res. 2005; 302(1): 69–82.

91.

Martinez-Velazquez

, Melendez-Zajgla

, Maldonado

Apoptosis induced by cAMP requires Smac/DIABLO transcriptional upregulation.

Cell Signal. 2007; 19(6): 1212–20.

92.

Sekimura

, Konishi

, Mizuno

Expression of Smac/DIABLO is a novel prognostic marker in lung cancer.

Oncol Rep. 2004; 11(4): 797–802.

93.

Wang

, Broderick

, Webb

Common 5p15.33 and 6p21.33 variants influence lung cancer risk.

Nat Genet. 2008; 40(12): 1407–9.

94.

Inaba

, Roberts

W.M.

, Shapiro

L.H.

Fusion of the leucine zipper gene HLF to the E2A gene in human acute B-lineage leukemia.

Science. 1992; 257(5069): 531–4.

95.

Muris

J.J.

, Meijer

C.J.

, Ossenkoppele

G.J.

, Vos

, Oudejans

J.J.

Apoptosis resistance and response to chemotherapy in primary nodal diffuse large B-cell lymphoma.

Hematol Oncol. 2006; 24(3): 97–104.

96.

Mishra

, Murphy

L.J.

The p53 oncoprotein is a substrate for tissue transglutaminase kinase activity.

Biochem Biophys Res Commun. 2006; 339(2): 726–30.

97.

Mangala

L.S.

, Arun

, Sahin

A.A.

, Mehta

Tissue transglutaminase-induced alterations in extracellular matrix inhibit tumor invasion.

Mol Cancer. 2005; 4: 33.

98.

, Begum

, Hearn

J.D.

, Hynes

R.O.

GPR56, an atypical G proteincoupled receptor, binds tissue transglutaminase, TG2, and inhibits melanoma tumor growth and metastasis.

Proc Natl Acad Sci U S A. 2006; 103(24): 9023–8.

99.

Goke

, Barth

, Schmidt

, Samans

, Lankat-Buttgereit

Programmed cell death protein 4 suppresses CDK1/cdc2 via induction of p21(Waf1/Cip1).

Am J Physiol Cell Physiol. 2004; 287(6): C1541–6.

100.

Jin

, Kim

T.H.

, Hwang

S.K.

Aerosol delivery of urocanic acid-modified chitosan/programmed cell death 4 complex regulated apoptosis, cell cycle, and angiogenesis in lungs of K-ras null mice.

Mol Cancer Ther. 2006; 5(4): 1041–9.

101.

Schmid

, Jansen

A.P.

, Baker

A.R.

, Hegamyer

, Hagan

J.P.

, Colburn

N.H.

Translation inhibitor Pdcd4 is targeted for degradation during tumor promotion.

Cancer Res. 2008; 68(5): 1254–60.

102.

Wang

, Sun

, Yang

H.S.

Downregulation of tumor suppressor Pdcd4 promotes invasion and activates both beta-catenin/Tcf and AP-1-dependent transcription in colon carcinoma cells.

Oncogene. 2008; 27(11): 1527–35.

103.

Yang

H.S.

, Matthews

C.P.

, Clair

Tumorigenesis suppressor Pdcd4 down-regulates mitogen-activated protein kinase kinase kinase kinase 1 expression to suppress colon carcinoma cell invasion.

Mol Cell Biol. 2006; 26(4): 1297–306.

104.

Lankat-Buttgereit

, Lenschen

, Schmidt

, Goke

The action of Pdcd4 may be cell type specific: evidence that reduction of dUTPase levels might contribute to its tumor suppressor activity in Bon-1 cells.

Apoptosis. 2008; 13(1): 157–64.

105.

Holte

R.C.

Very simple classification rules perform well on most commonly used datasets.

Machine Learning. 1993: 63–91.

106.

Witten

I.H.

, Frank

Data mining: practical machine learning tools and techniques (second edition): Morgan Kaufmann; 2005.

107.

Wang

, Makedon

F.S.

, Ford

J.C.

, Pearlman

HykGene: a hybrid approach for selecting marker genes for phenotype classification using microarray gene expression data.

Bioinformatics. 2005; 21(8): 1530–7.

108.

Antoniadis

, Lambert-Lacroix

, Leblanc

Effective dimension reduction methods for tumor classification using gene expression data.

Bioinformatics. 2003; 19(5): 563–70.

109.

Sun

, Miao

, Zhang

Gene Selection with Rough Sets for Cancer Classification. In: Fourth International Conference on Fuzzy Systems and Knowledge Discovery. 2007: 167–72.

110.

Ding

, Peng

Minimum redundancy feature selection from microarray gene expression data.

J Bioinform Comput Biol. 2005; 3(2): 185–205.

111.

, Liu

Redundancy Based Feature Selection for Microarray Data. In: the tenth ACM SIGKDD international conference on Knowledge discovery and data mining: 2004. 2004: 737–42.

Microarray-Based Cancer Prediction Using Soft Computing Approach

Abstract

Keywords

Introduction

Results

CNS Tumor Dataset

Colon Tumor Dataset

Lung Cancer Dataset

DLBCL Dataset

Comparison of Prediction Performance

CNS Tumor Dataset

Colon Tumor Dataset

Lung Cancer Dataset

DLBCL Dataset

Analysis of Biological Relevance

CNS Tumor Dataset

Colon Tumor Dataset

Lung Cancer Dataset

DLBCL Dataset

Conclusion

Methods and Materials

Rough Sets

Dataset

CNS Tumor Dataset

Colon Tumor Dataset

Lung Cancer Dataset

DLBCL Dataset

Data Preprocessing

Validation

Disclosure

Footnotes

Acknowledgements

References