Abstract
Objective
To investigate whether previously curated chronic lymphocytic leukemia (CLL) risk genes could be leveraged in gene marker selection for the diagnosis and prediction of CLL.
Methods
A CLL genetic database (CLL_042017) was developed through a comprehensive CLL-gene relation data analysis, in which 753 CLL target genes were curated. Expression values for these genes were used for case-control classification of four CLL datasets, with a sparse representation-based variable selection (SRVS) approach employed for feature (gene) selection. Results were compared with outcomes obtained by using analysis of variance (ANOVA)-based gene selection approaches.
Results
For each of the four datasets, SRVS selected a subset of genes from the 753 CLL target genes, resulting in significantly higher classification accuracy, compared with randomly selected genes (100%, 100%, 93.94%, 89.39%). The SRVS method outperformed ANOVA in terms of classification accuracy.
Conclusion
Gene markers selected from the 753 CLL genes could enable significantly greater accuracy in the prediction of CLL. SRVS provides an effective method for gene marker selection.
Keywords
Introduction
Chronic lymphocytic leukemia (CLL) is the most frequent B-cell leukemia, which affects men more frequently than women. 1 The disease often occurs in elderly patients, and rarely affects children. 2 Despite the efforts of many genetic studies, the molecular abnormalities and genetic mechanics of CLL remain largely unknown. 3 Most CLL patients are diagnosed without symptoms, with the exception of a high white blood cell count in a routine blood test. Consequently, early CLL could easily remain untreated. 4 Therefore, there is an urgent need for biomarker identification to facilitate early prediction of CLL. 5
In the past, hundreds of genes/proteins have been linked to CLL. Mutations of some risk genes, including IL4 and TP53, have been frequently reported as important markers for the pathogenic development of CLL.6,7 These genes may serve as biomarkers for multiple other diseases,7,8 thus decreasing their specificities as biomarkers for the prediction of CLL. Additionally, many CLL-gene relationships have been reported, but few can be replicated (e.g., PRKCD and TGFBR29,10), reflecting the heterogeneity of CLL and the variance of CLL-related genetic changes among patients. 11 Moreover, a number of novel CLL risk genes are identified each year, 12 facilitating the development of an enriched genetic database for CLL.
The purpose of this study was to investigate whether previously reported CLL genes could be leveraged as a database for gene marker selection, specifically targeting early diagnosis of CLL. We hypothesized that if these CLL genes are effective for the prediction of CLL, gene markers selected from among them should enable significant accuracy in differentiating CLL cases from controls.
Methods
Development and analysis of CLL_042017
Figure 1 presents the database schema of the curated database CLL_042017. The database contains 753 genes (

Chronic lymphocytic leukemia (CLL) genetic database schematic.
SRVS for gene vector selection
A sparse representation-based variable selection (SRVS) algorithm (described in detail elsewhere)
14
was used to rank the 753 CLL target genes, on the basis of a given experimental dataset. For each gene, a sparse weight is assigned by SRVS. The gene vector, composed of the top
Gene expression data
In this study, we used 4 RNA gene expression datasets to evaluate classification performance with CLL target genes; these datasets were GSE2466, GSE19147, GSE50006, and GSE8835. The datasets were selected by using the Illumina BaseSpace Correlation Engine (http://www.illumina.com) and are publicly available at the NCBI Gene Expression Omnibus (www.ncbi.nlm.nih.gov/geo/). The data selection criteria were as follows: 1) Sample organism was Homo sapiens; 2) Data type was RNA expression; 3) Experiment design was CLL case vs. normal control. From each dataset, expression data of normal controls and CLL patients were extracted and used for case/control classification. Genes of each dataset were limited to CLL target genes curated within the database CLL_042017. The key statistics of the four datasets are summarized in Table 1.
Statistics of four gene expression datasets.
The gene expression profiles of the four gene expression datasets are also included in CLL_042017:
CLL case/control classification
To identify the best gene vector and the corresponding classification accuracy (CR), the CLL target genes were first ranked by SRVSScore in descending order. Then, Euclidean distance-based multivariate classification
18
was performed for each dataset, followed by leave-one-out (LOO) cross-validation. In each run of LOO, gene expression data of one subject were used for testing; the remaining data were used for training. The inputs of the classifier are the top
Following the same process, the best gene subset was identified for each dataset by the ANOVA approach. For comparison purposes, a CR baseline was also generated by using randomly selected gene sets of n (
Results
CLL case/control classification
Figure 2 presents the classification results. Table 2 summarizes the results of LOO cross-validation of the two gene-ranking methods on four datasets, where the maximum CRs, corresponding numbers of top genes, and permutation p-values of the two methods are provided.

Comparison of different metrics through leave-one-out (LOO) cross-validation. Genes were ranked in ascending order according to SRVSScore or PValueScore, for sparse representation-based variable selection (SRVS) or analysis of variance (ANOVA), respectively. (a) GSE 2466, (b) GSE 19147, (c) GSE 50006 and (d) GSE 8835.
LOO cross-validation and permutation results
SRVS, sparse representation-based variable selection; ANOVA, analysis of variance.
Figure 2 establishes that, compared with the CRs generated by randomly selected gene sets, the genes selected from CLL target genes by both SRVS and ANOVA can demonstrate significantly higher classification accuracies. Notably, by using only the top genes with highest SRVSScore/PValueScore, the highest CRs were acquired (See Figure 2 and Table 2); adding more genes with lower scores may not necessarily improve classification accuracy. These results revealed the validity of both SRVS and ANOVA methods. Moreover, it was noted that SRVSScore outperformed PValueScore in terms of CR (Table 2).
Table 2 also shows that, for each dataset, the top genes selected by both methods could be significantly different (
Discussion
CLL affects approximately one million people globally, but remains poorly diagnosed at early stages. In the past, many studies have been performed with the aim of developing targeted molecular therapy for CLL6,7; hundreds of risk genes have been identified. Most of these genes are active within CLL-related genetic pathways, and many have been used as drug targets for the treatment of CLL. However, patients may demonstrate genetic variation, even in the same disease, implying the need for personalized treatment. 16 Therefore, for a given CLL patient/patient group, feature (gene) selection is important for diagnosis and treatment. Thus far, few studies have been conducted to test the validity of curated CLL risk genes for use as genetic markers in diagnosis and prediction of CLL.
In this study, we first conducted comprehensive literature data mining in 3078 scientific articles, which identified 753 CLL target genes. Gene set enrichment analysis showed that the majority of these genes (594/753) were significantly enriched within multiple genetic pathways that were associated with CLL (p-value<3e-13; q=0.001 for false discovery rate (FDR)). For instance, there are 230 genes significantly enriched within eight cell apoptosis pathways (p-value<5.2e-14; q=0.001 for FDR).
15
There were also 240 genes enriched within eight pathways/gene sets related to cell growth and proliferation (p-value<6.8e-015)
16
and 218 genes enriched within immune response (p-value <8.7e-029).
17
More pathways and related information can be identified at
Sub-network enrichment analysis (SNEA; http://pathwaystudio.gousinfo.com/SNEA.pdf) showed that 717 of 753 genes significantly overlapped with risk genes linked to each of the 97 diseases (p-value<1.6e-100; q=0.001 for FDR;
Within CLL_042017, there were 235 known CLL drugs/small molecules (
CLL case/control classification was conducted on four independent gene expression datasets, with two algorithms for gene selection within the 753 CLL gene pool: SRVS method and ANOVA. The basic theory for feature (gene) selection is that not all 753 genes will exhibit mutations for a given CLL patient/patient group; therefore, it is not appropriate to use all as target genes in the diagnosis and treatment.
Compared with randomly selected genes, these selected by both SRVS and ANOVA led to significantly higher prediction power (permutation p-value<0.0014 for SRVS and permutation p-value<0.0016 for ANOVA; CRs of SRVS vs. ANOVA: 100% vs. 100%, 100% vs. 93.94%, 98.64% vs. 98.18% and 89.39% vs. 84.85%, for the four datasets, respectively), as shown in Table 2. These results indicated that genetic markers selected from the 753 CLL target genes possess significant power for the diagnosis and prediction of CLL. Moreover, SRVS outperforms ANOVA in terms of CR. This implies the effectiveness of the SRVS method for gene marker selection for CLL.
Gene markers selected by both SRVS and ANOVA methods demonstrated substantial uniqueness (>25%) across different datasets (Table 2). This indicates that, in addition to the genomic specificity of each patient group, there may be other factors that affect the gene marker selection, which merit further study. As shown in Table 1, the four datasets were acquired from different blood cells and different patient populations. This may contribute to variations in the gene marker selection results (Table 2).
In conclusion, our study suggested that gene markers selected from the 753 CLL genes could provide high accuracy in the prediction of CLL, and that SRVS is an effective method for gene marker selection in CLL diagnosis and prediction.
Footnotes
Declaration of conflicting interest
The authors declare that there is no conflict of interest.
Funding
This research received no specific grant from any funding agency in the public, commercial, or not-for-profit sectors.
