Abstract
BACKGROUND:
The main obstacle encountered in microarray technology is how to mine the valuable information under the profiles and study the genes function.
OBJECTIVE:
Maximal information coefficient (MIC) is a novel, non-parametric statistic that has been successfully applied to genome-wide association studies and differentially gene and miRNA expression analysis. However, the data used in these applications are not gold standard but real data.
METHODS:
Therefore, this study attempts to test the feasibility of MIC for differentially expressed gene identification with simulation data.
RESULTS:
Our experiments indicate that, MIC perfermance is better than Limma always, which is almost the same level of SAM, ROTS or DESeq2. However, the count of AUC
CONCLUSIONS:
Compared to the existing methods, our experiments show that MIC is not only in the first tier in identifying differentially expressed genes and noise immunity, but also shows better robustness and stronger data/environment adaptability.
Introduction
A gene expression analysis plays an important role in studying biological characteristics and gene functions [1, 2]. Based on the analysis, we might identify the differentially expressed genes (DEGs) without being influenced by some factors, such as biological conditions, the states of cell cycle, tissues and individuals. And, by the DEGs, it is possible to discover the disorder of biological processes and dysfunctions of the organism, identify risk genes, and clarify the key influence of the pathogeny on gene expression, which is of great significance for the prevention and treatment of diseases.
A microarray is one of the ordinary means in the field of biomedicine. It can obtain a large number of gene expression profiles, overcome the defects of the analysis on single gene, and integrate bio-information to the extent possible, and then be used to analyze the expression and function of multi-genes during disease development [3, 4, 5, 6]. How to mine the valuable information under the gene profiles and study the genes function is the main obstacle in microarray technology [7]. The technology of gene expression analysis has been widely used in biology and medical statistics.
Since the advent of microarray technology, many promising methods have been proposed for gene expression analysis [5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19]. Some commonly used methods include Signification Analysis of Microarrays (SAM) [14], Limma [11, 20], Reproducibility-Optimized Test Statistic (ROTS) [15, 21, 22], CyberT [16, 23] and Rank Products [17, 24]. In addition, DESeq [25] and DESeq2 [26] can also be employed to identify DEGs, although they are original for RNA-seq analysis. These methods are promising in identifying differentially expressed genes, however, they might have some limitations. For example, SAM might lose some valuable DEGs [27]. Thus, it is one of important works in bioinformatics to explore novel methods unceasingly for differentially expressed genes identification.
Maximal Information Coefficient (MIC) is a novel statistical method to explore some unknown relationships between two variables [28]. It has an important characteristic of model independence, which is suitable for the studies of unknown models such as gene expression. We have successfully employed MIC to genome-wide association studies and identifying DEGs and differentially expressed miRNAs, and achieved well results [29, 30, 31, 32, 33]. However, the data for our studies are real gene expression profiles, which are experimentally derived data, not gold standard. So far, we have not found an accepted and open accessed gold standard profile. Thus, this study attempts to generate many simulation data sets on several distributions for discussing the feasibility of introducing MIC to identifying DEGs, by using SAM, Limma, ROTS and DESeq2 as the benchmarks. Our experiments indicate that, MIC perfermance is better than Limma always, which is almost the same level of SAM, ROTS or DESeq2. However, the count of AUC
Material and methods
Material
Simulation data
Since there are not any accepted and open accessed genes profile marked real DEGs can be used as gold standard, we generated some simulation data as our data sources. Based on the studies of [34, 35], the simulation data include four density distributions of normal, chi-square, exponential and uniform with the parameters shown on Table 1. For each distribution, we take arbitrarily one parameter from non-DEGs and DEGs respectively to construct a pair of parameters to generate datasets. And, each pair of parameters was repeatedly used to generate 100 datasets, each containing 10000 genes (5% of which are DEGs), 6 cases and 6 controls. In this way, a total of 29 groups containing 2900 datasets are generated.
Simulation data distributions and generating parameters
Simulation data distributions and generating parameters
DESeq2 is a novel method for RNA-seq analysis, which needs count data as its inputs. A gene expression dataset, however, is not of count but continuous. Thus, the simulation data must be transformed for DESeq2. The conventional transformation is to simply round the expression values to the nearest integer, which will lose too much information for low expression. Here, we let the values multiply 10 and then round them to the nearest integer to reduce the loss of information. Moreover, the expressed values in normal datasets may be negative because of its logarithm transformation, we translate the dataset up
Methods
Maximal information coefficient
As an exploratory analysis tool, MIC can be used to explore the possible, important and undiscovered relationships in hundreds of variable values, such as the relationship between genes and diseases in a genome-wide dataset. The study [28] defines MIC of two-variable
where
where
MIC is a non-parametric statistic that is independent of any model of the two variables. So far, there is no any reliable model to represent the relationship between the phenotypes and gene expressed values. Thus, MIC is very suitable for gene expression analysis.
Suppose the profile
Thus, it is possible to infer the differentially expressed significance of the gene
In order to compare the performance of identifying differentially expressed genes of MIC, we chose four ordinary methods of SAM, DESeq2, Limma and ROTS as benchmarks.
2.2.2.1. Significant analysis of microarrays
A traditional t-test [36] of two-sample with two independent normal distributions can be written as
where
Significant analysis of microarrays (SAM) is similar to t-test and uses a permutation to estimate the false discovery rates [14]. It introduces a small positive constant
2.2.2.2. DESeq2
DESeq2 is the successor to DESeq. DESeq is a widely used method for massive RNA-seq data analysis. It is based on the NB model with mean and variance linked by local regression [25]. DESeq2 integrates a number of advanced methods for quantitative analysis of RNA-seq data by using shrinkage estimators for dispersion and fold change. In fact, although DESEQ2 is original designed for RNA-seq analysis, it can be employed to gene expression analysis as well.
2.2.2.3. Linear models for microarray
Limma considers a gene expression satisfies
and
where
The variable that represents the possible differences between test groups is
where
The contrast estimator is defined as
where
2.2.2.4. Reproducibility-optimized test statistic
Reproducibility-optimized test statistic (ROTS) optimizes a set of modified
ROTS maximizes the scaled reproducibility based on the parameter and the size
where
In this study, all experiments were based on Windows 7 operating system platform. The simulation datasets were generated by programmed by R language (V3.5.0). Excepting MIC, the other methods were directly implemented by Bioconductor [37] (V3.7) in R language. MIC statistic employs the Matlab codes (the core code is implemented in C language) provided in the study [38]. The R language runs under the RStudio [39] (V1.1.456) shell.
By the above codes, the 2900 simulation datasets were analyzed in R language/Matlab.
Flowchart of the study.
Figure 1 shows the flowchart of this study.
Here, we used AUC to represent the performances of identifying DEGs of the methods. Let MIC and the four benchmarks mine the 2900 simulation datasets to identify DEGs, and calculate AUCs according to the identifying results of each method, and plot boxplots shown in Figs 2–5.
AUC boxplot on normal data. “
AUC boxplot on chi-square data. “
The counts of AUC
Note: The counts are from the 2900 simulation datasets, one for each.
AUC boxplot on exponential data. “
AUC boxplot on uniform data. “
In addition, since a method will lose its value to identify DEGs as AUC
A real gene expression profile contains a large amount of noise [40]. It is an important index that the ability of a method to resist noise interference over identifying DEGs. To evaluate the noise immunity of a method, we simulate noise-bearing gene expression data by adding white noise on the simulated data.
Our experiments show that the AUCs of all the methods in all the datasets are around 0.5 when the variance of white noise reaches to 2.0. Therefore, in the noise immunity experiments, we set the range to [0, 2.0] of the white noise added to the simulation datasets, with step of 0.2, a total of 11 levels of noise. For each noise level, all the methods identify DEGs and calculate the average AUCs grouped by the distributions. Then plot the scatter plots of Noise-AUC, and make linear fits to all points. Figures 6–9 show the plots of MIC and the Table 3 lists the summarization of all methods on the four distributions.
The linear fits of Noise-AUC plots
The linear fits of Noise-AUC plots
Note: “Slope” indicates the slope of the line, “Adj. GOF” is the adjusted goodness of fit, and “
Noise-AUC plot of MIC on normal. The line is fitted by the points. “Slope” indicates the slope of the line, “Adj. GOF” is the adjusted goodness of fit, and “
Noise-AUC plot of MIC on chi-square. The line is fitted by the points. “Slope” indicates the slope of the line, “Adj. GOF” is the adjusted goodness of fit, and “
Moreover, we also recorded the curve fitting of each dataset, and counted the fitted line slopes greater than 0 for each method (Table 4).
The counts of slope
Note: The counts are from the 29 groups of simulation datasets.
Noise-AUC plot of MIC on exponential. The line is fitted by the points. “Slope” indicates the slope of the line, “Adj. GOF” is the adjusted goodness of fit, and “
Noise-AUC plot of MIC on uniform. The line is fitted by the points. “Slope” indicates the slope of the line, “Adj. GOF” is the adjusted goodness of fit, and “
Differentially expressed gene identification is an application of data mining. It identifies genes with differential expression levels (variables) based on sample phenotypes (covariates). Therefore, the relationship between sample phenotype and expression level can be simplified into a model Eq. (3). Based on this model, differentially expressed genes can be screened by simply calculating the MIC values of all genes. The entire calculation process does not involve assumptions and calculations of any parameters.
Performance of identifying DEGs by MIC
Figures 2–5 show that the AUCs of MIC are significantly better than the Limma’s (the AUCs of Limma are also significantly lower than the other benchmarks’). The AUC of MIC is ranked the first with SAM and ROTS in the chi-square distribution data, ranked second after ROTS (small 6.19%) in normal distribution, ranked 4
In addition, it means a method has no value in practice when AUC
In summary, compared with the existing methods, MIC method is in the first tier in the performance of identifying DEGs, and it has stronger robustness and higher data/environment adaptability.
Noise immunity of identifying DEGs by MIC
The noise involved in a gene profile is an important factor affecting the accuracy of a method to identifying DEGs, especially for low expressed genes. In order to investigate the noise immunity of MIC, we tested its performance of identifying DEGs in a noisy environment by adding white noise to a noise-free dataset. For a method with excellent noise immunity, its AUCs should decrease with the increase of noise, and the slower decreasing, the stronger the noise immunity. Our experiments (Figs 6–9) show that the AUCs of each method have a good linear relationship with the noise intensity (the magnitudes of variance of the white noise). Thus, we linearly fit the AUCs to use the slope of the fitted line as a quantitative indicator for the noise immunity of a method. Obviously, the slope of the fitted line should be less than or equal to 0, and the smaller the absolute value, the stronger the noise immunity. By Limma, the slopes of the fitted lines in the normal and chi-square distribution shown in Table 3 are greater than 0. Meanwhile, Table 4 also shows that the method has up to 11 groups with a slope greater than 0 (accounting for up to 37.9%, significantly higher than the other 4 methods). Moreover, Limma is also greatly weaker than the other four methods in performance of identifying DEGs. Therefore, we consider to remove Limma from the next step in the noise immunity test.
In the noise immunity test shown Figs 6–9 and Table 3, the slope of fitted line of MIC is nearly same as SAM and ROTS in the normal data, whose absolute value is larger than DESeq2. And, the slope is almost same as SAM, ROTS and DESeq2 among the other three distributions. Furthermore, in the counts of the fitted line slope shown in Table 4, MIC and ROTS have no slope is greater than 0, while SAM and DESeq2 have 3 and 8, respectively.
In summary, compared with the existing methods, MIC is also in the first tier in terms of noise immunity.
Advantages and disadvantages of MIC
MIC is a non-parametric statistic with good noise immunity. It has better ability to discover non-function relations than the ordinary methods in exploring two-variable relationships. And, it has better uniformity to functional relationships [28] (i.e., MIC can yield almost the same value regardless of the functional relationship). In general, a gene expression data contains amount of noise [40], and the functional relationship between phenotype and expressed values is not clear, making MIC very suitable for the analysis of gene expression.
The deficiencies of MIC are mainly reflected in the fact that it is an exhaustive algorithm, leading its runtime is not more advantageous than the methods including permutation such as SAM, ROTS and DESeq2. When we use MIC to process very large datasets, its algorithmic time is a factor that must be considered. In addition, the essence of MIC is to replace all points in the dataset with some grids on a two-dimensional plane. Since the number of grids is not infinite, it is only an approximate method, being reduced accuracy of the algorithm. And thirdly, compared with the existing methods, although MIC performance is in the first tier, our experiments show that it has a few results of AUC
Our experiments verify that MIC is feasible to identify differentially expressed genes, and its performance of identifying DEGs and noise immunity are in the first tier of the existing methods, and it has advantage with more robustness and adaptability. Since the distributions of gene expression may be diverse, our experiments did not involve more distributions. And, we did not test the runtime of the algorithm. These might be further studied in the future.
Footnotes
Acknowledgments
This research was supported by the National Natural Science Foundation of China (Grant no. 31660321).
Conflict of interest
None to report.
