Sage Journals: Discover world-class research

Abstract

Screens using high-throughput, information-rich technologies such as microarrays, high-content screening (HCS), and next-generation sequencing (NGS) have become increasingly widespread. Compared with single-readout assays, these methods produce a more comprehensive picture of the effects of screened treatments. However, interpreting such multidimensional readouts is challenging. Univariate statistics such as t-tests and Z-factors cannot easily be applied to multidimensional profiles, leaving no obvious way to answer common screening questions such as “Is treatment X active in this assay?” and “Is treatment X different from (or equivalent to) treatment Y?” We have developed a simple, straightforward metric, the multidimensional perturbation value (mp-value), which can be used to answer these questions. Here, we demonstrate application of the mp-value to three data sets: a multiplexed gene expression screen of compounds and genomic reagents, a microarray-based gene expression screen of compounds, and an HCS compound screen. In all data sets, active treatments were successfully identified using the mp-value, and simulations and follow-up analyses supported the mp-value’s statistical and biological validity. We believe the mp-value represents a promising way to simplify the analysis of multidimensional data while taking full advantage of its richness.

Keywords

multidimensional screening multivariate data analysis high-throughput screening microarray analysis high-content screening

Technologies that produce multidimensional readouts for biological assays have become increasingly widespread in recent years because of their ability to capture detailed measurements at ever-increasing scales.^1,2 One such technology, high-content imaging, has transformed traditional microscopy, producing dozens of numeric cytological measurements for single cells in a high-throughput fashion.^3,4 The latest microarrays for gene expression interrogate more than 1 million exons, and next-generation sequencing technologies are capable of calling hundreds of billions of bases per run.^5,6 By producing more than one readout for each sample input, these methods are inherently multidimensional. Synthetic multidimensional data sets are also created by combining data from disparate sources or experiments. Such data sets have been produced to create compound profiles based on kinase assay results, side effect annotations, and cytotoxicity measures in diverse cell lines.^7–10

Confronted with the wealth of information in a multidimensional screening data set, it is not always clear how to proceed with the analysis in a way that can make use of the abundance of available information. Sometimes, a data set may be multidimensional simply because it is easy to collect many readouts, even when only a single readout is wanted; high-content screening (HCS) is an example, in which multiple cytological measurements can be calculated based on one stain just as easily as a single measurement. In these cases, the analysis is easily simplified by narrowing the focus to a single readout and ignoring the rest. Statistically, the problem is now univariate, and methods such as the t-test or the nonparametric Wilcoxon rank-sum test can be used to compare replicates of each treatment to the replicates for each control. However, in many situations, there is no a priori decision made to analyze a single readout from an otherwise multidimensional readout. In these situations, analytical metrics should make use of all available readouts to take full advantage of the data set.

An often-used workflow for processing multidimensional data consists of first normalizing treatments to a negative control (such as vehicle) and then applying machine-learning approaches (such as principal component analysis, support vector machines, k-means, or hierarchical clustering) to discern biologically defined groupings of the samples.^11,12 Although this workflow has been used to create functional groupings of samples including cell lines, yeast deletion mutants, and tissues from drug-treated rats, it aligns imperfectly with the language of biological screening.^13–15 In a screen, there are often negative and positive controls, and treatments are expected to be either active to varying degrees (similar to a positive control/dissimilar from a negative control) or inactive (similar to a negative control). Frequently, a treatment in a screen must display a measured activity beyond a certain threshold to be considered a hit and to progress for further evaluation. This threshold, which serves to remove inactive treatments, is missing from the grouping-based multidimensional data-processing workflow described above.

The normalization step common to the previously mentioned workflow makes it easy to understand comparative differences between two treatments, but it does so at some expense. Namely, the workflow does not use all of the information present in the replicates for both treatments and negative controls. Usually, replicates are summarized first by taking the means or medians of their readouts so that a single treatment profile can then be divided by a single vehicle profile. This summarization discounts variation across the replicates, losing information on the amount of noise present for each readout. In addition, in the traditional workflow, summarized negative control profiles are used only for normalization, not for direct comparison to treatment profiles. Although the use of negative controls for normalization is often seen as a necessary or logical step to reduce batch effects, some evidence suggests, counterintuitively, that it may increase them.¹⁶ By freeing the negative control replicates from the requirement that they be used for normalization, it becomes possible to directly compare treatment replicates to negative control replicates using the same methods by which a treatment would be compared with any other treatment.

Here, we describe the multidimensional perturbation value (mp-value), a single statistic that can be applied to various types of multidimensional screening data and used to determine whether any two treatments differ from each other. The mp-value fully uses all measured readouts as well as all replicates. Calculation of the mp-value is straightforward and based on well-established statistical techniques. It lends itself easily to visualization and is similar in spirit to the well-understood p value. Here, we have used the mp-value to analyze treatments from two gene expression data sets and a high-content imaging data set and have found it to be a simple, biologically sound method for determining whether treatments are active in a given screen. We provide R code and a Web tool for calculating mp-values in order to enable easy application of this method to other data sets.

Materials and Methods

Multidimensional Perturbation Value (mp-Value) Calculation

The mp-value can be used to compare any pair of treatments: a screened treatment against vehicle control(s), a screened treatment against known positive control(s), one batch of a treatment against another batch, and so forth. To simplify the description of the mp-value calculations, the comparison of a screened treatment (“treatment”) to a negative control (“vehicle”) will be used here. A treatment different from the vehicle is considered active, whereas a treatment similar to the vehicle is considered inactive.

A general overview of the mp-value calculation can be found in Figure 1 . All calculations described in this section were performed using R, and an R function for calculating the mp-value can be found in the supplemental material. A Web server is also available for calculating mp-values (http://www.jannahutz.com/mpvalue/). A small example data set is available for download from that site and is used in the supplemental material for a more detailed, step-by-step walkthrough of how the mp-value is calculated.

Figure 1.

Overview of how the mp-value is calculated. Different treatments are illustrated with different colors, and each treatment here is represented in triplicate. After removing treatments that fail to pass quality control filters and performing normalization, multidimensional data subsets are created for each treatment, containing replicates for that treatment as well as vehicle replicates from the same batch. Principal component analysis (PCA) is performed on this two-treatment data set, and components are scaled by the proportion of the total variance that they explain. In this adjusted PCA space, the Mahalanobis distance is calculated between the two treatment groups. Treatment labels are permuted 1000 times (disallowing the original label configuration), and Mahalanobis distances are calculated each time. The mp-value is equivalent to the percentage of permuted Mahalanobis distances greater than the original.

Data selection and normalization

Selection of data points to use for mp-value calculation must take into consideration two properties of the mp-value. First, at least one group (treatment or vehicle) must consist of more than one replicate. Common screen designs, in which treatments are run in duplicate or triplicate and controls in higher numbers, fit this requirement. Second, a statistically significant mp-value is achieved when the multidimensional profiles of the treatment replicates are different from those of the vehicle replicates. To minimize the possibility of a significant mp-value arising from systematic experimental bias rather than biological differences, it is recommended that treatment replicates be matched with vehicle replicates from the same experimental batch. If this is not possible, raw data should be adjusted for batch effects or other sources of experimental variation before combining data from multiple batches. Any replicates failing standard quality control (QC) checks should also be removed before calculating mp-values.

These data are provided to the mpvalue R function, which next creates treatment-specific data sets. For each treatment to be evaluated, its replicates and the replicates of the vehicle control are combined to form a m-by-n data set D. Each of the m rows is a treatment or vehicle replicate (e.g., a single cell, well, or array). The number of columns, n, is equal to the number of readouts for the assay. Next, the rows of D are standardized by subtracting the mean of each row and then dividing the resulting values by the standard deviation of each row. Finally, the columns of D are standardized in the same manner. Row and/or column standardization are used here but may not be appropriate for all types of data; these steps may be omitted in such cases. Although the principles behind the mp-value calculation can be applied to the analysis of univariate data, the mpvalue function is designed to work only for data with multiple readouts; those with single-readout data are recommended to use one of the many existing statistical methods that handle such data.

The example data available for download illustrate the above points. It consists of a single batch subset of normalized, QC-filtered data from the Broad Connectivity Map data set.¹⁷ The vehicle control (DMSO) was run in six replicates in this batch, satisfying the requirement that at least one group in all treatment versus control comparisons have more than one replicate. Two example treatments, clozapine and thioridazine, are profiled in the supplemental material. For each treatment, the D consists of seven rows (one row for the treatment array and six rows for the DMSO arrays) and 22 268 columns (corresponding to the array probes). Each row and column is standardized as described.

Principal component analysis (PCA)

Next, PCA is performed on D, producing an m-by-n matrix P. Now, the n columns represent n principal components instead of measured readouts. They are ordered from 1 to n by descending associated eigenvalue. The n eigenvalues calculated during the PCA are proportional to the amount of variance explained by each principle component, and by dividing each eigenvalue by the sum of all eigenvalues, the vector v is created, containing the percentage of overall variance explained by each principal component. For the thioridazine example (additional details in the supplemental material), P consists of seven rows and seven columns, which represent the seven principal components. The first column explains the most variance (45%), and the last column explains the least (6.6 × 10⁻²⁸%).

To reduce the dimensionality of P, the first c terms in v that sum to up to 0.90 are selected (c ≤ n), and the remaining principal components are discarded, reducing the dimensions of P to m-by-c. Each column in P is then multiplied by its corresponding value in v to weight the value of each principal component by the percentage of variance it explains. At this point, the first two or three principal components can be easily plotted to give an illustration of how the treatment and vehicle replicates relate to each other in PCA space. For the thioridazine example, the first four principle components explain 83% of the variance, so P is trimmed to 7-by-4, and each column is weighted (by 45%, 15%, 13%, and 11%, respectively). Plots of the first two principle components are present in the supplemental material along with the P matrices.

Mahalanobis distance calculation

Next, the Mahalanobis distance is used to assign a value to the separation of the treatment replicates from the vehicle replicates in PCA space. The formula for the calculation of the Mahalanobis distance is shown below, in equation 1.

D_{M} (x) = \sqrt{{(x - μ)}^{T} S^{- 1} (x - μ)} .

Vector x has length c and consists of the means for each principal component in the treatment replicates, whereas the vector µ contains the corresponding vehicle means. The matrix S is the covariance matrix for the principal components in P. To find S, the covariance matrices are first calculated separately for the treatment and vehicle subsets of P, and then each resulting covariance matrix is weighted by the fraction of total samples in P that it contains. The weighted matrices are summed to produce S. These terms (x, µ, and S) are calculated for the example treatments in the supplemental material (clozapine and thioridazine). The resulting Mahalanobis distance for clozapine is equal to 6.60, whereas for thioridazine it is equal to 62.78.

Permutation, mp-values, and significance cutoffs

To assign an mp-value (analogous to an empirical p value) to the Mahalanobis distance, a permutation approach is taken. Permutation tests are commonly used to determine statistical significance when an underlying statistical distribution is unknown and have often been used to analyze biological data.¹⁸ Here, the null hypothesis assumes that there is no difference between the treatment replicates and the vehicle replicates. Therefore, under the null hypothesis, shuffling the replicate labels would not result in a drastically different Mahalanobis distance between the two groups. In contrast, if the treatment replicates are significantly different, the Mahalanobis distance would likely be smaller after randomly shuffling the replicate labels.

To calculate the mp-value, replicate labels for all rows in P are randomly permuted, and the Mahalanobis distance is recalculated. Label permutations identical to the original label configuration are not allowed. In the analyses described here, 1000 permutations were performed. According to the standard procedure for generating permutation-based p values, the significance is equal to the fraction of permutation-derived values greater than or equal to the originally measured value.¹⁸ In other words, the mp-value (significance measurement) is equal to the fraction of permutation-derived Mahalanobis distances that are greater than or equal to the original Mahalanobis distance. Significance values for permutation tests are interpreted in the same way as significance values from parametric statistical analyses; a nominal significance cutoff of 0.05 is often used if multiple testing is not a concern.¹⁸ Permutations yielded an mp-value of 0.328 for clozapine and 0 for thioridazine. Using a significance cutoff of 0.05, thioridazine can be considered active (significantly different from the DMSO control replicates) whereas clozapine cannot.

If a treatment is run in more than one batch, mp-values can be considered separately for each batch or combined. Here, mp-values for any treatments run in multiple batches were combined using Stouffer’s method. Significance cutoffs were established using a false discovery rate of 0.1. To simplify analyses in which the negative log₁₀ of the mp-value is taken, any mp-values equal to zero are set to 0.0001.

Simulations and Comparison to Other Statistics

Simulations of the mp-value’s performance were conducted by repeatedly generating vehicle replicates and positive and negative treatment controls by sampling readouts from a number of different normal distributions. Each treatment control was simulated in duplicate, and 16 replicates were simulated for each vehicle. One hundred readouts were simulated. All readouts were simulated in the vehicle and negative control replicates by sampling from the standard normal distribution. Varying numbers of readouts were simulated in positive treatment control replicates by sampling from other normal distributions (μ = 1, 2, or 3 with σ = 1). Any remaining readouts in positive treatment control replicates were sampled from the standard normal distribution. The mp-value was considered to be accurate if its value was statistically significant (mp-value <0.05) for a positive control treatment or it if was not significant (mp-value >0.05) for a negative control treatment.

The mp-value was compared with three other statistical methods. For the first method, each of the 100 readouts was subjected to a t-test to determine whether that particular readout was significantly different between the treatment and control replicates. A similar methodology was used for the second method, substituting a Wilcoxon rank-sum test for the parametric t-test. In both cases, the 100 resulting p values were combined using Stouffer’s method for meta-analysis, producing a single p value representing the difference between treatment and control replicates. Meta-analysis p values were considered accurate if they were significant for positive control treatments or nonsignificant for negative control treatments. The third method was k-means clustering using the default parameters for the R kmeans function (k = 2). If the k-means clustering grouped only positive control replicates in one cluster and only vehicle replicates in the other, it was considered successful. For the negative control replicates, if they were not separated into a cluster distinct from the vehicle replicates, this was also considered successful.

Experimental Data Sets

The first data set, referred to here as LMF, was generated from a medium-throughput gene expression screen of compounds, siRNAs and cDNAs. The experimental method quantifies gene expression by using flow cytometry to detect levels of fluorophore-labeled cDNA bound to probe-specific beads and has been described previously.¹⁹ For each treatment, the same set of 100 probes was assayed, corresponding to readings for 97 unique genes. Negative controls were DMSO for compound treatments, scrambled siRNA for siRNA treatments, and an empty vector for cDNA treatments. Additional information on this data set as well as the following two data sets can be found in the supplemental material.

The second data set, referred to here as CMap, is the gene expression microarray-based Connectivity Map data set created and described by the Broad Institute.¹⁷ A subset of the microarrays in this data set that shared experimental conditions was used here, consisting of 2674 arrays comprising 1150 unique compound treatments. Arrays were checked for quality using MDQC and normalized using ComBat.^20,21

The third data set, referred to here as HCS, resulted from a screen of a collection of 3805 unique natural product compounds and has been published previously.⁴ Normalization was done using median polish.

Results and Discussion

The mp-value was first assessed using data simulated to reflect a typical biological screen. Treatment samples were simulated to either have no difference (negative control treatments) from the simulated vehicle replicates or to differ from the vehicle (positive control treatments) in one or more readouts. In this simulation, the mp-value was highly accurate in calling negative controls, and it also performed well in identifying positive controls ( Fig. 2 ). It was more successful when more readouts differed between the treatment and vehicle replicates or when readouts differed by a larger margin between the two groups ( Fig. 2 ).

Figure 2.

Performance on simulated data of the mp-value (blue), the t-test (orange), the Wilcoxon rank-sum test (purple), and k-means clustering (gray). Accuracy indicates the percentage of approximately 100 simulated treatments under each condition that were called accurately by each method. (A) Treatments were simulated to not differ from the vehicle control (negative control treatments). (B–D) Treatments varied from the vehicle control by between 1 and 64 readouts. When a readout was simulated to differ in the treatment, it was simulated from a normal distribution with mean of 1 (B), 2 (C), or 3 (D), while all vehicle readouts were simulated from a standard normal distribution.

We next sought to compare the performance of the mp-value to other metrics designed to determine whether two multivariate samples are different. Traditional multivariate statistical tests such as MANOVA and Hotelling’s T² test cannot be conducted when the number of readouts is greater than the total number of replicates, as in the simulated data set. Therefore, two other statistical tests (Student’s t-test and Wilcoxon rank-sum test) were performed for each single readout, with the resulting 100 p values being summarized to one p value by meta-analysis. The third method, k-means clustering, has been used previously in the analysis of multivariate biological data.^22,23 All three of these tests performed similarly to the mp-value in correctly identifying negative controls, with the t-test performing worse than the others ( Fig. 2A ). The decrease in the performance of the t-test would be amplified at the screen level if most treatments in a screen mimic the vehicle control; in this case, the t-test would produce far more false-positives than the other three tests. In addition, the t-test cannot be calculated for treatments in which only a single replicate is available.

For the positive control treatments, all metrics improved when more readouts were simulated to differ between the treatment and vehicle and when the magnitude of the difference was larger ( Fig. 2 ). Relative to the other tests, the Wilcoxon rank-sum test was outperformed for nearly all positive controls. The k-means clustering also performed quite poorly on the positive controls when compared with the mp-value and t-test ( Fig. 2B – D ). The mp-value and the t-test performed best on the positive controls, with the mp-value outperforming the t-test in calling true-positives when few readouts differed between the treatment and vehicle replicates. Because it also outperformed the t-test in correctly calling negative controls, we believe the mp-value is most appropriate for biological screens, which we have found in our experience to have few truly differing readouts and many treatments that do not differ from vehicle controls. For example, in a screen in which 4 of 100 readouts differ in 5% of the treatments (with the remaining 95% of treatments resembling the negative control), extrapolating from the data shown in Figure 2 suggests that the mp-value would be predicted to call 89% to 92% of samples accurately, depending on the magnitude of the biological difference, whereas the t-test would call 78% to 80% correctly. Because of these results, we are confident in the accuracy and the appropriateness of the mp-value for differentiating between pairs of treatments in real biological screening data.

The mp-value calculation was first applied to the LMF data set, which consists of measurements of gene expression levels conducted using 100 probes representing 97 unique genes. Three classes of treatments were screened on this platform: compounds, siRNAs, and cDNAs. Of the 7052 compounds screened, 3893 (55%) had mp-values that were smaller than the statistical significance cutoff, suggesting that they produced a significantly different 100-probe profile than vehicle control and could be considered active in this screen ( Table 1 ). The percentages of active treatments differed for the genomic reagents. Similar numbers of siRNAs and cDNAs were screened on the LMF platform, but the percentages of active treatments were very different. Seventy-nine percent of siRNA treatments had statistically significant mp-values, whereas none of the cDNA treatments reached statistical significance using a false discovery rate threshold of 0.1. Many cDNA treatments tended toward significance; however, 25% of them had nominal mp-values less than 0.05.

Table 1.

Percentage of Treatments in Each Group Considered Active by the mp-value with a False Discovery Rate of 10%.

Data Set	Active Treatments (% of Total)
LMF, compounds	55
LMF, siRNAs	79
LMF, cDNAs	0
CMap	13
High-content screening	93

Several possibilities exist for the differential activity of the siRNAs and cDNAs. Importantly, cDNA selection was unbiased, whereas siRNA selection was limited to genes of interest to internal drug development programs. It makes sense that perturbing these genes might be more likely to produce signatures indicative of biological activity. Another explanation concerns off-target effects, which are known to be common with siRNAs.^24,25 In a signature measuring 97 genes, a given siRNA may perturb several as a result of off-target activity. If these perturbations are consistent across replicates, the siRNA will likely have a statistically significant mp-value. In contrast, cDNA reagents lack a similar inherent mechanism for producing transcriptional off-target effects. Therefore, cDNA-induced perturbations of the gene signature may be comparatively rare. In addition, the negative cDNA control was moderately variable across replicates, and this increased noise likely also contributed to the lack of statistically significant findings. Supplemental Figure 1 shows a cDNA treatment corresponding to one of the genes probed. Its expression increased approximately 11-fold in the treated replicates versus the negative control replicates, but the variability in the control replicates caused the comparison to have an mp-value of only 0.09, higher than the nominal statistical significance cutoff of 0.05.

To explore the biological validity of the mp-values, a detailed analysis was undertaken of a subset of the compounds screened in the LMF data set. In this subset, 256 compounds were run in four-point dose response at concentrations of 10 nM, 100 nM, 1000 nM, and 10 000 nM, and 131 compounds were run in eight-point dose response at concentrations of 5 nM, 14 nM, 41 nM, 123 nM, 370 nM, 1111 nM, 3333 nM, and 10 000 nM. Across all compounds, treatments had more statistically significant mp-values at higher concentrations ( Fig. 3A , B ). To determine whether this aggregate pattern held true for each compound, mp-values were plotted by dose for each compound and then subjected to linear regression. The results of this analysis, details of which can be found in the supplemental material, were highly statistically significant and indicated that mp-values tended to be more statistically significant at higher doses, as expected. These results strongly support the biological validity of the mp-value.

Figure 3.

(A, B) Percentage of treatment-vehicle comparisons with statistically significant mp-values at increasing doses for compounds run in four-point and eight-point dose response, respectively. (C, D) Percentage of treatment-treatment pairs in which treatments are indistinguishable from each other (mp-values are nonsignificant), plotted by the fold difference in dose between the treatments in each pair. All treatment-treatment pairs represented here are pairs in which the treatments are identical in compound structure and differ only in dose. All treatments are also individually active (have significant mp-values when compared with vehicle control). (E) Principal component analysis plots of two concentrations of emetine and corresponding vehicle control wells from the same batch. The higher concentration has a much more significant mp-value. (F) LMF expression profiles for each emetine dose. The strong induction of one gene (CYP1A1) at 10 000 nM likely drives the significance of the mp-value.

A second analysis of the LMF data set supported the ability of the mp-value to detect phenocopies and thereby answer the question of whether two treatments are similar. In large screens, one often wants to identify treatments that act similarly to a positive control, even if the potencies may differ. For example, readouts obtained for a given compound at the 10 nM dose would be expected to be more similar to readouts obtained at 100 nM (a 10-fold difference in concentration) than to readouts obtained at 10 µM (a 1000-fold difference). To test this in the LMF data set, replicates for each dose of a compound were compared with replicates of other doses for that same compound. As shown in Figure 3C and D , two treatments were more often indistinguishable by the mp-value when the dose difference was smaller. However, even when the difference in two doses of the same compound was large (i.e., 2000-fold), the treatments were still called similar approximately 79% of the time. These findings are encouraging and suggest that phenocopies of a treatment can successfully be identified using the mp-value, even in cases in which there are differences in potency.

The mp-value’s biological validity was also supported by examining which compound classes tended to produce significant mp-values. Because LMF is a transcriptional readout, it would make sense for compounds that affect transcriptional processes to have statistically significant mp-values. The biological process terms for 1731 Gene Ontology (GO) were linked to at least three compounds in the LMF data set by using internal compound-target and target-GO term annotation. Of these, 445 GO terms were overrepresented by Fisher’s exact test in the subset of compounds considered active, using a false discovery rate threshold of 0.1. If compounds that target transcriptional processes tend to be more active in the LMF data set, then transcription-related GO terms should be enriched within the 445 GO terms found to be associated with active compounds. This was indeed the case; 30 of the original 1731 GO biological process terms were related to transcription, and 20 of these were enriched in the GO term subset associated with active compounds (Supplemental Table 1). By Fisher’s exact test, the p value for this enrichment was 2.3 × 10⁻⁶. These results provided further evidence for the ability of the mp-value to detect true biological perturbations.

The reliability of the mp-value was also tested on another gene expression–based data set: the Broad Connectivity Map (CMap). The CMap data set contains many more readouts than LMF—more than 22 000 compared with 100. However, the subset of CMap analyzed here shares many similarities with the LMF data set: The cell line is the same (MCF7), as is the time point (6 hr). Of the 1150 compounds screened in this subset, 835 were also screened in LMF. Although many experimental parameters differ between CMap and LMF, there is still high concordance between the significance of the mp-values for the 835 compounds present in both data sets, with a p value of 2.899 × 10⁻⁷ by Fisher’s exact test for the contingency table shown in Table 2 .

Table 2.

Agreement of the Significance of mp-values for Compounds Present in Both the LMF and CMap Data Sets.

	CMap, active	CMap, inactive
LMF, active	75	243
LMF, inactive	52	465

Of the 1150 CMap compounds analyzed, 144 (13%) had statistically significant mp-values. This percentage is substantially lower than the 55% of compounds active in the LMF data set. When the percentages are calculated using only the 835 compounds that were screened in both this subset of CMap and in LMF, the percentages are slightly more similar; 15% are active in CMap, and 38% are active in LMF. A possible explanation for the difference is that the vehicle replicates for CMap may have exhibited higher variability than the vehicle replicates for LMF. In CMap, each treatment was usually run only once in a given batch, so batches were combined to have sufficient replicates to calculate mp-values. Normalization methods performed well in reducing batch effects (Supplemental Fig. 2) but may have been unable to completely eliminate them. Larger variation in vehicle replicates would result in increased mp-values. Another explanation for the different percentages of active compounds may be due to the dimensionality of the original data. Although it may seem counterintuitive that a method measuring fewer variables (such as LMF) would capture more biological variation, recent analyses have shown smaller, optimized gene signatures to be more accurate when used to identify pairs of similar treatments.²⁶ The increased accuracy of smaller gene signatures may be due to the elimination of large numbers of readouts that are not perturbed upon treatment and therefore represent noise.

The CMap data set also presented an opportunity to use the mp-value methodology to answer a different type of question: whether two noncontrol treatments are different from each other. In this data set, eight microarrays were run using trichostatin A obtained from Calbiochem (La Jolla, CA), and 51 microarrays were run using trichostatin A from Sigma (St. Louis, MO). Aside from being run in a number of different batches, all other experimental parameters were identical for these 59 arrays. A plot of the first two principal components for these treatments is shown in Figure 4 . The mp-value calculated for the comparison of these two groups was less than 0.001 and highly statistically significant. This strongly suggests a fundamental difference in the chemicals from two different suppliers.

Figure 4.

Plot of the first two principal components describing the subset of CMap data containing the replicates using trichostatin A supplied from Sigma and the replicates using the same chemical supplied from Calbiochem. The mp-value calculated using these data is statistically significant (mp-value <0.001).

The mp-value was next used to evaluate treatments in a third data set, an HCS of natural products. In this data set, 92.7% of treatments had statistically significant mp-values, a much higher percentage than was found for either of the gene expression–based data sets. A likely reason for this is that there was much more statistical power in this data set because of the high number of replicates, as hundreds of cells were imaged for each treatment and considered as individual replicates. With this sample size, treatment-vehicle differences that are small in magnitude can still be identified as statistically significant. Indeed, many of the treatments in this data set with significant mp-values had much smaller Mahalanobis distances than those seen in the LMF and CMap data sets (Supplemental Fig. 3). When many treatments have significant mp-values, the calculated Mahalanobis distances are a logical secondary criterion for hit prioritization, as they represent the magnitude of the biological difference. Treatments can be chosen for follow-up by selecting all treatments with a significant mp-value and then prioritizing those that have Mahalanobis distances above a threshold, which would differ for each assay. By using the statistical significance of the mp-value as a first-pass selection criterion and the Mahalanobis distance for the second pass, one can be assured that selected hits are significantly different from vehicle in both the statistical and biological sense.

The mp-value is intuitive, straightforward, and easy to calculate. To our knowledge, it is the first method that fully uses all readouts and replicates obtained in a screen and assigns a statistical significance value to each treatment based on an easy-to-visualize separation. Others have developed metrics that use the Mahalanobis distance, PCA, or similar methods for describing the similarity of multivariate profiles, but we believe this to be the first method to use these procedures in concert and to provide an estimate of statistical significance.^27–31 Previously published methods also often require scientists to make an a priori selection of relevant readouts, samples for training set, or a cutoff for the final metric.^27–31

The mp-value can be applied to data from many types of multidimensional screens; we have demonstrated its use here on two gene expression data sets and one high-content imaging data set. These data sets contained varying numbers of readouts (29, 100, and more than 22 000). The biological validity of the mp-value was demonstrated in several ways. As expected, it identified a higher percentage of active treatments among the highest doses versus the lowest doses for compounds run in dose response. In addition, it detected expected effects on transcription in a gene expression-based data set. It also had a high concordance rate when evaluating the significance of a compound in two different expression-based data sets.

As the results from the LMF, CMap, and HCS data sets have shown, there are several key factors that contribute to whether an mp-value will be statistically significant. The first and most intuitive of these is the magnitude of the difference between the two experimental groups; the larger the biological difference, the higher the likelihood that the mp-value will be statistically significant. The second factor is the variation in the replicates for each experimental group. As mentioned above, increased variation in replicates of a given treatment can cause an increase in the mp-value. Including replicates from multiple experimental batches can introduce such variation, so we have stressed that analyses should compare treatments run in the same batch. Finally, the third major factor is the number of replicates that are run. As shown in the HCS data set, when hundreds of replicates are available for each treatment, mp-values can be statistically significant even when the biological differences are small. The converse is also true; with insufficient replicates, mp-values for truly different treatments might not reach statistical significance. We recommend that experimental treatments within a batch be run in duplicate at minimum (and preferably in triplicate) and that controls be run in higher numbers (preferably at least 8 to 16 replicates per batch). Consideration of the first two factors mentioned above (magnitude of expected biological difference and expected assay variation) during assay planning will help inform the selection of the number of replicates to run for each treatment.

Despite the strengths of the mp-value, there are characteristics of the mp-value that may present challenges in certain situations. In some cases, such as in the HCS data set tested here, nearly all of the assayed treatments may have statistically significant mp-values. The mp-value itself is designed to merely say whether two treatments are different; its value should not be used to further prioritize this subset of significant treatments. However, as mentioned above, the Mahalanobis distance calculated during the mp-value calculation process can be used for a secondary prioritization, as it indicates the magnitude of the difference between the treatments. The provided R code outputs both the mp-value and the Mahalanobis distance so both metrics can be considered if desired. The permutation process introduces another characteristic of the mp-value that some may find inconvenient; it produces a limited number of discrete mp-values equal to the number of permutations run. To avoid this, a distribution can be fit to the Mahalanobis distances produced by the permutations, and the p value of the original Mahalanobis distance can be calculated according to the parameters of that distribution. We tested this approach in the HCS data set using gamma distributions fitted for each and found a high correlation between the original mp-values and those derived using the gamma distribution parameters (Supplemental Fig. 4). The provided R code contains an option to calculate mp-values based on the best-fitting gamma distribution for the permuted Mahalanobis distances.

One final inherent characteristic of the mp-value is that its distribution can vary widely from batch to batch. In one batch, vehicle replicates may be highly similar to each other, leading to more significant mp-values than would be seen in a batch in which vehicle replicates are not as consistent. The batch-to-batch variability of mp-values may be seen as an inconvenience, but it can also be used to indicate assay quality; if a batch has no statistically significant mp-values, it may be useful to investigate whether there was a consistency issue with the negative controls. When comparing the positive control(s) to the negative control(s) in a given batch, the mp-value can be considered to act similarly to the univariate Z′ factor; both provide a metric describing the difference between positive and negative controls.³² In general, we find the mp-value to be a simple, appealing solution for evaluating treatments while dealing with the issues of difference, variability, and reproducibility that are inherent in multidimensional screening data.

Footnotes

Acknowledgements

We thank Fritz Roth for helpful discussions and mentorship. We would also like to acknowledge Yan Feng, Anne Kummel, and Xian Zhang for providing the HCS data set and accompanying annotation. In addition, we would like to thank Chris Wilson, Christian Parker, Xian Zhang, and John Tallarico for valuable feedback and discussions. Finally, we wish to thank those who have read and provided feedback on this article, including anonymous reviewers. Janna Hutz and Florian Nigsch are the recipients of NIBR Presidential Postdoctoral Fellowships.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Supplementary material for this article is available on the Journal of Biomolecular Screening Web site at .

References

Feng

Mitchison

T. J.

Bender

Young

D. W.

Tallarico

J. A.

Multi-parameter Phenotypic Profiling: Using Cellular Effects to Characterize Small-Molecule Compounds. Nat. Rev. Drug Discov. 2009, 8, 567–578.

Wagner

B. K.

Clemons

P. A.

Connecting Synthetic Chemistry Decisions to Cell and Genome Biology Using Small-Molecule Phenotypic Profiling. Curr. Opin. Chem. Biol. 2009, 13, 539–548.

Giuliano

K. A.

DeBiasio

R. L.

Dunlay

R. T.

Gough

Volosky

J. M.

Zock

Pavlakis

G. N.

Taylor

D. L.

High-Content Screening: A New Approach to Easing Key Bottlenecks in the Drug Discovery Process. J. Biomol. Screen. 1997, 2, 249–259.

Young

D. W.

Bender

Hoyt

McWhinnie

Chirn

G. W.

Tao

C. Y.

Tallarico

J. A.

Labow

Jenkins

J. L.

Mitchison

T. J.

Feng

Integrating High-Content Screening and Ligand-Target Prediction to Identify Mechanism of Action. Nat. Chem. Biol. 2008, 4, 59–68.

Glenn

T. C.

Field Guide to Next-Generation DNA Sequencers. Mol. Ecol. Resour. 2011, 11, 759–769.

Ning

Fang

Hong

Perkins

Tong

Shi

Next-Generation Sequencing: A Revolutionary Tool for Toxicogenomics. Gen. Appl. Syst. Toxicol. 2011.

Campillos

Kuhn

Gavin

A. C.

Jensen

L. J.

Bork

Drug Target Identification Using Side-Effect Similarity. Science 2008, 321, 263–266.

Fedorov

Marsden

Pogacic

Rellos

Muller

Bullock

A. N.

Schwaller

Sundstrom

Knapp

A Systematic Interaction Map of Validated Kinase Inhibitors with Ser/Thr Kinases. Proc. Natl. Acad. Sci. U. S. A. 2007, 104, 20523–20528.

Fliri

A. F.

Loging

W. T.

Thadeio

P. F.

Volkmann

R. A.

Analysis of Drug-Induced Effect Patterns to Link Structure and Side Effects of Medicines. Nat. Chem. Biol. 2005 1, 389–397.

10.

Shoemaker

R. H.

The NCI60 Human Tumour Cell Line Anticancer Drug Screen. Nat. Rev. Cancer 2006, 6, 813–823.

11.

Butte

The Use and Analysis of Microarray Data. Nat. Rev. Drug Discov. 2002, 1, 951–960.

12.

Girolami

Mischak

Krebs

Analysis of Complex, Multidimensional Datasets. Drug Discov. Today Technol. 2006, 3, 13–19.

13.

Hughes

T. R.

Marton

M. J.

Jones

A. R.

Roberts

C. J.

Stoughton

Armour

C. D.

Bennett

H. A.

Coffey

Dai

Y. D.

Kidd

M. J.

King

A. M.

Meyer

M. R.

Slade

Lum

P. Y.

Stepaniants

S. B.

Shoemaker

D. D.

Gachotte

Chakraburtty

Simon

Bard

Friend

S. H.

Functional Discovery via a Compendium of Expression Profiles. Cell 2000, 102, 109–126.

14.

Natsoulis

El Ghaoui

Lanckriet

G. R.

Tolley

A. M.

Leroy

Dunlea

Eynon

B. P.

Pearson

C. I.

Tugendreich

Jarnagin

Classification of a Large Microarray Data Set: Algorithm Comparison and Analysis of Drug Signatures. Genome Res. 2005, 15, 724–736.

15.

Scherf

Ross

D. T.

Waltham

Smith

L. H.

Lee

J. K.

Tanabe

Kohn

K. W.

Reinhold

W. C.

Myers

T. G.

Andrews

D. T.

Scudiero

D. A.

Eisen

M. B.

Sausville

E. A.

Pommier

Botstein

Brown

P. O.

Weinstein

J. N.

A Gene Expression Database for the Molecular Pharmacology of Cancer. Nat. Genet. 2000, 24, 236–244.

16.

Iskar

Campillos

Kuhn

Jensen

L. J.

van Noort

Bork

Drug-Induced Regulation of Target Expression. PLoS Comput. Biol. 2010, 6.

17.

Lamb

Crawford

E. D.

Peck

Modell

J. W.

Blat

I. C.

Wrobel

M. J.

Lerner

Brunet

J. P.

Subramanian

Ross

K. N.

Reich

Hieronymus

Wei

Armstrong

S. A.

Haggarty

S. J.

Clemons

P. A.

Wei

Carr

S. A.

Lander

E. S.

Golub

T. R.

The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 2006, 313, 1929–1935.

18.

Manly

B. F. J.

Randomization, Bootstrap And Monte Carlo Methods in Biology; Chapman & Hall/ CRC: Boca Raton, FL, 2007.

19.

Peck

Crawford

E. D.

Ross

K. N.

Stegmaier

Golub

T. R.

Lamb

A Method for High-Throughput Gene Expression Signature Analysis. Genome Biol. 2006, 7, R61.

20.

Cohen Freue

G. V.

Hollander

Shen

Zamar

R. H.

Balshaw

Scherer

McManus

Keown

McMaster

W. R.

R. T.

MDQC: A New Quality Assessment Method for Microarrays Based on Quality Control Reports. Bioinformatics 2007, 23, 3162–3169.

21.

Johnson

W. E.

Rabinovic

Adjusting Batch Effects in Microarray Expression Data Using Empirical Bayes Methods. Biostatistics 2007, 8, 118–127.

22.

Huang

Southall

Cho

M. H.

Xia

Inglese

Austin

C. P.

Characterization of Diversity in Toxicity Mechanism Using In Vitro Cytotoxicity Assays in Quantitative High Throughput Screening. Chem. Res. Toxicol. 2008, 21, 659–667.

23.

Gagarin

Makarenkov

Zentilli

Using Clustering Techniques to Improve Hit Selection in High-Throughput Screening. J. Biomol. Screen. 2006, 11, 903–914.

24.

Birmingham

Anderson

E. M.

Reynolds

Ilsley-Tyree

Leake

Fedorov

Baskerville

Maksimova

Robinson

Karpilow

Marshall

W. S.

Khvorova

3′ UTR Seed Matches, But Not Overall Identity, Are Associated with RNAi Off-Targets. Nat. Methods 2006, 3, 199–204.

25.

Jackson

A. L.

Bartz

S. R.

Schelter

Kobayashi

S. V.

Burchard

Mao

Cavet

Linsley

P. S.

Expression Profiling Reveals Off-Target Gene Regulation by RNAi. Nat. Biotechnol. 2003, 21, 635–637.

26.

Nigsch

Hutz

Cornett

Selinger

D. W.

McAllister

Bandyopadhyay

Loureiro

Jenkins

J. L.

Determination of Minimal Transcriptional Signatures of Compounds for Target Prediction. EURASIP J. Bioinform. Syst. Biol. 2012, 1, 2.

27.

Codrea

M. C.

Hakala-Yatkin

Karlund-Marttila

Nedbal

Aittokallio

Nevalainen

O. S.

Tyystjarvi

Mahalanobis Distance Screening of Arabidopsis Mutants with Chlorophyll Fluorescence. Photosynth. Res. 2010, 105, 273–283.

28.

Durr

Duval

Nichols

Lang

Brodte

Heyse

Besson

Robust Hit Identification by Quality Assurance and Multivariate Data Analysis of a High-Content, Cell-Based Assay. J. Biomol. Screen. 2007, 12, 1042–1049.

29.

Kummel

Gubler

Gehin

Beibel

Gabriel

Parker

C. N.

Integration of Multiple Readouts into the z′ factor for Assay Quality Assessment. J. Biomol. Screen. 2010, 15, 95–101.

30.

Lachenmeier

D. W.

Frank

Humpfer

Schäfer

Keller

Mörtter

Spraul

Quality Control of Beer Using High-Resolution Nuclear Magnetic Resonance Spectroscopy and Multivariate Analysis. Eur. Food Res. Technol. 2005, 220, 215–221.

31.

Wilson

C. J.

Thompsons

C. M.

Smellie

Ashwell

M. A.

Liu

J. F.

Yohannes

S. C.

Identification of a Small Molecule That Induces Mitotic Arrest Using a Simplified High-Content Screening Assay and Data Analysis Method. J. Biomol. Screen. 2006, 11, 21–28.

32.

Zhang

J. H.

Chung

T. D.

Oldenburg

K. R.

A Simple Statistical Parameter for Use in Evaluation and Validation of High Throughput Screening Assays. J. Biomol. Screen. 1999, 4, 67–73.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.83 MB

0.00 MB

The Multidimensional Perturbation Value

Abstract

Keywords

Materials and Methods

Multidimensional Perturbation Value (mp-Value) Calculation

Data selection and normalization

Principal component analysis (PCA)

Mahalanobis distance calculation

Permutation, mp-values, and significance cutoffs

Simulations and Comparison to Other Statistics

Experimental Data Sets

Results and Discussion

Footnotes

Acknowledgements

Declaration of Conflicting Interests

Funding

References

Supplementary Material