Multiparametric Analysis of Screening Data

Abstract

Advances in instrumentation now allow the development of screening assays that are capable of monitoring multiple readouts such as transcript or protein levels, or even multiple parameters derived from images. Such advances in assay technologies highlight the complex nature of biology and disease. Harnessing this complexity requires integration of all the different parameters that can be measured rather than just monitoring a single dimension as is commonly used. Although some of the methods used to combine multiple measurements, such as principal component analysis, are commonly used for microarray analysis, biologists are not yet using many of the tools that have been developed in other fields to address such issues. Visualization of multiparametric data sets is one of the major challenges in this field, and a depiction of the results in a manner that can be readily interpreted is essential. This article describes a number of assay systems being used to generate such data sets en masse, and the methods being applied to their visualization and analysis. We also discuss some of the challenges of applying methods developed in other fields to biology.

Keywords

multiparametric data analysis multiparametric visualization high-content screening machine learning cell-based screening

Introduction

Phenotype is defined as the observable physical or biochemical characteristics of an organism resulting from its genetic background and the environment it has experienced. Such a definition of phenotype can be applied to whole organisms, organs, and tissues, all the way down to individual cells, including those grown in vitro. The precision with which a phenotype can be defined is often limited by the number of parameters being measured. For example, cell death is a simple phenotype that can be measured using cell number; it can be further refined into apoptosis, necrosis, or senescence using parameters such as induction of caspases, DNA fragmentation, and cell morphology. Another example of how describing a phenotype with more parameters can lead to a more exacting description of the biological process being studied was presented by Tsiper and coworkers, who were able to show that increasing the numbers of morphological features used to describe the effect of statins on cells allowed the different clinical effects of the compounds to be identified.¹

The complexity with which a phenotype can be defined is further increased once tissues and organs are considered when multiple cells and cell types are interacting. Biological systems often depend on a balance of multiple factors, and no single measurement will correctly describe a particular biological state. The characterization of biological states using multiple biological readouts is especially important for describing developmental and physiological states. Stem cells are often characterized by the presence or absence of multiple gene transcripts or protein markers, and the status of immune cells by the presence or absence of multiple cell-surface markers.^2,3 There are now a number of assay systems that allow multiple parameters to be measured on a large number of samples (see Table 1 ). Such studies are now starting to show that cells may not display discrete phenotypes (in a simple binary manner) but rather display a continuum of states.⁴

Table 1.

Commonly Used Multiparametric Assay Methods.

Methods	What Is Being Measured	Throughput (# of Samples)
RNA
Microarrays	Gene expression	Low
LMF and branched-chain DNA amplification methods	Gene expression	High
Protein
FACS	Proteins	Low/Medium
CyTOF	Proteins	Low/Medium
Luminex	Proteins (cytokines)	Low/Medium
RPA	Proteins	Medium
MS-based proteomics	Proteins (peptides)	Low
Metabolites
Metabolomics	Metabolites	low
Cell-Based Features
Scanning laser cytometry	Proteins and DNA	High
High-content imaging	Proteins, DNA, and structural features	High

We exclude time-based methods (e.g., CellKey), concentration-based methods (e.g., the half maximal inhibitory concentration, or IC50), or collections of data from difference sources (e.g., iPOP and movies). LMF, ligation-mediated amplification, microsphere, and flow-cytometric detection system; FACS, fluorescence-activated cell sorting; CyTOF, time-of-flight mass cytometry; RPA, reverse-phase protein microarray; MS, mass spectrometry.

As technologies have been developed to handle larger numbers of samples, there is an increasing need to use machine-learning tools to evaluate the results from such experiments. In this article, we describe any technology that generates data requiring computational support for data analysis as being high throughput, so that methods that monitor only a few samples but generate hundreds of parameters (e.g., microarray experiments) are included, and methods that can screen millions of samples while still generating multiple readouts (e.g., image-based assays) are also considered. Fortunately, although these assay platforms use very different technologies, they produce data sets with a similar structure to those generated by cheminformatics and other data-intensive fields. Many of the analysis methods developed in these other scientific fields can be reapplied to these studies. Visualization methods are essential not only to assess consistency of the data (which may have been collected during a number of different experiments) but also to allow the data to be explored in novel ways that may lead to further insights into the underlying biology of the process being studied.

In pharmaceutical research, the goal is to correct, or at least modulate, the phenotype of the treated individual. Such a goal requires multiple parameters of the body’s physiology and phenotype to be monitored and optimized, sometimes with opposite objectives [e.g., the optimization of selective glucocorticoid-receptor (GR) agonists requires the optimization of transrepression functions without activation of the transactivation functions of the GR receptor].⁵

The objective of this article is to describe the various ways that are being used to analyze and compare, in a quantitative manner, results from such multiparametric experiments. This article will not review the many commercial or open-source software packages that are available to perform these analysis methods but, rather, will focus on the underlying methods and techniques being applied. The article will make comparisons to methods from engineering, computational chemistry, and other fields for analysis of multiple readouts and also try to highlight some of the visualization tools available to present such results.

Methods Generating Multiple Assay Readouts

Table 1 lists the commonly used assay methods that result in multiple readouts from any assay sample.

RNA

The availability of microarrays in the 1990s made it possible to monitor the biological responses of cells to changing environments, drug treatments, or toxins.⁶ Recent advances in next-generation-sequencing methods now make it possible to monitor transcriptional changes in an increasingly unbiased manner, such that rare splicing variants or unexpected transcripts can be detected and quantified, even in species where microarrays are not available.⁷ Such methods have the ability to report on not only the 20,000 or so genes expressed from the human genome but also any noncoding RNAs that may be expressed in the cell. Methods to collect such data sets have improved throughout time with greater effort being applied to the design of such experiments to obtain sufficient statistical significance when monitoring changes or differences among different states or treatments.⁸

The development of ligation-mediated and branched-chain DNA amplification methods has now made it possible to monitor numerous gene transcripts in multiple samples, allowing compound profiling and screening to be conducted.^9,10,11 The result is that it is now possible to screen large numbers of samples, monitoring the effects of up to 90 genes simultaneously.^9,12 This, combined with the fact that genes can be chosen to represent the combined effects of multiple pathways, means that with careful selection of the transcripts to be tracked, many aspects of cellular physiology can be monitored simultaneously.¹³ In our experience, however, there are significant challenges in generating results that are consistent in a day-to-day and batch-to-batch manner using these technologies. Methods to normalize among different experiments and improvements to assay consistency are still needed to fully exploit these methods.

Protein

Although transcript analysis monitors the cellular effects on gene expression at the level of transcription or RNA stability, it has been reported on a number of occasions that changes in messenger RNA (mRNA) levels do not directly reflect changes in protein levels.¹⁴ Therefore, a number of different methods have been developed to monitor changes in protein levels in response to changing treatments. In general, the mass spectrometry (MS)-based proteomics methods such as two-dimensional (2D) gel electrophoresis or liquid chromatography (LC)-MS-based proteomic methods, recently reviewed,¹⁵ will not be discussed further in this article because the number of samples they can analyze is fairly limited. Methods that focus on monitoring the presence of a number of preselected proteins using affinity reagents such as antibodies or aptamers have a higher throughput and so can be used for screening applications. Such methods include technologies such as protein arrays^16,17 or Luminex bead–based multiplexed assays.¹⁸ Recent developments have enabled the use of fluorescence-activated cell sorting (FACS) for multiparametric screening. Conventional FACS machines are able to monitor up to six different parameters (four colors with forward and side scatter) on an individual cell basis, thus allowing simple profiling of the cell population. With the development of sampling and data-analysis tools,¹⁹ such FACS-based methods are being used for high-throughput screening (HTS) applications.²⁰ Further developments of the detector arrays of FACS machines have allowed the development of machines capable to monitoring up to 44 variables simultaneously on each particle detected in the flow cytometer.²¹ Antibodies labeled with rare earth elements have allowed the development of mass cytometry [or time-of-flight mass cytometry (CyTOF)], in which up to 100 parameters can be detected on a cell-by-cell basis. In addition, by using specific reagents to label the protein content of the cell, this method allows multiplexing of samples for greater throughput.²² The main disadvantage of this method lies in the use of inductively coupled plasma (ICP)-MS for the detection of the labeled antibody, which requires the vaporization of cells in argon plasma, precluding any recovery of the sample.

Compared to transcriptome-based methods, the phenotype of individual cells can be monitored, allowing for the detection of rare cell types or events within a population, such as differentiation or changes in samples containing multiple cell types.

Metabolites

Although changes in mRNA or protein levels are important for monitoring phenotypes, the abundance of metabolites can also be used to track changes in cellular physiology. Metabolomics, just like proteomics, can be focused on a set of preselected analytes or can monitor changes in the overall pattern and abundance of metabolites present in a sample.²³ Such methods allow one to monitor changes in cellular metabolism that may not be apparent on the time scale required to detect changes in transcription or protein abundance, or to analyze samples from species for which genomic-sequence data are not available²⁴ or samples that contain mixtures of organisms.²⁵ The throughput of these methods has not, however, reached the point at which they can be used for routine screening efforts.

Cell Features

Screening approaches that are capable of monitoring cell-by-cell effects have even been extended to laser-scanning cytometry and to microscopy-based efforts, and they are often referred to as high-content assays. Such assays try to capture the effect of treatments on multiple cellular readouts, at multiple concentrations, and now even against multiple cell lines.^26,27 As for metabolites or proteomics, such assays may be used to monitor only specific events of interest to the researchers (e.g., nuclear translocation of a specific translation factor) or as many morphological features of the cell in response to treatment as possible.²⁸ Because imaging-based assay methods have the ability to report on the state of individual cells (much as FACS and CyTOF methods can do), it becomes possible to record not only individual cell parameters but also population statistics, which has proven to be very important for normalizing and correcting for differences in the cell cycle or the cell local environment in small interfering RNA (siRNA) screens.²⁸

Methods for Analyzing Multiparametric Data

One simple, yet effective, approach to analyze multiparametric data is to focus on a small subset of parameters, analyze them independently, and then combine the results. The gating approach commonly used in flow-cytometry data analysis sets up thresholds on different fluorescent markers to separate cells into unique subpopulations. Such gating strategies have, however, the disadvantage that if conducted manually, they can result in significant variability among users.²⁹ An additional source of variability can be normalization within and among different experiments, although tools to address these issues have been developed.^30,31 In high-content image-based screening, studies typically focus on one primary readout, such as a target protein signal, while using other readouts as filters to identify treatments that have resulted in overt toxicity.³² This approach, however, is limited to a small number of readouts and does not take into account the correlation among readouts. To fully leverage the phenotypic knowledge in multiparametric data, one needs to apply multiparametric statistical learning methods.³³

Multiparametric statistical methods are designed to understand the structure and patterns from current data sets, and thus to define predictive functions that can be applied to new data. These methods are widely applied in multiparametric screening data analysis.^26,34,35 A list of different types of analysis and example methods is shown in Table 2 , grouped by their objectives. Feature-selection and dimension-reduction methods aim to find out the informative subset or combination of original readouts.³⁶ Distance or similarity methods aim to convert multiple readouts into one distance or similarity score to represent how close the phenotypes of two treatments are. Machine-learning methods aim to identify distinct phenotype groups in the data set, either based on predefined phenotypes (supervised) or without predefined phenotypes (unsupervised).³⁷

Table 2.

List of Multiparametric Analysis Methods.

Multiparametric Analysis	Objective	Method Example
Feature selection	Remove noisy or noninformative readouts.	Filtering, and recursive feature elimination with support vector machine
Dimension reduction	Project readouts to a lower-dimensional space.	Principle component analysis, and factor analysis
Distance or similarity	Calculate a single similarity value based on readouts.	Euclidean, correlation, maximal information coefficient, and Mahalanobis distance
Supervised learning	Classify samples based on a preannotated training set.	Support vector machine, random forest, K-nearest neighbors, linear discriminant analysis, and naïve Bayesian
Unsupervised learning	Cluster samples into groups without a priori knowledge.	Hierarchical clustering, K-means clustering, self-organizing map, and SPADE

SPADE, spanning-tree progression analysis of density-normalized events.

Feature Selection and Dimension Reduction

Raw readouts from screening assays may contain noisy or redundant information. Multiple readouts are often highly correlated, due to either related biology (e.g., antibody readouts in FACS) or correlated measurement [e.g., cell perimeter versus cell area in high-content screening (HCS)]. By focusing on the relevant and informative readouts, dimension reduction and feature selection will help analysis efficiency and accuracy as well as understanding of the underlying biological processes. Many dimension-reduction methods have been applied in bioinformatics;³⁶ here, we focus on methods that have been applied to screening data. One simple way to select the readouts is to apply certain filtering criteria, such as variance throughout a whole data set, reproducibility among replicates, or separation between negative and positive controls. This method is straightforward and fast to compute, but it does not take into account the potential correlation among readouts, and readouts not separating the controls may be important for identifying novel phenotypes in the screen. Alternatively, such analysis can calculate the correlation among readouts and remove the redundant readouts, or transform the readouts to orthogonal components or factors via either principle component analysis³⁸ or factor analysis.³⁹ Each component or factor is a combination of readouts that might provide better biological interpretation. For example, Young et al. have identified six factors by analyzing 36 cytological readouts, in which factor 1 is a combination of multiple nuclear size-related readouts.³⁹ A third type of approach is to embed feature selection in classification analysis. One example is recursive feature elimination embedded with a support vector machine (SVM), in which readouts are removed one by one until the SVM overall classification accuracy is impaired. Loo et al. have applied this method and shown that only 20–40 readouts are needed out of about 300 image-based readouts.⁴⁰

Distance and Similarity Metric

To identify samples with novel phenotypes, we often need to calculate a single distance or similarity score based on all the readouts. This score can be based on the distance or similarity between a sample and a negative control; to identify samples with significant phenotypes; or, among samples, to cluster samples based on their phenotype. Various multiparametric distance or similarity measures can be used, such as Euclidean distance,^41,42 Mahalanobis distance,^43,44 and correlation distance.^45,46

Euclidean distance calculates the square root of the sum of the squared differences between two treatments for each individual readout, in which each readout is equally weighted. Mahalanobis distance is similar to Euclidean distance, except it takes into account the correlation among readouts. Thus, when correlation among readouts is expected, as is often the case in original multiparametric readouts, Mahalanobis distance is preferred. Alternatively, one can apply the feature-selection methods mentioned in this article to reduce correlation before applying Euclidean distance. The third distance method is correlation distance, which calculates the angle between two readout vectors instead of absolute distance.

Recently developed correlation methods, such as maximal information coefficient⁴⁷ and Brownian covariance,⁴⁸ which captures linear and nonlinear correlation, are also being explored in HCS data analysis.⁴⁹

It is important to select the right distance metric, because the efficiency of classification and clustering will heavily depend on it. For example, while Euclidean and Mahalanobis distances capture the magnitude of the phenotypes, correlation captures the difference among the phenotypes. Great care should be taken to select the distance that best reflects the phenomenon of interest.

Supervised Learning (Classification)

Supervised learning (classification) is suitable when there is a known set of phenotype classes (e.g., cell-cycle phases)⁵⁰ or positive and negative phenotypes. A set of rules based on the readouts (i.e., a classification model) is first learned with a training data set in which class labels are known (e.g., by manual annotation). Models can then be applied to the whole data set and predict phenotype classes. This approach can be applied not only on the cell level for cell phenotypes but also at the well level for sample phenotypes. Popular supervised methods include K nearest neighbors,⁵¹ SVMs,^34,50 linear discriminant analysis (LDA),⁵² naïve Bayesian classifier,^53,54 artificial neural networks (ANNs),⁵⁴ random forests,⁵⁵ and Markov models.⁵⁶ For definitions and detailed descriptions of these methods, readers are referred to machine-learning reviews and books.^33,37 All these methods have been used extensively in bioinformatics, and some methods are more popular than others.⁵⁷ To evaluate different supervised methods, one typical exercise is to randomly divide the annotated data set into two groups, one for training the model and the other for evaluating the model. By this cross-validation exercise, users can estimate the classification accuracy of different methods. Supervised learning has the advantage that one can predefine the relevant phenotypes, and the results are relatively more straightforward to interpret. For example, Neumann et al. predefined 16 morphological classes to assess cell-division phenotypes in an RNA interference (RNAi) screen.⁵⁰ This approach, however, is only able to detect previously known phenotypes, and its performance is unpredictable when used on novel phenotypes that were not present in the training data set, in which case one should consider unsupervised learning methods.

Unsupervised Learning (Clustering)

Unsupervised learning methods try to cluster the data into groups based on their pairwise similarity, so that the samples present in the same group have a higher similarity to each other than to members of other clusters. These groups are then further characterized as phenotypic classes. Typical methods include hierarchical clustering,^39,45 K-means,⁵⁸ neural network,⁵⁹ and SPADE (spanning-tree progression analysis of density-normalized events).⁴ For definitions and detailed descriptions of these methods, readers are referred to machine-learning reviews and books.^33,37,60 To evaluate the performance of clustering methods, one can assess the structure of the clustering via different indexes,⁶¹ or by benchmarking against external labels such as compound structure³⁹ or gene targets.^45,62 Such a clustering approach is impartial because there are no predefined classes. For example, Bendall et al. applied SPADE to cluster cells based on their immune responses, and then projected these results onto a 2D plot to represent their relationships.⁴ Parameters such as the number of clusters, however, which are often arbitrarily defined, can substantially affect the clustering results. One caveat of such an approach is that it also requires additional efforts to understand the clusters and their biological implications.

It should be noted that the methods mentioned here are neither independent nor exclusive. Feature selection can be embedded with SVMs,⁴⁰ and dimension reduction is often needed before calculation of Mahalanobis distance.⁴⁴ Young et al. first applied factor analysis to reduce dimensions before applying hierarchical clustering,³⁹ and Fuchs et al. applied a supervised method to classify cellular phenotypes before clustering siRNA phenotypes with an unsupervised method.³⁴ Thus, different methods need to be considered and evaluated in conjunction.

Methods for Visualizing Multiparametric Data

There are multiple ways to display multiparametric data, and there is extensive literature that focuses on the mathematical and computational aspects of the problem (see Fig. 1 for examples; for reviews, see Refs. ^63,64). One of the major challenges in multiparametric visualization is how to best simplify the data while at the same time retaining the important relationships among the different samples and parameters. Namely, the challenges are to find the best possible combination of dimensions that avoids losing important details, and to assess the effectiveness of the resulting visualization. In biology, the key challenge remains in the interpretability of the display, a visual metaphor that needs to be consistent with the biological concepts it is trying to describe.

Figure 1.

Examples of multivariate visualizations. (A) A two-dimensional scatterplot; (B) a scatterplot matrix; (C) a three-dimensional scatterplot; (D) a line chart; (E) a parallel coordinate plot; (F) a radar plot; (G) a heat map; (H) a SPADE (spanning-tree progression analysis of density-normalized events) tree; and (I) a self-organizing map.

The repertoire of visual metaphors that are routinely used to present data in biology is limited and includes scatterplots, heat maps, and histograms. Most biologists are not familiar with concepts underlying other types of graphs, such as boxplots, thereby limiting their usefulness. The increasing need for multiparametric visualization, as described in this article, driven by high-content technologies, must be matched with the expectations and experience of users by enhancing familiar concepts rather than introducing new ones.

Familiar Metaphors: Scatterplots and Multiparametric Data

Scatterplots are the workhorse of scientific visualization; every scientist is familiar with them and knows how to interpret them, which makes them a powerful metaphor ( Fig. 1a ). When the number of dimensions is low, one can use a scatterplot matrix to display the data ( Figs. 1b and 2a ). As the number of dimensions increases, the number of plots required to display all data follows a geometric progression, making it impractical for all but the simplest data sets. One way to circumvent this issue is to use dimensionality-reduction methods such as principal component analysis (PCA; Figs. 1c and 2b ).⁶⁵

Figure 2.

Examples of visualizations applied to high-content screening data. (A) scatterplot matrix of six parameters; (B) scatterplot matrix of first three principal components; (C) radviz projection of six parameters; and (D) heat map of six parameters.

In PCA, each data point is projected on a linear combination of dimensions so that the new dimensions (the principal components) capture as much of the variance in the data in as few dimensions as possible. Using principal components instead of original dimensions, data can be displayed as a scatterplot while retaining most of the information present in the data. In addition, the loadings of each principal component can be interpreted in terms of relations among the measures used in the experiment, confirming known interactions or leading to new biological insights.

Other projection methods include multidimensional scaling (MDS) and self-organizing maps (SOMs). These methods project the data onto a 2D grid so that distance among points in 2D is comparable to distance in the multivariate space ( Fig. 1i ). They allow for visualization of a large number of samples, enabling identification of patterns in the data. This is especially relevant for the analysis of HCS data, in which the relationship between the dimensions being measured (usually, physical parameters) and the biology is weak, whereas the relationship among the phenotypes of the cells (positive controls, negative controls, and samples) is key to calling hits.

Recently, application of topological data analysis to multiparametric data led to the development of t-SNE (t-distributed stochastic neighbor embedding),^16,66 a method in which multiparametric data are projected in a 2D space in a way that preserves the relations among the data points. An implementation of this technique, called viSNE, has recently been used to analyze results of single-cell analysis, highlighting similarities among healthy donors and their differences from patient-derived samples.⁶⁷ Although information about the contribution of each dimension is lost, such a map could be used to analyze results of HSC data at the single-cell level, directly visualizing cells from sample wells in the context of cell populations from control wells.

Compared to PCA, these procedures require optimization and may give different results with every repeated attempt. Moreover, the relationships among dimensions are lost and cannot be used to derive new insights about the biology.

Extending the Scatterplot

Several ways to extend the scatterplot have been described that build on the familiarity that most people have with this way of presenting data. Examples in which such an approach has been taken to extend the scatterplot include Chernoff faces,⁶⁸ stick figures (and, more generally, any icon method that can be combined with geometric methods.^64,69

Another approach combines Hilbert space-filling curves (a pixel method) with scatterplots to display large fingerprints such as those generated in cheminformatics, in HCS, or during meta-analysis of high-throughput data, in the context of two dimensions.^70,71 Although these solutions allow one to display a large amount of information at once, their interpretability is limited by our perception of color, shapes, and angles.^72,73

An interesting yet seldom explored possibility is radial coordinate visualization, or radviz.⁷⁴ In radviz, each dimension is treated as an anchor, to which each data point is linked by a spring. The spring tension then corresponds to the measured intensity for that data point in a given channel. The position of each point corresponds to the point of equilibrium among all springs. Depending on the position of anchors in 2D space (typically, on a circle and evenly spaced) and on the order of the anchors (either derived from knowledge of the system or optimized using an algorithm⁷⁵), different patterns will emerge from the data that can be interpreted in terms of similarity among the data points ( Fig. 2c ).

Examples of radviz being used to analyze biological results exist but have been limited so far to data mining and evaluation of clustering efficiency.^76,77 Yet the potential for HCS data exists, in which one can quickly assess similarity among data points and controls, or the efficiency of hit-calling mechanisms.

Heat Maps and Line Charts

Heat maps are widely used to display the results of multivariate experiments, and are probably the most familiar method for biologists ( Figs. 1g and 2d ). Heat maps, in which each column represents a measure and each line represents an event, use a color scale to display the intensity of the measure in the event. To enhance the interpretability, rows and columns can be ordered using a hierarchical clustering to highlight strong patterns in the data. Although heat maps can display several thousand measurements in one simple graph, they will fail when strong patterns are absent or if overlapping patterns are present in the data.

Another powerful visual metaphor is the line chart, in which each item being measured is represented by a line in an XY plot, in which dimensions are represented as categories on the x-axis and intensities are captured on the y-axis ( Fig. 1d ). Line charts use a single y-axis when the dimensions can be normalized to a common scale, whereas parallel-coordinate plots use a distinct y-axis when dimensions of different scale or type exist in the data ( Fig. 1e ). Radar plots represent a variation of the parallel coordinate plot, in which dimensions are organized radially rather than linearly ( Fig. 1f ). Such visualizations are very good at presenting how closely a given treatment creates, or correlates to, a desired pattern of responses and which aspects need to be modulated.

Heat maps can display almost any amount of data provided a meaningful order exists for rows and columns. Line charts, however, are rapidly limited in terms of number of events and number of dimensions; one can limit the number of events displayed at once by filtering based on similarity to a reference profile.

Clustering of dimensions in a heat map, or reference profiles for line charts, carry information about the biology being measured if they are derived from unsupervised learning. Alternatively, they can represent hypothesis being tested, such as signatures or known populations.

Visualizing Populations

Recent evidence from single cell analysis suggests there are many more transition states among cell populations in the immune system than were previously thought.⁴ This is likely to be true in other systems as well, suggesting a continuum of cellular phenotypes rather than a collection of discrete states. Yet, clustering remains an attractive way to simplify complex data sets by grouping similar objects together so that higher-order relationships can be explored.

Standard, density-based clustering approaches will tend to group these transition events with the closest population. To address this challenge, the SPADE algorithm first samples a subset of the data so that the density is the same throughout the space being explored. After the clustering step, a minimum-spanning tree algorithm is used to group clusters by proximity in multivariate space while allowing for a simple 2D visualization. Although this method allowed for the identification of several transition steps that had not been observed before, its main limitations remain the estimation of the number of clusters, which depends on the analyst’s definition of a biologically relevant population, and the fact that the minimum-spanning tree might not an optimal solution to capture and represent relationships among populations.³⁵

Alternatively, one can use SOMs,⁵⁹ t-SNE,⁶⁶ or viSNE^35,67 to identify cell populations independently of any hypothesis (Fig. 1h–i). By constructing 2D projections that capture cell–cell relationships in multivariate space, these methods allow for the visual assessment of the presence of populations, their number, and their size. This comes at the expense of biological interpretability, because the position of individual cells or clusters cannot be related to the original measurement. Relations among populations are usually lost, because optimization is focused on short distances.

Conclusion and Future

As illustrated in this article, a growing number of assay technologies are capable of recording multiple parameters of a biological system, and as the throughput of these methods increases, they are increasingly being used for screening applications. In addition, this report has described some of the tools that can be applied for the analysis of such data sets as well as some of the methods that have been used to present and visualize such results to aid the conceptualization and interpretation of such data sets.

A number of challenges in this field of multiparametric evaluation of biological systems can be anticipated. First, there will be a continued need for the development (and widespread deployment) of tools that are accepted by the screening community for the visualization of such multidimensional data sets. An important outcome of such a development is that the community will start to come to a consensus on a common format for presenting such experiments. Second, one aspect that is missing in the vast majority of reports describing multidimensional assay readouts has been the assessment and tracking of assay quality.

One of the most important contributions to HTS has been the development of the Z’ factor, a simple and easy-to-implement statistic to monitor assay quality.⁷⁸ This single measure has now been referenced >700 times since its publication (in just PubMedCentral alone) and has led to a number of alternatives or developments.^79,80 It should be noted, however, that all these papers spring from the original paper of Sittampalam et al., who first published methods to monitor assay HTS quality.⁸¹ The Z’ parameter has become so widely accepted because it is a simple, easy-to-calculate means of tracking assay performance. We believe that the analysis of multiparametric data sets for the screening community is reaching a similar inflection point at which common quality-control statistics need to be accepted by the screening community to allow different assays, assay methods, screening machines, and data sets to be compared in a consistent and accepted manner. A number of different methods have been presented in which it has been possible to condense multiple dimensions into a single value that can then be used to calculate a Z’ value.^44,52,82,83 Such methods all have the disadvantage that they allow one to discern if one compound treatment or biological state is similar to a known treatment or biological state. Such measures, however, have no indication of direction, so it is not possible to determine if a treatment is driving the system closer to the desired state over the complete set of descriptors. The development of similarity measures that include direction (e.g., using a vector of cosine similarity) or visualization methods such as Pareto graphs or radviz may help to address this issue. The availability of such measures will then allow assay developers to monitor multiple readouts. This is important because without the ability to track assay quality, comparison of results among days, batches, or screening runs will not be possible.

The third challenge that scientists will face using multiparametric data sets will be to develop methods that allow the unbiased exploration of such data sets in a way that enables the discovery of previously unknown phenotypes. This is similar to identifying groups of compounds in large collections using methods such as SVMs or neural networks. Such methods can, however, be swamped by the sheer volume of data in a screening context. One approach that has been taken to overcome this problem is first to identify those treatments that leave the assay in a state similar to the untreated (or DMSO control treated) sample and remove these samples from the subsequent analysis.⁵⁹ In essence, this approach is similar to that first presented for analysis of microarray results⁸⁴ and HTS results.⁸⁵ It may also be possible to extend concepts such as a maximum common substructure (a way to identify the maximal common feature in sets of compounds), which could be applied to identify the maximum set of common features that define a phenotype.

Finally, because multiparametric data sets become larger and start to encompass not only larger numbers of readouts and samples but also different types of data collected, throughout time, new methods will be required. An example of this challenge starting to face the field was recently presented by the paper of Chen et al.;⁸⁶ in this paper, it was necessary to analyze time-series data sets using methods, such as Fourier spectral analysis, that were developed for use in other fields. Such examples of algorithms first developed in fields such as engineering, signal processing, and physics will be needed for the analysis of these larger data sets.

Lexicon

ANN	Artificial neural network
CyTOF	Time-of-flight mass cytometry
FACS	Fluorescent-assisted cell sorting
HCS	High-content screening
HTS	High-throughput screening
ICP-MS	Inductively coupled plasma mass spectrometry
LC-MS	Liquid chromatography coupled to mass spectrometry
LDA	Linear discriminant analysis
MDS	Multidimensional scaling
MS	Mass spectrometry
PCA	Principal component analysis
RNAi	RNA interference
RPA	Reverse-protein array
siRNA	Small interfering RNA
SOM	Self-organizing maps
SPADE	Spanning-tree progression of density-normalized events
SVM	Support vector machine
t-SNE	t-distributed stochastic neighbor embedding

Footnotes

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

References

Tsiper

M. V.

Sturgis

Avramova

L. V.

. Differential Mitochondrial Toxicity Screening and Multi-Parametric Data Analysis. PloS One 2012, 7 (10), e45226.

Basford

Forraz

McGuckin

Optimized Multiparametric Immunophenotyping of Umbilical Cord Blood Cells by Flow Cytometry. Nat. Protoc. 2010, 5 (7), 1337–1346.

Kim

S. H.

Bang

S. H.

Park

S. A.

. Character Comparison of Abdomen-Derived and Eyelid-Derived Mesenchymal Stem Cells. Cell Prolif. 2013, 46 (3), 291–299.

Bendall

S. C.

Simonds

E. F.

Qiu

. Single-Cell Mass Cytometry of Differential Immune and Drug Responses Across a Human Hematopoietic Continuum. Science 2011, 332 (6030), 687–696.

Rauch

Gossye

Bracke

. An Anti-Inflammatory Selective Glucocorticoid Receptor Modulator Preserves Osteoblast Differentiation. FASEB J. 2011, 25 (4), 1323–1332.

Braxton

Bedilion

The Integration of Microarray Information in the Drug Development Process. Curr. Opin Biotech. 1998, 9 (6), 643–649.

Jimenez-Guri

Huerta-Cepas

Cozzuto

. Comparative Transcriptomics of Early Dipteran Development. BMC Genom. 2013, 14, 123.

Scherer

Dai

Meng

Impact of Experimental Noise and Annotation Imprecision on Data Quality in Microarray Experiments. Methods Mol. Biol. 2013, 972, 155–176.

Peck

Crawford

E. D.

Ross

K. N.

. A Method for High-Throughput Gene Expression Signature Analysis. Genome Biol. 2006, 7 (7), R61.

10.

Choi

N. W.

Kim

Chapin

S. C.

. Multiplexed Detection of mRNA Using Porosity-Tuned Hydrogel Microparticles. Anal. Chm. 2012, 84 (21), 9370–9378.

11.

Flagella

Bui

Zheng

. A Multiplex Branched DNA Assay for Parallel Quantitative Gene Expression Profiling. Anal. Biochem. 2006, 352 (1), 50–60.

12.

Metzger

D. C.

Luckenbach

J. A.

Dickey

J. T.

. Development of a Multiplex Gene Expression Assay for Components of the Endocrine Growth Axis in Coho Salmon. Gen. Comp. Endocrin. 2013, 189, 134–140.

13.

Nigsch

Hutz

Cornett

. Determination of Minimal Transcriptional Signatures of Compounds for Target Prediction. EURASIP J. 2012, 2012 (1), 2.

14.

Vogel

Marcotte

E. M.

Insights into the Regulation of Protein Abundance from Proteomic and Transcriptomic Analyses. Nature Rev. Genetics 2012, 13 (4), 227–232.

15.

Van Riper

S. K.

de Jong

E. P.

Carlis

J. V.

. Mass Spectrometry-Based Proteomics: Basic Principles and Emerging Technologies and Directions. Adv. Exper. Med. Biol. 2013, 990, 1–35.

16.

van Oostrum

Calonder

Rechsteiner

. Tracing Pathway Activities with Kinase Inhibitors and Reverse Phase Protein Arrays. Proteom. Clin. Appl. 2009, 3 (4), 412–422.

17.

Carragher

N. O.

Brunton

V. G.

Frame

M. C.

Combining Imaging and Pathway Profiling: An Alternative Approach to Cancer Drug Discovery. Drug Disc. Today 2012, 17 (5–6), 203–214.

18.

Wunderlich

M. L.

Dodge

M. E.

Dhawan

R. K.

. Multiplexed Fluorometric Immunoassay Testing Methodology and Troubleshooting. J. Visual. Exper. 2011, 10.3791/3715 (58).

19.

Edwards

B. S.

Kuckuck

F. W.

Prossnitz

E. R.

. HTPS Flow Cytometry: A Novel Platform for Automated High Throughput Drug Discovery and Characterization. J. Biomolec. Screen. 2001, 6 (2), 83–90.

20.

Florian

A. E.

Lepensky

C. K.

Kwon

. Flow Cytometry Enables a High-Throughput Homogeneous Fluorescent Antibody-Binding Assay for Cytotoxic T Cell Lytic Granule Exocytosis. J. Biomolec. Screen. 2013, 18 (4), 420–429.

21.

Gregori

Patsekin

Rajwa

. Hyperspectral Cytometry at the Single-Cell Level Using a 32-Channel Photodetector. Cytometry Pt. A 2012, 81 (1), 35–44.

22.

Bodenmiller

Zunder

E. R.

Finck

. Multiplexed Mass Cytometry Profiling of Cellular States Perturbed by Small-Molecule Regulators. Nature Biotech. 2012, 30 (9), 858–867.

23.

Zhang

Sun

Wang

Saliva Metabolomics Opens Door to Biomarker Discovery, Disease Diagnosis, and Treatment. App. Biochem. Biotech. 2012, 168 (6), 1718–1727.

24.

Zhang

A. H.

Sun

Han

. Ultraperformance Liquid Chromatography-Mass Spectrometry Based Comprehensive Metabolomics Combined with Pattern Recognition and Network Analysis Methods for Characterization of Metabolites and Metabolic Pathways from Biological Data Sets. Anal. Chem. 2013, 85 (15), 7606–7612.

25.

Poroyko

Morowitz

Bell

. Diet Creates Metabolic Niches in the “Inmature Gut” That Shape Microbial Communities. Nutr. Hosp. 2011, 26 (6), 1283–1295.

26.

Perlman

Z. E.

Slack

M. D.

Feng

. Multidimensional Drug Profiling by Automated Microscopy. Science 2004, 306 (5699), 1194–1198.

27.

Feng

Mitchison

T. J.

Bender

. Multi-Parameter Phenotypic Profiling: Using Cellular Effects to Characterize Small-Molecule Compounds. Nat. Rev. Drug Discov. 2009, 8 (7), 567–578.

28.

Snijder

Sacher

Ramo

. Single-Cell Analysis of Population Context Advances RNAi Screening at Multiple Levels. Molec. Sys. Biol. 2012, 8, 579.

29.

Magness

S. T.

Puthoff

B. J.

Crissey

M. A.

. A Multi-Center Study to Standardize Reporting and Analyses of Fluorescence-Activated Cell Sorted Murine Intestinal Epithelial Cells. Am. J. Physiol. 2013, 10.1152/ajpgi.00481.2012.

30.

Hahne

LeMeur

Brinkman

R. R.

. flowCore: A Bioconductor Package for High Throughput Flow Cytometry. BMC Bioinformatics 2009, 10.

31.

Hahne

Brinkman

R. R.

. flowClust: A Bioconductor Package for Automated Gating of Flow Cytometry Data. BMC Bioinformatics 2009, 10, 145.

32.

Ibig-Rehm

Gotte

Gabriel

. High-Content Screening to Distinguish between Attachment and Post-Attachment Steps of Human Cytomegalovirus Entry into Fibroblasts and Epithelial Cells. Antiviral Res. 2011, 89 (3), 246–256.

33.

Hastie

Tibshirani

Friedman

The Elements of Statistical Learning; Springer: New York, 2008.

34.

Fuchs

Pau

Kranz

. Clustering Phenotype Populations by Genome-Wide RNAi and Multiparametric Imaging. Molec. Sys. Biol. 2010, 6, 370.

35.

Linderman

M. D.

Bjornson

Simonds

E. F.

. CytoSPADE: High-Performance Analysis and Visualization of High-Dimensional Cytometry Data. Bioinformatics 2012, 28 (18), 2400–2401.

36.

Saeys

Inza

Larranaga

A Review of Feature Selection Techniques in Bioinformatics. Bioinformatics 2007, 23 (19), 2507–2517.

37.

Tarca

A. L.

Carey

V. J.

Chen

X. W.

. Machine Learning and Its Applications to Biology. Plos Comput. Biol. 2007, 3 (6), 953–963.

38.

Tanaka

Bateman

Rauh

. An Unbiased Cell Morphology-Based Screen for New, Biologically Active Small Molecules. PLoS Biol. 2005, 3 (5), e128.

39.

Young

D. W.

Bender

Hoyt

. Integrating High-Content Screening and Ligand-Target Prediction to Identify Mechanism of Action. Nature Chem. Biol. 2008, 4 (1), 59–68.

40.

Loo

L. H.

L. F.

Altschuler

S. J.

Image-Based Multivariate Profiling of Drug Responses from Single Cells. Nature Meth. 2007, 4 (5), 445–453.

41.

Durr

Duval

Nichols

. Robust Hit Identification by Quality Assurance and Multivariate Data Analysis of a High-Content, Cell-Based Assay. J. Biomolec. Screen. 2007, 12 (8), 1042–1049.

42.

Caie

P. D.

Walls

R. E.

Ingleston-Orme

. High-Content Phenotypic Profiling of Drug Response Signatures across Distinct Cancer Cells. Molec. Cancer Ther. 2010, 9 (6), 1913–1926.

43.

Glory

Murphy

R. F.

Automated Subcellular Location Determination and High-Throughput Microscopy. Devel. Cell 2007, 12 (1), 7–16.

44.

Hutz

J. E.

Nelson

. The Multidimensional Perturbation Value: A Single Metric to Measure Similarity and Activity of Treatments in High-Throughput Multidimensional Screens. J. Biomolec. Screen. 2013, 18 (4), 367–377.

45.

Adams

C. L.

Kutsyy

Coleman

D. A.

. Compound Classification Using Image-Based Cellular Phenotypes. Meth. Enzym. 2006, 414, 440–468.

46.

Petrone

P. M.

Simms

Nigsch

. Rethinking Molecular Similarity: Comparing Compounds on the Basis of Biological Activity. ACS Chem. Biol. 2012, 7 (8), 1399–1409.

47.

Reshef

D. N.

Reshef

Y. A.

Finucane

H. K.

. Detecting Novel Associations in Large Data Sets. Science 2011, 334 (6062), 1518–1524.

48.

Szekely

G. J.

Rizzo

M. L.

Bakirov

N. K.

Measuring and Testing Dependence by Correlation of Distances. Ann. Stat. 2007, 35 (6), 2769–2794.

49.

Reisen

Zhang

Gabriel

. Benchmarking of Multivariate Similarity Measures for High-content screening Fingerprints in Phenotypic Drug Discovery. J. Biomolec. Screen. 2013, 18 (10), 1284–1297.

50.

Neumann

Walter

Heriche

J. K.

. Phenotypic Profiling of the Human Genome by Time-Lapse Microscopy Reveals Cell Division Genes. Nature 2010, 464 (7289), 721–727.

51.

Fix

Hodges

J. L.

Discriminatory Analysis. Nonparametric Discrimination: Consistency Properties. Intl. Stat. Rev.1989, 57 (3), 238–247.

52.

Kummel

Gubler

Gehin

. Integration of Multiple Readouts into the Z’ Factor for Assay Quality Assessment. J. Biomolec. Screen. 2010, 15 (1), 95–101.

53.

Wang

Zhou

Bradley

P. L.

. Cellular Phenotype Recognition for High-Content RNA Interference Genome-Wide Screening. J. Biomolec. Screen. 2008, 13 (1), 29–39.

54.

Horvath

Wild

Kutay

. Machine Learning Improves the Precision and Robustness of High-Content Screens: Using Nonlinear Multiparametric Methods to Analyze Screening Results. J. Biomolec. Screen. 2011, 16 (9), 1059–1067.

55.

Statnikov

Wang

Aliferis

C. F.

A Comprehensive Comparison of Random Forests and Support Vector Machines for Microarray-Based Cancer Classification. BMC Bioinformatics 2008, 9, 319.

56.

Rozowsky

J. S.

Korbel

J. O.

. A Supervised Hidden Markov Model Framework for Efficiently Segmenting Tiling Array Data in Transcriptional and chIP–chip Experiments: Systematically Incorporating Validated Biological Knowledge. Bioinformatics 2006, 22 (24), 3016–3024.

57.

Jensen

L. J.

Bateman

The Rise and Fall of Supervised Machine Learning Techniques. Bioinformatics 2011, 27 (24), 3331–3332.

58.

Wilkins

M. F.

Hardy

S. A.

Boddy

. Comparison of Five Clustering Algorithms to Classify Phytoplankton from Flow Cytometry Data. Cytometry 2001, 44 (3), 210–217.

59.

Kummel

Selzer

Siebert

. Differentiation and Visualization of Diverse Cellular Phenotypic Responses in Primary High-Content Screening. J. Biomolec. Screen. 2012, 17 (6), 843–849.

60.

Nugent

Meila

An Overview of Clustering Applied to Molecular Biology. Methods Molec. Biol. 2010, 620, 369–404.

61.

Bezdek

J. C.

Pal

N. R.

Some New Indexes of Cluster Validity. IEEE T. Syst. Man. Cy. B 1998, 28 (3), 301–315.

62.

Ljosa

Caie

P. D.

Ter Horst

. Comparison of Methods for Image-Based Profiling of Cellular Morphological Responses to Small-Molecule Treatment. J. Biomolec. Screen. 2013, 10.1177/1087057113503553.

63.

de Oliveira

M. C. F.

Levkowitz

From Visual Data Exploration to Visual Data Mining: A Survey. IEEE T. Vis. Comput. Gr. 2003, 9 (3), 378–394.

64.

Keim

D. A.

Kriegel

H. P.

Visualization Techniques for Mining Large Databases: A Comparison. IEEE T. Knowl. Data En. 1996, 8 (6), 923–938.

65.

Pearson

On Lines and Planes of Closest Fit to Systems of Points in Space. Philosophical Magazine 1901, 2 (11), 559–572.

66.

Hinton

G. E.

Roweis

S. T.

Stochastic Neighbor Embedding. Adv. Neur. Info. Proc. Sys. 2002, 833–840.

67.

Amir el

A. D.

Davis

K. L.

Tadmor

M. D.

. viSNE Enables Visualization of High Dimensional Single-Cell Data and Reveals Phenotypic Heterogeneity of Leukemia. Nature Biotech. 2013, 31 (6), 545–552.

68.

Chernoff

Use of Faces to Represent Points in K-Dimensional Space Graphically. J. Am. Stat. Assoc. 1973, 68 (342), 361–368.

69.

Pickett

R. M.

Grinstein

Iconographic Displays for Visualizing Multidimensional Data. Proc. IEEE Intl. Conf. Sys. Man Cyber. 1988, 1, 514–519.

70.

Gehlenborg

Brazma

Visualization of Large Microarray Experiments with Space Maps. BMC Bioinformatics 2009, 10, O7.

71.

Anders

Visualization of Genomic Data with the Hilbert Curve. Bioinformatics 2009, 25 (10), 1231–1235.

72.

Duncan

Selective Attention and the Organization of Visual Information. J. Exper. Psych. Gen. 1984, 113 (4), 501–517.

73.

Cleveland

W. S.

Mcgill

Graphical Perception: Theory, Experimentation, and Application to the Development of Graphical Methods. J. Am. Stat. Assoc. 1984, 79 (387), 531–554.

74.

Hoffman

Grinstein

Pinkey

Dimensional Anchors: A Graphic Primitive for Multidimensional Multivariate Information Visualizations. In Proceedings of the 1999 Workshop on New Paradigms in Information Visualization and Manipulation in Conjunction with the Eighth ACM International Conference on Information and Knowledge Management, Kansas City, MO, Nov 2–6, 1999; Ebert

D. S.

Shaw

C. D.

Eds.; ACM: New York, 1999; pp 9–16.

75.

Di Caro

Frias-Martinez

Analyzing the Role of Dimension Arrangement for Data Visualization in Radviz. In Proceedings of the 14th Pacific-Asia Conference on Advances in Knowledge Discovery and Data Mining, Hyderabad, India, June 21–24, 2010; Zaki

M. J.

J. X.

Ravindran

, Eds.; Springer: New York, 2010; pp 125–132.

76.

McCarthy

J. F.

Marx

K. A.

Hoffman

P. E.

. Applications of Machine Learning and High-Dimensional Visualization in Cancer Detection, Diagnosis, and Management. Ann. NY Acad. Sci. 2004, 1020, 239–262.

77.

Sharko

Grinstein

Marx

K. A.

Vectorized Radviz and Its Application to Multiple Cluster Datasets. IEEE T. Vis. Comput. Gr. 2008, 14 (6), 1444–1451.

78.

Zhang

J. H.

Chung

T. D. Y.

Oldenburg

K. R.

A Simple Statistical Parameter for Use in Evaluation and Validation of High Throughput Screening Assays. J. Biomolec. Screen. 1999, 4 (2), 67–73.

79.

Shun

T. Y.

Lazo

J. S.

Sharlow

E. R.

. Identifying Actives from HTS Data Sets: Practical Approaches for the Selection of an Appropriate HTS Data-Processing Method and Quality Control Review. J. Biomolec. Screen. 2011, 16 (1), 1–14.

80.

Iversen

P. W.

Eastwood

B. J.

Sittampalam

G. S.

. A Comparison of Assay Performance Measures in Screening Assays: Signal Window, Z’ Factor, and Assay Variability Ratio. J. Biomolec. Screen. 2006, 11 (3), 247–252.

81.

Sittampalam

G. S.

Iversen

P. W.

Boadt

J. A.

. Design of Signal Windows in High Throughput Screening Assays for Drug Discovery. J. Biomolec. Screen. 1997, 2 (3), 159–169.

82.

Mazur

Kozak

Z’ Factor including siRNA Design Quality Parameter in RNAi Screening Experiments. RNA Biol. 2012, 9 (5), 624–632.

83.

Kozak

Csucs

Kernelized Z’ Factor in Multiparametric Screening Technology. RNA BIol. 2010, 7 (5), 615–620.

84.

Hastie

Tibshirani

Eisen

M. B.

. "Gene Shaving" as a Method for Identifying Distinct Sets of Genes with Similar Expression Patterns. Genome Biol. 2000, 1 (2), RESEARCH0003.

85.

Schreyer

S. K.

Parker

C. N.

Maggiora

G. M.

Data Shaving: A Focused Screening Approach. J. Chem. Info. Comp. Sci. 2004, 44 (2), 470–479.

86.

Chen

Mias

G. I.

Li-Pook-Than

. Personal Omics Profiling Reveals Dynamic Molecular and Medical Phenotypes. Cell 2012, 148 (6), 1293–1307.