Rapid Assessment and Visualization of Normality in High-Content and Other Cell-Level Data and Its Impact on the Interpretation of Experimental Results

Abstract

When investigators monitor effects on a population of cells following a perturbation, these events rarely occur in a classical normal (or Gaussian) distribution. A normal distribution is, however, explicitly assumed for events within a single well, in which mean values per well are used as an assay metric and, in general, measures of assay robustness, such as the Z’ score and the V factor. Such analysis is not possible for many technologies; however, high-content screening (HCS) measures events of individual cells, which are averaged over the well. These individual cell-level measurements may be analyzed separately. This study quantifies the extent of nonnormality in experimental samples and their effects on determining the EC₅₀ of a test compound and the assay robustness statistics. The results, based on five sets of publicly available data, indicate that the Z’ or V-factor score can be improved by as much as 0.44 more than standard calculations, and the EC₅₀ of a dose–response curve can be lowered by as much as fivefold when nonparametric methods are used, but not all data sets show a significant improvement. The effect on analysis depends in part on whether the greatest shift from normality occurs in the upper or lower range of the dose–response curve.

Keywords

cell-based assays high-content screening image analysis statistical analyses cancer and cancer drugs

Introduction

The advent of image-based screening has greatly expanded the range of cell-based assays available to researchers in academia and drug development.¹ These assays include screens for novel phenotypes and workhorse assays for determining compound potency.^2–4 These assays operate through image analysis of single cells to quantify cellular responses; however, the cell-level data are usually averaged over the whole well, frequently with additional measures, such as the standard deviation for the cells in the well. The relevance of such measures assumes the cellular responses are normally distributed, and therefore the mean and standard deviation represent accurate summaries of the cellular responses for a given sample. In cases in which the distribution of the sample populations has been examined, however, it has frequently been observed that the data are not normally distributed.^5–7 This is largely because cells will respond to a perturbation among a wide range of values or even as subpopulations.⁸ Instead of recognizing this as a property of cells in culture, nonhomogeneous responses are usually ascribed to outlier cells that are distinct and not relevant to the general population of cells, and their effects are minimized through the use of statistical measures that are less sensitive to rare events. The most common is to use the median value as the definition of the response for a well, and the corresponding measure of the variance, the median absolute deviation (MAD).^9,10 Yet the differences from normality can be severe, and in such cases the use of robust parametric statistical tests may not be entirely sufficient. In such cases, the use of nonparametric statistical tests may be required.^11,12

High-content screening (HCS)—or, more generally, image-based screening—represents a richer data set than other well-based measurements such as fluorescence- or luminescence-based enzyme assays that are also performed in microtitre plates.^1–4,13,14 Events occurring within a well are interpreted through image analysis as events per cell. The assay metric can be a highly specialized signaling event, such as the phosphorylation of a kinase substrate or the translocation of a receptor or transcription factor. These events may be immediately proximal to the target being manipulated, creating a specific cell-based corollary to a biochemical assay. Controls for the assay can be equally specific and performed as a multiplexed assay, such as a similar protein that should not be affected by the perturbation if it affects the intended target selectively. Last, and most relevant for this study, HCS is uniquely different from these alternative assay technologies, because HCS reports values for a well after measuring events on the individual cells within each well. As such, the measurements for all the cells in a well are reduced to a single (average) number for each well. HCS captures additional statistics for the well, such as the standard deviation of the cellular measurements for each well. The central caveat to this approach is that the mean and standard deviation are statistical measures of normally distributed data, and no assessment is made of the extent to which the data are actually normally distributed. In point of fact, cells in culture are highly heterogeneous; this heterogeneity is an intrinsic and reproducible property of cells in culture and has the potential to affect the interpretation of experimental results.^6,7

Operationally, image-based screening (or HCS) is concerned with the identification of cellular perturbations, such as screening for potential therapeutics, and typically relies on the cellular measurements to be averaged for the well (or “well-level” measurements), but this aggregate of “cell-level” measurements can be measured directly, such as by comparing histograms. Although used infrequently, cell-level measurements are calculated during image analysis and can be saved or exported for study (DNA content as a measure of cell cycle distribution is one exception). Studying experiments at the cell level allows for confirmation that the average value is appropriate (which expects that the data are normally distributed) and provides the basis for using methods appropriate to nonnormally distributed data when this is observed. Heterogeneity of cellular populations can occur when discrete subpopulations are present in a sample (such as the presence of stem cell–like tumor progenitor cells), but the transition from a normally distributed sample to the presence of a true subpopulation is not static, and highly skewed distributions can be functionally heterogeneous, even if they are genetically homogeneous.¹⁵ Furthermore, if the skewing changes among samples within the course of an experiment, some statistical tests and assumptions will be incorrect and potentially misleading. In such cases, even if the causes of cellular heterogeneity are not understood, a high level of skewing and changes in the extent of skewing can lead to heterogeneity of variance among samples, or heteroskedasticity.

The principal challenge for a scientist who wishes to understand the distributions that underlie well-level data and whether there is a potential for misinterpretation based on unusual distribution patterns is to rapidly identify such cases as a part of day-to-day activities. Furthermore, the measurement of the deviation from normality needs to be placed in the context of how selecting a particular method of analysis affects experimental conclusions. What are the practical consequences of a lack of normality, and, specifically, would it cause candidate compounds to be ranked differently, or would it cause small interfering RNAs (siRNAs) for a gene to be scored inappropriately, affecting the number of false-positive or false-negative genes reported in a screen? Are there cases in which some skewing in the sample is the result of the inclusion of a minor population of biologically irrelevant outliers and can be disregarded (as is the case when median values for a well are used instead of the mean), or does the lack of normality indicate that the population is truly heterogeneous, and the use of mean or median values fails to account for biologically relevant complexity? The methods outlined in this article allow the rapid assessment of plate-based assays for normality, map the degree of normality by well (allowing a quick visualization of where the lack of normality is greatest), and quantify the impact of using nonparametric statistical tests on assay metrics. Using these methods, a set of publicly available image sets, representing a wide variety of experiment types, has been evaluated.

A secondary challenge is to learn how to equate cellular heterogeneity observed in culture with the role of cellular heterogeneity in developmental and disease biology. Efforts to reduce cellular heterogeneity can affect the experimental system itself (such as interactions between the mechanism of cell-cycle replication block, for the purpose of synchronizing a population of cells, and the perturbation under study). Instead, cell-level approaches have been used to study signal transduction in native populations through flow cytometry and HCS.^7,16,17 Such approaches reduce the experimental manipulations that can interfere with the experiment. In addition, and perhaps more to the point, cellular heterogeneity is an intrinsic part of many biological processes, including maintenance of the pluripotent state of a colony of stem cells and the growth of some tumor types such as glioblastomas, which develop subpopulations of cells expressing different receptor tyrosine kinases and interact through paracrine-signaling pathways.^18,19

Methods are described in this article that provide such an assessment, and can do so in a rapid and flexible manner. Included in the calculations are measures of mean and median and additional measures of normality, including three goodness-of-fit tests. These methods described here are written to be readily usable by researchers who possess a very small amount of programming experience. The routines do expect data to be entered in a microtitre-plate format (with well locations) but can accept any formatting of the wells, including partial plates. Metadata to assign wells to doses or other treatments (such as RNA interference (RNAi) reagents) can be added to analyses through an additional file. The routines are provided as source code and can be incorporated into other data-processing routines.

Materials and Methods

Experimental Data Sets

All data sets analyzed in this study are publicly available through the Broad Bioimage Benchmarking Collection²⁰ (BBBC; http://www.broadinstitute.org/bbbc/), including both the images and the metadata explaining the experiments. The image-analysis algorithms are included in these files and specify how the images were corrected for background and other image aberrations, intensities calculated, and features recorded. Image sets were donated and uploaded after being reviewed for image-capture quality, including minimal background aberrations and no saturation. Background problems are frequent but do not necessarily confound image analysis, because there are many methods to correct them, and in fact many algorithms include a background-correction step routinely. Saturation can affect image analysis as well. The dynamic range is instrument specific but is at least 4000:1 (and is approaching 25,000:1), so the dynamic range of the assay itself is typically not the problem; instead, saturation typically occurs when the image-capture settings are based on negative controls or a limited set of perturbations. When saturation is significant, population dynamics such as those explored in this study can be highly compromised, and, indeed, any experimental study will suffer. For population studies, the presence of saturation would be readily detected through an overrepresentation of maximal values (e.g., a strong spike for values in the highest bin in a histogram). Additional technical issues are discussed at the BBBC site, where the image sets can be obtained.

Statistical Analysis Using the R Programming and Visualization Environment

The R programming language was downloaded from one of several mirror sites that provide the language free of charge. Sites are listed at the Comprehensive R Archive Network (http://www.cran.r-project.org).²¹ Several supplemental packages are required for these studies, and they are also located at the mirror sites. These packages are nortest,²² moments,²³ drc,²⁴ ggplot2,²⁵ and gridExtra.²⁶ The analytical and visualization functions they provide are listed in the Supplemental Materials. Statistical tests used to develop alternative metrics (such as the Kolmogrov–Smirnov (KS) test as a measure of response) were based on comparisons of a sample well against the aggregate of cells in the control wells.

Results

Cellular Properties Quantified by HCS Are Nonnormally Distributed

This study looks at sets of publicly available image-based screening data for the purpose of extracting generalizable trends and observations regarding the analysis of image-based cellular assays. Such benchmarking studies have limited use to the general scientific audience if the data or methods are difficult to obtain. Because of this, Ilya Ravkin and members of the Broad Institute Imaging Platform initiated a repository of image sets for image-analysis benchmarking studies that are intended to be used for evaluating imaging applications and methods.²⁰ These data sets make benchmarking studies much more transparent, because they provide a common set of experiments that can be studied by researchers striving to improve data-analysis methods. Data sets used in this study are described in Supplementary Table S1. For these data sets, a single assay metric is used to measure cellular responses to perturbations such as Wortmannin to block phosphatidylinositol-3 kinase (PI3K) signaling to forkhead homologue (FKHR) in rhabdomyosarcoma cells, tumor necrosis factor alpha (TNFα) to induce nuclear localization of nuclear factor kappa B (NF-κB), and isoproterenol to induce puncta formation of β-arrestin. For the experiments measuring FKHR or NF-κB translocation, the amount of the transcription factor in the nucleus is divided by the amount in the cytoplasm and reported as a ratio. Puncta formation is measured through the texture of the β-arrestin–GFP (green fluorescent protein) after treatment. The terms used to identify these assay measurements (metrics) are specific to the image-analysis application and algorithm used, which can be adjusted by the image analysis. For the data sets discussed in this study, the translocation of the FKHR–GFP fusion protein is reported as Math_Ratio1, the translocation of NF-κB is reported as logratio, and puncta formation of the β-arrestin–GFP is reported as Texture_Gabor CorrGFP_1. In addition to the data sets, robust image-analysis algorithms (or pipelines) are provided that correct for image and cellular aberration and quantify assay responses with a variety of candidate metrics. These materials can be used by investigators to deconstruct the example materials to better understand the image-analysis process. Common practice for HCS data analysis is to quantify a cellular response as an average for all the cells in the well; this study accesses the original cell-level data that are used to derive the well-level summaries. The process is outlined in Supplementary Figure S1.

Plotting a sample distribution against what the distribution would be if it followed a normal distribution is one of the easiest and most powerful methods for assessing normality. The plot, known as a quantile–quantile (or Q–Q) plot, is used routinely to evaluate data. A quantile is any regular interval of the data range, frequently a percentile but also a quartile or the mean (e.g., the 50th percentile or the second quartile). Nonnormality is readily observed by a deviation of the points from the line. Usually, deviations from normality are small, and the line is a diagonal. To initiate the characterization of HCS data, wells from various experiments were examined using Q–Q plots. Examples are shown in Figure 1 . In these examples, no well of data is truly normally distributed (which occurs when all points lie on the line), although in some cases the data are nearly normally so. The majority of data from an HCS experiment are, however, grossly nonnormally distributed, as the other examples in Figure 1 show. In the cases in which there is significant discordance with a normal distribution, using a parametric statistical test, or even using the mean as the value reported for that well, may be inappropriate.

Figure 1.

Quantile–quantile (Q–Q) plots of cell-level high-content screening (HCS) data. Data from individual wells of several HCS experiments are plotted by their actual values (identified as sample quantiles along the y-axis) as a function of what the values would be if the data followed a normal distribution. For each data point, the actual value reported (the sample quantile) is plotted against where it should be if the data were to follow a normal distribution. Because both axes are plotted from low to high values, the data are effectively rank ordered. In a normal distribution, the majority of the data points will fall near the mean, or midpoint, of the axes. If a distribution fails to follow a normal distribution, many of the points will not follow this expectation and will therefore fall off of the line (which would be a diagonal if the data were normally distributed). Data shown in the panels are the relevant metric for the indicated well for each experiment, as listed in Supplementary Table S1 and described in the “Results” section. They are the values for nuclear localization of nuclear factor kappa B (NF-κB), forkhead homologue in rhabdomyosarcoma (FKHR), or the granularity of β-arrestin, depending on the experiment.

Moreover, the patterns of nonnormality in HCS data are varied. Some samples show upward curvature in both the low and the high values, indicating a rightward skew in the data (well E10 of the BBBC–iCyte experiment) or a downward curvature at the tail, indicating a leftward skew in the data (well A02 in the BBBC–Vitra data set is a particularly strong example). The extent of nonnormality is affected by the feature or metric being examined for a well of cells. The figure shows two metrics for well G08 from the BBBC–Bioimage data set. The algorithm measures the translocation of the FOXO transcription factor when treated with PI3K inhibitors, and two metrics that can be used to evaluate the effects of the compounds in the algorithm developed by the Broad Institute Imaging Platform are Math_Ratio1 and Math_Ratio2. These metrics are inversely related measures of the extent of nuclear localization.

Statistical Characterization of Nonnormality in HCS Data

Having noted the lack of normality in HCS data when examined in individual wells, several questions follow, including: How widespread is this lack of normality, how severe is it, is it affected by experimental or control treatments, and does it have a material effect on experimental results? To begin the process of evaluating these questions, the data sets described above were all evaluated using several well-recognized measures of normality. Three statistical tests for normality are commonly used: the Kolmogrov–Smirnov goodness-of-fit test (KS-GoF; essentially, a KS test that compares a sample to a normal distribution), the Anderson–Darling goodness-of-fit test (AD-GoF), and the D’Agostino–Pearson omnibus test (DP-O) for normality. The KS-GoF test is the first widely used test for normality, but the latter two are preferred by statisticians; the AD-GoF test is more sensitive to differences in the tails of a distribution; and the DP-O is a composite of measures for skewness and kurtosis.²⁷ In this study, the DP-O test was not performed directly, but skewness and kurtosis were evaluated individually. To accomplish this, scripts were written in the R statistical programming language.²¹ The language is well developed and includes supplemental packages that extend the power of the language. Several of these packages were used in this script, including the goodness-of-fit tests, as well as graphical routines for presenting the results. A complete list of packages and routines used in this project is detailed in the Supplementary Information.

These tests are very powerful, and, as indicated in Figure 1 , all samples show some degree of nonnormality. Although some samples are highly nonnormal, not all samples would be classified as such according to either the KS-GoF or AD-GoF test. Examples from several of the data sets are shown in Supplementary Table S2. By these tests, statistically significant deviations from normality are observed in at least some of the samples in all of the experiments. The AD-GoF test tends to show higher p values, is more sensitive to the tails of the distributions, and may be affected more by the frequent occurrence of outliers in HCS data than the KS-GoF test. Both tests do identify samples as nonnormal—in some cases, strongly so. This is in large part because the tests are typically applied to sample sizes of 40–200, and the HCS data sets can range up to thousands of cells per sample; this can have the effect of returning statistically significant differences from normality by standard criteria (typically, p = 0.05 or 0.01). In the case of HCS data, the values can be many orders of magnitude smaller (more significantly different from normal). The extent of nonnormality varies strongly with dose within an experiment, which by itself limits some methods of analysis that depend on homoskedasticity, such as how to treat error measurements or whether data can be transformed to generate a Gaussian (normal) distribution.

Because a lack of normality appears to be a common occurrence and, in at least some cases, an alternative method of analysis may be more statistically proper, there is a need to analyze HCS experiments for normality quickly. Toward this end, a method for characterizing normality within samples in a plate-based assay has been developed. The wellstats script generates two files. The first is a table that lists all the wells of the experiment for the parameters listed above (KS-GoF, AD-GoF, skewness, kurtosis, and the p values for the GoF tests) and is referred to as the wellstats table. Supplementary Table S2 includes example wells from several of the experiments as evaluated by the script. The complete wellstats table contains much of the information reported in the dosestats table, which is presented as a table and discussed later, but at the well level. As such, the table can run up to 384 rows, which can be unwieldy for some purposes, but the data can be plotted in other applications. This file is saved in a .csv format. The second output from the script is a graphical representation of the platemap showing the extent of nonnormality for each well. Examples of the platemap graphics are shown in Figure 2 . The figure compares a graphic depiction of the experimental conditions for two of the studies, the BBBC–Bioimage and BBBC–Vitra experiments, showing the layout of the cell lines and treatment conditions, including the location of the control wells. The platemaps generated by the wellstats script are shown alongside the experimental graphics. The platemap shows data for the response metric (nuclear translocation of FOXO3A/FKHR, or NF-κB) by area of the circle for each well, and the extent of nonnormality as shown by color. Scales for each are shown. In this figure, the AD-GoF statistic is used to quantify normality, but any measure reported in the wellstats table could be used. When viewed over the entire plate, strong patterns of increased deviation from normality can be observed in each of the plates. In particular, it is specifically concentrated in the lower or upper half of the dose–response curve, depending on the experiment. In the top experiment (BBBC–Bioimage), increased nonnormality is observed in the lower doses of the dose–response curves for both Wortmannin and LY294002. In the BBBC–Vitra translocation experiment, increased nonnormality occurs in the higher concentrations of the dose–response curve. In addition, nonnormality is more pronounced in MCF7 cells than in A498 cells.

Figure 2.

Assessment of normality in high-content screening (HCS) data sets. (A) Graphical platemap of the Broad Bioimage Benchmarking Collection (BBBC)–Bioimage experiment. U2OS cells were treated with Wortmannin or LY294002 in increasing doses by quadruplicate wells; increasing doses are shown as right triangles. Positive and negative controls are indicated; coloring is intended to highlight the organization of dose–response curves for two treatments. (B) Well-level reports for the assay metric (nuclear localization of FOXO-3a), indicated by the size of the spot; and degree of nonnormality, indicated by the color of the spot. Normality was determined using the Anderson–Darling test for normality. (C) Graphical platemap for the BBBC–Vitra experiment. MCF7 and A498 cells were treated with tumor necrosis factor alpha (TNFα) in quadruplicate wells. Coloring is meant to indicate that one treatment condition is applied to two cell types. (D) Results of the response metric (nuclear factor kappa B (NF-κB) localization) and of the Anderson–Darling test, as described in this article. Assay response is based on a comparison of each well with the negative controls, with greater differences being recorded as higher values on the diameter size. Note, the plate data are identified with an early tracking name for the data sets.

The dependence of nonnormality on dose in the dose–response assays was examined further. For these experiments, the relationship between skewing and kurtosis was evaluated. Results are plotted in Figure 3 . Two patterns can be observed. In the first ( Fig. 3A ), some samples show a positive skew, whereas others show little skew. As concentrations increase, all samples show a shift to more negative skewing, either reducing the initial positive skew back to normal or making a shift to negative skewing in the samples that were initially neutral. Kurtosis is strongly correlated with skew, and either increases as skew increases in the cases in which skewing was initially neutral, or decreases as skewing decreases in the samples that were initially positively skewed, as seen in Figure 3B . The relationship between skewing and kurtosis as a function of dose is further compared in Figure 3C . Skew and kurtosis are related, so it is not surprising that a function between the two is observed in the figure, but what is also observed is the extent to which samples track along this relationship as a function of increasing dose, from positive skewing, to relative normality, to negative skew. Examples of skewing and kurtosis are shown in the remaining panels of the figure. Figure 3D shows the distribution of values for well G04 of the BBBC–Bioimage experiment (plotted on the right side of the skewness–kurtosis plot in Fig. 3C ). This was a sample of U2OS cells treated with a low dose of the PI3K inhibitor LY294002. This sample shows positive skewing in the analysis described above. These MCF7 cells were treated with the highest concentration of TNFα and show a high level of negative skewing. Figure 3E shows well F12 of the BBBC–Vitra experiment, in this case showing A498 cells at the lowest concentration of TNFα. As noted previously, this sample shows a distribution that is much closer to normally distributed than other HCS data. In the last example is well A02 of the BBBC–Vitra experiment, in which negative skewing can be seen in the distribution plot shown in Figure 3F .

Figure 3.

Skewing and kurtosis as functions of assay treatments. (A) Skewness for each dose in each experiment is plotted by experiment and colored by dose; according to the scale at the right of the figure, coloring is by rank order of the dose by each experiment. (B) Kurtosis for the same samples using the same scaling. (C) Plotting skewness by kurtosis in the experiments. Experiments are plotted by symbol; dose is by color to highlight the common trend of increasing dose on skewness and kurtosis among all experiments examined. (D) Example of positive (rightward) skewing. (E) Example of low skewing and kurtosis. (F) Example of negative (leftward) skewing. Note that assessments are made on all the values for each well; distributions in (D)–(F) are positioned to highlight the overall shape (in some cases, showing the complete distribution compresses the shape of the curve to include a very small number of extreme values).

The effect of the skewing in the distribution curves on the measurement of the well-level responses is examined in more detail in Figure 4 . In Figure 4A , the dose–response relationship is presented for one row of the MCF7 cells in the BBBC–Vitra experiment in the top row, and the BBBC–Bioimage experiment in the second row. In the case of the BBBC–Bioimage data, a sharp peak for the distribution in the initial doses becomes a flatter peak as the doses increase, with some heterogeneity that can be detected.

Figure 4.

Distributions of actual treatment samples and their corresponding normal distributions, based on parametric values. (A) A density plot, showing the continuous distribution of the localization of nuclear factor kappa B (NF-κB) per cell for MCF7 cells in the BBBC–Vitra experiment. In essence, this is a smoothed histogram, in which puncta values are along the x-axis, and the frequency of cells that report these values are shown on the y-axis for each well in the top row of the plate. In this experiment, cells treated with the highest concentration of tumor necrosis factor (TNF) are in column 1, and doses decrease across the plate. (B) Data for each dose in the BBBC–Vitra data for MCF7 are indicated in the figure, shown in distribution plots, shown in red, and plotted by dose (feature data for replicate wells were pooled). For each distribution, the mean and standard deviation were calculated, and a normal distribution having these means and standard deviations was modeled, plotted in blue, and overlaid on the density plots for the actual data.

In Figure 4B , density plots for the four replicates of the TNFα-treated MCF7 cells are plotted in red. For each dose, the well-level mean and standard deviation were determined. Taking these parameters, normal distributions were calculated, and these were plotted as density plots in blue. Thus, in each panel, the actual distribution of cells can be compared to a corresponding normal distribution. This is not an idealized normal distribution, but an actual data table; rerunning the procedure will generate a new data set that is normally distributed around the same mean and standard deviation. The assumed normal distribution is able to track the change in NF-κB localization as the dose of TNFα increases, but it misses much of the complexity in the distributions, and generally underreports the degree of change, because the normal distribution is to the right of the majority of the sample values (including the mode) of the actual distribution in the lower concentrations and to the right in the higher concentrations. Thus, at a minimum, using measurements that assume a normal distribution reduces the assay window in this experiment (the extent of the change caused by the treatment). In addition, some heterogeneity may be present, and the compound may not have a uniform effect on the cells, as can be observed in the doses near the middle of the dose–response curve, particularly 1×10⁻¹⁰ M and 3×10⁻⁰⁹ M TNFα, in which strong shoulders or possible multiple peaks can be observed. Whether these represent stable subpopulations can be determined through clustering.²⁸ The mean for the distribution only reports on the center of the peak, but not its shape, meaning that although the assay can be quantified as a move in the mean value as the dose increases, it does not fully report on the events at the cellular level, whereas a nonparametric assessment would be better able to capture the complexity of the response.

Effects of Nonparametric Methods on the Z-Prime and V-Factor Screening Statistical Benchmarks and EC₅₀ Measurements

Because the effect of increased nonnormality is greatest near the EC₅₀ dose, but concentrated on either the lower half or the higher half of the midpoint, it is possible that this will have an effect on the assay statistics, including determination of the EC₅₀ itself. The extent to which this is true was examined. Parametric and nonparametric methods were used to calculate the response of cells in culture to various perturbations. To make these assessments, a second script was written to extend the normality analysis described in Figure 2 to incorporate dosing information and to measure these effects on standard parameters for assay performance and EC₅₀ calculations. It takes the same feature data used in the first script and generates a new table that reports on normality measurements for doses and replicates, and a plot that compares the dose–response curves for the data when evaluated by well means, t test, and KS statistic; the latter two compare an individual well with the pooled samples in the negative-control wells. In addition, the graphical output reports the Z’, V factor, and EC₅₀ for each of these methods. These provide a per-experiment capacity to examine the extent of nonnormality and its effect on an experiment without significant data manipulation. To provide a comprehensive assessment, a platemap and a plot of the skewness-to-kurtosis relationship are also included. The complete analysis is presented as a dashboard; an example is shown in Figure 5 . The output from this script calculates and displays the platemap described above, because most experiments are based on a platemap design that is meaningful to the experimenter. In addition, the metadata provide a context for the experiment that can be used to assess the effect of nonnormality on the experiment. This includes mapping the doses to the skewness–kurtosis plot, which helps to orient the extent of nonnormality to the treatment dose. The effects of the parametric and nonparametric measures on the dose–response curve are shown in Figure 5 . The graph illustrates that there is a pronounced effect on the dose–response curve when analyzed by either the KS test or the t test. As to which one is ultimately selected, additional data, such as the Z’ and V factors, can also be considered, which is why these data are included in the dashboard. The script also generates a dosestats table; an example is shown in Supplementary Table S3. The data allow a comparison of the standard assay metrics with measures of normality and alternative methods for calculating the experimental events. The analysis of HCS data by cell-level statistics can be particularly relevant in cases in which anomalous responses occur, such as changes in cell shape that may occur with some compounds. In these cases, sensitivity to events at the cell level, with the capacity to review the images themselves, provides options to an investigator that may not always be available with nonimaging approaches.

Figure 5.

Example dashboard showing analysis of a data set for deviation from normality in the cell-level data. Because this figure is of a dashboard created by the analysis software, it is not a figure of individual panels but is a single figure generated as output. Elements are, from right to left and from top to bottom: (1) The platemap graphic generated identically to that from the wellstats script. Data will be for all wells of the plate, even if the metadata specify a partial plate. (2) The skew–kurtosis plot, similar to the plot in Figure 3 , but restricted to the samples and doses specified by the metadata. Color is scaled by relative dose, as indicated to the right of the panel. (3) A table with calculations of the Z’, V factor, and EC₅₀ as calculated by the standard mean (well-level average value), a KS test, and a t test. (4) A dose–response plot of the data as measured by well-level average, KS test, and t test. Shading represents the standard deviation for each dose as determined by replicate wells.

The standard well-based measurement, the mean value for the well, was compared to both a parametric and a nonparametric comparison of each sample to the control wells. In this case, the KS test was compared to the parametric t test. Although many analytical approaches use the well-mean values for the relevant feature or endpoint, the t test is also used in some cases.²⁹ For a nonparametric test, the KS statistic is probably the most commonly used in HCS, although the Wilcoxon–Mann–Whitney U test is more directly analogous to the t test. The Wilcoxon–Mann–Whitney U test is a cumulative measure of two samples, whereas the KS is a measure of proportional difference between two scaled distributions. The cumulative nature of the Wilcoxon-Mann-Whitney U test makes it difficult to use for high-content data, because the number of cell-level comparisons is very high (from hundreds to tens of thousands). The test is less linear than the KS test, particularly at the midranges of the dose–response curves, and therefore less useful as a metric. In contrast, the KS test returns a value between 0 and 1, based on the point of greatest divergence between the two populations, which is a much easier metric to use and understand. Like the t test, the p values for both tests also work as metrics if log transformed. Standard deviations are calculated for the differences among replicate wells for each analysis and are plotted in the dose–response curves as shaded regions.

The results for these tests on all of the data sets used in this study are summarized in Table 1 . As can be seen in the table, effects are experiment specific; and because of this, rather than conclude from this study that a specific change in experimental analysis is called for, the exploratory nature of these methods can help an investigator decide if such modifications are in fact called for. In some cases, the effects were negligible, such as the iCyte experiment, for which the samples were largely normal among all of the treatment conditions and there were no changes in EC₅₀, Z’, and V factor. In contrast, changes in Z’ and V factor can be large, more than 0.3 change (the maximum value for these tests is 1.0), and the effect on EC₅₀ ranges can exceed fivefold. The final determination (particularly for the EC₅₀ determination) should be made by taking the data that have been exported to a .csv file, recalculating with the standard laboratory software, and substituting the KS test values or other available measurements as desired, for reasons discussed below.

Table 1.

Screening Statistics for Parametric and Nonparametric Analyses.

Experiment		Z’	V Factor	EC₅₀	Test
SBSBio(Wort)	MR1¹	0.682	0.606	5.71	KS test
		0.607	0.619	10.68	Well-mean test
		0.124	0.426	6.137	t test
	MR2²	0.878	0.745	5.564	KS test
		0.759	0.702	10.5	Well-mean test
		0.348	0.613	7.18	t test
SBSBio(LY)	MR1	0.682	0.442	2.658	KS test
		0.607	0.401	3.675	Well-mean test
		0.124	0.392	3.365	t test
	MR2	0.878	0.682	2.376	KS test
		0.759	0.686	2.922	Well-mean test
		0.492	0.425	3.258	t test
iCyte		0.855	0.799	4.071	KS test
		0.851	0.789	4.195	Well-mean test
		0.874	0.806	5.282	t test
Vitra A498		0.467	0.53	1.148	KS test
		0.15	0.395	6.967	Well-mean test
		0.245	0.34	1.784	t test
Vitra MCF7		0.74	0.758	1.934	KS test
		0.507	0.645	1.819	Well-mean test
		0.311	0.461	0.048	t test

KS, Kolmogrov–Smirnov.

Math_Ratio1.

Math_Ratio2.

One aspect of the dashboard that is important to remember is that it is an aggregate of separate programs. The normality tests, skewness, and kurtosis are described above; the latter two are calculated after pooling the cell-level measurements for replicate wells at each dose. The well-level mean values, t test, and KS test values are for each well and dose; standard deviations are calculated at the well level for each method. From these measurements, the dose–response curves are plotted, including the error measurements. The positive- and negative-control samples are evaluated individually through a comparison with the pooled negative-control samples; these individual values can also be added to the plot or the table in the dashboard. The EC₅₀ calculations are derived from a separate routine and therefore have a potential to be discordant with a visual estimation of the EC₅₀ from the plots. EC₅₀ calculations are notoriously difficult to determine from raw data, in which noise and a failure to completely plateau at the maximal dose have strong effects on the EC₅₀ measurement.³⁰ In standard studies of compound effects, a failure to reach a definitive plateau in an experiment may trigger a revised experimental design in which the compound treatment doses would increase until such a plateau was reached. In this case, the EC₅₀ calculation is based on the presumption that the maximal dose has been reached, and therefore it could be considered a “forced” EC₅₀ calculation. The decision to handle the data this way is based on the idea that these procedures are exploratory in nature, designed to provide an indication of whether additional studies are warranted.

Software Tools for the Rapid Assessment of Normality in HCS Feature Data

Progress in understanding the effects of the cell-level distribution on experimental conclusions and downstream decisions rests largely on the ability of a scientist to evaluate the data quickly. The methods described here have been written in the form of R scripts that perform the analyses and generate tables and plots to enable a typical researcher to rapidly evaluate data from their experiment. With these data, the experimenter has the ability to decide whether a departure from normality materially affects the conclusion of an experiment. For the experimental metric, either a feature value taken from the cell-level data file (typically, a .csv file) or a calculated value, such as a ratio of two feature values or a feature that has been transformed, a two-column table is loaded into R. A guide for formatting data is included in the Supplemental Information.

In evaluating high-content data, it is common to examine several features as candidate assay metrics. For example, total nuclear intensity could be compared to the nuclear-cytoplasmic ratio and difference, giving three options for measuring the extent of a nuclear translocation. Phosphorylation of a protein could be evaluated by average intensity per cell or total intensity per cell. In all cases, there are biological reasons why one feature may be more appropriate as well as reasons why multiple features may be redundant for an experiment (such as average and total intensity values for cells that do not change size or shape during the course of the experiment). An additional script is included here that allows the evaluation of four features simultaneously. The script generates wellstats tables for each feature and a single platemap plot for the four features. From this analysis, the feature to be chosen as an assay metric can be further evaluated by the dosestats script. The source code for the dosestats script is provided in the Supplementary Information. It can be modified directly, including extraction of the wellstats platemap. Additional scripts may be obtained by direct inquiry with the author or through GitHub (https://github.com/shaney314/normality-scripts/). Additional examples include formatting cell-level normality analysis for RNAi screening and a set of platemaps for multiple features, for cases in which alternative features are being considered (such as average intensity and total intensity per object).

Discussion

Heterogeneity of cells in culture is generally well appreciated, particularly for cancer cell lines, most of which are highly aneuploid. Such cellular heterogeneity is, however, typically viewed as a population of outliers, and is typically ignored outright or through taking the median value when a population-based method such as HCS is used. Usually, the regression toward the mean is cited as the principle for doing so, with the frequent observation that small samples may show nonnormality, but that as sample size increases, the population assumes a normal distribution. To limit the impact of this sampling-based appearance of nonnormality, the median is used instead of the mean.

These results suggest that nonnormality is an accurate reflection of cellular heterogeneity and is evident in many experiments. Although not the case for all experiments, statistical heterogeneity, particularly variability in the variance of the population, can affect experimental conclusions. For experiments exhibiting severe nonnormality, using a nonparametric test is a demonstrably more sensitive analytical approach. In such cases, the first population of cells that respond to a compound may be a robust indication of critical compound properties, such as the ability of the compound to penetrate cells and hit the target. In contrast, the mean may infer a state that is not represented at the cell level at all. Toriello and coworkers³¹ showed that GAPDH mRNA levels in cells treated with an siRNA were strikingly bimodal; expression levels in 50 cells analyzed at the single-cell level were either 50% or 0% of control. The arithmetic mean of 22% of control did not actually occur in any cell. Sisan et al.⁶ showed that GFP expression is bimodal across a 300-fold range and that sorting cells from high or low GFP expression levels and following them in culture would eventually give rise to the original distribution, leading to the conclusion that such a bimodal distribution is a true property of the cells in culture and is not captured by the average value for the population.⁶

The data routines have focused on the development of a single dashboard as a vehicle for exploratory data analysis. Dashboard visualizations continue to grow in importance in the era of “Big Data,” as multiple sets of data need to be reviewed together before a conclusion can be made. For cell-level data generated in HCS experiments, the effect of the analytical method should be considered, and the current version of the dashboard ties assessments of normality on standard experimental measurements, including the EC₅₀, with visualizations of how deviations from normality vary with treatment dose. From these data, a decision can be made on whether the standard well-level measurements based on average cellular responses are appropriate or whether an alternative approach should be considered. One thing that is clear from the analyses performed in this study is that there is an experiment-by-experiment decision, because some experiments show little (or at least inconsequential) deviation from normality, whereas others can show up to a 10-fold change in EC₅₀ value and increased Z’ or V-factor scores. Because of this, the abilities to quickly review the experiments and do so without being encumbered by a fixed experimental design are important.

Heterogeneity has been shown to be biologically relevant, and the appreciation of its importance in developmental and disease biology is increasing. For example, aggressive glioblastoma is fueled by a mixture of cells expressing multiple receptor tyrosine kinases in a mixed or mosaic pattern that includes EGFR, PDGFRA, KIT, MET, and VEGFRA.^18,19 Cellular heterogeneity is also essential for pluripotency and differentiation of stem cell populations,³² as well as the stratification of liver hepatocytes into gluconeogenic and glycogenolic subtypes through regulated levels of Wnt signaling.³³ Well-level methods that fail to differentiate effects on essential subpopulations miss opportunities to understand the effects of candidate therapeutics. Clearly, as these studies progress, a specific set of disease cells may come to define the critical population, and methods that can focus on events in subpopulations will differentiate durable therapies from other efforts. HCS is well positioned to focus on such subpopulations and cellular heterogeneity in general. In addition to methods of image analysis that can query subpopulations defined by cytological specifications, refined data analysis, such as the methods described here, can focus an experiment on subpopulations defined by response thresholds. In this regard, a certain irony has emerged, in which efforts to study cells have focused on methods to reduce heterogeneity, even when studying diseases in which heterogeneity is an important part of the biology of the system. Methods that account for heterogeneity may in fact provide better models for natural biological processes, including disease progression.

The question of when heterogeneity in a sample can be considered as an aggregate of individual subclasses is one that will require discussion. Currently, methods exist to define many different classes for a cell line.^8,34 These result when multiple proteins that show broad independent distributions are sorted (two proteins that independently show high, medium, and low levels can be considered to stratify a cell into nine subpopulations). Comparing this work to that of Sisan⁶ suggests that these states could persist throughout several generations. Even cases in which relaxation back to the pretreatment state occurs more quickly can have a material effect on therapeutic response.³⁵ This has been discussed by Sorger and colleagues,³⁶ who have observed that heterogeneity of cells within a single cell type can result in a survivor fraction, a transient state that can give rise to increased resistance to the perturbation (TRAIL, in this case) but does not define a fundamentally resistant population and is one that will reset its heterogeneity, much as the GFP level did in the Sisan experiments.⁶ This can be a significant disconnect between an assay response at the well level (such as caspase activity) and a treatment effect in the biological context, because the survivor population represents persistence but not resistance at the cellular level, but can nevertheless be a significant cause of treatment failure. This definition is explicitly empirical. Treatment of a therapeutic targeted to mitotic cells for 1 h and then washed out will show a high survival fraction, whereas application of the treatment for 72 h should result in nearly complete killing in culture, increasing the effective homogeneity of the cells in culture. The patterns of protein-level adjustments observed by Plant and Sorger suggest that some therapeutics will show incomplete effects even for longer duration treatment cycles. The potential for these varied and heterogeneous responses can be important for comparing candidate targets or therapeutics. The methods described here can help to identify these behaviors in routine experiments and can alert the experimenter to events that may require more direct follow-up.

Footnotes

Acknowledgements

Lin T. Guey is thanked for several helpful discussions on statistical approaches to high-content data. David Logan provided helpful background on the BBBC image collection and analyses, and Anne Carpenter gave helpful comments on the manuscript; their efforts are also very much appreciated.

Declaration of Conflicting Interests

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The author received no financial support for the research, authorship, and/or publication of this article.

Supplementary material for this article is available on the Journal of Biomolecular Screening Web site at .

References

Giuliano

K. A.

DeBiasio

R. L.

Dunlay

R. T.

. High-Content Screening: A New Approach to Easing Key Bottlenecks in the Drug Discovery Process. J. Biomol. Screen. 1997, 2, 249.

Inglese

Measuring Biological Responses with Automated Microscopy. In Methods in Enzymology; Academic Press: New York, 2006; pp. 348–363.

Taylor

D. L.

Haskins

J. R.

Giuliano

High Content Screening: A Powerful Approach to Systems Cell Biology and Drug Discovery; Humana Press: New York, 2006.

Haney

S. A.

High Content Screening: Science, Techniques and Applications; John Wiley and Sons: Hoboken, NJ, 2008; p. 391.

Burchfield

J. G.

Fazakerley

D. J.

. Novel Systems for Dynamically Assessing Insulin Action in Live Cells Reveals Heterogeneity in the Insulin Response. Traffic 2013, 14, 259–273.

Sisan

D. R.

Halter

Hubbard

J. B

. Predicting Rates of Cell State Change Caused by Stochastic Fluctuations Using a Data-Driven Landscape Model. Proc. Natl. Acad. Sci. 2012, 109, 19262–19267.

LaPan

Zhang

Pan

. Single Cell Cytometry of Protein Function in RNAi Treated Cells and in Native Populations. BMC Cell Biol. 2008, 9, e43.

Slack

M. D.

Martinez

E. D.

L. F.

. Characterizing Heterogeneous Cellular Responses to Perturbations. Proc. Natl. Acad. Sci. 2008, 105, 19306–19311.

Malo

Hanley

J. A.

Cerquozzi

. Statistical Practice in High-Throughput Screening Data Analysis. Nature Biotech. 2006, 24, 167–175.

10.

Chung

Zhang

X. D.

Kreamer

. Median Absolute Deviation to Improve Hit Selection for Genome-Scale RNAi Screens. J. Biomol. Screen. 2008, 13, 149–158.

11.

Hastie

Tibshirani

Friedman

; The Elements of Statistical Learning: Data Mining, Inference, and Prediction, 2nd ed. Springer-Verlag: New York, 2008; p. 763.

12.

Hill

LaPan

. Analysis of Multiparametric High-Content Data. In High Content Screening: Science, Techinques and Applications; Haney

, Ed.; Wiley Interscience: Hoboken, NJ, 2008; pp. 329–353.

13.

Buchser

Collins

Garyantes

. Assay Development Guidelines for Image-Based High Content Screening, High Content Analysis and High Content Imaging. In Assay Guidance Manual; Sittampalam

G. S.

Gal-Edd

Arkin

. (Eds.); National Institutes of Health: Bethesda, MD, 2012.

14.

Bray

M. A.

Carpenter

A. E.

Advanced Assay Development Guidelines for Image-Based High Content Screening and Analysis. In Assay Guidance Manual; Sittampalam

G. S.

Gal-Edd

Arkin

. (Eds.); National Institutes of Health: Bethesda, MD, 2012.

15.

Spencer

S. L.

Gaudet

Albeck

J. G.

. Non-Genetic Origins of Cell-to-Cell Variability in TRAIL-Induced Apoptosis. Nature 2009, 459, 428–432.

16.

Barteneva

N. S.

Fasler-Kan

Vorobjev

I. A.

Imaging Flow Cytometry: Coping with Heterogeneity in biological systems. J. Histochem. Cytochem. 2012, 60, 723–733.

17.

Krutzik

P. O.

Irish

J. M.

Nolan

G. P.

. Analysis of Protein Phosphorylation and Cellular Signaling Events by Flow Cytometry: Techniques and Clinical Applications. Clin. Immunol. 2004, 110, 206–221.

18.

Little

S. E.

Popov

Jury

. Receptor Tyrosine Kinase Genes Amplified in Glioblastoma Exhibit a Mutual Exclusivity in Variable Proportions Reflective of Individual Tumor Heterogeneity. Cancer Res. 2012, 72, 1614–1620.

19.

Snuderl

Fazlollahi

L. P.

. Mosaic Amplification of Multiple Receptor Tyrosine Kinase Genes in Glioblastoma. Cancer Cell. 2011, 20, 810–817.

20.

Ljosa

Sokolnicki

K. L.

Carpenter

Annotated High-Throughput Microscopy Image Sets for Validation. Nature Meth. 2012, 9, 637.

21.

Team RDC. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing: Vienna, 2012.

22.

Gross

Ligges

nortest: Tests for Normality [Online]; 2012. http://cran.r-project.org/web/packages/nortest/ (accessed Feb 26, 2014).

23.

Komsta

Novomestky

Moments: Moments, Cumulents, Skewness, Kurtosis and Related Tests [Online]; 2012. http://cran.r-project.org/web/packages/moments/ (accessed Feb 26, 2014).

24.

Ritz

Streibig

J. C.

Bioassay Analysis Using R [Online]; 2005. http://ftp.uni-bayreuth.de/math/statlib/R/CRAN/doc/vignettes/drc/drc.pdf (accessed Feb 26, 2014).

25.

Wickem

ggplot2: Elegant Graphics for Data Analysis. Springer: New York, 2009.

26.

Baptiste

gridExtra: Functions in Grid Graphics [Online]; 2012. http://cran.r-project.org/web/packages/gridExtra/ (accessed Feb 26, 2014).

27.

D’Agostino

R. B.

Stephens

M. A.

Goodness-of-Fit Techniques. Dekker: New York, 1986; p. 576.

28.

Han

Kamber

Pei

Data Mining: Concepts and Techniques, 3rd ed.; Morgan Kaufmann: Burlington, MA, 2011.

29.

Zhang

X. D.

Illustration of SSMD, Z Score, SSMD*, Z* Score, and t Statistic for Hit Selection in RNAi High-Throughput Screens. J Biomol Screen. 2011, 16, 775–785.

30.

Lutz

Kenakin

Quantitative Molecular Pharmacology and Informatics in Drug Discovery; John Wiley and Sons: Hoboken, NJ, 2000; pp. 428.

31.

Toriello

N. M.

Integrated Microfluidic Bioprocessor for Single-Cell Gene Expression Analysis. Proceedings of the National Academy of Sciences. 2008, 105, 20173–20178.

32.

Wennekamp

Hiiragi

Stochastic Processes in the Development of Pluripotency in Vivo. Biotech J. 2012, 7, 737–744.

33.

Si-Tayeb

Lemaigre

F. P.

Duncan

S. A.

Organogenesis and Development of the Liver. Devel. Cell 2010, 18, 175–189.

34.

Weber

Fernández-Cachón

M. L.

Nascimento

J. M.

. Label-Free Detection of Neuronal Differentiation in Cell Populations Using High-Throughput Live-Cell Imaging of PC12 Cells. PLoS One 2013, 8, e56690.

35.

Flushberg

D. A.

Roux

Spencer

S. L.

. Cells Surviving Fractional Killing by TRAIL Exhibit Transient but Sustainable Resistance and Inflammatory Phenotypes. Mol. Biol. Cell 2013, 24, 2186–2200.

36.

Niepel

Spencer

S. L.

Sorger

P. K.

Non-Genetic Cell-to-Cell Variability and the Consequences for Pharmacology. Curr Opin Chem Biol. 2009, 13, 556–561.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

1.46 MB

0.00 MB

Rapid Assessment and Visualization of Normality in High-Content and Other Cell-Level Data and Its Impact on the Interpretation of Experimental Results

Abstract

Keywords

Introduction

Materials and Methods

Experimental Data Sets

Statistical Analysis Using the R Programming and Visualization Environment

Results

Cellular Properties Quantified by HCS Are Nonnormally Distributed

Statistical Characterization of Nonnormality in HCS Data

Effects of Nonparametric Methods on the Z-Prime and V-Factor Screening Statistical Benchmarks and EC50 Measurements

Software Tools for the Rapid Assessment of Normality in HCS Feature Data

Discussion

Footnotes

Acknowledgements

Declaration of Conflicting Interests

Funding

References

Supplementary Material

Effects of Nonparametric Methods on the Z-Prime and V-Factor Screening Statistical Benchmarks and EC₅₀ Measurements