Factors Affecting Reproducibility between Genome-Scale siRNA-Based Screens

Abstract

RNA interference-based screening is a powerful new genomic technology that addresses gene function en masse. To evaluate factors influencing hit list composition and reproducibility, the authors performed 2 identically designed small interfering RNA (siRNA)–based, whole-genome screens for host factors supporting yellow fever virus infection. These screens represent 2 separate experiments completed 5 months apart and allow the direct assessment of the reproducibility of a given siRNA technology when performed in the same environment. Candidate hit lists generated by sum rank, median absolute deviation, z-score, and strictly standardized mean difference were compared within and between whole-genome screens. Application of these analysis methodologies within a single screening data set using a fixed threshold equivalent to a p-value ≤0.001 resulted in hit lists ranging from 82 to 1140 members and highlighted the tremendous impact analysis methodology has on hit list composition. Intra- and interscreen reproducibility was significantly influenced by the analysis methodology and ranged from 32% to 99%. This study also highlighted the power of testing at least 2 independent siRNAs for each gene product in primary screens. To facilitate validation, the authors conclude by suggesting methods to reduce false discovery at the primary screening stage. In this study, they present the first comprehensive comparison of multiple analysis strategies and demonstrate the impact of the analysis methodology on the composition of the “hit list.” Therefore, they propose that the entire data set derived from functional genome-scale screens, especially if publicly funded, should be made available as is done with data derived from gene expression and genome-wide association studies.

Keywords

RNA interference analysis RNAi screen analysis siRNA RNAi siRNA screening sum rank median absolute deviation strictly standardized mean difference genome-wide whole genome comparison overlap hit list

Introduction

The development of high-throughput systems for the large-scale application of RNA interference-based assays swiftly followed the discovery of RNA interference (RNAi) in eukaryotic systems.^1,2 Dissemination of genome-scale libraries using RNAi throughout the research community has driven the interrogation of the myriad of biological pathways resulting in many salient discoveries previously impossible. The progression of RNAi technology necessitates continual evaluation of the methodology to identify valuable targets.

The human genome can be interrogated by expressed short-hairpin RNAs (shRNA) or transfected small interfering RNAs (siRNA).^3,4 In this study, we focus on synthetic siRNAs arrayed in microwell format. The evolution of RNAi screening to the genome scale parallels the emergence of high-density microarray technology. As RNAi technology joins the “omics” echelons, so does the need for analysis methodology to make sense of the enormous data sets generated from genome-scale loss-of-function studies. These data sets can range from the simple output of luminescence or fluorescence of a well to the generation of high-content cell-based data containing as many as a hundred separate parameters for each of the thousands of cells in the well. Multiple analysis methods have been used to generate the all important “hit list,” although none is considered standard procedure.⁵

One specific field benefiting from genome-scale siRNA screening technology is the field of host-pathogen interactions. To date, multiple groups have pursued the identification of factors influencing viral propagation using genome-scale RNAi technology. Recently, 3 groups identified host factors supporting the HIV life cycle in human cells. However, upon comparison of the results, the significant dissimilarity between the proposed host factors posited more questions than any one project answered.^6-8 The low level of overlap among hit lists can be explained by the dissimilar methodologies employed, but it also raises a question that can be explored experimentally: to what degree would one expect RNAi-based genome-wide screens to agree?

Our study explores the effect of screening strategy and analysis methodology on the results of 2 genome-scale siRNA screens. These 2 screens used high-content cell-based imaging and analysis to score for siRNAs that inhibited yellow fever virus propagation in human cells. Both were performed using the same siRNA library, cell line, viral stock, equipment, and procedure but were separated by 5 months. We use these data to illustrate the advantages of testing independent siRNAs during the primary screen. In addition, we compare the performance of 4 accepted analysis methods with respect to the variability and overlap of intra- and interscreen hit lists. Our work defines multiple factors contributing to the variability between genome-scale siRNA screens.

Materials and Methods

siRNA screening

Both genomic screens to identify human host factors of yellow fever virus propagation were performed using the Qiagen Human Genome siRNA Library v1.0 at the Duke RNAi Screening Facility. HuH-7 cells were reverse transfected with 1.0 pmol siRNA in a total volume of 65 µL media. Briefly, assay plates were prearrayed from stock library plates using a Velocity11/Agilent Bravo precision liquid-handling robot (Velocity11, Menlo Park, CA). Then, 5 µL of dH₂O containing 1.0 pmol of siRNA was dispensed into 384-well microplates (Corning 3712; Corning, Corning, NY) using the 384-channel ST head of the Bravo. Next, 10 µL of Opti-MEM I (GIBCO 11058; GIBCO, Carlsbad, CA) containing 0.5% RNAimax (Invitrogen 13778-150; Invitrogen, Carlsbad, CA) was dispensed to each well using a Matrix WellMate liquid dispenser (Matrix Technologies Corporation, Hudson, NH) and incubated for 30 min at room temperature in the presence of 1 pmol siRNA in 5 µL dH₂O. After incubation, 1200 HuH-7 cells in 50 µL of Dulbecco’s modified Eagle’s medium (DMEM; GIBCO 11995) supplemented with 5% fetal bovine serum (FBS; GIBCO 16140) and 1% antibiotics (GIBCO 15140) was dispensed into each well using a Matrix WellMate.

Approximately 51 h after transfection, the HuH-7 cells were infected with yellow fever virus vaccine strain 17D at a multiplicity of infection (MOI) of 0.1 by addition of 20 µL of viral-containing media to each assay well using the 384-channel ST head of the Bravo followed by mixing by triteration. The virus infection was allowed to proceed for 42 h at 37°C. The cells were then immunostained for the viral envelope protein using the primary antibody 4G2 followed by Alexa 488–conjugated secondary antibodies.⁹ Reagents were added using a Matrix automated 12-channel pipette, and wash steps were performed using a BioTek ELx405 automated plate washer (BioTek, Winooski, VT). Both genomic screens strictly adhered to a common protocol in which 11 batches of 14 plates per day were assayed within 27 days.

High-content imaging and cell-based assay

After fixation and staining, 2 of the 4 available fields in each assay well were imaged with a 10× objective using the Cellomics Array Scan VTI system (Cellomics, Pittsburgh, PA). The acquired images were analyzed using the vHCS Scan version 5.1.2 (Build 268). The Compartmental Analysis bioapplication was optimized to determine the percentage of HuH-7 cells that were infected by yellow fever virus. The vHCS View software package reports the number of cells as the valid object count (VOC), and the percentage of the analyzed cell population infected by the virus is reported as the “% selected” and henceforward referred to by the moniker percent infection.

Statistical analysis

Valid object count

Only genes in which both assay wells in a single genomic screen had a VOC ≥1.9 standard deviations less than the mean cell density of the population were analyzed.

The following analysis methodologies were performed treating the genomic data as a single data set of 21,853 pairs for GS1 and 21,843 pairs for GS2.

Sum rank (SR)

The SR p-value was calculated for each genomic screen separately.¹⁰ The percent infection values for each set of siRNAs (AB, CD; Fig. 1 ) are ordered lowest to highest and given a rank from 1 to n (n = number of assay wells), respectively. The summation of the ranks for corresponding pairs between each set is calculated. Using the equation presented in Supplemental Table 1, the p-value for each pair is determined. The p-values are ranked. The genes with the lowest 200 p-values and the lowest 500 p-values comprise the top 200 and top 500 hit lists, respectively.

Fig. 1.

The human whole genome siRNA library design. The target gene is assayed by 4 unique siRNAs (A, B, C, D), which are pooled as 2 siRNAs per well (AB, CD) and arrayed in corresponding wells within 74 pairs (set AB, set CD) of 384-well assay plates.

Median absolute deviation (MAD)

The MAD was calculated as reported in the Supplemental Table 1 separately for each set within each genomic screen.¹¹ The median percent infection value for each set within each genomic screen was used in the calculations. To expedite the identification of the MAD limits for the top 200 and top 500 gene sets, we calculated the number of MAD units away from the median for each percent infection value by

{{MAD}_{n} = \frac{(x_{i j} - median (X)}{MAD} | \begin{matrix} x_{i j} = sample value \\ X = all sample per population \end{matrix}}

(1)

The thresholds for the top 200 or top 500 populations were identified as the point in which the MAD _n limit for set AB equaled the MAD _n limit for set CD and the population within the limits contained 200 or 500 hits, respectively.

z-score

The calculation for z-score is reported in Supplemental Table 1. As applied to these genomic screens, the negative control population consisted of all percent infection values from wells with siGFP for both sets within each genomic screen. The thresholds for the top 200 or top 500 populations were identified as the point in which the z-score for set AB equaled the z-score for set CD and the population within the limits contained 200 or 500 hits, respectively.

Strictly standardized mean difference (SSMD)

The calculations for SSMD are reported in the Supplemental Table 1.^12,13 The negative control population consisted of all percent infection values from wells with siGFP for both sets within each genomic screen. The genes with the lowest 200 and lowest 500 SSMD values within each genomic screen comprised the top 200 and top 500 hit lists.

Results

Duplicate siRNA-based genomic screens

Two genome-scale siRNA screens were performed to identify host factors supporting yellow fever virus propagation in HuH-7 cells. Approximately 5 months separated the first genomic screen (GS1) and the second genomic screen (GS2). A high-content imaging and analysis platform quantified infection as described in Materials and Methods.

To eliminate unreliable data, we removed assayed pairs in which cells failed to proliferate after plating from further analysis. Of the 22,909 assayed pairs in each genomic screen, 95% of the pairs retained sufficient cell density to be further analyzed for percent infection ( Table 1 ).

Table 1.

Genomic Data Set

	Genomic Screen
	GS1	GS2
Total pairs	22,909	22,909
Pairs removed	1056	1066
Pairs analyzed	21,853	21,843

Library design and screening procedure to reduce false discovery rate

Four distinct siRNA duplexes denoted A, B, C, and D targeted each of the 22,909 predicted mRNAs assayed by a human whole genome siRNA library. We chose a 2 × 2 pooled siRNA screening format that consists of 2 distinct 74-plate sets, each containing 2 unique siRNA duplexes per well (Set^AB, Set^CD). This format results in 148 paired 384-well microplates ( Fig. 1 ). The library design allowed each gene to be assayed by 2 independent tests. Genes that scored strongly for a particular phenotype by both siRNA pools were considered a candidate.

We employed a total effective siRNA concentration of 15.4 nM for screening, resulting in each siRNA being present at 7.7 nM. This concentration of siRNA is significantly lower than that commonly used (Suppl. Table 2) and was chosen to reduce off-target effects (OTE).¹⁴

The low siRNA concentration and the 2 × 2 screening strategy are designed to limit the false discovery rate in the primary screen. These criteria were selected to provide a high-confidence primary hit list, with minimal false positives for academic screeners who may face both personnel and financial constraints during assay validation.

Analysis of GS1 and GS2 control wells to assess behavior of the 2 screens

Twelve siRNA controls were arrayed on each assay plate. Four negative control wells targeted GFP (siGFP), a protein not present in our system. Four wells contained a nontargeting siRNA (siNSC) as a negative control (Qiagen, Valencia, CA). The remaining 4 wells contained a siRNA duplex that targeted a subunit of the vATPase (si-vATPase), which is required for flaviviral infection, and served as a positive control.^10,15,16 The mean and standard deviation of the percent infection in GS1 for the negative controls, siGFP and siNSC, was 78.6 ± 11.4 and 90.8 ± 6.3, respectively, whereas the mean and standard deviation of the percent infection for the positive controls was 7.5 ± 2.7 ( Fig. 2A ). The mean and standard deviation of the percent infection in GS2 for the negative controls, siGFP and siNSC, was 87.9 ± 5.2 and 98.3 ± 1.1, respectively, whereas the mean and standard deviation of the percent infection for the positive controls was 24.6 ± 4.7 ( Fig. 2C ). Despite our efforts to reproduce the screens in an identical fashion, these data begin to reveal a significant difference in the overall behavior of GS1 and GS2. We attribute this difference to slight changes in the infectivity of the yellow fever virus stocks and minor variability of the HuH-7 cell line after storage in the freezer for 5 months, but we have no definitive explanation for the change in behavior.

Fig. 2.

The behavior of the negative and positive controls across GS1 and GS2. For A, B, C, D: si-nonsilencing negative controls (■), si-GFP negative controls (△), and si-vATPase positive controls (+). (A) The % infection for the negative and positive control wells from the 148 assay plates in GS1 is plotted left to right in the order in which each was assayed. (B) The % infection for the negative or positive control wells from the 74 assay plates in GS1^AB are plotted as coordinate pairs with the corresponding negative or positive control well within the 74 assay plates from GS1^CD on the coordinate plane. (C) Similar to A, except that the negative and positive controls were assayed on the 148 plates in GS2. (D) Similar to B, except that pairs of controls wells belong to GS2^AB and GS2^CD.

Although the behavior of the controls was different between screens, when assessed individually, each screen performed very well. Figure 2A (GS1) and Figure 2C (GS2) display all positive and negative control wells left to right in the order in which the screen was performed. Visualization of controls in this fashion allows intra- and interbatch anomalies to be detected. Although minor batch-specific affects are noticeable, we decided to analyze each genomic screen as a single data set (Materials and Methods).

A modified Z′ factor (Z′_n) illustrates the power of replicate tests

Z′ factor (Z′) was developed as a common metric to classify the strength of high-throughput chemical screens and has been similarly applied to siRNA-based screens.¹⁷ We chose to use the values of negative control siGFP to calculate the most conservative Z′. Z′ was calculated between the siGFP and si-vATPase to be 0.40 and 0.53 for GS1 and GS2, respectively. These Z′ factors indicated we could consistently differentiate the positive control wells from the negative controls wells within both respective siRNA screens.

In addition, the paired design of our screening format allowed us to improve the resolution of the assay by plotting the performance of the control wells from Set^AB and Set^CD as respective x and y coordinates. In Figures 2B,D , each point represents the percent infection for 2 corresponding controls for all paired assay plates within each screen. Plotting the controls in 2 dimensions illustrates the enhanced resolution afforded by paired tests and results in greater separation of the values. Indeed, by applying a modified equation, one can arrive at a modified Z′ factor, Z′ _n factor (Z′ _n ).

{{Z^{'}}_{n} = 1 - \frac{3 * (σ_{c +} + σ_{c -})}{\sqrt{2} * | µ_{c +} - µ_{c -} |} | \begin{matrix} \begin{array}{l} µ_{c +}, σ_{c +} = mean, standard deviation of the positive control \end{array} \\ \begin{array}{l} µ_{c -}, σ_{c -} = mean, standard deviation of the negative control \end{array} \end{matrix}}

(2)

The adjustment of Z′ to Z′ _n quantifies the increased distance between the mean of the negative and positive controls in 2D space without increasing the standard deviations. It assumes that the mean and standard deviation along the x and y axes are equal for the respective control set and that there is no covariance for which to account. When those assumptions are fair, as described in Supplemental Figure 1 and accompanying text, Z′ _n can be interpreted in a manner similar to how Zhang et al.¹⁷ proposed interpretation of Z′.

Calculation of the Z′ _n factor for GS1 and GS2 results in 0.58 and 0.66, respectively. The increase of Z′ _n factor relative to Z′ factor demonstrates the well-known positive influence replicate tests have on data quality.

The 2 × 2 pooled library design improved data integrity by providing independent tests

The increased resolution afforded by replicate tests and illustrated by the Z′ _n factor would apply to any screening format performed in duplicate. Our 2 × 2 pooled library format fulfills this criterion, whereas additional power is achieved by using 2 unique pools of siRNAs and requiring the phenotype to reproduce with both pools. We demonstrated the effects of this format by comparing the VOC between screens in several ways. Initially, we limited the analysis to the first 3 plate pairs of each screen: GS1^1-3AB, GS1^1-3CD, GS2^1-3AB, and GS2^1-3CD.

First, we examined the behavior of the same plates between GS1 and GS2 by comparing the observed cell density as reported by VOC. A highly significant correlation was observed between the VOC for GS1^1-3AB and the VOC for GS2^1-3AB (r = 0.91; Fig. 3A ), demonstrating the reproducible affect on cell density of a single siRNA pool. Comparison of the VOC for GS1^1-3CD and GS2^1-3CD resulted in an equally strong correlation (r = 0.94; Fig. 3B ). These results clearly indicate that a specific siRNA pool will affect cell density in a way that is reproducible and predictive of its behavior in a replicate screen.

Fig. 3.

The correlation of a phenotype between identical siRNA pools and independent siRNA pools targeting a subset of genes within genomic screen 1 (GS1) and genomic screen 2 (GS2). (A) The valid object count (VOC) from 984 siRNA pools (GS1^1-3AB) within GS1 is compared across genomic screens to the VOC for 984 identical siRNA pools (GS2^1-3AB) from GS2. The linear regression line is provided with the correlation coefficient (r). (B) Similar to A, except that the VOC from 984 identical siRNA pools from GS1^1-3CD is compared to GS2^1-3CD. Again, the linear regression line is identified along with the correlation coefficient (r). (C) The VOC from 984 siRNA pools (GS1^1-3AB) is compared to the corresponding 984 siRNA pools (GS1^1-3CD) within GS1. Both the linear regression line and correlation coefficient (r) are provided. (D) Similar to C, except that intrascreen comparison is made between the VOC from 984 wells (GS2^1-3AB) to the 984 corresponding wells (GS2^1-3CD) within GS2.

The power of our 2 × 2 pooled screening strategy is demonstrated when the independent siRNA sets are compared within screens: GSl^1-3AB versus GS1^1-3CD and GS2^1-3AB versus GS2^1-3CD. The correlation coefficient of the VOC of GSl^1-3AB versus GS1^1-3CD was 0.11 ( Fig. 3C ) and reflected the independence exhibited by the 2 sets of siRNAs. The second genomic screen also demonstrated independent behavior of the siRNA sets with a correlation coefficient of 0.14 for GS2^1-3AB versus GS2^1-3CD ( Fig. 3D ). We have also performed this analysis on 7 plate pairs from the middle of the screen and 7 pairs from the end of the screen, accounting for greater than 18% of the library, with similar results (data not shown). These results indicate that the effect of each AB siRNA pool on cell density is not predictive of the effect of the corresponding CD siRNA pool.

Our assay system used percent infection to identify hits, and therefore effects on cell number can be considered an OTE of siRNA treatment. When assaying cell number as a surrogate of OTEs, our results demonstrate that testing the same siRNA pool multiple times produced similar results ( Figs. 3A,B ). However, our 2 × 2 pooled siRNA library format is intended to take fullest advantage of the power of 2 independent trials, and supporting that strategy, we clearly demonstrate the ability to alleviate this observed correlation by testing 2 unique siRNA pools ( Figs. 3C,D ). It should be noted that these data are derived from a single siRNA library from 1 manufacturer, and it remains possible that other algorithms for siRNA design may produce siRNA libraries that behave differently.

Four methods for hit selection resulted in significantly different hit lists

At our academic screening facility, we were interested in testing the reproducibility of siRNA screening on a genome-wide scale. As suggested by the behavior of the controls ( Figs. 2A,C ), the distributions of the population of screening wells from GS1 were markedly different from those of GS2 ( Figs. 4A,B ). To compare the screens, we first had to arrive at a collection of hit selection strategies that were appropriate for both data sets.

Fig. 4.

The illustrated thresholds from 4 statistical methodologies overlaid onto the associated genomic population. For A, B, C, D: genomic screen 1 (GS1), genomic screen 2 (GS2), sum rank (SR), median absolute deviation (MAD), z-score, and strictly standardized mean difference (SSMD). (A) Each point (▪) represents the % infection for the analyzed population in GS1^AB plotted against the % infection from the corresponding wells within GS1^CD. The illustrated lines indicate the thresholds within which hits are declared if SR ≤ 0.001350, MAD ≤ −2.000, z-score ≤ −3.000, or SSMD ≤ −3.000. (B) Similar to A, except that the % infection from GS2^AB is plotted against the % infection for corresponding wells from GS2^CD. The thresholds indicate the limits within which hits can be declared assuming SR ≤ 0.001350, MAD ≤ −3.000, z-score ≤ −3.000, or SSMD ≤ −3.000. (C) The thresholds for the top 200 (dashed lines) and top 500 (solid lines) lists generated by SR, MAD, z-score, and SSMD are overlaid onto the scatter plot of the % infection (as described in A) for the paired siRNA pools in GS1^AB and GS1^CD. (D) The % infection for corresponding siRNA pools within GS2 are plotted as coordinates (GS2^AB, GS2^CD). The limits determining the top 200 (dashed lines) and top 500 (solid lines) lists are illustrated in the context of the GS2 population.

Analyses using parametric statistics assume that the population in question can be accurately described by an established probability distribution. The consistent behavior of the siGFP negative control wells within each genomic screen ( Figs. 2A,C ) moderately agrees with a normal distribution, so methods that used the siGFP population mean and variance were acceptable for analysis. In addition, nonparametric analysis can be applied when a population is not easily described by a normal distribution. The skewed distributions of the experimental populations GS1 and GS2 ( Figs. 4A,B ) indicated that nonparametric statistics applied to the genomic population were also appropriate. We found no single batch of data behaving independently, and therefore GS1^AB, GS1^CD, GS2^AB, and GS2^CD were considered distinct populations, and each was analyzed as a single batch. Supplemental Table 1 lists 4 statistical strategies: MAD, SR, z-score, and SSMD. MAD and SR were representative nonparametric methods, whereas z-score and SSMD were representative parametric methods. All 4 strategies had been previously described for genome-wide siRNA-based screen applications.^10-13 Also described in the literature, both z-score and SSMD methods can be performed substituting the mean and standard deviation with the median and MAD, respectively, to generate nonparametric estimators for SSMD and z-score, which are robust with respect to influences by strong outliers and/or skewed distributions, but these alternate calculations were not evaluated here.^5,12

To arrive at comparable hit lists using the 4 methods, we initially sought to apply a consistent probability threshold to each. Three standard deviations greater or less than the mean of the negative control or the population median or mean is considered a robust threshold for hit detection.⁵ We chose a z-score ≤ −3.000 and its corresponding probability threshold of p ≤ 0.001350 to meet this convention. The SSMD and MAD values −3.000 were both defined by the respective authors as being equivalent to a z-score of −3.000, whereas a p ≤ 0.001350 for SR could be used for analysis.^11-13 In the specific instance of the application of MAD to the GS1 data set, a −3.000 was inappropriate because this threshold exceeded the limits of the assay (−4.3%) for the GS1^AB set. As advised by the method’s authors, we relaxed the MAD limit to k ≤ −2.000.¹¹ Table 2 presents the number of hits each strategy identified and the overlap among the different lists (some fields intentionally left blank to reduce redundancy). It is immediately apparent that application of the 4 strategies resulted in significantly different hit list lengths even when the probability thresholds were held constant. For GS1, SR resulted in 75 hits, whereas z-score generated the most extended hit list with 794 members. Analysis of GS2 produced lists ranging from 82 (SR) to 1140 (SSMD) members.

Table 2.

The Intra- and Interscreen Comparisons of Hit Lists

Screen ID		Genomic Screen 1			Genomic Screen 2
Method and Threshold		MAD, k ≤ −2	z-Score, z ≤ −3	SSMD, β ≤ −3	SR, p ≤ 0.001	MAD, k ≤ −3	z-Score, z ≤ −3	SSMD, β ≤ −3
	List length	148	794	513	82	312	392	1140
Genomic screen 1
SR, p ≤ 0.001	75	75	75	75	43	55	56	59
MAD, k ≤ −2	148		148	148	53	81	86	105
z-score, z ≤ −3	794			445	68	179	199	290
SSMD, β ≤ −3	513				68	153	168	281
Genomic screen 2
SR, p ≤ 0.001	82					82	82	82
MAD, k ≤ −3	312						312	285
z-score, z ≤ −3	392							331

MAD, mean absolute deviation; SR, sum rank; SSMD, strictly standardized mean difference.

To better understand the wide range of hit list lengths generated using these analysis methods, we illustrated the limits defined by each strategy upon the GS1 and GS2 population ( Figs. 4A,B ). Visualization of the methods in this way allows one to quickly ascertain how lists can vary in both size and composition. Importantly, the assay design required that a phenotype be identified in both wells, but the SSMD limit illustrated in Figure 4B indicates that for GS2, SSMD is not strictly adhering to the rule. These data clearly demonstrate that reproducibility of hit lists between and even within genome-scale siRNA screens can be greatly influenced by the analysis method. Although we present the overlap among and between these hit lists ( Table 2 ), we believe that the considerable difference in list lengths confounded the interpretation of reproducibility between GS1 and GS2 and chose to pursue fixed-length hit lists for comparison.

Generation of GS1 and GS2 hit lists using fixed list lengths of 200 and 500 members

As an alternative to determining the overlap of hit lists of such divergent lengths, we applied a consistent rule to all statistical strategies that resulted in a fixed number of hits and then studied reproducibility from that perspective. We scanned the literature for whole-genome RNAi-based screens and considered the length of the primary hit lists. Primary hit lists for 13 different genome-wide screens ranged from 0.96% to 6.66% of the populations tested (Suppl. Table 2). The median of the 13 lists identified 407 hits, or 2.09% of their genomic library.¹⁸ Our choice of benchmark lists included the top 200 and top 500 hits, representing 0.88% and 2.21% of the interrogated genome, respectively.

Table 3 presents the thresholds associated with the top 200 and top 500 hit lists generated when SR, MAD, z-score, and SSMD methods are used to rank hits for GS1. Although the thresholds determining the top 200 hits—p ≤ 0.005735, k ≤ −1.900, z ≤ −4.326, and β ≤ −3.505—are no longer related in a statistical sense, they appeared to identify comparable regions of the genomic population (depicted in Fig. 4C as dotted lines). Furthermore, for the top 500, the thresholds p ≤ 0.017673, k ≤ −1.510, z ≤ −3.480, and β ≤ −3.019 also appeared to identify comparable regions of the genomic population (as depicted in Fig. 4C by solid lines). It should be noted that for the MAD-generated hit list, 201 hits are reported indicating that at the threshold k ≤ −1.900, 2 different genes performed in a statistically indistinguishable manner (a tie).

Table 3.

Intrascreen Comparisons of the Top 200 or 500 Hits Identified by SR, MAD, z-Score, and SSMD for Genomic Screen 1

	Genomic Screen 1
	Method and Threshold	MAD, k ≤ −1.900	z-Score, z ≤ −4.326	SSMD, β ≤ −3.505	MAD, k ≤ −1.510	z-Score, z ≤ −3.480	SSMD, β ≤ −3.019
Genomic Screen 1
Method and threshold	List length	201	200	200	500	500	500
SR, p ≤ 0.00573	200	173 (87%)	174 (87%)	176 (88%)
MAD, k ≤ −1.900	201		178 (89%)	151 (76%)
z-score, z ≤ −4.326	200			153 (77%)
SR, p ≤ 0.0176	500				423 (85%)	428 (86%)	436 (87%)
MAD, k ≤ −1.510	500					445 (89%)	375 (75%)
z-score, z ≤ −3.480	500						376 (75%)

Values presented as n (%). MAD, mean absolute deviation; SR, sum rank; SSMD, strictly standardized mean difference.

For GS2, the SR, MAD, z-score, and SSMD thresholds with their illustrated limits are presented in Table 4 and illustrated in Figure 4D . The thresholds for the top 200 hits were p ≤ 0.005653, k ≤ −3.726, z ≤ −4.300, and β ≤ −5.211, whereas the thresholds for the top 500 hits were p ≤ 0.016596, k ≤ −2.281, z ≤ −2.491, and β ≤ −4.142. In the instance of the SR hit list, 501 hits are identified due to 2 hits at the threshold having identical SR values (a tie).

Table 4.

Intrascreen Comparisons of the Top 200 or 500 Hits Identified by SR, MAD, z-Score, and SSMD for Genomic Screen 2

	Genomic Screen 2
	Method and Threshold	MAD, k ≤ −3.726	z-Score, z ≤ −4.300	SSMD, β ≤ −5.211	MAD, k ≤ −2.281	z-Score, z ≤ −2.491	SSMD, β ≤ −4.142
Genomic screen 2
Method and threshold	List length	200	200	200	500	500	500
SR, p ≤ 0.005653	200	172 (86%)	173 (87%)	144 (72%)
MAD, k ≤ −3.726	200		198 (99%)	129 (65%)
z-score, z ≤ −4.300	200			129 (65%)
SR, p ≤ 0.01659	501				422 (84%)	428 (86%)	284 (57%)
MAD, k ≤ −2.281	500					481 (96%)	237 (47%)
z-score, z ≤ −2.491	500						240 (48%)

Values presented as n (%). MAD, mean absolute deviation; SR, sum rank; SSMD, strictly standardized mean difference.

Intrascreen comparison of analysis methods using fixed-length hit lists

To examine the impact of analysis methodology on the composition of hit lists, we determined the intrascreen overlap of identified targets using each method. Intrascreen overlap illustrates to what extent the results of differing analysis methods agree with respect to identifying the same factors. The thresholds determining the top 200 and top 500 hits lists for GS1 by the SR, MAD, z-score, and SSMD methods are illustrated in Figure 4C , whereas the intrascreen overlap is tallied in Table 3 . As presented in Table 3 , the best intrascreen overlap for the top 200 and top 500 hit lists for GS1 was between hit lists generated by z-score and MAD methods in which 89% of the potential overlap was identified by both methods. Similarly, as depicted in Figure 4C , both z-score and MAD methods identify very similar regions of interest. The least reproducibility is consistently associated with comparisons between SSMD and either z-score or MAD methods for both the top 200 and top 500 lists in GS1, ranging from 75% to 77% of the population ( Table 3 ). Again, Figure 4C depicts that the SSMD method identified 2 unique regions of interest in which 1 of the 2 assay pairs performed quite strongly, whereas the other had fairly weak signal, relative to the control population. However, z-score and MAD methods identified unique populations in which both assay wells performed moderately yet similarly. Comparisons between MAD, z-score, and SSMD reveal the extremes, whereas SR identifies a region of interest depicted in Figure 4C sharing some of the populations independently identified by each and, as tabulated in Table 3 , consistently identifies 85% to 88% of any other method’s top 200 or top 500 hit list.

The intrascreen overlap for the SR, MAD, z-score, and SSMD methods as applied to GS2 is tallied in Table 4 , and the limits for each analysis method are depicted in Figure 4D . Similar to the performance of the MAD and z-score methods in GS1, applied to the GS2 population, these have the greatest overlap at 99% for the top 200 list and 96% for the top 500 list, and their respective limits depicted in Figure 4D support the strong overlap as they identify nearly identical regions of interest. The SSMD method again behaves most differently from the other methods. Relative to the MAD and z-score methods, the SSMD method shares 65% overlap for the top 200 data sets and 47% to 48% of the population for the top 500 data sets. Explaining the discrepancy, Figure 4D shows that the SSMD algorithm did not strictly adhere to the proposed goal that both siRNA pools had to perform similarly. In fact, some candidate genes identified by SSMD have a higher than average percent infection, relative to the negative siGFP control, for 1 assay well, while nearly completely inhibiting infection in the alternate well. The overlap between SR and MAD or z-score was 86% and 87% for the top 200 and 84% and 86% for the top 500 list, respectively. SR overlaps with SSMD only 72% for the top 200 list and 57% for the top 500 list.

Overlap of the GS1 and GS2 hit lists

The interscreen reproducibility was measured by comparing the overlap of the top 200 and/or top 500 hits generated for each analysis method, SR, MAD, z-score, and SSMD, from GS1 to the top 200 and top 500 hit lists generated from GS2. Table 5 reports the thresholds associated with each hit list and the overlap for all paired comparisons. The least overlap (32%) was noted when comparing the SSMD-generated top 500 list for GS2 to the z-score or MAD-generated lists from GS1. The best overlap, 67%, is noted when comparing the SR-generated top 500 list in GS1 to the GS2 SR top 200 generated list. The general trend indicates that although hit lists of the same length from 2 screens may not have the best overlap, in the case in which a smaller hit list is first considered and then compared to a larger hit list, the best overlap can then be observed. As practically applied, the best reproducibility between hit lists can be observed when a threshold is set for 1 genomic screen and then compared to list generated by a relaxed threshold in the second screen.

Table 5.

Interscreen Comparisons between Lists Identified by SR, MAD, z-Score, and SSMD for Either 200 or 500 Hits

	Genomic Screen 2
	Method and Threshold^a	SR, p ≤ 0.005	MAD, k ≤ −3.72	z-Score, z ≤ −4.3	SSMD, β ≤ −5.2	SR, p ≤ 0.016	MAD, k ≤ −2.28	z-Score, z ≤ −2.5	SSMD, β ≤ −4.1
Genomic screen 1
Method and threshold^a	List size	200	200	200	200	501	500	500	500
SR, p ≤ 0.005	200	98 (49%)	91 (46%)	91 (46%)	88 (44%)	130 (65%)	127 (64%)	126 (63%)	120 (60%)
MAD, k ≤ −1.9	201	90 (45%)	83 (42%)	83 (42%)	77 (39%)	117 (59%)	119 (60%)	118 (59%)	108 (54%)
z-score, z ≤ −4.3	200	91 (46%)	82 (41%)	82 (41%)	78 (39%)	116 (58%)	117 (59%)	116 (58%)	108 (54%)
SSMD, β ≤ −3.5	200	96 (48%)	87 (44%)	87 (44%)	92 (46%)	129 (65%)	124 (62%)	122 (61%)	129 (65%)
SR, p ≤ 0.017	500	133 (67%)	127 (64%)	127 (64%)	118 (59%)	201 (40%)	195 (39%)	194 (39%)	181 (36%)
MAD, k ≤ −1.5	500	127 (64%)	121 (61%)	120 (60%)	105 (53%)	184 (37%)	184 (37%)	183 (37%)	161 (32%)
z-score, z ≤ −3.48	500	125 (63%)	122 (61%)	121 (61%)	108 (54%)	188 (38%)	186 (37%)	184 (37%)	161 (32%)
SSMD, β ≤ −3.02	500	131 (66%)	123 (62%)	123 (62%)	127 (64%)	202 (40%)	190 (38%)	189 (38%)	205 (41%)

Values presented as n (%). MAD, mean absolute deviation; SR, sum rank; SSMD, strictly standardized mean difference.

Thresholds have been truncated for space. Actual values are in the Results section and reflect 4 significant digits.

Discussion

The work presented here is the first published study to experimentally define the factors that influence reproducibility between genome-scale siRNA screens. The high-throughput loss-of-function genomic screening community has reached a critical milestone in the application of this promising technology. Multiple teams have completed whole-genome siRNA-based screens for factors involved in the same biological system, and the obvious first reaction was to compare the overlap between related screens.^6-8 This seemingly straightforward task resulted in a marked lack of overlap among the published hit lists. A comprehensive meta-analysis attempted to reconcile the data from these substantially different assay systems and did produce some additional overlap of gene families.¹⁹ Without a clear idea of the reproducibility expected, interpretation of the limited overlap was clouded. Our study attempts to shed light on the issue of the reproducibility of genome-scale siRNA screening and provide a context for interpretation of published screening data.

Within 6 months, 2 human whole-genome siRNA-based screens were completed. The genomic screens rigorously adhered to the same protocols, used the same instrumentation and the same batch of virus, and were performed by the same team. The only intentional difference was the 4 months separating the completion of GS1 and beginning of GS2. Variables that were not accounted for included changes in the batches of reagents and the passages of the cell line.

The 2 × 2 pooled library format described in Figure 1 ensured that hit identification would require that at least 2 of 4 siRNAs induce a phenotype.²⁰ If both positive siRNAs resided in 1 well, the hit would be missed. The probability of such a distribution by random chance is 0.18, or 18% of the possible scored wells.

Off-target effects are defined as changes in gene expression for genes not intentionally targeted by the siRNA design. Jackson et al.¹⁴ demonstrated that decreasing the effective concentration of siRNA targeting MAPK14 decreased OTEs while preserving the integrity of the target-specific knockdown. They also noted some OTEs could not be titrated away. Our assay conditions were designed to minimize OTE by using a relatively low effective siRNA concentration of 15.4 nM (7.7 nM for each siRNA) when compared to other published screens (Suppl. Table 2).

We addressed the impact of screening format on OTE as well as reproducibility by examining the effect of pooled siRNA duplexes on cell density. In this case, effects on cell density can be considered an unintended or OTE. The scatter plots from Figure 3 demonstrate the remarkably reproducible effect a given pool of 2 siRNAs had on cell density, whereas independent pools had entirely independent effects on cell density. Although we are unable to rule out that the observed effect could be due to influences such as relying on a single source (Qiagen) for the design and manufacturing of the siRNA library, these data demonstrate that a given pool of siRNAs will produce similar results when tested multiple times. The rationale that multiple tests of a single system better illustrate the true range of the system has sound statistical support. Unfortunately, studying the cell density results in which the same pool of siRNAs is tested twice ( Figs. 3A,B ) and contrasting it to the population distributions generated by testing independent pools ( Figs. 3C,D ) clearly shows that the experimental interpretation of the function of a target mRNA is not defined without independent tests. We understand that additional validation of our hit list is required, but by identifying those targets that display the appropriate phenotype with 2 independent pools, we fulfill the important criteria of redundancy in the primary screen.

Recently, Brass et al.²¹ published a screen that identified factors influencing H1N1 propagation in a human model cell line that used 3 tests of the same pool of 4 siRNA duplexes. This study reported 334 targets from the primary screen, and a follow-up screen to deconvolute the pools resulted in 40% of the 334 putative hits being confirmed by at least 2 independent siRNAs. As a contrast, consider the screen by Zhang et al.²² for modifiers of circadian cycle. The authors here used the pooled 2 × 2 format and followed up on hits that scored well in both independent tests. Of the 343 putative hits, 78% reconfirmed with 2 or more siRNAs eliciting a strong effect in a validation screen. Despite screening the genome 3 times, representing a 50% increase in the workload relative to Zhang et al.,²² Brass et al.²¹ did not improve the resolution of the mRNA function.

Our study used the data from each of the four 74 plate sets—GS1^AB, GS1^CD, GS2^AB, and GS2^CD—as batches for analysis for several reasons. On each 384-well plate, 4 negative siGFP and 4 positive si-vATPase controls were arrayed. Both SSMD and z-score methods require a comparison to a negative control set, so the minimum data set for analysis must be 1 assay plate. Zhang et al.²³ demonstrated using SSMD how variability in control wells contributed strongly to the calculated assay performance. This indicated that opportunities to combine more plates into a single data set could mask the impact that normal variation in the negative control set had on the analysis. Furthermore, Qiagen manufactured its library in a systematic fashion. Consequently, 65% of the siRNA duplexes targeting the G-protein-coupled receptor (GPCR) family members were arrayed on a single plate, and the remaining ones were distributed on 2 adjacent plates. Other gene families are arrayed similarly. SR used the population to determine a hit, so any population with significant bias would inhibit the application of SR, and thus we chose larger data sets to provide protection from plating bias. Figures 2A,C qualitatively indicated that data within each screen performed consistently, whereas the calculated Z′ factor quantitatively indicated any variability that was present was unlikely to negatively affect the resolution of hits as strong as the si-vATPase. Although we concluded that treating all wells in a single siRNA set as 1 data set would be appropriate, one must carefully consider which approach best suits the behavior of the data sets.

Genomic screening is costly. Significant resources are invested during assay development to determine how best to pursue factors involved in a biological system. It seems contradictory that the investment in the cell-based assay has not been complemented by an investment in understanding the interpretation of screening results. We measured the reproducibility between 2 genomic screens by directly testing how much overlap there was between specific analysis methods. Applying the statistical methods SR, MAD, z-score, and SSMD by following the recommendations of each method’s authors produced a broad range of overlap.^10-13 It would seem in some cases that there was little overlap between identical screens. Unfortunately, the length of the hit lists was dramatically different. For example, in the comparison between SSMD in GS1 and GS2, the potential overlap could include all 513 hits from GS1, but the potential overlap is only 45% of the 1140 hits in GS2. Furthermore, it was difficult to assess the reproducibility between hit lists within a single screen because the different analytical methods produced significantly different length lists.

To compare methodologies within and between genomic screens, we chose to compare lists composed of the same number of hits. The top 200 hits for any method produced between 39% and 49% overlap. Expanding the list to include the top 500 hits did not improve the apparent reproducibility as the range was 32% to 41% overlap. The best indication that genomic data are regularly reproducible was established by comparing the top 200 from any individual method to the top 500 from the alternate screen. This situation reflects the stochastic nature of biological assays. A subset of siRNA targets will always score particularly well because the siRNAs are robust, the gene is easily silenced, or the gene is simply at a critical juncture in a system or pathway. Other siRNAs that may score strongly in one assay but only moderately in a second assay demonstrate that some pathways may be more resistant or adaptable to change or some genes are not as efficiently silenced as others. Thus, a gene that strongly inhibited infection in one screen did not necessarily strongly inhibit infection in the other screen, but it was highly likely to perform well.

As a standard practice, when comparing genomic screens, one must consider the behavior of assay wells that did not make the top tier hit lists. Unlike other “-omic” technologies such as microarray analysis of gene expression, proteomic studies of protein abundance, and genotyping via deep sequencing, the results of siRNA loss-of-function studies are not directly quantitative. The relative strength of the “score” in the phenotypic assay may not be directly related to the abundance of a required target protein. Due to difference in effective concentration and stoichiometry of reactions involved, the strength of a complex phenotypic score is very unlikely to scale proportionally with protein levels. A target that scores only moderately in a screen may be absolutely required, but due to factors such as siRNA efficiency, protein half-life, and message abundance, the protein levels may only be reduced 30% in the course of the assay. This makes generation of “hit lists” a troubling facet of siRNA screening and certainly contributes to overlap and reproducibility. To alleviate these issues, the authors of screens should make available the performance of all the wells in a genomic screen at the time of publication, similar to what is currently done for microarrays.

The dual genomic screens identified reproducibility that exceeded 67%. This is the strongest published overlap for 2 complete screens. The 2 × 2 pooled siRNA library format and low siRNA concentration provided an efficient assay design to identify strong candidates without the costly validation screening. We demonstrated that each siRNA pool behaves quite reproducibly with respect to VOC and posit that independent siRNA pools tested against the same system would provide more robust final data sets. The best practice we can promote at this time is that researchers use several analysis strategies and that all relevant data from each control and each experimental well be provided to future researchers as is the case for microarray and genome-wide association studies.

Footnotes

Acknowledgements

The project described was supported in part by NIH grants to MGB (1 R21 AI064925 and U54 AI057157) from the Southeastern Regional Center of Excellence for Emerging Infections and Biodefense. Its contents are solely the responsibility of the authors and do not necessarily represent the official views of the NIH. The Duke RNAi Screening Facility was supported by NIH S10 1SA0RR024572-01, North Carolina Biotechnology Center, Duke Institute for Genome Sciences & Policy, Duke Comprehensive Cancer Center, and Duke School of Medicine.

References

Echeverri

Perrimon

: High-throughput RNAi screening in cultured cells: a user’s guide. Nat Rev 2006;7:373-384.

Fire

Montgomery

Kostas

Driver

Mello

: Potent and specific genetic interference by double-stranded RNA in Caenorhabditis elegans . Nature 1998;391:806-811.

Caplen

Parrish

Imani

Fire

Morgan

: Specific inhibition of gene expression by small double-stranded RNAs in invertebrate and vertebrate systems. Proc Natl Acad Sci U S A 2001;98:9742-9747.

Paddison

Caudy

Bernstein

Hannon

Conklin

: Short hairpin RNAs (shRNAs) induce sequence-specific silencing in mammalian cells. Genes Dev 2002;16:948-958.

Birmingham

Selfors

Forster

Wrobel

Kennedy

Shanks

: Statistical methods for analysis of high-throughput RNA interference screens. Nat Methods 2009;6:569-575.

Brass

Dykxhoorn

Benita

Yan

Engelman

Xavier

: Identification of host proteins required for HIV infection through a functional genomic screen. Science 2008;319:921-926.

Konig

Zhou

Elleder

Diamond

Bonamy

Irelan

: Global analysis of host-pathogen interactions that regulate early-stage HIV-1 replication. Cell 2008;135:49-60.

Zhou

Huang

Gates

Zhang

Castle

: Genome-scale RNAi screen for host factors required for HIV replication. Cell Host Microbe 2008;4:495-504.

Henchal

Gentry

McCown

Brandt

: Dengue virus-specific and flavivirus group determinants identified with monoclonal antibodies by indirect immunofluorescence. Am J Trop Med Hyg 1982;31:830-836.

10.

Sessions

Barrows

Souza-Neto

Robinson

Hershey

Rodgers

: Discovery of insect and human dengue virus host factors. Nature 2009;458:1047-1050.

11.

Chung

Zhang

Kreamer

Locco

Kuan

Bartz

: Median absolute deviation to improve hit selection for genome-scale RNAi screens. J Biomol Screen 2008;13:149-158.

12.

Zhang

: A pair of new statistical parameters for quality control in RNA interference high-throughput screening assays. Genomics 2007;89:552-561.

13.

Zhang

Ferrer

Espeseth

Marine

Stec

Crackower

: The use of strictly standardized mean difference for hit selection in primary RNA interference high-throughput screening experiments. J Biomol Screen 2007;12:497-509.

14.

Jackson

Bartz

Schelter

Kobayashi

Burchard

Mao

: Expression profiling reveals off-target gene regulation by RNAi. Nat Biotechnol 2003;21:635-637.

15.

Krishnan

Sukumaran

Gilfoy

Uchil

Sultana

: RNA interference screen for human genes associated with West Nile virus infection. Nature 2008;455:242-245.

16.

Krishnan

Sukumaran

Pal

Agaisse

Murray

Hodge

: Rab 5 is required for the cellular entry of dengue and West Nile viruses. J Virol 2007;81:4881-4885.

17.

Zhang

Chung

Oldenburg

: A simple statistical parameter for use in evaluation and validation of high throughput screening assays. J Biomol Screen 1999;4:67-73.

18.

Brass

Xavier

Liang

: A genome-wide genetic screen for host factors required for hepatitis C virus propagation. Proc Natl Acad Sci U S A 2009;106:16410-16415.

19.

Bushman

Malani

Fernandes

D’Orso

Cagney

Diamond

: Host cell factors in HIV replication: meta-analysis of genome-wide studies. PLoS Pathogens 2009;5:e1000437.

20.

Echeverri

Beachy

Baum

Boutros

Buchholz

Chanda

: Minimizing the risk of reporting false positives in large-scale RNAi screens. Nat Methods 2006;3:777-779.

21.

Brass

Huang

Benita

John

Krishnan

Feeley

: The IFITM proteins mediate cellular resistance to influenza A H1N1 virus, West Nile virus, and dengue virus. Cell 2009;139:1243-1254.

22.

Zhang

Liu

Hirota

Miraglia

Welch

Pongsawakul

: A genome-wide RNAi screen for modifiers of the circadian clock in human cells. Cell 2009;139:199-210.

23.

Zhang

Espeseth

Johnson

Chin

Gates

Mitnaul

: Integrating experimental and analytic approaches to improve data quality in genome-wide RNAi screens. J Biomol Screen 2008;13:378-389.

Factors Affecting Reproducibility between Genome-Scale siRNA-Based Screens

Abstract

Keywords

Introduction

Materials and Methods

siRNA screening

High-content imaging and cell-based assay

Statistical analysis

Valid object count

Sum rank (SR)

Median absolute deviation (MAD)

z-score

Strictly standardized mean difference (SSMD)

Results

Duplicate siRNA-based genomic screens

Library design and screening procedure to reduce false discovery rate

Analysis of GS1 and GS2 control wells to assess behavior of the 2 screens

A modified Z′ factor (Z′n) illustrates the power of replicate tests

The 2 × 2 pooled library design improved data integrity by providing independent tests

Four methods for hit selection resulted in significantly different hit lists

Generation of GS1 and GS2 hit lists using fixed list lengths of 200 and 500 members

Intrascreen comparison of analysis methods using fixed-length hit lists

Overlap of the GS1 and GS2 hit lists

Discussion

Footnotes

Acknowledgements

References

A modified Z′ factor (Z′_n) illustrates the power of replicate tests