Sage Journals: Discover world-class research

Abstract

In most genome-scale RNA interference (RNAi) screens, the ultimate goal is to select siRNAs with a large inhibition or activation effect. The selection of hits typically requires statistical control of 2 errors: false positives and false negatives. Traditional methods of controlling false positives and false negatives do not take into account the important feature in RNAi screens: many small-interfering RNAs (siRNAs) may have very small but real nonzero average effects on the measured response and thus cannot allow us to effectively control false positives and false negatives. To address for deficiencies in the application of traditional approaches in RNAi screening, the author proposes a new method for controlling false positives and false negatives in RNAi high-throughput screens. The false negatives are statistically controlled through a false-negative rate (FNR) or false nondiscovery rate (FNDR). FNR is the proportion of false negatives among all siRNAs examined, whereas FNDR is the proportion of false negatives among declared nonhits. The author also proposes new concepts, q*-value and p*-value, to control FNR and FNDR, respectively. The proposed method should have broad utility for hit selection in which one needs to control both false discovery and false nondiscovery rates in genome-scale RNAi screens in a robust manner.

Keywords

false discovery rate false nondiscovery rate RNAi high-throughput screening

Introduction

For hit selection, one common task is to control false positives and false negatives statistically.¹ Traditionally, false positives (or more accurately, false discoveries) are defined as the siRNAs with no inhibition or activation effects among selected hits, and false negatives (or more accurately, false nondiscoveries) are defined as those with even very small inhibition or activation effects among declared nonhits. In some RNAi screens, especially the confirmatory screens where the investigated siRNAs are preselected to have effects, there are few siRNAs having exactly zero effects on average. Many siRNAs have very small but real nonzero effects on measured responses. Consequently, in these screens, if we only control the siRNAs with exactly zero effects on average using the traditional definitions of false positives, we may include too many siRNAs with very small effects in the hit list even if we use a small cutoff of traditional q-value or p-value. Therefore, the traditional false positives and false negatives are inappropriate to be applied for hit selection in these RNAi screens.²

In reality, what we really want to control in the list of selected hits are not just the siRNAs with exactly zero effects but also the siRNAs with very small effects, and what we really do not want to include in the list of declared nonhits are the siRNAs with large effects, not those with very small effects. This special feature of hit selection in RNAi screens requires the adoption of new definitions of false positives and false negatives: false positives are defined to be the siRNAs with a small effect among selected hits, and false negatives are defined to be the siRNAs with a large effect among declared nonhits.^2,3 Although false-positive rate (FPR) and false-negative rate (FNR) based on these new definitions of false positives and false negatives have recently been explored,^2,3 the corresponding false discovery rate (FDR) and false nondiscovery rate (FNDR) have yet to be investigated. As pointed out by Storey and Tibshirani,⁴ FPR and FDR are often mistakenly equated, but their difference is actually very important. So is the relationship between FNR and FNDR.

In this article, I compare FDR and FNDR with FPR and FNR in mathematical forms and demonstrate theoretically that FDR and FNDR are more tractable and address the question that we are normally interested in an RNAi screen better than FPR and FNR. I propose the concepts of p*-value and q*-value to control FNR and FNDR, respectively, and provide the calculation of p-value, p*-value, q-value, and q*-value based on mean difference and strictly standardized mean difference (SSMD), respectively. In the follow-up article,⁵ my colleagues and I will demonstrate the utility of the proposed method and concepts for hit selection in real genome-scale RNAi screens with replicates.

False Discovery Rate and False Nondiscovery Rate

The effect of an siRNA is usually reflected by the difference of measured values between the siRNA and a negative reference. For simplicity, in this article, we assume that a positive value represents the upregulated direction and a negative value represents the downregulated direction. Suppose the true mean value of the difference for an siRNA is µ.

Traditionally, to select downregulated hits, the interesting siRNAs are those with µ < µ₀, and the noninteresting ones are those with µ ≥ µ₀ (where µ₀ can be 0 or another fixed value) in the downregulated direction ( Table 1A ). Consequently, based on the traditional definition, the false positives are those with µ ≥ µ₀ among the selected hits, and the false negatives are those with µ < µ₀ among the declared nonhits ( Table 1B ). In RNAi screens, especially confirmatory screens, many investigated siRNAs may have their true mean difference away from 0 (or µ₀) even though the difference may be very small. Consequently, the false positives should be the siRNAs with µ ≥ µ₂ where µ₂ ≤ µ₀. Our interest is the siRNAs with a certain large size of downregulated effect, that is, the siRNAs with µ ≥ µ₁ where µ₁ is a fixed negative value less than µ₂ (i.e., µ₁ < µ₂). Therefore, the false negatives should be the siRNAs with µ ≤ µ₁ among the nonhits. Meanwhile, we can tolerate a hit list that contains some siRNAs with weak or moderate effects (i.e., those with µ₁ < µ < µ₂; Table 2A ). These definitions of false positives and false negatives in the downregulated direction can be displayed ( Table 2B ).

Table 1.

Traditional Methods for Defining False Discovery and False Nondiscovery Rates among m Simultaneous Tests in the Downregulated Direction

A: Interesting and Noninteresting Regions

B: False Positives and False Negatives
	Declared as Nonhits	Declared as Hits	Total
Noninteresting (β ≥ β₀ or µ ≥ µ₀)	U (#true negatives)	V (#false positives)	m ₀
Interesting (β < β₀ or µ < µ₀)	T (#false negatives)	S (#true positives)	m−m ₀
	m−R	R	m
C: False Discovery Rate (FDR), False Nondiscovery Rate (FNDR), False-Positive Rate (FPR), and False-Negative Rate (FNR)
$FDR = E (\frac{V}{R})$	$FNDR = E (\frac{T}{m - R})$	$FPR = E (\frac{V}{m})$	$FNR = E (\frac{T}{m})$

# denotes “the number of.” β represents the population value of strictly standardized mean difference (SSMD), and µ represents the population value of mean difference.

Table 2.

New Method for Defining False Discovery and False Nondiscovery Rates in m Simultaneous Tests in the Downregulated Direction in a Genome-Scale RNAi Screen

A: Interesting, Tolerable and Noninteresting Regions

B: False Positives and False Negatives
	Declared as Nonhits	Declared as Hits	Total
Noninteresting (β ≥ β₂ or µ ≥ µ₂)	U (#true negatives)	V (#false positives)	m ₀
Tolerable (β₂ > β > β₁ or µ₂ > µ > µ₁)	W1	W2	m ₁
Interesting (β ≤ β₁ or µ ≤ µ₁)	T (#false negatives)	S (#true positives)	m−m ₀−m ₁
	m−R	R	m
C: False Discovery Rate (FDR), False Nondiscovery Rate (FNDR), False-Positive Rate (FPR), and False-Negative Rate (FNR)
$FDR = E (\frac{V}{R})$	$FNDR = E (\frac{T}{m - R})$	$FPR = E (\frac{V}{m})$	$FNR = E (\frac{T}{m})$

# denotes “the number of.” β represents the population value of strictly standardized mean difference (SSMD), and µ represents the population value of mean difference. µ₁ < µ₂.

Traditionally, hit selection is based on the test of mean difference (i.e., µ in Tables 1 and 2 ). However, mean difference can neither take into account data variability nor accommodate different measurement units. Consequently, the value of mean difference is not comparable across experiments, and hence no cutoff of mean difference can be applicable to various experiments. Therefore, it is hard to define generally applicable values of µ₂ and µ₁. An alternative that can avoid these issues of mean difference is the so-called effect size.⁶ One effect size that has been developed for high-throughput screening (HTS) experiments is SSMD (i.e., β in Tables 1 and 2 ).² SSMD is the ratio of the mean to the standard deviation of the difference between an siRNA and a negative reference group. SSMD has also been shown to be better than other commonly used effect sizes.⁷ Based on SSMD, the new definitions of false positives and false negatives can be readily applied by replacing mean difference with SSMD as shown in Tables 1 and 2 .

A clear advantage of SSMD over mean difference is that the population value of SSMD is comparable across experiments, and thus we can use the same cutoff for the population value of SSMD to measure the size of siRNA effects.² A meaningful and interpretable SSMD-based criterion for classifying the size of siRNA effects is as follows: |β| for extremely strong, 5 > |β| ≥ 3 for very strong, 3 > |β| ≥ 2 for strong, 2 > |β| ≥ 1.645 for fairly strong, 1.645 > |β| ≥ 1.28 for moderate, 1.28 > |β| ≥ 1 for fairly moderate, 1 > |β| ≥ 0.75 for fairly weak, 0.75 > |β| > 0.5 for weak, 0.5 ≥ |β| > 0.25 for very weak, and |β| ≤ 0.25 for extremely weak effects.⁸ Based on this criterion, β₁ can be set to be −3 or −2, and β₂ can be set to be −0.25 in the downregulated direction. For example, in some RNAi screens, the false positives that we want to control are the siRNAs with SSMD ≥ −0.25 among hits; the false negatives are the siRNAs with SSMD ≤ −3 (not SSMD ≤ −0.25) among nonhits. We can tolerate siRNAs with SSMD between −0.25 and −3 in the hit list.

Traditionally, the FDR is defined based on Table 1 . Consider the problem of simultaneously testing m null hypotheses, of which m ₀ are noninteresting (i.e., β ≥ β₀ if using SSMD or µ ≥ µ₀ if using mean difference) when the interest is in downregulation. Based on Table 1B , FDR is the expectation of the total number of false positives V divided by the total number of significant tests (i.e., discoveries) R—namely, $E (\frac{V}{R})$ ; FNDR is the expectation of the total number of false negatives T divided by the total number of nonsignificant tests (i.e., nondiscoveries) m-R—namely, $E (\frac{T}{m - R})$ ( Table 1C ). By contrast, FPR is the expectation of V divided by the total number of tests m—namely, $E (\frac{V}{m})$ ; FNR is the expectation of T divided by m—namely, $E (\frac{T}{m})$ ( Table 1C ).

Based on the definitions of false positives and false negatives shown in Table 2 , we still have the same formats for FPR, FNR, FDR, and FNDR—namely, $FDR = E (\frac{V}{R})$ , $FNDR = E (\frac{T}{m - R})$ , $FPR = E (\frac{V}{m})$ , and $FNR = E (\frac{T}{m})$ , as in Table 1 . However, the U, V, T, and S in Table 2B are obtained using different criteria from those in Table 1B . They are labeled in such a way that the definitions of FDR and FNDR based on Table 2C have the same formats as those based on Table 1C .

There are other concepts of FDR. One of them is positive FDR, $p F D R = E (\frac{V}{R} | R > 0)$ , defined as the expectation of $\frac{V}{R}$ conditional on at least 1 rejection.⁹ Another is conditional FDR, $c F D R = E (\frac{V}{R} | R = r) / r$ , defined as the expected proportion of false positives conditional on the event that R = r rejected have been observed, which answers the question, “What proportion of false positives may one expect in the top list of r siRNAs?” One less used is the marginal FDR, mFDR = E(V)/E(R), defined as the ratio of the expected number of false positives to the expected number of rejections. Tsai et al.¹⁰ prove that pFDR, cFDR, and mFDR are all equivalent with each other under independence and identical distribution in a Bayesian setting.

p-Value, p-Value, q-Value, and q-Value

The control of FPR and FDR is usually based on p-value and q-value. The well-known q-value is a term defined similarly to p-value. The q-value is defined on FDR, whereas p-value is defined on FPR. Considering hit selection in the downregulated direction, the p-value with respect to β₂ of an siRNA with an observed value β_obs is p-value(β_obs) = max{FPR with respective to β₂} = max{Pr(β̂ ≤ β_obs|β ≥ β₂)}. Similarly, the q-value is defined as q-value(β_obs) = max{FDR with respective to β₂}. In terms of P _i (i.e., the p-value for the ith investigated siRNAs), the q-value is $q - value (p_{i}) = \max_{γ \leq p_{i}} {FDR (γ) with respective to β_{2}}$ . When the FDR is nonincreasing, as it should be, then q-value(p _i) = FDR(p _i). The q-value has the following meaning for an individual siRNA: the q-value (with respect to β₂) of a particular siRNA with an observed value β_obs is the maximum FDR if we use the following selection criterion: any siRNA is selected as a hit if it has the estimated SSMD value no more than β_obs and as a nonhit otherwise.

In traditional hypothesis testing, people care about FNDR less than about FDR. However, in RNAi screens, the FNDR with respect to β₁ can be as equally important as, if not more important than, the FDR with respect to β₂, and FNR can be as equally important as FPR in RNAi screens. Similar to p-value being defined upon FPR, we have p*-value being defined upon FNR as follows: for hit selection in downregulation, the p*-value (with respect to β₁) of an siRNA with β_obs is p*-value(β_obs) = max{FNR with respective to β₁} = max{Pr(β̂ > β_obs|β ≤ β₁)}. In the context of FDR, corresponding to p-value, we have q-value. In parallel, in the context of FNDR, corresponding to p*-value, we have q*-value. That is, the q*-value is defined as q*-value(β_obs) = max{FNDR with respective to β₁}. The q*-value has the following meaning for an individual siRNA: the q*-value (with respect to β₁) of a particular siRNA with an observed value β_obs is the maximum FNDR if we use the following selection criterion: any siRNA is selected as a hit if it has the estimated SSMD value no more than β_obs and as a nonhit otherwise.

There are an impressive number of algorithms for estimating/controlling FDR in the literature.^9,11-13 One popular algorithm is the Benjamin-Hochberg (BH) procedure.¹¹ The FDR calculated using the BH procedure is conservative.^9,11,13 After obtaining the p-value using the formulas described in the following section, we can use existing R packages (e.g., qvalue,⁹ multtest,¹¹ or fdrtool ¹³) to calculate the q-value with respect to β ≥ β₂ (or µ ≥ µ₂). To calculate q*-value, we can treat p*-value as the p-value for testing H ₀: β ≤ β₁ (or H ₀: µ ≤ µ₁) and calculate the corresponding q-value; the resulting q-value equals the q*-value with respect to β ≤ β₁(or µ ≤ µ₁).

Similarly as the above methods for the downregulated direction, we can derive the p-value, q-value, p*-value, and q*-value for hit selection in the upregulated direction. The calculation of q-value and q*-value is usually based on the calculation of p-value and p*-value for a single test for an individual siRNA. The formulas for calculating p-value and p*-value are displayed in Table 3 . Table 3 also contains the simple R codes for calculating corresponding q-value and q*-value from p-value and p*-value, respectively. The following sections show how the formulas in Table 3 are derived.

Table 3.

Calculation of p-Value, p*-Value, q-Value, and q*-Value

Direction	SSMD-Based Method	Mean Difference−Based Method
Downregulation	$p - value = F_{t (n - 1, \sqrt{n} β_{2})} (\frac{β_{obs}}{k})$	$p - value = F_{t (n - 1)} (\frac{\sqrt{n} (µ_{obs} - µ_{2})}{s_{D}})$
	$p * - value = 1 - F_{t (n - 1, \sqrt{n} β_{1})} (\frac{β_{obs}}{k})$	$p * - value = 1 - F_{t (n - 1)} (\frac{\sqrt{n} (µ_{obs} - µ_{1})}{s_{D}})$
Upregulation	$p - value = 1 - F_{t (n - 1, \sqrt{n} β_{2})} (\frac{β_{obs}}{k})$	$p - value = 1 - F_{t (n - 1)} (\frac{\sqrt{n} (µ_{obs} - µ_{2})}{s_{D}})$
	$p * - value = F_{t (n - 1, \sqrt{n} β_{1})} (\frac{β_{obs}}{k})$	$p * - value = F_{t (n - 1)} (\frac{\sqrt{n} (µ_{obs} - µ_{1})}{s_{D}})$

The q-values and q*-values can be obtained from p-values and p*-values for all investigated siRNAs using existing R packages such as fdrtool and qvalue as follows.

Using fdrtool, qvalues = fdrtool(pvalues, statistic = “pvalue”)$qval; qSTARvalues = fdrtool(pSTARvalues, statistic = “pvalue”)$qval

Using qvalue, qvalues = qvalue(pvalues)$qvalues; qSTARvalues = qvalue(pSTARvalues)$qvalues

$β_{obs} = \sqrt{\frac{K}{n - 1}} \frac{\bar{D}}{s_{D}}$ and µ_obs = D̄, where D̄ and s _D are, respectively, sample mean and sample standard deviation of difference between an siRNA and a negative reference; n is the number of replicates; $K = 2 \cdot {(\frac{Γ (\frac{n - 1}{2})}{Γ (\frac{n - 2}{2})})}^{2}$ , where Γ(·) is a gamma function; $k = \sqrt{\frac{K}{n (n - 1)}}$ . $t (n - 1, \sqrt{n} β)$ is a noncentral t-distribution with degree of freedom n − 1 and noncentral parameter $\sqrt{n} β$ , t (n − 1) is a central t-distribution with degree of freedom n − 1, and F _{t(n − 1)}(·) and $F_{t (n - 1, \sqrt{n} β)} (\cdot)$ are the cumulative distribution functions of $t (n - 1, \sqrt{n} β)$ and t (n − 1), respectively. β₁ and β₂ are population values of strictly standardized mean difference (SSMD) to indicate large and small effects, respectively; µ₁ and µ₂ are population values of mean difference.

Calculation of SSMD-Based p-Value and p*-Value

Suppose we observe n pairs of samples, (Y ₁₁, Y ₂₁), (Y ₁₂, Y ₂₂), ... ,(Y _1n, Y _2n), from groups G ₁ and G ₂, respectively. In a confirmatory or primary screen with replicates, groups G ₁ and G ₂ represent an investigated siRNA and the negative reference, respectively, and n is the number of replicates. Let D_j be the difference between the jth pair of samples—namely, D_j = Y _1j − Y _2j. Let D̄ and s_D be the sample mean and sample standard deviation of D, respectively—namely, $\bar{D} = \frac{1}{n} \sum_{j = 1}^{n} D_{j}$ and $s_{D}^{2} = \frac{1}{n - 1} \sum_{j = 1}^{n} {(D_{j} - \bar{D})}^{2}$ .

The method-of-moment estimate of SSMD is β̂ = D̄/S _D. Assume that D is normally distributed—namely, $D ~ N (µ_{D}, σ_{D}^{2})$ ). Then the uniformly minimal variance unbiased estimate (UMVUE) of SSMD is as follows³:

\hat{β} = \sqrt{\frac{K}{n - 1}} \frac{\bar{D}}{s_{D}} where K = 2 \cdot {(\frac{Γ (\frac{n - 1}{2})}{Γ (\frac{n - 2}{2})})}^{2},

(1)

and we have the following noncentral t-distribution,

T = \frac{\sqrt{n} \bar{D}}{s_{D}} ~ noncentral t (n - 1, \sqrt{n} β) .

(2)

When selecting downregulated hits, for an siRNA with an observed value β_obs of SSMD, given the true value of SSMD no less than a small value β₂, p-value is defined as the maximum probability of selecting this siRNA as a hit if we use the following selection criterion: any siRNA is selected as a hit if it has the estimated SSMD value no more than β_obs and as a nonhit otherwise. That is, p-value is the maximum of Pr(β̂ ≤ β_obs | β ≥ β₂)—namely, p-value = Pr(β̂ ≤ β_obs | β = β₂). Based on the noncentral t-distribution of T in formula (2),

p - value = F_{t (ν, b β_{2})} (\frac{β_{obs}}{k}) .

(3)

where t(ν, bβ) is a noncentral t-distribution with ν degrees of freedom and noncentral parameter bβ, F _{t(ν, bβ)}(·) is the cumulative distribution function of t(ν, bβ), and $k = \sqrt{\frac{K}{n (n - 1)}}$ , $b = \sqrt{n}$ , ν = n − 1. The above p-value corresponds to FPR with respect to β₂. In parallel, for convenience, we can define p*-value with respect to β₁ (β₁ < β₂ ≤ 0) as the maximum of Pr(β̂ > β_obs|β ≤ β₁)—namely, p*-value = Pr(β̂ > β_obs|β = β₁), which corresponds to FNR with respect to β₁. Based on the noncentral t-distribution of T in formula (2),

p * - value = 1 - F_{t (ν, b β_{1})} (\frac{β_{obs}}{k}) .

(4)

The values of β₂ are 0 or −0.25, and the values of β₁ are −1.645, −2, −3, or −5.^2,8

Similarly, when selecting upregulated hits (i.e., siRNAs with large positive value), for an siRNA with an observed value β_obs of SSMD, the p-value with respect to β₂ is the maximum of Pr(β̂ ≥ β_obs|β ≤ β₂)—namely, p-value = Pr(β̂ ≥ β_obs|β = β₂); thus,

p - value = 1 - F_{t (ν, b β_{2})} (\frac{β_{obs}}{k}) .

(5)

And the p*-value with respect to β₁ (β₁ > β₂ ≥ 0) is

p * - value = F_{t (ν, b β_{1})} (\frac{β_{obs}}{k}) .

(6)

The values of β₂ may be 0 or 0.25, and the values of β₁ may be 3 or 5.^2,8 The left panel of Table 3 shows formulas (3) to (6) with v and b expressed in terms of n.

Calculation of Mean Difference − Based p-Value and p*-Value

When D is normally distributed, the estimate of mean difference µ (equivalently mean fold change in log scale) is µ̂ = D̄, and we have the following central t-distribution:

T = \frac{\sqrt{n} (\bar{D} - µ)}{s_{D}} ~ central t (n - 1) .

(7)

In the downregulated direction, for an siRNA with an observed value µ_obs of mean difference, the p-value is the maximum of Pr(T ≤ T _obs|µ ≥ µ₂)—namely, p-value = Pr(T ≤ T _obs|µ = µ₂). Based on the central t-distribution of T in formula (7),

p - value = F_{t (ν)} (\frac{\sqrt{n} (µ_{obs} - µ_{2})}{s_{D}}) .

(8)

where t(ν) is a central t-distribution with ν degrees of freedom, F _t(ν)(·) is the cumulative distribution function of t(ν), and ν = n − 1. The above p-value corresponds to FPR with respect to µ₂. Similarly, the p*-value with respect to µ₁ (µ₁ < µ₂ ≤ 0) is p*-value = Pr(T > T _obs|µ = µ₁) that corresponds to FNR with respect to µ₁. Based on the central t-distribution of T in formula (7),

p * - value = 1 - F_{t (ν)} (\frac{\sqrt{n} (µ_{obs} - µ_{1})}{s_{D}}) .

(9)

There is no theoretical base for choosing the values of µ₁ or µ₂. Some potential values of µ₁ may be ½-fold or ⅓-fold in log scale. Some potential values of µ₂ may be 1-fold, $\frac{1}{1.1}$ -fold, or $\frac{1}{1.2}$ -fold in log scale.

Similarly, when selecting upregulated hits, for an siRNA with an observed value µ_obs of mean difference, the p-value is the maximum of Pr(T ≥ T _obs|µ ≤ µ₂)—namely, p-value = Pr(T ≥ T _obs|µ = µ₂). Based on the central t-distribution of T in formula (7),

p - value = 1 - F_{t (ν)} (\frac{\sqrt{n} (µ_{obs} - µ_{2})}{s_{D}}) .

(10)

And the p*-value with respect to µ₁ (µ₁ > µ₂ ≥ 0) is

p * - value = F_{t (ν)} (\frac{\sqrt{n} (µ_{obs} - µ_{1})}{s_{D}}) .

(11)

There is no theoretical base for choosing the values of µ₁ and µ₂. Some potential values of µ₁ may be 2- or 3-fold in log scale. Some potential values of µ₂ may be 1-, 1.1-, or 1.2-fold in log scale. The right panel of Table 3 shows formulas (8) to (11) with v expressed in terms of n.

Discussion

In most genome-scale RNAi screens, the ultimate goal is to select siRNAs with a large inhibition/activation effect. The hit selection usually requires statistical control of 2 errors: false positives and false negatives, which is commonly achieved through FPR or p-value, FDR or q-value,^4,11,14 FNR,³ and FNDR.¹⁵ In this article, I propose a new method for controlling FDR and FNDR, which applies 2 constants to define noninteresting, tolerable, and interesting siRNAs (shown in Table 2A ) compared to a single constant in traditional methods (shown in Table 1A ). I also recommend SSMD as a more sensible measurement than mean difference to calculate FDR and FNDR because SSMD takes into account data variability and effect size.

As shown in Tables 1 and 2 , given a selection process for identifying hits, FPR is the expected proportion of all the true noninteresting siRNAs in a study population selected as hits, whereas FDR is the expected proportion of true noninteresting siRNAs among declared hits. The number of true noninteresting siRNAs in a study population is typically unknown, whereas the number of siRNAs declared as hits is known. Therefore, FPR is not typically verifiable and of less interest in an RNAi screen relative to FDR, which is verifiable and of interest. Similarly, given a selection process for identifying hits, FNR is the expected proportion of all the true interesting siRNAs in a study selected as nonhits, whereas FNDR is the expected proportion of true interesting siRNAs among declared nonhits. The number of true interesting siRNAs in a study population is typically unknown, whereas the number of siRNAs declared as nonhits is known. Therefore, theoretically, FNDR is more tractable and addresses the question that we are normally interested in an RNAi screen better than FNR. Consequently, the analytic methods for hit selection should focus on the control of FDR and FNDR rather than the control of FPR and FNR in genome-scale RNAi screens.

In addition, currently many analytic methods control false positives and ignore the control of false negatives in high-throughput biotechnologies. In genome-scale RNAi screens, when one chooses a decision rule to select hits, the interest is the control of not only false positives but also false negatives. If one misses a true positive in the first round of screening, one would not have a chance to investigate it again in the follow-up research. Therefore, in some screens, especially primary screens, the control of false negatives is even more important than the control of false positives. In this article, we propose a new concept called q*-value to address FNDR and another concept called p*-value to address FNR.

From Tables 1 and 2 , when the 2 constants are equal, the proposed method is reduced to the traditional method. Therefore, the proposed method can be applied to the situations where traditional methods work, yet it also works in the situation where traditional methods are inappropriate when 2 different constants are required. Therefore, the proposed method works not only in RNAi screens but also in other screens; it is applicable to the screens with either a large or small portion of true hits.

The method in this article has been developed from a methodological perspective based on scientific needs in genome-scale RNAi screens. There is a need to demonstrate the practical usefulness of the proposed method in real genome-scale RNAi screens. In a follow-up article,⁵ my colleagues and I will report on the applications of the proposed method for hit selection in 2 in-house RNAi screens. Although the method presented in this article is developed for hit selection in RNAi-based high-throughput screens, it should be applicable to other assays in which the end point is a difference in signal compared to a reference sample, including those for receptor, enzyme, and cellular function.

Footnotes

Acknowledgements

The author thanks Drs. Holder, Soper, and Heyse for their support in this research.

Conflict of interest statement. XHD Zhang is an employee of Merck Research Laboratories.

References

Birmingham

Selfors

Forster

Wrobel

Kennedy

Shanks

: Statistical methods for analysis of high-throughput RNA interference screens. Nat Methods 2009;6:569-575.

Zhang

XHD

: A new method with flexible and balanced control of false negatives and false positives for hit selection in RNA interference high-throughput screening assays. J Biomol Screen 2007;12:645-655.

Zhang

XHD

Marine

Ferrer

: Error rates and powers in genome-scale RNAi screens. J Biomol Screen 2009;14:230-238.

Storey

Tibshirani

: Statistical significance for genomewide studies. Proc Natl Acad Sci U S A 2003;100:9440-9445.

Zhang

XHD

Lacson

Yang

Marine

McCampbell

Toolan

: The use of false discovery and false non-discovery rates for hit selection in genome-scale RNAi screens. J Biomol Screen. In press.

Kirk

: Practical significance: a concept whose time has come. Educ Psychol Meas 1996;56:746-759.

Zhang

XHD

: Strictly standardized mean difference, standardized mean difference and classical t-test for the comparison of two groups. Stat Biopharm Res 2010;2:292-299.

Zhang

XHD

: A method for effectively comparing gene effects in multiple conditions in RNAi and expression-profiling research. Pharmacogenomics 2009;10:345-358.

Storey

: A direct approach to false discovery rates. J R Stat Soc Ser B 2002;64:479-498.

10.

Tsai

Hsueh

Chen

: Estimation of false discovery rates in multiple testing: application to gene microarray data. Biometrics 2003;59:1071-1081.

11.

Benjamini

Hochberg

: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995;57:289-300.

12.

Efron

: Large-scale simultaneous hypothesis testing: the choice of a null hypothesis. J Am Stat Assoc 2004;99:96-104.

13.

Strimmer

: A unified approach to false discovery rate estimation. Bmc Bioinformatics 2008;9:303.

14.

Zhang

XHD

Kuan

Ferrer

Shu

Liu

YCX

Gates

: Hit selection with false discovery rate control in genome-scale RNAi screens. Nucleic Acids Res 2008;36:4667-4679.

15.

Genovese

Wasserman

: Operating characteristics and extensions of the false discovery rate procedure. J R Stat Soc Ser B 2002;64:499-517.

An Effective Method for Controlling False Discovery and False Nondiscovery Rates in Genome-Scale RNAi Screens

Abstract

Keywords

Introduction

False Discovery Rate and False Nondiscovery Rate

p-Value, p*-Value, q-Value, and q*-Value

Calculation of SSMD-Based p-Value and p*-Value

Calculation of Mean Difference − Based p-Value and p*-Value

Discussion

Footnotes

Acknowledgements

References

p-Value, p-Value, q-Value, and q-Value