An Economic Framework to Prioritize Confirmatory Tests after a High-Throughput Screen

Abstract

How many hits from a high-throughput screen should be sent for confirmatory experiments? Analytical answers to this question are derived from statistics alone and aim to fix, for example, the false discovery rate at a predetermined tolerance. These methods, however, neglect local economic context and consequently lead to irrational experimental strategies. In contrast, the authors argue that this question is essentially economic, not statistical, and is amenable to an economic analysis that admits an optimal solution. This solution, in turn, suggests a novel tool for deciding the number of hits to confirm and the marginal cost of discovery, which meaningfully quantifies the local economic trade-off between true and false positives, yielding an economically optimal experimental strategy. Validated with retrospective simulations and prospective experiments, this strategy identified 157 additional actives that had been erroneously labeled inactive in at least one real-world screening experiment.

Keywords

chemoinformatics fluorescence polarization (FP)data analysis software information management automation/robotics

Introduction

Deciding how many initial positives (“hits”) from a high-throughput screen to submit for confirmatory experiments is a basic task faced by all screeners.^1,2 Often, several hits are experimentally confirmed by ensuring they exhibit the characteristic dose-response behavior common to true actives. We introduce an economic framework for deciding the optimal number to send for confirmatory testing.

Many screeners arbitrarily pick the number of hits to experimentally confirm. There are, however, rigorous statistical methods for selecting the number of compounds to test by carefully considering each compound’s screening activity in relation to all compounds tested^3,4 or the distribution of negative (e.g., dimethyl sulfoxide treatment) controls.⁵ In either case, a p-value is associated with each screened compound, and a multitest correction is used to decide the number of hits to send for confirmatory testing. One of the oldest and most conservative of these corrections is the Bonferroni correction, a type of family-wise error rate (FWER) correction that tests the hypothesis that all hits are true actives with high probability.^6,7 More commonly, false discovery rate (FDR) methods are used to fix the confirmatory experiment’s failure rate at a predetermined proportion.^8,9 Although more principled than ad hoc methods, all multitest correction strategies require the screener to arbitrarily choose an error tolerance without much theoretical guidance.

These methods, which balance global statistical trade-offs based on arbitrarily chosen parameters, are not sensible in the context of high-throughput screening (HTS). In a 2-stage experiment—a large screen followed by fewer confirmatory experiments—global statements about the statistical confidence of hits based on the screen alone are less useful because confirmatory tests can establish the activity (or inactivity) of any hit with high certainty, are much less expensive than the total screening costs, and are required for publication. The experimental design itself implies an economic constraint; resource limitations are the only reason all compounds are not tested for dose-response behavior. We argue, therefore, that the real task facing a screener is better defined as an economic optimization rather than a statistical trade-off.

Tasks like this, deciding the optimal quantity of “goods” to “produce,” have a rich history and form the basis of economic optimization.¹⁰ This optimization aims to maximize the utility minus the cost: the surplus. One of the most important results from analyzing this optimization is that the globally optimal point at which to stop attempting to confirm additional hits can be located by iteratively quantifying whether the local trade-off between true positives and false positives is sensible. A local analysis admits a globally optimal solution. This result motivates our framework.

Methods

Our economic framework is defined by 3 modular components: (1) a utility model for different outcomes of dose-response experiments, (2) a cost model of obtaining the dose-response data for a particular molecule, and (3) a predictive model for forecasting how particular compounds will behave in the dose-response experiment. After each dose-response plate is tested, the predictive model is trained on the results of all confirmatory experiments performed thus far. This trained model is then used to compute the probability that each unconfirmed hit will be determined to be a true hit if tested. Using the utility model and cost model in conjunction with these probabilities, the contents of the next dose-response plate is selected to maximize the expected surplus (expected utility minus expected cost) of the new plate, and the plate is run only if its expected surplus is greater than zero.

This framework maps directly to an economic analysis. From basic calculus, it can be derived that the point at which a pair of supply and demand curves intersect corresponds to the point that maximizes total surplus. This point of intersection is also the last dose-response plate with a marginal surplus greater than zero. The utility model quantifies how the screener values different outcomes of the confirmatory study. This information is used to define a demand curve: the marginal utility as a function of the number of successfully confirmed actives so far. The cost model exposes the cost of obtaining a particular dose-response curve and is expressed in the same units as the utility. The predictive model forecasts the outcome of future confirmatory experiments and, together with the cost model, is used to define a supply curve: the marginal cost of discovery (MCD) as a function of the number of actives successfully confirmed as actives.

Explicitly computing the expected surplus to determine an optimal stop point is critically dependent on modeling a screener’s utility function with mathematical formulas, a daunting task that requires in-depth surveys or substantial historical data in addition to assumptions about the screener’s response to risk.^11-13 However, in practice, the expected surplus need not be explicitly computed to solve this problem. We can access the relevant summary of the screener’s preferences by repeatedly asking whether he or she is willing to pay the required amount to successfully confirm one more active. Iteratively exposing the MCD to the screener is like placing a price tag on the next unit of discovery, and interrogating the screener is equivalent to computing whether the marginal surplus is greater than zero. For illustration purposes, however, we do use mathematical utility functions to show that simple predictive models, which map screening activity to the dose-response outcome, can be constructed and these models admit near-optimal economic decisions. To our knowledge, this is the first time formal economic analysis has informed an HTS protocol.

High-throughput screen

The technical details of the assay and subsequent analysis can be found in PubChem (PubChem AIDs 1832 and 1833). For approximately 300,000 small molecules screened in duplicate, activity is defined as the mean of final, corrected percent inhibition. After molecules with autofluorescence and those without additional material readily available were filtered out, 1323 with at least one activity greater than 25% were labeled “hits” and tested for dose-response behavior in the first batch. Of these tested compounds, 839 yielded data consistent with inhibitory activity. Each hit was considered a “successfully confirmed active” if the effective concentration at half-maximal activity (EC₅₀) was less than or equal to 20 µM. Using this criterion, 411 molecules were determined to be active.

Utility model

In this study, 1 unit of discovery is defined as a single successfully confirmed active. The utility model is defined as a function of the number of confirmed actives obtained so far, n. We consider 2 utility models. First, in the fixed-return model (F), we fix the marginal utility, U′(n), for each confirmed hit at $50. Second, we consider a more realistic diminishing-returns model (D) where the marginal utility of the nth confirmed hit is $(max [502 – 2n, 0]). The utility function for each model can be derived by considering the boundary condition U(0) = $0, yielding U(n) = $(50n) for F and U(n) = $(501 min [n, 250] – min [n, 250]2) for D.

Cost model

We use a fixed-cost model where each plate costs the same amount of money, C(p(n))= $(400 p(n)), where p(n) is the total number of plates required to confirm actives. Allowing for 12 positive and 12 negative controls on each plate and measuring 6 concentrations in duplicate, 30 dose-response curves can be tested on a single 384-well plate; these parameters can easily be modified to suit other formats. We assume the first such plate tests the 30 hits with highest activity in the original screen. Subsequent plates are filled to maximize expected surplus as described in the Introduction.

Clearly, in many scenarios, it is not easy or meaningful to place a dollar value on the cost of doing a confirmatory experiment. In these cases, we recommend expressing the cost function in more general units: the number of confirmatory tests (NCTs) required. In our example, this is equivalent to using the cost model defined as C(p(n)) = (30 p(n)) NCT. This amounts to a negligible change in the downstream analysis, which only changes the y-axis of the resulting supply and demand curves without affecting their shapes. An example of how this strategy functions in practice is included in the Results section.

Predictive model

We consider 2 predictive models: a logistic regressor (LR)¹⁴ and a neural network with a single hidden node (NN1),¹⁵ structured to use the screen activity as the single independent variable and using the result of the associated confirmatory experiment as the single dependent variable. Both models were trained using gradient descent on the cross-entropy error. This protocol yields models with outputs that are interpretable as probabilities.

Ordering hits for confirmation

Any combination of either of the 2 predictive models (LR and NN1) with either of the 2 utility models (F and D) and a fixed-cost model will always test hits in order of their activity in the screen. The proof is as follows: the cost of screening a dose-response plate is fixed and therefore irrelevant to ordering hits. Maximizing the expected utility is exactly equivalent to maximizing the expected number of true hits in the plate. Both NN1 and LR are monotonic transforms of their input, so maximizing the expected number of confirmed hits in a plate is equivalent to selecting the hits with the highest activity. These schemes, therefore, order compounds by activity.

Experimental yield and marginal cost of discovery

Most experiments compute the expected MCD by dividing the cost of the next plate by the number of confirmed actives expected to be found within it, using either LR or NN1. The expected number of actives, the yield, is computed by summing the outputs of either LR or NN1 over all molecules tested in the plate. For the data, the MCD is computed by dividing the cost of each plate by the actual number of confirmed actives within it. When communicating the MCD to our screening team, we present the MCD as the number of confirmatory tests required to successfully confirm one more active instead of dollars. This cost is computed by dividing the number of confirmatory experiments proposed (462 in this case) by the forecast experimental yield.

Results

Relationship between potency and screen activity

For illustration, we use the data from a florescence polarization screen for binding inhibitors of the Caenorhabditis elegans protein MEX-5 to the tcr2 element of its cognate target, glp-1.¹⁶ Additional details are included in the Methods section. As would be expected in any successful screen, potency and screening activity are correlated ( Fig. 1a ; n = 839 and R = 0.3656 with p < 0.0001). Similarly, actives and inactives are associated with different distributions that are not separable, overlapping substantially ( Fig. 1b ; actives having n = 411 with 47.60 mean and 20.48 SD compared with inactives having n = 912 with 31.70 mean and 16.80 SD).

Fig. 1.

Relationship between potency and activity. (a) A scatter plot demonstrating the relationship between a small molecule’s confirmed potency and activity in the initial screen. (b) The distribution of molecules that were verified either active or inactive by dose-response experiments.

Feasibility of forecasting experimental yield

One critical test of our method’s feasibility is to confirm that simple predictive models can accurately forecast the number of successfully confirmed actives in future experiments. The predictive model must extrapolate predictions on data outside the range of the training inputs. We expect that the simplest models, such as LR or NN1, will extrapolate well—perhaps better than more complex models—because they will not easily overfit the data. As expected from a model with more parameters, NN1 fits the data more closely than LR ( Fig. 2 ), yielding an average residual of 2.35 compared with LR’s average residual of 3.07. Certainly, there is systematic error in LR’s predictions that could possibly be corrected. Nonetheless, both models yield curves that appear to be moving averages of the data, showing that simple models do seem to enable reasonable forecasts.

Fig. 2.

Forecast confirmation rate. The observed and predicted (LR and NN1) number of actives for each plate. Each prediction is based on the results of all plates with lower numbers.

Prospective supply curve

We can also evaluate this framework by considering the shape of forecast supply curves ( Fig. 3 ). Unsurprisingly, NN1 fits the data better than LR; a supply curve is the yield plotted in a particular manner. On the same graph, we can also plot demand curves derived from our 2 utility models. For each pair of supply and demand curves, their intersection is the predicted optimal stop point. This point is different for different demand curves, reflecting the differences among screeners’ budgets and projects’ value; this is exactly the behavior we hoped to observe.

Fig. 3.

Supply and demand. The predicted supply curves (LR and NN1) plotted with demand curves (F and D) derived from 2 utility models. For reference, the position of every 10th plate is denoted by vertical gridlines.

Economic optimality of stop points

Predicted stop points can be evaluated by comparing them to the optimal stop points, those empirically found to yield the highest surplus. The surplus varies with the number of compounds tested, has several nearly equivalent plates for F, and has a clearer peak for D ( Fig. 4 ). For F, the optimal stop point is either dose-response plate 19 or 27, yielding a surplus of $6,800. NN1 and LR indicate the screener should stop after plate 24 or 32, yielding surpluses of $6750 and $6250, respectively. For D, both NN1 and LR indicate the screener should stop after plate 15, yielding a surplus of $56,618, within 0.2% of the optimal stop point at plate 13, with a surplus of $56,738. The MCD succeeds in prospectively finding a nearly optimal solution.

Fig. 4.

Surplus as a function of stop point for both the (a) F and (b) D utility models, comparing observations with predictions (LR and NN1).

It is interesting to note that, with F, there are 2 optimal stop points, one at plate 19 and the other at plate 27. This is because stopping at either plate yields the same surplus and, from the point of view of a screener with this utility function, is therefore exchangeable. Geometrically, this occurs because the true supply and demand curves intersect in 2 places and because the surplus has 2 maxima. In practice, this scenario is highly unlikely to occur because real demand curves usually slope downward and supply curves usually slope upward, thereby defining a unique optimum. In this example, however, this observation highlights that methods should be evaluated not by their ability to optimize the specific stop plate number but to optimize the computed surplus at this plate number.

Prospective validation

We further validated our methodology by presenting the MCD of the next batch of compounds to the team running the screening experiment. The MCD was presented in terms of NCT in this scenario, amounting to a simple change of units.

Predictive models were trained on the 1323 dose-response experiments performed so far; these models were then used to forecast the number of successfully confirmed actives we would expect to find by testing the next batch of compounds. The next batch of 462 compounds ranged from 19.52 to 100.25 screen activity, over a larger window than expected because 213 of the best hits became available, although they were initially unavailable when compounds for the first batch of confirmation experiments were being ordered. This batch of confirmations was predicted by LR and NN1 to yield 126.1 and 116.2 confirmed actives corresponding with an MCD of 3.66 and 3.98, respectively.

As a result of this analysis, which surprised the scientists involved, the screening team decided to obtain and test for dose-response behavior an additional batch of compounds. Of these 157 compounds, which were previously assumed to be inactive, were found to be active with about 30% higher yield than predicted. This analysis proved useful to the team running the HTS experiment because they had capacity to test additional compounds and did not expect that additional actives could be retrieved at this yield. Although not as accurate as forecasts on smaller batches, these forecasts provided enough information to appropriately justify testing additional compounds. In this experiment, therefore, the MCD provided the rationale to discover 157 additional actives that would not have otherwise been discovered. We suspect this scenario is not atypical. In HTS experiments, the cost of testing hits is low compared to the overall cost of the screen. In many cases, screeners might send more compounds for confirmatory testing with our analysis.

Comparison with FDR

It is tempting to directly compare statistical methods of error control with MCD. This comparison, however, is not straightforward because there is no correspondence between global and local error rates. However, in the context of fixed-assay cost, our fixed-utility model, F, does imply a particular FDR cutoff. A maximum FDR of 0.73 would fix the average cost of confirming a hit at $50, the marginal utility for F, and also fixes the total surplus at zero. The FDR of all dose-response plates for this example is 0.6893, below the maximum allowable FDR threshold. At least 7 additional plates of confirmatory experiments would be sent even after the screener was paying well over $50 for each additional confirmation. Confirming additional actives when their value is less than their cost, as FDR would propose, is economically irrational. Using FDR in this manner corresponds to a monopolistic supplier capturing profit at the expense of consumers by only selling its product in bundles far larger than consumers would be willing to purchase one by one: a well-known case from microeconomic analysis.¹⁷ Clearly, MCD will approximate an economically optimal solution, whereas FDR will not.

Comparison with local FDR

An uncommonly used statistical method attempts to estimate the “local” FDR from the p-values of each hit with respect to the distribution of the negative controls.^18,19 Local FDR is defined as the proportion of false positives at a particular point in the prioritized list and can function as the predictive model in our framework. In the case of F, selecting the stop point at which the local FDR is 0.7333 is exactly equivalent to the economic solution we have derived. In the case of the more realistic D model, local FDR does not admit a solution by itself. There is no a priori way of determining the economically optimal local FDR threshold without knowing the full utility function of the screener. So although local FDR succeeds in one simplified case, when used alone it is not capable of finding the economically optimal solution in realistic scenarios.

Discussion

The experiments presented here demonstrate the feasibility and power of our economic approach. For clarity, only the simplest variations have been presented. Several extensions are imaginable for each module: utility, cost, and predictive. These extensions explore how our economic analysis functions as a unified framework to quantitatively address several disparate concerns in HTS experiments.

First, more useful utility models can be derived that more accurately model the real value of screening results. As we have seen, the exact utility model can dramatically affect the economically optimal stop point, so this is a critical concern. Although the utility function of screeners remains challenging to mathematically specify, better models of the inputs into the utility function are potentially useful. For example, more meaningful definitions of a “unit of discovery” are important because they can suggest attempting to confirm hits in different orders and increase yield even when only a fixed number of hits are to be tested. This is equivalent to changing the units of the x-axis of the supply and demand curve graphs. One variation would be to define a diversity-oriented utility that measures discovery as the number of scaffolds with at least 1 confirmed active. Related models have been proposed in the cheminformatics literature.²⁰ Even more complex utility models might consider the potential of improving a downstream classifier.²¹ We favor a “clique”-oriented utility function that counts each clique of at least 3 or 4 confirmed actives with a shared scaffold as a single unit of discovery. This models the behavior of medicinal chemists who are often only willing to optimize a lead if other confirmed actives share its scaffold. We will explore the algorithm necessary for these variations in a future study.

Second, more complex cost models can be derived that could, for example, encode the extra overhead of running smaller batch sizes. Other variations could encode the acquisition cost or the intellectual property encumbrance of particular molecules. Furthermore, in some circumstances, it may be more useful to measure cost in units of time instead of money.

Third, better predictive models are possible, including neural networks with several hidden nodes, estimates of the local FDR computed from p-values,^18,19 analytic formulas based on mathematical properties of the system,²² and predictors that take chemical information into account.^23,24 Many of these methods, however, make strong assumptions about the underlying data. In contrast, our simple models make fewer assumptions and could, therefore, be more broadly applicable. Even within the structure of our simple models, there are some important improvements possible. For example, adding a well-designed Bayesian prior to the logistic regressor would enable it to work appropriately even when confirmatory experiments yield no true actives, an important boundary case often encountered in HTS.

In addition to continuing to admit economically optimal experimental strategies, these extensions expose several advantages of MCD and our economic analysis over p-value-based statistical methods designed to control the FDR or FWER. First, more complex predictive or utility models can reorder hits to improve the yield from confirmatory experiments, whereas statistical methods will always prioritize compounds in the order of their screen activity. Second, the supply curve traced by MCD can be fed into higher order models to direct more complex experimental designs, whereas statistical methods do not yield this information. Third, MCD does not require negative or positive controls or make normality assumptions, enabling its application to data sets where calculating accurate p-values is problematic.

Finally, the immediate application of the MCD to practical HTS scenarios is straightforward, even when adaptive HTS, with plate-by-plate confirmatory experiments, is not possible. At any point after confirmatory experiments have been run, the following protocol can be applied: (1) train a predictive model on the confirmatory experiments obtained so far, (2) use this model to pick the contents of the next batch of compounds to be tested, (3) compute the MCD of this batch in terms of NCT as the number of compounds in the next batch divided by the sum of the predictive model’s output for each of these compounds, and (4) obtain and test these compounds if the screener believes that this MCD is an acceptable price for more confirmed actives. This analysis can even be done on publicly available HTS data sets from PubChem to discover active compounds that have yet to be discovered.

We have outlined an economic framework—from which we have derived the MCD—that has applications throughout both science and medicine; prime candidates for its application include any experimental design in which a large screen is followed up with more costly confirmatory experiments. By encoding local cost structure, MCD provides a richer, more informative analysis than statistical methods alone can provide and yields a unified analytic framework to address several disparate tasks embedded in screening experiments.

Footnotes

Acknowledgements

We thank the Physician Scientist Training Program of the Washington University Pathology Department for supporting SJS; NEB and PAC are supported in part by NCI’s Initiative for Chemical Genetics (N01-CO-12400) and NIH Molecular Libraries Network (U54-HG005032), and SPR is supported by the NIH (NS-059380). We also thank Paul Swamidass and Richard Ahn for their helpful comments on the manuscript.

Author contributions: SJS conceived the ideas in this article, performed the virtual experiments, and wrote the initial manuscript. JAB and SPR performed the MEX-5 HTS experiment and the dose-response confirmations. NEB downloaded and pre-processed the data for analysis, and prepared manuscript figures. PAC provided project guidance, helpful comments, and substantial revisions to the manuscript.

Competing interests: The authors declare that they have no competing financial interests.

References

Storey

Dai

Leek

: The optimal discovery procedure for large-scale significance testing, with applications to comparative microarray experiments. Biostatistics 2007;8:414.

Rocke

: Design and analysis of experiments with high throughput biological assay data. Semin Cell Dev Biol 2004;15:703-713.

Zhang

Chung

Oldenburg

: Confirmation of primary active substances from high throughput screening of chemical and biological populations: a statistical approach and practical considerations. J Comb Chem 2000;2:258-265.

Brideau

Gunter

Pikounis

Liaw

: Improved statistical methods for hit selection in high-throughput screening. J Biomol Screen 2003;8:634-647.

Seiler

George

Happ

Bodycombe

Carrinski

Norton

: ChemBank: a small-molecule screening and cheminformatics resource database. Nucleic Acids Res 2008;36:D351.

van der Laan

Dudoit

Pollard

: Augmentation procedures for control of the generalized family-wise error rate and tail probabilities for the proportion of false positives. Stat Appl Genet Mol Biol 2004;3:1042.

van der Laan

Dudoit

Pollard

: Multiple testing: Part II. Step-down procedures for control of the family-wise error rate. Stat Appl Genet Mol Biol 2004;3:1041.

Benjamini

Hochberg

: Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Stat Soc Ser B 1995;57:289-300.

Reiner

Yekutieli

Benjamini

: Identifying differentially expressed genes using false discovery rate controlling procedures. Bioinformatics 2003;19:368-375.

10.

Varian

: Microeconomic Analysis. New York: Norton, 1992.

11.

Kroes

Sheldon

: Stated preference methods: an introduction. J Transport Econ Policy 1988;22:11-25.

12.

Houthakker

: Revealed preference and the utility function. Economica 1950;17:159-174.

13.

Schoemaker

: The expected utility model: its variants, purposes, evidence and limitations. J Econ Literature 1982;20:529-563.

14.

Dreiseitl

Ohno-Machado

: Logistic regression and artificial neural network classification models: a methodology review. J Biomed Inform 2002;35:352-359.

15.

Baldi

Brunak

: Bioinformatics: The Machine Learning Approach. Cambridge, MA: MIT Press, 2001.

16.

Pagano

Farley

McCoig

Ryder

: Molecular basis of RNA recognition by the embryonic polarity determinant MEX-5. J Biol Chem 2007;282:8883-8894.

17.

Adams

Yellen

: Commodity bundling and the burden of monopoly. Q J Econ 1976;90:475-498.

18.

Scheid

Spang

: twilight; a Bioconductor package for estimating the local false discovery rate. Bioinformatics 2005;21:2921-2922.

19.

Scheid

Spang

: A stochastic downhill search algorithm for estimating the local false discovery rate. IEEE Trans Comput Biol Bioinform 2004;1:98-108.

20.

Clark

Webster-Clark

: Managing bias in ROC curves. J Comput Aided Mol Design 2008;22:141-146.

21.

Danziger

Zeng

Wang

Brachmann

Lathrop

: Choosing where to look next in a mutation sequence space: active learning of informative p53 cancer rescue mutants. Bioinformatics 2007;23:i104.

22.

Van Leijenhorst

van der Weide

: A formal derivation of Heaps’ law. Inf Sci 2005;170:263-272.

23.

Swamidass

Azencott

Lin

Gramajo

Tsai

Baldi

: Influence relevance voting: an accurate and interpretable virtual high throughput screening method. J Chem Inform Model 2009;49:756-766.

24.

Posner

Mills

JEJ

: Enhanced HTS hit selection via a local hit rate analysis. J Chem Inf Model 2009;49:2202-2210.