Combined Analysis of Phenotypic and Target-Based Screening in Assay Networks

Abstract

Small-molecule screens are an integral part of drug discovery. Public domain data in PubChem alone represent more than 158 million measurements, 1.2 million molecules, and 4300 assays. We conducted a global analysis of these data, building a network of assays and connecting the assays if they shared nonpromiscuous active molecules. This network spans both phenotypic and target-based screens, recapitulates known biology, and identifies new polypharmacology. Phenotypic screens are extremely important for drug discovery, contributing to the discovery of a large proportion of new drugs. Connections between phenotypic and biochemical, target-based screens can suggest strategies for repurposing both small-molecule and biologic drugs. For example, a screen for molecules that prevent cell death from a mutated version of superoxide-dismutase is linked with ALOX15. This connection suggests a therapeutic role for ALOX15 inhibitors in amyotrophic lateral sclerosis. An interactive version of the network is available online (http://swami.wustl.edu/flow/assay_network.html).

Keywords

chemoinformatics database and data management phenotypic drug discovery pharmacology: ligand binding receptor binding statistical analyses

Over the past 10 years, the academic drug discovery landscape has undergone several dramatic changes. One of the most important of these changes is the increase in publicly available data from the early stage of drug discovery. In particular, the availability of small-molecule screening data through PubChem¹—with data generated from efforts funded by the National Institutes of Health and some industry contributors—is exciting because, together, phenotypic and target-based small-molecule screens are responsible for 90% of recently approved, novel medicines with small-molecule active ingredients.²

One long-range goal of the academic screening enterprise is to develop novel treatments for human disease. The most advanced of these efforts has discovered a novel, orally available, selective, and potent sphingosine-1-phosphate receptor agonist. This molecule, designated RPC1063, induces lymphopenia in animal models, entered human trials in 2012, and is being investigated as a treatment for multiple sclerosis.³ Although this example is encouraging, public screening should be evaluated by more than its success in bringing drugs to market.

In addition to academics’ drug discovery projects, the public nature of their data has directly contributed to several advances in screening science. For example, several substructures common to promiscuously active molecules have been identified, placed in the public domain, and are now commonly filtered from hit lists.⁴ Likewise, public data enable the direct comparison of competing methods of selecting hits from primary screens.⁵ Similarly, it enables academics to propose and test improvements to screening experiment design.⁶

Nonetheless, public screening data remain largely untapped for several reasons. First, simple analysis is thwarted by systematic sources of error, such as promiscuously active molecules and other artifacts.^4,7 Second, most screens are not cleanly annotated, making it difficult to identify retrospectively interesting screens or place them in appropriate biological context.^8,9 Third, there is substantial ascertainment bias in screening data, because assays are often run on substantially different libraries. Fourth, previous efforts have focused on the target-based screens instead of phenotypic screens. In this analysis, we propose a new method for overcoming these barriers and incorporating phenotypic data by using a robust measure of similarity between assays to create a global view of all screening data.

We construct an Assay Network, which plots each screen as a node. Nodes are connected when they are strongly correlated. As we will see, the network elucidates the relationships between a large number of seemingly unrelated screens and organizes them into a global structure covering a large swath of studied chemical biology ( Fig. 1 ).

Figure 1.

Our approach generates a correlation between all the screens in a database. (Left) Each node represents a high-throughput screen using either a target-based (T#) or phenotypic (P#) assay. Screens are connected to one another if they share a significant number of nonpromiscuous molecules in common. Target-based assays can be connected to known drugs, and phenotypic screens can correspond to diseases or indications. (Middle) In the first use case, drugs useful for a specific indication can be discovered by identifying a phenotypic screen (P4), relevant to the indication. The most strongly connected target-based screens to this phenotypic screen (T1 and T2) connect to drugs (either biological and small molecule), which may be useful for treating the disease. (Right) In the second use case, new indications can be discovered by identifying an assay corresponding to the known target of a specific biologic or small-molecule drug (T5). The most strongly connected phenotypic screens (P1 and P2) to this target-based screen suggest new indications.

Materials and Methods

Data

In this analysis, we focus on the large screening data sets in PubChem—those that test at least 5000 molecules. This provides 1581 screens with a unique PubChem Assay ID and on average 89,631 molecules tested in each screen.

The data associated with each screen are a table, where each row corresponds to a molecule tested in the screen. Each molecule is associated with two values, the PubChem Outcome and the PubChem Score. The outcome labels molecules as active, inactive, or inconclusive. For our purposes, inconclusive molecules are considered inactive. The score assigns an integer value to each molecule that is supposed to correlate with the assay’s readout. There are on average 1006 actives in each screen.

Considering just these assays, several molecules are promiscuously active, meaning they are labeled as actives in a large fraction of the assays in which they are tested ( Fig. 2 ).

Figure 2.

Prevalence of promiscuity. This histogram plots, on a log scale, the distribution of molecules according to their promiscuity on the x axis. Molecules are binned by their promiscuity, the fraction of screens in which they are active. There are tens of thousands of molecules that are promiscuous and active in more than 10% of assays. Appropriate correlations should down-weight or ignore these promiscuous molecules because they do not convey biologically important information about the relationships between screens.

Correlations and Networks

This study measures the similarity between two assays (X and Y). Two assays are similar when the same molecules are active in both assays. Central to this approach is the choice of an appropriate measure of correlation between screens. The idea here is that if two assays tend to give the same readout for a large number of molecules, then there is likely a strong relationship between them. Perhaps both assays are testing for inhibition of similar proteins or are interrogating closely related cellular pathways. We assess two ways of scoring the correlation between screens and also introduce an adjustment that improves both correlations. In all cases, we implemented these scores using the Python programming language.

Pearson Correlation

Some have proposed using Spearman and Pearson correlations to measure the similarity between screens,¹⁰ and we use it in this study as a baseline method against which to compare our work. In our results, Spearman and Pearson correlations produced nearly identical results. In the interest of clarity and brevity, we have included only results from Pearson correlation (PC). Using the activity of molecules in each screen, we can compute the PC between a pair of assays X and Y as

P C (X, Y) = \sum_{i \in X \cap Y} \frac{(X_{i} - μ_{X}) (Y_{i} - μ_{Y})}{σ X σ Y},

where i iterates over all the molecules that are tested in both X and Y, X_i and Y_i are the activity of the ith molecule in each screen, µ_X and µ_Y are the mean activities of the molecules tested in both X and Y, and σ_X and σ_Y are standard deviations of activities of the molecules tested in both X and Y. The mean and standard deviation terms are computed over all the molecules in common between screens X and Y, so they will be different for every pair of assays. PC is undefined when there are no molecules in common between X and Y.

Promiscuity-Adjusted Correlation

In place of the Pearson correlation, we propose the promiscuity-adjusted correlation (PAC). In contrast with related measures that others have proposed,^10–12 PAC accounts for both promiscuously active molecules and ascertainment bias. Furthermore, and in contrast with PC, the PAC focuses on the molecules annotated as actives by the screeners. Relying on the annotated activity focuses the correlation on the most salient information and dampens the influence of artifactual correlations between screens.

Our idea with PAC is to extend the term-frequency inverse-document-frequency (TF-IDF) score, which is frequently used to measure the similarity between text documents,¹³ so as to work with assay data.

Using the activity outcome of each molecule (whether it is labeled as active or inactive), we can compute the PAC between assays that down-weights the molecules that are active in several assays. We compute the promiscuity-adjusted weight of each molecule as

W_{i} = \frac{1}{P_{i}} = \frac{T_{i}}{A_{i}},

where P_i is the promiscuity of the ith molecule, T_i is the number of screens in which it is tested, and A_i is the number of screens in which it is labeled active. The PAC between two assays X and Y is defined as

P A C (X, Y) = \sum_{i \in X_{A} \cap Y_{A}} \frac{W_{i}^{2}}{K_{X Y}},

where i ranges over the molecules active in both X and Y designated X_A and X_B and K_XY is a normalization constant that scales PAC to range from 0 to 1. The normalization constant is specific to a pair of screens and is defined as

K_{X Y} = {([\sum_{i \in X_{A} \cap Y} W_{i}^{2}] [\sum_{j \in X \cap Y_{A}} W_{j}^{2}])}^{\frac{1}{2}},

where i ranges over the molecules tested in Y (regardless of the outcome) and found to be active in X, whereas j ranges over the molecules tested in X (regardless of the outcome) and found to be active in Y. PAC is undefined when no active molecules from one assay have been tested in the other.

Neighborhood Adjustment

As an additional filter, we propose using a neighborhood adjustment (NA). A few assays often dominate assay networks constructed using PC and PAC. The effect of these strongly connected assays is reduced by applying a NA to the pairwise correlations computed across the data set. The neighborhood-adjusted correlation between X and Y is defined as the percentile rank of the correlation between X and Y among its neighboring correlations. The neighboring correlations are all the correlations (excluding the undefined correlations) in the full pairwise comparison of the data that include either X or Y. If the adjusted correlation is 1, the unadjusted correlation is the highest correlations of its neighbors. If the adjusted correlation is 0.7, then the unadjusted correlation is greater than 70% of its neighbors. In practice, a cutoff on this correlation can be tuned to generate graphs with the desired edge density, but for the purposes of this study, we use the top 700 edges. The NA can be applied to any correlation, and we use it to generate NA versions of both PC (abbreviated as NA-PC) and PAC (abbreviated as NA-PAC).

Network Construction

We constructed all networks by connecting nodes if their edges correspond to the top 700 most correlated pairs. We considered other methods, all of which produced similar results. Seven hundred was empirically determined to be a reasonable number of edges to use so as to produce intelligible networks across all correlation types. If the number of screens were to increase, we would expect the number of edges required to increase. Optimal selection of the number of edges, however, is beyond the scope of this study and left for future work.

The networks were visualized using the freely available software Cytoscape 2.7.0. Coordinates for visualization were computed using the Organic layout engine. Edges were colored by whether the two assays they connect are in the same project. The project membership of each assay was determined using the method published by Calhoun et al.⁸ The width of each edge scales with the number of molecules active in both assays it connects. For clarity, unconnected nodes are removed from the networks displayed in the main document.

Results and Discussion

We assess our assay networks with both qualitative and quantitative studies. Quantitatively, we should see some statistical correspondence between the networks and annotations of the assays in the network (covered in the Quantitative Assessment section). Qualitatively, the assay network should be capable of highlighting connections, which expose important biology that would not otherwise be obvious from the data (covered in the Biological Results section). For reference, the complete pairwise correlation matrices are included in the supplementary information. These matrices are indexed by PubChem Assay ID and in a tab-delimited file.

Biological Results

Assay networks were constructed connecting nodes when they were highly correlated. We considered several different methods of deciding the appropriate cutoff over which to consider edges “highly” correlated. Ultimately, the method that produced the most consistent results was to use the 700 edges with the highest correlation. In the online version of the network (accessible at http://swami.wustl.edu/flow/assay_network.html), networks are colored so that edges between assays in the same screening campaign are colored blue and between those in the different campaigns are red.

Pearson Correlation

This approach ignores several systematic sources of error in screening data. Unsurprisingly, Pearson correlation does not produce a meaningful network of assays ( Fig. 3 ). The network is dominated by a few large clusters of viability and wavelength-specific counterscreens. One of these clusters consists exclusively of viability screens of the National Cancer Institute’s panel of 60 cancer cell lines (NCI60). The NCI60 are, of course, closely related to one another, but in an obvious and noninteresting way. They all identify molecules that are toxic, and they all test the same set of molecules that tend to be toxic.

Figure 3.

The Assay Network constructed by the Pearson correlation. For clarity, hundreds of unconnected assays are not included in the figure. The largest cluster of assays all test molecules for cytotoxicity against different members of the National Cancer Institute’s panel of 60 cancer cell lines. This grouping is accurate in the sense that these assays are related to one another. However, they are also noninformative because they do not expose any useful biological relationships. In fact, none of the large groups of assays revealed any biology.

Promiscuity-Adjusted Correlation

Encouragingly, PAC does seem to yield a more meaningful network, generating a more informative structure instead of clumping assays into unintelligible balls ( Fig. 4A ). The NCI60 cluster is still present, but it does not entirely dominate the network. As we will see, this network recapitulates known biology and suggests novel hypotheses directly relevant to drug design and development.

Figure 4.

The Assay Network constructed by the promiscuity-adjusted correlation (PAC). (A) Complete network. The stronger links are depicted with thicker lines. (B) One of the larger subnetworks of the PAC network, which connects several pathways known to interact with one another. In particular, the connection between a screen for activators of the NF-κb (1) pathway and a screen for increased expression of amyloid precursor protein (APP) (2) is indicated with a dashed line. Most of the assays were done at different locations. There is some overlap in assay technology, but this is not sufficient to explain the connections. For example, the sentrin-specific protease and APP assays are luminescence-based assays. This is not surprising as a large proportion of PubChem assays are luminescence-based. Still, APP assays are not connected to all luminescence assays, just those in this cluster. This is what we would expect because promiscuously active molecules are down-weighted by PAC.

Specific molecules underlie each connection in the network. For each connection, nonpromiscuous molecules are active in both the assays they connect. This can happen in several biologically important scenarios. First, the connections in the network can identify novel polypharmacology (i.e., the ability of specific classes of molecules to interact simultaneously with multiple proteins). This is because biochemical assays of proteins with structurally similar binding sites will identify some of the same molecules as active and, consequently, will be connected in the network. Isozymes of the same protein have structurally similar binding sites, but sometimes otherwise unrelated proteins also interact with the same molecules.^12,14

Second, connections in the network can yield mechanistic insight into phenotypic assays. If, for example, inhibiting a protein in a cellular system appropriately changes the behavior of cells in phenotypic assay, the biochemical assay for inhibitors of the protein can identify the same molecules as the phenotypic assay.

There are, of course, other less interesting reasons why assays are connected. Determining the exact reason for specific connections usually requires additional experimental inquiry. Nonetheless, unexpected connections in the network expose provocative experimental data points that are often missed. The key contribution of this strategy is to bring to the surface these data, so that the most compelling connections can be further investigated.

Confidence in the strength of this approach is built by assessing the plausibility of connections in the network. For example, a subnetwork in the PAC Assay Network links several assays together, many of which are clusters of closely related proteins ( Fig. 4B ). There are several sentrin-specific protease (SENP) isozyme inhibition assays, several frataxin activation assays, and signal transducer and activator of transcription activation assays. First, most assays of the isozymes of the same protein are connected to one another. This is sensible because isozymes of the same protein often have very similar binding pockets.

Second, some of the connections between different protein families can be explained by known biological mechanisms of action. For example, in the center of the subnetwork is a connection between activators of amyloid precursor protein (APP) expression and nuclear factor kappa-light-chain-enhancer (NF-κB) activation, a well-supported relationship from the literature.^15–17 Cleaved APP stimulates NF-κB production in cells. So, it is expected that molecules that activate APP production would also activate NF-κB expression. This is exactly what the data show and why the APP and NF-κB assays are connected in the network. There are also reports that link small-ubiquitin-like modifiers—regulated by SENP enzymes—to neurodegeneration and Alzheimer disease.^18,19

Another subnetwork links the flap endonuclease 1 (FEN1) and DNA polymerase β (POLB) inhibition assays (not called out in the figure). Most of the molecules that inhibit both proteins are flavonoids, a class of plant-derived molecules already known to inhibit DNA polymerase.²⁰ This link, therefore, suggests that flavonoids are a class of molecules that can inhibit both POLB and FEN1—two enzymes required for cell proliferation—and dual inhibition of these two enzymes might be the mechanism of some flavonoids’ anticancer activity.²¹

Neighborhood Adjustment

To further refine the PAC network by preventing highly related data sets, such as the NCI60, from dominating the network, we also explored using an NA to the computed correlations ( Fig. 5A ). The basic idea is to normalize each correlation to be on the same scale as other correlations nearby in the network. This adjustment connects more nodes to the networks and prevents a few hubs from dominating the view. Encouragingly, the strongly connected NCI60 cluster is entirely absent in both the NA-PAC and N-PC networks, serving as an excellent negative control. Likewise, we observe that NA tends to down-weight counterscreens designed to identify seemingly active molecules that are artifacts of the assay.

Figure 5.

The Assay Network constructed by the neighborhood-adjusted promiscuity-adjusted correlation (NA-PAC). (A) Complete network. The width of each edge scales with the number of molecules active in both assays it connects. The two boxes correspond to two subnetworks called out in the next panels. (B) One section of the network—with the PubChem assay IDs in parentheses—that connects the hLO and SOD1 screens together. (C) A small subnetwork that links HDAC3 and KLK5 together. The molecules active in these two screens show a high degree of structural similarity, falling into three closely related scaffold groups. Assays are labeled with a HUGO identifier, if possible, and with the assay ID in parentheses.

This improved network exposes several interesting relationships. For example, a screen for molecules that prevent cell death from a mutated version of superoxide-dismutase 1 (SOD1) is linked with a screen for 15-human lipoxygenase (ALOX15) inhibitors ( Fig. 5B ). This connection suggests a therapeutic role for ALOX15 inhibitors in amyotrophic lateral sclerosis, a condition thought to be caused by SOD1-mediated cell death. Encouragingly, both ALOX15 and SOD1 are connected to free-radical generation and clearance, and at least one animal study using an ALOX15 inhibitor seems to support this hypothesis,²² and the one molecule active in both of these small screens, geraniol, is structurally very similar to the natural substrates of ALOX15. This connection is made despite only one active molecule, for two reasons. First, only 662 molecules are tested in both screens, and only 32 are found to be active, which is 1 active in common more than the number expected by chance. Second, geraniol is not promiscuous: although tested in hundreds of assays, it is active only in the ALOX15 and SOD1 screens.

Likewise, the network connects an assay for inhibitors of histone deacetylase 3 (HDAC3) with an assay for kallikrein-related peptidase 5 (KLK5) inhibitors ( Fig. 5C ). These are two biochemical screens, and the network highlights the existence of a class of molecules that might simultaneously inhibit both HDAC3 and KLK5 in these assays. Molecules with this property might be therapeutically useful. HDAC inhibitors are being investigated for several purposes, including use as either anti-inflammatory²³ or antineoplastic²⁴ agents. KLK5 is thought to regulate skin desquamation²⁵ and also promote tumor metastasis.²⁶ If this link is not an artifact of the assays, perhaps simultaneously inhibiting HDAC3 and KLK5 with a single molecule would be both feasible and desirable in the setting of some cancers or immune-related desquamation such as psoriasis or eczema.

Our confidence that this is an informative connection increases because the molecules active in the KLK5 and HDAC3 assays share the same molecular scaffolds ( Fig. 5C ). Likewise, molecules that support a connection in the network typically have extremely high structural similarity (data not shown). This similarity supports the belief that a chemical structure–driven mechanism lies beneath many connections. Furthermore, because chemical structure is not used to make connections, this similarity is an independent validation of the network.

Quantitative Assessment

Quantitatively assessing these networks is difficult. There is no gold standard by which the accuracy of connections can be judged. A fundamental problem is that screens are often complex, and the exact details of their assays (which are poorly documented) have a large impact on whether or not it is expected for two screens to share active molecules.

We investigated using GO-term enrichment, KEGG pathways, or protein-interaction data to assess network quality, but it was not feasible. Even though two proteins share the same KEGG pathway or interact with one another, it is entirely possible (even likely) that separate assays based on these proteins’ inhibition or activation will identify different molecules. Common protein annotations do not imply common actives. Rather, we would expect molecules to be active in two assays only if the binding sites of the proteins are similar, a feature not well captured in these annotations.

The complexity of screens is a fundamental barrier to automated assessment. For example, one screen was titled as a phosphodiesterase 4 (PDE4) inhibition screen. The title implies it is a simple enzyme inhibition assay, but this is incorrect. Rather, the free-text description indicates that this screen actually uses a cell-based assay for increased levels of cyclic AMP (cAMP). In this screen, active molecules could interact with PDE4 or any other members of the cAMP pathway. Complexities like this are common across most of PubChem, so deciding a priori if two screens should be connected is difficult, requiring a manual reading of the assay description by biological experts.

We did find one way of assessing networks using campaign groupings. Screening campaigns often execute multiple assays, and assays from the same campaign often have strong, known relationships to one another. For example, they may test molecules against several isozymes of the same protein or a cluster of related phenotypes. Campaign groupings are at best a “bronze standard,” but we can reliably extract them from PubChem,⁸ and we expect a good network would connect assays from the same campaign.

One way of quantitating the performance of a scoring metric using campaign groups is with the area under the receiver-operating characteristic (ROC) curve (AUC) metric. ROC curves order all assay pairs in a network by their correlation, plotting the true discovery rate versus the false discovery rate at all possible cutoffs ( Fig. 6 ). Here, connections between assays in the same campaign are considered “true” connections, and connections between assays in different campaigns are considered “false” connections. The PAC (AUC = 0.760) works better than PC (AUC = 0.726). Likewise, NA-PAC (AUC = 0.825) does better than its unadjusted version. The NA-PC (AUC = 0.689) has a lower AUC than PC does. However, the early part of the curve hugs the y axis for longer. In this case, therefore, the AUC is not the best way of quantifying the relative performance of NA-PC to PC. Rather, the NA-PC performs better than PC, just as we would expect.

Figure 6.

Quantitative performance. These receiver-operating characteristic curves assess how well assays in the same campaign are grouped together by each metric. Assays in the same campaign are often related to one another by biology and assay technology. The in-campaign connections are a useful baseline by which to gauge the quantitative performance of each score. Nonetheless, this is a “bronze standard” assessment, because biologically related screens are not always executed in the same campaign. The area under each curve summarizes each score’s performance. The promiscuity-adjusted correlation metric performs better than Pearson correlation, and the neighborhood-adjusted promiscuity-adjusted correlation metric performs better still.

There are caveats with this approach because campaign groups are not true gold standards. Screening campaigns often include counterscreens to detect molecules that are active for artifactual reasons. A good correlation will down-weight connections to counterscreens, even though they are in the same projects. Nonetheless, this is a useful assessment because a good network would still group campaign-related screens together. The uninteresting relationships, also, are easy to filter out by coloring or excluding edges between assays in the same campaign in the final network.

These quantitative data, although not definitive on their own, still support both the theoretical conclusions and our qualitative analysis. Accounting for promiscuous molecules and also normalizing assay correlations with an NA improves the accuracy and informativeness of our approach.

Discussion

These experiments and examples demonstrate that it is feasible to extract useful information from large repositories of screens that would not be apparent from looking at each campaign in isolation. In particular, the PAC score we introduce has several advantages over other methods that have been used in the literature.

Term-Frequency Inverse-Document-Frequency

The PAC score between two assays is closely related to the TF-IDF score. There are three key differences. First, TF-IDF computes its weight for the ith term (corresponding with the ith molecule in PAC) as 1/logA_i rather than T_i/A_i. This difference makes the PAC down-weight promiscuous molecules more strongly than TF-IDF and also account for the ascertainment bias introduced when different molecular libraries are profiled by different numbers of assays. Second, TF-IDF computes the normalization term by summing over all words in the dictionary. In contrast, PAC sums over only those molecules tested in both screens and active in at least one, accounting for the ascertainment bias introduced when assays are applied to different libraries of molecules. Third, PAC ignores the number of times each molecule is determined active in a screen more than once (the term frequency), because replicating an experimental result in a screen is not equivalent to repeating a word several times in a document.

Single Common Active

One group has suggested connecting assays when they share one active molecule in common.¹² This approach is very sensitive to promiscuously active molecules and noise in screening data. It would identify all of the connections found by PAC but would also identify a large number of spurious connections. On a separate but related issue, these authors also decided to use their own cutoff to the activity score to call actives, rather than relying on the depositor-specified annotations for actives. This decision is problematic because their choice of a single cutoff at a PubChem Activity of 90 ignores the scale of each assay readout and the complicated ways in which many depositors have used (and abused) the PubChem Activity data type.

Jaccard Similarity

One group has suggested connecting assays when they share more than 10% of active molecules in common.¹¹ Although less than the single common active method, this approach is also sensitive to promiscuously active molecules. At the same time, it would miss many of the connections identified by PAC. For example, it would miss the connection between HDAC3 and KLK5. This method is also strongly influenced by ascertainment bias and promiscuous molecules. Assays run on different libraries of molecules will have lower correlations than they should. Conversely, promiscuous molecules will increase the correlation of many assay pairs more than they should. In contrast, PAC is much less influenced by these sources of error.

The score and adjustment we introduce in this study, the PAC and the NA, account for promiscuous actives and ascertainment bias in screens, yielding an assay graph that appears to have biologically meaningful connections. Nonetheless, there are caveats to our method. Some of the connections are based on very few active molecules, and the method relies on activity annotations from different screening groups, which may not be consistent. Perhaps most importantly, the network spans a vast range of biology that will require experts from across this vast range to identify and validate the most interesting connections.

Nonetheless, large repositories of small-molecule screens are an opportunity. Substantial effort was spent generating millions of data points. Assay networks are one way of organizing these data into a coherent view that spans both phenotypic and target-based screens. The PAC and NA-PAC networks expose several connections that are supported by known biology, expose novel polypharmacology, or suggest repurposing strategies. This is just an initial attempt at mining this immensely rich but noisy data. We believe that continued effort toward mining of screening data with more sophisticated methodology will uncover many more interesting connections.

Footnotes

Acknowledgements

We acknowledge the use of Cytoscape 2.7.0 to visualize networks.

Author Contributions

S.J.S. and P.A. conceived the idea and provided project direction. S.J.S. invented the PAC and NA. C.N.S. coded and performed all the virtual experiments. S.J.S., C.N.S., and P.A. wrote and edited the manuscript. M.R.H. provided expert advice and helped edit the manuscript. M.M. made an interactive Web version of the network.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article. This study was funded by GlaxoSmithKline R&D.

Supplementary material for this article is available on the Journal of Biomolecular Screening Web site at .

References

Wang

Xiao

Suzek

. PubChem: A Public Information System for Analyzing Bioactivities of Small Molecules. Nucleic Acids Res. 2009, 37, W623–W633.

Swinney

Anthony

How Were New Medicines Discovered?

Nat. Rev. Drug Discov. 2011, 10, 507–519.

Kotz

Small (Molecule) Thinking in Academia. SciBX: Science-Business eXchange 2011, 4.

Baell

Holloway

New Substructure Filters for Removal of Pan Assay Interference Compounds (PAINS) from Screening Libraries and for Their Exclusion in Bioassays. J. Med. Chem. 2010, 53, 2719–2740.

Swamidass

Calhoun

Bittker

. Utility-Aware Screening with Clique-Oriented Prioritization. J. Chem. Inf. Model. 2012, 52, 29–37.

Swamidass

Calhoun

Bittker

. Enhancing the Rate of Scaffold Discovery with Diversity-Oriented Prioritization. Bioinformatics 2011, 27, 2271–2278.

Davis

Erlanson

D. A.

Learning from Our Mistakes: The “Unknown Knowns” in Fragment Screening. Bioorg. Med. Chem. Lett. 2013, 23, 2844–2852.

Calhoun

Browning

Bittker

. Automatically Detecting Workflows in PubChem. J. Biomol. Screen. 2012, 17, 1071–1079.

Visser

Abeyruwan

Vempati

. Bioassay Ontology (BAO): A Semantic Description of Bioassays and High-Throughput Screening Results. BMC Bioinformatics 2011, 12, 257.

10.

Chen

McConnell

Wale

. Comparing Bioassay Response and Similarity Ensemble Approaches to Probing Protein Pharmacology. Bioinformatics 2011, 27, 3044–3049.

11.

Zhang

Lushington

Huan

The Bioassay Network and Its Implications to Future Therapeutic Discovery. BMC Bioinformatics 2011, 12, S1.

12.

Chen

Wild

Guha

PubChem as a Source of Polypharmacology. J. Chem. Inf. Model. 2009, 49, 2044–2055.

13.

Hiemstra

A Probabilistic Justification for Using TF-IDF Term Weighting in Information Retrieval. Int. J. Digital Libraries 2000, 3, 131–139.

14.

Keiser

Roth

Armbruster

. Relating Protein Pharmacology by Ligand Chemistry. Nat. Biotechnol. 2007, 25, 197–206.

15.

Mattson

Cellular Actions of Beta-Amyloid Precursor Protein and Its Soluble and Fibrillogenic Derivatives. Physiol. Rev. 1997, 77, 1081–1132.

16.

Paris

Patel

Quadros

. Inhibition of A[beta] Production by nf-[kappa] b Inhibitors. Neurosci. Lett. 2007, 415, 11–16.

17.

Bales

Dodel

. The NF-κb/Rel Family of Proteins Mediates Aβ-Induced Neurotoxicity and Glial Activation. Mol. Brain Res. 1998, 57, 63–72.

18.

Wang

. Positive and Negative Regulation of APP Amyloidogenesis by Sumoylation. Proc. Natl. Acad. Sci. U.S.A. 2003, 100, 259.

19.

Dorval

Fraser

SUMO on the Road to Neurodegeneration. Biochim. Biophys. Acta 2007, 1773, 694–706.

20.

Ono

Nakane

Fukushima

. Differential Inhibitory Effects of Various Flavonoids on the Activities of Reverse Transcriptase and Cellular DNA and RNA Polymerases. Eur. J. Biochem. 1990, 190, 469–476.

21.

Ren

Qiao

Wang

. Flavonoids: Promising Anticancer Agents. Med. Res. Rev. 2003, 23, 519–534.

22.

West

Wang

. The Arachidonic Acid 5-Lipoxygenase Inhibitor Nordihydroguaiaretic Acid Inhibits Tumor Necrosis Factor α Activation of Microglia and Extends Survival of g93a-SOD1 Transgenic Mice. J. Neurochem. 2004, 91, 133–143.

23.

Adcock

HDAC Inhibitors as Anti-Inflammatory Agents. Br. J. Pharmacol. 2007, 150, 829–831.

24.

Johnstone

Histone-Deacetylase Inhibitors: Novel Drugs for the Treatment of Cancer. Nat. Rev. Drug Discov. 2002, 1, 287–299.

25.

Borgono

Michael

Komatsu

. A Potential Role for Multiple Tissue Kallikrein Serine Proteases in Epidermal Desquamation. J. Biol. Chem. 2007, 282, 3640–3652.

26.

Kim

Scorilas

Katsaros

. Human Kallikrein Gene 5 (klk5) Expression Is an Indicator of Poor Prognosis in Ovarian Cancer. Br. J. Cancer. 2001, 84, 643–650.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

25.82 MB

0.00 MB