BioAssay Ontology Annotations Facilitate Cross-Analysis of Diverse High-Throughput Screening Data Sets

Abstract

High-throughput screening data repositories, such as PubChem, represent valuable resources for the development of small-molecule chemical probes and can serve as entry points for drug discovery programs. Although the loose data format offered by PubChem allows for great flexibility, important annotations, such as the assay format and technologies employed, are not explicitly indexed. The authors have previously developed a BioAssay Ontology (BAO) and curated more than 350 assays with standardized BAO terms. Here they describe the use of BAO annotations to analyze a large set of assays that employ luciferase- and β-lactamase–based technologies. They identified promiscuous chemotypes pertaining to different subcategories of assays and specific mechanisms by which these chemotypes interfere in reporter gene assays. Results show that the data in PubChem can be used to identify promiscuous compounds that interfere nonspecifically with particular technologies. Furthermore, they show that BAO is a valuable toolset for the identification of related assays and for the systematic generation of insights that are beyond the scope of individual assays or screening campaigns.

Keywords

compound promiscuity assay ontology reporter gene assays high-throughput screening data analysis cheminformatics

Introduction

The field of high-throughput screening (HTS) is rapidly advancing through the development of sophisticated robotics and liquid handling systems, sensitive and versatile detection technologies, and powerful informatics systems that enable miniaturization and increased throughput.¹ Furthermore, HTS is being used to interrogate increasingly complex biological systems and processes, driven by advancements in molecular and cellular biology in combination with innovative assay designs.

In an effort to find novel entry points for drug discovery programs, countless HTS campaigns comprising large commercial and proprietary compound libraries have produced massive data sets, primarily in pharmaceutical companies. The National Institutes of Health (NIH) Molecular Libraries Roadmap Initiative² and the availability of more affordable “out-of-the-box” screening systems and reagents have facilitated a dissemination of HTS capabilities into academic institutes and universities, where they are now relatively common and available to researchers.

HTS data sets, which consist of experimental results and assay metadata, are typically stored in data warehouses using relational database schemas.^3,4 The fast pace of innovation in assay designs and detection technologies, as well as the increasing complexity of the biological targets under investigation, poses challenges to “static” database schemas to capture and manage the diversity of screening experiments and their outcomes. To optimize the value of HTS efforts beyond any individual HTS campaign and to facilitate more informed decision making as compounds progress in the value chain, systematic knowledge management is receiving increased attention from informatics organizations.⁵ In this context, a formal, well-structured, knowledge-based, and extensible description of biological assays is required. Expert biocuration to organize and annotate existing data is also a critical component of any HTS knowledge management solution.

PubChem is a public repository of HTS assay descriptions, small-molecule compounds, and HTS results (which we refer to as endpoints).^6,7 Originally put in place as part of the Molecular Libraries Program (MLP), it serves to host data generated at the MLP centers as well as that from other NIH-funded projects. As of September 2010, there were more than 2100 bioassays from the MLP deposited in PubChem. In addition to PubChem, there are several other publicly available sources of screening data, including ChEMBL,⁸ which contains structure–activity relationship (SAR) data curated from the medicinal chemistry literature; the Psychoactive Drug Screening Program (PDSP)^9,10; and ChemBank.^11,12 In addition, private resources, such as Collaborative Drug Discovery (CDD),^13,14 also make large screening data sets publicly accessible.

Despite recommendations from industry and government work groups, there is currently no agreed-upon standard for the representation of HTS assay data. Such a representation is vital for researchers to meaningfully interpret and compare diverse assay results.¹⁵ Because HTS data repositories lack detailed annotations using standardized terms, seemingly trivial queries such as “list the biochemical vs. cell-based assays” or “list assays that use a luciferase reporter construct” are not possible. In addition, the lack of a formal description of biological assays hinders the integration of HTS data from different sources as well as with other life science databases (e.g., biological pathways).

PubChem’s already large and diverse set of deposited assay results along with several other accessible screening data repositories form a large corpus of data that can serve as a starting point to develop a systematic categorization of HTS assays. The exponential growth of public data repositories indicates that we are only beginning to explore the space of possible assay designs. The development of a clearly structured and standardized formal description of concepts that are relevant to interpreting HTS results is therefore very timely.

In this report, we demonstrate how such a formalized terminology can facilitate analyses across multiple diverse assays to identify promiscuous compounds. These compounds are traditionally problematic for HTS, and it is desirable to identify them as early as possible in a campaign. Compound promiscuity can be related to assay design, detection technology, or interaction with biological targets, and often the specific mechanisms of action are not fully understood. There have been attempts at the identification of compound classes that can interfere with specific assay technologies, but these studies usually focused on a small number of biological assays and did not make use of the large numbers of data sets currently available.^16,17 Here we attempt for the first time to identify promiscuous behavior on a large scale using a curated data set that allowed us to interrogate compound behavior across certain assay categories and subcategories.

Methods

PubChem local mirror database and chemical structures

A local relational mirror of the PubChem bioassay database was created using in-house scripts and a public version of the MySQL database. The details of this database, schema, population and update processes, implementation, and code are reported elsewhere.¹⁸ Briefly, the database consisted of several tables, including assay details (such as AID, assay name, description, project category, protocol), panel assay specifications, result definitions (such as IC₅₀, percentage inhibition, or any other observed measurement or statistics), result data (with PubChem Activity Outcome and Score and the most important results, such as IC₅₀, and qualifiers such as <, >, =), cross-references (links of assays to other National Center for Biotechnology Information (NCBI) databases such as protein or nucleotide target, PubMed, taxonomy), and relationships (links between different assays, links to other NCBI Entrez databases, and links between targets and their sequences). The system used the PubChem FTP site to access XML assay descriptions and CSV assay data and the NCBI Entrez Utilities (eUtils) to access additional information (including if an assay had changed) to keep the mirror database current. The database included a structure table only as a placeholder. Chemical structures corresponding to the assay data were downloaded by substance IDs (SIDs) directly from PubChem as SDFiles using the batch download facility.

PubChem assay annotation and assay clustering

PubChem assays were annotated manually using the mirror database described above, which was fetched from PubChem in April 2010 with 2299 assays by AID. In total, 172 assays had no data at all (on-hold assays). There were 194 summary assays, of which 136 had no substances or activity data. These assays were not considered for annotation. There were 105 assays with no activity outcome method (which is usually assigned as screening, confirmatory, other, or summary)—these are from Ambit Biosciences, Developmental Therapeutics Program/National Cancer Institute, and Structural Genomics Consortium. From the screening centers of the NIH Molecular Libraries Probe Center Network (MLPCN) and the former Molecular Libraries Screening Center Network (MLSCN) were 1498 assays—not including assays without data (on hold) and summary assays.

To aid the manual annotation process, all assays were clustered based on the assay title, description, protocol, and source. Several assays (other than on-hold or summary assays) did not have a protocol or only a minimal description, but all had information about the source. To cluster the assays, first for each assay a text fingerprint was generated from all words used in title, description, protocol, and source after stemming (to consolidate different grammatical forms) using the Pipeline Pilot 8.0 (Accelrys, San Diego, CA)¹⁹ text analytics component collection. The text fingerprints (TXFP_Custom) encode for each individual assay the presence and absence of word tokens from the global corpus of assays. The assay “documents” were then clustered based on the fingerprints using the Tanimoto similarity metric and setting the average cluster size to five members. The clustering method is a relocation technique based on maximum dissimilarity partitioning implemented in the Pipeline Pilot text analytics collection. A total of 460 clusters were generated. This method grouped together similar assays very effectively—for example, all assays of the same screening campaign by center or assays with the same procedure or assay design (e.g., many National Center for Chemical Genomics [NCGC] toxicity assays); as expected, clusters usually included assays from the same source. The method also grouped together related assays with minimal annotations (such as many of the NCI or Chembank assays), summary assays, or assays that were on hold. Two hundred ninety-nine clusters were generated from the 1498 MLPCN and MLSCN assays that had data deposited and were not summary assays. To illustrate the similarity relationships of these assays, we generated a minimum spanning tree (MST) based on the pairwise (Tanimoto) similarities of the assays computed from their text fingerprints (the same similarities that were used for clustering above). The MST was computed using an in-house protocol implementing Kruskal’s algorithm. The MST was visualized in Cytoscape²⁰ and is shown in Supplemental Figure S1A. Supplemental Figure S1B,C shows the cluster and assay memberships for biochemical and cell-based assays, respectively. Assay formats were mapped onto the tree after manual annotation (see the following).

Following cluster preprocessing, assays were then manually annotated by assay format, design, technology, and the other BioAssay Ontology (BAO) categories. The BAO schema with classes, individuals, relationships, and their definitions can be downloaded from our Web site, and BAO can also be visualized there.²¹ For the limited analysis presented here, we focused specifically on assays based on designs to detect luminescence from the luciferase-catalyzed conversion of luciferin substrates²² and assays employing β-lactamase-based technology.²³

Luciferase-assays were classified into five subcategories: reporter gene, viability, adenosine triphosphate (ATP)–coupled, luciferin-coupled, and luciferase enzyme activity assays. Briefly, luciferase reporter gene assays use the luciferase gene downstream of a promoter of interest. The amount of luciferase expressed is quantified by the intensity of light (luminescence) produced in the presence of substrates, ATP, and luciferin. Viability assays estimate the proportion of living cells in an assay by measurement of ATP content in a luciferase-catalyzed reaction. ATP-coupled assays measure the residual amount of ATP (e.g., after a kinase reaction) by a coupled luciferase reaction. Luciferin-coupled assays measure the amount of luciferin generated after detoxification by cytochrome P450 enzyme activity. Luciferase enzyme activity assays quantify the luciferase enzyme activity by the amount of light produced in a biochemical reaction. β-Lactamase technology is used in either reporter gene or enzyme activity assays.

PubChem Promiscuity Index (PCIdx)

The PubChem Promiscuity Index (PCIdx) of a substance (by SID) was defined as the number of assays in which this substance is active divided by the number of assays in which it was tested (equation (1), where N is the assay count for the substance).

PCIdx (substance) = \frac{N (Active)}{N (Tested)} .

(1)

To define active, we used the PubChem activity outcome, which is one of the required fields to be uploaded by the assay depositor. Activity outcome categorizes tested samples as active, inactive, inconclusive, or unspecified. PubChem does not have rules when to apply the outcome category “active,” which is defined (subjectively) by the depositor. Therefore, active can have different meanings across different assays. This is clearly not the best way of comparing compounds in a large number of assays, and it would be much better to standardize the most important endpoints across all assays. However, currently, activity outcome is one of the only two required endpoints (the other one is activity score—also subjectively depositor defined) and therefore the only way to quickly identify “active” compounds.

To compute PCIdx for each compound, all assays in which it was tested and the corresponding activity outcomes were determined by querying the PubChem mirror database above. PCIdx was calculated according to equation (1), separately for single-concentration assays (PubChem activity outcome method “screening”) and concentration–response assays (activity outcome method “confirmatory”). Only assays of a certain category were considered—for example, all luciferase technology assays or a certain subset thereof, such as viability assays or luciferase enzyme inhibition assays.

Because the significance of the PCIdx measure increases with more tested assays, we visualized compounds’ promiscuities by plotting PCIdx over the number of assays tested while indicating the number of active assays by a color code (cf. Fig. 1 and Supplementary Figures S4 and S6).

Fig. 1.

Compound promiscuity by luciferase assay technologies. For each compound, the Promiscuity Index versus number of tested assays is depicted. (A, B) Adenosine triphosphate (ATP)–coupled enzyme activity (e.g., kinase activity, not viability). (C, D) Luciferase enzyme activity. (E, F) Luciferin-coupled enzyme activity (e.g., P450). (G, H) Luciferase reporter gene assays. (I, J) Cell viability assays (ATP coupled). (A, C, E, G, I) Concentration–response assays. (B, D, F, H, J) Single-concentration assays. Color and size indicate the number of assays (of the particular luciferase assay type) in which a compound was active. In total, 87 615 data points with at least one active assay shown: (A) 3457, (B) 5619, (C) 2313, (D) 3646, (E) 3457, (F) 5619, (G) 14 200, (H) 36 685, (I) 1413, (J) 11 206.

Figure 1 , Supplementary Figures S4 and S6, and the heat map in Figure 2 were created in TIBCO Spotfire DecisionSite.²⁴

Data Clustering

Data in Figure 2 (corresponding to Supplementary Table S1) were hierarchically clustered using the unweighted pair group method with arithmetic mean (UPGMA) and PCIdx correlation as a similarity measure.

Fig. 2.

Heat map of 161 most promiscuous compounds in luciferase reporter gene assays, which are active in at least five concentration–response and five single-concentration (luciferase reporter) assays. DR, dose response; SC, single concentration. Shown are the promiscuity indices of all compounds in the different luciferase assay categories for both concentration–response and single-concentration assays, respectively, clustered by their PubChem Promiscuity Index (PCIdx) profiles. Two groups of promiscuous reporter gene compounds were apparent, suggesting the mechanism for reporter gene assay promiscuity: one in which compounds were also active in viability assays (red) and the other where compounds were also active in luciferase enzyme assays (blue). Compare Supplemental Table S1 for details.

Chemical Structure Clustering

Chemical structures were clustered by maximum common substructure using ChemAxon Library maximum common substructure (MCS).²⁵

Chemical Structure Similarities

Compound pairwise similarities and the similarity matrix were computed using extended connectivity atom-type fingerprints of length 4 (ECFP4)²⁶ and the Tanimoto metric implemented in Pipeline Pilot 8.¹⁹

Results

BioAssay Ontology and assay annotations

We have developed an ontology (BAO)²¹ to facilitate analyses of screening results from large and diverse sets of biological assays spanning multiple technologies and originating from different sources. The BAO project seeks to develop a formal, extensible, knowledge-based description of biological assays by making use of descriptive logic-based features of the Web Ontology Language (OWL). Expert curation is an important component of the BAO project, and we have been systematically annotating sets of PubChem bioassays with BAO terms describing assay concepts. The BAO project will also provide software tools to query and explore data sets in the context of the ontology.

The BioAssay Ontology describes several concepts related to biological screening, including Perturbagen, Target, Format, Assay Design, Detection Technology, and Endpoint, including endpoint data manipulation. Perturbagens deposited in PubChem and the other screening data sources mentioned above are mostly small molecules but can include various other perturbing agents that are screened in an assay. We refer to targets as “Meta Target” describing not just protein targets but also pathways, biological processes or events, and so on targeted by the assay. Format describes the biological or chemical features common to each test condition in the assay and includes biochemical, cell based, organism based, and variations thereof. Assay Design describes the assay methodology and implementation of how the perturbation of the biological system is translated into a detectable signal. Detection Technology relates to the physical method and technical details to detect and record a signal. Endpoints are the final HTS results as they are usually published (such as IC₅₀, percentage inhibition, etc.). Endpoint data manipulation specifies how the raw signal(s) is transformed into reported endpoints (i.e., normalization, correction, etc.). BAO also captures other assay properties such as assay purpose and how assays are related in campaigns. BAO is also designed to handle multiplexed assays. All main BAO components include multiple levels of subclasses and specification classes, which are linked via object–property relationships forming a knowledge representation. The details of the development and description of BAO will be reported elsewhere. The BAO schema with classes, individuals, and relationships can be downloaded from our Web site.²¹ BAO classes, their subsumption hierarchies, and class definitions can also be visualized directly on the BAO Web site.²¹

We annotated a set of more than 350 PubChem assays and grouped them into related classes by assay design and detection technology. Specifically, we focused on widely used HTS assay technologies that employ luciferase- and β-lactamase–based reporters.²² By analyzing the outcomes of related assays, we could readily identify compounds of interest, for example, those that were promiscuously active in one or multiple classes of assays. The luciferase assays were annotated and classified into subcategories that relate to assay design (described in Methods). To efficiently annotate assays and to facilitate data analysis across all PubChem assays, we created a local mirror of the PubChem database. This database stores assay descriptions and endpoints in a relational format and can be queried easily using SQL. Mirrored assays were then manually annotated with BAO terms after interpreting the textual descriptions available in PubChem. To aid in the assay annotation process, we clustered the assays based on text fingerprints derived from the free text in assay title, description, protocol, and source (see methods). Supplemental Figure S1 illustrates the similarity relationships based on their textual descriptions in PubChem of the 1498 MLPCN and MLSCN assays and the clusters that were obtained as well as the most important formats (biochemical and cell based) and the screening center. Supplemental Figure S1A shows the minimum spanning tree of the assays (see Methods) illustrating that assays from the same center and assays of the same format typically group together locally. Supplemental Figure S1B,C shows the cluster memberships (see Methods). Each cluster contained only assays from one center, and most clusters only contained assays of one format. Clusters of assays also typically were of related designs and biological targets (not shown). In our hands, this was an effective method to group similar assays together, enabling the sequential annotation of sets of related assays. We found that this greatly reduced errors and accelerated the annotation process compared to random or chronological (by assay ID [AID]) order of annotation.

Analysis of luciferase technology assays

Using the local relational database created from data as of April 2010, we identified a total of 257 assays using a design based on the luciferase-induced conversion of luciferin substrates that results in the emission of light.²² Specifically, we annotated the following types of luciferase technology assays: reporter gene assays (105), cell viability assays (through detection of ATP, 82), ATP-coupled assays (other than viability assays, 35), luciferin-coupled assays (23), and enzyme (biochemical) activity assays (12). A histogram of assay types is shown in Supplemental Figure S2. We also identified and annotated the assay kits (Supplemental Figure S3).

Using the luciferase assay annotations, we computed promiscuity statistics for each compound that was tested in any of the luciferase assays. We developed a Pipeline Pilot (Accelrys) protocol that queries the relational database to determine how many different assays (of a luciferase technology category) each compound was tested in and in how many it was found active. This was done separately for single-concentration and concentration–response assays. To define active and inactive, we used the PubChem activity outcome endpoint. Although this is a subjective, “local” definition (each depositor can define active and inactive for each assay independently), we found it a useful first approximation. We calculated a PCIdx for each category as the quotient of the number of luciferase assays in which a substance was reported as active and the number of assays in which it was tested (see Methods, equation (1)). The larger the ratio of active luciferase assays to assays tested, the higher a compound’s promiscuity PCIdx. However, the significance of this promiscuity measure increases with the number of assays tested. We therefore visualized promiscuity by a scatter plot of PCIdx and the number of assays tested while also indicating the number of active assays (of each category) by color. Figure 1 shows compound promiscuities for the different luciferase technology categories for single-concentration and concentration–response assays (87 615 data points shown overall). It shows a large number of promiscuous compounds identified from viability and reporter gene assays, which we decided to investigate in greater detail. Supplemental Figure S4 illustrates promiscuities across all luciferase technology assays taken together.

The majority of viability assays were concentration–response series deposited by the NCGC. Figure 3 shows the most promiscuous cytotoxic compounds (by SID) identified by these assays. We have previously demonstrated that such data can be useful to model acute animal toxicity.²⁷ All of the compounds shown have been tested in 44 concentration–response assays and were categorized as active in more than 95% of them. For example, the toxicity of digitonin (SID 17389047) is related to its lipid (membrane) solubilizing properties. Most of the compounds are chemically reactive, which is likely the cause of their toxicity. Crystal violet (hexamethyl-p-rosaniline chloride; SID 17389869) and methylene blue (SID 17388909) are redox-active and electrophilic dyes, respectively. 17388695 and 17389115 are surfactants (phase transfer reagents), 17389451 is a reactive dihydroxyanthraquinone, 17389124 is used as a pesticide, and 17389974 is an alkylator.

Fig. 3.

Examples of highly promiscuous (cytotoxic) compounds in luciferase viability assays. All compounds have a promiscuity index between 0.95 and 1.0, were tested in 44 assays, and are active in at least 42 assays.

Although luciferase is often used in viability assays, its most common application is in reporter gene assays. To investigate promiscuous compounds in this category, we retrieved all substances that were active in at least five single-concentration and five concentration–response luciferase reporter gene assays. Figure 2 illustrates the PCIdx of the 161 compounds in each of the luciferase assay categories for dose–response (DR) and single-concentration (SC) assays after hierarchical clustering (see Methods). There are two major clusters of compounds. In one group, the compounds were also highly promiscuous across viability assays. This could be expected because broadly cytotoxic compounds should also show up as actives in luciferase reporter gene assays. Importantly, this pattern was immediately revealed by our analysis method, which used activity outcomes across all assays of each category in which a compound had been tested. In the other group, compounds also showed promiscuity in the category of luciferase enzyme inhibition assays. It is therefore likely that the mechanism responsible for their promiscuity across reporter gene assays is inhibition of the luciferase enzyme. Most of those compounds also showed high promiscuity indices in the other categories of luciferase assays. Supplemental Table S1 lists all 161 compounds corresponding to Figure 2 (by SID), their PCIdx values for each category, and the number of assays in which each compound was active versus the number of assays in which it was tested.

Figure 4 and Table 1 illustrate example chemical structures of both categories of highly promiscuous reporter gene compounds. The first row in Figure 4 and the first five entries in Table 1 show selected compounds that likely act via inhibiting luciferase enzyme. They represent five different chemical classes, including the benzoyl-aryl-urea (SID 3717070) or the 3,5-disubstituted-1,2,4-triazole (SID 865680) scaffolds.¹⁶ The second row of Figure 4 and entries 6 to 10 in Table 1 show cytotoxic compounds that were broadly active across cell proliferation assays. They include reactive compounds such as electron-deficient vinyl chloride (SID 24817234) and Michael acceptor (SID 845529), as well as daunorubicin (SID 855534), which is a DNA intercalator used as a chemotherapeutic.

Fig. 4.

Selected examples of promiscuous compounds in luciferase reporter gene assays of two categories. Top row compounds were also active in luciferase enzyme inhibition assays. Bottom row compounds were active in viability assays. Refer to Table 1 for details.

Table 1.

Promiscuity Indices (PCIdx) and Number of Assays Tested and Found Active for Selected Promiscuous Compounds in Luciferase Reporter Gene Assays

	Concentration Response									Single Concentration
	Reporter Gene Assays			Viability Assays			Enzyme Activity Assays			Reporter Gene Assays			Viability Assays			Enzyme Activity Assays
Substance ID	PCIdx	Active	Tested	PCIdx	Active	Tested	PCIdx	Active	Tested	PCIdx	Active	Tested	PCIdx	Active	Tested	PCIdx	Active	Tested
4243980	0.86	12	14				0.50	1	2	0.54	7	13				0.50	1	2
3717070	0.75	9	12				1.00	1	1	0.36	8	22				0.50	1	2
865680	0.81	13	16				0.33	1	3	0.36	8	22				0.00	0	2
3714425	0.69	9	13				1.00	2	2	0.32	7	22				0.67	2	3
24821749	0.88	7	8				0.50	1	2	0.44	7	16				0.50	1	2
24817234	0.71	5	7	0.50	1	2				0.53	8	15	0.83	5	6
861918	0.73	8	11	1.00	1	1				0.36	8	22	0.60	6	10
855543	0.75	6	8	1.00	1	1				0.38	6	16	0.40	4	10
4246251	0.83	10	12	1.00	2	2				0.23	5	22	0.56	5	9
845529	0.71	10	14	0.50	2	4				0.56	15	27	0.55	6	11

This table corresponds to compounds in Figure 4 . The top five compounds are luciferase enzyme inhibitors, and the bottom five are cytotoxic.

β-lactamase versus luciferase reporter gene assays

Another widely used assay reporter technology relies on β-lactamase.²³ Most of the implementations use fluorescence resonance energy transfer (FRET) substrates, resulting in a fluorescence shift upon hydrolysis of the β-lactam.²⁸ As of April 2010, we annotated 92 β-lactamase assays, 74 of which were reporter gene assays (Supplemental Figure S5A). Supplemental Figure S5B shows the assay kits used. To identify small-molecule structural classes that were active in a large percentage of the β-lactamase technology assays tested in PubChem, we performed an analysis similar to that for luciferase-based assays. Supplemental Figure S6 shows the compounds’ promiscuity plots for β-lactamase enzyme activity and β-lactamase reporter gene assays, respectively, and expressed separately for single-concentration and concentration–response assays. From Supplemental Figure S6, several interesting classes of compounds can be identified, including some subtle ones. For example, from quadrant A (biochemical β-lactamase enzyme activity measured by concentration–response assays), a series of 2-alkylsulfonyl-1,3,4-oxadiazoles could be identified, which had previously been demonstrated to covalently modify the enzyme resulting in its inhibition.¹⁷ Because of its mechanism, this chemotype shows activity in many other assay types (not shown). However, many more compounds can be identified as highly promiscuous among the β-lactamase reporter gene assays.

For further analysis, we selected compounds with a PCIdx of at least 0.5 and that have been tested in at least 10 reporter gene assays (for single-concentration or concentration–response assays). These compounds were clustered by MCS. Some of the most promiscuous clusters are shown in Figure 5 by their MCS scaffolds. Supplemental Table S2 includes all 97 compounds, their MCS scaffolds, cluster details, PCIdx, and the number of active and tested assays. Interestingly, in contrast to the promiscuous luciferase reporter gene compounds ( Fig. 4 and Supplemental Table S1), these compounds formed more pronounced (larger) clusters. The mechanism of promiscuity was not immediately obvious from this analysis. However, we hypothesize that compounds in cluster 1 inhibit the β-lactamase enzyme because these compounds were also promiscuously active in the biochemical β-lactamase inhibition assays. Some of the other series had reactive functional groups—for example, cluster 2 ( Fig. 5 ) or clusters 7 and 4 (Supplemental Table S2), which could therefore be toxic or react chemically with the reporter enzyme or other proteins in the pathways upstream of the promoter.

Fig. 5.

Representative chemical scaffolds of the most promiscuous compounds in β-lactamase reporter gene assays (see Supplemental Table S2 for compounds from all series).

To further investigate how the promiscuity mechanisms were distinct among luciferase and β-lactamase reporter gene assays, we pairwise compared all highly promiscuous compounds across the two technologies—specifically, 102 compounds that were active against the majority of luciferase reporter gene assays versus 97 compounds active against the majority of β-lactamase reporter gene assays. Compounds were selected with PCIdx ≥0.5 and tested in at least 10 assays of their respective reporter technology. Figure 6 shows the similarity histogram of the maximum similar compound among one group for each compound in the other group (see Methods). The complete similarity matrix and the histogram of all pairwise similarities are provided in Supplemental Figures S7 and S8. Figure 6 (and Supplemental Figures S7 and S8) indicated that for most of the compounds active against β-lactamase, there was no significantly similar compound active against luciferase. This supports distinct mechanisms of nonspecific chemical interferences among luciferase and β-lactamase reporter gene assays. Chemical classes that were promiscuous in both luciferase and β-lactamase reporter gene assays are shown in Figure 7 (all compounds are provided in Supplemental Table S3). Their generic mechanisms appeared to include high chemical reactivity such as SID 14729238 or SID 4251553 and general toxicity such as emetine (SID 855836),²⁷ which is a protein synthesis inhibitor. However, the results suggested that other mechanisms are likely to exist; for example, staurosporine (SID 11532977) is one of the most promiscuous pan-kinase inhibitors.²⁹

Fig. 6.

Histogram of the maximum pairwise Tanimoto similarities of each of the 102 most promiscuous luciferase reporter gene compounds compared with the 97 most promiscuous β-lactamase reporter gene compounds. Tanimoto similarities were computed using ECFP4 fingerprints. Most promiscuous compounds were defined as those with a promiscuity index (PCIdx) ≥0.5 and that were tested in at least 10 assays. See Supplemental Figure S7 for the full similarity matrix and Supplemental Figure S8 for the histogram of all pairwise similarities.

Fig. 7.

Compounds representing structural classes that show promiscuous activity across luciferase and β-lactamase reporter gene assays. Supplemental Table S3 includes all compounds and their cluster details.

Discussion

The BioAssay Ontology is the first public effort to develop a formal knowledge-based description of HTS assays and screening outcomes.²¹ The value of large public data repositories such as PubChem will ultimately be determined by how well researchers are able to use the information to extract knowledge as a starting point for new research and drug development. Their usefulness will largely be determined by two factors: (1) the content and quality of data in the repository and (2) the ability to retrieve relevant results. The ability to identify, aggregate, and analyze data from various assays that are related to a project of interest is particularly important. BAO primarily addresses this second aspect, but it will also help to analyze data quality by identifying redundancies and related data. While developing BAO, we have annotated more than 350 PubChem assays to organize them by concepts that are relevant to interpret HTS results. Specifically, we investigated assays based on designs that use the luciferase-catalyzed conversion of luciferin substrates, resulting in luminescence and assays detecting β-lactamase via FRET substrates. In contrast to previous reports that focused mostly on individual screening campaigns, BAO has enabled a systematic analysis of many related assays to generate results that could not be obtained from individual screens. Our promiscuity analyses also demonstrated clearly that there is valuable information in the PubChem repository beyond individual screening campaigns and that the BAO descriptions can facilitate the extraction of new knowledge from large numbers of related data sets.

Among assays employing luciferase technologies, we identified five subcategories: reporter gene assays, viability assays, ATP-coupled and luciferin-coupled enzyme activity, and biochemical luciferase enzyme activity (Supplemental Figure S2). Analyzing compound promiscuity in viability assays revealed the most generally cytotoxic compounds and compound classes. Many of these assays were performed by the NCGC with compounds that were also studied at the Environmental Protection Agency (EPA).³⁰ Toxicity for these highly promiscuous compounds can be mediated by several mechanisms, as illustrated by our examples. One common and expected theme that could readily be identified for many of these compounds was that chemical reactivity is related to their cytotoxic effects ( Figs. 3 and 4 ).

The majority of the annotated luciferase assays belong to the category of reporter gene assays. We identified the most promiscuous compounds in both single-concentration and concentration–response assays, based on the promiscuity index and the number of luciferase reporter gene assays in which a compound was screened. The identified chemotypes are of interest because it is likely that they will be identified in future luciferase reporter gene assays. The fact that many of the most promiscuous luciferase reporter gene compounds have been tested in concentration–response assays indicates that they were selected as interesting hits from primary assays. On the basis of our observations, researchers would be well advised to exclude these compounds from follow-up studies because they act via a mechanism that is related to the assay technology and not the biological target of interest. Calculated promiscuities of these compounds in the different subtypes of assays that use luciferase in their design suggested two likely mechanisms of action. One was related to cell viability/toxicity and the other to inhibition of the luciferase enzyme. We have shown specific examples for both cases ( Figs. 2 , 4 ; Table 1 ; and Supplementary Table S1). The analysis presented here was relatively simple because it did not take into consideration variations in assay conditions and different luciferase enzymes used. We nevertheless could identify many promiscuous and undesired chemotypes, making this information useful for flagging primary screening hits that should be treated with caution. Our simple analysis that relies on results from many different assays is thus an effective approach to help identifying undesirable compounds and eliminating them before additional resources are spent during hit verification, lead identification, and optimization stages. Moreover, this computational analysis could also be used to develop hypotheses on the mechanism of compound promiscuity.

We then performed a similar promiscuity analysis for β-lactamase reporter gene assays to identify chemotypes that were nonspecifically active in this category of assays ( Fig. 5 ; Supplementary Figure S6 and Supplementary Table S2). The rationale was the same as for luciferase reporter gene assays: to identify and exclude undesirable hit compounds as early as possible in the discovery and optimization pipeline. In contrast to luciferase reporter gene assays, the β-lactamase promiscuous compounds formed more pronounced, larger clusters after maximum common substructure clustering. This could be due to the composition of the library or because the larger number of luciferase (compared with β-lactamase) reporter gene assays selected more diverse highly promiscuous compounds.

Pairwise comparison of the most promiscuous compounds in luciferase-versus β-lactamase reporter gene assays showed that, with a few exceptions, their chemical spaces do not overlap ( Fig. 6 ; Supplementary Figures S7 and S8). This suggests distinct mechanisms of promiscuity that are specific to the reporter technology. Although this may be expected, such a quantitative analysis using a large number of assays is relevant for data analysis and is also directly relevant to HTS assay development. Our analysis demonstrates that the two reporter technologies are orthogonal to one another because they are prone to distinct chemotypes of artifactual hits. Compounds that were identified as promiscuous in both luciferase and β-lactamase reporter gene assays appear generally cytotoxic because of their chemical reactivity or a mechanism unrelated to the reporter—for example, nonselective kinase inhibition ( Fig. 7 ; Supplementary Table S3).

In summary, we have systematically analyzed data from a large number of assays in PubChem to identify compounds that are promiscuously active in assays of specific designs and technologies and via distinct mechanisms of action. Such an analysis is only possible with the detailed annotations that we made based on the BioAssay Ontology, the first reported ontology to formally describe HTS assays and assay outcomes. There are many advantages of a formal description of bioassays and standardized annotations of data sets such as those in PubChem. Here we demonstrated that analyses across many assays are facilitated by standardized annotations such as those produced by BAO and that the results can provide insights that cannot be obtained by analyzing individual data sets. This is particularly relevant for relatively noisy primary HTS results. Analysis across many assays of the same type can also be expected to be more robust than analyses focused on individual data sets. Although HTS data contain false positives and false negatives, the BAO approach does not rely on each individual result data points but requires only that the ensemble of results reflects the correct trend (i.e., the fraction of the experiments of a certain category in which a compound is found active).

Although undesirable and reactive chemical functionalities that are prone to cause false positives in HTS have been reported in the past,³¹ the definition of undesired chemical substructures to a large extent depends on the specific assay technologies and biological targets; for example, in some applications, covalent modifiers may be acceptable or even desired, but in others, they have to be excluded. BAO provides a means to identify undesirable chemical substructures in a data-driven manner specific to the assay technologies or biological meta-targets that are covered by BAO. With the type of analysis presented here, it would thus be possible to identify undesirable chemotypes that are specifically relevant to a given discovery project.

We would not recommend to a priori remove from a screening library all compounds that show promiscuity but rather flag them, because such compounds can still be of interest for certain targets and orthogonal assay designs, and detection technologies are prone to structurally different artifacts (as we have shown for luciferase and β-lactamase reporter gene assays). By the same token, certain chemotypes may cause artifacts across a large number of assay technologies and biological targets, and these could be removed to improve a screening collection. This will require more comprehensive analyses. We are currently annotating more assays from PubChem and will perform similar analyses for various other categories. The curation effort is time-consuming and not an effective long-term strategy to standardize data. Although a certain amount of curation will likely be required to consolidate terminology, it would be desirable to add BAO-type annotations at the stage of assay deposition and to make these annotations available in the primary data sources such as PubChem. BAO is available from our Web site.²¹

As the number of available data sets increases, the type of analyses presented here would have to be repeated periodically to comprehensively and accurately identify promiscuous compounds of a certain category. However, this is a straightforward undertaking, given standardized assay annotations and endpoints. Using BAO annotations and standardized endpoints, we are also currently working on developing predictive classifiers from quantitative outcomes of luciferase assays. Such classifiers could then be used to automatically flag potentially promiscuous compounds.

The BAO software under development²¹ will facilitate the query, exploration, and downloading of curated HTS data by BAO terms and thus will also facilitate the identification of promiscuous compounds for specific assay technologies.

Footnotes

Acknowledgements

The work presented here was supported by NIH grant RC2 HG005668. We acknowledge resources from the Center for Computational Science of the University of Miami. Vance Lemmon holds the Walter G. Ross Distinguished Chair in Developmental Neuroscience.

Supplementary material for this article is available on the Journal of Biomolecular Screening Web site at .

References

Mayr

L. M.

Bojanic

Novel Trends in High-Throughput Screening. Curr. Opin. Pharmacol. 2009, 9, 580–588.

Austin

C. P.

Brady

L. S.

Insel

T. R.

Collins

F. S.

NIH Molecular Libraries Initiative. Science 2004, 306, 1138–1139.

Schürer

Tsinoremas

Screening Informatics. In A Practical Guide to Assay Development and High-Throughput Screening in Drug Discovery; Chen

T.,

, Ed.; Taylor & Francis: Oxford, UK, 2009.

Ling

X. B.

High Throughput Screening Informatics. Comb. Chem. High Throughput Screen 2008, 11, 249–257.

Torr-Brown

Advances in Knowledge Management for Pharmaceutical Research and Development. Curr. Opin. Drug Discov. Dev. 2005, 8, 316–322.

PubChem Project. http://pubchem.ncbi.nlm.nih.gov/

Wang

Bolton

Dracheva

Karapetyan

Shoemaker

B. A.

Suzek

T. O.

Wang

Xiao

Zhang

Bryant

S. H.

An Overview of the PubChem BioAssay Resource. Nucleic Acids Res. 2010, 38, D255–D266.

ChEMBL Database. http://www.ebi.ac.uk/chembldb/index.php

PDSP Ki Database. http://pdsp.med.unc.edu/kidb.php

10.

Jensen

N. H.

Roth

B. L.

Massively Parallel Screening of the Receptorome. Comb. Chem. High Throughput Screen. 2008, 11, 420–426.

11.

ChemBank. http://chembank.broad.harvard.edu/

12.

Seiler

K. P.

George

G. A.

Happ

M. P.

Bodycombe

N. E.

Carrinski

H. A.

Norton

Brudz

Sullivan

J. P.

Muhlich

Serrano

. ChemBank: A Small-Molecule Screening and Cheminformatics Resource Database. Nucleic Acids Res. 2008, 36, D351–D359.

13.

Collaborative Drug Discovery. http://www.collaborativedrug.com/

14.

Hohman

Gregory

Chibal

Smith

P. J.

Ekins

Bunin

Bradford

Dole

Spektor

Blondeau

. Novel Web-Based Tools Combining Chemistry Informatics, Biology and Social Networks for Drug Discovery. Drug Discov. Today 2009, 14, 261–270.

15.

Inglese

Shamu

C. E.

Guy

R. K.

Reporting Data from High-Throughput Screening of Small-Molecule Libraries. Nat. Chem. Biol. 2007, 3, 438–441.

16.

Auld

D. S.

Southall

N. T.

Jadhav

Johnson

R. L.

Diller

D. J.

Simeonov

Austin

C. P.

Inglese

Characterization of Chemical Libraries for Luciferase Inhibitory Activity. J. Med. Chem. 2008, 51, 2372–2386.

17.

Babaoglu

Simeonov

Irwin

J. J.

Nelson

M. E.

Feng

Thomas

C. J.

Cancian

Costi

M. P.

Maltby

D. A.

Jadhav

. Comprehensive Mechanistic Analysis of Hits from High-Throughput and Docking Screens against Beta-Lactamase. J. Med. Chem. 2008, 51, 2502–2511.

18.

Southern

Griffin

A Java API for Working with PubChem Data-Sets. Bioinformatics 2011, 27, 741–742.

19.

Pipeline Pilot 8.0. San Diego, CA: Accelrys, 2010.

20.

Shannon

Markiel

Ozier

Baliga

N. S.

Wang

J. T.

Ramage

Amin

Schwikowski

Ideker

Cytoscape: A Software Environment for Integrated Models of Biomolecular Interaction Networks. Genome Res. 2003, 13, 2498–2504.

21.

BioAssay Ontology. http://www.bioassayontology.org/

22.

Fan

Wood

K. V.

Bioluminescent Assays for High-Throughput Screening. Assay Drug Dev. Technol. 2007, 5, 127–136.

23.

Qureshi

S. A.

Beta-Lactamase: An Ideal Reporter System for Monitoring Gene Expression in Live Eukaryotic Cells. Biotechniques 2007, 42, 91–96.

24.

Spotfire DecisionSite 9.0. TIBCO: Palo Alto, CA, 2007.

25.

ChemAxon JChem Software Suite. http://www.chemaxon.com/

26.

Rogers

Hahn

Extended-Connectivity Fingerprints. J. Chem. Inf. Model 2010, 50, 742–754.

27.

Guha

Schurer

S. C.

Utilizing High Throughput Screening Data for Predictive Toxicology Models: Protocols and Application to MLSCN Assays. J. Comput. Aided Mol. Des. 2008, 22, 367–384.

28.

Zlokarnik

Negulescu

P. A.

Knapp

T. E.

Mere

Burres

Feng

Whitney

Roemer

Tsien

R. Y.

Quantitation of Transcription and Clonal Selection of Single Living Cells with Beta-Lactamase as Reporter. Science 1998, 279, 84–88.

29.

Nakano

Omura

Chemical Biology of Natural Indolocarbazole Products: 30 Years since the Discovery of Staurosporine. J. Antibiot. (Tokyo) 2009, 62, 17–26.

30.

Xia

Huang

Witt

K. L.

Southall

Fostel

Cho

M. H.

Jadhav

Smith

C. S.

Inglese

Portier

C. J.

. Compound Cytotoxicity Profiling Using Quantitative High-Throughput Screening. Environ. Health Perspect. 2008, 116, 284–291.

31.

Rishton

G. M.

Reactive Compounds and In Vitro False Positives in HTS. Drug Discov. Today 1997, 2, 382–384.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

11.11 MB