Abstract

We are pleased to work with the Journal of Biomolecular Screening (JBS) to present a special issue on generating Knowledge from Small-Molecule Screening and Profiling Data.
Since its inception as an avenue for probe- and drug-discovery activities, high-throughput screening (HTS) has produced large data sets with great potential to enrich our understanding of the interactions of chemical matter with biological systems. In most cases, such activities focus on discovering a “needle in a haystack” to perturb a particular cellular process (probe) or find a starting point (lead) from which a treatment (drug) might be developed for a disease.
High-throughput diversity screening is a well-established activity in pharmaceutical and agrochemical research, with a body of evidence to support its effectiveness. 1 Indeed, HTS can no longer be considered a novel or disruptive technology, given it has been practiced for around 20 years! That does not mean, however, that there is nothing to improve. In the realm of data analysis, this special issue illustrates how much there is still to do, and how much still to learn, particularly as screening technology allows us to study multiparametric cellular responses.
With the advent of chemical biology repositories like ChemBank, 2 PubChem, 3 and ChEMBL, 4 it rapidly became clear that small-molecule screening data sets on common compound collections could be more than the sum of their parts, especially when aggregated and publicly shared. The US National Institutes of Health (NIH) Roadmap formally recognized this reality 10 years ago by establishing the Molecular Libraries Program (initially the Molecular Libraries Screening Network [MLSCN] and subsequently the Molecular Libraries Probe Production Centers Network [MLPCN]). Increasingly, as data from this network’s activities have become available via PubChem, 3 and later the BioAssay Research Database (BARD), 5 it has become possible to imagine cross-sectional analyses that make use of data from multiple experiments simultaneously, even those performed by separate investigators worldwide.
This special issue on Knowledge from Small-Molecule Screening and Profiling Data opens with a review article 5 that presents perspective on the development of BARD, a fourth-generation repository and knowledge environment for small-molecule science. At its core, BARD aims to present the successful experiment in public probe discovery undertaken by the NIH Roadmap and to contextualize screening and follow-up data from multiple diverse Network Centers collected over multiple years. BARD also aims to pave the way for future work in chemical biology research using structured vocabularies to describe assays in a way that is amenable to rapid search, filtering, and computational analysis. Two additional reviews from Abraham et al. 6 and Singh et al. 7 provide an overview of multiparametric analyses and suggest that more information could be extracted from future high-content screens through better data analysis.
The original reports of this JBS special issue feature multiple perspectives on the maturation of high-throughput and high-content screening as technologies. Assay quality is still a key determinant of the effectiveness of screening and can be aided by advances in both process and statistics. Zhang et al. 8 describe a novel approach to testing whether a screen is fit for its purpose, while Murie et al. 9 propose a statistical method for dealing with screens that have a high (real) hit rate.
High-content or phenotypic screens are increasingly high throughput and often used directly for lead discovery. This expanded scope has highlighted the need for more sophisticated data-analysis methods to include multiparametric endpoints and imaging. Haney 10 illustrates the importance of visualization and understanding the underlying distribution in high-content data sets. Smith and Horvath 11 offer a novel approach to the challenging area of phenotypic screening analysis, while Bornot et al. 12 show the value of using historical data to aid data analysis.
Once a primary screen has completed, it is often the case that hits require further triage and prioritization. Often this step has been performed through specificity or selectivity assays, to disqualify a hit, but biophysics techniques now offer the possibility of confirming hits via direct binding methods. Genick et al. 13 provide an account of the application of biophysics in a large pharmaceutical company screening environment. When molecule libraries and miniaturized assays mix, there is always the potential for false positives due to an unwanted mechanism, often very specific to the assay technology in use. Schorpp et al. 14 provide a case study in the identification of frequent hitters for the AlphaScreen technology.
A very large study, across multiple assay technologies, is detailed by Hansson et al., 15 illustrating a number of trends that can be observed across a large collection of molecules when applied to many years’ worth of screening data. Many chemists will not be surprised to see an old friend, lipophilicity, appear as a cause of promiscuity in molecules. Large data sets such as these give much opportunity for algorithms to find interesting relationships, such as chemical scaffolds that appear enriched in a single screen or multiple screens. Two groups describe the application of such methods to screening or profiling data to identify small-molecule scaffolds (Wawer et al. 16 ) or natural product motifs (Coma et al. 17 ). Although most of the articles presented in this issue focus on HTS, Beresini et al. 18 remind us there are other ways to use a collection of molecules and that screening a subset can often deliver what is required when full HTS is not possible.
As more data are produced to aid activities not directly associated with the original screen, Dancik et al. 19 use the data to compute similarities in biological response between molecules that might have very different chemical structures, while Swamidass et al. 20 and Jaeger et al. 21 describe how these data can be used to build network models that connect assays, phenotypes, and disease.
As experience with mature public chemical-biology data sets has shown,2-4 one of the key challenges in integrating data collected at multiple laboratories is connecting metadata—descriptions of experiments—across the many different ways researchers choose to describe their science. The review article on BARD 5 provides one perspective on these challenges, and an original report from the Library of Integrated Network-Based Cellular Signatures (LINCS) Network of NIH-funded Centers 22 provides a detailed account of how that network is addressing these issues. As the volume and complexity of screening and profiling data continue to accrue, additional work will be needed to ensure facile interoperability between data sets.
Again, we are delighted to present this special issue to you as a broad and diverse collection of research and perspectives on generating, mining, and interpreting data from high-throughput and high-content experiments directed at probe and drug discovery.
Darren V. S. Green, PhD
Paul A. Clemons, PhDFootnotes
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors received no financial support for the research, authorship, and/or publication of this article.
