Abstract
The National Institutes of Health Library of Integrated Network-based Cellular Signatures (LINCS) program is generating extensive multidimensional data sets, including biochemical, genome-wide transcriptional, and phenotypic cellular response signatures to a variety of small-molecule and genetic perturbations with the goal of creating a sustainable, widely applicable, and readily accessible systems biology knowledge resource. Integration and analysis of diverse LINCS data sets depend on the availability of sufficient metadata to describe the assays and screening results and on their syntactic, structural, and semantic consistency. Here we report metadata specifications for the most important molecular and cellular components and recommend them for adoption beyond the LINCS project. We focus on the minimum required information to model LINCS assays and results based on a number of use cases, and we recommend controlled terminologies and ontologies to annotate assays with syntactic consistency and semantic integrity. We also report specifications for a simple annotation format (SAF) to describe assays and screening results based on our metadata specifications with explicit controlled vocabularies. SAF specifically serves to programmatically access and exchange LINCS data as a prerequisite for a distributed information management infrastructure. We applied the metadata specifications to annotate large numbers of LINCS cell lines, proteins, and small molecules. The resources generated and presented here are freely available.
Keywords
Introduction
Modern high-throughput screening technologies based on miniaturized assay technologies have enabled the production of vast data sets in the life sciences, including genomics, proteomics, transcriptomics, and chemical biology. During the past decade, both the number of publically funded data production projects and the size of data sets have been rising dramatically, providing access to unprecedented amounts and diversity of data in the public domain. Examples of such projects funded by the National Institutes of Health (NIH) include The Cancer Genome Atlas (TCGA), 1 the Encyclopedia Of DNA Elements (ENCODE) project, 2 Cancer Target Discovery and Development (CTD 2 ) Network, 3 and the Molecular Libraries Probe Center Network (MLPCN). 4
Here we focus on a more recent NIH-funded project, the Library of Integrated Network-based Cellular Signatures (LINCS) program. 5 The LINCS project aims to generate an extensive reference set of cellular response data to a variety of small-molecule and genetic perturbations with the goal of improving our understanding of complex human diseases, such as cancer. Common patterns from these data (signatures) include information about gene transcription, protein binding, cell proliferation, cell signaling, and other cellular phenotypes. LINCS assays span a variety of technologies, model systems, readouts, and perturbations. To produce an integrated view across the diverse LINCS data resources requires (1) defining which biological entities and concepts, experimental parameters, and results must be included in such an integrated view; (2) uniquely identifying the entities of interest, such as small-molecule compounds, proteins, cells, siRNAs, and so forth, so that they can be unambiguously associated with the assays and the screening results; and (3) standardized data formats in which data sets can be exchanged or queried. A fundamental requirement of useful metadata standards for LINCS, and other projects, is their free and open accessibility and well-defined relationships with other standards.
Types of standards that are relevant for reporting biological screening experiments and results include (1) minimum information checklists, (2) controlled vocabularies and ontologies, and (3) data format specifications. Various minimum information specifications have been developed to facilitate reproducibility and critical evaluation and interpretation of biological experiments and their results by others. Such standards relevant to LINCS include Minimum Information About a Cellular Assay (MIACA), 6 Minimum Information About an RNAi Experiment (MIARE), 7 Minimum Information About a Protein Affinity Reagent (MIAPAR), 8 and Minimum Information About a Bioactive Entity (MIABE). 6 These are available via the Minimum Information for Biological and Biomedical Investigations (MIBBI) project. 9 MIBBI checklists are now part of the larger BioSharing effort, 10 which also catalogs other standards (such as terminologies) and databases that use such standards. The ISA framework, including the ISA-Tab file format and software tools, enables the use of such standards; ISA refers to the specific metadata categories “Investigation,” “Study,” and “Assay.” Among many projects, it has also been used at LINCS. 11
Many controlled vocabularies and biomedical ontologies exist, and several have become widely used as standards, such as medical subject headings (MeSH) 12 and the Gene Ontology (GO). 13 However, existing vocabularies and ontologies are still far from comprehensive, and in many cases, ontologies have been developed for specific purposes and are not mapped to one another, thus complicating unique identification of biological entities across domains. 14 To address this challenge in the domain of chemical biology high-throughput screening, we have recently developed BioAssay Ontology (BAO) and demonstrated its utility in classification and analysis of screening experiments and results.15–17 We leveraged BAO and several other ontologies to develop the metadata terminologies required to integrate, interpret, and analyze LINCS data.
Because of the scale and diversity of data generated, the LINCS consortium does not maintain a central repository containing all data. Towards building a distributed federated LINCS information infrastructure, we have developed data format specifications to facilitate exchange and integrated access of LINCS data across the consortium via Web services.
In this article, we describe the metadata standards developed in the LINCS consortium with the goal of generating an integrated view across the diverse LINCS data resources as described above. The metadata standards and annotated data sets, including cell lines, proteins, and small molecules are freely available for download at the LINCS-, 5 LINCS Information FramEwork (LIFE)-, 18 Harvard Medical School (HMS) LINCS 19 Web sites.
Data and Methods
LINCS Assays and Data
Data generated in the LINCS project are described on the LINCS Web site 5 and links to individual LINCS Center Web sites therein. Briefly, data considered for the current version of metadata standards include transcript expression data and biochemical and cell phenotypic responses obtained with a variety of assay technologies. Landmark gene (L1000) expression signatures were generated by using multiplex ligation-mediated amplification with the Luminex FlexMAP optically addressed and barcoded microsphere and a flow cytometric detection system. 20 The LINCS (L1000) along with original Connectivity Map (v1) data are available via the LINCS Connectivity Map Project (LINCS cloud). 21 Kinase biochemical profiles are generated using the DiscoveRx KINOMEscan 22 technology based on a competition binding assay and phage tag PCR amplification or the KiNativ 23 proteomics assay based on labeling active kinase lysine sites with biotinylated ATP or ADP probes and mass spectrometry detection. Cell-based assays are read out via imaging or bulk fluorescence measurement to quantify phenotypic responses. These data are available via the HMS LINCS Explorer. 24 LINCS data across the consortium can be queried and explored via the LIFE search engine. 25
Metadata Standards Development
LINCS metadata standards were developed in the LINCS Data Working Group (DWG). We set up a DWG private Google Web site/wiki and used Google spreadsheets linked to the Web site to enable convenient sharing and collaborative authoring of the metadata standards with change control. The site and documents have 180 registered users, so a relatively large group has access to the DWG activities and provides input. The DWG documented various use cases related to research and tools development goals of the LINCS consortium. We prioritized an initial list of use cases that were relevant to guide the development of the herein reported metadata standards (see the Results section). For each use case, the relevant LINCS assays (and result types) were listed, and required parameters and annotations for screening result sets were determined as the basis for formalizing relevant and important metadata. We first focused on assay reagents (molecular entities and model systems) used to carry out LINCS assays, specifically cells (primary cells and cell lines), proteins, small molecules, siRNA/shRNA, antibodies, and “other” reagents that do not fit any of the previous categories. We reviewed and summarized applicable elements from various minimum information standards,
9
including MIAME, MIACA, MIAPAR, MIAPE-MSI, MIAPE-MS, MIABE, MIQE, MIFlowCyt, and MIARE. For each of the reagent categories, we created a Google shared spreadsheet that lists all metadata entities describing reagents of that category (compare
Tables 1
Metadata Categories Required for the Development of the Metadata Specifications for the LINCS Assays.
Selected Metadata Standards Fields for LINCS Reagent Categories Cell Line, Protein Reagent, Small Molecule, siRNA/shRNA, and Antibody.
Assay Simple Annotation Format
We first developed the requirements of the data format to encode the annotation for LINCS assays and screening results. The primary purpose of the simple annotation format (SAF) is to facilitate programmatic data exchange. We defined specific requirements: the data format must work seamlessly with Javascript and Web services, in particular representational state transfer (REST) application programing interface (API); it should support a wide variety of applications; it must be easy to process and to write applications; and it should be reasonably simple and human readable. JSON, a lightweight data-interchange format, 26 fits these requirements well and thus is a straightforward choice (compared with XML, for example). For each assay, we worked out the fields, data types, and content required to exchange the information and how they are linked to the assay metadata standards and controlled vocabularies. Specifically, BAO version 2.0 classes and the LINCS metadata standards were used to annotate specific assay types from HMS LINCS DB (http://lincs.hms.harvard.edu/db/), and SAF annotations were developed for each assay type by mapping HMS LINCS DB field names to specific SAF elements, which rely on classes from the BAO and corresponding LINCS metadata representations. SAF includes separate sections for the assay annotations and the result sets, which are encoded as tag-value pairs (see the Results section). SAF files thus represent a portable database-independent means of exchanging these annotations. Full SAF specifications are available at http://lifekb.org/index.php/dcc/SAF. We have made SAF-annotated screening results available through the HMS LINCS DB Web services API; instructions and documentation are available at http://lincs.hms.harvard.edu/resources/software/hms-lincs-database/. SAF-annotated screening results are pulled from this service to upload results into the LIFE software system developed at the University of Miami. 25
Results
LINCS Use Cases
One of the central goals of the LINCS project is to evolve more comprehensive systems-level views of normal and diseased states of cellular systems that can be applied for the development of new biomarkers and therapeutics. Toward that goal, the LINCS consortium is cataloging, integrating, and analyzing changes in gene expression and other cellular processes that occur as a response to different types of perturbations. Various LINCS consortium use cases were documented at the DWG site to coordinate the development of LINCS tools, including data integration and analysis, new algorithms, end-user software tools, and user interfaces. Simple use cases to ensure LINCS data sets could be annotated to facilitate these LINCS goals and that were relevant to guide the development of the herein reported metadata standards include (1) identifying screening model systems related to a specific disease or a disease group of interest or a particular tissue or organ of interest, (2) identifying small-molecule compounds active against a specific kinase target of interest, (3) querying a broad kinase binding profile for a kinase inhibitor of interest, (4) identifying small-molecule compounds that inhibit cell growth in cell lines associated with a disease of interest, (5) identifying small-molecule compounds with a protein target that corresponds to the gene target of a reference siRNA /shRNA, and (6) querying gene expression signatures for a small molecule of interest (for example, one that inhibits a kinase of interest).
Following the approach described in the Methods section, we first reviewed the data types, detection technologies, and assay formats currently used in LINCS assays. We then developed lists of metadata terms required to annotate LINCS assays and screening results, including recommended terminologies (vocabularies). We started with the following LINCS assay types: apoptosis assay, cell cycle state assay, small-molecule binding assay (KINOMEscan and KiNativ), cell viability assay, and L1000 transcriptional response profiling assay. Table 1 shows the required metadata categories to be associated with these assay types. Figure 1 summarizes how the proposed reagent metadata standards relate to selected LINCS assays (and results) and other important concepts, such as protein and gene that are related to the mechanism of action of how a particular phenotypic response is mediated. Note that the same entity (e.g., protein) can have multiple distinct roles, and these need to be separated in the metadata scheme. For example, protein kinases are specific, biochemically purified protein reagents in the KINOMEscan binding assay. In the broader context of all LINCS assays (most of which are cell-based) and data sets, protein kinases are conceptual targets of small molecules or antibodies. Thus, in our metadata standards, each protein reagent is directly related to a “parent” conceptual protein.

Illustration of how LINCS metadata standards relate to LINCS assays (and results) and biological entities.
Model versus Confounder Metadata
In our approach, we have made a clear distinction between “model” metadata and “confounder” metadata. Model metadata are those required to understand, interpret, and meaningfully integrate experimental results. These include global identifiers for experimental reagents (e.g., key information about cell lines and small-molecule perturbations) and critical experimental parameters (e.g., tested perturbagen concentrations and time points studied). Model metadata should be queryable in software tools and are often shown in published figures that illustrate important conclusions drawn from the data. Confounder metadata, on the other hand, include other details required to reproduce experiments but that are less important for interpreting experimental results. Examples of confounder metadata include specific batch numbers for reagents, detailed descriptions of the experimental equipment used (model of a centrifuge used in a particular step in an assay protocol), and so forth. To describe LINCS assay protocols, for the most part, we make model metadata explicit, whereas other experimental details (confounder metadata) are captured as free text in standard operating procedures implemented by the LINCS data production centers that describe how the assays are run. The specific parameters that are included in the model metadata are determined by use cases. This approach leaves the option to make additional metadata explicit at a later time (by curating the experimental procedures), should they be required for new use cases. Model metadata fields are required in our LINCS metadata standards. Confounder metadata (with some exceptions, such as batch-specific identification of reagents) are considered optional.
Metadata Specifications
The full LINCS metadata specifications are publically available at the LINCS project and LIFE Web sites (http://www.lincsproject.org/data/data-standards/, http://lifekb.org/index.php/data-standards). In the following, we briefly describe these standards. Table 2 lists the most important metadata descriptors of each LINCS reagent category, including the descriptor name, how it relates to a specific (material) instance of the reagent (invariant canonical or batch-specific representation), its importance level (1, essential; 2, recommended/if available; 3, optional), and—for controlled vocabulary—the recommended reference terminology/ontology. Resources for controlled vocabulary that are applied in the metadata standards are listed below. Metadata for each LINCS reagent category in Table 2 are separated into two sections: identification of the reagent and reagent-specific descriptors. For all details for each of the reagent categories, we refer to the full specifications.
Cell Lines and Primary Cells
LINCS assays interrogate a variety of disease models. Cell lines are immortalized cells, whereas primary cells are mortal and generally undergo a finite number of cell divisions, after which they reach senescence. To describe cell lines and primary cells, we incorporated some of the elements proposed in MIACA. The underlying theme among all cell types is their association with a tissue or organ from which the cells were derived. In many cases (especially with cell lines), the cells are also associated with a disease. We proposed explicit fields to describe the source (vendor or laboratory), origin (organism, organ and tissue), cell type (epithelial, neuronal stem cell, etc.), associated disease/disease model (e.g., type of cancer), growth properties (adherent or suspension), genetic modifications (transfection, transduction), inherent mutations (mutations in receptors, oncogenes, tumor suppressors), and culture conditions (culture medium and the medium components, such as serum, growth factors). Cell line source and culture conditions are batch-specific information, whereas the others are canonical (do not change between batches). In addition, permanent cell lines require reporting of cell line authentication, such as short tandem repeat (STR) profiling, whereas primary cells require the passage number and donor details, such as age, ethnicity, gender, and so on.
Small Molecules
Small molecules are used as perturbagens in LINCS experiments. Some of the minimal information standards proposed in MIABE were included in our specifications, such as compound name and ID (PubChem CID, ChEBI ID), canonical structure representation (SMILES, InChI key), software used to generate a canonical structure representation, important molecular descriptors, chemical salt, and so forth. Known biological targets of small molecules should be annotated if known using standard symbols; this is particularly important for approved drugs or clinical compounds with a known mechanism as suggested in MIABE. Small-molecule metadata also include substance-specific batch information, such as compound provider, salt form, molecular mass, purity, and aqueous solubility. For Food and Drug Administration–approved drugs, we proposed reporting additional information, such as drug indication and mechanism of action. If available, Protein Data Bank identifiers of corresponding target small-molecule co-crystal structures should also be reported.
Protein Reagents
A standardized description of protein reagents is critical to link results of different LINCS assay types. Protein reagents need to be identified in a manner that enables screening results associated with a specific protein reagent (e.g., KINOMEscan) to be linked with data obtained by other assays in which that protein participates as a (material) component (e.g., in a cell-based assay readout via the L1000 transcript profiling method; see Fig. 1 ). Although this is a fairly obvious requirement, it is not trivial to implement because a protein reagent expressed recombinantly is typically not the exact same entity or in the same state as its corresponding assay participant in a living cell (e.g., kinase domain binding assay vs. corresponding kinase occurring in a specific cell line used for a growth inhibition assay). In this first version of metadata standards, we take a rudimentary approach. We use the UniProt accession and approved Gene symbol (NCBI Gene) and accession number to identify and reference proteins and their coding genes, respectively. Although we recognize limitations, for the purpose of our current simple use cases, this is sufficient. Linking protein and gene identifiers in addition is relevant to integrate RNAi reagent gene targets (see below). The recommended explicit fields for proteins include a standardized name, both for the protein and the gene that encodes it; source of protein (e.g., chemically synthesized, purified from natural source, recombinantly expressed); protein modifications (e.g., mutations, posttranslational modification); protein purity; subunit information for components of a protein complex; and isoform information (derived from either alternative promoter usage, alternative splicing, alternative translation initiation, or frame shifting). We are currently working on a formal description of proteins that will allow ambiguity (more- or less-specific definition of proteins), because in some cases, the exact entity and state of a protein reagent or model system participant is not definitively known (full length, functional domain, exact sequence, mutation, phosphorylation state, etc.).
Inhibitory RNAs (siRNA, shRNA)
RNA interference is a standard methodology to transiently knock down gene expression in living cells. This can be achieved using different types of small RNA molecules, including siRNA, shRNA, and miRNA. Information that is relevant to identify and describe these perturbations include probe ID, name, source/provider, target gene symbol and accession number, sequence of the probe, and modifications to the probe (e.g., chemical modification) if any are specified.
Antibody Reagents
Antibodies are extremely useful because of their high target specificity in detection of proteins, capture of proteins for isolation, purification and quantification, and selective inhibition of protein function (e.g., membrane receptor). Important metadata to be reported include a standardized name and ID of the antibody, identity of the target protein, target organism, information on the immunogen (name, source, modification of the protein/peptide), antibody clonality, antibody isotype, antibody purity, antibody specificity, and whether it was used as a primary or secondary antibody in an assay.
Other Reagents
This category serves to describe generic reagents that fall outside of any of the previously listed specific categories. An example is lipopolysaccharide, a component of the outer membrane of gram-negative bacteria that triggers an immune response similar to that initiated by a bacterial infection. Information that is relevant to be reported about these reagents includes a standardized name and ID, provider information, purity, and source.
Resources and Controlled Vocabularies Used in the Metadata Standards
BAO was initially developed to describe high-throughput assays and therefore already includes many terms and definitions for assay-related entities and concepts.
17
The Ontology for Biomedical Investigations (OBI)
27
is an important midlevel ontology to integrate various domain-specific experimental ontologies. One of the main objectives of BAO was to describe screening outcomes (endpoints) and to enable classification and aggregation of these results by categories that relate to the biology (e.g., target) of the assay, the detection method, the assay design (how a signal is generated), and the model system. In contrast, OBI has a more operational focus (how is an investigation performed, how are the samples processed, etc.). However, the ontologies are not incompatible, and we plan to align BAO with OBI to facilitate future integration with other biomedical investigations. We recently extended BAO to enable more flexible modeling of profile endpoints and signatures that are generated in LINCS assays (manuscript submitted). BAO is specifically used as a reference for the SAF (see below). We formally defined the LINCS assays in BAO; these include the KINOMEscan, KiNativ, cell viability, transcriptional response profiling, apoptosis, and cue signal response (CSR) assays; as such, BAO serves as an important reference to the metadata standards, directly or via imported ontologies. To facilitate the unique identification of reagents and assay annotations, we recommend several other ontologies (
Assay SAF
We developed the SAF specifications to facilitate data exchange between the HMS LINCS DB and LIFE via a Web services API as described in the Methods section. Here we describe the SAF, how it is used, and its implementation in a LINCS publication Web services API. It is a model that can be extended to the entire LINCS network and potentially beyond.
Description of the SAF Format and Content
The SAF is a JSON
26
-based format for annotating and exchanging assay metadata and results. The chief goal of the SAF is to provide a simple, human readable format for representing and exchanging assay (experiment) data. JSON was chosen for encoding because it is simple to understand, easy for a human to read, ubiquitous, and computationally easy to use (Java script, Web services, with support in many applications) for data display and storage. Each SAF JSON object can be any subset of results generated by one assay (which is defined by its annotations); in practice, it is an operational unit, such as one screening experiment. A SAF file (
Table 3 lists the SAF elements, descriptions, and mappings to the HMS database and BAO. Data types include controlled vocabulary, free text, numeric value, and IDs with further differentiation of (LINCS) global and local (center- and/or batch-specific) IDs. Table 3 also lists specific example annotations (tags and values) that apply to the KINOMEscan assay. It should be noted that many metadata annotations that refer to the assay are implicitly defined by the name KINOMEscan assay; this means they can be inferred based on the formal definition of the assay in BAO. For example, the assay format, assay method, detection technology, and so forth do not need to be explicitly annotated because BAO defines all these details for the (KINOMEscan) assay. That also applies for the semantics of the reported endpoint “percentage control.” In this particular case, BAO defines the KINOMEscan assay as a competitive binding assay (assay technology described above) that reports percentage control as the (normalized) percentage of substrate that remains bound to the kinase; 100% control thus is formally defined as no binding of the screened compound to the kinase, and vice versa; 0% control means 100% compound binding. Because compounds bind at the ATP site (competitive with the substrate), this can also be interpreted as 100% inhibition of the kinase.
List of the SAF Elements, Descriptions, Mappings, and Data Types with Examples That Apply to the KINOMEscan Assay.
BAO = BioAssay Ontology; HMS = Harvard Medical School; SAF = simple annotation format.
SAF has also been implemented for LINCS apoptosis, cell cycle state, cell growth inhibition, and KiNativ assays.
Implementation of SAF as LINCS Publication Service
The SAF provides a mechanism to minimally describe assay and screening result information so that it can be exchanged between screening centers or accessed programmatically. We have started to use the SAF to annotate LINCS assays so that they can be easily indexed and made searchable by the LIFEwrx KnowledgeBase. The LIFEwrx KnowledgeBase is a searchable repository of LINCS assay data linked to the LIFE ontology and accessible through an easy-to-use Web-based user interface ( Fig. 2 ). 25 Previously, data were populated in LIFEwrx by an ETL-like process in which data were loaded from the LINCS centers into a staging database where standardization was done. The data were then annotated using the metadata standards, which enriches the information by linking associated concepts (e.g., disease names and categories). All of this information was made searchable and viewable through the search application. Annotating assays using the SAF simplifies this pipeline because assay information is already in a standard format and linked to ontology concepts ( Fig. 2 ). The SAF annotated assays are made available through the HMS LINCS DB Web services API, which serves as a LINCS publication service (LPS). Data from the service can be pulled directly by the LPS-driven LIFEwrx ingest pipeline with no special processing (see the Methods section for access and references to SAF and API specifications).

Integration of Harvard Medical School (HMS) LINCS data into LIFE via the LINCS publication service (LPS) REST application programing interface that leverages the simple annotation format (SAF). The ISA-Tab has been used in a pilot project to annotate some LINCS data at HMS, and the SAF is used to facilitate programmatic access via the LPS.
Annotating Data Sets Applying LINCS Metadata Standards
Applying the metadata standards, we have systematically curated and annotated cell lines, small molecules, and proteins used in LINCS assays. Representative examples for cell lines, proteins, and small molecules tested in LINCS assays are shown in
Cell Line Annotation and Linkage to Disease and Organ
Established cell lines are powerful high-throughput screening disease model systems. This is particularly the case in cancer research; for example, the NCI60 screen for effects on viability of multiple cancer-derived cell lines is routinely run on promising lead compounds. To facilitate the integration and analysis of large-scale cell-based screening profiles, such as those generated at LINCS, we systematically annotated cell lines with controlled terms identifying associated organs, diseases, and mutations leveraging the Human Disease Ontology and the organ Uber Anatomy Ontology; example annotations are shown in

Representation (percentage) of the different types of cancers among cell lines tested in the LINCS assays.
A list of all (>1000) annotated cell lines screened at the LINCS consortium is available via the HMS LINCS DB at http://lincs.hms.harvard.edu/db/cells/. Cell lines can also be queried and explored by disease, organ, or assay results via the LIFE software. 25
Protein Annotations
Deregulation of protein kinases is a hallmark of many diseases, including cancer. LINCS addresses the role of protein kinases using several assay types in which activity is either directly measured in biochemical assays (KINOMEScan) or by assessing phenotypes resulting from inhibition in cell-based assays (CSR, apoptosis, cell viability assays, transcriptional response profiling). Protein name, ID, alternate names, posttranslational modification, and mutation status were annotated using standardized terminology from UniProt, NCBI/Protein, and Protein Ontology (example shown in
A list of protein reagents (>1000) is available via the HMS LINCS DB at http://lincs.hms.harvard.edu/db/proteins/, and curation of this list is ongoing. Kinase proteins, including phosphorylation status and mutations can also be queried and explored via a kinase domain ontology in the LIFE software. 25
Compound Annotations
Small molecules tested in the LINCS assays include approved drugs, clinical kinase inhibitors, MLP probes, and various other screening compounds. Integration of data from different assays and external resources requires a unique identification of small molecules. We used PubChem CIDs, and we annotated the compounds with additional details curated from various sources, including DrugBank, PubChem, the NCBI MLP probe reports, the NCATS Pharmaceutical Collection, and the Protein Data Bank. Example records are shown in
We made the annotations for LINCS small molecules (>4000) available at the LIFE KB Web site (http://lifekb.org/index.php/data-standards). The list of compounds can also be obtained from LINCS HMS DB (http://lincs.hms.harvard.edu/db/sm/). Compound information can be queried, browsed, and downloaded via LIFE. 25
Discussion
Formal specifications of metadata are required to make the biological and methodological context of the assays and results explicit. Because of the diversity of methods and data types generated at LINCS, such specifications are critical to generate integrated and interpretable views of diverse LINCS results and also to link to external resources, such as small-molecule activity data in PubChem, ChEMBL, drug information in DrugBank, pathway information, disease data, and so forth. Here we developed metadata specifications for assays and screening results produced in the LINCS consortium. We focused on the model metadata needed to interpret and link assays and results. Guided by prioritized use cases, we determined the required types of biological entities and concepts and the corresponding specifications to uniquely identify each individual entity and to relate them, while not impeding human parsing of the data (common names, descriptions, etc.). We reviewed existing minimum information specifications and available established resources for controlled vocabularies. Although these have been a useful starting point, we determined that the LINCS project requires specific metadata standards to fulfill the current and envisioned future use cases. Comprehensive minimum information specifications for the purpose of replicating experiments were not practically applicable, given limited data curation resources and the focus on model metadata. Vocabulary resources (including ontologies) to describe many of the important LINCS biological entities and concepts were still lacking. We first developed the required metadata specifications in a smaller core group and then passed them to a larger group at LINCS for review and approval before their public release. We have demonstrated the applicability of these metadata standards by annotating LINCS assays and results. We have made publically available information on more than 1000 cell lines with detailed annotations, including disease and organ, on more than 1000 LINCS protein reagents, and on several thousand compounds, including many clinical kinase inhibitors and drugs. The various biological entities and concepts and their associated screening assays and results can be queried and browsed based on these metadata in the LIFE software system. 25 Use cases to develop the LINCS specifications range from relatively simple queries to more complex analyses and also include the development of software tools and user interfaces to query, explore, and analyze LINCS data. We have already implemented a variety of useful functionality leveraging these metadata standards in the LIFE search engine. 25
To facilitate the programmatic exchange of metadata-annotated screening results, we developed specifications for an assay SAF. The ISA-Tab format was used at HMS to capture important metadata for some experiments. Metadata and screening results are deposited to the HMS LINCS DB. SAF is the native format of the LPS REST API, which publishes this information for programmatic access and further processing by other systems, such as LIFE ( Fig. 2 ). We have described several of the LINCS assays using these SAF specifications and implemented LINCS publication Web services to access these data programmatically. This mechanism is also used to upload data into the LIFEwrx knowledgebase. We have shown several examples of curated annotations using the metadata specifications for cell lines, proteins, and compounds and how an assay is described in SAF; the full lists and details are available at the LINCS, 5 LIFE, 18 and HMS LINCS 19 Web sites.
As an example of linking results from different LINCS assays, we illustrate biochemical, cell growth inhibition, cell cycle state (mitosis/apoptosis), and transcriptional responses of a novel Plk-1 inhibitor, BI-2536, that has been shown to inhibit tumor growth in vivo, 37 has a modest efficacy and favorable safety in relapsed non–small-cell lung cancer, 38 and is also in phase I study in advanced solid tumors. 39 The presented standards to annotate cell lines and small molecules enable integration of relevant data. In this example, the cell growth inhibition data of a non–small-cell lung carcinoma cell line, A549, indicate the cell survival rate of 30% (at the BI-2536 concentration of 0.5 µM), whereas the KINOMEscan inhibition data confirm its activity in vitro with the Plk-1 inhibition of 81% (at the concentration of 10 µM). Tang et al. 40 identified an unexpected bell-shaped dose response of BI-2536 in the mitosis/apoptosis assay and suggest that low/medium concentrations of the drug inhibit the primary target (Plk1) in its function in promoting progression through mitosis and cell arrest in mitosis and from there move into apoptosis. Meanwhile, medium/higher concentrations of the drug might block mitotic entry altogether, which can protect from cytotoxic effects of antimitotic drugs. At the highest concentrations, cytotoxicity due to off-target inhibition of other kinases is seen, and the apoptosis/death curve rises again as the mitotic index falls. Off-target candidates can readily be identified via the KINOMEscan results for BI-2536. Similarly, gene expression results for BI-2536 in A549 cells and other cell lines can readily be queried and integrated with these results. The utility of the metadata standards is illustrated by their implementation in the LIFE search engine. 25 For example, a simple query of “BI-2536” (LSM-1041) returns various types of LINCS data for this compound, including L1000 transcriptional response, cell cycle state assay, cell growth inhibition, and KINOMEscan results.
During the development of the metadata standards presented here, and in particular when applying them to curate and annotate cell lines, proteins, and small molecules, it became apparent that such an effort requires significant resources, which are easy to underestimate. Judged by previous attempts, biocuration and systematic annotation of biological data have not been perceived as high-priority efforts in the community and as a result often appear underresourced. 41 It is therefore particularly important to optimize and prioritize minimum annotations that enable the scientific use cases and software functionality that involve integrated data views and linking to external information. Here we have developed and applied such minimum annotations in one of the first attempts to describe and make public large diverse data sets reporting biochemical and phenotypic readouts in addition to gene expression data; this is a major goal for the LINCS project. The development of metadata specifications continues to accommodate new use cases, data analysis algorithms, and software tools. It should be noted that the current metadata specifications already enable more complicated use cases, such as associating kinase targets and genes with diseases. Although causal associations cannot be directly inferred from the LINCS data, the metadata standards in principle include the required details to perform such analyses, for example, linking kinase targets (from KINOMEscan) and diseases (linked to cell lines tested in growth inhibition assays) based on the activity of small molecules tested in both assays (compare Fig. 1 , inferred relations).
In conclusion, the LINCS metadata and SAF specifications facilitate various use cases involving data integration, analysis, development of software tools, and programmatic data exchange across a variety of assay types, screening results, and external biomedical data. We anticipate that the metadata specifications, the SAF, and annotated cell lines, proteins, and small molecules will be useful beyond the LINCS project. All developed resources in this project are freely available.
Footnotes
Acknowledgements
The authors thank members of the LINCS Data Working Group and other members of the LINCS Consortium for helpful comments on drafts of metadata standards.
Declaration of Conflicting Interests
The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding
The authors disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was funded by the LINCS project grants 1U01HL111561, 3U01HL111561-01S1, and 3U01HL111561-02S1, U54HG006097, U54 HG006093.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
