Causal Network Models for Predicting Compound Targets and Driving Pathways in Cancer

Abstract

Gene-expression data are often used to infer pathways regulating transcriptional responses. For example, differentially expressed genes (DEGs) induced by compound treatment can help characterize hits from phenotypic screens, either by correlation with known drug signatures or by pathway enrichment. Pathway enrichment is, however, typically computed with DEGs rather than “upstream” nodes that are potentially causal of “downstream” changes. Here, we present graph-based models to predict causal targets from compound-microarray data. We test several approaches to traversing network topology, and show that a consensus minimum-rank score (SigNet) beat individual methods and could highly rank compound targets among all network nodes. In addition, larger, less canonical networks outperformed linear canonical interactions. Importantly, pathway enrichment using causal nodes rather than DEGs recovers relevant pathways more often. To further validate our approach, we used integrated data sets from the Cancer Genome Atlas to identify driving pathways in triple-negative breast cancer. Critical pathways were uncovered, including the epidermal growth factor receptor 2–phosphatidylinositide 3-kinase–AKT–MAPK growth pathway and ATR–p53–BRCA DNA damage pathway, in addition to unexpected pathways, such as TGF–WNT cytoskeleton remodeling, IL12-induced interferon gamma production, and TNFR–IAP (inhibitor of apoptosis) apoptosis; the latter was validated by pooled small hairpin RNA profiling in cancer cells. Overall, our approach can bridge transcriptional profiles to compound targets and driving pathways in cancer.

Keywords

Causal modeling triple-negative breast cancer networks

Introduction

Gene-expression profiling is used in various scenarios to characterize cells or tissues that are biologically dysregulated. Transcriptional changes may arise, for example, from treatment with a perturbagen, such as a compound or RNA interference (RNAi), or from disease-associated genetic disruptions. Despite an abundance of research attempting to infer changes in pathways reflected in differentially expressed genes (DEGs), less research has focused on interpreting transcriptional data in a way that sheds light on upstream causal nodes responsible for dysregulation. Oftentimes, causal nodes, such as drug targets, are challenging to predict because they are not themselves subject to transcriptional changes.¹ Nevertheless, DEGs are routinely input directly to gene set enrichment analysis, in which pathways have been defined by protein interaction sets. DEGs, however, better represent the effect rather than the cause of a perturbation. In this vein, it is more appropriate to identify nodes upstream of DEGs to predict the causes of biological dysregulation.

Predicting compound targets from expression data remains an unsolved challenge. In the seminal work describing the Connectivity Map (CMAP) approach,² gene-expression profiles of drugs were used to infer a mechanism by correlating a compound with reference signatures. Compound profile correlations can be used, in turn, to create a drug network in which densely connected nodes tend to group compounds by mechanism.³ Additional throughput may be increased by reducing the number of genes in the signature but still providing enough information to predict a compound mechanism⁴ or infer a global transcriptional response.⁵

Increasingly, network-based methods are being applied to discovering drug targets. Causal modeling is a special branch of network science in which graph topology is used to explain downstream events based on upstream nodes. For example, knowing that protein A phosphorylates and activates protein B creates a directed graph with two nodes and a connecting edge. Such structures provide an opportunity to infer mechanistic relationships in a system as opposed to mere statistical correlations.⁶ In Selventa’s reverse causal reasoning (RCR) approach, networks based on integrated “omics” molecular profiling data are computationally traversed with automated reasoning techniques to predict causal relationships in treated versus control measurements.⁷ The result is a list of hypotheses, consisting of subnetworks connecting upstream and downstream nodes, which are evaluated in terms of “concordance” and “richness.” Importantly, RCR uses prior biological knowledge of molecular interactions as a scaffold to explain new data sets. Chindelevitch et al. also used prior knowledge from curated interactions with special emphasis on the stability of the method to network alterations.⁸ In terms of large-scale drug mechanism-of-action prediction, Laenen et al. recently published a diffusion strategy that used either kernel-based random walks or diffusion to genes with similar connectivity structures in the surrounding network module to predict compound targets.¹ The authors found that predictivity was better for proteins than for small molecules, and nuclear hormone receptors were the easiest target class to predict.

Oftentimes in drug discovery, drugs may not be available for a particular disease indication, but gene-expression data are available from diseased tissues. In these cases, causal modeling from disease DEGs may provide pointers to new drug targets. In this vein, others have attempted network approaches to predict actual drug targets simply by scoring significant nodes in disease microarray data.⁹ In work related to our approach, Goldenberg et al. created methods to identify network nodes that are highly influential on their neighbors’ expression.¹⁰ Intriguingly, using both genes and protein interactions, the authors showed that many of the genes identified as influential and consistent among lung cancer cohorts were genes with significant second-order influence.

We postulate that target prediction from compound-treated samples is a related task to the prediction of potential influential nodes in diseased tissues. In the current study, we describe new approaches to predict causal nodes in a large, integrated network of curated interactions to explain downstream transcriptional changes ( Fig. 1A ). We attempt to find causal nodes in two orthogonal data types. First, we test a number of target prediction schemes on publicly available compound-treated microarray data from the CMAP.² We further assess whether causal hypotheses are more relevant as input for pathway-enrichment analysis than DEGs alone. Next, the best-performing algorithm for predicting targets from microarrays is applied to data sets from the Cancer Genome Atlas (TCGA) to predict causal nodes and pathways in triple-negative breast cancer (TNBC). TNBC is defined by the absence of expression of estrogen receptor 1 (ESR1), progesterone receptor (PR), and amplification of human epidermal growth factor receptor 2 (EGFR2 or HER2). TNBC remains a challenging disease with the poorest prognosis among all breast-cancer subtypes and no known molecular targets for this patient subgroup. Recent advances in genomic profiling have shed new light on the molecular heterogeneity of TNBC.^11,12 But despite multiple targeted agents at various stages of development for this lethal subtype, clinical results have been disappointing. Clearly, new strategies are needed for target discovery in TNBC. Thus, to go beyond single genes and uncover topologically significant nodes in pathways hidden from direct analysis, we apply our network models to multiple TCGA profiling data sets and previously defined subtypes of TNBC.^13,14 We are able to confirm several known and unexpected TNBC pathways, and we validate our inhibitor of apoptosis (IAP) pathway hypothesis by knockdown among breast cancer cell lines.

Figure 1.

(A) Schematic of a mathematical directed graph to link messenger RNA changes to targets of chemical perturbations and, optionally, pathway enrichment. (B) Rewriting the initial graph of triples (A effects B) with the effects “activation” or “inhibition” into a mathematical graph with signed nodes (A+, B−) and only one type of directed-edge “causes.” (C) The effect of “signed path” to the scoring methods. We can determine that A+ outperforms (N+) for the “signed-path” methods “Normalized weights” and SignFluxDegree despite the equal length of the shortest paths and the reachability to transcriptional changes, because (N−) has shorter paths to the same transcriptional changes. Also, we could observe feedback loops represented from N+ to N−.

Materials and Methods

Drug-Specific Gene Signatures

Signatures for gene-expression changes on compound perturbation are derived from the CMAP.² The CMAP represents a collection of genome-wide transcriptional expression data from cultured human cells treated with compounds. CMAP2 contains more than 7000 expression profiles representing 1309 compounds. CMAP2 was done in batches, each batch with one cell line, and one array platform, with different compound treatment concentrations. All signature analyses were stratified by cell line, treatment, and concentration combination (see Suppl. Methods). To better understand the effect of using different signature-generation strategies used internally at Novartis, we used two different methods to generate DEG signatures for each compound treatment (see Suppl. Methods). No attempt was made to combine data from compound treatments with different concentrations; each compound concentration is considered a different experiment hereforth.

Each signature is defined by its genes, the gene fold-changes of expression levels (representing the magnitude), and associated q-values (representing the significance). Only signatures with at least 10 significantly changing genes were considered in this study.

Networks

Networks were derived from different types of interactomes by combining Metabase (Thomson Reuters, New York, NY) and STRING DB v. 9.1¹⁵ into a single text file containing node A, node B, the interaction direction, the interaction mechanism, and the source (see Suppl. Methods). In total, we compiled 497,891 interactions; 230,138 came from Metabase, and 268,684 came from STRING.

In addition, canonical pathways from Metabase with highly confident interactions (referred to as “noodles” by GeneGo developers; personal communications, October 2011) were considered as part of this study. Because noodles describe high-confidence linear interactions, we sought to not penalize graph distances that occur within noodles. The noodles are thus rolled into knots, in which each member of a noodle is only one edge away from all of the other members of the noodle (see Suppl. Methods). We thus used directed graphs generated from three approaches: (1) exclusively from canonical noodles in Metabase and rolled into knots, (2) from pathways extended to the whole Metabase and STRING networks, and (3) a combination of all of the above (in which nodes in noodles receive a significant advantage in the network due to the knot concept).

Mathematical Directed Graph

The networks described in this article were processed and converted into mathematical directed graphs (using Accelrys PipelinePilot and the R iGraph library). Network models were written in R and executed with a GNU Parallel interface.

Every network node was represented twice in the graph, with a signed direction of “+” for up or “−” for down, reflecting the effect of the edges connecting node A to node B. For example, the relationship “the increase of node A increases the levels of node B” would be represented as “node (A+) directed to (→) node (B+)” and “node (A−) → node (B−)” ( Fig. 1B ). In contrast, the relationship “the increase of node A decreases the levels of node B” would be represented as “node A+ → node B−” and “node A− → node B+”. We will refer to these as “causal flows,” having in mind a loose definition of causality among biological entities. In this way, the edge’s direction and effect as described in this article are converted into another graph with directed edges. The second graph is written so that the effects are reflected in the signs of the nodes. In the cases of an edge effect being “unknown,” we treat with equal probabilities the “increase” and the “decrease” effects by generating signed nodes–edge combinations reflecting both of the effects. Also, because of the inherited symmetry in the second graph, the path from node A+ to node A− (in the case of negative feedback loops) also will be a path from node A− to node A+, connecting the nodes with the opposite signs. For example, the path “(A+) → (O+) → (A−)” (negative feedback loop) will be also present as a path “(A−) → (O−) → (A+).”

The path “(A+) → (O+) → (B−)” will be present as “(A−) → (O−) → (B+).”

This does not mean that the shortest paths (if any exist) for “(A+) … → … (B−)” and “(A−) … → … (B+)” will have the same length.

We define the distance from (A) to (A) as 0.

We annotate the opposite of signed node A as ¬ A (negation of A).

Schematic of How Causal Nodes Are Predicted

The significant signature genes are mapped onto the mathematical directed graph based on the sign of the gene-expression fold-change, in other words, a fold-change >0 for gene A will translate to (A+), and a fold-change <0 corresponds to (A−) (step 1 in the schematic; Fig. 1A ). Then, because these signatures reflect transcriptional changes, the edges for one-degree neighbors of the signature genes were filtered down to only the ones related to messenger RNA (mRNA) regulation or expression (step 2 in the schematic; Fig. 1A ). Each node from the graph is then assessed for how “well” it connects (by following the causal flow) to the significantly changing signature genes based on several scoring metrics ( Fig. 1C and legend). The node ranks from each scoring scheme are then combined into one, as explained in this article. The higher the rank, the higher the confidence that a hypothesized node will be a drug target.

In all cases, performance measures and graphs were compared to random predictions (see Suppl. Methods and Suppl. Fig. S1C).

Compound-Target Annotation

For the CMAP data set, we applied two different levels of compound-target annotations (see Suppl. Methods for further description). The first level of target annotation is intended to represent canonical or “intended” compound targets, whereas the second category of target annotations comes from any other assay targets modulated by CMAP compounds as determined by internal assays or in literature and patents.

We combined the mapped compound targets and the associated compound signatures with at least one predicted target. This resulted in a total of 204 signatures for 154 compounds from our Signature 1 approach and in 581 signatures for 191 compounds from our Signature 2 approach.

Network-Model Scoring Schemes, Parameters, and Performance

We explored the power of the directed graphs to predict compound targets based on gene-expression signatures. Different topological properties of the directed graph could have different effects on the ranking of compound targets that connect with the gene-expression signatures of interest. For this reason, we developed several network algorithms that rank each signed node in the directed graph for its ability to “predict” the gene-expression changes on compound perturbation under specific treatments, concentrations, and cell lines. We developed six algorithms, which differ in the way they rank each signed node of the directed graph. We refer to the six methods as Normalized Weights, SignFluxDegree, Lambda, Power Weights, Exponential Weights, and SigNet, with a minimum-rank consensus approach applied to Lambda, Power Weights, and Exponential Weights (see Suppl. Methods).

Pathway Enrichment

For every gene list, pathway enrichments were calculated for Metacore maps [the false discovery rate (FDR) corrected; a p-value cutoff of 0.01; the maximum top 50 pathways reported; reported p-values are the p-values satisfying the FDR, not the q-values]. Enrichment was calculated in steps of 50 genes for each list (i.e., 50 genes, then 100, then 150, etc.). For the CMAP study, the enrichment process was applied to each CMAP treatment instance individually, in which the genes were ranked according to our best-performing method, SigNet.

Metacore maps were also used for pathway visualizations in Figure 6B and the supplemental figures.

TNBC Design of Experiments

TCGA breast-cancer profiling data were obtained for RNAseq, methylation, and reverse phase protein array (RPPA) from the GDAC Firehose pipeline.¹⁶ Consensus clustering using a nonnegative matrix factorization (NMF) algorithm of TCGA data was carried out (GDAC Firehose, Gene Pattern Consensus NMF; http://genepattern.broadinstitute.org), and the optimum number of clusters was determined based on the largest cophenetic correlation coefficient.

The most representative core samples of the clusters were identified based on positive silhouette width.^16,17 Core samples were used to select differentially expressed markers for each cluster and platform by comparing the cluster versus the other clusters using Student’s t-test.

RNAseq and methylation-based differentially expressed markers were further filtered [q-value <0E-15, absolute difference of expression >0.5, and correlation coefficient between gene methylation and RNAseq profiles < (−0.2)].

Molecular-based subtypes of TNBC in TCGA were determined by running the TNBCtype tool.¹⁸ The main subtypes BL, IM, and M+SML were further analyzed using the isoform-level RNAseq and RPPA data. Differentially expressed genes and proteins for each TNBC subtype were determined by two-way comparison to the non-TNBC tumor samples and matched normal. We applied the statistical method Limma with eBayes¹⁹ for determining the significantly differentially expressed markers with contrast coefficient the TNBC subtype versus non-TNBC samples > 3 and contrast coefficient for subtype versus matched normal > 0.5. We also applied Benjamini and Hochberg’s method to control the FDR (adjusted p-value < 0.05).²⁰

Results

Performance of Signatures, Networks, and Scoring Methods

We first sought to combine compound-microarray data with molecular interaction knowledge to predict drug targets from gene-expression data. The CMAP data used in this study contain genome-wide microarray data for cellular compound treatment at multiple concentrations and cell lines. Supplementing this data set, we implemented two different DEG signature-generation approaches. Our causal-modeling approach relies on knowledge-based interaction networks that are used to interpret transcriptional changes. To better understand the influence of network construction on method performance, we used three different interactome data sets, including integrated commercial and public interactions (Metabase+STRING), only canonical linear networks (Noodles), or all of the above (Metabase+STRING+Noodles). Finally, we tried five different network-scoring schemes to predict compound targets as well as a minimum-rank approach that kept the consensus best rank for every target. “Canonical” compound targets (assigned in DrugBank or Integrity) were ranked among the entire human genome for each network type, for each network-scoring method, and for both DEG signatures generated for each treatment condition. For additional internal benchmarking, we also used compound target annotations that were unknown to DrugBank or Integrity but were derived from literature, patents, or legacy assays. These noncanonical targets may be considered novel targets arising from polypharmacology, indirect “assay readout targets” from cellular screens, or simply off-targets.

Performance assessment is based on receiver operating characteristic (ROC)-like measures of sensitivity and specificity, in particular how highly each method ranks in aggregate the known compound targets from every signature. First, we compared the three networks and two DEG signatures ( Fig. 2A ). Due to the practical difficulty of following up target hypotheses, we are most interested in method performance within the top 0.5% (rank < 179; Fig. 2B ). To have equal comparisons, we only use a single scoring approach, the SigNet (minimum-rank) scoring method, which gives every target its best ranking from the scoring approaches. We observe that our implementation of a noodles-as-knots concept performs worse than the other networks that are larger and more inclusive in their interaction compositions. We also observe a marginal performance improvement in target enrichment for DEG Signature 2 within the top 1% (rank < 357; Fig. 2B ). Within the top 0.5% (i.e., rank < 179), the best approach is the use of the combined Metabase+STRING network and our Signature 2 method to process the microarray gene changes, correctly predicting 49% of the 430 compound–target–signature associations. Notably, we also include a random ROC curve created by the randomization of target and signature assignments (see Suppl. Methods and Suppl. Fig. S1C for extensive benchmarking of randomization methods), which would capture on average 11% +/− 2% of targets. It is important to display the “randomly correct” target prediction line, which is steeper than a linear line for a traditional ROC plot. The apparent better-than-randomness is potentially because of a large number of targets tying in their rankings, the nonidentically and independently distributed nature of compound–target pairs, and investigation bias in network nodes.

Figure 2.

Network and signature performance. Enrichment of compound-canonical targets among the entire genome. (A) On the x-axis, we show the percentage rank of each gene or protein, referred to as a hypothesis. Each hypothesis can rank between 1 and 35,656. On the y-axis, we show the percentage of compound signatures with the correctly predicted targets. (B) Close-up of a graph showing enrichment in the top 1%. ErrorBar (Up, Down) measures one standard deviation from the random performance.

Next, we examine our six scoring schemes using the Metabase+String network seeded with DEGs from Signature 2 ( Fig. 3A ). We find that the SigNet consensus method generates uniformly better performance than the rest of the methods and significantly outperforms random ranking methods within the top 1% of ranks ( Fig. 3B ). It also outperformed a consensus MeanRank (not shown). For the individual methods, Exponential Weights and Lambda give the best results, followed by the Power Weights method. None of the three best-performing methods, however, take into account the “sign of the path,” a concept used in Normalized Weights, SignFluxDegree, and other prior work.¹⁵ It is possible that the curated networks in this study have different topological properties than the causal networks used in other studies. The Normalized Weights method performs very close to random for the top 5% of the ranks. This method does not score the length of the shortest path, but rather the “sign of the path” and the average fold-change for the signature genes.

Figure 3.

(A) Scoring scheme performance. Enrichment of compound targets by different scoring schemes. All curves use canonical targets of compounds except for “SigNet (all noncanonical targets),” which includes additional assay targets (see the “Materials and Methods” section). (B) Close-up of a graph showing enrichment in the top 1%. ErrorBar (Up, Down) measures one standard deviation from the Random performance.

We next examined whether the confidence in target annotations influenced our predictions. We compared the accuracy of predicting well-established targets of the CMAP compounds with the accuracy of predicting additional compound targets published in literature and patents or in legacy Novartis assays. We find the SigNet method performs uniformly better for the more inclusive target annotations compared to using only well-established compound targets ( Fig. 3A and 3B and Suppl. Fig. 1A and 1B).

Compound-Target Prediction Examples

We began with 191 compounds that were mapped to one or more targets and input their DEGs from Signature 2 to our network models. We took the minimum rank of the three scoring methods that performed consistently significantly better than random within the top 1% ranks (i.e., Exponential Weights, Lambda, and Power Weights). From all possible compound–target mappings, we found 482 compound–signature–target triplets in the top ~0.5% of predictions (i.e., ranked in the top 170 genes) using our best method (SigNet, or minimum rank-score with the Metabase_STRING network). Of these, there are 172 unique targets (Suppl. Table S2) correctly identified for 114 unique compounds, meaning we successfully ranked true targets in the top 0.5% of the genome for 60% of all compounds; when restricting to compounds with canonical target annotations only (which constitute 431 compound signatures for 135 unique compounds), we identified 40 unique targets in the top 0.5% for 41 unique compounds (30%) from 69 compound–target pairs. ESR1 was the most frequently predicted correct canonical target (10 times for fulvestrant, progesterone, levonorgestrel, etynodiol, raloxifene, genistein, danazol, tamoxifen, melatonin, and mestranol), followed by NR3C1 (7 times), androgen receptor (6 times), HDAC1 (3 times), EGFR (3 times), and PPARG (3 times). Notably, among canonical targets, nuclear hormone receptors are well represented, as are HDACs, chaperones, and key signaling molecules. ESR1 was, in fact, the top prediction in the genome for fulvestrant, whereas PPARG was ranked third for troglitzaone. Table 1 shows a selection of correct predictions with a focus on highlighting different targets, whereas Supplemental Table S1 displays the 107 correctly predicted compound–target pairs from public or commercial sources (DrugBank, Integrity, or published literature and patents).

Table 1.

A Selection of Correctly Predicted Compound–Target Pairs with Best Rank <100.

Drug	Symbol	Best Minimum Rank	Evidence
Fulvestrant	ESR1	1	WO 2005/077968
Resveratrol	CYP1B1	1	PubMed 17584609
Troglitazone	PPARG	3	DrugBank 15003718
Zardaverine	PDE4A	5	PubMed 9400021, 11472205, 15916433
Podophyllotoxin	TUBB2A	6	PubMed 9564789
Fludrocortisone	AR	6	DrugBank
Colchicine	TUBB2A	7	18783207
Geldanamycin	HSP90AA1	9	Integrity; DrugBank
Flunisolide	NR3C1	11	DrugBank
Vorinostat	HDAC3	14	DrugBank
Clomipramine	ADCY8	15	PubMed 10447960
Trichostatin_A	HDAC1	16	Integrity
Rifabutin	HSP90AA1	24	DrugBank
Genistein	EGFR	61	Integrity
Thalidomide	NFKB1	65	DrugBank
Dicycloverine	CHRM1	78	DrugBank

Aside from direct transcriptional regulators like nuclear hormone receptors, our causal modeling approach correctly predicted tubulin (TUBB2A) as a target for podophyllotoxin (ranked sixth) and colchicine (ranked seventh). Several DEGs are shared between podophyllotoxin and colchicine, including downregulation of NOP16 and upregulation of DUSP4, NR4A2, PPP1R16B, GPR183, CD86, TPM4, and TUBB2A. The two compounds also share many upstream nodes in common, such as ACTB and HSPA4, which are highly connected to the DEGs in the signatures. To better understand the hypothesis network leading from the differentially expressed genes for these compounds to tubulin, we plotted the network of interactions between genes moving up or down on compound treatment and the upstream pathway nodes (Suppl. Fig. S2A and S2B). Interestingly, the TUBB2A gene itself is upregulated in cells in response to both drugs—in addition to tubulin being a direct target of podophyllotoxin and colchicine (Suppl. Fig. S2A). An inverse correlation between TUBB2A expression levels and paclitaxel-induced apoptosis in lymphoblastoid cell lines has been reported previously.²¹ Taken together, upregulation of TUBB2A expression may be a natural cellular response to tubulin modulation. Also interestingly, CASP7 appears highly ranked in the podophyllotoxin causal hypotheses (ranked ninth). A chemical analogue of podophyllotoxin called deoxypodophyllotoxin is known to activate CASP7 as part of its apoptotic mechanism.²² These results support a detectable cellular transcriptional signature resulting from the chemical modulation of tubulin.

Multiple compounds that were polyphenolic flavonoid natural-product scaffolds, such as resveratrol, quercetin, and luteolin, were correctly predicted to have CYP1B1 as a target. CYP1B1 is a monooxygenase involved in metabolism of xenobiotics such as polycyclic aromatic hydrocarbons. We observe that genes with altered transcription in response to flavonoids are highly connected to CYP1B1 enzymes upstream in our network models.

Another validated prediction from Table 1 is clomipramine and adenylyl cyclase 8 (ADCY8). Clomipramine is given as an antidepressant and anxiolytic, and for treating obsessive-compulsive disorder. The drug is primarily thought of as a modulator of SLC6A4 sodium-dependent serotonin transporter, although it is also known to potently target adenylyl cyclases.²³ ADCY8 is differentially expressed in specific brain regions of mice that differ in avoidance behavior levels. Moreover, chemical modulation of adenylyl cyclase reduced mouse avoidance behavior.²⁴ Together with prior pharmacology data, our connectivity between clomipramine-induced gene changes and ADCY8 suggests that this target might be more broadly considered as part of the drug’s therapeutic mechanism.

Target Prediction by Pathway Enrichment: DEGs versus Causal Hypotheses

Given the enrichment for compound targets among the network nodes upstream of the compound-induced DEGs, we surmised that pathway-enrichment analysis on the upstream nodes would provide important and divergent information from pathway-enrichment analysis on DEGs. For each signature, we determined the top 10 most significantly enriched pathways starting from either (1) DEGs resulting from compound treatment or (2) highly ranked network nodes upstream of DEGs. To judge success, we determined the number of significant pathways that contained known compound targets, which enables us to “recover” hidden targets not found directly among the DEGs or causal nodes.

First, it is critical to note that in only 29 out of 191cases (15%), the drug target’s expression level itself changed on drug treatment, reinforcing the notion that looking at transcriptional gene changes alone is not usually sufficient for identifying protein targets of compounds.

Using DEGs as input, we could recover targets for only 8% of compounds (16 out of 191 compounds with 17 unique targets). In contrast, when using causal nodes from SigNet as input, pathway-enrichment analysis could recover targets for 46% of compounds (87 out of 191 compounds with 77 unique targets). Thus, SigNet finds around five- to sixfold more targets by pathway analysis than the traditional approach of pathway enrichment using DEG lists. Notably, all 16 compounds for which we could recover targets with DEGs as input are entirely a subset of the compounds with targets recovered by SigNet.

Despite the advantage of pathway enrichment using causal network nodes rather than DEGs, we still recall the most compound targets directly among the top-ranked causal network nodes. Presumably, in many cases, the pathways that are most enriched exclude some known compound targets. Understanding the upstream pathways that are enriched among causal-node hypotheses, however, remains useful for biological interpretation of profiling data. We suggest that there is a trade-off between the interpretability of curated pathway maps and the accuracy of compound-target predictions based on the whole network, a gap that may close in the future as pathway curation captures cellular networks more holistically.

Applying the Network Methods to Signatures from TNBC Subtypes

Because our causal network-modeling approach is able to enrich for compound targets that drive a transcriptional response on perturbation, we next sought to assess whether the same approach could be repurposed to predict driver nodes in cancer. Furthermore, we wanted to assess the integration of diverse profiling data sets rather than gene expression alone. We first analyzed breast invasive carcinoma (BRCA) data sets obtained from TCGA for three platforms: RNAseq, DNA methylation, and RPPA as obtained from GDAC Firehose pipeline.¹⁶ On applying nonnegative matrix factorization (NMF), we observed that the samples corresponding to TNBC stably clustered for each profiling data set (Suppl. Fig. S3). This indicates that TNBC samples possess molecular features that are characteristic and distinct from other breast-cancer samples.

Next, we used the differential markers from our triple-negative (TN)-containing clusters as a subset of the BRCA from all three data platforms for network modeling. As with the CMAP exercise, we generated network-based ranking of the top hypotheses linked to the differential-marker signatures from each profile. We determined enriched pathways for both the causal hypotheses and differential-marker signatures (Suppl. Fig. S4). We observe that the pathways significantly enriched for causal hypotheses are the “TGF, WNT, and cytoskeletal remodeling” (Suppl. Fig. S5) pathway, followed by the “nongenomic action of androgen receptor” (Suppl. Fig. S6A), “gastrin in cell growth and proliferation, and “Development_prolactin receptor signaling” (Suppl. Fig. S6B) pathways.

Meta-Analysis of Common Pathways among Platforms

Having assessed the entire BRCA data set, we next sought to examine the inherent heterogeneity within TNBC by incorporating previously defined subtypes, specifically those reported by (1) Perou in Prat et al.¹⁴, and (2) Lehmann et al.¹³

The Perou signatures defined three major subtypes in TN known as Basal, HER2E, and LAR (luminal androgen receptor).

The TNBCtype tool allowed us to define TN subtypes in TCGA. The main classes we identified were for IM (Immunomodulatory), BL1 (Basal-like 1) merged with BL2, and M (mesenchymal) merged with MSL (mesenchymal stem–like) subtypes. We compared each of these three major TN subtypes to the rest of the breast-cancer tumors—which do not overlap with the extended definition of TN tumors—and derived gene signatures for each major subtype. The comparisons were done on RNAseq isoforms as well as on RPPA levels for the TCGA tumors with matched normal.

From the above signatures, we generated network-based causal hypotheses. Then, to interpret the many causal predictions, we explored them together on a pathway level. We investigated enriched pathways for both the causal nodes and gene signatures. The pathway-similarity approach in principle can uncover pathways on which both the causal hypotheses and differentially expressed genes (which are sometimes the causal hypotheses themselves) converge. Thus, we also applied pathway-similarity statistics (kappa²⁶) on the combined signatures and causal hypotheses to understand the most significantly predicted pathways as well as pathway concordance among signatures, as exemplified for the Basal-like subtype ( Fig. 4 ) and IM subtype ( Fig. 5A ), which complement the IM subtype pathway enrichment with gene signatures as input only (Suppl. Fig. S7). We reconfirmed several well-characterized pathways in the progression of TNBC, including the p53–BRCA DNA damage pathway (Suppl. Fig. S8A and S8B), the EGFR–phosphatidylinositide 3-kinase (PI3K)–AKT–mTOR–MAPK pathway, and the TGF–WNT cytoskeleton-remodeling pathway.²⁷ Unexpectedly, we identified a tumor necrosis factor (TNF)–IAP apoptotic pathway (Fig. 7A and 7B), DNA damage ATM–ATR pathway (Suppl. Fig. S9), and immune-response of interleukin 12 (IL12)–interferon gamma production (Suppl. Fig. S10). For example, the most convergent pathway map for the IM subtype was “Role of IAP proteins in apoptosis.” Indeed, we have observed in pooled small hairpin RNA (shRNA) screening in cancer cell lines that TNBC cell lines are statistically more sensitive to XIAP knockdown than other non-TN estrogen receptor (ER)-positive breast-cancer lineages ( Fig. 5C ), in agreement with our pathway enrichment.

Figure 4.

Most significant pathways in the Basal-like subtype of triple-negative breast cancer (TNBC). Shown are the top 10 significantly enriched pathways based on both causal nodes and gene signatures. The top pathway, DNA damage ATM–ATR, is further explored in Supplemental Figure S9.

Figure 5.

Evidence for the tumor necrosis factor–inhibitor of apoptosis (TNF–IAP) apoptosis pathway in triple-negative breast cancer (TNBC). (A) Most significant pathways in the Immunomodulatory subtype of TNBC. The top 10 enriched pathways based on causal nodes and gene signatures according to map similarity. (B) Significant nodes in the IAP apoptosis pathway (Metacore; Thomson Reuters, New York, NY) detected by our analysis (thermometer icons). (C) TNBC lines are more sensitive to shBIRC4 (XIAP and hIAP3). In three independent small hairpin RNA (shRNA) experiments on 30 breast-cancer lines that contain 15 TNBC-positive lines (gray bars) and 15 estrogen receptor (ER)-positive lines (black bars), TNBC lines are more sensitive than ER-positive lines to the effect of shXIAP knockdown. Each panel represents different RNA interference experiments.

Discussion

In this study, we applied causal graph-based analysis of curated interaction networks as a framework to interpret profiling data. For compound-target prediction from gene expression profiles, we observed that larger, more comprehensive networks consisting of both canonical and noncanonical pathway interactions were more helpful to construct a directed graph for predicting causal, upstream compound targets. This may be because the noodle-based graph is much sparser, and many important relationships are not captured. Alternatively, canonical relationships may not be more influential than noncanonical ones and should not get special weight in the causal modeling. This is exemplified by Supplemental Figure S2A and S2B, in which the network of hypotheses leading to tubulin for both podophyllotoxin and colchicine shows that the network is complex, nonlinear, and highly interconnected. In addition, target prediction was more successful when using all empirically derived compound targets as opposed to only the most established targets, suggesting our approach may provide a means for discovering novel compound polypharmacology.

Our two DEG signature-generation methods did not have a major impact on overall performance compared to the differences in performance that network methods had. Three of the network scoring methods worked better than the other two that incorporated “sign of the path” rules in their scoring. In other words, giving high emphasis to the shortest path flow versus longer paths was less accurate at target prediction. This may suggest that the existence of a biological interaction (A→B) is a more fungible relationship than a directed effect (A→B), which may be more condition dependent. Our three most successful methods put more emphasis on the “reachability” of the upstream nodes to signature genes, allowing for the magnitude of the gene changes to influence the network models to various degrees. Thus, with our experimental setup, network topology was more important than shortest path flow. Finally, simple voting methods combining the different assumptions about network reachability, shortest paths, and DEG fold-changes significantly outperformed the individual methods.

We observed that the targets of drugs are not often themselves transcriptionally altered by drug treatment (15% of compounds). In addition, genes differentially expressed on compound treatment are not particularly good inputs to pathway-enrichment algorithms in that they do not often enrich for pathways containing compound targets (8% of compounds). These observations are because target perturbation is fundamentally an event that is upstream of gene changes. Thus, reconstructing the upstream nodes of resultant gene changes should provide a better set of proteins to use for pathway-enrichment analysis. Indeed, for 46% of compounds, enriched pathways from SigNet upstream nodes contained known drug targets. We therefore suggest that genes found by pathway enrichment starting from DEG lists are unlikely to point to the actual molecular targets of compounds and that approaches like SigNet provide better inputs for pathway analysis.

To further explore applications of our network models in disease settings, we chose the TNBC subtype of breast cancer because it remains the only clinical subtype for which we have no effective targeted therapy. Distinct from previous studies, we applied an integrative network analysis to TCGA breast-cancer mutation, copy number, mRNA, and proteomic profiling data sets. We reaffirmed that the most critical pathway in TNBC is the cell cycle and DNA-damaging pathway ( Figs. 4 and 5 ). The identified driving pathways have clinical implications. First, affirmation of the cell cycle and DNA damage pathway as the top-driving pathway reflects the use of PARP [poly (ADP-ribose) polymerase] inhibitors in BRCA1/2-deficient Basal-like TNBCs. Inhibitors of PARP enzymes prevent base-excision repair of single-strand DNA breaks.²⁸ Despite the clinical shortcomings in meeting the prespecified statistical criteria when a PARP inhibitor was used in combination with DNA-damaging agents, a signal of efficacy was observed that continues to incite mechanistic studies. Ongoing clinical trials (NCT01989546, NCT00516724, and NCT01237067) will further evaluate the potential benefit of PARP inhibitors.

Perhaps the most unexpected critical pathway within TNBC that we identified was the TNFα–IAPs apoptosis pathway. IAPs are key regulators of apoptosis and block apoptosis from both the intrinsic and extrinsic pathways. Relieving the inhibition of apoptosis by the IAP proteins may provide an avenue for the reactivation of cell death. Importantly, our follow-up shRNA validation studies showed growth inhibition of 15 TNBC cell lines when knocking down XIAP. The most successful approach to date for targeting IAPs has been the design of small-molecule second mitochondrial-derived activator of caspases (Smac) mimetics. In preclinical studies, Smac mimetics show single-agent activity in a subset of human breast-cancer cell lines in vitro and in vivo.²⁹ Furthermore, the triple-negative subtype of human breast cancer appears to be particularly sensitive to Smac mimetics. Importantly, Smac mimetics show broad synergistic activity in combination with agents targeting the death receptors, such as TNFα and TRAIL, as well as with conventional chemotherapeutic agents and certain tyrosine kinase receptor inhibitors. To date, six Smac mimetics have advanced into clinical development. Initial clinical data suggest that these compounds are well tolerated in patients with advanced cancer and demonstrate initial evidence of antitumor activity, including tumor regression and stable diseases. Additional clinical trials (NCT01617668, NCT01078649, and NCT01930292) are being conducted to test IAP inhibitor combination with conventional anticancer drugs in patients with TNBC, a strategy that is scientifically supported by the current study.

While we are waiting for the results of those clinical trials, the network models applied in this study may provide one of the platforms for rationally designing combinations at both vertical and horizontal pathway levels. One such case is PARP inhibitors to target the DNA-damaging pathway in combination with PI3K inhibitors to target the critical growth pathway. Experimentally, in a Brca1-deficient rodent model of mammary basal carcinoma, Juvekar et al. recently demonstrated dramatic tumor shrinkage owning to combination treatment with the pan-PI3K inhibitor BYM and a PARP inhibitor, Olaparib.³⁰ A Phase I clinical trial was recently launched to test this in clinical settings.

Given the marked molecular and clinical heterogeneity of TNBC, molecular subtyping may be essential to identify driving pathways in a subset of patients who share a common genetic background. The novel critical pathways identified in the present study are worth further exploration not only for new target discovery but also for rational target combinations and for better understanding mechanisms of resistance to either standard-of-care or targeted therapies.

In conclusion, we used knowledge-based interaction networks as a framework to interpret molecular profiling data together with causal network approaches. Upstream nodes can help identify compound targets as well as pathways dysregulated in cancer, and provide a basis for more relevant pathway-enrichment analysis.

Footnotes

Acknowledgements

We thank Wolfgang Hackl, Doug Selinger for mentoring of Janna Hutz, Wendy Broom, Eugen Lounkine, Adam Hill, Somnath Bandyopadhyay, and Joe Loureiro for helpful conversations and collaborations.

Declaration of Conflicting Interests

The authors declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Funding

The authors received no financial support for the research, authorship, and/or publication of this article.

Supplementary material for this article is available on the Journal of Biomolecular Screening Web site at .

References

Laenen

Thorrez

Bornigen

Finding the Targets of a Drug by Integration of Gene Expression Data with a Protein Interaction Network. Mol. Biosystems 2013, 9, 1676–1685.

Lamb

Crawford

E.D.

Peck

The Connectivity Map: Using Gene-Expression Signatures to Connect Small Molecules, Genes, and Disease. Science 2006, 313, 1929–1935.

Iorio

Bosotti

Scacheri

Discovery of Drug Mode of Action and Drug Repositioning from Transcriptional Responses. Proc. Natl. Acad. Sci. USA 2010, 107, 14621–14626.

Nigsch

Hutz

Cornett

Determination of Minimal Transcriptional Signatures of Compounds for Target Prediction. EURASIP J. Bioinform. Syst. Biol. 2012, (1), 2.

Library of Integrated Network-Based Cellular Signatures. Approach. http://lincs.hms.harvard.edu/about/approach/ (accessed Jan 23, 2014).

Steyvers

Tenenbaum

J.B.

Wagenmakers

E.-J.

Inferring Causal Networks from Observations and Interventions. Cognitive Sci. 2003, 27, 453–489.

Selventa. RCR Whitepaper. http://www.selventa.com/reverse-causal-reasoning-methods-whitepaper (accessed Nov 19, 2012).

Chindelevitch

Ziemek

Enayetallah

Causal Reasoning on Biological Networks: Interpreting Transcriptional Changes. Bioinformatics 2012, 28, 1114–1121.

Emig

Ivliev

Pustovalova

Drug Target Prediction and Repositioning Using an Integrated Network-Based Approach. PLOS One 2013, 8, e60618.

10.

Goldenberg

Mostafavi

Quon

Unsupervised Detection of Genes of Influence in Lung Cancer Using Biological Networks. Bioinformatics 2011, 27, 3166–3172.

11.

Banerji

Cibulskis

Rangel-Escareno

Sequence Analysis of Mutations and Translocations across Breast Cancer Subtypes. Nature 2012, 486, 405–409.

12.

Shah

P. S.

Roth

Goya

The Clonal and Mutational Evolution Spectrum of Primary Triple-Negative Breast Cancers. Nature 2012, 486, 395–399.

13.

Lehmann

B. D.

Bauer

J. A.

Chen

Identification of Human Triple-Negative Breast Cancer Subtypes and Preclinical Models for Selection of Targeted Therapies. J. Clin. Invest. 2012, 121, 2750–2767.

14.

Prat

Adamo

Cheang

Molecular Characterization of Basal-Like and Non-Basal-Like Triple-Negative Breast Cancer. The Oncologist 2013, 18, 123–133.

15.

Franceschini

Szklarczyk

Frankild

STRING v9.1: Protein-Protein Interaction Networks, with Increased Coverage and Integration. Nucleic Acids Res. 2013, 41, D808–D815.

16.

Brunet

J. P.

Tamayo

Golub

T. R.

ii9i9. Proc. Natl. Acad. Sci. USA 2004, 12 (101), 4164–4169.

17.

Rosusseeuw

Silhouettes: A Graphical Aid to the Interpretation and Validation of Cluster Analysis. J. Comp. Applied Math. 1987, 20, 53–65.

18.

Chen

Gray

W.H.

TNBCtype: A Subtyping Tool for Triple-Negative Breast Cancer. Cancer Inform. 2012, 11, 147–156.

19.

Smyth

G. K.

Limma: Linear Models for Microarray Data. In Bioinformatics and Computational Biology Solutions Using R and Bioconductor; Gentleman

Carey

Dudoit

, Eds.; Springer: New York, 2005; pp 397–420.

20.

Benjamini

Hochberg

Controlling the False Discovery Rate: A Practical and Powerful Approach to Multiple Testing. J. R. Statist. Soc. B 1995, 57, 289–300.

21.

Leandro-García

L. J.

Leskelä

Jara

Regulatory Polymorphisms in ß-Tubulin IIa Are Associated with Paclitaxel-Induced Peripheral Neuropathy. Clin. Cancer Res. 2012, 18, 4441–4448.

22.

Shina

S. Y.

Yong

Kim

C.G.

Deoxypodophyllotoxin Induces G2/M Cell Cycle Arrest and Apoptosis in HeLa Cells. Cancer Lett. 2010, 287, 231–239.

23.

Quiñones-Torrelo

Sagrado

Villanueva-Camañas

R.M.

Development of Predictive Retention-Activity Relationship Models of Tricyclic Antidepressants by Micellar Liquid Chromatography. J. Med. Chem. 1999, 42, 3154–3162.

24.

de Mooij-van Malsen

A.J.

van Lith

H.A.

Oppelaar

Interspecies Trait Genetics Reveals Association of Adcy8 with Mouse Avoidance Behavior and a Human Mood Disorder. Biol. Psychiatry 2009, 66, 1123–1130.

25.

Broad Institute of MIT and Harvard. Broad Institute TCGA Genome Data Analysis Center: Analysis Overview for Breast Invasive Carcinoma (Primary Solid Tumor Cohort). https://confluence.broadinstitute.org/display/GDAC/Home (accessed Jan 23, 2014).

26.

Cohen

A Coefficient of Agreement for Nominal Scales. Ed. Psych. Measure. 1960, 20, 37–46.

27.

Klemm

Bleckmann

Siam

Beta-Catenin-Independent WNT Signaling in Basal-Like Breast Cancer and Brain Metastasis. Carcinogenesis 2011, 32, 434–442.

28.

O’Shaughnessy

Osborne

Pippen

J. E.

Iniparib plus Chemotherapy in Metastatic Triple-Negative Breast Cancer. N. Engl. J. Med. 2011, 364, 205–214.

29.

Bai

Sun

SM-164: A Novel, Bivalent Smac Mimetic That Induces Apoptosis and Tumor Regression by Concurrent Removal of the Blockade of cIAP-1/2 and XIAP. Cancer Res. 2008, 68, 9384–9393.

30.

Juvekar

Burga

L.N.

Combining a PI3K Inhibitor with a PARP Inhibitor Provides an Effective Therapy for BRCA1-Related Breast Cancer. Cancer Discov. 2012, 2, 1048–1063.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

2.31 MB

0.00 MB