Abstract
Identification of molecular-based signatures is one of the critical steps toward finding therapeutic targets in cancer. In this paper, we propose methods to discover prognostic gene signatures under a causal structure learning framework across the whole genome. The causal structures are represented by directed acyclic graphs (DAGs), wherein we construct gene-specific network modules that constitute a gene and its corresponding regulators. The modules are then subsequently used to correlate with survival times, thus, allowing for a network-oriented approach to gene selection to adjust for potential confounders, as opposed to univariate (gene-by-gene) approaches. Our methods are motivated by and applied to a clear cell renal cell carcinoma (ccRCC) study from The Cancer Genome Atlas (TCGA) where we find several prognostic genes associated with cancer progression - some of which are novel while others confirm existing findings.
Introduction
The most common and most lethal type of kidney cancer or renal cell carcinoma (RCC) is clear cell renal cell carcinoma (ccRCC). 1 Much heterogeneity exists within the ccRCC subtype of kidney cancer, and various factors are used to characterize the disease. This includes disease stage, tumor size, tumor cell morphology, lymph node status, and patient response to treatment. 2 The Cancer Genome Atlas (TCGA) ccRCC study 3 used multiple platforms to determine the associations between the molecular signatures of the disease and patient survival. These analyses identified a role for the PI(3) K/AKT pathways in tumorigenesis and ccRCC progression, and therefore as therapeutic targets.
The goal of this paper is to find gene signatures that are associated with patients' survival times. Based on the understanding of most disease processes, the ccRCC phenotype does not result from a mutation in a single gene, but rather from a coordinated series of interactions involving multiple molecular pathways and multiple genes. 4 Our main hypothesis is that genes influencing patient's survival times are cumulative effects of a given gene as well as its (upstream) regulators. The main goals of this study are (1) to determine the whole genome causal network from gene expression data and (2) to relate each gene expression to patient survival, adjusting for the estimated causal network (hence, regulators). The incorporation of analytic methods that are based on network models of gene expression has improved our ability to identify elements that can serve as biomarkers of patient prognosis. Because the interactions between the expressions of different genes are assessed within a biological network, this construct is labeled as a gene co-expression network. In a layout of this network, the gene functions are shown as vertices and the significant associations between gene functions are shown as connections (or edges) between them. 5 All edges in a co-expression network are undirected and can be quantified by different statistical measures, such as marginal correlations, partial correlations, or mutual information. 6 In existing approaches, there are two main aspects of gene co-expression networks: (1) hub genes and (2) modularity. In a co-expression network that corresponds to a set of genes, hub genes are the genes that connect to a significant proportion of the total genes in the network. In contrast, a modularity approach focuses on a subnetwork that has a higher density of edges within groups of genes than between them. Recent works, such as Han et al. 7 , Taylor et al. 8 , Patel et al. 9 , and Yang et al. 10 , uncovered prognostic genes for an outcome variable based on the characteristics of genes in the co-expression network.
Although the undirected co-expression network estimated from observational gene expression data has been useful to select prognostic genes, they do not explicitly account for directionality of the mechanistic regulation between the genes. The delineation of causal (directed) relations among genes would be useful not only in discovering (upstream) regulators for a particular gene but also in identification of predictive gene signatures involved in cancer progression. Causal relationships can be concisely represented by directed acyclic graph (DAG) models, and given an estimated/known DAG model, the causal effects can be computed using standard methods as in Pearl. 11
However, a DAG model is not directly identifiable (in a statistical sense) from observational/static gene expression data. Maathuis et al. 12 , proposed a method called IDA (intervention-calculus when the DAG is absent) to infer bounds on total causal effects under a limited causal structure estimated from observational data. Motivated by the IDA method, we propose methods to rank genes based on their effects on patient survival times, adjusting for their causal structure, ie, regulators. Our method has two main steps: estimating the causal structure of genes that identifies network modules for each gene and its regulators (“parents”), and then, subsequently, using the modules consisting of a gene and its parents to estimate the effects on survival times. In essence, this constitutes a network-oriented method as opposed to univariate gene analyses - thus, refining the estimations since it adjusts for potential confounders (as we demonstrate in the sequel).
The most challenging part of this analysis is to estimate the causal structure for a large number of genes (typically on the order of 104). IDA starts from a completely connected graph in which all pairs of genes are connected and iteratively removes the edges (“thins the graph”) by excluding edges with all orders of conditional independences (marginal independence, first-order conditional independence, and so on). However, to exclude an edge, the set of variables (conditioned on) needs to be a set of all subsets of the inter-connected variables – which leads to an exhaustive search over the large number of edges and vertices. Using penalized regression technique, we estimate sparse graphical models and use algorithms based on IDA to estimate the causal structure – thus, enabling us to scale our methods to such high-dimensional genomics data. Applying our methodology to TCGA ccRCC tumor samples, we found significant gene modules in ETS and Notch family. The signatures also involve the genes ARID1A and SMARCA4, which were found by the TCGA Research Network's study of ccRCC. 3
In Section 2, we introduce and describe our methodology. In Section 3, we present the results of the analysis of TCGA data to select prognostic gene signatures for the survival time of patients with ccRCC. In Section 4, we provide a summary and discussion.
Methods
We propose an approach for estimating the effect of each gene on patient survival, adjusted for the causal structure of all the genes of interest. The causal structure forms modules for each gene that consists of a gene and its parents - where parents are defined by the set of genes having a directed edge (pointing) toward a gene in a graph. The main challenge is that the unique determination of modules is unidentifiable from observational gene expression data. To address this issue, we propose a principled statistical procedure that consists of two main parts: (1) estimating the causal structure, which includes direct and indirect relations among genes, for high-dimensional gene expression data and (2) evaluating the effects of each gene under the ambiguous causal structure. Figure 1 concisely describes the entire workflow of our method. Briefly, the causal structure is first estimated through several undirected/partially directed graphs from Steps 1 to 6 and the edges are sequentially thinned with different implications of the dependencies for edges in different graphs. The eventual causal structure represented by the completed, partially directed acyclic graph (CPDAG) in Step 6 includes undirected edges when the directions are not identifiable. To address the issue of identifiability, several effect sizes of each gene for all possible modules from the CPDAG are obtained by the Cox-proportional hazards model, and the minimum effect size is used for ranking the genes. As opposed to the single gene analysis, we refine the estimation of effect size based on the estimated causal structure and show how this leads to better quantification of the prognostic effects. In the following subsections, we describe in detail our method for the causal structure estimation and the effect size evaluation given the causal structure.

Workflow to obtain the whole genome causal structure: pairs of genes (edges) are sequentially excluded by conditional (marginal) independence tests, starting from a completely connected graph and arriving at a skeleton. V-structure detection and completion steps then follow. PDAG is partially directed acyclic graph and CPDAG is completed PDAG.
Estimation of the causal structure for genes
We study the causal relations of p variables X1, …, X
p
by a DAG G = (V, E) with a set of vertices V = {1, …, p} and a set of edges E ⊆ V × V. All edges in the DAG are directed; in that, (v, w) ∈ E but (w, v) ∉ E for all v ∈ V ≠ w ∈ V. The acyclic graph means that the DAG contains no cycle (no path from a vertex to the same vertex along with directed edges). In our context, the vertices/variables represent genes, and the edges represent directed relations between pairs of these genes. We assume that X = (X1, …, X
p
)T ∈ Rp follows a multivariate normal distribution N
p
(0, Σ) with the density function fΣ(·). The parent vertices of a vertex i ∈ V are defined by the vertices pointing toward the vertex i. We denote the parent vertices of i ∈ V by pa
i
, and the corresponding variables by Xpa. The Markov properties determined by the DAG G admit recursive factorization of the joint probability density function fΣ,
The joint distribution of X is decomposed into conditional densities of each variable given its parents. Because several different DAGs may determine the same factorization, the DAG G is not identifiable from the observational distribution. The Markov property on the observational distribution of X provides the relations of conditional independence among the random variables. However, a collection of all the DAGs that correspond to the same set of conditional independence restrictions can be assembled into a Markov equivalence class, which can be determined based on observational data.
The approaches described by Spirtes et al. 13 , and Pearl 11 rely on a series of conditional independence tests to estimate a Markov equivalence class. The framework of the inductive causation (IC) algorithm is based on the theorem in Andersson et al. 14 : two DAGs are Markov equivalent if and only if they have the same skeleton and the same v-structures. The skeleton of a DAG G is obtained by replacing all directed edges with undirected edges. A v-structure is an ordered triplet of vertices (i, j, k) such that G contains the directed edges (i, k) ∈ E and (j, k) ∈ E and (i, j) ∉ E (i → k ← j). From the observational distribution, the skeleton and v-structures can be identified and are represented by a partially directed acyclic graph (PDAG). The undirected edges in the skeleton and the directions present in the v-structures imply conditional independencies among the variables corresponding to V:
There is an edge between vertices i and j in the skeleton if and only if the variables Xi and Xj are dependent, conditional on variables corresponding to XS = {Xk: k ∈ S} for all S ⊆ V\{i, j}.
In a v-structure i → k ← j, Xi and Xj are dependent, conditional on every set that contains Xk or its descendants.
The framework of the IC algorithm 11 relies on the conditional independence constraint and consists of three steps: (1) estimation of the skeleton by conditional independence tests, (2) identification of the v-structures, and (3) completion of the PDAG obtained from (1) and (2). We follow the framework of the IC algorithm by modifying the details of the algorithms to be suitable for high-dimensional data. We describe each step of our method in the following subsection.
Estimation of the skeleton. Spirtes et al.
13
, described various algorithms for estimating the skeleton. Our method is a modification of the standard algorithm known as the Peter and Clark (PC) algorithm, which has been shown to be consistent for high-dimensional sparse graphs.
15
The modification of the PC algorithm is based on the concept of a moral graph of a DAG. Given a DAG G, moralization of a DAG is executed by connecting the vertices i and j that form a v-structure i → k ← j, and replacing all directed edges by undirected edges. This moral graph is a Gaussian graphical model (GGM) that is specified by the structure of zeros in the inverse covariance matrix Σ−1.
16
The PenPC algorithm
17
starts from a GGM instead of a completely connected graph in the PC algorithm. The PenPC algorithm uses penalized regressions to estimate the GGM, which allows for the removal of a large amount of edges from the initial stage. The sparsity of the GGM makes skeleton estimation feasible for high-dimensional data, such as our ccRCC gene expression data, with p = 14,576. Motivated by the PenPC algorithm, the estimation of the skeleton proceeds in two stages: (1) the GGM is estimated based on penalized full-order partial correlations and (2) more edges in the GGM are removed by lower order (unpenalized) partial correlation tests. From a known DAG, G = (V, E), we can construct the GGM by the moralization and the skeleton by replacing the directed edges with undirected edges. We denote the GGM and the skeleton of the DAG, G = (V, E), by Gm = (V, Em) and Gu = (V, Eu) with superscripts. The edges in Em and Eu are all undirected, which means that (i, j) ∈ Eu (j, i) ∈ Eu. In an undirected graph, the neighborhood of a vertex v is defined by the set of vertices that are connected to v. Note that l-order conditional independence means that conditional independence exists between two variables given l number of variables. The conditional independence is assessed by estimated partial correlations
[Step 1] Estimation of the GGM, Gm = (V, Em). Meinshausen and Bühlmann 18 proposed a regression-based approach to estimate a GGM. The neighborhood of v is estimated by a penalized regression of the variable corresponding to v versus the remaining variables. After estimating all p penalized p − 1 dimensional coefficients by separate estimations, the graph structure is estimated based on the zero structure of those coefficients. For a response Xv where v ∈ V, we use procedures 1A and 1B, given below:
Procedure 1A. Marginal correlations are calculated between Xv and all other variables,
Procedure 1B. For all Xv, v ∈ V, we select a neighborhood, denoted by ne(v), using a penalized regression with Xv as a response variable and the variables {Xw: w ∈ ne0(v)} as explanatory variables,
In the Step 1 of the PenPC algorithm, 17 they fit the p penalized regressions without marginal independence tests at the beginning; in other words, all regressions involve p − 1 covariates. In this algorithm, we include the neighborhoods from the correlation graph for each response variable as the covariates in the penalized regression corresponding to the response.
[Step 2] Estimation of the skeleton, Gu = (V, Eu). If an edge (v, w) belongs to Em, which is an edge set of the GGM, the genes Xv and Xw are conditionally dependent given all other genes (full-order conditional independence). An edge (v, w) in a skeleton is equivalent to the variables Xv and Xw, which are conditionally dependent given all subsets of the other variables. Therefore, more edges will be removed from the GGM if any lower order partial correlation test provides a P-value greater than the threshold α. For an edge (v, w) that is in the GGM but removed from the skeleton ((v, w) ∈ Em and (v, w) ∉ Eu), we calculate a separation set that is the union of the sets that induce conditional independence between the variables Xv and Xw.
Our algorithm performs the partial correlation tests from the first order, l = 1, until l exceeds the maximum size of the neighborhoods in the current graph. We denote ne(i, G) as the set of neighbors for i ∈ V in an undirected graph G. Our algorithm is summarized in detail in Figure 2. It starts from the first-order partial correlation tests because we already tested marginal correlations to estimate the GGM. For a fixed order, l, each edge is tested by partial correlations given the subsets in the neighborhood for either vertex that forms the edge. We changed the order-independent version of the PC algorithm in Colombo et al. 20 to an algorithm that can operate in parallel with the vertices and the order l. The main difference between our algorithm and the PC algorithm is in the calculation of the separation sets. While the PC algorithm stops testing an edge when a separation set is obtained, our algorithm exhaustively searches all vertices that participate in any of the separation sets (Step 2.2.1 of Fig. 2). The exhaustive search provides a more accurate estimation of the v-structures.

A modified PC algorithm starting from a GGM.
v-structure identification. All edges (v, w) ∉ Eu have separation sets denoted by Svw. Consider a triplet (v, w, k) such that v-k-w in the skeleton. From the conditional independence property of the v-structure, if v → k ← w, k must not be in the separation set Svw because Xv and Xw are conditionally dependent given any set containing the variable Xk corresponding to the child vertex k. Thus, we direct the triplet as
Using all the graphs, the correlation graph, GGM, and skeleton, in our algorithm, we determine the v-structures for the triplets, (v, w, k), such that
The edge (v, w) is excluded from the correlation graph.
(i) The edge (v, w) is in the GGM, ie, Xv and Xw are marginally correlated and partially correlated given all other variables
Using the above rules, we obtain a PDAG that represents the skeleton and v-structures.
Completion of the PDAG. The completion of the PDAG obtained from the skeleton and v-structures is accomplished by maximally orienting the remaining undirected edges, with restrictions of no directed cycle and no extra v-structure. This completion is done by applying several deterministic rules from Meek 21 and Pearl. 11 The resulting graph is called the CPDAG. It represents the Markov equivalence class. The directed edges in the CPDAG exist in every DAG in the equivalent class; otherwise, the directions for the undirected edges in the CPDAG are reversible in some DAGs in the equivalent class.
Selection of gene signatures based on the causal gene modules
In this section, we identify which genes have a significant effect on survival time. The unsupervised causal structural learning in the previous section provided a CPDAG that represents a Markov equivalence class. The main idea is that when we model a gene in relation to survival time, we adjust for its parent genes, which are obtained from the CPDAG. If the DAG that represents the causal relations for V is known, the parent genes for all vertices are obvious. Using the unique gene modules for each gene, we can obtain the effect of a gene on patient survival by including the expressions of its parent genes in a Cox-proportional hazards model. However, the undirected edges in the CPDAG generate uncertainty in determining the parent genes: the arrows of the undirected edges imply either directions. Similar to the IDA method, 12 our approach accounts for all possible parent genes by switching the directions for the undirected edges and obtains the lower bound for the effect of a gene on survival time.
For a gene i, we define a set of parents {j: j → i} and children {j: j ← i} only for the directed edges. If all neighboring vertices of i are directed, then we have an obvious parent set for i. For an undirected edge that connects to vertex i in the CPDAG, the parents and children are indistinguishable. Under this uncertainty, all effects are computed by switching the directions without creating extra v-structures or cycles on the CPDAG. The detailed algorithm for constructing the candidate set of parents for a vertex is described in Maathuis et al. 12
Application to Kidney Cancer Data
TCGA ccRcc data description
RNA-seq and clinical data are available in TCGA for 480 patients with primary ccRCC. Among those 480 patients, 343 were censored; the quantiles of the observed survival times in days were 2 (0%), 326.75 (25%), 456.35 (33%), 619.90 (45%), 731.00 (50%), 830.15 (66%), 1,338.50 (75%), and 2,830 (100%). We analyzed RNA-seq V2 data from TCGA tumor tissues for 480 patients with ccRCC (accessed as of November 7, 2013). We filtered out genes with 75 percentiles across the 480 samples less than 25. After removing genes with low expression, we obtained 14,576 genes that are used for downstream analyses. The expression of each gene within each sample is measured by total read counts. We used log10 transformed total read counts in this study. We obtained a 480 × 14,576 residual dataset after removing the effects of several batch covariates by linear regression: 75th percentile of log10 total read counts, capturing the read depth for each sample, plate, and institution. Then we standardized the residual data to have a mean of zero and a unit variance for each gene. The final expression values conform very well with the normal distribution (Fig. 3) - thus, facilitating GGM approaches.

The histogram of the post-processed expressions. The standard normal density curve is overlaid on the histogram.
From unsupervised learning, we estimated a whole genome expression network, that is a CPDAG in which some of the edges are directed, with the set of vertices V = {1, …, 14,576}. Then we evaluated the effect of each gene in V by using a Cox-proportional hazards model, adjusting for the causal structure, CPDAG. The PC algorithm implemented in the pcalg package of Kalisch and Bühlmann 15 and Colombo and Maathuis 20 is not applicable for estimating a skeleton for the number of vertices we analyzed, P = 14,576. Thus, at the beginning of the PC algorithm, we used penalized regressions, which are more suitable for these high-dimensional data. Based on the estimated causal structure for the genes, we evaluated the effect of each gene on survival time. We describe the details of the CPDAG estimation results in the Appendix.
Results
Using graph theory terminology, we refer to “parent” (“child”) genes as genes pointing toward (away from) the gene of interest.
Gene rankings: Our gene ranking is based on the Cox-proportional hazards model, adjusted for the estimated CPDAG and four clinical covariates: patient age and tumor stage, grade, and metastasis status. We refer to this model as the network-adjusted model. To benchmark our method, we considered the model that includes the gene expression and the four clinical covariates with no parent gene and refer it as the unadjusted model. Therefore, in the network-adjusted model (which includes the set of gene parents for each gene), we have more parameters than the unadjusted model. Figure 4 displays the scatter plot of the effect sizes from the unadjusted model versus the network-adjusted model for all 14,576 genes. The slope of the regression line in the scatter plot was 0.893 with zero intercept. The trend with the slope less than 1 indicates that the effect sizes from the network-adjusted models are overall less than the effect sizes from the unadjusted model. However, the areas, A1 and A2, in Figure 4 indicate that the effect sizes from the network-adjusted model are greater than the maximum in the effect sizes from the unadjusted model. Although the effect sizes from the unadjusted model tend to be greater than those from the network-adjusted model, several genes show evident increases in their effect sizes by adding the set of parent genes (located in the regions A1 and A2). Especially, BAT3 gene showed the sign change in the effect sizes from the unadjusted model (0.04) to the network-adjusted model (-0.5) with 1,136.57% relative increase in their effect sizes by adding the parent genes, BAT2, FLOT1, and NOC2L (Table 1). Both BAT3 and BAT2 genes are HLA-B-associated transcripts, and the sequences of the two genes were shown to be closely linked. 22 EEF1A1 gene showed 1,311.37% relative increase in the effect sizes from the unadjusted model to the network-unadjusted model by adjusting for EEF1A1P9 gene (Table 1). The EEF1A1P9 gene is a pseudogene that is a dysfunctional gene with sequence similar to EEF1A1 gene, and has lost their protein-coding ability or is no longer expressed. 23 Regulated by the pseudogene EEF1A1P9, EEF1A1 gene has a significant effect on the survival times. In summary, the top-ranked genes from the network-adjusted model are in the areas A1 and A2 in Figure 4, and this indicates that most of the top-ranked genes are not found in the unadjusted model.
Top 50 genes (for ccRCC), ranked by the network-adjusted model: effects on survival time measured by the network-adjusted model and unadjusted model, relative increase in effect size resulting from the network-adjusted model compared to that from the unadjusted model, parent genes, and children genes. The red and green texts in the third and fourth columns indicate negative and positive effects, respectively.

Scatter plot of the effect sizes of all genes in relation to survival time, from the unadjusted model versus the network-adjusted model. The names of the top four genes from the network-adjusted model are listed on the graph. The green line represents equal effect sizes between the two models. The red dashed vertical and horizontal lines are drawn at the maximum in the absolute effects from the unadjusted model (0.35). The areas A1 and A2 indicate that effect sizes from the network-adjusted model are greater than the maximum in the effect sizes from the unadjusted model.
Prediction performance: We also evaluated the prediction performance of the network-adjusted and unadjusted models using Harrell's concordance indices. The index is a rank-correlation measure to evaluate predictive accuracy for censored survival outcome and is defined as the proportion of all usable patient pairs in which the predictions and outcomes are concordant. 24 Therefore, the larger index value indicates the more accurate model. Figure 5 shows Harrell's concordance indices of both models for the top 100 genes from the network-adjusted model. For most of the top 100 genes, the indices for the network-adjusted model are larger than those for the unadjusted model. Notably, the SLC30A5 gene shows the greatest increase in the indices after adding the parent genes, AGGF1, GFM2, GMCL1, HEXB, IPO11, MIER3, and NUDT18 (Table 1). We display survival curves for low-expressed and high-expressed groups for the TAF6 gene (the top ranked gene in Table 1) at the median expression levels of its parent genes (Fig. 6A).

For the top 100 genes ranked by the network-adjusted model, concordance indices for the network-adjusted model (red) and the unadjusted model (blue).
Effect size estimation of cancer genes: Next, we focused on the known cancer genes in the Catalogue of Somatic Mutations in Cancer, 25 and found that 415 genes in our gene expression dataset are included in that catalog of cancer genes. From our ranking of the corresponding 415 effect sizes from the network-adjusted model, the top 50 cancer genes are displayed in Table 2. We can consider the relative increase in effect size resulting from the network-adjusted model compared to that from the unadjusted model (listed in Table 2) as the efficiency gain when we use the network-adjusted model as a predictive model. While 2 genes among the top 50 genes had losses and 13 genes had no relative increase, the remaining 35 genes had gains by adjusting for the causal structure. The SEPT6 gene had 5,334.49% gains in effect size by adjusting for expressions of parent genes FOXP1, GPR146, MCOLN2, and SBK1. The REL gene had 5,838.95% gains in effect size by adjusting for expressions of parent genes ETV3, FAM110C, PAPOLG, and SKIL.
Top 50 cancer genes (for ccRCC) we identified within the dataset of 415 genes that are also found in COSMIC, ranked by the network-adjusted model: effects on survival time, from network-adjusted model and unadjusted model; relative increase in effect size resulting from the network-adjusted model compared to that from the unadjusted model; parent genes; and children genes. The red and green texts in the third and fourth columns indicate negative and positive effects, respectively.
Biological interpretation of the top prognostic cancer genes: From the signatures based on the known cancer genes displayed in Table 2, we found the ETV5 gene to be the top-ranked gene and to have parent genes that included the ETV1 gene. ETV5 and ETV1 are ETS family members, share a highly conserved ETS binding domain, and are almost 50% identical along the full protein. 26 Gene fusions involving the ETS family have been identified in a large fraction of prostate cancers.27,28 ETV5 is positively regulated by the Glial cell line-derived neurotrophic factor (GDNF) rearranged during transfection (RET) signaling pathway, which plays a crucial role in kidney development.29,30 We display survival curves for low-expressed and high-expressed groups for the ETV5 gene (the top ranked gene in Table 2) at the median expression levels of its parent genes (Fig. 6B). Also, among the top-ranked genes we identified were ARID1A and SMARCA4, which were also reported in TCGA Research Network's analysis of ccRCC. 3 ARID1A regulates cell cycle progression and prevents genomic instability in human cancer. 31 Hoffman et al. 32 , suggested SMARCA4 as a therapeutic target for BRG1–mutated cancers, such as lung cancer. The NOTCH1 gene was among our top-ranked genes, with parent genes CCDC88C and JAG2 and child gene ZMIZ1. NOTCH1 is included in the Notch signaling network, which is crucial to the control of the fate of a cell and its development processes through local cellular interactions. 33 Sjölund et al. 34 , and Ai et al. 35 , reported that NOTCH1 expression was significantly elevated in mRNA and protein levels in ccRCC tumors, compared to matched nontumor tissues. Interestingly, the NOTCH1 gene was selected by our model after adjusting for one of the parent genes, the JAG2 gene, which is a NOTCH ligand. Rakowski et al. 36 , observed that the gene ZMIZ1, which is regulated by NOTCH1 in our estimated causal network, is coactivated with NOTCH1 in leukemia.

Survival curves from the fitted network-adjusted Cox-proportional hazards models for top genes at the median expression levels of their parent genes. The low-expressed and high-expressed genes indicate the bottom and top 10% of the observed expression levels. (
Discussion
In this paper, we propose methods to select gene signatures based on a causal network learning. The causal network provides modules for each gene that consists of the gene and its parent genes. Rather than treating the modules together for gene signature discovery, we refine the estimation of the effects of each gene by adjusting for the parent genes. We applied this method to determine gene signatures at the gene expression level that correlate with patient survival time based on our analysis of TCGA ccRCC tumor samples. We extensively analyzed RNA sequencing data, including 14,567 genes, which represent 480 TCGA ccRCC tumor samples, to determine the whole genome causal structure, adjusted for batch effects. Then we assessed the effect sizes of all genes and adjusted for the estimated causal structure and clinical covariates of patient age and tumor stage, grade, and metastasis status. As gene signatures, we found ETV5, adjusted by ETV1, and NOTCH1, adjusted by JAG2. The signatures also involve the genes ARID1A and SMARCA4, which were found by the TCGA Research Network's study of ccRCC. 3
The main challenge of our analysis was to construct a whole genome causal structure from tens of thousands of genes. A standard approach is to screen genes by the strength of the association between gene expression and patient survival time in advance of assembling a network. Instead of prescreening the genes, we started with a completely connected graph and screened the edges to obtain a causal structure for all the genes. Our approach uses the PC algorithm, which thins the edges in the completely connected graph by edgewise partial correlations given all possible subsets of all other vertices. However, because the computational time of the PC algorithm is inefficient when working with large numbers of vertices and a P-value cutoff of 0 < α < 1, 17 we added a GGM estimation step to the middle of the PC algorithm. The GGM estimation step involves removing the edges by using penalized full-order partial correlations obtained from p separate penalized regressions. Those penalized regressions form a sparse GGM (0.027% of all possible edges in our data analysis), and we further assessed the lower order partial correlations for the edges in the GGM. Using several meaningful graphs, a correlation graph, a GGM, and a skeleton (as described in Fig. 1), we successfully obtained a causal structure from a whole genome gene expression dataset of gene expressions.
When the normalized read counts follow non-Gaussian distribution, we can still use the similar framework to estimate the causal structure. A recent paper, Loh and Bühlmann, 37 proved that the inverse covariance matrix reflects the moral graph of a DAG when data are generated from a linear, possibly non-Gaussian structural equation model (SEM) under a faithfulness (every conditional independence relations true in the joint distribution are entailed by Markov property applied to the underlying DAG) assumption. However, we need to be more careful to choose the detailed method in each step. In Step 1 of estimating the moral graph (instead of GGM), we can recover the edge set of the moral graph by using node-wise regressions with Lasso. 18 Then our algorithm uses a series of partial correlation testings that rely on Gaussian assumption. Instead of the edgewise test, Loh and Bühlmann 37 suggested to use nonparametric score-based search among DAGs that are consistent with the moral graph. However, the identifiability of a DAG using their score-based algorithm relies on strong assumption on error (error covariances are specified up to a constant multiple). Moreover, the efficiency of their dynamic programing to select an optimal DAG relies heavily on the structure of GGM. To overcome these limitations, a modified framework of PC algorithm still seems to be plausible, especially for the dimensionality of the most genome-wide data. As an alternative to the partial correlation tests, we may use kernel-based conditional independence test, 38 which has been applied in causal discovery. The extension of our method to relax the Gaussian assumption will be our future research.
The computation efficiency of the causal structure estimation mostly relies on the estimation of GGM. In our application data example where P = 14,576 and n = 480, on average a penalized regression took seven minutes (2.00 GHz processor and 128 GB RAM running on Linux using 64 bit R3.0.1), including the tuning parameter search across 100(λ) × 10(τ) two-dimensional grids with median 4,967 covariates. The algorithms we described in this paper to estimate correlation graph, GGM, and skeleton, although computationally expensive, can be parallelized by vertices. We are currently developing freely available software for this method. As future work, we will apply this method to other genomic, epigenetic, and proteomic platforms, eg, protein expression data, microRNA expression data, and DNA methylation data.
Supplement Data
Appendix includes the details of the causal structure estimation results using the TCGA ccRCC data.
Appendix Figure 1
Histograms of the degrees of all genes for graphs estimated from Step 1 and Step 2.
Appendix Figure 2
Histogram of the degrees of all genes for the skeleton estimates and frequencies of the degree change from GGM to skeleton estimates.
Author Contributions
Analyzed the data: MJH. Wrote the first draft of the manuscript: MJH. Contributed to the writing of the manuscript: MJH, VB. Agree with manuscript results and conclusions: MJH, VB, K-AD. Jointly developed the structure and arguments for the paper: MJH, VB, K-AD. Made critical revisions and approved final version: MJH, VB, K-AD. All authors reviewed and approved of the final manuscript.
