Abstract
We have developed a general framework to construct an association network of single nucleotide polymorphisms (SNPs) (SNP association network, SAN) based on the functional interactions of genes located in the flanking regions of SNPs. SAN, which was constructed based on protein-protein interactions in the Human Protein Reference Database (HPRD), showed significantly enriched signals in both linkage disequilibrium (LD) and long-range chromatin interaction (Hi-C). We used this network to further develop two methods for predicting and prioritizing disease-associated genes from genome-wide association studies (GWASs). We found that random walk with restart (RWR) using SAN (RWR-SAN) can greatly improve the prediction of lung-cancer-associated genes by comparing RWR with the use of network in HPRD (AUC 0.81 vs 0.66). In a reanalysis of the GWAS dataset of age-related macular degeneration (AMD), SAN could identify more potential AMD-associated genes that were previously ranked lower in the GWAS study. The interactions in SAN could facilitate the study of complex diseases.
Keywords
Introduction
In the last 10 years, genome-wide association studies (GWASs) have become an important approach for unbiased discovery of common genomic loci, represented by selected single-nucleotide polymorphisms (SNPs) that are associated with complex diseases or traits.1,2 Associations between common SNPs and various diseases have been extensively studied,3–6 but most of them either have small effects on disease risk or only explain a small fraction of the susceptible population.7,8 In a typical GWAS analysis, a large number of SNPs are evaluated for their statistical associations with a certain phenotype. 9 But, because of the need for multiple testing corrections, only very few SNPs can successfully surpass the significance threshold and be selected for the further investigation.10,11 In such a context, one is very likely to miss some crucial information contained in the filtered-out SNP data. On the other side, since many complex diseases are the outcome of the joint action of multiple genes, many real biomarkers that have a significant risk effect in combination but not individually often fail to be detected by a typical GWAS.12,13 Thus, there has been increasing demand in developing methods to reanalyze GWAS datasets and to study associations of high-order SNP combinations with complex phenotypes.14,15
Recently, a gene-level knowledge-based strategy that utilizes prior biological knowledge at the gene level to facilitate GWAS dataset analysis has emerged as a potentially more powerful approach. One of the first attempts to utilize genetic information is gene-based GWAS analysis, in which all SNPs within a candidate gene are considered jointly. 16 The pioneering method to combine SNPs in multiple genes is pathway-based GWAS analysis, in which SNPs located in diverse genes of the same pathway are examined jointly for their association with a disease or trait. 17 In this method, genes in a specific pathway are treated as an exchangeable set. In a newly developed pathway-based method, a Markov random field model was proposed to incorporate the topological structure information of a pathway. 18 Considering that current data sources of pathway cover only less than 20% of proteins and genes, network-based approaches on a larger scale have recently been developed to integrate network information to prioritize genes.19,20
In this paper, we make an attempt in an alternative direction on how to reasonably utilize the genetic information to assist GWAS dataset analysis. Different from previous gene-based approaches that usually first map an SNP to a gene, we establish a general framework to map different sources of gene interaction information (such as protein-protein interaction, gene coexpression, or any types of functional associations) to SNP-tagged genomic loci, and sequentially construct a mutual SNP association network based on this information. Proven by large-scale experiment datasets (such as HapMap 21 and HiC 22 ) and known disease-related SNP data, 23 this SNP association network (SAN) is able to reflect the real functional associations between genomic loci, which may facilitate the analysis of GWAS datasets. In order to test this, we developed a disease-related SNP prediction method by the use of a random walk with restart (RWR) strategy. 24 Compared with the prediction based on the Human Protein Reference Database (HPRD) network, the prediction based on SAN shows a significant improvement (AUC: 0.81 vs 0.66). We further test our SAN by reanalyzing the GWAS dataset of age-related macular degeneration (AMD). 25 By referring to Google's PageRank algorithm, we developed a new method that combined the AMD GWAS dataset with the SAN topological information to rerank the relevance between SNPs and AMD. According to our reranking result, we found new AMD-related SNP candidates, which is in agreement with reports in the literature.
Result
General Idea of SNP Association Network Construction
In GWASs, when an SNP is connected with a specific disease, it actually means that the chromosomal region around this SNP has one or more function elements, such as protein-coding genes, that are related to this disease. 26 Considering that those genes that are involved in the same disease tend to have closer functional interactions in the gene interaction network (GIN) than other genes, 27 we can exploit the gene interaction information to evaluate functional associations between genomic loci. Figure 1 shows a simple example of how SAN is constructed for three SNP-tagged genomic loci based on gene interactions. We can calculate the SNP association score (SAS, Formula 1 in the Method section) between each pair of SNPs and obtain a symmetric SAS matrix for all SNP pairs. SAS is calculated based on the connectivity between genes inside of the loci. The higher the score, the more the possibility that is there a functional association between these two loci. For this SAS matrix, we can further test the significance of each SAS by random permutation. After filtering out SNP pairs with nonsignificant SAS, we can finally construct the SAN.

The general idea of SAN construction: an example network. Gi (or Gj) represents a gene set in the chromosomal region of SNPi (or SNPj). The computing method for SAS is as shown in Formula 1.
Parameter Setting for the SNP Association Network Construction
Several parameters need to be set in the construction of SAN in order to best utilize the information. The first parameter is the length of the genomic locus that each SNP represents. Based on the datasets of known disease-related genes and SNPs that are involved in coronary heart disease, prostate cancer, and schizophrenia, we tested variable lengths of genomic range (from 1 K to 1M). As shown in Figure 2A, when the length is increased, more disease-related genes can be embraced into the represented neighboring region of the known disease-associated SNPs; at the same time, the proportion of disease-related genes among total genes is decreased. We finally chose 100 kb (50 kb each from upstream and downstream of a SNP site) as the neighborhood of this SNP to balance both the coverage and specificity of disease-related genes in the SNP-represented regions. Furthermore, we clustered SNPs whose neighborhoods cover the same gene set into one SNP cluster, as they could not be distinguished in the calculation of functional association. Hence, in the SAN, an SNP cluster can be labeled as one node and represents one genomic locus.

(
The second parameter is a control parameter in the diffusion kernel method. 28 In order to control the noise and to capture the long-range relationships between genes, we used the diffusion kernel method (Formula 2) 28 to transfer the HPRD network 29 into an inter-gene association matrix. In the diffusion kernel formula, the parameter β controls the extension of “diffusion”. To obtain an optimal value of β for multiple diseases, we tested different β values (from 0.01 to 2) using known disease-related genes from coronary heart disease, prostate cancer, and schizophrenia (Fig. 2B). Compared with random background, genes involved in a certain disease are likely to be connected closely, that is, larger scores in the diffusion kernel matrix. We chose 0.5 as the optimal β value because it gives the largest differences of cumulative distributions of diffusion kernel scores between disease-related genes from these three diseases and random background.
The third parameter is the P-value cutoff for selecting the statistically significant associations. Because different genomic loci contain different numbers of genes, which also have different degrees in the HPRD network, we cannot compare the SASs with each other directly. So for the SAS of each SNP cluster pair, we use permutation to generate a random background distribution and convert each SAS into an empirical P-value (Formula 3). The significant SASs can be determined based on a P-value cutoff. As shown in Figure 2C, we assessed the impact of different P-value thresholds on the size of the SAN and chose a P-value less than 1 × 10−4 as the threshold for further study.
In this way, we obtained a SAN with 13,217 nodes (genomic loci) and 153,235 interactions (significant associations). According to the distribution of degrees, the SAN is approximately a scale-free network, 30 which means there are hub nodes in the network (Fig. 2D). These hub nodes represent the hotspots on chromosomes, which tend to have more interactions with other genomic loci. In the circular layout 31 of SAN (Fig. 2E), we can find that those hotpots are mainly located on chr1, chr11, chr12, chr17, and chr19. The density of interactions in the genome is positively correlated with the gene density (ρ = 0.53, P < 2.2 × 10−16, Spearman correlation test), but with no significant correlation to the density of SNP in the genome (ρ = −0.035, P = 0.092, Spearman correlation test).
Linkage Disequilibrium of SNP Cluster Nodes in the SNP Association Network
In population genetics, linkage disequilibrium (LD) is the nonrandom association of alleles at different loci on chromosomes. 32 In the human genome, adjacent SNPs mostly have strong LD, forming the so-called LD block, whereas SNPs on different chromosomes or SNPs on the same chromosome but with long distance are not. In the SAN, about 92% of the interactions are inter-chromosomal while only 8% are intra-chromosomal. Interestingly, although most of the interacting nodes in the SAN are located on different chromosomes that do not exist in proximal LD blocks, they are likely to have a stronger LD compared with background distribution (Fig. 3A, P-value of Kolmogorov–Smirnov test (KS test) < 2.2 × 10−16, genotype data from HapMap). In the SAN, the median of LD between interacting nodes is 0.151, while the random background is 0.098 (P-value of Wilcox test < 2.2 × 10−16). The significantly stronger LD of interacting node pairs in the SAN raises the possibility that these node pairs are likely to have profound associations with similar functions or phenotypes.

(
Figure 3B shows a representative example of LD between two connected SNP cluster nodes SC13676 (on chromosome 7) and SC5103 (on chromosome 2). Both SC13676 and SC5103 have existing LD blocks in their own loci. Interestingly, the SNP pairs between these two loci, which are on different chromosomes, also display strong LD. There are two genes, TWIST1 and GLI2, on the corresponding genomic loci, respectively. TWIST1 and GLI2 do not interact directly in the HPRD network; they are coupled by the gene GLI3. Both GLI2 and GLI3 are members of GLI family of transcription factors and are crucial actors for normal development in the Sonic hedgehog–Patched–Gli (Shh-Ptch-Gli) pathway.33,34 Dysregulation of the Shh-Ptch-Gli pathway leads to several human diseases, including birth defects and cancers.35,36 Recent researches have shown that TWIST, a developmental regulatory gene and potential oncogene, does appear to be linked to Shh signal transduction.37,38 Mouse Twist protein can activate transcription of human GLI1, another member of GLI family of transcription factors, by interacting with the E-boxes in GLI1's first intron. 39 More interestingly, nonsense, missense, deletion, and insertion mutations in several regions of the human TWIST gene have been shown to cause the Saethre–Chotzen syndrome, an autosomal dominant disease whose clinical phenotype partially overlaps with Shh-pathway-related human diseases.40,41 All of these facts indicate that there is a strong functional association between these two genomic loci (represented by SC13676 and SC5103), which is well worth further joint analysis.
HiCinteraction between SNP Cluster Nodes in the SNP Association Network
The functional association of genomic loci with long distance in the genome may also connect with the direct long-range physical interaction of chromatins. The three-dimensional folding of chromosomes can bring distant functional elements such as a promoter and an enhancer into close spatial proximity. Such long-range interaction can be detected by the recently developed HiC technique in an unbiased and genome-wide manner. 22 Here, we compared the genomic loci pairs that have direct interactions in the SAN with that in the human HiC data (Table 1). It was shown that, compared with randomly selected genomic loci pairs, the long-range chromatin interactions detected by HiC exhibit a clear dominance in genomic loci pairs that are directly interacting in the SAN (KS test P-value < 2.2 × 10−16). About 30% of the interacting SNP cluster pairs in the SAN can be found with HiC interactions. This frequency reduces to about 20% in the random background and increases to 40% for interacting SNP cluster pairs related to the same disease. Nearly 1.5% of the interacting SNP cluster pairs are supported by over three HiC interactions, which is 50% higher than that in the random background. For those interacting SNP cluster pairs that are involved in the same disease, this proportion reaches 2.6%. These results indicate that at least some functional associations between the SNP clusters in the SAN are established by the direct physical interaction between the corresponding chromosomal regions.
The distributions of HiC interactions between interacting SNP clusters in the SAN (SAN-link), randomly picked SNP clusters (random), and interacting SNP clusters in the SAN that are related to the same diseases (disease-link).
Close Correlation of Known Disease-Related SNP Cluster Nodes in the SNP Association Network
In the SAN, there are a number of nodes that correspond to known disease-related SNPs. Our results show that the distance distribution between SNP cluster nodes related to the same disease is significantly smaller than that from randomly selected nodes (Table 2). We have checked 13 different types of diseases (each with more than 20 nodes in the SAN). Eleven diseases showed significantly shorter distances between nodes while comparing with the random background (t-test, P < 0.05), with two diseases (prostate cancer, Type 2 diabetes) as exceptions. The smaller distances in SAN are also found in nodes that are related to the similar subtypes of diseases. Autoimmune diseases are caused by inappropriate immune responses of the body against substances and tissues normally present in the body. 42 It has been shown that different autoimmune diseases are likely to share etiological similarities and underlying mechanisms. 43 In the SAN, 251 nodes are related to different subtypes of autoimmune diseases. Compared with the random background, the nodes related to the same subtype of disease form a more closely connected subnetwork. In the autoimmune-disease-related subnetwork, there are 183 edges and the size of the maximally connected subgraph is 64 (Fig. 4A), while in the random background the average number of edges is only 92 and the average size is 20 (both P-value = 0 by random sampling).

(
The disease-related SNP clusters from different diseases having significant shorter distances than those randomly selected clusters in SAN.
As the closely connected subnetworks in the SAN are likely associated with the same disease or phenotype, we can use the topological information of the SAN, such as the clustering coefficient and the shortest distance between nodes, to discover the potential high-order SNP combinations that are relevant to a disease or phenotype. For example, we examined the autoimmune-disease-related subnetwork and found two quasi-cliques (QC1 and QC2) that are separately comprised of eight nodes with 25 edges (Fig. 4B) and eight nodes with 24 edges (Fig. 4C). Studies had shown that these closely linked nodes in both cliques are related to autoimmune diseases. Thus, we inferred that the SAN nodes that have a close connection with nodes in QC1 and QC2 are also involved in autoimmune diseases. There are 7 and 33 SNP cluster nodes in SAN, respectively, that have direct connections with over one-half of the nodes in QC1 or QC2 (the SNP cluster nodes that are already in the autoimmune-disease-related subnetwork are excluded). For those seven SNP cluster nodes connected with QC1, there exist 12 genes of which 10 have been proven to be correlated to autoimmune diseases (P = 1.80 × 10−11, binomial test). For example, STAT3 has been found to be essential for the differentiation of TH17 helper T cells in a variety of autoimmune diseases, 44 while, of those 33 SNP cluster nodes connected with QC2, 17 of 33 genes are proven to have a relationship with autoimmune diseases (P = 2.56 × 10−13, binomial test), such as CTSL1 and HLA-DQA1.45,46
Prediction of Novel Disease-Related SNPs Based on the SNP Association Network
Guilt by association (GBA) is a proven approach for identifying novel disease genes based on the simple idea that genes that are associated with or interacting in a GIN are more likely to be associated with similar traits.47,48 Similar to that of GIN, the genomic locus in the SAN, which has dense connections with the genomic loci that are proven to be related to certain diseases, is probably associated with this disease too. Therefore, we can explore known data of disease-related SNPs and the SAN topological structure to predict novel disease-related SNPs, with no need for introducing a new GWAS dataset. Based on the RWR strategy, 24 we developed a prediction algorithm by using the known disease-related genomic loci as seeds to predict new disease-related SNP cluster candidates. RWR is a ranking algorithm that simulates a random walker of proceeding coequally from each known disease-related seed node and then moving forward randomly to the immediate neighbors at each step. Meanwhile, the random walker can return at a probability “r” to the original seed nodes at each step. Thus, after several rounds of steps, the random walking will reach a steady state. All the nodes in the graph are then ranked by the probability of the random walker reaching the destination, which will evaluate the closeness between these nodes and the known disease-related seed nodes.
We tested our method (RWR-SAN) on known lung-cancer-related SNPs collected from the GWAS Catalog and the Lung Cancer Database.23,49 For comparison, we also implemented a similar RWR procedure on the HPRD network (RWR-HPRD). In the SAN, the known lung-cancer-related SNPs were mapped into the corresponding SNP clusters, which are marked as disease-related nodes. In the HPRD network, these known lung-cancer-related SNPs were mapped into the nearest genes in the genome and also marked as disease-related nodes. We then used leave-one-out cross-validation to examine how well these algorithms recover the disease-related nodes. In each round of cross-validation, we selected one of the known disease-related nodes and used the rest of them as seed nodes. The held-out node and other 99 randomly picked nodes were ranked by the RWR algorithm. Here, we used the receiver operating characteristic (ROC) analysis to compare the two algorithms. 50 Sensitivity is the frequency of a disease-related node that was ranked above a particular threshold. Specificity is the frequency of a non-disease-related node ranked below this threshold. In order to compare different curves obtained by ROC analysis, we calculated the area under the ROC curve (AUC) for each case. As shown in Figure 5, the AUC value of RWR-SAN is much higher than that of RWR-HPRD (0.81 vs 0.66), which indicates that the prediction capability of RWR-SAN is much better than that of RWR-HPRD.

ROC curves of RWR-SAN and RWR-HPRD in lung cancer data.
We further applied RWR-SAN to predict novel lung-cancer-related SNP clusters. All known lung-cancer-related nodes are treated as seed nodes to run RWR-SAN. For the top10 predicted SNP clusters (Table 3), four genomic loci had been proven to contain genes related to lung cancer and the other six loci also have reported evidences related to lung cancer. For instance, the gene FHL2 on SC485751 is a hub gene in the HPRD that has interactions with other 39 genes. Among them, 18 have been related to lung cancer. Another example is the tumor suppressor gene VBP1 on SC1449, which has direct protein-protein interaction with VHL, another known lung-cancer gene. 52 A more interesting example is the gene SLC6A4 on SC7161, which is involved in primary pulmonary hypertension (PPH).53,54 Recent studies have shown that the genesis and progression of PPH is likely consistent with the model of tumorigenesis.55,56
The top10 prediction of lung-cancer-related SNP clusters.
SAN-Assisted Reanalysis of an Age-Related Macular Degeneration GWAS Dataset
The topological information in SAN can be used as an external information source to assist GWAS data analysis. Borrowing from the Google's PageRank algorithm, we can reanalyze the GWAS dataset by integrating the typical GWAS data analysis method with the topological information in the SAN. We tested the performance of our SAN-assisted reanalysis on an AMD GWAS dataset. 25 Here, we adopted the iterative ranking method (details in Method section), in which a SNP cluster's score is calculated from an initial score (which is from typical GWAS analysis) and the normalized scores of its neighbors (which are iteratively updated). 57 According to our reanalysis, each SNP cluster receives a revised score with contributions from both direct evidence from the typical GWAS analysis and indirect evidence from the neighbors in the SAN. Then, we can rerank the SNP clusters based on their revised scores; the higher the rank of the SNP cluster, the closer its correlation with AMD.
In the GWAS analysis of the AMD dataset, Klein et al. found only one significant SNP, rs380390. 25 In our SAN-assisted reanalysis, SNP cluster SC7581 corresponds to SNP rs380390 and is still on the top of the list. Compared with the ranking by using the initial scores from GWAS analysis, the ranks of some SNP clusters get a significant boost after integrating the topological information of SAN (Table 4). For instance, there are two SNP clusters, SC9345 and SC962, whose ranks go up dramatically, with a jump from 541 in the original order to 2 in the reanalysis order for SC9345, and from 244 to 6 for SC962. AMD usually affects older adults and results in a loss of vision in the center of the macula because of damage to the retina. 58 The genomic region of SC9345 contains two genes, bHLHE41 and SSPN. bHLHE41 is the member of basic helix-loop-helix (bHLH) transcription factor family, which makes important contributions to the control of the proliferation and development during differentiation, particularly in neurons.59–61 Studies employed in diverse experimental systems from various species have shown that bHLH genes play decisive roles in the generation of the diverse cell types during the development of the retina.62–64 The gene on genomic locus of SC962 is CDH18, which belongs to CDH gene family, a family of calcium-dependent cell-cell adhesion molecules.65,66 CDH genes mediate neural cell-cell interactions and may play important roles in neural development. For example, CDH3, a member of CDH family, had been proven to be associated with ectodermal dysplasia, ectrodactyly, and macular dystrophy (EEM syndrome). 67 Another member of CDH family, CDH8, has been also found related to retinal survival/protection. 68 More interestingly, in our SAN-assisted reanalysis, the rank of SNP cluster SC688, which contains the gene CDH8, is also boosted greatly, from rank 171 to rank 12. These results indicate that the reanalysis of GWAS data with our SAN may identify more potential disease-associated genes.
The reranking of top10 SNP clusters of the AMD GWAS dataset.
Discussion
So far, large-scale GWAS studies have produced massive data; therefore, how to further reanalyze these data has become an important issue. One reanalysis strategy of GWAS data is meta-analysis, which was originally developed for pooling the results from a set of similar clinical trials but is now widely used to combine different types of studies.69–71 Another strategy is to introduce new information into GWAS data analysis to improve the detection power. It is very attractive to combine GWAS data with gene-interaction information, because the latter can provide us some hints on how to measure the association between SNPs. In this work, we established a general framework to integrate different sources of gene-interaction information to measure the association between SNPs. Although we only used the HPRD network as data resource in this work, our method is capable of integrating different types of gene-interaction information. By using gene-interaction data from different sources (such as protein interaction data, gene coexpression data), our SAN network can investigate SNPs’ correlation in different aspects. Systematically integrating SANs constructed from multiple data sources will allow us to obtain better effect on SAN-based prediction.
Over the last decade, GWAS have revealed a large number of disease- or trait-predisposing SNPs, but most of them are located within noncoding regions. 72 Besides being the regulatory regions in a coding gene (such as enhancer), these SNPs are likely associated with some functional noncoding RNAs. For instance, there are two coronary-artery-diseases-related long noncoding RNAs, myocardial-infarction-associated transcript (MIAT), and antisense noncoding RNA in the INK4 locus (ANRIL) found in GWAS. 73 Recently, a database, named lncRNASNP, also collected such lncRNA-related SNPs, and found that 142 human lncRNA-related SNPs are GWAS-tag SNPs and 197,827 lncRNA-related SNPs are in the GWAS LD regions. 74 In our SAN, we studied only the coding region in the genome. But, if we exploit the coding-noncoding gene interaction/coexpression network into our SAN, 75 it can be further extended to SNPs-tagged noncoding region and be used to annotate lncRNA-related SNP's function.
The studies of SAN can not only perform auxiliary GWAS analysis but also offer biologically meaningful information by itself. In the known studies on GINs, network topology provided important information for function study, and a lot of tools mining functional module were applied greatly to accelerate protein function prediction.76−78 As to how to apply our SAN network structure, here we made a preliminary attempt, including analysis of autoimmune-disease-related quasi-cliques and the RWR method in SAN. Instead of inspecting the possible distinctions between SAN and known gene-interaction networks, we directly used algorithms developed in GIN study. It is believed that by combining the numerous disease-related SNPs in GWASs with in-depth studies of specific characteristics of the SAN network structure, our SAN study can further assist in the prediction of potential disease-related chromosome regions and allow us to find the possible interactions between different diseases.
Methods
SNP Association Score
As shown in Formula 1, the SAS between each pair of SNP
i
and SNP
j
is calculated based on the connectivity among genes inside of the loci. Gi/Gj represents a gene set in chromosomal region of SNPi/SNPj, respectively. A GIN is any interaction/association network between genes. In this work, we use HPRD network.
29
DGIN is a scoring function of gene association based on GIN; here we use the diffusion kernel matrix of HPRD network.
28
Diffusion Kernel on Graph
As shown in Formula 2, diffusion kernel of a graph G is a matrix exponential, where kij measures the similarity between nodes vi and vj. 28 The matrix L is the Laplacian of the graph G, defined as E–D, where E is the adjacency matrix and D is a diagonal matrix containing the nodes’ degrees. The real parameter β controls the magnitude of the diffusion, and its optimal value is data-dependent.
Diffusion kernel on graph is a global measure of similarity since it is calculated using the global connectivity information (ie, adjacency and degree information). In addition, compared with another common measure, namely the shortest path distance similarity that is extremely sensitive to random insertion/deletion of edges, diffusion kernel is more robust to deal with extensive noise in high-throughput datasets.
79
Empirical P-Value of sAs
For each SNP cluster pair (i, j) and its SASi,j, we can compute its corresponding empirical P-value by Formula 3. BKG is the background set of SNP cluster pairs that are generated by randomly picking two SNP clusters that have the same numbers of genes in their neighborhoods, and these genes have the same degrees in the HPRD network. μ(SASBKG) is the mean value of all SASs in the set BKG, and σ(SASBKG) is the standard deviation of all SASs in the set BKG.
Random Walk with Restart
RWR is a ranking algorithm that simulates a random walker who starts on a set of seed nodes and iteratively transits from its current node to a randomly selected immediate neighbor. At each step, the random walker can return to the seed nodes with a certain restart probability. Finally, all the nodes in the graph are ranked by the probability of the random walker reaching this node. 24
RWR can be formally defined as Formula 4. The parameter gamma ∊ (0, 1) is the restart probability (in our application it is set as 0.5). The transition matrix W is the column-normalized adjacency matrix of the graph, and Wij is the transition probability from node i to node j. P0 is the initial probability vector, which was constructed such that equal probabilities were assigned to the seed nodes with the sum of the probabilities equal to 1. Pt is a vector in which the ith element holds the probability of finding the random walker at node i at step t.
After some steps, the probability vector will reach a steady state P∞, which gives a measure of proximity to seed nodes. If P∞(i) > P∞(j), then node i is more proximate to seed nodes than node j. This is obtained by performing the iteration until the difference between Pt and Pt+1 (measured by the L1 norm) fall below 10−10.
SAN-Assisted GWAS re-analysis
SAN-assisted GWAS reanalysis computes a score SC for each SNP cluster C. The higher the score, the closer will be its correlation with diseases or traits. First, by Formula 5 in which Φ−1 is the inverse cumulative distribution function (CDF) of normal distribution, all SNPs’ P-values from the original GWAS study will be transferred to z-scores; that is, smaller P-values correspond to larger z-scores.
80
Second, each SNP-cluster's score SC will be initialized as OC, which is the maximum z-score of all SNPs covered. Then, each SNP-cluster's score SC will be iteratively updated by adding the average score of its immediate neighbors according to Formula 6, where NB(C) is the set of immediate neighboring nodes of the SNP cluster C.
57
The parameter (1 – gamma)/gamma weights the network's contribution to the reanalysis score. Previous work
57
has proved that this iterative ranking method can converge to a unique solution very fast and is not sensitive to the range of (1 – gamma)/gamma.5,50 Here, we set it as 5 in our application.
Data Sources
Human SNP dataset: UCSC Genome Browser (genome.ucsc.edu, SNP132_common). HPRD network: Human Protein Reference Database (www.hprd.org, date 2011–4).
Disease-related SNPs: NIH GWAS catalog (www.genome.gov/gwastudies, date 2011–6).
HapMap genotype dataset: HapMap (hapmap.ncbi.nlm.nih.gov, date 2011–11).
HiC dataset: Hi-C Data Browser (hic.umassmed.edu).
Lung cancer database: HLungDB (www.megabionet.org/bio/hlung).
Coronary heart disease database: CADgene (www.bioguo.org/CADgene).
Prostate cancer database: DDPC (www.cbrc.kaust.edu.sa/ddpc).
Schizophrenia database: SZGR (bioinfo.vipbg.vcu.edu: 8080/SZGR).
Author Contributions
Conceived and designed the experiments: CL, ZX. Analyzed the data: CL. Wrote the first draft of the manuscript: CL. Agree with manuscript results and conclusions: CL, ZX. Jointly developed the structure and arguments for the paper: CL, ZX. Made critical revisions and approved final version: CL, ZX. Both authors reviewed and approved of the final manuscript.
