The Information Filtering of Gene Network for Chronic Diseases: Social Network Perspective

Abstract

Web and mobile platforms have provided an environment of technical cooperation through technical development and the diffusion of related devices. Large-scale data sets have been available to analyze web interaction and data analysis. Particularly, large-scale data make us learn new patterns and insight into several research fields. For healthcare field, most chronic diseases are caused by environmental and genetic factors (Van der Laan et al., 2003). The relationship between environmental exposure and gene factors is crucial regarding disease etiology (Swift et al., 2004). For example, Tobacco is considered one of the biggest environmental factors responsible for many diseases each year. Schwartz and Collins (2007) discussed the importance of gene and environment factor correlation in human diseases. Thomas (2010) published a review of different approaches on gene-environment association studies attempting to explain some of the most complex diseases. Although previous studies have studied chronicle diseases with their causes one by one, those studies do not show integrated relationships between various diseases and their related human genes. Therefore, this study investigates the gene-disease relationships which are affected by tobacco and is able to find new association links with social network analysis and other mining techniques.

1. Introduction

Web and mobile platforms have provided an environment of technical cooperation through technical development and the diffusion of related devices. Based on text mining, large-scale data sets have been available to analyze web interaction and data analysis. Particularly, large-scale data make us learn new patterns and insight into several research fields. For healthcare field, most chronic diseases are caused by environmental and genetic factors [1]. The investigation of causal relationships between environment factors and human genes is important regarding disease etiology [2, 3]. For example, tobacco is considered one of the biggest environmental factors responsible for many diseases each year. Schwartz and Collins, in 2007 (science), discussed the importance of gene and environment factor correlation in human diseases. In 2010, a previous study published a review of different approaches on gene-environment association studies attempting to explain some of the most complex diseases [4]. Although previous studies have studied chronicle diseases with their causes one by one, those studies do not focus on the integrated relationships between various diseases and their related human genes. Therefore, this study investigated analyzing the gene-disease relationships for tobacco and finding out new association causal links conducting social network analysis and mining techniques.

2. Related Works

2.1. Chronic Diseases and Data Mining

Previous studies have investigated the chronic diseases with data mining [5]. To find out how to decrease risk of chronic diseases, many works used data mining technics like case-based reasoning and machine learning [5, 6]. According to previous studies, data mining can help improve diagnosis systems and treatment of chronic diseases [6, 7]. Data mining technics make professionals improve treatment for patients of chronic diseases. Previous studies have investigated the effectiveness of data mining and its performance for improving clinical data repositories and diagnosis systems [6, 8]. Although data mining technics are powerful for treatment and diagnosis systems, there has been no study related to diagnosis of chronic diseases with social network analysis. In addition, this study tried social network analysis with collaborative filtering to find out causal relationship and correlations with gene and chronic diseases.

2.2. Social Network Analysis

Social network analysis (SNA) is the analysis of social relationships based on network theory. Social network consists of nodes (i.e., individual entity within the relationships) and ties which represent relationships between the nodes like human relationship. The social network is a simple and powerful concept in that it can find out types of interaction or connection of users or entities [9]. This aspect constitutes various social phases with familiar or chance subjects. Usually, social network can provide numerous points of advantages by reinforcing the connection between node and node or among network itself. From these ways, the social network concept has been applied in many fields, especially in information system and data analytics [10]. For instance, web-based social websites such as Facebook, Linked-in, and Twitter make efforts to provide specific and diverse social services by determining the relationships of users.

In the health care area, social network analysis helps researchers to understand each disease and human gene relationships based on overall disease-gene network structures [11, 12]. This aspect is important because it helps us to understand network structures of diseases and related associational human gene and identify the characteristics of the most influential human gene in the early phase of a disease. The structure of network can make accessibility of a lot of information related to disease-gene relationships available [13] and also increase performance for medical treatment and prevention of diseases by providing the primary causes of diseases. Particularly, social network analysis in diseases-gene networks could increase understanding for a sense of unity by many patterns and elevate the performances for their prevention [14].

2.3. Collaborative Filtering

Collaborative filtering (CF) is called a social information filtering technique because it generates the process of using relationship to decide whether item would link certain items [14–17]. CF is an algorithm used to study prediction and used primarily as a recommender system [18, 19]. We used collaborative filtering because this approach identifies genes related to particular diseases in the collected data.

Collaborative filtering recommends items based on the preference similarities of users in the preference or taste information of many other users [20–22]. According to characteristics of CF algorithms, this study tried to apply CF algorithm to the human gene network. SVD (singular vector decomposition for collaborative filtering) algorithm is matrix factorization models to solve collaborative filtering problem [23, 24]. SVD maps both users and items with a joint latent factor space of dimensionality. The latent space tries to find similar products or services by comparing users and product information (i.e., descriptions and features of products). SVD assumes that only a small number of factors can influence the preferences. Also, SVD assumes that preference of users on each item is determined by how each factor is related to the user and the item. This can be formulated as a MF problem. Namely, in a k-factor model, given the preference matrix $Y \in R^{(m \times n)}$ (preference matrix $Y \in R^{(m \times n)}$ can be converted to 2-mode network straightforwardly), SVD finds two matrices: $U \in R^{(m \times k)}$ and $M \in R^{(n \times k)}$ such that

(1) $Y \approx {U M}^{τ}$ .

To find matrices U and M, SVD solves the following optimization problem using stochastic gradient descent:

(2) ${m i n}_{U, M} (λ / 2) ({‖U‖}^{2} + {‖M‖}^{2}) + \sum_{i, j \in S} {(Y_{i j} - U_{i} M_{j}^{T})}^{2}$ ,

where

λ > 0

is overfitting regulation parameter and

S = {i, j | Y_{i j} > 0}

. More details are in [15, 25]. The factorization considers an iterative method based on starting with random initial values for U and M.

3. Empirical Methodology and Results

3.1. Data Collection and Methodology

We collected human gene of Symantec type terms with disease names related to tobacco based on their cooccurrence in PubMed abstracts. For our raw data gathering we set the term tobacco as a query and collected 82,538 abstracts in XML [25, 26]. Using a Java program, we parse the abstracts and create a text file in a form of PubMed ID/Title/Journal/abstract/year. In the next step, we extracted the Bio Entity from the text file and made a MySQL Database in a form of Extracted Term/ULMS Symantec ST/CUI (UMLS's Unique Identifier Code) and preprocessed the list of diseases. To unify our terms, this study included CUI into the UMLS and corrected as a preferred term. Also, we matched the preferred terms to Gene Ontology and counted only when disease and genes cooccurred (disease-gene: count pair) and made an undirected network.

By doing research procedure, the network contained 479 disease nodes, 869 gene nodes, and 2195 edges. After obtaining the undirected network with weight we had to clarify the network. We used Pearson's correlation to change the heterogeneous network to homogenous network. The disease 2-mode network was changed into 1-mode network to test social network analysis. To find out the correlation among diseases, Pearson's correlation was conducted. The bigger the correlation score is, the higher the similarity between two diseases is. According to the evaluation of the results, this study used SNA (centrality, closeness, and PageRank centrality) and collaborative filtering, clustering PAM.

3.2. Results

Social network theory explains that central positions provide greater access to and control over information. Centrality of a node in a gene network determines the relative importance of the social power of node in the network. Usually, nodes of greater centrality are located in central positions of network visualization. The degree means nodes are directly connected to each other. In the disease-gene networks, “clique” is considered as every disease is directly tied to every gene. “Component” and “clique” are major measures for the cohesion of the network. However, because this study conducted relationships of disease-gene using referred social network measures as in Table 1 and Figure 1, we used centrality measures, not cohesion measures. Among those measures, we firstly tested degree for the network. A degree means the number of links which is connected to other nodes.

Table 1

Degree centrality.

	Degree (number of links)	Degree (sum of weight)
Degree centrality	4390	17612
Mean	3.257	13.065
MIN.	1	1
MAX.	133	804

Figure 1

Degree centrality.

Closeness centrality is information related to the centrality of point in the network and that information can be measured by closeness or distance between each point in the network as shown in Figure 2 and Table 2. The distance between two points means the shortest distance of the connecting path of two points. The point that has low value of the sum of path distances is the central position of centrality in the network. According to result, the network closeness centralization index is 25.774%. Node betweenness centrality appeared similarly in the viewpoint of difference between two countries' clan networks. Betweenness centrality indicates the number of shortest paths between each node to others. This measure describes the connectivity of the node's neighbors. Thus, betweenness centrality generates higher central score when nodes connect node clusters in the entire network. The measure reflects the degree of the fact that each disease is connecting the related human genes. The network node betweenness centralization is 24.485%. PageRank centrality counts the number and quality of connections to a node to describe an estimate of how important the node is in the network. Options for this measure are as follows: the number of Iterations is 200 and Dampening Parameter was 0.85.

Table 2

Centrality measures.

	In-closeness	Out-closeness
Closeness centrality	0.184	0.184
Node betweenness centrality	0.002	0.002
PageRank centrality	0.001	0.001

Figure 2

Closeness for disease-gene network.

As shown in Table 3, we compared our result with CDC (Centers for Disease Control and Prevention). In data analysis, many genes are related to cerebral palsy and lung diseases whereas lung and heart diseases are in high ranking in the CDC report. Thus, the correlation results among genes and diseases provided the important implications for which genes affect increasing the risk for some diseases.

Table 3

Comparison result.

	Our research	CDC reported
(1)	Cerebral palsy	Lung disease
(2)	Lung diseases	Heart disease
(3)	Chronic obstructive airway disease	Chronic airway obstruct
(4)	Deciduous maxillary right second molar tooth	Other heart diseases
(5)	Asthma	Cerebrovascular disease
(6)	Diabetes	Bronchitis
(7)	Ecthyma contagious	Pneumonia
(8)	Pneumonia	Esophagus cancer
(9)	Gardner syndrome	Aortic aneurysm
(10)	Sarcosinemia	Pancreas cancer

According to SNA result, this study compared the outcomes with actual diseases to evaluate performance for clustering results in this study by conducting clustering PAM (Partitioning Around Medoids) [26, 27]. Clustering refers to the process dividing the data set into some clusters. Clustering methods have two ways: partitioning and hierarchical. Partitional clustering is determining k-clusters make optimal cluster function based on Euclidean distance. With its specific, there are k-mean and k-medoids.

The k-means clustering describes vector quantization originally from signal processing and this method is popular for cluster analysis in data mining. k-means clustering classifies objects on a set of user selected characteristics [28, 29]. This results in a partitioning of the data space into shortest area from a point. Meanwhile, the k-medoids are used as a clustering algorithm related to the k-means and the medoidshift algorithm [14, 30, 31].

Both algorithms generate partitions through breaking data set up into each group and minimize the distance between all points and a point as a center of cluster [30]. In contrast to the k-means, k-medoids choose data points as centers and consider an arbitrary matrix for distances among data points [2].

The most common realization of k-medoids clustering is based on the Partitioning Around Medoids (PAM). PAM is considered to initialize clusters by randomly selecting k of the n data points as the medoids. Associate each data point to the closest medoid. k-medoids generate better performance than k-means as shown in Figure 3 at some situation. Figure 3 demonstrated that in the asymptotic of large-scale data sets the k-medoids take less time. In this study, we optioned the k-medoids as Symmetrize (method = “MAX”). Number of medoids (clusters) is 25, maximum number of swaps is 1000, and proximity is set as similarity. According to the result, sum of distances to the nearest medoid is 1,301.446. Average of distances to the nearest medoid is 2.717 and maximum of distances to the nearest medoids is 5.062.

Figure 3

k-medoids clustering.

To test relationships for diseases-gene network as in Figure 4, this study used collaborative filtering. The number of features (rank) was 10 and the number of items to the recommendation was 10. If |the training error at i iteration − the training error at $i + 1$ iteration| < Convergence Tolerance, the algorithm is stopped. Proportion of validation set was 10.0%. A validation set is a portion of a data set to evaluate the performance of prediction that has been fitted on a separate portion of the same data set (the training set) as shown in Table 4 and Figure 5. Both the training and validation sets are randomly selected. The proportion of validation data set can be set by this option.

Table 4

Summary of RMSE.

EPOCH (the number of iterations performing)	Training RMSE	Test RMSE
51	3.67620	5.14195

Figure 4

Classical MDS Algorithm.

Figure 5

The results of RMSE.

4. Conclusion

The purpose of this study is analyzing and understanding the relationship between diseases-gene networks using various mining methods based on biofield research articles. To identify the research purpose, we collected 82,538 abstracts for research papers in PubMed, querying tobacco. After cleansing data set, we extracted 479 diseases nodes and 869 gene nodes. Also 2195 links were also extracted in the rule which same appearing in an abstract. Based on this information, we conducted social network analysis, clustering, and collaborative filtering.

With degree scores which were weighted as equally appearing in previous papers, we gave the order and compared diseases with CDC report. Using Pearson's coefficient, we also compared gene and disease network with closeness, betweenness, and PageRank centrality. Also, we evaluated how each gene was clustered with disease and recommend new genes related to diseases except for the relationships between disease and gene.

This study focused on the data set and its analysis with SNA and collaborative filtering. But the future study needs to find out more detailed knowledge for evaluating human genes with some experts. Also egonetwork analysis for disease related to tobacco still needs to be tested to see which gene factors affect the attack of a disease in the first stage of a disease.

Footnotes

Disclosure

This work was presented at ISBSS Conference 2014.

Conflict of Interests

The authors declare that there is no conflict of interests regarding the publication of this paper.

Acknowledgments

This research was supported by the Soonchunhyang University Research Fund. This work was supported by the National Research Foundation of Korea Grant funded by the Korean Government (NRF-2014S1A3A2044046).

References

van der Laan

M. J.

Pollard

K. S.

Bryan

A new partitioning around medoids algorithm

Journal of Statistical Computation and Simulation 2003 73 8 575 584

10.1080/0094965031000136012

MR1998670

2-s2.0-30244517052

Sperisen

Pagni

JACOP: a simple and robust method for the automated classification of protein sequences with modular architecture

BMC Bioinformatics 2005 6 1, article 216

10.1186/1471-2105-6-216

2-s2.0-25444502317

Swift

Tucker

Vinciotti

Martin

Orengo

Liu

Kellam

Consensus clustering and functional interpretation of gene-expression data

Genome Biology 2004 5, article R94

10.1186/gb-2004-5-11-r94

2-s2.0-24944539029

Thomas

Gene-environment-wide association studies: emerging approaches

Nature Reviews Genetics 2010 11 4 259 272

10.1038/nrg2764

2-s2.0-77949772292

Huang

M.-J.

Chen

M.-Y.

Lee

S.-C.

Integrating data mining with case-based reasoning for chronic diseases prognosis and diagnosis

Expert Systems with Applications 2007 32 3 856 867

10.1016/j.eswa.2006.01.038

2-s2.0-33750992805

Lavrač

Selected techniques for data mining in medicine

Artificial Intelligence in Medicine 1999 16 1 3 23

10.1016/s0933-3657(98)00062-1

2-s2.0-0032895111

Huang

Q. R.

Qin

Zhang

Chow

C. M.

Clinical patterns of obstructive sleep apnea and its comorbid conditions: a data mining approach

Journal of Clinical Sleep Medicine 2008 4 6 543 550

2-s2.0-58149327092

Kaur

Wasan

S. K.

Empirical study on applications of data mining techniques in healthcare

Journal of Computer Science 2006 2 2 194 200

10.3844/jcssp.2006.194.200

Rho

Kim

Park

Information mediating in social network sites: a simulation study

The Journal of Society for e-Business Studies 2013 18 1 33 55

10.7838/jsebs.2013.18.1.033

10.

Yang

C. C.

T. D.

Terrorism and crime related weblog social network: link, content analysis and information visualization

Proceedings of the IEEE Intelligence and Security Informatics (ISI ‘07)

May 2007

New Brunswick, Canada

IEEE

55 58

2-s2.0-34748814355

11.

Patel

C. J.

Butte

A. J.

Predicting environmental chemical factors associated with disease-related gene expression data

BMC Medical Genomics 2010 3, article 17

10.1186/1755-8794-3-17

2-s2.0-77951840730

12.

Schwartz

Collins

Medicine: environmental biology and human disease

Science 2007 316 5825 695 696

10.1126/science.1141331

2-s2.0-34249722803

13.

Junker

B. H.

Koschützki

Schreiber

Exploration of biological network centralities with CentiBiN

BMC Bioinformatics 2006 7, article 219

10.1186/1471-2105-7-219

2-s2.0-33746674977

14.

Pan

Zhu

Han

Genetic algorithms applied to multi-class clustering for gene expression data

Genomics Proteomics Bioinformatics 2003 1 4 279 287

2-s2.0-13844255692

15.

Adomavicius

Tuzhilin

Toward the next generation of recommender systems: a survey of the state-of-the-art and possible extensions

IEEE Transactions on Knowledge and Data Engineering 2005 17 6 734 749

10.1109/tkde.2005.99

2-s2.0-20844435854

16.

Choi

Lee

H. J.

Kim

Y. C.

The influence of social presence on customer intention to reuse online recommender systems: the roles of personalization and product type

International Journal of Electronic Commerce 2011 16 1 129 153

10.2753/jec1086-4415160105

2-s2.0-81255211548

17.

Koren

Bell

Advances in collaborative filtering

Recommender Systems Handbook 2011

New York, NY, USA

Springer

145 186

10.1007/978-0-387-85820-3_5

18.

Illhoi

Cluster ensemble and its applications in gene expression analysis

Proceedings of the 2nd Conference on Asia-Pacific Bioinformatics (APBC ‘04)

January 2004

Dunedin, New Zealand

Australian Computer Society

19.

Kim

H. K.

Jang

M. K.

Kim

J. K.

Cho

Y. H.

A new item recommendation procedure using preference boundary

Asia Pacific Journal of Information Systems 2010 20 1 81 99

20.

Herlocker

J. L.

Konstan

J. A.

Terveen

L. G.

Riedl

J. T.

Evaluating collaborative filtering recommender systems

ACM Transactions on Information Systems 2004 22 1 5 53

10.1145/963770.963772

2-s2.0-3042697346

21.

Bobadilla

Ortega

Hernando

Bernal

A collaborative filtering approach to mitigate the new user cold start problem

Knowledge-Based Systems 2012 26 225 238

10.1016/j.knosys.2011.07.021

2-s2.0-84155181004

22.

Paterek

Improving regularized singular value decomposition for collaborative filtering

Proceedings of the KDD Cup and Workshop

2007

5 8

23.

Kim

K.-J.

Ahn

Customer level classification model using ordinal multiclass support vector machines

Asia Pacific Journal of Information Systems 2010 20 2 23 37

24.

Ahn

Kim

K.-J.

Corporate bond rating using various multiclass support vector machines

Asia Pacific Journal of Information Systems 2009 19 2 157 178

25.

Dobra

Hans

Jones

Nevins

J. R.

Yao

West

Sparse graphical models for exploring gene expression data

Journal of Multivariate Analysis 2004 90 1 196 212

10.1016/j.jmva.2004.02.009

MR2064941

2-s2.0-15944399178

26.

Andreopoulos

Wang

Schroeder

A roadmap of clustering algorithms: finding a match for a biomedical application

Briefings in Bioinformatics 2009 10 3 297 314

10.1093/bib/bbn058

2-s2.0-65549104397

27.

Chao

Y. P.

Cho

K. H.

Yeh

C. H.

Tsao

S. P.

Chen

D. Y.

Chen

J. H.

Lin

C. P.

Brain segmentation using ATP (automatic twice PAM) in multi diffusion indices

Proceedings of the 14th Science Meeting & Exhibition of the International Society for Magnetic Resonance in Medicine

2006

Seattle, Wash, USA

2733

28.

Chun

Interrelated two-way clustering: an unsupervised approach for gene expression data analysis

Proceedings of the IEEE 2nd International Symposium on Bioinformatics and Bioengineering Conference

2001

41 48

10.1109/BIBE.2001.974410

29.

Park

H.-S.

Jun

C.-H.

A simple and fast algorithm for K-medoids clustering

Expert Systems with Applications 2009 36 2 3336 3341

10.1016/j.eswa.2008.01.039

2-s2.0-56349158295

30.

Kaufman

Rousseeuw

P. J.

Dodge

Clustering by means of medoids

Statistical Data Analysis Based on the L1-Norm and Related Methods 1987

North-Holland

405 416

31.

Langfelder

Zhang

Horvath

Defining clusters from a hierarchical cluster tree: the dynamic tree cut package for R

Bioinformatics 2008 24 5 719 720

10.1093/bioinformatics/btm563

2-s2.0-40049099114