Sage Journals: Discover world-class research

Abstract

One of the most powerful techniques to study proteins is to look for recurrent fragments (also called substructures), then use them as patterns to characterize the proteins under study. Although protein sequences have been extensively studied in the literature, studying protein three-dimensional (3D) structures can reveal relevant structural and functional information that may not be derived from protein sequences alone. An emergent trend consists of parsing proteins 3D structures into graphs of amino acids. Hence, the search of recurrent substructures is formulated as a process of frequent subgraph discovery where each subgraph represents a 3D motif. In this scope, several efficient approaches for frequent 3D motif discovery have been proposed in the literature. However, the set of discovered 3D motifs is too large to be efficiently analyzed and explored in any further process. In this article, we propose a novel pattern selection approach that shrinks the large number of frequent 3D motifs by selecting a subset of representative ones. Existing pattern selection approaches do not exploit the domain knowledge. Yet, in our approach, we incorporate the evolutionary information of amino acids defined in the substitution matrices in order to select the representative 3D motifs. We show the effectiveness of our approach on a number of real datasets. The results issued from our experiments show that considering the substitution between amino acids allows our approach to detect many similarities between patterns that are ignored by current subgraph selection approaches, and that it is able to considerably decrease the number of 3D motifs while enhancing their interestingness.

Get full access to this article

View all access options for this article.

References

Altschul

, Gish

, Miller

, et al. 1990. Basic local alignment search tool. J. Mol. Biol., 215, 403–410.

Andreeva

, Howorth

, Chandonia

J.-M.

, et al. 2008. Data growth and its impact on the SCOP database: new developments. Nucleic Acids Res., 36, D419–D425.

Berman

H.M.

, Westbrook

J.D.

, Feng

, et al. 2000. The protein data bank. Nucleic Acids Res., 28, 235–242.

Cuff

A.L.

, Sillitoe

, Lewis

, et al. 2011. Extending CATH: increasing coverage of the protein structure universe and linking structure with function. Nucleic Acids Res., 39, D420–D426.

Eddy

S.R.

2004. Where did the BLOSUM62 alignment score matrix come from?. Nat. Biotechnol., 22, 1035–1036.

Faust

, Dupont

, Callut

, et al. 2010. Pathway discovery in metabolic networks by subgraph extraction. Bioinformatics, 26, 1211–1218.

Fei

, and Huan

2010. Boosting with structure information in the functional space: an application to graph classification. In ACM Knowledge Discovery and Data Mining Conference (KDD), 643–652.

Gao

, and Skolnick

2013. APoc: large-scale identification of similar protein pockets. Bioinformatics, 29, 597–604.

Holm

, and Rosenström

2010. Dali server: conservation mapping in 3D. Nucleic Acids Res., 38, W545–W549.

10.

Huan

, Wang

, and Prins

2003. Efficient mining of frequent subgraphs in the presence of isomorphism. In IEEE International Conference on Data Mining (ICDM), 549–552.

11.

Huan

, Bandyopadhyay

, Wang

, et al. 2005. Comparing graph representations of protein structure for mining family-specific residue-based packing motifs. J. Comput. Biol., 12, 657–671.

12.

Jin

, Young

, and Wang

2009. Graph classification based on pattern co-occurrence. In ACM International Conference on Information and Knowledge Management, 573–582.

13.

Liu

, Carbonell

J.G.

, Gopalakrishnan

, et al. 2009. Conditional graphical models for protein structural motif recognition. J. Comput. Biol., 16, 639–657.

14.

Nijssen

, and Kok

J.N.

2004. A quickstart in frequent structure mining can make a difference. In ACM Knowledge Discovery and Data Mining Conference (KDD), 647–652.

15.

Öhsen

N.V.

, Sommer

, Zimmer

, et al. 2004. Arby: automatic protein structure prediction using profile-profile alignment and confidence measures. Bioinformatics, 20, 2228–2235.

16.

Pisanti

, Soldano

, Carpentier

, et al. 2009. A relational extension of the notion of motifs: application to the common 3D protein substructures searching problem. J. Comput. Biol., 16, 1635–1660.

17.

Rahat

, Alon

, Levy

, et al. 2009. Understanding hydrogen-bond patterns in proteins using network motifs. Bioinformatics, 25, 2921–2928.

18.

Regad

, Matin

, and Camproux

A.-C.

2011. Dissecting protein loops with a statistical scalpel suggests a functional implication of some structural motifs. BMC Bioinformatics, 12, 247.

19.

Ritchie

D.W.

, Ghoorah

A.W.

, Mavridis

, and Venkatraman

2012. Fast protein structure alignment using Gaussian overlap scoring of backbone peptide fragment similarity. Bioinformatics, 28, 3274–3281.

20.

Saigo

, Krämer

, and Tsuda

, 2008. Partial least squares regression for graph mining. In ACM Knowledge Discovery and Data Mining Conference (KDD), 578–586.

21.

Terashi

, Shibuya

, and Takeda-Shitaka

2012. LB3D: a protein three-dimensional substructure search program based on the lower bound of a root mean square deviation value. J. Comput. Biol., 19, 493–503.

22.

Thoma

, Cheng

, Gretton

, et al. 2010. Discriminative frequent subgraph mining with optimality guarantees. Statist. Anal. Data Min., 3, 302–318.

23.

Vacic

, Iakoucheva

L.M.

, Lonardi

, et al. 2010. Graphlet kernels for prediction of functional residues in protein structures. J. Comput. Biol., 17, 55–72.

24.

Witten

I.H.

, and Frank

2005. Data Mining: Practical Machine Learning Tools and Techniques. Morgan Kaufmann, San Francisco, CA.

25.

Woznica

, Nguyen

, and Kalousis

2012. Model mining for robust feature selection. In ACM Knowledge Discovery and Data Mining Conference (KDD), 913–921.

26.

Yan

, and Han

2002. gSpan: graph-based substructure pattern mining. Order, 02, 721–724.

27.

Yan

, Cheng

, Han

, et al. 2008. Mining significant graph patterns by leap search. ACM SIGMOD International Conference on Management of Data, 433–444.

Smoothing 3D Protein Structure Motifs Through Graph Mining and Amino Acid Similarities

Abstract

Abstract

Get full access to this article

References