Effect of Protein Repetitiveness on Protein–Protein Interaction Prediction Results Using Support Vector Machines

Abstract

Background: There are many computational approaches to predict the protein–protein interactions using support vector machines (SVMs) with high performance. In fact, performance of currently reported methods are significantly over-estimated and affected by the object repetitiveness in the datasets used.

Objective: To study the effect of object repetitiveness of datasets on predicting results.

Method: We present novel methods to construct different positive datasets with or without repeating proteins using graph maximum matching in the protein–protein interaction datasets and corresponding series of negative datasets with different proteins repetitiveness are constructed using graph adjacency matrix. The relationship between the SVM prediction results and the repeated proteins (repeat numbers and repeat rates) and the distributions of repeated proteins in the datasets are analyzed.

Results: Protein repetitiveness of positive and negative datasets can affect the prediction result: high protein repetitiveness of positive or negative datasets yield high performance prediction result.

Conclusion: This indicate that dealing with object repetitiveness of datasets is a key issue in protein–protein interactions prediction using SVMs since real world data contain certain degrees of repeat proteins.

Get full access to this article

View all access options for this article.

References

Alberts

1998. The cell as a collection of protein machines: Preparing the next generation of molecular biologists. Cell, 92, 291–294.

Aranda

, Achuthan

, Alam-Faruque

, et al. 2002. The IntAct molecular interaction database in 2010. Nucleic Acids Res., 2010, 38, D525–D531.

Auerbach

, Thaminy

, Hottiger

M.O.

, and Stagljar

The post-genomic era of interactive proteomics: Facts and perspectives. Proteomics, 2, 611–623.

Barabasi

A.L.

, and Oltvai

Z.N.

2004. Network biology: Understanding the cell's functional organization. Nature Rev. Genetics, 5, 101–113.

Bauer

, and Kuster

2003. Affinity purification-mass spectrometry. Eur. J. Biochem., 270, 570–578.

Ben-Hur

, and Noble

2006. Choosing negative examples for the prediction of protein-protein interactions. BMC Bioinformatics. 7, S2.

Ceol

, Chatr Aryamontri

, Licata

, et al. 2010. MINT, the molecular interaction database: 2009 update. Nucleic Acids Res. 38, D532–D539.

Cheng

, Zhou

, and Guan

2015. Computationally predicting protein-RNA interactions using only positive and unlabeled examples. J. Bioinformat. Comput. Biol. 13, 1541005.

Fawcett

2006. An introduction to ROC analysis. Pattern Recog. Lett., 27, 861–874.

10.

Guan

, and Kiss-Toth

2008. Advanced technologies for studies on protein interactomes, In: Werther

, Seitz

, eds. Protein–Protein Interaction, 1–24. Springer Berlin Heidelberg.

11.

Guo

, Yu

, Wen

, et al. 2008. Using support vector machine combined with auto covariance to predict protein–protein interactions from protein sequences. Nucleic Acids Res. 36, 3025–3030.

12.

Hall

, Frank

, Holmes

, et al. 2009. The WEKA data mining software: An update. ACM SIGKDD Explorations Newslett. 11, 10–18.

13.

Hart

G.T.

, Ramani

, and Marcotte

2006. How complete are current yeast and human protein-interaction networks?. Genome Biol. 7, 120.

14.

Ito

, Chiba

, Ozawa

, et al. 2001. A comprehensive two-hybrid analysis to explore the yeast protein interactome. Proc. Natl. Acad. Sci. USA., 98, 4569–4574.

15.

Jansen

, Yu

, Greenbaum

, et al. 2003. A Bayesian networks approach for predicting protein-protein interactions from genomic data. Science, 302, 449–453.

16.

Keshava Prasad

T.S.

, Goel

, Kandasamy

, et al. 2009. Human Protein Reference Database–2009 update. Nucleic Acids Res. 37, D767–D772.

17.

Koo

C.L.

, Liew

M.J.

, Mohamad

M.S.

, et al. 2013. A review for detecting gene-gene interactions using machine learning methods in genetic epidemiology. BioMed. Res. Intl. 2013, 13.

18.

McKinney

B.A.

, Reif

D.M.

, Ritchie

M.D.

, et al. 2006. Machine learning for detecting gene-gene interactions: A review. Appl. Bioinformat., 5, 77–88.

19.

Michnick

S.W.

, MacDonald

M.L.

, and Westwick

J.K.

2006. Chemical genetic strategies to delineate MAP kinase signaling pathways using protein-fragment complementation assays (PCA). Methods, 40, 287–293.

20.

Mosca

, Pons

, Céol

, et al. 2013. Towards a detailed atlas of protein–protein interactions. Curr. Opin. Struct. Biol., 23, 929–940.

21.

Muppirala

U.K.

, Honavar

V.G.

, and Dobbs

2011. Predicting RNA-protein interactions using only sequence information. BMC Bioinformat. 12, 489.

22.

Piro

R.M.

, and Di Cunto

2012. Computational approaches to disease-gene prediction: Rationale, classification and successes. FEBS J. 279, 678–696.

23.

Ramani

A.K.

, and Marcotte

E.M.

2003. Exploiting the co-evolution of interacting proteins to discover interaction specificity. J. Mol. Biol., 327, 273–284.

24.

Rao

V.S.

, Srinivas

, Sujini

G.N.

, et al. 2014. Protein-protein interaction detection: Methods and analysis. Intl. J. Proteom. 2014, 12.

25.

Rhodes

D.R.

, Tomlins

S.A.

, Varambally

, et al. 2005. Probabilistic model of the human protein-protein interaction network. Nature Biotechnol. 23, 951–959.

26.

Salwinski

, Miller

C.S.

, Smith

A.J.

, et al. 2004. The Database of Interacting Proteins: 2004 update. Nucleic Acids Res. 32, D449–D451.

27.

Scott

, and Barton

2007. Probabilistic prediction and ranking of human protein-protein interactions. BMC Bioinformat. 8, 239.

28.

Shen

, Zhang

, Luo

, et al. 2007. Predicting protein–protein interactions based only on sequences information. Proc. Natl. Acad. Sci., 104, 4337–4341.

29.

Shoemaker

B.A.

, and Panchenko

A.R.

2007. Deciphering protein–protein interactions. Part II. Computational methods to predict protein and domain interaction partners. PloS Comput. Biol. 3, e43.

30.

Stark

, Breitkreutz

B.J.

, Reguly

, et al. 2006. BioGRID: A general repository for interaction datasets. Nucleic Acids Res. 34, D535–D539.

31.

Trabuco

L.G.

, Betts

M.J.

, and Russell

R.B.

2012. Negative protein–protein interaction datasets derived from large-scale two-hybrid experiments. Methods, 58, 343–348.

32.

Uetz

, Giot

, Cagney

, et al. 2000. A comprehensive analysis of protein-protein interactions in Saccharomyces cerevisiae. Nature, 403, 623–627.

33.

Walhout

A.J.

, and Vidal

2001. Protein interaction maps for model organisms. Nature Rev Mol Cell Biol. 2, 55–63.

34.

Westra

R.L.

, Hollanders

, Jan Bex

, et al. The identification of dynamic gene-protein networks. In: Knowledge, Discovery and Emergent Complexity in Bioinformatics, 157–170. In: Tuyls

, Westra

, Saeys

, et al., eds. Springer Berlin Heidelberg.

35.

, Zhou

, Wang

, et al. 2015. Identifying DNA-binding proteins by combining support vector machine and PSSM distance transformation. BMC Syst. Biol. 9, S10.

36.

Yang

, Xia

J.F.

, and Gui

2010. Prediction of protein-protein interactions from protein sequence using local descriptors. Protein Peptide Lett. 17, 1085–1090.

37.

, and Dong

2003. Computational analyses of high-throughput protein-protein interaction data. Curr. Protein Peptide Sci., 4, 159–180.

38.

, Guo

, Needham

C.J.

, et al. 2010. Simple sequence-based kernels do not predict protein–protein interactions. Bioinformatics, 26, 2610–2614.

39.

Zhang

S.W.

, Hao

L.Y.

, and Zhang

T.H.

2014. Prediction of protein–protein interaction with pairwise kernel support vector machine. Intl. J. Mol. Sci. 15, 3220.

40.

Zou

, Gong

, and Li

2013. An improved sequence based prediction protocol for DNA-binding proteins using SVM and comprehensive feature analysis. BMC Bioinformat. 14, 90.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.03 MB

0.63 MB

1.27 MB

1.25 MB

0.23 MB

0.18 MB

0.02 MB