Sage Journals: Discover world-class research

Abstract

The proper integration of multiple sources of data and the unbalance between annotated and unannotated proteins represent two of the main issues of the automated function prediction (AFP) problem. Most of supervised and semisupervised learning algorithms for AFP proposed in literature do not jointly consider these items, with a negative impact on both sensitivity and precision performances, due to the unbalance between annotated and unannotated proteins that characterize the majority of functional classes and to the specific and complementary information content embedded in each available source of data. We propose UNIPred (unbalance-aware network integration and prediction of protein functions), an algorithm that properly combines different biomolecular networks and predicts protein functions using parametric semisupervised neural models. The algorithm explicitly takes into account the unbalance between unannotated and annotated proteins both to construct the integrated network and to predict protein annotations for each functional class. Full-genome and ontology-wide experiments with three eukaryotic model organisms show that the proposed method compares favorably with state-of-the-art learning algorithms for AFP.

Get full access to this article

View all access options for this article.

References

Ashburner

, Ball

C.A.

, Blake

J.A.

, et al. 2000. Gene ontology: Tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet., 25, 25–29.

Barutcuoglu

, Schapire

R.E.

, and Troyanskaya

O.G.

2006. Hierarchical multi-label prediction of gene function. Bioinformatics. 22, 830–836.

Bengio

, Delalleau

, and Roux

N.L.

2006. Label propagation and quadratic criterion, 193–216. In Chapelle

, Scholkopf

, and Zien

, eds. Semi-Supervised Learning. MIT Press, New York.

Bertoni

, Frasca

, and Valentini

2011. COSNet: A cost sensitive neural network for semi-supervised learning in graphs, 219–234. In Gunopulos

, Hofmann

, Malerba

, and Vazirgiannis

, eds. Machine Learning and Knowledge Discovery in Databases. European Conference, ECML PKDD 2011, Athens, Greece, September 5–9, 2011. Proceedings, Part I. Springer, Berlin.

Cesa-Bianchi

, and Valentini

2010. Hierarchical cost-sensitive algorithms for genome-wide gene function prediction. J. Mach. Learn. Res., 8, 14–29.

Cesa-Bianchi

, Re

, and Valentini

2012. Synergy of multi-label hierarchical ensembles, data fusion, and cost-sensitive methods for gene functional inference. Mach. Learn., 88, 209–241.

Chen

, and Xu

2004. Global protein function annotation through mining genomescale data in yeast saccharomyces cerevisiae. Nucleic Acids Res. 32, 6414–6424.

Chua

H.N.

, Sung

W.-K.

, and Wong

2007. An efficient strategy for extensive integration of diverse biological data for protein function prediction. Bioinformatics. 23, 3364–3373.

Cozzetto

, Buchan

D.W.A.

, Bryson

, et al. 2013. Protein function prediction by massive integration of evolutionary analyses and multiple data sources. BMC Bioinform., 14, S1.

10.

Frasca

2015. Automated gene function prediction through gene multifunctionality in biological networks. Neurocomputing. 162, 48–56.

11.

Frasca

, and Pavesi

2013. A neural network based algorithm for gene expression prediction from chromatin structure, 1–8. In IJCNN. IEEE, New York.

12.

Frasca

, Bassis

, and Valentini

2015. Learning node labels with multi-category hopfield networks. Neural Comput. Appl. DOI: 10.1007/s00521-015-1965-1

13.

Frasca

, Bertoni

, Re

, et al. 2013a. A neural network algorithm for semisupervised node label learning from unbalanced data. Neural Netw., 43, 84–98.

14.

Frasca

, Bertoni

, and Sion

. 2013b. A neural procedure for gene function prediction, 179–188. In Neural Nets and Surroundings, Volume 19 of Smart Innovation, Systems and Technologies. Springer, Berlin.

15.

Gillis

, and Pavlidis

2013. Characterizing the state of the art in the computational assignment of gene function: Lessons from the first critical assessment of functional annotation (CAFA). BMC Bioinform., 14, S15.

16.

Guan

, Myers

C.L.

, and Hess

D.C.

2008. Predicting gene function in a hierarchical context with an ensemble of classifiers. Genome Biol., 9, 1.

17.

Hawkins

, Chitale

, Luban

, et al. 2009. Pfp: Automated prediction of gene ontology functional annotations with confidence scores using protein sequence data. Proteins. 74, 566–582.

18.

Jansche

2005. Maximum expected F-measure training of logistic regression models, 692–699. In HLT’05: Proceedings of the Conference on Human Language Technology and Empirical Methods in Natural Language Processing, Morristown, NJ. Association for Computational Linguistics. Stroudsburg, PA.

19.

Joachims

2005. A support vector method for multivariate performance measures, 377–384. In Proceedings of the 22nd International Conference on Machine Learning, ICML ’05, New York, NY. ACM, New York.

20.

Juncker

A.S.

, Jensen

L.J.

, Pierleoni

, et al. 2009. Sequence-based feature prediction and annotation of proteins. Genome Biol., 10, 206.

21.

Kim

, Krumpelman

, and Marcotte

2008. Inferring mouse gene functions from genomic-scale data using a combined functional network/classification strategy. Genome Biol., 9, S5.

22.

Kourmpetis

Y.A.I.

, van Dijk

A.D.J.

, Bink

M.C.A.M.

, et al. 2010. Bayesian Markov random field analysis for protein function prediction based on network data. PLoS ONE., 5, e9293.

23.

Lan

, Djuric

, Guo

, et al. 2013. MS-kNN: Protein function prediction by integrating multiple data sources. BMC Bioinform., 14, S8.

24.

Lanckriet

G.R.G.

, De Bie

, Cristianini

, et al. 2004. A statistical framework for genomic data fusion. Bioinformatics. 20, 2626–2635.

25.

Lee

, Tu

, Deng

, et al. 2006. Diffusion kernel-based logistic regression models for protein function prediction. OMICS., 10, 40–55.

26.

Ling

C.X.

, and Sheng

V.S.

2007. Cost-sensitive learning and the class imbalanced problem. In Sammut

, ed. Encyclopedia of Machine Learning. Springer, New York.

27.

Linghu

, Snitkin

E.S.

, Holloway

D.T.

, et al. 2008. High-precision high-coverage functional inference from integrated data sources. BMC Bioinform., 9, 119.

28.

Lippert

, et al. 2010. Gene function prediction from synthetic leathality networks via ranking on demand. Bioinformatics. 26, 912–918.

29.

Marcotte

E.M.

, Pellegrini

, Thompson

M.J.

, et al. 1999. A combined algorithm for genome-wide prediction of protein function. Nature. 402, 83–86.

30.

Martin

, Berriman

, and Barton

2004. Gotcha: A new method for prediction of protein function assessed by the annotation of seven genomes. BMC Bioinform., 5, 178.

31.

Mayer

M.L.

, and Hieter

2000. Protein networks built by association. Nat. Biotechnol. 18, 1242–1243.

32.

Mesiti

, Re

, and Valentini

2014. Think globally and solve locally: Secondary memory-based network learning for automated multi-species function prediction. GigaScience., 3, 5.

33.

Mostafavi

, and Morris

2009. Using the gene ontology hierarchy when predicting gene function, 419–427. In Proceedings of the Twenty-Fifth Annual Conference on Uncertainty in Artificial Intelligence (UAI-09), Corvallis, Oregon. AUAI Press, Corvallis, OR.

34.

Mostafavi

, and Morris

2010. Fast integration of heterogeneous data sources for predicting gene function with limited annotation. Bioinformatics. 26, 1759–1765.

35.

Mostafavi

, Ray

, Farley

D.W.

, et al. 2008. GeneMANIA: A real-time multiple association network integration algorithm for predicting gene function. Genome Biol., 9, S4.

36.

Musicant

D.R.

, Kumar

, and Ozgur

2003. Optimizing f-measure with support vector machines, 356–360. In Proceedings of the International Florida AI Research Society Conference. AAAI Press, New York.

37.

Myers

C.L.

, and Troyanskaya

O.G.

2007. Context-sensitive data integration and prediction of biological networks. Bioinformatics. 23, 2322–2330.

38.

Obozinski

, Lanckriet

, Grant

, et al. 2008. Consistent probabilistic outputs for protein function prediction. Genome Biol., 9, S6.

39.

Pandey

, Myers

, and Kumar

2009. Incorporating functional inter-relationships into protein function prediction algorithms. BMC Bioinform. 10, 1–142.

40.

Pavlidis

, Cai

, Weston

, et al. 2002. Learning gene functional classifications from multiple data types. J. Comput. Biol., 9, 401–411.

41.

Pena-Castillo

, Tasan

, Myers

, et al. 2008. A critical assessment of Mus musculus gene function prediction using integrated genomic evidence. Genome Biol., 9, S1.

42.

, Seetharaman

J.K.

, and Joseph

Z.B.

2007. A mixture of feature experts approach for protein-protein interaction prediction. BMC Bioinform., 8, S6.

43.

Radivojac

, Clark

W.T.

, Oron

T.R.

, et al. 2013. A large-scale evaluation of computational protein function prediction. Nat. Methods. 10, 221–227.

44.

, Mesiti

, and Valentini

2012. A fast ranking algorithm for predicting gene functions in biomolecular networks. IEEE ACM Trans. Comput. Biol. Bioinform., 9, 1812–1818.

45.

Robinson

, Frasca

, Kohler

, et al. 2015. A hierarchical ensemble method for DAG-structured taxonomies, 15–36. In Multiple Classifier Systems—MCS 2015, Volume 9132 of Lecture Notes in Computer Science. Springer, New York.

46.

Sharan

, Ulitsky

, and Shamir

2007. Network-based prediction of protein function. Mol. Sys. Biol., 8, 88.

47.

Sokolov

, and Ben-Hur

2010. Hierarchical classification of Gene Ontology terms using the GOstruct method. J. Bioinform. Comput. Biol., 8, 357–376.

48.

Sokolov

, Funk

, Graim

, et al. 2013. Combining heterogeneous data sources for accurate functional annotation of proteins. BMC Bioinform., 14, S10.

49.

Tian

, Zhang

, Tasan

, et al. 2008. Combining guilt-by-association and guiltby- profiling to predict saccharomyces cerevisiae gene function. Genome Biol., 9, S7.

50.

Tsuda

, Shin

, and Scholkopf

2005. Fast protein classification with multiple networks. Bioinformatics. 21, 59–65.

51.

Valentini

2014. Hierarchical ensemble methods for protein function prediction. ISRN Bioinform., 2014, 1–34.

52.

Valentini

, Paccanaro

, Caniza

, et al. 2014. An extensive analysis of disease-gene associations using network integration and fast kernel-based gene prioritization methods. Artif. Intell. Med., 61, 63–78.

53.

Wang

, Cao

, and Cheng

2013. Three-level prediction of protien function by combining profile-sequence search, profile-profile search, and domain co-occurence networks. BMC Bioinform., 14, S3.

54.

Wong

A.K.

, Park

C.Y.

, Greene

C.S.

, et al. 2012. Imp: A multi-species functional genomics portal for integration, visualization and prediction of protein functions and networks. Nucleic Acids Res., 40, W484–W490.

55.

Yao

, and Ruzzo

W.L.

2006. A regression-based k nearest neighbor algorithm for gene function prediction from heterogeneous data. BMC Bioinform., 7, S1.

56.

Youngs

, Penfold-Brown

, Bonneau

, and Shasha

2014. Negative example selection for protein function prediction: The NoGO database. PLoS Comput. Biol., 10, e1003644.

57.

Zhang

, and Dai

2012. A framework for incorporating functional interrelationships into protein function prediction algorithms. IEEE ACM Trans. Comput. Biol. Bioinform., 9, 740–753.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.16 MB

0.00 MB

UNIPred: Unbalance-Aware Network Integration and Prediction of Protein Functions

Abstract

Abstract

Get full access to this article

References

Supplementary Material