Sage Journals: Discover world-class research

Abstract

The accelerated growth of protein databases offers great possibilities for the study of protein function using sequence similarity and conservation. However, the huge number of sequences deposited in these databases requires new ways of analyzing and organizing the data. It is necessary to group the many very similar sequences, creating clusters with automated derived annotations useful to understand their function, evolution, and level of experimental evidence. We developed an algorithm called FastaHerder2, which can cluster any protein database, putting together very similar protein sequences based on near-full-length similarity and/or high threshold of sequence identity. We compressed 50 reference proteomes, along with the SwissProt database, which we could compress by 74.7%. The clustering algorithm was benchmarked using OrthoBench and compared with FASTA HERDER, a previous version of the algorithm, showing that FastaHerder2 can cluster a set of proteins yielding a high compression, with a lower error rate than its predecessor. We illustrate the use of FastaHerder2 to detect biologically relevant functional features in protein families. With our approach we seek to promote a modern view and usage of the protein sequence databases more appropriate to the postgenomic era.

Get full access to this article

View all access options for this article.

References

Altschul

S.F.

, Madden

T.L.

, Schaeffer

A.A

, et al. 1997. Gapped BLAST and PSI-BLAST: A new generation of protein database search programs. Nucleic Acids Res. 25, 3389–3402.

Dehal

, Satou

, Campbell

R.K.

, et al. 2002. The draft genome of Ciona intestinalis: Insights into chordate and vertebrate origins. Science. 298, 2157–2167.

Finn

R.D.

, Bateman

, Clements

, et al. 2014. The Pfam protein families database. Nucleic Acids Res. 42, 222–230.

Hauser

, Mayer

C.E.

, and Söding

2013. kClust: Fast and sensitive clustering of large protein sequence databases. BMC Bioinform. 14. PMID: 23945046.

Huang

, Niu

, Gao

, et al. 2010. CD-HIT Suite: A web server for clustering and comparing biological sequences. Bioinformatics, 26, 680–682.

Kityk

, Kopp

, Sinning

, et al. 2012. Structure and dynamics of the ATP-bound open conformation of Hsp70 chaperones. Mol. Cell. 48, 863–874.

Louis-Jeune

, Andrade-Navarro

M.A.

, and Perez-Iratxeta

2015. FASTA Herder: A web application to trim protein sequence sets. ScienceOpen Res. 7, 1–4.

Marcion

, Seigneuric

, Chavanne

, et al. 2015. C-terminal amino acids are essential for human heat shock protein 70 dimerization. Cell Stress Chaperones, 20, 61–72.

Mier

, Andrade-Navarro

M.A.

, and Pérez-Pulido

A.J.

2015. OrthoFinder facilitates the discovery of homologous and orthologous proteins. In press.

10.

Perez-Iratxeta

, Palidwor

, and Andrade-Navarro

M.A.

2007. Towards completion of the Earth's proteome. EMBO Rep. 8, 1135–1141.

11.

Petrakis

, Schaefer

M.H.

, Wanker

E.E.

, et al. 2013. Aggregation of polyQ-extended proteins is promoted by interaction with their natural coiled-coil partners. BioEssays, 35, 503–507.

12.

Ponting

C.P.

, Schultz

, Copley

R.R.

, et al. 2000. Evolution of domain families. Adv. Protein Chem., 54, 185–244.

13.

Rose

P.W.

, Prlic

, Bi

, et al. 2015. The RCSB Protein Data Bank: Views of structural biology for basic and applied research and education. Nucleic Acids Res. 43, 345–356.

14.

Rost

1999. Twilight zone of protein sequence alignments. Protein Eng. 12, 85–94.

15.

Sander

, and Schneider

1991. Database of homology-derived protein structures and the structural meaning of sequence alignment. Proteins, 9, 56–68.

16.

Schaefer

M.H.

, Wanker

E.E.

, and Andrade-Navarro

M.A.

2012. Evolution and function of CAG/polyglutamine repeats in protein-protein interaction networks. Nucleic Acids Res. 40, 4273–4287.

17.

Sim

K.L.

, and Creamer

T.P.

2004. Protein simple sequence conservation. Proteins. 54, 629–638.

18.

Suzek

B.E.

, Wang

, Huang

, et al. 2015. UniRef clusters: A comprehensive and scalable alternative for improving sequence similarity searches. Bioinformatics, 31, 926–932.

19.

The UniProt Consortium. 2015. UniProt: A hub for protein information. Nucleic Acids Res. 43, 204–212.

20.

Trachana

, Larsson

T.A.

, Powell

, et al. 2011. Orthology prediction methods: A quality assessment using curated protein families. BioEssays, 33, 769–780.

21.

Wang

, and Dunbrack

R.L.

2005. PISCES: Recent improvements to a PDB sequence culling server. Nucleic Acids Res. 33, 94–98.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.01 MB

FastaHerder2: Four Ways to Research Protein Function and Evolution with Clustering and Clustered Databases

Abstract

Abstract

Get full access to this article

References

Supplementary Material