Abstract
Bioinformatics is used at three different steps of proteomic studies of sub-cellular compartments. First one is protein identification from mass spectrometry data. Second one is prediction of sub-cellular localization, and third one is the search of functional domains to predict the function of identified proteins in order to answer biological questions. The aim of the work was to get a new tool for improving the quality of proteomics of sub-cellular compartments. Starting from the analysis of problems found in databases, we designed a new Arabidopsis database named ProtAnnDB ( http://www.polebio.scsv.ups-tlse.fr/ProtAnnDB/ ). It collects in one page predictions of sub-cellular localization and of functional domains made by available software. Using this database allows not only improvement of interpretation of proteomic data (top-down analysis), but also of procedures to isolate sub-cellular compartments (bottom-up quality control).
Introduction
Bioinformatics is of paramount importance for protein analysis in proteomic studies. Proteomics generates huge amounts of data that must be interpreted to answer biological questions. It is used at three steps of proteomic studies (Fig. 1): (i) identification of proteins by peptide mass mapping or peptide sequencing; (ii) prediction of their sub-cellular localization, (iii) prediction of their function. Identification of proteins requires measurement of mass/charge ratios of tryptic peptides or of ions resulting from peptide fragmentation and comparison to sequence databases. This part of the work is done with devoted software such as ProteinProspector (http://prospector.ucsf.edu/), Mascot (http://www.matrixscience.com/search_form_select.html), and Profound (http://prowl.rockefeller.edu/prowl-cgi/profound.exe) that can be used online. Of course, reliability of such identifications depends on the quality of both sequences deposited in databases and of structural annotation of genomes. But many efforts have been done to improve it, especially with the setting of Unigene at NCBI which is defined as “An Organized View of the Transcriptome” (http://www.ncbi.nlm.nih.gov/sites/entrez?db=unigene). Analysis of protein sub-cellular localization and function is more puzzling. For Arabidopsis proteins, information on sub-cellular localization can only be found at MIPS (http://mips.gsf.de/proj/plant/jsf/athal/searchjsp/index.jsp). Information on prediction of functional domains is given in most databases. However, scores are not always given and the names of proteins may not be related to these predictions. The availability of all this information in a reliable and friendly way appeared critical when we obtained loads of data from proteomics. We wanted to use bioinformatics not only as a tool to interpret our experimental data in a “top-down analysis”, but also as “bottom-quality control” of our procedure for preparation of plant cell walls (Fig. 1)1,2. Starting from the analysis of problems found in databases, we designed a new Arabidopsis database named ProtAnnDB for Protein Annotation DataBase. It collects predictions made by available software. It allows the user to see the results in one page without any need to run their query against all of them.

Prediction of Sub-Cellular Localization of Proteins as a Valuable Tool to Assess the Quality of Sub-Cellular Proteomics
The creation and maintenance of cellular structures relies on the regulated expression and spatial targeting of proteins. In addition, determining the sub-cellular localization of a protein is an important first step towards understanding its function. Proteomics is one of the large-scale methods used to identify proteins, made possible by the sequencing of whole genomes. The homogenized cells are fractionated, most commonly through centrifugation. With proper purification and fractionation techniques, the contents of a particular fraction will correspond to a particular organelle. However, the approach is sensitive to contaminations. Purity control mainly relies on tests that should be positive for proteins expected in the purified fraction, and negative for proteins of all other cell compartments.
Plant cell wall is a particular difficult compartment since it is not surrounded by a membrane, and contains anionic carbohydrates that can bind basic proteins from other compartments. The reliability of protein profiling for a compartment like the cell wall thus strongly depends on the quality of the extraction protocol. There are two ways to obtain proteins from the cell wall compartment:4–5 (i) non-destructive methods using the culture medium of cell cultures, or the extraction of extracellular fluids by an infiltration/centrifugation procedure; (ii) destructive methods involving the rupture of the cells and the separation of the insoluble cell wall fraction. Since plant cell walls are mainly built up with highly dense polysaccharides, this property can be used to purify them by centrifugation through high density solutions. Several methods have been used to prepare enriched cell wall fractions and to verify their purity (Table 1). On the one hand, enzymology, immunology, and microscopy have been used to estimate the purity of the isolated fraction. On the other hand, bioinformatic analyses of the sequences of identified proteins allowed the prediction of their sub-cellular localization using PSORT (http://psort.ims.u-tokyo.ac.jp/form.html) 6 and TargetP (http://www.cbs.dtu.dk/services/TargetP/).7,8 In a few cases, sub-cellular localization could not be reliably predicted. The ratio of the number of predicted secreted proteins to the total number of proteins identified has been calculated. This ratio can also be used as a purity control. It can be seen that classical methods of control support a high purity for most of the fractions. However, the concrete results of the bioinformatic predictions show a different picture with much lower degree of purity in most cases, suggesting that many fractions were not pure enough. On the contrary, although such controls have not been performed in the case of preparation of proteins from the hypocotyl cell wall fraction, 1 it is remarkable that the level of purity of the cell wall protein fraction is one of the highest as calculated from bioinformatic predictions. It should be noted that the use of biochemical or immunological markers as purity criteria have produced a number of publications claiming that many well-known intracellular proteins are also secreted without a signal peptide. 9 It is true that not all secreted proteins contain a signal peptide and alternative secretion pathways exist in both prokaryotes and eukaryotes. However, only a reduced number of proteins are secreted without a signal peptide in eukaryotes. 10
These comparisons show that the classical methods used to test for the purity of sub-cellular compartments are not conclusive for proteomic studies. Indeed, the sensitivity of mass spectrometry is much higher than that of enzymatic or immunological tests using specific markers. As a consequence, the characterization and prediction of the intrinsic signals that target proteins to the correct subcellular compartment has become a major task in bioinformatics. Although not all signals for protein sorting in cell compartments are described, bioinformatics can help in predicting subcellular localization of proteins thus contributing to the quality control of proteomic strategies (Fig. 1). In particular, sorting signals for vacuoles are of several types and probably not all are known. 11 In addition, non classical pathway for protein secretion should be taken into account. 10
Using Functional Domains as Efficient Tools for Annotation of Proteins
With regard to protein function and due to automatic annotation of proteins on the basis of BLAST searches (http://blast.ncbi.nlm.nih.gov/Blast.cgi), 12 there are many mistakes in databases on the principle of the children game called the Chinese whispers. Even if functional domains such as InterPro, PFAM or PROSITE are now indicated in the description of protein sequences in most databases, the names proposed for proteins are often incorrect because they result from BLAST searches rather than from the presence of functional domains. Actually, BLAST results can rely on partial sequence homology as shown in the case of the family of 11 leucine-rich repeat extensins (LRXs) 13 as LRXs and PEXs. Query of the NCBI Entrez Protein database (http://www.ncbi.nlm.nih.gov/sites/entrez?db=Protein) results in 14 accession numbers using the following key words: leucine-rich repeat AND extensin AND Arabidopsis. The same functional annotation was found at TAIR (http://arabidopsis.org/index.jsp) and TIGR (http://www.tigr.org/tdb/e2k1/ath1/) whereas only 6 proteins were given related names such as leucine-rich repeat/extensin or extensin-like at MIPS (http://mips.gsf.de/proj/plant/jsf/index.jsp) (Table 2). A detailed analysis of the information available in databases shows that the appropriate functional domains are listed in the description of the proteins (Table 1, supplementary data). However, the names assigned to the proteins are not correct at NCBI, TAIR, and TIGR in three cases (At2g19780, At4g06744, and At4g29240) since these names were given according to BLAST results. As shown for At2g19780 in Figure 2, significant identity was found with an LRX protein encoded by At3g24480, but only in its leucine-rich repeat (LRR) domain. All proteins that are bona fide LRXs according to Baumberger et al. 13 should have at least one LRR domain and one proline-rich domain (Table 3). Annotation of At2g19780, At4g06744, and At4g29240 should be revised. On the contrary, At2g19780 and At3g24480 are annotated as “disease resistance proteins” at MIPS since many of such proteins have LRR domains. But there is no experimental evidence that these two proteins play any role in plant defense. At present, an annotation mentioning only the presence of structural LRR domains would be more relevant.

Number of Arabidopsis proteins annotated as LRXs in various databases and by Baumberger et al. 13
Functional domains found using InterProScan and PROSITE in Arabidopsis proteins annotated as LRXs in databases.
Other problems result from different annotations despite the presence of identical functional domains. This is the case for the extensin gene family which comprises 19 members. 14 Three examples are given in Table 2, supplementary data. At1g21310 gets the name of its mutant (RSH for Root Shoot Hypocotyl Defective) whereas At1g26240 is annotated as “proline-rich extensin-like family protein” and At1g26250 as “proline-rich extensin, putative.” On the contrary, the same name can be attributed to proteins that belong to different families. The “Pollen Ole e1 allergen and extensin family” contains two types of proteins (Table 2, supplementary data): proteins that only contain the IPR006041 domain (Pollen Ole e1 allergen and extensin), and proteins that contain both this domain and a proline-rich region profile (PS50099). Only the MIPS database names At3g33790 a “putative proline-rich protein.”
Although functional domains are correctly listed in description of proteins, annotation may only take into account a profile with a high probability of occurrence, i.e. of poor significance. This is the case of At2g15770 annotated as “glycine-rich protein” although the score obtained for PS50315 (glycine-rich region) is very low (Table 2, supplementary data). A higher score was obtained for IPR008972 (cupredoxin) and the gene has been annotated as AtEn23 by experts. 15 On the contrary, At1g15825 has been annotated as “hydroxyproline-rich glycoprotein family protein” although the only identified structural domain is PS50099 (proline-rich region). A “BLAST 2 sequences” reveals a very poor matching to the true extensin mentioned in databases (At3g19020) (data not shown). Moreover, the prediction for subcellular localization is “cytoplasm” using PSORT with a score of 0.450 and “other” using TargetP with a score of 0.602. A detailed analysis of the protein sequence reveals that at present the only possible annotation is “proline-rich protein”.
The last case will concern a protein that has been annotated as an “extensin-like protein” as inferred from sequence comparison to a protein sequence deduced from a cDNA of Brassica napus (AAK30571) (Fig. 3A). Again, the relevant functional domain is indicated in databases, i.e. IPR003612 (plant lipid transfer protein/seed storage/trypsin-alpha amylase inhibitor). The “BLAST 2 sequences” against AAK30571 gives 94% identities (Fig. 3B). However, since the annotation of the B. napus sequence is wrong, this mistake has been spread over other sequences that are also wrongly annotated.

Building of the ProtAnnDB Dedicated to Collection of Predictions of Sub-Cellular Localization and Functional Domains
The Arabidopsis protein sequences present in ProtAnnDB (http://www.polebio.scsv.ups-tlse.fr/ProtAnnDB/) are taken from the latest version of the Arabidopsis genome annotation (TAIR8, http://www.arabidopsis.org/help/helppages/BLAST_help.jsp#datasets) (32825 proteins). 16 ProtAnnDB provides a link to the NCBI reference protein sequences (RefSeq) which are curated sequences (http://www.ncbi.nlm.nih.gov/RefSeq/). 17 Amino acid sequences have been run against different software predicting either their sub-cellular localization or functional domains. Different programs were used because each of them has its own specificity and it is necessary to compare the results to increase confidence in the prediction. There are feature-based methods, global sequence properties-based methods, and machine learning methods such as neural networks, hidden Markov models and support vector machines. 8
Sub-cellular localization of proteins can be predicted using several programs on line. TargetP, SignalP (http://www.cbs.dtu.dk/services/SignalP/), 10 and Predotar (http://urgi.versailles.inra.fr/predotar/predotar.html) 18 predict N-terminal targeting sequences. TMHMM (http://www.cbs.dtu.dk/services/TMHMM-2.0/) predicts transmembrane domains. Since all these programs use different algorithms to make predictions, it is necessary to compare the results. There is no such tool available on line apart from Aramemnon (http://aramemnon.botanik.uni-koeln.de/), a database describing plant putative membrane proteins based on the comparison of results from different programs predicting transmembrane spanning domains and membrane-anchoring through GPI anchors, prenylation, and myristoylation. 19 Several of these programs were selected in ProtAnnDB: SignalP, TargetP, Predotar, TMHMM and Aramemnon (Fig. 4).

Bioinformatics tools for prediction of protein sub-cellular localization. Protein sequences are from TAIR8 (http://www.arabidopsis.org/index.jsp).
Prediction of functional domains can be achieved using different programs. InterProScan (http://www.ebi.ac.uk/Tools/InterProScan/) collects information from different prediction programs and proposes classification of proteins in superfamilies. 20 Pfam (http://pfam.sanger.ac.uk/) comprises 9318 protein families. 21 PROSITE (http://www.expasy.org/prosite/) comprises 717 patterns and 795 profiles. 22 PANTHER (http://www.pantherdb.org/) classifies genes by their functions, using published experimental evidence and evolutionary relationships. Three of these programs were chosen for building ProtAnnDB: InterProScan, PROSITE and PFAM (Fig. 5).

Bioinformatics tools for prediction of cell wall protein functional domains. Protein sequences are from TAIR8 (http://www.arabidopsis.org/index.jsp). Links to NCBI RefSeq are provided for each protein (http://www.ncbi.nlm.nih.gov/RefSeq/). Examples of cell wall-related gene families annotated by experts are listed in Table 4.
In all cases, use of several programs based on different rationales is required to improve the quality of the prediction of sub-cellular location or function of a protein. However, some doubtful cases cannot be resolved. Finally, relevant literature, especially if it includes experimental data, is the best source of information. Databases annotated by experts in the field also provide checked information and often allow use of unified nomenclature. This information is available either in research articles or on line. It has been included in ProtAnnDB for 1565 sequences of cell wall-related proteins, such as (i) proteins involved in cell wall biogenesis or metabolism, and (ii) proteins predicted to be at the cell wall/plasma membrane interface like receptor-like kinases (Table 4).
Cell wall-related protein families of Arabidopsis annotated by experts.
Two different formats have been designed in ProtAnnDB: a tab delimited text format that can be imported in the Microsoft Office Excel format (http://www.microsoft.com/france/office/2007/programs/excel/overview.mspx) (Fig. 6) and a friendly user web interface format for each protein entry (Fig. 7). This presentation allows (i) checking the annotation of each protein separately and (ii) working with several proteins at the same time. In future versions of ProtAnnDB, it is planned to introduce other plant proteins such as those from rice and poplar since their genomes will been soon fully annotated. In addition, it is planned to allow querying the database with keywords related to sub-cellular localization and/or functional domains.


Conclusion
Proteomic studies generate huge amounts of data consisting in lists of genes. Answering biological questions using these results is a great challenge. It is usually done collecting protein annotations from databases. However, protein annotation in databases can be incomplete or misleading since they are mainly inherited from results of sequence comparisons and do not usually take into account the presence of functional domains. Moreover, the available databases do not allow finding predictions of sub-cellular localization of proteins easily. ProtAnnDB has been designed as a new tool to facilitate the processing of proteomic data. Of course, it also allows processing of transcriptomic data which encounters the same type of problem. ProtAnnDB allows collecting results of prediction of sub-cellular localization and functional domains of Arabidopsis proteins by existing bioinformatics software. It should allow improving the biological interpretation of proteomic and transcriptomic data.
Methods
ProtAnnDB contains all the AGI codes included in the TAIR8 database (http://www.arabidopsis.org/help/helppages/BLAST_help.jsp#datasets). Each AGI code is associated with the results of the predictions of sub-cellular localization and functional domains. ProtAnnDB has been filled using Perl scripts by a two-step procedure. As a first step, all the Arabidopsis protein sequences were submitted to the following programs: TargetP, SignalP, TMHMM, and Predotar for sub-cellular localization; PROSITE (ps-scan.pl) for functional domains. InterPro and PFAM domains were imported from the TAIR8_all.domain file available at TAIR (ftp://ftp.arabidopsis.org/home/tair/Proteins/Domains/). Predictions of GPI anchors have been imported from the Aramemnon database. As a second step, all the results were organized in a MySQL database, which can be queried with a friendly user web interface written in PHP.
Authors’ Contributions
HSC made the scripts to build up ProtAnnDB. EJ selected programs available online and provided examples of cell wall proteins misannotated in databases. RPL reviewed literature about plant cell wall proteomic studies. EJ and RPL wrote the manuscript. All authors read and approved the final manuscript.
Footnotes
Acknowledgments
This work was supported by the Université Paul Sabatier in Toulouse and Centre National de la Recherche Scientifique (France).
The authors report no conflicts of interest.
