Abstract
The Plasmodium falciparum genome being AT-rich, the presence of GC-rich regions suggests functional significance. Evolution imposes selection pressure to retain functionally important coding and regulatory elements. Hence searching for evolutionarily conserved GC-rich, intergenic regions in an AT-rich genome will help in discovering new coding regions and regulatory elements. We have used elevated GC content in intergenic regions coupled with sequence conservation against P. reichenowi, which is evolutionarily closely related to P. falciparum to identify potential sequences of functional importance. Interestingly, ∼30% of the GC-rich, conserved sequences were associated with antigenic proteins encoded by var and rifin genes. The majority of sequences identified in the 5′ UTR of var genes are represented by short expressed sequence tags (ESTs) in cDNA libraries signifying that they are transcribed in the parasite. Additionally, 19 sequences were located in the 3′ UTR of rifins and 4 also have overlapping ESTs. Further analysis showed that several sequences associated with var genes have the capacity to encode small peptides. A previous report has shown that upstream peptides can regulate the expression of var genes hence we propose that these conserved GC-rich sequences may play roles in regulation of gene expression.
Introduction
Regulatory motifs that allow fine-tuning of gene expression are of interest in the malaria parasite Plasmodium falciparum. These include promoters, mRNA stability motifs and translation regulatory sequences. Some regulatory motifs also encode non-coding RNAs (ncRNAs) that in turn regulate expression of genes. The importance of regulatory motifs cannot be underestimated in the parasite since mechanisms of regulation of gene expression are still being elucidated in this human pathogen.1–3 The comparative genomics approach has been successfully employed to identify evolutionarily well-conserved regulatory elements in C. elegans, S. cerevisiae and Homo sapiens.4–6 This is based on the rationale that functionally important sequences are often conserved among species. Comparative genomics has also been used in Plasmodium species to identify regulatory motifs. 7
Another feature of the Plasmodium falciparum genome that has proved useful in the search for new regulatory elements has been nucleotide bias. Plasmodium falciparum has an unusually AT-rich genome, 8 with an average AT content of 80% that increases to 90% in intergenic regions. In such a biased genome, local regions of increased GC content in the non-coding regions appear to correlate with functionally important features. For example, a conserved, GC-rich region found upstream of heat shock protein (hsp) genes is a functionally important DNA regulatory element.9,10 In two reports including one from our group, noncoding RNAs (ncRNAs) were identified in Plasmodium falciparum based on searching for conserved GC-rich intergenic regions.10,11 Similarly, nucleotide compositional contrast has been used to identify ncRNA in the AT rich genome of Dictyostelium discoideum and hyperthermophiles.12–14 This type of screen exploited the fact that most RNA regulatory elements carry out their functions by inter-molecular or intra-molecular base pairing; hence an increase in GC content especially in an AT-rich genome would result in RNAs having more stable secondary structures. 15 Most of these reports also used comparative genomics and evolutionary conservation as a tool to assess functional significance.
The choice of genomes used for comparative genomics is critical. In a bioinformatics screen described previously, 11 since the complete genome of P. yoelii was available we chose this species for identifying conserved, GC-rich intergenic regions that were shown to encode ncRNAs. However, with the recent availability of other Plasmodium genomes, it is likely that other genomes might be equally useful for comparative genomics. Indeed, P. yoelii and P. falciparum appear to have diverged >100 million years ago 16 however, P. falciparum has been shown to be most closely related to the chimpanzee malaria parasite P. reichenowi.17,19 Apart from housekeeping genes, several ORFs that encode cell surface proteins in P. falciparum are conserved between P. falciparum and P. reichenowi; these include CSP, 20 MSP2 21 and var CSA. 22 In contrast, the var, rifin and stevor multigene families that are involved in antigenic variation in P. falciparum are represented by a single multigene family (yir) in P. yoelii that is most closely related to the vir family in P vivax.23,24 Over the entire genome, P. yoelii is most closely related to the other rodent malaria parasites P. berghei and P. chabaudi. 25
In this report we ask whether regulatory elements can be identified by a bioinformatics screen using elevated GC content in the P. falciparum genome, followed by sequence conservation in other Plasmodium species. Due to the large evolutionary distance between P. falciparum and P. yoelii, we hypothesized that the choice of these two genomes for comparative genomics may not identify regulatory elements associated with immunogenic genes that are specifically expressed in P. falciparum and not in P. yoelii. Hence for identification of genomic sequences that might be involved in host-specific functions eg, evasion from the immune system or regulation of antigenic variation genes, a primate malaria parasite genome would be more appropriate for the comparative genomics part of any bioinformatics screen.
We show that a large number of GC-rich sequences are conserved in the genomes of P. falciparum and the primate parasite P. reichenowi. Many of these GC-rich sequences flank genes involved in antigenic variation and some may be transcribed and translated. Several reports in the literature show that short RNAs can regulate transcription 26 and short ORFs can regulate translation of downstream genes.27,28 Indeed, one of these reports shows that an upstream ORF regulates expression of certain var genes. 29 We suggest that the sequences identified in this study may play roles in regulation of antigenic gene expression at the level of transcriptional or translational control.
Materials and Methods
GC% filter source data
The genome of Plasmodium falciparum 3D7 was downloaded chromosome wise from the online database (http://www.plasmodb.org/). Exon locations of all protein coding genes were also downloaded from the same database. Due to the unavailability of exon location data in the new version—PlasmoDB 5.2, all the data were downloaded from the older version PlasmoDB 4.4.
GC% C program algorithm
A C program was written which reads large text files of the Plasmodium falciparum genome. The program divides the genome into 70 base chunks with the sliding window of 10 bases. It uses exon location data and excludes those chunks which fall within ORFs. The GC% of each chunk was then calculated. An output FASTA file was generated with the sequences of all 70 base chunks with greater than 35% GC according to the sliding window model and lying outside ORFs. If any 70 base chunks with greater than 35% GC were overlapping, these were combined and treated as a single sequence. All such 70 base chunks were associated with their chromosomal locations; note that since overlapping chunks were merged together, some regulatory elements are greater than 70 bases.
Sequence Conservation Source Data
The genome contigs of Plasmodium species viz. P. yoelii, P. vivax, P. reichenowi, P. berghei, P. gallinaceum, P. knowlesi and P. chaubadi were downloaded from PlasmoDB 5.2. The Washington University BLAST version 2.0 (WU-BLAST) downloaded from http://www.blast.wustl.edu/ was employed to analyze sequence conservation. This BLAST version was installed on a Linux machine.
Shell Script
A shell script was written which took each sequence from the output FASTA file containing sequences having GC content greater than 35% and fed it into the BLAST software. It performs BLAST of all chunks in each of the query files with all the available contigs in the database file. The E value cut-off was set as 1e-10.
Positive controls for the above strategy were rRNA, tRNA and the sequences identified with Plasmodium yoelii earlier by Upadhyay et al. In short, after running the BLAST analysis of GC-rich sequences using different genomes, we checked whether the 43 annotated tRNAs, 27 annotated rRNAs and 18 ncRNA sequences identified by Upadhyay et al were correctly identified.
Results and Discussion
Use of the P. reichenowi genome for comparative genomics can identify novel GC-rich conserved sequences
Previous work in our lab had used a bioinformatics strategy to identify GC-rich sequences present in intergenic regions that were conserved between P. falciparum and P. yoelii. This screen used two cut-offs (35% GC followed by an E value cut-off of 1e-10) and identified 18 sequences, many of which were found to be small molecular weight RNAs also known as non-coding RNAs (ncRNAs). These cut-offs were appropriate in searching for ncRNAs since we were able to identify all 43 annotated tRNAs and 21 annotated rRNAs from the P. falciparum genome.
We hypothesized that using the same strategy but with different genomes for the comparative genomics part of the screen might give more GC-rich, conserved sequences that are associated with host-specific functions. These sequences might be regulatory DNA sequences, ncRNAs or protein-encoding regions. To ensure that the 35% GC cut-off was appropriate for identifying such regulatory sequences, and particularly to be sure that the probability of finding the GC-rich sequence was greater than chance, we did a simple statistical analysis. The average GC content of the 23 megabase P. falciparum genome (19%) was compared to the GC content of the 10 base chunks used in the screen (35%) with a Chi-square test using Minitab software. The P value of this test was 0.0003, indicating that the probability of finding a 35% GC-rich sequence of 10 bases in the P. falciparum genome, is very low. Hence, any GC-rich sequences identified should be significantly different from the genome in their nucleotide content. We proceeded to test our hypothesis that sequences greater than 35% GC-rich and conserved in other Plasmodium species might be regulatory sequences associated with host-specific functions.
To test this, we initially performed the bioinformatics screen using only chromosome 1 of P. falciparum. This screen retained the original parameters of GC threshold and BLAST cutoff (>35% GC rich and BLAST value of e-10), however the BLAST analysis was performed against seven Plasmodium species—P. yoelii, P. reichenowi, P. berghei, P. vivax, P. gallinaceum, P. knowlesi and P. chabaudi. For all genomes except P. reichenowi, no new GC-rich, conserved sequences were identified. Interestingly, eighty-five new sequences could be identified when the screen involved comparison with the chimpanzee parasite, P. reichenowi. No new sequences were identified when BLAST was performed against the macaque parasite P. knowlesi and the human parasite P. vivax. This is consistent with reports that P. knowlesi falls in the same phylogenetic group as P. vivax. 19 Hence, P. reichenowi was chosen as the most appropriate genome to do the comparative analysis for identifying regulatory elements in P. falciparum.
Proximal Intergenic Sequences
The bioinformatics screen was repeated using the entire P. falciparum genome to identify GC-rich sequences with a cutoff of 35% GC; these sequences were compared for conservation against the complete P. reichenowi genome (BLAST value of e-10) yielding ∼1500 conserved GC-rich regions. In order to further prioritize these sequences an additional parameter was applied. This parameter restricted the output to sequences that lie within 500 bases of the start or stop codons of annotated ORFs (termed proximal intergenic regions). The rationale was that a majority of DNA regulatory elements and translational control elements are generally found within 500 bases of the start or stop codons of flanking genes. Hence we decided to sort out sequences that could lie within 5′ or 3′ UTRs of P. falciparum genes. Very few P. falciparum UTRs have been annotated, nevertheless Watanabe et al conclude from their analysis of a cDNA library that the 5′ UTRs of P. falciparum genes are unusually long, averaging 346 bp. 30 Golightly et al report a 3′ UTR of 450 bp in the mRNA of Pgs28, an ookinete protein of the avian parasite P. gallinaceum. 31 Hence, we defined all the intergenic sequences within 500 bp of the coding region as ‘proximal intergenic sequences’. Those intergenic sequences, which lie greater than 500 bp from the coding sequence, were designated as ‘deep intergenic sequences’. Concurrently, Neafesy et al has suggested that conserved CpG dinucleotides enriched in proximal intergenic regions might function as regulatory elements. 32 With these criteria in mind, ∼1500 new GC-rich sequences that were identified during the bioinformatics analysis described in this report were pruned down to 151 by screening for proximal intergenic sequences (see Supplementary Table 1).
BLAST analysis of antigenic proteins.
Immunogenic Proteins are Conserved in P. falciparum and P. reichenowi
Having shown that 151 sequences that are GC-rich and present in intergenic regions are conserved between P. falciparum and P. reichenowi, we wished to test our hypothesis that these might be associated with antigenic genes that are found in these two species. As a first step towards this, we tested whether families of antigenic genes found in P. falciparum are also present in P. reichenowi.
A comparison of the chimpanzee's genetic blueprints with that of the human genome shows that our closest living relatives share 96 percent of our DNA. Humans and chimps originate from a common ancestor, and scientists believe they diverged some six million years ago. 33 Interestingly the human malaria parasite P. falciparum diverged from the chimpanzee malaria parasite P. reichenowi around 5–7 million years ago17,34 suggesting that the primate parasites may have diverged at the same period when their hosts diverged.
Several studies have shown that P. falciparum is most closely related to P. reichenowi20,21 This is true not only for housekeeping genes but also for genes that encode proteins involved in host-parasite interactions. These include some of the var genes that encode the PfEMP family of proteins important for antigenic variation and evasion of the host immune response. Indeed, Trimnell et al 22 have shown that fragments of the varlCSA and var2CSA genes are conserved between P. falciparum and P. reichenowi suggesting an ancient origin of some var loci. Like P. falciparum, P. reichnowi is also shown to express key invasion proteins like EBLs and MAEBLs.35,36 To further test the extent of relatedness of the parasites, an analysis was done for other genes involved in antigenic variation. Antigenic proteins of P. falciparum involved in host pathogen interactions were chosen and BLAST analysis of the genes was performed with P. reichenowi contigs (PlasmoDB BLAST server—blastn). Two genes were chosen at random from each of the PfEMP, rifin and stevor families of antigenic surface proteins and the P. yoelii genome was used for comparison. Table 1 shows the results of this analysis.
Except for the var gene PF07_0051 there were fewer than 5 matches to the P. yoelii genome with the antigenic genes tested. PF07_0051 showed 27 matches with a best E value of 3e-5 indicating that this var gene may have weak homology to sequences in the P. yoelii genome. This is consistent with the data that there have been no genes showing homology to the var gene family in reports on P. yoelii genome analysis. 8 Instead, the P. yoelii genome contains a multigene family (yir) that shows homology to the P. vivax vir multigene family.23,24 In contrast, 34–194 matches of the var, rifin and stevor genes were obtained by using BLAST against the P. reichenowi genome and these matches gave extremely low E values (E value < e-140) indicating that the sequences are highly conserved. The high numbers of matches obtained (eg, 194 with a rifin gene) indicate that P. reichenowi also has three different families of antigenic proteins like P. falciparum. Hence the data suggests that the P. falciparum genome is more similar to the genome of P. reichenowi than P. yoelii when antigenic variation genes are analyzed.
Sequences Proximal to var Genes
Having shown that antigenic variation genes are conserved in P. falciparum and P. reichenowi and that 151 GC-rich sequences are also conserved in the two genomes, the next question was whether these GC-rich sequences flanked antigenic variation genes.
As mentioned in the previous section, sequestration and rosetting are key determinants of P. falciparum pathogenesis and these processes are mediated by the var gene family called Plasmodium falciparum Erythrocyte Membrane Proteins 1 (PfEMP1). To evade immunity and extend infections, parasites clonally vary the PfEMP1 proteins that are expressed on the surface of the infected red blood cells. 31 Mechanisms of regulation of var genes have been a topic of intense research due to the clinical importance of these genes.38,39 Expression of var genes is regulated by two regions with separate promoters, one upstream of the coding region and a second within the intron. 40 Upstream promoters of var genes fall into four major sequence classes: upsA, upsB, upsC and upsE 41 of which upsA- upsB- and upsE type var genes lie in sub-telomeric regions and upsC-type var genes lie in internal clusters. Recent evidence indicates that var genes are activated by recruitment of the promoter to a perinuclear site that is permissive for transcription 42 and also that the PfSIR2 regulator plays a role in var gene silencing.43,44 Recent studies indicate that ncRNAs associate with chromatin and thus regulate the expression of var gene family. 45 Additionally, an upstream ORF can regulate certain var genes. 29
Interestingly, the BLAST result with P. reichenowi showed that 27 of the proximal intergenic GC-rich sequences flank var genes (listed in Table 2). All these sequences lie in the 5′ UTR of the flanking var genes and most are less than 20 bp away from the predicted ORF of PfEMP1 proteins. The close proximity of the GC-rich sequences to the var ORF led us to wonder whether these sequences might be transcribed either as short RNAs or as part of the var mRNA transcripts.
Conserved GC rich sequence associated with var genes.
A search of PlasmoDB revealed that the Sugano malaria cDNA library30,46,47 has identified several short transcripts (ESTs AU088275 and AU087013) in the 5′ UTRs of var genes. An analysis of the GC-rich sequences that are proximal to var genes showed that all except the PfNC4.4var overlap with at least one of the two ESTs AU088275 and AU087013. The two ESTs are transcribed from the same strand as the PfEMP1 mRNA and AU088275 and AU087013 showed alignment with 30 and 16 regions of the P. falciparum genome respectively. This bioinformatics study was able to identify 23 out of 30 and 10 out of 16 regions in the case of AU088275 and AU087013 respectively. The GC-rich sequences that were not identified in this study are less conserved compared to P. reichenowi and hence did not show up after the BLAST with a cut off of 1e-10. The presence of short transcripts that overlap with the GC-rich sequences identified in this bioinformatics screen suggests that indeed these sequences are transcribed.
PfNC4.4var was the only sequence with no associated ESTs and this sequence lies 190 bases away from the annotated PfEMP1. A BLAST was performed with the sequence of PfNC4.4.var against the genome of P. falciparum and we identified 6 matches that were all proximal to PfEMP1 genes. To test whether any short RNAs are associated with the sequence PfNC4.4var we performed Northern analysis on mixed stage asexual parasites using strand-specific probes. These results indicate that the sequence is not expressed in mixed stage asexual parasites (data not shown); perhaps the expression of this sequence is below the limit of detection by Northern analysis or is stage-specific. Alternatively the sequence may function as a DNA regulatory element rather than as RNA or may be involved in translational control of the flanking var gene.
The sequences of the ESTs AU088275 and AU087013 were compared with each other and with the sequence PfNC4.4var using ClustalW (http://www.ebi.ac.uk/clustalw/). The scores obtained show that the ESTs AU088275 and AU087013 are 68% similar to each other at the sequence level while the sequence PfNC4.4var is quite distinct from either of these ESTs showing 25%–32% sequence similarity in the ClustalW analysis. Further analysis of the ESTs showed that AU088275 and AU087013 are in the 5′ UTRs of var genes of the upsB or upsBsh subtypes while sequence PfNC4.4var is found in the 5′ UTRs of 7 var genes of the upsC subtype.
Having shown that the GC-rich sequences that flank var genes are found in short transcripts, we next asked whether these sequences have the capacity to encode proteins, either as upstream ORFs (uORFs) or as N-terminal extensions of the annotated var genes. Indeed, a majority of the GC-rich sequences showed the presence of upstream ORFs (uORFs) ranging in size from minimal ORFs (1 amino acid) to 21 amino acids. Several of the uORFs are found in a majority of the GC-rich regions (pentapeptide MYATI found 20 times) and others are found less frequently (MYQNTTKPCMPRYKPRMHDIM found once).
Interestingly, when all the GC-rich sequences that flank var genes were aligned with each other, it was noticed that the most conserved sequences (highlighted in grey with asterisks), encoded the uORF pentapeptide MYATI (Fig. 1). In contrast, sequence conservation was poor in the regions surrounding the uORF. This suggests an evolutionary pressure to maintain the uORF encoding sequences indicating these sequences may have functional importance. A sequence alignment between var-associated GC-rich sequences of P. falciparum and P. reichenowi (Fig. 2) shows a significant sequence similarity between PfNC12.4var and the homologous region from P. reichenowi and the uORF MYATI is conserved between the two species.

Sequence alignment of proximal upstream regions of upsB var genes.

BLAST result of PfNC12.4var against the Plasmodium reichenowi genome.
uORFs have been shown to play important roles in translational control. For example, a minimal uORF can regulate translation of certain HIV genes. 48 This minimal ORF (consisting of only a start and a stop codon) overlaps with the start codon of the vpu gene and mutating the start and stop codons of this minimal ORF results a reduction of translation of the downstream env gene. Upstream AUGs and uORFs in human and rodent genes appear to regulate translation initiation by the ribosome scanning machinery. 27 Finally, and most pertinently for this work, the presence of uORFs has been shown to regulate the expression of the downstream var gene. 29 We propose that the uORFs identified in this report flank var genes at the 5′ regions and may play similar roles in regulation of var gene expression.
Sequences Proximal to rifin Genes
Rifin genes constitute the largest multi-gene family in the P. falciparum genome with 149 members. Transcription from rifin genes is highest at the rings and early trophozoite stages and proteins encoded by these mRNAs are localized to the Maurer's clefts.49,50 Presence of antibodies against RIFINS in patient sera suggests that these proteins are indeed exposed on the surface of erythrocytes. 51 More recently, the discovery of a PEXEL/VTS transport signal found in proteins exported from the parasite vacuole to the erythrocyte was observed in RIFIN proteins and is consistent with a potential cell surface localization.52,53 The function of RIFINS is unknown however these proteins may be involved in cytoadherence. Similar to var genes, rifin genes are also clonally variable although the mechanisms underlying the two processes appear to be different.
A search of the proximal intergenic GC-rich sequences obtained in our screen of the P. falciparum genome shows that 19 sequences flank rifin genes. The list of sequences is shown in Table 3. All the sequences except for one (PfNC10.1rif) lie in the 3′ UTR of rifin genes and are 1 to 500 bases away from the stop codon of the rifin open reading frame. PfNC10.1rif is located in the 5′ UTR of rifin gene PF10_0002w. Four of the GC-rich regions that flank rifins are associated with short ESTs (BI816203 and BQ577081) and all the ESTs are transcribed from the same strand as the rifin gene. There is a paucity of information regarding regulation of rifin gene expression. A recent study has mapped promoter elements that are required for expression of one rifin gene (PF11_0009) that is highly expressed in 3D7 parasites. 54 The promoter elements include two repressor regions that are bound by nuclear proteins expressed at different stages of the parasite life cycle. While 5′ flanking sequences are essential for transcriptional regulation, it is tempting to speculate that events in the 3′ UTRs of rifin genes, particularly the GC-rich sequences discovered in this study may play roles in gene regulation.
Conserved GC rich regions associated with rifin genes.
Conclusion
In conclusion, this report shows that a bioinformatics strategy involving a search for GC-rich intergenic regions that are conserved between P. falciparum and P. reichenowi can be used to uncover conserved GC-rich sequences proximal to antigenic variation genes. These sequences are transcribed and may also encode short upstream ORFs. It will be of interest to test the functional importance of these sequences in regulation of antigenic variation and clinical disease.
Disclosures
Author(s) have provided signed confirmations to the publisher of their compliance with all applicable legal and ethical obligations in respect to declaration of conflicts of interest, funding, authorship and contributorship, and compliance with ethical requirements in respect to treatment of human and animal test subjects. If this article contains identifiable human subject(s) author(s) were required to supply signed patient consent prior to publication. Author(s) have confirmed that the published article is unique and not under consideration nor published by any other publication and that they have consent to reproduce any copyrighted material. The peer reviewers declared no conflicts of interest.
Footnotes
Acknowledgements
SP thanks the Department of Biotechnology, Government of India for funding this project. PP was supported by funding from the Indian Institute of Technology (IIT) Bombay during her summer internship. PB received a fellowship from the Council for Scientific and Industrial Research (CSIR).
Appendix
List of 151 GC-rich sequences proximal to annotated genes identified in P. falciparum.
