Abstract
Pan-genome is defined as the set of orthologous and unique genes of a specific group of organisms. The pan-genome is composed by the core genome, accessory genome, and species- or strain-specific genes. The pan-genome is considered open or closed based on the alpha value of the Heap law. In an open pan-genome, the number of gene families will continuously increase with the addition of new genomes to the analysis, while in a closed pan-genome, the number of gene families will not increase considerably. The first step of a pan-genome analysis is the homogenization of genome annotation. The same software should be used to annotate genomes, such as GeneMark or RAST. Subsequently, several software are used to calculate the pan-genome such as BPGA, GET_HOMOLOGUES, PGAP, among others. This review presents all these initial steps for those who want to perform a pan-genome analysis, explaining key concepts of the area. Furthermore, we present the pan-genomic analysis of 9 bacterial species. These are the species with the highest number of genomes deposited in GenBank. We also show the influence of the identity and coverage parameters on the prediction of orthologous and paralogous genes. Finally, we cite the perspectives of several research areas where pan-genome analysis can be used to answer important issues.
Introduction
Pan-genome was a term coined by Tettelin et al
1
to describe the gene content of several strains of
A search in PubMed database using the words “pan genome” or “pan-genome” returns a total of 494 works published in the last 5 years (2015 to date). This number tends to increase because the results of pan-genome analyses are becoming more accurate. Zeng et al
5
used a new pan-genome reverse vaccinology approach and found 121 cell surface-exposed proteins belonging to the core genome of
In this review, we present the main concepts and software used for the analysis of prokaryotic pan-genomes. First, introduction to basic concepts is presented followed by an up-to-date description of the most recent software for pan-genome analysis. We also present a pan-genome analysis of the 9 bacterial species with the highest number of genomes deposited in GenBank.
Basic Concepts
Pan-genome structure
The sequence of a single genome does not reflect the entire genetic variability of a bacterial species. Complex analysis such as evolutionary genomics and molecular pathogenesis require a large number of sequenced genomes.1,9 Fortunately, the constant evolution of sequencing technologies has been allowing the reduction of sequencing time and cost. Consequently, an exponential increase in the number of genomes available in the databases has been observed. New research fields have emerged such as comparative genomics, whose principle is to compare the genetic content of several taxonomically related microorganisms. 10 For example, in the pre-genomic era, 2 strains were classified in the same species if they presented 70% DNA-DNA reassociation. 11 In the post-genomic era, several other methods can be applied to evaluate taxonomic relationships such as average nucleotide identity (ANI). 12 A broader discussion about the concept of bacterial species will be accomplished later. Recently, the tree of life was updated based on the comparative analysis of a large number of bacterial genomes. 13
A pan-genome is determined through comparative genomic analyses. A pan-genome consists in the set of non-redundant gene families belonging to a taxonomically related group of organisms. The pan-genome is divided into 3 subgroups as demonstrated in Figure 1. Core genome is the set of genes shared by all analyzed microorganisms. Most of these genes are involved in vital roles for bacterial survival.1,14,15 However, genes of the core genome may also be involved in pathogenicity and virulence in some bacterial species.16,17 Accessory or dispensable genome is composed by the set of genes that are present in 2 or more genomes but not all. 18 Singleton genes such as species- or strain-specific genes are those present in only one genome. Usually, accessory and singleton genes are acquired by horizontal gene transfer (HGT) or evolved due to mutations in pre-existing genes. They are commonly related to a specific metabolism, virulence, antibiotic resistance mechanism, or other environmental adaptation. 19 A pan-genome is classified as open or closed depending on the probability of detecting new gene families as new genomes are added into the analysis. In an open pan-genome, the number of gene families will continuously increase with the addition of new genomes to the analysis. In contrast, in a closed pan-genome, the number of gene families will not increase considerably.

Venn diagram representing the subgroups of a pan-genome. Each set represents the gene families detected in a genome. The intersection of these sets represents the core genome. The number of gene families in the core genome corresponds to the size of the intersection. The fraction of genes corresponding to the core genome in an open pan-genome (A) is smaller than in a closed pan-genome (B). In contrast, the fraction of genes corresponding to the accessory genome in an open pan-genome (A) is higher than in a closed pan-genome (B).
Genomic plasticity and the accessory genome
The flexible portion of the pan-genome (accessory and strain-specific genes) is the main genetic component responsible for the adaptation of a bacterial population to different environmental stresses. In this context, it is important to distinguish the terms genomic plasticity and accessory genome. Genomic plasticity is used to describe the mobile genetic elements (MGEs) and hypervariable regions that transform the genome into a dynamic molecule. Therefore, it is a concept used to discuss the genetic variability of a single or multiple genome without necessarily making use of a pan-genomic approach. Accessory genome is the set of genes that, after a pan-genomic analysis, are present in 2 or more genomes but not all. Thus, it comprises the variable portion of a pan-genome. In some cases, MGEs and hypervariable regions comprise most of the accessory genome.
20
In strains of
Orthologous and paralogous genes
Most comparative genomic analyses begin by identifying the homologous characteristics between 2 or more prokaryotic genomes. These homologies range from large chromosomal segments to genes or even point mutations. In a pan-genome analysis, the genes are the main characteristics evaluated. From an evolutionary perspective, a gene is classified as homologous or analogous. Homologous genes are those originated from a common ancestor, whereas the analogous genes evolved independently through convergent evolution. In both cases, they will present the same function but in different organisms. About 15% of the genes of a bacterium are acquired through HGT. 22 Thus, it is difficult to apply the concept of analogous genes in the Bacteria domain.
A pan-genomic analysis searches for homologous genes within the set of analyzed genomes. These homologous genes are divided into orthologous and paralogous genes. 23 Orthologous genes diverged via evolutionary speciation. Paralogous genes diverged via gene duplication. Thus, orthologous genes are those shared by 2 or more bacteria and have equivalent biological function. It is worth noting that orthologous genes tend to be more conserved than paralogous genes. 24 In contrast, paralogous genes commonly undergo several mutations after their duplication leading to a change in their biological function. 25
As pan-genomic analysis is based on sequence homology, some parameters such as coverage and identity must be carefully chosen. These parameters will strongly influence in the detection of orthologous genes. To demonstrate this issue, we calculated the pan-genome of

Pan-genome, core genome, and alpha value for each one of the five analyses performed using 30 genomes of
High values of coverage and identity lead to overestimation of the pan-genome size and may separate groups of orthologous genes. The opposite is also true. Low values of coverage and identity lead to the clustering of non-orthologous genes. One way to determine the best values for these parameters is checking the clustering of known orthologous genes in such a way that they will serve as an internal control during the analysis. Other characteristics that should be taken into consideration include the genomic plasticity and the taxonomic level of the microorganisms being compared.
Some software allows the user to modify other parameters that also strongly influence the results of the analysis. For example, GET_HOMOLOGUES 26 offers the possibility to use DIAMOND algorithm instead of BLAST to perform the alignment. This software also allows user to choose the algorithm used for bidirectional search of best hits, between COGtriangle and orthoMCL. A broader discussion about the types of orthology analysis used by each software will be further presented.
Pan-genomic concept of bacterial species
In 1942, Ernst Mayr proposed a species concept that is widely used for eukaryotes: “species are groups of interbreeding natural populations that are reproductively isolated from the other such groups.” The microbial world, basically composed by microorganisms that reproduce asexually, is therefore one of the great bottlenecks of the species concept proposed by Mayr. Bacteria are able to exchange genetic material through HGT.
Nevertheless, taxonomy and systematics are extremely important for basic analysis in microbiology. Species is the fundamental taxonomic unit and in the absence of a concept that encompasses all living beings, new ideas were presented or re-discussed. 27 The Bergey’s Manual of Systematics of Archaea and Bacteria has a broader concept of species: “a distinct group of strains that have certain distinguishing features and that generally bear a close resemblance to one another in the more essential features of organization.” Due to the emergence of modern (or molecular) microbiology that is based on the genetic analysis of cultivable or uncultivable strains, the concept of bacterial species is becoming increasingly divergent from the concept of Mayr and clearer than the concept provided by the Bergey’s manual. Currently, bacteria with >70% DNA-DNA hybridization and >97% 16S rRNA sequence identity are classified in the same species. 28 Even so, taxonomic classification of environmental bacteria is a bottleneck. The overwhelming majority of free-living microorganisms are uncultivable, which makes the hybridization analysis of complete DNA molecules unfeasible. To circumvent this bias, environmental microbiologists began to use the Operational Taxonomic Unit (OTU) definition to replace the biological species concept in microbial ecology analyses. 29
One of the most recent concepts of bacterial species was proposed by Bobay and Ochman. 30 They proposed that bacterial strains should be classified in the same species if they present an intra-group rate of gene flow higher than the rate between that group and any other strains. HGT is the main mechanism responsible for the spread of these genes within the bacterial population. HGT also allows exchange of genetic material between taxonomically distinct bacterial populations but on a smaller scale. 30 A hypothesis presented by Baumdicker et al 31 supports the concept proposed by Bobay and Ochman. 30 Baumdicker et al 31 argue that in the microbial world, individual bacterial cells maintain compact genomes whereas a higher number of genes exists at the population level. This idea was called the distributed genome hypothesis. The distributed genome of a group of bacteria can be accessed by calculating its pan-genome. The most recent methods used to characterize a pan-genome will be described in a later section.
Moldovan and Gelfand
32
proposed a new method for defining bacterial species using pan-genome data. The pan-genome can be represented by a gene frequency spectrum
(a) Must be monophyletic in a sequence-based tree;
(b) Should be composed of a homogeneous set of genomes;
(c) Should be the maximal set of strains satisfying conditions a and b.

Ecological perspective and the pan-selectome hypothesis
Theoretically, a bacterial species whose population is highly clonal (closed pan-genome) is more successful in colonizing stable environments such as the human or animal tissues. In contrast, free-living microorganisms have a greater gene variability to adapt to different environmental conditions. The coagulase-negative staphylococci
The maintenance of gene frequency in a pan-genome has been subject of several studies. Rodriguez-Valera et al 36 raised the hypothesis that the pan-genome of a bacterial population is maintained and equalized by phage predation. Phages are more abundant than bacteria in several environments. 37 Many works analyze the relationship between microbial communities and abiotic factors. However, bacteria also need to adapt to biotic factors such as phage predation. A bacterial population under constant phage predation is also under constant modulation of its gene content. This process is called pan-selectome, and the pan-genome is a snapshot of the gene frequency in a given population under constant phage predation. 36 Subsequently, Rodriguez-Valera et al 36 postulated that this pan-selectome is the evolutionary unit of selection in the microbial world (therefore, at the genomic level the unit of selection is the pan-genome).
Bioinformatic Tools
The first step in a pan-genomic analysis is the homogenization of the genome annotation followed by gene clustering based in gene orthology and, finally, the curve fitting (Figure 4). In addition, some software performs phylogeny analyses based on core genome and single-nucleotide polymorphism (SNP) calling. The homogenization of the annotation avoids the wrong classification of core genes into the shared subset and shared genes assigned to singletons. It should be performed using genome annotation software like RAST 38 and Prokka. 39 Alternatively, the researcher may use refseq-annotated genomes from NCBI 40 or gene prediction using software such as FGENES 41 and GeneMark, 42 as long as the methodology is the same for all genomes being analyzed.

Flow diagram representing the main steps in a pan-genomic analysis. Each process (represented by blocks) can be performed by different methods. Table 1 details the methods used by different software. COG indicates Clusters of Orthologous Groups of proteins; KEGG, Kyoto Encyclopedia of Genes and Genomes.
The clustering analysis is normally achieved by first performing an all-vs-all bidirectional BLAST analysis followed by the use of an orthology identification software, such as OrthoMCL 43 and Orthofinder. 44 Orthofinder is capable of eliminating gene length bias in orthogroup detection. Emms and Kelly 44 showed that Orthofinder performed 25% better than OrthoMCL. Another strategy is to use the results from BLAST to define the orthology directly from the size of the alignment and the identity of the sequence alignment, by setting a threshold for both. Also, a strategy described by Lerat performs the orthology identification by means of the score rate value (SRV), a normalization of the bit score from blast analyses. Afterward, as previously described, genes that are present in all strains are assigned to core genome, whereas genes that are shared by more than 2 but not all strains are assigned to shared genome and unique genes, that are only present in one strain, are assigned to singletons. 45
The complete table with all the orthologous genes may then be used for pan-genome, core genome, and singletons development analyses, which will fit the specific curve generated from permutations of all genomes in all positions. Normally, the software performs curve fitting of the pan-genome using Heaps law or Power Law, whereas the curve fitting of shared genome and singletons are performed by means of exponential regression decay. The formula for the Heaps Law is
where
where
Interestingly, the α value is representative of the openness or closeness of the pan-genome, where an α lower than 1 is representative of an open pan-genome and an α higher than 1 is representative of a closed pan-genome. Also,
BPGA 48 software uses USEARCH, 49 CD-HIT, 50 and OrthoMCL 43 software for the orthology analyses and power-law regression and exponential curve fit for the pan-genome and core genome developments (Table 1), respectively. It also implements other relevant analyses such as core/pan/MLST (Multi Locus Sequence Typing) phylogeny, subset analysis, and KEGG 51 & COG 52 mapping. EDGAR, 53 on the other hand, uses SRV analyses for the orthology identification followed by Heaps law and exponential curve fitting of pan-genome and core genome development analyses, respectively. The website also plots venn diagrams, allows the analyses of subgroups of genomes, and exports multiple sequence alignments for phylogeny analyses.
Main software used for pan-genomic analysis and their respective algorithms.
Not mentioned in manuscript.
Pan-genome analysis of intergenic regions.
GET_HOMOLOGUES 26 uses bidirectional best-hit, COGtriangles 26 or OrthoMCL 43 for orthology analyses and performs pan-genome and core development analyses using the script plot_pancore_matrix.pl. The software also generates high-quality graphics, computes pan-genome trees, and performs syntenic cluster analyses. Pandelos 54 uses a dictionary-based method for orthology analyses and introduces a measure based on K-mer multiplicity and computation of Jaccard similarity. Panseq 55 uses MUMmer 56 and BLASTn 57 for the orthology analyses and it exports the accessory genome and binary presence/absence data, and core genome and SNPs for phylogeny analyses. PanX 58 uses Diamond 59 and MCL 60 for the Orthology analysis and it displays an alignment, a phylogenetic tree for each gene cluster, besides mapping mutation and inferring gain and loss of genes on the core genome phylogeny. It also has introduced a divide-and-conquer strategy for large datasets, by clustering small batches of genomes and combining different batches.
PGAP 61 uses Inparanoid, MultiParanoid, and Gene Family for the orthology assignment and Heap law and power law for the pan-genome and core genome developments. PGAP 61 is based in 5 modules: cluster, pan-genome, genetic variation, species evolution, and function enrichment analyses. PANWeb 62 and PGAWeb 63 are graphical interfaces for the use of PGAP.
PGAT 64 uses BLASTp for Orthology analysis with a sequence alignment of >80% and sequence identity >91% to 92%. PGAT 47 also allows comparison of sequence polymorphisms, multigenome display of regions surrounding a query gene, comparisons of metabolic pathways, and manual community annotation. Roary 65 uses CD-HIT, 49 BLAST, 57 and MCL 60 for the orthology analyses. Roary 65 runs in thousands of genomes in pan-genome analyses in standard desktops. Also, it uses the context provided by conserved gene neighborhood information for orthology analyses. Piggy 66 emulates Roary, 65 but for intergenic regions (Table 1).
Pan-Genome Assessment of 9 Bacterial Species
Pan-genomic analyses become more accurate the greater the number of genomes used. Thus, the 9 bacterial species with the higher number of complete sequenced genomes were selected in GenBank and their pan-genomes were calculated using PGAP v.1.2.0.
61
Fifty genomes of each species were used, totaling 450 genomes analyzed (Table 2). The gene annotation was normalized submitting all fasta files of each genome to the RAST
38
server. The PGAP
61
parameters were 50% identity, 60% coverage, and an
Genome, pan-genome, and core genome size of each species.
Abbreviation: α, alpha value of Heap law.
The average percentage of uncharacterized proteins is also presented. To remove the uncharacterized proteins, we opened the genbank files in the genome browser Artemis 77 and using the feature selector tool we excluded all coding sequences that contained the names “uncharacterized protein” or “hypothetical protein” in their qualifier/product. The alpha value of each pan-genome analyzed with and without uncharacterized proteins is also shown.
Today, a basic step in genomics is the prediction and annotation of coding sequences (CDS) using bioinformatics. The need for homogenization of genome annotation was discussed in a previous section. Several software that use different methods for prediction of CDS were developed such as Glimmer3,
75
Prodigal,
76
or GeneMarkS-2.
42
Prediction of false-positive genes is common.
42
Theoretically, in a pan-genome analysis, even a small rate of prediction error can change the results significantly due to the high number of genomes analyzed. False positives are commonly predicted as CDSs encoding uncharacterized proteins. We determined the average percentage of uncharacterized proteins in the genomes of the 9 species (Table 2).
Applications and Future Perspectives
A pan-genomic analysis presents all the gene variability of a group of organisms. The set of genes shared among all organisms as well as species- or strain-specific genes are also extremely useful information. All of these data allow an improvement of time and technology in different areas of biology and bioinformatics.
Comparison of several bacterial genomes is a valuable aid for reverse vaccinology analysis. Reverse vaccinology was a pioneering method first applied to the serogroup B meningococci. 78 The method has greater advantages over classic approaches of vaccine development because it is less laborious, less costly, and more accurate in choosing a gene target. Several reverse vaccinology studies have now used pan-genome to determine the main targets for vaccine development.79,80
The host-pathogen interaction can be evaluated at the genomic level through genes that are responsible for processes such as adhesion, invasion, and toxin production. Therefore, a pan-genome analysis helps to define which virulence genes are shared among all pathogenic species, as well as which genes are specific to one isolate. This has direct implications for understanding the evolution of pathogenic species.
In addition to the examples cited above, pan-genomic analysis has been increasingly used to assist in the taxonomic classification of microorganisms, 81 to determine a set of molecular markers for phylogenomic analysis, 82 among other applications. Therefore, determining the pan-genome of a group of organisms is sometimes the initial step of a research. Several downstream analyses depend on a good prediction of the pan-genome. Thus, knowledge about the basic concepts and the correct choice of software, algorithms, and parameters are extremely important to the success of the research.
Footnotes
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: R.A.B. would like to thank Fundação Amazônia Paraense de Amparo a Estudos e Pesquisas (FAPESPA) for supporting the research (grant number 2155/2017), and Pró-Reitoria de Pesquisa e Pós-Graduação of Universidade Federal do Pará (Programa de Apoio à Publicação Qualificada – PAPQ 2020).
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
All authors contributed equally to the conception and writing of this manuscript. SSC performed the pan-genomic analysis.
