Abstract
With the advent of next-generation whole-genome sequencing (WGS), the need for good-quality and well-characterised Salmonella genomes has increased over the past years. Good-quality complete genomes are often required for assembly reference mapping or phylogenetic single nucleotide polymorphism (SNP) analysis. Complete genomes or contigs from specific sources or serovars are also searched for clustering analysis or source attribution studies. Therefore, new bioinformatics tools are needed for the extraction of good-quality and well-characterised genomes from public databases. Here, we developed SalmoDEST, an open-source Python tool capable of extracting Salmonella genomes with a coverage higher than 50x and genome length over 4Mb from the GenBank database in the form of complete genomes or contigs, with verification of the serovar to which they belong and identification of the corresponding multi locus sequence type (MLST) profile. To validate the ability to SalmoDEST to screen for and retrieve genomes of good quality, we compared our results for S. Typhi complete genome with those available in the literature and extracted Salmonella genomes from bovine sources strains isolated worldwide. Finally, we provide in this study a list of 239 complete genomes for 123 serovars of Salmonella of high quality. SalmoDEST is a handy and easy-to-use open-source tool to extract complete genomes or contigs that can be routinely used in public health, food safety and research laboratories. SalmoDEST (SALMOnella Download gEnome Serotype sT) is available at https://github.com/I-Guy/SalmoDEST.
Keywords
Introduction
The investigation of genetic markers or genome relationships between different pathogens and microorganisms requires good-quality genomes. A large panel of good-quality genomes makes it possible to study chromosome rearrangements in more detail, identify sequences of interest and improve the identification of genetic clustering. Among the most frequently consulted sequence databases for collecting genomes is the open-access GenBank database, housed by the National Centre for Biotechnology Information (NCBI). GenBank annotates a collection of all publicly available nucleotide sequences generated by laboratories throughout the world from more than 100,000 distinct organisms. Release 242.0, produced in February 2021, contained over 12 trillion nucleotide bases in more than 2 billion sequences. 1 To facilitate the retrieval of genomes of interest from the GenBank database, we designed a workflow (called SalmoDEST) to search and download genomes with a coverage greater than 50x. The options of this tool make it possible to download either complete genomes or contigs. It is possible to choose to download protein fasta files, if desired, and an output directory where all the selected fasta files are kept. The SalmoDEST tool was developed for Salmonella, a well-known and widely distributed foodborne pathogen. Salmonella enterica is regulated in the European Union (EU) and monitored in the United States (US) and many other countries. In the US, the economic burden due to salmonellosis is estimated to be US$3.66 billion per year. In 2016, the incidences of culture-confirmed cases of salmonellosis were 14.51 and 20.4 cases per 100,000 population in the US and the EU, respectively.2,3 The economic, social and public health importance of diseases caused by Salmonella has brought many developing and developed countries to implement their monitoring systems with whole-genome sequencing (WGS) of the isolated strains, clustering by single nucleotide polymorphism (SNP) core-genome analysis for outbreaks and source attribution investigations. For countries that can carry out WGS, it is necessary to have access to Salmonella genomes from different regions of the world and for which the serovar has been verified and the multi-locus sequence type (MLST) profile identified. For countries in which WGS is still not readily available, carrying out studies based on good-quality and well-identified open-access Salmonella genomes can prove to be an essential asset.
Materials and Methods
Workflow description
SalmoDEST is implemented as an open-source Python tool (https://github.com/I-Guy/SalmoDEST). It is based on a succession of two Python scripts and a Bash process (Figure 1).

SALMOnella Download gEnome Serotype sT (SalmoDEST) pipeline.
SalmoDEST is a workflow designed to search and download Salmonella genomes from the NCBI GenBank database using either the ncbi-acc-download 4 tool for complete genomes or ncbi-genome-download 5 for contigs. Using these tools, the first Python script ‘Get_HQ_Genome_1.py’ in SalmoDEST automatically downloads the genome fasta files of the strains for which accession numbers are present in the input text file. Then, the serovar and MLST profile predictions of the downloaded genomes is carried out with a Bash process using SeqSero, 6 MLSTseeman tool 7 and Quast, 8 respectively. The second Python script ‘Get_HQ_Genome2.py’ renames the downloaded fasta files, adding the accession number, the serovar and the MLST profile predictions as follows: antigenic formula or serovar name_ST_ID_Accession number (eg, Montevideo_81_42N_CP037893.1). The Python script ‘Get_HQ_Genome2.py’ also downloads the gff and gbk files and checks the quality of each genome. It retains only those with coverage greater than 50x and a genome length longer than 4 Mb, and removes the others. Finally, this Python script compresses (zips) all files.
Optionally, it is possible to choose to download fasta protein files, if desired, and, in addition, choose an output directory in which all the selected fasta files are stored.
Get_HQ_Genome_1.py script
The input file of SalmoDEST and the ‘Get_HQ_Genome1.py’ script is a text file, obtained from an NCBI Nucleotide database query (https://www.ncbi.nlm.nih.gov/nuccore) or compiled by the user, listing the accession numbers of the complete genomes or contigs to download.
If an NCBI Nucleotide database query is used, the ‘Complete Record’ must be exported into a destination ‘File’ in the ‘Accession List’ format sorted by ‘Default order’.
In the ‘Get_HQ_Genome_1.py’ script, the function named ‘getFastafromNuccore’ downloads fasta files and transcribes the accession number of the downloaded fasta files in a tsv file. The function named ‘Renamer’ renames every fasta file as ”ID_Accession.fasta” and creates a folder with the same name to which it moves the fasta files. The function named “Filter1Genome” works only if the user chooses the “complete genome mode”. The function named “Filter1Contig” works only if the user chooses the “contigs mode”. These two functions copy the accession numbers of the fasta files in a tsv file named “Genome_HQ.tsv”. Then, they count the number of contigs in every fasta file and report it in a second tsv file named “Genome_HQ_Filter1.tsv”. If the “complete genome mode” is selected, it discards all fasta files with more than one contig.
Get_HQ_Genome2.py
The ‘Get_HQ_Genome2.py’ script runs after the Bash process queries the SeqSero, MLSTseeman and Quast tools. The function named ‘ReadSeqSero’ reads the results from the SeqSero2 tool and retrieves the accession numbers of the genomes and the serovar predictions, with the associated probabilities. Similarly, the function named ‘ReadMLST’ reads results from the MLSTseeman tool and stores accession numbers and MLST profiles. The function named ‘ReadQuast’ reads results from the Quast tool and retrieves length, the N50 value and the number of contigs of genomes. The function named ‘MergeResult’ merges all the information from the previous functions (ie, serovar predictions, MLST profiles, number of contigs, length, N50 and genome size) along with information from ‘Genome_HQ_Filter1.tsv’ (ie, produced by the ‘Get_HQ_Genome_1.py’ script) in a third tsv file named ‘TableMerge.tsv’. The function named ‘GetGBK’ downloads the gbk (GenBank) files associated with fasta files. The function named ‘Renamer2’ moves the gbk files to the folder containing fasta files and renames them according to the fasta file names. The function named ‘Filter2’ generates a fourth tsv file called ‘TableMergeFilter2.tsv’ with the keys (ie, accession numbers) of all genomes that have a coverage higher than 50x (> 50x) based on gbk files and a length longer than 4 Mb (> 4 Mb). It also adds information on the sequencing technology used to this tsv file. The function named ‘GetGFF’ downloads gff files.
The function named ‘RenamerGFF_FASTAprot’ renames gff files and protein fasta files. It moves them to the folder containing the fasta files. The function named ‘FinalRenamer’ renames every file and directories as described above (ie, antigenic formula/serovar name_ST_Accession). The ‘Renamer’ functions can be easily modified at the user’s convenience. The function named ‘zipfiles’ will compress (zip) all the folders containing the downloaded files.
Workflow application
In this study, we report two application examples for SalmoDEST. In the first example, we evaluate the ability of SalmoDEST tool to download complete Salmonella genomes from the NCBI GenBank database and, in the second, its ability to download Salmonella genome contigs for strains isolated from bovine sources.
Selection of complete genomes from a public database
Complete reference genomes are often required for assembly reference mapping or phylogenetic SNP analysis for the mapping step and the calculation of pairwise distance between genomes. Nevertheless, for a single laboratory it may be difficult to have a complete set of reference genomes, particularly considering that the genus Salmonella is separated into six subspecies and over 2000 serovars. 9 The SalmoDEST tool was tested to search, download and select all complete Salmonella reference genomes available in the GenBank database. SalmoDEST applies a coverage filter set to a minimum of 50x. A second manual filter is based on serovar identification. SalmoDEST was used to compare the listed serovars with the serovars predicted by Seqsero2 in the TableMergeFilter2 tsv file. In this study, SalmoDEST was tested using the list of accession numbers obtained using the NCBI ‘All Databases’ query: ‘Salmonella[title] AND Genome[title] AND Salmonella enterica[title] AND Genome Assembly and Annotation report[title]’ (https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/152/) with the filter ‘Complete’ (on 24 June 2021). A list with 1648 accession numbers was retrieved, and after eliminating duplicates, 1048 unique accession numbers were found (Supplementary Table S1). The SalmoDEST option for complete genome mode ‘-m g’ was used. Finally, after serovar prediction and genome length verifications, 1040 genomes were retained and downloaded. Four tsv output files were produced, including the final TableMergeFilter2 tsv file (Supplementary Table S2).
Selection of contig genomes from public database
Microbiologists need to access to Salmonella serovar genomes from specific sources for many types of analyses such as clustering analyses, source attribution studies or when screening for molecular markers.10 -13 Obtaining genomes from laboratories around the world is therefore a major advantage. Here, we tested the ability of the SalmoDEST tool to obtain Salmonella genomes from strains isolated from bovine sources worldwide. The SalmoDEST tool was tested using the list of assembly accession numbers obtained using the NCBI ‘All Databases’ query: ‘Salmonella[title] AND Genome[title] AND Salmonella enterica[title] AND Genome Assembly and Annotation report[title]’ (https://www.ncbi.nlm.nih.gov/genome/browse/#!/prokaryotes/152/) with the following filters: ‘Contig’ AND ‘Bovine’ AND ‘bovine’ (on 24 June 2021), 89 unique accession numbers were found (Supplementary Table S3). The SalmoDEST option for contig genome mode ‘-m c’ was used and, after the filtering process, 88 genomes were downloaded. Four tsv output files were created, including the final TableMergeFilter2 tsv file (Supplementary Table S4).
Results and Discussion
The NCBI Nucleotide query carried out on 7 June 2021 resulted in 1648 accessions. After deduplication, 1048 unique accessions were included in the input txt file and downloaded by the SalmoDEST tool that we developed here. All these complete genomes were checked for 50x coverage, genome length and predicted serovar matching. Finally, 1040 complete genomes with good quality were downloaded and the MLST profile was determined. From the initial list of 1048 complete genomes in the input txt file, SalmoDEST excluded one genome (CP060132.1) for incorrect serovar prediction and seven others (OU015718.1, OU015719.1, OU015720.1, OU015717.1, LR792437.1, LR792391.1 and LN868943.1) due to low genome length (genome lengths of < 4 Mb, comprised between 277 503 and 3 746 274 bases). We obtained 16 genomes of S. enterica subsp. salamae, 10 S. enterica subsp. arizonae, 13 S. enterica subsp. diarizonae, 10 S. enterica subsp. houtenae and 991 S. enterica subsp. enterica, representing 135 serovars with different antigenic formulas. No S. enterica subsp. indica genomes with a coverage higher than 50x were found. Four serovars were overrepresented (ie, more than 50 complete genomes) in the GenBank database and in our results: S. Typhi (ie, responsible for human typhoid fever with 124 genomes/1040), S. Enteritidis, S. Typhimurium and S. 4,[5],12: i:-, with 114/1040, 141/1040 and 56/1040 genomes, respectively. These latter three serovars are the non-typhoid Salmonella serovars the most frequently isolated worldwide. These serovars were followed by S. Heidelberg (40/1040), S. Newport (38/1040), S. Anatum (32/1040), S. Bareilly (30/1040), S. Indiana (22/1040), S. Montevideo (21/1040) and S. Senftenberg (20/1040) (Figure 2). Our results are consistent with CDC and EFSA reports.14 -18 Since 2016, these 11 serovars have belonged to the top 30 most frequently isolated serovars in the EU and the US.14 -18

Histogram of serovar diversity among the 1040 complete Salmonella genomes downloaded from the NCBI GenBank database using the SalmoDEST tool developed in this study. Only serovars with more than five complete genomes and complete antigenic formula are shown, with the exception of S. 4,[5],12: i:- and S. 1,3,19:g, s,t:-.
To validate the ability to SalmoDEST to screen for and retrieve complete genomes of good quality, we compared our results for S. Typhi with those available in the literature. As expected, in accordance with the study published by Yap and Thong in 2017, 19 SalmoDEST was able to recover 124 S. Typhi. The SalmoDEST tool developed in this study succeeded in screening for and downloading good-quality reference genomes for S. Typhi, confirming its ability to make good-quality genomes available quickly.
Finally, due to the need for complete genomes for sequence assembly and for SNP phylogenetic analyses (ie, for mapping analyses and to calculate the pairwise distance between genomes), we constituted a panel of complete reference genomes for Salmonella from the SalmoDEST output obtained in this study. We selected 239 complete genomes from the initial 1040 genomes, with 10 S. enterica subsp. salamae, 8 S. enterica subsp. arizonae, 7 S. enterica subsp. diarizonae, 8 S. enterica subsp. houtenae and 206 S. enterica subsp. enterica, representing 123 serovars and 185 MLST profiles (Table 1 and Supplementary Table S5). When possible, the sequencing technology used for complete genome assembly (ie, both short and long reads) and coverage were taken in account for the selection of the final panel. This panel of complete genomes can be used by microbiologists in food poisoning and typhoid investigations involving Salmonella spp.
List of good-quality complete Salmonella genomes (ID, serovar and MLST profile predictions) downloaded from the NCBI GenBank database on 28 June 2021.
Salmonella contig genomes from bovine sources
Among the recognised pathogens causing human disease, almost 60% are of animal origin 20 and cattle bred for meat and for milk are common reservoirs of Salmonella spp. 21 Almost 40% of a herd can be infected, and the risk of infection increases with the size of the herd.22,23 Salmonellosis in cattle puts producers at risk for direct economic losses associated with mortality or body weight loss, and also indirect losses caused by reduced feed conversion or veterinary care costs. 23 Genomes from strains isolated from cattle can be used in source attribution studies, as well as in searches for specific host marker sequences. Our test successfully downloaded Salmonella genomes of strains isolated from bovine animals. The SalmoDEST tool was able to download 88 contig genomes of Salmonella isolated from bovine sources with a coverage of > 50x, lengths of > 4 Mb and correct serovar prediction from the initial input list file of 89 genomes. One genome (GCA_004744895,1) was excluded due to a genome length of < 4 Mb (Supplementary Tables 3 and 4). Fifty-two entries in the TableMergeFilter2.tsv file showed missing information on coverage and sequence type in the gbk files of the corresponding genomes. Interestingly, among the 88 contig genomes downloaded, the most represented serovars were S. Typhimurium (28 contig genomes/88), S. Newport (14/88) and S. Dublin (11/88). These three serovars are well known for contaminating bovine animals in the EU and the US.18,20,22
50x coverage
The value of 50x was chosen for Salmonella in the SalmoDEST tool following the recommendations of the European Centre for Disease Control and Prevention (ECDC). 24 The amount of data generated per Salmonella isolate by a DNA sequencer is substantial (ie, megabytes) and a trade-off must be struck between genome coverage (ie, quality) and the size of the files generated. For example, although a coverage of 30x is typically sufficient for routine surveillance of foodborne pathogens, the appropriate coverage threshold is platform-dependent and may also vary by organism. 25 ECDC has fixed a coverage of 50x for Salmonella, considering this value as reasonable for corresponding file size. 24 Coverage is frequently considered as the main quality metric typically used in WGS. Furthermore, the quality of genome sequences also have an impact on successful in silico serovar prediction. Missing or incomplete MLST and cgMLST loci sequences largely contribute to errors in identification.6,26 Similarly, partial or missing antigenic data in the rfb region (ie, the O-antigen flippase and polymerase genes) and the fliC and fljB genes influence in silico serovar prediction. 6 Good coverage prevents poor MLST, cgMLST and, antigenic data and contributes to the correct listing of the serovar.6,26,27
Errors in serotyping
Salmonella genomes from GenBank have already revealed errors in the serovar listed in their metadata. In 2016, Yoshiba et al carried out in silico serovar prediction on over 4,291 genomes extracted from GenBank, and revealed that 3.5% gave incorrect serovar predictions and that 1.8% had missing or ambiguous metadata, making it impossible to ascertain the listed phenotypic serovar. 26 For this reason, we integrated the Bash process in the SalmoDEST tool to query the SeqSero2 6 and MLSTseeman 7 tools. SeqSero is a Web-based tool developed by the Centres for Disease Control and Prevention (CDC) in Atlanta, GA (US) for determining Salmonella serotypes using the rfb region and the fliC and fljB alleles.6,28 SeqSero2 was chosen because it is the only tool that relies on characterising genetic determinants of Salmonella serovars without consulting any markers, such as MLST types; it saves time because it predicts serovars directly from raw sequencing reads and not from assemblies, and finally it is able to detect inter-serovar contaminations. 6 The MLSTseeman is a tool developed by Torsten Seemann in 1991 7 that scans contig files against traditional PubMLST typing schemes conceived as part of the development of the first MLST scheme in 1998, 29 making it possible to include all levels of sequence data, from single gene sequences up to and including complete, finished genomes. 30
Information on serovar and MLST type were integrated in SalmoDEST to enable genome verification and because they are integral to surveillance and outbreak investigations.
Conclusion
SalmoDEST is a handy and easy-to-use tool that can be routinely used in public health, food safety and research laboratories to extract complete Salmonella reference genomes of high quality from GenBank. It can also be used to download contig genomes from a list of assembly IDs. A coverage of 50x, as well as correct Salmonella genome size and serovar and MLST type prediction, are used as quality controls for both genome modes (ie, complete and contig genomes search and download). Moreover, SalmoDEST screens downloaded genomes for contamination by using the SeqSero2 tool for serovar prediction.
Supplemental Material
sj-jpg-1-bbi-10.1177_11779322221080264 – Supplemental material for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST
Supplemental material, sj-jpg-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights
Supplemental Material
sj-txt-1-bbi-10.1177_11779322221080264 – Supplemental material for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST
Supplemental material, sj-txt-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights
Supplemental Material
sj-txt-2-bbi-10.1177_11779322221080264 – Supplemental material for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST
Supplemental material, sj-txt-2-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights
Supplemental Material
sj-xls-1-bbi-10.1177_11779322221080264 – Supplemental material for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST
Supplemental material, sj-xls-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-1-bbi-10.1177_11779322221080264 – Supplemental material for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST
Supplemental material, sj-xlsx-1-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights
Supplemental Material
sj-xlsx-2-bbi-10.1177_11779322221080264 – Supplemental material for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST
Supplemental material, sj-xlsx-2-bbi-10.1177_11779322221080264 for Retrieving Good-Quality Salmonella Genomes From the GenBank Database Using a Python Tool, SalmoDEST by Emeline Cherchame, Guy Ilango and Sabrina Cadel-Six in Bioinformatics and Biology Insights
Footnotes
Acknowledgements
We thank Laurent Vigneron (ANSES) for providing high-performance computing resources.
Author Contributions
SC-S, EC and GI conceived the study. EC and GI contributed equally to the design and analysis of data. GI conceptualised the algorithms. EC implemented scripts and executed commands. SC-S drafted the manuscript. EC reviewed the draft. All authors commented and approved the final manuscript, take public responsibility for appropriate portions of the content and agree to be accountable for all aspects of the work in terms of accuracy or integrity.
Declaration of Conflicting Interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was supported by funding by the French Ministry of Agriculture, Food and Forestry, by the Salmonella Network, part of the ANSES-Laboratory for Food Safety (France).
Supplemental material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
