Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials

Abstract

CRISPR-Cas systems are an adaptive immunity that protects prokaryotes against foreign genetic elements. Genetic templates acquired during past infection events enable DNA-interacting enzymes to recognize foreign DNA for destruction. Due to the programmability and specificity of these genetic templates, CRISPR-Cas systems are potential alternative antibiotics that can be engineered to self-target antimicrobial resistance genes on the chromosome or plasmid. However, several fundamental questions remain to repurpose these tools against drug-resistant bacteria. For endogenous CRISPR-Cas self-targeting, antimicrobial resistance genes and functional CRISPR-Cas systems have to co-occur in the target cell. Furthermore, these tools have to outplay DNA repair pathways that respond to the nuclease activities of Cas proteins, even for exogenous CRISPR-Cas delivery. Here, we conduct a comprehensive survey of CRISPR-Cas genomes. First, we address the co-occurrence of CRISPR-Cas systems and antimicrobial resistance genes in the CRISPR-Cas genomes. We show that the average number of these genes varies greatly by the CRISPR-Cas type, and some CRISPR-Cas types (IE and IIIA) have over 20 genes per genome. Next, we investigate the DNA repair pathways of these CRISPR-Cas genomes, revealing that the diversity and frequency of these pathways differ by the CRISPR-Cas type. The interplay between CRISPR-Cas systems and DNA repair pathways is essential for the acquisition of new spacers in CRISPR arrays. We conduct simulation studies to demonstrate that the efficiency of these DNA repair pathways may be inferred from the time-series patterns in the RNA structure of CRISPR repeats. This bioinformatic survey of CRISPR-Cas genomes elucidates the necessity to consider multifaceted interactions between different genes and systems, to design effective CRISPR-based antimicrobials that can specifically target drug-resistant bacteria in natural microbial communities.

Keywords

CRISPR-Cas systems antimicrobial resistance genes alternative antibiotics CRISPR-based antimicrobials prokaryotic DNA repair systems CRISPR repeats dimensionality reduction RNA structure simulation studies comparative genomics

Introduction

Clustered regularly interspaced short palindromic repeats (CRISPR), found in many prokaryotic genomes, store sequence information about foreign DNA that has invaded these microorganisms.^1-3 With this information, the CRISPR-associated system (Cas genes) provides an adaptive immunity that protects the cell against invasive mobile genetic elements such as bacteriophages. The ability of CRISPR-Cas systems to cut and edit DNA has opened a new era of genome-editing technologies in various fields such as medicine and agriculture.⁴ Such applications have driven the scientific community to discover diverse CRISPR-Cas systems in nature, to uncover those that may be better tools for editing eukaryotic genomes.^5-7 CRISPR-Cas systems are currently divided into Class 1 (Type I, III, IV) and Class 2 (Type II, V, VI), with each type further classified into several subtypes.⁵

CRISPR-Cas systems are recently being investigated for their potential to selectively target bacteria with antimicrobial resistance (AMR) genes.^8-10 Antimicrobial resistance is now considered a “hidden pandemic” which threatens to undermine the effectiveness of modern medicine, from minor surgical procedures to cancer treatments due to hospital-acquired infections.¹¹ In 2019, infections from multidrug-resistant bacteria were estimated to have caused more than 1.2 million deaths worldwide.¹² Given the severity of the uncontrolled spread of these superbugs, the World Health Organization (WHO) recently published a list of priority pathogens that urgently need new antibiotics, including carbapenem-resistant Acinetobacter baumannii and Pseudomonas aeruginosa. CRISPR-based antimicrobials are potential alternatives to traditional small-molecule antibiotics, as the CRISPR component is programmable to target specific genes with a complex of Cas proteins. Several studies independently engineered CRISPR-Cas systems to selectively remove AMR genes from bacterial populations.^13-15

Despite the potential of CRISPR-based antimicrobials, several challenges remain before these tools can be successfully repurposed to remove AMR-carrying bacteria or plasmids from natural microbial communities.^8-10 In addition to the practical issues such as the delivery to target bacteria, there are several fundamental questions related to the effectiveness of CRISPR-based antimicrobials. For endogenous CRISPR-Cas self-targeting, both AMR genes and functional CRISPR-Cas systems have to be present in the chromosome or plasmid of target bacteria. For such bacteria, CRISPR-based antimicrobials can simply be composed of a self-targeting CRISPR array that is compatible with the endogenous Cas system.^13,16 Without functional endogenous CRISPR-Cas systems, a complete set of CRISPR-Cas systems that targets a specific AMR gene has to be delivered exogenously.⁸ Thus, it is necessary to understand the genomic background of target bacteria for effective design and delivery of CRISPR-based antimicrobials. In this study, we use the public CRISPR-Cas database to survey the genomic background of CRISPR-Cas genomes, which we define as prokaryotic genomes that have 1 functional CRISPR-Cas system (Figure 1). These CRISPR-Cas genomes are searched for AMR genes, to investigate the co-occurrence of functional CRISPR-Cas systems and AMR genes in diverse bacteria, particularly in pathogenic bacteria.

Figure 1.

The genomic background analysis of CRISPR-Cas genomes.

Another pertinent question is the impact of DNA repair pathways on the effectiveness of CRISPR-based antimicrobials.^8-10 Bacteria have evolved complex DNA repair pathways that can repair DNA damage in response to various external and internal triggers (eg, UV irradiation, antibiotics, stalled replication, recombination) that can be lethal if not repaired before cell division.^17,18 Despite the high efficiency of self-targeting spacers, a small percentage of the bacterial population targeted by these CRISPR-Cas systems persisted in a number of previous studies.^13,19 Here, we scan the CRISPR-Cas genomes for DNA repair pathways to investigate the potential interference against the activities of CRISPR-based antimicrobials.

We further explore the interplay of CRISPR-Cas systems and DNA repair pathways through simulation studies, whose co-evolution was predicted by the Lamarckian evolution of directed mutagenesis.^20,21 It is intriguing to observe that the acquisition of new spacers in CRISPR arrays requires DNA repair, during which several proteins engage in DNA unwinding, editing, and repairing activities along with the Cas proteins. Recent evidence shows that most CRISPR-Cas systems acquire new spacers through site-specific integration, with the leader end spacers being the most recent and most active.^22-24 This strategy enables prioritizing the defense against the most recent invader at the leader end by differential expression of crRNAs across the CRISPR array. However, this acquisition step is susceptible to mutation accumulation in the CRISPR repeats without efficient DNA repair pathways. Thus, we investigate the time-series patterns in CRISPR repeats to examine the potential interference of DNA damage response in utilizing CRISPR-based antimicrobials against prokaryotes. We first examine how the RNA structures of CRISPR repeats change over time by visualizing and analyzing the time-series patterns of CRISPR arrays associated with different Cas system types. We show that Class 1 CRISPR repeats are more structured than Class 2 CRISPR repeats, and this structural component is maintained throughout the site-specific integration of new spacers over time, indicating the active role of DNA repair pathways in these genomes. Furthermore, we show that DNA repair pathways in these CRISPR-Cas genomes are numerous and diverse. These results demonstrate that the genomic background of target bacteria should be considered for DNA damage response for the effective design of CRISPR-based antimicrobials tailored against these disease-causing strains.

Results

CRISPR-Cas genomes have numerous antimicrobial resistance (AMR) genes

From the dataset of CRISPR-Cas genomes (Supplemental Tables S1-S3), we conducted an AMR gene analysis to investigate the potential of self-targeting AMR genes with endogenous CRISPR-Cas systems (Figure 2a). The different types of CRISPR-Cas genomes, except for Type IA and Type IV, had several AMR-related genes per genome, ranging from 0.3 genes per genome for Type VA to 23.5 genes per genome for Type IE (Figure 2b). AMR-related genes were absent in Types IA and IV because they only had few CRISPR-Cas genomes that belonged to nonpathogenic prokaryotes, such as Clostridium perfringens and Alteromonas mediterranea. In the reference gene catalog of the AMR database,²⁵ the AMR-related genes are further classified into antimicrobial resistance, stress response, and virulence genes. The classification results show that most genes give antimicrobial resistance, and there are a few genes that confer virulence to the pathogens and others respond to external stresses such as metal or biocide (Table 1). It is intriguing to observe that only certain types of the CRISPR-Cas genomes (Types IB, IE, IF, IIC, and IIIA) have virulence genes, with Type IE having the highest ratio of virulence to AMR genes. Many CRISPR-Cas genomes of Type IE belong to pathogenic strains, including Salmonella enterica and Shigella spp. which are on the WHO priority pathogens list for new antibiotics. This result shows that the co-occurrence of AMR-related genes and CRISPR-Cas systems differ vastly depending on the Cas system type, thus the AMR analysis is the first step to understand the genomic background of target pathogens to achieve effective design and delivery of CRISPR-based antimicrobials.

Table 1.

Classification of the AMR-related genes in the CRISPR-Cas genomes by the Cas system type. CRISPR-Cas genomes of Type IA and Type IV had no AMR-related gene.

Cas system type	AMR	Virulence	Stress: Acid	Stress: Biocide	Stress: Metal	Stress: Heat
IB	137	8	-	-	36	-
IC	546	-	-	3	115	5
ID	-	-	-	-	2	-
IE	1631	1351	438	92	1356	45
IF	571	42	24	19	306	1
IIA	313	-	-	-	89	-
IIB	26	-	-	-	-	-
IIC	918	40	-	37	194	-
IIIA	416	60	1	2	32	-
IIIB	32	-	-	1	5	-
IIIC	1	-	-	-	-	-
IIID	5	-	-	-	3	-
VA	3	-	-	-	-	-
VIB1	2	-	-	-	1	-
VIB2	2	-	-	-	-	-

Figure 2.

(a) The first bar plot summarizes the non-redundant CRISPR-Cas genomes in the dataset by each Cas system type, with the brown color representing bacterial genomes and the green color representing archaeal genomes. The 3D macromolecular protein structure of a signature Cas protein for each system is shown on the left panel. (b) The second bar plot shows the number of AMR-related genes per CRISPR-Cas genome in the dataset by each Cas system type. The length of the error bars is the standard deviation of AMR-related genes per CRISPR-Cas genome at each Cas system type.

CRISPR-Cas genomes have diverse DNA repair pathways

We investigated the distribution of DNA repair pathways in the CRISPR-Cas genomes, based on the previous study of double-strand break (DSB) repair pathways in prokaryotic genomes.²⁶ We searched diverse DSB repair pathways, including the SOS response, the non-homologous end-joining (NHEJ), and various nuclease proteins. Each DNA repair pathway per genome was calculated for the CRISPR-Cas genomes of each Cas system type (Supplemental Table S4). The results are visualized as a heatmap (Figure 3) with the proteins belonging to each DNA repair pathway shown on the right axis label (eg, Ku, LigD1, LigD2, and LigD3 are components of NHEJ pathways). The heatmap shows that some DSB repair pathways are enriched in most CRISPR-Cas genomes, including the AddAB pathway, AdnAB pathway, and RuvAB pathway. Furthermore, some proteins such as RecG and RecN are enriched in almost all types of the CRISPR-Cas genomes.

Figure 3.

Heatmap showing the number of DNA repair pathways per CRISPR-Cas genome for each Cas system type. The name of the protein belonging to each DNA repair pathway is indicated on the right axis label. The color bar shows a scale from 0 to 7 DNA repair proteins per CRISPR-Cas genome, with the red color indicating the highest frequency.

The DSB repair pathways of some Cas system types show outlier patterns to the other CRISPR-Cas genomes (Figure 3). Particularly, the DSB repair pathways of Type ID and Type VIB2 stand out as an outlier, in which the RecBCD and the RuvAB pathways are more enriched while the AddAB pathway is less enriched, relative to the other types. Additionally, the CRISPR-Cas genomes of Type IIIA and Type IV stand out as outliers to have relatively high numbers of genes belonging to the NHEJ pathway, which have only been recently identified and verified to activate in prokaryotic genomes.^27,28 For this pathway, ligation is usually carried out by LigD proteins, but other ligases can be recruited by Ku in their absence.

DNA repair during acquisition generates variant CRISPR repeats

Recent studies on the acquisition step show that the site-integration of new spacers in CRISPR arrays is polarized; most spacers are added to the leader end of the CRISPR array^22-24 (Figure 4a). In this step, the Cas1-Cas2 complex acts as a spacer integrase,^29,30 during which the terminal 3′ ends of a protospacer catalyzes a nucleophilic attack on each end of the repeat. After this reaction, the 3′ ends of the protospacer are ligated to the repeat ends and the single-strand gaps are presumed to be duplicated by a DNA polymerase.^31-33 During this repeat duplication, the repeat sequence at the leader end of the CRISPR array is used as a template due to the polarity of the spacer acquisition.

Figure 4.

(a) Acquisition steps of new spacers in a CRISPR array show how repeats are being repaired by the DNA repair pathways after new spacer acquisition. (b) Projection of CRISPR repeats on the 2-dimensional latent space labeled with the associated Cas system type.

We investigated the CRISPR repeats of each Cas system type by dimensionality reduction to visualize the variation of CRISPR repeat sequences resulting from the DNA repair activities (Figure 4b). We used various summary statistics of biological features to interpret the principal components of these clusters. Each cluster of the repeats differs in mean length and standard deviation (Table 2 and Supplemental Table S5). The cluster analysis shows the length of a sequence and the metric entropy (ie, randomness of a sequence) are captured on the first latent dimension with the explained variance of 20.9% (Supplemental Figures S1 and S2). Furthermore, the clusters have a wide range of GC/AT ratios, which is captured on the second latent dimension with the explained variance of 9.9% (0.66 of Cluster 0 vs 2.19 of Cluster 1). Another important feature of the CRISPR repeats is the RNA secondary structure. The clusters of low minimum free energy (Clusters 1 and 4) lie on the upper side of the second principal component, which indicates highly structured CRISPR repeats. Contrarily, those with the high minimum free energy (Cluster 0 and 2) lie on the lower side of the second principal component, which indicates CRISPR repeats without distinct secondary structures.

Table 2.

Summary statistics of CRISPR repeats by the Gaussian Mixture Model cluster.

Cluster	Mean length ± SD (number of data)	GC/AT ratio ± SD	Metric entropy ± SD	Minimum free energy of RNA ± SD
0	37.35 ± 3.38 (n = 2753)	0.66 ± 0.42	0.050 ± 0.0047	−4.13 ± 4.57
1	28.97 ± 0.22 (n = 2122)	2.19 ± 0.52	0.065 ± 0.0021	−13.57 ± 2.10
2	30.44 ± 2.11 (n = 1449)	0.93 ± 0.67	0.061 ± 0.0053	−6.73 ± 5.77
3	27.59 ± 1.64 (n = 3007)	1.46 ± 0.75	0.070 ± 0.0050	−8.86 ± 4.17
4	32.72 ± 2.61 (n = 1759)	1.69 ± 0.60	0.058 ± 0.0059	−11.53 ± 3.39

CRISPR repeat structures show the patterns of DNA repair by the Cas system type

CRISPR arrays contain multiple repeats that separate unique spacers (typically, <50 spacers in bacteria and <100 spacers in archaea),³⁴ and the dimensionality reduction study showed the variation of these repeats within an array. To elucidate how these secondary structures of CRISPR repeats change over time due to DNA repair during the acquisition, we predicted RNA secondary structures of individual repeats within an array and quantified the Minimum Free Energy (MFE) associated with the secondary structure (Supplemental Table S6). The lower the MFE value, the higher the probability of sequences forming stable RNA secondary structures. We plotted the time-series graphs of the MFE values for CRISPR repeats within each array chronologically, in which the CRISPR repeats were separated by the number of unique repeats in an array (Supplemental Figure S5). The number of unique repeats was assumed to be mutation events during the spacer acquisition process, varying from 2 to 24 time points. These time-series graphs show that the MFE values of CRISPR repeats fluctuate over time. This result shows that the secondary structures of CRISPR repeats are dynamic due to mutation events during the spacer acquisition process. Another noticeable trend is the difference in the baseline of MFE values in CRISPR repeats associated with different Cas system types. For example, the MFE baselines of Class 2 subtypes, including IIA, IIB, and IIC, were consistently higher than some of Class 1 subtypes, including IC, IE, and IF. Interestingly, the MFE baselines of IA, IB, and some III types do not appear to follow the same trend.

To visualize the change in the CRISPR repeat structure over time, we built a selected collection of the graphical output of these RNA structures by the associated Cas system type (Supplemental Figure S6). Consistent with the time-series graphs built with the MFE values, the CRISPR repeat structures of Class 1 subtypes, particularly IC, IE, and IF, tend to have more distinctive hairpin structures of palindromic sequences over time as compared to those of Class 2 subtypes. Such difference in time-series patterns of CRISPR secondary structures according to the associated Cas system types raises an intriguing question on the differential effects of DNA repair during the genome-editing events of CRISPR-Cas systems.

Simulated studies show the effects of DNA repair under Lamarckian evolution

We simulated a selection of CRISPR repeats associated with Class 1 Type IE and Class 2 Type IIA (Supplemental Table S6) using the population genetic model that simulates the genetic drift of mutations. These simulation studies of the Darwinian evolution model were conducted to compare the evolution of CRISPR repeats that undergo genome-editing events equivalent to Lamarckian evolution.²⁰ According to the population genetic model, mutations on non-coding sequences are assumed to be neutral and their genetic drift through generations is modeled through binomial sampling.^35,36 As shown in Figure 5a, the CRISPR repeats associated with Class 1 Type IE maintain low MFE values temporally despite some fluctuations. However, the simulated trajectory of MFE values from the input repeats of the same initial sequences shows a trend toward zero MFE (Figure 5b). The difference in these trends is highlighted by the visualization of RNA secondary structures under each graph. Under the population genetic model, any mutation on the CRISPR repeat sequences is likely to degrade the RNA secondary structure by breaking the palindromic patterns. However, the CRISPR repeats associated with Class 1 Type IE tend to maintain the RNA secondary structures in the presence of mutations more robustly than expected. For the CRISPR repeats associated with Class 2 Type IIA (Figure 5c), the temporal patterns in MFE values are similar to the simulated patterns of MFE from the same initial sequences (Figure 5d). These temporal patterns are consistent as the initial repeat sequences of Type IIA are unstructured, thus mutations cannot break down the RNA secondary structure.

Figure 5.

Time-series graphs of the secondary structure of CRISPR repeats in the forward direction with 5 time points. (a) Minimum free energy of Class 1 Type IE CRISPR repeats. (b) Simulated minimum free energy of Class 1 Type IE CRISPR repeats. (c) Minimum free energy of Class 2 Type IIA CRISPR repeats. (d) Simulated minimum free energy of Class 2 Type IIA CRISPR repeats.

Discussion

CRISPR-Cas systems were initially discovered in prokaryotic genomes, which were found to be an adaptive immunity against invading mobile genetic elements. Due to their ability to cut DNA/RNA specifically with the CRISPR RNA as a guide template, CRISPR-Cas systems were first applied as genome-editing tools to alter certain phenotypic features in eukaryotes, including somatic human cells and agricultural plant cells. Recently, CRISPR-based antimicrobials are being repurposed as a highly potent alternative to traditional antibiotics to self-target drug-resistant pathogens.^8-10 The CRISPR RNA component can be reprogramed to self-target antimicrobial resistance (AMR) genes in the chromosome or plasmid of these drug-resistant pathogens. Moreover, CRISPR-based antimicrobials have the potential to be used as preventive measures, such as controlling reservoirs of AMR genes in microbial communities to regain or retain the antimicrobial activity of traditional antibiotics.¹³ However, most prokaryotic genomes have the ability to repair DNA damage, which includes the nuclease activity of CRISPR-Cas systems that requires DNA repair to integrate new spacers and to regenerate new repeats in CRISPR arrays.^26,37

According to the comprehensive survey of AMR-related genes in the curated prokaryotic genome dataset, most CRISPR-Cas genomes (except for Types IA and IV) have numerous AMR-related genes that can be self-targeted with endogenous CRISPR-Cas systems. This co-occurrence of CRISPR-Cas systems and AMR-related genes enables the delivery of CRISPR-based antimicrobials to be simplified to self-targeting CRISPR arrays on mobile genetic elements. Recently, phage capsids have been engineered to deliver self-targeting CRISPR-based antimicrobials to pathogenic bacteria.^14,15 For pathogens with both CRISPR-Cas systems and AMR-related genes, a simpler construct of self-targeting CRISPR arrays can be packaged into these viral vectors.⁸ Efficient delivery to specific bacteria is one of the main challenges of programmable CRISPR-based antimicrobials. Although several studies demonstrated genetic elements encoding foreign systems can be delivered to target bacteria using various vectors such as phage capsids, conjugative plasmids, and nanoparticles,¹⁰ the specificity and efficiency of such delivery vectors in a complex natural environment is still an ongoing area of research. Furthermore, the defense mechanisms and the resistance development of pathogens against these CRISPR-based antimicrobials should be studied and monitored extensively to demonstrate the long-term effectiveness of these novel antibiotics.^8-10

In this study, we investigated the potential interference of DNA repair pathways in utilizing CRISPR-based antimicrobials. Given that we found numerous and diverse DNA repair pathways in the CRISPR-Cas genomes, we focused on 2 general mechanisms to repair DNA damage. Homologous recombination (HR) requires a homologous template to repair the DNA damage with high fidelity.^17,37 We found that all CRISPR-Cas genomes have diverse HR-related genes, including genes necessary for RecBCD, AddAB, and AdnAB pathways. Many bacteria contain multiple copies of the genome, or at least partially replicated forms before cell division, which may require CRISPR-based antimicrobials to perform simultaneous targeting due to the presence of diverse HR pathways. Non-homologous end-joining (NHEJ) is a DNA repair pathway that processes the DNA damage and directly ligates the DNA ends without requiring template DNA.¹⁷ Previously, bacteria were assumed to rely mainly on homologous recombination (HR) to repair double-strand breaks, but the recent discovery of alternative non-homologous end-joining pathways strengthens the evidence that bacteria have the ability to ligate unrelated DNA ends that do not share homology to create new genetic combinations.²⁷ However, Type IIA CRISPR-Cas systems in bacteria were found to inhibit NHEJ repair pathways due to the antagonistic interactions of recognizing the same DNA damage.³⁷ Consistently, we found that CRISPR-Cas genomes of Type IIA are void of NHEJ-related genes. However, we found that other CRISPR-Cas genomes have NHEJ-related genes, with Type IIIA and Type IV having been relatively enriched. These findings show the complex interactions between CRISPR-Cas systems and DNA repair pathways in CRISPR-Cas genomes, and the application of CRISPR-based antimicrobials on bacteria requires extensive investigations of the genomic background of the target bacteria.

Inspired by the interplay between CRISPR-Cas systems and DNA repair pathways, we further investigated the unique genome-editing features governing the evolution of CRISPR-Cas genomes. The ability of CRISPR-Cas immunity to specifically modify the genome of a prokaryote in response to an external challenge (eg, virus infection) has been recognized an an unique example of Lamarckian evolution.²⁰ Unlike Darwinian evolution whose variation results from random mutations, Lamarckian evolution relies on the high specificity of mutations that results in an efficient adaptation to the external challenge, and the necessity to co-evolve effective DNA repair pathways along with CRISPR-Cas systems was predicted by theoretical evolutionary modeling.³⁸ In this study, we brought further insights into the interaction between CRISPR-Cas systems and DNA repair pathways by time-series visualization of CRISPR repeat secondary structures and the simulation studies of CRISPR repeat evolution. We demonstrated that the diversity of CRISPR repeat structures is an important biological feature of different CRISPR-Cas systems, and the variation within a CRISPR array reflects the interplay of CRISPR-Cas systems and DNA repair pathways during the genome-editing event of spacer acquisition. Furthermore, the simulation studies elucidated that the secondary RNA structures of Type I CRISPR repeats are maintained better than expected under Darwinian evolution, which further elucidates the ability of some CRISPR-Cas genomes to repair DNA damage with high fidelity.

From this study, we emphasized the importance of understanding the genomic background of CRISPR-Cas genomes to exploit the potential of CRISPR-based antimicrobials to self-target AMR-related genes. CRISPR-based antimicrobials are unique programmable tools that can target bacteria specifically for their pathogenicity, despite the various challenges such as delivery issues and host resistance. We are currently in urgent need of next-generation antibiotics. The antibiotic market is currently not viable as new antibiotics can only be used sparingly as the last resort to prevent the rise of new drug resistance.^39-41 As opposed to the traditional antibiotics, for which drug resistance emerges rapidly, CRISPR-based antimicrobials offer an opportunity to exploit the recent progress in understanding the complexity and evolution of prokaryotic genomes to strategically counteract the spread of drug-resistant bacteria.

Methods

Curating a labeled dataset of CRISPR-Cas genomes by the Cas system type

We used a public database CRISPRCasdb (downloaded on 21/01/2021) to build a dataset of CRISPR-Cas genomes labeled by the Cas system type, which we define as prokaryotic genomes that have one complete set of Cas genes and one associated CRISPR array. We chose this one-to-one association to eliminate confounding factors resulting from complex associations between multiple CRISPR arrays and multiple Cas gene systems within the same genome. From 26 340 bacterial genomes and 436 archaeal genomes, CRISPRCasdb found 10 890 (41.34%) bacterial genomes with CRISPR arrays and 333 (76%) archaeal genomes with CRISPR arrays (Supplemental Table S1). Overall, 9554 (36.27%) bacterial genomes and 308 (70.74%) archaeal genomes had both CRISPR arrays and Cas gene systems. We, hereinafter, refer to CRISPR arrays in prokaryotic genomes without Cas gene systems as “orphan arrays.” As each CRISPR array typically contains multiple repeat sequences, the total number of unique repeats adds up to 26 958.

The number of non-redundant CRISPR-Cas genomes labeled by the associated Cas system type from the CRISPRCasdb is summarized in Supplemental Table S2. The number of CRISPR-Cas genomes varies by the Cas system type. For example, there are 209 CRISPR-Cas genomes associated with Type IE, whereas only 1 CRISPR-Cas genome is associated with Type VIB2. The disparity in the types may be due to CRISPRCasdb having biased sampling for human pathogens. Furthermore, this may result from other factors such as the selection criterion of those with one-to-one associations, the recent discovery of some subtypes (such as Type VI), and their true relative rarity in nature. The number of unique CRISPR repeats labeled by different Cas system types is shown in Supplemental Table S3. For visualization analyses, we merged the CRISPR-Cas genomes associated with the Cas system types that are extremely rare into one category (eg, V1B2), while keeping other subtypes of Class 1 and Class 2 as separate categories.

Analysis of AMR genes and DNA repair pathways in CRISPR-Cas genomes

For the AMR gene analysis, we used the NCBI Antimicrobial Resistance Gene Finder²⁵ which has an accompanying database of antimicrobial resistance genes, including some point mutations (AMRFinderPlus Version 3.10.20). We ran this software with protein sequences of the CRISPR-Cas genomes to search for AMR-related genes, which uses BLASTP and HMMER for gene matches and classification of novel sequences by building a hierarchical tree of gene families.

For the DNA repair analysis, we used the components of the double-strand break repair system that had previously been constructed using MacSyFinder (Version 1.0.2).²⁶ From these DNA repair pathways, the protein profile for new proteins had been built with the multiple sequence alignment of homologous proteins using MAFFT (Version 7.205) and HMMER (Version 3.1).²⁶ We downloaded the whole genomes which contained each CRISPR array by the associated Cas system from NCBI (downloaded 10/01/2022), and we used the HMM profiles of the DSB repair system to search for the components with HMMsearch (Version 3.3.2). We counted the number of each component in the DSB repair system above the sequence reporting threshold (E-value > $1 e^{- 3}$ ) and calculated the number of each component per genome for each CRISPR array by the associated Cas system.

Dimensionality reduction of CRISPR repeats

Principal Component Analysis (PCA) reduces the dimensions of data by computing the principal components and uses the first few to increase the interpretability. We used a direct PCA approach that transfers the sequence matrix to a boolean vector for direct analysis of nucleotide sequences.⁴² The conventional PCA approach uses model-based distance matrices to estimate the distance among samples. However, the process of summarizing variations in data disperses information oriented in various directions. The direct PCA approach used in this study finds the relationship of CRISPR repeat sequences directly from the sequence matrix, which shows the relationship between CRISPR repeats and nucleotide bases coincidently. In terms of differences in the source of distances, the direct PCA approach indicates the distances between the center and samples, while the conventional PCA approach indicates those between samples. The differences among CRISPR repeats were estimated by calculating Euclidean distances⁴²:

\hat{e_{1, 2}} = \sqrt{\sum {(\vec{x_{1}} - \vec{x_{2}})}^{2} / 2}

where are $\vec{x_{1}}$ and $\vec{x_{2}}$ the subjected vectors of the corresponding 2 nucleotides. Digitization of nucleotide sequences has been explored extensively in previous studies, mainly through encoding the 4 nucleotides with 1-hot vectors.^43-46 This transformation of nucleotide sequences has merits that it is completely reversible, and PCA can be directly applied to the transformed sequence matrix. The maximum length of repeats for all the categories is 50 (Supplemental Figure S1). For interpretability, we used a 2-dimensional latent space, as the third dimension does not add additional information about the biological features for this study (where the explained variance of P3 = 6%).

Clustering with Gaussian mixture models (GMM)

We used Gaussian Mixture Models (GMM) as a probabilistic model to define clusters. GMMs assume all data points follow a mixture of Gaussian distributions, with a fixed number of unknown parameters. GMMs are a generalized k-means clustering that incorporates the centers of Gaussian distributions and the covariance structure of input data. GMMs need the number of clusters to be pre-defined before using the algorithm. For model selection, we used the Bayesian information criterion (BIC) to choose the number of clusters without overfitting.⁴⁷ The BIC introduces a penalty term for the increasing number of parameters in the model:

B I C = k * l n (n) - 2 * \ln (\hat{L})

where $k$ is the number of parameters, $n$ is the observed data, and $\hat{L}$ is the maximized value of the likelihood function of the evaluated model.

Using Gaussian Mixture Models (GMM) as a probabilistic model, we evaluated a range of cluster numbers (1-9), with 4 different covariances of input data for each model (spherical, tied, diagonal, and full). The BIC scores from the GMM model selection for the repeats are summarized in Supplemental Figure S3. The BIC scores reveal that assuming the full covariance of input data renders the best result in every model. For the GMM models with the full covariance, the last BIC score to drop significantly occurs between the clusters of 4 and 5. Thus, the GMM model with 5 clusters was chosen as the simplest GMM model that best fits this data according to the BIC criterion (Supplemental Figure S4). According to the GMM model, we designated each cluster with the associated Cas system type for further analyses (Supplemental Table S5).

Biological feature interpretations of clusters

We evaluated each cluster with summary statistics to infer biological interpretations of the features the PCA extracted from the CRISPR repeats (Table 2). We calculated the entropy of the CRISPR repeats from each cluster to assess the randomness in these sequences. We used the Shannon entropy bounded between 0 and 1 as a measure of information content in a sequence⁴⁸:

H (X) = - \sum_{i}^{M} P (x_{i}) l o g_{2} P (x_{i})

where $P (x_{i})$ is the probability of the event $x_{i}$ . The Shannon entropy gives the maximum entropy for equiprobable and independent states of the 4 nucleotides (A, T, G, C). We obtained the metric entropy by dividing the Shannon entropy by the sequence length (Table 2).

We used the ViennaRNA Package to predict the RNA secondary structure of the CRISPR repeats. The RNAfold (Version 2.4.14) function of the package calculates the minimum free energy (MFE in kcal/mol) of the thermodynamic ensemble to predict the stability of RNA secondary structures.⁴⁹ We chose the centroid method to predict the optimal secondary structure, which results in the secondary structure with a minimum total base-pair distance to the entire thermodynamic ensemble of structures.^49,50 The centroid method finds the optimal secondary structure that minimizes the following sum of minimum base-pair distances:

\sum_{1 \leq k \leq m} \sum_{i} \sum_{j} {(I_{i j}^{k} - I_{i j})}^{2}

for a set of $m$ secondary structures $I_{1}, I_{2},$ . . ., $I_{m},$ with $I_{k} = {I_{i j}^{k}}$ , $1 <_k <_m$ . The biological features of CRISPR repeats, including metric entropy, sequence length, GC/AT ratio, and minimum free energy, were calculated by the clusters modeled using GMM.

Time-series patterns in RNA secondary structures of CRISPR repeats

To visualize the secondary structures of CRISPR repeats, the Vienna RNA software (Version 2.4.18) was used. Using the software, minimum free energy (MFE) values for RNA secondary structures were predicted,⁵¹ where an optimal secondary structure among the centroid structure, the partition function, and the matrix of base pairing probabilities⁵² was recorded. The MFE values of the optimal secondary structure were obtained for all CRISPR repeats, and they were plotted in time-series graphs by the number of time points in each CRISPR array (Supplemental Figure S5). For the visualization of RNA secondary structures, 100 CRISPR repeats were selected randomly to ensure every species of bacteria was included for the subtypes with many sequences (>100). Otherwise, all repeats in the dataset were analyzed for the subtypes with 100 or fewer sequences (Supplemental Figure S6).

Simulated patterns of the minimum free energy (MFE) of CRISPR repeats

To investigate the time-series patterns of CRISPR secondary structures under Lamarckian evolution, we simulated the evolution of CRISPR repeats under Darwinian evolution of genetic drift. We chose CRISPR repeats of the 2 subtypes (Class 1 Type IE and Class 2 Type IIA) that had the most prominent patterns from our previous time-series analyses for simulation studies. We chose CRISPR repeats that had 5 time points in the arrays to show clear temporal trends and only those arrays with the known direction (Supplemental Table S6). The CRISPR repeats sequences from the first time point were the input sequences to the following simulation studies. For the simulation, we assumed the following population genetic model: the genetic drift of mutations under binomial sampling of wildtype and mutant between generations. The mutation rate of microbes in nature is extremely difficult to measure, thus we chose the high end of the estimated range of mutation rates in microbial organisms ( $1 e^{- 5}$ mutation per generation). For every mutation event, 1 of the 4 nucleotides (A, U, G, C) was randomly chosen to replace the wildtype nucleotide. To ensure the presence of mutations, we ran the simulation for 10 000 generations, and these simulations were run for 5 time points. The simulated output of CRISPR repeat sequences of each time point was processed using Vienna RNA software (Version 2.4.18) as above for visualization of RNA secondary structures and quantification of MFE values. We repeated these simulations 100 times for each input sequence of CRISPR repeats, and the means of MFE values were plotted in time-series graphs for comparison (Figure 5).

Supplemental Material

sj-docx-1-evb-10.1177_11769343221103887 – Supplemental material for Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials

Supplemental material, sj-docx-1-evb-10.1177_11769343221103887 for Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials by Hyunjin Shim in Evolutionary Bioinformatics

Supplemental Material

sj-docx-2-evb-10.1177_11769343221103887 – Supplemental material for Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials

Supplemental material, sj-docx-2-evb-10.1177_11769343221103887 for Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials by Hyunjin Shim in Evolutionary Bioinformatics

Supplemental Material

sj-docx-3-evb-10.1177_11769343221103887 – Supplemental material for Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials

Supplemental material, sj-docx-3-evb-10.1177_11769343221103887 for Investigating the Genomic Background of CRISPR-Cas Genomes for CRISPR-Based Antimicrobials by Hyunjin Shim in Evolutionary Bioinformatics

Footnotes

Acknowledgements

We thank Wesley De Neve, Ho-min Park, and Yunseol Park for helpful discussions, and Yuju Ahn and Moobeom Hong for the bioinformatic support.

Funding:

The author disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: The research and development activities described in this study were funded by Ghent University Global Campus (GUGC), Incheon, Korea.

Declaration Of Conflicting Interests:

The author declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.

Author Contributions

HS conceived of the presented idea, designed the analysis tool, performed the analysis, and wrote the paper.

ORCID iD

Hyunjin Shim

Data and Code Availability

All the CRISPR-Cas sequences are available in the CRISPR-Cas++ database (https://crisprcas.i2bc.paris-saclay.fr) as well as our project GitHub page (http://github.com/). For AMR analysis, we used AMRFinderPlus v.3.10.20 (https://github.com/ncbi/amr). For DNA repair analysis, we used HMM search v.3.3.2 (http://hmmer.org). For dimensionality reduction, we used Direct-PCA (https://github.com/TomokazuKonishi/direct-PCA-for-sequences) and scikit-learn (https://scikit-learn.org). For RNA secondary structure, we used Vienna RA software v.2.4.18. For all data analysis and visualization, Python v.3.7.3 (https://www.python.org), SciPy v.1.1.0 (https://www.scipy.org), seaborn v.0.9.0 () were used.

Supplemental Material

Supplemental material for this article is available online.

References

Mojica

Díez-Villaseñor

García-Martínez

Soria

Intervening sequences of regularly spaced prokaryotic repeats derive from foreign genetic elements. J Mol Evol. 2005;60:174-182.

Makarova

Grishin

Shabalina

Wolf

Koonin

EV.

A putative RNA-interference-based immune system in prokaryotes: computational analysis of the predicted enzymatic machinery, functional analogies with eukaryotic RNAi, and hypothetical mechanisms of action. Biol Direct. 2006;1:7.

Andersson

Banfield

JF.

Virus population dynamics and acquired virus resistance in natural microbial communities. Science. 2008;320:1047-1050.

Jinek

Chylinski

Fonfara

Hauer

Doudna

Charpentier

A programmable dual-RNA-guided DNA endonuclease in adaptive bacterial immunity. Science. 2012;337:816-821.

Makarova

Wolf

Iranzo

, et al. Evolutionary classification of CRISPR-Cas systems: a burst of class 2 and derived variants. Nat Rev Microbiol. 2020;18:67-83.

Makarova

Wolf

Alkhnbashi

, et al. An updated evolutionary classification of CRISPR-Cas systems. Nat Rev Microbiol. 2015;13:722-736.

Koonin

Makarova

Zhang

Diversity, classification and evolution of CRISPR-Cas systems. Curr Opin Microbiol. 2017;37:67-78.

Bikard

Barrangou

Using CRISPR-Cas systems as antimicrobials. Curr Opin Microbiol. 2017;37:155-160.

Pursey

Sünderhauf

Gaze

Westra

van Houte

CRISPR-Cas antimicrobials: challenges and future prospects. PLoS Pathog. 2018;14:e1006990.

10.

Duan

Cao

Zhang

L-H

Harnessing the CRISPR-Cas systems to combat antimicrobial resistance. Front Microbiol. 2021;12:716064.

11.

Reygaert

WC.

An overview of the antimicrobial resistance mechanisms of bacteria. AIMS Microbiol. 2018;4:482-501.

12.

Antimicrobial Resistance Collaborators. Global burden of bacterial antimicrobial resistance in 2019: a systematic analysis. Lancet. Published online January 19, 2022.

13.

Gomaa

Klumpe

Luo

Selle

Barrangou

Beisel

CL.

Programmable removal of bacterial strains by use of genome-targeting CRISPR-Cas systems. mBio. 2014;5:e00928.

14.

Citorik

Mimee

TK.

Sequence-specific antimicrobials using efficiently delivered RNA-guided nucleases. Nat Biotechnol. 2014;32:1141-1145.

15.

Bikard

Euler

Jiang

, et al. Exploiting CRISPR-Cas nucleases to produce sequence-specific antimicrobials. Nat Biotechnol. 2014;32:1146-1150.

16.

Luo

Mullis

Leenay

Beisel

CL.

Repurposing endogenous type I CRISPR-Cas systems for programmable gene repression. Nucleic Acids Res. 2015;43:674-681.

17.

Wigley

DB.

Bacterial DNA repair: recent insights into the mechanism of RecBCD, AddAB and AdnAB. Nat Rev Microbiol. 2013;11:9-13.

18.

Kreuzer

KN.

DNA damage responses in prokaryotes: regulating gene expression, modulating growth patterns, and manipulating replication forks. Cold Spring Harb Perspect Biol. 2013;5:a012674.

19.

Edgar

Qimron

The Escherichia coli CRISPR system protects from λ lysogenization, lysogens, and prophage induction. J Bacteriol. 2010;192:6291-6294.

20.

Koonin

Wolf

YI.

Is evolution Darwinian or/and Lamarckian?

Biol Direct. 2009;4:42.

21.

Koonin

Wolf

YI.

Evolution of microbes and viruses: a paradigm shift in evolutionary biology?

Front Cell Infect Microbiol. 2012;2:119.

22.

Wright

Liu

Knott

Doxzen

Nogales

Doudna

JA.

Structures of the CRISPR genome integration complex. Science. 2017;357:1113-1118.

23.

Xiao

Nam

How type II CRISPR-Cas establish immunity through Cas1-Cas2-mediated spacer integration. Nature. 2017;550:137-141.

24.

Wright

Doudna

JA.

Protecting genome integrity during CRISPR immune adaptation. Nat Struct Mol Biol. 2016;23:876-883.

25.

Feldgarden

Brover

Gonzalez-Escalona

, et al. AMRFinderPlus and the reference gene catalog facilitate examination of the genomic links among antimicrobial resistance, stress response, and virulence. Sci Rep. 2021;11:12728.

26.

Bernheim

Bikard

Touchon

Rocha

EPC

. A matter of background: DNA repair pathways as a possible cause for the sparse distribution of CRISPR-Cas systems in bacteria. Philos Trans R Soc Lond B Biol Sci. 2019;374:20180088.

27.

Chayot

Montagne

Mazel

Ricchetti

An end-joining repair mechanism in Escherichia coli. Proc Natl Acad Sci USA. 2010;107:2141-2146.

28.

Shuman

Glickman

MS.

Bacterial DNA repair by non-homologous end joining. Nat Rev Microbiol. 2007;5:852-861.

29.

Nuñez

Kranzusch

Noeske

Wright

Davies

Doudna

JA.

Cas1–Cas2 complex formation mediates spacer acquisition during CRISPR–Cas adaptive immunity. Nat Struct Mol Biol. 2014;21:528-534.

30.

Nuñez

Harrington

Kranzusch

Engelman

Doudna

JA.

Foreign DNA capture during CRISPR-Cas adaptive immunity. Nature. 2015;527:535-538.

31.

Yosef

Goren

Qimron

Proteins and DNA elements essential for the CRISPR adaptation process in Escherichia coli. Nucleic Acids Res. 2012;40:5569-5576.

32.

Ivančić-Baće

Cass

Wearne

Bolt

EL.

Different genome stability proteins underpin primed and naïve adaptation in E. coli CRISPR-Cas immunity. Nucleic Acids Res. 2015;43:10821-10830.

33.

Arslan

Hermanns

Wurm

Wagner

Pul . Detection and characterization of spacer integration intermediates in type I-E CRISPR–Cas system. Nucleic Acids Res. 2014;42:7884-7893.

34.

Garrett

SC.

Pruning and tending immune memories: spacer dynamics in the CRISPR array. Front Microbiol. 2021;12:664299.

35.

Shim

Laurent

Matuszewski

Foll

Jensen

JD.

Detecting and quantifying changing selection intensities from time-sampled polymorphism data. G3. 2016;6:893-904.

36.

Foll

Shim

Jensen

JD.

WFABC: a Wright-Fisher ABC-based approach for inferring effective population sizes and selection coefficients from time-sampled data. Mol Ecol Resour. 2015;15:87-98.

37.

Bernheim

Calvo-Villamañán

Basier

, et al. Inhibition of NHEJ repair by type II-A CRISPR-Cas systems in bacteria. Nat Commun. 2017;8:2094.

38.

Koonin

Wolf

YI.

Just how Lamarckian is CRISPR-Cas immunity: the continuum of evolvability mechanisms. Biol Direct. 2016;11:9.

39.

Park

H-M

Park

Vankerschaver

Van Messem

De Neve

Shim

Rethinking protein drug design with highly accurate structure prediction of anti-CRISPR proteins. Pharmaceuticals. 2022;15:310.

40.

Shim

Shivram

Lei

Doudna

Banfield

JF.

Diverse ATPase proteins in mobilomes constitute a large potential Sink for prokaryotic host ATP. Front Microbiol. 2021;12:691847.

41.

Lepore

Silver

Theuretzbacher

Thomas

Visi

The small-molecule antibiotics pipeline: 2014-2018. Nat Rev Drug Discov. 2019;18:739.

42.

Konishi

Matsukuma

Fuji

Nakamura

Satou

Okano

Principal component analysis applied directly to sequence matrix. Sci Rep. 2019;9:19297.

43.

Xia

Zhang

, et al. DeeReCT-PolyA: a robust and generic deep learning method for PAS identification. Bioinformatics. 2019;35:2371-2379.

44.

Umarov

Kuwahara

Gao

Solovyev

Promoter analysis and prediction in the human genome using sequence-based deep learning models. Bioinformatics. 2019;35:2730-2737.

45.

Han

Wang

Gao

DeepSimulator: a deep simulator for nanopore sequencing. Bioinformatics. 2018;34:2899-2908.

46.

Angermueller

Lee

Reik

Stegle

DeepCpG: accurate prediction of single-cell DNA methylation states using deep learning. Genome Biol. 2017;18:67.

47.

Schwarz

Estimating the dimension of a model. Ann Stat. 1978;6: 461-464.

48.

Tenreiro Machado

. Shannon entropy analysis of the genome code. Math Probl Eng. 2012;2012:1-12.

49.

Gruber

Lorenz

Bernhart

Neuböck

Hofacker

IL.

The Vienna RNA websuite. Nucleic Acids Res. 2008;36:W70-W74.

50.

Ding

Chan

Lawrence

CE.

RNA secondary structure prediction by centroids in a Boltzmann weighted ensemble. RNA. 2005;11:1157-1166.

51.

Lorenz

Bernhart

Höner Zu Siederdissen

, et al. ViennaRNA Package 2.0. Algorithms Mol Biol. 2011;6:26.

52.

McCaskill

JS.

The equilibrium partition function and base pair binding probabilities for RNA secondary structure. Biopolymers. 1990;29:1105-1119.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

1.61 MB

2.40 MB

14.11 MB