Abstract
Genomic repeats are functionally ubiquitous structural units found in all genomes. Studying these repeats of different origins is essential for understanding the evolution and adaptation of a given organism. These repeating patterns have manifold signatures and structures with varying degrees of homology, making their identification challenging. To address this challenge, we developed a new algorithm and software that can rapidly and accurately detect any repeated sequences de novo with varying degrees of homology in genomic sequences in interspersed or clustered repeats. Numerous forms of repeated sequences and complex patterns can be identified, even for complex sequence variants and implicit or mixed types of repeat blocks. Direct and inverted-repeat elements, perfect and imperfect microsatellite repeats, and any short or long tandem repeat belonging to a wide range of higher-order repeat structures of telomeres or large satellite sequences can be detected. By combining precision and versatility, our tool contributes significantly to elucidating the intricate landscape of genomic repeats.
Keywords
Introduction
Repeated sequences are functionally ubiquitous structures found in all genomes (from viruses and prokaryotes to eukaryotes), with higher abundance, especially in noncoding eukaryotic sequences. 1 Genomic repeats represent integral structural elements across all genomes and contribute to their functional complexity. 2 As 1% of genomic DNA in the human genome is dedicated to protein-coding sequences, various researchers have suggested that the mobile genetic elements and the other repetitive sequences that constitute the so-called ‘noncoding DNA’ are where the entire evolutionary history of these genomes may be linked.3-6 A tandem repeat (TR) is a DNA sequence in which a pattern of 2 or more different nucleotides is repeated one or more times consecutively. The TRs are found throughout the genome and can vary in length from a few base pairs to several kilobases. 7 Eukaryotic ribosomal RNA (rDNA) gene families typically display TR structures and are present in long arrays. In most eukaryotic genomes, rDNA genes are clustered in long TRs at one or several loci. 8 Variation in copy number is a common characteristic of rDNAs and has been reported in many organisms. Using rat lines as an example, we previously observed that the copy number of animal genes is also highly variable. 9 The diversity of TR forms makes identification challenging, as the origin of their development is different. The 2 other forms of TRs are monomeric arrays and higher-order repeats (HORs).7,10 Monomeric arrays are formed when TRs occur as individual copies of the same sequence and are repeated one after the other. A TR set is also represented by more complex forms of sequences, 11 similar to tandemly repeated Alu elements (belonging to an order of nonautonomous retroelements termed short interspersed elements) and 5S rRNA genes. Mobile genetic elements are ubiquitous in eukaryotes and are present in high copy numbers in most genomes, making them significant constituents of eukaryotic genomes.12,13 Retrotransposons are substantial elements of interspersed and clustered TRs in all eukaryotic genomes. 14 One of the remarkable features of retrotransposons is their high copy numbers. This abundance is believed to be due to their ability to replicate and insert into new genomic locations. In addition, the primary order of these elements in various plant species is long terminal repeat (LTR) retrotransposon multiple extended tandem arrays. 15 Diverse families of inverted-repeat elements, such as the miniature inverted-repeat transposable element (MITEs) family, which belong to a group of nonautonomous class II transposable elements (DNA sequences), are especially typical of the plant genome. 16 In addition to other mobile elements, MITEs can also form extended TRs that are not detected using next-generation sequencing but can be detected using third-generation sequencing, such as Oxford Nanopore Technologies (ONT) or Pacific Biosciences Sequel systems. The presence of diverse families of inverted and direct repeat elements in genomes is associated with responses to environmental stress and diseases. This may enable genomes to adapt rapidly to new challenges, such as pathogens, climate variations, and other stressors. Detecting repetitive genomic sequences is challenging, and many algorithms are available.17-23 These repetitive patterns are characterised by manifold distinct signatures and intricate architectures with varying degrees of homology, making identification difficult. 24 We developed an advanced tool that detects de novo diverse forms of direct and inverted repeated sequences within genomic compositions. This tool recognises interspersed and clustered repeats, encompassing a broad spectrum of variations, including nested repeat patterns. This enabled the identification of inverted-repeat elements, perfect and imperfect microsatellite repeats, and diverse short and long TRs, thus spanning a comprehensive array of configurations with varying degrees of homology and repeat frameworks of extensive satellite sequences. Our tool offers a robust and rapid solution for comprehensively detecting genomic repeats, even in intricate sequence variations and complex repeat architectures.
Design and Implementation
We developed an easy-to-use command line and versatile application in Java that can identify various forms of direct and inverted repeat sequences that can be applied to any target genome. Our tool is implemented in Java (requires Java Platform, Standard Edition 23 or higher). User-friendly and accessible genomic repeat analysis software can help researchers identify repeat sequences efficiently and accurately in genomic data, which can further advance the understanding of the role of genomic repeats in biology. The tool runs locally, allowing users to analyse their data (a single or all files in the target folder). It takes the genomic sequence(s) and several parameters as input, defining the length of the k-mers and the type of analysis. In the first stage of the algorithm, identical k-mers that match in different regions of the sequences are lengthened pairwise by the maximum length for them.
In the initial stage of repeat identification, the software uses sequence screening for size-defined k-mers and identifies repeating regions of different lengths in the target sequence. The adjacent repetitive areas are combined to determine the boundaries of all types of repeats. Repeats shorter than the specified filter size will be ignored in further analysis. At the same time, a masked sequence is generated in which regions with repeats are highlighted by lowercase letters. After identifying all repeats and defining their boundaries, the software determines the homology of each repeat to other repeats. The identification of homologous sequences is achieved through the screening of matching k-mers. The minimum similarity of homologous sequences can be at least 60%. Homologous sequences are clustered together and will be output as a separate cluster. The result will be saved in a text file containing a list of clusters and coordinates of repeating blocks in each cluster. In the current version, k-mers should be identical and can be used as ‘seeds’ to detect homologous sequences. The default length of k-mers is 12 nucleotides but can be as short as 5. The recommended minimum size of k-mers is 18 or longer for large genomes. Homologous fragments are searched for both in the same orientation and for a complementary chain.
In the ‘quick search’ mode, the identified homologous sequences are subsequently grouped into their respective clusters. Although this mode is the fastest, there will potentially be some false clustering of these repeats for large genomes with different types of mixed repeats. This mode is recommended by default for analysing large genomic sequences (eg, chromosomes) to obtain an overview of global genomic rearrangements.
Under profile search mode (quick = false), the homologous sequences thus detected are grouped into an individual cluster and will not be clustered with overlapping sequences from different clusters. This mode allows the study of the complex structure of genomic repeats, including large telomeric or chromosomal duplications and rearrangements. This mode is relatively fast, does not use inter-clustering, and is the safest to analyse genomic rearrangements and general analysis of repetitive elements in the genome. Using this mode, it is possible to observe which clusters overlap and are related.
This mode enables the separation of different types of repeat sequences, a critical feature of the analysis since sequences typically consist of multiple repeat types. The software clusters repeated blocks for smaller genomes, such as those of bacteria, fungi, protozoa, and other target sequences. Overlapping blocks are automatically considered homologous and grouped into a common cluster. At this stage, however, different repeat elements may become mixed. This misclustering is especially common in eukaryotic genomes, where mixed elements can be distinguished by their copy number, as mixed repeats are less likely to be repeated elsewhere.
Many mobile elements contain low-complexity sequences, such as microsatellites or poly-A-rich regions, which must be accurately linked to their corresponding genetic elements and distinguished from other genome sequences. The algorithm effectively separates scattered low-complexity sequences from those associated with specific genetic elements. Furthermore, analysing the sequence complexity allows for the identification of perfect and imperfect microsatellite repeats, as well as short or long TRs.
The result is saved in a GFF (9-column, table-delimited) text file, a sequential list of clusters of homologous sequences indicating the start and end of each site, its orientation, and the sequence itself. When visualising the results, sequences in the forward orientation are indicated by blue, and those in the opposite orientation are indicated by red (mode with quick = false) in the output graphic PNG file.
The main achievement of our software is a rapid analysis of any repetitive sequences in both short and large chromosomes. The obtained masked sequence, GFF, and graphical PNG files allow for an understanding of the overall structure of the target sequence for any repeats and the evolution of these repeats. It is possible to trace different types of repeats and their relationship at the genomic level. In a graphical PNG file, each cluster is localised in a separate line along the current sequence. In the sequence scale, the first cluster appears at the top, with subsequent clusters arranged beneath it. If a cluster contains homologous sequences in different orientations, they are colour-coded: blue for forward orientation and red for inverted orientation. In cases where inverted repeats are analysed (with quick = false), both forward and reverse orientations are displayed but are grouped into separate clusters. This feature allows the detection of homologous sequences located in different parts of the sequence.
Chromosome analyses performed with the profile mode (quick = false) enable a detailed distinction of the repeat profile at the genomic level, revealing its characteristic repetitive structure. Notably, this level of repeat profiling is a unique capability of the software, unmatched by any other existing tools.
In addition, our algorithm eliminates the need for extensive alignment matrix calculations and does not rely on prior knowledge of repeat patterns. By employing a simple and efficient approach based solely on k-mer screening, it maps sequences and determines sequence similarity with high precision. The algorithm swiftly defines the boundaries of all repeats and calculates homologous distances between them using k-mer distance. Importantly, there is no size limitation on the repeats that can be detected, and the algorithm accounts for nucleotide substitutions between repeated copies.
Unlike traditional methods, our algorithm accurately identifies repeated sequences without relying on frequency pattern analysis, alignment, or suffix tree construction. It is designed to autonomously detect repeated sequences, regardless of their divergence or length, making it highly versatile. This adaptability allows the software to excel in detecting a wide range of repeat structures, including mobile elements, ribosomal genes, large genomic rearrangements, telomeric and centromeric regions, and other direct and inverted repeats. These repeated sequences may be situated at varying distances from each other and oriented in different directions, yet the algorithm consistently identifies them with precision.
The masked sequence contains all variants of repeat sequences without any initial classification. The identified related repeat sequences can serve as a platform for subsequent classification, allowing researchers to determine their relationships to established repeat element groups or even discover novel repeat types. To streamline this process, the software immediately categorises all identified repeats into clusters based on shared sequence features. However, it does not perform classification or detailed analysis of de novo repeats associated with genes, gene families, or mobile elements.
Classifying repeats into specific families remains one of the most challenging aspects of repeat annotation, as repeats within a genome can often be misgrouped, especially in the case of hybrid elements. This misclassification is common in eukaryotic genomes. Therefore, a more specialised analysis of each repeat type must be conducted using external tools. For example, established tools like RepeatMasker 25 or CENSOR tool 26 can be employed to compare query sequences against reference repeat collections (eg, https://www.girinst.org/).
While the current software does not handle detailed repeat family classification, the data generated from the clustered repeats can be used for further in-depth analysis. In profiling mode (quick = false), the software excels at analysing complex ‘repeat within repeat’ patterns and intricate sequence variants, including implicit and mixed repeat blocks. This advanced capability is particularly useful for detecting highly complex sequence variations and mixed repeat types.
Furthermore, the software provides accurate classification of perfect and imperfect microsatellite repeats, as well as short and long TRs that belong to various HOR structures, such as telomeres or large satellite sequences. This multifaceted functionality makes the software a comprehensive tool for researchers investigating the complex landscape of genomic repeats.
Results
Unveiling complex TRs in genomes: insights from our analysis
To illustrate the functionality of our tool, we performed analyses on different genomic sequences from different taxonomic groups of eukaryotes (animals, birds, plants, insects, fungi, and protists), bacteria, plastids, and giant viruses. We compared the de novo identification of repeat sequence with REPuter (https://bibiserv.techfak.uni-bielefeld.de/reputer), 27 RED, 21 RECON, and RepeatScout28,29 (Table 1). For comparison, the ability of RepeatMasker to identify only already characterised repeats was considered. For example, we showed that 2 inverted repeats in the chloroplast genome have a more complex structure, corresponding to inverted repeats within direct repeats (Figure 1). REPuter, a widely used software, employs a suffix tree algorithm to identify approximate repeats within genomic sequences. It has been particularly effective in detecting TRs and palindromic sequences. RED (REpetitive DNA detector) uses a graph-based approach to identify repeats, focusing on exact and approximate matches. RECON (REpetitive DNA CONsensus) employs a consensus sequence approach to identify repeats, aiming to provide a comprehensive view of repetitive elements within genomic sequences. Generic Repeat Finder (GRF) is a software for identifying repetitive elements within genomic sequences. It employs a simple yet efficient algorithm that scans the input sequences for repetitive patterns and reports their occurrences. However, GRF may lack the sophistication to handle complex repeat structures and may not provide comprehensive insights into the organisation and distribution of repetitive elements within genomes. RepeatScout, on the contrary, is a more advanced software tool designed to identify repeats through de novo discovery. It employs a suffix tree-based algorithm to identify significant repeats within genomic sequences without relying on pre-existing repeat libraries. RepeatScout excels in detecting novel repeat families and offers valuable insights into the evolutionary dynamics of repetitive elements within genomes. Furthermore, we considered the performance of RepeatMasker, a widely used tool for identifying repeats based on a library of known repeat elements. We evaluated its ability to accurately identify previously characterised repeats within the genomic sequences under analysis. Through our comparative analysis, we aimed to provide insights into the strengths and limitations of each software solution, with a focus on their applicability across different taxonomic groups and the comprehensiveness of repetitive element identification. Compared with existing software, our algorithm does not use multiple alignments for the de novo identification of repeat sequences; the algorithm eliminates the need for extensive alignment matrix calculations and suffix tree construction. The size of repeats that can be detected is unlimited, and the algorithm accounts for nucleotide substitutions between repeated copies. It is designed to autonomously detect repeated sequences regardless of variance in divergence and length. At the same time, it is possible to identify all repetitive sequences with a copy number of at least 2 copies and a homology of at least 60%. Although the algorithm concept for identification of repeat sequences may be similar to existing software, implementing these algorithms is different. The main advantage of our approach is the implementation of a graphical representation of identified repeats at the genomic level. Repeats can have an internal structure consisting of other repeats in both forward (highlighted in blue) and reverse orientation (highlighted in red). If we trace the structure of genomic repeats at the chromosomal level, we can determine their relatedness and evolutionary divergence without annotating their classification. Genomic blocks formed by duplication and located in different parts of chromosomes and orientations accumulate multiple changes in the process of evolution. However, we can identify the relatedness of these genomic blocks by using certain structural patterns. Our algorithm identifies low-complexity repetitive sequences characteristic of perfect and imperfect microsatellite repeats and any short or long TRs belonging to a wide range of repeat structures organised from telomeres or large satellite sequences and extended genomic repeats, which can make up the majority of chromosomal sequences. Therefore, in this tool, we do not limit ourselves to a particular repeat type but identify all available repetitive sequences, regardless of whether they belong to one repeat type. The examples below demonstrate genomic analyses for different organisms and structural sequences. Table 1 compares the software available for de novo identification of repeat sequence analysis. Our tool does not require additional steps, such as the individual compilation of certain blocks for a particular operating system or the installation of additional libraries, as it is a standard application for the Java Platform. The direct launch of the application is made in the command line with the call of the application and the target file or the path to the directory with files: java -jar Repeater2.jar <inputfile>.
The programme’s main features include a comparative analysis of genome-wide tools for de novo identification of interspersed repeats.

Genome-wide repeat profiling of chloroplast genomes was performed to identify interspersed and clustered repeats of (
This is a significant advantage over existing applications that require skill and are challenging for a user to deploy. The main differences concern the algorithms used in this software and the potential for further development. Our tool also leaves the potential for further development and improvements in the annotation of repetitions and their classification. Masked sequences can be used for subsequent analysis of homologous repeats within a particular chromosome and between different chromosomes of the same or other related species, followed by their annotation. Graphical representation of individual chromosomes for whole genome analysis will be a practical comparative and evolutionary analysis of a particular species and related species. Thus, mobile elements are identified in genomic sequences by their characteristic structural elements, namely LTRs, primer binding sites (PBS), and specific genes (gag, rt, int). 30 However, these limit the search for new types of mobile repeats and related elements. Therefore, the primary analysis should be based on identifying all repetitive sequences and analysing these repeats and their annotation.
Another example of complex direct repeats is the LTRs in endogenous retroviruses and retrotransposons, which also have a loop structure characteristic of inverted repeats despite being direct repeats. 15 This is a remarkable phenomenon, probably characteristic of most LTR-retrotransposons, in which long direct repeats are both inverted repeats, forming a loop structure between distal repeats. We observed loop structure patterns for direct repeats exemplified by MITEs in plant genomes and genomic repeats in eukaryotes and prokaryotes. Thus, most tandem elements are complex mixed repeated sequences with a ‘tandem within a tandem’ pattern.
This software identifies long tandems, typically either for retrotransposons or ribosomal genes; repeats of any length can be detected locally and throughout the genome. Repeats associated with the chromosome’s centromeric or other structural regions consist of a complex mixture of sequences included in each other or the neighbourhood. For example, chromosome 1A from the fungal genome of

Comparative analysis of a region (KY962518) of the
There is a direct correlation between genome size and the density of repeat forms. Plant genomes contain many repeats, similar to animal and human genomes. A unique feature of plant genomes is the occurrence of TRs of mobile elements, like ribosomal gene clusters. In addition, the main elements of the plant genome are LTR retrotransposons, which are numerous elongated tandem arrays whose copy numbers differ among plant lines or cultivars and are challenging to analyse and sequence. Plant genomes are also characterised by repeats originating from LTR-retrotransposons that form extended tandem blocks.
15
For insect genomes, telomeric regions are also characterised by the same extended tandem blocks of repeating LTR-retrotransposon identical copies. However, the uniqueness of insect genomes is not limited to the specificity of telomeric regions. Recent studies have examined 601 insect species and found significant differences in the dynamics of mobile genetic elements.
31
Our software analysed genome-wide de novo identification and visualisation of interspersed and TRs for insect genomes. In the insect genomes, we observed large arrays of diverse repeats of various origins, including very long repeats. The insect genomes of

Genome-wide profile of all repeats for the small hive beetle (

Genome-wide profile of all repeats for insect genomes using
Genome-wide de novo identification and visualisation of interspersed and TRs was performed for nucleomorphs. Nucleomorphs represent some of the smallest genomes that have been sequenced and are characteristic of both cryptomonads and chlorarachniophytes, converted to a similar size from larger genomes. The terminal and concurrent telomeric regions of all 3 chromosomes of the nucleomorph

Genome-wide profile of all repeats in the 3 chromosomes of the nucleomorph
Comparative analysis of the genomes of 8 sequenced

The extrachromosomal replicon of
Comparative analysis of the genomes of 8 sequenced
The equally complex structure of the giant virus (

Genome-wide profile of all repeats for the complete genome of the
Using CENSOR software, we analysed the repeats for the
The task of clustered regularly interspaced short palindromic repeat (CRISPRs) array detection is to search for short, direct repeats (approximately 20-50 bp) in archaeal and bacterial genomes that can form CRISPR array repeats and then evaluate the identified putative array.
35
Identification criteria included the length of the repetitive sequence and the similarity between repeats within the array. Therefore, we used parameters for short fragments not exceeding 20 to 30 nucleotides to identify the entire repeat-spacer array of the CRISPR loci. Supplemental Figure S3 shows an example of identifying the CRISPR repeat-spacer array in the
Conclusion
Despite the exponential growth in the diversity of genomes, it is evident that the rate at which the scientific community contributes to repeated element databases has not kept up, hindering efficient annotation and comprehensive study of repeated elements across various groups. Thus, there exists a significant opportunity for the field of biodiversity genomics to thoroughly embrace and prioritise research on repeated elements. We developed a unique and flexible tool for the de novo identification of diverse forms of direct and inverted repeated sequences interspersed and clustered in genomic sequences. The algorithm can identify all types of repeated sequences, including perfect and imperfect microsatellite repeats and any short TR belonging to a wide range, organised into HOR structures of large satellite sequences and telomeres. This software is a susceptible and automated method for repeated sequence identification. Classification and comprehensive analysis of the detection of de novo repeats associated with genes and their families and mobile elements were not performed using our tool. The main features of our software include a novel, rapid approach that does not rely on alignment or suffix tree construction, versatility in identifying various types of repeats, ability to handle large genomes, and visual representation of repeat profiles. Our software was tested on animals, insects, plant chromosomes, chloroplasts, and prokaryotic and giant virus genome sequences. It was rapid, efficient, and simple to use, with a user-friendly interface. Novel complex repeats were predicted in multiple genomes. In addition, the software can provide information on the location and distribution of these repeat sequences within the genome. This can help researchers identify potential regulatory regions or areas prone to genetic instability. By providing a user-friendly interface and comprehensive analysis options, this accessible software greatly facilitates the study of genomic repeats and their impacts on biological processes. Furthermore, the ability to run the tool locally ensures data privacy and allows customised analysis parameters tailored to specific research needs.
Supplemental Material
sj-docx-1-bbi-10.1177_11779322241306391 – Supplemental material for Genome-Wide Tool for Sensitive de novo Identification and Visualisation of Interspersed and Tandem Repeats
Supplemental material, sj-docx-1-bbi-10.1177_11779322241306391 for Genome-Wide Tool for Sensitive de novo Identification and Visualisation of Interspersed and Tandem Repeats by Ruslan Kalendar and Ulykbek Kairov in Bioinformatics and Biology Insights
Footnotes
Acknowledgements
We thank the core facility at Nazarbayev University for helping with oligonucleotide synthesis and DNA sequencing.
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This study was funded by the Committee of Science of the Ministry of Science and Higher Education of the Republic of Kazakhstan (grant no. AP23490594, BR18574184, BR18574184, BR24993023, BR24992841) and Nazarbayev University funding CRP grant 021220CRP2222.
Declaration of conflicting interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
All authors have revised the manuscript accordingly.
Supplemental Material
Supplemental material for this article is available online.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
