Abstract
Background
Variable minisatellites count among the most polymorphic markers of eukaryotic and prokaryotic genomes. This variability can affect gene coding regions, like in the prion protein gene, or gene regulation regions, like for the cystatin B gene, and be associated or implicated in diseases: the Creutzfeld-Jakob disease and the myoclonus epilepsy type 1, for our examples. When it affects neutrally evolving regions, the polymorphism in length (
Motivation
In these tandem repeat sequences, different mutational mechanisms let the number of copies, as well as the copies themselves, vary. Especially, the interspersion of events of tandem duplication/contraction and of punctual mutation makes the succession of variant repeats much more informative than the sole allele length. To exploit this information requires the ability to align minisatellite alleles by accounting for both punctual mutations and tandem duplications.
Results
We propose a minisatellite maps alignment program that improves on previous solutions. Our new program is faster, simpler, considers an extended evolutionary model, and is available to the community. We test it on the data set of 609 alleles of the MSY1 (DYF155S1) human minisatellite and confirm its ability to recover known evolutionary signals. Our experiments highlight that the informativeness of minisatellites resides in their length and composition polymorphisms. Exploiting both simultaneously is critical to unravel the implications of variable minisatellites in the control of gene expression and diseases.
Availability
Software is available at http://atgc.lirmm.fr/ms_align/
Introduction
Polymorphic tandem repeats
Polymorphic tandem repeat loci, also known as Variable Number Tandem Repeats (VNTR), are widely used as genetic markers in population genetics, gene mapping, and forensic medicine (Jeffreys et al. 1985). Microsatellites, which are tandem repeats of a 1–6 bp long motif, show a frequent variability in their number of repeats. The expansion in some triplet microsatellites forms the molecular basis of a dozen inherited neurodegenerative diseases (Cummings and Zoghbi, 2000). Polymorphism was observed in another class of tandem repeats with motif size in the 7–100 bp range, the minisatellites. Unlike microsatellites, unstable minisatellites display not only repeat number variability, but also sequence heterogeneity between repeats. The succession of the variant repeats along a minisatellite allele can be obtained by a specific method called Minisatellite Variant Repeat-PCR (Jeffreys et al. 1991) (MVR-PCR). It provides a
Variable minisatellites were also shown to be involved in the development of inherited diseases: either by influencing gene transcription, like in the progressive myoclonus epilepsy type 1, or by being part of a coding sequence, like in the prion protein gene, which is responsible for the Creutzfeld-Jakob disease (Buard and Jeffreys, 1997; Bois and Jeffreys, 1999). The informativeness of unstable minisatellites has led to their widespread use for individual identification, parentage analysis (Jeffreys et al. 1985), and for discrimination between bacterial strains, including anthrax strains (Le Flèche et al. 2001).
Minisatellite variability and turnover processes
Comparison between the internal structure of alleles has been shown to be the key to elucidate the mechanism of minisatellite expansion and deletion for several human autosomal GC-rich minisatellite loci. Complex rearrangements involving transfer of groups of repeats between alleles as well as intra-allelic duplications have been deduced by alignment “by eye” between MVR maps of progenitor and new length alleles (Buard and Vergnaud, 1994; Jeffreys et al. 1994; May et al. 1996; Stead and Jeffreys, 2000; Tamaki et al. 1999). However, as the rearrangements may completely reshuffle the repeat array in a single generation (Tamaki et al. 1999), the parental relationships is easily lost when studying more distantly related alleles (
Many variable minisatellites with mutation rate below 1% per gamete have been reported in human and in other species. For instance, in the human insulin minisatellite, the variation is mainly due to the gain or loss of one repeat, which occurs at a rate of 10–3 per gamete, while complex inter-allelic rearrangements happen at much lower frequency (10–5) (Stead and Jeffreys, 2000). In other eukaryotic species, no hypervariable minisatellites were discovered (Bois and Jeffreys, 1999). Detailed investigation of murine minisatellites provides evidence of a variability dominated again by simple intra-allelic duplication occuring at a rate of 10–4 per gamete (Bois et al. 1998). This suggests that these minisatellites can serve for evolutionary population studies. Indeed, alignments between MVR maps have also been used to deduce the evolutionary relationships between alleles for the study of recent human population history (Alonso and Armour, 1998; Armour et al. 1993; Hurles et al. 1998; Jobling et al. 1998; Stead and Jeffreys, 2002). Both the potential investigative power of variable minisatellites for evolutionary studies and their use for identification has been limited by the lack of computerized methods to objectively compare alleles.
The evolutionary events at work in minisatellite turnover are divided into inter- and intra-allelic events. Inter-allelic events meanrearrangements between the two alleles of an autosomal minisatellite, while intra-allelic events comprise the amplification and contraction of a repeat or of a block of consecutive repeats, as well as the nucleotidic mutations inside the repeats. For the acquisition of MVR maps, the limits of the variant repeats are chosen arbitrarily, and when comparing maps, duplication events are assumed to copy complete variants (and not, for example, a variant and the half of the following variant). However, the mechanisms of DNA duplication may duplicate any segment of DNA inside the minisatellite, and their templates do not always correspond to complete repeats. Therefore, comparison of minisatellite maps relies on the assumption that the boundaries of the variant repeats are fixed and that duplications copy complete variants. This assumption (discussed from the algorithmic view-point in (Benson and Dong, 1999) and (Rivals, 2004)) may not be satisfied for all minisatellites, but seems generally valid for polymorphic tandem repeat loci.
Existing algorithms for minisatellite comparison
Several methods to compare MVR maps were published recently. All of them consider solely intra-allelic evolution. The statistical similarity measure defined in (Brion et al. 2002) computes a weighted sum of the number of shared variants when the two maps are compared at different relative positions. This measure depends to a great extent on the weight function used; in addition, the distance based on it is not a metric, which is a serious drawback for phylogenetic reconstruction. Alignment of minisatellite maps under a specific evolutionary model that considers indels, substitutions, but also tandem duplications and contractions of variants was first described in (Bérard and Rivals, 2003). There, as well as in (Behzadi and Steyaert, 2004; Behzadi and Steyaert, 2005), duplications and contractions are limited to a single variant (
Other works aim at improving the efficiency of this algorithm. A first method performs a Run-Length Encoding of the maps to reduce the complexity of the procedure (Behzadi and Steyaert, 2004). Run-Length Encoding (RLE) is a data compression technique in which a stretch of the same letter is coded as a power of this letter,
Biological validation
To validate our algorithm, we choose a data set from a minisatellite on the human paternally inherited Y chromosome, called MSY1. Most of the Y chromosome is haploid and thus escapes recombination. It contains the whole range of polymorphic systems observed in the human genome with mutation rates varying from 5.10–7% for base substitutions to 3.8% for the hypervariable minisatellite MSY1 (DYF155S1) (Andreassen et al. 2002). Probably due to its obligatory intra-allelic mode of mutation (no complex rearrangement between alleles) and to its AT-richness, MSY1 evolution is dominated by the gain or loss of a single variant (Andreassen et al. 2002), and is thus adequate to the model we hypothesized in this work. As a result, MSY1 alleles have a simple modular structure of variant repeats compared with the intermingled pattern of variants resulting from recombinational exchanges at hypervariable autosomal minisatellites (Figure 1 and also (Jobling et al. 1998)). Homogenization of the variants along the array has been observed at the locus MSY1, suggesting the occurrence of complex rearrangements like at autosomal loci. However, this phenomenon is restricted to alleles belonging to a single Y haplogroup (Bouzekri et al. 1998).

Examples of MSY1 maps of 11 individuals. The columns indicate the haplotype code, his population of origin, his haplogroup, and the MSY1 map (Jobling et al. 1998). For space reasons, the maps of the alleles
MSY1 haploid nature also avoids physical separation of the two alleles, which is a technical bottleneck for autosomal minisatellites. Consequently, a large number of MSY1 alleles originating from males all over the world were typed by several authors. The Y chromosome represents a unique system for comparing inter-chromosome relationships established with MVR maps and those deduced from more stable markers. (Jobling and Tyler-Smith, 2000), studied the evolutionary relationships between stable markers on the Y chromosome. They defined haplogroups using these markers and reconstructed a most parsimonious tree for them. The availability of known precise relationships between a large set of alleles is a major reason for the choice of MSY1.
The purpose of our experiments was to investigate whether known phylogenetic relationships between Y chromosomes could be independently recovered from the alignments of MSY1 MVR maps. Moreover, as the MSY1 MVR maps were obtained from individuals taken from different populations, we could check whether the coalescence of MSY1 haplotypes we inferred inside a Y chromosomal haplogroup agrees with the populations' relationships. We analyzed 609 alleles of MSY1 with our program and found inter- and intra-haplogroup relationships that are consistent with the evolution of the Y chromosome.
The remaining of this paper is organized as follows. In Section 2, we provide a notation and define an evolutionary model for minisatellites. Section 3 explains the alignment algorithm. Section 4 describes the experiments performed on 609 alleles of the MSY1 human haploid minisatellite to validate our program MS_ALIGN. We conclude in Section 5.
Evolutionary Model
Notations
Let Σ be a finite alphabet of variants. A map
Evolutionary model
Tandem repeat sequences undergo two kinds of evolutionary events: point mutations (substitutions and indels) that act on nucleotides and modify the minisatellites variants, and specific events acting on a complete variant: tandem
These five operations are gathered under the term
The alignment cost under this model is a metric (see the proof of (Bérard and Rivals, 2003)). It matters when using the cost as a distance, for example in phylogenetic reconstruction.
Segmental operations
The set of operations of our evolutionary model induces “long distance” dependencies. As an example, let us consider the following two generations of sequence
The first one contains 2 amplifications and 1 mutation though the second one contains 2 amplifications and 2 mutations. In the first generation, the last character
Our alignment algorithm is composed of two phases:
Computation of the costs of all possible segmental operations in each map; Alignment of the 2 maps by dynamic programming, taking into account both the elementary and the segmental operations computed at step 1.
This phase precomputes the costs of all segmental operations for each map. For a map, say
In
For each pair (
The initialization computes the generation costs of all one-character substrings,
The main recurrence is decomposed in 3 cases, it uses the intermediate table For the table Duplicate the symbol Generate the prefix or symmetrically, generate the prefix from ε, and generate the suffix from symbol To generate Finally, to generate
This preprocessing is performed on both

To align globally maps
the compression of a suffix of
the generation of a suffix of

Second phase of calculation. Two possible segmental compressions: (
This explains the following recurrence for A which takes the minimum between 4 possibilities:
Remember that elementary operations are trivial cases of segmental operations as illustrated in Figure 3: when
We provide a novel program (MS_ALIGN Version 2) to compare minisatellite maps. The set of events authorized comprises all elementary events that occur in the intra-allelic evolution of a minisatellite. The evolutionary model has been extended to account for variable mutation costs. Note that the model can be extended to enable all costs to be variant dependent. In the original model ℳ

An alignment produced by MS_ALIGN: the alignment between the maps
We also envisaged reducing the complexity of our algorithm
Independently of the costs
Nevertheless, in practice MS_ALIGN can compare several hundred maps (the size of the largest available data sets) in a reasonable time, which enables the user to test the robustness of the alignments by varying the event costs.
Our algorithm is implemented in a program called MS_ALIGN, which can be run through a web interface. For more details, we refer the reader to the section User's guide of the website http://atgc.lirmm.fr/ms_align/. The distances matrix output by MS_ALIGN can be directly used as input of standard phylogenetic reconstruction programs.
Biological validation on MSY1
In Section 3, we proposed an algorithm to align minisatellite maps under the evolutionary model detailed in Section 2. An optimal alignment of two maps represents a series of mutations needed to transform one map into the other. The alignment score can serve as a weighted measure of the evolutionary distance between these two maps. Thus, we wanted to know whether this measure is adequate to infer evolutionary relationships between the haplotypes represented by these maps. To this purpose, we tested our program on real biological data. We chose a large set of 609 MSY1 alleles of men originating from 76 different populations. The main reason for this choice resides in the availability of known evolutionary relationships for the Y chromosomes, and the knowledge of the Y-chromosomal haplogroup of each individual of the data set. Indeed, human Y chromosomal haplogroups have been defined from the analysis of more stable Y marker systems, mainly SNPs, and an evolutionary tree for these haplogroups have been proposed (Jobling and Tyler-Smith, 2000). Therefore, we used MS_ALIGN to compute metric distances between pairs of alleles and infered phylogenetic trees from these distances using BIONJ (Gascuel, 1997). We could then investigate whether known phylogenetic relationships between Y chromosomes could be independently recovered from alignments between MSY1 MVR maps.
Data set
Mark Jobling provided us with the MSY1 maps of 690 individuals originating from 76 populations. Figure 1 displays examples of MSY1 maps. The MSY1 variants are 25 bases long and 7 different variants have been identified. Besides these variants, some maps display unidentified variants, which are termed
Although a new nomenclature of Y chromosomal haplogroups has been published recently (The Y Chromosome Consortium, 2002), we use the nomenclature of (Jobling and Tyler-Smith, 2000) in the sequel. The new nomenclature has defined a larger number of haplogroups and inferred an evolutionary tree of their relationships using SNP data. However, the correspondence between the old and new nomenclatures is complex: for instance, the old haplogroup 2 is now split between clades B, G, and I of the new haplogroup tree. Without further information, it is not possible to determine to which of the new haplogroups an individual in our data set belongs, nor to compare our results to the haplogroup tree of new haplogroups. However, the data set based on the nomenclature of (Jobling and Tyler-Smith, 2000) is relevant for validation purposes.
The costs we used are
Haplogroup prediction with MSY1 maps
Using the matrix of pairwise distances between individuals and the
An evolutionary tree of the Y-chromosome haplogroups derived from MSY1
From the matrix of distances between the individuals, we compute the matrix of average distances between the haplogroups. We then let BIONJ (Gascuel, 1997) reconstruct an evolutionary unrooted tree for the haplogroups with these distances as input. This tree is shown in Figure 5(

(
In conclusion, the MSY1 tree agrees with the SNP tree for the most recent levels of evolution, which is consistent with the hypervariability of MSY1.
We constructed two trees for the Finnish and the Mongolian populations (Figures 6 and 7). A remarkable feature is that subsets of maps of the same haplogroup cluster together. Clearly, the structures of both trees reflect the relationships of the haplogroups in the MSY1 haplogroups tree, although the latter is based on average haplogroups distances. Hence, when the trees are estimated from the raw distances, they corroborate the partition in haplogroups and the relationships between these haplogroups.

Phylogenetic tree of the Finnish population. Each haplotype is described by its code and its haplogroup (between braces). The MVR map of each haplotype is shown in its Run-Length encoded form between parentheses: a map is a sequence of stretches of identical variants. Each stretch is represented by the variant type number (as in Figure 1) with a power equal to the number of variants in the stretch;

Phylogenetic tree of the Mongolian population. Each haplotype is described by its code and its haplogroup (between braces). The map of each haplotype is shown between parentheses (see Figure 6 for description).
The Y chromosomes tend to be more closely related to one another inside a population than autosomal chromosomes. This and the difference in behavior between gender result in geographical specificity of the Y chromosomes. By computing evolutionary trees for the haplotypes within a haplogroup, we could check whether the alignment of MVR maps is able to recover a geographical clustering of the haplotypes.
For these experiments, we choose haplogroups that contain at least two different populations with at least two maps in each (otherwise it is impossible to detect a population separation). We focus on haplogroup 2 (European populations), 4 (Asian populations), and 16 (European and Asian populations). For haplogroup 2, which includes mainly European haplotypes, alleles of different populations are neighbors in the tree irrespective of their geographical origins (tree not shown). A reason for this is the fact that Europeans populations do not live isolated from each other and have largely exchanged their genetic material. The tree of haplogroup 4 (see Supplementary Material) perfectly separates 10 Japaneses from 3 Tibetans and 1 Mongolian. The Japanese haplotypes coalesce in a single clade. These populations live in geographically distinct area of Asia and their haplotypes cluster well in the tree.
The tree of haplogroup 16 (Figure 8) contains 57 MSY1 haplotypes, among which 3 populations are represented by several individuals: Mongolian (23), Finnish (10), Yakut and Siberian Yakut (13 + 5). In the tree, the Yakuts are monophyletic, all Finns but two form a monophyletic group, and the Mongolians agglomerate together (with one Japanese, two Finns, two Russians, and two Norwegians) and branch out between the Finnish and Yakut subtrees. Here, for haplogroup 16, whose number of haplotypes is much larger than haplogroup 4, we observe a high level of population specific coalescence. Apart from a few individuals in the Mongolian group, geographical separation appears clearly in this tree and agrees well with the geographical specificity of the Y.

Phylogenetic tree of haplogroup 16. For each haplotype, we give its code, its population of origin between square brackets, and its MVR map between parentheses (see Figure 6 for description). The maps of this haplogroup were compared pairwise with MS_ALIGN, and the resulting distance matrix was used to infer an haplotype evolutionary tree using a Neighbor-Joining method (Gascuel, 1997). Among the 57 maps of haplogroup 16, most belong to three main populations: Mongolian (23), Finnish (10), Yakut and Siberian Yakut (13 + 5). The Yakuts are monophyletic, all Finns but two form a monophyletic group, and the Mongolians agglomerate together (with one Japanese, two Finns, two Russians and two Norwegians) and branch out between the Finnish and Yakut subtrees.
Here, we presented a novel method for the alignment of minisatellite maps, which considers an extended evolutionary model with variable mutation costs. It improves in simplicity and in computational time upon previous solutions. Moreover, the program MS_ALIGN can be used through a web-interface and is available upon request from the authors. We have applied our method on a large real data set from the human haploid hypervariable minisatellite MSY1. The alignment distance enables us to recover known phylogenetic relationships between Y-chromosomal haplogroups, showing the validity of the approach. In tentative experiments, we investigate the coalescence of alleles within haplogroups, and the outcome suggest the method could prove useful for micro-evolutionary studies. Our results highlight that the informativeness of minisatellites resides in their length and composition polymorphisms, which can both be exploited simultaneously.
MS_ALIGN can be also used to analyze other types of tandem repeats.
To further validate our program for a wider range of minisatellite, we tested it with the variable GC-rich autosomal insuline minisatellite (INS). A study of the structural diversity of INS alleles could assign the alleles in three lineages called classes I, IIIA, and IIIB. Visual inspection and multidimensional scaling further divided class I into classes IC and ID (Stead and Jeffreys, 2000). In an experiment similar to those performed with MSY1 data, we compared the set of 181 INS alleles published in (Stead and Jeffreys, 2000) with MS_ALIGN, and reconstructed a coalescence of these alleles from the resulting alignment distances (tree available as Supplementary Material). In this tree, the classes IC, ID, IIIA, and IIIB are all monophyletic, and the main split separates classes IC/ID from classes IIIA/IIIB. Again, by comparing the alleles of a variable GC-rich minisatellite, our alignment tool could infer automatically the distinct classes of alleles. This suggests that MS_ALIGN could be well suited for deciphering the evolution of unstable, but non hypervariable, minisatellites.
Tandemly repeated protein sequences are also amenable to analysis. In an other work, our program was succesfully applied to decipher the evolution of a large family of proteins that contain a variable tandem repeat in the N-terminal parts of their sequences, the Pentatricopeptide Repeat family in
A limitation of MS_ALIGN for minisatellite analysis is the restriction on duplication and contraction to operate on a single variant, and not on a block of consecutive variants. In cases of block duplications, MS_ALIGN overestimates the allele distance. In both MSY1 and INS, some alleles provide evidence for block duplications (
To authorize block duplications/contractions in minisatellite alignment makes the evolutionary model more general and even more realistic, but increases the complexity of the alignment procedure (as in (Sammeth et al. 2005)). A major challenge for future developments will be to generalize the evolutionary model (
Footnotes
Acknowledgements
We thank Mark Jobling for providing us with MSY1 MVR maps. We are grateful to Erik Desmarais and Philippe Jarne for their comments on earlier version of the manuscript. We also thank Valentin Guignon for his help in the development of MS_ALIGN. We thank the anonymous referees for their insightful comments. All authors are supported by a research grant from the “Action Concertée Incitative—Informatique, Mathématiques, Physique pour la Biologie” REPEVOL.
Additional Material
The file SupplMat.pdf is available at: http://www.lirmm.fr/~rivals/REPEVOL/MSY1/.
