Abstract
Some genetic systems frequently present ambiguous data that cannot be straightforwardly analyzed with common methods of population genetics. Two possibilities arise to analyze such data: one is the arbitrary simplification of the data and the other is the development of methods adapted to such ambiguous data. In this article, we present an attempt at such a development, the UNIFORMAT grammar and the GENEE[RATE] tools, highlighting the specific aspects and the adaptations required to analyze ambiguous nominal data in population genetics.
Keywords
Introduction
Diploid systems are characterized by genotypes with two alleles at each locus: one inherited from the mother and the other from the father. The typing technologies used to determine these alleles do not always reveal exactly two alleles; sometimes only one is seen and sometimes more than two seem to be present. An illustration of such data is provided by the typing of the major histocompatibility complex in humans, generally referred to as human leukocyte antigen (HLA), routinely made for clinical purposes and also used for studies of human peopling history and population genetics.1–9 Such data, usually designated as ambiguous, have the characteristic of not providing a single two-allele genotype (possibly two identical alleles) for every individual typing. Ambiguous data, then, indicate an incertitude in the determination of the two alleles of the genotype of an individual rather than the occurrence of more than two alleles in an individual's genotype.
Essentially, all classical methods of population genetics have been conceived for data without ambiguities and, therefore, are not appropriate to handle ambiguous data. This is even more so for nominal data, where alleles are the distinct forms, potentially infinite as described by Kimura and Crow (see Hedrick 10 ), possible for a given locus. A common approach is to preprocess the data in order to eliminate ambiguities and hence obtain single two-allele genotypes. 11 The problem with such approaches is that they always involve arbitrary decisions that, at least in the long run, are inconsistent. For instance, some alleles now well identified with sequencing techniques were initially seen in ambiguous data typing, but they were reported as not present as a result of preprocessing treatments. 12 To tackle this problem, we have applied the alternative approach of adapting population genetics methods to ambiguous nominal data (see references below).
To achieve a generalization of population genetics methods adapted to ambiguous and highly polymorphic diploid data, a number of steps were involved. A fundamental first step was to establish a well-defined format that could clearly and explicitly describe ambiguities, that is, UNIFORMAT, which is described in the following section. A second step was to adapt, prove, and validate algorithms, encode them as programs, and test them empirically. A third point to address was how to handle specific difficulties related to the generalized methods. For instance, when estimating genetic frequencies, it was necessary to take into account the diagnosis of the estimation procedure, such as the number of distinct solutions and its control based on the convergence of the log likelihood. It was also necessary to find strategies to deal with the possible effects of indistinguishable ambiguities (allele blocks), and a further challenge was to find alternatives to essential equilibrium measures, such as Hardy–Weinberg equilibrium (HWE) and linkage disequilibrium (LD), that could take into account the extra information provided by the ambiguities and avoid statistical tests whose assumptions are likely to be broken by ambiguous data.
In this article, we provide an updated description of UNIFORMAT and a unified account of the methods using the GENE[RATE] tools, including references to some applications already running and published studies that use them. As far as we know, the approach presented here is the only one currently available, which is able to handle both highly polymorphic (typical samples include hundreds of alleles per locus and tens of thousands of two-locus haplotypes) and ambiguous data.
The GENE[RATE] framework
The algorithms and programs described in this manuscript have been developed in the context of analyses of ambiguous HLA data.2,13–15 These tools have been used in a number of studies already published, but here, we provide an integrated description of all the tools and a number of updates and improvements. Furthermore, we want to present this suite of tools in a larger, non-HLA context of ambiguous diploid data for which they are also expected to be useful.
The current implementation depends on other software and computer libraries in addition to the key components presented in the following section, all of which constitute a suite of facilities to perform routine tasks in population genetics data analyses. The GENE[RATE] tools are available online at two addresses: the original page: http://geneva.unige.ch/generate and a version tightly integrated with an HLA database, http://hla-net.eu/tools, whose usage is described in a previous work. 16 Examples of UNIFORMAT files and their preparation using common computer programs are provided in the website. These include text documentation, the slides of a tutorial presentation, and a screencast on the preparation and validation of UNIFORMAT files.
The essence of the framework is a suite of computer programs that handle highly polymorphic and ambiguous data organized around a data format that allows for the expression of complex phenotypes. The GENE[RATE] tools currently include utilities to convert files from and into a few data formats and to recode datasets by transliterating some alleles or combinations of alleles into others - an operation required, for instance, in studies involving several population samples, all of which are not typed at the same time or in the same laboratory. The tools also include programs to estimate a number of one- and two-locus parameters relevant for population genetics, whose rationale is presented in the following section.
Methods
UNIFORMAT version 3
To describe ambiguous data, we formally defined the UNIFORMAT grammar. 17 This grammar allows the user to express all kinds of ambiguities that can occur in diploid data and includes abbreviations that may help express some usual cases, such as possible but uncertain homozygous, one “known” allele and several possible second alleles and untyped loci. Examples of data written in UNIFORMAT are given as follows:
The formal specification of the current version of UNIFORMAT in a BNF-like form (see Levine 18 for BNF-like descriptions), slightly simplified from the one actually used in the parser implementation by omitting terms related to error control, follows below:
In easy-to-read language, UNIFORMAT can be described as follows: a sample is a file where each line represents an individual; each line starts with an identifier and is followed by the phenotypes for the locus or loci of the individual; all the parts are separated by white space; each locus phenotype is a string (without spaces inside it) where allele pairs (that we call alps) are separated by a vertical bar, “|”, and alleles of a pair by a comma, with the proviso that the left allele is “smaller” than the right one. This last condition prevents an allele pair from appearing under two different forms (such as a1, a2 and a2, a1), which is likely to introduce confusion. It also enables simplification of the algorithms, allowing for huge speedups during computer-intensive calculations because no comparisons are required. The condition assumes that allele names can be ordered in some way, for instance, alphanumeric sorting, or, for computer implementations, numeric sorting (easily implemented by translating allele names into numbers representing their positions in a list of allele names). Allele names can also be used to describe haplotypes, by separating the locus allele names with a tilde sign (“~”). All the semantics of single allele names are valid for haplotype names, and currently, these can be mixed, which might lead to strange, but useful expressions. For instance, A*01~B*08, A*02 represents a typing for which one haplotype is known to be exactly A*01~B*08, while the other haplotype is only known to have A*02 at its first locus. Actually, haplotype data can be represented as a special case of (a highly polymorphic and ambiguous) single locus data.
Finally, allele names (ALLELE) and identifiers (IDENT) can use any of the following characters: alphabetic and numeric characters and also characters from the set {*, +, /, −, _,', : ,~}. Other characters will produce invalid names, and some do have special meaning; hence, LOCI_SEP, the locus separator, is a horizontal white space; AL_SEP, the allele separator, is a comma; ALP_SEP, the alps separator, is a vertical bar; untyped loci are marked with the “@” sign corresponding to AL_UNDET or AL_UNDET_NO_BLANK; a multiple allele, MULTI_AL, is just a suite of alleles (ALLELE) separated by an ampersand, “&”; and EOF means the end-of-file.
The main changes introduced by UNIFORMAT version 3 are the simplification of the locus separator, which can now be any white space (any number of tabs and spaces) and not necessarily one tabulation mark as before, and the integration of the treatment of the haplotype notation. These simplifications are possible because, in this version, white spaces cannot occur inside single locus phenotypes. Although no other changes were made to the grammar definition, its actual implementation as a parser has been much improved by using hash tables to store allele (and haplotype) names, allowing for a faster encoding of input files and decoding of results into output files. The current implementation also allowed the integration of the previous tools into a single tool for validation and recoding. Overall, this simplifies the creation and manipulation of UNIFORMAT files, while maintaining compatibility with version 2 files. Files in UNIFORMAT version 1 can no longer be used because of the HLA nomenclature changes effective in 2010, 19 but they can still be converted to newer versions using the validation tool.
Tools for using UNIFORMAT effectively
The grammar defined earlier has been implemented in a program, UNIFORMAT, that checks if a file is a valid UNIFORMAT file, performs all abbreviation expansions, and, in case of errors, returns an indication of the lines where the errors occur.
The same program can also perform recoding, or transliteration, of valid UNIFORMAT files. Recoding or transliteration is an operation often required when working with population data samples coming from different laboratories, or not typed with the same techniques, or typed at different times using different allele definitions. The transliteration facility makes this operation much simpler, as a single transliteration file can be used for all data sources that are used in a project. The transliteration file is just a list of old and new names; an example is provided in the web page of the tool.
There are many reasons for which “file conversions” are needed; that is why there are many options for this tool. When the data are compatible (usually this means not ambiguous), both forward and backward conversions are available. The formats that have been included in this tool are the ones that we have most commonly found in our practice, but they may be extended in the future.
Some people develop their own typing kits, and, for those, PHENOTYPE might be the tool of choice for a fast and accurate interpretation of the results. This tool expects two inputs: a probe-definition file and a typing-reactivity results file. The output is a UNIFORMAT file that provides the interpretation of the reactivities in terms of allele pairs required to explain them. The originality of this tool is the use of allele pairs rather than lists of alleles, thus avoiding spurious ambiguities such as those resulting from the use of NMDP codes, see discussions in Buhler et al and Sanchez-Mazas et al.14,20
Frequency estimation
All methods based on the expectation-maximization (EM) algorithm 21 are actually generalizations of the gene-counting method initially proposed by Ceppellini et al. 22 , which were extended to haplotypes in 1995.23–25 Implementations of the gene-counting methods were further extended to deal with ambiguities (initially a fixed number, around 200; personal communication by J. Clayton in 1995, personal communication) and further generalized to any number of ambiguities and loci around 2005. 26
The principles of the GENE[RATE] implementation of frequency estimation have been succinctly described, 17 and the relevant mathematical details appear in the Appendix. Basically, it is an implementation of the gene-counting method that allows the use of ambiguous data by representing them as alternative allele pairs. What was and still is particular to the GENE[RATE] implementation is that the algorithm is controlled by an EM algorithm rather than by the changes in frequency estimates (as in the original gene-counting method and most current EM implementations). The algorithm itself is implemented in such a way as to report quality information about the estimates, ie, the number of iterations until convergence and the number of distinct solutions found. This is indeed essential because the estimation procedure does not necessarily have a unique solution. 27 The assessment of the uniqueness of the solution is made by using multiple random starting points and checking if they all converge to the same maximum, as in Excoffier and Slatkin, 24 but, unlike these authors, we report all distinct solutions found and not only the one with the largest log likelihood. If the solution is unique, it means that we have good estimates; if there are multiple solutions, the results should not be used as frequency estimates. Common ways to tackle the problem of getting multiple solutions include increasing the sample size, reducing the level of resolution of the typing, or recoding some indistinguishable alleles as a single entity. Sometimes these remedies produce the desired effect (ie, a single solution), but sometimes they do not, meaning that there are no acceptable maximum likelihood frequency estimates. From our experience, this is very rarely the case with one-locus alleles and rare with two-locus haplotypes, whereas it happens more frequently with highly ambiguous data, possibly with several loci, and with not so large sample sizes. In practice, this has rarely been a problem for us. The effect of sampling variation is much more important than variation due to the use of ambiguities in the estimation process, as shown in a previous study. 28 The possibility of dealing with ambiguities requires scrutiny from the user, especially if the relative amount of ambiguities in the sample is large. An account of the questions raised by the use of ambiguities is given in Buhler et al. 14
A further extension of the gene-counting estimation implemented in GENE[RATE] is the replacement of Hardy–Weinberg with a more general model that includes one parameter to allow for deviation from Hardy–Weinberg proportions (see mathematical details in Appendix). This model is a key element for efficient Hardy–Weinberg testing with our method, as described in the following section.
Testing HWE
Testing for HWE on HLA data presents a number of difficulties, such as non-observed genotypes, ambiguous data, and the presence of a recessive-like blank allele. To avoid such difficulties, we considered an approach using a likelihood ratio test (LRT). The test consists of comparing the likelihood of frequency estimates under the Hardy–Weinberg model with the likelihood of frequency estimates under a model that generalizes HWE by including an extra parameter (such as an inbreeding coefficient). Thus, we consider two models: one of them being a particular case of the other. Formally, the full model is a function of both the allele frequencies and a parameter, F, which measures deviation from the Hardy–Weinberg model:
The usual Hardy–Weinberg model just sets this extra parameter to zero:
As usual for LRTs, twice the difference of the log likelihoods follows a chi-square distribution with one degree of freedom. This test presents an advantage over the more usual chi-square, exact, or Monte-Carlo-Markov-Chains (MCMC) tests, in that it is not affected by the problems mentioned at the beginning of this section.
Assessing selective neutrality
Neutrality testing is frequently performed using the revised version of the Ewens–Watterson test proposed by Slatkin, hereafter EWS.29,30 Using this test with ambiguous data is, however, problematic due to the presence of genotypes that are not determined unequivocally. To tackle this problem, a parametric resampling schema is used in such a way that the EWS is applied to a batch of samples with no ambiguous genotypes, which is randomly drawn from the allele frequencies estimated in the population.
28
The batch of
As an indication of the quality of the assessment, the number of zeros or ones, which are values, respectively, smaller or larger than all others, is reported, but the adjusted
Linkage disequilibrium
LD for two bi-allelic loci reduces to a single coefficient; hence, the existence of a haplotype in LD implies that the other three haplotypes are also in disequilibrium. In this case, we can also say that the two loci are in LD. This simple situation does not hold when working with loci having more than two alleles. Therefore, we need to consider separately global LD, and the LD of each individual haplotype. The latter is the same as for bi-allelic loci, ie, the difference between the observed (often estimated) haplotype frequency and the product of the frequencies of the alleles defining the haplotype. The global measure of LD is supposed to provide a summary of the situation, taking into account all possible haplotypes for two given loci. It is possible that two loci are not in global LD while some specific haplotypes for these loci do present strong LD. On the other hand, global LD requires that at least one haplotype exhibits strong LD.
To test the LD, the usual chi-square test (or equivalents such as Fisher's exact test and its MCMC approximations
32
) is in general inapplicable to HLA data, because it cannot handle ambiguous data or the presence of a putative recessive-like blank allele, and it also suffers from the large number of haplotypes that are generally not observed (ie, with estimated frequencies of 0). Instead, our approach consists in using an LRT (as for the Hardy–Weinberg test) for comparing two estimations: the log likelihood of the estimated haplotype frequency and the product of the log likelihoods of the estimated allele frequencies. Under the hypothesis of no LD, the two statistics are expected to be similar. As the model of “no LD” can be seen as a special case of the more general LD model (by setting the disequilibrium coefficients of all haplotypes to zero), the number of degrees of freedom associated with this LRT used is the number of possible haplotypes. Formally, this uses a full model described by the following equation:
In general, the number of degrees of freedom associated with this likelihood is provided by the number of independent (free) parameters, which is given by considering the restrictions on the sums of the frequencies of all haplotypes carrying a given allele (they must add up to the allele frequency) and on the sum of the allele frequencies at the two loci, that is
The previous likelihoods provide the first measure of global LD that we report in our outputs (LRT). Given the problems raised earlier about the convergence of the LRT test statistic to its limiting chi-square distribution, we have complemented this LRT test with a resampling procedure, where the null hypothesis of no LD is used to generate a given number of two-locus samples (often 10,000) to which the LRT test is applied. The procedure provides an empirical distribution for the LRT statistic under the hypothesis of no LD, and the reported
An additional statistic is calculated, not to be used for a global test of LD, but rather to identify individual haplotypes whose frequencies deviate from their expectations more than by chance. (Chance, here, means random sampling, and the deviations are expected to be normally distributed.) This is done by considering the standardized residuals proposed by Agresti 33 for a chi-square test. Residuals of this kind are considered to be more independent from the observed or expected frequencies than other residuals, and unlike other measures of LD such as D and D' (see Hedrick 10 ), they allow for direct comparisons of deviations, even for haplotypes with very different observed or expected frequencies.
Discussion and Conclusion
To our knowledge, currently, the GENE[RATE] tools are the only suite of computer programs that are able to work with highly polymorphic and ambiguous nominal data. These tools are an extension of the classic methods of gene-counting and haplotype estimation published in 1995 (references are mentioned in “Frequency estimation” section) and are shown to be particular instances of the EM algorithm. The framework of nominal alleles is rather distinct from that of sequence data. Nominal alleles refer to longer or shorter chromosomal regions that can span tens to thousands of nucleotides and are just considered as equal or distinct, without taking into account the molecular information. This is often because such information is not simply available given the typing techniques used to produce the data. Such data are clearly not as rich as sequence data, but they present interesting characteristics that make them useful for population genetics analyses. The high polymorphism of the data allows for a high discrimination of populations. The eight loci of the system are considered as segregating independently; they span over a contiguous region of 6 million nucleotides. Finally and most importantly, data for HLA are very abundant, given the clinical relevance of the system. (A detailed discussion of HLA relevance for population genetics has been given by Sanchez-Mazas et al. 13 )
As we have stated, not all data ambiguities are of the same nature, and they do not have the same relevance, for nominal and sequence data. It is interesting to see that the approach taken for sequence data is either to make a call or discard the ambiguous position,34–37 while the approach presented here includes the ambiguities in the calculations. In practice, instead of using two alleles for an individual, what we consider is two probability distributions for each individual (that reduce to a single allele for nonambiguous data).
The spirit of Li's samtools 37 and that of GENE[RATE] are similar, but a number of differences in the nature of the data makes a direct comparison impossible. We do plan, however, to perform such a comparative study using a triple approach: the alleles defined at nominal levels (as usual with HLA data), their translation as sequence data, and the raw reads produced by an NGS typing technique.
The UNIFORMAT grammar and the GENE[RATE] tools described in this article are specially adapted to nominal diploid data with ambiguities. Although primarily developed to solve problems arising with the analysis of the highly polymorphic genetic system HLA, the human MHC, the suite is completely general and applicable to diploid data. It may even accommodate pedigree information by directly specifying known, possibly partial, haplotypes in the UNIFORMAT data file. Actually, these tools have already been used in mixed analyses of HLA, classical and nonclassical genes, and other immunogenetic systems such as KIR and MICA.38–41
The web interface is currently the easiest way to use this suite because of the large number of dependencies on external libraries and other software, but the code source will eventually be packaged, including dependency information, in the Debian format 42 or as an R package, and made publicly available. The web interface is continuously maintained and updated.
Questions, comments, error reports, and suggestions are welcome and can be addressed to the author.
Footnotes
Acknowledgments
This work was supported by Swiss FNS grants #31003A_127465 and #31003A_144180 and by COST Action BM0803. The author thanks Stéphane Buhler and Alicia Sanchez-Mazas for reading a first version of this text and providing valuable comments and suggestions and, more importantly, for their continuous support and advice with all the many details that are an essential part of GENE[RATE].
Author Contributions
JMN is the sole author, responsible for all of the content of this paper. The author has reviewed and approved of the final manuscript.
