Abstract
Some organizations such as 23andMe and the UK Biobank have large genomic databases that they re-use for multiple different genome-wide association studies. Even research studies that compile smaller genomic databases often utilize these databases to investigate many related traits. It is common for the study to report a genetic risk score (GRS) model for each trait within the publication. Here, we show that under some circumstances, these GRS models can be used to recover the genetic variants of individuals in these genomic databases—a reconstruction attack. In particular, if two GRS models are trained by using a largely overlapping set of participants, it is often possible to determine the genotype for each of the individuals who were used to train one GRS model, but not the other. We demonstrate this theoretically and experimentally by analyzing the Cornell Dog Genome database. The accuracy of our reconstruction attack depends on how accurately we can estimate the rate of co-occurrence of pairs of single nucleotide polymorphisms within the private database, so if this aggregate information is ever released, it would drastically reduce the security of a private genomic database. Caution should be applied when using the same database for multiple analysis, especially when a small number of individuals are included or excluded from one part of the study.
1. Introduction
In a survey of genomic privacy experts, the long-term privacy of genomic information was deemed both the most important and the most challenging problem to overcome (Mittos et al., 2019). If an individual's password or ID number gets leaked, it is always possible to change it. However, it is impossible for a person to change their genetic code and they will pass part of it onto their children, so any information leaks can have long-term impacts on both the individual and their descendants. Although much of the research focus on long-term privacy of genomic databases rests on the longevity of the encryption scheme (Huang et al., 2015), it is also important to remember that these genomic databases are not just sitting on a server somewhere, but are also being continually utilized for making new scientific discoveries. Each time these databases are accessed and the scientific results are published, there is a risk that information will be leaked and that eventually this would enable an attacker to reconstruct private information held in the database.
Genomic researchers are already aware that some forms of aggregate data from their databases should not be released publicly, because there is a risk that an attacker may be able to determine whether a particular individual is a member of the database (a membership inference attack). For instance, such attacks have already been developed for summary statistics about the frequency of single nucleotide polymorphisms (SNPs; Cai et al., 2015; Dwork et al., 2015; Simmons and Berger, 2015). Membership inference attacks have also been developed for the case where a person is allowed to repeatedly query a database to learn whether at least one individual contains a particular SNP (Shringarpure and Bustamante, 2015; Raisaro et al., 2017; von Thenen et al., 2018). These kinds of aggregate statistics about the frequency or presence/absence of a particular SNP in a database might be useful to release to the broader research community, but it is not an essential output of the research process.
However, the main research findings—that is, the SNPs associated with the trait of interest and their strength of association—are essential to publish since the entire purpose of these genomic research projects is to uncover the relationship between genetic variants and phenotypic traits. Moreover, knowledge of these SNPs can lead to new diagnosis procedures or new potential drug targets, so their release is important for the public interest (Visscher et al., 2017). However, even this information can potentially leak private information about individuals in the database. For instance, Im et al. (2012) found that information about individuals in a genomic database is leaked when studies publish whether each SNP is correlated or anti-correlated to the trait of interest. It is important to quantify how much information is leaked by publishing these research findings, so that scientists can make informed decisions about when to publish their results and whether it is worth risking the privacy of the participants.
In this study, we demonstrate that the kind of research output that is published from genome-wide association studies (GWAS) has the potential to leak enough information to recover the SNPs of individuals in the database (a reconstruction attack), under specific circumstances. In particular, we focus on the release of genetic risk scores (GRS), a common research output for finding genetic associations with continuous traits (Qi et al., 2011; Belsky et al., 2013; Zhao et al., 2014; Chouraki et al., 2016; Day et al., 2017; Knowles and Ashley, 2018). We also focus on cases where a database is repeatedly used to perform a GWAS analysis, but not all the individuals are part of all the analyses. This could be the case because some individuals drop out of the study or skip specific survey questions. Alternatively, some databases, such as 23andMe, may grow in size over time and allow several GWAS to be performed within a short period. Under these circumstances, we demonstrate that it is possible to completely reconstruct the SNPs of an individual by using a custom expectation–maximization (EM) algorithm. We also provide suggestions for avoiding this kind of attack.
To be clear, this article focuses on the simpler case in which the exact same trait is investigated in multiple GWAS studies; however, we expect that some version of this attack may be developed in the near future for the case of multiple highly correlated traits.
1.1. Overview of scenarios that will be investigated
We demonstrate a series of reconstruction attacks that enable us to infer the genotypes of individuals in private genomic databases, based on publicly released GRS. These attacks will initially be deployed in a very favorable scenario, but the scope of the attack will be subsequently expanded, building up to the scenario shown in Figure 1. It is worth noting that the reconstruction attacks that we will describe do not depend on (1) how the SNPs were initially filtered or (2) how strongly they associate with the trait of interest.

We investigate the case where two GWAS studies are performed on two datasets that mostly contain the same individuals. We reconstruct the genotype of those individuals added to the second study, using the GRS from each study and an estimate of SNP frequencies. GRS, genetic risk score; GWAS, genome-wide association studies; SNP, single nucleotide polymorphism.
We will begin by investigating a simple scenario: Two GWAS studies are performed to identify SNPs associated with the same trait, and the two studies use the same set of participants, except that the second study includes one extra individual. In addition, we will assume that we know the frequencies of each SNP and the frequencies that pairs of SNPs co-occur in the same individual. We assume that both studies publish the coefficients associated with the GRS models that they infer as part of the analysis.
Next, we will consider the case in which the second study includes more than one additional participant and we demonstrate that in many circumstances this still allows us to easily reconstruct the individual genotypes of all the individuals that are found in the second study but not the first (see Section 3.2).
Afterward, we will demonstrate that we do not need to know the precise frequencies of SNPs and frequencies of co-occurring SNPs, as long as we have a reasonable estimate of these values from public databases (see Section 3.3). We also briefly discuss how loosening additional restrictions would impact our ability to predict individual genotypes. In particular, we analyze the case where the two sets of SNPs that are used by the two studies are not identical. These results imply that if two sets of GRS are released on two genetic datasets with largely overlapping populations, it may be possible to reconstruct the genotypes of those individuals who participated in one study but not the other (Fig. 1).
2. Methods
GRS models describe the relationship between a particular phenotype of interest and particular SNPs. These models are fit in a two-stage process: First, a reduced set of SNPs is selected from a potentially very large pool of candidates; then, this reduced set is used as the independent variables in a linear regression analysis. The set of SNPs is selected by first filtering for those that significantly correlate to the trait of interest, after controlling for other covariates. These SNPs are then further filtered to ensure that they are far apart from one another, to decrease the correlation between them. In this setting, we suppose that M individuals have taken part in a study, and N SNPs have passed the filtration steps to be used in a linear model. Let yM be the vector of M real-valued phenotypes, and XM be an
The GRS model parameter
where 
where we have defined the symmetric
Now, suppose a second study is run, targeting the same phenotype, which adds a single extra individual with SNPs represented by the N length vector x0. This corresponds to adding the row
We assume that both GRS models
where C is a scalar, specifically
Our approach, thus, centers on the use of the vector that we define as d1,

corresponding to a rescaled copy of the input SNP data in the design matrix
We additionally consider the case where m additional individuals have been included in the second study, yielding a new GRS model 
where Cm is a vector of length m. For sufficiently small m (relative to N), exact reconstruction of all m added individual genomes is also possible in this setting, following the algorithm we will introduce in Section 3.2.
The previous examples have focused on cases in which the participants in the first study are a subset of the individuals in the second study. In Section 5.7, we consider the case in which the first study has some participants that are not found in the second study and vice versa. We show that the same strategies for reconstructing the genome can be used as in the previous scenario that we discussed, in which multiple participants are added to the second study.
2.1. Estimation of K
As it turns out, the entries of matrix K correspond to simple population-level statistics of the SNPs, which could either be inadvertently released (under the assumption they would be safe to share), or could be estimated from another sample from the same population. In fact, the entries of K depend only on the SNP frequencies and SNP co-occurrence frequencies in the dataset:
– For – For – For – Finally,
Thus, knowledge of SNP frequencies and pairwise co-frequencies from the original study are all that is required to compute K. In the following Sections 3.1 and 3.2, we consider adding one and multiple individuals at once, respectively, in the setting where this matrix K can be estimated exactly. However, although
3. Results
The key observation from the previous section is that the vectors d1 and dm, derived from the change in parameter vectors
This section describes algorithmically how these vectors can be used to recover the genomes of the additional individuals, as well as empirical tests that use the Cornell Dog Genome dataset as a case study (Hayward et al., 2016). More details on the experimental setup can be found in Section 5.1.
3.1. Complete reconstruction of one individual's genotype when SNP frequency information is known
The first, most straightforward case is when only one participant is added between the first and second studies, that is, where
Given K,

3.2. Complete reconstruction of multiple individuals' genotype when SNP frequency information is known
We now consider the case where m additional individuals have been included in the second study, yielding a new GRS model
Consider again Eq. (8) described earlier. The
If we were to learn which values of dm are also found in Cm, then we could infer the complete genotypes of all the m individuals added to the second study. We would be able to reconstruct m complete genotype vectors, although it would be impossible to know which of the genotypes corresponded to which of the m individuals. In fact, in many cases it is extremely straightforward to determine which values in dm correspond to values in Cm. Here, we describe a simple algorithm for finding Cm when there are exactly
First, extract all unique, nonzero values from dm.
Find the sum of all pairs of values in this vector.
Find all values that are in Eq. (1), but not in Eq. (2). The values of Cm appear in this list. There is no way to know which value of Cm corresponds to which index, so for simplicity we can randomly assign them indices.
Each value in Cm corresponds to a specific individual who was added to the second study. Each value in dm can be described as a sum of a unique combination of values in Cm. For instance, if
We tested this approach by using the Cornell Dog Database, in a test scenario where the second study added three different dogs. We were able to uniquely identify the genotypes of all three dogs with 100% accuracy, with both common and uncommon SNPs (Fig. 2C).
3.3. Accurate estimation of an individual's genotype when SNP frequency information is estimated from a public database
Previously, we assumed that the attacker had access to the matrix K, which consists of population-level statistics on frequencies and co-occurrence frequencies of SNPs. Although this could be released voluntarily by organizations that are not aware of the risk, we now consider the case where K is not directly available to the attacker but is instead estimated from a separate public database assumed to correspond to individuals from the same population.
We simulated this scenario by using the Cornell Dog Database by taking one random set of dogs for building the GRS model, and a second non-overlapping set of dogs for estimating

Example values taken by the noisy vector
The main challenge is that the vector

Accuracy at reconstruction of genomes x0 using EM estimation and a noisy estimate
Crucially, we show that our approach is able to reconstruct, with relatively high accuracy, the genotypes of dogs even when they differ significantly from those in the public dataset (Fig. 5). This shows that our attack is able to extract information about the particular individuals that differ across the two studies, not merely about the general population as in the most-common-variant baseline. By definition, dogs that have genotypes that differ significantly from the general population have a higher proportion of uncommon SNPs, and the ability to recover these uncommon SNPs is particularly important from a privacy perspective. Indeed, uncommon SNPs can be used to identify a particular individual and are also more likely to be associated with disease phenotypes, which is sensitive information. In general, we find that the larger the public dataset available, and the more similar the dataset is to the unknown private dataset, the better we are able to reconstruct the genome of the added individual. Full details and description of the experimental setting are given in Section 5.1. We also derive theoretical error bounds for our estimate of

Results of Figure 4 broken down by individual dogs. Here, each point represents a dog and we define atypicality as the proportion of uncommon variants that the dog has compared with the public database—for instance, if 51% or more of dogs in the public database have a G in a specific locus, but this dog has a T, then this would count toward the dog's atypicality. In other words, dogs further to the right are less and less similar to average dogs present in the public dataset (measured by percentage of different variants). In contrast to the most-common-variant baseline, our method generalizes well even to dogs that are highly dissimilar to those in the public dataset. Larger public databases (right) provide more accurate population estimates of
This task becomes more challenging when multiple individuals are added simultaneously and K is unknown; an algorithm for estimating
3.4. Accurate estimation of an individuals' genotype when different SNPs are used in each study
When GRS models are constructed, the first step is to filter the set of SNPs down to a small set of SNPs that are (1) significantly correlated to the trait after covariates are considered and (2) far apart from one another along the genome. If the two studies use two different sets of SNPs to construct the GRS model, it is still possible to recover whether or not each of the SNPs in the overlap is present in the new individual. This process is highly analogous to the previous cases and is detailed in Section 5.6.
4. Discussion
In this study, we demonstrate that private information is leaked when GRS models are published, specifically in the case where two sets of largely overlapping individuals are used for multiple studies. In particular, we show that we can recover SNPs from an individual in a private database—a reconstruction attack. Even though we would not have a name associated with this genotype, it may be possible to identify the individual once the genotypic data are available to the attacker. For instance, the attacker may have access to partial genotypic information of the individual and then be able to identify them. Alternatively, they could use the genotype information to predict ethnicity and other phenotypic traits that could then be used to uniquely identify the individual.
We also note that even an incomplete reconstruction attack (in which only a proportion of the SNPs are correctly identified) is likely to be sufficient to perform a membership inference attack. Investigating the relationship between the reconstruction attack and the membership inference attack will be a subject of future research. Importantly, if the attackers were unable to link the genomic data with a particular individual, the reconstruction attack would still be a breach in privacy that could have serious consequences. For instance, the patient may have only consented to have their genomic data used in particular kinds of research studies, whereas the attacker may use the reconstructed genomic data for a different (potentially unethical) purpose.
4.1. Suggestions for good practice
We provide a number of simple suggestions for good practice that would help limit this attack.
Aggregate statistics about the frequency of SNPs in the database or the frequency of co-occurrence of SNPs should never be released. We have shown that this information, combined with GRS, allows to precisely reconstruct individual genomes in various settings. It may be possible to release noisy versions of SNP frequency data, but this would be equivalent to releasing
If a genetic dataset is intended to serve for multiple complementary analyses, it is important that all study participants are used in every analysis performed. If there are missing phenotypic data from a few individuals, they should not be included in any of the analyses performed, or their privacy may be compromised.
When multiple individuals are added in between two studies, then the ability to reconstruct the genomes depends on the number of SNPs, being large relative to the number of individuals. In particular, if m new dogs are added, exact reconstruction is only possible by using the approach in Section 3.2 if the number of SNPs
4.2. Extensions and future work
Although we have analyzed the case where the genome is represented by binary values of 0 or 1, often studies instead count the number of times each allele is present, which would lead to a design matrix
A possible countermeasure to our reconstruction attack could consist of randomly perturbing the GRS models before releasing them, as done in differentially private linear regression (Wang, 2018). However, a naive application of this strategy could destroy the utility of the models. A formal and empirical analysis of the effectiveness of such protection against reconstruction attacks, as well as of the usefulness of the resulting GRS models to genomic researchers, is beyond the scope of this article and left for future work.
Another countermeasure is to refrain from releasing precise information about the population structure of the study population to prevent the attacker from estimating K effectively. This would, however, limit the utility of the research study, because the researchers would not know to what populations the research applies.
Our work has a number of limitations. For instance, we only test our EM algorithm on dog data. Dog populations may have different population structures than human populations due to selective breeding, so in the future we aim at investigating how properties of population structure will impact our ability to estimate K and the accuracy of our reconstruction attack.
It may seem unlikely on the surface that two GWAS analyses will include nearly the same participants. One potentially common setting where this could arise is when a single study collects both genotype and phenotype data from a single set of participants, and it releases multiple models to predict multiple traits. In this case, there may be a small number of individuals who are used in one analysis, but not the other; for instance, there may be a small subset of participants who skip a particular survey question that was used to collect phenotype information, and this is, indeed, evident in a recent study (Jiang et al., 2019). In such settings, it could be very possible for multiple released GRS models to be computed on sets of individuals that differ by only a few participants. In future work, we aim at extending our analysis and attack to settings where multiple GRS models are released, each predicting different but highly correlated traits.
5. Appendix
5.1. Experimental details
5.1.1. Cornell dog database
To experimentally test the reconstruction attacks, we used data from the Cornell Dog Genome Database, which contains data about SNPs from a wide range of dog breeds and a number of associated phenotypic traits. The two traits we focused on were average breed weight and average breed height, because these two phenotypes had the fewest number of missing values. For the initial investigation, we binarized the genotype matrix—considering all heterogenous alleles to have a value of 1. (We also repeated the analysis with the original genotype matrix.) Only common SNPs (i.e., SNPs that were found in 25%–75% of the dogs) were used, leaving 23,497 SNPs. For each linear model built,
5.1.2. Experiment with imprecise K
First, two linear models were constructed to predict average breed weights: one with the M = 1000 randomly sampled dogs and another that contained one additional randomly sampled dog. This gives
5.2. Adding multiple dogs
Here, we explain Eqs. (7) and (8). Note that the former is a special case of the latter so we will only explain the latter in detail. First note that, by definition,

Substituting these into the left hand side of the following equation gives the right hand side:
This equation can be rearranged to give
Defining the length m vector
5.3. Algorithm for identifying unique genotypes of multiple dogs when K is known
Although the simple approach described in the main article will work in many cases, there are a few special circumstances where a more complex algorithm may be required. In particular, it would not work if there are combinations of SNPs that are not observed among the individuals added to the database. For instance, if there is not an SNP location where the first individual has an SNP variant and the others do not, then we would miss the corresponding value in Cm. However, it is still possible to identify all the values in Cm through a more complex algorithm:
First, extract all unique, nonzero values from dm.
Find the sum of all pairs of values in Eq. (1).
Find all values that are in Eq. (1), but not in Eq. (2).
If there are exactly m values in Eq. (3) and the sum of these values equal the last value of dm (corresponding to the intercept term), then we have found the correct values of Cm.
Otherwise, this suggests that there are one or more elements of Cm that are missing from Eq. (3) and possibly a few values in Eq. (3) that are not in Cm.
Begin by subtracting every pair of values in Eq. (3). These are now also potential values of Cm
Search for a set of m values from Eqs. (3) and (6) that sum to the last element of dm. There may be more than one set of values for which this is true.
If this search is unsuccessful, repeat steps 6–7. Eventually, a set of m values summing to dm should be found.
If more than one possible set of values is found for Cm in Eq. (7), it is still possible to compare these sets and identify which is the most likely to contain the true values of Cm. For each possible Cm vector, a set of genotypes can be constructed for the m additional individuals. Using the frequencies of each SNP, it is possible to calculate the probability of observing each genotype. The set of values that produces the most likely genotypes for the m individuals is most likely to be the correct one.
In addition, this algorithm depends on the fact that it is extremely unlikely that if someone were to sample three random continuous numbers i, j, and k, it would just so happen that
5.4. Estimating K
If the true matrix K is unknown, it can be estimated with public data. We denote this estimator by
Also of note, the same analysis given next applies to the scenario in which the researchers do not release K, but rather release a “noisy” version of K, where the noise is drawn from a normal distribution. They might consider doing this if they feel that releasing information about SNP frequencies is important for the research community, but they do not wish to release the real K because this would allow for an exact reconstruction of genotype. This noisy K could still be used in a reconstruction attack in the same way as an estimate of K from a public database is used.
5.4.1. Analytic bound on
For convenience, we only consider the case of adding a single individual, though the generalization is quite straightforward. If
Note that
This means that we can bound the error by two quantities. The term
The key term in Eq. (10) is
5.4.2. Modeling the error in
In this section, we define a model to capture the error in
Let
for some small
which corresponds to the difference between the two GRS model parameter vectors when m additional dogs are added. Given the true value of K, the system of equations
relates the known quantity 
We need to estimate all m constants
If K is Gaussian [following Eq. (11)], then the linear transformation
meaning overall the vector
With some algebraic re-arrangement, and since for the true underlying value of K we have 
where
For the special case of
5.4.3. Parameter estimation with EM
We now can use this model to derive EM algorithms for finding maximum likelihood estimates of all parameters, and estimate the posterior distribution over SNP variants for the added individuals between the two studies.
For notational convenience in this section, denote the entries of the m new individuals 
Supposing we know values of
An EM algorithm to estimate
Given
Given the posterior
For each
the conditional probability of each particular entry taking a value of 1, rather than 0, for each zj given the values of the other zk,
5.4.4. Exact EM algorithm when 1 individual is added
For the special case of
the posterior probability of each particular entry taking a value of 1, rather than 0. To maximize 
This yields

which we set equal to zero and solve to find
These updates taken together can be used to define an EM algorithm that optimizes the values of C and
The overall EM algorithm can be summarized by the following iterative updates:
To initialize the algorithm, we can set
5.4.5. Stochastic EM (SEM) for multiple individuals
For
draw approximate posterior samples of
conditioned on the current sampled values
Although this does not converge to an exact parameter value, under suitable conditions the algorithm converges in distribution to a Gaussian centered on the maximum likelihood estimate of the parameter. A point estimate can be extracted by averaging across many iterations after convergence.
In contrast to the EM updates, the updates for values of Cj and 
using the vector notation
, has the solution
The maximum likelihood estimate of 
To address permutation invariance in the entries
Empirical results quantifying the performance of this algorithm are shown in Figure 6, in an experimental setup similar to that for evaluating EM when a single dog is added to a dataset in the main article, with unknown K. A private dataset is assumed to contain 1000 individuals, whereas a separate public dataset of 800 is available;

Results for running the stochastic EM algorithm when estimating SNPs for three additional dogs simultaneously. This experimental setup replicates the experiment for one additional dog, across 5 public/private/test dataset splits, with 20 different test sets of three additional dogs for each. (Left) Accuracy at predicting SNP presence relative to the “most common variant” baseline. On average, the SEM algorithm predicts the correct SNP 75.5% of the time, relative to 71.5% for the baseline. (Right) As in the one-dog example, we see relative improvement in the performance of our algorithm when considering more atypical dogs. SEM, stochastic EM.
5.5. Scaling of EM algorithm with size of private dataset
Figure 7 demonstrates the change in accuracy of the EM algorithm over a range of different private database sizes. For this test, a synthetic dataset with 100 SNPs and 1,000,000 individuals is generated; 10,000 are held out as a public database, and 30 individuals are taken as a fixed test dataset of new dogs to add and are used to estimate EM algorithm accuracy, across increasingly large private database sizes. The algorithm has stable performance for increasingly large private databases.

Accuracy at reconstruction of the genome of one additional individual, using EM estimation and a noisy estimate
5.6. Estimating
with different SNP sets
Here, we analyze what can still be said in the event that the two studies do not use exactly the same set of SNPs. We will still assume that the sets of SNPs are considered to have a significant overlap.
For this purpose we will need a greater variety of notation. A primed variable denotes that it corresponds to the second set of SNPs, for example,
As described earlier, from the first experiment, we have
and now, from the second experiment, we have
Taking the difference between these expressions, as earlier, gives
Restricting to the overlapping set gives that
Noting that
Analogous to the previous cases,
Thus, if K is known, it can be used to deduce whether the additional individual has each of the SNPs in the overlapping set. If K is not known exactly, it can be estimated from public data just as in the same SNP case.
5.7. Case in which each GWAS study adds two new sets of participants
This article mostly explores the case in which one study's participants are a subset of the other study's participants. Here, we demonstrate that this is equivalent to the case where each of the two studies contains a small number of participants that are not found in the other study.
In particular, let us say that the first study has
Let us define the following
and the following
Then, this gives us:
This means that having two nonoverlapping participant sets is equivalent to the setting in which the first study is a subset of the second (only m is now
5.8. Description of K when the genotypes are non-binary
In many cases, GRS are calculated on genotype matrices that are non-binary. In particular, they may take on three discrete values 0, 1, and 2, where 0 indicates that the most common variant is homozygous, 1 indicates that the individual is heterozygous for the uncommon variant, and 2 indicates that the individual is homozygous for the uncommon variant.
If this is the case, the description of K will change. However, it is still the case that the entries of K depend only on the SNP frequencies and SNP co-occurrence frequencies in the dataset, and that knowledge of SNP frequencies and pairwise co-frequencies from the original study, are all that is required to compute K.
– For
– For
– For
– Finally,
Footnotes
Author Disclosure Statement
The authors declare they have no competing financial interests.
Funding Information
This project was funded by the Alan Turing Institute Research Fellowship under EPSRC Research grant (TU/A/000017); EPSRC/BBSRC Innovation Fellowship (EP/S001360/1), and under the EPSRC grant EP/N510129/1. It was also partly funded by a grant from CPER Nord-Pas de Calais/FEDER DATA Advanced data science and technologies 2015–2020.
