Advanced genotyping technology has made it feasible for large numbers of individuals to be genotyped resulting in many biobanks across the world. These biobanks are an excellent resource to study haplotype matching at a large scale. Durbin’s positional Burrows–Wheeler transform (PBWT) supports efficient haplotype matching and queries given a panel of haplotypes and scales well with large data. It has been widely used for statistical phasing, imputation, and identity-by-descent detection. However, the original PBWT panel does not support updates when haplotypes need to be added or deleted from the panel. Dynamic-PBWT (d-PBWT) solved this problem but is not memory efficient. While the memory constraint problem of the PBWT has been tackled by Syllable-PBWT and
-PBWT, these are static data structures that do not allow updates. In addition, Syllable-PBWT only supports long-match query and
-PBWT only supports set-maximal match query, limiting their functionality in the compressed form. In this article, we present Dynamic
-PBWT (which can also be seen as compressed d-PBWT) that is memory efficient and supports dynamic updates. We run-length compress PBWT to achieve a better compression rate and store the runs in self-balancing trees to enable dynamic updates. We show that the number of updates per insertion or deletion in the tree at each site is constant regardless of the number of haplotypes in the panel and the updates can be made without decompressing the index. Moreover, we use orders of magnitude less memory than d-PBWT. We provide set maximal match and long match query algorithms on Dynamic
-PBWT. The long match query algorithm can easily be extended back to the original
-PBWT. We benchmark all algorithms on the UK Biobank and 1000 Genomes Project dataset. Overall, the flexibility and space-efficiency of Dynamic
-PBWT make it a potential index data structure for biobank-scale genetic data maintenance and analysis.