An Efficient Dynamic Data Structure for Haplotype Matching and Compression on Biobank-Scale Data

Abstract

Advanced genotyping technology has made it feasible for large numbers of individuals to be genotyped resulting in many biobanks across the world. These biobanks are an excellent resource to study haplotype matching at a large scale. Durbin’s positional Burrows–Wheeler transform (PBWT) supports efficient haplotype matching and queries given a panel of haplotypes and scales well with large data. It has been widely used for statistical phasing, imputation, and identity-by-descent detection. However, the original PBWT panel does not support updates when haplotypes need to be added or deleted from the panel. Dynamic-PBWT (d-PBWT) solved this problem but is not memory efficient. While the memory constraint problem of the PBWT has been tackled by Syllable-PBWT and $μ$ -PBWT, these are static data structures that do not allow updates. In addition, Syllable-PBWT only supports long-match query and $μ$ -PBWT only supports set-maximal match query, limiting their functionality in the compressed form. In this article, we present Dynamic $μ$ -PBWT (which can also be seen as compressed d-PBWT) that is memory efficient and supports dynamic updates. We run-length compress PBWT to achieve a better compression rate and store the runs in self-balancing trees to enable dynamic updates. We show that the number of updates per insertion or deletion in the tree at each site is constant regardless of the number of haplotypes in the panel and the updates can be made without decompressing the index. Moreover, we use orders of magnitude less memory than d-PBWT. We provide set maximal match and long match query algorithms on Dynamic $μ$ -PBWT. The long match query algorithm can easily be extended back to the original $μ$ -PBWT. We benchmark all algorithms on the UK Biobank and 1000 Genomes Project dataset. Overall, the flexibility and space-efficiency of Dynamic $μ$ -PBWT make it a potential index data structure for biobank-scale genetic data maintenance and analysis.

Keywords

μ-PBWT biobank d-PBWT Dynamicμ-PBWT haplotype matching run-length compression

Get full access to this article

View all access options for this article.

References

Auton

, Abecasis

, Altshuler

(Co-Chair) , et al.; The 1000 Genomes Project Consortium. A global reference for human genetic variation. Nature, 2015; 526(7571):68–74; doi: 10.1038/nature15393

Bick

, Metcalf

, Mayo

, et al.; The All of Us Research Program Genomics Investigators. Genomic data in the All of Us Research Program. Nature, 2024; 627(8003):340–346; doi: 10.1038/s41586-023-06957-x

Browning

, Tian

, Zhou

, et al. Fast two-stage phasing of large-scale sequence data. Am J Hum Genet, 2021; 108(10):1880–1890; doi: 10.1016/j.ajhg.2021.08.005

Bycroft

, Freeman

, Petkova

, et al. The UK Biobank resource with deep phenotyping and genomic data. Nature, 2018; 562(7726):203–209; doi: 10.1038/s41586-018-0579-z

Cozzi

, Rossi

, Rubinacci

, et al. μ-PBWT: A lightweight r-indexing of the PBWT for storing and querying UK Biobank data. Bioinformatics, 2023; 39(9):btad552; doi: 10.1093/bioinformatics/btad552

Danecek

, Bonfield

, Liddle

, et al. Twelve years of SAMtools and BCFtools. Gigascience, 2021; 10(2):giab008; doi: 10.1093/gigascience/giab008

Delaneau

, Zagury

J-F

, Robinson

, et al. Accurate, scalable and integrative haplotype estimation. Nat Commun, 2019; 10(1):5436; doi: 0.1038/s41467-019-13225-y

Durbin

. Efficient haplotype matching and storage using the positional Burrows–Wheeler transform (PBWT). Bioinformatics, 2014; 30(9):1266–1272; doi: 10.1093/bioinformatics/btu014

. BGT: Efficient and flexible genotype query across many samples. Bioinformatics, 2016; 32(4):590–592; doi: 10.1093/bioinformatics/btv613

10.

. BWT construction and search at the terabase scale. Bioinformatics, 2024; 40(12):btae717; doi: 10.1093/bioinformatics/btae717

11.

. Fast construction of FM-index for long sequence reads. Bioinformatics, 2014; 30(22):3274–3275; doi: 10.1093/bioinformatics/btu541

12.

Loh

P-R

, Palamara

, Price

. Fast and accurate long-range phasing in a UK Biobank cohort. Nat Genet, 2016; 48(7):811–816; doi: 10.1038/ng.3571

13.

Naseri

, Holzhauser

, Zhi

, et al. Efficient haplotype matching between a query and a panel for genealogical search. Bioinformatics, 2019; 35(14):i233–i241; doi: 10.1093/bioinformatics/btz347

14.

Navarro

. Compact Data Structures: A Practical Approach. Cambridge University Press, USA, 1st edition, 2016.

15.

Prezza

. A Framework of Dynamic Data Structures for String Processing. In: ( Iliopoulos

C. S

, Pissis

S. P

, Puglisi

S. J

, et al. eds), 16th International Symposium on Experimental Algorithms (SEA 2017), volume 75 of Leibniz International Proceedings in Informatics (LIPIcs), pages 11:1–11:15, Schloss Dagstuhl – Leibniz-Zentrum für Informatik: Dagstuhl, Germany; 2017; doi: 10.4230/LIPIcs.SEA.2017.11

16.

Rubinacci

, Delaneau

, Marchini

. Genotype imputation using the positional burrows wheeler transform. PLoS Genet, 2020; 16(11):e1009049; doi: 10.1371/journal.pgen.1009049

17.

Sanaullah

, Zhi

, Zhang

. d-PBWT: Dynamic positional burrows–wheeler transform. Bioinformatics, 2021; 37(16):2390–2397; doi: 10.1093/bioinformatics/btab117

18.

UKBiobank. Nearly £50 million unlocked for world-leading UK Biobank following new industry backing. 2025. Available from: https://www.ukbiobank.ac.uk/learn-more-about-uk-biobank/news/nearly-50-million-unlocked-for-world-leading-uk-biobank-following-new-industry-backing [Last accessed: September 15, 2025].

19.

Wang

, Naseri

, Zhang

, et al. Syllable-PBWT for space-efficient haplotype long-match query. Bioinformatics, 2023; 39(1):btac734; doi: 10.1093/bioinformatics/btac734

20.

Wei

, Naseri

, Zhi

, et al. RaPID-Query for fast identity by descent search and genealogical analysis. Bioinformatics, 2023; 39(6):btad312; doi: 10.1093/bioinformatics/btad312

21.

Yang

, Durbin

, Iversen

AKN

, et al. Sparse haplotype-based fine-scale local ancestry inference at scale reveals recent selection on immune responses. medRxiv, 2024; doi: 10.1101/2024.03.13.24304206