Sage Journals: Discover world-class research

Abstract

Modern genomic datasets, like those generated under the 1000 Genome Project, contain millions of variants belonging to known haplotypes. Although these datasets are more representative than a single reference sequence and can alleviate issues like reference bias, they are significantly more computationally burdensome to work with, often involving large-indexed genome graph data structures for tasks such as read mapping. The construction, preprocessing, and mapping algorithms can require substantial computational resources depending on the size of these variant sets. Moreover, the accuracy of mapping algorithms has been shown to decrease when working with complete variant sets. Therefore, a drastically reduced set of variants that preserves important properties of the original set is desirable. This work provides a technique for finding a minimal subset of variants $S$ such that for given parameters α and δ, all substrings up to length α in the haplotypes are guaranteed to be still alignable to the appropriate locations with either Hamming or edit distance at most δ, using only $S$ . Our contributions include showing the NP-hardness and inapproximability of these optimization problems and providing Integer Linear Programming (ILP) formulations. Our edit distance ILP formulation carefully decomposes the problem according to variant locations, which allows it to scale to support all of chromosome 22’s variants from the 1000 Genome Project. Our experiments also demonstrate a significant reduction in the number of variants. For example, for moderately long reads, e.g., α = 1000, over 75% of the variants can be removed while preserving read mappability with edit distance at most one.

Get full access to this article

View all access options for this article.

References

Ameur

. Goodbye reference, hello genome graphs. Nat Biotechnol, 2019; 37(8):866–868.

Amir

, Lewenstein

. Pattern matching in hypertext. J. Algorithms, 2000; 35(1):82–99; doi: 10.1006/jagm.1999.1063

Danecek

, Auton

, Abecasis

, et al. 1000 Genomes Project Analysis Group. The variant call format and vcftools. Bioinformatics, 2011; 27(15):2156–2158.

Danecek

, Bonfield

, Liddle

, et al. Twelve years of samtools and bcftools. Gigascience, 2021; 10(2):giab008.

Danecek

, McCarthy

. Bcftools/csq: Haplotype-aware variant consequences. Bioinformatics, 2017; 33(13):2037–2039; doi: 10.1093/bioinformatics/btx100

Darby

, Gaddipati

, Schatz

, et al. Vargas: Heuristic-free alignment for assessing linear and graph read aligners. Bioinformatics, 2020; 36(12):3712–3718; doi: 10.1093/bioinformatics/btaa265

Eggertsson

, Jonsson

, Kristmundsdottir

, et al. Graphtyper enables population-scale genotyping using pangenome graphs. Nat Genet, 2017; 49(11):1654–1660.

Garrison

, Sirén

, Novak

, et al. Sequence variation aware genome references and read mapping with the variation graph toolkit. BioRxiv, 2017:234856.

Garrison

, Sirén

, Novak

, et al. Variation graph toolkit improves read mapping by representing genetic variation in the reference. Nat Biotechnol, 2018; 36(9):875–879.

10.

Genomes Project Consortium and others. A global reference for human genetic variation. Nature, 2015; 526(7571):68.

11.

Guarracino

, Heumos

, Nahnsen

, et al. ODGI: Understanding pangenome graphs. Bioinformatics, 2022; 38(13):3319–3326; doi: 10.1093/bioinformatics/btac308

12.

Guo

, Liu

, Guan

, et al. Fast variation-aware read alignment with debga-vara. In IEEE International Conference on Bioinformatics and Biomedicine, BIBM 2018, Madrid, Spain, December 3–6, 2018. IEEE Computer Society, (2018); pp.227–233; doi:10.1109/BIBM.2018.8621555

13.

Gurobi Optimization, LLC. Gurobi Optimizer Reference Manual; 2022. https://www.gurobi.com

14.

Hickey

, Heller

, Monlong

, et al. Genotyping structural variants in pangenome graphs using the vg toolkit. Genome Biol, 2020; 21(1):35–17.

15.

Holley

, Peterlongo

. Blastgraph: Intensive approximate pattern matching in sequence graphs and de-bruijn graphs. Proceedings of the Prague Stringology Conference 2012, Prague, Czech Republic, 2012. Department of Theoretical Computer Science, Faculty of Information Technology, Czech Technical University in Prague; 2012; pp.53–63. http://www.stringology.org/event/2012/p06.html

16.

Ivanov

, Bichsel

, Mustafa

, et al. Astarix: Fast and optimal sequence-to-graph alignment. In ( Schwartz

, ed.). Research in Computational Molecular Biology—24th Annual International Conference, RECOMB 2020, Springer: Padua, Italy, May 10–13, 2020, Proceedings, vol. 12074 of Lecture Notes in Computer Science; 2020; pp.104–119; doi:10.1007/978-3-030-45257-5\7

17.

Jain

, Tavakoli

, Aluru

. A variant selection framework for genome graphs. Bioinformatics, 2021a;37(Suppl_1):i460–i467; doi: 10.1093/bioinformatics/btab302

18.

Jain

, Tavakoli

, Aluru

. A variant selection framework for genome graphs. Bioinformatics, 2021b;37(Suppl_1):i460–i467.

19.

Kim

, Paggi

, Park

, et al. Graph-based genome alignment and genotyping with hisat2 and hisat-genotype. Nat Biotechnol, 2019; 37(8):907–915.

20.

. A statistical framework for snp calling, mutation discovery, association mapping and population genetical parameter estimation from sequencing data. Bioinformatics, 2011; 27(21):2987–2993.

21.

, Handsaker

, Wysoker

, et al. 1000 Genome Project Data Processing Subgroup. The sequence alignment/map format and samtools. Bioinformatics, 2009; 25(16):2078–2079; doi: 10.1093/bioinformatics/btp352

22.

Liu

, Guo

, Brudno

, et al. debga: Read alignment with de bruijn graph-based seed and extension. Bioinformatics, 2016; 32(21):3224–3232.

23.

Miga

, Wang

. The need for a human pangenome reference sequence. Annu Rev Genomics Hum Genet, 2021; 22:81–102.

24.

Pritt

, Chen

, Langmead

. Forge: Prioritizing variants for graph genomes. Genome Biol, 2018; 19(1):220–216.

25.

Rautiainen

, Marschall

. Graphaligner: Rapid and versatile sequence-to-graph alignment. Genome Biol, 2020; 21(1):253–228.

26.

Sirén

, Garrison

, Novak

, et al. Haplotype-aware graph indexes. Bioinformatics, 2020; 36(2):400–407; doi: 10.1093/bioinformatics/btz575

27.

Sirén

. Indexing variation graphs. In Proceedings of the Ninteenth Workshop on Algorithm Engineering and Experiments, ALENEX 2017, Hotel Porta Fira: Barcelona, Spain, January 17-18, 2017. SIAM; 2017; pp.13–27. doi:10.1137/1.9781611974768.2

28.

Tavakoli

, Gibney

, Aluru

. Haplotype-aware variant selection for genome graphs. In BCB ’22: 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics, Northbrook, Illinois, USA, August 7–10, 2022. ACM; 2022; pp.51:1–51:9. doi:10.1145/3535508.3545556

29.

Team F. The variant call format (vcf) version 4.2 specification. Available at https.github.com/samtools/hts-specs (2015).

30.

Valiente-Mullor

, Beamud

, Ansari

, et al. One is not enough: On the effects of reference genome for the mapping and subsequent analyses of short-reads. PLoS Comput Biol, 2021; 17(1):e1008678; doi: 10.1371/journal.pcbi.1008678

31.

Zuckerman

. Linear degree extractors and the inapproximability of max clique and chromatic number. Theory of Comput, 2007; 3(1):103–128; doi: 10.4086/toc.2007.v003a006

GraphSlimmer: Preserving Read Mappability with the Minimum Number of Variants

Abstract

Get full access to this article

References