Sage Journals: Discover world-class research

Abstract

Germline genetic variation contributes to cancer etiology, but self-reported race is not always consistent with genetic ancestry, and samples may not have identifying ancestry information. In this study, we describe a flexible computational pipeline, PopInf, to visualize principal component analysis output and assign ancestry to samples with unknown genetic ancestry, given a reference population panel of known origins. PopInf is implemented as a reproducible workflow in Snakemake with a tutorial on GitHub. We provide a preprocessed reference population panel that can be quickly and efficiently implemented in cancer genetics studies. We ran PopInf on The Cancer Genome Atlas (TCGA) liver cancer data and identify discrepancies between reported race and inferred genetic ancestry. The PopInf pipeline facilitates visualization and identification of genetic ancestry across samples, so that this ancestry can be accounted for in studies of disease risk.

Get full access to this article

View all access options for this article.

References

SexChrLab/PopInf. 2020. Sex Chromosome Lab. https://github.com/SexChrLab/PopInf.

Alexander

D.H.

, Novembre

, and Lange

2009. Fast model-based estimation of ancestry in unrelated individuals. Genome Res. 19, 1655–1664.

Ally

, Balasundaram

, Carlsen

, et al. 2017. Comprehensive and integrative genomic characterization of hepatocellular carcinoma. Cell 169, 1327–1341.e23.

Bryc

, Velez

, Karafet

, et al. 2010. Genome-wide patterns of population structure and admixture among Hispanic/Latino Populations. Proc. Natl. Acad. Sci. 107, 8954–8961.

Bryc

, Durand

E.Y.

, Macpherson

M.J.

, et al. 2015. The genetic ancestry of African Americans, Latinos, and European Americans across the United States. Am. J. Hum. Genet. 96, 37–53.

Chang

C.C.

, Chow

C.C.

, Tellier

L.C.

, et al. 2015. Second-generation PLINK: Rising to the challenge of larger and richer datasets. Gigascience, 4, 7.

Danecek

, Auton

, Abecasis

, et al. 2011. The variant call format and VCFtools. Bioinformatics, 27, 2156–2158.

Dutil

, Chen

, Monteiro

A.N.

, Teer

J.K.

, et al. 2019. An interactive resource to probe genetic diversity and estimated ancestry in cancer cell lines. Cancer Res. 79, 1263–1273.

Grossman

R.L.

, Heath

A.P.

, Ferretti

, et al. 2016. Toward a shared vision for cancer genomic data. N. Engl. J. Med. 375, 1109–1112.

10.

Hindorff

L.A.

, Gillanders

E.M.

, and Manolio

T.A.

2011. Genetic architecture of cancer and other complex diseases: Lessons learned and future directions. Carcinogenesis, 32, 945–954.

11.

Koster

, and Rahmann

2012. Snakemake—A scalable bioinformatics workflow engine. Bioinformatics, 28, 2520–2522.

12.

Lonsdale

, Thomas

, Salvatore

, et al. 2013. The Genotype-Tissue Expression (GTEx) project. Nat. Genet. 45, 580–585.

13.

Maples

B.K.

, Gravel

, Kenny

E.E.

, et al. 2013. RFMix: A discriminative modeling approach for rapid and robust local-ancestry inference. Am. J. Hum. Genet. 93, 278–288.

14.

McKenna

, Hanna

, Banks

, et al. 2010. The Genome Analysis Toolkit: A MapReduce framework for analyzing next-generation DNA sequencing data. Genome Res. 20, 1297–1303.

15.

Patterson

, Price

A.L.

, and Reich

2006. Population structure and Eigenanalysis. PLoS Genet. 2, e190.

16.

Pedersen

B.S.

, and Quinlan

A.R.

2017. Who's who? Detecting and resolving sample anomalies in human DNA sequencing studies with Peddy. Am. J. Hum. Genet. 100, 406–413.

17.

Price

A.L.

, Zaitlen

N.A.

, Reich

, et al. 2010. New approaches to population stratification in genome-wide association studies. Nat. Rev. Genet. 11, 459–463.

18.

Quinlan

A.R.

, and Hall

I.M.

2010. BEDTools: A flexible suite of utilities for comparing genomic features. Bioinformatics, 26, 841–842.

19.

R Development Core Team. 2011. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria.

20.

Ross

M.T.

, Grafham

D.V.

, and Coffey

A.J.

, et al. 2005. The DNA sequence of the human X chromosome. Nature, 434, 325–337.

21.

Skaletsky

, Kuroda-Kawaguchi

, Minx

P.J.

, et al. 2003. The male-specific region of the human Y chromosome is a mosaic of discrete sequence classes. Nature, 423, 825–837.

22.

The 1000 Genomes Project Consortium. 2015. A global reference for human genetic variation. Nature, 526, 68–74.

23.

Timpson

N.J.

, Greenwood

C.M.T.

, Soranzo

, et al. 2018. Genetic architecture: The shape of the genetic contribution to human traits and disease. Nat. Rev. Genet. 19, 110–124.

24.

Yuan

, Hu

, Mahal

B.A.

, et al. 2018. Integrated analysis of genetic ancestry and genomic alterations across cancers. Cancer Cell, 34, 549–560.e9.

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.20 MB

0.00 MB

PopInf: An Approach for Reproducibly Visualizing and Assigning Population Affiliation in Genomic Samples of Uncertain Origin

Abstract

Get full access to this article

References

Supplementary Material