Sage Journals: Discover world-class research

Abstract

Phylogenetic methods are emerging as a useful tool to understand cancer evolutionary dynamics, including tumor structure, heterogeneity, and progression. Most currently used approaches utilize either bulk whole genome sequencing or single-cell DNA sequencing and are based on calling copy number alterations and single nucleotide variants (SNVs). Single-cell RNA sequencing (scRNA-seq) is commonly applied to explore differential gene expression of cancer cells throughout tumor progression. The method exacerbates the single-cell sequencing problem of low yield per cell with uneven expression levels. This accounts for low and uneven sequencing coverage and makes SNV detection and phylogenetic analysis challenging. In this article, we demonstrate for the first time that scRNA-seq data contain sufficient evolutionary signal and can also be utilized in phylogenetic analyses. We explore and compare results of such analyses based on both expression levels and SNVs called from scRNA-seq data. Both techniques are shown to be useful for reconstructing phylogenetic relationships between cells, reflecting the clonal composition of a tumor. Both standardized expression values and SNVs appear to be equally capable of reconstructing a similar pattern of phylogenetic relationship. This pattern is stable even when phylogenetic uncertainty is taken in account. Our results open up a new direction of somatic phylogenetics based on scRNA-seq data. Further research is required to refine and improve these approaches to capture the full picture of somatic evolutionary dynamics in cancer.

1. INTRODUCTION

Phylogenetic analysis is an approach that relies on reconstructing evolutionary relationships between organisms to determine population genetics parameters such as population growth (Heled and Drummond, 2008; Kingman, 1982), structure (Müller et al, 2017a), or geographical distribution (Lemey et al, 2010, Lemey et al, 2009). Typically, the reconstructed phylogeny is not the end-goal. Using previously estimated trees, various evolutionary hypotheses can be explored, such as the evolutionary relationship of traits carried by individual taxa (Freckleton, 2012; Grafen and Hamilton, 1989; Pagel et al, 2004).

Within-organism cancer evolution is increasingly being studied using population genetics approaches, including phylogenetics (Alves et al, 2019; Alves et al, 2017; Caravagna et al, 2020; Caravagna et al, 2018; Detering et al, 2019; Kuipers et al, 2020; Malikic et al, 2019; Navin et al, 2011; Singer et al, 2018; Schwartz and Schäffer, 2017; Werner et al, 2020; Yuan et al, 2015), to understand evolutionary dynamics of cancer cell populations. These approaches have shown promise to be developed into therapeutic applications in the personalized medicine framework (Abbosh et al, 2017; Gerlinger et al, 2012; Rao et al, 2020b).

Specifically, the clonal composition of tumors, metastasis initiation, development, and timing can be reconstructed using phylogenetic methods (Angelova et al, 2018; Alves et al, 2019; El-Kebir et al, 2018; Yuan et al, 2015). Unlike other evolutionary processes prone to events such as hybridization or horizontal gene transfer, population dynamics of somatic cells is underpinned by a strictly bifurcating clonal process driven by cell division. This is in perfect agreement with theoretical assumptions routinely applied in stochastic phylogenetic models such as coalescent (Hudson1990; Kingman, 1982; Posada, 2020) or birth-death processes (Aldous, 2001; Aldous, 1996; Komarova, 2006).

From the methodological perspective, however, cancer is an evolutionary process with unique characteristics, which are not modeled in conventional phylogenetic approaches. These include a high level of genomic instability with structural changes (gene losses and duplications), which accumulate along with point mutations during the course of growth and evolution (Beerenwinkel et al, 2015; Posada, 2015).

Traditional whole genome sequencing (WGS) methods have been instrumental in understanding cancer mutational profiles and oncogene detection (Mardis and Wilson, 2009; Nakagawa and Fujita, 2018). DNA from a tissue sample is isolated and sequenced “in bulk.” This increases the total amount of DNA that improves coverage and reduces amplification errors. To establish the presence or absence of mutations, a variant allele frequency (VAF) is calculated and compared to a threshold, typically $10 - 20 %$ (Strom, 2016). This filters out rare mutations present only in a few reads that are likely to be false positives or sequencing errors (Petrackova et al, 2019). More recently, bulk sequencing is used to study cancer evolution using phylogenetic methods, either by comparing VAF (Ling et al, 2015; Zhao et al, 2016; Zhai et al, 2017) or estimating copy number variants (CNVs) (Desper et al, 1999; Demeulemeester et al, 2016; Tarabichi et al, 2021).

However, the usage of bulk sequencing in this context is problematic. Bulk samples contain cells from multiple cell lineages, including non-tumor cells, such as immune or blood vessel cells (Racle et al, 2017), and there is strong evidence for a constant migration of metastatic cells between tumors (Aguirre-Ghiso, 2010; Cheung and Ewald, 2016; Casasent et al, 2018; Reiter et al, 2017). High VAF thresholds ignore tumor heterogeneity, but by lowering the threshold, mutations in nontumor cells or clonal lineages are retained instead. Sequences or mutational profiles derived from bulk samples thus have a chimeric origin (Alves et al, 2017).

A typical assumption in classical phylogenetics is that the sequences or mutational profiles represent individual taxonomic units, either individuals or populations of closely related individuals. If these methods are used on the data from bulk samples, the reconstructed trees are not phylogenies describing an evolutionary history, but evolutionarily meaningless sample similarity trees (Alves et al, 2017). To address this issue, phylogenetic trees are reconstructed by estimating the sequential order of somatic mutations using VAF from one or multiple tumor samples (Deshwar et al, 2015; El-Kebir et al, 2018; Miura et al, 2018). Given the tumor heterogeneity and insufficient read depth to reliably estimate VAF, this is not a simple problem and the performance of current methods is limited (Miura et al, 2020).

Single-cell DNA sequencing (scDNA-seq) does not suffer from the chimeric DNA origin of bulk sequencing as each DNA segment is barcoded to guarantee its known cell of origin. Recent progress in WGS technology made sequencing individual cells cost-efficient (Gawad et al, 2016) and this approach is now regularly used for the phylogenetic reconstruction of metastatic cancer or the subclonal structure of a single tumor (Leung et al, 2017; Myers et al, 2019; Potter et al, 2013; Roth et al, 2016). However, this increased resolution comes with additional complications. Current methods are not sensitive enough to sequence DNA from a single cell and DNA amplification is required (Gawad et al, 2016). This process suffers from a random bias with different parts of the genome amplified in different quantities or not at all (Satas and Raphael, 2018).

In addition, polymerase does not replicate DNA without error; this can have a significant impact if the replication errors occur early in the amplification process (Gawad et al, 2016). This does not only increase the error rate for identified single nucleotide variants (SNVs) but also a large proportion of SNVs might be simply missing (Hicks et al, 2018). The advantages associated with scDNA-seq led to the development of novel approaches that tackle these challenges using an error model to correct for amplification errors and false-positive SNV calls (Luquette et al, 2019; Kozlov et al, 2022; Zafar et al, 2019).

Similar technological development led to proliferation of single-cell RNA sequencing (scRNA-seq), which, compared to traditional bulk RNA sequencing, enabled detection of gene expression profiles for individual cells in the tissue sample (González-Silva et al, 2020; Jerby-Arnon et al, 2018; Olsen and Baryawno, 2018; Müller et al, 2017b). This allows understanding tumor heterogeneity by identifying different cell populations (Andrews and Hemberg, 2018), estimating immune cell content within a tumor (Yu et al, 2019), or even identifying individual clones and subclones, as they can differ in their behavior (Fan et al, 2020).

However, as the levels of RNA expression vary between genes and cells, the amplification problems of scDNA-seq that cause unequal expression and dropout effects are more pronounced in scRNA-seq. There is an increased interest for SNV calling on scRNA-seq data using bulk SNV callers (Chen et al, 2016; Liu et al, 2019; Poirion et al, 2018; Schnepp et al, 2019) and specialized CNV callers (Kuipers et al, 2020; Harmanci et al., 2020a,b; Gao et al, 2021) as this allows for identification of mutations in actively expressed genes.

In this work, we test if expression values and SNVs inferred from scRNA-seq contain phylogenetic information to reconstruct a population history of cancer. We reconstruct phylogenetic trees using expression values and SNVs from three different datasets. We then compare and test these phylogenies against expected evolutionary history to determine whether scRNA-seq data contain phylogenetic signal.

2. METHODS

2.1. Experimental design

To test if scRNA-seq data contain phylogenetic signal, we select datasets with multiple regional samples and test if cells from a regional sample are phylogenetically closer to each other than to cells from other samples.

Methods that reconstruct phylogenies from sequence data assume that the data were generated using an evolutionary process. Under this assumption, an effect of a low phylogenetic signal would produce a high mutation rate and/or star tree-like structure. However, both high mutation rate and star trees are biologically plausible in case of rapid population expansion, such as during a rapid metastatic spread (Schwarz et al, 2015). Classical methods for estimation of phylogenetic signal, such as Pagel's $λ$ (Pagel, 1999) or Blomberg's $κ$ (Blomberg et al, 2003) (see Münkemüller et al (2012) for review), are not applicable, as they test the presence of a phylogenetic signal of an observed trait based on known phylogeny.

An alternative method to test whether the data do contain a phylogenetic signal that we employ in this article is to compare the resulting phylogeny to an expected relationship, such as the expected monophyly of groups of taxa obtained from multiple regional samples, for example, phylogenetic clusters of cells from healthy tissues or individual metastases. To further account for migration and uncertainty, we do not require cells from a single regional sample to be monophyletic, but test whether cells from a regional sample are phylogenetically closer to each other than to cells from other samples using phylogenetic clustering tests.

2.2. Dataset selection

We perform our phylogenetic analyses on three different datasets, a new unique molecular identifier (UMI)-based dataset of breast cancer-derived xenografts (BCX), and two previously published datasets, a UMI-based dataset of small intestinal neuroendocrine cancer (INC) (Rao et al, 2020a) and a non-UMI-based dataset of gastric cancer (GC) (Wang et al, 2021). These datasets contain primary and metastatic cells from multiple regional samples, allowing us to assess the performance of the phylogenetic analysis using the phylogenetic clustering tests. We expect that primary, metastatic, and cells from regional samples will each cluster together, forming cell-type and region-specific clades.

The BCX dataset consisted of three individual specific tumor samples T1, T2, and T3, seeded from the same population of cells, and two matched circulating tumor cell (CTC) samples CTC1 and CTC2. Each tumor was derived from the same population and shares a common ancestor; each tumor thus represents a distinct regional sample. We expected that cells isolated from each individual and cells for each sample form clusters. As the cell lineage MDA-MB-231-LM2 is highly metastatic (Minn et al, 2005), we do not expect CTC to form a single monophyletic clade, but a larger number of smaller clades embedded inside the tumor cells of a sampled individual.

The INC dataset from Rao et al (2020a) consisted of a primary tumor and a paired liver metastatic sample. Both samples contained a mixture of cancerous and noncancerous cells (fibroblasts, endothelial cells, and immune cells). We expect cancerous cells to form a cluster, with metastatic cells forming distinct clusters among the cancer cells.

The GC dataset from Wang et al (2021) consisted of 94 cells from a primary tumor and a lymph node of three patients (GC1, GC2, and GC3). We expect that for each patient, the lymph node cells would form a monophyletic lineage derived from the primary tumor cell, but due to the small number of cells, clustering of the primary tumor cells is also interpreted as a success.

2.3. Preparation of the BCX dataset

MDA-MB-231-LM2 (green fluorescent protein: GFP+) (Minn et al, 2005) cells were injected into the R4 mammary fat pad of Nu/J mice ( $250, 000$ cells per mouse, 3 mice), and tumor growth was monitored for 8 weeks. Mice were euthanized when tumor size approached the endpoint (2 cm). Tumors were resected and dissociated into single cells. To extract CTC, up to 1 mL of blood was drawn immediately posteuthanasia using cardiac puncture. Red blood cells were removed using RBC lysis buffer. All cells (tumor derived and CTC) were stained with 4′,6-diamidino-2-phenylindole (DAPI) and sorted for DAPI and GFP using a BD FACSAria cell sorter. Libraries were generated using the 10 × Chromium single cell gene expression system immediately after cell sorting, and sequenced on an Illumina NextSeq platform together to eliminate batch effect.

2.4. Mapping and demultiplexing

The BCX and INC reads were mapped with the Cellranger v5.0 software to the GRCh38 v15 from the Genome Reference Consortium using the analysis-ready assembly without alternative locus scaffolds (no_alt_analysis_set) and associated GTF annotation file. The Cellranger software performs mapping, demultiplexing, cell detection, and gene quantification for the 10 × Genomics scRNA-seq data.

Reads for the non-UMI GC dataset were mapped with the STAR v2.7.9a (Dobin et al, 2013), using the same reference and GTF annotation file.

2.5. Postprocessing expression data

For the INC and BC datasets, previously published expression datasets from Rao et al (2020a) (GSE140312) and Wang et al (2021) (GSE158631) were used. For the BCX dataset, we have used the filtered expression values produced by the Cellranger.

2.5.1. Standardizing expression values

The filtered feature-barcode expression values from Cellranger were processed using the R Seurat v4.0.4 package (Stuart et al, 2019). Expression values from different regional samples (or individuals in case of BCX) were merged and analyzed together. The expression values for each gene were centered ( $μ = 0$ ) and rescaled ( $σ^{2} = 1$ ). No normalization or filtering was done at this step.

2.5.2. Discretizing expression values

To reconstruct phylogenies, expression values need to be discretized as phylogenetic software does not support tree reconstruction from continuous data. The rescaled expression values were categorized into a five-level ordinal scale ranging from 1 (low level of expression) to 5 (high level of expression). The five-level scale was chosen to capture the data distribution of rescaled expression values and represent a compromise between introducing data noise with too many levels or artificial similarity with only a few categories.

Interval ranges, according to which the values were categorized, were chosen according to the $60 %$ and $90 %$ highest density intervals (HDI), the shortest intervals containing $60 %$ or $90 %$ of values, respectively. The values inside the $60 %$ HDI were categorized as normal, values inside the $90 %$ HDI, but outside the $60 %$ as increased/decreased expression, and values outside the $90 %$ HDI as an extremely increased/decreased expression.

Genes that contain only a single categorized value for each cell were removed as phylogenetically irrelevant and the discretized values were then transformed into fasta format.

2.5.3. Recording unexpressed genes as unknown data

The amount of coverage in a standard bulk RNA-seq expression analysis is usually sufficient to conclude that genes for which no molecule was detected are not expressed (Lähnemann et al, 2020). In scRNA-seq, however, the sequencing coverage is very small, dropout effect is likely, and thus this assumption does not hold. This is especially a problem for non-UMI-based technologies (Cao et al, 2021), but not entirely absent from the UMI-based technologies as well due to biological and technological processes (Hsiao et al, 2020; Townes and Irizarry, 2020).

According to the standard expression pipeline, these values are commonly treated as biological zeros, that is, no detected expression of a particular gene, and have a significant impact on the data distribution during the normalization and rescaling steps (Hicks et al, 2018; Townes and Irizarry, 2020). Without an explicit model of dropout effect to account for technical or biological variation, these values might be more accurately represented as unknown values rather than true biological zeros (Van den Berge et al, 2018). We have modified the Seurat code to treat these values as unknown values (NA in R) and included modified functions in the phyloRNA package.

We will further use data density to describe the number of unknown values in both expression and SNV datasets, with $100 %$ representing data set without unknown values, while $0 %$ would represent a dataset formed entirely of unknown values.

2.6. SNV

2.6.1. Preprocessing reads for SNV detection

The BAM files from Cellranger were processed using the Broad Institute's Genome Analysis ToolKit (GATK) v4.2.3.0 (Poplin et al, 2018) according to GATK best practices of somatic short variant discovery. Reads were sorted, processed, and recalibrated using GATK's SortSam, SplitNCigarReads, and Recalibrate.

2.6.2. SNV detection and filtering

To obtain SNVs for individual cells of the scRNA-seq data, first, a list of SNVs was obtained by running Mutect2 (Benjamin et al, 2019), treating the data set as a pseudo-bulk sample, and retaining only the SNVs that passed all filters.

For the BCX dataset, Mutect2 was run in the tumor with matched normal sample using the parental cell linage MDA-MB-231 from Kidwell et al (2021) and Panel of Normals derived from the same source, see Supplementary Materials S1 for details.

For the IND and GC datasets, we have used Mutect2 in a tumor-only mode using the Panel of Normals and the GNOMAD germline data from the GATK best practice resource bundle.

SNVs for individual cells were then obtained by individually summarizing reads belonging to each single cell at the positions of the SNVs obtained beforehand using the pysam library, which is built on htslib (Li et al, 2009). The most common base for every cell and every position was retained, and base heterogeneity and CNVs were ignored. This SNV table was then transformed into fasta format.

2.7. Finding a well-represented subset of data

When treating the potentially unexpressed genes as unknown values, only a small proportion of the expression count values was known, with the data set derived from SNV suffering from the same problem due to the low number of reads for each cell.

While model-based phylogenetic methods can process missing data by treating the missing data as phylogenetically neutral, this significantly flattens the likelihood space, which can cause artifacts, convergence problems or increase computational time (Jiang et al, 2014; Wiens, 2006; Xi et al, 2016).

Published phylogenetic tools designed for single-cell DNA data sets ranged from 47 cells and 40 SNVs (Jahn et al, 2016) to 370 cells and 50 SNVs (Singer et al, 2018) or in an extreme case, 18 cells and $50, 000$ SNVs (Singer et al, 2018), with at most $58 %$ of missing data across these data sets. In comparison, scRNA-seq technology can produce up to tens of thousands of cells with tens of thousand detected genes (Chen et al, 2019) and data reduction is often required.

To alleviate these issues, we employ two different filtering strategies to reduce the size of datasets, while preserving as much information as possible, a selection strategy, where a set of high-quality cells is selected, and a stepwise filtration algorithm, where a subset of data with the highest data density is selected. Under the selection filtering strategy, a set of cells is selected, either cells of interest from the expression analysis, or a fixed subset of cells with the highest data density. This allows for a construction of datasets of specific size.

The stepwise filtering algorithm aims to find a well-represented subset of the data. By iteratively cutting out cells and genes/SNVs with the smallest number of known values, we increase the data density until a local maximum or desired data density is reached. This is equivalent to the gene/cell quality filtering during the scRNA-seq postprocessing pipeline, such as in the Seurat's standard preprocessing workflow described above, where low-quality cells and genes are removed. The advantage of this method is that a desired density can be reached with the least amount of data removed.

2.8. Phylogenetic analysis

To reconstruct phylogenetic trees from the categorized expression values and identified SNVs, we used IQ-TREE v2.1.4 (Minh et al, 2020) and BEAST2 v2.6.3 (Bouckaert et al, 2019).

The IQ-TREE analysis was performed with an ordinal model and an ascertainment bias correction (-m ORDINAL+ASC) for the expression data, and a standard model selection was performed for the SNV data (-m TEST). Where the size of the dataset allowed, tree support was evaluated using the standard nonparametric bootstrap (Felsenstein, 1985) with 100 replicates (-b 100).

The BEAST2 analysis was performed with a birth-death tree prior (Kingman, 1982). Exponential population growth and BD model are not compatible; we have previously used the coalescent model (with exponential population growth) and forgot to (fully) update the text. For the expression data, the BEAST2 was run using ordinal model available in the Morph-Models package, while the SNV values were analyzed using the Generalized Time-Reversible model (Tavaré, 1986). For both the expression and the SNV data sets, BEAST2 was set to not ignore ambiguous states.

2.9. Phylogenetic clustering tests

To test if the phylogenetic methods were able to recover expected population history, we employ mean pairwise distance (MPD) (Webb, 2000) and mean nearest taxon distance (MNTD) (Webb, 2000). MPD is calculated as a mean distance between each pair of taxa from the same group, while MNTD is calculated as a mean distance to the nearest taxon from the same group. For each sample and samples isolated from a single individual, MPD and MNTD are calculated and compared to a null distribution obtained by permuting sample labels on a tree and calculating MPD and MNTD for these permutations. The p-value is then calculated as a rank of the observed MPD/MNTD in the null distribution normalized by the number of permutations.

The MPD and MNTD are calculated using the ses.mpd and ses.mntd functions implemented in the package picante (Kembel et al, 2010). For the Bayesian phylogenies, MPD and MNTD were calculated for a sample of 1000 trees from the posterior distribution and then summarized with mean and $95 %$ confidence interval. For maximum likelihood phylogenies, MPD and MNTD were calculated from 100 trees, from the nonparametric bootstrap, and summarized in the same manner as Bayesian trees.

2.10. Code and data availability

Code required to replicate the data processing steps is available at https://github.com/bioDS/phyloRNAanalysis.

To aid in creating pipelines for phylogenetic analysis of scRNA-seq data, we have integrated a number of common tools in the R phyloRNA package, which is available at https://github.com/bioDS/phyloRNA.

Raw reads and expression matrices produced by Cellranger for the BCX dataset are in the NCBI GEO under the accession number GSE163210.

Alignments in the fasta format, reconstructed trees, and phylogenetic clustering test results are available at https://github.com/bioDS/phyloRNAanalysis/tree/processed_files.

3. RESULTS

3.1. Breast cancer-derived xenografts

3.1.1. Sample overview

In total, five samples were used in this analysis, three tumor samples (T1, T2, and T3) and two CTC samples (CTC1 and CTC2). The number of cells isolated from the CTC3 sample was too small for scRNA sequencing and the sample was removed from the study. The number of detected cells in the tumor samples was generally smaller than in the CTC samples, but the reverse was true for the total number of detected UMIs—the number of unique mRNA transcripts (Table 1). In the T2 sample, a large number of cells but a small number of UMIs were detected in a similar pattern to the CTC samples.

Table 1.

An Overview of the Breast Cancer-Derived Xenograft Dataset Used in This Work


(a) Experiment overview

(b) Sample overview

Sample	Cells (FACS)	Cells (Cellranger)	Genes	UMI	UMI/cell	Data Density
T1	$11, 258$	701	17k	$3, 167$ k	$4, 518$	2.97%
T2	$20, 233$	$2, 794$	5k	69k	25	0.04%
T3	$13, 865$	806	18k	$2, 876$ k	$3, 569$	2.57%
CTC1	605	$3, 125$	8k	129k	41	0.06%
CTC2	415	$3, 161$	9k	155k	49	0.06%
Total	$46, 376$	$10, 587$	20k	6.4M	604	0.44%

In total, five samples were isolated from three individuals (a): three tumor samples (T1, T2, and T3) and two CTC samples (CTC1 and CTC2). For each sample, the number of cells from FACS, the number of identified by Cellranger, the number of detected genes, the number of UMIs, UMI/cell ratio, and the data density are reported (b).

CTC, circulating tumor cell; FACS, fluorescent-activated cell sorting; UMI, unique molecular identifier.

Compared to the fluorescent-activated cell sorting, Cellranger detected fewer cells for tumor samples, but more cells for the CTC samples. Cellranger classifies barcodes as cells based on the amount of UMI detected to distinguish real cells from a background noise (Lun et al, 2019). The large number of detected cells in the CTC samples is likely a result of lysed cells or cell-free RNA (Fleming et al, 2019). In all cases, the number of expression values across data sets was relatively low, with the best sample T3 amounting to about $3 %$ of known expression values.

3.1.2. SNV identification

To identify SNVs in scRNA-seq data, we first identified a list of SNVs by treating the single-cell reads as a pseudo-bulk sample. The total of $21, 261$ SNVs that passed all quality filters were identified this way. When these SNVs were called for each individual cell, the resulting data set had data density of < $0.13 %$ .

The expression data are expected to have higher data density than SNV because for expression quantification, a presence or absence of a molecule is sufficient, while for SNV, knowledge of each position is required. This expectation is confirmed in Table 1, where data density of the expression data is summarized. About $40 %$ of $10, 587$ cells represented in this data set did not contain any positively identified SNV after filtering out false positives, these were relatively equally distributed among the T2 (1487), CTC1 (1379), and CTC2 (1324) samples. This represents a challenge from a data analysis perspective, given the large sample size and its small data density.

3.1.3. Data reduction

With over $10, 000$ cells and more than $20, 000$ genes and SNVs, the unfiltered datasets would require substantial computational resources. An additional issue we have encountered in our data was a significant difference in the quality between individual samples, only five CTC1 and six CTC2 cells passed the quality filtering criteria of a minimum of 250 represented genes and a minimum of 500 UMI per cell, with no T2 cell passing the quality filtering. This contrasts with the T1 and T3 samples, where 701 and 806 passed the quality filtering criteria, respectively.

Due to this varied quality of samples, filtering data to a higher data density using the stepwise filtering algorithm leads to the removal of low-quality samples (T1, CTC1, and CTC2), which bars us from testing the phylogenetic structure using the phylogenetic clustering tests. For this reason, we have selected a small number of cells with the least amount of missing data from each sample using the selection filtering method. The small number of cells is not sufficient to represent the full diversity of the tumor, but allows us to test the phylogenetic relationship between individual samples without introducing a bias due to an unequal size of the samples.

A total of 58 cells were retained for both the expression and SNV datasets: 20 cells for T1 and T3 samples and six cells for T2, CTC1, and CTC2 samples. In these reduced datasets, genes that were not present in any of the cells or present only in a single cell are removed. The reduced expression data set contained $30 %$ of known data distributed across $7, 520$ genes. The SNV data set contained $10 %$ of known data distributed across $1, 058$ SNVs. These reduced data sets are analyzed using maximum likelihood and Bayesian method to further explore the topological uncertainty.

Reconstructed trees and phylogenetic tests for the data filtered to the $20 %$ , $50 %$ , and $90 %$ data density using the stepwise filtering algorithm are provided in the Supplementary Materials S1.

3.1.4. Phylogenetic reconstruction from expression data

The maximum likelihood tree reconstructed from the reduced expression data set showed significant clustering of all samples (Fig. 1a). This is confirmed by the phylogenetic clustering tests where all, but CTC2 cells had a significant MPD p-value (Table 2). Four out of six CTC2 cells clustered together, but on the opposite side of the tree with phylogenetic proximity to the T1 cells. This close phylogenetic relationship suggests that T1 and CTC2 were isolated from a single individual. This pattern is further reinforced as T2 cells clustered in a single compact clade with phylogenetic proximity to the CTC1 sample. When this relationship was tested with phylogenetic clustering methods, both MPD and MNTD confirmed the strong clustering signal between T2 and CTC1. The same tests were not significant for the T1-CTC2 grouping, likely due to the presence of two nonclustering CTC2 cells.

FIG. 1.

(a) Maximum likelihood and (b) Bayesian trees reconstructed from the expression data for the 58 selected cells. Terminal branches are colored according to cell's sample of origin (T1, T2, T3, CTC1, and CTC2). In the maximum likelihood tree, the T2, CTC1, and CTC2 samples are also marked with colored circles. For the Bayesian tree, Bayesian posterior values show the topology uncertainty. CTC, circulating tumor cell.

Table 2.

Test of Phylogenetic Clustering for the Reduced Dataset of the 58 Selected Cells

Groups	Cells	Expression (ML)		Expression (BI)		SNV (ML)		SNV (BI)
Groups	Cells	MPD	MNTD	MPD	MNTD	MPD	MNTD	MPD	MNTD
T1	20	^*0.003	0.755	^*0.001	^*0.006	^*0.001	^*0.008	^*0.001	0.434
T2	6	^*0.001	^*0.001	^*0.001	^*0.003	^*0.001	^*0.001	^*0.001	^*0.001
T3	20	^*0.001	0.423	^*0.001	^*0.017	^*0.002	0.660	^*0.001	0.184
CTC1	6	^*0.002	^*0.004	0.992	0.479	^*0.001	^*0.001	^*0.001	^*0.002
CTC2	6	1.000	0.988	0.724	0.966	0.247	0.595	0.223	0.594
T1 and CTC1	26	0.059	0.407	0.139	0.261	0.475	^*0.008	0.466	0.305
T2 and CTC2	12	0.997	0.027	0.970	0.165	0.998	0.800	0.977	0.317
T1 and CTC2	26	0.637	0.999	^*0.005	0.932	^*0.001	0.630	^*0.001	0.943
T2 and CTC1	12	^*0.001	^*0.001	0.674	0.034	^*0.001	^*0.001	^*0.001	^*0.001

MPD and MNTD calculated for the ML and BI trees from the expression and SNV data. p-Values for MPD and MNTD were calculated for each sample (T1, T2, T3, CTC1, and CTC2) and expected clustering for cells isolated from a single individual (T1 with CTC1, and T2 with CTC2) and to test a possible mislabeling between CTC1 and CTC2 samples (T1 with CTC2, and T2 with CTC1). Significant p-values at $α = 0.05$ after correcting for multiple comparisons using the False Discovery Rate method (Benjamini and Hochberg, 1995) are marked with an asterisk. Values of MNTD and MPD calculated for the ML bootstrap sample and BI posterior tree sample are available in the Supplementary Table S1.

Significant support.

BI, Bayesian; MNTD, mean nearest taxon distance; MPD, mean pairwise distance; ML, maximum likelihood; SNV, single nucleotide variant.

The phylogenies reconstructed from the same data using the Bayesian inference show a similar pattern of clustering (Fig. 1b, Table 2), although neither CTC1 nor CTC2 formed a compact cluster. The T2 and CTC1 connection is not supported, but about half the CTC1 cells were placed in a group with the T2 samples. Similar to the maximum likelihood tree, this group was not closely related to the T1 and T3 cells, instead it formed a distantly related sister group. The relationship between T1 and CTC2 is supported by the MNTD statistics on the Bayesian phylogeny.

Neither MNTD nor MPD statistics on the maximum likelihood and Bayesian phylogeny supported the clustering of CTC2 cells. This might suggest that the CTC2 cells are polyphyletic, with their origin in the seeding population before the injection. This is not unlikely, given that the cell lineage used (MDA-MB-231-LM2) is highly metastatic (Minn et al, 2005).

In addition to testing on the best phylogeny, we have integrated the topological uncertainty of the reconstructed phylogenies by performing the phylogenetic clustering tests on the 100 bootstrap replicates from the maximum likelihood analysis and a sample of 1000 trees from the Bayesian posterior tree sample. The distribution of MPD and MNTD p-values calculated on each tree were then summarized using mean and $95 %$ confidence interval. The majority of relationships from the best tree was also supported by the tree samples (Supplementary Table S1). This suggests that, while there is high uncertainty in the data and reconstructed topologies, we can reconstruct broad topological patterns with relatively high certainty.

3.1.5. Phylogenetic reconstruction from the SNV data

The maximum likelihood tree reconstructed from the reduced SNV dataset (Fig. 2) displayed similar, but weaker patterns to the one reconstructed from the expression data. The CTC2 cells no longer formed two compact clusters and were dispersed along the tree. Similar to the expression data, the T2 and CTC1 cells were placed together on a long branch, suggesting a long shared evolutionary history. However, unlike the expression data, the T1 and T3 were more interspersed with very short branches. The phylogenetic clustering tests confirm the grouping of all samples (Table 2), except for the CTC2 sample, in addition to the putative relationship between T1 and CTC2, and T2 and CTC1 samples. This reinforces the hypothesis about possible mislabeling between CTC1 and CTC2 samples.

FIG. 2.

(a) Maximum likelihood and (b) Bayesian trees reconstructed from the SNV data for the 58 selected cells. Terminal branches are colored according to cell's sample of origin (T1, T2, T3, CTC1, and CTC2). In the maximum likelihood tree, cells are also marked with colored circles and an extremely long branch leading to T2 cells (dashed line) was collapsed. For the Bayesian tree, Bayesian posterior values show the topology uncertainty. SNV, single nucleotide variant.

A similar pattern of sample clustering can be observed on the Bayesian phylogeny reconstructed from the same data (Fig. 2b), with T2 and CTC1 cells placed on a distantly related sister branch to all other samples. The T1 and T3 cells are still interspersed, but the CTC2 cells seem to cluster together more closely. Like with the expression analysis, these relationships are stable when the topological uncertainty is taken into account in the phylogenetic clustering tests (Supplementary Table S1).

Additional topological comparison between trees from SNV and expression data can be found in Supplementary Materials S1.

3.1.6. Biological zero or unknown value

To test the assumption if zero expression values should be treated as unknown data rather than biological zeros, that is, no expression of a particular gene, we have reconstructed the phylogenies from the scRNA-seq expression by treating the zeros in the dataset as biological zeros. Data were processed as per the standard methodology to get the alignments, but instead of treating the zeros as an unknown position, they were treated as a category 0, in addition to the five-level ordinal scale. Phylogenies were then reconstructed using both maximum likelihood and Bayesian methods with sample clustering explored using the phylogenetic clustering tests.

In the phylogenies reconstructed from the expression data when zero is treated as a biological zero (Fig. 3), the CTC2 cells did not form a cluster, but clustered closely with the T1 and CTC2 cluster. This cluster was no longer placed as a sister branch to the T1 and T3 cells, but was deeply nested. The T1 and T3 samples were less interspersed than when zero is treated as unknown data. This change in the phylogenetic structure is supported by the phylogenetic clustering tests (Table 3), with T1 and T3 no longer being supported, and instead, the clustering of CTC2 cells is being supported in both the maximum likelihood and Bayesian phylogenies. Likewise, the T1 and CTC2 grouping is not supported, as the CTC2 cells group together with the CTC1 and T2 samples.

FIG. 3.

Table 3.

Test of Phylogenetic Clustering for the Expression Data When Zero Expression Level Is Treated as Biological Zero

Groups	Cells	Expression (ML)		Expression (BI)
Groups	Cells	MPD	MNTD	MPD	MNTD
T1	20	1.000	1.000	0.998	0.992
T2	6	^*0.001	^*0.001	^*0.001	^*0.001
T3	20	0.543	0.994	0.804	0.998
CTC1	6	^*0.001	^*0.001	^*0.001	^*0.001
CTC2	6	^*0.004	^*0.013	^*0.001	^*0.011
T1 and CTC1	26	0.992	0.928	0.968	0.560
T2 and CTC2	12	^*0.001	^*0.001	^*0.001	^*0.001
T1 and CTC2	26	0.997	0.994	0.991	0.888
T2 and CTC1	12	^*0.001	^*0.001	^*0.001	^*0.001

MPD and MNTD calculated for the ML and BI trees from the expression data, with zeros treated as biological zeros. p-Values for MPD and MNTD were calculated for each sample (T1, T2, T3, CTC1, and CTC2) and expected clustering for cells isolated from a single individual (T1 with CTC1, and T2 with CTC2) and to test a possible mislabeling between CTC1 and CTC2 samples (T1 with CTC2, and T2 with CTC1). Significant p-values at $α = 0.05$ after correcting for multiple comparisons using the False Discovery Rate method (Benjamini and Hochberg, 1995) are marked with an asterisk.

Significant support.

These results do not provide a conclusive answer on which assumption should be preferred. Assuming all zeros to be biological zeros will bias the model as many of those might be technical zeros instead. At the same time, the pattern of expression and nonexpression seems to carry information. This information is lost when all zeros are assumed to be technical zeros and thus unknown data. For our datasets, the assumption of zeros as technical zeros and thus unknown data seems to create better agreement in the phylogenetic structure between the expression and SNVs and thus should be preferred. However, our datasets also suffered from unequal data quality issues (Table 1), and under different conditions, assuming zeros as biological zeros might be preferred.

3.2. Intestinal neuroendocrine cancer

Cells were labeled according to their sample of origin (primary tumor and metastasis) and their cell type, which was determined by replicating the analysis from Rao et al (2020a). We have derived two subsets from the expression and SNV data for the INC dataset from Rao et al (2020a), a subset with all cell types and a subset with cancer cells only. To do this, cells were labeled according to their sample of origin (primary tumor and metastasis) and their cell type, which was determined by replicating the analysis from Rao et al (2020a). For each subset, 1000 cells with the least amount of missing data were selected, 500 from the primary tumor and 500 from the metastatic sample.

However, not all cells found in the expression subsets were found in the SNV data. This is likely due to a different version of the Cellranger software used in this work compared to the Rao et al (2020a). In both derived subsets from the expression data, metastatic cells showed a strong clustering tendency ( $p = 0.001$ ) into several large clades (Fig. 4; Table 4). This suggests a strong phylogenetic relationship with several well-preserved lineages.

FIG. 4.

Maximum likelihood trees constructed from the expression and SNV data published by Rao et al (2020a). Terminal branches are colored according to cell's type or sample of origin. In the tree reconstructed from expression data for all cells (a), the vast majority of cancer cells clusters in a single clade. The tree reconstructed from expression data for cancer cells only (b) shows a strong clustering of primary and metastatic cells. While the metastatic cells are not clustered in a single clade, multiple metastatic events are biologically plausible. In the trees reconstructed from the SNV data (c, d), primary and metastatic cells, as well as cells of different type, are relatively evenly distributed without any apparent clustering.

Table 4.

Test of Phylogenetic Clustering on the Maximum Likelihood Trees from Rao et al (2020a)

Data	Groups	Cancer only			All cell types
Data	Groups	Cells	MPD	MNTD	Cells	MPD	MNTD
Expression	Cancer cells	1000	—	—	355	^*0.001	^*0.001
	Fibroblasts	0	—	—	552	1.000	0.989
	Endothelial cells	0	—	—	71	0.860	0.903
	Immune cells	0	—	—	22	0.651	0.565
	Metastasis	500	^*0.001	^*0.001	500	^*0.001	^*0.001
	Primary	500	1.000	1.000	500	1.000	1.000
SNV	Cancer cells	981	—	—	355	0.362	^*0.004
	Fibroblasts	0	—	—	552	0.402	0.997
	Endothelial cells	0	—	—	71	0.596	0.785
	Immune cells	0	—	—	22	0.949	0.968
	Metastasis	500	0.907	0.132	500	1.000	^*0.001
	Primary	481	^*0.004	0.076	500	^*0.001	0.095

MPD and MNTD calculated for the phylogeny reconstructed from the dataset containing only cancer cells and from the dataset containing all cell types. p-Values for MPD and MNTD were calculated for the sample of origin and cell types where applicable. Significant p-values at $α = 0.05$ after correcting for multiple comparisons using the False Discovery Rate method (Benjamini and Hochberg, 1995) are marked with an asterisk.

Significant support.

In addition, in the derived subset containing all cell types, the cancer cells showed a significant clustering ( $p = 0.001$ ), while other cell types showed the opposite tendency (Table 4). However, the cancer clade contained deeply nested clades of endothelial cells and immune cells. A similar, although significantly weaker pattern of cancer cell clustering can be observed on the trees derived from the SNV data (Table 4). In both subsets derived from the SNV data, the primary cells clustered together, but the pattern was less consistent and confirmed only by one of the two tested statistics.

3.3. Gastric cancer

For both the expression and SNV data from the GC dataset published by Wang et al (2021), only a single patient showed significant clustering of lymph nodes (Fig. 5; Table 5). Poor separation of primary and lymph node cells from the expression levels was pointed out in the original study (Wang et al, 2021). In addition, non-UMI-based methods suffer from an increased error rate through zero-count inflation (Cao et al, 2021) and amplification variability (Townes and Irizarry, 2020). In the absence of a strong phylogenetic signal shared by a large percentage of genes, this additional noise is making a phylogenetic reconstruction difficult, if not impossible. At the same time, the typically higher coverage in the non-UMI-based sequencing compared to the UMI should improve the identification of SNVs and decrease the misspecification error. This might suggest that different strategies for the phylogenetic reconstruction should be applied to UMI- and non-UMI-based sequencing.

FIG. 5.

Maximum likelihood trees for the patient G2 constructed from the (a) expression and (b) SNV data published by Wang et al (2021). Terminal branches are colored according to cell's sample of origin. Only the patient G2 shows a significant clustering signal both on the trees from expression and SNV data. For all trees, see Supplementary Figures S3 and S4.

Table 5.

Test of Phylogenetic Clustering on the Maximum Likelihood and Bayesian Trees Calculated from Expression and Single Nucleotide Variant Data Published by Wang et al (2021)

Data	Type	Groups	GC1			GC2			GC3
Data	Type	Groups	Cells	MPD	MNTD	Cells	MPD	MNTD	Cells	MPD	MNTD
Expression	ML	Primary tumor	19	0.171	0.123	27	^*0.001	^*0.001	19	0.829	0.854
	ML	Lymph node	4	0.830	0.869	13	1.000	1.000	12	0.138	0.093
	BI	Primary tumor	19	0.776	0.995	27	0.999	0.953	19	0.218	0.232
	BI	Lymph node	4	0.086	0.035	13	^*0.002	^*0.001	12	0.205	0.333
SNV	ML	Primary tumor	19	0.102	0.111	27	^*0.001	^*0.001	19	0.552	0.231
	ML	Lymph node	4	0.507	0.092	13	1.000	0.945	12	0.276	0.208
	BI	Primary tumor	19	0.117	0.025	27	^*0.001	^*0.014	19	0.531	0.372
	BI	Lymph node	4	0.955	0.935	13	1.000	0.960	12	0.487	0.322

MPD and MNTD calculated for the ML and BI trees reconstructed from the expression and the SNV data for patients GC1, GC2, and GC2. p-Values for MPD and MNTD were calculated for the sample of origin. Significant p-values at $α = 0.05$ after correcting for multiple comparisons using the False Discovery Rate method (Benjamini and Hochberg, 1995) are marked with an asterisk.

Significant support.

4. DISCUSSION

Phylogenetic methods using scDNA-seq data are becoming increasingly common in tumor evolution studies. scRNA-seq is currently used for studying expression profiles of cancer cells and their behavior. However, while clustering approaches to identify cells with similar expression profiles are common and frequently used, scRNA-seq data are yet to be used in phylogenetic analyses to reconstruct the population history of somatic cells.

To test if the scRNA-seq contains a phylogenetic signal to reliably reconstruct the population history of cancer, we have performed an experiment to produce a known history by infecting immunosuppressed mice with human cancer cells derived from the same population. Then using two different forms of scRNA-seq data, expression values, and SNVs, we reconstructed phylogenies using maximum likelihood and Bayesian phylogenetic methods. By comparing the reconstructed trees to the known population history, we confirmed that scRNA-seq contains a phylogenetic signal to reconstruct the population history of cancer, with both the expression values and SNVs producing a similar phylogenetic pattern.

However, this signal is burdened by uncertainty in both the source data as well as reconstructed phylogeny. Accurate phylogenies might thus need an explicit error model to account for this increased uncertainty (Hicks et al, 2018). Still, by taking this topological uncertainty into account, we can make a conclusion about the structural relationship of individual cells. This highlights that scRNA-seq can be utilized to explore both the physiological behavior of cancer cells and their population history using a single source of data.

While the expression phylogeny can be obtained for virtually no added cost, the choice of normalization method, batch-effect correction, discretization, or the choice between biological and technical zeros can greatly influence the phylogeny. The phylogenies reconstructed from SNV do not suffer from these decisions and SNV-calling pipelines will likely differ only in the number of false negatives and false positives.

Without any specialized phylogenetic or error model for the scRNA-seq data, conventional methods and software tools developed for systematic biology are able to reconstruct population history from these data, potentially at low computational cost. This implies that more accurate inference will be possible when and if specialized models and software are developed, and serious computational resources are employed. For example, computationally more intensive standard nonparametric bootstrap or Bayesian methods on the unfiltered data sets are certainly within the reach of modern computing clusters. This is a future direction for research.

In this work, we tested for phylogenetic signal on three data sets, a new data set consisting of five tumor samples seeded using a population sample, and two previously published data sets consisting of a primary tumor with a paired lymph node or a metastatic sample. The nature of the experiment and the amount of uncertainty in scRNA-seq data barred us from a more detailed exploration of the tree topology as only broad patterns, the phylogenetic clustering of cells according to sample and individual of origin, could be considered. Our clustering analyses show that the phylogenetic trees conform broadly to the expected shapes under different experimental conditions, and thus expression and SNV data can both be used to infer phylogenetic trees from scRNA-seq. Nevertheless, our results also demonstrate that all such trees contain significant uncertainty, so new datasets and methods will be required to extend this work.

The degree to which low and uneven gene expression plays a role in scRNA-seq requires special attention, especially for non-UMI based data sets, as this not only causes a large proportion of missing data but also burdens the known values with a significant error rate. Research should aim at trying to quantify this expression-specific error rate and build specialized models to include the uncertainty about observed data in the phylogenetic reconstruction itself. This could potentially include removing a large proportion of low-coverage data in favor of robust analysis and proper uncertainty estimation of the inferred topology.

The estimation of the topological uncertainty, be it the Bootstrap branch support or the Bayesian posterior clade probabilities, is a staple for phylogenetic analyses. Currently existing methods for the phylogenetic analysis of scDNA-seq, such as SCITE (Jahn et al, 2016), SiFit (Zafar et al, 2017), or SCI $Φ$ (Singer et al, 2018), do not provide this uncertainty estimate. This makes interpretation of the estimated topology difficult because a single topology can only be marginally more accurate than a number of alternative topologies.

Of packages we are aware of, only CellPhy, through its integration in the phylogenetic software RAxML-NG (Kozlov et al, 2019), provides an estimate of topological uncertainty through the bootstrap method. Bayesian methods could be a solution as they provide an uncertainty estimate through the posterior distribution. However, they are significantly more computationally intensive than maximum likelihood methods. Instead, as the size of single-cell data sets will only increase, bootstrap approximations optimized for a large amount of missing data need to be developed to provide a fast and accurate estimate of topological uncertainty.

An aspect of scRNA-seq expression data that was not considered in this study is correlated gene expression (Bageritz et al, 2019; Wang et al, 2004). A single somatic mutation could thus induce a change of expression of multiple genes. This might be problematic, given that phylogenetic methods assume individual sites are independent and this would cause an overestimation of a mutation rate. However, phylogenetic methods are generally rather robust to a wide range of model violations (Huelsenbeck, 1995a,b; Song et al, 2010; Philippe et al, 2011). In addition, by randomly sampling sites, the bootstrap analysis does explore solutions that would arise from this model violation. An investigation of the effect of correlated gene expression on the estimated phylogeny provides an interesting direction for further research.

Multiomic approaches are increasingly popular as they integrate information from multiple biological layers (Bock et al, 2016; Hasin et al, 2017; Nam et al, 2020). While CNVs were ignored in this article, it is possible to detect large-scale CNVs from scRNA-seq data (Gao et al, 2021; Harmanci et al, 2020a,b; Kuipers et al, 2020; Müller et al, 2018). Combined with the SNVs and expression data as analyzed in this article, this enables a multiomic approach using just a single scRNA-seq data source, without the additional cost of DNA sequencing.

Footnotes

ACKNOWLEDGMENTS

We thank Dr. Jon Preall and the Genomics Technology Development Core (CSHL) for scRNA-seq library preparation, and Pamela Moody and the Flow Cytometry Facility (CSHL) for support with single-cell sorting. We acknowledge Suzanne Russo for technical assistance with animal experiments.

AUTHORs' CONTRIBUTIONS

J.C.M.: conceptualization, data curation, formal analysis, methodoology, software, visualization, writing—original draft, and writing—review and editing. R.L.: methodology and writing—review and editing. D.L.S.: investigation, resources, and writing—review and editing. S.D.D.: conceptualization, data curation, funding acquisition, investigation, methodology, resources, writing—original draft, and writing—review and editing. A.G.: conceptualization, funding acquisition, methodology, project administration, resources, supervision, original draft, and review and editing.

AUTHOR DISCLOSURE STATEMENT

The authors declare they have no conflicting financial interests.

FUNDING INFORMATION

A.G. and J.C.M. acknowledge support from the Royal Society te Ap $ā$ rangi through a Rutherford Discovery Fellowship (RDF-UOC1702), A.G., J.C.M., R.L., and S.D.D. acknowledge support from an Endeavour Smart Ideas grant (UOOX1912), A.G. acknowledges support from a Data Science Programs grant (UOAX1932), S.D.D. acknowledges support from a Rutherford Discovery Fellowship (RDF-UOO1802) and the NIH/NCI (National Institutes of Health/National Cancer Institute) grant (1K99CA215362-01), and D.L.S. acknowledges support from the NCI grant (5P01CA013106-Project 3). We would also like to acknowledge the CSHL, Next-Gen Sequencing Core (NCI-2P30CA45508).

Supplementary Material

References

Abbosh

, Birkbak

, Wilson

, et al. Phylogenetic ctDNA analysis depicts early-stage lung cancer evolution. Nature, 2017; 545(7655):446–451; doi: 10.1038/nature22364

Aguirre-Ghiso

JA.

On the theory of tumor self-seeding: Implications for metastasis progression in humans. Breast Cancer Res, 2010; 12(2):304; doi: 10.1186/bcr2561

Aldous

Probability distributions on cladograms. In: Random Discrete Structures. Springer: New York, NY, USA; 1996, pp. 1–18; doi: 10.1007/978-1-4612-0719-1_1

Aldous

DJ.

Stochastic models and descriptive statistics for phylogenetic trees, from yule to today. Stat Sci, 2001; 16(1):23–34; doi: 10.1214/ss/998929474

Alves

, Prado-López

, Cameselle-Teijeiro

, et al. Rapid evolution and biogeographic spread in a colorectal cancer. Nat Commun, 2019; 10(1):5139; doi: 10.1038/s41467-019-12926-8

Alves

, Prieto

, Posada

Multiregional tumor trees are not phylogenies. Trends Cancer Res, 2017; 3(8):546–550; doi: 10.1016/j.trecan.2017.06.004

Andrews

, Hemberg

Identifying cell populations with scRNASeq. Mol. Aspects Med, 2018;59:114–122; doi: 10.1016/j.mam.2017.07.002

Angelova

, Mlecnik

, Vasaturo

, et al. Evolution of metastases in space and time under immune selection. Cell, 2018; 175(3):751.e16–765.e16; doi: 10.1016/j.cell.2018.09.018

Bageritz

, Willnow

, Valentini

, et al. Gene expression atlas of a developing tissue by single cell expression correlation analysis. Nat Methods, 2019; 16(8):750–756; doi: 10.1038/s41592-019-0492-x.

10.

Beerenwinkel

, Schwarz

, Gerstung

, et al. Cancer evolution: Mathematical models and computational inference. Syst Biol, 2015; 4(1):e1–25; doi: 10.1093/sysbio/syu081

11.

Benjamin

, Sato

, Cibulskis

, et al. Calling somatic snvs and indels with mutect2. bioRxiv, 2019; doi: 10.1101/861054

12.

Benjamini

, Hochberg

Controlling the false discovery rate: A practical and powerful approach to multiple testing. J R Stat Soc, 1995; doi: 10.1111/j.2517-6161.1995.tb02031.x

13.

Blomberg

, Garland

Jr.

, T, Ives

. Testing for phylogenetic signal in comparative data: Behavioral traits are more labile. Evolution, 2003; 57(4):717–745; doi: 10.1111/j.0014-3820.2003.tb00285.x

14.

Bock

, Farlik

, Sheffield

. Multi-Omics of single cells: Strategies and applications. Trends Biotechnol, 2016; 34(8):605–608; doi: 10.1016/j.tibtech.2016.04.004

15.

Bouckaert

, Vaughan

, Barido-Sottani

, et al. BEAST 2.5: An advanced software platform for bayesian evolutionary analysis. PLoS Comput Biol, 2019; 15(4):e1006650; doi: 10.1371/journal.pcbi.1006650

16.

Cao

, Kitanovski

, Küppers

, et al. UMI or not UMI, that is the question for scRNA-seq zero-inflation. Nat Biotechnol, 2021; 39(2):158–159; doi: 10.1038/s41587-019-0379-5

17.

Caravagna

, Giarratano

, Ramazzotti

, et al. Detecting repeated cancer evolution from multi-region tumor sequencing data. Nat Methods, 2018; 15(9):707–714; doi: 10.1038/s41592-018-0108-x

18.

Caravagna

, Heide

, Williams

, et al. Subclonal reconstruction of tumors by using machine learning and population genetics. Nat Genet, 2020; 52(9):898–907; doi: 10.1038/s41588-020-0675-5

19.

Casasent

, Schalck

, Gao

, et al. Multiclonal invasion in breast tumors identified by topographic single cell sequencing. Cell, 2018;172(1-2):205–217.e12; doi: 10.1016/j.cell.2017.12.007

20.

Chen

, Ning

, Shi

Single-Cell RNA-Seq technologies and related computational data analysis. Front Genet, 2019;10:317; doi: 10.3389/fgene.2019.00317

21.

Chen

, Zhou

, Wang

, et al. Single-cell SNP analyses and interpretations based on RNA-Seq data for colon cancer research. Sci Rep, 2016; 6(1):34420; doi: 10.1038/srep34420

22.

Cheung

, Ewald

. A collective route to metastasis: Seeding by tumor cell clusters. Science, 2016; 352(6282):167–169; doi: 10.1126/science.aaf6546

23.

Collienne

Spaces of Phylogenetic Time Trees. PhD thesis, University of Otago;. 2021. Available from: https://hdl.handle.net/10523/12606. Last accessed on October 5, 2022.

24.

Collienne

, Gavryushkin

Computing nearest neighbour interchange distances between ranked phylogenetic trees. J Math Biol, 2021;82(1-2):8; doi: 10.1007/s00285-021-01567-5

25.

Demeulemeester

, Kumar

, Møller

, et al. Tracing the origin of disseminated tumor cells in breast cancer using single-cell sequencing. Genome Biol, 2016; 17(1):1–15; doi: 10.1186/s13059-016-1109-7

26.

Deshwar

, Vembu

, Yung

, et al. PhyloWGS: Reconstructing subclonal composition and evolution from whole-genome sequencing of tumors. Genome Biol, 2015;16:35; doi: 10.1186/s13059-015-0602-8

27.

Desper

, Jiang

, Kallioniemi

, et al. Inferring tree models for oncogenesis from comparative genome hybridization data. J Comput Biol, 1999; 6(1):37–51; doi: 10.1089/cmb.1999.6.37

28.

Detering

, Tomás

, Prieto

, et al. Accuracy of somatic variant detection in multiregional tumor sequencing data; 2019. Available from: https://www.biorxiv.org/content/early/2019/05/31/655605. Last accessed on October 5, 2022.

29.

Dobin

, Davis

, Schlesinger

, et al. STAR: Ultrafast universal RNA-seq aligner. Bioinformatics, 2013; 29(1):15–21; doi: 10.1093/bioinformatics/bts635

30.

El-Kebir

, Satas

, Raphael

. Inferring parsimonious migration histories for metastatic cancers. Nat Genet, 2018; 50(5):718–726; doi: 10.1038/s41588-018-0106-z

31.

Fan

, Slowikowski

, Zhang

Single-cell transcriptomics in cancer: Computational challenges and opportunities. Exp Mol Med, 2020; 52(9):1452–1465; doi: 10.1038/s12276-020-0422-0

32.

Felsenstein

Confidence limits on phylogenies: An approach using the bootstrap. Evolution, 1985; 39(4):783–791; doi: 10.1111/j.1558-5646.1985.tb00420.x

33.

Fleming

, Marioni

, Babadi

CellBender remove-background: A deep generative model for unsupervised removal of background noise from scRNA-seq datasets. bioRxiv, 2019; doi: 10.1101/791699

34.

Freckleton

RP.

Fast likelihood calculations for comparative analyses. Methods Ecol Evol, 2012; 3(5):940–947; doi: 10.1111/j.2041-210X.2012.00220.x

35.

Gao

, Bai

, Henderson

, et al. Delineating copy number and clonal substructure in human tumors from single-cell transcriptomes. Nat Biotechnol, 2021;39:1–10; doi: 10.1038/s41587-020-00795-2

36.

Gavryushkin

, Whidden

, Matsen 4th

. The combinatorics of discrete time-trees: Theory and open problems. J Math Biol, 2018; 76(5):1101–1121; doi: 10.1007/s00285-017-1167-9

37.

Gawad

, Koh

, Quake

. Single-cell genome sequencing: Current state of the science. Nat Rev Genet, 2016; 17(3):175–188; doi: 10.1038/nrg.2015.16

38.

Gerlinger

, Rowan

, Horswell

, et al. Intratumor heterogeneity and branched evolution revealed by multiregion sequencing. N Engl J Med, 2012; 366(10):883–892; doi: 10.1056/nejmoa1113205

39.

González-Silva

, Quevedo

, Varela

Tumor functional heterogeneity unraveled by scRNA-seq technologies. Trends Cancer Res, 2020; 6(1):13–19; doi: 10.1016/j.trecan.2019.11.010

40.

Grafen

, Hamilton

. The phylogenetic regression. Philos Trans R Soc Lond B Biol Sci, 1989; 326(1233):119–157; doi: 10.1098/rstb.1989.0106

41.

Harmanci

, Harmanci

, Zhou

CaSpER identifies and visualizes CNV events by integrative analysis of single-cell or bulk RNA-sequencing data. Nat Commun, 2020a;11(1):1–16; doi: 10.1038/s41467-019-13779-x

42.

Harmanci

, Harmanci

, Zhou

Inference of clonal copy number alterations from rna-sequencing data. J Cancer Immunol, 2020b;2(3):66–68.

43.

Hasin

, Seldin

, Lusis

Multi-omics approaches to disease. Genome Biol, 2017; 18(1):83; doi: 10.1186/s13059-017-1215-1

44.

Heled

, Drummond

. Bayesian inference of population size history from multiple loci. BMC Evol Biol, 2008;8:289; doi: 10.1186/1471-2148-8-289

45.

Hicks

, Townes

, Teng

, et al. Missing data and technical variability in single-cell RNA-sequencing experiments. Biostatistics, 2018; 19(4):562–578; doi: 10.1093/biostatistics/kxx053

46.

Hsiao

, Tung

, Blischak

, et al. Characterizing and inferring quantitative cell cycle phase in single-cell RNA-seq data analysis. Genome Res, 2020; 30(4):611–621; doi: 10.1101/gr.247759.118

47.

Hudson

RR.

Gene genealogies and the coalescent process. Oxford Surv Evol Biol, 1990; 7(1):44.

48.

Huelsenbeck

JP.

The robustness of two phylogenetic methods: Four-taxon simulations reveal a slight superiority of maximum likelihood over neighbor joining. Mol Biol Evol 1995a;12(5):843–849; doi: 10.1093/oxfordjournals.molbev.a040261

49.

Huelsenbeck

JP.

Performance of phylogenetic methods in simulation. Syst Biol 1995b; doi: 10.1093/sysbio/44.1.17

50.

Jahn

, Kuipers

, Beerenwinkel

Tree inference for single-cell data. Genome Biol, 2016;17:86; doi: 10.1186/s13059-016-0936-x

51.

Jerby-Arnon

, Shah

, Cuoco

, et al. A cancer cell program promotes T cell exclusion and resistance to checkpoint blockade. Cell, 2018; 175(4):984.e24–997.e24; doi: 10.1016/j.cell.2018.09.006

52.

Jiang

, Chen

S-Y

, Wang

, et al. Should genes with missing data be excluded from phylogenetic analyses?. Mol Phylogenet Evol, 2014;80:308–318; doi: 10.1016/j.ympev.2014.08.006

53.

Kembel

, Cowan

, Helmus

, et al. Picante: R tools for integrating phylogenies and ecology. Bioinformatics, 2010;26:1463–1464; doi: 10.1093/bioinformatics/btq166

54.

Kidwell

, Casalini

, Pradeep

, et al. Laterally transferred macrophage mitochondria act as a signaling source promoting cancer cell proliferation. bioRxiv, 2021; doi: 10.1101/2021.08.10.455713.

55.

Kingman

JFC.

The coalescent. Stochastic Process Appl, 1982; 13(3):235–248; doi: 10.1016/0304-4149(82)90011-4

56.

Komarova

NL.

Spatial stochastic models for cancer initiation and progression. Bull Math Biol, 2006; 68(7):1573–1599; doi: 10.1007/s11538-005-9046-8

57.

Kozlov

, Alves

, Stamatakis

, et al. CellPhy: Accurate and fast probabilistic inference of single-cell phylogenies from scDNA-seq data. Genome Biol, 2022; 23(1):37; doi: 10.1186/s13059-021-02583-w

58.

Kozlov

, Darriba

, Flouri

, et al. RAxML-NG: A fast, scalable and user-friendly tool for maximum likelihood phylogenetic inference. Bioinformatics, 2019; 35(21):4453–4455; doi: 10.1093/bioinformatics/btz305

59.

Kuhner

, Yamato

, Felsenstein

Maximum likelihood estimation of population growth rates based on the coalescent. Genetics, 1998; 149(1):429–434; doi: 10.1093/genetics/149.1.429

60.

Kuipers

, Tuncel

, Ferreira

, et al. Single-cell copy number calling and event history reconstruction; 2020. Available from: https://www.biorxiv.org/content/early/2020/04/30/2020.04.28.065755. Last accessed on October 5, 2022.

61.

Lähnemann

, Köster

, Szczurek

, et al. Eleven grand challenges in single-cell data science. Genome Biol, 2020; 21(1):1; doi: 10.1186/s13059-020-1926-6

62.

Lemey

, Rambaut

, Drummond

, et al. Bayesian phylogeography finds its roots. PLoS Comput Biol, 2009; 5(9):e1000520; doi: 10.1371/journal.pcbi.1000520

63.

Lemey

, Rambaut

, Welch

, et al. Phylogeography takes a relaxed random walk in continuous space and time. Mol Biol Evol, 2010; 27(8):1877–1885; doi: 10.1093/molbev/msq067

64.

Leung

, Davis

, Gao

, et al. Single-cell DNA sequencing reveals a late-dissemination model in metastatic colorectal cancer. Genome Res, 2017; 27(8):1287–1299; doi: 10.1101/gr.209973.116

65.

, Handsaker

, Wysoker

, et al. The sequence alignment/map format and SAMtools. Bioinformatics, 2009; 25(16):2078–2079; doi: 10.1093/bioinformatics/btp352

66.

Ling

, Hu

, Yang

, et al. Extremely high genetic diversity in a single tumor points to prevalence of non-darwinian cell evolution. Proc Natl Acad Sci USA, 2015; 112(47):E6496–E6505; doi: 10.1073/pnas.1519556112

67.

Liu

, Zhang

, et al. Systematic comparative analysis of single-nucleotide variant detection methods from single-cell RNA sequencing data. Genome Biol, 2019; 20(1):242; doi: 10.1186/s13059-019-1863-4

68.

Lun

ATL

, Riesenfeld

, Andrews

, et al. EmptyDrops: Distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol, 2019; 20(1):63; doi: 10.1186/s13059-019-1662-y

69.

Luquette

, Bohrson

, Sherman

, et al. Identification of somatic mutations in single cell DNA-seq using a spatial model of allelic imbalance. Nat Commun, 2019; 10(1):3908; doi: 10.1038/s41467-019-11857-8

70.

Malikic

, Jahn

, Kuipers

, et al. Integrative inference of subclonal tumour evolution from single-cell and bulk sequencing data. Nat Commun, 2019; 10(1):2750; doi: 10.1038/s41467-019-10737-5

71.

Mardis

, Wilson

. Cancer genome sequencing: A review. Hum Mol Genet, 2009;18(R2):R163-8; doi: 10.1093/hmg/ddp396

72.

Minh

, Schmidt

, Chernomor

, et al. IQ-TREE 2: New models and efficient methods for phylogenetic inference in the genomic era. Mol Biol Evol, 2020; 37(5):1530–1534; doi: 10.1093/molbev/msaa015

73.

Minn

, Gupta

, Siegel

, et al. Genes that mediate breast cancer metastasis to lung. Nature, 2005; 436(7050):518–524; doi: 10.1038/nature03799

74.

Miura

, Gomez

, Murillo

, et al. Predicting clone genotypes from tumor bulk sequencing of multiple samples. Bioinformatics, 2018; 34(23):4017–4026; doi: 10.1093/bioinformatics/bty469

75.

Miura

, Vu

, Deng

, et al. Power and pitfalls of computational methods for inferring clone phylogenies and mutation orders from bulk sequencing data. Sci Rep, 2020; 10(1):3498; doi: 10.1038/s41598-020-59006-2

76.

Müller

, Rasmussen

, Stadler

The structured coalescent and its approximations. Mol Biol Evol, 2017a;34(11):2970–2981; doi: 10.1093/molbev/msx186

77.

Müller

, Cho

, Liu

, et al. CONICS integrates scRNA-seq with DNA sequencing to map gene expression to tumor sub-clones. Bioinformatics, 2018; 34(18):3217–3219; doi: 10.1093/bioinformatics/bty316

78.

Müller

, Kohanbash

, Liu

, et al. Single-cell profiling of human gliomas reveals macrophage ontogeny as a basis for regional differences in macrophage activation in the tumor microenvironment. Genome Biol, 2017b;18(1):234; doi: 10.1186/s13059-017-1362-4

79.

Münkemüller

, Lavergne

, Bzeznik

, et al. How to measure and test phylogenetic signal. Methods Ecol Evol, 2012; 3(4):743–756; doi: 10.1111/j.2041-210X.2012.00196.x

80.

Myers

, Satas

, Raphael

. CALDER: Inferring phylogenetic trees from longitudinal tumor samples. Cell Syst, 2019; 8(6):514–522.e5; doi: 10.1016/j.cels.2019.05.010

81.

Nakagawa

, Fujita

Whole genome sequencing analysis for cancer genomics and precision medicine. Cancer Sci, 2018; 109(3):513–522; doi: 10.1111/cas.13505

82.

Nam

, Chaligne

, Landau

. Integrating genetic and non-genetic determinants of cancer evolution by single-cell multi-omics. Nat Rev Genet, 2020; doi: 10.1038/s41576-020-0265-5

83.

Navin

, Kendall

, Troge

, et al. Tumour evolution inferred by single-cell sequencing. Nature, 2011; 472(7341):90–94; doi: 10.1038/nature09807

84.

Olsen

, Baryawno

Introduction to single-cell rna sequencing. Curr Protocols Mol Biol, 2018:122(1):e57; doi: 10.1002/cpmb.57

85.

Pagel

Inferring the historical patterns of biological evolution. Nature, 1999; 401(6756):877–884; doi: 10.1038/44766

86.

Pagel

, Meade

, Barker

Bayesian estimation of ancestral character states on phylogenies. Syst Biol, 2004; 53(5):673–684; doi: 10.1080/10635150490522232

87.

Petrackova

, Vasinek

, Sedlarikova

, et al. Standardization of sequencing coverage depth in NGS: Recommendation for detection of clonal and subclonal mutations in cancer diagnostics. Front Oncol, 2019;9:851; doi: 10.3389/fonc.2019.00851

88.

Philippe

, Brinkmann

, Lavrov

, et al. Resolving difficult phylogenetic questions: Why more sequences are not enough. PLoS Biol, 2011; 9(3):e1000602; doi: 10.1371/journal.pbio.1000602

89.

Poirion

, Zhu

, Ching

, et al. Using single nucleotide variations in single-cell RNA-seq to identify subpopulations and genotype-phenotype linkage. Nat Commun, 2018; 9(1):4892; doi: 10.1038/s41467-018-07170-5

90.

Poplin

, Ruano-Rubio

, DePristo

, et al. Scaling accurate genetic variant discovery to tens of thousands of samples. bioRxiv, 2018; doi: 10.1101/201178

91.

Posada

Cancer molecular evolution. J Mol Evol, 2015;81(3–4):81–83; doi: 10.1007%2Fs00239-015-9695-7

92.

Posada

CellCoal: Coalescent simulation of Single-Cell sequencing samples. Mol Biol Evol, 2020; 37(5):1535–1542; doi: 10.1093/molbev/msaa025

93.

Potter

, Ermini

, Papaemmanuil

, et al. Single-cell mutational profiling and clonal phylogeny in cancer. Genome Res, 2013; 23(12):2115–2125; doi: 10.1101/gr.159913.113

94.

Racle

, de Jonge

, Baumgaertner

, et al. Simultaneous enumeration of cancer and immune cell types from bulk tumor gene expression data. eLife, 2017;6; doi: 10.7554/elife.26476

95.

Rao

, Oh

, Moffitt

, et al. Comparative single-cell RNA sequencing (scRNA-seq) reveals liver metastasis-specific targets in a patient with small intestinal neuroendocrine cancer. Cold Spring Harb Mol Case Stud, 2020a;6(2); doi: 10.1101/mcs.a004978

96.

Rao

, Somarelli

, Altunel

, et al. From the clinic to the bench and back again in one dog year: How a Cross-Species pipeline to identify new treatments for sarcoma illuminates the path forward in precision medicine. Front Oncol, 2020b;10:117; doi: 10.3389/fonc.2020.00117

97.

Reiter

, Makohon-Moore

, Gerold

, et al. Reconstructing metastatic seeding patterns of human cancers. Nat Commun, 2017;8:14114; doi: 10.1038/ncomms14114

98.

Roth

, McPherson

, Laks

, et al. Clonal genotype and population structure inference from single-cell tumor sequencing. Nat Methods, 2016; 13(7):573–576; doi: 10.1038/nmeth.3867

99.

Satas

, Raphael

. Haplotype phasing in single-cell DNA-sequencing data. Bioinformatics, 2018; 34(13):i211–i217; doi: 10.1093/bioinformatics/bty286

100.

Schnepp

, Chen

, Keller

, et al. SNV identification from single-cell RNA sequencing data. Hum Mol Genet, 2019; 28(21):3569–3583; doi: 10.1093/hmg/ddz207

101.

Schwartz

, Schäffer

. The evolution of tumour phylogenetics: Principles and practice. Nat Rev Genet, 2017; 18(4):213–229; doi: 10.1038/nrg.2016.170

102.

Schwarz

, Ng

CKY

, Cooke

, et al. Spatial and temporal heterogeneity in high-grade serous ovarian cancer: A phylogenetic analysis. PLoS Med, 2015; 12(2):e1001789; doi: 10.1371%2Fjournal.pmed.1001789

103.

Singer

, Kuipers

, Jahn

, et al. Single-cell mutation identification via phylogenetic inference. Nat Commun, 2018; 9(1):5144; doi: 10.1038/s41467-018-07627-7

104.

Smith

MR.

Information theoretic generalized robinson-foulds metrics for comparing phylogenetic trees. Bioinformatics, 2020a;36(20):5007–5013; doi: 10.1093/bioinformatics/btaa614

105.

Smith

MR.

TreeDist: Distances between phylogenetic trees. R package version 2.4.0.9001, 2020b. Last accessed on October 5, 2022.

106.

Song

, Sheffield

, Cameron

, et al. When phylogenetic assumptions are violated: Base compositional heterogeneity and among-site rate variation in beetle mitochondrial phylogenomics. Syst Entomol, 2010; 35(3):429–448; doi: 10.1111/j.1365-3113.2009.00517.x

107.

Strom

SP.

Current practices and guidelines for clinical next-generation sequencing oncology testing. Cancer Biol Med, 2016; 13(1):3–11; doi: 10.28092/j.issn.2095-3941.2016.0004

108.

Stuart

, Butler

, Hoffman

, et al. Comprehensive integration of single-cell data. Cell, 2019;177:1888–1902; doi: 10.1016/j.cell.2019.05.031

109.

Tarabichi

, Salcedo

, Deshwar

, et al. A practical guide to cancer subclonal reconstruction from DNA sequencing. Nat Methods, 2021; 18(2):144–155; doi: 10.1038/s41592-020-01013-2

110.

Tavaré

Some probabilistic and statistical problems in the analysis of DNA sequences. Lect Math Life Sci, 1986; 17(2):57–86.

111.

Townes

, Irizarry

. Quantile normalization of single-cell RNA-seq read counts without unique molecular identifiers. Genome Biol, 2020; 21(1):1–17; doi: 10.1186/s13059-020-02078-0

112.

Van den Berge

, Perraudeau

, Soneson

, et al. Observation weights unlock bulk RNA-seq tools for zero inflation and single-cell applications. Genome Biol, 2018; 19(1):24; doi: 10.1186/s13059-018-1406-4.

113.

Wang

, Zhang

, Qing

, et al. Comprehensive analysis of metastatic gastric cancer tumour cells using single-cell RNA-seq. Sci Rep, 2021; 11(1):1–10; doi: 10.1038/s41598-020-80881-2

114.

Wang

, Azuaje

, Bodenreider

, et al. Gene expression correlation and gene Ontology-Based similarity: An assessment of quantitative relationships. Proc. IEEE Symp. Comput. Intell. Bioinforma. Comput Biol, 2004;2004:25–31; doi: 10.1109/cibcb.2004.1393927

115.

Webb

CO.

Exploring the phylogenetic structure of ecological communities: An example for rain forest trees. Am Nat, 2000; 156(2):145–155; doi: 10.1086/303378

116.

Werner

, Case

, Williams

, et al. Measuring single cell divisions in human tissues from multi-region sequencing data. Nat Commun, 2020; 11(1):1035; doi: 10.1038/s41467-020-14844-6

117.

Wiens

JJ.

Missing data and the design of phylogenetic analyses. J Biomed Inform, 2006; 39(1):34–42; doi: 10.1016/j.jbi.2005.04.001

118.

, Liu

, Davis

. The impact of missing data on species tree estimation. Mol Biol Evol, 2016; 33(3):838–860; doi: 10.1093/molbev/msv266

119.

, Chen

, Conejo-Garcia

, et al. Estimation of immune cell content in tumor using single-cell RNA-seq reference data. BMC Cancer, 2019; 19(1):715; doi: 10.1186/s12885-019-5927-3

120.

Yuan

, Sakoparnig

, Markowetz

, et al. BitPhylogeny: A probabilistic framework for reconstructing intra-tumor phylogenies. Genome Biol, 2015; 16(1):36; doi: 10.1186/s13059-015-0592-6

121.

Zafar

, Navin

, Chen

, et al. Siclonefit: Bayesian inference of population structure, genotype, and phylogeny of tumor clones from single-cell genome sequencing data. Genome Res, 2019; 29(11):1847–1859; doi: 10.1101/gr.243121.118

122.

Zafar

, Tzen

, Navin

, et al. SiFit: Inferring tumor trees from single-cell sequencing data under finite-sites models. Genome Biol, 2017; 18(1):178; doi: 10.1186/s13059-017-1311-2

123.

Zafar

, Wang

, Nakhleh

, et al. Monovar: Single-nucleotide variant detection in single cells. Nat Methods, 2016; 13(6):505–507; doi: 10.1038/nmeth.3835

124.

Zhai

, Lim

TK-H

, Zhang

, et al. The spatial organization of intra-tumour heterogeneity and evolutionary trajectories of metastases in hepatocellular carcinoma. Nat Commun, 2017;8:4565; doi: 10.1038/ncomms14565

125.

Zhao

Z-M

, Zhao

, Bai

, et al. Early and multiple origins of metastatic lineages within primary tumors. Proc Natl Acad Sci USA, 2016; 113(8):2140–2145; doi: 10.1073/pnas.1525677113

Supplementary Material

Please find the following supplemental material available below.

For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.

For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.

0.00 MB

0.10 MB

Testing for Phylogenetic Signal in Single-Cell RNA-Seq Data

Abstract

1. INTRODUCTION

2. METHODS

2.1. Experimental design

2.2. Dataset selection

2.3. Preparation of the BCX dataset

2.4. Mapping and demultiplexing

2.5. Postprocessing expression data

2.5.1. Standardizing expression values

2.5.2. Discretizing expression values

2.5.3. Recording unexpressed genes as unknown data

2.6. SNV

2.6.1. Preprocessing reads for SNV detection

2.6.2. SNV detection and filtering

2.7. Finding a well-represented subset of data

2.8. Phylogenetic analysis

2.9. Phylogenetic clustering tests

2.10. Code and data availability

3. RESULTS

3.1. Breast cancer-derived xenografts

3.1.1. Sample overview

3.1.2. SNV identification

3.1.3. Data reduction

3.1.4. Phylogenetic reconstruction from expression data

3.1.5. Phylogenetic reconstruction from the SNV data

3.1.6. Biological zero or unknown value

3.2. Intestinal neuroendocrine cancer

3.3. Gastric cancer

4. DISCUSSION

Footnotes

ACKNOWLEDGMENTS

AUTHORs' CONTRIBUTIONS

AUTHOR DISCLOSURE STATEMENT

FUNDING INFORMATION

Supplementary Material

References

Supplementary Material