Branching-Process Modeling of Homology Distribution in Salmonid Genomes

Abstract

Comparative analysis of sequence similarity distributions reveals evolutionary mechanisms shaping gene families. In Salmonidae, whole-genome duplication (WGD) and rapid speciation pose a challenge for modeling retained homologs and sequence divergence. We introduce a stochastic branching-process framework that models sequence similarity decay over evolutionary time and quantifies fractionation rates across successive duplication events. We derive moment-generating functions of pairwise similarity scores and carry out simulation-based validation. Applying our model to multiple salmonid genomes (Atlantic salmon, rainbow trout, Chinook salmon, …), we not only recapitulate observed bimodal similarity distributions, but we also quantify gene retention across evolutionary branches. Results indicate that the estimated fractionation rates for both WGDs ( $μ_{1}, μ_{2} \approx 0.0009$ –0.0013 per Myr) remain highly consistent across species and insensitive to synteny block size, supporting a conserved post-WGD gene loss dynamic. In contrast, lineage-specific differences in duplicate retention arise primarily in the temporal gap between duplication events rather than differences in instantaneous loss rates. These findings underscore the stability of fractionation dynamics and the critical role of structural genome decay in shaping retention patterns in salmonid evolution and sucker fish.

Keywords

branching process gene retention dynamics Salmonidae sequence similarity distribution synteny loss whole-genome duplication

1. INTRODUCTION

Whole-genome duplications (WGDs) have been pivotal in shaping the evolutionary trajectory of teleost fish, providing raw material for functional diversification. Two early rounds of duplication in the vertebrate ancestor (1 R and 2 R) laid the foundation for vertebrate complexity, followed by the teleost-specific duplication (Ts3R) approximately 300 million years ago (Mya) (Jaillon et al., 2004; Near et al., 2012; Glasauer and Neuhauss, 2014; Robertson et al., 2017). A subsequent event, the salmonid-specific autotetraploidization (Ss4R), occurred around 80 Mya, profoundly influencing genome architecture and gene function within salmonids (Macqueen and Johnston, 2014; Lien et al., 2016, 2021). Post-WGD processes in salmonids involve extensive fractionation and subfunctionalization, which have contributed to their ecological and physiological diversity. Beyond salmonids, lineage-specific WGDs such as Cat-4R in Chinese sucker (Myxocyprinus asiaticus, 60 Mya) indicate the diversity of post-duplication trajectories, showing conserved synteny and relatively stable genome structures in contrast to the widespread rearrangements seen in other polyploid fish lineages (Kumar et al., 2017; Bagley et al., 2018; Lecaudey et al., 2018; Krabbenhoft et al., 2021).

Understanding the evolutionary consequences of WGD requires more than identifying duplicate genes. It involves characterizing the processes that govern their retention, divergence, and loss. Previous work, including Yu and Sankoff (2022), showed that fractionation during synteny evolution is not random but influenced by genomic context, dosage balance, and chromosomal rearrangements (Blanc and Wolfe, 2004; Zhang et al., 2021a).

In this study, we analyze coding-sequence similarity among paralogous gene pairs. Our framework applies a discrete-time branching process to model duplicate retention and loss while incorporating stochastic sequence divergence. Two processes dominate this dynamic: (i) WGD-driven duplication, which creates gene pairs, and (ii) fractionation, which eventually removes one gene from most pairs. Together, these processes produce a mixture of similarity distributions, where distinct peaks often reflect different duplication events.

While the branching-process approach has been successfully applied in plant comparative genomics, this study represents its first application to teleost fish. To overcome challenges such as overlapping signals and synteny erosion in older WGDs, we extend the original model with two innovations: (i) a moment-based probabilistic formulation using the moment-generating function (MGF) to derive closed-form expectations and variances; (ii) simulation-based validation with Gaussian mixtures to ensure robust cutoff-based event separation; and (iii) an integrated approach that combines Ks values with sequence similarity to improve resolution in cases where multiple duplication signals overlap.

2. BRANCHING PROCESS MODEL IN THE WGD CONTEXT

Gene proliferation and loss within genomes can be modeled as a population of genes evolving stochastically (see Eq. 2). Discrete-time branching processes provide a good model for WGDs and gene loss.

Consistent with previous research methodologies (Blanc and Wolfe, 2004; Glasauer and Neuhauss, 2014; Zhang et al., 2021a), the survival or deletion of each gene in subsequent generations is treated as a multinomial event, with the assumption that at least one gene remains. This framework allows for the estimation of $u_{j}^{(i)}$ that is a probability distribution over the number of surviving progeny j at time $t_{i + 1}$ , given that each gene at one WGD time $t_{i}$ produces $r_{i}$ progeny. For times $t_{1} < ... < t_{n - 1}$ (where $n \geq 1$ ), each existing gene is replaced by $r_{1}, ..., r_{n - 1}$ progeny, respectively, where each $r_{i} \geq 2$ . In this context, $t_{n - 1}$ represents the occurrence of the final WGD event. This means:

t_{1} \overset{u_{j}^{(1)}}{\to} t_{2} \overset{u_{j}^{(2)}}{\to} \dots \overset{u_{j}^{(i)}}{\to} t_{i + 1} \overset{u_{j}^{(i + 1)}}{\to} \dots \overset{u_{j}^{(n - 1)}}{\to} t_{n}

and

\sum_{j = 1}^{r_{i}} u_{j}^{(i)} = 1

A complete combinatorial formulation of evolutionary histories under this model, including the probability of specific configurations $P (r; a)$ , is presented in earlier work (Blanc and Wolfe, 2004; Zhang et al., 2021a), where

P (r; a) = \prod_{i = 1}^{n - 1} (\begin{matrix} m_{i} \\ a_{1}^{(i)}, \dots, a_{r_{i}}^{(i)} \end{matrix}) \prod_{j = 1}^{r_{i}} {(u_{j}^{(i)})}^{a_{j}^{(i)}},

with

r = (r_{1}, \dots, r_{n - 1})

denoting the sequence of duplication events,

m_{i}

the number of gene copies at step i,

a_{j}^{(i)}

the number of copies assigned to trajectory j, and

u_{j}^{(i)}

the survival probability of that trajectory. Here, we extend this framework to incorporate similarity-based observations and introduce probabilistic mixture decomposition, as discussed in Section 4. Although the inclusion of multinomial coefficients and high-degree polynomials may appear computationally demanding, the model remains practical because typical polypoid levels are small (usually 2 or 3), ensuring tractability.

The observations are made at time $t_{n}$ , reflecting the similarity measures (e.g., coding sequence similarity) among all gene pairs in the population, with the original $m_{1}$ genes considered unrelated. The model initiates with $m_{1} \geq 1$ genes at time $t_{1}$ .

Using expected values allows for straightforward calculation of useful expressions. This feature is pivotal for a comprehensive analysis of gene pair similarity distributions. The expected number of genes at time $t_{i + 1}$ , considering all possible retained copies, follows the recurrence relation:

E (m_{i + 1}) = E (m_{i}) (\sum_{j = 1}^{r_{i}} j \cdot u_{j}^{(i)}) .

Our analysis focuses on the evolution of gene pairs surviving from WGD events. Each pair evolves independently, with sequence similarity declining over time due to the accumulation of random mutations at both nucleotide and amino acid levels. Because simultaneous loss of both copies in a functional pair is assumed to be deleterious, the model excludes concurrent deletion events. Combining these assumptions with the branching process, we compute the expected number of descendant gene pairs $d (i, n)$ sharing a common ancestor at duplication event $t_{i}$ , given by:

E (d^{(i, n)}) = E (d^{(i, i + 1)}) {[E^{(i + 1, n)} (m_{n})]}^{2},

where

E (d^{(i, i + 1)})

denotes the expected number of pairs formed at generation i and surviving to

t_{i + 1}

, and

E^{(i + 1, n)} (m_{n})

represents the expected number of genes at time

t_{n}

descended from

t_{i + 1}

2.1. Synteny block

We examined two categories of retained duplicate gene pairs: (i) individual pairs, analyzed without reference to genomic context, and (ii) synteny-block pairs, located within conserved blocks that preserve gene order. Histogram analysis of these two categories (Appendix A1, Fig. A1) showed a bimodal distribution of similarity values. The pattern was more evident for synteny-block pairs, reflecting stronger signals of both ancient and recent duplication events.

We also identified singleton genes, defined as duplicates with no corresponding copy in any synteny block (horizontal markers in Fig. 1). These reflect gene loss or rearrangement and were traced back to conserved regions to infer their origin.

FIG. 1.

Example of two conserved synteny blocks (A and B) derived from a WGD. Rectangles (G1, G2, G3, etc.) represent retained genes. Dashed lines connect matching gene pairs that still sit within the blocks (block pairs). Genes without a corresponding copy in the other block (shown as horizontal line) are singletons, because their counterpart has been lost or rearranged. WGD, whole-genome duplication.

To estimate the number of singleton genes, we subtract the count of genes in $t_{i}$ pairs from the genome’s total. This approach, however, requires a nuanced analysis due to the genes to appear in multiple pairs across various $t_{i}$ events. Directly relying on the genome’s total gene count can lead to inaccuracies due to gene expansions and duplications that occur after $t_{n - 1}$ . Our focus is on singletons within synteny blocks to accurately infer retention rates, identifying the $t_{i}$ at which singletons were generated. This method offers a more precise alternative to simple subtraction, providing essential data for parameter estimation. The presence of singletons within synteny blocks plays a crucial role in our analysis of gene complement sizes and the computation of “crumble” coefficients.

Our model assumes uniform fractionation probability across same WGD events, selecting one pair for loss at each step. Although fractionation within blocks introduces local dependencies, our analysis focuses on sums of expected values, avoiding any nonindependence effects in calculations.

2.2. Erosion of synteny blocks over time

Over time, fractionation erodes synteny blocks through processes such as chromosomal rearrangement and gene pair divergence, compounded by technical limitations in block detection, including threshold-based filtering to distinguish true collinearity from background noise. This phenomenon, termed block crumble, complicates the inference of gene retention rates for older WGD events by underestimating the number of gene pairs and singletons originally present in these blocks, thereby biasing retention estimates.

To mitigate this bias, we use the concept of a “syntenic cohort” and employ a set of “crumble coefficients” $c_{1}, \dots, c_{n - 1}$ (Zhang et al., 2021a). These coefficients adjust the $m_{i}$ values, enabling a more accurate reconstruction of the original gene complement and retention probabilities, thereby improving inferences of synteny erosion across evolutionary intervals, particularly in genomes with multiple rounds of duplication.

2.3. A model for synteny blocks

By incorporating these adjustments, we refine our understanding of synteny block erosion across evolutionary intervals. This is particularly relevant for genomes that have experienced multiple rounds of replication (Zhang et al., 2021a).

For example, consider a single gene at $t_{1}$ with all $r_{i} = 2$ , and let $u (i)$ ( $i = 1, \dots, n - 1$ ) denote the probability that both progeny of a gene at $t_{i}$ survive to $t_{i + 1}$ (Glasauer and Neuhauss, 2014). The expected number $N_{i}$ of duplicate pairs originating at $t_{i}$ and observed at $t_{n}$ is:

E (N_{1}) = m_{1} c_{1} u (1) \prod_{j = 2}^{n - 1} {(1 + u (j))}^{2}

E (N_{i}) = m_{1} c_{i} u (i) \prod_{j = 1}^{i - 1} (1 + u (j)) \prod_{j = i + 1}^{n - 1} {(1 + u (j))}^{2}

E (N_{n})) = m_{1} u (n - 1) \prod_{j = 1}^{n - 2} (1 + u (j))

In general, as illustrated in Figure 2, we count gene pairs at time $t_{n}$ and track them forward to $t_{i}$ .

FIG. 2.

The three unfractionated descendants of gene g at time $t_{i}$ define three $t_{i}$ -pairs (ovals). We follow the pair in the uppermost oval: its two members present at $t_{i + 1}$ (shaded triangles) evolve independently into $m_{n}^{'}$ and $m_{n}^{''}$ genes by $t_{n}$ , yielding $m_{n}^{'} m_{n}^{''}$ $t_{n}$ -pairs.(see Zhang et al., 2021a).

Equation (5) gives the expected number of pairs originating at $t_{1}$ , incorporating the cumulative effect of subsequent WGDs. Equation (6) generalizes this for an intermediate duplication, while (7) describes the final event, influenced only by prior duplications.

The expected values of singleton are as follows:

E (S_{1}) = m_{1} c_{1} (1 - u (1))

E (S_{i}) = m_{1} c_{i} (1 - u (i)) \prod_{j = 1}^{i - 1} (1 + u (j))

E (S_{n}) = m_{1} (1 - u (n)) \prod_{j = 1}^{i - 1} (1 + u (j))

The number of gene pairs and singleton genes within the identified components is counted. These observed counts are substituted into evolutionary equations in Table 1, allowing for the estimation of evolutionary rates within the model.

Table 1.

The Expected Number of Pairs and Singletons for Two Successive WGDs

Event	Observed	Expected number
$t_{1}$	Pairs	$c_{1} m_{1} u (1) {(1 + u (2))}^{2}$
$t_{2}$	Pairs	$m_{1} (1 + u (1)) u (2)$
$t_{1}$	Singletons	$c_{1} m_{1} (1 - u (1))$
$t_{2}$	Singletons	$m_{1} (1 + u (1)) (1 - u (2))$

WGD, whole-genome duplication.

Table 2.

Total Gene Pairs in Synteny Blocks for Different Species Across Block Sizes 3, 4, and 5. The Data Compares the Number of Gene Pairs in Synteny Blocks for Atlantic Salmon, Lake Trout, Chum Salmon, Chinook Salmon, and Rainbow Trout

Total pairs	Block 3	Block 4	Block 5
Atlantic Salmon	21,763	17,770	15,776
Lake trout	21,143	16,741	14,721
Chum Salmon	14,577	12,063	10,722
Chinook Salmon	20,501	17,377	15,773
Rainbow trout	23,548	20,146	18,395

2.4. Sequence divergence: similarity decay and cutoff estimation

While these equations in Section 2 quantify the number of expected surviving pairs, they do not yet explicitly model how sequence similarity between retained gene pairs decays over time. To incorporate evolutionary divergence, we introduce a similarity decay function:

S_{i, n} = e^{- μ_{i} (t_{i} - t_{n})} + ϵ_{i}

Here, $S_{i, n}$ denotes the expected sequence similarity between paralogous genes originating at time $t_{i}$ and observed at time $t_{n}$ , and $μ_{i}$ is the fractionation rate. And $ϵ \sim N (0, σ^{2})$ represents normally distributed random noise. This time-dependent exponential decay function reflects the accumulation of point mutations and is consistent with previous studies (Blanc and Wolfe, 2004; Sankoff et al., 2019). This exponential approximation assumes independent sites and neglects multiple substitutions and back mutations at the same position.

For comparison, we also evaluated a model for multiple substitutions and back mutations, $\frac{1}{4} + \frac{3}{4} e^{- μ_{i} \frac{4}{3} (t_{i} - t_{n})} + ϵ_{i}$ (Jukes and Cantor, 1969), and found that the estimates of the fractionation rate were highly consistent with those from the simpler approximation.

To characterize the distribution of S, we derive its MGF:

M_{S} (θ) = E [e^{θ S}] = \exp (θ e^{- μ (t_{i} - t_{n})} + \frac{θ^{2} σ_{i}^{2}}{2})

This closed-form expression facilitates the derivation of moments. Specifically, the first moment (expected value) and second central moment (variance) are:

E [S_{i}] = {\frac{d}{d θ} M_{S} (θ) |}_{θ = 0} = e^{- μ (t_{i} - t_{n})}, Var (S_{i}) = σ_{i}^{2}

After fitting the similarity distribution with a Gaussian mixture model using mixtools (Benaglia et al., 2009), we estimate the component parameters ( $S_{i, n}, σ_{i}, λ_{i}$ ) and compute an optimal cutoff H to separate evolutionary groups of gene pairs. The cutoff maximizes the likelihood under two-component mixtures:

H = \arg \max_{h \in (0, 1)} \prod_{x \leq h} λ_{1} N (S_{1, n}, σ_{1}) \prod_{x > h} λ_{2} N (S_{2, n}, σ_{2}) .

In the Gaussian mixture model, $λ_{i}$ denotes the weight of the i-th component, with $λ_{i} \geq 0$ and $\sum_{i} λ_{i} = 1$ .

Differentiating the log-likelihood with respect to h and solving yields:

H = \frac{2 (\frac{S_{1, n}}{σ_{1}^{2}} - \frac{S_{2, n}}{σ_{2}^{2}}) \pm \sqrt{{(2 (\frac{S_{1, n}}{σ_{1}^{2}} - \frac{S_{2, n}}{σ_{2}^{2}}))}^{2} - 4 (\frac{1}{σ_{1}^{2}} - \frac{1}{σ_{2}^{2}}) (\frac{S_{1, n}^{2}}{σ_{1}^{2}} - \frac{S_{2, n}^{2}}{σ_{2}^{2}} - 2 \log (\frac{λ_{1}}{λ_{2}}))}}{2 (\frac{1}{σ_{1}^{2}} - \frac{1}{σ_{2}^{2}})} .

For simplicity, the solution reduces to a weighted average when $σ_{1} \approx σ_{2}$ :

H = \frac{λ_{1} \frac{S_{1, n}}{σ_{1}^{2}} + λ_{2} \frac{S_{2, n}}{σ_{2}^{2}}}{λ_{1} \frac{1}{σ_{1}^{2}} + λ_{2} \frac{1}{σ_{2}^{2}}} .

Pairs with similarity above H are assigned to the most recent WGD; those below are inferred as older duplicates. These optimal cutoff values, which separate gene pairs into distinct evolutionary groups, are summarized in Tables 3–5.

Table 3.
Observed Counts of Gene Pairs and Singletons, Cutoff Values, and Parameters for Synteny Blocks with Three or More Genes

Type (block 3) Singleton Pair $S_{1, n}$ $S_{2, n}$ Cutoff $c_{1}$ $m_{1}$ u(1) u(2) $μ (1)$ $μ (2)$

Atlantic Salmon 7863 15,668 8457 13,384 77 94 0.852 0.544 21,759 0.335 0.461 0.0012 0.0008

Lake Trout 6727 9088 9627 11,577 75 93 0.859 0.708 15,081 0.370 0.560 0.0013 0.0009

Chum Salmon 3862 8596 5095 9504 75 92 0.852 0.431 14,332.5 0.375 0.482 0.0013 0.0010

Chinook Salmon 5250 10,206 8572 11,960 75 92 0.853 0.563 15,744 0.408 0.540 0.0013 0.0010

Rainbow Trout 5202 13,622 9283 14,292 75 93 0.844 0.477 19,407 0.438 0.512 0.0013 0.0009

Type (block 3)	Singleton	Pair	$S_{1, n}$	$S_{2, n}$	Cutoff	$c_{1}$	$m_{1}$	u(1)	u(2)	$μ (1)$	$μ (2)$
Atlantic Salmon	7863	15,668	8457	13,384	77	94	0.852	0.544	21,759	0.335	0.461	0.0012	0.0008
Lake Trout	6727	9088	9627	11,577	75	93	0.859	0.708	15,081	0.370	0.560	0.0013	0.0009
Chum Salmon	3862	8596	5095	9504	75	92	0.852	0.431	14,332.5	0.375	0.482	0.0013	0.0010
Chinook Salmon	5250	10,206	8572	11,960	75	92	0.853	0.563	15,744	0.408	0.540	0.0013	0.0010
Rainbow Trout	5202	13,622	9283	14,292	75	93	0.844	0.477	19,407	0.438	0.512	0.0013	0.0009

Table 4.

Observed Counts of Gene Pairs and Singletons, Cutoff Values, and Parameters for Synteny Blocks with Four or More Genes

Type (block 4)	Singleton		Pair		$S_{1, n}$	$S_{2, n}$	Cutoff	$c_{1}$	$m_{1}$	u(1)	u(2)	$μ (1)$	$μ (2)$
Atlantic Salmon	5033	15,604	5114	12,681	76	93	0.847	0.350	21,326	0.326	0.448	0.0012	0.0009
Lake Trout	4395	9218	5590	11,163	75	93	0.853	0.445	15,133	0.347	0.548	0.0013	0.0009
Chum Salmon	2341	8869	3050	9016	76	92	0.845	0.282	13,098	0.365	0.504	0.0013	0.0011
Chinook Salmon	3775	10,535	5605	11,776	76	92	0.844	0.384	16,065	0.389	0.528	0.0013	0.0010
Rainbow Trout	3874	12,605	6228	13,920	76	93	0.838	0.348	18,828	0.409	0.525	0.0013	0.0009

Table 5.

Observed Counts of Gene Pairs and Singletons, Cutoff Values, and Parameters for Synteny Blocks with Five or More Genes

Type (block 5)	Singleton		Pair		$S_{1, n}$	$S_{2, n}$	Cutoff	$c_{1}$	$m_{1}$	u(1)	u(2)	$μ (1)$	$μ (2)$
Atlantic Salmon	3193	15,382	3551	12,235	76	93	0.843	0.239	20,485	0.348	0.443	0.0012	0.0009
Lake Trout	2982	9351	3838	10,885	76	93	0.839	0.308	14,963	0.352	0.538	0.0012	0.0009
Chum Salmon	1619	8982	2006	8717	76	92	0.836	0.193	13,039	0.357	0.493	0.0013	0.0011
Chinook Salmon	2932	10,671	4136	11,638	76	92	0.841	0.291	16,187	0.379	0.522	0.0012	0.0010
Rainbow Trout	2856	12,415	4622	13,774	76	93	0.831	0.261	18,573	0.410	0.526	0.0012	0.0009

2.5. Analytical workflow

The analytical workflow begins with extracting relevant genomic sequences and proceeds through synteny block identification and evolutionary rate estimation (Fig. 3). We use CoGe’s SYNMAP with default settings to perform a genome self-comparison, detecting regions of conserved gene order and orientation (synteny blocks). Block detection depends on parameters such as the number of collinear genes, intergenic distances, and sequence similarity. In practice, we used a maximum gap of 20 genes and a minimum of three to five collinear gene pairs.

FIG. 3.

Workflow for counting $t_{i}$ -pairs and singletons through successive WGDs.

3. FIVE SALMONINAE GENOMES

We analyzed genomic data from the National Center for Biotechnology Information (NCBI) (National Center for Biotechnology Information, 2023) and the Comparative Genomics (CoGe, n.d.) platform (coge), which provides web-based tools for synteny analysis and genome comparison. Synteny detection was conducted using DAGchainer and SynMap (Haas et al., 2004 ; Lyons et al., 2008; Haug-Baltzell et al., 2017), both integrated within CoGe, to identify sequence-similar gene pairs and collinear regions. DAGchainer was configured with a maximum allowed gap of 20 genes (−D = 20) and a minimum of three to five aligned pairs (−A = 3–5). To ensure data quality, only the highest percent-identity LASTZ hit per query gene was retained. The uniform similarity and collinearity within identified synteny blocks support the assumption of preserved pre-duplication gene order. In contrast, interspersed singletons likely correspond to fractionated pairs from the same evolutionary period.

The analysis included multiple salmonid species (Fig. 4). Publicly available genomes were used for Atlantic salmon [Salmo salar (Lien et al., 2016)] and lake trout [Salvelinus namaycush (Smith et al., 2022)], while privately downloaded NCBI assemblies were utilized for Chinook salmon [Oncorhynchus tshawytscha (Christensen et al., 2018)], rainbow trout [Oncorhynchus mykiss (Berthelot et al., 2014)], and chum salmon [Oncorhynchus keta (Lee and Kim, 2019)].

FIG. 4.

Phylogenetic relationships of selected Salmonidae species and the catostomid fish (Myxocyprinus asiaticus), showing major WGD events (red diamonds) and estimated divergence times. Ts3R occurred $\sim 300$ Mya, Ss4R $\sim 80$ Mya, and Cat-4R $\sim 60$ Mya. Mya, million years ago.

Atlantic salmon (Salmo salar, ID: 28938) has a genome size of 2.24 Gb assembled into 29 chromosomes (NCBI GCA_905237065.2). The assembly is unmasked, with noncoding regions accounting for approximately 2.17 Gb. Unplaced sequences labeled “NW_” in the GFF file were removed to improve assembly quality.

Lake trout (Salvelinus namaycush, ID: 63989) has a genome size of 2.35 Gb across 4,121 contigs (NCBI GCA_013841185.1), released on May 24, 2022. The assembly is unmasked.

Chinook salmon (Oncorhynchus tshawytscha, ID: 64176) has a genome size of 2.3 Gb organized into 34 chromosomes and 9982 contigs within 9977 scaffolds, containing 2313 gaps. The Scaffold N50 and Contig N50 are both 2.9 Mb, indicating moderate continuity.

Rainbow trout (Oncorhynchus mykiss, ID: 63876) has a 2.3 Gb genome comprising 32 chromosomes, 1228 contigs, and 938 scaffolds with 196 gaps. The assembly quality is high, with Scaffold N50 of 39.2 Mb and Contig N50 of 15.6 Mb.

Chum salmon (Oncorhynchus keta, ID: 64186) has a genome size of 2.6 Gb assembled into 37 chromosomes and 17,497 contigs within 17,479 scaffolds, including 3405 gaps. Both Scaffold N50 and Contig N50 are approximately 2 Mb.

Table 2 reports the total number of gene pairs in synteny blocks across species. Note that these totals may appear smaller than the sum of synteny block pairs shown in Tables 3–5. This discrepancy occurs because the same gene pair can belong to multiple candidate blocks prior to filtering. When applying mean similarity cutoffs at the block level, some blocks (and their pairs) are excluded, and duplicate gene pairs are removed, leading to a reduced final count.

3.1. Similarity distributions across salmonid genomes

Figure A1 in Appendix A1 illustrates the distribution of similarity scores for individual gene pairs and gene pairs within synteny blocks across the five salmonid species, providing insights into WGD events. The local peaks observed in both distributions reflect the synchronicity of gene duplications associated with WGD. Distinct local peaks in these distributions reflect the synchronicity of gene duplications tied to WGD, corresponding to evolutionary periods marked by significant genomic changes. Analyzing these patterns allows inference of both the timing and ploidy levels of WGD events.

Notably, the two local peaks in the similarity distributions of synteny blocks, around 75% and 90%, specify the differential conservation levels. This distribution pattern is similar to that of individual gene pairs, where two distinct peaks similarly reflect varying degrees of conservation or divergence but may not capture the broader evolutionary context as effectively as synteny blocks.

The presence of an ambiguous peak around 95% in the distribution of synteny blocks points to a subset of highly conserved genomic segments. This observation likely suggests segmental duplications in the conservation of these genomic regions or assembly errors.

We further validated the cutoff inference approach using synthetic data generated under a two-component Gaussian mixture model (parameters chosen based on empirical similarity peaks). The estimated cutoff closely matched the theoretical decision boundary, confirming the robustness of our likelihood-based method (Fig. A2 in Appendix A2).

4. RESULTS

Synteny block analysis across five Salmoninae genomes reveals distinct post-polyploidization patterns of gene retention and loss. As expected, reducing the block size threshold from 5 to 3 increases both singleton and gene pair counts due to the less stringent requirements for synteny block formation.

Despite variation in block size and species, similarity cutoffs remain stable, ranging from 0.831 to 0.859, indicating consistent criteria for defining synteny blocks based on sequence similarity and gene collinearity.

The crumble constant (c), representing the proportion of conserved syntenic regions, increases as block size decreases (Fig. 5). This pattern indicates that smaller blocks are the outcome of structural erosion—a phenomenon known as block crumble—reflecting accumulated gene loss, duplication, and rearrangement.

FIG. 5.

Left plot showing the crumble coefficients (c) for different Salmonidae species across synteny block sizes ( $\geq 3$ , $\geq 4$ , and $\geq 5$ genes). Right plot showing the total gene pairs for different Salmonidae species across synteny block sizes ( $\geq 3$ , $\geq 4$ , and $\geq 5$ genes).

Comparative analysis with plant genomes [poplar, durian (Glasauer and Neuhauss, 2014)] shows that Salmoninae exhibit approximately half the c value observed in plants, indicating a higher rate of genomic rearrangement and fractionation in these fish. The plants appear more tolerant of polyploidization, retaining more duplicated genes.

The apparent increase in the initial gene complement ( $m_{1}$ ) when moving from block size 5 to 3 likely reflects detection bias rather than a true expansion. Erosion of syntenic blocks over time and improved sensitivity in smaller blocks can distort retention estimates, leading to an overestimation in more recent blocks.

Finally, survival probabilities u and v, which represent temporal dynamics of gene retention, provide further insight. Across all datasets, $u < v$ , consistent with the longer evolutionary interval between the first and second WGD (approximately 220 Mya) compared with the shorter interval from the second WGD to the present (approximately 80 Mya).

4.1. Similarity distributions and temporal retention across salmonid genomes

We applied our branching-process-based framework to the five salmonid genomes to investigate post-duplication divergence patterns. Histogram-based analysis of retained duplicate gene pairs revealed a characteristic bimodal similarity distribution (Fig. A1 in the Appendix A1). The first mode, centered above $S > 0.85$ , corresponds to gene pairs derived from the salmonid-specific whole-genome duplication (Sal-4R), whereas the second mode reflects remnants of the older teleost-specific duplication (Ts3R). These results are consistent with previous reports and further validate the capacity of similarity-based models to capture polyploidy signatures.

To quantify the evolutionary timescale underlying these patterns, we calibrated the similarity decay model:

E (S_{i}) = e^{- μ_{i} (t_{i} - t_{n})}

using the known timing of the WGDs and the observed similarity at the peaks. Solving for

μ (.)

each species gives the most right two columns in Tables 3–5:

μ_{i} = - \frac{\ln (S_{i, n})}{t_{i} - t_{n}}

Applying this transformation across all duplicate pairs yielded retention time distributions for each genome (Tables 3–5 and Fig. 6). This parameterization enabled the conversion of similarity scores into estimated divergence times.

FIG. 6.

Fractionation rate parameters ( $μ_{1}$ and $μ_{2}$ ) for five salmonid species across three block-size thresholds (3, 4, and 5 genes). Bars represent $μ_{1}$ (solid) and $μ_{2}$ (hatched) for each block-size category.

Interestingly, although the branching-process model does not impose equality between the fractionation parameters $μ_{1}$ and $μ_{2}$ , which correspond to the two major WGD events (Ts3R and Sal-4R), their estimated values are nearly identical across all five Salmoninae species. Specifically, the observed range for both fractionation parameters is approximately 0.0009–0.0013 per Myr, regardless of block-size criteria. This slow rate of fractionation is consistent with earlier findings that teleost WGDs (e.g., Ss4R in salmonids and Cat-4R in catostomids) maintain duplicate genes for tens of millions of years due to the slow pace of rediploidization (Lien et al., 2016; Robertson et al., 2017; Krabbenhoft et al., 2021). In contrast, plant lineages such as Malvaceae usually experience rapid post-WGD fractionation, losing most duplicates within a few million years (Zhang et al., 2021b).

The consistent retention rates observed across different WGD events suggest two possible interpretations. First, fractionation dynamics following polyploidization may follow a conserved mechanism across different WGD events, resulting in similar long-term retention probabilities. Alternatively, the similarity of $μ_{1}$ and $μ_{2}$ may reflect the limited statistical resolution to distinguish between two events using present-day retention patterns alone. In either case, the evidence supports a relatively stable rate of post-WGD gene loss. This observation aligns with previous reports of conserved fractionation mechanisms in polyploid genomes across taxa (Blanc and Wolfe, 2004; Zhang et al., 2021a).

5. DISCUSSION: A COMPARATIVE PERSPECTIVE ON POST-WGD FRACTIONATION

Our analyses indicate that post-WGD fractionation dynamics in salmonid genomes are characterized by relatively slow rediploidization, with well-resolved signals across multiple species. To place these patterns in a broader evolutionary context, we briefly examined the Chinese sucker (Myxocyprinus asiaticus, ID: 61661), a catostomid fish that experienced an independent, lineage-specific whole-genome duplication (Cat-4R).

In addition to Cat-4R, M. asiaticus retains detectable signals from the teleost-specific WGD (Ts3R), as reflected by two major similarity peaks ( $\sim 75 %$ and $\sim 92 %$ ) and additional overlapping components (Fig. 7). Compared with salmonids, the presence of highly conserved syntenic blocks in M. asiaticus leads to substantial overlap among duplication signals, posing challenges for component separation using similarity alone.

FIG. 7.

Orange is recent duplications or WGD, often aligned near the diagonal. Green is moderately diverged duplicates, and red is the more ancient duplicates for block 3.

To improve resolution in this multi-WGD setting, we integrated $K_{s}$ information with similarity scores. Lower $K_{s}$ values cluster within high-similarity blocks, consistent with a more recent duplication origin, whereas the higher similarity peak likely reflects Cat-4R together with remnants of older events. Visual inspection using GEvo further confirms strong local collinearity alongside missing matches and gaps, indicating ongoing fractionation despite overall synteny conservation (Fig. 8).

FIG. 8.

Left: GEvo visualization of a syntenic block between Chromosome 7 and Chromosome 8 in Myxocyprinus asiaticus. Red boxes represent annotated genes, and pink lines connect syntenic gene pairs between the two regions. Right: Histogram of block similarity in Myxocyprinus asiaticus, with bars colored according to the mean log(Ks) value of genes within each bin.

Taken together, these observations suggest that the post-duplication trajectory of M. asiaticus differs from the prolonged and well-resolved rediploidization observed in salmonids. However, the current data do not allow a quantitative comparison of fractionation timescales across lineages. A systematic analysis explicitly modeling M. asiaticus within the same branching-process framework and across alternative synteny detection methods will be required to assess lineage-specific differences in post-WGD dynamics.

6. CONCLUSION AND FUTURE WORK

Starting with the SYNMAP software on the CoGe platform, we assessed coding sequence similarities to evaluate fractionation rates and polyploidy events in salmonid genomes, utilizing a combination of algebraic and statistical methods. This approach emphasizes the role of singleton genes in the fractionation process, ensuring accurate identification and interpretation of retained and lost genes.

Our branching-process-based framework has applied to distinguish two major WGD signals, can be extended to accommodate three or more components by generalizing the retention model to a mixture structure (Sankoff et al., 2019; Zhang et al., 2019, 2018):

f (S) = \sum_{k = 1}^{K} λ_{k} N (e^{- μ_{k} t_{k}}, σ_{k}^{2})

where K denotes the number of duplication layers,

μ_{k}

represents the fractionation for component k, and

λ_{k}

its mixing proportion. A limitation of this study is the potential sensitivity of synteny-based modeling to syntenic block definitions. Systematic comparison across multiple synteny detection methods represents an important direction for future work.

Future research also should aim to standardize synteny block identification methods, enabling more accurate cross-species comparisons. Expanding the study to include a wider range of species will help clarify the evolutionary consequences of polyploidization. Model refinement is also needed to incorporate processes such as gene conversion and subfunctionalization, which play critical roles in post-WGD genome evolution. Developing automated computational pipelines will further improve the efficiency of fractionation analysis, while functional characterization of fractionated genes will provide deeper insights into the mechanisms driving genome diversification.

AUTHORS’ CONTRIBUTIONS

Y.Z.: Conceptualization, methodology, implementation, and writing—original draft. D.S.: Conceptualization, methodology, and writing—review and editing. All authors read and approved the final article.

Footnotes

AUTHOR DISCLOSURE STATEMENT

The authors declare no conflicts of interest.

FUNDING INFORMATION

This study was supported by the Natural Sciences and Engineering Research Council of Canada (NSERC) under Grant (RGPIN-2024-06717 and RGPIN-5212-2022) and the Thompson Rivers University Open Access Fund (Grant No. 103883).

Appendix A1

APPENDIX A2: SIMULATION-BASED VALIDATION OF SIMILARITY THRESHOLDS

To validate the robustness of the overall branching-based framework introduced in our original model—including the modeling of synteny block disintegration, the use of similarity score distributions, and the inferred threshold separating duplicated gene classes—we conducted a simulation-based evaluation using synthetic data.

To emulate observed similarity distributions in real Salmonidae datasets, we defined a synthetic two-component Gaussian mixture model:

S \sim λ_{1} N (μ_{1}, σ^{2}) + λ_{2} N (μ_{2}, σ^{2})

where

μ_{1} = 0.78

μ_{2} = 0.88

, and

σ_{1} = 0.06

σ_{2} = 0.03

are the component means and shared standard deviation, respectively. The mixing proportions were set to

λ_{1} = 0.4

and

λ_{2} = 0.6

, reflecting an imbalance in the retention of gene pairs derived from distinct whole-genome duplication (WGD) events.

These parameters were selected based on empirical peaks observed in pairwise similarity histograms from salmon dataset, representing two modes of post-WGD divergence. We then applied our original inference pipeline, which combines a moment-based representation of similarity scores (see Section 2) with a log-likelihood maximization approach, to estimate the optimal cutoff H for distinguishing these components. The log-likelihood is computed as:

L (H) = \sum_{S_{i} \leq H} \log [λ_{1} ϕ (S_{i}; S_{1, n}, σ^{2})] + \sum_{> H} \log [λ_{2} ϕ (S_{i}; S_{2, n}, σ^{2})]

The cutoff H maximizing this likelihood was found to closely align with the theoretical decision boundary separating the two Gaussian modes (e.g., $H^{*} \approx 0.829$ ). As shown in Figure A2, the histogram of simulated similarity scores is effectively partitioned by this inferred threshold. These results support the application of our model to empirical polyploid genomes and justify its use in downstream evolutionary inference.

References

Bagley

, Mayden

, Harris

. Phylogeny and divergence times of suckers (Cypriniformes: Catostomidae) inferred from Bayesian total-evidence analyses of molecules, morphology, and fossils. PeerJ, 2018; 6:e5168; doi: 10.7717/peerj.5168

Benaglia

, Chauveau

, Hunter

, et al. mixtools: An R package for analyzing mixture models. J Stat Softw, 2009; 32(6):1–29; doi: 10.18637/jss.v032.i06

Berthelot

, Brunet

, Chalopin

, et al. The rainbow trout genome provides novel insights into evolution after wholegenome duplication in vertebrates. Nat Commun, 2014; 5:3657; doi: 10.1038/ncomms4657

Blanc

, Wolfe

. Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell, 2004; 16(7):1667–1678; doi: 10.1105/tpc.021345

Christensen

, Leong

, Sakhrani

, et al. Chinook salmon (Oncorhynchus tshawytscha) genome and transcriptome. PLoS One, 2018; 13(4):e0195461; doi: 10.1371/journal.pone.0195461

Comparative Genomics (CoGe). CoGe: Comparative Genomics Platform. n.d. Available from: https://genomevolution.org/coge/ [Last accessed: December 10, 2025].

Glasauer

, Neuhauss

. Whole-genome duplication in teleost fishes and its evolutionary consequences. Mol Genet Genomics, 2014; 289(6):1045–1060; doi: 10.1007/s00438-014-0889-2

Gundappa

, To

, Grønvold

, Martin

SAM

, et al. Genome-wide reconstruction of rediploidization following autopolyploidization across one hundred million years of salmonid evolution. Mol Biol Evol, 2022; 39(1); doi: 10.1093/molbev/msab310

Haas

, Delcher

, Wortman

et al. DAGchainer: a tool for mining segmental genome duplications and synteny. Bioinformatics, 2004; 20(18):3643–3646; doi: 10.1093/bioinformatics/bth397

10.

Haug-Baltzell

, Stephens

, Davey

, et al. SynMap2 and SynMap3D: Web-based whole-genome synteny browsers. Bioinformatics, 2017; 33(14):2197–2198; doi: 10.1093/bioinformatics/btx144

11.

Jaillon

, Aury

, Brunet

, et al. Genome duplication in the teleost fish Tetraodon nigroviridis reveals the early vertebrate protokaryotype. Nature, 2004; 431(7011):946–957; doi: 10.1038/nature03025

12.

Jukes

, Cantor

. Evolution of protein molecules. In: Mammalian Protein Metabolism. ( Munro

, ed.) Academic Press; 1969, pp. 21–132.

13.

Krabbenhoft

, MacGuigan

, Backenstose

NJC

, et al. Chromosome-level genome assembly of Chinese sucker (Myxocyprinus asiaticus) reveals strongly conserved synteny following a catostomid-specific whole-genome duplication. Genome Biol Evol, 2021; 13(9):evab190; doi: 10.1093/gbe/evab190

14.

Kumar

, Stecher

, Suleski

, et al. TimeTree: A resource for timelines, timetrees, and divergence times. Mol Biol Evol, 2017; 34(7):1812–1819; doi: 10.1093/molbev/msx116

15.

Lecaudey

, Schliewen

, Osinov

, et al. Inferring phylogenetic structure, hybridization and divergence times within Salmoninae (Teleostei: Salmonidae) using RAD-sequencing. Mol Phylogenet Evol, 2018; 124:82–99; doi: 10.1016/j.ympev.2018.02.022

16.

Lee

, Kim

. Dataset for characterization of thrombospondin family in chum salmon (Oncorhynchus keta). Data Brief, 2019; 22:866–870; doi: 10.1016/j.dib.2019.01.008

17.

Lien

, Koop

, Sandve

, et al. The Atlantic salmon genome provides insights into rediploidization. Nature, 2016; 533(7602):200–205; doi: 10.1038/nature17164

18.

Lyons

, Pedersen

, Kane

, et al. Finding and comparing syntenic regions among Arabidopsis and the outgroups papaya, poplar, and grape: CoGe with rosids. Plant Physiol, 2008; 148(4):1772–1781.

19.

Macqueen

, Johnston

. A well-constrained estimate for the timing of the salmonid whole genome duplication reveals major decoupling from species diversification. Proc Biol Sci, 2014; 281(1778):20132881.

20.

National Center for Biotechnology Information. (2023. NCBI RefSeq annotation release GCF_019703515.2-RS_2023_02. RefSeq annotation release, University at Buffalo. Available from: https://www.ncbi.nlm.nih.gov/assembly/GCF_019703515.2

21.

Near

, Eytan

, Dornburg

, et al. Resolution of ray-finned fish phylogeny and timing of diversification. Proceedings of the National Academy of Sciences of the United States of America, 2012; 109(34):13698–13703.

22.

Robertson

, Gundappa

, Grammes

, et al. Lineage-specific rediploidization is a mechanism to explain time-lags between genome duplication and evolutionary diversification. Genome Biol, 2017; 18(1):111; doi: 10.1186/s13059-017-1241-z

23.

Sankoff

, Zheng

, Zhang

, et al. Models for similarity distributions of syntenic homologs and applications to phylogenomics. IEEE/ACM Trans Comput Biol Bioinform, 2019; 16(3):727–737.

24.

Smith

, Normandeau

, Djambazian

, et al. A chromosome-anchored genome assembly for Lake Trout (Salvelinus namaycush). Mol Ecol Resour, 2022; 22(2):679–694; doi: 10.1111/1755-0998.13483

25.

, Sankoff

. “Syntenic dimensions of genomic evolution”. In: RECOMB-Comparative Genomics (RECOMB-CG) 2022, Vol. 13234. Lecture Notes in Bioinformatics. ( Jin

, Durand

, eds.) Springer; 2022, pp. 21–30; doi: 10.1007/978-3-031-06220-9_2

26.

Zhang

, Yu

, Zheng

, et al. Integrated synteny- and similarity-based inference on the polyploidization–fractionation cycle. Interface Focus, 2021a;11(4):20200059; doi: 10.1098/rsfs.2020.0059

27.

Zhang

, Zheng

, Islam

, et al. Branching out to speciation in a model of fractionation: The Malvaceae. IEEE/ACM Trans Comput Biol Bioinform, 2021b;18(5):1875–1884; doi: 10.1109/TCBB.2019.2955649

28.

Zhang

, Zheng

, Sankoff

. Pinning down ploidy in paleopolyploid plants. BMC Genomics, 2018; 19(Suppl 5):287.

29.

Zhang

, Zheng

, Sankoff

. Distinguishing successive ancient polyploidy levels based on genome-internal syntenic alignment. BMC Bioinformatics, 2019; 20(Suppl 20):635; doi: 10.1186/s12859-019-3202-x