Abstract
The study of gene family evolution has benefited from the use of phylogenetic tools, which can greatly inform studies of both relationships within gene families and functional divergence. Here, we propose the use of a network-based approach that in combination with phylogenetic methods can provide additional support for models of gene family evolution. We dissect the contributions of each method to the improved understanding of relationships and functions within the well-characterized family of AGAMOUS floral development genes. The results obtained with the two methods largely agreed with one another. In particular, we show how network approaches can provide improved interpretations of branches with low support in a conventional gene tree. The network approach used here may also better reflect known and suspected patterns of functional divergence relative to phylogenetic methods. Overall, we believe that the combined use of phylogenetic and network tools provide a more robust assessment of gene family evolution.
Introduction
Advances in sequencing technology have lead to dramatic expansions in the number of sequenced genes within most gene families, both through the use of whole genome or whole transcriptome sequencing and through broader taxon sampling. Gene families are generally studied through the use of phylogenetic approaches to identify closely and distantly related sequences, as well as to classify divergence between gene copies into those resulting from speciation (orthology) or gene duplication (paralogy).1,2 Thus, phylogenetic approaches are widely employed to study how sequence divergence can lead to divergence of structure and/or function.3,4 When coupled with genome context information, this approach can provide insightful understanding of gene regulation and function.
For instance, it is well known that orthologous genes conserved at syntenic locations in the genome are more likely to exhibit conserved regulation 5 and function 6 than genes at nonsyntenic locations. However, the prevalence of whole genome duplications in plants poses challenges to the study of gene family evolution using exclusively phylogeny-based methods 3 due to the diverse outcomes of duplicated genes. Whole genome duplications produce syntenic paralogs that can be reciprocally lost,7,8 sub- or neofunctionalized, 9 or even retained in the same functional roles as a result of relative or absolute dosage constraints. 10
A fundamental assumption of any phylogenetic reconstruction is that the observed changes occur exclusively through a hierarchical bifurcated branching process. This model is certainly a good representation of a major evolutionary force (ie, descent with modification) 11 ; however, many will argue that it fails to capture the diversity of evolutionary processes which shape the gene content of extant species.12,13
One way to address the complexity of evolutionary processes is to apply network approaches to address questions related to cell organization and functioning, 14 human diseases relationships, 15 and plant gene function prediction. 16 Network approaches have also been successfully applied to study fungi evolution based on enzymes related to the chitin synthase pathway. 17 Recently, Carvalho et al 18 have used a network-based approach to address the origin of the mitochondria, providing a new perspective on the study mitochondrial evolution.
Network-based approaches can overcome some of the limitations of phylogenetic methods. For instance, these approaches do not require the assumption of a hierarchical bifurcating framework and therefore may be capable of dealing with more complex biological patterns and phenomena.19–21 Networks are generally less precise in their ability to reconstruct the divergence points of different groups within a gene family; however, they may be able to capture additional insight into function evolution and divergence using information which might be lost in phylogenetic reconstructions.
In this study, we compare the information gained from conventional phylogenetic analysis and a network-based approach using a well-characterized subfamily of floral transcription factors, the AGAMOUS floral genes. The AGAMOUS gene subfamily comprises MADS-box transcription factors and is involved in important aspects of flower and fruit development. 22 Among angiosperms (flowering plants), the AGAMOUS subfamily is traditionally divided into the C and D lineages. C lineage genes include the closest relatives of the Arabidopsis thaliana AGAMOUS (AG) gene23,24 in all angiosperms, as well as close relatives of the SHATTERPROOF (SHP) gene, present exclusively in core eudicots.
D lineage genes, on the other hand, include angiosperm SEEDSTICK (STK) genes.25,26 The C/D split likely occurred after the split between gymnosperms and angiosperms and gymnosperms usually carry a single-gene copy of the AGAMOUS subfamily. While D lineage genes are usually related to ovule development, C lineage genes have been implicated in stamen and carpel development. Particularly, in core eudicots, SHP genes have also been shown to be involved in fruit development and ripening.27–30
This gene subfamily has been extensively studied and mutant characterization has provided insights into their functional roles in carpel, ovule, and fruit development as well as floral meristem termination. The AGAMOUS subfamily has undergone several instances of duplication followed by neo- and subfunctionalization throughout its evolutionary history in angiosperms,26,31 and understanding the evolutionary history of this group has proven challenging as a result of low support for deep nodes on the tree.
Here, we investigate the contributions of a a similarity-based phylogenetic network approach to our understanding of AGAMOUS gene family evolution.32–34 The phylogenetic network methods used here do not require the assumption of a scale-free topology, or the need to calculate gene correlation based on expression data,16,35 which makes the approach used more straightforward. Also, the approach used here does not rely on an existing tree to generate the networks, as most phylogenetic networks. Overall, both the phylogeny and network results showed consistent clustering of the gene families. However, our results suggest that the network approach was less affected by sequence divergence. We demonstrate that a combination of both methods can provide additional insight into evolutionary events and functional divergence within gene families.
Methods
Sequence search and multiple sequence alignment
C and D lineage AGAMOUS nucleotide sequences were retrieved on Phytozome (https://phytozome.jgi.doe.gov/pz/portal.html) and NCBI (National Center for Biotechnology Information). Species of origin and accession numbers for each sequence included in this analysis are provided in Table 1. A multiple sequence alignment was performed using the ClustalW 36 alignment tool within Geneious v7.0.4, 37 based on translated nucleotides. Further refinements were made manually, using translated sequences as a way to guide manual curation. Manual curation of the multiple sequence alignment was performed using a codon-preserving approach and taking into account domains and motifs previously described in the literature. 25 Unalignable regions were removed prior to further analysis. The final multiple sequence alignment included 549 nucleotides. The alignment statistics obtained from HMMSTAT, from HMMER3 package, 38 were eff_nseq = 2.72, M = 531, relent = 0.45, info = 0.45, p relE = 0.31, and compKL = 0.02. jModelTest 2.1.1 was used to estimate the best-fit evolutionary model of nucleotide evolution. 39 A protein multiple sequence alignment was also performed with the same sequences and used in downstream phylogenetic analysis.
List of species and sequence identifiers used in this study.
Genes retrieved from NCBI (National Center for Biotechnology Information) (genes with * were retrieved from Phytozome) (long).
Phylogenetic analysis
Maximum likelihood analysis was performed using PhyML 3.0 (http://www.atgc-montpellier.fr/phyml/)40,41 with the TN93 model, 42 a gamma distribution parameter of 1.107. Bootstrap support was calculated based on 100 iterations. The most likely tree was computed based on the PhyML-estimated parameters: transition/transversion ratio for purines of 2.541, transition/transversion ratio for pyrimidines of 4.342, and nucleotides frequencies of f(A) = 0.33406, f(C) = 0.20359, f(G) = 0.24537, and f(T) = 0.21698. An ML tree of the protein sequence multiple sequence alignment was also performed on PhyML 3.0 using the LG model of amino acid substitution.
Obtaining identity matrix
A pairwise distance matrix, based on a nucleotide multiple sequence alignment of the 93 sequences, was calculated using MEGA7. Even though the length of the final alignment obtained was 543 positions, removal of gaps and missing data was performed to calculate the distance matrix, resulting in a final set of 372 informative positions in the final filtered data set. 43 The number of base substitutions per site between sequences was calculated using the maximum composite likelihood model. 44 To obtain the identity value of the sequence pairs, we subtracted 1 from the distance value of every term of the distance matrix to finally obtain the identity matrix.
Network analysis
Once the gene identity matrix was generated, a set of 101 networks were created based on the identity threshold between sequence pairs (1 network for each threshold, 0% through 100%), which is represented by the parameter σ. In each network, each nucleotide sequence is represented by a single node. Two nodes (say i and j) are considered connected if the identity threshold is greater than σ. The networks were represented in the format of an adjacency matrix M(σ), where the matrix elements Mij (pairs of sequences) were either 1, if they were connected, or 0, if they were not connected.
45
Then, neighborhood matrices
Summary of symbols.
Gephi was used to visualize and further interrogate the networks. 48 The modularity calculation from Gephi, based on Blondel et al 49 and resolution from Lambiotte et al, 50 was used to classify individual nodes into communities.
To summarize the network approach applied here, we describe the main steps performed:
Alignment of gene sequences;
Calculation of genetic distances and generation of identity matrix;
Calculation of network distances;
Identification of best σ;
Network generation and analysis under most informative value.
The proposed approach used here requires less than 10 seconds to run on an Acer Intel Core i7-6700 CPU @ 3.40GHz for all data sets tested to date (<100 sequences). The scripts used here can be found on GitHub (https://github.com/deCarvalho90/network_analysis) and the software with a graphical interface is available in the work by Goés-Neto et al. 51
Results
Phylogenetic analysis
The maximum likelihood phylogeny of AGAMOUS genes presented in Figure 1 is consistent with the topology previously published studies of the AGAMOUS gene family.25,26,31 The most likely nucleotide tree had a log likelihood score of −20654.546986. The ML protein tree had poor support for main clades and therefore was not used in subsequent analysis (data not shown).

Phylogenetic tree of the AGAMOUS family genes. Main functional groups are highlighted in black boxes along the tree.
Gymnosperm AGAMOUS genes (here termed C/D homologues) form a paraphyly at the base of the unrooted tree. An initial duplication event separates C and D lineage angiosperm genes and likely occurred in the common ancestor of angiosperms. Basal angiosperm C lineage homologues, although clustering with D lineage genes, exhibit expression patterns, and likely function, similar to that of core eudicot C lineage genes. D lineage genes form a monophyletic clade that includes all other angiosperm species included in this study.
Monocot D lineage genes appear as a paraphyly at the base of the D lineage clade; however, the relationships among D lineage genes otherwise are largely consistent with known species relationships. The relationships of C lineage genes are more convoluted. The base of this subtree is a polyphyly including monocot, basal eudicot, and core eudicot genes. At the base of the core eudicots, a second duplication event resulted in the split of the AGAMOUS and PLENA/SHATTERPROOF (SHP) lineages. A third duplication, likely at the base of the Brassicales, resulted in 2 copies of SHP genes in this group (SHP1 and SHP2; Figure 1).
Basal angiosperm C lineage genes form a group that diverges after the gymnosperm C/D lineage, but before the angiosperm C/D lineage split. The artificial polyphyletic group of the paleoAGAMOUS includes monocot and basal eudicot sequences. While the basal eudicot genes group with other core eudicot AGAMOUS genes, monocot paleoAGAMOUS genes share a most recent common ancestor with D lineage genes. It is important to notice, however, that the low branch support in many areas of the AGAMOUS gene tree poses challenges to the interpretation of the evolutionary relationship between clades.
Network analysis
The network distance graph showed its highest peak at 75% identity, which means that the network generated at that peak is the most distant from the others (Figure 2A). Also, it means that the network presents a clear community structure with relevant evolutionary information. Despite the fact that the network with the biggest distance was obtained at 75% identity, the community structure was already too fragmented to answer questions about the evolution of the gene families analyzed in the phylogeny (Figure S1A). Even though the network obtained at 75% was too fragmented, the network still provided relevant information about the functional divergence of the genes. However, we wanted to see how the community structure would behave in a scenario closer to the phylogeny. To do so, we had to find the network where all sequences were connected in a way that it would still be possible to retrieve a community structure. A similar situation occurred in the work by Carvalho et al, 18 and the problem was solved by analyzing other networks in different peaks. Here, we attempted to solve this problem by analyzing the network at 51% to find the last network where all sequences were connected. However, it was not possible to see a clear community structure in this network due to the high degree of connectivity between nodes (Figure S1B). Finally, in this study, we focused mainly on the network obtained at the identity threshold 67, which meant that 2 sequences had to have an identity value of 67% or higher to be connected. The choice of the network threshold was based on the fact that all sequences in this study were connected, with exception of the out-group sequences, which reflected a scenario similar to the phylogeny.

(A) Network distance graph based on the δ(σ, σ + Δσ) distance. The values for the analyzed networks obtained at 51%, 67%, and 75% are marked. (B) Network obtained at 67% identity. Nodes are colored based on the community they belong to (C1-C5), as result of the modularity algorithm (see methods). The sequences that do not belong to any community are represented as gray nodes. Network obtained at 67% identity, colored based on (C) gene function and (D) species phylogenetic placement.
After applying the modularity calculation (see section “Methods”) in the 67% network, it was possible to see the emergence of the community structure of the network, containing 5 communities (C1-C5; Figure 2B). Each one of the communities mainly cluster genes that have similar functions. In C1, 3 out of the 5 nodes from gymnosperm C/D homologues are connected. Even though the 5 nodes are not connected, this result was expected due to the fact that they are part of the most distant out-group sequences as seen in Figure 1. In C2, however, the functions of the nodes are related to AG, paleoAG and basal angiosperm C homologues. This might suggest that the AGAMOUS genes have retained a function very similar to their basal angiosperm C homologues. In C3, the SHP genes are clustered together, but in a different community of the AG genes, also suggesting functional divergence. The genes clustered in C4 comprise the STK genes. Even though the communities were mostly composed of genes with similar functions, 3 genes exhibited unexpected placements. For instance, the SHP gene from Vitis vinifera (ViviSHP) clustered with other AG genes in C2, instead of with other SHP genes in C3. Similarly, Sorghum bicolor SbAG2, a STK gene, clustered in C5, instead of the expected C4, whereas Sorghum bicolor SbAG3, a paleoAG gene, clustered in C4, instead of the expected C5.
Finally, the genes clustered in C5 belong to the monocots paleoAG. This result might suggest that monocot paleoAG genes are evolving under different evolutionary forces than the paleoAG and AG lineages. Finally, we can notice that the grouping obtained by both methods were consistent with one another by comparing Figures 1 and 2D. Also, the results obtained at the 67% threshold is largely congruent with the one obtained for the protein network generated (Figure S3), obtained at the 60% threshold (highest peak). However, the protein network showed lower resolution because it clustered together AG, eudicot paleoAG, and SHP genes, whereas we see a clear separation of SHP from the other genes.
Even though the 75% network showed a fragmented community structure for this study, we can notice that it shows that the STK sequences from maize, sorghum, rice, and brachypodium are in a separate community. This information might suggest that STK genes from grasses might be undergoing a functional divergence compared with the remaining STK genes; however, limitations in gene functional annotation do not enable us to further support this inference.
Both the phylogenetic and network-based analyses returned largely consistent sets of gene clusters. However, the grouping of monocots paleoAG sequences in a separate cluster (C5) than other C homologues from basal angiosperm, basal eudicot, and eudicot sequences (jointly clustered in C2) in the network-based analysis suggest 2 testable hypothesis: (1) monocot sequences are undergoing different and independent evolutionary processes when compared with other non-monocot AG homologues and (2) non-monocot AG sequences are clustered with euAG genes due to conservation of function.
Discussion
The use of phylogenetic methods to study gene family evolution has provided vast increases in the understanding of molecular evolution, and the utility of these methods for reconstructing ancestral relationships remains unparalleled. However, in many cases, complex evolutionary processes including neofunctionalization, repeated co-option into new biological roles, as has occurred in independent origins of C4 photosynthesis, 52 high birth/death gene families, and reciprocal gene loss following gene or genome duplication, reconstructing phylogenetic relationships may not be the most effective method for identifying genes with equivalent functional roles. Among the contributions of a network approach to gene family studies is the interpretation of the relationships among gene sequences that are not limited to a bifurcating pattern, which is often the case in a phylogenetic framework. A network approach allows for the emergence of patterns that are not seen otherwise. Here, we propose the use of network-based approach which has complementary sets of strengths and weaknesses to conventional phylogenetic methods and tested the contributions of these methods using data from the well-characterized AGAMOUS family of floral transcription factors.
For instance, in the phylogenetic tree, the non-monocot euAG and paleoAG genes are not clustered with the basal C homologues. Rather, in the networks we notice that these genes are clustered together suggesting a higher functional conservation between them, which is not seen in the tree. Also, from the tree alone we cannot infer whether either euAG or SHP genes neofunctionalized. However, because all the euAG and non-monocot paleoAG are clustered together with the ancestral C homologues, and apart from the SHP genes, we can infer that the SHP genes neofunctionalized, whereas the euAG and non-monocot paleoAG retained an ancestral function. We believe that a combined approach might help with discerning functional and structural evolution in a way that neither methods can provide on its own.
In agreement with the literature,26,53 the network-based analysis recovered clusters of paeloAG and AG genes from basal angiosperms, basal eudicots, and core eudicots, potentially indicating conserved functional roles for the genes included in these clusters despite sequence divergence. In contrast, the position of the basal angiosperms’ C lineage in the phylogenetic tree leads to uncertain interpretations of conserved or divergent function with respect to the D lineage. The network-based approach also separated the STK and paleoAG genes within the monocot lineage, despite the close phylogenetic relatedness of these 2 gene clades, consistent with reports of distinct functional roles for these 2 sets of genes in monocots.54,55 For instance, paleoAG gene from maize has undergone a duplication event in the common ancestor of maize, wheat, and rice 25 which leads to subfunctionalization of these genes that perform functions still related to, but different from Arabidopsis AG. 56 A similar process also occurred in rice. 57 These differences may be the reason the monocot paleoAG clustered together in the network, but in a different community than the remaining AG gene sequences. Moreover, genetic networks of the inflorescence meristems can vary a lot between grasses and eudicots because several changes in these regulatory networks are either only present in grasses or perform a different function in eudicots. 58
However, network-based approaches to studying gene families bring with them their own set of limitations. Some of these are inherent to the particular methodology used here, whereas others are a result of the relative immaturity of statistical and software tools for applying these methods to the analysis of gene family evolution. For example, a range of statistical methods are widely available for estimating the level of support for individual branches/clades within a given phylogeny, such as jackknife, bootstrap, and posterior probabilities.59,60 In contrast, methods for calculation of cluster support in a biological context are far less mature, at least for the implementation employed here. The use of sequence identity as a measure of distance, while computationally tractable, also means discarding a great deal of information on the frequency of different types of substitutions at both the nucleotide and amino acid levels which can be incorporated into many modern phylogenetic algorithms. 61
Figure 3 summarizes the contributions and relative strengths and weaknesses of phylogenetic and network-based approaches to the study of gene family evolution. We propose that the combination of both methods can provide more assessment of both functional and historical relationships between sequences than either approach alone.

Schematic diagram of results based on phylogenetic (left) and network (right) analyses. Potential contributions of each approach, as well as benefits steaming from the combination of both methods are described below the diagrams.
Conclusions
Investigating the contributions of a particular network-based approach to the study of the evolution of a well-known family of transcription factor genes involved in floral development supports the idea that network-based approaches, when used in conjunction with phylogenetic methods, can be used to improve our understanding of functional conservation or divergence within gene family evolution. The network-based analysis of gene families used here currently lacks the robust ecosystem of computational tools and statistical approaches developed for phylogenetic analysis; however, it can already provide an independent assessment of relationship structures which can aid in the interpretation of phylogenetic data, especially in areas of the tree exhibiting low branch support. In particular, network analysis can be used to generate testable hypotheses regarding the conservation or divergence of gene function in cases of potential subfunctionalization or neofunctionalization. In combination, we believe that these methods provide a robust framework that expands the power of gene family evolution studies.
Supplemental Material
Supplementary_Material – Supplemental material for Integrating Phylogenetic and Network Approaches to Study Gene Family Evolution: The Case of the AGAMOUS Family of Floral Genes
Supplemental material, Supplementary_Material for Integrating Phylogenetic and Network Approaches to Study Gene Family Evolution: The Case of the AGAMOUS Family of Floral Genes by Daniel S Carvalho, James C Schnable and Ana Maria R Almeida in Evolutionary Bioinformatics
Footnotes
Funding:
The author(s) disclosed receipt of the following financial support for the research, authorship, and/or publication of this article: This work was supported by CAPES (BJT 069/2013) and Instituto Nacional de Ciência e Tecnologia em Estudos Interdisciplinares e Transdisciplinares em Ecologia e Evolução (INCT-INTREE, 465767/2014-1) to A.M.R.A., a CNPq Science Without Borders scholarship (214038/2014-9) to D.S.C., and a Robert B. Daugherty Water for Food Institute research support award to J.C.S.
Declaration of Conflicting Interests:
The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or publication of this article.
Author Contributions
DSC, JCS, and AMRA designed the analysis, analyzed the results and wrote the paper; DSC and AMRA conducted the analyses.
References
Supplementary Material
Please find the following supplemental material available below.
For Open Access articles published under a Creative Commons License, all supplemental material carries the same license as the article it is associated with.
For non-Open Access articles published, all supplemental material carries a non-exclusive license, and permission requests for re-use of supplemental material or any part of supplemental material shall be sent directly to the copyright owner as specified in the copyright notice associated with the article.
