- © 2017 American Society of Plant Biologists. All Rights Reserved.
Abstract
Mainly due to their economic importance, genomes of 10 legumes, including soybean (Glycine max), wild peanut (Arachis duranensis and Arachis ipaensis), and barrel medic (Medicago truncatula), have been sequenced. However, a family-level comparative genomics analysis has been unavailable. With grape (Vitis vinifera) and selected legume genomes as outgroups, we managed to perform a hierarchical and event-related alignment of these genomes and deconvoluted layers of homologous regions produced by ancestral polyploidizations or speciations. Consequently, we illustrated genomic fractionation characterized by widespread gene losses after the polyploidizations. Notably, high similarity in gene retention between recently duplicated chromosomes in soybean supported the likely autopolyploidy nature of its tetraploid ancestor. Moreover, although most gene losses were nearly random, largely but not fully described by geometric distribution, we showed that polyploidization contributed divergently to the copy number variation of important gene families. Besides, we showed significantly divergent evolutionary levels among legumes and, by performing synonymous nucleotide substitutions at synonymous sites correction, redated major evolutionary events during their expansion. This effort laid a solid foundation for further genomics exploration in the legume research community and beyond. We describe only a tiny fraction of legume comparative genomics analysis that we performed; more information was stored in the newly constructed Legume Comparative Genomics Research Platform (www.legumegrp.org).
The Fabaceae, Leguminosae, or Papilionaceae, commonly known as the legume, pea, or bean family, is a large and economically important monophyletic family of flowering plants. It includes trees, shrubs, and perennial or annual herbaceous plants, which are easily recognized by their fruit (legume) and their compound, stipulated leaves (Goebel, 1969). As the third-largest land plant family, legumes are widely distributed and divided into 650 genera and over 18,860 species, accounting for about 7% of flowering plant species (Magallón and Sanderson, 2001). Along with cereals, fruits and tropical roots of a number of legumes have been a staple human food, and their use is closely related to human evolution (Zhu et al., 2005). Furthermore, legumes are an important part of natural ecosystems, as they fix atmospheric nitrogen by intimate symbioses with microorganisms (Doyle, 2011).
Mainly due to their economic importance, whole-genome sequences for a number of legumes have been deciphered, including soybean (Glycine max; Schmutz et al., 2010), chickpea (Cicer arietinum; Varshney et al., 2013), barrel medic (Medicago truncatula; Young et al., 2011; Tang et al., 2014), lotus (Lotus japonicus; Sato et al., 2008), mung bean (Vigna radiata; Kang et al., 2014), adzuki bean (Vigna angularis; Kang et al., 2015), pigeon pea (Cajanus cajan; Varshney et al., 2011), common bean (Phaseolus vulgaris; Schmutz et al., 2014), and two wild peanuts (Arachis duranensis and Arachis ipaensis; Bertioli et al., 2016; Chen et al., 2016). These legume genomes have sizes ranging from ∼400 Mb (barrel medic) to 1,150 Mb (soybean), packaged into six to 20 chromosomes.
Most if not all legumes, having originated from a common ancestor about 60 million years ago (Mya), shared a tetraploid ancestor (named legume-common tetraploid [LCT]) of similar age (Schmutz et al., 2010) that played a major role in shaping legume genome organization (Young et al., 2011). Before the LCT, legumes shared an ancient core eudicot-common hexaploid (ECH) ancestor (often named gamma), which was revealed first with the Arabidopsis genome sequence (Bowers et al., 2003) and then described in detail based on the grape (Vitis vinifera) genome (Jaillon et al., 2007; Jiao et al., 2012), often taken as a valuable reference with which to explore the genome structure of eudicots. More recent polyploidizations continued to occur in some legume lineages, offering the opportunity for punctational change in the evolution of these plants (e.g. one occurring ∼13 Mya and specifically contributing to the formation of the extant soybean genome [Schmutz et al., 2010]; named soybean-specific tetraploid [SST]).
Polyploidization, as an abrupt evolutionary event, can occur overnight but exerts an enormous effect on the evolution of a plant and even triggers speciation and diversification processes (Paterson et al., 2004; Soltis et al., 2008; Jiao et al., 2011). Recently, polyploidization was suggested to explain the long-standing mystery of the rapid formation and diversification of land plants (Frohlich and Chase, 2007; Van de Peer, 2011). Polyploidization can have short-term and long-term effects, genetically or epigenetically, and/or at the single-gene or whole-genome scale. After a new polyploid forms, the genome can be very unstable, and in the first generations, it may lose much of its DNA content, as evidenced, for example, by the production of synthetic tetraploid wheat (Triticum aestivum; Kashkush et al., 2002). Evolutionary analysis also supports this inference. Comparative analysis of the cereal genomes, sharing a 100-Mya tetraploid ancestor, suggested that the majority of gene losses (97% or more) occurred before the divergence of sorghum (Sorghum bicolor; panicoids) and rice (Oryza sativa; oryzoids; Paterson et al., 2009). Nonetheless, thousands of polyploidy-derived duplicated genes can still be preserved in extant genomes. These duplicated genes may take different evolutionary avenues to share or divide ancestral gene functions or develop novel genetic functions (Feldman et al., 2012; Lin et al., 2014). As for gene expression, it has been proposed that at least 57% to 85% of paleopolyploid-produced duplicates have diverged in rice (Throude et al., 2009), and duplicates with high expression tend to have higher CG body methylation (Wang et al., 2013). This suggests that epigenetic changes may have contributed to the genomic preservation, maintenance, and restoration of genomic stability (Wang et al., 2013).
The availability of 10 hard-won legume genomes provides a precious opportunity to understand legume biology. Here, by developing approaches to perform hierarchical comparative genomics analysis, we produced multiple alignments of all 10 of these legume genomes. By tracking information about ancestral polyploidization, we deconvoluted the layer-by-layer homology between the legume genomes. This enabled us to evaluate evolutionary divergence among legumes, redate major evolutionary events, and reveal rules of massive gene losses and expression changes between duplicated genes. The hierarchical alignment yielded a homologous gene list relating to different evolutionary events such as recursive polyploidizations and plant divergences. These efforts provided a valuable genomic platform for researchers in the plant community to investigate evolutionary changes, functional innovations, and phylogenetic structures of gene families and regulatory pathways.
RESULTS
Gene Colinearity within and among Genomes
Intragenomic Homology
By inferring gene colinearity, we detected colinear genes within each legume genome, between each pair of them, and between them and grape, which was used as an outgroup reference. Homologous blocks with more than four, 10, 20, and 50 colinear genes were checked (Supplemental Tables S1 and S2).
The legume genomes were divergent in numbers of duplicated blocks and colinear genes residing in them. For blocks containing more than four colinear genes, we found the most duplicated genes in soybean (25,302 pairs) and the fewest in adzuki bean (1,956 pairs; Supplemental Table S1). The large difference in duplicated gene numbers among genomes might be related to the SST in soybean or to the incomplete assembly of the legume genomes. In soybean, 434, 224, and 87 blocks had more than 10, 20, and 50 colinear genes, which contain 20,365, 17,578, and 13,191 colinear genes, accounting for 44.9%, 38.8%, and 29.1% of total gene contents, respectively. The longest homologous region supported by gene colinearity was from soybean chromosomes Gm10 and Gm20, having 824 colinear genes in a 12.87-Mb region. The other genomes had much shorter duplicated blocks, often with fewer than 10 blocks having more than 50 colinear genes. For example, among hundreds of duplicated blocks in barrel medic and mung bean, each had only nine duplicated blocks with more than 50 colinear genes. Common bean has the most (12) duplicated blocks of more than 50 colinear genes.
Intergenomic Homology
Intergenomic homology among legumes often is better than intragenomic homology, consistent with speciations often being more recent than genome duplications. Between these legume genomes, there were often many thousands of colinear genes (Supplemental Table S1). Soybean had more colinear genes with other legumes than were found between any other legumes, due to the SST. For example, soybean and barrel medic genes form 50,672 colinear gene pairs located in 2,824 homologous blocks with more than four colinear genes, involving 21,103 (∼35.4%) and 34,822 (∼47.7%) genes from the two genomes, respectively. There were often tens of intergenomic blocks with more than 50 colinear genes. Two peanut genomes have 16,484 colinear genes in 50 blocks, with each containing at least 50 colinear genes. Detailed statistics of the numbers of inferred paralogous and orthologous genes, gene pairs, and blocks are given in Supplemental Tables S2 to S5.
Multiple Genome/Chromosome Alignment
Event-Related Genomic Homology
Intergenomic comparison helped to unravel the structural complexity of legume genomes, which had been the result of recursive polyploidization events successively doubling or tripling the numbers of existing homologous regions (Fig. 1). Analysis of the grape genome contributed to understanding the triplicated nature of the ancestral core eudicot genome, which appears to have transitioned from 2n = 2x = 14 to 2n = 6x = 42 chromosomes (Jaillon et al., 2007). Here, we used the grape genome to distinguish orthologous from outparalogous regions between legumes as well as paralogous regions within each legume. Homologous regions in different genomes are called outparalogous when they were produced by the genomic duplication in two species’ common ancestor, to distinguish from paralogous regions produced by duplication specific to one species. Homologous gene dot plots (Supplemental Figs S1–S4) depict genomic comparisons and provide for inferences of orthology and paralogy. Orthologous regions between grape and legumes have much better DNA similarity than between outparalogous regions, the latter being a result of the ECH. The details inferring orthology and paralogy can be found in “Materials and Methods” and Supplemental Text S1. Similar analyses have been described for grass genomes and the cotton (Gossypium hirsutum) genome (Paterson et al., 2012; Wang et al., 2015, 2016). In that an extra LCT is shared by all legumes, there would be an expected 1:2 ratio of orthologous regions between grape and most legumes, with the additional SST conferring a 1:4 ratio between grape and soybean. In partial summary, intergenomic analysis revealed layers of genomic homology in the complex legume genomes. Above, we used grape as the outgroup reference to deconvolute the genomic complexity of barrel medic and other legumes to find duplicated blocks in each of them and homology between them. In a similar manner, we adopted barrel medic and common bean as references to distinguish recent SST duplicated regions in soybean.
Species and gene phylogenetic trees. A, Phylogenetic tree of soybean (G), A. duranensis (A), A. ipaensis (B), barrel medic (M), common bean (P), lotus (L), chickpea (E), pigeon pea (C), adzuki bean (U), mung bean (R), and grape (V). The ECH is denoted by the blue hexagon, the LCT by the red square, and the SST by the yellow square. B, Gene phylogenetic tree. Three paralogous genes in the grape genome, V1, V2, and V3, produced by the ECH, each have two orthologs in nonsoybean legume genomes and four orthologs in soybean.
Multiple Alignment
With the grape genome as a reference, we produced a table to store intergenomic and intragenomic homology information. First, we filled in all grape gene identifiers in the first column of the table, then added gene identifiers from legumes column by column, species by species, according to the colinearity inferred by multiple alignments. As noted above, in the absence of gene loss, the grape genes would have two colinear orthologous genes in most legumes and four in soybean. When a legume species contained a gene showing colinearity with a grape gene, a gene identifier was filled into an appropriate cell in the table. When a legume species did not have an expected colinear gene, often due to gene loss or translocation or insufficient assembly, a dot (signifying missing) was filled into an appropriate cell. For 11 (sub)genomes (including two subgenomes for soybean), there are 23 (9 × 2 + 4 + 1) columns in the table. Moreover, due to the ECH, each chromosomal segment would repeat three times in each genome. Based on homology inferred in grape, therefore, we extended the table to 69 columns. Finally, we constructed a table of colinear genes reflecting three polyploidizations and all salient speciations. In partial summary, the table summarized the results of multiple-genome and event-related alignments, reflecting layers of tripled and/or doubled homology due to recursive polyploidizations (Fig. 2).
Homologous alignments of legume genomes with grape as a reference. Genomic paralogy, orthology, and outparalogy information within and among 10 legumes, with same name abbreviations as in Figure 1, are displayed in 69 circles, each corresponding to an extant gene in Figure 1B. The curved lines within the inner circle are formed by 19 grape chromosomes color coded to correspond to the seven ancestral chromosomes before the ECH. The short lines forming the innermost grape chromosome circles represent predicted genes, which have two sets of paralogous regions, forming another two circles. Each of the three sets of grape paralogous chromosomal regions has two orthologous copies in a legume, with the exception of soybean, which has four. The resulting 69 circles are marked according to species by a capital letter, as defined in Figure 1. Each circle has an underline colored to indicate its source plant corresponding to the color scheme in Figure 1A, and each circle is formed by short vertical lines that denote homologous genes, colored to indicate chromosome number in their respective source plant as shown in the color scheme at bottom.
The genomic alignment table for 10 legumes with grape as a reference is not complete; in particular, it cannot include all duplicated genes produced by the SST. That is, genes specific to legumes and absent from the grape genome are not represented. Therefore, the grape-legume homology table was supplemented by a genomic homology table with barrel medic as a reference (Supplemental Fig. S5) to better represent pan-legume gene content.
Event-Related Duplicated Genes
The cross-legume genome analyses described above helped to identify duplicated genes produced by each polyploidization event and to infer gene content in the ancestral genomes before each polyploidization and speciation event. In grape, we inferred 1,764 pairs of genes in 86 homologous regions derived from the ECH, involving 2,893 extant genes (Table I). Being affected by more polyploidizations, legume genomes contain more duplicates. In barrel medic, 2,504 gene pairs involving 2,961 genes were inferred in 194 ECH-derived homologous regions. However, fewer ECH-derived duplicates were inferred in some legumes. For example, only 300 to 1,400 ECH gene pairs were inferred for pigeon pea, adzuki bean, and lotus. The most ECH-derived gene pairs were inferred from soybean, with 3,663 gene pairs involving 2,575 genes from 344 homologous regions. The high numbers of soybean ECH genes result partly from the additional SST, which would have produced up to 5 times [(6,2)/(3,2)] the number of various combinations of homologous gene pairs found in other legumes. Here, (m,n) defines the combinatorial number.
We also characterized LCT-derived gene pairs, which showed 10-fold variation among legumes. In barrel medic, 4,796 gene pairs involving 4,198 genes were inferred from 309 LCT-derived homologous regions. In soybean, 8,317 gene pairs involving 9,486 genes were derived from 343 LCT-derived homologous regions. Pigeon pea has the fewest LCT-derived gene pairs (869). The reduced abundance of inferred LCT-derived gene pairs may have resulted from poor assembly. SST produced 17,104 gene pairs involving 19,210 genes that were derived from 133 homologous regions.
Genomic Fractionation
Genomic fractionation reshapes plant genomes. Key forces driving genomic fractionation include polyploidization, multiplying gene content of an entire genome, and transposon activities, duplicating and relocating individual genes (Wang et al., 2011). Here, using grape, barrel medic, and common bean as references, we show how gene removal eroded colinearity between homologous genomic regions.
Using the grape genome and genes as a reference, it is clear that there has been widespread genomic fractionation following LCT (Supplemental Table S6). For example, regarding grape chromosome 1 as an outgroup, as to pairwise alignment of the grape and each medic barrel duplicate, 75% and 77% of grape genes were not found at the respective colinear locations; as to triple-wise alignment of barrel medic duplicated regions and the outgroup, 70% of the grape genes were absent from both collinear locations. For common bean, the corresponding numbers are 94%, 89%, and 83%, respectively. Using barrel medic chromosome 1 as a reference, 74%, 73%, and 69% of its genes were not found at the respective colinear locations in each or both of the duplicated regions produced by the SST. A local alignment of colinear blocks among genomes shows the pattern of genomic fractionation (Fig. 3). Some missing genes from the homologous locations may be related to deletions of adjacent transposons or movements of transposons disrupting the gene orders and also may be related to poor assemblies or annotations, as discussed further below.
Local alignment in selected genomes: grape, barrel medic, and soybean. The graph shows details of a short segment of alignment marked by the triangle in Figure 2. Homologous block phylogeny is shown at left: three paralogous chromosome segments in the grape genome, Grape-14, Grape-05, and Grape-07, from ancestral chromosomes affected by ECH, each with two orthologous barrel medic and four soybean chromosome segments. Chromosome numbers are shown after the names of plants, and locations on chromosomes also are shown. Genes are shown by rectangles with small arrows indicating their transcriptional direction. Homologous genes between neighboring chromosomal regions are linked with lines.
To investigate the scale and potential mechanisms of fractionation, we counted the numbers of runs of removed genes in each legume genome relative to a reference genome, that is, the numbers of consecutive genes from the reference not appearing in the studied genome. Many missing genes constituted small runs (i.e. of only one or two genes; Fig. 4). For example, these small runs make up 53% of missing genes and up to 71% of all 10,604 runs in common bean; 15% of genes and up to 49% of all 13,936 runs in barrel medic; and 15.2% of genes and up to 44.7% of all 7,984 runs in the referenced grape genome. From another perspective, 77.6%, 56.5%, and 47% of genes were removed from their anticipated locations in runs of 10 genes or fewer that account for up to 48.4%, 89.5%, and 85.6% of all runs for each of the reference grape, barrel medic, and common bean genomes, respectively. The references work as temporal outgroups, with common bean, barrel medic, and grape being successively more diverged from soybean (Supplemental Tables S7 and S8). Missing genes were more likely to appear in small runs using common bean as a reference than barrel medic or grape. This suggests an accumulating effect with initial gene loss resulting in small runs that are gradually extended over time.
Fitting a geometric distribution and gene loss rates: soybean to the grape (A), barrel medic (B), and common bean (C) genomes. The x axis indicates numbers of continuously missing genes in gene colinearity regions.
The lengths and numbers of runs of removed genes closely approximated a geometric distribution. We fitted the observed distribution of numbers of different runs using different density curves of the geometric distribution, with extension parameters of 0.33, 0.31, and 0.3, respectively, for common bean, barrel medic, and grape as references, finding goodness of fit of 0.995, 0.991, and 0.994 with P values of 0.92, 0.91, and 0.89 (F test), respectively (Supplemental Table S9). The closer is the reference plant to soybean and the shorter are the runs of lost genes (Fig. 4), showing a better gene-sharing pattern. The deviation between the observed numbers and the theoretically predicted numbers becomes larger when the gene loss runs are longer, which also supports the length extension of removed-gene runs over time.
Correspondingly Balanced Fractionation between the SST Homologous Chromosomes
Aligning duplicated soybean regions onto corresponding single barrel medic chromosomes permitted us to reconstruct (infer) the gene composition of ancestral duplicated SST paralogous chromosomes, which often show significant divergence of gene retention rates. Among eight barrel medic chromosomes, seven have significantly divergent paralogous soybean chromosomal regions at a χ2 test significance level of 0.05 (or six at 0.01; Supplemental Tables S7 and S10). This finding shows unbalanced gene retention between homologous chromosomes. However, scrutiny of gene retention/loss using a sliding window along chromosomes showed that, in nearly all local regions, with the exception of large patches of DNA losses in one copy of the duplicated chromosomes, genomic retention and loss often are highly similar (Fig. 5). The difference of gene retention between corresponding paralogous regions always varies around the zero level. The difference observed above in the chromosome level should have been caused by large patches of alternative segmental DNA losses due to genomic instability (Fig. 5). In general, this finding suggests little if any dominance between members of homologous chromosome pairs, providing further evidence of the likely autotetraploidization nature of the SST (Garsmeur et al., 2014).
Homologous alignments and soybean gene retention along corresponding orthologous barrel medic chromosomes. Genomic paralogy and orthology information within and among genomes is displayed in three circles. The short lines forming the innermost barrel medic chromosome circles represent predicted genes. Each of the barrel medic paralogous chromosomal regions has two orthologous copies in soybean. Each circle is formed by short vertical lines that denote homologous genes, colored to indicate chromosome number in their respective source plant as shown in the color scheme at bottom. A, Rates of retained genes in sliding windows of soybean homologous region group 1 (red) and homologous region group 2 (black). B, The differences between two groups (blue) are displayed.
Karyotype Changes and Intergenomic Representation
After recursive polyploidizations, plants often restore chromosome numbers to relatively small values. Grape and legumes share a eudicot common ancestor inferred to have had 2n = 6x = 42 chromosomes, resulting from triplication of a basal set (x) of seven chromosomes (2n = 2x = 14) by the ECH. Using gene colinearity information, the 19 grape chromosomes or chromosomal regions were grouped into seven sets of paralogous triplets, which were mapped onto the chromosomes of legumes (Fig. 1).
After the LCT, the nonsoybean legumes under consideration have six to 11 haploid chromosomes, suggesting considerable chromosome number reduction. The legume-common ancestor may have had 11 chromosomes, still found in common bean and its indigoteroid/millettioid relatives, while the dalbergioid (peanut) and hologalegina (chickpea and barrel medic) legumes may have experienced chromosome number reductions. Soybean tetraploidization might have produced 22 chromosomes, with a chromosome fusion resulting in 20 extant chromosomes. Within the indigoteroid clade, three legumes have the same chromosome number (n = 11), but their chromosomes differ in composition (Fig. 6). At least six common bean chromosomes were largely preserved in other legumes (Fig. 6).
Chromosome representation using the seven eudicot ancestral chromosomes and those of common bean. Each chromosome from grape and legume genomes is first represented by genes colinear to grape. Genes are denoted by short lines in seven different colors related to ancestral chromosomes before the ECH. Second, with the exception of common bean, chromosomes from the other 10 legumes are represented by genes having common bean colinear genes, and these colinear genes in each plant are colored to indicate common bean chromosomes where their orthologs reside. Thus, a chromosome in the legume genomes is displayed in two sets of short lines arranged side by side.
Evolutionary Divergence and Dating
We found that legume genes evolve at considerably divergent rates in different genomes. By estimating synonymous nucleotide substitutions at synonymous sites (Ks), we characterized divergence levels between colinear homologs in different legumes or within a legume. Recursive polyploidization events can be identified based on Ks distributions for duplicated genes as Ks peaks that deviate from a general decline in frequency with increasing Ks value. For example, the soybean duplicates form a distribution with three peaks reflecting three polyploidizations (SST, LCT, and ECH) over time, although the peaks resulting from more ancient events can be difficult to discern. Ks distributions of interlegume colinear homologs reflect both polyploidization events common to them and speciation events that differentiate them. The peak corresponding to their differentiation is often more prominent than the polyploidization-derived ones due to widespread gene losses following polyploidizations. We adopted kernel function analysis to distinguish different components in Ks distributions (for details, see “Materials and Methods”), and each Ks distribution was represented by a linear combination of multiple normal distributions, each corresponding to an ancestral event (polyploidization or speciation; Supplemental Table S11).
Both the LCT and ECH produced Ks peaks with divergent locations in different legumes (Fig. 7; Supplemental Table S11), revealing divergent gene evolutionary rates. Lotus has evolved the slowest and peanut the fastest (with a nearly 25% difference). Relative to soybean, gene sequences of other legumes have evolved 17% to 24% faster (peanut, 23.9%; adzuki bean, 20.7%; mung bean, 19.4%; chickpea, 19.1%; barrel medic, 18.8%; and common bean, 17%) or 3.9% to 11.2% slower (pigeon pea, 3.9%; lotus, 11.2%; Supplemental Table S11).
Dating evolutionary events within and among the legume genomes: soybean (G), wild peanut (A and B), barrel medic (M), common bean (P), lotus (L), chickpea (E), pigeon pea (C), adzuki bean (U), mung bean (R), and grape (V). A, Distribution of average Ks levels between colinear gene pairs in intergenomic (solid curves) and intragenomic blocks (dashed curves). B, Distribution of average Ks levels after correction to account for the evolutionary rate of soybean genes. C, Correction to the Ks distribution and occurrence of key evolutionary events.
Such high divergence in evolutionary rates may jeopardize efforts to date evolutionary events and perform phylogenetic analysis, hindering our understanding of legume biology and evolution. Using soybean as a reference, we performed a correction to other legumes’ evolutionary rates, calibrating the LCT peaks in the other legumes’ Ks distribution to that in soybean (for details, see “Materials and Methods”; Fig. 7, B and C; Supplemental Table S12). Supposing that ECH occurred ∼130 Mya (Jiao et al., 2012), we estimated that LCT occurred ∼59 Mya, that peanut (from the dalbergioid tribe) split from the other legumes about 49.1 Mya, and that the hologalegina (including barrel medic, lotus, and chickpea) and millettioid (including soybean, pigeon pea, mung bean, adzuki bean, and common bean) tribes split 48.1 Mya.
Inference of Ancestral Genome Content
Using information of event-related colinearity, we inferred gene content at the major evolutionary nodes of legumes (Fig. 8). Two colinear orthologs from different genomes show that the most recent common ancestor had a single ancestral gene at the corresponding location in its genome, whereas two colinear (out)paralogous genes produced by the same polyploidization would derive from an ancestral gene in the paleogenome before the event. Therefore, by referring to the event-related colinear gene table (Table I), it was quite easy to infer the ancestral gene content at any evolutionary node during the evolution and divergence of these legumes. For example, the most recent common ancestors had at least 22,177 genes for soybean and common bean, 18,935 genes for the two peanut genomes, and 28,900 genes for all legumes after the LCT. After the ECH, there were at least 11,672 genes in the eudicot common ancestor.
Inferred ancestral gene numbers during the evolution of legumes.
Gene Ontology Analysis
By counting genes still in colinearity, we explored how each polyploidization event contributed to copy number variations for genes with different functions. By characterizing Gene Ontology functions, it was clear that each event increased copy numbers for all functional genes but by divergent increments (Supplemental Fig. S6), and different events resulted in divergent contributions to the enhancement of functions. After the SST, genes related to macromolecular complexes, membrane function and organelle function (classified in view of cellular components), and metallochaperone, molecular regulator, and structural activities (classified in view of molecular functions) were significantly retained. The most significantly preserved genes were related to macromolecular complexes, accounting for up to 9.24% of the SST α-duplicates but only 6.15% of all genes in the genome (Fisher’s exact test, P = 6.75 × 10−35; Supplemental Table S13).
In contrast, genes that were least increased by the SST were related to catalytic activities (P = 1.04 × 10−70), and nearly all genes relating to biological processes were not increased, with the exception of those relating to localization.
By checking the barrel medic genome, we evaluated what genes were likely to be removed from soybean after the SST. These genes are still in the barrel medic genome but have no corresponding copies at the expected locations in soybean, which could be a result of postpolyploidy instability (Supplemental Fig. S7). Genes in metabolic processes (P = 5.08 × 10−8), catalytic activity (P = 3.7 × 10−12), and molecular binding (P = 4.5 × 10−4) were frequently not deleted or transposed. Comparatively, genes related to biological regulation (P = 8.6 × 10−10), membrane part (P = 4.24 × 10−5), and nucleic acid-binding transcription factors (P = 2.6 × 10−6) were frequently deleted or transposed (Supplemental Table S14).
Nodulation and Oil Synthesis
A topic of singular importance to legume biology is whether recursive polyploidizations contributed to the evolution of key traits, such as nodulation associated with the symbiotic nitrogen fixation that is a distinguishing feature of legumes. Legumes have divergent numbers of nodulation-related genes (Supplemental Table S15). Using the reported soybean nodulation genes as seeds (Schmutz et al., 2010), we detected their homologs in all legumes at BLASTP E < 1e-10 and a score greater than 150 (Supplemental Table S15). Soybean has the most nodulation-related genes (1,702), comprising four families of 50 or fewer genes and three families of more than 200 genes (Supplemental Table S16). We wanted to know whether the recursive polyploidizations had contributed to their expansion. Since large gene families are excluded from inferences of colinearity (see “Materials and Methods”) and, therefore, are underrepresented in the colinear gene table, to investigate whether recursive polyploidizations had contributed to their expansion, we plotted the distribution of nodulation-related genes in the whole genome, also showing colinear genes related to each polyploidization (Fig. 9). Notably, in soybean, we found that 78%, 74%, and 66% of nodulation-related genes could be located at paralogous chromosomal regions related to the three polyploidization events (SST, SCT, and ECH), respectively. Genes involved in younger polyploidizations also could be involved in older events if they have a paralogous copy produced by the latter. Nonetheless, these finding showed that polyploidizations may have contributed to the increase of nodulation-related gene copy numbers, with increases of 73 in the SST (Fig. 9), 284 in the ECH (based on barrel medic), and 852 related to the LCT. Similar findings have been observed in the other legumes.
Nodulation gene amplification model related to gene duplication events in soybean. A, Curved lines within the inner circle, colored green, link paralog pairs on the 20 soybean chromosomes produced by SST. B and C, LCT (B) and ECH (C). Nodulation subfamily genes are displayed in colors as follows: light salmon (subfamily 1), green (subfamily 2), gray (subfamily 3), yellow (subfamily 4), black (subfamily 5), blue (subfamily 6), and red (subfamily 7). Colored curved lines link nodulation gene pairs with Ks < 0.15.
While new genes can be produced by tandem duplications and transposon activities, these events produced fewer genes than polyploidization. At Ks < 0.15, a time after or overlapping the SST event, we found more than 13 genes residing in duplicated regions from soybean chromosomes 4 and 5, 7 and 8, and 11 and 12 that were clearly produced by the SST. We also found young tandem gene clusters on chromosomes 16, 14, 9, and others and young transposed genes on many other chromosomes (Fig. 9). One tandem cluster on chromosome 16 contains more than 20 young duplicated genes, some with Ks ∼ 0 and four pairs with Ks < 0.015, involving six genes (Glyma16g07010.1, Glyma16g07051.1, Glyma16g07031.1, Glyma16g07060.1, Glyma16g30695.1, and Glyma16g30911.1), showing a hotspot of new gene production.
Then, we checked how polyploidizations affected the copy number variation of genes participating in the synthesis of high concentrations of seed oils that are an important economic product of many legumes. Oil synthesis-related (OSR) genes could be classified into nine different functions: synthesis of fatty acids in plastids, synthesis and storage of oil, metabolism of acyl lipids in mitochondria, lipid signaling, fatty acid elongation, wax and cutin metabolism, synthesis of membrane lipids in the endomembrane system, degradation of storage lipids and straight fatty acids, and miscellaneous functions, as reported previously (Wang and Brendel, 2006; Schmutz et al., 2010). Each of these families has more than 50 genes in soybean (Supplemental Table S17). There are more than 850 OSR genes in the peanut genomes and 1,528 in soybean (Supplemental Table S18). In peanut, 42% and 22% of OSR genes can be related to paralogous regions produced by the LCT and ECH events, respectively; in soybean, 65%, 58%, and 27% of OSR genes can be related to the SST, LCT, and ECH events, respectively (Supplemental Fig. S8). This shows that each of these polyploidizations may have expanded the OSR families, which also seems true in other legumes. As with nodulation genes, tandem duplications and transposon activities also might have contributed to expansion of the OSR families. At Ks < 0.15, more than 13 genes residing in duplicated regions of soybean chromosomes 4 and 6, 7 and 8, 11 and 12, and 14 and 17 were clearly produced by the SST. We also found young small tandem clusters of six or fewer genes on 14 soybean chromosomes and 11 young transposed genes between chromosomes (Supplemental Fig. S9). Interestingly, OSR genes and nodulation genes shared paralogous regions in the soybean genomes. However, only four genes (Glyma03g42460, Glyma06g04940, Glyma19g23446, and Glyma19g45230) execute functions contributing to both traits.
Legume polyploidizations also may have contributed to the expansion of NBS-LRR resistance genes, which was further subjected to a birth-death process due to ectopic recombination (for details, see Supplemental Text S1).
DISCUSSION
Event-Related Alignment of Legume Genomes
Recursive polyploidizations make plant genomes very complex, conferring genomic structural changes and complicating deductions about the evolutionary trajectories of genes (Soltis et al., 2014, 2015). Here, we performed an event-related whole-genome-scale alignment of all 10 sequenced legume genomes, aided by the use of appropriate reference genomes, grape for all legumes and barrel medic for soybean. This effort is valuable in deconvoluting the layers of homologous regions packed together after recursive polyploidizations, producing a list of homologous genes, paralogs, and orthologs, and relating these homologs to each ancestral polyploidization event. The list tells how and when a pair of homologs were produced and diverged and whether there was gene deletion after certain events, providing valuable information to reveal the evolutionary and function-innovation trajectories of genes, gene families, regulatory pathways, and economically and agriculturally important traits. Nodulation and OSR genes provide examples to show how each polyploidization event contributed to their copy number expansion. By integrating more sophisticated phylogenetic and evolutionary analyses, many key genes could be rechecked at certain phylogenetic nodes to clarify specific evolutionary changes correlated with their functional innovation.
The sequences acquired for legume genomes are each incomplete to certain extents, which affects the inference of genomic homology and other comparative studies shown above. New sequencing technologies are empowering resequencing efforts to obtain more complete genome sequences. The present hierarchical and event-related alignment of legume genomes will be updated when more complete or novel genome sequences are ready in the future and stored in the legume comparative genomics database.
A Hypothetical Paleoautopolyploidization
All well-sequenced and annotated plant genomes show evidence of genome duplication, supporting the likelihood that all plants are paleopolyploid. It has been proposed that polyploidization has contributed to the origination, divergence, and success of seed and flowering plants (Jiao et al., 2011) and their domestication (Kellogg, 2016). However, the rates of autopolyploidization and allopolyploidization in evolutionary history have not been known. Indeed, the relative frequencies of allopolyploidy, hybridization between diverged (sub)species, and autopolyploidy, duplication of the same genome by means such as unreduced gamete formation, are unknown. Autopolyploid formation is thought to be more frequent than allopolyploidy. However, autopolyploids may suffer from reduced fertility, while allopolyploids are thought to have advantages during the establishment phase owing to their potential for heterosis. These thoughts are consistent with the observation (Barker et al., 2016) that more crops are allopolyploid (e.g. wheat, cotton, tobacco [Nicotiana tabacum], strawberry [Fragaria spp.], and oilseed rape [Brassica napus]) than autopolyploid (e.g. potato [Solanum tuberosum], sugarcane [Saccharum officinarum], and banana [Musa spp.]).
Based on information from sequenced genomes, maize (Zea mays), wheat, and the common ancestor of grasses were proposed to result from allopolyploidy, with only the most recent duplication in sugarcane proposed to be autopolyploidy (Schnable et al., 2011; Chalhoub et al., 2014; International Wheat Genome Sequencing Consortium, 2014). Here, by characterizing gene losses between the SST-duplicated chromosomes in soybean, we proposed that the SST was likely an autotetraploidy. In maize, all homologous pairs of duplicated chromosomes have divergent gene loss/retention rates across most of their lengths. In contrast, in soybean, all homologous pairs of chromosomes have similar gene loss/retention rates in most corresponding regions. Although it might be possible that an allotetraploid has two subgenomes without dominance over one another, such a phenomenon has not been observed so far. And it is difficult to suggest the existence of two diverged species to have nearly corresponding genetic advantage over one another across nearly their full genomes when merging to produce an allotetraploid. Therefore, the simplest explanation of the SST is an autotetraploid. This inference of a likely paleoautopolyploidization in a dicot, building on one in sugarcane (Jannoo et al., 2004), begins to provide some support for the role of autopolyploidy in the establishment of new species.
Gene Loss and Retention
It is recognized that large-scale gene losses follow polyploidization (Soltis et al., 2016), and it has been shown that the maize genome fractionated through accumulated small runs of gene deletions (Schnable et al., 2011) after its polyploidization ∼26 Mya (Wang et al., 2015). Here, we showed that gene losses broke the continuity of gene colinearity in legumes and resulted in the removal of neighboring genes that could mostly be described by a geometric distribution and might have occurred in a mostly random manner. However, an obvious deviation of the data from the random distribution suggested that gene losses could be more complex. For example, gene losses might have occurred in a recursive manner, extending the length of runs, which might require a more sophisticated model to be revealed. Another is that natural selection, although no ready approach could be used to test it, may have acted occasionally to determine the direction of evolution involving gene losses. Gene losses often could be harmful and even lethal and may be constrained by selection, a factor not reflected in our random model. Moreover, gene losses could contribute to the reestablishment of disomic chromosome pairing in an unstable neopolyploid (Bowers et al., 2005), a phenomenon that might be facilitated by the loss of large patches of chromosomes that we observed, which again could not be described in the geometric distribution model.
Gene movement and gene annotation may affect the description of gene loss level. To be careful, we searched the grape genes against all legume genes and the barrel medic genes against other legume genes, and we searched soybean, barrel medic, and lotus genes against their respective ESTs (Supplemental Tables S25–S28). We found that less than half of grape genes had legume best matches and about one-fourth had bidirectional best matches (at protein matched coverage ≥ 50% and identity ≥ 60%). This could have resulted from gene divergence and gene loss. With the best matched genes, about half of them share gene colinearity between genomes. We got similar findings with barrel medic as a reference. These findings suggest that gene movement, possibly involving transposons, may contribute to genomic fractionation. With EST, at coverage ≥ 30% and identity ≥ 90%, we found that there are at least 50% of genes having no EST support, which suggests that legume gene annotations need much improvement. The annotation of genes would affect the inference of gene colinearity and, therefore, the characterization of gene losses and genomic fractionation. We will update our inference based on the latest versions of annotated genes in the future.
Unbalanced Evolutionary Rates among Legumes
Duplicated genes deriving from a shared duplication event provide a direct means to compare evolutionary rates among taxa. In grasses, duplicated genes produced by a grass-common tetraploidization show 8.5% to 48% divergence in evolutionary rates, with rice being the slowest (Wang et al., 2015). A phylogenetic analysis with mulberry (Morus notabilis) genes and their orthologs from Rosales relatives showed that mulberry evolved much (even 3 times) faster than other Rosales species (He et al., 2013).
Polyploidization itself may drive genes with duplicates to evolve faster, as duplicated genes may buffer mutations in one another, possibly resulting in neofunctionalization or subfunctionalization. For example, cotton genes affected by a decaploidization may have evolved 19% and 15% faster than orthologs in cacao (Theobroma cacao) that have not experienced duplication since the two taxa diverged (Wang et al., 2016). Furthermore, genes from a duplicated pair of grass chromosomes affected by gene conversion evolved faster than those not affected by gene conversion (Wang et al., 2009).
Unexpectedly, duplicated genes in soybean, affected by the SST, did not necessarily evolve faster than those of other legumes. With soybean as a reference, genes in peanut, adzuki bean, mung bean, chickpea, barrel medic, and common bean evolve faster and those in pigeon pea and lotus evolve slower. This weakens the generalization that duplicated genes evolve faster than single-copy genes, perhaps pointing to the importance of other factors, such as living in different environments for millions of years.
MATERIALS AND METHODS
Genomic Materials
We downloaded genomic sequences and annotations from respective Web sites for each genome project, for which complete information can be found in Supplemental Table S29.
Inferring Gene Colinearity
With annotated genes as input, chromosomes from within a genome or between different genomes were compared. First, by performing BLASTP (Altschul et al., 1990), protein sequences were searched against one another to find potentially homologous genes (E < 1e-5). A smaller E value may involve more-diverged homologous genes and help find ancient duplicated genes. Second, information on gene homology was used as input for the software ColinearScan (Wang et al., 2006) to locate homologous gene pairs in colinearity. The key parameter, the maximum gap, was set to be 50 intervening genes, as adopted in previous genomics research (Wang et al., 2015, 2016). Large gene families with 30 or more copies in a genome were removed from inferring colinearity.
Inferring Genomic Homology
To infer chromosomal homology in legumes, we used the grape (Vitis vinifera) genome as an outgroup reference, which provides information of chromosome homology transitively. The grape genome preserves much of the ancestral genome structure before and after the ECH that was common to most eudicot plants (Bowers et al., 2003; Jaillon et al., 2007) much better than other sequenced eudicot genomes, which often are affected by further polyploidizations. The grape genome was important to reveal and distinguish paralogous blocks within legume genomes that were produced by the ECH event or not. Due to the ECH, any one grape genomic region often has two paralogous regions within grape itself and more in legume genomes. Dot plots of genomic homology between genomes produced by our custom software were used to help distinguish orthologous and outparalogous regions between different genomes.
We produced dot plots between grape and various legumes. For example, we show how the grape-barrel medic (Medicago truncatula) homology dot plot helps us understand the barrel medic genome structure. The 19 chromosomes of grape were denoted with blocks in seven colors, corresponding to seven ancestral eudicot chromosomes before the ECH. Due to the ECH and the legume-specific LCT, we anticipated that a grape region would have two orthologous barrel medic regions, which are paralogous to one another, and four outparalogous regions (Supplemental Fig. S1). In the grape-barrel medic dot plot, orthologous and outparalogous blocks can be inferred without much difficulty. A grape chromosomal region often is much more similar, measured by collinear gene number, to its barrel medic orthologous regions than to the outparalogous regions. Some outparalogous blocks can have few homologous gene dots and can only be inferred by transitively using paralogy between grape chromosomes (Supplemental Fig. S11; detailed in Supplemental Text S1). Ideally, a grape chromosome would have two orthologous corresponding regions. However, often, they are broken into pieces by chromosomal rearrangement. A complementary pattern of broken segments helps us infer their being derived from the same ancestral chromosome.
The above strategy also was applied to a comparative analysis between grape and various legumes. To infer intragenomic homology in soybean after its specific SST, we used the barrel medic genome as a reference.
Accession Numbers
Sequence data from this article can be found in Supplemental Table S29.
Supplemental Data
The following supplemental materials are available.
Supplemental Figure S1. Homologous dot plot between grape and barrel medic genomes.
Supplemental Figure S2. Homologous dot plot between grape and A. duranensis genomes.
Supplemental Figure S3. Homologous dot plot between grape and A. ipaensis genomes.
Supplemental Figure S4. Homologous dot plot between barrel medic and soybean genomes.
Supplemental Figure S5. Homologous alignments of 10 legume genomes with barrel medic as a reference.
Supplemental Figure S6. Gene Ontology analysis distribution of soybean retention genes produced by ECH, LCT, and SST.
Supplemental Figure S7. Gene Ontology analysis distribution of soybean lost genes in ECH, LCT, SST, and LCT-SST
Supplemental Figure S8. Oil gene amplification model related to gene duplication events in soybean.
Supplemental Figure S9. NBS-class gene amplification model related to gene duplication events in soybean.
Supplemental Figure S10. NBS domain gene amplification model related to gene duplication events in soybean.
Supplemental Figure S11. Homologous dot plot between grape and barrel medic chromosomes.
Supplemental Table S1. Number of homologous blocks and gene pairs within a genome or between genomes.
Supplemental Table S2. Number of homologous genes within a genome or between genomes.
Supplemental Table S3. Number of paralogous, orthologous, and outparalogous gene pairs within a genome or between genomes.
Supplemental Table S4. Number of paralogous, orthologous, and outparalogous genes within a genome or between genomes.
Supplemental Table S5. Number of paralogous, orthologous, and outparalogous blocks within a genome or between genomes.
Supplemental Table S6. Legume gene loss rates and gene translocation with grape as a reference genome.
Supplemental Table S7. Legume gene loss and gene translocation rates with barrel medic as a reference genome.
Supplemental Table S8. Legume gene loss and gene translocation rates with common bean as a reference genome.
Supplemental Table S9. Observed distribution of gene loss and translocation numbers fitted using different density curves of geometry distribution.
Supplemental Table S10. Gene retention in soybean duplicated chromosomes.
Supplemental Table S11. Kernel function analysis of Ks distribution related to duplication events within each genome and between selected legumes (before evolutionary rate correction).
Supplemental Table S12. Kernel function analysis of Ks distribution related to duplication events within each genome and between selected legumes (after evolutionary rate correction).
Supplemental Table S13. Gene Ontology analysis distribution of soybean retention genes produced by CEH, LCT, and SST.
Supplemental Table S14. Gene Ontology analysis distribution of soybean lost genes in CEH, LCT, SST, and LCT-SST.
Supplemental Table S15. Nodulation genes related to duplication events in each legume genome.
Supplemental Table S16. Nodulation subfamily 7 genes related to duplication events in the soybean genome.
Supplemental Table S17. Oil subfamily 9 genes related to duplication events in the soybean genome.
Supplemental Table S18. Oil genes related to duplication events in each legume genome.
Supplemental Table S19. NBS-CC genes related to duplication events in each legume genome.
Supplemental Table S20. NBS-TIR genes related to duplication events in each legume genome.
Supplemental Table S21. NBS-TNL genes related to duplication events in each legume genome.
Supplemental Table S22. NBS-TNx genes related to duplication events in each legume genome.
Supplemental Table S23. NBS-xNL genes related to duplication events in each genome.
Supplemental Table S24. NBS-xNx genes related to duplication events in each genome.
Supplemental Table S25. Bidirectional BLAST searched against all annotated genes between grape and legume.
Supplemental Table S26. Bidirectional BLAST searched against all annotated genes between barrel medic and other legumes.
Supplemental Table S27. Barrel medic, soybean, and lotus genes against their respective ESTs (alignment of coverage ≥ 30%).
Supplemental Table S28. Barrel medic, soybean, and lotus genes against their respective EST sequences (alignment of coverage ≥ 50%).
Supplemental Table S29. Information of original data material.
Supplemental Text S1. Description of details about inferring genomic colinearity, estimating nucleotide substitution, evolutionary dating, modeling gene loss, and inferring Gene Ontology.
Acknowledgments
We thank Liming Zhou for helpful discussions about the article.
Footnotes
X.W. conceived and led the research; J.W. implemented and coordinated the analysis; P.S., Y.Li, Y.Liu, R.X., X.M., J.Y., N.Y., S.Su., X.L., B.J., Y.X., X.S., J.Z., L.J., J.S., J.Y., R.C., X.D., and S.Sh. performed the analysis; T.Li. and T.Le. contributed analysis tools; W.G., L.W., Z.W., L.Z., D.G., D.J., Y.P., J.Q., and M.Z. performed the analysis with constructive discussions; X.W., A.P., and J.W. wrote the article.
↵* Address correspondence to wangxiyin{at}vip.sina.com.The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Xiyin Wang (wangxiyin{at}vip.sina.com).
↵1 This work was supported by the China Department of Science and Technology Key Research Project Seven Key Crop Breeding Project (grant no. SQ2016ZY03002918), the China National Science Foundation (grant no. 31501333 to J.W. and grant no. 31371282 to X.W.), the Natural Science Foundation of Hebei Province (grant no. C2015209069 to J.W. and grant no. C2016209097 to W.G.), the Hebei New Century 100 Creative Talents Project, the Hebei 100 Talented Scholars Project, and the Tangshan Key Laboratory Project (to X.W.), the National Fund Cultivation Project of the North China University of Science and Technology (grant no. GP201508 to D.J.), the U.S. National Science Foundation (grant no. ACI1339727 to X.W. and A.P.), and the Georgia Peanut Commission and the Southeastern Peanut Research Initiative (to A.P.).
↵[OPEN] Articles can be viewed without a subscription.
Glossary
- Mya
- million years ago
- LCT
- legume-common tetraploid
- ECH
- eudicot-common hexaploid
- SST
- soybean-specific tetraploid
- Ks
- synonymous nucleotide substitutions at synonymous sites
- OSR
- oil synthesis-related
- Received January 3, 2017.
- Accepted March 19, 2017.
- Published March 21, 2017.