|
|
||||||||
|
First published online November 2, 2007; 10.1104/pp.107.104588 Plant Physiology 146:5-21 (2008) © 2008 American Society of Plant Biologists OPEN ACCESS ARTICLE
Identification and Characterization of Nucleotide-Binding Site-Leucine-Rich Repeat Genes in the Model Plant Medicago truncatula1,[W],[OA]Laboratoire des Interactions Plantes Microorganismes, UMR CNRS-INRA 442–2594, 31326 Castanet Tolosan, France (C.A.-T.); Departments of Plant Pathology and Plant Biology, University of Minnesota, St. Paul, Minnesota 55108 (C.A.-T., B.-B.W., N.D.Y.); Advanced Center for Genome Technology and Department of Chemistry and Biochemistry, University of Oklahoma, Norman, Oklahoma 73109 (M.S.O., B.A.R.); Department of Plant and Soil Sciences, University of Kentucky, Lexington, Kentucky 40546–0236 (H.Z.); and United States Department of Agriculture-Agricultural Research Service and Department of Agronomy, Iowa State University, Ames, Iowa 50011 (S.B.C.)
The nucleotide-binding site (NBS)-Leucine-rich repeat (LRR) gene family accounts for the largest number of known disease resistance genes, and is one of the largest gene families in plant genomes. We have identified 333 nonredundant NBS-LRRs in the current Medicago truncatula draft genome (Mt1.0), likely representing 400 to 500 NBS-LRRs in the full genome, or roughly 3 times the number present in Arabidopsis (Arabidopsis thaliana). Although many characteristics of the gene family are similar to those described on other plant genomes, several evolutionary features are particularly pronounced in M. truncatula, including a high degree of clustering, evidence of significant numbers of ectopic translocations from clusters to other parts of the genome, a small number of more evolutionarily stable NBS-LRRs, and numerous truncations and fusions leading to novel domain compositions. The gene family clearly has had a large impact on the structure of the genome, both through ectopic translocations (potentially, a means of seeding new NBS-LRR clusters), and through two extraordinarily large superclusters. Chromosome 6 encodes approximately 34% of all TIR-NBS-LRRs, while chromosome 3 encodes approximately 40% of all coiled-coil-NBS-LRRs. Almost all atypical domain combinations are in the TIR-NBS-LRR subfamily, with many occurring within one genomic cluster. This analysis shows the gene family not only is important functionally and agronomically, but also plays a structural role in the genome.
Plants have evolved sophisticated mechanisms to recognize and guard against pathogens. Interaction between hosts and pathogens triggers both localized and systemic resistance responses. Disease resistance frequently is governed by specific recognition between pathogen AVIRULENCE genes and corresponding plant disease RESISTANCE (R) genes. This type of gene for gene interaction usually is accompanied by a hypersensitive response leading to the restriction of pathogen growth. In the past decade, R genes have been cloned from numerous plant species, conferring resistance to a wide range of plant pathogens including bacteria, fungi, oomycetes, viruses, and nematodes (Dangl and Jones, 2001
The largest class of R genes encodes proteins with a nucleotide-binding site (NBS) and a Leu-rich repeat (LRR) region. This domain architecture is consistent with a role in pathogen recognition and defense response signaling. The NBS domain contains several conserved motifs typically found in ATP- or GTP-binding proteins and also present in several structurally related regulators of animal apoptosis (Traut, 1994
The NBS-LRR family of R genes can be further divided into two subfamilies based on deduced N-terminal structural domains. One subfamily, termed TIR-NBS-LRR or TNL, encodes a domain with similarity to the intracellular signaling domains of the Drosophila Toll and mammalian INTERLEUKIN1 receptor, while the second, termed coiled-coil (CC)-NBS-LRR or CNL, codes for a putative CC domain in the N-terminal region. These two subfamilies can also be distinguished by the unique amino acid motifs found within the NBS domain itself (Meyers et al., 1999
Conservation of the NBS domain has been used to study the genomic architecture of this gene family. R genes are unevenly distributed in plant genomes and many reside in local multigene clusters. The clustered distribution of R genes provides a reservoir of genetic variation from which new specificities can evolve. Mechanisms like duplication, unequal crossing over, ectopic recombination, gene conversion, and diversifying selection have been proposed to contribute to the structure of R gene clusters and the evolution of resistance specificities (Michelmore and Meyers, 1998
M. truncatula is a self-fertile, annual, and diploid plant that has been selected as a model legume (Barker et al., 1990 In the public draft assembly, which is estimated to span approximately 60% of the euchromatic space of M. truncatula, we identified at least 333 NBS-LRR encoding genes in the A17 genotype that is now being sequenced. Here we report an analysis of the evolution and genomic organization of these genes in M. truncatula based on genomic sequence data from the first large-scale genome assembly of the ongoing sequencing project (www.medicago.org/genome).
Genome Assembly, Gene Prediction, and Nomenclature
Genomic sequence consisted of the 1.0 draft genome assembly generated by the Medicago Genome Sequencing Consortium (MGSC; http://www.medicago.org/genome/downloads/Mt1/), with gene predictions from the International Medicago Genome Annotation Group (IMGAG; Town, 2006 The M. truncatula NBS-LRR genes in this study were identified from IMGAG annotated genes. Gene names (Supplemental Table S1) follow the IMGAG naming convention, as illustrated by gene AC148761_18.3: The characters before the underscore are the GenBank accession for the source BAC; the number after the underscore is the gene number within the BAC; and the number after the period is the version of this gene call. For convenience in tree figures, shorter, more informative names are used. Aliases to IMGAG names are provided in Supplemental Table S1. The format for short names is illustrated by Mt2g1873: genus and species in the first two characters, followed by chromosome number, then type (g = gene), then gene order in the pseudomolecule build (Mt1.0; http://www.medicago.org/genome/downloads/Mt1). These names are for use only in this article; the persistent names (in GenBank and EMBL) use the IMGAG format.
We used similarity searches based on extended NBS-LRR domains (see "Materials and Methods") to identify NBS-LRR genes in the A17 ecotype genomic sequence. A total of 333 nonredundant sequences, consisting of 177 putative CNL and 156 TNL sequences, were used in subsequent analyses, as described in "Materials and Methods" (Supplemental Table S1). Thirty additional sequences that appear to be related to or derived from NBS-LRRs, but were too divergent for inclusion in phylogenetic analyses, are also shown in Supplemental Table S1. The 333 sequences included in phylogenetic analyses are distributed across all chromosomes of M. truncatula (Fig. 1 ), with six (four CNLs and two TNLs) located on still-unmapped BACs. Three chromosomes (3, 4, and 6) contain a disproportionately large number of NBS-LRRs (more than 54%).
As M. truncatula has not been fully sequenced, we also attempted to estimate how many NBS-LRR genes may be missing in this study. We asked what proportion of a random set of expressed M. truncatula NBS genes are found in the Mt1.0 assembly. We find that proportion to be approximately 196/294 = 2/3, as follows. For the random set of expressed M. truncatula NBS genes (the denominator), we took the M. truncatula transcript assemblies and singletons (TA unigenes from http://plantta.tigr.org, release 2) that (1) have a tblastn E value of 1e-15 with some M. truncatula NBS gene, and (2) have higher similarity to an Arabidopsis NBS-LRR than to an Arabidopsis gene from any other gene family. Applying only the first criterion, we find 661 NBS-like M. truncatula sequences among the 55,182 TA unigenes. Applying the second criterion lowers the number to 294, because a large proportion of the initial candidates are actually more similar to some other Arabidopsis gene. Given the denominator (294), we then asked how many M. truncatula NBS genes have a nearly identical match to those Mt NBS sequences (196). Therefore, we estimate that the current genome sequence contains roughly two-thirds (196/294) of M. truncatula NBS-LRR genes. This is consistent with estimates that the Mt1.0 assembly covers approximately two-thirds of the euchromatic space of M. truncatula.
Most NBS are physically clustered in the genome (Fig. 1), with more than 54% of the NBS-LRR genes encoded on chromosomes 3, 4, and 6. Using a sliding window size of 100 kb, 79.8% of NBS domains occur in clusters of at least two genes, and 49.5% are in clusters of at least five genes. For this window size, the largest cluster (on chromosome 6) contains 14 genes. Using a sliding window size of 250 kb, 83.6% of NBS are in clusters of at least two genes, 68.9% are in clusters of at least five genes. For this window size, the largest cluster (on chromosome 3) contains 23 genes. Further relaxing these clustering criteria, a significant fraction of all M. truncatula NBS are in two very large, extended clusters: one at the north end of chromosome 3 containing 82 genes (73 CNL and nine TNL) and extending across 55 BAC clones (including 11 unspanned gaps) and another at the south end of chromosome 6 containing 57 genes (all TNL) and extending across 34 BAC clones (including 10 unspanned gaps). Together, these two clusters contain a remarkable 39% of all NBS genes in this study.
Phylogenetic trees were constructed from 333 NBS sequences (177 CNLs and 156 TNLs). As previous studies have shown that phylogenies calculated from the NBS domain robustly distinguish the TNL and CNL subfamilies (Meyers et al., 1999
Figures 1 and 2, A to C, provide overviews of the CNL and TNL subfamilies. Figure 1 shows gene position and subfamily on M. truncatula chromosome pseudomolecules. Figure 2 shows phylogenies, including chromosome of origin (by color and sequence name), clade age relative to Arabidopsis (Arabidopsis or At) or Populus trichocarpa (poplar or Pt) sequences (pink dots), relationship to putative internal genomic duplications (black arrows), gene relatedness and evolutionary rate (phylogenetic structure and branch length), and approximate expression levels and regulatory element counts (right). The chromosome of origin is informative in that it highlights local expansion of some sequence types, as well as changes in chromosome origin, suggesting either large-scale rearrangements or ectopic translocations. The approximate locations of Arabidopsis and poplar sequences and the internal genomic duplications (black arrows) inferred from large-scale duplications within the genome are informative in that they provide approximate relative age calibrations for each clade.
Most NBS derive from relatively recent gene duplications and for the most part they are highly similar to other NBS in the same genomic clusters (although some of the observed sequence similarity may also be the result of gene conversion). In Figure 2, A to C, pink dots indicate approximate locations of coalescent points with poplar or At NBS clades. These points provide relative time references, showing which clades have probably expanded within legumes. The poplar and Arabidopsis coalescence points are reported together because, although poplar and Arabidopsis separated from the legumes at approximately 70 to 84 mya and 108 to 117 million years ago (mya), respectively (Wikström et al., 2001 Most legume NBS sequences are found at greater than 0.5 PAM units (accepted point mutations per site) from these coalescence points with nonlegume species. By measuring the phylogenetic distance between M. truncatula NBS sequences, we can assess how many have originated recently. In this context, a distance cutoff of 0.5 PAM units between M. truncatula sequences (or average distance of 0.25 PAM to the M. truncatula-M. truncatula coalescence point) is a reasonable indicator of nearness in that it is much shorter than the evolutionary distance to Arabidopsis or poplar sequences, and so represents gene duplications that most probably occurred within legumes. On average, each M. truncatula NBS is within 0.5 PAM of 9.2 other sequences, again indicating that many groups of sequences have high sequence similarity. This is evident in the trees in Figure 2, A to C, in the form of clades with many sequences and short branch lengths, such as the many similar sequences from chromosome 3 in Figure 2A. The tree in Figure 2, A to C, is divided into 17 CNL and eight TNL clades. These are legume specific, if we define legume specific to mean that each clade contains no sequences from Arabidopsis or poplar (using the coalescence points described above). Properly, this designation should be further qualified. More closely related nonlegume species (for example, from Rosaceae) might still fall within these clades and the absence of a poplar or Arabidopsis sequence could sometimes be due to gene loss. Gene conversion or homogenization might also foreshorten distances in some M. truncatula clades. Nevertheless, the genes in this large gene family trace to a small number of legume-specific progenitor sequences. It is also likely, therefore, that a much larger number of genes have arisen and been lost in this timeframe.
Most legume-specific clades are dominated by sequences from one chromosome (and usually from one or a small number of genomic clusters), but many also contain small numbers of sequences from other chromosomes. Specifically, 14 CNL and six TNL legume-specific clades are mixed (with sequences from multiple chromosomes). This is 80% of all clades. A mixed clade could arise in several ways: by chromosomal rearrangement (for example, breakage and fusion), by transposition, or by large-scale genomic duplication. In at least five clades, the origin of sequences from different chromosomes (or widely separated parts of one chromosome) can be traced to internal synteny in M. truncatula, the likely remnants of an early episode of polyploidy in the legumes (Cannon et al., 2006 While some mixed clades, including TNL-1 just described, can be traced to internal segmental duplications and others are probably cryptic remnants of duplications, no longer apparent after rearrangements, there are also other mixed clades and clusters that are best explained as ectopic translocations. Supplemental Table S1 indicates 29 such cases: instances in which a clade of closely related sequences from one genomic cluster, with one sequence occurring in a distant part of the genome. These instances can be thought of as having donor regions (a cluster of related sequences in one part of the genome) and acceptor regions (the location of the related gene outside of the home cluster). Examples of such clades are Figure 2A, CNL-3 (Mt5, Mt2, Mt7); Figure 2B, CNL-9 (Mt6, Mt3; Mt7, Mt5; Mt5, Mt1); Figure 2C, TNL-1 (Mt6, Mt2); TNL-4 (Mt6, Mt2; Mt6, Mt1). Probable transpositions (donations) do not seem to target particular locations. They seem to occur throughout the genome and not just in NBS-rich regions. For example, there are seven donations to chromosome 1, but there are only 11 NBS-LRRs, in total, on chromosome 1. These are mostly unclustered (five donations occur as singletons and the remainder occur in two clusters).
There are, however, some instances of apparent donations into existing clusters. Several examples are the TNL genes in the large CNL cluster on chromosome 3 (Mt6g1868 While some clusters contain genes that appear to have come from other regions of the genome, a broader phenomenon is physical clusters that include divergent sequences (regardless of the origin). There are 26 instances of TNL and CNL genes falling within 100 kb of one another (Supplemental Table S1). In fact, five pairs of CNL and TNL genes fall within single BACs: Mt7g2240/AC169666_26.4 (CNL-4) and Mt7g2242/AC169666_23.4 (TNL-2); Mt5g3987/CT963106_1.4 (CNL-9) and Mt5g3991/CT963106_17.4 (TNL-8); Mt3g812/CT963074_17.3 (TNL-4) and Mt3g833/CT963074_13.3 (CNL-3); Mt3g796/CT963132_6.3 (TNL-4) and Mt3g809/CT963132_9.3 (CNL-4); Mt3g654/CT967304_15.3 (CNL-2) and Mt3g655/CT967304_17.3 (TNL-8).
Protein domains of the 333 NBS-encoding genes in this study were predicted using Hidden Markov model (HMM) searches against Pfam v. 20 (Bateman et al., 2002 Comparisons between protein sequences within a single structural category revealed some likely inaccuracies in automated gene predictions and annotations. Probable misannotations were detected in 10 proteins (indicated with an asterisk in the domains column of Supplemental Table S1), and generally consisted of an additional exon in the C-terminal region. Such exons include HSP70, reverse transcriptase, MMR-HSR (GTPase), RNase H, and chaperone-associated domain. These are not included in the tally of domain classes in Table I.
Pfam analyses could not identify the CC motif present in the N-terminal region, even though previous studies have demonstrated that the presence or absence of this motif is correlated with specific signatures in the NBS domain (Meyers et al., 1999
A majority of the proteins examined belong to the canonical classes described in the literature (Meyers et al., 1999
In the CNL subfamily, the predominant unusual domain arrangement is a missing LRR; specifically, 25/177 (16%) lack the LRR (i.e. CN). Only one other unusual class observed in the CNL is the result of a putative fusion in CU013515_1.4/Mt5g1164, with the Rpw8 domain. The closest homolog of CU013515_1.4/Mt5g1164 in Arabidopsis (At5g66910) displays the same domain structure. The Rpw8 gene in Arabidopsis provides broad-spectrum resistance mildew resistance (Xiao et al., 2001
In contrast to the CNL, the TNL subfamily is highly diverse in terms of domain arrangements. Only 86/156 (55%) are typical TNL. The second and third most common classes are TN (27/156 = 17%) and NL (25/156 = 16%). One of the most intriguing sets of atypical domain arrangements is within clade TNL-8 (bottom of Fig. 2C), where a cluster of predicted peptides on chromosome 4 includes the structures TNTNL (5), TNLT (2), TNL (1), TTNL (1), N (1), and NL (1). The sister clade, with genes from chromosomes 7 and 4 (and two unplaced BACs) contains one each of NT, NTNL, N, TNTNL, and NT. That these sequences occur mainly on two chromosomes suggests that these unusual sequences have been maintained for some time, probably at least since polyploidy, early in the legumes (Schlueter et al., 2004 An additional intriguing instance of a putative fusion in the TNL subfamily is a predicted protein with domains TNLTNL (AC126790_31.4/Mt6g1826; GenBank ID ABE83302.2). Such a fusion would not be unprecedented, as at least one gene with similar structure is present in the current manually annotated Arabidopsis peptides (At3g25510; The Institute for Genomic Research v.7). The Arabidopsis gene is not an ortholog, as it apparently results from an independent event. Rexamination of M. truncatula BAC AC126790.38 confirmed the initial prediction, and about one-third of the sequence has 100% cDNA support (with ESTs CX538931 and CX524109). Therefore this gene structure and five exons are otherwise not unusual, and the 3,123 nt of coding sequence occurs within the 4,550 nt total gene region.
Analyses of motifs within NBS domains reveal additional features. Since typical NBS domains often contain variable motifs (NBS-A and -C, described in Meyers et al., 1999
To assess which genes in this study have expression support, we compared the predicted genes against available ESTs (231,765 ESTs from 55 libraries, from GenBank in April, 2006). Because many NBS-LRR genes are similar to one another, only top matches were considered, after applying a high match stringency of at least 95% of nucleotide identity between EST and genomic sequences. At this threshold, 168 NBS genes in this study have EST support, representing 50.5% of predicted genes in the study (indicated by EST matches in the right-hand column of Fig. 2, A–C; Supplemental Table S1). Altogether, 530 EST matches were identified, with an average of 3.1 ESTs per expressed NBS gene. ESTs are approximately equally distributed in the CNL and TNL classes: 81 CNLs and 87 TNLs are represented by at least one EST. A majority of CNL and TNL genes display one or two ESTs, but 22 genes have at least five ESTs and one (AC135229_11.5/Mt8g3004) has 31 ESTs (Fig. 2, A–C; Supplemental Table S1). Among these relatively highly expressed genes, most are located on the lower arm of chromosome 6, within the supercluster described above. These genes are expressed in a wide range of libraries, including those constructed from various developmental stages, tissue types, and pathogen-challenged or nonchallenged tissue. Approximate expression patterns, judged by counts of EST matches, vary substantially between clades and even between highly similar genes within the same clade. For example, genes on most branches in the CNL tree in Figure 2A have lower expression than those in clade CNL-16 (bottom of Fig. 2B). Within clade CNL-16, however, corresponding EST numbers range from 1 to 31.
There are 49 unique sequences in Mt1.0 with stop codons, identified using a CC or TIR consensus NBS sequences in a tblastn query against the Mt1.0 nucleotide chromosome assemblies, and filtered at E value 1e-10. This is probably an underestimate, as either many pseudogene fragments may fall below this level of significance, or are sequences without stop codons in the region of this query, but not predicted among the IMGAG gene calls in this assembly. Nevertheless, the stated criteria give values for comparison. Pseudogene counts per chromosome are (for chromosomes 0–8) 1, 2, 2, 4, 15, 6, 7, 5, and 7. Counts of CNL- and TNL-like pseudogenes are 22 and 27 (Supplemental Table S2).
We also have observed that 91.8% of all predicted NBS pseudogenes are within 100 kb of another predicted NBS gene. These pseudogenes are not, however, distributed on the chromosomes in the same way as predicted NBS genes without stop codons. More pseudogenes are found on Mt4 than would be expected (15 observed versus 6.1 expected), and fewer are found on Mt3 than expected (four observed versus 13.5 expected). These differences are supported by a test for independence by chromosome, with a There also is evidence that some of the predicted pseudogenes may be expressed. Four of the 49 predicted pseudogenes match at least one EST at 99% to 100% identitity over 58% to 89% of the genomic pseudogene length, and from 76% to 100% of the EST length (Supplemental Table S2). For example, TA38236_3880 (AL375406 AL375407) matches over 1,403 nucleotides, with one mismatch, and neither genomic or EST sequence has extended open reading frames; each contains at least three stop codons in the 716 nt aligning region.
We identified promoter sequences in 2 kb windows upstream of predicted NBS-LRR genes (Supplemental Table S1). Four regulatory elements implicated in either response to pathogens or plant stress were identified as being overrepresented in the 2 kb region upstream of NBS-LRRs. The regulatory elements were: WBOX cassettes, associated with the WRKY transcription factors (Dong et al., 2003 WBOX elements are the most numerous, averaging 8.6 for the CNL and 8.4 for the TNL subfamilies (Supplemental Table S1; Fig. 2, A–C). In contrast, the average numbers of other element types are 0.68 (CBF), 0.08 (GCC), and 0.39 (DRE). Of the predicted NBS-LRR genes, 75% contain between five to 11 predicted WBOXs, with six WBOXs being the most common. The other three identified promoter regions are generally observed only once per upstream region, with just a few cases of multiple boxes predicted. A striking feature of counts of these regulatory elements is that they are distributed quite uniformly across the tree (Fig. 2, A–C). For example, average numbers of WBOXs calculated per major clade in this phylogeny (CNL1-17 and TNL1-8), the SD is 2.15 on clade-wise averages of 9.0 WBOXs. Similarly, no clades show systematic excesses or deficits of the (less numerous) CBF, GCC, or DRE box motifs. We see no clear evidence of a correlation between the arrangement of these promoter cassettes (WBOX, CBF, GCC, and DRE) and in silico expression via EST counts.
Many aspects of the NBS-LRR disease resistance gene family have been extensively studied and described in other species. This study of NBS-LRRs in M. truncatula confirms many patterns observed in other plant species, but also clarifies some patterns and finds some features that differ at least quantitatively from those seen in other plants. Analysis of overall localization, predicted domain structure, in silico gene expression, promoter regions, and molecular evolution reveal a number of striking features: (1) predominantly recently derived sequences, with most having originated through local duplications; (2) evidence that NBS-LRR clusters, which in many cases dominate multimegabase regions, have played an important role in genomic remodeling; (3) evidence of ectopic translocations of NBS-LRRs from many clusters to other parts of the genome; (4) surprisingly variable domain arrangements, primarily in the TNL subfamily; (5) several novel domain combinations that appear to have originated and proliferated within the legumes; (6) dramatically varying expression patterns, with expression varying both between and within clades; (7) surprising uniformity of promoter regions across the gene family; and (8) patterns of pseudogene distributions related to NBS gene distributions, but differing significantly between clusters.
As is the case in other plant genomes, NBS genes predominantly are clustered physically in M. truncatula. This is clearly an outcome of the birth and death process that results from tandem duplication or contraction in a cluster. More intriguing, perhaps, are the exceptions. While most clusters are predominantly comprised of closely related genes, most clusters also include distantly related strangers. While most NBS are found in clusters, some exist as singletons, that in some cases, have close homologs elsewhere in the genome, but in other cases, appear to have been evolving independently. These exceptions have important implications for evolution of this family (and the genome), because although they are rare, they provide sources of novelty and change in the genome.
The pattern of clustered, related NBS sequences clearly is an outcome of the birth and death process that results from tandem duplication or contraction in a cluster (Michelmore and Meyers, 1998
Not only do M. truncatula NBS-LRRs tend to cluster, but many also lie in superclusters, such as the 82 NBS genes on the upper arm of chromosome 3 and the 57 NBS genes on the lower arm of chromosome 6. Interestingly, Mt6 also is more transposon dense than any other chromosome (Cannon et al., 2006
The NBS genes encoded on the Mt3 and Mt6 superclusters represent more than 5% of all the genes, NBS and non-NBS, found on the upper arm of Mt3 and the lower arm of Mt6. At this scale, NBS superclusters probably played a central role in genomic remodeling during the evolution of these chromosome regions. Superclusters in M. truncatula resemble the situation in Arabidopsis, where 32 and 43 NBS-LRRs are found on chromosomes At-1 and At-5, respectively (Meyers et al., 1999
Although most clusters are predominantly composed of similar sequences, many clusters also contain some phylogenetically distant NBS genes. Indeed, 26 of 120 M. truncatula clusters include both TNL and CNL members. The presence of heterogeneous NBS clusters in M. truncatula resembles the situation in rice (Monosi et al., 2004 Among the minority of NBS-LRRs that are singletons, some of these are closely related to sequences elsewhere in the genome. Although this is a small proportion of all NBS genes in the genome, these genes may play the role of pioneers, seeding new regions of the genome with NBS-LRRs, and potentially establishing new locations for future clusters. Examples of singletons with related genes elsewhere are the three Mt1 NBS-LRRs (pink) in clade TNL-4, Figure 2C, nested in a large clade of sequences primarily located on Mt6 (blue). Approximately 7% (23/333 = 6.9%) of NBS genes in this study have close homologs that appear to have come from clusters elsewhere in the genome.
The last class of NBS-LRRs are singletons that have no close relatives in the genome. Examples include the single-gene clade CNL-17 or the nine low-copy, unclustered genes in CNL-13 to CNL-15 (Fig. 2B). In each case, there is a candidate ortholog from lotus, poplar, or Arabidopsis, and in none of those cases are the orthologs in clusters in those genomes (Supplemental Data S13 and S14). Some NBS genes may remain as stable singletons simply by chance and because they are in stable regions of the genome. That is, singletons will tend to remain singletons. They are, by definition, not in clusters, which are inherently prone to expansion and contraction through unequal crossing over (Cooley et al., 2000
Several studies have described evidence of large-scale genomic duplication (possibly a whole-genome duplication [WGD]) early in the evolution of the legumes (Schlueter et al., 2004
In an evaluation of the 2,000 bp upstream of the NBS-LRR genes in M. truncatula, we found surprising uniformity in the numbers of four overrepresented cis-elements. This uniformity was found across all clades examined, in both TNL and CNL subfamilies. At least within clusters, similar regulatory elements might be expected if regulatory regions duplicate and undergo changes at rates similar to their associated genes. This would be consistent with the finding that tandemly duplicated genes in Arabidopsis have higher levels of conservation of cis-elements when compared to segmentally duplicated genes (Haberer et al., 2004
M. truncatula NBS genes show diverse domain combinations, although almost all of the diversity exists in the TIR subfamily. The only variants in the CNL (apart from variation in LRR repeat number and a possible fusion with an RPW8 homolog) are CN (no LRR) and CNL (the canonical structure). In contrast, there are nine domain arrangements in the TIR subfamily: N, NL, NT, NTNL, TN, TNL, TNLT, TNLTNL, TNTNL, and TTNL.
The much greater domain diversity in the TNL subfamily compared with CNL might be explained in part by their exon-intron structure. The CNL proteins mostly are encoded by a single exon, unlike TNLs that usually are encoded by multiple exons (Meyers et al., 1999
Intriguingly, much of the structural diversity in TNL genes exists in a small number of clusters, suggesting a linkage between physical organization in the genome and the origin of novelty in gene structure. All but one of the unusual TNL domain arrangements (TNLTNL) are found in a single clade (Fig. 2B, TNL-8). Most of these sequences fall into two classes: singletons on Mt8, Mt5, Mt7, and Mt3, and a cluster on Mt4. Several pairs of most-similar sequences in this clade are singletons on Mt5 and Mt8, which show the largest amount of internal synteny when the M. truncatula genome is compared with itself (Cannon et al., 2006 In general, expression patterns (at least measured by counts of EST matches) are highly variable and are not strongly associated with domain structure or sequence similarity. This is especially striking on chromosome Mt6. Here, there are frequent instances of neighboring NBS genes differing significantly in both expression and structure. For example, a cluster of six TNL genes on Mt6 (on a single BAC clone, AC126790) differ in EST counts ranging from 0 to 31. These same TNL genes also display four distinct domain combinations and differ in upstream WBOX counts, which range from 0 to 11.
An examination of pseudogenes supports rapid turnover of genes in this gene family and identifies some particularly active clusters that have generated both large numbers of diverse new genes and pseudogenes. A relatively restrictive criterion for identifying pseudogenes finds 49, in comparison with the 333 predicted NBS genes reported here. This proportion is similar to that observed in the Arabidopsis TN and TIR-X subfamilies, which contain 47 genes and four pseudogenes, respectively (Meyers et al., 2002
It also is interesting to note that at least some of the pseudogenes may be expressed, and therefore not under neutral selection. Four of the predicted pseudogenes have near-perfect (99%–100% identity) support from ESTs. In one such case, the full 716 nt EST contig length matches the genomic pseudogene, and both contain at least three stop codons. Expressed pseudogenes have been observed to regulate the messenger-RNA stability of the corresponding homologous coding gene (Hirotsune et al., 2003 Some mechanisms of gene turnover are suggested by the distribution of NBS pseudogenes in comparison to predicted NBS genes. Most (91.8%) of psedudogenes are found within 100 kb of predicted NBS genes, suggesting that most turnover occurs within clusters. However, there is clearly a greater rate of turnover in some clusters than others. A large excess of pseudogenes is present on Mt4, with 15 observed versus 6.0 expected if the 49 pseudogenes were distributed as are the 333 predicted NBS genes. Further, most (10) of the pseudogenes on Mt4 occur in the cluster that accounts for a large portion of domain diversity in the TNL subfamily (Table I; Fig. 2C, bottom). Thus, this large TNL cluster on Mt4 has generated unusual diversity, much of which has apparently been discarded, but some of which retains and contains highly expressed genes (Supplemental Table S2). Just as the diverse Mt4 cluster has generated a large share both of diverse genes and pseudogenes, other clusters have fewer pseudogenes than expected—specifically, Mt3, with four observed versus 13.5 expected. These pseudogenes are in the CNL subfamily and occur in the large CNL cluster on Mt3. That cluster contains only two domain arrangements (domain classes 1 and 2 in Table I) and is in this sense more conservative than the Mt4 TIR clusters. A much smaller portion of these genes have clearly become pseudogenes, suggesting a clade expansion in which most genes have been accepted.
The NBS-LRR gene family remains, despite a great deal of work on many fronts, fascinating and surprising. There was little reason to suspect, prior to the sequencing of M. truncatula, that there would be 3 times as many NBS-LRRs in this genome as in Arabidopsis, or that they would dominate large parts of two chromosomes (Mt3 and Mt6). Similarly it was surprising to find such domain novelty, and to find that almost all the domain novelty exists in the TNL subfamily, and most of that within one genomic cluster. Besides raising more intriguing questions (e.g. precisely how do NBS-LRR translocations occur, and are frequencies different between genomes?), these findings have direct practical agronomic implications. The dramatic pace of birth and death in the family is emphasized by the fact that the large majority of M. truncatula NBS-LRRs exist in cluster nurseries, and will not have one-to-one correspondences to NBS-LRRs in other species. A striking counterexample exists, however, for a minority of genes, which seem to follow a different, more stable evolutionary trajectory.
Identification of TNL and CNL Sequences and Pseudogenes in Medicago truncatula
We used the 1.0 draft genome assembly generated by the MGSC (http://medicago.org/genome/release1.0), with gene predictions from the IMGAG (Town, 2006 Consensus of CNLs used for blast search was as follows: erpsestiVGletmleklwnrLledndvgivgiyGMGGVGKTTLatqifNdfdvkgehFdrviWVvVSkefnvekiqqdIlekLglgdeewlekteeekaaeienLfqlLegKkfLLvLDDvWekevdLdkigvpfPdrenGsKvlfTTRsesvavcgdmgvdxmevecLtpeeAWeLFqkkvfentlksdpeieelaKevvkkCgGLPLAlkVlGgllacKrtvqEWkraievlssslaaefsgmessilpvLklSYdnLppelKsCFLYcalFPEDykIekekLieyWiaEGfideseggetaedvGyeylgeLVrrsLleegdktdnetsrketVkMHDvvREmALwiaseegfkeviiVraGvglreipnvkswntvrRmSlmnneieelldspenpklrsLltlllqsnsh. Consensus of TNLs used for blast search was as follows: RDFddlVGiEaHlekmksLLcLdsdeVrMVGIwGPaGIGKTTIARALfsqLSssFqlsaFmenlrgsyStrpaglDeYsmKLhLQeqfLSkILnqkDikIhHLGvieERLkdqKVLIiLDDVDdleQLdALAketqWFGpGSRIIVTTeDkqLLkaHgInhIYeVgfPSkeeALqIFCrsAFgQnsPpdGFeeLAreVtkLaGnLPLGLrVlGSsLRGkskeeWedmLpRLrtsLDgkIekvLrvsYDgLhekDqaLFLhIACfFNgekvdyVkalLadsnLDVrqGLkvLadKSLIhisplgdgtieMHnLLqqLGReIVrkQsidePgKRqFLvDaeeIcdVLtdnTGtgsVlGIslDtseieeelnIsekAFegMrNLqFLriykksfrddgk.
Pseudogenes were identified using the same consensus CNL and TNL sequences, using a tblastn search (Altschul et al., 1997
Candidate NBS-LRR proteins were provisionally assigned to either the CNL or TNL groups on the basis of similarity, then were aligned to a HMM calculated from a large collection of TNL and CNL extended NBS domains (Cannon et al., 2002 Consensus from NBS HMM used for whole-family NBS alignment was as follows: GKTTLAraVYNkiadhFeakcFlcvvrefsvkhxlkhlqkqlxxxxxkeikldnvleglsiilkrLsgKKvLLVLDDVwneeQLeaLaggldwxxpGSRIIITTRdkhvLsshgvvrxxtYevegLneeealeLFckkAFkgxxspvdpeYeeigkkiVkycgGLPL.
Prior to phylogeny construction, sequences containing fewer than 75% of the HMM match-state residues were retained for subsequent analysis, and indels and poorly aligning regions were removed by trimming regions outside the HMM match states. Also, although the IMGAG pseudomolecule assembly process removed most overlapping regions, some redundant sequence remains in the 1.0 draft in unfinished BAC clones. Phylogenies were calculated using parsimony and bootstrapped neighbor joining. Parsimony trees were calculated using protpars in the Phylip suite (PHYLIP [Phylogeny Inference Package] version 3.6; distributed by the author). The input sequence order was jumbled five times, and a topology was calculated based on each data order. One most-parsimonious tree was chosen at random to serve as the basis for branch length calculations. Maximum likelihood branch lengths were calculated on the parsimony topologies using TreePuzzle 5.2 (Schmidt et al., 2002
Domains were predicted using hmmpfam (Eddy, 2003
Medicago truncatula EST and cDNA sequences were downloaded from GenBank nucleotide database using query (txid3880[ORGN] AND "biomol mrna"[PROP]) for medicago (txid3880[ORGN] AND "biomol mrna"[PROP]). All EST/cDNA sequences were mapped to Mt1.0 BAC sequences by computer program GMAP (Wu and Watanabe, 2005
To estimate the NBS-LRR gene number in EST collection, 55,182 Medicago Transcript Assemblies and singletons sequences (TA unigenes, release 2) were downloaded from http://plantta.tigr.org. Identified M. truncatula NBS-LRR protein sequences were used as query sequences to search against the TA unigenes by BLAST (Altschul et al., 1997
For each NBS predicted gene, the 2 kb upstream regions were selected according to the position of the genes provided by the IMGAG annotation (Medicago Sequencing Resources) on the BAC sequences of M. truncatula. The extracted sequences were screened against the PLACE database (Higo et al., 1999
Comparisons of NBS-LRR gene duplications and large-scale genomic duplications were carried out using the Medicago genome pseudomolecule build (Mt1.0; http://www.medicago.org/genome/downloads/Mt1). Syntenic regions were predicted using National Center for Biotechnology Information blastp self comparisons at E-value 1e-10, then filtering to consider only the top reciprocal best hit between each chromosome pair, then synteny prediction using DiagHunter (Cannon et al., 2003
The following materials are available in the online version of this article.
Thanks to Roxanne Denny for lab assistance and to Xiaohong Wang, Jayprakash Vasdewani, and Ethalinda Cannon for bioinformatics assistance. Received June 25, 2007; accepted October 19, 2007; published November 2, 2007.
1 This work was supported by the National Science Foundation (grant nos. 0321664 and 0321460 to N.D.Y.). The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Steven B. Cannon (steven.cannon@ars.usda.gov).
[W] The online version of this article contains Web-only data.
[OA] Open Access articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.107.104588 * Corresponding author; e-mail steven.cannon{at}ars.usda.gov.
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||