|
|
||||||||
|
Plant Physiology 135:1179-1197 (2004) © 2004 American Society of Plant Biologists Computational Identification and Characterization of Novel Genes from Legumes1,[w]Department of Plant Biology, University of Minnesota, St. Paul, Minnesota 55108
The Fabaceae, the third largest family of plants and the source of many crops, has been the target of many genomic studies. Currently, only the grasses surpass the legumes for the number of publicly available expressed sequence tags (ESTs). The quantity of sequences from diverse plants enables the use of computational approaches to identify novel genes in specific taxa. We used BLAST algorithms to compare unigene sets from Medicago truncatula, Lotus japonicus, and soybean (Glycine max and Glycine soja) to nonlegume unigene sets, to GenBank's nonredundant and EST databases, and to the genomic sequences of rice (Oryza sativa) and Arabidopsis. As a working definition, putatively legume-specific genes had no sequence homology, below a specified threshold, to publicly available sequences of nonlegumes. Using this approach, 2,525 legume-specific EST contigs were identified, of which less than three percent had clear homology to previously characterized legume genes. As a first step toward predicting function, related sequences were clustered to build motifs that could be searched against protein databases. Three families of interest were more deeply characterized: F-box related proteins, Pro-rich proteins, and Cys cluster proteins (CCPs). Of particular interest were the >300 CCPs, primarily from nodules or seeds, with predicted similarity to defensins. Motif searching also identified several previously unknown CCP-like open reading frames in Arabidopsis. Evolutionary analyses of the genomic sequences of several CCPs in M. truncatula suggest that this family has evolved by local duplications and divergent selection.
Legumes constitute a large plant family that presents humans with a treasure trove of resources for a variety of uses. Throughout the world, legumes provide important sources of protein, oil, mineral nutrients, and nutritionally important natural products (Graham and Vance, 2003
An important feature of legumes is their ability to obtain nutrients via symbioses with soil microbes. The formation of nitrogen-fixing nodules via interaction with bacteria collectively know as rhizobia is virtually unique to legumes, although some species in eight families of the eurosid I clade of dicots can form nodules in association with nitrogen-fixing actinomycetes (Soltis et al., 1995
The many uses of legumes and the variety of symbiotic and pathogenic interactions found provide numerous targets for functional genomics research. Currently, there are nearly 700,000 nucleotide sequences representing the Fabaceae available from the National Center for Biotechnology Information (NCBI taxonomy browser, http://www.ncbi.nlm.nih.gov/Taxonomy, November, 2003). In particular, significant strides have been made in the functional genomics of the model legumes M. truncatula and Lotus japonicus and the crop legume soybean (G. max and Glycine soja; VandenBosch and Stacey, 2003 The wide availability of ESTs and other gene-containing sequences from plants, including the near complete rice (Oryza sativa) and Arabidopsis genomes, provides an opportunity for comparative sequence analysis. For example, the well-characterized Arabidopsis genome is an asset to identify orthologous genes in other, less well-characterized species. However, comparative approaches may also be used to identify genes that are taxon specific.
In the case of legumes, it is clear that many of the genes involved in the hallmark legume functions of nodulation or isoflavonoid biosynthesis likely evolved from pathways shared among many plant species. For example, several classes of broadly conserved receptor kinases have been found to be required for nodule formation or regulation of nodule numbers (Parniske and Downie, 2003
The goal of this project was to use available sequence databases to quickly and efficiently identify and characterize sequences from M. truncatula, soybean, and L. japonicus that lack close homologs in other nonlegume taxa. As an operational definition, legume-specific genes have no BLAST (Altschul et al., 1997
Identification of Legume-Specific Genes
A series of BLAST analyses was used to identify and remove broadly conserved sequences from the set of expressed sequences from legumes. A summary of the approach and the results are presented in Table I. Initially, the TIGR M. truncatula, L. japonicus, and soybean (G. max and G. soja) TCs were compared to the TIGR maize (Zea mays, ZmGI), tomato (Lycopersicon esculentum, LeGI), rice (OsGI), and Arabidopsis (AtGI) gene indices (Quackenbush et al., 2001
With the remaining subset of putatively legume-specific TCs, increasingly stringent and computationally intensive searches were used to identify homologous sequences from nonlegume species. Stringency levels were raised by increasing the number of sequences present in the nonlegume data set and/or by changing the BLAST analysis from BLASTN to BLASTX or TBLASTX. As before, legume TCs with homology to nonlegume sequences with an E-value more significant than 104 were eliminated. BLASTX was used to compare the remaining 3,973 legume-specific TCs to the GenBank protein database (NR). Three hundred and forty additional legume TCs with homology to nonlegume sequences were identified and removed. TBLASTX was then used to compare the remaining 3,633 legume-specific TCs to the remaining nonlegume plant gene indices available at TIGR: barley (Hordeum vulgare), Chlamydomonas reinhardtii, cotton (Gossypium spp.), grape (Vitis vinifera), ice plant (Mesembryanthemum crystallinum), lettuce (Lactuca sativa), Pinus spp., potato (Solanum tuberosum), rye (Secale cereale), sorghum (Sorghum bicolor), sunflower (Helianthus annuus), and wheat (Triticum aestivum). An additional 283 legume TCs with homology to nonlegume sequences were eliminated.
Until this point, all of the described BLAST searches involved nonlegume sequences that most likely represented real genes, including those that have been identified experimentally and others that have been predicted from genomic sequences. Since gene prediction programs are fallible (Zhang, 2002 One area of concern was that some of the putative legume-specific TCs might be too short to yield informative BLAST hits with nonlegume sequences. One example would be if the TC corresponded to the untranslated portions of a transcript. Since untranslated regions do not code for protein, they are less likely to be conserved among species. In order to identify sequences falling into this category, we used TBLASTX to compare the remaining legume-specific genes to the original set of legume TCs. If a legume-specific TC showed homology to a lengthier legume TC that was not legume specific, it was unlikely to be legume specific. However, if a legume-specific TC had homology only to itself or other remaining legume-specific TCs, it was retained in the legume-specific category. After this analysis, 2,525 total legume-specific TCs remained from all three species (Table I and Supplemental Table I, which can be viewed at www.plantphysiol.org).
To confirm that the identified M. truncatula legume-specific TCs were of plant origin and not that of the nitrogen-fixing symbiont, the M. truncatula legume-specific TCs were compared to the complete genome of Sinorhizobium meliloti (Galibert et al., 2001
Following the analyses above, 92; 861; and 1,572 legume-specific TCs remained in L. japonicus, M. truncatula, and G. max/soja, respectively. A comprehensive list of all TCs identified is presented in Supplemental Table I, while a summary of TC length and the number of ESTs per TC in the legume-specific TCs can be found in Table II. BLASTX analysis to the GenBank NR database revealed that less than three percent of the identified legume-specific TCs had homology to other legume sequences with an E-value less than 104 (Supplemental Table II). Of the five most highly expressed TCs in M. truncatula, two (MtTC59272 and MtTC60015) had sequence homology (E-value
Single Linkage Clustering and Motif Analysis
One approach for hypothesizing function for legume-specific TCs lacking significant BLAST hits was to scan the TCs for conserved motifs identified from other proteins. InterProScan (European Bioinformatics Institute, Hinxton, UK; Mulder et al., 2003 A second approach was to mine groups of related legume-specific genes for common, uncharacterized motifs. Motif searching can be more sensitive than BLAST analysis for several reasons. First, highly conserved residues carry more weight in motif analysis. Second, a minimum exact word match is not required. Finally, the information from many homologs can be combined in a single motif. Once a novel motif description is generated, it can be used to scan the public protein databases for matches. Common motifs among target proteins suggest possible functions. In order to identify families of legume-specific TCs for motif analysis, single linkage clustering analysis of legume-specific TCs was performed. Including as many diverse sequences as possible within a cluster increases the likelihood of identifying conserved motifs. Therefore, the 2,525 legume-specific genes were combined with 50, 672, and 688 homologous singletons from L. japonicus, G. soja/max, and M. truncatula, respectively. Clustering identified 665 groups that corresponded to potential gene families or cross-species homologs. The groups are identified in Supplemental Table I, with their size distribution shown in Supplemental Figure 1. The majority of TCs did not cluster with other sequences and are denoted as Group 0 in Supplemental Table I. Sixty-seven groups were chosen for motif analysis based on their size and/or tissue-specific expression patterns. For 17 of the groups, no obvious open reading frame with shared motifs could be found, suggesting that the sequences in these groups were not full length, represented unusually long untranslated regions, or that the gene products themselves were RNAs. Nine additional groups were almost certainly not legume specific. In these cases, TCs clustered with singletons that had strong hits to nonlegumes. Motif analyses in many of the remaining groups were quite fruitful. Thirteen groups had Motif Alignment Search Tool (MAST) or hidden Markov model (HMM) hits (E-value < 104) to nonlegume Swiss-Prot/TrEMBL sequences or to the Arabidopsis genome. The member sequences, motifs, and sequence alignments for each group are provided in additional supplemental data available at www.medicago.org/documents/Publications/Graham04_supplement. Three classes of families are particularly noteworthy and are described in detail below: (1) novel F-box proteins, (2) Pro-rich cell wall, and (3) small Cys-rich proteins. Tables III and IV and Figures 1 and 2 demonstrate the general features of groups in these categories.
F-Box Proteins
Two groups (640 and 630) of sequences without significant BLAST similarity with one another clearly shared motifs with discrete families of F-box proteins. Group 640 had five sequences from M. truncatula, from a variety of root and shoot libraries (Table III). Multiple EM for Motif Elicitation (MEME; Bailey and Elkan, 1994
One area of concern was that members of group 640 were not full-length genes. The missing portions of these genes could encode F-box domains. Therefore, the 22 hits from the M. truncatula genome, which were identified by a domain outside of the F-box, were scanned for F-box domains using InterProScan (Zdobnov and Apweiler, 2001
A second group (group 630) of F-box-containing proteins from M. truncatula was also found among the legume-specific groups. While no BLAST hit was found for this group, InterProScan identified a cyclin-like F-box (IPR001810) for one of the members of the group, but not the others, which may not have been full length. Using HMMs created for this group, we found several hits to the Arabidopsis genome (At1g20790, At5g18160, and At1g15680; Table IV). All of the significant hits were annotated as hypothetical F-box-containing proteins at the Arabidopsis Information Resource (TAIR, http://www.arabidopsis.org; Huala et al., 2001
Group 5 included 57 sequences from M. truncatula (one TC and 56 singletons) and seven singletons from soybean (Table III). Sequences within this group were composed of pentameric and hexameric repeats and had homology to Pro-rich cell wall proteins from a variety of legumes. BLAST analyses with sequences from this family were problematic. Using a low-complexity filter during BLAST effectively screened out the mature protein sequences because they were repetitive. However, removing the filter resulted in hits to numerous Pro-rich proteins in which the order and arrangement of Pro residues was not conserved. Similar problems were encountered using MEME. Therefore, perl regular expression patterns were used, as described in the methods section. The first pattern identified proteins containing at least three subunits of PPVEK, PPVYK, or PPVVK, in any combination. This pattern identified 11 sequences in Swiss-Prot/TrEMBL from a diverse set of legumes, plus a single sequence from carrot (Daucus carota; Table IV). As a more exhaustive test, we scanned dbEST for the nucleic acid equivalent of our motif. This search identified 1,824 ESTs from a variety of legumes and sunflower. Many of these corresponded to TCs or singletons in our original list, but the frequency of misassembly for this repetitive Pro-rich family at TIGR was high, based on visual inspection of contigs. Therefore, all the identified ESTs were assembled into contigs. In the end, 59 unique consensus sequences were identified: 27 from M. truncatula, 15 from G. max/soja, 1 from L. japonicus, 3 from Lupinus luteus, 3 from Lupinus albus, 1 from Phaseolous coccineus, and 9 from sunflower. The carrot and sunflower sequences were significantly different from their legume counterparts, having exclusively valines and/or histidines in the fourth position of the repeat (Fig. 1B). Among the legumes, valines were found in this position infrequently, and histidines were rare. Noting the variations observed in the carrot and sunflower sequences, we generalized the motif to (PP[ILMV][EYVHNA][KT]){3}. This search identified two additional sequences from carrot and one from Arabidopsis. The Arabidopsis sequence is significantly different from all other sequences identified; it is several times longer, contains hexameric in addition to the usual pentameric repeats, and it frequently has Thr rather than Lys in the last position (Fig. 1B). Two other groups of Pro-rich proteins were also identified (group 485 and group 699). Group 485 was made up of two soybean sequences and one M. truncatula sequence. Group 669 was made up of three M. truncatula sequences. Neither group had any apparent tissue specificity (Table III). Like group 5, members of groups 485 and 669 also encoded a signal peptide and Pro-rich mature peptide (Table IV). However, groups 485 and 699 lacked the repetitive Pro-rich pentamers and encoded much smaller peptides. Motif analysis using these small groups was unable to identify similar sequences in the M. truncatula or Arabidopsis genomes.
A large fraction of the legume-specific genes encoded Cys-cluster proteins (CCPs). These share several common features: (1) an N-terminal signal sequence, (2) a small, highly charged or polar-mature protein sequence, (3) a characteristic arrangement of 4, 6, 8, or 10 Cys residues likely involved in disulfide bridges, (4) an apparent tissue-specific expression profile, and (5) low similarity to other expressed CCPs (Table III). The largest group of CCPs, group 31, is made up almost entirely of M. truncatula sequences. The group contains 197 M. truncatula singleton sequences and 136 TCs, all of which are expressed almost exclusively in nodules. A small fraction (less than 1%) of EST sequences from this group were also found in other root tissues. While the majority of sequences in this group had no BLAST homology to sequences within NR, approximately 20% had low homology (approximately equal to 104) hits to the same group of sequences: a hypothetical protein from G. orientalis (CAB51773 Kaijalainen et al., 2002 The MEME program, when applied to the nodule-specific CCPs (group 31), generated three motifs of 21, 15, and 11 amino acids, respectively. The first motif encoded the signal peptide, and the remaining two contained pairs or triplets of Cys residues and their surrounding residues (Table IV). Scans of Swiss-Prot/TrEMBL using MAST recovered all of the GenBank BLAST hits of individual TCs mentioned above, plus several new hits with significance better than 104 (Table IV). The new hits included several more genes from legumes annotated as early and late nodulins, plus a set of five potassium channel-blocking neurotoxins from Manchurian scorpion (Mesobuthus martensii.) Further, the dozen hits with E-value between 0.1 and 104 were almost all K+ channel-blocking neurotoxins from a variety of scorpion species. All scorpion toxins hit only motifs 2 and 3, having their own distinctive signal peptide. The extensive divergence among the nodule-specific CCPs made the creation of an accurate multiple-sequence alignment impossible for the whole group. However, an accurate alignment is a prerequisite for building an effective HMM. Hence, 261 sequences from group 31 were distributed among 11 distinct subgroups. A final set of HMMs was created to describe each of the 11 subgroups of nodule-specific CCPs. Alignments of the largest two subgroups, 31.01 and 31.02, are displayed in Figure 2A. For HMM generation, the subgroups were deliberately modeled without the signal peptide, since only the mature peptide itself proved to be similar to the scorpion toxins in the MEME/MAST analysis. Since HMMs were specifically designed for each subgroup, none of the 11 HMMs picked up as many significant hits as had the MEME motifs when scanning Swiss-Prot/TrEMBL. However, they proved to be useful in subdividing the large diverse family. For example, only subfamily two picked up any hits to scorpion toxins. The 11 HMMs were then used to scan the Arabidopsis genome, which was translated into all six reading frames. These searches yielded eight hits with E-values more significant than 104, but seven of these hits were accounted for by subgroups two (subgroup 31.02; Fig. 2B) and nine (subgroup 31.09). Seven of the eight hits to the Arabidopsis genome lie in regions with no predicted genes on chromosomes one, two, and four of TIGR Arabidopsis sequence 4.0 (TAIR). The remaining hit, At1g43720, was a predicted hypothetical gene. In contrast to Arabidopsis, the rice genome had no significant hits to any of the 11 subgroups. In addition to the group of nodule-specific CCPs, we also identified several groups of predominantly seed-specific CCPs (Table III). Group 645 was composed of 10 M. truncatula TCs and two singletons corresponding to the pods with seeds and immature seed libraries. Groups 38 (5 TCs, 6 singletons), 40 (2 TCs and a singleton), and 41 (2 TCs and a singleton) were soybean specific and were composed of ESTs from immature seed coats, seed coats, and very young seeds. Group 36 contains one M. truncatula singleton, one soybean singleton, and two soybean TCs composed of ESTs from mature and immature seed coats, mature pods, and immature cotyledons. Group 655 (2 TCs and a singleton) corresponded to developing flowers and pods with seeds. Unlike the nodule-specific CCPs, none of these groups had significant BLAST hits in the NR database. However, if a lower E-value cutoff of 104 were used, many would cluster with the nodule-specific CCPs. The first round of MEME motif building did not yield any hits more significant than 104 for any of the seed-specific Cys-rich protein groups. However, the first round analysis for group 645 had a weak (E-value > 104) hit to a protease inhibitor from pear (Pyrus communis) that conserved the exact Cys positions (Table IV). Adding this single sequence to the second round of MEME analysis resulted in a set of motifs that had significant hits to nearly 100 proteins in Swiss-Prot/TrEMBL. Collectively, these included sequences annotated as gamma thionins, protease inhibitors, insect and plant defensins, and sodium channel-blocking scorpion toxins. A few examples of these can be seen in Figure 2C. A single HMM was created from an alignment of all 12 original members from group 645, the largest seed-specific CCP group. Like the first iteration of the MEME motif, the HMM also did not have any significant hits to Swiss-Prot/TrEMBL. However, it had eight hits to the Arabidopsis genome and one hit to the existing M. truncatula genomic bacterial artificial chromosome (BAC) sequence (E-value < 104). An alignment of these sequences is shown in Figure 2C. At5g63660 was predicted to be a plant defensin by TAIR. P82761 (Swiss-Prot/TrEMBL) is the putative self-incompatibility protein LCR46. The remaining six hits lie in regions not predicted to encode genes on chromosomes two, three, and four (Fig. 2C; TAIR). Group 666 also appeared to be Cys-rich and had strong sequence homology to the soybean albumin 1 precursor (GenBank accession BAA04219). This group was composed of one TC and three singletons from soybean and three TCs and singletons from M. truncatula. Two of the TCs from M. truncatula had high levels of expression. MtTC59272 and MtTC60015 were composed of 61 and 36 ESTs, respectively, mainly from roots and mycorrhizal roots. Unlike the CCPs in group 31, members of group 666 are longer, contain more conserved Cys residues, and show diminished tissue specificity.
In order to examine the genomic organization of all the CCPs, BLASTN was used to search the available M. truncatula genome sequences. Thirteen BACs were identified that had exact matches to expressed members of the nodule-specific CCP family. Seven of these BACs contained at least two predicted CCPs. Only BAC clone Mth2-34P9 (GenBank accession AC121238) was completely assembled. This BAC contained four predicted CCPs, only one of which had corresponding ESTs. CCP2 corresponded to TC78513, which contained 10 ESTs from four different nodule libraries spanning four to 60 d postinoculation with S. meliloti. Comparison of the EST sequences of CCP2 with the genomic sequence revealed it did not contain an intron. Given the structural similarity to CCP1 and CCP3, it is unlikely that these have introns. However, CCP4 appears to contain a 314-bp insertion or intron relative to the other CCPs. Without EST confirmation, it is impossible to tell if CCP4 actually contains an intron or is a pseudogene. Dot plot analysis was used to examine the organization of CCPs within BAC Mth2-34P9 (Fig. 3). Initially three tandem repeats were identified, each of which corresponded to one CCP. Repeat R1 was 4,184 bp in length and contained CCP1. Repeat R2 was 4,192 bp in length and corresponded to CCP2. Repeat R3 was 19,948 bases in length and corresponded to CCP3 and CCP4. The large size of R3 and the extra CCP were due to a large insertion of 17,916 bases. Repeats R1 and R2 shared more than 97% nucleotide identity with each other and 92% nucleotide identity with R3.
Closer inspection of the large repeat units revealed a number of tandemly duplicated mini repeats (MRs). Repeats R1 and R2 each contained four tandemly duplicated MRs (MR1-1, MR1-2, MR1-3, MR1-4, and MR2-1, MR2-2, MR2-3, MR2-4, in order, respectively) that ranged in size from 502 to 523 bases. MR1-4 ended 25 bases before the predicted translation start site of CCP1, while 54 bp separated MR2-4 from CCP2. Repeat R3 contained three tandemly duplicated MRs (MR3-1, MR3-3, and MR3-4) ranging in size from 513 bases to 524 bases. MR3-4 was split by the 17,916-bp insertion. The insertion itself also contained a 617-bp MR (MR4-1). MR3-4 and MR4-1 each ended 54 bp upstream of the predicted translation start site of CCP3 and CCP4. Interestingly, we were able to identify a TC identical to portions of MR2-2 and MR2-3. TC82353 corresponded to bases 76 through 522 of MR2-2 and bases 1 through 376 of MR2-3. The TC was composed of two ESTs from senescing nodules. Further experimental analysis will be performed to determine if the MRs are being expressed as a single transcript with CCP2. Phylogenetic analyses of repeats and MRs were performed to determine how this complex arrangement of repeats might have developed (Fig. 4A). Sequence identity was highest between MRs located at the same relative position within the different larger repeats. For example, MR1-2 and MR2-2 (from the larger repeat regions R1 and R2, respectively) showed greater similarity to each other than to other MRs from their own larger repeats. This confirmed that the three tandem repeats all originated from a single ancestral sequence that likely contained four MRs and a CCP coding sequence.
The large number of repeats within this BAC and their high degree of sequence similarity suggested this region might be prone to unequal recombination. The organization of MRs in repeats R1 and R2 suggested that region R3 was missing an MR. Phylogenetic analysis revealed clustering between MR1-3, MR2-3, and MR3-3 and between MR1-4, MR2-4, and MR3-4 (Fig. 4A). The MR3-1 sequence did not clearly cluster with MR1-1 and MR2-1, or with MR1-2 and MR2-2. However, if MR3-1 was divided in half, the first half clustered with MR1-1 and MR2-1, while the other half clustered with MR2-2 and MR3-2. This suggested that unequal recombination between repeats might have deleted portions of two ancestral MRs to form a new MR (MR3-1), which was a mixture of both. To determine if this had occurred, polymorphic sites were identified by comparing conserved nucleotide positions shared between MR1-1 and MR2-1 to conserved nucleotide positions shared by MR1-2 and MR2-2 (Fig. 4B). Polymorphic sites included base changes and insertions/deletions. Once these polymorphic sites were identified, they were compared to the sequence of MR3-1. At the 10 polymorphic sites identified from consensus position 30 through 157, MR3-1 matched the sequences of MR1-1 and MR2-1. However, at the 23 polymorphic sites identified from consensus positions 215 to 550, MR3-1 matched the sequences of MR1-2 and MR2-2. Therefore, a recombination event, somewhere between consensus positions 158 and 215, combined two ancestral MRs to form MR3-1. The 17,916-bp insertion in MR3-4 is also likely due to unequal recombination.
As a final step in the analysis of CCPs in this region, the ratio (Ka/Ks) of nonsynonymous (Ka) to synonymous (Ks) nucleotide substitutions was determined. If a gene is under purifying selection, the Ka/Ks ratio should be less than one (Hughes, 1999 To determine if MRs were associated with CCPs elsewhere in the genome, the different MRs were used as BLAST queries against MtGI (version 7) and the available M. truncatula genome sequence. Two TCs and a singleton that did not match the sequence of BAC Mth2-34P9 were identified. TC85293 showed homology to all the MRs and encoded a CCP. It was composed of 10 ESTs from mature nodules. TC79662 also had homology to a small portion of an MR and contained a CCP. It was composed of four ESTs from three different nodule libraries. A singleton (BE998091) from the senescing nodule library was most similar to bases 1 through 183 of MR1-2. To determine if the pattern of duplications observed was unique to BAC Mth2-34P9, another BAC with multiple CCPs was examined. BAC Mth2-10A20 (GenBank accession AC138527) contains five CCPs, four of which contain ESTs exclusively from mature and senescing nodules (TC64692, TC75960, TC64231, and TC73971). Comparisons of the ESTs with the BAC sequence revealed that the four expressed CCPs and likely the fifth CCP, whose expression has not been detected, contain introns. The introns range in length from 107 to 108 bp and are located 70 to 73 bp downstream of the translation start site. A complete genomic analysis of the CCPs on BAC Mth2-10A20 was impossible, since the remaining six sequence contigs have not been assembled. However, three CCPs are present in a single contig. This contig contains a tandem duplication, 4,629 bp in length. A second duplication, at least 1,570 bp in length, is also present. However, the extent of this duplication cannot be determined without further sequence assembly.
Using increasingly stringent BLAST searches, we have identified over 2,500 legume-specific genes from M. truncatula, L. japonicus, and G. max/soja. The analyses included comparisons to the GenBank NR and EST_others databases, as well as comparisons to the rice and Arabidopsis genomes. By the very nature of the analysis, only a small subset of the legume-specific genes identified have homology to previously characterized genes or gene products. The observed results are consistent with the representation of legume sequences in the GenBank NR database relative to those of better characterized groups. For example, within NR, there are over 117,000 protein entries for Arabidopsis alone (NCBI taxonomy browser, http://www.ncbi.nih.nlm.gov/Taxonomy, November, 2003). In contrast, for all of the legume species within the Fabaceae, there are only 11,300 protein sequences. While Arabidopsis has proven to be a useful model for many aspects of plant biology, it is not a good model for studying nodule development. Legume-specific TCs are especially enriched in transcripts from nodules; almost 56% of the ESTs corresponding to legume-specific genes in M. truncatula come from nodules. In contrast, only 22% of the ESTs corresponding to all Medicago TCs come from nodules. Future analysis of legume-specific genes expressed in nodules may therefore provide insights into novel symbiotic functions. Given the low representation of legume sequences in the GenBank databases, several approaches were taken in order to assign putative functions to the legume-specific genes. All of the legume-specific genes and homologous singletons were grouped together into families of related sequences. The sequences within a group were then mined for conserved motifs that could be used to scan the protein databases at Swiss-Prot/TrEMBL. Proteins of known function that shared these motifs could provide a hint of function. Using this technique, we identified several groups of interest. However, the most interesting were families of F-box related proteins, Pro-rich proteins, and Cys-rich proteins.
F-box-related proteins have been identified with a wide array of cellular functions. F-box proteins are involved in transcriptional regulation and signal transduction through Skp1, Cdc53/Cullin1, F-box protein ubiquitin-ligase complexes, transcript elongation, cell cycle transition, and self-incompatibility in plants (Kipreos and Pagano, 2000
In our analysis, we identified two groups with distant homology to F-box proteins and related domains. In Arabidopsis, roughly 35% of genes encoding F-box-related proteins exist in clusters of two to seven F-box genes (Gagne et al., 2002
The plant cell wall has several important functions, including structural support, defense against pathogens, and signaling between the plant cell and the outside world. The Hyp-rich glycoprotein superfamily includes extensins, repetitive Pro-rich proteins (RPRPs), and arabinogalactan proteins that are proposed to represent a phylogenetic continuum (Kieliszewski and Lamport, 1994
We have identified a group (group 5) of sequences encoding proteins composed of pentameric repeats with homology to RPRP wall proteins from a variety of legumes (Fig. 4B), including the well-characterized PRP1 and PRP2 from G. max (Hong et al., 1987
If one considers the pentameric Pro-rich repeat to be a functional unit within RPRPs, it is probable that legume RPRPs may have novel functions related to their novel pentameric motifs. Most notably, the PPVYK motif appears to be rare in RPRPs known to date from plants outside the legume family, and is therefore clearly not diagnostic of the RPRPs as a group. This motif functions in peroxide-mediated cross linking involving Tyr residues in both extensins and RPRPs, which renders the proteins insoluble and confers rigidity to the wall (Bradley et al., 1992
The first nodule-specific CCP homolog, ENOD3, was identified by Scheres et al. (1990)
Motif analysis of the CCPs has revealed similarity to plant defensins, whose conserved Cys residues are important in the formation of the knottin fold (Thomma et al., 2002
The similarity of CCPs to defensins is intriguing because of their potential utility in crop improvement. Defensin expression in plants can typically be found in leaves (Terras et al., 1995
Defensins have been identified throughout the animal and plant kingdoms and are thought to be members of small (1540 members) gene families (Boman, 2003
Why are there so many more CCPs in M. truncatula? It's possible there aren't, but rather additional undiscovered defensins do exist in other species. The sequence diversity of defensins has made them difficult to identify experimentally and computationally. For example, using CCP motifs developed from M. truncatula, we were able to identify nine putative defensins from Arabidopsis. Only one of these had been previously identified, while the other eight were identified in regions that were not predicted to contain genes (TAIR). Advances in computational biology will likely lead to the discovery of additional defensins (Schutte et al., 2002
How have so many diverse CCPs originated? Analysis of the available genomic data suggests an evolutionary model similar to that of nucleotide binding site (NBS), Leu-rich repeat (LRR) family of disease resistance genes. Like the CCPs and defensins, NBS/LRR genes have been found both as single genes and clustered throughout plant genomes (for review, see Hulbert et al., 2001
Until further sequencing of the M. truncatula genome is complete, it will be difficult to determine all the mechanisms governing the evolution of the CCPs. Our analysis of BACs Mth2-34P9 and Mth2-10A20 clearly reveals evidence of multiple duplication events. Analysis of the repeat regions on BAC Mth2-34P9 has shown that unequal recombination has occurred between repeat units. If the repeats themselves can undergo recombination with paralogous repeats, it is likely their close proximity to the CCPs would allow unequal recombination to occur within the CCPs. The significant levels of nonsynonymous amino acid substitutions could be the result of recombination between paralogous CCPs within a cluster and/or the accumulation of point mutations. The identification of other TCs with MRs and CCPs that do not match BAC Mth2-34P9 suggest that these events occur elsewhere in the genome. Similar phenomena have been seen in mammalian defensin gene clusters. Semple et al. (2003)
Why then are so many CCPs expressed specifically in nodules? Like seeds, nodules are one of the largest sink tissues in plants. Fifteen to 30 percent of the net photosynthate is transported to the nodule and its surrounding root system (Schubert, 1986
One of our hypotheses was that some of the identified legume-specific genes were derived from nonlegume origins, but have diverged so much they appear unique to legumes. Using single-linkage clustering and motif analysis, we were able to identify gene families with conserved motifs. In some cases, such as the defensin-like CCPs and F-box-related proteins, the motifs identified were clearly represented across diverse taxa. Thus, as hypothesized, these genes may be examples of fast-evolving genes that are so divergent that similarity to their progenitors is not readily detectable by BLAST algorithms. Sequences that are truly novel in legumes may be present among the families that were too small for motif analysis, families where motifs could not be detected, or families whose motifs failed to detect similarity to any known proteins. Experimental analyses and sequence information from a wider diversity of organisms will aid in determining if these genes are indeed novel. While the function of many legume-specific genes could not be predicted by computational approaches, their expression patterns suggest they are worth investigating experimentally in the future. All of the legume-specific genes we have identified have been made publicly available in the supplemental data, representing a rich resource for legume biologists (see supplemental data for this article and at www.medicago.org/documents/Publications/Graham04_supplement). Among these legume-specific genes, we identified many gene families with nonspecific expression patterns. Additionally, we have identified 10 gene families specifically expressed in roots and nodules, eight in seeds, four expressed only in leaves and flowers, and seven from stressed or pathogen-inoculated tissues. The tissue specificity of these genes suggests they would make excellent candidates for transformation or gene silencing in future analyses of gene function.
Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial research purposes, subject to the requisite permission from any third-party owners of all or parts of the material. Obtaining any permissions will be the responsibility of the requestor. Perl scripts will be made available upon request.
Unless otherwise stated, all computer analyses were performed on a single Macintosh PowerG4 computer running Mac OS 10.22 with dual 800 MHz processors. Locally installed versions of the NCBI BLASTN, BLASTX, and TBLASTX (Altschul et al., 1997
In the first BLAST iteration, (Table I) the TIGR Medicago truncatula, Lotus japonicus, and soybean (Glycine max and Glycine soja) TCs were compared to the TIGR maize (ZmGI), tomato (LeGI), rice (OsGI), and Arabidopsis (AtGI) gene indices using BLASTN and TBLASTX (Table I; Quackenbush et al., 2001
BLASTN analysis against the composite genome of S. meliloti (Galibert et al., 2001
TBLASTX analyses were used to compare the remaining legume-specific TCs with the original legume TCs. Putative legume-specific TCs with significant (E-value
Prior to single linkage clustering, the legume-specific TCs were blasted against the legume singletons using TBLASTX and a 106 E-value cutoff. TBLASTX was used to compare the legume-specific TCs and the homologous singletons against themselves using an E-value cutoff of 106. Perl scripts were |