|
|
||||||||
|
First published online November 21, 2007; 10.1104/pp.107.110353 Plant Physiology 146:45-59 (2008) © 2008 American Society of Plant Biologists OPEN ACCESS ARTICLE
TEnest: Automated Chronological Annotation and Visualization of Nested Plant Transposable Elements1,[W],[OA]Bioinformatics and Computational Biology, Department of Plant Pathology and Center for Plant Responses to Environmental Stresses (B.A.K., R.P.W.) and Corn Insects and Crop Genetics Research, United States Department of Agriculture-Agricultural Research Service (R.P.W.), Iowa State University, Ames, Iowa 50011–1020
Organisms with a high density of transposable elements (TEs) exhibit nesting, with subsequent repeats found inside previously inserted elements. Nesting splits the sequence structure of TEs and makes annotation of repetitive areas challenging. We present TEnest, a repeat identification and display tool made specifically for highly repetitive genomes. TEnest identifies repetitive sequences and reconstructs separated sections to provide full-length repeats and, for long-terminal repeat (LTR) retrotransposons, calculates age since insertion based on LTR divergence. TEnest provides a chronological insertion display to give an accurate visual representation of TE integration history showing timeline, location, and families of each TE identified, thus creating a framework from which evolutionary comparisons can be made among various regions of the genome. A database of repeats has been developed for maize (Zea mays), rice (Oryza sativa), wheat (Triticum aestivum), and barley (Hordeum vulgare) to illustrate the potential of TEnest software. All currently finished maize bacterial artificial chromosomes totaling 29.3 Mb were analyzed with TEnest to provide a characterization of the repeat insertions. Sixty-seven percent of the maize genome was found to be made up of TEs; of these, 95% are LTR retrotransposons. The rate of solo LTR formation is shown to be dissimilar across retrotransposon families. Phylogenetic analysis of TE families reveals specific events of extreme TE proliferation, which may explain the high quantities of certain TE families found throughout the maize genome. The TEnest software package is available for use on PlantGDB under the tools section (http://www.plantgdb.org/prj/TE_nest/TE_nest.html); the source code is available from http://wiselab.org.
Transposable elements (TEs) are mobile DNA found throughout eukaryotic organisms. Although abundance is extremely high in some organisms, little is known about the processes governing the distribution of TEs across the genome. Each classification level of a TE may exhibit different genetic makeup, different modes of replication, or preference for different genomic habitats. By the nature of their mobility, TEs have the potential to induce change throughout an organism's genome. As a consequence of multiple TE copies, unequal crossover and recombination can occur between chromosome regions. TE insertions can cause gene or regulatory mutations, altering levels of transcripts, or provide new genetic material for novel gene functions to evolve (Kidwell and Lisch, 2000
Abundance of TEs varies widely across different organisms. Human (Homo sapiens) DNA is composed of 45% (Lander et al., 2001
High quantities of TEs, especially the LTRs of retrotransposons, greatly impede sequence assembly as well as genome annotation (Rabinowicz and Bennetzen, 2006
Current repeat annotation tools have not adequately addressed the issue of nested TEs and are unable to rebuild fragmented elements. Three distinct methods of TE identification have been developed. RepeatMasker (http://www.repeatmasker.org) uses a repeat database to locate sequence matches. This provides correct identification of fragmented TEs in nested repeat clusters, but reconstruction of whole TEs and evolutionary timeline of insertions is not possible. LTR retrotransposon detection software, such as LTR_struct (McCarthy and McDonald, 2003
To fully analyze repeat dense grass genomes, we have developed TEnest. Using a community updated repeat database of LTR retrotransposons, non-LTR retrotransposons, DNA transposons, and other repetitive elements, TEnest will identify all TE insertions in the input sequence. With additional repeat database construction, TEnest will annotate TEs in any organism's genome. For LTR retrotransposons, TEnest will identify the two flanking LTR sequences and calculate the time since insertion based on the rate of mutation accumulation in repetitive sequences of grasses (1.3 x 10–8; Kimura, 1980
Recent evidence suggests that the high percentage of repetitive elements, especially LTR retrotransposons in maize, is due to the replication activities of just a few element families (Meyers et al., 2001
Throughout this article we follow the TE nomenclature format outlined in Wicker et al. (2007)
The nested TE identification software package TEnest has three sections for use in genome sequence analysis: the organism-specific repeat databases; TEnest, a program for identification of TE coordinates; and svg_ltr, a graphical display program for visualization of TE insertions.
The repeat databases are kept up to date by two methods. First, when new genomic contigs are completed, they are entered into PlantGDB (http://www.plantgdb.org; Dong et al., 2004 Second, users of TEnest on PlantGDB can update the repeat databases with newly identified TEs. This submission system requires information about the TE, such as the organism, TE classification, sequence locations of identification, and the proposed name. The new TE is aligned to known TE families in the organism and flagged for manual review; when review is complete, users will be notified of the status of their TE submission. TEnest users can also use this submission system to suggest revisions or repairs to TE database entries.
With use of the plant repeat databases, TEnest identifies all TE insertions in the input sequence, reconstructs fragmented elements, and determines age of insertion for LTR retrotransposons producing a list of coordinates for each TE sequence location. TE insertions are classified as one of four data types: SOLO, corresponding to solo LTR sequences; PAIR, right and left LTRs of a LTR retrotransposon grouped by base pair similarity and the corresponding internal sequences of the TE; NLTR, full-length TEs of classes not containing LTRs (non-LTR retrotransposons and DNA transposons); and FRAG, partial sequences of the NLTR class or internal fragmented regions of LTR retrotransposons.
Identification of Retrotransposon LTR Sequences
Each pairwise LTR alignment coordinate set is entered into the recombination process where a power set algorithm is used to rejoin separated LTR sections. A power set is the set of all the subsets of a set (Suppes, 1972
LTR retrotransposons replicate into new locations across the genome by means of reverse transcription and integration. A seven-step process produces an exact DNA intermediate of the retrotransposon with one exception; the two new LTR sequences are reverse transcribed from an intermediate LTR, which itself is a unique LTR sequence formed from the combination of the two original LTRs of the parent retrotransposon. This results in a new integrated retrotransposon with identical LTRs (Boeke and Corces, 1989 For each TE family, each LTR alignment returned from the power set recombination process is excised from the input sequence and joined into a single contiguous sequence. By TE family, the LTR sequences are locally aligned to each other and the base pair substitution rate (BSR) is determined. Any insertion, deletion, or substitution of any length is scored as a single substitution for the alignment. BSR is calculated by the total substitutions divided by the alignment length of the two LTR sequences. LTR sequences are grouped according to the smallest BSR. The two paired LTRs are classified in the PAIR data type; any LTR sequences not paired are assigned to the SOLO class. LTRs can be found in a solo configuration throughout the genome. There are three possibilities where TEnest is unable to assign a LTR to a pair. (1) The second LTR sequence is missing; it is found either off the end of the contig or in a sequence gap. (2) The LTR's true partner was incorrectly paired with another LTR. Although unlikely, if two LTRs from different retrotransposon insertions have evolved to be more similar than those from the same insertion, incorrect LTR pairing can occur and possibly cause solo LTR identification. Any such occurrences of incorrect LTR pairing are resolved by TEnest by the discrepancy function discussed below. (3) The solo LTR is the result of homologous unequal recombination that has caused deletion of the internal retrotransposon region and one, leaving just a single LTR.
Identification of Retrotransposon Internal Regions However, there is one significant difference between the alignment method here and the previously described LTR identification process. First, paired LTRs are arranged by smallest sequence-spanning length first; after each paired LTR is processed, the identified regions are ignored by subsequent alignments. Second, during the alignment process, the sequence database of the initial WU-BLAST alignment contains only one TE sequence, the TE type corresponding to the paired LTRs. These restrictions give TEnest the ability to correctly annotate the entire internal regions of nested LTR retrotransposons. The identified internal regions are added to the PAIR data type signifying classification as whole LTR retrotransposons.
Identification of Non-LTR Retrotransposons and DNA Transposons
The power set recombination process of TEnest is useful for joining sections separated by nesting of subsequent TE insertions; however, this process can run into problems. Although uncommon, recombined TE sections are susceptible to coordinate discrepancies, defined as a rejoined sequence set whose grouping configuration disagrees with another rejoined set. This is seen when sections from two or more recombined TE annotations are found in alternating orders across the input sequence (Fig. 1B) as opposed to nested within one another. Whereas a biological process, such as local small inversions, can explain such occurrences, a nested insertion display is unable to represent disagreeing rejoined sections. TEnest, therefore, assumes the discrepancies are caused by either incorrect power set grouping or incorrect LTR pairing. To resolve each recombination discrepancy, each TEnest data type (PAIR, SOLO, NLTR, FRAG) is self checked and each combination of data types is checked for possible coordinate discrepancies by TEnest. Any discrepancies found are scored based on alignment identities, sequence length percentage of the whole TE, and number of discrepancies; the joins of those with the worst discrepancy scores are broken to split combined sections into separate groups.
TEnest Processing Time Is Decreased with Use of Multiprocessors and Clustered Computers In addition, to make TEnest a viable resource for chromosome-sized maize pseudomolecules, a TEnest wrapper script, clusterTEnest.pl, has been developed. This script will take a large input sequence and split it into user-defined lengths and send each section to a separate node of a clustered computer to run several instances of multiprocessor TEnest simultaneously. Once each split sequence is complete, the annotation results are regrouped and the identified TEs are removed from the input sequence. A final TEnest is run on the full sequence, ultimately providing the same output as an original TEnest submission. This split function decreases process time for long sequences and decreases incorrect LTR BSR pairing that may be found when analyzing a large number of a retrotransposon type. For example, when the same 1-Mb contig (GenBank accession no. EF517601) was split into 100-kb segments and sent to 10 nodes of a clustered computer (each with dual 3.2 Ghz, 4-Gb RAM), it took 35 min to complete. The benefit of clusterTEnest.pl will increase with longer sequence contigs.
TE insertions identified by TEnest are visualized in a triangle insertion graph with the program svg_ltr (Fig. 2 ). svg_ltr uses the coordinate table of identified TE locations from TEnest (Supplemental Figs. S1–S4) to produce the main output of TEnest. The nesting display graph represents the original DNA prior to repeat insertions as a black horizontal line and the TE insertions within it as triangles. The horizontal top of a triangle TE insertion corresponds to the length of the TE insertion; the bottom point shows the insertion location. The genome distance between any two points on a triangle graph is determined by the addition of all horizontal lines (including the black DNA representation and TE triangle tops) on all levels between the two points. Spacing and alignment of triangles are adjusted to prevent overlapping triangles; however, TE triangle insertion point locations are preserved to show the true location of a TE. Triangle color corresponds to TE type, shown in the legend at the bottom of the display.
Several functions are included with svg_ltr to produce graphs containing information needed by the user. Display of data types SOLO, PAIR, NLTR, and FRAG can be toggled on or off. Arrows representing location and direction of LTRs can be shown at the top of the triangle. Either BSR or millions of years ago (Mya) can be displayed inside a box within a LTR retrotransposon triangle; see Figure 3 for calculation of BSR and Mya. Coordinates corresponding to either the TE or the input sequence can be displayed at the top of the TE triangle for each section of the recombined TE group. Insertions found within a TE that do not align to a TE found in the repeat databases can be omitted from the display; this is shown by a white triangle within the TE triangle.
In addition, several functions are included with svg_ltr that make this a stand-alone program for display of sequence region annotations. svg_ltr can also display two user-inputted data types: GENE, corresponding to gene annotations, and PSDO, corresponding to pseudogene annotations. Both are displayed along with the TE annotations in the svg_ltr display; GENE and PSDO are shown as rectangular regions with direction-indicating arrows, and can show separated sections, such as multiple exon genes. A script, checkTE.pl, included with TEnest, is provided to assist users in adding this additional information to the svg_ltr input file by converting to and from both the TEnest coordinate table output and a generic feature format (GFF3) table.
To assess the repetitive element identification capability of TEnest, three verification experiments were conducted. (1) The TEnest outputs of high-density repetitive regions of maize and rice were compared to GenBank-submitted TE annotations. (2) Permutations of TE annotations were evaluated to exclude confounding TE annotations. (3) Simulated genomic maize sequences were made with TE insertions and examined with TEnest.
Comparisons of Maize and Rice to Submitted GenBank Annotations
For maize, TEnest identified every annotation formerly found by the original curators. In addition, TEnest identified many LTR retrotransposons, solo LTRs, fragmented repeats, DNA transposons, and other repeats not found in the original annotations. For rice, TEnest found all but seven miniature inverted-repeat TE (MITE) insertions originally identified by the initial annotations. These missing MITEs were truncated and below the default cutoff values in TEnest; additional runs with altered parameters to allow smaller sequence alignments identified all the missing insertions. In addition, TEnest located 11 additional DNA transposons not found in the original analysis. From the analyzed rice BACs, no TEs were seen in nested configuration, compared with 70% of maize TEs found nested. In additional rice BACs analyzed, nested TEs were seen at a rate of approximately 1 per 75 to 100 kb. This rate of nested TEs is due to the low amounts of repeats in rice as well as the small average lengths of rice TE insertions, caused by the high number of MITE insertions in the rice genome.
Permutations of the Maize Repeat Database
Construction and Analysis of Simulated Maize Genomic Sequences
TEnest presents a unique ability to observe TE family distributions across plant genomes in relation to age of insertion, sequence similarity, and sequence retention. At the time of submission, the GenBank sequence database contained 165 ordered and oriented genomic maize sequence contigs greater than 100 kb. This included 56 finished contigs and 109 gapped-sequence submissions with sections presented in correct order and orientation relative to the genome sequence. In total, these BACs equal 29.3 Mb or about 1% of the maize genome. This dataset contains BAC clones sequenced with intentions of gene discovery, as well as BACs randomly selected to survey the entire maize genome. Therefore, this set of sequence contigs may be slightly higher in gene content and lower in repetitive amounts than will be observed across the entire maize genome. These 165 contigs were evaluated with TEnest to provide a broad picture of TE clusters in the maize genome.
Distribution of TE Insertions Is Unequal across Families
LTR retrotransposons make up 60.59% of the total maize sequence analyzed. This is divided into 37.87% whole LTR retrotransposons, 22.23% partial or fragmented LTR retrotransposons, and 0.50% solo LTR sequences. Whole LTR retrotransposons are defined as containing both flanking LTR sequences and >90% of the internal region based on the TE consensus sequence. Partial LTR retrotransposons are incomplete insertions resulting from deletions or transpositions or gaps in the sequence assembly. Partial LTR retrotransposon annotations include any amount of the internal regions of TE sequences that may be reconstructed sections from later insertions and may also include LTR sequences. In terms of sequence length of the identified TEs, partial LTR retrotransposons cover 6.2 Mb of the BAC sequence in this analysis; whole LTR retrotransposons cover 11.1 Mb; this ratio of partial to whole is 1:1.8., showing that, even in fragmented TE remnants, sequence structure is moderately reconstructable. Solo LTRs are defined as annotations >50% of the LTR length and not connected to internal regions of the TE. Identified solo LTR regions that equaled <50% of the solo LTR length once reconstructed were not analyzed here and were classified as fragmented LTR retrotransposons. Solo LTR-to-whole LTR ratio varies by TE family (Table III ), ranging from Gyma, with 2.3 whole TEs per solo LTR, to Zeon, with 121 whole TEs per solo. The majority of solo-to-whole LTR retrotransposon ratios lie between one solo LTR to seven to 15 whole elements, which includes members of both Copia and Gypsy superfamilies. A much higher recombination frequency (one solo LTR to two to three whole elements) is seen with three retrotransposons, Gyma, Ruda, and Danelle, again in both Copia and Gypsy superfamilies. Huck and Zeon show very infrequent solo LTR formation, with one solo to 70 or 121 whole elements. In addition, many TEs have no corresponding solo and are not shown in Table III. Sequence structure may play a role in solo LTR formation; two families of retrotransposons with similar sequence have almost exact unequal recombination rates: Ji and Opie, with 64.2% sequence identity, have 14.9 and 15.1 solo-to-whole ratios; Danelle and Gyma, with 55.7% sequence identity, have 3.3 and 2.3 solo-to-whole ratios. These are considered closely related for between-family comparisons; within TE families, members may have <60% identity to each other over their entire lengths. Solo LTRs are found on average one per 156 kb across the analyzed sequences; however, sequence AC148093, located in a near-centromeric region of chromosome 4, contains one solo LTR per 16.3 kb. This region contains only one solo LTR from the high-rate solo LTR families (zero Danelle, zero Gyma, one Ruda), rather than the high amount of solo LTRs from a variety of lower rate solo-forming LTR retrotransposon families, and suggests a high level of unequal recombination in this region.
TEnest calculates time since insertion (Mya) for each whole LTR retrotransposon using sequence identity from the paired LTRs. Fourteen-hundred fifty-seven LTR pairs were identified in this set of 165 maize contigs. Fifty percent of all LTR retrotransposon insertions occurred <0.875 Mya, and 75% of all LTR retrotransposon insertions are <1.5 Mya (Fig. 3). As shown in Figure 3, the age of insertion across the four most abundant whole LTR retrotransposons; Ji 268 copies, Opie 226 copies, Huck 212 copies, Zeon 121 copies, remains constant following this distribution of insertion times. Less represented LTR retrotransposon insertions may be younger or older than this general distribution; however, too few copies are present in the analyzed dataset to accurately calculate as individual families.
Forty-seven families of LTR retrotransposons were identified, nine Ty1/Copia, 11 Ty3/Gypsy, and 27 others. The three most abundant families Ji, Opie, Huck, each have 15% to 18% of the total amount of LTR retrotransposons identified, comprising more than one-half of the LTR retrotransposons found. As illustrated in Figure 4 , there is a considerable drop in TE abundance in the rest of the identified LTR retrotransposon families ranging from 8% to <1%.
Analysis of the sequence relationship between individual elements within each TE family may give insight into the evolution and expansion of TEs across the genome and may possibly explain the unequal amounts of retrotransposon families. Using the output table of TEnest, a multiple alignment of every element insertion for each family of LTR retrotransposons was made with ClustalW (Thompson et al., 1994
A tree with clustered insertion ages does not follow the expected phylogenetic result of continuously replicating LTR retrotransposons, with each clade representing a distinct family subset replicating individually and concurrently. Here, one expects a tree with a similar range of insertion ages on each phylogenetic branch. Instead, LTR proliferation is observed, where, at specific times throughout the genome evolution, the Ji retrotransposon family has undergone cycles of rapid expansion. This suggests multiple instances of extreme proliferation events by one or a few related members propagating many similar insertions in a small time frame. The Grande retrotransposon family (Fig. 5B) also seems to follow the proposed proliferation process, although with only 39 members in this analysis the clusters of insertion ages are less obvious. This initial evidence from low-copy families excludes the proliferation process as the only explanation for the extremely high copy number of Ji, Opie, and Huck retrotransposons.
Relative amounts of LTR retrotransposon superfamilies Ty1/Copia and Ty3/Gypsy are similar, 22.66% and 24.16%, respectively, whereas the other class is much less abundant with 13.77% of the sequence content. However, the similar amount of Copia and Gypsy elements does not mean they are found equally across the genome; instead, they correspond to genome locations. In general, Copia and Gypsy sequence quantities per location are inversely proportional (Fig. 6
). With the maize WebFPC July 19, 2005 release (Coe et al., 2002
TEnest: An Efficient Algorithm for Nested TE Annotation TEnest was initially designed for annotation of maize BAC contigs (e.g. EF517601, EF517601); the repeat database has since been expanded to include rice, wheat, and barley (Hordeum vulgare), and potentially could include other sequenced grasses such as sorghum (Sorghum bicolor) and Brachypodium (Brachypodium distachyon). TEnest can be extended further for use in a variety of organisms; however, the main advantage of TEnest over other repeat identification software is the ability to annotate nested TE insertions, primarily seen in the densely repetitive grass genomes. To evaluate other organisms, users can create a custom repeat database. This custom database can be used with the downloadable version of TEnest or uploaded onto the online version. In addition, users can submit and suggest edits of TE database entries; both of these systems are in place to keep TEnest up to date as more sequence is produced and more TEs are identified. There are several important steps of the TEnest system that give it the ability to accurately annotate nested TEs and make it a viable resource for genome repeat analysis. The two-alignment method, first using a quick BLAST search to locate general regions of interest, then using a pairwise local alignment to accurately identify the complete sequence alignment, greatly increases speed and precision when using sequences of similar identity, such as repeat databases. The power set reconstruction method builds TE segments separated by nesting, using coordinates based on the TE insertion, giving TEnest the ability to recognize and correctly resolve TE families nested within themselves (a whole Ji retrotransposon nested within another Ji) or to identify a duplicated region within a TE (an extra portion of a Ji found within a Ji element). TEnest pairs the left and right LTRs of retrotransposons based on their divergence, identifying the TE family and the sequence ends of the insertion and allowing TEnest to quickly build the internal region with more relaxed criteria, thus obtaining the complete annotation. Joining separated sections, both the power set reconstruction method and the LTR pairing method introduce a possibility of join discrepancies, where two or more joined regions disagree (Fig. 1B). If the reconstruction or pairing processes suggest combinations that could not have occurred by TE nesting, but would require a local inversion or translocation, TEnest uses the discrepancy process to separate the most likely incorrect join. TEnest provides the user with three output formats; an annotation table of TE insertions with insertion ages of LTR retrotransposons, a repeat masked sequence file, and a vector format graphical display of the chronology of TE insertions. Use of these output files can assist with identification of genes and other functionally important locations and can also answer questions regarding the sequence makeup of the genome. When used to analyze a BAC or a single sequence region, TEnest gives information about sequence structure, content, rearrangement, and evolutionary dynamics of the area. Expanding this analysis to multiple regions across the genome, such as the analysis of the currently finished maize BAC contigs within this article, gives a more in-depth example of the capabilities of TEnest. With more sequence information, comparisons between TE families and classes and evolutionary analysis of single TE families can begin to answer questions about TE evolution, replication, and their effect on the genome. The included cluster submission script that splits input sequence and runs TEnest versions on each node of a cluster gives users the ability to evaluate long sequences in relatively short time frames.
Three analyses were used to validate the output from TEnest. The first verification examined TEnest outputs to curated submitted maize and rice contigs. The important information from this analysis is that TEnest was able to identify all known TE insertions. TEnest did identify extra insertions; however, the goals of the original curators were varied and may have intentionally not included all TEs. In addition, at the time of original annotation, community repeat databases were less complete. TEnest uses a repeat database to identify TE insertions; this repeat database is made of consensus or representative sequences. This process could allow easy identification of similar TEs within the family, whereas not accurately identifying distantly related TEs within the family. In the permutation verification, the TE insertion and similar TEs from the sequence contig were removed from database construction to show that TEnest correctly identified the distantly related TEs within a family. In each case, TEnest was able to correctly identify the permutated TEs showing that, regardless of individual TEs used in the database construction, TEnest gives unbiased repeat annotation. These results are possible because of the unique processes of TEnest, specifically initial identification of paired LTRs and relaxation of internal region alignment parameters. The final validation highlights a further resource of the TEnest outputs. Hypothetical ancestral sequences, prior to TE insertions, can be constructed by removing whole TE annotations, which we believe are more accurate representations than simple masking of all repeat matching sequences. These ancestral predictions can be used for comparative genome analysis to give a cleaner assessment of the shared sequence regions between genomes or sequence regions. Time-point sequences can be made by removing TEs inserted after a certain age.
TEs cause sequence rearrangement and recombination by insertion and translocation of their own and other sequences throughout the genome. Additionally, by their seemingly unbridled expansion of genome size, LTR retrotransposons are significant drivers of sequence evolution. TEnest was designed to quickly and accurately analyze completely sequenced genomes and to explore how TEs affect whole-genome evolution. As with smaller sequence scale analyses, TEnest can provide TE insertion locations, distributions, and insertion preferences to show the current structure of the whole genome. But, on a larger scale, it can give the whole view of each TE family evolution, the sequence divergence history of each type from a common ancestor, along with its age since insertion and its genome location. Combination of TE insertion age, sequence relationships, and location in the genome can be used to investigate fundamental questions about TEs, such as their rates of proliferation across the genome, their paths of replication over time, and ultimately their effects on genome evolution. Based on data presented here, different TE families experience unequal recombination that results in solo LTRs at different frequencies. The rate of solo LTR formation does not seem to be influenced by length of either the LTR or the whole retrotransposon. Only those LTR retrotransposons with at least one observed solo LTR were included in this analysis; many retrotransposons had no solo LTRs, most likely due to the limited amount of maize genome sequence in this study, as well as the low rate of recombination within the retrotransposon family. Gyma and Danelle and Ji and Opie both share similar rates of solo formation, as well as relatively similar sequence identities, and suggest that TE structure or sequence is an important factor for unequal recombination. Phylogenetic analysis of LTR retrotransposon families gives similar age of insertion clustered in clades of the tree. We hypothesize that this is caused by proliferation of LTR retrotransposons, where at specific time points in evolution a single or related group of elements has rapidly expanded across the genome. These rapid TE expansions could correspond to times of relaxed mutation standards, such as genome duplication events or environmental stress conditions where mutations caused by TE insertions are less detrimental to the organism. Alternatively, these TE proliferations could be caused by advantageous mutations in the TE sequence, allowing a TE copy to replicate across the genome. Similar proliferation-style phylogenetic trees are observed across many LTR retrotransposon families and therefore the process is not TE specific and cannot explain differences in TE amounts. The causes behind the abundance of certain TE families are due to selective processes not yet understood. Two other hypotheses for LTR retrotransposon replication do not explain the observed trees. Continual copying of TEs in a family until mutation prevents replication of an individual will give a tree with TEs from any clades of the family the ability to replicate. In this scenario, the phylogenetic tree has clades containing a variety of insertion ages. Alternatively, genome duplication could immediately double the amount of TEs within the genome. A tree following a genome duplication event will contain 2 times as many TEs with every clade, each with the same insertion age, but each clade still contains a variety of insertion ages and those TEs still able to replicate will continue to increase the age ranges. However, genome duplication could play a role in the proliferation hypothesis by allowing proliferation to increase with a decreased chance of harming the genome.
Construction of Consensus Repeat Databases for TEnest
Maize (Zea mays), rice (Oryza sativa), wheat (Triticum aestivum), and barley (Hordeum vulgare) repeat databases were constructed from the following sources: GIRI RepBase (Jurka et al., 2005 Each final set of clustered repeat entries was aligned with ClustalW. Neighbor-joining trees were made using the PHYLIP package. The resulting phylogenetic trees were examined for well-defined separations into subgroups, such as a tree with only two distant clades. If present, these clustered tree sections were split into subgroups of the original repeat set. Consensus sequences were made from each repeat set or each subgroup within a set. Many repeat groups contained high diversity between elements; if >10% of the consensus sequence was Ns or if the sequence had stretches of 90% Ns for more than 100 bases, a consensus sequence was not used. Instead, a representative repeat entry was selected for use in the repeat database from a central branch off the phylogenetic analysis.
Consensus sequences were checked against the GenBank maize database (Benson et al., 2006
Customization of TEnest runs is accomplished using the many available parameter settings. All of these parameter settings are explained in further detail in the TEnest README file found with the TEnest Web service or bundled with a downloaded version. Similar parameter settings are available for each TEnest identification process; LTRs, internal retrotransposon regions, fragmented, and non-LTR retrotransposon regions. Users can alter the number of pairwise alignments reported (default 7), the gap open penalty (default 30 for LTRs, 75 for others), the gap extension penalty (default 15 for LTRs, 75 for others), and the pairwise alignment E-value cutoff score (default 10–20). The amount of base pairs to allow as overlapping when joining sections is also customizable, for pairwise alignments (default 25), or when reconstructing separated sections in the power set process (default 30). The smallest returned reconstructed LTR (default 25) can be raised to limit unnecessary annotations; the maximum distance between power set reconstructed sections can also be altered (default 100 kb). The LTR pairing process can be customized with gap open (default 12) and gap extension (default 4) penalties, and amount of LTR pairs to consider (default 0.1). TE makeup across organisms is different; some TEnest settings have proved more useful when attempting annotation on other species. Rice has a high number of small MITE insertions, TEnest has better success identifying these elements when ignoring long-spanning TEs (decreasing the power set reconstruction maximum) and allowing for smaller TE alignments (decreasing the E-value cutoff for pairwise alignments and decreasing the size of reported sections). Some success has been seen with TEnest on nonplant species. In Drosophila melanogaster, the LTRs of LTR retrotransposons are very small in relation to the grass species. We have achieved TE annotations with TEnest on D. melanogaster sequences when lowering the LTR overlap lengths, pairwise alignment cutoffs, and LTR size cutoff.
The TEnest software package is available for use on PlantGDB under the tools section (http://www.plantgdb.org/prj/TE_nest/TE_nest.html), the source code, along with maize, rice, wheat, and barley repeat databases, is available from http://wiselab.org. To install a local version of TEnest, Perl (http://www.perl.org), WU-BLAST version 2.0 (http://blast.wustl.edu), and FASTA2 (ftp://ftp.virginia.edu/pub/fasta) are required. To display TEnest annotations svg_ltr uses the scalable vector graphics format (http://www.w3.org/Graphics/SVG), displayable in Mozilla Firefox (www.mozilla.com/firefox) version 2 or later. GenBank sequence submissions submitted with this manuscript: LTR retrotransposons Danelle, EF562447 and Stella, EF621725.
The following materials are available in the online version of this article.
We thank Dr. Thomas Peterson and Dr. Dan Voytas and for critical review of the manuscript. This article is a joint contribution of The Iowa Agriculture and Home Economics Experiment Station and the Corn Insects and Crop Genetics Research Unit, U.S. Department of Agriculture-Agricultural Research Service. Mention of trade names or commercial products in this publication is solely for the purpose of providing specific information and does not imply recommendation or endorsement by the U.S. Department of Agriculture. Received October 8, 2007; accepted November 15, 2007; published November 21, 2007.
1 This work was supported by the U.S. Department of Agriculture-National Research Initiative (grant no. 2002–35301–12064). The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Roger P. Wise (rpwise{at}iastate.edu).
[W] The online version of this article contains Web-only data.
[OA] Open Access articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.107.110353 * Corresponding author; e-mail rpwise{at}iastate.edu.
Bao Z, Eddy SR (2002) Automated de novo identification of repeat sequence families in sequenced genomes. Genome Res 12: 1269–1276 Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2006) GenBank. Nucleic Acids Res 34: D16–20 Boeke JD, Corces VG (1989) Transcription and reverse transcription of retrotransposons. Annu Rev Microbiol 43: 403–434[CrossRef][Web of Science][Medline] Brunner S, Fengler K, Morgante M, Tingey S, Rafalski A (2005) Evolution of DNA sequence nonhomologies among maize inbreds. Plant Cell 17: 343–360 Caldwell KS, Langridge P, Powell W (2004) Comparative sequence analysis of the region harboring the hardness locus in barley and its collinear region in rice. Plant Physiol 136: 3177–3190 Coe E, Cone K, McMullen M, Chen SS, Davis G, Gardiner J, Liscum E, Polacco M, Paterson A, Sanchez-Villeda H, et al (2002) Access to the maize genome: an integrated physical and genetic map. Plant Physiol 128: 9–12 Dong Q, Schlueter SD, Brendel V (2004) PlantGDB, plant genome database and analysis tools. Nucleic Acids Res 32: D354–359 Edgar RC, Myers EW (2005) PILER: identification and classification of genomic repeats. Bioinformatics (Suppl 1) 21: i152–i158[Abstract] Emrich SJ, Aluru S, Fu Y, Wen TJ, Narayanan M, Guo L, Ashlock DA, Schnable PS (2004) A strategy for assembling the maize (Zea mays L.) genome. Bioinformatics 20: 140–147 Felsenstein J (2005) PHYLIP (Phylogeny Inference Package) Version 3.6. Department of Genome Sciences, University of Washington, Seattle Fu H, Dooner HK (2002) Intraspecific violation of genetic collinearity and its implications in maize. Proc Natl Acad Sci USA 99: 9573–9578 Fu H, Zheng Z, Dooner HK (2002) Recombination rates between adjacent genic and retrotransposon regions in maize vary by 2 orders of magnitude. Proc Natl Acad Sci USA 99: 1082–1087 Gu YQ, Salse J, Coleman-Derr D, Dupin A, Crossman C, Lazo GR, Huo N, Belcram H, Ravel C, Charmet G, et al (2006) Types and rates of sequence evolution at the high-molecular-weight glutenin locus in hexaploid wheat and its ancestral genomes. Genetics 174: 1493–1504 Haberer G, Young S, Bharti AK, Gundlach H, Raymond C, Fuks G, Butler E, Wing RA, Rounsley S, Birren B, et al (2005) Structure and architecture of the maize genome. Plant Physiol 139: 1612–1624 Huang X, Miller W (1991) A time-efficient, linear-space local similarity algorithm. Adv Appl Math 12: 337–357[CrossRef] IRGSP (2005) The map-based sequence of the rice genome. Nature 436: 793–800[CrossRef][Medline] Jurka J, Kapitonov VV, Pavlicek A, Klonowski P, Kohany O, Walichiewicz J (2005) Repbase update, a database of eukaryotic repetitive elements. Cytogenet Genome Res 110: 462–467[CrossRef][Web of Science][Medline] Kalyanaraman A, Aluru S (2005) Efficient algorithms and software for detection of full-length LTR retrotransposons. J Bioinform Comput Biol 4: 197–216[CrossRef] Kaminker JS, Bergman CM, Kronmiller B, Carlson J, Svirskas R, Patel S, Frise E, Wheeler DA, Lewis SE, Rubin GM, et al (2002) The transposable elements of the Drosophila melanogaster euchromatin: a genomics perspective. Genome Biol 3: RESEARCH0084[Medline] Kidwell MG, Lisch DR (2000) Transposable elements and host genome evolution. Trends Ecol Evol 15: 95–99[CrossRef][Medline] Kimura M (1980) A simple method for estimating evolutionary rates of base substitutions through comparative studies of nucleotide sequences. J Mol Evol 16: 111–120[CrossRef][Web of Science][Medline] Lal SK, Giroux MJ, Brendel V, Vallejos CE, Hannah LC (2003) The maize genome contains a helitron insertion. Plant Cell 15: 381–391 Lander ES, Linton LM, Birren B, Nusbaum C, Zody MC, Baldwin J, Devon K, Dewar K, Doyle M, FitzHugh W, et al (2001) Initial sequencing and analysis of the human genome. Nature 409: 860–921[CrossRef][Medline] Ma J, Bennetzen JL (2004) Rapid recent growth and divergence of rice nuclear genomes. Proc Natl Acad Sci USA 101: 12404–12410 McCarthy EM, McDonald JF (2003) LTR_STRUC: a novel search and identification program for LTR retrotransposons. Bioinformatics 19: 362–367 Messing J, Bharti AK, Karlowski WM, Gundlach H, Kim HR, Yu Y, Wei F, Fuks G, Soderlund CA, Mayer KF, et al (2004) Sequence composition and genome organization of maize. Proc Natl Acad Sci USA 101: 14349–14354 Meyers BC, Tingey SV, Morgante M (2001) Abundance, distribution, and transcriptional activity of repetitive elements in the maize genome. Genome Res 11: 1660–1676 Nagano H, Kunii M, Azuma T, Kishima Y, Sano Y (2002) Characterization of the repetitive sequences in a 200-kb region around the rice waxy locus: diversity of transposable elements and presence of veiled repetitive sequences. Genes Genet Syst 77: 69–79[CrossRef][Web of Science][Medline] Price AL, Jones NC, Pevzner PA (2005) De novo identification of repeat families in large genomes. Bioinformatics (Suppl 1) 21: i351–i358[Abstract] Quesneville H, Bergman CM, Andrieu O, Autard D, Nouaud D, Ashburner M, Anxolabehere D (2005) Combined evidence annotation of transposable elements in genome sequences. PLoS Comput Biol 1: 166–175[Web of Science][Medline] Rabinowicz PD, Bennetzen JL (2006) The maize genome as a model for efficient sequence analysis of large plant genomes. Curr Opin Plant Biol 9: 149–156[CrossRef][Web of Science][Medline] SanMiguel P, Gaut BS, Tikhonov A, Nakajima Y, Bennetzen JL (1998) The paleontology of intergene retrotransposons of maize. Nat Genet 20: 43–45[CrossRef][Web of Science][Medline] SanMiguel P, Tikhonov A, Jin YK, Motchoulskaia N, Zakharov D, Melake-Berhan A, Springer PS, Edwards KJ, Lee M, Avramova Z, et al (1996) Nested retrotransposons in the intergenic regions of the maize genome. Science 274: 765–768 Song R, Llaca V, Linton E, Messing J (2001) Sequence, regulation, and evolution of the maize 22-kD alpha zein gene family. Genome Res 11: 1817–1825 Suppes P (1972) Axiomatic Set Theory. Dover, New York, pp 46–49 Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTALW: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673–4680 Tikhonov AP, SanMiguel PJ, Nakajima Y, Gorenstein NM, Bennetzen JL, Avramova Z (1999) Collinearity and its exceptions in orthologous adh regions of maize and sorghum. Proc Natl Acad Sci USA 96: 7409–7414 Wei F, Coe E, Nelson W, Bharti AK, Engler F, Butler E, Kim H, Goicoechea JL, Chen M, Lee S, et al (2007) Physical and genetic structure of the maize genome reflects its complex evolutionary history. PLoS Genet 3: e123[CrossRef][Medline] Wicker T, Matthews DE, Keller B (2002) TREP: a database for Triticeae repetitive elements. Trends Plant Sci 7: 561–562[CrossRef][Web of Science] Wicker T, Sabot F, Hua-Van A, Bennetzen JL, Capy P, Chalhoub B, Flavell A, Leory P, Morgante M, Panaud O, et al (2007) A unified classification system for eukaryotic transposable elements. Nat Rev Genet 8: 973–982[CrossRef][Medline] This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|