Replication of nonautonomous retroelements in soybean appears to be both recent and common.

Retrotransposons and their remnants often constitute more than 50% of higher plant genomes. Although extensively studied in monocot crops such as maize (Zea mays) and rice (Oryza sativa), the impact of retrotransposons on dicot crop genomes is not well documented. Here, we present an analysis of retrotransposons in soybean (Glycine max). Analysis of approximately 3.7 megabases (Mb) of genomic sequence, including 0.87 Mb of pericentromeric sequence, uncovered 45 intact long terminal repeat (LTR)-retrotransposons. The ratio of intact elements to solo LTRs was 8:1, one of the highest reported to date in plants, suggesting that removal of retrotransposons by homologous recombination between LTRs is occurring more slowly in soybean than in previously characterized plant species. Analysis of paired LTR sequences uncovered a low frequency of deletions relative to base substitutions, indicating that removal of retrotransposon sequences by illegitimate recombination is also operating more slowly. Significantly, we identified three subfamilies of nonautonomous elements that have replicated in the recent past, suggesting that retrotransposition can be catalyzed in trans by autonomous elements elsewhere in the genome. Analysis of 1.6 Mb of sequence from Glycine tomentella, a wild perennial relative of soybean, uncovered 23 intact retroelements, two of which had accumulated no mutations in their LTRs, indicating very recent insertion. A similar pattern was found in 0.94 Mb of sequence from Phaseolus vulgaris (common bean). Thus, autonomous and nonautonomous retrotransposons appear to be both abundant and active in Glycine and Phaseolus. The impact of nonautonomous retrotransposon replication on genome size appears to be much greater than previously appreciated.

Retrotransposons and their remnants often constitute more than 50% of higher plant genomes. Although extensively studied in monocot crops such as maize (Zea mays) and rice (Oryza sativa), the impact of retrotransposons on dicot crop genomes is not well documented. Here, we present an analysis of retrotransposons in soybean (Glycine max). Analysis of approximately 3.7 megabases (Mb) of genomic sequence, including 0.87 Mb of pericentromeric sequence, uncovered 45 intact long terminal repeat (LTR)-retrotransposons. The ratio of intact elements to solo LTRs was 8:1, one of the highest reported to date in plants, suggesting that removal of retrotransposons by homologous recombination between LTRs is occurring more slowly in soybean than in previously characterized plant species. Analysis of paired LTR sequences uncovered a low frequency of deletions relative to base substitutions, indicating that removal of retrotransposon sequences by illegitimate recombination is also operating more slowly. Significantly, we identified three subfamilies of nonautonomous elements that have replicated in the recent past, suggesting that retrotransposition can be catalyzed in trans by autonomous elements elsewhere in the genome. Analysis of 1.6 Mb of sequence from Glycine tomentella, a wild perennial relative of soybean, uncovered 23 intact retroelements, two of which had accumulated no mutations in their LTRs, indicating very recent insertion. A similar pattern was found in 0.94 Mb of sequence from Phaseolus vulgaris (common bean). Thus, autonomous and nonautonomous retrotransposons appear to be both abundant and active in Glycine and Phaseolus. The impact of nonautonomous retrotransposon replication on genome size appears to be much greater than previously appreciated.
Transposable elements are abundant components of plant genomes. They are typically divided into two groups based on their mechanism of transposition. Class I transposons transpose via an RNA intermediate and must therefore use reverse transcriptase (RT) during the replication process. Class II transposons do not have an RNA intermediate and usually use a cut-andpaste mechanism for transposition . Elements of both classes have had major impacts on genome structure, apparently not only promoting mutations of genes and affecting gene regulatory sequences but also playing a substantial role in the creation of new genes by "exon-shuffling" and retrotransposition (Jin and Bennetzen, 1994;Jiang et al., 2004;Bennetzen, 2005; Morgante et al., 2005;Zabala and Vodkin, 2005;Wang et al., 2006).
Retrotransposons and their remnants often constitute more than 50% of higher plant genomes and can be as high as 90% Sabot and Schulman, 2006). Because the majority of such elements appear to have inserted in the last few million years, it was once believed that there had been a relatively recent burst in retrotransposon activity that led to a recent expansion in plant genome sizes. However, it is now clear that genome expansion resulting from retrotransposon activity is counteracted by spontaneous deletions resulting from unequal homologous recombination and illegitimate recombination events Bennetzen, 2005;Vitte and Bennetzen, 2006). Plant genomes appear to differ not only in the content of their repetitive fraction but in the dynamics of DNA removal as well. The latter may be estimated by examining the ratio of intact and possibly active elements to their fragmented or recombined counterparts.
The specific families of retrotransposons present in different plant species and their relative abundance varies tremendously, indicating that they are rapidly evolving and may undergo bursts of activity. In addition, most elements are represented by both autonomous (full-length elements encoding all proteins necessary for transposition) and nonautonomous (mutated elements lacking one or more proteins required for transposition) versions in the same genome, with both types varying even among individuals of the same species. These observations, combined with the presence of retrotransposon-derived mRNA, indicate that many elements are still active. Because retroelement sequences decay at a rapid rate, it can be difficult to identify and properly annotate their positions, especially using automated tools. This has led to frequent overestimation of genic sequences in genome annotations . Because of their impact on genome size and structure, however, proper annotation of retrotransposon-derived sequences in genomes is especially important in terms of studying genomewide mechanisms of sequence evolution.
As part of the National Science Foundation (NSF)funded project Comparative Analysis of Legume Genome Evolution, we have generated approximately 4 megabases (Mb) of genomic sequence derived from two varieties of soybean (Glycine max), which we are comparing to orthologous regions of a wild perennial relative of soybean (Glycine tomentella) and to common bean (Phaseolus vulgaris; scientific names will be used for clarity; Innes et al., 2008). These comparisons have allowed us to estimate the impact of retrotransposons on G. max genome evolution. In addition, because the two Glycine species share a genome duplication event that occurred approximately 10 to 14 million years ago , we were able to evaluate how duplicated regions differed in their subsequent retrotransposon activity and whether such differences were shared between these two species, which themselves diverged 5 to 7 million years ago (Innes et al., 2008).

Strategy for Identifying Long Terminal Repeat-Retrotransposons
The majority of retrotransposons in plants and animals contain long terminal repeats (LTRs), which are generated during the transposition process. LTRs thus provide a convenient signature when searching genomic sequence for the presence of retrotransposons. We used a combination of publicly available programs that search for repeats, along with manual BLAST searches (Altschul et al., 1997) for homology to known retroelement sequences, to identify LTR-containing retrotransposons (see "Materials and Methods"). Approximately 3.7 Mb of G. max genomic sequence were searched, including 1 Mb from the Rpg1-b region on molecular linkage group F (homoeologue I [H1]) and 0.87 Mb of homoeologous sequence (H2) on molecular linkage group E. To sample other areas of the soybean genome, we also analyzed 1.85 Mb derived from bacterial artificial chromosome clones (BACs) not assigned to a particular location but available through the National Center for Biotechnology Information (NCBI) high-throughput genomic sequence database.
LTR-retrotransposons were classified as intact when they possessed two full-length LTRs flanked by targetsite duplications (TSDs), a recognizable primer binding site, and a polypurine tract. Intact elements were additionally classified as autonomous if they contained intact Gag and Pol open reading frames (ORFs). Gag encodes the structural protein required for nucleocapsid formation, while Pol encodes a polyprotein containing an RT domain, an integrase domain, and an aspartic proteinase domain, which is responsible for posttranslational processing of the Pol ORF product. Intact elements lacking complete Gag and Pol ORFs were classified as nonautonomous. LTRs were classified as solo-LTRs when they contained sequence similarity to previously identified LTRs, appeared to be full length, were not associated with a second LTR, and were flanked by TSDs. Solo-LTRs are believed to arise by homologous recombination between LTRs of an individual element, resulting in deletion of the intervening retroelement sequence. All other elements with similarity to retrotransposon sequences, but judged not to be intact, were classified as remnants.

Glycine and Phaseolus Contain Many Retrotransposon Families with Recent Insertions
We identified 45 intact LTR-retrotransposons in G. max, 23 in G. tomentella, and seven in P. vulgaris (Table I;  Supplemental Table S2). All LTR-transposons with recognizable Gag-Pol domains fell into two superfamilies, Ty1/copia-like and Ty3/gypsy-like, based on the order of the protein domains contained within the Pol polyprotein . In Ty1/copia-like elements, the integrase domain appears N terminal to the RT domain, whereas in Ty3/Gypsy-like elements, the integrase domain appears after the RT domain (Fig. 1A).
We further classified the LTR-retrotransposons into 41 families based on their LTR sequences (Supplemental Table S2). We used LTR sequences to classify families rather than more commonly used RT sequences for two reasons. First, many nonautonomous retrotransposon sequences lack an intact RT domain. Second, RT domain sequences diverge at a slower rate than LTR sequences, making it difficult to distinguish more recently diverged families based on the RT domain alone. Following the guidelines for transposable element annotation proposed by , we grouped elements into the same family when their LTRs shared .80% identity across at least 80% of their length. Using these criteria, we grouped the G. max elements into 20 families (Supplemental Table S2). Only three of these families contained previously described G. max retrotransposons: SIRE-, Diaspora-, and Calypso-like elements (Wright and Voytas, 2002;Laten et al., 2003;Yano et al., 2005).
At the time of retrotransposon insertion, the two LTR sequences are identical. It is thus possible to estimate the time since insertion by aligning the two LTR sequences of each element and counting the number of nucleotide substitutions (see "Materials and Methods"). Eight of the 20 G. max families contained elements that appear to have inserted within the last one million years, and three elements, each from a different family, contained identical LTRs, indicating that the insertion events were very recent (Supplemental Table S2). In addition, we identified insertion events in cv Williams 82 that are absent from line PI 96983 and vice versa (Innes et al., 2008). Furthermore, we identified G. max EST sequences with over 90% DNA sequence identity to elements in 10 of the 20 G. max families (Supplemental Table S2). Of particular note are families 1 (SIRE-like), 9, and 10, all of which had EST matches with 97% or higher identity and insertions unique to cv Williams 82 or PI 96983. Thus, at least some elements in these families are being actively expressed and are likely generating new insertions.
In G. tomentella, we grouped the 23 intact elements into 16 families (Supplemental Table S2). Similar to G. max, nine of these 16 families contain elements that had inserted within the last million years, and two elements contained identical LTRs. We grouped the seven P. vulgaris elements into five families, including one element with identical LTRs and one that had inserted approximately 600,000 years ago. Thus, all three legume species characterized contain multiple retrotransposon families that have been active in the recent past.
Because of how rapidly LTR sequences diverge, it was not possible to align LTR sequences of elements from different families accurately and hence was not possible to construct phylogenetic trees based on the LTR sequences. We therefore used the RT domains to construct phylogenetic trees using Bayesian analyses (see "Materials and Methods"), splitting the copia-like and gypsy-like elements into separate trees (Fig. 2, A and B). The LTR-based families, indicated by shaded ovals in Figure 2, grouped together at the terminal branches of the RT trees, indicating that there has been little to no recombination between elements belonging to different RTclades. The copia-like elements exhibited  a high level of diversity in their RT domains, with no easily recognizable super-clades. In contrast, the gypsylike elements formed three distinct clades. A recent analysis of retrotransposon content in the model legume Medicago truncatula uncovered similar patterns of diversity, with the Copia superfamily being significantly more diverse than the Gypsy superfamily (Wang and Liu, 2008). Similarly, an analysis of the retroelement content in garden pea (Pisum sativum) revealed a greater diversity of copia elements than gypsy elements (Macas et al., 2007), suggesting that this general pattern arose prior to the split between the Glycine lineage and the Medicago/Pisum lineage. Despite this similarity in pattern, comparison of abundant repeat families between M. truncatula and G. max found low levels of sequence similarity, with no major repeat families shared between the two species other than rDNA (Macas et al., 2007;Swaminathan et al., 2007).
To see how these legume retrotransposons were related to previously described retrotransposons from other plant species, we used representative RT sequences from divergent branches of each tree to search the NCBI nonredundant database for related RT sequences. Top hits were then added to the RT alignment and new trees constructed. As shown in Figure 2, A and B, these additional sequences were dispersed throughout the two RT trees, indicating that Glycine and Phaseolus contain a diversity of retrotransposons that are distributed widely among angiosperms. Assuming that these elements have not been transferred horizontally between species, it would suggest that at least some of these lineages predate the split between monocots and eudicots. This conclusion is supported by a recent phylogenetic analysis of copia elements from wheat (Triticum aestivum), barley (Hordeum vulgare), rice (Oryza sativa), and Arabidopsis (Arabidopsis thaliana), A, Bayesian tree derived from the RT domains of copia-like LTR-retrotransposons. B, Bayesian tree derived from the RT domains of gypsy-like LTR-retrotransposons. Species of origin are indicated by color-coding and elements belonging to the same LTR family are indicated by shaded ovals. Elements not in shaded ovals belong to other families as indicated in Supplemental Table  S2. Numbers indicate posterior probabilities, and the scale indicates nucleotide substitutions per site.

Impacts of Retroelements on Soybean Genome
Plant Physiol. Vol. 148,2008 which revealed the presence of six distinct copia lineages that predate the monocot-dicot split (Wicker and Keller, 2007).
Within one of the three clades of gypsy-like elements, we identified several that contained a chromatin organization modifier (pfam 00385, CHROMO) domain, which is a hallmark of the CHROMO domain-containing retrotransposons, also known as Chromoviruses. The CHROMO domain is part of the integrase domain and is located just upstream of the putative polypurine tract. It is thought to be involved in binding to methylated histone tails and/or to RNA (Nielsen et al., 2002). Four elements from G. max (129e12-re-1, 52d1-re-3, 109b11-re-4, and 77p13-re-2) contained a CHROMO domain. CHROMO domain retrotransposons are distributed widely among eukaryotes, and examples can be found in Arabidopsis, Medicago, and rice (Fig. 2B) as well as in animals and fungi, suggesting that they form an ancient family of gypsy-like elements (Marin and Llorens, 2000;Kordis, 2005).
Independent Increases in Retrotransposon Content in H2 in Both G. max and G. tomentella As stated above, G. max and G. tomentella diverged approximately 5 to 7 million years ago (Innes et al., 2008) and share a whole-genome duplication event that occurred approximately 10 to 14 million years ago (Schlueter et al., 2004Innes et al., 2008). At the time of the whole-genome duplication event, it is assumed that the resulting homoeologous chromosomes were very similar in terms of gene and retrotransposon content, particularly if G. max is derived from an autotetraploid event, as some current data suggest (Straub et al., 2006). Comparison of retrotransposon content in H1 to H2 in G. max revealed striking differences in content and number (Supplemental Table S2; Innes et al., 2008), with H2 containing many more insertions than H1. Significantly, we also observed this pattern in G. tomentella. The preferential accumulation of retrotransposons in H2 appears to have occurred independently in G. max and G. tomen- tella, because the majority of the elements that we identified inserted after the speciation event that gave rise to these two species (Supplemental Table S2). These data indicate that H2 is more prone to retrotransposon accumulation than H1 in both species. Fluorescence in situ hybridization analyses in G. max, along with preliminary analyses of the G. max wholegenome shotgun sequence (Soybean Genome Project, DoE Joint Genome Institute; http://www.phytozome. net/soybean.php), indicate that the H2 region is located near a centromere, while the H1 region is not (Innes et al., 2008). These observations suggest that the H2 region has been translocated to a pericentromeric position, which may be promoting retrotransposon accumulation (Innes et al., 2008). It seems likely, therefore, that this translocation event occurred sometime after the divergence of H1 and H2, but before the divergence of G. max and G. tomentella, and thus predisposed H2 to retrotransposon accumulation in both species.

Relative Abundance of Intact Elements
We analyzed approximately 3.7 Mb of genomic sequence from G. max and identified 45 intact elements, which corresponds to an average density of 12.2 elements per Mb. We do not think this density is an overestimate, as only about 25% of the sequence analyzed came from known pericentromeric BACs (i.e. H2; Supplemental Table S2), while the soybean genome is thought to be made up of 40% to 60% repetitive DNA (Goldberg, 1978;Gurley et al., 1979;Swaminathan et al., 2007). Nevertheless, if we exclude the H2 sequence from the calculation, we identified 22 elements in 2.83 Mb, or an average of 7.8 intact elements per Mb. This average is still much higher than in M. truncatula, where only 2.3 elements per Mb were identified (Wang and Liu, 2008), and in rice, where the density across the whole genome was found to be 0.84 elements per Mb (Gao et al., 2004). It should be noted, however, that the M. truncatula value may be an underestimate, as it is derived from BAC sequences from the M. truncatula genome project, which is focused on gene-rich regions (Young et al., 2005).

The Glycine Genome Appears to Be Expanding
The remarkable variation in nuclear genome size of flowering plants is associated mainly with the size of the repetitive element fraction, especially LTR-retrotransposons (Bennetzen, 2005;Ammiraju et al., 2007). Increases in LTR-retrotransposon content are counterbalanced by internal genomic forces driving DNA removal, such as unequal crossing-over between homologous sequences and illegitimate recombination resulting from multiple mechanisms, including repairs of double-strand breaks (nonhomologous end-joining) and slipstrand mispairing. The rates of both genome expansion and genome contraction processes appear to vary between species (Devos et al., 2002;Ma et al., 2004;Bennetzen et al., 2005;Vitte and Bennetzen, 2006), allowing some genomes to shrink while others expand.
The rate of DNA removal caused by homologous recombination between LTR sequences of an individual element can be estimated by calculating the ratio of intact LTR-retroelements to solo-LTRs (Devos et al., 2002;Ma et al., 2004;Bennetzen et al., 2005;Vitte and Bennetzen, 2006). Analysis of our G. max and G. tomentella data revealed ratios of 8.0 and 7.7, respectively (Supplemental Table S3; P. vulgaris was not included in these calculations due to the low number of both intact retroelements and solo-LTRs identified in the sequenced regions). These ratios are much higher than those calculated for Arabidopsis (0.9), rice (0.7), and Medicago (1.8) and are similar to maize (Zea mays; 6.8) and Lotus japonicus (5.7; Vitte and Bennetzen, 2006). The low frequency of solo-LTRs compared to intact LTR-retroelements in our analysis indicates that homologous recombination between LTRs is not a major force driving removal of retroelement sequences in Glycine, at least in the regions analyzed.
DNA loss through illegitimate recombination is likely the stronger force driving DNA removal in plants (Devos et al., 2002;Ma et al., 2004;Grover et al., 2008). Such recombination events are typically associated with small deletions and can be detected by aligning LTR sequences from single elements. The relative frequency of these events can be estimated by comparing the ratio of base substitutions to insertion/deletion events, with higher ratios indicating lower rates of DNA removal via illegitimate recombination. G. max and G. tomentella had ratios of 12.8 and 13.7, respectively (Supplemental Table S2), significantly higher than that reported for previously analyzed plant species such as maize (8.2) and Medicago (3.6; Vitte and Bennetzen, 2006; Table I). These high ratios, combined with the low frequency of solo-LTRs, suggest that DNA loss rates from Glycine are lower than those reported for other plant species. This apparently low DNA loss rate combined with what appears to be rapid increases in retrotransposon content suggests that the genomes of G. max and G. tomentella are still expanding and doing so independently since their divergence from a common ancestor. It should be noted, however, that 51% of the LTR pairs analyzed in G. max are located in a pericentromeric region (H2) that has undergone a dramatic accumulation of retroelements in the last 10 to 14 million years (Innes et al., 2008); thus, the overall rate of genome expansion is likely to be lower than that indicated by our dataset. Nevertheless, if only the non-H2 elements are considered, we still observe a high ratio of base substitutions to insertion/deletions (10.9) and a low frequency of solo LTRs (two solo LTRs total compared to three in the H2 region; Supplemental Tables S2 and S3), suggesting that the Glycine genome overall is still expanding.
This conclusion would seem at odds with the general observation that polyploid genomes tend to be smaller than the sum of the genome sizes of their diploid ancestors (Soltis and Soltis, 1999;Ozkan et al., 2003;Gu et al., 2006). However, a recent analysis of insertion/ deletion events in diploid and polyploid cotton (Gossypium spp.) suggests that the genomes of both the diploid ancestors and of the polyploid derivative are adding DNA (via retroelement replication) faster than they are losing it (via homologous and illegitimate recombination; Grover et al., 2008). The diploids appear to be accumulating DNA faster than the polyploid, though, hence giving an overall appearance that the polyploid species is losing DNA relative to its diploid progenitors.

Transition to Transversion Ratios in Glycine LTRs Indicate a High Rate of Retrotransposon Methylation
Analysis of LTR mutation patterns can also be used to gain insight into whether these sequences are typically methylated. Methylated LTR sequences are more likely to accumulate transition mutations than transversion mutations due to the high frequency at which 5-methyl cytosine can be replaced by thymine during DNA replication (Vitte and Bennetzen, 2006); thus, the ratio of transition mutations to transversion mutations (Ts: Tv) is often used as an indicator of DNA methylation. The majority of LTRs in all species studied to date show a Ts:Tv ratio higher than that of nontransposon-related coding sequences (SanMiguel et al., 1998;Vitte and Bennetzen, 2006). The Ts:Tv ratios observed for LTRs from both G. max and G. tomentella were 2.4 (760:316) and 2.6 (266:104), respectively (Supplemental Table S2), which are similar to those reported previously for the LTRs found in the legumes L. japonicus and M. truncatula (2.4 and 2.5; Vitte and Bennetzen, 2006). Comparison of 15 G. max and G. tomentella protein coding genes on H1 (genes A through O in Innes et al., 2008) revealed a Ts:Tv ratio of 1.8 (378:212), while comparison of eight genes on H2 (genes A, B, D, G, H, J, K, and O) gave a Ts: Tv ratio of 1.7 (255:147). Similarly, comparison of G. max 'Williams 82' H1 and H2 (genes A through O) gave a Ts: Tv ratio of 1.7 (700:409). Based on an expected Ts:Tv ratio of 1.7, the ratio of 2.4 (760:316) for retroelements in G. max is significantly different from low copy genes (chi-square = 27.5; P , 0.0001). The elevated ratio found in the LTRs relative to the low copy genes leads us to conclude that the majority of Glycine retrotransposon LTRs have become methylated. It is also noteworthy that the H2 low copy genes do not display an elevated Ts:Tv ratio, suggesting that the low copy genes have not become methylated despite their pericentromeric location.
Nonautonomous LTR-Retrotransposons Appear to Be Replicating in G. max and G. tomentella The terms autonomous and nonautonomous were first applied to DNA-based transposons to distinguish between elements that encoded all necessary proteins for transposition versus those that relied on other elements to provide transposition functions (McClintock, 1950;Fedoroff et al., 1983). Typically, nonautonomous elements are derived from autonomous elements via deletion of transposase genes in the case of DNA-based transposons or deletion of Gag and Pol genes in the case of retrotransposons. Retrotransposons frequently suffer deletions that render their Gag-Pol ORFs nonfunctional. However, it has only recently been established that such elements can still be replicated, presumably by Gag and Pol proteins provided by other elements in the genome (Witte et al., 2001;Kalendar et al., 2004;. Replication of nonautonomous retrotransposons has been inferred by the presence of genetically uniform families of elements lacking functional Gag-Pol genes and displaying recent insertions. Examples include the terminal repeats in miniature (TRIMs), large retrotransposon derivatives (LARDs), and so-called Morganes families ( Fig. 1B; Witte et al., 2001;Kalendar et al., 2004;. In each of these cases, however, the element(s) providing the trans-acting Gag and Pol products have not been identified. There are two reports of putative pairs of autonomous and nonautonomous retroelements in plants, Dasheng and RIRE2 from rice (Jiang et al., 2002) and BARE-1 and BARE-2 from barley (Tanskanen et al., 2007), but in both of these cases, there is substantial sequence divergence between the autonomous and nonautonomous families and no direct evidence that the transposition functions are being provided by the autonomous member.
We identified several apparently replicating nonautonomous retrotransposon families in the genomes of G. max and G. tomentella, including one family of elements containing both autonomous and nonautonomous members. Family 6 from G. max is an example of a family for which no autonomous members were found in the sequences we analyzed (Supplemental Table S2). All elements in this family possessed similar LTRs, primer binding sites, and polypurine tracks. This family appears to be one of the most numerous in the G. max genome, as we identified a total of five intact copies on three different BAC clones. Support for this conclusion was obtained by searching a database of low-pass whole-genome 454 DNA sequence reads in which high copy number repetitive elements have been assembled into contigs (http://stan.cropsci.uiuc.edu/sequencing.php; Swaminathan et al., 2007). Using the LTR from family 6 element 45m6-re-2 as a query, we identified a contig in this database that was 94% identical over the entire length of the LTR (contig 80367). This contig was 6.3 kb long and was made up of 1,067 reads. Using the formula described in Swaminathan et al. (2007), this level of read redundancy corresponds to a copy number of 265 genome wide. We also identified a second contig, 80354, that was 93% identical to the LTR of 45m6-re-2, which had an estimated copy number of 252. Thus, family 6 likely has a copy number of at least 500 in the soybean genome. Element 45m6-re-2 had nearly identical LTRs (Supplemental Table S2) but a highly degenerated Gag-Pol region; thus, we infer that it was recently inserted and that the Gag and Pol func-tions must have been supplied by another element located elsewhere in the genome.
In contrast to family 6, family 10 appears to have both autonomous and nonautonomous members. We identified five members of this family, three of which were classified as nonautonomous, while two appeared to be fully autonomous (Fig. 3). Both classes appear to have been active recently, as two of the nonautonomous class and one of the autonomous class members contained only a single nucleotide difference in their LTRs. The three nonautonomous members of this family were nearly identical to each other across their whole length but, compared to the two autonomous elements, were missing approximately 2 kb that spanned the RT domain. All five elements of this family were .97% identical to each other across the entirety of their shared sequence. Phylogenetic analysis using just this shared sequence showed that the three nonautonomous elements clustered closely together and were equally related to the two autonomous elements (data not shown), further supporting a model in which the autonomous and nonautonomous elements are both replicating. To our knowledge, this is the first example of a nonautonomous family of LTR-retroelements that appears to be recently derived from an autonomous "parent" element.

Replication of Nonautonomous Retroelements Is Likely
Having a Large Impact on Genome Size in Glycine Species As described above, we identified several different families of nonautonomous retrotransposons that appear to be actively replicating in the genomes of G. max and G. tomentella. When combined with the previously identified nonautonomous elements such as TRIMs, LARDs, and Morganes (Fig. 1B), there appears to be a great diversity in the structures of such elements. This suggests that almost any element with intact LTRs, primer binding site, and polypurine track may be capable of replication when appropriate Gag and Pol proteins are provided in trans. It is tempting to speculate that nonautonomous families of retrotransposons can arise anytime that active autonomous members are present. This resembles the quasispecies concept in the evolution of retroviruses and RNA viruses (Domingo et al., 1985), which has also been applied to the evolu-tion of retrotransposons (Casacuberta et al., 1995;Sabot and Schulman, 2006). The replication of RNA by RT is an error-prone process; thus, replication of retroviruses inevitably leads to generation of many different mutant variants (quasispecies), which depend on their active and autonomous cousins for replication functions. There is no reason why the same should not occur with retrotransposons, and our findings support this hypothesis. If any element with intact LTRs, primer binding site, and polypurine track can be replicated, this provides a mechanism by which retrotransposonrelated sequences in plant genomes may be driven to very high copy numbers by autonomous elements.

Apparent "Hitchhiking" of Unrelated DNA Sequences within LTR-Retrotransposons
Families 21 and 22 from G. tomentella were unique among the families we characterized in that all elements in these families contained a large insertion of apparently noncoding sequence downstream of the Pol ORF. The inserted sequence differed between the two families but was highly conserved within each family. Both families contained elements that had inserted recently, as well as elements that had inserted much earlier; thus, the inserted sequences have been replicated along with these elements for millions of years. This implies that LTR-retrotransposons are capable of replicating other unrelated DNA sequences and could potentially pick up functional genes. Although both families 21 and 22 are gypsy-like elements, the insertion and replication of additional DNA sequence downstream of the Pol ORF has also been described in the copia-like SIRE elements, which contain an additional ORF in an equivalent position (Laten et al., 1998;Holligan et al., 2006). Additionally, insertion and replication of a large ORF upstream of the Gag-Pol genes has been reported in gypsy-like elements named Ogre from the legumes pea and Vicia pannonica (Neumann et al., 2006;Macas et al., 2007).
We observed a possible example of such retrotransposon hitchhiking in the gypsy-like family 28 from G. tomentella. A single element in this family on BAC clone gtt1-129o17 (AC188784.13) contained an insertion of approximately 10.5 kb. The origin of this insertion is unclear, but it contains a mixture of noncoding se- Figure 4. An LTR-retroelement with a disease resistance gene insertion. LTR-retrotransposon 129o17-re-2 located on G. tomentella BAC gtt1-129o17 contains an approximately 10.5-kb insertion that includes a full-length plant disease resistance gene belonging to the NB-LRR family. Numbers indicate nucleotide position relative to the complete BAC sequence (accession no. AC188784.13). quence, several gene fragments, and one full-length nucleotide binding-Leu-rich repeat (NB-LRR) disease resistance-like gene (Fig. 4). We believe that this 10.5-kb region is contained within a single retrotransposon element based on the structure of the LTR sequences, which are 97% identical to each other and are flanked by a target site duplication (TAAGT/TAAGT). These LTRs are 85% identical to other members of family 28 that lack the 10.5-kb insertion. Based on similarity to nearby NB-LRR sequences, the NB-LRR gene within this element may be flanked by appropriate promoter and terminator sequences. What is not clear is whether this retrotransposon can still be replicated, because we did not identify any other copies of this family that carried an NB-LRR gene. However, recent work on the legume V. pannonica has shown that Ogre elements larger than 25 kb can be replicated at a high frequency (Neumann et al., 2006). If the gtt1-129o17 element were replicated, this would represent a new mechanism for duplicating and dispersing disease resistance genes throughout a plant genome. Intrigu-ingly, both transcription and transposition of the tobacco retroelement Tnt1 can be induced by fungal elicitors (Melayah et al., 2001), suggesting that pathogen infection could promote retroelement multiplication. If transposition of the gtt1-129o17 element were induced by pathogen infection, it would provide a link between pathogen infection and creation of new disease resistance genes.
Long Interspersed Nuclear Elements of G. max, G. tomentella, and P. vulgaris Long interspersed nuclear elements (LINEs) represent a non-LTR class of retroelements found throughout eukaryotes (Eickbush, 1992). Xiong and Eickbush's (1990) cladistic studies suggest that the first LTRretroelements arose through the acquisition of LTRs by LINEs, therefore making them appear to be the oldest class of eukaryotic retroelements. Compared to LTR-retroelements, there have been relatively few analyses performed on LINEs in plants. For those species that have been studied (maize, barley, Arabidopsis, lotus [L. japonicus], and sugar beet [Beta vulgaris]), LINEs appear to be more diverse but less numerous than LTR-retroelements (Schwarz-Sommer et al., 1987;Schmidt et al., 1995;Wright et al., 1996;Vershinin et al., 2002;Holligan et al., 2006). To identify potential LINEs in our BAC sequences, we used BLASTX to search the NCBI nonredundant protein database for similarity to previously characterized LINEs using the BAC DNA sequences as queries. We identified multiple LINE-like elements in G. max, G. tomentella, and P. vulgaris. As observed in other plant species, LINEs were much less common than LTRretroelements. A total of 21 putative LINEs, including remnants, were identified among 36 G. max BACs analyzed. These elements were widely dispersed as only two BACs contained more than one element, and both of these had just two elements. To assess the diversity of these LINEs, we constructed an RT-based phylogenetic tree and included representative LINEs from maize, barley, Arabidopsis, and sugar beet (Fig.  5). This analysis revealed that the G. max LINEs are quite diverse. The LINEs from other plant species were distributed throughout the tree, suggesting that the G. max LINES are of ancient origin.

CONCLUSION
The analyses presented above show that the G. max genome has been heavily impacted by the activity of retroelements and likely continues to be shaped by their replication. Of most significance is our identification of three different nonautonomous families that have undergone recent replication. This observation suggests that rapid expansion of genome size can be driven by both autonomous and nonautonomous elements. A second striking feature of our dataset is the relatively low frequency of insertion/deletion events observed in the LTRs of both G. max and G. tomentella compared to previously characterized plant species, including the legume M. truncatula (Table I). Although the underlying cause for this is not known, it suggests that the G. max genome is likely still expanding. Finally, the identification of a retroelement carrying an NB-LRR disease resistance-like gene provides a potential new mechanism for the rapid evolution of new resistance genes.

BAC and Retroelement Sequences
All BAC sequences were obtained from either the High Throughput Genomic Sequence database or the nonredundant nucleotide database maintained by NCBI. The majority of these sequences were generated as part of the NSF-funded project "Comparative Analysis of Legume Genome Evolution" (grant no. DBI-0321664; Innes et al., 2008). Accession numbers for each BAC are provided in Supplemental Table S1.

Identifying Retroelements
Approximately 3.7 Mb of Glycine max genomic sequence were searched, including 1 Mb from H1 from the NSF project (Innes et al., 2008), about 0.85 Mb of H2, and 1.85 Mb derived from BACs not assigned to a particular genomic location that have been sequenced as part of an ongoing project in the R. Shoemaker laboratory. Where BACs covered overlapping regions, only the unique sequences were counted in determining the total area analyzed and total elements identified. We used the program LTR_STRUC as the first step in identifying retrotransposon sequences (McCarthy and McDonald, 2003). LTRs from the elements identified by LTR-STRUC were used as queries in BLAST searches (Altschul et al., 1997). These BACs were also searched for the presence of retrotransposon-related genes using the BAC sequences as a query to search the NCBI nonredundant database using BLASTX. Regions of homology to known retrotransposon-like sequences (e.g. RT, integrase, etc.) were then manually evaluated for the presence of LTRs. In addition, we used the REPuter and RepeatMasker programs to identify repeated sequences (Kurtz et al., 2001;Smit et al., 1996Smit et al., -2008. These additional searches uncovered several intact elements missed by the LTR_STRUC program. Essentially the same approach was used to identify retrotransposons in 1.6 Mb of genomic sequence from Glycine tomentella (0.5 Mb from H1, 0.35 Mb from H2, and 0.75 Mb of G. tomentella sequence from BACs not yet assigned a genomic location) and 0.94 Mb of sequence from Phaseolus vulgaris (Supplemental Table S1). To identify potential LINEs, we used BLASTX to search the NCBI nonredundant protein database for similarity to previously characterized LINEs using the BAC DNA sequences as queries. All elements identified by the above approaches were deposited in a local database, and BAC DNA sequences were then searched for homology to this database using BLASTN (Altschul et al., 1997). To group retrotransposons into families, all LTR sequences were compared to each other in pair-wise BLASTN comparisons. Elements that shared a minimum of 80% sequence identity over at least 80% of the length of the shortest LTR were grouped into the same family per the recommendations of .

Sequence Alignments and Phylogenetic Tree Construction
Multiple sequence alignments were performed using ClustalX (Jeanmougin et al., 1998) and the MEGA software package version 3.1 followed by manual adjustments to optimize the alignments (Kumar et al., 2004). Transition and transversion mutation rates were also calculated using the MEGA software package. Trees were generated using the MrBayes software package version 3.1.2 (Ronquist and Huelsenbeck, 2003) using the General Time Reversible DNA substitution model with gamma-distributed rate variation across sites and a proportion of invariable sites (General Time Reversible +I+G model). We performed paired runs with four chains each with sampling every 100 generations. The priors for each analysis were the program's defaults. All runs started with a random tree and were run for 5 million generations. After elimination of the first 25% of runs, which included the burn-in phase, the remaining iterations were summarized in a consensus tree with posterior probabilities as nodal support.

Dating LTR-Retrotransposon Insertion Times
The insertion times of LTR-retroelements were dated by aligning their 5# and 3# LTR sequences and identifying transition and transversion substitutions using the MEGA software package version 3.1 (SanMiguel et al., 1998). The time since element insertion was calculated using the formula T = K/2r, where T = time, K = distance calculated using Kimura's two parameter model as implemented within the MEGA software package, and r = substitution rate. Kimura's two parameter model corrects for multiple hits (Kimura, 1980). Two values for the substitution rate were used and are shown in Supplemental Table  S2: 5.1 3 10 29 (the average synonymous substitution rate estimated for genic sequences in G. max; Pfeil et al., 2005) and 1.3 3 10 28 , which is the value used by Vitte and Bennetzen (2006) in Table I. The latter value takes into account the observation that LTR sequences accumulate mutations at a higher rate than silent sites in standard housekeeping genes, possibly because of the high rate of cytosine methylation observed in LTR sequences (Vitte and Bennetzen, 2006). Sequence data from this article can be found in the GenBank/EMBL data libraries under accession numbers FJ197979 to FJ198023 (G. max LTR-retrotransposons), FJ402900 to FJ402922 (G. tomentella LTR-retrotansposons), and FJ402923 to FJ402929 (P. vulgaris LTR-retrotransposons) and are also listed in Supplemental Table S2. Accession numbers for the LINEs analyzed in Figure 5 can be found under accession numbers FJ402887 to FJ402899.

Supplemental Data
The following materials are available in the online version of this article.
Supplemental Table S1. BAC clones screened for retroelement-related sequences.
Supplemental Table S2. Analysis of paired LTRs.
Supplemental Table S3. Numbers of intact elements, solo-LTRs, and fragmented elements.