|
|
||||||||
|
Plant Physiol, May 2003, Vol. 132, pp. 52-63 CACTA Transposons in Triticeae. A Diverse Family of High-Copy Repetitive Elements1Institute of Plant Biology, University of Zurich, Zollikerstrasse 107, 8008 Zurich, Switzerland
In comparison with retrotransposons, which comprise the majority of the Triticeae genomes, very few class 2 transposons have been described in these genomes. Based on the recent discovery of a local accumulation of CACTA elements at the Glu-A3 loci in the two wheat species Triticum monococcum and Triticum durum, we performed a database search for additional such elements in Triticeae spp. A combination of BLAST search and dot-plot analysis of publicly available Triticeae sequences led to the identification of 41 CACTA elements. Only seven of them encode a protein similar to known transposases, whereas the other 34 are considered to be deletion derivatives. A detailed characterization of the identified elements allowed a further classification into seven subgroups. The major subgroup, designated the "Caspar " family, was shown by hybridization to be present in at least 3,000 copies in the T. monococcum genome. The close association of numerous CACTA elements with genes and the identification of several similar elements in sorghum (Sorghum bicolor) and rice (Oryza sativa) led to the conclusion that CACTA elements contribute significantly to genome size and to organization and evolution of grass genomes.
All genomes contain repetitive
elements and in some species, such elements comprise the majority of
the nDNA. Repetitive elements can be divided into two main groups:
class 1 and class 2 elements. Class 1 elements (also called
retrotransposons) replicate via an mRNA intermediate that is reverse
transcribed into DNA and integrated somewhere else in the genome.
Retrotransposons contribute a large fraction to the total
genomic DNA of plants with large genomes such as wheat, barley
(Hordeum vulgare), or maize (Zea mays;
SanMiguel and Bennetzen, 1998 The terminal regions of all identified CACTA elements show a similar
sequence organization. They are flanked by short terminal inverted
repeats (TIRs) of 10 to 28 bp in size that terminate in the CACTA
motif. These serve as recognition sequences for the transposase protein
(Lewin, 1997 Diploid Triticeae spp. such as barley or Triticum monococcum
have genome sizes of more than 5,000 Mb and contain approximately 80%
of repetitive DNA (Smith and Flavell, 1975 Recent analysis of the Glu-A3 loci in diploid and tetraploid
wheat revealed the presence of 12 different CACTA transposons (Wicker
et al., 2003 The objective of our study was to characterize the previously described CACTA elements from wheat and to identify new Triticeae elements present in the public databases. Here, we report the identification and characterization of 41 novel CACTA transposons from Triticeae. Our results indicate that this transposon class is present at a high copy number in the wheat genome and that a large number are deletion derivatives. Elements similar to the ones in Triticeae were found in rice (Oryza sativa) and sorghum, indicating that also these genomes contain a wide variety of CACTA elements.
Identification of CACTA Transposons by BLAST Search and Dot-Plot Analysis Because only a minority of the CACTA transposons were expected to actually encode a transposase-like protein, a first approach for the identification of new elements was based on their TR sequences. The TR regions that contain the TIRs and the sub-TRs usually have a size of 200 to 500 bp. In this study, the term "element with complete ends" was used for elements in which both TIRs contain an intact CACTA motif and are flanked by a 3-bp target site duplication. They were distinguished from elements truncated by deletions or elements with damaged ends (referred to as "truncated elements"). Ten of the 12 CACTA elements with complete ends identified on the
Glu-A3 contigs (Wicker et al., 2003 It was clear that the consensus TR would not identify CACTA elements
that contain divergent TRs. Therefore, a second approach for the
identification of new elements was based on their structural similarity
rather than on sequence conservation: The subterminal direct and
inverted repeats displayed a specific pattern when the transposon
sequence is plotted against itself with dot plot (program DOTTER;
Sonnhammer and Durbin, 1995
In total, the database mining resulted in the identification of 16 new Triticeae CACTA transposons from genomic sequences and nine from EST sequences. None of the 16 new elements found in genomic sequences had been annotated as such. It is likely that they were not recognized because none of them encodes a transposase protein. As it was previously described for retrotransposons in Triticeae, the CACTA elements were often found as nested insertions in other class 1 or class 2 elements. Two additional elements (Jorge_TREP766 and
Caspar_TREP788) were kindly provided by Dr. Jorge Dubcovsky
(University of California, Davis) and Dr. Nils Stein (Institute
of Plant Genetics and Crop Plant Research, Gatersleben, Germany),
respectively. Together with the initial 12 elements, TAT-1
(Feuillet et al., 2001
CACTA Transposons Can Be Classified Based on Their TR Sequences Because the majority of the identified transposons have no
apparent coding capacity and vary greatly in size, we decided to base
their classification on the TR sequences, the only feature that all of
them have in common. The 14 truncated elements contain only one intact
TR each, whereas from the 26 element with complete ends, both TRs could
be used. The total 66 TR sequences from Triticeae transposons were used
for a multiple sequence alignment. The alignment was done with the
terminal 200 bp of the elements. A phylogenetic analysis of the
multiple sequence alignments allowed the classification of the TR
sequences into seven distinct clades (Fig.
2A). Sequence conservation between
members of different families is restricted basically to the terminal
20 to 30 bp containing the CACTA motif. The major group containing 28 TR sequences was designated the "Caspar " family. One
exclusive feature of the Caspar family is that the TR starts
with a CACTAGT motif, whereas all others start with CACTAC(A/T). Three
additional main families were designated Balduin,
Mandrake, and TAT-1. Further similarities were
discovered between Jorge_TREP766 and the previously
described unclassified XB element (Wicker et al.,
2001
To test this classification, a second approach for classification was based on the similarity of TR sequences displayed by dot-plot analysis: TRs from members of the same family display the characteristic transposon signature, whereas TRs of elements from different families show no signature. The terminal 300 bp of one TR from each element was used to generate a large array, which was then compared against itself by dot plot. Examples for dot-plot alignments of three different families are displayed in Figure 2B. In this approach, the classification into seven groups as it was obtained by the multiple sequence alignment could be confirmed for all elements. The results of the two classification approaches are summarized in Table I. The CACTA Family Comprises Full-Length Elements and a Wide Variety of Deletion Derivatives To investigate the range of diversity in size and sequence organization among members of the CACTA family, only the 26 elements with complete ends were used. Truncated elements were excluded because it is not possible to determine their actual size and coding capacity. Seven of the 26 elements with complete ends encode a transposase protein (Table I). However, all seven do not encode functional proteins because they all contain frameshifts or in-frame stop codons within their coding region (see below). In this study, we refer to elements that encode a transposase protein as "full-length elements," even if the coding region of the transposase protein is apparently defective. Four of the seven elements encode a second protein (which we refer to as CTG-2) in addition to the transposase. The CTG-2 coding gene was only found in the members of the Caspar family (see below). All identified full-length elements are large in size, ranging from 9.9 up to 13.1 kb. The other 19 CACTA transposons are considered to be deletion derivatives that have lost some or all of their coding capacity and depend for their transposition on enzymes encoded elsewhere in the genome. These deletion derivatives vary drastically in size: At one end of the spectrum, there are seven SNAC transposons that encode no proteins and range in size from 750 bp to 1.5 kb. The TR regions of these seven SNAC elements have sizes of 200 to 300 bp and are separated by an internal domain. Three SNAC elements belonging to the Caspar family
(Caspar_107G22-1, Caspar_426K20-2, and
Caspar_AF325198-1) plus a fragment of a putative SNAC
element (Caspar_107G22-3) contain a 64-bp region that is
75% to 81% identical to a part of the 5S rDNA gene (120 bp) from
T. monococcum (accession no. Z11461). This region is
embedded in an approximately 400-bp region that is more strongly conserved than the rest of the elements. In the 400-bp region, all four
are 91% to 95% identical, whereas their overall sequence identity is
79% to 91%. The 5S derivative conserved in the four elements
corresponds to the internal RNA polymerase III promoter that is
involved in the recruitment of transcription factors. It includes the
highly conserved motifs BoxA, IE, and BoxC (Cloix et al.,
2000 The 12 large deletion derivatives range in size from 3,411 bp up to 16.5 kb. Seven are members of the Caspar family, five of which encode a CTG-2 protein. All seven large Caspar deletion derivatives contain regions of tandem repeated DNA (see below). The other five deletion derivatives do not contain any sequences similar to known repetitive elements or genes. They also do not contain obvious structures like direct repeats, which would explain their large size. The largest deletion derivative identified is Jorge_AF326781-1, which has a size of 16,497 bp. Elements of the Caspar Family Encode a Transposase and a Protein of Unknown Function Four Caspar elements (Caspar_453N11-1, Caspar_18B1-1, Caspar_AF521177-1, and Caspar_TREP788) gave strong BLASTX hits with numerous transposase-like proteins from rice and sorghum. The coding region for the transposase is located in the 5' region of the elements. All four are likely to be nonfunctional because they all contain frameshifts or in-frame stop codons within their coding regions. However, because they show a high degree of sequence conservation within the coding region of the transposase, a multiple sequence alignment allowed to determine at which positions frameshifts have to be introduced in an individual element to obtain a contiguous open reading frame. All four elements contain between one and three frameshifts and Caspar_453N11-1 and Caspar_TREP788 contain one and two in-frame stop codons, respectively. Comparison with transposase proteins from public databases helped to determine the positions of the putative start and stop codons. The four deduced transposase proteins have sizes ranging from 1,044 to 1,122 amino acids and are 73% to 79% similar to one another. The coding region does not contain any introns. The four putative proteins are 68% to 74% similar to TNP2-like proteins from rice (accession no. Q9AUX7) and from sorghum (accession no. Q9XEQ1) but only 40% to 45% similar to the transposase of En/Spm (accession no. AAA66266). The transposase genes of Caspar elements are expressed as more than 30 ESTs from Triticeae corresponding to the transposase region were found in public databases. Nine Caspar elements contain a coding region for a second protein we refer to as CTG-2 (Caspar transposon gene 2). BLAST search of the CTG-2 region revealed similarity to 12 hypothetical proteins from rice and one from sorghum. In contrast to the transposase, which is well conserved among the different Caspar elements, the CTG-2 protein is highly variable. Therefore, it was difficult to predict a protein sequence. Based on sequence conservation between different Caspar elements and on the similarity to the proteins identified by BLASTX, putative protein sequences of eight Caspar CTG-2 proteins were deduced. The proteins have sizes of 968 to 1,292 amino acids. In all cases, they consist of one large putative first exon, which varies strongly in size between the different copies. The differences are caused by a region that contains multiple repeats of short 3- to 30-bp units, and the number of repeat units differs in the different elements. This putative first exon is followed by five short exons (25-50 amino acids) that show a higher degree of sequence conservation. The exon/intron structure of the last five exons was determined by comparison with the amino acid sequences of the 12 hypothetical proteins from rice that were identified by BLASTX. The predicted exon/intron structure of CTG-2 is strongly conserved in all analyzed elements. Eight ESTs similar to the CTG-2 region were found in public databases, indicating that the CTG-2 proteins are also expressed. The predicted CTG-2 protein sequences show no clear homology to previously described transposon proteins. A weak similarity to previously described proteins could be shown if sequences were aligned with the GCG program BESTFIT (Genetics Computer Group, Madison, WI), and gap creation and gap extension penalties were decreased to 4 and 1, respectively. Using these parameters, all CTG-2 proteins are between 42% and 50% similar over most of their length to the TNP1 protein of Tam-1 (accession no. CAA40554) and TNPA of En/Spm (accession nos. AAG17044). However, the sequence alignments contain a large number of gaps; therefore, one can only speculate that the CTG-2 protein may represent a highly diverged homolog to TNP1 and TNPA. CACTA Elements Contain Large Amounts of Low-Complexity DNA Dot-plot analysis of the identified transposons revealed that several elements contain patterns of tandem repeats of variable length and sequence. The repeated sequence units range in size from 2 to 30 up to 380 bp. A selection of 13 CACTA elements with complete ends that contain multiple different repeat structures were chosen for further analysis (Fig. 3). Eleven of them are members of the Caspar family, and the two others are Balduin_453N11-1 and Isaac_107G22-1. SNAC transposons, the large deletion derivatives Jorge_TREP766, Jorge_AF326781-1 and Enac_453N11-1, and truncated elements were excluded because they do not contain comparable repeat patterns.
The repeat regions in Balduin_453N11 and
Isaac_107G22 showed no similarity to each other or to the
ones from the Caspar family, whereas nine of the 11 Caspar elements share common repeat units. A surprising
finding was that eight Caspar elements contain the previously described Afa repeats (Rayburn and Gill,
1986 The tandem repeats within CACTA elements obviously can undergo rapid changes in copy number: Four Caspar elements from barley (Caspar_AF427791-1, Caspar_AF474373-1, Caspar_AF474373-2, and Caspar_AF474072-1) appear to be very closely related because they are approximately 92% to 95% identical on the DNA level. However, the most striking difference between them is the number of direct repeats (Fig. 3). Caspar_AF427791-1, for example, contains three copies of TM-1, nine Afa repeats, and five copies of TM-2, whereas Caspar_AF474373-1 contains four TM-1 units, four Afa units and 16 TM-2 units. In contrast, Caspar_AF747373-2 contains only four TM-1 repeats but neither Afa nor TM-2 repeats (Fig. 3). The Caspar Family Is Present at a High-Copy Number in the Wheat Genome The fact that the transposons of the Caspar family were
found in several copies in the publicly available sequences suggested that this elements may occur very frequently in Triticeae genomes. To
estimate the copy number of the Caspar transposons, one
high-density filter (Filter C) from the T. monococcum BAC
library (Lijavetzky et al., 1999
Caspar-Like Elements Are Also Frequently Found in Other Grass Genomes The apparently high copy number of Caspar elements in Triticeae genomes inspired the search for similar elements in other grass genomes. Three BACs from rice and one from sorghum encoding the proteins that gave the strongest BLASTX hits with CTG-2 from Caspar were screened for the presence of transposon-like sequences. In all four cases, an annotated transposase protein was found upstream of the protein that gave the BLASTX hit with CTG-2, but transposase and CTG-2 were not annotated as belonging to the same element. In all four cases, CTG-2 was annotated as a putative gene. The predicted exon/intron structure as it was annotated in the publicly available sequences differed slightly from our prediction of the structure of CTG-2. However, comparison with our predicted proteins from the Triticeae elements showed that that the same exon/intron structure also can be found in the elements from rice and sorghum, although the proteins from the different species were only about 46% to 50% similar to one another. Two proteins from rice BACs AP002484 and AP003020 and one from sorghum BAC AF114171 were deduced by applying our predicted exon/intron structure and used as query sequences for a TBLASTN search. The number of hits was striking: CTG-2_AP002484 and CTG-2_AP003020 gave 218 and 214 hits in rice, respectively, with E values below 3E-4. CTG-2_AF114171 identified five putative CTG-2 proteins in sorghum (E value = 0.0). Using dot plot, the actual borders of the elements on the rice and sorghum BACs were identified, and four Caspar-like elements with complete ends could be characterized. In addition, the four BAC clones were searched for further transposon signatures by dot plot, which led to the identification of two additional SNAC transposons (one from rice BAC AP002484 and one from sorghum BAC AF114171), both of which were not annotated. The positions of the elements on their respective BAC clones are shown in Table II.
All sequences identified in this way were used for a next round of BLASTN search against the National Center for Biotechnology Information nonredundant database to obtain a rough estimate of the abundance of these elements in the rice and sorghum genomes. This search revealed the presence of a very high number of similar elements in the genomes of rice and sorghum, ranging from 493 hits for SNAC_ AP002484-1 up to 824 hits for the CACTA element from rice BAC AP003020 that contains both a transposase and CTG-2. E values for all these BLASTN hits were below 3E-4. The CACTA element from sorghum BAC AF114171 identified four elements in sorghum (E value = 0.0). Because the focus of this study was not a complete survey of rice CACTA elements but to study their structure and sequence organization, we focused our attention on the isolation of a small number of elements with complete ends. The result of the database mining was a set of 18 CACTA elements from rice and six elements from sorghum. The precise location of all identified rice and sorghum elements on their source sequences is provided as supplemental material (Table III). Interestingly, only one additional element that encodes proteins was identified, and all others were SNAC transposons. None of the SNAC transposons had been annotated as such. These data suggest that the rice genome might contain a very large number of yet undiscovered CACTA elements and that the majority of them might be small nonautonomous elements. A very interesting finding in this context is SNAC_AP003446-1 from rice, which at 274 bp is the smallest element identified in this study (Table II). It is the only element that does not contain an internal domain but consists exclusively of terminal and sub-TR sequences.
Why Were the CACTA Elements in Triticeae Not Discovered Earlier? The high density of CACTA elements observed at the
Glu-A3 loci from T. monococcum and T. durum was a fortunate constellation (Wicker et al., 2003 CACTA Sequences in Grass Genomes Are Mainly Deletion Derivatives All identified CACTA elements appear to be defective or
nonautonomous because they either lack sufficient coding capacity, or
their coding sequences are interrupted by frameshifts or in-frame stop
codons. For En/Spm and Ac/Ds elements from maize,
it was shown that numerous deletion derivatives exist that are only
able to transpose in the presence of a functional element (for review, see Gierl and Saedler, 1989 However, during the evolution of nonautonomous elements, there was
obviously no selection pressure that would favor smaller sized
elements, as is illustrated by the numerous large elements such as
Jorge_AF326781-1. An even more impressive example is the 23-kb Candystripe1 transposon from sorghum. This CACTA
element was shown to be active in sorghum, although it is also
considered to be nonautonomous (Chopra et al., 1999 The Presence of Afa Repeats in Caspar Elements Explains Some of the Features of These Repeats But Also Raises New Questions Because Afa repeats were found in several members of
the Caspar family but never isolated outside of
Caspar elements, we conclude that all Afa repeats
are actually compounds of such transposons. This "transposon
hypothesis" explains three properties of this repeat family as they
were described by Nagaki et al. (1998a) The presence of Afa and other repeat structures such at
TM-1, TM-2, and the extensive regions comprising
short sequence repeats raises new questions. First, the amplification
mechanism is still obscure. Template slippage during DNA replication or
unequal crossing over can explain the rapid change in copy number, but
it does not explain why only some conserved repeat sequences are
amplified. A rolling circle amplification, as was suggested by
Nagaki et al. (1998a) Despite these open questions, the mere knowledge that tandem repeats
are often found within transposons might be important for future
analysis of genomic regions. The presence of such arrays can be an
indication for the presence of a novel diverged transposon family that
could not be detected otherwise. In addition, it is possible that in
future studies, tandem repeats from other species such as saccharum
CENtromeric sequence repeats from sugarcane (Saccharum officinarum; Nagaki et al.,
1998b The Contribution of CACTA Elements to Genome Evolution The function and possible benefit of repetitive elements for the
"host" plant is a hotly debated question. MITEs, for example, are
often found in close association with genes, and they are believed to
contribute regulatory sequences that may alter gene expression
(Zhang et al., 2000 The finding that the four Caspar SNAC elements contain
sequences similar to 5S rDNA genes is intriguing. The fact that the region that contains the 5S derivative is more conserved among the four
elements than the rest of the elements suggests that a selection
pressure has been acting on these sequences. It is possible that these
sequences have been acquired by a CACTA element during evolution and
that they have gained a function that was beneficial for the plant,
eventually leading to their fixation within the genome. Acquisition of
fragments of cellular genes by CACTA elements has been reported before
(Takahashi et al., 1999 Concluding Remarks Repetitive DNA, which is still often referred to as "junk DNA," is rarely the focus of a detailed analysis. Our results demonstrate the importance of detailed characterization of repetitive elements and database mining of public databases. Because of their high amount of repetitive DNA, genomic sequences from Triticeae are an essential resource for the identification of novel repetitive elements. The information gained about these elements then can be used for a targeted search for similar elements in other plant genomes. This was demonstrated by the discovery of the rice SNAC transposons, which were not annotated in the publicly available rice sequences. Another important result of our study is the finding that the CTG-2 protein is actually a part of the Caspar transposon. This information suggests that numerous sequences that were interpreted as genes could actually belong to repetitive elements. This has an important implication for future estimates of the total gene contents of entire genomes and also for the calculation of local gene densities in large genome plants such as wheat or maize. Finally, the identification of novel CACTA elements could eventually lead to the discovery of active wheat transposons that could be used for transposon-tagging systems similar to those based on En/Spm and Ac/Ds elements.
Southern Hybridization of High-Density BAC Filters Two copies of Filter C from the Triticum
monococcum BAC library (Lijavetzky et al., 1999 Database Mining and Sequence Analysis Public databases and the database for Triticeae repetitive
elements (TREP, http://wheat.pw.usda.gov/ITMI/Repeats) were screened with the BLASTN and BLASTX algorithms (Altschul et al.,
1997 Distribution of Materials Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial research purposes, subject to the requisite permission from any third party owners of all or parts of the material. Obtaining any permissions will be the responsibility of the requestor.
The authors would like to thank Dr. Jorge Dubcovsky (University of California, Davis) and Dr. Nils Stein (Genomanalyse im biologischen System Pflanze grant no. 0312280A, Bundesministerium für Bildung und Forschung, Berlin, Germany) for making their unpublished transposon sequences available for our study. We are also grateful to Dr. Catherine Feuillet (Institute of Plant Biology, University of Zurich, Switzerland) and Clair Wicker for critical reading of the manuscript.
Received October 4, 2002; returned for revision November 30, 2002; accepted January 30, 2003. 1 This work was supported by the Swiss National Science Foundation (grant no. 31-65114.01).
* Corresponding author; e-mail bkeller{at}botinst.unizh.ch; fax 41-1-634-82-04.
Article, publication date, and citation information can be found at www.plantphysiol.org/cgi/doi/10.1104/pp.102.015743.
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|