|
|
||||||||
|
First published online December 31, 2008; 10.1104/pp.108.131656 Plant Physiology 149:1316-1324 (2009) © 2009 American Society of Plant Biologists OPEN ACCESS ARTICLE
Highly Diversified Molecular Evolution of Downstream Transcription Start Sites in Rice and Arabidopsis1,[W],[OA]Division of Genome and Biodiversity Research, National Institute of Agrobiological Sciences, Tsukuba, Ibaraki 305–8602, Japan (T.T., T.I.); and Graduate School of Information Science and Technology, Hokkaido University, Sapporo, Hokkaido 060–0814, Japan (K.O.K.)
Alternative usage of transcription start sites (TSSs) is one of the key mechanisms to generate gene variation in eukaryotes. Here, we show diversified molecular evolution of TSSs in remotely related flowering plants, rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana), by comprehensive analyses of large collections of full-length cDNAs and genome sequences. We determined 45,917 representative TSSs within 23,445 loci of rice and 35,313 TSSs within 16,964 loci of Arabidopsis, about two TSSs per locus in either species. The nucleotide features around TSSs displayed distinct patterns when the most upstream TSSs were compared with downstream TSSs. We found that CG-skew and AT-skew were clearly different between upstream and downstream TSSs, and that this difference was commonly observed in rice and Arabidopsis. Relative entropy analysis revealed that the most upstream TSSs had retained canonical cis elements, whereas downstream TSSs showed atypical nucleotide features. Expression patterns were distinguishable between upstream and downstream TSSs. These results indicate that plant TSSs were generally diversified in downstream regions, resulting in the development of new gene expression patterns. Furthermore, our comparative analysis of TSS variation between the species showed a positive correlation between TSS number and gene conservation. Rice and Arabidopsis might have evolved novel TSSs in an independent manner, which led to diversification of these two species.
While a complete genome sequence enables us to estimate the total number of genes in an organism, transcriptional activities of genes can be verified by either tiling array analyses or mapping ESTs and cDNAs onto a genome (Suzuki et al., 2001a
Past studies of TSS variation have focused mainly on mammals and fungi. For example, in human, millions of 5'-end sequences of FLcDNAs were used to determine 269,774 TSSs, from which 30,964 TSS clusters of 14,628 genes were obtained (Kimura et al., 2006
Recent progress of genome and transcriptome sequencing in rice and Arabidopsis gives us an opportunity to investigate TSS variation in higher plants. RIKEN and Ceres have released over 200,000 5'-end sequences obtained from Arabidopsis FLcDNA clones (Seki et al., 2002 In this study, we identified TSSs comprehensively by mapping transcripts to the genomes of rice and Arabidopsis and compared nucleotide signals around the TSSs. To our knowledge, this is the first attempt of a large-scale TSS comparison between higher plants based on FLcDNA sequences. We also conducted a comparative analysis of TSS variation and gene conservation to elucidate how TSSs and genes have evolved during plant evolution.
Identification and Clustering of TSSs
We mapped rice FLcDNAs and their 5'-end sequences onto the rice genome to identify TSSs by a previously described method (The Rice Annotation Project, 2007
In addition to alternative TSSs, it is known that a TSS can fluctuate in animal and fungus genomes for biological reasons (Carninci et al., 2006
To date, several criteria have been used for TSS clustering. For example, to analyze Cap Analysis of Gene Expression data, tags of 20- or 21-bp overlaps were clustered (Carninci et al., 2006 Although hundreds of thousands of sequences were evaluated, the number of transcripts might not be saturated and there might be TSS variants missing in the cDNA libraries used. In fact, 50% and 58% of the clusters were determined by only one transcript in rice and Arabidopsis, respectively. This result suggests that more TSSs might be detected if we further collect cDNA clones. Thus, our estimates of the numbers of TSS clusters should be taken as lower limits, and there is likely to be more variation of TSSs than observed in this study.
As described in "Materials and Methods," we defined a representative TSS in each TSS cluster for further analyses. We refer to these representatives simply as TSSs, unless otherwise noted. When there are two or more TSSs in a locus, cis elements of a downstream TSS may overlap with a transcribed region of an upstream TSS. In fact, 18.9% of downstream TSSs of rice were located within the protein-coding regions of transcripts initiated from upstream TSSs and resulted in truncated open reading frames (ORFs). The nucleotide compositions around the downstream TSSs might be distinct from those of the upstream TSSs, because the functional constraints of the transcribed regions of the upstream TSSs should create nucleotide biases. To assess this possibility, we separated the TSS data sets into the most upstream TSSs and the remaining downstream TSSs and analyzed their nucleotide features, such as CG-skew and AT-skew. When there was one TSS in a locus, it was included in the upstream TSS data set.
First, we observed a strong peak of CG-skew in the upstream TSS data sets, whereas the downstream TSSs represented considerably reduced CG-skew in the two plants (Fig. 1, A and B
). In downstream TSSs, the peaks were weakened and slightly shifted to the upstream. Second, we investigated AT-skew around TSSs. Our data clearly showed that the AT-skew was significantly biased around the TSSs, with similar tendency in rice and Arabidopsis (Fig. 1, C and D). In addition, the overall distributions of the AT-skew were quite different between the upstream and downstream TSSs, in a similar manner in both species. It seems that AT-skew is a much clearer indicator of the nucleotide signals around TSSs than CG-skew. Third, in both upstream and downstream TSSs, similar patterns of GC contents were observed (Fig. 2
). Clear peaks and drastic changes around TSSs were seen in all cases. However, TATA-like signals around –35 to –25 bp upstream from the TSSs were significantly diminished in downstream TSSs. Last, the relative entropy display of the nucleotide compositions clearly showed strong signals of TATA-like motifs around the –35 to –25 bp regions of upstream TSSs, but these signals disappeared in downstream TSSs (Fig. 3
). This analysis was conducted for the case in which the loci of a single TSS were excluded, and we obtained essentially the same results (Supplemental Fig. S4). We found skewed appearance of C at –1 bp and of A/G at TSSs, which were also weaker in downstream TSSs. In previous reports, the TATA box of rice and Arabidopsis was frequently detected in conjunction with the Y patch motif, which is a stretch of C/Ts and is located –100 to –1 bp upstream from the TSS (Yamamoto et al., 2007a
These comparisons of the nucleotide signals around TSSs between the upstream and downstream data sets suggested that in rice and Arabidopsis downstream transcription might be differently regulated from upstream transcription, which had canonical cis elements such as the TATA box. Previous studies have defined TSSs as being distinct if they were separated by over 500 bp, so that the downstream transcription signals do not heavily overlap with the upstream transcripts (Kimura et al., 2006
It is expected that use of alternative TSSs is related to gene expression patterns in differentiated tissues or in response to specific conditions (Macknight et al., 2002
TSS Diversity Correlated with Protein Sequence Evolution To elucidate the evolutionary significance of TSS diversification, we used two approaches to analyze the relationship between the numbers of TSSs per locus and the evolutionary conservation of protein sequences. We determined orthologs between rice and Arabidopsis by reciprocal best hits of BLASTP searches and calculated protein identity between the orthologs. We found a positive correlation between the number of rice TSSs and protein identity (Fig. 5 ), and a similar tendency was observed in Arabidopsis TSSs (Supplemental Fig. S7). These results suggest that a locus encoding evolutionarily conserved proteins had acquired more TSSs than one encoding diverged proteins.
Next, we searched the UniProtKB database for homologous sequences of the rice and Arabidopsis proteins and classified them into four groups by their level of conservation. The ratio of conserved protein groups increased as the number of TSSs per locus grew (Supplemental Fig. S8). However, if the cDNA collection was insufficient, the number of TSSs of poorly expressed genes might be underestimated. To exclude the possibility of insufficient sampling of cDNAs, we used TSSs that were supported by five or more transcripts and confirmed that the same tendency was observed (data not shown). Therefore, highly variable TSSs seemed to be prevalent in conserved protein-coding genes of either rice or Arabidopsis.
To cluster TSSs that fluctuated for biological or experimental reasons, we used a threshold interval of 21 bp for rice and 27 bp for Arabidopsis. Since the resultant average sizes of the TSS clusters were much smaller, 4.2 bp in rice and 8.7 bp in Arabidopsis, than those initial intervals, it seems that fluctuating TSSs were clustered effectively and that there was little excessive clustering. The longer average size of Arabidopsis TSS clusters may be due to experimental errors, as we observed more discrepancies in Arabidopsis sequences obtained from the same clone compared with those of rice (Table II).
As each locus had on average two TSS clusters in either species, there should have been significant contribution of this TSS variation to these species. Indeed, TSS variants of several genes are known to be responsible for different expression patterns (Landry et al., 2003
In addition to CG-skew, which characterizes plant and yeast TSSs (Fujimori et al., 2005
A reason for the weak signals of downstream TSSs appears to be overlap with upstream protein-coding regions. Since protein-coding regions are under functional constraints, the nucleotide compositions and genomic positions of cis elements will be affected. For example, the TATA box frequently contains TAA, which is a stop codon and may prematurely terminate translation. The medians of the distances between TSSs were relatively small, 149 and 184 bp in rice and Arabidopsis, respectively, so that it was possible that the signals overlapping the upstream protein-coding region remained generally weak. However, even though we used only downstream TSSs separated from upstream TSSs by more than 500 bp, the signals were almost identical to those of all downstream TSSs (data not shown). Therefore, we concluded that the distinct signals of the downstream TSSs were not necessarily due to upstream coding regions but that they are intrinsic to the nature of the downstream TSSs. We should note that the downstream TSSs might produce a truncated protein whose function is deteriorated or lost. Thus, regulation by alternative TSS usage may be achieved in a loss-of-function manner, which is suggested to be of evolutionary importance (Oda et al., 2002
It is intuitively plausible that, if protein sequences are highly diverged because of relaxed functional constraint, regulation of their expression becomes concordantly variable. However, our analyses revealed that the number of TSSs increased proportionally to the sequence conservation in both rice and Arabidopsis. Although it was expected that the gene function affected the number of TSSs, our functional categorization of the proteins by the Gene Ontology hierarchy showed no significant correlation between gene function and the number of TSSs (Supplemental Fig. S9). Since highly conserved proteins generally play essential roles, are used in a variety of tissues, and are regulated by complex processes, elaborate transcriptional regulation to control several TSSs might be required. Intriguingly, the TSSs that we identified were not necessarily conserved between rice and Arabidopsis. As shown in Figure 4, B and C, both rice and Arabidopsis used the same upstream TSSs in this locus, whereas the downstream TSSs obviously differed between these orthologs. Likewise, in humans and mice, merely one-fourth of the promoter regions between orthologs were conserved (Wasserman et al., 2000
We determined TSSs in rice and Arabidopsis by large-scale computation and found that both species have, on average, two or more TSSs per locus. The nucleotide signals around TSSs were similar in these two plants, while they were quite different between the upstream and downstream TSSs. A positive correlation between TSS numbers and gene conservation was also observed. This study provides an insight for diversified transcriptional variation that is likely to have contributed to the evolution of plant species.
Genome and cDNA Sequences
We used FLcDNAs and their 5'-end sequences for TSS determination (Supplemental Table S2). The FLcDNAs and 5'-end sequences of rice (Oryza sativa; Kikuchi et al., 2003
Positions of transcripts on the genome sequences were determined by methods described previously (The Rice Annotation Project, 2007
We clustered 5'-end positions that fluctuated for biological or for experimental reasons. To determine an appropriate threshold for the distance between 5'-end positions to be clustered, the relationship between the distance and the total number of clusters was examined (Supplemental Fig. S10). The cluster number decreased gradually and monotonically as the distance increased. We adopted the threshold distance at which the rate of decrease in the total number of clusters was less than 1%: 21 bp for rice and 27 bp for Arabidopsis. Juxtaposed 5'-end positions within the threshold distance were clustered by the single-linkage clustering method. In each cluster, a single representative TSS was selected in the following order: (1) supported by a full-length sequence, (2) supported by the most clones, and (3) the most upstream 5'-TSS.
We extracted genomic sequences that spanned the –250 to +350 bp region around each TSS. When an ambiguous nucleotide denoted by N existed in a sequence file, the sequence was eliminated from our analysis. CG-skew values [= (C – G)/(C + G)] were computed in a sliding window of 100 bp with 1-bp steps, where C stands for the total number of cytosines in the window and G stands for the total number of guanines. The position of a window in Figure 1 is represented by the 51st nucleotide of the window. Likewise, AT-skew values were calculated for adenines (A) and thymines (T).
We represented nucleotide biases by relative entropy, modifying a previously reported method (Schneider and Stephens, 1990
The rice protein set we used was compared with the Arabidopsis protein set of TAIR. Homologs and orthologs were determined, as described elsewhere (The Rice Annotation Project, 2007
The following materials are available in the online version of this article.
We thank H. Numa and H. Sakai for their suggestions; S. Kikuchi, M. Seki, and T. Sakurai for providing information about FLcDNA clones; the Rice Annotation Project members for rice genome annotation data; and Y.Y. Yamamoto for helpful discussions. Received October 26, 2008; accepted December 21, 2008; published December 31, 2008.
1 This work was supported by the Ministry of Agriculture, Forestry, and Fisheries of Japan (Integrated Research Project for Plant, Insect, and Animal Using Genome Technology grant no. GD–1002 to T.T., T.I., and K.O.K. and Genomics for Agricultural Innovation grant no. GIR–1001 to T.T. and T.I.). The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Takeshi Itoh (taitoh{at}affrc.go.jp).
[W] The online version of this article contains Web-only data.
[OA] Open Access articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.108.131656 * Corresponding author; e-mail taitoh{at}affrc.go.jp.
Alexandrov NN, Troukhan ME, Brover VV, Tatarinova T, Flavell RB, Feldmann KA (2006) Features of Arabidopsis genes and genome discovered using full-length cDNAs. Plant Mol Biol 60: 69–85[CrossRef][Web of Science][Medline] Blencowe BJ (2006) Alternative splicing: new insights from global analyses. Cell 126: 37–47[CrossRef][Web of Science][Medline] Brett D, Pospisil H, Valcarcel J, Reich J, Bork P (2002) Alternative splicing and genome complexity. Nat Genet 30: 29–30[CrossRef][Web of Science][Medline] Carninci P, Sandelin A, Lenhard B, Katayama S, Shimokawa K, Ponjavic J, Semple CA, Taylor MS, Engstrom PG, Frith MC, et al (2006) Genome-wide analysis of mammalian promoter architecture and evolution. Nat Genet 38: 626–635[CrossRef][Web of Science][Medline] Chen FC, Wang SS, Chaw SM, Huang YT, Chuang TJ (2007) Plant Gene and Alternatively Spliced Variant Annotator: a plant genome annotation pipeline for rice gene and alternatively spliced variant identification with cross-species expressed sequence tag conservation from seven plant species. Plant Physiol 143: 1086–1095 Crooks GE, Hon G, Chandonia JM, Brenner SE (2004) WebLogo: a sequence logo generator. Genome Res 14: 1188–1190 Frith MC, Ponjavic J, Fredman D, Kai C, Kawai J, Carninci P, Hayashizaki Y, Sandelin A (2006) Evolutionary turnover of mammalian transcription start sites. Genome Res 16: 713–722 Fujimori S, Washio T, Tomita M (2005) GC-compositional strand bias around transcription start sites in plants and fungi. BMC Genomics 6: 26[CrossRef][Medline] Halasz G, van Batenburg MF, Perusse J, Hua S, Lu XJ, White KP, Bussemaker HJ (2006) Detecting transcriptionally active regions using genomic tiling arrays. Genome Biol 7: R59[CrossRef][Medline] Iida K, Go M (2006) Survey of conserved alternative splicing events of mRNAs encoding SR proteins in land plants. Mol Biol Evol 23: 1085–1094 Kikuchi S, Satoh K, Nagata T, Kawagashira N, Doi K, Kishimoto N, Yazaki J, Ishikawa M, Yamada H, Ooka H, et al (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301: 376–379 Kim N, Shin S, Lee S (2005) ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res 15: 566–576 Kimura K, Wakamatsu A, Suzuki Y, Ota T, Nishikawa T, Yamashita R, Yamamoto J, Sekine M, Tsuritani K, Wakaguri H, et al (2006) Diversification of transcriptional modulation: large-scale identification and characterization of putative alternative promoters of human genes. Genome Res 16: 55–65 Landry JR, Mager DL, Wilhelm BT (2003) Complex controls: the role of alternative promoters in mammalian genomes. Trends Genet 19: 640–648[CrossRef][Web of Science][Medline] Lee JR, Jang HH, Park JH, Jung JH, Lee SS, Park SK, Chi YH, Moon JC, Lee YM, Kim SY, et al (2006) Cloning of two splice variants of the rice PTS1 receptor, OsPex5pL and OsPex5pS, and their functional characterization using pex5-deficient yeast and Arabidopsis. Plant J 47: 457–466[CrossRef][Web of Science][Medline] Li L, Wang X, Sasidharan R, Stolc V, Deng W, He H, Korbel J, Chen X, Tongprasit W, Ronald P, et al (2007) Global identification and characterization of transcriptionally active regions in the rice genome. PLoS One 2: e294[CrossRef][Medline] Macknight R, Duroux M, Laurie R, Dijkwel P, Simpson G, Dean C (2002) Functional significance of the alternative transcript processing of the Arabidopsis floral promoter FCA. Plant Cell 14: 877–888 Miura F, Kawaguchi N, Sese J, Toyoda A, Hattori M, Morishita S, Ito T (2006) A large-scale full-length cDNA analysis to explore the budding yeast transcriptome. Proc Natl Acad Sci USA 103: 17846–17851 Oda M, Satta Y, Takenaka O, Takahata N (2002) Loss of urate oxidase activity in hominoids and its evolutionary implications. Mol Biol Evol 19: 640–653 Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 31: 224–228 Rice P, Longden I, Bleasby A (2000) EMBOSS: the European Molecular Biology Open Software Suite. Trends Genet 16: 276–277[CrossRef][Web of Science][Medline] Sakurai T, Satou M, Akiyama K, Iida K, Seki M, Kuromori T, Ito T, Konagaya A, Toyoda T, Shinozaki K (2005) RARGE: a large-scale database of RIKEN Arabidopsis resources ranging from transcriptome to phenome. Nucleic Acids Res 33: D647–D650 Satoh K, Doi K, Nagata T, Kishimoto N, Suzuki K, Otomo Y, Kawai J, Nakamura M, Hirozane-Kishikawa T, Kanagawa S, et al (2007) Gene organization in rice revealed by full-length cDNA mapping and gene expression analysis through microarray. PLoS One 2: e1235[CrossRef][Medline] Schneider TD, Stephens RM (1990) Sequence logos: a new way to display consensus sequences. Nucleic Acids Res 18: 6097–6100 Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y, et al (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science 296: 141–145 Suzuki Y, Taira H, Tsunoda T, Mizushima-Sugano J, Sese J, Hata H, Ota T, Isogai T, Tanaka T, Morishita S, et al (2001a) Diverse transcriptional initiation revealed by fine, large-scale mapping of mRNA start sites. EMBO Rep 2: 388–393[Web of Science][Medline] Suzuki Y, Tsunoda T, Sese J, Taira H, Mizushima-Sugano J, Hata H, Ota T, Isogai T, Tanaka T, Nakamura Y, et al (2001b) Identification and characterization of the potential promoter regions of 1031 kinds of human genes. Genome Res 11: 677–684 Szecsi J, Joly C, Bordji K, Varaud E, Cock JM, Dumas C, Bendahmane M (2006) BIGPETALp, a bHLH transcription factor is involved in the control of Arabidopsis petal size. EMBO J 25: 3912–3920[CrossRef][Web of Science][Medline] Tanaka T, Tateno Y, Gojobori T (2005) Evolution of vitamin B6 (pyridoxine) metabolism by gain and loss of genes. Mol Biol Evol 22: 243–250 The Rice Annotation Project (2007) Curated genome annotation of Oryza sativa ssp. japonica and comparative genome analysis with Arabidopsis thaliana. Genome Res 17: 175–183 The Rice Annotation Project (2008) The Rice Annotation Project Database (RAP-DB): 2008 update. Nucleic Acids Res 36: D1028–D1033 The UniProt Consortium (2007) The Universal Protein Resource (UniProt). Nucleic Acids Res 35: D193–D197 Tsuritani K, Irie T, Yamashita R, Sakakibara Y, Wakaguri H, Kanai A, Mizushima-Sugano J, Sugano S, Nakai K, Suzuki Y (2007) Distinct class of putative "non-conserved" promoters in humans: comparative studies of alternative promoters of human and mouse genes. Genome Res 17: 1005–1014 Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE (2000) Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26: 225–228[CrossRef][Web of Science][Medline] Yamada K, Lim J, Dale JM, Chen H, Shinn P, Palm CJ, Southwick AM, Wu HC, Kim C, Nguyen M, et al (2003) Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 302: 842–846 Yamamoto YY, Ichida H, Abe T, Suzuki Y, Sugano S, Obokata J (2007a) Differentiation of core promoter architecture between plants and mammals revealed by LDSS analysis. Nucleic Acids Res 35: 6219–6226 Yamamoto YY, Ichida H, Matsui M, Obokata J, Sakurai T, Satou M, Seki M, Shinozaki K, Abe T (2007b) Identification of plant promoter constituents by analysis of local distribution of short sequences. BMC Genomics 8: 67[CrossRef][Medline] This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|