Highly diversified molecular evolution of downstream transcription start sites in rice and Arabidopsis.

Alternative usage of transcription start sites (TSSs) is one of the key mechanisms to generate gene variation in eukaryotes. Here, we show diversified molecular evolution of TSSs in remotely related flowering plants, rice (Oryza sativa) and Arabidopsis (Arabidopsis thaliana), by comprehensive analyses of large collections of full-length cDNAs and genome sequences. We determined 45,917 representative TSSs within 23,445 loci of rice and 35,313 TSSs within 16,964 loci of Arabidopsis, about two TSSs per locus in either species. The nucleotide features around TSSs displayed distinct patterns when the most upstream TSSs were compared with downstream TSSs. We found that CG-skew and AT-skew were clearly different between upstream and downstream TSSs, and that this difference was commonly observed in rice and Arabidopsis. Relative entropy analysis revealed that the most upstream TSSs had retained canonical cis elements, whereas downstream TSSs showed atypical nucleotide features. Expression patterns were distinguishable between upstream and downstream TSSs. These results indicate that plant TSSs were generally diversified in downstream regions, resulting in the development of new gene expression patterns. Furthermore, our comparative analysis of TSS variation between the species showed a positive correlation between TSS number and gene conservation. Rice and Arabidopsis might have evolved novel TSSs in an independent manner, which led to diversification of these two species.

While a complete genome sequence enables us to estimate the total number of genes in an organism, transcriptional activities of genes can be verified by either tiling array analyses or mapping ESTs and cDNAs onto a genome (Suzuki et al., 2001a;Yamada et al., 2003;Halasz et al., 2006;Li et al., 2007). The complicated gene structure of eukaryotes, which includes alternative transcripts, hampers precise computational predictions of exon-intron boundaries. Discoveries of alternative variants of genes have, therefore, been accomplished by the experimental verification of transcripts (Kim et al., 2005;Blencowe, 2006;Chen et al., 2007). Findings from a wide variety of alternative transcripts in higher eukaryotes led to the concept that the number of transcript variants, rather than the total number of genes, would better reflect the biological complexity of an organism (Brett et al., 2002). Therefore, to understand the relationship between genes and their functions, it is necessary to study transcript variation. Alternative transcripts are mainly generated by two mechanisms: alternative splicing and alternative usage of transcription start sites (TSSs). Both mechanisms are known to play important roles in tissue-specific gene expression and functional variation, which have significant impact on biological processes (Landry et al., 2003;Iida and Go, 2006). Recent large-scale sequencing projects have produced a considerable number of 5#-end sequences of full-length cDNAs (FLcDNAs) from rice (Oryza sativa subsp. japonica; Satoh et al., 2007) and from Arabidopsis (Arabidopsis thaliana; Seki et al., 2002;Alexandrov et al., 2006). Therefore, in this paper, we attempt to elucidate the biological significance and evolution of alternative TSSs in plants.
Past studies of TSS variation have focused mainly on mammals and fungi. For example, in human, millions of 5#-end sequences of FLcDNAs were used to determine 269,774 TSSs, from which 30,964 TSS clusters of 14,628 genes were obtained (Kimura et al., 2006). It was shown that alternative promoters and the resultant alternative usage of first exons had created a large number of transcript variants (Kimura et al., 2006). In yeast, 5#-end sequences of FLcDNAs were mapped on the genome sequence, and numerous TSSs were also clearly determined. Over 90% of the analyzed yeast loci had more than two transcript variants derived from different TSSs (Miura et al., 2006). These results indicated that TSS variation could be observed widely in animals and fungi. However, a comparison of promoters between human and mouse revealed that 1 This work was supported by the Ministry of Agriculture, Forestry, and Fisheries of Japan (Integrated Research Project for Plant, Insect, and Animal Using Genome Technology grant no. GD-1002 to T.T., T.I., and K.O.K. and Genomics for Agricultural Innovation grant no. GIR-1001 to T.T. and T.I.). * Corresponding author; e-mail taitoh@affrc.go.jp. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Takeshi Itoh (taitoh@affrc.go.jp).
[W] The online version of this article contains Web-only data. [OA] Open Access articles can be viewed online without a subscription.
www.plantphysiol.org/cgi/doi/10.1104/pp.108.131656 of 5,463 genes, which contained putative alternative promoters, only 807 were evolutionarily conserved (Tsuritani et al., 2007). In addition, another study of Cap Analysis of Gene Expression data sets found that TSSs of the orthologous genes did not always reside at the equivalent locations in the human and mouse genomes . These observations have suggested flexibility and rapid turnover of TSSs during evolution. Despite the large amount of TSS information in animals and fungi, there is a paucity of TSS studies in plants. Therefore, analysis of TSSs in the plant lineage could add to knowledge about the evolution of TSSs in eukaryotes. Recent progress of genome and transcriptome sequencing in rice and Arabidopsis gives us an opportunity to investigate TSS variation in higher plants. RIKEN and Ceres have released over 200,000 5#-end sequences obtained from Arabidopsis FLcDNA clones (Seki et al., 2002;Alexandrov et al., 2006). In addition, more than 580,000 5#-or 3#-end sequences of rice FLcDNAs have become available (Satoh et al., 2007). This wealth of sequence information allows us to conduct identification and comparative analyses of TSSs in more than 10,000 loci of these plants. Previous studies have indicated increased CG-skew around TSSs in both plants and fungi but not in animals (Fujimori et al., 2005) and that the skewed prevalence of adenine at TSSs and of cytosine at 21 bp of TSSs were common characteristics of rice and Arabidopsis (Alexandrov et al., 2006). Yamamoto et al. (2007aYamamoto et al. ( , 2007b conducted a comprehensive analysis of promoter regions to detect frequently observed octamers derived from TATA box, Y patch, and CpG and reported that different octamers could be used for different gene expression mechanisms.
In this study, we identified TSSs comprehensively by mapping transcripts to the genomes of rice and Arabidopsis and compared nucleotide signals around the TSSs. To our knowledge, this is the first attempt of a large-scale TSS comparison between higher plants based on FLcDNA sequences. We also conducted a comparative analysis of TSS variation and gene conservation to elucidate how TSSs and genes have evolved during plant evolution.

Identification and Clustering of TSSs
We mapped rice FLcDNAs and their 5#-end sequences onto the rice genome to identify TSSs by a previously described method (The Rice Annotation Project, 2007). As shown in Table I In addition to alternative TSSs, it is known that a TSS can fluctuate in animal and fungus genomes for biological reasons Kimura et al., 2006;Miura et al., 2006). A TSS may not be determined on a single site of a genome sequence; hence, the TSS number inferred from transcript mapping could be overestimated if all fluctuating TSSs are counted. Taking this possibility into account, we decided to define regions of TSSs by clustering closely located TSSs. When we calculated the distance between TSSs of the same locus, 79.8% and 86.9% of all distances between TSS pairs were less than 100 bp in rice and Arabidopsis, respectively (Supplemental Fig. S1). Since these small fluctuations can be caused either by biological processes or by experimental errors, we first checked the accuracy of 5#-end sequencing. A large number of 5#-end sequences that were obtained from the same clones were determined by two independent experiments: complete sequencing of the FLcDNA and partial 5#-end sequencing. If there are experimental errors, the TSS positions should vary even though they are derived from the same clone. We evaluated the TSS positions in 15,381 FLcDNAs of rice and in 19,865 FLcDNAs of Arabidopsis. Our results showed that approximately 10% to 20% of TSSs of the same clones were not identical (Table II; Supplemental Fig. S2), and 4.8% and 4.5% of TSS pairs were more than 5 bp apart in rice and Arabidopsis, respectively. Note that the distance should be 0, because the sequences were determined using the same clone. Therefore, not only biological fluctuations but also these experimental errors should be diminished by clustering closely mapped TSSs.
To date, several criteria have been used for TSS clustering. For example, to analyze Cap Analysis of Gene Expression data, tags of 20-or 21-bp overlaps were clustered , while Kimura et al. (2006) adopted a 500-bp interval between distinct TSSs, and, in yeast, a fixed 100-bp interval was used for clustering (Miura et al., 2006). In this study, we first estimated appropriate interval sizes (for details, see ''Materials and Methods''), and they were determined to be 21 bp for rice and 27 bp for Arabidopsis. The TSSs were clustered by the single-linkage clustering method with these thresholds. As a result, the maximum cluster sizes were 133 and 193 bp in length, and the average sizes were 4.  Fig. S3).
Although hundreds of thousands of sequences were evaluated, the number of transcripts might not be saturated and there might be TSS variants missing in the cDNA libraries used. In fact, 50% and 58% of the clusters were determined by only one transcript in rice and Arabidopsis, respectively. This result suggests that more TSSs might be detected if we further collect cDNA clones. Thus, our estimates of the numbers of TSS clusters should be taken as lower limits, and there is likely to be more variation of TSSs than observed in this study.

Prominent Nucleotide Features around TSSs
As described in ''Materials and Methods,'' we defined a representative TSS in each TSS cluster for further analyses. We refer to these representatives simply as TSSs, unless otherwise noted. When there are two or more TSSs in a locus, cis elements of a downstream TSS may overlap with a transcribed region of an upstream TSS. In fact, 18.9% of downstream TSSs of rice were located within the proteincoding regions of transcripts initiated from upstream TSSs and resulted in truncated open reading frames (ORFs). The nucleotide compositions around the downstream TSSs might be distinct from those of the upstream TSSs, because the functional constraints of the transcribed regions of the upstream TSSs should create nucleotide biases. To assess this possibility, we separated the TSS data sets into the most upstream TSSs and the remaining downstream TSSs and analyzed their nucleotide features, such as CG-skew and AT-skew. When there was one TSS in a locus, it was included in the upstream TSS data set.
First, we observed a strong peak of CG-skew in the upstream TSS data sets, whereas the downstream TSSs represented considerably reduced CG-skew in the two plants (Fig. 1, A and B). In downstream TSSs, the peaks were weakened and slightly shifted to the upstream. Second, we investigated AT-skew around TSSs. Our data clearly showed that the AT-skew was significantly biased around the TSSs, with similar tendency in rice and Arabidopsis (Fig. 1, C and D). In addition, the overall distributions of the AT-skew were quite different between the upstream and downstream TSSs, in a similar manner in both species. It seems that ATskew is a much clearer indicator of the nucleotide signals around TSSs than CG-skew. Third, in both upstream and downstream TSSs, similar patterns of GC contents were observed (Fig. 2). Clear peaks and drastic changes around TSSs were seen in all cases. However, TATA-like signals around 235 to 225 bp upstream from the TSSs were significantly diminished in downstream TSSs. Last, the relative entropy display of the nucleotide compositions clearly showed strong signals of TATA-like motifs around the 235 to 225 bp regions of upstream TSSs, but these signals disappeared in downstream TSSs (Fig. 3). This analysis was conducted for the case in which the loci of a single TSS were excluded, and we obtained essentially the same results (Supplemental Fig. S4). We found skewed appearance of C at 21 bp and of A/G at TSSs, which were also weaker in downstream TSSs. In previous reports, the TATA box of rice and Arabidopsis was  frequently detected in conjunction with the Y patch motif, which is a stretch of C/Ts and is located 2100 to 21 bp upstream from the TSS (Yamamoto et al., 2007a(Yamamoto et al., , 2007b. In our relative entropy analysis, signals of C/Ts were detected between TSSs and TATA-like motifs in the 225 to 21 bp upstream regions of rice but not of Arabidopsis. Similarly, in the downstream TSSs, the signals corresponding to the Y patch motif were clear only in rice. These comparisons of the nucleotide signals around TSSs between the upstream and downstream data sets suggested that in rice and Arabidopsis downstream transcription might be differently regulated from upstream transcription, which had canonical cis elements such as the TATA box. Previous studies have defined TSSs as being distinct if they were separated by over 500 bp, so that the downstream transcription signals do not heavily overlap with the upstream transcripts (Kimura et al., 2006;Tsuritani et al., 2007). To examine the possibility that the weakened downstream signals were caused by overlapping upstream transcripts, we reanalyzed nucleotide signals of downstream TSSs that were located more than 500 bp away from any upstream TSSs. We observed decreased nucleotide signals at the same level as those of all downstream TSSs (data not shown). In addition, this tendency was not different in protein-coding regions, untranslated regions, and introns (Supplemental Fig. S5). These results suggest that the weakened nucleotide signal might be due to different transcriptional signals rather than to overlapping transcription with the upstream TSSs.

Relationship between TSS Diversification and Gene Expression Patterns
It is expected that use of alternative TSSs is related to gene expression patterns in differentiated tissues or in response to specific conditions (Macknight et al., 2002;Lee et al., 2006;Szecsi et al., 2006). We examined whether TSS diversification is correlated with patterns of alternative gene expression, using information from the rice transcript library. To exclude ambiguity and experimental errors, we focused on loci where TSSs were determined by more than five transcripts. As a result, 1,012 (48.5%) of 2,088 loci that had exactly two TSSs consisted of cDNAs that were obtained from different libraries. Hence, as expected, differential TSSs should result in variations of gene expression patterns. For example, there are two TSSs in a locus named Os08g0199300 and annotated as ''similar to YyaF/YCHF TRANSFAC/OBG family small GTPase plus RNA binding domain TGS'' (Fig. 4). The downstream TSS of this locus started in the third exon of the upstream transcript. Thus, the downstream ORF was predicted to be a truncated form. While the upstream  transcripts were collected from various libraries, such as callus, flower, and shoot, the downstream transcripts were found only in one library, designated as ''1 week after flowering ear'' (Table III). As another example of a locus, Os01g0303200, in which a hypothetical protein was predicted, had two TSSs derived from different expression patterns (Supplemental Fig.  S6). Transcripts from the downstream TSS that showed no coding potential had been derived only from the library of ''Leaf (9 leaf stage),'' and no transcripts of the upstream TSS had been derived from this library (Supplemental Table S1). These results support the idea that TSS diversification contributed to the variations of gene expression patterns. The Arabidopsis ortholog, AT1G35220, had similar but slightly different TSSs (see ''Discussion'').

TSS Diversity Correlated with Protein Sequence Evolution
To elucidate the evolutionary significance of TSS diversification, we used two approaches to analyze the relationship between the numbers of TSSs per locus and the evolutionary conservation of protein sequences. We determined orthologs between rice and Arabidopsis by reciprocal best hits of BLASTP searches and calculated protein identity between the orthologs. We found a positive correlation between the number of rice TSSs and protein identity (Fig. 5), and a similar tendency was observed in Arabidopsis TSSs (Supplemental Fig. S7). These results suggest that a locus encoding evolutionarily conserved proteins had acquired more TSSs than one encoding diverged proteins.
Next, we searched the UniProtKB database for homologous sequences of the rice and Arabidopsis proteins and classified them into four groups by their level of conservation. The ratio of conserved protein groups increased as the number of TSSs per locus grew (Supplemental Fig. S8). However, if the cDNA collection was insufficient, the number of TSSs of poorly expressed genes might be underestimated. To exclude the possibility of insufficient sampling of cDNAs, we used TSSs that were supported by five or more transcripts and confirmed that the same tendency was observed (data not shown). Therefore, highly variable TSSs seemed to be prevalent in conserved proteincoding genes of either rice or Arabidopsis. Table III. Library information and the number of cDNA clones obtained from the rice locus, Os08g0199300

DISCUSSION
To cluster TSSs that fluctuated for biological or experimental reasons, we used a threshold interval of 21 bp for rice and 27 bp for Arabidopsis. Since the resultant average sizes of the TSS clusters were much smaller, 4.2 bp in rice and 8.7 bp in Arabidopsis, than those initial intervals, it seems that fluctuating TSSs were clustered effectively and that there was little excessive clustering. The longer average size of Arabidopsis TSS clusters may be due to experimental errors, as we observed more discrepancies in Arabidopsis sequences obtained from the same clone compared with those of rice (Table II).
As each locus had on average two TSS clusters in either species, there should have been significant contribution of this TSS variation to these species. Indeed, TSS variants of several genes are known to be responsible for different expression patterns (Landry et al., 2003;Iida and Go, 2006). In this study, our large-scale analysis revealed that TSSs had been obtained from different libraries in about half of the loci that had two TSSs (Table III; Supplemental Table S1). This observation is consistent with our finding that the nucleotide signatures are distinguishable between the upstream and downstream TSSs, as canonical signals, such as the TATA box motif, were clearly depicted in the upstream TSSs but were considerably diminished in the downstream TSSs. These results suggest that transcription from the upstream TSSs is, at least in part, under a common regulatory mechanism, while the downstream TSSs are generally regulated by specialized systems (Fig. 6), which should lead to highly differentiated expression patterns.
In addition to CG-skew, which characterizes plant and yeast TSSs (Fujimori et al., 2005), AT-skew was found to be another strong indicator of TSSs. It is of particular interest that the distributions of AT-skew were nearly identical between rice and Arabidopsis. The sharp contrast of the AT-skew patterns between the upstream and downstream TSSs also supports the aforementioned idea that TSS variations are related to expression differences. A possible application of this clear AT-skew is that, since the AT-skew has been conserved between these remotely related plant species, one may consider a generalized method by which TSSs can be predicted from newly sequenced genomic DNA of plants. Because plants and fungi share common nucleotide features around TSSs (Fujimori et al., 2005), the animal machinery might have evolved independently.
A reason for the weak signals of downstream TSSs appears to be overlap with upstream protein-coding regions. Since protein-coding regions are under func- Figure 5. Relationship between TSS number and amino acid identity of orthologs between rice and Arabidopsis. The TSS number is that of the rice loci. tional constraints, the nucleotide compositions and genomic positions of cis elements will be affected. For example, the TATA box frequently contains TAA, which is a stop codon and may prematurely terminate translation. The medians of the distances between TSSs were relatively small, 149 and 184 bp in rice and Arabidopsis, respectively, so that it was possible that the signals overlapping the upstream protein-coding region remained generally weak. However, even though we used only downstream TSSs separated from upstream TSSs by more than 500 bp, the signals were almost identical to those of all downstream TSSs (data not shown). Therefore, we concluded that the distinct signals of the downstream TSSs were not necessarily due to upstream coding regions but that they are intrinsic to the nature of the downstream TSSs. We should note that the downstream TSSs might produce a truncated protein whose function is deteriorated or lost. Thus, regulation by alternative TSS usage may be achieved in a loss-of-function manner, which is suggested to be of evolutionary importance (Oda et al., 2002;Tanaka et al., 2005). On the contrary, if a new TSS is generated in the upstream region, it would affect the downstream canonical transcriptional signals. Therefore, upstream TSS generation might have been suppressed during evolution (Fig. 6). Our hypothesis is that plants have generally retained an upstream ''genuine'' TSS with the TATA box and created downstream diversity. This seems to be in contrast with the observation that, in humans, the TATA box was used for tissuespecific expression while ubiquitously expressed genes are dependent on CpG islands (Suzuki et al., 2001b;Carninci et al., 2006), suggesting that plants and animals independently evolved their basic transcription regulation machineries.
It is intuitively plausible that, if protein sequences are highly diverged because of relaxed functional constraint, regulation of their expression becomes concordantly variable. However, our analyses revealed that the number of TSSs increased proportionally to the sequence conservation in both rice and Arabidopsis. Although it was expected that the gene function affected the number of TSSs, our functional categorization of the proteins by the Gene Ontology hierarchy showed no significant correlation between gene function and the number of TSSs (Supplemental Fig. S9). Since highly conserved proteins generally play essential roles, are used in a variety of tissues, and are regulated by complex processes, elaborate transcriptional regulation to control several TSSs might be required. Intriguingly, the TSSs that we identified were not necessarily conserved between rice and Arabidopsis. As shown in Figure 4, B and C, both rice and Arabidopsis used the same upstream TSSs in this locus, whereas the downstream TSSs obviously differed between these orthologs. Likewise, in humans and mice, merely one-fourth of the promoter regions between orthologs were conserved (Wasserman et al., 2000;Tsuritani et al., 2007). Therefore, TSS variation seems to be unstable in the course of evolution, and this variation should contribute to biodiversity among a wide range of species. CONCLUSION We determined TSSs in rice and Arabidopsis by large-scale computation and found that both species have, on average, two or more TSSs per locus. The nucleotide signals around TSSs were similar in these two plants, while they were quite different between the upstream and downstream TSSs. A positive correlation between TSS numbers and gene conservation was also observed. This study provides an insight for diversified transcriptional variation that is likely to have contributed to the evolution of plant species.

Genome and cDNA Sequences
We used FLcDNAs and their 5#-end sequences for TSS determination (Supplemental Table S2). The FLcDNAs and 5#-end sequences of rice (Oryza sativa; Kikuchi et al., 2003;Satoh et al., 2007) and the FLcDNAs and 5#-end sequences of Arabidopsis (Arabidopsis thaliana; Seki et al., 2002;Alexandrov et al., 2006) were retrieved from the GenBank/EMBL/DDBJ DNA databases. In addition, the Arabidopsis FLcDNAs sequenced by RIKEN were downloaded from the RIKEN Arabidopsis Genome Encyclopedia (http://rarge. gsc.riken.jp/archives/rafl/sequence/; Sakurai et al., 2005). The library information of the rice FLcDNA clones, which was derived from 41 different libraries including unknown resources, was provided by Dr. S. Kikuchi (personal communication). For the rice genome sequence, the International Rice Genome Sequencing Project genome sequence build 4 was used (http:// rgp.dna.affrc.go.jp/IRGSP/download.html). The Arabidopsis genome sequence was downloaded from the National Center for Biotechnology Information's FTP site (ftp://ftp.ncbi.nih.gov/genomes/) as of August 13, 2004. ORFs and annotation data of rice were downloaded from the RAP-DB (http:// rapdb.dna.affrc.go.jp/; The Rice Annotation Project, 2008). ORFs of Arabidopsis were retrieved from The Arabidopsis Information Resource (TAIR) 7 annotation data (http://www.arabidopsis.org/) as of June 19, 2007(Rhee et al., 2003.

cDNA Mapping to Genome Sequences
Positions of transcripts on the genome sequences were determined by methods described previously (The Rice Annotation Project, 2007). We used 5#-end positions aligned by the est2genome program with the following options: gap open penalty, 8; mismatch penalty, 6 (Rice et al., 2000). Since the cDNA data sets included redundant sequences, which were determined as a full-length sequence and as a 5#-EST of the same clone, we used only the fulllength cDNAs. We noticed that approximately 5% of the mapped transcripts contained an unaligned 5#-region of 7 bp or more, which were possibly derived from remaining vector sequences. These unaligned regions were discarded from our analyses. We found that 764 RAP loci included nonoverlapping transcripts, which might be due to transcriptional read-through. These read-through candidates were not used in this study, because readthrough transcripts lead to overestimation of alternative TSSs. Because 1,807 5#-end sequences of Arabidopsis did not correspond to any TAIR proteincoding regions, they were eliminated from our data sets.

Clustering of 5#-End Positions
We clustered 5#-end positions that fluctuated for biological or for experimental reasons. To determine an appropriate threshold for the distance between 5#-end positions to be clustered, the relationship between the distance and the total number of clusters was examined (Supplemental Fig. S10). The cluster number decreased gradually and monotonically as the distance increased. We adopted the threshold distance at which the rate of decrease in the total number of clusters was less than 1%: 21 bp for rice and 27 bp for Arabidopsis. Juxtaposed 5#-end positions within the threshold distance were clustered by the single-linkage clustering method. In each cluster, a single representative TSS was selected in the following order: (1) supported by a fulllength sequence, (2) supported by the most clones, and (3) the most upstream 5#-TSS.

Calculation of CG-Skew and AT-Skew
We extracted genomic sequences that spanned the 2250 to 1350 bp region around each TSS. When an ambiguous nucleotide denoted by N existed in a sequence file, the sequence was eliminated from our analysis. CG-skew values [5 (C 2 G)/(C 1 G)] were computed in a sliding window of 100 bp with 1-bp steps, where C stands for the total number of cytosines in the window and G stands for the total number of guanines. The position of a window in Figure  1 is represented by the 51st nucleotide of the window. Likewise, AT-skew values were calculated for adenines (A) and thymines (T).

Calculation of the Relative Entropy at a Nucleotide Site
We represented nucleotide biases by relative entropy, modifying a previously reported method (Schneider and Stephens, 1990;Crooks et al., 2004). The relative entropy (R) at a particular nucleotide position is: R 5 + n5A;T;G;C p n log 2 ðp n =p g Þ where p n is the observed frequency of nucleotide n (A, T, G, or C) at the position and p g is the genomic frequency of n. Previous studies have assumed random occurrence of the nucleotide in a background distribution, so that p g was 0.250 for any n, but the GC contents of rice and Arabidopsis were 43.6% and 36.0%, respectively, which clearly deviated from 50%. Therefore, for example, p g for the adenine of rice was set to 0.282, assuming that A and T, or G and C, distributed equally in either DNA strand. The height (H n ) of each nucleotide n at a particular position in Figure 1 was determined by multiplying the relative entropy by the frequency of that nucleotide (Schneider and Stephens, 1990), as follows: H n 5 p n R

Sequence Analysis of Orthologs
The rice protein set we used was compared with the Arabidopsis protein set of TAIR. Homologs and orthologs were determined, as described elsewhere (The Rice Annotation Project, 2007). Homologous sequences of other organisms were identified by BLASTP searches against UniProtKB (release 10.2) downloaded as of April 9, 2007(The UniProt Consortium, 2007. We adopted less than 10 24 of the E value as a threshold. On the basis of the taxonomic groups to which the organisms of the homologs belonged, we categorized the rice and Arabidopsis proteins into (1) Oryzeae/Brassicaceae, (2) Liliopsidae/Eudicotyledons, (3) Viridiplantae, and (4) nonplant organisms (including fungi, animals, and prokaryotes).

Supplemental Data
The following materials are available in the online version of this article.
Supplemental Figure S2. Distribution of the distances of aligned 5#-positions between FLcDNAs and 5#-ESTs obtained from the same clones.
Supplemental Figure S3. Distribution of average TSS distances of a locus in rice and Arabidopsis.
Supplemental Figure S4. Relative entropy of nucleotides around the loci of a single TSS and loci with multiple TSSs.
Supplemental Figure S5. Relative entropy of nucleotides around downstream TSSs in rice.
Supplemental Figure S7. Relationship between the TSS number and amino acid identity in Arabidopsis.
Supplemental Figure S8. Evolutionary conservation and TSS occurrence frequency per locus.
Supplemental Figure S9. Relationship between TSS numbers in a locus and Gene Ontology categories.
Supplemental Figure S10. Definition of threshold distance.
Supplemental Table S1. Library information of the number of cDNA clones from Os01g0303200.
Supplemental Table S2. Number of transcript sequences used in this study.
Supplemental Information S1. Evaluation of TSSs of two FLcDNA sets obtained from independent cloning methods.