|
|
||||||||
|
Plant Physiology 132:469-484 (2003) © 2003 American Society of Plant Biologists Refined Annotation of the Arabidopsis Genome by Complete Expressed Sequence Tag Mapping1Department of Zoology and Genetics (W.Z., S.D.S., V.B.) and Department of Statistics (V.B.), Iowa State University, Ames, Iowa 500113260
Expressed sequence tags (ESTs) currently encompass more entries in the public databases than any other form of sequence data. Thus, EST data sets provide a vast resource for gene identification and expression profiling. We have mapped the complete set of 176,915 publicly available Arabidopsis EST sequences onto the Arabidopsis genome using GeneSeqer, a spliced alignment program incorporating sequence similarity and splice site scoring. About 96% of the available ESTs could be properly aligned with a genomic locus, with the remaining ESTs deriving from organelle genomes and non-Arabidopsis sources or displaying insufficient sequence quality for alignment. The mapping provides verified sets of EST clusters for evaluation of EST clustering programs. Analysis of the spliced alignments suggests corrections to current gene structure annotation and provides examples of alternative and non-canonical pre-mRNA splicing. All results of this study were parsed into a database and are accessible via a flexible Web interface at http://www.plantgdb.org/AtGDB/.
The efforts of an international collaboration to obtain the complete genome sequence of the flowering plant Arabidopsis resulted in the release and annotation of 115.4 Mb of the genome (estimated at 125 Mb) in December of 2000 (Arabidopsis Genome Initiative, 2000
Expressed sequence tags (ESTs) are single-pass sequencing reads of cDNA clones that have become a widely employed method for gene identification, expression profiling, and polymorphism analysis. Presently, more than 13.4 million EST entries have been deposited into the National Center for Biotechnology Information (NCBI) dbEST public database, including Arabidopsis with 176,915 ESTs and 21 other species with EST sets of more than 100,000 entries (http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html). In the absence of a whole-genome sequencing project for a particular species, clustering of ESTs into contigs that represent unique genes is one of the most promising strategies to glimpse the gene space of that organism. Challenges of EST clustering arise from poor average sequence quality, incomplete EST sampling, polymorphisms, alternative transcript isoforms, representation of highly similar transcripts from distinct members of multigene families, and cloning artifacts. Different strategies for EST clustering and the associated gene indexing databases have been reviewed by Bouck et al. (1999
For Arabidopsis, up-to-date EST clusters are available in form of the UniGene clusters at NCBI (http://www.ncbi.nlm.nih.gov/UniGene/) and as a The Institute for Genome Research (TIGR) Gene Index (AtGI; http://www.tigr.org/tdb/tgi/agi/; Quackenbush et al., 2001
EST Spliced Alignments
The Arabidopsis EST (ATest) data set employed in this study consists of 176,915 entries. As shown in Figure 1, only 2,059 EST sequences (1.2%) did not show any significant alignments with the genome. Further investigation based on BLASTN (Altschul et al., 1997
Of the ESTs, 96.0% have at least one high-quality spliced alignment (hqSPA; see "Materials and Methods") with the Arabidopsis genome (such ESTs denoted as high-quality ESTs [hqESTs]), and about 13.2% have more than one hqSPA with the genome (such ESTs denoted as mhqESTs; see Fig. 1). The distribution of the number of hqSPAs per hqEST is shown in Table I. The majority of the ESTs have only one or two hqSPAs, but there are 38 ESTs with at least 10 hqSPAs. These ESTs were found to be associated with transposon families and other highly prolific genome elements. For example, EST gi:9787698 (with 170 hqSPAs) appears to be derived from an Arabidopsis putative retroelement polyprotein gene, clustered around all five centromeres of the Arabidopsis genome as shown in Figure 2.
Overall, about 82.8% (146,527 entries) of the ATest data set are uhqESTs (see "Materials and Methods"), which align with a single locus in the genome. To properly position the remaining ESTs, which display multiple hqSPAs, we make the assumption that for each EST the alignment with maximal score (similarity score x coverage score) identifies the true cognate location of that EST. Such alignments are designated putative cognate spliced alignments (pcSPAs; see "Materials and Methods"). In this way, 172,137 pcSPAs were generated from 169,888 hqESTs and 206,833 hqSPAs. Because of virtual equalities among the scores of some hqSPAs for certain mhqESTs (Fig. 3), there are more pcSPAs than hqESTs.
We should emphasize that our restriction on hqESTs largely eliminates typical problems of EST clustering and EST-based gene annotation, as caused by chimeric clones, for example. Thus, chimeric sequences would typically lead to alignments with coverage score below 0.8 because in any given genomic location, only one part of the sequence would match (or if the foreign sequence were only very short, it would not be used in the GeneSeqer spliced alignment, which optimizes the local alignment score). According to the aforementioned assumption, the similarity and coverage scores for each pcSPA correlate with our confidence in the prediction of cognate transcript origin for the hqEST in question. Higher alignment similarity and coverage scores denote greater confidence. The vast majority of pcSPAs have similarity and coverage scores in the 0.99 to 1.0 range (Fig. 4). This implies high confidence in the classification of these alignments as cognate. The designation of "putative" cognate is formally accurate, however, because the matched ESTs and genomic sequences were not isolated from the same plant. When considering the alignment of ESTs not derived from the Columbia ecotype on which the genomic sequences are based, cognate position implies the cognate origin of the most probable transcript ortholog to the aligned EST. According to dbEST annotation, about 98% of the Arabidopsis ESTs were derived from the Columbia ecotype. Three hundred of the 337 ESTs annotated as derived from ecotype Landsberg have pcSPAs with average similarity score 0.93 and average coverage score 0.94. Thus, the different Arabidopsis ecotypes appear to have such a high degree of sequence conservation that correct mapping of the ESTs onto the Columbia ecotype genome is unproblematic (see also Haas et al., 2002
EST assembly refers to the problem of finding the correct orientation and order of EST sequences in a tiling path covering the cognate mRNA. Because EST sequences are typically generated by single-pass sequencing and, thus, contain a fair number of errors and ambiguous bases, this assembly can be difficult in the absence of genome sequence data. However, when the entire genome sequence is available, the spliced alignment of ESTs gives reliable assemblies and can be used for prediction of gene structure and alternative splicing (Kan et al., 2001 Because of the relative facility of EST sequencing, EST projects have outpaced genome sequencing projects for many species. EST clustering is typically the first analysis step in deriving a "unigene" set representing the transcriptome of the species. By clustering, EST sequences that share significant sequence similarity are partitioned into presumed gene-specific contigs, thus reducing the redundancy of the EST set. Such reduction is often dramatic, especially in the case of EST sets not derived from normalized libraries. Cluster-based reduction may be a practical necessity before EST assembly for large EST sets. Here, we are particularly interested in evaluating the utility of ESTs in gene identification. pc-SPAs, representing putative cognate gene locations, were clustered based on chromosome location. Each cluster contains ESTs from a single gene provided that the intergenic regions between neighboring genes are sufficiently long compared with the maximal allowed gap (negative overlap) set by the clustering parameters (see "Materials and Methods"). Because genome-based EST clustering does not depend on pair-wise EST sequence overlap, which is a necessary requirement for comparison-based assembly programs, small gaps in local genome coverage can be allowed, thereby joining partial gene annotations through a genome scaffolding scheme. In addition, high coverage as required for the pcSPAs excludes erroneous alignment of chimeric clones, which typically pose annoying problems for comparison-based assembly.
Figure 5 provides an example of the possibilities and difficulties of gene structure annotation by EST clustering. Full-length cDNA evidence indicates four genes in alternating directions in the displayed region of chromosome four. Current GenBank annotation misses the second gene, the 5' end of which is overlapping the 5' end of the third gene transcribed on the opposite strand. Genome-based EST clustering without using clone pair information would give the three clusters that are bounded by ESTs gi:19864852 and gi:19802435, gi:19822861 and gi:19863255, and gi:8732113 and gi:19863376, respectively. If clone pair information is used, the clusters resolve to four clusters that correctly identify the four genes. For comparison, Figure 5 also shows the alignment of TIGR Arabidopsis Gene Index tentative contigs (Quackenbush et al., 2001
Choosing various clustering parameters from 50-bp overlap to 100-bp gap was shown to alter the number of clusters by less than 12% (Table II). The following results are based on the 27,611 clusters obtained by allowing a maximal gap of 60 bp (other criteria give similar results; data not shown). About one-half of the clusters contain only one or two pcSPAs (Table III). Large clusters correspond to highly expressed genes (e.g. Fernandes et al., 2002
As described in the next section, some of our spliced alignment results contradict particular gene models in the most recent Arabidopsis genome annotation. To safeguard against possible errors in our employed methods, we exploited a set of 5,000 nonredundant full-length cDNAs derived in a Ceres/TIGR collaboration (Haas et al., 2002
The results showed that 4,999 of the cDNAs have at least one hqSPA. The only unmatched cDNA (gi:21405014, Ceres identification no. CT23693) matches mitochondrial DNA. Generally, the pcSPA of a full-length cDNA is regarded to be the most decisive experimental evidence to define gene structures. Therefore, the cDNA-derived pcSPAs provide a reliable set to assess EST-based gene prediction. Overall, the 4,999 cDNAs have 4,691 uhqSPAs, 308 mhqSPAs, and 5,013 pcSPAs. Surprisingly, 1,100 (21.9%) of the pcSPAs are embedded in longer EST clusters (see http://www.plantgdb.org/AtGDB/prj/ZSB03PP/extendedCoverage.html). This discrepancy may result from alternative transcription initiation and termination sites or systematic biases in the cDNA cloning process (Haas et al., 2002 Because the cDNA-covered gene set is not representative of the entire Arabidopsis gene set (highly expressed genes have a greater chance to be cloned and sequenced both as ESTs and full-length cDNAs), the 91% fraction of pcSPAs from full-length cDNAs covered also by the EST-derived pcSPAs is an upper bound of the estimated fraction of genes identified by ESTs. The comparison confirms that both ESTs and cDNAs were accurately mapped to the genome with our method and that these approaches provide both alternative and complementary paths to gene discovery.
Spliced alignments of ATest and ATcdna and the recent annotation of the Arabidopsis genome were parsed and imported into an MySQL relational database, which was named AtGDB. An elaborate Web interface was designed for the database to allow users to browse the genome and query the database by sequence similarity, identifiers, or description (http://www.plantgdb.org/AtGDB/). In general, the Web interface is composed of three parts: the genomic context view, the query view, and the sequence view. The genomic context view allows users to browse a specific genomic region in the context of multiple annotation resources. The region graphic displays these multiple sources of alignment information relative to one another. Each is colored with respect to its specific annotation source (see Fig. 5). The query view allows users to view and interact with the results of a user query. Stored EST/cDNA alignments and annotated transcripts each have an individual page, the sequence view, which glues together sequence data, analysis tools, and related external links. This Web interface efficiently presents the database entries on the fly and facilitates data access and utilization as described below.
After mapping the ESTs to the genome, we not only acquired the genomic loci each EST originated from but also confirmation of other annotation resources by comparison with the EST spliced alignments. Here, we explored several applications listed below. However, we should emphasize that we cannot describe in-depth analysis of these data within the scope of this manuscript and rather wish to point out possibilities of further studies based on the rich data source provided by the comprehensive EST mapping.
Consistency of Gene Structure Annotation In addition to suggesting corrections to current gene annotations, the EST spliced alignments also identify novel gene locations. Thus, of the 27,611 EST contigs assembled on the basis of proximity in their genomic locations, 129 occur in regions without any annotated gene models and contain open reading frames (ORFs) longer than 100 residues that show no significant hits with annotated Arabidopsis proteins using BLASTP (threshold 1e-10). Eighty-two of these show no hits at the same threshold when compared against the NCBI nonredundant protein database, and the remaining 47 EST contigs show at least one hit (data available at http://www.plantgdb.org/AtGDB/prj/ZSB03PP/novelGenes.html). For example, ESTs gi:19863912, gi:9786135, and gi:8721866 form a cluster that supports an ORF of 108 residues between genes At4g02400 and At4g02410; the existence of a gene in that region is also supported by full-length cDNAs gi:14596167 and gi:20148266. In other cases, the novel ORFs may correspond to upstream or downstream exons of incompletely annotated genes. The display at AtGDB allows users to provide updated annotation upon more in-depth analysis of individual cases.
5'- and 3'-Untranslated Regions (UTRs) in mRNAs
The gene density in the Arabidopsis genome is high, with about one gene every 5 kb. Therefore, intergenic regions are typically very short, which may make accurate UTR assignments difficult. We cataloged high-quality predicted introns that mapped into annotated intergenic regions into potential 5'-UTR or 3'-UTR introns, depending on whether the constituent hqSPAs extend from the flanking coding region into the upstream or downstream region, respectively (note that in some cases the additional exons may extend an annotated ORF; thus, the derived set of potential UTR introns is a superset of EST-confirmed UTR introns). In this way, 2,282 potential 5'-UTR introns in 2,023 annotated genes (including 199 genes with multiple potential 5'-UTR introns; all data displayed at http://www.plantgdb.org/AtGDB/prj/ZSB03PP/upstreamUTRintrons.html) and 570 potential 3'-UTR introns in 487 annotated genes (including 47 genes with multiple potential 3'-UTR introns; all data displayed at http://www.plantgdb.org/AtGDB/prj/ZSB03PP/downstreamUTRintrons.html) were identified. Seventy-two genes have both potential 5'-UTR and potential 3'-UTR introns. Thus, at least 9% of Arabidopsis genes may have introns in their UTRs. Our listing of these features at AtGDB should provide a valuable resource to study possible roles for these introns in the regulation of gene expression and to develop models for UTR prediction (see also dbUTR; Pesole et al., 2002
Non-Canonical Splice Sites In this study, 738 introns (1.7% of the 43,165 high-quality predicted introns derived from EST alignments) were found to have non-canonical splice sites (Table IV). GC-AG introns represent the large majority of non-canonical introns (453 cases, or about 1.0% of all high-quality predicted introns). AT-AC introns comprise the second largest category (25 cases). Many of the non-canonical introns have short direct repeats spanning the donor and acceptor sites. In these cases, the exact intron position cannot be unambiguously determined by spliced alignment; thus, some of the classifications in Table IV may prove incorrect. The complete listing of apparent noncanonical introns (http://www.plantgdb.org/AtGDB/prj/ZSB03PP/ncSpliceSites.html) should facilitate experimental investigation of splicing in the absence of the standard splice site features.
The 453 GC-AG introns (http://www.plantgdb.org/AtGDB/prj/ZSB03PP/non_canonical/gc_ag.html) have the consensus donor sequence (non-U)AG/GCAAGU (donor site boldfaced) exactly as reported before for other data sets (Burset et al., 2000
Dietrich et al. (1997
Mutual comparison of the genes containing the putative U12-type introns shows that some of them may correspond to duplications within gene families. For example, the genes At1g56280, At5g26990, At3g06760, At3g05700, and At4g02200 all encode a drought-induced-19 like protein. A detailed study shows that all five genes have a U12-type intron between coding exons three and four (At4g02200 has a U12-dependent GT-AG intron, whereas the other four genes have a U12-dependent AT-AC intron). Similarly, the genes At3g53520 (Fig. 6A) and At3g62830, which encode a dTDP-Glc 4-6-dehydratase-like protein, also both have a U12-dependent AT-AC intron in the same location. Inspection of a homologous rice gene shows that the U12-type intron location is not only conserved among the Arabidopsis paralogs but also across the monocot/dicot divide (Fig. 6B). This observation is consistent with the conjecture of the early origin of U12-class introns (Wu et al., 1996
The analysis of U12-type introns gives an example of how to utilize the EST data and AtGDB resource, and it also exposes several annotation problems. For instance, of the 23 AT-AC U12-type introns, only four AT-AC introns are explicitly annotated (At3g53520, At5g22650, At5g26990, and At5g27380). One AT-AC U12-type intron in gene At3g62830 is incorrectly annotated as a CA-TA intron, even with the presence of six cognate full-length cDNAs. In addition, gene structures predicted by AB INITIO methods will typically never include non-canonical introns (for example, the gene At1g76170). Furthermore, EST data can provide a check on the accuracy of the genome sequence (Brendel and Zhu, 2002
Alternative Splicing
Mini-Exons and Mini-Introns Based on EST evidence, we did not find any introns less than 50 bp. According to the GenBank annotation, there are 46 introns ranging from 1 to 10 bp, but it seems likely that these are annotation mistakes. One 27-bp intron was annotated in the gene At3g53740, which is supported by full-length cDNA CT267357 (gi:21405387). However, 33 pcSPAs uniformly support a continuous exon in that position. It is possible that this region is polymorphic between Columbia and other ecotypes and that the cognate origin of CT267357 includes a standard-sized intron. Conversely, 128 nonterminal mini-exons are supported by EST evidence. These exons range in size from 5 to 25 bp, with 13 of them no longer than 10 nucleotides in length (http://www.plantgdb.org/AtGDB/prj/ZSB03PP/miniexons.html). In a few cases, these mini-exons may occur in regions of increased alternative splicing activity. An example of this is given in Figure 5. However, most mini-exons appear to be constitutively spliced, as confirmed by the consistent alignments of several ESTs. For example, a six-nucleotide exon in At5g14030 is unanimously confirmed by 12 EST spliced alignments and conserved in an apparent rice homologous gene (Figs. 7 and 8). Due to steric constraint imposed by their size, we find it difficult to explain the accurate splicing of mini-exons by exon definition, and intron-definition and/or facilitation of splicing by splicing enhancers may be a more plausible splice site selection model in this case. Interestingly, most mini-exons are characterized by high splice prediction scores in the flanking exon-intron junctions (data not shown), suggesting that the associated spliceosome and mechanism of splicing involved in resolving mini-exons may be highly similar to that of normal exons.
ESTs have become the most popular method for gene discovery in eukaryotic species without a whole-genome sequencing project and a key technology for genome annotation when genome sequence data are available. We are particularly interested in systematic, functional, and phylogenetic comparisons of the gene repertoires of plants. Currently, a near-complete genome has been assembled for only Arabidopsis and rice. In contrast, some of the largest species-specific EST collections are from plants, including wheat (Triticum aestivum; more than 415,000), barley (Hordeum vulgare; more than 310,000), soybean (Glycine max; more than 305,000), maize (Zea mays; more than 195,000), and Medicago truncatula (more than 180,000; http://www.ncbi.nlm.nih.gov/dbEST/dbEST_summary.html). Kalyanaraman et al. (2003 In comparison with other indexing methods such as UniGene or the TIGR Gene Indices that work entirely on the mRNA level, genome location-based clustering not only has the advantage of accuracy but also allows using low-quality ESTs more effectively. For example, EST gi:8332684 has a uhqSPA with a similarity score marginally higher than 0.8, but the GeneSeqer spliced alignment still accurately reveals the exon-intron boundaries of the gene At1g20620 (catalase 3). This EST is clustered with hundreds of other cognate ESTs located in the same region. However, although labeled as weakly similar to At1g20620, it is clustered as a singleton in the TIGR Arabidopsis Gene Index.
Surprisingly, the complete EST mapping revealed a large number of discrepancies between the current gene structure annotation and assignments of exons and introns indicated by the spliced alignment. Previously, Haas et al. (2002 In addition to providing the standards for EST clustering and data for refining basic gene structure annotation, the spliced alignments also provide a rich resource for more in-depth analysis of pre-mRNA processing, including assessment of the extent of alternative splicing and use of non-canonical splices sites. Based on very stringent spliced alignment criteria, we established alternative splicing (excluding possible intron retention) for only about 1.5% of the Arabidopsis genes. The majority of alternative splicing occurs at either the donor site or the acceptor site of an intron but not on both ends simultaneously (292 of 327 cases). We also observe that most alternative splice sites are within 50 bp of the common splice site (220 of 292). Specifically, in 134 cases, the distances between the alternative splice site and the common splice site are less than 10 bp. Such transcript isoforms with minor difference may be easily over-looked in conventional EST clustering and transcript assembly. For example, the gene At1g02500 has two alternative isoforms with a difference of only 3 bp in the location of the acceptor site of its sole intron. Each of the isoforms has at least six ESTs to support its unique gene structure. However, all of these ESTs are assembled into one index in the TIGR Arabidopsis Gene Index (Cluster ID:TC149272).
Most certainly, these estimates of the occurrence of alternative splicing are very conservative. First, these estimates were based on only very good spliced alignments that leave no doubt as to the origin of the respective ESTs. Second, the ATest collection is still very small compared with the human collection, for example, which is about 30 times larger. However, we still estimate the occurrence of alternative splicing in Arabidopsis much lower than the reported 40% to 60% of human genes (Black, 2000
Currently, most gene identification efforts rely heavily on ab initio gene prediction programs (Pavy et al., 1999 To facilitate refined genome annotation and further study of pre-mRNA processing based on the spliced alignment data, all of our results were stored in a MySQL database and are visually presented on a special Web site, AtGDB (http://www.plantgdb.org/AtGDB/). Several established and comprehensive Arabidopsis databases are already available to date, such as TAIR (http://www.arabidopsis.org/), Munich Information Center for Protein Sequences (http://mips.gsf.de/proj/thal/), and the TIGR Arabidopsis Database (http://www.tigr.org/tdb/e2k1/ath1/). All displays in AtGDB are linked to the corresponding entries in those databases. AtGDB adds a convenient sequence-centered view of the genome. Users of AtGDB can easily find the distribution of target sequences in the genome, see their related annotations, and exact genomic coordinates (based upon the most recent release of Arabidopsis genome annotation) of ESTs and cDNAs. Analytical tools are linked to the displays to allow further analysis with additional data, for example spliced alignment with ESTs from sources other than Arabidopsis. We hope that this analysis and the new Web tools will contribute to more complete and accurate genome annotation.
Data Sets
The five chromosome sequences of Arabidopsis were obtained from GenBank (http://www.ncbi.nih.gov/entrez/query.fcgi?db=Nucleotide) as accessions NC_003070 (chromosome I, dated August 20, 2002, 30,028,691 bp), NC_003071 (chromosome II, dated August 20, 2002, 19,646,746 bp), NC_003074 (chromosome III, dated August 20, 2002, 23,467,821 bp), NC_003075 (chromosome IV, dated August 20, 2002, 17,550,036 bp), and NC_003076 (chromosome V, dated August 20, 2002, 26,583,670 bp). Arabidopsis ESTs were downloaded from the dbEST database (http://www.ncbi.nlm.nih.gov/dbEST/). Our analysis was based on 176,915 EST records available October 25, 2002 (data set label: ATest). According to the GenBank records, 111,155 non-RIKEN ESTs were derived from the Columbia ecotype. An additional 61,481 ESTs are from RIKEN, and these ESTs also were from Columbia (Seki et al., 2002
Alignment of cDNAs or ESTs to a genomic template is known as spliced alignment because the alignment must correctly reflect the removal of introns from the pre-mRNA copy of the genomic template. Several programs and services are available for this task, including PROCRUSTES (Gelfand et al., 1996
The default GeneSeqer parameters are set to allow detection of gene structure through alignment of ESTs from non-cognate ESTs derived from a homologous gene elsewhere in the genome (or even ESTs from a homologous locus in a related species). For some of the questions studied here, it was necessary to restrict the data to only the cognate alignments. Because of allelic variation and sequencing errors, even cognate alignments will not necessarily display 100% sequence matching; however, the overall alignment quality generally should be much higher than for heterologous alignments. For a given EST, GeneSeqer assesses alignment quality by two parameters: a similarity score, defined as the ratio of the observed alignment score over the maximum possible alignment score obtained in the absence of any substitutions and insertions or deletions; and a coverage score, defined as the fraction of the EST nucleotides involved in the displayed alignment (because the GeneSeqer spliced alignment is local, any poorly matching N- or C-terminal EST regions are culled from the displayed alignment). Here, we define hqEST spliced alignments (hqSPAs) as alignments that give similarity and coverage scores both of at least 0.8. ESTs with at least one hqSPA are defined as hqEST. An hqEST is further categorized according to the number of hqSPAs derived from the given EST. It is called a uhqEST if the EST matches a unique locus in the genome, and it is called an mhqEST if the EST matches multiple sites in the genome (presumably corresponding to duplicated genes). The corresponding spliced alignments are referred to as uhqSPAs and mhqSPAs. The major task of spliced alignment discussed in this paper was to identify cognate positions for each entry of ATest. Because the EST set was not masked or filtered to remove contaminations, low-complexity regions, or repeats, and because high-sensitivity/low-specificity default GeneSeqer parameters were applied for the spliced alignment, we limited most of our derived results to hqSPAs and hqESTs. The product of similarity and coverage scores was utilized as a measure to identify the pcSPAs, based on the assumption that the pcSPA should have the best score among hqSPAs for each specific hqEST. Due to recent gene duplications, possible genome assembly errors, or other uncertain reasons, some hqESTs may have several hqSPAs with identical or near-identical score in different locations of the genome. Thus, the pcSPA for each hqEST is not necessarily unique. The distribution of score differences among multiple hqSPAs for an EST is shown in Figure 3. Based on this distribution, all hqSPAs with scores strictly within 0.015 of the maximal score for that EST were labeled as pcSPA. With default parameters, a GeneSeqer-reported similarity score of s corresponds to 0.5 x (1 + s) x 100% sequence identity (for an alignment without gaps). Thus, two alternative full-length alignments of an EST will be distinguished as cognate and non-cognate if the weaker match has on average one additional mismatch to the genomic sequence per 100 nucleotides compared with the better match. The average nucleotide difference between the duplicated genes identified by hqESTs was calculated as 11.4% ± 4.6%. Therefore, the given criterion would safely distinguish duplicated genes except for very recent duplications that result in such minor sequence differences that they are indistinguishable from EST sequencing error rates.
hqESTs were mapped to the Arabidopsis genome based on pcSPAs as described in the previous section. The mapped hqESTs were clustered according to genome coordinates derived from their pcSPAs requiring a defined minimal overlap length or a maximal coverage gap size. Precisely, let est1 map to region [a,b] and est2 to region [c,d], where a ≤ c on the same chromosome; then est1 and est2 are clustered if c ≤ b + G + 1, where G is the clustering parameter. G could be negative (overlap required) or positive (specifying the maximal allowed gap). For ESTs giving multiple exonspliced alignments, the overlap rule is superceded by the requirement for consistency of strand orientation as indicated by GeneSeqer. Thus, ESTs from overlapping genes in opposite transcriptional directions can be separated into different clusters (compare with Fig. 5). In addition, ESTs from the same plasmid (clone pairs) were used to join clusters independent of their local map coordinates. Different sets of clusters based on alignment and clustering parameters are available at http://www.plantgdb.org/AtGDB/prj/ZSB03PP/ESTclustering.html. ESTs of each cluster were further assembled by the built-in function of GeneSeqer to generate alternative gene structures and predicted peptide sequences (PPSs) derived from long ORFs in the alternative gene structures. The PPSs were searched against ATpep via BLASTP to locate putative novel genes as described below.
The set of full-length cDNAs was aligned to the genome similarly to the EST alignments (the GeneSeqer option x 30 y 50 was used, which probes for potential gene locations by about 50-base identities in the suffix array, thus quickly identifying cognate loci). These alignments served as quality control in two ways. First, the results test the integrity of our analysis method. Because these full-length cDNAs were used previously to improve the Arabidopsis genome annotation (Haas et al., 2002
The raw output of GeneSeqer occupied a total of 1.6 billion bytes of disc space. The output was parsed and imported into an MySQL relational database management system (http://www.mysql.com) for further analysis. The database is accessible via the Web at http://www.plantgdb.org/AtGDB/. Supplementary data for the results of this study are available at http://www.plantgdb.org/AtGDB/prj/ZSB03PP/.
GeneSeqer gives two scores to each splice site, a prediction score and a local similarity score. The prediction score is between 0 and 1.0, based on a statistical model for the probability of the site to function as a splice site. Non-canonical splice sites receive 0 as a prediction score. The local similarity score measures sequence matching in the 40- to 50-bp flanking exon regions derived from the spliced alignment. This score is also normalized to 1.0 for complete identity. For exons shorter than 40 bp, the local similarity scores of the flanking splice sites are both set to 0. In this study, high-quality predicted introns were selected as predicted introns with: (a) splice site prediction scores for the donor and acceptor sites both higher than 0, i.e. the intron should be a canonical intron; and (b) local similarity scores for the donor and acceptor sites both higher than 0.95 (implying that the flanking exons should be no less than 40 bp and that at most, one mismatch is allowed in the 4050-bp flanking exon region alignment).
The 5' site motif ATCC in positions +3 to +6 is highly conserved in U12 introns (Wu and Krainer, 1996
The mapped ESTs provide a rich data set for studying many aspects of genome and gene structure. Here, we have explored the following issues.
Consistency of Gene Structure Annotation
5'- and 3'-UTRs in mRNAs
Non-Canonical Splice Sites
Alternative Splicing
Mini-Exons and Mini-Introns
The authors would like to thank Peter T. Vedell for early contributions to this work and Jessica A. Schlueter for critical reading of the manuscript. Received November 21, 2002; returned for revision January 6, 2003; accepted February 20, 2003.
Article, publication date, and citation information can be found at www.plantphysiol.org/cgi/doi/10.1104/pp.102.018101.
1 This work was supported in part by the National Science Foundation (grant no. DBI0110254 to V.B.). * Corresponding author; e-mail vbrendel{at}iastate.edu; fax 5152946755.
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline] Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 2836[Medline]
Bailey TL, Gribskov M (1998) Combining evidence using p-values: application to sequence homology searches. Bioinformatics 14: 4854
Berget SM (1995) Exon recognition in vertebrate splicing. J Biol Chem 270: 24112414 Black DL (2000) Protein diversity from alternative splicing: a challenge for bioinformatics and post-genome biology. Cell 103: 367370[CrossRef][Web of Science][Medline] Bouck J, Yu W, Gibbs R, Worley K (1999) Comparison of gene indexing databases. Trends Genet 15: 159162[CrossRef][Web of Science][Medline]
Brendel V, Kleffe J (1998) Prediction of locally optimal splice sites in plant pre-mRNA with applications to gene identification in Arabidopsis thaliana genomic DNA. Nucleic Acids Res 26: 47484757 Brendel V, Zhu W (2002) Computational modeling of gene structure in Arabidopsis thaliana. Plant Mol Biol 48: 4958[CrossRef][Web of Science][Medline] Brett D, Pospisil H, Valcarcel J, Reich J, Bork P (2002) Alternative splicing and genome complexity. Nat Genet 30: 2930[CrossRef][Web of Science][Medline] Brown JW, Smith P, Simpson CG (1996) Arabidopsis consensus intron sequences. Plant Mol Biol 32: 531535[CrossRef][Web of Science][Medline] Burge CB, Padgett RA, Sharp PA (1998) Evolutionary fates and origins of U12-type introns. Mol Cell 2: 773785[CrossRef][Web of Science][Medline]
Burset M, Seledtsov IA, Solovyev VV (2000) Analysis of canonical and non-canonical splice sites in mammalian genomes. Nucleic Acids Res 28: 43644375
Burset M, Seledtsov IA, Solovyev VV (2001) SpliceDB: database of canonical and non-canonical mammalian splice sites. Nucleic Acids Res 29: 255259 Coward E, Haas SA, Vingron M (2002) SpliceNest: visualizing gene structure and alternative splicing based on EST clusters. Trends Genet 18: 5355[CrossRef] Davuluri RV, Grosse I, Zhang MQ (2001) Computational identification of promoters and first exons in the human genome. Nat Genet 29: 412417[CrossRef][Web of Science][Medline]
Davuluri RV, Suzuki Y, Sugano S, Zhang MQ (2000) CART classification of human 5' UTR sequences. Genome Res 10: 18071816 Dietrich RC, Incorvaia R, Padgett RA (1997) Terminal intron dinucleotide sequences do not distinguish between U2- and U12-dependent introns. Mol Cell 1: 151160[CrossRef][Web of Science][Medline]
Fernandes J, Brendel V, Gai X, Lal S, Chandler VL, Elumalai RP, Galbraith DW, Pierson EA, Walbot V (2002) Comparison of RNA expression profiles based on maize expressed sequence tag frequency analysis and micro-array hybridization. Plant Physiol 128: 896910
Florea L, Hartzell G, Zhang Z, Rubin GM, Miller W (1998) A computer program for aligning a cDNA sequence with a genomic DNA sequence. Genome Res 8: 967974
Gelfand MS, Mironov AA, Pevzner PA (1996) Gene recognition via spliced sequence alignment. Proc Natl Acad Sci USA 93: 90619066 Haas BJ, Volfovsky N, Town CD, Troukhan M, Alexandrov N, Feldmann KA, Flavell RB, White O, Salzberg SL (2002) Full-length messenger RNA sequences greatly improve genome annotation. Genome Biol 3: research 0029.10029.2 Huang X, Adams MD, Zhou H, Kerlavage AR (1997) A tool for analyzing and annotating genomic sequences. Genomics 46: 3745[CrossRef][Web of Science][Medline]
Huang YH, Chen YT, Lai JJ, Yang ST, Yang UC (2002) PALS db: Putative Alternative Splicing database. Nucleic Acids Res 30: 186190 Kalyanaraman A, Kothari S, Brendel V, Aluru S (2003) Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Res 31: in press
Kan Z, Rouchka EC, Gish WR, States DJ (2001) Gene structure prediction and alternative splicing analysis using genomically aligned ESTs. Genome Res 11: 889900
Levine A, Durbin R (2001) A computational scan for U12-dependent introns in the human genome sequence. Nucleic Acids Res 29: 40064013 Modrek B, Lee C (2002) A genomic view of alternative splicing. Nat Genet 30: 1319[CrossRef][Web of Science][Medline]
Mott R (1997) EST_GENOME: a program to align spliced DNA sequences to unspliced genomic DNA. Comput Appl Biosci 13: 477478
Pavy N, Rombauts S, Déhais P, Mathé C, Ramana DVV, Leroy P, Rouzé P (1999) Bioinformatics 15: 887899
Pesole G, Liuni S, Grillo G, Licciulli F, Mignone F, Gissi C, Saccone C (2002) UTRdb and UTRsite: specialized databases of sequences and functional elements of 5' and 3' untranslated regions of eukaryotic mRNAs. Update 2002. Nucleic Acids Res 30: 335340
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res 29: 159164
Seki M, Narusaka M, Kamiya A, Ishida J, Satou M, Sakurai T, Nakajima M, Enju A, Akiyama K, Oono Y et al. (2002) Functional annotation of a full-length Arabidopsis cDNA collection. Science 296: 141145 Sharp PA, Burge CB (1997) Classification of introns: U2-type or U12-type. Cell 91: 875879[CrossRef][Web of Science][Medline]
Tabaska JE, Davuluri RV, Zhang MQ (2001) Identifying the 3'-terminal exon in human DNA. Bioinformatics 17: 602607 Usuka J, Brendel V (2000) Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 297: 10751085[CrossRef][Web of Science][Medline]
Usuka J, Zhu W, Brendel V (2000) Optimal spliced alignment of homologous cDNA to a genomic DNA template. Bioinformatics 16: 203211
Wheelan SJ, Church DM, Ostell JM (2001) Spidey: a tool for mRNA-togenomic alignments. Genome Res 11: 19521957 Wu HJ, Gaubier-Comella P, Delseny M, Grellet F, Van Montagu M, Rouzé R (1996) Non-canonical introns are at least 10(9) years old. Nat Genet 14: 383384[CrossRef][Web of Science][Medline]
Wu Q, Krainer AR (1996) U1-mediated exon definition interactions between AT-AC and GT-AG introns. Science 274: 10051008
Wu Q, Krainer AR (1999) AT-AC pre-mRNA splicing mechanisms and conservation of minor introns in voltage-gated ion channel genes. Mol Cell Biol 19: 32253236
Yeh RF, Lim LP, Burge CB (2001) Computational inference of homologous gene structures in the human genome. Genome Res 11: 803816 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|