|
|
||||||||
|
Plant Physiology 138:18-26 (2005) © 2005 American Society of Plant Biologists The Institute for Genomic Research Osa1 Rice Genome Annotation Database1The Institute for Genomic Research, Rockville, Maryland 20850
We have developed a rice (Oryza sativa) genome annotation database (Osa1) that provides structural and functional annotation for this emerging model species. Using the sequence of O. sativa subsp. japonica cv Nipponbare from the International Rice Genome Sequencing Project, pseudomolecules, or virtual contigs, of the 12 rice chromosomes were constructed. Our most recent release, version 3, represents our third build of the pseudomolecules and is composed of 98% finished sequence. Genes were identified using a series of computational methods developed for Arabidopsis (Arabidopsis thaliana) that were modified for use with the rice genome. In release 3 of our annotation, we identified 57,915 genes, of which 14,196 are related to transposable elements. Of these 43,719 nontransposable element-related genes, 18,545 (42.4%) were annotated with a putative function, 5,777 (13.2%) were annotated as encoding an expressed protein with no known function, and the remaining 19,397 (44.4%) were annotated as encoding a hypothetical protein. Multiple splice forms (5,873) were detected for 2,538 genes, resulting in a total of 61,250 gene models in the rice genome. We incorporated experimental evidence into 18,252 gene models to improve the quality of the structural annotation. A series of functional data types has been annotated for the rice genome that includes alignment with genetic markers, assignment of gene ontologies, identification of flanking sequence tags, alignment with homologs from related species, and syntenic mapping with other cereal species. All structural and functional annotation data are available through interactive search and display windows as well as through download of flat files. To integrate the data with other genome projects, the annotation data are available through a Distributed Annotation System and a Genome Browser. All data can be obtained through the project Web pages at http://rice.tigr.org.
Rice (Oryza sativa) has emerged as a model species for the cereals, a group of grass species that includes not only rice but also the major crop species maize (Zea mays), wheat (Triticum aestivum), barley (Hordeum vulgare), sorghum (Sorghum bicolor), oats (Avena sativa), and millet (Eleusine coracana). Features of rice that have contributed to its utility as a model species include its small stature in comparison to other cereals, transformability, dense genetic map, well-developed genomic resources, and small genome (430 Mb; Arumuganathan and Earle, 1991
Of the four rice genome sequencing projects, the public effort of the International Rice Genome Sequencing Project (IRGSP) has generated the highest quality and most complete genome sequence, that of the japonica subspecies, cultivar Nipponbare. However, although the IRGSP has generated a near-complete finished sequence for rice (http://rgp.dna.affrc.go.jp/IRGSP/), the annotation of the rice genome is still ongoing. Annotation, in which features are noted on the genome sequence, is a dynamic, iterative process. The primary component of any genome annotation effort is identifying the genes, also termed structural annotation. This is a challenging task as it relies heavily on computational methods with less than optimal sensitivity and specificity. Structural annotation can be improved dramatically by access to experimental evidence, such as transcripts and protein sequences. However, the strength of experimental evidence is variable. The most powerful evidence is that of full-length cDNAs (FL-cDNAs). For rice, a collection of approximately 32,000 FL-cDNA sequences are available (The Rice Full-Length cDNA Consortium, 2003
Layered on top of structural annotation is functional annotation. This involves identifying the function of the genes, as well as associated genome sequences, with biologically relevant features. Assigning gene function is perhaps the most subjective aspect of functional annotation. This is typically done based on transitive annotation as only a small portion of genes within any genome have been verified for function at the experimental level. Gene function can be assigned based on sequence similarity with known proteins or through the presence of protein domains with known function. Gene ontologies have been developed to provide controlled vocabularies to annotate gene function, thereby allowing for cross-kingdom querying of genes (The Gene Ontology Consortium, 2000
We have generated a public annotation database for the rice genome based on the near-complete sequence generated by the IRGSP. This database, termed Osa1 for Oryza sativa 1, is a Sybase relational database that stores and tracks rice genome sequence and annotation. Annotation data are generated using a series of bioinformatic processes initially developed for annotating the Arabidopsis (Arabidopsis thaliana) genome (Wortman et al., 2003
The goal of The Institute for Genomic Research (TIGR) rice annotation database is to provide high-quality, uniform structural and functional annotation of the rice genome. This involves identifying all the genes, constructing gene models (including alternative splice forms), and identifying putative function for these genes. In addition, the rice genome is annotated with functional annotation data types to provide biologists with the highest quality of content as possible. The bulk of the data is stored in the Osa1 database. This database is similar in structure to other eukaryotic annotation databases at TIGR, such as the Ath1 database utilized in the reannotation of Arabidopsis (Wortman et al., 2003
The basic sequence unit of our rice genome annotation pipeline is the bacterial artificial chromosome (BAC)/P1 artificial chromosome (PAC) clones generated by the IRGSP. The BAC/PACs are processed using the Eukaryotic Genome Control (EGC) pipeline (Wortman et al., 2003
Comparison of sensitivity and specificity of the five ab initio gene finders listed above revealed FGENESH as superior in predicting rice genes (Q. Yuan and C.R. Buell, unpublished data). Thus, for all of our automated annotation, we utilized the output of the FGENESH program and not the other ab initio gene finders to generate gene models that were improved through use of the Program to Assemble Spliced Alignments (PASA; Haas et al., 2003
Pseudomolecules (virtual contigs) of the 12 rice chromosomes are constructed to remove the overlapping sequences between the BAC/PAC clones. The BAC/PAC clones are aligned on the chromosome based on the clone order as reported by the IRGSP, and overlaps between the BAC/PAC clones are confirmed (http://rgp.dna.affrc.go.jp/cgi-bin/statusdb/irgsp-status.cgi). An overwhelming majority of the BAC/PAC clones match perfectly with most discrepancies involving polymorphisms of copy number of mono- and dinucleotide simple sequence repeats. The pseudomolecule DNA sequence is constructed by trimming the overlap region at junction points in which the genes are least disrupted and the annotation data are transferred from the BAC/PAC clones to the pseudomolecules. As a consequence of identifying junction points within the overlap region based on gene location, minimal resolution of incongruent gene models must be made. Statistics on our current pseudomolecules can be found in Table I and through the project Web pages (http://rice.tigr.org/tdb/e2k1/osa1/pseudomolecules/info.shtml).
The total length of the 12 pseudomolecules is 370.6 Mb, smaller than the 430 Mb genome size reported for rice (Arumuganathan and Earle, 1991 Recently, another public version of the rice pseudomolecules was released by the IRGSP (http://rgp.dna.affrc.go.jp/IRGSP/Build3/build3.html). While the foundation of both pseudomolecule sets is the same, our pseudomolecules and those of the IRGSP differ in several ways. First, the IRGSP molecules were constructed in July 2004 and thus represent a slightly older build with a higher percentage of unfinished sequence. Second, the IRGSP utilized a "left greedy" approach in constructing the pseudomolecules, and, thus, sequence in the overlap regions between the two builds will differ on occasion. Third, to date, no annotation of the genes is available from the IRGSP pseudomolecules, although the IRGSP pseudomolecules were the substrate used in the Rice Annotation Project 1 annotation effort in which FL-cDNAs were annotated (http://rgp.dna.affrc.go.jp/IRGSP/Build3/build3.html) by a community of rice and bioinformatics experts.
In release 3 of our pseudomolecules, we identified a total of 57,915 genes with 14,196 related to TE, leaving 43,719 non-TE-related genes (Table I). Of these non-TE-related genes, we were able to assign a putative function to 18,545 genes (42.4%), while 5,777 (13.2%) were annotated as encoding an expressed protein due to the presence of EST and/or FL-cDNA support. The remaining 19,397 genes (44.4%) lacked experimental evidence or sequence similarity with known proteins or domains and were annotated as encoding hypothetical proteins. Pack-MULEs, which are chimeric Mutator-like elements that have assimilated host sequences, were reported in the rice genome by Jiang et al. (2004)
Abundant transcript evidence (approximately 300,000 ESTs and approximately 32,000 FL-cDNAs) is available for rice and provides evidence for gene expression as well as experimental data to refine gene model structure. Using cutoff criteria of 95% identity over 50% of the length of the EST/FL-cDNA sequence, 25,410 genes could be aligned with either a rice EST and/or FL-cDNA, suggesting that at least 43.9% of the genes are expressed. To provide users with access to the pattern and frequency of gene expression, we developed the Expression Viewer Tool (http://rice.tigr.org/tdb/e2k1/osa1/expression/expression.info.shtml) for gene models in the Osa1 database. Alternative splicing does occur in rice, and we identified 2,538 genes representing a total of 5,873 alternative splice forms in the rice genome using both EST and FL-cDNA evidence. These can be viewed in our Alternative Splice Form Viewer Tool (http://rice.tigr.org/tdb/e2k1/osa1/expression/alt_spliced.info.shtml). To improve the gene structure in an automated manner, we used the PASA program, which employs more stringent criteria than a simple alignment of EST and FL-cDNA sequence to the genome (Haas et al., 2003
In previous releases of our pseudomolecules, we referred to the genes and gene models using an internal identifier (feat_name) that was cumbersome for the user and not readily convertible through releases of the pseudomolecules. However, implementation of a more stable identifier prior to release 3 was not feasible due to the instability of the unfinished genome sequence. In release 3, 98% of the underlying sequence is finished with the remaining 2% sequence derived from phase II HGS BAC/PAC clones, which although unfinished are in ordered and oriented contigs. Thus, for release 3, we have implemented locus identifiers for the genes. The convention we have implemented is similar to the nomenclature used in Arabidopsis with adaptations made for the larger size of the rice genome and to the nomenclature currently under discussion by the rice community (http://www.gramene.org/documentation/nomenclature/). However, as the nomenclature has not been finalized and alternative builds of the pseudomolecules are publicly available (http://rgp.dna.affrc.go.jp/IRGSP/Build3/build3.html), we have chosen to use locus nomenclature that clearly denotes the TIGR loci. Each gene is labeled LOC_OsXXg#####, with LOC referring to locus, Os referring to rice, XX referring to chromosome (0112), g referring to gene, and a 5-digit number referring to the gene order on the chromosome. We have sequentially numbered the genes (loci) on each of the chromosomes in increments of 10 to allow for insertion of future loci. To accommodate additional sequence that may be identified in the physical gaps, we have provided larger spacing in the locus numbering at the physical gaps. As locus identifiers are associated with release 3 and not our previous releases, we have developed a Version Converter (http://rice.tigr.org/tdb/e2k1/osa1/v_converter/index.shtml) to allow users to find locus identifiers for genes and models from releases 1 and 2, which had been identified solely with feat_names. A tab-delimited flat file is also available at the project FTP site. As with the addition of newly identified loci, we have scripts in place to handle merging, splitting, and retirement of locus identifiers as the annotation improves over the course of our project.
The rice EGC annotation pipeline is highly automated. Manual curation, in which a trained annotator inspects the gene model and evidence(s) supporting the model and then creates the most congruent model possible, is considered the highest level of annotation possible. While this is clearly desirable at the whole-genome level, this is labor intensive and not feasible with a genome the size of rice or with the iterative updates of the annotation that are required with incremental releases of new experimental evidence. However, a portion of the release 3 gene model set is derived from manually curated genes (282 BACs, 5,420 genes). This provided us the opportunity to assess the accuracy of our automated annotation pipeline to determine the qualitative impact of automation on annotation. As shown in Table II, our automated annotation pipeline captures similar gene structure as manually curated genes. One clear difference is the slightly higher density of genes in manually curated annotation versus automated annotation, suggesting lack of capture of all genes in the automated method. This can be explained by the use of a sole ab initio gene finder (FGENESH) for the identification of genes in the automated pipeline. With manual curation, an annotator would examine all ab initio gene finder output (five programs in total) as well as experimental evidence and create a gene if the evidence warrants. Another feature that differs in manual versus automated annotation is the assignment of putative function, a subjective process at best. This is clearly an aspect of annotation that can be highly benefited by manual curation, and we will be addressing this in future curation activities through annotation of paralogous families. Another aspect that we feel warrants manual inspection is curation of gene model structure of genes with EST and/or FL-cDNA evidence that failed our PASA validation tests (3,948 genes in release 3).
In addition to structural annotation of the genes, we annotated a number of other data types within the rice genome. At the sequence level, we have mapped >10,000 sequence-based genetic markers derived primarily from rice to our pseudomolecules (http://rice.tigr.org/tdb/e2k1/osa1/BACmapping/markers_physical_map.shtml). This collection of markers (13,900) includes RFLP-based (Causse et al., 1994 97% identity over 90% length of the marker, 10,279 markers could be aligned to 10,695 locations within the rice genome. Reducing this stringency to 95% identity over 90% length of the marker resulted in 10,945 markers aligning to 11,630 locations within the genome. Markers with multiple alignments within the genome are denoted on our Rice Genetic Marker search page.
Multiple efforts are under way to generate a large collection of tagged insertion mutants in rice using T-DNA tagging (Sallaud et al., 2004
To provide resources for other plant biologists, we have aligned the rice genome with sequences from other plant species. These alignments are displayed through several Web interfaces: through a single interface aligning the pseudomolecules with 20 of the TIGR Plant Gene Indices (http://rice.tigr.org/tigr-scripts/alignTC/db/chrs.pl?taxon=16), through the evidence window on the Manatee page for each gene, and through a genome browser (see below). For the alignments with the Gene Indices, we used the BLAT program (Kent, 2002
At the proteome level, we provide several layers of annotation. A catalog of domains and motifs for the predicted rice proteome are available, including Pfam domains (Bateman et al., 2002
Multiple access modes are available for the sequence and annotation data associated with the current release. Perhaps the most user-friendly access is a set of Web pages with text and search tools for searching the sequence and annotation data. These can all be accessed through the project homepage (http://rice.tigr.org). In addition, all of the sequence and a majority of the annotation data are available through anonymous FTP download in XML and GFF3 format (ftp://ftp.tigr.org/pub/data/Eukaryotic_Projects/o_sativa/annotation_dbs/). A subset of the sequence and annotation data can be accessed through the Data Extractor tool, which generates flat files of user-selected datasets (http://rice.tigr.org/tdb/e2k1/osa1/data_download.shtml). In addition, the current sequence and annotation are available through a Genome Browser (Fig. 2 ; Stein et al., 2002
The size of the rice genome sequence and the volume of accompanying annotation data limit our ability to rapidly provide and track updates in these two data sets. Thus, we schedule updates to the sequence (i.e. the pseudomolecules) and the accompanying annotation on a biannual basis. Release 1 of the pseudomolecules was in September 2003, followed by release 2 in April 2004. The third release was made in late December 2004. Each release has been accompanied with a major improvement in sequence and annotation quality. With each release, we have been able to expand the annotation data types to augment the data available to the community. In the near future, we will be providing updates quarterly through addition of new tracks in the Genome Browser. With the release of the IRGSP pseudomolecules, we will provide users alignments between our pseudomolecules and those of the IRGSP to provide a cross-reference of these two datasets. We will also provide any annotation made available by the IRGSP or other entities as separate tracks in our Genome Browser, which will allow the users of the Osa1 database the ability to see alternative annotations of the rice genome and select their preferred annotation.
The efforts of the TIGR Bioinformatics department in generating a suite of tools and resources for eukaryotic sequence and annotation are appreciated. Received December 31, 2004; returned for revision February 24, 2005; accepted March 21, 2005.
1 This work (on rice genome annotation) was supported by the National Science Foundation (grant no. DBI0321538 to C.R.B.) and the U.S. Department of Agriculture (grant no. 20033531713173 to C.R.B.).
2 Present address: Laboratory of Neurogenetics, NIAAA, NIH, 5625 Fishers Lane, Suite 3532, MSC 9412, Bethesda, MD 20892. www.plantphysiol.org/cgi/doi/10.1104/pp.104.059063. * Corresponding author; e-mail rbuell{at}tigr.org; fax 3018380208.
Arumuganathan K, Earle ED (1991) Nuclear DNA content of some important plant species. Plant Mol Biol Rep 9: 208219
Barry GF (2001) The use of the Monsanto draft rice genome sequence in research. Plant Physiol 125: 11641165
Bateman A, Birney E, Cerruti L, Durbin R, Etwiller L, Eddy SR, Griffiths-Jones S, Howe KL, Marshall M, Sonnhammer EL (2002) The Pfam protein families database. Nucleic Acids Res 30: 276280 Burge C, Karlin S (1997) Prediction of complete gene structures in human genomic DNA. J Mol Biol 268: 7894[CrossRef][Web of Science][Medline] Causse MA, Fulton TM, Cho YG, Ahn SN, Chunwongse J, Wu K, Xiao J, Yu Z, Ronald PC, Harrington SE, et al (1994) Saturated molecular map of the rice genome based on an interspecific backcross population. Genetics 138: 12511274[Abstract] Dowell RD, Jokerst RM, Day A, Eddy SR, Stein L (2001) The Distributed Annotation System. BMC Bioinformatics 2: 7[CrossRef][Medline]
Gale MD, Devos KM (1998) Comparative genetics in the grasses. Proc Natl Acad Sci USA 95: 19711974 The Gene Ontology Consortium (2000) Gene ontology: tool for the unification of biology. Nat Genet 1: 2529
Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100 Greco R, Ouwerkerk PB, Taal AJ, Favalli C, Beguiristain T, Puigdomenech P, Colombo L, Hoge JH, Pereira A (2001) Early and multiple Ac transpositions in rice suitable for efficient insertional mutagenesis. Plant Mol Biol 46: 215227[CrossRef][Web of Science][Medline]
Haas BJ, Delcher AL, Mount SM, Wortman JR, Smith RK Jr, Hannick LI, Maiti R, Ronning CM, Rusch DB, Town CD, et al (2003) Improving the Arabidopsis genome annotation using maximal transcript alignment assemblies. Nucleic Acids Res 31: 56545666
Harushima Y, Yano M, Shomura A, Sato M, Shimano T, Kuboki Y, Yamamoto T, Lin S, Antonio BA, Parco A, et al (1998) A high-density rice genetic linkage map with 2275 markers using a single F2 population. Genetics 148: 479494 Huang X, Adams MD, Zhou H, Kerlavage AR (1997) A tool for analyzing and annotating genomic sequences. Genomics 46: 3745[CrossRef][Web of Science][Medline] Jiang N, Bao Z, Zhang X, Eddy SR, Wessler SR (2004) Pack-MULE transposable elements mediate gene evolution in plants. Nature 431: 569573[CrossRef][Medline]
Juretic N, Bureau TE, Bruskiewich RM (2004) Transposable element annotation of the rice genome. Bioinformatics 20: 155160
Kent WJ (2002) BLATthe BLAST-like alignment tool. Genome Res 12: 656664 Kim CM, Piao HL, Park SJ, Chon NS, Je BI, Sun B, Park SH, Park JY, Lee EJ, Kim MJ, et al (2004) Rapid, large-scale generation of Ds transposant lines and analysis of the Ds insertion sites in rice. Plant J 39: 252263[CrossRef][Web of Science][Medline] Krogh A, Larsson B, von Heijne G, Sonnhammer ELL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305: 567580[CrossRef][Web of Science][Medline]
Lowe TM, Eddy SR (1997) tRNAscan-SE: a program for improved detection of transfer RNA genes in genomic sequence. Nucleic Acids Res 25: 955964
Lukashin AV, Borodovsky M (1998) GeneMark.hmm: new solutions for gene finding. Nucleic Acids Res 26: 11071115
Miyao A, Tanaka K, Murata K, Sawaki H, Takeda S, Abe K, Shinozuka Y, Onosato K, Hirochika H (2003) Target site specificity of the Tos17 retrotransposon shows a preference for insertion within genes and against insertion in retrotransposon-rich regions of the genome. Plant Cell 15: 17711780
Mulder NJ, Apweiler R, Attwood TK, Bairoch A, Barrell D, Bateman A, Binns D, Biswas M, Bradley P, Bork P, et al (2003) The InterPro Database, 2003 brings increased coverage and new features. Nucleic Acids Res 31: 315318
Nielsen H, Engelbrecht J, Brunak S, von Heijne G (1997) Identification of prokaryotic and eukaryotic signal peptides and prediction of their cleavage sites. Protein Eng 10: 16
Ouyang S, Buell CR (2004) The TIGR Plant Repeat Databases: a collective resource for identification of repetitive sequences in plants. Nucleic Acids Res (Database Issue) 32: D360D363
Pertea M, Lin X, Salzberg SL (2001) GeneSplicer: a new computational method for splice site prediction. Nucleic Acids Res 29: 11851190
Qi LL, Echalier B, Chao S, Lazo GR, Butler GE, Anderson OD, Akhunov ED, Dvorak J, Linkiewicz AM, Ratnasiri A, et al (2004) A chromosome bin map of 16,000 expressed sequence tag loci and distribution of genes among the three genomes of polyploid wheat. Genetics 168: 701712
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res 29: 159164
The Rice Full-Length cDNA Consortium (2003) Collection, mapping, and annotation of over 28,000 cDNA clones from japonica rice. Science 301: 376379
Salamov AA, Solovyev VV (2000) Ab initio gene finding in Drosophila genomic DNA. Genome Res 10: 516522 Sallaud C, Gay C, Larmande P, Bes M, Piffanelli P, Piegu B, Droc G, Regad F, Bourgeois E, Meynard D (2004) High throughput T-DNA insertion mutagenesis in rice: a first step towards in silico reverse genetics. Plant J 39: 450464[CrossRef][Web of Science][Medline] Sasaki T, Burr B (2000) International Rice Genome Sequencing Project: the effort to completely sequence the rice genome. Curr Opin Plant Biol 3: 138141[CrossRef][Web of Science][Medline] Stein LD, Mungall C, Shu S, Caudy M, Mangone M, Day A, Nickerson E, Stajich JE, Harris TW, Arva A, et al (2002) The generic genome browser: a building block for a model organism system database. Genome Res 10: 15991610
Wortman JR, Haas BJ, Hannick LI, Smith RK, Maiti R, Ronning CM, Chan AP, Yu C, Ayele M, Whitelaw CA, et al (2003) Annotation of the Arabidopsis genome. Plant Physiol 132: 461468
Wu J, Maehara T, Shimokawa T, Yamamoto S, Harada C, Takazaki Y, Ono N, Mukai Y, Koike K, Yazaki J, et al (2002) A comprehensive rice transcript map containing 6591 expressed sequence tag sites. Plant Cell 14: 525535
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 7992 Yuan Q, Ouyang S, Liu J, Suh B, Cheung F, Sultana R, Lee D, Quackenbush J, Buell CR (2001) The TIGR Rice Genome Annotation Resource: annotating the rice genome and creating resources for plant biologists. Nucleic Acids Res 31: 229233 Zdobnov EM, Apweiler R (2001) InterProScanan integration platform for the signature-recognition methods in InterPro. Bioinformatics 9: 847848 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|