|
|
||||||||
|
Plant Physiol, March 2001, Vol. 125, pp. 1166-1174 Rice Bioinformatics. Analysis of Rice Sequence Data and Leveraging the Data to Other Plant Species1The Institute for Genomic Research, 9712 Medical Center Drive, Rockville, Maryland 20850
Rice (Oryza sativa) is a model species for monocotyledonous plants, especially for members in the grass family. Several attributes such as small genome size, diploid nature, transformability, and establishment of genetic and molecular resources make it a tractable organism for plant biologists. With an estimated genome size of 430 Mb (Arumuganathan and Earle, 1991), it is feasible to obtain the complete genome sequence of rice using current technologies. An international effort has been established and is in the process of sequencing O. sativa spp. japonica var "Nipponbare" using a bacterial artificial chromosome/P1 artificial chromosome shotgun sequencing strategy. Annotation of the rice genome is performed using prediction-based and homology-based searches to identify genes. Annotation tools such as optimized gene prediction programs are being developed for rice to improve the quality of annotation. Resources are also being developed to leverage the rice genome sequence to partial genome projects such as expressed sequence tag projects, thereby maximizing the output from the rice genome project. To provide a low level of annotation for rice genomic sequences, we have aligned all rice bacterial artificial chromosome/P1 artificial chromosome sequences with The Institute of Genomic Research Gene Indices that are a set of nonredundant transcripts that are generated from nine public plant expressed sequence tag projects (rice, wheat, sorghum, maize, barley, Arabidopsis, tomato, potato, and barrel medic). In addition, we have used data from The Institute of Genomic Research Gene Indices and the Arabidopsis and Rice Genome Projects to identify putative orthologues and paralogues among these nine genomes.
CURRENT STATUS OF PLANT GENOMICS The advancement of sequencing
technologies within the last decade has been capitalized on by plant
biologists. As of October 27, 2000, there are over 102 plant species
(among 278 species total) represented in the expressed sequence tag
(EST) division (dbEST) of GenBank
(http://www.ncbi. nlm.nih.gov/dbEST/dbEST_summary.html). These
various EST projects collectively represent over 835,884 entries (among
6,259,492 total) in GenBank. At the forefront of plant genomics is the
model dicotyledonous plant, Arabidopsis. Starting with the Arabidopsis
EST project in the early 1990s (Hofte et al., 1993 With the completion of the Arabidopsis genome, plant biologists will
have the opportunity to assess the entire gene complement of a plant
for the first time. New avenues of research have begun that will
culminate in determining the function of every gene in Arabidopsis
(http://www.nsf.gov/pubs/2001/nsf0113/nsf0113.htm). Although the
complete sequence and the subsequent analyses are an immense
achievement in plant biology, Arabidopsis cannot be utilized to address
all aspects of plant growth, development, and reproduction. For other
plant species that represent diverse physiological and developmental
programs, complete genomic sequencing is unlikely to be completed in
the foreseeable future. Thus, sequencing of ESTs remains the primary
tool for genomic exploration and for functional genomics analyses. The
value of EST resources can be greatly enhanced if the data are used to
reconstruct a high-fidelity set of nonredundant transcripts such as
gene indices (Liang et al., 2000a
A major classification of plants not represented by Arabidopsis
are the monocotyledonous plants. Development of a model monocot species
to parallel the achievements in Arabidopsis would have a tremendous
impact on plant biology. One of the most important families of monocots
is the Gramineae family as it includes major agricultural crop species
such as maize, wheat, barley, sugarcane, sorghum, and rice. These grass
species share extensive synteny across their genomes, allowing for one
species to serve as the base for comparative genomics within the family
(Moore et al., 1995 Due to several factors, rice presents the most tractable species for
genomic applications in a cereal. Perhaps the most significant factor
in selecting a model species is the small genome size of rice compared
with other members of the Gramineae family. The genome size of rice is
estimated to be 431 Mb (Arumuganathan and Earle, 1991
The current strategy employed by the IRGSP (http://rgp.dna.affrc.go.jp/Seqcollab.html) is a BAC or PAC shotgun sequencing approach. In this approach a minimally overlapping set or tile of clones is identified that is contiguous with the chromosomes and these are sequenced in a high throughput fashion. The BAC/PAC sequences are assembled individually and the entire chromosome is then assembled from the overlapping BAC sequences. The sequences for each BAC/PAC are then annotated for gene function and the data are released to the public. Regions of the rice genome are allocated to the IRGSP participants on a chromosomal level. Table II lists the participants of the IRGSP with the corresponding web site for each sequencing center. IRGSP progress can be monitored at several levels. BAC/PAC sequences can be submitted as unfinished sequences to the HTGS division of GenBank at the Phase 1, Phase 2, and Phase 3 level. Phase 1 submissions consist of unordered, unoriented assemblies greater than 2 kbp in length. Phase 2 submissions consist of ordered, oriented assemblies greater than 2 kbp. Phase 3 submissions contain no gaps and may contain annotation. Upon completion, the BAC/PAC sequence is moved to PLN at which time annotation may be added.
Table III lists the basepairs of rice DNA in GenBank as of November 9, 2000. A total of 65,681 entries comprising 28,282,731 bases of sequence have been submitted from the rice EST sequencing projects. Over 35 Mb of sequence is from the IRGSP and represents BAC/PAC shotgun sequencing efforts. Another 56.9 Mb is in the Genome Sequence Survey division and is derived from BAC end sequencing. Together, with directed rice sequencing efforts by individual labs, rice ranks seventh in total number of bases of DNA/RNA in GenBank, representing the most sequence for any plant species, with the exception of Arabidopsis (ftp://ncbi.nlm.nih.gov/GenBank/gbrel.txt; GenBank Release 120.0).
In addition to the public BAC/PAC strategy of the IRGSP, Monsanto has sequenced 3,391 rice BACs; however, these clones were sequenced at a lower coverage than that of the IRGSP and thus are of reduced quality (http://www.rice-research.org). From this partial coverage, 259 Mb of assembled sequence data is present in 52,202 contigs. The Monsanto draft of the rice genome is available for basic local alignment searches to academic researchers through a licensing agreement (http://www.rice-research.org). As with the public IRGSP approach, Monsanto used O. sativa spp. japonica variety "Nipponbare," allowing for integration of the Monsanto partial sequence with the public IRGSP effort. In addition, Monsanto has analyzed the draft sequence for simple sequence repeats, which are invaluable in mapping studies. A file of approximately 7,000 simple sequence repeats with flanking DNA and putative map location (if known) is available to researchers through a free download on the Monsanto web site (http://www.rice-research.org).
Integration of Rice Genetic and Physical Maps The data from the IRGSP can be utilized even at this early stage.
BAC end sequences were used to accelerate the identification of BAC
clones that are anchored to the rice genetic map. In Yuan et al. (2000 Another important tool in positional cloning is the rapid identification of new reagents. We have implemented an automated process to download rice sequences from GenBank and search against the rice genetic markers. Using a high stringency cutoff we display the alignment of the markers with the BACs/PACs on the TIGR web site (http://www.tigr.org/tdb/rice/BACmapping/description.shtml). Through this in silico alignment, we have generated a high-resolution map of rice for positional cloning purposes. Rice Databases In a genome project it is imperative that the sequence information is rapidly integrated with available genetic, marker, clone, and other resources, as this will maximize the ability of researchers to utilize the data. This was invaluable in the Arabidopsis Genome Initiative where a single database, initially Arabidopsis Database (AtDB) and then The Arabidopsis Information Resource (http://www.Arabidopsis.org), served as a central repository for molecular, genetic, clone, and sequence data for Arabidopsis biologists. Likewise, with the generation of a vast amount of rice genomic sequence data, the necessity to integrate rice sequence data with other information from rice genetics, breeding, physiology, and biochemistry is apparent. Several centers have developed databases to integrate rice data from multiple sources (Table IV) and will be working to integrate the sequence data from the IRGSP.
One valuable tool in analyzing rice genomic sequences is the
identification of repetitive sequences as it can lead to false associations. To provide a tool to identify repetitive sequences in
rice genomic DNA we constructed a rice repeat database using known,
curated rice sequences available from GenBank (Yuan et al., 2000
Annotation of Completed BAC/PAC Sequences Annotation involves adding biological information to DNA sequences and includes identification of genes as well as other miscellaneous features. The annotation system takes a DNA sequence and searches it against a comprehensive protein sequence database to identify similar proteins and against DNA databases such as EST and BAC databases to identify similar gene structures. It also uses ab initio gene finders to identify potentially new genes. The results of these searches are displayed in a graphical genome viewer for a human annotator to analyze. The annotator can modify the model of exon-intron structure for any gene, and can assign a gene name. After careful curation, these data are submitted to public databases (e.g. GenBank, EMBL, and DDBJ) and are displayed on the appropriate web sites. An example of a single-gene model from an annotated rice BAC is shown in Figure 2. The evidence from the database searches and the output from the gene prediction programs are shown along with the final working model constructed by the annotator. This model, OSJNBa0051D19.18, encodes a putative alpha-galactosidase.
Features of Annotated Rice Genomic Sequences We have annotated 1.33 Mb of rice genomic sequence from completed BACs on the lower arm of chromosome 10 and thus far we have identified a total of 235 genes. The average rice gene is 2.2 kbp with 3.9 exons and 2.9 introns. The density of rice genes is one gene per 5.7 kbp. The G/C content of rice differs in the coding region (59.1% G/C) from that of the intergenic region (39.1% G/C). We could identify a putative function for 116 of the genes (49.4%), leaving 50.6% of the genes as unknown (having similarity to transcripts with no known function) or hypothetical (having no evidence other than the prediction of a gene finder). Preliminary Annotation of Rice BACs Valuable insight into the gene content of BAC/PACs can be obtained prior to closure of the clone, which is the rate-limiting step in high throughput sequencing. We have developed an automated mechanism to search our unfinished rice BACs (Phases 1 and 2 of HTGS) using available gene prediction and homology-based programs to provide preliminary annotation of rice BACs prior to their completion. The output from the searches is parsed and displayed on the TIGR web page (http://www.tigr.org/tdb/edb2/osa1/htmls/osa1.html). An automated mechanism is used to parse the "top hit" from the database search results and assign a putative name to the working model. This level of preliminary annotation provides researchers a first glimpse of the gene content in rice genome sequences. Additional interspecies information can be gained by examining the alignment of the EST and gene sequence data in the TIGR Gene Indices with reference to the rice genome. Using all publicly available rice BAC sequence data, including finished, phase 1, and phase 2 sequences in the HTGS division of GenBank, we tabulated the alignment of the TCs and singletons from the nine TIGR Plant Gene Indices with all rice BAC/PAC sequences (http://www.tigr.org/tdb/ogi/alignTC.html). These alignments not only provide a low-level functional annotation of all available rice genomic sequence, but also insight into conservation of gene structure across the plant kingdom. An example alignment of an annotated gene on chromosome 10 can be seen in Figure 3. The coding sequence of a putative rice cytoplasmic malate dehydrogenase (OSJNBa0055P24.3) shows significant similarity in gene structure from other plant species as is evident from an alignment between the rice genomic sequence and the various plant TCs. The multiple hits seen in some species may represent paralogues, gene families, alternative splice forms, or partial TC assemblies. However, a clear separation of gene structure is evident between the transcripts identified in the monocot versus the dicot species, as all monocot transcripts and no dicot transcripts contain the more distal 5' exon.
Improvement of Rice Annotation Tools One current difficulty with rice annotation is the lack of accurate gene prediction programs. In some instances the gene prediction programs and the homology searches indicate a clear choice for the working model. However, in most cases the gene prediction programs do not agree perfectly with one another and are often in conflict with evidence from sequence homology search results. When similar protein sequences exist, the annotators almost always prefer this evidence over the output of gene finders. However, when sequence homology is very faint or nonexistent, gene prediction programs provide the only available information. Similar to all completed genomes, rice has a substantial number of genes that are hypothetical in that they are predicted solely on the basis of gene prediction programs. Thus, it is imperative that the quality of gene prediction programs be improved for rice. GlimmerR, a Rice Gene Finder A special version of GlimmerM (Salzberg et al., 1999 Training of GlimmerR Training the system required a set of confirmed genes from rice.
The training database was assembled as follows. Thirty-eight complete
rice genes were found in GenBank and had Medline references indicating
that genomic DNA and cDNA sequences were available. Next, all rice
entries with "complete coding regions" in the definition line were downloaded from GenBank and separated into cDNA and non-cDNA
groups. To match each cDNA to its corresponding genomic DNA, the two
sets of sequences were aligned using DDS (Huang et al., 1997 The accuracy of the splice site module on the training set indicates that splice site detection is quite good; for the integrated system, a detection threshold had to be chosen for donor and acceptor sites. Not to miss many true sites, the system was set with a donor site false negative rate of 0.44%, which corresponds to a 6.2% false positive rate and an acceptor site false negative rate of 0.7%, which corresponds to a 9.5% false positive rate. (The false positive rate refers to the percentage of GT/AG dinucleotides that are mistakenly labeled as true splice sites.) When tested on the set of 172 complete genes, GlimmerR's predictions were exactly correct on 100 genes (58%). At the nucleotide level, the sensitivity was 94% (measured as the percentage of true coding bases correctly identified) and specificity was 97% (measured as the percentage of bases labeled as coding that were truly coding). The system exactly predicted 755 exons out of the total number of 921 true exons. One GenBank entry, accession number AF013580, was not included in the training data, but was added later. For this gene, with 3 exons and a total length of 870 bp, GlimmerR predicted all three exons correctly. As sequencing progresses and further genes are found and validated, the system's accuracy can be measured more precisely and new genes will be added to the training set to improve its performance.
To provide additional information for functional genomic analysis
we have established the TOGA database
(http://www.tigr.org/tdb/toga/toga.shtml). Homologous genes can be
separated into two classes, orthologues and paralogues (Fitch, 1970 An example plant TOG can be seen in Figure 4. This TOG (13934) contains rice TC28184 from Figure 1 along with two putative orthologues, one from maize and one from wheat. The high degree of sequence identity between the members of the TOG is apparent in the JALview alignment shown in Figure 4. Although orthologues are properly defined using functional information and protein sequence data, the stringent overlap criteria used to generate the TOGs provide a degree of confidence in the assignment. Further, the progress of the IRGSP, combined with mapping data to be generated by numerous plant EST projects, will provide additional data on syntenic relationships for sequences in the plant TOGA database that will assist in validating the TOG assignments.
Rice has numerous features that make it a model species for the grasses. Its small genome size is compatible with current genomic technologies and sequencing efforts are under way to determine the complete sequence of the rice genome. Tools and resources are being developed to maximally interpret the rice genome sequence. These include improvement of gene prediction programs, expansion of the rice EST and cDNA resources, and identification of molecular resources for mapping. As the data from rice genomics can be leveraged rapidly to other grass species, it is imperative that resources be developed to exploit rice in this manner. Current efforts to extend the rice sequence data to other genomes includes the identification of putative orthologues, alignment of rice BAC/PAC sequences with plant gene indices, and integration of rice sequence data into comparative genetic maps. All of these efforts will continue to be refined as more sequence information is collected. As evidenced by the accomplishments of the Arabidopsis Genome Initiative, knowledge of rice and its close relatives in the grass family will be exponentially increased in the next few years.
The authors are indebted to Anna Glodek for database development. The authors also wish to thank Michael Heaney and Susan Lo for database support, and Vadim Sapiro, Billy Lee, Sonja Gregory, Corey Irwin, Rajeev Kramchedu, Jackie Neubrech, Mark Sengamalay, and Eddie Arnold for computer system support. The authors wish to thank Lowell Umayan, Jeremy Peterson, Hanif Khalak, Patee Gesuwan, and Qi Yang for their informatic support. All the TIGR data and databases described in this article, as well as the GlimmerR software, are freely available from the TIGR web site (www.tigr.org) or upon direct request from the corresponding author.
Received November 16, 2000; accepted December 18, 2000. 1 This work was supported in part by the U.S. Department of Agriculture (grant no. 99-35317-8275 to C.R.B.), by the National Science Foundation (grant no. DBI998282 to C.R.B.), and by the U.S. Department of Energy (grant no. DE-FG02-99ER20357 to C.R.B.). This work was also supported by the U.S. Department of Energy (grant no. DE-FG02-99ER62852 to J.Q.) and by the U.S. National Science Foundation (grant nos. DBI-9983070, DBI-9813392, and DBI-9975866 to J.Q.). J.Q. was also supported in part by the National Science Foundation (grant no. KDI-9980088). S.L.S. and M.P. were supported in part by the National Institutes of Health (grant no. R01-LM06845) and by the National Science Foundation (grant nos. KDI-9980088 and IIS-9902923).
* Corresponding author; e-mail rbuell{at}tigr.org; fax 301-838-0208.
This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|