|
|
||||||||
|
Plant Physiology 139:610-618 (2005) © 2005 American Society of Plant Biologists Comparative Plant Genomics Resources at PlantGDB1Department of Genetics, Development and Cell Biology (Q.D., C.J.L., S.D.S., M.D.W., V.B.), and Department of Statistics (V.B.), Iowa State University, Ames, Iowa 500113260; Zentrum für Bioinformatik, Universität Hamburg, 20146 Hamburg, Germany (S.K.); and Department of Computer Science, University of South Dakota, Vermillion, South Dakota 57069 (C.L.)
PlantGDB (http://www.plantgdb.org/) is a database of plant molecular sequences. Expressed sequence tag (EST) sequences are assembled into contigs that represent tentative unique genes. EST contigs are functionally annotated with information derived from known protein sequences that are highly similar to the putative translation products. Tentative Gene Ontology terms are assigned to match those of the similar sequences identified. Genome survey sequences are assembled similarly. The resulting genome survey sequence contigs are matched to ESTs and conserved protein homologs to identify putative full-length open reading frame-containing genes, which are subsequently provisionally classified according to established gene family designations. For Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa), the exon-intron boundaries for gene structures are annotated by spliced alignment of ESTs and full-length cDNAs to their respective complete genome sequences. Unique genome browsers have been developed to present all available EST and cDNA evidence for current transcript models (for Arabidopsis, see the AtGDB site at http://www.plantgdb.org/AtGDB/; for rice, see the OsGDB site at http://www.plantgdb.org/OsGDB/). In addition, a number of bioinformatic tools have been integrated at PlantGDB that enable researchers to carry out sequence analyses on-site using both their own data and data residing within the database.
Plant genome sequence data have been accumulating from three major sources: whole-genome sequencing and assembly (Arabidopsis [Arabidopsis thaliana]: Lin et al., 1999 PlantGDB (http://www.plantgdb.org/) is our ongoing effort to aid in the organization and interpretation of sequence data through the development and implementation of integrated databases and analytical tools. In this article, we discuss some of the unique sequence storage and analysis capabilities provided by PlantGDB and compare them to those made available through other online resources. All PlantGDB data and scripts described here are freely available from our download site (http://www.plantgdb.org/download/download.php) or by request.
PlantGDB is a plant sequence database. Its data consist of plant sequences and their associated annotations. There are mainly three types of plant sequences: complete genome sequences for Arabidopsis and rice, other kinds of sequences including EST and GSS extracted from public sequence repositories such as GenBank (Benson et al., 2005
Plant sequences that are made available through public repositories compose the core PlantGDB sequence set. Currently, PlantGDB contains sequences from more than 24,000 plant species (belonging to more than 6,000 genera). Our sequence-processing scripts extract all plant nucleotide sequences from EST (http://www.ncbi.nlm.nih.gov/dbEST/), GSS (http://www.ncbi.nlm.nih.gov/dbGSS/), sequence tagged sites (http://www.ncbi.nlm.nih.gov/dbSTS/), high-throughput genomic (http://www.ncbi.nlm.nih.gov/HTGS/), and other genomic DNA sequence categories at GenBank and populate our relational database. All scripts are written in Perl and are available upon request. The extracted sequences are sorted by taxonomic classification to provide fast and easy access to subsets of sequences limited to an individual species or to a phylogenetically related group via the PlantGDB download site (http://www.plantgdb.org/download/download.php). To ensure that our data are synchronized with GenBank, this procedure is run daily. To maintain consistency with the version updates occurring at GenBank, our pipeline automatically tracks two GenBank data files, gbchg.txt and gbdel.txt (available from ftp://ftp.ncbi.nih.gov/genbank/), corresponding to sequence version changes and deletions, respectively. The entries in these two files are converted into corresponding "update/delete" SQL statements, which allow our database to stay synchronized with GenBank. Similarly, plant protein sequences are extracted from UniProt (Apweiler et al., 2004
Sequence sets downloaded from GenBank can contain "contaminants". Contaminants typically only pose a problem when unrecognized (for an example of clever use of nonnative sequences derived in Drosophila sequencing projects that resulted in the serendipitous assembly of three Wolbachia bacterial genomes, see Salzberg et al., 2005
The Institute for Genomic Research (TIGR) plant repeat database (http://www.tigr.org/tdb/e2k1/plant.repeats/; Ouyang and Buell, 2004
EST sequences are valuable data for gene discovery, especially for plant species with large genomes that have not been fully sequenced, and they provide a convenient means of accessing the transcriptome of a given species. However, ESTs generally correspond to only partial cDNA sequences, and EST samples are typically highly redundant (especially if EST sets are not derived from normalized EST libraries). Therefore, the assembly of overlapping ESTs into putative unique transcript contigs on a frequent and regular basis constitutes the first step for all EST analyses performed at PlantGDB (for more details, see http://www.plantgdb.org/prj/ESTCluster/progress.php). A similar analysis is provided by the TIGR gene indices for selected species with sufficiently large numbers of ESTs (http://www.tigr.org/tdb/tgi/plant.shtml; Lee et al., 2005
EST assembly remains a computational challenge given the large number of EST sequences currently available. For example, with more than 400,000 maize ESTs, CAP3 (one of the most popular assembly programs; Huang and Madan, 1999 In addition to piecing together significantly overlapping fragments, EST assembly can be considered to be an initial step toward reducing the redundancy that exists in available EST datasets. Because no EST assembly can be guaranteed to be error free, we caution researchers to consider searches against the PlantGDB assemblies to be complementary and exploratory steps in gene discovery relative to more comprehensive analyses of promising targets. One obvious advantage of searching a database of EST contigs (rather than unassembled ESTs) is that the likelihood of finding a complete match against one's query should be increased (because EST contigs are longer, on average, than raw EST sequences).
Instead of deriving EST assembly parameters specific to each species, we use a common set of conserved assembly criteria for most assemblies: ESTs are initially clustered whenever they share a minimum overlap of 40 bases with at least 95% identity (these initial clusters may split into several contigs based on overall similarity; Huang and Madan, 1999 Groups working on a particular organism often carefully generate their own species-specific EST contigs, and some groups have asked to deposit their assemblies into PlantGDB to gain easy access to our annotations and integrated analysis tools. For example, the barley (Hordeum vulgare) EST contigs at PlantGDB are mirrored from the HarvEST Triticeae database (http://harvest.ucr.edu/). The PlantGDB display includes mapping of oligomer probes for microarray expression studies and links to expression data at BarleyBase (http://www.barleybase.org/). For species with EST sequence sets that are assembled at PlantGDB directly, feedback from researchers is used to determine the build release schedule: if a researcher has a need to gain access to a new build for a given species' EST assembly, that dataset can be given priority for a speedy build and release. After a new assembly has been created, the deprecated assemblies remain accessible online and can still be viewed on the Web (this serves as an historic record to enable long-term accessibility). However, deprecated assemblies cannot be accessed by the PlantGDB data analysis tools (e.g. BLAST@PlantGDB).
For each contig sequence, a putative function is assigned based upon sequence similarity to gene products that have been previously functionally annotated. This is accomplished by means of an automated BLAST search (Altschul et al., 1997 PlantGDB was designed to be a Web-based research workbench. Thus, all records are linked to tools that allow for immediate retrieval of raw data, and all data are made available alongside related data and applications that can be used to recalculate curated records using updated or proprietary data. For example, both the EST assembly and associated BLAST hit annotation can be updated by researchers using only data present at PlantGDB or data present at PlantGDB in conjunction with additional data from elsewhere (e.g. sequencing reads the researcher has not yet submitted to GenBank), or using parameters other than default via tools embedded within the sequence record display pages.
Roughly 2.6 million sequences have entered the current PlantGDB maize GSS assembly, which was generated using the PCAP program with default parameters (Huang et al., 2003
Genome-Wide Gene Structure Annotation
The Arabidopsis EST and full-length cDNA collections were threaded onto the five established Arabidopsis chromosome sequences using the GeneSeqer program. Those data are stored in a specialized ancillary database called AtGDB (Arabidopsis thaliana Genome Database; http://www.plantgdb.org/AtGDB/). AtGDB incorporates 418,564 EST sequences, 64,840 full-length cDNA sequences, 31,971 predicted transcripts, and an ever-increasing number of user-contributed annotations (http://www.plantgdb.org/AtGDB/Annotation/UCAlist.php) into a workbench for genome informatics. Approximately 80% of the predicted protein-coding gene models are supported by EST or cDNA evidence, whereas 20% are based solely upon computational gene structure prediction. The degree of support for individual annotations in Arabidopsis is substantially better than that of rice, with 68% being supported to the extent that major changes are not likely (Tables I and II). However, 70% of the predicted Arabidopsis gene models exist in a genomic context such that corresponding EST and cDNA alignments reveal some form of incongruence, including incompletely annotated noncoding regions, alternative splicing, and erroneous gene predictions (Schlueter et al., 2005
OsGDB (http://www.plantgdb.org/OsGDB/) is the rice equivalent to AtGDB. ESTs and full-length cDNAs stored at OsGDB are threaded onto the full set of rice bacterial artificial chromosome (BAC) sequences. Altogether, 298,857 EST sequences, 32,136 full-length cDNAs, 3,453 BAC sequences, 66,224 gene models (defined as GenBank file features), and 62,121 transcription unit (TU) models (defined by TIGR; http://www.tigr.org/tdb/e2k1/osa1/irgsp/eukan_routine_irgsp.shtml) are on display at OsGDB. Of the current TU gene model predictions, about one-half are supported by EST or cDNA evidence, whereas the remaining one-half are based on computational gene prediction alone (Table I). Similar to what was observed for Arabidopsis, a large percentage of rice TUs reveals some level of incongruence with local EST and cDNA alignments (Table II; http://www.plantgdb.org/OsGDB/Annotation/gaeval/gaeval_lists.php).
PlantGDB is built upon a relational database. Underlying PlantGDB is the MySQL (http://www.mysql.com/) database management system running on the open-source RedHat Linux operating system (http://www.redhat.com/). The PlantGDB database schema can be accessed at http://www.plantgdb.org/document/public/schema.php. The schema was designed to efficiently store sequence data and associated annotations to facilitate meaningful biological queries. For example, for all the raw plant DNA sequences downloaded from GenBank, we extract and reorganize the information originally stored at GenBank as ASN.1 flat-files and store those records in tables. Such information extraction and reorganization enables researchers to construct meaningful biological queries that can be carried out in a precise and flexible environment, a process that is not always possible at GenBank. For example, using the National Center for Biotechnology Information (NCBI) Entrez tool, a researcher cannot construct queries like "show me all maize promoter sequences," even though promoter region annotations are embedded in GenBank records. If a researcher attempted to use a query like "Zea mays[ORGN] AND Promoter," he or she would only end up with a list of sequence records that contain the word "Promoter." The word could appear anywhere in the sequence record (e.g. in the "comment" field). Furthermore, it is impossible to extract just the promoter regions from all the sequences and the output list provided at GenBank. However, at PlantGDB, the promoter (and all other GenBank feature) information is stored within a relational table, enabling researchers to accurately specify that only promoter regions be retrieved. This sort of flexibility is made possible by PlantGDB's TableMaker tool (http://www.plantgdb.org/search/query/TableMaker.php; Fig. 2). Furthermore, the PlantGDB TableMaker does not require that the researcher have any knowledge of SQL: TableMaker translates the researcher's specifications to SQL query statements and presents the results in a tabular format. For advanced users, direct SQL access to the PlantGDB backend database is also available online (http://www.plantgdb.org/search/query/websql.php).
For species that have a complete genome sequence available, specialized databases (i.e. AtGDB for Arabidopsis and OsGDB for rice) have been created to make available annotation of detailed exon-intron structures for protein-coding genes, based upon the threading of EST and full-length cDNAs onto chromosomes. Spliced alignments of ESTs and cDNAs as well as the recent annotation of the Arabidopsis and rice genomes are parsed and imported into database tables, and an elaborate Web interface was developed and is made available to provide a visual assessment of annotated gene structure (Zhu et al., 2003
Correct gene annotation must leverage the expertise that exists within the research community. To truly involve the research community, a simple feedback form is not sufficient. Instead, well-designed and easy-to-use annotation tools must be made available to researchers. To address this need, researchers using AtGDB can contribute updated annotations of their own to a shared community annotation collection through the use of Web-based annotation tools (Schlueter et al., 2005
PlantGDB makes available data analysis tools to help the research community gain access to the wealth of information that can be gleaned through the analysis of sequence data. Three online sequence analysis tools that are unique to PlantGDB are the BLAST@PlantGDB, GeneSeqer@PlantGDB, and PatternSearch@PlantGDB tools.
Although nearly all sequence databases provide an online BLAST server, most only allow researchers to search against one database at a time. For example, the current NCBI BLAST server (http://www.ncbi.nlm.nih.gov/BLAST/) requires that only one database be selected from a predefined set of database options (e.g. "nr," "est," "gss," etc.). In addition, at NCBI researchers can further choose to search against either all organisms or a group of species that share a given taxonomic rank. This is not always convenient if a researcher wishes to search against multiple databases (e.g. EST and GSS) or to search against multiple species (e.g. maize and rice but not sorghum). BLAST@PlantGDB (http://www.plantgdb.org/PlantGDB-cgi/blast/PlantGDBblast/) serves as a much-needed supplement to NCBI's BLAST service, providing selection flexibility for both the database and species data source options (e.g. only rice or maize ESTs, or both; or all monocot GSSs; or rice ESTs and all cereal EST contigs, etc.). The BLAST@PlantGDB server implements the standard NCBI BLAST stand-alone search engine (Altschul et al., 1997
GeneSeqer produces plant gene structure models based on spliced alignment to genomic sequences of both native and homologous EST, cDNA, and protein sequences. The GeneSeqer@PlantGDB (http://www.plantgdb.org/PlantGDB-cgi/GeneSeqer/PlantGDBgs.cgi) Web service allows researchers to "thread" EST/cDNA sequences onto genomic DNA across all plant species. Integration of the stand-alone GeneSeqer program with backend database operations allows researchers to conveniently align ESTs of specific quality or origin to their genomic counterparts. For a detailed description of this server, see Schlueter et al. (2003)
The PatternSearch@PlantGDB tool (http://www.plantgdb.org/PlantGDB-cgi/vmatch/patternsearch.pl) allows researchers to conduct pattern searches, i.e. searches for relatively short matches possibly interspersed with mismatches and/or insertions/deletions ("indels"), against PlantGDB sequences (e.g. against the Arabidopsis and rice genome, or cDNAs, etc.). For example, if a researcher were to design primers (e.g. for reverse transcription-PCR or genomic PCR) to amplify a gene of interest, the primers might also hybridize with nontarget sequences. This researcher could use the PlantGDB PatternSearch tool to find out whether the primer sequences chosen are unique to the gene of interest. The underlying search engine for the PlantGDB PatternSearch is the Vmatch program, which is based on enhanced suffix arrays (Abouelhoda et al., 2004
An educational site ancillary to PlantGDB provides a centralized repository for various Plant Genome Research "Outreach" Program (PGROP) materials and related activities (http://www.plantgdb.org/PGROP/pgrop.php; Baran et al., 2004
Compared to the large number of finished prokaryotic and animal genomes, plant genomics is still in its infancy. However, the increasing importance of plant biotechnology is spearheading a dramatic increase in the resources available for plant genomics research. Thus, we can anticipate with certitude the complete genome sequencing of several crop species in the next few years. As is the case for the prokaryotic and animal genomics fields, comparative genomics will be a most valuable tool for turning the plant sequence information into knowledge of genome repertoire, organization, and function. Databases like PlantGDB that synthesize current knowledge will provide a requisite foundation for making rapid progress by leveraging existing knowledge to annotate and evaluate new genomes. There are a number of services that we are working to improve at PlantGDB in the coming months. For instance, the creation of formally defined "versions" of both the data and database will help researchers to cite the database using links and version designators that will persist long after a manuscript's publication. We also are working to provide various case studies in tutorial form as examples of work flow using PlantGDB data and tools. We also plan to provide access to data and analysis tools via Web services to allow for better integration with other online resources.
We thank Ann Fu and Mike Brekke for their technical contributions to this work. Received December 31, 2004; returned for revision May 4, 2005; accepted May 12, 2005.
1 This work was supported by the National Science Foundation Plant Genome Research Projects (grant no. DBI0321600 to V.B. and C.L.).
2 Present address: Corn Insects and Crop Genetics Research, U.S. Department of Agriculture Agricultural Research Service and Department of Agronomy, Iowa State University, Ames, IA 500113260. www.plantphysiol.org/cgi/doi/10.1104/pp.104.059212. * Corresponding author; e-mail vbrendel{at}iastate.edu; fax 5152946755.
Abouelhoda MI, Kurtz S, Ohlebusch E (2004) Replacing suffix trees with enhanced suffix arrays. J Discrete Algorithms 2: 5386
Altschul SF, Madden TL, Schäffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402
Apweiler R, Bairoch A, Wu CH, Barker WC, Boeckmann B, Ferro S, Gasteiger E, Huang H, Lopez R, Magrane M, et al (2004) UniProt: the Universal Protein knowledgebase. Nucleic Acids Res 32: D115D119 Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al (2000) Gene Ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 2529[CrossRef][ISI][Medline]
Baran S, Lawrence CJ, Brendel V (2004) Plant Genome Research Outreach Portal. A gateway to plant genome research "outreach" programs and activities. Plant Physiol 134: 889 Bedell JA, Budiman MA, Nunberg A, Citek RW, Robbins D, Jones J, Flick E, Rholfing T, Fries J, Bradford K, et al (2005) Sorghum genome sequencing by methylation filtration. PLoS Biol 3: e13[CrossRef][Medline]
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) GenBank. Nucleic Acids Res 33: D34D38
Brendel V, Xing L, Zhu W (2004) Gene structure prediction from consensus spliced alignment of multiple ESTs matching the same genomic locus. Bioinformatics 20: 11571169
Emrich SJ, Aluru S, Fu Y, Wen TJ, Narayanan M, Guo L, Ashlock DA, Schnable PS (2004) A strategy for assembling the maize (Zea mays L.) genome. Bioinformatics 20: 140147 Fernandes J, Dong Q, Schneider B, Morrow DJ, Nan GL, Brendel V, Walbot V (2004) Genome-wide mutagenesis of Zea mays L. using RescueMu transposons. Genome Biol 5: R82[CrossRef][Medline] Gilks WR, Audit B, De Angelis D, Tsoka S, Ouzounis CA (2000) Modeling the percolation of annotation errors in a database of protein sequences. Bioinformatics 18: 16411649
Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100
Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9: 868877
Huang X, Wang J, Aluru S, Yang SP, Hillier L (2003) PCAP: a whole-genome assembly program. Genome Res 13: 21642170
Kalyanaraman A, Aluru S, Kothari S, Brendel V (2003) Efficient clustering of large EST data sets on parallel computers. Nucleic Acids Res 31: 29632974
Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J (2005) The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res 33: D71D74 Lin X, Kaul S, Rounsley S, Shea TP, Benito MI, Town CD, Fujii CY, Mason T, Bowman CL, Barnstead M, et al (1999) Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402: 761768[CrossRef][Medline] Mayer K, Schuller C, Wambutt R, Murphy G, Volckaert G, Pohl T, Dusterhoft A, Stiekema W, Entian KD, Terryn N, et al (1999) Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402: 769777[CrossRef][Medline]
Ouyang S, Buell CR (2004) The TIGR Plant Repeat Databases: a collective resource for the identification of repetitive sequences in plants. Nucleic Acids Res 32: D360D363
Palmer LE, Rabinowicz PD, O'Shaughnessy AL, Balija VS, Nascimento LU, Dike S, de la Bastide M, Martienssen RA, McCombie WR (2003) Maize genome sequencing by methylation filtration. Science 302: 21152117 Salanoubat M, Lemcke K, Rieger M, Ansorge W, Unseld M, Fartmann B, Valle G, Blocker H, Perez-Alonso M, Obermaier B (2000) Sequence and analysis of chromosome 3 of the plant Arabidopsis thaliana. Nature 408: 820822[CrossRef][Medline] Salzberg SL, Dunning Hotopp JC, Delcher AL, Pop M, Smith DR, Eisen MB, Nelson WC (2005) Serendipitous discovery of Wolbachia genomes in multiple Drosophila species. Genome Biol 6: 402[CrossRef][Medline]
Schlueter SD, Dong Q, Brendel V (2003) GeneSeqer@PlantGDB: gene structure prediction in plant genomes. Nucleic Acids Res 31: 35973600 Schlueter SD, Wilkerson MD, Huala E, Rhee SY, Brendel V (2005) Community-based gene structure annotation for the Arabidopsis thaliana genome. Trends Plant Sci 10: 914[Medline] Usuka J, Brendel V (2000) Gene structure prediction by spliced alignment of genomic DNA with protein sequences: increased accuracy by differential splice site scoring. J Mol Biol 297: 10751085[CrossRef][ISI][Medline]
Whitelaw CA, Barbazuk WB, Pertea G, Chan AP, Cheung F, Lee Y, Zheng L, van Heeringen S, Karamycheva S, Bennetzen JL, et al (2003) Enrichment of gene-coding sequences in maize by genome filtration. Science 302: 21182120
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 7992 Yuan Y, SanMiguel PJ, Bennetzen JL (2003) High-Cot sequence analysis of the maize genome. Plant J 34: 249255[CrossRef][ISI][Medline]
Zhu W, Schlueter SD, Brendel V (2003) Refined annotation of the Arabidopsis genome by complete EST mapping. Plant Physiol 132: 469484 This article has been cited by other articles:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||