|
|
||||||||
|
Plant Physiology 138:1280-1288 (2005) © 2005 American Society of Plant Biologists Plant Genome Resources at the National Center for Biotechnology InformationNational Center for Biotechnology Information, National Institutes of Health, Bethesda, Maryland 20894
The National Center for Biotechnology Information (NCBI) integrates data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core Entrez database, Entrez Nucleotide, includes GenBank and is tightly linked to the NCBI Taxonomy database, the Entrez Protein database, and the scientific literature in PubMed. A suite of more specialized databases for genomes, genes, gene families, gene expression, gene variation, and protein domains dovetails with the core databases to make Entrez a powerful system for genomic research. Linked to the full range of Entrez databases is the NCBI Map Viewer, which displays aligned genetic, physical, and sequence maps for eukaryotic genomes including those of many plants. A specialized plant query page allow maps from all plant genomes covered by the Map Viewer to be searched in tandem to produce a display of aligned maps from several species. PlantBLAST searches against the sequences shown in the Map Viewer allow BLAST alignments to be viewed within a genomic context. In addition, precomputed sequence similarities, such as those for proteins offered by BLAST Link, enable fluid navigation from unannotated to annotated sequences, quickening the pace of discovery. NCBI Web pages for plants, such as Plant Genome Central, complete the system by providing centralized access to NCBI's genomic resources as well as links to organism-specific Web pages beyond NCBI.
The National Center for Biotechnology Information (NCBI) provides a data-rich environment in support of genomic research by integrating the data from more than 20 biological databases through a flexible search and retrieval system called Entrez. A core database in Entrez, Entrez Nucleotide, includes GenBank (Benson et al., 2005 The NCBI Map Viewer (http://www.ncbi.nlm.nih.gov/mapview/), used to display genomic maps for many plant and animal genomes, takes advantage of the same database technology as Entrez and supports text queries with Boolean logic. Searches of complete genomic sequences from organisms ranging from microbes to higher plants and animals may be performed via genomic BLAST searches that lead to Map Viewer displays in which the genomic context of the hits can be seen. The NCBI genomics environment offers a unique opportunity for researchers. While specialized genomic resources provide excellent coverage of the data for a single organism or a closely related group of organisms, the environment at NCBI enables the comparison of genomic data from species across the entire taxonomic range. This wide taxonomic coverage greatly enhances the utility of the data by allowing advances in the understanding of the genomics of one organism to be applied to a broader genomic context spanning many organisms.
In the descriptions that follow, addresses for the key NCBI Web pages appear directly in the text and are also compiled in the "URLs for NCBI Resources and Download Sites" section below. For a more detailed overview of NCBI resources, see the NCBI home page (http://www.ncbi.nlm.nih.gov) and the review in Wheeler et al. (2005)
The bulk of the primary plant genomic data available at NCBI falls into one of three categories that mirror the types of projects currently conducted by the plant research community; these are genomic assemblies, batches of Expressed Sequence Tags (ESTs), and genetic or physical genomic maps. Genomic assemblies include that of the Arabadopsis (Arabidopsis thaliana) genome produced by the Arabidopsis Genome Initiative (2000) Data resulting from large-scale EST sequencing projects for over 70 plants have been deposited into GenBank and are integrated with other sequence data at NCBI in a number of ways. Organisms for which more than 70,000 EST sequences have been deposited are automatically entered into the UniGene database, in which case their EST sequences are combined with other transcript sequences in GenBank and partitioned into gene-oriented clusters. In this manner, ESTs from 20 plant species have been used to produce more than 200,000 transcript clusters. The UniGene clusters themselves are subjected to further analysis to link them to the STSs in UniSTS, genes in Entrez Gene, gene homologs in HomoloGene, and proteins in Entrez Protein. As a consequence, it is often possible to simply query Entrez with a GenBank EST accession number and wind up with a gene location, protein sequence, and estimate of sequence conservation across organisms in a single step. ESTs arising from several plant species as well as the UniGene clusters derived from them are aligned to the well-assembled Arabidopsis and O. sativa genomes. ESTs from the Liliopsida are aligned to the O. sativa genome; these include Hordeum vulgare, O. sativa, Sorghum bicolor, Triticum aestivum, and Zea mays. Those from the Eudicotyledons are aligned to the Arabidopsis genome and include Arabidopsis itself, Glycine max, Lactuca sativa, L. corniculatus, Malus x domestica, M. truncatula, Populus tremula x Populus tremuloides, Solanum tuberosum, and V. vinifera. These alignments make an important link between EST data, which is relatively inexpensive to obtain, and assembled, well-annotated genomic sequence, a more expensive commodity. In addition to these ESTs, more than three million SNPs, submitted to NCBI's dbSNP database, have been mapped to the O. sativa genome at NCBI.
Genetic mapping projects are under way for a number of plants, and maps resulting from these projects are displayed in the Map Viewer discussed below. Such nonsequence data is imported by NCBI from many publicly available databases in formats ranging from Structured Query Language table dumps to flat files with unique formats. NCBI processes the data to link feature locations with GenBank records, with data in related databases, and with URLs that point back to the source of the data. Once processed, the data is entered into the Map Viewer database for display. Sources of nonsequence data shown in the Map Viewer include maps from MaizeGDB (Lawrence et al., 2004
The Map Viewer provides a central interface to plant genomic data at NCBI, serving as an interactive tool for viewing an ensemble of aligned genetic, physical, or sequence-based maps with an adjustable focus ranging from that of a complete chromosome to that of a portion of a gene. The maps displayed in the Map Viewer may be derived from a single organism or from multiple organisms; map alignments are performed on the basis of shared markers. There are many species of plant for which Map Viewer displays are available: Arabidopsis, oat (Avena sativa), G. max, H. vulgare, L. esculentum, O. sativa, S. bicolor, T. aestivum, Triticum monococcum, Z. mays, and L. corniculatus var. japonicus. The creation of displays for M. sativa and M. truncatula is in progress. A number of genetic or physical maps are usually available for each organism displayed in the Map Viewer. For two organisms, O. sativa and Arabidopsis, sequence maps are available with annotated genes, and for these two genomes EST and UniGene alignments are available, as discussed above. Annotations on the plant genomic assemblies displayed in the Map Viewer are those produced by the appropriate model organism community annotation effort. References and links to the projects generating these annotations are shown on Map Viewer genome overview pages. Navigation to a target Map Viewer display is via one of three primary routes: a Map Viewer text query, an Entrez text query, or a PlantBLAST search. When a match is found through one of these avenues, it is possible to navigate to a Map Viewer display from which many routes to genomic comparison and analysis are available. The usage scenarios outlined below illustrate these three entry mechanisms.
The Plant Genomes Query page (http://www.ncbi.nlm.nih.gov/mapview/map_search.cgi?chr=plants.inf) enables one to simultaneously search physical, genetic, and sequence maps shown in Map Viewer for several plant genomes. Queries may include clone names or aliases, gene names or symbols, locus identifiers, gene products or descriptions, genetic marker or GenBank identifiers. Multiple search terms can be combined using Boolean logic.
Since the early 1990s, various lines of research have shown that large-scale genome structure is conserved in blocks across the grasses (Ahn and Tanksley, 1993
The entire suite of Entrez databases may be searched in tandem using the Entrez Global Query page (http://www.ncbi.nlm.nih.gov/gquery/gquery.fcgi). Using Global Query, it is not necessary to know in advance in which databases the data will be found, and, as such, it is an excellent starting point for an initial survey of a new research area. As an example, consider a search for salt tolerance-related genes in plants. One may begin with the following query: "salt tolerance" AND viridiplantae [organism]. The query consists of two parts, linked by a Boolean "AND," which is always given in capital letters. For the query to be successful, the quoted phrase "salt tolerance" must be found in some part of a database record. The second portion of the query limits the search to the taxon viridiplantae within the Entrez organism field, given in the square brackets. Hence, for databases, such as the sequence databases, that classify records by organism, only records for the viridiplantae will be returned.
The entries identified with the query are displayed on the Global Entrez results page (Fig. 2H). The search matches entries in the Entrez Nucleotide, Protein, Gene, UniGene, HomoloGene, and GEO databases. Navigating directly to the matches in the Gene database by clicking on the link in Global Query generates a display of summaries (data not shown) for six genes. Clicking on the link to the first of these Entrez Gene records, the salt-tolerance (STO) gene of Arabidopsis, yields the report shown in the main area of Figure 2. The report begins with the Entrez Gene GeneID and name of the gene, shown in section A. Following this is a graphical representation of the structure of the gene, section B, showing two splice variants, a two-exon form, and a three-exon form. In section C, the genomic context of STO is shown, followed, in section D, by its type, aliases, taxonomic lineage, and links to the literature in PubMed. Gene Ontology (Gene Ontology Consortium, 2004
Returning to the Global Query page, it may be of interest to see if additional genes related to salt tolerance can be found using a different route. For example, clicking on the Global Query link to the Entrez Protein database yields a list of the first 20 matching proteins. Among these entries is a SwissProt-derived record for Arabidopsis SAL1 phosphatase, accession number Q42546. Clicking on the accession number to see the GenBank flat file view of this entry reveals that the first literature reference (Quintero et al., 1996
The suite of BLAST sequence similarity programs offers an alternative to a text query as a method of entry into the Entrez system. Using BLAST, a novel sequence can be quickly linked to known database sequences to which it is similar. Links from these sequence records to others in Entrez allow fluid navigation between NCBI resources. As an example, consider an unknown European white birch (Betula pendula) EST, GenBank accession number AJ606366. A simple Entrez query reveals that this EST does not belong to any UniGene cluster and, therefore, no gene-oriented information is available. In addition, since European white birch ESTs are not among those routinely aligned to the O. sativa or Arabidopsis genomic sequence by NCBI, it will not be possible to simply look this one up in Map Viewer. However, using Plant Genome BLAST, it may still be possible to place this EST on the Arabidopsis genome.
The PlantBLAST Page (http://www.ncbi.nlm.nih.gov/BLAST/Genome/PlantBlast.shtml) is reached via the Plants link on the BLAST homepage. The query form allows the selection of the species of plant to BLAST against, the database to search, and the BLAST algorithm to use. For example, one may select Arabidopsis, one of 10 available plant species, as the organism, the set of protein sequences derived from its genomic annotation as the database, and the BLASTX program from the pulldown menus, using the form shown in Figure 3A. Alternatively, mapped sequences from all available plants may be selected as the target of the search. The BLASTX program allows one to compare the six-frame protein translations of a nucleotide query to the sequences in a protein database and is a very sensitive way to perform a cross-species search. As a query, either the EST sequence or its GenBank accession number, AJ606366, may be used. Such a search returns significant matches to several members of the chitinase protein family, among them NCBI RefSeq accession number NP_172076. Clicking on the Genome View button in the BLAST results and adding the gene map as the rightmost map (data not shown) invokes a graphical Arabidopsis Map Viewer overview revealing many hits across the Arabidopsis genome (Fig. 3B). The "NP_172076" link in the overview leads to the detailed display the alignment of NP_172076 to the Arabidopsis genome of Figure 3C with links; "At1g05850" leads to an Entrez Gene report for the corresponding gene, and "TAIR" leads to The Arabidopsis Information Resource. The Entrez Gene report contains links to a publication (Zhong et al., 2002
Consider as a final example an EST with GenBank accession number AU251269 from Italian rye grass (Lolium multiflorum). A BLASTX PlantBLAST search against O. sativa proteins, performed as in the preceding case, reveals a putative protein homolog in O. sativa with RefSeq accession number NP_909676. Navigating to the BLink report for NP_909676 shows conservation across all taxonomy branches (Fig. 5A) and includes an alignment to human protein NP_003932, Wiskott-Aldrich syndrome gene-like protein (Fig. 5B). The Entrez Gene report for this curated RefSeq contains links to a large collection of articles in PubMed and a number of Gene Ontology terms pertaining to actin activity (Fig. 5, C and D) as well as an impressive array of links to other resources (Fig. 5E).
Resources
The authors would like to thank Pavel Bolotov, Andrei Kochergin, Igor Tolstoy, and Boris Kiryutin for their expertise and diligence in the maintenance of many of the databases highlighted in this article. Received December 22, 2004; returned for revision April 7, 2005; accepted April 22, 2005.
www.plantphysiol.org/cgi/doi/10.1104/pp.104.058842. * Corresponding author; e-mail wheeler{at}ncbi.nlm.nih.gov; fax 3014809241.
Ahn SN, Tanksley SD (1993) Comparative linkage maps of the rice and maize genomes. Proc Natl Acad Sci USA 90: 79807984 Barrett T, Suzek TO, Troup DB, Wilhite SE, Ngau WC, Ledoux P, Rudnev D, Lash AE, Fujibuchi W, Edgar R (2004) NCBI GEO: mining millions of expression profiles: database and tools. Nucleic Acids Res 33: D562D566
Benson DA, Karsch-Mizrachi I, Lipman DJ, Ostell J, Wheeler DL (2005) GenBank: update. Nucleic Acids Res 33: D34D38
Boeckmann B, Bairoch A, Apweiler R, Blatter MC, Estreicher A, Gasteiger E, Martin MJ, Michoud K, O'Donovan C, Phan I, et al (2003) The SWISS-PROT protein knowledgebase and its supplement TrEMBL in 2003. Nucleic Acids Res 31: 365370 Deshpande N, Addess KJ, Bluhm WF, Merino-Ott JC, Townsend-Merino W, Zhang Q, Knezevich C, Xie L, Chen L, Feng Z, et al (2004) The RCSB Protein Data Bank: a redesigned query system and relational database based on the mmCIF schema. Nucleic Acids Res 33: D233D237 Devos KM, Chao S, Li QY, Simonetti MC, Gale MD (1994) Relationship between chromosome 9 of maize and wheat homeologous group 7 chromosomes. Genetics 138: 12871292[Abstract] Garcia-Hernandez M, Berardini TZ, Chen G, Crist D, Doyle A, Huala E, Knee E, Lambrecht M, Miller N, Mueller LA, et al (2002) TAIR: a resource for integrated Arabidopsis data. Funct Integr Genomics 2: 239
Gene Ontology Consortium (2004) The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res 32: D258D261 Kanz C, Aldebert P, Althorpe N, Baker W, Baldwin A, Bates K, Browne P, van den Broek A, Castro M, Cochrane G, et al (2004) The EMBL nucleotide sequence database. Nucleic Acids Res 33: D29D33 Kurata N, Moore G, Nagamura Y, Foote T, Yano M, Minobe Y, Gale MD (1994) Conservation of genome structure between rice and wheat. Biotechnology (N Y) 12: 276278
Lawrence CJ, Dong Q, Polacco ML, Seigfried TE, Brendel V (2004) MaizeGDB, the community database for maize genetics and genomics. Nucleic Acids Res 32: D393D397 Lederberg EM (1986) Plasmid prefix designations registered by the Plasmid Reference Center 1977-1985. Plasmid 1: 5792 Maglott D, Ostell J, Pruitt KD, Tatusova T (2004) Entrez Gene: gene-centered information at NCBI. Nucleic Acids Res 33: D54D58[CrossRef] Marchler-Bauer A, Anderson JB, Cherukuri PF, DeWeese-Scott C, Geer LY, Gwadz M, He S, Hurwitz DI, Jackson JD, Ke Z, et al (2004) CDD: a Conserved Domain Database for protein classification. Nucleic Acids Res 33: D192D196[CrossRef]
Pruitt KD, Tatusova T, Maglott DR (2005) NCBI Reference Sequence (RefSeq): a curated non-redundant sequence database of genomes, transcripts and proteins. Nucleic Acids Res 33: D501D504 Quintero FJ, Garciadeblas B, Rodriguez-Navarro A (1996) The SAL1 gene of Arabidopsis, encoding an enzyme with 3'(2'),5'-bisphosphate nucleotidase and inositol polyphosphate 1-phosphatase activities, increases salt tolerance in yeast. Plant Cell 8: 529537[Abstract] Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, Katayose Y, Wu J, Niimura Y, Cheng Z, Nagamura Y, et al (2002) The genome sequence and structure of rice chromosome 1. Nature 420: 312316[CrossRef][Medline]
Schoof H, Zaccaria P, Gundlach H, Lemcke K, Rudd S, Kolesov G, Arnold R, Mewes HW, Mayer KF (2002) MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res 30: 9193
Sherry ST, Ward MH, Kholodov M, Baker J, Phan L, Smigielski EM, Sirotkin K (2001) dbSNP: the NCBI database of genetic variation. Nucleic Acids Res 29: 308311 Tateno Y, Saitou N, Okubo K, Sugawara H, Gojobori T (2004) DDBJ in collaboration with mass-sequencing teams on annotation. Nucleic Acids Res 33: D25D28 The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline] Van Deynze AE, Nelson JC, O'Donoughue LS, Ahn SN, Siripoonwiwat W, Harrington SE, Yglesias ES, Braga DP, McCouch SR, Sorrells ME (1995) Comparative mapping in grasses: oat relationships. Mol Gen Genet 249: 349356[CrossRef][Medline]
Ware D, Jaiswal P, Ni J, Pan X, Chang K, Clark K, Teytelman L, Schmidt S, Zhao W, Cartinhour S, et al (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res 30: 103105
Wheeler DL, Barrett T, Benson DA, Bryant SH, Canese K, Church DM, DiCuccio M, Edgar R, Federhen S, Helmberg W, et al (2005) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 33: D39D45
Wu CH, Yeh LSL, Huang H, Arminski L, Castro-Alvear J, Chen Y, Hu Z, Kourtesis P, Ledley RS, Suzek BE, et al (2003) The Protein Information Resource. Nucleic Acids Res 31: 345347
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 7992
Zhong R, Kays SJ, Schroeder BP, Ye ZH (2002) Mutation of a chitinase-like gene causes ectopic deposition of lignin, aberrant cell shapes, and overproduction of ethylene. Plant Cell 14: 165179 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|