|
|
||||||||
|
Plant Physiology 138:1301-1309 (2005) © 2005 American Society of Plant Biologists Munich Information Center for Protein Sequences Plant Genome Resources. A Framework for Integrative and Comparative Analyses1,[w]Technische Universität München, Chair of Genome Oriented Bioinformatics, Center of Life and Food Science, D85354 Freising-Weihenstephan, Germany (H.S., M.S.); and Institute for Bioinformatics, GSF National Research Center for Environment and Health, D85764 Neuherberg, Germany (H.S., M.S., L.Y., R.E., H.G., D.H., G.H., K.F.X.M.)
With several plant genomes sequenced, the power of comparative genome analysis can now be applied. However, genome-scale cross-species analyses are limited by the effort for data integration. To develop an integrated cross-species plant genome resource, we maintain comprehensive databases for model plant genomes, including Arabidopsis (Arabidopsis thaliana), maize (Zea mays), Medicago truncatula, and rice (Oryza sativa). Integration of data and resources is emphasized, both in house as well as with external partners and databases. Manual curation and state-of-the-art bioinformatic analysis are combined to achieve quality data. Easy access to the data is provided through Web interfaces and visualization tools, bulk downloads, and Web services for application-level access. This allows a consistent view of the model plant genomes for comparative and evolutionary studies, the transfer of knowledge between species, and the integration with functional genomics data.
The Munich Information Center for Protein Sequences (MIPS; http://mips.gsf.de) has been involved in maintaining plant genome databases since the Arabidopsis (Arabidopsis thaliana) genome project (Arabidopsis Genome Initiative, 2000
However, genome-scale bioinformatics analyses are limited by the effort required for data integration (Wilkinson et al., 2005 To overcome these limitations, an integrated cross-species plant genome resource is desirable. This would complement the efforts of species-specific databases that focus on providing the best possible annotation for that genome by providing a platform that facilitates comparative analyses. The challenges are as follows.
The MIPS plant genome resources started out as a collection of species-specific databases but with the goal of merging these into an integrated, comparative framework (Schoof et al., 2004
Instead of a single data warehouse, a modular approach was chosen for the MIPS plant genome resources. This allows for easy addition of new modules when new types of data need to be accommodated and for more flexibility in development of individual modules without being hampered by the complexity of the whole system. To be able to manage these data modules in a component-oriented manner, a multitier architecture following the Java 2 Platform, Enterprise Edition standard (http://java.sun.com/j2ee) was implemented (Mewes et al., 2004
The core of the system, around which other data modules are organized, is a flexible data model for representing genome sequence and annotation (Karlowski et al., 2003 The third data module, GeneticElement, contains all features or elements that can be represented through coordinates on the genome sequence: protein coding genes, noncoding RNAs, repeats, sequenced markers, and transposons. This list is extensible whenever new features or elements are discovered and can utilize, for example, the Sequence Ontology (http://song.sourceforge.net) for semantic relationships, e.g. that a coding sequence is part of a transcript. GeneticElements can have subelements, e.g. exons or domains, that don't exist without the GeneticElement, e.g. exons exist only as parts of a transcript. To accommodate more abstract concepts, like "gene," GeneticElements can be grouped. In this way, all GeneticElements belonging to a gene (promoter, transcript, alternative transcripts, regulatory elements, cDNA matches, etc.) can be identified through a single group entry. For every species, a separate physical instance of all three data modules is created to ensure scalability and separation of namespaces. The mapping of species to physical database instance is performed by the middleware. The database schema is available as supplemental data or on the MIPS Web site.
Data Content and Sources
Arabidopsis
Medicago
Gene prediction and protein annotation of Medicago sequences are performed in collaboration with the International Medicago Genome Annotation Group (IMGAG; Cannon et al., 2005
Automated annotation of Medicago sequences, including gene prediction, is also performed at other sites (VandenBosch and Stacey, 2003
Grass Genomes: Rice and Maize
For rice, the MIPS Oryza sativa database (Karlowski et al., 2003
The maize database includes 100 publicly available BAC sequences with manually curated gene predictions. Repeat detection and classification was also enhanced by manual efforts. These data provide an insight into the structure and composition of the maize genome and provide a basis for comparative and combinatorial analysis (Messing et al., 2004
The aim of the Web interface to the MIPS plant genome resources, available at http://mips.gsf.de/projects/plants/, is to provide access to all included genomes in a common format and tools for cross-species comparisons. To browse data, the user can navigate in a genome-oriented way. Assuming one would, for example, start from the chromosome list, all contigs anchored to each chromosome can be retrieved. A contig report contains detailed information on the entry as well as links to sequence, EMBL database records, a list of annotated genetic elements, or a graphical viewer. The genetic element list links to reports on the protein genes or other features on display. Sequences can be viewed and downloaded as Hypertext Markup Language (HTML; http://www.w3.org/MarkUp/), Extensible Markup Language (XML; http://www.w3.org/XML/), or FASTA format. For protein coding genes, unspliced, spliced (transcript), and coding DNA sequences are available as well as protein sequences. Moreover, cross-references in the reports allow easy access to entries in external databases associated with the entry. Alternatively, complete lists of all sequenced contigs, all genetic elements, or all elements of a selected type are available for browsing. The tables displayed on a given page can be sorted and filtered by clicking on a column heading or table cell content, respectively. The latter restricts the view to all rows that contain the value that was clicked. Somewhat separately, lists of clones can be browsed, by chromosome if linkage data is available. To visualize and browse genetic elements on a specified contig, a graphical interface, DBBrowser, was developed (Fig. 1A). DBBrowser uses the scaleable vector graphics (SVG) graphics format and thus allows seamless zooming of the image as well as full editing of the downloaded vector graphics file. The controls for zooming and moving the image depend on the plugin used. For users that do not have an SVG plugin for their browser, an applet is provided that displays the SVG. The SVG images can be downloaded and extensively edited as vector graphics.
Search options include search by name, free text, or sequence. The free-text search option allows inspection of the content of all text fields, and it is available for individual genomes or across all databases. BLAST is used as a homology search engine (Altschul et al., 1997
Finally, the download section provides ftp access to various data downloads. This includes FASTA-formatted sequence files for all clones/contigs and protein coding genes. Besides this, the download section contains functionality to create and download a Genome Annotation Markup Elements file (GAMEXML; http://xml.coverpages.org/game.html) for a specified contig and coordinate range. The GAMEXML format is used by the Apollo Genome Browser (Lewis et al., 2002
Bioinformatics tools and databases are most commonly accessible through Web interfaces that allow navigation using a standard browser. While it is relatively easy to provide a human-readable presentation of data in this way, the data is not easily accessed from applications in order to be integrated with remote data sets. Solutions to this problem have been screen scraping, where the HTML code of Web pages was parsed in order to grab the information therein, or the import of bulk data dumps into local data warehouses to make the data available for integration. Maintaining current data is hard, as the HTML representation or data dump format may change, requiring changes to the parser, and as data needs to be reimported into the local warehouse for every update.
Recently, Web services have been suggested as a solution to the problem of data and analysis resource interoperability and data integration for biology and bioinformatics (Schoof et al., 2004
Currently, 35 BioMOBY-compliant Web services provide retrieval (22) and analysis (13) functionality (see http://www.eu-plant-genome.net [select "Tools"]). These allow retrieval of protein and DNA sequences for keywords, EMBL accessions, or AGI locus codes (Schoof et al., 2002
These services implement a public application interface that provides consistent access to all plant genome resources over the Internet. On the one hand, these allow a programmer to build applications that retrieve data as required, e.g. retrieving genomic sequences, then iterating over all protein-coding genes and extracting the upstream sequences based on their coordinates. On the other hand, BioMOBY also provides tools with more user-friendly interfaces that allow point-and-click discovery of data (Wilkinson et al., 2005
The first overview of a genome can be achieved through some basic statistics, like average gene length, exon number, gene density, or GC content. These are regularly calculated from the plant genome databases and made available in the Web interface (Fig. 1B).
An important tool for comparative genomics is the prediction of orthologs between genomes. We use a tool initially designed for the detection of putative conserved orthologous set (COS) markers (Fulton et al., 2002 The consistent application interface to all plant databases at MIPS facilitates the implementation of sophisticated query tools. We have set up a sequence export tool that allows users to download specific sequence data sets such as all first introns of all protein-coding genes on a selected contig, or a selected number of base pairs upstream of all start codons in a genome (Fig. 1D). This will soon also be available as a Web service.
The Web service interface to the MIPS plant genome resources provides a versatile and powerful, yet easy to use, access for remote users, enabling them to create their own analyses. An example workflow that can be realized in this way would retrieve putative GeneOntology (GO; Ashburner et al., 2000
This can be extended to a whole-genome comparative analysis by retrieving GO term annotation for all Arabidopsis genes with a putative ortholog in both Medicago and rice, compared to all Arabidopsis genes with no detected ortholog in Medicago or rice. This workflow is available as supplemental data or from the MIPS Web site. It first retrieves all identifiers for all protein-coding genetic elements in the MIPS Arabidopsis database. Then, it queries a service that returns putative orthologs between Arabidopsis, Medicago, and rice (based on best bidirectional hits extracted from the MIPS SIMAP database; Güldener et al., 2005 The result is provided as an Excel file in the supplemental data; the most frequent GO terms are, unsurprisingly, GO:0000004 biological_process unknown and GO:0008372 cellular_component unknown. However, both these terms are underrepresented in the set of Arabidopsis proteins with putative orthologs, suggesting that conserved proteins are more likely to be annotated with functional or localization information. Executing this workflow requires several hours and 1 GB of RAM on a standard workstation, as tens of thousands of Web service calls are executed, returning 89,444 GO term-protein associations. This may seem tedious to anyone accustomed to answering such questions using, for example, SQL queries on a data warehouse. However, Taverna and the Web service architecture are fully capable of this kind of analysis and allow enormous flexibility in the generation of queries, without the need for warehoused data or knowledge of SQL. Instead, distributed data sources can be combined on the fly. Instead of using precalculated orthologs, a similar workflow could incorporate a BLAST service to calculate similarities on the fly, thus allowing a user to start with his or her own set of proteins and building a workflow to discover which of these have homologs in the MIPS plant resources.
Plant genome and associated data have revolutionized plant research during the last years. The individual model plant research communities have benefited greatly from appropriate storage, communication, and display of genome data, which are a prerequisite for the sustainable maintenance of genome annotation and an indispensable informational infrastructure. And the results from genome research and genome-scale analyses have had significant impact. Even from partial genomes such as collections of BESs, detailed insights into the composition and characteristics of particular genomes can be gained (Palmer et al, 2003
This comparative aspect is essential to unleash the full potential of mining sequence data. Comparative analyses have been demonstrated to be extremely powerful in yeasts and vertebrates (Mouse Genome Sequencing Consortium, 2002
We will continue to work toward a consistent view of the model plant genomes, which entails intense collaboration with international partners to synchronize our efforts with other plant genome initiatives, databases, and analysis resources. To this end, BioMOBY-compliant Web services are increasingly helpful. At the same time, they enable a completely new user experience, providing a remote application interface that can be harnessed by tools such as Taverna to bring comparative analyses to remote users without the need for programming or warehousing. While current services focus on sequence, gene structure, and function annotation, integration of modules to handle expression data from microarray experiments is expected to add significantly to the value of the resource. Further plans are to enable browsing syntenic regions and viewing predicted orthologs or homologs in any gene report. To this end, the integration with existing MIPS systems like SIMAP (Mewes et al., 2004 Plant genome databases will have to evolve with our increasing knowledge of how genomes work. But at the same time, well-structured integrated knowledge resources and new query and access interfaces can facilitate research and uncover some of their secrets.
Several students contributed code and programming effort. We thank Arthur Zimek for work on DBBrowser, Mirjam Maier and Peter Kral for work on the sequence export, Elisabeth Wischnitzki and Patrick Tischler for work on the COS markers, and Daniela Foelsl for work on the BLAST BioMOBY services. Thomas Rattei has helped with SIMAP access. Tobias Hindemitt assisted with graphics and Web design. We thank all members of the BioMOBY community, the IMGAG, and the PlaNet project for cooperation and helpful discussions on data exchange and standards. Received January 5, 2005; returned for revision April 6, 2005; accepted May 3, 2005.
1 This work was supported by the Federal Ministry of Education and Research, Germany, through the Genomanalyse im biologischen System Pflanze (http://www.gabi.de) project. Work on the European Medicago and Legume Database is funded in the European Grain Legumes Integrated Project by the Sixth European Union Framework Programme of the European Commission (grant no. FP 6 FOODCT2004506223). Data and database integration for Arabidopsis is funded by the Fifth European Union Framework Programme PlaNet project (grant no. FP5 QLRICT200100006).
2 Present address: Max Planck Institute for Plant Breeding Research, D50829 Cologne, Germany.
[w] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.104.059188. * Corresponding author; e-mail schoof{at}mpiz-koeln.mpg.de; fax 492215062413.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 2529[CrossRef][Web of Science][Medline]
Bejerano G, Pheasant M, Makunin I, Stephen S, Kent WJ, Mattick JS, Haussler D (2004) Ultraconserved elements in the human genome. Science 304: 13211325
Bonnet E, Wuyts J, Rouze P, Van de Peer Y (2004) Detection of 91 potential conserved plant microRNAs in Arabidopsis thaliana and Oryza sativa identifies important target genes. Proc Natl Acad Sci USA 101: 1151111516
Cannon SB, Crow JA, Heuer ML, Wang X, Cannon EKS, Dwan C, Lamblin AF, Vasdewani J, Mudge J, Cook A, et al (2005) Databases and information integration for the Medicago truncatula genome and transcriptome. Plant Physiol 138: 3846
Fulton TM, Van der Hoeven R, Eannetta NT, Tanksley SD (2002) Identification, analysis, and utilization of conserved ortholog set markers for comparative genomics in higher plants. Plant Cell 14: 14571467
Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100
Güldener U, Münsterkötter M, Kastenmüller G, Strack N, van Helden J, Lemer C, Richelles J, Wodak SJ, Garcia-Martínez J, Pérez-Ortín JE, et al (2005) CYGD: the Comprehensive Yeast Genome Database. Nucleic Acids Res (Database issue) 33: D364D368
Haberer G, Hindemitt T, Meyers BC, Mayer KF (2004) Transcriptional similarities, dissimilarities, and conservation of cis-elements in duplicated genes of Arabidopsis. Plant Physiol 136: 30093022 International Chicken Genome Sequencing Consortium (2004) Sequence and comparative analysis of the chicken genome provide unique perspectives on vertebrate evolution. Nature 432: 695716[CrossRef][Medline]
Karlowski WM, Schoof H, Janakiraman V, Stuempflen V, Mayer KFX (2003) MOsDB: an integrated information resource for rice genomics. Nucleic Acids Res 31: 190192 Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241254[CrossRef][Medline] Lawrence CJ, Dong Q, Polacco ML, Seigfried TE, Brendel V (2004) MaizeGDB, the community database for maize genetics and genomics. Nucleic Acids Res 32: 393397 Lewis SE, Searle SMJ, Harris N, Gibson M, Iyer V, Ricter J, Wiel C, Bayraktaroglu L, Birney E, Crosby MA, et al (2002) Apollo: a sequence annotation editor. Genome Biology 3: RESEARCH0082 Mayer K, Schuller C, Wambutt R, Murphy G, Volckaert G, Pohl T, Dusterhoft A, Stiekema W, Entian KD, Terryn N, et al (1999) Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402: 769777[CrossRef][Medline]
Messing J, Bharti AK, Karlowski WM, Gundlach H, Kim HR, Yu Y, Wei F, Fuks G, Soderlund CA, Mayer KF, et al (2004) Sequence composition and genome organization of maize. Proc Natl Acad Sci USA 101: 1434914354 Mewes HW, Amid C, Arnold R, Frishman D, Güldener U, Mannhaupt G, Münsterkötter M, Pagel P, Strack N, Stümpflen V, et al (2004) MIPS: analysis and annotation of proteins from whole genomes. Nucleic Acids Res 32: 4144 Mouse Genome Sequencing Consortium (2002) Initial sequencing and comparative analysis of the mouse genome. Nature 420: 520562[CrossRef][Medline]
Oinn T, Addis M, Ferris J, Marvin D, Greenwood M, Carver T, Pocock MR, Wipat A, Li P (2004) Taverna: a tool for the composition and enactment of bioinformatics workflows. Bioinformatics 20: 30453054
Palmer LE, Rabinowicz PD, O'Shaughnessy AL, Balija VS, Nascimento LU, Dike S, de la Bastide M, Martienssen RA, McCombie WR (2003) Maize genome sequencing by methylation filtration. Science 302: 21152117
Pan X, Liu H, Clarke J, Jones J, Bevan M, Stein L (2003) ATIDB: Arabidopsis thaliana insertion database. Nucleic Acids Res 31: 12451251 Rat Genome Sequencing Project Consortium (2004) Genome sequence of the Brown Norway rat yields insights into mammalian evolution. Nature 428: 493521[CrossRef][Medline]
Reinhart BJ, Weinstein EG, Rhoades MW, Bartel B, Bartel DP (2002) MicroRNAs in plants. Genes Dev 16: 16161626
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 31: 224228
Riley ML, Schmidt T, Wagner C, Mewes HW, Frishman D (2005) The PEDANT genome database in 2005. Nucleic Acids Res 33: D308D310
Rudd S, Schoof H, Mayer KFX (2005) PlantMarkers: a database of predicted molecular markers from plants. Nucleic Acids Res 33: D628D632 Schiex T, Moisan A, Rouzé P (2001) EuGène: an eukaryotic gene finder that combines several sources of evidences. In O Gascuel, M-F Sagot, eds, Computational Biology. LNCS 2066. Springer-Verlag, Heidelberg, pp 111125 Schoof H (2003) Towards interoperability in genome databases: the MAtDB (MIPS Arabidopsis thaliana database) experience. Comp Funct Genomics 4: 255258[CrossRef] Schoof H, Ernst R, Mayer KFX (2004) The PlaNet consortium: a network of European plant databases connecting plant genome data in an integrated biological knowledge resource. Comp Funct Genomics 5: 184189[CrossRef]
Schoof H, Zaccaria P, Gundlach H, Lemcke K, Rudd S, Kolesov G, Arnold R, Mewes HW, Mayer KFX (2002) MIPS Arabidopsis thaliana Database (MAtDB): an integrated biological knowledge resource based on the first complete plant genome. Nucleic Acids Res 30: 9193
VandenBosch KA, Stacey G (2003) Summaries of legume genomics projects from around the globe: community resources for crops and models. Plant Physiol 131: 840865
Ware D, Jaiswal P, Ni J, Pan X, Chang K, Clark K, Teytelman L, Schmidt S, Zhao W, Cartinhour S, et al (2002) Gramene: a resource for comparative grass genomics. Nucleic Acids Res 30: 103105
Whitelaw CA, Barbazuk WB, Pertea G, Chan AP, Cheung F, Lee Y, Zheng L, van Heeringen S, Karamycheva S, Bennetzen JL, et al (2003) Enrichment of gene-coding sequences in maize by genome filtration. Science 302: 21182120
Wilkinson MD, Schoof H, Ernst R, Haase D (2005) BioMOBY successfully integrates distributed heterogeneous bioinformatics web services. The PlaNet exemplar case. Plant Physiol 138: 517
Wilkinson MD, Links M (2002) BioMOBY: an open source biological web services proposal. Brief Bioinform 3: 331341
Wortman JR, Haas BJ, Hannick LI, Smith RK Jr, Maiti R, Ronning CM, Chan AP, Yu C, Ayele M, Whitelaw CA, et al (2003) Annotation of the Arabidopsis genome. Plant Physiol 132: 461468 Yazaki J, Kojima K, Suzuki K, Kishimoto N, Kikuchi S (2004) The Rice PIPELINE: a unification tool for plant functional genomics. Nucleic Acids Res 32: 383387 This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|