|
|
||||||||
|
Plant Physiology 135:1888-1892 (2004) © 2004 American Society of Plant Biologists The Plant-Specific Database. Classification of Arabidopsis Proteins Based on Their Phylogenetic Profile1Department of Energy Plant Research Laboratory (R.A.G., C.W.), Department of Biochemistry and Molecular Biology (R.A.G.), and Department of Plant Biology and Genomics Technology Support Facility (M.D.L., C.W.), Michigan State University, East Lansing, Michigan 488241312
One of the main goals of the plant community is to determine the function of every Arabidopsis gene by the end of the year 2010 (Chory et al., 2000 The main goal of PLASdb is to stimulate further research on some of the least-studied proteins of plants. To this end, we have compiled and integrated information from public data sources (e.g. The Institute for Genomic Research [TIGR], Munich Information center for Protein Sequences [MIPS], The Arabidopsis Information Resource [TAIR]) with links to the original information and provided links to external databases (e.g. Salk Institute Genomic Analysis Laboratory [SIGNAL]). In addition, we have performed predictions of subcellular localization and transmembrane helices, analyzed gene expression in organs based on expressed sequence tag (EST) frequencies and microarray data, and grouped protein families based on sequence similarity clustering (BLASTCLUST). Web-based interfaces to several search engines allow gene-driven or exploratory modes of data access. Information in PLASdb can be quickly downloaded in text or Excel format for further analysis (http://genomics.msu.edu/plant_specific).
PLASdb is a relational database of the nuclear-encoded Arabidopsis proteins (hereinafter proteins) classified according to their pattern of sequence similarity in the protein sets of the following organisms: Homo sapiens, Rattus norvegicus, Drosophila melanogaster, Caenorhabditis elegans, Mus musculus, Schizosaccharomyces pombe, Saccharomyces cerevisiae, a combined set of 88 species of Bacteria, and a combined set of 16 species of Archaea. We determined the phylogenetic profile of each protein sequence as a vector of nine values that indicates the similarity of the Arabidopsis protein to the proteins in each of the other sets. A similar approach was taken in previous studies of protein function in Escherichia coli (Pellegrini et al., 1999
Because the absence of Arabidopsis proteins in all protein sets utilized could occur trivially because of incorrect Arabidopsis gene predictions, we carried out the following second filtering step. We compared the 7,868 protein sequences that showed no detectable sequence similarity to proteins in other organisms against the Arabidopsis EST database and the EST databases of 13 other vascular plant species (see list of species in the PLASdb Web site). We considered as plant specific any of the 7,868 proteins identified previously that showed significant sequence similarity (E-value
In addition to the classification of Arabidopsis proteins based on their individual phylogenetic profiles, in PLASdb we have integrated information from multiple public databases and computer prediction programs. For each protein in PLASdb, the following fields of information are stored when available (summarized in Fig. 1): Mr and pI (TIGR); enzyme commission identifier from the Kyoto Encyclopedia of Genes and Genomes (Kanehisa et al., 2002
All the information associated with an Arabidopsis protein is displayed in an HTML page that we termed the Protein Properties page (see example page in Fig. 2A). This page can be accessed directly from the home page with a locus identifier (e.g. At2g39990) or from any of the HTML tables generated by the output engines (Fig. 1) as detailed below.
PLASdb was designed to allow easy and quick access to the information stored. We have implemented intuitive Web-based interfaces and facile ways to retrieve the data by the individual user for further analysis. We envision at least two scenarios that would motivate researchers to visit PLASdb.
In this scenario researchers would have a gene or list of genes of interest that they would like to study in PLASdb. The database can be queried for individual or multiple Arabidopsis loci identifiers. Searching by genes will retrieve the entry(ies) found in the database and a list of the following properties: locus number, gene name, prediction of subcellular localization, prediction of transmembrane helices, the top automatic MIPS functional category assignment, and the total number of ESTs for the corresponding gene (TIGR Arabidopsis Gene Index). Individual Protein Properties page(s) can be accessed from this table by following the hyperlink on each of the loci numbers in the left-most column (Fig. 2B). For users interested in groups of functionally related genes, it is also possible to search PLASdb by gene families, as defined by experts in the field. We have incorporated the gene family information compiled and maintained by TAIR (Rhee et al., 2003
A second scenario can be envisioned in which users would browse the database in an exploratory mode to discover new aspects of plant biology. A researcher would have a process or general question in mind and would use PLASdb as a way to guide the generation of a new hypothesis. To illustrate this scenario, we present two examples.
Case 1: Plant-Specific Proteins Involved in Photosynthesis
Case 2: Using PLASdb as a Guide to Study the Function of Unknown Arabidopsis Proteins See the HOW-TO in the PLASdb Web site for further details and step-by-step examples of how to use the database.
We believe the study of Arabidopsis proteins that are unique to the plant lineage should be a priority for future functional studies. Many of these proteins have unknown functions and are not likely to be studied in other model organisms, such as yeast. Thus, their characterization represents a challenge for the plant community. Here, we describe a new resource to facilitate the functional characterization of the Arabidopsis proteins that are specific to plants. The PLASdb identifies 3,848 Arabidopsis proteins as plant specific. In addition, 4,816 other proteins are classified in various groups based on their specific patterns of conservation among other eukaryotes, bacteria, or archaea. PLASdb contains extensive information compiled from multiple public data sources (e.g. annotation information, expression in organs based on microarray data), and generated with predictive algorithms (e.g. protein families, subcellular localization). We hope this new resource stimulates further research by identifying and providing quick and easy access to available information about the Arabidopsis plant-specific proteins.
We thank Dr. Pamela J. Green, Dr. Kenneth Keegstra, and Dr. John B. Ohlrogge for support and valuable comments throughout this project. We thank Dr. John B. Ohlrogge for critical reading of this manuscript. We thank Dr. Robert Halgren for expert bioinformatics assistance. We thank Dr. Vivek Anantharaman and Eugene V. Koonin for providing the list of Arabidopsis proteins involved in RNA metabolism. We thank Karen Bird for editorial assistance. Received March 28, 2004; returned for revision April 8, 2004; accepted April 8, 2004.
1 This work was supported by the Department of Energy (R.A.G.; grant no. DEFG0291ER20021 to Dr. Pamela J. Green and Dr. Kenneth Keegstra) and by the National Science Foundation (R.A.G.; grant no. DBI9943561 to Dr. Pamela J. Green, Dr. Kenneth Keegstra, and Dr. John B. Ohlrogge).
2 Present address: Department of Biology, 100 Washington Square East, 1009 Main Building, New York University, New York, NY 10003. www.plantphysiol.org/cgi/doi/10.1104/pp.104.043687. * Corresponding author; e-mail rg98{at}nyu.edu; fax 2129954204.
Alonso JM, Stepanova AN, Leisse TJ, Kim CJ, Chen H, Shinn P, Stevenson DK, Zimmerman J, Barajas P, Cheuk R, et al (2003) Genome-wide insertional mutagenesis of Arabidopsis thaliana. Science 301: 653657
Anantharaman V, Koonin EV, Aravind L (2002) Comparative genomics and evolution of proteins involved in RNA metabolism. Nucleic Acids Res 30: 14271464
Beisson F, Koo AJ, Ruuska S, Schwender J, Pollard M, Thelen JJ, Paddock T, Salas JJ, Savage L, Milcamps A, et al (2003) Arabidopsis genes involved in acyl lipid metabolism. A 2003 census of the candidates, a study of the distribution of expressed sequence tags in organs, and a web-based database. Plant Physiol 132: 681697
Chory J, Ecker JR, Briggs S, Caboche M, Coruzzi GM, Cook D, Dangl J, Grant S, Guerinot ML, Henikoff S, et al (2000) National Science Foundation-Sponsored Workshop Report: "The 2010 Project" functional genomics and the virtual plant. A blueprint for understanding how plants are built and how to improve them. Plant Physiol 123: 423426 Emanuelsson O, Nielsen H, Brunak S, von Heijne G (2000) Predicting subcellular localization of proteins based on their N-terminal amino acid sequence. J Mol Biol 300: 10051016[CrossRef][Web of Science][Medline]
Frishman D, Mokrejs M, Kosykh D, Kastenmuller G, Kolesov G, Zubrzycki I, Gruber C, Geier B, Kaps A, Albermann K, et al (2003) The PEDANT genome database. Nucleic Acids Res 31: 207211 Gutiérrez RA, Green PJ, Keegstra K, Ohlrogge JB (2004) Phylogenetic profiling of the Arabidopsis thaliana proteome: What proteins distinguish plants from other organisms? Genome Biol 5: R53[CrossRef][Medline]
Kanehisa M, Goto S, Kawashima S, Nakaya A (2002) The KEGG databases at GenomeNet. Nucleic Acids Res 30: 4246 Krogh A, Larsson B, von Heijne G, Sonnhammer EL (2001) Predicting transmembrane protein topology with a hidden Markov model: application to complete genomes. J Mol Biol 305: 567580[CrossRef][Web of Science][Medline]
MacIntosh GC, Wilkerson C, Green PJ (2001) Identification and analysis of Arabidopsis expressed sequence tags characteristic of non-coding RNAs. Plant Physiol 127: 765766
Pellegrini M, Marcotte EM, Thompson MJ, Eisenberg D, Yeates TO (1999) Assigning protein functions by comparative genome analysis: protein phylogenetic profiles. Proc Natl Acad Sci USA 96: 42854288
Peregrin-Alvarez JM, Tsoka S, Ouzounis CA (2003) The phylogenetic extent of metabolic enzymes and pathways. Genome Res 13: 422427
Rhee SY, Beavis W, Berardini TZ, Chen G, Dixon D, Doyle A, Garcia-Hernandez M, Huala E, Lander G, Montoya M, et al (2003) The Arabidopsis Information Resource (TAIR): a model organism database providing a centralized, curated gateway to Arabidopsis biology, research materials and community. Nucleic Acids Res 31: 224228 The TIGR Gene Index Databases (2003). The Institute for Genomic Research. http://www.tigr.org Related articles in Plant Physiol.:
This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|