|
|
||||||||
|
First published online April 27, 2007; 10.1104/pp.106.095059 Plant Physiology 144:857-866 (2007) © 2007 American Society of Plant Biologists OPEN ACCESS ARTICLE
EGENES: Transcriptome-Based Plant Database of Genes with Metabolic Pathway Information and Expressed Sequence Tag Indices in KEGG1,[C],[W],[OA]Laboratory of Bioknowledge Systems, Bioinformatics Center, Institute for Chemical Research, Kyoto University, Gokasho Uji, Kyoto 6110011, Japan (A.M.-N., S.G., R.J., M.I., Y.M., M.K.); Laboratory of Bioinformatics and Bioknowledge Systems, Institute of Biochemistry and Biophysics, and Center of Excellence for BioMathematics, School of Computer Science, University of Tehran, Tehran, Iran (A.M.-N.); Laboratory of Genome Database, Human Genome Center, University of Tokyo, Tokyo 1088639, Japan (S.K., M.K.); and Laboratory of Plant Genetics, Division of Applied Bioscience, Kyoto University, Kyoto 6068502, Japan (T.R.E.)
EGENES is a knowledge-based database for efficient analysis of plant expressed sequence tags (ESTs) that was recently added to the KEGG suite of databases. It links plant genomic information with higher order functional information in a single database. It also provides gene indices for each genome. The genomic information in EGENES is a collection of EST contigs constructed from assembly of ESTs. Due to the extremely large genomes of plant species, the bulk collection of data such as ESTs is a quick way to capture a complete repertoire of genes expressed in an organism. Using ESTs for reconstructing metabolic pathways is a new expansion in KEGG and provides researchers with a new resource for species in which only EST sequences are available. Functional annotation in EGENES is a process of linking a set of genes/transcripts in each genome with a network of interacting molecules in the cell. EGENES is a multispecies, integrated resource consisting of genomic, chemical, and network information containing a complete set of building blocks (genes and molecules) and wiring diagrams (biological pathways) to represent cellular functions. Using EGENES, genome-based pathway annotation and EST-based annotation can now be compared and mutually validated. The ultimate goals of EGENES will be to: bring new plant species into KEGG by clustering and annotating ESTs; abstract knowledge and principles from large-scale plant EST data; and improve computational prediction of systems of higher complexity. EGENES will be updated at least once a year. EGENES is publicly available and is accessible by the following link or by KEGG's navigation system (http://www.genome.jp/kegg-bin/create_kegg_menu?category=plants_egenes).
In the past decade, bioinformatics has become an integral part of research and development in the biological sciences. Bioinformatics now has an essential role both in deciphering genomic, transcriptomic, and proteomic data generated by high-throughput experimental technologies and in organizing information gathered from traditional biology. Sequence-based protocols for analyzing individual genes or proteins have been elaborated and expanded, and different methods have been developed for analyzing large numbers of genes or proteins simultaneously. To date, the genome sequence for over 349 different species has been published, and sequencing of 987 other prokaryotic and 588 eukaryotic genomes is under development (Liolios et al., 2006
To increase our understanding of cellular processes from genome information, pathway databases such as KEGG (Kanehisa, 1997
EGENES is a new database recently added to the KEGG suite of databases (Kanehisa et al., 2006 In this article, we provide an introduction to EGENES and discuss its importance for the plant research community. Because all the resources in KEGG follow the same architecture and design, an appraisal of EGENES should give readers an idea of the available information stored in KEGG and how to use it efficiently.
One of the main problems with EST data is poor quality and contamination (vectors, repeats, organelles, and low-complexity regions), which cannot be completely avoided. Many groups have used different vectors and repeats databases for decontamination analysis, while none of these databases cover all contaminants. To avoid contaminants effectively, we first made a nonredundant custom database of repeats and vectors covering almost all publicly available vectors and repeat databases (Masoudi-Nejad et al., 2006
The pipeline used for processing, assembling, and KEGG-based annotation of the ESTs is summarized in Figure 1
and was described previously (Masoudi-Nejad et al., 2004
The sequence cleaning process involves basic procedures, such as removing the polyA/polyT tail, clipping low-quality ends (the ends rich in undetermined bases), and discarding those that are too short (shorter than 100) or mostly low-complexity sequences. The repeat masking process uses the RepeatMasker program (Smit et al., 2004
For annotation, EST contigs were compared against each genome in the KEGG GENES database by BLAST searches (Moriya et al., 2005
The current version of EGENES (release 41.0, January 2007) consists of 25 species. Table I shows the species included in the initial version of EGENES. Two other genome-based plant entries for Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) are also available in KEGG (the genome-based pathways were produced by manual KO annotation based on Smith-Waterman sequence similarity search of the TIGR gene annotation of latest release of the Arabidopsis and rice to the KEGG GENES database). The table shows corresponding KEGG codes covering 11 families (mustard, rue, mallow, pea, willow, daisy, nightshade, grape, grasses, pine, and moss). Entries with ESTs have four-character codes that start with "e" (e.g. "eath" for the Arabidopsis ESTs data), and complete genome sequences have three-character codes ("ath" represents the Arabidopsis genome data). By clicking on the species code in the EGENES entry page, a new window provides relevant information for each species, including taxonomy data (Fig. 2A ). The data for each species in EGENES are organized in two distinct parts: pathway maps and gene catalogs.
The pathway maps for each species were derived from KEGG's reference pathways based on the automatic annotation as described above. The KEGG PATHWAY database contains 266 reference pathways, all of which have been manually curated. Each reference pathway can be viewed as a network of enzymes or a network of enzyme commission (EC) numbers. Knowledge-based prediction of metabolic pathways involves the matching of genes in the genome against enzymes in the KEGG reference pathway. Historically, the integration of pathway information and genomic information was first achieved in KEGG by EC numbers. The EC system uses four numbers that define in a hierarchical manner the function of an enzyme. However, there can be different enzymes that have the same EC number but function in different pathways. This means that, in cases where EC numbers refer to more than one enzyme, we have to determine which enzyme is appropriate to a particular KEGG pathway. However, to incorporate nonmetabolic pathways and to overcome various problems inherent in the enzyme nomenclature, a new scheme based on the ortholog identification (KO) was introduced, replacing the EC numbers. KO is based on computational analysis, as well as manual curation, decomposing all genes in the complete genomes into sets of orthologs. Here, two genes are considered as orthologs, or belonging to the same KO group, when they are mapped to the same KEGG pathway node. KO will be used not only to characterize all known pathways but also to explore unknown pathways that have not been experimentally verified but can be inferred by sequence similarity to enzymes from other species. Once enzyme genes are identified in the genome based on sequence similarity and the EC numbers and the KO identifiers are properly assigned, species-specific pathways could be constructed computationally by correlating EST contigs with gene products (enzymes) in the reference pathways according to the matching KO identifiers and EC numbers. Figure 2B shows a reference pathway diagram in KEGG. Each pathway consists of rectangles representing enzymes labeled with their EC numbers, small circles representing reactants of the enzyme reactions, and their connections. Here, each rectangle has the corresponding KOs as underlying function information. Because the KO can be used for linking enzymes and genes coding for the enzymes in each species, this graphic representation is most useful in superimposing the genomic information onto the knowledge of metabolic pathways, which helps to deduce metabolism for each species. Because the metabolic pathway, especially for intermediary metabolism, is well conserved among most species and we have already annotated enzyme genes from completely sequenced genomes, it is possible to manually draw one reference pathway and consequently computationally generate many species-specific pathways by using EGENES annotation based on BHR.
Pathway maps have been hierarchically organized into four main categories, each of which consists of many subcategories (Kanehisa et al., 2006
Exploring the Pathway Maps By clicking the pathway maps link in Figure 2A, a hierarchical context window will appear, displaying all the pathways present in the current species (Supplemental Fig. S1). Each number to the left of the pathway map name is the pathway map identification. Clicking on each pathway will show the diagram for that pathway, including all enzymes and compounds in the reference pathway. For example, by clicking on Photosynthesis under Energy Metabolism, the corresponding pathway diagram will be shown (http://www.genome.jp/dbget-bin/get_pathway?org_name=eath&mapno=00195). White boxes (e.g. PsbB, PsbF, etc.) indicate that there are no Arabidopsis ESTs similar to those genes/enzymes in other species. Colored boxes are the enzymes that are present in Arabidopsis (EC 1.18.1.2, PetB, PsbA, PsbO, etc.). By clicking on any colored box, gene catalog data for that gene (EST) are shown.
By clicking the Gene Catalogs link in Figure 2A, the entry point for the EST-based gene catalog of Arabidopsis will appear (http://www.genome.jp/kegg-bin/show_organism?menu_type=gene_catalogs&org=eath). The gene catalog is the EST/gene index and EGENES functional classification, which includes functional hierarchies of KEGG pathways, ortholog groups (KO), and protein families. Clicking on KO will show the first level of the pathway-based classification of ortholog groups. The first level contains Metabolism, Genetic Information Processing, Environmental Information Processing, and Cellular Processes (http://www.genome.jp/dbget-bin/get_htext?eath00001.keg+-p+/kegg/brite/eath+-f+F). Second and third levels show further hierarchies inside each part of the first level. The fourth level shows the list of contigs orthologous to each enzyme in each category (http://www.genome.jp/dbget-bin/get_htext?eath00001.keg+-p+/kegg/brite/eath+-f+F+D). For example, the second contig that appears is 3,213, corresponding to KO K00844. By clicking on K00844, a new window shows the information related to the contig including other related pathways (Fig. 2C). An ortholog of this contig is hexokinase (EC 2.7.1.1), which is present in glycolysis (PATH Ko00010), Fru and Man metabolism (PATH ko00051), Gal metabolism (PATH Ko00052), starch and Suc metabolism (PATH Ko00500), and aminosugars metabolism (PATH ko00530), all of which belong to carbohydrate. Figure 2C also shows other associated information, such as EC, reaction, substrate, product, pathway, and Gene Ontology (GO) number. Reaction includes the reactions in which this enzyme participates. By clicking on any reaction identifier (e.g. R00867), users will get new windows showing the associated reactions and compounds (http://www.genome.jp/dbget-bin/www_bget?rn+R00867). Clicking on any Compound (e.g. C00002) shows the compound on which this enzyme acts (http://www.genome.jp/dbget-bin/www_bget?compound+C00002).
Clicking on 3213 shows Arabidopsis EST contig 3,213, similar to its corresponding enzyme. The main window has seven sub-windows, including KO annotation (gene catalogs), EST members of the contigs with links to graphical and text alignment, links to pathway and other databases, and the contig's sequence (http://www.genome.jp/dbget-bin/www_bget?ko+K00844). Clicking on the Alignment button shows the text alignment of EST sequences that forms this contig, which in fact is the output of the CAP3 (Huang and Madan, 1999
To assess whether the coverage of EST contigs is sufficient for computationally predicting pathways, we checked the number of pathway nodes present in the same pathway between the manually curated pathways for Arabidopsis (ath) and rice (osa) and the predicted pathways for these plants based on ESTs (eath and eosa). In many pathways, we found that EST coverage is more than genomic DNA. In total, we found 251 enzymes (pathway nodes) unique to the EST-based rice metabolic pathway (eosa) but not found in the rice pathway based on genomic DNA. For Arabidopsis, we found 295 unique enzymes that are found only in the EST-based predicted pathway. Figure 3, A and B , shows diagrams of genome (ath) and EST-based (eath) inositol phosphate metabolism pathways for Arabidopsis, respectively. Total nodes/reactions complete: ath = 7; eath = 14. Reactions common and complete between ath and eath = 3 (2.7.1, 2.7.1.68, 5.5.1.4). Reactions unique to ath = 4 (3.1.3.57 [2x], 3.1.3.67, 2.7.1.137), and reactions unique to eath = 10 (2.7.1.34, 2.7.1.134 [2x], 3.1.3.25 [3x], 3.1.3.36, 2.7.1.67, 4.6.1.13, 1.13.99.1; Supplemental Table S1).
Another entry point for EGENES is the sequence similarity search in the KEGG database using BLAST (http://blast.genome.jp/) and FASTA (http://fasta.genome.jp/). For example, to find all the sequences similar to Arabidopsis cDNA clone gi 32885696, click on the BLAST button, copy and paste the sequence into the form, choose BLASTN and the database to search, KEGG EGENES (Fig. 3C), and press the compute button. All the similar contigs from EGENES (including species other than Arabidopsis) will be shown. From here, a variety of analyses on the BLAST results is available, such as performing multiple alignments (ClustalW and MAFFT) and graphical alignments of BLAST search and drawing a phylogenic tree (Fig. 3D) via the GenomeNet server (http://align.genome.jp/).
With the release of the fully sequenced genomes of Arabidopsis and rice and the initiation of sequencing projects for many other plant species and large-scale ESTs sequencing for many other plant species, there is a fast-growing desire to place this genomic information into a metabolic context. ESTs provide researchers with a quick and inexpensive route for discovering new genes, obtaining data on gene expression and regulation, and constructing genome maps. Efforts to computerize our knowledge on cellular functions, at present either by the controlled vocabulary of GO or by graph representation, will facilitate the computational mapping of genomic data to complex cellular properties or detect any empirical relationships between genomic and higher properties.
There are several molecular biology databases available on the Internet, such as AraCyc (Mueller et al., 2003
Using accurate and comprehensive reference databases is the key to ensuring the quality of derived databases. Examples of comprehensive pathway databases are KEGG and MetaCyc (Krieger et al., 2004
Although both KEGG and EcoCyc attempt to be as complete as possible, there are differences between them in terms of combination and coverage. These differences also exist between EGENES and AraCyc. Each database has its unique features. AraCyc uses only Arabidopsis genomic data that is based on annotation by TIGR and TAIR and is highly contingent on the quality of the annotation in the input data. KEGG includes 26 species for plants (EGENES 25) and two Arabidopsis and rice draft genomes. Indeed, KEGG has two entries for Arabidopsis and rice based on genome data (ath and osa) and EST data (eath and eosa). There are inevitably false positive associations of genes with pathways using only genomic data and automatic assignments (Zhang et al., 2005 The comparison of coverage of the metabolic reconstruction showed that the EST data increased the coverage more than the genomic information in many instances. In some cases, the number of pathway hits was similar but consisted of hits unique to one or the other, such as the citrate cycle (tricarboxylic acid) in rice, where both eosa and osa have 16 hits that are unique to either of them (data not shown). In Arabidopsis, both eath and ath have 22 hits, six of which are uniquely found in either one or the other (data not shown). This may be because either the sequences or annotations of these genomes are still incomplete. In many cases, by BLASTing EST contigs against the genome, we found that in a certain pathway many genes have been missed based on genomic data. Of the many entries in ath and eath, some portions are common to both, and one database has many specific entries not found in the other. These may be caused by the lack of a complete sequence for Arabidopsis and rice (genome assembly lacks the sequenced regions) or by misannotation (many parts annotated as intergenic regions were hit by our EST contigs). There were also cases where genomic-based annotation gave more hits than EST-based annotation. For example, in the pathway for metabolism of photosynthesis, we have 45 hits; 19 hits were unique to ath and 25 hits were common to both ath and eath (data not shown). This is because the numbers of sequenced ESTs, the size of the cDNA library, or the type of tissue where ESTs have been produced is insufficient to cover all pathway nodes. The EGENES project is an attempt to capture all reactions and pathways that occur in plants based on both genomic and expressed part of the genome. This is, however, only a starting point. We see the role of EGENES as providing a framework of possible reactions that, when combined with expression, genetics and physical location and enzyme kinetic data, provides the infrastructure for the true integration of biological knowledge and data. Integration in this respect does not simply involve methodology, such as links and common interfaces, but rather involves biology. We are working to make EGENES data available in a variety of standard formats, including text and XML. This will also enable data exchange with other pathway databases, such as AraCyc and other molecular interaction databases. The ultimate integration of biological databases will be a computer representation of living cells and organisms, whereby any aspect of biology can be examined computationally. Our immediate goals for improving the data coverage and quality of EGENES are as follows: increasing the number of plant species included in the EGENES; including other organisms than plants for comparative genomics analysis; improving the EST clustering and assembling algorithm; including the GSS data for clustering; providing the splice-alignment information for ESTs/GSS contigs; manually curating each species pathway by verifying and linking present pathway diagrams and information to literature and adding new pathway diagrams not present in current versions of the KEGG's reference pathways; improving automatic assignments of genes/ESTs/GSSs to reactions and pathways for avoiding false-positive assignments; linking the ESTs/genes/GSSs in reactions and pathways to genetic and physical locations in the genome; linking the ESTs/genes/GSSs in reactions and pathway to microarray chip information and tissue/condition information described in dbEST to give further meaningful information, including isoenzyme discrimination that KO does not explicitly handle; and updating EGENES twice a year to cope with the rapid influx of new ESTs resulting from high-throughput sequencing projects. In summary, EGENES provides overviews of fundamental biological processes in plants in a form that is useful to students working on a single gene/protein, as well as to bioinformaticians striving to make sense of large-scale datasets. EGENES provides biologist-friendly visualization of biological pathway data. EGENES data and pathway information are publicly available for downloading in XML format from KEGG's FTP site (ftp://ftp.genome.jp/pub/kegg/).
The following materials are available in the online version of this article.
We express our gratitude to Dr. Kiyoko F. Aoki-Kinoshita and Dr. Nelson Hayes for critical reading of the manuscript. Received December 19, 2006; accepted April 18, 2007; published April 27, 2007.
1 This work was supported by the Ministry of Education, Science, Sports and Culture, Japan (Grant-in-Aid for Scientific Research no. 123066001 and Grants-in-Aid for Scientific Research on priority areas nos. 17020005 and 17017019), by JSPS (postdoctoral award to A.M.-N.), and by Kyoto University (COE fellowship to A.M.-N.).
2 Present address: BIOBASE GmbH, Haltchersche Str. 33, D38304 Wolfenbüttel, Germany. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Ali Masoudi-Nejad (amasoudin{at}ibb.ut.ac.ir).
[C] Some figures in this article are displayed in color online but in black and white in the print edition.
[W] The online version of this article contains Web-only data.
[OA] Open Access articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.106.095059 * Corresponding author; e-mail amasoudin{at}ibb.ut.ac.ir; fax 98(0)2166404680.
Carollo V, Matthews DE, Lazo GR, Blake TK, Hummel DD, Lui N, Hane DL, Anderson OD (2005) GrainGenes 2.0. An improved resource for the small-grains community. Plant Physiol 139: 643651 Christoffels A, van Gelder A, Greyling G, Miller R, Hide T, Hide W (2001) STACK: Sequence Tag Alignment and Consensus Knowledgebase. Nucleic Acids Res 29: 234238 Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using Phred I: accuracy assessment. Genome Res 8: 175185 Goto S, Okuno Y, Hattori M, Nishioka T, Kanehisa M (2002) LIGAND: database of chemical compounds and reactions in biological pathways. Nucleic Acids Res 30: 402404 Huang X, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 6: 829845[CrossRef] Kanehisa M (1997) A database for post-genome analysis. Trends Genet 13: 375376[CrossRef][ISI][Medline] Kanehisa M, Goto S, Hattori M, Aoki-Kinoshita KF, Itoh M, Kawashima S, Katayama T, Araki M, Hirakawa M (2006) From genomics to chemical genomics: new developments in KEGG. Nucleic Acids Res 34: D354D357 Kanehisa M, Goto S, Kawashima S, Okuno Y, Hattori M (2004) The KEGG resource for deciphering the genome. Nucleic Acids Res 32: D277D280 Karp PD, Riley M, Paley SM, Pellegrini-Toole A, Krummenacker M (1998) EcoCyc: encyclopedia of Escherichia coli genes and metabolism. Nucleic Acids Res 26: 5053 Kim N, Shin S, Lee S (2005) ECgene: genome-based EST clustering and gene modeling for alternative splicing. Genome Res 15: 566576 Krieger CJ, Zhang P, Mueller LA, Wang A, Paley S, Arnaud M, Pick J, Rhee SY, Karp PD (2004) MetaCyc: a multiorganism database of metabolic pathways and enzymes. Nucleic Acids Res 32: D438D442 Kunne C, Lange M, Funke T, Miehe H, Thiel T, Grosse I, Scholz U (2005) CR-EST: a resource for crop ESTs. Nucleic Acids Res 33: D619D621 Kurata N, Yamazaki Y (2006) Oryzabase. An integrated biological and genome information database for rice. Plant Physiol 140: 1217 Lee Y, Tsai J, Sunkara S, Karamycheva S, Pertea G, Sultana R, Antonescu V, Chan A, Cheung F, Quackenbush J (2005) The TIGR Gene Indices: clustering and assembling EST and known genes and integration with eukaryotic genomes. Nucleic Acids Res 33: D71D74 Liang F, Holt I, Pertea G, Karamycheva S, Salzberg SL, Quackenbush J (2000) An optimized protocol for analysis of EST sequences. Nucleic Acids Res 28: 36573665 Liolios K, Tavernarakis N, Hugenholtz P, Kyrpides NC (2006) The Genomes On Line Database (GOLD) v.2: a monitor of genome projects worldwide. Nucleic Acids Res 34: D332334 Masoudi-Nejad A, Jauregui R, Kawashima S, Goto S, Kanehisa M, Endo TR (2004) The kingdom of Plantae EST Indices: a resource for plant genomics community. Genome Informatics 2004. PP-102. The 15th International Conference on Genome Informatics, December 1618, 2004, Pacifico Yokohama, Japan Masoudi-Nejad A, Tonomura K, Kawashima S, Itoh M, Kanehisa M, Endo T, Goto S (2006) EGassembler: online bioinformatics service for large-scale processing, clustering and assembling ESTs and genomic DNA fragments. Nucleic Acids Res 34: W459W462 Moriya Y, Itoh M, Okuda S, Kanehisa M (2005) KAAS: KEGG automatic annotation server. Genome Informatics 2005. P005-1. The 16th International Conference on Genome Informatics, December 1921, 2005, Pacifico Yokohama, Japan Mueller LA, Zhang P, Rhee SY (2003) AraCyc: a biochemical pathway database for Arabidopsis. Plant Physiol 132: 453460 Ptitsyn A, Hide W (2005) CLU: a new algorithm for EST clustering. BMC Bioinformatics (Suppl 2) 15: S3 Schneeberger K, Malde K, Coward E, Jonassen I (2005) Masking repeats while clustering ESTs. Nucleic Acids Res 33: 21762180 Schneider M, Bairoch A, Wu CH, Apweiler R (2005) Plant protein annotation in the UniProt Knowledgebase. Plant Physiol 138: 5966 Schuler GD, Boguski MS, Stewart EA, Stein LD, Gyapay G, Rice K, White RE, Rodriguez-Tomé P, Aggarwal A, Bajorek E, et al (1996) A gene map of the human genome. Science 274: 540546 Smit A, Hubley FA, Green P (2004) RepeatMasker Open-3.0. Institute for Systems Biology. http://www.repeatmasker.org (January 10, 2007) Ware DH, Jaiswal P, Ni J, Yap IV, Pan X, Clark KY, Teytelman L, Schmidt SC, Zhao W, Chang K, et al (2002) Gramene, a tool for grass genomics. Plant Physiol 130: 16061613 Wheeler DL, Smith-White B, Chetvernin V, Resenchuk S, Dombrowski SM, Pechous SW, Tatusova T, Ostell J (2005) Plant genome resources at the national center for biotechnology information. Plant Physiol 138: 12801288 Zhang P, Foerster H, Tissier CP, Mueller L, Paley S, Karp PD, Rhee SY (2005) MetaCyc and AraCyc. Metabolic pathway databases for plant research. Plant Physiol 138: 2737 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|