|
|
||||||||
|
First published online November 5, 2008; 10.1104/pp.108.128579 Plant Physiology 149:171-180 (2009) © 2009 American Society of Plant Biologists OPEN ACCESS ARTICLE
GRASSIUS: A Platform for Comparative Regulatory Genomics across the Grasses1,[W],[OA]Department of Plant Cellular and Molecular Biology and Plant Biotechnology Center (A.Y., B.G.F., E.G.), and Department of Biomedical Informatics (D.J.), The Ohio State University, Columbus, Ohio 43210; Instituto de Química, Departamento de Bioquímica, Universidade de São Paulo, São Paulo, Brazil (M.Y.N., G.M.S.); and Department of Biology, University of Toledo, Toledo, Ohio 43606 (J.G.)
Transcription factors (TFs) are major players in gene regulatory networks and interactions between TFs and their target genes furnish spatiotemporal patterns of gene expression. Establishing the architecture of regulatory networks requires gathering information on TFs, their targets in the genome, and the corresponding binding sites. We have developed GRASSIUS (Grass Regulatory Information Services) as a knowledge-based Web resource that integrates information on TFs and gene promoters across the grasses. In its initial implementation, GRASSIUS consists of two separate, yet linked, databases. GrassTFDB holds information on TFs from maize (Zea mays), sorghum (Sorghum bicolor), sugarcane (Saccharum spp.), and rice (Oryza sativa). TFs are classified into families and phylogenetic relationships begin to uncover orthologous relationships among the participating species. This database also provides a centralized clearinghouse for TF synonyms in the grasses. GrassTFDB is linked to the grass TFome collection, which provides clones in recombination-based vectors corresponding to full-length open reading frames for a growing number of grass TFs. GrassPROMDB contains promoter and cis-regulatory element information for those grass species and genes for which enough data are available. The integration of GrassTFDB and GrassPROMDB will be accomplished through GrassRegNet as a first step in representing the architecture of grass regulatory networks. GRASSIUS can be accessed from www.grassius.org.
A large fraction of the genome of any organism is dedicated to specify when, where, and how much of each mRNA needs to be produced. This regulatory information, hardwired into the genomic DNA, is essentially the same in every cell and largely constant over time and generations. Because these regulatory sequences are often in close proximity to the genes they control, we refer to them here as the cis-regulatory apparatus, which is formed by a mosaic arrangement of cis-regulatory elements (CREs). However, depending on the cell type or on the particular environmental circumstance, the same regulatory sequences can be interpreted in very different ways. It is the function of a group of trans-acting proteins, the transcription factors (TFs), to interpret the sequence code hardwired in the cis-regulatory apparatus and execute it in the form of a signal to the basal transcription machinery that will result in RNA production. TFs are organized into hierarchical gene regulatory networks in which one TF, often in cooperation with other proteins, positively or negatively regulates the expression of another TF. This establishes a variety of regulatory motifs, which, when assembled into regulatory modules, provide the free-scale architecture that characterizes gene regulatory networks (Babu et al., 2004
Several databases, including AtTFDB (http://arabidopsis.med.ohio-state.edu/AtTFDB; Davuluri et al., 2003 Here, we describe the development of a first version of GRASSIUS (GRASSIUS v1 already deployed at www.grassius.org) as a knowledge-based public Web resource that integrates information on TFs (in the GrassTFDB database) and gene promoters (in the GrassPROMDB) for maize (Zea mays), rice (Oryza sativa), sorghum (Sorghum bicolor), and sugarcane (Saccharum spp.), yet expected to expand to other grasses as genome information becomes available. In addition to providing the framework for building a comprehensive parts list, GRASSIUS also serves as a centralized clearinghouse for TF synonyms for the grasses. Combined with the discovery of phylogenetic relationships among members of TF families, and as a portal for available TF open reading frames (ORFs) in convenient recombination vectors, GRASSIUS provides a valuable resource for comparative regulatory genomics across the grasses.
GRASSIUS furnishes a user-friendly online database tool developed as a comprehensive resource for retrieving information regarding the components involved in the regulation of gene expression across the grasses, initially focusing on maize, rice, sorghum, and sugarcane. GRASSIUS currently consists of two integrated databases, GrassTFDB and GrassPROMDB. As previously done for AGRIS (Palaniswamy et al., 2006
Whereas many other proteins participate in the regulation of gene expression, we limit here our definition of TFs to proteins that contain a characteristic structural motif, the DNA-binding domain, which is involved in recognizing a short (usually 4–8 bp) DNA sequence. Based on the structure of the DNA-binding domain, TFs are classified into a variable number of different families (usually 40–60), and in plants, 5% to 7% of all the protein-encoding genes correspond to TFs meeting these characteristics (Riechmann et al., 2000
Because many of the grass genomes are not yet completely sequenced or annotated, the total number of TFs that should be expected is hard to predict. As a first step toward estimating the total number of TFs, particularly from maize and sugarcane where genomic information is either incomplete or missing, we performed a correlation between the number of genes in various genomes and the number of identified TFs. For sugarcane, the gene number was estimated based on the EST data from the SUCEST Project (Vettore et al., 2003
Interestingly, when a similar analysis was done for several nonplant genomes (fungal and animal), a similar trend was observed (Fig. 1, blue triangles). The fit of the regression (r2 = 0.74, blue dashed line; Fig. 1) was significantly improved by combining the plant and nonplant points (r2 = 0.82; data not shown), suggesting that plants and animals have a similar relationship between total gene numbers and TF numbers.
With these estimates in mind regarding the expected number of TFs that GRASSIUS should contain, we initiated the generation of GrassTFDB. As a first step, publicly available plant TFs from PlnTFDB, PlantTFDB, and DBD were obtained, and a comprehensive and nonredundant list of plant TFs was generated. Then, previously unidentified TFs were searched in the most recent genome sequence releases by scanning for PFAM domains found in plant TFs (Fig. 2
). Predicted TFs in each species were sorted into 47 families, following criteria similar to those used for developing AtTFDB and AGRIS (Davuluri et al., 2003
For rice and sugarcane, the number of TFs currently present in GrassTFDB is very close to the predicted number of TFs (Table I), indicating that the database has good coverage. In the case of sorghum, the number of predicted TFs is higher than the expected number, suggesting that GrassTFDB may contain some duplicates and splice variants that should be collapsed into single TFs. For maize, the number of TFs in GrassTFDB is close to the expected number, in agreement with most of the coding region of the maize genome being already available. The contents of GrassTFDB also compare very favorably to other TF databases in terms of comprehensiveness (Table II ).
When all TFs in GrassTFDB are arranged into species and families, interesting differences become evident, according to the online summary table furnished by GRASSIUS (http://grassius.org/summary.html). For example, although maize has the largest gene count, the number of maize TFs in all families is not the highest. Maize has significantly more TFs only in the ABI3VP1, AP2-EREBP, bZIP, C2C2-YABBY, CPP, E2F-DP, Homeobox, Jumonji, MYB, NAC, SBP, and TUB families. In all other cases, the numbers are about the same as for the other grasses, with the exception of the GRAS family, which shows a significantly (P < 0.05; Weisberg t test) lower TF number. Similarly, the C2C2-CO-like family in rice has significantly fewer members than those found in the same family in the other species (P < 0.05; Weisberg t test), while the rice Trihelix family is significantly larger (P < 0.05; Weisberg t test). In sugarcane, the C3H family has significantly more members than the C3H families of maize, sorghum, or rice (P < 0.05; Weisberg t test). These trends may reflect the expansion/contraction of individual families in a particular taxon. The recent amplification of R2R3-MYB regulators during the radiation of the grasses (Dias et al., 2003
An important function of GRASSIUS will be to provide information that facilitates comparative regulatory genomics studies. Central to this is the identification of orthologous TF pairs between the various grasses. Therefore, GRASSIUS contains an application that permits retrieval of preformed phylogenetic trees for a particular family (Fig. 3D; Supplemental Fig. S1). Phylogenetic analyses were performed by aligning conserved domains of all members of a particular TF family and trees were constructed using RAXML (see "Materials and Methods"). The branches and nodes of the tree, visualized with A TREE VIEWER (ATV; Supplemental Fig. S1) are hyperlinked to the underlying data within GRASSIUS. A click on a terminal branch displays a single sequence. A click on an internal node displays all the data for the group of sequences that node subtends.
The availability of FLcDNA clones and, in particular, the coding sequence prescribed by the ORF, for any given gene greatly advances the potential for further research of the protein encoded therein. Ready access to a cloned ORF accelerates the pace of research by permitting a variety of fusion and overexpression constructs to be engineered. Despite the large number of ESTs available through various projects, researchers are often lacking a FLcDNA for particular genes of interest. Thus, as part of the effort to establish a central resource for grass regulatory genomics, the development of a collection of clones containing ORFs for grass TFs was initiated (Supplemental Fig. S2). The clones in this collection are distinct from FLcDNA in that the coding sequence without 5'- and 3'-untranslated regions was specifically amplified omitting the stop codon, and then cloned into a Gateway entry vector. Such clones can be easily recombined into a variety of destination vectors (e.g. Karimi et al., 2002
With the ultimate goal of populating GrassPROMDB with all the regulatory sequences in the grasses, this initial release focuses on a set of experimentally verified regulatory sequences as well as predicted rice promoters. Experimentally characterized promoters constitute the gold standard because they furnish information on when and how a particular regulatory sequence is active, often providing information on the CREs responsible for expression and the TFs that they recruit. GrassPROMDB summarizes much of that information for every promoter in a single page (see Supplemental Fig. S3), with links to the corresponding TFs that recognize each CRE, providing the information necessary for building GrassRegNet. However, given how laborious it is to experimentally dissect promoter function, it is expected that GrassPROMDB will be primarily populated with predicted promoters. Predicted promoters can be of two types, curated promoters and upstream regions. Curated promoters correspond to sequences directly upstream of the TSS. Identifying curated promoters requires the availability of FLcDNAs to precisely determine TSSs. Upstream regions will be used instead of curated promoters when FLcDNAs are not available. Upstream regions consist of sequences 5' of the translation start codons (ATG), thus including 5'-untranslated regions. Currently (September 2008), GrassPROMDB contains 56,278 rice gene upstream regions corresponding to sequences 5 kb upstream of the ATG, according to the latest release of The Institute for Genomic Research (TIGR) rice genome annotation (release 5). These upstream regions carry the same unique identifier as the genes from which they were extracted.
GRASSIUS provides a first step toward building a comprehensive platform integrating information, tools, and resources for comparative regulatory genomics across the grasses. All the data in GRASSIUS are downloadable and freely available to the community. While initially containing information for maize, sorghum, sugarcane, and rice, as increasing genome sequence data for other grasses (e.g. wheat [Triticum aestivum], Brachypodium) accumulate, GRASSIUS has the potential to include them as well. GRASSIUS also serves as an initial centralized clearinghouse for TF synonyms, and as a source of information regarding orthologous TF pairs between several grasses.
GrassTFDB
For the development of GrassTFDB, information on TFs was obtained from the respective genome sequence databases and SUCEST, the sugarcane (Saccharum spp.) EST database (Vettore et al., 2003
Protein and respective nucleotide sequences for known TFs were integrated and grouped by species, resulting in a nonredundant set. Filtering of redundant nucleotide sequences was performed using the Perl Module Digest::MD5 (available at http://grassius.org/help.html), consisting of a sequence of 32 hexadecimal digits that identifies unequivocally each TF sequence for each species. In a second step, BLAST searches were performed to eliminate redundancy within each species. The proteins were considered duplicated if they were found in the same species, had a query coverage
Based on the information available in AGRIS and in the other TF databases, we created a comprehensive list of PFAM domains, which was used to generate a database containing all FASTA sequences for each domain cataloged in the PFAM database. Each PFAM domain can be represented by a median of 67 domain sequences from different species, and it was the reference for the approach to identify new TFs from the respective genome databases. This was accomplished by collecting all protein sequences from indica (Gramene; http://www.gramene.org) and japonica (TIGR5; http://rice.plantbiology.msu.edu) rice (Oryza sativa), sorghum (Sorghum bicolor; Joint Genome Institute; http://genome.jgi-psf.org/Sorbi1), maize (Zea mays; http://maizesequence.org), and nucleotide sequences from Saccharum species hybrids (SUCEST; http://sucest-fun.org). For the first three species, we used BLASTP. For sugarcane, BLASTX was used during alignment against the database of TF domain sequences. The first criterion in the BLAST alignment was to retrieve hits with e-value
Phylogenetic analyses were conducted by aligning conserved domains of all members of one TF family. InterProScan of TF sequences revealed the locations of domains, information that was utilized to extract the nucleotide sequence and perform subsequent analyses. A standard workflow that consisted of multiple sequence alignments of nucleotide sequences by ClustalW (Thompson et al., 1994
The ORFs of selected TFs were amplified from FLcDNA templates available from various sources (mainly the Arizona Genomics Institute) using PCR and directionally cloned into a Gateway entry vector (Invitrogen) according to the manufacturer's protocol. A high-fidelity DNA polymerase (Phusion; New England Biolabs) was employed to minimize errors during amplification. Individual entry clones were picked and plasmids isolated and sequenced to confirm the absence of errors, correct orientation, and remove the stop codon. Clones that passed this quality control were then stored in duplicate at –80°C.
Cloned TFs were named according to nomenclature guidelines developed by the community (see Letter to the Editor, this issue [Gray et al., 2009
The development of GrassPROMDB was based on gathering published promoter sequence information, along with detailed CRE information extracted from the literature (experimentally verified regulatory sequences). For the predicted promoter sequences, 1-kb regions upstream from the ATG translation start codon of all rice genes were obtained from to the latest TIGR release of the rice genome and extended to 5-kb upstream regions using the available genomic sequence.
The Web interface for GRASSIUS consists of a Perl-embedded HTML and was developed by using HTML::Mason (http://search.cpan.org/dist/HTML-Mason; accessible through http://grassius.org/help.html). Such an approach allowed implementing already available BioPerl (www.bioperl.org; Stajich et al., 2002
Data Visualization and User Interface
GrassTFDB
The TF information page provides domain information, nucleotide, or peptide sequences, orthologs in the other grasses (when available), and expression information as it becomes available (Fig. 6 , sample screen shot for OsMYB1). Domain information is extracted from InterProScan results and gathers information from multiple databases including BlastProDom, FPrintScan, HMMPIR, HMMPfam, HMMSmart, HMMTigr, ProfileScan, ScanRegExp, SuperFamily, SignalPHMM, TMHMM, HMMPanther, and Gene3D. The positions of the corresponding domains with respect to a schematic representation of the protein (N terminus on the left, C terminus on the right) are represented by boxes generated by the Bio::Graphics module of BioPerl. This provides an identification number specific to the database and descriptions of the particular domain. Each box, when clicked, takes the user to detailed information about the protein domain at corresponding databases.
GrassPROMDB
Community Contribution
Downloads, Help Pages
The following materials are available in the online version of this article.
We thank Saranyan Palaniswamy, Ramana Davuluri, and Eric Easley with assistance at various stages of this project. Received August 29, 2008; accepted October 29, 2008; published November 5, 2008.
1 This work was supported by the National Science Foundation (grant no. DBI–0701405 to J.G. and E.G.) and Fundação de Amparo à Pesquisa do Estado de São Paulo (grant to G.M.S.). G.M.S is also a recipient of a CNPq fellowship.
2 These authors contributed equally to the article. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Erich Grotewold (grotewold.1{at}osu.edu).
[W] The online version of this article contains Web-only data.
[OA] Open Access articles can be viewed online without a subscription. www.plantphysiol.org/cgi/doi/10.1104/pp.108.128579 * Corresponding author; e-mail grotewold.1{at}osu.edu.
Babu MM, Luscombe NM, Aravind L, Gerstein M, Teichmann SA (2004) Structure and evolution of transcriptional regulatory networks. Curr Opin Struct Biol 14: 283–291[CrossRef][Web of Science][Medline] Bennetzen JL (2007) Patterns in grass genome evolution. Curr Opin Plant Biol 10: 176–181[CrossRef][Web of Science][Medline] Braun EL, Grotewold E (2001) Fungal Zuotin proteins evolved from MIDA1-like factors by lineage-specific loss of MYB domains. Mol Biol Evol 18: 1401–1412 Curtis MD, Grossniklaus U (2003) A gateway cloning vector set for high-throughput functional analysis of genes in planta. Plant Physiol 133: 462–469 Davuluri RV, Sun H, Palaniswamy SK, Matthews N, Molina C, Kurtz M, Grotewold E (2003) AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics 4: 25[CrossRef][Medline] Deplancke B, Dupuy D, Vidal M, Walhout AJ (2004) A gateway-compatible yeast one-hybrid system. Genome Res 14: 2093–2101 Dias AP, Braun EL, McMullen MD, Grotewold E (2003) Recently duplicated maize R2R3 Myb genes provide evidence for distinct mechanisms of evolutionary divergence after duplication. Plant Physiol 131: 610–620 Earley KW, Haag JR, Pontes O, Opper K, Juehne T, Song K, Pikaard CS (2006) Gateway-compatible vectors for plant functional genomics and proteomics. Plant J 45: 616–629[CrossRef][Web of Science][Medline] Gray J, Bevan M, Brutnell T, Buell CR, Cone K, Hake S, Jackson D, Kellogg E, Lawrence C, McCouch S, et al (2009) A recommendation for naming transcription factor proteins in the grasses. Plant Physiol 149: 4–6 Guo AY, Chen X, Gao G, Zhang H, Zhu QH, Liu XC, Zhong YF, Gu X, He K, Luo J (2008) PlantTFDB: a comprehensive plant transcription factor database. Nucleic Acids Res 36: D966–969 Jannoo N, Grivet L, Chantret N, Garsmeur O, Glaszmann JC, Arruda P, D'Hont A (2007) Orthologous comparison in a gene-rich region among grasses reveals stability in the sugarcane polyploid genome. Plant J 50: 574–585[CrossRef][Web of Science][Medline] Karimi M, Inze D, Depicker A (2002) GATEWAY vectors for Agrobacterium-mediated plant transformation. Trends Plant Sci 7: 193–195[CrossRef][Web of Science][Medline] Kummerfeld SK, Teichmann SA (2006) DBD: a transcription factor prediction database. Nucleic Acids Res 34: D74–81 Luscombe NM, Austin SE, Berman HM, Thornton JM (2000) An overview of the structures of protein-DNA complexes. Genome Biol 1: REVIEWS001 Palaniswamy K, James S, Sun H, Lamb R, Davuluri RV, Grotewold E (2006) AGRIS and AtRegNet: A platform to link cis-regulatory elements and transcription factors into regulatory networks. Plant Physiol 140: 818–829 Riano-Pachon DM, Ruzicic S, Dreyer I, Mueller-Roeber B (2007) PlnTFDB: an integrative plant transcription factor database. BMC Bioinformatics 8: 42[CrossRef][Medline] Riechmann JL, Heard J, Martin G, Reuber L, Jiang C, Keddie J, Adam L, Pineda O, Ratcliffe OJ, Samaha RR, et al (2000) Arabidopsis transcription factors: genome-wide comparative analysis among eukaryotes. Science 290: 2105–2110 Riechmann JL, Ratcliffe OJ (2000) A genomic perspective on plant transcription factors. Curr Opin Plant Biol 3: 423–434[CrossRef][Web of Science][Medline] Schlitt T, Brazma A (2007) Current approaches to gene regulatory network modelling. BMC Bioinformatics (Suppl 6) 8: S9 Shahmuradov IA, Gammerman AJ, Hancock JM, Bramley PM, Solovyev VV (2003) PlantProm: a database of plant promoter sequences. Nucleic Acids Res 31: 114–117 Souza GM, Simoes ACQ, Oliveira KC, Garay HM, Fiorini LC, Gomes FS, Nishiyama-Junior MY, da Silva AM (2001) The sugarcane signal transduction (SUCAST) catalogue: prospecting signal transduction in sugarcane. Genet Mol Biol 24: 25–34[Web of Science] Stajich JE, Block D, Boulez K, Brenner SE, Chervitz SA, Dagdigian C, Fuellen G, Gilbert JG, Korf I, Lapp H, et al (2002) The Bioperl toolkit: Perl modules for the life sciences. Genome Res 12: 1611–1618 Stamatakis A, Ludwig T, Meier H (2005) RAxML-III: a fast program for maximum likelihood-based inference of large phylogenetic trees. Bioinformatics 21: 456–463 Thompson JD, Higgins DG, Gibson TJ (1994) CLUSTAL W: improving the sensitivity of progressive multiple sequence alignment through sequence weighting, position-specific gap penalties and weight matrix choice. Nucleic Acids Res 22: 4673–4680 Vettore AL, da Silva FR, Kemper EL, Souza GM, da Silva AM, Ferro MI, Henrique-Silva F, Giglioti EA, Lemos MV, Coutinho LL, et al (2003) Analysis and functional annotation of an expressed sequence tag collection for tropical crop sugarcane. Genome Res 13: 2725–2735 Wilson D, Charoensawan V, Kummerfeld SK, Teichmann SA (2008) DBD—taxonomically broad transcription factor predictions: new content and functionality. Nucleic Acids Res 36: D88–92 Yamamoto YY, Obokata J (2008) ppdb: a plant promoter database. Nucleic Acids Res 36: D977–981 Yu H, Gerstein M (2006) Genomic analysis of the hierarchical structure of regulatory networks. Proc Natl Acad Sci USA 103: 14724–14731 Zmasek CM, Eddy SR (2001) ATV: display and manipulation of annotated phylogenetic trees. Bioinformatics 17: 383–384 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|