|
|
||||||||
|
Plant Physiology 136:2621-2632 (2004) © 2004 American Society of Plant Biologists GENEVESTIGATOR. Arabidopsis Microarray Database and Analysis Toolbox1,[w]Institute of Plant Sciences, Swiss Federal Institute of Technology and Zurich-Basel Plant Science Center, ETH Center, CH8092 Zurich, Switzerland (P.Z., M.H.-H., L.H., W.G.); and Functional and Genomics Center Zurich, UNI Irchel, Y32 H52, CH8057 Zurich, Switzerland (W.G.)
High-throughput gene expression analysis has become a frequent and powerful research tool in biology. At present, however, few software applications have been developed for biologists to query large microarray gene expression databases using a Web-browser interface. We present GENEVESTIGATOR, a database and Web-browser data mining interface for Affymetrix GeneChip data. Users can query the database to retrieve the expression patterns of individual genes throughout chosen environmental conditions, growth stages, or organs. Reversely, mining tools allow users to identify genes specifically expressed during selected stresses, growth stages, or in particular organs. Using GENEVESTIGATOR, the gene expression profiles of more than 22,000 Arabidopsis genes can be obtained, including those of 10,600 currently uncharacterized genes. The objective of this software application is to direct gene functional discovery and design of new experiments by providing plant biologists with contextual information on the expression of genes. The database and analysis toolbox is available as a community resource at https://www.genevestigator.ethz.ch.
A major challenge in biology today is the large-scale determination of gene function (Boyes et al., 2001
The complete sequencing of the Arabidopsis genome achieved in the year 2000 (The Arabidopsis Genome Initiative, 2000
The exploitation of large-scale gene expression datasets, mainly from Saccharomyces cerevisiae and Escherichia coli, has already led to the discovery of global structures governing metabolic and regulatory networks (Lee et al., 2002
The Affymetrix platform provides a standardized system with a high degree of reproducibility (Hennig et al., 2003 Here, we describe a novel online tool called GENEVESTIGATOR comprising a gene expression database and a number of querying and analysis functionalities developed to facilitate gene functional discovery. GENEVESTIGATOR allows the data to be presented in the context of plant development, plant organ, and environmental conditions, both for individual genes or for families of genes, thereby answering questions such as "in which growth stage is my gene of interest expressed?" or "which genes are specifically expressed in roots?" The main objective of the software is to assign contextual information to gene expression data, directing the design of new experiments and gene functional discovery.
Database Concept and Software Design GENEVESTIGATOR was conceived as a user-friendly online tool for large-scale expression data analysis. It consists of a MySQL relational database and a Web server application programmed in the PHP (PHP Hypertext Preprocessor) scripting language. The database works as a "data warehouse" containing experimental and annotation data, preprocessed data, as well as diverse tables for control of workflow and analysis (Fig. 1).
Raw experimental data from users is processed using Affymetrix MAS 5.0 software to a target value (TGT) of 1,000 (Liu et al., 2002 The experiment annotation is curated, entered, and structured in either hierarchical (e.g. plant organs), unique (e.g. growth stage), or multi-select form (e.g. environmental condition). The software has been designed for easy additions of new annotations in any of these formats and for rapid creation of the corresponding tools to analyze and visualize the data. The annotation of arrays was based on the information provided by users or public repositories. Missing information does not impact the results, as the corresponding arrays are not included into the respective calculations. Ambiguous or unsuitable annotations were further ignored. For example, arrays from RNA extracted from whole adult plants (including roots, rosette leaves, and inflorescence) are unsuitable for tools relating to plant organ specificity (Gene Atlas) and are therefore not included into the corresponding calculations, but may be proper for use in other tools such as Gene Chronologer. Each tool therefore accesses the best respective available sources of data for processing, while unsuitable data is ignored.
Data from the ATH1 and AG arrays are processed separately. Different sets of oligonucleotide sequences are used to probe identical target genes on the two array types, and thus different efficiencies of target to probe hybridization and nontarget to probe cross-hybridization makes a direct comparison of signal intensities impossible. Although a high degree of reproducibility was found for most target genes probed by both the ATH1 and the AG arrays, 300 pairs of probe set for identical target genes yielded strongly differing results (Hennig et al., 2003
As of July 2004, the database contained publicly available data from 750 ATH1 and 121 AG arrays covering 81 public experiments from the Gruissem Laboratory (http://www.pb.ethz.ch; Menges et al., 2003 GENEVESTIGATOR is freely accessible to all academic institutions. Since the database contains at present both publicly available as well as confidential data, we have implemented a dual user profile management system for public and private users. All users are therefore asked to register once and to login for each session. We limit the collection and use of personal information to what is necessary to administer the database and improve the utility of GENEVESTIGATOR. Personal information is not shared with third parties.
The GENEVESTIGATOR tools generally contain two types of queries: a gene-centric approach reporting signal intensity values for individual genes, and a genome-centric approach providing lists of genes fulfilling chosen criteria. The results obtained from any tool are based on all available signal intensity values and the corresponding annotations. In some cases, present/absent call information as defined by the MAS5.0 algorithm is indicated (see below). The first tool, Digital Northern, will retrieve the signal intensity values of input genes for a chosen selection of GeneChip experiments. An elaborate selection tool (Fig. 2A) allows the user to choose exactly those experiments that fit single or multiple criteria such as anatomy, growth stage, or environmental factors. Up to 10 probe sets can be processed simultaneously, displayed in several colors, shapes, and filling, revealing both signal intensity values and present call (closed symbols) and absent call (open symbols) information (Fig. 2B).
The Gene Correlator allows comparing the signal intensity values of two genes throughout all chosen experiments (Fig. 2C; identical selection tool as for Digital Northern). Each spot represents a GeneChip and can be identified by mouse-over or by linking to the annotation database. The Pearson's correlation coefficient is given as a measure for the relationship between expression signals of two genes. Present call information is visualized by a color coding (Fig. 2C). Because the objective of the software was to provide contextual information for the expression of genes, we additionally focused on relating gene expression to three main annotation groups: plant organ, developmental stage, and environmental stress.
The Gene Atlas tool similarly provides the average signal intensity values of a gene of interest in all organs or tissues annotated in the database (Fig. 2D). Reversely, GENEVESTIGATOR can output lists of genes for which signal intensities exceed a chosen threshold in selected organs versus a baseline choice of organs (Fig. 2E). This allows users to find genes expressed preferentially in certain organs or tissues, such as roots, young leaves or stamina. The anatomy annotation was based on standard anatomy terms as defined by the Plant Ontology Consortium (http://www.plantontology.org/) that we classified into six main groups (callus, cell suspension, seedling, inflorescence, rosette, and roots) and the corresponding subgroups. These categories cover all tissues that can currently be isolated for expression analysis, but can easily be extended as tissue and cell separation techniques become more precise (Birnbaum et al., 2003
The Gene Chronologer tool, based on the Boyes growth stage ontology (Boyes et al., 2001 The Response Viewer tool provides the same functionalities as Gene Atlas and Gene Chronologer, based on stress response annotations (Fig. 2, H and I). For each condition, one or several representative experiments were chosen. Each stress factor is given with the corresponding control from these experiments, allowing direct comparison. The Meta-Analyzer utility has been designed to study the gene expression profiles of several genes simultaneously in the context of environmental stresses, organs, and growth stages (Fig. 2, JL). Lists of genes can be entered in diverse formats (comma-, semi-colon-, or space-separated, CRLF [carriage return, line feed], or directly copied from a spreadsheet). The output is a heat map of normalized signal intensity values (see Documentation section on our Web page) clustered by either single, average, or complete linkage hierarchical clustering. This tool is especially useful to compare members of gene families and to identify clusters of similarly expressed genes.
Finally, the Database and Documentation sections provide users with annotation information about experiments in the database, as well as technical information (Fig. 2, M and N). Since GENEVESTIGATOR was conceived to be an analysis tool and not a data repository, a reduced set of annotations is stored locally. The full MIAME (Minimum Information About a Microarray Experiment) compliant annotations (Brazma et al., 2001
The database contains expression data from a high diversity of experiments covering different tissues, ages, and treatments (Table I). The general hypothesis in our approach is that as the number of experiments per category (e.g. growth stage 5.10) increases, individual effects are averaged out and global trends become visible. As a measure of confidence for the expression of genes in different categories, we indicate the respective number of GeneChips and the SE of the mean for each category.
To validate our hypothesis, we checked whether strongly populated categories yield results that are consistent with the literature. In a first step, we selected a number of marker genes with preferential expression in particular organs, at specific growth stages, or in response to certain stresses and then analyzed their expression patterns generated by GENEVESTIGATOR. Marker genes were chosen from the literature.
First, using Gene Atlas, three AGAMOUS-like genes known to be preferentially expressed in roots as measured by reverse transcription-PCR (AGL12 [At1g71692], AGL14 [At4g11880], and AGL17 [At2g22630]; Parenicova et al., 2003
Second, to verify the reliability of the Gene Chronologer tool, we looked for genes annotated as being developmentally regulated. Two genes involved in seed germination and seedling development (encoding the embryonic abundant protein ATEM1 [AT3G51810, Vicient et al., 2000
Third, the Response Viewer tool was used for several genes known to be responsive to particular stresses (Fig. 3, LQ). GENEVESTIGATOR correctly showed the expression pattern of a light-induced gene encoding a light-harvesting chlorophyll a/b binding protein (AT4G14690, Jansson et al., 2000
This first validation step confirms that global trends can be detected in the expression profiles of individual genes by combining numerous normalized expression data sets using the same technical platform, i.e. the Affymetrix system. Based on this information, we performed a second validation step, in which we tested whether GENEVESTIGATOR can identify genes with known expression profiles. Using Gene Atlas, 72 genes were identified to be expressed in pollen. Of these, 9 had been identified by Honys and Twell (2003)
Public repositories such as GEO and ArrayExpress provide tools for submission, storage, and retrieval of heterogeneous data sets. In contrast, GENEVESTIGATOR contains a coherent data set from a single organism generated on a common hybridization platform. Despite the high diversity of experiments represented in the database, the validation steps we carried out demonstrate that the underlying hypothesis is valid and that biologically meaningful results can be obtained using GENEVESTIGATOR. The software generally performs primary level analysis and displays results either as graphs or as numeric data, which can easily be combined, exported, or further analyzed with other data analysis and visualization tools.
The complexity of multicellular life requires the proper context-dependent expression of genes, which is achieved by highly interconnected transcriptional networks. The inference of such module networks may require the use of many data types such as gene expression, protein abundance, protein interaction, metabolite abundance, affinity precipitation, synthetic lethality, etc. (Troyanskaya et al., 2003 Critical issues in using the GENEVESTIGATOR tools are (1) the questions being addressed by queries and (2) the interpretation of output data. First, GENEVESTIGATOR allows queries at a high level of detail and in a large variety of combinations specifying organ, developmental stage, or treatment. Although GENEVESTIGATOR currently contains information from more than 750 publicly available full genome arrays, some combinations at very detailed level may not yet have sufficient data support to yield robust results. The quality of the results therefore depends strongly on the level of granularity the user chooses and the number and types of underlying experiments. Second, care must be taken not to over-interpret output data computed by GENEVESTIGATOR. To facilitate data interpretation, the number of samples per category and the SEs of the means are indicated. Nevertheless, when working in a detailed level of granularity, a post-verification of individual genes is advised using the Digital Northern tool to confirm the origin of the effects observed.
Both the forward and reverse validation of GENEVESTIGATOR revealed that the combination of annotated data from various sources using the same technology platform is a valid approach to reveal contextual information about elements of the dataset. In our case, the expression profiles of more than 22,000 genes from Arabidopsis can be generated in the context of plant organ, plant development and environmental stress. Although not all annotated categories are currently well covered in terms of number of arrays, and therefore the output from these categories may be somewhat biased, the general quality of results obtained using GENEVESTIGATOR is high. The permanent submission of new datasets is expected to constantly improve the quality of the output. The resulting information can be used to confirm previous hypotheses or generate new hypotheses about gene expression network structures and genetic regulatory networks, resulting in the design of more precise and targeted experiments.
We thank Eva Vranová and Franziska Humair for feedback on the use of the software in development. We are also grateful to the Functional Genomics Center Zurich for providing support and the Affymetrix platform for GeneChip experiments, as well as all public repositories for providing data. Received May 14, 2004; returned for revision July 12, 2004; accepted July 16, 2004.
1 This work was supported by ETH, Strategic Excellence Project 27421302/TH8/022, and by the Functional Genomics Center Zurich.
2 These authors contributed equally to the paper.
[w] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.104.046367. * Corresponding author; e-mail wilhelm.gruissem{at}ipw.biol.ethz.ch; fax 4116321079.
The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline] Albani D, Sardana R, Robert LS, Altosaar I, Arnison PG, Fabijanski SF (1992) A Brassica napus gene family which shows sequence similarity to ascorbate oxidase is expressed in developing pollen. Molecular characterization and analysis of promoter activity in transgenic tobacco plants. Plant J 2: 331342[Medline] Ashburner M, Ball CA, Blake JA, Botstein D, Butler H, Cherry JM, Davis AP, Dolinski K, Dwight SS, Eppig JT, et al (2000) Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet 25: 2529[CrossRef][ISI][Medline]
Becker JD, Boavida LC, Carneiro J, Haury M, Feijo JA (2003) Transcriptional profiling of Arabidopsis tissues reveals the unique characteristics of the pollen transcriptome. Plant Physiol 133: 713725 Bergmann S, Ihmels J, Barkai N (2004) Similarities and differences in genome-wide expression data of six organisms. PLoS Biol 2: E9[CrossRef][Medline]
Birnbaum K, Shasha DE, Wang JY, Jung JW, Lambert GM, Galbraith DW, Benfey PN (2003) A gene expression map of the Arabidopsis root. Science 302: 19561960
Boyes DC, Zayed AM, Ascenzi R, McCaskill AJ, Hoffman NE, Davis KR, Gorlach J (2001) Growth stage-based phenotypic analysis of Arabidopsis: a model for high throughput functional genomics in plants. Plant Cell 13: 14991510 Brazma A, Hingamp P, Quackenbush J, Sherlock G, Spellman P, Stoeckert C, Aach J, Ansorge W, Ball CA, Causton H C, et al (2001) Minimum information about a microarray experiment (MIAME)toward standards for microarray data. Nat Genet 29: 365371[CrossRef][ISI][Medline]
Brazma A, Parkinson H, Sarkans U, Shojatalab M, Vilo J, Abeygunawardena N, Holloway E, Kapushesky M, Kemmeren P, Lara GG, et al (2003) ArrayExpressa public repository for microarray gene expression data at the EBI. Nucleic Acids Res 31: 6871
Craigon DJ, James N, Okyere J, Higgins J, Jotham J, May S (2004) NASCArrays: a repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res (Database issue) 32: D575D577
Edgar R, Domrachev M, Lash AE (2002) Gene Expression Omnibus: NCBI gene expression and hybridization array data repository. Nucleic Acids Res 30: 207210
Friedman N (2004) Inferring cellular networks using probabilistic graphical models. Science 303: 799805
Hennig L, Gruissem W, Grossniklaus U, Köhler C (2004) Transcriptional programs of early stages of plant reproduction. Plant Physiol 135: 17651775 Hennig L, Menges M, Murray JA, Gruissem W (2003) Arabidopsis transcript profiling on Affymetrix GeneChip arrays. Plant Mol Biol 53: 457465[CrossRef][ISI][Medline]
Honys D, Twell D (2003) Comparative analysis of the Arabidopsis pollen transcriptome. Plant Physiol 132: 640652 Ihmels J, Levy R, Barkai N (2004) Principles of transcriptional control in the metabolic network of Saccharomyces cerevisiae. Nat Biotechnol 22: 8692[CrossRef][ISI][Medline] Jansson S, Andersson J, Kim SJ, Jackowski G (2000) An Arabidopsis thaliana protein homologous to cyanobacterial high-light-inducible proteins. Plant Mol Biol 42: 345351[CrossRef][ISI][Medline] Kleffmann T, Russenberger D, von Zychlinski A, Christopher W, Sjolander K, Gruissem W, Baginsky S (2004) The Arabidopsis thaliana chloroplast proteome reveals pathway abundance and novel protein functions. Curr Biol 14: 354362[CrossRef][ISI][Medline]
Lee TI, Rinaldi NJ, Robert F, Odom DT, Bar-Joseph Z, Gerber GK, Hannett NM, Harbison CT, Thompson CM, Simon I, et al (2002) Transcriptional regulatory networks in Saccharomyces cerevisiae. Science 298: 799804 Lehman A, Black R, Ecker JR (1996) HOOKLESS1, an ethylene response gene, is required for differential cell elongation in the Arabidopsis hypocotyl. Cell 85: 183194[CrossRef][ISI][Medline]
Lelandais G, Le Crom S, Devaux F, Vialette S, Church GM, Jacq C, Marc P (2004) yMGV: a cross-species expression data mining tool. Nucleic Acids Res (Database issue) 32: D323D325
Liu WM, Mei R, Di X, Ryder TB, Hubbell E, Dee S, Webster TA, Harrington CA, Ho MH, Baid J, Smeekens SP (2002) Analysis of high density expression microarrays with signed-rank call algorithms. Bioinformatics 18: 15931599 Lockhart DJ, Dong H, Byrne MC, Follettie MT, Gallo MV, Chee MS, Mittmann M, Wang C, Kobayashi M, Horton H, et al (1996) Expression monitoring by hybridization to high-density oligonucleotide arrays. Nat Biotechnol 14: 16751680[CrossRef][ISI][Medline] Menges M, Hennig L, Gruissem W, Murray JA (2003) Genome-wide gene expression in an Arabidopsis cell suspension. Plant Mol Biol 53: 423442[CrossRef][ISI][Medline]
Mouline K, Very AA, Gaymard F, Boucherez J, Pilot G, Devic M, Bouchez D, Thibaud JB, Sentenac H (2002) Pollen tube development and competitive ability are impaired by disruption of a Shaker K(+) channel in Arabidopsis. Genes Dev 16: 339350
Onate-Sanchez L, Singh KB (2002) Identification of Arabidopsis ethylene-responsive element binding factors with distinct induction kinetics after pathogen infection. Plant Physiol 128: 13131322
Parenicova L, de Folter S, Kieffer M, Horner DS, Favalli C, Busscher J, Cook HE, Ingram RM, Kater MM, Davies B, et al (2003) Molecular and phylogenetic analyses of the complete MADS-box transcription factor family in Arabidopsis: new openings to the MADS world. Plant Cell 15: 15381551 Pelaz S, Gustafson-Brown C, Kohalmi SE, Crosby WL, Yanofsky MF (2001) APETALA1 and SEPALLATA3 interact to promote flower development. Plant J 26: 385394[CrossRef][ISI][Medline]
Ravasz E, Somera AL, Mongru DA, Oltvai ZN, Barabasi AL (2002) Hierarchical organization of modularity in metabolic networks. Science 297: 15511555 Redman JC, Haas BJ, Tanimoto G, Town CD (2004) Development and evaluation of an Arabidopsis whole genome Affymetrix probe array. Plant J 38: 545561[CrossRef][ISI][Medline] Ruiz-Garcia L, Madueno F, Wilkinson M, Haughn G, Salinas J, Martinez-Zapater JM (1997) Different roles of flowering-time genes in the activation of floral initiation genes in Arabidopsis. Plant Cell 9: 19211934[Abstract] Runge S, Sperling U, Frick G, Apel K, Armstrong GA (1996) Distinct roles for light-dependent NADPH:protochlorophyllide oxidoreductases (POR) A and B during greening in higher plants. Plant J 9: 513523[CrossRef][ISI][Medline] Segal E, Shapira M, Regev A, Pe'er D, Botstein D, Koller D, Friedman N (2003) Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data. Nat Genet 34: 166176[CrossRef][ISI][Medline] Stelling J, Klamt S, Bettenbrock K, Schuster S, Gilles ED (2002) Metabolic network structure determines key aspects of functionality and regulation. Nature 420: 190193[CrossRef][Medline]
Troyanskaya OG, Dolinski K, Owen AB, Altman RB, Botstein D (2003) A Bayesian framework for combining heterogeneous data sources for gene function prediction (in Saccharomyces cerevisiae). Proc Natl Acad Sci USA 100: 83488353
Vicient CM, Hull G, Guilleminot J, Devic M, Delseny M (2000) Differential expression of the Arabidopsis genes coding for Em-like proteins. J Exp Bot 51: 12111220 Wille A, Zimmermann P, Vranová E, Bleuler S, Fürholz A, Hennig L, Laule O, Prelíc A, von Rohr P, Thiele L, et al (2004) Sparse graphical gaussian modeling for genetic regulatory network inference. Genome Biol (in press) This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||