|
|
||||||||
|
First published online October 6, 2006; 10.1104/pp.106.085639 Plant Physiology 142:1589-1602 (2006) © 2006 American Society of Plant Biologists Large-Scale cis-Element Detection by Analysis of Correlated Expression and Sequence Conservation between Arabidopsis and Brassica oleracea1,[W]Munich Information Center for Protein Sequences, Institute for Bioinformatics (G.H., P.K., M.S., L.Y., K.F.X.M.), and Institute of Stem Cell Research (M.T.M.), GSF National Research Center for Environment and Health, 85764 Neuherberg, Germany
The rapidly increasing amount of plant genomic sequences allows for the detection of cis-elements through comparative methods. In addition, large-scale gene expression data for Arabidopsis (Arabidopsis thaliana) have recently become available. Coexpression and evolutionarily conserved sequences are criteria widely used to identify shared cis-regulatory elements. In our study, we employ an integrated approach to combine two sources of information, coexpression and sequence conservation. Best-candidate orthologous promoter sequences were identified by a bidirectional best blast hit strategy in genome survey sequences from Brassica oleracea. The analysis of 779 microarrays from 81 different experiments provided detailed expression information for Arabidopsis genes coexpressed in multiple tissues and under various conditions and developmental stages. We discovered candidate transcription factor binding sites in 64% of the Arabidopsis genes analyzed. Among them, we detected experimentally verified binding sites and showed strong enrichment of shared cis-elements within functionally related genes. This study demonstrates the value of partially shotgun sequenced genomes and their combinatorial use with functional genomics data to address complex questions in comparative genomics.
Brassica oleracea enjoys a close evolutionary relationship to Arabidopsis (Arabidopsis thaliana). The two genera separated approximately 12 to 24 million years ago (Yang et al., 1999b
Comparative genomics has been proven to be a powerful tool for the discovery of a large variety of functional elements by their conservation between related species. The usefulness of Brassica GSSs for the improvement of genome and specifically gene structure annotation in Arabidopsis, as well as for comparative studies of the repeat contents of both genomes, has been reported (Zhang and Wessler, 2004
In particular, it has been shown that comparative genomics approaches are able to detect genetic elements that are often difficult to discover due to their small size and/or limited information content. Examples include genetic elements like micro-RNAs and cis-regulatory elements (Wasserman et al., 2000
Comparative genomics detects cis-elements by their conservation between two or more evolutionary related sequences from orthologous genes. The assumption is that orthologs exhibit a common regulatory mode that is reflected in the conservation of transcription factor binding sites. Phylogenetic footprinting (Wasserman et al., 2000
In addition to sequence conservation, a different popular approach uses functional information, mainly coexpression information, within one species to discover cis-elements. Powerful technologies to monitor transcriptional states and dynamics on a genome scale are well established and widely applied. The analysis of coexpressed genes under different conditions and states has been shown to be highly valuable for the analysis of shared cis-regulatory elements (Harmer et al., 2000
The majority of studies have used either sequence conservation or overrepresentation of motifs in promoters of coexpressed genes to discover cis-regulatory elements. However, some studies used a combination of both approaches to evaluate and/or screen detected motifs. For instance, Kellis and coworkers (Kellis et al., 2003
Recent developments in the detection of cis regulatory elements integrate both phylogenetic information as well as coexpression information (Wang and Stormo, 2003
To integrate coexpression and sequence conservation for motif discovery, adequate information sources, coexpression and orthology information, are required. Recent contributions have provided both large-scale expression data for Arabidopsis and sequence data from Brassica. A large and high-quality expression dataset comprising about 800 microarrays has recently been made available (Craigon et al., 2004 In this study, we undertook a comprehensive analysis of thousands of Brassica-Arabidopsis orthologous promoter pairs. To generate coexpression information, we analyzed a set of 81 microarray experiments from Arabidopsis totaling 779 chips. Promoters from coexpressed genes and their respective Brassica orthologous counterpart were selected. The resulting promoter sets were analyzed, and a large number of candidate sites have been discovered. These sites are derived from profiles, which are conserved between orthologous promoters and associated with coexpression. Many of the detected motifs are enriched for specific biological processes and pathways. Evaluation of our analysis with the aid of experimentally validated cis-regulatory elements from Arabidopsis confirms their significance. This study provides the basis for future cis-regulatory module analysis and analysis of regulatory circuits not restricted to Brassicaceae. It further demonstrates the benefits of partial genome sequences to address complex problems in comparative genomics.
The main goal of our study was the genome-wide discovery of candidate cis-regulatory regions in Arabidopsis. We selected PhyloCon to combine two sources of information for cis-element discovery: coexpression and sequence conservation. PhyloCon has been demonstrated to be very powerful both on biological and controlled artificial data (Wang and Stormo, 2003
To avoid potential misassignments, we applied a stringent bidirectional best BLASTN hit strategy to detect the best available candidates for orthologous promoter sequences (see "Materials and Methods"). In the following, we use the term orthologous promoters for these candidate pairs. Coexpressed Arabidopsis genes were determined using 779 microarray hybridization data (Craigon et al., 2004
Brassica Orthologous Upstream Sequences
We assembled a set of 5 67,365 Brassica GSSs by applying highly stringent clustering to minimize both redundancy within the GSSs and to prevent the generation of erroneous hybrid clusters by exclusion of repetitive and/or ambiguous sequences. Assembly and repeat masking/filtering resulted in 142,489 clusters with an average length of 987 bp totaling 140.6 Mb nonredundant sequences. The genome size of Brassica has been estimated to be about 600 Mb (Arumuganathan and Earle, 1991
Next, we determined orthologous upstream sequences between Arabidopsis and Brassica by reciprocal BLASTN comparisons (E
Expression data available from the Nottingham Arabidopsis Stock Centre (Craigon et al., 2004
The background distribution was derived from the pairwise correlations of all 21,559 genes used in this study (Fig. 2
). To define groups of coexpressed genes, the 99% quantile of this distribution was considered as significant (r = 0.803; Fig. 2). For each Arabidopsis gene, we assigned all genes exceeding a Pearson correlation of r
Motif Discovery by PhyloCon Each PhyloCon analysis group (PAG) is composed of the orthologous gene pairs of an individual CEG. Arabidopsis genes with no detectable orthologous upstream sequence in Brassica were removed from the analysis set. In the following, we refer to an orthologous pair of a Brassica GSS assembly and Arabidopsis promoter/upstream sequence as an orthologous promoter group (OPG). Thus, the collection of all OPGs of a particular CEG represents a PAG. Elimination of genes without corresponding Brassica OPG significantly reduced the size and number of coexpressed groups, because an OPG has been identified on average for only about one-sixth of the Arabidopsis genes. In addition, each PAG had to consist of at least two OPGs. This filtering resulted in 4,540 PAGs that were subjected to a PhyloCon analysis.
PhyloCon initially creates profiles from pairwise OPG alignments. In subsequent cycles, merging and trimming profiles of preceding cycles generates derived profiles. Thus, profiles of later cycles are derived from alignments of an increasing number of distinct OPGs. An example for profile generation by PhyloCon is given in Figure 3B. For each PAG, both the final alignment matrices as well as profiles of previous cycles to which we refer as intermediate matrices were collected. This step reduces the likelihood to miss significant motifs in a noisy data set (see "Materials and Methods" for details). Analysis of all PAGs revealed a total of 322,079 preliminary profiles, including a large number of redundant intermediate matrices (see "Materials and Methods"). As (CT)n-repeats [or its respective complement, (GA)n] are very prominent in Arabidopsis promoters, we prefiltered consensus sequences of our matrices for the presence of such simple repeats. We analyzed the filtered matrices for overrepresentation within the associated CEG as compared to their frequency in all 21,559 analyzed Arabidopsis genes by testing against the cumulative binomial distribution. Within 3,861 PAGs, we detected at least one motif model that was significantly overrepresented for the respective CEG (P
Lengths of profiles predominantly range between 6 and 15 bp (Supplemental Fig. S1). This is in good agreement with sizes of known individual transcription factor binding sites. To estimate the number of candidate sites per gene, overlapping sites/instances of different motifs were merged (see "Materials and Methods"). Fusion of overlapping sites did not change the size distribution (Supplemental Fig. S1). This indicates that our profiles detect well-confined regions within the promoters. We found on average 7.3 nonredundant sites per gene and a total of 61,745 sites in 8,407 Arabidopsis genes (out of 13,254 genes contained in all CEGs). Each CEG and each gene can be queried for a list of significant profiles at http://mips.gsf.de/proj/plant/webapp/expressionDB/index.jsp. For the Web display, we assorted identical or nearly identical profiles exceeding a Pearson correlation of r
We investigated whether profiles detected in this study match known cis-elements. For this purpose, we screened elements from the PLACE and AGRIS databases (Higo et al., 1999 In total, 537 motifs are contained within the two databases. However, there is a significant degree of redundancy both between the databases as well as within individual databases. In many cases, it is difficult to decide whether two motif variants constitute binding sites for two distinct transcription factors or represent two sites for one transcription factor. Therefore, we used all binding sites listed in both databases. For 255 sites (out of 537 sites; 47.5%), we detected a profile similar to the described motif within PLACE or AGRIS. Table I lists a selection of detected matches.
Several reasons complicate the evaluation for profile matches to motifs reported in PLACE or AGRIS. First, motifs in PLACE are derived from various plant species. Hence, some motifs might not be present within Arabidopsis and Brassica. Second, many motifs are either reported as consensus sequences or experimental reports are restricted to only one specific site in a particular promoter. Particularly, the latter description is likely too specific as many transcription factor binding sites are degenerated. In addition, some consensus sequences do not describe binding sites for individual transcription factors but instead give the (degenerated) consensus for a family of transcription factors such as, for example, the Myb transcription factors (see Table I). This problem is especially pronounced for known motifs for which only a short core sequence is present and that are involved in the regulation of numerous pathways. Examples comprise the ACGT-element or the CAAT-box. Most importantly, the Brassica assembly only partially covers the Brassica genome, and the average length of the Brassica GSSs is about one-half the average length of Arabidopsis promoters used in this study. Thus, we are missing a considerable number of genes or promoter regions for comparison. Nevertheless, our findings for several known motifs are consistent with experimental findings and functional enrichments described below. For instance, the PALBOX is frequently found in promoters of genes catalyzing steps in the phenylpropanoid biosynthesis. Consistently, we detected a significant enrichment of several profiles highly similar to it (e.g. Table II) in the flavonoid, phenylpropanoid, and lignin biosynthesis and in the category response to UV-C. A detailed description of the detected sites within the PAL promoter and their matches to known sites within this promoter is given in the last section. We also detected several profiles matching the G-box-related abscisic acid-responsive element GCCACGTG. In agreement with its regulatory function, these profiles were significantly overrepresented in the functional category response to abscisic acid stimulus (Table II).
Detected Profiles Are Overrepresented within Specific Functional Categories and Biochemical Pathways
Numerous studies demonstrated that coexpressed genes have an increased probability to be involved in a common biological process (DeRisi et al., 1997
The enrichment in a particular KEGG pathway or a biological process defined by a GO term (see "Materials and Methods") was determined using the binomial coefficient against the genome-wide background distribution. P values have been Bonferroni corrected for multiple testing, and corrected P values
We evaluated our results using experimental findings. CRABS CLAWS (CRC), a member of the YABBY gene family, is required for nectary and carpel development in Arabidopsis (Lee et al., 2005
Figure 5A depicts a PAG enriched for genes involved in phenylpropanoid biosynthesis. Within this group, an enzymatic chain involving phenyl-alanine-ammonia-lyase (PAL1, the entry point of the biosynthetic pathway), trans-cinnamate 4-monooxygenase (C4H/CYP73A5), p-coumarate 3-hydroxylase, and a caffeoyl-CoA O-methyltransferase-like protein is present. We found these genes to frequently cocluster within our analysis. In addition, many of these PAG clusters contained two isoforms of the coumarate CoA-ligase (4CL1, 4CL2) and a second variant of the PAL2. Tight coexpression of PAL1, C4H, and 4CLs has been reported not only for Arabidopsis (Mizutani et al., 1997 In summary, the examples illustrate that conserved sites detected in this study correlate very well with known transcription factor binding sites.
The detection of cis-regulatory elements in higher eukaryotes is a major challenge in functional genomics. Bioinformatic sequence analysis of transcription factor binding sites is notoriously difficult, as cis-elements are difficult to distinguish from background. Two major approaches are commonly undertaken. In the first strategy, coexpressed genes are selected and analyzed for shared cis sequence elements. A second approach, phylogenetic footprinting, aims to detect candidate transcription factor binding sites from conserved regions in alignments of orthologous promoters. Both strategies have been demonstrated to be powerful (Zhang and Gerstein, 2003
In our analysis, we applied PhyloCon to discover cis-elements in Arabidopsis upstream sequences. Coexpression information for Arabidopsis genes was derived from a large set of microarray experiments (Craigon et al., 2004
One limitation in our approach is the incomplete Brassica genome sequence that covers approximately one-fourth of the genome. OPGs identified in this study represent the best available candidates for orthologous promoter pairs. Albeit PhyloCon uses sequence conservation for motif discovery, strict orthologous relationship of sequence pairs is not compulsory. For the identification of cis-regulatory elements, paralogous promoters that contain conserved regulatory regions are useful as well (Haberer et al., 2004
A number of complete plant genomes will be available in the near future and will help to circumvent some of the limitations encountered with partial genomes. Map information for these genomes will enable us to detect syntenic relationships and thus support the detection of true corresponding orthologous promoters. Map-derived synteny relations, however, are impaired by the highly dynamic nature of plant genomes. Genome, segmental, and tandem duplications are prevalent, and in plant genomes, gene families are often highly expanded (Arabidopsis Genome Initiative, 2000
Detection of cis-regulatory elements is known to be an error-prone process. Nevertheless, several observations indicate a successful enrichment for functional transcription factor binding sites in our study: sequence conservation of motifs between Arabidopsis and Brassica, enrichment of motifs in functional categories, and detection of known sites. Sequence conservation between evolutionary related species is generally considered as an indicator for either short divergence times or the functional importance of the respective elements. Insufficient sequence divergence imposes a severe problem for classic phylogenetic footprinting analysis based on sequence alignments as nonfunctional elements cannot be delimited from functional elements. From a large set of We investigated for enrichments within functional categories by making use of GOSlim and the KEGG biochemical pathways annotations for the respective Arabidopsis genes. Many profiles detected are enriched in a wide range of biological functional categories involving metabolism (e.g. gluconeogenesis), development (flower development), signaling (abscisic and GA signaling) as well as cell maintenance tasks like ribosome biogenesis. Applying the guilt-by-association rule, the occurrence of particular profiles or the functional enrichment within particular CEGs may assist to transfer knowledge to genes of yet unknown functions. To assist in this task, we implemented a database and a Web portal providing structured access to all results of this study (http://mips.gsf.de/proj/plant/webapp/expressionDB/index.jsp). We analyzed to what extent known Arabidopsis and plant cis-elements present within the AGRIS and PLACE databases overlap to the motifs detected within our analysis. We compared all cis-element entries present within the two databases with the profiles resulting from our analysis. We successfully detected 255 out of 537 elements present within the databases. Limiting factors in this analysis are the incomplete Brassica genome and the partial coverage of many Arabidopsis promoters by corresponding orthologous Brassica GSS contigs. An additional limitation is the partial coverage of the Arabidopsis transcriptome by Affymetrix GeneChips (21,559 out of 26,535 genes in MAtDB). Given these restrictions, the successful detection of 47.5% of described cis-elements from PLACE and AGRIS can be viewed as highly satisfactory. However, due to the incomplete data set as well as some limitations of motifs within the databases (e.g. single site reports, consensus sequences of transcription factor families, see "Results"), an exact global assessment of specificity and sensitivity for our results is not feasible.
Two examples we have studied in detail illustrate the correlation of sites detected in our analysis with regulatory elements involved in transcriptional regulation. CRC coclustered with several floral development genes that have been shown to interact with CRC (Lee et al., 2005 In our work, we demonstrate that complex questions in comparative genomics can be addressed by using fragmented genome information and an integrative analytical approach, i.e. the combination of expression data with comparative sequence analysis and phylogenetic footprinting. Our analysis uses the comparison of a full and a partial genome sequence. The approach can be extended to additional partial or complete genomes to enhance the support for and the refinement of discovered motifs. The simultaneous analysis of several partial genomes, however, would decrease the number of OPGs available for the analysis, as best candidate orthologs have to be detected in multiple partial sequence sets. For instance, for the analysis of two genomes with coverage of one-quarter each, one would expect one-sixteenth of candidates on average to be present in both sets. Instead of a simultaneous analysis partial genomes may be sequentially subjected to a comparison against a complete genome. Subsequent processing and merging of results derived from pairwise comparisons would lead to a more comprehensive cis-element catalog.
Full genome sequences are labor and cost intensive, and high quality genome projects are expected in the near future for only a few model organisms and economically important species. Large-scale expression data will underlie similar limitations. Genome scale comparative genomics would thus have to rely on a few species that may be separated by large evolutionary distances, restricting the scope of comparative analyses. In plants, this problem is particularly accentuated as up to now only two genomes, rice (Oryza sativa) and Arabidopsis, have been analyzed extensively (Arabidopsis Genome Initiative, 2000
Brassica Dataset
Sequences were retrieved from the National Center for Biotechnology Information selecting for the keyword Brassica oleracea in the field Organism. The vast majority of the sequences represent GSSs of B. oleracea deposited by a sequencing consortium of The Institute for Genomic Research, the Cold Spring Harbor Laboratories, and Washington University. A total of 567,985 sequences (567,365 GSS) were obtained. The 567,365 GSS sequence reads represented approximately 384 Mb of sequence. A rigid clustering regime using the Harvester assembly pipeline (BIOMAX Informatics) was applied. The assembly method of Harvester is based on the CAP3 program (Huang and Madan, 1999
Individual Arabidopsis (Arabidopsis thaliana) upstream sequences were selected from the genomic sequence. Sequences were delimited either by the 5' neighboring gene or a maximum size of 3 kb (excluding the 5'-untranslated region [UTR]). Because 5' UTRs may harbor motifs or signals relevant to the transcriptional activity of a gene, 5'-UTR sequences were included in the analysis. To identify the best available candidates for orthologous promoter regions between partial genome information of Brassica and the complete Arabidopsis genome sequence, a bidirectional best Blast hit analysis strategy was applied. Upstream Arabidopsis sequences were compared against the Brassica GSS assemblies by BLASTN (E
Arabidopsis genome scale expression data have become available from a variety of microarray platforms. Among them are several cDNA arrays, both commercial and custom made, as well as two Affymetrix oligonucleotide GeneChips (http://www.affymetrix.com/products/arrays/index.affx?Arabidopsis). However, it is well known that comparisons among different platforms are problematic (The Toxicogenomics Research Consortium, 2005
Experiments available from Nottingham Arabidopsis Stock Centre (http://nasc.nott.ac.uk, CD-ROM release as of November 2004; Craigon et al., 2004
Due to annotation updates and enhanced gene modeling, GeneChip oligonucleotide mapping is frequently erroneous and outdated. Therefore, probe sets were recalculated using an enhanced oligonucleotide mapping against the Arabidopsis genome template.
All oligonucleotides present on the ATH1 GeneChip of Affymetrix (sequences downloaded from www.affymetrix.com as of October, 2004) were realigned against the coding sequence, and, for genes with associated full-length cDNA information, against the UTR sequences (MAtDB release from September 24, 2004; Schoof et al., 2004 Oligonucleotides aligning to more than one gene and probes without perfect matches were excluded. For subsequent calculations, only probe sets with at least five probe pairs were considered. Most of the probe sets still consist of nine to 11 probe pairs. Four percent of the probes matched perfectly to at least two genes and led to partial unspecific estimates for 10% of the original probe sets, indicating the need for the realignments. We excluded those probe sets from our refined sets. In summary, expression measurements from 21,559 genes met the quality criteria and were used for subsequent analyses.
The statistical analysis of the expression data was carried out in R (R Development Core Team, 2004) using the FunDaMiner system (http://mips.gsf.de/proj/express). We calculated MAS 5.0, dChip (Li and Wong, 2001
For 779 measurements, i.e. microarray experiments, we computed the correlation matrix of all-against-all probe sets. The full matrix consists of about 4.65 x 108 (21,5592) correlation pairs. Correlations were determined as metric (Pearson) correlation coefficients. The full correlation matrix (except self correlations) served as background and the 99% quantile has been derived from this distribution (r = 0.803). Correlations with a correlation coefficient higher than the 99% quantile of the background distribution, analogous to a one-sided 1% significance level, were considered as relevant.
For each of the 21,559 genes, its CEG was defined as those genes showing a Pearson correlation r
PhyloCon was downloaded from http://ural.wustl.edu/ approximately twang/Phylocon/ (Wang and Stormo, 2003 A common problem in motif discovery is the degree of noise in the selected set of genes. One reason is that coexpression does not necessarily result from coregulation, as coexpression of two genes can be attributed to secondary effects (e.g. transcription factor cascades). Measurement errors, cross hybridization, biological variation, and erroneous annotations are additional sources of noise. In this study, the incomplete and fragmented genome of B. oleracea represents an additional difficulty. With an average length of about 1 kb for the GSS assemblies and about 1.8 kb for the Arabidopsis upstream sequences, orthologous information is available on average for only approximately 55% of the promoter region. Thus, even if all studied promoters of one specific CEG contain a conserved binding site, approximately one-half of the phylogenetic comparisons will on average fail to detect it, as alignments cannot cover the conserved site. However, PhyloCon computes candidate matrices in a stepwise manner. Starting from pairwise OPG alignments, it sequentially adds OPGs to the alignments from which new matrices are built. Importantly, PhyloCon allows for a report of intermediate matrices, i.e. matrices derived from preceding cycles.
To overcome the limitations of missing sites or noisy expression groups, we retrieved a maximum of 10 intermediate matrices per cycle. As additional parameters, we allowed for 200 temporary (or test) matrices per cycle and the number of SDs was set to 0.5 (for details, see Wang and Stormo, 2003
Primary profiles were filtered for (CT)n- and (GA)n-repeats, as these repeats are very prominent on Arabidopsis promoters. Alignment matrices reported by PhyloCon were transformed into position weight matrices (PWMs) to generate a scoring function for sequence instances. For an alignment matrix of length m, we determined the number of occurrences nij of the four possible nucleotides i
For each PWM, we tested the statistical significance of its overrepresentation within the respective CEG in comparison to all 21,559 Arabidopsis upstream sequences (see "Materials and Methods"). P values were obtained from the cumulative binomial distribution and a PhyloCon PWM was considered to be significantly overrepresented for P
To identify identical and almost identical profiles, profiles with a PCC of r Note that we did not merge any redundant profiles/PWMs, i.e. recompute a new alignment matrix derived from the merged profiles. Although this approach results in a significant redundancy, we avoid any flaws by low-quality matrices that potentially strongly alter the specificity of a merged profile. As a consequence, one would need to reassess findings for the merged matrices, e.g. enrichments of particular profiles in CEGs and functional categories, which have been obtained from the more specific individual profiles. To derive an estimate for the number of sites detected in the PAG analysis, we merged detected instances/sites (not profiles) in each promoter if instances overlapped by more than 90%.
To compare detected profiles with reported sites, all motifs listed in the AGRIS (http://Arabidopsis.med.ohio-state.edu/AtcisDB/bindingSiteContent.jsp) and PLACE (http://www.dna.affrc.go.jp/PLACE) databases were downloaded (Higo et al., 1999
GOSlim annotation for Arabidopsis and the KEGG pathway map has been obtained from The Arabidopsis Information Resource (www.arabidopsis.org). All functional categories containing only one member have been excluded from subsequent analysis. Gene lists of categories were matched with the 21,559 genes used in this study. For each profile, we selected the genes containing the respective profile within their upstream sequences. Overrepresentation of the profile in a functional category was consequently checked for each GO annotation associated with the selected genes. P values for each test were obtained by cumulative binomial probability.
Profiles present in only one GO annotation were not considered (x > 1) as no reliable statistics can be computed for only one occurrence. Multiple testing corrections were performed by multiplication of the P value with the total number of assayed GO annotations for each profile. For the KEGG pathways, we employed a similar binomial testing scheme. P values were corrected for multiple testing by the number of different KEGG pathways.
The following materials are available in the online version of this article.
We thank Markus Schmid and Detlev Weigel for providing us microarray data from the AtGenExpress, and Chris D. Town from The Institute for Genomic Research for making the Brassica GSS dataset available to us prior to publication. The authors also wish to thank Louise Gregory for helpful discussions. Received June 22, 2006; accepted September 24, 2006; published October 6, 2006.
1 This work was supported by the GABI program of the German Ministry of Education and Research (BMBF).
2 These authors contributed equally to the paper. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Klaus F.X. Mayer (kmayer{at}gsf.de).
[W] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.106.085639 * Corresponding author; e-mail kmayer{at}gsf.de; fax 4908931873585.
Aarts MG, Hodge R, Kalantidis K, Florack D, Wilson ZA, Mulligan BJ, Stiekema WJ, Scott R, Pereira A (1997) The Arabidopsis MALE STERILITY 2 protein shares similarity with reductases in elongation/condensation complexes. Plant J 12: 615623[CrossRef][Web of Science][Medline] Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline] Arumuganathan K, Earle ED (1991) Nuclear DNA content of some important plant species. Plant Mol Biol Rep 9: 208218 Ayele M, Haas BJ, Kumar N, Wu H, Xiao Y, Van Aken S, Utterback TR, Wortman JR, White OW, Town CD (2005) Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis. Genome Res 15: 487495 Bailey TL, Elkan C (1994) Fitting a mixture model by expectation maximization to discover motifs in biopolymers. Proc Int Conf Intell Syst Mol Biol 2: 2836[Medline] Bao X, Franks RG, Levin JZ, Liu Z (2004) Repression of AGAMOUS by BELLRINGER in floral and inflorescence meristems. Plant Cell 16: 14781489 Berardini TZ, Mundodi S, Reiser R, Huala E, Garcia-Hernandez M, Zhang P, Mueller LM, Yoon J, Doyle A, Lander G, et al (2004) Functional annotation of the Arabidopsis genome using controlled vocabularies. Plant Physiol 135: 111 Blanc G, Wolfe KH (2004) Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16: 16791691 Boffelli D, McAuliffe J, Ovcharenko D, Lewis KD, Ovcharenko I, Pachter L, Rubin EM (2003) Phylogenetic shadowing of primate sequences to find functional regions of the human genome. Science 299: 13911394 Bolstad BM, Irizarry RA, Astrand M, Speed TP (2003) A comparison of normalization methods for high density oligonucleotide array data based on variance and bias. Bioinformatics 19: 185193 Bowers JE, Chapman BA, Rong J, Paterson A (2003) Unravelling angiosperm evolution by phylogenetic analysis of chromosomal duplication events. Nature 422: 433438[CrossRef][Medline] Cleveland WS (1979) Robust locally weighted regression and smoothing scatterplots. J Am Stat Assoc 74: 829836[CrossRef][Web of Science] Cleveland WS, Grosse E, Shyu WM (1992) Local regression models. In JM Chambers, TJ Hastie, eds, Statistical Models in S. Wadsworth and Brooks, Pacific Grove, CA, pp 309376 Cliften P, Sudarsanam P, Desikan A, Fulton L, Fulton B, Majors J, Waterston R, Cohen BA, Johnston M (2003) Finding functional features in Saccharomyces genomes by phylogenetic footprinting. Science 301: 7176 Craigon DJ, James N, Okyere J, Higgins J, Jotham J, May S (2004) NASCArrays: a repository for microarray data generated by NASC's transcriptomics service. Nucleic Acids Res 32: D575D577 Davuluri RV, Sun H, Palaniswamy SK, Matthews N, Molina C, Kurtz M, Grotewold E (2003) AGRIS: Arabidopsis gene regulatory information server, an information resource of Arabidopsis cis-regulatory elements and transcription factors. BMC Bioinformatics 4: 2536[CrossRef][Medline] DeRisi JL, Iyer VR, Brown PO (1997) Exploring the metabolic and genetic control of gene expression on a genomic scale. Science 278: 680686 Dudoit S, Yang YH, Callow MJ, Speed TP (2000) Statistical methods for identifying differentially expressed genes in replicated cDNA microarray experiments. Technical Report 578. Stanford University School of Medicine, Stanford, CA Elemento O, Tavazoie S (2004) Fast and systematic genome-wide discovery of conserved regulatory elements using a non-alignment based approach. Genome Biol 6: R18 Guo H, Moose SP (2003) Conserved noncoding sequences among cultivated cereal genomes identify candidate regulatory sequence elements and patterns of promoter evolution. Plant Cell 15: 11431158 Haberer G, Hindemitt T, Meyers BC, Mayer KF (2004) Transcriptional similarities, dissimilarities and conservation of cis-elements in duplicated genes of Arabidopsis. Plant Physiol 136: 30093022 Harmer SL, Hogenesch JB, Straume M, Chang HS, Han B, Zhu T, Wang X, Kreps JA, Kay SA (2000) Orchestrated transcription of key pathways in Arabidopsis by the circadian clock. Science 290: 21102113 Hertz GZ, Stormo GD (1999) Identifying DNA and protein patterns with statistically significant alignments of multiple sequences. Bioinformatics 15: 563577 Hertzberg M, Aspeborg H, Schrader J, Andersson A, Erlandsson R, Blomqvist K, Bhalerao R, Uhlen M, Teeri TT, Lundeberg J, et al (2001) A transcriptional roadmap to wood formation. Proc Natl Acad Sci USA 98: 1473214737 Higo K, Ugawa Y, Iwamoto M, Korenaga T (1999) Plant cis-acting regulatory DNA elements (PLACE) database. Nucleic Acids Res 27: 297300 Huang X, Madan A (1999) CAP3: A DNA sequence assembly program. Genome Res 9: 868877 Hughes TR, Marton MJ, Jones AR, Roberts CJ, Stoughton R, Armour CD, Bennett HA, Coffey E, Dai H, He YD, et al (2000) Functional discovery via a compendium of expression profiles. Cell 102: 109126[CrossRef][Web of Science][Medline] Inada DC, Bashir A, Lee C, Thomas BC, Ko C, Goff SA, Freeling M (2003) Conserved noncoding sequences in the grasses. Genome Res 13: 20302041 International Rice Genome Sequencing Project (2005) The map-based sequence of the rice genome. Nature 436: 793800[CrossRef][Medline] Jones-Rhoades MW, Bartel DP (2004) Computational identification of plant microRNAs and their targets, including a stress-induced miRNA. Mol Cell 14: 787799[CrossRef][Web of Science][Medline] Katari MS, Balija V, Wilson RK, Martienssen RA, McCombie WR (2005) Comparing low coverage random shotgun sequence data from Brassica oleracea and Oryza sativa genome sequence for their ability to add to the annotation of Arabidopsis thaliana. Genome Res 15: 496504 Kellis M, Patterson N, Endrizzi M, Birren B, Lander ES (2003) Sequencing and comparison of yeast species to identify genes and regulatory elements. Nature 423: 241254[CrossRef][Medline] Lee JY, Baum SF, Alvarez J, Patel A, Chitwood DH, Bowman JL (2005) Activation of CRABS CLAW in the nectarines and carpels of Arabidopsis. Plant Cell 17: 2536 Li C, Wong WH (2001) Model-based analysis of oligonucleotide arrays: expression index computation and outlier detection. Proc Natl Acad Sci USA 98: 3136 McCue L, Thompson W, Carmack C, Ryan MP, Liu JS, Derbyshire V, Lawrence CE (2001) Phylogenetic footprinting of transcription factor binding sites in proteobacterial genomes. Nucleic Acids Res 29: 774782 Mizutani M, Ohta D, Sato R (1997) Isolation of a cDNA and a genomic clone encoding cinnamate 4-hydroxylase from Arabidopsis and its expression manner in planta. Plant Physiol 113: 755763[Abstract] Moses AM, Chiang DY, Eisen MB (2004) Phylogenetic motif detection by expectation-maximization on evolutionary mixtures. Pac Symp Biocomput 2004: 324335 Schmid M, Davison TS, Henz SR, Pape UJ, Demar M, Vingron M, Schoelkopf B, Weigel D, Lohmann JU (2005) A gene expression map of Arabidopsis thaliana development. Nat Genet 37: 501506[CrossRef][Web of Science][Medline] Schones DE, Sumazin P, Zhang MQ (2005) Similarity of position frequency matrices for transcription factor binding sites. Bioinformatics 21: 307313 Schoof H, Ernst R, Nazarov V, Pfeifer L, Mewes HW, Mayer KF (2004) MIPS Arabidopsis thaliana database (MAtDB): an integrated biological knowledge resource for plant genomics. Nucleic Acids Res 32: D373D376 Siddharthan R, Siggia ED, van Nimwegen E (2005) PhyloGibbs: a Gibbs sampling motif finder that incorporates phylogeny. PloS Comput Biol 1: e67[CrossRef][Medline] Simillion C, Vandepole K, Van Montagu MC, Zabeau M, Van de Peer Y (2002) The hidden duplication past of Arabidopsis thaliana. Proc Natl Acad Sci USA 99: 1362713632 Sinha S, Blanchette M, Tompa M (2004) PhyME: a probabilistic algorithm for finding motifs in sets of orthologous sequences. BMC Bioinformatics 5: 170[CrossRef][Medline] The Toxicogenomics Research Consortium (2005) Standardizing global gene expression analysis between laboratories and across platforms. Nat Methods 2: 351356[CrossRef][Web of Science][Medline] Thijs G, Marchal K, Lescot M, Rombauts S, DeMoore B, Rouzé P, Moreau Y (2002) A Gibbs sampling method to detect over-represented motifs in upstream regions of coexpressed genes. J Comput Biol 9: 447464[CrossRef][Web of Science][Medline] Tompa M, Li N, Bailey TL, Church GM, De Moor B, Eskin E, Favorov AV, Frith MC, Fu Y, Kent WJ, et al (2005) Assessing computational tools for the discovery of transcription factor binding sites. Nat Biotechnol 23: 137144[CrossRef][Web of Science][Medline] Town CD, Cheung F, Maiti R, Crabtree J, Haas BJ, Wortman JR, Hine EE, Althoff R, Arbogast TS, Tallon LJ, et al (2006) Comparative genomics of Brassica oleracea and Arabidopsis thaliana reveal gene loss, fragmentation and dispersal after polyploidy. Plant Cell 18: 13481359 Vision T, Brown DG, Tanksley SD (2000) The origins of genomic duplications in Arabidopsis. Science 290: 21142117 Wang T, Stormo GD (2003) Combining phylogenetic data with co-regulated genes to identify regulatory motifs. Bioinformatics 19: 23692379 Wasserman WW, Palumbo M, Thompson W, Fickett JW, Lawrence CE (2000) Human-mouse genome comparisons to locate regulatory sites. Nat Genet 26: 225228[CrossRef][Web of Science][Medline] Windsor AJ, Schranz ME, Formanova N, Gebauer-Jung S, Bishop JG, Schnabelrauch D, Kroymann J, Mitchell-Olds T (2006) Partial shotgun sequencing of the Boechra stricta genome reveals extensive microsynteny and promoter conservation with Arabidopsis. Plant Physiol 140: 11691182 Yang WC, Ye D, Xu J, Sundaresan V (1999a) The SPOROCYTELESS gene of Arabidopsis is required for initiation of sporogenesis and encodes a novel nuclear protein. Genes Dev 13: 21082117 Yang YW, Lai KN, Tai PY, Li WH (1999b) Rates of nucleotide substitution in angiosperm mitochondrial DNA sequences and dates of divergence between Brassica and other angiosperm lineages. J Mol Evol 48: 597604[CrossRef][Web of Science][Medline] Zhang X, Wessler SR (2004) Genome-wide comparative analysis of transposable elements in the related species Arabidopsis thaliana and Brassica oleracea. Proc Natl Acad Sci USA 101: 55895594 Zhang Z, Gerstein M (2003) Of mice and men: phylogenetic footprinting aids the discovery of regulatory elements. J Biol 2: 11[CrossRef][Medline] This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|