Thousands of cis-regulatory sequence combinations are shared by Arabidopsis and Poplar

The identification of cis-regulatory modules can greatly advance our understanding of gene regulatory mechanisms. Despite the existence of binding sites of more than three transcription factors in a cis-regulatory module, studies in plants often consider only the co-occurrence of binding sites of one or two transcription factors. In addition, cis-regulatory module studies in plants are limited to combinations of only a few families of transcription factors. It is thus not clear how widespread plant transcription factors work together, which transcription factors work together to regulate plant genes, and how the combinations of these transcription factors are shared by different plants. To fill these gaps, we applied a frequent pattern mining based approach to identify frequently used cis-regulatory sequence combinations in the promoter sequences of two plant species, Arabidopsis thaliana and Populus trichocarpa . A cis-regulatory sequence here corresponds to a DNA motif bound by a transcription factor. We identified 18638 combinations composed of 2 to 6 cis-regulatory sequences that are shared by the two plant species. In addition, with known cis-regulatory sequence combinations, gene function annotation, gene expression data, and known functional gene sets, we shown that the functionality of at least 96.8% and 65.2% of these shared combinations in Arabidopsis are partially supported, under a false discovery rate of 0.1 and 0.05, respectively. Finally, we discovered that 796 of the 18638 combinations might relate to functions that are important in bioenergy research. Our work will facilitate the study of gene transcriptional regulation in plants.


INTRODUCTION
Identifying cis-regulatory modules (CRMs) is important for the understanding of gene transcriptional regulation (Singh, 1998;Yuh et al., 1998;Blanchette et al., 2006;Hu et al., 2008;Cai et al., 2010). CRMs are short DNA regions of a few hundred base pairs (bp) that contain multiple transcription factor binding sites (TFBSs). It is estimated that there are at least five times more CRMs than genes in high eukaryotic species (Davidson, 2006). In addition, in high eukaryotes such as metazoans, CRMs instead of individual TFBSs often determine the spatial and temporal expression patterns of neighboring genes (Singh, 1998;Yuh et al., 1998). Therefore, identification of CRMs is crucial for studying gene transcriptional regulation.
In past decades, many studies, both experimental and computational, have identified CRMs in animals (Yuh et al., 1998;Kel-Margoulis et al., 2000;Loots et al., 2000;Frith et al., 2001; although a plant CRM may consist of TFBSs of 3 or more TFs (Baudry et al., 2004;Kim et al., 2004;Akyildiz et al., 2007;Kawashima et al., 2009;Wang et al., 2010). In addition, these studies often identify only individual CRMs for individual genes, and only consider a limited number of TFs (Steffens et al., 2005). Furthermore, compared with a large number of computational CRM studies on animals mentioned above, only a couple of computational CRM studies in plants exist.
These studies often focus on the identification of 2 to 3 TFs that are potentially working together to regulate target genes (Steffens et al., 2005;Vandepoele et al., 2006;Chang et al., 2008;Michael et al., 2008). Therefore, it is not clear how widespread CRMs are in plants, which multiple TFs likely coordinate the transcriptional regulation of their target genes, and how these TF combinations are shared by different plant species.
To fill these gaps in our knowledge about plant CRMs, we developed computational approaches to study plant CRMs in the upstream 1 kilo base (kb) regions of all genes in two plant species, Arabidopsis thaliana (Arabidopsis) and Populus trichocarpa (poplar) by using cis-regulatory sequences from the PLACE database (Higo et al.,1999). These cis-regulatory sequences in the PLACE database are normally called motifs, each of which represents the common pattern of TFBSs bound by a TF. In this paper, we called them cis-regulatory sequences instead of motifs because a PLACE motif often has only a few experimentally verified TFBSs and we often do not know the TF behind a PLACE motif. We found that 18638 cis-regulatory sequence combinations composed of 2 to 6 cis-regulatory sequences are shared by Arabidopsis and Poplar. In addition, we discovered that a large number of these shared combinations in Arabidopsis are partially supported by various sources of functional evidence. The developed methods and the predicted shared cis-regulatory sequence combinations will facilitate future CRM studies in plants.
sequence is a motif in the PLACE database (Higo et al., 1999), which is represented as consensus sequence such as RTACGTGGCR. It is also worth pointing out that scanning 1kb long sequences with cis-regulatory sequences will result in many false positive cis-regulatory sequence instances, which we filtered below by considering co-occurrence of these instances in a large number of sequences and by assessing the statistical significance of this co-occurrence.
Second, we identified individual cis-regulatory sequence combinations composed of multiple cis-regulatory sequences, whose instances co-occurred in upstream 1kb sequences of at least 100 genes in one species. Third, we assessed the statistical significance of these combinations by the Poisson clumping heuristic approximation (Aldous, 1989). Finally, significant cis-regulatory sequence combinations were reported as the final predicted combinations. The upstream 1kb sequences containing instances of a cis-regulatory sequence combination were considered as CRMs of this combination and the genes containing CRMs of a combination were taken as target genes of this combination. See Material and Method Section for details.
In total, we identified 51176 cis-regulatory sequence combinations in 92.2% of Arabidopsis genes. These combinations in Arabidopsis consist of 2 to 6 cis-regulatory sequences, and 3.88 cis-regulatory sequences on average ( Figure 2). Similarly, we identified 58031 cis-regulatory sequence combinations in 98.2% of genes in poplar. These poplar combinations consist of 2 to 6 cis-regulatory sequences, and 3.91 cis-regulatory sequences on average ( Figure 2). Among all these cis-regulatory sequence combinations, 18638 combinations are shared by Arabidopsis and poplar. Here a shared combination has exactly the same cis-regulatory sequence combination in the two species, but the majority of its target genes in Arabidopsis may be not orthologous to its target genes in poplar. In fact, for a shared cis-regulatory sequence combination, on average, only 33.3% of its target genes in Arabidopsis have defined orthologous genes in poplar in the Inparanoid database (Ostlund et al., 2009). In addition, for a shared combination, among the target genes with orthology information in the two species, no more than 13.6% of its target genes in Arabidopsis are orthologous to its target genes in poplar. These percentages could be higher, since we have no orthology information for the majority of genes in the two species.
Because we identified these combinations in the two species separately, the 18638 shared cisregulatory sequence combinations are likely biologically meaningful. See the supplemental file S1 for all 18638 combinations and their target genes.

Shared cis-regulatory sequence combinations are supported by known combinations
To assess the reliability of the above 18638 shared cis-regulatory sequence combinations, we compared these combinations with known cis-regulatory sequence combinations. Many experimental studies have identified co-occurring TFBSs of a pair of interacting TFs or individual TFs with multiple DNA binding domains in plants (Solano et al., 1995;Zhang et al., 1995;Chen et al., 1996;Davies et al., 1996;Singh, 1998;Chen and Singh, 1999;Kagaya et al., 1999;Yanagisawa and Schmidt, 1999;Nagano et al., 2001;Abe et al., 2003;Baudry et al., 2004;Konishi and Yanagisawa, 2010). In Table 1, we list seven well known cis-regulatory sequence combinations. TF or TFBS combinations were considered as cis-regulatory sequence combinations here since each TF or TFBS corresponds to a cis-regulatory sequence. There are other known TF combinations as well. However, cis-regulatory sequence information for known TF combinations is not always available. In case that there are known cis-regulatory sequences for TFs, it is difficult to match these cis-regulatory sequences with the PLACE cis-regulatory sequences we used due to different degrees of mismatches between the cis-regulatory sequences.
Therefore, we only used the seven known cis-regulatory sequence combinations in Table 1.
We found that our approach predicted all seven known combinations. For each known combination, there are two or more predicted cis-regulatory sequence combinations that contain this known combination. For instance, for the known bHLH-MYB (MYC-MYB) combination, 32 predicted cis-regulatory sequence combinations contain two motifs from the bHLH and MYB families, respectively. Multiple predicted cis-regulatory sequence combinations are consistent with a known combination because different TFs in the same family may bind different motifs.
In addition, additional cis-regulatory sequences together with the same known combinations form different predicted cis-regulatory sequence combinations, which regulate different target genes. For instance, M9269 (composed of three cis-regulatory sequences AAATTAACCAA Since the predicted cis-regulatory sequence combinations include additional cis-regulatory sequences, we further investigated whether there is functional evidence to support the addition of other cis-regulatory sequences. We found that this is indeed the case for at least several predicted combinations. For instance, two predicted combinations, M9269 and M9274, contain the known bHLH-MYB combination. Besides the two PLACE motifs SORLIP3AT and LBOXLERBCS that correspond to a bHLH TF and an MYB TF, M9269 and M9274 contain an extra cisregulatory sequence TACGTGTC (ABREMOTIFAOSOSEM) and CAATNATTG (ATHB5ATCORE), respectively. The bHLH and MYB function as transcriptional activators in abscisic acid (ABA) signaling (Abe et al., 2003). The extra motif ABREMOTIFAOSOSEM in M9269 is also related to ABA signaling (Hobo et al., 1999). Similarly, the TF that binds the extra motif ATHB5ATCORE in M9274 is a homeodomein-leucine zipper protein, ATHB5 (Johannesson et al., 2001), which relates to the ABA signaling as well (Johannesson et al., 2003). Therefore, the function of two extra ci-regulatory sequences is consistent with that of the known combinations, both of which relate to ABA signaling.
Besides predicting known cis-regulatory sequence combinations, we also predicted known target genes as target genes of the corresponding predicted combinations. For 6 out of 7 known combinations in Table 1, we predicted at least one known target gene as a target gene of multiple corresponding predicted combinations. Note that not all known target genes are included in our prediction. For instance, AT1G61720 is a target gene of the known combination bHLH and MYB (Debeaujon et al., 2003), and was not predicted as a target gene of any corresponding predicted combination. The lack of known target genes could be due to the limited number of known motifs in the PLACE database (only 469 motifs) and upstream sequences (1kb) used. It could also be a result of the choice of parameters used in the MOPAT software (Hu et al., 2008). See Table 1 and the supplemental file S2 for the target genes and the details of comparisons of predicted cis-regulatory sequence combinations with known combinations.

combinations
To investigate whether the 18638 shared cis-regulatory sequence combinations make sense, we also performed GO enrichment analysis (Boyle et al., 2004) on the target genes of each of the 18638 cis-regulatory sequence combinations in Arabidopsis. The GO enrichment analysis is to assess whether a group of genes significantly shares any function, where a function is described by a standard GO term. That is, we want to check whether the target genes of a cis-regulatory sequence combination significantly share any function, represented by GO terms here, when compared with all annotated genes in Arabidopsis. We did GO enrichment analysis for each cisregulatory combination only in Arabidopsis because the Arabidopsis genome is much better annotated than the poplar genome.
By performing GO enrichment analysis with the tool GOTermFinder (Boyle et al., 2004), we found that target genes of 58.1% and 77.5% of cis-regulatory sequence combinations in Arabidopsis significantly share GO terms, with a false discovery rate (FDR) of 0.05 and 0.1, respectively. Below, we provide two examples of the predicted combinations with literature support of their functions. Researchers with interest in specific genes will find more supported cis-regulatory sequence combinations in the supplemental file S1. Example 1. The cis-regulatory sequence combination M3121 is composed of three PLACE cisregulatory sequences, TTGCATGACT (SORLREP5AT), ATAAAACGT (SORLREP2AT), and TAAAAGTTAAAAAC (BOX1PVCHS15). SORLREP5AT and SORLREP2AT are known to be phyA-repressed motifs (Hudson and Quail, 2003) and phyA is critical for red or far-red light photoreceptor activities (Dehesh et al., 1993). BOX1PVCHS15 is known to be bound by GT-1 (Villain et al., 1996) and GT-1 is also related to red light and far-red light response (Gilmartin and Chua, 1990;Escobar et al., 2004). Thus, this cis-regulatory sequence combination M3121 is important for the red or far-red light photoreceptor activity and all three cis-regulatory sequences in this combination may interact to regulate their 187 target genes. We found the following GO terms significantly shared by the target genes of this predicted combination: GO:0009883 (red or far-red light photoreceptor activity, p-value 8.47E-05) and GO:0008020 (G-protein coupled photoreceptor activity, p-value 1.68E-04). These GO terms of target genes are consistent with the function of the three cis-regulatory sequences in this combination, which supports the functionality of this cis-regulatory sequence combination.
It has been shown that both CNX1 and HY5 mediate the development of plants in the light (Mangeon et al., 2011). Thus, the function of the two cis-regulatory sequences in this combination is consistent and the two cis-regulatory sequences in this combination may interact to regulate their 245 target genes. We found that the following GO terms are significantly shared by the target genes of this combination: GO:0019684 (photosynthesis, light reaction, p-value 2.31E-06), GO:0015979 (photosynthesis, p-value 1.41E-08), and GO:0010207 (photosystem II assembly, p-value 6.36E-05). These functions of target genes are consistent with that of cisregulatory sequences in this combination, which supports the functionality of this combination.

Co-expressed gene clusters and known functional gene sets in literature supports predicted cis-regulatory sequence combinations shared between Arabidopsis and poplar
In addition to GO enrichment analysis, we also compared target genes of the 18638 shared cisregulatory sequence combinations with co-expressed gene clusters in Arabidopsis. Since coexpressed target genes are often co-regulated (Allocco et al., 2004), the significant overlap of target genes of a cis-regulatory sequence combination with co-expressed gene clusters supports the functionality of the predicted combinations as well. We found that for 954 of the 18638 shared cis-regulatory sequence combinations, under a FDR of 0.05, their target genes significantly overlap with at least one co-expressed gene cluster from 8 microarray datasets.
When we changed the FDR to 0.10, we found that target genes of 14910 combinations significantly overlap with at least one co-expressed gene cluster. The co-expressed gene clusters were listed in the supplemental file S5. These 954 and 14910 cis-regulatory sequence combinations could be obtained by sorting corresponding columns in the supplemental file S1.
In addition, we compared target genes of the 18636 shared cis-regulatory sequence combinations with all gene sets mentioned in the poplar genome paper (Tuskan et al., 2006) and are available in the literature. In total, we obtained six and five gene sets in Arabidopsis and in poplar, respectively ( Table 2, row 5 to 15). Each of these eleven gene sets contains genes of a specific function that are important for the production of bioenergy. We found that each of the eleven gene sets enriched the target genes of several cis-regulatory sequence combinations. In total, we obtained 796 shared cis-regulatory sequence combinations that are likely to regulate genes in these functional gene sets, which are important for bioenergy research. For instance, target genes of the cis-regulatory sequence combination M164 significantly overlap with the cell wall formation gene set in Arabidopsis. This combination is composed of cis-regulatory sequences CCCACCTACC (ACIPVPAL2) and TCCATGCAT (SPHCOREZMC1). The motif ACIPVPAL2 interacts directly with the PAL2 gene (Phenylalanine ammonia-lyase gene family). The PAL2 gene is involved with phenylpropanoid biosynthesis pathway (Hatton et al., 1995). C1, one of the target genes of the motif SPHCOREZMC1, relates to the cell wall formation process (Suzuki et al., 1997;Yamaguchi et al., 2000). Therefore, the two cis-regulatory sequences corresponding to the two motifs both relate to the cell wall formation process, which conforms to the function of genes in this gene set. The supplemental file S3 provides the details of shared cis-regulatory sequence combinations with target genes significantly overlapping with these eleven gene sets.

DISCUSSION
In this study, we identified 18638 cis-regulatory sequence combinations shared by Arabidopsis and poplar. By comparing these shared combinations with known cis-regulatory sequence combinations, analyzing functional similarity of target genes, comparing target genes with coexpressed gene clusters, and overlapping target genes with known functional gene sets, we have shown that various sources of evidence support the functionality of more than 96.8% and 65.2% of the 18638 shared cis-regulatory sequence combinations in Arabidopsis, under a FDR of 0.05 and 0.1, respectively. All 18638 combinations, the sources of functional evidence supporting their functions, and their target genes are in the supplemental files S1, S2, and S3.
We compared the 18638 shared cis-regulatory sequence combinations with seven categories of known combinations. Although all seven categories of known combinations were included in our predictions, not all known specific TF-TF interactions of each type were predicted. For instance, the TF combination OBP1-OBF (Dof-bZIP) has been reported by several papers (Zhang et al., 1995;Chen et al., 1996;Yanagisawa and Schmidt, 1999). However, we did not find this specific combination in our predicted combinations that contain cis-regulatory sequences corresponding to bZIP and Dof. The missed specific TF-TF combinations could be due to the reason that we removed many cis-regulatory sequence instances and putative cis-regulatory sequence combinations as well to control false positives.
For the 18638 cis-regulatory sequence combinations, on average, about two-thirds of their target genes in the two plant species have no orthologous gene information. The incompleteness of orthologous gene information, together with the limited number of known cis-regulatory sequences and the short non-coding sequences around each gene considered, results in fewer than 13.8% of Arabidopsis target genes with defined poplar orthologs orthologous to the corresponding poplar target genes, for each of the 18638 cis-regulatory sequence combinations.
However, we doubt that the percentage of orthologous target genes for most of the shared combinations will be increased significantly with more information, given the long divergence time of the two species. In fact, previous studies have confirmed that substantial changes occur between the target genes of the same TFs (TF combinations) between even closely related species (Borneman et al., 2007;Tuch et al., 2008;Perez and Groisman, 2009).
In each of the 18638 predicted cis-regulatory sequence combinations, there are two or six cisregulatory sequences. Note that the same sequences may be included in a cis-regulatory sequence combination multiple times, if there are multiple cis-regulatory sequences in the PLACE database corresponding to the same TFs. For instance, two cis-regulatory sequences, GCCACGTGGc (ACGTROOT1) and GCCACGTGGg (ABREAZMRAB28), are likely bound by the same TF. We found instances of these two cis-regulatory sequences co-occur in 221 1kb long upstream sequences. Thus, we predicted a cis-regulatory sequence combination, M103, which consists of the two cis-regulatory sequences. It is worth mentioning that similar cisregulatory sequences occurring in the same combination is due to their multiple occurrences in 1kb long sequences instead of the overlap of their instances, since overlapping instances were removed from the beginning (See Materials and Methods). Because of filtering overlapping instances in our method, it is also evident that there could be more TFs in a plant TF combination.
Among the 18638 shared cis-regulatory sequence combinations, 796 combinations are likely to regulate genes that are important in bioenergy research (Table 2). These genes are known to relate to cell wall formation, secondary metabolism, phytohormones, membrane transporter, and disease resistance (Tuskan et al., 2006). We found enrichment with genes related to bioenergy production among the target genes of 796 shared cis-regulatory sequence combinations. We have also shown that cis-regulatory sequences in some combinations are consistent with these functions. These 796 predicted cis-regulatory sequence combinations will be useful for generating hypotheses for future bioenergy research.
Our method is useful for genome-wide studies of gene transcriptional regulation in plants. To our knowledge, previous computational methods for plant CRM identification are often restricted to a small group of genes (Vandepoele et al., 2006) or require TFBS co-occurrence in short regions There are several limitations in our current study. We only considered instances of cis-regulatory sequences in the upstream 1kb of all genes in a species here. It has been shown, however, that TFBSs can be anywhere in the non-coding sequences (Blanchette et al., 2006). In future studies, we should extend our method to include the entire non-coding regions of a plant genome (Cai et al., 2010). In addition, we could consider plant motifs in other databases (Wingender et al., 1996;Lescot et al., 2002;Sandelin et al., 2004) in addition to the PLACE database. This is feasible because the main component of our method, the frequent pattern mining technique, has been used to effectively deal with a much larger number of different items (cis-regulatory sequences) in transactions (upstream sequences) in the research field of databases (Han et al., 2000;Grahne and Zhu, 2005). Finally, we only examined the shared combinations in two plant species here.
With more genomic data such as gene expression data and epigenetic data in Arabidopsis and poplar, we could consider the function of other combinations in individual species as well.
Our study provides a useful method for the study of gene transcriptional regulation in plants. A software package based on the developed method will be freely available at http://www.cs.ucf.edu/~xiaoman/arabpoplar/webserver. This software package can be readily applied to a genome-wide study of any sequenced plant species. Our study also supplies a valuable resource of predicted CRMs and cis-regulatory sequence combinations in Arabidopsis and poplar. Besides the prediction provided in the supplementary files, we also provided a web server at http://www.cs.ucf.edu/~xiaoman/arabpoplar/webserver for researchers to extract TF combinations they are interested in. The predicted CRMs and shared cis-regulatory sequence combinations in two species, despite the support of many sources of evidence, needs to be experimentally validated. We expect that the application of next generation sequencing techniques to plant genomes will greatly accelerate the validation process and facilitate their use in plant and bioenergy research.

Plant motifs, genes, upstream sequences, and orthology information
A set of 469 plant motifs were obtained from the PLACE database (Higo et al., 1999) (Supplemental file S4). Detailed description about these PLACE motifs can be retrieved at (http://www.dna.affrc.go.jp/PLACE/info.html). These motifs are represented as consensus sequences, in which the most frequent nucleotide(s) at each position of the sequences is given.
For instance, the motif for the sequence bound by CBF2 is CCACGTGG. We called these PLACE motifs "cis-regulatory sequences" in this paper.
We obtained all genes and their upstream 1kb sequences relative to the gene transcriptional start sites in Arabidopsis and poplar from the ENSEMBL Plant (http://plants.ensembl.org/index.html), where the TAIR10 release and the JGI2.0 were used for these two species. In total, we obtained 33518 and 41365 genes and their upstream sequences in Arabidopsis and poplar, respectively.
Repeats in the upstream sequences were then masked by using RepeatMasker (http://www.repeatmasker.org/). We obtained all pairs of orthologous genes in Arabidopsis and poplar from the Inparanoid database (Ostlund et al., 2009), from which 10499 pairs of orthologous genes are defined in the two species. These pairs of orthologous genes are composed of 9707 Arabidopsis genes and 9941 poplar genes.

Gene expression data and co-expressed clusters
We obtained gene expression data on the GPL198 platform (Affymetrix Arabidopsis ATH1 Genome Array) from the GEO database (Edgar et al., 2002). Data from the GPL198 platform was used because this microarray platform has the largest number of samples in Arabidopsis. We further filtered data series that have fewer than 10 samples or have not been normalized by the RMA algorithm (Irizarry et al., 2003). We required at least 10 samples in each dataset to ensure more reliable co-expressed gene clusters. The requirement of normalization by RMA will be helpful for others to repeat this study. In this way, we obtained 15 datasets and 292 samples.
Because of our interest in development and bioenergy, we further chose time series data and stress resistance related data from the 15 datasets. In total, we obtained 8 datasets and 188 samples. The series number of these datasets and the number of samples is listed in supplemental file S5. For each dataset, we calculated the pair wise Pearson's correlation coefficient for each gene pair and performed hierarchical clustering with complete linkage to obtain co-expressed clusters, pairs of genes in each of which have a Pearson's correlation coefficient larger than 0.6 or smaller than -0.6. In total, we obtained 2461 clusters from the eight GEO datasets (Supplemental file S5). In addition to using 0.6 as the correlation coefficient cutoff, we also obtained clusters based on the cutoffs 0.7 and 0.8. The results from different expression correlation coefficient cutoffs were similar. We reported our results based on the cutoff 0.6 in this paper without specification.

Identification of cis-regulatory sequence combinations in 1kb long upstream sequences
Previously, we developed an efficient method (Cai et al., 2010) to identify motif combinations in a large number of sequences of 1kb length, by using 522 known vertebrate motifs from the TRANSFAC database (Wingender et al., 1996). We applied this method here to identify cis- regulatory sequence combinations in upstream 1kb sequences of all genes in Arabidopsis and poplar with slight modifications. The reason to modify the previous method is that the quality of plant motifs in the PLACE database (Higo et al., 1999) is lower than that in the TRANSFAC database, with a much fewer number of experimentally verified TFBSs for each motif and a consensus representation of motifs in PLACE. The consequence of scanning upstream sequences with low quality motifs is that many motifs occur in almost every upstream sequence even requiring exact match of consensus sequences. We thus called these plant motif consensuses "cis-regulatory sequences". The details of the method follow.
First, we defined a score cutoff for each plant cis-regulatory sequence. To define the score cutoff, we permutated all upstream sequences and then scanned these random sequences in each species separately, with each PLACE cis-regulatory sequence. For each PLACE cis-regulatory sequence, we calculated the score of each segment of length k in the random sequences, where k is the length of the PLACE cis-regulatory sequence. The score is calculated by the following formula: In this formula, p(b,i) is defined as With the score of each segment, we obtained the score distribution of random segments in each species. We then chose the 85% quartile of this distribution as the score cutoff of the corresponding PLACE cis-regulatory sequence in the corresponding species. That is, if a segment with the score larger than this cutoff, this segment is a potential cis-regulatory sequence instance of this PLACE cis-regulatory sequence. This 85% cutoff corresponds to at most one mismatch for 6-8 bp long PLACE cis-regulatory sequences.
Second, we defined cis-regulatory sequence instances of each PLACE cis-regulatory sequence in each upstream sequence. With the PLACE cis-regulatory sequences and their score cutoffs, we scanned each upstream sequence and obtained all segments with scores larger than the corresponding score cutoffs. These segments are called cis-regulatory sequence instances. Note that some cis-regulatory sequence instances overlap with each other. For those overlapping cis- regulatory sequences instances with starting positions shorter than 4 bp away from each other, we only kept the one with the largest score, as in the previous study (Hu et al., 2008). We also removed instances of cis-regulatory sequences whose instances occur in more than 50% of upstream sequences, as these cis-regulatory sequences are too degenerate to be identified in statistically meaningful cis-regulatory sequence combinations. The following cis-regulatory sequence combination analysis is based on the remaining cis-regulatory sequence instances.
Third, we identified cis-regulatory sequence combinations that occur in at least 100 upstream sequences. With cis-regulatory sequence instances in each upstream sequence, we applied frequent pattern mining techniques (Han et al., 2000;Grahne and Zhu, 2005) to identify cisregulatory sequence combinations whose cis-regulatory sequence instances co-occur in at least 100 upstream sequences. The frequent pattern mining techniques were originally developed as a database mining tool to identify frequent combination of patterns in databases (Han et al., 2000).
Consider a database containing customer purchase information. With what frequency will a customer who buys item A also buys item B, C, …? The frequent pattern mining techniques can efficiently solve this type of problems. In our case, the 469 PLACE cis-regulatory sequences correspond to items and upstream sequences correspond to customers. The cis-regulatory sequence instances in an upstream sequence are the exact items this sequence contains.
Fourth, we identified significant cis-regulatory sequence combinations by using the Poisson clumping heuristic (Aldous, 1989 . Therefore, the p-value that this combination occurs in g sequences is where C(N1,i) is defined as the frequent cis-regulatory sequence combinations output at the third step. For each selected cisregulatory sequence combination, the upstream sequences containing instances of all cisregulatory sequences in this combination are defined as CRMs of this combination. Similarly, genes corresponding to these upstream sequences are defined as the target genes of this cisregulatory sequence combination.

P-value calculation for significant overlap between target genes of a cis-regulatory sequence combination and a gene set of special interest
For every cis-regulatory sequence combination, we calculated the significance of overlap between its target genes and each co-expressed gene cluster obtained from the 8 microarray expression datasets, by using the hypergeometric test. Assume the set of all genes that are included in the GPL198 platform is S and |S|, the number of genes in S, is equal to N. Given a target gene set S 1 of a cis-regulatory sequence combination and a co-expressed gene cluster S 2 , assume the number of genes in the intersection of the three sets S, S 1 , and S 2 is   T  a  b  l  e  2  .  F  u  n  c  t  i  o  n  a  l  e  v  i  d  e  n  c  e  f  o  r  t  h  e  1  8  6  3  8  s  h  a  r  e  d  c  i  s  -r  e  g  u  l  a  t  o  r  y  s  e  q  u  e  n  c  e  c  o  m  b  i  n  a  t  i  o  n  s  .   F  u  n  c  t  i  o  n  a  l  E  v  i  d  e  n  c  e  s  o  u  r  c  e  (  I  D  )  #  S  i  g  n  i  f  i  c  a  n  t  c  o  m  b  i  n  a  t  i  o  n