Expansion and diversification of the Populus R2R3-MYB family of transcription factors

The R2R3-MYB proteins comprise one of the largest families of transcription factors in plants. R2R3-MYB family members regulate plant-specific processes, such as the elaboration of specialised cell types, including xylem, guard cells, trichomes and root hairs, and the biosynthesis of specialised branches of metabolism, including phenylpropanoid biosynthesis. As such, R2R3-MYB family members are hypothesised to contribute to the emergence of evolutionary innovations that have arisen in specific plant lineages. As a first step in determining the role played by R2R3-MYB family members in the emergence of lineage-specific innovations in the genus Populus , the entire Populus trichocarpa R2R3-MYB family was characterised. The Populus R2R3-MYB complement is much larger than that found in other angiosperms with fully sequenced genomes. Phylogenetic analyses, together with chromosome placement, showed that the expansion of the Populus R2R3-MYB family was not only attributable to whole genome duplication, but also involved selective expansion of specific R2R3-MYB clades. Expansion of the Populus R2R3-MYB family prominently involved members with expression patterns that suggested a role in specific components of Populus life history, including wood formation and reproductive development. An expandable compendium of microarray-based expression data (PopGenExpress) and associated web-based tools were developed to better enable within- and between-species comparisons of Populus R2R3-MYB gene expression. This resource, which includes intuitive graphic visualisation of gene expression data across multiple tissues, organs and treatments, is freely available to, and expandable by, scientists wishing to better understand the genome biology of Populus , an ecologically dominant and economically important forest tree genus.

Given the multiplicity of plant-specific processes controlled by R2R3-MYB transcription factors, it has been postulated that the elaboration of the R2R3-MYB family may account for some of the evolutionary innovations that contribute to plant diversity (Riechmann et al., 2000). In A. thaliana, members of the R2R3-MYB family of transcription factors have been identified and their targets and expression patterns are being characterised (Kranz et al., 1998;Meissner et al., 1999;Stracke et al., 2001). However, A. thaliana is an herbaceous, annual plant, which is not representative of a large number of important plant species, particularly long-lived woody perennial plants, like forest trees.
Trees are distinct from herbaceous species in many ways: they have a self-supporting structure, the secondary growth or wood, and they have a much longer lifespan (Bradshaw et al., 2000;Brunner et al., 2004). The regulatory networks and molecular mechanisms that underlie these unique properties cannot be investigated through the examination of non-tree species, thus demonstrating the importance and opportunity presented by the recent publication of the complete genome sequences for two wood-forming species -Populus trichocarpa (Tuskan et al., 2006), and Vitis vinifera (Velasco et al., 2007). Molecular evidence suggests that P. trichocarpa and A. thaliana shared their last common ancestor some 100 to 120 million years ago (Tuskan et al., 2006). Since then, A. thaliana and P. trichocarpa have evolved different life histories, including herbaceous versus arboreal development, annual versus perennial habit, and selfing versus outcrossing mating strategies. In both plants and animals, the evolution of such diversity has been hypothesised to be the consequence of changes in -6www.plantphysiol.org on August 29, 2017 -Published by Downloaded from Copyright © 2008 American Society of Plant Biologists. All rights reserved. the expression pattern of the genes encoding transcription factors, and/or changes in the function of the transcription factors themselves (Doebley and Lukens, 1998;Weatherbee et al., 1998). The elaboration of the R2R3-MYB transcription factor family is postulated to account for some of the evolutionary innovations that contribute to plant diversity (Jin and Martin, 1999). Here we explore the contribution of R2R3-MYB proteins in the diversification of plant form and function, through an analysis of the entire R2R3-MYB family encoded in the Populus genome.

RESULTS AND DISCUSSION
The Populus R2R3-MYB family is larger than that of other sequenced dicotyledonous angiosperms The Populus R2R3-MYB family was compared to the corresponding families from the woody perennial Vitis vinifera, which is a sister taxon to Populus in Eurosids I, and the more distantly related A. thaliana, which is a member of Eurosids II. The predicted R2 and R3 MYB repeats of the MYB DBD are highly conserved across plant lineages (Jin and Martin, 1999). Using the conserved amino-terminal DBD as the defining feature, 192 genes in the P. trichocarpa genome were annotated as encoding R2R3-MYB proteins, and five genes were annotated as encoding 3R-MYB proteins (Table I, Supplemental Table S1). Like their counterparts in other plant species, the R2 and R3 MYB repeats of the P. trichocarpa R2R3-MYB family contain characteristic amino acids, including a series of evenly distributed and highly conserved tryptophan residues that play a role in sequence-specific binding of DNA (Ogata et al., 1995;Stracke et al., 2001) (Fig. 1). In keeping with this, the phylogeny constructed with all predicted Populus 3R-MYB and R2R3-MYB proteins, plus all A. thaliana and V. vinifera family members (Fig. 2), is highly similar to a previously published phylogeny which included all known A. thaliana and Oryza sativa R2R3-MYB proteins (Jiang et al., 2004). Most of the subgroups described in this previously published phylogeny, defined on the basis of their A. thaliana complement, are maintained in the current phylogeny, with most subgroups merely expanding to include both Populus and Vitis R2R3-MYB members (Supplemental Table S2).
The P. trichocarpa genome encodes many more R2R3-MYB family members (192) than either A. thaliana (126) (Stracke et al., 2001;Durbarry et al., 2005) or Vitis vinifera (123). This expansion appears to be the result of multiple gene duplication processes including a whole-genome duplication event in the Populus lineage as well as multiple segmental and tandem duplication events (Tuskan et al., 2006). To examine the relative contribution of each of these factors in the -7 - expansion of the R2R3-MYB family, R2R3-MYB genes were electronically mapped to loci across all 19 linkage groups (LG). In total, 172 Populus R2R3-MYB genes were mapped, it was not possible to map the remaining 21 R2R3-MYB genes as they have not yet been assigned to any linkage group.
There are as many as 21 (LG_II) and as few as a single (LG_XVI) R2R3-MYB genes per linkage group. On average, there is one R2R3-MYB gene every 2.4Mb.
Expansion of the Populus R2R3-MYB family has occurred through multiple mechanisms, and resulted in lineage-specific differences in R2R3-MYB clades Evidence of the salicoid-specific whole genome duplication event in the Populus lineage (ca. 65 MYA) is present throughout the P. trichocarpa genome (Tuskan et al., 2006). Consistent with this, many predicted MYB-encoding genes have paralogous counterparts in syntenic regions of related linkage groups. For example, large sections of linkage groups II and V are thought to be the product of the salicoid-specific genome duplication (Tuskan et al., 2006). In keeping with this, three predicted MYB genes on linkage group II (LG_II) (PtrMYB109, PtrMYB189, and PtrMYB220) show conserved organization with three highly similar MYB genes on LG_V (PtrMYB030, PtrMYB158, and Tandem gene duplication has also played an important role in the elaboration of the R2R3-MYB gene family. More than 35% (68/192) of the R2R3-MYB-encoding genes in the P. trichocarpa genome are present as tandem repeats, where the gene duplications were directly adjacent to each other on a given linkage group, with no intervening annotated gene. Tandem repeats most commonly include duplicate or triplicate R2R3-MYB-encoding genes, but there is one instance of four tandem repeats (LG_XI) and one of six repeats (LG_II).
While the number of R2R3-MYB-encoding genes has expanded in Populus, the number of 3R-MYB genes has not. That is, A. thaliana (Stracke et al., 2001), V. vinifera and P. trichocarpa genomes all encode only five 3R-MYB proteins, despite the fact that they encode a widely varying number of R2R3-MYB proteins -125, 123 and 192 respectively (Supplemental Tables S1, S3, S4). The 3R-MYB gene typifies the vertebrate MYB family (Lipsick, 1996) and is present in all major plant lineages (Kranz et al., 2000;Ito, 2005;Haga et al., 2007)  proteins regulate progress through cell cycle transition (Ito et al., 2001;Okada et al., 2002;Ito, 2005;Haga et al., 2007). Though there appears to be partially redundant function within this family of genes in A. thaliana, loss-of-function mutations in 3R-MYB genes have been demonstrated to lead to incomplete cytokinesis (Haga et al., 2007). This suggests that 3R-MYB proteins fulfill a number of core cellular functions that are conserved across evolution, while the larger, more diverse family of R2R3-MYB proteins in plants suggests that they may play a role in generating phenotypic diversity within this kingdom.
Phylogenetic analysis of the predicted R2R3-MYB protein sequences revealed that there was not equal representation of Populus, Vitis and A. thaliana R2R3-MYB proteins within given clades ( Remarkably, several clades do not include any A. thaliana R2R3-MYB proteins, but only members from Populus and Vitis (Fig. 2). This suggests that the genes in these clades may have specialised roles that were either lost in A. thaliana or acquired in the Populus and Vitis lineages after divergence from the last common ancestor with A. thaliana. It remains to be determined whether the absence of A. thaliana genes in these clades extends to other members of Eurosids II, or whether it is something particular to the A. thaliana genome.

PopGenExpress, an Affymetrix microarray-based compendium of Populus transcript abundance, provides insights into R2R3-MYB gene expression and diversification
Affymetrix Poplar Genome Arrays were used to assess the transcript abundance of 180 R2R3-MYBencoding genes (Supplemental Table S1). There were no probe sets on the array corresponding to the remaining R2R3-MYB-encoding transcripts. Transcript abundance for the R2R3-MYB-encoding genes was assessed in biological triplicate RNA samples extracted from seedlings grown under different light regimes, young leaves, mature leaves, roots, xylem, female catkins, and male catkins. The compendium of data derived from these experiments is referred to as the Populus gene expression (PopGenExpress) dataset.
As is commonly the case for genes encoding transcription factors, many of the P. trichocarpa R2R3-MYB-encoding genes had low transcript abundance levels as determined by the Affymetrix microarray analysis. Nevertheless, distinct transcript abundance patterns were readily identifiable in the PopGenExpress dataset for all 180 of the R2R3-MYB probe sets on the microarray. Groups of MYB-encoding genes showed preferential accumulation of transcripts in a given organ or tissue, or under a specific condition (Fig. 4). In fact, the majority (75%) of Populus R2R3-MYB-encoding genes exhibited transcript abundance profiles that had marked peaks in transcript abundance in only one distinct condition in the current PopGenExpress dataset. This suggests that R2R3-MYB proteins function as regulators of processes that are limited to discrete cells, organs or conditions.
In keeping with their roles as regulators in plant specific processes, 23/180 Populus R2R3-MYBencoding genes showed the highest level of transcript abundance in differentiating xylem, a tissue that gives rise to the woody stem characteristic of trees like Populus. A further 29 R2R3-MYB genes (16%) had the greatest accumulation of transcripts in roots. Remarkably, none of these genes encoded a protein with high sequence similarity to WEREWOLF, which is involved in the In addition to groups of genes that had similar transcript abundance profiles but were relatively phylogenetically distinct, several phylogenetic clades were characterised by having members that largely shared the same transcript abundance profile. Similar transcript abundance patterns were observed even between A. thaliana and Populus members of the clade. Prominent amongst these clades were those that included R2R3-MYB-encoding genes related to those previously implicated in the regulation of phenylpropanoid metabolism.

Populus R2R3-MYB proteins related to those implicated in regulation of phenylpropanoid metabolism tend to share transcript abundance profiles
Many R2R3-MYB proteins modulate expression of genes encoding enzymes involved in various facets of phenylpropanoid metabolism. Phenylpropanoid metabolism generates a vast array of compounds that are important for a diversity of plant functions, including resistance to herbivore and pathogen attacks (Peters and Constabel, 2002;Miranda et al., 2007), and for the construction of structural components of the plant body (Boerjan et al., 2003;Deluc et al., 2006). Phenylpropanoid metabolism is particularly important in tree species, which make use of phenylpropanoid-derived -11 - Moreover, trees like Populus have a woody stem composed of up to 30% phenylpropanoid-derived lignins, which provide structural integrity, water transport capacity, and defence against degradation for this key portion of the tree body. In keeping with the importance of maintaining a diverse and robust phenylpropanoid metabolism in trees, genes encoding enzymes of phenylpropanoid metabolism are over-represented in the Populus genome relative to the genome A. thaliana (Tuskan et al., 2006). On the basis of the analysis of Populus genes predicted to encode R2R3-MYB proteins, the trend in diversifying genes encoding enzymes in the phenylpropanpoid pathway also seems to extend to genes encoding regulators of this pathway. In the present phylogeny, two V. vinifera proteins and four P. trichocarpa proteins group in the lignification-related R2R3-MYB clade. The P. trichocarpa genes encoding these proteins are located on LG_1 (PtrMYB002 and PtrMYB003) and on LG_IX (PtrMYB020 and PtrMYB021) in regions that are thought to be the paralogous product of the recent salicoid whole genome duplication event (Tuskan et al., 2006). The P. trichocarpa proteins are most similar to EgMYB2 and to the V. vinifera members of this clade. Three of the four Populus R2R3-MYB genes exhibit high levels of transcript accumulation in xylem tissue (Fig. 5b), suggesting that function in this tissue has been retained for most family members since the duplication event. The transcript abundance profile also suggests that, like their counterparts in other plant species, these transcription factors also function in xylem- and then moved into the light for 3h, and may therefore play a role in invoking anthocyanin biosynthesis in vegetative tissues to protect them from the deleterious effects of UV light (Fig. 5d).
PtrMYB113 shows low transcript accumulation throughout the tissues and organs examined, and may function to regulate anthocyanin biosynthesis in response to a stimulus that was not included in this study.
The genes encoding the proteins in the clade implicated in the regulation of anthocyanin biosynthesis have a distinctive arrangement the A. thaliana, V. vinifera, and P. trichocarpa genomes.
In all three organisms, the genes are present as tandem duplications that appear to have arisen since they last shared a common ancestor (Fig. 5c, Supplemental Fig. S1). Notably, the R3 MYB present at the 3' end of the second exon. Based on an alignment of the predicted mRNA molecules for the A. thaliana, V. vinifera, and P. trichocarpa genes in this clade, it does not appear that there has been an error in splice site prediction (Supplemental Fig. S2b). That is, the additional amino acids in the Populus R3 domain appear to be bona fide components of the motif. Given the role of this motif in protein-DNA interactions, the additional four amino acids in the Populus proteins could impact DNA binding. The fact that this motif is present in all six of the Populus members of the clade suggests that they are likely to bind to similar targets, but it remains to be determined if their bindingspecificity or selectivity differs from their A. thaliana counterparts. Future studies could examine this possibility, and use it as the basis to explore co-evolution of DNA-binding domains relative to their cognate cis-acting DNA binding sites.
The high degree of redundancy within those clades where genes share highly similar transcript abundance profiles is consistent with the hypothesis that sub-functionalisation has occurred, with individual family members assuming non-redundant functions in the same tissue. In some cases, sub-functionalisation may have resulted in one of the MYB proteins acting as an activator of gene expression and another acting as a repressor. Co-expression of such antagonistic pairs of MYB proteins can produce a "gearing mechanism" where the regulation of shared target genes is a function of the relative abundance of a strong activator relative to a weaker one (Moyano et al., 1996). Alternatively, the high degree of redundancy within such clades may reflect the fact that there has not been adequate time for selection to purge one of the paralogues. It must be noted that genetic distance between paralogous pairs of R2R3-MYB genes was not a good predictor of the Pearson correlation coefficient describing their expression profiles (Supplemental Fig. S3). Namely, paralogous pairs separated by shorter genetic distances were no more likely to be co-expressed than were pairs separated by longer genetic distances.

The Populus Electronic Fluorescent Pictograph browser facilitates rapid hypothesis testing related to MYB expression
Identification of Populus R2R3-MYB genes with transcript abundance patterns or gene products that are phylogenetically close to MYB proteins with known function from other species, provides candidates for future studies aimed at thoroughly dissecting MYB function. Such studies would benefit from tools that enabled more intuitive representations of transcript abundance patterns from the PopGenExpress compendium, so that rapid comparisons could be made to putative orthologues from better-characterised species such as A. thaliana. To enable community-wide, simple, graphical -14 -  R2R3-MYB-encoding genes, or, indeed, any other gene in Populus that is represented on the Affymetrix Poplar GeneChip. By pre-computing A. thaliana -Populus orthologues, we also enable the easy cross-species expression viewing to identify the true functional orthologue in Populus of a given A. thaliana gene, if it exists. The gene expression database is expandable, and will acquire greater value as other researchers add their GeneChip data to the core compendium we have established. This will add greatly to the expanding toolbox available to characterise the molecular mechanisms underpinning the basic biology of an ecologically dominant, and economically important tree genus.

Plant Material
Plant material was collected from clonally-propagated, 8-week-old Populus balsamifera saplings (Clone 1006, Alberta Pacific Forest Industries Inc., Boyle, AB, Canada) grown in a climate-controlled growth chamber (mature leaf, young leaf, roots, differentiating xylem), or from P. balsamifera seeds with a 16h photoperiod, a maximum daytime temperature of 22°C and a minimum nighttime temperature of 17°C. All tissues were collected 8h into the light phase during a 16h:8h light:dark cycle , except for the midnight samples, which were collected at 4h after the shift from light to dark in the same cycle,. All tissues were collected from saplings without water limitations. For the comparison of dark-grown seedlings versus those grown in the dark and then exposed to light for 3h, seeds were germinated in a dark, in a growth chamber on wetted filter paper in petri plates with a constant temperature of 21°C, and were grown in this way for five days. On the sixth day, half of the plates were exposed to light (150μmol) for three hours. At this time seedlings were collected from plates exposed to light and from plates in continuous darkness. All collected plant materials were flash frozen in liquid nitrogen upon harvesting and stored at -80°C until RNA was extracted.
Each sample comprises pooled material from three individuals, except for the seedling samples, which comprise pooled material from 20 individuals. Mature leaf samples comprise the first fully expanded leaf of three saplings, including the petiole. Young leaf samples include the first leaf, with its petiole, that was completely uncurled, but was still shiny and much smaller than the mature -16 - leaves. Root samples include the distal 15cm of well-rinsed roots mass. Differentiating xylem was collected from the bottom 6 inches of the stem by peeling off the bark and immediately scraping the surface of the exposed wood into micro collection tubes containing liquid nitrogen. Seedlings were grown for 5 days in the dark, on the sixth day, one half of the seedlings were transferred to the light and samples (comprising entire seedlings) were collected at 30 minutes after transfer to light. Male and female catkins were collected on April 30, 2007.  Table S3). Different identifiers have been used for AtMYB8 (At1g35515), AtMYB39 (At4g17785), and AtMYB118 (At3g27785)  identified 108 putative R2R3-MYB-encoding genes, careful scrutiny of the coding sequences revealed that 15 of these genes actually represent single repeat MYB proteins, and so were excluded from the analysis described herein. An additional 25 R2R3-and five 3R-MYB genes were identified on the IGGP's website using the search terms "R2R3" and "MYB transcription factor" (Supplemental Table S4).
Putative MYB genes in the P. trichocarpa genome were identified using the PAC v0.1 Ortholog Finder using A. thaliana R2R3-MYBs as reference (Supplemental Table S1). MYB gene models were identified using the KOG keyword MYB. All P. trichocarpa and V. vinifera models were manually inspected to ensure that the putative gene models contained two or three MYB DBDs, and that the gene models mapped to unique loci in their respective genomes. It is important to note that a proportion of the 21 P. trichocarpa R2R3-MYB genes that have not been assigned to a linkage group may be alleles of genes already placed on a linkage group, but as this has not yet been resolved empirically, these 21 genes were treated as distinct loci. An additional 22 R2R3-MYB transcription factors from other tree species were included in the phylogeny -7 from Pinus taeda, 13 from Picea glauca and two from Eucalyptus gunnii (Supplemental Table S5). A further 32 'landmark' MYB proteins from a variety of other organisms were also included in the analysis (Supplemental Table S6) (Jiang et al., 2004). Sequences were retrieved from NCBI protein database (NCBI Entrez, http://www.ncvi.nlm.nih.gov/sites/entrez).
Phylogenetic analysis included the above 506 R2R3-MYB proteins and 15 3R-MYB proteins. The full-length amino acid sequences were aligned using MAFFT using the G-INS-i algorithm (Katoh et al., 2005). A neighbour-joining tree was constructed using MEGA 4 (Tamura et al., 2007) with settings for Jones-Taylor-Thornton (JTT) substitution model and a Gamma parameter of 1.0 to allow for the uneven rates of substitution across the length of the protein, including the highly conserved amino terminal DNA binding domain and the more divergent carboxy-terminal activation domain.
Pairwise gap deletion was used to ensure that the carboxy-terminal residues, which are more divergent in MYB proteins, could contribute to the topology of the neighbour-joining tree.

RNA Extraction and Microarray Hybridisation
Plant material was ground to a fine powder under liquid nitrogen, and total RNA was extracted from each sample using the procedure described by Chang et al. (Chang et al., 1993)

Microarray Analysis
GeneChip data analysis was performed using the BioConductor suite (Gentleman et al., 2004) in R (R: A language and environment for statistical computing, http://www.R-project.org) (R Core Development Team, 2007) using the affy package (Gautier et al., 2004). Pre-processing of the arrays was done using GC-robust multiarray analysis (GCRMA) (Gentleman et al., 2004). Probe sets corresponding to the putative R2R3-MYB and 3R-MYB transcription factors were identified using the Probe Match tool on the NetAffx Analysis Centre website (Table 1) (Netaffx, http://www.affymetrix.com/analysis/index.affx). The transcript abundance levels for probe sets corresponding to the putative MYB genes in the P. trichocarpa genome were extracted from the complete data set and the probe sets were then clustered using heirarchical clustering based on Pearson correlation coefficients (Ploner, 2008).

Poplar eFP Browser
A Poplar eFP Browser (http://www.bar.utoronto.ca/efppop/cgi-bin/efpWeb.cgi) was developed based on the Bio-Array Resource database framework described in Toufighi et al. (Toufighi et al., 2005) and the eFP engine developed by Winter et al. (Winter et al., 2007). Briefly, database tables were generated using MySQL to archive meta-information for the samples used to produce the transcriptome data in this study, as well as to store the expression data themselves, after the MIAME convention (Brazma et al., 2001). This database may be queried through other tools described in Toufighi et al. (2005), such as Expression Browser, Project Browser, and Expression Angler for identifying co-expressed genes of interest (Toufighi et al., 2005). In the case of the poplar gene expression database, we normalized the Affymetrix data using the GCOS algorithm, with a TGT value of 500. Diagrammatic representations of poplar were generated and processed as described in Winter et al. (Winter et al., 2007) to create an input for the eFP Browser. In addition, an XML control file was created to instruct the eFP engine for proper functioning. Finally, A. thaliana -Populus orthologues based on the P. trichocarpa gene models were computed using the OrthoMCL algorithm (Li et al., 2003). These were stored in a separate database along with pre-computed A.
-19 -    with paralogous and syntenic genes on two LG. The groups indicated are not exhaustive, but serve to indicate the widespread evidence of the Salicoid-specific whole genome duplication event in the P. trichocarpa genome.   Supplemental Table S1: R2R3-and 3R-MYB encoding genes from Populus trichocarpa. When a MYB number appears to be absent, this has arisen from the fact that, when the Populus genome was annotated, some gene models were mis-annotated as R2R3-MYB-encoding genes, when they