Tobacco Transcription Factors: Novel Insights into Transcriptional Regulation in the Solanaceae.

Tobacco ( Nicotiana tabacum L.) is a member of the Solanaceae, one of the agronomically most important groups of flowering plants. We have performed an in silico analysis of 1.15 million gene space sequence reads from the tobacco nuclear genome and report the detailed analysis of over 2,500 tobacco transcription factors (TFs). The tobacco genome contains at least one member of each of the 64 well-characterized TF families identified in sequenced vascular plant genomes, indicating that evolution of the Solanaceae was not associated with the gain or loss of TF families. However, we found notable differences between tobacco and non-Solanaceae species in TF family size and evidence for both tobacco- and Solanaceae-specific subfamily expansions. Compared to TF families from sequenced plant genomes, tobacco has a higher proportion of ERF/AP2, C2H2 zinc finger, homeodomain, GRF, TCP, zinc finger homeodomain, BES and SAP genes and novel subfamilies of BES, C2H2 zinc finger, SAP and NAC genes. The novel NAC subfamily, termed TNACS, appears restricted to the Solanaceae as they are absent from currently sequenced plant genomes but present in tomato, pepper and potato. They comprise about 25% of NAC genes in tobacco. Based on our phylogenetic studies we predict that many of the over 50 tobacco Group IX ERF genes are involved in jasmonate responses. Consistent with this, over two thirds of Group IX ERF genes tested showed increased mRNA levels following jasmonate treatment. Our data is a major resource for the Solanaceae and fills a void in studies of TF families across the plant kingdom. was compared at the DNA level with its closest neighbour in the phylogeny and two primers were designed for each gene that were specific for this ERF member. The NtPMT1a gene, a well documented MeJA induced transcript in tobacco (Nagata et al., 1992; Goossens et al., 2003), was included as a positive control and actin was used as a non-MeJA induced control. PCR products were separated by electrophoresis on 2% agarose gels. Experiments were replicated at least three times and representative data are shown. constructed using the Neighbor-Joining method. Each tobacco gene identified in the GSR dataset is designated by an arbitrary number. Six clades (numbered 1-6 in the figure) are found in tobacco and other plant species and three clades, designated TNAC A – TNAC C are found in tobacco and other Solanaceae. EST sequences from TNAC genes of potato tomato and pepper included in tobacco gene

The Solanaceae (Nightshade Family) are one of the largest and most important families of flowering plants with over 3000 species. Many of its members are important crop plants, such as tomato, potato, aubergine and chilli pepper. Others are prized for their medicinal, poisonous or psychotropic effects and are the source of drugs such as atropine, scopolamine and hyoscyamine (Oksman-Caldentey, 2007). As a result, the Solanaceae have been the focus of considerable research including genome sequencing projects for both tomato and potato (Mueller et al., 2005;Mullins et al., 2006).
Regulation of gene expression at the level of transcription is a major control point in many biological processes and plant genomes devote approximately 7% of their coding sequence to transcription factors (TFs) (Udvardi et al., 2007). In plants, changes in transcription rates are seen as the plant grows and develops and also as the plant is required to respond to changes in the environment. A plant's ability to respond appropriately to these cues may ultimately influence the plants chances of survival and affect yields in crop plants. Transcriptional regulation can determine a number of agronomically important traits and, therefore, studies of TFs form a major focus in plant biology (Richardt et al., 2007;Udvardi et al., 2007).
The first genome-wide analysis of TFs in any plant species was reported soon after the first plant genome sequence, that of Arabidopsis, was completed (Riechmann et al., 2000). This analysis revealed that Arabidopsis contains at least twenty-nine families of TFs with sixteen families appearing to have no counterpart in animals. However, this first analysis was incomplete due to the discovery of additional TF families and refinements in bioinformatic analysis used to find the genes. The Database of Arabidopsis TFs (DATF) (Guo et al., 2005) currently lists 64 families of TFs as present in Arabidopsis. All 64 of these families are also present in poplar (Zhu et al., 2007), while the monocot rice contains only 63 families (Gao et al., 2006), missing the SAP family that is represented by only a single gene in both of the dicot species. Recent large scale analyses of plant TFs have produced databases for TF studies across the entire plant kingdom (Richardt et al., 2007;Udvardi et al., 2007). These are excellent resources, but are largely restricted to sequences already present in public databases and contain no comprehensive analysis of a member of the Solanaceae.
Tobacco [Nicotiana tabacum L.] has been a model plant for decades and is one of the most studied higher plant species. Many resources are already available to facilitate functional genomics in tobacco including transformation and regeneration systems, reduced complexity cell To study the TF gene families in more detail, phylogenetic trees were constructed with MEGA version 4 (Tamura et al., 2007) using the Neighbor-Joining method. In each case, the conserved domains were used to construct the phylogenetic trees. Two phylogenetic trees were produced for each TF family, one containing only the tobacco TF family members and the other containing the tobacco sequences and sequences of selected members of the Arabidopsis TF family as markers for specific subgroups. We used the Arabidopsis domains because they are the most complete and well-characterized data set. The latter phylogenies enable direct comparisons to be made between family members from the two species and allow the rapid identification of any novel lineages within the tobacco gene family. Differences represent components of potentially novel gene regulatory mechanisms and we predict that some of these TFs regulate Solanaceae-or tobaccospecific processes.
We analysed eight of the largest TF families in tobacco, the ERF, R2R3MYB, bHLH, NAC, homeodomain, MADS box, WRKY and bZIP families. These account for over a third of the total number of tobacco TFs. Our analyses showed that seven of these families appear similar in composition to those from other vascular plants, the one notable exception being the NAC family.
Despite broad overall similarity of these seven families, each provides insights into the evolution, subfamily distribution and possible functional relatedness of the gene family members. All trees were produced by the Neighbor-Joining method using MEGA version 4 (Tamura et al., 2007) and similar trees were produced using other methods.

The ERF Family
The ERF family of TFs was first discovered in tobacco (Ohme-Takagi and Shinshi, 1995) and is a large multigene family whose members play important roles in biotic and abiotic stress responses as well regulation of plant growth and development (McGrath et al., 2005;Nakano et al., 2006).
The ERFs are part of the larger AP2/ERF superfamily that is defined by the presence of the AP2 domain. Our searches revealed 241 tobacco sequences that contain complete or partial ERF domains. Of these, 10 contain only the N-terminal part of the domain and 7 the C-terminal portion. Based on their amino acid sequences and phylogenetic positions, a maximum of two Nterminal fragments could potentially correspond to C-terminal fragments and the minimum number of ERF genes in tobacco is therefore 239. A minimum of 35 AP2 genes are also present (Table I) and the 274 tobacco AP2/ERF genes form the largest single family of tobacco TFs.
clades that correspond to subgroups I-X that have been found in both rice and Arabidopsis (McGrath et al., 2005;Nakano et al., 2006). There is, however, one major difference between the tobacco ERF family and that from Arabidopsis, namely that the group V genes from tobacco form two separate clades. We suggest that this may be a common feature of ERF gene families as we have observed that the group V genes from cowpea (Vigna unguiculata L. Walp.) also form two clades in a similar way (Timko et al., 2008). Additionally, the group V genes from rice are also not monophyletic (Nakano et al., 2006). Overall, the tobacco ERF gene family confirms the similarity of the ERF gene families from both monocot and dicot plants and also suggests a modification to the classification of Group V ERF genes.

The WRKY Family
The WRKY TFs are one of the largest groups of transcriptional regulators in plants (Rushton et al., 1995;Rushton et al., 1996;Eulgem et al., 2000;Ulker and Somssich, 2004;Zhang and Wang, 2005), mediating responses not only to biotic stresses (Rushton et al., 1996) but also abiotic stresses, such as wounding, drought, and cold adaptation (Eulgem et al., 2000;Ulker and Somssich, 2004). WRKY genes are classified into groups I, IIa, IIb, IIc, IId, IIe and III based on their primary amino acid sequence and the structure of their zinc finger motifs (Eulgem et al., 2000;Zhang and Wang, 2005). A total of 131 full or partial WRKY domains, representing a minimum of 93 WRKY genes, were identified and used in our phylogenetic analysis. Included in this number are eighteen published tobacco WRKY sequences whose domain coding regions were not found in our GSR data set. Fifteen of these genes were in fact tagged in the GSRs, but only contained sequence information outside of the conserved domain used in our analysis.
The composition of the WRKY gene family in tobacco is similar to that from other dicot plants as it consists of the same seven groups of WRKY genes (I, IIa, IIb, IIc, IId, IIe and III) (Fig. 2). The relative number of members in each group is also similar. This contrasts with the WRKY gene family from rice where the number of genes in Group III is greatly expanded (Zhang and Wang, 2005), making up about a third of all WRKY genes. This is not seen in the tobacco gene family ( Fig. 2) and it appears that amplification of this group is associated with the evolution of monocots. Initial classification of WRKY genes into Groups I, IIa, IIb, IIc, IId, IIe and III was based only partly on phylogeny and relied also on the number and structures of the WRKY domains (Eulgem et al., 2000). Recently, phylogenetic analysis has led Zhang and Wang (2005) to propose that group II WRKY genes are not monophyletic, but instead form three distinct clades: IIa + IIb, IIc and IId + IIe. The tobacco WRKY gene family provides new data to test this and suggests that this new classification is correct. Group II WRKY genes are not monophyletic and IIa + IIb, IIc and IId + IIe are to be found in different areas of the phylogenetic tree (Fig. 2).

The Homeodomain Family
Homeodomain proteins were initially discovered during the study of homeotic mutants in Drosophila (Gehring, 1987) and play important roles in developmental regulation in all eukaryotic lineages. Tobacco genes were isolated by searches with the homeodomains from each of the major subgroups of plant homeodomain proteins (Chan et al., 1998). We found that this approach was crucial as the different subgroups of tobacco homeodomain genes are dissimilar.
Blast searches with the homeodomains from one group often failed to identify genes from other groups, even when using a very high e-value cut off. To illustrate this, searches with the homeodomain from glabra2 resulted in 74 hits whereas searches with that from knotted1 only yielded 15 hits and none of these were present in the data set obtained with glabra2. The combined results from all searches revealed a minimum of 129 homeodomain genes in tobacco.
This phylogenetic similarity should aid the identification of functional homologs in tobacco of characterised homeodomain genes from other higher plants.

The MYB Family
The MYB family is the largest family of TFs in many plant species (Qu and Zhu, 2006) and our data suggest it is the second largest in tobacco (Table I). MYB TFs are defined by up to three imperfect repeats of the MYB DNA-binding domain (Ogata et al., 1992). The largest MYB subfamily is the R2R3MYB family and these appear to be predominately involved in "plant specific" processes such as the regulation of plant secondary metabolism and the identity and fate of plant cells (Kranz et al., 1998;Stracke et al., 2001). We identified 232 R2R3MYB sequences and this represents a minimum number of 194 R2R3MYB genes in tobacco. Tobacco also contains a minimum of 56 MYB-related genes (Table I) and the MYB family in tobacco therefore contains at least 250 genes. The phylogenetic tree of the R2R3MYB genes can be divided into numerous small clades (Supplemental Fig. 3). This is similar to Arabidopsis where at least twenty three R2R3MYB subgroups have been defined (Stracke et al., 2001). As there is no clear crossspecies nomenclature, we have given the tobacco clades arbitrary numbers. Previous analysis of the Arabidopsis R2R3MYB family concluded that there are clear examples of functional conservation between related members of the R2R3MYB family across species (Stracke et al., 2001). This potential conservation has enabled us to use the phylogenetic tree to identify tobacco genes that are similar to the key regulators GLABROUS1, WEREWOLF, TRANSPARENT TESTA, ALTERED TRYPTOPHAN REGULATION1 and GAMYB.

The bZIP Family
A total of 75 bZIP genes were identified in the tobacco GSR dataset. These genes form 10 clades, corresponding to groups A-E, G, H, I, M and S previously defined in Arabidopsis (Jakoby et al., Fig. 4). In this regard, the tobacco bZIP gene family is broadly similar to that found in Arabidopsis. However, tobacco may be missing homologs of the Arabidopsis group F bZIP genes, as none were identified in the GSR dataset. The absence of Group F genes in tobacco could indicate that this group is part of a species or family specific expansion.

The MADS box Family
Tobacco contains at least 119 MADS box genes (Supplemental Fig. 5). Members of the MADS box family are known to be predominately involved in developmental processes (Parenicova et al., 2003) and are found in animal as well as plant species. In plants, the gene family has greatly expanded and this appears to have been instrumental in shaping the evolution of the true flower around 120-150 million years ago (Rijpkema et al., 2007). The MADS box genes can be divided into five subgroups (MIKC, Mα, Mβ, Mγ, and Mδ) based on phylogenetic relationships of the MADS box domain (Parenicova et al., 2003). The MADS box genes in tobacco are found in these same five subgroups and it appears that the MADS box gene family is similar in structure and composition across higher plants.

The bHLH Family
Of all the similarities we observed between tobacco TF families and those from sequenced plant genomes, the bHLH family is the most striking. bHLHs form one of the largest families of TFs in plants (Bailey et al., 2003) and regulate numerous processes. Analysis of the bHLH family is technically more difficult than for most of the other TF families (Bailey et al., 2003;Heim et al., 2003;Toledo-Ortiz et al., 2003) and, therefore, we performed a total of 30 independent searches using a representative bHLH domain from each of the 23 subfamilies described by Heim et al. (2003). This resulted in the identification of 192 bHLH sequences representing a minimum of 190 bHLH genes. Figure 3 shows a phylogenetic tree of the tobacco bHLH domains together with a marker domain (shown by the letter of the subfamily) for each of the 23 subfamilies. There is clear evidence suggesting that tobacco contains members of all 23 bHLH subfamilies that are present in Arabidopsis. Even smaller subfamilies such as groups I, M and U appear to have at least one tobacco member. This supports the detailed classification of Arabidopsis bHLH genes (Bailey et al., 2003) and suggests conservation of the composition of the bHLH gene family among vascular plants. The phylogenetic tree identifies possible tobacco homologs of key regulators of light responses such as the PIF/PILs (subgroup E) and jasmonate responses such as AtMYC2/jin1/jai1 (Subgroup N). Recently, it was shown that MYC2, a key transcriptional activator of jasmonate responses in Arabidopsis, interacts with the JAZ family of transcriptional repressors (Chini et al., 2007;Thines et al., 2007). The JAZ proteins are members of the ZIM family of TFs and interact with another central regulator of JA signalling, the F-box protein COI1 (Chini et al., 2007;Thines et al., 2007). We have found at least 13 ZIMs in tobacco (Table I) and there are apparent homologs of the complete COI1/JAZ1/MYC2 jasmonate-inducible signalling cascade in tobacco (data not shown).

Differences between Tobacco TFs and Those from Sequenced Plant Genomes
We found a number of notable differences in the composition of several TF families in tobacco compared with those in poplar, Arabidopsis, and rice. This includes a number of novel TF subfamilies that may be components of regulatory circuits specific to tobacco or the Solanaceae.

The NAC Family
A major difference is found in the NAC gene family, one of the largest families of plant specific TFs (Guo et al., 2005;Olsen et al., 2005). NACs have been implicated in regulating diverse processes including flower development, reproduction, defense against insect pests and pathogens, abiotic stress responses and responses to hormones (Olsen et al., 2005). NAC TFs are defined by the presence of the NAC domain, a conserved DNA-binding domain that appears to have no known close structural homologs (Aida et al., 1997;Ernst et al., 2004).
We found 203 complete or partial NAC domains in tobacco and a minimum number of 152 NAC genes. Previous phylogenetic analysis of NAC TFs has been limited. The most comprehensive study of NACs is by Ooka et al. (2003), who divided the rice and Arabidopsis NAC family into two major subgroups and numerous minor groups (Ooka et al., 2003). Figure 4 shows the phylogenentic relationship of members of the tobacco NAC gene family. We identified 7 major subfamilies, 6 of which are present in tobacco and other plant species and a seventh subfamily that contains the largest number of tobacco NAC genes and appears unique to the Solanaceae. This subfamily, termed TNACS, represents not only a novel subgroup of NAC genes but also a major difference between tobacco and all sequenced plant genomes. There are approximately 50 TNAC genes and they account for about one quarter of all NAC genes in tobacco. The TNAC genes can be further subdivided into three major clades (A, B and C) with members in each clade having clearly different primary amino acid sequences in their NAC domains. The differences among the NAC domain sequences in the TNAC genes (subdomains A-C) and how they differ from the NAC domain consensus derived from the other six groups of tobacco NACs is We sought to determine the distribution of TNACs across the plant kingdom. Extensive blast searches of the NCBI databases (All GenBank+EMBL+DDBJ+PDB sequences, Plant genomes EST sequences, Genome Survey Sequence, Unfinished High Throughput Genomic Sequences) and the Sol Genomics Network (http://www.sgn.cornell.edu/) revealed that there are no published TNAC genes and that they are completely absent from all currently sequenced plant genomes. A total of just six TNAC ESTs, one from tomato (BI422367), one from potato (CV505554), one from pepper (U204177) and three from tobacco (AM845922, BP137257 and AF211685) were found in the databases, together with one tomato TNAC gene (SL_MboI0012J01 3) from a Genome Survey Sequence and one predicted TNAC gene on tomato chromosome 2 (C02SLm0076E07). Like tobacco, the crop plants tomato, potato and pepper are also members of the Solanaceae and this large group of novel NAC genes appears to be a previously unknown feature of transcriptional regulation in the Solanaceae.
To demonstrate the Solanaceae-specific nature of the TNAC subfamily, we constructed a phylogenetic tree containing the complete NAC gene families from Arabidopsis, poplar and rice (indica), together with the tobacco NAC genes shown in Figure 5. The A-C portions of the NAC domains were used for its construction. Figure 6 shows this combined phylogeny of over 450 NAC genes. The non-TNAC genes from tobacco (blue dots) are widely scattered across the phylogenetic tree, illustrating that they belong to NAC subfamilies that are present in different plant species. In stark contrast, the TNAC genes (red dots) clearly form a separate clade and this demonstrates that TNACs are absent from the other three plant genomes and likely to be Solanaceae-specific. Due to the limited previous phylogenetic information on the NAC gene family, our analysis of over 450 NAC genes should be of broad use for studies of NAC genes.
The TNAC ESTs from tobacco come from both the A and C clades. The A clade tobacco EST AF211685 comes from cell suspension cultures harvested 30 min after treatment with the Avr9 peptide from the fungus Cladosporium fulvum. The other tobacco A clade EST (AM845922) appears to come from the TNAC gene NtNAC84 and was isolated from cold treated whole plants.
This suggests that some A clade genes may play roles in stress responses in tobacco. We have performed expression studies on a small number of representative TNAC genes and found that members of all three clades are expressed in tobacco. Some TNAC genes such as the B clade gene NtNAC176 and the C clade gene NtNAC156 appear to be expressed in most plant tissues, whereas the A clade gene NtNAC151 is mostly expressed in roots and young leaves (data not shown). We suggest that TNACs are components of regulatory circuits some of which are specific to tobacco or the Solanaceae.

The BES Family
The prototype of the BES family of TFs is BES1 (Yin et al., 2002;Yin et al., 2005), a TF that binds to, and activates, brassinosteroid-regulated gene promoters. Tobacco contains at least nineteen BES-like TFs distributed in 5 main clades (Fig. 7A). Four clades have members with homologs in other plant species. Among the best characterized BES genes are the Arabidopsis genes BES1 and BZR1 that play roles in stem elongation and senescence (Yin et al., 2002;Yin et al., 2005). Based on sequence homology, NtBES10 and NtBES14 are likely to be their functional homologs in tobacco (Fig. 7). The fifth clade is composed of four tobacco BES-like genes that are significantly different from members of the other four clades. Figure 7B shows a comparison of the conserved N-terminal regions of the tobacco BES genes and their Arabidopsis counterparts.
There are at least six amino acid differences in the conserved domain that are a hallmark of the tobacco-specific group. This N-terminal domain contains a bipartite NLS, together with a highly conserved bHLH-like DNA binding domain (Yin et al., 2005) that is responsible not only for heterodimer formation with bHLH TFs, but also for direct binding to E Box elements (CANNTG). The differences between the novel tobacco subgroup of BES TFs and other BESs are found within both the basic region and both helices (Fig. 7B). This raises intreaguing questions as to the interacting partners of these tobacco BES genes, their DNA binding preferences and their roles in planta. Database searches of the NCBI databases (All GenBank+EMBL+DDBJ+PDB sequences, Plant genomes EST sequences, Genome Survey Sequence, Unfinished High Throughput Genomic Sequences) and the Sol Genomics Network (http://www.sgn.cornell.edu/) failed to find any members of this BES group in other plants and it is possible that these BES-like genes are specific to tobacco and related Nicotiana species.

The SAP Family
A recessive mutation in Arabidopsis STERILE APETALA (SAP) causes severe aberrations in inflorescence and flower and ovule development (Byzova et al., 1999). Together with the organ identity gene AGAMOUS, SAP is required for the maintenance of floral identity acting in a manner similar to APETALA1. Both Arabidopsis and poplar have just a single SAP gene and it is the only family among the 64 TF families that appears to be absent in rice (Gao et al., 2006).
Tobacco has at least four SAP genes (Fig. 8). Although similar to the Arabidopsis and poplar genes, the tobacco SAP genes are more similar to each other than they are to the genes from the other two species. Whether expansion of the tobacco SAP gene family led to redundant, partially redundant, or distinct functions for the SAP genes remains to be experimentally determined. In the near future, genome sequences for tomato and potato will be available and this information should help clarify whether expansion in the SAP gene family is a general feature of the Solanaceae.

The C2H2 Family
The C2H2 zinc finger family appears to be over-represented in the tobacco genome compared to other plants (Supplemental Table II). This is due almost entirely to the presence of 23 unusual gene coding sequences in the tobacco GSR data set that contain tandem zinc fingers (up to nine repeats), mostly of the C-x2-C-x12-H-x3-H type (data not shown). None of these GSRs have corresponding homologs in currently available EST sequence data. Based on blast homology, the domains in these GSRs may be related to the pfam domain DUF1644 found in a number of hypothetical plant proteins of unknown function. A lack of clear homology makes it is difficult to determine an exact number for these genes as they are not particularly similar to known plant proteins. The tandem zinc fingers are most similar to transcriptional repressors like Kruppel (Hanna-Rose et al., 1997), but their roles in plants remain unknown.

Group IX ERF Genes Regulate Jasmonate Responses in Tobacco
Group IX ERF genes have been implicated in regulating defense responses mediated by methyl jasmonate (MeJA) in a number of plant species, including Arabidopsis (McGrath et al., 2005), Catharanthus roseus (van der Fits and Memelink, 2000) and tobacco (Goossens et al., 2003;De Sutter et al., 2005). In C. roseus, the Group IX ERF ORCA3 has been shown to be a master regulator of primary and secondary metabolism during jasmonate responses (van der Fits andMemelink, 2000, 2001). In tobacco, the MeJA-responsive Group IX ERFs NtORC1 and NtJAP1 are able to postively regulate the jasmonate-inducible gene putrescine N-methyltransferase (PMT), that plays a key role in secondary metabolism (De Sutter et al., 2005). These results suggest that Group IX ERFs are among the key regulators of jasmonate responses in plants. We therefore used our phylogenetic analysis of the tobacco ERF gene family to determine if we could use phylogenetic position to reveal potential function. We predicted that phylogenetically similar genes may have similar functions and specifically that many Group IX ERF genes play roles in jasmonate responses in tobacco. This translational biology approach therefore led us to predict that many tobacco group IX genes would be MeJA-inducible. To test this, we treated tobacco BY-2 cells with MeJA and used semi-quantitative RT-PCR together with gene-specific primers (Supplemental Table III) to determine whether nineteen representative tobacco group IX ERF genes are indeed inducible by MeJA. Figure 9A shows that the BY-2 cells responded well to MeJA as judged by induction of the PMT1a gene that is up-regulated as part of a reprogramming of secondary metabolism in tobacco. Of the nineteen genes studied, thirteen were inducible by MeJA with the remaining six genes yielding no product. The inducible genes are found across the whole of the group IX phylogenetic tree (Fig. 9B).
The induction of the ERF genes followed several different kinetic patterns ranging from rapid and sustained induction (NtERF210, NtERF179, ORC1) to very rapid and transient induction (NtERF165, ACRE1), suggesting that several different mechanisms of gene activation are involved. The maximum level of mRNA was after 1-2 hours of treatment for most of the ERF genes, with the maximum in NtPMT1a mRNA level being reached later, about 4 hours after treatment. One of the most notable patterns of mRNA accumulation was found with NtERF210.
NtERF210 mRNA was undetectable in the absence of MeJA treatment but then showed an extremely rapid induction with maximum levels being reached within 30 minutes of hormone treatment. This maximum level was then sustained throughout the following 24 hour period. In addition to the striking pattern of mRNA accumulation, our phylogenetic analysis of the tobacco Group IX ERF genes revealed that NtERF210 is found in a small clade (containing NtERF163, NtERF210 and NtERF91) that contains the genes that are the most similar to ORCA3, the key regulator of jasmonate responses in Catharanthus roseus. This is validation of the use of phylogenies as an aid for functional prediction as it suggests not only that many tobacco group IX ERF genes are indeed involved in jasmonate responses, but it also identifies NtERF210 as a potential key regulator of secondary metabolism during jasmonate responses in tobacco.

Conclusions and Perspectives
Reduced representation sequence datasets, such as the MF dataset used here, are useful resources for plants species where whole genome sequence projects are lacking. They deliver sequences for the majority of genes and are invaluable tools for functional genomics and systems biology.
However, they do lack information on some genes. Our discovery of members from 64 TF gene families in tobacco, including those that consist of only one or two genes in other plants, suggests that a large percentage of genes have been tagged in the tobacco gene space data set. Due to the incomplete nature of gene space sequences, some TF contigs in our dataset contain partial DNA binding domains, normally due to the presence of introns. For these reasons, it is only possible to give a minimum number of TF genes that are present in tobacco based on the largest number of independent sequences that contained a certain portion of the conserved domain. The actual number of tobacco TFs in our data set is between 2,513 (the minimum number of genes) and 2,882 (the total number of contigs). Given the allopolyploid nature of N. tabacum, it is probable that the actual number of TF genes in tobacco is over 3,000.
Seven of the nine largest TF gene families in tobacco are very similar in size and complexity to their counterparts in Arabidopsis, rice and poplar (Guo et al., 2005;Gao et al., 2006;Zhu et al., default settings for pairwise alignments and assembly except for the minimum overlap match which was set at 30 bp. These parameters give positive alignment scores when regions of two sequences that are 70% or more identical are extended. In almost all cases where we were able to assemble contigs, the alignments were 90% identical or greater. The average contig length was about 1,000 bp with each contig containing on average between three and four independent GSRs per contig. The shortest contig was about 250 bp and the longest about 2.8 kb. Each predicted gene was then individually manually curated by blastx searches against the non-redundant protein database housed at http://www.ncbi.nlm.nih.gov/. This served two purposes. First, all sequences were verified as coding for at least part of a conserved domain (i.e., all false positives that did not encode any part of the conserved domain from the gene family were discarded at this point).
Second, the searches identified whether the sequence contained all of the conserved domain or only part of it. If the hit was only partial, the searches also identified which part of the conserved domain was present. This information was used to determine the minimum number of members in each TF gene family by calculating the number of independent sequences that contained any certain portion of the domain (e.g., amino acids 20-30). For larger gene families, the genes were first assigned to known subfamilies and then the minimum number of genes calculated for each subfamily.
The data sources for the Arabidopsis, poplar and rice datasets were the following. The Short partial domains were discarded as they cannot be used to construct acceptable phylogenies. Less than ten fragmentary NAC domains were excluded as their amino acid sequences are uncertain and they could not be aligned correctly. Total RNA (5 μg) was used for cDNA synthesis using the ThermoScript RT-PCR System (Invitrogen) according to the manufacturer's instructions. Each PCR reaction contained 0.2 μg of cDNA and gene specific primers described in Supplemental Table III. To ensure that each product represented the mRNA level from a single gene, each gene was compared at the DNA level with its closest neighbour in the phylogeny and two primers were designed for each gene that were specific for this ERF member. The NtPMT1a gene, a well documented MeJA induced transcript in tobacco (Nagata et al., 1992;Goossens et al., 2003), was included as a positive control and actin was used as a non-MeJA induced control. PCR products were separated by electrophoresis on 2% agarose gels. Experiments were replicated at least three times and representative data are shown. Table I. Comprehensive list of predicted transcription factor genes in TOBFAC and the Genbank accession numbers of their constituent GSRs. Table II. Comparison of predicted TF gene family sizes in tobacco with those from sequenced plant genomes. Table III. Gene-specific primers used in RT-PCR analysis of methyl jasmonateinduced gene expression. Table IV. Comprehensive list of amino acid sequences of the TF domains used in Blast homology searches.            Table I

. Size Distribution of the Transcription Factor Families Found in Tobacco.
The table shows the minimum number of members identified in each of the 64 families of transcription factors found in the tobacco GSR dataset. Each family is identified by its abbreviated name and the estimated minimum number of gene family members is given in parenthesis.