Gene discovery of modular diterpene metabolism in nonmodel systems.

Plants produce over 10,000 different diterpenes of specialized (secondary) metabolism, and fewer diterpenes of general (primary) metabolism. Specialized diterpenes may have functions in ecological interactions of plants with other organisms and also benefit humanity as pharmaceuticals, fragrances, resins, and other industrial bioproducts. Examples of high-value diterpenes are taxol and forskolin pharmaceuticals or ambroxide fragrances. Yields and purity of diterpenes obtained from natural sources or by chemical synthesis are often insufficient for large-volume or high-end applications. Improvement of agricultural or biotechnological diterpene production requires knowledge of biosynthetic genes and enzymes. However, specialized diterpene pathways are extremely diverse across the plant kingdom, and most specialized diterpenes are taxonomically restricted to a few plant species, genera, or families. Consequently, there is no single reference system to guide gene discovery and rapid annotation of specialized diterpene pathways. Functional diversification of genes and plasticity of enzyme functions of these pathways further complicate correct annotation. To address this challenge, we used a set of 10 different plant species to develop a general strategy for diterpene gene discovery in nonmodel systems. The approach combines metabolite-guided transcriptome resources, custom diterpene synthase (diTPS) and cytochrome P450 reference gene databases, phylogenies, and, as shown for select diTPSs, single and coupled enzyme assays using microbial and plant expression systems. In the 10 species, we identified 46 new diTPS candidates and over 400 putatively terpenoid-related P450s in a resource of nearly 1 million predicted transcripts of diterpene-accumulating tissues. Phylogenetic patterns of lineage-specific blooms of genes guided functional characterization.

With more than 10,000 different structures, plant diterpenes are one of the largest and most diverse classes of plant metabolites. All of these compounds are formed from (E,E,E)-geranylgeranyl diphosphate (GGPP 1, Fig. 1). A relatively small number of diterpenes of general (i.e. primary) metabolism, such as GAs or chlorophyll (a diterpene conjugate), are known to serve essential functions in plant growth and development. These compounds and their respective biosynthetic pathways are conserved broadly across the plant kingdom. In contrast, most diterpenes are products of specialized (i.e. secondary) metabolism and may have roles in ecological interactions of plants with other organisms. Examples are antimicrobial diterpene phytoalexins in rice (Oryza sativa), wheat (Triticum aestivum), maize (Zea mays), and other cereal crop plants of the family Poaceae (Peters, 2006;Schmelz et al., 2011;Wu et al., 2012;Zhou et al., 2012a); diterpene resin acids of the oleoresin defense of spruce (Picea spp.), pine (Pinus spp.), and other species of the Pinaceae family (Keeling and Bohlmann, 2006;Hall et al., 2013); or antiherbivore diterpene glycosides in Nicotiana attenuata (Heiling et al., 2010). Individual specialized diterpenes are often of limited taxonomic distribution, where certain metabolites may be found only in a few species, genera, or families. For example, taxol and related taxoids are unique to species of the genus Taxus (Jennewein and Croteau, 2001;Guerra-Bubb et al., 2012); labdane-related diterpene phytoalexins are found in several cultivated species of the Poaceae family (Peters, 2006;Schmelz et al., 2011); conifers of the Pinaceae and related families produce an abundance of different diterpene resin acids (Keeling and Bohlmann, 2006;Hamberger et al., 2011;Hall et al., 2013); and different species of wild and cultivated tobacco (Nicotiana spp.) produce different profiles of antimicrobial or antiherbivore diterpenes (Heiling et al., 2010;Sallaud et al., 2012;Seo et al., 2012). Schematic of proposed pathways leading from GGPP 1 to pseudolaric acid B 2, cis-abienol 3, triptolide 4, oridonin 5, carnosol 6, marrubiin 7, forskolin 8, grindelic acid 9, ingenol-3-angelate 10, and jatrophone 11. These diterpenes are formed by bifunctional class I or class I/II enzymes or pairs of monofunctional class II and class I diTPSs, catalyzing the cycloisomerization of GGPP into distinct diterpene scaffolds. Products of diTPSs can undergo various functional modifications primarily through the activity of P450 enzymes. Considering the modular organization of diterpene specialized metabolism (e.g. Hall et al., 2013), we showed the variable diTPS and P450 enzyme modules as "LEGO-like" blocks. This is conceptually modified from Baldwin (2010), who referred to the five-carbon units of terpenoids as "LEGO" blocks. Here, the different colors of the diTPS modules indicate variations of the gba three-domain structure of plant diTPSs; a red "X" indicates loss of function of class I or class II activity. Recombining diTPS and P450 enzyme modules of diterpene pathway may allow for production of known and new diterpenes through metabolic engineering or synthetic biology.
Plant specialized diterpenes also have a multibillion dollar annual market value as pharmaceuticals, fragrances, resins, and other industrial bioproducts Zerbe and Bohlmann, 2013). Diterpene pharmaceuticals include the anticancer drugs taxol derived from Taxus species (Jennewein and Croteau, 2001) and ingenol-3-angelate from Euphorbia peplus (Li et al., 2010), the cAMP-regulating and vasodilator drug forskolin from Coleus forskohlii (Alasbahi and Melzig, 2010b), and the analgesic and antidiabetic drug candidate marrubiin from species of Marrubium (Mnonopi et al., 2012). Beyond pharmaceuticals, examples of industrially used plant diterpenes are steviol glycosides sold as natural sweeteners (Goyal et al., 2010), sclareol produced in Salvia sclarea (Caniard et al., 2012) as a source of ambroxide fragrances, and diterpene resin acids used as a large feedstock for industrial resins, inks, and coatings . Sufficient and sustainable access to these compounds can be limited due to lack of cultivation or lack of crop production systems for relevant plant species, low yields from cultivated or noncultivated plant material, overharvesting of wild plant material, or a need for protection of endangered species and habitats. In addition, yields of complex diterpenes and purity of particular stereoisomers obtained by chemical synthesis are often inadequate for large-scale production. Solutions to these issues may be found with improved agricultural or biotechnological production systems. For example, production of taxol uses combinations of precursors isolated from foliage of cultivated Taxus species, semisynthesis, and plant cell cultures (Jiang et al., 2012;Wilson and Roberts, 2012). Alternatively, microbial metabolic engineering and synthetic biology approaches are being explored for production of high-value diterpenes (Ajikumar et al., 2010;Caniard et al., 2012;Schalk et al., 2012;Zerbe et al., 2012a;Zhou et al., 2012b). These approaches, however, require knowledge of the specific diterpene pathways, genes, and enzymes.
The structural diversity of specialized diterpenes and their lineage-specific patterns of taxonomic distribution are the result of divergent evolution of specialized diterpene biosynthetic pathways. In addition to sharing a common GGPP 1 starter unit, these pathways are generally built in a "LEGO-like" fashion (compare with Baldwin, 2010) with modules from a few classes of enzymes and along a common building plan (Fig. 1). The common building blocks of modular diterpenoid pathways include enzymes of different classes of diterpene synthases (diTPSs), cytochrome P450-dependent monooxygenases (P450s), and various types of transferases (e.g. acyl-, aryl-, methyl-, or glycosyltransferases) and oxidoreductases. Specific functions of each module may be unique for a given plant species, genus, or family. The modular diterpene pathways can be short, composed for example of only a single diTPS as in the biosynthesis of cis-abienol in Abies balsamea (Zerbe et al., 2012a), or employ a suite of up to 20 different modules as in taxol biosynthesis (Walker and Croteau, 2001). In either case, the committed biosynthetic pathways begin with a diTPS, which in the above examples is either a bifunctional cis-abienol synthase (CAS), which in its particular structure and function is only known from A. balsamea (Zerbe et al., 2012a), or a taxadiene synthase (TXS), which is of a unique structure and function that has only ever been found in Taxus species (Köksal et al., 2011).
While diterpene biosynthetic pathways found across the plant kingdom employ common modules (classes of enzymes), the functional space within each module is hugely diverse. The catalytic divergence within modules contributes much to the structural diversity of diterpene products and intermediates of these pathways. The diversity of function within modules also has profound implications for gene discovery and gene annotation in the many different species that produce interesting diterpenes. In essence, specific gene and enzyme functions are often unique to specific taxonomic groups. Due to the high level of sequence similarity and functional plasticity within modules, such as the diTPS and P450 modules, specific functions cannot be predicted based on sequence similarity alone. As a consequence, there is no single model or reference system to guide rapid annotation of the many unique genes and their specific enzymatic functions beyond basic module classifications.
Of the plethora of different plant species that produce ecologically or economically important specialized diterpenes, few have been well characterized for sets of genes and enzymes of the respective biosyntheses. These few include the biosyntheses of taxol in Taxus spp. (Walker and Croteau, 2001;Guerra-Bubb et al., 2012), diterpene resin acids in a few conifers (Hamberger et al., 2011;Keeling et al., 2011a;Zerbe et al., 2012a;Hall et al., 2013), and diterpene defense compounds in rice and wheat (Xu et al., 2007a;Wu et al., 2012;Zhou et al., 2012a). In addition, individual enzymes of specialized diterpene biosynthesis, in particular different diTPS genes, have been functionally characterized in approximately a dozen different species. In contrast to the diTPS module, only a few P450s of specialized diterpene pathways have been characterized, specifically P450s of the CYP725 family with functions in taxol biosynthesis (Jennewein and Croteau, 2001;Jennewein et al., 2004), P450s of the CYP720B subfamily of diterpene resin acid formation in pine and spruce (Ro et al., 2005;Hamberger et al., 2011), and P450s of the CYP71, CYP99, CYP76, and CYP701 families of diterpene phytoalexin formation in rice (Swaminathan et al., 2009;Wang et al., 2011Wang et al., , 2012aWang et al., , 2012bWu et al., 2011).
Deep or enriched transcriptome resources of diterpene-producing nonmodel systems provide opportunities for the discovery of new genes and enzymes and for exploration of the functional space of specialized diterpene metabolism (Caniard et al., 2012;Facchini et al., 2012;Zerbe et al., 2012a;Higashi and Saito, 2013). It is, however, important to note that accumulation and biosynthesis of specialized diterpenes may be spatially restricted to certain organs, tissue, or cell types or may be temporally limited to certain stages of plant growth and development. In some species, biosynthesis may be induced in response to contact with pathogens or herbivores. Examples are the constitutive and induced production of diterpene resin acids, which in conifers are located in epithelial cells of resin ducts (Zulak and Bohlmann, 2010), the formation of cis-abienol in phloem tissue of A. balsamea (Zerbe et al., 2012a) or in tobacco leaf trichomes (Ennajdaoui et al., 2010;Sallaud et al., 2012), or the formation of sclareol in flowers of S. sclarea (Caniard et al., 2012). Thus, development of transcriptome resources has to be guided by information of metabolite profiles, taking into consideration the fact that biosynthesis and accumulation may be spatially or temporally separated.

Transcriptome Sequencing and de Novo Assembly
Nonnormalized complementary DNA (cDNA) libraries were prepared from tissues with highest metabolite abundance for transcriptome sequencing using Roche 454 GS-FLX Titanium and Illumina GAIIx or HiSeq platforms. A one-half-plate 454 sequencing reaction per library yielded a total of approximately 5,500,000 (98% of total reads) high-quality sequences, averaging 610,377 reads per species (Supplemental Table S1), compared with approximately 1,500,000,000 high quality reads (96%), with 59,648,341 and 256,120,403 reads per library on average, obtained with one lane of Illumina GAIIx or HiSeq, respectively. De novo transcriptome assemblies of 454 data sets using a Roche Newbler 2.6 assembler generated 188,270 isogroups (i.e. unique transcripts) with on average 4% of singletons (Supplemental Data S1-S10). The length distribution of isotigs ranged from 76 to 9,530 bp with a median length of 943 bp (Supplemental Fig. S1). Illumina reads were assembled using a Trinity de novo assembler (Grabherr et al., 2011) and yielded 1,164,906 contigs that built 712,222 components (compare with isogroups) per species (Supplemental Data S1-S10). Contig length varied from 200 to 20,019 bp, showing with 765 bp a shorter median length as compared with 454 assemblies. Mapping of raw Illumina reads against the respective Trinity assemblies showed on average 70% of reads with significant pairing to the assembly (Supplemental Table S1). Overall gene coverage of the transcriptome inventories was validated by comparing all assemblies to the Arabidopsis (Arabidopsis thaliana) Core Eukaryotic Genes Mapping Approach (CEGMA) data set (Parra et al., 2007). On average, 92% CEGMA sequences were present in 454 assemblies as compared with 99% in Illumina data sets (defined as query sequence aligning to $95% in length to top BLASTn hits at 1e -20 E-value cutoff; Supplemental Table S2), consistent with the typically deeper coverage of Illumina assemblies. Only with R. officinalis we obtained lower transcriptome quality with fewer than 74% CEGMA transcripts detected and a higher number of singletons (15%). Of the sequences matching CEGMA targets, 42% of 454 and 57% of Illumina transcripts were full length (FL; defined as 95% coverage compared with top tBLASTn hits at 1e -20 E-value cutoff). The independent 454 and Illumina assemblies proved useful to validate predicted genes by comparison of the two assemblies for each species.

Targeted Mining of Large Transcriptome Resources for Diterpenoid Biosynthetic Pathways
In plants, the plastidial 2-C-methyl-D-erythritol-4-P (MEP) pathway and the cytosolic mevalonate (MVA) pathway produce five-carbon precursors of isoprenyl diphosphates, with precursors of GGPP produced primarily through the MEP route (Davis and Croteau, 2000). To assess the quality of the transcriptome resources for terpenoid gene discovery, we first evaluated the presence of MEP and MVA genes by mapping the 454 and Illumina assemblies against these two pathways using custom reference sequences from the Kyoto Encyclopedia of Genes and Genomes (KEGG) database (E-value threshold 1e -20 ). Both pathways, as well as short-chain isoprenyl diphosphate synthases, were completely represented in all 10 species, with 82% of 454-derived and 93% of Illumina-derived transcripts detected as FL sequences (Fig. 3).

Association of diTPS Candidates with Patterns of Evolutionary Divergence of the diTPS Family
The diTPS candidate genes identified in the transcriptomes of the eight angiosperm and two gymnosperm species, together with previously characterized TPSs (Chen et al., 2011), substantiate a phylogeny according to which angiosperm diTPSs of specialized metabolism evolved from ancestral genes along several different branches and in several different subfamilies of the TPS family (Fig. 4).
In the transcriptomes of the two gymnosperm species P. amabilis and A. balsamea, we identified four diTPSs (A. balsamea levopimaradiene/abietadiene synthase [AbLAS], A. balsamea isopimaradiene synthase [AbISO], AbCAS, and PxaTPS4) with features of bifunctional class I/II diTPSs of the TPS-d3 group (Fig. 4). Three of these enzymes, AbLAS, AbISO, and AbCAS, were recently functionally characterized (Zerbe et al., 2012a). Bifunctional class I/II diTPSs represent an ancestral form of plant diTPS and have only been found in bryophytes (e.g. Physcomitrella patens CPS/KS [for copalyl diphosphate synthase/kaurene synthase]), the vascular nonseed plant S. moellendorffii, and gymnosperms (Chen et al., 2011). In P. amabilis we also identified three apparently monofunctional gba-domain TPS candidates (PxaTPS1-PxaTPS3). These TPSs contain the DDxxD and NSE/DTE motifs, but lack the DxDD motif, and cluster together with previously characterized gymnosperm bisabolene synthases (Bohlmann et al., 1998b;Martin et al., 2004). This particular class I TPS-d3 group contains diTPS-like sesqui-TPS, which, together with Taxus spp. TXS genes, apparently evolved from bifunctional class I/II diTPSs by loss of class II activity and further neofunctionalization (Fig. 4). Evolution of sesqui-TPS functionality required change of substrate specificity (McAndrew et al., 2011). Our phylogeny suggests that loss of class II activity, followed by g-domain loss further led to the evolution of ba-domain sesqui-TPS as well as casbene synthases (CBS) and related diTPSs with macrocyclic products of the large angiosperm TPS-a subfamily (Fig. 4). Based on the topology of the phylogenetic tree, GGPP-converting macrocyclases appear to have derived from sesqui-TPS through events of reverted evolution with a substrate change from C 15 to C 20 .
Species-specific patterns of repeated diTPS gene duplications and sequence divergence (referred to as "blooms"; compare with Feyereisen, 2011) became apparent for several species of the transcriptome analysis, and most prominently with EpTPS and TwTPS, members of the TPS-a subfamily. At larger taxonomic scales, blooms of angiosperm specialized metabolism diTPSs of the large TPS-c and TPS-e/f subfamilies are rooted in ent-CPS (ECPS) and ent-KS (EKS) genes, respectively, of GA biosynthesis. In contrast, no such blooms are found for gymnosperm diTPSs of the TPS-c and TPS-e/f subfamilies represented in the phylogeny with Picea sitchensis ECPS and EKS genes (Fig. 4).

Phylogenetic Relationships and Blooms of Candidate P450s of the CYP71 and CYP85 Clans
Known P450s of terpenoid biosynthesis do not represent a monophyletic group, but are found across the huge P450 superfamily in seven of the 10 P450 clans of vascular plants: CYP51, CYP71, CYP72, CYP85, CYP86, CYP710, and CYP711 (Nelson and Werck-Reichhart, 2011;Hamberger and Bak, 2013). Since only very few P450s of diterpene metabolism are functionally characterized, it is currently not possible to reconstruct the evolution of P450s of diterpene metabolism. Ancestral P450s of the CYP71 and the CYP85 clans with functions in general diterpene GA biosynthesis and triterpene biosynthesis may have served as progenitors for the evolution of P450s of specialized terpene metabolism, since these two clans show features of blooms of terpenoid P450s (Hamberger and Bak, 2013). The CYP71 and the CYP85 clans have previously been mined for the discovery of P450s of specialized diterpene biosynthesis, specifically . Phylogeny of candidate diTPSs. Maximum likelihood tree, illustrating phylogenetic relationships of diTPS candidates relatively to previously known diTPSs. P. patens copalyl diphosphate synthase/kaurene synthase (PpCPS/KS) was used as the tree root. For abbreviations and accession numbers see Supplemental Table S3. biosynthesis of taxol (CYP725A subfamily of the CYP85 clan; Jennewein and Croteau, 2001;Jennewein et al., 2004), diterpene resin acids (CYP720B subfamily of the CYP85 clan; Ro et al., 2005;Hamberger et al., 2011), and rice diterpene phytoalexins (CYP99, CYP76 and CYP701 families of the CYP71 clan; Swaminathan et al., 2009;Wang et al., 2011Wang et al., , 2012aWang et al., , 2012bWu et al., 2011).
Querying the transcriptomes of the 10 diterpene specialized metabolite producing species, we found a total of well over 1,000 candidate P450s, of which up to 440 genes fall into terpenoid-related subfamilies of the CYP71 and CYP85 clans. Within these clans we identified expansive lineage-specific blooms of P450 subfamilies as shown in Figure 5 for four of the 10 species. These blooms are indicative of roles in specialized metabolism. In T. wilfordii, six members of a novel CYP88H subfamily (CYP85 clan) were found, in addition to the related TwCYP88A (Fig. 5A). While CYP88A genes function in GA biosynthesis and often occur as single copy genes, the TwCYP88H subfamily identified here is distantly related to CYP88D6 from licorice (Glycyrrhiza glabra) involved in triterpene biosynthesis of glycyrrhizin (Seki et al., 2008), indicative of roles in specialized metabolism. In C. forskohlii, we observed a large expansion of the CYP716 family (CYP85 clan; Fig. 5B). Five C. forskohlii CYP716 genes fall into a clade of CfCYP716A subfamily members, two C. forskohlii genes each belong to subfamilies CYP716D and CYP716E, and one C. forskohlii gene represents the CYP716C subfamily. Based on read counts, CfCYP716C1, CfCYP716A3, and CfCYP716A4 were highly overrepresented in the 454-sequenced transcriptome. In the gymnosperm P. amabilis, we found two large clusters of in total 18 members of the CYP76AA subfamily (CYP71 clan) that group closely with P. sitchensis CYP76Z1 (Fig. 5C). Lack of angiosperm representatives in this CYP76 branch indicates a gymnosperm-specific family diversification and suggests that angiosperm CYP76 genes evolved from a common ancestor with the apparently gymnospermspecific CYP76Z and CYP76AA subfamilies. In E. peplus, several closely related subfamilies of the CYP71 clan showed a high degree of expansion (Fig. 5D). A total of 23 E. peplus CYP71 candidates form a largely expanded CYP71D tribe comprising 16 members and a smaller 726A subfamily with seven candidates. In particular, members of the large CYP71D bloom are of interest, with several CYP71D enzymes being involved in terpenoid biosynthesis across the plant kingdom (Ralston et al., 2001;Wüst et al., 2001;Wang and Wagner, 2003).

Functional Characterization of diTPS Candidates
Definitive functional annotation of genes such as diTPSs and P450s that belong to multigene families with divergent functions and large catalytic plasticity of the encoded enzymes requires experimental verification of predicted functions. While attempting this for the several hundred P450s identified here is beyond the scope of the present work, we selected a few diTPS candidates to validate the identification of new diTPSs and the feasibility of both in vitro assays with microbial expressed proteins and in vivo assays based on transient (co)expression in Nicotiana benthamiana using single diTPSs as well as combinations of diTPSs.
We tested three different G. robusta diTPS candidates of Asteraceae-specific subgroups of the TPS-c and TPSe/f subfamilies using in vitro and in vivo assays (Figs. 6B-D and 7A). Concurrent with its position next to Arabidopsis and grapevine (Vitis vinifera) geranyllinalool synthases (GLSs) on a distinct branch of the TPS-f subfamily, GrTPS5 was functionally identified as a GLS by comparison of the enzyme product to authentic geranyllinalool (Fig. 6B). GrTPS1 was identified as a class II LPPS by comparison to the product profile of a previously reported protein variant of AbCAS ( Fig. 6C; Zerbe et al., 2012a). Consistent with the AbCAS variant and other LPP synthases (LPPSs) (Caniard et al., 2012), GC-MS analysis of the dephosphorylated GrTPS1 product showed, next to labd-13en-8,15-diol (i.e. dephosphorylated LPP), a racemic mixture of manoyl oxide and epi-manoyl oxide (Fig.  6C). The latter diterpenes could not be detected without dephosphorylation of the GrTPS1 product, indicating that they represent derivatives formed under GC-MS conditions. While no in vitro or in vivo activity could be detected for GrTPS6 alone, combination of GrTPS6 with GrTPS1 in in vitro and in vivo assays resulted in formation of manoyl oxide with only traces of epi-manoyl oxide (Figs. 6D and 7A), demonstrating a new class I function for GrTPS6 that proceeds via cleavage of the diphosphate group of LPP and subsequent heterocyclic ring closure including the hydroxyl group at C-8.
We also tested the possibility of using combinations of diTPSs from different species. Products with high similarity to pimaradiene and abietadiene were observed, when coupling GrTPS6 with a maize ECPS (Harris et al., 2005) and in trace amounts when coupled with a (+)-CPP-producing variant of Norway spruce (Picea abies) LAS (Zerbe et al., 2012a; Fig. 6D).
In agreement with its close phylogenetic relationship to known Euphorbia esula and Triadica sebifera CBS enzymes of the TPS-a family, heterologous expression of the ba-domain EpTPS3 in N. benthamiana resulted in the formation of casbene (Fig. 7B) as identified by comparison with reference mass spectra (Reiling et al., 2004). E. peplus TPS, a member of the class II TPS-c subfamily, and EpTPS1 and C. forskohlii TPS14, both belonging to the class I TPS-e/f subfamily, were also analyzed using the N. benthamiana in vivo assay system (Fig. 7). While expression of the individual enzymes in N. benthamiana yielded no detectable new products, coexpression of EpTPS7 with EpTPS1 or CfTPS14 resulted in accumulation of ent-kaurene (Fig.  7C) and suggested functions of these three enzymes in GA biosynthesis.

Mining of Plant Biodiversity for Bioproduct Genes and Synthetic Biology
Plants are estimated to produce more than onequarter million different specialized metabolites (Pichersky and Lewinsohn, 2011), which each typically require several unique enzymes for biosynthesis. To accomplish this biosynthetic diversity, nature must have evolved and maintained across the biodiversity of plant species hundreds of thousands of more or less closely related enzymes for specialized metabolism. Discovery and accurate annotation of the corresponding genes of specialized metabolism pose a challenge for high-throughput biology, but if successful also present new opportunities for development of biocatalysts, bioproducts, and synthetic biology. Discovering and exploring the enormous enzyme diversity of plant specialized metabolism on a large scale requires genomic and biochemical research beyond traditional model systems. Annotation of the divergent enzymes of specialized metabolism is further supported by the development of databases that accurately capture the functional space of enzymes in specialized metabolism, where minor sequence variations may substantially affect substrate or product specificity.  Table S4.
In this study we focused on diterpene specialized metabolites for proof-of-concept work in an important and very large class of plant-derived bioproducts. The many roles of plant diterpenes in the natural world and their applications as pharmaceuticals, fragrances, resins, food additives, and other commercial bioproducts have spurred a growing interest in exploring and harnessing the biodiversity of diterpene metabolism. Because diterpene biosynthesis is generally organized with pathways of common modules (Fig. 1), we began to investigate select modules and the gene space and functional diversity within such modules. We reasoned that this approach can provide a rational starting point for a general strategy of enzyme discovery of diterpenoid biosynthesis that could likewise apply to other classes of terpenes, such as the thousands of monoterpenes and sesquiterpenes. The pathways of mono-, sesqui-and diterpenes differ in the Figure 6. In vitro characterization of recombinant diTPSs from G. robusta and P. amabilis. GC-MS traces of reaction products from single or coupled in vitro assays of recombinant proteins with 15 mM GGPP 1 as substrate. Major products are compared with authentic standards or reference mass spectra of the Wiley Registry MS Libraries. A, Diterpene alcohol and olefin formation by PxaTPS4. B, Geranyllinalool production by GrTPS5. C, Formation of LPP with manoyl oxide and epi-manoyl oxide byproducts by GrTPS1 verified through identification of the dephosphorylated reaction product (labd-13-en-8,15-diol). D, Activity of GrTPS6, forming manoyl oxide and traces of epi-manoyl oxide in combination with GrTPS1, and abietadiene when coupled with ECPS (ZmECPS; Harris et al., 2005) or (+)-CPS (PaLAS:D611A; Zerbe et al., 2012a). IS, Internal standard 1.6 mM eicosene; peak a, palustradiene; peak b, levopimaradiene; peak c, abietadiene; peak d, neoabietadiene; peak e/f, epimers of 13-hydroxy-8(14)-abietene; peak g, geranyllinalool; peak h, manoyl oxide; peak i, epi-manoyl oxide; peak j, labd-13-en-8,15-diol; peak k, abietadiene; peak l, pimaradiene-type diterpene. AbCAS:D621A, A. balsamea CAS variant, producing LPP (Zerbe et al., 2012a). chain length of a few common isoprenyl diphosphate substrates at the beginning of each pathway, with GGPP 1 being the common precursor for diterpenes. The first diversifying module of terpenoid biosynthesis is the TPS module, which includes a large number of diTPSs, in addition to hemi-TPS, mono-TPS, and sesqui-TPS (Bohlmann et al., 1998a;Chen et al., 2011). In this work, we identified features of the TPS phylogeny that will guide general identification of diTPSs relative to hemi-TPS, mono-TPS, or sesqui-TPS, and may also provide informed direction for exploring specific functions (see below).
Following the diTPS module, the P450 superfamily is a huge reservoir for enzymes that may modify diTPS products and provide pathway end products or intermediates for further functional decoration. Compared with the diTPSs, much less is known about the P450 module of diterpene biosynthesis, but our work highlights that plant species with diterpene specialized metabolism show blooms of P450s within clans, families, or subfamilies of the P450 superfamily that are promising candidates for terpene metabolism. The functional characterization of any P450 in terpene biosynthesis is a difficult task, often limited by availability of authentic metabolite standards and daunting in the face of the large number of putative candidate genes. Developing in vivo systems for producing relevant diterpene substrates by metabolic engineering of diTPSs and combinations thereof in heterologous hosts provides platforms in which candidate P450 genes can subsequently be tested. The usefulness of platform diTPS expression systems for P450 discovery has been demonstrated in previous work (Ro et al., 2005;Swaminathan et al., 2009;Hamberger et al., 2011;Wang et al., 2011Wang et al., , 2012aWang et al., , 2012bWu et al., 2011), and we have extended here on the development of diTPSexpressing platforms using both yeast (Saccharomyces cerevisiae) and plant expression systems, which will be suitable for additional expression of P450s. Selecting candidate P450s from expanded lineage-specific tribes and subfamilies within relevant clades of the P450 superfamily greatly reduces the number of candidates to be tested in a first screen with in vivo platform expression systems. This approach of selecting candidates from lineage-specific P450 blooms proved successful in the past for the discovery of a new gymnosperm subfamily of CYP720B genes of the Pinaceae family that produce a diverse array of diterpene resin acids (Hamberger et al., 2011), and should now be applicable to other systems based on mining of deep transcriptome resources from diterpene-producing tissues.
Our approach of individual and combinatorial diTPS expression also already moved beyond proof of concept to the identification of diTPS functionalities, which have not previously been described, as illustrated here by the stereospecific formation of manoyl oxide through sequential activity of GrTPS1 (class II) and GrTPS6 (class I). The particular functions of manoyl oxide formation with combined GrTPS1 and GrTPS6 activity can now be utilized for developing diterpene platforms for additional gene discovery and production of hydroxylated diterpene therapeutics, such as forskolin 8. We also showed here the utility to explore substrate promiscuity and new diTPS combinations with GrTPS6, which also converted ent-CPP and to a lesser extent (+)-CPP into pimaradiene-and abietadiene-type diterpenes, when combining this class I diTPS with different class II diTPSs of other species.

A Functionally Informed and Informative Phylogeny of Plant diTPS
We showed here that plant diTPSs are not restricted to one or a few subfamilies of the TPS family (Chen et al., 2011), but are found in several major subfamilies, specifically TPS-a, TPS-c, TPS-d, TPS-e/f, and TPS-f. diTPSs are not known in the TPS-b and TPS-g subfamilies. diTPSs have evolved as mono-or bifunctional enzymes of different domain structures. Nevertheless, all plant diTPSs can be traced back to a putative bifunctional three-domain class I/II diTPS ancestor, which may be most closely resembled in extant plants by the CPS/KS enzyme of the moss P. patens (Fig. 4). Our analysis of diTPSs of different TPS subfamilies highlights a few vignettes of the TPS phylogeny as it pertains to evolution of diTPS functions. The resulting functionally informed diTPS phylogeny will be useful for directing experimental validation of new candidate diTPSs as shown with a few select examples in this study.
Originating from an ancestral PpCPS/KS-type bifunctional diTPSs, class I/II diTPSs of the gymnosperm specific TPS-d3 clade are a major relatively basal group of the TPS family. LAS and ISO functionality appears to be a more ancient function of the TPS-d3 group. This interpretation is supported by the functional characterization of PxaTPS4, described here, as a functional ortholog to known LAS (Peters et al., 2000;Martin et al., 2004;Keeling et al., 2011a;Zerbe et al., 2012a;Hall et al., 2013), suggesting that LAS functionality was established prior to the divergence of different genera in the Pinaceae. From this basal function evolved, within the gymnosperm TPS-d3 group, monofunctional class I diTPSs (represented with TXS) and sesqui-TPSs of the bisabolene synthase type. Descendants of the TPS-d3 group also include the many gymnosperm mono-TPSs and sesqui-TPSs of the TPS-d1 and TPS-d2 clades (not shown in Fig. 4; see Keeling et al., 2011a). Also branching out from the gymnosperm TPS-d3 group are the large TPS-a group, which includes angiosperm diTPS and sesqui-TPS functions (Fig. 4). In contrast to the relationship between blooms of the gymnosperm TPS-d subfamily and the angiosperm TPS-a subfamily, blooms of angiosperm diTPSs of TPS-c and TPS-e/f subfamilies are not paralleled by similarly expanded families in the gymnosperms. Blooms of angiosperm diTPSs of TPS-c and TPS-e/f subfamilies appear to have evolved through repeated duplication and neofunctionalization of ancestral ECPS and EKS of GA biosynthetic diTPSs (Peters, 2010). In contrast, the bona fide ECPS and EKS genes of gymnosperm GA biosynthesis, positioned at the base of the TPS-c and TPS-e/f subfamilies respectively, appear to be resistant to diversification and are retained as orphans in "single-copy families" (compare with Keeling et al., 2010;De Smet et al., 2013).
Across the 10 species of our analysis we found blooms of CBS-like diTPSs of the TPS-a subfamily in E. peplus, J. gossypiifolia (Euphorbiaceae), and T. wilfordii (Celastraceae) with more than 20 genes (Fig. 4). Based on the relatedness with Ricinus communis CBS (Mau and West, 1994) and E. esula CBS (Kirby et al., 2010), and considering the macrocyclic diterpene biochemistry of E. peplus and J. gossypiifolia (Fig. 1), these genes are good candidates for discovery of macrocyclases; indeed we verified CBS macrocyclase activity for EpTPS3 (Fig. 7B). The diTPS phylogeny also showed emerging patterns of functional clades within the TPS-c and TPS-e/f subfamilies. Within the TPS-c subfamily, ECPS genes of GA biosynthesis cluster together, and in the TPS-e/f subfamily EKS and other KS-like genes cluster together. The informative nature of clustering was supported with functional characterization of EpTPS7 as an ECPS in the TPS-c subfamily, and functional characterization of EpTPS1 and CfTPS14 as EKS in the TPS-e/f subfamilies (Fig. 7C). A separate functionally informative cluster of specialized metabolism diTPSs also emerged around a scaffold of three different previously described LPPSs (Falara et al., 2010;Caniard et al., 2012;Sallaud et al., 2012), indicating that new candidate class II diTPS that are situated together with these genes may also encode LPPS or related functions and are therefore primary targets for roles in the biosynthesis of hydroxylated diterpenoids, such as marrubiin 7, forskolin 8, and oridonin 5 in M. vulgare, C. forskohlii, and I. rubescens, respectively. Within the TPS-e/f subfamily, a few candidate diTPS from M. vulgare, C. forskohlii, and I. rubescens form a distinct group of class I diTPSs with functionally characterized SsSS (Caniard et al., 2012) and SmMS (Gao et al., 2009). These Lamiaceae genes show an unusual ba-domain architecture that can be explained by loss of the g-domain (Fig. 4), as previously reported for a diTPSs from wheat and S. miltiorhizza . This cluster of Lamiaceae diTPSs is likely to be involved in specialized metabolism, given its distinct separation from gba-domain class I enzymes and the high abundance of the corresponding transcripts in the target tissue-specific transcriptomes of M. vulgare, C. forskohlii, and I. rubescens. With G. robusta TPS5, we functionally identified the first Asteraceae member of the TPS-f family of GLSs (Fig. 6B) in addition to known GLSs from Arabidopsis, grapevine, and N. attenuata (Herde et al., 2008;Martin et al., 2010;Jassbi et al., 2008). This group also contains linalool synthase from Clarkia brewerii (Dudareva et al., 1996), which may have evolved from a GLS function by a simple change of substrate specificity, but retaining formation of an acyclic product.
Despite the emerging patterns of the diTPS phylogeny, which provides functional information, a priori distinction of diTPSs of general and specialized metabolism within the TPS-c and TPS-e/f clades is still not trivial. This is due, in part, to the fact that similar diTPS functions may have evolved in parallel in separate taxonomic groups and the corresponding genes may cluster by taxonomic rather than functional relatedness. For example, we functionally identified G. robusta GrTPS1 as an LPPS ( Fig. 6C; Fig. 7A) despite its closer relationship to other diTPSs of the Asteraceae than to LPPSs from other species.

Clades of Terpenoid-Related P450s
The enormous size of the P450 superfamily (Nelson and Werck-Reichhart, 2011;Hamberger and Bak, 2013) with very few functionally characterized P450s of diterpene biosynthesis provides a challenge for a priori gene annotations. However, some general observations may guide selection of candidate P450s from transcriptome or genome sequences. First, the majority of P450 families involved in specialized terpene metabolism belong to the CYP71 and CYP85 clans. Second, emergence of lineage-specific P450 blooms (in the P450 nomenclature system typically named as subfamilies) in plant species or families with unique and diverse diterpene metabolism may be indicative of specialized P450 functions as was shown with the CYP720B subfamily of diterpene resin-producing gymnosperms of the Pinaceae family (Hamberger et al., 2011). Such blooms can distinguish genes of potentially similar functions in general metabolism, as was also shown with the CYP720B example that was clearly distinct from the CYP701A subfamily of entkaurene oxidation in GA biosynthesis. Here, we found several species-specific blooms within the CYP71 clan with P450 candidates from P. amabilis and E. peplus, which are now being explored as candidates in the biosynthesis of pseudolaric acid 2 and ingenol-3angelate 10 (Fig. 5). Similarly, blooms within the CYP85 clan were observed with a diversified CYP88H family from T. wilfordii and a bloom of highly abundant members of the CYP716 subfamily in C. forskohlii that may include candidates with possible functions in triptolide 4 and forskolin 8 biosynthesis, respectively.

General Applications and Conclusion
We developed an approach that identified nearly 500 diTPS and P450 candidate genes from 10 plant species of five different gymnosperm and angiosperm families. Our strategy combined tissue-specific metabolite analyses with development and mining of deep transcriptome resources using customized reference sequence databases for relevant target gene modules (diTPSs and P450s). Informative phylogenies, from which general patterns of evolution of gene functions and lineage-specific blooms could be inferred, proved useful to guide functional annotation as shown for a set of diTPSs using both in vitro and in vivo analyses of functions. We also showed the potential for developing combinatorial platforms for heterologous production of natural and nonnatural diterpenes, which we had previously proposed as a concept for synthetic biology of plant bioproducts (Facchini et al., 2012). The results of this study provide a resource for much future work in the species of our focus, which are broadly of interest because of the various bioproduct applications of their diterpene metabolites. The diTPS and P450 query databases used in this work, the approaches of identifying candidate genes from functionally informative gene family phylogenies, and the in vivo and in vitro expression systems can be applied to other plant species that produce interesting diterpenes, where scale of investigation can be focused on one particular species of interest or can be a broad-scale search for new enzymes across the plant kingdom. The general model of the present work can also be applied to other classes of terpenoids.
We showed that confirming presence of target compounds by metabolite profiling of different tissues provided confidence in the choice of tissues selected for transcriptome analysis. Since metabolite transport and accumulation in tissues other than their biosynthetic origin cannot be ignored, assessing transcriptomes for core precursor pathways provided additional confidence in the assembled transcriptome resources, as shown here with the complete coverage of all steps of the MVA and MEP pathway genes in each transcriptome paralleled with large numbers of relevant diTPS and P450 candidates. We found that combining Illumina and 454 assemblies was advantageous as it expanded the recovery of FL assemblies with different read-depth distribution from both sequencing platforms; however, read-length and -depth advantages of different high-throughput sequencing are rapidly changing and will provide continuously improving coverage for targeted gene discovery. The development and use of nonredundant diTPS and P450 databases, which encompass the functional space of these gene families was most important as it allowed efficient query of large transcriptome datasets. In all 10 plant species (except for R. officinalis, showing a lower assembly quality) multigene diTPS and P450 families were identified, supporting the identification of patterns of lineage-specific expansion and diversification of the diTPS-and P450-ome in species of different angiosperm and gymnosperm families. Blooms of diTPSs and P450s in species producing diverse diterpene metabolites are hot spots for functional gene identification. Finally, reliable functional annotation is a key challenge in the discovery of diTPS and P450 genes, given the functional plasticity and high sequence similarity within these large gene families. For example, with diTPSs, change of function as a result of small changes in the active site composition has been illustrated (e.g. Wilderman and Peters, 2007;Xu et al., 2007b;Keeling et al., 2008;Morrone et al., 2008;Leonard et al., 2010;Zhou and Peters, 2011;Criswell et al., 2012;Gao et al., 2012;Zerbe et al., 2012b). However, emerging patterns of functional evolution, such as changes in domain architecture and presence or absence of active side signature motifs, can now guide functional annotation and characterization of new diTPSs.

Plant Collection
Abies balsamea and Pseudolarix amabilis were purchased from Arbutus Grove Nursery and Forestfarm. Both species were obtained as 2-year-old trees and maintained in University of British Columbia greenhouses. Tripterygium wilfordii was provided by the University of British Columbia Botanical Garden. Seeds of Coleus forskohlii, Rosmarinus officinalis, Isodon rubescens, Marrubium vulgare, Euphorbia peplus, Jatropha gossypiifolia, and Grindelia robusta were obtained from B&T World Seeds and Horizon Herbs, and cultivated under long-day conditions (16-h light; day/night temperature 22°C/25°C) for 2 to 4 weeks and further maintained under greenhouse conditions.

Diterpene Profiling
For the isolation of diterpene metabolites, tissues were ground in liquid nitrogen and extracted at room temperature using suitable solvents. Extracts were filtered through cheesecloth, freed from surplus water by adding anhydrous Na 2 SO 4 , and reduced under nitrogen stream to approximately 1 mL. Further chromatographic purification on silica, alumina, or charcoal material was applied as required. Samples were filtered through 0.22-mm GHP membrane filters (www.pall.com) and analyzed by LC-MS or GC-MS with metabolite identification by comparison with authentic standards and reference mass spectra from the Wiley Registry Mass Spectral Libraries. Detailed extraction and analytical procedures are given in Supplemental Materials and Methods S1.

RNA Isolation, cDNA Library Construction, and Transcriptome Sequencing
RNA was isolated as previously described (Kolosova et al., 2004) using 50 to 150 mg of tissue. High RNA integrity was verified using Bioanalyzer 2100 RNA Nano-chip assays (www.chem.agilent.com). Construction of nonnormalized cDNA libraries and transcriptome sequencing was performed at the McGill University and Génome Québec Innovation Centre. cDNA libraries for 454 sequencing were constructed from 200 ng of fragmented mRNA using the cDNA Rapid Library Preparation kit, GS FLX Titanium series (www. roche.com). After yield and fragment size validation on a 6,000 Nano-chip (Agilent), 200 ng of prepared libraries were subject to a one-half-plate reaction of 454 pyrosequencing with the Roche GS FLX Titanium platform. For Illumina sequencing, cDNA libraries were prepared from 10 mg of total RNA using the mRNA Seq Sample Preparation Kit (www.illumina.com), with normalization of mRNA amounts to 100 ng prior to library construction. Yields and correct fragment sizes were assessed using a high sensitivity DNA chip (Agilent). Sequencing was performed from 7 pmol of each library using 108 bp (GAIIx) or 100 bp (HiSeq) pair-ended runs.
De Novo Transcriptome Assembly 454 reads were cleaned to remove adapter sequences and 15 bp at front ends of sequences using FastQC (www.bioinformatics.babraham.ac.uk/projects/ fastqc/). Flowgram files were clipped with tools provided by Roche. The resulting high-quality reads were assembled with the "cdna" switch paired with 45-bp minimum read length using the Roche Newbler 2.6 genome assembler. Illumina assemblies were conducted with the Trinity de novo assembler (Grabherr et al., 2011) after cleaning of sequence reads with a custom perl script, allowing trimming of Illumina reads in fastq format and further removal of 15 bp at front ends of sequences. Mapping of raw Illumina reads to respective assemblies was achieved using Burrows-Wheeler Aligner (Li and Durbin, 2009) with default parameters.

Terpenoid Pathway Gene Discovery
Functional annotation of terpenoid backbone biosynthetic genes was conducted using a BLASTx approach against a manually curated version of the terpenoid backbone biosynthesis pathway (ec00900) from the KEGG database with exclusion of duplicates. Assemblies were screened with an E-value cutoff of 1e -20 . For identifying diTPS and P450 candidates, custom databases (diTPS, Supplemental Data S11; P450, Supplemental Data S12) were created based on publicly available protein sequences that represent minimal nonredundant sequence sets, resembling the functional range of TPS and P450 genes. Gene mining of significant candidates from generated assemblies was performed using tBLASTx with an E-value threshold of 1e -50 and a minimum read length of 150 amino acids.

cDNA Cloning
Cloning of FL cDNAs was conducted using gene-specific oligonucleotides (Supplemental Table S5) or, where required, through completion of 59-and 39sequences by RACE using the SMARTer RACE cDNA amplification kit (www.clontech.com) and touchdown PCR from cDNA prepared with the SuperScript III First Strand Synthesis kit (www.invitrogen.com) and random hexamer oligonucleotides. For expression in Escherichia coli, obtained amplicons were ligated into pJET (Clontech), sequence verified, and subcloned into the pET28b(+) expression vector (www.emdmillipore.ca). For alternative expression in Nicotiana benthamiana, genes were amplified from FL cDNA clones using PfuX7 polymerase (Nørholm, 2010) and cloned into pCAMBIA130035Su as described earlier (Nour-Eldin et al., 2006).

Expression of diTPS in E. coli and diTPS in Vitro Assays
Recombinant proteins were expressed in E. coli BL21DE3-C41 and Ni 2+ -affinity purified as described elsewhere (Zerbe et al., 2012b). In vitro enzyme assays were conducted as previously reported (Zerbe et al., 2012b) in the form of single or coupled assays using 50 mg of purified protein (50 mg each for coupled assays) and 15 mM of GGPP 1 (Sigma), with incubation for 1 h at 30°C and extraction of reaction products with 500 mL pentane prior to GC-MS analysis. The detection of diphosphate intermediates required treatment with 10 units of calf intestinal phosphatase (Invitrogen) for 16 h at 37°C prior to GC-MS analysis. Detailed procedures are given in Supplemental Materials and Methods S2.

Expression of diTPS in N. benthamiana and diTPS in Vivo Assays
For expression in N. benthamiana, diTPS candidates were cloned into the pCAMBIA130035Su vector (Nour-Eldin et al., 2006), and transformed into Agrobacterium tumefaciens strain GV3850 by electroporation. The resulting strains were grown at 28°C in Luria-Bertani media supplemented with 50 mg L -1 kanamycin, 50 mg L -1 ampicillin and 34 mg L -1 rifampicin. After harvest, cells were resuspended to a final optical density at 600 nm of 2 in 10 mM MES buffer with 10 mM MgCl 2 and 100 mM acetosyringone. Following 60-min shaking incubation at 50 rpm and room temperature, strains were mixed with 0.25 volume of the P19 suppressor strain (Voinnet et al., 2003) for singleconstruct transformations, or using equal volumes for combinatorial expression of two diTPSs with 0.25 volume of P19 before infiltration into the underside of 5-week-old N. benthamiana leaves. Plants were incubated 4 d prior to metabolite extraction with 1 mL hexane and GC-MS analysis as described in Supplemental Materials and Methods S2.

Supplemental Data
The following materials are available in the online version of this article.
Supplemental Figure S1. Average contig length distribution of assemblies generated from 454-and Illumina-derived transcriptomes.
Supplemental Figure S2. Discovery of diTPSs and P450s in the transcriptome assemblies.
Supplemental Table S1. Summary of Roche 454 and Illumina GAII and HiSeq transcriptome sequencing results.
Supplemental Table S2. Comparison of 454 and Illumina assemblies to the Arabidopsis (Arabidopsis thaliana) CEGMA data set.
Supplemental Table S3. Abbreviations and accession numbers of diTPSs used for phylogenetic analyses in Figure 4.
Supplemental Table S4. Abbreviations and accession numbers of P450s used for phylogenetic analyses in Figure 5.
Supplemental Table S5. Oligonucleotides used in PCR procedures described in Figures 6 and 7.
Supplemental Data S1. De novo 454 and Illumina assemblies of A. balsamea bark tissue. Supplemental Data S10. De novo 454 and Illumina assemblies of I. rubescens leaf tissue.
Supplemental Materials and Methods S1. Detailed description of extraction and metabolite analysis procedures used for the identification of the respective diterpenes targeted in this study.
Supplemental Materials and Methods S2. Detailed description of GC-MS analyses used for in vitro and in vivo functional characterization of diTPSs as illustrated in Figures 6 and 7.