|
|
||||||||
|
Plant Physiology 141:811-824 (2006) © 2006 American Society of Plant Biologists Identification of Genes with Potential Roles in Apple Fruit Development and Biochemistry through Large-Scale Statistical Analysis of Expressed Sequence Tags1,[W]Pomology Group (S.P., S.v.N.) and Postharvest Technology and Physiology Laboratory (N.S., R.B.), Horticulture Department; and Bioinformatics Core, Research Technology Support Facility (M.D.L.), Michigan State University, East Lansing, Michigan 48824
Advanced studies of apple (Malus domestica Borkh) development, physiology, and biochemistry have been hampered by the lack of appropriate genomics tools. One exception is the recent acquisition of extensive expressed sequence tag (EST) data. The entire available EST dataset for apple resulted from the efforts of at least 20 contributors and was derived from more than 70 cDNA libraries representing diverse transcriptional profiles from a variety of organs, fruit parts, developmental stages, biotic and abiotic stresses, and from at least nine cultivars. We analyzed apple EST sequences available in public databanks using statistical algorithms to identify those apple genes that are likely to be highly expressed in fruit, expressed uniquely or preferentially in fruit, and/or temporally or spatially regulated during fruit growth and development. We applied these results to the analysis of biochemical pathways involved in biosynthesis of precursors for volatile esters and identified a subset of apple genes that may participate in generating flavor and aroma components found in mature fruit.
Cultivated apple (Malus domestica Borkh) is among the most diverse and ubiquitously cultivated fruit species. Apple is a member of the Rosaceae family, which includes many commercial fruit (e.g. pear [Pyrus communis], strawberry [Frageria spp.], cherry [Prunus avium], peach/nectarine [Prunus persica], apricot [Prunus armeniaca]), nut (almond [Prunus amygdalus]), forest (black cherry [Prunus serotina Ehrh.), and ornamental (rose [Rosa hybrida], crab apple [Malus coronaria]) species. In the United States alone, apple production is worth approximately $1.6 billion annually, and Rosaceous fruits, collectively, are the most economically important fruit crops (U.S. Department of Agriculture, National Agricultural Statistics Service, Noncitrus Fruits and Nuts 2004 Summary; http://www.nass.usda.gov/). In spite of its importance to agriculture and its pervasive role in human health, relatively little is known about apple fruit development, physiology, and biochemistry. This lack of knowledge has contributed to perpetual difficulties in breeding, production, and storage. Unlike tomato (Lycopersicon esculentum Mill.), which has emerged as a model for fruit growth and development, there have been relatively few molecular-oriented studies of apple. However, such studies would be expected to yield novel insights into fruit biology. Like tomato, apple is a climacteric fruit, with a clear respiratory climacteric and ethylene peak associated with ripening. However, unlike tomato, which is a true berry fruit, the majority of apple fruit is derived from proliferated receptacle tissues, with the ovary-derived tissues restricted to the center of the mature fruit (core). The skin (epidermal and subepidermal cell layers) is strikingly different from the cortex, and, in almost all apple varieties studied thus far, biosynthesis of pigments and most volatile esters associated with aroma is concentrated in epidermal and subepidermal tissues. Pigmentation is dominated by anthocyanins (compared with carotenoids in tomato), and tomato fruit is not known to synthesize volatile esters. Compared with most fruits, apple has an extremely long developmental sequence, often exceeding 150 d. Although ripening in apple is accompanied by changes in texture, it is one of the few commercially important fruits that undergo significant softening only after extended storage and deterioration. An additional attraction of apple for studies of fruit biology is the enormous diversity in fruit-related traits among the large number of cultivars and related wild genotypes (>3,000) available for analysis.
Advanced studies of apple development, physiology, and biochemistry have been hampered by the general lack of appropriate genomics tools. An exception is the recent generation and public release of extensive expressed sequence tag (EST) data. Public sequence databanks contain in excess of approximately 200,000 apple EST sequences. The majority of the ESTs were contributed through large-scale sequencing efforts in the United States and New Zealand (Korban et al., 2004
Analysis of ESTs provides a powerful complement to genome sequencing for model plants and the primary tool for gene discovery in many plant species of agronomic and economic interest. For example, in apple, Newcomb et al. (2006) Fruit tissues are well represented in the current apple EST collections (approximately 40% of sequences; Table I). This offers the opportunity to apply EST frequency analysis to a nonmodel fruit and to explore the validity of this approach where data are derived from heterogeneous sources. In this study, we performed an extensive analysis of apple EST sequence data found in public databanks as a first step in identifying genes with important function in apple fruit development, including the biosynthesis of volatile esters during ripening.
We assembled the approximately 200,000 apple sequences found in public sequence databanks into approximately 23,000 contiguous sequences (clusters containing more than one EST) and approximately 21,000 singletons (solitary or nonclustered ESTs; Table II ). The number of unique sequences derived from the combined set of clusters and singletons (approximately 44,000) is similar to the number of unique sequences (approximately 43,000) determined by Newcomb et al. (2006)
The number of unique sequences identified in our study is an overestimate of the number of expressed genes sampled because clusters or singletons representing the same gene may not overlap and distinct clusters or singletons may represent alternatively transcribed or processed mRNAs originating from the same gene. Also, because the ESTs analyzed in this study were derived from several cultivars and because cultivated apple is highly heterozygous, there is the potential for the degree of polymorphisms to exceed the sequence match stringency utilized for cluster analysis, leading to the clustering of sequences from distinct alleles of the same gene into separate clusters. We reasoned that if allelic divergence was a significant factor contributing to cluster number, then this should be evident by frequent occurrence of homologous pairs of clusters each containing ESTs derived from only one cultivar. To explore this, we examined clusters of ESTs originating from young fruit or leaves, two tissue sources that were each represented by non-normalized libraries from both the Royal Gala and GoldRush cultivars (Supplemental Table I). We used a stringent statistical test (see "Materials and Methods") to identify bias in cultivar-associated EST representation, then eliminated from consideration unbiased clusters and biased clusters containing ESTs derived from both cultivars. The 14,447 ESTs from young fruit were assembled into 7,439 clusters, of which 71 and 49 contained significant numbers of ESTs originating only from Royal Gala or GoldRush, respectively. Of these, 50 and 20, respectively, were found to contain ESTs from the other cultivar when the entire EST collection was considered, indicating that they did not represent a cultivar-specific gene or allele. We used the remaining 21 and 29 clusters as queries in BLAST analyses interrogating the entire clustered EST collection. Only two and four clusters from Royal Gala and GoldRush, respectively, were related to clusters specific to the other cultivar. Similarly, of the 6,354 clusters assembled from 12,677 leaf ESTs, we identified six and 23 clusters containing significant numbers of ESTs originating only from Royal Gala or GoldRush, respectively. Of these, three and 16, respectively, were found to contain ESTs from the other cultivar when the entire EST collection was considered. None of the remainders was related to clusters specific to the other cultivar. From these results, we concluded that polymorphism was not a significant factor in clustering in this study and that allelic variants were typically clustered together. The cultivar-specific clusters that we identified may correspond to genes that are differentially expressed between cultivars, an idea that we evaluated further below. Unique sequences in our clustered set were used as queries in homology searches of protein and derived protein databanks. Approximately 40% of unique sequences did not exhibit significant homology with currently cataloged sequences from plant or other genomes (data not shown), suggesting that the respective apple tissues may represent a rich source of novel plant genes.
As a first step in characterizing the transcriptional profile of fruit, we used EST frequency analysis to identify potential highly expressed genes in fruit tissues. This analysis collectively considered 19 non-normalized cDNA libraries derived from various stages of fruit development (very young to mature) and various fruit parts (core, cortex, and skin/peel; Supplemental Table I).
ESTs generated from these libraries comprised a total of 15,000 unique sequences, including about 10,300 clusters, with about 40% of clusters also including EST representatives from nonfruit tissues (see below). Clusters contained up to 284 fruit-derived ESTs, or about 0.7% of the fruit EST pool (Supplemental Table II), with the median cluster containing 39 ESTs. We found that the top second percentile of clusters in terms of number of representative ESTs (282 clusters) contained 24% of all fruit ESTs. A large proportion of these highly represented clusters (27 clusters [about 10%]) did not exhibit significant homology with the sequenced nuclear genome of the reference plant Arabidopsis (Arabidopsis thaliana), but instead were closely related to segments of sequenced plastid or mitochondrial genomes (Supplemental Table II). We presumed that these high-abundance sequences resulted from organellar DNA or RNA contamination of the RNA sources utilized in library construction and, therefore, these were eliminated from further analyses. We also found an additional significant proportion of clusters (9%) that did not exhibit sequence similarity with any currently cataloged sequence, were enriched in A/T nucleotides or contained extended A/T tracts, and/or did not contain an appreciably long open reading frame (ORF). Such clusters may also have originated from high-copy/repetitive apple nuclear genomic or organellar DNA or RNA or from uncharacterized microbial or viral sequences. An alternative explanation is that these could represent authentic products of transcription of nonprotein-coding genes or intergenic sequences (Yamada et al., 2003
Of the 20 most abundantly represented clusters (Supplemental Table II), none represents genes whose function in relation to fruit biology is unambiguously known. The most highly represented cluster, MD249100, is very closely related to THI4, an Arabidopsis gene that participates in thiamine synthesis (Machado et al., 1997 For each highly represented cluster, we also evaluated EST frequency in nonfruit-derived tissues (see below), as well as frequency of ESTs representing closely related sequences (paralogous groups), which may define functionally homologous genes (Supplemental Table II). Resulting data for the clusters were analyzed by k-means grouping (Fig. 1 ) to identify clusters with similar general frequency profiles across fruit-derived and nonfruit-derived EST sources. We found that there was a high degree of overlap between clusters represented at high frequency by fruit-derived ESTs and those that were represented preferentially by fruit ESTs, such that only a minority of highly represented clusters (Fig. 1, group 1; 87 of 282 clusters [approximately 30%]) were also highly represented by ESTs originating outside of fruit tissues. Genes corresponding to this group likely carry out functions important for both fruit-associated and nonfruit-associated tissues. Examples of group 1 clusters included those representing basic metabolic enzymes (MD249140, corresponding to the small subunit of Rubisco, and several clusters corresponding to chlorophyll a/b-binding proteins) and structural proteins (MD012900, corresponding to histone H2B, and several clusters corresponding to tubulin), among other functional classes.
The majority of clusters showed significantly higher representation by fruit-derived ESTs as compared with nonfruit-derived ESTs (Fig. 1, groups 25). Within this subset, clusters in group 2 (approximately 37% of total) either did not exhibit significant sequence homology with other apple sequences or were closely related to clusters only infrequently represented in apple EST libraries, suggesting that they may define genes with essentially nonredundant functions of special relevance for the fruit. A sample cluster, MD170530, corresponds to a gene whose closest homolog in Arabidopsis (E = 3e-59) is BOR1, encoding a protein required for xylem loading of boron in roots (Takano et al., 2002
Clusters included in another profile class (Fig. 1, group 3) showed homology with other clusters that also exhibited significantly higher representation in libraries from fruit tissues relative to nonfruit tissues. These clusters and their relatives may define paralogous groups of genes with emphasized function in fruit. Examples in this group are MD024600 and MD024610, which show very high sequence homology with a reported progesterone 5-
Clusters in another profile subset (Fig. 1, groups 4 and 5) were not strongly represented in nonfruit-derived libraries but were closely related to clusters represented with high frequency in nonfruit-derived libraries. Thus, group 4 and 5 clusters may define fruit-specific representatives of paralogous groups of genes with biochemical roles that are not limited to fruit. An example in this group is MD187410, encoding a likely lipoxygenase (LOX). Although MD187410 was represented almost exclusively by fruit-derived ESTs, we identified an additional 14 closely related clusters containing ESTs from at least 18 fruit and nonfruit libraries. LOX in vegetative tissues is best characterized in response to wounding through its role in biosynthesis of jasmonic acid (Liechti and Farmer, 2003
To expand this analysis and identify additional genes with potential fruit-specific roles, we identified all clusters that had statistically higher representation of ESTs from fruit-derived libraries as compared with nonfruit-derived libraries, regardless of their absolute frequency (Supplemental Table I). Using stringent selection criteria (see "Materials and Methods"), we identified 714 clusters overrepresented by fruit ESTs. Conversely, we identified 345 clusters as underrepresented by fruit ESTs (Supplemental Table III). Of these 1,059 clusters, 349 did not exhibit significant similarity (<1e-10) with any sequence cataloged in current databanks. In fruit EST-overrepresented clusters, sequences classified as unknown were proportionally much higher than in the fruit EST-underrepresented clusters (Supplemental Table III; data not shown), perhaps reflecting the poor representation of sequences derived from fleshy fruits in current databases.
To help relate these data to gene function, we assigned functional categories to those clusters that were closely related to known genes (see "Materials and Methods"). For the majority of functional classifications, representation by clusters from the fruit-overrepresented and fruit-underrepresented classes was not significantly different. However, fruit-overrepresented clusters showed a marked enrichment for functions related to transcription and signal transduction (Fig. 2A
). Analysis of clusters in these categories revealed that several were closely related to genes previously implicated in pathogen response and ethylene signaling. For example, MD094540 and MD116370 were very closely related to genes encoding ethylene-responsive, Ras-related GTP-binding proteins of the RAB8/ARA-3 and RAB11 classes, respectively, from tomato (Zegzouti et al., 1999
To further help characterize function, we subjected fruit-overrepresented clusters to k-means analysis based on EST representation from various fruit libraries. For this analysis, the 19 non-normalized fruit cDNA libraries were assigned to eight classes according to stage and tissues (Supplemental Table I). The clusters were then grouped into eight different frequency profiles (Supplemental Table III; Fig. 2B). We noted that the vast majority of the fruit-overrepresented clusters exhibited an apparent specificity of EST representation according to stage and/or tissue (Fig. 2B). To lend support for this observation, we subjected the clusters to statistical analyses (see "Materials and Methods") to identify those that were significantly more or less represented in any one of the eight EST sources. Of the 714 clusters identified as fruit overrepresented, 573 showed statistically significant (P < 0.01) library-associated frequency, suggesting that the corresponding genes may act in a temporal and/or tissue-specific pattern during fruit development. Clusters found in group 1 (young fruit libraries) were generally not statistically well supported (median P value = 0.058); this was at least partly as a result of the greater complexity of the source libraries and consequent lower proportional EST representation (Supplemental Table III) rather than to occurrence of large numbers of representative ESTs in other libraries. Of the remaining, statistically supported clusters, those in group 7 (predominantly 126 DAFB cortex and 150 DAFB cortex) generally had the least statistical support (median P value = 0.0013), and this was likely due to broad EST representation among the four related libraries associated with ripening and mature fruit (126 DAFB core, 126 DAFB cortex, 150 DAFB cortex, and 150 DAFB skin).
To extend this analysis to the identification of potential ripening-related genes, we examined all clusters that contained EST representatives from libraries prepared from fruit cortex, regardless of representation in nonfruit-derived libraries, for significant differences in EST frequency among developmental stages. This analysis considered libraries derived from Royal Gala fruit 87 DAFB, 126 DAFB, or 150 DAFB, representing fruit at a late stage of cell expansion and growth, early in the ripening process, and ripe fruit, respectively. This resulted in the identification of 165 clusters, which we subsequently classified into 10 groups according to frequency profiles using k-means analysis (Fig. 3 ; Supplemental Table IV).
Clusters in groups 1, 2, and 3 showed significant frequency increases between 87- and 126-DAFB library sources, suggesting that the corresponding genes are up-regulated associated with the early ripening stage. Group 1 clusters showed a further significant increase between 126- and 150-DAFB library sources, whereas group 3 clusters showed significant decrease in EST frequency between 126- and 150-DAFB libraries.
An example of a group 1 cluster is MD028310, which is very closely related to known adenine phosphoribosyltransferases (APTs). A characterized activity for this enzyme is the recycling of adenine into adenylate nucleotides; APTs may also be involved in modification of cytokinins (Schnorr et al., 1996 Group 4 clusters showed frequency increases in 150-DAFB libraries compared with 126-DAFB libraries, suggesting a role in a later ripening stage. One of these, MD108530, encodes a likely thiosulfate sulfurtransferase. Sulfurtransferases/rhodaneses are a group of enzymes widely distributed in plants, animals, and bacteria that catalyze the transfer of sulfur from a donor molecule to a thiophilic acceptor substrate. This class of enzyme is not known to be regulated, and the substrate for such an enzyme in ripening fruit has not been proposed. Another group of clusters, group 9, showed statistically significant EST increases in 150-DAFB libraries compared with 87-DAFB libraries, but not with 126-DAFB libraries, and potentially represent genes that are more subtly up-regulated throughout ripening. This group includes cluster MD140930, related to 14-3-3 genes from tomato and other plants.
Finally, we identified a set of clusters, predominantly group 7 (Fig. 3), that showed frequency decreases between 87- and 126-DAFB library sources, suggesting that the corresponding genes are down-regulated associated with ripening. This set includes the XET-related cluster MD176720. Down-regulation of XET gene expression preceding ripening has previously been observed for tomato LeEXT1 (Catalá et al., 2000
To evaluate the accuracy of the predicted expression patterns of genes corresponding to these ripening-related clusters, we analyzed the mRNA levels of representative genes of the six major ripening k-means groups during a developmental sequence of fruit growth and ripening in field-grown plants. These plants were of a different cultivar (Jonagold) and were maintained independently of plants utilized for EST analysis. The advent of ripening in fruit was apparent by a substantial increase in ethylene concentration from about 0.5 ppm 141 DAFB to about 100 ppm 158 DAFB, and by a transient increase in CO2 production peaking around 155 DAFB (Fig. 4
). All six genes analyzed showed an apparent ripening-related expression pattern substantially similar to that predicted by EST frequency analysis (Fig. 4). For example, the APT-related MD028310 (group 1), ACO-related MD033250 (group 2), and progesterone 5-
To determine to what extent presumed ripening-related changes in expression of specific genes might be rendered inconsequential by static or reciprocal changes in expression of functionally similar genes, we analyzed the collective frequency of ESTs representing paralogous groups of clusters in libraries derived from fruit 87, 126, and 150 DAFB. We found that of the 121 clusters that showed significant EST frequency increase associated with fruit ripening (Supplemental Table IV), only seven were members of paralogous groups that collectively showed no significant frequency increase (data not shown). Similarly, of 100 clusters that exhibited EST frequency decrease during ripening, only eight were included in paralogous groups that collectively showed no significant frequency decrease (data not shown). Thus, most of the presumed changes in gene expression that we identified are likely to be biologically meaningful.
To view these results in an evolutionary context, we compared the subsets of clusters representing potential ripening-regulated genes in apple with those in tomato, another plant for which large numbers of ESTs from representational, ripening fruit-derived libraries are available (Fei et al., 2004
The remaining 48 apple clusters that showed sequence homology in tomato were homologous with a total of 243 tomato clusters, none of which exhibited ripening-associated EST frequency increases that were significant even at a less stringent statistical cutoff (P
The majority of EST sequences analyzed in this study (approximately 85%) were derived from two cultivars, GoldRush and Royal Gala. GoldRush is a very firm variety characterized by a complex spicy flavor and yellowish-orange skin at maturity. This variety produces relatively low levels of ethylene, matures late in the season, and stores well (Janick, 2001 We identified 169 clusters with EST representation biased toward GoldRush and 292 biased toward Royal Gala (Supplemental Table VI). To determine whether these differences reflected distinct biological processes that could account for cultivar-associated characteristics, we assigned biased clusters into one or more of 25 categories of predicted function (see "Materials and Methods"). Although most of these categories contained similar numbers of clusters from each cultivar (data not shown), some differences were noted. Clusters annotated as related to nucleic acid metabolism were derived predominantly from Royal Gala, and this may reflect an extension in Royal Gala of the period of cell division activity that typically occurs early in apple fruit development. In contrast, clusters annotated as related to response to stress, biotic stimulus, or abiotic stimulus were derived predominantly from GoldRush (Fig. 5 ), potentially reflecting a relatively higher basal level of activity of the associated response pathways in GoldRush.
We analyzed the results for potential difference in expression of genes with specific roles in fruit quality attributes and found several candidates. One example is the XET-related cluster MD247820, which is represented at high frequency in a GoldRush young fruit library (20 of 7,736 ESTs [0.26%]) but lacks any representation from Royal Gala young fruit libraries. Moreover, this cluster did not contain ESTs derived from any other Royal Gala library, suggesting that the corresponding gene is expressed or exists only in GoldRush. We were also unable to detect any closely related (E < 12) sequence, or any sequences annotated as XET-like or included in GH16, in young fruit libraries from Royal Gala (data not shown). A single XET-related sequence (MD176720) is observed in Royal Gala libraries derived from 87-DAFB fruit, where the ESTs represent nearly 2% of the total EST pool, and EST representation decreased markedly (P < 0.001) between 87 and 126 DAFB. MD176720 also contains EST representatives from GoldRush, showing that it is not a Royal Gala-specific gene, but these ESTs originate only from nonfruit libraries (data not shown). XET-like genes have been implicated in both fruit growth and softening, and we speculate that a difference in expression of this gene early in development may contribute to the distinct textural qualities of the mature GoldRush and Royal Gala fruit.
Flavonoids are important secondary metabolites in apple and contribute to antioxidant capacity and pigmentation. We found four clusters tied to flavonoid biosynthesis that were overrepresented in GoldRush: MD245860 and MD244690, related to leucoanthocyanidin dioxygenase (LDOX); MD013330, related to anthocyanidin reductase (ANR); and MD060870, related to anthocyanidin-3-glucoside rhamnosyltransferase. ANR and LAR participate in synthesis of flavan-3-ol monomers required for formation of PA polymers (i.e. tannins), utilizing anthocyanidin and leucocyanidin, respectively, as substrates (Xie et al., 2003
We applied our EST frequency analyses to the components of biochemical pathways likely to be involved in the generation of precursors for volatile esters in ripening fruit. Ester formation in apple is largely confined to the fruit tissue and, more specifically, the skin (Knee and Hatfield, 1981
We found a total of 49 clusters collectively representing these genes (data not shown), with 10 showing EST frequency bias among the sources analyzed (Table III). In addition to the acyl-CoA synthetase cluster MD042900 and LOX cluster MD187410, seven other clusters were overrepresented in fruit tissues relative to vegetative tissues (MD111420 and MD157240, representing potential enoyl-CoA hydratases; MD216630, representing an additional potential acyl-CoA synthetase; MD177270, representing a potential malonyl-CoA:ACP transacylase; and MD075410 and MD142530, representing potential ADHs). In addition to MD187410, the two acyl-CoA synthetase clusters MD042900 and MD123520 showed EST frequency increases associated with ripening. We found that seven clusters from five gene groups showed overrepresentation in skin tissues relative to cortex: the enoyl-CoA hydratase clusters MD111420 and MD157240, the acyl-CoA synthetase cluster MD042900, the acyl carrier protein cluster MD045430, the malonyl-CoA:ACP transacylase cluster MD177270, and the ADH clusters MD075410 and MD142530. We did not observe significant ripening-related or skin-associated overrepresentation of ESTs representing the alcohol acyl transferase (AAT) clusters present in the libraries analyzed (data not shown). This observation is not inconsistent with expression data in studies by Defilippi et al. (2005) We analyzed mRNA levels of genes corresponding to the LOX cluster MD187410 and acyl-CoA synthetase cluster MD042900 in relation to the ripening-related production of volatile esters in fruit from field-grown plants. In these fruit, substantial amounts of volatiles were first detected 134 DAFB, increased rapidly between 141 and 155 DAFB, peaked at 162 DAFB, and declined thereafter (Fig. 6A ). RNA of the LOX gene represented by MD187410 was barely detectable by reverse transcription (RT)-PCR at 123 DAFB, increased in abundance between 123 and 144 DAFB before the onset of significant volatile production, peaked at 162 DAFB, and remained at similar levels 183 DAFB. This pattern is substantially similar to that predicted for this cluster, which is included in ripening k-means group 2. RNA for the acyl-CoA synthetase gene was easily detectable at 123 DAFB, increased in abundance markedly by 144 DAFB, and declined thereafter; this pattern is also predicted for this cluster, which is included in ripening k-means group 3.
We also analyzed mRNA levels of representative genes predicted to be expressed preferentially in skin relative to cortex tissues of ripe fruit. We confirmed our predictions for the enoyl-CoA hydratase cluster MD111420, the acyl-CoA synthetase cluster MD042900, the acyl carrier protein cluster MD045430, the malonyl-CoA:ACP transacylase cluster MD177270, and the ADH cluster MD142530. Also consistent with predictions based on EST frequency, the LOX cluster MD187410 was found to be expressed to relatively higher levels in cortex compared with skin tissues. The malonyl-CoA:ACP transacylase cluster MD177270 was apparently expressed to similar levels in both samples, although it was predicted to be expressed to higher levels in skin tissues.
In this study, we applied the technique of EST frequency analysis to apple, a crop for which no additional genomic resources are currently available. Although these predictions of gene expression based on EST frequency serve as an excellent entry point to studies of fruit molecular biology, this type of study is subject to numerous caveats. Identification of the most abundant ESTs, which should represent the most active genes in these tissues, is highly sensitive to artifacts such as differential amplification of cDNAs during library preparation and contamination of libraries with abundant organellar DNAs, highly repetitive DNA in the nuclear genome, and microbial nucleic acids. We found that a considerable fraction of the most highly represented apple clusters corresponded to known mitochrondrial- or plastid-encoded genes. Although the most likely explanation is contamination by organellar DNAs or RNAs, it remains possible that at least a portion of these sequences may be derived from organellar sequences integrated into the genome, as has been observed for Arabidopsis (Arabidopsis Genome Initiative, 2000 Additional caveats apply to this study where library construction, sequencing, and submission were outside of the investigators' control. For tree species such as apple, tissue sources will typically be field derived and thus subject to a variety of biotic and abiotic stresses, and that may not be apparent at the time when tissues are collected. In addition, it is not unlikely that absence of ESTs representing some genes reflects selective withholding from submission by investigators, rather than low transcript abundance. However, in spite of the numerous potential caveats, we were able to predict with some accuracy the temporal and/or tissue-associated expression behavior in 13 of the 14 cases attempted, even though the plants subjected to gene expression analysis were a different cultivar than those subjected to EST analysis and were grown in distinct climates and seasons. Rigorous standardization and extensive documentation of tissue sourcing and experimental procedures associated with EST sequencing and submission will likely further expand the utility of this type of analysis. We also note that the application of EST frequency analysis is limited to proportional (representational) libraries. Many EST sequencing efforts utilize normalization techniques to increase library complexity, precluding this type of analysis. We noted that the 50,721 sequences derived from four normalized GoldRush libraries represented 21,565 unique sequences, whereas the 74,914 sequences derived from 33 non-normalized Royal Gala libraries represented 23,050 unique sequences (data not shown). Therefore, at least in the case studied here, exhaustive sequencing of a few normalized libraries apparently provided some advantage over limited sequencing of diverse and specific cDNA libraries, but at the expense of losing representational data.
We demonstrated the utility of this approach by analyzing a small subset of genes potentially involved in the synthesis of fatty acid precursors for the volatile esters that contribute to aroma and flavor. Although it is well known that levels of fatty acids accumulate during apple fruit ripening (Meigh and Hulme, 1956 Analysis of ESTs is likely to continue to play a dominant role in genome characterization, especially in organisms where other genomics technologies are difficult or intractable, and in organisms of relatively minor collective importance. As methods for high-throughput sequence analysis become more accessible and affordable, EST data for minor crops should become increasingly abundant. Many of these data will likely be in the form of relatively small datasets and heterogeneous with respect to cultivar/genotype analyzed, tissue sources, growth conditions, and methods for library construction. Our results show that this type of analysis can be informative even where based on heterogeneous sources.
Sequence Sources
A total of 198,684 apple (Malus domestica Borkh) sequences were identified in the EST and nonredundant nucleotide databases at the NCBI (http://www.ncbi.nlm.nih.gov). These sequences consisted of 198,068 ESTs and 616 cDNAs, including 394 full-length cDNAs, and are cataloged at http://genomics.msu.edu/fruitdb (version 3.0). k-means analyses and P-value calculations utilized data in version 2.0 of this database (159,254 sequences). A minimally redundant EST set from tomato (Lycopersicon esculentum Mill.) was obtained from the Tomato Gene Index of TIGR (http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=tomato; Quackenbush et al., 2000
Apple sequences were clustered using StackPACK (version 2.2; http://www.egenetics.com; Miller et al., 1999 For identification of putative paralogs of presumed highly represented genes in fruit tissue, we used tBLASTx and a single-linkage method to define groups of sequences that exhibited significant similarity (E < 1e-40). The sum of EST counts for all members in the group was considered to be the EST frequency for the paralogous group. For comparative analysis of ripening with tomato, we utilized the TIGR Tomato Gene Index (http://www.tigr.org/tigr-scripts/tgi/T_index.cgi?species=tomato), version 10.1, containing 162,621 ESTs and 1,587 cDNA sequences.
Similarity searches were carried out with the stand-alone BLAST programs (Altschul et al., 1997 For functional grouping of genes corresponding to fruit EST-overrepresented or fruit EST-underrepresented clusters, we utilized BLASTx to identify the closest expressed sequence in Arabidopsis (Arabidopsis thaliana) and generated the corresponding GO Slim classification by mapping plant GO Slim to The Arabidopsis Information Resource (TAIR) Arabidopsis GO gene associations, using Perl script map2slim.pl (available at http://www.geneontology.org). Only matches with E values less than 1e-10 were included in the analysis. To avoid inconsistency and error associated with manual or alternative-source annotation, we ignored all clusters without an Arabidopsis relative, even if they exhibited strong similarity with other reported sequences. Of the 1,059 clusters that showed differential representation from fruit and nonfruit-derived libraries, 622 (59%) were selected for functional classification; of the remainder, 78 exhibited significant similarity only with non-Arabidopsis sequences. Where genes were classified into multiple GO Slim categories, each classification was treated separately in the analysis to prevent bias associated with manual selection.
Digital analyses of gene expression were performed based on AC statistics (Audic and Claverie, 1997
For analysis of mRNA levels during fruit ripening, fruit (cv Jonagold) was harvested periodically from trees maintained under field conditions at the Michigan State University Clarksville Horticultural Station. Fruit was allowed to acclimate to ambient laboratory conditions for 24 h before analysis. Monitoring of fruit ethylene production, respiration, and volatile ester biosynthesis was carried out essentially as described by Jayanty et al. (2002)
We thank Curtis Wilkerson for additional bioinformatics support and members of the van Nocker group for helpful critiques. Received March 27, 2006; returned for revision May 12, 2006; accepted May 16, 2006.
1 This work was supported by the Michigan Agricultural Experiment Station (funding to S.v.N and R.B.). The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Steven van Nocker (vannocke{at}msu.edu).
[W] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.106.080994. * Corresponding author; e-mail vannocke{at}msu.edu; fax 5173550249.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline] Audic S, Claverie JM (1997) The significance of digital gene expression profiles. Genome Res 7: 986995 Baker A, Graham IA, Hodsworth M, Smith SM, Theodoulou FL (2006) Chewing the fat: Bogs J, Downey MO, Harvey JS, Ashton AR, Tanner GJ, Robinson SP (2005) Proanthocyanidin synthesis and expression of genes encoding leucoanthocyanidin reductase and anthocyanidin reductase in developing grape berries and grapevine leaves. Plant Physiol 139: 652663 Catalá C, Rose JK, Bennett AB (2000) Auxin-regulated genes encoding cell wall-modifying proteins are expressed during early tomato fruit growth. Plant Physiol 122: 527534 Chen G, Hackett R, Walker D, Taylor A, Lin Z, Grierson D (2004) Identification of a specific isoform of tomato lipoxygenase (TomloxC) involved in the generation of fatty acid-derived flavor compounds. Plant Physiol 136: 26412651 Chou A, Burke J (1999) CRAWview: for viewing splicing variation, gene families, and polymorphism in clusters of ESTs and full-length sequences. Bioinformatics 15: 376381 Crosby JA, Janick J, Pecknold PC, Goffreda JC, Korban SS (1994) Goldrush Apple. HortScience 29: 827828 Defilippi B, Kader AA, Dandekar AM (2005) Apple aroma: alcohol acyltransferase, a rate limiting step for ester biosynthesis, is regulated by ethylene. Plant Sci 168: 11991210[CrossRef] Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 1486314868 Fei Z, Tang X, Alba RM, White JA, Ronning CM, Martin GB, Tanksley SD, Giovannoni JJ (2004) Comprehensive EST analysis of tomato and comparative genomics of fruit ripening. Plant J 40: 4759[CrossRe |