- © 2005 American Society of Plant Biologists
Abstract
To take full advantage of the power of functional genomics technologies and in particular those for metabolomics, both the analytical approach and the strategy chosen for data analysis need to be as unbiased and comprehensive as possible. Existing approaches to analyze metabolomic data still do not allow a fast and unbiased comparative analysis of the metabolic composition of the hundreds of genotypes that are often the target of modern investigations. We have now developed a novel strategy to analyze such metabolomic data. This approach consists of (1) full mass spectral alignment of gas chromatography (GC)-mass spectrometry (MS) metabolic profiles using the MetAlign software package, (2) followed by multivariate comparative analysis of metabolic phenotypes at the level of individual molecular fragments, and (3) multivariate mass spectral reconstruction, a method allowing metabolite discrimination, recognition, and identification. This approach has allowed a fast and unbiased comparative multivariate analysis of the volatile metabolite composition of ripe fruits of 94 tomato (Lycopersicon esculentum Mill.) genotypes, based on intensity patterns of >20,000 individual molecular fragments throughout 198 GC-MS datasets. Variation in metabolite composition, both between- and within-fruit types, was found and the discriminative metabolites were revealed. In the entire genotype set, a total of 322 different compounds could be distinguished using multivariate mass spectral reconstruction. A hierarchical cluster analysis of these metabolites resulted in clustering of structurally related metabolites derived from the same biochemical precursors. The approach chosen will further enhance the comprehensiveness of GC-MS-based metabolomics approaches and will therefore prove a useful addition to nontargeted functional genomics research.
Functional genomics technologies designed to assess gene activity (transcriptomics) and protein accumulation (proteomics) are now well established in the quest to link gene to function (Holtorf et al., 2002). Subsequently, metabolomics approaches have been forwarded as a means to link the functional biochemical phenotype to other functional genomics data (Weckwerth and Fiehn, 2002; Sumner et al., 2003; Bino et al., 2004; Hall et al., 2005). Like transcriptomics and proteomics, metabolomics involves two main components: instrumental analysis (analytical) and data analysis (bioinformatics). Both topics need to be as comprehensive as possible for true, broad, metabolic profiling and comparative analysis of the biochemical status of living organisms. Several analytical methods for metabolomics have already been reported using model plants in genomic studies (Fiehn et al., 2000a, 2000b; Roessner et al., 2000, 2001; Huhman and Sumner, 2002; Tolstikov and Fiehn, 2002; Roessner-Tunali et al., 2003; Kopka et al., 2004; Desbrosses et al., 2005). A significant number of these studies have, however, been dedicated to metabolic profiling specifically of the nonvolatile compounds involved in primary plant metabolism using gas chromatography (GC) coupled to mass spectrometry (MS). Another significant part of the plant metabolome, comprising the volatile metabolites, is of a particular interest, since they play an important role in fundamental processes such as signaling mechanisms and interorganism interactions (Shulaev et al., 1997; Seskar et al., 1998; Ozawa et al., 2000; Arimura et al., 2002; Liechti and Farmer, 2002; Dicke et al., 2003; Dudareva et al., 2004; Engelberth et al., 2004; Ryu et al., 2004). In addition, these components are also of great agronomic importance as volatile metabolites are major determinants of food and flower quality in terms of flavor and fragrance (Buttery and Ling, 1993; Baldwin et al., 2000, 2004; Yilmaz et al., 2001; Tandon et al., 2003; Krumbein et al., 2004; Simkin et al., 2004; Ruiz et al., 2005).
Solid phase microextraction (SPME-GC-MS) is an analytical approach that is suitable for metabolomics studies of volatiles since it is renowned for its high sensitivity, reproducibility, and robustness (Yang and Peppard, 1994; Matich et al., 1996; Verhoeven et al., 1997; Song et al., 1997, 1998; Augusto et al., 2000; Verdonk et al., 2003). GC-MS-based approaches utilize gas chromatographic separation of metabolites extracted from plant material and, in the case of SPME, the volatiles are first extracted from the headspace above the plant material using a specially designed adsorbant fiber (Fig. 1A). Subsequently, separated metabolites are fragmented to charged molecular fragments—ions—that are then detected in the mass spectrometer. Each metabolite produces a unique spectrum of molecular fragments with specific masses and a fixed relative abundance. This unique fingerprint can therefore be used for metabolite recognition and identification.
GC-MS-based metabolomics. A, Analytical approach used. B, Conventional approach. C, Alternative, unbiased approach to GC-MS data analysis.
Hundreds of different metabolites can be detected in crude plant extracts using GC-MS. This is, however, just a small fraction of the more than 10,000 metabolites that have been described in plants (Fiehn et al., 2000b). However, even this limited amount of biochemical information cannot be fully subjected to a comparative metabolomic analysis when conventional strategies are used. Such strategies, in general, consist of three consecutive steps (Fig. 1B). First, metabolites must be recognized and (or) identified from the tens of thousands of molecular fragments that constitute a typical GC-MS profile (Fig. 1B-a). Second, quantitative values (often relative) of the identified metabolites are aligned throughout all the metabolic profiles of the genotypes (Fig. 1B-b) in order to perform the third step, comparative analyses of their metabolic phenotypes using multivariate exploratory techniques dedicated for metabolomics (e.g. hierarchical cluster analysis [HCAi], principal component analysis [PCA], self-organizing maps, etc.; Fig. 1B-c). The metabolic data can then also be linked to other data derived using other functional genomics technologies. The comprehensiveness of this strategy therefore depends on the number of metabolites that can be identified in the samples to be compared. However, the extreme complexity of the plant metabolome already generates a bottleneck at the first step in this algorithm: Despite using chromatographic separation, metabolites still coelute prior to being subjected to MS. Consequently, this coelution results in overlapping of the unique fragmentation patterns. In addition to the problem of coelution, a high variability in metabolite quantity within large numbers of biological samples further complicates metabolite identification and thus limits the entire analysis to a metabolite subset that includes only those compounds that can be reliably identified throughout all genotypes. Many other possible metabolic differences may then be overlooked. To overcome these limitations and make the metabolomic data analysis truly comprehensive and unbiased, we offer a novel strategy for data analysis (Fig. 1C). This strategy is based on a fully automated alignment of metabolic profiles at the level of individual molecular fragments without prior assignment to the chemical structures of the metabolites they represent. Subsequently, a multivariate comparative analysis of individual metabolic profiles is performed, which is based on all chemical information derived by an analytical approach. Although this strategy initially removes the need for prior metabolite identification, this is eventually still required in order to put a biological meaning to the differences found. To relate the thousands of molecular fragments normally constituting a chromatogram to their parental metabolites, a novel approach, multivariate mass spectra reconstruction (MMSR), has been developed. Using MMSR, clusters of related metabolite fragments can be recognized and the corresponding metabolites subsequently identified.
The entire strategy of data analysis is universal for many kinds of mass spectral data and exceeds approaches of unbiased metabolomic data analysis in terms of resolution and comprehensiveness (Nielsen et al., 1998; Fraga et al., 2001; Johnson et al., 2003; Jonsson et al., 2004; Wiener et al., 2004; Willse et al., 2005). Also, it uses widely available software tools and simple basic statistical procedures, both of which make it useful for a wide range of studies in the fields of biochemistry, physiology, functional genomics, and systems biology.
The strategy was used for a comparative multivariate analysis of a set of 94 contrasting tomato (Lycopersicon esculentum Mill.) genotypes covering the variation in the germplasm of commercial tomato varieties. The analysis was based on the profiles of all volatiles that could be detected by the analytical method used (SPME-GC-MS) and revealed a total of 322 different compounds in the entire genotype set. This covers approximately 80% of the more than 400 tomato volatile compounds, which have been detected in tomato fruit using different analytical methods (for review, see Petro-Turza, 1987).
RESULTS
Automated Sequential Headspace SPME-GC-MS: Method Development
In order to produce and release volatiles, tomato material (e.g. juice or pulp) is usually incubated for a fixed period, during which essential enzymes such as lipoxygenase and hydroperoxide lyases are allowed to remain active. This is followed by the addition of concentrated CaCl2 to stop enzyme activity and to drive the volatiles into the headspace (Bezman et al., 2003; Verdonk et al., 2003). To test this method for its suitability for effective, prolonged, sequential automated analysis of tomato samples, the sds of the 15 major tomato volatiles (Baldwin et al., 2000), measured sequentially in four sample replications, each after 3-h intervals, were calculated (Table I). The addition of CaCl2 alone resulted in large variations in metabolite abundance between replicate analyses (average % sd = 41%; Table I). However, a marked improvement in reproducibility (average % sd = 9%) was achieved by the addition of NaOH/EDTA solution, which was chosen for its effectiveness compared to a number of alternative buffers tested (data not shown). In combination with subsequent CaCl2-induced enzyme inactivation, this procedure resulted in sufficient stability and reproducibility over a 12-h period. On average, the biological variation between the genotypes was then approximately 10 times the analytical variation. To estimate the metabolic variation that can be observed within a genotype, samples of five individual fruits of the same genotype were analyzed. The fruit-to-fruit variation, which, in fact, included the analytical variation, observed for the 15 volatiles ranged from 8% to 35% sd. For all metabolites, the fruit-to-fruit variation was significantly less than the biological variation between genotypes, according to % sd and range between lowest and highest value (Table I).
Biological and analytical variation of the tomato volatile metabolites
For the analysis, a mix of tomato samples was made and separate aliquots were measured after 0, 4, 8, and 12 h. Using these four measurements, % sd (presented as the % of total value) was calculated for the sole use of CaCl2 (second column) and for the combination NaOH/EDTA + CaCl2 (third column). For the analysis of biological variation within genotype (fourth column), five individual fruits of the same genotype were profiled for volatiles, and % sd for these five replicates was calculated. Biological variation between genotypes (fifth column) was calculated as % sd of means of all 94 tomato samples when NaOH/EDTA + CaCl2 procedure was used. The maximal relative fruit-to-fruit variation as well as the maximal variation between all 94 genotypes was calculated as the ratio of maximal and minimal relative values of the 15 volatiles across the five fruits and the 94 genotype samples, respectively. It is given in parenthesis as fold difference (fourth and fifth columns).
In total, 94 tomato fruit samples, in duplicate, were profiled for volatile metabolites. Consequently, including the daily external reference samples, 198 GC-MS datasets were obtained in this tomato volatile study.
A Stepwise Approach for Nontargeted Data Analysis
Step 1 (Fig. 1C-a) is as follows. The entire 198-sample GC-MS dataset was analyzed using the dedicated MetAlign software package. After automated baseline correction, intensities of approximately 20,000 molecular fragments with corresponding retention times were aligned throughout 198 GC-MS profiles by MetAlign.
Step 2 (Fig. 1C-b) is as follows. A common problem in SPME-GC-MS analyses is the production of molecular fragments originating from contaminants coming from the SPME fiber material. These molecular fragments had a typical pattern of occurrence throughout the samples, which was very different from the plant-derived molecular fragments. These could therefore be efficiently recognized by means of, for example, HCA (Fig. 2). The cluster of molecular fragments, which was clearly separate from the other clusters (Fig. 2), had mass and retention time characteristics that were identical to those of nonplant compounds identified in blank injections. Therefore, this entire group of nonplant molecular fragments, which were highly correlated to the contaminant-specific fragments (such as m/z 207, 267, 355, etc.) related to a number of polysiloxanes, could readily be excluded from the dataset before further analysis. This is an essential prerequisite before effective comparison of the plant-specific data can be made.
HCA of >20,000 molecular fragments based on their expression patterns throughout 198 GC-MS profiles. To simplify the view, only the highest branches of the dendrogram are displayed, showing the main groups of compounds as triangles. This procedure produced a dendrogram revealing a distinct cluster of nonplant components, comprising molecular fragments derived from constituents of the SPME fiber material that could then be readily removed from the dataset prior to further analysis.
Step 3 (Fig. 1C-c) is as follows. The data matrix cleaned of the fiber contaminants was subjected to a multivariate comparative analysis. First, HCA of the 94 tomato genotypes was performed using the Pearson correlation between means of genotype analytical replicates. The HCA revealed a high correlation between the reference samples, which were analyzed daily during the entire experiment in order to monitor the stability of the analytical system (Fig. 3A). The cherry genotypes formed a distinct cluster, clearly separated from the round and beef varieties. The latter two tomato types could not be separated into distinct groups. One cherry genotype could be regarded as intermediate by its volatile composition, due to its location at the very edge of the round-beef cluster.
Multivariate analyses of 94 tomato genotypes. A, Hierarchical tree of the 94 tomato genotypes based on intensity patterns of >20,000 individual molecular fragments. B, PCA plot showing two major types of differences between the tomato genotypes: between-type variation, discriminating the cherry tomatoes from round and beef tomatoes along vector 1, and within-type variation, independent of fruit type, along vector 2. C, PCA plot showing the distribution of >20,000 molecular fragments: Those molecular fragments (a) distributed along vector 1 determine the between-type variation, and molecular fragments (b) distributed along vector 2 determine the within-type variation. D, PCA plot showing the distribution of the identified volatile metabolites determining the main differences between the tomato genotypes. E and F, Two enlarged parts of the PCA plot shown in D: Compounds are shown as colored shapes and the numbers refer to the compounds presented in Table II. The smaller black dots represent unknown compounds.
PCA revealed two major types of metabolic differences within the 94 tomato genotypes (Fig. 3B). First, in accordance with HCA, PCA showed a clear between-fruit-type variation, separating the cherry tomatoes, on the one hand, from the round and beef tomatoes, on the other hand (vector 1). In addition, PCA revealed a clear within-type variation in metabolite content, separating the 94 tomato cultivars into two groups independent of fruit type (vector 2). The daily replicated reference samples are located in the middle of both vectors of genotype differentiation. This is logical, since the reference sample was created by pooling of fruit material of several genotypes of each fruit type. The molecular fragments determining both the between- and within-type variations could be found by projection of the genotype differentiation vectors onto the PCA plot showing the distribution of the molecular fragments (Fig. 3C).
Step 4 (Fig. 1C-d) is as follows. A novel MMSR strategy was developed to reconstruct chemical structures of metabolites from the molecular fragment information of GC-MS profiles and subsequently to discover a biochemical meaning of the metabolic differences found.
The approach is based on two points. First, since fragmentation of a metabolite by the mass spectrometer occurs after chromatographic separation, molecular fragments derived from the same metabolite will appear within a peak of a certain width at a certain retention time in a chromatogram. Second, the relative ratio between intensities of molecular fragments derived from the same metabolite is constant. Therefore, the expression patterns of these molecular fragments must be identical throughout a set of variable metabolic profiles and hence must be highly correlated to each other. Based on these points, a metabolite may be defined as a group of highly correlated molecular fragments situated within a certain retention time window. Proceeding from this definition, all of the 20,000 molecular fragments were subjected to HCA by calculating the Pearson correlation between their intensity patterns throughout the GC-MS profiles of all the tomato genotypes analyzed. HCA resulted in clustering of molecular fragments showing identical or highly similar patterns of intensities throughout all GC-MS datasets (Fig. 4A). Those molecular fragments, which clustered together with a Pearson correlation coefficient equal to or higher than 0.8 and were situated within a maximal deviation in retention time of ≤6 s (corresponding to an average peak width at one-half height in the chromatograms we obtained), were considered to belong to the mass spectrum of one and the same metabolite. In total, 322 molecular fragment clusters were obtained. The mass spectra of the 15 key flavor-related tomato volatiles (Baldwin et al., 2000) were in agreement with the mass spectra reconstructed from the molecular fragment clusters at their corresponding retention times, as shown by the example of 2-isobutylthiazole in Figure 4C. This suggests that the 322 molecular fragment clusters each represent the mass spectrum of an individual volatile compound. Overlapping mass spectra of coeluting compounds could also be successfully discriminated from each other using MMSR. Molecular fragments of coeluting compounds were clustered based on the similarity of their patterns throughout the samples and the number of clusters indicated the number of overlapping chemical compounds at a certain retention time (Fig. 4, A and B). In many cases, MMSR allowed extraction of all major fragments of a mass spectrum of a particular coeluting compound (Fig. 4C; Supplemental Data II). In others, it revealed a few compound unique fragments (data not shown). For compound identification, the AMDIS software package, dedicated to chromatogram deconvolution, was used as a bridge to match the compound spectral information derived by MMSR to entries of the National Institute of Standards and Technology (NIST) library of chemical compound mass spectra (as described in “Materials and Methods”).
MMSR-driven discrimination of mass spectra. A, Dendrogram showing a clustering of intensity patterns of ions situated in the retention time window 20.8 to 21.07 min into several molecular fragment clusters. B, MMSR indicated the presence of five individual compounds within a visually single total ion count (TIC) peak within the chosen time window. C-1, An experimental mass spectrum, obtained by plotting of the original intensities of the molecular fragments of compound b could be matched to the mass spectrum of the chemical standard analog of 2-isobutylthiazole (C-2), which also has a retention time falling within the chosen window.
Phe-Derived Volatiles Mostly Explain the Difference in the Composition of the Tomato Fruit Volatile Metabolome
The MMSR and NIST library matching results revealed that the molecular fragments (Fig. 3C), which were most discriminative between the tomato genotypes (Fig. 3B), belonged to two groups of volatile metabolites, derived from the phenolic (depicted in pink) and phenylpropanoid (depicted in blue) pathways (Fig. 3D). Interestingly, both groups originate from the amino acid Phe. Cherry tomatoes could be distinguished from round and beef by a relatively high accumulation of phenolic-derived volatiles (Fig. 3, D and F, vector 1). Two phenolic alcohols, phenylethanol and benzyl alcohol, showed the highest contribution to the cherry versus round/beef contrast. Volatiles derived from the phenylpropanoid pathway, including methyl and ethyl salicylate, guaiacol, eugenol, and salicylaldehyde, were responsible for the division of the genotypes into two groups independent of the tomato fruit type (Fig. 3D, vector 2). Both types of Phe-derived volatiles revealed the largest relative variation across the 94 genotypes (Table I).
Besides phenolic volatiles, cherry tomatoes also contained relatively high levels of lipid derivatives (Fig. 3F) and low levels of terpenoids, open-chain carotenoid derivatives, and Leu/Ile-derived products (Fig. 3E).
Patterns of the 322 Volatiles Are Correlated According to Their Precursor or Metabolic Pathway
The 322 compounds were subjected to HCA using the Pearson correlation coefficient. This revealed the presence of a few major compound clusters, as shown in a correlation matrix (Fig. 5). Compounds situated in the clusters were subjected to a putative identification by matching their mass spectra to the NIST library. Reliable matching results were obtained for 100 of them, of which 70 metabolites had previously been described as tomato fruit volatiles (Petro-Turza, 1987; Table II). The reliability of the identity prediction was assessed through comparison with 46 authentic chemical standards. Each of those standards represented a first hit from the NIST library search (see Supplemental Data I) and together covered a few members of each compound cluster. Of those 46 standards, 43 confirmed the identity of the predicted compound, indicating that the prediction of compound identity was very reliable.
Metabolite-metabolite correlation matrix of the 322 plant-derived compounds. A, The main compound clusters are situated along the diagonal line (groups a–g). Correlations between metabolites are shown in grayscale: the darker the color gray, the higher the percentage of similarity between metabolite expression patterns. B, Detailed dendrogram of each compound cluster with putative compound identity as described in Table II. Compound cluster: a, phenylpropanoid volatiles; b, other phenolic volatiles; c, Leu and Ile derivatives (c1 and c2, respectively); d, lipid derivatives. Isoprenoids: e, terpenoids; f, open-chain carotenoid derivatives; g, cyclic carotenoid derivatives.
Putative identity of volatile metabolites present within the clusters obtained using HCA (Fig. 5)
Metabolites were identified by matching their mass spectra to the NIST library. RT, Retention time; specific ion (m/z), mass (m/z value) of a compound-specific molecular fragment; identity, putative identity, according to the highest NIST library match; NIST match, matching score (1,000 = 100% identical to the NIST library entry), (+) or (−) after the NIST match value (the NIST match was confirmed [+] or was not confirmed [−] by an authentic chemical standard injection); biochemical group, corresponding cluster in Figure 5.
The identification results revealed that each of the compound clusters contained compounds that have a common biochemical precursor or belong to the same metabolic pathway (Fig. 5): Phe derivatives (phenolic and phenylpropanoid volatiles), Leu and Ile derivatives, lipid derivatives, and isoprenoid derivatives, consisting of open-chain and cyclic carotenoid breakdown products and terpenes (Buttery and Ling, 1993; Baldwin et al., 2000).
DISCUSSION
High-Throughput Screening of Volatiles
Volatile tomato fruit metabolites have been profiled using headspace SPME-GC-MS, which is a procedure that has been used in the past for many plant matrices including tomato fruits (Song et al., 1998; Deng et al., 2004). SPME is superior to other sampling methods in both speed and robustness (Yang and Peppard, 1994). Only direct thermal desorption exceeds SPME in terms of sensitivity (Pfannkoch and Whitecavage, 2000). In fact, SPME-based methods, as well as other methods based on headspace extraction, are so-called semiquantitative due to the presence of a matrix effect and relatively short linearity of the dynamic range—the drawbacks, which in many cases do not allow an absolute quantification. However, metabolomics, as well as other profiling techniques such as microarray analyses, mostly operate with intensity patterns formed by relative responses, which allow searching for potential differences and performing multivariate comparative analyses. Absolute quantification of the levels of these volatiles will be performed using more sophisticated methods in our future experiments.
For a high-throughput analysis of a large number of biological samples, an automated sequential manipulation of the samples is required. To obtain reliable data in this way, the metabolic composition has to be stable during the entire period of experimentation. This is especially important when analyzing complex native plant materials such as fruit tissue. To develop an automated high-throughput SPME-GC-MS method to screen and profile fruit volatiles of 94 tomato cultivars, the initial focus was placed on 15 volatile metabolites that are of particular importance in determining tomato fruit flavor (Buttery and Ling, 1993; Baldwin et al., 2000). First, we optimized the stability of the metabolites by adding NaOH/EDTA/CaCl2 at the end of the sample preparation procedure. This procedure stabilizes the fruit matrix for at least 12 h, presumably by increasing the pH and exploiting the chelating effect of EDTA to prevent compound oxidation. The method was found to be suitable, reliable, and accurate and enabled the automated measurement of the large numbers of fruit samples required for this investigation.
Full Spectral Alignment Enables Unbiased Comparative Metabolomics
Metabolomics aims to generate a comprehensive overview of the identity and quantity of metabolites in biological materials. The general principle currently used is that all compounds are identified prior to their, often relative, quantification and subsequent comparison throughout the biological samples. When using GC-MS, each chemical compound is classified both on its relative retention time and its mass spectrum. This mass spectrum gives a unique fingerprint of the chemical resulting from its fragmentation on entering the mass spectrometer. However, when using complex plant extracts, despite effective prior chromatographic separation, mass spectra of many compounds inevitably often coelute, thus complicating their discrimination. Consequently, the compound discrimination step (not always unbiased) limits the comprehensiveness of the metabolomic analyses. As an alternative strategy for comparative metabolomic analysis, we propose here a protocol that is based upon an unbiased empirical quantification and search for metabolic differences at the level of molecular fragments (ions) prior to compound identification. This approach avoids the time-consuming need for any prior assignment of chemical information to the molecular structure for hundreds of datasets and thus makes it possible to gain a faster, more unbiased and nontargeted metabolomic overview. Furthermore, this approach facilitates our desire to home in specifically on those mass peaks that are discriminatory between samples. This approach, however, depends on the initial ability to align the spectral patterns of the tens of thousands of molecular fragments present throughout all the GC-MS datasets to be compared. For this, we used the MetAlign software package to eliminate noise, compensate retention time shifts, and align the mass spectral information. This resulted in a data matrix of about 4,000,000 data points (198 datasets×20,000 mass peaks detected). Each row of this data matrix displays the intensities of a unique molecular fragment throughout the 198 GC-MS datasets. However, a number of contaminants resulting from the fiber material can usually be found when using SPME. A multivariate analysis (HCA) of the molecular fragment patterns throughout the GC-MS profiles obtained allowed us to extract the fragments related to the fiber and to remove them from the dataset automatically. The complete mass spectral alignment of metabolic profiles has thus allowed us to perform a reliable, multivariate comparative analysis of the 94 genotypes studied. This analysis revealed both between-fruit-type metabolic differences, discriminating cherry tomatoes from round and beef, as well as within-fruit-type metabolic differences, which were independent of fruit type, and allowed the discrimination of the molecular fragments determining the variation between genotypes. However, to get subsequently biologically relevant information, we have to be able to relate these discriminative molecular fragments to their parent compounds in order to perform a putative identification. To overcome the limitations of metabolite recognition and identification that are due to high metabolome complexity and variability, we developed an approach that allows an automated reconstruction of the mass spectra of individual compounds (MMSR). This approach is based on the fixed ratio of molecular fragment intensities resulting from the fragmentation of a particular molecule. Logically, even if the abundance of a compound varies between samples, the ratios of its molecular fragment intensities derived from the parent molecule should remain the same throughout all the samples. Consequently, when molecular fragments cluster together after being subjected to a multivariate analysis such as HCA and their relative retention time does not exceed a predefined window, it can be concluded that they relate to the same chemical compound.
Using MMSR, we were able to discriminate the full array of chemical compounds present in all datasets using one automated procedure, even in cases of complex overlapping mass spectra (Fig. 4). In comparison, when using AMDIS alone—a software package dedicated to resolve compound overlap cases by means of automated mass spectra deconvolution—for chemical compound discrimination, we were unable to get an equally reliable prediction of the number and chemical identity of overlapping compounds. This is due to their variable mass intensities in the wide range of the different samples (data not shown).
Deconvolution procedures are generally reliable and frequently used to handle individual GC-MS datasets. However, when analyzing hundreds of samples, a limited number of datasets that are assumed to fully represent the compound diversity of the entire sample set analyzed have to be selected for deconvolution visually. The compounds that can be discriminated in these representative datasets are subsequently used for a comparative analysis of the entire sample set. Such procedures, based on a prior mass spectral deconvolution of GC-MS profiles, have been used successfully, and this has allowed the discrimination and identification of many compounds in plant extracts (Taylor et al., 2002). However, in contrast to this conventional procedure, MMSR is not limited and uses all available spectral information, thus allowing discrimination and recognition of all individual compounds based on their variability patterns. This significantly improves comprehensiveness, since even when a particular compound is only abundant in one of the 94 samples it will still be included in the analysis.
In our tomato study, a total of 322 tomato volatile compounds could be discriminated in 198 datasets. This is approximately 80% of all the volatile metabolites (>400 different volatiles) that have so far been reported in tomato fruit (Petro-Turza, 1987). The multivariate analyses (HCA, PCA) revealed that most of the compounds, which could be identified, clustered on the basis of their biochemical nature (Fig. 5) and the entire metabolic organization could be characterized by the existence of a few large compound groups, which unite (e.g. lipid derivatives, phenolic and phenylpropanoid volatiles, isoprenoids, etc.). The main metabolite clusters could subsequently be divided into smaller subclusters. For example, the compounds of cluster c (Fig. 5B) could be clearly divided into two distinct subclusters based on their biochemical precursors, Leu and Ile. Interestingly, volatiles derived from Leu include, besides alcohols (e.g. 3-methylbutanol) and aldehydes (e.g. 3-methylbutanal), a number of nitrogen-containing compounds (yet to be identified) and even a sulfur-containing heterocyclic compound, 2-isobutylthiazole, which are all known to be Leu derived (Buttery and Ling, 1993).
All isoprenoid volatiles can be roughly separated into three subclusters representing terpenoids, open-chain, and cyclic carotenoid derivatives (Fig. 5B, groups e, f, and g, respectively). Interestingly, the terpenoids α- and β-citral appeared in the group of open-chain carotenoid derivatives. This is in line with previous observations that the citral isomers may be derived as a degradation product of lycopene (Cole and Kapur, 1957; Schreier et al., 1977). For several other compounds, such as acetophenone and 4-methylacetophenone, the biosynthetic pathway is still unclear and their clustering with terpenoids and cyclic carotenoid volatiles, respectively, may shed new light on their biochemical origin.
Mathematical analyses of metabolic pathway databases of many organisms have led to the concept of hierarchical modularity in the organization of metabolic networks. This concept implies that cellular functionality is organized in a set of functional modules, which consequently are organized in a few large modules, which in turn can be grouped into even larger modules (Jeong et al., 2000; Ravasz et al., 2002; Ihmels et al., 2004). The hierarchical modularity of metabolic network organization would allow robust, error-tolerant, and energetically efficient functioning of biological systems. Our experimental results based on an analysis of metabolic expression patterns clearly reflect the features of this concept: structurally related metabolites resulting from different enzymatic or nonenzymatic reactions, but originating from a common metabolic precursor, were clustered into groups and subgroups representing distinct metabolic pathways. This modularity may be due to the existence of a coordinate regulation of these metabolic pathways, e.g. by specific transcription factors activating the expression of the structural genes in a pathway. It may also reflect regulation by the activity of key enzymes, which determines the flux through the downstream pathway or the availability of metabolite precursors at the beginning of a metabolic pathway. Although functional implications of this modular clustering still remain to be elucidated, an existence of such functional modularity can be assumed for the group of phenylpropanoid metabolites and their derivatives, as seen here for tomato, since phenylpropanoid metabolism is known to contribute to plant stress responses (Dixon and Paiva, 1995) and methyl salicylate has been shown as an airborne signaling agent in plant pathogen resistance (Shulaev et al., 1997; Seskar et al., 1998). In this light, it is possible that genotypes with increased levels of these compounds may have been selected through the years for their increased capability to respond to biotic or abiotic stress.
CONCLUSION
The high-resolution, comprehensive, and unbiased strategy for metabolomic data analysis presented here is novel and opens new directions of discovery in the field of metabolomics. Full mass spectral alignment of GC-MS metabolic profiles followed by a universal strategy for chemical compound discrimination has allowed us to perform a high-resolution, unbiased, and fast multivariate comparative analysis of volatile biochemical composition of 94 tomato genotypes (198 complex plant extracts) based on metabolic information derived by the analytical method. The large-scale picture of the volatile part of the tomato fruit metabolome reflects the hierarchical modularity of metabolism organization that is assumed to be common for different levels of a biological system. Further projecting the data into data from other “omics” technologies will pave the way for a true systems biology approach to investigating cell networks and more directed gene discovery.
The main goal of this study was to describe this novel efficient approach for unbiased analysis of complex biochemical datasets. A detailed biological interpretation of the data obtained is beyond the scope of this article, but it is anticipated that this will provide much new information on the heterogeneity in biochemical composition within tomato varieties, and this will be the subject of our future investigations.
MATERIALS AND METHODS
Plant Material
Ninety-four tomato (Lycopersicon esculentum Mill.) genotypes were obtained from six different tomato seed companies, each with its own breeding program. As such, the cultivars should represent a considerable collection of genetic and therefore phenotypic variation, not just between tomato types (cherry, round, and beef), but also within the individuals of each type. This study was deliberately performed blind. We only received information from the tomato breeders of the companies supplying the material concerning the tomato fruit types and not their genetic background. For classification, breeders generally use a combination of (1) fruit diameter and (2) number of locules in the fruit (fl). For the latter, the criteria were as follows: cherry-type fl = 2; round fl = 3; beef fl = 4 or more. All cultivars were grown in the summer of 2003 under greenhouse conditions at a single location in Wageningen, The Netherlands. Nine plants, randomly distributed over three adjacent greenhouse compartments, were grown for each cultivar. Pink-staged tomato fruits of all plants were picked on two consecutive days. To mimic the conditions from the farm to the fork, fruits were stored for 1 week at 15°C and turned to 20°C at 24 h prior to freezing. During this period, the fruits continued to ripen slowly and, at the moment of sampling, the fruits were fully red ripe, resembling the conditions at the time of consumption. For each cultivar, a selection of red ripe fruits (12 for round and beef tomatoes and 18 for cherry tomatoes) was pooled to make a representative fruit sample. The fruit material was immediately frozen in liquid nitrogen, ground in an analytical electric mill, and stored at −80°C before analyses.
Standard Chemicals
Fifteen analytical grade chemicals (all obtained from Sigma) were used as authentic standards to optimize the SPME-GC-MS method for automated sequential analysis of hundreds of samples. These were cis-3-hexenal, β-ionone, hexanal, 1-penten-3-one, 2-methylbutanal, 3-methylbutanal, trans-2-hexenal, 2-izobutylthiazole, trans-2-heptenal, phenylacetaldehyde, 6-methyl-5-hepten-2-one, cis-3-hexenol, 2-phenylethanol, 3-methylbutanol, and methyl salicylate. For metabolite identification, an additional set of standards was used. These include 2-methylbutanol, 3-methylbutanoic acid, 3-methylbutanol nitrite, 1-hexanol, pentanal, 2-ethylfuran, 1-pentanol, heptanal, E,E-2,4-hexadienal, salicylaldehyde, eugenol, guaiacol, ethylbenzene, styrene, benzaldehyde, benzonitrile, benzyl alcohol, phenylacetonitrile, β-phenylpropionaldehyde, phenol, p-cresol, acetophenone, 4-methylactophenone, geranylacetone, α-isophorone, β-cyclocitral, α-citral, β-citral, limonene, cis- and trans-linalool oxide, and α-terpineol.
Sample Preparation Procedure and Headspace SPME-GC-MS Analysis
Frozen fruit powder (1 g fresh weight) was weighed in a 5-mL screw-cap vial, closed, and incubated at 30°C for 10 min. An EDTA-NaOH water solution was prepared by adjusting of 100 mm EDTA to a pH of 7.5 with NaOH. Then, 1 mL of the EDTA-NaOH solution was added to the sample to a final EDTA concentration of 50 mm. Solid CaCl2 was then immediately added to give a final concentration of 5 m. The closed vials were then sonicated for 5 min. A 1-mL aliquot of the pulp was transferred into a 10-mL crimp cap vial (Waters), capped, and used for SPME-GC-MS analysis.
Each of the 94 tomato fruit samples was analyzed using two replicated aliquots. In total, 22 freshly prepared samples were measured per day (two series of 11 samples). In addition, reference tomato samples were made by mixing fruit powders from several genotypes of the round, beef, and cherry fruit phenotypes. This mixture was routinely analyzed every day of experimentation as an external control in order to monitor the stability of the analytical system. The samples were automatically extracted and injected into the GC-MS via a Combi PAL autosampler (CTC Analytics AG). Headspace volatiles were extracted by exposing a 65-μm polydimethylsiloxane-divinylbenzene SPME fiber (Supelco) to the vial headspace for 20 min under continuous agitation and heating at 50°C. The fiber was inserted into a GC 8000 (Fisons Instruments) injection port and volatiles were desorbed for 1 min at 250°C. Chromatography was performed on an HP-5 (50 m×0.32 mm×1.05 μm) column with helium as carrier gas (37 kPa). The GC interface and MS source temperatures were 260°C and 250°C, respectively. The GC temperature program began at 45°C (2 min), was then raised to 250°C at a rate of 5°C/min, and finally held at 250°C for 5 min. The total run time, including oven cooling, was 60 min. Mass spectra in the 35 to 400 m/z range were recorded by an MD800 electron impact MS (Fisons Instruments) at a scanning speed of 2.8 scans/s and an ionization energy of 70 eV. The chromatography and spectral data were evaluated using Xcalibur software (http://www.thermo.com).
Data Analyses: Multivariate Comparative Analysis and MMSR
1. For automated baseline correction, mass spectra extraction, and subsequent spectral data alignment, in total 198 GC-MS datasets were processed simultaneously using the dedicated MetAlign metabolomics software package (http://www.metalign.nl; Fig. 2A).
2. The metabolic profiles aligned were subjected to multivariate analyses: HCA (Pearson correlation coefficient was used) and PCA to search for metabolic differences between the tomato genotypes at the level of molecular fragments (Fig. 2, A–C). The multivariate analyses were performed using the GeneMaths software package (http://www.applied-maths.com). A log2 transformation was applied to the data prior to the multivariate analyses.
3. MMSR was used to assign the molecular fragments to compounds. For this, the patterns of all molecular fragments were subjected to HCA. Those molecular fragments that revealed a Pearson correlation equal to or more then 0.8 and were situated within a 6-s retention time window (which corresponds to an average peak width at one-half height in the chromatograms we obtained) were considered as belonging to the spectrum of one compound.
4. For compound identification, the following steps were used: (1) for each compound selected for putative identification, the most optimal chromatogram is selected with respect to relative abundance and overlap with other compounds at the specific position; (2) for each selected compound, specific molecular fragments (ions, m/z) were selected from the corresponding fragment cluster derived by MMSR; (3) the selected fragments were used as a basis for deconvolution of the chromatographic peak at the corresponding retention time using AMDIS (Stein, 1999). Mass spectral models derived in this way were matched to the NIST mass spectral library (http://www.nist.gov).
Acknowledgments
The authors are grateful to Syngenta Seeds, Seminis, Enza Zaden, Rijk Zwaan, Nickerson-Zwaan, and De Ruiter Seeds for providing seeds of the 94 tomato cultivars. We would like to thank Mrs. Fien Meijer-Dekens, Mrs. Petra van den Berg, Dr. A.W. van Heusden, and Dr. Pim Lindhout for excellent greenhouse management and plant cultivation, and Dr. Harro Bouwmeester and Mr. Francel Verstappen for helpful discussions and technical support.
Footnotes
-
The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Arnaud G. Bovy (arnaud.bovy{at}wur.nl).
-
↵1 This work was supported by the research program of the Centre of BioSystems Genomics, which is part of the Netherlands Genomics Initiative/Netherlands Organization for Scientific Research.
-
↵[w] The online version of this article contains Web-only data.
- Received July 6, 2005.
- Revised September 13, 2005.
- Accepted September 13, 2005.
- Published November 11, 2005.