Plant Physiol. Bio-Rad Microplate Reader
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via CrossRef
Right arrow Citing Articles via ISI Web of Science (50)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Kell, D. B.
Right arrow Articles by Draper, J.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Kell, D. B.
Right arrow Articles by Draper, J.
Agricola
Right arrow Articles by Kell, D. B.
Right arrow Articles by Draper, J.

Plant Physiol, July 2001, Vol. 126, pp. 943-951

SCIENTIFIC CORRESPONDENCE

Genomic Computing. Explanatory Analysis of Plant Expression Profiling Data Using Machine Learning1


Douglas B. Kell,* Robert M. Darby, and John Draper

Institute of Biological Sciences, University of Wales, Aberystwyth SY23 3DD, United Kingdom


    INTRODUCTION
TOP
INTRODUCTION
Multivariate Data Analysis and...
A Data Flood Defense...
Evolving Simple Answers to...
An Example: Analysis of...
CONCLUSION
LITERATURE CITED

"Actually, the orgy of fact extraction in which everybody is currently engaged has, like most consumer economies, accumulated a vast debt. This is a debt of theory, and some of us are soon going to have an exciting time paying it back---with interest, I hope."

---Sydney Brenner, In Theory, 1997

As with every other organism whose genome has been sequenced (Hinton, 1997; Bork et al., 1998), a chief finding in plants (Bevan et al., 1999; Somerville and Somerville, 1999) is the presence of a vast number of genes (many with no relatives in the databases) whose existence, let alone function, had previously gone unrecorded. The importance of finding the function of these genes has led to what amounts to a complete reversal of conventional scientific strategies (Brent, 1999, 2000; Kell and Mendes, 2000), in which one would start with a phenotype (e.g. flower color) and devise experiments that would lead one to the genes whose products were responsible for producing that phenotype. Now, the dawn of the post-genomic era has (consequently) spawned major commercial and academic programs in which plants with more or less defined genotypes (e.g. knockouts; Martienssen, 1998) are being subjected to parallel and high-throughput analyses at the level of the transcriptome (Ruan et al., 1998; Schaffer et al., 2000; Schenk et al., 2000), the proteome (Santoni et al., 1998; Jacobs et al., 2000; Prime et al., 2000; van Wijk, 2000), the metabolome (Oliver et al., 1998; Trethewey et al., 1999; Fiehn et al., 2000; Johnson et al., 2000; Kell and Mendes, 2000; Raamsdonk et al., 2001; Trethewey, 2001), and the phenotype (Rieger et al., 1999), which will provide the wherewithal to assess the contribution of different genes through the activities of their products to the overall functioning of cells and organisms. The problem at hand is then how best to exploit the high-dimensional data floods so generated (e.g. with thousands of gene products or metabolites) for providing the comparatively low-dimensional explanations that we require at higher levels of organization (this gene is or is not important, for example, in cold tolerance).


    Multivariate Data Analysis and Machine Learning
TOP
INTRODUCTION
Multivariate Data Analysis and...
A Data Flood Defense...
Evolving Simple Answers to...
An Example: Analysis of...
CONCLUSION
LITERATURE CITED

This highly multivariate data analysis problem is most easily thought of in relation to Figure 1A (Kell and King, 2000), which depicts a familiar view in the style of a spreadsheet or database table (Fig. 1A), in which the samples of interest are represented in different rows and a set of their properties in columns. Some of the columns might represent, for example, expression profiling data ("explanatory variables" or "x-data" in the jargon) that are going to be the inputs, whereas the functional or other classes of interest, which are still variables associated with the samples ("dependent variables" or "y-data"), are thus represented by a subset of the columns. The game then is to use the values of the input (x) variables to predict the appropriate classes of interest (the appropriate value of the y variable). Thus, Figure 1B equivalently depicts the problem as a set of multivariate inputs which may be transformed, via a set of mathematical transformations, into a series of outputs (possibly just one), such that application of the data vector on the left leads to the correct classification of the object from which the data were generated on the right side of the transformation.



View larger version (26K):
[in this window]
[in a new window]
 
Figure 1.   Supervised learning of a classical propositional system. A, The samples are set out to form the rows of a table, while relevant values (or category membership such as male/female) of their associated variables form the columns. Some of the variables are used to predict the values of other variables. B, In a different but equivalent representation, the explanatory variables appear on the left and the dependent variables on the right, and the aim is to produce a mathematical transformation that uses some or all of the inputs and classifies the object into the correct class on the right. Such a classification can also have a numerical value, such as the concentration of a metabolite that cannot be measured directly, the severity of a disease, and so on.

In machine learning, it is normal to distinguish methods that use only the x-data (unsupervised methods) from supervised learning methods, which are trained using both the x-data and the y-data (Duda and Hart, 1973; Jain and Dubes, 1988; Therrien, 1989; Rich and Knight, 1991; Weiss and Kulikowski, 1991; Fukunaga, 1992; Michie et al., 1994; Bishop, 1995; Livingstone, 1995; Ripley, 1996; Mitchell, 1997). A vast pantheon of examples shows that supervised methods are always much more powerful than are unsupervised ones (such as the widely used Principal Components Analysis and clustering methods), because they concentrate on the variance that matters for the question of interest. In an example of our own concerning the exploitation of mass spectrometry in the assessment of the adulteration of extra virgin olive oils (Goodacre et al., 1992, 1993), most of the variance in the spectra was found to be due to the cultivar of olive, and not whether the oils were adulterated, such that unsupervised methods were fine for discriminating cultivars (see also Bianchi et al., 2001) but were useless for detecting adulteration. By contrast, a supervised method (in that case a fully interconnected backpropagation-type neural network (Wasserman, 1989; Hertz et al., 1991; Bishop, 1995; Ripley, 1996) trained on the same data succeeded in classifying all the oils in an unseen (double blind) test set of data (Goodacre et al., 1992; Goodacre et al., 1993).

The generalized problem so described parallels rather clearly the numerous DNA microarray experiments that have been performed and which are normally analyzed purely in terms of co-expression or clustering (Eisen et al., 1998; Tamayo et al., 1999; Burke, 2000; Getz et al., 2000), a strictly unsupervised method. Supervised methods are again much more appropriate here, though each has strengths and weaknesses (Brown et al., 1999; Alizadeh et al., 2000; Gilbert et al., 2000; Hastie et al., 2000), and using supervised learning methods to classify individual samples from microarray or other data in terms of a gene function or other type of class does require that one has a sensible class structure in the first place (Kell and King, 2000).

It should be noted that many general approaches and methods exist for supervised learning (e.g. neural, statistical, rule-based, symbolic, and so on; Rich and Knight, 1991; Weiss and Kulikowski, 1991; Hutchinson, 1994; Michie et al., 1994; Michalewicz and Fogel, 2000). However, all machine learning methods have their strengths and weaknesses, and there is provably no universal method that works best for all datasets (no free lunch; Radcliffe and Surry, 1995; Wolpert and Macready, 1997). Consequently, the method one may choose is governed by one's individual preferences regarding features such as speed, accuracy, mathematical rigor, and robustness, and by the comprehensibility of the model formed. Any claims for the superiority of one method over another consequently should be seen in this light. In our view (Kell and Sonnleitner, 1995; Davey and Kell, 1996; Alsberg et al., 1997; Goodacre et al., 2000), the best methods not only give one the correct answer, but give an explanation of what, in biological terms, is the basis for that answer. This depends on identifying a subset of the variables with high explanatory power.

Support vector machines (SVMs; Cortes and Vapnik, 1995; Scholkopf et al., 1997; Burges, 1998; Cristianini and Shawe-Taylor, 2000; Vapnik and Chapelle, 2000) are an approach that has generated much recent interest, most pertinently here regarding the analysis of expression profiling data (Brown et al., 1999, 2000). Like all approaches, SVMs too have their strengths and weaknesses. The strengths include the existence of formal theory (see above citations) and a rapid speed of convergence. Against this, they normally give equal weighting to every variable, so unless one has reasons to remove some of the variables, those that contribute only noise will tend to dominate if they are present in large numbers. This is certainly the case in microarray experiments (Wittes and Friedman, 1999), where, for example, an SVM failed to learn the helix-turn-helix class (Brown et al., 1999, 2000), whereas other methods that select only subsets of the variables succeeded on the same data (Gilbert et al., 2000; Delneri et al., 2001). Second, like neural nets (Mozer and Smolensky, 1989; Andrews et al., 1995; Tickle et al., 1998; Alexander and Mozer, 1999), such methods can have poor explanatory power (i.e. when you have trained the system and asked it to classify the test set it will do so,but it will not straightforwardly tell you which variables are used). However, as pointed out by a referee, hybrid methods could prove an interesting approach in which something like an evolutionary algorithm (see below) is used to select candidate subsets of variables for processing through an SVM (compare with Guyon et al., 2001; Weston et al., 2001) or other learning method, as has been done using genetic algorithms for neural network (Broadhurst et al., 1997) and other statistical data processing analyses (Horchner and Kalivas, 1995).


    A Data Flood Defense System
TOP
INTRODUCTION
Multivariate Data Analysis and...
A Data Flood Defense...
Evolving Simple Answers to...
An Example: Analysis of...
CONCLUSION
LITERATURE CITED

Next, it should be noted that the problem of finding an adequate model (i.e. a set of mathematical transformations in the sense of Fig. 1B) scales only linearly with the number of objects but combinatorially with the number of variables. Thus, the number of models that either do or do not use a particular variable (let alone seek to parametrize it) when faced with a choice of M variables is obviously 2M (a variable is used or not used, binary 1 or 0). Even for M = 100, which is much smaller than the numbers typically used in microarrays, 2100 congruent  1030, and the lifetime of the universe is only approximately 1017 s. Exhaustive search of such models to find the best model is clearly computationally intractable. By contrast, if we state that a model should use just two, three, four, or five variables out of the 100, the numbers of combinations are, respectively, just 4,950, 161,700, 3,921,225, and 75,287,520. These kinds of numbers are both (a) much more tractable and (b) likely to lead to much more intelligible explanations of which variables (genes/proteins/metabolites) are most important to a particular process (stress, flowering time, and so on) of interest. The rather Zen conclusion is clearly that we are wise to start by seeking simplicity in our explanations.


    Evolving Simple Answers to Complex Questions of Functional Genomics
TOP
INTRODUCTION
Multivariate Data Analysis and...
A Data Flood Defense...
Evolving Simple Answers to...
An Example: Analysis of...
CONCLUSION
LITERATURE CITED

A particularly useful approach to attacking combinatorial optimization problems of this type lies in the use of the methods of evolutionary computing. In evolutionary computing (Goldberg, 1989; Michalewicz, 1994; Mitchell, 1995; Bäck et al., 1997, 2000a, 2000b; Corne et al., 1999; Zitzler, 1999; Michalewicz and Fogel, 2000), we have a population of individual computer programs or algorithms whose output is a potential solution to a problem (typically a combinatorial optimization problem). These outputs are ranked according to their "fitness" (usually their closeness to the true solution in the dataset they are given, though other criteria such as simplicity of explanation may be used as well), and the better-performing individuals retained. Some of these individuals/programs/algorithms are then modified, typically by mutating (changing) them "asexually" or by recombining parts of them from more than one parent "sexually," and the process of generating an output, evaluating the fitness function, and mutating and selecting at each generation continued until a specified stopping criterion is met. (This is usually the number of generations or the achievement of an adequately small difference between desired and true values.)

This may be set out in pseudocode as follows:

1. START by generating a population of individuals (computer programs/algorithms)

2. At the end of each generation, EVALUATE the "fitness" of each individual

3. RANK these individuals and RETAIN a certain fraction of them with a probability related to their fitness for one or more purposes

4. CREATE NEW INDIVIDUALS from these parents so as to replenish the population by mutation (changing an individual parent) or recombination (between two or more parents)

5. Return to step 2 UNTIL...

6. ...when a suitable criterion (elapsed time, evolved fitness, generation number) is met, then STOP.

A particularly interesting subset of evolutionary computing methods, popularized by John Koza as genetic programming (GP; Koza, 1992, 1994; Banzhaf et al., 1998; Langdon, 1998; Koza et al., 1999, 2000), involves an arrangement in which the rules are arrayed in a tree-like structure that is read from the bottom and a subset of variables passed through appropriate operators or functions to provide the output (i.e. fitness). Such so-called parse trees---unlike conventional computer programs---can be mutated and recombined to provide variants that remain syntactically correct (Fig. 2). Thus, one can evolve solutions to a complex problem yet produce equations that are simple and intelligible. These equations are essentially in the form of rules, in that the best ones give entirely different outputs for the different inputs characteristic of examples from particular classes, and if the classes are encoded as the outputs (say in binary, 1 or 0 for contribution of a gene to a particular function), the equations are the rules. (An equivalent procedure can be used, for example, in spectroscopy for high-throughput screening, where the output is continuous and is the concentration of a substance of interest [Gilbert et al., 1997; Taylor et al., 1998b; Woodward et al., 1999].) The special power of GP, which we have found particularly valuable (Gilbert et al., 1997, 1998, 1999; Jones et al., 1998; Taylor et al., 1998a, 1998b; Goodacre and Gilbert, 1999; Woodward et al., 1999; Goodacre et al., 2000; Johnson et al., 2000), stems from the fact that both the (potentially small number of) explanatory variables and the functional form of the relationship between them are evolved together. As suggested by a referee, and to avoid confusion, we note that genetic programming differs significantly from methods such as genetic algorithms. In genetic algorithms, the length of the string is normally fixed, and what evolves is a parameterization of a given model. In GP, the model itself evolves too (and while this is normally effected in a tree structure, as shown in Fig. 2, linear versions of GP are also possible; Banzhaf et al., 1998; Gilbert et al., 2000).



View larger version (21K):
[in this window]
[in a new window]
 
Figure 2.   A, Parse-tree representation of an equation which takes some of the variables as the input and, by reading from the lowest leaves of the tree, produces an output at the top, which is clearly equivalent to the equation given on the right. Also shown are cartoons of how mutation (B) and recombination (C) can be effected to produce equations (rules) with different properties while preserving a syntactically logical structure.

We also mention here that a significant problem, especially with some of the original flavors of GP, can be their tendency to bloat, i.e. to continue to grow branches onto trees even when the contribution of these branches to fitness is small (Langdon and Poli, 1998), a phenomenon mirroring what may be observed in nature where it is referred to as "Muller's ratchet" (Muller, 1964). There are ways around this however, e.g. by incorporating the desirability for small trees into the fitness function itself (Goodacre et al., 2000), and we do not nowadays find this a problem. Another feature of GPs is that because of the stochastic way in which they are initiated and evolve, they are not deterministic. This said, it is possible to turn such properties to advantage by running the GP several times, as there is plenty of evidence that combining or voting among several independent solutions to a problem can give improved learning (Drucker et al., 1994; Bauer and Kohavi, 1999; Dietterich, 2000a, 2000b; Friedman et al., 2000; King et al., 2000). Finally, all computer-intensive methods of this type, including those based on purely multivariate statistical strategies (Martens and Næs, 1989), have a great many degrees of freedom, which require that we provide a careful evaluation with respect to the reality of the solutions found (Chatfield, 1995).

In our own approach, which we refer to as genomic computing, we choose to equate the fitness of an individual, not by a regression-based analysis such as the root mean square error of prediction or the percentage of samples classified correctly, but by using a ranking scheme in which the metric encoding the quality of the model is analyzed in terms of the ordering of the samples with respect to their ease of prediction via the model. We find that this is particularly good at drawing out the variables that are most important for the particular problem.


    An Example: Analysis of the Tobacco Metabolome in Transgenic Plants
TOP
INTRODUCTION
Multivariate Data Analysis and...
A Data Flood Defense...
Evolving Simple Answers to...
An Example: Analysis of...
CONCLUSION
LITERATURE CITED

As an illustrative example, we set up a transgene discovery problem in which we measured a series of metabolites via HPLC and used these as the inputs to a genetic program designed to find a rule that would tell from the metabolome data whether the transgene of interest was present or absent (of course, we could have sought to encode its activity). The experiment was also aimed at investigating the biosynthesis and function of salicylic acid in plant defense by the expression of a salicylate hydroxylase enzyme to block accumulation (Bi et al., 1995).

Salicylic acid has been known for more than a decade to play a key role in defense mechanisms in many plants and is associated specifically with the hypersensitive response and the phenomenon of systemic acquired resistance (Mur et al., 1996, 2000; Draper, 1997; Mur et al., 1997). Tobacco (Nicotiana tabaccum) has provided a model organism for the study of salicylate biology in plant defense, but despite a considerable amount of research, little is known regarding its synthesis, catabolism, and mode of action. A bacterial gene encoding the enzyme salicylate hydroxylase (SH-L) expressed from the cauliflower mosaic virus 35S promoter has provided a useful tool to block salicylic acid accumulation in transgenic tobacco (Bi et al., 1995; Mur et al., 1996, 1997; Darby et al., 2000). Six-week-old transgenic tobacco plants (35S-SH-L) and control plants (Samsum NN) were inoculated with tobacco mosaic virus (TMV) at a temperature (32°C) non-permissive for the hypersensitive response (Mur et al., 1997; Roberts et al., 1999). Under these conditions, the TMV can replicate and spread without inducing lesion formation. Following a shift to a permissive temperature (24°C), hypersensitive response is induced synchronously with cell death visible after 8 h. Leaf tissue from TMV-inoculated, temperature-shifted plants was sampled at different time points (0-24 h), flash frozen in liquid N2, extracted in 90% (v/v) methanol, dried, partitioned with dichloromethane, and then analyzed by HPLC using standard procedures (Bi et al., 1995). A total of 48 peaks from the HPLC traces were digitized and integrated using standard software provided with the instrument, and a total of 36 samples were studied.

The metabolite peak values were used as inputs to the genomic computing software Gmax-bio (Aber Genomic Computing, Aberystwyth, UK), with the presence or absence of SH-L in the genotype being encoded 1 or 0.

One of many rules which evolved could be written as follows:


   x1 = V24
   x2 = V37
   If x1 <> 0 Then x1 = x2/x1 Else x1 = 1
   x1 = Sqr(Abs(x1))
   x2 = V24
   x3 = V42
   x2 = x2 + x3
   x3 = V30
   If x2 <> 0 Then x2 = x3/x2 Else x2 = 1
   x2 = Sqr(Abs(x2))
   x1 = x1 + x2
   SCORE = x1
   PRBBLITY = 1/(1 + Exp[-{-8.046777 + SCORE * 1.872833}])

This rule has an accuracy of more than 95% and is shown in tree form in Figure 3A. A power of genomic computing is that it ranks variables in order of their utility in successful rules. The top three variables are peaks 24, 30, and 42, and peak 24 is indeed salicylate. The low intrinsic dimensionality of this problem thus allows us to visualize the salient features of the experiment in a straightforward way, as illustrated in Figure 3B.



View larger version (26K):
[in this window]
[in a new window]
 
Figure 3.   A rule derived from genomic computing for the presence of a specific transgene (salicylate hydroxylase) in tobacco. A, Tree illustrating the rule. B, Plot of the data using the three variables identified as most significant. Presence of SH-L is encoded by the symbol (, none; , present). Time after shift (0, 3, 6, 9, 12, and 24 h) is encoded by size: F27 (abscissa) = variable 24, F33 (ordinate) = variable 30, and F45 (coming out of page) is variable 42. The latest time points have low values for variable 24 but large values for variable 23 (data not shown). Note that the problem is apparently not linearly separable. The Gmax-bio software (Aber Genomic Computing) was used with the following default parameters: population size, 1,000; maximum program length, 44 nodes; fitness, based on tournament selection/Gmax(v); crossover operator used 80% of the time; and of the mutations, terminals were selected 20% of the time. Operators used were the default numeric (0.1, 1, 3, 5, and Rand) and arithmetic (+, -, /, and *) operators plus square root, log, tanh, <, AND, OR, or IfT (this latter operator takes three inputs and if the first not equal  0 then it returns the second, else the third). The derivation of this rule required 56 generations and approximately 2 min on a standard 750-MHz Pentium III (Intel) personal computer.


    CONCLUSION
TOP
INTRODUCTION
Multivariate Data Analysis and...
A Data Flood Defense...
Evolving Simple Answers to...
An Example: Analysis of...
CONCLUSION
LITERATURE CITED

Many modern "omics" technologies are producing highly multivariate data at unprecedented rates. Only modern machine learning methods can turn these data into knowledge. Genomic computing provides an approach that can effect this desirable transformation and provide simple rules that map back onto the variables measured in the real world and thus have high explanatory power.


    ACKNOWLEDGMENT

We thank Mike Sinclair for useful discussions.

    FOOTNOTES

Received April 13, 2001; accepted May 1, 2001.

1 This work was supported by the UK Biotechnology and Biological Science Research Council.

* Corresponding author; e-mail dbk{at}aber.ac.uk; fax 44-0-1970-622354.


    LITERATURE CITED
TOP
INTRODUCTION
Multivariate Data Analysis and...
A Data Flood Defense...
Evolving Simple Answers to...
An Example: Analysis of...
CONCLUSION
LITERATURE CITED

  • Alexander JA, Mozer MC (1999) Template-based procedures for neural network interpretation. Neural Netw 12: 479-498[Medline]
  • Alizadeh AA, Eisen MB, Davis RE, Ma C, Lossos IS, Rosenwald A, Boldrick JG, Sabet H, Tran T, Yu X (2000) Distinct types of diffuse large B-cell lymphoma identified by gene expression profiling. Nature 403: 503-511[CrossRef][Medline]
  • Alsberg BK, Goodacre R, Rowland JJ, Kell DB (1997) Classification of pyrolysis mass spectra by fuzzy multivariate rule induction-comparison with regression, K-nearest neighbour, neural and decision-tree methods. Anal Chim Acta 348: 389-407[CrossRef]
  • Andrews R, Diederich J, Tickle AB (1995) Survey and critique of techniques for extracting rules from trained artificial neural networks. Knowledge-Based Syst 8: 373-389
  • Bäck T, Fogel DB, Michalewicz Z (1997) Handbook of Evolutionary Computation. IOP Publishing/Oxford University Press, Oxford
  • Bäck T, Fogel DB, Michalewicz Z (2000a) Evolutionary Computation 1: Basic Algorithms and Operators. IOP Publishing, Bristol, UK
  • Bäck T, Fogel DB, Michalewicz Z (2000b) Evolutionary Computation 2: Advanced Algorithms and Operators. IOP Publishing, Bristol, UK
  • Banzhaf W, Nordin P, Keller RE, Francone FD (1998) Genetic Programming: An Introduction. Morgan Kaufmann, San Francisco
  • Bauer E, Kohavi R (1999) An empirical comparison of voting classification algorithms: bagging, boosting, and variants. Mach Learn 36: 105-139[CrossRef]
  • Bevan M, Bancroft I, Mewes HW, Martienssen R, McCombie R (1999) Clearing a path through the jungle: progress in Arabidopsis genomics. Bioessays 21: 110-120[CrossRef][ISI][Medline]
  • Bi YM, Kenton P, Mur L, Darby R, Draper J (1995) Hydrogen peroxide does not function downstream of salicylic acid in the induction of PR protein expression. Plant J 8: 235-245[CrossRef][ISI][Medline]
  • Bianchi G, Giansante L, Shaw A, Kell DB (2001) Chemometric criteria for the characterization of Italian protected denomination of origin (DOP) olive oils from their metabolic profiles. Eur J Lipid Sci Technol 103: 141-150[CrossRef]
  • Bishop CM (1995) Neural Networks for Pattern Recognition. Clarendon Press, Oxford
  • Bork P, Dandekar T, Diaz-Lazcoz Y, Eisenhaber F, Huynen M, Yuan YP (1998) Predicting function: from genes to genomes and back. J Mol Biol 283: 707-725[CrossRef][ISI][Medline]
  • Brenner S (1997) In theory. In Loose Ends. Current Biology, London, pp 37
  • Brent R (1999) Functional genomics: learning to think about gene expression data. Curr Biol 9: R338-R341[CrossRef][Medline]
  • Brent R (2000) Genomic biology. Cell 100: 169-183[CrossRef][ISI][Medline]
  • Broadhurst D, Goodacre R, Jones A, Rowland JJ, Kell DB (1997) Genetic algorithms as a method for variable selection in multiple linear regression and partial least squares regression, with applications to pyrolysis mass spectrometry. Anal Chim Acta 348: 71-86[CrossRef]
  • Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet C, Furey TS, Manuel Ares J, Haussler D (1999) Support vector machine classification of microarray gene expression data. Technical Report No. UCSC-CRL-99-09. University of Santa Cruz, Santa Cruz, CA. http://www.cse.ucsc.edu/research/compbio/genex/genex.ps
  • Brown MPS, Grundy WN, Lin D, Cristianini N, Sugnet C, Furey TS, Manuel Ares J, Haussler D (2000) Knowledge-based analysis of microarray gene expression data by using support vector machines. Proc Natl Acad Sci USA 97: 262-267[Abstract/Free Full Text]
  • Burges CJC (1998) A tutorial on support vector machines for pattern recognition. Data Mining Knowledge Discovery 2: 121-167[CrossRef]
  • Burke HB (2000) Discovering patterns in microarray data. Mol Diagn 5: 349-357[CrossRef][ISI][Medline]
  • Chatfield C (1995) Model uncertainty, data mining and statistical inference. J R Stat Soc Ser A 158: 419-466[CrossRef]
  • Corne D, Dorigo M, Glover F (1999) New Ideas in Optimization. McGraw Hill, London
  • Cortes C, Vapnik V (1995) Support-vector networks. Mach Learn 20: 273-297
  • Cristianini N, Shawe-Taylor J (2000) An Introduction to Support Vector Machines and Other Kernel-Based Learning Methods. Cambridge University Press, Cambridge, UK
  • Darby RM, Maddison A, Mur LAJ, Bi YM, Draper J (2000) Cell-specific expression of salicylate dydroxylase in an attempt to separate localised HR and systemic SAR in tobacco. Plant Mol Pathol 1: 115-124
  • Davey HM, Kell DB (1996) Flow cytometry and cell sorting of heterogeneous microbial populations: the importance of single-cell analysis. Microbiol Rev 60: 641-696[Abstract/Free Full Text]
  • Delneri D, Brancia FL, Oliver SG (2001) Towards a truly integrative biology through the functional genomics of yeast. Curr Opin Biotechnol 12: 87-91[CrossRef][Medline]
  • Dietterich TG (2000a) Ensemble methods in machine learning. In Multiple Classifier Systems, Lecture Notes in Computer Science, Vol. 1857. Springer, Berlin, pp 1-15
  • Dietterich TG (2000b) An experimental comparison of three methods for constructing ensembles of decision trees: bagging, boosting, and randomization. Mach Learn 40: 139-157[CrossRef]
  • Draper J (1997) Salicylate, superoxide synthesis and cell suicide in plant defense. Trends Plant Sci 2: 162-165[CrossRef][ISI]
  • Drucker H, Cortes C, Jackel LD, Lecun Y, Vapnik V (1994) Boosting and other ensemble methods. Neural Comput 6: 1289-1301
  • Duda RO, Hart PE (1973) Pattern Classification and Scene Analysis. John Wiley, London
  • Eisen MB, Spellman PT, Brown PO, Botstein D (1998) Cluster analysis and display of genome-wide expression patterns. Proc Natl Acad Sci USA 95: 14863-14868[Abstract/Free Full Text]
  • Fiehn O, Kopka J, Dormann P, Altmann T, Trethewey RN, Willmitzer L (2000) Metabolite profiling for plant functional genomics. Nat Biotechnol 18: 1157-1161[CrossRef][ISI][Medline]
  • Friedman J, Hastie T, Tibshirani R (2000) Additive logistic regression: a statistical view of boosting. Ann Stat 28: 337-374[CrossRef]
  • Fukunaga K (1992) Introduction to Statistical Pattern Recognition, Ed 2. Academic Press, New York
  • Getz G, Levine E, Domany E (2000) Coupled two-way clustering analysis of gene microarray data. Proc Natl Acad Sci USA 97: 12079-12084[Abstract/Free Full Text]
  • Gilbert RJ, Goodacre R, Shann B, Taylor J, Rowland JJ, Kell DB (1998) Genetic programing-based variable selection for high-dimensional data. In JR Koza, W Banzhaf, K Chellapilla, K Deb, M Dorigo, DB Fogel, MH Garzon, DE Goldberg, H Iba, RL Riolo, eds, Genetic Programming 1998: Proceedings of the Third Annual Conference. Morgan Kaufmann, San Francisco, pp 109-115
  • Gilbert RJ, Goodacre R, Woodward AM, Kell DB (1997) Genetic programming: a novel method for the quantitative analysis of pyrolysis mass spectral data. Anal Chem 69: 4381-4389[CrossRef]
  • Gilbert RJ, Johnson HE, Winson MK, Rowland JJ, Goodacre R, Smith AR, Hall MA, Kell DB (1999) Genetic programming as an analytical tool for metabolome data. In WB Langdon, R Poli, P Nodin, T Fogarty, eds, Late-Breaking Papers of EuroGP-99. CWI, Amsterdam, pp 23-33
  • Gilbert RJ, Rowland JJ, Kell DB (2000) Genomic computing: explanatory modelling for functional genomics. In D Whitley, D Goldberg, E Cantú-Paz, L Spector, I Parmee, H-G Beyer, eds, Proceedings of the Genetic and Evolutionary Computation Conference (GECCO-2000). Morgan Kaufmann, Las Vegas, pp 551-557
  • Goldberg DE (1989) Genetic Algorithms in Search, Optimization and Machine Learning. Addison-Wesley, Reading, MA
  • Goodacre R, Gilbert RJ (1999) The detection of caffeine in a variety of beverages using Curie-point pyrolysis mass spectrometry and genetic programming. Analyst 124: 1069-1074[CrossRef]
  • Goodacre R, Kell DB, Bianchi G (1992) Neural networks and olive oil. Nature 359: 594[CrossRef]
  • Goodacre R, Kell DB, Bianchi G (1993) Rapid assessment of the adulteration of virgin olive oils by other seed oils using pyrolysis mass spectrometry and artificial neural networks. J Sci Food Agric 63: 297-307
  • Goodacre R, Shann B, Gilbert RJ, Timmins EM, McGovern AC, Alsberg BK, Kell DB, Logan NA (2000) Detection of the dipicolinic acid biomarker in Bacillus spores using Curie-point pyrolysis mass spectrometry and Fourier transform infrared spectroscopy. Anal Chem 72: 119-127[Medline]
  • Guyon I, Weston J, Barnhill S, Vapnik V (2001) Gene selection for cancer classification using support vector machines. Mach Learn (in press)
  • Hastie T, Tibshirani R, Eisen MB, Alizadeh A, Levy R, Staudt L, Chan WC, Botstein D, Brown P (2000) "Gene shaving" as a method for identifying distinct sets of genes with similar expression patterns. Genome Biol 1: research 0003.1-0003.21
  • Hertz J, Krogh A, Palmer RG (1991) Introduction to the theory of neural computation. Addison-Wesley, Redwood City, CA
  • Hinton JCD (1997) The Escherichia coli genome sequence: the end of an era or the start of the FUN? Mol Microbiol 26: 417-422[CrossRef][ISI][Medline]
  • Horchner U, Kalivas JH (1995) Further investigation on a comparative study of simulated annealing and genetic algorithm for wavelength selection. Anal Chim Acta 311: 1-13[CrossRef]
  • Hutchinson A (1994) Algorithmic Learning. Clarendon Press, Oxford
  • Jacobs DI, van der Heijden R, Verpoorte R (2000) Proteomics in plant biotechnology and secondary metabolism research. Phytochem Anal 11: 277-287
  • Jain AK, Dubes RC (1988) Algorithms for Clustering Data. Prentice Hall, Englewood Cliffs, NJ
  • Johnson HE, Gilbert RJ, Winson MK, Goodacre R, Smith AR, Rowland JJ, Hall MA, Kell DB (2000) Explanatory analysis of the metabolome using genetic programming of simple, interpretable rules. Genet Progr Evolvable Mach 1: 243-258
  • Jones A, Young D, Taylor J, Kell DB, Rowland JJ (1998) Quantification of microbial productivity via multi-angle light scattering and supervised learning. Biotechnol Bioeng 59: 131-143[Medline]
  • Kell DB, King RD (2000) On the optimization of classes for the assignment of unidentified reading frames in functional genomics programmes: the need for machine learning. Trends Biotechnol 18: 93-98[CrossRef][ISI][Medline]
  • Kell DB, Mendes P (2000) Snapshots of systems: metabolic control analysis and biotechnology in the post-genomic era. In A Cornish-Bowden, ML Cárdenas, eds, Technological and Medical Implications of Metabolic Control Analysis. Kluwer Academic Publishers, Dordrecht, The Netherlands, pp 3-25
  • Kell DB, Sonnleitner B (1995) GMP---good modelling practice: an essential component of good manufacturing practice. Trends Biotechnol 13: 481-492
  • King RD, Ouali M, Strong AT, Aly A, Elmaghraby A, Kantardzic M, Page D (2000) Is it better to combine predictions? Protein Eng 13: 15-19[Abstract/Free Full Text]
  • Koza JR (1992) Genetic Programming: On the Programming of Computers by Means of Natural Selection. MIT Press, Cambridge, MA
  • Koza JR (1994) Genetic Programming II: Automatic Discovery of Reusable Programs. MIT Press, Cambridge, MA
  • Koza JR, Bennett FH, Keane MA, Andre D (1999) Genetic Programming III: Darwinian Invention and Problem Solving. Morgan Kaufmann, San Francisco
  • Koza JR, Keane MA, Yu J, Bennett FH III, Mydlowec W (2000) Automatic creation of human-competitive programs and controllers by means of genetic programming. Genet Progr Evolvable Mach 1: 121-164[CrossRef]
  • Langdon WB (1998) Genetic Programming and Data Structures: Genetic Programming + Data Structures = Automatic Programming! Kluwer, Boston
  • Langdon WB, Poli R (1998) Fitness causes bloat: mutation. In W Banzhaf, R Poli, M Schoenauer, TC Fogarty, eds, Proceedings of the First European Workshop on Genetic Programming, Vol. 1391. Springer-Verlag, Berlin, pp 37-48
  • Livingstone D (1995) Data Analysis for Chemists. Oxford University Press, Oxford
  • Martens H, Næs T (1989) Multivariate Calibration. John Wiley, Chichester, UK
  • Martienssen RA (1998) Functional genomics: probing plant gene function and expression with transposons. Proc Natl Acad Sci USA 95: 2021-2026[Abstract/Free Full Text]
  • Michalewicz Z (1994) Genetic Algorithms + Data Structures = Evolution Programs, Ed 3. Springer-Verlag, Berlin
  • Michalewicz Z, Fogel DB (2000) How to Solve It: Modern Heuristics. Springer-Verlag, Heidelberg
  • Michie D, Spiegelhalter DJ, Taylor CC (1994) Machine Learning: Neural and Statistical Classification, Ellis Horwood Series in Artificial Intelligence. Ellis Horwood, Chichester, UK
  • Mitchell M (1995) An Introduction to Genetic Algorithms. MIT Press, Boston
  • Mitchell TM (1997) Machine Learning. McGraw Hill, New York
  • Mozer MC, Smolensky P (1989) Using relevance to reduce network size automatically. Connect Sci 1: 3-16
  • Muller HJ (1964) The relation of recombination to mutational advance. Mutat Res 1: 2-9[ISI]
  • Mur LAJ, Bi YM, Darby RM, Firek S, Draper J (1997) Compromising early salicylic acid accumulation delays the hypersensitive response and increases viral dispersal during lesion establishment in TMV-infected tobacco. Plant J 12: 1113-1126[CrossRef][ISI][Medline]
  • Mur LAJ, Brown IR, Darby RM, Bestwick CS, Bi YM, Mansfield JW, Draper J (2000) A loss of resistance to avirulent bacterial pathogens in tobacco is associated with the attenuation of a salicylic acid-potentiated oxidative burst. Plant J 23: 609-621[CrossRef][ISI][Medline]
  • Mur LAJ, Naylor G, Warner SAJ, Sugars JM, White RF, Draper J (1996) Salicylic acid potentiates defense gene expression in tissue exhibiting acquired resistance to pathogen attack. Plant J 9: 559-571[CrossRef]
  • Oliver SG, Winson MK, Kell DB, Baganz F (1998) Systematic functional analysis of the yeast genome. Trends Biotechnol 16: 373-378[CrossRef][ISI][Medline]
  • Prime TA, Sherrier DJ, Mahon P, Packman LC, Dupree P (2000) A proteomic analysis of organelles from Arabidopsis thaliana. Electrophoresis 21: 3488-3499[CrossRef][ISI][Medline]
  • Raamsdonk LM, Teusink B, Broadhurst D, Zhang N, Hayes A, Walsh M, Berden JA, Brindle KM, Kell DB, Rowland JJ (2001) A functional genomics strategy that uses metabolome data to reveal the phenotype of silent mutations. Nat Biotechnol 19: 45-50[CrossRef][ISI][Medline]
  • Radcliffe NJ, Surry PD (1995) Fundamental limitations on search algorithms: evolutionary computing in perspective. Comput Sci Today 1995: 275-291
  • Rich E, Knight K (1991) Artificial Intelligence. McGraw Hill, New York
  • Rieger K-J, Orlowska G, Kaniak A, Coppee J-Y, Aljinovic G, Slonimski PP (1999) Large-scale phenotypic analysis in microtitre plates of mutants with deleted open reading frames from yeast chromosome III: key step between genomic sequencing and protein function. In AG Craig, JD Joheisel, eds, Methods in Microbiology. Automation: Genomic and Functional Analysis, Vol. 28. Academic Press, London, pp 205-227
  • Ripley BD (1996) Pattern Recognition and Neural Networks. Cambridge University Press, Cambridge, UK
  • Roberts MR, Warner SAJ, Darby R, Lim EK, Draper J, Bowles DJ (1999) Differential regulation of a glucosyl transferase gene homologue during defense responses in tobacco. J Exp Bot 50: 407-410
  • Ruan Y, Gilmore J, Conner T (1998) Towards Arabidopsis genome analysis: monitoring expression profiles of 1400 genes using cDNA microarrays. Plant J 15: 821-833[CrossRef][ISI][Medline]
  • Santoni V, Rouquie D, Doumas P, Mansion M, Boutry M, Degand H, Dupree P, Packman L, Sherrier J, Prime T (1998) Use of a proteome strategy for tagging proteins present at the plasma membrane. Plant J 16: 633-641[CrossRef][ISI][Medline]
  • Schaffer R, Landgraf J, Perez-Amador M, Wisman E (2000) Monitoring genome-wide expression in plants. Curr Opin Biotechnol 11: 162-167[CrossRef][ISI][Medline]
  • Schenk PM, Kazan K, Wilson I, Anderson JP, Richmond T, Somerville SC, Manners JM (2000) Coordinated plant defense responses in Arabidopsis revealed by microarray analysis. Proc Natl Acad Sci 97: 11655-11660[Abstract/Free Full Text]
  • Scholkopf B, Sung KK, Burges CJC, Girosi F, Niyogi P, Poggio T, Vapnik V (1997) Comparing support vector machines with Gaussian kernels to radial basis function classifiers. IEEE Trans Sign Process 45: 2758-2765[CrossRef]
  • Somerville C, Somerville S (1999) Plant functional genomics. Science 285: 380-383[Abstract/Free Full Text]
  • Tamayo P, Slonim D, Mesirov J, Zhu Q, Kitareewan S, Dmitrovsky E, Lander ES, Golub TR (1999) Interpreting patterns of gene expression with self-organizing maps: methods and application to hematopoietic differentiation. Proc Natl Acad Sci USA 96: 2907-2912[Abstract/Free Full Text]
  • Taylor J, Goodacre R, Wade WG, Rowland JJ, Kell DB (1998a) The deconvolution of pyrolysis mass spectra using genetic programming: application to the identification of some Eubacterium species. FEMS Microbiol Lett 160: 237-246[CrossRef][Medline]
  • Taylor J, Rowland JJ, Goodacre R, Gilbert RJ, Winson MK, Kell DB (1998b) Genetic programming in the interpretation of Fourier transform infrared spectra: quantification of metabolites of pharmaceutical importance. In JR Koza, W Banzhaf, K Chellapilla, K Deb, M Dorigo, DB Fogel, MH Garzon, DE Goldberg, H Iba, RL Riolo, eds, Genetic Programming 1998: Proceedings of the Third Annual Conference. Morgan Kaufmann, San Francisco, pp 377-380
  • Therrien CW (1989) Decision Estimation and Classification: An Introduction to Pattern Recognition and Related Topics. Wiley, New York
  • Tickle AB, Andrews R, Golea M, Diederich J (1998) The truth will come to light: directions and challenges in extracting the knowledge embedded within trained artificial neural networks. IEEE Trans Neural Netw 9: 1057-1068[Medline]
  • Trethewey RN (2001) Gene discovery via metabolic profiling. Curr Opin Biotechnol 12: 135-138[CrossRef][ISI][Medline]
  • Trethewey RN, Krotzky AJ, Willmitzer L (1999) Metabolic profiling: a Rosetta Stone for genomics? Curr Opin Plant Biol 2: 83-85[CrossRef][ISI][Medline]
  • van Wijk KJ (2000) Proteomics of the chloroplast: experimentation and prediction. Trends Plant Sci 5: 420-425[CrossRef][Medline]
  • Vapnik V, Chapelle O (2000) Bounds on error expectation for support vector machines. Neural Comput 12: 2013-2036[Abstract/Free Full Text]
  • Wasserman PD (1989) Neural Computing: Theory and Practice. Van Nostrand Reinhold, New York
  • Weiss SH, Kulikowski CA (1991) Computer Systems that Learn: Classification and Prediction Methods from Statistics, Neural Networks, Machine Learning, and Expert Systems. Morgan Kaufmann Publishers, San Mateo, CA
  • Weston J, Mukherjee S, Chapelle O, Pontil M, Poggio T, Vapnik V (2001) Feature selection for SVMs. In Neural Information Processing Systems, Vol. 13. Morgan Kaufmann, San Francisco (in press)
  • Wittes J, Friedman HP (1999) Searching for evidence of altered gene expression: a comment on statistical analysis of microarray data. J Natl Cancer Inst 91: 400-401[Free Full Text]
  • Wolpert DH, Macready WG (1997) No Free Lunch theorems for optimization. IEEE Trans Evol Comput 1: 67-82[CrossRef]
  • Woodward AM, Gilbert RJ, Kell DB (1999) Genetic programming as an analytical tool for non-linear dielectric spectroscopy. Bioelectrochem Bioenerg 48: 389-396[CrossRef][Medline]
  • Zitzler E (1999) Evolutionary algorithms for multiobjective optimization: methods and applications. Shaker Verlag, Aachen, Germany
© 2001 American Society of Plant Physiologists



This article has been cited by other articles:


Home page
Plant Physiol.Home page
M. C. McCann, M. Defernez, B. R. Urbanowicz, J. C. Tewari, T. Langewisch, A. Olek, B. Wells, R. H. Wilson, and N. C. Carpita
Neural Network Analyses of Infrared Spectra for Classifying Cell Wall Architectures
Plant Physiology, March 1, 2007; 143(3): 1314 - 1326.
[Abstract] [Full Text] [PDF]


Home page
J. Nutr.Home page
R. Goodacre
Metabolomics of a Superorganism
J. Nutr., January 1, 2007; 137(1): 259S - 266S.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
D. P. Enot, M. Beckmann, D. Overy, and J. Draper
Predicting interpretability of metabolome models based on behavior, putative identity, and biological relevance of explanatory signals
PNAS, October 3, 2006; 103(40): 14865 - 14870.
[Abstract] [Full Text] [PDF]


Home page
Proc. Natl. Acad. Sci. USAHome page
G. S. Catchpole, M. Beckmann, D. P. Enot, M. Mondhe, B. Zywicki, J. Taylor, N. Hardy, A. Smith, R. D. King, D. B. Kell, et al.
Hierarchical metabolomics demonstrates substantial compositional similarity between genetically modified and conventional potato crops
PNAS, October 4, 2005; 102(40): 14458 - 14462.
[Abstract] [Full Text] [PDF]


Home page
J Exp BotHome page
R. Goodacre
Making sense of the metabolome using evolutionary computation: seeing the wood with the trees
J. Exp. Bot., January 1, 2005; 56(410): 245 - 254.
[Abstract] [Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
J. Allen, H. M. Davey, D. Broadhurst, J. J. Rowland, S. G. Oliver, and D. B. Kell
Discrimination of Modes of Action of Antifungal Substances by Use of Metabolic Footprinting
Appl. Envir. Microbiol., October 1, 2004; 70(10): 6157 - 6165.
[Abstract] [Full Text] [PDF]


Home page
Plant Physiol.Home page
L. J. Sweetlove, R. L. Last, and A. R. Fernie
Predictive Metabolic Engineering: A Goal for Systems Biology
Plant Physiology, June 1, 2003; 132(2): 420 - 425.
[Full Text] [PDF]


Home page
Appl. Environ. Microbiol.Home page
D. I. Ellis, D. Broadhurst, D. B. Kell, J. J. Rowland, and R. Goodacre
Rapid and Quantitative Detection of the Microbial Spoilage of Meat by Fourier Transform Infrared Spectroscopy and Machine Learning
Appl. Envir. Microbiol., June 1, 2002; 68(6): 2822 - 2828.
[Abstract] [Full Text] [PDF]