|
|
||||||||
|
Plant Physiology 133:1717-1725 (2003) © 2003 American Society of Plant Biologists Finding Unexpected Patterns in Microarray Data1IFEVA, Facultad de Agronomía, Universidad de Buenos Aires, Av. San Martín 4453, 1417-Buenos Aires, Argentina (S.P., J.J.C.); Instituto de Investigaciones en Ingenieria Genetica y Biologia Molecular, Vuelta de Obligado 2490, 1428-Buenos Aires, Argentina (M.A.M., J.M.); and Syngenta Biotechnology, Inc., 3054 Cornwallis Road, Research Triangle Park, North Carolina 27709 (T.Z.)
We describe the performance of a protocol based on the sequential application of unsupervised and supervised methods to analyze microarray samples defined by a combination of factors. Correspondence analysis is used to visualize the emerging patterns of three set of novel or previously published data: photoreceptor mutants of Arabidopsis grown under different light/dark conditions, Arabidopsis exposed to different types of biotic and abiotic stress, and human acute leukemia. We find, for instance, that light has a dramatic effect on plants despite the absence of the four major photoreceptors, that bacterial-, fungal-, and viral-induced responses converge at later stages of attack, and that sample preparation procedures used in different hospitals have large effects on transcriptome patterns. We use canonical discriminant analysis to identify the genes associated with these patters and hierarchical clustering to find groups of coregulated genes that are easily visualized in a second round of correspondence analysis and ordered tables. The unconventional combination of standard descriptive multivariate methods offers a previously unrecognized tool to uncover unexpected information.
Microarray data can be used for different purposes. In clinical studies, a common aim is to identify a set of genes whose expression signature is sufficiently specific for a given condition and can be used as markers in diagnosis. Biologists interested in understanding the mechanisms involved in a given process may want to find the genes whose expression is characteristic of a given set of experimental cases. There are basically three groups of multivariate approaches that can be followed to achieve this goal: class prediction methods, projection methods, and clustering. The first is defined as a "supervised" method whereas the other two are considered unsupervised because they do not reflect any previous knowledge or classification scheme (Slonim, 2002
Class prediction methods such as support vector machines, decision tree algorithms, and discriminant analysis are specifically designed to classify objects into known groups and can be used to identify genes that characterize different conditions (Golub et al., 1999
Projection methods such as principal component analysis (PCA) or correspondence analysis (CA) can also be used to link cases to genes. These dimension reduction techniques allow the visualization of the structure of large microarray data sets in a few dimensions that retain a large amount of the original variation (Fellenberg et al., 2001
Clustering methods such as hierarchical clustering and k-means (Kaufman and Rousseeuw, 1990
When the number of experimental conditions is large, the wealth of information obtained from the samples hybridized to microarrays is useful not only to learn about the function of genes but also to learn about the samples themselves. The application of CA to a series of samples from synchronized cells of Brewer's yeast (Saccharomyces cerevisiae) at different phases of the cell cycle suggested that some samples had been improperly classified because their position in the biplot was distant from that corresponding to other samples originally thought to correspond to the same phase (Fellenberg et al., 2001
In the aforementioned example with Brewer's yeast, the useful information about the samples could derive from the fact that 800 cell-cycle associated genes were used as starting point. Alternatively, CA could be intrinsically useful to uncover unknown relationships (visualized as convergence on the biplot) among experimental conditions, particularly when they result from the combination of two or more factors. To investigate this possibility, we applied CA to three sets of data. First, our own transcriptome data from seedlings of Arabidopsis involving the factorial combination of three light/dark conditions and four genotypes, the wild type (WT), the double mutant lacking the red and far-red light photoreceptors phytochrome A and B (phyA phyB), the double mutant lacking the blue-light photoreceptors cryptochrome 1 and 2 (cry1 cry2), and the quadruple mutant lacking phytochromes A and B and cryptochromes 1 and 2 (phyA phyB cry1 cry2). Second, previously published data from Arabidopsis plants exposed to different types of biotic (bacterial, fungal, and viral) or abiotic stress each one at different times after application (Chen et al., 2002
Supervised and unsupervised methods are normally used as alternatives, depending on the aims of the study. A second contribution of the work presented here is to demonstrate that these apparently contrasting approaches can be complementary. We describe a protocol combining a first unsupervised visualization of the whole data set through CA with the supervised exploration of the particular structures that emerged in this first step. This exploration is based on the discriminant loadings of the genes, derived from canonical discriminant analysis (CDA; Seber, 1984
Light Has Large Effects on Plant Transcriptome Even in the Absence of Four Major Photoreceptors
Light perceived by phytochromes A and B and cryptochromes 1 and 2 has profound effects on plant growth and development (Quail et al., 1995
We used CDA to identify the genes that best discriminate among the three light/dark conditions. These genes were arranged in groups following common trends in their pattern of expression by using a hierarchical farthest-neighbor clustering algorithm, based on correlation coefficient as similarity measure. Instead of using the representation generated by CDA itself, the genes and the cases were displayed on the biplot generated by a second round of CA (Fig. 1C). This grouping was highly significant (P < 0.0001) as indicated by a multivariate non-parametric test, multiresponse permutation procedure (MRPP; Mielke, 1984
We carried out CA with the complete matrix of data published by Chen et al. (2002
To gain precision in the search for responses in expression caused by the different types of stress, we left aside the cases involving healthy plants at different stages of development and calculated the deviations of each sample respect to its control profile. We then applied CDA to identify the TF genes that better discriminate between groups of stressors: abiotic, fungal, viral, and bacterial agents. The 146 TF genes selected by CDA were then clustered in five groups by hierarchical clustering (Supplementary Table II). CA was now applied to the resultant reduced matrix (146 genes x 33 treatments, rank-transformed data). Three homogeneous groups containing samples only from bacterial, viral, or abiotic stresses were obtained, but an additional group of cases still included fungal, viral, and bacterial attack (Fig. 3A). A closer inspection revealed that the samples of this apparently heterogeneous group of treatments had something in common as they corresponded to the latest time points used to characterize bacterial attack (30 h) or viral attack (5 and 14 d) in conjunction with the majority of fungal treatments. Thus, the ordination reveals that the TF expression profiles converge at late stages of attack by virus, bacteria, and fungi. If instead of using CA to visualize the cases, these are represented on the projection space generated by CDA itself, the convergence is no longer obvious (Fig. 3B). This reflects the fact that the CDA display maximizes discrimination whereas the cases are more freely ordered on the CA plot according to their extent of similarity.
To search for the TF genes that could express the apparent convergence in expression profiles associated with late biotic stress, we selected specific treatments narrowing the boundaries of the encompassed variability. The treatments included in this step of the analysis were: early virus attack (1 d, 5 cases), late virus attack (5 and 14 d, 6 cases), early bacterial attack (6 h, 9 cases), late bacterial attack (30 h, 2 cases), early fungal attack (12 h, 1 case), and late fungal attack (24-84 h, 5 cases); i.e. a total of 15 early stress cases and 13 late stress cases. We used CDA to identify the TF genes that best discriminate between early and late stress responses, i.e. only two groups (the analysis continued with the deviations from the respective controls). This procedure pointed to 30 TF genes that were arranged in two groups by hierarchical clustering. A new round of CA was conducted with the reduced matrix (30 genes x 28 treatments), and the biplot is shown in Figure 3C. Although we selected the genes that best discriminate between early and late stress responses making no difference between stress type, bacterial and virus early responses still show clear differences in expression profiles. This grouping was highly significant (P < 0.0001) as indicated by MRPP. The first and third groups, dominated by AP2 domain and MYB-related TF, included TF that were induced respectively early or late in response to attack by bacteria, virus, or fungi. The second group contained TF that were induced early only by viral attack, including two auxin-induced TF (IAA8 and IAA17/AXR3-1). The ordered table shows the correspondence between treatments and of TF genes (Fig. 3D, left) or between the average for each treatments group (e.g. all the samples corresponding to early bacterial attack regardless of the bacterial agent) and genes (Fig. 3D, right). Although the genes identified by CDA yield three clearly distinct groups of cases in the CA biplot (Fig. 3C), it is very difficult to find in the table individual genes with homogeneous response accounting for the differences among these groups of treatments (Fig. 3D, left), even if gene data are averaged for all bacteria, virus, or fungus attack samples (Fig. 3D, right). This is a strong evidence of the inherent multivariable nature of the gene expression responses, which determines the intrinsic caveat of univariate (i.e. gene per gene) approaches to detect some underlying patterns of potential interest.
An additional feature of the combined unsupervised/supervised protocol proposed here is that genes identified for their ability to discriminate among certain conditions can be used to visualize a different set of data in the search for unknown biologically meaningful relationships. For instance, we applied CA to the samples involving different developmental conditions by using the same 30 TF genes that best discriminate between early and late biotic stress (note that developmental samples were already out of the analysis at the point when these 30 genes were identified). The ordination of treatments on the plane generated by the first two axes of CA revealed three groups: one containing all root samples, another containing most leaf samples, and a third, more heterogeneous group containing seedling and reproductive tissue samples (Fig. 4). Most of the TF genes showing high expression in early bacterial stress are also highly expressed in leaves, whereas TF genes with high expression in late biotic stress showed relatively high expression in roots.
To investigate the applicability of the protocol presented here to a well-known data set, we used CA to characterize the patterns that emerges from 72 cases of patients with leukemia described by Golub et al. (1999
We describe a powerful protocol combining unsupervised and supervised methods to extract biologically meaningful information from microarray data (Fig. 6): CA is used to investigate the major underlying patterns of the cases from which the samples are extracted. CDA is then applied to identify the genes that cause the pattern of interest. Finally, a second round of CA based on these genes (grouped by using hierarchical clustering) is used to visualize both cases and genes (biplot). The strength and novelty of the approach is given by the combination of methods previously used only in isolation.
CA allows the representation of cases and genes on the same plane (biplot) and can therefore be used to find the genes whose expression correlates with a given set of samples (Fellenberg et al., 2001
The comparative analysis revealed the better performance of CA over PCA for dimension reduction of microarray profiles. The first principal components of PCA are strongly affected by size (in this case the absolute values of expression of a microarray sample) because the method is based on Euclidean distances (Seber, 1984
We confirm the usefulness of discriminant loadings derived from CDA to find the genes that best discriminate among classes. However, according to the protocol proposed here, these groups do not necessarily reflect previous knowledge on the subject because they can result from the unsupervised aggregation of experimental conditions resulting from CA. The combination between unsupervised and supervised methods is extended to the use of hierarchical clustering applied not to the original set of data but to the reduced set of genes emerging from CDA, thus eliminating the drawback of using hierarchical clustering with large bodies of data (Slonim, 2002 Another unconventional feature of the protocol presented here is the use of a second round of CA based on the genes emerging from CDA to visualize both genes and cases on the biplot, rather than using the CDA projection itself. The CDA display maximizes the separation among pre-ordained sample classes. The display of CA is not subjected to this restriction; the samples are freely ordered according to their similarity and can reveal unexpected patterns. For instance, we used CDA to find the TF genes that better discriminate among different types of biotic and abiotic stress. The CDA display confirms that the genes are able to discriminate among cases (Fig. 3B). The CA plot reveals that the genes that discriminate among types of biotic stress show convergence of their patterns at late stages of attack, a previously unsuspected pattern. The enhanced visualization of the data also makes evident the multivariate nature of microarray data. Although the groups of samples are clearly distinguishable, it is difficult to find a single gene able by itself to discriminate among the different treatments. Rather it is the behavior of the groups of genes as a whole that provides discrimination among treatments. This underscores the need to use multivariate methods for the analysis of microarray data as significant patterns may be ignored in a gene-by-gene analysis. Another application of the combination of supervised and unsupervised methods is the ability to use a set of genes found by CDA to discriminate among a given set of treatments to classify an unrelated set of cases. These cases are represented on a plane generated by CA based on the previously identified discriminating genes. Following this procedure, we have observed that genes that discriminate between early and late biotic attack are differentially expressed in different plant tissues. Microarray data present two different, unrelated challenges. One is the efficient use of the potential of the technique because important biological information may remain buried in the large amount of data. The other is the statistical treatment of the results because, given the large number of genes and the generally small number of replicates, false positive genes are not unlikely. The protocol proposed here addresses only the first of these two issues, and the probabilities provided in our analysis indicate that the group patterns are unlikely to be found by mere chance. This does not imply that the pattern of expression of each of the genes of a given group is statistically significant. Testing the latter requires the application of a separate set of techniques and very often, additional experimental follow-up.
Sample Processing and Hybridizations
The WT of Arabidopsis and all photoreceptor mutants were in the Landsberg erecta background. The production of multiple mutants has been described previously (Yanovsky et al., 2000
The proposed protocol consists of a first round of CA (PROC CORRESPOND, SAS/STAT V8.02; for details of the technique, see Fellenberg et al., 2001
The hypothesis of no difference between groups of entities (either samples or genes) was tested by using a multivariate non-parametric procedure (MRPP, PC-ORD V4; Mielke, 1984
We thank Dr. Marcelo Yanovsky (IFEVA) for his valuable help with sample collection. Received June 17, 2003; returned for revision July 23, 2003; accepted August 31, 2003.
www.plantphysiol.org/cgi/doi/10.1104/pp.103.028753.
1 This work was supported by Fondo Nacional de Ciencia y Técnica (grant no. PICT 06739 to J.J.C.), by the University of Buenos Aires (grant no. G 067 to J.J.C.), by the National Research Council of Argentina (Consejo Nacional de Investigaciones Científicas y Técnicas grant no. PID 888 to J.J.C.), and by Fundación Antorchas (grant no. 14116-16 to J.J.C.). * Corresponding author; e-mail casal{at}ifeva.edu.ar; fax 5411-4514-8730.
Berling B, Kolbinger F, Grunert F, Thompson JA, Brombacher F, Buchegger F, von Kleist S, Zimmermann W (1999) Cloning of a carcinoembryonic antigen gene family member expressed in leukocytes of chronic myeloid leukemia patients and bone marrow. Cancer Res 50: 6534-6539
Cashmore AR, Jarillo JA, Wu Y-J, Liu D (1999) Cryptochromes: blue light receptors for plants and animals. Science 284: 760-765
Chen WNJ, Provart J, Glazebrook F, Katagiri H, Chang T, Eulgem F, Mauch S, Luan G, Zou SA, Whitham PR et al. (2002) Expression profile matrix of Arabidopsis transcription factor genes suggests their putative functions in response to environmental stresses. Plant Cell 14: 559-574 Cosgrove DJ (1999) Enzymes and other agents that enhance cell wall extensibility. Annu Rev Plant Physiol Plant Mol Biol 50: 391-417[CrossRef][Web of Science][Medline] Crisan D, David D, DiCarlo R (1996) Use of myeloperoxidase mRNA as a marker for myeloid lineage in acute leukemias. Arch Pathol Lab Med 120: 828-834[Medline]
Fellenberg KNC, Hauser B, Brors A, Neutzner JD, Hoheisel, Vingron M (2001) Correspondence analysis applied to microarray data. Proc Natl Acad Sci USA 98: 10781-10786 Folta KM, Spalding EP (2001) Unexpected roles for cryptochrome 2 and phototropin revealed by high-resolution analysis of blue-light mediated hypocotyl inhibition. Plant J 26: 471-478[CrossRef][Web of Science][Medline]
Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov J, Coller H, Loh ML, Downing JR, Caligiuri et al. (1999) Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science 286: 531-537
Gower JC (1966) Some distance properties of latent root and vector methods used in multivariate analysis. Biometrika 53: 325-338 Gower JC (1982) Euclidean distance geometry. Math Scientist 7: 1-14 Greenacre MJ (1984) Theory and Applications of Correspondence Analysis. Academic Press, London
Hadfield KA, Bennet AB (1998) Polygalacturonases: many genes in search of a function. Plant Physiol 117: 337-343 Horwitz M, Benson KF, Person RE, Aprikyan AG, Dale DC (1999) Mutations in ELA2, encoding neutrophil elastase, define a 21-day biological clock in cyclic haematopoiesis. Nat Genet 23: 433-436[CrossRef][Web of Science][Medline] Kaufman L, Rousseeuw PJ (1990) Finding Groups in Data. An Introduction to Cluster Analysis. John Wiley & Sons, New York Mazzella MA, Cerdán PD, Staneloni RJ, Casal JJ (2001) Hierarchical coupling of phytochromes and cryptochromes reconciles stability and light modulation of Arabidopsis development. Development 128: 2291-2299[Web of Science][Medline] McCune B, Mefford MJ (1999) PC-ORD. Multivariate Analysis, Version 4. MjM Software Design, Gleneden Beach, OR Mielke PWJ (1984) Meteorological applications of permutation techniques based on distance functions. In PR Krishnaiah, PK Sen, eds, Handbook of Statistics. Elsevier Science Publishers, New York, pp 813-830
Misra JW, Schmitt D, Hwang L, Hsiao S, Gullans G, Stephanopoulos GD, Stephanopoulos G (2002) Interactive exploration of microarray gene expression patterns in a reduced dimensional space. Genome Res 12: 1112-1120
Neznanov N, Man AK, Yamamoto H, Hauser CA, Cardiff RD, Oshima RG (1999) A single targeted Ets2 allele restricts development of mammary tumors in transgenic mice. Cancer Res 59: 4242-4246 Paul CC, Ackerman SJ, Mahrer S, Tolbert M, Dvorak AM, Baumann MA (1994) Cytokine induction of granule protein synthesis in an eosinophilinducible human myeloid cell line, AML14. J Leukoc Biol 56: 74-79[Abstract]
Quail PH, Boylan MT, Parks BM, Short TW, Xu Y, Wagner D (1995) Phytochromes: photosensory perception and signal transduction. Science 268: 675-680 Ramaswamy SP, Tamayo R, Rifkin S, Mukherjee C, Yeang M, Angelo C, Ladd M, Reich E, Latulippe JP, Mesirov T et al. (2001) Multiclass cancer diagnosis using tumor gene expression signatures. Proc Natl Acad Sci USA 98: 15148-15154 Rousselet MC, Laniece A, Gardais J, Dautel M, Gardembas-Pain M, Pellier I, Ifrah N, Saint-Andre JP (1995) Immunohistochemical characterization of acute leukemia: study of 31 bone marrow biopsies. Ann Pathol 15: 119-126[Medline] Seber GAF (1984). Multivariate Observations. John Wiley & Sons, New York
Skold S, Rosberg B, Gullberg U, Olofsson T (1999) A secreted proform of neutrophil proteinase 3 regulates the proliferation of granulopoietic progenitor cells. Blood 93: 849-856 Slonim DK (2002) From patterns to pathways: gene expression data analysis comes of age. Nat Genet Suppl 32: 502-508[CrossRef]
Stephanopoulos GD, Hwang WA, Schmitt J, Misra JW, Stephanopoulos G (2002) Mapping physiological states from microarray expression measurements. Bioinformatics 18: 1054-1063 ter Braak CJF (1985) Correspondence analysis of incidence and abundance data: properties in terms of a unimodal response model. Biometrics 41: 859-873[CrossRef] Tzachanis D, Freeman GJ, Hirano N, van Puijenbroek AA, Delfs MW, Berezovskaya A, Nadler LM, Boussiotis VA (2001) Tob is a negative regulator of activation that is expressed in anergic and quiescent T cells. Nat Immunol 2: 1174-1182[CrossRef][Web of Science][Medline] Yanovsky MJ, Mazzella MA, Casal JJ (2000) A quadruple photoreceptor mutant still keeps track of time. Curr Biol 10: 1013-1015[CrossRef][Web of Science][Medline]
Winkel-Shirley B (2001) Flavonoid biosynthesis: a colorful model for genetics, biochemistry, cell biology, and biotechnology. Plant Physiol 126: 485-493 Zhu T, Budworth P, Han B, Brown D, Chang HS, Zou G, Wang X (2001) Toward elucidating the global gene expression patterns of developing Arabidopsis: parallel analysis of 8300 genes by high-density oligonucleotide probe array. Plant Physiol Biochem 39: 221-242[CrossRef][Web of Science] This article has been cited by other articles:
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY® | THE PLANT CELL | |
|---|---|---|---|