Plant Physiol. Drug Metab Dispos
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


First published online June 1, 2004; 10.1104/pp.104.040840

Plant Physiology 135:637-652 (2004)
© 2004 American Society of Plant Biologists

This Article
Right arrow Full Text (PDF)
Right arrow All Versions of this Article:
135/2/637    most recent
pp.104.040840v1
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (23)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Meyers, B. C.
Right arrow Articles by Agrawal, V.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Meyers, B. C.
Right arrow Articles by Agrawal, V.
Agricola
Right arrow Articles by Meyers, B. C.
Right arrow Articles by Agrawal, V.
PERSPECTIVES ON TRANSLATIONAL BIOLOGY

Methods for Transcriptional Profiling in Plants. Be Fruitful and Replicate

Blake C. Meyers*, David W. Galbraith, Timothy Nelson and Vikas Agrawal

Department of Plant and Soil Sciences and Delaware Biotechnology Institute, University of Delaware, Newark, Delaware 19711 (B.C.M., V.A.); Department of Molecular, Cellular, and Developmental Biology, Yale University, New Haven, Connecticut 06511 (T.N.); and Department of Plant Sciences, University of Arizona, Tucson, Arizona 85721 (D.W.G.)

Because of the tractability of large-scale RNA measurements compared with protein studies, the first application of genomics in many organisms is to catalog and then measure transcriptional activity. Substantial investment in the US and abroad has led to dramatic growth in the availability of gene sequences for many plant species. With these sequences in hand, many molecular biologists are building the resources and technologies to enable large-scale transcriptional analyses for different plant species. The availability of the complete genome sequence of Arabidopsis made this the first plant for which transcriptional profiling platforms were developed. The experience gained from the applications of these technologies in Arabidopsis will shape the direction of similar experiments performed in other plant species.

The ability to simultaneously measure the expression of thousands of genes is a powerful analytical system, and the availability of technologies for this has presented scientists with many new opportunities. In most plant species, these experiments are being conducted largely with microarrays, although there are a growing number of alternative technologies. Some of these alternative technologies generate data that are distinct from and complementary to microarray data. The massive datasets generated by gene expression technologies present novel statistical and analytical problems, resulting in a convergence of biology, mathematics, and computer science. Users have developed a broad range of applications for the platforms, so that the use of microarrays has gone beyond simple measurements of relative transcript abundance to include genotyping, tissue classification, and pathway studies. Competition is intense among commercial microarray vendors vying in the plant market, and new companies join the fray on a regular basis. For laboratories working in plant species other than Arabidopsis, or for students and teachers of plant molecular biology, the question arises of what lessons to take away from the experience of this model plant, and how to best apply these technologies and approaches without squandering limited resources.


    TECHNOLOGIES FOR MEASURING GENE EXPRESSION
 TOP
 TECHNOLOGIES FOR MEASURING GENE...
 THE DANGERS OF PROLIFERATING...
 OPEN VERSUS CLOSED TECHNOLOGIES...
 TISSUE ISSUES: MEASUREMENTS OF...
 APPLICATIONS OF TRANSCRIPTIONAL...
 FUTURE DIRECTIONS
 LITERATURE CITED
 
The last decade has seen major advances in technologies for measuring gene expression. However, no method is without serious limitations, so many more advances will be required before we have achieved the necessary sensitivity and scope. The forerunner of many of the current methods is the RNA gel blot (northern), in which a labeled probe is hybridized to an RNA target, and the resulting band size and signal intensity is used to confirm and quantify expression. Advances in genomic technologies now permit the simultaneous analysis of thousands of genes, although many are based on the same concept of specific probe-target hybridization. These methods, described in more detail in this section, most prominently include DNA microarrays. However, sequencing-based methods are an alternative; these methods started with the use of expressed sequence tags (ESTs), and now include methods based on short tags, such as serial analysis of gene expression (SAGE) and massively parallel signature sequencing (MPSS). Differential display techniques provide yet another means of analyzing gene expression; this family of techniques is based on random amplification of cDNA fragments generated by restriction digestion, and bands that differ between two tissues identify cDNAs of interest. With a well-characterized genome, it is possible to match fragments to specific genes (Shimkets et al., 1999Go). Most differential display techniques require a large number of reactions to achieve maximal coverage of all active transcripts, and it is difficult to sample every transcript. Differential display-like approaches have been reviewed elsewhere (Green et al., 2001Go) and will not be discussed in detail in this review. All of these transcriptional profiling technologies permit the analysis of complex mRNA populations from selected cells or tissues, producing large-scale measurements of gene expression, but different technologies provide data with different uses. In fact, none of the existing technologies address all experimental needs, and there are advantages and disadvantages to each. These differences make the technologies complementary; in addition to good experimental design and analysis, the validation of apparent quantitative differences in mRNA levels by using several of these complementary approaches is critically important.


Single Gene Measurements

Although measurements of single genes have advanced well beyond northern blots, northern blot data are still considered to be the gold standard. The basis for this confidence may be based more on historical reasons than on any data that indicate northerns are more reliable than other methods. In situ hybridizations can provide both a qualitative and quantitative assessment of gene expression in specific tissues. In recent years, quantitative real-time PCR (QRT-PCR) has been demonstrated to generate robust, quantitative expression data for a single gene; this method also offers rapid and reproducible results and a large dynamic range (Hayward-Lester et al., 1995Go; Bustin, 2002Go; Ginzinger, 2002Go; Klein, 2002Go). Fluorescence signals are generated by dyes that are specific to double-stranded DNA (dsDNA) or by sequence-specific fluorescently-labeled oligonucleotide primers. The signal is proportional to the amount of PCR product, and special PCR machines are designed to monitor the process of amplification in real time. The amplification curve is used to quantify the initial concentration of a specific transcript in a template mixture. One of the major advantages of QRT-PCR is a broad dynamic range that can precisely quantify transcript concentrations over more than eight orders of magnitude (Heid et al., 1996Go). QRT-PCR can be performed using a dye like SYBR Green and unlabeled primers, with one amplification target per tube and control reactions performed in parallel. Alternatively, a pair of gene-specific primers is synthesized, one of which is fluorescently labeled; several pairs of control primers are added to the sample and each primer pair labeled with a different fluorochrome to allow specific detection. The former method using SYBR Green is less expensive than the latter; in both cases, reactions are replicated and the results are averaged.

One of the more intriguing new methods for the measurement of single genes uses so-called polonies, which stands for polymerase-colonies (Mitra and Church, 1999Go; Mitra et al., 2003Go). While still in its infancy, this intriguing technology is based on the in situ amplification of DNA or cDNA in a thin-layer acrylamide gel on a microscope slide. Because the PCR products are essentially immobilized, the result of the amplification is large numbers of polonies distributed across the slide that are spherical colonies of DNA. Each polony is derived from a single template molecule, and specific genes or transcripts can be detected by hybridizing labeled probes, similar to a classic colony lift blot. By counting the proportion of polonies derived from a specific transcript compared to the total (detected by a nonspecific stain, for example), a quantitative estimate of gene expression can be obtained (Mitra and Church, 1999Go; Mikkilineni et al., 2004Go). Modifications of this technology may go beyond expression analysis to monitor RNA splicing (Zhu et al., 2003Go) or other applications.

The analysis of expression of single genes or small sets of genes will further advance with the increased availability of well-curated expression data in public repositories. Using these preexisting data sets, it may be possible to measure gene expression using only a computer and internet access. Such analyses constitute electronic or virtual northern blots. Several groups, including our own, have made plant gene expression data accessible from easy-to-use Web interfaces (see http://mpss.udel.edu or the gene expression section of http://www.arabidopsis.org). A more limited set of plant data are available as part of the Gene Expression Omnibus section of GenBank (http://www.ncbi.nlm.nih.gov/geo/); their SAGEmap Web page performs differential expression analyses and provides a limited ability to measure single genes (Lash et al., 2000Go). However, this site is primarily a repository for published SAGE data (described below), and by design it is not optimized for any particular organism. These resources provide starting points for researchers interested in specific genes or gene families.


DNA Microarrays

The DNA microarray has produced a revolution in expression analysis. These chips simultaneously determine expression levels for thousands of genes. Data are then analyzed for patterns of expression that change over various treatments or time points. Microarrays may be comprised of short oligonucleotides or complete cDNA clones and provide a rapid and relatively inexpensive way to monitor in parallel the expression of thousands of transcripts. Because microarrays have now been used in hundreds of publications and the technology has been discussed in scores of review articles, the reader is directed elsewhere for in-depth discussions and technical details.

Early microarrays were built of cDNA fragments robotically gridded and immobilized on microscope slides (Schena et al., 1995Go), much as if the probes for a northern blot were laid down in a dense pattern. This approach, though still widely used, requires the maintenance and handling of microtiter dishes, validation of clones, and large scale PCR reactions. A competing approach that has become the dominant system is based on short DNA oligonucleotides that serve as probes. There are several reasons for the dominance of these oligo arrays; one reason is that oligos can be synthesized either in plates or directly on solid surfaces (in situ synthesis), making it easier to obtain reliable amounts of material than for cDNA clones. In addition, even for a well-characterized plant like Arabidopsis, cDNA clones may represent less than 60% of the predicted genes (Wortman et al., 2003Go). Oligo-based approaches can effectively target selected regions starting from only the DNA sequence, such as anonymous open reading frames found in genomic sequence. With any of these microarray technologies, one of the most serious problems is ensuring that cDNA or oligonucleotide sequences are correctly assigned to their source. This is a particular problem if any sort of spotting or gridding is used to build the microarray, because a small proportion of microtiter dishes and tubes inevitably are mishandled. A different concern for commercially manufactured arrays can be validating the identity or genomic location of a specific probe, as these probe sequences are often not available. The assumption that microarrays are manufactured without errors can lead to misinterpretations or delays in understanding data that result from poor sample tracking, informatics errors, or contamination.

For plant research, the tractability and genomic resources of Arabidopsis have made it an attractive system in which to develop or commercialize microarrays. Because development costs were high in the early days of microarrays, and because resources for plant research are limited, several academic groups formed a consortium (the Arabidopsis Functional Genomics Consortium, or AFGC) to produce and make publicly available the first Arabidopsis arrays (Wisman and Ohlrogge, 2000Go). While these arrays were used by many academic laboratories, commercial arrays such as those produced by Affymetrix (Santa Clara, CA) were quickly adopted as well. The AFGC ended on December 31, 2002 and its public microarray project was discontinued; some public groups still produce Arabidopsis arrays, representing the model that was an impetus for the development of core microarray facilities at many institutions. However, some of these core facilities are now gathering dust due to the centralization of microarray production and competition from commercial operations. In general, this has proven to be a positive step because it relieves research scientists of relatively mundane manufacturing responsibilities. For example, one of the most critical steps in array construction is quality control to ensure minimal variation from array to array. Companies or public groups focused solely on array production can afford to spend considerable effort to ensure quality control, and a competitive pressure for quality works to the benefit of the researcher. Companies were quick to recognize the commercial potential for Arabidopsis arrays and have aggressively pursued the production of Arabidopsis microarrays. The drawback of commercial production is that the high costs of overhead, labor, and development are included in the arrays, whereas these costs are often absorbed in academically-produced arrays. Another drawback to removing microarray production from the hands of researchers can be the loss of control over the content and format.

Competition is heating up among companies that can or do produce Arabidopsis microarrays. The popular Affymetrix GeneChip arrays are comprised of sets of 25-base oligonucleotides synthesized in situ via a photolithographic process (Lockhart et al., 1996Go); the original array design that included more than 8,000 genes was the first commercial Arabidopsis array on the market (Zhu and Wang, 2000Go). The most recent design that is often called the whole genome array (WGA) includes more than 24,000 genes (http://www.affymetrix.com). Rosetta Inpharmatics (Kirkland, WA) developed the process of ink-jet "printing" of 60-base probes (Hughes et al., 2001Go). The Arabidopsis array based on this technology is produced by Agilent Technologies (Palo Alto, CA) and includes 21,500 genes; later in 2004, this array will contain approximately 44,000 features. In addition to arrays produced by Agilent, other companies are now marketing so-called long oligo microarrays. These arrays typically are comprised of a single oligonucleotide primer of 50 to 70 nucleotides for each gene, and the oligos are synthesized in situ or synthesized using conventional methods and then spotted on the arrays (Barczak et al., 2003Go). Spotted oligo arrays offer several advantages, such as a low manufacturing cost and flexibility, but usually require a substantial commitment by a company to presynthesize the 20,000+ long oligos that are spotted on these arrays. However, once the oligos have been synthesized, the materials can be distributed to individual labs for use with conventional gridding robots. For example, Operon (a subsidiary of Qiagen) produces oligo sets for three plant species (http://oligos.qiagen.com/), and at least one academic group grids and distributes arrays based on these oligos (http://www.ag.arizona.edu/microarray/). Customized or whole-genome Arabidopsis arrays may potentially be made using any of the platforms based on rapid and flexible in situ synthesis. This includes platforms developed by NimbleGen Systems (Madison, WI; Nuwaysir et al., 2002Go) and febit ag (Mannheim, Germany; Baum et al., 2003Go). Nimblegen uses a flexible photolithographic process capable of synthesizing high-density arrays with oligos of 24 to 90 bases; febit produces a benchtop machine capable of producing arrays of up to 48,000 features per slide with an oligo length of approximately 30 nucleotides. Because of ongoing changes in the technologies and commercial competitors, it is impossible to provide a comprehensive list of microarray platforms. However, there are now many commercial microarray options now available to Arabidopsis researchers.

Microarrays are now becoming available for additional plant species. Rice (Oryza sativa) is a widely-studied organism for which the complete genome sequence is anticipated by end of 2004. As with Arabidopsis, early rice microarray experiments were based on limited sets of ESTs (Kawasaki et al., 2001Go). With more sequence data now available, Agilent has announced the release of a rice long-oligo microarray that includes approximately 60% of the estimated 50,000 rice transcripts (http://www.chem.agilent.com/). As with Arabidopsis, other companies are entering the business (for example, GreenGene Biotech; http://www.ggbio.com), heating up competition with a recently funded public rice array project (http://www.ricearray.org/). Despite a lack of genomic sequence data, other plant species have not been left without microarray resources. Academic collaborations have led to the development of microarrays for barley (Hordeum vulgare), cotton (Gossypium hirsutum), cabbage (Brassica capitata), maize (Zea mays), potato (Solanum tuberosum), tomato (Lycopersicon esculentum), and wheat (Triticum aestivum); commercial interest in developing arrays for these and other plant species is growing. As in the case of Arabidopsis, the release of commercial microarray products can drive some academics out of the array manufacturing business. However, because the primary motivation for some academic labs to fabricate microarrays is to generate the resources they need for experimentation, the entrance of a commercial competitor may be welcomed.

Despite the broad adoption of microarrays as a research tool, there are several technical issues with the technology, some of which are better understood than others. Most of these limitations result from the principle of hybridization that is at the core of the technology. For example, cross-hybridization, the hybridization of multiple targets to single probes, remains poorly characterized. Genome duplications impede the design of oligos that distinguish between closely related sequences (Ishii et al., 2000Go). In many plant species, genome duplications resulting in cross-hybridization may be a limitation for determining the expression of any single gene; in Arabidopsis, one of the most simple genomes, approximately 60% of the genome is duplicated and 17% of the genes are present in tandem arrays (Blanc et al., 2000Go; Grant et al., 2000Go; Vision et al., 2000Go; Simillion et al., 2002Go). The general migration from cDNA to oligo arrays means that probes can be selected based on regions of dissimilarity among generally similar genes, improving specificity (Talla et al., 2003Go). Hybridization and washing conditions are a critical issue for any array platform; these conditions are influenced by variations in temperature, ionic strength, or pH. The limit of detection for Affymetrix chips is approximately 1/100,000 transcripts (http://www.affymetrix.com); changes in genes expressed near this level are difficult to detect with statistical significance (Ishii et al., 2000Go). Background signal intensities at this level are similar to signals of many weakly expressed transcripts (Duggan et al., 1999Go). Spotted microarrays built from presynthesized components have several potential sources of variation that differ from those of arrays manufactured by in situ synthesis. Spotted microarrays are subject to variation in the pin geometry, variations in spot geometry, and differences in the amount of material deposited onto and subsequently bound to the slide surface. The method of preparation of the RNA and labeled cDNA targets used in any microarray experiment can also introduce variation, as there are many methods for the processing, isolation, and labeling of RNA samples, and factors such as the degradation rate of transcripts may also affect the final data (Auer et al., 2003Go). Sequence-specific differences in the efficiency of dye incorporation may also produce variation for biologically-irrelevant reasons. In the use of microarrays, the source of variation, whether technical or biological, should be identified and quantitatively estimated by replicating experiments at two levels—technical replications that are separate preparations and arrays run for the same RNA sample, and biological replications that are RNA samples extracted from separate but identically treated biological materials (Lee et al., 2000Go). It is important to note that technical variation appears much lower for in situ synthesized and spotted oligo arrays than for those produced from PCR amplicons, and this consistency decreases the relative importance of technical replicates to the point at which these may be eliminated while retaining biological replications (Zhu and Wang, 2000Go).

An involvement of statistics is inevitable given the large numbers of simultaneous measurements that can be made using microarrays, and these large numbers raise problems that are not normally encountered in molecular biology. For example, an alpha value of 0.05 would be viewed as highly satisfactory for most biological measurements, where the {alpha} value is the accepted probability of detecting a false positive for a single event (a Type I error). However, when making independent measurements of 26,000 genes (events) on a typical Arabidopsis whole-genome microarray, this {alpha} value allows 1,300 false positives for the experiment. Since downstream procedures, which are more labor intensive, less high throughput, and more expensive per unit of information, cannot reasonably accommodate this proportion of false leads, the importance of achieving more restrictive {alpha} values is readily apparent. This is possible through replication of the microarray experiments and requires greater numbers of microarrays as well as an appropriate statistical design. A particularly accessible review of this area has been provided by Draghici (2002)Go. Among statistical treatments, the application of mixed model ANOVA methods to microarray data has considerable promise for both spotted and in situ synthesized microarrays (Kerr et al., 2000Go; Wolfinger et al., 2001Go). General agreement has not yet been reached on the optimal statistical treatment for the sets of 10 or more probes designed for each gene represented on the Affymetrix microarrays (probe level expression data). There are advantages to using existing statistical methodologies instead of the standard Affymetrix software; better accuracy and sensitivity are provided by the use of various types of models or probe level data (Li and Wong, 2001aGo, 2001bGo), ANOVA analyses (Chu et al., 2002Go), or analyses of inherent noise (Naef et al., 2002Go; Draghici et al., 2003Go). Identification of the sources of variance in expression data is essential to enable the detection of small but biologically relevant differences in transcriptional profiles (Jin et al., 2001Go). It has been clearly demonstrated that the failure to apply appropriate statistical analyses to microarray data can result in misleading conclusions (Hsieh et al., 2003Go).


Tag-Based Methods

Exhaustive sequencing of ESTs is a common method for gene expression profiling, although the primary purpose of EST sequencing is usually to generate genic sequence data. EST data are generated by large-scale, single-pass, partial sequencing of cDNA clones (approximately 500 bp), usually from a large number of libraries representing diverse tissues (Adams et al., 1995Go). Comparisons of EST frequencies in different libraries can expose differential gene expression on a broad basis (Okubo et al., 1992Go, 1995Go; Matsubara and Okubo, 1993Go; Ewing et al., 1999Go). In theory, the abundance of an EST is an exact digital representation of the number of copies of a transcript in the tissue. Large numbers of ESTs derived from diverse tissues produce quantitative estimates of gene expression, but ESTs are relatively slow and costly to generate, making it difficult to achieve saturation of a library. Theoretically, expression profiles could be derived for very weakly expressed genes if ESTs were sequenced in sufficient number. This has been performed with human EST libraries that contain tens of thousands of sequences (Okubo et al., 1992Go, 1995Go; Matsubara and Okubo, 1993Go; Adams et al., 1995Go; Okubo et al., 1995Go; Kawamoto et al., 2000Go). In plants, Ewing et al. (1999)Go compared and analyzed 10 rice libraries containing between 1,000 and 5,000 ESTs and were able to identify statistically significant patterns of gene expression among several rice tissues. However, public plant EST libraries are in general too small or from too many sources for accurate quantitative expression analyses, although private companies have amassed databases of more than a million plant ESTs (Mazur et al., 1999Go). For Arabidopsis, there are currently 196,988 ESTs or cDNAs in GenBank (as of January, 2004; http://www.ncbi.nlm.nih.gov/dbEST), but because most of these were generated either from a single library of mixed tissues or were selected from normalized libraries (Newman et al., 1994Go; Delseny et al., 1997Go), the Arabidopsis EST abundance does not accurately reflect expression levels. In general, the low total number of EST sequences for a given organism confounds accurate estimates of gene expression levels.

SAGE, like EST sequencing, is a quantitative or digital method of gene expression analysis. Unlike EST sequencing, SAGE extracts only a 10- to 14-base tag from a unique position within each species of mRNA (Velculescu et al., 1995Go; Zhang et al., 1997Go). These short SAGE tags are derived from a position directly 3'-adjacent to the 3'-most recognition site for a particular restriction enzyme, such as NlaIII. The tag sequence and position are important for the identification of the gene from which the tag was derived. Whereas ESTs each require a single sequencing read, SAGE tags are released from cDNAs by restriction enzymes, ligated together, amplified by PCR, and sequenced as concatamers. This results in a higher throughput and lower cost for SAGE than ESTs. A number of modifications to the original protocol have been reported. Modifications that increase the length of the tag include the LongSAGE method (Saha et al., 2002Go) that produces 21- or 22-base tags, and the SuperSAGE method that produces 26-base tags (Matsumura et al., 2003Go); a recent report describes modifications that dramatically improve the efficiency of LongSAGE library construction (Gowda et al., 2004Go). The primary limitation of SAGE or its variants is the cost of sequencing reactions; even at $1 per read, SAGE tags cost roughly $0.04 each and a library of 100,000 tags would cost $4,000. Sampling error has also been a source of bias in SAGE (Stollberg et al., 2000Go), although increasing the number of available tags addresses this problem.

A recent advance in tag-based gene expression analysis is MPSS, developed and commercialized by Lynx Therapeutics (Hayward, CA). MPSS is based on methods to clone individual cDNA molecules on microbeads and sequence, in parallel, short tags or signatures from these cDNAs (Brenner et al., 2000aGo, 2000bGo). A complex mix of cDNAs, such as those derived from a particular plant tissue, is cloned onto microbeads, with the representation of molecules on the beads identical to that in the original sample (e.g. one cDNA per bead). Using an unconventional but ingenious method of sequencing, large numbers of beads are sequenced in parallel. A series of digestion, ligation, and hybridization reactions are performed in consecutive steps while the beads are immobilized in a flow-cell underneath a high-power microscope so that the reagents flow over and around the beads, and there are no gels or capillaries (Brenner et al., 2000aGo). The final output of MPSS is a set of abundances for thousands of distinct 17- or 20-base signatures, most of which uniquely identify a particular transcript. The parallel sequencing method produces millions of MPSS signatures in only a few weeks; however, the technology is sufficiently complex that unlike SAGE, it cannot be performed in individual laboratories. On a per-tag basis, MPSS is currently less than half the cost of SAGE.

The sequence-based expression data from ESTs, SAGE, or MPSS experiments have many uses. The availability of complete genome sequences permits the direct comparison of tags to genomic sequence and further extends the utility of the data (Meyers et al., 2004bGo). The identification of transcribed regions is performed by aligning the signatures to genomic sequence. The expression levels of nearly all polyadenylated transcripts can be quantitatively determined, and the abundance of a given tag for a specific library is representative of the expression level of the corresponding gene. The approximate location of the polyadenylation site for each transcript is known because both SAGE and MPSS tags are derived from defined restriction sites in the 3' end of a transcript. Several distinct SAGE or MPSS tags matching different sites within a single gene indicate alternative polyadenylation or 3' splicing. Expressed tags that uniquely match to unannotated regions of the genome provide experimental evidence for novel transcripts (Meyers et al., 2004cGo). Quantitative methods for the analysis of tag frequencies and detection of differences among libraries have been published (Audic and Claverie, 1997Go; Greller and Tobin, 1999Go; Lash et al., 2000Go; Stekel et al., 2000Go).

Genome duplications complicate the unique assignment of short tags to specific genes, particularly when members of a gene family have a high degree of similarity. Issues of genome duplications are likely to be particularly relevant to many plant species that have polyploid origins and show evidence of large-scale segmental duplications. The short length of SAGE tags (usually 14 bases) complicates the assignment of tags to distinct genes in even minimally complex genomes (a tag-to-gene ambiguity; Lash et al., 2000Go; Stollberg et al., 2000Go). Tag-to-gene ambiguities may be avoided by using longer tag sequences, such as 20-base MPSS signatures, 21-base LongSAGE tags (Saha et al., 2002Go), or the 26-base SuperSAGE tags (Matsumura et al., 2003Go). An analysis of potential MPSS signatures in the Arabidopsis genome demonstrates that 18.1% of 17-base tags and 12.5% of 20-base tags are duplicated (Meyers et al., 2004bGo). Analyses using the Arabidopsis genome indicate that there is a diminishing return for tag lengths beyond 20 bases, such that it may be more economical to sacrifice some specificity to obtain a greater number of tags of approximately 20 bases and sort out differential expression among nearly identical gene family members using different techniques (C.D. Haudenschild and B.C. Meyers, unpublished data). A gene may also have more than one unique tag as a result of alternative termination of some transcripts, creating a gene-to-tag ambiguity (Lash et al., 2000Go; Meyers et al., 2004bGo).

Methods like SAGE have not been applied extensively to plant species, but more and more examples can be found in the literature (Matsumura et al., 1999Go; Chakravarthy et al., 2003Go; Jung et al., 2003Go; Lee and Lee, 2003Go; Fizames et al., 2004Go). Early applications in nonplant species used SAGE to characterize transcriptomes (Velculescu et al., 1995Go, 1997Go), to study the differences between them (Zhang et al., 1997Go), to annotate genomic sequences (Saha et al., 2002Go), and for whole-genome studies of transcriptional activity (Caron et al., 2001Go). In our laboratory, we have been using MPSS to analyze gene expression in Arabidopsis, and we have developed a Web site for public access to these data (Meyers et al., 2004aGo, 2004bGo). For reasons that are not entirely clear, MPSS has been more rapidly adopted in the plant community than in animal species, although there are only a few published studies outside of our own laboratory (e.g. Hoth et al., 2002Go, 2003Go; Christensen et al., 2003Go). One limitation for all of the tag-based methods compared to microarrays is that the cost of biological or technical replications is prohibitive, so estimates of variance for the tag-based methods are incomplete or poorly characterized.


    THE DANGERS OF PROLIFERATING TECHNOLOGIES
 TOP
 TECHNOLOGIES FOR MEASURING GENE...
 THE DANGERS OF PROLIFERATING...
 OPEN VERSUS CLOSED TECHNOLOGIES...
 TISSUE ISSUES: MEASUREMENTS OF...
 APPLICATIONS OF TRANSCRIPTIONAL...
 FUTURE DIRECTIONS
 LITERATURE CITED
 
There are both advantages and disadvantages to the growing number of competing technologies and technology platforms for the measurement of gene expression. Some comparisons are not entirely fair; for example, the two broad categories that we describe above, tag-based systems and microarrays, have different and complementary uses (see below), so these are not directly competing technologies. Competition among microarray platforms has led to lower costs, improved quality control, and increased numbers of genes per array, at least in the case of Arabidopsis. The disadvantage of having a proliferation of array platforms is that it can create orphan data. In other words, experiments performed with an older generation or different type of a microarray may be difficult to compare to data derived from the latest microarray format. This may necessitate the repetition of experiments to directly confirm other laboratory's findings.

The prospect of comparing data across experiments raises the question of whether the measurements from gene expression technologies are directly comparable and how good the correlations are. While no definitive answer yet exists, several groups have or currently are addressing this question. In a comparison of SAGE with the Affymetrix oligonucleotide microarrays, the two approaches correlate for genes expressed at high levels, and SAGE is more accurate than for genes expressed at low levels (Ishii et al., 2000Go). We are currently conducting comparisons of MPSS and microarray analyses. Among microarray platforms, several comparisons have been published. Tan et al. (2003)Go compared gene expression measurements generated from identical human RNA samples using the Affymetrix (25-mer), Agilent (60-mer), and Amersham (30-mer; Piscataway, NJ) microarray platforms. A total of five arrays were used for each time point in their analysis, including technical and biological replicates. Their results demonstrated considerable variation for comparisons of significant gene expression changes, and correlations in gene expression levels across the different platforms were modest (Pearson's correlation coefficient average of 0.53, range of 0.48–0.60). In addition, although many of the genes present on each microarray platforms were the same, the differentially expressed genes identified by each technology were not substantially overlapping. Other studies have compared spotted cDNA microarrays with Affymetrix GeneChip arrays and found a poor correlation between these disparate array types (Kuo et al., 2002Go; Yuen et al., 2002Go), although the level of experimental replication in these studies was not clear. Poor statistical designs or a lack of replications could also generate low correlations. In general, published cross-platform analyses suggest that the conclusions derived from a microarray analysis may be largely dependent upon the type of platform used in the experiment. This is not encouraging news, and suggests that a great deal remains to be learned about factors intrinsic to different microarray platforms that can affect the data.

Incongruous data or conclusions from gene expression measurements performed using different technology platforms may result from several sources of variation. A very simple example is that the set of genes represented in the arrays may not be identical; the Agilent, Affymetrix, and Qiagen/Operon probe sets for Arabidopsis microarrays each represent 21,500 to 24,197 genes, but only 17,149 genes are shared among the three platforms. However, there are additional issues in such a comparison, because oligo lengths, positions, and numbers per gene vary among manufacturers. It is possible that some genes are better measured by the probes on different microarray platforms, and no single type of array accurately measures every gene. It may take many years of empirical studies before we achieve optimal designs and understand the impact of the sequence and position of the oligo on the signal strength. The process of correlating design features with expression data would be facilitated if all manufacturers released the sequence of the probes on their arrays. Probe sequences are considered proprietary information by some companies because of a fear that competitors will market arrays based on identical probes or use the information to decipher design algorithms. With some exceptions, complete sets of probe sequences can be hard to obtain except via nondisclosure agreements with manufacturers. In fact, oligo design software is still rapidly developing (e.g. Mei et al., 2003Go; Nielsen et al., 2003Go; Talla et al., 2003Go), so it is highly unlikely that any existing microarrays contain a complete set of optimally-designed probes. It may also be desirable (although perhaps not plausible) for array manufacturers to agree on a set of standard template sequences; if different splice variants or models of a gene are used for probe design, it is possible that probes with the same gene identifier may be measuring different transcripts. Standardization of experimental design and methods would also facilitate comparisons of array data produced by different labs. One of the first steps in this direction was the development of a standard set of technical details that should be reported for every microarray experiment. The minimal information about a microarray experiment (MIAME) protocol requires the reporting of enough details to ensure that the results of a microarray experiment could be interpreted or repeated independently (Brazma et al., 2001Go). These basic data should be sufficient to store the data in public repositories such as GenBank and enable the use of standardized data analysis tools.

In the coming years and as sequence databases are populated with ESTs and genomic data for diverse plant species, the research community working in each of these organisms may face the question of which gene expression platform to choose. This may be an issue if it comes down to a choice among commercial platforms, because several of the major microarray production companies charge significant set-up fees (although for a large-enough market, these fees may be waived and absorbed into the sales of the arrays). The barley GeneChip microarray is an example of an organized and united approach taken by a consortium of plant researchers to build resources for expression profiling in a crop species that had not attracted the interest of commercial microarray manufacturers (Close et al., 2004Go). An international group of laboratories focused and coordinated their efforts to develop a single microarray platform for transcriptional profiling. A public data storage site, BarleyBase (http://barleybase.org/), was constructed as part of this project to integrate expression profiling data from all researchers using the platform. This creates a synergistic effect because all array data generated for barley will be directly and easily comparable. BarleyBase is also incorporating controlled vocabularies to facilitate cross-species comparisons (Close et al., 2004Go). The coordinated development of the barley microarray may represent a paradigm for other plant species in which too many technology platforms could diminish the utility of individual data sets and fragment the research community.


    OPEN VERSUS CLOSED TECHNOLOGIES AND THE IDENTIFICATION OF NOVEL TRANSCRIPTS
 TOP
 TECHNOLOGIES FOR MEASURING GENE...
 THE DANGERS OF PROLIFERATING...
 OPEN VERSUS CLOSED TECHNOLOGIES...
 TISSUE ISSUES: MEASUREMENTS OF...
 APPLICATIONS OF TRANSCRIPTIONAL...
 FUTURE DIRECTIONS
 LITERATURE CITED
 
Technologies such as ESTs, SAGE, or MPSS require no prior knowledge of the sequences of the transcripts and can discover previously unknown transcripts. This feature defines an open architecture for these expression technologies. In contrast, closed architectures, like most microarrays, are based on existing knowledge of genes, with probe sets designed to match known or predicted transcripts. The data derived from the open technologies can be used to annotate genomic sequence, whereas data from closed technologies is often cheaper to obtain and can more easily be used for focused experiments. However, one of the more interesting applications of the microarray is the development of a hybrid approach. In different organisms, several groups have constructed true WGAs containing tiled probe sets that include nearly every nucleotide in the genome (Kapranov et al., 2002Go; Yamada et al., 2003Go). WGAs have extended the potential of microarrays by creating an open system on a platform generally characterized as closed. Such arrays have recently been applied to Arabidopsis and led to the identification of transcription from unannotated regions of the genome (Yamada et al., 2003Go). In addition, these tiled arrays uniquely offer the ability to characterize, at the whole-genome level, transcriptional variants that differ in the use of splice sites and exons and to describe previously uncharacterized 5' or 3' untranslated regions.

In fact, transcriptional data from open technologies suggest that automated annotations of genomic sequence fail to identify many transcripts. Through the application of WGAs, MPSS, and targeted RACE experiments, the Arabidopsis genome is still yielding previously unknown transcripts, although the genome was mostly completed and first annotated more than 3 years ago (Arabidopsis Genome Initiative, 2000Go; Xiao et al., 2002Go; Yamada et al., 2003Go; Meyers et al., 2004bGo). The WGA and MPSS data of Yamada et al. (2003)Go and Meyers et al. (2004c)Go suggest that a comprehensive annotation of transcripts encoded in a genome requires significant experimental data beyond the complete sequencing of chromosomal DNA. Many of these RNA molecules may not encode proteins, but could have independent functions as regulatory molecules. Transcripts that do not encode proteins but can function directly as RNA molecules are called noncoding RNAs (ncRNAs; Eddy, 2001Go). With the exception of housekeeping RNAs, like tRNAs or small nucleolar RNAs, relatively few potential regulatory ncRNAs have been characterized from plants; those that have been identified appear to be plant-specific (MacIntosh et al., 2001Go). Nearly all of the 29,000+ predicted genes in Arabidopsis encode proteins; very few ncRNAs are annotated (MacIntosh et al., 2001Go; Wortman et al., 2003Go). Natural anti-sense transcripts (NATs) overlap with transcribed coding regions and may be involved in the regulation of gene expression (Vanhee-Brossollet and Vaquero, 1998Go). These NATs and other ncRNAs are a major component of the diversity of transcripts produced in higher eukaryotes (Eddy, 2001Go; Numata et al., 2003Go; Yelin et al., 2003Go). Some of the first experiments using SAGE and MPSS in plant genomes have identified a number of anti-sense transcripts (Gibbings et al., 2003Go; Meyers et al., 2004bGo). Therefore, the comprehensive use of open transcriptional profiling approaches will add significant new information to any sequenced genome by identification of ncRNAs, NATs, or other transcripts that are poorly predicted. Because the transcriptional complexity of sequenced genomes has yet to be fully explored, microarray designs should be flexible and facilitate the addition of newly discovered transcripts.

There are additional transcripts missing from or insufficiently measured by current technology platforms. Methods also need to be developed for high-throughput quantification of splice variants. Simultaneous quantification of all splice variants of a single gene is presently done on a gene-by-gene basis using QRT-PCR (Renner and Pilger, 1999Go; Goel et al., 2001Go). Large numbers of variants of known transcripts have been found in Arabidopsis, generated by alternative splicing or polyadenylation (Haas et al., 2003Go; Meyers et al., 2004bGo). These variants may have novel functions. Additionally, there are no systematic processes for identification and quantification of microRNAs (miRNAs), which have important biological roles in plants and animals (Carrington and Ambros, 2003Go). These small RNA molecules (approximately 21 nucleotides) play regulatory roles in plant development and are processed from longer noncoding transcripts (Aukerman and Sakai, 2003Go; Palatnik et al., 2003Go). However, it is not yet clear that all possible miRNAs have been characterized from Arabidopsis. A technology to measure these on a global scale would contribute greatly to our understanding and open the door to novel experiments.

Future genomics projects will take advantage of the advances in techniques and technologies to deliver genomes at a fraction of previous costs. We anticipate that high-throughput open technologies, such as MPSS, will be important because the data can be used to annotate genomic sequence. Ultimately, it may be possible to estimate the extent of gaps in the genomic sequence based on the percentage of unmatched MPSS signatures. Statistical approaches to estimating the complete size and complexity of the human transcriptome based on limited SAGE data were unsuccessful (Stern et al., 2003Go), but it may be possible to estimate the complexity of the Arabidopsis transcriptomes using more extensive sets of MPSS data.


    TISSUE ISSUES: MEASUREMENTS OF GENE EXPRESSION IN SPECIFIC CELL TYPES
 TOP
 TECHNOLOGIES FOR MEASURING GENE...
 THE DANGERS OF PROLIFERATING...
 OPEN VERSUS CLOSED TECHNOLOGIES...
 TISSUE ISSUES: MEASUREMENTS OF...
 APPLICATIONS OF TRANSCRIPTIONAL...
 FUTURE DIRECTIONS
 LITERATURE CITED
 
Multicellular eukaryotic organisms comprise complex interspersions of different cell types. Higher plants are no exception, and it is increasingly apparent that methods are required to isolate specific cell types when considering gene expression in the whole organ. Typical experiments may utilize intact leaves, flowers, or other organs that comprise multiple cell types and utilize RNA that is isolated essentially from a population or mixture of cells. For certain studies, this homogenization of a heterogeneous starting material may dilute, alter, or mask the true biological state of individual cells. The averaging of a response across millions of cells may produce a signal that is artificial and accurately reflects none of the varied transcriptional states found in individual cells. Signals that emanate from a single plant cell (perhaps one under attack from a pathogen) may be found in a gradient that decreases with distance from the source, such that the timing and magnitude of the transcriptional response varies dramatically in cells that are further from the source. However, until technologies are better able to precisely measure the state of single cells, this will remain speculation.

Several methods are being employed to allow subsets of cells to be isolated and analyzed for gene expression with the techniques described above. These methods are described in more detail below, but one limitation that still exists is the large amount of RNA required for an experiment. Standard microarray experiments utilize fluorescent dyes that necessitate microgram quantities; SAGE and MPSS library construction requires similar quantities of starting material. The use of radioactively-labeled targets requires only nanogram quantities for accurate detection and measurement, but methods employing radiation, such as macroarrays (the larger cousin of microarrays with probes gridded on nylon membranes), have been predominantly supplanted due to relatively low throughput. Amplification of small quantities of RNA may provide a way around this requirement. Methods and products for RNA amplification are available, but amplification could bias the representation in the sample due to variation in the length or sequence of the transcripts. Amplification methods are complicated slightly for oligo-based microarrays; the immobilized probe on these arrays consists of a single strand of DNA, and to ensure strand specificity for the RNA target, amplification methods must ensure production of the complementary target. We have developed accurate methods based on in vitro transcription for the linear amplification of plant total RNA that start from as little as 50 ng of material; we have also developed methods for exponential amplification of picogram quantities of RNA (F.-C. Gong and D. W. Galbraith, unpublished data).


Isolation of Cell-Specific RNA and Other Macromolecules by Laser-Capture Microdissection

Several methods have been developed for the isolation of macromolecules such as DNA, RNA, and protein from selected cells. Some schemes rely on tissue dissociation (e.g. tissue digestion and cell sorting) and thus rely on the prior identification of cell-specific markers (see below). Other techniques, such as direct micropipetting of cell contents, are highly labor-intensive or have limited access to internal tissues (Karrer et al., 1995Go; Brandt et al., 2002Go). In contrast, laser-capture microdissection (LCM) provides a rapid means of isolating pure cellular preparations directly from heterogeneous tissues, based on conventional histological identification (Emmert-Buck et al., 1996Go). Specific markers can assist with the identification of the desired cells, including prestaining with {beta}-glucuronidase reporters (N. Gandotra and T. Nelson, unpublished data) but this is not a requirement. The LCM system can also incorporate immunological identification of specific cells to assist the laser-harvest step. Two studies to date have reported the use of laser microdissected cells from plant tissues as the source of RNA for profiling on microarrays (Asano et al., 2002Go; Nakazono et al., 2003Go).

In the LCM version developed at the National Institutes of Health (Emmert-Buck et al., 1996Go) and commercially available as the Arcturus Pix-Cell system (http://www.arctur.com), a HeNe laser beam is used to fasten selected cells to a thermoplastic film suspended above a tissue slice while it is viewed on an inverted microscope. Cells harvested onto the film can be subjected to high efficiency procedures for the isolation and analysis of DNA, RNA, and protein. The advantage of this version of LCM method is that the low-power infrared laser dimples the adhesive film onto individual cells (for review, see Roberts, 2002Go); the cells are not struck by the laser beam. Images are obtained of samples before and after cell harvest, as well as of the harvested cells. The harvest of hundreds or thousands of individual cells is feasible, using either a manual aim-and-fire method or a fully automated method in which the desired cells are marked on a screen for robotic harvest from the slide. A variety of proof-of-concept and analytical studies have demonstrated that the DNA, RNA, and protein obtained from LCM-harvested cells can be suitable for microarray-based RNA expression profiling, proteomic protein profiling and genomic mutational analysis (Banks et al., 1999Go; Jin et al., 1999Go; Luo et al., 1999Go; Simone et al., 2000Go; Wong et al., 2000Go; Craven et al., 2002Go; Ohyama et al., 2002Go; Nakazono et al., 2003Go).

Kerk et al. (2003)Go optimized LCM for use with tissues from a variety of plants, including rice, maize, Arabidopsis, radish (Raphanus sativus), and other species. Their approach used conventional histological methods, including paraffin-embedding; this method provides high-resolution access to cells of all ages and types, and is stable enough to permit archiving and resampling of the tissue. RNA can be isolated from paraffin-archived materials for at least several months without degradation in quality. In addition, samples can be taken from multiple sections onto the same collecting film to pool cells that are rare, such as single cells from a particular location. Using the paraffin methodology, recoveries of 10 ng of RNA/50 LCM-harvested cells are possible, sufficient for a strong signal by single-round RT-PCR from a moderately expressed gene or to serve as a template for linear amplification into probes for microarrays (N. Gandotra, T. Ceserani, S.L. Tausta, and T. Nelson, unpublished data).


Flow Sorting of Cell-Type Specific Nuclei or Protoplasts

Specific cell types can be labeled with fluorescent proteins and protoplasts prepared and purified using flow cytometry and cell sorting. The sorted protoplasts can then be subjected to gene expression analyses. The green fluorescent protein (GFP) of Aequorea victoria is the prototypic label; specific cell types can be tagged by driving expression of such proteins with highly specific promoters. This approach was used by Birnbaum et al. (2003)Go to create a gene expression map of the developing Arabidopsis root. Groups of genes with coordinated expression, as determined using Affymetrix GeneChips, defined local expression domains. Statistically significant overrepresentation of genes of known functions within the local expression domains provided testable hypotheses about root development. These hypotheses concerned the influences and involvement of signal transduction, hormone responses, gene organization, and other regulatory mechanisms. The map also provides a useful resource for the design of further experimental and computational strategies to explore gene regulation in roots. One caveat is the possible influence of the process of protoplast production on gene expression patterns. For Arabidopsis roots, this influence appears minor (Birnbaum et al., 2003Go), although subtle changes in genes expressed at low levels may not have been detected by the expression platform. For organ systems, the question also exists as to whether protoplasts can be successfully isolated from all cell types that are present within that organ.

The approach of GFP-based cell type-specific labeling can also be applied to subcellular organelles such as nuclei (Galbraith, 2003Go). Flow sorting of GFP-tagged nuclei from homogenates of transgenic plants allows rapid purification of sources of primary transcripts. Given that polyadenylation is essentially cotranscriptional (Orphanides and Reinberg, 2002Go), this approach should provide information about transcriptional regulation that is unaffected by the types of perturbation of gene expression associated with protoplast production. A further advantage is that plant homogenization can be adapted more readily for high throughput handling than can protoplast production.


    APPLICATIONS OF TRANSCRIPTIONAL PROFILING: AN EXPANDING RANGE OF POSSIBILITIES
 TOP
 TECHNOLOGIES FOR MEASURING GENE...
 THE DANGERS OF PROLIFERATING...
 OPEN VERSUS CLOSED TECHNOLOGIES...
 TISSUE ISSUES: MEASUREMENTS OF...
 APPLICATIONS OF TRANSCRIPTIONAL...
 FUTURE DIRECTIONS
 LITERATURE CITED
 

Dissection of Changes in Gene Expression Levels

One of the temptations of whole-genome expression platforms is to simply generate data for discovery purposes. While this may be a valid approach for open technologies in which the data can be used for genome annotation, it is harder to justify for microarrays and other closed technology platforms. Despite the ease of producing reams of data, it will be meaningless unless experiments are properly designed with the appropriate biological materials and replicates. The extraction of meaningful data requires analytical strategies and the interpretation depends on close interactions among biologists, computer scientists, and statisticians.

The detection of differential expression among two types of tissues differing by some experimental variable is one of the most basic questions addressed with transcriptional analysis. Typically, a user-defined cut-off or threshold for the ratio of expression levels in the two tissues is used to identify differentially expressed genes. The underlying assumption is that genes with differential expression are somehow involved in the condition that distinguished the tissues. The statistical methods for identifying such genes have been much better developed in recent years (for review, see Slonim, 2002Go), and are able now to identify up- or down-regulated genes with statistical significance. The end product is a list of candidate genes believed to be involved in the phenotype of interest; these genes must then be validated using much more time-consuming functional studies. The integration of pathway information could lead to the association of pathways with a process when genes in that pathway are overrepresented in the differentially expressed genes. Although for most organisms few data are available describing the pathways and related genes, such data may be generated empirically by the application of pattern discovery methods. These methods include the numerous clustering techniques designed to construct groups of genes with related patterns within the dataset. This simplifies and structures the data based on inherent patterns rather than imposing assumptions made a priori. Ultimately, it may be possible to reconstruct or model complex signaling pathways by combining interferences made from transcriptional profiling data with biochemical and metabolic data.


Categorization of Tissues Based on Expression Patterns

Expression profiling provides a comprehensive approach for the molecular characterization of tissues, treatments, or cell types. The state of the transcriptome represents a phenotype that provides a clear physiological picture of cellular activity (Hughes et al., 2000Go). Class prediction methods are statistical techniques that can be used to classify expression profiles from different samples into known groups (for review, see Slonim, 2002Go). The use of microarray phenotypes for tissue classification is most widely and successfully used in cancer research; the molecular data can distinguish tumors more reliably than other approaches, resulting in more accurate disease diagnoses (Russo et al., 2003Go). Comparisons among different samples of the same cancer type reveal distinct subgroups, provide a molecular classification of the cancer type, and can determine the stages of progression of the disease (Russo et al., 2003Go). Hierarchical clustering analysis of the array data is used to sort specimens. These studies have defined candidate marker genes that can discriminate between normal and diseased tissues. The combined sets of diagnostic marker genes may be used to develop specialized or customized arrays that contain only the diagnostic genes of specific interest. However, while the idea of customized arrays was pertinent when array densities were low and most arrays were homemade, this strategy may be less important as costs decrease for high-density commercial arrays for which uninformative genes can be ignored.

In plants, this type of classification based on transcriptional profiles could be applied to the sorting of mutants based on perturbations in distinct signaling pathways. This strategy does not require optimal microarray probe design or even that the probes identify known genes. The microarray elements must serve as molecular markers, providing detectable signals and behaving independently. Moreover, complete coverage of all genes by the technology is not critical, as long as the genes that are represented provide enough resolution for diagnosis or identification. Every informative array element or probe will provide an additional dimension for the analysis and for maximum resolution and significance; these probes should outnumber the distinct pathways or mutants under analysis.


Application of Technologies to Diverse Genotypes

Natural variation in gene expression levels between closely related plant varieties can be treated as a genetic polymorphism. Microarrays or other methods can be used to describe patterns of gene expression among individuals in a mapping population. Each pattern constitutes a molecular phenotype. Transcript abundance levels differing in the parents of a mapping population and segregating among the progeny can be mapped and characterized as quantitative traits (for review, see Cheung and Spielman, 2002Go). These expression profiles may be more easily interpreted or quantified than some visible phenotypes. Differences in expression of a given gene may result either from allelic differences in its promoter or from effects of distal regulatory loci. In both cases, the variation is due to genetic differences that can be subjected to genetic analysis. In parallel, the individuals in the population can be genotyped using standard molecular techniques. With molecular phenotypic and genotypic data, expression level differences can be mapped using approaches based on quantitative traits, and with these data, quantitative phenotypic measurements may be associated with genetic markers (Jansen and Nap, 2001Go). Accessions of Arabidopsis are rich in genetic variation for many traits (Alonso-Blanco and Koornneef, 2000Go), and the analysis of this natural variation using quantitative methods may provide more insight into plant signaling and gene function than classical mutagenesis studies. This is because of the complexity of variation found between ecotypes and because variation in the genetic background may increase the penetrance of certain weak alleles or promote novel phenotypes resulting from gene interactions. Another important point is that alterations in the transcriptional activity of a gene may have more significant effects than polymorphisms that alter the protein sequence. Substantial variation in gene expression has been demonstrated between primate species and among fish populations (Enard et al., 2002Go; Oleksiak et al., 2002Go), suggesting that natural selection may act as, or more, effectively on transcriptional than translational differences. In plants, most such studies will first be carried out in Arabidopsis due to the experimental advantages of this model plant; there is little doubt that gene expression analysis ultimately will be used to characterize and to map complex phenotypes in many plant species.

Which technology platforms will be used for studies of natural variation in gene expression? All of the platforms described above will measure variation in expression, but some will also be sensitive to genotypic differences that could interfere with measurements of expression. For example, the oligos used in some microarray platforms are short enough to be sensitive to sequence polymorphisms within the homologous region of the transcript. The short probes (25-base oligos) used on Affymetrix arrays will be most sensitive to single nucleotide polymorphisms (SNPs); one base difference in the length of the oligo is enough to substantially diminish hybridization. Because Affymetrix uses 10 or more probes for each gene, differences in hybridization intensity among the probes may be attributed to genomic polymorphisms. In fact, some research groups have exploited this property using labeled genomic DNA to identify SNPs or insertion/deletion events. An early and elegant study demonstrated polymorphic hybridization to Affymetrix microarrays due to strain-specific differences in yeast (Saccharomyces cerevisiae; Winzeler et al., 1998Go). Borevitz et al. (2003)Go used Affymetrix arrays to assess the polymorphisms in the Landsberg ecotype of Arabidopsis by hybridization of genomic DNA to the array designed from the Columbia genome. In contrast to the 25-mer oligos, long oligos (70-mers) are more tolerant to polymorphisms, presumably because the additional nucleotides provide greater stability. This has been demonstrated in experiments using RNA from Arabidopsis thaliana, Arabidopsis arenosa, and Brassica oleracea (Lee et al., 2004Go). Whole-genome long-oligo arrays could be used to analyze gene expression in a wide variety of related species with smaller genotypic effects on hybridization. This reduced sensitivity to SNPs means that long-oligo microarrays will not be useful for distinguishing expression levels of alleles or closely related gene families.


Measurement of Allele-Specific Differences

Beyond simply measuring expression level differences among homozygous inbred lines, an additional challenge for gene expression technologies will be to characterize and quantify subtle allele-specific differences in expression at heterozygous loci. Hybrid vigor is a well-characterized but poorly understood trait that is important to modern agriculture; one possible explanation for hybrid vigor is transgressive variation in expression. Expression differences for a particular allele in a hybrid compared with the parental lines result either from imprinting (Oakey and Beechey, 2002Go; a cis effect) or trans-acting regulatory elements encoded in the two genomes. Imprinting is generally associated with monoallelic expression (Oakey and Beechey, 2002Go), so biallelic nonparental expression is indicative of trans-acting regulation of expression. To put it differently, the promoter and other adjacent regulatory elements for a given allele are identical in the F1 hybrid and parental lines, so any differences in expression for a specific allele between an inbred parent and the F1 hybrid must result from the interchromosomal effects in the hybrid. Similar intergenome effects may alter gene expression patterns in polyploids (Osborn et al., 2003Go). Draft sequences of rice indica and japonica varieties have been published (Goff et al., 2002Go; Yu et al., 2002Go), and these data create a unique opportunity for large-scale measurements of differential expression in closely related varieties and hybrids, because the sequence of alleles from each variety will be known and may be used for measurements of allele-specific expression levels.

Sequence based-measurements of gene expression such as LongSAGE or MPSS are sensitive to single nucleotide polymorphisms and therefore could be used to globally quantify allele-specific expression. However, the sequence of both alleles must be known to ensure a specific match for the tag. For microarrays, a priori knowledge of SNP locations enables the use of short oligonucleotides, such as those present on the Affymetrix arrays, to measure differential expression between alleles. This type of analysis was performed using human genes and demonstrated that a significant proportion of the alleles that were examined were differentially expressed (Lo et al., 2003Go). Differential display-type methods, which distinguish genes based on restriction site polymorphisms, can be used to screen for allele-specific expression (Hagiwara et al., 1997Go); differential-display approaches are advantageous when the sequence of one or both alleles is unknown. This approach has been used to identify allele-specific differences in expression for small numbers of maize genes (Guo et al., 2003Go). Whole-genome analyses of allele-specific expression in plants will require gene sequences from multiple varieties and may require specialized microarrays that detect SNPs to distinguish alleles.


    FUTURE DIRECTIONS
 TOP
 TECHNOLOGIES FOR MEASURING GENE...
 THE DANGERS OF PROLIFERATING...
 OPEN VERSUS CLOSED TECHNOLOGIES...
 TISSUE ISSUES: MEASUREMENTS OF...
 APPLICATIONS OF TRANSCRIPTIONAL...
 FUTURE DIRECTIONS
 LITERATURE CITED
 
Eventually it may be possible to perform global expression profiling experiments on single plant cells. Attempts have been made to do this for human cancer cells (Klein et al., 2002Go). LCM can isolate RNA from a single cell, which can then be amplified by a linear method into sufficient probe for an array experiment. Few technologies exist to precisely measure single cell transcription without amplification. One recent report used oligomer DNA probes tagged with fluorophores to detect RNAs by fluorescence in situ hybridization (Levsky et al., 2002Go). However, this analysis was limited to 11 genes. These experiments suggested that gene expression is stochastic and that a single sampled cell may have properties highly divergent from the average (Levsky et al., 2002Go; Levsky and Singer, 2003Go). Studies of bacterial colonies also show substantial stochasticity in gene expression, suggesting that for biological reasons, substantial noise will be inherent in any measure of gene expression (Elowitz et al., 2002Go). This makes it important to pool cells of a type and to compare multiple samples to understand their average or typical behavior. This could be done using LCM or flow sorting, as described above; it is relatively easy to collect samples of hundreds of cells of one type by LCM, as long as the histological preparation makes them visible and accessible.

Because existing transcriptional profiling methods require the physical disruption of tissues and cells, gene expression is measured only in discrete time points. Ideally, future technologies should monitor transcripts in situ and in real time for the duration of a treatment or developmental phase. The technique mentioned above using labeled DNA probes and FISH permits this type of analysis (Levsky et al., 2002Go). However, significant advances will be required to make this more practical and to enable large-scale measurements of transcriptional activity.

Intriguing advances in DNA and protein detection are being made with nanoparticles. The laboratory of Chad