|
|
||||||||
|
Plant Physiology 133:475-481 (2003) © 2003 American Society of Plant Biologists DNA Sequence-Based "Bar Codes" for Tracking the Origins of Expressed Sequence Tags from a Maize cDNA Library Constructed Using Multiple mRNA Sources1Department of Agronomy (F.Q., T.-J.W., P.S.S.) and Mathematics (D.A.A.), Interdepartmental Graduate Program in Bioinformatics and Computational Biology (L.G.), Interdepartmental Genetics Graduate Programs (F.L.), Center for Plant Genomics (P.S.S.), Iowa State University, Ames, Iowa 50011
To enhance gene discovery, expressed sequence tag (EST) projects often make use of cDNA libraries produced using diverse mixtures of mRNAs. As such, expression data are lost because the origins of the resulting ESTs cannot be determined. Alternatively, multiple libraries can be prepared, each from a more restricted source of mRNAs. Although this approach allows the origins of ESTs to be determined, it requires the production of multiple libraries. A hybrid approach is reported here. A cDNA library was prepared using 21 different pools of maize (Zea mays) mRNAs. DNA sequence "bar codes" were added during first-strand cDNA synthesis to uniquely identify the mRNA source pool from which individual cDNAs were derived. Using a decoding algorithm that included error correction, it was possible to identify the source mRNA pool of more than 97% of the ESTs. The frequency at which a bar code is represented in an EST contig should be proportional to the abundance of the corresponding mRNA in the source pool. Consistent with this, all ESTs derived from several genes (zein and adh1) that are known to be exclusively expressed in kernels or preferentially expressed under anaerobic conditions, respectively, were exclusively tagged with bar codes associated with mRNA pools prepared from kernel and anaerobically treated seedlings, respectively. Hence, by allowing for the retention of expression data, the bar coding of cDNA libraries can enhance the value of EST projects.
To exploit the power of functional genomic technologies (e.g. microarrays, proteomics, and reverse genetics) in a particular species, it is desirable to have available a large collection of genes from that species. The sequencing of random cDNAs, i.e. an expressed sequence tag (EST) approach, is an attractive method for the high-throughput discovery of genes in organisms with complex genomes. However, to fully explore the gene space of an organism, EST-based gene discovery projects must overcome the challenge that genes are differentially expressed. Specifically, many genes are expressed only in specific tissues, organs, developmental states, genotypes, or under particular environmental conditions. Because of this, EST projects often make use of cDNA libraries produced using mixtures of mRNAs isolated from multiple sources. This approach, however, suffers from the disadvantage that expression data are lost because the origins of the resulting ESTs cannot be determined. Another approach is to prepare multiple libraries, each from a fairly uniform source of mRNA (e.g. one organ at a particular developmental stage). Although this approach allows the origins of ESTs to be determined, the production of multiple libraries requires a great deal of labor and time. We report here an alternative approach for conducting EST-based gene discovery. To maximize the representation and complexity of cDNAs and thereby facilitate gene discovery, multiple sources of maize (Zea mays) mRNAs were pooled to construct a single cDNA library. Distinct 6-bp "bar codes" were added to the 3' ends of each mRNA source during first-strand cDNA synthesis. It was possible to identify the source mRNA pool of more than 97% of the ESTs from this library.
Library Construction and EST Sequencing A cDNA library was prepared using a complex mixture of mRNAs from the maize inbred line B73. To maximize the gene representation in this library and to thereby facilitate gene discovery, mRNA samples were extracted from 60 different plant samples that included various organs, at various stages of development, and that had been subjected to various treatments. These mRNA samples were grouped into 21 pools (Table I). First-strand cDNA synthesis was conducted on each pool using unique NotI/oligo(dT) primers that differed by the inclusion of unique 6-bp DNA sequence bar codes embedded between the NotI cloning site and (dT)18. First-strand cDNAs were pooled and used to construct a single cDNA library (ISUM6) that contained approximately 1.15 x 106 clones. On the basis of double restriction enzyme digestion analysis of 96 random clones, the average length of cDNA inserts is 850 to 900 bp, and the frequency of empty vectors is 2% (data not shown).
Sequencing reactions were performed on 5,184 cDNA clones from library ISUM6 using a primer that provides data from the 3' end of the cDNAs. Of these attempts, 3,684 (71%), resulted in EST sequences that included a poly(T) tail and more than 200 bp of high-quality, non-vector sequences. These ESTs have been deposited in GenBank as GenInfo Identifier nos. 18177912 to 18181595.
A method to decipher bar codes was developed so that the mRNA source pool from which an individual EST was derived could be ascertained (see "Materials and Methods"). Of the 3,684 sequences that were passed to this decoding algorithm, 3,531 (95.8%) had exact bar code matches, 70 (1.9%) had errors in their bar codes that were decodable, and 83 (2.3%) were not decodable (see "Materials and Methods"). Hence, the origins of more than 97% of the ESTs from this cDNA library could be determined. The distribution of the bar codes among the ESTs is provided in Table I. Even though efforts were made to use equal amounts of first-strand cDNA from each of the 21 mRNA source pools for the construction of the cDNA library, there are approximately 3-fold differences in the representation of pools within this collection of ESTs. This could be a consequence of differences in the quality of the pools of first-strand cDNAs and/or errors in measuring the concentration of first-strand cDNAs in these pools.
Unlike EST projects that are composed of 5' sequences from a variety of genetic backgrounds, it is possible for 3' ESTs, all of which are from the same genetic background (the inbred B73), to be assembled into a set of unique sequence clusters (i.e. genes) with a high degree of confidence. Using CAP3 (Huang and Madan, 1999
Because library ISUM6 was not normalized, the frequencies at which particular bar codes appear within a contig should correspond to the relative expression levels of the corresponding gene in the 21 pools of mRNA. The numbers of bar codes detected within each of the 483 EST contigs are shown in Table III. Approximately 12% of the EST contigs are derived from a single mRNA source pool. This is almost certainly an overestimate of the number of maize genes that are expressed in a single mRNA source, because many of the EST contigs in this study are not large enough to adequately sample the expression space.
The utility of using bar codes to extract expression data can, however, be confirmed by the analysis of several of the larger EST contigs. The distribution of bar codes among ESTs that comprise the 20 contigs with 10 or more members is shown in Table IV. All of the ESTs in two contigs (297 and 305) that have decipherable bar codes have the bar code corresponding to the kernel mRNA pool (bar code 3). On the basis of BLASTX analysis both of these contigs are derived from genes that encode proteins (i.e. zeins) that accumulate predominately, if not exclusively, in kernel endosperms (Woo et al., 2001
The largest contig (196) consists of 135 ESTs and is derived from a gene that encodes a metallothionein-like protein (Table IV). The distribution of bar codes associated with ESTs in contig 196 was used to examine the expression pattern of this metallothionein-like gene. The numbers of ESTs in this contig that were isolated from the various mRNA pools differs from that expected based on the distribution of bar codes in the entire EST collection (Fig. 1; Table I). This gene is overexpressed in mRNA pools 18 and 20 to 21 and is underexpressed in pools 2 to 3, 6, 8 to 10, and 13. Hence, this gene is apparently up-regulated in seedlings treated with the plant hormones brassinolide, GA3, and jasmonic acid; down-regulated in the presence of cycloheximide; and not well expressed in mature tissues, kernels, and female reproductive structures.
Five of the contigs encode proteins that are novel or are most similar to proteins that lack even a predicted functional assignment (contigs 180, 257, 264, 329, and 439). Any expression data that can be extracted from the bar-coded ESTs will provide clues as to the functions of these genes. Nine of the 20 ESTs in contig 264 carry bar code 20; this suggests that GA3 induces this gene. Five of the 19 ESTs in contig 439 carry bar code 14, suggesting that this gene is induced under anaerobic conditions. Contigs 55 and 339 encode proteins that are similar to ABA-induced proteins. Interestingly, only one of 14 and zero of 14 of the ESTs from these contigs carry the bar code (19) that is associated with the mRNA pool from ABA-treated seedlings. Because these rates are less than the rate of ESTs with this bar code in the entire library (3.5%, Table I), it appears that the genes defined by these contigs are not induced by ABA, at least at the level of mRNA accumulation under the induction conditions used in this study. In contrast, the gene associated with contig 111, which encodes a protein similar to one induced by drought, does appear to be overexpressed in adventitious roots, in that five of 23 ESTs from this contig are derived from this mRNA pool.
Analyses of the bar-coded ESTs generated by this project support the prior observation that the protein inhibitor cycloheximide deregulates gene expression (Koshiba et al., 1995 The bar codes associated with unpollinated ears (bar code 9) and ear shanks (bar code 10) were observed at substantially lower than predicted rates (Table IV; Fig. 1). These results suggest that these structures have reduced expression of the 20 genes that were most prevalent in this collection of ESTs. Libraries prepared from these structures might therefore be ideal subjects for further gene discovery.
Application of Bar Codes to EST Projects
Data mining tools have been developed to extract gene expression data from EST databases (Zhang et al., 1997 Here, we describe a method to tag cDNAs from different mRNA pools with unique DNA sequence bar codes before the preparation of a library and the means to interpret the resulting data. In a non-normalized cDNA library, the frequency at which a bar code is represented in an EST contig should be proportional to the abundance of the corresponding mRNA species in the source mRNA pool. The utility of this approach was established by demonstrating that all of the ESTs derived from zein and adh1 genes, which are preferentially expressed in kernels and under anaerobic stress were exclusively tagged with bar codes associated with mRNA pools prepared from kernels and anaerobically treated seedlings, respectively. Although only a few thousand ESTs were sequenced in this study, analysis of the largest EST contig provided quantitative data concerning the expression of a metallothionein gene. Data were also obtained that suggest that several novel genes are up-regulated by the plant hormone GA3 or anaerobic stress. In addition, several EST contigs that exhibit similarity to putative ABA-responsive genes do not exhibit strong evidence of ABA induction at the level of mRNA accumulation. The utility of bar-coded EST data increases in proportion to the size of the data set. Hence, EST data would be substantially more useful if the bar coding of cDNA libraries were widely adopted. Significantly, this could be achieved for little additional cost.
Although bar codes have been used previously in the construction of cDNA libraries (Bonaldo et al., 1996
The natural metric for the design of DNA bar codes is the edit metric (Gusfield, 1997
In the current study, the rates of single mutations and multiple mutations in bar codes were 1.9% and 2.3%. Because it was possible to correct single mutations, mRNA source data were unavailable for only 2.3% of the ESTs. The rate of uncorrectable errors might, however, be higher in other bar-coded libraries. This is because the error rate is likely to depend on a number of factors, e.g. the quality of the oligonucleotides used for first-strand cDNA synthesis, the Escherichia coli host strain in which the library is propagated, and the DNA sequencing protocol. It might therefore be desirable to use in the future a set of bar codes that would allow for the correction of two errors, i.e. that are at least five edits apart. If the length of bar codes is increased by just 2 bp (to 8 bp), it is possible to design 34 unique bar codes that meet this criterion (Ashlock et al., 2002
Sources of mRNAs
Sixty tissue samples that included different stages of development, organs, and various treatments were collected from the maize (Zea mays) inbred line B73. Before mRNA extraction, RNA samples were grouped into 21 pools (Table I). Pool 1 consisted of RNAs from germinated seeds and seedlings grown in paper rolls and collected 1, 2, 8, and 11 d after planting. Pool 2 contained RNAs from a mixture of tissues from field-grown plants 17, 21, 38, 69, and 77 d after planting. Pool 3 consisted of RNAs from kernels collected 3, 5, 10, 15, 20, 25, and 30 d after pollination. Pool 4 consisted of RNAs from adventitious roots collected from field-grown plants 65 d after planting. Pool 5 consisted of RNAs from tassels with lengths between 3 and 39 cm collected from plants 53 and 56 d after planting. Pool 6 consisted of RNAs from immature ears with lengths of between 0.2 and 3.0 cm collected from plants 53, 56, and 59 d after planting. Pools 7 to 10 consisted of RNAs from husks, silks, unpollinated first ears, and ear shanks collected from three plants 73 d after planting, respectively. Pool 11 consisted of RNAs from etiolated seedlings grown in paper rolls in the dark and collected 8 d after planting. Pool 12 consisted of RNAs from calli derived from immature zygotic embryos that had been tissue-cultured in medium based on N6 salts for 72 d (Songstad et al., 1991
RNAs were extracted from the 60 samples using Trizol Reagent (Invitrogen, Carlsbad, CA) and examined for RNase activity using the RNase Alert Kit (Ambion, Austin, TX). Equal amounts of the RNA samples were combined to form the 21 pools as shown in Table I. Pooled RNA samples were digested with DNase I (Invitrogen) to remove DNA contamination and precipitated using LiCl. Extraction of mRNA from these 21 RNA pools was performed using Oligotex mRNA kits (Qiagen USA, Valencia, CA). First-strand cDNAs were prepared from the 21 mRNA pools by priming with 21 distinct NotI/oligo(dT) primers that contained distinguishable bar code tags, (N)6, 5'-AAC TGG AAG AAT TCG CGG CCG CNN NNN NTT TTT TTT TTT TTT TTT T-3'. The bar code tags associated with specific pools are shown in Table I and can be used to identify the mRNA pool from which a particular cDNA clone was derived.
Bar codes are potentially subject to mutations such as insertions, deletions, and substitutions during primer synthesis and subsequent in vitro and in vivo manipulations that can confuse the origins of clones that have nearly identical bar codes. An edit-distance lexicode algorithm (Ashlock et al., 2002 The lexicode algorithm for locating error correcting codes, which in this case are collections of DNA bar codes, is as follows. The members of a set of potential bar codes, e.g. all length 6 DNA words, are placed in alphabetical order. The algorithm is given a minimum pair wise edit distance d that must exist between any two barcodes. An empty set of barcodes B is initialized. Traversing the list of potential barcodes in order, a word is added to B if it is at least d edits from every word already in B. Because it takes the next possible word, the lexicode algorithm is an example of a greedy algorithm.
The unmodified lexicode algorithm does not locate maximal size barcode sets for a given length and minimum distance. The algorithm can be modified by handing it a nonempty set B of initial words that are in the barcode set by fiat. Such initial sets of included barcodes are called seeds. Most seeds yield smaller codes than found by the unmodified algorithm; some yield larger codes. An evolutionary algorithm is used to search for three member seeds that yield larger codes. This algorithm acts on a population of seeds in a manner analogous to biological evolution (for details, see Ashlock et al., 2002 The algorithm requires one additional modification to be used to produce embeddable barcodes. Barcodes located with the modified lexicode algorithm as given do not respect restriction sites for enzymes or other biological constraints. At any point where a potential barcode is checked for minimum distance from words already in the code, or when words in seeds are chosen, the words are also checked for compliance and biological constraints. Various biological restrictions were considered in the design of the bar codes. Bar codes were not accepted that ended in T, contained the strings TT or AAA, or contained EcoRI (GAATTC) or NotI (GCGGCCGC) restriction enzyme sites. Following these rules, 21 unique bar codes were generated to label the 21 mRNA pools (Table I). Approximately equal amounts of first-strand cDNA from each pool were combined and used as templates for DNA PolI-catalyzed second-strand synthesis. After the addition of EcoRI adapters, double strand-cDNAs were digested with NotI. Molecules between 0.5 and 2.0 kb were directionally cloned into the EcoRI and NotI sites of the pSlip7 expression vector (F. Liu and P.S. Schnable, unpublished data; GenBank accession no. AY217101). Plasmid DNA isolated from the resulting library was digested with NotI to remove empty vector clones. Linear DNA molecules of between 5.4 to 7 kb were gel purified and self-ligated at low concentration to promote recircularization. Ligation products were precipitated and transformed into DH10B host cells.
Plasmid DNAs of cDNA clones from the ISUM6 library were isolated in a 96-well format using a modified alkaline lysis method adapted from one provided by the Clemson University Genomic Institute (http://www.genome.clemson.edu). Sequencing reactions were conducted using 4 µL (0.5 µg) of plasmid DNA, 2 µL of BigDye Version2 mix (Applied Biosystems, Foster City, CA), 2 µL of 5x sequencing reaction buffer (10 mM MgCl2 and 0.4 M Tris, pH 9.0) and 1 µL of 3.2 µM universal primer (GTAAAACGACGGCCAGT) and the following PCR program using a PTC-225 Tetrad thermal cycler (MJ Research, Waltham, MA): 25 cycles of 96°C for 30 s, 50°C for 15 s, and 60°C for 4 min. Unincorporated dye terminators were removed from the sequencing reactions using Sephadex G-50 columns. Sequence reactions were subjected to electrophoresis on an ABI PRISM 3700 DNA analyzer at the Iowa State University DNA Sequence and Synthesis Facility.
Base calling was performed by using Phred (Ewing et al., 1998
The position of the vector-NotI/cDNA boundary was determined for each EST using the output from the Lucy software (Chou and Holmes, 2001
Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial research purposes, subject to the requisite permission from any third-party owners of all or parts of the materials. Obtaining any permissions will be the responsibility of the requestors.
We thank Diane Sickau, Regan Slonecker, Erin Archer, Luke Holst, Ka-Wai Leong, and Shu-Ting Tsao for technical assistance with EST sequencing; Olga Alechina for help isolating RNAs; Dr. Frank Hochholdinger for preparing and collecting hormone-treated seedlings; and the Iowa State University Plant Transformation Facility for providing maize callus tissue. Received April 16, 2003; returned for revision May 19, 2003; accepted June 29, 2003.
www.plantphysiol.org/cgi/doi/10.1104/pp.103.025015.
1 This research was supported by the National Science Foundation Plant Genome Program (grant no. DBI-9975868 to P.S.S., D.A.A., and others) and by the Iowa Corn Promotion Board (to P.S.S.). This is journal paper number J-19730 of the Iowa Agriculture and Home Economics Experiment Station (Ames; project no. 3409, supported by Hatch Act and State of Iowa funds).
2 These authors contributed substantially to this report. F.Q. constructed the bar-coded cDNA library. L.G. conducted the bioinformatic analysis of the EST data and extracted expression data from the bar-coded ESTs.
3 Present address: Department of Statistics, Iowa State University, Ames, IA 50011.
4 Present address: Department of Biological Chemistry, University of California at Irvine, Irvine, CA 92697. * Corresponding author; e-mail schnable{at}iastate.edu; fax 515-294-5256.
Altschul SF, Gish W, Miller W, Myers EW, Lipman DJ (1990) Basic local alignment search tool. J Mol Biol 215: 403-410[CrossRef][ISI][Medline] Ashlock D, Guo L, Qiu F (2002) Greedy closure genetic algorithms. In Fogel DB, ed, Proceedings of the 2002 Congress on Evolutionary Computation CEC2002. IEEE Press, Piscataway, NJ, pp 1296-1301 Bailey-Serres J, Dawe RK (1996) Both 5' and 3' sequences of maize adh1 mRNA are required for enhanced translation under low-oxygen conditions. Plant Physiol 112: 685-695[Abstract]
Bonaldo MF, Lennon G, Soares MB (1996) Normalization and subtraction: two approaches to facilitate gene discovery. Genome Res 6: 791-806
Chou HH, Holmes M (2001) DNA sequence quality trimming and vector removal. Bioinformatics 17: 1093-1104
Ewing B, Hillier L, Wendl MC, Green P (1998) Base-calling of automated sequencer traces using phred: I. Accuracy assessment. Genome Res 8: 175-185 Gusfield D (1997) Algorithms on Strings, Trees, and Sequences: Computer Science and Computational Biology. Cambridge University Press, UK
Huang XQ, Madan A (1999) CAP3: a DNA sequence assembly program. Genome Res 9: 868-877 Koshiba T, Ballas N, Wong LM, Theologis A (1995) Transcriptional regulation of PS-IAA4/5 and PS-IAA6 early gene expression by indoleacetic acid and protein synthesis inhibitors in pea (Pisum sativum). J Mol Biol 253: 396-413[CrossRef][ISI][Medline]
Lal A, Lash AE, Altschul SF, Velculescu V, Zhang L, McLendon RE, Marra MA, Prange C, Morin PJ, Polyak K et al. (1999) A public database for gene expression in human cancers. Cancer Res 59: 5403-5407 Lemke-Keyes CA, Sachs MM (1989) Genetic variation for seedling tolerance to anaerobic stress in maize germplasm. Maydica 34: 329-337
Scheurle D, DeYoung MP, Binninger DM, Page H, Jahanzeb M, Narayanan R (2000) Cancer gene discovery using digital differential display. Cancer Res 60: 4037-4043
Schmitt AO, Specht T, Beckmann G, Dahl E, Pilarsky CP, Hinzmann B, Rosenthal A (1999) Exhaustive mining of EST libraries for genes differentially expressed in normal and tumour tissues. Nucleic Acids Res 27: 4251-4260 Songstad DD, Armstrong CL, Petersen WL (1991) AgN03 increases Type II callus production from immature zygotic embryos of inbred B73 and its derivatives. Plant Cell Rep 9: 699-702
Wheeler DL, Chappey C, Lash AE, Leipe DD, Madden TL, Schuler GD, Tatusova TA, Rapp BA (2000) Database resources of the National Center for Biotechnology Information. Nucleic Acids Res 28: 10-14
Woo YM, Hu DW, Larkins BA, Jung R (2001) Genomics analysis of genes expressed in maize endosperm identifies novel seed proteins and clarifies patterns of zein gene expression. Plant Cell 13: 2297-2317
Zhang L, Zhou W, Velculescu VE, Kern SE, Hruban RH, Hamilton SR, Vogelstein B, Kinzler KW (1997) Gene expression profiles in normal and cancer cells. Science 276: 1268-1272 This article has been cited by other articles:
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|