|
|
||||||||
|
First published online September 16, 2005; 10.1104/pp.105.066134 Plant Physiology 139:869-884 (2005) © 2005 American Society of Plant Biologists Sorghum Expressed Sequence Tags Identify Signature Genes for Drought, Pathogenesis, and Skotomorphogenesis from a Milestone Set of 16,801 Unique Transcripts1,[w]Department of Plant Biology (L.H.P., C.L., M.S., F.S., H.W., S.P.R., M.-M.C.-P.), Center for Applied Genetic Technologies (A.R.G.), Plant Genome Mapping Laboratory (A.H.P., H.-m.M.), and Department of Statistics (X.Z.), University of Georgia, Athens, Georgia 30602; Clemson University Genomics Institute (R.W.) and Department of Plant Pathology and Physiology (R.D.), Clemson University, Clemson, South Carolina 29634; United States Department of Agriculture Agricultural Research Service, Southern Plains Agricultural Research Center, College Station, Texas 77845 (R.K.); Department of Plant and Soil Sciences, Texas Tech University, Lubbock, Texas 79409 (H.T.N.); and Department of Biochemistry and Biophysics, Texas A&M University, College Station, Texas 77843 (D.T.M., J.E.M.)
Improved knowledge of the sorghum transcriptome will enhance basic understanding of how plants respond to stresses and serve as a source of genes of value to agriculture. Toward this goal, Sorghum bicolor L. Moench cDNA libraries were prepared from light- and dark-grown seedlings, drought-stressed plants, Colletotrichum-infected seedlings and plants, ovaries, embryos, and immature panicles. Other libraries were prepared with meristems from Sorghum propinquum (Kunth) Hitchc. that had been photoperiodically induced to flower, and with rhizomes from S. propinquum and johnsongrass (Sorghum halepense L. Pers.). A total of 117,682 expressed sequence tags (ESTs) were obtained representing both 3' and 5' sequences from about half that number of cDNA clones. A total of 16,801 unique transcripts, representing tentative UniScripts (TUs), were identified from 55,783 3' ESTs. Of these TUs, 9,032 are represented by two or more ESTs. Collectively, these libraries were predicted to contain a total of approximately 31,000 TUs. Individual libraries, however, were predicted to contain no more than about 6,000 to 9,000, with the exception of light-grown seedlings, which yielded an estimate of close to 13,000. In addition, each library exhibits about the same level of complexity with respect to both the number of TUs preferentially expressed in that library and the frequency with which two or more ESTs is found in only that library. These results indicate that the sorghum genome is expressed in highly selective fashion in the individual organs and in response to the environmental conditions surveyed here. Close to 2,000 differentially expressed TUs were identified among the cDNA libraries examined, of which 775 were differentially expressed at a confidence level of 98%. From these 775 TUs, signature genes were identified defining drought, Colletotrichum infection, skotomorphogenesis (etiolation), ovary, immature panicle, and embryo.
The Poaceae contains numerous species of importance to human nutrition. A thorough exploration of the transcriptome of this important plant family is an important step in understanding its fundamental biology, as well as in identifying genes that will continue to improve its agricultural productivity. Defining the transcriptome of a complex, multicellular eukaryote is, however, a daunting challenge. The two most widely used and comprehensive approaches are whole-genome sequencing coupled with application of gene prediction algorithms (Mathé et al., 2002
Among available approaches, an appropriately designed EST project offers a number of substantial advantages: (1) It most often is a much less expensive route to gene discovery than is whole-genome sequencing; (2) it offers unambiguous identification of transcribed genomic sequences; (3) it results in a cDNA resource that can serve a broad scientific community; (4) it provides at no additional cost templates suitable for cDNA-based microarray applications as well as (5) information about gene expression as a function of developmental stage, organ, and/or environmental parameters at the time plant material is harvested for RNA isolation; and (6) it can reveal information about several transcript properties, including untranslated region (UTR) structures, polyadenylation signals, and alternative splicing. Because of these and other advantages, several EST projects in commercially important plant species have been initiated (Michalek et al., 2001
The cereals are among the agriculturally most important members of the Poaceae. The extensive synteny among their genomes (Hulbert et al., 1990
We characterize and explore here 117,682 sorghum ESTs derived from approximately half that number of independent cDNAs, most of which were sequenced at both 5' and 3' ends. A Milestone set (freeze) of 16,801 unique transcripts, or tentative UniScripts (TUs), has been identified from 55,783 3' ESTs and is in use for microarray applications (Buchanan et al., 2005 All ESTs are available for examination and download at http://fungen.org/Sorghum.htm and http://cggc.agtec.uga.edu/cggc. Additional information concerning data access is provided in "Materials and Methods."
EST Characteristics
The 13 cDNA libraries from which ESTs were obtained are summarized in Table I. Three major considerations went into the choice of libraries. Some were selected to provide linkage to other EST projects (e.g. pathogen-infected plants, incompatible [PI1] and pathogen-infected plants, compatible [PIC1]; Ronning et al., 2003
Initial choices for species and genotype were made for four reasons. (1) Genotype BTx623 was selected for most libraries because it is one of the most widely used Sorghum bicolor L. Moench accessions in breeding programs and (2) has been used as one of the parents for the construction of both of the most detailed genetic maps for sorghum (Menz et al., 2002 From a total of 151,870 sequence attempts, 117,682 high-quality 3' and 5' ESTs were obtained (Table II). With the exceptions of RHIZ1, DSAF1, and DSBF1, about one- half of the clones contain full-coding-length cDNAs. Estimates for cDNAs cloned backwards from expectations range from 0.5% to 3.5%. Most libraries were sequenced to a depth of about 5,000 cDNAs. After trimming for vector, adapter, and quality, ESTs as submitted to GenBank averaged 516 and 529 nucleotides (nt) for 3' and 5' reads, respectively. The greatest number of trimmed sequences had lengths between 500 and 599 nt, with 89% exceeding 300 nt (Fig. 1). These sequences can be explored and downloaded as fasta files as described in "Materials and Methods."
Milestone TUs
Only 3' ESTs were clustered for two reasons. First, ESTs deriving from the same gene would be expected to have substantial sequence overlap. Conversely, 5' ESTs would be expected to start at different places depending upon where reverse transcription (RT) terminated. Thus, by using only 3' ESTs the much greater frequency of error associated with clustering 5' ESTs is avoided. Wang et al. (2004) The 55,783 3' ESTs clustered here identify 6,114 singletons, 1,655 contigs-of-one, and 9,032 clusters of two or more members (Table II). When a sequence is sufficiently similar to one or more other sequences, phrap attempts to assemble it with them. If phrap ultimately fails to do so, however, the sequence is designated by phrap as a contig-of-one. The identifier of a TU in this category begins with 1. A sequence that bears so little resemblance to any other sequence that no attempt is ever made to assemble it with other sequences is designated by phrap as a singleton. The identifier for this category begins with 0. While both categories contain only one EST, it can be important to be aware that those originally identified as a contig-of-one do have a strong resemblance to one or more other TUs. The identifier of a TU with two or more members begins with a 2. For simplicity, the term singleton in the following will also refer to contigs-of-one. Collectively, singletons and assemblies with two or more members will be referred to here as TUs, as already defined. The distribution of TU consensus sequence lengths is presented in Figure 1. With few exceptions, they are little more than about 100 nt longer than individual sequences (Fig. 1). The number of TUs as a function of the number of ESTs per TU indicates that very few genes are observed to be expressed at high frequency (Fig. 2). Only 42 TUs are detected at a frequency exceeding one transcript per 1,000, while only 2,158 exceed a frequency of one per 10,000.
The relative coverage of this EST data set has been evaluated by BLASTn to 255,964 sugarcane, 416,090 maize, and 284,234 rice (Oryza sativa) ESTs downloaded from GenBank on September 13, 2004. The best return for each TU from each database was binned, revealing the expected inverse relationship between frequency of high-quality hits and evolutionary distance. The percentage of TUs returning an Expect value E-100 was 54.9%, 43.1%, and 11.6% for sugarcane, maize, and rice, respectively. Conversely, these percentages for Expect values >E-5 were 19.6%, 23.4%, and 35.5%, respectively. A bar chart that includes these data is presented in Supplemental Figure 1.
Because the overwhelming majority of cDNAs were randomly selected from unamplified and nonnormalized libraries, results of clustering 3' ESTs can be used to estimate the rate of discovery of new TUs as a function of the number of 3' ESTs accumulated. Because the required information has been entered into the same Oracle database that also contains the results of EST clustering, it is possible to calculate and display the number of TUs as a function of the number of 3' ESTs included in the data set (Fig. 3), to do the same for each cDNA library separately, and to do the same cumulatively, as additional libraries are added (Fig. 4). From the theoretical curve in Figure 3, obtained as described in "Materials and Methods," it is then possible to define the rate of TU discovery at any number of 3' ESTs and to extrapolate in order to obtain an estimate of the total number expected if one or more libraries were sequenced to infinite depth (Fig. 4). The rate of gene discovery remains substantial, even after sampling 55,783 cDNAs. At this point the rate of discovery of new TUs by sequencing new cDNA clones picked at random from these same libraries is predicted from the slope of the theoretical curve to be 13.6% (Fig. 3). At infinite sequencing depth, the result predicts that these libraries contain representatives of approximately 30,600 TUs (Fig. 4). Each library individually is predicted to contain representatives of no more than about 13,000 TUs, with most containing only about 6,000 to 9,000 (Fig. 4).
The richest library in terms of the maximum number of TUs predicted is that prepared from young, light-grown seedlings (LG1 in Fig. 4). TUs enhanced in their expression, however, were no more frequent in LG1 than in other libraries. This is the case whether fold induction relative to the average expression across all libraries is measured (Fig. 5), or the frequency with which TUs consisting of two or more 3' ESTs is observed in only one library is determined (Fig. 6). For each library or subgroup, fold induction is the frequency with which that library or subgroup was represented in a TU (the number of 3' ESTs in the TU from that library or subgroup divided by the total number of 3' ESTs in the library or subgroup) divided by the ratio of the total number of 3' ESTs in that TU to the total number of 3' ESTs (55,783).
Hierarchical Clustering, Differentially Expressed TUs, and Signature Genes
Hierarchical clustering of 3' ESTs representing the 258 TUs with 20 or more members revealed that few of these highly expressed genes were expressed uniformly among all libraries (data not shown). Similarly, an evaluation of the 10 most abundantly expressed genes indicated that most were expressed preferentially in only a few libraries, with the three drought-related libraries (WS1, DSAF1, DSBF1) accounting for the majority of expression in half of these 10 (data not shown). To explore in greater detail the ability of this EST data set to discriminate among the different environmental conditions or plant organs from which the individual libraries were obtained, the R statistic of Stekel et al. (2000)
This subset of 775 TUs was evaluated by hierarchical clustering, yielding the result illustrated in Figure 7. The number of members in these TUs ranged from 4 to 215. The two pathogen libraries clustered as a group, as did the three drought libraries when the analysis was repeated with the inclusion of DSAF1 and DSBF1 (data not shown). FM1 and RHIZ2, both from S. propinquum, also clustered together. Individual examination of several of the TUs that identify these latter two libraries (green bar to right of heat map in Fig. 7) reveal that they most often represent genes whose orthologs in S. propinquum and S. bicolor differ enough that the ESTs derived from them were separated into different TUs.
With the further exclusion of RHIZ1, RHIZ2, and FM1, 70 TUs were selected from Figure 7 and resubmitted to hierarchical clustering. These 70 TUs consisted of 10 representing each of seven subgroups or libraries. The results identify well-defined signature TUs for each of the environmental conditions (drought, pathogenesis, skotomorphogenesis, photomorphogenesis) or tissues (embryo, immature panicle, ovary) examined (Fig. 8). Two size fractions of the ovary library were picked and sequenced for a practical reason described in "Materials and Methods." Comparison of the ESTs derived from these two library fractions in Figures 7 and 8 indicates that variable size distribution in these two library fractions does lead to minor differences in the TUs identified (Fig. 7), even though ovary 1 (OV1) and OV2 nonetheless cluster well with one another (Figs. 7 and 8).
Eighteen signature genes identified by hierarchical clustering were evaluated by quantitative RT-PCR. To evaluate the utility of these signature genes, seven comparisons were made between abscisic acid (ABA)-treated and light-grown seedlings and 11 between dark- and light-grown seedlings. The former comparisons were designed to assess the utility of the signature genes with respect to ABA response, which is a subcomponent of dehydration stress, and to connect these signature genes to an in-depth microarray evaluation of ABA and dehydration responsive genes in sorghum (Buchanan et al., 2005 The entire Milestone 1.0 data set is available in comma-delimited format as Supplemental Table I. It is also available for download, together with all consensus sequences, using MAGIC Gene Discovery at http://fungen.org/Sorghum.htm. Supplemental Table I contains TU identification (ID), number of 3' ESTs in the TU, number of 3' ESTs in each library for that TU, BLASTx target description, Expect value, Protein Information Resource Non-redundant Reference Protein (PIR-NREF) ID, and the 3' EST that represents the TU (TU anchor sequence). Supplemental Table II provides the same information for the data in Figure 7, the R statistic, and the order in which TUs appear in the heat map.
The analysis of sorghum ESTs presented here is an early step in taking advantage of sorghum as a model organism for genome-scale investigations of stress-related genes among the Poaceae. It complements the more extensive effort that has already been put into mapping the sorghum genome (Whitkus et al., 1992
By random sampling of a relatively large number of mostly nonnormalized, unamplified, and diverse cDNA libraries to a uniform depth of about 5,000 cDNAs, and by sequencing both 3' and 5' ends of each cDNA (Tables I and II), the advantages enumerated in the introduction have been realized. The random sampling permits more rigorous interpretation of the results of hierarchical clustering. Sequencing both ends of each cDNA permitted more rigorous clustering, as compared to the large majority of other plant EST projects, which focused almost exclusively on 5' ESTs (e.g. Shoemaker et al., 2002
The PIR-NREF database was selected for default provisional electronic annotation for several reasons (Wu et al., 2002
The observation that TU consensus sequences are only slightly longer than individual 3' ESTs (Fig. 1) is one indication that the TUs identified here are of good quality. Because all 3' ESTs should start at or near the same position, depending upon differential polyadenylation sites, consensus sequences should never be much longer than individual sequences (Fig. 1). MAGIC Gene Discovery at http://fungen.org permits visual inspection of individual TUs, with discrepancies from the consensus sequence highlighted, and identifies ESTs that have been assembled as their reverse complement, thereby permitting independent judgments of quality (Supplemental Fig. 3). Moreover, as an additional consequence of curating the assembly in a relational database and of assigning to each TU an anchor cDNA clone, as the size of this assembly grows TU identifiers will inasmuch as possible be retained, and, when necessary, a means for tracking necessary ID changes will be provided (C. Liang, F. Sun, H. Wang, D. Kolychev, L.H. Pratt, and M.-M. Cordonnier-Pratt, unpublished data). Consequently, the value of this EST assembly will have more permanence than is usually the case, which is an important consideration when used for microarray and other downstream applications.
Given the relatively low cost, sequencing both ends of cDNAs randomly selected from predominantly unamplified and nonnormalized libraries and to a relatively shallow depth provides an excellent compromise between cost and benefit. This approach maintains a substantial rate of gene discovery (Fig. 3) without unnecessarily reducing the information content of the cDNA libraries (Figs. 48
Although collectively these libraries appear to contain in excess of 30,000 TUs, with the exception of LG1 no one library is predicted to contribute more than about 6,000 to 9,000 (Fig. 4). Each library also exhibits about the same level of complexity. This same observation holds when considering either the number of TUs preferentially expressed in individual libraries (Fig. 5) or the frequency with which TUs consisting of two or more 3' ESTs coming from only one library or library subgroup are observed (Fig. 6). While the data suggest that LG1 is the richest library in terms of total number of genes being expressed (Fig. 4), it appears to provide, if anything, fewer preferentially expressed genes (Fig. 5) and, as compared to other libraries, about the same frequency of TUs with two or more members coming from only one library (Fig. 6). Combined with the need for redundancy in order to explore events such as alternative polyadenylation and differential splicing (Burke et al., 1998 The results just discussed, especially those in Figure 4, also document that sorghum expresses only a small fraction of its genome either in any one organ, at any one developmental stage, or in response to any specific environmental influence. We are unaware of a quantitative analysis similar to that presented here for any other plant, but see no reason why this observation should not have general validity. The apparent enhancement in the frequency of TUs with two or more members coming from only one library that is observed for FM1, RHIZ1, and RHIZ2 (Fig. 6) results from the observation that some of the genes in S. propinquum (FM1, RHIZ2) and johnsongrass (RHIZ1) differ enough from their orthologs in S. bicolor to be grouped into different TUs when the genes are highly expressed. This outcome is not surprising given that the clustering performed here was intended to discriminate among different members of a gene family and, as a consequence, was sensitive to relatively small differences in sequence, especially in the 3' UTR. Hierarchical clustering of the 258 TUs with 20 or more members (data not shown) provides an outcome much like that seen in Figure 7, which is to say that FM1 and RHIZ2 form a distinct cluster separate from all other libraries. Manual inspection of these S. propinquum signature genes (Fig. 7, green bar) reveals that they appear to be orthologs of S. bicolor genes that were separated into individual TUs.
Comparison by BLASTn of all 16,801 sorghum TUs to ESTs from sugarcane, maize, and rice reveal, as anticipated, decreasing similarity with increasing phylogenetic distance. Even in the case of sugarcane, however, close to 20% of sorghum TUs returned an Expect value >E-5 and almost 5% returned a value >1 (>E0; Supplemental Fig. 1). For maize, the equivalent values are just over 23% and 10%, while for rice they are just over 35% and 10%. It is evident that even when compared to these 956,288 Poaceae ESTs, the results documented here for sorghum indicate, at least superficially, that there remains a large pool of genes to be discovered by this approach. It should be noted, however, that since the bulk of sugarcane, maize, and rice ESTs are 5' while the TUs are defined by 3' ESTs, then one might expect insufficient overlap in at least some instances. Nonetheless, we have observed that 5' sorghum ESTs derived from TU anchor clones often find fewer and/or poorer matches than do the 3' TU sequences (data not shown; see also below). This outcome indicates that average cDNA lengths for other EST projects have often been relatively short such that even 5' ESTs are near the 3' terminus.
The 16,801 TUs identified here from 55,783 cDNAs is consistent with observations from other plant EST projects. A comparable estimate for potato (Solanum tuberosum; Ronning et al., 2003
It is important to note that like Fei et al. (2004)
Evaluation as described by Stekel et al. (2000)
One of the two foci of this effort was to investigate the influence of drought on the sorghum transcriptome. The 9,656 ESTs derived from libraries WS1, DSAF1, and DSBF1 define 717 TUs with two or more members and 1,517 singleton TUs containing ESTs from only these three libraries. As a group, they exhibit the highest frequency of context-specific gene expression as estimated in Figure 6. Because DSAF1 and DSBF1 when constructed were not originally designed to be included in this project, however, they were both subtracted libraries. Consequently, they were not included in the hierarchical clustering presented here. Of the 775 TUs in Figure 7, 72 are preferentially expressed in WS1 (Fig. 7, violet bar; TUs 454525 in Supplemental Table II). Of the 1,591 ESTs in these 72 TUs, 1,042 or 65% derive from WS1. DSAF1 and DSBF1 contributed another 225 ESTs to these 72 TUs. They include three dehydrins, four heat-shock proteins, a late embryogenesis-abundant protein, a drought-inducible protein, a dehydration-responsive protein, a tonoplast-intrinsic protein, and a pore-protein homolog. About half have at least a putative or hypothetical function assigned based upon BLASTx returns from PIR-NREF.
Comparison of this entire Milestone EST data set to ABA-induced sorghum TUs identified by microarray and confirmed by quantitative RT-PCR (Buchanan et al., 2005
Similar to the characterization here of sorghum genes expressed preferentially in response to both compatible and incompatible infections, an earlier potato EST project obtained about 5,000 ESTs from each of two cDNA libraries, prepared from plants challenged with either a compatible or incompatible pathogen (Ronning et al., 2003
A more recent tomato EST data set (Fei et al., 2004
While coexpressed TUs are sometimes expected to identify genes encoding proteins that interact with one another, the data set here is too small to provide statistically meaningful information within this context (Price and Rieffel, 2004
Annotations of the signature genes are often informative and consistent with expectations (Fig. 8). Obvious examples include a seed maturation protein in the embryo library, a dehydrin in the drought libraries, and a chlorophyll a/b-binding protein in the light-grown library. In addition, it will be of interest to follow up several of the annotated signature genes in order to obtain further insight into their biological function in sorghum. For example, differential representation of a pseudo-response regulator gene in the pathogen libraries may indicate that modified clock gating is important in mobilizing responses to pathogens. Similarly, the pathogen signature gene encoding a putative phosphoinositide phosphatase suggests that down-regulation of phospholipid signaling may play a role in the response of sorghum to pathogens (Laxalt and Munnik, 2002 Other signature genes, however, are annotated as hypothetical, putative, similarity to, or some other designation indicating that annotation is at best highly speculative. Yet other genes are effectively not annotated at all, returning in two cases Expect values greater than 1. Thus, functions of the products of many or most of these signature TUs are effectively unknown. Nonetheless, their differential digital expression patterns can provide assistance in elucidating their functions, although such investigations are beyond the scope of this analysis.
Of the 70 signature TUs, eight returned from PIR-NREF an Expect value
The probability is quite high that each of the 775 TUs characterized in Figure 7 is expressed differentially (Table III) and that the signature genes identified in Figure 8 are truly diagnostic. For example, the 10 TUs representing the 10 drought signature genes contain not only 203 ESTs deriving from WS1, but an additional 88 ESTs deriving from DSAF1 and DSBF1, which as noted previously were not included in the hierarchical clustering. Moreover, evaluation of 18 TUs by quantitative RT-PCR is consistent with the results of hierarchical clustering with only two exceptions (Fig. 8). In the case of TU 2_8855 no induction by ABA was detected by RT-PCR. Similarly, however, none of the 2,342 3' ESTs from a subsequently produced sorghum ABA-induced cDNA library (http://fungen.org) can be associated with this TU. Thus, it represents a gene induced by drought, but apparently not by ABA. In the case of TU 2_7723, fold induction as assayed by RT-PCR was very low. Because arabinogalactan-proteins derive from a relatively large gene family (Gaspar et al., 2001
cDNA Libraries A total of 13 libraries were prepared from Sorghum bicolor L. Moench, Sorghum propinquum (Kunth) Hitchc., or johnsongrass (Sorghum halepense L. Pers.) as summarized in Table I. With the exception of libraries DSAF1 and DSBF1, S. bicolor libraries were prepared from genotype BTx623. DSAF1 and DSBF1, which were initially prepared for a different purpose, were from genotypes B35 and Tx7000, respectively. B35 is an inbred line with stay-green, post-flowering drought tolerance, while Tx7000 is an elite, high-yielding accession with nonstay-green, preflowering drought tolerance. For DSBF1, water was withheld after 4 weeks to impose gradual water deficit and to simulate natural preflowering drought stress. For DSAF1, final irrigation was administered 3 d after anthesis (about 2 months after sowing) to impose gradual water deficit and to simulate natural postflowering drought stress. Harvested tissue was frozen by immersion in liquid nitrogen and stored at 80°C. Embryos were isolated by milling imbibed grain with a Quaker model 4-E plate mill (Clinton Separators), immersing the ground grain in liquid nitrogen, and filtering it through a sieve with pores of 0.84 mm. Endosperm, which was ground to a powder by the mill, passed through while embryos were retained. With the exception of RHIZ1, DSAF1, and DSBF1, libraries were constructed by Stratagene, beginning with total RNA extracted from plant material finely ground under liquid nitrogen. cDNAs were cloned into the EcoRI (5' end) and XhoI (3' end) sites of lambda ZAPII. Average insert sizes, as reported by Stratagene for 12 randomly picked clones from each library, were between 1.25 and 2.0 kb. RHIZ1, DSAF1, and DSBF1 were similarly prepared in the same vector, but in the laboratories of Andrew Paterson (RHIZ1) or Henry Nguyen (DSAF1, DSBF1).
Library phage were received from Stratagene in two or three fractions per library, with each fraction representing a different insert size range. With one exception, plasmids derived from these libraries were obtained from the fraction with the longest insert size range. In the case of the ovary library, the fraction with the longest inserts (OV2) yielded too few clones. Hence, the second of three fractions, which had the next-longest insert size range, was also used (OV1). RHIZ1, DSAF1, and DSBF1 plasmids were obtained from libraries that were amplified, but not size fractionated. DSAF1 and DSBF1 were also subtracted using driver cDNA prepared from poly(A)+-RNA obtained from nonstressed sorghum leaves essentially as described by Soares and Bonaldo (1998)
Following transformation by electroporation, bacteria were plated, clones randomly picked into freezing medium in 96- or 384-well plates, and frozen at 80°C after overnight growth at 37°C in a HiGro (Genomic Solutions). All colonies used for sequencing were grown in triplicate: two sets of shallow 96-well plates for subsequent clone distribution, and one set of deep-well blocks for preparation of template DNA. The latter was prepared in the same deep-well blocks in which the bacteria were cultured, using an alkaline lysis procedure essentially as described by Roe et al. (1996
ABI BigDye Terminator Cycle Sequence Ready Reaction version 2 or 3 was used at 12-fold dilution, as described by Roe et al. (1996
Data-processing pipelines and an Oracle database were created for this project (Cordonnier-Pratt et al., 2004
Vector/adaptor- and quality-trimmed 3' ESTs were clustered and assembled with phrap (http://www.phrap.org). To reduce the frequency of poorly assembled TUs, members of each TU were resubmitted to phrap one TU at a time. Because phrap discriminates among sequences far better when assembling them in smaller groups, this resubmission eliminated most of the poorly assembled TUs by subdividing them. Extensive data for each TU was entered into database tables designed for this purpose. These data included the first and last base positions of a sequence relative to the consensus, the offset of each sequence relative to the consensus, whether a sequence had been reverse complemented to match the consensus, the length of each sequence including pads required for alignment, and all discrepancies from the consensus. From this information a normalized percentage of alignment of each trimmed and padded sequence to the consensus for its TU was determined in order to identify poorly assembled TUs. The latter were eliminated from the Milestone 1.0 assembly presented here. While these poorly assembled TUs are not included in the 16,801 TUs reported and characterized here, they have not been disregarded. Instead, they have been flagged in the database as poorly assembled and added to the Milestone TUs for use in microarray applications (Buchanan et al., 2005
The following relationship, obtained from Dr. Bruce Roe and James White of the University of Oklahoma, was used to estimate the number of TUs in a library or group of libraries as a function of sequencing depth (Figs. 3 and 4):
is the estimated number of TUs when n number of ESTs has been obtained, G is the maximum number of TUs expected as n approaches infinity, and S is an empirically derived parameter that when multiplied by G corresponds to the number of ESTs required to obtain one-half of the maximum number of TUs. S therefore effectively determines the slope of the curve. An iterative process is used to determine the G and S values that yield a curve that best fits the experimental data as shown in Figure 3. Note that as n goes to infinity, the function is simplified to = G. This relationship is a special case of a widely used pharmacological drug responsiveness model:
1, y is the response as a percentage of the maximum, b0 is the expected response at saturating dose, b2 is the concentration that produces half-maximal response, and b1 determines the slope of the function. With the substitutions described below, the drug responsiveness model can be transformed to that used here to model the rate of gene discovery while simultaneously retaining its usefulness for modeling a saturation curve of the sort to be expected as a randomly picked cDNA library is sequenced to increasing depth. It is of general applicability and thus useful with other EST datasets. The substitutions and the rationales behind them are as follows. (1) The number of ESTs that have been obtained (n) is substituted for dose level (x), which in both cases is the independent variable. (2) The estimated number of TUs when n ESTs have been sequenced as a proportion of the maximum number as n approaches infinity ( ) is substituted for drug response (y), which in both cases is the dependent variable. (3) The number of TUs expected as n approaches infinity (G) is substituted for the expected response at saturating dose (b0), which in both cases is the maximum to be anticipated. (4) The number of ESTs required to identify one-half of the maximum number of TUs (G x S) is substituted for the dose required to give the half-maximal drug response (b2), which in both cases determines the slope of the function. (5) In addition, b1 is set to 1, which reflects the assumption that as a randomly picked cDNA library is sequenced, the rate of gene discovery as a function of the number of ESTs sequenced remains relatively unchanged. In our experience, this has proven to be the case when one examines curves like that in Figure 3 for each individual cDNA library (data not shown).
Provisional electronic annotation of all ESTs and TU consensus sequences was obtained by BLASTx (Altschul et al., 1990 BLASTx returns from this curated PIR-NREF database were also used to estimate the percentage of clones containing full-coding-length inserts and of inserts cloned inversely from expectations. For each library, BLASTx returns with an Expect value less than E-13 and with three or fewer high-scoring pairs were identified. From this subset, the percentage of query 5' ESTs that either matched the initiating Met or contained sufficient 5' sequence upstream of the match to encode the initiating Met was determined. This calculation assumes that a target protein is the same length as that encoded by the query sequence. While not always correct, it is nonetheless a reasonable assumption that targets are as likely to be shorter than the query as they are to be longer such that on average the assumption is reasonable. The percentage of inverted clones was estimated from the same subset of 5' ESTs. If the reading frame was negative, that was taken as evidence that a presumed 5' EST was in fact a 3' EST. These calculations can be redone with different parameters using MAGIC Gene Discovery described in "Data Access" below.
The R statistic of Stekel et al. (2000)
The relationship between the R statistic and the likelihood that expression of a given TU differs significantly from the null hypothesis of uniform expression across all libraries was determined following the suggestion of Stekel et al. (2000)
Quantitative RT-PCR was performed as described by Salzman et al. (2005)
ESTs have been deposited in GenBank. Accession numbers and associated laboratory sequence names are available in comma-delimited format in Supplemental Tables III (DG1, DSAF1, DSBF1, EM1, FM1), IV (IP1, LG1, OV1, OV2, PI1), and V (PIC1, RHIZ1, RHIZ2, WS1). Sequences can also be viewed at and downloaded from http://fungen.org/Sorghum.htm |