|
|
||||||||
|
Plant Physiology 140:1169-1182 (2006) © 2006 American Society of Plant Biologists Partial Shotgun Sequencing of the Boechera stricta Genome Reveals Extensive Microsynteny and Promoter Conservation with Arabidopsis1,[W] a FormanováMax-Planck-Institut für chemische Ökologie, D07745 Jena, Germany (A.J.W., M.E.S., N.F., S.G.-J., D.S., J.K., T.M.-O.); and Washington State University, School of Biological Sciences, Vancouver, Washington 98686 (J.G.B.)
Comparative genomics provides insight into the evolutionary dynamics that shape discrete sequences as well as whole genomes. To advance comparative genomics within the Brassicaceae, we have end sequenced 23,136 medium-sized insert clones from Boechera stricta, a wild relative of Arabidopsis (Arabidopsis thaliana). A significant proportion of these sequences, 18,797, are nonredundant and display highly significant similarity (BLASTn e-value 1030) to low copy number Arabidopsis genomic regions, including more than 9,000 annotated coding sequences. We have used this dataset to identify orthologous gene pairs in the two species and to perform a global comparison of DNA regions 5' to annotated coding regions. On average, the 500 nucleotides upstream to coding sequences display 71.4% identity between the two species. In a similar analysis, 61.4% identity was observed between 5' noncoding sequences of Brassica oleracea and Arabidopsis, indicating that regulatory regions are not as diverged among these lineages as previously anticipated. By mapping the B. stricta end sequences onto the Arabidopsis genome, we have identified nearly 2,000 conserved blocks of microsynteny (bracketing 26% of the Arabidopsis genome). A comparison of fully sequenced B. stricta inserts to their homologous Arabidopsis genomic regions indicates that indel polymorphisms >5 kb contribute substantially to the genome size difference observed between the two species. Further, we demonstrate that microsynteny inferred from end-sequence data can be applied to the rapid identification and cloning of genomic regions of interest from nonmodel species. These results suggest that among diploid relatives of Arabidopsis, small- to medium-scale shotgun sequencing approaches can provide rapid and cost-effective benefits to evolutionary and/or functional comparative genomic frameworks.
The genomes of higher plants are dynamic entities and their evolutionary histories have been influenced by duplication, deletion, rearrangement, transposition, and changes in ploidy. As such it is not surprising that significant investments have been made toward the development of Arabidopsis (Arabidopsis thaliana), a species with a modest haploid genome composed of five chromosomes, approximately 125 Mb, and relatively little repetitive DNA, as a model genetic/genomic system. Beyond the superficial simplicity of the Arabidopsis genome, the species offers other characteristics that make it amenable to laboratory study, including self compatibility, rapid generation time, and responsiveness to culture conditions and transformation. These characteristics have translated themselves into an extensive history of both classical- and molecular-genetic application, the complete sequencing of this genome (Arabidopsis Genome Initiative, 2000 In the postgenomic era, comparative analyses between related genomes have proved invaluable. Animal and fungal research communities have invested heavily in the establishment of comparative genomic resources (Genomes OnLine Database v2.0 provides a comprehensive list of completed, proposed, and on-going genome-sequencing projects; http://www.genomesonline.org/). The dividends of these investments include: better assessments of conserved microsynteny and colinearity; improved annotation and ab initio gene prediction; and the identification of novel genes, cis-regulatory sequences, and noncoding RNAs. Within the plant biology community, the development and application of comparative genomic approaches have also met with success in maize (Zea mays), rice (Oryza sativa), sorghum (Sorghum bicolor), and other agronomically important members of Poaceae.
While many agriculturally important crop and weed species (Dietz et al., 1999
These efforts will, no doubt, be highly informative. It is, however, difficult to ignore the complexity of the Brassicaceae, a diverse family composed of approximately 340 genera and 3,400 species. Brassicaceae is marked by recurrent polyploidization (Arabidopsis Genome Initiative, 2000
Phylogenetic analysis of multiple nuclear and plastid encoded loci (Koch et al., 2001a
From an evolutionary perspective, B. stricta provides an interesting contrast to Arabidopsis. B. stricta is common in mountainous regions of western North America and, unlike Arabidopsis, thrives in undisturbed habitats. B. stricta is a highly selfing, nonweedy, short-lived perennial with substantial genetic variation within populations, and clear isolation by distance on a regional scale (Song et al., 2006
B. stricta End Sequencing
We end sequenced 23,136 medium-sized insert clones (approximately 13 kb) from B. stricta. A total of 39,976 sequencing reactions yielded read lengths
Identification of B. stricta-Arabidopsis Homologous Genomic Regions
We identified 25,879 B. stricta end sequences from the sequence-indexed library that display significant similarity (BLASTn e-value From the informative sequence set, 12,668 sequences corresponded to both sequencing reads from 6,334 inserts (paired end sequences; Fig. 2) and the other 6,129 sequences were classified as solo end sequences (Fig. 2). All informative sequences were mapped to Arabidopsis chromosomal pseudomolecules using the physical position of the Arabidopsis nucleotide homologous to the 5'-most B. stricta nucleotide in a given BLASTn high-scoring pair (HSP; Fig. 3 ; Supplemental Fig. 1). Approximately 15.5 Mb of nonredundant B. stricta sequence with highly significant similarity to Arabidopsis genomic sequences have been identified.
Nearly half (9,034) of the B. stricta end sequences in the informative set also have significant BLASTn hits (e-value 1010) to annotated Arabidopsis coding sequences (CDSs), with an average identity of 90.3% among all nonredundant HSPs (Supplemental Data 1). As expected, the number of CDS hits per Arabidopsis chromosome corresponds to the physical size of the chromosome (Table I
; Supplemental Data 1).
Conservation of Microsynteny in B. stricta and Arabidopsis The 12,668 paired end sequences (6,334 inserts) with similarity to Arabidopsis genomic sequences were analyzed to identify end sequences with conserved microsyntenic relationships relative to homologous Arabidopsis genomic regions. This analysis was performed with syntenyFinder.py that recognizes four syntenic categories (described in Fig. 4 ): syntenic, tight colinear, physically linked, and unlinked. From the starting collection of inserts, 4,691 (74.1%) qualified as having conserved synteny relative to Arabidopsis, while only 13.5% of inserts fell within the categories of tight colinear or linked (Fig. 4). The 1.2% of B. stricta end sequences scored as tight colinear (Fig. 4) indicated that the relevant Arabidopsis homologs show conserved colinearity and physical proximity, but diverge with regard to relative orientation. The physical sizes of the Arabidopsis chromosomal intervals bracketed by B. stricta end sequences with conserved microsynteny (Supplemental Data 2) are distributed as shown in Figure 5 . In a one-tailed t test assuming unequal variances, the mean length of these Arabidopsis intervals (11,949 bp) differs significantly from the average B. stricta insert size (13,187 bp; P < 0.005), indicating that Arabidopsis genomic regions are, on average, smaller than homologous B. stricta genomic regions.
The B. stricta inserts displaying conserved microsynteny were subsequently organized into virtual microsyntenic blocks. These blocks are comprised of either singleton inserts or multiple, overlapping inserts as suggested by their similarity to the Arabidopsis genome. In total, 1,974 blocks have been identified, bracketing approximately 31.5 Mb of the Arabidopsis genome (Table II ; Supplemental Data 3).
To verify the homology of the B. stricta sequence-indexed clones identified as displaying conserved microsynteny to their corresponding Arabidopsis genomic regions, two approaches were taken. In the first approach, eight B. stricta inserts were randomly selected from the distribution presented in Figure 5 and sequenced completely. Microsynteny and identity are conserved in all regions examined, with disruptions being attributed to indel polymorphisms between the two species (Fig. 6, A and C ; Supplemental Fig. 2). In comparisons where there is a substantial difference in the sizes of the B. stricta genomic region versus the Arabidopsis homolog, the differences arise from large indel polymorphisms (>5.0 kb). Near-isometric regions contain multiple small indel polymorphisms that occur in a complementary fashion in both species.
For the second approach, 21 B. stricta sequence-indexed clones were specifically targeted for complete sequencing. End-sequence similarities for these clones suggested that their inserts should contain B. stricta genomic regions that are homologous to Arabidopsis loci involved in insect/pathogen resistance and flowering time. Every insert was correctly identified by syntenyFinder.py as a homolog of the relevant Arabidopsis genomic region (data not shown). As an example, the B. stricta class I chitinase region is presented in Figure 6B. The B. stricta region shown is composed of two overlapping, sequence-indexed inserts that were predicted by syntenyFinder.py to be members of a virtual microsyntenic block.
The B. stricta end sequences with significant similarity to annotated Arabidopsis CDSs were cross-referenced against the Arabidopsis duplication annotations of K. Wolfe (http://wolfe.gen.tcd.ie/athal/all_results/). In total, we identified 7,160 nonredundant B. stricta hits to the Arabidopsis CDSs duplicated by polyploidization within the last 24 to 40 million years (Blanc et al., 2003
To more closely investigate orthologous/paralogous relationships between B. stricta and Arabidopsis, we analyzed a subset of the above data (Table III, selected dataset; Supplemental Data 4). This dataset, which consists of 1,440 B. stricta end sequences, includes only data from B. stricta sequences where similarity is detected to a least one member of a given Arabidopsis gene pair at an e-value 1090. The calculation of log e-value for a given B. stricta end sequence (see "Materials and Methods") yields an integer ranging from 170 to 0. Lower values of log e-value indicate greater differences in the e-values of the BLASTn hits to the members of a given Arabidopsis gene pair and improve our ability to distinguish between candidates for orthology and close paralogs. The frequency distribution of log e-value (Fig. 7
) shows that a majority (98%) of Arabidopsis gene pairs have log e-value scores <5, indicating more than five orders-of-magnitude difference in the significance of the B. stricta end-sequence BLASTn hits to the two Arabidopsis paralogs.
Comparison of Upstream Noncoding Sequences B. stricta and Arabidopsis genomic regions upstream to homologous CDSs were analyzed using the UntransID.py program. The analysis, which proceeds in two phases, was performed on B. stricta sequences from the sequence-indexed library dataset as well as sequences generated from a B. stricta small-insert library (see "Materials and Methods"). In the first phase, all B. stricta sequences with significant similarity to Arabidopsis CDSs were screened to identify sequences that traverse the translation initiation codon of the homologous Arabidopsis CDS and contain a minimum of 500 bp of sequence 5' to the presumptive initiation codon. Redundancies and B. stricta sequences with homology to Arabidopsis CDSs displaying complex paralogous relationships were purged during the screening process. At the conclusion of this phase of the analysis, 657 B. stricta sequences were deemed sufficiently informative for comparison to Arabidopsis upstream regions (Supplemental Data 5).
During the second phase of the analysis, the 657 B. stricta sequences identified in phase 1 were aligned to homologous Arabidopsis upstream regions (The Arabidopsis Information Resource, Arabidopsis sequence datasets; ftp://ftp.arabidopsis.org/Sequences/) using the algorithm of Needleman and Wunsch (1970) As percent identity is a one-dimensional representation of sequence similarity, the proportions of Arabidopsis nucleotides scored as identity and noninformative were plotted against Arabidopsis nucleotide position for all 657 alignments (Fig. 8A ). The proportion of identities is highest (approximately 0.70) at positions directly abutting the presumptive translation initiation codons and declines steadily at positions increasingly 5' to the start codon. This trend is linear and significant (P < 0.005; Supplemental Data 5).
In the postgenomic era, the utility of comparative genomics as applied to functional and evolutionary questions has become the subject of increasing interest. This work represents the first foray into establishing a whole-genome resource from a wild relative of Arabidopsis that provides immediate comparative genomic application for an established research community; literature searches demonstrate that B. stricta and the encompassing genus, Boechera, are among the most extensively studied wild relatives of Arabidopsis. Further, the relatedness of the two species, similar breeding systems, and comparable general genome organizations contrasted against different life histories, geographic distribution, and ecology suggested that a B. stricta-Arabidopsis comparison would be highly informative in both functional and evolutionary terms.
To establish a conservative estimate for coverage of the Arabidopsis genome by homologous B. stricta end sequences, a stringent significance threshold (e-value
At the specified significance threshold, BLASTn identified a single, best Arabidopsis homolog for the vast majority of B. stricta end sequences (Fig. 3, green). Duplications, i.e. close paralogs, in Arabidopsis are also detected (Fig. 3, yellow), while regions with three or four paralogs in Arabidopsis (Fig. 3, light orange and orange) are observed less frequently. These results are consistent with what is known regarding segmental duplication in the lineage leading to Arabidopsis and related species (Arabidopsis Genome Initiative, 2000 Several informative B. stricta sequences display similarity to Arabidopsis sequences with 5 times representation in the Arabidopsis genome (Fig. 3; Supplemental Fig. 1, red). These Arabidopsis homologs are associated with centromeres and pericentromeric regions, suggesting that they are heterochromatic repeats that were not identified by our filtering datasets. Further work is required to position these repeats in the B. stricta genome relative to centromeric and pericentromeric regions. Nearly half of the informative B. stricta end sequences intersect unique, annotated Arabidopsis CDSs (Supplemental Data 1). This accounts for a full third of the CDSs currently identified in the Arabidopsis genome (approximately 27,500). The average identity observed among HSPs (90.3%) coupled with the mean dN/dS observed in B. stricta by Arabidopsis comparisons (approximately 0.3; K. Schmid, personal communication) points toward functional constraints on the evolution and divergence of CDSs in these two species. On a per-site basis, dN/dS quantifies the level of nonsynonymous to synonymous changes. Our estimate of informative B. stricta end sequences relative to Arabidopsis is intentionally conservative in an attempt to make the conclusions drawn from this initial analysis as unambiguous as possible. Thus, the approximately 14,000 B. stricta end sequences classified as having weak similarity to Arabidopsis genomic regions (Fig. 2) should not be interpreted as uninformative or as representing substantive interspecific differences, rather, they represent a resource for future analysis and interpretation.
From the informative B. stricta end-sequence dataset, we have identified 6,334 B. stricta sequence-indexed inserts in which both end sequences are anchored to homologous sequences in the Arabidopsis genome. The end sequences from three quarters of these inserts display conserved microsynteny or tight colinearity relative to Arabidopsis genomic regions (Fig. 4) and are distributed uniformly across each Arabidopsis chromosome (Table II; Supplemental Data 3). Microsyntenic relationships have been further confirmed by sequencing a subset of inserts to completion (Fig. 6; Supplemental Fig. 2). Taken together and in the absence of a high-resolution genetic map for B. stricta, the data demonstrate that the genomes of B. stricta and Arabidopsis are largely colinear. Further, we can infer conservatively that 26% of the Arabidopsis genome has been bracketed in our sequence-indexed clone collection (Table II) with the homologous B. stricta genomic regions comprising the sequence-indexed library. The informative B. stricta inserts with end sequences whose homologs in Arabidopsis have been categorized as linked or unlinked (Fig. 4) represent departures from conserved colinearity. These inserts may contain large indel polymorphisms, breakpoints of local rearrangements, and/or translocations in one species relative to the other; however, further investigation is required to validate such conclusions. Observations where only one end sequence for a given B. stricta insert displays similarity to Arabidopsis (Fig. 2) mostly arise as a result of poor sequence quality on one clone end, or where the significance of the BLASTn hit to a given Arabidopsis genomic region failed to meet our criterion, and cannot be meaningfully interpreted relative to conserved colinearity between the two species. However, a significant subset of solo end sequences results from transposon insertions or other indel polymorphisms that affect local but not global colinearity. To further validate and demonstrate the utility of microsynteny as inferred from B. stricta end-sequence data, we have used the sequence-indexed library to clone several B. stricta genomic regions that contain homologs to Arabidopsis regions of interest. B. stricta inserts that were doubly anchored to Arabidopsis by similarity routinely contained the desired B. stricta CDSs (Fig. 6B; data not shown). Further, inserts anchored to the Arabidopsis genome at a single end and inserts anchored by identity to a fully sequenced B. stricta clone previously mapped to the Arabidopsis genome, i.e. walking out in silico, also proved successful in the identification of B. stricta inserts of interest (data not shown).
Genome size is heterogeneous both within the Brassicaceae (Johnston et al., 2005
The genome of B. stricta is about 60% larger than that of Arabidopsis, but is similar in size to the diploids, Arabidopsis lyrata and Capsella rubella (Johnston et al., 2005
A major obstacle in comparative genomics is inference of orthology versus paralogy for coding regions in interspecific comparisons. This is especially true in lineages, such as the Brassicaceae, that have experienced recurrent polyploidization and/or duplication events. In these lineages, the issue is complicated not only by the number of intraspecific paralogs, but also by the sub- or neofunctionalization of these paralogs (Lynch et al., 2001
Throughout this study, the significance value attached to identified B. stricta-Arabidopsis similarity has been assumed to be the best indicator of orthology. To test this assertion more directly, we have contrasted the significance values of B. stricta BLASTn hits to known Arabidopsis paralog pairs. In a majority of instances (98%; Fig. 7), the best Arabidopsis candidate for orthology was readily identified by standard BLASTn analysis. Hence, computational approaches will prove highly informative in the establishment of orthology for comparisons within the genus Arabidopsis and among near diploid relatives (Fig. 1). These species, however, share a common history with regard to genome duplication and, as demonstrated by this study and related work (Acarkan et al., 2000
Phylogenetic footprinting and shadowing have emerged as powerful techniques for the identification of candidates for functionally conserved domains in cis-transcriptional regulators of plant genes (Koch et al., 2001b To assess the utility of B. stricta in global comparisons of 5' noncoding regions, we have aligned 657 such B. stricta regions to their Arabidopsis orthologs. In informative alignments, the average identity displayed to the Arabidopsis regions was 71.4%, indicating that neutral divergence has not been saturated. These data indicate that B. stricta and Arabidopsis 5' regions are more diverged than identified orthologous coding regions, where the average identity among HSPs was observed to be 90.3%. While more diverged than coding regions, orthologous 5' noncoding sequences still display a high degree of identity. A similar global comparison of 1,208 B. oleracea 5' regions to their Arabidopsis homologs (Fig. 8B) has indicated that while the 5' regions of Brassica are more diverged from Arabidopsis than those of B. stricta, conservation at the level of sequence is still high in these regions. On average, upstream regions displayed 61.4% identity (40.0% when noninformative alignments are included in the calculation) in B. oleracea-Arabidopsis comparisons. Interestingly, 34.8% of alignments in these comparisons were deemed noninformative (Fig. 8B). This may reflect complex ortholog-paralog relationships between B. oleracea and Arabidopsis, as evidenced by the observation that only 17.7% noninformative alignments were detected among B. stricta by Arabidopsis comparisons (Fig. 8A). When the proportion of identity among all alignments is plotted against Arabidopsis nucleotide position (Fig. 8, A and B), comparisons to B. stricta and B. oleracea display declining identity 5' to translation initiation codons. Two processes may contribute to this pattern. First, insertions in either species may cause regulatory domains to fall outside our window of comparison, hence distal regulatory domains may not be included in some comparisons. We expect cumulative insertions (and the probability that regulatory domains fall outside our window of comparison) to show a linear increase toward distal promoter regions, as observed in B. stricta, but not B. oleracea. Second, natural selection to maintain regulatory function may be strongest near the initiation codon, and decline in distal promoter regions. Although it is difficult to disentangle these two processes using partial shotgun sequences, the nonlinear pattern of decrease in B. oleracea suggests that both factors contribute to observed patterns of promoter similarity, making our estimates of mean similarity conservative. Our observations suggests that B. stricta, C. rubella, and even species as distant as B. oleracea are too similar to Arabidopsis to consistently identify highly conserved regulatory regions, since identity arising from functional constraint cannot be easily distinguished from that arising through common history. This highlights the necessity of multiple comparisons for effective phylogenetic footprinting of regulatory regions, even at a global scale.
We have applied a random shotgun sequencing approach to a medium-insert size B. stricta genomic library. This approach was rapid, cost effective, and allowed the generation and analysis of a high-quality genomic sequence dataset with immediate application to comparative genomics within the Brassicaceae. Using this dataset, we have shown that analyses of B. stricta and, potentially, species of similar evolutionary distance from Arabidopsis will be informative relative to issues of genome size, coding regions, and gross genomic organization such as duplication, synteny, and orthology. Our data suggests that global comparisons to Arabidopsis have limited ability to identify conserved regulatory domains. Further, additional work is required to generalize our conclusions regarding genome evolution to lineages outside of Arabidopsis and its sister clade. As such, we suggest that similar, exploratory end-sequencing projects from a wide sampling of the Brassicaceae would be beneficial in establishing a comparative genomic strategy for the Brassicaceae research community. Such a sampling will help to inform the selection of candidate species and help to establish priorities for large-scale genome projects within the Brassicaceae in the future.
Plant Materials and DNA Preparation
Seeds derived from a Boechera stricta individual, SAD12.4 (Taylor Creek, Colorado, population), were surface sterilized and grown in liquid culture as described previously for Arabidopsis (Arabidopsis thaliana; Windsor and Waddell, 2000
A SAD12 genomic library,
A second, small-insert library was constructed by partially digesting 200 ng of SAD12 genomic DNA with Sau3AI. The digestion was halted with a 20 min incubation at 65°C followed by a 5 min incubation on ice. Digestion products were run on a standard 1.0% agarose gel and the fragments in the size interval of 1.5 to 5.0 kb were purified using NucleoSpin Extract (Macherey-Nagel) columns according to the manufacturer's instructions. Recovered SAD12 DNA fragments were incubated at 65°C for 10 min, snap cooled to 4°C for 5 min, and ligated to dephosphorylated, BamHI digested pUC19 (Fermentas) as described for the production of
Recovered SAD12 clones were transformed en masse into electrocompetent E. coli DH10b. From the resulting pool of DH10b transformants, DNA from 23,136 colonies was isolated, transferred to microtiter plates, and sequenced. As each clone has been cataloged to a unique plate and plate coordinate and each clone can, therefore, be accessed specifically on demand, we refer to this resource as the SAD12.4 sequence-indexed library.
All clones that comprise the SAD12.4 sequence-indexed library were end sequenced in both directions using T7 and T3 primers; the small-insert library clones were sequenced in both directions with M13 forward and reverse primers. Cycle-sequencing reactions were carried out in GeneAMP 2700 thermal cyclers (Applied Biosystems) using Big Dye terminator cycle-sequencing kits (Applied Biosystems) and read with an ABI PRISM 3730XL DNA sequencer (Applied Biosystems). The traces generated for both libraries were trimmed of vector and processed for quality using SeqMan 5.0 (DNAStar) at a quality threshold of 12. To augment sequence quality, SAD12.4 sequence-indexed reads with less than 200 bp of reliable sequence were excluded from further analysis. Small-insert clones were assembled into contigs with SeqMan and only clones with overlapping sequencing reads were considered further. A subset of SAD12.4 sequence-indexed library inserts were sequenced to completion. Sequencing primer sites were introduced into inserts using the HyperMu <KAN-1> insertion kit (Epicentre Biotechnologies) via a scaled-down version of the manufacturer's protocol; sequencing reactions were performed as described earlier using the primers MUKAN-1 FP-1 and MUKAN-1 RP-1 supplied with the HyperMu system. Using SeqMan 5.0, traces were first quality trimmed followed by a fixed 5' trim of 60 bp to remove MuKan sequences. pBlueStar sequences were removed manually during contig assembly.
The BLASTn program (Altschul et al., 1997
For the purpose of filtering the B. stricta sequence-indexed library dataset, BLASTn analyses with default parameters were used to compare B. stricta sequences to the sequences in the AtRepBase (Cold Spring Harbor, database and sequence set of repetitive DNAs identified in Arabidopsis; http://nucleus.cshl.org/protarab/AtRepBase.htm/) and the Arabidopsis mitochondrial and chloroplast genomes (ftp://ftp.arabidopsis.org/Sequences/). Further, BLASTx (default parameters) was used to identify B. stricta sequences with significant similarity to the transposable element translated coding-sequence dataset of Zhang and Wessler (Zhang and Wessler, 2004
To analyze the B. stricta end-sequence data, a custom suite of software was developed. These scripts were written in the Python language using the Python core distribution (distribution site for the Python object-oriented programming language; http://www.python.org/), the BioPython extension (Chapman and Chang, 2000 Duplicated SAD12.4 sequence-indexed inserts were identified by the dupCloneFinder.py script with an e-value threshold 1030 and a read_error setting of 25 bp. Mapping of B. stricta sequence-indexed end sequences to physical intervals along the Arabidopsis chromosome pseudomolecules and filtering of end sequences with significant similarity to repetitive DNAs, rRNA genes, and organellar genomes was accomplished with the compGenomeFilterv2.py script and an e-value threshold setting of 1030. The syntenyFinder.py script was used identify B. stricta sequence-indexed clones displaying microsynteny to Arabidopsis chromosomal regions with the following parameters: physical limit for a syntenic region in Arabidopsis, 50 kb; average read length, 826 bp; and the default duplicated region filtering mode.
B. stricta-Arabidopsis promoter region comparisons were performed using the UntransID.py program. UntransID.py was run with extra sequence quality measures; an e-value threshold of 1010 for Arabidopsis CDS screening, the Needleman-Wunsch global alignment algorithm, 10,000 random comparisons for the determination of the alignment quality score threshold, a quality score threshold of 300, and 100 iterations for the determination of the mean and SE for the proportion of Arabidopsis nucleotides scored as identities in random comparisons. To implement the Needleman-Wunsch global alignment algorithm, UntransID.py invokes the needle program of the EMBOSS 3.0 sequence analysis suite (Rice et al., 2000
Brassica oleracea-Arabidopsis promoter region comparisons were performed as described for the B. stricta-Arabidopsis analysis with the exception of the quality score threshold parameter, which was set to 290. This analysis was based on an initial pool of 415,521 publicly available B. oleracea shotgun sequences (Ayele et al., 2005
Annotation of fully sequenced B. stricta sequence-indexed inserts was performed using gene-prediction models generated with the Twinscan (Korf et al., 2001
Dot-plot alignments were performed with the Dotter program (Sonnhammer and Durbin, 1995
B. stricta end sequences displaying significant similarity (e-value
Detailed analysis was performed on a more conservative dataset derived from the above (Supplemental Data 4, selected dataset). To be included in this dataset, the e-value for a given B. stricta end-sequence BLASTn hit to at least one member of an Arabidopsis gene pair was
For a given B. stricta end sequence,
All statistical analyses were performed with either the Excel spreadsheet package (Microsoft) or the stats.py and pstat.py modules of G. Strangman.
The sequence-indexed and small-insert datasets are available as GenBank accession numbers DU667459 to DU708532. The B. stricta class I chitinase genomic region is available as accession number DQ275145.
The authors wish to thank X. Zhang and S. Wessler for supplying their B. oleracea/Arabidopsis transposable element CDS dataset, A. Heidel for data relating to the recovery of insect resistance loci cloned using inferred synteny and the sequence-indexed library, and U. Göbel for insightful discussions regarding promoter and 5'-untranslated region analyses. The authors also wish to thank C. Ortlepp and J. Haupt for technical support and R. Oyama, K. Schmid, E. Kellogg, and A. Navarro-Quezada for comments on the manuscript. Received November 9, 2005; returned for revision January 19, 2006; accepted February 4, 2006.
1 This work was supported by the Max Planck Society.
2 Present address: Duke University, Department of Biology, Box 91000, Durham, NC 277080338. The author responsible for distribution of materials integral to the findings presented in this article in accordance with the policy described in the Instructions for Authors (www.plantphysiol.org) is: Aaron J. Windsor (aaron.windsor{at}duke.edu).
[W] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.105.073981. * Corresponding author; e-mail aaron.windsor{at}duke.edu; fax 9196138177.
Acarkan A, Rossberg M, Koch M, Schmidt R (2000) Comparative genome analysis reveals extensive conservation of genome organisation for Arabidopsis thaliana and Capsella rubella. Plant J 23: 5562[CrossRef][ISI][Medline] Altschul SF, Madden TL, Schaeffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 768815[Medline] Ayele M, Haas BJ, Kumar N, Wu H, Xiao Y, Van Aken S, Utterback TR, Wortman JR, White OR, Town CD (2005) Whole genome shotgun sequencing of Brassica oleracea and its application to gene discovery and annotation in Arabidopsis. Genome Res 15: 487495 Ayre BG, Blair JE, Turgeon R (2003) Functional and phylogenetic analyses of a conserved regulatory program in the phloem of minor veins. Plant Physiol 133: 12291239 Bao X, Franks RG, Levin JZ, Liu Z (2004) Repression of AGAMOUS by BELLRINGER in floral and inflorescence meristems. Plant Cell 16: 14781489 Beilstein MA, Al-Shehbaz IA, Kellogg EA (2006) Brassicaceae phylogeny and trichome evolution. Am J Bot 93: (in press) Bennetzen JL, Ma J, Devos KM (2005) Mechanisms of recent genome size variation in flowering plants. Ann Bot (Lond) 95: 127132 Blanc G, Hokamp K, Wolfe KH (2003) A recent polyploidy superimposed on older large-scale duplications in the Arabidopsis genome. Genome Res 13: 137144 Blanc G, Wolfe KH (2004a) Functional divergence of duplicated genes formed by polyploidy during Arabidopsis evolution. Plant Cell 16: 16791691 Blanc G, Wolfe KH (2004b) Widespread paleopolyploidy in model plant species inferred from age distributions of duplicate genes. Plant Cell 16: 16671678 Bleeker W (2003) Hybridization and Rorippa austriaca (Brassicaceae) invasion in Germany. Mol Ecol 12: 18311841[CrossRef][Medline] Bleeker W, Matthies A (2005) Hybrid zones between invasive Rorippa austriaca and native R-sylvestris (Brassicaceae) in Germany: ploidy levels and patterns of fitness in the field. Heredity 94: 664670[Medline] Boivin K, Acarkan A, Mbulu R-S, Clarenz O, Schmidt R (2004) The Arabidopsis genome sequence as a tool for genome analysis in Brassicaceae: a comparison of the Arabidopsis and Capsella rubella genomes. Plant Physiol 135: 735744 |