Computational Finishing of Large Sequence Contigs Reveals Interspersed Nested Repeats and Gene Islands in the rf1 -associated Region of Maize 1

The architecture of grass genomes varies on multiple levels. Large long terminal repeat (LTR) retrotransposon clusters occupy significant portions of the intergenic regions, and islands of protein-encoding genes are interspersed among the repeat clusters. Hence, advanced assembly techniques are required to obtain completely finished genomes, as well as to investigate gene and transposable element (TE) distributions. To characterize the organization and distribution of repeat clusters and gene islands across large grass genomes, we present 961- and 594 kb contiguous sequence contigs associated with the rf1 locus in the near-centromeric region of maize chromosome 3 . We present two methods for computational finishing of highly repetitive BAC clones that have proved successful to close all sequence gaps caused by TE insertions. Sixteen repeat clusters were observed, ranging in length from 23 kb to 155 kb. These repeat clusters are almost exclusively LTR retrotransposons, of which the paleontology of insertion varies throughout the cluster. Gene islands contain from 1 to 4 predicted genes, resulting in a gene density of 1 gene per 16 kb in gene islands, and 1 gene per 111 kb over the entire sequenced region. The two sequence contigs, when compared to the rice and sorghum genomes, retain gene co-linearity of 50% and 71%, respectively; 70% and 100%, respectively for high-confidence gene models. Collinear genes on single gene islands show that while most expansion of the maize genome has occurred in the repeat clusters, gene islands are not immune and have experienced growth in both intra- and inter-gene locations. T. aestivum , and Z. mays were downloaded from GenBank (Benson et al. 2006) and aligned to determine gene models. Predicted gene exons were exported and examined for similarity to the plant protein, EST and predicted gene sets of GenBank to determine possible functions. Output from gene prediction programs, alignments to protein and EST sequences, and predicted genes were displayed on the GMOD package, GBrowse (Stein et al. 2002).


INTRODUCTION
Genome sequencing of the Zea mays (maize) genome is nearing completion (Bennetzen et al. 2001;Chandler and Brendel 2002;Wessler 2006); it is the largest and most difficult to assemble plant genome sequenced to date. Maize is an important economic, agricultural, industrial, and research crop, however, with a genome close to the size of human (2.8 Gb) and it's high percentage of repetitive elements, acquiring the maize genome seemed a daunting task.
Approximately 67% of the genome is made up of transposable elements (TEs) (Haberer et al. 2005; Kronmiller and Wise 2008) increasing the difficulty of assembly (Rabinowicz and Bennetzen 2006). Much exploratory work has gone into isolating and sequencing just the gene areas and ignoring the repetitive regions, both by methylation filtration (Rabinowicz et al. 1999;Palmer et al. 2003;Whitelaw et al. 2003) and high-C 0 t (Whitelaw et al. 2003;Yuan et al. 2003) systems, which have assisted researchers with selecting only genic regions to sequence. These methods have captured a majority of the maize genic sequence (Fu et al. 2005), but they still have the potential to miss important regions. The current genome-sequencing project aims to capture the entire gene set of maize including regulatory regions. However, the current strategy will not provide a fully assembled genome, but rather assembled bacterial artificial chromosome (BAC) contigs ordered and orientated to provide complete gene regions that are adjacent to potentially incomplete TE clusters.
The landscape of the maize genome provides an interesting challenge for both sequencing and subsequent annotation. A high density of long terminal repeat (LTR) retrotransposons has had a direct effect on the genome size of many plant genomes, including maize (SanMiguel et al. 1996;Bennetzen et al. 2005;Hawkins et al. 2006;Piegu et al. 2006). Besides expanding genome size, LTR retrotransposons can have an impact on evolution of the species (Kidwell and Lisch 2000). LTR retrotransposon insertions tend to form nested clusters (SanMiguel and Bennetzen 1998)  Previous studies of large contiguous regions of maize have provided a general view of the landscape of the genome. Unfinished sequence totaling 7.8 megabases (Mb) from chromosome 1 and 6.6 Mb from chromosome 9 show a gene density of 1 gene per 33-and 27 kb, respectively (Bruggmann et al. 2006). BAC contigs ranging in size from 126 kilobases (kb) to 405 kb show a gene density of 1 gene per 19 kb, and genes found in small groups between large repeat clusters (Brunner et al. 2005). Genome-wide analysis of maize BACs has painted a different picture, while gene density of 100 random BACs at 1 gene per 44 kb was similar to the above results, genes were not observed in tight clusters (Haberer et al. 2005). When investigating gene specific areas of maize this dichotomy of gene density is also seen. Analysis of gene rich regions such as the 22-kd α -zein gene family on maize chromosome 4 reveals a high density of genes, with one gene observed per 10 kb over 346 kb (Song et al. 2001). The Adh1 locus on maize chromosome 1 contains two genes across 280 kb, or 1 gene per 140 kb. Perhaps the only message learned here is the gene density across the maize genome varies to a great degree, large contiguous sequenced regions can begin to capture the true diversity of maize chromosome architecture.
In order to characterize large contiguous regions of maize sequence, we identified and sequenced two B73 BAC contigs from the centromeric region of chromosome 3. These contigs of 961 kb and 594 kb correspond to contigs 117 and 119 respectively on maize WebFPC (Wei et al. 2007) and span regions associated with the rf1 (restorer of fertility) locus for Texas (T) cytoplasmic male sterility (cmsT) (Duvick et al. 1961;Wise et al. 1996). As a foundation for the isolation of the Rf1 locus, four rf1 male-sterile mutants were recovered from a screen of 123,500 flowering plants (Wise et al. 1996). A 5.5 kb Mu1 hybridizing EcoRI restriction fragment was identified that cosegregated with the rf1-m3207 allele. Sequences from this fragment were hybridized to a Rf1 cDNA library and probes designed from the identified cDNA, p6140-1 (Wise et al. 1999), were found to cosegregate with the rf1 locus in a recombinant population selected from over 10,000 progeny.
Using probes designed off the 5.5 kb cosegregating restriction fragment and the p6140-1 cDNA, we have identified two BAC contigs spanning the rf1 locus. Sixteen BACs were sequenced to completion to provide high-quality finished sequence. Here we present two methods for computational finishing of highly repetitive grass genomes, which were successfully 7 utilized to close 11 TE induced gaps. Sixteen nested repeat clusters were found each spanning as much as 155 kb and containing a variety of LTR retrotransposons types and ages of insertion.
Genes are found tightly clustered showing a density rate of 1 gene per 16 kb within gene islands.
Finally, comparative analysis to Oryza sativa (rice) and Sorghum bicolor (sorghum), show that while many genes are retained across all three species, genes have both been lost and translocated across the genomes.

RESULTS AND DISCUSSION
Mapping, Sequence, and Assembly of Maize rf1 Contigs

Contigs
Six tightly linked and cosegregating low-copy AIMS fragments (Frey et al. 1998) were identified from three rf1-m families (Wise et al. 1996), and along with sequences selected from the cosegregating 6140-1 cDNA (Wise et al. 1999), were used as probes against the first B73 library filters (ZMMBBa -Clemson University Genomics Institute). From each of the resulting short non-overlapping contigs, overgo probes were designed off sequenced BAC ends. Low-copy hybridizing probes were used for the next round of hybridization to the B73 BAC library; identified BACs were used to extend the length of the existing rf1 contigs.
After the NSF-sponsored maize physical mapping project was underway (NSF-PGR #9872655), additional BAC clones were identified by hybridization to the ZMMBBb and ZMMBBc libraries and subsequently via in silico overlaps from the maize WebFPC database (Coe et al. 2002;Wei et al. 2007) and a minimal tiling path was constructed from a total of 796 BACs from the three B73 BAC libraries. The minimal tiling path formed two contigs both located on chromosome 3. rf1-associated contig 1 (rf1-C1), and rf1-associated contig 2 (rf1-C2), correspond to contigs 117 and 119 respectively in maize WebFPC and are located in maize bin 3.04.

Sequencing and Initial Assembly of Maize BACs
Sixteen BAC clones were fully shotgun sequenced to provide the most accurate representation of this region. Initial sequencing produced 8 to 9 fold coverage, however, after initial assembly additional plates were produced if the BAC was deemed highly repetitive. Once draft sequence 8 was completed, BACs averaged 10X coverage, depending on their repeat content (Table I).
BACs were finished via standard gap-closing techniques (see Methods).
At this stage, BAC assemblies were as close to best possible condition that finish sequencing could bring (Table I,  Finished BAC clones were verified with restriction digest analysis. For non-gap regions, basepair quality is well within sequencing standards with less than 1 error in 1x10 5 per BAC assembly. In the minimal tiling path BACs average 32 kb of overlap, although the area of the overlap between ZMMBBb0211C05 and ZMMBBb0331I02 multiple BACs were sequenced to resolve mapping discrepancies. Fully assembled, the rf1-C1 is 961 kb, the rf1-C2 is 594 kb, and they have been submitted to GenBank (Benson et al. 2006) as EF517601 and EF517600 respectively ( Fig. 1).

Characterization of Repetitive Gaps in Maize Sequence Assembly
In particular, two methods proved very useful to resolve maize sequence gaps that were unclosable with traditional laboratory based finishing methods. Eleven gaps in the BAC assemblies were closed with purely computational methods. Two cases of a gap causing misassembly were found to be common in maize BACs, both involving the duplicated regions of LTRs of retrotransposons. The first misassembly type is much like any misassembly caused by a duplicated area within a BAC; the traces for one LTR all assemble into the second copy, breaking the sequence of the first LTR and causing a gap. This was seen most often in TEs with long LTRs where the whole sequence trace or even both end sequences from an entire subclone were within the LTR boundaries. This was also commonly seen on LTR retrotransposons with a 9 recent age of insertion, fewer polymorphisms introduced over the time since insertion between the two LTRs caused more assembly confusion.
The second common case of misassembly was also caused by the LTRs of retrotransposons, seen when a LTR retrotransposon nested into one of the LTRs of an existing LTR retrotransposon. In this type the gap can be found in either of the two LTRs of the first retrotransposon ( Fig. 2A). Once this insertion occurs the sequence of one LTR is interrupted with the sequence of the nested transposon. To cause a gap, during assembly the sequence from the complete LTR incorrectly aligns to both LTR locations removing the join between the interrupted LTR and the nested TE. This recruitment causes a gap, now one or both of the contig ends that point into the gap have assembled traces belonging to the other LTR ( Fig. 2B) and can cause one of two gaps in the nested LTR, or one gap in the un-nested LTR of the original LTR retrotransposon.
The closure of the final unfinished gap, found in ZMMBBb0331I02 (Table I)

Computational Methods for Closing Difficult Gaps: Genome Based Approach
Two computational methods were designed to combat the misassemblies caused by repetitive sequences. The first method is termed the genome based approach because it uses the biological or genomic information present in the BAC sequence to determine the correct assembly configuration. As explained above, many assembly gaps occurred when similar sequences are found in multiple locations in the BAC. In maize, this occurs frequently with the long LTRs of retrotransposons, the traces for one or more location collapse their assembly into a single copy.
Our genome based approach uses the structure of the nested TEs to suggest the gap filling sequence.
The first step in the genomic based approach was to run both contigs surrounding the gap with TEnest (Kronmiller and Wise 2008). This gave two nested structure pictures of the contigs, the TE insertions leading into the gap were examined for any gap-split TE insertions. For example, for a gap presented in Figure 2, one contig end would contain a partial LTR (and possibly some internal TE sequence) of one nested retrotransposon near the end of the gap, the other contig would contain the other sections of this partial retrotransposon along with the complete sequence of the nested TE. Other TE insertions, more than presented in the simple example of Figure 2 could confound the identification of the split TE, but they could also be of assistance. If the two TEs shown in the example were both nested in an older TE insertion the older TE would also be split around the gap even further distance from the problem region and providing more evidence for the nesting pattern.
Once the nesting structure of the TEs was identified using the above process a string of DNA sequence could be filled in to span the gap. Sequence surrounding the gap was built to resemble the predicted nested TE structure. This built sequence contains 3 sections. The split LTR, formed by identifying it's missing sequence donated by the corresponding full LTR. The join point between the split LTR and the nested TE exactly identified by the nested location on the other side of the split LTR. Finally the sequence of the nested TE is added to complete the sequence spanning the gap. A low quality backbone phd file (Ewing et al. 1998) was created from the proposed gap spanning sequence and used to drive the phrap assembly. From here the correct sequence traces were found either during the assembly or by user in Consed (Gordon et al. 1998). Several iterations were generally required to add or remove any sequence differences between the proposed backbone and the true sequence. Ultimately sequences were found to span across the gaps, custom sequencing primers were designed to help span low quality regions if necessary.

Computational Methods for Closing Difficult Gaps: Sequence Based Approach
The second computational method used for difficult gap closure used the sequence information from paired end plasmids. Essentially mimicking a localized constrained assembly, the sequence based approach would back out of the gap into the contig looking for unique sequence unduplicated in the BAC assembly. This process backed up on both contigs for at least 4 kb (the largest plasmid clone length) but often much longer to find unique sequence. At the unique locations all the traces found in this area and the plasmid end pairs for these traces were built into separate assemblies. phd file backbone sequences would be made from these small localized assemblies and again overlapping sequences and their mate pairs would be added and assembled to the localized assemblies, continuing until the contigs identified and correctly assembled missing sequences or sequences incorrectly placed and walked into the gap.
This sequence based approach was most useful on the simpler gaps caused by duplicated regions in the BAC that condensed the sequence into one region. In these misassemblies the collapsed traces were identified by their plasmid mate pairs anchored in unique sequence and forced to assemble into the duplicated copy. This process also proved to be helpful to build a backbone phd sequence when closing gaps by the genomic based approach explained above.
Often the sequences that were needed to span the gap were hard to identify or didn't match the predicted backbone sequence well enough to find by assembly or by hand, this sequence based method was useful to draw them to the correct location.

Transposable Element Annotation Reveals Large Repeat Clusters
The two sequence contigs were repeat annotated with TEnest ( Definite separation between gene areas and repeat areas can be seen when large sections of the maize genome are evaluated. In maize this phenomenon is known as oceans and islands, where islands of genes are found within oceans of repetitive clusters (SanMiguel et al. 1998).
For this analysis of repeat clusters, we defined a cluster or ocean as a group of nested or closely inserted TEs. TEs found inserted less than 5 kb from each other and not separated by a predicted non-transposon related gene were grouped together as a repeat cluster. Groups of TEs identified by this definition that contained less than three TE insertions were designated as TE insertions within a gene island so were left out of repeat clusters. In total, sixteen TE clusters were identified in the two rf1-associated contigs. Repeat clusters range in size from 23 kb to 155 kb, ranging from 3 to 18 TE insertions (Table II). While sizes of TE clusters are generally evenly distributed across the contigs, the two largest clusters are found on the smaller 594 kb rf1-C2 contig. This corresponds to its higher repeat percentage, 82% versus 75% of rf1-C1.

TEnest displays clusters of TE insertions, with multiple layers of chronologically inserted
TEs nested into one another. As repeat clusters become more dense and complex the heights or levels of these TE insertion clusters increase.  (Fig. 1).
Exon structure for each gene model identified by the three prediction programs was plotted on GBrowse and compared to the evidence based sequence alignments. A consensus approach between the results of the three gene prediction programs and the EST and protein alignments was used to pick candidate genes and build gene models. Eight predicted genes were identified in rf1-C1, 6 identified in rf1-C2 (Table III) Complete gene models were identified for all 14 predicted genes. Gene model and exon coordinates are given in Supplemental Table S1. Predicted functions were assigned to 9 of the identified genes (Table III). Genes that we were unable to assign function were given one of two notations: Predicted, if the predicted protein has a full length alignment to other submitted nonfunctionally characterized proteins; or hypothetical, if the predicted protein has a less-than full alignment to submitted proteins. Hypothetical predicted genes, while having complete gene We identified 14 gene islands as a result of characterization of 16 nested TE clusters.
Because our repeat-cluster definition (explained above) did not allow repeat clusters to contain predicted non-transposon related genes, all of the predicted genes are found in these 14 gene islands. While genes found within gene islands or between islands do not seem to form any tight clusters there is obvious clustering of genes when observed on a contig-wide scale. Gene islands have just one or few predicted gene annotations, no gene islands contain large clusters of genes.

Collinearity between Orthologous regions in Maize, Rice, and Sorghum
To examine sequence collinearity between grass genomes, the 14 gene islands were aligned to the rice assembly (IRGSP 2005) version 5 and the sorghum assembly version 1 (sbi1) Seven out of the 14 predicted maize genes align when compared to the rice genome, all 7 seen in a syntenic location on rice chromosome 1. As illustrated in Figure 4 and Table III,   predicted maize  along the rice chromosome at 13.2 Mb. Of the 7 genes found in conserved collinear locations 2 genes, 8 and 13, are found in a reverse orientation relative to maize. One non-predicted region on the maize contigs, a region near 85 kb on rf1-C1, aligns to rice gene Os01g14670 on rice chromosome 1 also in this conserved location, near 8.2 Mb. These conserved gene regions show expanded intra-gene distance in maize as compared to rice as expected by the increased density of repeat clusters surrounding gene islands.
Ten of the 14 predicted maize genes align to the sorghum genome. On rf1-C1 predicted genes 1, 2, 4 align with a conserved order and orientation to a 50 kb region on sorghum chromosome 3 near 5.4 Mb. This same set of predicted maize genes, along with genes 5, 6, 7 and 8 are found also on sorghum chromosome 3 near 10.2 Mb (Table III, Fig. 4). This shows at least 500 kb of the maize sequence is duplicated in the sorghum genome on the same chromosome, while only one copy of this region is found in rice, and only one copy of this region is found on the currently sequenced maize genome. Similar to the rice genome comparison, the non-predicted region near 85 kb on rf1-C1 aligns to sorghum chromosome 3 at both 5.4 and 10.2 Mb locations. rf1-C2 gene predictions show genes 10, 12, and 13 are shared between maize and sorghum over the sequence of this contig in similar order and orientation.
The four maize genes that did not have sorghum counterparts correspond to the four hypothetical gene predictions, further suggesting these may not be real genes.
The set of 7 predicted maize genes found on rice chromosome 1 in a conserved order are found in the set of 10 genes found conserved when compared to the sorghum genome. The two genes in conserved order and location but found in a reverse direction in rice are seen in the same orientation in maize and sorghum, suggesting the direction change for these genes occurred either in rice after the split to maize/sorghum, or in the maize/sorghum ancestor. Three maize genes are found in two locations on sorghum chromosome 3, these genes are not found duplicated in the rice genome. These 3 genes are not seen duplicated in the initial maize genome sequence, either in chromosome 3 or elsewhere.

CONCLUSION
Based on the sequence length, 78% of the rf1-associated contigs consist of repetitive sequences ( repetitive (Lander et al. 2001); mouse, 37.5% repetitive (Waterston et al. 2002), and Drosophila, 3.9% repetitive (Kaminker et al. 2002), this near-centromeric region of maize chromosome 3 is not proportionally more difficult (Celniker et al. 2002) to bring to sequence completion. This is due to a number of reasons. First, the maize genome has many families of transposable elements, and therefore within a given BAC there is less of a chance to contain multiple copies of a type of element. Second, the average size of transposable elements in maize is larger than those of other sequenced organisms, again decreasing the chance of obtaining multiple copies in a single BAC. Third, simple repeats are much less common in maize than in some other sequenced organisms. Simple repeats are generally small (generally less than 500 bp, similar in length to sequence traces) and tandemly duplicated, causing havoc with assembly algorithms.
Fourth, the phenomenon of nesting transposable elements in maize is only seen on a small scale in previously sequenced genomes (Quesneville et al. 2005). Nesting within a transposable element will break up the repetitive sequence into smaller sections. Once broken up, these segments are flanked by unique sequences in relation to other similar elements, and so are actually easier to assemble. Unfortunately for sequence assemblies, LTRs of maize transposable elements are in general much longer than those of other sequenced genomes. LTRs are very similar to each other, and they cause much of the gaps seen in initial draft assemblies.
Seven of the predicted maize genes are found conserved in the rice genome, 10 of predicted maize genes are found in the sorghum genome. One non-predicted gene region is found conserved in both rice and sorghum, this is not near any predicted maize genes and suggests it is a pseudogene. Fifty percent of predicted maize genes are found in collinear locations on rice chromosome 1, 71% of predicted maize genes are found collinear to sorghum chromosome 3.
For high-confidence gene models (the set of predicted maize genes excluding those termed hypothetical) 70% are found in collinear locations on rice chromosome 1, 100% are found in collinear locations on sorghum chromosome 3. Of genes found conserved across both compared organisms, 27% of shared genes are not seen collinear between maize and rice, and 23% of shared genes are not seen in collinear locations between maize and sorghum. Gene islands are not found conserved in their entirety in their orthologous locations. Rather, gene islands are made up of one to two collinear genes with additional genes found on other chromosome locations or not found in the comparison organism. In the maize to rice comparison one gene island is found containing at least two genes in the collinear region. The distance between these two genes expanded by almost 7 fold in maize. In the maize to sorghum comparison three sets genes are found with two genes in a gene island in the collinear region. One set of genes is seen in with a similar distance between the genes in maize and sorghum, one set has had an approximately 3 fold expansion in maize relative to sorghum, and the final set of genes, the same set observed in the maize to rice comparison, has experienced an almost 9 fold increase of intergene distance in the maize genome. While the most common increase of inter-gene distance has occurred between gene islands, increase in genome sequence is not limited to repeat clusters. In several instances genes found on the ends of collinear regions of rice and sorghum did not have a maize counterpart, however due to the increased inter-gene distances these genes may be found off the ends of our sequenced contigs.
Sixteen repeat clusters were identified across the 2 sequenced contigs. These clusters are 23 kb to 155 kb long and contain a variety of TEs and LTR retrotransposons with a range of insertion ages. In few cases several LTR retrotransposon families are seen highly clustered in tight groupings within 1 to 2 repeat clusters and may indicate preferential nesting of TEs. Recent insertions of LTR retrotransposons, those that can be considered as the currently active replicating and transposing elements, are seen almost exclusively in the top levels of nested repeat clusters. Insertions into these locations are further away from genes and therefore mutations in these regions have a less detrimental effect on the organism. Gene islands, located between each repeat cluster, range from 4 kb to 98 kb long and contain from 1 to 4 gene predictions. The average gene density across islands is 1 gene per 16 kb for islands that contain genes. This density is not consistent across islands; larger gene islands do not necessarily contain more genes. While it may be an artifact of our definition of repeat oceans and gene islands; TEs found inserted in gene islands are seen on a very small scale as opposed to the large nested repeat clusters. In all but one case LTR retrotransposon insertions in gene islands are estimated to have older ages of insertion when compared to the younger TE insertions on upper levels of repeat clusters. This suggests TEs integrated near genes are rare, or not selected for, possibly due to their potential to cause plant-altering mutations. One LTR retrotransposon is seen within the intron of predicted gene 10, increasing the size of the intron by 4.5 kb. The rice and sorghum ortholog counterparts to maize predicted gene 10 do not share this observed increase of intron length due to TE insertion. of related grass genomes to the clustering of genes or repeats, diversity is observed at different sequence scales and across various sequence lengths. We hope the assembly techniques presented here will assist the community, ultimately providing long contiguous grass genome assemblies that facilitate examination of the genome as a whole.

Identification of BACs in the rf1 Region
Three rf1-m allele families (rf1-m3207, rf1-m7323, and rf1-m7212) (Wise et al. 1996)  The oligonucleotides were annealed to each other and a fill-in reaction was performed using [α-32 P] dCTP and dATP. The BAC-end overgos were labeled by a revision of the random priming technique with α-32 P dCTP and dATP. Hybridization protocol for overgos was similar to those explained above for AIMs probes, except overgos were hybridized at 58 °C, and overgos were washed for two 15-minute wash in 1X SSPE and 0.1% SDS and a 15-minute wash in 0.5X SSPE, 0.1% SDS at 58° C.

BAC Sequencing and Assembly
BAC clones were sequenced by MWG Biotech (Ebersberg, Germany). BACs were sheered and cloned into 3 kb subclone libraries, subclones were end sequenced to a coverage of 8 to 10X.
The BAC sequences were initially assembled with the phred/phrap package (Ewing and Green 1998;Ewing et al. 1998) (http://www.phrap.org) to determine the coverage condition. If the of sequence were produced to increase the sequence depth.
Finishing assembly was conducted with phrap and CAP3 (Huang and Madan 1999). To increase quality of poor regions, low quality and failed subclone sequences were identified for re-sequencing. If low quality was due to the DNA structure (hairpin folding) or difficult sequence (mono/dinucleotide strings) subclone sequences were identified and re-sequenced with alternate sequencing chemistries. To close gaps, sequencing primers were designed and sequenced off the subclone and BAC template in order to walk in the direction of the gap. For larger gaps, PCR primers were designed surrounding the area and amplified to make template for sequencing into the gap. Entire plasmid subclones and PCR products were identified that spanned the gap regions and other unsequenceable areas and fully sequenced with transposon bombing insertion methods (Kimmel et al. 1997).
Assembly of repetitive gap regions was aided with use of TEnest (Kronmiller and Wise 2008). Individual BACs and combined BAC contigs were run with TEnest using default parameters on the provided maize repeat database. Collapsed repeat spanning assemblies were manipulated with Consed (Gordon et al. 1998). HindIII restriction digests were compared to insilico digestion of finished sequence files (Marra et al. 1997). Any discrepancies found between the two digestions were re-examined for sequence misassemblies.

Annotation of BAC Contigs
Sequence files masked with TEnest were used for gene predictions. Three programs were used; GeneSeqer (Schlueter et al. 2003), FGENESH (Salamov andSolovyev 2000), and GeneMark.hmm (Lukashin and Borodovsky 1998). Predicted gene models were compared across the three prediction programs to determine a consensus for predicted genes. Protein and EST databases for A. thaliana, A. sativa, B. distachyon, H. vulgare, O. sativa, S. officinalis, S. cereale, S. bicolor, T. aestivum, and Z. mays were downloaded from GenBank (Benson et al. 2006) and aligned to determine gene models. Predicted gene exons were exported and examined for similarity to the plant protein, EST and predicted gene sets of GenBank to determine possible functions. Output from gene prediction programs, alignments to protein and EST sequences, and predicted genes were displayed on the GMOD package, GBrowse (Stein et al. 2002).

Comparative Analysis of Sequence Contigs
Orthologous regions were identified using the VISTA comparative genomics tools (Dubchak et al. 2000;Mayor et al. 2000;Bray et al. 2003;Brudno et al. 2003;Frazer et al. 2004

Data Access
Upon request, all novel materials described in this publication will be made available in a timely manner for non-commercial research purposes, subject to the requisite permission from any third-party owners of all or parts of the material. Obtaining any permissions will be the responsibility of the requestor.
GenBank sequence submissions reported in this manuscript; rf1-associated contig 1, EF517601 and rf1-associated contig 2, EF517600.  In either case, the sequence of the left LTR of the green TE has been split apart, sequences belonging to the right LTR have incorrectly assembled at this split location (shown as the arrow pointing to the red sequence) and cause the gap assembly. The assembly gap can also occur on the right LTR of the green TE. Here the join sequences between the left LTR and the blue TE, found on both sides of the blue TE insertion, can assemble incorrectly into the sequence of the right LTR and prevent the sequence from aligning. Successful closing of these types of gaps is crucial to characterization of maize nested repeat clusters.  Comparative analysis of maize sequence contigs to the rice and sorghum genomes. The two sequenced rf1-associated BAC contigs are shown in the center of the figure, predicted genes are shown as red rectangles on the black sequence contig lines with gene identification numbers found in red above. Comparative sequence analysis to rice is shown at the top half of the figure, shared sequence regions between maize and rice are shown as green connecting lines.
Comparative sequence analysis to sorghum is shown at the bottom half of the figure, shared sequence regions between maize and sorghum are shown as blue connecting lines. Collinear regions are seen between maize chromosome 3, rice chromosome 1 and sorghum chromosome 3.
Seven out of 14 predicted genes are found in collinear order and orientation between maize and rice. 10 out of 14 predicted genes are found in collinear order and orientation between maize and sorghum, 3 of these genes are found duplicated in a second location on sorghum chromosome 3.
One non-predicted gene region at the left end of rf1-C1 aligns to collinear regions in both rice and sorghum. This is probably a maize pseudogene.