Genome-wide analysis of the core DNA replication machinery in the higher plants Arabidopsis and rice.

Core DNA replication proteins mediate the initiation, elongation, and Okazaki fragment maturation functions of DNA replication. Although this process is generally conserved in eukaryotes, important differences in the molecular architecture of the DNA replication machine and the function of individual subunits have been reported in various model systems. We have combined genome-wide bioinformatic analyses of Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa) with published experimental data to provide a comprehensive view of the core DNA replication machinery in plants. Many components identified in this analysis have not been studied previously in plant systems, including the GINS (go ichi ni san) complex (PSF1, PSF2, PSF3, and SLD5), MCM8, MCM9, MCM10, NOC3, POLA2, POLA3, POLA4, POLD3, POLD4, and RNASEH2. Our results indicate that the core DNA replication machinery from plants is more similar to vertebrates than single-celled yeasts (Saccharomyces cerevisiae), suggesting that animal models may be more relevant to plant systems. However, we also uncovered some important differences between plants and vertebrate machinery. For example, we did not identify geminin or RNASEH1 genes in plants. Our analyses also indicate that plants may be unique among eukaryotes in that they have multiple copies of numerous core DNA replication genes. This finding raises the question of whether specialized functions have evolved in some cases. This analysis establishes that the core DNA replication machinery is highly conserved across plant species and displays many features in common with other eukaryotes and some characteristics that are unique to plants.

DNA replication depends on the coordinated action of numerous multiprotein complexes. At the simplest level, it requires an initiator to establish the site of replication initiation, a helicase to unwind DNA, a polymerase to synthesize new DNA, and machinery to process the Okazaki fragments generated during discontinuous synthesis. Much is known about the DNA replication machinery in yeast (Saccharomyces cerevisiae) and animal model systems, but relatively little is known about the apparatus in plants. To gain insight into plant DNA replication components, we have combined published experimental information with our own bioinformatic analysis of genomic sequence data to examine the core DNA replication machinery in the model plants Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa).
Figure 1 depicts a model eukaryotic DNA replication fork and illustrates the protein complexes known or suspected to be part of the core DNA replication machine. These complexes mediate the initiation, elongation, and maturation stages of DNA replication and, as such, constitute the core eukaryotic DNA replication machinery. The events leading to the formation of an active DNA replication fork occur in a stepwise fashion, but our understanding of the timing and specific details of how these events unfold in diverse eukaryotes is limited, and there are a growing number of examples of variations between model systems (Bell, 2002;Bell and Dutta, 2002;Kearsey and Cotterill, 2003).
In recent years, there has been increased interest in plant DNA replication and in using plants as models for understanding DNA replication in eukaryotes. A detailed understanding of the core DNA replication machinery in plants will provide researchers with an important tool for understanding what makes plants unique with respect to replicative and developmental capacity and for investigating how plant strategies compare to the mechanisms employed by animals.

Strategy
To identify the core DNA replication genes in Arabidopsis and rice, we developed an approach that incorporated experimental data from the literature with homology-based computational gene annotation. First, we assembled a database of yeast and animal proteins that have been determined experimentally to be part of the core eukaryotic DNA replication machinery. The BLAST algorithm was used to search against the translated Arabidopsis genome database at The Arabidopsis Information Resource (TAIR), and sequences with significant similarity were assigned putative annotations based on their functions in yeast and animal systems. The Arabidopsis sequences were then used to identify putative homologs in The Institute for Genomic Research (TIGR) rice genome database. Next, we searched the primary literature and, when available, incorporated experimental results that pertained to plant systems to validate the annotation (relevant plant literature is listed in Table  I). In cases where no experimental data from plants could be found, we generated protein sequence alignments from diverse eukaryotes and considered the validity of putative annotations based on the quality of the alignment and the presence of highly conserved domains. Using this strategy, we report the core DNA replication machinery in the dicot Arabidopsis and the monocot rice (Table I). Together, these results established that there is a general conservation of DNA replication machinery in plants.
We encountered numerous instances where the existing gene model resulted in a protein that either lacked highly conserved sequences or contained additional residues compared to other eukaryotic proteins. When available, plant-derived transcripts from Gen-Bank and the TIGR Plant Transcript Assembly databases (TIGR-TA; Childs et al., 2007) were used to guide the prediction of a new gene model. In cases where fulllength transcripts could not be identified, we used protein sequence alignments to create gene models that maximized sequence conservation with other eukaryotes (Table I). The predicted coding and resulting protein sequences are provided in Supplemental Text S1.

Preinitiation
One of the first steps toward establishing a functional DNA replication fork is binding of the eukaryotic initiator complex termed the origin recognition complex (ORC) to DNA in late M and early G1 phases of the cell division cycle (Dutta and Bell, 1997;Bryant et al., 2001;Bell, 2002;Bell and Dutta, 2002;DePamphilis, 2003). Next, CDC6 interacts with the origin-bound ORC, which together recruit a CDT1/MCM2-7 complex of proteins. Hydrolysis of ATP by ORC/CDC6 causes the release of CDT1 and structural alteration of the ring-shaped MCM complex, leading to its loading around DNA (Randell et al., 2006;Ranjan and Gossen, 2006;Waga and Zembutsu, 2006). MCM loading is reiterated through the action of a single ORC/CDC6 complex resulting in the recruitment of 10 to 40 MCM complexes at each potential origin (Blow and Dutta, Figure 1. Model depicting the core eukaryotic DNA replication machinery from initiation through Okazaki fragment maturation. A, Components of the preinitiation complex. DNA bound ORC recruits NOC3, CDC6, and CDT1 in early G1. Reiterative loading of 10 to 40 MCM complexes forms a licensed origin. After MCM loading is complete, CDC6 and CDT1 dissociate from the origin. B, At the G1/S transition a subset of licensed origins transition to an initiation complex. The precise order of events is not clear and may vary between systems. CDC45, TOPBP1, and MCM8-10 contribute to GINS complex loading, DNA unwinding, and recruitment of the polymerases. C, Components of the active DNA replication fork. MCM2-7, CDC45, and GINS unwind the duplex DNA. Leading strand synthesis is accomplished primarily by POLE. GINS increases the processivity of POLE. On the lagging strand, RPA stabilizes ssDNA, POLA lays down a short RNA/DNA primer and then is replaced by POLD, which completes the Okazaki fragment. RFC loads PCNA, which increases the processivity of POLD. The precise role of MCM8-10 in this process is not clear. D, The dominant mechanism of Okazaki fragment maturation requires FEN1 to cleave the RNA/DNA flap, resulting in a nick that is sealed by LIG1. 2005). The resulting protein/DNA assembly consisting of ORC1-6/CDC6/MCM2-7 is termed the prereplication complex (pre-RC), and sites containing this complex are considered licensed with the potential to serve as origins of replication (Fig. 1A).
All six ORC genes have been identified in Arabidopsis (Gavin et al., 1995;Collinge et al., 2004;Masuda et al., 2004), and genes encoding ORC1 to 5 have been reported for rice (Kimura et al., 2000a;Li et al., 2005;Mori et al., 2005) and maize (Zea mays; Witmer et al., 2003). The Arabidopsis ORC proteins show 22% to 37% amino acid identity with human ORC subunits (Table I). Consistent with a role in DNA replication, Arabidopsis ORC transcripts have been shown to be abundant in proliferating tissues such as root tips, young leaves, and flower buds, and their expression induced upon cell cycle reentry following Suc starvation of cultured suspension cells (Masuda et al., 2004;Diaz-Trivino et al., 2005). Interestingly, ZmORC3 (Witmer et al., 2003) and AtORC5-6 (Diaz-Trivino et al., 2005) transcripts are also abundant in postmitotic tissues, suggesting that plant ORC subunits may have additional functions in mature tissues.
The six-subunit MCM complex (MCM2-7) represents the putative eukaryotic replicative helicase (Forsburg, 2004;Masai et al., 2005;Maiorano et al., 2006), and genes encoding one copy of each subunit have been identified in Arabidopsis (Springer et al., 1995;Stevens et al., 2002;Masuda et al., 2004;Dresselhaus et al., 2006). MCM3 (Sabelli et al., 1996;Sabelli et al., 1999) and MCM6 (Dresselhaus et al., 2006) proteins have been identified in maize, and MCM3 has been reported in tobacco (Nicotiana tabacum; Dambrauskas et al., 2003). We identified strong candidates for each of the MCM2-7 proteins in rice (Table I). Importantly, these proteins contain the sequence features that define the MCM family, including Walker A and Walker B domains, a zinc finger region, and an Arg finger motif (Forsburg, 2004;Maiorano et al., 2006). Nucleolar complex-associated (NOC) proteins are conserved in eukaryotes and are involved in ribosome biogenesis (Milkereit et al., 2001) and cell differentiation (Tominaga et al., 2004). One member of this complex, NOC3, has been shown to interact with ORC and MCM proteins and is required for pre-RC formation in budding yeast (Zhang et al., 2002). Our analysis indicates that Arabidopsis and rice both code for a NOC3 protein ( Table I). The TAIR gene model for AtNOC3 (At1g79150.1) produces a protein of 496 amino acids that is missing conserved sequences in the C-terminal region. This model is based on a cDNA sequence in GenBank (accession NM_106566), but there is a second cDNA sequence in GenBank (accession AAC17047) with a different intron-exon structure that contains the conserved C-terminal portion of NOC3. It would be interesting to investigate whether these two AtNOC3 transcripts represent alternative splicing events or are simply artifacts. We assembled the available tran-scripts to predict a putative full-length AtNOC3 transcript (Supplemental Text S1). Our results support the conclusion that a complete pre-RC is conserved in plants.

Initiation
The pre-RC assembles at many sites, but only a subset of these sites recruit replication machinery and initiate DNA synthesis (Bell, 2002;DePamphilis et al., 2006). Neither the order of events nor the proteins involved in the transition from pre-RC to active replication fork are completely defined. However, several proteins are known to have critical roles in this initiation process (Table I; Fig. 1B).
MCM8 and MCM9 proteins are conserved in a diverse array of eukaryotes, but are lacking in most fungi and Caenorhabditis elegans (Blanton et al., 2005). Human and frog MCM8 proteins associate with chromatin in S-phase after loading of the MCM2-7 complex, and may stabilize replication protein3 (RPA3) and POLA1 binding to the replication fork (Gozuacik et al., 2003;Maiorano et al., 2006). The function of MCM9 is not known, but it is expressed maximally in S-phase and is transcriptionally regulated by E2F1, indicative of a role in DNA replication (Yoshida, 2005). In Arabidopsis, sequences with similarity to eukaryotic MCM8 and MCM9 proteins have been reported (Dresselhaus et al., 2006), and we identified putative MCM8 and MCM9 genes in rice, suggesting that these proteins are generally conserved in plants. The TIGR gene model for OsMCM8 predicts a protein of 482 amino acids, which is considerably shorter than other MCM8 proteins and lacks several highly conserved domains. However, a sequence encoding the missing domains is present in the rice genome, supporting our prediction of a new gene model for OsMCM8 (Supplemental Text S1).
Our examination of the Arabidopsis and poplar (Populus spp.) MCM9 gene sequences suggested that they also may be alternatively spliced or misrepresented by transcripts in the databases, and new gene models that maximize protein sequence conservation were predicted (Supplemental Text S1). Arabidopsis MCM8 and MCM9 are expressed at low levels (Schmid et al., 2005). This could explain why the functional transcripts have not been cloned, but a directed effort toward identifying primary and alternatively spliced MCM8 and MCM9 transcripts is needed to determine the relevance of our predicted gene models.
Like other MCM family members, the central region of plant MCM8 and MCM9 proteins contain Walker A and B NTP-binding domains, a putative zinc finger, and an Arg finger motif (Supplemental Figs. S1 and S2). In both plants and animals, the MCM8 and MCM9 proteins contain a classic GKS sequence in the Walker A motif compared to the deviant A/SKS sequence found in MCM2-7 (Maiorano et al., 2005. Arabidopsis, poplar, and rice lack the first approximately 60 amino acids of animal MCM8 proteins and are also missing a QVLTKDLEXXAAXLQXDE motif found in human, chicken, frog, and sea urchin homologs (Supplemental Fig. S1, region C). Additional differences between plant and animal MCM8 proteins are indicated (Supplemental Fig. S1, regions A and B and D-I).
In animals, MCM9 proteins are the largest of the MCM family members due to a long, poorly conserved C-terminal domain. Plants do not have this C-terminal extension, raising the possibility that MCM9 functions differently in plants and animals (Supplemental Fig.  S2). It is also noteworthy that while all MCM2-8 proteins contain the IDEFDKM Walker B sequence, only the IDEF is conserved in MCM9 proteins (Walker et al., 1982;Neuwald et al., 1999).
MCM10, which is conserved from yeast to humans, does not contain the sequence features that define the rest of the MCM family. However, it is an essential part of the core DNA replication machinery and has been implicated in a variety of DNA replication processes, including loading and stabilizing DNA polymerase a (POLA; Ricke and Bielinsky, 2004), recruitment of CDC45 (Wohlschlegel et al., 2002;Sawyer et al., 2004), and as a component of the replisome progression complex (Gambus et al., 2006;Pacek et al., 2006). Although we found no published information regarding MCM10 proteins in plants, we identified putative MCM10 homologs in Arabidopsis, rice, maize, and the western columbine (Aquilegia formosa). Eukaryotic MCM10 proteins are not highly conserved (Supplemental Fig. S3).
Arabidopsis and rice MCM10 proteins show 44% amino acid identity, while Arabidopsis and human MCM10 proteins display 33% identity within the aligned region. In budding yeast, MCM10 interacts with itself through a CCCH-type zinc finger motif to form a large homocomplex that is required for DNA replication (Cook et al., 2003). The CCCH zinc finger is conserved in plant and animal MCM10 proteins, suggesting that zinc binding and homomultimerization are shared properties of MCM10 proteins (Supplemental Fig. S3).
CDC45 is essential for both the initiation and elongation stages of DNA replication (Bell and Dutta, 2002;Pollok et al., 2003;Pacek and Walter, 2004). It assembles onto the origin in late G1 after the MCM2-7 complex and concurrent with the onset of initiation (Zou and Stillman, 1998). CDC45 is required for POLA complex loading (Mimura and Takisawa, 1998) and is a component of large complexes containing MCMs and GINS. These observations have led to the suggestion that CDC45 serves as an anchor coupling POLA to the replication fork via the replisome complex (Gambus et al., 2006;Moyer et al., 2006;Pacek et al., 2006).
Published results indicate that Arabidopsis CDC45 is expressed in proliferating tissues and transcripts are most abundant at the G1/S transition (Stevens et al., 2004), consistent with a role at this cell cycle stage. Interestingly, AtCDC45 has also been implicated in meiosis-a role not yet reported for any other eukaryotes (Stevens et al., 2004). In rice, we identified two putative CDC45 genes located on chromosomes 11 and 12. These two proteins differ only at four positions, indicative of a very recent duplication event or strong selective pressure.
The GINS complex, which consists of four proteins, PSF1, PSF2, PSF3, and SLD5, was identified recently as a critical part of the initiation process. GINS is essential for the establishment and maintenance of a functional DNA replication fork (Kanemaki et al., 2003;Kubota et al., 2003;Takayama et al., 2003;Gambus et al., 2006). The GINS proteins copurify as a tightly associated heterotetrameric complex with a ring-like structure that resembles the DNA polymerase d (POLD) processivity factor, proliferating cell nuclear antigen (PCNA), in electron micrographs (Kubota et al., 2003). GINS has been shown to bind weakly to DNA polymerase e (POLE) and specifically stimulate DNA synthesis by POLE in vitro, leading to the suggestion that GINS functions as a POLE processivity factor analogous to the function of PCNA (Seki et al., 2006). However, there is also evidence that GINS is a core component of the eukaryotic DNA replication fork helicase. GINS interacts stably with the MCM2-7 complex and CDC45, and the GINS/MCM/CDC45 supercomplex functions as a helicase in vitro (Gambus et al., 2006;Moyer et al., 2006). The precise role that GINS plays at the DNA replication fork is not yet clear, but GINS interactions with POLE and the CDC45/MCM helicase complex place it between these two complexes on the leading strand (Fig. 1C).
GINS complex proteins have been identified in a broad array of eukaryotes based on sequence similarity (Kubota et al., 2003), and experimental evidence from yeast (cited above), fly (Moyer et al., 2006), frog (Kubota et al., 2003;Pacek et al., 2006), and mouse (Ueno et al., 2005) suggests that the complex is functionally conserved in eukaryotes. Consistent with this suggestion, we identified a complete GINS complex from two dicots, Arabidopsis and soybean (Glycine max), and two monocots, rice and maize. In Arabidopsis, we found two loci (see Table I) that encode for nearly identical copies of the putative PSF3 protein. We also identified transcripts representing one or several of the GINS complex proteins from a diverse array of additional plant species, demonstrating that GINS is conserved broadly in plants. We performed phylogenetic analysis of the GINS complex from plants, animals, and yeasts, demonstrating that the proteins cluster primarily by subunit and secondarily by taxonomy (Fig. 3). In all cases, the vertebrates (human, chicken, zebrafish, and frog) form a tight cluster with good bootstrap support. Drosophila sequences cluster loosely with this group, and C. elegans sequences are more divergent. Plant sequences also form a highly supported cluster, with monocot and dicot sequences tending to separate into subgroups. For PSF1, PSF2, and SLD5, the plant sequences cluster with animals, while for PSF3, plants and yeast cluster, although the bootstrap support for these divisions is generally low (Fig. 3).
GINS complex proteins are highly conserved with respect to amino acid sequence (Supplemental Table  S1). PSF2, which shares 66% identity between Arabidopsis and rice and 42% identity between Arabidopsis and human, is the most highly conserved GINS complex subunit. Amino acid sequence length and pI are also conserved features of eukaryotic GINS proteins (Supplemental Table S2). PSF1 shows the least size variability with an average of 199 amino acids and a SD of 4.4 between plants, vertebrates, and yeasts (data not shown). In budding yeast, PSF3 and SLD5 proteins are longer than plant and animal sequences due to an approximately 25 amino acid N-terminal extension and several small internal insertions. The predicted pIs of GINS complex proteins typically range from 5 to 7 (Supplemental Table S2). PSF1 in chicken, which has a predicted pI of 8.8, and PSF3 in Arabidopsis and rice, which have predicted pIs of 8.3 and 9.2, respectively, are notable exceptions.
To identify conserved and unique features of plant GINS complex proteins, we generated sequence alignments from diverse plant species and compared them to yeast and vertebrate GINS proteins (Fig. 2, B-E). These alignments indicated that eukaryotic PSF1 proteins are similar along their entire length, but show the highest degree of sequence conservation in the central and C-terminal regions (Fig. 2B). Two blocks of identical residues, RNKRCLMAY (block I) and VDMVPPKDP (block II), and a highly conserved motif in the C terminus (block III), are apparent in plant sequences. These domains are also highly conserved in yeast and animals, suggesting that they are critical for PSF1 function. Supporting this conclusion, it has been shown that mutation of a conserved Arg residue in block I of budding yeast [NK(R-to-G) CL] results in cell growth arrest and morphology consistent with a DNA replication defect (Takayama et al., 2003). PSF1 is predicted to adopt primarily a helical conformation with a short elongated region (b sheet) near its C terminus. Given the structural constraints conferred by Pro residues, the conserved double Pro in block II may serve an important role in defining the structural properties of PSF1.
PSF2 proteins from yeast, animals, and plants contain tracts of identical and conserved residues spread across the length of the protein (Fig. 2C). In contrast to the rest of the protein, the C terminus stands out as being poorly conserved. Plant proteins have an additional 15 to 20 amino acids at the C terminus including a short, conserved motif, PRRxLRR (region B). Plant and vertebrate PSF2 proteins also contain a conserved sequence (region A) that is lacking in budding and fission yeasts. Alignment of PSF3 proteins reveals two conserved features in the N-terminal region (Fig. 2D, region A and region B), a high degree of similarity through the central portion of the protein, and an LGRKR motif at the C-terminal end (region C). This motif does not align with yeast and vertebrate proteins. However, vertebrate sequences have a conserved NYXKRK motif in this region, suggesting that positive charge may be important at the C terminus (data not shown).
Our analysis indicated that SLD5 proteins contain two prominent blocks of highly conserved amino acids (Fig. 2E, blocks I and II). Except for a short conserved region at the extreme C terminus (region A), the N and C termini of SLD5 are divergent. We used the COILS algorithm (Lupas et al., 1991) to predict that the plant SLD5 proteins adopt a coiled-coil structure between blocks I and II (Fig. 2E, COILS track). Coiled-coil domains, which are common in transcription factors (Leu zipper motif), SNARE complexes, and spindlepole-body components, are thought to interact primarily with other coiled-coil domains (Lupas, 1996(Lupas, , 1997Martin et al., 2004;Rose et al., 2004). We did not detect coiled-coil domains in the other GINS complex proteins, and it would be interesting to ask if SLD5 acts as a homodimer or facilitates interactions between the GINS complex and other coiled-coil proteins.
Our analysis suggested that the initiation components have largely been conserved in plants, and supports the hypothesis that similar mechanisms govern the transition from pre-RC to active replication fork in plants and animals.

Elongation Complex
Initiation of active replication at the G1/S transition requires the assembly of additional proteins including DNA polymerases and Okazaki fragment maturation factors to form a complete replication factory (Fig. 1C; Waga and Stillman, 1998;Bell and Dutta, 2002). Three eukaryotic DNA polymerase complexes have been implicated in DNA replication-POLA, POLD, and POLE (Burgers et al., 2001;Garg and Burgers, 2005;Johnson and O'Donnell, 2005).
The POLA complex includes a catalytic subunit (POLA1), two primase subunits (POLA3 and POLA4), and POLA2, which is thought to tether the complex to the replication fork (Frick and Richardson, 2001). Protein complexes containing polymerase and primase activity have been purified from a variety of plant systems (Coello and Vazquez-Ramos, 1995;Garcia et al., 2002), demonstrating that a POLA-like function exists in plants. However, sequence homology has been investigated only for the POLA1 subunit in rice (Yokoi et al., 1997). Rice POLA1 was originally reported to be shorter than other eukaryotic POLA1 homologs, due to a truncated N terminus (Yokoi et al., 1997). However, publication of the rice genomic sequence (Sasaki et al., 2002) allowed us to predict a full-length OsPOLA1 (GenBank accession O48653). We identified all four putative POLA subunits in Arabidopsis and the remaining three subunits in rice. We predicted new gene models for rice POLA2 and POLA4 subunits, resulting in better conservation to other eukaryotes (Supplemental Text S1).
Seven protein sequence features have been established as conserved in all eukaryotic DNA polymerase catalytic subunits (Spicer et al., 1988;Wong et al., 1988), and an additional five regions are conserved in POLA1 proteins (Miyazawa et al., 1993). We found that these defined regions are conserved in the Arabidopsis and rice POLA1 proteins. Although the sequence features of POLA2 to 4 are less well characterized, Arabidopsis POLA2, POLA3, and POLA4 align with 37%, 41%, and 47% identity to their corresponding human proteins, respectively. According to our analyses, the majority of sequence features that are conserved between yeast and human are present in the corresponding Arabidopsis and rice proteins, supporting the hypothesis of conserved function. One notable exception is an YYRRLFP motif of unknown function located at the N terminus of yeast and animal POLA4 proteins but absent in Arabidopsis and rice POLA4. We conclude that POLA is a four subunit complex in Arabidopsis and rice.
POLD is known to function as a heterotetramer in fission yeast and animals (POLD1-4), but only three subunits have been identified in budding yeast (POLD1-3; Johnson and O'Donnell, 2005). The largest subunit (POLD1) contains the polymerase and exonuclease activity, while the other subunits are involved in complex stabilization and interactions with PCNA. In rice, the POLD1 and POLD2 genes have been shown to be expressed primarily in proliferating tissues and induced upon regrowth following Suc starvation in cell culture (Uchiyama et al., 2002). Interestingly, only POLD1 transcripts were detected in mature leaves and induced upon UV irradiation treatment (Uchiyama et al., 2002), leading to the suggestion that POLD1 has specific DNA repair functions independent of POLD2.
An alternate explanation is that POLD2 protein activity does not correlate with its transcription. POLD1 and PCNA protein levels correlate in maize, further supporting the conclusion that plant POLD functions in DNA replication (Garcia et al., 2006).
A previously published alignment of Arabidopsis, soybean, rice, and maize sequences indicated that plant POLD1 proteins contain most of the conserved domains present in other eukaryotic POLD1 proteins, but the dicot sequences lacked two C-terminal zinc finger mo-tifs (Garcia et al., 2006). This result was surprising because these motifs are highly conserved in other eukaryotes and are critical for interaction with POLD2. The AtPOLD1 and GmPOLD1 sequences were derived from transcripts in GenBank (accessions NP_201201 and AAC18443, respectively). We identified a second Arabidopsis transcript (accession ABA41487) that utilizes an alternative splice donor site (GT) at the end of exon 27 (data not shown) that results in a frameshift, which restores conservation in the C terminus, including the zinc finger motifs. Similarly, we searched the TIGR-TA database and identified an assembly (TA66266_3847) for GmPOLD1 that encodes a protein with these zinc finger motifs. It is not known if these transcripts represent bona fide alternative splicing events or artifacts, but it is clear that dicots produce transcripts specifying proteins that contain these important zinc finger domains.
Our analysis indicated that Arabidopsis and rice POLD2 proteins also contain all of the sequence features conserved between animals and yeasts (data not shown). In humans, a region of hydrophobic residues (MRPFL) near the N terminus of POLD2 has been shown to mediate interaction with PCNA (Lu et al., 2002). We found a similar sequence (MRT/NLL) in Arabidopsis and rice POL2D proteins at this position. We also identified a conserved PCNA-binding motif in Arabidopsis and rice POLD3, suggesting that multiple POLD subunits mediate PCNA interactions in plants as has been reported for human (Ducoux et al., 2001). The N termini of plant and animal POLD3 also show significant sequence similarity, while their central regions are more divergent (data not shown).
The POLD4 subunit is not essential for growth in fission yeast (Reynolds et al., 1998), but increases the processivity of both fission yeast and human POLD1 to 3 complexes in vitro (Zou and Stillman, 2000;Li et al., 2006). Human POLD4 also stabilizes the POLD complex and participates in interactions with PCNA (Li et al., 2006). A POLD4 homolog has not been identified in budding yeast, and it has been uncertain whether the plant POLD complex consists of three or four subunits. We identified a single Arabidopsis POLD4 and two putative POLD4 genes in rice (Table I), indicating that POLD consists of at least four subunits in plants.
However, AtPOLE1A/B double mutants arrest earlier than single AtPOLE1A mutants, suggesting some functional overlap in vivo (Jenik et al., 2005). We identified a single POLE1 gene in rice that specifies a protein that is 66% and 63% identical to AtPOLE1A and AtPOLE1B, respectively (Table I). Like the AtPOLE1 proteins, OsPOLE1 contains all of the functional domains conserved in other eukaryotes.
We identified two candidate POLE2 genes in the rice genome (Table I). The gene models for these loci (OsPOLE2A, LOC_Os05g06840.1 and OsPOLE2B, LOC_ Os08g36330.1) predict proteins that are considerably shorter than other eukaryotic POLE2 proteins and are missing several highly conserved domains. Because these gene models were derived solely by computational methods, we searched the TIGR-TA database for biological transcripts. We identified a single transcript assembly (TA60386_4530) representing OsPOLE2. This transcript aligns to the OsPOLE2A locus but has a different intron/exon structure than the computational gene model. Translation of this sequence results in a protein containing the domains missing from the computational model and likely specifies a functional OsPOLE2 protein. We were unable to detect any biological transcripts for the OsPOLE2B gene, and stop codons in the genomic sequence prevent the prediction of a full-length coding sequence that would contain all of the conserved domains. As a consequence, we suggest that OsPOLE2B is a pseudogene.
Our search for POLE3 and POLE4 homologs in Arabidopsis and rice returned a family of histone-fold proteins, which includes the core histones as well as a large number of CCAAT box-binding transcription factors. Histone-fold proteins share a conserved threedimensional conformation but are only distantly related in primary sequence (Arents and Moudrianakis, 1995;Marino-Ramirez et al., 2006). We were unable to specify POLE3 and POLE4 homologs based on sequence similarity. However, the POLE small subunits have been identified and functionally verified in a multitude of other eukaryotes (Garg and Burgers, 2005), and it is likely that a directed experimental approach will identify plant homologs.
PCNA, the processivity clamp for POLD, is highly conserved among eukaryotes and is structurally related to the bacterial b-sliding clamp (Maga and Hubscher, 2003;Naryzhny et al., 2005). PCNA homologs have been described in numerous plants (Toueille et al., 2002) and will not be described in detail here.
Replication factor C (RFC) is a five-subunit clamp loader complex that uses ATP to load PCNA onto DNA (Ellison and Stillman, 1998;Venclovas et al., 2002;Majka and Burgers, 2004). Variable nomenclature has been used to describe the subunits in yeasts and animals with HsRFC1-5, corresponding to budding yeast ScRFC1, ScRFC4, ScRFC5, ScRFC2, and ScRFC3, respectively (we have adopted the human nomenclature here). All five RFC subunits have been identified in both Arabidopsis and rice, and contain conserved sequence motifs characteristic of other eukaryotic RFCs . In rice, the RFC subunits are expressed in proliferating tissues and transcript levels respond to chemical treatments that arrest cell cycle progression . The sequence conservation and experimental data indicate that like other eukaryotes, plants utilize a five-subunit RFC complex to load PCNA.
Budding yeast DPB11 is essential for the recruitment of POLE and POLA complexes to origins (Masumoto et al., 2000;Bell and Dutta, 2002). Fission yeast RAD4, human TOPBP1, Drosophila MUS101, and Xenopus CUT5 proteins are all thought to be functional homologs of ScDPB11 Kim et al., 2005). Amino acid sequence conservation between the yeast and animal proteins is limited, but all contain copies of the breast cancer 1 gene (BRCA1) C-terminal domain (BRCT). Four BRCT domains are present in ScDPB11, while HsTOPBP1 and DmMUS101 contain eight and seven copies of the BRCT domain, respectively (Makiniemi et al., 2001;Kim et al., 2005). We searched for an Arabidopsis DPB11/TOPBP1 homolog and found two BRCT domain-containing proteins, meiosis 1 (MEI1, At1g77320), and At4g02110 (Table I). AtMEI1 contains five BRCT domains and plays an essential role in DNA repair during meiosis (Grelon et al., 2003). At4g02110 contains only two BRCT domains (Pfam data not shown) but does show significant similarity to other TOPBP1 proteins (Table  I). Based on sequence similarity, it is not possible to determine if one, both, or neither of these proteins are true homologs of TOPBP1. AtMEI1 mutants do not display any visible mitotic phenotypes, suggesting that plants do not require a TOPBP1 homolog or that another protein performs this function (Grelon et al., 2003). In rice, we identified one protein (Os11g08660) that is most similar to AtMEI1 (38% identity) and two proteins with similarity to At4g02110 (Table I). Considering that TOPBP1/DPB11 function is conserved from yeast to human, a directed effort to identify a functional homolog in plants would be worthwhile.

Okazaki Fragment Maturation
Semidiscontinuous replication requires machinery to process the Okazaki fragments generated during lagging strand synthesis (Fig. 1D). As POLD/PCNA extends the Okazaki fragment, it encounters the 5# end of the downstream replication product and displaces it from the template strand, generating a flap (Maga et al., 2001). The flap is then cleaved to generate a nick, which is ligated to form the intact nascent strand (Kao and Bambara, 2003). The dominant mechanism of flap cleavage requires Flap Endonuclease1 (FEN1) to cleave the 5# flap structure and DNA Ligase1 (LIG1) to seal the nick (Kao et al., 2004;Rossi and Bambara, 2006). Both FEN1 and LIG1 homologs have been described in plants (Table I). Other models of Okazaki fragment maturation require DNA2 and/or RNASE H in addition to FEN1 for efficient processing of the flap structure (Qiu et al., 1999;Masuda-Sasa et al., 2006;Stewart et al., 2006). We identified a single putative DNA2 gene in both the Arabidopsis and rice genomes (Table I). AtDNA2 is 33% identical to HsDNA2 and 54% identical to the putative OsDNA2 protein. We were unable to identify a RNASEH1 homolog in any plant species, but both Arabidopsis and rice encode a RNASE H2 homolog. Perhaps RNASE H2 is the dominant RNASE H enzyme in plants.

Multiple Copy Core DNA Replication Genes
Plants may be unique among eukaryotes in that they have multiple copies of numerous core DNA replication genes (Table II). This raises the question of whether some copies have evolved specialized functions. Indeed, this has been demonstrated for the single-stranded DNA (ssDNA)-binding RPA complex in rice (see references cited below).
RPA functions as a heterotrimeric complex to stabilize ssDNA during replication, repair, and transcription (Iftode et al., 1999;Fanning et al., 2006;Zou et al., 2006). The largest subunit (RPA1) contains the primary ssDNA-binding activity, while the two smaller subunits (RPA2 and RPA3) stabilize the complex and mediate interactions with replication and repair machinery (Zou et al., 2006). Aside from humans, which have two RPA2 homologs (Keshav et al., 1995), plants are the only eukaryotes that possess multiple copies of an RPA gene. Rice has three copies each of RPA1 and RPA2, and a single RPA3 gene (Table II). Arabidopsis has five putative RPA1 genes and two copies each of RPA2 and RPA3 (Table II). Three distinct RPA complexes, termed A type, B type, and C type have been characterized in rice (Ishibashi et al., , 2006. The A-type complex localizes to the chloroplast and, thus, is expected to function primarily in organelle processes. The B-and C-type complexes both localize to the nuclear compartment, suggesting that they act in nuclear processes, but the precise function of each complex remains to be determined. It is not known whether analogous complexes occur in Arabidopsis, but mutation of Arabidopsis RPA1A (At2g06510) is lethal while mutation of a different RPA1 copy (At5g08020) results in viable but mutagensensitive plants .
Three members of the pre-RC (ORC1, CDC6, and CDT1) are duplicated in Arabidopsis. Both the AtORC1A (At4g14700) and AtORC1B (At4g12620) promoters have been shown to contain consensus E2F-binding sites (Masuda et al., 2004), but only AtORC1A transcripts were found to be elevated in tissues that undergo extra endocycles (Diaz-Trivino et al., 2005), suggesting distinct functions with respect to mitotic and endocycling cells. Similarly, AtCDC6A (At2g29680) and AtCDC6B (At1g07270) have distinct expression profiles (Masuda et al., 2004). The only other eukaryote known to have multiple CDC6 genes is Xenopus laevis, where XlCDC6A and XlCDC6B have distinct N-terminal regulatory motifs and different expression patterns in the developing frog embryo (Tikhmyanova and Coleman, 2003). XlCDC6A acts prior to the midblastula transition, after which XlCDC6B becomes the dominant protein (Tikhmyanova and Coleman, 2003). The midblastula transition coincides with extensive chromatin remodeling, activation of zygotic transcription, and a clear shift in the regulation of origin usage. It has been suggested that XlCDC6A and XlCDC6B play key roles in determining origin usage during development. An understanding of whether the two AtCDC6 genes are functionally distinct awaits further analysis. We identified only single copies of ORC1 and CDC6 in rice, indicating that any specialized functions that may have evolved in Arabidopsis are not required for plant development. Rice contains two nearly identical CDC45 proteins while Arabidopsis only has one, but both Arabidopsis and rice contain two CDT1 genes. A directed effort to understand whether these multicopy DNA replication genes in plants have functional significance would be worthwhile.
PCNA and POLE1 genes have also been duplicated in Arabidopsis (Table II). Interestingly, we observed that the AtCDC6B (At1g07270), AtPCNA1 (At1g07370), and AtPOLE1A (At1g08260) genes are located in close physical proximity on chromosome 1, and the other copies of these genes, AtCDC6A (At2g29680), AtPCNA2 (At2g29570), and AtPOLE1B (At2g27120), are clustered on chromosome 2 (data not shown). A published analysis of segmental duplications in the Arabidopsis genome indicated that this region was duplicated in a polyploidy event approximately 24 to 40 million years ago, prior to the Arabidopsis/Brassica rapa split (Blanc et al., 2003). We compared the sequences and found that the levels of nucleotide conservation between each copy of AtCDC6, AtPCNA, and AtPOLE1 are similar at 82%, 85%, and 85%, respectively. However, at the amino acid level, the two copies of AtPCNA and AtPOLE1 are identical at 96% The second copy of RPA2 in human is named RPA4. GF, Putative gene family. Species abbreviations: Hs, Human; Sc, yeast; At, Arabidopsis; Os, rice.
Core DNA Replication Machinery in Plants and 90% of residues, respectively, while the two copies of AtCDC6 only show 72% identity. This situation provides an excellent opportunity for a more detailed analysis of the different evolutionary pressures on these genes.

CONCLUSION
Through genome-wide bioinformatic analysis of Arabidopsis and rice and a comprehensive review of the extant literature, we report that the core DNA replication machinery of animals and yeasts is conserved in plants. Generalization to other plant species is supported by the inclusion of both a monocot and a dicot in this analysis. Identification of components that have not previously been reported from any plant, including the GINS complex, MCM10, NOC3, POLA2 to 4, POLD3 and 4, and RNASEH2, will open up new avenues of research. Additionally, extension of many previously reported components to include both monocot and dicot proteins should facilitate comparison within plants and between plants and other eukaryotes.
We did not detect candidate homologs for RNASE H1 or geminin, leading us to suggest that these proteins are not conserved in plants. Geminin is a critical regulator of CDT1 activity in some metazoans (Ballabeni et al., 2004;Lutzmann et al., 2006) but has not been identified in yeast. It would be interesting to determine how CDT1 activity is regulated in plants. An intriguing possibility is that one or more members of the large CDK and cyclin gene families encoded by plant genomes (Vandepoele et al., 2002) function as CDT1 regulators. Very little is known about RNASE H enzymes in plants, and the lack of an obvious RNASE H1 homolog in plants suggests that plants may be different from other eukaryotic organisms in this regard. However, we did identify an RNASE H2 gene in both Arabidopsis and rice, and an effort to define the functional capacity of this RNASE H2 enzyme in plants is needed.
Our analysis also indicated that core DNA replication proteins from plants are more similar to human than to budding yeast proteins (Table I). This observation holds true for the majority of proteins listed in Table I and is most striking for ORC3, ORC5, ORC6, CDT1, TOPBP1, and POLD3, for which no significant alignment between Arabidopsis and budding yeast proteins could be generated. The parallels in the core DNA replication machinery between plants and animals are not limited to amino acid similarity. For example, budding yeast have only three POLD subunits, while animals have four, and there are four strong POLD candidates in both Arabidopsis and rice. The Arabidopsis and rice genomes also encode putative MCM8 and MCM9 proteins, which are part of the replication initiation complex in animals but not in yeasts. In summary, the available data suggest that animal systems may be more relevant models than budding yeast for plant DNA replication.
We also found numerous components of the core DNA replication machine that are encoded by small gene families in both Arabidopsis and rice. With few exceptions, this situation seems to be unique to plants. There is some evidence of functional divergence between copies, and it would be interesting to investigate the evolutionary relationships and functional roles of these genes in greater detail. There are many examples of overlapping functions in DNA replication and repair machinery , and it is attractive to hypothesize that some members of DNA replication gene families have specialized roles related to repair. Similarly, plant cells often undergo endoreduplication as part of normal development, and there is some evidence suggesting that members of DNA replication gene families have specialized roles in this process. A systematic approach to determining the function of each gene copy in such families would provide an important contribution to the fields of DNA replication and plant developmental biology.

Assembly of Yeast and Animal Reference Sequences
Core DNA replication genes were defined primarily by review of the literature. The STRING database (Snel et al., 2000;von Mering et al., 2007) was also used to supplement known protein interaction networks. Yeast (Saccharomyces cerevisiae) sequences were downloaded from the National Center for Biotechnology Information (NCBI) and the Saccharomyces Genome Database. Animal sequences were downloaded from the NCBI RefSeq database when possible, and the NCBI nonredundant database otherwise. The NCBI Homo-loGene database was useful for identifying homologs in various organisms. The BLAST programs (BLASTP and TBLASTN) were used to query sequence databases. Vector NTI (Invitrogen) was used to manage and analyze sequences in house. GenBank accession numbers are provided in Supplemental File S1.

Plant Sequences
To identify core DNA replication proteins in Arabidopsis (Arabidopsis thaliana), yeast and animal proteins were used to query (BLASTP) the Arabidopsis genome databases at TAIR, TIGR, and NCBI. Sequences with significant similarity were downloaded into our Vector NTI database and putative annotations were assigned based on the function of yeast and animal proteins. Next, the NCBI PubMed and ISI Web of Science literature databases were queried for publications relevant to each protein in a plant system. Pertinent information was used to manually curate the putative annotations we assigned based on sequence similarity. This curated list of Arabidopsis proteins was then used to query (BLASTP and TBLASTN) the rice (Oryza sativa L. sp. japonica) genome database managed by TIGR. Sequences with significant similarity to Arabidopsis proteins were downloaded into our Vector NTI database and annotated accordingly. Transcripts representing core DNA replication proteins from all other plants in this analysis were downloaded from either the TIGR plant transcript assembly (Childs et al., 2007) or NCBI databases. Protein translations were performed using the Vector NTI software package.

Protein Sequence and Phylogenetic Analyses
Percent amino acid identity and similarity values reported in Table I,  Supplemental Table S1, and in the text were generated by pairwise BLAST on the NCBI Web site using the default parameters. The percentages reported correspond to regions of the proteins that were aligned by the algorithms. In cases where we revised gene models (Supplemental Text S1), we used those revised models in making the alignments. Multiple sequence alignments were performed using the Clustal W algorithm within the Vector NTI suite and the BLOSUM62 scoring matrix. Similar amino acids were defined based on the chemical properties of residue side chains as follows: acidic, Asp (D), and Glu (E); aliphatic, Gly (G), Ala (A), Val (V), Leu (L), and Ile (I); amide, Asn (N) and Gln (Q); aromatic, Phe (F), Tyr (Y), and Trp (W); basic, His (H), Lys (K), and Arg (R); hydroxyl, Ser (S) and Thr (T); and sulfur containing, Met (M) and Cys (C). Conserved sequence features were annotated by searching a variety of databases including Pfam, SMART, and the NCBI conserved domain database. Phylogenetic trees were constructed using the neighbor-joining method (Saitou and Nei, 1987) within the Molecular Evolutionary Genetics Analysis software package (MEGA3; Kumar et al., 2004). Alignment for tree construction was done using ClustalW with the Gonnet scoring matrix. For bootstrap tests, the P-distance method and 5,000 iterations were selected.
Accession numbers and locus identifiers for sequences used in these analyses are provided in Supplemental File S1.

Supplemental Data
The following materials are available in the online version of this article.
Supplemental Figure S1. Multiple sequence alignment of MCM8 proteins.
Supplemental Figure S2. Multiple sequence alignment of MCM9 proteins.
Supplemental Figure S3. Multiple sequence alignment of MCM10 proteins.
Supplemental Table S1. Pairwise BLAST of GINS complex proteins.
Supplemental Table S2. Properties of GINS complex proteins.
Supplemental Text S1. Text file containing nucleotide coding sequences and amino acid sequences in FASTA format for new gene models predicted in these analyses.
Supplemental File S1. Microsoft Excel file containing accession numbers for sequences used in these analyses.