|
|
||||||||
|
Plant Physiol, December 2002, Vol. 130, pp. 1585-1586
UPDATE ON GENOMICS
|
|||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| |
ARTICLE |
|---|
|
|
|---|
Rice (Oryza sativa) is the first grass species to be sequenced, and as of September 2002, there are four draft genome sequences available. All four drafts are available to the academic community, although two drafts have some limitations with respect to access and distribution. Although none of the four draft sequences is complete, they collectively provide our first view of the landscape and the content of a monocot genome.
The first rice genome sequence made accessible in large tracts was that
of the O. sativa subsp japonica cv Nipponbare
generated by the International Rice Genome Sequencing Project (IRGSP;
Sasaki and Burr, 2000
), an international
consortium of public laboratories. Using a bacterial artificial
chromosome (BAC)-by-BAC approach, the IRGSP has generated draft
sequence of 3,083 BAC or P1 artificial chromosome (PAC) clones that is
available through GenBank/DNA data bank of Japan (DDBJ)/EMBL (as of
September 17, 2002). These 3,083 BAC/PAC clones represent 426 Mb of
sequence, and assuming an overlap of 15% between the clones, this
would represent 362 Mb of unique sequence. With an estimated genome
size of 430 Mb (Arumuganathan and Earle, 1991
), this
represents 84% of the rice genome. Alignment of the IRGSP sequence
with 13,895 sequenced genetic markers reveals that 11,442 markers can
be anchored to a BAC/PAC clone using high-stringency criteria
(http://www.tigr.org/tdb/e2k1/osa1/BACmapping/description.shtml), indicating that based on coverage of markers, the IRGSP sequence represents 82% of the genome. A graphic depiction of the anchoring of
the BAC/PAC clones to the chromosomes can be viewed at
http://www.tigr.org/tdb/e2k1/osa1/BACmapping/description.shtml. There is clearly representation throughout most of the
chromosomes, with the exceptions occurring in the regions devoid of, or
lacking in, a high density of genetic markers in which to anchor the
BAC/PAC clones. Likewise, regions where it is technically difficult to identify BAC/PAC clones (telomeres, centromeres, and
nucleolar-organizing regions) are under-represented in the IRGSP sequence.
Although the majority of the IRGSP sequence is draft sequence, approximately a third of the sequence is finished (1,023 BAC/PAC clones as of September 12, 2002; http://www.tigr.org/tdb/e2k1/osa1/BACmapping/description.shtml). In fact, manuscripts describing the sequence, annotation, and analysis of chromosomes 1 and 4 are in press (T. Sasaki and B. Hin, personal communication) and a manuscript on chromosome 10 is in preparation (C.R. Buell, W. McCombie, J. Messing, and R.A. Wing, personal communication) highlighting the role of the IRGSP in finishing the rice genome. In addition, the overall quality of draft sequence generated by the IRGSP is high with the bulk of the sequence being 10×, phase 2 sequence, with 10× being the level of sequence coverage and phase 2 reflecting the fact that the contigs are ordered and oriented when deposited in GenBank (http://www.ncbi.nlm.nih.gov/HTGS/). Although the immediate goal of the IRGSP is completion of a phase 2 draft of the rice genome by the end of 2002 (http://rgp.dna.affrc.go.jp/rgp/press_conference.html), the ultimate goal is that of a finished rice genome.
Annotation for the IRGSP BAC/PAC clones is available for finished clones in GenBank/DDBJ/EMBL. Annotation data for unfinished sequences are generated through automated annotation processes and are available from The Institute for Genomic Research (http://www.tigr.org/tigr-scripts/e2k1/irgsp.spl) and the Rice Genome Program (http://rgp.dna.affrc.go.jp/giot/INE.html). Although manually curated annotation is always preferred over automated annotation, access to automated annotation for unfinished sequences provides a valuable resource for these unfinished sequences. Other analyses of the rice genome, such as alignment with expressed sequence tags from other monocot species, identification of motifs/domains within the rice proteome, analysis of repetitive sequences, and identification of syntenic sequences are available through several public sources (http://www.tigr.org/tdb/e2k1/osa1/; http://rgp.dna.affrc.go.jp/; http://www.gramene.org).
Draft sequence of the same rice cv Nipponbare
japonica sequenced by the IRGSP is available from two
separate private sources, Pharmacia (Peapack, NJ) and Syngenta (San
Diego). The Pharmacia draft sequence was generated using a
BAC-by-BAC approach and represents 259 Mb of sequence (Barry,
2001
). Access to this draft sequence is available to academic
scientists under an access agreement with Pharmacia
(http://www.rice-research.org). An agreement between Pharmacia and
the IRGSP has resulted in the incorporation of the Pharmacia BAC clones
and sequence into the IRGSP sequence. The Syngenta draft sequence was
generated using a whole-genome shotgun sequencing approach and provides
93% coverage of the genome (Goff et al., 2002
). This
draft sequence is available through a licensing agreement with Syngenta
(http://www.tmri.org). Although the Syngenta draft sequence has been
annotated, these data are not available to the public. Insights into
the rice genome and proteome using the Syngenta draft sequence was
recently reported by Goff et al. (2002)
. It is estimated
that the rice genome encodes between 32,000 and 50,000 proteins. From
comparative analyses with cereals, not only was a high degree of
homology present between rice and other cereal genes, but synteny
between rice and other cereals, especially maize (Zea mays),
was reported. These analyses are an extension of previous studies on
the high degree of similarity between rice and other cereals
(Gale and Devos, 1998
) and further highlight the role
rice can have in cereal comparative genomics.
A draft sequence of the O. sativa subsp. indica
cultivar (93-11) was reported by the Beijing Genomics Institute (BGI)
by Yu et al. (2001
, 2002
). This draft,
generated through a whole-genome shotgun sequencing approach,
represents 360 Mb of assembled sequence and provides a resource not
only for gene discovery in rice subsp indica but also for
rice comparative genomics. Unlike the Pharmacia and Syngenta draft
sequences, the BGI sequence is freely available via the BGI web site
(http://btn.genomics.org.cn/rice) and through GenBank/DDBJ/EMBL. An
analysis of the BGI sequence suggests that the rice genome encodes for
between 46,022 and 55,612 proteins (Yu et al., 2002
),
consistent with the estimate made by Goff et al. (2002)
.
To date, rice has the most genes of any sequenced organism, almost
twice that of the dicotyledonous model plant, Arabidopsis
(Arabidopsis Genome Initiative, 2000
). In a comparative study between rice and Arabidopsis, rice has a homolog for
approximately 81% of the proteins in the Arabidopsis genome (Yu
et al., 2002
), suggesting substantial overlap in the genes
required for basic plant functions in monocots and dicots. However, in
the reciprocal comparison, a homolog in Arabidopsis could only be found
for one-half of the rice proteins.
Although these four draft sequences provide a rich resource for data mining, they have limitations. The nature of draft sequence, regardless of source, is that it contains errors and is incomplete. The errors can be simple sequencing errors (incorrect bases, low quality regions) or larger in nature in terms of misassembly. However, the main disadvantage of draft sequence is the incomplete nature of the sequence. Not only can the gene of interest be truncated in the draft sequence because of sequencing gaps, not all portions of the genome are represented in the draft sequence. Telomeres, centromeres, and other regions that are difficult to sequence are absent, under-represented, or misassembled in these draft sequences. Thus, studies such as chromosome structure and organization that require a more complete sequence cannot be performed with these draft sequences.
Obtaining a finished rice genome is necessary if rice is to be used as
the base species in comparative genomics in cereals (Goff,
2002
; Leach et al., 2002
). It is anticipated
that the IRGSP will take the lead in finishing the genome. Because the
IRGSP has finished one-third of the genome to date and will have the remainder of the genome ready for finishing in December 2002, finishing
the rice genome seems to be on a positive track. In addition, although
the rice genome may be finished and annotation deposited in GenBank at
the BAC level, it will be essential to have annotation of the genome at
a level comparable with that of Arabidopsis if rice is to be leveraged
to other monocot genomes. Construction of pseudomolecules (reference
molecules) of the chromosomes along with uniform, high-quality
annotation will be required for researchers to maximally gain
information from the rice genome sequence. Although the rice genome and
annotation is incomplete today, multiple centers are contributing to
finishing the sequence and in improving the annotation. From these
efforts, a truly finished and well-annotated reference sequence for the
first cereal species will be available.
| |
FOOTNOTES |
|---|
Received September 18, 2002; returned for revision September 25, 2002; accepted September 25, 2002.
1 The work on rice genome sequencing at TIGR was supported by the U.S. Department of Agriculture (grant no. 99-35317-8275), by the National Science Foundation (grant no. DBI998282), and by the U.S. Department of Energy (grant no. DE-FG02-99ER2035).
* E-mail rbuell{at}tigr.org; fax 301-838-0208.
www.plantphysiol.org/cgi/doi/10.1104/pp.014878.
| |
LITERATURE CITED |
|---|
|
|
|---|
This article has been cited by other articles:
![]() |
C. D. Buchanan, P. E. Klein, and J. E. Mullet Phylogenetic Analysis of 5'-Noncoding Regions From the ABA-Responsive rab16/17 Gene Family of Sorghum, Maize and Rice Provides Insight Into the Composition, Organization and Function of cis-Regulatory Modules Genetics, November 1, 2004; 168(3): 1639 - 1654. [Abstract] [Full Text] [PDF] |
||||
![]() |
I. Landrieu, M. da Costa, L. De Veylder, F. Dewitte, K. Vandepoele, S. Hassan, J.-M. Wieruszeski, F. Corellou, J.-D. Faure, M. Van Montagu, et al. A small CDC25 dual-specificity tyrosine-phosphatase isoform in Arabidopsis thaliana PNAS, September 7, 2004; 101(36): 13380 - 13385. [Abstract] [Full Text] [PDF] |
||||
![]() |
G. Blanc and K. H. Wolfe Widespread Paleopolyploidy in Model Plant Species Inferred from Age Distributions of Duplicate Genes PLANT CELL, July 1, 2004; 16(7): 1667 - 1678. [Abstract] [Full Text] [PDF] |
||||
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |
| ASPB Publications | PLANT PHYSIOLOGY | THE PLANT CELL | |
|---|---|---|---|