Plant Physiol. Bio-Rad Microplate Reader
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (4)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Federspiel, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Federspiel, N.
Agricola
Right arrow Articles by Federspiel, N.

Plant Physiol, December 2000, Vol. 124, pp. 1456-1459

Deciphering a Weed. Genomic Sequencing of Arabidopsis

Nancy Federspiel*

Exelixis, Inc., 170 Harbor Way, P.O. Box 511, South San Francisco, California 94083-0511


    ARTICLE
TOP
ARTICLE
LITERATURE CITED

By the end of 2000, the genomic sequence of the model plant, Arabidopsis, will be completed, annotated, and released. Although the hype over the draft of the human genome sequence may overshadow the significance of this lowly weed, its importance for the future food security of the world's population may have equal weight for human health with the actual catalog of human genes. In recent months the popular media have enthusiastically reported protests around the world against the use of genetically modified organisms, especially plants. While farmers have been breeding, and thereby modifying, agricultural crops for millennia, protesters feel that the use of recombinant DNA for changing the characteristics of plants is inherently dangerous. Nevertheless, the entire complement of Arabidopsis genes will define the first member of the "other" multicellular biological kingdom and along with the emerging rice genome it will facilitate even more precise manipulations of crop species to achieve targeted goals. In addition, this will be the most accurate and most complete sequence of a higher eukaryotic genome in existence and it will probably retain that status for some years to come. At this historic juncture it is well worth a quick review and acknowledgment of the process and the people whose efforts have made this achievement possible.

It was just over a decade ago that the Arabidopsis Multinational Science Steering Committee, a small group of forward-thinking Arabidopsis researchers, decided that the most valuable goal for the advancement of plant science was the determination of the entire sequence of the approximately 130-Mb Arabidopsis genome. With a target completion date of 2004, new sources of funding and scientists interested in undertaking such a large effort had to be identified without detracting from other aspects of basic plant biology research. Early projects focused on identifying expressed genes through single-pass sequencing of cDNA clones; Tom Newman and colleagues at Michigan State University (Newman et al., 1994) and a group of investigators in France (Höfte et al., 1993) produced a large number of expressed sequence tags (EST) that were collected in the database dbEST (Boguski et al., 1993). Preparations for genomic sequencing in the early 1990s in Europe and the U.S. were directed toward establishing the optimal molecular resources such as appropriate large-insert libraries and physical maps of the individual chromosomes. Genomic libraries of Arabidopsis DNA were constructed in cosmids (Hauge and Goodman, 1992) and yeast artificial chromosomes (Matallana, et al., 1992; Schmidt et al., 1995; Zachgo et al., 1996), and much effort went into restriction mapping and hybridization of these clones to localize known genetic markers and to generate a sequence-ready map. However, it became apparent that there were difficulties inherent in these vectors, such as clone instability and incomplete representation of the genome, which would limit their utility as substrates in a full genome sequencing project. In the mid-1990s, bacterial artificial chromosomes (BACs; Shizuya et al., 1992) were emerging as the vectors of choice for this type of project due to their reasonably large insert size (approximately 70-200 kb), low copy number in Escherichia coli, and apparent insert stability. Rod Wing's group, then at Texas A&M University, and Thomas Altmann's group at the Max-Planck-Institut für Molekulare Pflanzenphysiologie in Golm, Germany constructed complementary BAC libraries using different restriction enzymes. These two libraries became the mainstay of the sequencing projects in Europe and the U.S.; in addition, the Japanese developed libraries to support their sequencing efforts in a P1 vector and in a modified BAC vector that could be used in Agrobacterium-mediated plant transformation.

The Europeans were the first to organize a sequencing consortium and obtain funding for a pilot genomic sequencing effort. Led by Mike Bevan at the John Innes Centre in Norwich, 18 European laboratories formed "European Scientists Sequencing Arabidopsis" to sequence approximately 1.9 Mb on chromosome 4. In December 1995, the Kasuza DNA Research Institute in Japan initiated their Arabidopsis sequencing efforts, focusing on chromosome 5. Soon after, in 1996 three U.S. funding agencies, the Department of Energy, Department of Agriculture, and the National Science Foundation as the lead agency requested proposals and funded three pilot sequencing projects in the U.S. Arabidopsis sequencing at The Institute for Genomic Research (TIGR) focused initially on chromosome 2 and was directed by Steve Rounsley and Craig Venter. The second U.S. effort, a consortium of researchers at Cold Spring Harbor Laboratory led by Dick McCombie at the Genome Sequencing Center of Washington University School of Medicine in St. Louis headed by Rick Wilson, and at Applied Biosystems Inc. with Ellson Chen, focused on the top arm of chromosome 4. The third U.S. project, concentrating on chromosome 1, was directed by the SPP Consortium composed of the Stanford DNA Sequencing and Technology Center under Ron Davis and Nancy Federspiel, Joe Ecker's group at the University of Pennsylvania, and Sakis Theologis' group at the Plant Gene Expression Center/UC-Berkeley. In the summer of 1996, representatives of all the international sequencing groups met at the first AGI (Arabidopsis Genome Initiative) meeting and established guidelines for sequencing standards and accuracy, data release, and resolution of conflicts. The AGI has played an essential role for the last 5 years in coordinating the international efforts, monitoring progress, allocating regions of the genome to new participants, and redistributing areas to utilize excess sequencing capacity. Subsequent rounds of funding in the U.S. and Europe increased the rate of progress and included an additional European consortium led by the French genome center Genoscope under Marcel Salanoubat and Francis Quetier, focusing on chromosome 3. The combined efforts of all the groups has led to the completion of the Arabidopsis genome 4 years ahead of the original schedule!

Considering that over 40 laboratories have participated in this enormous undertaking, it is interesting to compare the different strategies that were used (or not used) to define the genomic structure of Arabidopsis. These strategies are variations along a gradient from fully random sample sequencing to highly directed localized sequencing. Whole genome shotgun sequencing (in which total genomic DNA is sheared into small pieces, cloned, and sequenced) was proposed as one component of the effort by the SPP Consortium because the resulting random sequences would tag nearly every gene in the genome in the first phase of the project. Directed sequencing, in contrast, produces highly accurate sequence in discrete areas of individual chromosomes, whereas other chromosomal areas remain unknown until late in the process. At the first AGI meeting there was vigorous debate among the representatives over the whole genome shotgun strategy and it was decided by majority vote not to support it due to concerns over how the data could be utilized and incorporated into the final, highly accurate product. Five years later this strategy has become much more accepted in the biological community as an efficient way of surveying a new genome in the initial stages of analysis.

Participants in the AGI then utilized the more directed approaches with BAC or other large-insert clones that were mapped to specific chromosomes as substrates for all the sequencing efforts. European Scientists Sequencing Arabidopsis, composed of a large number of laboratories with varying sequencing capacities, chose early on to invest in a large mapping effort to establish an overlapping set of clones distributed along their chosen chromosomal locations. Once the list of clones was defined, individual clones were allocated to the different participating laboratories for sequencing, and each laboratory could operate independently of the others on its own set of clones. Using this approach, the first large contiguous stretch of Arabidopsis genomic sequence was completed and it was revealed that the predicted gene density in Arabidopsis is quite high, averaging 1 gene/5 kb (Bevan et al., 1998). An alternative approach was used by many of the other sequencing groups, which required less initial effort on mapping clones; this strategy instead required a database of the end sequences of all the available BAC clones. To initiate the process, "seed" BACs were mapped to widely distributed sites along a chromosome and then completely sequenced. The sequence of each seed BAC was then compared with the end-sequence database to identify the BAC clones that overlapped each end by a minimum amount. These BACs were in turn sequenced completely to extend the region of contiguous sequence and the process was repeated until the "contigs" merged to form the entire sequence of a chromosomal arm (centromeric repeats are difficult to sequence and require specialized strategies). This approach led to the completion of the sequence of chromosome 2 (Lin et al., 1999) and the mapping strategy resulted in the completion of the sequence of chromosome 4 (Mayer et al., 1999).

Although most sequencing groups relied heavily on manual labor and, later in the project, some commercial automation for sample preparation, the SPP Consortium used the Arabidopsis sequencing project as the testbed for new robotic instrumentation developed at Stanford to reduce the cost and increase the throughput for sequencing (Marziali et al., 1997). The AGI agreed to follow the so-called "Bermuda standards" for sequence accuracy established by the Human Genome Project so that the resulting final product should have less than one error per 10,000 bases. However, adhering to the Bermuda standards for immediate data release to GenBank proved to be more problematic for some AGI groups due to their sources of funding and other internal issues. As a result, public release of sequence data ranged from depositing rough draft sequence in GenBank within 24 h of assembly to posting of sequence data on individual web sites, to releasing to GenBank only after finishing to high quality and annotation. This was a continuing source of discussion and negotiation among AGI members throughout the project, especially in the latter stages when re-allocation of clones required data exchange to confirm overlaps and fill gaps in the sequence. Publication of the sequence of the remaining three chromosomes, as well as a discussion of the structure and content of the entire genome, is scheduled for the end of 2000 and will ensure that all AGI sequence data is in the public domain (The Arabidopsis Genome Initiative, in press).

The accumulation of approximately 130 million pieces of Arabidopsis sequence in the database certainly does not in and of itself provide the key to understanding how this model plant works or how to modify economically important crop species for improvement in nutrition, response to stress, and other traits required for long-term agricultural sustainability. It is analogous to now having a book published in a foreign language, divided into five chapters (for the five chromosomes), but having no punctuation to separate the words and sentences and no dictionary for translation. Annotation of the sequence, or defining the interesting features such as genes, repetitive sequences, regulatory regions, etc., will initiate the process of producing a Rosetta stone for the plant genome. Members of the AGI have all been responsible for annotating the clones that were sequenced in the individual groups, and diverse methods have been used to provide these gene labels along the genomic sequence. In general, sequence matches with the EST database, with the non-redundant protein sequence database, and results of gene prediction programs are the bases of annotation.

Numerous annotation issues must be addressed in any large-scale genomic sequencing effort such as this. ESTs generally represent only a fraction of the genes in an organism, and the individual sequence tags usually define only a portion of each gene. Comparisons to characterized genes from other organisms provide another method of identifying conserved genes, but intron/exon junctions, start/stop sequences, and regulatory regions may or may not be similarly conserved. Gene prediction programs often produce conflicting output that must be interpreted by a human annotator whose final choice may not be biologically accurate. In addition, characterization of a "hypothetical protein" that has no database matches may be correct on the day the annotation was completed and submitted to GenBank, but may be incorrect the following day after a new set of sequences from any source is deposited into the database. To help basic researchers during the course of the genome project, several of the AGI members created databases to automatically predict coding sequences for GenBank records lacking annotation, to identify protein motifs and protein similarities, and to provide regular updates of new entries (DAtA, Palm et al., 2000; http://www. kazusa.or.jp/kaos/; http://www.mips.biochem.mpg. de/proj/thal/).

When the complete Arabidopsis sequence has been compiled from the individual BAC sequences, TIGR is funded to provide a unified set of annotation that will provide consistent nomenclature and current database matches (http://www.tigr.org/tdb/ath1/htmls/ath1.html). However, this again will be a snapshot in time of the existing data and knowledge, and ongoing efforts to curate the sequence data in a stable, accessible database will be required for full development of the long term value of the Arabidopsis sequencing project. The Arabidopsis Information Resource is one such possible resource, and their efforts are described in another article in this issue. Despite the best efforts of programmers attempting to improve gene prediction software and annotators, a large percentage of putative genes will remain essentially guesses without corroborating experimental evidence. Full-length cDNA sequencing, characterization of tagged mutant lines, and expression profiling with arrays are among the approaches that will greatly aid in deciphering the meaning of the Arabidopsis genome sequence. The knowledge gained from this major endeavor will ultimately define the basic building blocks of this model plant and provide the tools for manipulating desired traits in other more complex plants. It is to be hoped that the precise definition of what is changed in a genetically modified plant will allow a more exact understanding of the outcome of this manipulation. In this way, political, environmental, and biological issues can be addressed so that the scientific community and the general public alike can be informed of the merits of the new agricultural genomics technologies.

    FOOTNOTES

Received September 7, 2000; accepted September 18, 2000.

* E-mail nfedersp{at}exelixis.com; fax 650-837-8204.


    LITERATURE CITED
TOP
ARTICLE
LITERATURE CITED

  • Boguski MS, Lowe TMJ, Tolstoshev CM (1993) dbEST: database for "expressed sequence tags." Nat Genet 4: 332-333 [CrossRef][ISI][Medline]
  • Hauge BM, Goodman HM (1992) In C Koncz, N-H Chua, J Schell, eds, Methods in Arabidopsis Research. World Sicentific, Singapore, pp 191-223
  • Höfte H, Desprez T, Amselem J, Chiapello H, Caboche M, Moisan A, Jourjon MF, Charpenteau JL, Berthomieu P, Guerrier D (1993) An inventory of 1152 expressed sequence tags obtained by partial sequencing of cDNAs from Arabidopsis thaliana. Plant J 4: 1051-1061 [CrossRef][ISI][Medline]
  • Lin X (1999) Sequence and analysis of chromosome 2 of the plant Arabidopsis thaliana. Nature 402: 761-768 [CrossRef][Medline]
  • Marziali A, Federspiel N, Davis R (1997) Automation for the Arabidopsis genome sequencing project. Trends Plant Sci 2: 71-74
  • Matallana E, Bell CJ, Dunn P, Lu M, Ecker JR (1992) In C Koncz, N-H Chua, J Schell, eds, Methods in Arabidopsis Research. World Sicentific, Singapore, pp 191-223
  • Mayer K, Schuller C, Wambutt R, Murphy G, Volckaert G, Pohl T, Dusterhoft A, Stiekema W, Entian KD, Terryn N, Harris B, Ansorge W, Brandt P, Grivell L, Rieger M, Weichselgartner M, de Simone V, Obermaier B, Mache R, Muller M, Kries M, Delseny M, Puigdomenech P, Watson M, McCombie WR (1999) Sequence and analysis of chromosome 4 of the plant Arabidopsis thaliana. Nature 402: 769-777 [CrossRef][Medline]
  • Newman T, Bruijn FJ de, Green P, Keegstra K, Kende H, McIntosh L, Ohlrogge J, Raikhel N, Somerville S, Thomashow M, Retzel E, Somerville C (1994) Genes galore: a summary of methods for accessing results from large-scale partial sequencing of anonymous Arabidopsis cDNA clones. Plant Physiol 106: 1241-1255 [Abstract]
  • Palm CJ, Federspiel NA, Davis RW (2000) DAtA: database of Arabidopsis thaliana annotation. Nucleic Acids Res 28: 102-103 [Abstract/Free Full Text]
  • Schmidt R, West J, Love K, Lenehan Z, Lister C, Thompson H, Bouchez D, Dean C (1995) Physical map and organization of Arabidopsis thaliana chromosome 4. Science 270: 480-483 [Abstract/Free Full Text]
  • Shizuya H, Birren B, Kim U-J, Mancino V, Slepak T, Tachiiri Y, Simon M (1992) Cloning and stable maintenance of 300 kilobase-pair fragments of human DNA in Escherichia coli using an F-factor-based vector. Proc Natl Acad Sci USA 889: 8794-8797
  • The Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature (in press)
  • The EU Arabidopsis Genome Project (1998) Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis thaliana. Nature 391: 485-488 [CrossRef][Medline]
  • Zachgo EA, Wang ML, Dewdnery J, Bouchez D, Camilleri C, Belmonte S, Huang L, Dolan M, Goodman HM (1996) A physical map of chromosome 2 of Arabidopsis thaliana. Genome Res 6: 19-25 [Abstract/Free Full Text]
© 2000 American Society of Plant Physiologists



This article has been cited by other articles:


Home page
Plant Physiol.Home page
F. M. Ausubel
Arabidopsis Genome. A Milestone in Plant Biology
Plant Physiology, December 1, 2000; 124(4): 1451 - 1454.
[Full Text]


This Article
Right arrow Full Text (PDF)
Right arrow Alert me when this article is cited
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (4)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Federspiel, N.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Federspiel, N.
Agricola
Right arrow Articles by Federspiel, N.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
ASPB Publications PLANT PHYSIOLOGY THE PLANT CELL
Copyright © 2000 by the American Society of Plant Biologists