Plant Physiol. (1998) 117: 1129-1133
UPDATE ON GENOMICS
Genome Sequencing and Informatics: New Tools for Biochemical
Discoveries
Milton H. Saier Jr.*
Department of Biology, University of California at San Diego, La
Jolla, California 92093-0116
 |
INTRODUCTION |
During the past 3 years, we have
experienced a major revolution in the biological sciences resulting
from a tremendous flux of information generated by genome-sequencing
efforts. Our understanding of microorganisms, the metabolic processes
they catalyze, the genetic apparatuses encoding cellular proteinaceous
constituents, and the pathological conditions caused by these organisms
has greatly benefited from the availability of complete microbial genomic sequences. Many research institutes around the world are now
devoting their efforts solely to genome sequencing and to analysis of
the data produced. Dozens of international conferences have been held
with the primary purpose of keeping the scientific community abreast of
recent developments. In this Update I will summarize some of
the exciting information reported at a recent conference on microbial
genomics.1
Compared with microbe genomics, plant genomics is still in its infancy,
since the sequencing of only the Arabidopsis genome is currently under
way (Bevan et al., 1998
). However, within a few years we can expect
that the genomic sequences of at least two plants, Arabidopsis and rice
(Oryza sativa), will be available. Meanwhile, as of January
1998, 12 microbial genomes have already been sequenced and published.
These include genomes of representative organisms from the three
domains of life: bacteria, archaea, and eukarya. The genomes of one
eukaryote (the brewers' yeast Saccharomyces cerevisiae) as
well as three archaea and eight bacteria have been completely
sequenced. By examining what has been learned from sequencing of
microbial genomes, we can get some idea about what kinds of information
we can expect from plant genome-sequencing efforts.
 |
MICROBIAL GENOME SEQUENCING |
Most of the bacterial genomes that have been sequenced are small,
of 2 Mbp or less. However, three large bacterial genomes have been
sequenced: those of the prototypic gram-negative bacterium Escherichia coli (Blattner et al., 1997
); the
best-characterized gram-positive bacterium, Bacillus
subtilis (Kunst et al., 1997
); and a representative
cyanobacterium, Synechocystis PCC6803 (Kaneko et al., 1996
).
In addition, six microbial genomes have been completely sequenced but
not yet published. Pathogens with fully sequenced genomes include
Mycobacterium tuberculosis, the causative agent of
tuberculosis; Treponema pallidum, the spirochete that causes syphilis; and Borrelia burgdorferi, another spirochete that
causes Lyme disease. Two interesting nonpathogens that have recently been sequenced are Deinococcus radiodurans, the organism
reported to be most resistant to UV irradiation, and Aquifex
aeolicus, a marine hyperthermophile capable of growth at 95°C.
Based on 16S RNA analyses, A. aeolicus may represent the
deepest lineage within the bacterial domain, and its genome may
therefore provide clues about primitive prokaryotic life- forms that
existed billions of years ago.
More than 50 microbial genomes are currently being sequenced. These
include the genomes of virtually every major pathogen. Industrially
important bacteria such as Clostridium acetobutylicum, a
principal producer of organic solvents, are also being sequenced, as is
the deep-sea manganese-oxidizing bacterium Shewanella
putrefaciens. The sequence of the first animal genome, that
of the worm Caenorhabdeitis elegans, is expected to
be completed in the middle of this year (July, 1998). Although only 3%
of the human genome has been sequenced, this tiny fraction of the human
genome represents more DNA than that of all of the completed microbial
genomes sequenced to date.
Table I summarizes some of the most
impressive genome-sequencing efforts completed by the end of 1997. The
first organism to have its genome sequenced was Haemophilus
influenzae (Fleischmann et al., 1995
). It has a genome size of 1.8 Mbp encoding 1743 recognized genes. One laboratory at the Institute for
Genomic Research, involving about 40 people, completed the project in 1 year. B. subtilis, with a genome of 4.2 Mbp, was
sequenced by an international consortium of 46 laboratories involving
about 160 people in 5 years (Kunst et al., 1997
).
S. cerevisiae possesses about 13 Mbp of DNA, and 12 Mbp of
this DNA has been fully sequenced. The remaining 1 Mbp consists of
repetitive rDNA (encoding rRNA molecules) representing more than 100 repeats. This repetitive DNA was not sequenced as part of the
yeast-sequencing effort partly for technical reasons, and partly
because it was expected to yield little new information. Ninety-six
laboratories and 640 people completed this project in 6 years (Goffeau
et al., 1997
). The prototypical bacterium E. coli (4.6 Mbp)
was sequenced by a single laboratory in an effort involving 17 people,
but the project took about 10 years (Blattner et al., 1997
). Because of
recent technological advances, it is estimated that a microbial genome
of about 2 Mbp can now be fully sequenced in 1 year by a single
laboratory of 6 people with an expenditure of less than $1 million.
 |
GENOME SEQUENCING LEADS TO THE IDENTIFICATION OF NEW PROTEINS |
Tremendous benefits have resulted from the microbial
genome-sequencing efforts completed to date (Table
II). For example, important new proteins
have been identified. Owen White at the Institute for Genomic Research
reported that D. radiodurans is the first nonphotosynthetic
organism to be shown to possess the light-sensing protein phytochrome.
In D. radiodurans this molecule may function to regulate the
synthesis of pigments that protect the organism from irradiation, and
it also has a novel type of RecA protein involved in DNA repair-related
recombination.
Richard Roberts of New England Biolabs analyzed various genomes for DNA
restriction-modification systems, enzyme systems that protect organisms
from the potentially detrimental effects of foreign DNA (such as that
of viruses). It was found that although B. subtilis and
E. coli possess 3 to 4 such restriction-modification systems, some pathogenic bacteria have far more. Thus, H. influenzae has 7, Neisseria gonorrhoeae has 18, and
Helicobacter pylori, which is a causative agent of peptic
ulcers, has 23. The physiological significance attributed to the
possession of large numbers of such systems in small-genome bacteria is
a point of debate.
Entirely new families of protein paralogs (families of proteins arising
by gene duplication within a single organism) have been revealed by
genome sequencing. For example, Claire Fraser, Sherwood Casjens, and
Steven Norris reported that the genomes of the two pathogenic
spirochetes T. pallidum and B. burgdorferi both
exhibit large families of species-specific paralogs of unknown function. The same has been observed for methanogenic
(methane-producing) archaea. In the case of B. subtilis, a
large family of Bacillus-specific protein paralogs have been
identified as "Rap" phosphatases that release phosphate from
phosphorylated aspartyl residues in response to regulatory proteins.
These regulatory proteins control the initiation of sporulation, a
gram-positive bacterial cell-differentiation process that leads to the
generation of thick-walled, metabolically inert, environmentally
resistant spores (Perego et al., 1994
, 1996
).
 |
DISCOVERY OF NEW METABOLIC PATHWAYS |
Entirely new metabolic pathways have been revealed by genome
sequencing. Tyrrell Conway reported the discovery of a previously unrecognized pathway in E. coli for the metabolism of the
sugar acid idonate, a compound that had not been known to be a
substrate for the growth of this bacterium (C. Bausch, N. Peekhaus, C. Utz, T. Blais, E. Murray, T. Lowary, and T. Conway, unpublished data). Additionally, analysis of several bacterial genomes showed that many bacteria have all of the requisite enzymes of the
ribulose-monophosphate-hexulose-monophosphate pathway, a pathway
for the interconversion of five- and six-carbon sugars (Reizer et al.,
1997
). Such a pathway had been previously demonstrated only in
methanogenic bacteria. Finally, Owen White reported evidence that
D. radiodurans has a novel pathway for the efficient repair
of double-stranded DNA breaks, a fact that explains the ability of this
octaploid organism to repair more than 150 double-stranded DNA breaks
in its genome without loss of viability (Battista, 1997
).
The availability of a complete genomic sequence allows one to estimate
the total metabolic capability of an organism. For example, knowledge
of the complete gene complement permits estimation of the nutrients
that can be taken up by the cell. Such knowledge reflects the natural
lifestyle of the organism, and hence, ecological deductions and
inferences can be made regarding the use of the bacterium for purposes
of bioremediation (Clayton et al., 1997
; Paulsen et al., 1998
).
Based on the complement of genes encoded within microbial genomes,
Peter Karp of the Pangea Corporation in Oakland, CA, and his co-workers
have created "EcoCyc" for E. coli and "HinCyc" for
H. influenzae, which are computerized displays of the
complete metabolic pathways of these two closely related organisms
(Karp et al., 1996
). "EcoCyc," which is available on the World Wide Web, is now being expanded to include transport and regulatory information as well as metabolic reactions.
Results reported at the Microbial Genomics II conference clearly
suggested that many obligate prokaryotic human parasites have condensed
and streamlined their genomes, with the loss of regulatory functions
and biosynthetic capabilities. Moreover, although strict human
pathogens such as Mycoplasma genitalium, T. pallidum, and B. burgdorferi have adapted to an
anaerobic life style using glycolytic sugar metabolism for their
primary source of energy, as reported by Claire Fraser,
Chlamydia (a common sexually transmitted disease agent
[Peeling and Brunham, 1996
]) and Rickettsia (the causative
agent of Rocky Mountain spotted fever [Andersson and Andersson,
1997
]) have adapted to a strictly aerobic life style. These two
bacteria have lost their glycolytic enzymes and satisfy their energy
needs by metabolizing organic acids via the Krebs cycle and electron
flow, as reported by Richard Stephens and Siv Andersson, respectively
(unpublished results).
 |
COMPARATIVE GENOMICS YIELDS NEW CLUES ABOUT PATHOGENESIS AND
HORIZONTAL GENE TRANSFER |
Virulence factors and novel mechanisms of pathogenesis have been
revealed as a result of the development of the new discipline of
comparative genomics. Thus, Fred Blattner, who recently sequenced a
pathogenic E. coli strain and compared it with the
nonpathogenic E. coli K12 strain commonly used for
laboratory research, reported that the former bacterium possesses 1.2 Mbp more DNA than the latter, an approximately 20% increase in genomic
size. This additional genetic material codes for virulence
factor-bearing prophage, a virulence plasmid, and a "pathogenicity
island" that allows the bacteria to secrete toxic proteins directly
from the bacterial cytoplasm into that of the host animal cell (see
Groisman and Ochman, 1996
). In addition, three new tRNAs, a
eukaryotic-type Ser/Thr protein kinase, and a novel iron-transport
system, all possibly important for pathogenicity, were revealed.
Potential drug targets were identified and studied. Lynn Miesel
reported on studies characterizing the targets of the antituberculosis drug isoniazid. This drug appears to inhibit the growth of M. tuberculosis by blocking the function of a biosynthetic enzyme that makes a cell-surface fatty acid called mycolic acid. Isoniazid thereby disrupts the outer protective layer of this fastidious bacterium, rendering it sensitive to host defense mechanisms.
Horizontal gene transfer between related gram-negative bacteria could
be established using the comparative genomics approach. Pathogenicity
islands, encoding virulence factor genes and the apparatus for their
transfer to host cells, have now been identified in many bacteria
(Groisman and Ochman, 1996
). Fred Blattner noted that 30% of the new
DNA present in pathogenic E. coli, but lacking in
nonpathogenic E. coli, is shared by Yersinia
pestis, the causative agent of plague.
 |
GENOMICS, PROTEOMICS, AND BIOINFORMATICS |
Tremendous technological advances are being made in the related
fields of genomics, proteomics, and bioinformatics (Tables III-V). In
the area of genomics (the study of an organism's gene complement;
Table III), novel chips bearing
oligonucleotide probes for analysis of the expression of the complete
complement of genes encoded within the genome of an organism have
already been used and the results published (DeRisi et al., 1997
;
Wodicka et al., 1997
). Such a chip costs about $200 and is good for a
single information-rich experiment. However, the machine that reads the
chip costs about $150,000.
Expression vectors, gene knockouts, and reporter-gene constructs for
all of the genes in an organism will soon be available for at least two
major experimental organisms, the yeast S. cerevisiae, and
the sporulating, gram-positive bacterium B. subtilis. Our laboratory has recently taken advantage of such technology to identify
and characterize a protein kinase that controls catabolite repression
and carbon metabolism in B. subtilis (Reizer et al., 1998
).
In earlier studies we had purified the protein and determined an
N-terminal amino acid sequence, but the availability of this sequence
was insufficient to allow us to clone the gene. When the complete
genome of B. subtilis became available, its identification became almost automatic. Thus, when expression vectors, knockouts, and
fusion constructs become available commercially, the academic molecular
biologist will be largely obsolete.
In proteomics (the study of an organism's proteins; Table
IV), technological advances are beginning
to have an effect on basic research. Tremendous advances have been made
in coupling two-dimensional gel analysis of the total protein
complement of an organism to analysis by MS or to N-terminal sequence
analysis. Chips are being developed for pico-quantity protein
purification based on hydrophobic and ion-exchange chromatography as
well as adsorption chromatography.
Rapid procedures for studying protein-protein interactions are being
developed. For example, a yeast two-hybrid system has recently been
developed for the assay of whole libraries of genes to determine which
of the encoded proteins interact with high affinity with a specific
test protein (Williams et al., 1998
). Additionally, novel chips for
studying protein-ligand interactions are being developed.
Finally, in the area of bioinformatics (the computational analysis of
genetic and biochemical data; Table V),
novel software is being developed for more refined homology searches of
the databases, for protein and nucleic acid secondary and tertiary
structural predictions, and for protein and DNA motif identification.
Many of the programs used routinely in academic laboratories for
characterizing families of proteins have been or are now being
automated. Such advances will be required to keep up with the
exponentially increased volume of sequence data that will be available
in the near future.
Genome-sequence analyses have revealed that most of the proteins
encoded within the genome of a living organism belong to families of
homologous proteins that share a common evolutionary origin and are
present in many dissimilar organisms. Some of these families are very
large, with hundreds of currently sequenced members (Pao et al., 1998
).
Other families are small, with only a few currently sequenced members
(Saier, 1998
). By constructing dendrograms (which show approximate
relationships of the proteins to each other without providing numerical
values for the relative phylogenetic distances that separate them), or
instead by constructing phylogenetic trees (which not only cluster the
proteins according to their relative degrees of sequence similarity,
but also provide quantitative measures of their degrees of
relatedness), one can estimate the probability that any two members of
a protein family will prove to serve the same function. Phylogenetic
trees thus indicate relative degrees of sequence similarity and provide
a reliable guide to biochemical function.
 |
WHERE IS GENOMICS GOING? |
Where is the new discipline of genomics taking us? Directly into
an exciting and ever-expanding new century of scientific discovery. The
current sequencing explosion will lead to the development of novel
disciplines such as "comparative genomics" and "molecular archaeology" (Table VI). Because about
20% to 30% of each newly sequenced genome consists of genes encoding
proteins with no recognizable homologs in the current databases
(Koonin et al., 1997
), a tremendous effort must be devoted to
the functional identification of these "orphan" proteins. Novel
regulatory constraints and virulence mechanisms will be revealed
(Ewald, 1996
). New kingdoms of currently unculturable microbes
(Bloomfield et al., 1998
) are likely to be revealed. Discoveries
leading to tremendous expansion of industrial applications will be
made, and additional novel technological advances will undoubtedly
appear in the not-too-distant future. But most importantly, genomics
will lead to discoveries currently unforeseen and nearly unfathomable.
Even 2 years ago we did not dream that microbial genomes would exhibit
the degree of plasticity that has been revealed by genome sequencing,
or that the "minimal genome" would be so limited in scope. Living
organisms have solved the fundamental problems of life in many diverse
ways. It is an exciting time to be working in the biological sciences,
and this excitement is largely attributable to developments in
genomics. The advances of the future will be limited only by the
imaginations of current and future generations of molecular biologists.
 |
FOOTNOTES |
*
E-mail msaier{at}ucsd.edu; fax 1-619-534-7108.
Received April 5, 1998;
accepted April 18, 1998.
1
The conference, entitled "Microbial Genomics
II," was organized by Claire Fraser of the Institute for Genomic
Research in the United States and Bart Barrell of the Sanger Center in
England, and took place at Hilton Head, South Carolina, from January 31 to February 3, 1998. The program and abstracts of the meeting are
presented in Microbial and Comparative Genomics
(1998) 3: 1-96.
 |
ABBREVIATIONS |
Abbreviation:
Mbp, megabase pair.
 |
LITERATURE CITED |
Andersson JO,
Andersson SGE
(1997)
Genomic rearrangements during evolution of the obligate intracellular parasite Rickettsia prowazekii as inferred from an analysis of 52015 bp nucleotide sequence.
Microbiology
143:
2783-2795
[Abstract]
Battista JR
(1997)
Against the odds: the survival strategies of Deinococcus radiodurans.
Annu Rev Microbiol
51:
203-224
[CrossRef][ISI][Medline]
Bevan M,
Bancroft I,
Bent E,
Love K,
Goodman H,
Dean C,
Bergkamp R,
Dirkse W,
Van Staveren M,
Stiekema W,
and others
(1998)
Analysis of 1.9 Mb of contiguous sequence from chromosome 4 of Arabidopsis thaliana.
Nature
391:
485-488
[CrossRef][Medline]
Blattner FR,
Plunkett G III,
Bloch CA,
Perna NT,
Burland V,
Riley M,
Collado-Vides J,
Glasner JD,
Rode CK,
Mayrew GF,
and others
(1997)
The complete genome sequence of Escherichia coli K-12.
Science
277:
1453-1474
[Abstract/Free Full Text]
Bloomfield SF,
Stewart GSAB,
Dodd CER,
Booth IR,
Power EGM
(1998)
The viable but non-culturable phenomenon explained?
Microbiology
144:
1-2
[Medline]
Clayton RA,
White O,
Ketchum KA,
Venter JC
(1997)
The first genome from the third domain of life.
Nature
387:
459-462
[CrossRef][Medline]
DeRisi JL,
Iyer VR,
Brown PO
(1997)
Exploring the metabolic and genetic control of gene expression on a genomic scale.
Science
278:
680-686
[Abstract/Free Full Text]
Ewald PW
(1996)
Guarding against the most dangerous emerging pathogens: insights from evolutionary biology.
Emerging Infect Dis
2:
245-257
[Medline]
Fleischmann RD,
Adams MD,
White O,
Clayton RA,
Kirkness EF,
Kerlavage AR,
Bult CJ,
Tomb JF,
Dougherty BA,
Merrick JM,
and others
(1995)
Whole-genome random sequencing and assembly of Haemophilus influenzae Rd.
Science
269:
496-512
[Abstract/Free Full Text]
Goffeau A, Aert R, Agostini-Carbone ML, Ahmed A, Aigle M, Alberghina L,
Albermann K, Albers M, Aldea M, Alexandraki D, and others (1997)
The yeast genome directory. Nature 387(suppl): 1-105
Groisman EA,
Ochman H
(1996)
Pathogenicity islands: bacterial evolution in quantum leaps.
Cell
87:
791-794
[CrossRef][ISI][Medline]
Kaneko T,
Sato S,
Kotani H,
Tanaka A,
Asamizu E,
Nakamura Y,
Miyajima N,
Hirosawa M,
Sugiura M,
Sasamoto S,
and others
(1996)
Sequence analysis of the genome of the unicellular cyanobacterium Synechocystis sp. strain PCC6803. II. Sequence determination of the entire genome and assignment of potential protein-coding regions.
DNA Res
3:
109-136
[Abstract]
Karp PD,
Riley M,
Paley SM,
Pelligrini-Toole A
(1996)
EcoCyc: an encyclopedia of Escherichia coli genes and metabolism.
Nucleic Acids Res
24:
32-39
[Abstract/Free Full Text]
Koonin EV,
Mushegian AR,
Galperin MY,
Walker DR
(1997)
Comparison of archaeal and bacterial genomes: computer analysis of protein sequences predicts novel functions and suggests a chimeric origin for the archaea.
Mol Microbiol
25:
619-637
[CrossRef][ISI][Medline]
Kunst F,
Ogasawara N,
Moszer I,
Albertini AM,
Alloni G,
Azevedo V,
Bertero MG,
Bessieres P,
Bolotin A,
Borchert S,
and others
(1997)
The complete genome sequence of the Gram-positive bacterium Bacillus subtilis.
Nature
390:
249-272
[CrossRef][Medline]
Pao SS,
Paulsen IT,
Saier MH Jr
(1998)
The major facilitator superfamily.
Microbiol Mol Biol Rev
62:
1-32
[Abstract/Free Full Text]
Paulsen IT,
Sliwinski MK,
Saier MH Jr
(1998)
Microbial genome analyses: global comparisons of transport capabilities based on phylogenies, bioenergetics and substrate specificities.
J Mol Biol
277:
573-592
[CrossRef][ISI][Medline]
Peeling RW,
Brunham RC
(1996)
Chlamydiae as pathogens: new species and new issues.
Emerging Infect Dis
2:
307-319
[ISI][Medline]
Perego M,
Glaser P,
Hoch JA
(1996)
Aspartyl-phosphate phosphatases deactivate the response regulator components of the sporulation signal transduction system in Bacillus subtilis.
Mol Microbiol
19:
1151-1157
[ISI][Medline]
Perego M,
Hanstein C,
Welsh KM,
Djavakhishvili T,
Glaser P,
Hoch JA
(1994)
Multiple protein-aspartate phosphatases provide a mechanism for the integration of diverse signals in the control of development in B. subtilis.
Cell
79:
1047-1055
[CrossRef][ISI][Medline]
Reizer J,
Hoischen C,
Titgemeyer F,
Rabus R,
Stülke J,
Rivolta C,
Karamata D,
Saier MH Jr,
Hillen W
(1998)
A novel protein kinase that controls carbon catabolite repression in bacteria.
Mol Microbiol
27:
1157-1169
[CrossRef][ISI][Medline]
Reizer J,
Reizer A,
Saier MH Jr
(1997)
Is the ribulose monophosphate pathway widely distributed in bacteria?
Microbiology
143:
2519-2520
[ISI][Medline]
Saier MH Jr
(1998)
Molecular phylogeny as a basis for the classification of transport proteins from bacteria, archaea and eukarya.
In
RK Poole,
eds, Advances in Microbial Physiology
Academic Press, San Diego, CA (in press)
Williams JM,
Chen G-C,
Zhu L,
Rest RF
(1998)
Using the yeast two-hybrid system to identify human epithelial cell proteins that bind gonococcal Opa proteins: intracellular gonococci bind pyruvate kinase via their Opa proteins and require host pyruvate for growth.
Mol Microbiol
27:
171-186
[CrossRef][Medline]
Wodicka L,
Dong H,
Mittmann M,
Ho MH,
Lockhart DJ
(1997)
Genome-wide expression monitoring in Saccharomyces cerevisiae.
Nature Biotechnol
15:
1359-1367
[CrossRef][ISI][Medline]