Department of Plant Biology, Michigan State University, East
Lansing, Michigan 48824-1312
Plastid envelope proteins from the Arabidopsis nuclear genome
were predicted using computational methods. Selection criteria were:
first, to find proteins with NH2-terminal plastid-targeting peptides from all annotated open reading frames from Arabidopsis; second, to search for proteins with membrane-spanning domains among the
predicted plastidial-targeted proteins; and third, to subtract known
thylakoid membrane proteins. Five hundred forty-one proteins were
selected as potential candidates of the Arabidopsis plastid inner
envelope membrane proteins (AtPEM candidates). Only 34%
(183) of the AtPEM candidates could be assigned to
putative functions based on sequence similarity to proteins of known
function (compared with the 69% function assignment of the total
predicted proteins in the genome). Of the 183 candidates with assigned
functions, 40% were classified in the category of "transport
facilitation," indicating that this collection is highly
enriched in membrane transporters. Information on the predicted
proteins, tissue expression data from expressed sequence tags and
microarrays, and publicly available T-DNA insertion lines were
collected. The data set complements proteomic-based efforts in the
increased detection of integral membrane proteins, low-abundance
proteins, or those not expressed in tissues selected for proteomic
analysis. Digital northern analysis of expressed sequence tags
suggested that the transcript levels of most AtPEM
candidates were relatively constant among different tissues in contrast
to stroma and the thylakoid proteins. However, both digital northern
and microarray analyses identified a number of AtPEM
candidates with tissue-specific expression patterns.
 |
INTRODUCTION |
Plastids exist in a wide range of
differential forms, including proplastids, chloroplasts, etioplasts,
amyloplasts, leucoplasts, and chromoplasts, depending on the
developmental stage and function of the plant cells in which they
reside. To a large extent, the types of plastids that are carried by
cells determine the metabolic function and products of the particular
plant tissue (Kirk and Tilney-Bassett, 1978
).
One constant feature among the various types of plastids is the double
membrane envelope structure that surrounds the organelle. The envelope,
especially the inner envelope, effectively separates plastid metabolism
from that of the cytosol. At the same time, almost all carbon and a
major flux of other metabolites, various polypeptides, and signals must
move through the envelope to coordinate and integrate metabolism with
the entire cell. Plastid envelopes contain protein transport machinery
(Schnell, 1995
; Cline and Henry, 1996
; Heins et al., 1998
), and are a
major site for membrane biogenesis. Metabolite transporters from
chloroplast and/or nongreen plastids accommodate the requirements of
the different photosynthetic or heterotrophic tissues (Kammerer et al.,
1998
; Neuhaus and Emes, 2000
). Membrane constituents (glycerolipids,
pigments, and prenylquinones) are synthesized on the envelope membrane
as well as the porphyrin ring and phytyl chain used in chlorophyll
synthesis, and the enzymes required for chlorophyll breakdown in
senescing plastids (for review, see Joyard et al., 1998
). It is also
hypothesized that fatty acids that are synthesized within the stroma
are exported through a channeled system on the envelope (Pollard and
Ohlrogge, 1999
; A.J. Koo, J.B. Ohlrogge, and M. Pollard,
unpublished data). The acyl group modification of lipids such as
desaturation takes place on the inner envelope and lipid-derived
signals are produced on the envelope membranes (Miquel and Browse,
1992
).
Despite the importance and complexity of the plastid envelope, only a
small fraction of envelope proteins have been purified or characterized
(Joyard et al., 1998
). Recently, several groups have initiated
proteomic studies to identify the constituents of plant subcellular
organelles and membranes (Santoni et al., 1998
; Sazuka et al., 1999
;
Seigneurin-Berny et al., 1999
; Ferro et al., 2000
; Peltier et al.,
2000
). Most of these studies have been based on one- or two-dimensional
electrophoretic methods followed by mass spectrometric analysis of
peptides derived from fractionated proteins. The continued development
and improvement in these methods resulted in the identification of
hundreds of proteins, particularly from organisms with sequenced
genomes. Nevertheless, there are some limitations in current proteomic technologies. In particular, many membrane proteins are found in very
low abundance and/or expressed only in certain tissues, making
comprehensive samples difficult to obtain. In addition, many integral
membrane proteins remain difficult to solubilize and often do not
resolve well by gel electrophoresis or isoelectric focusing (Adessi et
al., 1997
; Santoni et al., 2000
). Finally, trypsin, most frequently
used to digest proteins before mass spectrometric analysis, has
relatively few sites in the more hydrophobic integral membrane
proteins. Together, these limitations make it likely that a significant
fraction of the integral proteins present in membranes will remain
difficult to detect by most proteomic technologies.
One of the first steps in understanding the function of a gene is to
determine the subcellular localization of the gene product (Somerville
and Dangl, 2000
). To augment the proteomic approaches described above,
an additional opportunity for the discovery of putative localization of
proteins is now possible due to complete sequencing of the Arabidopsis
genome (Arabidopsis Genome Initiative, 2000
). To determine subcellular
localization of plastidial proteins, one can make use of certain
features of plastid-targeted proteins. The plastid genome itself
encodes only about 120 single-copy genes and, therefore, must rely on
proteins from the nucleus. The protein constituents that are encoded in
the nuclear genome are synthesized in the cytoplasm as higher
Mr precursors containing an amino-terminal transit peptide that is cleaved after entry into the plastid (Cline and
Henry, 1996
). These presequences of nuclear-encoded plastid proteins,
even though not strictly conserved, share some common features that can
be used to predict localization using computer algorithms (Emanuelsson
et al., 2000
). This, in combination with transmembrane
-helical
domain characteristics, namely charge bias and hydrophobicity
(Sonnhammer et al., 1998
; Krogh et al., 2001
), make it possible to
identify candidates for plastid membrane proteins. Considering the
nature of predictors and the selection criteria used (see "Results
and Discussion"), we expect our collection described in this paper to
contain mostly inner envelope integral membrane proteins.
Our laboratory has a long-standing interest in enzymes related in lipid
metabolism and proteins that may facilitate transport of lipids across
membranes. To identify additional proteins involved in these processes,
we reasoned that such proteins would be integral membrane proteins and
might not be easily detected by standard proteomic technologies. We
considered that a bioinformatics analysis of the Arabidopsis genome
might provide a useful complement to other approaches toward
characterization of integral proteins of the plastid envelope. This
study and its associated database of predicted candidates, together
with ongoing global functional analysis, will provide insight into
proteins associated with the plastid envelope as well as an initial
step to further elucidate their function.
 |
RESULTS AND DISCUSSION |
Plastid Envelope Membrane Protein Candidates
The strategy to select candidates of plastid envelope integral
membrane proteins from Arabidopsis nuclear genome sequences by
computational methods and the results are summarized in Figure 1.

View larger version (24K):
[in this window]
[in a new window]
|
Figure 1.
Summary scheme of plastid envelope protein
prediction from the Arabidopsis nuclear genome. The graph displays the
number of proteins after each selection step. Subcellular localization
was predicted using the TargetP version 1.01 Web server. Transmembrane
region prediction was done using the TMHMM version 2.0 Web server. cTP,
Chloroplast transit peptide. RC, TargetP reliability class. TM,
Transmembrane domain. See "Materials and Methods" for Web site
addresses for the predictors.
|
|
The first step was to predict plastid-targeted proteins from the entire
Arabidopsis genomic sequences (see "Materials and Methods"
for Arabidopsis genomic sequence retrieval). This was done using
TargetP (http://www.cbs.dtu.dk/services/TargetP/). TargetP is
considered the best subcellular location predictor that is publicly
available (Emanuelsson and von Heijne, 2001
; Bannai et al., 2002
;
Peltier et al., 2002
) and was also chosen by the Arabidopsis Genome
Initiative to analyze the recently finished genome sequence of
Arabidopsis (Arabidopsis Genome Initiative, 2000
). Protein targeting to
the plastid usually relies on its amino-terminal presequence (Cline and
Henry, 1996
) and TargetP, a neural network-based tool, is trained to
recognize those signals. Overall success rate on test sets analyzed by
the developers was 85% (Emanuelsson et al., 2000
). The sensitivity for
chloroplast-targeted proteins was 85%, whereas the specificity was
lower (69% and 84% on two different test sets). This means that more
false positives are expected than false negatives. To increase the
specificity, cutoff restraints can be applied. However, to include a
maximum number of possible candidates we instead chose default decision rule (winner takes all) and designated in our database the
"reliability classes" from 1 to 5 (Emanuelsson et al., 2000
). In
total, 3,665 of 25,552 (14.3%) proteins encoded by the Arabidopsis
nuclear genome were predicted to be plastidial, which is similar to the reported value of 3,574 from the Arabidopsis Genome Initiative (2000)
.
These 3,665 predicted plastid-targeted proteins were next analyzed for
transmembrane
-helical domains (Fig. 1). There are several Web-based
predictors available (Moller et al., 2001
). A recent study compared the
performance of 14 different methods against 883 membrane-spanning
regions of biochemically characterized proteins, and concluded that
Transmembrane Hidden Markov Model (TMHMM) is the most accurate
method for transmembrane segment detection (Moller et al., 2001
).
Subsequently, Hidden Markov Model for TOpology Prediction (HMMTOP) also
has been shown to have similar performances (Moller et al., 2002
). Most
recent methods for prediction of transmembrane helices rely on
hydrophobicity and charge bias of the transmembrane regions. There were
variations in the accuracy of overall correct topology prediction
(number and location and orientation of transmembrane regions);
however, correct prediction of the presence of transmembrane segments
per se was overall 95% in a comparison of three best predictors
(Sonnhammer et al., 1998
). TMHMM is reported to discriminate between
soluble and membrane proteins with both specificity and sensitivity
better than 99% (Krogh et al., 2001
). TMHMM, like TargetP, is also a
neural network-based program, but built with an architecture that
corresponds more closely to the biological system (Hidden Markov
model). The probability of a predicted integral protein to be a true
membrane protein is positively correlated with the expected number of
amino acid residues within the transmembrane segments and also with the
expected number of transmembrane segments in the protein. In other
words, the more amino acid residues per predicted transmembrane
segments and the more predicted number of transmembrane segments per
gene protein, the more likely that it is a truly integral protein. To
include as many integral proteins as possible, we selected all proteins
that have at least one predicted transmembrane domain, regardless of
amino acid residue numbers. One of the advantages of TMHMM is that it
provides plots of probabilities for prediction throughout the analyzed
proteins. In our study, proteins that were overall not predicted to be
transmembrane proteins, yet contained weak
-helical domains with
probabilities over 50%, were also included in the database as
potential membrane proteins with their probabilities. Because the
accuracy of TMHMM prediction drops when signal peptides are present
(Krogh et al., 2001
), transmembrane predictions within the predicted
targeting peptides (according to the TargetP cleavage site predictions)
were removed and those domains that were predicted close to the
targeting peptides were designated in the database as "N-term."
After processing as described above, of 3,665 predicted plastidial
proteins, 562 (about 15%) contained potential transmembrane
-helical domains (Fig. 1) and, thus, are considered as plastid
membrane protein candidates.
The third step was to eliminate the thylakoid integral membrane
proteins. Although there are suborganelle location predictors such as
PSORT, the overall accuracy of such programs is not high enough (about
50%) to use for discrimination between the thylakoid and the
envelope-localized proteins (Nakai and Horton, 1999
). It is
estimated that there are at least 200 proteins in the lumenal space and
in the periphery of the thylakoid membrane (Peltier et al., 2000
). Four
multisubunit protein complexes with 75 to 100 peptides involved in
photosynthesis comprise the majority of the thylakoid membrane proteome
(Peltier et al., 2000
). These proteins are relatively well studied and
many of them are known to be encoded by the plastid genome, which
contains about 90 protein-encoding genes (Sugita and Sugiura, 1996
). If
the known nuclear-encoded thylakoid membrane proteins are removed from
our candidate list of 562, the remaining candidates are expected to be
highly enriched in envelope proteins. After manual removal of 21 known
thylakoid-associated proteins and their homologs, 541 proteins remained
as potential candidates for Arabidopsis plastid envelope membrane
proteins (AtPEM).
It is important to note that these are predicted to be integral
membrane protein candidates and the candidate list will contain omissions and inclusions of various types. First, as mentioned above,
the specificity of TargetP prediction is lower than the sensitivity.
Because we did not employ any specificity cutoff for TargetP prediction
and used low stringency for transmembrane prediction (transmembrane
domain probability > 50%), these 541 candidates may contain many
false positives. If instead the TargetP specificity is set to >0.95
and proteins with low probabilities of transmembrane domain are
excluded, about 250 candidates remain and these represent a higher
reliability core set. These higher reliability candidates are color
coded on our Web site. Second, there will be many membrane proteins
that do not have transmembrane domains but are associated with the
envelope through different types of interactions such as
-sheet
conformation, polyisoprenylation, acylation, glycolipid anchors,
protein-protein interaction-mediated associations, etc. Third, most
outer membrane-localized proteins appear to insert directly into the
outer envelope membrane without going through the general import
apparatus (Cline and Henry, 1996
; Fulgosi and Soll, 2001
;
Schleiff and Klosgen, 2001
). Therefore, the 541 AtPEM
candidates will underrepresent the outer membrane proteins. However,
the protein composition of outer membranes is much simpler than that of
inner membrane. The lipid to protein ratio of outer membrane is about 3 times higher than that of inner membrane and the inner envelope
membrane, which contains various metabolite translocators, is the main
permeability barrier for solutes, although there is growing evidence
suggesting that the outer envelope may contain regulated ion channels
(Cline et al., 1981
; Block et al., 1983
; Douce and Joyard, 1990
;
Pohlmeyer et al., 1998
; Flugge, 2000
; Bolter and Soll, 2001
).
Thus, we consider it likely that AtPEM candidates represent
the majority of integral plastid envelope proteins.
What Are They?
Figure 2A indicates the functional
identification status for the selected 541 AtPEM candidates
based on the short descriptions in the annotation databases (MIPS and
GenBank). The proteins with identified function (identified) or that
have homology with known proteins (X like, putative X) comprised only
about 29% of the 541, whereas 71% of the candidate proteins could not
be assigned to any known function (putative, unknown, and
hypothetical). This is a very low-characterized status compared with
the overall 69% functional assignment of the whole genome (Arabidopsis
Genome Initiative, 2000
) and reflects the past difficulties in studying membrane proteins (Wilkins et al., 1998
; Seigneurin-Berny et al., 1999
).

View larger version (36K):
[in this window]
[in a new window]
|
Figure 2.
Functional classification of the Arabidopsis
plastid envelope protein (AtPEM) candidates. A, Overall
functional annotation status based on the short descriptions of each
candidate from the Munich Information Center for Protein Sequences
(MIPS; chromosomes 1, 3, and 5) and from GenBank (chromosomes 2 and 4).
X, Any protein/gene/enzyme name. B, Detailed functional classification
using automatically derived functional categories from the Protein
Extraction Description and Analysis Tool (PEDANT) Web server (see
"Materials and Methods"). C, Subclassification of the proteins
classified as "transport facilitation" and "metabolism" by
PEDANT.
|
|
More detailed functional classification of the
AtPEM candidates was done according to the
automatically derived functional categories from the MIPS
Arabidopsis Database using the PEDANT Web server
(http://pedant.gsf.de/; Frishman et al., 2001
). Among the 541 candidates, 183 (34%) were found in the PEDANT database and were
classified into 17 classes and 89 subclasses. Figure 2B shows the
number of AtPEM candidates found in each category. The
largest number of AtPEM candidates (39% of 183 classified predicted proteins) fell into the class of "transport
facilitation." Thus, membrane transporters are highly enriched in the
selected AtPEM candidates. Figure 2C shows the
subclassification of "transport facilitation" and "metabolism"
classes. The three major subclasses for "transport facilitation"
were "ion transporters," "C compound and carbohydrate
transporters," and "ATP binding cassette (ABC) transporters."
There were also several candidates involved in protein translocation
into the plastids. None of the outer membrane components of the protein
translocon were found among AtPEM candidates as was expected
(many outer membrane proteins lack obvious plastid-targeting sequences
as discussed above), whereas five Arabidopsis homologs to the known pea
(Pisum sativum) translocon at the inner membrane of
chloroplast (Tic110, Tic20, Tic40, and Tic55; Jackson-Constan and
Keegstra, 2001
) were present among the candidates (Table
I). Two Arabidopsis homologs of Tic22
were predicted to be targeted to the plastids by the TargetP, but did
not have obvious transmembrane-spanning domain (by TMHMM prediction).
Tic22 is thought to be peripherally associated with the inner envelope
membrane facing the intermembrane space (Jackson-Constan and Keegstra,
2001
).
In agreement with the known role of plastid envelopes in lipid
metabolism, "lipid-, fatty acid-, and isoprenoid metabolism-" related proteins represented the largest subset among the
"metabolism" class. Table I presents the results of a more detailed
survey of proteins predicted or known to be involved in plant lipid
metabolism (the search was carried out using our database of predicted
lipid-related genes in Arabidopsis (Mekhedov et al., 2000
; F. Beisson
and J.B. Ohlrogge, unpublished data). Seven proteins potentially
responsible for known enzymatic reactions to produce glycerol lipids in
the plastid envelope were found among AtPEM candidates
(Table I; for review, see Ohlrogge and Browse, 1995
). Arabidopsis
homologs of plastidial 2-lysophosphatidic acid acyltransferase
(EC 2.3.1.51), which mediates acyl group transfer from acyl-acyl
carrier protein to the sn-2 position of lyso-phosphatidic acid
(1-acyl-sn-glycerol 3-phosphate) toproduce phosphatidic acid
(1,2-diacyl-sn-glycerol 3-phosphate), was missing. All three
Arabidopsis homologs of this enzyme were predicted to a different
subcellular location ("mitochondria" and "others") by TargetP.
Digalactosyldiacylglycerol synthase (EC 2.4.1.184), which was known to
localize on the outer envelope (Joyard et al., 1998
), was not among the
AtPEM candidates. Although it had plastid transit peptide
recognized by TargetP, no membrane-spanning region was detected by TMHMM.
In plants, fatty acid synthesis takes place in the plastid. Although a
portion of the newly synthesized acyl chains are then used for lipid
synthesis within the plastid, a major portion is exported into the
cytosol for glycerolipid assembly at the endoplasmic reticulum or other
sites. In addition, some of the extraplastidial glycerolipids return to
the plastid. Thus, there is considerable flux of lipid exchange and
intermixing between the plastid and the endoplasmic reticulum (Ohlrogge
and Jaworski, 1997
). This involves lipid transport across the
plastid envelope membranes. Evidence is emerging that some of the ABC
transporters are involved in the lipid transport (Ruetz and Gros, 1994
;
Zhou et al., 1998
; Hettema and Tabak, 2000
). There were 15 potential
ABC transporters among the AtPEM candidates, which, thus,
are candidates for lipid transporters of the plastid envelope.
Surprisingly, there were several candidates that are homologous
to proteins involved in cell wall biogenesis and modifications, such as
arabinogalactan-protein homolog, cellulose synthase catalytic subunit-like protein, pectinesterase-like protein, pectin
methylesterase-like protein, putative xyloglucan fucosyltransferase,
etc. These unexpected relationships between the plastid envelope
and predicted proteins annotated with functions that are generally
considered non-plastidial may reflect inaccurate prediction by
TargetP software. Alternatively, because these putative functions are
assigned based solely upon sequence similarities, an equally likely
possibility is that the annotations point toward related, but
previously undescribed, functions in the plastid. Therefore, this
incongruence represents an example where bioinformatics analysis can
direct attention toward potential novel functions in the plastid envelope.
Characteristic Features of Plastid Envelope Candidates
The distribution of the peptide length of AtPEM
candidates is shown in Figure 3A. Eighty
percent of the candidates were smaller than 600 amino acids in length.
The average length of the peptides was 411 amino acids, which is
slightly smaller than the average peptide length of 434 amino acids for
the nuclear genome (Arabidopsis Genome Initiative, 2000
). As indicated
in Figure 3B, 63% of the AtPEM candidates contained two or
less membrane-spanning domains, whereas there were 40 candidates that
had 10 or more membrane-spanning domains. The spanning domains of
proteins with only a few such domains might serve as an anchor to the
envelope or to tether the peripheral proteins. Interestingly, 70% of
candidates with 10 or more membrane-spanning domains were classified as
"transport facilitation." Due to their high hydrophobicity, these
potential transporters would be particularly difficult to display on
two-dimensional electrophoretic gels of purified plastid
envelopes.

View larger version (15K):
[in this window]
[in a new window]
|
Figure 3.
Characteristics of AtPEM candidates. A,
Peptide length distribution. L, Number of amino acid residues.
Plastid-targeting sequences were removed based on TargetP cleavage site
predictions. B, Distribution of the number of membrane-spanning
domains. The TMHMM version 2.0 Web server was used for the
membrane-spanning domain prediction. prob<1, Proteins with spanning
domain probability less than 1.
|
|
AtPEM candidates provide an opportunity to compare predicted
plastid envelope proteome with the proteome of other organisms, i.e.
Cyanobacteria. The Synechocystis PCC6803 genome contains about 3,167 protein encoding genes, which is roughly the number predicted to be plastidial in Arabidopsis. Highly homologous proteins for about 32% (175) of AtPEM candidates were in the
Synechocystis PCC6803 genome (data not shown),
compared with 44% and 47% of thylakoid and stromal proteins,
respectively. Thus, the boundary envelop of the plastid may have a more
"mixed" evolutionary origin than the internal components.
We also compared our list of candidates with approximately 180 proteins
identified by mass spectrometric analysis of Arabidopsis chloroplast
envelope preparations (B.S. Phinney and J.E. Froehlich, unpublished data). Among these 180 proteins, about 70 had at least one
transmembrane domain predicted by TMHMM (TM probability > 50%)
and 38 proteins overlapped with AtPEM candidates. Thus, the proteomics approach identified many nonintegral membrane proteins and
outer membrane proteins not selected by our approach, whereas the
bioinformatic identifications used here may provide novel information
on a set of inner envelope integral membrane proteins not easily
characterized by proteomics.
Digital mRNA Expression Profile
In addition to the subcellular localization, indirect information
on cellular or developmental function can be obtained from spatial and
temporal expression patterns of genes. Gene chips, microarrays,
expressed sequence tags (ESTs), and serial analysis of gene expression
all provide useful means to study mRNA expression profiles (Bouchez and
Hofte, 1998
).
A large proportion (68%) of AtPEM candidates have at
least one EST in the GenBank dbEST (113,330 ESTs as of January 4, 2002; Table II) and 87% (470 of 541) were
represented by The Institute for Genomic Research (TIGR) Tentative
Consensus (TC) sequences (163,752 TCs as of May 23, 2001;
http://www.tigr.org/tdb/agi/; Quackenbush et al., 2000
; see
"Materials and Methods"). Thus, the majority of the
AtPEM candidates are transcriptionally active.
The abundance of ESTs sequenced from different cDNA libraries
can provide an estimate of relative transcript abundance provided a
number of conditions are met (Audic and Claverie, 1997
). To obtain
tissue-specific expression profiles of AtPEM candidates, 110,000 ESTs deposited in GenBank and sequenced from 55 different cDNA
libraries were grouped into eight "library pools" according to the
source tissues from which the cDNA libraries were derived (Table II, F. Beisson, J.B. Ohlrogge, unpublished data). We then surveyed the
abundance of ESTs within each library pool for AtPEM candidates after normalization by dividing the number of
AtPEM candidate ESTs in the given "library pools" by the
total number of ESTs of that "library pool." The relative
abundances of ESTs for AtPEM candidates in each tissue were
similar, ranging from 1.4% to 2.5% of the total ESTs, except for the
"flowers" pool, which showed the highest abundance at 4.2% of ESTs.
Although these aggregate EST frequencies varied little, a number of
individual proteins were found to have more distinct tissue-specific expression patterns. To study this in detail, tissue-specific expression patterns of individual candidates were analyzed using statistical equations developed by Audic and Claverie
(http://igs-server.cnrs-mrs.fr) that differentiate between random EST
sampling fluctuations versus significant change in EST frequencies. The
number of ESTs corresponding to a given gene (AtPEM
candidate) found in each of three tissue-specific library pools
("flowers," "roots," or "seeds") was compared with that in
the reference "mixed" pool; the results are presented in Table
III. A total of 21 AtPEM
candidates (6% of 365 candidates that had at least one EST) displayed
tissue-specific expressions when P < 0.005 (P, a probability of the compared EST abundances being
different by chance) was applied. Figure
4 visualizes the digital expression
profile of 21 AtPEM candidates that showed tissue-specific
expression pattern at P < 0.005. Among the
differentially expressed genes were Glc-6-phosphate/phosphate
translocator (GPT) and phosphoenolpyruvate/phosphate
translocator (PPT). GPT was shown previously to be highly expressed in
developing maize (Zea mays) kernel and potato
(Solanum tuberosum) tuber (Kammerer et al., 1998
) and the
ESTs of GPT were abundant in the Arabidopsis developing seed EST
database (White et al., 2000
). In agreement with these observations,
GPT had higher EST abundance in the "seeds" library pool than the
reference at P < 0.004. Similarly, for PPT, biochemical analysis and mRNA-blotting results indicated a high expression of PPT in nongreen tissue (Kammerer et al., 1998
), especially roots (Fischer et al., 1997
), and in our EST analysis, PPT
expression was highest in the "roots" (P < 0.003). Thus, the comparisons in Table III agree with previous
biochemical characterizations and support the validity of the digital
expression profile approach. Many of the proteins shown in Table III
are of unknown function, including several with high representation in
specific tissues.
View this table:
[in this window]
[in a new window]
|
Table III.
Summary of AtPEM candidates with tissue-specific
transcript abundances
Nos. of ESTs in the "flowers," "roots," and "seeds" pools
were compared with that in the "mixed" pool. Candidates with
probabilities of EST frequency differences being by chance less than
0.005 are grouped according to the tissue specificity of the transcript
abundances.
|
|

View larger version (43K):
[in this window]
[in a new window]
|
Figure 4.
Digital differential display analysis.
AtPEM candidates with statistically different levels
(P < 0.005, where P is probability of
difference being by chance) of EST frequencies are displayed. Number of
ESTs in the "flowers," "roots," or "seeds" pools was
compared with that in the reference "mixed" library pool. The bars
indicate the abundances of the ESTs (EST frequency axis) corresponding
to the proteins (proteins axis) in each library pool (tissue
axis).
|
|
The bioinformatics predictions of suborganellar localization, together
with tissue-specific expression, provide initial clues that may help in
discovering the functions of these proteins. However, it should be
noted in such analysis that the normalization procedures used in the
construction of the cDNA libraries for EST projects can undermine the
statistical analysis, and even with the high probability such as
P < 0.005 used in Table III, there could still be
false positives (Audic and Claverie, 1997
). Therefore, the digital
expression profile analysis should be considered as an initial step to
find possible candidates for differentially expressed genes from a
large collection of data and needs to be verified by experiments such
as northern blots, reverse transcription-PCR, or western blots, etc.
Plastids undergo massive changes during differentiation into their
various forms (chloroplast, chromoplast, amyloplast, leucoplast, etioplast, etc.) and need to import different spectra of proteins according to their changed biochemical properties. However, the structure of the concentric pair of envelope membranes remains constant
(Douce and Joyard, 1990
; Joyard et al., 1998
). It is yet unknown the
extent to which the proteome of the envelope will also remain
unchanged. The low number (21) of candidates differentially expressed
between the different tissues may indicate that the envelope proteome
candidates are relatively constitutive in terms of transcript abundance
across the different tissues. Furthermore, when the
tissue-specific EST distributions of AtPEM candidates and
that of 39 thylakoid-localized proteins and 1,802 stromal proteins were
compared, AtPEM candidates showed the least changes (Table
IV). Although only 8% of the
AtPEM candidates were tissue specific at P < 0.05, 69% of the analyzed thylakoid-localized proteins showed
specific EST tissue distributions. The stromal proteins also displayed
a higher level (19% at P < 0.05) of differential expression than the AtPEM candidates. Thus, based on EST
frequencies from different cDNA libraries, the transcript abundance for
the envelope membrane proteins appears relatively constant compared with that of the stromal or thylakoid proteins.
View this table:
[in this window]
[in a new window]
|
Table IV.
Tissue-specific transcript abundance comparison
between the plastid envelope, thylakoid, and stroma-localized proteins
|
|
Microarray Analysis
To further investigate the tissue-specific gene expression pattern
of AtPEM candidates, we analyzed publicly available cDNA microarray data from the Stanford Microarray Database (SMD;
http://genome-www.stanford.edu/microarray; Sherlock et al., 2001
)
and from the Arabidopsis developing seed array Web site
(http://www.bpp.msu.edu/Seed/SeedArray.htm; Girke et al., 2000
; Ruuska
et al., 2002
). Expression profiles of 230 AtPEM candidates
for about 110 different microarray experiments were found in the SMD
public domain (as of March 12, 2002). We specifically examined tissue
comparison data that were publicly available for comparisons between
flower, leaf, or root versus the whole plant for 147 AtPEM
candidates and for seed versus leaf or seed versus plantlets for 59 AtPEM candidates. Figure 5
presents a summary of genes which displayed more than 2-fold changes in at least two different tissue comparisons and the pattern of which agreed in all combinations (i.e. At1g29390 expression is higher in leaf
compared with that in flower and higher in leaf also when compared with
that in root). Twenty-two candidates showed higher levels of transcript
abundance in leaf when compared with that of root, whereas only three
showed higher levels in root than in leaf. Twenty-one of these 22 candidates (expressed higher in leaf than in root) also had higher
transcript abundance in leaf than that in flower, whereas only three
candidates showed higher expression in flower than in leaf. Comparison
between the seed and the leaf also showed higher levels of transcript
abundance in leaf (four higher in seed and 23 higher in leaf). Although it is well known that thylakoid and stromal chloroplast proteins are
very abundant in leaves, very little comparative information is
available on the envelope proteins. The data of Figure 5 suggest that
transcripts for many plastid envelope proteins are also more abundant
in leaves than from heterotrophic tissues.

View larger version (61K):
[in this window]
[in a new window]
|
Figure 5.
Tissue-specific microarray analysis of
AtPEM candidates. AtPEM candidates with >2-fold
tissue-specific expression in at least two different tissue comparisons
are shown clustered according to the tissue specificity (indicated with
the brackets and labeled). The intensity of green and red colors
represents the relative level of transcripts (refer to the color keys
at the bottom for the actual ratios) in tissues under comparison (the
tissues under comparison are indicated above the image and colored with
representative colors). The data for flower, leaf, and root comparisons
were from SMD (http://genome-www.stanford.edu/microarray) and data for
seed versus leaf and seed versus plantlet were from the Arabidopsis
developing seed array (Girke et al., 2000 ; Ruuska et al., 2002 ). The
graphic was generated by Tree View software (Eisen et al.,
1998 ).
|
|
Six genes selected by the digital expression profile analysis in Table
III as tissue specific were also present in the microarray data shown
in Figure 5. Although the digital expression profiles used mixed
tissues as a reference and microarray data compared individual tissues
or plantlets, data for all six genes agree at least partially between
the two types of analysis. However, the cross validity between the
statistical EST analysis and the microarray analysis should be further
characterized by larger sets of genes that have data analyzed by both
methods, combined with additional experiments (northern blots, RT-PCR,
etc.) to verify results.
Changes in transcript abundance in seeds during 5 to 13 d after
flowering were recently reported by Ruuska et al. (2002)
. Twenty-six
AtPEM candidates changed more than 2-fold during this period
(data not shown). The expression of many genes involved in seed storage
protein and starch and lipid biosynthesis also change during these
stages (Ruuska et al., 2002
); therefore, these 26 AtPEM
candidates may be involved in the role of plastids in seed filling and development.
Toward the Function of Unknown Candidates
Gene knockouts often can provide key information to
link genes of unknown function to a phenotype. Large collections of
T-DNA insertion mutants are publicly available and when the DNA
flanking the T-DNA insertion sites are sequenced and aligned with the
Arabidopsis genome sequence, provide mutants useful for studying gene
function. We searched the Sequence-Indexed Library of Insertion
Mutations generated by the Salk Institute Genome Analysis Laboratory
(http://signal.salk.edu/tabout.html; containing 32,758 T-DNA sequences
as of March 8, 2002). In total, 388 insertion lines corresponding to
217 AtPEM candidates (approximately 40% of AtPEM
candidates) were found. The "insertion Ids" can be used to search
and to order the mutant line seed stocks from the Arabidopsis
Biological Resource Center
(http://www.biosci.ohio-state.edu/~plantbio/Facilities/abrc/abrchome.htm) at Ohio State University (Columbus). These "Insertion Ids,"
as well as the accession numbers to search the publicly available microarray data in SMD (http://genome-www.stanford.edu/microarray), were deposited on our Web site
(http://www.plantbiology.msu.edu/PlastidEnvelope/).
 |
CONCLUSIONS |
Plastids draw the attention of plant biologists in large part
because of their defining roles in establishing the character of the
plant cell. Envelope proteins may hold one key to the understanding of
coordinated control between the plastid and the rest of the cell. So
far, there is no established inventory of plastid envelope proteins. In
this study, we attempted to predict integral plastid envelope proteins
from the Arabidopsis nuclear genome using computational methods.
As the word "candidates" implies, the results of our
study are not definitive due to the nature of prediction software and due to the possibility of errors in protein annotations in the genome
databases (van Wijk, 2001
). Although it is likely that at least
10% of the AtPEM candidates represent incorrect
predictions, it is reasonable to assume that the selected candidates
represent a large portion of the real envelope proteome. These
candidates can be used as a starting point for designing further
biological experiments. For example, one alternative strategy to
complement "proteomics" approaches is to experimentally
characterize envelope localization of selected candidates. The
AtPEM candidates could be analyzed for their targeting to
the chloroplasts and partitioning to the envelopes by in vitro
reconstitution of import into the chloroplast followed by fractionation
into the suborganellar compartments. This method can be especially
powerful in identifying highly hydrophobic, low-expressed transporters
and also can circumvent the difficulties in isolation of pure plastid
envelope from many non-photosynthetic tissues. This approach is
currently being developed and evaluated in our laboratory. Another
application to further characterize selected candidates will be to
define subsets of AtPEM proteins that show concerted
spatial, developmental, or conditional expression profiles. Finally,
based on our database, and such selections, a more focused analysis of
the phenotypes and biochemical compositions altered in T-DNA insertion
lines can be carried out.
 |
MATERIALS AND METHODS |
Establishment of a Database for Plastid Envelope Protein
Candidates
The results of TargetP prediction for Arabidopsis chromosomes 2 and 4 were from
http://www.cbs.dtu.dk/services/TargetP/predictions/pred.html. The
sequences for these two chromosomes were from TIGR and the European
Union Arabidopsis Sequencing Consortium (as of January 7, 2000).
Additional sequence retrieval was from the National Center for
Biotechnology Information Batch Entrez Web server
(http://www.ncbi.nlm.nih.gov:80/Entrez/batch.html). Chromosomes 1, 3, and 5 sequences were downloaded from the MIPS Arabidopsis Database
ftp site (ftp://ftpmips.gsf.de/cress/, as of June 2001). Subcellular
localization was predicted using TargetP version 1.01 from the Center
for Biological Sequence Analysis (http://www.cbs.dtu.dk/services/TargetP/). No cutoff was applied but
instead "Reliability Class" values from 1 to 5 were designated for
each predicted proteins. Proteins with plastid transit peptides were
then evaluated for membrane-spanning domains using the TMHMM version
2.0 Web server at http://www.cbs.dtu.dk/services/TMHMM-2.0/.
Proteins that were not strongly predicted to be transmembrane proteins
by the program, yet contained weak
-helical domains with
probabilities over 50%, were also included in the database as possible
membrane proteins. Proteins with transmembrane
-helical domains
within the range of possible plastid-targeting sequences (plastid-targeting sequence prediction was according to the TargetP cleavage site prediction) were removed and in cases where the domain
located close to but not within the predicted cleavage site were marked
"N-term" to reduce false prediction of hydrophobic targeting
sequences as membrane-spanning domains. Proteins that are known to
locate in thylakoid were removed manually.
Classification by Function
Functional classification was based on the MIPS Arabidopsis
Database automatically derived functional categories. The catalogue was
downloaded from the PEDANT Web server (http://pedant.gsf.de/). The
catalogue of genes for plant glycerolipid biosynthesis was from
http://www.canr.msu.edu/lgc/(Mekhedov et al., 2000
). An updated inventory of lipid metabolism-related genes was available at our regional database (F. Beisson and J.B. Ohlrogge, unpublished
data). The full inventory of Arabidopsis ABC proteins was
downloaded from http://www.arabidopsisabc.net/(Sanchez-Fernandez et
al., 2001
) and was queried for AtPEM candidates.
Digital mRNA Expression Profiling
A set of all public Arabidopsis ESTs was obtained through a
structured query language query of our in-house "SeqStore" database that contained 103,109 EST sequences from the GenBank EST database (dbEST, http://www.ncbi.nlm.nih.gov/dbEST/index.html). The EST sequences were used as queries (BLASTN version 2.2.1) against the
target database of all predicted transcripts from the Arabidopsis genome (ftp://ftp.tigr.org/pub/data/a_thaliana/ath1/, January 10, 2002). Then the plastid envelope protein candidates were queried against this resulting database using common chromosomal locus identifiers (i.e. At2g39040). TC sequences assembled from ESTs and
expressed transcript sequences were retrieved from TIGR Arabidopsis Gene Index (http://www.tigr.org/tdb/agi/). To match TCs with the chromosomal locus identifiers, TCs retrieved were mapped to all the
predicted transcripts of Arabidopsis genome by Blast alignments (BLASTN
version 2.2.1). TCs corresponding to the plastid envelope protein
candidates were selected by chromosomal locus identifiers.
To identify possible differential expression of the
AtPEM candidates, the relative frequencies of ESTs
between tissue-specific library pools were compared (55 different EST
libraries were grouped into eight tissue-specific "library pools"
according to the source tissues from which the library was derived; F. Beisson and J.B. Ohlrogge, unpublished data). The influence of
random fluctuations and sampling size was considered statistically to
discern the reliability of the digital expression profiling (Audic and
Claverie, 1997
). For this analysis, software was downloaded from
http://igs-server.cnrs-mrs.fr. EST analysis for the thylakoid and
stromal proteins was done using 39 selected known thylakoid-localized
proteins, most of which are related to photosynthesis, and 1,802 stromal proteins that were chosen randomly from plastidial proteins
predicted by TargetP and subtracting those with transmembrane domains
and obvious thylakoid proteins. ESTs for the thylakoid- and
stroma-localized proteins were from TIGR Arabidopsis Gene Index TCs.
Microarray Data
Public microarray data for AtPEM candidates were
searched from a local database that contained most of the microarray
data downloaded from SMD (as of March 1, 2002; Sherlock et al., 2001
) and was courtesy of Rodrigo Gutierrez (Michigan State
University-Department of Energy [MSU-DOE] Plant Research Laboratory,
East Lansing). The duplicates in the tissue comparison data sets (SMD
experiment identifiers: 7,197, 7,199, 7,200, 7,201, 7,203, and 7,205)
were averaged and divided each other in a way to give pair-wise
comparisons between flower, leaf, and root tissues. Arabidopsis
developing seed array data were from a local database and are available
at http://www.bpp.msu.edu/Seed/SeedArray.htm (Girke et al., 2000
; Ruuska et al., 2002
). The cluster image (Fig. 5) was generated by Tree
View software downloaded from http://rana.lbl.gov/(Eisen et al.,
1998
).
T-DNA Insertion Mutants
T-DNA insertion mutant lines were searched from the
Sequence-Indexed Library of Insertion Mutations generated by the Salk Institute Genome Analysis Laboratory using the Arabidopsis Gene Mapping
Tool (http://signal.salk.edu/cgi-bin/tdnaexpress).
Information on AtPEM candidates and their various
attributes, including tissue-specific EST frequencies, accession
numbers to search SMD, and T-DNA insertion mutant stock numbers from
Salk Institute Genome Analysis Laboratory can be downloaded from
our Web site
(http://www.plantbiology.msu.edu/PlastidEnvelope/).
We thank Dr. Rob Halgren (MSU Bioinformatics Group, East
Lansing) and Rodrigo Gutierrez (MSU-DOE Plant Research Laboratory) for
providing the EST and the microarray information. We thank Dr. John
Froehlich (MSU-DOE Plant Research Laboratory) for the Arabidopsis
chloroplast envelope proteomics data and Dr. Sari Ruuska (MSU
Department of Plant Biology) for the Arabidopsis developing seed array
data. We appreciate Dr. Andreas Weber, Dr. Xiaoming Bao from MSU
Department of Plant Biology, and Dr. Curtis Wilkerson (MSU-DOE Plant
Research Laboratory), Anne Plovanich-Jones (National Center for Food
Safety and Toxicology, East Lansing), Dr. Jeff Landgraf (MSU Genomics
Technology Support Facility), and Dr. Jay Thelen (University of
Missouri Proteomics Center, Columbia) for helpful comments on manuscript.
Received May 7, 2002; returned for revision May 30, 2002; accepted June 13, 2002.
Article, publication date, and citation information can be found at
www.plantphysiol.org/cgi/doi/10.1104/pp.008052.