|
|
||||||||
|
Plant Physiology 141:825-839 (2006) © 2006 American Society of Plant Biologists Formation of the Arabidopsis Pentatricopeptide Repeat Family1,[W]Laboratoire d'Informatique, de Robotique et de Microélectronique de Montpellier, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 5506, Université de Montpellier II, 34392 Montpellier cedex 5, France (E.R.); and Institut de Biotechnologie des Plantes, Centre National de la Recherche Scientifique, Unité Mixte de Recherche 8618, Université Paris-Sud, 91405 Orsay cedex, France (C.B., C.T.-N., A.L.)
In Arabidopsis (Arabidopsis thaliana) the 466 pentatricopeptide repeat (PPR) proteins are putative RNA-binding proteins with essential roles in organelles. Roughly half of the PPR proteins form the plant combinatorial and modular protein (PCMP) subfamily, which is land-plant specific. PCMPs exhibit a large and variable tandem repeat of a standard pattern of three PPR variant motifs. The association or not of this repeat with three non-PPR motifs at their C terminus defines four distinct classes of PCMPs. The highly structured arrangement of these motifs and the similar repartition of these arrangements in the four classes suggest precise relationships between motif organization and substrate specificity. This study is an attempt to reconstruct an evolutionary scenario of the PCMP family. We developed an innovative approach based on comparisons of the proteins at two levels: namely the succession of motifs along the protein and the amino acid sequence of the motifs. It enabled us to infer evolutionary relationships between proteins as well as between the inter- and intraprotein repeats. First, we observed a polarized elongation of the repeat from the C terminus toward the N-terminal region, suggesting local recombinations of motifs. Second, the most N-terminal PPR triple motif proved to evolve under different constraints than the remaining repeat. Altogether, the evidence indicates different evolution for the PPR region and the C-terminal one in PCMPs, which points to distinct functions for these regions. Moreover, local sequence homogeneity observed across PCMP classes may be due to interclass shuffling of motifs, or to deletions/insertions of non-PPR motifs at the C terminus.
The pentatricopeptide repeat (PPR) gene family of 466 genes is one of the largest gene families discovered in the complete sequence of the Arabidopsis (Arabidopsis thaliana) genome (Aubourg et al., 2000
At the time of its discovery in 2000, the PPR family was completely orphan of function, but a number of members of this gene family recently received an increasing interest from different laboratories. Some PPRPs are involved in plant development (Cushing et al., 2005
The description of the unusually complex motif organization of PPRPs is progressively improving. The different motif organizations of Arabidopsis PPRPs are summarized in Table I
. Figure 1
gives a brief history of the structural annotation of this complex family and shows gene models and different representations of the motif organizations provided by different approaches for one PPRP-P and one PCMP. In the PPRP-P subfamily, the P motifs are usually adjacent to each other, i.e. in tandem repeats. The modular organization in PCMPs is more complex than in PPRP-Ps, but it nonetheless follows a small number of systematic rules (Aubourg et al., 2000
The characterization of the proteins of a given family often relies on the detection of regions of their sequences shared by all family members. Computing the consensus of such regions provides a motif that is used to recognize new members of the family (Servant et al., 2002 A peculiarity of PCMPs is that they may also be considered as a specific sequence of a variable number of PPR motifs, P, L, or S, and of PCMP blocks, either PLS, LSP, or SPL, associated or not with three different kinds of non-PPR motifs. The motif sequence of a given PCMP, at the level of the organization of both the PPR motifs and the PCMP blocks, has been shaped during evolution by a succession of duplication and functionalization events. Selection pressure was clearly critical on the motif sequence as evidenced by the unusually high constraint on the motif pattern in spite of the important increase in both the number of genes and the number of motifs. Furthermore, despite the high number of PCMP block repeats into the whole genome, the PCMP blocks are absolutely specific to the PCMP family.
It is evident that the interest for the function of PPRPs has only started. The complex and highly structured arrangement of PCMP motifs suggests precise relationships between the organization of motifs and protein substrate specificity. This prompted us to undertake an exhaustive study of the organization of PCMP blocks over the whole PCMP family and to investigate how this family has developed. The number of different PCMPs (198) and PCMP blocks (about 600) is both a challenge and a chance. One difficulty is to carry out an expert, and thus time-consuming annotation of the whole PCMP family including the characterization of all the motifs. This annotation involved many manual steps and was based on the structural annotation of PCMP genes available at GeneFarm (Aubourg et al., 2005
No doubt that evidence for PCMP functions will be soon provided by experimental data on PCMP targets. In this context, we believe that our study will greatly help the functional investigation of PCMPs. Indeed, there are indications both in chloroplasts (Miyamoto et al., 2004
Overview of the PCMP Blocks in PCMPs
Up to recently, two different terminologies have been used to describe the modular organization of PPRPs depending on the fusion of PPRP-Ps and PCMPs into one (Small and Peeters, 2000 On average, PCMPs have 3.8 PCMP blocks made of L, S, and P motifs and slightly more than half of the PCMPs have either three or four PLS blocks, not accounting for the P2L2S2 block present in each protein. Supplemental Table I gives the repartition of the number of tandem repetitions of PLS blocks in both the all-PCMP (198 proteins) and the nonredundant (nr)-PCMP (109 proteins) sequence databases of PCMPs. In the aggregate, the two sequence databases are similar and the repartition of the number of block repeats (maximum at 34) is the same in the different classes. Thus, globally, the diversification of the PCMP family in four classes is not correlated to a clustering of different protein structures between the four classes. Nevertheless, there is one intriguing exception in class H with one protein structure made of three PCMP blocks that is observed in 11 proteins.
Trees from Block Sequences
Supplemental Table II gives the values of five treeness criteria (Guénoche and Garreta, 2000
The most reliable tree (parameters Am = 1, Ip = 8, Ab = 3, and In = 50) is shown in Figure 2 . First, classes A and H are monophyletic, i.e. the lowest common ancestor of all proteins in such a class is not an ancestor of any protein not in that class. A contrario, classes E and F are not monophyletic. Indeed, classes E and F are split in three and two subtrees respectively, and class F branches out between two E subtrees, while class H branches out between two F subtrees. The proteins from different classes are not mixed together in the tree. However, the support Re varies among the internal edges leading to the classes: A|EFH, AEF|H, as well as AE|FH have a confidence value equal to 1 (i.e. maximal), showing that the monophyleties of A and of H are well supported. For both classes E and F, the edge leading to one of their subgroups is less supported by the data: Fa (0.65), Fbc (0.36), Ea (0.69), and Ebc (0.43). Indeed, the edges that split F in two are short and not well supported while the edge leading to H is perfectly supported. Another feature is that the subtree of each class is further split according to the N-terminal block of the proteins. This block may be incomplete, and if one reads LSP blocks in the motif sequence the first block is (LS)k in group a, (S)k in group b, and (PLS)k in group c (Table I), with k larger than or equal to 1. In the subtree of each class, c is monophyletic and branches out between two group b subtrees. In classes H and F, group a is monophyletic. Moreover, group a is in general the nearest group to the neighbor class, suggesting that its proteins may more likely change their class by losing some non-PPR motifs at their C terminus. This structure may be explained by the relatively high frequency of two events that alter preferentially the N-terminal block: motif loss and S motif tandem duplication.
Several trees computed with different parameter values have VAF and Re values similar to that of the optimal trees. To see whether the PCMP evolution looks different in a suboptimal tree we compare the optimal tree with the tree computed for parameters Am = 1, Ip = 6, Ab = 3, and In = 50, whose VAF and Re values equal 0.99 and 0.63, respectively (Supplemental Fig. 1). The differences between the two trees are at the lower level of the trees. The group subtrees are modified as well as the repartition of the proteins between the two group b subtrees of each PCMP class. The picture of the evolution of PCMP classes and the relative positions of the groups inside these PCMP classes remain exactly the same.
In a relatively large range of evolutionary distances between organisms, the level of similarity of amino acid sequences between orthologs is generally higher than between paralogs. This observation is exploited, for instance, in the database of clusters of orthologous genes (Tatusov et al., 2003 By looking at the similarity of the amino acid sequences of blocks, we attempt to determine: (1) at which positions in the tandem array, blocks show paralogous or orthologous relationships as defined above, (2) if the array was extended in a preferential direction, i.e. toward the N or the C terminus and, (3) if extension depended on a preferential phase for block addition (PLS or LSP). For this sake, we searched in the whole set of blocks for homogeneous groups according to sequence similarity, i.e. for groups of blocks that are more similar to each other than to blocks not in the group. We performed this analysis starting either with blocks from the same class of proteins or with blocks located at the same position in the tandem array of proteins from different classes. As previously explained, classical techniques for amino acid sequence comparison are not adapted to PCMPs; we thus develop an approach based on HMMs and on a graphical display of their results (for details see "Materials and Methods").
Depending on the reading phase, the most frequent PCMP block may be read either PLS, LSP, or SPL. We first asked the question of which one of these PCMP blocks has been duplicated during the formation of the PCMPs. Thus, a HMM has been built up with 20 sequences of PLS or LSP blocks located at the most carboxy side of the PCMP block repeat region (region two in Table I) of 20 proteins, i.e. at position 1 (Fig. 1B). At this step we based our determination of the positions of the blocks in the proteins on the fact that the most carboxy-terminal PCMP block (PL2S, line "Lurin et al. 2004
Hence, the LSP2 model has a high similarity with most of the LSP2 blocks and a comparatively low similarity with most of the LSP blocks. These results confirm that the P2L2S2 block and the PLS block have a common ancestor that has been duplicated, and provided two blocks that diverged significantly to generate P2L2S2 and PLS. As all PCMPs have only one P2L2S2 and at least one PLS block, this very first duplication probably took place before the advent of the family, i.e. in an ancestral protein common to all PCMPs. Since many PCMPs have more than one PLS, the homology between P2L2S2 and PLS suggests an extension of the protein from its carboxy end toward its amino-terminal region. The protein extended first by a duplication of an ancestral PCMP block followed by successive additions of PLS blocks either by recruitment from ectopic loci or by tandem duplication. We further try to discriminate between these two nonexclusive hypotheses by looking at the similarity between blocks at the same position in different proteins or adjacent in the same protein. For each analysis we retrieved two different information: first, the total number of PCMP blocks that was found similar to a given HMM and, second, the slope of the regression line for the number of blocks belonging to the same category. An instance of category is the set of blocks occupying the same position into different proteins, than the 20 blocks used to build up the HMM. The combination of these two results gives an indication of the similarity between the blocks in the category of the HMM and the blocks from other categories. There is a negative correlation between the numbers of position-1 PLS blocks similar to the position-1 PLS model per group of 20 blocks (columns in Fig. 3A) and the E-values of the sequence comparison (Fig. 3A, inset in the top right corner). For instance, in column 2 of Figure 3A, the E-value goes from 3.8 E-18 to 1.5 E-15 and 16 out of 20 PLS blocks are at position 1 into proteins, while in column 17, the E-value is from 8.2 E-01 to 4.2 E+01 and only two PLS blocks are at position 1 in their respective proteins. The equation of the regression line is 0.82x +17.1 (R2 = 0.90), with a significance lower than 103 for both the slope and the origin. We repeated the above experiment three times, with each time a position-1 PLS model built up with 20 different blocks sampled independently from the all-PCMP sequence database. The three experiments gave similar equations and confidence levels: 0.90x + 17.9 (R2 = 0.85); 0.97x + 18.7 (R2 = 0.82); and 1.05x +18.7 (R2 = 0.83). Therefore, a HMM model built up with a sample of 20 PLS blocks can be representative of the whole set of 557 PLS blocks. To estimate the sequence similarity between position-1 PLS blocks and PLS blocks at a position other than position 1 in proteins, we accumulated the data from the four repetitions to work on higher numbers in each E-value class (results in Supplemental Table III). There is a positive correlation between the relative numbers of PLS blocks located at positions 2, 3, or others in the proteins and the E-values. The slopes are equal to 1.42, 1.57, and 0.70 for position 2, 3, and others, respectively, with a significance better than 103 for the first two and better than 102 for the last one. For comparison, in this cumulative experiment, the slope for position-1 PLS blocks is 3.70. Thus, the PLS model built up with a representative set of position-1 PLS blocks has an affinity for PLS blocks that decreases from position 1, in the carboxy terminus of the proteins, to position 3 and above, toward the amino terminus. The result obtained with the complete set of PCMPs cannot be explained by a bias introduced by the redundancy of block sequences. Indeed, a similar result was obtained when the experiment was carried out with the nr-PCMP sequence database and a position-1 PLS matrix (result in Supplemental Table IV) even if the numbers of proteins and blocks are twice less. The equation of the regression line is 1.65x + 18.6 (R2 = 0.80) and the significances are better than 102 for the slope and than 103 for the origin. This latter experimental verification is well in accordance with the direct characterization of redundancy in the four PCMP classes (Supplemental Table I). Indeed, the pattern of redundancy is quite similar in PCMP classes E, F, and H. Results shown in Figure 3C using a HMM model built up with 20 sequences of PLS position-2 blocks are clearly different from those shown in Figure 3A using a model built up with 20 sequences of PLS position-1 blocks. In Figure 3C there is neither a significant negative correlation between the number of position-2 PLSs and the E-values nor between PLSs at other positions. Moreover, the total number of PLS blocks similar to the PLS position-2 model, 395 (Fig. 3C, 19 columns with 20 blocks + column 20 with 15 blocks), is remarkably higher than with the position-1 model 344 (Fig. 3A). As before, we repeated the experiments three times and all gave results similar to those in Fig. 3C. In the experiment shown in Fig. 3C, there are only 18 position-2 blocks in the two first columns (E-values from 2.7 E-237.9 E-15). For 11 of the 18 proteins that contain these position-2 PLS blocks, the position-1 block is in the first five columns while the expected value of the repartition was by chance only 0.967, i.e. more than 10 times less. Thus, PLS blocks at position 2 may be more similar to the blocks surrounding them in a given protein (paralogous blocks in one protein) than to blocks at the same position in other proteins (orthologous blocks in duplicated genes). Collectively these results suggest three major trends in the formation of the PCMP family. First, the PCMP blocks that have been duplicated from an ancestral block are of the PLS kind rather than LSP. Second, a significant part of the block duplication events might have involved tandem duplication more often than block recruitment from different chromosome loci. Third, a substantial part of the tandem duplications have added PLS blocks at the amino-terminal region of proteins.
Gene duplications have been very frequent during the formation of the family and we observe four classes of proteins defined by the presence of non-PPR motifs at the carboxy terminus of the proteins (Table I). The results described in the previous paragraph suggest that the evolution of the carboxy- and the amino-terminal regions may have been independent. This is in favor of more than one generation of ancestral proteins for the four classes during the formation of the family, rather than an early formation of the classes from four different ancestors. This is a testable hypothesis using the three protein classes containing a large number of members, the class E, F, and H. In the case of an early formation of the PCMP classes from one or a small number of ancestors, a HMM built up with sequences of 20 PLS blocks from one class of PCMP should be more similar to sequences of PLS blocks from this PCMP class than to sequences of PLS blocks from the two other PCMP classes. We expect an opposite result in the case of a continuous generation of the PCMP classes by independent events of deletion/insertion of non-PPR motifs. The number of PLS blocks belonging to different PCMP classes (E, F, and H) and found similar to a class-specific HMM is not changing with the E-value class, i.e. whatever the E-value we observed an equivalent number of PLS blocks belonging to each PCMP (Fig. 4, AC). The best horizontal line that may be computed through the E-value classes is at a number of blocks roughly proportional to the number of blocks in each PCMP class (54 E, 51 F, and 87 H). Our data show that the distribution of pair distances between amino acid sequences of PLS blocks in one PCMP class or between the three PCMP classes is similar. In other words, the PLS blocks of one PCMP class are not more similar in sequence to PLS blocks belonging to proteins in the same class than to PLS blocks from proteins belonging to the two other classes of PCMPs. It is interesting to highlight that the HMM obtained with a sample of 20 sequences of PLS blocks from class-F proteins (Fig. 3B) recovers a higher number of blocks than both the HMMs built up with 20 PLS blocks belonging to PCMP class H or E. This result is best explained by an oriented flux of gene transformation either from genes coding for proteins of class H toward proteins of class-E through class-F proteins, or the opposite. It suggests also that the evolution of the PLS tandem region and of the carboxy region containing the non-PPR motifs has been, to a large extent, independent. Indeed, non-PPR blocks have been either inserted or deleted independently of the events of gene duplication and of elongation by tandem duplication of PLS blocks of the amino-terminal region.
The next question concerns the direction of the gene flux observed between the three PCMP classes. Similar to the non-PPR motifs, the P2L2S2 blocks have undergone a different evolutionary history from the other PCMP blocks. Indeed, even if the homology with the PLS block is clear, the P2L2S2 blocks do not appear in tandem in these proteins. Hence, P2L2S2 blocks are good candidates to investigate the direction of gene duplication between PCMP-H, -F, and -E, and to ask the question about the relative importance of intraclass duplications during the formation of the extant PCMP family. The sequences of the P2L2S2 block are less divergent than the PLS block ones and present in only one copy at a conserved position adjacent to the non-PPR motifs thought to form the active site of the protein. All these data are consistent with a higher functional pressure on P2L2S2 than PLS blocks. The sequence similarity between two P2L2S2 blocks might thus be a better indicator of the divergence time since the ancestral gene duplication than the distances between PLS blocks. Thus, we analyzed the similarity between the amino acid sequences of P2L2S2 blocks using HMMs built up with 20 sequences from either class-H, -F, or -E proteins (Fig. 5 ). The three class-specific HMMs output 184, 173, and 171 P2L2S2 blocks out of 198, for classes -H, -F, and -E, respectively. The results obtained with P2L2S2 HMMs differ completely from those obtained with PLS position-1 HMMs (Fig. 4). With the P2L2S2 model from class H, the number of P2L2S2 blocks belonging to class H in a E-value class is correlated negatively with the E-value (Fig. 5A). Thus, the P2L2S2 blocks exhibit a large range of decreasing similarity with the HMM, and contrary to PLS blocks at position 1 (Fig. 4A), P2L2S2 blocks from class-H proteins are globally more similar between them than to P2L2S2 from classes E and F (Fig. 5A).
The results obtained with the two other HMMs built up with P2L2S2 sequences either from PCMP-F (Fig. 5B) or -E (Fig. 5C) proteins are different from those described above with PLS from the same PCMP class (Fig. 4, B and C) and also different from those with PCMP-H HMMs for P2L2S2 (Fig. 5A). Both with PCMP-F (Fig. 5B) and -E HMMs (Fig. 5C) we observed a minimum of blocks similar to the class model at intermediate E-values, i.e. at intermediate similarity between P2L2S2 blocks and the model. Thus, opposite to what we observed with the PCMP-H model (Fig. 5A), the distributions of pairwise distances between P2L2S2 sequences are not homogeneously organized in PCMP-E and -F. Rather they are clustered by the HMMs in three distinct groups. A HMM built up from 20 sequences has first a high similarity with a limited group of sequences: the sequences used to build up the model and some other sequences probably generated by intraclass gene duplications. This is what we observed in the first two or three columns in PCMP-F (Fig. 5B) and -E (Fig. 5C), respectively. Second, at intermediate E-values, a PCMP class-specific HMM is similar to a second group of P2L2S2 blocks that mainly belongs to proteins of other PCMP classes and particularly to PCMP-H. Third, at higher E-values, i.e. lower similarity, a PCMP class-specific HMM is similar to an increasing number of P2L2S2 blocks from the PCMP class used to build up the HMM. The second and third groups contain P2L2S2 blocks that may not share a direct common ancestor with the sequences used for the HMM. Rather, they might derive from proteins that changed in PCMP class after gene duplication. Two PCMP-F or -E genes may be generated by a duplication of a PCMP-Hs followed by non-PPR motif deletions. The number of events necessary to pass from class H to class E should in mean be higher, deletions of two non-PPR motifs, than to pass from class H to class F, only a deletion of the Dyw motif. Thus, the mean time since duplication of the PCMP-H ancestor and, as a consequence, the sequence similarity, should be lower between P2L2S2 blocks inside of PCMP-F and PCMP-E than between them and P2L2S2 blocks of some PCMP-H. At the opposite in genes generated by intraclass duplications, P2L2S2 should be more similar between them than they should be to P2L2S2 in proteins either of the same class (-F or -E), but generated by a duplication of a PCMP-H gene followed by non-PPR motif deletions or to P2L2S2 from PCMP-H. Consistent with this hypothesis, the minimum of sequences for proteins of the same group as the protein used to build up the HMM is displaced more toward the high E-values for PCMP-E (Fig. 5B) than for PCMP-H (Fig. 5C). Thus, collectively these results may be explained by a preeminent and oriented flux of gene duplications from the PCMP-H proteins toward the PCMP-E through the PCMP-F and a somehow less important contribution of intraclass duplications.
Methodological Improvements
Numerous proteins, often members of large families, contain tandem repeats. In such protein families, the number of copies of the motif usually varies among members of the family and makes these proteins difficult to align (Thompson et al., 1999
The tree obtained from the motif sequences (Fig. 2) supports the clustering of PCMPs in four classes. Indeed, it provides evidence for the monophylety of classes A and H and is in favor of separated origins of classes E and F. The search results of the HMM built with P2L2S2 motifs, the only part of the sequence that is common to all PCMPs, showed that the interclass divergence is higher than the intraclass divergence and agrees with the tree results. For PCMP-E and -F, the results suggest that these two classes do not originate from a single ancestor gene, but rather from a few ones. Concerning the PLS tandem repeat (excepting P2L2S2), our results suggest that the block that preferentially underwent duplication is PLS rather than LSP. Searches performed with the most N-terminal blocks (data not shown) and with the PLS blocks in position 1 also gave weight to a preferred direction, from the most C-terminal block toward the most N-terminal block, in the elongation of the tandem repeat. The distinct behavior of the HMMs built from any PLS block or from P2L2S2 blocks reveals that the PPR motifs that composed them are homologous but different. We can thus infer that a common ancestor of PCMPs contained at least a PLS block and a P2L2S2 block, and that these blocks probably resulted from the duplication of an ancestral PLS block. Moreover, the results from HMMs built with PLS blocks from different classes gives a blurred view of the class relationships, as if the PLS repeat was a region that underwent interclass homogenization. Homogenization of sequences may be the effect either of an interclass shuffling of blocks or of deletions/insertions into the carboxy-terminal region resulting in a change of class for a given protein. The second hypothesis is better supported by the results shown in Figure 4 as well as by the interclass similarity in both the redundancy of block structures and the repartition of the number of blocks (Supplemental Table I).
The PCMP family in Arabidopsis and in O. sativa accounts for 198 and 229 members. Although the advent of the family predated the separation between mono- and dicotyledon plants, the conservation of the number of proteins is surprising. On one hand, the amino acid sequences of PLS blocks are highly divergent. But on the other hand, the three largest PCMP classes (E, F, and H) exhibit a similar redundancy, as well as an overall homogeneity in the composition of their PLS repeat, indicating an evolutionary constraint. Altogether, the results suggest that the PLS repeat existed before the separation between mono- and dicotyledons and has evolved since then under a functional selection pressure. This mode of evolution seems to differ from the one of the C-terminal region, which we think begins with the P2L2S2 block (included). This partition in two regions having a distinct evolution, and the diversity of PLS repeats observed throughout the family lead to the belief that the PLS repeat could serve as a RNA-binding domain in which the succession of motifs encodes the information needed for specific recognition of a given binding partner (Lurin et al., 2004
The gene family coding for PPRPs expanded vastly during the evolution of the land plants. A recent estimation using FLAGdb++ (Samson et al., 2004
Several features of the PCMPs help to figure out the mechanisms involved in the evolution of the family. The paucity of introns in PCMPs as compared to the mean number of introns in Arabidopsis genes suggests that this family expanded mainly by reverse transcription events promoting duplicate dispersal through the genome (Lecharny et al., 2003
Our approach may be relevant for other families of proteins with repeated motifs (Patthy, 2003
Data Sets, Motif Sequence, and Classes
In this study, we used the whole Arabidopsis (Arabidopsis thaliana) family as annotated in GeneFarm (Aubourg et al., 2000 In all PCMPs, the C-terminal region is a succession of non-PPR motifs. Depending on this region, PCMPs were divided into four classes: H, F, E, and A (see "Results" section "Overview of PCMP Blocks" for details; Supplemental Table I). The repartition of PCMPs is: 87 in class H, 51 in class F, 54 in class E, and six in class A. In GeneFarm classes F and E are fused in a unique class F. The correspondence between GeneFarm and Arabidopsis Genome Initiative identifications (AGI-IDs) is shown in Supplemental Table IV. Some PCMPs share the same motif sequence. As we used these sequences to compare the proteins, we excluded this redundancy and built up a nr set, the nr-PCMP sequence database, which contains 109 proteins. A protein or motif sequence identifier in the nr-PCMP set is made of the protein AGI-ID concatenated with a label indicating its class followed by its group (i.e. At4g16470_Fb for a class-F protein of group b). Supplemental Table V gives the list of nr-PCMP proteins, the motif sequences, and the associated proteins that share the same motif sequence but are not in nr-PCMP.
The N-terminal part of all PCMP proteins is a tandem repeat of a block of PPR motifs. The most represented block is the triple LSP, and all other internal blocks are of the form L(S)nP with n > 1. Note that in a tandem repeat, which has a cyclic structure, LSP, SPL, or PLS are equivalent. The most N-terminal block is a suffix of a L(S)nP block. To code the block sequence, we consider arbitrarily a block to start with a L motif (or with the protein N terminus), and to end with the beginning of the next L or L2 motif. We encoded each different block observed in the all-PCMP set, as well as each non-PPR motif, by a single letter (block letter code in Supplemental Table VI). We then recoded the motif sequence of each protein as a sequence of block letters. This defines the block sequence. As the block code is univoque, the block sequence of a protein is strictly equivalent to its motif sequence. We used the block sequences to perform adequate protein comparisons as described below.
We computed a mutation cost between any pairs of blocks. Any block can be transformed into any other block by insertion or loss of one or two PPR motifs (e.g. LSP <-> SP) and by tandem duplication of the S motif (e.g. LSP <-> LSSP). For example, the block LSP can be transformed in LSSSSP by three S motif duplications, while LSP can be obtained from the N-terminal block SSP by a S motif contraction and the insertion of a motif L. We denote by Am (for amplification of motif) the cost of an S motif amplification/contraction (the word amplification is used as a synonym for duplication), and by Ip the cost of a PPR motif insertion/deletion. The mutation costs were calculated for different values of the ratio Am/Ip; with Am = 1 and Ip = 10, 12, 15, 20, or 30. The rationale is that an insertion of a motif is less probable than a duplication/contraction. For fixed costs (e.g. Am = 1 and Ip = 12), the mutation costs are stored in a matrix (all matrices are given in Supplemental Table VI) as amino acid substitution costs are recorded in a PAM matrix for classical alignment.
We compare pairwise the block sequences of the 109 nr-PCMPs using an alignment method, MS_Align (Bérard and Rivals, 2003 Comparisons were performed using several parameter sets, all combinations of the following parameters: Am = 1; Ip = 10, 12, 15, 20, or 30; Ab = 3, 4, 5, or 6; and In= 12, 15, 20, 30, 40, or 50. Our evolutionary model is symmetrical: The costs of dual events are identical; for instance, a deletion cost equals an insertion cost. The choice of parameter values reflects several facts about the motifs that form PCMPs. First, amplifications/contractions of the S motif and of PCMP blocks have been frequent events, since their numbers vary greatly among the PCMPs. Thus, we give lower costs to these events (Am = 1; Ab = 3, 4, 5, or 6) as compared to insertion/deletion costs. Second, as non-PPR motifs are not homologous to any other motifs, we forbid such motif mutations by giving them an infinite cost. The parameter values of different experiments depart from each other notably in the ratio Am/Ip and in the difference (In Ip), which was kept positive. Also, a PPR motif insertion (Ip) costs less than a non-PPR motif insertion (In), since it seems plausible that the former could be obtained by amplification of any other PPR motif and subsequent mutations in the amino acid sequence, while the latter could only be acquired by insertion.
An important issue concerns the reliability of the approach. Is it sound to measure the evolution with the alignment procedure used in our approach? Or in other words, can those distances be reliably represented by a tree? In a valuated tree, the distance between any two nodes is a tree distance, i.e. it satisfies the four points condition (Buneman, 1974
Precisely, we use the alignment distance matrix D to feed a distance-based phylogenetic reconstruction program, FastMe, which implements an improved neighbor-joining algorithm (Desper and Gascuel, 2002
It assesses if a tree is a good model to represent D(i,j).The second is a topological criterion called the Re. Consider a subset of four proteins, {i,j,k,l}, and an internal edge of the tree that separates {i,j} from {k,l}. In the tree, the distances between {i,j,k,l} must satisfy the four points condition (Buneman, 1974
The edge e is correct if this condition is also satisfied by the distance D. In this case, one says the quartet {i,j,k,l} supports the edge e. The support value for edge e, denoted R(e), is defined as the average number of quartets that support e. The Re is simply the average value of R(e) over all internal edges. The Re for an internal edge is a confidence value for that edge.
Supplemental Table II gives the values of five treeness criteria (Guénoche and Garreta, 2000
On the other hand, a Re value of 0.64 corresponds to trees that do have long external branches and to data that incorporates between 15% and 20% of noise. The Re is a very stringent criteria: with real data, it is usually lower than the VAF (although, its theoretical maximum also is 1), and it seems more dependent to noise and to the presence of long edges. Nevertheless, up to 20% of noise, the inferred tree remains reliable (Guénoche and Garreta, 2000
A PCMP block is an ordered association of the three PPR motifs, P, L, and S. Depending on the phasing, three different PCMP blocks are encountered, either PLS, LSP, or SPL. Other possible arrangements, as PSL for instance, are not present in PCMPs. Trees derived from multiple alignments of the amino acid sequences of PCMP blocks exhibit low bootstrap values at their nodes (data not shown). As already mentioned, in the data, a large number of short and divergent sequences is inadequate for this type of analysis. To classify the PCMP blocks we designed an alternative approach based on the HMMer package (Eddy, 1998 |