Evolutionary and expression signatures of pseudogenes in Arabidopsis and rice.

Pseudogenes (Psi) are nonfunctional genomic sequences resembling functional genes. Knowledge of Psis can improve genome annotation and our understanding of genome evolution. However, there has been relatively little systemic study of Psis in plants. In this study, we characterized the evolution and expression patterns of Psis in Arabidopsis (Arabidopsis thaliana) and rice (Oryza sativa). In contrast to animal Psis, many plant Psis experienced much stronger purifying selection. In addition, plant Psis experiencing stronger selective constraints tend to be derived from relatively ancient duplicates, suggesting that they were functional for a relatively long time but became Psis recently. Interestingly, the regions 5' to the first stops in the Psis have experienced stronger selective constraints compared with 3' regions, suggesting that the 5' regions were functional for a longer period of time after the premature stops appeared. We found that few Psis have expression evidence, and their expression levels tend to be lower compared with annotated genes. Furthermore, Psis with expressed sequence tags tend to be derived from relatively recent duplication events, indicating that Psi expression may be due to insufficient time for complete degeneration of regulatory signals. Finally, larger protein domain families have significantly more Psis in general. However, while families involved in environmental stress responses have a significant excess of Psis, transcription factors and receptor-like kinases have lower than expected numbers of Psis, consistent with their elevated retention rate in plant genomes. Our findings illustrate peculiar properties of plant Psis, providing additional insight into the evolution of duplicate genes and benefiting future genome annotation.

Pseudogenes (Cs) are defined as nonfunctional genomic sequences with significant sequence similarity to functional RNA or protein-coding genes (Li, 1983;Vanin, 1985;Balakirev and Ayala, 2003). The first described C has similarity to the Xenopus laevis 5S ribosomal RNA gene but is truncated and not expressed (Jacq et al., 1977). Protein-coding sequences are defined as Cs if degenerative features are present, such as premature stops, frameshift mutations, and truncations of the full-length gene. In this study, we focus on protein-coding Cs derived from previously functional genes or duplication of existing Cs (Li, 1983). Depending on the mechanism of the duplication event that created the C copy, Cs can be classified into two categories. Processed Cs are derived from retro-transposition events where double-stranded cDNAs derived from reverse transcription events are integrated into the genome. Nonprocessed Cs are derived from duplication of genomic DNA by whole-genome, tandem, and/or segmental duplication.
Cs are by definition nonfunctional and therefore are expected to be evolving neutrally, consistent with the finding that approximately 95% of human Cs are evolving neutrally (Torrents et al., 2003). Lack of function at the protein level, however, does not preclude the possibility that some Cs may still function as RNA genes (Balakirev and Ayala, 2003;Zheng and Gerstein, 2007). It has been suggested that C transcripts may act as intracellular inhibitors by hybridizing to sense RNA derived from their target genes (McCarrey and Riggs, 1986). The most prominent example is a nitric oxide synthase (NOS) C that contains an approximately 140-bp region that forms a heteroduplex with the functional NOS transcript and suppresses the translation of NOS protein (Korneev et al., 1999). Another example of C function at the RNA level is the Makorin1-p1 C, which potentially regulates the stability of its homologous coding gene transcript (Hirotsune et al., 2003) and experiences nonneutral evolution (Podlaha and Zhang, 2004); however, its function remains controversial (Gray et al., 2006). Although there are very few examples of Cs that clearly function at the RNA level, recent large-scale transcriptome sequencing projects and global expression studies using genome tiling arrays indicate that some Cs may be expressed. For example, close to 10% of distinct transcripts in the FANTOM collection of mouse full-length cDNAs are likely derived from Cs (Frith et al., 2006). Tiling array studies of the Arabidopsis (Arabidopsis thaliana) transcriptome suggest that approximately 20% of annotated Cs may be expressed (Yamada et al., 2003). The functional relevance of C expression remains to be explored. But even if Cs are not functional at the protein or RNA level, they may still aid in the evolution of genes by serving as reservoirs for generating genetic diversity (Balakirev and Ayala, 2003).
Cs are evolutionary relics of functional components in the genome and provide substantial information regarding the history of gene and genome evolution (Li, 1983;Balakirev and Ayala, 2003). For example, an understanding of the process of pseudogenization is important for estimating how frequently duplicate genes are retained in genomes (Sakai et al., 2007). In addition, studying Cs and understanding their properties will aid genome annotation. Because Cs are similar to functional genes, a large number of Cs are misidentified as potentially functional genes during the genome annotation process (Arabidopsis Genome Initiative, 2000;Lander et al., 2001;Mounsey et al., 2002;Mouse Genome Sequencing Consortium, 2002). For example, in a detailed analysis of the Bric-a-Brac/ Tramtrack/Broad domain family in rice (Oryza sativa), 43 out of 192 annotated genes were found to contain frameshifts and/or premature stops (Gingerich et al., 2007). Therefore, distinguishing Cs from functional genes is important for primary genome annotation. For these reasons, Cs have been extensively studied in various animal genomes and yeast (Torrents et al., 2003;Zhang et al., 2003Lafontaine et al., 2004;. In plants, despite a body of literature describing individual Cs or Cs in a limited number of gene families, there have been no systematic studies of Cs. The only genome-wide analysis so far concerned the identification of hundreds of processed Cs in the Arabidopsis genome (Benovoy and Drouin, 2006). It remains unclear if C evolution in plants is similar to that in animals in terms of C abundance, selection, expression, and patterns of preferential pseudogenization among gene families. In addition, although Cs contain frameshifts and/or premature stops, they may still lead to the generation of truncated forms of proteins that remain functional (Zheng and Gerstein, 2007). Furthermore, some plant Cs are likely expressed, but it is unclear if these expressed Cs have properties distinct from nonexpressed Cs. Finally, Cs are remnants of duplicates that were not retained. Since plant gene family sizes vary widely and there is substantial bias in what kinds of duplicates were retained (Maere et al., 2005;Hanada et al., 2008), it is anticipated that the relative abundance of Cs among plant gene families should complement studies on gene gains. To address these questions, we identified and examined the properties of thousands of Cs from two model plant species, Arabidopsis and rice, focusing on the strength of purifying selection on these Cs, their expression, and their representation in various protein domain families.

RESULTS AND DISCUSSION
Abundant Cs in the Genomes of Rice and Arabidopsis The overall C analysis pipeline and the number of sequences found during each step are shown in Figure  1. We identified over 21,000 and 123,000 intergenic sequence regions with significant similarity to known plant proteins from GenBank in Arabidopsis and rice, respectively (Fig. 1). These regions are referred to as "pseudoexons" (Zhang and Chasin, 2004). Note that it is possible that we missed some intergenic regions resembling protein sequences from nonplant species. After repeat masking, the pseudoexons in close proximity to each other and having the same plant protein matches were joined together to form contigs (see "Materials and Methods" and Supplemental Methods S1). These contigs are regarded as putative Cs (set I).
In Arabidopsis, a set of annotated genes has been designated as Cs. These annotated Arabidopsis Cs were combined with the intergenic Cs identified in this study for all subsequent analyses. An independent analysis has been conducted recently to identify Cs among annotated rice genes, but this data set is not included in our study (Thibaud-Nissen et al., 2009). The locations and the pseudocoding sequences of these Cs are provided in Supplemental Tables S1 to S3. A total of 28,330 set I Cs were identified in rice, versus only 4,771 in Arabidopsis. Based on our earlier analysis of the Bric-a-Brac/Tramtrack/Broad ubiquitin ligase family (Gingerich et al., 2007), we found that a number of the annotated genes in rice may be Cs. Therefore, the number of rice Cs is likely even higher than what we have presented here. In most cases, a C is derived from nonfunctionalization of one of a pair of duplicate genes while the other copy maintains its ancestral function (Little, 1982;Li, 1983). However, a substantial number of set I Cs (Fig. 1) do not have significant within-species matches (paralogs), even though they were originally identified based on significant similarities to protein sequences from other plant species. We further filtered set I Cs by requiring that these Cs have one or more non-transposableelement paralogs with alignment lengths of 50 amino acids or greater, eliminating approximately 1,000 Arabidopsis and approximately 20,000 rice putative C sequences. These remaining Cs are referred to as set II Cs (Fig. 1). Note that 10 times more set I rice Cs were eliminated compared with Arabidopsis Cs. Therefore, it appears that more rice genes either were fast evolving, precluding paralog identification, or were singlecopy genes that became Cs.

Zou et al.
After eliminating Cs without paralog(s), the ratio of set II Cs between rice and Arabidopsis is approximately 2:1, which correlates with the ratio of genome sizes between Arabidopsis (125 Mb; Arabidopsis Genome Initiative, 2000) and rice (372 Mb;International Rice Genome Sequencing Project, 2005) and the ratio of annotated protein-coding gene numbers (Arabidopsis, 27,029 versus rice, 41,030, after excluding annotated Cs and repetitive elements). This ratio remains the same for Cs with apparent "disabling mutations" (premature stops and/or frameshifts, referred to as set III). Given that, the larger set II and III C numbers in rice may simply reflect the fact that a larger pool of rice genes can become Cs. In addition to the difference in C number between Arabidopsis and rice, set II Cs in Arabidopsis are much shorter relative to their paralogs (P , 0.01), while the length differences between rice set II Cs and their paralogs are not significant (P . 0.09), potentially reflecting a stronger deletion pressure in Arabidopsis. Taken together, we have identified thousands of putative Cs from Arabidopsis and rice. A subset of these putative Cs contains disabling mutations that potentially disrupt protein function. It is not clear if putative Cs without disabling mutations represent false-positive Cs. To address this possibility further and to examine the selection pressure imposed on these putative Cs, we evaluated the strength of purifying selection on Cs in both plant species.

Strength of Purifying Selection on Plant Cs
Cs are expected to evolve neutrally. Therefore, after a sufficient amount of time, the signature of purifying selection at the amino acid level will ultimately be erased. We determined the strength of selection on the Arabidopsis and rice Cs by estimating v, the ratio of the nonsynonymous substitution rate (Ka) to the synonymous substitution rate (Ks) between each set II C and its closest "functional" paralog. Here, a functional paralog (designated "FP") is defined as an annotated protein-coding gene that is not a repetitive element or a C according to Arabidopsis and rice genome annotation. The v value for each pair of FP reciprocal best non-C matches (FP-FP) was generated as well to represent the selection strength on presumably functional protein-coding genes. In general, set II C-FP pairs from both Arabidopsis ( Fig. 2A) and rice ( Fig. 2B) have significantly higher v values compared with those of FP pairs (Wilcoxon rank sum test, P , 2.2e-16 for both; Fig. 2, A and B). Assuming that, immediately after gene duplication, one duplicate becomes nonfunctional and evolves neutrally (v approximately 1) and the other duplicate is still subject to strong purifying selection with v approximately 0.2, the v value for a C-FP pair is expected to be approximately 0.6. However, a substantial number of C-FP pairs in both plants, particularly in Arabidopsis, have v values that resemble those of nonneutrally evolving sequences (P , 2.2e-16, Wilcoxon rank sum test; Fig. 2, A and B).
One explanation for this difference between Arabidopsis and rice is that a substantial number of Arabidopsis Cs may be derived from relatively more recent duplication events than rice Cs and therefore had less time for neutral evolution. A more recent whole genome duplication (WGD) has occurred in the Arabidopsis lineage (20-40 million years ago [Blanc et al., 2003]) compared with the rice lineage (53-94 million years ago [Yu et al., 2005]). This is consistent with the presence of a clear peak at Ks approximately 0.8 among FP-FP pairs in Arabidopsis but not in rice (Fig. 2, C and D). Using Ks as a proxy of time, if the relatively more recent WGD in Arabidopsis contributed to the apparently lower v among its C-FP pairs, we would expect a Ks peak of C-FP pairs following the Ks peak representing the WGD. However, there is no apparent C-FP peak after WGD (Fig. 2, C and D). In addition, the Ks frequency distributions of Arabidopsis and rice C-FP pairs are very similar, indicating that the lower v values in Arabidopsis Cs cannot be clearly attributed to its more recent WGD event. The absence of a Ks peak near the WGD also indicates that many Cs derived from WGD duplicates likely were deleted or became too degenerated for detection within 20 to 40 million years.
Another explanation for the more relaxed selection of rice Cs compared with those in Arabidopsis is that the larger retrogene pool in rice contributed to an overall higher v value in rice. A much larger number of retroelements and retrogenes are present in rice compared with those in Arabidopsis Benovoy and Drouin, 2006;Wang et al., 2006). In addition, given that retrogenes tend to be inserted in genomic regions with mostly irrelevant regulatory contexts, they are expected to be "dead on arrival" (Li et al., 1981;Brosius, 1991). We classified rice Cs into retro and nonretro categories and determined the strength of purifying selection on these two sets. Although the classification rate is low (18%; Fig. 1), the stringent criteria we used (see "Materials and Methods") ensure low false identification rates of both retro and nonretro Cs. We found that nonretro Cs tend to have significantly lower v values compared with retro Cs (P , 4.3e-06, Wilcoxon rank sum test; Supplemental Fig. S1), indicating that the difference in strength of past selection between Arabidopsis and rice Cs can be attributed, at least in part, to the differences in selection on duplicated and retro Cs.
In an earlier study of human Cs, approximately 95% of the 19,724 Cs were found to be neutrally evolving based on v values (Torrents et al., 2003). The abundance of retrogenes in the human genome (Kazazian, Figure 2. Strength of purifying selection on annotated genes and Cs. A and B, Frequency distributions of v, the ratio between the nonsynonymous substitution rate (Ka) and the synonymous substitution rate (Ks) of sequence pairs. A, Arabidopsis sequence pairs. Red symbols indicate annotated "functional" gene (FP) pairs that are not known transposable elements (TE) or Cs. Blue symbols indicate C-FP pairs; Cs are those identified in this study (set II; Fig. 1) that do not resemble known TEs. Cyan symbols indicate non-TE annotated C-FP pairs. B, Rice sequence pairs. Red symbols indicate FP-FP pairs that are not known TEs. Blue symbols indicate C-FP pairs; Cs are those identified in this study that do not resemble known TEs. Green symbols indicate TE C-FP pairs. C and D, Distributions of Ks values of FP-FP pairs (red) and C-FP pairs (blue) in Arabidopsis (C) and rice (D).
2004; Sakai et al., 2007) and the finding that 72% (7,819 of 10,834) human Cs identified are derived from retrotransposition events  likely significantly contributed to the apparently more relaxed selection on human Cs compared with plant Cs.

Additional Explanations for the Signature of Nonneutral Evolution among Plant Cs
There are 685 and 926 C-FP pairs with v # 0.2 in Arabidopsis and rice, respectively. These Cs have apparently experienced selective constraints as strong as those experienced by most functional genes. To evaluate if these Cs are false-positive predictions, we compared the distributions of v values of Cs with and without disabling mutations in Arabidopsis and rice to those of FP-FP pairs (Fig. 2). Although the median v values for Cs with disabling mutations are slightly higher than those without disabling mutations in both plants (Supplemental Fig. S2), Cs without disabling mutations have much higher v values compared with those of FP-FP pairs (Supplemental Fig. S2). Therefore, although we cannot rule out the possibility that a few of these truncated sequences may still be functional as protein-coding genes, the v value distributions strongly suggest that most Cs without disabling mutations are likely nonfunctional at the protein level and that the absence of disabling mutations may be explained by their recent pseudogenization.
An implicit assumption made when using Cs with and without disabling mutations to assess the extent of C false positives is that Cs with disabling mutations are true Cs, at least in the context of protein function. If a gene has acquired a premature stop or frameshift but remains functional at the protein level, we would expect the v value of the segment 5# to the first stop to be significantly lower than the value of the 3# segment (after the first stop, in the original reading frame alignable to its paralog) of the same C (set III; Fig. 1). To test this, we eliminated stops and frameshifted positions and determined v values of 5# and 3# regions. Interestingly, the median v value of the 5# half of set III Cs (Arabidopsis, 0.40; rice, 0.48) is indeed significantly smaller than that of the 3# half (Arabidopsis, 0.46; rice, 0.58; Wilcoxon rank sum tests: Arabidopsis, P , 9.2e-05; rice, P , 2.2e-16; Fig. 3, A and B). Importantly, there is no significant difference between the v value distributions of 5# regions and 3# regions of FP-FP pairs in both species when we assume that there is a premature stop in the same position as in the pseudogenes (Arabidopsis, P . 0.12; rice, P . 0.82; Supplemental Fig. S4). These findings indicate that the regions 3# to the disabling mutation may become nonfunctional but the 5# regions escaped nonsense-mediated decay (Chang et al., 2007) and continued to experience purifying selection for some time. It should be noted that 5# segments have almost three times as many sequences with v values between 0 and 0.05 compared with 3# segments. Therefore, this pattern of differential selection on 5# and 3# segments of Cs is noticeable only for very young duplicates with small Ks values.
In addition to differential selection between the 5# and 3# parts of the Cs, another possibility for strong selective constraints on Cs is that they were derived from relatively ancient duplicates that became Cs recently. The rationale behind this hypothesis is that, although some duplicate genes have persisted in the genome for tens to hundreds of millions of years, the Ks frequency distributions of duplicate genes indicate that most duplicates will eventually be lost (Fig. 2, C and D). Consistent with our expectation, Cs derived from older duplication events (larger Ks) tend to have experienced stronger purifying selection (lower Ka/Ks) than Cs derived from younger duplication events (Fig.  3, C and D). The selection on younger duplicates is much more relaxed, presumably due to the presence of two functionally identical sequences that are free to accumulate mutations (Hughes, 1994). Therefore, the relatively recent pseudogenization of Cs derived from old duplicates and the insufficient time for accumulation of neutral mutations in these Cs likely contributed to the signature of selection of plant Cs.

Evidence of C Expression
Although the Cs we have identified may not be functional at the protein level, it remains an open question if they are still useful as RNA genes. To evaluate this possibility, we first determined if set II Cs are transcribed using EST and massively parallel signature sequencing (MPSS) data sets. Among annotated protein genes, 73% and 49% have either EST and/or MPSS evidence in Arabidopsis and rice, respectively (Table I). In contrast, significantly fewer Cs have evidence of expression (2%-5% and 2%-3% in Arabidopsis and rice, respectively; Fisher's exact test, P , 2e-16 in both cases; Table I). Our findings indicate that the majority of Cs are no longer expressed at a sufficiently high level to be detected by EST sequencing and MPSS approaches. Interestingly, studies of mammalian Cs have shown that 2% to 5% of Cs are expressed based on similar expression tag analysis (Yano et al., 2004;Harrison et al., 2005;Zheng et al., 2005;Frith et al., 2006). The consistency in the proportion of Cs with sequence tags between mammals and plants suggests that Cs contribute similarly to the noncoding RNA gene repertoire in these divergent taxa.
In addition to the C expression tags, we analyzed tiling array data from Arabidopsis and rice to compare the levels of C expression with that of other sequence features (Fig. 4). Intron sequences in general have significantly lower hybridization intensities compared with those of exons and Cs in both rice and Arabidopsis (Wilcoxon rank sum tests, all P , 2e-16), indicating that a large number of Cs may be expressed but at a relatively low level. In addition, in Arabidopsis, exon sense expression is significantly higher than C expression in either the sense or antisense direction (Wilcoxon rank sum test, P , 2e-16; Fig. 4A). In rice, however, the intensity distributions of exons (sense direction) and Cs are similar (Wilcoxon rank sum test, P . 0.7; Fig. 4B). This is inconsistent with the finding that there are more than 10 times more expression tags for annotated genes than for Cs in rice. This does not appear to be due to cross-hybridization, since we have eliminated probes with potential to cross-hybridize (see "Materials and Methods"). We suspect this may be partly due to the overall lower hybridization intensity in the rice tiling array data sets , but we do not have a definitive explanation.
Using the 95th percentile of the intron probe intensity level distribution as the threshold, we found that 610 Arabidopsis (16.79%) and 1,047 rice (22.91%) Cs may be considered sense expressed (Table I), consistent with earlier studies (Yamada et al., 2003). We also found that 523 Arabidopsis (14.42%) and 922 rice (20.17%) Cs are likely expressed in the antisense direction (Table I). Another interesting finding is that the difference in the intensity distributions of C probes between sense and antisense directions (Wilcoxon rank sum test, P , 3e-4) is not as significant as those for exons (Wilcoxon rank sum test, P , 2.2e-16; Fig. 4,  A and B). Therefore, C sense and antisense expression appear to be uncoupled, suggesting that Cs may not be subjected to regulatory control at a level similar to functional genes.

Properties of Expressed Cs
The finding that a significant number of plant Cs have evidence of expression raises the question if these Cs are selected at the RNA level. Based on the findings presented below, it seems that in most cases C expression is not stable over a long evolutionary time and therefore may not be subjected to purifying selection. We found that expressed Cs tend to have significantly lower Ks values, indicating that they were derived from relatively recent duplication events (Wilcoxon rank sum tests: P , 1.0e-11 for Arabidopsis [ Fig. 4C] and P , 0.05 for rice [ Fig. 4D]). In addition, expressed Cs tend to be more "complete." That is, the alignment coverage of Cs to their functional paralogs is significantly higher for Cs with expression tags than for all Cs (Wilcoxon rank sum tests: Arabidopsis, P , 7.0e-10; rice, P , 9.0e-9; Supplemental Fig. S3). A possible explanation for the tendency for younger Cs to be expressed is that their cis-regulatory regions are not completely degenerated, assuming that the completeness of the C coding region reflects the completeness of the associated promoter region. Another potential source of expressed Cs is from those derived from ancient duplication events that have experienced strong purifying selection but have become pseudogenized relatively recently. Unfortunately, the number of expressed Cs derived from ancient duplication is too small to assess this possibility statistically.
Note that the above explanations assume that C expression evolved neutrally and that Ks can be used as a proxy for evolutionary time. These assumptions would not hold if in fact some expressed Cs were functional as RNA genes and purifying selection acted on the majority of the nucleotide positions. In these cases, C Ks would be an underestimate of the true neutral evolution rate and our earlier suggestion that expressed Cs tend to be young would be invalid. It has been shown that the transcript from a NOS C is a natural antisense of a paralogous NOS gene and is implicated in the transcriptional regulation of the latter (for review, see Zheng and Gerstein, 2007). Nevertheless, the heteroduplex between NOS C and its functional paralog only involves an approximately 140-bp region over the greater than 2-kb NOS C. This is similar to the findings of an analysis of trans-natural antisense transcripts in 10 eukaryotes, where the me-dian sizes of regions of significant sequence similarity between the antisense transcripts and their target genes are 45 to 170 bp (Li et al., 2008). Therefore, if transcriptional regulatory roles of Cs on their paralogs are mechanistically similar to NOS regulation, we do not expect the sequences outside of the relatively much smaller antisense regions to experience strong purifying selection. Taken together, our Ks estimate is likely not substantially influenced by purifying selection of the potential antisense regions important for RNA gene functions, and our suggestion that expressed Cs tend to be derived from young duplicates is likely correct.
The Ks distribution of expressed Cs in plants (Fig. 4, C and D) is similar to that of mouse Cs, where there is a much higher frequency of C-FP pairs with low Ks and a long tail of pairs with high Ks values (Frith et al., 2006). In the mouse study, it was suggested that the long tail indicated that some expressed Cs may have persisted in the genome over a long evolutionary time scale. We have also found a similar pattern among plant Cs. However, it is also possible that some of these expressed Cs are simply derived from older duplicates that have become Cs recently. Further studies examining the timing of pseudogenization would be necessary to assess these possibilities. Given the enrichment of expressed Cs that are derived from younger duplicates, it seems that in most cases C expression does not persist.

Correlation between the Number of Cs and Family Size
Most of the Cs we identified are remnants of duplicates that were not retained, and their presence can provide important clues to the past history of gene family evolution. Studies of duplicate genes in Arabidopsis have established that gene families vary greatly in size and that there is a substantial bias in the functions of the retained duplicates (Maere et al., 2005;Hanada et al., 2008). What is the relationship between the functional and C members of a subfamily? Are the numbers of Cs and functional members in a gene a Annotated genes are Arabidopsis or rice genes in TAIR version 7 and TIGR version 5 annotations. A PUT was matched to an annotated gene if the PUT-gene pair had 97% or greater identity over 50% or more of the shorter sequence with length of 300 bp or greater. Only MPSS tags with a 100% unique match to genes were considered. b The numbers of annotated protein genes in Arabidopsis (TAIR version 7) and rice (TIGR version 5) are approximately 41,030 and 27,029, respectively (after excluding annotated Cs and repetitive elements). c The numbers of set II Cs in Arabidopsis (TAIR version 7) and rice (TIGR version 5) are 3,697 and 7,762, respectively.
family positively correlated, as would be expected assuming that there is no differential retention of duplicates?
To address this question, we first classified Arabidopsis and rice protein-coding genes into domain families based on Pfam domain designations. We then assigned Cs according to the domain family designations of their top matching functional paralogs. Here, we assumed that the domain content of a C and its functional paralog is identical. This is in general true, since few domain configurations have changed among functional duplicate pairs with Ks # 2 (Arabidopsis, 197 of 3,848 paralogous pairs; rice, 209 of 2,990). Based on these domain family assignments, we found that there are significant positive correlations between the numbers of Cs and functional members in both Arabidopsis (Spearman rank, r = 0.54, P , 2.2e-16; Fig. 5A) and rice (r = 0.56, P , 2.2e-16; Fig. 5B). Our findings indicate that larger gene families tend to experience more gene loss than smaller gene families. Given that the most common fate of duplicate genes is pseudogenization (Lynch and Conery, 2000;Harrison et al., 2002), assuming that the gene loss rate is similar among families, larger gene families likely have proportionally higher numbers of duplicates at any given time. This may be the explanation for the observed correlation.
The same correlation has been observed in yeast (Lafontaine et al., 2004). However, in yeast the correlation is nearly perfect (r 2 = 0.98), which is in sharp contrast to our findings. Despite the highly significant correlation between numbers of Cs and numbers of functional members in plant domain families, there is substantial variation that results in relatively moderate correlation coefficients (Fig. 5). Assuming a simple model of linear correlation between numbers of Cs and family size, many gene families are either above the trend line, indicating excess loss, or below the trend line, indicating a higher than usual rate of retention. In addition to the correlation between numbers of Cs and family size in each species, the numbers of Cs per family in Arabidopsis and rice are also significantly correlated (Fig. 5C). This can be partly attributed to the fact that gene family sizes between species are correlated because of common ancestry and a similar degree of parallel retention (Hanada et al., 2008). This correlation also suggests that the proportion of family members that become pseudogenized after duplication is similar in rice and Arabidopsis.

Overrepresentation and Underrepresentation of Cs among Domain Families
Although there is a significant correlation between numbers of Cs and functional members in plant gene families, many gene families seem to have lost more or lost less than average (as indicated by deviation from the trend lines; Fig. 5). Which families have an overrepresented or underrepresented number of Cs? In addition, do Cs tend to be derived from genes with certain functions? To address these questions, we assigned Cs to GeneOntology categories based on the annotations of each C's best matching functional paralog to assess if a particular domain family has a significantly overrepresented or underrepresented number of Cs (for original data and test statistics, see Supplemental Table S4). Among families with overrepresented numbers of Cs, only 13 overlap between Arabidopsis and rice. In addition, only two families are underrepresented consistently between these two plants. Among Arabidopsis domain families with an overrepresented number of Cs, four "functional classes" of domain families stand out (Fig. 6A): (1) defense genes: nucleotide-binding adaptor shared by APAF-1, certain R gene products, and CED-4 (NB-ARC), Toll/interleukin-1 receptor homology domain (TIR), leucine-rich repeat (LRR-3), S-domain in receptorlike kinases (PAN_2, S_locus_glycop, B_lectin); (2) cell wall-modifying enzymes: cellulose synthase, polygalacturonase (Glyco_hyrdo_28), pectin esterase; (3) enzymes involved in secondary metabolism: cytochrome P-450, terpene synthase; and (4) protein degradation machinery: F-box and various associated domains (FBD, FBA_1, FBA_3, and LRR_2), Seven in absentia (Sina), Meprin and TRAF-Homology domain (MATH), and U-box.
The excess of Cs in these families is not simply due to the trivial explanation that larger families have more Cs (this is accounted for in the statistical tests). Interestingly, genes harboring some of these domain families tend to be tandem duplicates. These domain families tend to experience repeated gene gain in one organismal lineage and undergo loss in another (single lineage expansion; Hanada et al., 2008). We found that there is a significantly higher ratio of Cs that are in tandem compared with that of functional genes in both plants (Supplemental Table S5). In an earlier study, we also found that the genes derived from single-lineage expansion also tend to be responsive to environmental, particularly biotic, stresses (Hanada et al., 2008). We found that, as expected, genes in class 1 (as defined above; Fig. 6A) tend to be responsive to biotic stress ( Fig. 6B; Supplemental Table S4). Our findings suggest that genes involved in defense responses, particularly those in tandem repeats, have high duplication rates but turn over rapidly, presumably due to selection imposed by rapid changes in biotic environments. Aside from domain families with genes implicated in defense response, what factors explain elevated rates of pseudogenization? Most of the domain fam-ilies in functional classes 2, 3, and 4 ( Fig. 6A) do not have excess numbers of genes that are stress responsive (Fig. 6B). In class 2, the excess Cs in the P450 and Figure 6. C representation and stress responsiveness of domain families in Arabidopsis. A, Arabidopsis protein domain families with significantly overrepresented (shades of red) or underrepresented (shades of blue) numbers of Cs. C/NC tests determine if a domain family contains an underrepresented or overrepresented number of Cs compared with non-Cs (NC). The "C/NC test" column indicates the significance of the test statistic (P value) in shades of blue and red (significant underrepresentation and overrepresentation, respectively). The "Funct. class" column indicates four groups of domain families with overrepresented numbers of Cs: defense genes (class 1), cell wall-modifying enzymes (class 2), enzymes involved in secondary metabolism (class 3), and protein degradation machinery (class 4); and two groups with underrepresented numbers of Cs: transcription factor-associated domains (class 5) and receptorlike kinases (class 6). B, Abiotic/biotic stress test. This test determines if the annotated members of a domain family tend to be responsive to a stress condition (see "Materials and Methods"). The color shading scheme is the same as that for A, with domain families containing genes that tend to be up-regulated under a particular stress condition compared to the genome average in shades of red and families with genes that tend not to be up-regulated in shades of blue.
terpene synthase families are likely indicative of rapid turnover due to selection for the production of a battery of secondary compounds serving as part of defense mechanisms and/or herbivore deterrents (Guengerich, 2004;Tholl, 2006). But it is puzzling why there are so many Cs related to cell wall modification and protein degradation. The large family sizes among cell wall-modifying enzymes may be explained by selection for functional divergence for more flexible cell wall modification controls in myriad environments (Kim et al., 2006); however, it is still unclear why there is an overrepresentation of Cs in these families. Similarly, it would appear that, given the complexity of cellular networks, a large complement of specificity determinants to accurately regulate protein degradation is essential. While this may explain why the F-box (Gagne et al., 2002), U-box (Mudgil et al., 2004), and MATH (Gingerich et al., 2007) domain families are much larger compared with most plant gene families, the selection pressures that contribute to the disproportionally larger numbers of Cs in these families remain unclear.
There are two apparent functional classes among domain families with an underrepresented number of Cs. We found a number of transcription factorassociated domains (class 5; Fig. 6A). We also found that there are significantly fewer transcription factor Cs compared with all other Cs in both Arabidopsis and rice (Supplemental Table S6). This is consistent with earlier studies showing that transcription factors tend to be retained after whole genome duplication in Arabidopsis (Blanc and Wolfe, 2004;Seoighe and Gehring, 2004) and that there is a higher rate of lineage-specific expansion of transcription factors in plants compared with other eukaryotes (Shiu et al., 2005). In addition to transcription factors, another domain class that behaves similarly in both Arabidopsis and rice is typically found in plant receptor-like kinases (RLKs; Shiu and Bleecker, 2001). The RLKs belong to the largest gene family in plants, with over 600 and 1,200 members in Arabidopsis and rice, respectively . A number of RLKs with LRRs are important in controlling plant growth and development as well as in defense responses (Shiu and Bleecker, 2001;Morillo and Tax, 2006). The underrepresentation of RLK Cs suggests that more RLK duplicates were retained than became Cs. Aside from these examples, it is not entirely clear why some of the other domain families have underrepresented numbers of Cs. It remains to be determined if this underrepresentation is due to a much reduced duplication rate or an elevated rate of duplicate retention.

CONCLUSION
In this study, we identified Cs that likely were remnants of protein-coding genes in the genomes of Arabidopsis and rice. We found that a large number of plant Cs seem to be subjected to a substantial period of purifying selection before pseudogenization. A number of the plant Cs are likely expressed but at a lower level compared with functional genes. Our findings also indicate that C expression is found mostly among young duplicates, suggesting that many of the expressed Cs may not be under selection as noncoding RNA. In addition, we found that the number of Cs in a domain family is significantly correlated with the number of presumably functional members in the family. This correlation illustrates that pseudogenization at the gene family level is to some extent a neutral process. However, the correlation is far from perfect, because many families either have significantly more or significantly less than the expected numbers of Cs, consistent with the substantial functional bias of retained duplicates in mammals  and plants (Blanc and Wolfe, 2004;Seoighe and Gehring, 2004). Genes in families with underrepresented numbers of Cs likely have a higher than usual retention rate of their family members (Shiu et al., , 2005. On the other hand, genes in families with overrepresented numbers of Cs may have experienced rapid birth-and-death evolution (Nei and Rooney, 2005). In several cases, for example the F-box family, rapid turnover is observed but the selection pressure driving the rapid birth and death is not clear. Our study represents, to our knowledge, the first comprehensive study of plant C evolution and expression. The findings here highlight multiple properties of plant Cs that are important for our understanding of genic evolution in plants and for guiding future annotation efforts of plant genomes.

Identification of Cs in the Rice and Arabidopsis Genomes
The intergenic sequences used for identifying putative Cs were defined according to The Institute of Genome Research (TAIR) version 7 annotations for Arabidopsis (Arabidopsis thaliana) and Rice Annotation Project 2 (RAP2) and The Institute of Genome Research (TIGR; currently the J. Craig Venter Institute) version 5 annotation for rice (Oryza sativa). The overall pipeline for C identification is outlined in Figure 1 and is generally based on the pseudopipe workflow  with modifications. The pipeline consists of six major steps: (1) identifying intergenic regions with sequence similarity to known proteins; (2) repeat masking; (3) linking C fragments (pseudoexons) into contigs (set I C); (4) quality filtering (set II C); (5) identifying features that disrupt contiguous protein sequences (set III C); and (6) distinguishing retro and nonretro C (for details, see Supplemental Methods S1). The intergenic regions that qualify for the first three steps are referred to as set I C. Those passing through the first four steps are referred to as set II C. Cs with disabling mutations identified during step 5 are referred to as set III C. The coordinates and the pseudocoding sequences of these Cs are included in Supplemental Tables S1 to S3.

Analysis of Expression
For each plant species, we determined the number of annotated genes that are presumably functional and the number of Cs that have evidence of expression based on (1) putative unique transcript (PUT), (2) MPSS tags, and (3) tiling array data. Rice and Arabidopsis PUTs were downloaded from PlantGDB (version 163a; Duvick et al., 2008) and were used to search against annotated genes and set II Cs. PUTs are regarded as cognate transcripts for annotated genes and Cs if (1) these PUTs do not have a better match to other genes, (2) their identities are 97% or greater, (3) the aligned regions are 300 nucleotides or greater, and (4) the matched region is greater than 50% of the shorter sequence length. The MPSS tags for rice and Arabidopsis were downloaded (http://mpss.udel.edu/; Nobuta et al., 2007) and mapped to the rice and Arabidopsis pseudomolecules (100% identity and 100% coverage). MPSS tags that mapped uniquely to functional genes or Cs were regarded as expression tags for the respective genic sequences.
The third type of expression data set we examined was tiling array data for Arabidopsis (GEO: GSE601; Yamada et al., 2003) and for rice (GEO: GSE6996; Li et al., 2007). In both studies, the genomes were covered with the use of multiple arrays. A between-array normalization procedure was applied to each data set using the affyPLM package of Bioconductor (Gentleman et al., 2004). Among probe sequences with perfect matches to the genome assemblies, probes with one or more match with 85% or greater identity were discarded due to their potential contribution to cross-hybridization. Intensities for probes enclosed within the regions defined by the coordinates of exons and introns in the annotations as well as Cs identified in this study were averaged to serve as summary statistics of expression level.
Gene expression data under eight abiotic and eight biotic stress conditions were obtained from AtGenExpress (http://www.uni-tuebingen.de/ plantphys/AFGN/atgenex.htm) and processed as described previously (Hanada et al., 2008). Significantly up-and down-regulated genes under each condition were identified using LIMMA (Wettenhall and Smyth, 2004). Overrepresentation and underrepresentation of each domain family (D) member under each condition (C) were determined by setting up a two-by-two contingency table comparing numbers of D and non-D members that are upor down-regulated in C and numbers of D and non-D members without significant expression change under C using Fisher's exact test.

Evolutionary Rate Calculation
To evaluate the level of selective constraint on Cs, we calculated the synonymous and nonsynonymous substitution rates (Ks and Ka, respectively) between each C and its closest paralog in Arabidopsis and rice. The rates were determined using the yn00 program in the PAML program package (Yang, 1997) based on pairwise alignments of the sequence pairs. Very few pairs had run errors (e.g. NAN in PAML output) and were excluded. Sequence pairs that were too similar (Ks # 0.005) or too divergent (Ks . 3) were excluded as well. Note that these filtered Cs were used for comparing the strength of selection between 5# segments before the first disabling mutation and 3# segments after the last disabling mutation. However, we did not apply the Ks cutoffs on the segments. The rates for within-species reciprocal best match gene pairs (FP pairs) were determined as mentioned above.

Classification of Annotated Genes and Cs into Domain Families and Tandem/Nontandem Classes
Annotated genes in Arabidopsis and rice were first classified into families based on their Pfam domain composition. First, the protein sequences from these two plants were compared against hidden Markov models using the HMMER program package (http://hmmer.wustl.edu/) with the trusted cutoff threshold scores defined by Pfam (build May 2007; Bateman et al., 2004). For genes with multiple annotations due to alternatively spliced forms, the longest annotated protein sequences were used for domain identification. Because in most cases Cs are truncated and/or degenerate, they were assigned to domain families based on the assumption that their domain compositions were the same as their closest paralogs identified in step 4 of the C identification pipeline. Tandemly duplicated genes were defined as genes in any gene pair, T1 and T2, that (1) belong to the same gene family, (2) are located within an average distance of 10 genes, and (3) are separated by 10 or fewer nonhomologous spacer genes.

Supplemental Data
The following materials are available in the online version of this article.
Supplemental Figure S1. Frequency distributions of v values (Ka/Ks) of rice annotated genes, retro-Cs, and nonretro-Cs.
Supplemental Figure S2. Frequency distributions of v values (Ka/Ks) of rice annotated genes, all Cs, and Cs with premature stops and/or frameshifts.
Supplemental Figure S3. Proportion of C sequences covered by their closest functional paralogs in Arabidopsis and rice.
Supplemental Figure S4. Strength of purifying selection on regions 5# and 3# to the first stops in non-transposable-element C-FP pairs and FP-FP pairs.
Supplemental Table S1. C annotation data.
Supplemental Table S2. Arabidopsis C coding sequences.
Supplemental Table S3. Rice C coding sequences.
Supplemental Table S4. GeneOntology categories with consistently overrepresented or underrepresented numbers of Cs in Arabidopsis and rice.
Supplemental Table S6. Underrepresentation of transcription factor Cs.