|
|
||||||||
|
Plant Physiology 137:31-42 (2005) © 2005 American Society of Plant Biologists Exploring the Plant Transcriptome through Phylogenetic Profiling1,[w]Department of Plant Systems Biology, Flanders Interuniversity Institute for Biotechnology (VIB), Ghent University, B9052 Ghent, Belgium
Publicly available protein sequences represent only a small fraction of the full catalog of genes encoded by the genomes of different plants, such as green algae, mosses, gymnosperms, and angiosperms. By contrast, an enormous amount of expressed sequence tags (ESTs) exists for a wide variety of plant species, representing a substantial part of all transcribed plant genes. Integrating protein and EST sequences in comparative and evolutionary analyses is not straightforward because of the heterogeneous nature of both types of sequence data. By combining information from publicly available EST and protein sequences for 32 different plant species, we identified more than 250,000 plant proteins organized in more than 12,000 gene families. Approximately 60% of the proteins are absent from current sequence databases but provide important new information about plant gene families. Analysis of the distribution of gene families over different plant species through phylogenetic profiling reveals interesting insights into plant gene evolution, and identifies species- and lineage-specific gene families, orphan genes, and conserved core genes across the green plant lineage. We counted a similar number of approximately 9,500 gene families in monocotyledonous and eudicotyledonous plants and found strong evidence for the existence of at least 33,700 genes in rice (Oryza sativa). Interestingly, the larger number of genes in rice compared to Arabidopsis (Arabidopsis thaliana) can partially be explained by a larger amount of species-specific single-copy genes and species-specific gene families. In addition, a majority of large gene families, typically containing more than 50 genes, are bigger in rice than Arabidopsis, whereas the opposite seems true for small gene families.
Comparative genomics provides a powerful means to study gene structure and the evolution of gene function and regulation. Analysis of genes or pathways in a broad phylogenetic context allows scientists to better understand how complex biological processes are regulated and evolve (Soltis and Soltis, 2003
Perhaps the best known example of an integrated sequence-based system applying phylogenetic profiles is the clusters of orthologous groups (COG) database, which is a comprehensive repository of functionally annotated clusters of bacterial and eukaryotic orthologous genes (Tatusov et al., 2003 Here, we present an integrated sequence repository (available as Sequence platform for the Phylogenetic analysis of Plant Genes [SPPG] in the section Databases at http://www.psb.ugent.be/bioinformatics/) that combines EST sequence data with protein information, providing an excellent starting point for plant comparative and evolutionary genomics. This is illustrated by the examination of several thousands of gene families distributed over a large number of different plant species, which reveals unique features about the evolution of plant gene families.
EST Assembly, ORF Detection, Protein Clustering, and Functional Annotation
Initially, 106,174 proteins and 2,884,000 EST sequences from 32 different plant species were retrieved from EMBL and The Institute for Genomic Research (TIGR) to construct a nonredundant and high-quality data set of plant proteins. After the assembly of the EST sequences, annotation of open reading frames (ORFs) on EST clusters, and processing all currently available proteins for the plant species selected here (see "Materials and Methods" for technical details), a total of 86,077 nonredundant plant proteins from EMBL and TIGR were obtained, together with 253,857 EST clusters derived from more than 1.8 million clustered EST sequences (Table I; Fig. 1). Fifty-seven percent of all initial EST sequences could be assembled into an EST cluster comprising, on average, 6.16 ESTs. These results are very comparable with similar plant EST assembly initiatives (TIGR Plant Gene Indices, Quackenbush et al., 2001
Gene families and individual genes have been functionally annotated based on the available gene descriptions and Gene Ontology (GO) annotations of protein sequences derived from EMBL and TIGR. Approximately 58,000 gene descriptions could be mapped on 11,938 different gene families, and 22,395 functional GO labels of Arabidopsis could be assigned to 4,099 gene families. When gene descriptions are transferred between different members of the same gene family, more than 80% of plant sequences can be labeled with functional information.
In addition to assigning general gene descriptions to families or individual proteins, information about the nuclear or organellar origin of genes has also been integrated, which allows us to determine the amount of chloroplast and mitochondrial DNA sequences that have been inserted into or transferred to the nucleus. In total, 704 chloroplast and 275 mitochondrial gene products were identified that could be clustered into 202 distinct gene families. Interestingly, in numerous gene families, genes from different origins were grouped. Sixty-six and 24 gene families were found uniquely for chloroplast or mitochondrial genomes, respectively, whereas 110 organelle families were identified for which homologs were also detected in the nuclear plant genome of Arabidopsis or rice. Two gene families were identified encoded by the chloroplast and mitochondrial genome (NADH dehydrogenase subunits 1 and 4). Gene families in both mitochondrial and nuclear genomes encode for cytochrome c subunits, ribosomal proteins, and tRNAs, whereas a wide variety of genes, covering 66 different gene families, was found in both chloroplast and nuclear genomes (for full list, see Supplemental Table I). In addition, 10 families were identified in the mitochondrial, chloroplast, and nuclear genomes of different species encoding ribosomal proteins, NADH dehydrogenase subunits, Fe-superoxide dismutase, ATP synthase subunit 1, and an Asn tRNA. This confirms previous findings that genes frequently are transferred from the chloroplast or mitochondrial genome to the nucleus, where they acquire new expression control and targeting signals for the correct expression, translation, and reimport into the organelle (Martin, 2003
Strikingly, whereas 19% (15 out of 76 gene families) of all chloroplast gene functions in Arabidopsis are also present in the nuclear genome, in rice 37% (30 out of 81 gene families) of all chloroplast gene functions are found in the nuclear genome. This difference confirms previous findings that the rice nuclear genome is significantly more enriched with plastid genome sequences than that of Arabidopsis (Shahmuradov et al., 2003
An overview of the number of proteins ascribed to gene families is shown in Table I. As expected, the largest numbers of proteins that can be assigned to gene families are derived from Arabidopsis and rice (22,412 and 30,993 genes, respectively), for which nearly complete nuclear genome sequences have been determined. Monocotyledonous plants, such as Triticum aestivum, Zea mays, Sorghum bicolor, and Hordeum vulgare, are also well represented, as well as the eudicotyledonous plants Glycine max, Medicago truncatula, Solanum tuberosum, Lycopersicon esculentum, and Vitis vinifera. For the moss Physcomitrella patens, more than 6,300 proteins are clustered into gene families, which can be explained by the exhaustive EST-sequencing efforts lately (Nishiyama et al., 2003 In addition to defining sensu stricto phylogenetic profiles at the species level, we also determined the overall presence of each gene family over distinct taxa of the Viridiplantae. The different taxa scored were, at lower taxonomic levels, Chlorophyta, Bryophyta, gymnosperms, and angiosperms, the latter being further subdivided in monocots and eudicots. At a higher taxonomic level, Eurosids I, Eurosids II, Rosids, Asterids, and Caryophyllales were discerned. Given the still very incomplete nature of most available plant gene sequences, these high-level phylogenetic profiles offer an alternative representation of the distribution of gene families within the green lineage (Fig. 2). Moreover, these alternative profiles provide a valuable tool for the extraction of information about the evolution of gene functions.
Core Plant Genes, Species- and Lineage-Specific Gene Families, and Orphans
Examination of the high-level phylogenetic profiles revealed that a total of 397 gene families covering 53,796 proteins were present in Chlorophytes, Bryophytes, gymnosperms, and angiosperms. These conserved gene families thus represent a set of core genes found in all major divisions of the Viridiplantae. As expected, the functional classification of these gene families shows that they encode basic components of the plant cell machinery, such as genes involved in translation, ribosomal structure, posttranslational modifications, energy production, secretion, amino acid transport, and metabolism (see Supplemental Fig. 1). The number of core proteins in Arabidopsis identified here (4,177) is larger than the 1,152 Arabidopsis proteins conserved in all eukaryotes (Gutierrez et al., 2004
In contrast with the set of core genes, a large number of gene families are specific to one particular plant. Initially, 3,337 SSGFs were identified when querying the profiles of all gene families. Because the general gene family delineation was performed with rather conservative criteria, less stringent protein clustering parameters were applied in order to determine the real number of SSGFs, LSGFs, and orphan genes (see "Materials and Methods"). In total, 1,116 SSGFs containing 5,180 proteins were detected, with the largest number in rice, Arabidopsis, and Physcomitrella, covering 637 (approximately 4,258 proteins), 187 (approximately 1,241 proteins), and 164 (approximately 408 proteins) gene families, respectively. The availability of a complete genome sequence for Arabidopsis and rice may be the reason for the larger number of SSGF proteins, whereas for Physcomitrella the absence of sequence data from closely related species in combination with the large number of available EST/cDNA sequences explains the high amount of SSGF proteins. Approximately 82% of all SSGF proteins lack a functional annotation, which indicates that they play a role in unknown or poorly characterized biological processes. Although one might expect that LSGFs will be hard to detect in an incomplete and fragmented plant data set (Jabbari et al., 2004
To estimate the real number of orphan genes for a particular organism, we compared these proteins with the total data set by using less strict sequence similarity criteria than those used for the construction of the gene families. Still more than 14,000 orphan genes were detected, the largest number being found in rice and the lowest in Zinnia elegans (Table I). Interestingly, the number of expressed orphan genes is only 6,482 because almost one-half of all putative orphans are predicted genes of rice and Arabidopsis lacking proof of expression (no EST- or cDNA-supported gene model). P. patens seems to be the organism with the highest number of expressed orphan genes (2,053) in the full data set, which can be explained by its unique taxonomic position and current EST/cDNA sequencing status. Indeed, P. patens is the only moss representative in the data set and has a high number of ESTs yielding more than 10,000 different moss proteins. Overall, disregarding P. patens, the observed correlation between the number of initial EST sequences and the final number of orphan genes for all plant species is linear (r2 = 0,83; y = 0.0011x + 25.482). Hence, within these plant species, the chance of detecting new orphan genes only increases with one new orphan per approximately 900 additional ESTs. In this respect, the 131 orphan genes for C. reinhardtii, which also lacks closely related species in this data set and has a high number of ESTs (>140,000), seems unexpectedly low. Most probably, the fact that only 26% of all C. reinhardtii EST clusters yielded a protein sequence of more than 50 amino acids compared to 79% for P. patens, for which overall longer cDNA sequences could be obtained, reduces the number of detectable Chlamydomonas orphan genes. The current sequencing and gene annotation of the Chlamydomonas genome will probably reveal additional information about the amount of Chlorophyta-specific and orphan genes (Grossman et al., 2003
To determine specific gene-loss events in Arabidopsis and rice, we searched the phylogenetic profiles for conserved gene functions present in numerous eudicots and grasses but absent in Arabidopsis and rice, respectively. Subsequently, we used less stringent sequence similarity criteria (see "Materials and Methods") to validate whether a particular gene family indeed was absent in the full proteome of Arabidopsis or rice. We identified seven gene families that were present in five or more plant species, including related Eurosid II species, but were absent from Arabidopsis. A detailed search with protein sequences of related plants for the missing genes against the raw genomic Arabidopsis bacterial artificial chromosome (BAC) sequences yielded three loci with significant similarity (Table III; Supplemental Table II). This indicates that these loci may represent active genes missed by the current gene annotation efforts, whereas the absence of the other four gene families could point to gene loss in Arabidopsis. An alternative explanation is that these four gene functions do exist in Arabidopsis but are located in currently unsequenced chromosomal regions, such as centromeres (Yamada et al., 2003
Despite the high number of publicly available protein and EST sequences for monocots that are extremely valuable for extrinsic gene prediction approaches (Mathé et al., 2002
Comparing all conserved gene families between Arabidopsis and rice makes it possible to verify whether the larger number of genes in rice, as suggested in the past (Goff et al., 2002
Apart from analyzing the conserved gene families between Arabidopsis and rice, we also examined the distribution of gene families containing Arabidopsis or rice genes over a wider range of plant species using the high-level phylogenetic profiles (see above). Although 69% of the gene families in grasses is also present in eudicots, 3,006 gene families are unique to the grasses, of which 42% represent grass-specific families found in multiple cereals. These results correspond with previous estimates of putative monocot-specific genes using sugarcane (Saccharum officinarum) ESTs (Vincentz et al., 2004
Recent estimates show that approximately 43,000 plant protein sequences are known, which can be classified into approximately 4,053 gene families (Mohseni-Zadeh et al., 2004
Construction of the Data Set
The data set consists of two subsets, one including publicly available plant proteins and the other containing EST sequences. The protein data set covers data extracted from EMBL (Kulikova et al., 2004
EST sequences were transformed into EST clusters (also called unigene or tentative consensus) and a set of singleton ESTs with the EST clustering software developed by TIGR (Pertea et al., 2003
Next, putative ORFs were delineated for all EST clusters. For these EST clusters containing experimentally derived mRNAs, the corresponding coding sequence (CDS) information was retained. For all other sequences, the coding frame and putative CDS were determined with the FrameD software tool (Schiex et al., 2003
All translated coding sequences of the EST clusters and all sequences from the protein data set were used to construct gene families by applying sequence-based protein clustering (Li et al., 2001
GO gene associations for Arabidopsis proteins were retrieved from TIGR (ftp.tigr.org/pub/data/a_thaliana/ath1/DATA_RELEASE_SUPPLEMENT/) and remapped to the generic GO Slim classification scheme (ftp.geneontology.org/pub/go/GO_slims/goslim_generic.go) with the Perl script map2slim.pl (available at www.geneontology.org).
Throughout this analysis, we assumed that Arabidopsis and rice genes derived from the genome sequencing projects represented full-length proteins. Given the fact that the family delineation algorithm does not create family relationships between homologous proteins that vary extremely in length (i.e. that lack global homology), we believe that gene families including Arabidopsis and rice proteins will generally not contain clustered partial proteins. These full-length families represent the majority of all gene families (i.e. 68% of all 14,639 gene families). We obtained 4,341 gene families without Arabidopsis and/or rice homologs that might contain partial proteins (designated partial protein families [PPFs]). For each of the 14,369 gene families, a random gene representative was selected and compared with all other gene representatives. Subsequently, all significant similarities (BLASTP E-value <1e15) between genes representing full-length families and PPFs were scored. Finally, we identified these PPFs that were significantly shorter than the homologous full-length family. We found 1,415 and 1,515 PPFs that were more than 50% and more than 30% shorter than the homologous full-length family, respectively. To reduce the chance of overpredicting the final number of gene families, we selected the 1,515 gene families that were at least 30% shorter than their full-length counterpart as gene families consisting of partial proteins. These families were discarded when the number of gene families in the different lineages is discussed (Fig. 2). Applying other E-value similarity and length-difference cutoffs yielded similar results (data not shown).
All orphan proteins or proteins of gene families specific for one plant species or lineage were compared against the full set of proteins using less stringent criteria (BLASTP E-value <1e05) compared to the criteria applied by the protein clustering algorithm for delineating gene families (see above). These proteins without non-self BLAST hits (i.e. only hitting themselves) were designated orphans, whereas only those genes uniquely matching proteins of the same species or lineage were retained as species or lineage specific, respectively.
Upon request, all novel materials described in this publication will be made available in a timely manner for noncommercial research purposes, subject to the requisite permission from any third-party owners of all or parts of the material. Obtaining any permission will be the responsibility of the requestor.
We thank M. De Cock for help with the manuscript and F. Dierick for technical assistance. Received October 11, 2004; returned for revision November 10, 2004; accepted November 10, 2004.
1 This work was supported by the Instituut voor de aanmoediging van Innovatie door Wetenschap en Technologie in Vlaanderen (predoctoral fellowship to K.V.).
[w] The online version of this article contains Web-only data. www.plantphysiol.org/cgi/doi/10.1104/pp.104.054700. * Corresponding author; e-mail yves.vandepeer{at}psb.ugent.be; fax 3293313809.
Allen JE, Pertea M, Salzberg SL (2004) Computational gene prediction using multiple sources of evidence. Genome Res 14: 142148
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25: 33893402 Arabidopsis Genome Initiative (2000) Analysis of the genome sequence of the flowering plant Arabidopsis thaliana. Nature 408: 796815[CrossRef][Medline] Bennetzen JL, Coleman C, Liu R, Ma J, Ramakrishna W (2004) Consistent over-estimation of gene number in complex plant genomes. Curr Opin Plant Biol 7: 732736[CrossRef][ISI][Medline]
Dong Q, Schlueter SD, Brendel V (2004) PlantGDB, plant genome database and analysis tools. Nucleic Acids Res 32 (Database issue): D354D359 Doyle JJ, Gaut BS (2000) Evolution of genes and taxa: a primer. Plant Mol Biol 42: 123[ISI][Medline] Durbin ML, McCaig B, Clegg MT (2000) Molecular evolution of the chalcone synthase multigene family in the morning glory genome. Plant Mol Biol 42: 7992[CrossRef][ISI][Medline] Ermolaeva MD, Wu M, Eisen JA, Salzberg SL (2003) The age of the Arabidopsis thaliana genome duplication. Plant Mol Biol 51: 859866[CrossRef][ISI][Medline] Feng Q, Zhang Y, Hao P, Wang S, Fu G, Huang Y, Li Y, Zhu J, Liu Y, Hu X, et al (2002) Sequence and analysis of rice chromosome 4. Nature 420: 316320[CrossRef][Medline]
Goff SA, Ricke D, Lan TH, Presting G, Wang R, Dunn M, Glazebrook J, Sessions A, Oeller P, Varma H, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. japonica). Science 296: 92100
Griffiths S, Dunford RP, Coupland G, Laurie DA (2003) The evolution of CONSTANS-like gene families in barley, rice, and Arabidopsis. Plant Physiol 131: 18551867
Grossman AR, Harris EE, Hauser C, Lefebvre PA, Martinez D, Rokhsar D, Shrager J, Silflow CD, Stern D, Vallon O, et al (2003) Chlamydomonas reinhardtii at the crossroads of genomics. Eukaryot Cell 2: 11371150 Gutierrez RA, Green PJ, Keegstra K, Ohlrogge JB (2004) Phylogenetic profiling of the Arabidopsis thaliana proteome: What proteins distinguish plants from other organisms? Genome Biol 5: R53[CrossRef][Medline] Jabbari K, Cruveiller S, Clay O, Le Saux J, Bernardi G (2004) The new genes of rice: a closer look. Trends Plant Sci 9: 281285[CrossRef][ISI][Medline] Kevei Z, Vinardell JM, Kiss GB, Kondorosi A, Kondorosi E (2002) Glycine-rich proteins encoded by a nodule-specific gene family are implicated in different stages of symbiotic nodule development in Medicago spp. Mol Plant Microbe Interact 15: 922931[Medline]
Kinoshita T, Fukuzawa H, Shimada T, Saito T, Matsuda Y (1992) Primary structure and expression of a gamete lytic enzyme in Chlamydomonas reinhardtii: similarity of functional domains to matrix metalloproteases. Proc Natl Acad Sci USA 89: 46934697 Koonin EV, Fedorova ND, Jackson JD, Jacobs AR, Krylov DM, Makarova KS, Mazumder R, Mekhedov SL, Nikolskaya AN, Rao BS, et al (2004) A comprehensive evolutionary classification of proteins encoded in complete eukaryotic genomes. Genome Biol 5: R7[CrossRef][Medline] Kriventseva EV, Biswas M, Apweiler R (2001) Clustering and analysis of protein families. Curr Opin Struct Biol 11: 334339[CrossRef][ISI][Medline]
Kulikova T, Aldebert P, Althorpe N, Baker W, Bates K, Browne P, van den Broek A, Cochrane G, Duggan K, Eberhardt R, et al (2004) The EMBL Nucleotide Sequence Database. Nucleic Acids Res 32 (Database issue): D27D30 Li WH, Gu Z, Wang H, Nekrutenko A (2001) Evolutionary analyses of the human genome. Nature 409: 847849[CrossRef][Medline]
Martin W (2003) Gene transfer from organelles to the nucleus: frequent and in big chunks. Proc Natl Acad Sci USA 100: 86128614
Mathé C, Sagot MF, Schiex T, Rouzé P (2002) Current methods of gene prediction, their strengths and weaknesses. Nucleic Acids Res 30: 41034117
Mergaert P, Nikovics K, Kelemen Z, Maunoury N, Vaubert D, Kondorosi A, Kondorosi E (2003) A novel family in Medicago truncatula consisting of more than 300 nodule-specific genes coding for small, secreted polypeptides with conserved cysteine motifs. Plant Physiol 132: 161173
Mohseni-Zadeh S, Louis A, Brezellec P, Risler JL (2004) PHYTOPROT: a database of clusters of plant proteins. Nucleic Acids Res 32 (Database issue): D351D353
Mounsey A, Bauer P, Hope IA (2002) Evidence suggesting that a fifth of annotated Caenorhabditis elegans genes may be pseudogenes. Genome Res 12: 770775 Nagaki K, Cheng Z, Ouyang S, Talbert PB, Kim M, Jones KM, Henikoff S, Buell CR, Jiang J (2004) Sequencing of a rice centromere uncovers active genes. Nat Genet 36: 138145[CrossRef][ISI][Medline]
Nishiyama T, Fujita T, Shin IT, Seki M, Nishide H, Uchiyama I, Kamiya A, Carninci P, Hayashizaki Y, Shinozaki K, et al (2003) Comparative genomics of Physcomitrella patens gametophytic transcriptome and Arabidopsis thaliana: implication for land plant evolution. Proc Natl Acad Sci USA 100: 80078012 Parkinson J, Guiliano DB, Blaxter M (2002) Making sense of EST sequences by CLOBBing them. BMC Bioinformatics 3: 31[CrossRef][Medline]
Pertea G, Huang X, Liang F, Antonescu V, Sultana R, Karamycheva S, Lee Y, White J, Cheung F, Parvizi B, et al (2003) TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets. Bioinformatics 19: 651652 Pryer KM, Schneider H, Zimmer EA, Ann Banks J (2002) Deciding among green plants for whole genome studies. Trends Plant Sci 7: 550554[CrossRef][ISI][Medline]
Quackenbush J, Cho J, Lee D, Liang F, Holt I, Karamycheva S, Parvizi B, Pertea G, Sultana R, White J (2001) The TIGR Gene Indices: analysis of gene transcript sequences in highly sampled eukaryotic species. Nucleic Acids Res 29: 159164 Raes J, Vandepoele K, Simillion C, Saeys Y, Van de Peer Y (2003) Investigating ancient duplication events in the Arabidopsis genome. J Struct Funct Genomics 3: 117129[CrossRef][Medline]
Rice Chromosome 10 Sequencing Consortium (2003) In-depth view of structure, activity, and evolution of rice chromosome 10. Science 300: 15661569
Rost B (1999) Twilight zone of protein sequence alignments. Protein Eng 12: 8594 Rouzé P, Pavy N, Rombauts S (1999) Genome annotation: which tools do we have for it? Curr Opin Plant Biol 2: 9095[CrossRef][ISI][Medline] Rudd S (2003) Expressed sequence tags: alternative or complement to whole genome sequences? Trends Plant Sci 8: 321329[CrossRef][ISI][Medline] Sasaki T, Matsumoto T, Yamamoto K, Sakata K, Baba T, Katayose Y, Wu J, Niimura Y, Cheng Z, Nagamura Y, et al (2002) The genome sequence and structure of rice chromosome 1. Nature 420: 312316[CrossRef][Medline]
Schiex T, Gouzy J, Moisan A, de Oliveira Y (2003) FrameD: a flexible program for quality check and gene prediction in prokaryotic genomes and noisy matured eukaryotic sequences. Nucleic Acids Res 31: 37383741 Shahmuradov IA, Akbarova YY, Solovyev VV, Aliyev JA (2003) Abundance of plastid DNA insertions in nuclear genomes of rice and Arabidopsis. Plant Mol Biol 52: 923934[CrossRef][ISI][Medline]
Shewry PR, Halford NG (2002) Cereal seed storage proteins: structures, properties and role in grain utilization. J Exp Bot 53: 947958
Shiu SH, Karlowski WM, Pan R, Tzeng YH, Mayer KF, Li WH (2004) Comparative analysis of the receptor-like kinase family in Arabidopsis and rice. Plant Cell 16: 12201234
Soltis DE, Soltis PS (2003) The role of phylogenetics in comparative genetics. Plant Physiol 132: 17901800 Tatusov RL, Fedorova ND, Jackson JD, Jacobs AR, Kiryutin B, Koonin EV, Krylov DM, Mazumder R, Mekhedov SL, Nikolskaya AN, et al (2003) The COG database: an updated version includes eukaryotes. BMC Bioinformatics 4: 41[CrossRef][Medline] Timmis JN, Ayliffe MA, Huang CY, Martin W (2004) Endosymbiotic gene transfer: organelle genomes forge eukaryotic chromosomes. Nat Rev Genet 5: 123135[CrossRef][ISI][Medline]
Torrents D, Suyama M, Zdobnov E, Bork P (2003) A genome-wide survey of human pseudogenes. Genome Res 13: 25592567
Vandepoele K, Simillion C, Van de Peer Y (2003) Evidence that rice and other cereals are ancient aneuploids. Plant Cell 15: 21922202
Vincentz M, Cara FA, Okura VK, da Silva FR, Pedrosa GL, Hemerly AS, Capella AN, Marins M, Ferreira PC, Franca SC, et al (2004) Evaluation of monocot and eudicot divergence using the sugarcane transcriptome. Plant Physiol 134: 951959
Wortman JR, Haas BJ, Hannick LI, Smith RK Jr, Maiti R, Ronning CM, Chan AP, Yu C, Ayele M, Whitelaw CA, et al (2003) Annotation of the Arabidopsis genome. Plant Physiol 132: 461468
Yamada K, Lim J, Dale JM, Chen H, Shinn P, Palm CJ, Southwick AM, Wu HC, Kim C, Nguyen M, et al (2003) Empirical analysis of transcriptional activity in the Arabidopsis genome. Science 302: 842846
Yang J, Lusk R, Li WH (2003) Organismal complexity, protein complexity, and gene duplicability. Proc Natl Acad Sci USA 100: 1566115665
Yu J, Hu S, Wang J, Wong GK, Li S, Liu B, Deng Y, Dai L, Zhou Y, Zhang X, et al (2002) A draft sequence of the rice genome (Oryza sativa L. ssp. indica). Science 296: 7992
Yuan Q, Ouyang S, Liu J, Suh B, Cheung F, Sultana R, Lee D, Quackenbush J, Buell CR (2003) The TIGR rice genome annotation resource: annotating the rice genome and creating resources for plant biologists. Nucleic Acids Res 31: 229233 Zhou T, Wang Y, Chen JQ, Araki H, Jing Z, Jiang K, Shen J, Tian D (2004) Genome-wide identification of NBS genes in japonica rice reveals significant expansion of divergent non-TIR NBS-LRR genes. Mol Genet Genomics 271: 402415[CrossRef][ISI][Medline] This article has been cited by other articles:
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||