Overview
Phytozome is a joint project of the Department of Energy's Joint Genome Institute and the Center for Integrative Genomics to facilitate comparative genomic studies amongst green plants. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of release v8.0, Phytozome provides access to thirty-one sequenced and annotated green plant genomes which have been clustered into gene families at ten evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, TAIR, JGI are hyper-linked and searchable.
Included Organisms
The proteomes of the following organisms are clustered in release v8.0 of Phytozome:
Nodes
Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:
| Viridiplantae (~475 Mya): | Genes representing the most recent common ancestor of Embryophytes and chlorophytes. |
| Chlorophyte (~50-200 Mya): | Genes representing the most recent common ancestor of Chlamydomonas and Volvox. |
| Embryophyte (~450 Mya): | Genes representing the most recent common ancestor of Tracheophytes and Bryophyta (represented by Physcomitrella). |
| Tracheophyte (~420 Mya): | Genes representing the most recent common ancestor of Selaginella and the angiosperms. |
| Angiosperm(~160 Mya): | Genes representing the most recent common ancestor of grasses and eudicots. |
| Core eudicot(~115 Mya) | Genes representing the most recent common ancestor of the rosids and asterids (here represented by mimulus). |
| Rosid (~1 Mya) | Genes representing the most recent common ancestor of Grape, the Fabidae and the Brassicales |
| Fabid(~107 Mya): | Genes representing the most recent common ancestor the nitrogen-fixing and Malpighiales clades. |
| Nitrogen-fixing(~90 Mya): | Genes representing the most recent common ancestor of Fabales and Rosales. |
| Grass (~70 Mya): | Genes representing the most recent common ancestor of Sorghum, Maize, foxtail millet and the BEP clade. |
| BEP (~50 Mya): | Genes representing the most recent common ancestor of Rice and Brachypodium. |
Clustering Methodology
The protein translations of the longest transcript at each locus for all 31 genomes were subject to an all-versus-all BLASTP alignment, and all pairs with hits more significant than 1e-03 subsequently underwent Needleman-Wunsch alignment. This global pairwise alignment was used to compute the evolutionary distance between the two peptides using a JTT model of protein evolution.
Founding Gene Family Construction
The data are first classified into core and non-core genomes. A core genome is considered either to have a high quality assembly and annotation with a relatively complete gene set, and/or to occupy a phylogenetically essential location for gene family construction. For Phytozome version 8, the core genomes are Populus trichocarpa (poplar), Glycine max (soybean), Prunus persica (peach), Arabidopsis thaliana, Vitis vinifera (grapevine), Mimulus guttatus (monkey flower), Sorghum bicolor, Oryza sativa, Brachypodium distachyon, Selaginella moellendorffii, Physcomitrella patens, Volvox carterii and Chlamydomonas reinhardtii. The intial gene families are constructed predominantly using peptides from the core genomes, as described below.Beginning at the crown (single extant species) nodes, gene families are built up hierarchically, progressing towards the root (Viridiplantae) node. A "backbone" of mutual best hit (MBH) pairwise scores between in-group (species present at that node) and out-group (species not present at that node) peptides is constructed, and ordered via increasing evolutionary distance. Starting with the closest in-group backbone peptide, X, other in-group peptides that are not part of the backbone are added to X if their distance from X is less than X's backbone out-group MBH distance (paralog accumulation step). Note that, at a given node, all paralog accumulation is built around in-group peptides with a MBH to an outgroup peptide. Proceeding up from the crown nodes, the gene families constructed at the previous step are merged if they have an in-group MBH between them. A new backbone is constructed on these merged families, and paralog accumulation is again performed. This process is repeated at each node until the root (Viridiplantae) node. The lack of good (not too phylogenetically distant) outgroup candidates for the viridiplantae encourages us not to attempt paralog accumulation at the root node (which can lead to the merger of large, weakly similar families into giant superfamilies), so only ingroup MBH merging is performed here.
The end result is a set of founding gene families defined across a series of evolutionary nodes with the following properties:
- every founding gene is a member of one and only one gene family at a given node
- founding genes in the same family at a given node are also in the same family at all nodes ancestral to this one
Members of non-core genomes can be added to a core gene family if they have mutual best hits (MBH) to this family that are unique (hit only one family per node) and consistent (hits this family, and all its descendant and ancestor families). For a given family, the set of core genome members, plus the non-core members that pledge consistently via MBH, are called "founding members" of the family. Founder members of a gene family are indicated by the letter F on the Gene family page.
Multiple sequence alignments (MSA), consensus sequences and hidden-markov-model profiles (HMM) are constructed for each founding gene family.
Pledging of non-founding genome members to Founding Gene Families
Remaining non-core genes are added to existing gene families by two methods:- Inconsistent and/or non-unique MBH to founding members of a family, indicated by the letter M.
- HMMsearch of the gene against the family profile results in an evalue < 1.0e-5 and a coverage > 45%, indicated by the symbols Pc or Pi.
Pc (Pledge consistent) means the gene pledges consistently and uniquely to this family (hits to other families with a evalue more than 1e20 times larger than the best evalue are ignored). Pi (Pledge inconsistent) means it does not. Note that for genes marked as Pi or M on a Gene Family page, the consistency or uniqueness conditions are violated. Such genes may therefore be found in more than one family at a given node.
Citing Phytozome
| David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane, Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe Hellsten, Nicholas Putnam, and Daniel S. Rokhsar, Phytozome: a comparative platform for green plant genomics, Nucleic Acids Res. 2012 40 (D1): D1178-D1186 |
Phytozome Team
| Software: | Joni Fazo, David M. Goodstein, Richard D. Hayes, Shengqiang Shu |
| Analysis: | Uffe Hellsten, Therese Mitros, Raf Podowski, Simon Prochnik, Dan Rokhsar |