Overview

Phytozome is a joint project of the Department of Energy's Joint Genome Institute and the Center for Integrative Genomics to facilitate comparative genomic studies amongst green plants. Clusters of orthologous and paralogous genes that represent the modern descendents of ancestral gene sets are constructed at key phylogenetic nodes. These clusters allow easy access to clade specific orthology/paralogy relationships as well as clade specific genes and gene expansions. As of release v8.0, Phytozome provides access to thirty-one sequenced and annotated green plant genomes which have been clustered into gene families at ten evolutionarily significant nodes. Where possible, each gene has been annotated with PFAM, KOG, KEGG, and PANTHER assignments, and publicly available annotations from RefSeq, UniProt, TAIR, JGI are hyper-linked and searchable.

Included Organisms

The proteomes of the following organisms are clustered in release v8.0 of Phytozome:

Organismcommon nameSource
Aquilegia coeruleaColorado blue columbineJGI 8X assembly v1.0, annotation v1.1
Arabidopsis lyrataLyre-leaved rock cressJGI release v1.0
Arabidopsis thalianaThale cressTAIR release 10 acquired from TAIR
Brachypodium distachyonPurple false bromeJGI 8x assembly release v1.0 of strain Bd21 with JGI/MIPS PASA annotation v1.2
Brassica rapaNapa cabbageAnnotation v1.2 on assembly v1.1 from brassicadb.org
Capsella rubellaRed shepherd's purseJGI annotation v1.0 on assembly v1
Carica papayaPapayaASGPB release of 2007
Chlamydomonas reinhardtiiGreen algaeAugustus update 10.2 (u10.2) annotation of JGI assembly v4
Citrus clementinaClementineJGI v0.9 assembly and annotation
Citrus sinensisSweet orangeJGI v1.1 annotation on v1 assembly
Cucumis sativusCucumberRoche 454-XLR assembly and JGI v1.0 annotation
Eucalyptus grandisEucalyptusJGI assembly v1.0, annotation v1.1
Glycine maxSoybeanJGI Glyma1.0 annotation of the chromosome-based Glyma1 assembly
Linum usitatissimumFlaxBGI v1.0 on assembly v1.0
Malus domesticaAppleGDR prediction v1.0 on Malus x domestica assembly v1.0
Manihot esculentaCassavaAssembly version 4, JGI annotation v4.1
Medicago truncatulaBarrel medicRelease Mt3.0 from the Medicago Genome Sequence Consortium
Mimulus guttatusMonkey flowerJGI 7x assembly release v1.0 of strain IM62, annotation v1.0
Oryza sativaRiceMSU Release 7.0 of the Rice Genome Annotation
Phaseolus vulgarisCommon beanJGI annotation v0.91 on assembly v0.9 using published ESTs, and JGI RNAseq
Physcomitrella patensMossJGI assembly release v1.1 and COSMOSS annotation v1.6
Populus trichocarpaPoplarJGI assembly release v2.0, annotation v2.2
Prunus persicaPeachJGI release v1.0
Ricinus communisCastor beanTIGR release 0.1
Selaginella moellendorffiiSpikemossJGI v1.0 assembly and annotation
Setaria italicaFoxtail milletJGI 8.3X chromosome-scale assembly release 2.0, annotation version 2.1
Sorghum bicolorSweet SorghumSbi1.4 models from MIPS/PASA on v1.0 assembly
Thellungiella halophilaSalt cressJGI annotation v1.0 on assembly v1
Vitis viniferaGrapeMarch 2010 12X assembly and annotation from Genoscope
Volvox carteriVolvoxJGI annotation 2.0 on assembly v2
Zea maysMaize5b.60 annotation (filtered set) of the maize "B73" genome v2 produced by the Maize Genome Project

Nodes

Clustering is used to group extant genes into sets representing the ancestral genes that existed just prior to various significant evolutionary events (nodes). Extant genes have been clustered at nodes representing the following speciation events:

Viridiplantae (~475 Mya):Genes representing the most recent common ancestor of Embryophytes and chlorophytes.
Chlorophyte (~50-200 Mya):Genes representing the most recent common ancestor of Chlamydomonas and Volvox.
Embryophyte (~450 Mya): Genes representing the most recent common ancestor of Tracheophytes and Bryophyta (represented by Physcomitrella).
Tracheophyte (~420 Mya): Genes representing the most recent common ancestor of Selaginella and the angiosperms.
Angiosperm(~160 Mya): Genes representing the most recent common ancestor of grasses and eudicots.
Core eudicot(~115 Mya)Genes representing the most recent common ancestor of the rosids and asterids (here represented by mimulus).
Rosid (~1 Mya)Genes representing the most recent common ancestor of Grape, the Fabidae and the Brassicales
Fabid(~107 Mya):Genes representing the most recent common ancestor the nitrogen-fixing and Malpighiales clades.
Nitrogen-fixing(~90 Mya):Genes representing the most recent common ancestor of Fabales and Rosales.
Grass (~70 Mya):Genes representing the most recent common ancestor of Sorghum, Maize, foxtail millet and the BEP clade.
BEP (~50 Mya):Genes representing the most recent common ancestor of Rice and Brachypodium.

Clustering Methodology


The protein translations of the longest transcript at each locus for all 31 genomes were subject to an all-versus-all BLASTP alignment, and all pairs with hits more significant than 1e-03 subsequently underwent Needleman-Wunsch alignment. This global pairwise alignment was used to compute the evolutionary distance between the two peptides using a JTT model of protein evolution.

Founding Gene Family Construction

The data are first classified into core and non-core genomes. A core genome is considered either to have a high quality assembly and annotation with a relatively complete gene set, and/or to occupy a phylogenetically essential location for gene family construction. For Phytozome version 8, the core genomes are Populus trichocarpa (poplar), Glycine max (soybean), Prunus persica (peach), Arabidopsis thaliana, Vitis vinifera (grapevine), Mimulus guttatus (monkey flower), Sorghum bicolor, Oryza sativa, Brachypodium distachyon, Selaginella moellendorffii, Physcomitrella patens, Volvox carterii and Chlamydomonas reinhardtii. The intial gene families are constructed predominantly using peptides from the core genomes, as described below.

Beginning at the crown (single extant species) nodes, gene families are built up hierarchically, progressing towards the root (Viridiplantae) node. A "backbone" of mutual best hit (MBH) pairwise scores between in-group (species present at that node) and out-group (species not present at that node) peptides is constructed, and ordered via increasing evolutionary distance. Starting with the closest in-group backbone peptide, X, other in-group peptides that are not part of the backbone are added to X if their distance from X is less than X's backbone out-group MBH distance (paralog accumulation step). Note that, at a given node, all paralog accumulation is built around in-group peptides with a MBH to an outgroup peptide. Proceeding up from the crown nodes, the gene families constructed at the previous step are merged if they have an in-group MBH between them. A new backbone is constructed on these merged families, and paralog accumulation is again performed. This process is repeated at each node until the root (Viridiplantae) node. The lack of good (not too phylogenetically distant) outgroup candidates for the viridiplantae encourages us not to attempt paralog accumulation at the root node (which can lead to the merger of large, weakly similar families into giant superfamilies), so only ingroup MBH merging is performed here.

The end result is a set of founding gene families defined across a series of evolutionary nodes with the following properties:


Members of non-core genomes can be added to a core gene family if they have mutual best hits (MBH) to this family that are unique (hit only one family per node) and consistent (hits this family, and all its descendant and ancestor families). For a given family, the set of core genome members, plus the non-core members that pledge consistently via MBH, are called "founding members" of the family. Founder members of a gene family are indicated by the letter F on the Gene family page.

Multiple sequence alignments (MSA), consensus sequences and hidden-markov-model profiles (HMM) are constructed for each founding gene family.

Pledging of non-founding genome members to Founding Gene Families

Remaining non-core genes are added to existing gene families by two methods:

Pc (Pledge consistent) means the gene pledges consistently and uniquely to this family (hits to other families with a evalue more than 1e20 times larger than the best evalue are ignored). Pi (Pledge inconsistent) means it does not. Note that for genes marked as Pi or M on a Gene Family page, the consistency or uniqueness conditions are violated. Such genes may therefore be found in more than one family at a given node.

Citing Phytozome


David M. Goodstein, Shengqiang Shu, Russell Howson, Rochak Neupane, Richard D. Hayes, Joni Fazo, Therese Mitros, William Dirks, Uffe Hellsten, Nicholas Putnam, and Daniel S. Rokhsar, Phytozome: a comparative platform for green plant genomics, Nucleic Acids Res. 2012 40 (D1): D1178-D1186

Phytozome Team


Software:Joni Fazo, David M. Goodstein, Richard D. Hayes, Shengqiang Shu
Analysis: Uffe Hellsten, Therese Mitros, Raf Podowski, Simon Prochnik, Dan Rokhsar
  ©2011 University of California Regents. All rights reserved