Download Orthologs, paralogs and homology inference Where are we now?

Motivation for Reference Genome Effort Fully and reliably annotated Genomes: • empower scientific research • are essential for use in automatic inference. We comprehensively capture the experimental data from the most active research communities producing high-confidence functional descriptions to leverage the power of the comparative method for inference. Deliverable of Reference Genome Effort 1. Proteome sets 2. Annotation best practices documentation 3. Annotation software tool 4. Reference annotations for inference of function in other species Evolutionary relationships are the “glue” in RefGenome • Goal – identify genes in reference genomes that may have the same or similar functions, so that comprehensive curation can be done simultaneously • Why? – Different model organisms have different strengths for investigating gene function, and these can often inform each other – Most genes did not first evolve within a given extant species: they were INHERITED from a common ancestor shared with other species. Genes in different organisms have similar functions because they were inherited, and haven’t changed much since the common ancestor. structural annotation of genomes used to build gp2protein files Gp2protein files used to build “ortholog clusters” Current process ISS annotations made independently by each MOD Selection of “annotation set”, including independent ortholog identification at each MOD Individual MODs annotate in-depth each gene in set structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” New process coordinate and centralize where possible Trees and clusters used to define ref. genome annotation sets Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Select “gene set for concurrent annotation” from a central resource with more complete information Trees and clusters used to define ref. genome annotation sets Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Make homology-based annotations concurrently and consistently in the context of an evolutionary tree Trees and clusters used to define ref. genome annotation sets Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Trees and clusters used to define ref. genome annotation sets Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins Update on progress: comprehensive gene sets from each MOD • Short term solution implemented as of 9/4 – Gp2protein files are now approximately complete • Most sets were OK as deposited by the MOD • A few sets had to be augmented (missing genes filled in from Ensembl or Entrez Gene), one set had to be reduced by selecting a single “representative” protein sequence per gene • Long term solution: UniProt? • SwissProt record includes all alternatively spliced exons , which is ideal for evolutionary modeling of protein coding gene history • We have already shared the gp2protein files with SwissProt, and they are comparing to UniProt “complete proteome” sets Proposal made at this meeting • Write a white paper describing the “complete protein-coding gene set” needs/requirements for the RefGenome project • Michael will approach Amos and discuss options for working together structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Trees and clusters used to define ref. genome annotation sets Individual MODs annotate in-depth each gene in set Inferences made to ancestral proteins Inferences made to extant proteins Example: NEDD4 • Selected for electronic jamboree Oct. 2008 • Human NEDD4 was “core” target • OrthoMCL identified “orthologs” in – – – – – – Drosophila C. elegans Mouse (2) Human (2) Zebrafish Rat • Curators at SGD identified an ortholog in yeast from a published paper Orthologs (green) and paralogs (orange) of human NEDD4 (red) duplications at base of metazoa WWP1/2; SMURF1/2 diverge NEDD4 conserved duplication at base of chordata HACE1 diverges NEDD4 conserved duplication at base of reptilia? OrthoMCL cluster containing human NEDD4/NEDD4L (blue) and curator-identified yeast ortholog (lt. blue) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia Orthologs (green) and paralogs (orange) of human NEDD4 (red) And “conserved orthologs” of NEDD4/NEDD4L (yellow) duplications at base of metazoa duplication at base of chordata duplication at base of reptilia Update on progress Gene trees and “homology set” selection tool • Gene trees have been built for all existing PANTHER families, from all RefGenome species, plus 35 other “phylogenetically informative” species • Tree Curation Tool has been updated by Paul’s and Suzi’s groups in collaboration – Retrieves and displays tree, and UniProt information for each sequence – Displays OrthoMCL clustering results-- scalable to any number of different clustering algorithms – “Pre-alpha” prototype has been installed and is being tested by Pascale • GOC has obtained supplemental funding to support – Adding multiple homology clustering algorithms – A “protein family curator” Proposal made at this meeting • Lead RefGenome Curator and Protein Family Curator work together to define set of genes to be annotated concurrently • No need for review by individual MODs structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Trees and clusters used to define ref. genome annotation sets Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins Annotation inference based on homology • We need to make homology inferences correctly and consistently – Infer only from annotations with experimental evidence – Use explicit evolutionary model: inheritance (maybe with modification) from a common ancestor! • Homology inference is actually two inferences – 1. the common ancestor has the same annotation as its descendant that has been characterized – 2. another (unannotated) descendant has the same annotation as its ancestor – Need traceable, versioned evidence trail: • Inferred annotation -> tree -> experimental annotation(s) -> literature GO process: cellular response to UV GO process: positive regulation of synaptogenesis ? ? GO function: ubiquitin-protein ligase activity Proposal made at this meeting • Protein family curator makes first pass at homology inferences – Confers with individual MODs as necessary • Iterative: protein family curator prepares list of inferred annotations for each MOD, each MOD reviews and can suggest changes structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Annotation process Trees and clusters used to define ref. genome annotation sets Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins Trees and clusters used to define ref. genome annotation sets 1. 2. 3. Protein family curator (Princeton/Pascale) suggests protein set based on report/examination of trees MOD curators annotate all experimental data to completion Protein family curator mediates annotation review Protein family curator Inferences made to ancestral proteins Review and sign off on r.g. experimental annotations Protein family curator Reviewed by protein family and MOD curators Inferences made to extant proteins Done! structural annotation of genomes used to build gp2protein files Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Transformations Trees and clusters used to define ref. genome annotation sets Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins Inferences made to extant proteins Princeton / P-POD update • New run with protein sets used by PANTHER under way • Implementing algorithms for generation of consensus clusters and other ortholog prediction methods • New P-POD features P-POD search P-POD results/disambiguation P-POD-Notung structural annotation of genomes used to build gp2protein files UniProt complete proteome project? How to most efficiently incorporate input from all MOD curators? Gp2protein files used to build trees Gp2protein files used to build “ortholog clusters” Pascale picks a focal gene Trees and clusters used to define ref. genome annotation sets Review and sign off on r.g. experimental annotations Inferences made to ancestral proteins How are resulting homologybased annotations delivered to MODs? Inferences made to extant proteins

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Orthologs, paralogs and homology inference Where are we now?