Download Slide 1

Functional annotation Datasources Konstantinos Mavrommatis [email protected] MGM workshop. 19 Oct 2010 Let’s get started… Information from databases is used to predict the function of a protein (functional annotation).  Product name  Enzyme catalog number  Domain architecture … MGM workshop. 19 Oct 2010 But what is function? cobalamin biosynthetic enzyme, cobalt-precorrin-4 methyltransferase (CbiF)  molecular/enzymatic (methyltransferase)  Reaction (methylation)  Substrate (cobalt-precorrin-4)  Ligand (S-adenosyl-L-methionine)  metabolic (cobalamin biosynthesis)  physiological (maintenance of healthy nerve and red blood cells, through B12). MGM workshop. 19 Oct 2010 Functional annotation Predict the biochemistry and physiology of an organism based on its genome sequence Explain known biochemical and physiological properties MGM workshop. 19 Oct 2010 Homologs/Orthologs/Paralogs MGM workshop. 19 Oct 2010 Function prediction  Function transfer by homology  Homology  implies a common evolutionary origin.  not retention of similarity in any of their properties.  Homology ≠ similarity of function. Punta & Ofran. PLOS Comp Biol. 2008 MGM workshop. 19 Oct 2010 Trust transfer of annotation ? Punta & Ofran. PLOS Comp Biol. 2008 MGM workshop. 19 Oct 2010 Dos and Don’ts Type Don’t Do Homology Same function Probability for same function Orthology Same function Probability for same function Paralogy Same function Probability for same function Sequence similarity Same function Probability for same function High sequence similarity Same function Probability for same function Same sequence Same function Probability for same function MGM workshop. 19 Oct 2010 What if nothing is similar ?  Subcellular localization  Gene context  Special features  Prediction of binding residues (DISIS, bindN) S~S S~S Periplasm Cytoplasm MGM workshop. 19 Oct 2010 Annotation should make sense Model pathway Substrate A Enzyme 1 Substrate B Enzyme 2 Substrate C Enzyme 3 Substrate D Genome annotation ? Enzyme 1 Enzyme 2 ? Enzyme 3 ✓ MGM workshop. 19 Oct 2010 Annotation should make sense MGM workshop. 19 Oct 2010 Databases  Databases used for the analysis of biological molecules.  Databases contain information organized in a way that allows users/researchers to retrieve and exploit it.  Why bother?  Store information.  Organize data.  Predict features (genes, functions ...).  Understand relationships (metabolic reconstruction). MGM workshop. 19 Oct 2010 Primary nucleotide databases EMBL/GenBank/DDBJ (http://www.ncbi.nlm.nih.gov/,http://www.ebi.ac.uk/embl)  Archive containing all sequences from:  genome projects  sequencing centers  individual scientists  patent offices  The sequences are exchanged between the three centers on a daily basis.  Database is doubling every 10 months.  Sequences from >140,000 different species.  1400 new species added every month. Year 2004 2005 2006 2007 2008 Base pairs 44,575,745,176 56,037,734,462 69,019,290,705 83,874,179,730 99,116,431,942 Sequences 40,604,319 52,016,762 64,893,747 80,388,382 98,868,465 MGM workshop. 19 Oct 2010 Primary protein sequence databases  Contain coding sequences derived from the translation of nucleotide sequences  GenBank  Valid translations (CDS) from nt GenBank entries.  UniProtKB/TrEMBL (1996)  Automatic CDS translations from EMBL.  TrEMBL Release 40.3 (26-May-2009) contains 7,916,844 entries. MGM workshop. 19 Oct 2010 Errors in databases There are many errors in the primary sequence databases:  In the sequences themselves:  Sequencing errors.  Cloning vectors sequences.  In the annotations:  Inaccuracies, omissions, and even mistakes.  Inconsistencies between some fields. MGM workshop. 19 Oct 2010 Redundancy Redundancy is a major problem. Entries are partially or entirely duplicated:  e.g. 20% of vertebrate sequences in GenBank. { { {  Partial and complete sequence duplications MGM workshop. 19 Oct 2010 NCBI Derivative Sequence Data Curators RefSeq TATAGCCG AGCTCCGATA CCGATGACAA Labs Genome Assembly TATAGCCG TATAGCCG TATAGCCG TATAGCCG GenBank UniGene Algorithms MGM workshop. 19 Oct 2010 RefSeq  Curated transcripts and proteins.  reviewed by NCBI staff.  Model transcripts and proteins.  generated by computer algorithms.  Assembled Genomic Regions (contigs).  Chromosome records. MGM workshop. 19 Oct 2010 Secondary protein databases  Uniprot/SWISS-PROT (1986) (http://ca.expasy.org/spro)  a curated protein sequence database  high level of annotation (such as the description of the function of a protein, its domains structure, post-translational modifications, variants, etc.)  a minimal level of redundancy  high level of integration with other databases MGM workshop. 19 Oct 2010 Classification databases  Groups (families/clusters) of proteins based on…  Overall sequence similarity.  Local sequence similarity.  Presence / absence of specific features (active site, signal peptides… ).  Structural similarity.  ...  These groups contain proteins with similar properties.  Specific function, enzymatic activity.  General function.  Evolutionary relationship. … MGM workshop. 19 Oct 2010 Overall sequence similarity MGM workshop. 19 Oct 2010 Clusters of orthologous groups (COGs)  COGs were delineated by comparing protein sequences encoded in 43 complete genomes representing 30 major phylogenetic lineages.  Each Cluster has representatives of at least 3 lineages  A function (specific or broad) has been assigned to each COG. http://www.ncbi.nlm.nih.gov/COG/ MGM workshop. 19 Oct 2010 How it works Reciprocal best hit Bidirectional best hit Blast best hit Unidirectional best hit COG1 COG2 MGM workshop. 19 Oct 2010 Profiles & Pfam  A method for classifying proteins into groups exploits region similarities, which contain valuable information (domains/profiles).  These domains/profiles can be used to detect distant relationships, where only few residues are conserved. MGM workshop. 19 Oct 2010 Regions similarity MGM workshop. 19 Oct 2010 Pfam http://pfam.sanger.ac.uk HMMs of protein alignments (local) for domains, or global (cover whole protein) MGM workshop. 19 Oct 2010 TIGRfam  Full length alignments.  Domain alignments.  Equivalogs: families of proteins with specific function.  Superfamilies: families of homologous genes.  HMMs http://www.tigr.org/TIGRFAMs/ MGM workshop. 19 Oct 2010 KEGG orthology MGM workshop. 19 Oct 2010 Composite pattern databases  To simplify sequence analysis, the family databases are being integrated to create a unified annotation resource – InterPro   Release 30.0 (Dec10) contains 21178 entries Central annotation resource, with pointers to its satellite dbs http://www.ebi.ac.uk/interpro/ MGM workshop. 19 Oct 2010 * It is up to the user to decide if the annotation is correct * MGM workshop. 19 Oct 2010 ENZYME MGM workshop. 19 Oct 2010 ENZYME http://ca.expasy.org/enzyme/ MGM workshop. 19 Oct 2010 KEGG  Contains information about biochemical pathways, and protein interactions. http://www.kegg.com MGM workshop. 19 Oct 2010 Functional annotation YES IMG term PSI BLAST 1e-2 COGs NO YES TIGRfam BLASTp <1e-10, >45% id, >70% length KO terms NO gene YES COG BLASTp evalue<10, 20 best hits Hmmsearch (BLAST preprocessing) IMG NO YES COG + pfam Pfam TIGRfam NO YES Pfam NO YES BLAST http://imgweb.jgi-psf.org/img_er_v260/doc/img_er_ann.pdf NO Product name hypothetical (based on translation tables) MGM workshop. 19 Oct 2010 Sequencing projects & Metadata http://www.genomesonline.org MGM workshop. 19 Oct 2010 Literature search  PubMed http://www.ncbi.nlm.nih.gov/Pubmed MGM workshop. 19 Oct 2010 Specialized databases  There is a large number of databases devoted to specific organisms.  For some model organisms there are often concurrent systems.  These databases are typically associated to sequencing or mapping projects. MGM workshop. 19 Oct 2010 Other specialized databases  Signal transduction, regulation, protein-protein interactions Gene 3D structures expression  TRANSFAC (Transcription Factor database) GXD PDB(Mouse (Protein Gene Data Expression Bank) Database)  BRITE (Biomolecular Relations in Information Transmission and The MMDB Stanford (Molecular Microarray Modelling Database Data Expression database) Base) Mapping  DIP (Database of NRL_3D Interacting Proteins) (Non-Redundant Library of GDB (Genome Data Base) 3DInteraction Structures)  BIND (Biomolecular Network database) EMG (Encyclopedia of Mouse Genome) SCOP (Structural Classification of  BioCarta MGD (Mouse Genome Database) Proteins)  Biochemical pathways INE (Integrated Rice Genome Explorer) Polymorphism  KLOTHO (Biochemical Compounds Declarative database) Protein quantification (Allele Frequency Database)  BRENDA (enzymeALFRED information system) SWISS-2DPAGE Molecular  LIGAND (similar to Enzymeinteractions but with more information for substrates) PDD (Protein Disease Database) DIP (Database of Interacting proteins)  Gene order and co-occurrence Sub2D (B. subtilis 2D Protein Index) BIND (Biomolecular Interaction  STRING Network Database) MGM workshop. 19 Oct 2010 List of databases http://www.oxfordjournals.org/nar/database/c MGM workshop. 19 Oct 2010 Databanks interconnection Blocks MIMMAP REBASE PDBFINDER ALI PROSITEDOC OMIM ProDom PROSITE SWISSNEW ENZYME DSSP SWISSDOM HSSP FSSP GenBank PDB MOLPROBE SWISS-PROT NRL_3D ECDC EPD YPDREF PMD EMBL YPD EMNEW TFSITE TrEMBLNEW ProtFam FlyGene TrEMBL PIR TFACTOR Not all databases are updated regularly. Changes of annotation in one database are not reflected in others. MGM workshop. 19 Oct 2010 Summary  Gene annotation should make sense in the context of the organism  We have main archives (Genbank), and currated databases (Refseq, SwissProt), and protein classification database (COG, Pfam), and many, many more…  They help predict the function, or the network of functions.  Systems that integrate the information from several databases, visualize and allow handling of data in an intuitive way are required QUESTIONS? MGM workshop. 19 Oct 2010

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Slide 1