* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Protein_Informatics_Annotation
Promoter (genetics) wikipedia , lookup
Paracrine signalling wikipedia , lookup
Gene therapy of the human retina wikipedia , lookup
Metalloprotein wikipedia , lookup
Community fingerprinting wikipedia , lookup
Gene regulatory network wikipedia , lookup
Magnesium transporter wikipedia , lookup
Expression vector wikipedia , lookup
Western blot wikipedia , lookup
Gene expression wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Gene nomenclature wikipedia , lookup
Gene expression profiling wikipedia , lookup
Proteolysis wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Protein purification wikipedia , lookup
Point mutation wikipedia , lookup
Interactome wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Protein structure prediction wikipedia , lookup
Advanced practical course in genome bioinformatics DAY 6: Functional annotation Petri Törönen Earlier version Patrik Koskinen Three sections • Background • Methods • Demonstration of tools Outline of Background • Goal • How function can be defined – Description – Gene Ontology • What can be used to predict function • What I am omitting here • Why these two matter GOAL • You have an unknown protein sequence • It should be functionally annotated – In wet lab (precise, slow) – In silico (less precise, faster) • When having thousands of sequences, wet lab is not an option – Collective annotation of sequences (hundreds of scientists) – Combined use of few in-silico methods Examples (Good, Bad) tartrate-resistant acid phosphatase type 5 precursor [Homo sapiens] >NP_001602.1 tartrate-resistant acid phosphatase type 5 precursor [Homo sapiens] hypothetical protein PANDA_021498, partial [Ailuropoda melanoleuca] GenBank: EFB24196.1 >EFB24196.1 hypothetical protein PANDA_021498, partial [Ailuropoda melanoleuca] How function is defined • How can we describe a function for a gene? How function is defined • • • • • Functional description as human readable text Linking gene to Key Words (Uniprot) Linking gene to Gene Ontology classes Linking gene to Enzyme categories Linking gene to Signalling Pathways or Biochemical Pathways (KEGG) • Linking Domain to functional activity Focus on description and on Gene Ontology Human readable descriptions • tartrate-resistant acid phosphatase type 5 precursor [Homo sapiens] Summary: This gene encodes an iron containing glycoprotein which catalyzes the conversion of orthophosphoric monoester to alcohol and orthophosphate. It is the most basic of the acid phosphatases and is the only form not inhibited by L(+)tartrate. [provided by RefSeq, Aug 2008]. Gene Ontology (GO) • GO represents a popular standard currently in the gene annotation • GO represents categories that represent gene function • Creates an union for genes in same process • Easy summary for genes with similar function • Easier to predict than text descriptions Gene Ontology (GO) • 3 sub-parts: Biological Process, Molecular Function, Cellular Localization – Molecular Function => chemical activity – Biological Process => Biology, cellular process – Cellular localization => Location of gene in cell • Hierarchical structure – Categories with very precise function – Categories with less precise function – Categories with very broad function Gene Ontology • tartrate-resistant acid phosphatase type 5 precursor [Homo sapiens] Advantages of GO • Cross species comparison – Already used by several databases • Comprehensive – GO covers all biological and chemical processes – Many terms per gene product • Simplify querying – Uses restricted vocabulary developed by curators and annotators • Use of evidence code – How reliable is the given information Methods outline • • • • Stupid method What can be used to predict function Current in-silico methods Method Comparisons Stupid method • Run BLAST search • Take the first sequence hit Stupid Functional annotation Traditional way to go: Nearest neighbour = query sequence Threshold in search space (e.g. Blast e-val 1e-5) Example Blast results from the Meliteae Cinxia (The Glanville fritillary butterfly) Score Sequences producing significant alignments: tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA... tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261... tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud... tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163... tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P... tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga... tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987... tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143... tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P... tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042... tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110... tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen... tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex... tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P... tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen... tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc... tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans... tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc... sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont... tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2... sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1... sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1 tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot... tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali... tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con... tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as... tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con... tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen... tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=... sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont... AGAP000295-PA … (bits) E-Value 155 140 136 135 133 133 132 132 132 132 132 131 129 104 100 74 73 72 67 67 66 66 65 65 65 65 64 64 63 62 1e-35 5e-31 8e-30 1e-29 5e-29 7e-29 9e-29 9e-29 9e-29 9e-29 1e-28 3e-28 1e-27 3e-20 8e-19 6e-11 1e-10 1e-10 4e-09 8e-09 1e-08 1e-08 2e-08 3e-08 3e-08 3e-08 5e-08 5e-08 1e-07 2e-07 Is this an informative annotation to adopt? Example Blast results from the Meliteae Cinxia (The Glanville fritillary butterfly) Score Sequences producing significant alignments: tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA... tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261... tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud... tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163... tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P... tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga... tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987... tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143... tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P... tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042... tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110... tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen... tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex... tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P... tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen... tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc... tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans... tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc... sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont... tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2... sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1... sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1 tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot... tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali... tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con... tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as... tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con... tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen... tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=... sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont... … (bits) E-Value 155 140 136 135 133 133 132 132 132 132 132 131 129 104 100 74 73 72 67 67 66 66 65 65 65 65 64 64 63 62 1e-35 5e-31 8e-30 1e-29 5e-29 7e-29 9e-29 9e-29 9e-29 9e-29 1e-28 3e-28 1e-27 3e-20 8e-19 6e-11 1e-10 1e-10 4e-09 8e-09 1e-08 1e-08 2e-08 3e-08 3e-08 3e-08 5e-08 5e-08 1e-07 2e-07 Plasmolipins Chemokine like factor superfamily members Why traditional method is stupid? • Blind copying of the nearest neighbor annotation is creating errors in database annotations Errors in public databases Schnoes AM, Brown SD, Dodevski I, Babbitt PC, 2009. Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies. PLoS Computational Biology 5(12) Errors in databases • So all the annotations in the databases are not correct • What can be done? Rational Functional annotation • Collect some features for analyzed sequence • Compare these features to features in known sequences • Estimate the function based on the similarity with many sequences What can be used to predict function • Where computer programs and researchers can get the information for the sequence? Function Prediction: What can we use to predict function This lecture discusses sequence based features! • • • • • Sequence homology (BLAST result list) Phylogenetic tree of sequences Protein Domains (PFAM domains) Short sequence patterns – motifs Sequence features Function Prediction: What I am omitting • • • • Literature search Gene expression (in tissues, development..) Protein Protein interaction Genetic interactions These are also important source. But only if they are available. Why the two previous groups matter • Assume you are predicting genes for neuronal growth in mammal • Assume you are predicting entzyme functions Sequence Homology Methods • Do a BLAST search with a query sequence • Collect GO classes for genes in the BLAST result hit • Give a weight to each BLAST hit – often log(E-value) • Combine the scores from the genes that belong to same GO class • Report the top best / significant GO classes Example Blast results re-visited Score Sequences producing significant alignments: tr|Q7QF43|Q7QF43_ANOGA AGAP000295-PA OS=Anopheles gambiae GN=AGA... tr|B4JK90|B4JK90_DROGR GH12613 OS=Drosophila grimshawi GN=GH1261... tr|Q29H35|Q29H35_DROPS GA13573 OS=Drosophila pseudoobscura pseud... tr|B4N1J8|B4N1J8_DROWI GK16321 OS=Drosophila willistoni GN=GK163... tr|B4PXZ6|B4PXZ6_DROYA GE17857 OS=Drosophila yakuba GN=GE17857 P... tr|Q9VZ71|Q9VZ71_DROME CG15211, isoform A OS=Drosophila melanoga... tr|B4R7M7|B4R7M7_DROSI GD16987 OS=Drosophila simulans GN=GD16987... tr|B4IDT9|B4IDT9_DROSE GM11433 OS=Drosophila sechellia GN=GM1143... tr|B3NVE7|B3NVE7_DROER GG18369 OS=Drosophila erecta GN=GG18369 P... tr|B3MQD3|B3MQD3_DROAN GF20425 OS=Drosophila ananassae GN=GF2042... tr|B4L7Y1|B4L7Y1_DROMO GI11027 OS=Drosophila mojavensis GN=GI110... tr|Q16VM8|Q16VM8_AEDAE Putative uncharacterized protein (Fragmen... tr|B0WJ18|B0WJ18_CULQU Putative uncharacterized protein OS=Culex... tr|C1C2H5|C1C2H5_9MAXI Plasmolipin OS=Caligus clemensi GN=PLLP P... tr|Q1DGM2|Q1DGM2_AEDAE Putative uncharacterized protein (Fragmen... tr|C3ZW39|C3ZW39_BRAFL Putative uncharacterized protein OS=Branc... tr|A3KQ86|A3KQ86_DANRE Novel protein similar to vertebrate trans... tr|C3YLB7|C3YLB7_BRAFL Putative uncharacterized protein OS=Branc... sp|Q8CJ61|CKLF4_MOUSE CKLF-like MARVEL transmembrane domain-cont... tr|A4IFB9|A4IFB9_BOVIN CMTM4 protein OS=Bos taurus GN=CMTM4 PE=2... sp|P47987|PLLP_RAT Plasmolipin OS=Rattus norvegicus GN=Pllp PE=1... sp|Q9DCU2|PLLP_MOUSE Plasmolipin OS=Mus musculus GN=Pllp PE=2 SV=1 tr|Q4SI17|Q4SI17_TETNG Chromosome 5 SCAF14581, whole genome shot... tr|B1H2E6|B1H2E6_XENTR LOC100145405 protein OS=Xenopus tropicali... tr|A7YYE5|A7YYE5_DANRE CKLF-like MARVEL transmembrane domain con... tr|C4Q9A0|C4Q9A0_SCHMA Marvel-containing potential lipid raft-as... tr|Q6DGM6|Q6DGM6_DANRE CKLF-like MARVEL transmembrane domain con... tr|C3ZW41|C3ZW41_BRAFL Putative uncharacterized protein (Fragmen... tr|C1BJ60|C1BJ60_OSMMO Plasmolipin OS=Osmerus mordax GN=PLLP PE=... sp|Q8IZR5|CKLF4_HUMAN CKLF-like MARVEL transmembrane domain-cont... … (bits) E-Value 155 140 136 135 133 133 132 132 132 132 132 131 129 104 100 74 73 72 67 67 66 66 65 65 65 65 64 64 63 62 1e-35 5e-31 8e-30 1e-29 5e-29 7e-29 9e-29 9e-29 9e-29 9e-29 1e-28 3e-28 1e-27 3e-20 8e-19 6e-11 1e-10 1e-10 4e-09 8e-09 1e-08 1e-08 2e-08 3e-08 3e-08 3e-08 5e-08 5e-08 1e-07 2e-07 Plasmolipins Chemokine like factor superfamily members Sequence Homology Methods • Simple method • Can fail to detect some similarities • Programs – – – – BLAST2GO (http://www.blast2go.com/b2ghome) GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php) ARGOT2(http://www.medcomp.medicina.unipd.it/Argot2/form.php) PFP (http://kiharalab.org/web/pfp.php) – PANNZER (http://ekhidna2.biocenter.helsinki.fi/sanspanz/) Phylogenetic tree methods • • • • Create the pair-wise distances for the set of genes Do a hierarchical clustering of genes Map the know GO functions to cluster tree Look for unknown genes in a cluster with many genes from the same GO class • Report the top best / significant GO classes • More => http://genome.cshlp.org/content/8/3/163.full Phylogenetic tree methods • • • • These should outperform sequence homology methods Require a set of related genes Often much heavier calculations Programs: – SIFTER • (http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109) Prediction with Protein domains • Look what protein domains there are in query protein (PFAM) • Map the functions that are linked to domains to your query sequence – PFAM2GO • Programs: InterProScan + PFAM2GO • Drawbacks: – This mapping is same in plant, mammal, bacteria – Many domains to specific function Prediction with Protein domains • Benefits: – Can create annotation from separate domains – Similar seq:s do not have to be in database • Programs: InterProScan (http://www.ebi.ac.uk/InterProScan/) • Drawbacks: – The mapping is same in plant, mammal, bacteria – Many domains to specific function Our contribution: PANNZER • Use BLAST result list • Add Taxonomic information • Score GO classes using a score that takes the frequency of GO class in seq. DB into account • Method is used to predict: – GO Classes – Description line Our contribution: PANNZER • Benefits: – Taking the species taxonomy into account – Improved use of statistics • Drawbacks: – Use only sequence similarity Method comparisons Method comparisons • Many annotation methods available • All methods claim to be best available • What methods are really the best? Critical Assessment of Function Annotations (CAFA) • Select a set of unknown genes • Ask research groups to predict GO terms • After dead line start collecting new annotations for genes • Next evaluate the methods Critical Assessment of Function Annotations (CAFA) Radivojac et al. A large-scale evaluation of computational protein function prediction. Nature Methods. 2013 Jan 27. doi: 10.1038/nmeth.2340. CAFA 1 • Most successful methods – JonesUCL • http://www.slideshare.net/idoerg/david-jones-afpcafa2011 – Argot2 • http://www.medcomp.medicina.unipd.it/Argot2/ – Pannzer • http://ekhidna.biocenter.helsinki.fi/pannzer Method demonstrations • InterProScan • Argot • Pannzer Demo sequences • • • • 5 sequences 3 from eukaryota, 2 in prokaryota All are annotated as unknown They can be still annotated based on the sequence InterProScan • http://www.ebi.ac.uk/interpro/search/sequence-search • InterProScan is metaserver that looks many sequence features • Many of these features can be used to annotate sequences • You have to give one sequence at the time InterProScan Results • http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163056-0935-89512669-oy • http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163142-0955-53752333-oy • http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163259-0726-8535830-oy • http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163327-0153-62340367-oy • http://www.ebi.ac.uk/interpro/sequencesearch/iprscan5S20161208-163413-0491-69491884-pg Argot2 • http://www.medcomp.medicina.unipd.it/Argot2/ • Tool that processes BLAST output • One of the best methods in CAFA 1 Argot2 • Results: http://www.medcomp.medicina.unipd.it/Argot2/getStatu s.php?js=12196 PANNZER2 • • • • http://ekhidna2.biocenter.helsinki.fi/sanspanz/ Processes results similar to BLAST Predicts text and Gene Ontology classes Significantly faster than the other tools Conclusion • • • • These methods increasingly needed Some methods exist Unfortunately no clear evaluation Remember: These are predictions. No certain info until they are tested in wet lab…