Download Automate Function Prediction

Automate Function Prediction Outline • • • • • Goal How function is defined Why Gene Ontology Methods for protein function prediction End points GOAL • A) You find a new protein • B) You sequence the whole genome of your favorite organism • Obtained gene(s) should be annotated • A can be solved manually. B needs automatic tools How function is defined • • • • Functional description as text Linking gene to Key Words (Uniprot) Linking gene Gene Ontology Linking gene to Signalling Pathways or Biochemical Pathways (KEGG) Why Gene Ontology (GO) • GO represents a popular standard currently in the gene annotation • GO represents categories that represent gene function • Creates an union for genes in same process • Easy summary for genes with similar function Why Gene Ontology (GO) • 3 sub-parts: Biological Process, Molecular Function, Cellular Localization – Molecular Function => chemical activity – Biological Process => Biology, cellular process – Cellular localization => Location of gene • Hierarchical structure – Categories with very precise function – Categories with less precise function – Categories with very broad function How GO helps • End user: Summary categories for genes with various functions • Computer programs: Classifier algorithms can be taught to predict the categories for genes Understanding GO • Amigo server (http://amigo.geneontology.org/cgibin/amigo/go.cgi) Function Prediction: What can we use to predict function • • • • • Sequence homology (BLAST result list) Phylogenetic tree of sequences Protein Domains (PFAM domains) Short sequence patterns – motifs Sequence features (sec. struct., low compl. regions) Sequence Homology Methods • Do a BLAST search with a query sequence • Collect GO classes for genes in the BLAST result hit • Give a weight to each BLAST hit – often log(E-value) • Combine the scores from the genes that belong to same GO class • Report the top best / significant GO classes Sequence Homology Methods • Simple methods • Programs – BLAST2GO (http://www.blast2go.com/b2ghome) – GOTCHA (http://www.compbio.dundee.ac.uk/gotcha/gotcha.php) – ARGOT(http://www.medcomp.medicina.unipd.it/Argot2/form.php) – PFP (http://kiharalab.org/web/pfp.php) Phylogenetic tree methods • Create the pair-wise distances for the set of genes • Do a hierarchical clustering of genes • Map the know GO functions to cluster tree • Look for unknown genes in a cluster with many genes from the same GO class • Report the top best / significant GO classes • More => http://genome.cshlp.org/content/8/3/163.full Phylogenetic tree methods • These should outperform sequence homology methods (CAFA 2011?) • Require a set of related genes • Often much heavier calculations • Programs: – Sifter (http://genome.cshlp.org/content/early/2011/07/22/gr.104687.109) Prediction with Protein domains • Look what protein domains there are in query protein (PFAM) • Map the functions that are linked to domains to your query sequence – PFAM2GO • Programs: InterProScan + PFAM2GO • Drawbacks: – This mapping is same in plant, mammal, bacteria – Many domains to specific function Prediction with Protein domains • Benefits: – Can create annotation from separate domains – Similar seq:s do not have to be in database • Programs (?): InterProScan (http://www.ebi.ac.uk/InterProScan/) • Drawbacks: – The mapping is same in plant, mammal, bacteria – Many domains to specific function Prediction with patterns and motifs • Same principle as before, but we look sequence patterns and motifs • Map the functions that are linked to patterns to your query sequence • Programs: – InterProScan – IBM BioDictionary (http://cbcsrv.watson.ibm.com/Tpa.html) • Drawbacks and benefits appr. same as before Prediction with sequence features • Again same principle as before • We look seq. features (see pict.) • These are given as an input to classifier algorithm (Support Vector Machine) Prediction with sequence features Prediction with sequence features • Benefits: – No actual seq. similarity needed – Info collected from vague similarities – Use of classifier => feature weighting • Program: FFPred (http://bioinf.cs.ucl.ac.uk/ffpred/) • Drawbacks: • Calculations probably quite heavy • No use of nearby sequence similarities (domains etc.) Our contribution: PANNZER • Use BLAST result list • Add Taxonomic information • Score GO classes using a score that takes the frequency of GO class in seq. DB into account • Method is used to predict: – GO Classes – Description line Our contribution: PANNZER • Benefits: – Taking the species taxonomy into account – Improved use of statistics • Not public yet Our contribution: No Name Yet • Take PFAM domain predictions, BLAST similarities and Taxonomic information • Feed this to feature selection and to classifier algorithm • …Wait… • Method is used to predict GO-classes • Not public + testing is ongoing Conclusion • These methods increasingly needed • Some methods exist • Unfortunately no clear evaluation (my opinion) • Remember: These are predictions. No certain info until they are tested in wet lab…

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Automate Function Prediction