Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Predicting interactions between genes based on genome sequence comparisons The “genomic context” component of STRING Bioinformatics seminar series 15-11-2005 Berend Snel Today • Announcement: the seminar of Jakob de Vlieg on 22 November is canceled. Please consult the website of the seminar series (www.cmbi.ru.nl/edu/seminars) for the new date. • Seminar (today); please ask questions !!! • Handing out article and questions : “Identification of a bacterial regulatory system for ribonucleotide reductases by phylogenetic profiling.” Read the article and hand in the answers to the questions by Monday November 28th. Contents • Predicting functional interactions between proteins; what & why • Genomic context methods – General – Gene fusion – Gene order – Presence / absence of genes across genomes • Integration and benchmarking of predictions • Biochemistry by other means BolA • In addition to genomic context: functional genomics data Complete genomes, now what? • Post-genomic era = we have the parts list (complete genomes) • to understand the cell we need to know the functions of the genes A bacterial genome gene 408..1748 /gene="dnaA" /locus_tag="BCE33L0001" /old_locus_tag="BCZK0001" CDS 408..1748 /gene="dnaA" /locus_tag="BCE33L0001“ /old_locus_tag="BCZK0001" /inference="non-experimental evidence, no additional details recorded“ /codon_start=1 /transl_table=11 /product="chromosomal replication initiator protein“ /protein_id="AAU20227.1" /db_xref="GI:51978677“ /translation="MENISDLWNSALKELEKKVSKPSYETWLKSTTAHNLKKDVLTIT APNEFARDWLESHYSELISETLYDLTGAKLAIRFIIPQSQAEEEIDLPPAKPNAAQDD SNHLPQSMLNPKYTFDTFVIGSGNRFAHAASLAVAEAPAKAYNPLFIYGGVGLGKTHL MHAIGHYVIEHNPNAKVVYLSSEKFTNEFINSIRDNKAVDFRNKYRNVDVLLIDDIQF LAGKEQTQEEFFHTFNALHEESKQIVISSDRPPKEIPTLEDRLRSRFEWGLITDITPP DLETRIAILRKKAKAEGLDIPNEVMLYIANQIDSNIRELEGALIRVVAYSSLINKDIN ADLAAEALKDIIPNSKPKIISIYDIQKAVGDVYQVKLEDFKAKKRTKSVAFPRQIAMY LSRELTDSSLPKIGEEFGGRDHTTVIHAHEKISKLLKTDTQLQKQVEEINDILK" gene 1927..3066 /gene="dnaN" /locus_tag="BCE33L0002" /old_locus_tag="BCZK0002" CDS 1927..3066 /gene="dnaN" /locus_tag="BCE33L0002" /old_locus_tag="BCZK0002" /EC_number="2.7.7.7" /inference="non-experimental evidence, no additional details recorded" /codon_start=1 /transl_table=11 /product="DNA polymerase III, beta subunit" /protein_id="AAU20226.1" /db_xref="GI:51978676" /translation="MRFTIQKDYLVRSVQDVMKAVSSRTTIPILTGIKVVATEEGVTL TGSDADISIESFIPVEEDGKEIVEVKQSGSIVLQAKYFSEIVKKLPKETVEISVENHL MTKITSGKSEFNLNGLDSAEYPLLPQIEEHHVFKIPTDLLKHMIRQTVFAVSTSETRP ILTGVNWKVYNSELTCIATDSHRLALRKAKIEGIADEFQANVVIPGKSLNELSKILDE SEEMVDIVITEYQVLFRTKHLLFFSRLLEGNYPDTTRLIPAESKTDIFVNTKEFLQAI DRASLLARDGRNNVVKLSTLEQAMLEISSNSPEIGKVVEEVQCEKVDGEELKISFSAK YMMDALKALDSTEIKISFTGAMRPFLIRTVNDESIIQLILPVRTY" For most genes in any genome we need function prediction - E. Coli, the most intensively studied organism: only 1924 genes (~43%) have been (partially) experimentally characterized. Predicting protein function What is function ? Various levels of description: Sequence similarity/homology has the largest relevance for “Molecular Function”. This aspect of protein function is best conserved. Molecular function can often be predicted from similarities between protein sequences (BLAST), or structures. Homology: BLAST and / or SMART/PFAM/CDD gi|22209068|Mayven [Homo sapiens] gi|21410410|Klhl2 protein [Mus musculus] . . . . . . i|55725960|hypothetical protein [Pongo pygmaeus] gi|6644176|Klhl3 [Homo sapiens] gi|19354513|Klhl3 protein [Mus musculus] gi|12644384| Ring canal kelch protein [Drosophila melanogaster] 1159 1145 887 885 765 676 “Beyond” homology and molecular function Homology based function prediction works very well, yet: • a large fraction of genes are poorly described (no homologs, uncharacterized homologs; this holds for ~60% of the human genes) • There are other aspects of function: functional associations, e.g. the target of a protein kinase or a transcriptional regulator, I.e. to understand the cell we need to know the interactions of the genes Thus: predicting associations There are many types of functional associations (AKA functional interactions, interactions, functional links, functional relations) in molecular biology P Transcription regulation Signalling pathways Protein complexes Cellular process Metabolic pathways Types of functional associations metabolic pathways: filling gaps Types of functional associations Transcription regulation Signalling pathways P Types of functional associations Cellular process (“DNA repair”, “Apoptosis”) Protein complexes So how can knowledge of the functional associations help? • If we did not know anything about the function of the protein we can now say in which process it is involved • If we already knew something about the function, we might now know much more about the function (I.e. if we knew it was a hydrolase we might now know in which metabolic pathway it is active) • If the gene was already well characterized, we might understand its role better (I.e. new targets for a kinase) Contents • Predicting functional interactions between proteins • Genomic context methods – General (how do we predict functional interactions) – Gene fusion – Gene order – Presence / absence of genes across genomes • Integration and benchmarking of predictions • Biochemistry by other means BolA • In addition to genomic context: functional genomics data How can we now predict / detect functional associations? • Functional genomics / high throughput experiments • GENOMIC CONTEXT functionally associated proteins leave evolutionary traces of their relation in genomes We can thus detect evolutionary traces of a functional association by comparing genomes Genomic context is an tool to predict functional associations between genes • Use the genome sequences Themselves (through comparative genome analysis) for interaction prediction: genomic context methods •Genomic context methods have been shown to be reliable indicators for functional interaction • Genomic context is also known as in silico interaction prediction, or genomic associations 1 0.8 0.6 0.4 Fusion Gene Order Co-occurrence 0.2 00 0.2 0.4 Score 0.6 0.8 1 http://string.embl.de Three different genomic context methods in STRING • Gene fusion, Rosetta stone method • Conserved gene order between divergent genomes • Co-occurrence of genes across genomes, phylogenetic profiles Contents • Predicting functional interactions between proteins • Genomic context methods – General – Gene fusion – Gene order – Presence / absence of genes across genomes • Integration and benchmarking of predictions • Biochemistry by other means BolA • In addition to genomic context: functional genomics data Gene fusion • i.e. the orthologs of two genes in another organism are fused into one polypeptide • A very reliable indicator for functional interaction; partly because it is an relatively infrequent evolutionary event: 3470 distinct fusions when surveying 179 genomes Fusion Gene fusion: an example Contents • Predicting functional interactions between proteins • Genomic context methods – General – Fusion – Gene order – Presence / absence of genes across genomes • Integration and benchmarking of predictions • Biochemistry by other means BolA • In addition to genomic context: functional genomics data Gene order evolves rapidly But … Differential retention of divergent / convergent gene pairs suggests that conservation implies a functional association “Operons” Comparison to pathways conservation implies a functional association 6 number of COGS 5 1000 average metabolic distance 100 4 3 2 10 1 1 0 30 27 24 21 18 15 12 9 6 3 0 co-occurrences in operons average metabolic distance number of COGs 10000 Conserved gene order • i.e. genes that are present over ‘sufficiently large’ evolutionary distances in the same gene cluster • Contributes by far the most predictions Conserved gene order NB1 predicting operons is not trivial; in fact conserved gene order or functional association is a major clue NB2 using ‘only’ operons without requiring conservation results in much less reliable function prediction Conserved gene order: an example from metabolism of propionyl-CoA “target” “query” Conserved gene order: an example from metabolism of propionyl-CoA Biochemical assays confirm the function of members of COG0346 as a DLmethylmalonyl-CoA racemase Contents • Predicting functional interactions between proteins • Genomic context methods – General – Gene fusion – Gene order – Presence / absence of genes across genomes • Integration and benchmarking of predictions • Biochemistry by other means BolA • In addition to genomic context: functional genomics data Presence / absence of genes Gene content co-evolution. (The easy case, few genomes. ) Differences between gene Content reflect differences in Phenotypic potentialities Genomes share genes for phenotypes they have in common Presence / absence of genes L. innocua (non-pathogen) L. monocytogenes (pathogen) Presence / absence of genes Genes involved in pathogenecity L. innocua (non-pathogenic) L. monocytogenes (pathogenic) ...... ... .. .. species 5 species 4 species 3 species 2 species 1 Generalization: phylogenetic profiles / co-occurence Gene 1: Gene 2: Gene 3: .... 1 1 0 0 1 1 1 0 0 1 0 0 ...... ... .. .. species 5 species 4 species 3 species 2 species 1 Gene 1: Gene 2: Gene 3: .... 0 1 1 1 0 0 Co-occurrence of genes across genomes • i.e. two genes have the same presence/ absence pattern over multiple genomes: they have ‘coevolved’ •AKA phylogenetic profiles Predicting function of a disease gene protein with unknown function, frataxin, using co-occurrence of genes across genomes • Friedreich’s ataxia • No (homolog with) known function Frataxin has co-evolved with hscA and hscB indicating that it plays a role in iron-sulfur cluster assembly H.sapiens D.melan. C.elegans S.cerevisiae C.albicans S.pombe A.thaliana M.jannaschii A.pernix E.coli P.multocida H.influenzae V.cholerae Buchnera P.aeruginosa X.fastidiosa N.meningitidis M.loti C.crescentus R.prowazekii C.jejuni H. pylori D.radiodurans M.tuberculosis M.genitalium B.subtilis Synechocystis A.aeolicus cyaY Yfh1 hscB Jac1 hscA ssq1 iscS Nfs1 iscU Isu1-2 iscA Isa1-2 fdx Yah1 RnaM IscR Hyp Atm1 Nfu1 Arh1 Iron-Sulfur (2Fe-2S) cluster in the Rieske protein Prediction: Confirmation: The opposite of co-occurrence: anti-correlation / complementary patterns: predicting analogous enzymes Genes with complementary phylogenetic profiles tend to have a similar biochemical function. A B A B Complementary patterns in thiamin biosynthesis predict analogous enzymes Morett E, Korbel JO, Rajan E, Saab-Rincon G, Olvera L, Olvera M, Schmidt S, Snel B, Prediction of analogous enzymes is confirmed Contents • Predicting functional interactions between proteins • Genomic context methods – General – Gene fusion – Gene order – Presence / absence of genes across genomes • Integration and benchmarking of predictions • Biochemistry by other means BolA • In addition to genomic context: functional genomics data Benchmark and integration: KEGG maps Integrating genomic context scores into one single score • Compare each individual method against an independent benchmark (KEGG), and find “equivalency” • Multiply the chances that two proteins are not interacting and subtract from 1; naive bayesian i.e. assuming independence S 1 (1 Si ) 1 i 0.8 0.6 0.4 Fusion Gene Order Co-occurrence 0.2 00 0.2 0.4 Score 0.6 0.8 1 Benchmark 100000 10000 1000 100 10 0.5 Integrated Gene Order (norm.) Gene Order (abs.) Cooccurrence Fusion (norm.) Fusion (abs.) 0.6 0.7 0.8 0.9 Accuracy (fraction of confirmed predictions, i.e. same KEGG map) 1.0 fraction of reference set covered by data Coverage Performance of genomic context compared to high-throughput interaction data Purified Complexes HMS-PCI purified complexes TAP genomic context mRNA co-expression two methods synthetic lethality yeast two-hybrid raw data filtered data parameter choices Accuracy fraction of data confirmed by reference set combined evidence three methods Contents • Predicting functional interactions between proteins • Genomic context methods – General – Gene fusion – Gene order – Presence / absence of genes across genomes • Integration and benchmarking of predictions • Biochemistry by other means BolA • In addition to genomic context: functional genomics data Data-mining proteins for protein function prediction: BolA An interaction of BolA with a mono-thiol glutaredoxin ? (STRING) BolA BolA and Grx occur as neighbors in a number of genomes Grx Bola BolA and Grx have an (almost) identical phylogenetic distribution BolA and Grx have been shown to interact in Y2H in S.cerevisiae and D.melanogaster, and in Flag tag in S.cerevisiae BolA phylogeny Cell division / Cell wall (oxidative) stress BolA does have (predicted) interactions with cell-division / cell-wall proteins. Those appear secondary to the link with GrX STRING has obtained a higher resolution in function prediction than phenotypic analyses BolA is homologous to the peroxide reductase OsmC, suggesting a similar function OsmC uses thiol groups of two, evolutionary conserved cysteines to reduce substrates Problem: The BolA family does not have conserved cysteines. …It would have to obtain its reducing equivalents from elsewhere… BolA family alignment Prediction of interaction partner and molecular function complement each other BolA interacts with GrX ? BolA is (homologous to) a reductase GrX provides BolA with reducing equivalents !? (or “scaffolding?”) Genomic context: biochemistry by other means Despite the high performance of genomic context methods, as a tool for function prediction it is not a button press method It is more like biochemistry by other means. Often quite a lot of manual input and expert knowledge from the researcher is needed to distill associations into a concrete function prediction Small-scale bioinformatics? Contents • Predicting functional interactions between proteins • Genomic context methods – General – Fusion – Gene order – Co-occurrence across genomes • Integration and benchmarking of predictions • Interaction networks • In addition to genomic context: functional genomics data STRING currently in addition includes: • Functional association data from large scale / highthroughput biochemical experiments (functional genomics data) • protein complex purification • yeast-2-hybrid • ChIP-on-chip • micro-array gene expression • “known” functional relations, so called “legacy data”, as present in PubMed abstracts and databases like MIPS or KEGG. • Handing out article and questions : “Identification of a bacterial regulatory system for ribonucleotide reductases by phylogenetic profiling.” Read the article and hand in the answers to the questions by Monday November 28th.