Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein Interaction Networks Feb. 21, 2013 Aalt-Jan van Dijk Applied Bioinformatics, PRI, Wageningen UR & Mathematical and Statistical Methods, Biometris, Wageningen University [email protected] My research • Protein complex structures – Protein-protein docking – Correlated mutations • Interaction site prediction/analysis – – – • Protein-protein interactions Enzyme active sites Protein-DNA interactions Network modelling – – Gene regulatory networks Flowering related Overview • • • • • Introduction: protein interaction networks Sequences & networks: predicting interaction sites Predicting protein interactions Sequence and network evolution Interaction network alignment Protein Interaction Networks hemoglobin Obligatory Protein Interaction Networks hemoglobin Obligatory Mitochondrial Cu transporters Transient Experimental approaches (1) Yeast two-hybrid (Y2H) Experimental approaches (2) Affinity Purification + mass spectrometry (AP-MS) Interaction Databases • STRING http://string.embl.de/ Interaction Databases Interaction Databases • STRING http://string.embl.de/ • HPRD http://www.hprd.org/ Interaction Databases Interaction Databases • STRING http://string.embl.de/ • HPRD http://www.hprd.org/ • MINT http://mint.bio.uniroma2.it/mint/ Interaction Databases Interaction Databases • • • • STRING http://string.embl.de/ HPRD http://www.hprd.org/ MINT http://mint.bio.uniroma2.it/mint/ INTACT http://www.ebi.ac.uk/intact/ Interaction Databases Interaction Databases • • • • • STRING http://string.embl.de/ HPRD http://www.hprd.org/ MINT http://mint.bio.uniroma2.it/mint/ INTACT http://www.ebi.ac.uk/intact/ BIOGRID http://thebiogrid.org/ Interaction Databases Some numbers Organism Number of known interactions H. Sapiens 113,217 S. Cerevisiae 75,529 D. Melanogaster 35,028 A. Thaliana 13,842 M. Musculus 11,616 Biogrid (physical interactions) Overview • Introduction: protein interaction networks • Sequences & networks: predicting interaction sites • Predicting protein interactions • Sequence and network evolution • Interaction network alignment Binding site Binding site prediction Applications: Binding site prediction Applications: • Understanding network evolution • Understanding changes in protein function • Predict protein interactions • Manipulate protein interactions Binding site prediction Applications: • Understanding network evolution • Understanding changes in protein function • Predict protein interactions • Manipulate protein interactions Input data: • Interaction network • Sequences (possibly structures) Sequence-based predictions Sequences and networks • Goal: predict interaction sites and/or motifs Sequences and networks • Goal: predict interaction sites and/or motifs • Data: interaction networks, sequences Sequences and networks • Goal: predict interaction sites and/or motifs • Data: interaction networks, sequences • Validation: structure data, “motif databases” Motif search in groups of proteins • Group proteins which have same interaction partner • Use motif search, e.g. find PWMs Neduva Plos Biol 2005 Correlated Motifs Correlated Motifs • Motif model • Search • Scoring Predefined motifs Predefined motifs Predefined motifs Predefined motifs Predefined motifs Correlated Motif Mining Find motifs in one set of proteins which interact with (almost) all proteins with another motif Correlated Motif Mining Find motifs in one set of proteins which interact with (almost) all proteins with another motif Motif-models: • PWM – so far not applied • (l,d) with l=length, d=number of wildcards Score: overrepresentation, e.g. χ2 Correlated Motif Mining Find motifs in one set of proteins which interact with (almost) all proteins with another motif Search: • Interaction driven • Motif driven Interaction driven approaches Mine for (quasi-)bicliques most-versus-most interaction Then derive motif pair from sequences Motif driven approaches Starting from candidate motif pairs, evaluate their support in the network (and improve them) D-MOTIF Tan BMC Bioinformatics 2006 IMSS: application of D-MOTIF protein Y Test error protein X Number of selected motif pairs Van Dijk et al., Bioinformatics 2008 Van Dijk et al., Plos Comp Biol 2010 Experimental validation protein Y Test error protein X Number of selected motif pairs Van Dijk et al., Bioinformatics 2008 Van Dijk et al., Plos Comp Biol 2010 Experimental validation protein Y Test error protein X Number of selected motif pairs Van Dijk et al., Bioinformatics 2008 Van Dijk et al., Plos Comp Biol 2010 Experimental validation protein Y Test error protein X Number of selected motif pairs Van Dijk et al., Bioinformatics 2008 Van Dijk et al., Plos Comp Biol 2010 SLIDER Boyen et al. Trans Comp Biol Bioinf 2011 SLIDER • Faster approach, enabling genome wide search • Scoring: Chi2 • Search: steepest ascent Validation • Performance assessment on simulated data • Performance assessment using using protein structures Extensions of SLIDER Extension I: better coverage of network Boyen et al. Trans Comp Biol Bioinf 2013 Extensions of SLIDER Extension I: better coverage of network Extension II: use of more biological information bioSLIDER DGIFELELYLPDDYPMEAPKVRFLTKI bioSLIDER DGIFELELYLPDDYPMEAPKVRFLTKI conservation bioSLIDER DGIFELELYLPDDYPMEAPKVRFLTKI conservation accessibility bioSLIDER DGIFELELYLPDDYPMEAPKVRFLTKI conservation accessibility Thresholds for conservation and accessibility Extension of motif model: amino acid similarity (BLOSUM) bioSLIDER DGIFELELYLPDDYPMEAPKVRFLTKI conservation Interaction-coverage accessibility Using human and yeast data for training and optimizing parameters 0.5 0.4 0.3 0.2 No conservation, no accessibility Conservation and accessibility 0.1 0.0 0.0 0.0 0.3 0.3 0.6 0.6 Motif-accuracy Leal Valentim et al., PLoS ONE 2012 Application to Arabidopsis Input data: 6200 interactions, 2700 proteins Interface predictions for 985 proteins (on average 20 residues) Arabidopsis Interactome Mapping Consortium, Science 2011 Ecotype sequence data (SNPs) SNPs tend to ‘avoid’ predicted binding sites In 263 proteins there is a SNP in a binding site these proteins are much more connected to each other than would be randomly expected Summary • Prediction of interaction sites using protein interaction networks and protein sequences • Correlated motif approaches Overview • • • • • Introduction: protein interaction networks Sequences & networks: predicting interaction sites Predicting protein interactions Sequence and network evolution Interaction network alignment Protein Interaction Prediction Lots of genomes are being sequenced… (www.genomesonline.org) ARCHAEA BACTERIA EUKARYA TOTAL Complete 182 3767 183 4132 Incomplete 264 14393 2897 17514 Protein Interaction Prediction Lots of genomes are being sequenced… (www.genomesonline.org) ARCHAEA BACTERIA EUKARYA TOTAL Complete 182 3767 183 4132 Incomplete 264 14393 2897 17514 But how do we know how the proteins in there work together?! Protein Interaction Prediction • Interactions of orthologs: interologs • Phylogenetic profiles A1 0 1 1 0 0 1 B1 0 1 1 0 0 1 • Domain-based predictions Orthology based prediction Orthology based prediction Phylogenetic profiles A 1 0 1 1 0 0 1 C 1 0 1 1 1 0 1 B 1 0 1 1 1 0 1 D 0 1 0 1 0 0 1 Domain Based Predictions Domain Based Predictions Overview • • • • • Introduction: protein interaction networks Sequences & networks: predicting interaction sites Predicting protein interactions Sequence and network evolution Interaction network alignment Duplications Duplications and interactions Gene duplication Duplications and interactions Gene duplication Duplications and interactions Gene duplication 0.001 Myear-1 Interaction loss 0.1 Myear-1 Duplications and interaction loss Duplicate pairs share interaction partners Interaction network evolution Science 2011 Overview • • • • • Introduction: protein interaction networks Sequences & networks: predicting interaction sites Predicting protein interactions Sequence and network evolution Interaction network alignment Network alignment Local Network Alignment: find multiple, unrelated regions of Isomorphism Global Network Alignment: find the best overall alignment PATHBLAST Kelley, PNAS 2003 PATHBLAST: scoring homology interaction Kelley, PNAS 2003 PATHBLAST: results Kelley, PNAS 2003 PATHBLAST: results For yeast vs H.pylori, with L=4, all resulting paths with p<=0.05 can be merged into just five network regions Kelley, PNAS 2003 Multiple alignment Scoring: Probabilistic model for interaction subnetworks Sub-networks: bottom-up search, starting with exhaustive search for L=4; followed by local search Sharan PNAS 2005 Multiple alignment: results Sharan PNAS 2005 Multiple alignment: results Applications include protein function prediction and interaction prediction Sharan PNAS 2005 Global alignment Singh PNAS 2008 Global alignment Singh PNAS 2008 Global alignment Alignment: greedy selection of matches Singh PNAS 2008 Network alignment: the future? Sharan & Ideker Nature Biotech 2006 Summary • Interaction network evolution: mostly “comparative”, not much mechanistic • Approaches exist to integrate and model network analysis within context of phylogeny (not discussed) • Outlook: combine interaction site prediction with network evolution analysis Exercises The datafiles “arabidopsis_proteins.lis” and “interactions_arabidopsis.data” contain Arabidopsis MADS proteins (which regulate various developmental processes including flowering), and their mutual interactions, respectively. SOC1 AGL24 Exercise 1 • Start by getting familiar with the basic Cytoscape features described in section 1 of the tutorial http://opentutorials.cgl.ucsf.edu/index.php/Tutori al:Introduction_to_Cytoscape • Load the data into Cytoscape • Visualize the network and analyze the number of interactions per proteins – which proteins do have a lot of interactions? Exercise 2 Write a script that reads interaction data and implements a datastructure which enables further analysis of the data (see setup on next slides). Use the datafiles “arabidopsis_proteins.lis” and “interactions_arabidopsis.data” and let the script print a table in the following format: PROTEIN Number_of_interactions Make a plot of those data #two subroutines #input: filename #output: list with content of file sub read_list { my $infile=$_[0]; YOUR CODE return @newlist; } #input: protein list and interaction list #output: hash with “proteins” list of their partners sub combine_prot_int($$) { my ($plist,$intlist) = @_; YOUR CODE return %inthash; } #reading input data my @plist= read_list($ARGV[0]); my @intlist= read_list($ARGV[1]); #obtaining hash with interactions %inthash=combine_prot_int(\@plist,\@intlist); YOUR CODE #loop over all proteins and print their name and their number of interactions Exercise 3 In “orthology_relations.data” we have a set of predicted orthologs for the Arabidopsis proteins from exercise 1. “protein_information.data” describes a.o. from which species these proteins are. Finally, “interactions.data“ contains interactions between those proteins. Use the Arabidopsis interaction data from exercise 1 to “predict” interactions in other species using the orthology information. Compare your predictions with the real interaction data and make a plot that visualizes how good your predictions are.