Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein Sequence Motifs Aalt-Jan van Dijk Plant Research International, Wageningen UR Biometris, Wageningen UR [email protected] www.bioinformatics.nl Plant Bioinformatics Genomics • • Next Generation Sequencing Genome assembly & annotation (Comparative) genome analysis SNP analysis, marker development Computational infrastructure Database development Webbased analysis tools Software- development Workflow management systems machine learning Data (pre-)processing pipelining Alternative splicing Protein interactions networks Metabolomics • • • Alternative splicing EST analysis Proteomics • • • Technology Integrated analysis of omics datasets Transcriptomics Database- development Data (pre-)processing pipelining Metabolite and pathway-identification Systems biology network modelling (bottom-up) • Protein interactions networks www.bioinformatics.nl www.bioinformatics.nl My research Protein complex structures Protein-protein docking Correlated mutations Interaction site prediction/analysis Protein-protein interactions Protein-DNA interactions Motif search Enzyme active sites www.bioinformatics.nl www.bioinformatics.nl Overview Protein Motif Searching Hydrophobicity & Transmembrane Domains Protein Interactions Sequence-motifs to predict interaction sites Secondary Structure Prediction www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching www.bioinformatics.nl What is a motif? A motif is a description of a particular element of a protein that contains a specific sequence pattern Motifs are identified by 3D structural alignment Multiple sequence alignment Pattern searching programs www.bioinformatics.nl www.bioinformatics.nl What is a motif? A motif is a description of a particular element of a protein that contains a specific sequence pattern Motifs are identified by 3D structural alignment Multiple sequence alignment Pattern searching programs www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Strict consensus pattern use only strictly conserved residues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Strict consensus pattern use only strictly conserved residues C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Strict consensus pattern use only strictly conserved residues But what about: variable residues? gaps? C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C C CxxxxxCxxxPxxxxxC C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Strict consensus patterns contain no alternative residues no flexible regions no mismatches no gaps C--QASCDGIPLKMNDC C---VTCEGLPMRMDQC CERTLGCQPMPVH---C CxxxxxCxxxPxxxxxC C C P C www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Most motifs defined as regular expressions Motifs can contain alternative residues flexible regions C-x(2,5)-C-x-[GP]-x-P-x(2,5)-C CXXXCXGXPXXXXXC | | | | | FGCAKLCAGFPLRRLPCFYG www.bioinformatics.nl www.bioinformatics.nl The PROSITE Syntax A-[BC]-X-D(2,5)-{EFG}-H A B or C anything 2-5 D’s not E, F, or G H www.bioinformatics.nl www.bioinformatics.nl PROSITE entries Mandatory motifs characterise a protein (super-) family ID SUBTILASE_ASP; PATTERN. DE Serine proteases, subtilase family, aspartic acid active site. PA [STAIV]-x-[LIVMF]-[LIVM]-D-[DSTA]-G-[LIVMFC]-x(2,3)-[DNH]. ID SUBTILASE_HIS; PATTERN. DE Serine proteases, subtilase family, histidine active site. PA H-G-[STM]-x-[VIC]-[STAGC]-[GS]-x-[LIVMA]-[STAGCLV]-[SAGM]. ID SUBTILASE_SER; PATTERN. DE Serine proteases, subtilase family, serine active site. PA G-T-S-x-[SA]-x-P-x(2)-[STAVC]-[AG]. www.bioinformatics.nl www.bioinformatics.nl Exercise Find the three subtilase motifs in prosite (prosite.expasy.org) Compare the lists of proteins in which the motifs occur – what does this tell you? Similarly, compare protein structures in which the motifs occur Have a look at the “sequence logo” www.bioinformatics.nl www.bioinformatics.nl Protein Motif Searching Some motifs occur frequently in proteins; they may not actually be present, such as Post-translational modification sites ID DE PA ASN_GLYCOSYLATION; PATTERN. N-glycosylation site. N-{P}-[ST]-{P}. www.bioinformatics.nl www.bioinformatics.nl Exercise Use a glycosylation site predictor such as http://www.cbs.dtu.dk/services/NetNGlyc/ Input: your favorite set of sequences Do you observe that some N-{P}-[ST] sites are likely to be glycosylated and others not? www.bioinformatics.nl www.bioinformatics.nl Profiles Many motifs cannot be easily defined using simple patterns Such motifs can be defined using profiles A profile is constructed from a multiple sequence alignment. For each position, each amino acid is given a score depending on how likely it is to occur www.bioinformatics.nl www.bioinformatics.nl Calculating a Profile For each alignment position: take the (weighted) average of the appropriate rows from the scoring matrix An (extremely simple) example: www.bioinformatics.nl seq_01 seq_02 seq_03 seq_04 seq_05 seq_06 seq_07 seq_08 seq_09 seq_10 A A A A A A A A A A A A A A A A A A A W A A A A A A A A W W A A A A A A A W W W A A A A A A W W W W A A A A A W W W W W A A A A W W W W W W A A A W W W W W W W A A W W W W W W W W A W W W W W W W W W W W W W W W W W W W www.bioinformatics.nl Excerpt from the EBLOSUM62 matrix: A R N D C Q E G H I L K M F P S T W Y V A 4 -1 -2 -2 0 -1 -1 0 -2 -1 -1 -1 -1 -2 -1 1 0 -3 -2 0 W -3 -3 -4 -4 -2 -2 -3 -2 -2 -3 -2 -3 -1 1 -4 -3 -2 11 2 -3 A 4.0 N -2.0 C 0.0 P -1.0 D -2.0 Q -1.0 E -1.0 R -1.0 F -2.0 S 1.0 G 0.0 T 0.0 H -2.0 V 0.0 I -1.0 W -3.0 K -1.0 Y -2.0 L -1.0 M -1.0 A 5A+5W: 1.0 N -6.0 C -2.0 P -5.0 D -6.0 Q -3.0 E -4.0 R -4.0 F -1.0 S -2.0 G -2.0 T -2.0 H -4.0 V -3.0 I -4.0 W 8.0 K -4.0 Y 0.0 L -3.0 M -2.0 A -3.0 N -4.0 C -2.0 P -4.0 D -4.0 Q -2.0 E -3.0 R -3.0 F 1.0 S -3.0 G -2.0 T -2.0 H -2.0 V -3.0 I -3.0 W 11.0 K -3.0 Y 2.0 L -2.0 M -1.0 10A: 10W: prophecy (EMBOSS), using Henikoff profile type, and BLOSUM62 matrix; www.bioinformatics.nl www.bioinformatics.nl Pattern Searching Short linear motifs: e.g. http://dilimot.russelllab.org/ Profiles: meme http://meme.sdsc.edu/meme/cgi-bin/meme.cgi www.bioinformatics.nl www.bioinformatics.nl Exercise Use a number of sequences wich contain the prosite subtilase motif and find motifs in those sequences with MEME www.bioinformatics.nl www.bioinformatics.nl Hydropathy Plot Prediction hydrophobic and hydrophilic regions in a protein www.bioinformatics.nl Partition Coefficients Hydrophilic Hydrophobic Oil Water www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity/Hydrophilicity Values hydrophilic hydrophobic R K D Q N E H S T P Y C G A M W L V F I Fauchere & Pliska -1.37 -1.35 -1.05 -0.78 -0.85 -0.87 -0.40 -0.18 -0.05 0.12 0.26 0.29 0.48 0.62 0.64 0.81 1.06 1.08 1.19 1.38 www.bioinformatics.nl Kyte & Doolittle -4.50 -3.90 -3.50 -3.50 -3.50 -3.50 -3.20 -0.80 -0.70 -1.60 -1.30 2.50 -0.40 1.80 1.90 -0.90 3.80 4.20 2.80 4.50 Hopp & Woods 3.00 3.00 3.00 0.20 0.20 3.00 -0.50 0.30 -0.40 0.00 -2.30 -1.00 0.00 -0.50 -1.30 -3.40 -1.80 -1.50 -2.50 -1.80 Eisenberg -2.53 -1.50 -0.90 -0.85 -0.78 -0.74 -0.40 -0.18 -0.05 0.12 0.26 0.29 0.48 0.62 0.64 0.81 1.06 1.08 1.19 1.38 www.bioinformatics.nl Hydrophobicity Plot Sum amino acid hydrophobicity values in a given window Plot the value in the middle of the window Shift the window one position ik 1 Hi Hn 2k 1 n i k www.bioinformatics.nl www.bioinformatics.nl Sliding Window Approach Calculate property for first sub-sequence Use the result (plot/print/store) Move to next residue position, and repeat www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Hydrophobicity Plot 1.5 1 0.5 0 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 -0.5 -1 -1.5 -2 MEZCALTASTESVERYNICE www.bioinformatics.nl www.bioinformatics.nl Transmembrane Regions Rotation is 100 degrees per amino acid Climb is 1.5 Angstrom per amino acid residue www.bioinformatics.nl www.bioinformatics.nl Transmembrane Regions 30 angstrom www.bioinformatics.nl So we need approx. 30 / 1.5 = 20 amino acids to span the membrane www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl Adapting the window size to the size of the membrane spanning segment makes the picture easier to interpret www.bioinformatics.nl www.bioinformatics.nl window = 1 window = 9 window = 19 window = 121 www.bioinformatics.nl www.bioinformatics.nl Protein Interactions www.bioinformatics.nl Protein Interactions hemoglobin Obligatory www.bioinformatics.nl www.bioinformatics.nl Protein Interactions hemoglobin Obligatory www.bioinformatics.nl Mitochondrial Cu transporters Transient www.bioinformatics.nl Experimental approaches (1) Yeast two-hybrid (Y2H) www.bioinformatics.nl www.bioinformatics.nl Experimental approaches (2) Affinity Purification + mass spectrometry (AP-MS) www.bioinformatics.nl www.bioinformatics.nl Interaction Databases STRING http://string.embl.de/ www.bioinformatics.nl www.bioinformatics.nl Interaction Databases www.bioinformatics.nl www.bioinformatics.nl Interaction Databases STRING http://string.embl.de/ HPRD http://www.hprd.org/ www.bioinformatics.nl www.bioinformatics.nl Interaction Databases www.bioinformatics.nl www.bioinformatics.nl Interaction Databases STRING http://string.embl.de/ HPRD http://www.hprd.org/ InteroPorc http://biodev.extra.cea.fr/interoporc/Default.aspx Many others…. E.g. see http://nar.oxfordjournals.org./content/39/suppl_1.toc www.bioinformatics.nl www.bioinformatics.nl Yeast protein interaction network www.bioinformatics.nl www.bioinformatics.nl Sequence-based Protein Binding Site Prediction www.bioinformatics.nl Binding site www.bioinformatics.nl www.bioinformatics.nl Binding site www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Predefined motifs www.bioinformatics.nl www.bioinformatics.nl Motif search in groups of proteins • Group proteins which have same interaction partner • Use motif search, e.g. find PWMs Neduva Plos Biol 2005 www.bioinformatics.nl www.bioinformatics.nl Motif search in groups of proteins • Group proteins which have same interaction partner • Use motif search www.bioinformatics.nl www.bioinformatics.nl Correlated Motif Search www.bioinformatics.nl www.bioinformatics.nl Correlated Motif Search Interactors AARLL PLTEQ MARLT DLTEP VVRLM MMTER Non-interactors AARLL MARLT VVRLM MARLT PLTEQ DLTEP Correlated Motif Pair: (RL,TE) www.bioinformatics.nl www.bioinformatics.nl Experimental validation Van Dijk et al, Plos Comp Biol 2010 www.bioinformatics.nl www.bioinformatics.nl New approach: slider • • Faster approach genome wide searching for interaction motifs Improve mining algorithm with a priori biological knowledge (conservation score, surface accessibility) www.bioinformatics.nl www.bioinformatics.nl Boyen et al, IEEE/ACM Trans Comput Biol Bioinform. 2011 THE END….. Questions? www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction www.bioinformatics.nl Secondary Structure Prediction Traditional methods (statistical and/or rule-based) E.g. Garnier, Osguthorpe & Robson • Statistical method Accuracy ~ 60% www.bioinformatics.nl www.bioinformatics.nl GOR Helix Parameters i-8 Gly -5 ala 5 val 0 leu 0 ile 5 ser 0 thr 0 asp 0 glu 0 asn 0 gln 0 lys 20 his 10 arg 0 phe 0 tyr -5 trp -10 cys 0 met 10 pro -10 -10 10 0 5 10 -5 0 -5 0 0 0 40 20 0 0 -10 -20 0 20 -20 i-6 -15 15 0 10 15 -10 0 -10 0 0 0 50 30 0 0 -15 -40 0 25 -40 -20 20 0 15 20 -15 -5 -15 0 0 0 55 40 0 0 -20 -50 0 30 -60 i-4 i-2 -30 -40 -50 -60 30 40 50 60 0 0 5 10 20 25 28 30 25 20 15 10 -20 -25 -30 -35 -10 -15 -20 -25 -20 -15 -10 0 10 20 60 70 -10 -20 -30 -40 5 10 20 20 60 60 50 30 50 50 50 30 0 0 0 0 0 5 10 15 -25 -30 -35 -40 -50 -10 0 10 0 0 -5 -10 35 40 45 50 -80-100-120-140 www.bioinformatics.nl i -86 65 14 32 6 -39 -26 5 78 -51 10 23 12 -9 16 -45 12 -13 53 -77 -60 60 10 30 0 -35 -25 10 78 -40 -10 10 -20 -15 15 -40 10 -10 50 -60 i+2 -50 50 5 28 -10 -30 -20 15 78 -30 -20 5 -10 -20 10 -35 0 -5 45 -30 -40 40 0 25 -15 -25 -15 20 78 -20 -20 0 0 -30 5 -30 -10 0 40 -20 i+4 -30 30 0 20 -20 -20 -10 20 78 -10 -10 0 0 -40 0 -25 -50 0 35 -10 -20 20 0 15 -25 -15 -5 20 70 0 -5 0 0 -50 0 -20 -50 0 30 0 i+6 -15 15 0 10 -20 -10 0 15 60 0 0 0 0 -50 0 -15 -40 0 25 0 -10 10 0 5 -10 -5 0 10 40 0 0 0 0 -30 0 -10 -20 0 20 0 i+8 -5 5 0 0 -5 0 0 5 20 0 0 0 0 -10 0 -5 -10 0 10 0 www.bioinformatics.nl I S G A R N I E R H E L I X P R E D I C T i-8 Gly -5 ala 5 val 0 leu 0 ile 5 ser 0 thr 0 asp 0 glu 0 asn 0 gln 0 lys 20 his 10 arg 0 phe 0 tyr -5 trp -10 cys 0 met 10 pro -10 -10 10 0 5 10 -5 0 -5 0 0 0 40 20 0 0 -10 -20 0 20 -20 i-6 -15 15 0 10 15 -10 0 -10 0 0 0 50 30 0 0 -15 -40 0 25 -40 -20 20 0 15 20 -15 -5 -15 0 0 0 55 40 0 0 -20 -50 0 30 -60 i-4 i-2 -30 -40 -50 -60 30 40 50 60 0 0 5 10 20 25 28 30 25 20 15 10 -20 -25 -30 -35 -10 -15 -20 -25 -20 -15 -10 0 10 20 60 70 -10 -20 -30 -40 5 10 20 20 60 60 50 30 50 50 50 30 0 0 0 0 0 5 10 15 -25 -30 -35 -40 -50 -10 0 10 0 0 -5 -10 35 40 45 50 -80-100-120-140 www.bioinformatics.nl i -86 65 14 32 6 -39 -26 5 78 -51 10 23 12 -9 16 -45 12 -13 53 -77 -60 60 10 30 0 -35 -25 10 78 -40 -10 10 -20 -15 15 -40 10 -10 50 -60 i+2 -50 50 5 28 -10 -30 -20 15 78 -30 -20 5 -10 -20 10 -35 0 -5 45 -30 -40 40 0 25 -15 -25 -15 20 78 -20 -20 0 0 -30 5 -30 -10 0 40 -20 i+4 -30 30 0 20 -20 -20 -10 20 78 -10 -10 0 0 -40 0 -25 -50 0 35 -10 -20 20 0 15 -25 -15 -5 20 70 0 -5 0 0 -50 0 -20 -50 0 30 0 i+6 -15 15 0 10 -20 -10 0 15 60 0 0 0 0 -50 0 -15 -40 0 25 0 -10 10 0 5 -10 -5 0 10 40 0 0 0 0 -30 0 -10 -20 0 20 0 i+8 -5 5 0 0 -5 0 0 5 20 0 0 0 0 -10 0 -5 -10 0 10 0 www.bioinformatics.nl GOR Prediction beta sheet helix www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction Recent methods Neural networks Multiple alignments Heuristics Or a combination of the above = flexible statistics = variability = common sense Accuracy ~ 70% www.bioinformatics.nl www.bioinformatics.nl Heuristics Conserved parts are structurally and/or functionally important Segments with many gaps must be in loop regions www.bioinformatics.nl www.bioinformatics.nl Secondary Structure Prediction Strategy Use as many methods as possible Use homologous sequences Combine predictions into consensus prediction www.bioinformatics.nl www.bioinformatics.nl Why can’t it be 100% correct? All current 2D prediction schemes are based upon observation of occurrence of 2D elements in 3D structures Deduction of 2D elements from structures is ambiguous! DSSP, Stride, and the PDB (human) annotation do not always agree upon the assigned elements www.bioinformatics.nl www.bioinformatics.nl Do these residues still belong to the helix? www.bioinformatics.nl www.bioinformatics.nl