* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A^2
Whole genome sequencing wikipedia , lookup
Non-coding DNA wikipedia , lookup
Metalloprotein wikipedia , lookup
Endogenous retrovirus wikipedia , lookup
Proteolysis wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Biosynthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Biochemistry wikipedia , lookup
Protein Architecture: Four Levels Cost per genome ☞ Cost of first genome (Human Genome Project, started 1990): 3 Billion US$ Exponential decay of computing cost soon below $1,000 ? Hayden, Nature 2014 Cost per genome ☞ Cost of first genome (Human Genome Project, started 1990): 3 Billion US$ Exponential decay of computing cost soon below $1,000 ? Cost per genome ☞ Cost of first genome (Human Genome Project, started 1990): 3 Billion US$ http://sulab.org/ Structures in the Protein Data Bank 80.000 total per year X-ray crystallography 10.000 NMR spectroscopy Structures in the Protein Data Bank 80.000 total per year X-ray crystallography Membrane proteins of known structure Stephen While lab, UC Irvine Today’s lecture Much more sequence informa>on available, compared to structural informa>on! (A)Sequence alignments How similar are two (amino acid) sequences? (B)Phylogenetic trees Find evolutionary tree from set of sequences (C) Structure prediction Predict protein structure from amino acid sequence (A) Sequence alignment Why sequence alignments? • Mul>ple sequence alignments → Iden>fy amino acids that were conserved during evolu>on → Relevant for protein func>on or protein stability • Quan>fy, how similar two (or more) sequences are → Quan>fy distance in evolu>on, build evolu>onary (phylogene>c) trees • Find the best (or most likely) alignment of two sequences → homology modelling (A) Sequence alignment Problem: How similar are two sequences? s = B A N A N A t = A N A N A S (A) Sequence alignment Problem: How similar are two sequences? s = B A N A N A s = B A N A N A t = A N A N A S t = A N A N A S More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations p1 s p 1’ x x’ p2 p 2’ y p3 y’ p 3’ t p = p1 p2 p3 + p01 p02 p03 Possible transformations: • mutation • insertion / deletion (A) Sequence alignment Problem: How similar are two sequences? s = B A N A N A s = B A N A N A t = A N A N A S t = A N A N A S More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations p1 s p 1’ x x’ p2 p 2’ y p3 y’ p 3’ t p = p1 p2 p3 + p01 p02 p03 Possible transformations: • mutation • insertion / deletion #2) Align amino acid sequence such that the weight of the transformation is maximised, e.g.: B A N A N A A N A N A S (A) Sequence alignment Problem: How similar are two sequences? s = B A N A N A s = B A N A N A t = A N A N A S t = A N A N A S More precisely: #1) Find weights of transformation that turns s into t by a specific sequence of mutations p1 s p 1’ x x’ p2 p 2’ y p3 y’ p 3’ t p = p1 p2 p3 + p01 p02 p03 #2) Align amino acid sequence such that the weight of the transformation is maximised, e.g.: B A N A N A A N A N A S Requirement: “Scoring Matrix” for transformations: • mutations • insertion / deletion Common scoring matrices: BLOSUM, PAM matrices s i ! ti : pG gap psi ;ti Scoring matrix Requirement: “Scoring Matrix” for transformations: • mutations • insertion / deletion s i ! ti : pG Common scoring matrices: BLOSUM, PAM matrices gap BLOSUM62 scoring matrix (1992) (BLOck SUbstitution Matrix) psi ;ti Scoring matrix Diagonal: weight for keeping the amino acid Cystein: large weight since important for 3D structure (disulfide bonds) Tryptophane: largest amino acid, mutation unlikely Scoring matrix Off-diagonal elements: weight for mutations Leucine to aspartate: hydrophobic to anionic, mutation unlikely Leucine to isoleucine: similar amino acids, mutation likely Scoring matrix Requirement: “Scoring Matrix” for transformations: s i ! ti : pG • mutations • insertion / deletion Common scoring matrices: BLOSUM, PAM matrices psi ;ti gap BLOSUM62 scoring matrix (1992) E K N G F P A | | | E M Q G R W A BLOSUM62 score = 7 One to three- letter code: E = Glu K = Lys M = Met N = Asn Q = Gln G = Gly F = Phe R = Arg P = Pro W = Trp A = Ala (A) Sequence alignment Example: s = R I - L V S D K V I t = R I S L V - - K A I p = 1 · 1 · pG · 1 · 1 · p2G · 1 · pVA · 1 Here simplified: p=1 for keeping a mutation si ti R R I S pG K A I D K wi-1,j-1 V I wi,j-1 p(si,tj) I V V S 1 1 L L pG pG pSV wi-1,j pLS 1 1 pG wi,j pVD pG pG 1 Task: pVA 1 Every possible alignment corresponds to one specific path! • Find the shortest (weighted) path (#2) • Sum over all paths (#1) (A) Sequence alignment Task: • Find the shortest (weighted) path (#2) • Sum over all paths (#1) Number of possible paths/alignments: ☞ n = 100: 1059 ☞ n = 1000: 10600 ✓ 2n n ◆ 22n ⇡p ⇡n → NP-problem? No! Needleman / Wunsch (1970) Smith / Waterman (1976) Idea (analogous to path integral for Schrödiner eq.): Complete sum wij over all paths to (i,j) recursively: wij = wi 1,j 1 psi ,tj wij = Max{wi + wi 1,j 1 psi ,tj 1,j pG + wi + wi,j 1,j pG Computational cost: O(n2) (like a route planner) 1 pG + wi,j (solves #1) 1 pG } (solves #2) “Dynamic programming” Close rela>on to Smoluchowski/Feynman path integrals action (x1 , t1 ) (x0 , t0 ) (x1 , t1 ) = Z dx0 (x0 , t0 ) x0 x1 eiS/~ = exp Z (x0 , t0 ) B dt L(x, ẋ, t) A i ~ Dx(t) exp all paths Z ! B dt L(x, ẋ, t) A xn x2 … Discretisation: i ~ Z (x1 , t1 ) ! Close rela>on to Smoluchowski/Feynman path integrals x0 x2 x1 xn … (x1 , t1 ) … Discretisation: (x0 , t0 ) (xn , tn ) = (xi+1 , ti+1 ) = = = Z dx0 (x0 , t0 ) Z Z Z Z dxi (xi , ti ) e i ~ dx1 e R ti+1 ti i ~ R t1 t0 dt L ··· Z dxn 1e i ~ R tn tn dxi develop ψ(x,t) in powers of Δx ,…, → Schrödinger equation @ i = @t ~ ✓ 1 2 V (x) ◆ dt L dt L ✓ Z ti+1 ✓ 2 ◆◆ i ẋ (xi , ti ) exp dt V (x) ~ ti 2 ✓ ◆2 Z i ti+1 1 xi+1 xi (xi , ti ) exp dt ~ ti 2 t dxi 1 V (x) !! Sequence alignment of a ribosomal protein P0 Source: Wikipedia Sequence comparison: hemoglobin alpha chain vs beta chain residue number of alpha chain Dot plots: residue number of beta chain Highlight similar regions in two sequences Details on filtering: window size: 31 match: +5 dismatch: -‐4 (B) Phylogene>c trees Given: N sequences s(1), … s(N) Task: Find most probable evolutionary tree: French: German: Italian: Spanish: English: un eins uno un one deux zwei due dos two trois drei tre tres three quatre vier quattro cuatro four cinq fünf cinque cinco five (B) Phylogene>c trees Given: N sequences s(1), … s(N) Task: Find most probable evolutionary tree: French: German: Italian: Spanish: English: un eins uno un one German English French Spanish Italian deux zwei due dos two trois drei tre tres three quatre vier quattro cuatro four cinq fünf cinque cinco five (B) Phylogene>c trees Given: N sequences s(1), … s(N) Task: Find most probable evolutionary tree: • Example s(1) = B A N A N A s(2) = A N A N A S s(3) = H O T D O G distance • Cost: NP-complete ☞ Trees for different proteins are (usually) similar ☞ Reconstruction of evolution Problem: horizontal gene transfer Phylogene>c trees Phylogenetic tree of dogs Nature 438, 803-819 Phylogene>c trees Phylogenetic tree of vertebrates Nature 496, 311-316 Phylogene>c trees Phylogenetic tree of ribosomal RNA Wikimedia Phylogene>c tree of indo-‐ european languages Science 337, 957-960 (2012) (C) Structure predic>on: from sequence to structure • • • Recall: many more sequences than structures available “Folding problem” Ab initio → only possible for smallest proteins (since recently) (a) Secondary structure prediction Chou-Fasman method (empirical) • Calculate probabilities from known structures P (S|A) = P (A|S) nA,S /nS = P (A) nA /n amino acid second. structure • Search for regions with high (average) propensities for certain secondary structures • Search secondary structure boundaries (e.g., “helix breakers” such as proline) ☞ 75% prediction rate (compare to random guess: 33%) log frequencies of amino acids in secondary structure elements amino acid alpha helix beta sheet turn A.A. P<a> P<b> P<t> A R N D C Q E G H I L K M F P S T W Y V 1.42 0.98 0.67 1.01 0.70 1.11 1.51 0.57 1.00 1.08 1.21 1.16 1.45 1.13 0.57 0.77 0.83 1.08 0.69 1.06 0.83 0.93 0.89 0.54 1.19 1.10 0.37 0.75 0.87 1.60 1.30 0.74 1.05 1.38 0.55 0.75 1.19 1.37 1.47 1.70 0.66 0.95 1.56 1.46 1.19 0.98 0.74 1.56 0.95 0.47 0.59 1.01 0.60 0.60 1.52 1.43 0.96 0.96 1.14 0.50 Hp 1.80 -4.50 -3.50 -3.50 2.50 -3.50 -3.50 -0.40 -3.20 4.50 3.80 -3.90 1.90 2.80 -1.60 -0.80 -0.70 -0.90 -1.30 4.20 (C) Structure predic>on (a) Homology modelling Observation in PDB: Similar sequence (30% identity) → similar structure ☞ Strategy: Crystal structures: Aquaporin-1 GlpF • • • • • • Search homologous sequence with known structure align sequences change differing amino acids in template structure meet steric criteria (avoid atomic overlaps), and other criteria optimize rotamers Critical: correct alignment GlpF crystal structure GlpF model based on Aqp1 (bad due to wrong alignment) (C) Structure predic>on: from sequence to structure (c) Protein threading No homologous structure available? ☞ Into which known fold fits the sequence best? aa S A R N D ☞ Find the known fold with the maximal … α-helix N X ln p(ai , sj ) i=1 p(ai , sj ) β-sheet Improvements: turn Sequence / structure statistics better statistics, e.g. consider triplets, spacial neighbours, cys-cys bonds, … non-polar surface area [A^2] E.g., make use of hydrophobicity of amino acids: Hydrophobic residues mainly buried inside. Trp Leu Ile Phe Met Val Pro Lys Tyr His Thr Arg Ala Glu Gln Ser Cys Gly Asp Asn 236 164 155 194 137 135 124 122 154 129 90 89 86 69 66 56 48 47 45 42 estimated hydrophobic effect [kcal/mol] 4.11 4.10 3.88 3.46 3.43 3.38 3.10 3.05 2.81 2.45 2.25 2.23 2.15 1.73 1.65 1.40 1.20 1.18 1.13 1.05 E.g., make use of hydrophobicity of amino acids: Hydrophobic residues mainly buried inside. (C) Structure predic>on (d) Empirical potentials E.g., ψ-angles between amino acids, e.g., Ala-Asn: h( ) V = 2 1 kB T ln h( ) 3 1 3 2 ☞ 20x20 pair interactions Vij ☞ minimize N X i=1 VSi ,Si+1 Ramachandran plots • Another source of empirical potentials: Ramachandran plots • Distribution of φ / ψ backbone angles Ramachandran plot for glycine Ramachandran plot for proline Bottom line: structure prediction is still not very accurate and reliable ! Today’s summary Learning from sequence alignments • Mul>ple sequence alignments → Iden>fy amino acids that were conserved during evolu>on → Relevant for protein func>on or protein stability • Quan>fy, how similar two (or more) sequences are → Quan>fy distance in evolu>on, build evolu>onary (phylogene>c) trees • Find the best (or most likely) alignment of two sequences → homology modelling Structure predic>on, from sequence to structure • homology modelling • Threading • Ab in>o modelling, empirical poten>als Master / Bachelor projects Interpretation of X-ray scattering experiments Membrane biophysics Contact: Jochen Hub, [email protected], phone 39-14189