* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download CS790 – Introduction to Bioinformatics
Artificial gene synthesis wikipedia , lookup
Signal transduction wikipedia , lookup
Clinical neurochemistry wikipedia , lookup
Biochemistry wikipedia , lookup
Paracrine signalling wikipedia , lookup
Gene expression wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Point mutation wikipedia , lookup
Magnesium transporter wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Expression vector wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Bimolecular fluorescence complementation wikipedia , lookup
Interactome wikipedia , lookup
Homology modeling wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein purification wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Disulfide Bonds Two cyteines in close proximity will form a covalent bond Disulfide bond, disulfide bridge, or dicysteine bond. Significantly stabilizes tertiary structure. Protein Folding Intro to Bioinformatics 1 Determining Protein Structure There are O(100,000) distinct proteins in the human proteome. 3D structures have been determined for 14,000 proteins, from all organisms • Includes duplicates with different ligands bound, etc. Coordinates are determined by X-ray crystallography Protein Folding Intro to Bioinformatics 2 X-Ray Crystallography ~0.5mm • The crystal is a mosaic of millions of copies of the protein. • As much as 70% is solvent (water)! • May take months (and a “green” thumb) to grow. Protein Folding Intro to Bioinformatics 3 X-Ray diffraction Image is averaged over: • Space (many copies) • Time (of the diffraction experiment) Protein Folding Intro to Bioinformatics 4 Electron Density Maps Resolution is dependent on the quality/regularity of the crystal R-factor is a measure of “leftover” electron density Solvent fitting Refinement Protein Folding Intro to Bioinformatics 5 The Protein Data Bank http://www.rcsb.org/pdb/ ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 Protein Folding N CA C O CB N CA C O N CA C O CB CG1 CG2 ALA ALA ALA ALA ALA GLY GLY GLY GLY VAL VAL VAL VAL VAL VAL VAL E E E E E E E E E E E E E E E E 1 1 1 1 1 2 2 2 2 3 3 3 3 3 3 3 22.382 22.957 23.572 23.948 23.932 23.656 24.216 25.653 26.258 26.213 27.594 28.569 28.429 27.834 29.259 26.811 47.782 47.648 46.251 45.688 48.787 45.723 44.393 44.308 45.296 43.110 42.879 43.613 43.444 41.363 41.013 40.649 Intro to Bioinformatics 112.975 111.613 111.545 112.603 111.380 110.336 110.087 110.579 110.994 110.521 110.975 110.055 108.822 110.979 111.404 111.850 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 24.09 22.40 21.32 21.54 22.79 19.17 17.35 16.49 15.35 16.21 16.02 15.69 16.43 16.66 17.35 17.03 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 3APR 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 6 A Peek at Protein Function Serine proteases – cleave other proteins • Catalytic Triad: ASP, HIS, SER Protein Folding Intro to Bioinformatics 7 Cleaving the peptide bond Protein Folding Intro to Bioinformatics 8 Three Serine Proteases Chymotrypsin – Cleaves the peptide bond on the carboxyl side of aromatic (ring) residues: Trp, Phe, Tyr; and large hydrophobic residues: Met. Trypsin – Cleaves after Lys (K) or Arg (R) • Positive charge Elastase – Cleaves after small residues: Gly, Ala, Ser, Cys Protein Folding Intro to Bioinformatics 9 Specificity Binding Pocket Protein Folding Intro to Bioinformatics 10 The Protein Folding Problem Central question of molecular biology: “Given a particular sequence of amino acid residues (primary structure), what will the tertiary/quaternary structure of the resulting protein be?” Input: AAVIKYGCAL… Output: 11, 22… = backbone conformation: (no side chains yet) Protein Folding Intro to Bioinformatics 11 Protein Folding – Biological perspective “Central dogma”: Sequence specifies structure Denature – to “unfold” a protein back to random coil configuration • -mercaptoethanol – breaks disulfide bonds • Urea or guanidine hydrochloride – denaturant • Also heat or pH Anfinsen’s experiments • Denatured ribonuclease • Spontaneously regained enzymatic activity • Evidence that it re-folded to native conformation Protein Folding Intro to Bioinformatics 12 Folding intermediates Levinthal’s paradox – Consider a 100 residue protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations. • If it takes 10-13s to convert from 1 structure to another, exhaustive search would take 1.6 1027 years! Folding must proceed by progressive stabilization of intermediates • Molten globules – most secondary structure formed, but much less compact than “native” conformation. Protein Folding Intro to Bioinformatics 13 Forces driving protein folding It is believed that hydrophobic collapse is a key driving force for protein folding • Hydrophobic core • Polar surface interacting with solvent Minimum volume (no cavities) Disulfide bond formation stabilizes Hydrogen bonds Polar and electrostatic interactions Protein Folding Intro to Bioinformatics 14 Folding help Proteins are, in fact, only marginally stable • Native state is typically only 5 to 10 kcal/mole more stable than the unfolded form Many proteins help in folding • Protein disulfide isomerase – catalyzes shuffling of disulfide bonds • Chaperones – break up aggregates and (in theory) unfold misfolded proteins Protein Folding Intro to Bioinformatics 15 The Hydrophobic Core Hemoglobin A is the protein in red blood cells (erythrocytes) responsible for binding oxygen. The mutation E6V in the chain places a hydrophobic Val on the surface of hemoglobin The resulting “sticky patch” causes hemoglobin S to agglutinate (stick together) and form fibers which deform the red blood cell and do not carry oxygen efficiently Sickle cell anemia was the first identified molecular disease Protein Folding Intro to Bioinformatics 16 Sickle Cell Anemia Sequestering hydrophobic residues in the protein core protects proteins from hydrophobic agglutination. Protein Folding Intro to Bioinformatics 17 Computational Problems in Protein Folding Two key questions: • Evaluation – how can we tell a correctly-folded protein from an incorrectly folded protein? H-bonds, electrostatics, hydrophobic effect, etc. Derive a function, see how well it does on “real” proteins • Optimization – once we get an evaluation function, can we optimize it? Simulated annealing/monte carlo EC Heuristics We’ll talk more about these methods later… Protein Folding Intro to Bioinformatics 18 Fold Optimization Simple lattice models (HPmodels) • Two types of residues: hydrophobic and polar • 2-D or 3-D lattice • The only force is hydrophobic collapse • Score = number of HH contacts Protein Folding Intro to Bioinformatics 19 Scoring Lattice Models H/P model scoring: count noncovalent hydrophobic interactions. Sometimes: • Penalize for buried polar or surface hydrophobic residues Protein Folding Intro to Bioinformatics 20 What can we do with lattice models? For smaller polypeptides, exhaustive search can be used • Looking at the “best” fold, even in such a simple model, can teach us interesting things about the protein folding process For larger chains, other optimization and search methods must be used • Greedy, branch and bound • Evolutionary computing, simulated annealing • Graph theoretical methods Protein Folding Intro to Bioinformatics 21 Learning from Lattice Models The “hydrophobic zipper” effect: Ken Dill ~ 1997 Protein Folding Intro to Bioinformatics 22 Representing a lattice model Absolute directions • UURRDLDRRU Relative directions • LFRFRRLLFFL • Advantage, we can’t have UD or RL in absolute • Only three directions: LRF What about bumps? LFRRR • Bad score • Use a better representation Protein Folding Intro to Bioinformatics 23 Preference-order representation Each position has two “preferences” • If it can’t have either of the two, it will take the “least favorite” path if possible Example: {LR},{FL},{RL}, {FR},{RL},{RL},{FR},{RF} Can still cause bumps: {LF},{FR},{RL},{FL}, {RL},{FL},{RF},{RL}, {FL} Protein Folding Intro to Bioinformatics 24 “Decoding” the representation The optimizer works on the representation, but to score, we have to “decode” into a structure that lets us check for bumps and score. Example: How many bumps in: URDDLLDRURU? We can do it on graph paper • Start at 0,0 • Fill in the graph In PERL we use a two-dimensional array Protein Folding Intro to Bioinformatics 25 A two-dimensional array in PERL $configuration = “URDDLLDRURU”; $sequence = “HPPHHPHPHHH”; foreach $i (1..100) { foreach $j (1..100) { $grid[$i][$j] = “empty”; } } $x = 0; $y = 0; @moves = split(//,$configuration); @residues = split(//,$sequence); Protein Folding Intro to Bioinformatics 26 Setting up the grid foreach $move (@moves) { $residue = shift(@residues); if ($move = “U”) { $y_position++; } if ($move = “R”) { $x_position++; } etc… if ($grid[$x][$y] ne “empty”) { BUMP! } else { $grid[$x][$y] = $residue; } Protein Folding Intro to Bioinformatics 27 More realistic models Higher resolution lattices (45° lattice, etc.) Off-lattice models • Local moves • Optimization/search methods and / representations Greedy search Branch and bound EC, Monte Carlo, simulated annealing, etc. Protein Folding Intro to Bioinformatics 28 The Other Half of the Picture Now that we have a more realistic off-lattice model, we need a better energy function to evaluate a conformation (fold). Theoretical force field: • G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb Empirical force fields • Start with a database • Look at neighboring residues – similar to known protein folds? Protein Folding Intro to Bioinformatics 29 Threading: Fold recognition Given: • Sequence: IVACIVSTEYDVMKAAR… • A database of molecular coordinates Map the sequence onto each fold Evaluate • Objective 1: improve scoring function • Objective 2: folding Protein Folding Intro to Bioinformatics 30 Secondary Structure Prediction AGVGTVPMTAYGNDIQYYGQVT… A-VGIVPM-AYGQDIQY-GQVT… AG-GIIP--AYGNELQ--GQVT… AGVCTVPMTA---ELQYYG--T… AGVGTVPMTAYGNDIQYYGQVT… ----hhhHHHHHHhhh--eeEE… Protein Folding Intro to Bioinformatics 31 Secondary Structure Prediction Easier than folding • Current algorithms can prediction secondary structure with 70-80% accuracy Chou, P.Y. & Fasman, G.D. (1974). Biochemistry, 13, 211-222. • Based on frequencies of occurrence of residues in helices and sheets PhD – Neural network based • • Uses a multiple sequence alignment Rost & Sander, Proteins, 1994 , 19, 55-72 Protein Folding Intro to Bioinformatics 32 Chou-Fasman Parameters Name Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Protein Folding Abbrv A R D N C E Q G H I L K M F P S T W Y V P(a) 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 P(b) P(turn) 83 66 93 95 54 146 89 156 119 119 37 74 110 98 75 156 87 95 160 47 130 59 74 101 105 60 138 60 55 152 75 143 119 96 137 96 147 114 170 50 Intro to Bioinformatics f(i) 0.06 0.07 0.147 0.161 0.149 0.056 0.074 0.102 0.14 0.043 0.061 0.055 0.068 0.059 0.102 0.12 0.086 0.077 0.082 0.062 f(i+1) 0.076 0.106 0.11 0.083 0.05 0.06 0.098 0.085 0.047 0.034 0.025 0.115 0.082 0.041 0.301 0.139 0.108 0.013 0.065 0.048 f(i+2) 0.035 0.099 0.179 0.191 0.117 0.077 0.037 0.19 0.093 0.013 0.036 0.072 0.014 0.065 0.034 0.125 0.065 0.064 0.114 0.028 f(i+3) 0.058 0.085 0.081 0.091 0.128 0.064 0.098 0.152 0.054 0.056 0.07 0.095 0.055 0.065 0.068 0.106 0.079 0.167 0.125 0.053 33 Chou-Fasman Algorithm Identify -helices • 4 out of 6 contiguous amino acids that have P(a) > 100 • Extend the region until 4 amino acids with P(a) < 100 found • Compute P(a) and P(b); If the region is >5 residues and P(a) > P(b) identify as a helix Repeat for -sheets [use P(b)] If an and a region overlap, the overlapping region is predicted according to P(a) and P(b) Protein Folding Intro to Bioinformatics 34 Chou-Fasman, cont’d Identify hairpin turns: • P(t) = f(i) of the residue f(i+1) of the next residue f(i+2) of the following residue f(i+3) of the residue at position (i+3) • Predict a hairpin turn starting at positions where: P(t) > 0.000075 The average P(turn) for the four residues > 100 P(a) < P(turn) > P(b) for the four residues Accuracy 60-65% Protein Folding Intro to Bioinformatics 35 Chou-Fasman Example CAENKLDHVRGPTCILFMTWYNDGP CAENKL – Potential helix (!C and !N) Residues with P(a) < 100: RNCGPSTY • Extend: When we reach RGPT, we must stop • CAENKLDHV: P(a) = 972, P(b) = 843 • Declare alpha helix Identifying a hairpin turn • VRGP: P(t) = 0.000085 • Average P(turn) = 113.25 Protein Folding Avg P(a) = 79.5, Avg P(b) = 98.25 Intro to Bioinformatics 36