* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Slides 3 - Department of Computer and Information Science and
Artificial gene synthesis wikipedia , lookup
Peptide synthesis wikipedia , lookup
Gene expression wikipedia , lookup
Expression vector wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Magnesium transporter wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Interactome wikipedia , lookup
Point mutation wikipedia , lookup
Biosynthesis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Genetic code wikipedia , lookup
Structural alignment wikipedia , lookup
Protein purification wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
CAP5510 – Bioinformatics Protein Structures Tamer Kahveci CISE Department University of Florida 1 What and Why? • • • Proteins fold into a three dimensional shape Structure can reveal functional information that we can not find from sequence Misfolding proteins can cause diseases Hemoglobin – Sickle cell anemia, mad cow disease • Used in drug design Normal v.s. sickled blood cells E→V HIV protease inhibitor 2 Goals • Understand protein structures • Primary, secondary, tertiary • Learn how protein shapes are – determined – Predicted • Structure comparison (?) 3 A Protein Sequence >gi|22330039|ref|NP_683383.1| unknown protein; protein id: At1g45196.1 [Arabidopsis thaliana] MPSESSYKVHRPAKSGGSRRDSSPDSIIFTPESNLSLFSSASVSVDRCSSTSDAHDRDDSLISAWKEEFEVKKDDESQNL DSARSSFSVALRECQERRSRSEALAKKLDYQRTVSLDLSNVTSTSPRVVNVKRASVSTNKSSVFPSPGTPTYLHSMQKGW SSERVPLRSNGGRSPPNAGFLPLYSGRTVPSKWEDAERWIVSPLAKEGAARTSFGASHERRPKAKSGPLGPPGFAYYSLY SPAVPMVHGGNMGGLTASSPFSAGVLPETVSSRGSTTAAFPQRIDPSMARSVSIHGCSETLASSSQDDIHESMKDAATDA QAVSRRDMATQMSPEGSIRFSPERQCSFSPSSPSPLPISELLNAHSNRAEVKDLQVDEKVTVTRWSKKHRGLYHGNGSKM 4 Amino Acid Composition • Basic Amino Acid Structure: – The side chain, R, varies for each of the 20 amino acids Side chain R H O N C C H Amino group H OH Carboxyl group 5 The Peptide Bond • Dehydration synthesis • Repeating backbone: N–C –C –N–C –C O O – Convention – start at amino terminus and proceed to carboxy terminus 6 Peptidyl polymers • A few amino acids in a chain are called a polypeptide. A protein is usually composed of 50 to 400+ amino acids. • We call the units of a protein amino acid residues. carbonyl carbon amide nitrogen 7 Side chain properties • Carbon does not make hydrogen bonds with water easily – hydrophobic • O and N are generally more likely than C to hbond to water – hydrophilic • We group the amino acids into three general groups: – Hydrophobic – Charged (positive/basic & negative/acidic) – Polar 8 The Hydrophobic Amino Acids 9 The Charged Amino Acids 10 The Polar Amino Acids 11 More Polar Amino Acids And then there’s… 12 Planarity of the Peptide Bond Psi () – the angle of rotation about the C-C bond. Phi () – the angle of rotation about the N-C bond. The planar bond angles and bond lengths are fixed. 13 Primary & Secondary Structure • Primary structure = the linear sequence of amino acids comprising a protein: AGVGTVPMTAYGNDIQYYGQVT… • Secondary structure – Regular patterns of hydrogen bonding in proteins result in two patterns that emerge in nearly every protein structure known: the -helix and the -sheet – The location of direction of these periodic, repeating structures is known as the secondary structure of the protein 14 The Alpha Helix 60° 15 Properties of the Alpha Helix • 60° • Hydrogen bonds between C=O of residue n, and NH of residue n+4 • 3.6 residues/turn • 1.5 Å/residue rise • 100°/residue turn 16 Properties of -helices • 4 – 40+ residues in length • Often amphipathic or “dual-natured” – Half hydrophobic and half hydrophilic • If we examine many -helices, we find trends… – Helix formers: Ala, Glu, Leu, Met – Helix breakers: Pro, Gly, Tyr, Ser 17 The beta strand (& sheet) 135° +135° 18 Properties of beta sheets • Formed of stretches of 5-10 residues in extended conformation • Parallel/aniparallel, contiguous/non-contiguous 19 Anti-Parallel Beta Sheets 20 Parallel Beta Sheets 21 Mixed Beta Sheets 22 Turns and Loops • Secondary structure elements are connected by regions of turns and loops • Turns – short regions of non-, non- conformation • Loops – larger stretches with no secondary structure. – Sequences vary much more than secondary structure regions 23 Ramachandran Plot 24 Levels of Protein Structure • Secondary structure elements combine to form tertiary structure • Quaternary structure occurs in multienzyme complexes 25 Protein Structure Example Beta Sheet Helix Loop ID: 12as 2 chains 26 Views of a Protein Wireframe Ball and stick 27 Views of a protein Spacefill Cartoon CPK colors Carbon = green, black, or grey Nitrogen = blue Oxygen = red Sulfur = yellow Hydrogen = white 28 Common Protein Motifs 29 Mostly Helical Folding Motifs • Four helical bundle: Globin domain: 30 / Motifs • / barrel: 31 Open Twisted Beta Sheets 32 Beta Barrels 33 Determining the Structure of a Protein Experimental Methods •X-ray •NMR As of August 2013, structure of > 85,000 proteins are determined 34 X-Ray Crystallography Discovery of X-rays (Wilhelm Conrad Röntgen, 1895) Crystals diffract X-rays in regular patterns (Max Von Laue, 1912) The first X-ray diffraction pattern from a protein crystal (Dorothy Hodgkin, 1934) 35 X-Ray Crystallography • Grow millions of protein crystals – Takes months • Expose to radiation beam • Analyze the image with computer – Average over many copies of images • PDB • Not all proteins can be crystallized! 36 NMR • Nuclear Magnetic Resonance • Nuclei of atoms vibrate when exposed to oscillating magnetic field • Detect vibrations by external sensors • Computes inter-atomic distances. • Requires complex analysis. NMR can be used for short sequences (<200 residues) • More than one model can be derived from NMR. 37 Determining the Structure of a Protein Computational Methods 38 The Protein Folding Problem • Central question of molecular biology: “Given a particular sequence of amino acid residues (primary structure), what will the secondary/tertiary/quaternary structure of the resulting protein be?” • Input: AAVIKYGCAL… Output: 11, 22… 39 Structure v.s. Sequence • Observation: A protein with the same sequence (under the same circumstances) yields the same shape. • Protein folds into a shape that minimizes the energy needed to stay in that shape. • Protein folds in ~10-15 seconds. 40 Secondary Structure Prediction 41 Chou-Fasman methods • Uses statistically obtained Chou-Fasman parameters. • For each amino acid has – P(a): alpha – P(b): beta – P(t): turn – f(): additional turn parameter. 42 Chou-Fasman Parameters Name Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine Abbrv A R D N C E Q G H I L K M F P S T W Y V P(a) 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 P(b) P(turn) 83 66 93 95 54 146 89 156 119 119 37 74 110 98 75 156 87 95 160 47 130 59 74 101 105 60 138 60 55 152 75 143 119 96 137 96 147 114 170 50 f(i) 0.06 0.07 0.147 0.161 0.149 0.056 0.074 0.102 0.14 0.043 0.061 0.055 0.068 0.059 0.102 0.12 0.086 0.077 0.082 0.062 f(i+1) 0.076 0.106 0.11 0.083 0.05 0.06 0.098 0.085 0.047 0.034 0.025 0.115 0.082 0.041 0.301 0.139 0.108 0.013 0.065 0.048 f(i+2) 0.035 0.099 0.179 0.191 0.117 0.077 0.037 0.19 0.093 0.013 0.036 0.072 0.014 0.065 0.034 0.125 0.065 0.064 0.114 0.028 f(i+3) 0.058 0.085 0.081 0.091 0.128 0.064 0.098 0.152 0.054 0.056 0.07 0.095 0.055 0.065 0.068 0.106 0.079 0.167 0.125 0.053 43 C.-F. Alpha Helix Prediction (1) A E A T T L C M Q S T Y C Y V P(a) 142 151 142 83 83 121 70 145 111 77 83 69 70 69 106 P(b) 83 37 83 119 119 130 119 105 110 75 119 147 119 147 170 • Find P(a) for all letters • Find 6 contiguous letters, at least 4 of them have P(a) > 100 • Declare these regions as alpha helix 44 C.-F. Alpha Helix Prediction (2) A E A T T L C M Q S T Y C Y V P(a) 142 151 142 83 83 121 70 145 111 77 83 69 70 69 106 P(b) 83 37 83 119 119 130 119 105 110 75 119 147 119 147 170 • Extend in both directions until 4 consecutive letters with P(a) < 100 found 45 C.-F. Alpha Helix Prediction (3) A E A T T L C M Q S T Y C Y V P(a) 142 151 142 83 83 121 70 145 111 77 83 69 70 69 106 P(b) 83 37 83 119 119 130 119 105 110 75 119 147 119 147 170 • Find sum of P(a) (Sa) and sum of P(b) (Sb) in the extended region – If region is long enough ( >= 5 letters) and P(a) > P(b) then declare the extended region as alpha helix 46 C.-F. Beta Sheet Prediction • Same as alpha helix replace P(a) with P(b) • Resolving overlapping alpha helix & beta sheet – Compute sum of P(a) (Sa) and sum of P(b) (Sb) in the overlap. – If Sa > Sb => alpha helix – If Sb > Sa => beta sheet 47 C.-F. Turn Prediction A E A T T L C M Q S T Y C Y V P(a) 142 151 142 83 83 121 70 145 111 77 83 69 70 69 106 P(b) 83 37 83 119 119 130 119 105 110 75 119 147 119 69 170 P(t) 66 74 66 96 96 59 119 60 98 143 96 114 119 114 50 i i+1 i+2 i+3 f() • An amino acid is predicted as turn if all of the following holds: – f(i)*f(i+1)*f(i+2)*f(i+3) > 0.000075 – Avg(P(i+k)) > 100, for k=0, 1, 2, 3 – Sum(P(t)) > Sum(P(a)) and Sum(P(b)) for i+k, (k=0, 1, 2, 3) 48 Other Methods for SSE Prediction • Similarity searching – Predator • Markov chain • Neural networks – PHD • ~65% to 80% accuracy 49 Tertiary Structure Prediction 50 Forces driving protein folding • It is believed that hydrophobic collapse is a key driving force for protein folding – Hydrophobic core – Polar surface interacting with solvent • • • • Minimum volume (no cavities) Disulfide bond formation stabilizes Hydrogen bonds Polar and electrostatic interactions 51 Fold Optimization • Simple lattice models (HP-models or Hydrophobic-Polar models) – Two types of residues: hydrophobic and polar – 2-D or 3-D lattice – The only force is hydrophobic collapse – Score = number of HH contacts 52 Scoring Lattice Models • H/P model scoring: count noncovalent hydrophobic interactions. • Sometimes: – Penalize for buried polar or surface hydrophobic residues 53 Can we use lattice models? • For smaller polypeptides, exhaustive search can be used – Looking at the “best” fold, even in such a simple model, can teach us interesting things about the protein folding process • For larger chains, other optimization and search methods must be used – Greedy, branch and bound – Evolutionary computing, simulated annealing 54 The “hydrophobic zipper” effect Ken Dill ~ 1997 55 Representing a lattice model • Absolute directions – UURRDLDRRU • Relative directions – LFRFRRLLFFL – Advantage, we can’t have UD or RL in absolute – Only three directions: LRF • What about bumps? LFRRR – Bad score – Use a better representation 56 Preference-order representation • Each position has two “preferences” – If it can’t have either of the two, it will take the “least favorite” path if possible • Example: {LR},{FL},{RL}, {FR},{RL},{RL},{FL},{RF} • Can still cause bumps: {LF},{FR},{RL},{FL}, {RL},{FL},{RF},{RL}, {FL} 57 More realistic models • Higher resolution lattices (45° lattice, etc.) • Off-lattice models – Local moves – Optimization/search methods and / representations • Greedy search • Branch and bound • EC, Monte Carlo, simulated annealing, etc. 58 How to Evaluate the Result? • Now that we have a more realistic off-lattice model, we need a better energy function to evaluate a conformation (fold). • Theoretical force field: – G = Gvan der Waals + Gh-bonds + Gsolvent + Gcoulomb • Empirical force fields – Start with a database – Look at neighboring residues – similar to known protein folds? 59 Comparative Modeling 1. Identify similar protein sequences from a database of known proteins (BLAST) 2. Find conserved regions by aligning these proteins (CLUSTAL-W) 3. Predict alpha helices and beta sheets from conserved regions, backbone 4. Predict loops 5. Predict side chain positions 6. Evaluate 60 Threading: Fold recognition • Given: – Sequence: IVACIVSTEYDVMKAAR… – A database of molecular coordinates • Map the sequence onto each fold • Evaluate – Objective 1: improve scoring function – Objective 2: folding 61 Folding : still a hard problem • Levinthal’s paradox – Consider a 100 residue protein. If each residue can take only 3 positions, there are 3100 = 5 1047 possible conformations. – If it takes 10-13s to convert from 1 structure to another, exhaustive search would take 1.6 1027 years. 62 Protein Classification • Class: Similar secondary structure properties – All alpha, all beta, alpha/beta, alpha+beta • Fold: major secondary structure similarity. – Globin like (6 helices, folded leaf, partly opened) • Super family: distant homologs. 25-30% sequence identity. • Family: close homologs. Evolved from the same ancestor. High identity. 63