* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Protein structure prediction Haixu Tang School of Informatics
Artificial gene synthesis wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Expression vector wikipedia , lookup
Magnesium transporter wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Biosynthesis wikipedia , lookup
Gene expression wikipedia , lookup
Point mutation wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Genetic code wikipedia , lookup
Interactome wikipedia , lookup
Protein purification wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Protein structure prediction Haixu Tang School of Informatics Basic operations in a cell (Central Dogma) • A gene is expressed in two steps 1) Transcription: RNA synthesis 2) Translation: Protein synthesis Basic operations in a cell (Central Dogma) Proteins • A gene is expressed in two steps 1) Transcription: RNA synthesis 2) Translation: Protein synthesis Proteins are major function biomolecules in cells • Examples of protein functions Alcohol dehydrogenase oxidizes alcohols to aldehydes or ketones – Catalysis: Almost all chemical reactions in a living cell are catalyzed by protein enzymes. – Transport: Some proteins transports various substances, such as oxygen, ions, and so on. – Information transfer: For example, hormones. Haemoglobin carries oxygen Insulin controls the amount of sugar in the blood Protein is composed of amino acids R NH3 + C Amino group Different side chains, R, determine the chemical properties of 20 amino acids. H COO Carboxylic acid group 20 Amino acids Glycine (G) Alanine (A) Valine (V) Isoleucine (I) Leucine (L) Proline (P) Methionine (M) Phenylalanine (F) Tryptophan (W) Asparagine (N) Glutamine (Q) Serine (S) Threonine (T) Tyrosine (Y) Cysteine (C) Lysine (K) Arginine (R) Histidine (H) Asparatic acid (D) Glutamic acid (E) White: Hydrophobic, Green: Hydrophilic, Red: Acidic, Blue: Basic Proteins are linear polymers of amino acids R1 R2 NH3+ C COO + NH3+ C COO + ー ー H H H 2O H 2O R1 R2 R3 NH3+ C CO NH C CO NH C CO H A F Peptide bond G N S Peptide bond H T D K G H S A The amino acid sequence is called as primary structure Each Protein has a unique structure Amino acid sequence NLKTEWPELVGKSVEE AKKVILQDKPEAQIIVL PVGTIVTMEYRIDRVR LFVDKLDNIAEVPRVG folding Protein Structure Determination • X-ray crystallography – – – – most accurate in vitro need crystal proteins ~100K per structure • Nuclear Magnetic Resonance – – – – Fairly accurate in vivo, in solution No need for crystals Limited to small proteins Protein data bank • http://www.rcsb.org/pdb/ • PDB files: atom coordinates, etc • (1atn: actin/DNAse I complex) • • • • • • • • ATOM ATOM ATOM ATOM ATOM ATOM ATOM ATOM 1 2 3 4 5 6 7 8 CA ACE A 0 105.046 51.546 40.626 1.00 72.72 1ATN 263 C ACE A 0 105.314 50.822 41.951 1.00 72.72 1ATN 264 O ACE A 0 105.220 51.451 43.013 1.00 72.56 1ATN 265 N ASP A 1 105.665 49.507 41.867 1.00 71.64 1ATN 266 CA ASP A 1 105.992 48.589 42.982 1.00 70.20 1ATN 267 C ASP A 1 107.024 49.191 43.936 1.00 69.70 1ATN 268 O ASP A 1 106.927 49.088 45.163 1.00 69.14 1ATN 269 CB ASP A 1 106.533 47.248 42.410 1.00 70.66 1ATN 270 Visualizing protein structure (PDB files) Basic structural units of proteins: Secondary structure α-helix β-sheet Secondary structures, α-helix and β-sheet, have regular hydrogen-bonding patterns. Three-dimensional structure of proteins Tertiary structure Quaternary structure Hierarchical nature of protein structure Primary structure (Amino acid sequence) ↓ Secondary structure (α-helix, β-sheet) ↓ Tertiary structure (Three-dimensional structure formed by assembly of secondary structures) ↓ Quaternary structure (Structure formed by more than one polypeptide chains) Secondary Structure Prediction • Given a protein sequence, secondary structure prediction aims at predicting the state of each amino acid as being either H (helix), E (extended=strand), or O (other). • The quality of secondary structure prediction is measured with a “3-state accuracy” score, or Q3. Q3 is the percent of residues that match “reality” (X-ray structure). Early methods for Secondary Structure Prediction • Chou and Fasman (Chou and Fasman. Prediction of protein conformation. Biochemistry, 13: 211-245, 1974) • GOR (Garnier, Osguthorpe and Robson. Analysis of the accuracy and implications of simple methods for predicting the secondary structure of globular proteins. J. Mol. Biol., 120:97120, 1978) Chou and Fasman Amino Acid -Helix -Sheet Turn Ala Cys Leu Met Glu Gln His Lys Val Ile Phe Tyr Trp Thr Gly Ser Asp Asn Pro 1.29 1.11 1.30 1.47 1.44 1.27 1.22 1.23 0.91 0.97 1.07 0.72 0.99 0.82 0.56 0.82 1.04 0.90 0.52 0.90 0.74 1.02 0.97 0.75 0.80 1.08 0.77 1.49 1.45 1.32 1.25 1.14 1.21 0.92 0.95 0.72 0.76 0.64 0.78 0.80 0.59 0.39 1.00 0.97 0.69 0.96 0.47 0.51 0.58 1.05 0.75 1.03 1.64 1.33 1.41 1.23 1.91 Arg 0.96 0.99 0.88 Favors -Helix Favors -strand Favors turn The GOR method For each position j in the sequence, eight residues on either side are considered. j Accuracy • Both Chou and Fasman and GOR have been assessed and their accuracy is estimated to be Q3=60-65%. Neural networks The most successful methods for predicting secondary structure are based on neural networks. The overall idea is that neural networks can be trained to recognize amino acid patterns in known secondary structure units, and to use these patterns to distinguish between the different types of secondary structure. Neural networks classify “input vectors” or “examples” into categories (2 or more). Protein 3D Structure Prediction • In theory, a protein structure can be predicted computationally • A protein folds into a 3D structure to minimizes its free potential energy • The problem can be formulated as a search problem for minimum energy – the search space is enormous even for small proteins! – the number of local minima increases exponentially of the size of proteins Computational Methods for Protein 3D Structure Prediction • Comparative modeling – Protein threading – make structure prediction through identification of “good” sequence-structure fit – Homology modeling – identification of homologous proteins through sequence alignment; structure prediction through placing residues into “corresponding” positions of homologous structure models Protein Threading • Find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB • Energy function – knowledge (or statistics) based rather than physics based – Should be able to distinguish correct structural folds from incorrect structural folds – Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments Protein Threading • Structure database • Fitness function • Sequence-structure alignment algorithm • Prediction reliability assessment Protein Threading – structure database • Build a template database Protein Threading – fitness function MTYKLILNGKTKGETTTEAVDAATAEKVFQYANDNGVDGEWTYTE how preferable to put two particular residues nearby: E_p how well a residue fits a structural environment: E_s alignment gap penalty: E_g find a sequence-structure alignment to minimize the energy function Protein Threading (sequence-structure alignment) • Unlike sequence-sequence alignment where amino acids are aligned, a sequence-structure alignment aligns amino acids with structural environments • A simple definition of structural environment – secondary structure: alpha-helix, beta-strand, loop – solvent accessibility: 0, 10, 20, …, 100% of accessibility – each combination of secondary structure and solvent accessibility level defines a structural environment • E.g., (alpha-helix, 30%), (loop, 80%), … Protein Threading -- algorithm • Threading algorithm – to find a sequence-structure alignment with the minimum fitness function sequence fold links CASP • CASP = Critical Assessment of Structure Prediction • First held in 1994, every 2 years afterwards • Teams make structure predictions from sequences alone CASP • Two categories of predictors – Automated • Automatic Servers, must complete analysis within 48 hours • Shows what is possible through computer analysis alone – Non-automated • Groups spend considerable time and effort on each target • Utilize computer techniques and human analysis techniques CASP • CASP6, 2004 – 200 prediction teams from 24 countries – Over 30,000 predictions for 64 protein targets collected and evaluated – Conference held after to discuss results, with many teams presenting individual results and methodologies – Helps to steer future work