* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Document
G protein–coupled receptor wikipedia , lookup
RNA interference wikipedia , lookup
Interactome wikipedia , lookup
RNA polymerase II holoenzyme wikipedia , lookup
Point mutation wikipedia , lookup
Polyadenylation wikipedia , lookup
Genetic code wikipedia , lookup
Biochemistry wikipedia , lookup
Silencer (genetics) wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Deoxyribozyme wikipedia , lookup
Western blot wikipedia , lookup
Proteolysis wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Metalloprotein wikipedia , lookup
RNA silencing wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Nucleic acid analogue wikipedia , lookup
molecule's structure prediction Outline RNA RNA folding Dynamic programming for RNA secondary structure prediction Protein Secondary Structure Prediction Homology Modeling Protein Threading ab-initio RNA Basics 23 Hydrogen Bonds – more stable RNA bases A,C,G,U Canonical Base Pairs A-U G-C G-U “wobble” pairing Bases can only pair with one other base. Image: http://www.bioalgorithms.info/ RNA Secondary Structure Pseudoknot Stem Interior Loop Single-Stranded Bulge Loop Junction (Multiloop) Hairpin loop Image– Wuchty RNA secondary structure representation Circular representation: Bacillus Subtilis RNase P RNA RNA secondary structure representation DotPlot representation of the same Bacillus Subtilis RNA folding: A dot is placed to represent a base pair RNA secondary structure definition An RNA sequence is represented as: R = r1, r2, r3, …, rn (ri is the i-th nucleotide). Each ri belongs to the set {A, C, G, U}. A secondary structure on R is a set S of ordered pairs, written as i•j, 1≤i<j≤n, satisfying: Computing RNA secondary structure Working hypothesis: The native secondary structure of a RNA molecule is the one with the minimum free energy Restrictions: No knots (ri,rj) , (rk,rl), i<k<j<l No close base pairs: (ri,rj) j – i > 3 (exclude “close” base pairs) Base pairs: A-U, C-G and G-U Computing RNA secondary structure Tinoco-Uhlenbeck postulate: Assumption: the free energy of each base pair is independent of all the other pairs and the loop structures Consequence: the total free energy of an RNA is the sum of all of the base pair free energies Independent Base Pairs Approach Use solution for smaller strings to find solutions for larger strings This is precisely the basic principle behind dynamic programming algorithms! RNA folding: Dynamic Programming Notation: e(ri,rj) : free energy of a base pair joining ri and rj Bij : secondary structure of the RNA strand from base ri to base rj. Its energy is E(Bij) S(i,j) : optimal free energy associated with segment ri…rj S(i,j) = max -E(Bij) B RNA folding: Dynamic Programming There are only four possible ways that a secondary structure of nested base pair can be constructed on a RNA strand from position i to j: 1. i is unpaired, added on to a structure for i+1…j S(i,j) = S(i+1,j) 2. j is unpaired, added on to a structure for i…j-1 S(i,j) = S(i,j-1) RNA folding: Dynamic Programming 4. i j paired, but not to each other; the structure for i…j adds together structures for 2 sub regions, 3. i j paired, added on to i…k and k+1…j a structure for i+1…j-1 S(i,j) = max {S(i,k)+S(k+1,j)} S(i,j) = S(i+1,j-1)+e(ri,rj) i<k<j RNA folding: Dynamic Programming Since there are only four cases, the optimal score S(i,j) is just the maximum of the four possibilities: S (i 1, j ) S (i, j 1) S (i, j ) max S (i 1, j 1) e( ri , rj ) S (i, k ) S (k 1, j ) max i k j ri unpaired rj unpaired i, j base pair i, j paired , but not to each other To compute this efficiently, we need to make sure that the scores for the smaller sub-regions have already been calculated Dynamic Programming !! RNA folding: Dynamic Programming Notes: S(i,j) = 0 if j-i < 4: do not allow “close” base pairs Reasonable values of e are -3, -2, and -1 kcal/mole for GC, AU and GU, respectively. In the DP procedure, we use 3, 2, 1 (or replace max with min) Build upper triangular part of DP matrix: - start with diagonal – all 0 - works outward on larger and larger regions - ends with S(1,n) Traceback starts with S(1,n), and finds optimal path that lead there. j A U A C C C U G U G G U A U Initialisation: No close basepairs A 0 0 0 0 U 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 A C C C U i G U G G U A U 0 j 1 10 5 A U A C C C U G U G G U A U Propagation: 1 C5….U9 : C5 unpaired: S(6,9) = 0 C5-U10 paired S(6,8) +e(C,U)=0 0 0 0 0 U 0 0 0 0 0 A 0 0 0 0 2 0 0 0 0 3 0 0 0 0 0 0 0 0 0 3 0 0 0 0 1 0 0 0 0 1 0 0 0 0 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 C 5 U10 unpaired: S(5,8)=0 A 0 C C U G C5 paired, U10 paired: U S(5,6)+S(7,9)=0 10 G S(5,7)+S(8,9)=0 G U A U 0 j 1 5 10 A U A C C C U G U G G U A U Propagation: 1 C5….G11 : A 0 0 0 0 0 0 2 U 0 0 0 0 0 2 3 0 0 0 0 2 3 5 0 0 0 0 3 3 3 0 0 0 0 0 3 6 0 0 0 0 3 3 0 0 0 0 1 1 0 0 0 0 1 2 0 0 0 0 2 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 A C5 unpaired: S(6,11) = 3 C 5 G11 unpaired: S(5,10)=3 C5-G11 paired S(6,10)+e(C,G)=6 C C U G C5 paired, G11 paired: U S(5,6)+S(7,11)=1 10 G S(5,7)+S(8,11)=0 S(5,8)+S(9,11)=0 G S(5,9)+S(10,11)=0 U A U 0 j 1 5 10 A U A C C C U G U G G U A U Propagation: 1 A 0 0 0 0 0 0 2 3 5 6 6 8 10 12 U 0 0 0 0 0 2 3 5 6 6 8 10 10 0 0 0 0 2 3 5 5 6 8 8 8 0 0 0 0 3 3 3 6 6 6 6 0 0 0 0 0 3 6 6 6 6 0 0 0 0 3 3 3 3 3 0 0 0 0 1 1 3 3 0 0 0 0 1 2 2 0 0 0 0 2 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 A C 5 C C U i G U 10 G G U A U 0 j A U A C C C U G U G G U A U Traceback: A 0 0 0 0 0 0 2 3 5 6 6 8 10 12 U 0 0 0 0 0 2 3 5 6 6 8 10 10 0 0 0 0 2 3 5 5 6 8 8 8 0 0 0 0 3 3 3 6 6 6 6 0 0 0 0 0 3 6 6 6 6 0 0 0 0 3 3 3 3 3 0 0 0 0 1 1 3 3 0 0 0 0 1 2 2 0 0 0 0 2 2 0 0 0 0 1 0 0 0 0 0 0 0 0 0 A C C C U i G U G G U A U 0 FINAL PREDICTION U G C AUACCCUGUGGUAU U C G C G A U U A A U Total free energy: -12 kcal/mol Protein structure prediction 2000000 1800000 1600000 1400000 1200000 1000000 800000 600000 400000 200000 0 1980 200000 180000 160000 140000 120000 100000 80000 60000 40000 20000 0 1985 1990 1995 2000 2005 Structures Sequences The sequence-structure gap The gap is getting bigger The protein folding problem The information for 3D structures is coded in the protein sequence Proteins fold in their native structure in seconds Secondary Structure Prediction Given a primary sequence ADSGHYRFASGFTYKKMNCTEAA what secondary structure will it adopt ? 25 Backbone A polypeptide chain. The R1 side chains identify the component amino acids. Atoms inside each quadrilateral are on the same plane, which can rotate according to angles and . Protein structure Secondary Structure Prediction Methods Chou-Fasman / GOR Method Based on amino acid frequencies Machine learning methods PHDsec and PSIpred 28 Chou and Fasman (1974) The propensity of an amino acid to be part of a certain secondary structure (e.g. – Proline has a low propensity of being in an alpha helix or beta sheet breaker) Name Alanine Arginine Aspartic Acid Asparagine Cysteine Glutamic Acid Glutamine Glycine Histidine Isoleucine Leucine Lysine Methionine Phenylalanine Proline Serine Threonine Tryptophan Tyrosine Valine P(a) 142 98 101 67 70 151 111 57 100 108 121 114 145 113 57 77 83 108 69 106 P(b) 83 93 54 89 119 037 110 75 87 160 130 74 105 138 55 75 119 137 147 170 P(turn) 66 95 146 156 119 74 98 156 95 47 59 101 60 60 152 143 96 96 114 50 29 Success rate of 50% Secondary Structure Method Improvements ‘Sliding window’ approach Most alpha helices are ~12 residues long Most beta strands are ~6 residues long Look at all windows, calculate a score for each window. If >threshold predict this is an alpha helix/beta sheet TGTAGPOLKCHIQWMLPLKK 30 Improvements since 1980’s Adding information from conservation in MSA Smarter algorithms (e.g. Machine learning). Success -> 75%-80% 31 Machine learning approach for predicting Secondary Structure (PHD, PSIpred) Query SwissProt Query Subject Subject Subject Subject Step 1: Generating a multiple sequence alignment 32 Query Step 2: Additional sequences are added using a profile. We end up with a MSA which represents the protein family. seed MSA Query Subject Subject Subject Subject 33 Step 3: Query The sequence profile of the protein family is compared (by machine learning methods) to sequences with known secondary structure. seed MSA Query Subject Subject Subject Subject Machine Learning Approach Known structures 34 Neural Network architecture used in BetaTPred2 Predicting protein 3d structure Goal: 3d structure from 1d sequence An existing fold Fold recognition Homology modeling A new fold ab-initio Homology Modeling Simplest, reliable approach Basis: proteins with similar sequences tend to fold into similar structures Has been observed that even proteins with 25% sequence identity fold into similar structures Does not work for remote homologs (< 25% pairwise identity) Homology Modeling Given: A query sequence Q A database of known protein structures Find protein P such that P has high sequence similarity to Q Return P’s structure as an approximation to Q’s structure Homology modeling needs three items of input: The sequence of a protein with unknown 3D structure, the "target sequence." A 3D “template” – a structure having the highest sequence identity with the target sequence ( >25% sequence identity) An sequence alignment between the target sequence and the template sequence Fold recognition = Protein Threading Which of the known folds is likely to be similar to the (unknown) fold of a new protein when only its amino-acid sequence is known? Protein Threading The goal: find the “correct” sequence-structure alignment between a target sequence and its native-like fold in PDB MTYKLILN …. NGVDGEWTYTE Energy function – knowledge (or statistics) based rather than physics based Should be able to distinguish correct structural folds from incorrect structural folds Should be able to distinguish correct sequence-fold alignment from incorrect sequence-fold alignments Protein Threading Basic premise The number of unique structural (domain) folds in nature is fairly small (possibly a few thousand) Statistics from Protein Data Bank (~2,000 structures) 90% of new structures submitted to PDB in the past three years have similar structural folds in PDB Chances for a protein to have a structural fold that already exists in PDB are quite good. Protein Threading Basic components: Structure database Energy function Sequence-structure alignment algorithm Prediction reliability assessment ab-initio folding Goal: Predict structure from “first principles” Requires: A free energy function, sufficiently close to the “true potential” A method for searching the conformational space Advantages: Works for novel folds Shows that we understand the process Disadvantages: Applicable to short sequences only Qian et al. (Nature: 2007) used distributed computing* to predict the 3D structure of a protein from its amino-acid sequence. Here, their predicted structure (grey) of a protein is overlaid with the experimentally determined crystal structure (color) of that protein. The agreement between the two is excellent. *70,000 home computers for about two years. Overall Approach Multiple Sequence Alignment Database Searching No Homologue in PDB Protein Sequence Secondary Structure Prediction Fold Recognition Yes Homology Modelling 3-D Protein Model Sequence-Structure Alignment Ab-initio Structure Prediction Yes Predicted Fold No Thank you for learning with me!