* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download Protein Structure
Expression vector wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Gene expression wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Magnesium transporter wikipedia , lookup
Gene regulatory network wikipedia , lookup
Point mutation wikipedia , lookup
Metalloprotein wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Western blot wikipedia , lookup
Biosynthesis wikipedia , lookup
Interactome wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Genetic code wikipedia , lookup
Structural alignment wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein Structure What We Are Going to Cover • Secondary Structure Prediction – Chou-Fasman rules for individual amino acids – Nearest neighbor approaches – Machine learning techniques • Neural networks – Physical properties vs. statistics – Measuring prediction quality • Tertiary structure prediction – Force fields – Threading – Homology modeling Background • In the 1950s Christian Anfinsen showed that pancreatic RNase could refold itself into its active configuration after denaturation, without any external guidance. This, and many confirming experiments on other proteins, has lead to the general belief that the amino acid sequence of a protein contains all the information needed to fold itself properly, without any additional energy input. – – • • For most proteins this process is very rapid, on the order of milliseconds. The primary driving forces are the need for hydrophobic side chains to minimize their interactions with water, and the formation of hydrogen bonds. Chaperone proteins assist in the folding of some proteins, and in the re-folding of mis-folded or aggregated proteins. This is often a consequence of elevated temperatures, and chaperone proteins were first discovered because they were highly expressed during heat shock in Drosophila. Prions are normal cellular proteins (PrPC) that can fold into an abnormal configuration (PrPSC) that causes other normal PrPC proteins to refold into the bad configuration, causing large aggregates of tightly packed beta sheets to form. – If refolding is spontaneous, is the bad form a lower energy configuration than the good form? PrPC (left) and PrPSC (right) Secondary Structure • Linus Pauling defined the two main protein secondary structures in the 1950s: alpha helix and beta sheet. – There are other related structures: for instance, the alpha helix has hydrogen bonds between the backbone –NH group of one alpha-carbon to the backbone C=O group of the alpha-carbon 4 residues earlier in the chain (i+4->i). There are also the 310 helix (i+3->i) and the π helix (i+5->i). – The Dictionary of Protein Secondary Structure (DSSP) defines 8 states; the 4 mentioned above plus several forms of turn, plus “everything else”, called random coil. – Real proteins have these structures bent and distorted, not laid out in theoretical perfection. – Most prediction programs classify every amino acid in a chain into just 3 states: alpha helix, beta sheet, or random coil. – These predictions can then be compared to X-ray crystallography results. Assessing Prediction Accuracy • The first problem is classification: given a solved X-ray structure of a protein, how do you classify its amino acids into alpha-helix, beta-sheet, etc.? – – • A second issue: source of protein structures to use for training. You want them to be unrelated to each other and to the sequences you will use for testing, a structural “nr” data set. – • Three separate programs, DSSP, STRIDE, and DEFINE, all use different definitions, and thus come up with somewhat different results, mostly at the ends of the structures. Hydrogen-bonding patterns, dihedral angles (the and angles of the alpha carbons in the backbone), and interatomic distances compared with ideal structures. The Protein Data Base (PDB) has a set called PDB_SELECT that does this. About 4000 chains (late 2008), all less than 25% sequence identity with high quality X-ray crystallography resolution (< 3.5 Angstroms). CASP (Critical Assessment of Sequence Predictions) is a series of contests where the sequences of a group of proteins whose structures have recently been determined but not published are released to anyone who wants to try predicting their structures. After a few months, they have a meeting an score everyone’s results. Currently, the CASP8 contest has just ended. More on Prediction Accuracy • Q3 is the proportion of amino acids in a test dataset whose predicted state (alphahelix, beta-strand, or random coil) matches its actual state. – – – • Note that you can’t use sequences from the training dataset for this. Given equal distributions of the 3 states, you would expect random guessing to give a Q3 score of 33%. A test using random guesses with an actual dataset gave 38%. Also, given the variations in automatically assigning types to known structures, scores above 85% are unlikely. Sov is a measure of how individual segments match that tries to avoid variation in end-ofsegment predictions. It measures the percentage of times that there is an overlap between observed segments and predicted segments. It works on the individual protein or fold level, as opposed to Q3, which measures performance over a whole database. Similar Q3 scores are not always meanignful. Sov OBS: CHHHHHHHHHHC PRED1: CHCHCHCHCHCC PRED2: CCCHHHHHCCCC 12.5 63.2 Q3 58.3 58.3 Chou-Fasman Rules • Gives the propensity of each amino acid to form or break alpha helices and beta strands. – Originally developed in the 1970’s from a very small set of proteins (15!). – Originally just a qualitative measure: “helix forming”, indifferent”, “helix-breaking”, etc. – It has been made quantitative and extended to 14 structures, which involved some fairly large changes in parameter values. – This method is just for individual amino acids in isolation, ignoring the neighborhood. Extensions to Chou-Fasman • The GOR method uses a sliding window of 17 amino acids (8 before and 8 after the residue being predicted) with rules based on known sequences to predict each amino acid’s state. – Based on self-information (each individual residue’s propensity to each type of secondary structure: approximately what the Chou-Fasman rules are), directionalinformation (how each other amino acid in the window affects the current residue regardless of what the current residue is) , and pair-information (how each other amino acid in the window affects the current residue considering what type of amino acid it is). – Several improved versions consider pairs and triplets of amino acids, and increase the number of sequences analyzed to produce the statistics. Directional-information: Effects of the alpha-helix breaker proline (top) and the non-breaker methionine (bottom) 5 residues downstream from residue j, whose type is not specified. Use of Multiple Alignments • It seems obvious that homologous proteins will have slightly different sequences with approximately the same secondary structure. – – • The Zpred program is a modification of the GOR program that uses multiple alignment information at each position in the aligned sequences to improve prediction. – – – • This allows a better understanding of which residues are important for the different structures. The predictions for each homologous region can be averaged Alpha helices and beta strands are less likely to tolerate insertions and deletions than random coil. Start by predicting each sequence separately as in GOR, then average the predictions. Zpred also encodes the amino acid properties Venn diagram we have seen before. The Zvelibil conservation number is 1.0 if all homologous residues at a given position are identical, 0.9 if they are not identical but all within the same set in the Venn diagram, and lesser values for amino acids in different groups. A value of 0 is given if any sequence has a gap. A modification of this, the nearest neighbor approach, finds the best set of matching sequences (homologous or not) for just the region of the sliding window. The idea is that similar sequences will share the same fold even if the rest of the protein is different. Neural Networks • Neural networks are a common bioinformatics tool for machine learning and optimization. – Based on neurobiology observations, primarily the retina of the eye. – Often used in structure prediction, among others. – Machine learning: automatic procedures to fit model parameters to a training dataset. • Layers of nodes, with each node connected to all the nodes in the next layer. – The top layer is the input, a model of the sequence being analyzed. – The bottom layer is output: generally 3 nodes, one for alpha-helix, one for betastrand, and one for random coil. – A model with just an input and an output layer is called a perceptron. – Usually there are one or more (and sometimes lots more, like 80) hidden layers that allow interactions between the input layers. – Some networks allow feedback between layers. More Neural Network • • Inputs. The most common way to code neural network input is to have one node for each type of amino acid (and often an additional one for a gap), multiplied by a node for each position in the sliding window. – Thus, for a 13 residue window, the net would have 21 x 13 = 273 input nodes. – Also, a few extra inputs encoding things like sequence length and distance form the N- and C-termini. – Each input node “fires” either a 0 or a 1 output. Signal processing. Each node sends out the same signal to all the nodes in the next layer. – The receiving node weights all of its inputs differently. – The receiving node also can add in a bias factor to affect how it will respond. – The node then adds the weighted input signals plus the bias and runs them through a response function (or transfer function) to determine its output. – Usually all nodes (at least at a given layer) have the same transfer function. – The output can be 0/1 or some numerical value. Weights applied to the possible outputs from the input layer at a hidden node. Black is positive, red is negative for alpha helix. Some possible transfer functions Still More Neural Network • Parameterization of a neural network is a big issue. Essentially it comes down to assigning weights for the inputs to each node and assigning bias factors. This assumes a constant transfer function. – Most of the weighting comes from the input node array. – It’s a matter of using a training set where you know the input sequence and the secondary structure at each residue. You start with randomly assigned parameters and adjust them using an optimization procedure until the outputs match the known results or until no further improvement happens. • It is quite common to feed the output of one neural network into another network. – This allows predictions for one residue to influence the prediction for neighboring residues. – Best done with numerical outputs as opposed to 0/1. For example, the predictions for alpha helix, beta strand and random coil are better reported as (0.76, 0.55, 0.37) instead of (1, 0, 0). – It is also useful for allowing homologous sequences to influence each other: first run the predictions for each sequence separately, then use the second neural net to combine them. – The first net is a sequence-to-structure net, and the second one is a structure-tostructure net. Tertiary Structure • Are we going to have time to say anything meaningful about this subject? – – • • It is the cutting edge, the hardest unsolved bioinformatics problem It is the basis of rational drug design. Ab initio (from first principles) structure predictions: just using the sequence and know physical and chemical properties, predict the protein’s final 3D structure. This is very difficult and not wildly successful. More commonly used are homology modeling and threading. – – In homology modeling, an unknown protein’s structure is modeled in comparison to the known structure of a homologous protein. For threading, no homolog is used, but instead the protein’s sequence is matched to a library of known folds and Plots of potential energy (y-axis) and root-mean-square deviation of atomic coordinates from their actual position for a large number of ab initio structure prediction runs. The red arrow points to the best prediction. The left one, E. coli RecA protein worked well, but the right one,human Fyn tyrosine kinase, did not. Force Fields • Based an the Anfinsen RNAase experiments and many others like it, it is thought that proteins fold into the lowest free energy state. – – • Various forces affect protein folding: covalent bonds, hydrogen bonds, polar and ionic interactions, van der Waals forces, solvent interaction. – – • • The problem is, the energy landscape is like a mountain range: there are vast numbers of possible folding states with many local minima separated by high peaks. “Free energy” has both an entropy term and an enthalpy term. Enthalpy is the energy stored in covalent bonds and non-covalent atomic interactions. Entropy is hard to calculate, so all this work is done using enthalpy, or potential energy. These forces can be described by a set of equations, collectively known as the “force field”. Solvent interactions are troublesome because they involve large numbers of independently-moving water molecules. Sometimes these are dealt with by using a statistical distribution rather than trying to model individual solvent molecules. Covalent bonds are most easily described using internal coordinates: bond lengths and angles. The potential energy of the bonds is easily described in these terms. On the other hand, non-covalent forces are more easily described using external coordinates: (x,y,z) numbers. This is because the forces involved vary with the distance separating the molecules. – Thus the final force field needs to combine the two coordinate systems. G H TS Threading • When there is no homolog that has a known structure, the protein sequence can be compared to a library of protein folds. – This is based on the theory that there are a limited number of possible folds, perhaps around 2000. The 35,000 protein structures in PDB (when the book was written) can be described in terms of 750-1500 folds. • One type of threading method uses positionspecific scoring matrices to describe the folds. – From PSI-BLAST – The matrices are asymmetric, since you are comparing a known to an unknown, rather than comparing two unknowns. For example, the penalty for substituting Lys for Arg is different from a Arg to Lys substitution. – This can also be done by using a set of environment-specific scoring matrices for the various positions, similar to BLOSUM matrices. • • The other common threading method is explicitly calculate the potential energy of the sequence when forced into the fold. Dynamic programming methods make it possible to ignore wildly improbable configurations. These structures are the SH3 domain of dihydrofolate reductase and a kinase They have only 14.5% identical amino acids. Homology Modeling • When your sequence has a homologue with a known structure, it is possible to build a model using the known structure as a template. – Originally done with wires and plastic balls! • The more similar your protein is to the known structure, the better it works. Dependence of modeling accuracy on sequence identity. Red ones are ab initio predictions. Sperm whale myoglobin model More Homology Modeling • • • • • Start by finding homologues with structures in PDB, using a BLAST search. Then do a careful hand-refined alignment. Then fit the highly conserved regions Model insertions as loops. Non-identical amino acids are predicted with rotamer libraries. – Side chains can generally only take on a few configurations, based on “exhaustive” searches of known structures • After the preliminary model is built, refinements can be made to minimize energy and fit things together as best as possible. Rotamer libraries for tyrosine (left) and phenylalanine (right). Each has 2 main positions with variants.