Download Machine learning methods for Protein Secondary Structure Prediction

COT 6930 HPC and Bioinformatics Protein Structure Prediction Xingquan Zhu Dept. of Computer Science and Engineering Protein structure databases Gene expression database transcription DNA Genomic DNA Databases translation RNA cDNA ESTs UniGene protein Protein sequence databases phenotype Outline  Protein Structure    Why structure How to predict protein structure  Experimental methods  Computational methods (predictive methods) Protein Structure Prediction  Secondary structure prediction (2D)   Machine learning methods for protein secondary structure prediction Tertiary structure prediction (3D)  Ab initio  Homology modeling Proteins   Proteins play a crucial role in virtually all biological processes with a broad range of functions. The activity of an enzyme or the function of a protein is governed by the three-dimensional structure Protein Structure is Hierarchical Protein Structure Video http://www.youtube.co m/watch?v=lijQ3a8yU YQ Primary Structure: Sequence  The primary structure of a protein is the amino acid sequence Protein Structure Prediction Problem Protein structure prediction     Predict protein 3D structure from (amino acid) sequence One step closer to useful biological knowledge Sequence → secondary structure → 3D structure → function Outline  Protein Structure    Why structure How to Predict Protein Structure  Experimental methods  Computational methods (predictive methods) Protein Structure Prediction  Secondary structure prediction (2D)   Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)  Ab initio  Homology modeling Why Predict Structure? Structure is more conserved than sequence Structure determines function Goals: 1. Predict structure from sequence 2. Predict function based on structure 3. Predict function based on sequence Molecular function Why predict structure: Structure is more conserved than sequence 28% sequence identity Why predict structure: Can Label Proteins by Dominant Structure  SCOP: Structural Classification Of Proteins Why predict structure: Large number proteins vs. relative smaller number folds  Small number of unique folds found in practice  90% proteins < 1000 folds, estimated ~4000 total folds http://www.rcsb.org/pdb/home/home.do As of 02/05/2008 48,878 structures Examples of Fold Classes How to Predict Protein Structure  A related biological question: what are the factors that determine a structure?    Energy Kinematics How can we determine structure?  Experimental methods  X-ray crystallography or NMR (Nuclear magnetic resonance) spectrometry   limitation: protein size, require crystallized proteins Computational methods (predictive methods)  2-D structure (secondary structure)  3-D structure (tertiary structure) Geometry of Protein Structure rotatable rotatable Inter-atomic Forces  Covalent bond    (short range, very strong) Covalent bond between sulfhydryl (sulfur + hydrogen) groups Hydrophobic / hydrophillic interaction (weak)   (short range, strong) Binds two polar groups (hydrogen + electronegative atom) Disulfide bond / bridge   Binds atoms into molecules / macromolecules Hydrogen bond  (short range, very strong) Hydrogen bonding w/ H2O in solution Van der Waal’s interaction  Nonspecific electrostatic attractive force (very weak) Types of Inter-atomic Forces Quick Overview of Energy Bond Strength (kcal/mole) H-bonds 3-7 Ionic bonds 10 Hydrophobic interactions 1-2 Van der vaals interactions 1 Disulfide bridge 51 Protein Folding Animation   http://www.youtube.com/watch?v=fvBO3TqJ6FE http://www.youtube.com/watch?v=swEc_sUVz5I Two Related Problems in Structure Prediction   Directly predicting protein structure from the amino acid sequence has proved elusive Two sub-problems   Secondary Structure Prediction Tertiary Structure Prediction Secondary Structure Predication (2D)  For each residues in a protein structure, three possible states: a (a-helix), ß (ß-strand), t (others). amino acid sequence Secondary structure sequence  Currently the accuracy of secondary structure methods is nearly 80% (2000).  Secondary structure prediction can provide useful information to improve other sequence and structure analysis methods, such as sequence alignment and 3-D modeling. http://bioinf.cs.ucl.ac.uk/psipred/psiform.html Outline  Protein Structure    Why structure How to Predict Protein Structure  Experimental methods  Computational methods (predictive methods) Protein Structure Prediction  Secondary structure prediction (2D)   Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)  Ab initio  Homology modeling PSSP: Protein Secondary Structure Prediction  Three Generations • • • Based on statistical information of single amino acids Based on local amino acid interaction (segments). Typically a segment containes 11-21 aminoacids Based on evolutionary information of the homology sequences Secondary Structure preferences for Amino Acids The normalized frequencies for each conformation were calculated from the fraction of residues of each amino acid that occurred in that conformation, divided by this fraction for all residues. Random occurrence of a particular amino in a conformation would give a value of unity. A value greater than unity indicates a preference for a particular type of secondary structure. Outline  Protein Structure    Why structure How to Predict Protein Structure  Experimental methods  Computational methods (predictive methods) Protein Structure Prediction  Secondary structure prediction (2D)   Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)  Ab initio  Homology modeling Machine learning methods for Protein Secondary Structure Prediction    Introduction to classification Generalize protein secondary structure prediction as a machine learning problem Introduction to Neural Network Classification and Classifiers   Given a data base table DB with a set of attribute values and a special atribute C, called a class label. Example: A1 1 0 1 A2 1 1 0 A3 m v m A4 g g b C Tumor Normal Normal Classification and Classifiers  An algorithm is called a classification algorithm if it uses the data to build a set of patterns     Decision rules or decision trees, etc. Those patters are structured in such a way that we can use them to classify unknown sets of objects- unknown records. For that reason (because of the goal) the classification algorithm is often called shortly a classifier. Classifier Example Classification and Classifiers  Building a classifier consists of two phases:    The training data set to create patterns (rules, trees, or to train a Neural network).   Training and testing. In both phases we use data (training data set and disjoint test data set) for which the class labels are known for ALL of the records. Evaluate created patterns with the use of of test data, which classification is known. The measure for a trained classifier accuracy is called predictive accuracy. Predictive Accuracy Evaluation The main methods of predictive accuracy evaluations are: • • • • Re-substitution (N ; N) Holdout (2N/3 ; N/3) x-fold cross-validation (N-N/x ; N/x) Leave-one-out (N-1 ; 1), where N is the number of instances in the dataset  The process of building and evaluating a classifier is also called a supervised learning, or lately when dealing with large data bases a classification method in Data Mining Classification Models: Different Classifiers Typical classification models  Decision Trees (ID3, C4.5)  Nearest Neighbors  Support Vector Machines  Neural Networks   Most of the best classifiers for PSSP are based on Neural Network model Demonstration Machine learning methods for Protein Secondary Structure Prediction    Introduction to classification Generalize protein secondary structure prediction as a machine learning problem Introduction to Neural Network How to generalize protein secondary prediction as a machine learning problem?  Using a sliding window to move along the amino acid sequence    Each window denotes an instance Each amino acid inside the window denotes an attribute The known secondary structure of the central amino acid is the class label How to generalize protein secondary prediction as a machine learning problem?     A set of “examples” are generated from sequence with known secondary structures Examples form a training set Build a neural network classifier Apply the classifier to a sequence with unknown secondary structure Machine learning methods for Protein Secondary Structure Prediction    Introduction to classification Generalize protein secondary structure prediction as a machine learning problem Introduction to Neural Network Introduction to Neural Network  What is an artificial Neural Network?  An extremely simplified model of the brain   Essentially a function approximator Transforms inputs into outputs to the best of its ability Introduction to Neural Network  Composed of many “neurons” that co-operate to perform the desired function How do Neural Network Work?   A neuron (perceptron) is a single layer NN The output of a neuron is a function of the weighted sum of the inputs plus a bias Activation Function  Binary active function    f(x)=1 if x>=0 f(x)=0 otherwise The most common sigmoid function used is the logistic function   f(x) = 1/(1 + e-x) The calculation of derivatives are important for neural networks and the logistic function has a very nice derivative  f’(x) = f(x)(1 - f(x)) Where Do The Weights Come From?   The weights in a neural network are the most important factor in determining its function Training is the act of presenting the network with some sample data and modifying the weights to better approximate the desired function  Supervised Training   Supplies the neural network with inputs and the desired outputs Response of the network to the inputs is measured  The weights are modified to reduce the difference between the actual and desired outputs Perceptron Example  Simplest neural network with the ability to learn    Made up of only input neurons and output neurons Output neurons use a simple threshold activation function In basic form, can only solve linear problems  Limited applications Perceptron Example  Perceptron weight updating  If the output is not correct, the weights are adjusted according to the formula:  wnew = wold + ·(desired – output)input Assuming given instance {(1,0,1), 0} Multi-Layer Feedforward NN  An extension of the perceptron  Multiple layers   Activation function is not simply a threshold   Usually a sigmoid function A general function approximator   The addition of one or more “hidden” layers in between the input and output layers Not limited to linear problems Information flows in one direction  The outputs of one layer act as inputs to the next layer Multi-Layer Feedforward NN Example  XOR problem Back-propagation  Searches for weight values that minimize the total error of the network over the set of training examples   Forward pass: Compute the outputs of all units in the network, and the error of the output layers. Backward pass: The network error is used for updating the weights (credit assignment problem). NN for Protein Secondary Structure Prediction Outline  Protein Structure    Why structure How to Predict Protein Structure  Experimental methods  Computational methods (predictive methods) Protein Structure Prediction  Secondary structure prediction (2D)   Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)  Ab initio  Homology modeling Ab initio Prediction  Sampling the global conformation space  Lattice models / Discrete-state models  Molecular Dynamics  Picking native conformations with an energy function  Solvation model: how protein interacts with water  Pair interactions between amino acids Lattice String Folding  HP model: main modeled force is hydrophobic attraction     Amino Acids are classified into two types  Hydrophopic (H) or Polar (P) NP-hard in both 2-D square and 3-D cubic Constant approximation algorithms Not so relevant biologically Lattice String Folding Energy Minimization  Many forces act on a protein  Hydrophobic: inside of protein wants to avoid water      Packing: atoms can't be too close, nor too far away van der Waals interactions Bond angle/length constraints Long distance, e.g.      Hydrophobic molecules associate with each other in water solvent as if water molecules is the repellent to them. It is like oil/water separation. Electrostatics & Hydrogen bonds Disulphide bonds Salt bridges Can calculate all of these forces, and minimize Intractable in general case, but can be useful Molecular Dynamics (MD) In molecular dynamics simulation, we simulate motions of atoms as a function of time according to Newton’s equation of motion. The equations for a system consisting on N atoms can be written as d ri t  2 mi 2  Fi t , (i  1, 2,  , N ). (1) dt Here, ri and mi represent the position and mass of atom i and Fi(t) is the force on atom i at time t. Fi(t) is given by Fi  iV r1 , r2 ,  , rN , (2) where V（r1, r2, …, rN) is the potential energy of the system that depends on the positions of the N atoms in the system. ∇i is    i  i j k x y z (3) Energy Functions used in Molecular Simulation Φ r Θ Bond stretching term Angle bending term Vtotal  Dihedral term  K r  r    K       K 1  cosn    2 b 2  0 bonds angles dihedrals  Cij Dij   12  10      van der Waals r Hbonds rij ij   i , j pairs   0   H-bonding term Van der Waals term O r H The most time demanding part.  Aij Bij  qi q j  12  6   r  electrosta tic r r ij ij   i , j pairs ij Electrostatic term ＋ r r ー Outline  Protein Structure    Why structure How to Predict Protein Structure  Experimental methods  Computational methods (predictive methods) Protein Structure Prediction  Secondary structure prediction (2D)   Machine learning methods for Protein Secondary Structure Prediction Tertiary structure prediction (3D)  Ab initio  Homology modeling Homology-based Prediction  Align query sequence with sequences of known structure, usually >30% similar  Superimpose the aligned sequence onto the structure template, according to the computed sequence alignment  Perform local refinement of the resulting structure in 3D The number of unique structural folds is small (possibly a few thousand) 90% of new structures submitted to PDB in the past three years have similar folds in PDB Homology-based Prediction Raw model Loop modeling Side chain placement Refinement Homology-based Prediction Outline  Protein Structure    Why structure How to predict protein structure  Experimental methods  Computational methods (predictive methods) Protein Structure Prediction  Secondary structure prediction (2D)   Machine learning methods for protein secondary structure prediction Tertiary structure prediction (3D)  Ab initio  Homology modeling

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Machine learning methods for Protein Secondary Structure Prediction