* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download A novel Method of Protein Secondary Structure Prediction with High
Gene expression wikipedia , lookup
G protein–coupled receptor wikipedia , lookup
Magnesium transporter wikipedia , lookup
Interactome wikipedia , lookup
Ancestral sequence reconstruction wikipedia , lookup
Expression vector wikipedia , lookup
Ribosomally synthesized and post-translationally modified peptides wikipedia , lookup
Point mutation wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Peptide synthesis wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Western blot wikipedia , lookup
Metalloprotein wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Amino acid synthesis wikipedia , lookup
Homology modeling wikipedia , lookup
Biosynthesis wikipedia , lookup
Genetic code wikipedia , lookup
A novel Method of Protein Secondary Structure Prediction with High Segment Overlap Measure: SVM approach by S. Hua & Z. Sun Presenter: Sancar Adali Outline Problem Statement SVM primer Methods and Parameters Results Conclusion Discussion Problem Statement Predicting Secondary Protein Structure from amino acid Sequences Secondary Protein Structure: The local shape of polypeptide chain dependent on the hydrogen bonds between different amino acids In the big picture, the goal is to solve tertiary/quaternary structure in 3D. By solving the more easily tackled secondary structure problem in 2D, we’re solving an easier problem. Some important terminology residue: When two or more amino acids combine to form a peptide, the elements of water are removed, and what remains of each amino acid is called an amino acid residue. segment: a group of consecutive residues that form a secondary structure such as helix, coil, sheet C-terminus: The residue that has a free carboxyl group… N-terminus: The residue in a peptide that has an amino group that is free… [International Union of Pure and Applied Chemistry definitions] Problem Statement Our main goal is to find Sequence to Structure relationship. We usually know the amino acid sequence of proteins. The problem is complicated, because of the many interactions/bonds between amino acid residues. We start at a lower level of complexity and try to predict secondary structure. Amino acids to Proteins H CC Amino H Carboxyl(COO) CC Amino(NH3) R1 Amino Carboxyl R2 H O H H CC C N CC H R2 R1 Carboxyl Peptide bond N-terminus C-terminus Linear Decision Boundary # of classes: 2 y: output {-1,1}; x Є Rn; L =((x1,y1), (x2,y2)…(xn,yn) w Є Rn N w x w jx j j1 Hyperplane: H = {x : w∙x+b = 0} SVM(Support Vector Machine) Linearly Separable Training Samples SVM(Support Vector Machine) Original Idea Choose H by Maximizing min{||x-xi||: w∙x+b = 0, i=1,…n} with respect to w and b Maximizing the margin(the distance from hyperplane to the closest positive and negative training points) Solving Linearly Separable Case Hyperplane: H = {x : w∙x+b = 0} M=d+ + dChoose w,b to maximize margin. Then fSVM=sign(w∙x+b) Since data points are linearly separable, there exist such w and b yi(w∙xi +b)>0 for all i =1, …k Solving Linearly Separable Case Rescale w, b to make the points closest to HP satisfy |w·xi +b|=1(standardizing distance) yi(w∙xi + b)≥1 for all i = 1,2,…k Now we have two parallel planes to H H1 : w x b 1 H 2 : w x b 1 Linearly Separable Case Soln. The distance between H1 , H2 is 2/||w||. Linearly Separable Case Soln. We want to maximize distance between hyperplane H1 and H2 which means minimizing ||w|| The problem has become an optimization problem where the objective function is min w,b w 2 such that y i ( x i w b) 1, i 1,...n The decision boundary depends only on a subset of training samples. These are called “Support Vectors”. Introduce αi≥0 :Lagrange (nonnegative) multipliers to solve the problem Solution and the decision function is n w i yi x i i 1 n y x i:x i S i i i n f SVM sign ( i y i x i x b) i 1 Error bound on SVM using LOOCV Not Linearly Separable Case Introduce slack variables ξi≥0 to take classification errors into account If ξi > 1, we make an error Add error term ξi to the objective function yi (w x b) 1 i n min w ,b w C( i ) 2 i 1 such that y i ( x i w b) 1 i , i 1,...n Not Linearly Separable Case Solution is again n w i yi x i i 1 n y x i:x i S i i i n f SVM sign ( i y i x i x b) i 1 0 i C Nonlinear SVM Important observation Training samples only appear in dot products: xi x or xi x j Map the training samples to a higher-dimensional space where they become linearly separable and solve the decision problem there Nonlinear SVM Φ :R N H (possibly dim H ) L {(Φ( x i ), yi )}ik1 We don’t need to know Φ explicitly, just need (x i ) (x j ) (x i ), (x j ) It can be represented by a kernel function K : R R R N N ( x i ) ( x j ) K(x i , x j ) H Nonlinear SVM NS f ( x ) sign ( i y i (s i )( x ) b) i 1 NS sign ( i y i K (s i , x ) b) i 1 Φ exists neither in the decision function nor in the training step Typical Kernels K ( x , x ' ) ( x x '1) K ( x , x ' ) exp( THE ONE USED IN ARTICLE d x x' 2 ) 2 2 K ( x , x ' ) arctan( x x ') 2 Methods Use of SVMs with Radial Basis Function Three classes H(α-helix), E(-sheet), C(coil,and the remaining part) Construction of tertiary classifiers from multiple binary classifiers(SVMs are binary classifiers) SVM Implementation Gaussian RBF with r=0.10 The authors claim this parameter is not critical and they have tried other values C: the penalty for classification errors. Must be chosen carefully to find a balance between overfitting(too high penalty) and poor generalization(too low penalty). The optimal value for C was calculated using 7-fold CV in this paper Encoding Amino Acid Sequence Information 20 amino acids + 1 for N or C terminus : 21 binary unit vectors (1,0,…,0),(0,1,…,0) Concatenate these vectors for each sliding window For Window Length ℓ: Dimension of Feature Vector is 21∙ℓ Encoding Example Assume ℓ=5 and there are 3 amino acids(for our purposes),named Arg(1,0,0,0), Lys(0,1,0,0), Met(0,0,1,0). The two ends of the chain are encoded separately as (0,0,0,1) The feature vector for Arg* residue in the sequence Arg,Lys,Arg*,Met,Arg,Met,Met would be encoded as xi = (0,0,0,1;0,1,0,0;1,0,0,0;0,1,0,0;1,0,0,0) This will be our input to SVM. Determination of Optimal Window Length Determines dimension of feature vector Determined using 7-fold cross-validation Data Sets RS126 Set: 126 nonhomologous proteins non-homologous:no two proteins in the set share more than 25 % sequence identity over a length of more than 80 residues CB513 Set: 513 nonhomologous proteins. Almost all the sequences in the RS126 set are included in the CB513 set. non-homologous data set: i.e. an SD score of ≥5 is regarded as homology. SD score(distance of “alignment score” from the mean score of randomized sequences in terms of std) is a more stringent measure than the percentage identity. In fact, 11 pairs of proteins in the RS126 set are sequence similar when using the SD score instead of percentage identity. The CB513 set contains 16 chains of 30 residues and short chains usually decrease accuracy. Protein Secondary Structure Definition H(α-helix), H G(310-helix) I(π-helix), E(β-strand), E B(isolated β-bridge), T(turn), C S(bend) -(rest). Reduction 1 Reduction 2 Reliability Index(of single classifications) Using a discretized function of the sample distance to the separating hyperplane Compares it with accuracy measure to prove the point RI increases monotonically with increasing accuracy Accuracy Measures Estimate of Estimate of p(Ŷ I | Y I) p(Ŷ I, Y I) p ( Y I) p(Y I, Ŷ I) p(Y I | Ŷ I) p(Ŷ I) Y is the class/state variable and Ŷ is the output of classifier Segment Overlap Measure Segment Overlap Measure: A more realistic way of measuring accuracy compared to per-residue accuracy In terms of relevancy to 3D structure Problems with per-residue accuracy measurement Proteins with same 3D folding differ by 12% in Secondary Structure This means maximum performance of Q3 should =88% End of segments might vary for proteins with same 3D structure. (so their classification is less relevant to determining protein structure) Account for the following Type and position of secondary structure segments rather than perresidue Variation of the residues at the end of the segments Comparison of SOV and per-residue Accuracy Measures Correlation Measure (p n ) (u o ) C (n u )(n o )(p u )(p o ) where pα :# of true positives nα: :# of true positives oα: :# of false positives uα: :# of false negatives α stands for H,C or E Tertiary Classifier Design H/~H E/~ E C/~ C Choose one with maximum margin: SVM_MAX_D Tertiary Classifier Design H/~H E/~ E Vote SVM_VOTE C/~ C H/C E/H E/C Neural Network SVM_NN Combine all of tertiary classifiers designed so far to form a jury SVM_JURY Results Results(Comparison with PHD) Results,Details – Support Vectors In typical pattern recognition problems # SVs/Training Samples is much lower. Interpretation: Remember the error bound found by LOOCV was EV(# SVs)/ Training samples. This means the error bound will be very high compared to typical pattern recognition problems. Conclusions Application of SVM method on prediction with good results(improvement over PHD). Importance of accuracy measure used to measure performance of algorithms and class reduction method Use of machine learning method in solving protein structure problem at lower complexity The general protein folding problem in three dimensions requires too much computational power, so this is a cheap alternative. Critique of article Number of SVs too large. This puts the performance of SVM into question. The article doesn’t put the machine learning aspects into clear form, too focused on results