* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download PDF
Protein (nutrient) wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
List of types of proteins wikipedia , lookup
Protein domain wikipedia , lookup
Circular dichroism wikipedia , lookup
Intrinsically disordered proteins wikipedia , lookup
Protein structure prediction wikipedia , lookup
2/11/2010 What are Proteins? 2 Amino Acids (20 types) CS 6784: STRUCTURAL SVM FOR PROTEIN SEQUENCE ALIGNMENT Any two amino acids can form peptide bonds to join together Represented by 20 letters: A, C, D, E, F, G, H, I, K, L, M N, P, Q, R, S, T, V, W, Y Feb 11, 2010 Guest Lecture Chun-Nam Yu Secondary Structures Tertiary Structures (Folds) A typical protein sequence: MDSIQAEEWYFGKITRRESERLLLNAENPRGTFLVRESE TTKGAYCLSVSDFDNAKGLNVKHYKIRKLDSGGFYITS RTQFNSLQQLVAYYSKHADGLCHRLTTVCP typically between 100 to 1000 amino acids long Fold into secondary structures: Alpha helix On top of 2nd structures fold into stables shapes Shape determines functions: Oxygen transport Building hair, muscle, etc Loop Beta sheet Homology Modeling of Proteins 5 The Learning Task 6 Given: new sequence LYNWVAKDVEPPKFTEVTDVVLITRD ADKVLKGEKVQAKYPVDLKLVVKQ Similar sequence, known structure Want to know: structure Build model align LYNWVAKDVEPPKFTEVTDVVLITRD ADKV-LKGEKVQAKYPV-DLKLVVKQ 1 2/11/2010 Sequence Alignments Given sequence pairs: AECD EACC E Score = -3 + 4 – 4 + 9 - 3 =3 Substitution Cost Matrix A E C D A C C Consider More General Linear Scoring Rule: µ ¶ µ ¶ µ ¶ µ ¶ µ ¶ A E C D ©(x; y) = Á x; +Á x; +Á x; +Á x; +Á x; E A C C ~ ¢ ©(x; y) Score of alignment = w Can still use Smith-Waterman E as score decomposes nicely A into alignment operations in µ ¶ E y ~ ¢Á x; w µ ¶ C ~ ¢Á x; w C D B B B B B ©(x; y) = B B B µ ¶ B B ~ ¢Á x; w B E B @ µ ¶ C ~ ¢Á x; w C AECD EACC E In the simplest case the match score s(ai ; bj ) is just a simple lookup over the BLOSUM matrix 0 1 Align ‘A’ with ‘A’ 1 Consider the feature map: B 0 C B . C Equivalent to BLOSUM B . C E Edge score = -d D E C More Complex Substitution Scores -AECD EACC- More Complex Substitution Scores Weighted string edit distance: Match, insert, delete Find highest scoring path through the graph E A C C Dynamic Programming A Recurrence: F(i; j) 8 < F(i-1; j-1) + s(ai ; bj ) = max F(i-1; j) - d : F(i; j-1) - d . 1 0 .. . 1 0 .. . 2 C C C C C C C C C C C C A Edge score = s(A,A) Edge score = s(A,C) E C Gap penalty d = -3 Smith-Waterman Algorithm A C C A Score = -1 – 1 +9 + 6 = 13 Alignment 2: -AECD EACC- Weighted string edit distance: Match, insert, delete String Edit Distance Graph: Alignment 1: AECD EACC Smith-Waterman Algorithm D Suppose we also have information on 2nd structures: x = (A E C D, E A C C) Extra feature counts: E A E C Gap C More Complex Substitution Scores Align ‘C’ with ‘C’ Align ‘E’ with ‘C’ A C C D A C C 0 B B B ©(x; y) = B B B @ µ ¶ ~ ¢Á x; w E 1 .. . .. C . C C 1 C C Align 2 C A Align .. . with with Our substitution score includes µ ¶ nd C 2 structure information too! ~ ¢Á x; w C 2 2/11/2010 Structural Support Vector Machines 13 Structural Support Vector Machines 14 QLVESGGGVVQPGKSLRLSCAA PEPVVAVALGA----SRQLTCR Structural SVM [Tsochantaridis et al. ‘04] n min ~ ;~» w n X 1 k~ wk2 + C »i 2 i=1 X 1 1 · i · n, for all alignments y ^ 2 Y, min k~ wk2 + C s:t:»for i ~ ;~» 2 w ~ ¢ ©(x ; y ) - w ~ ¢ ©(x ; y w ^) ¸ ¢(y ; y ^) - » i=1 QLVESGGGVVQPGKSLRLSCAA PEPVVAVALGASR----QLTCR i QLVESGGGVVQPGKSLRLSCAA PEPVVAVALGA---S-RQLTCR for all alignments y ^2Y i i i i s:t: for 1 · i · n, for all output structures y ^ 2 Y, Convex optimization problem µ ¶ ~ ¢ ©(xi; yi) - w ~ ¢ ©(x n i+ w ;y ^m ) ¸ ¢(yi; y ^) - »i O( ) = exponentially many constraints m Can solve using cutting-plane algorithm ~ ¢ ©(xi; yi) - w ~ ¢ ©(xi; y w ^) ¸ ¢(yi; y ^) - »i Learning with Many Features Feature Vectors (2 examples) 15 Complex Alignment Model with Many Features Structural SVM Learning 3 basic structural features: Amino acid (i.e., A, N, P, etc) Secondary structures (i.e., , , Exposure to water (i.e., 1, 2, 3, 4, 5) Simple Alignment Model Anova2: Pairwise feature interaction, e.g., s(A and , E and ) Window: Loss Functions ) Consider neighborhood of aligned site, e.g., s(AEC, CED), s( , ) Experiments 18 Q - loss Correct Alignment y: Q4 – loss (shift < 4) -AECD EACC- -AECD EACC Incorrect Alignment y’: Q-loss = 1/3 Incorrect Alignment y’: Training Set: ~5000 alignments [Qiu & Elber ’06] Test Set: ~30000 alignments from deposits to Protein Data Bank between June 05 to June 06 All structural alignments produced by the program CE by superposition of 3D coordinates A-ECD EACC- A-ECD EACC Correct Alignment y: Q4-loss = 0 (from pdb.org) 3 2/11/2010 Results on Alignment Accuracy Summary 19 Q4 score From [Yu, Joachims, Elber & Pillardy. RECOMB‘07] 72 71 70 69 68 67 66 65 71.32 69.8 68.04 67.51 An application of Structural SVM to a problem in computational biology Discriminative training allows us to incorporate complex features into the alignment models to improve alignment accuracy Showed how to design the feature vectors, loss functions, and inference procedures Anova2 Window Profile SSALN A strong (49634) (447016) (448642) generative model with hand-tuned Structural SVM features 4