Survey
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project
CS6772 Project Presentation 12/03/2003 Protein Classification Using Averaged Perceptron SVM Eugene Ie Protein Sequence Classification • Protein = ()* | | = 20 amino acids • Easy to sequence proteins, difficult to obtain structure 3D Structure Sequence VLSPADKTNVKAAWGKVGAHAGEYGAEALER MFLSFPTTKTYFPHFDLSHGSAQVKGHGKKV ADALTNAVAHVDDMPNALSALSDLHAHKLRV DPVNFKLLSHCLLVTLAAHLPAEFTPAVHAS LDKFLASVSTVLTSKYR ? Class Globin family Globin-like superfamily Function Oxygen transport Sequence Alignment vs. Classification • Sequence similarity through alignment distant homology SGFIEEDELKLFL SGFIEEEELKFVL close homology • Sequence classification for remote homology Classifier Structural Hierarchy of Proteins SCOP Fold Superfamily Negative Negative Training Set Test Set Family Positive Training Set Positive Test Set • Remote homologs: – Structure and function conserved – Sequence similarity - low Remote Homology Detection • Discriminative supervised learning approach to protein classification Approach: Support Vector Machines with String Kernels C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch String Kernels for SVM Protein Classification. C. Leslie and R. Kuang, Fast Kernels for Inexact String Matching. QP SVM Training Sequence Training Data >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF PTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHV DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAA HLPAEFTPAVHASLDKFLASVSTVLTSKYR [ ( xi , x j )]in, j 1 … >TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR K ( x, y ) 2 O(n ) Total: n sequences + n labels Learned Weights and Bias w i i yi ( xi ) b avg(b') From KKT QP Solver (slow) Averaged Perceptron SVM Training Training Algorithm: Y. Freund and R. Schapire, Large Margin Classification Using the Perceptron Algorithm. Averaged Perceptron SVM Training Sequence Training Data Iterate t Epochs Run Perceptron Algorithm >VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF PTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHV DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAA HLPAEFTPAVHASLDKFLASVSTVLTSKYR K ( x, y ) O ( kn) … >TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP AEFTPAVHASLDKFLASVSTVLTSKYR Total: n sequences + n labels Generalized Bound for k Final Weight Vector, Voting Weights ( w) s i 0 (v ) k i 1 s = no. of dimensions in feature space k = no. of mistakes made during perceptron run SCOP experiments show: For average n ~ 1000 Average k ~ 50-60 RD k O R xi , 0 D 2 2 d i1 i n d i max 0, yi u, xi u R s s.t. u 1 Averaged Perceptron SVM Classification Testing Algorithm: Note: Only k kernel products with unknown sequence x need to be computed. Recurrence relation: vi 1 , x vi , x ymi xmi , x mi M M is the set of “mistake indices” Implementation Details Built on top of protclass (Protein Classification) platform Java Platform Classification Task Classification Task Hash table scan instead of Mismatch Trie Generate mismatch mappings once using shifts Dynamic kernel matrix storage Still needs debugging Speed/Space Performance ~80% reduction in space requirement ~50% reduction in training time ~50% reduction in testing time Mainly from simple online algorithm