Download Protein Classification Using Averaged Perceptron SVM

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
CS6772 Project Presentation
12/03/2003
Protein Classification
Using Averaged Perceptron SVM
Eugene Ie
Protein Sequence Classification
• Protein = ()* |  | = 20 amino acids
• Easy to sequence proteins, difficult to obtain structure
3D Structure
Sequence
VLSPADKTNVKAAWGKVGAHAGEYGAEALER
MFLSFPTTKTYFPHFDLSHGSAQVKGHGKKV
ADALTNAVAHVDDMPNALSALSDLHAHKLRV
DPVNFKLLSHCLLVTLAAHLPAEFTPAVHAS
LDKFLASVSTVLTSKYR
?
Class
Globin family
Globin-like superfamily
Function
Oxygen transport
Sequence Alignment vs. Classification
• Sequence similarity through alignment
distant
homology
SGFIEEDELKLFL
SGFIEEEELKFVL
close homology
• Sequence classification for remote homology
Classifier
Structural Hierarchy of Proteins
SCOP
Fold
Superfamily
Negative
Negative
Training Set Test Set
Family
Positive
Training Set
Positive
Test Set
• Remote homologs:
– Structure and function conserved
– Sequence similarity - low
Remote Homology Detection
• Discriminative supervised learning approach to
protein classification
Approach: Support Vector Machines with String Kernels
C. Leslie, E. Eskin, J. Weston, and W. Noble, Mismatch String Kernels for SVM
Protein Classification.
C. Leslie and R. Kuang, Fast Kernels for Inexact String Matching.
QP SVM Training
Sequence Training Data
>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF
PTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHV
DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAA
HLPAEFTPAVHASLDKFLASVSTVLTSKYR
[ ( xi , x j )]in, j 1
…
>TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM
PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP
AEFTPAVHASLDKFLASVSTVLTSKYR
K ( x, y )
2
O(n )
Total: n sequences + n labels
Learned Weights and Bias
w  i  i yi ( xi )
b  avg(b')
From KKT
QP Solver
(slow)
Averaged Perceptron SVM Training
Training Algorithm:
Y. Freund and R. Schapire, Large Margin Classification Using the Perceptron
Algorithm.
Averaged Perceptron SVM Training
Sequence Training Data
Iterate
t Epochs
Run Perceptron
Algorithm
>VLSPADKTNVKAAWGKVGAHAGEYGAEALERMFLSF
PTTKTYFPHFDLSHGSAQVKGHGKKVADALTNAVAHV
DDMPNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAA
HLPAEFTPAVHASLDKFLASVSTVLTSKYR
K ( x, y )
O ( kn)
…
>TYFPHFDLSHGSAQVKGHGKKVADALTNAVAHVDDM
PNALSALSDLHAHKLRVDPVNFKLLSHCLLVTLAAHLP
AEFTPAVHASLDKFLASVSTVLTSKYR
Total: n sequences + n labels
Generalized Bound for k
Final Weight Vector, Voting Weights
( w)
s
i 0
(v )
k
i 1
s = no. of dimensions in feature space
k = no. of mistakes made during perceptron run
SCOP experiments show:
For average n ~ 1000
Average k ~ 50-60
 RD

k  O
  
R  xi ,   0
D
2
2
d
i1 i
n
d i  max 0,   yi u, xi
u  R s s.t. u  1

Averaged Perceptron SVM Classification
Testing Algorithm:
Note: Only k kernel products with unknown sequence x need to be computed.
Recurrence relation:
vi 1 , x  vi , x  ymi xmi , x
mi  M
M is the set of “mistake indices”
Implementation Details
 Built on top of protclass (Protein Classification) platform
 Java Platform
 Classification Task
 Classification Task
 Hash table scan instead of Mismatch Trie
 Generate mismatch mappings once using shifts
 Dynamic kernel matrix storage
 Still needs debugging
 Speed/Space Performance
 ~80% reduction in space requirement
 ~50% reduction in training time
 ~50% reduction in testing time
 Mainly from simple online algorithm
Related documents