Download Mika-ProteinFoldingClassification

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Protein Folding recognition with
Committee Machine
Mika Takata
Outline





Background
System Outline
Experiment
Experimental result
Reference
2
Background
SCOP
class
All alpha
All beta
a/b
Fold
Globinlike
Cytochrome c
Cupredoxins
a+b
(TIM)barrel
・・・・・
βgrasp
・・・・
・
・
・




Computation + biology + chemical + medicine + ・・・・
= significantly important
Structure Classification Of Protein database
Fold level class : remote homology
Better recognition, better Tertiary structure prediction
1. Chemical approaching parameter ( i )
i.
6 types of Chemical features
ii.
String windows N-grams
Protein molecular weight value
Protein sequential length value
iii.
iv.
4
1. Chemical approaching parameter ( ii ):
Global parameter

Symbol C


Frequencies of 20 amino acid symbols in a protein sequence
Symbol S, H, V, P, Z

(3-dim: composition, 3-dim: transition, 3×5-dim: Distribution)
1. Chemical approaching parameter ( iii )

Protein molecular weight value


Sum of Amino acids molecular weight
Utilize of molecular weight
yi  y
yi 
SD

Protein sequential length value

Utilize of sequential length
li
li  l

SD
2. Feature parameter based on Sliding window
N-Gram

Proteomic fragment similarity
c( , x)  ( number of occurrence of  in x ) /( length of x)
…… NSDWTNNETRHAIVILIIIIIMLRHGKIPYWCMIPFAA…
(*)string length =2
3: Feature parameter based on HMM
Fig 1: feature parameter flow based on HMM
Step 1
Model Ⅰ
Training
data
C
V
S
P
H
Z
Mol-Weight
Seq-Length
Model Ⅱ
Spectrum Kernel
Test data
Model Ⅲ
HMM
Step 2
Committe
e SVM_1
decision_1
・・
・
Committe
e SVM_i
decision_ i
・
・
・
・
Committe
e SVM_27
decision_ 27
Evaluation measurement:”Accuracy Q”
C i shows how correctly recognized in class i
Ci 

The numbers of
data in each class
are various
Q 

jclassi
27
TPj
TPj  FPj
 i Qi 
i 1
27
 i
i 1
ci
ni

C
N
n 

 i  i 
N 

Experiment

Parameter
Chemical approaching parameter
Feature parameter based on Sliding window kernel (string
length = 2 & 3)
Feature parameter based on HMM
i.
ii.
iii.
i.
Classification Methods
i.
independent SVM
ii.
Committee SVM Array
 Multi-class
recognition approaches
One-vs-others
All-vs-All method
i.
ii.

Data set

Training data: 341, test data: 353 (total: 694)
 http://www.nersc.gov/~cding/protein
Cross Validation:10 times

Result (1): Independent SVM- Model I
Result (2): CM- Model I
Result (3): CM- Model II
Result (3): Model I & II
Result (4): Model I & III
Result (5) : Model I & II & III
Conlusion



Improvement by using all models of Committee
Machine
Spectrum kernel was works if used with string
length of 2
advantage
 Take advantage of sporadic data ( ex. chemical
base and hmm)
 Reduce of computational cost
Reference ( i )
1.
Takata, M., Matsuyama, Y.: Protein Folding Classification by Committee
SVM Array, Lecture Notes in Computer Science, No.5507, pp. 369-377,
2009.
2.
Matsuyama, Y., Kawasaki, K., Hotta, T, mizutani, Takata, M., Ishida, A.:
Eukaryotic transcription start site recognition involving non-promoter model.
Intelligent Systems for Molecular Biology, Toronto (2008) L05
3.
Matsuyama, Y., Ishihara, Y., Ito, Y., Hotta, T., Kawasaki, K., Hasegawa, T.,
Takata, M.: Promoter recognition involving motif detection: Studies on E. coli
and human genes. Intelligent Systems for Molecular Biology, Vienna (2007)
H06.
4.
Dubchak, I., Muchunik, I., Holbrook, S.R., Kim, S-H.: Prediction of protein
folding class using global description of amino acid sequence. Proc. Natl.
Acad. Sci. USA 92 (1995) 8700–8704
5.
Dubchak, I., Muchnik, I., Mayor, C., Dralyyuk, I., Kim, S-H.: Recognition of a
Protein Fold in the Context of the SCOP Classification. Proteins: Structure,
Function, and Genetics 35 (1999) 401–407
Reference ( ii )
1.
Ding, C.H.Q, Dubchak, I.: Multi-class protein fold recognition using support
vector machines and neural networks. Bioinfo. 17 (2001) 349–358
2.
Mount,. D.W.: Bioinformatics. Cold Spring Harbor Laboratory Press (2001)
3.
Murzin, A.G., Brenner, S.E., Hubbard, T., Chothia, C.: SCOP: a structural
classification of proteins database for the investigation of sequences and
structures. J. Mol. Biol., 247 (1995) 536–540.
4.
Leslie, C., Eskin, E., Noble, W.S.: The Spectrum kernel: A string kernel for
SVM protein classification. Pacific Symposium on Biocomputing 7 (2002)
566–575
5.
Tabrez, M., Shamim, A., Anwaruddin, M., Nagarajaram, H.A.: Support vector
machine-based classification of protein folds using the structural properties
of amino acid residues and amino acid residue pairs. Bioinfo. 23 (2007)
3320–3327
6.
Lodhi, H,., Saunders, C., Shawe-Taylor, J., Watkins, C.: Text classification
using string kernels. J. of Machine Learning Research 2 (2002) 419–444.
Related documents