Download Using a Mixture of Probabilistic Decision Trees for Prediction of

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Protein domain wikipedia , lookup

Rosetta@home wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Protein mass spectrometry wikipedia , lookup

Protein folding wikipedia , lookup

Structural alignment wikipedia , lookup

Cyclol wikipedia , lookup

Alpha helix wikipedia , lookup

Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup

Homology modeling wikipedia , lookup

Protein structure prediction wikipedia , lookup

Transcript
Using a Mixture of Probabilistic
Decision Trees for Direct
Prediction of Protein Functions
Paper by Umar Syed and Golan Yona
department of CS, Cornell University
Presentaion by Andrejus Parfionovas
department of Math & Stat, USU
Classical methods to predict
a structure of a new protein:
 Sequence comparison to the known
proteins in search of similarities
 Sequences often diverge and become
unrecognizable
 Structure comparison to the known
structures in the PDB database
 Structural data is sparse and not
available for newly sequenced genes
2
What other features can be
used to improve prediction?








Domain content
Subcellular location
Tissue specificity
Species type
Pairwise interaction
Enzyme cofactors
Catalytic activity
Expression profiles, etc.
3
Having so many features it
is important:
 To extract relevant information
 Directly from the sequence
 Predicted secondary structure
 Features extracted from database
 To combine data in a feasible model
 Mixture model of Probabilistic Decision
Trees (PDT) was used
4
Features extracted directly
from the sequence (percentage):
 20 individual amino acids
 16 amino acid groups percentage
(16 amino acid groups: + or – charged,
polar, aromatic, hydrophobic, acidic, etc)
 20 most informative dipeptides
5
Features predicted from the
sequence:
 Secondary structure predicted by the
PSIPRED:
 Coil
 Helix
 Strand
6
Features extracted from
SWISSPROT database:
 Binary features (presence/absence)
 Alternative products
 Enzyme cofactors
 Catalytic activity
 Nominal features
 Tissue Specificity (2 different definitions)
 Subcellular location
 Organism and Species classification
 Continuous
 Number of patterns exhibited by each protein
(“complexity” of a protein)
7
Mixture model of PDT
(Probabilistic Decision Trees)
 Can handle nominal data
 Robust to the errors
 Missing data is allowed
8
How to select an attribute for a
decision node?
 Use entropy to measure the impurity
 Impurity must reduce after the split
| Sv |
Information Gain( S , A)  i( S )  
i  Sv ,
vValues ( A ) | S |
s
where i ( S )   p j log 2 ( p j )  Entropy
j 1
 Alternative measure – Mantras distance
metric (has lower bias towards low split info).
9
Enhancements of the algorithm:








Dynamic attribute filtering
Discretizing numerical features
Multiple values for attributes
Missing attributes
Binary splitting
Leaf weighting
Post-prunning
10-fold cross-validation
10
The probabilistic fremaework
 Attribute is selected with probability
that depends on its information gain
Gain( S , A)
Prob( A) 
 i Gain(S , A)
 Weight the trees by the performance

P ( / x) 
T
iTrees

Q(i ) Pi ( / x)
iTrees
Q(i )
11
Evaluation of decision trees




Accuracy = (tp + tn)/total
Sensitivity = tp/(tp + fn)
Selectivity = tp/(tp + fp)
Jensen-Shannon divergence score
D  p || r    d  p || r   (1   )d  q || r  ,
d  p || r   relative entropy (divergence)
r   p  (1   )q
12
Handling skewed distributions
(unequal class sizes)
 Re-weight cases by 1/(# of counts)
 Increases the impurity # of false positives
 Mixed entropy
 Uses average of weighted & unweighted
information gain to split and prune trees
 Interlaced entropy
 Start with weighted samples and later
use the unweighted entropy
13
Model selection (simplification)
 Occam’s razor: out of 2 models with
the same result choose more simple
 Bayesian approach: the most probable
model has max.posterior probability
P ( h) P ( D | h)
P(h | D) 
, where
P( D)
n
P ( D | h)  

i 1 jleaves (T )
f j ( xi ) Pj ( xi )
14
Learning strategy optimization
Configuration
sensitivity (selectivity)
accepted/rejected
Basic (C4.5)
0.35
initial
Mantaras metric
0.36
accepted
Binary branching
0.45
accepted
Weighted entropy
0.56
accepted
10 fold cross-validation
0.65
accepted
20 fold cross-validation
0.64
rejected
JS-based post-prunning
.07
accepted
sen/sel post-prunning
0.68
rejected
Weighted leafs
0.68
rejected
Mixed entropy
0.63
rejected
Dipeptide information
0.73
accepted
Propabilistic trees
0.81
accepted
15
Pfam classification test
(comparison to BLAST)
 PDT performance – 81%
 BLAST performance – 86%
 Main reasons:

Nodes become impure
because weighted entropy
stops learning too early

Important branches were
eliminated by post-pruning
when validation set is small
16
EC classification test
(comparison to BLAST)
 PDT performance
on average – 71%
 BLAST performance
was often smaller
17
Conclusions
 Many protein families cannot be
defined by sequence similarities only
 New method makes use of other
features (structure, dipeptides, etc.)
 Besides classification, PDT allow
feature selection for further use
 Results comparable to BLAST
18
Modifications and Improvements
 Use global optimization for pruning
 Use probabilities for attribute values
 Use boosting techniques (combine
weighted trees)
 Use Gini-index to measure node-impurity

K
k 1
pˆ mk
pˆ mk 1  pˆ mk , where
1

Nm
 I(y
xi Rm
i
 k)
19