Download Using a Mixture of Probabilistic Decision Trees for Prediction of

Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU Classical methods to predict a structure of a new protein:  Sequence comparison to the known proteins in search of similarities  Sequences often diverge and become unrecognizable  Structure comparison to the known structures in the PDB database  Structural data is sparse and not available for newly sequenced genes 2 What other features can be used to improve prediction?         Domain content Subcellular location Tissue specificity Species type Pairwise interaction Enzyme cofactors Catalytic activity Expression profiles, etc. 3 Having so many features it is important:  To extract relevant information  Directly from the sequence  Predicted secondary structure  Features extracted from database  To combine data in a feasible model  Mixture model of Probabilistic Decision Trees (PDT) was used 4 Features extracted directly from the sequence (percentage):  20 individual amino acids  16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc)  20 most informative dipeptides 5 Features predicted from the sequence:  Secondary structure predicted by the PSIPRED:  Coil  Helix  Strand 6 Features extracted from SWISSPROT database:  Binary features (presence/absence)  Alternative products  Enzyme cofactors  Catalytic activity  Nominal features  Tissue Specificity (2 different definitions)  Subcellular location  Organism and Species classification  Continuous  Number of patterns exhibited by each protein (“complexity” of a protein) 7 Mixture model of PDT (Probabilistic Decision Trees)  Can handle nominal data  Robust to the errors  Missing data is allowed 8 How to select an attribute for a decision node?  Use entropy to measure the impurity  Impurity must reduce after the split | Sv | Information Gain( S , A)  i( S )   i  Sv , vValues ( A ) | S | s where i ( S )   p j log 2 ( p j )  Entropy j 1  Alternative measure – Mantras distance metric (has lower bias towards low split info). 9 Enhancements of the algorithm:         Dynamic attribute filtering Discretizing numerical features Multiple values for attributes Missing attributes Binary splitting Leaf weighting Post-prunning 10-fold cross-validation 10 The probabilistic fremaework  Attribute is selected with probability that depends on its information gain Gain( S , A) Prob( A)   i Gain(S , A)  Weight the trees by the performance  P ( / x)  T iTrees  Q(i ) Pi ( / x) iTrees Q(i ) 11 Evaluation of decision trees     Accuracy = (tp + tn)/total Sensitivity = tp/(tp + fn) Selectivity = tp/(tp + fp) Jensen-Shannon divergence score D  p || r    d  p || r   (1   )d  q || r  , d  p || r   relative entropy (divergence) r   p  (1   )q 12 Handling skewed distributions (unequal class sizes)  Re-weight cases by 1/(# of counts)  Increases the impurity # of false positives  Mixed entropy  Uses average of weighted & unweighted information gain to split and prune trees  Interlaced entropy  Start with weighted samples and later use the unweighted entropy 13 Model selection (simplification)  Occam’s razor: out of 2 models with the same result choose more simple  Bayesian approach: the most probable model has max.posterior probability P ( h) P ( D | h) P(h | D)  , where P( D) n P ( D | h)    i 1 jleaves (T ) f j ( xi ) Pj ( xi ) 14 Learning strategy optimization Configuration sensitivity (selectivity) accepted/rejected Basic (C4.5) 0.35 initial Mantaras metric 0.36 accepted Binary branching 0.45 accepted Weighted entropy 0.56 accepted 10 fold cross-validation 0.65 accepted 20 fold cross-validation 0.64 rejected JS-based post-prunning .07 accepted sen/sel post-prunning 0.68 rejected Weighted leafs 0.68 rejected Mixed entropy 0.63 rejected Dipeptide information 0.73 accepted Propabilistic trees 0.81 accepted 15 Pfam classification test (comparison to BLAST)  PDT performance – 81%  BLAST performance – 86%  Main reasons:  Nodes become impure because weighted entropy stops learning too early  Important branches were eliminated by post-pruning when validation set is small 16 EC classification test (comparison to BLAST)  PDT performance on average – 71%  BLAST performance was often smaller 17 Conclusions  Many protein families cannot be defined by sequence similarities only  New method makes use of other features (structure, dipeptides, etc.)  Besides classification, PDT allow feature selection for further use  Results comparable to BLAST 18 Modifications and Improvements  Use global optimization for pruning  Use probabilities for attribute values  Use boosting techniques (combine weighted trees)  Use Gini-index to measure node-impurity  K k 1 pˆ mk pˆ mk 1  pˆ mk , where 1  Nm  I(y xi Rm i  k) 19

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Using a Mixture of Probabilistic Decision Trees for Prediction of