Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Protein domain wikipedia , lookup
Rosetta@home wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Protein mass spectrometry wikipedia , lookup
Protein folding wikipedia , lookup
Structural alignment wikipedia , lookup
Alpha helix wikipedia , lookup
Nuclear magnetic resonance spectroscopy of proteins wikipedia , lookup
Using a Mixture of Probabilistic Decision Trees for Direct Prediction of Protein Functions Paper by Umar Syed and Golan Yona department of CS, Cornell University Presentaion by Andrejus Parfionovas department of Math & Stat, USU Classical methods to predict a structure of a new protein: Sequence comparison to the known proteins in search of similarities Sequences often diverge and become unrecognizable Structure comparison to the known structures in the PDB database Structural data is sparse and not available for newly sequenced genes 2 What other features can be used to improve prediction? Domain content Subcellular location Tissue specificity Species type Pairwise interaction Enzyme cofactors Catalytic activity Expression profiles, etc. 3 Having so many features it is important: To extract relevant information Directly from the sequence Predicted secondary structure Features extracted from database To combine data in a feasible model Mixture model of Probabilistic Decision Trees (PDT) was used 4 Features extracted directly from the sequence (percentage): 20 individual amino acids 16 amino acid groups percentage (16 amino acid groups: + or – charged, polar, aromatic, hydrophobic, acidic, etc) 20 most informative dipeptides 5 Features predicted from the sequence: Secondary structure predicted by the PSIPRED: Coil Helix Strand 6 Features extracted from SWISSPROT database: Binary features (presence/absence) Alternative products Enzyme cofactors Catalytic activity Nominal features Tissue Specificity (2 different definitions) Subcellular location Organism and Species classification Continuous Number of patterns exhibited by each protein (“complexity” of a protein) 7 Mixture model of PDT (Probabilistic Decision Trees) Can handle nominal data Robust to the errors Missing data is allowed 8 How to select an attribute for a decision node? Use entropy to measure the impurity Impurity must reduce after the split | Sv | Information Gain( S , A) i( S ) i Sv , vValues ( A ) | S | s where i ( S ) p j log 2 ( p j ) Entropy j 1 Alternative measure – Mantras distance metric (has lower bias towards low split info). 9 Enhancements of the algorithm: Dynamic attribute filtering Discretizing numerical features Multiple values for attributes Missing attributes Binary splitting Leaf weighting Post-prunning 10-fold cross-validation 10 The probabilistic fremaework Attribute is selected with probability that depends on its information gain Gain( S , A) Prob( A) i Gain(S , A) Weight the trees by the performance P ( / x) T iTrees Q(i ) Pi ( / x) iTrees Q(i ) 11 Evaluation of decision trees Accuracy = (tp + tn)/total Sensitivity = tp/(tp + fn) Selectivity = tp/(tp + fp) Jensen-Shannon divergence score D p || r d p || r (1 )d q || r , d p || r relative entropy (divergence) r p (1 )q 12 Handling skewed distributions (unequal class sizes) Re-weight cases by 1/(# of counts) Increases the impurity # of false positives Mixed entropy Uses average of weighted & unweighted information gain to split and prune trees Interlaced entropy Start with weighted samples and later use the unweighted entropy 13 Model selection (simplification) Occam’s razor: out of 2 models with the same result choose more simple Bayesian approach: the most probable model has max.posterior probability P ( h) P ( D | h) P(h | D) , where P( D) n P ( D | h) i 1 jleaves (T ) f j ( xi ) Pj ( xi ) 14 Learning strategy optimization Configuration sensitivity (selectivity) accepted/rejected Basic (C4.5) 0.35 initial Mantaras metric 0.36 accepted Binary branching 0.45 accepted Weighted entropy 0.56 accepted 10 fold cross-validation 0.65 accepted 20 fold cross-validation 0.64 rejected JS-based post-prunning .07 accepted sen/sel post-prunning 0.68 rejected Weighted leafs 0.68 rejected Mixed entropy 0.63 rejected Dipeptide information 0.73 accepted Propabilistic trees 0.81 accepted 15 Pfam classification test (comparison to BLAST) PDT performance – 81% BLAST performance – 86% Main reasons: Nodes become impure because weighted entropy stops learning too early Important branches were eliminated by post-pruning when validation set is small 16 EC classification test (comparison to BLAST) PDT performance on average – 71% BLAST performance was often smaller 17 Conclusions Many protein families cannot be defined by sequence similarities only New method makes use of other features (structure, dipeptides, etc.) Besides classification, PDT allow feature selection for further use Results comparable to BLAST 18 Modifications and Improvements Use global optimization for pruning Use probabilities for attribute values Use boosting techniques (combine weighted trees) Use Gini-index to measure node-impurity K k 1 pˆ mk pˆ mk 1 pˆ mk , where 1 Nm I(y xi Rm i k) 19