Download poster - Computer Science and Engineering

Artificial Intelligence Research Laboratory Bioinformatics and Computational Biology Program Computational Intelligence, Learning, and Discovery Program Department of Computer Science Using Global Sequence Similarity Improves Biological Site-Specific Classifiers Jivko Sinapov, Cornelia Caragea, Drena Dobbs and Vasant Honavar Hierarchical Mixture of Naïve Bayes Experts (HME-NB): Biological Motivation: Many problems in bioinformatics involve the prediction of class labels for each element in a protein sequences. Examples include:  Prediction of RNA and DNA binding protein residues  Prediction of post-translational modification sites  Prediction of secondary structure elements in sequences Typically, classifiers are trained based on local features of each site in the training set of protein sequences. Thus no global sequence information is used when making predictions on the sites of the sequence to be annotated. In this work we seek to improve such classifiers by taking into account the global sequence similarity between the test sequence and the sequences in the training set. Example Problems: Protein-RNA binding site prediction: Glycosylation site prediction: H3N+ M K L L S I T I R P L L S Q L E S O-Glycosylated? Datasets: Dataset 1. Prediction of O-linked glycosylation sites 2. Prediction of RNA-binding protein residues 3. Prediction of protein-protein interface residues O-GlycBase Protein-RNA Protein-Protein E S I l 2 49 26 28 45 23 T i Each non-leaf node combines the predictions from its children: PV g (C | xtest , Stest )   V child(V g ) PVi (C | xtest , Stest ) P(Stest Vi | Stest  par (Vi )) COO- j Number of + Instances 2168 4336 2350 i j Results: Non O-Glycosylated? Number of Sequences 216 147 42 94 l M j C D 122 Let V ,V ,...,V be the leaf nodes in the hierarchical partitioning T Let 1 ,2 ,..., M be the parameters for the trained Naïve Bayes models at each leaf node in Let xtest  { f1 , f 2 ,..., f n } be the input features for some residue in sequence S test l 1 L P( f1 , f 2 ,..., f n | C ) P(C ) P(C | f1 , f 2 ,..., f n )  P( f1 , f 2 ,..., f n )  Assign class that maximizes: 3. Use the structure of the hierarchical partitioning to learn a Hierarchical Mixture of Experts model such that: 25 PV l (C | xtest , Stest )  P(C |  j ) P( f i | C , j ) Let xtest = {f1, f2, …,fn} be a n-dimensional test data point  Independence assumption: 2. Using Spectral Clustering algorithm, recursively partition the set of training sequences to obtain a Hierarchical Clustering of the Sequences. F Naïve Bayes (NB):  Apply Bayes rule: 1. Compute an N by N pair-wise similarity matrix using Global Alignment scores with Blosum62 substitution matrix 147 Each leaf node computes the class probability for xtest according to: T Q Let S1, S2, …, SN be a dataset of protein sequences. 1. Performed 10-fold sequence based cross validation 2. Compared Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB) Number of - Instances 12147 27988 9204 O-Glycosylation Protein-RNA interactions Protein-Protein interface Naïve Bayes HME-NB Naïve Bayes HME-NB Naïve Bayes HME-NB Accuracy 0.89 0.89 0.83 0.84 0.79 0.81 MCC 0.57 0.58 0.32 0.37 0.08 0.25 Sensitivity 0.61 0.65 0.24 0.31 0.06 0.18 Specificity 0.65 0.63 0.65 0.66 0.38 0.60 AUC 0.88 0.91 0.74 0.76 0.62 0.72 a) O-Glycosylation P( f1 , f 2 ,..., f n | C )   P( f i | C ) P(C ) P( f i | C ) i i Feature Representation: b) Protein-RNA interaction sites • A window of 21 amino-acids centered on the target residue: target residue Sequence: Label: DSNPKYLGVKKFGGEVVKAGNILVRQRGTKFKAGQGVGMGRDHTLFALSDGK 1111110011111110011111001011111100000001111101000000 class label . . . VKKFGGEVVKAGNIL,-1 KKFGGEVVKAGNILV,-1 KFGGEVVKAGNILVR,+1 FGGEVVKAGNILVRQ,+1 . . . Data points used for training and testing a classifier A qualitative comparison of Naïve Bayes (NB) and Hierarchical Mixture of Naïve Bayes Experts (HME-NB) on the task of predicting protein-protein interface sites of Anionic trypsin-2 precursor of Rattus norvegicus (shown in spheres) interfaced with Ecotin precursor of E.coli (in green). Each residue of the Anionic trypsin-2 precursor is colored based on whether the prediction is a True Positive (red), True Negative (gray), False Positive (blue), False Negative (yellow). For both methods, the False Positive Rate (FPR) is fixed at 0.1. HME-NB is able to achive higher TPR (0.88) than that of NB (0.56) for the same FPR. Conclusion: • Developed a classifier that improves labeling biological sequence data Acknowledgements: This work is supported in part by a grant from the National Institutes of Health (GM 066387) to Vasant Honavar & Drena Dobbs c) Protein-Protein interface sites

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download poster - Computer Science and Engineering