Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Point mutation wikipedia , lookup
Multi-state modeling of biomolecules wikipedia , lookup
Artificial gene synthesis wikipedia , lookup
Network motif wikipedia , lookup
Community fingerprinting wikipedia , lookup
Protein–protein interaction wikipedia , lookup
Two-hybrid screening wikipedia , lookup
Protein structure prediction wikipedia , lookup
Computational Biology, Part C Family Pairwise Search and Cobbling Robert F. Murphy Copyright 2000, 2001. All rights reserved. Overall Goals Find previously unrecognized members of a family Develop a model of a family Possible Approaches Model-based Motif-based (MEME/MAST) Hidden Markov model-based (HMMER) Non-model-based Family Pairwise Search (FPS) PSSMs Motifs can be summarized and searched for using Position-Specific Scoring Matrices Calculated from a multiple alignment of a conserved region for members of a family Learning PSSMs Unsupervised learning methods can be used to find motifs in unaligned sequences Best characterized algorithm is MEME T.L. Bailey & C. Elkan (1995) Unsupervised Learning of Multiple Motifs in Biopolymers Using Expectation Maximization. Machine Learning J. 21:51-83 Problems with PSSMs Some families are characterized by two or more “sub”-motifs with variable spacing between them Deciding upon motif boundaries difficult Possible information in intervening sequences lost if only motifs are used Cobbling Pick “most representative” protein sequence from a family Convert it to a profile by replacing each amino acid by the corresponding column from a similarity matrix Cobbling For each recognized “motif” in the family, replace the corresponding section of the profile with the profile of the motif Cobbling Advantage: At least some sequence information between motifs is retained. S. Henikoff & J.G. Henikoff (1997) Embedding strategies for effective use of information from multiple sequence alignments. Protein Science 6:698-705 Cobbler Illustration scores from profiles of conserved motifs sequence of “most representative” family member similarity scores for sequence from “most representative” family member Family Pairwise Search For all known members of family, calculate (pairwise) homology to each sequence in database (using BLAST) and sum those scores Family Pairwise Search Does not generate a model of the motif Analogous to k nearest neighbor classification Which method is best? Compare BLAST using a randomly chosen family member, BLAST FPS, MEME, HMMER W.N. Gundy (1998) Homology Detection via Family Pairwise Search. J. Comput. Biol. 5:479-492 Comparison Protocol For each method For each known protein family Train with family members Search database for matches Rank by score from search Determine how many known family members are ranked highly Comparison Protocol Evaluation metric average ROC50 ROC50 is the fraction of true positives detected at a threshold giving 50 false negatives average over all families Bigger is better! Comparison Protocol Caution! True positive defined as being listed as a member of the family in the PROSITE compilation Some false positives could be actual family members that were missed during PROSITE compilation! (Should be minor effect) Results BLAST FPS MAST BLAST HMMER Conclusion FPS better than single sequence BLAST FPS better than model-based methods Which is best (part 2)? Compare BLAST, BLAST FPS, cobbled BLAST, cobbled BLAST FPS W.N. Grundy and T.L. Bailey (1999) Family pairwise search with embedded motif models. Bioinformatics 15:463-470 Comparison Protocol Evaluation metric rank sum calculate difference in ROC50 for two methods for a given family sort by absolute value of difference sum ranks of families for which one method is better than the other Bigger is better! Results Conclusion For task of finding members of a family given a reasonable number of known members of that family, cobbled FPS is best currently available method!