Download LecturesPartC

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Point mutation wikipedia , lookup

Multi-state modeling of biomolecules wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Network motif wikipedia , lookup

Community fingerprinting wikipedia , lookup

Protein–protein interaction wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Protein structure prediction wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Structural alignment wikipedia , lookup

Transcript
Computational Biology, Part C
Family Pairwise Search and
Cobbling
Robert F. Murphy
Copyright  2000, 2001.
All rights reserved.
Overall Goals
Find previously unrecognized members of a
family
 Develop a model of a family

Possible Approaches

Model-based
 Motif-based
(MEME/MAST)
 Hidden Markov model-based (HMMER)

Non-model-based
 Family
Pairwise Search (FPS)
PSSMs
Motifs can be summarized and searched for
using Position-Specific Scoring Matrices
 Calculated from a multiple alignment of a
conserved region for members of a family

Learning PSSMs
Unsupervised learning methods can be used
to find motifs in unaligned sequences
 Best characterized algorithm is MEME

 T.L.
Bailey & C. Elkan (1995) Unsupervised Learning of
Multiple Motifs in Biopolymers Using Expectation
Maximization. Machine Learning J. 21:51-83
Problems with PSSMs
Some families are characterized by two or
more “sub”-motifs with variable spacing
between them
 Deciding upon motif boundaries difficult
 Possible information in intervening
sequences lost if only motifs are used

Cobbling
Pick “most representative” protein sequence
from a family
 Convert it to a profile by replacing each
amino acid by the corresponding column
from a similarity matrix

Cobbling

For each recognized “motif” in the family,
replace the corresponding section of the
profile with the profile of the motif
Cobbling
Advantage: At least some sequence
information between motifs is retained.
 S. Henikoff & J.G. Henikoff (1997)
Embedding strategies for effective use of
information from multiple sequence
alignments. Protein Science 6:698-705

Cobbler Illustration
scores from profiles of conserved motifs
sequence of “most
representative”
family member
similarity scores for
sequence from “most
representative” family
member
Family Pairwise Search

For all known members of family, calculate
(pairwise) homology to each sequence in
database (using BLAST) and sum those
scores
Family Pairwise Search
Does not generate a model of the motif
 Analogous to k nearest neighbor
classification

Which method is best?
Compare BLAST using a randomly chosen
family member, BLAST FPS, MEME,
HMMER
 W.N. Gundy (1998) Homology Detection
via Family Pairwise Search. J. Comput.
Biol. 5:479-492

Comparison Protocol

For each method
For each known protein family
 Train
with family members
 Search database for matches
 Rank by score from search
 Determine how many known family
members are ranked highly
Comparison Protocol

Evaluation metric
 average
ROC50
 ROC50
is the fraction of true positives detected at a
threshold giving 50 false negatives
 average over all families
 Bigger
is better!
Comparison Protocol

Caution!
 True
positive defined as being listed as a
member of the family in the PROSITE
compilation
 Some false positives could be actual family
members that were missed during PROSITE
compilation!
 (Should be minor effect)
Results
BLAST FPS
MAST
BLAST
HMMER
Conclusion
FPS better than single sequence BLAST
 FPS better than model-based methods

Which is best (part 2)?
Compare BLAST, BLAST FPS, cobbled
BLAST, cobbled BLAST FPS
 W.N. Grundy and T.L. Bailey (1999) Family
pairwise search with embedded motif
models. Bioinformatics 15:463-470

Comparison Protocol

Evaluation metric
 rank
sum
 calculate
difference in ROC50 for two methods for a
given family
 sort by absolute value of difference
 sum ranks of families for which one method is better
than the other
 Bigger
is better!
Results
Conclusion

For task of finding members of a family
given a reasonable number of known
members of that family, cobbled FPS is best
currently available method!