Download Dynamic Programming: Sequence alignment

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Multilocus sequence typing wikipedia , lookup

Nucleic acid analogue wikipedia , lookup

Non-coding DNA wikipedia , lookup

Two-hybrid screening wikipedia , lookup

Point mutation wikipedia , lookup

Artificial gene synthesis wikipedia , lookup

Ancestral sequence reconstruction wikipedia , lookup

Transcript
BLAST:
Basic Local Alignment Search Tool
Altschul et al. J. Mol Bio. 1990.
CS 466
Saurabh Sinha
Motivation
• Sequence homology to a known protein
suggest function of newly sequenced
protein
• Bioinformatics task is to find
homologous sequence in a database of
sequences
• Databases of sequences growing fast
Alignment
• Natural approach to check if the “query
sequence” is homologous to a
sequence in the database is to compute
alignment score of the two sequences
• Alignment score counts gaps
(insertions, deletions) and replacements
• Minimizing the evolutionary distance
Alignment
• Global alignment: optimize the overall
similarity of the two sequences
• Local alignment: find only relatively
conserved subsequences
• Local similarity measures preferred for
database searches
– Distantly related proteins may only share isolated
regions of similarity
Alignment
• Dynamic programming is the standard
approach to sequence alignment
• Algorithm is quadratic in length of the
two sequences
• Not practical for searches against very
large database of sequences (e.g.,
whole genome)
Scoring alignments
• Scoring matrix: 4 x 4 matrix (DNA) or 20
x 20 matrix (protein)
• Amino acid sequences: “PAM” matrix
– Consider amino acid sequence alignment for
very closely related proteins, extract
replacement frequencies (probabilities),
extrapolate to greater evolutionary distances
• DNA sequences: match = +5, mismatch
= -4
BLAST: the MSP
• Given two sequences of same length, the
similarity score of their alignment (without
gaps) is the sum of similarity values for each
pair of aligned residues
• Maximal segment pair (MSP): Highest scoring
pair of identical length segments from the two
sequences
• The similarity score of an MSP is called the
MSP score
• BLAST heuristically aims to find this
Locally maximal segment pair
• A molecular biologist may be interested in all
conserved regions shared by two proteins, not just
their highest scoring pair
• A segment pair (segments of identical lengths) is
locally maximal if its score cannot be improved by
extending or shortening in either direction
• BLAST attempts to find all locally maximal segment
pairs above some score cutoff.
Rapid approximation of MSP score
• Goal is to report those database sequences that
have MSP score above some threshold S.
• Statistics tells us what is the highest threshold S
at which “chance similarities” are likely to appear
• Tractability to statistical analysis is one of the
attractive features of the MSP score
Rapid approximation of MSP score
• BLAST minimizes time spent on database sequences
whose similarity with the query has little chance of
exceeding this cutoff S.
• Main strategy: seek only segment pairs (one from
database, one query) that contain a word pair with
score >= T
• Intuition: If the sequence pair has to score above S,
its most well matched word (of some predetermined
small length) must score above T
• Lower T => Fewer false negatives
• Lower T => More pairs to analyze
Implementation
1. Compile a list of high scoring words
2. Scan database for hits to this word list
3. Extend hits
Step 1: Compiling list of words
from query sequence
• For proteins: List of all w-length words that
score at least T when compared to some
word in query sequence
• Question: Does every word in the query
sequence make it to the list?
• For DNA: list of all w-length words in the
query sequence, often with w=12
Step 2: Scanning the
database for hits
• Find exact matches to list words
• Can be done in linear time
– two methods (next slides)
• Each word in list points to all
occurrences of the word in word list
from previous step
Scanning the database for hits
• Method 1: Let w=4, so 204 possible words
• Each integer in 0 … 204-1 is an index for an
array
• Array element point to list of all occurrences
of that word in query
• Not all 204 elements of array are populated
– only the ones in word list from previous
step
Scanning the database for hits
• Method 2: use “deterministic finite
automaton” or “finite state machine”.
• Similar to the keyword trees seen in
course.
• Build the finite state machine out of all
words in word list from previous step
Step 3: Extending hits
• Once a word pair with score >= T has been
found, extend it in each direction.
• Extend until score >= S is obtained
• During extension, score may go up, and then
down, and then up again
• Terminate if it goes down too much (a certain
distance below the best score found for
shorter extensions)
• One implementation allows gaps during
extension
BLAST: approximating the MSP
• BLAST may not find all segment pairs
above threshold S
• Trying to approximate the MSP
• Bounds on the error: not hard bounds,
but statistical bounds
– “Highly likely” to find the MSP
Statistics
• Suppose the MSP has been calculated by
BLAST (and suppose this is the true MSP)
• Suppose this observed MSP scores S.
• What are the chances that the MSP score for
two unrelated sequences would be >= S?
• If the chances are very low, then we can be
confident that the two sequences must not
have been unrelated
Statistics
• Given two random sequences of lengths
m and n
• Probability that they will produce an
MSP score of >= x ?
Statistics
• Number of separate SPs with score >= x is Poisson
distributed with mean y(x) = Kmn exp(-x), where
•  is the positive solution of
∑pipjexp(s(i,j)) = 1
• K is a constant
• s(i,j) is the scoring matrix, pi is the frequency of
i in random sequences
Statistics
• Poisson distribution:
Pr(x) = (e-  x)/x!
• Pr(#SPs >= )
= 1 - Pr(#SPs <= -1)
 1
y
i
 1
i
e y
 1 
i!
i 0
y
 1 e 
i 0 i!
y
Statistics
• For =1, Pr(#SPs >= 1) = 1-e-y(x)
• Choose S such that 1-e-y(S) is small
• Suppose the probability of having at least 1 SP with
score >= S is 0.001.
• This seems reasonably small
• However, if you test 10000 random sequences, you
expect 10 to cross the threshold
• Therefore, require “E-value” to be small.
• That is, expected number of random sequence pairs
with score >= S should be small.
More statistics
• We just saw how to choose threshold S
• How to choose T ?
• BLAST is trying to find segment pairs
(SPs) scoring above S
• If an SP scores S, what is the
probability that it will have a w-word
match of score T or more?
• We want this probability to be high
More statistics: Choosing T
• Given a segment pair (from two random
sequences) that scores S, what is the
probability q that it will have no w-word
match scoring above T?
• Want this q to be low
• Obtained from simulations
• Found to decrease exponentially as S
increases
BLAST is the universally used
bioinformatics tool
http://flybase.org/blast/