Download Lecture 15-16

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Transcript
Lecture outline
• Database searches
– BLAST
– FASTA
• Statistical Significance of Sequence
Comparison Results
– Probability of matching runs
– Karin-Altschul statistics
– Extreme value distribution
1
The purpose of sequence alignment
• Homology
• Function identification
– about 70% of the genes of M. jannaschii were
assigned a function using sequence similarity
(1997)
2
Similarity
• How much similar do the sequences have to be
to infer homology?
• Two possibilities when similarity is detected:
– The similarity is by chance
– They evolved from a common ancestor – hence,
have similar functions
3
Measures of similarity
• Percent identity:
– 40% similar, 70% similar
– problems with percent identity?
• Scoring matrices
– matching of some amino acids may be more
significant than matching of other amino acids
– PAM matrix in 1970, BLOSUM in 1992
– problems?
4
Statistical Significance
• Goal: to provide a universal measure for inferring
homology
– How different is the result from a random match, or a
match between unrelated requences?
– Given a set of sequences not related to the query (or a set
of random sequences), what is the probability of finding a
match with the same alignment score by chance?
• Different statistical measures
– p-value
– E-value
– z-score
5
Statistical significance measures
• p-value: the probability that at least one sequence will
produce the same score by chance
• E-value: expected number of sequences that will
produce same or better score by chance
• z-score: measures how much standard deviations
above the mean of the score distribution
6
7
Search Significance Scores
• A search will always return some hits.
• How can we determine how “unusual” a
particular alignment score is?
– ORF’s
• Assumptions
8
Assessing significance requires a distribution
Frequency
• I have an apple of diameter 5”. Is that unusual?
Diameter (cm)
9
Is a match significant?
Frequency
• Match scores for aligning my sequence with random
sequences.
• Depends on:
– Scoring system
– Database
– Sequence to search for
• Length
• Composition
Match score
• How do we determine the random sequences?
10
Generating “random” sequences
• Random uniform model:
P(G) = P(A) = P(C) = P(T) = 0.25
– Doesn’t reflect nature
• Use sequences from a database
– Might have genuine homology
• We want unrelated sequences
• Random shuffling of sequences
– Preserves composition
– Removes true homology
11
What distribution do we expect to see?
• The mean of n random events tends towards a
Gaussian distribution.
– Example: Throw n dice and compute the mean.
– Distribution of means:
n=2
n = 1000
12
The extreme value distribution
• This means that if we get the match scores for
our sequence with n other sequences, the mean
would follow a Gaussian distribution.
• The maximum of n random events tends
towards the extreme value distribution as n
grows large.
13
Comparing distributions
Extreme Value:
Gaussian:


f x  
1


e
x

e
e

x

1
f x  
e
 2
  x   2
2 2
14
How to compute statistical significance?
• Significance of a match-run
– Erdös-Renyí
• Significance of local alignments without gaps
– Karlin-Altschul statistics
– Scoring matrices revisited
• Significance of local alignments with gaps
• Significance of global alignments
15
Analysis of coin tosses
• Let black circles indicate heads
• Let p be the probability of a “head”
– For a “fair” coin, p = 0.5
• Probability of 5 heads in a row is (1/2)^5=0.031
• The expected number of times that 5H occurs in
above 14 coin tosses is 10*0.031 = 0.31
16
Analysis of coin tosses
• The expected number of a length l run of heads in n
tosses.
E (l )  np
l
• What is the expected length R of the longest match
in n tosses?
1  np
R
R  log 1/ p (n)
17
Analysis of coin tosses
• (Erdös-Rényi) If there are n throws, then the
expected length R of the longest run of
heads is
R = log1/p (n)
18
Example
• Example: Suppose n = 20 for a “fair” coin
R=log2(20)=4.32
– In other words: in 20 coin tosses we expect a run of heads of
length 4.32, once.
• Trick is how to model DNA (or amino acid)
sequence alignments as coin tosses.
19
Analysis of an alignment
• Probability of an individual match p = 0.05
• Expected number of matches: 10x8x0.05 = 4
• Expected number of two successive matches
 10x8x0.05x0.05 = 0.2
20
Matching runs
in sequence alignments
• Consider two sequences a1..m and b1..n
• If the probability of occurrence for every
symbol is p, then a match of a residue ai
with bj is p, and a match of length l from
ai,bj to ai+l-1,bj+l-1 is pl.
• The head-run problem of coin tosses
corresponds to the longest run of matches
along the diagonals
21
Matching runs
in sequence alignments
• There are m-l+1 x n-l+1 places where the match
could start
E (l )  mnp
l
• The expected length of the longest match can be
approximated as
R=log1/p(mn)
where m and n are the lengths of the two sequences.
22
Matching runs
in sequence alignments
• So suppose m = n = 10 and we’re looking at
DNA sequences
R=log4(100)=3.32
• This analysis makes assumptions about the
base composition (uniform) and no gaps,
but it’s a good estimate.
23
Statistics for matching runs
• Statistics of matching runs:
E (l )  mnp
l
• Length versus score?
– Consider all mismatches receive a negative score of -∞ and
aibj match receives a positive score of si,j.
• What is the expected number of matching runs with a
score x or higher?
E (S  x)  mnp
x
– Using this theory of matching runs, Karlin and Altschul
developed a theory for statistics of local alignments without
gaps (extended this theory to allow for mismatches).
24
Statistics of local alignments
without gaps
• A scoring matrix which satisfy the following
constraint:
– The expected score of a single match obtained by a scoring
matrix should be negative.
E ( si , j )  i , j pi p j si , j  0
– Otherwise?
• Arbitrarily long random sequences will get higher scores just because
they are long, not because there’s a significant match.
• If this requirement is met then the expected number of
alignments with score x or higher is given by:
 x
E ( S  x)  Kmne
25
Statistics of local alignments
without gaps
 x
E ( S  x)  Kmne
– K < 1 is a proportionality constant that corrects the mn “space
factor” for the fact that there are not really mn independent
places that could have produced score S ≥ x.
– K has little effect on the statistical significance of a similarity
score
– λ is closely related to the scoring matrix used and it takes into
account that the scoring matrices do not contain actual
probabilities of co-occurence, but instead a scaled version of
those values. To understand how λ is computed, we have to
look at the construction of scoring matrices.
26
Scoring Matrices
• In 1970s there were few protein sequences available.
Dayhoff used a limited set of families of protein
sequences multiply aligned to infer mutation
likelihoods.
27
Scoring Matrices
• Dayhoff represented the similarity of amino acids as a
log odds ratio:
sij  log( qij / pi p j )
where qij is the observed frequency of co-occurrence, and pi, pj
are the individual frequencies.
28
Example
• If M occurs in the sequences with 0.01
frequency and L occurs with 0.1 frequency. By
random pairing, you expect 0.001 amino acid
pairs to be M-L. If the observed frequency of
M-L is actually 0.003, score of matching M-L
will be
– log2(3)=1.585 bits or loge(3) = ln(3) = 1.1 nats
• Since, scoring matrices are usually provided as
integer matrices, these values are scaled by a constant
factor. λ is approximately the inverse of the original
scaling factor.
29
How to compute λ
• Recall that:
sij  log( qij / pi p j )
 qij  pi p j e
and:
n
i
 q
i 1 j 1
n
i
ij
 1 Sum of observed frequencies is 1.
  pi p j e
i 1 j 1
sij
sij
1
Given the frequencies of
individual amino acids and
the scores in the matrix, λ
can be estimated.
30
Extreme value distribution
• Consider an experiment that obtains the
maximum value of locally aligning a random
string with query string (without gaps). Repeat
with another random string and so on. Plot the
distribution of these maximum values.
• The resulting distribution is an extreme value
distribution, called a Gumbel distribution.
31
Normal vs. Extreme Value Distribution
0.4
Normal
Normal distribution:
y=
Extreme
Value
2/2
-x
(1/√2π)e
Extreme value distribution:
y = e-x – e-x
0.0
-4
-3
-2
-1
0
1
2
3
4
x
32
Local alignments with gaps
• The EVD distribution
is not always observed.
Theory of local alignments
with gaps is not well studied
as in without gaps.
Mostly empirical results.
For example, BLAST allows
only a certain range of
gap penalties.
33
Comparing distributions
Extreme Value:
Gaussian:


f x  
1


e
x

e
e

x

f x  
1
e
 2
  x   2
2 2
34
Determining P-values
• If we can estimate  and , then we can determine, for a given
match score x, the probability that a random match with score
x or greater would have occurred in the database.
• For sequence matches, a scoring system and database can be
parameterized by two parameters, K and , related to  and .
– It would be nice if we could compare hit significance
without regard to the database and scoring system used!
35
Bit Scores
• The expected number of hits with score  S is:
E = Kmn e s
– Where m and n are the sequence lengths
• Normalize the raw score using:
S 
S  ln K
ln 2
• Obtains a “bit score” S’, with a standard set of units.
• The new E-value is:
E  mn 2
S
36
P values and E values
• Blast reports E-values
• E = 5, E = 10 versus P = 0.993 and P =
0.99995
• When E < 0.01 P-values and E-values are
nearly identical
37
BLAST parameters
• Lowering the neighborhood word threshold
(T) allows more distantly related sequences to
be found, at the expense of increased noise in
the results set.
• Raising the segment extension cutoff (X)
returns longer extensions for each hit.
• Changing the minimum E-value changes the
threshold for reporting a hit.
38
BLAST statistics
• Pre-computed λ and K values for different
scoring matrices and gap penalties are used for
faster computation.
• Raw score is converted to bit score:
Sbit 
S  ln K
ln 2
• E-value is computed using
 S b it
E  sss  2
sss  (m  L)( n  N  L)
• m is query size, n is database size and L is the
typical length of maximal scoring alignment.
39
Evaluating BLAST Results
• A BLAST search in a sequence database might produce hundreds
of candidate alignments.
• How to know which are meaningfull, i.e. homologous?
• BLAST provides with:
– Raw scores
– Bit scores
– E-values
Probability is the basic element of tests for statistical significance
40
• Raw scores: the sum of the scores of the maximal-scoring segment pairs
(MSPs) that makes up the alignment. Because of differences between scoring
matrices raw scores are not directly comparable
• Bit scores: these are raw scores that have been converted from the log base of
the scoring matrix that creates the alignment to log base 2. This rescaling
allows bit scores to be comparable.
• E-scores: is the likelihood that a given sequence alignment is significant. The
e-value indicates the number of alignments one expects to find with a score
equal or greater to the given one in a search against a random database.
• Large e-value (5 or 10) indicates that the alignment is probably by chance.
• E-values of 0.1 or 0.05 are typical cuttoff values for data base search
• Proteins with less than 25% similarity are not similar enough for a reliable
BLAST analysis and structural comparison must be used.
41
The probability density function of the extreme
value distribution (characteristic value u=0 and
decay constant =1)
0.40
0.35
probability
0.30
0.25
0.20
normal
distribution
extreme
value
distribution
0.15
0.10
0.05
0
-5
-4
-3
-2
-1
0
1
2
3
4
5
x
page 103
42
page 104
43
How to interpret a BLAST search: expect value
The expect value E is the number of alignments
with scores greater than or equal to score S
that are expected to occur by chance in a
database search.
An E value is related to a probability value p.
The key equation describing an E value is:
E = Kmn e-S
page 105
44
E = Kmn e-S
This equation is derived from a description
of the extreme value distribution
S = the score
E = the expect value = the number
of HSPs expected to occur with
a score of at least S
m, n = the length of two sequences
, K = Karlin Altschul statistics parameters
45
Some properties of the equation E = Kmn e-S
• The value of E decreases exponentially with increasing S
(higher S values correspond to better alignments). Very
high scores correspond to very low E values.
•The E value for aligning a pair of random sequences must
be negative! Otherwise, long random alignments would
acquire great scores
• Parameter K describes the search space (database).
• For E=1, one match with a similar score is expected to
occur by chance. For a very much larger or smaller
database, you would expect E to vary accordingly
page 105-106
46
From raw scores to bit scores
• There are two kinds of scores:
raw scores (calculated from a substitution matrix) and
bit scores (normalized scores)
• Bit scores are comparable between different searches
because they are normalized to account for the use
of different scoring matrices and different database sizes
S’ = bit score = (S - lnK) / ln2
The E value corresponding to a given bit score is:
E = mn 2 -S’
Bit scores allow you to compare results between different
database searches, even using different scoring matrices.
page 106
47
How to interpret BLAST: E values and p values
The expect value E is the number of alignments
with scores greater than or equal to score S
that are expected to occur by chance in a
database search. A p value is a different way of
representing the significance of an alignment.
p = 1 - e -E
page 106
48
How to interpret BLAST: E values and p values
Very small E values are very similar to p values.
E values of about 1 to 10 are far easier to interpret
than corresponding p values. (p = 1 - e-E)
E
10
5
2
1
0.1
0.05
0.001
0.0001
p
0.99995460
0.99326205
0.86466472
0.63212056
0.09516258 (about 0.1)
0.04877058 (about 0.05)
0.00099950 (about 0.001)
0.0001000
page 107
49
How to interpret BLAST: getting to the bottom
page 107
50
EVD parameters
matrix
gap penalties
10.0 is the E value
Effective search space
= mn
= length of query x db length
threshold score = 11
cut-off parameters
page 108
51
Changing E, T & matrix for blastp nr RBP
Expect
10
(T=11)
1
(T=11)
10,000
(T=11)
10
(T=5)
10
(T=11)
10
(T=16)
10
10
(BL45) (PAM70)
#hits to db
129m
129m
129m
112m
112m
112m
386m
129m
#sequences
1,043,455 1.0m
1.0m
907,000 907,000 907,000 1.0m
1.0m
#extensions
5.2m
5.2m
5.2m
508m
4.5m
73,788
30.2m
19.5m
#successful
extensions
8,367
8,367
8,367
11,484
7,288
1,147
9,088
13,873
#sequences
142
better than E
86
6,439
125
124
88
110
82
#HSPs>E
53
(no gapping)
46
6,099
48
48
48
60
66
#HSPs
gapped
145
88
6,609
127
126
90
113
99
X1, X2, X3
16 (7.4 bits)
38 (14.6 bits)
64 (24.7 bits)
16
38
64
16
38
64
22
51
85
15
35
59
52
Changing E, T & matrix for blastp nr RBP
Expect
10
(T=11)
1
(T=11)
10,000
(T=11)
10
(T=5)
10
(T=11)
10
(T=16)
10
10
(BL45) (PAM70)
#hits to db
129m
129m
129m
112m
112m
112m
386m
129m
#sequences
1,043,455 1.0m
1.0m
907,000 907,000 907,000 1.0m
1.0m
#extensions
5.2m
5.2m
5.2m
508m
4.5m
73,788
30.2m
19.5m
#successful
extensions
8,367
8,367
8,367
11,484
7,288
1,147
9,088
13,873
#sequences
142
better than E
86
6,439
125
124
88
110
82
#HSPs>E
53
(no gapping)
46
6,099
48
48
48
60
66
#HSPs
gapped
145
88
6,609
127
126
90
113
99
X1, X2, X3
16 (7.4 bits)
38 (14.6 bits)
64 (24.7 bits)
16
38
64
16
38
64
22
51
85
15
35
59
53
Changing E, T & matrix for blastp nr RBP
Expect
10
(T=11)
1
(T=11)
10,000
(T=11)
10
(T=5)
10
(T=11)
10
(T=16)
10
10
(BL45) (PAM70)
#hits to db
129m
129m
129m
112m
112m
112m
386m
129m
#sequences
1,043,455 1.0m
1.0m
907,000 907,000 907,000 1.0m
1.0m
#extensions
5.2m
5.2m
5.2m
508m
4.5m
73,788
30.2m
19.5m
#successful
extensions
8,367
8,367
8,367
11,484
7,288
1,147
9,088
13,873
#sequences
142
better than E
86
6,439
125
124
88
110
82
#HSPs>E
53
(no gapping)
46
6,099
48
48
48
60
66
#HSPs
gapped
145
88
6,609
127
126
90
113
99
X1, X2, X3
16 (7.4 bits)
38 (14.6 bits)
64 (24.7 bits)
16
38
64
16
38
64
22
51
85
15
35
59
54
Changing E, T & matrix for blastp nr RBP
Expect
10
(T=11)
1
(T=11)
10,000
(T=11)
10
(T=5)
10
(T=11)
10
(T=16)
10
10
(BL45) (PAM70)
#hits to db
129m
129m
129m
112m
112m
112m
386m
129m
#sequences
1,043,455 1.0m
1.0m
907,000 907,000 907,000 1.0m
1.0m
#extensions
5.2m
5.2m
5.2m
508m
4.5m
73,788
30.2m
19.5m
#successful
extensions
8,367
8,367
8,367
11,484
7,288
1,147
9,088
13,873
#sequences
142
better than E
86
6,439
125
124
88
110
82
#HSPs>E
53
(no gapping)
46
6,099
48
48
48
60
66
#HSPs
gapped
145
88
6,609
127
126
90
113
99
X1, X2, X3
16 (7.4 bits)
38 (14.6 bits)
64 (24.7 bits)
16
38
64
16
38
64
22
51
85
15
35
59
55
Changing E, T & matrix for blastp nr RBP
Expect
10
(T=11)
1
(T=11)
10,000
(T=11)
10
(T=5)
10
(T=11)
10
(T=16)
10
10
(BL45) (PAM70)
#hits to db
129m
129m
129m
112m
112m
112m
386m
129m
#sequences
1,043,455 1.0m
1.0m
907,000 907,000 907,000 1.0m
1.0m
#extensions
5.2m
5.2m
5.2m
508m
4.5m
73,788
30.2m
19.5m
#successful
extensions
8,367
8,367
8,367
11,484
7,288
1,147
9,088
13,873
#sequences
142
better than E
86
6,439
125
124
88
110
82
#HSPs>E
53
(no gapping)
46
6,099
48
48
48
60
66
#HSPs
gapped
145
88
6,609
127
126
90
113
99
X1, X2, X3
16 (7.4 bits)
38 (14.6 bits)
64 (24.7 bits)
16
38
64
16
38
64
22
51
85
15
35
59
56
Changing E, T & matrix for blastp nr RBP
Expect
10
(T=11)
1
(T=11)
10,000
(T=11)
10
(T=5)
10
(T=11)
10
(T=16)
10
10
(BL45) (PAM70)
#hits to db
129m
129m
129m
112m
112m
112m
386m
129m
#sequences
1,043,455 1.0m
1.0m
907,000 907,000 907,000 1.0m
1.0m
#extensions
5.2m
5.2m
5.2m
508m
4.5m
73,788
30.2m
19.5m
#successful
extensions
8,367
8,367
8,367
11,484
7,288
1,147
9,088
13,873
#sequences
142
better than E
86
6,439
125
124
88
110
82
#HSPs>E
53
(no gapping)
46
6,099
48
48
48
60
66
#HSPs
gapped
145
88
6,609
127
126
90
113
99
X1, X2, X3
16 (7.4 bits)
38 (14.6 bits)
64 (24.7 bits)
16
38
64
16
38
64
22
51
85
15
35
59
57
BLAST search strategies
General concepts
How to evaluate the significance of your results
How to handle too many results
How to handle too few results
BLAST searching with HIV-1 pol, a multidomain protein
BLAST searching with lipocalins using different matrices
page 108-12258
Sometimes a real match has an E value > 1
…try a reciprocal BLAST to confirm
page 110
59
Sometimes a similar E value occurs for a
short exact match and long less exact match
page 111
60
Assessing whether proteins are homologous
RBP4 and PAEP:
Low bit score, E value 0.49, 24% identity (“twilight zone”).
But they are indeed homologous. Try a BLAST search
with PAEP as a query, and find many other lipocalins.
page 111
61
page 112
62
Searching with a multidomain protein, pol
page 114
63
64
Searching bacterial sequences with pol
65
Protein sequence Motifs or Patterns
66