Download rec03

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts
no text concepts found
Transcript
Comp. Genomics
Recitation 3
The statistics of database
searching
Substitution matrices
• Random model(R): x and y appear at random
•
P( x, y | R)   qxi  q y j
i
j
• Match model(M): x and y are derived from a common
ancestor
•
P( x, y | M )   pxi yi
i
• The odds ratio: P( x, y | M )
P( x, y | R)
• Using the log of the odds ratio gives an additive scoring
system
Exercise
The following substitution matrix is given:
T
C
A
G
T
1
0
-1
-1
C
0
1
-1
-1
A
-1
-1
1
0
G
-1
-1
0
1
What is the average score per nucleotide
pair? Assume each nucleotide appears with
equal probability.
T
Solution
C
A
G
T
1
0
-1
-1
C
0
1
-1
-1
A
-1
-1
1
0
G
-1
-1
0
1
1 1
4
  s (i , j )  
4 4 i, j
16
•What happens if the average score is not
negative?
•What is the chance that in a pair of
evolutionary related sequences T is replaced
by C?
Solution
S(i,j)=
log 2
pij
qi q j
pij
qi q j
log 2
pij
qi q j
0
1
Pij=1/16
Does the optimal alignment change if we
multiply the matrix by a constant C?
How significant is my
score?
• Create a mathematical model of the
alignment of random sequences, and
derive the score distribution analytically
• Use simulation to estimate the score
distribution of the alignment of:
• Generated sequences
• Real sequences that are known to be nonhomologous, or that are shuffled
Empirical score
distribution
• The picture shows a
distribution of scores
from a real database
search using BLAST.
• This distribution
contains scores from
non-homologous and
homologous pairs.
High scores from homology.
Empirical null score
distribution
• This distribution is
similar to the
previous one, but
generated using a
randomized sequence
database.
Statistical analysis
• What is a null hypothesis?
• An assumption that may be contradicted (but
not validated) by the data.
• The purpose of most statistical tests is to
determine whether the observed data can be
explained by the null hypothesis.
Overview
• What is a p-value?
• The probability of observing an effect as
strong or stronger than you observed, given
the null hypothesis. I.e., “How likely is this
effect to occur by chance?”
• Pr(x > S|null)
Extreme value distribution
• What is the name of the distribution
created by local alignment scores, and
what does it look like?
• Extreme value distribution, or Gumbel
distribution.
• It looks similar to a normal distribution, but it
has a larger tail on the right.
• Ungapped local alignment max scores follow
this distribution, and gapped alignment
scores seem to follow it
Extreme value distribution
• The expected number of optimal alignments with a
score ≥S is given by the formula:
E  Kmne
S
(E-value)
where m,n are sequence lengths, λ is a scaling
parameter for the scoring system and K is a scaling
parameter for the search space (e.g. accounts for
overlaps)
• For ungapped local alignments, the parameters can be
calculated directly from the substitution matrix scores
and the lengths of the aligned sequences
Exercise
• Assuming that the probability for seeing x
optimal alignments with score ≥S is given
by the Poisson distribution:
e


x
x!
where μ is the mean, what is the p-value
of the score S?
Solution
• The p-value is the probability of seeing
the score ≥S by chance
• The probability of not seeing the score by
chance is
e
 Kmne  S
• The probability of seeing the score by
chance is
 S
1  exp(  Kmne
)
What p-value is
significant?
• The most common thresholds are 0.01 and 0.05.
• A threshold of 0.05 means you are 95% sure
that the result is significant.
• Is 95% enough? It depends upon the cost
associated with making a mistake.
• Examples of costs:
• Doing expensive wet lab validation.
• Making clinical treatment decisions.
• Misleading the scientific community.
Database searching
A database contains many sequences
Problem:
multiple
comparisons
Increase
chance for
random high
score
Multiple testing
• Say that you perform a statistical test with
a 0.05 threshold, but you repeat the test
on twenty different observations.
• Assume that all of the observations are
explainable by the null hypothesis.
• What is the chance that at least one of
the observations will receive a p-value
less than 0.05?
Multiple testing
•
Say that you perform a statistical test with a 0.05 threshold, but you repeat
the test on twenty different observations. Assuming that all of the
observations are explainable by the null hypothesis, what is the chance
that at least one of the observations will receive a p-value less than 0.05?
•
•
•
•
Pr(making a mistake) = 0.05
Pr(not making a mistake) = 0.95
Pr(not making any mistake) = 0.9520 = 0.358
Pr(making at least one mistake) = 1 - 0.358 = 0.642
• There is a 64.2% chance of making at
least one mistake.
Bonferroni correction
• Divide the desired p-value threshold by the
number of tests performed.
• For the previous example, 0.05 / 20 = 0.0025.
•
•
•
•
Pr(making a mistake) = 0.0025
Pr(not making a mistake) = 0.9975
Pr(not making any mistake) = 0.997520 = 0.9512
Pr(making at least one mistake) = 1 - 0.9512 = 0.0488
Corrections for Database
searching
• Say that you search the non-redundant
protein database at NCBI, containing
roughly one million sequences. What pvalue threshold should you use?
• What is the hidden assumption here?
Example
• Say that you search the non-redundant protein
database at NCBI, containing roughly one
million sequences. What p-value threshold
should you use?
• Say that you want to use a conservative p-value
of 0.001.
• Recall that you would observe such a p-value by
chance approximately every 1000 times in a
random database.
• A Bonferroni correction would suggest using a
p-value threshold of 0.001 / 1,000,000 =
0.000000001 = 10-9.
Exercise
• A sequence of size m is queried against a
database. The database contains k sequences
of lengths n1,n2,…,nk. The E-value for
S
alignment i is Kmni e .
• What is the query E-value for score S?
• If we know that the p-value for alignment i is Pi,
what is the query p-value?
Solution
• Let Ai denote the number of optimal alignment i
that scored ≥S (Ai is either 0 or 1)
• E(A1+A2+…+Ak)=E(A1)+E(A2)+…+E(Ak)=
k
  Kmni e
i 1
S
 Kme
S
k
S
n

Kmne
i
i 1
Solution
• The probability of seeing an optimal alignment
with score ≥S by chance in the entire database
is
k
k
i 1
i 1
1   (1  Pi )  1   exp( Kmni e
S
k
)  1  exp( Kmni e   S )
i 1
Related documents