Download Lecture 1: Basic Probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Randomness wikipedia , lookup

Transcript
Bioinformatics II
Probability and Statistics
Universität Zürich and ETH Zürich
Spring Semester 2009
Lecture 1: Basic Probability
Dr Fraser Daly
adapted from a course by Dr N Pétrélis
1
Course outline
• Basic probability (lecture 1)
• Statistical estimation and testing (lecture 2)
• Markov chains (lecture 3)
• Models and algorithms in bioinformatics (lectures 4
and 5)
2
Suppose we are given two sequences from two species:
is there a common ancestor?
ggagactgtagacagctaatgctata
gaacgccctagccacgagcccttatc
Sequence length: 26 nucleotides
11 of 26 positions agree
Conclusion? Generated ‘purely by chance’ or by some
other mechanism?
To answer this, we need to understand properties of
random sequences.
3
Why do we need probability and statistics in
bioinformatics?
•
modeling sequence evolution (Markov chains);
•
inferring phylogenetic trees
(maximum likelihood trees);
•
gene prediction (hidden Markov models);
•
analysis of micro array data
(multiple testing, multivariate statistics);
•
evaluating sequence similarity in BLAST searches
(extreme values, random walks);
•
and much more!
4
Random variables
A random variable is a quantity whose value depends
on a random event. For example:
1. Toss a coin. Let X = 0 if we get heads, X = 1 if
we get tails. X is a random variable.
2. Select two DNA sequences at random from a database.
Let Y be the number of matches. Then Y is a random
variable.
3. Let Z be the lifetime of a newly fitted lightbulb. Z
is a random variable.
X and Y are discrete random variables, Z is a continous
random variable.
5
The most important feature of a random variable is its
probability distribution: this tells us the probability of
the random variable taking particular values.
For discrete random variables, this can be specified by
the probability mass function or by the (cumulative)
distribution function.
For continuous random variables, this can be specified
by the probability density function or the (cumulative)
distribution function.
6
Discrete random variables
Let X be a discrete random variable taking values in a
sample space S.
For i ∈ S, we define the probability mass function of
X to be
pX (i) = P (X = i),
The (cumulative) distribution function is
FX (i) = P (X ≤ i) =
X
pX (j).
j∈S,j≤i
Note that both pX (i) and FX (i) are probabilities, so are
between 0 and 1.
These functions contains all essential information about
the properties and behavior of X.
7
Continuous random variables
For a continuous random variable Y we can also define
the (cumulative) distribution function:
FY (t) = P (Y ≤ t) =
Z
x≤t
fY (x) dx,
where the function fY (x) is the probability density
function.
The probability density function is not a probability!
It is always positive, but can be larger than 1.
8
Example
Toss a fair coin. Let X = 0 if we get heads, X = 1 if
we get tails.
X is a discrete random variable with sample space {0, 1}.
The probability mass function is:
pX (0) = P (heads) =
1
,
2
pX (1) = P (tails) =
1
.
2
The cumulative distribution function is
1
FX (0) = P (X ≤ 0) = P (X = 0) = ,
2
FX (1) = P (X ≤ 1) = P (X = 0) + P (X = 1) = 1
9
Example
Suppose we generate a DNA sequence as follows: at
each position 1, . . . , N in the sequence choose either a,
c, g or t with probabilities pa, pc, pg and pt, independently
of any other position.
Compare 2 DNA sequences generated in this way. Let
Y be the number of matches between them.
What is the probability mass function of Y ?
10
Fix a position in the sequences. The probability of a
match at that position is
P (two ‘a’ or two ‘c’ or two ‘g’ or two ‘t’)
= P (two ‘a’) + P (two ‘c’) + P (two ‘t’) + P (two ‘g’)
2
2
2
= p2
a + pc + pg + pt .
Call this probability p (the match probability).
11
What is P (Y = k)?
We must have k matches and N − k mismatches. This
can happen in several different ways. Each one has
probability
pk (1 − p)N −k
We add the probabilities to obtain our answer:
P (Y = k) = pk (1 − p)N −k + · · · + pk (1 − p)N −k .
How many terms do we add?
N N!
,
k!(N − k)!
k
the number of ways of choosing k positions from a total
of N positions (N choose k).
=
12
So:
P (Y = k) =
N k
pk (1 − p)N −k ,
k = 0, . . . , N.
Any random variable which has a probability mass function like this is said to be a binomially distributed with
parameters N and p.
Think about N independent trials, each is either:
· a success (with probability p), or
· a failure (with probability 1 − p).
Let Z count the number of successes. Then Z has a
binomial distribution with parameters N and p.
Z ∼ Bin(N, p).
13
Think about our N independent trials again, each with
success probability p.
Let X be the trial on which we have our first success.
pX (k) = P (k − 1 failures, then a success)
= (1 − p)k−1p,
for k = 1, 2, . . ..
In this case, we say that X has a geometric distribution
with parameter p.
14
Yet another important probability distribution is the
uniform distribution.
Suppose we have N possible outcomes of our random
variable Z, each of which are equally likely. Then Z has
a uniform distribution, with
1
pZ (i) = ,
N
for any i ∈ S.
For example, if we choose either a, c, g or t ‘uniformly
at random’, each has probability 0.25 of being chosen.
15
Of course, there are infinitely many different probability
distributions, but some occur time and time again in
applications.
As well as the binomial, uniform and geometric, these
include:
normal
Poisson
exponential
negative binomial
chi-square
Student’s t
beta
gamma
16
Events
Remember that S is our sample space. This is the set
of possible outcomes of our ‘experiment’.
An event A is something that either will or will not
occur in our ‘experiment’, so that A ⊆ S.
Examples: Roll a die once. Then S = {1, 2, 3, 4, 5, 6}.
The event A1 that ‘the number we roll is odd’ can be
written A1 = {1, 3, 5}.
The event A2 that the number we roll is at least three
can be written A2 = {3, 4, 5, 6}.
17
Suppose A, A1, A2 are events (subsets of our sample
space). We can use set operations to construct new
events.
• Ac : ‘A does not occur’ (complement)
Ac = {j ∈ S : j 6∈ A}.
• A1 ∪ A2 : ‘either A1 or A2 occurs’ (union)
A1 ∪ A2 = {j ∈ S : j ∈ A1 or j ∈ A2}.
• A1 ∩ A2 : ‘both A1 and A2 occur’ (intersection)
A1 ∩ A2 = {j ∈ S : j ∈ A1 and j ∈ A2}.
18
Computing probabilities of events
Remember that 0 ≤ P (A) ≤ 1.
P (S) = 1,
P (Ac) = 1 − P (A),
P (A1 ∪ A2) = P (A1) + P (A2) − P (A1 ∩ A2).
Events A1 and A2 are called mutually exclusive if they
cannot happen together (the intersection A1 ∩ A2 is the
empty set). In this case
P (A1 ∪ A2) = P (A1) + P (A2).
19
Conditional probability
Suppose we roll a fair die once. The probability of
getting an odd number (either 1, 3 or 5) is 1
2.
But, suppose we are told that the number we rolled
was (strictly) bigger than 3. How does this affect the
probability?
We know the number rolled was either 4, 5 or 6. Only
one of these is odd, so given our knowledge that we
rolled a number bigger than 3, the probability we got
an odd number is only 1
3.
1
P (odd number | number bigger than 3) = .
3
This is a conditional probability.
20
More generally, let A1 and A2 be two events, with
P (A2) > 0.
The conditional probability P (A1|A2) is defined to be
the probability that event A1 occurs, given that event
A2 occurs.
It can be calculated using the formula
P (A1 ∩ A2)
P (A1|A2) =
.
P (A2)
21
Example
Consider our dice example again:
P (odd number | number bigger than 3)
P (odd number bigger than 3)
=
P (number bigger than 3)
=
P ({5})
P ({4, 5, 6})
1
= .
3
22
Independence
Two events A1 and A2 are said to be independent if
P (A1 ∩ A2) = P (A1)P (A2).
This is equivalent to
P (A1|A2) = P (A1), and P (A2|A1) = P (A2).
That is, knowing whether A2 happened tells us nothing
about whether A1 happened (and vice versa).
Think of two random variables being independent if
knowing the value of one tells us nothing about the
value of the other.
23
It is not always obvious when we have independence.
In most applications we have dependence:
independence is the exception, not the rule!
For example, two DNA sequences linked by evolution
are dependent.
24
Suppose we generate two random DNA sequence. Let
Y be the number of matches.
If the nucleotide in each position is independently chosen with probabilities pa, pc, pg and pt, we have already
seen that Y has a binomial distribution.
But, if there is dependence between the nucleotides
chosen, we no longer have a binomial distribution.
Why?
25
There could be lots of different types of dependence.
Consider an extreme example.
Choose the nucleotide in position 1 with probabilities
pa, pc, pg and pt. Then set the nucleotides in positions
2, . . . , N to be the same as that in position 1.
The only possible sequences are
aaa · · · aaa
ccc · · · ccc
ggg · · · ggg
ttt · · · ttt
So, the only posible number of matches are 0 and N .
Y cannot possibly have a binomial distribution.
26
Expectation and variance
With a random variable X (or a probability distribution)
we can associate some important quantities. These can
give us idea of how the random variable behaves (its
location and spread).
• The expected value µ = E[X].
• The variance σ 2 = Var(X).
• The standard deviation σ = SD(X) =
q
Var(X).
27
Expected value
Suppose we generate two DNA sequences of length
N = 1000, with nucleotides chosen independently and
uniformly. That is,
pa = pc = pg = pt = 0.25
Let Y be the number of matches. From before, we
know that Y ∼ Bin(1000, 0.25).
How many matches do we expect to see ‘on average’ ?
Intuitively, it should be about 1000 × 0.25 = 250.
In fact, if Y ∼ Bin(N, p), we can show that its expected
value is E[Y ] = N p. This agrees with our intuitive
answer.
28
For any discrete random variable X with state space S,
we can define its expected value µ = E[X] by
E[X] =
X
iP (X = i).
i∈S
Example: Roll a fair die and let X be the number we
see.
1
1
1
E[X] = 1 · + 2 · + · · · + 6 ·
= 3.5.
6
6
6
Note that the expected value is not necessarily one of
the possible values of X.
Example: If Y ∼ Bin(N, p) then
N X
N i
E[Y ] =
i
p (1 − p)N −i = N p.
i
i=0
29
One interpretation:
Repeat the experiment many times. Take independent
observations X1, . . . , Xn each with the same distribution
as X.
Take the mean of the results you obtain:
1
X1 + X2 + · · · + Xn .
n
As n gets large, this mean gets closer and closer to
E[X].
More precise statements can be made.
30
Properties of expectation
Expectation has a very nice linearity property:
Let X1, X2, . . . , Xn be random variables (dependent or
independent), and let c1, . . . , cn be real numbers. Then
E[c1X1 + · · · + cnXn] = c1E[X1] + · · · + cnE[Xn].
If X and Y are independent random variables then
E[XY ] = E[X] · E[Y ]
This last result is not true if X and Y are dependent.
31
Let X be a discrete random variable and let g be any
function.
We can define E[g(X)], the expected value of g(X):
E[g(X)] =
X
g(i)P (X = i).
i∈S
For example:
E[X 2] =
X
i2P (X = i).
i∈S
32
There is a similar formula for X a continuous random
variable with probability density function fX (x):
E[g(X)] =
Z
g(x)fX (x) dx.
Warning:
E[g(X)] 6= g(E[X]).
33
Suppose that X ∼ Bin(1000, 0.25). We know that
E[X] = 250. So, we expect observations around 250.
251 or 249 would not be surprising results, but what
about 240? 200? 170?
The expectation gives us a measure of location. It is
also useful to have a measure of spread.
How much variability do we expect in our observations
of X?
34
For a random variable X, we define the variance of X,
σ 2 = Var(X):
h
Var(X) = E (X − E[X])2
i
= E[X 2] − E[X]2.
The second form is usually easier for calculations.
The standard deviation of X, σ is defined by:
σ = SD(X) =
q
Var(X).
A high variance indicates a high deviation from the
mean.
35
The term
(X − E[X])2,
is the (squared) distance between X and its expected
value.
In some sense, σ is the ‘average deviation of X from its
mean’.
36
Properties of variance
For any random variable X and constants a and b:
Var(aX + b) = a2Var(X).
If X and X are independent random variables then
Var(X + Y ) = Var(X) + Var(Y ).
If X and Y are dependent random variables then
Var(X + Y ) = Var(X) + Var(Y ) + 2 · Cov(X, Y ).
37
Covariance
We define the covariance between random variables X
and Y :
Cov(X, Y ) = E [(X − E[X]) · (Y − E[Y ])]
= E[XY ] − E[X]E[Y ].
Covariance measures the linear dependence between X
and Y .
Properties:
• If X and Y are independent then Cov(X, Y ) = 0.
• Cov(X, X) = Var(X).
38