Download Probability A statistical definition of probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Foundations of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
PROBABILITY AND LIKELIHOOD, A BRIEF INTRODUCTION IN SUPPORT OF A
(BIOL 3046)
COURSE ON MOLECULAR EVOLUTION
Probability
The subject of PROBABILITY is a branch of mathematics dedicated to building models to
describe conditions of uncertainty and providing tools to make decisions or draw
conclusions on the basis of such models.
In the broad sense, a PROBABILITY is a measure of the degree to which an occurrence
is certain [or uncertain].
A statistical definition of probability
People have thought about, and defined, probability in different ways.
important to note the consequences of the definition:
It is
1. All definitions agree on the algebraic and arithmetic procedures that must be
followed; hence, the definition does not influence the outcome.
2.
The definition has a fundamental impact on the meaning of the result!
We will consider the frequentist definition of probability, as it is the one that
currently is the most widely held. To do this we need to define two concepts: (i)
sample space, and (ii) relative frequency.
1. Sample space, S, is the collection [sometimes called universe] of all possible
outcomes. For a stochastic system, or an experiment, the sample space is a
set where each outcome comprises one element of the set.
2. Relative frequency is the proportion of the sample space on which an event E
occurs. In an experiment with 100 outcomes, and E occurs 81 times, the
relative frequency is 81/100 or 0.81.
The frequentist approach is based on the notion of statistical regularity; i.e., in the
long run, over replicates, the cumulative relative frequency of an event (E) stabilizes.
The best way to illustrate this is with an example experiment that we run many
times and measure the cumulative relative frequency (crf). The crf is simply the
relative frequency computed cumulatively over some number of replicates of
samples, each with a space S.
Let’s take a look at an example of statistical regularity.
Suppose we have a treatment for high blood pressure. The event, E, we are
interested in is successfully controlling the blood pressure. So, we want to be able to
make a prediction about the probability that a patient treated in the future will have
blood pressure under control, P(E). To estimate this probability we conduct an
experiment that is replicated over time in months. The data are presented in the
table below.
Month
1
2
3
4
5
6
7
8
9
10
11
12
Number of
subjects (S)
Number
Controlled (E)
Cumulative S
Cumulative E
crf
100
100
100
100
100
100
100
100
100
100
100
100
80
88
75
77
80
76
82
79
80
76
77
78
100
200
300
400
500
600
700
800
900
1000
1100
1200
80
168
243
320
400
476
558
637
717
793
970
948
0.800
0.840
0.810
0.800
0.800
0.793
0.797
0.796
0.797
0.793
0.791
0.790
[data for example is after McColl (1995)]
The crf values down the right most column fluctuate the most in the beginning, but
rapidly stabilize. Statistical regularity is the stabilization of the crf in the face of
individual fluctuations form month to month in the relative frequency of E.
Finally, we are in a position where we can obtain a definition of probability. Here
goes: In words, the probability of an event E, written as P(E), is the long run
(cumulative) relative frequency of E. More formally we define P(E) as follows:
P(E) = lim crf n (E )
n →∞
We can get an idea of this by using an example with “nearly infinite” replications.
1
0.9
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
0
0
2500
5000
7500
10000
Hypothetical plot of crf of an event
Probability models
For all probability models to give consistent results about the outcomes of future
events they need to obey four simple axioms (Kolmogorov 1933).
Probability axioms:
1. Probability scale= 1 to 0. Hence, 0 ≤ P(E) ≤ 1.
2. Probabilities are derived from a relative frequency of an event (E) in the
“space of all possible outcomes” (S), where P(S) = 1. Hence, if the probability
of an event (E) is P(E), then the probability that E does not occur is 1 – P(E).
3. When events E and F are disjoint, they cannot occur together. The
probability of disjoint events E or F = P(E or F) = P(E) + P(F).
4. Axiom 3 above deals with a finite sequence of events. Axiom 4 is an
extension of axiom 3 to an infinite sequence of events.
For the purpose of modelling in molecular evolution, we need to assume these
probability axioms and just one additional theorem, the multiplication theorem. I
will not provide a detailed explanation of this theorem. However, a consequence of
this theorem is what is sometime referred to as the “product rule” or “multiplication
rule”; see the box below for an explanation.
Product rule:
The product rule applies when two events E1 and E2 are independent. E1 and
E2 are independent if the occurrence or non-occurrence of E1 does not change
the probability of E2 [and vice versa]. [A further statistical definition requires the
use of the multiplication theorem]
It is important to note that a proof of statistical independence for a specific case
by using the multiplication theorem is rarely possible; hence, most models
incorporate independence as a model assumption.
Typically, probability refers to the occurrence of some future event:
When E1 and E2 occur together they are joint events. The joint probability of
• For example, the probability that a tossed [fair] coin will be heads is ½.
the independent events E1 and E2 = P(E1,E2) = P(E1) × P(E2). Hence the term
• What is the probability of getting 5H and 6T if the coin is fair
“product rule” or “multiplication principle”, or whatever you call it.
Conditional probability is very useful as it allows us to express a probability given
some further information; specifically, it is the probability of event E2 assuming that
event E1 has already occurred. We assume the E1 and E2 events are in a given
sample space, S, and P(E1) > 0. We write the conditional probability as P(E2|E1);
the vertical bar is read as “given”.
Let’s look at an example of a probability model. The familiar binomial distribution
provides the appropriate model for describing the probability of the outcomes of
flipping a coin. The binomial model is as follows:
⎛n⎞ k
n−k
P = ⎜⎜ ⎟⎟( p ) (1 − p )
⎝k ⎠
⎛n⎞
n!
⎜⎜ ⎟⎟ =
k
!
n
− k )!
(
k
⎝ ⎠
If we had a fair coin we could predict the probability of specific outcomes (e.g., 1
head & 1 tail in two tosses) by setting the p parameter equal to 0.5. Note that the
model does not require this. In the case of the coin toss, we are interested in a
conditional probability; i.e., what is the probability of obtaining, say, 5 heads given a
fair coin (p = 0.5) and 12 tosses, or P(k=5 | p=0.5, n=12).
Probability and likelihood are inverted
Probability refers to the occurrence of some future outcome.
• For example: “If I toss a fair coin 12 times, what is the probability that I
will obtain 5 heads and 7 tails?”
Likelihood refers to a past event with a known outcome.
• For example: “What is the probability that my coin is fair if I tossed it 12
times and observed 5 heads and 7 tails ”
Let’s continue to use the familiar coin tossing experiment to examine this inversion.
⎛n⎞
k
n−k
P = ⎜⎜ ⎟⎟(1 / 2 ) (1 / 2)
⎝k ⎠
⎛n⎞
n!
⎜⎜ ⎟⎟ =
k
!
n
( − k )!
⎝k ⎠
n is the number of flips
k is the number of successes
CASE 1: PROBABILITY.
The question is the same: “If I toss a fair coin 12 times, what is the probability that I
will obtain 5 heads and 7 tails?”
The answer comes directly from the above formula where n = 12, and k = 5. The
probability of such a future event is 0.193359.
From the probability perspective we can look at the distribution of all possible
outcomes
Probability of 5 heads & 7 tails = 0.1933
Our outcome of 5 heads & 7 tails
This is the distribution of mutually exclusive outcomes that comprise the set of all
possible outcomes under the model where p = 0.5. Remember probability axiom 2
where P(S) = 1; the probability of each outcome (i.e., 0 to 12 heads) sum to 1.
CASE 2: LIKELIHOOD.
The second question is: “What is the probability that my coin is fair if I tossed it 12
times and observed 5 heads and 7 tails?”
We have inverted the problem. In the previous case (1) we were interested in the
probability of a future outcome given that my coin is fair. In this case (2) we are
interested in the probability hat my coin is fair, given a particular outcome.
So, in the likelihood framework we have inverted the question such that the
hypothesis (H) is variable, and the outcome (let’s call it the data, D) is constant.
A problem: What we want to measure is P(H|D). The problem is that we can’t work
with the probability of a hypothesis, only the relative frequencies of outcomes. The
solution comes from the knowledge that there is a relationship between P(H|D) and
P(D|H):
The P(H|D) = αP(D|H)
Constant value of proportionality
The likelihood of the hypothesis given the data, L(H|D), is proportional to the
probability of the data given the hypothesis, P(D|H). As long as we stick to
comparing hypotheses on the same data and probability model, the constant remains
the same, and we can compare the likelihood scores. We cannot make comparisons
on different data using likelihoods.
Just remember: with likelihoods, the hypotheses are the variables!
Let’s use the binomial model to look at the application of probability as compared
with likelihood.
PROBABILITIES
Hypotheses
H1: p(H) = 1/4
H2: p(H) = 1/2
D1: 1H & 1T
0.375
0.5
Data
D2: 2H
0.0625
0.25
Following the probability axioms, and as we saw in the binomial distribution above,
given a singe hypothesis (i.e., H2: p(H) = 0.5), the different outcomes can be
summed. For example P(D1 or D2|H2) = P(D1|H2) + P(D2|H2), a well known
result; with all possible outcomes summing to 1. However, we cannot use the
addition axiom over different hypotheses H1 and H2; i.e., P(D1|H1 or D2|H2) ≠
P(D1|H1) + P(D2|H2).
LIKELIHOODS
Hypotheses
H1: p(H) = 1/4
H2: p(H) = 1/2
D1: 1H & 1T
α1 × 0.375
α1 × 0.5
Data
D2: 2H
α2 × 0.0625
α2 × 0.25
Under likelihood we can work with different hypotheses as long as we stick to the
same dataset. Take the likelihoods of H1 and H2 under D1. We can infer that the
H1 is ¾ less likely than H2. Note that when working with likelihoods, we compute
the probabilities, and we drop the constant for convenience. The likelihoods do not
sum to 1 because the probabilities terms are for the same outcome drawn from
different distributions [probabilities for the total set of outcomes S in same
distribution sum to 1].
An example of Likelihood in action
Let’s use likelihood to follow through on our question of the probability that the coin
is fair given 12 tosses with 5 heads and 7 tails. As always our tosses are
independent.
The L(p=0.5|12,5) = α × P(2,5|p=0.5)
[it’s easy to use the binomial formula to get the probability term]
L = α × 0.193
[we drop the constant for convenience]
L = 0.193
Perhaps there is an alternative hypothesis; i.e., where p ≠ 0.05, that has a higher
likelihood. To explore this possibility we take the binomial formula as our likelihood
function and evaluate the resulting likelihoods with respect to various values of p and
the given data. The results can be plotted as a curve; this curve is sometimes called
the likelihood surface. The curve for our data (12,5) is shown below.
Maximum Likelihood score = 0.228
0.25
0.2
0.15
0.1
0.05
0
0
0.2
0.4
0.6
0.8
1
ML estimate of p = 0.42
IMPORTANT NOTE: It looks like a distribution, but don’t be fooled, the area under the
curve does not sum to 1. The curve reflects the probabilities of different values of p
(a parameter of the model) under the same data, and these are not mutually
exclusive outcomes within a single set of all the possible outcomes.