Download 4 Conditional Probability

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Birthday problem wikipedia , lookup

Inductive probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
4
Conditional Probability
If we have partial information, this effectively shrinks the available sample space and changes the probabilities. We begin with some
examples.
Lost keys Joe is 80% certain that his missing key is in one of the two pockets of his jacket; being 40% certain that it is in the left,
40% certain that it is in the right. He checks his left pocket and does not find the key. What is the probability that it is in the right pocket?
Solution: Let R and L be the events that the key is in the right, resp. left pocket, then P (R) = P (L) = 0.4. We can also introduce
the complement event N , when the key is not in either pockets, then P (N ) = 0.2 and the three events R, L and N form a complete
partition of the probability space ⌦. The event we are looking at is R, but the under the condition that we already know that we are on
Lc = ⌦ L, the complement of L. So we need to compute the ratio
P (R)
0.4
2
=
= .
P (Lc )
1 0.4
3
It is important to understand precisely what has changed that yielded a different probability. Although the common language hints
that the probability of R has changed, in fact P (R) remained 0.4. What really happened is that the additional information (”not in the
left pocket”) has changed the probability measure on ⌦. So it would be more correct to talk about two different probability measure on
the same ⌦. The original one (”a-priori”), denoted by P , assigns the following numbers to the three basic events:
P (R) = P (L) = 0.4,
P (N ) = 0.2.
The information that L does not hold, changes the probability measure on the whole space. We will denote this ”a-posteriori” measure by
Q, and we know that Q(L) = 0. What happens to the other probabilities? Now the whole probability is concentrated on the remaining
two events, R and N , and the assumption is that they remain proportionally distributed, the ratio Q(R) : Q(N ) should remain the same
as before, P (R) : P (N ), but now they have to add up to one. This results in
Q(R) =
P (R)
,
P (Lc )
Q(N ) =
P (N )
.
P (Lc )
The new probability measure Q is called the conditional probability conditioned on the event Lc and it is often denoted by
Q(A) = P (A|Lc ),
A 2 ⌦.
It is easy to check that Q satisfies the basic properties of the probability, most importantly Q(R) + Q(N ) + Q(L) = 1.
One may discuss whether the probability space has also changed with conditioning (i.e. the new space consists of only the events R
and N ) or one keeps the original space that also includes L, but now we assign zero probability to it Q(L) = 0. This is an irrelevant
academic question, both interpretations are correct. As long as the new probability measure Q assigns zero probability to L, it does not
matter any more in any probabilistic arguments.
The above example was somewhat special since the two events, L and R were disjoint. Let’s see an example where the event that we
are interested in and the event on which we condition are not disjoint.
Coin flips A coin is flipped twice, and assume that the four outcomes (head or tail for either flip) in the total sample space ⌦ =
{(h, h), (h, t), (t, h), (t, t)} are equally likely:
P {(h, h)} = P {(h, t)} = P {(t, h)} = P {(t, t)} =
1
.
4
What is the conditional probability that we have exactly one tail and one head, given that (a) the first flip is head; (b) at least one flip is
head?
Solution. In case (a), we have Qa {(t, h)} = Qa {(t, t)} = 0 and the total probability is equally shared on the remaining two
elementary events (since their proportion was equal under P as well), thus
Qa {(h, h)} = Qa {(h, t)} =
17
1
.
2
The probability in question is
Qa {(h, t), (t, h)} =
1
1
+0= .
2
2
In case (b), we have Qb {(t, t)} = 0, so the total probability is shared on three events, and again equally:
Qb {(h, h)} = Qb {(h, t)} = Qb {(t, h)} =
The probability in question is
Qa {(h, t), (t, h)} =
1
.
3
1
1
2
+ =
3
3
3
While the outcome of the first conditioning is natural (if the first flip is surely head, the second must be tail and not head to get what
we wanted, and this has exactly 1/2 probability), the second one might sound somewhat surprising. We might argue that if at least
one flip is head then the other one can be either head or tail, ”obviously with equal chance” and only the second case gives what we
wanted, so we should again get 1/2. The error we make in this argument is that if at least one flip is head, then among the three possible
elementary events (namely {h, h}, {t, h}, {h, t}) there are two that gives what we wanted against only one that does not. So they do
not have ”equal chance”. In other words the statement in quotation mark above was wrong. This is a very common source of error in
applied probability. The safest way is to go back to the elementary events (individual elements of ⌦) and use the principle that ratios of
probabilities of elementary events, that are compatible with the conditioning, remain unchanged.
Monty Hall show. The setting is a game show in which a prize is hidden behind one of three curtains. Call the curtains X, Y , and
Z. You can win the prize by guessing the right curtain.
S TEP 1. You choose a curtain.
This leaves two curtains you did not choose, and at least one of them does not hide the prize. Monty Hall opens this one curtain and this
way demonstrates there is no prize hidden there. Then he asks whether you would like to reconsider. Would you?
S TEP 2 A . You stick to your initial choice.
S TEP 2 B . You change to the other available curtain.
Perhaps surprisingly, Step 2 B is the better strategy. As shown in Figure 6, it doubles your chance to win the prize.
1
2A
1/3
X
1/3
1
2B
X
1/3
1
Y
1/3
1/3
Y
Z
1
Y
1/3
1
Z
X
Z
X
1
Z
X
Figure 6: Suppose the prize is behind curtain X. The chance of winning improves from
1
3
in 2 A to
2
3
in 2 B.
Formalization and Bayes’ Rule. We are given a sample space, ⌦, and consider two events, A, B ✓ ⌦. The conditional probability
of event A given event B is
P (A | B) =
P (A \ B)
,
P (B)
but we will often write P (AB) for P (A \ B). We illustrate this definition in Figure 7. If we know that the outcome of the experiment
is in B, the chance that it is also in A is the fraction of B occupied by A \ B. Notice that in general
P (A | B)
P (A)
6=
,
P (A0 | B)
P (A0 )
18
Ω
A
B
Figure 7: Assuming B, the probability of A is represented by the fraction of the shaded region, B, that is dark shaded, A \ B.
but equality holds if both A and A0 are included in B. In particular, this holds for any elementary events A, A0 within B.
Alternatively, we can write
P (AB) = P (A|B) · P (B),
P (ABC) = P (A|BC) · P (B|C) · P (C),
where in the second line B is conditioned on C, etc. Next, we rewrite the formula in a form that is often useful. Suppose ⌦ =
H1 t H2 t . . . t Hn is a partition of the sample space. Then
P (A) = P (AH1 ) + P (AH2 ) + . . . + P (AHn )
=
n
X
i=1
P (A|Hi ) · P (Hi ).
Now fix an integer k and note that the probability of Hk given A is P (Hk |A) = P (Hk A)/P (A). Of course, P (Hk A) = P (AHk ), so
we can write
P (A|Hk ) · P (Hk )
P (Hk |A) = Pn
,
i=1 P (A|Hi ) · P (Hi )
which is usually referred to as Bayes’ Rule. It is often used in applications, for example for selecting candidates from different interpretations of data. The following simple example shows a typical application.
Insurances and accidents. An insurance company classifies drivers into two groups: reckless and careful. Their long term statistics
shows that a the probability that a reckless driver has an accident within a year is 0.4, while for a careful driver this is only 0.2. They
also know that 30% of the population is reckless.
(i) What is the probability that Ann, a new policyholder, will have an accident within a year?
(ii) Suppose Ann has an accident in the first year. What is the probability that she is reckless?
Solution. Let A be the event that Ann has an accident in the first year and let H1 and H2 be the events that she is reckless, resp.
careful. Clearly H1 t H2 is a partition of the sample space. Using the formula above,
P (A) =P (A|H1 ) · P (H1 ) + P (A|H2 ) · P (H2 )
=(0.4)(0.3) + (0.2)(0.7) = 0.26
is the answer to part (i).
For part (ii) we compute
P (H1 |A) =
P (A|H1 ) · P (H1 )
(0.3)(0.4)
6
=
=
⇡ 46%
P (A)
0.26
13
This argument a basic deduction in statistics, called Bayesian inference. A-priori the insurance company had no information on Ann
at all, so when it tried to assess whether Ann is reckless or not, it could not do better than use the large population data, i.e. Ann, as
anybody else without any prehistory, had a 30% chance to be reckless. However, after having an accident, this information a-posteriori
19
changed the probability that Ann is reckless; in fact, it increased it from 30% to 46%. While it is common sense to understand that this
probability goes up due to the accident, it requires Bayesian argument to quantify it (and hence readjust the yearly premium!).
Bayesian arguments are often subject to criticism, since its applicability intertwines rigorous mathematical arguments with interpretations. For example, in the insurance problem above, we implicitly assumed that being reckless or careful is a permanent attribute of
Ann. In some applications this attribute is indeed objectively determined, but certainly not in the insurance example, as it excludes the
scenario that after her accident Ann may change her driving habits. This is especially problematic if the insurance companies uses its
calculation to penalize Ann for the future (as formally an increased premium does) and not to penalize for the past.
But Bayesian approach is not without criticism for penalties of past actions either. In principle, Bayesian inference can be used in
courts to accumulate evidence for or against a defendant, cumulatively evaluating various pieces of evidence to establish the verdict
”beyond a reasonable doubt”. However, Bayesian approach starts with the assumption that a priori everyone shares equal guilt and
evidences will be used to update this probability. This may lead to a completely wrong verdict if the set of evidences is wrongly chosen
(e.g. evidences that have no relevance to the crime). Moreover, the procedure assumes that a priori probabilities are known, which
depends on the incidence of the crime (and could be quite unreliable due to that a small data set, depending on the crime and other
circumstances). For more interested students, we recommend to read about the Lindley paradox.
Law of succession. Consider a scenario in which we have N + 1 urns with N balls each. For k from 0 to n, the k-th urn contains
k red balls and n
k blue balls.
S TEP 1. Select an urn randomly.
S TEP 2. Draw m balls with replacement from the selected urn.
Let H be the event that all m balls are red. Now we draw another ball from the same urn, and we ask what is the probability that the
(m + 1)-st ball is also red. To analyze this question, we write A for the event that all m + 1 balls are red, so we are interested in
P (A|H). By definition, we have P (A|H) = P (AH)/A(H) = P (A)/P (H). Assuming we select the k-th urn, the probability of
drawing m red balls in sequence is (k/n)m . Hence,
1m + 2 m + . . . + nm )
nm (n + 1)
Z n
1
⇡ m
xm dx
n (n + 1) 0
n
=
.
(n + 1)(m + 1)
P (H) =
Assuming n is very large, we feel justified to take the limit, which is limn!1 P (H) =
Hence,
lim P (A|H) =
n!1
1
.
m+1
Similarly, limn!1 P (A) =
1
.
m+2
m+1
.
m+2
This is called the Law of Succession by Laplace (1812), in particular in the case in which the n + 1 urns are replaced by one urn but we
do not know the fraction of red versus blue balls so we assume this fraction is randomly distributed as explained.
Independent events and trial processes. We say A and B are independent if knowing B does not change the probability of A,
that is,
P (A | B) = P (A).
Since P (A | B) =
P (A \ B)
P (B)
= P (A), we have
P (B) =
P (B \ A)
= P (B | A).
P (A)
We thus see that independence is symmetric. Combining the definition of conditional probability with the condition of independence,
we get a formula for the probability of two events occurring at the same time: if A and B are independent then
P (A \ B) = P (A) · P (B).
20
More generally, n events A1 , . . . , An are independent if for any subset Ai1 , Ai2 , . . . , Air it holds that
P (Ai1 Ai2 . . . Air ) = P (Ai1 )P (Ai2 ) · · · P (Air )
Note that it is not sufficient to assume pairwise independence, in fact it is an amusing homework to construct a probability space
with three events, A1 , A2 , A3 such that P (Ai Aj ) = P (Ai )P (Aj ) holds for any disjoint pairs {i, j} from the set {1, 2, 3} but
P (A1 A2 A3 ) 6= P (A1 )P (A2 )P (A3 ).
Independence is one of the most important concepts in probability; this is basically the concept that separates discrete probability
form just combinatorial counting (as we did up to know) and that separates continuous probability from analysis (measure theory). It
expresses the primary paradigm of probability (and statistics), that useful information can be gained by postulating that certain events
happen ”chaotically”, ”independently”, in contrast to Newtonian determinism. In many cases, the uncertainty that underlies probability
is actually an asset in a world consisting of too many degrees of freedom to be suitable for deterministic predictions.
In many situations, a probabilistic experiment is repeated, possibly many times. We call this a trial process. It is independent if the
i-th trial is not influenced by the outcomes of the preceding i 1 trials:
P (Ai | A1 \ . . . \ Ai
1)
= P (Ai ),
for each i and for any possible outcomes A1 , . . . , Ai 1 of the previous trials. A particular example is the Bernoulli trial process in
which the probability of success is the same at each trial:
P (success) = p;
P (failure) = 1
p.
If we do a sequence of n trials, we may define X equal to the number of successes. Hence, ⌦ is the space of possible outcomes for a
sequence of n trials or, equivalently, the set of binary strings of length n. What is the probability of getting exactly k successes? The
probability of having a sequence of k successes followed by n k failures is pk (1 p)n k . Now we just have to multiply with the
number of binary sequences that contain k successes.
B INOMIAL P ROBABILITY L EMMA. The probability of having exactly k successes in a sequence of n trials is P (X = k) =
pk (1 p)n k .
n
k
As a sanity check, we make sure that the probabilities add up to one. Using the Binomial Theorem, we get
!
n
n
X
X
n k
P (X = k) =
p (1 p)n k ,
k
k=0
which is equal to (p + (1
probabilities.
k=0
p))n = 1. Because of this connection, the probabilities in the Bernoulli trial process are called the binomial
Medical test example. Trial processes that are not independent are generally more complicated and we need more elaborate tools
to compute the probabilities. A useful such tool is the tree diagram as shown in Figure 8. We explain this using a realistic medical test
0.001
0.999
0.99
y
0.00099
0.01
n
0.00001
0.02
y
0.01998
0.98
n
0.97902
D
D
Figure 8: Tree diagram showing the conditional probabilities in the medical test question.
problem. The outcome will show that probabilities can be counterintuitive, even in situations in which they are important. Consider a
21
medical test for a disease, D. The test mostly gives the right answer, but not always. Say its false-negative rate is 1% and its false-positive
rate is 2%:
P (y | D) = 0.99;
P (n | D) = 0.01;
P (y | ¬D) = 0.02;
P (n | ¬D) = 0.98.
Assume that the chance you have disease D is only one in a thousand, that is, P (D) = 0.001. Now you take the test and the outcome
is positive. What is the chance that you have the disease? In other words, what is P (D | y)? As illustrated in Figure 8,
P (D | y) =
P (D \ y)
0.00099
=
= 0.047 . . . .
P (y)
0.02097
This is clearly a case in which you want to get a second opinion before starting a treatment. Notice that the test is almost useless. The
main reason is that the probability of the disease is much smaller than the false-positive test. So most positive tests are just false alarms.
22