Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Towards conditional probability and Bayes’ theorem GOAL: our goal is to understand the basic mathematical bits in the standard Kolmogorov approach to conditional probability. Intuitively, we so far have the notion of unconditional probability – the probability of an outcome (elementary or molecular) full stop. But what of the probability of an outcome given another outcome? Example: the probability of rolling a 2 on one throw of a fair die is one thing (and something for which our ‘unconditional probability’ machinery can nicely model); the probability of rolling a 2 on one throw of a fair die given that you roll an even number is another thing – and something our ‘unconditional probability’ machinery doesn’t clearly model. Kolmogorov suggested that unconditional probability be seen as a derivative notion.1 Our aim is to go through the basic definitions up through conditional probability and one of the key results involving conditional probability, namely, Bayes’ theorem. 0.1 Conditional probability (via unconditional probability) In the account of (unconditional) probability that we have so far, we have the following well-defined probabilities (provided that P(B) > 0, which we will always assume in these contexts):2 P(A ∩ B) P(B) This ratio of unconditional probabilities is precisely what Kolmogorov advocates as the definition of conditional probability – in the above case, the probability of A conditional on (or ‘given’) B. Definition 0.1 (Conditional probability). The probability of outcome A conditional on B is defined thus: P(A ∩ B) P(A|B) =df P(B) The expression ‘P(A|B)’ may be read ‘the probability of A given B’. 0.2 Is conditional probability probability ? Recall that any function from σ fields into [0, 1] which meets the three Kolmogorov axioms is a probability function. The question at hand is whether conditional probability, as given in §0.1, satisfies our (Kolmogorov) definition of a probability function. The answer is: yes. 1 For more on some of the ideas and bits of the target histsory, see Antony Eagle’s Philosophy of probability: contemporary readings (Routledge), which was useful to us in framing these notes. 2 Recall from our discussion that there can be nonempty sets with probability 0, so requiring that P(B) > 0 is requiring more than just that it be nonempty. To refresh: you know that for any set S, ℘S is a σ field on S. Let S = {a, b}, and define a probability function P on ℘S as follows: P (∅) = P ({a}) = 0, and P ({b}) = P ({a, b}) = 1. P is indeed a probability function, as you can check. (You can think of this as the case where there are two possible worlds, a and b, and we are positive we are in b.) But {a} is nonempty and has probability 0. 1 (Pause: why does this question matter? Think about it.) To see this, we have to show that the given definition of conditional probability is defined over a σ field, and meets the three conditions of being a probability function – normality, non-negativity, and additivity. We take all of these things in turn; but we begin by ‘reframing’ how we will think of P(A|B) as a (unary) function from sets (elements of a σ field) into R. 0.2.1 Conditional probability functions as unary Our probability functions are unary functions from σ fields into the real numbers between 0 and 1 (inclusive). But P(A|B) looks like it takes more than one argument. Let us write ‘P (A|B)’ instead of ‘P(A|B)’ at the moment, just to avoid confusion. The issue is: if P (A|B), defined per Def 0.1, is to count as a probability function, it needs to be unary. How can it be? The answer is that we great the B part as a constant (or a ‘parameter’), and see A as the (single) variable of the function. In other words, for any set B such that P(B) > 0, we have one of our familiar (already-in-hand, unary) probability functions P(x, B), which takes a single set A from its domain (some σ field on which the function is defined) and delivers a unique real number between 0 and 1. In this way, we can think of P (A|B) as P(x, B), where P is a probability function such that P(B) > 0. Accordingly, we are thinking of not one but many unary functions P (A|B), one for each P(x, B) so understood. And what we need to show is that each of these meets the conditions of being a genuine probability function (as we are understanding them, per Kolmogorov’s axioms). 0.2.2 Domain and range? That the domain of each such P (A|B) is a σ field is clear, given that every probability function P is a function from a σ field into a subset of [0, 1]. All we’re doing is taking an already-defined probability function P, whose domain contains both A and B and, since the domain is a σ field, also contains A ∩ B. Hence, holding B fixed (and the value assigned to B, which we take to be non-nil), we are simply looking at what our given P does for each set A in the given σ field. This means that P , so understood, is a function (our original one!) from a σ field (our original one!) into a subset of [0, 1]. So far, so good. 0.2.3 Conditional probability function: normality? Our so-called conditional probability functions meet the first hurdle of having the right domains and ranges to be probability functions. But what of the Komogorov constraint of normality – namely, that P (S, B) = 1 for our overall background space S (on which the σ field is defined) and for each element B in the given σ field? This constraint is met. To see this, recall the definition of conditional probability, applying it to the case of P (S|B): P (S|B) = 2 P(S ∩ B) P(B) Recall also that, for any background space S and subset A ⊆ S of that space, we have that S ∩ A = A. Hence, we can substitute B for S ∩ B in the above equation, getting P (S|B) = P(B) P(B) And so P (S|B) = 1, since any nonzero real number – including P(B) – divided by itself is 1. 0.2.4 Conditional probability function: non-negativity? This one falls out of basic arithmetic: namely, that the division of two non-negative real numbers is itself non-negative. Since our probability functions P : ℘(S) −→ [0, 1] have only non-negative values as outputs, and our candidate conditional-probability functions P (x, B) are each defined as the division/ratio of the values of our background functions, our candidate conditional-probability functions therefore satisfy the non-negativity constraint. 0.2.5 Conditional probability function: additivity? The final constraint is additivity, where this is countable additivity. This is the axiom: ~ • Countable P∞if Ai is a countable sequence of pairwise disjoint elements of =, S∞Additivity: then P( i=0 Ai ) = i=0 P(Ai ). ~ i is a countable Towards seeing that this holds, first suppose (for conditional proof) that A sequence of pairwise disjoint elements of some σ field =. We need to show that, where we are treating B ∈ = as constant, our candidate probability function P (Ai |B) satisfies the given axiom, namely:3 ∞ ∞ [ X P ( Ai |B) = P (Ai |B) i=0 i=0 That this equation holds may be seen via the following steps. By Def 0.1 in §0.1, we have, for the probability function P in terms of which our candidate (conditional) probability function P is defined, S ∞ [ P( ∞ i=0 Ai ∩ B) P ( Ai |B) = P(B) i=0 But since P is a probability function (i.e., already obeys the Kolmogorov axioms), P obeys countable additivity, and hence: S P∞ P( ∞ P(Ai ∩ B) i=0 Ai ∩ B) = i=0 P(B) P(B) 3 We are using a different font for the element B ∈ = which we holding as constant for purposes of establishing that the conditional probability functions P (A|B) are in fact probability functions. The hope is that different font, here, will help the eyes and mind focus on the main nuts and bolts of what needs to be proved. 3 The right side is equivalent (via arithmetic) to the right side below: S ∞ X P( ∞ P(Ai ∩ B) i=0 Ai ∩ B) = P(B) P(B) i=0 But now we’re there! Taking a careful look at the right side of the immediately preceding equation – together with the definition of our candidate (conditional) probability functions – delivers this: S ∞ X P( ∞ i=0 Ai ∩ B) = P (Ai |B) P(B) i=0 And now we’re done, as the transitivity of identity (walk through the foregoing equations) delivers what we were after: ∞ ∞ [ X P ( Ai |B) = P (Ai |B) i=0 i=0 Hence, conditional probability functions, when we think of them with some fixed second argument, are probability functions.4 0.3 Definition of independence of two outcomes Two outcomes being independent with respect to a probability assignment is an important notion. Definition 0.2 (Independence of outcomes). Let P(A) > 0 and P(B) > 0. Then A and B are independent iff P(A|B) = P(A). Note that, by definition, A and B are independent iff P(A ∩ B) = P(A) P(B) But, from arithmetic (and the requirement on conditional probability P(A|B) that P(B) > 0), this equation holds iff P(A ∩ B) = P(A) × P(B). Hence, an alternative way of thinking about the independence of two outcomes A and B, relative to a probability assignment P, is that P(A ∩ B) = P(A) × P(B). 0.4 Some basic results involving conditional probability Here are a few useful results involving the notion of conditional probability. Fact 0.3. P(A ∩ B) = P(B) × P(A|B). (Proof: exercise.) Fact 0.4 (Symmetry of independence). If A is independent from B, then B is independent from A. (Proof: exercise.) 4 Note that this isn’t the case if we fix the first argument, and consider it as a function of its second argument! The probability of A conditional on B is a probability proper of A (as we have just shown), but not of B. 4 0.5 Another set-theoretic concept: partitions Intuitively, a partition of a set A is a way of ‘slicing up’ the non-empty subsets of A into pairwise exclusive (i.e., pairwise disjoint) and mutually exhaustive (wrt A) items – that is, taking elements of ℘(A) \ {∅} in such a way that the all of your selected elements are pairwise disjoint and their general union is identical to A itself. The official definition: Definition 0.5 (Partition). Let A be a set. Let C ⊆ ℘(A), such that, for some index set I, C = {Ci }i∈I .5 Then C is a partition of A iff 1. ∅ 6∈ C. 2. {Ci } is pairwise disjoint, that is, Cj ∩ Ck = ∅ unless j = k, for all j, k ∈ I. S 3. {Ci } = A. The idea is fairly straightforward: you’ve partitioned a set A just when you’ve got a bunch of non-empty sets Ci that ‘add up’ to A if you unionize all Ci , and no two Ci have anything in common (they’re pairwise disjoint). 0.6 Total probability: a key result involving conditional probability The probability of an outcome can be seen as a weighted average of conditional probabilities involving the given outcome. The key result is this: Theorem 0.6 (Total probability). Let S be a set, and = a σ-field on S. Let {Ci } be a partition of S such that every such outcome Ci gets a non-nil probability – that is, P(Ci ) > 0 for each Ci in the partition. Now, let A be any element of =. Then:6 X P(A) = (P(A|Ci ) × P(Ci )) i Proof. A key equation in the proof is the following lemma, which you will prove as an exercise: S Lemma 0.7. A = {A ∩ Ci }i∈I . Note that because {Ci } is a partition, A ∩ Cj and A ∩ Ck are disjoint for all j 6= k. But now we can invoke additivity via Lemma 0.7 to get that X P(A) = P(A ∩ Ci ) i∈I 5 We drop the explicit (subscripted) reference to I, leaving it implicit – like the already-implicit background function that indexes C. P 6 Note that the theorem’s equation (viz., P(A) = i (P(A|Ci ) × P(Ci ))) is equivalent to P(A) = X P(A ∩ Ci ) i by definition of P(A|B) together with arithmetic (viz., multiplying 5 m n by n equals m). which, by definition of P(A|Ci ), is equivalent to X P(A) = (P(A|Ci ) × P(Ci )) i Remark. How is this ‘total probability’ result related to Fact 0.3 in §0.4? (Exercise.) 0.7 Bayes’ theorem: a key result involving conditional probability An important result left out of §0.4 is something you may’ve heard of: namely, Bayes’ theorem. The proof largely falls out of definitions, but the result is an important one in applications of abstract probability theory (e.g., getting a sense of the probability of various incompatible hypotheses given some evidence) and in philosophical debates about probability.7 Theorem 0.8 (Bayes’ theorem). Let S be a set, and = a σ-field on S. Let {Hi } be a partition of S such that every such outcome Hi gets a non-nil probability – that is, P(Hi ) > 0 for each Hi in the partition. Now, let E be any element of =.8 Then for each j ∈ I, P(Hj |E) = P(E|Hj ) × P(Hj ) P(E|Hj ) × P(Hj ) =P P(E) i [P(E|Hi ) × P(Hi )] Proof Bayes’ theorem. Think of the equation in three terms ‘split’ by the two identity signs. The left identity follows from two steps: the definition of conditional probability P(Hj |E), which delivers P(E ∩ Hj ) P(Hj |E) = P(E) and, for the second step, Fact 0.3 from §0.4 tells us that P(Hj ∩ E) = P(E|Hj ) × P(E), and so we may substitute P(E|Hj ) × P(E) for P(Hj ∩ E) into the above equation to get P(Hj |E) = P(E|Hj ) × P(Hj ) P(E) To get the right-side identity (between middle and righthand terms) in the theorem, focus on the denominator P(E) of the middle term, and recall the total-probability theorem 0.6 from §0.6, which tells us that, since {Hi } is a partition, X P(E) = [P(E|Hi ) × P(Hi )] i 7 We recommend, again, Eagle’s Philosophy of Probability, which contains more elaboration of some of these ideas than we sketch here. 8 The motivation for the choice of ‘H’ and ‘E’ here: many applications of Bayes’ theorem involve the likelihood of a H ypothesis given E vidence. But note well: the theorem itself is not tied to any such interpretation; the variables involved are simply variables over sets (or ‘events’ in the background space and ‘outcomes’ in the σ-field). 6