Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Notes for COMP11120 (Probability part) Kees van Schaik [email protected] Alan Turing Building, office 2.142 2013-2014 Note: If you find any typos/errors/... I would be very grateful if you’d let me know at [email protected]. 1 Contents Contents 2 1 Building blocks of probability theory 1.1 Sample spaces and events . . . . . . . . . . 1.2 Basic event/set operations . . . . . . . . . . 1.3 Probability measures . . . . . . . . . . . . . 1.4 Some more rules for set/event manipulations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . and computing probabilities 5 . 5 . 8 . 10 . 16 2 Equally likely outcomes 21 2.1 Equally likely outcomes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21 2.2 A little bit of combinatorics . . . . . . . . . . . . . . . . . . . . . . . . . . 22 2.3 Examples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25 3 The 3.1 3.2 3.3 3.4 3.5 power of conditioning Conditional probabilities . . Independent events . . . . . Partitions . . . . . . . . . . The law of total probability Bayes’ Theorem . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4 Random variables 4.1 What are random variables? . . . . . . . . . . . . . . . . . 4.2 Computing probabilities with random variables . . . . . . 4.3 Constructing new random variables from existing ones . . 4.4 Probability mass functions (pmf’s) . . . . . . . . . . . . . 4.5 Cumulative distribution functions (cdf’s) . . . . . . . . . . 4.6 The mean (or: expected value) of a random variable . . . . 4.7 The variance and standard deviation of a random variable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31 31 34 36 39 43 . . . . . . . 47 47 48 54 55 57 60 65 5 Some well known distributions 69 5.1 What is a distribution? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.2 The Bernoulli distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 5.3 The Binomial distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 70 2 CONTENTS 5.4 5.5 3 The Poisson distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73 The geometric distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 74 3 Chapter 1 Introduction: building blocks of probability theory Many or even all of you will already have some understanding of what probability theory is about. In fact it’s likely that quite a few of you have already seen some probability theory during your previous education. But otherwise you will all have for one reason or the other thought about basic probabilistic questions. The probably most boring example is the tossing of coin, which yields either heads or tails. What makes the experiment of tossing a coin probabilistic is that you don’t know whether you will get heads or tails: the outcome of the experiment is ‘random’. Probability theory concerns the study of such experiments, and tries to provide with tools that allow to analyse such ‘random’ experiments. In the sequel, the word ‘experiment’ is to be taken very broadly, we essentially use it for any situation in which a ‘random’ outcome will occur, be it a game, a lottery, the number of car accidents in Manchester on any given day, the decay of radioactive material etc. etc. 1.1 Sample spaces and events We begin by collecting all possible outcomes of the experiment in a set called the sample space: Definition 1.1.1 (Sample space). All possible outcomes of an experiment are collected in a set called the sample space and is usually denoted by the letter Ω. In general we will label the possible outcomes of the experiment as ω1 , ω2 , ... so that Ω = {ω1 , ω2 , . . . , ωn } if the experiment has n possible outcomes and Ω = {ω1 , ω2 , . . .} if the experiment has infinitely many possible outcomes∗ . (In particular examples we may deviate from this ∗ Note that there is an underlying assumption here, namely that we can actually put all the outcomes in a list i.e. we can order them. This is only possible when the set in question is countable. Examples of countable sets are the natural numbers N, the integers Z and (even) the rational numbers Q, or any 5 6 Notes for COMP11120 (Probability part) general notation though and will use something more specific for the example in question). Example 1.1.2. Consider the experiment of throwing a coin. Then there are two possible outcomes, namely heads (H) and tails (T). Therefore the sample space is the set Ω = {H, T}. (Or, in general notation, Ω = {ω1 , ω2 } where ω1 : coin shows heads and ω2 : coin shows tails). Next consider rolling a dice and observing the number of dots. The resulting sample space is Ω = {1, 2, 3, 4, 5, 6}. (Or, in general notation, Ω = {ω1 , ω2 , . . . , ω6 } where ωi : dice shows i dots for i = 1, . . . , 6). Finally, suppose you throw two coins. Each coin can show either heads (H) or tails (T), and the outcome of the experiment consists of a combination of two of those. Hence Ω = {HH, HT, TH, TT}. (And the obvious analogue in general notation). In the sequel we are not only interested in the outcome of the experiment but, more broadly, in certain collections of outcomes. For instance, if we throw a dice we might be very interested in whether or not the number of dots is less than 3. This is true if the number of dots is either 1 or 2, and false otherwise. To answer this question, rather than in the outcome itself we are hence more interested in whether the outcome is in the set {1, 2} (in which case the aswer to the question is ‘yes’) or not (in which case the answer to the question is ‘no’). In order to facilitate such questions we give a name to such collections of outcomes: Definition 1.1.3 ((Occuring) events). An event is a set containing one or more outcomes from the sample space Ω (or, none in case it is the empty set, see below). Hence mathematically speaking an event is nothing but a subset of the sample space: A ⊆ Ω. In fact, any subset of Ω is (by definition) an event. This includes the empty set† , denoted ∅. Furthermore, we say that an event occurs (or happens) if the actual outcome of the experiment is an element of the event. (Hence we only know for sure whether an event occurs or not when we have done the experiment and observed the outcome!). Remark 1.1.4 (‘Extreme’ events). There are two ‘extreme’ events in the sense that they are the smallest and largest possible event (in terms of the number of outcomes they contain). The smallest is ∅ (the empty set). This set does not contain any outcomes at all and hence this is an event that never occurs. The largest is Ω itself. This set clearly contains all possible outcomes and as such it occurs always. All other events contain at least some, but not all elements of Ω. Hence these events may or may not occur. subsets of those sets. An example of a set that is uncountable, i.e. not countable, are the real numbers R or any (non-empty) interval [a, b] ⊆ R: no matter how hard you try, you can’t list all elements in R. Mathematically speaking, a set S is countable exactly if there exist an injective (one-to-one) function f : S → N. In this course we will only focus on the situation that the sample space Ω is countable. If Ω were uncountable some results we discuss in these notes seriously break down! A first proof that R is uncountable was given the famous mathematician Georg Cantor around the 1890’s † Recall that the ‘empty set’ is the set that contains no elements at all. By convention the empty set is a subset of any set 6 CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY 7 Remark 1.1.5. If Ω has n elements then there are 2n events (or, equivalently, Ω has 2n subsets). A way to see that is as follows. Write Ω = {ω1 , ω2 , . . . , ωn }. You can associate each subset of Ω with a sequence of zeros and ones, of length n. Namely, the i-th entry of the sequence is a 1 if ωi is an element of the subset and otherwise a 0, for i = 1, . . . , n. In this way, the sequence with only 0’s corresponds to the subset which contains no elements at all, i.e. the empty set; while the sequence with only 1’s corresponds to the subset which contains all elements, i.e. Ω itself. All other sequences have at least a 0 and a 1 in and represent subsets somewhere between these two extremes (see also the remark above). Hence there is a one-to-one correspondence (i.e. a bijection) between all subsets of Ω and all these sequences. So rathare than counting subsets we may as well count sequences, and as there are n entries each to be filled one of two choices, namely a 0 or a 1, we see that there are indeed 2n such sequences. Example 1.1.6 (Example 1.1.2 cont’d). In the experiment of tossing a coin we have Ω = {H, T}. There are 22 = 4 possible subsets of Ω, namely ∅, {H}, {T} and Ω itself. These are hence also the four possible events. We can express these events in words as follows: ∅ : ”nothing happens” (this event never occurs, see also Remark 1.1.4) {H} : ”coin shows heads” (this event occurs exactly if the outcome is heads) {T} : ”coin shows tails” (this event occurs exactly if the outcome is tails) Ω : ”coin shows either heads or tails” (this event occurs always, see also Remark 1.1.4). Example 1.1.7 (Example 1.1.2 cont’d). When we throw a dice we have Ω = {1, 2, 3, 4, 5, 6}. There are hence 26 = 64 possible events. As always this includes ∅ and Ω. Three more examples of events, denoted A, B and C: A: ”No. of dots is less than 3”, hence A = {1, 2}; B: ”No. of dots is even”, hence B = {2, 4, 6}; C: ”No. of dots is 1”, hence C = {1}. Recall that an event occurs when the actual outcome of the experiment is a member of the event in question. Hence we can compile the following table: Observed outcome of the experiment 1 2 3 4 5 6 Event occurs? A B C Y N Y Y Y N N N N N Y N N N N N Y N Remark 1.1.8. Shifting out focus to events rather than (only) outcomes might make you wonder how we can catch a particular outcome in terms of events. In fact, the event C in the above Example 1.1.7 shows how to do that. With the general notation Ω = {ω1 , ω2 , . . . , ωn } we have that 7 8 Notes for COMP11120 (Probability part) the outcome of the expriment is ωi exactly if the event {ωi } occurs. 1.2 Basic event/set operations As events are just sets, we can use the usual set operations to create new sets/events from existing ones. For the sake of completeness we collect the usual set operations in the following definition, and we give an interpretation of what the set operations mean when translated to the world of outcomes and events. Definition 1.2.1 (Basic set operations). If A, B ⊆ Ω are events then we have the following. • We use the symbol #A to denote the number of outcomes in A, i.e. #A ∈ {0, 1, 2, . . . , }, we can also have #A = ∞. • The complement of A, denoted Ac , consists of all outcomes that are not in A, i.e. Ac = {ω ∈ Ω | ω 6∈ A}. Note that the event Ac occurs ⇐⇒ the event A does not occur. • The union A ∪ B consists of all outcomes that are either in A or in B (or in both): A ∪ B = {ω ∈ Ω | ω ∈ A or ω ∈ B}. Note that the event A ∪ B occurs ⇐⇒ the event A, or the event B occurs, or both. • The intersection A ∩ B consists of all outcomes that are in both A and B: A ∩ B = {ω ∈ Ω | ω ∈ A and ω ∈ B}. Note that the event A ∩ B occurs ⇐⇒ both events A and B occur. • If B ⊆ A we can define the (set) difference of A and B (‘A minus B’): A \ B = {ω ∈ Ω | ω ∈ A and ω 6∈ B}. Note that the event A \ B occurs ⇐⇒ the event A occurs, but the event B does not. 8 CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY 9 Ac A A Figure 1.1: A set A (left) and its complement Ac (right) B A A ∪B A ∩B Figure 1.2: Two sets A (left top) and B (right top) with their union (left bottom) and intersection (right bottom) 9 10 Notes for COMP11120 (Probability part) A\B B Figure 1.3: With the set A as in the previous two graphs and B as on the left, the set difference A \ B on the right See Figures 1.1–1.3 for a graphical representation of the above set operations. (These representations are called ‘Venn diagrams’ and can be very useful to gain some intuition how (complicated) set operations work out). Definition 1.2.2 (Disjoint events). Two events A and B are called disjoint when A∩B = ∅. That is, two events are disjoint exactly when they cannot occur at the same time. Example 1.2.3 (Example 1.1.7 cont’d). In the setting of throwing a dice with Ω = {1, 2, 3, 4, 5, 6} and the three events: A = {No. of dots is less than 3} = {1, 2}; B = {No. of dots is even} = {2, 4, 6}; C = {No. of dots is 1} = {1} we have for instance Ac = {3, 4, 5, 6} = {No. of dots is at least 3}, A ∪ C = {1, 2} = A, A ∩ C = {1} = C, A ∩ B = {2}, B ∩ C = ∅, A \ C = {2}. 1.3 Probability measures We have so far discussed two of the main ingredients, namely the sample space and events. However we would like to talk about the likelihhod of outcomes of the experiment and in particular the likelihood of events occuring. That is to say, we want to have a function that assigns to every possible event, the likelihood that this event occurs. 10 CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY 11 Such a function is called a probability measure. For its definition we need to know what disjoint events are: Definition 1.3.1 (Disjoint events). Two events A and B are called disjoint when A∩B = ∅. That is, two events are disjoint exactly when they cannot occur both. Definition 1.3.2 (Probability measure). A probability measure, denoted P, is a mapping P : {Collection of all events} → [0, 1] i.e. P assigns to any event A a number P(A) ∈ [0, 1]. This number we understand the likelihood that the event A occurs, i.e. the likelihood that the actual outcome of the experiment is an element of A. Furthermore a probability measure should satisfy the following properties: (i) P(Ω) = 1, (ii) if A and B are disjoint events then P(A ∪ B) = P(A) + P(B). Remark 1.3.3. In the above Definition 1.3.2, there are two ‘extreme’ cases: P(A) = 0 ⇐⇒ A will never occur, P(A) = 1 ⇐⇒ A will always occur. We will see in a bit that P(A) = 0 holds at least when A = ∅ while P(A) = 1 is only true when A = Ω. For all other events A we have P(A) ∈ (0, 1) – i.e. A may or may not occur. (Compare also with Remark 1.1.4). Example 1.3.4 (Example 1.1.2 cont’d). Consider again the experiment of tossing a coin, with Ω = {H, T} and the four possible events ∅, {H}, {T} and Ω. We can define a (candidate) probability measure P by setting 1 1 P({H}) := , P({T}) := and P(Ω) := 1. 2 2 We need to verify this mapping P is indeed a probability measure, i.e. that all conditions in Definition 1.3.2 are satisfied. Point (i) is just by definition. For point (ii), we need to verify that for all possible combinations of disjoint events A and B we indeed have P(∅) := 0, P(A ∪ B) = P(A) + P(B). (1.1) You can verify all combinations yourself, two particular examples: if we take A = ∅ and B = {H} then P(A ∪ B) = P(∅ ∪ {H}) = P({H}) = while 11 1 2 12 Notes for COMP11120 (Probability part) 1 1 = 2 2 and hence we have verified that (1.1) indeed holds. If we take A = {H} and B = {T } then P(A) + P(B) = P(∅) + P({H}) = 0 + P(A ∪ B) = P({H} ∪ {T }) = P({H, T }) = P(Ω) = 1 while P(A) + P(B) = P({H}) + P({T }) = 1 1 + =1 2 2 so both are indeed equal. Note that P is the choice that corresponds to a ‘fair’ coin, i.e. when heads and tails both appear with probability 1/2. However P is certainly not the only possible probability measure. Indeed, if we define a mapping P as 1 3 P({H}) := , P({T}) := and P(Ω) := 1 4 4 you can verify this is indeed also a probability measure. This one corresponds to the situation the coin isn’t fair and heads only appears with probability 1/4. In fact, for any α ∈ [0, 1] define the mapping Pα by setting P(∅) := 0, Pα (∅) := 0, Pα ({H}) := α, Pα ({T}) := 1 − α and Pα (Ω) := 1, then it can be verified (exercise) these are indeed all probability measures. Note also that for α = 1/2 we find the P from above back, while for α = 1/4 we end up with the P from above. We see that even in this very basic example there are infinitely many probability measures! Finally then, it’s not very difficult to come up with an example of a mapping that is not a valid probability measure (exercise). Take for instance the mapping P1 defined as P1 (∅) := 0, 1 P1 ({H}) := , 3 P1 ({T}) := 1 3 and P1 (Ω) := 1. One disadvantage of Definition 1.3.2 is that it requires us to check that point (ii) of the definition holds for any disjoint events A and B – and as there may be many combinations of disjoint events this may be quite a task. Luckily there is a shortcut. For this, let us use the general notation Ω = {ω1 , ω2 , . . . , ωn } (we take for simplicity a finite sample space, but it also works for infinite ones. However, the argument we’re going to use below does not work when the sample space is not countable. Not relevant for us now, but good to know in case you’d end up doing more probability later on – cf. the footnote at the beginning of this chapter). 12 CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY 13 For any i = 1, . . . , n define pi := P({ωi }) (i.e. pi is the probability that the experiment has ωi as outcome). The important thing to realise is now that in fact we only need to know these pi ’s to compute the probability of any event. Indeed, take for instance the event A = {ω1 , ω2 }, then (∗) P(A) = P({ω1 } ∪ {ω2 }) = P({ω1 }) + P({ω2 }) = p1 + p2 . Note that (*) uses that {ω1 } and {ω2 } are disjoint sets, i.e. {ω1 } ∩ {ω2 } = ∅, and hence we can use part (ii) of Definition 1.3.2. You can probably imagine this works in the same way for any event: its probability is computed by adding up the probabilities of each of the outcomes that are elements of the event. We can express this in a formula as follows. For any event A we have P(A) = n X 1A (ωi )pi . (1.2) i=1 The symbol 1 is an ‘indicator function’. In general, for any set S the function 1S (x) takes the value 1 if x ∈ S holds and it takes the value 0 otherwise. Hence the indicator function appearing in (1.2) behaves as follows: 1A (ωi ) = ( 1 if ωi ∈ A 0 if ωi 6∈ A. (1.3) The formula (1.2) therefore reads as follows: the probability of A is obtained by looping over all outcomes in the sample space Ω, and if the outcome is an element of A then add the corresponding pi to the number you had and otherwise add nothing. Note that if we apply the formula (1.2) to the event A = Ω then 1A (ωi ) = 1 for every i = 1, . . . , n and thus the formula reads P(Ω) = n X pi . i=1 Since we know from Definition 1.3.2 that P(Ω) = 1 this implies that n X pi = 1. i=1 So, if we put the above argument in a Result then we have derived the following: Result 1.3.5. Suppose Ω = {ω1 , ω2 , . . . , ωn }. Let P be a probability measure. Define pi := P({ωi }) for i = 1, . . . , n. Then we have (i) pi ≥ 0 for all i = 1, . . . , n and p1 + p2 + . . . + pn = 1, 13 14 Notes for COMP11120 (Probability part) (ii) for any event A ⊆ Ω we have P(A) = n X 1A (ωi )pi i=1 where the indicator function 1 is defined in (1.3). Also if Ω has infinitely many elements (i) and (ii) are still true, with the understanding that the n has to be replaced by ∞ i.e. (i’) pi ≥ 0 for all i = 1, 2, . . . and p1 + p2 + . . . = 1, (ii’) for any event A ⊆ Ω we have P(A) = ∞ X 1A (ωi )pi . i=1 In fact, if you think a bit about it you can also invert the above result. Still working with the sample space Ω = {ω1 , ω2 , . . . , ωn }, take any sequence of numbers pi for i = 1, . . . , n such that pi ≥ 0 and p1 + p2 + . . . + pn = 1. Now define the mapping P by setting for any event A P(A) := n X 1A (ωi )pi . (1.4) i=1 Note that this formula in particular yields P({ωi }) = pi for every i = 1, . . . , n. Now this is indeed a probability measure. To verify point (i) of Definition 1.3.2, indeed we have P(Ω) = n X 1Ω (ωi )pi = i=1 n X pi = 1. i=1 To verify point (ii), let A and B be disjoint events. First we make the following step: 1A (ωi ) + 1B (ωi ) = = ( 1 if ωi ∈ A ( 1 if ωi ∈ B + 0 if ωi 6∈ A 0 if ωi 6∈ B 2 if ωi ∈ A ∩ B (∗) 1 if ωi ∈ (A ∪ B) \ (A ∩ B) = 0 if ω 6∈ A ∪ B i ( 1 if ωi ∈ A ∪ B 0 if ωi 6∈ A ∪ B = 1A∪B (ωi ). Note that the second equality follows by considering the different values both indicator functions can take. The crucial step is (*), where we use that A and B are disjoint i.e. that A ∩ B = ∅ to simplify the expression we had before. This computation hence tells 14 CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY 15 us that on account of A ∩ B = ∅ we have 1A (ωi ) + 1B (ωi ) = 1A∪B (ωi ). Using this we see that (1.4) P(A ∪ B) = n X i=1 n X 1A∪B (ωi )pi = (1A (ωi ) + 1B (ωi ))pi i=1 = n X 1A (ωi )pi + i=1 n X (1.4) 1B (ωi )pi = P(A) + P(B) i=1 so that we have also verified point (ii) of Definition 1.3.2 and we can thus conclude that P as defined in (1.2) is indeed a probability measure. This all brings us to Result 1.3.6. Suppose Ω = {ω1 , ω2 , . . . , ωn }. Take any sequence of numbers pi for i = 1, . . . , n such that pi ≥ 0 and p1 + p2 + . . . + pn = 1. For any event A, define the mapping P as P(A) := n X 1A (ωi )pi . (1.5) i=1 (That is, P(A) is found by adding up all those pi ’s for which the corresponding outcome ωi is an element of A.) Then P is a probability measure, chosen in such a way that for all i = 1, . . . , n the experiment has outcome ωi with probability pi = P({ωi }). If Ω = {ω1 , ω2 , . . . , } the same result still holds true, with the obvious adjustments such as that we need an infinite sequence of non-negative pi ’s summing to 1 rather than a finite one and we need to replace the upper bound of the sum in (1.5) by ∞. What we have actually done in this bit is connecting two ways of thinking about probabilities. We pretty much all know from previous experiences (secondary school etc.) how to compute probabilities: we assign a probability to each of the possible outcomes of the experiment and then the probability of an event is computed by looking what outcomes are an element of the event and adding of the probabilities of these outcomes. This is the first approach (and an intuitively appealing one). The second approach is the more mathematical approach, where we prefer to work with simple and more abstract definitions. According to the mathematical approach, a probability measure is simply a mapping as defined in Defintion 1.3.2. The above two Results 1.3.5 & 1.3.6 connect these two approaches. It is telling us that the first more intuitive approach is in fact consistent with the mathematical definition of a probability measure, as you would hope, so that we can essentially mix the use of both approaches to our liking‡ . ‡ It is good to point out that the first approach above only works when the sample space is either finite or countable, the mathematical definition works for any possible sample space though — even though that situation has its own quirks such as the fact that one has to careful with how events are exactly defined. However this is of course all outside the scope of this course! 15 16 Notes for COMP11120 (Probability part) Example 1.3.7 (Example 1.2.3 cont’d). Consider again the example of throwing a dice, with Ω = {1, 2, 3, 4, 5, 6} and the events A, B and C we introduced before: A = {No. of dots is less than 3} = {1, 2}; B = {No. of dots is even} = {2, 4, 6}; C = {No. of dots is 1} = {1}. Let us now compute some probabilities, using Result 1.3.6. Assuming that the dice is fair it is natural to assume all six possible outcomes occur with probability 1/6. Hence we take p1 = p2 = . . . = p6 = 1/6. Then 1 1 1 P(A) = p1 + p2 = , P(B) = p2 + p4 + p6 = and P(C) = p1 = . 3 2 6 Alternatively, these probabilities could be computed using that by Definition 1.3.2 we have P(A ∪ B) = P(A) + P(B) for disjoint events A and B. The strategy is then to split the events A, B and C into disjoint events for which we know their probabilities: 1 P(A) = P({1} ∪ {2}) = P({1}) + P({2}) = p1 + p2 = , 3 P(B) = P({2} ∪ {4} ∪ {6}) = P({2}) + P({4}) + P({6}) = p2 + p4 + p6 = and 1.4 1 2 1 P(C) = P({1}) = . 6 Some more rules for set/event manipulations and computing probabilities We have now introduced three of the four core concepts, namely samples spaces, events and probability measures. The final one is random variables and we’ll get to those later. Let us conclude this chapter by stating some results that are helpful for manipulating events and computing probabilities. For events we have the following. These identities may come in very handy! Result 1.4.1. Consider a sample space Ω and let A, B and C be events. Then we have: (i) A ∩ Ac = ∅ and A ∪ Ac = Ω (ii) Distributive rules (note that these are exactly the same laws as in arithmetic, provided you replace ”∪” by ”+” and ”∩” by ”×”): • (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) • (A ∪ B) ∩ C = (A ∩ C) ∪ (B ∩ C) (iii) De Morgan’s laws: • (A ∪ B)c = Ac ∩ B c 16 CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY 17 • (A ∩ B)c = Ac ∪ B c Proof. Ad (i). This is clear from the definition Ac = {ω ∈ Ω | ω 6∈ A}. Ad (ii). For the first one, ω ∈ (A ∪ B) ∩ C ⇐⇒ ω ∈ A ∪ B and ω ∈ C ⇐⇒ (ω ∈ A and ω ∈ C) or (ω ∈ B and ω ∈ C) ⇐⇒ ω ∈ A ∩ C or ω ∈ B ∩ C ⇐⇒ ω ∈ (A ∪ C) ∪ (B ∩ C). The proof of the second one is similar. Ad (iii). For the first one, ω ∈ (A ∪ B)c ⇐⇒ ω 6∈ A ∪ B ⇐⇒ ω 6∈ A and ω 6∈ B ⇐⇒ ω ∈ Ac and ω ∈ B c ⇐⇒ ω ∈ Ac ∩ B c . The proof of the second one is an exercise. For computing probabilities we have the following — again very useful! For the proofs, note that all we know (at this point) about a probability measure is that the probability of the whole sample space is 1 and that the probability of the union of two disjoint events is the sum of their probabilities. Cf. Definition 1.3.2. Therefore we will be looking for ways to make use of disjoint events. Result 1.4.2. Consider a sample space Ω, let A and B be events and let P be a probability measure. Then we have: (i) P(Ac ) = 1 − P(A) (‘complementary law’) (ii) if B ⊆ A then P(B) ≤ P(A) and P(A \ B) = P(A) − P(B) (iii) P(A ∪ B) = P(A) + P(B) − P(A ∩ B) (‘additive law’) (iv) P(Ac ∩ B) = P(B) − P(A ∩ B) (‘2nd complementary law’) (v) P(A ∩ B c ) = P(A) − P(A ∩ B) (‘3rd complementary law’) Proof. Ad (i). From Result 1.4.1 we know that A and Ac are disjoint events (*) while we also have A ∪ Ac = Ω (**). Recall from Definition 1.3.2 (i) that P(Ω) = 1. Using these facts together with Definition 1.3.2 (ii) (*) we get (∗∗) (∗) 1 = P(Ω) = P(A ∪ Ac ) = P(A) + P(Ac ) and the result follows. Ad (ii). If B ⊆ A then we may cut A up in two disjoint parts: B and A \ B, i.e. A = B ∪ (A \ B) (make a Venn diagram!). Hence, again making use of Definition 1.3.2 (ii) we find 17 18 Notes for COMP11120 (Probability part) P(A) = P(B ∪ (A \ B)) = P(B) + P(A \ B) and rearranging a bit we indeed get P(A \ B) = P(A) − P(B). Furthermore, we may also write this as P(B) = P(A) − P(A \ B). Since the probability of any set is bigger than or equal to 0, also P(A \ B) ≥ 0 and it hence indeed follows that P(B) ≤ P(A). Ad (iii). We may write the event A ∪ B for instance as the union of the two disjoint events A and B \ (A ∩ B) (make a Venn diagram!). This yields: P(A ∪ B) = P(A ∩ (B \ (A ∩ B))) = P(A) + P(B \ (A ∩ B)) = P(A) + P(B) − P(A ∩ B), where we have again used Definition 1.3.2 (ii) and the last step makes use of part (ii) above. Ad (iv). Draw again a Venn diagram to convince yourself that we may write Ac ∩ B = B \ (A ∩ B), so that P(Ac ∩ B) = P(B \ (A ∩ B)) = P(B) − P(A ∩ B), where the last step uses part (ii) above. Ad (v). This is exactly the same as part (iv), only with the roles of the events A and B reversed. Remark 1.4.3. If we plug A = Ω into part (i) above it reads as P(∅) = 1 − P(Ω). As P(Ω) = 1 (cf. Definition 1.3.2 (i)) it follows that P(∅) = 0 (as we promised to show in Remark 1.3.3). Also, this means that if we plug two disjoint events A and B into the ‘additive law’ above then that one reads P(A ∪ B) = P(A) + P(B) − P(A ∩ B) = P(A) + P(B) − P(∅) = P(A) + P(B), which is exactly Definition 1.3.2 (ii). So the difference between the two is that the ‘additive law’ holds also when the intersection A∩B is not empty, in which case P(A)+P(B) would be too large as it counts the elements in A ∩ B twice (they are both in A and in B), and hence the correction term P(A ∩ B) is subtracted. We conclude with two examples. Example 1.4.4. Suppose that we have events A and B of which we know that P(A) = 0.4, P(B) = 0.3 and P(A∩B) = 0.2. We would like to compute P(A∪B), P(Ac ) and P(Ac ∩B c ). For the first, applying the ‘additive law’ from Result 1.4.2 we find P(A ∪ B) = P(A) + P(B) − P(A ∩ B) = 0.4 + 0.3 − 0.2 = 0.5. 18 CHAPTER 1. BUILDING BLOCKS OF PROBABILITY THEORY 19 For the second, the ‘complementary law’ from Result 1.4.2 yields P(Ac ) = 1 − P(A) = 1 − 0.4 = 0.6. Finally, the last one is more challenging. We (obviously) need to try to rewrite the event Ac ∩ B c into one or more of the events for which we are given the probabilities. One of the De Morgan’s laws from Result 1.4.1 gives us what we need, namely Ac ∩ B c = (A ∪ B)c : P(Ac ∩ B c ) = P((A ∪ B)c ) = 1 − P(A ∪ B) = 1 − 0.5 = 0.5, where we used the ‘complementary law’ from Result 1.4.2 and the fact that we computed P(A ∪ B) above. Example 1.4.5. Consider a loaded dice which has the property that 1 and 3 dots show up with probability 1/3 while 2, 4, 5 and 6 dots show up with probability 1/12. Compute the probability that the outcome is even, and the probability that at least 2 dots show up. First note that we again have the sample space Ω = {1, 2, 3, 4, 5, 6}. Following the approach in Result 1.3.6 and Example 1.3.7 we define for i = 1, . . . , 6 the probabilities for each of the outcomes as pi = P({ωi }). We are given that p1 = p3 = 1/3 and p2 = p4 = p5 = p6 = 1/12. Then P({No. of dots is even}) = P({2, 4, 6}) = p2 + p4 + p6 = 1 4 and 2 P({No. of dots is at least 2}) = 1 − P({No. of dots is 1}) = 1 − p1 = , 3 where we used the ‘complementary law’ from Result 1.4.2 again. (In terms of sets, we have {No. of dots is at least 2} = {2, 3, 4, 5, 6} so that its complement is {1} = {No. of dots is 1}.) 19 Chapter 2 Equally likely outcomes In the previous chapter we have developed some general theory for computing probabilities. Now, very often it is the case that we are dealing with an experiment that has the special property that every outcome is equally likely to occur. Think for instance about throwing a fair dice, a fair coin, randomly selecting a prize winner in a lottery etc. etc. Since this is such a prominent particular case of experiments we dedicate this chapter to discussing such experiments in more detail. We will see that computing probabilities becomes a matter of counting. 2.1 Equally likely outcomes Consider an experiment with sample space Ω = {ω1 , ω2 , . . . , ωn } and suppose that we assume that each outcome is equally likely. That is to say, if we set pi := P({ωi }), the probability that the experiment has outcome ωi for i = 1, . . . , n, then we have p1 = p2 = . . . = pn . Since we know that p1 + p2 + . . . + pn = 1 (cf. Result 1.3.5 in Ch. 1) it must be the case that pi = 1/n for all i = 1, . . . , n. Furthermore, recall from Result 1.3.5 in Ch. 1 that we can compute the probability of any event A by the formula P(A) = n X 1A (ωi )pi . i=1 If we now plug in that pi = 1/n for all i = 1, . . . , n we get P(A) = n X i=1 n 1A (ωi ) 1X 1 = 1A (ωi ). n n i=1 Pn (2.1) The expression i=1 1A (ωi ) is a sum that loops over all elements ωi in the sample space, and adds 1 to the total if ωi ∈ A and 0 otherwise. Hence the number we end up with is exactly equal to the number of elements in A, denoted #A. Therefore we can further simplify the above (2.1) to 21 22 Notes for COMP11120 (Probability part) 1 #A #A = n #Ω where in the last step we used that since Ω = {ω1 , ω2 , . . . , ωn } we have #Ω = n. So we have shown the following: P(A) = Result 2.1.1. Suppose that an experiment has equally likely outcomes. Then we may compute the probability of any event A using the formula #A . #Ω (Recall that for any set S, #S denotes the number of elements in S). P(A) = (2.2) The above result essentially tells us that in the case of equally likely outcomes, computing probabilities boils down to (simply, or sometimes not so simply) counting how many outcomes there are in your event A and in the whole sample space Ω. 2.2 A little bit of combinatorics Since computing probabilities boils down to counting outcomes in this case, let us very briefly discuss some (probably quite familiar) results on how to count in a ‘clever’ way. Within mathematics, such questions are studied in the field known as combinatorics. The first result is probably very familiar to you and is the basis of all counting we’ll do in this chapter. Result 2.2.1 (Multiplication rule). Suppose we need to select a total of n items. If there are k1 possible choices for the first item, k2 possible choices for the second item, ..., kn possible choices for the n-th item then there are in total k1 · k2 · . . . · kn ways to select the n items. Next let us recall two basic quantities that we’ll use throughout: Definition 2.2.2 (Factorial). For any n ∈ N we define n! = n · (n − 1) · (n − 2) · . . . · 2 · 1 and 0! = 1. Or, in recursive form: n! = n · (n − 1)! with intitial value 0! = 1. Definition 2.2.3 (Binomial coefficient). For any n ∈ N and k = 0, 1, . . . , n we define the binomial coeffcient nk as n n! = k k!(n − k)! 22 CHAPTER 2. EQUALLY LIKELY OUTCOMES 23 (also pronounced as ‘n choose k’). The question we’ll be asking ourselves next is as follows: consider a set of n distinct elements, how many different selections of k ≤ n items out of this set are there? The asnwer also depends on the type of selecting you are doing. The selection could be with and without replacement. ‘Without replacement’ means that a selected item is put aside and can not be selected again, while (hence) ‘with replacement’ means that a selected item remains available in the set to be (potentially) selected again. Furthermore it is important whether the order is important. If the order is important then you count two selections that consist of the same items but contain those items in a different order as two different selections, while if the order is not important then you consider all selections consisting of the same items as one selection only. Example 2.2.4. Consider the set {a, b, c} and suppose we would like to select 2 items out of this set. (Check the below claims for yourself !) First consider selections with replacement. If the order is important then all possible selections are {a, a}, {a, b}, {a, c}, {b, a}, {b, b}, {b, c}, {c, a}, {c, b}, {c, c}. (2.3) If the order is not important then there are less selections possible as we consider {a, b} and {b, a} to be the same selection, analogue for {a, c} and {c, a}, {b, c} and {c, b}. Then we are hence only left with {a, a}, {a, b}, {a, c}, {b, b}, {b, c}, {c, c}. Next consider selections without replacement. If the order is important then all possible selections are {a, b}, {a, c}, {b, a}, {b, c}, {c, a}, {c, b} (2.4) while if the order is not important we again do not see a distinction between {a, b} and {b, a}, {a, c} and {c, a}, {b, c} and {c, b} so that we end up with the possible selections {a, b}, {a, c}, {b, c}. (2.5) Here are some formulae (without proof) for computing how many selections are possible. Result 2.2.5. Consider the question of how many different selections of k elements out of a set of n distinct items we can make (k ≤ n). The answer is given in the following table: 23 24 Notes for COMP11120 (Probability part) Order important? Yes Without replacement: With replacement: No n! (n − k)! (1) n k nk (3) (complicated!) (2) If the order is important we typically talk about ‘permutations’, otherwise about ‘combinations’. Example 2.2.6 (Example 2.2.4 cont’d). Let’s check that the formulae from Result 2.2.5 are consistent with the above example. We have n = 3 and k = 2. For (2.3) we get indeed nk = 32 = 9 possible selections. For (2.4) we get indeed n! 3! = =6 (n − k)! 1! possible selections, and for (2.5) we get indeed n 3 3! = = =3 k 2 2! · 1! possible selections. Remark 2.2.7. If we take k = n in the above Result 2.2.5 we effectively look at selecting all items from the set of n distinct items. In that case we see that there are n!/0! = n! different ways to select all items from the set of n items without replacement and where the order is important. If the order is not important (but still without replacement) there is only nn = 1 way of selecting all items – indeed, if we select all items from the set and then do it again, then necessarily the second selection is just a reordered version of the first selection and since the order is not important the second one is the same as the first one! The above Result 2.2.5 deals only with sets of distinct items. We conclude this section with one more very useful result (again without proof), which deals with a counting problem if the set contains duplicate items: Result 2.2.8. Consider a set of n items, consisting of l subsets where each of these subsets consists of identical items (but the items are different between different subsets). Suppose that the first subset contains k1 (identical) items, the second one k2 (identical) items, ..., the l-th one kl (identical) items. (Hence k1 + k2 + · · · + kl = n.) Then the number of different permutations of all items of this set (i.e. the number of ways to select all n items without replacement and where the order matters) is given by: 24 CHAPTER 2. EQUALLY LIKELY OUTCOMES 25 n! . k1 ! · k2 ! · . . . · kl ! See Example 2.3.1 for an application to illustrate this result. 2.3 Examples The material in this chapter is typically best understood by going through examples. Here are a bunch. Example 2.3.1. An explorer has rations for two weeks, consisting of 4 cans of Tuna, 7 cans of Spam, and 3 cans of Beanie Weenies. If he opens one can each day, in how many different ways can he consume the rations? This is an application of Result 2.2.8. We have in total n = 14 cans. If he would have all different cans then there would be 14! = 87178291200 different ways to consume his rations (cf. Remark 2.2.7). However in reality there are much less different ways, because given a certain choice we can swap for instance two cans of Tuna in that choice and end up with exactly the same choice! This wouldn’t be possible of all cans were different. Now, the collection of cans consists of l = 3 subsets of identical items (the Tuna, Spam and Beanie Weenies). So k1 = 4, k2 = 7 and k3 = 3, and the number of different ways in which he can consume his rations is 14! = 120120. 4! · 7! · 3! Example 2.3.2. A fair coin is tossed 10 times. What is the probability of: a) exactly 5 heads, b) at least 9 heads, c) less than 9 heads. For the solution, first note that as the coin is fair this is indeed an experiment with equally likely outcomes. The sample space Ω consists of all possible sequences of length 10, each entry filled with either H or T . Hence, the ‘multiplication rule’ Result 2.2.1 tells us that #Ω = 210 = 1024. For a), the event A = {5 heads} consists of all sequences of length 10 with 5 entries filled with a H and the other entries with a T . Hence, #A equals the number of ways in which we can select 5 out of 10 positions to put our H in (the other entries are then automatically filled with a T as that’s the only other option). Applying formula (2) from Result 2.2.5 yields 10 #A = = 252. 5 25 26 Notes for COMP11120 (Probability part) So using (2.2) we find 252 63 = . 1024 256 Alternatively we could apply Result 2.2.8 to count the elements in A. We are interested in how many ways there are to fill a sequence of length 10 with 5 heads and 5 tails. Or, put differently, in how many permutations there are of a set containing 5 heads and 5 tails. Using Result 2.2.8 with n = 10, l = 2, k1 = 5, k2 = 5 we find that there are P({5 heads}) = 10! = 252 5! · 5! permutations, reassuringly the same number we found above. For b), we may write the event A = {at least 9 heads} as A = {9 heads} ∪ {10 heads} and since these events are disjoint (i.e. their intersection is empty) we have in fact that 10 10 #A = #{9 heads} + #{10 heads} = + = 11 9 10 (using the same counting method as in part a)) so that again by (2.2) we find 11 . 1024 Finally for c), given the work we did above already it is a good idea here to use the ‘complementary law’ (cf. Result 1.4.2 in Ch. 1) since {less than 9 heads}c = {at least 9 heads} and hence P({at least 9 heads}) = P({less than 9 heads}) = 1 − P({at least 9 heads}) = 1 − 1013 11 = , 1024 1024 where we used the result from b). Example 2.3.3. Three fair dice are thrown. Compute the probability that: a) the throw is , b) exactly two of the dice are , c) the sum of the dice equals 16. First again note that as the dice are fair, all outcomes are equally likely. The sample space Ω consists of a sequence of length 3, each entry a natural number between 1 and 6. By the ‘multiplication rule’ Result 2.2.1 we have #Ω = 63 = 216. For a), the event A = {(1, 1, 1)} contains just one outcome and hence using (2.2) 1 . 216 For b), the event A = {exactly two of the dice are } consists of any of the outcomes (2, 2, ?), (2, ?, 2) and (?, 2, 2) where ? is a number unequal to 2. By the ‘multiplication rule’ P({the throw is 26 }) = CHAPTER 2. EQUALLY LIKELY OUTCOMES 27 Result 2.2.1 there are 1 · 1 · 5 possibilities of the type (2, 2, ?), 1 · 5 · 1 possibilities of the type (2, ?, 2) and 5 · 1 · 1 possibilities of the type (?, 2, 2). Hence #A = 1 · 1 · 5 + 1 · 5 · 1 + 5 · 1 · 1 = 15 and thus by (2.2) 15 5 = . 216 72 For c), the sum of the dice is 16 exactly if we either have a , a and a , or a , a and a . Of course, this can happen in several different ways: the first case corresponds to one of the outcomes (6, 6, 4), (6, 4, 6) and (4, 6, 6); while the second case corresponds to one of the outcomes (6, 5, 5), (5, 6, 5) and (5, 5, 6). Hence P({exactly two of the dice are }) = A = {the sum of the dice equals 16} = {(6, 6, 4), (6, 4, 6), (4, 6, 6), (6, 5, 5), (5, 6, 5), (5, 5, 6)} so that #A = 6 and thus by (2.2) 1 6 = . 216 36 An alternative way to count the elements in A is the following. If we would like to know how many outcomes there are containing a , a and a we could apply Result 2.2.8 with n = 3, l = 2, k1 = 2 and k2 = 1. This yields P({the sum of the dice equals 16}) = 6 3! = =3 2! · 1! 2·1 possible outcomes. In a similar way all outcomes consisting of a , a and a counted, which is also 3 so that we also using this method arrive at #A = 6. could be Example 2.3.4. Consider an urn containing 3 green, 4 blue and 5 red balls. You take two balls out, without replacement (i.e. you don’t put the first ball back). Compute the probability of getting: a) two blue balls, b) the first ball red and the second ball not red, c) the second ball green. This is again an experiment with equally likely outcomes. The sample space Ω consists of pairs of two colors. As there are 12 balls in total and it is without replacement, there are in total 12 · 11 = 132 possible pairs, i.e. #Ω = 132. (This can be seen from either the ‘multiplication rule’ Result 2.2.1 or alternatively from formula (1) in Result 2.2.5 with n = 12 and k = 2, which yields 12!/10! = 12 · 11 = 132.) For a), as there are 4 blue balls in the urn we have 27 28 Notes for COMP11120 (Probability part) #{both balls are blue} = 4 · 3 = 12 (again either by the ‘multiplication rule’ Result 2.2.1 or by formula (1) in Result 2.2.5 with n = 4 and k = 2) and hence by (2.2) 12 1 = . 132 11 For b), the first ball has to be red and the second not. For the second ball there are then 3 + 4 = 7 possibilities that are not red. This yields, again by the ‘multiplication rule’ Result 2.2.1: P({both balls are blue}) = #{first ball red, second ball not red} = 5 · 7 = 35 so that by (2.2) 35 . 132 Finally for c), this is slightly more tricky as the number of possibilities for the second ball to be green depends on what the first ball was: if the first ball was green then there is one less possibility than when the first ball was not green. One way around this problem is to note that we have P({first ball red, second ball not red}) = {second ball green} = {first ball green, second ball green} ∪ {first ball not green, second ball green} and as these two events are disjoint (i.e. their intersection is empty) we have #{second ball green} = #{first ball green, second ball green} + #{first ball not green, second ball green}. The point is that we can easily compute how many elements both events in the right hand side have, namely using the same argument as above we have #{first ball green, second ball green} = 3 · 2 = 6 and #{first ball not green, second ball green} = 9 · 3 = 27. Hence #{second ball green} = 6 + 27 = 33 and thus 28 CHAPTER 2. EQUALLY LIKELY OUTCOMES P({second ball green}) = 29 29 33 1 = . 132 4 Chapter 3 The power of conditioning In this chapter we will introduce the concept of conditioning on an event, which means that we consider the situation of performing the experiment given that we know in advance that a certain event will occur. This is a very powerful concept which leads to several important results in probability theory such as the law of total probability and Bayes’ Theorem. 3.1 Conditional probabilities To introduce the concept, consider the example of throwing a (fair) dice. Hence the sample space is Ω = {1, 2, 3, 4, 5, 6}. As done many times before, since the dice is fair each of the outcomes is equally likely and hence pi := P({i}), the probability that the outcome is i, equals 1/6 for any i = 1, . . . , 6. Define the event B = {the outcome is at least 5} = {5, 6}. Now consider the experiment of throwing the dice given that we know that the event B occurs. If we know in advance that the event B occurs, then we know in advance that the outcome of the experiment will be either 5 or 6. Define qi := the probability that the outcome is i given that we know that the event B occurs, still for i = 1, . . . , 6. Now, the outcomes 1, 2, 3 and 4 can not occur, since they are not elements of B and we know that B occurs. Hence q1 = q2 = q3 = q4 = 0. The outcomes 5 and 6 can both occur, and they are still equally likely but have become a lot more likely now that we know it must definitely one of these two, so that q5 = q6 = 1/2. That is to say, we obtain q5 and q6 by taking p5 and p6 and scaling these up with a factor 3. This factor is equal to 1/P(B) = 1/(1/3) = 3. So, the effect of knowing in advance that the event B occurs, or conditioning on the event B, is that the probabilities of the 6 possible outcomes are transformed as follows: 31 32 Notes for COMP11120 (Probability part) 1 1 1 1 1 1 , , , , , 6 6 6 6 6 6 ; 1 1 0, 0, 0, 0, , . 2 2 This transformation is obtained according to the following principle: (i) all outcomes not in the event B are assigned probability 0 (as they could not possibly occur), (ii) the probability of each outcome in the event B is scaled up by a factor 1/P(B) (to reflect the fact they have become more likely to occur than in the original experiment without conditioning). As usual we don’t want to work with outcomes only but with events, and hence we would like to translate this principle to the level of events. It turns out the right way to do this is through the following formula: for any event A the probability that A occurs given that we know that B occurs = P(A ∩ B) . P(B) (3.1) Let’s do some quick sanity checks of this formula, by considering what it gives us in three different cases for the event A. (A Venn diagram in each case should be helpful!). • First suppose that A only consists of outcomes that are not in B (or, A ∩ B = ∅). Hence, following (i) above, each outcome in A should be assigned probability 0 and therefore the event A should have probability 0 of occuring. This is indeed also what formula (3.1) yields. Namely, using that A ∩ B = ∅ we see: P(A ∩ B) P(∅) = = 0, P(B) P(B) where we used that P(∅) = 0 (cf. Remark 1.4.3 in Ch. 1). • Next consider the case that A only contains outcomes that are in B, i.e. A ⊆ B, then following (ii) we would expect that A has a probability of occuring that is the old probability P(A) scaled up by a factor 1/P(B), i.e. P(A)/P(B). This is indeed what we get. Namely, using (3.1) with the fact that A ⊆ B implies A ∩ B = A yields in this case P(A ∩ B) P(A) = . P(B) P(B) • Finally, the only other case is when A contains both outcomes that are in B and outcomes that are not in B. In that case, following (i) we should expect that all outcomes in A that are not in B get probability 0. Following (ii), the outcomes in 32 CHAPTER 3. THE POWER OF CONDITIONING 33 A that are also in B (which is exactly all outcomes in A ∩ B) should get a new probability that consists of the old probability, i.e. P(A ∩ B), scaled up by a factor 1/P(B). So we would expect to find 0+ P(A ∩ B) 1 · P(A ∩ B) = P(B) P(B) and reassuringly that’s exactly formula (3.1)! Let us put this now in a neat definition: Definition 3.1.1 (Conditional probability). Let B ⊆ Ω be an event with P(B) > 0. Take any other event A ⊆ Ω. We use the notation P(A | B) for the (conditional) probability of A given B, or, more elaborately, the probability that the event A occurs given that we know in advance that the event B occurs. This probability is given by the formula: P(A | B) = P(A ∩ B) . P(B) (3.2) Note that we can rewrite this formula in a form called the ‘multiplicative law’: P(A ∩ B) = P(A | B)P(B). (3.3) Example 3.1.2. Let’s do the good old dice example one more time, so Ω = {1, 2, 3, 4, 5, 6}. Define the events A = {# dots is odd} = {1, 3, 5} and B = {# dots is at least 4} = {4, 5, 6}. Then the probability of A given B is P(A | B) = P(A ∩ B) P({5}) 1/6 1 = = = . P(B) P({4, 5, 6}) 3/6 3 Maybe we could have guessed this result, namely since B consists of 3 (equally likely) outcomes of which 2 are even and 1 is odd it makes sense that if we know in advance that B occurs then the probability of an odd outcome is 1/3. 33 34 Notes for COMP11120 (Probability part) Example 3.1.3. Two cards are taken from a deck of cards (with the usual 52 cards in it). What is the probability that the first card is Queen of Spades and the second is the King of Spades? Note you could of course solve this using the techniques from previous chapter! If we first do that we find that the event we’re interested in contains only 1 element while the whole sample space Ω contains 52 · 51 = 2652 possible outcomes, so the required probability is 1/2652. We may also use our new technique of conditioning. Namely, define the events B = {first card is Queen of Spades} and A = {second card is King of Spades}. Then P(B) = 1/52 and P(A | B) = 1/51 (because if we know in advance that B occurs, i.e. that the first card drawn was the Queen of Spades, then for the second draw there are 51 cards left of which the King of Spades is one). Hence by the ‘multiplication rule’ (3.3) we get 1 1 1 1 · = = . 51 52 51 · 52 2652 As A ∩ B = {first card is Queen of Spades and second card is King of Spades} we have found again that the requested probability is 1/2652. P(A ∩ B) = P(A | B)P(B) = Remark 3.1.4. What we actually have done in (3.2) is define ourselves a mapping P that assigns to any possible event A the number P(A ∩ B) ∈ [0, 1]. P(B) You can in fact verify that this mapping P satisfies both conditions of Definition 1.3.2 from Ch. 1, and hence P is a probability measure. It is a probability measure ‘concentrated’ on only the event B ⊆ Ω rather than on the whole sample space Ω, in the sense that P (B) = 1. P (A) := P(A | B) = 3.2 Independent events Take an event B with P(B) > 0. If the probability of an event A does not alter at all when we condition on B, i.e. when P(A) = P(A | B) (3.4) we say that the event A is independent of the event B. If we plug the formula for P(A | B) from (3.2) in (3.4) we see that 34 CHAPTER 3. THE POWER OF CONDITIONING P(A) = P(A | B) ⇐⇒ P(A) = P(A ∩ B) ⇐⇒ P(A)P(B) = P(A ∩ B). P(B) 35 (3.5) We typically take the last equation above, i.e. P(A)P(B) = P(A ∩ B) as the definition of independence rather than (3.4). (As we saw in (3.5) they are equivalent – at least as long as P(B) > 0.) The reason for this is the following. If P(A) > 0 we also have P(A)P(B) = P(A ∩ B) ⇐⇒ P(B) = P(A ∩ B) P(B ∩ A) = = P(B | A). P(A) P(A) If we concentrate on the last equation, i.e. P(B) = P(B | A), we see that this one means that B is independent of A. Indeed it tells us that the probability of B does not alter when we condition on the event A. So, for any events A and B with P(A) > 0 and P(B) > 0 we have that A is independent of B ⇐⇒ P(A)P(B) = P(A ∩ B) ⇐⇒ B is independent of A. In particular this tells us that independence can never happen in only one direction: if A is independent of B then necessarily B is also independent of A (and vice versa)! Therefore we simply say that two events A and B are independent. This means both A is independent of B and B is independent of A. Putting this in a nice Definition 3.2.1 (Independence). Two events A and B are said to be independent if it holds that P(A)P(B) = P(A ∩ B). (3.6) If P(A) > 0 and P(B) > 0 we have in particular A and B are independent ⇐⇒ P(A) = P(A | B) ⇐⇒ P(B) = P(B | A). (3.7) Remark 3.2.2 (Disjoint vs independent: warning!!). A very common mistake is to confuse the concepts of independence and being disjoint. However they are really completely different. By definition A and B are disjoint exactly if A ∩ B = ∅, cf. Definition 1.2.2 in Ch. 1. This is clearly a very different definition than (3.6)! In fact, if A and B are disjoint then P(A ∩ B) = P(∅) = 0. Hence for them to be independent as well, equation (3.6) tells us that we would need to have P(A) = 0 or P(B) = 0 and this is (of course) usually not the case. 35 36 Notes for COMP11120 (Probability part) Also intuitively we can understand that independence and being disjoint cannot both be true. Two events are independent when the occurence of the one event does not give you any extra information about how likely it is that the other event will occur as well. If the events are disjoint then you do have extra information: if the one event occurs then you know for sure the other event won’t occur as they are disjoint and hence don’t share any outcomes! Example 3.2.3. A fair coin is tossed twice. Consider the events A = {1st throw shows heads} and B = {2nd throw shows tails}. Are A and B independent? It is an experiment with equally likely outcomes and the sample space is Ω = {HH, T H, HT, T T }. We have A = {HH, HT } and B = {HT, T T }, and hence A ∩ B = {HT }. We find 2 1 = , 4 2 and hence it indeed holds that P(A) = P(B) = 2 1 = 4 2 and P(A ∩ B) = 1 4 P(A ∩ B) = P(A)P(B) as both sides equal 1/4, we therefore conclude that A and B are indeed independent. Intuitively, this makes a lot of sense. A and B are independent when the occurence of the one event does not give you any extra information about the occurence of the other. Now A only concerns the outcome of the first throw and B only concerns the outcome of the second throw, so these events do not give any information about the other. 3.3 Partitions In the remaining part of this chapter we will use the concept of conditioning introduced above together with the concept of ‘partitions’ to derive two very important results in probability theory: the law of total probability and Bayes’ Theorem. In this section we will first introduce the concept of partitions. The idea is very simple: imagine your sample space as a sheet of paper. Take a scissors and cut the sheet up into different pieces. Then the collection of these pieces of paper is what we call a ‘partition’ of the original sheet of paper. The two crucial properties are: (i) the pieces do not overlap, (ii) putting all the pieces together we find the whole sheet of paper back again. The mathematical definition of a partition of a sample space (the ‘sheet of paper’) follows this image exactly: we take a collection of events E1 , . . . , Em (the ‘pieces of paper’) such that they do not overlap, i.e. Ei ∩Ej = ∅ for all i and j such that i 6= j, and such that taking them all together we find the sample space back again, i.e. E1 ∪ E2 ∪ . . . ∪ Em = Ω. 36 CHAPTER 3. THE POWER OF CONDITIONING 37 Definition 3.3.1 (Partition). Consider a sample space Ω. A partition is a collection of (non empty) events E1 , . . . , Em for some m ≥ 2 that satisfies the following two conditions: (i) Ei ∩ Ej = ∅ for all i = 1, . . . , m and j = 1, . . . , m with i 6= j, (ii) E1 ∪ E2 ∪ . . . ∪ Em = Ω. Any collection of events that satisfies point (i) above is said to be mutually disjoint. So a collection of events is a partition exactly if they are mutually disjoint and if their union is the whole sample space. A partition can also very well consist of infinitely many events, in this case the definition gets extended in the obvious way: (i’) Ei ∩ Ej = ∅ for all i ≥ 1 and j ≥ 1 with i 6= j, (ii’) E1 ∪ E2 ∪ E3 ∪ . . . = Ω. See also the two top diagrams in Figure 3.1. Remark 3.3.2. If you think a bit about it, the following also holds true: a collection of events E1 , . . . , Em is a partition exactly if any outcome ω in the sample space Ω belongs to exactly one of events E1 , . . . , Em . In terms of the sheet of paper: every molecule of the sheet of paper ends up in exactly one of the pieces (ignoring that in reality some will get stuck to the scissors etc...). Example 3.3.3. Consider the sample space Ω = {ω1 , ω2 , ω3 }. Which of the following collections of events are partitions? a) E1 = {ω1 , ω2 }, E2 = {ω2 , ω3 } b) E1 = {ω1 }, E2 = {ω3 } c) E1 = {ω1 , ω2 }, E2 = {ω3 }. For a), this is not a partition because E1 ∩ E2 = {ω2 } and this intersection should be empty according to Defintion 3.3.1 (i). Or, another way to say the same, the element ω2 is in both E1 and E2 and this is not allowed (cf. Remark 3.3.2). For b), this is not a partition either because E1 ∪ E2 = {ω1 , ω3 } and this union should be equal to Ω (Defintion 3.3.1 (ii)). Using Remark 3.3.2 again, it is not a partition since ω2 is not an element of both E1 and E2 . For c), yes this is a partition: we have E1 ∩ E2 = ∅ and E1 ∪ E2 = Ω so the conditions of Defintion 3.3.1 are indeed satisfied. Alternatively, following Remark 3.3.2, indeed each of the possible outcomes ω1 , ω2 and ω3 is an element of either E1 or E2 (but not of both). Remark 3.3.4. The simplest partition we can think of is the following: take any event B. Then set E1 = B and E2 = B c . This is indeed a partition since E1 ∩ E2 = B ∩ B c = ∅ and E1 ∪ E2 = B ∪ B c = Ω (as is obvious from the definition of complement, see Definition 1.2.1 in Ch. 1) so that the conditions of Defintion 3.3.1 are indeed satisfied. 37 38 Notes for COMP11120 (Probability part) Remark 3.3.5. The number of partitions that are possible when the sample space consists of n outcomes is called a Bell number. (We don’t allow the empty set to be one of the events in partition. If you would allow that you could produce an infinite number of partitions for every sample space by just adding empty sets all the time). For n = 2 it is 2, for n = 3 it is 5, for n = 4 it is 15, for n = 5 it is 52 and for n = 6 it is 203. So we see, as you probably had guessed already, that the number of possible partitions grows very quickly as the sample space gets larger. We conclude with two results concerning computing probabilities with partitions. Result 3.3.6 (Additive law for mutually disjoint events). For any collection of mutually disjoint events E1 , E2 , . . . , Em , i.e. events such that Ei ∩ Ej = ∅ for all i = 1, . . . , m and j = 1, . . . , m with i 6= j (cf. Definition 3.3.1), we have P(E1 ∪ E2 ∪ . . . ∪ Em ) = P(E1 ) + P(E2 ) + . . . + P(Em ). Proof. This is essentially an extension of the ‘additive law’ (cf. Result 1.4.2 in Ch. 1) to more than two events which is intuitively obvious: the probability of the union of a collection of events that do not overlap at all should simply be the sum of the probabilities of each of these events. However for a truly sound mathematical proof you would need to use the principle of induction (which I think I read somewhere you will be doing next semester), so for now we will just be happy with the fact that it is an obvious result! Result 3.3.7. Take any partition E1 , E2 , . . . , Em of a sample space Ω. Then we have that P(E1 ) + P(E2 ) + . . . + P(Em ) = 1. Proof. As E1 , E2 , . . . , Em forms a partition, we know that these events are mutually disjoint (cf. Definition 3.3.1). Hence by the ‘additive law for mutually disjoint events’ above we have that P(E1 ) + P(E2 ) + . . . + P(Em ) = P(E1 ∪ E2 ∪ . . . ∪ Em ). (3.8) Furthermore, we also know from the definition of a partition that E1 ∪ E2 ∪ . . . ∪ Em = Ω and we have by definition that P(Ω) = 1 (cf. Definition 1.3.2). So, plugging these two facts into (3.8) it indeed follows that P(E1 ) + P(E2 ) + . . . + P(Em ) = P(E1 ∪ E2 ∪ . . . ∪ Em ) = P(Ω) = 1. 38 CHAPTER 3. THE POWER OF CONDITIONING 3.4 39 The law of total probability Let us now combine the two ingredients, namely conditional probabilities and partitions, into a formula known as the ‘law of total probability’. Take any event A and a partition consisting of the events E1 , . . . , Em . Then we have the following: A = (A ∩ E1 ) ∪ (A ∩ E2 ) ∪ . . . ∪ (A ∩ Em ). (3.9) A way to see this is as follows. Take your sheet of paper again, and colour somewhere on the sheet a shape to represent your event A in a sample space. Then cut your piece of paper into m pieces. Each of the pieces of paper contains part of the coloured shape (or, of course, there may be none of it on some of the pieces). The effect is that we have cut A up into m different parts, namely the parts on each of the pieces of paper. Hence we could write something like: A = Part of A on piece 1 + Part of A on piece 2 + . . . + Part of A on piece m which is exactly what (3.9) is saying! Namely, translating this back to the sample space, we break the event A up into different parts by considering the parts A ∩ E1 (all outcomes in A that are also in E1 ), A ∩ E2 (all outcomes in A that are also in E2 ), ..., A ∩ Em (all outcomes in A that are also in Em ) and then formula (3.9) is saying nothing but ”we can find the event A back by collecting all the parts together”. See also Figure 3.1. Remark 3.4.1 (Warning!). For equation (3.9) to hold it is crucial that the events E1 , . . . , Em form a partition of the sample space. If that’s not true, so if the events are not mutually disjoint and/or their union is not the whole sample space, then in general equation (3.9) will not be true! The law of total probability now follows quite easily from (3.9). First note that the events A ∩ E1 , A ∩ E2 , . . . , A ∩ Em are mutually disjoint (recall Definition 3.3.1). Indeed, for any i = 1, . . . , m and j = 1, . . . , m with i 6= j we have that (A ∩ Ei ) ∩ (A ∩ Ej ) = A ∩ Ei ∩ A ∩ Ej = A ∩ A ∩ Ei ∩ Ej = A ∩ Ei ∩ Ej = A ∩ ∅ = ∅ (note that in the first three equalities above nothing special happens, just working things out a bit and using that it doesn’t matter in which order you take intersections. Then we use that as the events E1 , . . . , Em form a partition they are mutually disjoint, in particular Ei ∩ Ej = ∅ for the values of i and j we have chosen here). Having established that the events A ∩ E1 , A ∩ E2 , . . . , A ∩ Em are mutually disjoint, we get from (3.9) and the ‘additive law for mutually disjoint events’ (cf. Result 3.3.6) that 39 40 Notes for COMP11120 (Probability part) Figure 3.1: Left top: a grey square. Oh no, a sample space. Right top: the sample space ‘cut up’ into four parts, thereby forming a partition consisting of the events E1 , E2 , E3 and E4 . Left bottom: an event A in the sample space. Right bottom: as the sample space gets ‘cut up’ into the same four events E1 , . . . , E4 , each of them contains a portion of the event A: E1 contains the portion A ∩ E1 , E2 contains the portion A ∩ E2 etc. P(A) = P (A ∩ E1 ) ∪ (A ∩ E2 ) ∪ . . . ∪ (A ∩ Em ) = P(A ∩ E1 ) + P(A ∩ E2 ) + . . . + P(A ∩ Em ). Finally, using the ‘multiplicative law’ (cf. Definition 3.1.1) we may also write this as P(A) = P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em ) (3.10) and this is exactly the formula known as the ‘law of total probability’. It may at first seem a bit strange that it would be somehow useful to express the very short and simple expression P(A) as the much more complicated right hand side of (3.10), but there are plenty examples where the easiest (or even only) way to compute P(A) is by using this formula! Recall from Remark 3.3.4 that the simplest partition consists of two events only, and is obtained by taking any event B and setting E1 = B and E2 = B c . If we take this particular partition, then formula (3.10) simplifies to P(A) = P(A | B)P(B) + P(A | B c )P(B c ). So: 40 CHAPTER 3. THE POWER OF CONDITIONING 41 Result 3.4.2 (Law of total probability). Take a partition of the sample space consisting of the events E1 , . . . , Em . Then we have for any event A the law of total probability: P(A) = P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em ). (3.11) In the special case the partition consists of only two elements, so E1 = B and E2 = B c for some event B (cf. Remark 3.3.4), the formula becomes P(A) = P(A | B)P(B) + P(A | B c )P(B c ). (3.12) Example 3.4.3. A company buys tyres from two suppliers, say supplier 1 and 2. Suppose that 40% of all tyres come from supplier 1, and that the rates of defectives is 10% for supplier 1 and 5% for supplier 2. What is the probability that a randomly chosen tyre is defective? Define the event A = {tyre is defective} , so we want to compute P(A). There is no ‘direct’ way to do this, since the rates of defectives is different for both suppliers. So we need to ‘split’ into the different cases that the tyre comes from supplier 1 resp. supplier 2. For this, define the event B = {tyre comes from supplier 1}. Note that B c = {tyre comes not from supplier 1} = {tyre comes from supplier 2}. Now, since 40% of all tyres come from supplier 1 we know that P(B) = 0.4. Hence by the ‘complementary law’ P(B c ) = 0.6. (Or use that B c = {tyre comes from supplier 2} and since 40% comes from supplier 1, it must be that 60% comes from supplier 2 and hence P(B c ) = 0.6.) Also, using what we know about the rates of defectives we see that P(A | B) = P({tyre is defective} | {tyre comes from supplier 1}) = 0.1 and P(A | B c ) = P({tyre is defective} | {tyre comes from supplier 2}) = 0.05 The law of total probability, with the partition E1 = B and E2 = B c i.e. formula (3.12), now yields P(A) = P(A | B)P(B) + P(A | B c )P(B c ) = 0.1 · 0.4 + 0.05 · 0.6 = 0.07. In the above solution we haven’t specified the sample space. That is very well possible if you like, for instance as follows. Of all tyres we are interested in two properties: who the supplier was and whether or not the tyre was defective. Hence we might write the outcome of the experiment of randomly choosing a tyre as (D, 1) (defective and supplier was 1), (D, 2) (defective and supplier was 2), (6D, 1) (not defective and supplier was 1) and (6D, 2) (not defective and supplier was 2). The sample space is then Ω = {(D, 1), (D, 2), (6D, 1), (6D, 2)}. 41 42 Notes for COMP11120 (Probability part) (Note that this is not an experiment with equally likely outcomes and hence we can’t conclude that each of the four outomes has probability 1/4 of occuring!). The events A, B and B c above can then be written as A = {(D, 1), (D, 2)}, B = {(D, 1), (6D, 1)} and B c = {(D, 2), (6D, 2)}. The rest of the solution works exactly the same (of course). Example 3.4.4. Suppose that in a semiconductor manufacturing plant the probability of chip failure depends on the level of contamination. If the level of contamination is high then the probability of failure is 0.1. If level of contamination is medium then probability of failure is 0.01. Finally, if the level of contamination is low then the probability of failure is 0.001. Every chip is exposed to one level of contamination only. Suppose that in a particular production run 20% of the chips are subjected to high levels of contamination, 30% to medium levels and 50% to low levels of contamination. What is the probability a randomly selected chip fails? We need to compute the probability of the event A = {the chip fails}. Again we cannot do this directly as it depends on the level of contamination the chip has been exposed to. Therefore, define the events E1 = {contamination is high}, E2 = {contamination is medium} and E3 = {contamination is low}. Note that E1 , E2 and E3 together for a partition of the sample space. Indeed, following Remark 3.3.2, every chip we can randomly select must be a member of exactly one of these events (as we assumed that every chip is exposed to one level of contamination only). We are given that P(E1 ) = 0.2, P(E2 ) = 0.3 and P(E3 ) = 0.5. Furthermore we are given that P(A | E1 ) = 0.1, P(A | E2 ) = 0.01 and P(A | E3 ) = 0.001. Hence applying the law of total probability (3.11) we find P(A) = P(A | E1 )P(E1 ) + P (A | E2 )P(E2 ) + P(A | E3 )P(E3 ) = 0.1 · 0.2 + 0.01 · 0.3 + 0.001 · 0.5 = 0.0235. Analogue to what we did above, again we could also in this case be more precise and define a probability space. Any chip is either defect (D) or not (6D), and was subjected to a contamination level H, M or L. This means the sample space can be written as Ω = {(D, H), (D, M ), (D, L), (6D, H), (6D, M ), (6D, L)}. The events above now become A = {(D, H), (D, M ), (D, L)}, E1 = {(D, H), (6D, H)}, E2 = {(D, M ), (6D, M )} and E3 = {(D, L), (6D, L)}. In particular, you could now check in a more rigorous way that the events E1 , E2 , E3 indeed form a partition, i.e. satisfy the conditions of Definition 3.3.1! 42 CHAPTER 3. THE POWER OF CONDITIONING 3.5 43 Bayes’ Theorem Finally in this chapter we look at Bayes’ Theorem∗ . This is quite a famous one, and used in several ways in both probability and statistics. It is quite easy to prove, using the law of total probability from the previous section. Take any event A with P(A) > 0, and a partition consisting of the events E1 , . . . , Em . Take any i = 1, 2, . . . , m. Then we have the following: P(Ei | A) = P(A | Ei )P(Ei ) P(Ei ∩ A) = P(A) P(A) = P(A | Ei )P(Ei ) . P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em ) Note that the first equality uses the definition of conditional probability (cf. Definition 3.1.1), the second uses the ‘multiplicative law’ (cf. Definition 3.1.1) and the final equality uses the law of total probability (cf. Result 3.4.2). The equality between the ultimate left hand and right hand sides is called Bayes’ Theorem: Result 3.5.1 (Bayes’ Theorem). Let A be an event with P(A) > 0. Take a partition of the sample space consisting of the events E1 , . . . , Em . Then we have for any i = 1, 2, . . . , m: P(Ei | A) = P(A | Ei )P(Ei ) . P(A | E1 )P(E1 ) + P(A | E2 )P(E2 ) + . . . + P(A | Em )P(Em ) (3.13) In particular, if the partition consists of two events only, i.e. E1 = B and E2 = B c for some event B (cf. Remark 3.3.4), then this formula boils down to P(A | B)P(B) P(A | B)P(B) + P(A | B c )P(B c ) (if we take i = 1 in the above formula) and P(B | A) = P(A | B c )P(B c ) P(A | B)P(B) + P(A | B c )P(B c ) (if we take i = 2 in the above formula). P(B c | A) = (3.14) (3.15) Generally speaking, in our setting Bayes’ Theorem is usually a good choice if you have to find the probability of some event A given an event B, and you only know the probability of B given A instead. That is, it helps to ‘invert’ the events in a conditional probability. Before we move on to an example, let us first prove that the ‘complementary law’ also holds for conditional probabilities, often a very useful result when working with Bayes’ Theorem (the proof is left as an exercise): ∗ Named after the English mathematician Thomas Bayes 43 44 Notes for COMP11120 (Probability part) Result 3.5.2 (Complementary law for conditional probabilities). In a sample space Ω, let A and B be any events such that P(B) > 0. Then we have that P(A | B) = 1 − P(Ac | B). Example 3.5.3. A new medical test is designed to identify a rare illness. If somebody has the disease then the test correctly identifies this with probability 0.99. If somebody is does not have the disease then the test correctly identifies this with probability 0.95. Suppose that the probability of having the illness is 0.00001. Given that for a randomly selected person the test is postive, what is the probability this person indeed has the disease? For the solution, define the events A = {test is positive} and B = {person has disease}. Then we are asked to compute P(B | A). We are given that P(A | B) = 0.99, P(Ac | B c ) = 0.95 and P(B) = 0.00001. This is a situation in which we know P(A | B) and are asked to compute P(B | A). As pointed out above, Bayes’ Theorem is then usually helpful. Formula (3.14) reads: P(B | A) = P(A | B)P(B) . P(A | B)P(B) + P(A | B c )P(B c ) We see that in order to use this, in addition to what we know already we need to find P(B c ) and P(A | B c ). For the former we can simply use the ‘complementary law’ (cf. Result 1.4.2 in Ch. 1) to find P(B c ) = 1 − 0.00001 = 0.9999. For the latter we use the above Result 3.5.2 to derive P(A | B c ) = 1 − P(Ac | B c ) = 1 − 0.95 = 0.05. Plugging all this into the formula we find 0.99 · 0.00001 ≈ 0.002. 0.99 · 0.00001 + 0.05 · 0.9999 If you think about it, would you be happy when your GP uses a test for a serious disease on you with the characteristic that when the test is positive there is only a 0.002 probability you actually do have the disease? Nah. What is most intriguing about this example is the following. If the probability that a randomly selected person has the disease increases from the 0.00001 we used above to 0.1, then you can verify that P(B | A) changes from approx 0.002 to approx 0.69. That makes the test suddenly seem a lot more sensible, however the test itself hasn’t changed, only the probability a person has the disease. So, Bayes’ Theorem shows us that the quality of the test (as perceived by its users) does not only depend on how well the test is able to detect the disease but also on how large the percentage of the population having the disease is. P(B | A) = Finally, if you’d like you could again work with a sample space as well. In this case, what matters about the selected person is whether they have the disease (D) or not (H), 44 CHAPTER 3. THE POWER OF CONDITIONING 45 and whether the test is positive (P ) or not (N ). So we could write the sample space as Ω = {(D, P ), (D, N ), (H, P ), (H, N )} and express all events as subsets of this sample space. 45 Chapter 4 Random variables So far we have introduced three fundamental concepts in probability theory: sample spaces, events and probability measures. In this and the next (final) chapter we discuss the last fundamental concept, namely random variables. 4.1 What are random variables? The idea is that often you are not only interested in the outcome of the experiment an sich but (also) in some consequence that outcome has. For instance, in a betting game with dice you are typically not (only) interested in the outcome of the throw(s) but more in your overall win or loss. Every outcome of the experiment yields a certain win or loss, and hence we can analyse the overall win or loss by introducing a mapping that assigns to every outcome the associated win or loss should that outcome indeed occur. That is to say, we are talking about a mapping from the sample space to (say) the real numbers. That is exactly what a random variable in general is: Definition 4.1.1 (Random variables). A random variable is a mapping from the sample space Ω to the real numbers R, usually denoted by X. That is, a random variable is a mapping X:Ω→R so that every outcome ω ∈ Ω is assigned a real number X(ω) ∈ R. We denote by R ⊆ R the range of X, i.e. the values X can actually attain. (Sometimes the term ‘image’ is used rather than ‘range’). Remark 4.1.2. In our case, where the sample space is countable – recall the discussion at the beginning of Ch. 1 – the range R will always be a countable subset of R. In general, when this is the case X is called a discrete random variable. There are other types of random variables as well – which could occur if the sample space is not countable – but this is outside the scope of this course. 47 48 Notes for COMP11120 (Probability part) Remark 4.1.3. A random variable is in principle nothing more (or less) than a function with Ω as its domain. It is important to realise the following though. If we think back to our setup, with consists of an experiment with outcomes in Ω, then as long as the experiment is not yet performed we don’t yet know what the outcome will be, and hence we also don’t yet know what value the random variable will take! The random variable ‘inherits’ the uncertainty about the outcome of the experiment in the sense that it is uncertain what value it will take. In particular this also means that the we can talk about the probability that a random variable takes a certain value etc. Hence the ‘random’ in the name! ;). Example 4.1.4. Consider a betting game in which a coin is tossed. When heads appears you receive £1 while if tails appears you have to pay £1. Clearly, more than in the outcome of the experiment itself, i.e. heads or tails, you are interested in the financial consequences. With Ω = {H, T }, if we define the mapping X : Ω → R by setting X(T ) := 1 and X(H) := −1 (4.1) then X represents the win you will make by playing this game. In this case the range of X is R = {−1, 1}. Note that since we don’t know what outcome the experiment will have we also don’t know which value the random variable X will take, as this value depends on the outcome. We only know (for sure) which value X will take after we have performed the experiment and know the outcome so that we can apply the rule (4.1). Example 4.1.5. Consider an experiment in which you roll three dice and you are interested in the sum of the dots of the three dice. We can see every outcome as a triple of numbers, each between 1 and 6. Hence, we may write Ω = {(k1 , k2 , k3 ) | ki ∈ {1, 2, 3, 4, 5, 6} for i = 1, 2, 3}. The random variable X giving us the sum of the dots of the three dice is then defined as the mapping X : Ω → R given by X((k1 , k2 , k3 )) := k1 + k2 + k3 for every (k1 , k2 , k3 ) ∈ Ω. Note that the possible values X can take make up the set {3, 4, . . . , 18} and hence the range of X is R = {3, 4, . . . , 18}. 4.2 Computing probabilities with random variables When we have specified a probability measure P for our experiment then we know how likely each possible outcome of our experiment is to occur. This carries over to random variables: as the value a random variable turns out to take is determined by the outcome 48 CHAPTER 4. RANDOM VARIABLES 49 of the experiment, and each possible outcome occurs with a certain probability, also every value in the range of the random variable occurs with a certain probability! The key point to understand here is the following. Recall that X is mapping from Ω to R ⊆ R. Take some possible value r ∈ R. Then X takes this value r if and only if the experiment yields an outcome ω ∈ Ω satisfying X(ω) = r. That is to say, X takes the value r ⇐⇒ the outcome ω of the experiment satisfies X(ω) = r. (Of course, there can be only one or there can be multiple outcomes ω such that X(ω) = r). This means that the probability that X takes the value r should be equal to the probability of the event {ω ∈ Ω | X(ω) = r} ⊆ Ω, which is the event consisting of all those outcomes that are mapped to the value r by X: the probability that X takes the value r = P({ω ∈ Ω | X(ω) = r}). Remark 4.2.1. As probabilists are a bit lazy and prefer to write things as short as possible we use shorthands for expressing events and probabilities that involve random variables. For instance, when we write ‘{X = r}’ we actually mean the event {ω ∈ Ω | X(ω) = r}, i.e. the event consisting of all outcomes that are mapped to the value r by the random variable X. Also the equal sign might be some other oprator, for instance when we write ‘{X < 2}’ we actually mean the event {ω ∈ Ω | X(ω) < 2} and when we write ‘{X 6= 4}’ we actually mean the event {ω ∈ Ω | X(ω) 6= 4}. We do something similar when it comes to notation for computing probabilities with random variables. For instance, we write ‘P(X = 2)’ when we actually mean P({ω ∈ Ω | X(ω) = 2}), ‘P(X > 10)’ when we actually mean P({ω ∈ Ω | X(ω) > 10}) etc. etc. In this section we will write out those shorthands in full for your convenience, but later in the chapter we will use the shorthand only. Of course, when you find it confusing, try to translate the shorthand to the full form first and then do the computation! Next up are some examples to illustrate how to compute simple probabilities involving random variables. Example 4.2.2. Consider an experiment with sample space Ω = {ω1 , ω2 , ω3 , ω4 }. Suppose that P({ω1 }) = P({ω2 }) = 1 3 and 49 1 P({ω3 }) = P({ω4 }) = . 6 50 Notes for COMP11120 (Probability part) Figure 4.1: The situation in Example 4.2.2: we have an experiment that can yield any of the outcomes from the sample space Ω = {ω1 , . . . , ω4 }, and a random variable X that assigns to each possible outcome a value from its range R = {0, 1, 2} Define the random variable X by setting X(ω1 ) = 0, X(ω2 ) = 1, X(ω3 ) = X(ω4 ) = 2. What is the the range R of X? Compute each of the probabilities P(X = 0), P(X = 2) and P(X < 2). See also Figure 4.1. We have R = {0, 1, 2}, as these are the only three values X can take. For computing the requested probabilities we find: 1 (∗) (∗∗) P(X = 0) = P({ω ∈ Ω | X(ω) = 0}) = P({ω1 }) = , 3 (∗) (∗∗) (∗∗∗) P(X = 2) = P({ω ∈ Ω | X(ω) = 2}) = P({ω3 , ω4 }) = P({ω3 }) + P({ω4 }) = 1 3 and 2 (∗) (∗∗) (∗∗∗) P(X < 2) = P({ω ∈ Ω | X(ω) < 2}) = P({ω1 , ω2 }) = P({ω1 }) + P({ω2 }) = . 3 Note that in every computation, in step (*) we write out the shorthand, in step (**) we investigate which outcomes are in the event we are interested in by looking at the definition of X, and finally in step (***) we apply the ‘additive law’. Example 4.2.3 (Example 4.1.4 cont’d). Consider again the experiment and random variable introduced in Example 4.1.4. Let us assume the coin is fair so that we have P({H}) = 1/2 and P({T }) = 1/2. The random variable X has range R = {−1, 1}. Compute both P(X = −1) and P(X = 1). 50 CHAPTER 4. RANDOM VARIABLES 51 Following the same principle as in the above example, we find P(X = −1) = P({ω ∈ Ω | X(ω) = −1}) = P({H}) = 1 2 and similarly 1 P(X = 1) = P({ω ∈ Ω | X(ω) = 1}) = P({T }) = . 2 Example 4.2.4 (Example 4.1.5 cont’d). Consider again the experiment and random variable introduced in Example 4.1.5. Assuming the three dice are fair this is an experiment with equally likely outcomes and hence (recall Chapter 2) we have for any event A that P(A) = #A #Ω (4.2) where #Ω = 63 = 216. Compute the probability that X is equal to 3 and the probability that X is at most 4. For the probability that X is equal to 3 we get (4.2) P(X = 3) = P({ω ∈ Ω | X(ω) = 3}) = P({(1, 1, 1)}) = 1 216 and for the probability that X is at most 4 we get (4.2) P(X ≤ 4) = P({ω ∈ Ω | X(ω) ≤ 4}) = P({(1, 1, 1), (1, 1, 2), (1, 2, 1), (2, 1, 1)}) = 4 1 = . 216 54 We conclude this section with an important observation. Consider any random variable X and let us write its range in a generic form: R = {r1 , r2 , . . . , rm }. Define for i = 1, . . . , m the events Ei = {X = ri } = {ω ∈ Ω | X(ω) = ri }. Take any outcome ω ∈ Ω. Then we know that X maps ω to some value in its range, so ω must be an element of exactly one of the events E1 , E2 , . . . , Em . But this means, according to Remark 3.3.2 in Ch. 3, that the events E1 , E2 , . . . , Em form a partition of the sample space! This has some very useful consequences. For instance, Result 3.3.7 in Ch. 3 tells us that this implies that P(E1 ) + P(E2 ) + . . . + P(Em ) = P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm ) = 1, or, using the sum notation: m X P(X = ri ) = 1. i=1 51 52 Notes for COMP11120 (Probability part) Figure 4.2: The situation in Example 4.2.2: a partition of the sample space Ω = {ω1 , . . . , ω4 } consisting of the events E1 , E2 , E3 is generated by the random variable X with range R = {0, 1, 2} by setting E1 = {ω ∈ Ω | X(ω) = 0} = {ω1 }, E2 = {ω ∈ Ω | X(ω) = 1} = {ω2 }, E3 = {ω ∈ Ω | X(ω) = 2} = {ω3 , ω4 } Result 4.2.5. Take any random variable X with range R = {r1 , r2 , . . . , rm }. Define the collection of events E1 , E2 , . . . , Em as Ei = {X = ri } = {ω ∈ Ω | X(ω) = ri } for i = 1, . . . , r. Then the collection E1 , E2 , . . . , Em is a partition of the sample space. One consequence of this fact is that m X P(X = ri ) = P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm ) = 1. i=1 This result also holds true if the range of X has infinitely many elements∗ , say R = {r1 , r2 , . . .}. Then we get infinitely many events in our partition, i.e. Ei = {X = ri } for all i ≥ 1, and we have ∞ X P(X = ri ) = P(X = r1 ) + P(X = r2 ) + . . . = 1. i=1 ∗ As mentioned before, we’re always assuming it is at most countable! 52 CHAPTER 4. RANDOM VARIABLES 53 Example 4.2.6 (Example 4.2.2 cont’d). In Example 4.2.2 we had the sample space Ω = {ω1 , . . . , ω4 } and a random variable X with range R = {0, 1, 2} defined as X(ω1 ) = 0, X(ω2 ) = 1, X(ω3 ) = X(ω4 ) = 2. Following the above Result 4.2.5 we define the events E1 = {ω ∈ Ω | X(ω) = 0}, E2 = {ω ∈ Ω | X(ω) = 1} and E3 = {ω ∈ Ω | X(ω) = 2}. According to Result 4.2.5 these three events should form a partition of the sample space Ω. Indeed, using the definition of X we find E1 = {ω ∈ Ω | X(ω) = 0} = {ω1 }, E2 = {ω ∈ Ω | X(ω) = 1} = {ω2 } and E3 = {ω ∈ Ω | X(ω) = 2} = {ω3 , ω4 }, which is indeed a parition of Ω. See also Figure 4.2. Result 4.2.5 also tells us that we should have P(X = 0) + P(X = 1) + P(X = 2) = 1. We can indeed verify that this is true. We had already computed that P(X = 0) = 1/3 and P(X = 2) = 1/3. To compute the last one you can verify that P(X = 1) = P({ω2 }) = 1/3 and we conclude that indeed P(X = 0) + P(X = 1) + P(X = 2) = 1 1 1 + + = 1. 3 3 3 Example 4.2.7. Let Ω be some sample space and let X be a random variable of which we know that its range is R = {1, 2, 3, 4, 5} and that P(X = 1) = P(X = 2) = P(X = 4) = P(X = 5) = 1/6. Compute P(X = 3). Also compute P(X ≥ 4) and P(X < 3.7). First note that we know the probabilities for all values in the range R, with the exception of 3. We know from Result 4.2.5 that the sum of all probabilities must equal 1, that is P(X = 1) + P(X = 2) + P(X = 3) + P(X = 4) + P(X = 5) = 1. Plugging in the probabilities we are given we can deduce P(X = 3) = 1 − P(X = 1) − P(X = 2) − P(X = 4) − P(X = 5) = 1 − 4 1 = . 6 3 Next we compute P(X ≥ 4). For this,let us first write out the shorthand to realise that we are looking at the event {ω ∈ Ω | X(ω) ≥ 4}. Note that we know nothing about the sample space at all! So we can’t figure out which outcomes are in this event. But we don’t need to. The first thing to realise is that since the range of X is {1, 2, 3, 4, 5}, the outcomes ω for which X(ω) ≥ 4 must be exactly those outcomes for which either X(ω) = 4 or X(ω) = 5. That is to say, we have {ω ∈ Ω | X(ω) ≥ 4} = {ω ∈ Ω | X(ω) = 4} ∪ {ω ∈ Ω | X(ω) = 5} 53 54 Notes for COMP11120 (Probability part) and hence we have that P({ω ∈ Ω | X(ω) ≥ 4}) = P({ω ∈ Ω | X(ω) = 4} ∪ {ω ∈ Ω | X(ω) = 5}), or, in shorthand: P(X ≥ 4) = P({X = 4} ∪ {X = 5}). (4.3) From Result 4.2.5 we know that the events {X = 1}, {X = 2}, . . . , {X = 5} form a partition of the sample space. In particular that means that they are mutually disjoint (recall Definition 3.3.1 in Ch. 3), which implies that {X = 4} and {X = 5} are disjoint, i.e. {X = 4} ∩ {X = 5} = ∅. But that means that we can apply the ‘additive law’ from Ch. 1 to (4.3) to derive that P(X ≥ 4) = P({X = 4}) + P({X = 5}) = 1 1 1 + = . 6 6 3 Finally, to compute P(X < 3.7) we proceed in exactly the same way. First we note that since X has range {1, 2, 3, 4, 5}, any outcome in the event {ω ∈ Ω | X(ω) < 3.7} must be in any of the events {X = 1}, {X = 2} or {X = 3}. That is to say: {X < 3.7} = {X = 1} ∪ {X = 2} ∪ {X = 3}. Again we have that the three events in the above right hand side are mutually disjoint by Result 4.2.5 and hence the ‘additive law’ from Ch. 1 yields P(X < 3.7) = P(X = 1) + P(X = 2) + P(X = 3) = 2 1 1 1 + + = . 6 6 3 3 Remark 4.2.8. In the above Example 4.2.7 something very new has happened: we have considered a situation in which we did not at all specify what the experiment is, nor what the sample space looks like, and hence neither what the probability of each of the outcomes in the sample space is. Instead we only specified the random variable, its range and the probablities for each of the values in the range of X. We will do this more often in the remaining part of this course: when dealing with random variables we will often leave the experiment and sample space unspecified – just because we don’t need to specify them to do the computations we need to do! 4.3 Constructing new random variables from existing ones If we have a random variable X, i.e. a mapping X : Ω → R and a function f : R → R we would look at the composition of these two mappings to create one new mapping: 54 CHAPTER 4. RANDOM VARIABLES 55 f X Ω R R Y That is to say, the mapping Y : Ω → R is defined as Y (ω) = f (X(ω)) for all ω ∈ Ω. (4.4) This is also a random variable, as it is simply a mapping from Ω to R (cf. Definition 4.1.1). Of course, the range of Y may be different from the range of X! Example 4.3.1. Take the sample space Ω = {ω1 , ω2 , ω3 } and suppose that all outcomes are equally likely to occur. Define the random variable X by X(ω1 ) = −1, X(ω2 ) = 0 and X(ω3 ) = 1. Then X has range RX = {−1, 0, 1} (we use the subscript ‘X’ to specify it is the range of X), and as we did in the previous subsection we may compute 1 1 1 P(X = −1) = P({ω1 }) = , P(X = 0) = P({ω2 }) = and P(X = 1) = P({ω3 }) = . 3 3 3 Now, define the random variable Y by setting Y (ω) = X(ω)2 for all ω ∈ Ω. That is, we use (4.4) with the function f (x) = x2 . Then we find Y (ω1 ) = X(ω1 )2 = (−1)2 = 1, Y (ω2 ) = X(ω2 )2 = 02 = 0 and Y (ω3 ) = X(ω3 )2 = 12 = 1. From this we deduce that the range of Y is RY = {0, 1} and we may compute 1 2 and P(Y = 1) = P({ω1 , ω3 }) = . 3 3 (Note that we have P(X = −1)+P(X = 0)+P(X = 1) = 1 and P(Y = 0)+P(Y = 1) = 1, as should be the case according to Result 4.2.5). P(Y = 0) = P({ω2 }) = Remark 4.3.2. Another case of lazyness of notation: we normally simply write ‘Y = X 2 ’ when we actually mean ‘the random variable Y defined by Y (ω) = X(ω)2 for all ω ∈ Ω’. Just as another example, we write Y = 2X 2 − 3X + 10 just to mean that Y is given by Y (ω) = 2X(ω)2 − 3X(ω) + 10 for all ω ∈ Ω – which is (4.4) with the function f (x) = 2x2 − 3x + 10. 4.4 Probability mass functions (pmf ’s) Every† random variable X has a ‘probability mass function’, usually abbreviated to ‘pmf’. This is simply a function, usually denoted by p, containing all the information concerning the range R of X and the associated probabilities: † discrete, see Remark 4.1.2 55 56 Notes for COMP11120 (Probability part) Definition 4.4.1 (Probability mass function (pmf)). The probability mass function (pmf ) of a random variable X with range R = {r1 , r2 , . . . , rm } is the function p : R → [0, 1] defined as p(ri ) := P(X = ri ) for all i = 1, . . . , m. If the range has infinitely many elements, say R = {r1 , r2 , . . .}, then the definition is adjusted in the obvious way: p(ri ) := P(X = ri ) for all i ≥ 1. Example 4.4.2. Consider a betting game where a (fair) dice is rolled. If one or two dots show you get £1, if three, four or five dots show you get £2 and otherwise you get £3. You are interested in the profit you make playing this game. We have Ω = {1, 2, 3, 4, 5, 6}, where all outcomes are equally likely. The random variable X representing your profit is given by X(1) = X(2) = 1, X(3) = X(4) = X(5) = 2 and X(6) = 3. Hence X has range R = {1, 2, 3}, and thus its pmf is a mapping p : {1, 2, 3} → [0, 1]. We can easily compute its values: 1 p(1) = P(X = 1) = P({1, 2}) = P({1}) + P({2}) = , 3 p(2) = P(X = 2) = P({3, 4, 5}) = P({3}) + P({4}) + P({5}) = 1 2 and 1 p(3) = P(X = 3) = P({6}) = . 6 Note that in each computation above, the first step is just to fill in the definition of the pmf, and from then on it’s just the same arguments we used in the previous two sections as well. There’s not so much exciting to say about pmf’s, except for the following. We know from Result 4.2.5 that for any random variable X with range R = {r1 , r2 , . . . , rm } (resp. R = {r1 , r2 , . . .}) it holds that m X P(X = ri ) = 1 resp. i=1 X P(X = ri ) = 1. i≥1 If we plug in the definition of the pmf, this directly translates to 56 CHAPTER 4. RANDOM VARIABLES m X 57 p(ri ) = 1 resp. i=1 X p(ri ) = 1. i≥1 Hence: Result 4.4.3. For any random variable X with range R = {r1 , r2 , . . . , rm } (or R = {r1 , r2 , . . .} if the range has infinitely many elements) we have that m X p(ri ) = 1 i=1 (or P 4.5 i≥1 p(ri ) = 1 in the infinite case). Cumulative distribution functions (cdf ’s) Like the probability mass function we introduced in the previous section, the ‘cumulative distribution function’ (abbreviated to ‘cdf’) of a random variable is a function that contains all the information about the range of the random variable and the associated probabilities for each value in its range. In that sense the cdf is interchangable with the pmf: if you know the cdf you can derive the pmf, and vice versa. (The cdf has one big advantage over the pmf: the cdf exists for every random variable, not only for the discrete ones we limit ourselves to in this course. That is mainly why it is a populair tool to use.) The cdf is usually denoted F and it is a function F : R → [0, 1] which assings to every x ∈ R the probability that the rv X takes a value that is less than or equal to x. That is to say: Definition 4.5.1 (Cumulative distribution function (cdf)). The cumulative distribution function (cdf ) of a random variable X is a mapping, usually denoted by F , of the form F : R → [0, 1] and it is defined as follows: F (x) = P(X ≤ x) for all x ∈ R. Remark 4.5.2 (Short intermezzo: drawing the graph of discontinous functions). We will find in a moment that (in our case) the cdf is a discontinuous function, that is to say, if you draw its graph you will find there are ‘jumps’ in the graph. For drawing such a ‘jump’ there is a special agreement. To explain this, consider two functions f and g, with the following formulae: f (x) = ( 1 if x < 1 2 if x ≥ 1 and 57 g(x) = ( 1 if x ≤ 1 2 if x > 1. 58 -2 Notes for COMP11120 (Probability part) -1 2.5 2.5 2 2 1.5 1.5 1 1 0.5 0.5 1 2 -2 -1 1 2 Figure 4.3: On the left a plot of the function f and on the right a plot of the function g (cf. Remark 4.5.2). The ‘closed circle’ indicates what the value is at a point where the function has a jump, while the ‘open circle’ shows what the value is not. So we read from the left graph that f (1) = 2 and from the right graph that g(1) = 1 These functions are almost identical, the only difference is for x = 1, since f (1) = 2 and g(1) = 1. (Mathematicians say that f is ‘right continuous’ while g is ‘left continuous’.) To make the difference between these two clear in a graph we draw ‘open’ and ‘closed’ circles at the points where the graph has a jump. Probably staring at the graph is much more helpful than trying to explain in words, hence have a look at Figure 4.3 and match the location of the ‘open’ and ‘closed’ circles for both f and g up with the above formulae. Example 4.5.3 (Example 4.4.2 cont’d). Consider again the random variable X we discussed in Example 4.4.2, which had range R = {1, 2, 3} and pmf p given by p(1) = 1/3, p(2) = 1/2 and p(3) = 1/6. Let us try to figure out what the cdf F of this random variable looks like. We work from ‘left to right’ through all possible values for x ∈ R, where we will see that only something interesting happens if x crosses one of the values in the range R = {1, 2, 3} of X. First pick any x ∈ R such that x < 1 (the smallest value in the range of X). Then the event {X ≤ x} consists of all outcomes such that X(ω) ≤ x. However, the smallest value X can have is 1. And we have chosen x < 1. Hence there are no outcomes in the evet {X ≤ x}! That is to say, {X ≤ x} = ∅. Hence plugging in the definition of the cdf we get for such values of x: F (x) = P(X ≤ x) = P(∅) = 0 if x < 1. (4.5) Next pick any x ∈ R such that 1 ≤ x < 2 (so, x is at least as large as the smallest value in R but smaller than the second smallest value in R). Consider again the event {X ≤ x} in this situation. Every outcome ω ∈ Ω such that X(ω) = 1 also satisfies X(ω) ≤ x – by our choice of x, in particular since x ≥ 1. Any outcome ω ∈ Ω such that X(ω) = 2 or 58 CHAPTER 4. RANDOM VARIABLES 59 1 0.8 0.6 0.4 0.2 -1 1 2 4 3 Figure 4.4: A plot of the cdf F we found in Example 4.5.3 X(ω) = 3 does not satisfy X(ω) ≤ x – by our choice of x, in particular since x < 2. But this means that we have that {X ≤ x} = {X = 1} and hence we can compute: F (x) = P(X ≤ x) = P(X = 1) = p(1) = 1 3 if 1 ≤ x < 2. (4.6) We keep going in the same fashion until we have considered all possible values of x. The next one is any x ∈ R such that 2 ≤ x < 3. Consider again the event {X ≤ x} in this situation. Every outcome ω ∈ Ω such that X(ω) = 1 or X(ω) = 2 also satisfies X(ω) ≤ x – by our choice of x, in particular since x ≥ 2. Any outcome ω ∈ Ω such that X(ω) = 3 however does not satisfy X(ω) ≤ x – by our choice of x, in particular since x < 3. So, in this case we arrive at {X ≤ x} = {X = 1} ∪ {X = 2} and using this we can compute: F (x) = P(X ≤ x) = P({X = 1} ∪ {X = 2}) = P(X = 1) + P(X = 2) = p(1) + p(2) = 5 6 if 2 ≤ x < 3. (4.7) Note that in above computation we have used that the events {X = 1} and {X = 2} are disjoint, as we of course remember very well from Result 4.2.5! We have only one case left to consider, namely any x ∈ R such that x ≥ 3. But in this case we simply have {X ≤ x} = Ω. Indeed, as X(ω) can be either 1, 2 or 3, it is always true that X(ω) ≤ x – since we have x ≥ 3. This now yields F (x) = P(X ≤ x) = P(Ω) = 1 if x ≥ 3. (4.8) Finally, as the cdf F is now fully specified (we have computed its value for every possible x ∈ R in (4.5)-(4.8)) we can make a graph of the function, see Figure 4.4. 59 60 Notes for COMP11120 (Probability part) p(3) =1/6 p(2) =1/2 p(1) =1/3 r1 =1 r2 =2 r3 =3 Figure 4.5: See Remark 4.5.4 Remark 4.5.4. If we think about what exactly happened in the above Example 4.5.3 we can write down the following general rule (in our case): the cdf F of a random variable X with range R = {r1 , r2 , . . . , rm } and pmf p is a ‘piecewise constant’ function: it consists of a collection of constant pieces (i.e. the flat bits in the graph) and ‘jumps’. The ‘jumps’ occur at the points in the range of X, i.e. for x = r1 , x = r2 , ..., x = rm , and the height of these jumps is given by the corresponding probabilities p(r1 ), p(r2 ), ..., p(rm ). In the above Example 4.5.3 we have r1 = 1, r2 = 2, r3 = 3 and p(1) = 1/3, p(2) = 1/2, p(3) = 1/6, see Figure 4.5. Remark 4.5.5. An important observation is that the cdf contains in one graph all information about a random variable X: given the graph of a cdf we can deduce what the range of X is and what the pmf is. For this we simply use Remark 4.5.4: the range consists of exactly all the points where the graph ‘jumps’, and the corresponding probabilities can be read from the size of the jumps. 4.6 The mean (or: expected value) of a random variable We know by now very well that a random variable takes different values dependent on what the outcome of the underlying experiment is. It is then quite natural to wonder what the ‘average value’ is of a random variable. It is very comparable to having a sequence of numbers, say a1 , a2 , . . . , an , where each number has a weight, say that wi is the weight of the number ai . Then the average value of this sequence is given by the well known formula Average value = w 1 a1 + w 2 a2 + . . . + w n an . w1 + w2 + . . . + wn 60 (4.9) CHAPTER 4. RANDOM VARIABLES 61 If we think about a random variable X, then it takes a value from its range R = {r1 , r2 , . . . , rm }, each value ri with the probability P(X = ri ). If we think about these probabilities as the ‘weights’, then the analogue of (4.9) reads Average value of X = P(X = r1 ) · r1 + P(X = r2 ) · r2 + . . . + P(X = rm ) · rm . P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm ) Since P(X = r1 ) + P(X = r2 ) + . . . + P(X = rm ) = 1 (cf. Result 4.2.5) this simplifies to Average value of X = P(X = r1 )r1 + P(X = r2 )r2 + . . . + P(X = rm )rm and this is exactly how we define the average value or mean of a random variable: Definition 4.6.1 (Mean). Let X be a random variable with range R = {r1 , r2 , . . . , rm } and pmf p. Then the mean of X is denoted by E[X] and is defined as E[X] = m X ri P(X = ri ) = i=1 m X ri p(ri ). (4.10) i=1 If the range has infinitely many elements, R = {r1 , r2 , . . .}, then the sum simply becomes infinite as well: E[X] = ∞ X ri P(X = ri ) = i=1 ∞ X ri p(ri ). i=1 In fact, we would like to extend this definition slightly. As we discussed in Section 4.3, if we take any function f : R → R then we may construct a new random variable Y by definining Y = f (X), i.e. Y (ω) = f (X(ω)) for all ω ∈ Ω. If the range of X is RX = {r1 , r2 , . . . , rm } then the random variable Y can take any of the values f (r1 ), f (r2 ), . . . , f (rm ), where each of these values occurs with probability P(X = ri ) – note that these values may not all be necessarily different, cf. Example 4.3.1 for an example of this – and hence, analogue to the reasoning above, it makes sense to set Average value of f (X) = P(X = r1 )f (r1 ) + P(X = r2 )f (r2 ) + . . . + P(X = rm )f (rm ). This is indeed how we define the mean of the random variable f (X): Definition 4.6.2 (Mean of a function of a random variable). Let X be a random variable with range R = {r1 , r2 , . . . , rm } and pmf p. Let f : R → R be a function. Then the mean of the random variable f (X) is defined as E[f (X)] = m X f (ri )P(X = ri ) = i=1 m X i=1 61 f (ri )p(ri ). (4.11) 62 Notes for COMP11120 (Probability part) Again, if the range has infinitely many elements, R = {r1 , r2 , . . .}, then the sum simply becomes infinite as well: E[f (X)] = ∞ X f (ri )P(X = ri ) = i=1 ∞ X f (ri )p(ri ). i=1 Remark 4.6.3. If we take the function f (x) = x then f (X) is simply equal to X. Reassuringly, if we plug f (x) = x into the formula (4.11) then this boils down to E[X] = m X ri p(ri ), i=1 which is exactly (4.10). So, (4.11) is indeed an extension of (4.10): it applies more broadly, but in cases where both formulae apply they yield the same result. Remark 4.6.4. You might find this last bit a bit fishy, and rightfully so. Namely, if Y = f (X) is a random variable as we know it is then it has its own range, say RY = {y1 , y2 , . . . , yn } and its own pmf say pY . Hence we wouldn’t need the above Definition 4.6.2 since we could simply apply the original Definition 4.6.1 to the random variable Y to get a formula for the mean of f (X): E[f (X)] = E[Y ] = n X yi pY (yi ). i=1 The good news is that if you think about it then there is no real problem as this gives exactly the same result as using Definition 4.6.2. That is, we have n X yi pY (yi ) = i=1 m X f (ri )p(ri ), i=1 so it doesn’t matter which of the two definitions you use! See also the example below. Example 4.6.5 (Example 4.3.1 cont’d). In Example 4.3.1 we were looking at a random variable X with range RX = {−1, 0, 1} and pmf pX given by pX (−1) = pX (0) = pX (1) = 1/3. Then Definition 4.6.1 yields that the mean of X is given by E[X] = −1 · 1 1 1 + 0 · + 1 · = 0. 3 3 3 The random variable Y was defined as Y = X 2 , and we had already deduced that the range of Y was RY = {0, 1} with pmf pY given by pY (0) = 1/3 and pY (1) = 2/3. Now, we may compute the mean of Y = X 2 by two different approaches (see the above Remark). Firstly we may use that Y is a random variable in its own right and we can directly apply Definition 4.6.1 to compute 62 CHAPTER 4. RANDOM VARIABLES 63 1 2 2 +1· = . 3 3 3 2 On the other hand, since Y = f (X) with f (x) = x we may also apply Definition 4.6.2 to compute E[Y ] = 0 · 1 1 1 2 + 02 · + 12 · = . (4.12) 3 3 3 3 So indeed both approaches give the same result. The advantage of using the second approach is that you do not first need to determine what the range and pmf of Y are, for (4.12) you only need to know what the range and pmf of X are! E[Y ] = E[X 2 ] = (−1)2 · Example 4.6.6. Consider a random variable X with range R = {0, 1, 2, . . .} and pmf p given by k 1 p(k) = a 3 for k = 0, 1, 2, . . .. What is a? Also compute the mean of X. We know that p must satisfy the property (cf. Result 4.4.3) ∞ X p(k) = 1. (4.13) k=0 Plugging in the formula for p we find that k ∞ k ∞ X X 1 1 3 1 =a =a p(k) = a 1 = a, 3 3 2 1− 3 k=0 k=0 k=0 P∞ k where we used the geometric series: k=0 x = 1/(1 − x) for any x ∈ (−1, 1). Hence condition (4.13) boils down to ∞ X 3 2 a = 1 =⇒ a = . 2 3 So we arrive at 2 p(k) = 3 k 1 3 for k = 0, 1, 2, . . .. To compute the mean of X we apply Definition 4.6.1 and manipulate a bit to get ∞ X k k−1 ∞ ∞ X 2 1X 1 2 1 = · . E[X] = kp(k) = k k 3 3 3 3 3 k=0 k=0 k=1 (4.14) The reason that it is handy to ‘move a factor 1/3 out of the sum’, as we did above in the last step, is the following. If we differentiate the geometric series once we see that 63 64 Notes for COMP11120 (Probability part) ∞ X 1 x = 1−x k=0 differentiate k =⇒ ∞ X kxk−1 = k=1 1 . (1 − x)2 If we plug x = 1/3 into the above right hand side, we find that k−1 ∞ X 1 1 9 = . k = 2 3 (1 − 1/3) 4 k=1 Plugging this result back into (4.14) yields E[X] = 1 2 1 9 · · = . 3 3 4 2 We conclude this section with some properties of the mean. Result 4.6.7. We have the following for any random variable X with range R. (i) If the range contains only one element, say R = {c}, then E[X] = c. (ii) For any a, b ∈ R we have E[aX + b] = aE[X] + b. Proof. Let p denote the pmf of X. Ad (i). Since the range only contains the element c it must be that p(c) = 1 (cf. Result 4.4.3). Hence we get from Definition 4.6.1 (where m = 1) that E[X] = c · 1 = c. Ad (ii). By defining the function f (x) = ax + b we have that E[f (X)] = E[aX + b]. Invoking Definition 4.6.2 and then plugging in the definition of f we get E[f (X)] = m X k=1 f (ri )p(ri ) = m X (ari + b)p(ri ) = a k=1 m X ri p(ri ) + b k=1 m X p(ri ). k=1 But again by Definition 4.6.2, the first term in the right hand side above is equal to aE[X], P while we know from Result 4.4.3 that m k=1 p(ri ) = 1 and hence the second term in the above right hand side is just equal to b. So we indeed arrive at E[aX + b] = E[f (X)] = aE[X] + b as required. 64 CHAPTER 4. RANDOM VARIABLES 4.7 65 The variance and standard deviation of a random variable In the previous section we have extensively discussed what the mean of a random variable is, the ‘average value’ it takes. However, of course the mean does not give us a full picture what a random variable looks like. For instance, consider two random variables: X with range RX = {−1, 0, 1} and pmf pX (−1) = pX (0) = pX (1) = 1/3, and Y with range RY = {−10, 0, 10} and pmf pY (−10) = pY (0) = pY (10) = 1/3. Then both random variables X and Y have mean equal to 0, indeed: E[X] = −1 · 1 1 1 1 1 1 + 0 · + 1 · = 0 and E[Y ] = −10 · + 0 · + 10 · = 0. 3 3 3 3 3 3 However there is a clear difference between X and Y : the range of Y is much ‘wider’, Y can take the values −10 and 10 which are both ‘far away’ from its mean 0, while X has range {−1, 0, 1} and is hence always at most a distance of 1 away from its mean 0. The idea is now to introduce a measure for how far, on average, a random variable is away from its mean. Several such measures are possible, but one very common one is the variance. The variance works as follows. First the mean of a random variable X is computed, and for convenience let’s denote the mean by µ. Then we consider the random variable Y given by Y = (X − µ)2 , that is Y (ω) = (X(ω) − µ)2 for all ω ∈ Ω. Note that (X(ω) − µ)2 is a measure for the distance between X(ω) and the mean µ of X. We could just as well have taken |X(ω) − µ|, or (X(ω) − µ)4 for instance, but there are technical reasons why the square is a good choice. Now we have a random variable measuring the distance between X and its mean, but for easy comparison we would like just a number rather, and hence we take the mean of this distance: E[(X − µ)2 ] and this is exactly what the variance is. Definition 4.7.1 (Variance). For a random variable X with mean µ we define the variance, denoted Var(X), as Var(X) = E[(X − µ)2 ]. The variance is a measure for how far X is on average away from its mean µ: the larger the variance, the larger this distance on average is. Example 4.7.2. Consider again the two random variables we introduced at the beginning of this section, namely X with range RX = {−1, 0, 1} and pmf pX (−1) = pX (0) = pX (1) = 65 66 Notes for COMP11120 (Probability part) 1/3, and Y with range RY = {−10, 0, 10} and pmf pY (−10) = pY (0) = pY (10) = 1/3. We had already computed that both random variables have mean equal to 0. So the mean of X, say µX , satisfies µX = 0 and also the mean of Y , say µY , satisfies µY = 0. Let us for both compute their variance. Note that this is a straightforward application of Definition 4.6.2, where we set the function f equal to f (x) = (x − µX )2 : Def. 4.6.2 Var(X) = E[(X − µX )2 ] = f (−1)pX (−1) + f (0)pX (0) + f (1)pX (1) 1 1 1 2 = (−1 − 0)2 · + (0 − 0)2 · + (1 − 0)2 · = , 3 3 3 3 and now for Y , where we use f (x) = (x − µY )2 : Var(Y ) = E[(X − µY )2 ] Def. 4.6.2 = f (−10)pY (−10) + f (0)pY (0) + f (10)pY (10) 1 1 1 200 = (−10 − 0)2 · + (0 − 0)2 · + (10 − 0)2 · = . 3 3 3 3 At the beginning of this section we have argued that Y could get much further away from its mean 0 than X could, and we see this confirmed by the above computation: the variance of Y is much larger than the variance of X. Here are some useful properties of the variance: Result 4.7.3. For any random variable X we have (i) Var(X) = E[X 2 ] − (E[X])2 , (4.15) (ii) if the range of X contains only one element then Var(X) = 0, (iii) for any a, b ∈ R we have Var(aX + b) = a2 Var(X). Proof. Ad (i). Let us as usual write the range of X in the generic form R = {r1 , r2 , . . . , rm }. Also let us denote the mean of X by µ, for notational convenience, and let p be the pmf of X. Then we have 2 (∗) Var(X) = E[(X − µ) ] = m X (∗∗) 2 (ri − µ) p(ri ) = i=1 m X (ri2 − 2µri + µ2 )p(ri ) i=1 m m m X X (∗∗∗) X 2 2 = ri p(ri ) − 2µ ri p(ri ) + µ p(ri ). (4.16) i=1 i=1 66 i=1 CHAPTER 4. RANDOM VARIABLES 67 Note that (*) uses Definition 4.6.2, in (**) we just work out the brackets, and in (***) we work out the brackets a bit further to break the single summation up. Now, we can simplify the terms in the ultimate right hand side quite a bit. Indeed, from Definition 4.6.2 we know that m X ri2 p(ri ) = E[X 2 ], i=1 from Definition 4.6.1 we know m X ri p(ri ) = E[X] = µ i=1 and from Result 4.4.3 we have m X p(ri ) = 1. i=1 Plugging this all back into (4.16) we see that we get Var(X) = E[X 2 ] − 2µ · µ + µ2 = E[X 2 ] − 2µ2 + µ2 = E[X 2 ] − µ2 = E[X 2 ] − (E[X])2 . Ad (ii). Suppose that the range of X contains only the element c. We have seen in Result 4.6.7 (i) that in this case E[X] = c. Also, it must mean that the pmf p of X satisfies p(c) = 1. Hence we get (∗) Var(X) = E[(X − µ)2 ] = (c − c)2 p(c) = 0 · 1 = 0, where (*) uses Definition 4.6.2. Ad (iii). Let us as usual write the range of X in the generic form R = {r1 , r2 , . . . , rm }. Also let us denote the mean of X by µ, for notational convenience, and let p be the pmf of X. By part (i) we have that Var(aX + b) = E[(aX + b)2 ] − (E[aX + b])2 . (4.17) We’ll just need to work this out. For the first term we may use Definition 4.6.2 again, this time with the function f (x) = (ax + b)2 , to compute E[(aX + b)2 ] = m X (ari − b)2 p(ri ) = a2 E[X 2 ] + 2abµ + b2 , i=1 where the steps for the last equality are very similar to what we did in the proof of part (i) above. For the second term we may use Result 4.6.7 to see that (E[aX + b])2 = (aE[X] + b)2 = (aµ + b)2 = a2 µ2 + 2abµ + b2 . 67 68 Notes for COMP11120 (Probability part) Plugging this all back into (4.17) we find that Var(aX + b) = a2 E[X 2 ] + 2abµ + b2 − a2 µ2 + 2abµ + b2 = a2 E[X 2 ] − a2 µ2 = a2 E[X 2 ] − µ2 = a2 E[X 2 ] − (E[X])2 = a2 Var(X), as required. Note the last step uses the above part (i) again. Sometimes it is more convenient to use formula (4.15) in a computation of the variance, however usually both the original definition and that formula are possible to use. We conclude this chapter with a definition of the standard deviation. Essentially the standard deviation is fully equivalent to the variance, it measures exactly the same ‘spread’, just on a different scale. It is simply defined as the square root of the variance. Definition 4.7.4 (Standard deviation). The standard deviation of a random variable X, denoted SD(X), is defined as SD(X) = p 68 Var(X). Chapter 5 Some well known distributions In this final chapter we discuss some of the most well known distributions (discrete) random variables can have. 5.1 What is a distribution? So far we have always discussed a situation in which we started with an experiment, a corresponding sample space and in the previous Chapter we have seen how we can define a random variable associated with a certain experiment. To be able to do computations with random variables we need to know what their range is and what their pmf is. Now, as you can probably imagine, there are several different experiments possible with associated random variables that all lead to the same range and the same pmf. Since mathematicians love to make things as abstract as possible they like to treat all these cases the same. This is done by focussing only on the range and pmf of a random variable. Definition 5.1.1 (Distribution). A distribution is simply a combination of a range and a pmf. If two random variables have identical ranges and pmf ’s they are said to have the same distribution (or, are equal in law). In this chapter we will briefly discuss some of the most famous distributions, i.e. combinations of ranges and pmf’s. 5.2 The Bernoulli distribution The most elementary one is without any doubt the Bernoulli distribution: Definition 5.2.1 (Bernoulli distribution). A random variable X has the Bernoulli distribution with parameter p ∈ [0, 1] (notation: X ∼ Bernoulli(p)) when its range is R = {0, 1} and its pmf pX is given by pX (0) = 1 − p and 69 pX (1) = p. 70 Notes for COMP11120 (Probability part) Example 5.2.2. Consider the experiment of tossing a coin, so Ω = {H, T }. Define the random variable X by setting X(H) = 0 and X(T ) = 1. Then the range of X is R = {0, 1} and (if the coin is indeed fair) its pmf pX is given by pX (0) = P({H}) = 1/2 and pX (1) = P({T }) = 1/2. Hence X satisfies above Definition 5.2.1 for p = 1/2. Therefore X has the Bernoulli distribution with parameter 1/2, i.e. X ∼ Bernoulli(1/2). Example 5.2.3. A fair dice is thrown, so Ω = {1, 2, 3, 4, 5, 6}. Define the random variable X by setting X(1) = X(2) = 0, X(3) = X(4) = X(5) = X(6) = 1. Then X has range {0, 1}, it has pmf pX given by 2 1 4 2 = and pX (1) = P({3, 4, 5, 6}) = = . 6 3 6 3 This is again a Bernoulli distribution, now with parameter value p = 2/3. Hence X ∼ Bernoulli(2/3). pX (0) = P({1, 2}) = When X ∼ Bernoulli(p) it is quite straightforward to compute its mean and variance. Using Definition 4.6.1 from Ch. 4 we find E[X] = 0 · pX (0) + 1 · pX (1) = 0 · (1 − p) + 1 · p = p. To compute its variance we first compute, using Definition 4.6.2 from Ch. 4: E[X 2 ] = 02 · pX (0) + 12 · pX (1) = 0 · (1 − p) + 1 · p = p so that by Result 4.7.3 we find Var(X) = E[X 2 ] − (E[X])2 = p − p2 = p(1 − p). Hence: Result 5.2.4. Suppose that X ∼ Bernoulli(p). Then E[X] = p 5.3 and Var(X) = p(1 − p). The Binomial distribution Consider a simple experiment that yields either ‘success’ with probability p or ‘failure’ with probability (hence) 1 − p. Suppose now that you perform this experiment n times, independently of each other, and suppose that you define the random variable X as the total number of successes you have. This is the typical setup for a binomial distribution. Note that X has range R = {0, 1, . . . , n}. Also, we can compute the pmf pX of X quite easily. For any k ∈ R, p(k) = P(X = k) is the probability of getting k times success out of 70 CHAPTER 5. SOME WELL KNOWN DISTRIBUTIONS 71 n tries. Any possible sequence of k successes and n−k failures has probability pk (1−p)n−k of occuring (by independence of the experiments we can multiply the probabilities). There are nk sequences consisting of k successes and n − k failures (cf. Ch. 2), hence n k p(k) = P(X = k) = p (1 − p)n−k . k Therefore we define: Definition 5.3.1 (Binomial distribution). A random variable X has the Binomial distribution with parameters n ∈ N and p ∈ [0, 1] (notation: X ∼ Binomial(n, p)) when its range is R = {0, 1, . . . , n} and its pmf pX is given by n k pX (k) = p (1 − p)n−k k for all k ∈ {0, 1, . . . , n}. To compute the mean and variance of a Binomial distribution it is easiest to make use of a fact we haven’t proven as it is outside the scope of our course, namely the fact that if we may write the random variable X as the sum of other random variables, say Y1 , Y2 , . . . , Yn , then E[X] = E[Y1 ] + E[Y2 ] + . . . + E[Yn ] (5.1) and if the Yi ’s are independent (we haven’t even defined what that exactly means!) we in addition also have that Var(X) = Var(Y1 ) + Var(Y2 ) + . . . + Var(Yn ). (5.2) Now, if we believe for a second that these are true facts, then we can argue as follows. We do n identical experiments. Define the random variable Yi as follows. Yi takes the value 1 if the i-th experiment yields ‘success’ and 0 if the i-th experiment yields ‘failure’. Recalling the previous section we then know that Yi ∼ Bernoulli(p) and in particular E[Yi ] = p and Var(Yi ) = p(1−p) (cf. Result 5.2.4). Furthermore, if we want to know how many successes we have had in total we might just as well add up the random variables Y1 , Y2 , . . . , Yn , that is, we have X = Y1 + Y2 + . . . + Yn . Using the ‘facts’ (5.1) and (5.2) we hence arrive at E[X] = p + p + . . . + p = np and Var(X) = p(1 − p) + p(1 − p) + . . . + p(1 − p) = np(1 − p). 71 72 Notes for COMP11120 (Probability part) Alternative proofs are possible, for instance for the mean you could first write out the definition and then manipulate to get the result out, but that’s a bit of work as well. Hence we leave it at this intuive explanation rather. Result 5.3.2. Suppose that X ∼ Binomial(n, p). Then E[X] = np and Var(X) = np(1 − p). Example 5.3.3. You are starting your career as dart player in a Manchester pub. As you are just starting they give you a special darts board, that contains only a big bulls eye and otherwise nothing. If you hit bulls eye you get 50 points, otherwise you get no points. Suppose that each throw hits bulls eye with probability 1/10. If you throw 100 darts, what is the expected number of points? What is the probability of at least 100 points? This is a case where you repeat a simple experiment, that yields ‘success’ (you hit bulls eye) with prob. 1/10 and ‘failure’ (no points) with (hence) probability 9/10, in total 100 times. Hence if X denotes the number of successes then X ∼ Binomial(100, 1/10). The expected number of successes is E[X], which is equal to np = 100 · 1/10 = 10 (cf. Result 5.3.2). So the expected number of points is 50 · 10 = 500. For the probability of at least 100 points, that is the same as the probability of at least 2 times bulls eye, i.e. P(X ≥ 2). It is easiest here to use the ‘complement rule’. Namely, since {X ≥ 2} = {X = 2} ∪ {X = 3} ∪ . . . ∪ {X = 100} we have P(X ≥ 2) = P(X = 2) + P(X = 3) + . . . + P(X = 100). Computing all these probabilities would be quite a work. It is more clever to use that we know that P(X = 0) + P(X = 1) + . . . + P(X = 100) = 1 (cf. Result 4.2.5 in Ch. 4) which yields P(X = 0) + P(X = 1) + P(X ≥ 2) = 1 =⇒ P(X ≥ 2) = 1 − P(X = 0) − P(X = 1). Using the formula for the pmf from Definition 5.3.1 we can compute 0 100 100 1 9 P(X = 0) = ≈ 0.0000266 0 10 10 and P(X = 1) = 1 99 100 1 9 ≈ 0.000295 1 10 10 so that we arrive at 72 CHAPTER 5. SOME WELL KNOWN DISTRIBUTIONS 73 P(X ≥ 2) = 1 − P(X = 0) − P(X = 1) ≈ 0.9997. 5.4 The Poisson distribution The next one on the list is the Poisson distribution. This one has a more vague interpretation. It has one parameter, denoted by λ > 0. One of its uses is to count the number of ‘successes’ when an experiment is repeated many times and the probability of ‘success’ is very small. Indeed, a Binomial distribution with a very large value of n and a very small value of p is very similar to a Poisson distrbution with parameter λ = np. This fact is known as the Poisson limit theorem. The advantage of using the Poisson distribution over the (in principle more correct) Binomial distribution in such situations (i.e. when n is very large and p is very small) is that it is easier to do computations with the Poisson distribution. The Poisson distribution is the first one in our list with a range containing infinitely many elements. Definition 5.4.1 (Poisson distribution). A random variable X has the Poisson distribution with parameter λ > 0 (notation: X ∼ Poisson(λ)) when its range is R = {0, 1, . . .} and its pmf pX is given by pX (k) = λk −λ e k! for all k ∈ {0, 1, . . .}. For deriving the mean and the variance of a Poisson distribution it is good to recall the power series of the exponential function: x e = ∞ X xk k=0 k! . (5.3) For the mean, we get from Definition 4.6.1 in Ch. 4 that ∞ X ∞ ∞ X λk −λ X λk −λ E[X] = kpX (k) = k e = k e . k! k! k=0 k=0 k=1 Note that in the last step we let the sum start at k = 1 rather than at k = 0, which is absolutely fine since k λk −λ e = 0 when k = 0. k! Now, using that for any k ≥ 1 k k 1 1 = = = k! k(k − 1)(k − 2) . . . 2 · 1 (k − 1)(k − 2) . . . 2 · 1 (k − 1)! 73 74 Notes for COMP11120 (Probability part) we may further manipulate ∞ ∞ X X λk−1 λk . k e−λ = e−λ λ k! (k − 1)! k=1 k=1 But ∞ ∞ X X λk−1 λk = = eλ , (k − 1)! k! k=1 k=0 the last step by (5.3). Putting the pieces together we now arrive at E[X] = e−λ λeλ = λ. It turns out that the variance is equal to λ as well (exercise). Result 5.4.2. Suppose that X ∼ Poisson(λ). Then E[X] = λ and 5.5 Var(X) = λ. The geometric distribution We conclude with the geometric distribution. This one has a quite clear interpretation again. Suppose again that you perform a sequence of identical experiments, indepedent od each other, where each experiment generates ‘success’ with probability p ∈ [0, 1] and ‘failure’ with probability (hence) 1 − p. Let X be the number of experiments you have to do until you get ‘success’ for the first time. Then X has a geometric distribution with parameter p. It is clear from this description what the range is, namely R = {1, 2, . . .} (again with infinitely many elements). Also, the probability that X takes the value k ∈ R is equal to the probability that we get k − 1 ‘failures’ and then a ‘success’, the corresponding probability is hence P(X = k) = (1 − p)k−1 p. So: Definition 5.5.1 (Geometric distribution). A random variable X has the Geometric distribution with parameter p ∈ [0, 1] (notation: X ∼ Geometric(p)) when its range is R = {1, 2, . . .} and its pmf pX is given by pX (k) = (1 − p)k−1 p for all k ∈ {0, 1, . . .}. 74 CHAPTER 5. SOME WELL KNOWN DISTRIBUTIONS 75 To compute the mean of a geometric distribution we make use of the geometric series: ∞ X xk = k=0 1 1−x for any x ∈ (−1, 1). (Indeed, we use the same arguments as in Example 4.6.6 in Ch. 4, which was essentially also a geometric distribution!). If we differentiate∗ both sides of this equation we get ∞ X kxk−1 = k=1 1 (1 − x)2 for any x ∈ (−1, 1). (5.4) Now, again using Definition 4.6.1 from Ch. 4 we get E[X] = ∞ X kpX (k) = k=0 ∞ X k−1 k(1 − p) p=p k=0 ∞ X k(1 − p)k−1 . k=1 Using (5.4) with x = 1 − p yields ∞ X k(1 − p)k−1 = k=1 1 p2 so that we arrive at E[X] = p · 1 1 = . 2 p p In a similar fashion it can be derived that the variance is equal to (1 − p)/p2 . Result 5.5.2. Suppose that X ∼ Geometric(p). Then E[X] = 1 p and Var(X) = ∗ 1−p . p2 We are assuming here that when we differentiate the infinite sum, we may as well differentiate each of the terms of the infinite sum. This is certainly true when you have a finite sum, as you know very well, but actually doing it for an infinite sum would need some extra justification 75