Download Lecture Notes on Finite Probability Spaces

Lecture 1: Basic Probability, Some Definitions, and Lots of Examples CS2244-Randomness and Computation January 8, 2015 We will define the terms sample space, outcome and event, probability function, and random variable, probability mass function. We will concentrate for now on finite sample spaces, but later we will broaden the scope to include infinite discrete and continuous sample spaces. Mathematical definitions, derivations, and proofs cannot by themselves tell us anything about the behavior of real objects. This is just the mathematical language in which we formulate notions of probability. Most of the discussion will be taken up with examples of how we model real-world problems in this language. Background: We use the basic language and notation of sets and functions. If you’ve taken CS2243 this should look at least a little bit familiar (which is why we recommend that you take CS2243 first). If not, you may need to look at some of the discussion of basic set theory in the recommended texts and in the posted notes from 2243. Recommended reading. Grinstead and Snell, Chapter 1. Observe that Grinstead and Snell start with random variables as the fundamental notion, and proceed to the definition of the probability of events, whereas these notes start with a probability function on the set of events: In the end, we wind up at the same place. You should also be aware that some authors use ’probability mass function’ of a random variable to describe what we are calling the ’probability distribution function’. 1 Sample Space, Outcome A sample space is just a set, but we think of it as the set of possible outcomes of an experiment involving chance. (’Experiment’ here is used in a broad sense; it 1 doesn’t necessarily mean a science experiment!) The examples below show how we use this notion to model various situations. Example 1 A coin is tossed. The possible outcomes are that the coin comes up heads or tails. So we can model this with the sample space Ω = {H, T }. What we call these two elements is not especially relevant; we could just as well call them 0 and 1. Perhaps there is a remote possibility that the coin will land on its edge. By modeling the experiment with the sample space {H, T }, we are excluding this possibility, essentially treating such an outcome as impossible. Example 2 You roll a single standard 6-sided die. Usually we’re interested in the number of spots that show, so the natural sample space modeling this experiment is Ω = {1, 2, 3, 4, 5, 6}. Example 3 You roll a pair of such dice. We tend to think of the outcome as ‘a two and a three’, or ‘a pair of sixes’, and so we are in effect modeling the experiment with the sample space Ω = {{1}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {2}, {2, 3}, . . . , {5, 6}, {6}}. In other words, the outcome is the set of numbers that show. So the outcome {1} means ‘snake-eyes’ (double 1’s), and there is no distinction made between ‘a two and a three’ and ‘a three and a two’, both of which correspond to the outcome {2, 3}.(Question: How many elements does this sample space have?) But the other way to model the experiment is to keep in mind that there are two different dice, so we might think of ‘the first die comes up 2 and the second 3’ and ‘the first die comes up 3 and the second 2’ as distinct outcomes. (If the idea troubles you, as it did me when I first encountered it, think of two differentcolored dice, white with black spots and black with white spots.) In this view, the appropriate choice of sample space would be the set of ordered pairs Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6} = {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), . . . , (5, 6), (6, 1), (6, 2), . . . , (6, 6)} 2 There doesn’t seem at first to be much reason for preferring one of these models over the other. If we do not practically make a distinction between ‘1 and a 2’ and ‘2 and a 1’, then the first choice corresponds more closely to our mental model of the problem. However, as we shall see below, there are strong reasons for preferring the second formulation. Example 4 In our coin-tossing experiments in class, we flipped (or pretended to flip) a coin two hundred times in successsion, and were very much interested in the order in which the results occurred, so a natural sample space modeling this experiment is the set of all sequences of two hundred H’s and T’s. For example, one element of this set is the sequence TTTHTTHTTTHTTHTTHHTTHTTTHHTHTTTHTTHTHHHTHHTHTHTTTH THTTHTTHTHHTHHHTHHTTHTTHTTHTHTHTHHHTHTHTHTHTHTTTHT TTTHTTHTTTHTTHTTHHTTHTTTHHTHTTTHTTHTHHHTHHTHTHTTTH THTTHTTHTHHTHHHTHHTTHTTHTTHTHTHTHHHTHTHTHTHTHTTTHT. The sample space itself is huge, with 2200 (about 10360 ) different outcomes. Example 5 We flip a coin repeatedly until it comes up heads. The possible outcomes of the experiment are the elements of the set {H, T H, T T H, T T T H, . . .} but we could just as well denote it as {1, 2, 3, . . .}, the set of positive integers. This is an infinite sample space. We call it an infinite discrete sample space. It will be clearer what we mean by ‘discrete’ in the next example, when we see its opposite, which is ‘continuous’. Example 6 We throw a dart at a circular dart board with a radius of one foot. If we ignore darts that are aimed so badly that they miss the board completely, an outcome is a point in the interior of this circle, and thus the sample space is the disk {(x, y) : x2 + y 2 = 1}. As in the preceding example, the sample space is infinte, but this is a continuous sample space rather than a discrete one. 2 Events An event in a sample space just a set of outcomes. That is, if we denote the sample space by Ω, then an event E is just a subset E ⊆ Ω. (For some very technical 3 reasons, there is a slight problem with this definition in the case of continuous sample spaces, which we’ll discuss later.) Here we give examples of events in the sample spaces presented in the examples above. Example 7 Roll a standard die. The event ‘a three is rolled’ is the one-element subset {3} of the sample space {1, 2, 3, 4, 5, 6}. The event ‘an even number is rolled’ is the subset {2, 4, 6}. We call an event with a single outcome a ‘simple event’. (In more set-theoretic language, it is a ‘singleton set’. In all, there are 26 = 64 events in this sample space, including the entire sample space itself (‘some number is rolled’) and the empty event (‘no number is rolled’). Example 8 Roll a pair of standard dice. Consider the event, ‘the sum of the values on the dice is 6’. In the first model we considered of this dice-rolling experiment by a sample space, this event is the set {{1, 5}, {2, 4}, {3}} consisting of three different outcomes, but in the second it is {(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)} consisting of five different outcomes. Example 9 Instead of flipping a coin two hundred times in succession, let’s just flip it three times in succession, so that the sample space is Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. The event ‘exactly two heads come up’ is the subset {HHT, HT H, T HH}. The event ‘the second coin comes up heads’ is the set {HHH, HHT, T HH, T HT }. Example 10 In the sample space representing the experiment of flipping a coin until heads comes up, the event ’heads appears in five or fewer flips’ is the finite set {H, T H, T T H, T T T H, T T T T H}, while the event ‘heads does not come up on the first three tosses’ is the infinite set {T T T H, T T T T H, T T T T T H, . . .}. 4 Example 11 Suppose that the darts target has a bulls-eye one inch in diameter. ‘The dart hits the bulls-eye’ is the event {(x, y) : x2 + y 2 ≤ 3 1 }. 122 Probability Function A probability function on a sample space Ω assigns a value between 0 and 1 inclusive —-a probability–to each event. We think of this as a measure of the likelihood of the event occurring, but formally a probability function is just a function P : P(Ω) → R, where P(Ω) denotes the power set—the set of all subsets of Ω, and R is the set of real numbers. We denote the probability of an event E by P [E]. The probability function is required to satisfy these three properties. Axiom 1 P [E] ≥ 0 for every event E. Axiom 2 P [Ω] = 1. Axiom 3 E, F are events and E ∩ F = ∅, then P [E ∪ F ] = P [E] + P [F ]. This is not quite the whole story—when we discuss infinite discrete sample spaces and continuous sample spaces in more detail, we have to tweak the definition a bit. 3.1 Basic Consequences There are some immediate consequences of the above properties. The proofs are simple applications of the axioms. Again, we use Ω to denote the sample space. If E, F, G are pairwise disjoint events (meaning that no two of them have an element in common, then P [E ∪ F ∪ G] = P [E] + P [F ] + P [G]. In probabilistic language, ‘disjoint subsets’ are ‘mutually exclusive events’. More generally, if E1 , E2 , . . . , En are pairwise mutually exclusive events then n n [ X P Ei = P [Ei ]. i=1 i=1 5 If E ⊆ Ω then P [Ē] = 1 − P [E]. Here Ē is the complement of E, the set of all outcomes not in E. If E ⊆ F ⊂ Ω then P [E] ≤ P [F ]. If E, F ⊆ Ω then P [E ∪ F ] ≤ P [E] + P [F ] 3.2 Examples for Finite Sample Spaces; equally likely events If the sample space Ω is finite, then Axiom 3 implies that the probability function is completely determined by its value on simple events (single-outcome events). Thus any assignment of nonnegative values to the individual outcomes defines a probability function, provided these values add up to 1. In that case, the probability of an event E is just the sum of the probabilities of the individual outcomes: X P [E] = P [{e}]. e∈E This assignment is sometimes called a probability distribution on Ω. Of course, we want to think of probability as having some real-world meaning: An event with probability 1 should occur in every run of the experiment, an even with probability 0.4 in about 40% of the runs of the experiment. Example 12 For the coin toss sample space {H, T }, we can pick any value 0 ≤ p ≤ 1 and assign it to {H}, and then assign q = 1 − p to T. This gives us a probability function, and since we have infinitely many choices for p, there are infinitely many probability functions on this two-element set. The case p = 1, q = 0 models a coin that never comes up tails (perhaps both faces of the coin are stamped with heads), and the case p = 31 , q = 23 models a coin that comes up tails twice as often as it comes up heads. The case p = q = 0.5 is special. This is the case of equally likely outcomes, a fair coin. In the absence of other information, this is the way we usually model the toss of a coin. Example 13 Similarly, with the roll of a single die with sample space Ω = {1, 2, 3, 4, 5, 6}, we typically assume that the outcomes are equally likely, and that each simple event has probability 16 . Thus the probability of an event E is |E|/6, where |E| represents the number of elements (number of outcomes) in E. Example 14 More generally, if the sample space Ω has n elements with equally likely outcomes, then the probability of an event E is |E|/n. 6 When outcomes are equally likely, we say that the probability distriubution is uniform. Example 15 We saw that we could model the outcomes of the roll of two dice as either the set of numbers that appear (e.g., ’a 3 and a 4’, or ’double 6s’) or as the ordered pair of numbers (e.g. ‘3 on the first die and 4 on the second’). The first model probably more accurately reflects the way we think of dice rolls. However the second is usually preferred, because the usual expected behavior of dice makes the outcomes in this model equally likely. To see why, consider the case of two fair dice, where each individual die has equally likely outcomes, with probability 1/6. Imagine that we mark the dice, so that one is red and one is green. If we roll the dice a thousand times , we would expect the red die to come up 3 about one-sixth of the time, or 167 times. Of those 167 times, we expect that about one-sixth of these, something like 28 times, the green die will come up 4. So ’3 on the red die and 4 on the green die’ should come up about 28 times, and by the same argument, double 3’s should come up about 28 times. The crucial assumption is that the outcome of the red die does not affect in any way the outcome of the green die. On the other hand ’4 on the red die and 3 on the green die’ should also come up about 28 times, which means that ‘a 3 and a 4’ is twice as likely as double 3’s. If this is really what happens with dice, then our second sample space, which treats (3, 4) and (4, 3) will have equally likely outcomes, but our first will not. Let’s test this with a simulation. The Python function below simulates n rolls of a pair of dice and returns the counts of both ‘a 3 and a 4’, and ‘double 3s’. from pylab import * def rolldice(n): count1=0 count2=0 for j in range(n): dice=randint(1,7,2) if(dice[0]==3 and dice[1]==4) or (dice[0]==4 and dice[1]==3): count1=count1+1 elif dice[0]==3 and dice[1]==3: count2 = count2+1 return count1,count2 7 The result of a test of one hundred thousand rolls confirms the predicted frequencies of approximately 2.7 per cent and 5.4 per cent: >>> rolldice(100000) (5470, 2714) >>> Example 16 Exactly the same reasoning applies to the sequence of coin tosses. For a fair coin tossed 3 times, we would expect each of the 8 outcomes in Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }. to be equally likely, with probability 18 . With this probability function the probability of exactly 2 heads coming up is 38 . Example 17 A standard deck of cards contains 52 distinct cards, in 13 ranks (Ace, 2,3,...,10, Jack, Queen, King) and 4 suits (clubs, diamonds, hearts, spades). It will be convenient to encode the cards by the integers {0, . . . , 51}, the ranks by {0, . . . , 12} and the suits by {0, 1, 2, 3}. Then the rank of card n is bn/4c, and the suit is n mod 4. In Python these are computed by the expressions n/4 and n%4, as long as n has integer type. Draw twice from the deck. What is the probability that both cards are Aces? This description of the experiment is deliberately ambiguous. It could be interpreted as ’draw a card at random, put it back in the deck, and draw a card at random again’. (Imagine that we shuffle the deck between draws.) This is called sampling with replacement. In this case we can model the set of outcomes as the complete set of ordered pairs: {(i, j) : 0 ≤ i, j < 52}. This is for all intents and purposes identical to the way we modeled the roll of two dice, with 52 replacing 6. Again, let’s suppose that the outcomes are equally likely. Then each simple event has probability 1 = 3.698 × 10−4 . 2 52 What is the probability of getting an Ace on both draws? The Aces are encoded by the integers {0, 1, 2, 3}. so the event in question is {(i, j) : 0 ≤ i, j < 4}, 8 which has 16 elements. The probability of getting Aces twice is thus 16 × 3.698 × 10−4 = 0.005917 If we want to know the probability of getting two cards of the same rank, the answer is 13 times this amount, since the events ’two Aces’, ’two twos’, etc., are pairwise mutually exclusive, and each has the same probability. The result is thus 13 × 0.005917 = 0.076923. 1 If you’re sharp, you’ll recognize that this is, to the precision shown, exactly 13 . That = makes sense, since once we’ve drawn the first card there is a one in thirteen chance that the second card will have the same rank. On the other hand, we can interpret the problem as one of sampling without replacement: This is what we usually mean when we say ‘draw two cards from the deck’. So here we are playing a simplified kind of poker, and asking about the probability of getting a pair of Aces, as well as the probability of getting a pair. The sample space is now {(i, j) : 0 ≤ i, j < 52, i 6= j}, and the event ‘pair of aces’ is {(i, j) : 0 ≤ i, j < 4, i 6= j}. Again we make the reasonable assumption of equally likely outcomes, so we need to calculate the sizes of both these sets. It’s not hard to see (and we will make the argument in more generality when we study combinatorics) that the sample space has 52 × 51 elements, and the event ‘pair of aces’ contains 4 × 3 = 12 elements. So the probability of getting two Aces is 1 4×3 = = 0.0045. 52 × 51 13 × 17 Once again, the probability of a pair is 13 times this amount, which is now 1 = 0.058. 17 9 4 Random Variables for Finite Sample Spaces A random variable X is just a function from a sample space Ω into the real numbers. The functions X, Y, Z, T, U defined in the examples below, are all random variables implicit in the examples we’ve been discussing. Ω is the set of all sequences of 200 tosses a coin. If e ∈ Ω, X(e) is the number of heads in e. Same sample space as above; Y (e) is the number of distinct runs in e. Same sample space; Z(e) is the length of the longest run in e. Same samples space; T (e) is the difference between the number of heads and the number of tails. (You might notice that this is the same as 2X(e) − 1. We write T = 2X − 1 to express the relation between these random variables. Now let Ω be the set {(i, j) : 1 ≤ i, j ≤ 6} of results of the roll of two dice. Let U (e) denote the sum of the two dice. 4.1 Probability Distribution Function (also called Probability Mass Function) of a Random Variable Given a sample space Ω, a random variable X on Ω, and a real number x, the set {e ∈ Ω : X(e) = x} is an event. If we have a probability function P on Ω, then the event has a probabilty. We often write P [X(e) = x], or even just P [X = x] to denote this probability. Example 18 Let’s look at the first few items (the coin-toss examples) in the list of random variables above, but suppose that we are using sequences of three, rather than one hundred tosses. For the random variable X giving the number of heads, the possible values are 0, 1, 2 and 3. These correspond to the events {T T T }, {HT T, T HT, T HH}, {HHT, HT H, T HH}, {HHH}. 10 Figure 1: Probability mass functions for number of heads and maximum run length with toss of three fair coins So, P [X = 0] = P [X = 3] = 81 , P [X = 1] = P [X = 2] = 38 . For the random variable Z giving the length of the longest run, the possible values are 1,2,3. These correspond to the respective events {HT H, T HT }, {HHT, T T H, HT T, T HH}, {HHH, T T T }. So P [Z = 1] = 41 , P [Z = 2] = 12 , P r[Z = 3] = 41 . Figure 1 is a plot the functions x 7→ P [X = x], x 7→ P [Z = x] for x in the range of X and Z. These functions are called the probability mass functions, or the probability distribution functions for the random variables X and Z. Observe that the values of the PMF must add up to 1. Now let’s go back to the original definition of X as the number of heads in 100 tosses of a coin. X takes on the 101 different values 0, 1, . . . , 100. We cannot list all the sequences with, say, 40 heads, but later we will see how to count them without actually counting, enabling us to very precisely determine the probability 11 Figure 2: Probability mass function for number of heads on one hundred tosses of a fair coin, estimated from 10,000 simulated trials mass function for X. Even now, however, we can experimentally obtain an estimate of the probability mass function, by simulating a large number of tosses of 100 coins and plotting the percentage of tosses that yielded each of the values. Figure 2 shows the result for 10,000 tosses (something we have seen before). For the random variable U defined above, giving the sum of the two dice, we can determine the probability mass function precisely: U (e) = 2 for just one of the 36 outcomes e, namely when both dice show 1 (‘snake eyes’). Likewise U (e) = 12 for just one outcome. U (e) = 7 for 6 different outcomes ((1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)). It is easy to fill in the other values. The probability mass function is plotted in Figure 3 along with an experimental estimate of the function obtained from 10,000 simulated rolls. 4.2 Cumulative Distribution Function Sometimes, instead of looking at the function x 7→ P [X = x], it is more convenient to use 12 Figure 3: Probability mass function for the roll of two dice, both from exact calculation and a simulation of 10,000 rolls x 7→ P [X ≤ x]. This is the cumulative distribution function (CDF) of the random variable X. It really gives the precise same information as the PMF, although in a different form. If we want to know the probability that the value of X lies in a certain range, then it is likely to be easier to estimate this from the graph of the CDF than from the graph of the PMF. Observe that the CDF is a nondecreasing function of x, with maximum value 1. Figure 4 shows the CDF for the dice example. For instance,the probability that the sum of the dice is between 4 and 9 inclusive is the difference between the heights of the stems at x = 3 and x = 9. If we only had the graph of the PMF we would have to add the heights of the stems at x = 4, 5, 6, 7, 8 and 9. 13 Figure 4: Cumulative distribution function for the roll of two dice 14

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Lecture Notes on Finite Probability Spaces