Download Lecture Notes on Finite Probability Spaces

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Birthday problem wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Transcript
Lecture 1: Basic Probability, Some
Definitions, and Lots of Examples
CS2244-Randomness and Computation
January 8, 2015
We will define the terms sample space, outcome and event, probability function, and random variable, probability mass function. We will concentrate for
now on finite sample spaces, but later we will broaden the scope to include infinite
discrete and continuous sample spaces. Mathematical definitions, derivations, and
proofs cannot by themselves tell us anything about the behavior of real objects.
This is just the mathematical language in which we formulate notions of probability. Most of the discussion will be taken up with examples of how we model
real-world problems in this language.
Background: We use the basic language and notation of sets and functions. If
you’ve taken CS2243 this should look at least a little bit familiar (which is why
we recommend that you take CS2243 first). If not, you may need to look at some
of the discussion of basic set theory in the recommended texts and in the posted
notes from 2243.
Recommended reading. Grinstead and Snell, Chapter 1. Observe that Grinstead and Snell start with random variables as the fundamental notion, and proceed to the definition of the probability of events, whereas these notes start with a
probability function on the set of events: In the end, we wind up at the same place.
You should also be aware that some authors use ’probability mass function’ of a
random variable to describe what we are calling the ’probability distribution function’.
1
Sample Space, Outcome
A sample space is just a set, but we think of it as the set of possible outcomes of
an experiment involving chance. (’Experiment’ here is used in a broad sense; it
1
doesn’t necessarily mean a science experiment!)
The examples below show how we use this notion to model various situations.
Example 1 A coin is tossed. The possible outcomes are that the coin comes up
heads or tails. So we can model this with the sample space
Ω = {H, T }.
What we call these two elements is not especially relevant; we could just as well
call them 0 and 1.
Perhaps there is a remote possibility that the coin will land on its edge. By
modeling the experiment with the sample space {H, T }, we are excluding this
possibility, essentially treating such an outcome as impossible.
Example 2 You roll a single standard 6-sided die. Usually we’re interested in the
number of spots that show, so the natural sample space modeling this experiment
is
Ω = {1, 2, 3, 4, 5, 6}.
Example 3 You roll a pair of such dice. We tend to think of the outcome as ‘a two
and a three’, or ‘a pair of sixes’, and so we are in effect modeling the experiment
with the sample space
Ω = {{1}, {1, 2}, {1, 3}, {1, 4}, {1, 5}, {1, 6}, {2}, {2, 3}, . . . , {5, 6}, {6}}.
In other words, the outcome is the set of numbers that show. So the outcome {1}
means ‘snake-eyes’ (double 1’s), and there is no distinction made between ‘a two
and a three’ and ‘a three and a two’, both of which correspond to the outcome
{2, 3}.(Question: How many elements does this sample space have?)
But the other way to model the experiment is to keep in mind that there are
two different dice, so we might think of ‘the first die comes up 2 and the second
3’ and ‘the first die comes up 3 and the second 2’ as distinct outcomes. (If the
idea troubles you, as it did me when I first encountered it, think of two differentcolored dice, white with black spots and black with white spots.) In this view, the
appropriate choice of sample space would be the set of ordered pairs
Ω = {1, 2, 3, 4, 5, 6} × {1, 2, 3, 4, 5, 6}
= {(1, 1), (1, 2), (1, 3), (1, 4), (1, 5), (1, 6), . . . , (5, 6), (6, 1), (6, 2), . . . , (6, 6)}
2
There doesn’t seem at first to be much reason for preferring one of these models over the other. If we do not practically make a distinction between ‘1 and a
2’ and ‘2 and a 1’, then the first choice corresponds more closely to our mental
model of the problem. However, as we shall see below, there are strong reasons
for preferring the second formulation.
Example 4 In our coin-tossing experiments in class, we flipped (or pretended to
flip) a coin two hundred times in successsion, and were very much interested in
the order in which the results occurred, so a natural sample space modeling this
experiment is the set of all sequences of two hundred H’s and T’s. For example,
one element of this set is the sequence
TTTHTTHTTTHTTHTTHHTTHTTTHHTHTTTHTTHTHHHTHHTHTHTTTH
THTTHTTHTHHTHHHTHHTTHTTHTTHTHTHTHHHTHTHTHTHTHTTTHT
TTTHTTHTTTHTTHTTHHTTHTTTHHTHTTTHTTHTHHHTHHTHTHTTTH
THTTHTTHTHHTHHHTHHTTHTTHTTHTHTHTHHHTHTHTHTHTHTTTHT.
The sample space itself is huge, with 2200 (about 10360 ) different outcomes.
Example 5 We flip a coin repeatedly until it comes up heads. The possible outcomes of the experiment are the elements of the set
{H, T H, T T H, T T T H, . . .}
but we could just as well denote it as {1, 2, 3, . . .}, the set of positive integers. This
is an infinite sample space. We call it an infinite discrete sample space. It will be
clearer what we mean by ‘discrete’ in the next example, when we see its opposite,
which is ‘continuous’.
Example 6 We throw a dart at a circular dart board with a radius of one foot. If
we ignore darts that are aimed so badly that they miss the board completely, an
outcome is a point in the interior of this circle, and thus the sample space is the
disk
{(x, y) : x2 + y 2 = 1}.
As in the preceding example, the sample space is infinte, but this is a continuous sample space rather than a discrete one.
2
Events
An event in a sample space just a set of outcomes. That is, if we denote the sample
space by Ω, then an event E is just a subset E ⊆ Ω. (For some very technical
3
reasons, there is a slight problem with this definition in the case of continuous
sample spaces, which we’ll discuss later.)
Here we give examples of events in the sample spaces presented in the examples above.
Example 7 Roll a standard die. The event ‘a three is rolled’ is the one-element
subset {3} of the sample space {1, 2, 3, 4, 5, 6}. The event ‘an even number is
rolled’ is the subset {2, 4, 6}. We call an event with a single outcome a ‘simple
event’. (In more set-theoretic language, it is a ‘singleton set’.
In all, there are 26 = 64 events in this sample space, including the entire
sample space itself (‘some number is rolled’) and the empty event (‘no number is
rolled’).
Example 8 Roll a pair of standard dice. Consider the event, ‘the sum of the
values on the dice is 6’. In the first model we considered of this dice-rolling
experiment by a sample space, this event is the set
{{1, 5}, {2, 4}, {3}}
consisting of three different outcomes, but in the second it is
{(1, 5), (2, 4), (3, 3), (4, 2), (5, 1)}
consisting of five different outcomes.
Example 9 Instead of flipping a coin two hundred times in succession, let’s just
flip it three times in succession, so that the sample space is
Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
The event ‘exactly two heads come up’ is the subset {HHT, HT H, T HH}. The
event ‘the second coin comes up heads’ is the set {HHH, HHT, T HH, T HT }.
Example 10 In the sample space representing the experiment of flipping a coin
until heads comes up, the event ’heads appears in five or fewer flips’ is the finite
set
{H, T H, T T H, T T T H, T T T T H},
while the event ‘heads does not come up on the first three tosses’ is the infinite set
{T T T H, T T T T H, T T T T T H, . . .}.
4
Example 11 Suppose that the darts target has a bulls-eye one inch in diameter.
‘The dart hits the bulls-eye’ is the event
{(x, y) : x2 + y 2 ≤
3
1
}.
122
Probability Function
A probability function on a sample space Ω assigns a value between 0 and 1 inclusive —-a probability–to each event. We think of this as a measure of the likelihood
of the event occurring, but formally a probability function is just a function
P : P(Ω) → R,
where P(Ω) denotes the power set—the set of all subsets of Ω, and R is the set of
real numbers. We denote the probability of an event E by P [E]. The probability
function is required to satisfy these three properties.
Axiom 1 P [E] ≥ 0 for every event E.
Axiom 2 P [Ω] = 1.
Axiom 3 E, F are events and E ∩ F = ∅, then P [E ∪ F ] = P [E] + P [F ].
This is not quite the whole story—when we discuss infinite discrete sample
spaces and continuous sample spaces in more detail, we have to tweak the definition a bit.
3.1
Basic Consequences
There are some immediate consequences of the above properties. The proofs are
simple applications of the axioms. Again, we use Ω to denote the sample space.
If E, F, G are pairwise disjoint events (meaning that no two of them have
an element in common, then
P [E ∪ F ∪ G] = P [E] + P [F ] + P [G].
In probabilistic language, ‘disjoint subsets’ are ‘mutually exclusive events’.
More generally, if E1 , E2 , . . . , En are pairwise mutually exclusive events
then
n
n
[
X
P
Ei =
P [Ei ].
i=1
i=1
5
If E ⊆ Ω then P [Ē] = 1 − P [E]. Here Ē is the complement of E, the set
of all outcomes not in E.
If E ⊆ F ⊂ Ω then P [E] ≤ P [F ].
If E, F ⊆ Ω then P [E ∪ F ] ≤ P [E] + P [F ]
3.2
Examples for Finite Sample Spaces; equally likely events
If the sample space Ω is finite, then Axiom 3 implies that the probability function
is completely determined by its value on simple events (single-outcome events).
Thus any assignment of nonnegative values to the individual outcomes defines a
probability function, provided these values add up to 1. In that case, the probability of an event E is just the sum of the probabilities of the individual outcomes:
X
P [E] =
P [{e}].
e∈E
This assignment is sometimes called a probability distribution on Ω.
Of course, we want to think of probability as having some real-world meaning:
An event with probability 1 should occur in every run of the experiment, an even
with probability 0.4 in about 40% of the runs of the experiment.
Example 12 For the coin toss sample space {H, T }, we can pick any value 0 ≤
p ≤ 1 and assign it to {H}, and then assign q = 1 − p to T. This gives us a
probability function, and since we have infinitely many choices for p, there are
infinitely many probability functions on this two-element set. The case p = 1, q =
0 models a coin that never comes up tails (perhaps both faces of the coin are
stamped with heads), and the case p = 31 , q = 23 models a coin that comes up tails
twice as often as it comes up heads. The case p = q = 0.5 is special. This is the
case of equally likely outcomes, a fair coin. In the absence of other information,
this is the way we usually model the toss of a coin.
Example 13 Similarly, with the roll of a single die with sample space Ω = {1, 2, 3, 4, 5, 6},
we typically assume that the outcomes are equally likely, and that each simple
event has probability 16 . Thus the probability of an event E is |E|/6, where |E|
represents the number of elements (number of outcomes) in E.
Example 14 More generally, if the sample space Ω has n elements with equally
likely outcomes, then the probability of an event E is |E|/n.
6
When outcomes are equally likely, we say that the probability distriubution is
uniform.
Example 15 We saw that we could model the outcomes of the roll of two dice
as either the set of numbers that appear (e.g., ’a 3 and a 4’, or ’double 6s’) or
as the ordered pair of numbers (e.g. ‘3 on the first die and 4 on the second’).
The first model probably more accurately reflects the way we think of dice rolls.
However the second is usually preferred, because the usual expected behavior of
dice makes the outcomes in this model equally likely.
To see why, consider the case of two fair dice, where each individual die has
equally likely outcomes, with probability 1/6. Imagine that we mark the dice, so
that one is red and one is green. If we roll the dice a thousand times , we would
expect the red die to come up 3 about one-sixth of the time, or 167 times. Of those
167 times, we expect that about one-sixth of these, something like 28 times, the
green die will come up 4. So ’3 on the red die and 4 on the green die’ should come
up about 28 times, and by the same argument, double 3’s should come up about
28 times. The crucial assumption is that the outcome of the red die does not affect
in any way the outcome of the green die. On the other hand ’4 on the red die and 3
on the green die’ should also come up about 28 times, which means that ‘a 3 and
a 4’ is twice as likely as double 3’s. If this is really what happens with dice, then
our second sample space, which treats (3, 4) and (4, 3) will have equally likely
outcomes, but our first will not.
Let’s test this with a simulation. The Python function below simulates n rolls
of a pair of dice and returns the counts of both ‘a 3 and a 4’, and ‘double 3s’.
from pylab import *
def rolldice(n):
count1=0
count2=0
for j in range(n):
dice=randint(1,7,2)
if(dice[0]==3 and dice[1]==4) or (dice[0]==4 and dice[1]==3):
count1=count1+1
elif dice[0]==3 and dice[1]==3:
count2 = count2+1
return count1,count2
7
The result of a test of one hundred thousand rolls confirms the predicted frequencies of approximately 2.7 per cent and 5.4 per cent:
>>> rolldice(100000)
(5470, 2714)
>>>
Example 16 Exactly the same reasoning applies to the sequence of coin tosses.
For a fair coin tossed 3 times, we would expect each of the 8 outcomes in
Ω = {HHH, HHT, HT H, HT T, T HH, T HT, T T H, T T T }.
to be equally likely, with probability 18 . With this probability function the probability of exactly 2 heads coming up is 38 .
Example 17 A standard deck of cards contains 52 distinct cards, in 13 ranks
(Ace, 2,3,...,10, Jack, Queen, King) and 4 suits (clubs, diamonds, hearts, spades).
It will be convenient to encode the cards by the integers {0, . . . , 51}, the ranks by
{0, . . . , 12} and the suits by {0, 1, 2, 3}. Then the rank of card n is bn/4c, and the
suit is n mod 4. In Python these are computed by the expressions n/4 and n%4,
as long as n has integer type.
Draw twice from the deck. What is the probability that both cards are Aces?
This description of the experiment is deliberately ambiguous. It could be interpreted as ’draw a card at random, put it back in the deck, and draw a card at
random again’. (Imagine that we shuffle the deck between draws.) This is called
sampling with replacement. In this case we can model the set of outcomes as the
complete set of ordered pairs:
{(i, j) : 0 ≤ i, j < 52}.
This is for all intents and purposes identical to the way we modeled the roll of
two dice, with 52 replacing 6. Again, let’s suppose that the outcomes are equally
likely. Then each simple event has probability
1
= 3.698 × 10−4 .
2
52
What is the probability of getting an Ace on both draws? The Aces are encoded
by the integers {0, 1, 2, 3}. so the event in question is
{(i, j) : 0 ≤ i, j < 4},
8
which has 16 elements. The probability of getting Aces twice is thus
16 × 3.698 × 10−4 = 0.005917
If we want to know the probability of getting two cards of the same rank, the
answer is 13 times this amount, since the events ’two Aces’, ’two twos’, etc., are
pairwise mutually exclusive, and each has the same probability. The result is thus
13 × 0.005917 = 0.076923.
1
If you’re sharp, you’ll recognize that this is, to the precision shown, exactly 13
.
That = makes sense, since once we’ve drawn the first card there is a one in thirteen
chance that the second card will have the same rank.
On the other hand, we can interpret the problem as one of sampling without
replacement: This is what we usually mean when we say ‘draw two cards from
the deck’. So here we are playing a simplified kind of poker, and asking about the
probability of getting a pair of Aces, as well as the probability of getting a pair.
The sample space is now
{(i, j) : 0 ≤ i, j < 52, i 6= j},
and the event ‘pair of aces’ is
{(i, j) : 0 ≤ i, j < 4, i 6= j}.
Again we make the reasonable assumption of equally likely outcomes, so we need
to calculate the sizes of both these sets. It’s not hard to see (and we will make the
argument in more generality when we study combinatorics) that the sample space
has 52 × 51 elements, and the event ‘pair of aces’ contains 4 × 3 = 12 elements.
So the probability of getting two Aces is
1
4×3
=
= 0.0045.
52 × 51
13 × 17
Once again, the probability of a pair is 13 times this amount, which is now
1
= 0.058.
17
9
4
Random Variables for Finite Sample Spaces
A random variable X is just a function from a sample space Ω into the real numbers. The functions X, Y, Z, T, U defined in the examples below, are all random
variables implicit in the examples we’ve been discussing.
Ω is the set of all sequences of 200 tosses a coin. If e ∈ Ω, X(e) is the
number of heads in e.
Same sample space as above; Y (e) is the number of distinct runs in e.
Same sample space; Z(e) is the length of the longest run in e.
Same samples space; T (e) is the difference between the number of heads
and the number of tails. (You might notice that this is the same as 2X(e) −
1. We write T = 2X − 1 to express the relation between these random
variables.
Now let Ω be the set {(i, j) : 1 ≤ i, j ≤ 6} of results of the roll of two dice.
Let U (e) denote the sum of the two dice.
4.1
Probability Distribution Function (also called Probability
Mass Function)
of a Random Variable
Given a sample space Ω, a random variable X on Ω, and a real number x, the
set
{e ∈ Ω : X(e) = x}
is an event. If we have a probability function P on Ω, then the event has a probabilty. We often write P [X(e) = x], or even just P [X = x] to denote this probability.
Example 18 Let’s look at the first few items (the coin-toss examples) in the list of
random variables above, but suppose that we are using sequences of three, rather
than one hundred tosses.
For the random variable X giving the number of heads, the possible values
are 0, 1, 2 and 3. These correspond to the events
{T T T }, {HT T, T HT, T HH}, {HHT, HT H, T HH}, {HHH}.
10
Figure 1: Probability mass functions for number of heads and maximum run
length with toss of three fair coins
So, P [X = 0] = P [X = 3] = 81 , P [X = 1] = P [X = 2] = 38 .
For the random variable Z giving the length of the longest run, the possible
values are 1,2,3. These correspond to the respective events
{HT H, T HT }, {HHT, T T H, HT T, T HH}, {HHH, T T T }.
So P [Z = 1] = 41 , P [Z = 2] = 12 , P r[Z = 3] = 41 .
Figure 1 is a plot the functions
x 7→ P [X = x], x 7→ P [Z = x]
for x in the range of X and Z. These functions are called the probability mass
functions, or the probability distribution functions for the random variables X and
Z. Observe that the values of the PMF must add up to 1.
Now let’s go back to the original definition of X as the number of heads in
100 tosses of a coin. X takes on the 101 different values 0, 1, . . . , 100. We cannot
list all the sequences with, say, 40 heads, but later we will see how to count them
without actually counting, enabling us to very precisely determine the probability
11
Figure 2: Probability mass function for number of heads on one hundred tosses of
a fair coin, estimated from 10,000 simulated trials
mass function for X. Even now, however, we can experimentally obtain an estimate of the probability mass function, by simulating a large number of tosses of
100 coins and plotting the percentage of tosses that yielded each of the values.
Figure 2 shows the result for 10,000 tosses (something we have seen before).
For the random variable U defined above, giving the sum of the two dice,
we can determine the probability mass function precisely: U (e) = 2 for just
one of the 36 outcomes e, namely when both dice show 1 (‘snake eyes’). Likewise U (e) = 12 for just one outcome. U (e) = 7 for 6 different outcomes
((1, 6), (2, 5), (3, 4), (4, 3), (5, 2), (6, 1)). It is easy to fill in the other values. The
probability mass function is plotted in Figure 3 along with an experimental estimate of the function obtained from 10,000 simulated rolls.
4.2
Cumulative Distribution Function
Sometimes, instead of looking at the function
x 7→ P [X = x],
it is more convenient to use
12
Figure 3: Probability mass function for the roll of two dice, both from exact calculation and a simulation of 10,000 rolls
x 7→ P [X ≤ x].
This is the cumulative distribution function (CDF) of the random variable X.
It really gives the precise same information as the PMF, although in a different
form. If we want to know the probability that the value of X lies in a certain
range, then it is likely to be easier to estimate this from the graph of the CDF than
from the graph of the PMF. Observe that the CDF is a nondecreasing function of
x, with maximum value 1.
Figure 4 shows the CDF for the dice example.
For instance,the probability that the sum of the dice is between 4 and 9 inclusive is the difference between the heights of the stems at x = 3 and x = 9. If we
only had the graph of the PMF we would have to add the heights of the stems at
x = 4, 5, 6, 7, 8 and 9.
13
Figure 4: Cumulative distribution function for the roll of two dice
14