Download Assignment 4: Permutations and Combinations

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Inductive probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Probability interpretations wikipedia , lookup

Probability wikipedia , lookup

Birthday problem wikipedia , lookup

Transcript
Assignment 4: Permutations and
Combinations
CS244-Randomness and Computation
Assigned February 18
Due February 27
March 10, 2015
Note: Python doesn’t have a nice built-in function to compute binomial coeffiecients, and surprisingly, matplotlib does not seem to either. If you want to
take the trouble you can download the scipy package (same site as numpy) and
then type
import scipy.special
scipy.special.binom(n,k)
It is probably simpler just to use the following function for computing these
coefficients:
def combinations(n,k):
prod = 1.0
for j in range(k):
prod = prod*(n-j)/(j+1)
return prod
I will post this code on the website.
I can’t seem to leave the birthday stuff alone, and I had to restrain myself from
making every problem about birthdays.
1. Use the exponential approximation to estimate how many people need to be
present in order for the probability of a coincidental birthday to be greater than
0.9, 0.95, 0.99. (So there are three answers here.) Then answer the questions
1
again using the exact probabilities—you may want to write jut a little bit of Python
code. Compare the two results–they should be quite close.
Solution. We approximate the probability of no coincidental birthday in a
group of k people by
2
e−k /2N ,
where N = 365. We thus have to solve the equations of the form
e−k
2 /730
= a,
where a = 0.1, 0.05, 0.01. Taking logs of both sides and simplifying a little gives
r
1
k = 730 · ln .
a
Substituting 10,20, and 100 for
1
a
gives the solutions:
40.998, 46.74, 57.98.
Let’s round these up to 41, 47, 58. To check the answer against the exact
probabilities, we use the following code:
>>> def birthday_coincidence(numpeople):
j=1
for k in range(1,numpeople):
j *= 1.0*(365-k)/365
return 1-j
>>> birthday_coincidence(40)
0.891231809817949
>>> birthday_coincidence(41)
0.9031516114817354
>>> birthday_coincidence(46)
0.9482528433672548
>>> birthday_coincidence(47)
0.9547744028332994
>>> birthday_coincidence(57)
0.9901224593411699
>>> birthday_coincidence(58)
0.9916649793892612
2
In the first two cases, our approximation gave the best possible answer. In the
third case it was off by one (57 would have been a better answer).
2. Suppose you have a database of biographies of prominent people from the past.
Each biography contains a date of birth and a date of death. If there are 1000
records in the database, what is the probability that two of them share both a date
of birth and a date of death (we are ignoring the year of birth and the year of
death, and just looking at the month and the day)? You should use the exponential
estimate for the generalized birthday problem.
Solution. It’s just the birthday problem with k = 1000 and N = 3652 . The
probability at least one shared birthday-death day pair is approximately
2 /2·3652
1 − e−1000
= 0.98168.
It’s almost a sure thing. In fact, with only 100 people, the probability of a shared
pair is already well over 50%.
3. You walk into a room with k people. What is the probability that someone
in the room has the same birthday as you? (Observe that this is very different
from the question we asked earlier, about whether there is any pair of people in
the room with the same birthday.) Express this exactly, and then approximate it
using the exponential approximation 1 − x ≈ e−x for small positive x. How many
people need to be in the room for the probability to exceed one-half?
Solution. The probability that a randomly chosen person has a different birthday
1
. The probability that k people have different birthdays from
from me is 1 − 365
1 k
me is (1 − 365
) . With the exponential approximation, this is about
e−k/365 .
Let’s find the value of k that makes this one-half: We take logs and reciprocals
and get
k ≈ 365 · ln 2 = 252.998.
An exact calculation shows
(1 −
1 253
) = 0.4995,
365
so our approximation was very accurate.
3
4. If I asked you to compute the probabilities of various poker hands, it would take
you less than a millisecond to find the Wikipedia page ’Poker odds’ with all the
answers, complete with the number of relevant outcomes for each hand expressed
in terms of binomial coefficients. So I had to make up some new poker hands and
ask you their probabilities. Explain your reasoning carefully, and try to express
your answers both in terms of binomial coefficients and powers, and as numerical
values.
It’s easy to be led astray here, and a very good way to check your answer is to
write a simulation. You are not required to do this for the homework, but it’s not
a bad idea if you want to see if you were right.
(a) The picture cards are the three ranks Jack, Queen, King. What is the probability of getting all picture cards?
Solution. There are 12 picture cards,
so the total number of 5-card hands
52con
12
taining only picture cards is 12
.
The
desired
probability
is
thus
/ 5 =
5
5
0.0003047, which is quite a lot smaller than I would have guessed!
(b) Two of the suits contain black cards, and two of the suits contain red cards.
What is the probability of having all 5 cards be the same color?
Solution. There are 26 red cards and 26 black cards. We can proceed as in (a) to
compute the probability of getting all red cards. The desired probability is twice
this value. So the answer is:
26
52
2·
/
= 0.0506.
5
5
(c) What is the probability of having all five cards belong to exactly two of the
suits? Remember there are two ways this can split: 3 of one suit and 2 of the
other, or 4 of one suit and 1 of the other.
For a 3-2 split, there are 4 ways to choose the 3-suit,
and then 3 ways to choose
13
the 2-suit. Once
the 3-suit is chosen, there are 3 ways to choose 3 cards from
13
it, and also 2 ways to choose 2 cards from the 2-suit. So the number of distinct
hands in which there are three cards from one suit and two from the other is
13
13
4×3×
×
.
3
2
By essentially identical reasoning, the number of distinct hands in which there are
4
four cards from one suit and one from another suit is
13
13
4×3×
×
.
4
1
Put it all together and the total number of hands in question is
13
13
13
13
4×3×
×
+4×3×
×
= 379236.
3
2
4
1
So the desired probability is
52
379236/
= 0.1459.
5
Here is another way to get the same result—it’s
hard to say if this is simpler
or not. First choose our two suits: there are 42 = 6 ways to do this. Then choose
5 cards from the 26 cards in the 2 suits. This gives
26
6×
= 394680.
5
The problem is that in this tabulation, we have also counted the hands that consist
of cards from a single suit, and moreover, we have counted each of these hands
several times. For example, there are 13
hands consisting entirely of hearts,
5
and in our tabulation, each of these has been counted three times (assuming our
two suits are hearts-spades, hearts-clubs, hearts-diamonds). That means we must
subtract
13
4×3×
= 15444
5
from our total.
And, what do you know,
394680 − 15444 = 379236.
It always feels good when two different methods give the same answer!
5. There are two candidates in an election. Candidate A has received 55% of the
votes, candidate B 45%. There is a very large number of voters (several million,
let’s say). We randomly sample 100 voters. This is sampling without replacement, since we should not poll the same voter twice!, but the voter pool is so large
5
that you can treat it as a problem of sampling with replacement, which makes the
calculation somewhat easier. What is the probability that in the sample, candidate B receives more votes? Express this answer as a formula using the binomial
coefficients, and then compute the probability exactly.
HINT: Think of the underlying experiment as flipping 100 biased coins in succession. We saw how to express the probability of getting exactly k heads in terms of
binomial coefficients, so here you will have a sum of about 50 such probabilities.
You will thus need to write a little code to answer the question.
Solution. Just as a reality check, we would expect this answer to be less than
one-half, because candidate A received more votes overall. By the coin analogy,
the probability that candidate B receives exactly k votes is
100
· 0.45k · 0.55100−k .
k
Thus the probability that candidate B receives strictly more votes than candidate
A is the sum of all these values as k varies from 51 to 100. A quick computation
with Python gives
100 X
100
· 0.45k · 0.55100−k = 0.1346.
k
k=51
This shows you something about the effectiveness of polling–if we have a truly
representative sample, and a 55-45 margin, then we can predict the result of the
election correctly 87% of the time by sampling only 100 people. With a sample
of 200 people, the success rate rises to 93%.
6. (Real birthdays) This is the most involved problem in terms of programming,
although not all that deep in terms of math. One very useful part of the problem
concerns how to sample from a given nonuniform distribution.
I am going to give you actual data on birth dates in the United States from one
year. You are to simulate the birthday problem using this distribution, and then
superimpose a plot of the result on the one obtained from exact calculation using
the uniform distribution model.
The birth data for 1978 is posted on the course website. I found this at the
’Chance’ website from Dartmouth, which also has the Grinstead-Snell book, but
I don’t know the original source for the data. You will want to read the second
column into a Python list. If you’ve forgotten (or never knew) how to do this, you
can use the following code (of course you have to change the full path name for
the file.)
6
Figure 1: Distribution of US birthdays in 1978
infile = open(’/Users/straubin/teaching/244/244website/birthday.txt’,’r’)
bdaylist = []
for j in range(365):
s=infile.readline().split()
bdaylist.append(int(s[1]))
(a) Make a scatter plot or a stem plot of the data. You can see the nonuniformity
very clearly: it is somewhat exaggerated if you display the plot with the default
settings, so I suggest that you base the y-coordinates at 0, using xlim(0,11000).
I find the results astonishing. There is indeed a seasonal variation (explained
by what? planning for the optimal time for the baby to be born? seasonal variation
in sexual behavior? in fertility?) but the amazing thing is that it looks as if there
are two entirely separate data series, with roughly the same seasonal variation, but
one significantly lower than the other. (Speculation about the variation is not part
of the assignment, just some random musing.)
7
Solution The code to produce the plot is posted on the website, and the plot is
in Figure 1. There definitely is a seasonal variation: the plot shows a spike in
births around the end of October and at the very end of the year, and a low in
mid-April. The October and April births point to more people conceiving around
Christmastime and fewer at the height of summer (I’m not sure why). I wonder if the year-end spike is there for tax purposes—that shows some very careful
planning!
But the real surprise is that the data falls into two series, and the differences
between these two series is larger than the variation within the series. What could
explain this extraordinary non-seasonal variation? If you study the data carefully,
you’ll realize that the lower series largely consists of two days out of every week.
This is probably due to scheduled Caesarean sections, which constitute a very
significant fraction of total births in the United States—the hospitals don’t like to
schedule these for weekends.
(b) You are to plot, for k = 1 to about 65, the probability of coincidental birthdays
using this probability distribution. The first hurdle is that you need some way of
generating random birthdays based on this probability model. You can code this
by hand using rand(), but there is a built-in method in numpy. You have to add
to your program
import numpy.random as npr
You then use a function called choice. To see how this works, a call to
npr.choice([1,2,3,4,5,6],p=[0.3, 0.3, 0.2, 0.1,0.05,0.05])
will generate a value in {1, 2, 3, 4, 5, 6} distributed according the probability mass
function p: that is, 1 will occur with probability 0.3, 2 with probability 0.3, etc.
Use this function and the information read in from the file to randomly generate
birthdays according to the given distribution.
The second hurdle is efficiently performing the simulation. You can do this
any way you like, but there is a nice trick for speeding things up, based on the
following insight: Consider the experiment where we repeatedly sample people
from the population until we find a birthday that is the same as one we have already drawn, and look at the the random variable that gives the number of rounds
this experiment lasts. Our original plot of birthday probabilities is just the cumulative distribution function of this random variable assuming uniformly distributed
birthdays.
This means that we can repeatedly sample birthdays with the choice function, and make a cumulative histogram. You have to scale things so that the values
rise from 0 to 1, and use the value returned by the histogram function to produce
8
a line plot.
(c) Add to your program the code that generates the plot for the probability of
coincidental birthdays under the uniform distribution, and superimpose the two.
Do you see much difference between the two plots? How adequately does the
uniform probability distribution model the real-life version of this problem?
Solution.
The code to produce the plot is posted on the website, and the plot itself is
shown in Figure 2. It performs 10,000 trials of the experiment of repeatedly sampling birthdays until a duplicate birthday is found, and recording the number of
samples. It then plots a cumulative histogram of the result (as a line graph, not a
bar graph). This is superimposed on a calculation of the exact probabilities assuming 365 equally likely birthdays. There is only a very slight difference between
the two plots. So for purposes of the repeated-birthday problem, the simple uniformity model gives accurate results, in spite of the nonuniformity present in the
real-life data.
9
Figure 2: Probability of a repeated birthday, showing the results of a simulation
based on the 1978 data, and the exact probabilities assuming uniform distribution
of birthdays
10