Download Discrete Distributions

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Statistics wikipedia , lookup

History of statistics wikipedia , lookup

Probability wikipedia , lookup

Ars Conjectandi wikipedia , lookup

Transcript
Discrete Distributions
STA 281 Fall 2011
1 Introduction
Previously we defined a random variable to be an experiment with numerical outcomes. Often different
random variables are related in that they have the same sample space or the same form for the
probabilities. For example, suppose X is a random variable with P(X=0)=0.3 and P(X=1)=0.7 while Y
is a random variable with P(Y=0)=0.4 and P(Y=1)=0.6. These random variables are similar in that
they have the same sample space S={0,1}. The random variables are different in that they assign
different probabilities to the elements of the sample space. In fact, there are many random variables
with the sample space S={0,1}. For every real number p in (0,1), we can define X such that P(X=1)=p
and P(X=0)=1-p. These probabilities sum to 1 and are both non-negative because 0<p<1, thus the
random variable satisfies the axioms of probability.
When we have a set of similar random variables, we call this set a family of distributions. Typically
families are indexed by a parameter such as p in the previous example. For each value of the parameter,
we have a different random variable, but one that can be described by a common equation involving the
parameter. The family we described in the previous example is called the Bernoulli family with
parameter p. The parameter space is the possible values of the parameter. For the Bernoulli family, the
parameter space is p (0,1).
Bernoulli random variables are used in a variety of applications. The simplest example is flipping a
fair coin. If we create a random variable X by assigning 1 to heads and 0 to tails, then X Bern(0.5)
(since each outcome is equally likely, p must be 0.5). Bernoulli random variables are also used in voter
polls. If Fred and Barney are running against each other in an election, there is some proportion p of
people who will vote for Fred. If Fred is unpopular, then p may be small, if the race is close p may be
near 0.5, or if Fred is winning by a landslide then p may be above 0.8. If we select one individual at
random from the population and ask whether they are intending to vote for Fred, we get a yes or no
answer. If we construct a random variable Y by assigning 1 to yes and 0 to no, then Y Bern(p). Any
experiment resulting in two outcomes can be transformed to a Bernoulli by assigning one of the
outcomes to 1 and the other outcome to 0.
Because all Bernoulli random variables have a common set of possible values and a common form for
the probabilities, we can solve the expectation and variance of a Bernoulli random variable in terms of
the parameter p. Let X Bern(p). The expectation of the random variable is defined to be ∑
. The random variable X has two possible outcomes, 0 and 1, that occur with probabilities 1-p and p,
so
E[X]=p
E[X2]=p
V[X]=p(1-p)
So, for example, a Bern(0.7) random variable has mean 0.7 and variance 0.21.
2 Experiments Based on Bernoulli Distribution
Bernoulli distributions form the basis for several important families of distributions. Complicated
experiments can often be described by specific combinations of several Bernoulli random variables.
1
Since the Bernoulli distribution is simple, thinking of more complicated distributions in terms of
Bernoulli random variables often makes them easier to understand.
We will consider two experiments resulting from Bernoulli random variables. One experiment
consists of drawing a sample of n Bernoulli random variables and counting how many ones appear in the
sample. This experiment forms the basis of voter polling. Usually we don’t just ask one person who
they are going to vote for, but many. Gallup polls, for example, typically question around 1500 people.
The count of the number of ones in the sample is used to make inferences about who will win the
election. Another example is quality control, where we are interested in determining the proportion of
products that are of acceptable quality. To assess the quality of the products, we test products as they
come off the assembly line. Each product is classified as acceptable (1) or unacceptable (0). A count of
acceptable items corresponds to a count of the number of ones.
This description alone (take n Bernoulli’s and count the number of successes) is not sufficient to
specify the distribution. We will consider two specific variants. The first scenario, called a Binomial
experiment, states that the n Bernoulli random variables are all independent and have the same
probability of success p. If Y is the number of successes in this scenario, then Y has a Binomial
distribution with parameters n and p (see section 2.1). When n is large and p is small, the Binomial
distribution may be approximated by a Poisson distribution (see section 2.2). In contrast to a Binomial
experiment, the n Bernoulli random variables may arise from sampling n items from a finite population
divided into M successes and N-M failures. We have actually already considered this scenario. If Y
counts the number of successes in this scenario, then Y has a Hypergeometric distribution with
parameters N, M, and n (see section 2.3). Under some conditions, the Binomial distribution and the
Hypergeometric distribution closely approximate each other (see section 2.4).
The alternative experiment (as opposed to counting the number of successes in n trials) is to observe
successive independent Bernoulli random variables, all with the same parameter p, and count the
number of failures that occur before the rth success. Note in the previous experiment we fixed the
number of trials, n, and counted the number of successes. We are doing the opposite here in that we
have fixed the number of successes, r, and are counting the number of failures. We could equivalently
count the number of trials, which is the number of failures plus the number of successes r. The number
of failures before the rth success has a Negative Binomial distribution with parameters r and p (see
section 2.5).
2.1
Binomial
The Binomial distribution is one of the most commonly used distributions in statistics and, with the
possible exception of the normal distribution, the distribution we will concentrate on most in this
course. The binomial distribution describes the results of voter polls, clinical trials, and many other
sampling procedures. A Binomial experiment has the following properties:
1. We observe a fixed, known number n of Bernoulli trials.
2. The Bernoulli trials are all independent with the same p.
3. The random variable of interest is the number of ones observed.
The first requirement just says we know the maximum number of ones in advance. The second
requirement is often satisfied in practice and proves mathematical convenience in describing a Binomial
distribution. By assuming the Bernoulli’s are independent, we may multiply probabilities together
rather than using conditional probabilities. Assuming all the Bernoulli’s have the same p further
simplifies the calculations.
2
In a Binomial experiment we call our n Bernoulli trials X1, X2,…,Xn. We know that each
Xi Bern(p). Our random variable of interest is the number of ones among the Xi, so let us construct
another random variable Y which is the sum of the n Xi
Such a random variable Y has a Binomial distribution with parameters n and p, written Y Bin(n,p). The
Binomial random variable is just recording the number of ones in the sample, since each X i=1
increments Y while each Xi=0 does nothing to Y.
Recall that the distribution of a random variable is a list of the possible values combined with the
probability of each value occurring. For simplicity, let us assume that n=4. If we observe 4 Bernoulli’s
and count the number of ones, we must observe 0, 1, 2, 3, or 4 ones. These are the possible values of Y.
Calculating the probabilities of each of these outcomes requires looking at the individual Bernoulli
probabilities.
There are 4 Xi random variables, each with two possible values, 0 and 1. This results in 16 (24)
possible combinations of the set (X1,X2,X3,X4) these are 0000, 0001, 0010,…,1110,1111). Using the
assumption that all the Xi are independent and have the same probability p, we may compute the
probability of observing any particular set of Xi values. For example, the probability of observing X1=1,
X2=0, X3=1, X4=1 is
We may rewrite the intersection as a product because of the independence of the Xi. We may compute
the probability for all of the sixteen combinations. In addition, for each combination we may compute
the value of Y. For example, if X1=1, X2=0, X3=1, and X4=1, then Y=1+0+1+1=3. The table shows the
probabilities and Y for the sixteen combinations.
X1
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
X2
0
0
0
0
1
1
1
1
0
0
0
0
1
1
1
1
X3
0
0
1
1
0
0
1
1
0
0
1
1
0
0
1
1
X4
0
1
0
1
0
1
0
1
0
1
0
1
0
1
0
1
Probability
(1-p)4
p(1-p)3
p(1-p)3
p2(1-p)2
p(1-p)3
p2(1-p)2
p2(1-p)2
p3(1-p)
p(1-p)3
p2(1-p)2
p2(1-p)2
p3(1-p)
p2(1-p)2
p3(1-p)
p3(1-p)
p4
Y
0
1
1
2
1
2
2
3
1
2
2
3
2
3
3
4
Notice that all combinations with the same Y have the same probability. For example, all the rows
where Y=2 have probability p2(1-p)2. This is because all rows with Y=2 have 2 ones and 2 zeros, the
ones occurring with probability p and the zeros occurring with probability (1-p). The sixteen
combinations in the table are the sample space for the four Bernoulli’s. To find P(Y=2), we may just
3
sum the probabilities of the outcomes corresponding to Y=2. There are six such combinations, each
with probability p2(1-p)2, so P(Y=2)=6p2(1-p)2. We may also find P(Y=3) similarly. Each combination
with Y=3 has 3 ones and 1 zero, so it occurs with probability p3(1-p). There are 4 rows with Y=3, so
P(Y=3)=4p3(1-p). We could similarly determine the probabilities for the remaining values of Y.
While n is always known in a Binomial experiment, we don’t know it until we are told about it in
the experiment, and so we want formulas for arbitrary values of n. If we count the number of ones for n
Bernoulli trials, we will find the possible values are integers 0, 1, 2, …, n. To find their probabilities, we
could construct a table as with n=4, but if n is even moderately large the table will be too large to write
conveniently. In principle such a table could be constructed and we could find the probabilities of each
combination of the Xi. Each combination with y ones and n-y zeros will have probability py(1-p)n-y (recall
each one occurs with probability p and each zero occurs with probability 1-p). We must then multiply
py(1-p)n-y by the number of rows with Y=y to determine the probability. Each row with Y=y has y ones
and n-y zeros. How many rows are there with y ones? This is the number of ways y ones may be placed
in n slot, or n choose y. In general, if Y Bin(n,p), then
( )
To find the expectation of Y, we could look directly at the definition [ ]
in the complicated formula
[ ]
∑
. This results
∑ ( )
It is simpler to observe that Y is a linear combination of the Xi, and use the formulas for linear
combinations to compute the mean and variance of Y. Recall that the mean and variance of each Xi are p
and p(1-p) since each Xi is distributed Bern(p). Since
we may derive
[ ]
[ ]
[
[
]
]
[
[
]
[
]
[
]
]
The expectation formula is intuitive. If we have 10 flips of a coin that is heads with probability 0.70, we
expect about 10(0.7)=7 heads.
Often we are more interested in the sample proportion than the actual number. For example,
election polls usually report what percentage of those surveyed would vote for a particular candidate,
not the actual count. If Y is the actual number of voters preferring the candidate, then the proportion
that prefer the candidate is
̂
We may compute probabilities for sample proportions just as we would for counts. If we interview 1000
people, the probability the proportion will be 0.712 is the same as the probability Y=712.
The sample proportion ̂ is a linear transformation of Y, since ̂
for the mean and variance of a linear transformation, we may derive
4
⁄
. Using the formulas
[ ̂]
[ ̂]
⁄
⁄
[ ]
[ ]
⁄
⁄
⁄
The expectation of ̂ is p, which indicates that, on average, the sample proportion is equal to the
population proportion, one of several reasons the sample proportion is typically used to estimate p. The
property that the estimator, on average, is equal to the actual value is called unbiasedness. The
variance of ̂ decreases as the sample size n increases, because n is in the denominator of V[ ̂ ]. This
indicates that as the sample size increases, the sample proportion becomes more precise (less variable).
Combining the mean and variance of ̂ , we may observe that for large samples, the sample proportion ̂
is very likely to be close to the population proportion p.
2.2
Approximation of Binomial by Poisson
Often, we are faced with a binomial experiment with a large n and a small p. For example, consider the
number of pages in a book with typographical errors. After the book has been proofread many times,
there is a small probability any given page will have an error, but a book contains many pages. A
telephone company is often interested in how many people call their operator. They serve many
customers, but in a given day each customer has only a small probability of calling the operator. In a
football season, there is a small probability of a turnover on a given play, but a season consists of many
plays.
If we are interested in computing probabilities for the number of ones observed in a binomial
experiment, we could use the binomial distribution. However, if n is large and p is small, the Poisson
distribution is often used as a convenient approximation. As a rule of thumb, if n≥100, p≤0.05, and
np≤20, then a Bin(n,p) distribution may be accurately approximated with a Poisson distribution with
parameter =np distribution, written Poi( ).
As stated earlier, defining a random variable involves stating the possible values of the variable and
giving their probabilities. If
, then the possible values of X are all the nonnegative integers.
Although the set of possible values differ between the Poisson and Binomial (the Binomial only has
possible values 0,1,…,n), the Poisson distribution places very little probability on values greater than n
so the approximation is still reasonable. The possible outcomes have probabilities
These probabilities sum to 1 since
∑
As an illustration, suppose we have a hotel with 100 rooms and that the probability any given room
will ask for room service from 2AM to 6AM is 0.05. Suppose we want to find exact and approximate
probabilities that zero through five rooms will ask for room service.
The exact distribution is Bin(n=100, p=0.05). The parameter values are sufficient to use a
Poi( =np=5) distribution as an approximation. We computed probabilities according to both the
Binomial and Poisson families. For example, let X be the number of rooms asking for room service.
The exact probability (Binomial) that X=4 is
5
(
)
and the approximate probability (Poisson) is
These are reasonably similar. Recall that the Poisson has an infinite number of possible values. We
know that P(Y=101)=0, since there are not 101 rooms in the hotel. While the Poisson does give the
event Y=101 nonzero probability, the approximate probability is
While this value is not exactly 0, it is extremely close, so the approximation is still reasonable. For a
fixed , the larger n is, the better the approximation.
For X=0 through X=5, the approximate probabilities are the second row of the following table and
the exact probabilities are the third row. Although not an exact approximation, the Poisson distribution
is reasonable. The approximation improves as n increases, with np held constant. Suppose the hotel had
500 rooms and the probability any given room orders room service between 2AM and 6AM is 0.01.
The expected number of rooms ordering room service is also np=5, so a Poi(5) distribution could be
used as an approximation. The exact probabilities are shown in the fourth row of the table. The
approximation has improved accuracy for the larger n. If n is increased to 5000 and p decreased to 0.001
(exact probabilities shown in the fifth row of the table), where np is also 5, the approximation is almost
exact.
Distribution
0
1
2
3
4
5
0.0067
0.0337
0.0842
0.1404
0.1755
0.1755
Poi( =5)
0.0059
0.0312
0.0812
0.1396
0.1781
0.1800
Bin(n=100, p=0.05)
0.0066
0.0332
0.0836
0.1402
0.1760
0.1764
Bin(n=500, p=0.01)
0.0067
0.0336
0.0842
0.1404
0.1755
0.1756
Bin(n=5000, p=0.001)
If
), then E[X]= and V[X]= . These quantities approximate the Binomial expectation
and variance. The expectations are the same, since =np. The variances are close, with the Poisson
variance =np and the Binomial variance np(1-p). If p is small, the 1-p will be close to 1 so np(1-p) will
be close to np.
2.3
Hypergeometric Distribution
Suppose we have a population of N individuals consisting of M ones and N-M zeros (again, the ones and
zeros may indicate any dichotomous (two valued) variable, such as gender). We sample n individuals
from the population at random and let X be the number of ones in the sample. The random variable X
then has a hypergeometric distribution with parameters N, M, and n, written HyperGeo(N,M,n).
While we still have a sample of n individuals from a population and record the number of ones in the
sample (same as the Binomial), here our sampling scheme creates dependencies among the Bernoulli
trials. We could consider sampling n individuals at random to occur by picking a single individual at
random from the N individuals, then picking a second individual from the N-1 individuals remaining,
and so on. The Bernoulli’s are dependent because whether a one or a zero is chosen on the second pick
depends on whether a one or zero was chosen on the first pick. For example, suppose a class consists of
30 men and 10 women, and we select 5 students at random. The probability we select a man on the first
pick is 30/40. If a man is chosen with the first pick, the probability a man is chosen on the second pick
6
is 29/39. In contrast, if a woman is chosen with the first pick, the probability a man is chosen with the
second pick is 30/39. Since the probability of choosing a man with the second pick depends on the
gender of the first person picked, the Bernoulli trials are dependent and thus the Binomial distribution
does not apply.
Although this is not a Binomial experiment, we have already computed the probabilities for this
scenario. We want to find the probability of randomly selecting x ones from a population of M ones and
N-M zeros. The probability X=x is
( )(
)
( )
The possible values of this distribution require some thought to determine. Clearly x must be an
integer, and must be at least 0 and at most n. However, depending on the number of ones and zeros in
the population, some values may not be possible. For example, if the population consists entirely of
zeros, then the only possible value is x=0.
The largest value x may be is n, provided there are enough ones in the population that n ones may
be selected. If M is less than n, so there are fewer ones available that the number of individuals we are
selecting, then we may observe at most M ones. The largest possible value of x is therefore min(n,M).
Similarly, if the number of zeros in the population is fewer than the number of individuals we sample,
then we must observe at least n-(N-M)=n-N+M ones. The possible values of x are therefore any integer
between max(0,n-N+M) and min(n,M).
Through some algebra, it is possible to determine the mean and variance of the hypergeometric
distribution. If X HyperGeo(N,M,n), then
[ ]
[ ]
2.4
(
)
( )
Comparison of Binomial and Hypergeometric Distribution
Recall we said that for a Hypergeometric distribution, the Bernoulli trials are dependent. We said that
in a population of 30 men and 10 women, the probability of selecting a man on the second pick depended
on the gender of the first pick. We said that if a man was picked first, there was a 29/39 probability of
selecting a man second. If a woman was picked first, there was a 30/39 probability of selecting a man
second. The difference between 29/39 and 30/39 is not that great. There is dependence, but is it
enough to make a difference?
In fact, the answer is no, that the dependence between the picks is sufficiently small that we may
ignore it. As a general rule, the larger the population, the easier it is to ignore the dependence. If, for
example, class consisted of 4000 individuals with 1000 women, the difference between 3000/3999 and
2999/3999 is extremely small (0.7502 versus 0.7499). For large populations, the difference between the
probabilities is so small that often we completely ignore the dependence and use the binomial
distribution. In voter polling, often there are millions of voters. Although technically the correct
distribution is Hypergeometric, the Binomial is always used to compute margins of error.
Suppose the class consists of 25 individuals with 5 ones and you select 5 individuals. Let X be the
number of ones observed in the sample. The possible values of X are the integers zero through five.
The Hypergeometric distribution is the exact distribution, since we are drawing without replacement
7
from a finite population. We may find the hypergeometric probability directly from the definition, such
as
( )( )
( )
The full set of probabilities is shown in the second row of the table. We could approximate these
probabilities by a Binomial distribution. We are sampling n=5 individuals from a population that
contains p=5/25=0.2 proportion of ones. The binomial probabilities are shown in the fifth row of the
table. The values are not very similar, although they are not vastly different, either. The accuracy of
the binomial distribution increases as N increases (with the proportions remaining equal). Suppose the
class consisted of 250 individuals with 50 ones (p=0.2) and we sample 5 individuals. The
hypergeometric probabilities are shown in the third row of the table and match the binomial
probabilities fairly close. As N is increased further, to 2500, also with M=500 so p=0.2, the probabilities
are almost exactly equal.
Distribution
0
1
2
0.292
0.456
0.215
HyperGeo(N=25,M=5,n=5)
0.324
0.414
0.206
HyperGeo(N=250,M=50,n=5)
0.327
0.410
0.205
HyperGeo(N=2500,M=500,n=5)
0.328
0.410
0.205
Bin(n=5,p=0.2)
As a rule of thumb, HyperGeo(N,M,n) may be approximated by a
we are sampling no more than 5% of the population.
3
4
0.036
0.002
0.050
0.006
0.051
0.006
0.051
0.006
Bin(n,p=M/N) if
5
0.000
0.000
0.000
0.000
(n/N)≤0.05, so
When this approximation holds, the mean and variance of the approximating Binomial are close to
the mean and variance of the original Hypergeometric distribution. Recall the expectation of a Binomial
distribution is np, which for the approximation is nM/N. This is exactly equal to the expectation of the
Hypergeometric. The variance of the binomial is np(1-p), which for the approximation is
( )
Compare this to the exact variance of the Hypergeometric
[ ]
(
)
( )
The only difference between the exact variance and the approximate is the term (N-n)/(N-1). However,
remember we are using this approximation for n/N≤0.05, which would make the
(N-n)/(N-1)
term approximately 1.
2.5
Negative Binomial Distribution
So far we have concentrated on collecting a sample of n Bernoulli random variables and counting the
number of ones in the sample. Another experiment we could perform is to sequentially observe
Bernoulli random variables until we observe the first one, or the first r ones.
For simplicity, we first describe observing Bernoulli’s until we observe the first one. After
observing the first one, we let a random variable Y be the number of zeros observed. For example,
suppose the first three Bernoulli trials are X1=0, X2=0, and X3=1. The first one appeared on the third
trial, and we observed Y=2 zeros before the first one. The possible values of Y are any nonnegative
integers 0,1,2,…. It could take an arbitrarily long time to observe the first one. Usually we will observe
only the first one fairly quickly, so large values of Y have small probabilities, but they are possible.
8
To complete the description of the distribution, we also require the probabilities for each of the
possible outcomes. If the Bernoulli trials we observe are independent and have the same probability p,
this is relatively easy to calculate. To find P(Y=0), we must realize that the event Y=0 corresponds to
observing no zeros before the first one. So Y=0 is the same event as X1=1, which we know occurs with
probability p. The event Y=1 occurs if we observe one zero before the first one, so X1=0 and X2=1.
This occurs with probability p(1-p). In general, the event Y=y corresponds to observing y zeros before
the first one, so X1 through Xy are all 0 while Xy+1=1. This occurs with probability p(1-p)y. This
distribution, the number of zeros until the first one, is called the Geometric distribution.
More generally, instead of looking at the number of zeros observed before the first one, we may look
at the number of zeros before the second one, or third one, or rth one. This is called the Negative
Binomial distribution. The Geometric distribution is a Negative Binomial distribution with r=1. As
with the geometric, if Y is the number of zeros before the rth one, the possible values are all nonnegative
integers. We might find r ones in the first r trials, and therefore observe no zeros. Similarly, it could
take an arbitrarily large number of Bernoulli’s before the rth one.
Suppose we are waiting for the 5th one. What is the probability we observe exactly 7 zeros before
the 5th one? We calculate this probability by noticing that the event “The 5th one occurs after exactly 7
zeros” requires that we observe 12 Bernoulli trials, with 5 ones and 7 zeros, and that the 12 th Bernoulli
trial must have been a one, since we are waiting for the 5th one to appear. Therefore, the first 11
Bernoulli’s consist of 7 zeros and 4 ones, and the 12th Bernoulli is a one. There are no restrictions on
the order of ones and zeros for the first 11 Bernoulli’s, so this occurs with probability
(
)
This quantity must be multiplied by the probability that the 12th Bernoulli is a one, which is p, so the
actual probability is
(
)
In general, we want the probability of observing y zeros before we observe the rth one. For the event
Y=y to occur, we must observe y+r Bernoulli’s, the first y+r-1 Bernoulli’s consisting of y zeros and r-1
ones, and then observing a one on the next Bernoulli trial. The probability is
(
)
The mean and variance of the negative binomial distribution are E[Y]=r(1-p)/p and V[Y]=r(1-p)/p2.
3 Recognizing Distributions
This handout discusses five distributions (Bernoulli, Binomial, Hypergeometric, Poisson, and Negative
Binomial). One skill you should have is the ability to look at a word problem and determine which
distribution is appropriate. After that, of course, you need to be able to apply formulas to find
probabilities, means, and variances.
The Bernoulli distribution is the building block of all other distributions. It should be the simplest
to recognize, since it has just two outcomes, 0 and 1, which occur with probabilities p and 1-p.
Bernoulli’s many be combined in two ways in this course. Either we sample a group of n Bernoulli’s
from a population and count the number of ones in the sample, or we observe a sequence of Bernoulli’s
until we observe r ones and then count the number of zeros before the rth one occurred. The Binomial,
9
Hypergeometric, and Poisson distributions describe counting the number of ones while the Negative
Binomial distribution describes counting how many zeros until the rth one.
If we are sampling n independent Bernoulli trials, all with the same probability p of success, then
you should use a Bin(n,p) distribution. If you are randomly sampling n individuals from a population
with M successes and N-M failures, use a HyperGeo(N,M,n) distribution. If you are doing neither,
consult your friendly local statistician. There are two approximations we consider in this course. If the
exact distribution is Bin(n,p) but n is large and p is small, the Poi( =np) distribution is an adequate
approximation. If the exact distribution is HyperGeo(N,M,n) but n/N≤0.05, then the Bin(n,p=M/N)
distribution is an adequate approximation.
10