Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Discrete Distributions STA 281 Fall 2011 1 Introduction Previously we defined a random variable to be an experiment with numerical outcomes. Often different random variables are related in that they have the same sample space or the same form for the probabilities. For example, suppose X is a random variable with P(X=0)=0.3 and P(X=1)=0.7 while Y is a random variable with P(Y=0)=0.4 and P(Y=1)=0.6. These random variables are similar in that they have the same sample space S={0,1}. The random variables are different in that they assign different probabilities to the elements of the sample space. In fact, there are many random variables with the sample space S={0,1}. For every real number p in (0,1), we can define X such that P(X=1)=p and P(X=0)=1-p. These probabilities sum to 1 and are both non-negative because 0<p<1, thus the random variable satisfies the axioms of probability. When we have a set of similar random variables, we call this set a family of distributions. Typically families are indexed by a parameter such as p in the previous example. For each value of the parameter, we have a different random variable, but one that can be described by a common equation involving the parameter. The family we described in the previous example is called the Bernoulli family with parameter p. The parameter space is the possible values of the parameter. For the Bernoulli family, the parameter space is p (0,1). Bernoulli random variables are used in a variety of applications. The simplest example is flipping a fair coin. If we create a random variable X by assigning 1 to heads and 0 to tails, then X Bern(0.5) (since each outcome is equally likely, p must be 0.5). Bernoulli random variables are also used in voter polls. If Fred and Barney are running against each other in an election, there is some proportion p of people who will vote for Fred. If Fred is unpopular, then p may be small, if the race is close p may be near 0.5, or if Fred is winning by a landslide then p may be above 0.8. If we select one individual at random from the population and ask whether they are intending to vote for Fred, we get a yes or no answer. If we construct a random variable Y by assigning 1 to yes and 0 to no, then Y Bern(p). Any experiment resulting in two outcomes can be transformed to a Bernoulli by assigning one of the outcomes to 1 and the other outcome to 0. Because all Bernoulli random variables have a common set of possible values and a common form for the probabilities, we can solve the expectation and variance of a Bernoulli random variable in terms of the parameter p. Let X Bern(p). The expectation of the random variable is defined to be ∑ . The random variable X has two possible outcomes, 0 and 1, that occur with probabilities 1-p and p, so E[X]=p E[X2]=p V[X]=p(1-p) So, for example, a Bern(0.7) random variable has mean 0.7 and variance 0.21. 2 Experiments Based on Bernoulli Distribution Bernoulli distributions form the basis for several important families of distributions. Complicated experiments can often be described by specific combinations of several Bernoulli random variables. 1 Since the Bernoulli distribution is simple, thinking of more complicated distributions in terms of Bernoulli random variables often makes them easier to understand. We will consider two experiments resulting from Bernoulli random variables. One experiment consists of drawing a sample of n Bernoulli random variables and counting how many ones appear in the sample. This experiment forms the basis of voter polling. Usually we don’t just ask one person who they are going to vote for, but many. Gallup polls, for example, typically question around 1500 people. The count of the number of ones in the sample is used to make inferences about who will win the election. Another example is quality control, where we are interested in determining the proportion of products that are of acceptable quality. To assess the quality of the products, we test products as they come off the assembly line. Each product is classified as acceptable (1) or unacceptable (0). A count of acceptable items corresponds to a count of the number of ones. This description alone (take n Bernoulli’s and count the number of successes) is not sufficient to specify the distribution. We will consider two specific variants. The first scenario, called a Binomial experiment, states that the n Bernoulli random variables are all independent and have the same probability of success p. If Y is the number of successes in this scenario, then Y has a Binomial distribution with parameters n and p (see section 2.1). When n is large and p is small, the Binomial distribution may be approximated by a Poisson distribution (see section 2.2). In contrast to a Binomial experiment, the n Bernoulli random variables may arise from sampling n items from a finite population divided into M successes and N-M failures. We have actually already considered this scenario. If Y counts the number of successes in this scenario, then Y has a Hypergeometric distribution with parameters N, M, and n (see section 2.3). Under some conditions, the Binomial distribution and the Hypergeometric distribution closely approximate each other (see section 2.4). The alternative experiment (as opposed to counting the number of successes in n trials) is to observe successive independent Bernoulli random variables, all with the same parameter p, and count the number of failures that occur before the rth success. Note in the previous experiment we fixed the number of trials, n, and counted the number of successes. We are doing the opposite here in that we have fixed the number of successes, r, and are counting the number of failures. We could equivalently count the number of trials, which is the number of failures plus the number of successes r. The number of failures before the rth success has a Negative Binomial distribution with parameters r and p (see section 2.5). 2.1 Binomial The Binomial distribution is one of the most commonly used distributions in statistics and, with the possible exception of the normal distribution, the distribution we will concentrate on most in this course. The binomial distribution describes the results of voter polls, clinical trials, and many other sampling procedures. A Binomial experiment has the following properties: 1. We observe a fixed, known number n of Bernoulli trials. 2. The Bernoulli trials are all independent with the same p. 3. The random variable of interest is the number of ones observed. The first requirement just says we know the maximum number of ones in advance. The second requirement is often satisfied in practice and proves mathematical convenience in describing a Binomial distribution. By assuming the Bernoulli’s are independent, we may multiply probabilities together rather than using conditional probabilities. Assuming all the Bernoulli’s have the same p further simplifies the calculations. 2 In a Binomial experiment we call our n Bernoulli trials X1, X2,…,Xn. We know that each Xi Bern(p). Our random variable of interest is the number of ones among the Xi, so let us construct another random variable Y which is the sum of the n Xi Such a random variable Y has a Binomial distribution with parameters n and p, written Y Bin(n,p). The Binomial random variable is just recording the number of ones in the sample, since each X i=1 increments Y while each Xi=0 does nothing to Y. Recall that the distribution of a random variable is a list of the possible values combined with the probability of each value occurring. For simplicity, let us assume that n=4. If we observe 4 Bernoulli’s and count the number of ones, we must observe 0, 1, 2, 3, or 4 ones. These are the possible values of Y. Calculating the probabilities of each of these outcomes requires looking at the individual Bernoulli probabilities. There are 4 Xi random variables, each with two possible values, 0 and 1. This results in 16 (24) possible combinations of the set (X1,X2,X3,X4) these are 0000, 0001, 0010,…,1110,1111). Using the assumption that all the Xi are independent and have the same probability p, we may compute the probability of observing any particular set of Xi values. For example, the probability of observing X1=1, X2=0, X3=1, X4=1 is We may rewrite the intersection as a product because of the independence of the Xi. We may compute the probability for all of the sixteen combinations. In addition, for each combination we may compute the value of Y. For example, if X1=1, X2=0, X3=1, and X4=1, then Y=1+0+1+1=3. The table shows the probabilities and Y for the sixteen combinations. X1 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 X2 0 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 X3 0 0 1 1 0 0 1 1 0 0 1 1 0 0 1 1 X4 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 Probability (1-p)4 p(1-p)3 p(1-p)3 p2(1-p)2 p(1-p)3 p2(1-p)2 p2(1-p)2 p3(1-p) p(1-p)3 p2(1-p)2 p2(1-p)2 p3(1-p) p2(1-p)2 p3(1-p) p3(1-p) p4 Y 0 1 1 2 1 2 2 3 1 2 2 3 2 3 3 4 Notice that all combinations with the same Y have the same probability. For example, all the rows where Y=2 have probability p2(1-p)2. This is because all rows with Y=2 have 2 ones and 2 zeros, the ones occurring with probability p and the zeros occurring with probability (1-p). The sixteen combinations in the table are the sample space for the four Bernoulli’s. To find P(Y=2), we may just 3 sum the probabilities of the outcomes corresponding to Y=2. There are six such combinations, each with probability p2(1-p)2, so P(Y=2)=6p2(1-p)2. We may also find P(Y=3) similarly. Each combination with Y=3 has 3 ones and 1 zero, so it occurs with probability p3(1-p). There are 4 rows with Y=3, so P(Y=3)=4p3(1-p). We could similarly determine the probabilities for the remaining values of Y. While n is always known in a Binomial experiment, we don’t know it until we are told about it in the experiment, and so we want formulas for arbitrary values of n. If we count the number of ones for n Bernoulli trials, we will find the possible values are integers 0, 1, 2, …, n. To find their probabilities, we could construct a table as with n=4, but if n is even moderately large the table will be too large to write conveniently. In principle such a table could be constructed and we could find the probabilities of each combination of the Xi. Each combination with y ones and n-y zeros will have probability py(1-p)n-y (recall each one occurs with probability p and each zero occurs with probability 1-p). We must then multiply py(1-p)n-y by the number of rows with Y=y to determine the probability. Each row with Y=y has y ones and n-y zeros. How many rows are there with y ones? This is the number of ways y ones may be placed in n slot, or n choose y. In general, if Y Bin(n,p), then ( ) To find the expectation of Y, we could look directly at the definition [ ] in the complicated formula [ ] ∑ . This results ∑ ( ) It is simpler to observe that Y is a linear combination of the Xi, and use the formulas for linear combinations to compute the mean and variance of Y. Recall that the mean and variance of each Xi are p and p(1-p) since each Xi is distributed Bern(p). Since we may derive [ ] [ ] [ [ ] ] [ [ ] [ ] [ ] ] The expectation formula is intuitive. If we have 10 flips of a coin that is heads with probability 0.70, we expect about 10(0.7)=7 heads. Often we are more interested in the sample proportion than the actual number. For example, election polls usually report what percentage of those surveyed would vote for a particular candidate, not the actual count. If Y is the actual number of voters preferring the candidate, then the proportion that prefer the candidate is ̂ We may compute probabilities for sample proportions just as we would for counts. If we interview 1000 people, the probability the proportion will be 0.712 is the same as the probability Y=712. The sample proportion ̂ is a linear transformation of Y, since ̂ for the mean and variance of a linear transformation, we may derive 4 ⁄ . Using the formulas [ ̂] [ ̂] ⁄ ⁄ [ ] [ ] ⁄ ⁄ ⁄ The expectation of ̂ is p, which indicates that, on average, the sample proportion is equal to the population proportion, one of several reasons the sample proportion is typically used to estimate p. The property that the estimator, on average, is equal to the actual value is called unbiasedness. The variance of ̂ decreases as the sample size n increases, because n is in the denominator of V[ ̂ ]. This indicates that as the sample size increases, the sample proportion becomes more precise (less variable). Combining the mean and variance of ̂ , we may observe that for large samples, the sample proportion ̂ is very likely to be close to the population proportion p. 2.2 Approximation of Binomial by Poisson Often, we are faced with a binomial experiment with a large n and a small p. For example, consider the number of pages in a book with typographical errors. After the book has been proofread many times, there is a small probability any given page will have an error, but a book contains many pages. A telephone company is often interested in how many people call their operator. They serve many customers, but in a given day each customer has only a small probability of calling the operator. In a football season, there is a small probability of a turnover on a given play, but a season consists of many plays. If we are interested in computing probabilities for the number of ones observed in a binomial experiment, we could use the binomial distribution. However, if n is large and p is small, the Poisson distribution is often used as a convenient approximation. As a rule of thumb, if n≥100, p≤0.05, and np≤20, then a Bin(n,p) distribution may be accurately approximated with a Poisson distribution with parameter =np distribution, written Poi( ). As stated earlier, defining a random variable involves stating the possible values of the variable and giving their probabilities. If , then the possible values of X are all the nonnegative integers. Although the set of possible values differ between the Poisson and Binomial (the Binomial only has possible values 0,1,…,n), the Poisson distribution places very little probability on values greater than n so the approximation is still reasonable. The possible outcomes have probabilities These probabilities sum to 1 since ∑ As an illustration, suppose we have a hotel with 100 rooms and that the probability any given room will ask for room service from 2AM to 6AM is 0.05. Suppose we want to find exact and approximate probabilities that zero through five rooms will ask for room service. The exact distribution is Bin(n=100, p=0.05). The parameter values are sufficient to use a Poi( =np=5) distribution as an approximation. We computed probabilities according to both the Binomial and Poisson families. For example, let X be the number of rooms asking for room service. The exact probability (Binomial) that X=4 is 5 ( ) and the approximate probability (Poisson) is These are reasonably similar. Recall that the Poisson has an infinite number of possible values. We know that P(Y=101)=0, since there are not 101 rooms in the hotel. While the Poisson does give the event Y=101 nonzero probability, the approximate probability is While this value is not exactly 0, it is extremely close, so the approximation is still reasonable. For a fixed , the larger n is, the better the approximation. For X=0 through X=5, the approximate probabilities are the second row of the following table and the exact probabilities are the third row. Although not an exact approximation, the Poisson distribution is reasonable. The approximation improves as n increases, with np held constant. Suppose the hotel had 500 rooms and the probability any given room orders room service between 2AM and 6AM is 0.01. The expected number of rooms ordering room service is also np=5, so a Poi(5) distribution could be used as an approximation. The exact probabilities are shown in the fourth row of the table. The approximation has improved accuracy for the larger n. If n is increased to 5000 and p decreased to 0.001 (exact probabilities shown in the fifth row of the table), where np is also 5, the approximation is almost exact. Distribution 0 1 2 3 4 5 0.0067 0.0337 0.0842 0.1404 0.1755 0.1755 Poi( =5) 0.0059 0.0312 0.0812 0.1396 0.1781 0.1800 Bin(n=100, p=0.05) 0.0066 0.0332 0.0836 0.1402 0.1760 0.1764 Bin(n=500, p=0.01) 0.0067 0.0336 0.0842 0.1404 0.1755 0.1756 Bin(n=5000, p=0.001) If ), then E[X]= and V[X]= . These quantities approximate the Binomial expectation and variance. The expectations are the same, since =np. The variances are close, with the Poisson variance =np and the Binomial variance np(1-p). If p is small, the 1-p will be close to 1 so np(1-p) will be close to np. 2.3 Hypergeometric Distribution Suppose we have a population of N individuals consisting of M ones and N-M zeros (again, the ones and zeros may indicate any dichotomous (two valued) variable, such as gender). We sample n individuals from the population at random and let X be the number of ones in the sample. The random variable X then has a hypergeometric distribution with parameters N, M, and n, written HyperGeo(N,M,n). While we still have a sample of n individuals from a population and record the number of ones in the sample (same as the Binomial), here our sampling scheme creates dependencies among the Bernoulli trials. We could consider sampling n individuals at random to occur by picking a single individual at random from the N individuals, then picking a second individual from the N-1 individuals remaining, and so on. The Bernoulli’s are dependent because whether a one or a zero is chosen on the second pick depends on whether a one or zero was chosen on the first pick. For example, suppose a class consists of 30 men and 10 women, and we select 5 students at random. The probability we select a man on the first pick is 30/40. If a man is chosen with the first pick, the probability a man is chosen on the second pick 6 is 29/39. In contrast, if a woman is chosen with the first pick, the probability a man is chosen with the second pick is 30/39. Since the probability of choosing a man with the second pick depends on the gender of the first person picked, the Bernoulli trials are dependent and thus the Binomial distribution does not apply. Although this is not a Binomial experiment, we have already computed the probabilities for this scenario. We want to find the probability of randomly selecting x ones from a population of M ones and N-M zeros. The probability X=x is ( )( ) ( ) The possible values of this distribution require some thought to determine. Clearly x must be an integer, and must be at least 0 and at most n. However, depending on the number of ones and zeros in the population, some values may not be possible. For example, if the population consists entirely of zeros, then the only possible value is x=0. The largest value x may be is n, provided there are enough ones in the population that n ones may be selected. If M is less than n, so there are fewer ones available that the number of individuals we are selecting, then we may observe at most M ones. The largest possible value of x is therefore min(n,M). Similarly, if the number of zeros in the population is fewer than the number of individuals we sample, then we must observe at least n-(N-M)=n-N+M ones. The possible values of x are therefore any integer between max(0,n-N+M) and min(n,M). Through some algebra, it is possible to determine the mean and variance of the hypergeometric distribution. If X HyperGeo(N,M,n), then [ ] [ ] 2.4 ( ) ( ) Comparison of Binomial and Hypergeometric Distribution Recall we said that for a Hypergeometric distribution, the Bernoulli trials are dependent. We said that in a population of 30 men and 10 women, the probability of selecting a man on the second pick depended on the gender of the first pick. We said that if a man was picked first, there was a 29/39 probability of selecting a man second. If a woman was picked first, there was a 30/39 probability of selecting a man second. The difference between 29/39 and 30/39 is not that great. There is dependence, but is it enough to make a difference? In fact, the answer is no, that the dependence between the picks is sufficiently small that we may ignore it. As a general rule, the larger the population, the easier it is to ignore the dependence. If, for example, class consisted of 4000 individuals with 1000 women, the difference between 3000/3999 and 2999/3999 is extremely small (0.7502 versus 0.7499). For large populations, the difference between the probabilities is so small that often we completely ignore the dependence and use the binomial distribution. In voter polling, often there are millions of voters. Although technically the correct distribution is Hypergeometric, the Binomial is always used to compute margins of error. Suppose the class consists of 25 individuals with 5 ones and you select 5 individuals. Let X be the number of ones observed in the sample. The possible values of X are the integers zero through five. The Hypergeometric distribution is the exact distribution, since we are drawing without replacement 7 from a finite population. We may find the hypergeometric probability directly from the definition, such as ( )( ) ( ) The full set of probabilities is shown in the second row of the table. We could approximate these probabilities by a Binomial distribution. We are sampling n=5 individuals from a population that contains p=5/25=0.2 proportion of ones. The binomial probabilities are shown in the fifth row of the table. The values are not very similar, although they are not vastly different, either. The accuracy of the binomial distribution increases as N increases (with the proportions remaining equal). Suppose the class consisted of 250 individuals with 50 ones (p=0.2) and we sample 5 individuals. The hypergeometric probabilities are shown in the third row of the table and match the binomial probabilities fairly close. As N is increased further, to 2500, also with M=500 so p=0.2, the probabilities are almost exactly equal. Distribution 0 1 2 0.292 0.456 0.215 HyperGeo(N=25,M=5,n=5) 0.324 0.414 0.206 HyperGeo(N=250,M=50,n=5) 0.327 0.410 0.205 HyperGeo(N=2500,M=500,n=5) 0.328 0.410 0.205 Bin(n=5,p=0.2) As a rule of thumb, HyperGeo(N,M,n) may be approximated by a we are sampling no more than 5% of the population. 3 4 0.036 0.002 0.050 0.006 0.051 0.006 0.051 0.006 Bin(n,p=M/N) if 5 0.000 0.000 0.000 0.000 (n/N)≤0.05, so When this approximation holds, the mean and variance of the approximating Binomial are close to the mean and variance of the original Hypergeometric distribution. Recall the expectation of a Binomial distribution is np, which for the approximation is nM/N. This is exactly equal to the expectation of the Hypergeometric. The variance of the binomial is np(1-p), which for the approximation is ( ) Compare this to the exact variance of the Hypergeometric [ ] ( ) ( ) The only difference between the exact variance and the approximate is the term (N-n)/(N-1). However, remember we are using this approximation for n/N≤0.05, which would make the (N-n)/(N-1) term approximately 1. 2.5 Negative Binomial Distribution So far we have concentrated on collecting a sample of n Bernoulli random variables and counting the number of ones in the sample. Another experiment we could perform is to sequentially observe Bernoulli random variables until we observe the first one, or the first r ones. For simplicity, we first describe observing Bernoulli’s until we observe the first one. After observing the first one, we let a random variable Y be the number of zeros observed. For example, suppose the first three Bernoulli trials are X1=0, X2=0, and X3=1. The first one appeared on the third trial, and we observed Y=2 zeros before the first one. The possible values of Y are any nonnegative integers 0,1,2,…. It could take an arbitrarily long time to observe the first one. Usually we will observe only the first one fairly quickly, so large values of Y have small probabilities, but they are possible. 8 To complete the description of the distribution, we also require the probabilities for each of the possible outcomes. If the Bernoulli trials we observe are independent and have the same probability p, this is relatively easy to calculate. To find P(Y=0), we must realize that the event Y=0 corresponds to observing no zeros before the first one. So Y=0 is the same event as X1=1, which we know occurs with probability p. The event Y=1 occurs if we observe one zero before the first one, so X1=0 and X2=1. This occurs with probability p(1-p). In general, the event Y=y corresponds to observing y zeros before the first one, so X1 through Xy are all 0 while Xy+1=1. This occurs with probability p(1-p)y. This distribution, the number of zeros until the first one, is called the Geometric distribution. More generally, instead of looking at the number of zeros observed before the first one, we may look at the number of zeros before the second one, or third one, or rth one. This is called the Negative Binomial distribution. The Geometric distribution is a Negative Binomial distribution with r=1. As with the geometric, if Y is the number of zeros before the rth one, the possible values are all nonnegative integers. We might find r ones in the first r trials, and therefore observe no zeros. Similarly, it could take an arbitrarily large number of Bernoulli’s before the rth one. Suppose we are waiting for the 5th one. What is the probability we observe exactly 7 zeros before the 5th one? We calculate this probability by noticing that the event “The 5th one occurs after exactly 7 zeros” requires that we observe 12 Bernoulli trials, with 5 ones and 7 zeros, and that the 12 th Bernoulli trial must have been a one, since we are waiting for the 5th one to appear. Therefore, the first 11 Bernoulli’s consist of 7 zeros and 4 ones, and the 12th Bernoulli is a one. There are no restrictions on the order of ones and zeros for the first 11 Bernoulli’s, so this occurs with probability ( ) This quantity must be multiplied by the probability that the 12th Bernoulli is a one, which is p, so the actual probability is ( ) In general, we want the probability of observing y zeros before we observe the rth one. For the event Y=y to occur, we must observe y+r Bernoulli’s, the first y+r-1 Bernoulli’s consisting of y zeros and r-1 ones, and then observing a one on the next Bernoulli trial. The probability is ( ) The mean and variance of the negative binomial distribution are E[Y]=r(1-p)/p and V[Y]=r(1-p)/p2. 3 Recognizing Distributions This handout discusses five distributions (Bernoulli, Binomial, Hypergeometric, Poisson, and Negative Binomial). One skill you should have is the ability to look at a word problem and determine which distribution is appropriate. After that, of course, you need to be able to apply formulas to find probabilities, means, and variances. The Bernoulli distribution is the building block of all other distributions. It should be the simplest to recognize, since it has just two outcomes, 0 and 1, which occur with probabilities p and 1-p. Bernoulli’s many be combined in two ways in this course. Either we sample a group of n Bernoulli’s from a population and count the number of ones in the sample, or we observe a sequence of Bernoulli’s until we observe r ones and then count the number of zeros before the rth one occurred. The Binomial, 9 Hypergeometric, and Poisson distributions describe counting the number of ones while the Negative Binomial distribution describes counting how many zeros until the rth one. If we are sampling n independent Bernoulli trials, all with the same probability p of success, then you should use a Bin(n,p) distribution. If you are randomly sampling n individuals from a population with M successes and N-M failures, use a HyperGeo(N,M,n) distribution. If you are doing neither, consult your friendly local statistician. There are two approximations we consider in this course. If the exact distribution is Bin(n,p) but n is large and p is small, the Poi( =np) distribution is an adequate approximation. If the exact distribution is HyperGeo(N,M,n) but n/N≤0.05, then the Bin(n,p=M/N) distribution is an adequate approximation. 10