Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Bernoulli Distribution A Bernoulli random variable is one which has only 0 and 1 as possible values. Let p P( X 1) Thus a Bernoulli distribution X has the following “table” Possible 0 values of X Probabilities 1-p 1 p Definition: Say that X B(1, p ) Generically, we say that X=1 is a success and X=0 is a failure. We say that p is the “success” probability. A Bernoulli random variable is the simplest random variable. It models an experiment in which there are only two outcomes. Example: A fair die is tossed. Let X = 1 only if the first toss shows a “4” or “5”. Then X 1 B(1, ) 3 Mean and Variance: For a Bernoulli random variable with success probability p: X 0(1 p ) 1 p p X2 02 (1 p ) 12 p p 2 = p p 2 p(1 p ) Binomial Distribution Consider n independent Bernoulli random variables, all with the same success probability p. The sum Y X 1 X n is called a binomial random variable with parameters n and p . Notation: Say Y B(n, p ) The binomial random variables arise in circumstances which warrant the name “binomial experiment”. A binomial experiment is one in which: 1. n trials are to be performed. The number n is predetermined in advance. 2. Each trial consists of an experiment in which only one of two possible outcomes are possible. The outcomes are labeled , generically, success and failure. 3. Each trial has the same probability p of success. 4. The trials are independent. Binomial random variables address the following situations: a) If I toss a coin 10 times, how many heads will appear? b) If a machine produces, on average, 5% defectives, then how many defectives would there be among a sample of size 200? Mean and Variance: Because a binomial random variable is a linear combination of Bernoullis: If Y is Binomial with parameters n and p, the mean and variance are given by: Y np Y2 np(1 p ) Population Sampling: Suppose that a population of N people is comprised of M “successes” and N-M “failures”. Suppose we wish to extract a sample of size n, one by one, from this population. We are interested in recording the number of “successes” we extract. If, after each human is sampled and recorded for either “success” or “failure” we replace him/her back into the population to be sampled, then the samples are all independent and all have the same probability of producing a success. Thus sampling with replacement is a binomial experiment. The number of successes then has B( n, MN ) distribution. If the humans are not replaced after sampling (sampling without replacement) then the resulting trials are not independent. In this case, the distribution is not binomial. However, if the sample size n is much smaller than the total population N. (Say N is 40 times more than the sample size n), then sampling without replacement is virtually indistinguishable from sampling with replacement. In this case, although the true distribution is not exactly binomial, the distribution is approximately binomial. The actual interpretation of success and failure could vary. We could be talking about “men” and “women”, “people in support of measure A” and “people not in support of measure A”, etc etc… The main thing is that, as long as we sample a portion that is much smaller than the total population, the number of “successes” will be considered to be binomial. Estimating the success probability p Suppose you have a binomial situation, except you don’t know the true value of p. What is the best way to guess p? For instance, if you select (with replacement) 50 marbles from a jar of red and green ones, and 34 of them are green, then your best guess is that 68% are green. In fact, this is an unbiased strategy! Fact: the sample proportion: pˆ Y number of successes in n trials n n is unbiased: pˆ E (Y ) np p n n and has standard deviation given by: pˆ V (Y ) n2 p(1 p) n In practice, the we replace p with p̂ to estimate the standard deviation. Notice that the standard deviation gets smaller as the sample size n increases! Large Sample Binomial probabilities For small sample, probabilities associated with a binomial random variable can be calculated directly or computed from the so-called “binomial tables”. We are going to focus on an approximation technique used when the sample size is large. Fact: Let n be large. Let X B(n, p ) ; that is, let X be the number of successes in a binomial experiment of n trials and success probability p. Then: X is approximately normal with mean np and standard deviation np(1 p) X ˆ p Furthermore, if n is the proportion of successes in n trials, then: p̂ is approximately normal with mean p(1 p) p and standard deviation n The rule of thumb for n to be large enough is that np > 10 and n(1 p ) 10 . *Optional Topic* When using the normal distribution to approximate a binomial distribution, there is a technique that is used to improve the accuracy of the approximation. This is called the continuity correction. The improvement is usually very minor. Continuity Correction: Suppose X is binomial and n is large enough to use the normal approximation. Recall that X takes only natural number values 0, 1, 2, 3, …. and that the normal curve is continuous. If a and b are natural numbers, a) Write a probability of the form P( X a ) as P( X a 0.5) and then use the normal approximation. b) Write a probability of the form P( X b) as P( X b 0.5) and then use the normal approximation. c) Write a probability of the form P(a X b) as P(a 0.5 X b 0.5) . Before applying the continuity correction, be sure to write the appropriate binomial probability in one of the above 3 forms. *End of Optional Topic* Example: A multiple choice test consists of 100 questions. Each question has 4 possible responses, only one of which is correct. Jane randomly guesses at each question on the test. Let X denote the number of questions Jane answers correctly a) What is the mean and standard deviation of X? b) What is the probability Jane answers at least 35 correctly? Solution: X is binomial with n=100 and p=.25. The mean and standard deviation are: X np 100(.25) 25 X np(1 p ) 100(.25)(.75) 4.33 For part b, we note that np = 25 and n (1 p ) 75 are both bigger than 10, so we can use the normal distribution to find the approximate probability. Since the z-score of 35 is 2.31 (check: (35-25)/4.33 = 2.31), we get P( X 35) P( Z 2.31) .0104 Thus the probability that Jane answers at least 35 correctly is about 1.04% (note: part b was calculated without the continuity correction) Example: Suppose that, in a large city, 40% of all registered voters voted in the last election. Suppose we take a random sample of n = 1500 registered voters. Let p̂ denote the sample proportion who voted in the last election. (Thus p̂ is the proportion of this sample of size 1500, not the proportion of all registered voters.) What is the probability that .38 pˆ .42 ? In english, we are asking what is the probability that the sample proportion differs from the true proportion by no more than 2 percentage points? Solution: Assuming that the population of registered voters is at least 40 times the size of the sample (which is very likely for a large city), then the number of “successes” (people who voted in the last election) out of n = 1500 is modeled by a binomial random variable. Thus p̂ has mean and standard deviation pˆ p .40 pˆ p(1 p) (.40)(.60) 0.01265 n 1500 Now, to calculate P(.38 pˆ .42) , we need to find the z-scores of .38 and .42 and use the normal distribution. (Here, n is clearly large enough since np = 1500(.40) = 600 and n(1-p) = 1500(.60) = 900). .38 .40 The z-score of .38 is 0.01265 1.58 .42 .40 The z-score of .42 is 0.01265 1.58 Thus P(.38 pˆ .42) P( 1.58 Z 1.58) .8858 Therefore, there is an 88.58% chance that the sample proportion will differ from the true value of 40% by no more than 2% (once again, the continuity correction was not used to calculate this probability)