Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Sampling Distribution Models – Chapter 18 How much faith should we put in our sample statistics? From two lectures ago… Suppose all of us took a road trip to Las Vegas and I gave each of you $100 in $1 chips to play roulette. Suppose each one of us bet Red 100 times. Would each one of us walk away with a mean of $94.737 at the end of the betting? And what about that Standard Deviation of $9.8614? Think about that question, read pp. 436-439 of Chapter 17 (this is all we need), try Lab 4 (your TAs wrote it), play some online roulette .05 Computer Simulation, 100 $1 bets made by 100 different people on Roulette The mean was $95.9 .04 The median was $96 Density .03 The minimum was $68 The maximum was $120 .02 Q3 was 102 Q1 was 90 .01 The SD was $9.494 0 The IQR was $12 70 80 90 100 $ after 100 $1 bets 110 120 1 Computer Simulation, 100 $1 bets made by 1000 different people playing Roulette .04 The mean was $94.70 The median was $94 .03 The minimum was $68 Density .02 The maximum was $126 Q3 was 102 .01 Q1 was 88 The SD was $9.811 0 The IQR was $14 60 80 100 $ after 100 $1 bets 120 .04 Computer Simulation, 100 $1 bets made by 5000 different people playing Roulette The mean was $94.71 .03 The median was $94 The minimum was $58 Density .02 The maximum was $132 Q3 was 102 .01 Q1 was 88 0 The SD was $9.983 60 80 100 $ after 100 $1 bets 120 140 The IQR was $14 Compare the 100 bets results Theoretical (from calculations)for 100 bets of $1 Empirical (observed) for 100 bets made by 100 people Empirical (observed) for 100 bets made by 1000 people Empirical (observed) for 100 bets made by 5000 people (this is very close to the theoretical values) Mean SD $94.74 (µ) $95.90 $9.99 (σ) $9.49 $94.70 $9.81 $94.71 $9.98 2 From last time, this situation: 1000 Likely voters are surveyed about their vote on November 4th. What is the chance that 530 of them were women (suppose we were expecting 500 because it’s typically 50% female, 50% male) When dealing with a large number of trials in a Binomial situation, making direct calculations of the probabilities becomes tedious (or outright impossible). Fortunately, the Normal model comes to the rescue… The Normal Model to the Rescue As long as the Success/Failure Condition holds (the basic unit is binomial – either a voter is a man or is a woman), we can use the Normal model to approximate Binomial probabilities. Success/failure condition: A Binomial model is approximately Normal if we expect at least 10 successes and 10 failures: np ≥ 10 and nq ≥ 10 Continuous Random Variables When we use the Normal model to approximate the Binomial model, we are using a continuous random variable (the Normal) to approximate a discrete random variable (the Binomial). So, when we use the Normal model, we no longer calculate the probability that the random variable equals an exact value (e.g. 530 female voters – the binomial can do it, but it is computationally intensive), but only that it lies between two values (e.g. at least 530 female voters – 530 to 1000). 3 So theory tells us to expect 500 but we get 530? Is something wrong? The underlying population (think infinitely large and not countable) here is voters and we are categorizing them by gender (assume 2 groups). Theory tells us that in a sample of 1000 people we should get 500 male, 500 female Suppose in our one sample we get 530 females (think about this in the context of the roulette games in the first slides). Instead of calculating the exact probability with the binomial formula of Chapter 17, we can approximate the probability with the normal because we’ve met the conditions. Normal Approximation (p. 439) We might restate the question, what was the probability (chance) of getting at least 530 women in a sample of 1000 people? Formally: (note E(X)=µ=np for a binomial and σ=√npq) P ( x ≥ 530 ) = P ( z ≥ 530 − (1000 * .50 ) x − np ) = P(z ≥ ) ≈ P ( z ≥ 1 .90 ) npq 1000 * .50 * .50 What’s left is to find the P(z ≥ 1.90) which is 1.9713=.0287 So the probability is .0287 or about 2.87% chance of getting 530 women in a sample of size 1000 if you were expecting 500 women. We enter Chapter 18 now and the question changes a little bit Here is a situation: 667 Likely North Carolina voters are surveyed on October 23rd-28th, 2008 about their vote on November 4th. Question: What is the chance that 52% (that’s a sample percentage, so it’s not 52 voters) of the 667 said they would vote for Obama when he actually got 50% of the vote? We need to review first 4 Recall Chapter 12 We sample because it is often too expensive or simply impossible to measure the whole population. Election surveys are samples, but today Nov 5th 2008 we know the parameter everyone tried to estimate. So let’s see what we can learn from it. Unknown Population Parameter Population Inference Sample Statistic Sample Really Quick Review Chapter 12 – Sample Surveys Parameter (Population Characteristics) μ (mean) p (proportion) and np (count, number, or sum from the roulette example) Statistic (Sample Characteristics) y p̂ (sample mean) (sample proportion) and nˆp (sample count or sum) Chapter 12 - Statistics will be different for each sample (think about 100 plays of Roulette in Vegas) Chapter 14 - Taking a sample from a population is a random phenomena. That means: The outcome (statistic) is unknown before the sampling occurs BUT their long term (think infinite) behavior is predictable 0 .01 Density .02 .03 .04 Under the right conditions this is the long term behavior of sample statistics (p. 459 Ch. 18) 60 80 100 $ after 100 $1 bets 120 140 This particular normal distribution has a special name – it is a SAMPLING DISTRIBUTION or a distribution of all possible samples of size n 5 The Central Limit Theorem(CLT) (pp. 458-459) for a sample proportion If a random sample of n observations is selected from a population (any population which can be characterized by Bernoulli trials), and x “successes” are observed, then when n is sufficiently large, the sampling distribution of the sample proportion p will be approximately a normal distribution. (see the previous slide – this is the theory which supports the sampling distribution) The Central Limit Theorem(CLT) (pp. 458-459) for a sample proportion When we select simple random samples of size n, the sample proportions p̂ p-hat that we obtain will vary from sample to sample (like all of the polls we see before an election. We can model the distribution of these sample proportions with a probability model that is ⎛ pq ⎞ ⎟ N ⎜⎜ p, ⎟ n ⎝ ⎠ ⎛ AKA N ⎜ p, ⎝ p (1 − p ) ⎞ n ⎟ ⎠ Things to note As sample size n gets larger, σ, the standard deviation gets smaller The formula tells us that larger samples are more accurate. As n grows larger, the standard deviation grows smaller. The assumption being made here is that the sampled values (observations) must be independent of each other The sample size n must be large enough 6 Sampling Distribution for p̂ the sample percentage Shape Two assumptions must hold in order for us to be able to use the normal distribution Normal Distribution The sampled values must be independent of each other The sample size, n, must be large enough Please Note As sample size n gets larger, σ, the standard deviation gets smaller The formula for the standard deviation tells us that larger samples are more accurate. As n grows larger, the standard deviation grows smaller. Sampling Distribution for p̂ the sample percentage (cont’d) It is hard to check that independence and sample size hold, so typically will settle for checking the following conditions 10% Condition – the sample size, n, is less than 10% of the population size Success/Failure Condition – np > 10, n(1-p) > 10 These conditions seem to contradict one another, but in practice, they don’t. Usually populations are much larger than samples (many more than 10 times) So, back to North Carolina Let’s assume that 50% is the true percentage that voted for Obama in North Carolina. How likely was it for CNN to have a poll of 667 voters that had 52% saying that they would vote? To answer this, we use the normal approximation to the binomial, this tells us what the sampling distribution looks like ⎛ (.50)(.50) ⎞ ⎟ ≈ N (.49,.02) N ⎜⎜ .49, 667 ⎟⎠ ⎝ The sampling distribution is centered on .50 (the true population proportion, not a sample) with a SD of .02 We note from the original poll, a margin of error of 4% or .04 was given, that is 2SD (or 2sigma or 2Z) Recall the 68-95-99.7 rule? For a normal population, 68% of the observations should be within +/- 1Z, 95% w/i +/- 2Z 7 Now we can answer the question (p. 464466) We need a Z score (!) We need to phrase the answer so that we are working with a range (e.g. at least 52% of the vote) instead of an exact value because the normal is continuous and mathematically we can’t find the area under an exact value Formally ⎛ ⎜ .52 − .50 P ( pˆ ≥ .52) = P ⎜ Z ≥ ⎜ .50 * .50 ⎜ 667 ⎝ ⎞ ⎟ ⎟ ≈ P ( Z ≥ +1.03)≈ .1515 ⎟ ⎟ ⎠ Understanding the Answer First the formula (note 1-p = q in your text) pˆ − p Z≥ p * (1 − p ) n Then the Z score. Since we wanted P( Z≥+1.24) means that there is .1515 chance or about a 15.2% chance of getting a single survey of size 667 with 52% saying that they would vote for Obama when in reality, only 50% voted. More interpretation While the chance is low, it is not zero, sampling error suggests that CNN could have done everything correct and still been off. The chance of being as far off as 2% was about 15.2%. If CNN had been farther off, they might want to take a look at their survey and the sampling procedure to see if anything needed adjustment. 8 Chapter 18 so far While samples are drawn from populations, it is helpful to understand that a sample is just one possible realization of all the possible samples in the sampling distribution Population Sampling Distribution Samples 9