Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
CHAPTER 8: Sampling Distribution Models p is the true proportion of the population (parameter for probability of success) p̂ is the observed proportion in a sample (varies because of different samples) Sampling error: variability that results from sampling Sampling Distribution Model: shows the behavior of the statistic over all the possible samples for the same size n For a proportion (categorical data): o Provided that the sampled values are independent and the sample size is large enough, the sampling distribution of p̂ is modeled by a Normal model with mean p , and pq . n o Centered at the population proportion standard deviation SD( pˆ ) Assumptions and Conditions (Sampling Distribution Model for proportions): Assumptions: *The sampled values must be independent of each other *The sample size, n, must be large enough Conditions: Randomization Condition: should be a SRS of the population. However, SRS can be difficult, if not impossible. Need at least to be very confident that the sampling method was not biased and that the sample should be REPRESENTATITVE of the population 10% Condition (or 5% condition): (if done w/o replacement) n must be no larger than 10% (or 5%) of the population. Success/Failure Condition: sample size must be big enough that both np and nq are at least 10. *checks that the sample size is large enough **Just Checking – pg. 415 1. You want to poll a random sample of 100 students on campus to see if they are in favor of the proposed location for the new student center. Of course, you’ll get just one number, your sample proportion, p̂ . But if you imagined the histogram of all the sample proportions from these samples, what shape would it have? A Normal model (approximately) 2. Where would the center of that histogram be? At the actual proportion of all students who are in favor. 3. If you think that about half the students are in favor of the plan, what would the standard deviation of the sample proportions be? (0.5)(0.5) SD( pˆ ) 0.05 100 **Step-by-Step – pg. 415 Suppose that about 13% of the population is left-handed. A 200-seat school auditorium has been built with 15 “lefty seats,” seats that have a built-in desk on the left rather than the right arm of the chair. In a class of 90 students, what’s the probability that there will not be enough seats for the left-handed students? Means: pg. 417 (dice) *means of repeated samples approach the Normal Model with the center approaching the population mean (Law of Large Numbers) Sampling Distribution Model For means (quantitative data): o Needs to be independent o Needs to be collected with randomization o Centered at the population mean o SD( y ) n o When a random sample is drawn from any population with mean μ and standard deviation σ, its sample mean, y (may also see it written x ), has a sampling distribution with the same mean μ but whose standard deviation is σ/√n. o No matter what population the random sample comes from, the shape of the sampling distribution is approximately Normal as long as the sample size is large enough. o The larger the sample used, the more closely the Normal approximates the sampling distribution for the mean. **A SAMPLE DISTRIBUTION AND A SAMPLING DISTRIBUTION ARE TWO DIFFERENT THINGS!!! Central Limit Theorem: The mean of a random sample has a sampling distribution whose shape can be approximated by a Normal model. The larger the sample, the better the approximation will be. (*this is true regardless of the shape of the population distribution, even if the population distribution is skewed or bimodal!) CLT Conditions: Randomization Condition: The data values must be sampled randomly or representative of the population Independence Assumption: The sampled values must be mutually independent. There’s no way to check in general whether the observations are independent. However, when the sample is drawn without replacement (as is usually the case), you should check the… 10% Condition: The sample size, n, is no more than 10% of the population Large Enough Sample Condition: Although the CLT tells us that a Normal model is useful in thinking about the behavior of sample means when the sample size is large enough, it doesn’t tell us how large a sample we need. The truth is, it depends; there’s no one-size-fits-all rule. If the population is unimodal and symmetric, even a fairly small sample is okay. If the population is strongly skewed, like the compensation for CEOs we looked at in Ch. 4, it can take a pretty large sample to allow use of a Normal model to describe the distribution of sample means. We’ll discuss this issue in more detail in later sections. For now you’ll just need to think about you sample size in the context of what you know about the population, and then tell whether you believe the Large Enough Sample Condition has been met. *Don’t confuse data taken from a real world sample (which should take the shape of the population distribution as the sample sizes increases; skewed, bimodal, etc) and the sampling distribution (which approaches the Normal model as the sample size increases) *means have smaller standard deviations than individuals *Just Checking – pg. 422 4. Human gestation times have a mean of about 266 days, with a standard deviation of about 16 days. Does this mean that if we record the gestation times of a sample of 100 women, that histogram will be well modeled by a Normal model? No, that histogram will approximate the distribution of human gestation periods; that may not be Normal. 5. Suppose we look at the average gestation times for a sample of 100 women. If we imagined all the possible random samples of 100 women we could take and looked at the histogram of all the sample means, what shape would it have? A Normal model (approximately) 6. Where would the center of that histogram be? 266 days 7. What would be the standard deviation of that histogram? 16 1.6days 100 Law of Diminishing Returns: (pg. 423) *Step-by-Step – pg. 423 Suppose that mean adult weight is 175 pounds with a standard deviation of 25 pounds. An elevator in our building has a weight limit of 10 persons or 2000 pounds. What’s the probability that the 10 people who get on the elevator overload its weight limit? Standard Error (standard deviation): For a sample proportion, p̂ : SE ( pˆ ) pˆ qˆ n For the sample mean, y : s SE ( y ) n *Since a statistic comes from a random sample; it is, itself, a random quantity *Sampling distributions arise because samples vary. Each random sample will contain different cases and, so, a different value of the statistic *Although we can always simulate a sampling distribution, the CLT saves us the trouble for means and proportions Suggested Practice: Old Book: Ch. 18 # 1, 3, 5, 9, 12, 14, 15, 17, 21, 22, 25, 27, 29, 30, 36, 41 New Book: 8.3, 8.10, 8.15, 8.16, 8.18, 8.21, 8.25, 8.30, 8.31 Examples: 1. “Toss Tacks” Each group tosses the tack five times and computes the sample proportions of the number of “ups.” Quickly sketch a histogram (or dotplot) of the results. Describe the distribution. Note that this is a distribution of sample proportions – the proportions are now the data Each group tosses the tack 20 more times and recomputes the sample proportion (out of 25). Make a new histogram (on the same scale). Describe the new histogram. What does the center represent? They should also see that the variability is smaller. Why? Someone should recall the Law of Large Numbers. Now toss an additional 25 times. Find the sample proportions out of 50 tosses. Do another histogram. Discuss center and spread again. And now the shape should also be getting clear. You should be able to go back to the Normal as approximation to the Binomial and derive the shape, center, and spread of the sampling model for sample proportions. Be sure to remind students of the assumptions and conditions that allow use of this model. 2. Of all cars on the interstate, 80% exceed the speed limit. What proportions of speeders might we see among the next 50 cars? Solution – Think: We want to find the distribution of the proportion of the next 50 cars that may be speeding on the highway. 80% of all cars on the highway are speeding. 10% Condition: The 50 cars can be considered a representative sample of all cars on the highway, and 50 is less than 10% of all cars on the highway. Success/Failure Condition: np=50(0.80)=40, nq=50(0.20)=10 are both at least 10. Therefore, the sampling model for the proportion of speeders in 50 cars has mean 0.80 and pq (0.80)(0.20) standard deviation of SD( pˆ ) 0.057 . n 50 The model for p̂ is N (0.80,0.057) . Show: Distribution using 68-95-99.7 Rule Tell: According to the Normal model, we expect 68% of the samples of 50 cars to have proportions of speeders between 0.743 and 0.857, 95% of the samples to have proportions between 0.686 and 0.914, and 99.7% of the samples to have proportions between 0.629 and 0.971. 3. We don’t know it, but 52% of voters plan to vote “Yes” on the upcoming school budget. We poll a random sample of 300 voters. What might the percentage of yes-voters appear to be in our poll? Solution – Think: We want to find the distribution of the proportion of 300 voters that may vote “Yes” on the upcoming school budget. 52% of voters plan to vote “Yes”. 10% Condition: The 300 voters can be considered a representative sample of all voters, and 300 is less than 10% of all voters. Success/Failure Condition: np=300(0.52)=156, nq=300(0.48)=144 are at least 10. Therefore, the sampling model for the proportion of “Yes”-voters in a sample of 300 has mean 0.52 pq (0.52)(0.48) and standard deviation of SD( pˆ ) 0.029 n 300 The model for p̂ is N (0.52,0.029) Show: Draw distribution using 68-95-99.7 Rule Tell: According to the Normal model, we expect 68% of the samples of 300 voters to have proportions of “Yes”-voters between 0.491 and 0.549, 95% of the samples to have proportions between 0.462 and 0.578, and 99.7% of the samples to have proportions between 0.433 and 0.607. 4. “Groovy” M&M’s are supposed to make up 30% of the candies sold. In a large bag of 250 M&M’s, what is the probability that we get at least 25% groovy candies? Solution – Think: We want to find the probability of getting a bag of 250 M&M’s that has at least 25% “groovy” candies. 10% Condition: The 250 M&M’s can be considered a representative sample of all M&M’s, and 250 is less than 10% of all M&M’s. Success/Failure Condition: np=250(0.30)=75, nq=250(0.70)=175 are at least 10. Therefore, the sampling model for the proportions of “groovy” M&M’s in a sample of 250 has a mean pq (0.30)(0.70) 0.30 and standard deviation of SD( pˆ ) 0.029 n 250 The model for p̂ is N (0.30,0.029) Show: z z pˆ p pq n *Sketch distribution 0.25 0.30 (0.30)(0.70) 250 z 1.73 Tell: According to the Normal model, the probability that our bag contains at least 25% “groovy” M&M’s is approximately 95.8%. 5. SAT scores should have a mean 500 and standard deviation 100. What about the mean of random samples of 20 students? (Note that the small sample is okay because we believe a Normal model applies to the population.) Solution: Think: We are interested in the distribution of possible means from samples of SAT scores from 20 students. SAT scores have a mean of 500 and a standard deviation of 100, and since the SAT is standardized, it’s reasonable to assume that the model for all SAT scores is Normal. Random Sampling Condition: The 20 students were sampled randomly. Independence Assumption: It’s reasonable to think that the SAT scores of the 20 randomly sampled students will be mutually independent, as long as the students weren’t all from the same university. 10% Condition: 20 students represent less than 10% of all students. Under these conditions, the sampling distribution of y has a Normal model, with mean 500 and 100 standard deviation SD( y ) 22.36 n 20 Show: Draw distribution using 68-95-99.7 Rule. Tell: According to the Normal model, we can expect 68% of samples of 20 students to have mean SAT scores between 477.6 and 522.4, 95% of samples to have means between 455.3 and 544.7, and 99.7% of samples to have means between 432.9 and 567.1. 6. Speeds of cars on a highway have mean 52 mph and standard deviation 6 mph, and are likely to be skewed to the right (a few very fast drivers). Describe what we might see in random samples of 50 cars. Solution – Think: We are interested in the distribution of possible means from samples of speeds from 50 cars on the highway. Speeds have a mean of 52 mph and a standard deviation of 6 mph, with a distribution that is skewed to the right. Random Sampling Condition: The 50 speeds were sampled randomly. Independence Assumption: It’s reasonable to think that the speeds of the 50 randomly sampled cars will be mutually independent. 10% Condition: 50 cars represent less than 10% of all cars. Even though the distribution is skewed, the Central Limit Theorem applies, since the sample size, 50 cars, is large. Under these conditions, the sampling distribution of y has a Normal model, with mean 6 52 mph and standard deviation SD( y ) 0.85 n 50 Show: Draw distribution using 68-95-99.7 Rule Tell: According to the Normal model, we can expect 68% of samples of 50 cars to have mean speeds between 51.2 and 52.9 mph, 95% of samples to have means between 50.3 and 53.7 mph, and 99.7% of samples to have means between 49.5 and 54.6 mph. 7. At birth, babies average 7.8 pounds, with a standard deviation of 2.1 pounds. A random sample of 34 babies born to mothers living near a large factory that may be polluting the air and water shows a mean birthweight of only 7.2 pounds. Is that unusually low? Solution: Think: We are interested in the probability that a sample of babies has mean birthweight less than 7.2 pounds. At birth, babies average 7.8 pounds, with a standard deviation of 2.1 pounds. The model for birthweights should be roughly unimodal and symmetric, if not Normal. Random Sampling Condition: The 34 babies were sampled randomly. Independence Assumption: It’s reasonable to think that the weights of the 34 randomly sampled babies will be mutually independent. 10% Condition: As long as more than 340 babies were born to mothers living in the vicinity of the factory, 34 babies represent less than 10% of all babies. Since the model for all babies is unimodal and symmetric, the Central Limit Theorem applies, especially since the sample size, 34 babies, is large. Under these conditions, the sampling distribution of y has a Normal model, with mean 7.8 pounds and standard deviation 2.1 SD( y ) 0.36 pounds n 34 Show: z z y SD( y ) *sketch distribution 7.2 7.8 0.36 z 1.67 Tell: According to the Normal model, only about 4.7% of samples of 34 babies are expected to have mean birthweights below 7.2 pounds. The samples of babies near the factory appears to have an unusually low mean birthweight.