Download Chapter 9: Sampling Distributions

Chapter 9: Sampling Distributions Section 9.1: Sampling Distribution The introduction to the topic of “statistical inference” – using statistical concepts in interpreting scientific results from studies and surveys. Parameter: A number that describes an aspect of a population Statistics: A number that is computed from sample data; often used to estimate an unknown parameter. Example: A census of all DHS seniors found that 10% got into college early. An SRS of 30 seniors was also taken and in that sample 12% got into college early. The 10% is a parameter while the 12% is a statistic. Sampling distributions and Sampling Variability: If we take repeated samples from the DHS senior population and measure the proportion of seniors from those samples that got into college early, we will undoubtedly get different numbers for the different samples. This is referred to as sample variability. We can create a distribution of the proportions of all the samples we took and draw a histogram. I created a simulation that selected repeated samples (100 to be exact) of size 30 from a population that had a proportion, p = 0.10 of seniors who got in early to college. I then took the proportions that I got from these samples and created a histogram. The histogram looks like this: NOTE: We will use p to represent the population proportion. We will use p̂ to represent the sample proportion which in turn is used to estimate p if p is unknown. 1 If we were to take all possible samples of the same size from the population and compute the sample proportion, p̂ , of each sample and then create a distribution it would be called a sampling distribution of p̂ . The following properties generally describe a sampling distribution of p̂ created from samples with a large size (usually n  30):       The overall shape of the distribution is symmetric and approximately normal. The larger the sample size the closer the shape is to a normal distribution. A rule of thumb used to determine if a normal curve can be used to approximate the sampling distribution of population proportions is if: a) np > 10 and b) n(1-p) > 10 There are no outliers or other important deviations from the main pattern The mean (center) of the distribution is equal to the true population parameter, p. The variability (spread) of the sampling distribution depends on the sample size. The larger the sample-size the smaller the variability of the sampling distribution. p(1  p ) The standard deviation of the sampling distribution is (as long as the population n is at least ten times larger than the sample size) Not all sampling distributions have these properties (though most do). When a sampling distribution does not have its center equal to the true population parameter, the statistic used to create that sampling distribution is said to be biased. The goal when creating a sampling distribution is to have no bias and low variability. Here is how bias and variability are related: 2  The variability of a sampling distribution is determined by the sampling design and the sample size used to create the sampling distribution. As long as the population is much larger than the sample (at least 10 times as large) the spread of the sampling distribution is the same for any population size.  Contrary to popular belief and intuition, the behavior of a statistic from random samples is not influenced by the size of the population. To see why, think of taking a sample scoop of m&ms from a well-shuffled 1-pound bag. If the m&ms are well shuffled does the scoop of m&ms really know whether it was surrounded by a one-pound bag of m&ms or a huge bin of m&ms? Clearly it does not.  The above realization, that variability of a sampling distribution is controlled by the size of the sample, not the size of a population, has major implication for sampling design. It means that a survey of, say, 2000 people is just as accurate if the sample was taken from the population of a small state like Rhode Island as when taken from the population of the entire United States. As long as the sample was an SRS, it can just as easily predict some aspect of the US population as it could from the much smaller Rhode Island population. In other words, the ratio of the sample to the population is NOT important. As a matter of fact, we actually want the ratio of the population to the sample size to be large – more than 10 to 1, in order to be able to conduct most of the statistical analyses we’ll be learning about. 3 Section 9.2: Sample Proportions Example: An SRS of 1500 high school seniors in CT was asked whether they applied to college early. Let’s assume that there are 100,000 high school seniors in the state of Connecticut, and that in fact 35% of them apply to college early. What is the probability that your sample of 1500 seniors will give a result within 2 percentage points of the true value of 35%? a) Since the population size is greater than 10 times the sample size we are OK to proceed (we can use the formula for standard deviation). b) Since np = 525 >10 and n(1-p) = 975 >10 we are OK in assuming that the distribution of sample proportions is approximately normal c) We know that the sampling distribution of sample proportions has a mean of 0.35 (equal to the p in the population) and that the standard deviation is: p (1  p ) 0.35  0.65   .0123 n 1500 d) We are looking for the probability that p̂ falls between 0.33 and 0.37 (within 2 % of 35%). So we are looking for P(0.33  pˆ  0.37) e) Draw a normal curve that approximates the sampling distribution of p̂ : f) Standardizing the p̂ values we get: 4 g) We can now find the area under the normal curve by using the z-score table in the back of the book (or our calculator). This result is telling us that almost 90% of all samples of size 1500 we take will be within 2 percentage points of the truth. 5 Section 9.3: Sample Means A couple of things to think about: 1.) Averages are less variable than individual observations. 2.) Averages are more normal than individual observations. Why are these two things important to us? Well because if we look at a histogram of averages, we will get a histogram that is more normal and less spread out than a histogram of individual observations. Data is much easier to work with if it is normal and has a small spread, so it is to our advantage to look at a distribution of averages. Mean and Standard Deviation of a Sample Mean Suppose that x is the mean of an SRS of size n drawn from a large population with mean μ and standard deviation σ. Then the mean of the sampling distribution of x is  x   and its standard deviation is x   n . Sampling Distribution of a Sample Mean from a Normal Population Draw an SRS of size n from a population that has a normal distribution with mean μ and standard deviation σ. Then the sample mean x has a normal distribution N(μ,  n ) with mean μ and standard deviation  n . Example: Suppose the heights of young women are normally distributed with μ = 64.5 inches and σ = 2.5 inches. What is the probability that the mean height of an SRS of 10 young women is greater than 66.5 inches? 6 The Central Limit Theorem The CLT answers the question, what does the distribution of x look like if the original population is not normal. CLT: Draw an SRS of size n from any population whatsoever with mean μ and finite standard deviation σ. When n is large the sampling distribution of the sample mean x is close to the normal distribution N(μ,  n ) with mean μ and standard deviation  n . NOTE: How large a sample size n is needed for x to be close to normal depends on the population distribution. More observations are required if the shape of the population distribution is far from normal Example: The number of flaws per square yard in a type of carpet material varies with mean 1.6 flaws per square yard and a standard deviation of 1.2 flaws per square yard. The population distribution cannot be normal because a count takes only whole number values. An inspector studies 200 square yards of the material, records the number of flaws found in each square yard, and calculates x , the mean number of flaws per square yard inspected. What is the probability the mean number of flaws exceeds 2 per square yard? FINAL THOUGHTS ON CHAPTER 9: A sampling distribution describes the values a statistic would take in very many repetitions of a sample or an experiment under the same conditions. Understanding that idea is key to understanding statistical inference (which is the topic of the next and last 4 chapters of the course). The sampling distributions we looked at were of two important statistics: a sample proportion p̂ and a sample mean x . 7

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 9: Sampling Distributions