Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Degrees of freedom (statistics) wikipedia , lookup
History of statistics wikipedia , lookup
Bootstrapping (statistics) wikipedia , lookup
Confidence interval wikipedia , lookup
Taylor's law wikipedia , lookup
Resampling (statistics) wikipedia , lookup
Gibbs sampling wikipedia , lookup
Misuse of statistics wikipedia , lookup
Unit 8 Summary In this unit we will be introduced to sampling distributions and estimation. If you recall my summary for unit 6 we demonstrated the Central Limit Theorem using a population of 4 and all samples of size 2. When we took all the possible samples of 2 values we obtained the following means for the 16 possible samples. 2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8 Now, if we use StatCrunch to create a histogram of these means we have: 1 Unit 8 Summary This is the Sampling Distribution of Means, which is a theoretical distribution. You will notice that it is very close to a normal distribution and from the Central Limit Theorem we know that the mean of the sampling distribution of means is equal to the population mean (µ) and the standard deviation of the sampling distribution is σ√n. The standard deviation of the sampling distribution is called Standard Error. If, and that is an impossible if, but nonetheless, if every possible sample were perfectly representative of the population, then every sample mean would be equal to the population mean. If you plotted those means in a histogram, there would only be one bar, directly over the population mean (µ). The reason the sampling distribution is normally distributed is because under the assumption of randomness, the expected or average of all sample means will equal the population mean but some of the sample means will be below the population mean and some will be above. This occurs simply because the samples are NOT perfectly representative. We refer to this fact as sampling error. In statistics, error is not “your fault” it is the variability that is not explained or controlled for in our design. In this case, we know that each sample will not be perfectly representative and that is only due to sampling error (it is random error). This is contrasted with bias or non-random errors. For example, if I ‘sampled’ this class but only asked those who have IMed me during the course, I have introduced non-random bias into my sample. We cannot quantify non-random error like we can random error. The standard deviation of the sampling distribution can be seen as the standard deviation of the sampling error. Finally because we know the sampling distribution is normal and the Central Limit Theorem gives us the mean and standard deviation (standard error) we can use our old friend the Z score to find probabilities associated with one sample from the sampling distribution. The Z score we use is identical in concept to the one we learned earlier but is using the information from the sampling distribution. In the distribution the score is a mean ( x̄) and the mean is the mean of the population (µ) and the standard deviation is the standard error (σ/√n). If you have trouble calculating the standard error, remember to take the square root of n first and then divide that value into sigma. To estimate a population mean we use Confidence Intervals. The textbook provides the general formula on page 347. The formula shows that the estimation of the population mean (μ) builds a band around the obtained sample mean ( x̄) that is expected to contain the μ, with a defined level of certainty (the margin of error). The level of certainty is decided by the statistician/researcher. Once decided the values of the z scores that will create the confidence interval can be obtained from the z-score table but StatCrunch will do this for us automatically. However, if one wanted a 95% confidence interval, the corresponding z score would be 1.96. {Important note: our text book does not use a value of 1.96 for the 95% confidence but instead uses a value of 2; this will result in slightly different intervals if you use that formula and compare the results to StatCrunch}. From the z-score table, that value would have 97.5% of the curve below or 2.5% above. Since we will use both 2 Unit 8 Summary the positive and negative value, there would be 5% of the curve outside the band and 95% inside the band. (See Example 1 on page 348) The text defines E (margin of error) as 2s/√n but StatCrunch will use z c (σ / √n); where zc is the value of z that corresponds to the level of confidence desired. The picture below shows the general 95% confidence interval when σ is known. The critical part to understand is that the confidence interval defines an area where we think the population parameter (μ) will be located. There is no reason to expect it to be in the middle; only that it will be within that area 95% of the time. 95% of the distribution of the means (x̄'s) is in this area µ – 1.96 σ/√n µ + 1.96 σ/√n µ x̄ – E x̄ x̄ + E The question is whether the band I have constructed (green line) around the sample mean ( x̄) has captured the true population mean (µ)? The answer is Yes, the population mean is inside our “confidence interval” and we have captured µ. How often will this be true? The answer is 95% of the time. Why, because we know that 95% of the time the mean (x-bar) of our sample will fall within the range indicated by the red line. So, if 95% of the time, any sample we take will have an obtained mean in the indicated range (the Red Line), then we would expect that 95% of the time that is what we would get when we calculate the mean from any given sample. 3 Unit 8 Summary 95% of the distribution of the means (x̄'s) is in this area µ – 1.96 σ/√n µ + 1.96 σ/√n µ x̄ – E x̄ x̄ + E What about the second situation? Did we capture the population mean (µ)? The answer is NO. The population mean (µ) is not in our interval. How often would we expect to get this type of situation? Only 5% of the time. Why, because only 5% of the time would we expect to obtain a sample mean that is more than 1.96 standard errors away from the mean, so only 5% of the time would we not expect to capture the mean with our confidence interval. When we construct confidence intervals, we are building a band that we believe contains the population mean and we can quantify the likelihood that we are incorrect. We can use the formula for the confidence interval to determine the minimum sample size we need to give us a predetermined margin of error (see page 351). The formula to determine the sample size needed for a given margin of error is: n ≈ (2σ/E)^2 For our purposes we will use the critical value of 2 and use estimated values for the population standard deviation. If we are interested in estimating the average time Kaplan students spend studying each week, we would like our estimate to be within 1 hour of the actual time and we estimate the population standard deviation to be 4.5 hours; the minimum sample size we 4 Unit 8 Summary need is: n = (2*4.5/1)^2 = 9^2 = 81 If we have a sample size of 81 and a standard deviation close to 4.5 we will have a margin of error of 1. The final section of this chapter deals with estimating a population proportion. The basic logic is the same as for estimating a population mean. The standard deviation of a proportion is √ρ(1-ρ)/n. If we consider the polls about the Republican candidates for president and we are trying to create the 95% confidence interval for the true population proportion we might have the following sample data. From a sample of 500 registered voters we found that 170 favored Candidate A. The sample proportion is 170/500 = .34 and the standard deviation would be √.34(.66)/500 = √.2244/500 = √.0004488 = 0.021. So we now use the formula for E and multiply the standard deviation by 2 and E = .042636. Now we add and subtract E from our sample proportion giving us the 95% confidence interval for the population proportion of 0.30 to 0.38. Now let's assume that we are planning to conduct a poll and would like our estimate to be within 3% points of the true proportion. What sample size should we have? The formula on page 357 shows us the following formula: n ≈ 1/E^2 Where did the 1 come from you might ask? Before we start a study about proportions where we do not know the population standard deviation we use a proportion of .5 to assure we have the minimum sample size. The value of .5 gives us the largest possible standard deviation. Therefore p * (1-p) = .25. If we take the square root of that value it is .5. Finally we multiply that by 2 (for the 95% confidence interval) we get 1. Notice that they now divide by the square root of n because we already took the square root of the top portion of the formula. Now, back to our example where we want a margin of error of 3%. The minimum sample size would be 1/.03^2 or 1/.0009 = 1111.11 ≈ 1112. This is why you see a lot of national polls that have around 1,000 respondents! 5