Download Unit 8 Summary

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Confidence interval wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Unit 8 Summary
In this unit we will be introduced to sampling distributions and estimation. If you recall my summary for unit 6 we
demonstrated the Central Limit Theorem using a population of 4 and all samples of size 2. When we took all the possible
samples of 2 values we obtained the following means for the 16 possible samples.
2, 3, 4, 5, 3, 4, 5, 6, 4, 5, 6, 7, 5, 6, 7, 8
Now, if we use StatCrunch to create a histogram of these means we have:
1
Unit 8 Summary
This is the Sampling Distribution of Means, which is a theoretical distribution. You will notice that it is very close to a normal
distribution and from the Central Limit Theorem we know that the mean of the sampling distribution of means is equal to the
population mean (µ) and the standard deviation of the sampling distribution is σ√n. The standard deviation of the sampling
distribution is called Standard Error. If, and that is an impossible if, but nonetheless, if every possible sample were perfectly
representative of the population, then every sample mean would be equal to the population mean. If you plotted those means
in a histogram, there would only be one bar, directly over the population mean (µ). The reason the sampling distribution is
normally distributed is because under the assumption of randomness, the expected or average of all sample means will equal
the population mean but some of the sample means will be below the population mean and some will be above. This occurs
simply because the samples are NOT perfectly representative. We refer to this fact as sampling error. In statistics, error is
not “your fault” it is the variability that is not explained or controlled for in our design. In this case, we know that each sample
will not be perfectly representative and that is only due to sampling error (it is random error). This is contrasted with bias or
non-random errors. For example, if I ‘sampled’ this class but only asked those who have IMed me during the course, I have
introduced non-random bias into my sample. We cannot quantify non-random error like we can random error. The standard
deviation of the sampling distribution can be seen as the standard deviation of the sampling error.
Finally because we know the sampling distribution is normal and the Central Limit Theorem gives us the mean and standard
deviation (standard error) we can use our old friend the Z score to find probabilities associated with one sample from the
sampling distribution. The Z score we use is identical in concept to the one we learned earlier but is using the information
from the sampling distribution. In the distribution the score is a mean ( x̄) and the mean is the mean of the population (µ) and
the standard deviation is the standard error (σ/√n). If you have trouble calculating the standard error, remember to take the
square root of n first and then divide that value into sigma.
To estimate a population mean we use Confidence Intervals. The textbook provides the general formula on page 347. The
formula shows that the estimation of the population mean (μ) builds a band around the obtained sample mean ( x̄) that is
expected to contain the μ, with a defined level of certainty (the margin of error). The level of certainty is decided by the
statistician/researcher. Once decided the values of the z scores that will create the confidence interval can be obtained from
the z-score table but StatCrunch will do this for us automatically. However, if one wanted a 95% confidence interval, the
corresponding z score would be 1.96. {Important note: our text book does not use a value of 1.96 for the 95% confidence
but instead uses a value of 2; this will result in slightly different intervals if you use that formula and compare the results to
StatCrunch}. From the z-score table, that value would have 97.5% of the curve below or 2.5% above. Since we will use both
2
Unit 8 Summary
the positive and negative value, there would be 5% of the curve outside the band and 95% inside the band. (See Example 1 on
page 348)
The text defines E (margin of error) as 2s/√n but StatCrunch will use z c (σ / √n); where zc is the value of z that corresponds to
the level of confidence desired. The picture below shows the general 95% confidence interval when σ is known. The critical
part to understand is that the confidence interval defines an area where we think the population parameter (μ) will be located.
There is no reason to expect it to be in the middle; only that it will be within that area 95% of the time.
95% of the distribution of the means (x̄'s) is in this area
µ – 1.96 σ/√n
µ + 1.96 σ/√n
µ
x̄ – E
x̄
x̄ + E
The question is whether the band I have constructed (green line) around the sample mean ( x̄) has captured the true
population mean (µ)?
The answer is Yes, the population mean is inside our “confidence interval” and we have captured µ.
How often will this be true?
The answer is 95% of the time. Why, because we know that 95% of the time the mean (x-bar) of our sample will fall within the
range indicated by the red line. So, if 95% of the time, any sample we take will have an obtained mean in the indicated range
(the Red Line), then we would expect that 95% of the time that is what we would get when we calculate the mean from any given
sample.
3
Unit 8 Summary
95% of the distribution of the means (x̄'s) is in this area
µ – 1.96 σ/√n
µ + 1.96 σ/√n
µ
x̄ – E
x̄
x̄ + E
What about the second situation? Did we capture the population mean (µ)?
The answer is NO. The population mean (µ) is not in our interval.
How often would we expect to get this type of situation?
Only 5% of the time. Why, because only 5% of the time would we expect to obtain a sample mean that is more than 1.96 standard
errors away from the mean, so only 5% of the time would we not expect to capture the mean with our confidence interval.
When we construct confidence intervals, we are building a band that we believe contains the population mean and we can
quantify the likelihood that we are incorrect.
We can use the formula for the confidence interval to determine the minimum sample size we need to give us a predetermined margin of error (see page 351). The formula to determine the sample size needed for a given margin of error is:
n ≈ (2σ/E)^2
For our purposes we will use the critical value of 2 and use estimated values for the population standard deviation. If we are
interested in estimating the average time Kaplan students spend studying each week, we would like our estimate to be within
1 hour of the actual time and we estimate the population standard deviation to be 4.5 hours; the minimum sample size we
4
Unit 8 Summary
need is:
n = (2*4.5/1)^2 = 9^2 = 81
If we have a sample size of 81 and a standard deviation close to 4.5 we will have a margin of error of 1.
The final section of this chapter deals with estimating a population proportion. The basic logic is the same as for estimating
a population mean. The standard deviation of a proportion is √ρ(1-ρ)/n. If we consider the polls about the Republican
candidates for president and we are trying to create the 95% confidence interval for the true population proportion we might
have the following sample data. From a sample of 500 registered voters we found that 170 favored Candidate A. The sample
proportion is 170/500 = .34 and the standard deviation would be √.34(.66)/500 = √.2244/500 = √.0004488 = 0.021. So we
now use the formula for E and multiply the standard deviation by 2 and E = .042636.
Now we add and subtract E from our sample proportion giving us the 95% confidence interval for the population proportion
of 0.30 to 0.38.
Now let's assume that we are planning to conduct a poll and would like our estimate to be within 3% points of the true
proportion. What sample size should we have? The formula on page 357 shows us the following formula: n ≈ 1/E^2
Where did the 1 come from you might ask? Before we start a study about proportions where we do not know the population
standard deviation we use a proportion of .5 to assure we have the minimum sample size. The value of .5 gives us the largest
possible standard deviation. Therefore p * (1-p) = .25. If we take the square root of that value it is .5. Finally we multiply that
by 2 (for the 95% confidence interval) we get 1. Notice that they now divide by the square root of n because we already took
the square root of the top portion of the formula.
Now, back to our example where we want a margin of error of 3%. The minimum sample size would be 1/.03^2 or 1/.0009 =
1111.11 ≈ 1112. This is why you see a lot of national polls that have around 1,000 respondents!
5