Download Statistics Confidence Intervals Unit Plan

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Central limit theorem wikipedia , lookup

Transcript
Statistics Confidence Intervals of the Mean Unit Plan
A confidence interval is an interval in which there is a certain probability of finding a parameter, given a
known statistic.
Common confidence intervals are the 90%, 95% and 99% intervals. We will find the endpoints of these
intervals using the skills we already have for finding the bounds of a “between” statement on a normal
random variable…
However, although the skills will be similar, we cannot always construct the interval on a normal curve.
Under certain conditions, described below, the sampling distribution is better modeled by a different sort
of curve, called the t-curve, which describes a distribution called the t-distribution or Student’s
Distribution.
So, when can we still continue to use the normal curve? A consequence of the Central Limit Theorem is
that, when n>=30, the sample standard deviation “s” will be quite close to the population standard
deviation “ 𝜎 ”, such that “s” can be used to replace “ 𝜎 ” in our calculations without losing any
significant amount of accuracy.
However, when n<30, this is not the case. We can only continue to use the normal curve if the original
distribution is known to be normal (as in Chapter 5) and 𝜎 is also known (in Chapter 5 it was assumed
that we always knew what 𝜎 was – a dubious assumption in real life!)
When the original distribution is known to be normal, but 𝜎 is unknown, the sampling distribution will be
t-distributed instead of normally distributed.
Calculus Connection: As n goes to infinity, the limit of the t-distribution is the normal distribution.
A flow chart might be helpful at this point: (there is a similar diagram on p. 299)
We can also describe this decision process in chart form?
𝑛 ≥ 30 ?
Y
Y
Y
Y
N
N
N
N
Original distribution normal?
Y
Y
N
N
Y
Y
N
N
𝜎 known?
Y
N
Y
N
Y
N
Y
N
Sampling Distribution type
Normal
Normal
Normal
Normal
Normal
𝑡 – distributed
Good luck, cowboy
Good luck, cowboy
Technically however, in this section we will not be making statements about sampling distributions
directly. Instead, we’ll be dealing with a new kind of random variable called the sampling error.
Fortunately, the distribution of the sampling error will always be the same type as the associated
sampling distribution, and will have the same standard deviation (although its mean will differ.) The
symbol for the sampling error is 𝐸 .
The sampling error is the difference between a statistic and the parameter the statistic is meant to
estimate. If the sampling method is unbiased, the expected value of the sampling error is 0 (actually this is
the technical definition of bias: When 𝜇𝐸 ≠ 0, the sampling method is biased.)
In sections 6.1 and 6.2, we will be working with distributions of the sample error of the mean, 𝐸𝑥̅ , the
difference between the sample and the population means. That is, 𝐸𝑥̅ = 𝑥̅ − 𝜇𝑥̅ (this is why 𝑥̅ and E must
have the same distribution type and standard deviation, since to go from one to the other, you just
subtract by a constant, )
(Note: For the rest of this unit, whenever the symbol 𝐸 is used, assume it to be 𝐸𝑥̅ . )
To summarize all this in equation form, we now have:
𝜇𝐸 = 0
and
𝜎𝐸 = 𝜎𝑥̅
Additionally, since we know from Chapter 5 that 𝜇𝑥̅ = 𝜇𝑥 , the equation for 𝐸 can be re-written as 𝐸 =
𝑥̅ − 𝜇𝑥
Once we construct a confidence interval for 𝑬 , we can use this to make a statement concerning the
location of the parameter. (THIS IS THE KEY IDEA!!!)
Ex: A Media Studies student sampled 54 sports magazine articles at random and recorded how many
sentences were in each. She found that the mean article contained 12.4 sentences, with a standard
deviation of 5.0. There is a 95% probability that the mean number of sentences per article in all sports
magazines is between _____________ and _____________ [S.A.M.]
In order to answer this question, we must know some facts about the sampling error E, the random
variable whose instances are the differences between the sample and the population means. How do we
do this?
Step 1: Find the type and parameters of the distribution of E: Since n>=30, the distribution of E will be
approximately normal. Recall that 𝜇𝐸 = 0 and that 𝜎𝐸 = 𝜎𝑥̅ =
So, 𝜇𝐸 = 0 and 𝜎𝐸 = 𝜎𝑥̅ =
5.0
√54
𝜎
√𝑛
≈ 0.68041
Step 2: Find the bounds of a “between” probability equation for E.
Here, we are asked to complete the statement P(a<E<b)=0.95. By sketching the normal curve just as we
did in Chapter 5, we can find that P(a<E)=0.025 and the z-scores associated with the boundaries are
±1.96.
Now, applying the same formula as we did in Chapter 5 when we found boundaries:
𝑎 = 𝜇𝐸 + 𝑧𝑎 𝜎𝐸 = 0 + −1.96(0.68041) = −1.33
and 𝑏 = 𝜇𝐸 + 𝑧𝑏 𝜎𝐸 = 0 + 1.96(0.68041) = 1.33
Conclusion: There is a 95% probability that the sampling error is between −1.33 and 1.33
Step 3: Construct the Confidence Interval
What does all the this mean? Well, since the sampling error is the distance between the statistic and the
parameter( in this case, the sample mean and the population mean), we can also be 95% certain that the
population mean lies within the interval:
{𝑥̅ + 𝑎, 𝑥̅ + 𝑏} = {12.4 − 1.33, 12.4 + 1.33} = {11.07 − 13.73}
Conclusion: There is a 95% probability that the value of 𝜇 is between 11.07 and 13.73. That is, we are 95%
certain that the mean number of sentences per article in all sports magazines is between 11.07 and 13.73
sentences.
More formally,
𝑃(11.07 < 𝜇 < 13.73) = 0.95
We have just related 𝑥̅ and 𝜇 mathematically, which has been our goal since the beginning of the year!!
Now that we have shown this process in all its detail, it is permissible to shorten some of the steps. We
can write a single formula for finding the confidence interval:
{𝑥̅ + 𝑧𝑎 𝜎𝑥̅ , 𝑥̅ + 𝑧𝑏 𝜎𝑥̅ }
If we want, we can shorten the process further, but be careful about putting all this in your calculator in
one step:
{𝑥̅ +
𝑧𝑎 𝜎𝑥
√𝑛
, 𝑥̅ +
𝑧𝑏 𝜎𝑥
√𝑛
}
Ex: A college dean wants to know the mean age of all applicants. In a random sample of 20 applicants, the
mean age is found to be 22.9 years. From past studies, the population is known to be normally distributed
with a standard deviation of 1.5 years. The dean can be 90% certain that the mean age of the applicants is
between ___________ and __________ years.
Step 1: What kind of random variable is E?
Here the sample size is less than 30, which throws us into perilous territory. However, since the original
distribution is known to be normal, and the population standard deviation is known, E is still
approximately normal (note this would not be true if the 1.5 was the sample standard deviation…)
Step 2: Construct a “between” probability to find the z-scores
P(a<E<b) = 0.90, so P(E<a)=0.05 and the z-scores of the boundaries are ±1.645.
Remember we need to find the standard deviation of E (not of x), which is the same as the standard
deviation of the sampling distribution:
𝜎𝐸 = 𝜎𝑥̅ =
1.5
√20
≈ 0.33541
And further recall that the mean of an (unbiased) sampling error is always zero. Therefore:
𝑎 = 0 + −1.645(0.33541) = −0.55
𝑏 = 0 + 1.645(0.33541) = −0.55
Step 3: Construct the Confidence Interval
{ 𝑥̅ + 𝑎, 𝑥̅ + 𝑏} = {22.9 − 0.55, 22.9 + 0.55} = {22.35, 23.45}
Conclusion: There is a 90% probability that 𝜇𝑥 is between 22.35 and 23.45 years of age.
Finding a minimum sample size: p. 286 example #6
Sometimes we have a confidence interval and a level of confidence in mind ahead of time, before we take
the sampling, and we want to know what the minimum sample size is that will give us these results.
More formally, we are asked to complete statements such as this:
𝑃(𝑟 < 𝜇 < 𝑠|𝑛 = ______) ≥ 𝑐
In these cases, we can re-arrange the formula for finding E so that n is isolated. Always round up, since
this is a lower bound, like Chebychev.
HW: p.288-292 #33-59 odds (evens if necessary)
Using the t-Distribution
Ex: You are in charge of inspecting the temperature of coffee dispensed by a coffee machine. You test 16
cups of coffee dispensed by the machine and find that the mean temperature was 162 oF, with a sample
standard deviation of 10 oF. Assume coffee temperatures are normally distributed. From this data, he can
conclude with 95% certainty that the mean temperature of the coffee dispensed by this machine is
between ________________ and _______________ degrees.
Step 1: What kind of random variable is E?
Here, the sample size is < 30, so we are thrown into perilous territory… again. However, here we don’t
know the population standard deviation here, so E is no longer approx. normal. Instead, it is t –
distributed.
The t-distributions are a family of distributions (like the normals are, actually as well), but, unlike the
normals, you can’t construct a scoring system on a t-distribution that will reduce to a single distribution
(remember, when we look up probabilities on the z-table, we are actually looking up areas in the standard
normal curve, not the normal curve that is the actual distribution of x).
t-Distributions are parameterized by three variables: n, the sample size, s, the sample standard deviation,
and 𝑥̅ , the sample mean.
Look at the t-table in your book (it is behind the z-table). The degrees of freedom (d.f) is always n-1. The
level of confidence in this problem is 0.95, so following that column down to 15 degrees of freedom gives
us a t-score ( 𝑡𝐸 ) of 2.131. This will serve the same purpose as the positive z-value when E was normal.
Since the t-table gives us this value directly, we no longer need Step 2 as before, so we can move right on
ahead to constructing the confidence interval:
Step “3”: Constructing the confidence interval for E
Recall that the rule for sampling distributions, 𝜎𝑥̅ =
𝜎𝑥
, is a general rule for all sampling distributions,
√𝑛
regardless of the nature of either the original or the sampling distribution. This fact sometimes gets lost in
Chapter 5, when we are only dealing with sampling distributions that are normal.
So we can still use the fact that
𝜎𝐸 = 𝜎𝑥̅ =
𝜎𝑥
√𝑛
to find 𝜎𝐸 . The problem is, that here we do not know 𝜎𝑥 . However, our sample standard
deviation “s” is an unbiased estimator for 𝜎𝑥 , so we will use it instead. Some precision will be lost in this
process, and in fact this is why we can’t use the normal distribution here.
So Here, 𝜎𝐸 =
𝑠
√𝑛
=
10
√16
= 2.5 , and
{𝑥̅ − 𝑡𝐸 𝜎𝐸 , 𝑥̅ − 𝑡𝐸 𝜎𝐸 } = {162 − (2.131)(2.5), 162 + (2.131)(2.5)} = {156.67 − 167.33}
Conclusion: There is a 95% likelihood that 𝜇𝑥 is between 156.67o and 167.33o F
Additional example: p.298 example #3
HW: p.300-2 #11-27 odds (evens if necessary)
Follow with: CLT worksheets, CLT test