Download Confidence Intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Confidence interval wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Regression toward the mean wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
ECON 309
Lecture 7A: Confidence Intervals
I. The Central Limit Theorem
The Central Limit Theorem is a convenient result from probability theory that allows us
to use the normal distribution even in many cases where variables are not normally
distributed.
Suppose we take a 6-sided die, and we throw it n times. We know that the probability
distribution for a die is not normal – it is discrete and uniform. But what is the
distribution of the sample mean? Notice that you’re much more likely to get a sample
mean close to true mean (3.5) than a sample mean that is near 1 or 6. For example, what
if we had a small sample of n = 2? We could get a sample mean of 1 only by throwing
two 1’s. We could get a sample mean of 6 only by throwing two 6’s. But we could get a
sample mean of 3.5 by throwing (1 and 6), (6 and 1), (2 and 5), (5 and 2), (3 and 4), or (4
and 3). So there are many more ways to get a sample mean in the center than at the
edges. And that’s for a tiny sample of size n = 2. As n gets larger, it becomes even less
likely that you’ll get an average far from the center.
To understand the CLT, you need to understand that a sample mean is a random variable;
it just happens to be a random variable that is constructed from n other random variables
(the observations). And like any random variable, we can talk about its distribution.
The CLT says that as n gets larger, the distribution of the sample mean gets more and
more like the normal distribution.
II. Standard Error of the Mean
If the sample mean is a variable, we can talk about its mean and standard deviation. Not
surprisingly, the mean of the sample mean is the same as the mean of the population the
sample is taken from.
What about the standard deviation of the sample mean? It is not the same as the standard
deviation of the population the sample is taken from, but it is related to it like so:
x 

n
The symbol on the left is the standard deviation of the sample mean, which is formally
known as the standard error of the mean. Notice that it gets smaller as the sample size
(n) grows. This makes sense. The bigger your sample is, the more likely its sample
mean is to be close to the true mean. If you had a sample size equal to the whole
population, it would be impossible for the sample mean to be anything other than the
population mean, so the standard error of the mean would be zero.
The CLT tells us that, for a large enough sample, the sample mean is normally distributed
with a mean of μ and a standard deviation (standard error) of σ/√n. [Show this with bell
curve; show curve getting tighter and taller as n gets larger.]
III. Forming a Confidence Interval if σ Is Known
A confidence interval (CI) is a range of values that you believe is likely to contain the
true population mean.
To create a CI, you start with your sample mean. This is your best guess as to the value
of the population mean; we sometimes call it your “point estimate.” But it is only a
guess. We would like to give a range of values instead of a point, and then state how
confident we are that the range includes the population mean.
So you take your sample mean, and you add and subtract something from it to get the
endpoints of your range. If you happen to know the population variance, then you also
know the standard error of the mean, and you can use the following formula for your
confidence interval:
x  zc

n
We will discuss what to use for zc in a moment. For now, notice you’re multiplying it by
the standard error of the mean, which gets smaller as the sample gets larger. So our CI
will shrink as our sample gets larger, because we are more confident in our point
estimate.
So what is zc? This is “z-critical,” a z-score that represents how confident we want to be.
If we want to be very confident our CI includes the population mean, then we’ll want a
larger z-critical; if we don’t need to be that confident, we can have a smaller z-critical.
People often want to have a 90% CI. This means we want to construct our CI’s so that
90% of the time they will include the population mean – or that our CI will not include
the population mean only 10% of the time.
Notice that there are two ways the population mean can fall outside the CI: by being
below the lowest value, or by being above the highest value. So the 10% chance of the
CI not including the population mean needs to be split in two: 5% of the time our CI will
be too high, and 5% of the time it will be too low.
To find our z-critical, then, we need a value of z from Table 3 such that 0.95 of the
weight will be to the left of it. Looking in the values inside the table, we find 0.9495 at z
= 1.64, and we can’t get any closer to 0.95 than that. We set z-critical equal to 1.64 in the
CI formula above. This means that weight of 0.05 falls the right of z = 1.64, another 0.05
falls to the left of z = -1.64, and 0.90 falls in between -1.64 and 1.64.
Example: You want to find out CSUN students’ mean income. You take a sample of n =
49 students and find the sample mean is $26,000. You happen to know the standard
deviation is $7,000. You want a 90% CI for the mean income. It is:
26,000 ± (1.64)(7000/√49)
26,000 ± (1.64)(1000)
26,000 ± 1640
[$24,360, $27,640]
What if you’d wanted to be more confident? We could have constructed a 95% CI by
finding a different z-critical. You want a value of z from Table 3 such that 0.975 of the
weight falls to the left of it (that is, we want 0.025 in each tail). That z-critical is 1.96.
26,000 ± (1.96)(7000/√49)
26,000 ± (1.96)(1000)
26,000 ± 1960
[$24,040, $27,960]
Notice the CI is wider now, because you wanted to be more certain you wouldn’t leave
out the true mean.
IV. About Confidence Levels
We will sometimes call the probability that a CI will not include the population mean the
“significance level” or α. When we used a 90% confidence level, we had α = 0.10.
When we used a 95% confidence level, we had α = 0.05.
There is nothing magic about significance levels! You’ve probably had 5% or α = 0.05
drilled into your head in previous stats classes. Maybe sometimes you used α = 0.10 or α
= 0.01. But there’s no reason we have to use any of those. We could have picked any α
we wanted. It’s all a question of how confident we want to be.
Scientists tend to pick a rather small α. This means they want to be very confident in
their predictions. But those small α’s may not be appropriate for other contexts, such as
business. We will discuss this more later, when we’re doing significance tests.
V. About the Population Standard Deviation
In the above discussion above CI’s, we assumed we knew the population’s standard
deviation. What a bizarre assumption! Why would we already know the population’s
standard deviation but not know the population mean? Most of the time, you won’t.
So what can you do? When you have a large enough sample size (the rule of thumb is n
> 30), you can do the exact same procedure, but use the sample standard deviation in
place of the population standard deviation.
In other words, use this:
x  zc
s
n
(The fraction there on the right is called the estimated standard error of the mean.)
Notice that we are still using a z-critical value. This is because, when n is large enough,
the CLT gives us relatively great confidence that our sample means are normally
distributed.
But what if we don’t have a large enough sample? In that case, we can no longer use the
normal distribution (and corresponding table). Instead, we use the Student’s t distribution
(and corresponding table – in our book, Table 4). The t-distribution is very similar to the
normal distribution, but it’s flatter. This reflects the fact that when your sample is small,
it’s more likely that you’ll get sample means far away from the population mean. Our
new CI formula is like so:
x  tc
s
n
Same as the above, except with t-critical replacing z-critical. So now all we need to do is
find the t-critical. We do this by looking at Table 4.
We look for the row with degrees of freedom = n – 1. (Degrees of freedom is a
complicated concept, but it will always be equal to your sample size minus the number of
things you’re trying to estimate. In this case, you’re only estimating the mean, so you
subtract 1. When we get to regressions, we’ll be estimating more things and you’ll have
to subtract more.)
We look for the column that corresponds to the confidence level we want. This is the
second row of boldfaced column headings. Say we want 95% confidence; we choose the
5th column from the right, with the column headings 0.0250 and 0.9500. (The first
column heading is the area in the right tail; the second column heading is the confidence
level. The reason these don’t add up to 1 is that we want 0.0250 in both the left and right
tails.)
If we can use the t-distribution for small samples, why not for big samples? Well, that’s
actually what economists and statisticians do now. The only reason we still teach people
the z-distribution method any more is history. The t-table consumes a lot of space (you
need a row for every different sample size), and the z- and t-distributions are similar
enough when n is large that it was good enough to use the z-table instead. But now we
have computers with enough memory to have complete t-tables. So when you have
Excel and other statistical programs construct a CI, they will always use the tdistribution.
(So why do I teach the z-distribution? First, to be consistent with everyone else. Second,
so I can tell you what I just told you. Third, because this way you can use the sample
problems in the book for practice; some use z and some use t, but all rely on pretty much
the same method.)
Example: CSUN students’ incomes again, but now suppose you only have a sample of
16. The sample mean is $26,000 (like before), and you don’t know the true standard
deviation. Your sample standard deviation is s = 6800. To find t-critical, note that df =
16 – 1 = 15, and suppose we want a 90% CI; this give us t-critical = 1.75. The CI is:
26,000 ± (1.75)(6800/√16)
26,000 ± (1.75)(1700)
26,000 ± 2975
[$23,025, $28,975]