Download Chapter 4 Confidence Intervals

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Misuse of statistics wikipedia , lookup

Transcript
Chapter 4
Confidence Intervals
Recall that Statistics, the science of analyzing data has two broad branches: Descriptive Statistics
and Inferential Statistics. In Descriptive Statistics we describe a given set of data, i.e. we tell the story
hidden behind a given set of data. In Inferential statistics, we use sample data to make inferences about
the population from which the sample was drawn. A population is the set of all entities of interest,
whereas a sample is a subset of the population. For example, let’s say I am interested in the average age
of all visitors at Disney’s Magic Kingdom in Orlando on Dec. 22nd 2012. All the visitors on that day will
make up my population. But if I took a random sample of say 100 visitors that day, I will have a sample.
If I find the age of these 100 visitors in my sample, and use this sample average to make an inference
about the average age of all visitors in my population, I am engaged in Inferential statistics.
Estimating certain parameters of a population, such as the population average, is one of the things
we do in inferential statistics. In this chapter, we will study how to estimate various parameters of a
population. What kinds of parameters can we estimate about a population? We can estimate the
average of a random variable of interest for our population of interest. We can estimate the standard
deviation. We can also estimate the proportions of various categories in a population. For example if we
are interested in knowing what proportion of college students in a particular college are part-time
students, we can estimate it using a sample.
Some jargon about population and sample and parameter and statistic
Before we go into much detail, let me introduce you to some terms frequently used in statistics.
When we talk about a summary measure of a population, such as the population average or the
population proportion or the population standard deviation, we are talking about a population
parameter. When we are talking about a summary measure of a sample, we are talking about a sample
statistic. There is no such thing as a statistic of a population or a parameter of a sample. The term
“parameter” goes with “population” and the term “statistic” (not statistics) goes with a “sample”. So,
the average of all values of a variable in a population is a parameter. The average of all values of a
variable in a sample is a statistic. We use a sample statistic to make inferences about a population
parameter.
One more very important thing about parameters and statistics:
Remember we learnt earlier that a variable is a random variable if its value depends upon chance?
Now I will tell you something you should never forget. A sample statistic is a random variable. Why is it a
random variable? Because the value of a sample statistic depends on what sample you happen to get,
which depends upon chance. A population parameter, on the other hand, is not a random variable. Why
is it not a random variable? It is not a random variable, because a population is fixed, because it contains
data about ALL the entities, so there is nothing left to chance. A population mean is what it is. A
population proportion is what it is. No matter how many times you measure it, you will get the same
value. When a value is a random variable, you can talk about its probability of being in a certain range
because it can assume any value described by a probability distribution. When a value is not a random
variable, you cannot talk about its probability of being within a range. Its value is fixed and therefore has
no probability distribution.
1
What good do Estimates do?
Making estimates about a population parameter is one of the most important things statisticians
do. Let me tell you why estimation is important. A lot of business decisions depend upon estimates
about the parameters of a population. For example, say I am a production manager at a factory that
produces T-shirts. Say I produce a million T-shirts a year. So an important production decision is how to
distribute the production of one million T-shirts in to various sizes such as small, Medium, Large, X-Large,
XX-Large etc. If I have information about the proportion of sizes of the customers in my population, that
information will be very useful in my production decision. Since obtaining this information about the
population is quite expensive, we rely on estimates using samples. What can we do get the estimates?
We can hire some people to sit in front of a clothing store and eyeball every customer that walks in the
store in a week and tally their shirt sizes. Based on just one week’s worth of data, which acts as our
sample, we can make very good estimates about the proportions of various shirt sizes in the entire
population, which, in turn, will help in our production decision. Please realize the importance of this
information from a business point of view. If the information about the population is not accurate then
for certain sizes, we will have surplus stock and for other sizes we will have shortages. Both situations are
undesirable for our business. A surplus implies that we may have to put the surplus stock in clearance,
hence lose money. A shortage implies that we lost some potential business and hence some money. So
by hiring some people to collect data for a week and hiring a student of statistics, such as you, to make an
inference about the population, the company can avoid some losses, even after paying all the people
hired for the estimation task.
Two types of Estimates:
There are two types of estimates. A point estimate and an interval estimate. A point estimate just
gives you a single value, such as 12, whereas an interval estimate is a range of values, such as 10 to 14. I
will first discuss point estimates and then interval estimates.
Point Estimate
Let’s say I am interested in knowing the proportion of part-time students at a University. Asking
every student their status is not very practical, so I want an estimate (not the exact proportion). A point
estimate would be just one value, such as 20% part-time students. A point estimate, in our example, is
relatively easy to compute. You simply get a random sample of students, say 100 students, and ask each
one if they are full time or part time. If 20 of them said that they were part-time students, then your
point estimate is 20% part-time students.
Let’s look at another example. Let’s say this time I am interested in knowing the average age of all
part-time students at the same University. Again, I can ask a sample of part-time students their age. I
could simply ask the 20 students in the above example, their age and compute the mean. Let’s say the
mean of the sample of 20 part-time students comes out to be 28.4 years. So the point estimate for
average age of the entire population part-time students at this university will be 28.4 years.
So what is so difficult about point estimates? All we did was get a sample and whatever was the
sample mean (or sample proportion) was the point estimate for the population mean (or population
proportion). Really – there is nothing difficult about getting a point estimate. I don’t know why people
have given such a bad name to Statistics. The only difficulty in getting a point estimate is getting a
random set of 100 students and asking them their full time status and their age. In a college course you
will never be required to collect data in such fashion. You will be given the sample mean and you will be
asked to give the point estimate of the population. Now how difficult is that? In fact, you don’t even
2
have to be a college student to learn this concept. You can learn this concept in 2nd grade, because really
– it is that simple.
Are Point Estimates Enough?
That’s a good question – are point estimates enough? Well, if we think about it, to obtain a point
estimate, we simply get a sample and whatever is the sample average (or proportion) becomes the
estimate for the population average (or proportion). So the accuracy of a point estimate depends largely
on the selection of your sample. If someone else is working with a different sample, they will get a
different estimate, because no two samples are exactly alike. So if I have two different point estimates,
which one is to be believed more? This is a legitimate concern people have when they see a point
estimate. There is not a whole lot of reliability for any given point estimate. For a decision maker, the
reliability of an estimate is important because, after all, the decision maker is making decisions based on
estimates. So point estimates, although easy to obtain, are not sufficient. We will now discuss interval
estimates, which take into account reliability.
Interval Estimates
Interval estimates allow us to specify a range of values in which we estimate the true value of a
population parameter lies, with a certain degree of confidence. For example, we can make a statement
like - we are 95% confident that the population mean lies between 10 and 14. We cannot make such
statements of confidence (or reliability) for a point estimate. For example we can never say that we are
95% confident that the true population mean is 12. In fact we are almost 100% confident that it is not
exactly 12 because a mean is a continuous variable and the probability that it will be any single value is
zero. So, we cannot make any reliability statements about a point estimate but can about interval
estimates.
Before I give you an example of an interval estimate, let me test your intuition about something.
Suppose I ask two students to go out and give me an estimate of the age of all part-time students at a
University. Suppose both students collect a sample of students and based on their sample they both tell
me that their sample mean is 28.4 years. These two students have never taken a course in statistics so
they didn’t know how to do an interval estimate. When I asked them to give me an interval estimate,
they both added and subtracted 2 from their point estimate and they both gave me an interval estimate
of 26.4 to 30.4 years. So I asked them how large their sample was. It turns out the first student had a
sample of 400 students, while the second one had a sample of 16 students. According to your intuition,
which student’s interval estimate is more reliable, the first or the second? In other words, will you have
more confidence in the first person’s estimate or the second, even though both estimates are identical?
Even for the same estimate, it is possible to have different degree of confidence. If you are thinking – you
will have more confidence in the first student’s estimate, because his sample size was much larger than
the second student’s, you are absolutely right. In fact, whenever someone gives you an interval estimate,
they should also tell you the degree (or level) of confidence. The degree (or level) of confidence and the
interval go together. For this reason these intervals are called confidence intervals. In the rest of this
book, you will read about confidence intervals a lot. But now you know what it means. Remember, an
interval estimate without a specified confidence level is not very useful, just like point estimates were not
quite useful because they lacked a measure of reliability.
So how to compute a confidence interval?
Let me now give you an example of how to compute a confidence interval for a population
parameter. Whenever you are asked to give an interval estimate, you are also given the desired degree
3
(or level) of confidence. So let’s say we want an interval estimate for the average age of all part-time
students at a university with a 95% confidence. In other words we want a 95% confidence interval for the
average age of all part-time students at a given university. So I will go out and collect some sample data,
i.e. I will randomly select some students. Suppose I collect data for 400 part-time students. Say the
average age in the sample is 28.4 years and say the standard deviation is 4 years.
Now a confidence interval is really the point estimate ± a margin of error. Point estimate minus margin
of error gives you the lower confidence limit and point estimate plus the margin of error gives the upper
confidence limit. The two confidence limits (lower and upper) make up your confidence interval.
We all know how to get a point estimate. It is so easy, even a second grader can give it to you. Whatever
is the value of the sample statistic is the value of the point estimate of the population parameter. So to
obtain a confidence interval, all we really need to learn is how to compute the margin of error.
Using Excel, you can easily get the margin of error (using the function =CONFIDENCE()) and consequently
the two confidence limits. Figure 1 shows you calculations in Excel. Figure 2 shows the results of the
calculations.
Figure 1: Calculations for Margin of Error and Confidence Intervals
Figure 2: Results of the Calculations for Margin of Error and Confidence Intervals
Let’s study Figure 1. The margin of error is calculated in cell B8 using the =CONFIDENCE() function. This
function needs three things – alpha (in cell B6), standard deviation (in cell B3) and the sample size (in cell
4
B4). Alpha is nothing but 1 minus the desired confidence level expressed as a fraction. So in cell B5,
there is the desired confidence level expressed as 0.95, since we are interested in a 95% CI, so alpha is
0.05. So the margin of error is 0.392 (See Figure 2), so the two confidence limits are 28.4 ± 0.392 or
28.008 and 28.792. So what statement can we make about this CI? We can say - “we are 95% confident
that the true value of the population mean age of all part time students at this university lies between
28.008 and 28.792 years.” Let me also tell you a statement that many students of statistics would make
about the confidence interval in this example, but which would be incorrect. It would be incorrect to say
that – there is a 95% chance (or probability) that the true population mean lies between 28.008 and
29.792. Some students don’t see any difference between these two statements and don’t see what is
wrong with the second statement. It is very important to understand this difference. For example, if you
meet a statistician and you make the second statement about a confidence interval, they will
immediately know that you didn’t quite pay attention during your lesson on confidence intervals (It is
also possible that they will not know you made a mistake because even they didn’t pay attention during
their lesson on confidence intervals). On the other hand, if you make the first statement, they will be
highly impressed by you that you understood your confidence interval lesson better than most college
students. The reason the second statement is incorrect is because this statement is talking about the
probability of a population parameter being in a certain range. Remember that a population parameter is
not a random variable and hence we cannot make a statement about its probability. The first statement
makes no statement about the probability of the population parameter lying in a range. The first
statement, (we are 95% confident that the true value of the population mean lies between 28.008 and
28.792) is making no statement about probability about the value of the population parameter. What it
is really saying is that if we were to had 100 samples of the same size and got hundred sample means and
standard deviations and used them to compute hundred different intervals, that 95 of those 100 intervals
would contain the true population mean. Therefore we are 95% confident that this interval would
contain the true population mean.
How did statisticians compute the confidence interval before there was Excel?
Before Excel, statisticians used a simple formula for computing the margin of error. Excel also basically
uses the same formulas, but since it does it for you we never get to see the formulas. But it is important
you have an understanding of the underlying formulas. For a 95% confidence interval and for a large
sample size, the Margin of Error (or MOE) = 1.96 * stdev/sqrt(n)
Let’s apply this formula to obtain the margin of error in the above example. In our example, the stdev is
4 and n is 400, so the margin of error will be 1.96*4/sqrt(400) = 1.96 * 4/20 = 1.96/5 = 0.392, which is
what we got using Excel. For smaller sample sizes (< 30) the formulas are different and will be explained
later.
What about a confidence level other than 95%?
The most common confidence level used by statisticians is 95%. For example, whenever you hear on TV
that polls showing certain percentage support for a presidential candidate (say 42%), they also tell you
that the margin of error is 3%. Essentially what they are telling you is that the interval estimate is 39% to
45%. They do not tell you the confidence level because most television viewers would not understand it.
But all TV polls assume a 95% confidence level. You, as a statistics student now know this little known
fact about TV polls. Just because a 95% confidence level is the most popular doesn’t mean you cannot
compute an interval for a different confidence level. Other popular confidence levels you may use are
90% or 99% or any other level that you desire.
5
For a 90% CI, for a large sample size, Margin of Error is 1.645 * std. dev. / Sqrt(n)
For a 99% CI, for a large sample size, Margin of Error is 2.575 * std. dev./ Sqrt(n)
Note that in these formulas, the only difference is the multiplier (1.96 for 95% CI, 1.645 for 90%CI and
2.575 for 99% CI)
For the above example, for 90% CI, the MOE is 1.645 * 4/Sqrt(400) = 1.645/5 = 0.329 and CI =
[28.071,28.729]
For 99% CI, the MOE is 2.575*4/Sqrt(400) = 2.575*4/20 = 2.575/5 = 0.515 and CI = [27.885, 28.915]
Figure 3: 90% and 95% Confidence Levels
Figure 3 shows these CIs using Excel. Note that Excel gives the same confidence levels that we obtained
using the formulas above. All we had to do was change cell B5 to 0.9 and 0.99.
Using Excel, it is easy to get any confidence level, such as 98% or 92% or 88% or 99.5% etc. You simply
change the value in cell B5 (in the above example) and you will get the new margin of error and the
confidence interval. Using formulas, it becomes a little difficult, because we do not have readily available
formulas for various levels. It requires you to read the z-table. The multipliers 1.96 and 1.645 and 2.575
come from the z table. I will not explain it here.
In the above example, when we talked about the formulas, we said that they were good for large
sample sizes. Since in our example, the sample size was 400, which is considered quite large, we were
able to use the formulas. But what if the sample size was small – say 16? Any sample size less than 30 is
considered small.
How to compute confidence interval for a small sample?
When the sample size is small, the formulas require you to read a t-table. The multipliers come from the
t-table. Instead of using the formulas and the t-table, you can simply use Excel’s =CONFIDENCE.T()
function as shown in Figure 4. Please look at the formula in cell B8. Also, note that the margin of error is
2.131, instead of 0.392 when the sample size was 400. The confidence interval is much larger. It is
[26.269 to 30.531] instead of [28.008 to 28.792]. This means that when the sample size became smaller,
you needed a much wider interval to be 95% confident, than when the sample size was large.
6
Figure 4: Using =CONFIDENCE.T() function to compute the confidence interval when sample size is small
The formula for margin of error explained
For large sample size, for a 95% confidence interval, the margin of error is 1.96*std_dev/sqrt(n).
There are two parts of this formula – the multiplier (1.96 in this case) and std_dev/sqrt(n). Let me first
talk about the multiplier. Why 1.96? We use 1.96 in this formula because in a normal distribution, the
area between +/- 1.96 standard deviations is 0.95. If we were interested in 90% CI, the multiplier would
be 1.645 because the area under a normal curve between +/- 1.645 standard deviations is 90%. Similarly
for a 99% CI, the multiplier is 2.575.
Why std_dev/sqrt(n)? This comes from the central limit theorem. The standard deviation of the
distribution of sample means is std_dev/sqrt(n) where std_dev is the sample standard deviation and n is
the sample size. You should be able to put these two concepts together to see how 1.96*std_dev/sqrt(n)
gives an interval such that we are 95% confident that the population mean lies within this interval.
For large sample, we can assume that the sampling distribution is normal, hence we can use the
multipliers of 1.96, 1.645 and 2.575. But if the sample size is small, we cannot assume that the sampling
distribution is normal. We assume that it has a t-distribution. Remember the t-distribution depends on
the degrees of freedom, which in turn depends on the sample size. We get the multiplier from the ttable for degrees of freedom equal to n minus 1.
7
Confidence Intervals of Proportions
Just as you can estimate confidence intervals for a population mean, you can estimate confidence
interval for population proportions. For example, you may want to know the proportion of republican
supporters in the population of a county or the proportion of people in your market radius that buy your
product/service or the proportion of people in a population making over 100,000 per year. By just
selecting a few hundred people you can estimate the confidence interval of the population proportion.
The basic formula is the same, i.e., CI = point estimate +/- Margin of Error
The margin of error, for proportions, is given by the formula: zα/2
Where n is the sample size and is the sample proportion and z is the standard normal random variable
and α is 1 minus the confidence level.
Example
Suppose we are interested in the proportion of republicans in a country. Say you survey a sample of 400
people and ask if they are republican supporters or not and suppose you find that 240 of them say they
are republican supporters. Suppose I am interested in a 95% confidence interval.
In this example, α is 0.05, so zα/2 = 1.96
is 240/400 = 0.60
n = 400
So, the margin of error is 1.96*
= 0.048
So, the confidence interval is = 0.6 ± 0.048 = [0.552, 0.648]
Determining the appropriate sample size
Say we want to control the margin of error. In the above example, say we are not happy with a margin of
error of 4.8% and we would like this error to be reduced to 3% or 2% or whatever. By increasing the
sample size, we know that the margin of error can be reduced.
Since MOE = zα/2
By rearranging the above equation, we can write n in terms of MOE as n = (zα/2)2 *
/(MOE)2
For the above example, suppose we are interested in a margin of error of 0.03. For a 95% CI, what should
be the sample size?
n = (1.96)2 * (0.6 * 0.4)/(0.03)2 = 1024.4 or 1025
Note that this sample size is quite close to 1000. This is the reason that most television polls are based
on 1000 responses. They try to make the margin of error as close to 3% as possible.
8
Summary of the concepts in this chapter:
-
-
-
One of the things we do in Inferential Statistics is Estimation of population parameters.
Estimates can be point estimates or interval estimates.
Point estimates are easy to obtain. Even a 2nd grader can obtain it. It is basically the same as the
sample statistic.
With point estimates, you cannot talk about the reliability or the degree or level of confidence.
Interval estimates give a range of values within which the population parameter lies. When we
estimate an interval, we do it for a certain level of confidence.
When you see a 95% or a 99% confidence interval, we do not say that there is a 95% (or a 99%)
probability that the population parameter lies in that interval. We say that you we are 95%
confident that the population parameter lies in that interval because if we generated 100
confidence intervals using 100 different samples, 95 of those intervals will contain the population
parameter.
For large sample sizes (>30) we use the =CONFIDENCE() function to get the margin of error. CI is
basically the point estimate ± margin of error. Note that Excel does not have a function called
margin of error or anything like that. The function =CONFIDENCE() does not really give the
confidence interval but it gives the margin of error for computing the confidence interval.
For small sample size (<30) you use Excel’s =CONFIDENCE.T() function to compute the confidence
interval.
95% is the most popular confidence level used by statisticians. Other important levels are 90%
and 99%.
For large sample size, for 95% CI, the formula is Point Estimate ± 1.96* std dev / sqrt(n)
For large sample size, for 90% CI, the formula is Point Estimate ± 1.645* std dev / sqrt(n)
For large sample size, for 99% CI, the formula is Point Estimate ± 2.575* std dev / sqrt(n)
For small sample sizes, the formula involves reading the t-table. These formulas were not
discussed here, but recommended that you use Excel’s =CONFIDENCE.T() function.
For small sample size, the multiplier in the formula comes from the t-table with n minus 1
degrees of freedom.
Confidence Intervals of Proportions can also be estimated using z values.
Sample size necessary to generate a given MOE can also be computed.
9