Download Chapter 4 Confidence Intervals

Chapter 4 Confidence Intervals Recall that Statistics, the science of analyzing data has two broad branches: Descriptive Statistics and Inferential Statistics. In Descriptive Statistics we describe a given set of data, i.e. we tell the story hidden behind a given set of data. In Inferential statistics, we use sample data to make inferences about the population from which the sample was drawn. A population is the set of all entities of interest, whereas a sample is a subset of the population. For example, let’s say I am interested in the average age of all visitors at Disney’s Magic Kingdom in Orlando on Dec. 22nd 2012. All the visitors on that day will make up my population. But if I took a random sample of say 100 visitors that day, I will have a sample. If I find the age of these 100 visitors in my sample, and use this sample average to make an inference about the average age of all visitors in my population, I am engaged in Inferential statistics. Estimating certain parameters of a population, such as the population average, is one of the things we do in inferential statistics. In this chapter, we will study how to estimate various parameters of a population. What kinds of parameters can we estimate about a population? We can estimate the average of a random variable of interest for our population of interest. We can estimate the standard deviation. We can also estimate the proportions of various categories in a population. For example if we are interested in knowing what proportion of college students in a particular college are part-time students, we can estimate it using a sample. Some jargon about population and sample and parameter and statistic Before we go into much detail, let me introduce you to some terms frequently used in statistics. When we talk about a summary measure of a population, such as the population average or the population proportion or the population standard deviation, we are talking about a population parameter. When we are talking about a summary measure of a sample, we are talking about a sample statistic. There is no such thing as a statistic of a population or a parameter of a sample. The term “parameter” goes with “population” and the term “statistic” (not statistics) goes with a “sample”. So, the average of all values of a variable in a population is a parameter. The average of all values of a variable in a sample is a statistic. We use a sample statistic to make inferences about a population parameter. One more very important thing about parameters and statistics: Remember we learnt earlier that a variable is a random variable if its value depends upon chance? Now I will tell you something you should never forget. A sample statistic is a random variable. Why is it a random variable? Because the value of a sample statistic depends on what sample you happen to get, which depends upon chance. A population parameter, on the other hand, is not a random variable. Why is it not a random variable? It is not a random variable, because a population is fixed, because it contains data about ALL the entities, so there is nothing left to chance. A population mean is what it is. A population proportion is what it is. No matter how many times you measure it, you will get the same value. When a value is a random variable, you can talk about its probability of being in a certain range because it can assume any value described by a probability distribution. When a value is not a random variable, you cannot talk about its probability of being within a range. Its value is fixed and therefore has no probability distribution. 1 What good do Estimates do? Making estimates about a population parameter is one of the most important things statisticians do. Let me tell you why estimation is important. A lot of business decisions depend upon estimates about the parameters of a population. For example, say I am a production manager at a factory that produces T-shirts. Say I produce a million T-shirts a year. So an important production decision is how to distribute the production of one million T-shirts in to various sizes such as small, Medium, Large, X-Large, XX-Large etc. If I have information about the proportion of sizes of the customers in my population, that information will be very useful in my production decision. Since obtaining this information about the population is quite expensive, we rely on estimates using samples. What can we do get the estimates? We can hire some people to sit in front of a clothing store and eyeball every customer that walks in the store in a week and tally their shirt sizes. Based on just one week’s worth of data, which acts as our sample, we can make very good estimates about the proportions of various shirt sizes in the entire population, which, in turn, will help in our production decision. Please realize the importance of this information from a business point of view. If the information about the population is not accurate then for certain sizes, we will have surplus stock and for other sizes we will have shortages. Both situations are undesirable for our business. A surplus implies that we may have to put the surplus stock in clearance, hence lose money. A shortage implies that we lost some potential business and hence some money. So by hiring some people to collect data for a week and hiring a student of statistics, such as you, to make an inference about the population, the company can avoid some losses, even after paying all the people hired for the estimation task. Two types of Estimates: There are two types of estimates. A point estimate and an interval estimate. A point estimate just gives you a single value, such as 12, whereas an interval estimate is a range of values, such as 10 to 14. I will first discuss point estimates and then interval estimates. Point Estimate Let’s say I am interested in knowing the proportion of part-time students at a University. Asking every student their status is not very practical, so I want an estimate (not the exact proportion). A point estimate would be just one value, such as 20% part-time students. A point estimate, in our example, is relatively easy to compute. You simply get a random sample of students, say 100 students, and ask each one if they are full time or part time. If 20 of them said that they were part-time students, then your point estimate is 20% part-time students. Let’s look at another example. Let’s say this time I am interested in knowing the average age of all part-time students at the same University. Again, I can ask a sample of part-time students their age. I could simply ask the 20 students in the above example, their age and compute the mean. Let’s say the mean of the sample of 20 part-time students comes out to be 28.4 years. So the point estimate for average age of the entire population part-time students at this university will be 28.4 years. So what is so difficult about point estimates? All we did was get a sample and whatever was the sample mean (or sample proportion) was the point estimate for the population mean (or population proportion). Really – there is nothing difficult about getting a point estimate. I don’t know why people have given such a bad name to Statistics. The only difficulty in getting a point estimate is getting a random set of 100 students and asking them their full time status and their age. In a college course you will never be required to collect data in such fashion. You will be given the sample mean and you will be asked to give the point estimate of the population. Now how difficult is that? In fact, you don’t even 2 have to be a college student to learn this concept. You can learn this concept in 2nd grade, because really – it is that simple. Are Point Estimates Enough? That’s a good question – are point estimates enough? Well, if we think about it, to obtain a point estimate, we simply get a sample and whatever is the sample average (or proportion) becomes the estimate for the population average (or proportion). So the accuracy of a point estimate depends largely on the selection of your sample. If someone else is working with a different sample, they will get a different estimate, because no two samples are exactly alike. So if I have two different point estimates, which one is to be believed more? This is a legitimate concern people have when they see a point estimate. There is not a whole lot of reliability for any given point estimate. For a decision maker, the reliability of an estimate is important because, after all, the decision maker is making decisions based on estimates. So point estimates, although easy to obtain, are not sufficient. We will now discuss interval estimates, which take into account reliability. Interval Estimates Interval estimates allow us to specify a range of values in which we estimate the true value of a population parameter lies, with a certain degree of confidence. For example, we can make a statement like - we are 95% confident that the population mean lies between 10 and 14. We cannot make such statements of confidence (or reliability) for a point estimate. For example we can never say that we are 95% confident that the true population mean is 12. In fact we are almost 100% confident that it is not exactly 12 because a mean is a continuous variable and the probability that it will be any single value is zero. So, we cannot make any reliability statements about a point estimate but can about interval estimates. Before I give you an example of an interval estimate, let me test your intuition about something. Suppose I ask two students to go out and give me an estimate of the age of all part-time students at a University. Suppose both students collect a sample of students and based on their sample they both tell me that their sample mean is 28.4 years. These two students have never taken a course in statistics so they didn’t know how to do an interval estimate. When I asked them to give me an interval estimate, they both added and subtracted 2 from their point estimate and they both gave me an interval estimate of 26.4 to 30.4 years. So I asked them how large their sample was. It turns out the first student had a sample of 400 students, while the second one had a sample of 16 students. According to your intuition, which student’s interval estimate is more reliable, the first or the second? In other words, will you have more confidence in the first person’s estimate or the second, even though both estimates are identical? Even for the same estimate, it is possible to have different degree of confidence. If you are thinking – you will have more confidence in the first student’s estimate, because his sample size was much larger than the second student’s, you are absolutely right. In fact, whenever someone gives you an interval estimate, they should also tell you the degree (or level) of confidence. The degree (or level) of confidence and the interval go together. For this reason these intervals are called confidence intervals. In the rest of this book, you will read about confidence intervals a lot. But now you know what it means. Remember, an interval estimate without a specified confidence level is not very useful, just like point estimates were not quite useful because they lacked a measure of reliability. So how to compute a confidence interval? Let me now give you an example of how to compute a confidence interval for a population parameter. Whenever you are asked to give an interval estimate, you are also given the desired degree 3 (or level) of confidence. So let’s say we want an interval estimate for the average age of all part-time students at a university with a 95% confidence. In other words we want a 95% confidence interval for the average age of all part-time students at a given university. So I will go out and collect some sample data, i.e. I will randomly select some students. Suppose I collect data for 400 part-time students. Say the average age in the sample is 28.4 years and say the standard deviation is 4 years. Now a confidence interval is really the point estimate ± a margin of error. Point estimate minus margin of error gives you the lower confidence limit and point estimate plus the margin of error gives the upper confidence limit. The two confidence limits (lower and upper) make up your confidence interval. We all know how to get a point estimate. It is so easy, even a second grader can give it to you. Whatever is the value of the sample statistic is the value of the point estimate of the population parameter. So to obtain a confidence interval, all we really need to learn is how to compute the margin of error. Using Excel, you can easily get the margin of error (using the function =CONFIDENCE()) and consequently the two confidence limits. Figure 1 shows you calculations in Excel. Figure 2 shows the results of the calculations. Figure 1: Calculations for Margin of Error and Confidence Intervals Figure 2: Results of the Calculations for Margin of Error and Confidence Intervals Let’s study Figure 1. The margin of error is calculated in cell B8 using the =CONFIDENCE() function. This function needs three things – alpha (in cell B6), standard deviation (in cell B3) and the sample size (in cell 4 B4). Alpha is nothing but 1 minus the desired confidence level expressed as a fraction. So in cell B5, there is the desired confidence level expressed as 0.95, since we are interested in a 95% CI, so alpha is 0.05. So the margin of error is 0.392 (See Figure 2), so the two confidence limits are 28.4 ± 0.392 or 28.008 and 28.792. So what statement can we make about this CI? We can say - “we are 95% confident that the true value of the population mean age of all part time students at this university lies between 28.008 and 28.792 years.” Let me also tell you a statement that many students of statistics would make about the confidence interval in this example, but which would be incorrect. It would be incorrect to say that – there is a 95% chance (or probability) that the true population mean lies between 28.008 and 29.792. Some students don’t see any difference between these two statements and don’t see what is wrong with the second statement. It is very important to understand this difference. For example, if you meet a statistician and you make the second statement about a confidence interval, they will immediately know that you didn’t quite pay attention during your lesson on confidence intervals (It is also possible that they will not know you made a mistake because even they didn’t pay attention during their lesson on confidence intervals). On the other hand, if you make the first statement, they will be highly impressed by you that you understood your confidence interval lesson better than most college students. The reason the second statement is incorrect is because this statement is talking about the probability of a population parameter being in a certain range. Remember that a population parameter is not a random variable and hence we cannot make a statement about its probability. The first statement makes no statement about the probability of the population parameter lying in a range. The first statement, (we are 95% confident that the true value of the population mean lies between 28.008 and 28.792) is making no statement about probability about the value of the population parameter. What it is really saying is that if we were to had 100 samples of the same size and got hundred sample means and standard deviations and used them to compute hundred different intervals, that 95 of those 100 intervals would contain the true population mean. Therefore we are 95% confident that this interval would contain the true population mean. How did statisticians compute the confidence interval before there was Excel? Before Excel, statisticians used a simple formula for computing the margin of error. Excel also basically uses the same formulas, but since it does it for you we never get to see the formulas. But it is important you have an understanding of the underlying formulas. For a 95% confidence interval and for a large sample size, the Margin of Error (or MOE) = 1.96 * stdev/sqrt(n) Let’s apply this formula to obtain the margin of error in the above example. In our example, the stdev is 4 and n is 400, so the margin of error will be 1.96*4/sqrt(400) = 1.96 * 4/20 = 1.96/5 = 0.392, which is what we got using Excel. For smaller sample sizes (< 30) the formulas are different and will be explained later. What about a confidence level other than 95%? The most common confidence level used by statisticians is 95%. For example, whenever you hear on TV that polls showing certain percentage support for a presidential candidate (say 42%), they also tell you that the margin of error is 3%. Essentially what they are telling you is that the interval estimate is 39% to 45%. They do not tell you the confidence level because most television viewers would not understand it. But all TV polls assume a 95% confidence level. You, as a statistics student now know this little known fact about TV polls. Just because a 95% confidence level is the most popular doesn’t mean you cannot compute an interval for a different confidence level. Other popular confidence levels you may use are 90% or 99% or any other level that you desire. 5 For a 90% CI, for a large sample size, Margin of Error is 1.645 * std. dev. / Sqrt(n) For a 99% CI, for a large sample size, Margin of Error is 2.575 * std. dev./ Sqrt(n) Note that in these formulas, the only difference is the multiplier (1.96 for 95% CI, 1.645 for 90%CI and 2.575 for 99% CI) For the above example, for 90% CI, the MOE is 1.645 * 4/Sqrt(400) = 1.645/5 = 0.329 and CI = [28.071,28.729] For 99% CI, the MOE is 2.575*4/Sqrt(400) = 2.575*4/20 = 2.575/5 = 0.515 and CI = [27.885, 28.915] Figure 3: 90% and 95% Confidence Levels Figure 3 shows these CIs using Excel. Note that Excel gives the same confidence levels that we obtained using the formulas above. All we had to do was change cell B5 to 0.9 and 0.99. Using Excel, it is easy to get any confidence level, such as 98% or 92% or 88% or 99.5% etc. You simply change the value in cell B5 (in the above example) and you will get the new margin of error and the confidence interval. Using formulas, it becomes a little difficult, because we do not have readily available formulas for various levels. It requires you to read the z-table. The multipliers 1.96 and 1.645 and 2.575 come from the z table. I will not explain it here. In the above example, when we talked about the formulas, we said that they were good for large sample sizes. Since in our example, the sample size was 400, which is considered quite large, we were able to use the formulas. But what if the sample size was small – say 16? Any sample size less than 30 is considered small. How to compute confidence interval for a small sample? When the sample size is small, the formulas require you to read a t-table. The multipliers come from the t-table. Instead of using the formulas and the t-table, you can simply use Excel’s =CONFIDENCE.T() function as shown in Figure 4. Please look at the formula in cell B8. Also, note that the margin of error is 2.131, instead of 0.392 when the sample size was 400. The confidence interval is much larger. It is [26.269 to 30.531] instead of [28.008 to 28.792]. This means that when the sample size became smaller, you needed a much wider interval to be 95% confident, than when the sample size was large. 6 Figure 4: Using =CONFIDENCE.T() function to compute the confidence interval when sample size is small The formula for margin of error explained For large sample size, for a 95% confidence interval, the margin of error is 1.96*std_dev/sqrt(n). There are two parts of this formula – the multiplier (1.96 in this case) and std_dev/sqrt(n). Let me first talk about the multiplier. Why 1.96? We use 1.96 in this formula because in a normal distribution, the area between +/- 1.96 standard deviations is 0.95. If we were interested in 90% CI, the multiplier would be 1.645 because the area under a normal curve between +/- 1.645 standard deviations is 90%. Similarly for a 99% CI, the multiplier is 2.575. Why std_dev/sqrt(n)? This comes from the central limit theorem. The standard deviation of the distribution of sample means is std_dev/sqrt(n) where std_dev is the sample standard deviation and n is the sample size. You should be able to put these two concepts together to see how 1.96*std_dev/sqrt(n) gives an interval such that we are 95% confident that the population mean lies within this interval. For large sample, we can assume that the sampling distribution is normal, hence we can use the multipliers of 1.96, 1.645 and 2.575. But if the sample size is small, we cannot assume that the sampling distribution is normal. We assume that it has a t-distribution. Remember the t-distribution depends on the degrees of freedom, which in turn depends on the sample size. We get the multiplier from the ttable for degrees of freedom equal to n minus 1. 7 Confidence Intervals of Proportions Just as you can estimate confidence intervals for a population mean, you can estimate confidence interval for population proportions. For example, you may want to know the proportion of republican supporters in the population of a county or the proportion of people in your market radius that buy your product/service or the proportion of people in a population making over 100,000 per year. By just selecting a few hundred people you can estimate the confidence interval of the population proportion. The basic formula is the same, i.e., CI = point estimate +/- Margin of Error The margin of error, for proportions, is given by the formula: zα/2 Where n is the sample size and is the sample proportion and z is the standard normal random variable and α is 1 minus the confidence level. Example Suppose we are interested in the proportion of republicans in a country. Say you survey a sample of 400 people and ask if they are republican supporters or not and suppose you find that 240 of them say they are republican supporters. Suppose I am interested in a 95% confidence interval. In this example, α is 0.05, so zα/2 = 1.96 is 240/400 = 0.60 n = 400 So, the margin of error is 1.96* = 0.048 So, the confidence interval is = 0.6 ± 0.048 = [0.552, 0.648] Determining the appropriate sample size Say we want to control the margin of error. In the above example, say we are not happy with a margin of error of 4.8% and we would like this error to be reduced to 3% or 2% or whatever. By increasing the sample size, we know that the margin of error can be reduced. Since MOE = zα/2 By rearranging the above equation, we can write n in terms of MOE as n = (zα/2)2 * /(MOE)2 For the above example, suppose we are interested in a margin of error of 0.03. For a 95% CI, what should be the sample size? n = (1.96)2 * (0.6 * 0.4)/(0.03)2 = 1024.4 or 1025 Note that this sample size is quite close to 1000. This is the reason that most television polls are based on 1000 responses. They try to make the margin of error as close to 3% as possible. 8 Summary of the concepts in this chapter: - - - One of the things we do in Inferential Statistics is Estimation of population parameters. Estimates can be point estimates or interval estimates. Point estimates are easy to obtain. Even a 2nd grader can obtain it. It is basically the same as the sample statistic. With point estimates, you cannot talk about the reliability or the degree or level of confidence. Interval estimates give a range of values within which the population parameter lies. When we estimate an interval, we do it for a certain level of confidence. When you see a 95% or a 99% confidence interval, we do not say that there is a 95% (or a 99%) probability that the population parameter lies in that interval. We say that you we are 95% confident that the population parameter lies in that interval because if we generated 100 confidence intervals using 100 different samples, 95 of those intervals will contain the population parameter. For large sample sizes (>30) we use the =CONFIDENCE() function to get the margin of error. CI is basically the point estimate ± margin of error. Note that Excel does not have a function called margin of error or anything like that. The function =CONFIDENCE() does not really give the confidence interval but it gives the margin of error for computing the confidence interval. For small sample size (<30) you use Excel’s =CONFIDENCE.T() function to compute the confidence interval. 95% is the most popular confidence level used by statisticians. Other important levels are 90% and 99%. For large sample size, for 95% CI, the formula is Point Estimate ± 1.96* std dev / sqrt(n) For large sample size, for 90% CI, the formula is Point Estimate ± 1.645* std dev / sqrt(n) For large sample size, for 99% CI, the formula is Point Estimate ± 2.575* std dev / sqrt(n) For small sample sizes, the formula involves reading the t-table. These formulas were not discussed here, but recommended that you use Excel’s =CONFIDENCE.T() function. For small sample size, the multiplier in the formula comes from the t-table with n minus 1 degrees of freedom. Confidence Intervals of Proportions can also be estimated using z values. Sample size necessary to generate a given MOE can also be computed. 9

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Chapter 4 Confidence Intervals