Download Confidence Intervals on mu

Statistics 215 Lab Materials Confidence Intervals on mu This topic officially moves us from probability to statistics. We begin to discuss making inferences about the population. One way to differentiate probability from statistics is to think of probability as the process of making an inference about a subset of the population (a sample) when we know the attributes of the entire population. This was the case in the previous chapters. We (pretended that we) knew the entire population or the entire distribution. We then were able to discuss probabilities for how a single observation might behave, as well as how the average of several observations might behave. Thus in probability we had the whole population and we want to know about an observation or a group of observations. Now in statistics this process reverses. In statistics we will have a sample, a collection of observations, from the population. From this subset, the sample, we want to be able to make statements about the population. Background The ideas presented in this section depend heavily on concepts from the previous chapters. Specifically the idea of variability from sample to sample is crucial. As we saw previously, each time we take a sample we got different values for the sample mean and these values differed from the population mean. One consequence of this is that using the sample mean, alone is not the best estimate of the population mean. The reason being that the sample mean fluctuates. Each sample gives us a different value. The idea of a confidence interval is that instead of simply using a single number, we use an interval, a range of numbers, to estimate the population mean. Definitions and preliminaries Definition: A parameter is a numerical quantity that summarizes a characteristic of the entire population. Definition: A statistic is a numerical quantity that summarizes a characteristic of a subset of the population, usually this is a sample. Example: We differentiate here between µ and σ which are parameters and x and sx which are statistics. Recall that µ is the population mean and σ is the population standard deviation, while x is the sample mean and sx is the sample standard deviation. µ and σ are numerical summaries for the entire population. x and sx are calculated from a subset of the population. Definition: A point estimate of a parameter is a single number estimate of that parameter. Definition: An interval estimate of a parameter is a range of numbers used to estimate a parameter. Example: Suppose that we are interested in estimating the mean weight of all black bears in West Virginia. We are able to weigh 38 black bears. From the 38 observations, we want to make statements about the entire population. That is we want to estimate the average weight of the population of black bears in West Virginia. A point estimate for this parameter, the mean weight of all black bears in WV, is 457 pounds. An interval estimate would be that the mean weight of all black bears is between 428 and 497 pounds. Now it is important to note that the mean weight does not change. Rather it is our knowledge that is imperfect. We have only 38 bears from the entire population and as a consequence we do not know the weight of all bears. That idea is worth reiterating. The need for interval estimates comes from the fact that we do not have all of the information that we want about the parameter, in the previous example the population mean. That is, we have a subset of the population or a sample not the entire population. We know that the sample mean Page 1 of 7 Statistics 215 Lab Materials varies and that it is often not the same as the population mean. Consequently, we cannot use the sample mean alone to estimate the population mean. Thus what we will do is specify a range of values that are plausible based on our sample. It is worth noting that unless we specify a range of values that goes from negative infinity to positive infinity, we can never guarantee that the population mean will be in the interval estimate for that mean. No method will “capture” or have the population mean inside the interval for every sample; however, statistical methods can specify the percentage of time that an interval will ‘miss’ the parameter of interest. When we use the word ‘capture’ in this topic it has a special meaning. It is important to remember that the value of the population mean or any parameter is constant. Consequently, when we are considering an interval estimate, if the parameter is inside the interval we will consider the parameter’s value to be ‘captured’ by the interval. For the parameter to be captured, it must fall between the endpoints of the interval; an interval will have an upper and a lower endpoint. Definition: A (1-α)*100% confidence interval for a parameter is an interval estimate that through repeating the process of taking a sample and making a confidence interval from that sample, will capture the parameter (1-α)*100% of the time. Definition: The confidence level for a confidence interval is the percentage of times that it will capture the parameter of interest in repeated sampling. Example: The confidence level for a 95% confidence interval is 0.95. The confidence level for an 84% confidence interval is 0.84. The confidence level is often denoted by (1-α)*100%. The reason for this is to allow each researcher to specify their his/her level of confidence. α=0.05 yields a confidence level of 0.95; while α = 0.10 yields of confidence interval of 0.90. The definition of a confidence interval needs some explanation (maybe plenty of explanation). Each confidence interval is calculated from a sample. This sample is a subset of the population. Previously, we saw that each sample of n observations was different. The idea of repeating the process of taking a sample described in the definition above is just that, each sample will be different. As we will see shortly (when we talk about calculations), since each sample is different, each confidence interval will be different. Because of the variability from sample to sample, some of the confidence interval that we make will not ‘capture’ the parameter it is trying to estimate. The difficulty with confidence intervals is this: We DON’T get to know the value of the parameter we are trying to estimate; so we don’t know which intervals capture the parameter and which don’t. Remember that the parameter is a quantity calculated from the population. In Statistics, we only see a subset of the population, so we cannot know the value of the parameter. Thus, we will make confidence interval and we will not know whether the parameter has been captured. Instead we must be content to know that the procedure works a certain percentage of the time, (1-α)*100% to be exact. The procedure begins with selecting units to be sampled, getting information from those units and then making our calculations. However, we don’t know if the confidence interval that we have created is one of those (1-α)*100% of times that we ‘captured’ the parameter or one of the α*100% of the time that we do not. Consider this more concrete example. A 95% confidence interval is made for the mean height that Yellow Poplars in West Virginia. This interval goes from 85.68 feet to 94.39 feet. This is based upon a sample of 56 trees taken from around the state. We say that we are 95% confident that the mean height of all Yellow Poplars in WV is between 85.68 and 94.39 feet. However, since we don’t know the population mean we cannot say with absolute certainty that it is between these numbers. So why, if I can’t say anything with certainty using a confidence interval would I use it at all. The answer is simply that by using statistics we can accurately tell the percentage of times the process will fail. No other methodology allows you to specify that percentage. Statistics allows for this, but forces a layer of uncertainty into the discourse. Page 2 of 7 Statistics 215 Lab Materials Confidence Intervals on mu (Small Sample Size) When the number of observations in the sample is large (at least 30 observations), we can use the Central Limit Theorem to help us construct a confidence interval on the unknown value of the population mean mu. For large samples the CLT tells us that the distribution of the possible values of the sample mean is normal. We can use the standard normal distribution to find the critical value (z) that is part of the (margin of ) error term. Let’s start with a 95% confidence interval for the population mean. 95% is a common choice for the confidence level. Note that 0.95 = P(-1.96<Z<1.96) = P(−1.96 < X−µ σ x < 1.96) n Some algebra later: = P( µ − 1.96 σx < X < µ + 1.96 n σx n ) The above expression is a statement of probability about X , if we know the values for µ and σx. This follows what we did in the previous chapters when we pretended that we knew the value of µ or σx or m or p. Some more algebra later, we can turn this into an interval for µ, = P( X − 1.96 σx n < µ < X + 1.96 σx n ) = 0.95. This expression provides the endpoints of the CI: σx σ < µ < X +1.96 x n n X −1.96 which we can write more succinctly (and in a more general fashion) as € X ± z(1−α 2) σx n But we don’t’ know the value of σx. When the value of σx is unknown (always the case in the real world) We can substitute our estimate sx for σx. Our CI is computed as € X ± t(n −1,α 2) sx n After we have computed x and sx (and not µ and σx) then we can construct a confidence interval on the unknown value of mu; however, there are two direct results of this. € Page 3 of 7 Statistics 215 Lab Materials 1.The critical value (z) in the error term no longer have a normal distribution, it comes from something called a t-distribution. 2.We no longer have a statement of probability, we have a statement of confidence. The t-distribution The t-distribution is similar to the z distribution or ‘standard normal’ distribution. It is based upon taking a sample of n observations from a Normal distribution with mean µ and standard devation σ. A R.V. T will have a t-distribution with n-1 degrees of freedom if T= X −µ . sx n The t-distribution is defined by the number of degrees of freedom much like the Poisson was indexed by λ . Degrees of freedom is a parameter just as λ was for the Poisson. The mean of a t random variable is 0. This distribution is symmetric and unimodal, but it has slightly more variability than the Normal distribution. We will use the percentiles of the t-distribution quite frequently throughout the rest of the course. Consequently, we have specific notation for it. € € The kth percentile for a t-distribution with q degrees of freedom will be denoted by t(q;1-k). Example: t(25, 0.05) would be the 95th percentile of a t R.V. with 25 degrees of freedom. t(38,0.10) would be the 90th percentile of a t R.V. with 38 degrees of freedom. For calculating these percentiles we use Table 4, Page 516. This table has the degrees of freedom in the first column and percentiles in the other columns. The book uses α to represent the area to the right of the percentile, thus if we want the 90th percentile from the table, we need to look in the column with α = 0.10. Likewise the 99th percentile can be found in the column that is designated by α = 0.01. Example: t(14, 0.05) = 1.761 t(25, 0.01) = 2.485 t(30, 0.10) = 1.310 Table 4 does not contain all possible values for degrees of freedom. For example, if the degrees of freedom is 30 or more, then you would use the Standard Normal table (Table 3) to estimate the corresponding tvalue. Example: Given a sample of 14 observations from a distribution that is known to be Normally distributed, construct a 99% confidence interval on the unknown value of population parameter mu. Form the data we have calculated X ± t(n −1,α X = 4.127 and sx = 0.358. We can use the formula 2) sx n since there are more than 2 observations and the data comes from a Normal distribution. € Page 4 of 7 Statistics 215 Lab Materials First, df=n-1=14-1=13 and α = 1 − C.L. = 1 − 0.99 = 0.01. So € α 0.01 = = 0.005 2 2 Then € X ± t(n −1,α 2) sx n 0.358 14 € = 4.127 ± t(13,0.005) € ⎛ 0.358 ⎞ = 4.127 ± 3.012⎜ ⎟ ⎝ 14 ⎠ = 4.127 ± 3.012 * 0.096 € = 4.127 ± 0.28915 € The endpoints of our confidence interval on mu are (3.838, 4.416). € So a 99% confidence interval for the population mean goes from 3.838 to 4.416. We interpret this by saying that we are 99% confident that the population mean’s value falls within the interval 3.838 to 4.416. Confidence Intervals on mu (Large Sample Size) When the sample size is large (at least 30 observations) a t-distribution with n-1 d.f. is virtually identical to the Standard Normal distribution. So we can obtain our critical value from the Standard Normal distribution table instead of the t-table. In the large sample situation, the formula for a (1-α)*100% confidence interval on mu becomes X ± z(1−α 2) sx n The t-value in the error term is replaced by a z-value from a Standard Normal distribution. All else remains the same. € Page 5 of 7 Statistics 215 Lab Materials The formula above can be used to construct a (1-α)*100% confidence interval (CI) for the population mean when 1.n (the number of observations) is more than 2 and the original data (the values of the variable X) are approximately Normal or 2.n is at least 30 (n≥30) (and we don’t know what distribution the data came from) Example: Suppose that want to estimate the mean of a population. We have a sample of 48 observations from a population. The mean of these observations is 290.34 and the standard deviation of these observations is 41.22. Then a 95% confidence interval for the population mean would be. We can use the formula below since n>30. X ± z(1−α 2) sx n = 290.34 ± 1.96 * € 41.22 48 = 290.34 ± 11.66 € € = (278.368, 302.00) (278.68, 302.00) is mathematical notation for an interval that goes from 278.68 to 302.00. Thus a 95% confidence interval for the mean goes from 278.68 to 302.00. We interpret this by concluding that we are 95% confident the population mean is between 278.68 and 302.00. Confidence instead of probability When we are dealing with parameters such as µ or σ, we are dealing with fixed quantities. As a consequence, we I make a statement such as the mean is between 8.5 and 19.4. That statement is either true or false. The mean is either in that interval or it is outside of it. This has implications for our interpretation of a confidence interval. If we create a 95% confidence interval for µ from say 175.46 to 176.32, then P(175.46< µ< 176.32) ≠ 0.95. This probability, P(175.46< µ< 176.32) is either 0 or 1. The value of the population mean is either inside the interval or it is not inside the interval. The confidence that we assert comes with repetition of the process of taking samples and calculating the confidence interval for each sample. However, for any individual interval that we do not know whether the mean is inside the interval or outside the interval. What we do know is that if we repeated the process of collecting samples and making 95% confidence intervals for the population mean from each sample, then 95% of those samples would contain the population mean. Page 6 of 7 Statistics 215 Lab Materials Summary: The basic form of a confidence interval for a population parameter is as follows: Point estimate ± percentile from a sampling distribution * standard error. The point estimate is best single number estimate for the parameter. The standard error is an estimate of the variability from sample to sample for the point estimate. The percentile that is used is based upon the confidence level that we want to use and the sampling distribution is determined by the type of parameter that we are estimating. (1-α)*100% Large Sample CI on mu: X ± z(1−α € 2) (1-α)*100% Small Sample CI on mu: X ± t(n −1,α € sx n 2) sx n The situation where σ is known NEVER happens in real world research. We skipped this section in the lecture notes due to this fact. € Page 7 of 7

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Confidence Intervals on mu