Download Document

Chapter 6 Part 4 Confidence Intervals October 21, 2008 Goal: To clearly understand the link between probability distributions and confidence intervals. Skills: Be able to calculate (1 - α)% confidence interval for a sample mean both for the case that the population variance is known and the case that it is not known. Be able to accurately interpret a confidence interval. Contents: Central Limit Theorem Page 2 Confidence interval using the normal distribution Page 5 Formula What impacts the length of a CI Stata commands: invnormal Page 8 Page 16 Usually we study samples from a population rather than the population itself because it is not possible to get our hands on the whole population (e.g. it is too big, the process is too costly, frequently some of the members of the population we are interested in haven’t even been born yet). We have agreed that when possible, we should select a random sample. We also know that when we select a random sample of size for a study, it is just one of many possible samples of size that could have been selected from the population. n n Suppose we want to know the average fasting triglycerides of the entire population of the U.S. that is 55 years old or older (55+). Some of the reasons why we’ll have to select a sample would be: 1) usually the whole population is simply not available (e.g. the ALLHAT investigators were hoping that the results of their study would apply not only to those who were 55+ at the time of entry into the study but also to those who will later become 55+) and 2) even in cases where the population is available (an unusual case) the cost and time involved to study a whole population tends to be prohibitive. So we’ve decided to select a random sample from the population and use the mean of the fasting triglycerides of that sample to estimate the mean of the entire population. What we learned earlier when studying the sampling distribution of means is the following: Let X be a random variable representing the distribution of the fasting triglycerides in the population of people aged 55+. Let the fasting triglycerides and σ X2 μX represent the population mean for the population variance for the fasting triglycerides. Then we usually denote the random variable representing the sampling distribution of means of samples of size n by X , the mean of the population of means by μ X and the variance of the population of means by σ X2 . n , of the sample is large enough, we have μ X = μ X (Fact 1 from before) If the size, 1) The mean of the original distribution is equal to the mean of the sampling distribution. Page -1- 2) σX = 2 where σ X2 n σ X n Note that σX = is called the standard error of the mean (SEM) - Fact 2 from before. σX σX n refers to variation related to a single sample and , the SEM, refers to variation among samples. 3) We also noticed that the larger the sample size n got, the more the distribution of those sample means looked like a normal distribution. The Central Limit Theorem states the following: Given the notation we have used above for the original population of fasting triglycerides and the notation for the sampling distribution of means, for large X is approximately normally distributed with mean = μ X (Fact 1: and variance = σ X2 n (Fact 2: σ 2 X = σ X2 n n, μX = μ ) regardless of the distribution of X ) X. If X is distributed normally, then X is also distributed normally (as opposed to approximately normally). Now our problem is, how do we know if the sample mean is a good estimate of the population mean. Let us say that the graph below is the distribution of the means for fasting triglycerides (AFTRIG) of all samples of size n from the U.S. population of those aged 55+. Looking at the histogram of the sampling distribution below we would probably be willing to say that the means represented by the bar on the far right end (the bar with square dots) of the distribution are not good estimates for the mean of the distribution of Page -2- the original AFTRIG values because they are probably not what we would be willing to call “close” to the mean of the distribution of sample means (i.e. μ X ). But what about the means represented by the striped bar in the graph below. This is where our problems begin. We are clearly going to need some sort of measure of how certain we are that the mean of our sample is a reasonable estimate of the population mean. This is where confidence intervals come in. Confidence intervals are going to be defined such that given a 95% confidence interval, we will be 95% confident that μ X (and hence obtaining a 95% confidence interval for the original population mean μX . μ X ) lies within our interval. So in μ X , we will have also obtained an interval for Just as we have only one sample and one sample mean, we will have only one confidence interval based on that sample and its mean. If, however, we had all possible samples, we could get a confidence interval for the mean of each sample. Then the interpretation of the 95% confidence interval is that we are confident that 95% of these intervals contain the original population mean ( Page -3- μ X ). Looking at the graph below of the confidence intervals, we notice that 3 of the intervals (the dashed ones) do not contain the population mean. The very top confidence interval does not contain the mean because confidence intervals will be defined as open intervals (i.e. intervals that do not contain their endpoints). The other two dashed confidence intervals don’t even come particularly close to the mean. 0 5 10 15 20 95% C I’s for the sam ple m eans assum ing w e know σ μX = μX E ach interval is centered about a sam ple m ean. E ach interval is the sam e length because σ is kno wn. The intervals are all of the same length because (as we will show) the length of each interval depends on the sample size n (remember all samples from the sampling distribution have the same size) and on the size of when is known. σ σ σ We’ll show later that when is not known, we can calculate the confidence interval using the sample estimate of , namely s. In this case the lengths of the samples will vary as s varies from sample to sample. σ There are actually 3 kinds of intervals that we can use: prediction, confidence and tolerance intervals. We won’t do much with prediction and tolerance intervals until we get to regression, but I will describe all three kinds of intervals here. Page -4- This example is taken from Forthofer and Lee’s (2007) book Biostatistics. Dairies add vitamin D to milk for the purpose of fortification. The recommended amount of vitamin D to be added to a quart of milk is 400 IUs (10 μg). If a dairy adds too much vitamin D, perhaps over 5000 IUs, the amount of vitamin D could be toxic. A prediction interval focuses on a single observation of the variable - for example, the amount of vitamin D in the next bottle of milk. A confidence interval focuses on a population parameter - for example, the mean or median of vitamin D in a population of bottles of milk. Thus, the prediction interval is of more interest to the consumer of the next bottle of milk, whereas the confidence interval is of more interest to the dairy. A tolerance interval provides limits such that there is a high level of confidence that a large portion of the values of the variable will fall within them. For example, besides being interested in the mean, the dairy owner or regulatory agency also wants to be confident that for a large portion of the bottles the vitamin D contents are within a specified tolerance of the value of 400 IUs. So back to confidence intervals. The picture of the confidence intervals above is a nice graphic, but how do we actually calculate the confidence interval for our sample mean? Confidence Intervals Below we give the confidence interval for the random variable conditions that the random variable X X μ X and has a known variance It is not usually the case that we know confidence interval first. under the is normally distributed has an unknown mean X X σ X2 σ X2 but we present this simplest version of the Page -5- So let X be the random variable associated with the sampling distribution of samples of size n drawn from the distribution with random variable X . The Central Limit Theorem says: for n large enough X ) where μ of the distribution of X =μ and X X ≈ N( μX , σ X2 ) σ2 = σ X2 (regardless . n X [ ≈ = approximately.] σ X2 ⎞ ⎛ 2 Density for X ≈ N ⎜ μ X = μ X , σ X = ⎟ ⎝ n⎠ 95% 2.5% 2.5% μ μ X − 3σ X μ X − 1σ X X μ X + 1σ X μ X − 1.96σ X μ X + 3σ X μ X + 1.96σ X Note that the areas and standard deviations in the graph above were derived under the assumption that X is close enough to being normally distributed not make any difference. How did I decide that the area under the normal density associated with x-axis and between μ − 196 . σ X X and μ + 196 . σ X Page -6- X X , above the is 95% of the total area under the curve. Well μ −196 . σ X X is 1.96 standard deviations ( σ μ ) of the normal distribution [ X ≈ N ( μX , σ X2 )] and X μ +196 . σ is 1.96 standard deviations above the mean. X ) below the mean ( X X We learned earlier that from 1.96 standard deviations below the mean to 1.96 standard deviations above the mean cuts off 95% of the area under the curve for any normal distribution (i.e. this is part of what we learned when we showed that any normal distribution could be mapped into the standard normal distribution Z ~ N ( 0, 1) ). [ ] So for n large enough we have Equation 1 ( Pr μ − 196 . σ X X < X < μ + 196 . σ X X ) = 0.95 [Aside: Notice above that I have used < rather than ≤ because although it doesn’t make any difference which you use in terms of the probability of a continuous distribution, confidence intervals are always written as open intervals.] But according to the Central Limit Theorem μ = μX X and σ = X σX n So Equation 1 becomes σX σX ⎞ ⎛ Pr ⎜ μX − 196 . < X < μX + 196 . ⎟ = 0.95 n n⎠ ⎝ Equation 2 Page -7- But we want μX in the middle and X on the ends, so we subtract μ parts of the inequality in Equation 2 and get σX σX ⎞ ⎛ Pr ⎜ − 196 . < X − μ X < 196 . ⎟ = 0.95 n n⎠ ⎝ Now subtract X across all Equation 3 X across all parts of the inequality in Equation 3 and get σ σX ⎞ ⎛ Pr ⎜ − X − 1.96 X < − μ X < − X + 196 . ⎟ = 0.95 n n⎠ ⎝ Equation 4 Now multiply by -1 across all parts of the inequality in Equation 4 (note this reverses the inequalities) σX σX ⎞ ⎛ Pr ⎜ X + 196 . > μ X > X − 196 . ⎟ = 0.95 n n⎠ ⎝ Equation 5 Now just put the smaller endpoint of equation 5 on the left and the larger on the right. σX σX ⎞ ⎛ Pr ⎜ X − 196 . < μ X < X + 196 . ⎟ = 0.95 n n⎠ ⎝ Equation 6 Below we switch from probability to confidence because X is a random variable for which probability is appropriate but x is the mean of a particular sample. Once we use the sample mean, the population mean μx either is or is not in the interval and probability is no longer appropriate. Page -8- So our 95% confidence interval is σX σX ⎞ ⎛ x − 196 . , x + 196 . ⎜ ⎟ ⎝ n n⎠ On the N(0,1) curve the area to the right of 1.96 is 0.025 or 2.5%. Or the area to the left of 1.96 is 0.975 or 1 - 0.025. z0.975 = z1− 0.025 . Or if we let α = 0.05, so that . This pattern will work regardless α / 2 = 0.025 , then more generally we have z of the value of α . This means we could denote 1.96 as 1−( α / 2 ) Well what do we do about -1.96? We’ll use Therefore, the general form of the (1 - − z1− (α / 2 ) . α )% confidence interval is σX σX ⎞ ⎛ , x + z1− (α / 2 ) ⎟ ⎜ x − z1− (α / 2 ) ⎝ n n⎠ Usually we don’t have to work so hard to distinguish between X and X and their means and variances. This is because the random variable X is not usually part of the conversation. We have only used it to derive the formula for the confidence interval. This means we can just say that the distribution for the random variable X has mean μ and standard deviation σ . So the commonly used form of the (1 - α )% confidence interval is σ σ⎞ ⎛x −z , x + z ⎜ ⎟ 1− ( α / 2 ) 1− ( α / 2 ) ⎝ n n⎠ Page -9- In the above formula x is the mean of a single sample and is not a random variable. The confidence for the interval above is 1 - α. α = 0.10 , then 1 − α = 0.90 So α / 2 = 0.05 . Therefore, an area So if and we would have a 90% confidence interval. equal to 0.05 is cut off each end of the distribution. The length of the confidence interval is 2 z1− (α / 2 ) σ n x As we select different samples of size n, we get different values for . So the location of the confidence interval changes. However, the length of the confidence interval remains the same (this is because is known) and the samples are all of size n. σ Find the 95% confidence interval for the baseline heart rate in beats/min for the Propranolol treatment group (Cardiology Problem 6.81 on page 222 of Rosner), also see original description of the problem in Cardiovascular Disease on page 157). Let us suppose that σ the standard deviation of the baseline heart rate for Propranolol is known and is equal to 17 beats/minute. The Stata data set for this problem is nifed.dta. Page -10- . des Contains data from C:\Stata\StataData\Myfiles\BiostatFall2003\Data\nifed.dta obs: 34 vars: 10 size: 22 Oct 2002 20:53 1,496 (99.9% of memory free) -----------------------------------------------------------------------------variable name storage display value type format label variable label -----------------------------------------------------------------------------id float %12.0g trtgrp float %11.0g heartlv0 float %12.0g Baseline Heart Rate beats/min heartlv1 float %12.0g Level 1 Heart Rate beats/min heartlv2 float %12.0g Level 2 Heart Rate beats/min heartlv3 float %12.0g Level 3 Heart Rate beats/min syslv0 float %12.0g Baseline Systolic Blood Pressure mmHg syslv1 float %12.0g Level 1 Systolic Blood Pressure mmHg syslv2 float %12.0g Level 2 Systolic Blood Pressure mmHg syslv3 float %12.0g Level 3 Systolic Blood Pressure mmHg trt Treatment Group -----------------------------------------------------------------------------. tab trtgrp Treatment | Group | Freq. Percent Cum. ------------+----------------------------------nifedipine | 18 52.94 52.94 propranolol | 16 47.06 100.00 ------------+----------------------------------Total | 34 100.00 . label list trt: 0 nifedipine 1 propranolol Since we have not used this data set before, I have run codebook for treatment group and for baseline heart rate so we can see what we have. Page -11- . codebook trtgrp Treatment Group -------------------------------------------------------------------------------------type: numeric (float) label: trt range: [0,1] unique values: tabulation: units: 2 1 missing .: Freq. Numeric 18 0 nifedipine 16 1 propranolol 0/34 Label -------------------------------------------------------------------------------------heartlv0 Baseline Heart Rate beats/min -------------------------------------------------------------------------------------type: range: unique values: numeric (float) [51,116] 1 missing .: mean: 74.1176 std. dev: 18.6544 percentiles: units: 21 0/34 10% 25% 50% 75% 90% 54 56 71 90 100 The baseline heart rate in beats/minute is denoted heartlv0 and trtgrp = 1 is the propranolol treatment group. . sum(heartlv0) if trtgrp == 1 Variable | Obs Mean Std. Dev. Min Max -------------+----------------------------------------------------heartlv0 | So 16 76.81 17.95 54 x = 7681 . and σ = 17 (i.e. we don’t use s = 17.95 because σ 105 is known). Since we are assuming n is large enough to assume normality, the 95% confidence Page -12- interval is ⎛ 76.81 − 1.96⎛⎜ 17 ⎞⎟ , 76.81 + 1.96⎛⎜ 17 ⎞⎟ ⎞ = (68.48, 85.14) ⎜ ⎟ ⎝ 16 ⎠ ⎝ 16 ⎠ ⎠ ⎝ We are confident that 95% of all such confidence intervals cover μ, the mean of the population (i.e. all people treated with Propranolol) baseline heart rate. That is what we mean when we say we are 95% confident that μ lies between 68.48 and 85.14. When assuming normality our equation for the confidence interval implies that the confidence interval is centered about the sample mean. So when you are carefully double-checking your work, you’ll want to make sure that the confidence interval you have gotten actually contains the sample mean. What impacts the length of the confidence interval? Remember that the length of the confidence interval is 2z σ 1−( α / 2 ) n 1) Sample size n As n increases, the length of the confidence interval decreases. So there is an inverse relationship between the sample size n and the length of the confidence interval. Note that shorter confidence intervals are better. x and y are inversely related if one increases as the other decreases. So there is an inverse relationship between the size of n and the length of the confidence interval. 2) The standard deviation or variance. Page -13- As the standard deviation or variance increases, the length of the confidence interval increases. So there is a direct relationship between the size of and the length of the confidence interval. σ x and y are directly related if they both increase or they both decrease. 3) The α -level. α As increases (meaning the confidence decreases), the length of the confidence interval decreases. So there is an inverse relationship between the size and the length of the confidence interval. α Let us use the function invnormal(p) = z where p is the probability or area and z is the cutoff. We can write the equation as invnormal(1 - ( α /2)) = z. Suppose that α = 0.05 (i.e. we are talking about a 95% confidence interval). This means that an area of 0.025 will be cut off on each end of the normal distribution. So we have 1 - ( α /2) = 1 - 0.025 = 0.975. . di invnormal(1-(0.05/2)) or 1.959964 So for α = 0.05, . di invnormal(0.975) 1.959964 z1−(α / 2 ) = z0.975 = 196 . Page -14- If α = 0.10, then 1 - ( α /2) = 1 - 0.05 = 0.95 . di invnormal(1 - (0.10/2)) 1.6448536 So z1− (α / 2 ) So or . di invnormal(0.95) 1.6448536 = z0.95 = 164 . α 1 = 0.05 produces a z value of 1.96 and α 2 = 0.10 produces a z value of 1.64 So the larger of the two confidence interval. α ‘s produces the smaller z value and hence the shorter If α = 0.05, then we have a 95% [i.e. (1 - α )%] confidence interval. If α = 0.10, then we have a 90% confidence interval. So less confidence and shorter confidence intervals go together. Page -15-

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Top subcategories

Download Document