Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
NAVAL POSTGRADUATE SCHOOL LAB #5: CONFIDENCE INTERVALS Statistics (OA3102) Prof. Fricker Lab #5: Confidence Intervals Goal: Demonstrate how to calculate confidence intervals in R and illustrate how confidence intervals work. Lab type: Interactive lab demonstration followed by group exercises. Time allotted: Lecture for ~50 minutes followed by group exercises for ~50 minutes. R libraries: bootstrap. Data: mk48.down.csv and simulated data. R HINT OF THE WEEK 1. Between the time you start R and it gives you the first prompt, R does a lot of things, one of which is loading the "search list." The search list defines how R goes about looking for objects when you ask for them at the command line. You can see what packages are in your search list by typing search() The first item on the search list is the "global environment" followed by a number of packages, and the order of these packages is important. When you type a function like mean() into R (say you are intending to calculate the average of some numbers), R starts by looking for it in the global environment (which is your working directory) and it then proceeds through all the other listed packages in order. The first object it finds called mean is what it will use. Note what happens when you load another library library(bootstrap) search() That package then becomes the first one searched after global environment. The key point is that if you give an object the same name as another one in a package that comes later on in the search list, you will mask it and will not be able to use it. For example try this: mean <- function(x) cat("Oops – I just masked the mean function! \n") mean(1:10) rm(mean) mean(1:10) If you want to see what libraries are installed on your system (though not necessarily in your search path), type: library() And, if you want to see what data sets are installed on your system (lots of libraries come with data sets), type: data() Revision: February 2012 2 Prof. Fricker DEMONSTRATION 2. First, a review of confidence intervals. a. As we learned in class, suppose that X1, …, Xn are iid from N(, 2) with unknown and known. This is not likely in real life, but it is a good place to start. i. Under these assumptions, we know that X ~ N ( , 2 / n) , and therefore X Pr 1.96 1.96 0.95 . / n ii. That is, the probability that a sample mean calculated from n random normal observations falls within 1.96 standard errors of the population mean is 0.95. This is a statement about a random point. b. We can re-write this inequality as Pr X 1.96 / n X 1.96 / n 0.95 , which is a statement about a random interval. i. That is, is still fixed; we don’t know what it is, but it’s not a random quantity, so it’s wrong to say “the probability that is between 10 and 12 is…” It either is or it isn’t. ii. The random quantity is the interval X 1.96 / n , X 1.96 / n . If we took a different sample, we’d get a different interval. iii. Does our interval contain the true ? We don’t know , so we can’t say. However, we’ve used a technique that produces an interval that is correct most of the time. We may be right or wrong this particular time but we know that in the long run we’ll be right 95% of the time, and that’s not bad. iv. The general expression for a (1-)x100% confidence interval when is known is Pr X z /2 / n X z /2 / n 1 . c. Gossett (who published under the pseudonym “Student”) showed that when the Xi are iid N(, 2), then the random variable X s / n follows the t distribution with n – 1 degrees of freedom. i. What this means is that we have to estimate then we should use critical values from the t distribution. ii. And, the general expression for a (1-)x100% confidence interval when is estimated by s is Pr X t /2,n1s / n X t /2,n1s / n 1 . Revision: February 2012 3 Prof. Fricker 3. Calculating confidence intervals in R. a. There is no function to directly calculate a confidence interval when is known. However, it is easy to calculate the necessary pieces and put them together. To do this, we need the following functions: i. mean(), ii. qnorm(), and iii. of course we need to know and specify . Then: se.mean<-sigma/sqrt(length(data)) # for predefined sigma ci.upper<-mean(data)+qnorm(1-alpha/2)*se.mean # for predefined alpha ci.lower<-mean(data)-qnorm(1-alpha/2)*se.mean b. To compute a t distribution-based confidence interval in R, we can use the qt() function, and then the calculations become: se.mean<-sd(data)/sqrt(length(data)) ci.upper<-mean(data)+qt(1-alpha/2,length(data)-1)*se.mean ci.lower<-mean(data)-qt(1-alpha/2,length(data)-1)*se.mean We can also use the t.test() function. We’ll talk about what the “t test” is in a later module. For now all you need to know is that this function produces a list of which the conf.int element contains the confidence interval. i. As the name implies, the t-test uses the t distribution to calculate the critical values. Implicit in this calculation is that the population standard deviation is not known. This is the usual case in real life. ii. Since we almost always have to estimate with s in the real world, the use of the t-test is the default in statistical software. Indeed, R does not even have a command like z.test for a large sample confidence interval. So, while the large sample confidence interval is easier to calculate by hand, when doing the calculation using software like R, it's just as easy to do the calculation using the t distribution, even when the sample size is large. c. Note that you can use t.test to calculate confidence intervals with other confidence levels. Simply use the conf.level option. Similarly, you can calculate one-sided confidence bounds using the alternative option. See the on-line help for more details. 4. Example #1. a. Let’s first generate a sample from a population we know. That is, let’s assume that X1, …, X10 are iid from N(): norm.samp<-rnorm(10,40,1) Revision: February 2012 4 Prof. Fricker b. So, we know that =40. But imagine you didn’t know that. All you knew was that you had 10 data points from a some distribution: X1, …, X10 – that’s what you’ll have in the real world. c. What’s your best point estimate of ? It’s X : mean(norm.samp) d. How close is your sample mean to the true mean? In the real world, you don’t know, because you don’t know ! e. But you can estimate the variability of your point estimate, which gives you some idea of the precision: se.mean<-sd(norm.samp)/sqrt(length(norm.samp)) f. And, if you assume that X has a normal distribution based on the CLT, you can compute a confidence interval as an interval estimate of : ci.upper<-mean(norm.samp)+qt(0.975,length(norm.samp)-1)*se.mean ci.lower<-mean(norm.samp)-qt(0.975,length(norm.samp)-1)*se.mean g. Finally, you can call the t.test() function and read the confidence interval t.test(norm.samp) or extract the conf.int element if you saved the results, as in tt<-t.test(norm.samp) tt$conf.int or you can get it even more directly using t.test(norm.samp)[4] 5. Example #2. a. The act of capturing the mean (in this case, 40) is a random thing. Some times you catch it, other times you miss. But on average, for a 95 percent confidence interval, we’ll catch it about 95 times out of a hundred. b. To demonstrate this, we’ll use the normci.demo() function on the next page and in the R script file. Before you use it, first cut and paste the following functions into R (which will be called by normci.demo). rowVars <- function(x, na.rm=FALSE, dims=1, unbiased=TRUE, SumSquares=FALSE,twopass=TRUE) { if (SumSquares) return(rowSums(x^2, na.rm, dims)) N <- rowSums(!is.na(x), FALSE, dims) Nm1 <- if (unbiased) N-1 else N if (twopass) {x <- if (dims==0) x - mean(x, na.rm=na.rm) else sweep(x, 1:dims, rowMeans(x,na.rm,dims))} (rowSums(x^2, na.rm, dims)-rowSums(x, na.rm, dims)^2/N) / Nm1 } rowSds <- function(x) sqrt(rowVars(x)) Revision: February 2012 5 Prof. Fricker Now copy and paste normci.demo() into R so that you can use the function. c. To try out the function, just type normci.demo() to get a plot. Then, get creative and try different values of N, n, and – see the code for the function arguments. (Be careful: if you choose a large N, or n, or both, keep plotspeed=0.) d. What the function does is generate N random samples of size n from (a default) N(40, 1) distribution, compute the t-based confidence intervals, and then plots the N intervals coloring the intervals that don’t include red. i. This is much like some of the plots we looked at in class, but now you can generate your own in real time. ii. Note that sometimes you miss; that’s just the way it is. In fact, on average you miss (100 x percent of the time. iii. Remember that in the real world you will only have one set of data and will only be calculating one confidence interval. Thus, all you will know is that you are “100(1-) percent confident” that your interval “covers” the true and unknown . This is a long-run statement, meaning that if you were able to re-do your real-world experiment over and over (like we can in these simulations), then on average 100(1-) percent of the CIs would contain . However, the one interval you calculate either does or does not contain and in most cases there’s no way to know whether it does or not (since, if you knew , why would you bother calculating the CI in the first place?). iv. Note that, all else being equal, wider intervals come with a higher confidence and narrower intervals with a lower level of confidence. To get 100% confidence, the interval must be the whole real line – not very informative. v. How could you make the interval smaller? Look at the formula again: decrease (and therefore, probably, s); increase n; or decrease your confidence level and be prepared to make errors more often. 6. Illustrating bootstrap confidence intervals. a. All the calculations we have done in class presume you know the sampling distribution of the sample statistic (aka the point estimator) so that you can derive the expression for the confidence interval. These are parametric confidence intervals. But what if you don't know the distribution? b. That's when the computer and the bootstrap is your friend. As we did in point estimation, you can use the bootstrap to nonparametrically estimate the confidence interval. Let's illustrate with a simple example for which we know the answer. i. Let's draw a sample of size n=100 from a population distributed according to N(, 2) with =5 and =10. For this, using the large sample approximation, we know that the 95% confidence interval is Revision: February 2012 6 Prof. Fricker X t /2,n 1s / n , X t /2,n 1s / n X 1.984 s / 100 X 1.984 s / 100 X 1.984 s /10 X 1.984 s /10 ii. So, let's draw some data and show that the bootstrap confidence interval is close to the one calculated from the above expression: rnorm.data <- rnorm(100,5,10) t.test(rnorm.data)[4] $conf.int [1] 3.009935 7.086407 Now, before we do the bootstrap, let's just confirm that the above is correct: mean(rnorm.data)-qt(0.975,length(rnorm.data)-1)*sd(rnorm.data)/ sqrt(length(rnorm.data)) [1] 3.009935 mean(rnorm.data)+qt(0.975,length(rnorm.data)-1)*sd(rnorm.data)/ sqrt(length(rnorm.data)) [1] 7.086407 iii. Now, let's bootstrap the confidence interval using the bootstrap() function. Known as the "percentile bootstrap confidence interval," the simplest approach is to simply take the appropriate percentiles of the bootstrapped statistics, as in library(bootstrap) bootstrap.out <- bootstrap(rnorm.data,10000,mean) quantile(bootstrap.out$thetastar,c(0.025,0.975)) 2.5% 97.5% 3.064672 6.995723 Pretty close, eh? However, while this type of "naïve" bootstrap confidence interval works well for symmetric distributions, its coverage properties can be overly optimistic (meaning the actual coverage percentage will be lower than what is specified), particularly for smaller sample sizes. Nonetheless, the simplicity of this approach is appealing. iv. Let's do another example in which we can calculate the analytical solution. It n turns out that, if Xi ~ exp(), i=1,…,n, then 2 X i ~ 22 n . We can use this i 1 as a pivotal quantity to derive the following confidence interval for : 12 /2,2 n 2 /2,2 n Pr 2 X 2 X i i 1 . So, let's randomly draw some data from an exponential distribution where we know the value of the parameter, =0.2, and calculate a 95% confidence interval per the above expression. Revision: February 2012 7 Prof. Fricker rexp.data <- rexp(100,0.2) qchisq(0.025,2*length(rexp.data))/(2*sum(rexp.data)) [1] 0.1642723 qchisq(0.975,2*length(rexp.data))/(2*sum(rexp.data)) [1] 0.2433455 Here we see that this particular interval covers the true parameter. Does yours? (Remember, there's a 1 in 20 chance that it will not.) Now let's bootstrap the 95% confidence interval for l, where we know that the MLE is ˆ 1 X : lambda.hat <- function(data) 1/mean(data) bootstrap.out <- bootstrap(rexp.data,10000,lambda.hat) quantile(bootstrap.out$thetastar,c(0.025,0.975)) 2.5% 97.5% 0.1700668 0.2448911 Still pretty good! Now, while the percentile bootstraps performed well in these two examples, note that there are more sophisticated variants that improve the coverage of the bootstrap confidence interval under various conditions. If you want to delve into this more, see An Introduction to the Bootstrap by B. Efron and R. Tibshirani, Chapman & Hall, 1994. Revision: February 2012 8 Prof. Fricker GROUP #___ EXERCISES Members: ______________, ______________, ______________, ______________ 1. Returning to the LVS data (mk48.down.csv), for the purposes of this exercise consider the 9,505 observations as a population that you are trying to make inference about. In particular, imagine that you are trying to calculate an interval estimate of the mean down time in the population, which is > mean(mk48.down$down.days) [1] 63.2445 To demonstrate confidence interval coverage properties, write a function to estimate the fraction of confidence intervals that do not contain the true mean as follows. o Repeat the following steps N times (perhaps within a loop or perhaps not): take a sample (without replacement) of size n from the population, calculate the appropriate confidence interval, and set a 0/1 indicator to 1 if the confidence interval does not contain 63.2445 o Then output the sum of the indicators divided by N as an estimate of i.e., ̂ , the probability the CI does not cover the population parameter. (a) Once you get your function running, for a 95% confidence interval calculation explore how big of a sample it takes for ̂ to get close to 0.05, the expected fraction of intervals that fail to cover . (b) Are you surprised by the result? What do you attribute this finding to? i) The following function may be helpful in diagnosing what's going on. It creates a normal Q-Q plot for N resamples (without replacement) of size n from a vector of data and then overlays on the plot a "best fit" straight line to help one visually judge whether the Q-Q plot follows a straight line or not. norm.plot <- function(data,n,N) { mean.data <- rep(0,N) for(i in 1:N) {mean.data[i] <- mean(sample(data,n)) qqnorm(mean.data) qqline(mean.data) } Revision: February 2012 } 9 Prof. Fricker Name: _____________________________ INDIVIDUAL EXERCISES 1. Run normci.demo for increasing N, say 10, 50, 100, 500, 1,000, for X~N(100,20). (a) Include the graphs with your homework submission. (b) In your own words, describe what a confidence interval is and explain why some of the intervals do not contain the true mean. (c) What can you say about the fraction of confidence intervals that miss the true mean as N gets larger? 2. Write either one or two functions in R that calculates the confidence interval for the mean when the population standard deviation is known and another that calculates the confidence interval when the population standard deviation is not known. That is, you can write two separate functions, or you can combine them into one function that first determines whether is known and then calculates the appropriate CI. (a) Note that, in the former case, you will have to pass the population standard deviation as an argument, but not in the latter case. (b) Try to make the function as general as possible, meaning don't "hard code" in critical values and let the function determine quantities like sample size. (c) In both cases, you will need an argument that specifies the confidence level. (d) Submit the code for your functions along with some empirical demonstrations that they work correctly. I.e., show that their calculations match some known results, such as those calculated in Part 4 of the lab, or other examples that you calculated by hand. Revision: February 2012 10 Prof. Fricker NORMCI.DEMO Function normci.demo <- function(plotspeed=0, N = 100, n = 10, mu = 40, sigma = 1, alpha = 0.05) { # Demonstrates confidence intervals based on the Normal. # Adapted from code first written by Prof Sam Buttrey. # # Arguments: # plotspeed: Controls speed of plotting using integers 0-4 # N: Number of confidence intervals to compute # n: size of each sample # mu: population mean # sigma: population SD # alpha: (1 - confidence) level # # (1) Set up matrix with random Normal(mu, sigma^2)'s, N by n. mat <- matrix(rnorm(N * n, mean = mu, sd = sigma), nrow = N) # # (2) Compute x.bar and s; get lower and upper bounds x.bar <- rowMeans(mat) s <- rowSds(mat) t.crit <- qt(1 - (alpha/2), n - 1) lower <- x.bar - t.crit * (s/sqrt(n)) upper <- x.bar + t.crit * (s/sqrt(n)) # # (3) Next, set up the axes, but don't print anything. plot(c(min(lower), max(upper)), c(1, N), type = "n", xlab = "X-bar", ylab = "Sample Number", main = "Normal Confidence Interval Illustration") # # (4) Draw a vertical line at the mean mu. lines(c(mu,mu),c(-5,N+5),type="l") # # (5) Draw the lines, one at a time. Color the ones that miss red. Notice # the "one-element or" is denoted by two vertical bars: "||". for(i in 1:N) { col <- 1 # default: black if(lower[i] > mu || upper[i] < mu) col <- "red" lines(c(lower[i], upper[i]), c(i, i), col = col) lines(c(x.bar[i], x.bar[i]),c(i-0.5,i+0.5), col = col) ps <- max(0,min(plotspeed+3,7)) for(j in 1:10^ps) {} #Just eats up computing cycles to slow plotting down } # # (6) Just to be nice, let's label the graph with the number of misses. Here we use # the "vectorized or" denoted by one vertical bar: "|". miss <- sum(lower > mu | upper < mu) mtext(paste(miss, "of the intervals missed")) } Revision: February 2012 11