Download Confidence Intervals - Naval Postgraduate School

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Time series wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Transcript
NAVAL
POSTGRADUATE
SCHOOL
LAB #5: CONFIDENCE INTERVALS
Statistics (OA3102)
Prof. Fricker
Lab #5: Confidence Intervals
Goal: Demonstrate how to calculate confidence intervals in R and illustrate how
confidence intervals work.
Lab type: Interactive lab demonstration followed by group exercises.
Time allotted: Lecture for ~50 minutes followed by group exercises for ~50 minutes.
R libraries: bootstrap.
Data: mk48.down.csv and simulated data.
R HINT OF THE WEEK
1. Between the time you start R and it gives you the first prompt, R does a lot of things,
one of which is loading the "search list." The search list defines how R goes about
looking for objects when you ask for them at the command line. You can see what
packages are in your search list by typing
search()
The first item on the search list is the "global environment" followed by a number of
packages, and the order of these packages is important. When you type a function
like mean() into R (say you are intending to calculate the average of some numbers),
R starts by looking for it in the global environment (which is your working directory)
and it then proceeds through all the other listed packages in order. The first object it
finds called mean is what it will use.
Note what happens when you load another library
library(bootstrap)
search()
That package then becomes the first one searched after global environment. The key
point is that if you give an object the same name as another one in a package that
comes later on in the search list, you will mask it and will not be able to use it. For
example try this:
mean <- function(x) cat("Oops – I just masked the mean function! \n")
mean(1:10)
rm(mean)
mean(1:10)
If you want to see what libraries are installed on your system (though not necessarily
in your search path), type:
library()
And, if you want to see what data sets are installed on your system (lots of libraries
come with data sets), type:
data()
Revision: February 2012
2
Prof. Fricker
DEMONSTRATION
2. First, a review of confidence intervals.
a. As we learned in class, suppose that X1, …, Xn are iid from N(, 2) with 
unknown and  known. This is not likely in real life, but it is a good place to start.
i. Under these assumptions, we know that
X ~ N ( ,  2 / n) ,
and therefore


X 
Pr  1.96 
 1.96   0.95 .
/ n


ii. That is, the probability that a sample mean calculated from n random normal
observations falls within  1.96 standard errors of the population mean  is
0.95. This is a statement about a random point.
b. We can re-write this inequality as


Pr X  1.96 / n    X  1.96 / n  0.95 ,
which is a statement about a random interval.
i. That is,  is still fixed; we don’t know what it is, but it’s not a random
quantity, so it’s wrong to say “the probability that  is between 10 and 12
is…” It either is or it isn’t.


ii. The random quantity is the interval X  1.96 / n , X  1.96 / n . If we
took a different sample, we’d get a different interval.
iii. Does our interval contain the true ? We don’t know , so we can’t say.
However, we’ve used a technique that produces an interval that is correct
most of the time. We may be right or wrong this particular time but we know
that in the long run we’ll be right 95% of the time, and that’s not bad.
iv. The general expression for a (1-)x100% confidence interval when  is
known is
Pr X  z /2 / n    X  z /2 / n  1   .


c. Gossett (who published under the pseudonym “Student”) showed that when the Xi
are iid N(, 2), then the random variable  X    s / n follows the t


distribution with n – 1 degrees of freedom.
i. What this means is that we have to estimate  then we should use critical
values from the t distribution.
ii. And, the general expression for a (1-)x100% confidence interval when  is
estimated by s is


Pr X  t /2,n1s / n    X  t /2,n1s / n  1   .
Revision: February 2012
3
Prof. Fricker
3. Calculating confidence intervals in R.
a. There is no function to directly calculate a confidence interval when  is known.
However, it is easy to calculate the necessary pieces and put them together. To do
this, we need the following functions:
i. mean(),
ii. qnorm(), and
iii. of course we need to know  and specify .
Then:
se.mean<-sigma/sqrt(length(data)) # for predefined sigma
ci.upper<-mean(data)+qnorm(1-alpha/2)*se.mean # for predefined alpha
ci.lower<-mean(data)-qnorm(1-alpha/2)*se.mean
b. To compute a t distribution-based confidence interval in R, we can use the qt()
function, and then the calculations become:
se.mean<-sd(data)/sqrt(length(data))
ci.upper<-mean(data)+qt(1-alpha/2,length(data)-1)*se.mean
ci.lower<-mean(data)-qt(1-alpha/2,length(data)-1)*se.mean
We can also use the t.test() function. We’ll talk about what the “t test” is in a
later module. For now all you need to know is that this function produces a list of
which the conf.int element contains the confidence interval.
i. As the name implies, the t-test uses the t distribution to calculate the critical
values. Implicit in this calculation is that the population standard deviation 
is not known. This is the usual case in real life.
ii. Since we almost always have to estimate  with s in the real world, the use of
the t-test is the default in statistical software. Indeed, R does not even have a
command like z.test for a large sample confidence interval. So, while the
large sample confidence interval is easier to calculate by hand, when doing
the calculation using software like R, it's just as easy to do the calculation
using the t distribution, even when the sample size is large.
c. Note that you can use t.test to calculate confidence intervals with other
confidence levels. Simply use the conf.level option. Similarly, you can
calculate one-sided confidence bounds using the alternative option. See the
on-line help for more details.
4. Example #1.
a. Let’s first generate a sample from a population we know. That is, let’s assume
that X1, …, X10 are iid from N():
norm.samp<-rnorm(10,40,1)
Revision: February 2012
4
Prof. Fricker
b. So, we know that =40. But imagine you didn’t know that. All you knew was
that you had 10 data points from a some distribution: X1, …, X10 – that’s what
you’ll have in the real world.
c. What’s your best point estimate of ? It’s X :
mean(norm.samp)
d. How close is your sample mean to the true mean? In the real world, you don’t
know, because you don’t know !
e. But you can estimate the variability of your point estimate, which gives you some
idea of the precision:
se.mean<-sd(norm.samp)/sqrt(length(norm.samp))
f. And, if you assume that X has a normal distribution based on the CLT, you can
compute a confidence interval as an interval estimate of :
ci.upper<-mean(norm.samp)+qt(0.975,length(norm.samp)-1)*se.mean
ci.lower<-mean(norm.samp)-qt(0.975,length(norm.samp)-1)*se.mean
g. Finally, you can call the t.test() function and read the confidence interval
t.test(norm.samp)
or extract the conf.int element if you saved the results, as in
tt<-t.test(norm.samp)
tt$conf.int
or you can get it even more directly using
t.test(norm.samp)[4]
5. Example #2.
a. The act of capturing the mean (in this case, 40) is a random thing. Some times you
catch it, other times you miss. But on average, for a 95 percent confidence
interval, we’ll catch it about 95 times out of a hundred.
b. To demonstrate this, we’ll use the normci.demo() function on the next page and
in the R script file. Before you use it, first cut and paste the following functions
into R (which will be called by normci.demo).
rowVars <- function(x, na.rm=FALSE, dims=1, unbiased=TRUE,
SumSquares=FALSE,twopass=TRUE) {
if (SumSquares) return(rowSums(x^2, na.rm, dims))
N <- rowSums(!is.na(x), FALSE, dims)
Nm1 <- if (unbiased) N-1 else N
if (twopass) {x <- if (dims==0) x - mean(x, na.rm=na.rm) else
sweep(x, 1:dims, rowMeans(x,na.rm,dims))}
(rowSums(x^2, na.rm, dims)-rowSums(x, na.rm, dims)^2/N) / Nm1
}
rowSds <- function(x) sqrt(rowVars(x))
Revision: February 2012
5
Prof. Fricker
Now copy and paste normci.demo() into R so that you can use the function.
c. To try out the function, just type normci.demo() to get a plot. Then, get creative
and try different values of N, n,  and  – see the code for the function arguments.
(Be careful: if you choose a large N, or n, or both, keep plotspeed=0.)
d. What the function does is generate N random samples of size n from (a default)
N(40, 1) distribution, compute the t-based confidence intervals, and then plots the
N intervals coloring the intervals that don’t include  red.
i. This is much like some of the plots we looked at in class, but now you can
generate your own in real time.
ii. Note that sometimes you miss; that’s just the way it is. In fact, on average
you miss (100 x  percent of the time.
iii. Remember that in the real world you will only have one set of data and will
only be calculating one confidence interval. Thus, all you will know is that
you are “100(1-) percent confident” that your interval “covers” the true and
unknown .

This is a long-run statement, meaning that if you were able to re-do your
real-world experiment over and over (like we can in these simulations),
then on average 100(1-) percent of the CIs would contain .

However, the one interval you calculate either does or does not contain 
and in most cases there’s no way to know whether it does or not (since, if
you knew , why would you bother calculating the CI in the first place?).
iv. Note that, all else being equal, wider intervals come with a higher confidence
and narrower intervals with a lower level of confidence. To get 100%
confidence, the interval must be the whole real line – not very informative.
v. How could you make the interval smaller? Look at the formula again:
decrease  (and therefore, probably, s); increase n; or decrease your
confidence level and be prepared to make errors more often.
6. Illustrating bootstrap confidence intervals.
a. All the calculations we have done in class presume you know the sampling
distribution of the sample statistic (aka the point estimator) so that you can derive
the expression for the confidence interval. These are parametric confidence
intervals. But what if you don't know the distribution?
b. That's when the computer and the bootstrap is your friend. As we did in point
estimation, you can use the bootstrap to nonparametrically estimate the
confidence interval. Let's illustrate with a simple example for which we know the
answer.
i. Let's draw a sample of size n=100 from a population distributed according to
N(, 2) with  =5 and =10. For this, using the large sample
approximation, we know that the 95% confidence interval is
Revision: February 2012
6
Prof. Fricker
 X  t /2,n 1s / n , X  t /2,n 1s / n  


 X  1.984  s / 100    X  1.984  s / 100  


 X  1.984  s /10    X  1.984  s /10 
ii. So, let's draw some data and show that the bootstrap confidence interval is
close to the one calculated from the above expression:
rnorm.data <- rnorm(100,5,10)
t.test(rnorm.data)[4]
$conf.int
[1] 3.009935 7.086407
Now, before we do the bootstrap, let's just confirm that the above is correct:
mean(rnorm.data)-qt(0.975,length(rnorm.data)-1)*sd(rnorm.data)/
sqrt(length(rnorm.data))
[1] 3.009935
mean(rnorm.data)+qt(0.975,length(rnorm.data)-1)*sd(rnorm.data)/
sqrt(length(rnorm.data))
[1] 7.086407
iii. Now, let's bootstrap the confidence interval using the bootstrap() function.
Known as the "percentile bootstrap confidence interval," the simplest
approach is to simply take the appropriate percentiles of the bootstrapped
statistics, as in
library(bootstrap)
bootstrap.out <- bootstrap(rnorm.data,10000,mean)
quantile(bootstrap.out$thetastar,c(0.025,0.975))
2.5%
97.5%
3.064672 6.995723
Pretty close, eh? However, while this type of "naïve" bootstrap confidence
interval works well for symmetric distributions, its coverage properties can be
overly optimistic (meaning the actual coverage percentage will be lower than
what is specified), particularly for smaller sample sizes. Nonetheless, the
simplicity of this approach is appealing.
iv. Let's do another example in which we can calculate the analytical solution. It
n
turns out that, if Xi ~ exp(), i=1,…,n, then 2  X i ~ 22 n . We can use this
i 1
as a pivotal quantity to derive the following confidence interval for :
 12 /2,2 n
2 /2,2 n
Pr 

 2 X
2 X i
i


  1   .

So, let's randomly draw some data from an exponential distribution where we
know the value of the parameter, =0.2, and calculate a 95% confidence
interval per the above expression.
Revision: February 2012
7
Prof. Fricker
rexp.data <- rexp(100,0.2)
qchisq(0.025,2*length(rexp.data))/(2*sum(rexp.data))
[1] 0.1642723
qchisq(0.975,2*length(rexp.data))/(2*sum(rexp.data))
[1] 0.2433455
Here we see that this particular interval covers the true parameter. Does
yours? (Remember, there's a 1 in 20 chance that it will not.)
Now let's bootstrap the 95% confidence interval for l, where we know that the
MLE is ˆ  1 X :
lambda.hat <- function(data) 1/mean(data)
bootstrap.out <- bootstrap(rexp.data,10000,lambda.hat)
quantile(bootstrap.out$thetastar,c(0.025,0.975))
2.5%
97.5%
0.1700668 0.2448911
Still pretty good! Now, while the percentile bootstraps performed well in these
two examples, note that there are more sophisticated variants that improve the
coverage of the bootstrap confidence interval under various conditions. If you
want to delve into this more, see An Introduction to the Bootstrap by B. Efron and
R. Tibshirani, Chapman & Hall, 1994.
Revision: February 2012
8
Prof. Fricker
GROUP #___ EXERCISES
Members: ______________, ______________, ______________, ______________
1. Returning to the LVS data (mk48.down.csv), for the purposes of this exercise
consider the 9,505 observations as a population that you are trying to make inference
about. In particular, imagine that you are trying to calculate an interval estimate of
the mean down time in the population, which is
> mean(mk48.down$down.days)
[1] 63.2445
To demonstrate confidence interval coverage properties, write a function to estimate
the fraction of confidence intervals that do not contain the true mean as follows.
o Repeat the following steps N times (perhaps within a loop or perhaps not):

take a sample (without replacement) of size n from the population,

calculate the appropriate confidence interval, and

set a 0/1 indicator to 1 if the confidence interval does not contain 63.2445
o Then output the sum of the indicators divided by N as an estimate of i.e.,
̂ , the probability the CI does not cover the population parameter.
(a) Once you get your function running, for a 95% confidence interval calculation
explore how big of a sample it takes for ̂ to get close to 0.05, the expected
fraction of intervals that fail to cover .
(b) Are you surprised by the result? What do you attribute this finding to?
i) The following function may be helpful in diagnosing what's going on. It
creates a normal Q-Q plot for N resamples (without replacement) of size n
from a vector of data and then overlays on the plot a "best fit" straight line to
help one visually judge whether the Q-Q plot follows a straight line or not.
norm.plot <- function(data,n,N)
{
mean.data <- rep(0,N)
for(i in 1:N) {mean.data[i] <- mean(sample(data,n))
qqnorm(mean.data)
qqline(mean.data)
}
Revision: February 2012
}
9
Prof. Fricker
Name: _____________________________
INDIVIDUAL EXERCISES
1. Run normci.demo for increasing N, say 10, 50, 100, 500, 1,000, for X~N(100,20).
(a) Include the graphs with your homework submission.
(b) In your own words, describe what a confidence interval is and explain why some
of the intervals do not contain the true mean.
(c) What can you say about the fraction of confidence intervals that miss the true
mean as N gets larger?
2. Write either one or two functions in R that calculates the confidence interval for the
mean when the population standard deviation is known and another that calculates the
confidence interval when the population standard deviation is not known. That is,
you can write two separate functions, or you can combine them into one function that
first determines whether  is known and then calculates the appropriate CI.
(a) Note that, in the former case, you will have to pass the population standard
deviation as an argument, but not in the latter case.
(b) Try to make the function as general as possible, meaning don't "hard code" in
critical values and let the function determine quantities like sample size.
(c) In both cases, you will need an argument that specifies the confidence level.
(d) Submit the code for your functions along with some empirical demonstrations that
they work correctly. I.e., show that their calculations match some known results,
such as those calculated in Part 4 of the lab, or other examples that you calculated
by hand.
Revision: February 2012
10
Prof. Fricker
NORMCI.DEMO Function
normci.demo <- function(plotspeed=0, N = 100, n = 10, mu = 40, sigma = 1, alpha = 0.05)
{
# Demonstrates confidence intervals based on the Normal.
# Adapted from code first written by Prof Sam Buttrey.
#
# Arguments:
#
plotspeed: Controls speed of plotting using integers 0-4
#
N: Number of confidence intervals to compute
#
n: size of each sample
#
mu: population mean
#
sigma: population SD
#
alpha: (1 - confidence) level
#
# (1) Set up matrix with random Normal(mu, sigma^2)'s, N by n.
mat <- matrix(rnorm(N * n, mean = mu, sd = sigma), nrow = N)
#
# (2) Compute x.bar and s; get lower and upper bounds
x.bar <- rowMeans(mat)
s <- rowSds(mat)
t.crit <- qt(1 - (alpha/2), n - 1)
lower <- x.bar - t.crit * (s/sqrt(n))
upper <- x.bar + t.crit * (s/sqrt(n))
#
# (3) Next, set up the axes, but don't print anything.
plot(c(min(lower), max(upper)), c(1, N), type = "n", xlab = "X-bar", ylab =
"Sample Number", main = "Normal Confidence Interval Illustration")
#
# (4) Draw a vertical line at the mean mu.
lines(c(mu,mu),c(-5,N+5),type="l")
#
# (5) Draw the lines, one at a time. Color the ones that miss red. Notice
# the "one-element or" is denoted by two vertical bars: "||".
for(i in 1:N) {
col <- 1
# default: black
if(lower[i] > mu || upper[i] < mu) col <- "red"
lines(c(lower[i], upper[i]), c(i, i), col = col)
lines(c(x.bar[i], x.bar[i]),c(i-0.5,i+0.5), col = col)
ps <- max(0,min(plotspeed+3,7))
for(j in 1:10^ps) {} #Just eats up computing cycles to slow plotting down
}
#
# (6) Just to be nice, let's label the graph with the number of misses. Here we use
# the "vectorized or" denoted by one vertical bar: "|".
miss <- sum(lower > mu | upper < mu)
mtext(paste(miss, "of the intervals missed"))
}
Revision: February 2012
11