Download Document

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Chapter 6
Part 4
Confidence Intervals
October 21, 2008
Goal:
To clearly understand the link between probability distributions and confidence intervals.
Skills:
Be able to calculate (1 - α)% confidence interval for a sample mean both for the case
that the population variance is known and the case that it is not known.
Be able to accurately interpret a confidence interval.
Contents:
Central Limit Theorem
Page 2
Confidence interval using the normal distribution
Page 5
Formula
What impacts the length of a CI
Stata commands:
invnormal
Page 8
Page 16
Usually we study samples from a population rather than the population itself because it
is not possible to get our hands on the whole population (e.g. it is too big, the process is
too costly, frequently some of the members of the population we are interested in
haven’t even been born yet).
We have agreed that when possible, we should select a random sample. We also
know that when we select a random sample of size
for a study, it is just one of many
possible samples of size
that could have been selected from the population.
n
n
Suppose we want to know the average fasting triglycerides of the entire population of
the U.S. that is 55 years old or older (55+). Some of the reasons why we’ll have to
select a sample would be: 1) usually the whole population is simply not available (e.g.
the ALLHAT investigators were hoping that the results of their study would apply not
only to those who were 55+ at the time of entry into the study but also to those who will
later become 55+) and 2) even in cases where the population is available (an unusual
case) the cost and time involved to study a whole population tends to be prohibitive. So
we’ve decided to select a random sample from the population and use the mean of the
fasting triglycerides of that sample to estimate the mean of the entire population.
What we learned earlier when studying the sampling distribution of means is the
following:
Let
X
be a random variable representing the distribution of the fasting triglycerides
in the population of people aged 55+. Let
the fasting triglycerides and
σ X2
μX
represent the population mean for
the population variance for the fasting triglycerides.
Then we usually denote the random variable representing the sampling
distribution of means of samples of size
n
by
X , the mean of the population
of means by μ X and the variance of the population of means by
σ X2 .
n , of the sample is large enough, we have
μ X = μ X (Fact 1 from before)
If the size,
1)
The mean of the original distribution is equal to the mean of the sampling
distribution.
Page -1-
2)
σX =
2
where
σ X2
n
σ
X
n
Note that
σX =
is called the standard error of the mean (SEM) - Fact 2 from before.
σX
σX
n
refers to variation related to a single sample and
, the SEM, refers to variation among samples.
3) We also noticed that the larger the sample size n got, the more the distribution
of those sample means looked like a normal distribution.
The Central Limit Theorem states the following:
Given the notation we have used above for the original population of fasting
triglycerides and the notation for the sampling distribution of means, for large
X
is approximately normally distributed with mean = μ X (Fact 1:
and variance =
σ X2
n
(Fact 2: σ 2
X
=
σ X2
n
n,
μX = μ
) regardless of the distribution of
X
)
X.
If X is distributed normally, then X is also distributed normally (as opposed to
approximately normally).
Now our problem is, how do we know if the sample mean is a good estimate of the
population mean. Let us say that the graph below is the distribution of the means for
fasting triglycerides (AFTRIG) of all samples of size n from the U.S. population of
those aged 55+.
Looking at the histogram of the sampling distribution below we would probably be
willing to say that the means represented by the bar on the far right end (the bar with
square dots) of the distribution are not good estimates for the mean of the distribution of
Page -2-
the original AFTRIG values because they are probably not what we would be willing to
call “close” to the mean of the distribution of sample means (i.e. μ X ).
But what about the means represented by the striped bar in the graph below. This is
where our problems begin.
We are clearly going to need some sort of measure of how certain we are that the mean
of our sample is a reasonable estimate of the population mean. This is where
confidence intervals come in.
Confidence intervals are going to be defined such that given a 95% confidence interval,
we will be 95% confident that
μ
X
(and hence
obtaining a 95% confidence interval for
the original population mean
μX .
μ X ) lies within our interval.
So in
μ X , we will have also obtained an interval for
Just as we have only one sample and one sample mean, we will have only one
confidence interval based on that sample and its mean.
If, however, we had all possible samples, we could get a confidence interval for the
mean of each sample. Then the interpretation of the 95% confidence interval is that we
are confident that 95% of these intervals contain the original population mean (
Page -3-
μ X ).
Looking at the graph below of the confidence intervals, we notice that 3 of the intervals
(the dashed ones) do not contain the population mean. The very top confidence interval
does not contain the mean because confidence intervals will be defined as open
intervals (i.e. intervals that do not contain their endpoints). The other two dashed
confidence intervals don’t even come particularly close to the mean.
0
5
10
15
20
95% C I’s for the sam ple m eans assum ing w e know σ
μX = μX
E ach interval is centered about a sam ple m ean.
E ach interval is the sam e length because σ is kno wn.
The intervals are all of the same length because (as we will show) the length of each
interval depends on the sample size n (remember all samples from the sampling
distribution have the same size) and on the size of
when
is known.
σ
σ
σ
We’ll show later that when
is not known, we can calculate the confidence interval
using the sample estimate of
, namely s. In this case the lengths of the samples will
vary as s varies from sample to sample.
σ
There are actually 3 kinds of intervals that we can use: prediction, confidence and
tolerance intervals. We won’t do much with prediction and tolerance intervals until we
get to regression, but I will describe all three kinds of intervals here.
Page -4-
This example is taken from Forthofer and Lee’s (2007) book Biostatistics. Dairies
add vitamin D to milk for the purpose of fortification. The recommended amount of
vitamin D to be added to a quart of milk is 400 IUs (10 μg). If a dairy adds too much
vitamin D, perhaps over 5000 IUs, the amount of vitamin D could be toxic.
A prediction interval focuses on a single observation of the variable - for example, the
amount of vitamin D in the next bottle of milk.
A confidence interval focuses on a population parameter - for example, the mean or
median of vitamin D in a population of bottles of milk. Thus, the prediction interval is of
more interest to the consumer of the next bottle of milk, whereas the confidence interval
is of more interest to the dairy.
A tolerance interval provides limits such that there is a high level of confidence that a
large portion of the values of the variable will fall within them. For example, besides
being interested in the mean, the dairy owner or regulatory agency also wants to be
confident that for a large portion of the bottles the vitamin D contents are within a
specified tolerance of the value of 400 IUs.
So back to confidence intervals. The picture of the confidence intervals above is a nice
graphic, but how do we actually calculate the confidence interval for our sample mean?
Confidence Intervals
Below we give the confidence interval for the random variable
conditions that the random variable
X
X
μ X and
has a known variance
It is not usually the case that we know
confidence interval first.
under the
is normally distributed
has an unknown mean
X
X
σ X2
σ X2 but we present this simplest version of the
Page -5-
So let X be the random variable associated with the sampling distribution of samples
of size n drawn from the distribution with random variable X .
The Central Limit Theorem says: for n large enough
X ) where μ
of the distribution of
X
=μ
and
X
X ≈ N( μX , σ X2 )
σ2 =
σ X2
(regardless
.
n
X
[ ≈ = approximately.]
σ X2 ⎞
⎛
2
Density for X ≈ N ⎜ μ X = μ X , σ X =
⎟
⎝
n⎠
95%
2.5%
2.5%
μ
μ X − 3σ X
μ X − 1σ X
X
μ X + 1σ X
μ X − 1.96σ X
μ X + 3σ X
μ X + 1.96σ X
Note that the areas and standard deviations in the graph above were derived under the
assumption that X is close enough to being normally distributed not make any
difference.
How did I decide that the area under the normal density associated with
x-axis and between
μ − 196
. σ
X
X
and μ + 196
. σ
X
Page -6-
X
X , above the
is 95% of the total
area under the curve. Well
μ −196
. σ
X
X
is 1.96 standard deviations (
σ
μ ) of the normal distribution [ X ≈ N ( μX , σ X2 )] and
X
μ +196
. σ is 1.96 standard deviations above the mean.
X
)
below the mean (
X
X
We learned earlier that from 1.96 standard deviations below the mean to 1.96 standard
deviations above the mean cuts off 95% of the area under the curve for any normal
distribution (i.e. this is part of what we learned when we showed that any normal
distribution could be mapped into the standard normal distribution Z ~ N ( 0, 1) ).
[
]
So for n large enough we have
Equation 1
(
Pr μ − 196
. σ
X
X
< X < μ + 196
. σ
X
X
) = 0.95
[Aside: Notice above that I have used < rather than ≤ because although it doesn’t
make any difference which you use in terms of the probability of a continuous
distribution, confidence intervals are always written as open intervals.]
But according to the Central Limit Theorem
μ = μX
X
and
σ =
X
σX
n
So Equation 1 becomes
σX
σX ⎞
⎛
Pr ⎜ μX − 196
.
< X < μX + 196
.
⎟ = 0.95
n
n⎠
⎝
Equation 2
Page -7-
But we want
μX in the middle and X on the ends, so we subtract μ
parts of the inequality in Equation 2 and get
σX
σX ⎞
⎛
Pr ⎜ − 196
.
< X − μ X < 196
.
⎟ = 0.95
n
n⎠
⎝
Now subtract
X
across all
Equation 3
X across all parts of the inequality in Equation 3 and get
σ
σX ⎞
⎛
Pr ⎜ − X − 1.96 X < − μ X < − X + 196
.
⎟ = 0.95
n
n⎠
⎝
Equation 4
Now multiply by -1 across all parts of the inequality in Equation 4 (note this reverses the
inequalities)
σX
σX ⎞
⎛
Pr ⎜ X + 196
.
> μ X > X − 196
.
⎟ = 0.95
n
n⎠
⎝
Equation 5
Now just put the smaller endpoint of equation 5 on the left and the larger on the right.
σX
σX ⎞
⎛
Pr ⎜ X − 196
.
< μ X < X + 196
.
⎟ = 0.95
n
n⎠
⎝
Equation 6
Below we switch from probability to confidence because X is a random variable for
which probability is appropriate but x is the mean of a particular sample. Once we use
the sample mean, the population mean
μx
either is or is not in the interval and
probability is no longer appropriate.
Page -8-
So our 95% confidence interval is
σX
σX ⎞
⎛
x
−
196
.
,
x
+
196
.
⎜
⎟
⎝
n
n⎠
On the N(0,1) curve the area to the right of 1.96 is 0.025 or 2.5%. Or the area to the left
of 1.96 is 0.975 or 1 - 0.025.
z0.975 = z1− 0.025 . Or if we let α = 0.05, so that
. This pattern will work regardless
α / 2 = 0.025 , then more generally we have z
of the value of α .
This means we could denote 1.96 as
1−( α / 2 )
Well what do we do about -1.96? We’ll use
Therefore, the general form of the (1 -
− z1− (α / 2 ) .
α )% confidence interval is
σX
σX ⎞
⎛
, x + z1− (α / 2 ) ⎟
⎜ x − z1− (α / 2 )
⎝
n
n⎠
Usually we don’t have to work so hard to distinguish between
X and X
and their
means and variances. This is because the random variable X is not usually part of
the conversation. We have only used it to derive the formula for the confidence interval.
This means we can just say that the distribution for the random variable X has
mean μ and standard deviation σ . So the commonly used form of the (1 - α )%
confidence interval
is
σ
σ⎞
⎛x −z
,
x
+
z
⎜
⎟
1− ( α / 2 )
1− ( α / 2 )
⎝
n
n⎠
Page -9-
In the above formula
x
is the mean of a single sample and is not a random variable.
The confidence for the interval above is 1 -
α.
α = 0.10 , then 1 − α = 0.90
So α / 2 = 0.05 . Therefore, an area
So if
and we would have a 90% confidence interval.
equal to 0.05 is cut off each end of the distribution.
The length of the confidence interval is
2 z1− (α / 2 )
σ
n
x
As we select different samples of size n, we get different values for . So the location
of the confidence interval changes. However, the length of the confidence interval
remains the same (this is because
is known) and the samples are all of size n.
σ
Find the 95% confidence interval for the baseline heart rate in beats/min for the
Propranolol treatment group (Cardiology Problem 6.81 on page 222 of Rosner), also
see original description of the problem in Cardiovascular Disease on page 157).
Let us suppose that σ the standard deviation of the baseline heart rate for Propranolol
is known and is equal to 17 beats/minute. The Stata data set for this problem is
nifed.dta.
Page -10-
. des
Contains data from C:\Stata\StataData\Myfiles\BiostatFall2003\Data\nifed.dta
obs:
34
vars:
10
size:
22 Oct 2002 20:53
1,496 (99.9% of memory free)
-----------------------------------------------------------------------------variable name
storage
display
value
type
format
label
variable label
-----------------------------------------------------------------------------id
float
%12.0g
trtgrp
float
%11.0g
heartlv0
float
%12.0g
Baseline Heart Rate beats/min
heartlv1
float
%12.0g
Level 1 Heart Rate beats/min
heartlv2
float
%12.0g
Level 2 Heart Rate beats/min
heartlv3
float
%12.0g
Level 3 Heart Rate beats/min
syslv0
float
%12.0g
Baseline Systolic Blood Pressure mmHg
syslv1
float
%12.0g
Level 1 Systolic Blood Pressure mmHg
syslv2
float
%12.0g
Level 2 Systolic Blood Pressure mmHg
syslv3
float
%12.0g
Level 3 Systolic Blood Pressure mmHg
trt
Treatment Group
-----------------------------------------------------------------------------. tab trtgrp
Treatment |
Group |
Freq.
Percent
Cum.
------------+----------------------------------nifedipine |
18
52.94
52.94
propranolol |
16
47.06
100.00
------------+----------------------------------Total |
34
100.00
. label list
trt:
0 nifedipine
1 propranolol
Since we have not used this data set before, I have run codebook for treatment group
and for baseline heart rate so we can see what we have.
Page -11-
. codebook
trtgrp
Treatment Group
-------------------------------------------------------------------------------------type:
numeric (float)
label:
trt
range:
[0,1]
unique values:
tabulation:
units:
2
1
missing .:
Freq.
Numeric
18
0
nifedipine
16
1
propranolol
0/34
Label
-------------------------------------------------------------------------------------heartlv0
Baseline Heart Rate beats/min
-------------------------------------------------------------------------------------type:
range:
unique values:
numeric (float)
[51,116]
1
missing .:
mean:
74.1176
std. dev:
18.6544
percentiles:
units:
21
0/34
10%
25%
50%
75%
90%
54
56
71
90
100
The baseline heart rate in beats/minute is denoted heartlv0 and trtgrp = 1 is the
propranolol treatment group.
. sum(heartlv0) if trtgrp == 1
Variable |
Obs
Mean
Std. Dev.
Min
Max
-------------+----------------------------------------------------heartlv0 |
So
16
76.81
17.95
54
x = 7681
. and σ = 17 (i.e. we don’t use s = 17.95 because σ
105
is known).
Since we are assuming n is large enough to assume normality, the 95% confidence
Page -12-
interval is
⎛ 76.81 − 1.96⎛⎜ 17 ⎞⎟ , 76.81 + 1.96⎛⎜ 17 ⎞⎟ ⎞ = (68.48, 85.14)
⎜
⎟
⎝ 16 ⎠
⎝ 16 ⎠ ⎠
⎝
We are confident that 95% of all such confidence intervals cover μ, the mean of the
population (i.e. all people treated with Propranolol) baseline heart rate. That is what we
mean when we say we are 95% confident that μ lies between 68.48 and 85.14.
When assuming normality our equation for the confidence interval implies that the
confidence interval is centered about the sample mean. So when you are carefully
double-checking your work, you’ll want to make sure that the confidence interval you
have gotten actually contains the sample mean.
What impacts the length of the confidence interval?
Remember that the length of the confidence interval is
2z
σ
1−( α / 2 )
n
1) Sample size n
As n increases, the length of the confidence interval decreases. So there is an
inverse relationship between the sample size n and the length of the confidence
interval. Note that shorter confidence intervals are better.
x and y are inversely related if one increases as the other decreases.
So there is an inverse relationship between the size of n and the length of the
confidence interval.
2) The standard deviation or variance.
Page -13-
As the standard deviation or variance increases, the length of the confidence interval
increases. So there is a direct relationship between the size of
and the length of
the confidence interval.
σ
x and y are directly related if they both increase or they both decrease.
3) The
α -level.
α
As
increases (meaning the confidence decreases), the length of the confidence
interval decreases. So there is an inverse relationship between the size
and the
length of the confidence interval.
α
Let us use the function invnormal(p) = z where p is the probability or area and z is
the cutoff.
We can write the equation as invnormal(1 - ( α /2)) = z.
Suppose that
α
= 0.05 (i.e. we are talking about a 95% confidence interval).
This means that an area of 0.025 will be cut off on each end of the normal
distribution. So we have
1 - ( α /2) = 1 - 0.025 = 0.975.
. di invnormal(1-(0.05/2)) or
1.959964
So for
α = 0.05,
. di invnormal(0.975)
1.959964
z1−(α / 2 ) = z0.975 = 196
.
Page -14-
If
α = 0.10, then 1 - ( α /2) = 1 - 0.05 = 0.95
. di invnormal(1 - (0.10/2))
1.6448536
So z1− (α / 2 )
So
or
. di invnormal(0.95)
1.6448536
= z0.95 = 164
.
α 1 = 0.05 produces a z value of 1.96 and α 2 = 0.10 produces a z value of 1.64
So the larger of the two
confidence interval.
α ‘s produces the smaller z value and hence the shorter
If α = 0.05, then we have a 95% [i.e. (1 - α )%] confidence interval. If α = 0.10,
then we have a 90% confidence interval. So less confidence and shorter confidence
intervals go together.
Page -15-