Download Confidence Intervals on mu

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

History of statistics wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Statistics 215 Lab Materials
Confidence Intervals on mu
This topic officially moves us from probability to statistics. We begin to discuss making inferences about
the population. One way to differentiate probability from statistics is to think of probability as the process
of making an inference about a subset of the population (a sample) when we know the attributes of the
entire population. This was the case in the previous chapters. We (pretended that we) knew the entire
population or the entire distribution. We then were able to discuss probabilities for how a single
observation might behave, as well as how the average of several observations might behave. Thus in
probability we had the whole population and we want to know about an observation or a group of
observations. Now in statistics this process reverses. In statistics we will have a sample, a collection of
observations, from the population. From this subset, the sample, we want to be able to make statements
about the population.
Background
The ideas presented in this section depend heavily on concepts from the previous chapters. Specifically the
idea of variability from sample to sample is crucial. As we saw previously, each time we take a sample we
got different values for the sample mean and these values differed from the population mean. One
consequence of this is that using the sample mean, alone is not the best estimate of the population mean.
The reason being that the sample mean fluctuates. Each sample gives us a different value. The idea of a
confidence interval is that instead of simply using a single number, we use an interval, a range of numbers,
to estimate the population mean.
Definitions and preliminaries
Definition: A parameter is a numerical quantity that summarizes a characteristic of the entire population.
Definition: A statistic is a numerical quantity that summarizes a characteristic of a subset of the population,
usually this is a sample.
Example:
We differentiate here between µ and σ which are parameters and x and sx which are statistics. Recall that µ
is the population mean and σ is the population standard deviation, while x is the sample mean and sx is the
sample standard deviation. µ and σ are numerical summaries for the entire population. x and sx are
calculated from a subset of the population.
Definition: A point estimate of a parameter is a single number estimate of that parameter.
Definition: An interval estimate of a parameter is a range of numbers used to estimate a parameter.
Example:
Suppose that we are interested in estimating the mean weight of all black bears in West Virginia. We are
able to weigh 38 black bears. From the 38 observations, we want to make statements about the entire
population. That is we want to estimate the average weight of the population of black bears in West
Virginia. A point estimate for this parameter, the mean weight of all black bears in WV, is 457 pounds.
An interval estimate would be that the mean weight of all black bears is between 428 and 497 pounds.
Now it is important to note that the mean weight does not change. Rather it is our knowledge that is
imperfect. We have only 38 bears from the entire population and as a consequence we do not know the
weight of all bears.
That idea is worth reiterating. The need for interval estimates comes from the fact that we do not have all
of the information that we want about the parameter, in the previous example the population mean. That is,
we have a subset of the population or a sample not the entire population. We know that the sample mean
Page 1 of 7
Statistics 215 Lab Materials
varies and that it is often not the same as the population mean. Consequently, we cannot use the sample
mean alone to estimate the population mean. Thus what we will do is specify a range of values that are
plausible based on our sample. It is worth noting that unless we specify a range of values that goes from
negative infinity to positive infinity, we can never guarantee that the population mean will be in the interval
estimate for that mean. No method will “capture” or have the population mean inside the interval for every
sample; however, statistical methods can specify the percentage of time that an interval will ‘miss’ the
parameter of interest.
When we use the word ‘capture’ in this topic it has a special meaning. It is important to remember that the
value of the population mean or any parameter is constant. Consequently, when we are considering an
interval estimate, if the parameter is inside the interval we will consider the parameter’s value to be
‘captured’ by the interval. For the parameter to be captured, it must fall between the endpoints of the
interval; an interval will have an upper and a lower endpoint.
Definition: A (1-α)*100% confidence interval for a parameter is an interval estimate that through repeating
the process of taking a sample and making a confidence interval from that sample, will capture the
parameter (1-α)*100% of the time.
Definition: The confidence level for a confidence interval is the percentage of times that it will capture the
parameter of interest in repeated sampling.
Example:
The confidence level for a 95% confidence interval is 0.95. The confidence level for an 84% confidence
interval is 0.84. The confidence level is often denoted by (1-α)*100%. The reason for this is to allow each
researcher to specify their his/her level of confidence. α=0.05 yields a confidence level of 0.95; while α =
0.10 yields of confidence interval of 0.90.
The definition of a confidence interval needs some explanation (maybe plenty of explanation). Each
confidence interval is calculated from a sample. This sample is a subset of the population. Previously, we
saw that each sample of n observations was different. The idea of repeating the process of taking a sample
described in the definition above is just that, each sample will be different. As we will see shortly (when
we talk about calculations), since each sample is different, each confidence interval will be different.
Because of the variability from sample to sample, some of the confidence interval that we make will not
‘capture’ the parameter it is trying to estimate.
The difficulty with confidence intervals is this: We DON’T get to know the value of the parameter we are
trying to estimate; so we don’t know which intervals capture the parameter and which don’t. Remember
that the parameter is a quantity calculated from the population. In Statistics, we only see a subset of the
population, so we cannot know the value of the parameter. Thus, we will make confidence interval and we
will not know whether the parameter has been captured. Instead we must be content to know that the
procedure works a certain percentage of the time, (1-α)*100% to be exact. The procedure begins with
selecting units to be sampled, getting information from those units and then making our calculations.
However, we don’t know if the confidence interval that we have created is one of those (1-α)*100% of
times that we ‘captured’ the parameter or one of the α*100% of the time that we do not.
Consider this more concrete example. A 95% confidence interval is made for the mean height that Yellow
Poplars in West Virginia. This interval goes from 85.68 feet to 94.39 feet. This is based upon a sample of
56 trees taken from around the state. We say that we are 95% confident that the mean height of all Yellow
Poplars in WV is between 85.68 and 94.39 feet. However, since we don’t know the population mean we
cannot say with absolute certainty that it is between these numbers.
So why, if I can’t say anything with certainty using a confidence interval would I use it at all. The answer
is simply that by using statistics we can accurately tell the percentage of times the process will fail. No
other methodology allows you to specify that percentage. Statistics allows for this, but forces a layer of
uncertainty into the discourse.
Page 2 of 7
Statistics 215 Lab Materials
Confidence Intervals on mu (Small Sample Size)
When the number of observations in the sample is large (at least 30 observations), we can use the Central
Limit Theorem to help us construct a confidence interval on the unknown value of the population mean mu.
For large samples the CLT tells us that the distribution of the possible values of the sample mean is normal.
We can use the standard normal distribution to find the critical value (z) that is part of the (margin of ) error
term.
Let’s start with a 95% confidence interval for the population mean. 95% is a common choice for the
confidence level.
Note that
0.95 = P(-1.96<Z<1.96) =
P(−1.96 <
X−µ
σ x < 1.96)
n
Some algebra later:
= P( µ
− 1.96
σx
< X < µ + 1.96
n
σx
n
)
The above expression is a statement of probability about X , if we know the values for µ and σx. This
follows what we did in the previous chapters when we pretended that we knew the value of µ or σx or m or
p.
Some more algebra later, we can turn this into an interval for µ,
= P( X − 1.96
σx
n
< µ < X + 1.96
σx
n
)
= 0.95.
This expression provides the endpoints of the CI:
σx
σ
< µ < X +1.96 x
n
n
X −1.96
which we can write more succinctly (and in a more general fashion) as
€
X ± z(1−α
2)
σx
n
But we don’t’ know the value of σx. When the value of σx is unknown (always the case in the real world)
We can substitute our estimate sx for σx. Our CI is computed as
€
X ± t(n −1,α
2)
sx
n
After we have computed x and sx (and not µ and σx) then we can construct a confidence interval on the
unknown value of mu; however, there are two direct results of this.
€
Page 3 of 7
Statistics 215 Lab Materials
1.The critical value (z) in the error term no longer have a normal distribution, it comes from something
called a t-distribution.
2.We no longer have a statement of probability, we have a statement of confidence.
The t-distribution
The t-distribution is similar to the z distribution or ‘standard normal’ distribution. It is based upon taking a
sample of n observations from a Normal distribution with mean µ and standard devation σ.
A R.V. T will have a t-distribution with n-1 degrees of freedom if
T=
X −µ
.
sx
n
The t-distribution is defined by the number of degrees of freedom much like the Poisson was indexed by
λ . Degrees of freedom is a parameter just as λ was for the Poisson. The mean of a t random variable is
0. This distribution is symmetric and unimodal, but it has slightly more variability than the Normal
distribution. We will use the percentiles of the t-distribution quite frequently throughout the rest of the
course. Consequently, we have specific notation for it.
€
€
The kth percentile for a t-distribution with q degrees of freedom will be denoted by t(q;1-k).
Example:
t(25, 0.05) would be the 95th percentile of a t R.V. with 25 degrees of freedom.
t(38,0.10) would be the 90th percentile of a t R.V. with 38 degrees of freedom.
For calculating these percentiles we use Table 4, Page 516.
This table has the degrees of freedom in the first column and percentiles in the other columns. The book
uses α to represent the area to the right of the percentile, thus if we want the 90th percentile from the table,
we need to look in the column with α = 0.10. Likewise the 99th percentile can be found in the column that
is designated by α = 0.01.
Example:
t(14, 0.05) = 1.761
t(25, 0.01) = 2.485
t(30, 0.10) = 1.310
Table 4 does not contain all possible values for degrees of freedom. For example, if the degrees of freedom
is 30 or more, then you would use the Standard Normal table (Table 3) to estimate the corresponding tvalue.
Example:
Given a sample of 14 observations from a distribution that is known to be Normally distributed, construct a
99% confidence interval on the unknown value of population parameter mu. Form the data we have
calculated
X ± t(n −1,α
X = 4.127 and sx = 0.358. We can use the formula
2)
sx
n
since there are more than 2 observations and the data comes from a Normal distribution.
€
Page 4 of 7
Statistics 215 Lab Materials
First, df=n-1=14-1=13
and
α = 1 − C.L. = 1 − 0.99 = 0.01.
So
€
α 0.01
=
= 0.005
2
2
Then
€
X ± t(n −1,α
2)
sx
n
0.358
14
€
= 4.127 ± t(13,0.005)
€
⎛ 0.358 ⎞
= 4.127 ± 3.012⎜
⎟
⎝ 14 ⎠
= 4.127 ± 3.012 * 0.096
€
= 4.127 ± 0.28915
€
The endpoints of our confidence interval on mu are (3.838, 4.416).
€
So a 99% confidence interval for the population mean goes from 3.838 to 4.416. We interpret this by
saying that we are 99% confident that the population mean’s value falls within the interval 3.838 to 4.416.
Confidence Intervals on mu (Large Sample Size)
When the sample size is large (at least 30 observations) a t-distribution with n-1 d.f. is virtually identical to
the Standard Normal distribution. So we can obtain our critical value from the Standard Normal
distribution table instead of the t-table. In the large sample situation, the formula for a (1-α)*100%
confidence interval on mu becomes
X ± z(1−α
2)
sx
n
The t-value in the error term is replaced by a z-value from a Standard Normal distribution. All else remains
the same.
€
Page 5 of 7
Statistics 215 Lab Materials
The formula above can be used to construct a (1-α)*100% confidence interval (CI) for the population
mean when
1.n (the number of observations) is more than 2 and the original data (the values of the variable X) are
approximately Normal or
2.n is at least 30 (n≥30) (and we don’t know what distribution the data came from)
Example:
Suppose that want to estimate the mean of a population. We have a sample of 48 observations from a
population. The mean of these observations is 290.34 and the standard deviation of these observations is
41.22. Then a 95% confidence interval for the population mean would be.
We can use the formula below since n>30.
X ± z(1−α
2)
sx
n
= 290.34 ± 1.96 *
€
41.22
48
= 290.34 ± 11.66
€
€
= (278.368, 302.00)
(278.68, 302.00) is mathematical notation for an interval that goes from 278.68 to 302.00.
Thus a 95% confidence interval for the mean goes from 278.68 to 302.00. We interpret this by concluding
that we are 95% confident the population mean is between 278.68 and 302.00.
Confidence instead of probability
When we are dealing with parameters such as µ or σ, we are dealing with fixed quantities. As a
consequence, we I make a statement such as the mean is between 8.5 and 19.4. That statement is either
true or false. The mean is either in that interval or it is outside of it. This has implications for our
interpretation of a confidence interval.
If we create a 95% confidence interval for µ from say 175.46 to 176.32, then
P(175.46< µ< 176.32) ≠ 0.95. This probability, P(175.46< µ< 176.32) is either 0 or 1.
The value of the population mean is either inside the interval or it is not inside the interval. The confidence
that we assert comes with repetition of the process of taking samples and calculating the confidence
interval for each sample. However, for any individual interval that we do not know whether the mean is
inside the interval or outside the interval. What we do know is that if we repeated the process of collecting
samples and making 95% confidence intervals for the population mean from each sample, then 95% of
those samples would contain the population mean.
Page 6 of 7
Statistics 215 Lab Materials
Summary:
The basic form of a confidence interval for a population parameter is as follows:
Point estimate ± percentile from a sampling distribution * standard error.
The point estimate is best single number estimate for the parameter. The standard error is an estimate of
the variability from sample to sample for the point estimate. The percentile that is used is based upon the
confidence level that we want to use and the sampling distribution is determined by the type of parameter
that we are estimating.
(1-α)*100% Large Sample CI on mu:
X ± z(1−α
€
2)
(1-α)*100% Small Sample CI on mu:
X ± t(n −1,α
€
sx
n
2)
sx
n
The situation where σ is known NEVER happens in real world research.
We skipped this section in the lecture notes due to this fact.
€
Page 7 of 7