Download CONFIDENCE INTERVALS I ESTIMATION: the sample mean Gx is

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
CONFIDENCE INTERVALS I
ESTIMATION:
the sample mean Gx is an estimate of the population mean µ
point of sampling is to obtain estimates of population values
Example: for 55 students in Section 105, 45 of 55 work:
ps = 82%;
for those who work, the mean number of hours xG= 14.76
inference: 82% of ASU students work an average of 14.76 hours a week; p = 0.82 and µ =
14.76
Problem: the sampling distribution is a continuous distribution ⇒ the probability that Gx actually
equals µ is zero
Gx is not an accurate estimate of µ; in this case we cannot even state the probability that Gx is
accurate
Gx is called a point estimate of µ
statisticians generally prefer to give an interval estimate:
"There is a 90% probability that µ is between 12 and 17.5."
interval estimate has two features
Ø the estimate in interval form
Ø a probability statement: taken as an assessment of the reliability or accuracy of the estimate
the probability hinges on the probabilities found in the sampling distribution of Gx
INTUITION
Consider the sampling distribution of sample means for samples of size 100 drawn from a
population of salaries in which µ = 33,000 and σ = 5,000
w E(Gx) = 33,000
wσxG = 5,000 ÷ √100 = 500.
In the z table find z values that demarcate the middle 95% of a normal distribution: z = ± 1.96
The interval µ − 1.96 × σxG to µ + 1.96 × σxG contains 95% of the sampling distribution
or contains 95% of all the possible Gx’s that could ever be drawn from this population
the interval noted is 33,000 ± 1.96 × 500 or 33,000 ± 980. Any sample mean in this interval differs
from the actual population mean by no more than 980.
⇒ 95% of all possible sample means differ from µ by no more than 980.
⇒for any Gx , there is a 95% probability that it differs from µ by no more than 980; that is, by
no more than the amount 1.96 × σxG
choose a sample and calculate Gx; now consider an interval of the form Gx ± 1.96 × σxG
necessarily, the population mean µ lies within the limits of 95% of all such intervals
⇒ there is a 95% probability that the population mean lies within the limits of any interval of the
form Gx ± 1.96 × σxG
CALCULATING
CONFIDENCE
FOR
MEAN
A C%
confidence
interval: INTERVALS
An interval
ofTHE
thePOPULATION
form
xG ±confidence
zC × σinterval
xG
A C%
for the population mean is given by
where zC is chosen so that C% of the normal distribution
Gx ± zC × σxG
lies within the
interval −zC to +zC.
w C is the confidence level
w zC is found as a z value such that −zC to +zC incorporates the middle C% of a normal distribution;
±zC demarcates a symmetric interval which has area C
Example: Find the appropriate z value for a 92% confidence interval
Ø the interval must be symmetric: take out the middle 92% leaves 8% to be split between
upper and lower tails.
Ø The required z values demarcate the lower 4% and lower 96% of the z distribution.
Ø
Alternatively, let L = (100 − C)/2; here L = (100 − 92)/2 = 4% or 0.04.
In the cumulative z table find area 0.0400. The closest seems to be 0.0401; reading
back to the margins z0.0401 = −1.75; therefore the required zC = 1.75
Check by finding a z value such that 96% of the distribution is less than that value.
Examples:
Ø A population of Christmas trees has unknown µ, but it is known that the population is
normally distributed with σ = 4. A sample of 25 trees has Gx = 16.6. Find a 95%
confidence interval for the mean height of the population.
w given n = 25, so that σxG = σ ÷ √ n = 4/5 = 0.8
w L = (100 − C)/2 = 5/2 = 2.5% or 0.025. From the z table 0.025 of the z
distribution is less than −1.96, so zC = 1.96.
w applying the formula above
Gx ± zC × σxG
16.6 ± 1.96 × 0.8
16.6 ± 1.568,
or the interval 15.032 to 18.168
Stating the interval:
w “A 95% confidence interval for the population mean is 16.6 ± 1.568”
w “We are 95% confident that the population mean is in the interval 15.032 to
18.168.”
w “There is a 95% probability that µ is at least 15.032 but no more than
18.168.”
Ø Find a 90% confidence interval for the same population, same sample
w find z90 by reference to z table.
„ L = (100 − C)/2 = 0.05.
„ the nearest entry is 0.0505 ⇒ z = −1.64.
w then we have 16.6 ± 1.64 × 0.8 = 16.6 ± 1.312, or the interval 15.288 to 17.912
Ø For a sample of 64 drawn from this population, we got the same Gx. Find a 90%
confidence interval for the population mean.
w since n = 64, σxG = 4 ÷ √ 64 = 0.5
w confidence interval: 16.6 ± 1.64 × 0.5 = 16.6 ± 0.82, or the interval 15.78 to 17.42
Messages:
Ø the width of the confidence interval varies in the same direction as the confidence level
in our first example, width = 18.168 − 15.032 = 3.136, while in the second example, width =
17.912 − 15.288 = 2.624
„ width of the interval is 2 × zC × σxG : called the precision of the estimate
„ there is a trade-off between precision and confidence
• common sense: for very wide intervals, we can be quite confident that we've
captured µ, but as the interval narrows, the probability that it includes µ drops to zero
Ø As the sample size increases, precision increases at the same level of confidence
the third interval above has width 1.64
„ with sufficiently large sample, we can achieve whatever combination of confidence and
precision we desire
„ as n increases, σxG decreases
FINDING THE RIGHT SAMPLE SIZE
The distance e = zC × σxG is the error in the estimate
„ e is one-half the width of the confidence interval
„ within the limits of our confidence statement, we are sure that the population mean differs from
the sample mean by no more than e: we might say we're 90% confident that the true mean differs
from the sample mean by no more than e. Hence, e is the maximum error in the estimate
Suppose that there is some maximum tolerable value for e, or maximum tolerable error
for a given confidence level, the value of n necessary to keep e within tolerable limits
σ
, solve for n to find
n
2
 zC × σ 
n=
 e 
e = zC ×
for given z, chosen for the appropriate confidence level, this formula gives us the sample size
necessary to achieve an error of no more than e
in general, the result of this calculation is not an integer, so the rule is to make the sample size
equal to the next largest integer.
NOTE CAREFULLY: This refers to the maximum tolerable error in the sampling procedure, or in
the estimate of µ, NOT to the tolerance in a manufacturing process.
Examples:
Ø Cigarette filters are supposed to have µ = 15 mm in length; σ = 0.1 mm. Machinery
will jam if the length of a filter exceeds 15.3 mm, and the probability of such a filter
increases as the mean length increases; must have an accurate estimate of the mean length
of filters. Let us require e ≤ 0.01 mm. and 90% confidence intervals. How large must n
be?
2
2
 zC × σ  1.64 × 0 .1
n=
=
= 268.96



0
.
01

 e  
the next greatest integer is the required sample size (that is, ALWAYS round n
upwards in these problems); here n = 269
Ø ordering T-shirts to give to contestants in a road race; average chest size unknown but
for all chests everywhere σ = 4 in. Measure a sample of the participants when they
register, and require that the sample be accurate to within ±1.5 in. How large must the
sample be to have 99% confidence in the result?
w first, z99 = ?
w n = (2.58 × 4 / 1.5)2 = 47.33
w rounding upwards, we require n = 48
CONFIDENCE INTERVALS II: σ UNKNOWN
WHEN TO USE A z VALUE IN CONSTRUCTING CONFIDENCE INTERVALS
To this point, we have assumed the population standard deviation known.
IF NOT
Population is normally distributed and σ NOT known ⇒ the sampling distribution of Gx is NOT
normal but rather conforms to Student's t distribution
If population is NOT normal and σ is NOT known but the sample is large (that is, n ≥ 30), then
the sampling distribution of Gx approximates the t distribution
In either of these cases, s, the sample standard deviation, estimates σ.
__
2
2
RECALL: s = Σ(x − xG) /(n − 1) and s = √s2
The standard error of the mean is estimated by
sxG = s/√n
confidence intervals have the form
Gx ± tC × sxG
the t values used here are numbers of standard deviations – in this case, numbers of standard
deviations on a t distribution
CHARACTERISTICS OF THE t DISTRIBUTION
Ø Continuous
Ø Symmetric
Ø Values near the mean are more probable than values further out
so that t distribution looks like a bell-shaped curve.
How is that any different from a normal distribution?
1. the t distribution has fatter tails and less mass in the center
Ø for a given number of standard deviations, probability is higher on the normal distribution than
on a t distribution
Ø put another way, a given probability level will be further from the center of a t distribution that
from the center of the normal distribution
Ø or, a given probability level will be more standard deviations (t values) away from the mean
than would be the case on a normal distribution
Note: t values will always be larger than z values for corresponding confidence level ⇒ intervals
constructed with t will always be wider (less precise) than those constructed with z
2. there is not one t distribution but a large number, depending on the number of "degrees of
freedom"
Digression: the concept of degrees of freedom
Ø Mechanically df = n − k, where n is the sample size and k the number of parameters that must
be estimated from the sample before estimating the standard deviation
w for example: s, the sample standard deviation, is an estimate of σ. To calculate s, we
must estimate µ. µ is estimated by xG, and xGis the only statistic we must calculate before
we can calculate s. We must thus estimate one parameter, µ, before deriving and
estimate of σ, and there are thus n − 1 degrees of freedom in our estimate of σ
Ø more generally, degrees of freedom represents the number of independent (in the probability
sense) random variables in a problem
w in calculating s we must use Gx. Suppose we are given Gx and n − 1 of the values in the
sample; then the n-th value is already determined and can be derived from what we know
The t-distribution: pages E-7 and E-8 in your textbook
how to read the table
Ø Upper tail (α) values across the top are the area in one tail of the distribution
Ø for a confidence interval use an upper–tail value corresponding to the area in one tail of the
distribution
„ this will be only half the difference between the confidence level and 1
For example: in preparing a 95% confidence interval, there will be 5% in the tails of the
distribution, thus 0.025 in each tail: we should use a t value for upper-tail area 0.025 and the
appropriate number of degrees of freedom
„ if C is the confidence level, expressed as decimal fraction, use α = (1 − C)/2
Ø degrees of freedom are in the left hand column
as df → infinity, the t-value → z value
Examples:
Ø Find the appropriate t value for 20 degrees of freedom and 90% confidence interval.
α = (1 − 0.9)/2 = 0.05 ⇒ t = 1.7247
Ø for a sample of size 37, find the t value for a 99% confidence interval
d.f. = n − 1 = 36; α = (1 − 0.99)/2 = 0.005 ⇒ t = 2.7195
CONFIDENCE INTERVAL FOR µ WITH NORMAL POPULATION AND σ UNKNOWN
Problem requires use of t with n − 1 degrees of freedom. Confidence intervals will have the form
( n −1) d . f .
x ± tC
× sx
__
where sxG = s/√ n , s being the sample standard deviation
note similarity to earlier confidence intervals
Examples:
Ø 7 male students are selected at random and an alcoholic beverage is poured down them
in tenth-ounce increments until distinct signs of non-sobriety are observed. The following
results were obtained:
Individual Amount of Beverage (oz)
1
3.7
2
2.9
3
3.2
4
4.1
5
4.6
6
2.3
7
2.5
Researchers feel safe in assuming that the distribution of ounces until non-sobriety is normal
in the population. Construct a 95% confidence interval for amount of drink it takes to get
the average member of the population drunk.
„ calculate Gx and s: Gx = 3.329, s2 = Σ(x − xG)2/(n − 1) = [(3.7 − 3.329)2 + … +
(2.5 − 3.329)2] ÷ (7 − 1) = 0.7157
s = √0.7157 = 0.846
„ calculate sxG = s/√n = 0.846/√ 7 = 0.846/2.65 = 0.3198
„ find appropriate t value, for c = .95 and 6 df = 2.4469
„ multiply sxG by t value = 0.7824
Gx ± t × sxG = 3.329 ± 0.7824 or the interval 2.546 to 4.111
Ø Each of 9 cars in a sample is driven 20,000 miles, the gallons of fuel used recorded, and
the fuel mileage calculated. For the sample mean fuel mileage Gx = 34.6 and s = 1.2.
Assuming that the distribution of fuel mileage is normally distributed, find a 90% confidence
interval for the mileage to be expected from all cars of this make.
„ sxG = s/√ n = (1.2)/3 = 0.4
„ α = (1 − .9)/2 = 0.05 and d.f. = n − 1 = 8 ⇒ t = 1.860
Gx ± t × sxG = 34.6 ± 1.86 × 0.4
34.6 ± 0.744 or the interval 33.856 to 35.344
Ø In a sample of 41 students who work, xG= 16.561 and s = 5.7128. Find a 95%
confidence interval for the average hours worked by all ASU students who work.
„ sxG = s ÷ √n = 5.7128 ÷ √41 = 0.892189
„ for 40 degrees of freedom, t95 = 2.0211
„ confidence interval: 16.561 ± 2.0211 × .892189
16.6 ± 1.8
Ø We wish to establish the average weight of a population of turkeys; we have chosen a
sample of 36, weighed them and have the following results:
18
13
6
7
26
8
20
12
22
10
19
11
7
12
14
22
11
21
11
12
18
14
8
16
9
18
16
17
13
14
21
16
11
15
10
15
Construct a 98% confidence interval for the population mean of these turkeys
„ first, find tC = 2.438
„ next, find Gx and s: Gx = 14.25, s = 4.90
„ find sxG = s/√n = 4.90/6 = 0.816861007
xG ± tC × sxG
14.25 ± 2.438 × 0.8169
14.25 ± 1.99 or
12.26 to 16.24
SAMPLING DISTRIBUTIONS FOR SMALL SAMPLES
The t distribution is often thought of as primarily of value with small samples
Ø applies whenever population is known to be normal and σ unknown, no matter how small n
footnote: who was "Student"? A pseudonym for William Gosset, an Irish brewmaster
concerned with controlling biochemical processes in brewing
Ø with large samples, if population is not normal, we must rely on Central Limit Theorem
And many statisticians and other practitioners will use z procedures with any sample of 30 or
more: this is especially prevalent in older practice
Another possibility:
sample is small, so that CLT does not apply
population is not normally distributed or the distribution is unknown
Safest course is to take a larger sample and rely on CLT
Following schematic may be used to determine proper distribution to use in constructing
confidence intervals.
Population standard
deviation known?
Yes
No
Population normal?
Yes
Population normal?
No
Yes
No
Sample Size
Sample Size
z value
n >= 30
n < 30
z or t (see
t value
note)
ERROR
NOTES:
n >= 30
n < 30
z or t (see
note)
ERROR
1. For a non-normal population and large samples, different practitioners may proceed
differently. Some argue that the Central Limit Theorem justifies use of a z value in this case, while
others feel that it is more appropriate to use a t value since that gives a less precise estimate (a
wider confidence interval). For purposes of this course, use a t in such cases.
2. For small samples from non-normal populations: there are techniques which can be used to
derive an interval estimate in this case, but they are beyond the scope of this course.
CONFIDENCE INTERVALS III
CONFIDENCE INTERVAL FOR THE POPULATION PROPORTION
Purpose: to use the sample proportion, ps , as the basis of an interval estimate of the population
proportion p
Reminders:
the sample proportion ps = x/n
the sampling distribution of p has parameters
E(ps) = p
σps = √p × (1 −p)/n
ps is normally distributed, so that probabilities are found by reference to the z table
typically p is unknown, so that we must estimate σps by
sps = √[ps × (1 − ps )]/n
A confidence interval for p then will have the form
ps ± zC × sps
Examples:
Ø of 55 students in a sample, 45 work. Construct a 95% confidence interval for the proportion
in the population who work.
„ ps = 45/55 = 0.82
„ sps = √[.82 × (1 − .82)]/55 = 0.0518
„ zC = ±1.96
„ confidence interval: 0.82 ± 1.96 × 0.0518 →
0.82 ± 0.10
We are 95% confident that in the population somewhere between 72% and 92%
work.
Ø In a sample of 800 North Carolinians 51% express the intention to vote for Jesse Helms in the
next election. Find a 98% confidence interval for the proportion in the population who intend to
vote for Helms.
„ ps = 51%; sps = √(51 × 49)/800 = 1.76741
„ then we have 51 ± 2.33 × 1.76741 = 51% ± 4.12% or the interval 46.9% to 55.1%
„ from this, we can say, strictly and properly, "We are 95% sure that the proportion
in the population who intend to vote Helms is within 4.12% of 51%."
or, as we might loosely and a bit improperly put it, "Our survey shows that 51% of
the population intend to vote Helms, and this result is accurate to within plus or minus
4%."
„ the election is a toss-up or “too close to call.”
Ø Suppose same result with a sample of size n = 1600
„ sps = 1.2497, and sp × z = 1.2497 × 2.33 = 2.9%
„ confidence interval would be 51% ± 3%
POINT: is the minor increase in precision worth the extra cost?
FINDING THE NECESSARY SAMPLE SIZE IN PROPORTION PROBLEMS
Since we have ± zC × sps , the estimate ps differs from p by at most that amount
substituting the definition of sps, the error is at most
zC × √[p × (1 −p)]/n
notice the use of p in the above expression; the concepts advanced here involve what we
know about the sampling distribution before sampling begins
for a given confidence level, this error can be reduced by increasing n
in the last example above, we noted that doubling the sample size would reduce the error from
4% to 3%
Ø suppose we require e < 0.01, that is, accuracy to within ± 1%. How large must n be?
the maximum error in the estimate:
e = zC ×
solve for n, giving
n=
2
p × (1 − p ) × zC
e2
p × (1 − p )
n
Ø A major problem: p, the population proportion is unknown
„ solution 1: assume p = 0.5
this will give largest possible value for n since p × (1 −p) reaches a maximum when p =
0.5
may result in an unnecessarily large and expensive sample
„ solution 2: use other information
do a pilot study on a small sample and use the resulting ps to estimate p
previous experience or knowledge of other populations may give an approximate value
for p
w lacks certainty of solution 1, but may result in somewhat smaller sample
Examples:
applying the formula above to solution 1, we have
n=
.5 × (1 − .5) × 2.33 2
2
= 13,573
0.01
this is the sample size necessary to be absolutely sure that a 98% confidence interval is
accurate to within ± 1%
Ø In the work example above, 95% confidence interval and sample of 55 gave accuracy
of ±0.10. What sample size is necessary to hold the error to ±0.015 (1.5%)?
„ solution 1: n = [(0.5 × 0.5) × 1.962] ÷ 0.0152 = 4268.44; taking the next greatest
integer, we have 4269
„ solution 2: for n = 55, we had ps = 0.82. Take that as an estimate of the unknown p.
Then n = [(0.82 × 0.18) × 1.962] ÷ 0.0152 = 2520.09 or 2520
using the pilot-study approach reduces the required sample size by more than 1,749
and might save a considerable amount of money
A footnote: in most proportion problems, it doesn’t matter whether you use percentages or decimal
fractions, as long as you keep them straight. In the sample-size formula above, however, you must
use decimal fractions. To use percentages, substitute 100 for 1, so the formula becomes
n = [p* × (100 − p*) × z2] ÷ e* where p* and e* are defined as percentages.
THE z VS. THE t DISTRIBUTION
In constructing confidence intervals, use the z distribution whenever
Ø the population standard deviation σ is known AND the population is known to normally
distributed
Ø you wish to calculate a confidence interval for a proportion
„ rule of thumb: n × p ≥ 5 AND n × (1 − p) ≥ 5 for sufficiently accurate approximation
In constructing confidence intervals, use the t distribution
Ø if the population is known to be normally distributed AND the population standard deviation σ
is UNKNOWN: this holds for any sample size
Ø if the population’s distribution is NOT normal AND the sample size is at least 30 AND the
population standard deviation σ is UNKNOWN