Download Confidence Interval Estimation - University of San Diego Home Pages

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

German tank problem wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Confidence Interval Estimation
For statistical inference in
decision making:
Chapter 6
Objectives
• Central Limit Theorem
• Confidence Interval Estimation of the
Mean (σ known)
• Interpretation of the Confidence Interval
• Confidence Interval Estimation of the
Mean (σ unknown)
• Confidence Interval Estimation for the
Proportion
• Determining Sample Size
Central Limit Theorem
Irrespective of the shape of the
underlying distribution of the
population, by increasing the
sample size, sample means &
proportions will approximate
normal distributions if the
sample sizes are sufficiently large.
Central Limit Theorem in action:
How large must a sample be for the
Central Limit theorem to apply?
The sample size varies according to the
shape of the population.
However, for our use, a sample size of
30 or larger will suffice.
Must sample sizes be 30 or larger for
populations that are normally distributed?
No. If the population is normally
distributed, the sample means are
normally distributed for sample sizes as
small as n=1.
Why not just always pick a sample size
of 30?
How can I tell the shape of the underlying
population?
•
•
•
•
CHECK FOR NORMALITY:
Use descriptive statistics. Construct stem-and-leaf plots for small
or moderate-sized data sets and frequency distributions and
histograms for large data sets.
Compute measures of central tendency (mean and median) and
compare with the theoretical and practical properties of the normal
distribution. Compute the interquartile range. Does it approximate
the 1.33 times the standard deviation?
How are the observations in the data set distributed? Do
approximately two thirds of the observations lie between the mean
and plus or minus 1 standard deviation? Do approximately fourfifths of the observations lie between the mean and plus or minus
1.28 standard deviations? Do approximately 19 out of every 20
observations lie between the mean and plus or minus 2 standard
deviations?
Why do I care if X-bar, the sample mean, is
normally distributed?
Because I want to use Z scores to analyze
sample means.
But to use Z scores, the data must be normally
distributed.
That’s where the Central Limit Theorem steps
in.
Recall that the Central Limit Theorem states that
sample means are normally distributed regardless
of the shape of the underlying population if the
sample size is sufficiently large.
Recall from Chapter 5:
• Z = (X - µ) ÷ σ
• If sample means are normally distributed, the Z
score formula applied to sample means would
be:
• Z = [X-bar - µX-bar ] ÷ σ X-bar
Background
• To determine µX-bar, we would need to randomly draw
out all possible samples of the given size from the
population, compute the sample means, and average
them. This task is unrealistic. Fortunately, µX-bar equals
the population mean µ, which is easier to access.
• Likewise, computing the value of σX-bar, we would have to
take all possible samples of a given size from a
population, compute the sample means, and determine
the standard deviation of sample means. This task is
also unrealistic. Fortunately, σX-bar can be computed by
using the population standard deviation divided by the
square root of the sample size.
Note:
As the sample size increases,
the standard deviation of the sample means
becomes smaller and smaller
because the population standard deviation
is being divided by larger and larger
values of the square root of n.
The ultimate benefit of the central
limit theorem is a useful version of
the Z formula for sample means.
Z Formula for Sample Means:
Z = [X-bar - µ] ÷ σ / √ n
Example:
The mean expenditure per customer at a
tire store is $85.00, with a standard
deviation of $9.00.
If a random sample of 40 customers is
taken, what is the probability that the
sample average expenditure per
customer for this sample will be
$87.00 or more?
Because the sample size is greater than 30, the central
limit theorem says the sample means are normally
distributed.
Z = [X-bar - µ] ÷ σ / √ n
Z = [$87.00 - $85.00] ÷ $9.00 / √ 40
Z = $2.00 / $1.42 = 1.41
For Z = 1.41 in the Z distribution table, the
probability is .4207.
This represents the probability of getting a mean
between $87.00 and the population mean
$85.00.
Solving for the tail of the distribution yields
.5000 - .4207 = .0793
• This is the probability of X-bar ≥ $87.00.
Interpretations
Therefore, 7.93% of the time, a random
sample of 40 customers from this
population will yield a mean expenditure of
$87.00 or more.
OR
From any random sample of 40 customers,
7.93% of them will spend on average
$87.00 or more.
Interpretations
Therefore, 7.93% of
the time, a random
sample of 40
customers from this
population will yield
a mean
expenditure of
$87.00 or more.
From any random
sample of 40
customers, 7.93%
of them will spend
on average $87.00
or more.
Solve:
Suppose that during any hour in a
large department store, the
average number of shoppers is
448, with a standard deviation
of 21 shoppers.
What is the probability that a
random sample of 49 different
shopping hours will yield a
sample mean between 441 and
446 shoppers?
Statistical Inference
Statistical Inference facilitates
decision making.
Via sample data,
we can estimate something about
our population,
such as its average value µ,
by using the corresponding
sample mean, X-bar.
Recall that µ,
the population mean to be estimated,
is a parameter,
while X-bar,
the sample mean, is a statistic.
Point Estimate
A point estimate is a statistic taken from a sample and is
used to estimate a population parameter.
However, a point estimate is only as good as the sample it
represents. If other random samples are taken from the
population, the point estimates derived from those
samples are likely to vary.
Because of variation in sample statistics, estimating a
population parameter with a confidence interval is often
preferable to using a point estimate.
Confidence Interval
A confidence interval is a range of values
within which it is estimated with some
confidence the population parameter lies.
Confidence intervals can be one or twotailed.
Confidence Interval to Estimate µ
• By rearranging the Z formula for sample means, a
confidence interval formula is constructed:
• X-bar +/- Z α/2 σ / √ n
• Where:
• α = the area under the normal curve outside the
confidence interval
• α/2 = the area in one-tail of the distribution outside
the confidence interval
The confidence interval formula yields a
range (interval) within which we feel
with some confidence the population
mean is located.
It is not certain that the population mean
is in the interval unless we have a 100%
confidence interval that is infinitely
wide, so wide that it is meaningless.
Confidence interval estimates for five different
samples of n=25, taken from a population where
µ=368 and σ=15
Common levels of confidence
intervals used by analysts are
90%, 95%, 98%, and 99%.
95% Confidence Interval
• For 95%
confidence, α = .05
and α / 2 = .025.
The value of Z.025
is found by looking
in the standard
normal table under
.5000 - .025 =
.4750. This area in
the table is
associated with a
Z value of 1.96.
• An alternate method:
multiply the confidence
interval, 95% by ½
(since the distribution is
symmetric and the
intervals are equal on
each side of the
population mean.
• (½) (95%) = .4750 (the
area on each side of
the mean) has a
corresponding Z value
of 1.96.
In other words, of all the possible X-bar
values along the horizontal axis of the
normal distribution curve, 95% of them
should be within a Z score of 1.96 from
the mean.
Margin of Error
Z [σ / √ n]
Example:
• A business analyst for cellular telephone
company takes a random sample of 85 bills
for a recent month and from these bills
computes a sample mean of 153 minutes. If
the company uses the sample mean of 153
minutes as an estimate for the population
mean, then the sample mean is being used
as a POINT ESTIMATE. Past history and
similar studies indicate that the population
standard deviation is 46 minutes.
• The value of Z is decided by the level of
confidence desired. A confidence level of
95% has been selected.
153 + /- 1.96( 46/ √ 85)
= 143.22 ≤ µ ≤ 162.78
• The confidence interval is constructed from the point
estimate, 153 minutes, and the margin of error of this
estimate, + / - 9.78 minutes.
• The resulting confidence interval is 143.22 ≤ µ ≤
162.78.
• The cellular telephone company business analyst is
95% confident that the average length of a call for
the population is between 143.22 and 162.78
minutes.
Interpreting a Confidence Interval
•
For the previous 95% confidence interval, the following
conclusions are valid:
•
I am 95% confident that the average length of a call for the
population µ, lies between 143.22 and 162.78 minutes.
•
If I repeatedly obtained samples of size 85, then 95% of the
resulting confidence intervals would contain µ and 5% would not.
QUESTION: Does this confidence interval [143.22 to 162.78]
contain µ? ANSWER: I don’t know. All I can say is that this
procedure leads to an interval containing µ 95% of the time.
•
I am 95% confident that my estimate of µ [namely 153 minutes] is
within 9.78 minutes of the actual value of µ. RECALL: 9.78 is the
margin of error.
Be Careful! The following statement is
NOT true:
“The probability that µ lies between
143.22 and 162.78 is .95.”
Once you have inserted your sample
results into the confidence interval
formula, the word PROBABILITY
can no longer be used to describe
the resulting confidence interval.
Confidence Interval Estimation of
the Mean (σ Unknown)
In reality, the actual standard deviation of the population, σ, is
usually unknown.
Therefore, we use “s” (sample standard deviation) to compute
the confidence interval for the population mean, µ.
However, by using “s” in place of σ, the standard normal Z
distribution no longer applies.
Fortunately, the t-distribution will work, provided the
population we obtain the sample is normally distributed.
Assumptions necessary to use tdistribution
•
•
•
•
Assumes random variable x is normally
distributed
However, if sample size is large enough ( > 30),
t-distribution can be used when σ is unknown.
But if sample size is small, evaluate the shape
of the sample data using a histogram or stemand-leaf.
As the sample size increases, the t-distribution
approaches the Z distribution.
Confidence Interval using a t-distribution
X-bar +/- t α,n-1 [s / √ n
α= confidence interval
n-1 = degrees of freedom
Example:
• As a consultant I have been employed to estimate the
average amount of comp time accumulated per week for
managers in the aerospace industry.
• I randomly sample 18 managers and measure the
amount of extra time they work during a specific week
and obtain the following results (in hours). Assume a
90% confidence interval.
• AEROSPACE DATA
6 21
17
20
3 8
12
11
7
9
0
21
8
25
16
15
29
16
Solution:
To construct a 90% confidence interval to estimate the
average amount of extra time per week worked by a
manager in the aerospace industry, I assume that comp
time is normally distributed in the population.
The sample size is 18, so df = 17.
A 90% level of confidence results in an α / 2 = .05 area in
each tail.
The table t-value is t
.05,17
= 1.740.
With a sample mean of 13.56 hours, and a
sample standard deviation of 7.8 hours, the
confidence interval is computed:
X-bar +/- t α/2, n-1 S / √ n
=13.56 +/- 1.740 ( 7.8 / √ 18) = 13.56 +/- 3.20
= 10.36 ≤ µ ≤ 16.76
Interpretation:
The point estimate for this problem is
13.56 hours, with an error of +/- 3.20
hours.
I am 90% confident that the average
amount of comp time accumulated by a
manager per week in this industry is
between 10.36 and 16.76 hours.
Recommendations:
From these figures, the aerospace
industry could attempt to build a
reward system for such extra work or
evaluate the regular 40-hour week to
determine how to use the normal work
hours more effectively and thus reduce
comp time.
Solve:
I own a large equipment rental company and I want to make
a quick estimate of the average number of days a piece of
ditch digging equipment is rented out per person per time.
The company has records of all rentals, but the amount of
time required to conduct an audit of all accounts would be
prohibitive.
I decide to take a random sample of rental invoices.
Fourteen different rentals of ditch diggers are selected
randomly from the files.
Use the following data to construct a 99% confidence
interval to estimate the average number of days that a
ditch digger is rented and assume that the number of days
per rental is normally distributed in the population.
Ditch Digger Data:
3
1
2
3
1
2
3
5
1
1
1
2
1
4
Stay-tuned
Estimating the Population Proportion
For most businesses, estimating market share (their
proportion of the market) is important b/c many company
decisions evolve from market share information:
• What proportion of my customers pay late?
• What proportion don’t pay at all?
• What proportion of the produced goods are
defective?
• What proportion of the population has cats/
dogs/ horses/ kids/ exercises/ reads?
Confidence Interval Estimate for the
Proportion
• ps +/- Z√ ps(1-ps) / n
• ps - Z√ps(1-ps) /n ≤ p ≤ ps + Z√ps(1-ps) /n
• ps = sample proportion = X / n = number of successes ÷
sample size. This is the POINT ESTIMATE.
• p = population proportion
• Z = critical value from the standardized normal
distribution
• n = sample size
ps +/- Z√ ps(1-ps) / n
NOTE: This formula can be applied only
when np and n(1-p) are at least 5.
Example:
A study of 87 randomly selected companies with a
telemarketing operation revealed that 39% of
the sampled companies had used telemarketing
to assist them in order processing.
Using this information, how could a researcher
estimate the population proportion of
telemarketing companies that use their
telemarketing operation to assist them in order
processing?
Solution:
• The sample proportion = .39.
• This is the point estimate of the population
proportion, p.
• The Z value for 95% confidence is 1.96.
• The value of (1-p) = 1 - .39 = .61.
ps +/- Z√ ps(1-ps) / n
ps - Z√ps(1-ps) /n ≤ p ≤ ps + Z√ps(1-ps) /n
• The confidence interval estimate is:
.39 – 1.96√(.39) (.61) / 87 ≤ p ≤ .39 + 1.96√(.39) (.61) / 87
.39 - .10 ≤ p ≤ .39 + .10
.29 ≤ p ≤ .49
Interpretation:
We are 95% confident that the population
proportion of telemarketing firms that use
their operation to assist order processing
is somewhere between .29 and .49.
There is a point estimate of .39 with a
margin of error of +/- .10.
Solve:
A clothing company produces men’s jeans. The jeans are
made and sold with either a regular cut or a boot cut.
In an effort to estimate the proportion of their men’s jeans
market in Oklahoma City that is for boot-cut jeans, the
analyst takes a random sample of 212 jeans sales from
the company’s two Oklahoma City retail outlets.
Only 34 of the sales were for boot-cut jeans.
Construct a 90% confidence interval to estimate the
proportion of the population in Oklahoma City who prefer
boot-cut jeans.
Solution:
ps = 34/212 = .16
A point estimate for boot-cut jeans is .16 or 16%.
The Z value for 90% level of confidence is 1.645.
The confidence interval estimate is:
ps - Z√ps(1-ps) /n ≤ p ≤ ps + Z√ps(1-ps) /n
.16 – 1.645√(.16) (.84) / 212 ≤ p ≤ .16 + 1.645√(.16) (.84) / 212
.16 - .04 ≤ P ≤ .16 + .04
.12 ≤ P ≤ .20
We are 90% confident that the proportion of boot-cut jeans
is between 12 and 20 %.
Estimating Sample Size
The amount of sampling error you are
willing to accept and the level of
confidence desired, determines the size
of your sample.
Sample size when Estimating µ
n = Z2σ2 / e2
e = Z (σ / √ n
To determine sample size:
•
Know the desired confidence level, which determines the
value of Z (the critical value from the standardized normal
distribution. Determining the confidence level is subjective.
•
Know the acceptable sampling error, e. The amount of error
that can be tolerated.
•
•
•
•
Know the standard deviation, σ. If unknown, estimate by:
past data
educated guess
estimate σ: [σ = range/4] This estimate is derived from the
empirical rule stating that approximately 95% of the values
in a normal distribution are within +/- 2σ of the mean,
giving a range within which most of the values are located.
Example:
Suppose the marketing manager wishes to
estimate the population mean annual usage of
home heating oil to within +/- 50 gallons of the
true value, and he wants to be 95% confident of
correctly estimating the true mean.
On the basis of a study taken the previous year,
he believes that the standard deviation can be
estimated as 325 gallons.
Find the sample size needed.
Solution:
• With e =50, σ = 325, and 95% confidence (Z = 1.96)
• n = Z2σ2 /e2 = (1.96)2 (325)2 / (50)2
• n = 162.31
• Therefore, n = 163. As a general rule for
determining sample size, always round up to the
next integer value in order to slightly over
satisfy the criteria desired.
Solve:
Suppose you want to estimate the average age of all
Boeing 727 airplanes now in active domestic U.S.
service.
You want to be 95% confident, and you want your estimate
to be within 2 years of the actual figure.
The 727 was first placed in service about 30 years ago, but
you believe that no active 727s in the U.S. domestic fleet
are more than 25 years old.
How large a sample should you take?
Solution:
With E = 2 years,
& Z value for 95% = 1.96,
and σ unknown,
it must be estimated by using σ ≈ range ÷ 4. As
the range of ages is 0 to 25 years, σ = 25 ÷ 4 =
6.25.
n = Z2σ2 /e2
n = Z2σ2 /e2 = (1.96)2 (6.25)2 / (2)2
= 37.52 airplanes.
Because
you cannot sample 37.52 units, the
required sample size is 38.
If you randomly sample 38 planes, you can
estimate the average age of active 727s
within 2 years and be 95% confident of the
results.
Solve:
Determine the sample size necessary to
estimate µ when values range from 80 to
500, error is to be within 10, and the
confidence level is 90 %.
n = Z2σ2 /e2
Answer: 200
Determining sample size for proportion
n = Z2p(1-p) /e2
• p = population proportion (if unknown, analysts
use .5 as an estimate of p in the formula)
• e = error of estimation equal to (ps – p) the
difference between the sample proportion and
the parameter to be estimated, p. Represents
amount of error willing to tolerate.
Solve:
The Packer, a produce industry trade publication, wants to
survey Americans and ask whether they are eating more
fresh fruits and vegetables than they did 1 year ago.
The organization wants to be 90% confident in its results
and maintain an error within .05. How large a sample
should it take?