Download Chapter 9 Powerpoint

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Degrees of freedom (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Chapter 9
Confidence Intervals
Suppose we wanted to estimate the
proportion of blue candies in a VERY large
bowl.
How might we go about estimating this
proportion?
We
Wecould
wouldtake
haveaa
sample
candies and
sampleof
proportion
or a
compute
statistic
– athe
single
proportion
ofthe
blue
value for
candiesestimate.
in our sample.
Point Estimate
• A single number (a statistic) based
on sample data that is used to
estimate a population characteristic
• But not always to the population
Different samples may
refers
to the
characteristic due “point”
to
sampling
produce
different
single value
on a number
statistics.
variation
line.
Population characteristic
The paper “U.S. College Students’ Internet Use:
Race, Gender and Digital Divides” (Journal of
Computer-Mediated Communication, 2009)
reports the results of 7421 students at 40
colleges and universities. (The sample was
selected in such a way that it is representative
of the population of college students.)
The authors want to estimate the proportion (p)
is a students
point estimate
for the
population
ofThis
college
who spend
more
than 3
proportion
ofthe
college
students who spend
hours
a day on
Internet.
more than 3 hours a day on the Internet.
2998 out of 7421 students reported using the
Internet more than 3 hours a day.
p = 2998/7421 = .404
The paper “The Impact of Internet and Television
Use on the Reading Habits and Practices of College
If a point estimate of m, the mean
Students” (Journal of Adolescence and Adult
academic reading time per week for all
Literacy,college
2009)students,
investigates
the reading
habits of
is desired,
an obvious
college
students.
The following
observations
choice
of a statistic
for estimating
m is the
represent the number
of hours
sample
meanspent
x. on academic
readingHowever,
in 1 weekthere
by 20are
college
otherstudents.
possibilities – a
trimmed
meansuggest
or the sample
median.
The
dotplot
this
data
is
1.7 3.8 4.7 9.6 11.7 12.3 12.3 12.4 12.6 13.4
approximately symmetrical.
14.1 14.2 15.8 15.9 18.7 19.4 21.2 21.9 23.3 28.2
College Reading Continued . . .
1.7
3.8
4.7
9.6
11.7 12.3 12.3 12.4 12.6 13.4
14.1 14.2 15.8 15.9 18.7 19.4 21.2 21.9 23.3 28.2
287.2
sample mean  x 
 14.36
20
So which of
The these
mean of
13.4  14.1
point
sample median 
 13.75
the middle
16
estimates
2
observations.
should we
use?
230.2
10% trimmed mean 
 14.39
16
Choosing a Statistic for
Computing an Estimate
• Choose a statistic that is unbiased
(accurate)
Unbiased,
since
Unbiased,
since is
the
distribution
the
centered
Biased,
since the
A statistic
whose mean value
isdistribution
equalattotheis
centered
at the
true value
distribution
is
the value
of the population
true value
NOT centered at
characteristic
being estimated is said to
the true value
be an unbiased statistic.
Choosing a Statistic for
Computing an Estimate
• Choose a statistic that is unbiased (accurate)
• Choose a statistic
deviation
Unbiased,
has a standard
with
the but
smallest
smaller standard
deviation so it is
more precise.
Unbiased, but has
If thestandard
population distribution is normal,
a larger
deviation
it isa smaller standard deviation
then xsohas
notthan
as precise.
any other unbiased statistic for
estimating m.
Suppose we wanted to estimate the
proportion of blue candies in a VERY large
bowl.
We could take a sample of candies and
compute the proportion of blue candies in
our sample.
How
much
confidence
Would
you
have more
do
you have in
confidence
if the
your
point
estimate?
answer were an
interval?
Confidence intervals
A confidence interval (CI) for a population
characteristic is an interval of plausible values
for the characteristic.
primary goalsoofthat,
a confidence
interval
ItThe
is constructed
with a chosen
degree
is to estimate
unknown
of confidence,
the an
actual
valuepopulation
of the
characteristic.
characteristic will
be between the lower and
upper endpoints of the interval.
Rate your confidence
0 – 100%
does it(%)
mean
toyou
be within
10 years?
HowWhat
confident
are
that you
can ...
Guess my age within 10 years?
. . . within 5 years?
. . . within 1 year?
What happened to
your level of
confidence as the
interval became
smaller?
Confidence level
The confidence level associated with a
confidence interval estimate is the success rate
of the method used to construct the interval.
If this method was used to generate an
intervalOur
estimate
over and
again from
confidence
is inover
the method
–
different
samples,
in the
long runinterval!
95% of the
NOT
in any one
particular
resulting intervals would include the actual value
of theThe
characteristic
being
estimated.
most common
confidence
levels are
90%, 95%, and 99% confidence.
Recall the General Properties for
Sampling Distributions of p
1.
2.
These are the conditions that
must be true in order to
m pˆ  p
calculate a large-sample
confidence interval for p
p (1  p ) As long as the sample size is
 pˆ 
less than 10% of the population
n
3. As long as n is large (np > 10 and
n (1-p) > 10) the sampling
distribution of p is approximately
normal.
Let’s develop the equation for the
We
canconfidence
generalize
thisinterval.
tothe
normal
For
large
random
samples,
large-sample
distributions
other
sampling
distribution
of than
p is the
To begin,approximately
westandard
will use anormal
95%
confidence
Use
distribution
–
normal.
So aboutlevel.
of 95%
the possible
pcurve
will are
fall
the table95%
of standard
areas
to
About
ofnormal
the values
within
95%value
of these
values
are of
within
determine
the
of z*deviations
such that
a central
area
1.96
standard
the
1.96
the
mean.
p (of
1mean
and
p
) z*.
of .95 falls within
between
–z*
1.96
within p
n
Central Area = .95
Lower tail area = .025
Upper tail area = .025
-1.96
0
1.96
Developing a Confidence Interval Continued . . .
p (1  p )
If p is within 1.96
n
of p,
this means the interval
p (1  p )
p (1  p )
pˆ  1.96
to pˆ  1.96
n
n
will capture p.
And this will happen for 95% of
all possible samples!
Developing a Confidence Interval Continued . . .
Approximate sampling
Suppose weSuppose
get this we
p get this pdistribution of p
and create an interval
Create
an
interval
Suppose we get this p
around
p
and create
an interval
Using this
method of
calculation,
p
the confidence
p (1  p )
p (1  p )
1.96
1.96
interval will
n
n
not capture p
p
5% of the
p
time.
This
line
represents
1.96
This line represents 1.96
When
n
is
large,
a
95%
p
standard
deviations
below
Here
is
the
mean
of the
Notice
thatdeviations
the lengthabove
of
standard
confidence
interval
for
p is
the
mean.
sampling
distribution
This
p
doesn’t
fall
within
1.96
each
half
of
the
interval
the
mean.
This p fell within
1.96
standard
This
p
fell
within 1.96
standard
p
(
1
 the
p ) mean
standard
deviations
of
equals
deviations of the
mean
AND
its
pˆits
confidence
1of.96
deviations
the mean
AND its
p
(
1

p
)
AND
interval
does
confidence
interval
“captures”
p.
1.96 confidence interval “captures”
n
p.
NOT
“capture”
p.
n
The diagram to the
right is 100
confidence intervals
for p computed from
100 different random
samples.
Note that the ones
with asterisks do not
capture p.
If we were to
compute 100 more
confidence intervals
for p from 100
different random
samples, would we get
the same results?
The Large-Sample Confidence
Interval for p
Now let’s look at a more general
The general formula forformula.
a confidence
interval for a population proportion p
when
• p is the sample proportion from a random sample
• the sample size n is large (np > 10 and
n(1-p) > 10), and
• if the sample is selected without replacement,
the sample size is small relative to the
population size (at most 10% of the population)
The Large-Sample Confidence
Interval for p
The general formula for a confidence interval
for a population proportion p . . . is
pˆ(1  pˆ)
pˆ  (z critical value)
n
The
standard
point
estimateerror of a statistic is
the estimated standard deviation
Estimate of the
of the statistic.standard deviation of p
or standard error
The 95%
confidence interval
is based on the
The
Large-Sample
Confidence
fact that, for approximately 95% of all random
Interval
p the bound on error
samples,for
p is within
estimation of p.
The general formula for a confidence interval
for a population proportion p . . . is
pˆ(1  pˆ)
pˆ  (z critical value)
n
This
the bound
This is
is called
also called
the
on the
error
margin
ofestimation.
error.
The article “How Well Are U.S.
Colleges Run?” (USA Today, February 17,
2010) describes a survey of 1031 adult
The point estimate is
Americans. The survey was carried out by the
567 the
Before
computing
National Center for Public Policy
pˆ  and the
 .55sample
1031 we
confidence
interval,
was selected in a way that makes it reasonable to
to verify the
regard the sample asneed
representative
of adult
conditions.
Americans. Of those surveyed, 567 indicated
that they believe a college education is essential
for success.
What is a 95% confidence interval for the
population proportion of adult Americans who
believe that a college education is essential for
success?
College Education Continued . . .
What is a 95% confidence interval for the
population proportion of adult Americans who
believe that a college education is essential for success?
Conditions:
1) np = 1031(.55) = 567
andconditions
n(1-p) = 1031(.45)
= 364,
All our
are verified
since both of these so
areitgreater
10, the sample
is safethan
to proceed
with
size is large enough to proceed.
the calculation of the
interval.
2) The sample size of n =confidence
1031 is much
smaller than
10% of the population size (adult Americans).
3) The sample was selected in a way designed to
produce a representative sample. So we can regard
the sample as a random sample from the population.
College Education Continued . . .
What is a 95% confidence interval for the
population proportion of adult Americans who
believe that a college education is essential for success?
Calculation:
pˆ(1  pˆ)
pˆ  (z critical value)
n
.55(.45)
.55  1.96
 (.521,.579)
1031
What does this
Conclusion:
interval mean in the
We are 95% confident that the population
context proportion
of this
of adult Americans who believe that aproblem?
college education
is essential for success is between 52.1% and 57.9%
College Education Revisited . . .
Recall the “Rate
A 95% confidence interval for
theConfidence”
population
your
proportion of adult Americans who believe that a
Activity
college education is essential for success is:
.55(.45)
.55  1.96
 (.521,.579)
1031
What do you
notice
about the
Compute a 90% confidence interval for this
proportion.
relationship
.55(.45)
between the
.55  1.645
 (.524,.575)
confidence level
1031
ofproportion.
an interval
Compute a 99% confidence interval for this
and the width of
the interval?
.55(.45)
.55  2.58
1031
 (.510,.590)
Choosing a Sample Size
The bound on error estimation for a 95%
Sometimes,
it isisfeasible to perform a
confidence
interval
preliminary study to estimate the value
In other cases,
may
forprior
p. p (knowledge
1

p
)
suggest
a
estimate for
If there
isreasonable
no
prior knowledge
and p.
a
B
1
.
96
Before
collecting
any
data,then
an the
preliminary
study
is
not
feasible,
n
What value should be used for the
investigator
may
wish tofor
determine
a
conservative
estimate
p
is
0.5.
unknown
value
p?
If we solve
this for
...
sample
sizen needed
to achieve a
certain bound on error estimation.
2
 1.96 
n  p 1  p 

 B 
Why is the conservative
estimate for p = 0.5?
.1(.9) = .09
.2(.8) = .16
.3(.7) = .21
.4(.6) = .24
.5(.5) = .25
By using .5 for p, we
are using the largest
value for p(1 – p) in
our calculations.
In spite of the potential safety hazards,
some people would like to have an internet
connection in their car. Determine the
sample size required to estimate the
proportion of adult Americans who would like an
internet connection in their car to within 0.03
with 95% confidence.
2
 1.96 
n  p (1  p )

 B 
2
 1.96 
n  .25

 .03 
n  1067.111 
n  1068 people
What value should be
used for p?
This is the value for the
bound on error estimate B.
Always round the
sample size up to the
next whole number.
Now let’s look at confidence
intervals to estimate the mean
m of a population.
Confidence intervals for m when 
is known
The general formula for a confidence interval
for a population mean m when . . .
This confidence interval is appropriate even
when n is small, as long as it is reasonable
1) x is the sample mean from a random sample,
to think that the population distribution is
2) the sample sizenormal
n is large
(n > 30), and
in shape.
this typically
known?
3) , the Is
population
standard
deviation, is known
Bound on error of estimation
is
These are the propertiesof
 the
 sampling
x  (z critical
value)of
distribution
 x.  Standard
Point estimate
 n
deviation of
the statistic
Cosmic radiation levels rise with increasing
altitude, promoting researchers to consider
how pilots and flight crews might be affected
by increased exposure to cosmic radiation. A study
reported a mean annual cosmic radiation dose of 219
mrems for a sample of flight personnel of Xinjiang
Airlines. Suppose this mean is based on a random
sample of 100 flight crew members. Let  = 35
mrems.
Calculate and interpret a 95% confidence interval
for the actual mean annual cosmic radiation exposure
for Xinjiang flight crew members.
1)Data is from a random sample of crew members
First, verify that the
2)Sample size n is large
(n > 30)
conditions
are met.
3)  is known
Cosmic Radiation Continued . . .
Let
What would happen to the width of
x = 219
mrems
this
interval if the confidence
level
wascrew
90%members
instead of 95%?
n = 100
flight
 = 35 mrems.
Calculate and interpret a 95% confidence interval
for the actual mean annual cosmic radiation exposure
for Xinjiang flight crew members.
  
x  (z critical value)

 n
 35 
219  1.96
  (212.14, 225.86)
 100 
What does this mean
in context?
We are 95% confident that the actual mean annual
cosmic radiation exposure for Xinjiang flight crew
members is between 212.14 mrems and 225.86 mrems.
Confidence intervals for m when 
is unknown
When  is unknown, we use the sample
standard deviation s to estimate . In
place of z-scores, we must use the following
to standardize the values:
x m
t 
s
n
The use of the value of s introduces
extra variability. Therefore the
distribution of t values has more
variability than a standard normal
curve.
Important Properties of t
Distributions
t distributions
are described
The t distribution
corresponding
to any by
particular
degreesofoffreedom
freedom
(df).shaped and
number of degrees
is bell
centered at zero (just like the standard normal (z)
distribution).
2) Each t distribution is more spread out than the
standard normal distribution.
1)
z curve
t curve for 2 df
0
Why is the z
curve taller
than the t curve
for 2 df?
Important Properties of t
Distributions Continued . . .
3) As the number of degrees of freedom increases,
the spread of the corresponding t distribution
decreases.
t curve for 8 df
t curve for 2 df
0
Important Properties of t
Distributions Continued . . .
3) As the number of degrees of freedom increases,
the spread of the corresponding t distribution
decreases.
For what df would the
4) As the number of degrees oftfreedom
increases,
distribution
be
the corresponding sequence approximately
of t distributions
the
approaches the standard normal
samedistribution.
as a standard
normal
z curve
distribution?
t curve for 2 df
t curve for 5 df
0
Confidence intervals for m when 
is unknown
The general formula for a confidence interval for a
population mean m based on a sample of size n
when . . .
This confidence interval is appropriate for
small n ONLY when the population
1) x is the sample mean from a random sample,
distribution is (at least approximately)
2) the population distribution
is normal, or the
normal.
sample size n is large (n > 30), and
3) , the population standard deviation,
is unknown
t critical
values
is
are found in
 s Table

3
x  (t critical value)

 n
Where the t critical value is based on df = n - 1.
The article “Chimps Aren’t Charitable” (Newsday,
November 2, 2005) summarized the results of a research
study published in the journal Nature. In this study,
chimpanzees learned to use an apparatus that dispersed
food when either of two ropes was pulled. When one of
the ropes was pulled, only the chimp controlling the
Firstthe
verify
apparatus received food. When
otherthat
rope was
conditions
forcontrolling
a
pulled, food was dispensedthe
both
to the chimp
t-interval
are met.cage. The
the apparatus and also a chimp
in the adjoining
accompanying data represent the number of times out of
36 trials that each of seven chimps chose the option that
would provide food to both chimps (charitable response).
23
22
21
24
19
20
20
Compute a 99% confidence interval for the
mean number of charitable responses for
the population of all chimps.
Chimps Continued . . .
23
Normal Scores
2
1
-1
-2
22
21
24
19
20
20
The plot is
straight,
Let’s suppose itreasonable
is reasonable
to
Since n is small,
we this
needsample
toitverify
ifplausible
it is
so
seems
regard
of seven
plausible
that this
sample
is
from
a the
that
the
population
chimps
as
representative
of
20
22
24
Number
of
Charitable
Responses
population that ischimp
approximately
normal.
distribution
of
population.
number of charitable
responses is
Let’s use a normal probability plot.
approximately
normal.
Chimps Continued . . .
23
22
21
24
19
20
x = 21.29 and s = 1.80
20
df = 7 – 1 = 6
 s 
x  (t critical value)

 n
 1.80 
21.29  3.71
  (18.77, 23.81)
 7 
We are 99% confident that the
mean number of charitable
responses for the population of all
chimps is between 18.77 and 23.81.
Choosing a Sample Size
When  is unknown, a preliminary study can
The bound
error oftoestimation
beon
performed
Thisestimate
requires
associated with a 95%
OR
toconfidence
be known – interval
is make an educated guess
which
rarely
of is
the
value of .



the
case!
A rough
estimate
for

withcan use
B  1.96
(used We
distributions that are
is the
this to find
n too skewed)
 not
the range divided by 4. necessary
sample size for
Solve this for n:
a particular
2
 1.96 
bound on error
n 

of estimation.
B


The financial aid office wishes to
estimate the mean cost of textbooks per
quarter for students at a particular
university. For the estimate to be useful,
it should be within $20 of the true
population mean. How large a sample
should be used to be 95% confident of
achieving this level of accuracy?
The financial aid office is believes that
the amount spent on books varies with
most values between $150 to $550.
To estimate  :
550  150

 $100
4
The financial aid office wishes to
estimate the mean cost of textbooks per
quarter for students at a particular
university. For the estimate to be useful,
it should be within $20 of the true
population mean. How large a sample
should be used to be 95% confident of
achieving this level of accuracy?
 1.96100  
n 
  96.04 
20


n  97
2
Always round
sample size up
to the next
whole number!