Download Example 9.1 A survey designed to obtain

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Sufficient statistic wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Chapter 9
Estimation Using a Single Sample
In many practical problems, we want to estimate some population characteristics, for
example, population mean , population standard deviation , the proportion of S’s in a
population, and so on. In this chapter, we will introduce two estimation techniques, point
estimation and interval estimation.
9.1 Point Estimation
Definition 9.1: A point estimate of a population characteristic is a single number
computed from sample data and represents a plausible value of the characteristic.
Note: (1) The adjective point reflects the fact that the estimate corresponds to a single
point on the number line.
(2) A point estimate is obtained by (i) selecting an appropriate statistic; (ii) computing the
value of the statistic for the given sample. For example, the computed value of the sample
mean x provides a point estimate of a population mean .
Sometimes, there may be several statistics that can reasonably be used to obtain a point
estimate of a specified population characteristic. For example, to obtain a point estimate
of a population mean , we can use the sample mean x , a trimmed mean, or the sample
median. Then which one should we choose for computing an estimate?

Criteria for choosing among competing statistics
Generally, we choose the statistic that tends, on average, to produce an estimate closest to
the true value, that is, the most accurate estimate. Information about the accuracy of
estimation for a particular statistic is provided by the statistic’s sampling distribution.
(a) If a statistic whose sampling distribution is centered to the right of the true value is
used to compute an estimate, the estimate will tend to be larger than the true value.
(b) If a statistic whose sampling distribution is centered to the left of the true value is
used to compute an estimate, the estimate will tend to be smaller than the true value.
(c) When a statistic whose sampling distribution is centered at the true value is used to
compute an estimate, there will be no long-run tendency to over- or underestimate the
true value.
Definition 9.2: A statistic whose mean is equal to the value of the population
characteristic being estimated is said to be an unbiased statistic. A statistic that is not
unbiased is said biased.
Questions: (1) Is x an unbiased statistic for estimating ? Is p an unbiased statistic for
estimating a population proportion ?
Generally, we prefer an unbiased statistic. If several unbiased statistics could be used for
estimating a population characteristic, the best statistic to use is the one with the smallest
standard deviation.
Consider the problem of estimating a population mean, . When the population
distribution is symmetric, the sample mean x , the sample median, and any trimmed
mean are all unbiased statistics. Which statistic should be used? The following facts are
helpful in making a choice.
1. If the population distribution is normal, then x has a smaller standard deviation than
any other unbiased statistic for estimating .
2. When the population is symmetric with heavier tails than the normal curve, a trimmed
mean has a smaller standard deviation than x .
Thus, when the population distribution is normal, we should use x to estimate . But
with a heavy-tailed distribution, a trimmed mean is a better statistic than x for estimating
.
Generally, we estimate a population characteristic by the corresponding sample
characteristic, which is summarized in the following table.
Population characteristic to be estimated Statistic to use Unbiasedness
p
Unbiased
Population proportion, 
x
Unbiased
Population mean, 
2
2
s
Unbiased
Population variance, 
s
Biased
Population standard deviation, 
Table 9.1 Statistics used to estimate some important population characteristics
Note: s is not an unbiased statistic for estimating . It tends to underestimate slightly the
true value of . Since unbiasedness is not the only criterion to judge a statistic, and there
are other good reasons for using s to estimate , we will use s to estimate  in the
following discussion.
9.2 A large-Sample Confidence Interval for a Population Proportion
In Section 9.1, we saw how to use a statistic to produce a point estimate of a population
characteristic. However, because of sampling variability, rarely is the point estimate from
a sample exactly equal to the true value of the population characteristic. Although a point
estimate may represent our best single-number guess for the value of the population
characteristic, it is not the only plausible value. Thus we need to indicate in some way
how precisely the population characteristic has been estimated. A point estimate by itself
does not provide this information. As an alternative to a point estimate, we report an
interval of reasonable values based on the sample data. Then we can have some
“confidence” in the interval estimate.
Definition 9.3: A confidence interval for a population characteristic is an interval of
plausible values for the characteristic. It is constructed so that, with a chosen degree of
confidence, the value of the characteristic will be captured inside the interval.
Definition 9.4: The confidence level associated with a confidence interval estimate is the
success rate of the method used to construct the interval.
Note: The confidence level provides information on how much “confidence” we can have
in the method used to construct the interval, not our confidence in any one particular
interval. See Figure 9.4 on page 381 for interpretation.
We first consider a large-sample confidence interval for a population proportion. Let
 = proportion of individuals in the population that possess the property of interest,
p = (number of individuals in sample that possess the property of interest) / n,
the sample proportion.
We know that the sampling distribution of the statistic p has the following properties:
(1)
(2)
(3)

The sampling distribution of p is centered at ; that is,  p = . Therefore, p is an
unbiased statistic for estimating .
The standard deviation of p is  p   (1   ) / n .
When n  10 and n(1-)  10, the sampling distribution of p is approximately
normal with mean  and standard deviation  (1   ) / n .
The development of a confidence interval for  with confidence level 95%
(i) Use Appendix Table 2 to determine a value z* such that P(– z*< z < z*) = 0.95. z*
=1.96.
(ii) Since –1.96 <  (p1 ) / n  1.96 is equivalent to
p  1.96
P( p  1.96
 (1 )
n
 (1 )
n
   p  1.96
 (1 )
   p  1.96
n
 (1 )
n
) = P(–1.96 <
,
p 
 (1 ) / n
 1.96 ) = 0.95
This implies that in repeated sampling, 95% of the time the interval
p  1.96
will contain .
 (1 )
n
to p  1.96
 (1 )
n
(iii) Since  is unknown,
the value of
 (1   ) / n must be estimated. When the sample size is large,
p(1  p) / n should be close to
 (1   ) / n and can be used in its place.
Thus when n is large, a 95% confidence interval for  is
p  1.96
p (1 p )
n
, p  1.96
p (1 p )
n
An abbreviated formula for the interval is
p  1.96
p (1 p )
n
where + gives the upper endpoint of the interval and – gives the lower endpoint of the
interval. The interval can be used as long as np  10 and n(1-p)  10.
The formula given for a 95% confidence interval can easily be adapted for other
confidence levels.

The large-sample confidence interval for 
When
1. p is the sample proportion from a random sample, and
2. the sample size n is large (np  10 and n(1-p)  10)
the general formula for a confidence interval for a population proportion  is
p  (z critical value)
p (1 p )
n
The desired confidence level determines the z critical value. The three most commonly
used confidence levels, 90%, 95%, and 99%, use z critical values 1.645, 1.96, and 2.58,
respectively.
Note: Some z critical values can be found in Appendix Table 3 on page 708.
Exercise in class: Discuss how each of the following factors affects the width of the
confidence interval for : (1) The confidence level; (2) The sample size n; (3) The value
of p.
Generally, the higher reliability of a interval (where “reliability” is specified by the
confidence level) entails a loss in precision (as indicated by the wider interval). For
example, the width of the 99% interval is 2(2.58
p (1 p )
n
), which is wider than the width
of the 95% interval, 2(1.96 p (1n p ) ). In the opinion of many investigators, a 95% interval
gives a reasonable compromise between reliability and precision.

The general form of a confidence interval
Many confidence intervals have the same general form as the large-sample intervals for :
1. (Point estimate using a specified statistic)  (critical value)  (standard deviation of
the statistic)
If it is known
2. (Point estimate using a specified statistic)  (critical value)  (estimated standard
deviation of the statistic)
If it is unknown
Definition 9.5: The standard error of a statistic is the estimated standard deviation of the
statistic.

Choosing the sample size
Definition 9.6: If the sampling distribution of a statistic is normal (approximately), the
bound on error of estimation, B, associated with a confidence interval is
(z critical value)(standard deviation of the statistic).
When we use p to construct a 95% confidence interval for , the bound is B =
1.96  (1n ) . Sometimes, we may wish to determine a sample size such that a particular
value of the bound B is achieved. For such purposes, solving the equation B =
1.96
 (1 )
n
for n, we obtain
n = (1-) ( 1.B96 ) 2
Generally, the sample size required to estimate a population proportion  to within an
amount B with a confidence level is
value 2
n = (1-) ( z critical
)
B
The value of  may be estimated using prior information. In the absence of any such
information, using  = .5 in this formula gives a conservatively large value for the
required sample size.
Note: (   1   ) 2  0  (  ) 2  2   1    ( 1   ) 2  0
  (1   ) 
1
2
(1-)  ¼  (1-) ( 1.B96 ) 2 
1
4
( 1.B96 ) 2 for any .
Example 9.1 A survey designed to obtain information on  = the proportion of registered
voters who are in favor of a constitutional amendment requiring a balanced budget results
in a sample of size n = 400. Of the 400 voters sampled 272 are in favor of a constitutional
amendment requiring a balanced budget.
a) Give a point estimate of .
b) Determine the estimated standard deviation of your estimate in part a).
c) Calculate a 99% confidence interval for  and interpret the confidence interval.
d) Based on this confidence interval, do the majority of registered voters favor the
constitutional amendment?
e) How large would n have needed to be in order to have estimated  to within .03 with
95% confidence?
a) The point estimate of  is p = 272 / 400 = 0.68.
b) The estimated standard deviation of p is
p (1 p )
n
=
0.68(1 0.68)
400
= .0233
c) Since np = 400 0.68 = 272 > 10 and n(1-p) = 400(1-0.68) = 128 > 10, we can use
the formula for a large-sample confidence interval to obtain a 99% confidence
interval for .
p  (z critical value) 
p (1 p )
n
= 0.68  2.58
0.68(1 0.68)
400
= 0.68  2.58  0.0233
= 0.68  0.0601 = (0.6199, 0.7401).
We are 99% confident that  is between 0.6199 and 0.7401.
d) Yes, since the entire interval is above 0.5.
e) Using a conservative value of  = .5 in the formula for required sample size gives
) 2 = 1067.11
n = (1-) ( 1.B96 ) 2 = 0.5(1-0.5) ( 10..96
03
Thus, n would need to be 1068 in order to estimate  to within .03 with 95%
confidence.
Question: Are the following statements correct?
(1) Since (0.6199, 0.7401) is a 99% confidence interval for , P((0.6199, 0.7401)
contains  ) = 99%.
(2) If the process of selecting a sample of size 400 and then computing the corresponding
99% confidence interval is repeated 100 times, 99 of the resulting intervals will
include .
9.3 A Confidence interval for a population mean
In this section, we consider how to use information from a random sample to construct a
confidence interval estimate for a population mean.
Recall the four properties about the sampling distribution of x :
1. The mean of x ,  x  
2. The standard deviation of x ,  x   / n
3. When the population distribution is normal, the sampling distribution of x is also
normal.
4. When n is sufficiently large (generally n 30), the sampling distribution of x is
approximately normal.

The one-sample z confidence interval for 
When
1. x is the sample mean of a random sample from a population
2. the population distribution is normal OR the sample size n is large (generally n  30),
and
3. the population standard deviation  is known
the formula for a confidence interval for population mean  is
x  ( z critical value) (

n
)
Example 9.2 The McClatchy News Service reported on a sample of prime-time television
hours. The following table summarizes the information reported for two networks.
Network Mean Number of Violent Acts per Hour
ABC
15.6
FOX
11.7
Suppose that each of these sample means was computed on the basis of viewing n = 50
randomly selected prime-time hours and that the population standard deviation for each
of the two networks is known to be  = 5.
a) Compute a 95% confidence interval for ABC , the true mean number of violent acts
per prime-time hour for ABC.
b) Compute a 95% confidence interval for FOX , the true mean number of violent acts
per prime-time hour for FOX.
c) The National Coalition on Television Violence claims that shows on ABC are more
violent than on FOX. Based on the confidence intervals from parts a) and b), do you
agree with this conclusion? Explain.
Since n = 50 > 30 and  = 5, we can use the one-sample z confidence interval formula.
a) The 95% confidence interval for ABC is
x ABC  (z critical value) ( n ) = 15.6  (1.96)(
5
50
) = 15.6  1.39 = (14.21, 16.99)
b) The 95% confidence interval for FOX is
x FOX  ( z critical value) ( n ) = ?  ? × ? = ?  ? = (?, ?)
c) Yes, because the plausible values for ABC is at least 14.21, while the plausible values
for FOX are not greater than 13.09.

The one-sample t confidence interval for 
Let us look at the development of the 95% confidence interval for  when  is known.
When the population distribution is normal, the sampling distribution of x is normal.
Thus,
z  x/ n
has the standard normal distribution.
Since  1.96  x/ n  1.96 is equivalent to x  1.96( n )    x  1.96( n ) ,
P( x  1.96( n )    x  1.96( n ) ) = P(  1.96  x/ n  1.96 ) = 0.95
Then a confidence interval for  is x  1.96( n ) .
When  is unknown, we must use the sample data to estimate . A natural estimate of 
is s. Now we use
t  sx/ n
To use t to develop a confidence interval for , we must know the probability distribution
of t.
Let x1, x2, , xn be a random sample from a normal population distribution. Then the
probability distribution of the standardized variable
t
x 
s/ n
is the t distribution with n-1 df.
When
1. x is the sample mean of a random sample from a population
2. the population distribution is normal OR the sample size n is large (generally n  30),
and
3. the population standard deviation  is unknown
the formula for a confidence interval for population mean  is
x  (t critical value) (
s
n
)
where the t critical value is based on n-1 df, which can be found by Appendix Table 3 on
page 708.
Note: Appendix Table 3 jumps from 30 df to 40 df, then 60 df, then 120 df, and finally to
the row of z critical values. If we need a critical value for a number of degrees of freedom
between those tabulated, we just use the critical value for the closest df. For df > 120, we
use the z critical values.
Example 9.3 A medical researcher from the National Institute of Health has collected
samples on the life expectancies of people who are long-time smokers and those who are
nonsmokers. The sample data is summarized in the table below.
Group
Sample Size Sample Mean Sample Standard Deviation
Smokers
50
67.6
5
Nonsmokers
60
74.5
3.5
a) Compute a 95% confidence interval for the mean life expectancy of a smoker.
b) Compute a 95% confidence interval for the mean life expectancy of a nonsmoker.
c) Do the confidence intervals in parts (a) and (b) provide convincing evidence that
nonsmokers live longer on the average than do smokers? Explain.
a) Since n1 = 50 > 30 and  is unknown, we can use the one-sample t confidence
interval formula.
x S  (t50-1 critical value) s1 = 67.6  2.02  550 = 67.6  1.4284
n1
= (66.1716, 69.0284)
b) Since n2 = 60 > 30 and  is unknown, we can use the one-sample t confidence
interval formula.
x N  (t60-1 critical value) s2 = ?  ?  ? = ?  ?
n2
= (?, ?)
c) The confidence intervals in parts a) and b) do provide convincing evidence that
nonsmokers live longer than long-time smokers since the largest value in the
confidence interval for smokers is roughly 4.5679 years less than the smallest value in
the confidence interval for non-smokers.

Choosing the sample size
When we use x to construct a 95% confidence interval for , the bound on error of
estimation is
B = 1.96( n )
Before collecting any data, an investigator may wish to determine a sample size for
which a particular value of the bound is achieved. Solving B = 1.96( n ) for n, we obtain
n = [ 1.96B  ] 2 .
Generally, we have the following result.
The sample size required to estimate a population mean  to within an amount B with a
confidence level is
n = [ ( z criticalB value) ]2 .
If  is unknown, it may be estimated based on previous information or, for a population
that is not too skewed, by using (range)/4
Example 9.4 The financial aid office wishes to estimate the mean cost of textbooks per
semester for students at a university. For the estimate to be useful, it should be within $20
of the true population mean. How large a sample should be used to be 95% confident of
achieving this level of accuracy?
To determine the required sample size, we must have a value for . The financial aid
office is pretty sure that the amount spent on books varies widely, with most values
between $50 and $450. A reasonable estimate of  is then
(range) / 4 = (450 – 50) / 4 = 400 / 4 = 100.
The required sample size is
n = [ 1.96 / B ]2 = [(1.96)(100) / 20]2 = [9.8]2 = 96.04.
Rounding up, a sample size of 97 or larger is required.