Download Estimation - Widener University

Document related concepts

Sufficient statistic wikipedia , lookup

Degrees of freedom (statistics) wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Resampling (statistics) wikipedia , lookup

German tank problem wikipedia , lookup

Transcript
Estimation
Estimators & Estimates
Estimators are the random variables used to
estimate population parameters, while the specific
values of these variables are the estimates.
Example: the estimator of m is often
n
X
X
i 1
i
n
but if the observed values of X are 1, 2, 3, and 6,
the estimate is 3.
So the estimator is a formula; the estimate is a
number.
Properties of a Good Estimator
1.
2.
3.
4.
Unbiasedness
Efficiency
Sufficiency
Consistency
Unbiasedness
An estimator
ˆ

(“theta hat”) is unbiased if its
expected value equals the value of the parameter

(theta) being estimated. That is,
ˆ
E( )  
In other words, on average the estimator is right
on target.
Examples
SinceE(X)  m, X is an unbiasedestimatorof m.
Since E(X/n)   , X/n is an unbiased estimator of  .
Since E(s2 )   2 , s 2 is an unbiasedestimator of  2 .
n
Recall that s 2 
2
(X
X
)

i 1
n 1
.
If we divided by n instead of by n-1, we would not have
an unbiased estimator of 2. That is why s2 is defined
the way it is.
Bias
ˆ
bias  E( )  
The bias of an unbiased estimator is
zero.
Mean Squared Error (MSE)
2
ˆ
MSE  E[(   ) ]
which happens to equal
 2  bias2
Efficiency
The most efficient estimator is the one with the
smallest MSE.
Efficiency
Since MSE   2  bias2 ,
for unbiased estimators (where the bias is zero),
MSE = 2.
So if you are comparing unbiased estimators, the
most efficient one is the one with the smallest
variance.
If you have two estimators, one of which has a small
bias & a small variance and the other has no bias
but a large variance, the more efficient one may be
the one that is just slightly off on average, but that is
more frequently in the right vicinity.
Example: sample mean & median
As we have found, the sample mean is an unbiased
estimator of m.
It turns out that the sample median is also an unbiased
estimator of m.
We know the variance of the sample mean is 2/n.
The variance of the sample median is (/2)(2/n).
Since  is about 3.14, /2 >1.
So the variance of the sample median is greater than 2/n,
the variance of the sample mean.
Since both estimators are unbiased, the one with the smaller
variance (the sample mean) is the more efficient one.
In fact, among all unbiased estimators of m, the sample
mean is the one with the smallest variance.
Sufficiency
An estimator is said to be sufficient if it
uses all the information about the population
parameter that the sample can provide.
Examples
Example 1: The sample median is not a
sufficient estimator because it uses only the
ranking of the observations, and not their
numerical values [with the exception of the
middle one(s)].
Example 2: The sample mean, however,
uses all the information, and therefore is a
sufficient estimator.
Consistency
An estimator is said to be consistent if it yields
estimates that converge in probability to the population
parameter being estimated as n approaches infinity.
In other words, as the sample size increases, The
estimator spends more and more of its time closer and
closer to the parameter value.
One way that an estimator can be consistent is for its
bias and its variance to approach zero as the sample
size approaches infinity.
Example of a consistent estimator
distribution of estimator when n = 500
As the
sample size
increases,
the bias &
the variance
are both
shrinking.
distribution of
estimator when
n = 50
distribution of
estimator when
n=5
m
Example: Sample Mean
_
We know that the mean of X is m.
So its bias not only goes to zero as n
approaches infinity, its bias is always zero.
The variance of the sample mean is 2/n.
As n approaches infinity, that variance
approaches zero.
So, since both the bias and the variance go
to zero, as n approaches infinity, the
sample mean is a consistent estimator.
A great estimator: the sample mean
X
We have found that the sample mean is a great
estimator of the population mean m.
It is unbiased,
efficient,
sufficient,
& consistent.
Point Estimators versus Interval Estimators
Up until now we have considered point
estimators that provide us with a single
value as an estimate of a desired parameter.
It is unlikely, however, that our estimate will
precisely equal our parameter.
We, therefore, may prefer to report
something like this: We are 95% certain that
the parameter is between “a” and “b.”
This statement is a confidence interval.
Building a
Confidence Interval
0.9750
0.0250
-1.96
We know that
Then
0
1.96
Pr(Z < 1.96) = 0.9750
Pr(-1.96 < Z < 1.96) = 0.95
X-m
We also know that 
is distributed as a standard normal (Z).
n
So there is a 95% probability that
- 1.96 
X-m

n
 1.96
Z
Continuing from: with 95% probability,
- 1.96 
X-m

 1.96
n

Multiplying through by
Subtracting off
n
,
X ,
Multiplying by -1 and flipping the
inequalities appropriately,
Flipping the entire expression,
  
  X - m  1.96
- 1.96 
 n



- X - 1.96 



X  1.96 


 
  


 n



 - m  -X  1.96 

n 


 
 m  X - 1.96 

n 

 
n 
 
n 
  
  



X - 1.96
 m  X  1.96 
 n
 n




So we have a 95% Confidence Interval
for the Population Mean m
  
  
  m  X  1.96 

X - 1.96 
 n
 n




Example: Suppose a sample of 25 students at a
university has a sample mean IQ of 127. If the
population standard deviation is 5.4, calculate the
95% confidence interval for the population mean.
  
  
  m  X  1.96 

X - 1.96 
 n
 n




 5.4
127 - 1.96 
 25


 5.4 
  m  127  1.96 


 25 



127 - 2.12  m  127  2.12
124.88  m  129.12
We are 95% certain that the population mean is between 124.88 & 129.12 .
When we say we are 95% certain that the
population mean m is between 124.88 & 129.12,
it means this:
The population mean m is a fixed number, but we
don’t know what it is.
Our confidence intervals, however, vary with the
random sample that we take.
Sometimes we get a more typical sample,
sometimes a less typical one.
If we took 100 random samples and from them
calculated 100 confidence intervals, 95 of the
intervals should contain the population mean
that we are trying to estimate.
What if we want a confidence level
other than 95%?
In our formula, the 1.96 came from our the fact that the Z distribution will
be between -1.96 and 1.96 95% of the time.
  
  



X - 1.96
 m  X  1.96 
 n
 n




To get a different confidence level, all we need to do is find the Z values
such that we are between them the desired percent of the time.
Using that Z value, we have the general formula for the
confidence interval for the population mean m :
  
  
  m  X Z 

X -Z 
 n
 n




Notice: In our confidence interval formula, we used “less than”
symbols:
  
X - Z 

 n

m

  
X  Z 

 n
Your textbook uses “less than or equal to” symbols:
  
X - Z 

 n

m

  
X  Z 

 n
Either of these is acceptable. Recall that the formula is built upon
the concept of the normal probability distribution. The probability
that a continuous variable is exactly equal to any particular number
is zero. So it doesn’t matter whether you include the endpoints of
the interval or not.
Determining Z values for
confidence intervals
0.9800
0.01
-k
-2.33
0.01
0
k
2.33
Z
Suppose we want a 98% confidence interval.
We need to find 2 values, call them –k and k, such that
Z is between them 98% of the time.
Then Z will be less than k with probability 0.99.
Look in the body of the Z table for the value closest to
0.99, which is 0.9901 .
The number on the border of the table corresponding
to 0.9901 is 2.33.
So that is your value of k, and the number you use for
Z in your confidence interval.
Sometimes 2 numbers in the Z table are
equally close to the value you want.
For example, suppose you want a 90% confidence
interval. Remember the Z table gives cumulative
values. So to get a value you can look up in the table,
you add the 0.90 from the middle area of your Z graph
plus 0.05 from the left tail for a total of 0.95. So you
look for 0.95 in the body of the Z table.
You find 0.9495 and 0.9505. Both are off by 0.0005.
The number on the border of the table corresponding to
0.9495 is 1.64.
The number corresponding to 0.9505 is 1.65.
Usually in these cases, we use the average of 1.64 and
1.65, which is 1.645.
Similarly for the 99% confidence interval, we usually use
2.575. (Draw your graph & work through the logic of
this number.)
Which interval is wider: One with a higher
confidence level (such as 99%) or one with
a lower confidence level (such as 90%)?
Let’s think it through using an unrealistic but
slightly entertaining example.
You have the misfortune of being stranded on
an island, with a cannibal & a bunch of bears.
It gets worse…
You get captured by the cannibal.
The cannibal, who knows the island well,
decides to give you a chance to avoid being
dinner.
He says if you can correctly estimate the
number of bears, he’ll let you go.
To give you a fighting chance, he’ll let you
give him an interval estimate.
You think that there are probably about a
hundred bears on the island.
Would you be more confident of not being
dinner if you gave the cannibal a narrow
interval like 90 to 110 bears, or a wider one
like 75 to 125 bears?
You would definitely be more confident with
the wider interval.
Thus, when the confidence level needs to be
very high (such as 99%), the interval needs
to be wide.
Let’s redo the IQ example with a different confidence level.
We had a sample of 25 students with a sample mean IQ
of 127. The population standard deviation was 5.4 .
Calculate the 99% confidence interval for the population
mean.
Our general formula is:
  
  



X -Z
 m  X Z 
 n
 n




We said that the Z value for 99% confidence is 2.575.
Putting in our values,
 5.4 
 5.4 



127 - 2.575
 m  127  2.575 
 25 
 25 




or 124.22 < m < 129.78
We had for the 95% confidence interval:
124.88 < m < 129.12
We just got for the 99% confidence interval:
124.22 < m < 129.78
The 99% confidence interval starts a little
lower & ends a little higher than the 95%
interval.
So the 99% interval is wider than the 95%
interval, as we said it should be.
What do we do if we want to compute a
confidence interval for m, but we don’t know
the population standard deviation ?
We use the next best thing, the sample standard
deviation s.
But with s, instead of a Z distribution, we have a t
(with n-1 degrees of freedom). So,
  
  



X -Z
 m  X Z 
 n
 n




X - t n -1
 s 

  m  X t
n -1
 n


becomes
 s 


 n


Example: From a large class of normally
distributed grades, sample 4 grades: 64, 66, 89,
& 77. Calculate the 95% confidence interval for
the class mean grade m.
X - t n -1
 s 

  m  X t
n -1
 n


 s 


 n


is the appropriate formula.
So we need to determine the sample mean,
sample standard deviation, and the t-value.
4 grades: 64, 66, 89, & 77
95% confidence interval for m
X
64
66
89
77
296
Adding our X values,
we get 296.
4 grades: 64, 66, 89, & 77
95% confidence interval for m
X
64
66
89
77
296
X  74
Dividing by 4, we find
our sample mean is 74.
4 grades: 64, 66, 89, & 77
95% confidence interval for m
X
64
66
89
77
296
X  74
X  X (X  X )
-10
-8
15
3
2
Keep in mind that the
sample standard
deviation is
n
s
(X  X )
2
i 1
n 1
So, next we subtract
our sample mean 74
from each of our X
values,
4 grades: 64, 66, 89, & 77
95% confidence interval for m
X
64
66
89
77
296
X  74
X  X (X  X )
-10
100
-8
64
15
225
3
9
398
2
square the differences
and add them up.
n
s
2
(
X

X
)

i 1
n 1
4 grades: 64, 66, 89, & 77
95% confidence interval for m
X
64
66
89
77
296
X  74
X  X (X  X )
-10
100
-8
64
15
225
3
9
398
s2 =
398/3
=132.7
2
Then we divide by n-1
(which is 3) to get the
sample variance s2,
n
s
2
(
X

X
)

i 1
n 1
4 grades: 64, 66, 89, & 77
95% confidence interval for m
X
64
66
89
77
296
X  74
X  X (X  X )
-10
100
-8
64
15
225
3
9
398
s2 =
398/3
=132.7
s = 11.5
2
and take the square
root to get the sample
standard deviation s.
n
s
2
(
X

X
)

i 1
n 1
So we have X  74 and s = 11.5
Since n = 4, dof = n-1 = 3
Since we want 95% confidence,
we want 0.95 as the middle area
of our graph, and .025 in each of
the 2 tails.
0.025
0.95
We find the 3.182 in our t table.
 s 
Our formula is
  m  X t
X - t n -1 
n -1
 n


0
0.025
3.182
t3
 s 


 n


Putting in our numbers we have
 11.5 
 
74 - 3.182 
 4 


 11.5 

m  74  3.182 
 4 


So our 95% confidence interval is 56 < m < 92.
The interval is very wide, because we only have 4 observations. If
we had more information, we’d be able to get a narrower interval.
From our previous confidence intervals, we can
see that we have a basic format, which can be
used when the point estimator is roughly normal.
  
  



X -Z
 m  X Z 
 n
 n




X - t n -1
point
estimate
z
or
t
 s 

  m  X t
n -1
 n


std . dev. or
estimate of
the std. dev.
of our pt.
estimate
Desired
parameter
 s 


 n


point
estimate
z
or
t
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
Calculating confidence intervals for the
binomial proportion parameter 
When the number events of interest (X) and the
number of events not of interest (n-X) are each
at least five, the binomial distribution can be
approximated by the normal and we can
develop a confidence interval for the binomial
proportion parameter .
That is, we can develop a confidence interval
for  , if
X ≥ 5 and n-X ≥ 5 .
We need a point estimate for , & the
standard deviation of our point estimate.
For the point estimate we will use the
binomial proportion variable X/n or p .
Its standard deviation was
 (1   )
n
.
Since we don’t know , we will use our
sample proportion p in the standard deviation
formula.
Use our format to get the confidence interval
for the binomial proportion .
point
estimate
z
or
t
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
Desired
parameter
point
estimate
p (1  p)
p(1  p)
pz
  pz
n
n
z
or
t
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
We have our confidence interval for the
binomial proportion .
p (1  p)
p(1  p)
pz
  pz
n
n
Example: Consider a random sample of 144
families; 48 have 2 or more cars. Compute the
95% confidence interval for the population
proportion of families with 2 or more cars.
pz
p (1  p)
p(1  p)
  pz
n
n
n = 144
48 1
p

144 3
2
1 p 
3
0.95
0.0250
-1.96
0.0250
0
1.96
Z
Looking up the cumulative area
0.9500 + 0.0250 = 0.9750, we
find our z value is 1.96 .
1
2
We now have n = 144, z = 1.96, p  and 1  p 
3
3
pz
p (1  p)
p(1  p)
  pz
n
n
 1  2 
 1  2 
 3  3 
 3  3 
1
1






 1.96
    1.96
3
144
3
144
0.333  0.077    0.333  0.077
So our 95% confidence interval for  is:
0.256 <  < 0.410 .
Suppose we want a confidence interval
not for a mean but for
the difference in two means (m1-m2).
For example, we may be interested in
the difference in the mean income for two
counties, or
the difference in the mean exam scores for
two classes.
We will use the same basic format,
but it will be a bit more complicated.
point
estimate
z
or
t
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
Desired
parameter
point
estimate
z
or
t
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
Our “desired parameter” is m1 – m2 .
Our point estimate is X  X
1
2
.
Initially, we will assume that we have the population standard
deviations, so we will use a z.
We need the standard deviation of the point estimate, X  X
1
To get that we will first determine the variance of X  X
1
2
.
2
.
Recall:
V(aX + bY) = a2V(X) + b2V(Y) + 2ab[C(X,Y)]
Letting a = 1, b = -1, X  X 1 and Y  X 2
V ( X 1  X 2 )  (1) 2 V ( X 1 )  (1) 2 V ( X 2 )  2(1)(1)C( X 1 , X 2 )
If our samples are independent, the covariance
term is zero, and the expression becomes
V ( X 1  X 2 )  (1) 2 V ( X 1 )  (1) 2 V ( X 2 )
or
V (X 1  X 2 ) V (X 1)  V (X 2 )
We now have V ( X 1  X 2 )  V ( X 1 )  V ( X 2 )
2

Recall that V ( X ) 
.
n
Applying subscripts for our samples,
 12  22
V (X 1  X 2 ) 

n1
n2
& the standard deviation of X  X
1
 12
n1

 22
n2
2
is
Apply our basic format
point
estimate
z
or
t
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
(X 1  X 2 )  z
 12
n1

 22
n2
Desired
parameter
point
estimate
z
or
t
 m1  m 2  ( X 1  X 2 )  z
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
 12
n1

 22
n2
Example: From 2 large classes,
with normally distributed grades, sample
4 grades (64, 66, 89, & 77) & 3 grades (56, 71, &
53). If the population variances for the 2 classes
are both 96, compute the 90% confidence interval
for the difference in means of the class grades.
We will use the formula we just developed:
(X 1  X 2 )  z
 12
n1

 22
n2
 m1  m 2  ( X 1  X 2 )  z
 12
n1

 22
n2
We need the 2 sample means & the z value.
Adding the observations & dividing by the number
of observations, our sample means are
(64 + 66 + 89 + 77) / 4 = 74
and
(56 + 71 + 53) / 3 = 60
The z value for 90% confidence,
as we found before, is 1.645 .
0.90
0.05
0.05
0
1.645
Z
Assembling our formula:
(X 1  X 2 )  z
 12
n1

 22
n2
 m1  m 2  ( X 1  X 2 )  z
 12
n1

 22
n2
96 96
96 96
(74  60)  1.645

 m1  m 2  (74  60)  1.645

4
3
4
3
14  12.31  m1  m 2  14  12.31
1.69  m1  m2  26.31
Interpreting the results
1.69  m1  m2  26.31
We are 90% certain that the difference in class
mean grades is between 1.69 and 26.31 .
Notice that this interval does not include zero.
If m1 – m2 = 0, then m1 = m2 .
That implies that the probability is less than 10%
that the class mean grades are equal.
What do we do if we want to compare means,
but we don’t know the population variances?
As before, we use the sample variances & the t distribution.
Our formula was
(X 1  X 2 )  z
 12
n1

 22
n2
 m1  m 2  ( X 1  X 2 )  z
 12
n1

 22
n2
Now the formula is
s12 s22
s12 s22
(X 1  X 2 )  t

 m1  m 2  ( X 1  X 2 )  t

n1 n2
n1 n2
For the t, the number of degrees of freedom is determined
by a very messy formula.
The degrees of freedom for the t for the confidence
interval for the difference between means with
unknown variances
dof  the integer part of
2
s
s 
  
 n1 n2 
2
2
2
2
 s1   s2 
   
 n1    n2 
n1  1 n2  1
2
1
2
2
Let’s do the same example as before,
but without knowing the population variances.
From 2 large classes, with normally distributed
grades, sample 4 grades (64, 66, 89, & 77) &
3 grades (56, 71, & 53). Compute the 90%
confidence interval for the difference in means
of the class grades.
This time we need to calculate the sample variances.
n
Recall s 
2
2
(
X

X
)
 i
i 1
n 1
We calculate the sample means
as before.
Class 1
X1
X 1  X 1 ( X1  X 1 )2
Class 2
X2
64
56
66
71
89
53
77
296
296
4
 74
X1 
180
180
3
 60
X2 
2
X2  X 2 (X2  X 2 )
n
Recall s 2 
2
(
X

X
)
 i
i 1
n 1
Then subtract the sample mean
from each observation,
square that difference,
Class 1
X1
Class 2
X 1  X 1 ( X1  X 1 )2
X2
2
X2  X 2 (X2  X 2 )
64
-10
100
56
-4
16
66
-8
64
71
11
121
89
15
225
53
-7
49
77
3
9
296
296
4
 74
X1 
180
180
3
 60
X2 
n
Recall s 
2
2
(
X

X
)
 i
and add up.
i 1
n 1
Class 1
X1
Class 2
X 1  X 1 ( X1  X 1 )2
X2
2
X2  X 2 (X2  X 2 )
64
-10
100
56
-4
16
66
-8
64
71
11
121
89
15
225
53
-7
49
77
3
9
296
296
4
 74
X1 
398
180
180
3
 60
X2 
186
n
Recall s 2 
2
(
X

X
)
 i
i 1
n 1
Dividing by n-1, we have
our sample variances.
Class 1
X1
Class 2
X 1  X 1 ( X1  X 1 )2
X2
2
X2  X 2 (X2  X 2 )
64
-10
100
56
-4
16
66
-8
64
71
11
121
89
15
225
53
-7
49
77
3
9
296
296
X1 
4
 74
398
398
3
 132.67
s12 
180
180
X2 
3
 60
186
186
2
 93.0
s 22 
What are the dof & t value?
dof  the integer part of
2
 s12 s22 
  
 n1 n2 
2
2
 s12   s22 
   
 n1    n2 
n1  1 n2  1
2
 132.67 93.0 



4
3


= 4.860
2
2
 132.67   93.0 

 

4
3

 

3
2
So the degrees of freedom is
the integer part of 4.86 or 4.
0.90
0.05
For 90% confidence & 4 dof,
the t value is 2.1318 .
0
2.1318
t4
Assemble our formula
s12 s22
s12 s22
(X 1  X 2 )  t

 m1  m 2  ( X 1  X 2 )  t

n1 n2
n1 n2
(74  60)  2.1318
132.67 93.0
132.67 93.0

 m1  m2  (74  60)  2.1318

4
3
4
3
14  17.08  m1  m2  14  17.08
3.08  m1  m2  31.08
Notice here that zero is contained in our 90% confidence
interval.
So we can’t rule out the possibility that the class mean
grades are equal.
Sometimes we believe the variances of
2 populations are equal, even though
we don’t know the actual values.
We have another confidence interval for this
situation.
(X 1  X 2 )  z
 12
n1

 22
n2
 m1  m 2  ( X 1  X 2 )  z
 12
n1

 22
n2
In our earlier formula above, we can drop the distinguishing subscripts
on our variances.
(X 1  X 2 )  z
2
n1

2
n2
 m1  m 2  ( X 1  X 2 )  z
2
n1

2
n2
Factoring out the variance, we have
1
1
1
1
2



( X 1  X 2 )  z      m1  m 2  ( X 1  X 2 )  z    
 n1 n2 
 n1 n2 
2
Next we replace the variance by a pooled sample variance, based
on information from both samples.
1
1
1
1
2



( X 1  X 2 )  t s p     m1  m 2  ( X 1  X 2 )  t s p   
 n1 n2 
 n1 n2 
The dof for the t value is n1 + n2 – 2 .
2
The pooled sample variance
(n1  1)s1  (n2  1)s2
sp 
n1  n2  2
2
2
2
When the 2 samples are the same size, this
estimator gives an estimate that is halfway
between the two sample variances.
When the samples are not the same size, the
estimate will be closer to the sample variance
from the larger sample.
So our confidence interval for the difference in the
population means, when we don’t know the population
variances but we believe that they are equal is:
1
1
1
1
2
( X 1  X 2 )  t s p     m1  m 2  ( X 1  X 2 )  t s p   
 n1 n2 
 n1 n2 
2
where
2
2
(
n

1
)
s

(
n

1
)
s
1
2
2
s p2  1
n1  n2  2
and the number of degrees of freedom is n1 + n2 – 2 .
Let’s do the same example as before,
assuming that the unknown population
variances are believed equal.
We had:
X 1  74, X 2  60, s1  132.67, s 2  93.0
2
2
2
(
n

1
)
s

(
n

1
)
s
1
2
2
s p2  1
n1  n2  2
(3)132.67  (2)93.0

432
584

 116.8
5
2
We have:
X 1  74, X 2  60, s 2p  116.8
We want 90% confidence.
dof = n1 + n2 – 2 = 4 + 3 – 2 = 5
So our t value is 2.015 .
0.90
0.05
0
2.015
t5
We have: X1  74, X 2  60, s 2p  116.8, t  2.015
1
1
1
1
2



( X 1  X 2 )  t s p     m1  m 2  ( X 1  X 2 )  t s p   
 n1 n2 
 n1 n2 
2
1 1
1 1
(74  60)  2.015 116.8    m1  m 2  (74  60)  2.015 116.8  
 4 3
 4 3
14  16.63  m1  m 2  14  16.63
2.63  m1  m 2  30.63
We are 90% certain that the difference in the population
means is between -2.63 & 30.63.
Again, since zero is in this interval, we can’t rule out the
possibility that the class mean grades are equal.
We can also develop a confidence interval for
the difference in population proportions
1 – 2
The point estimate is the difference in the sample proportions
p1  p2
Next we need the standard deviation of our point estimate.
Similarly to the case of the difference in population means,
V ( p1  p2 )  V ( p1 )  V ( p2 )
Recalling that our previous estimate of V ( p) was
we have
p (1  p )
,
n
p1 (1  p1 ) p2 (1  p2 )
V ( p1  p2 ) 

n1
n2
The estimated standard deviation of our point estimate becomes
p1 (1  p1 ) p2 (1  p2 )

n1
n2
Using our basic format, we find the confidence
interval for the difference in population proportions.
point
estimate
z
or
t
( p1  p2 )  z
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
Desired
parameter
point
estimate
z
or
t
std . dev. or
estimate of
the std. dev.
of our pt.
estimate
p1 (1  p1 ) p2 (1  p2 )
p (1  p1 ) p2 (1  p2 )

 1   2  ( p1  p2 )  z 1

n1
n2
n1
n2
Example: Samples from 2 states show proportions
of Democrats 1/3 & 1/5 with sample sizes 100 & 225.
Calculate the 99% confidence interval for the
difference in population proportions.
( p1  p2 )  z
p1 (1  p1 ) p2 (1  p2 )
p (1  p1 ) p2 (1  p2 )

 1   2  ( p1  p2 )  z 1

n1
n2
n1
n2
The z value for a 99% confidence
interval is 2.575 .
0.99
0.005
0.005
0
2.575
Z
We have:
p1  0.33, 1-p1  0.67, p2  0.20, 1  p2  0.80, n1  100, n 2  225, z  2.575
Applying the formula
( p1  p2 )  z
p1 (1  p1 ) p2 (1  p2 )
p (1  p1 ) p2 (1  p2 )

 1   2  ( p1  p2 )  z 1

n1
n2
n1
n2
yields
(0.33)(0.67) (0.20)(0.80)
(0.33  0.20)  2.575

 1   2
100
225
 (0.33  0.20)  2.575
or
(0.33)(0.67) (0.20)(0.80)

100
225
0.13  0.14  1   2  0.13  0.14
So the 99% confidence interval for the difference in population
proportions is
0.01      0.27
1
2
Given our confidence interval: 0.01  1   2  0.27
Can we conclude that the two population proportions
are not equal?
No. Since zero is in the interval, 1 may equal 2 .
How do you decide the
appropriate sample size for a project?
2 Decisions:
• Desired confidence level
• Maximum difference D between the estimate of
the population parameter & the true value of the
population parameter (that is, the maximum
error you’re willing to accept)
For example, if you’re estimating the population
mean m using the sample mean X ,
D is themaximumdifferencebetween X andm .
Suppose you have chosen 95% as
your desired confidence level.
You know that there is a z value (call it z0)
such that - z0 < Z < z0 95% of the time.
Xm
You also know that 
is distributed as a Z.
n
So,  z 0 
Xm

n
 z 0 with 95% probabilit y .
With 95% probabilit y , we hav e  z 0 
Xm

 z0 .
n
Multiply ing by

n
, we hav e  z 0

n
 X  m  z0

.
n
We see here that the largest v alue of the dif f erencebetween X and m ,
which we called D, is z 0
So, D  z 0

n

n
.
We hav e now, D  z 0

,
n
and we can solv e f or the sample size n.
First, square both sides
of the equation:
Multiply through by n:
Divide through by
D 2:
Dropping the subscript on
z for convenience, we
have the formula:
D  z0
2
2

2
n
nD  z 0
2
2
n  z0
2
nz
2

D
2
2
2
D
2
2
So we have a formula for determining the
appropriate sample size n when we want to
estimate the population mean.
nz
2

D
2
2
Example: Suppose you’re trying to estimate the mean
monthly rent of 2-bedroom apartments in towns of 100,000
people or less. The population standard deviation is 20.
You want to be 95% sure that your estimate is within $3 of
the true mean. How large a sample should you take?
nz
2

2
2
D
2
2 (20 )
 (1.96)
(3) 2
 170.3
You need to sample
171 observations.
It’s not 170, because
sample sizes smaller than
170.3 provide you with less
information & therefore less
than the desired level of
confidence.
Our formula for n has
the population standard deviation  in it.
What do we do if we don’t know ?
In the past, we used the sample standard deviation
s. Why can’t we do that here?
s came from the sample. We haven’t taken the
sample yet. We’re still trying to figure out how
many observations our sample should have.
If previous researchers have done related work,
you may be able to use their estimate for the
standard deviation.
Alternatively, you can do a small preliminary
sample, & based on that information, estimate
the standard deviation.
Determining the appropriate sample size n
for estimating the population proportion .
Again we will use
- z0 < Z < z0 with the desired confidence level as
our starting point.
We know that
p-
 (1- )
is approximately standard normal.
n
So with the desired level of confidence,
-z 0 
p 
 (1   )
n
 z0 .
Starting from
p 
-z 0 
 (1   )
 z0
n
Multiply through by
 (1- )
n
,
-z 0
 (1- )
n
 p    z0
We see here that the maximum difference D
between our estimator p & our parameter  is:
z0
 (1- )
n
 (1- )
n
We have now, D  z0
 (1   )
n
and we can solve for the sample size n.
First, square both sides
of the equation:
Multiply through by n:
Divide through by
D 2:
Dropping the subscript on
z for convenience, we
have the formula:
D z
2
2
0
 (1   )
n
nD  z  (1   )
2
2
0
nz
nz
2
0
2
 (1   )
D
2
 (1   )
D2
There’s one big problem with
this formula. What is it?
nz
2
 (1   )
D2
We want to collect a sample in order to estimate ,
but we have the unknown  in our equation for
determining the sample size!
We can’t use the sample proportion as we did
before, because we haven’t taken the sample yet.
As it happens, we can resolve this problem fairly
easily.
The largest possible value for  (1  ) occurs
when  is ½, and that largest value is ¼.
Play with some values for  & 1-  , and
convince yourself that this is true. For example,
(1/3)(2/3) = 2/9 < 1/4
(3/10)(7/10) = 21/100 < 1/4
(1/100)(99/100) = 99/10,000< 1/4
If we know the largest possible value for pq, we can
determine the largest sample size we should need for
nz
2
 (1   )
D2
Plugging in the maximum value of ¼ for  (1  ), we
have
 1
 
4
2  1  1
2 
 z   2
nz
2
D
 4  D
  z2 1

2
4
D

2
So our formula for n is:
z
n
4D2
2
z

4D 2
Sometimes you have a rough idea of what  is,
but you’re trying to get a more precise value.
You can use your rough idea to determine the
sample size.
If 1 is your rough idea, then the sample size
formula becomes
nz
2
 1 (1   1 )
D2
So we have 2 formulae for determining the
appropriate sample size for estimating the
population proportion.
If you have no idea at all
what  is, you use:
If you have a rough idea
of 1 for the value of ,
you use:
2
z
n
2
4D
nz
2
 1 (1   1 )
D2
Example: We are estimating the proportion of families
with 2 or more cars. We want to be 95% certain that
the estimate is within 3% (0.03) of the correct
percentage. What is the necessary sample size?
We’re clueless on the
proportion , so we
use the formula
The z value for 95%
confidence is 1.96 .
2
z
n
2
4D
0.95
0.0250
0.0250
0
Filling in our
values, we get
2
2
1.96
Z
(1.96)
z

n
 1067.1
2
2
4(0.03)
4D
So the needed sample size is 1068.