Download ST3239: Survey Methodology - Department of Statistics and Applied

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
ST3239: Survey Methodology
by Wang ZHOU
Chapter 1
Elements of the sampling problem
1.1
Introduction
Often we are interested in some characteristics of a finite population, e.g. the average income
of last year’s graduates from NUS. Since the population is usually very large, we would like
to say something (i.e. make inference) about the population by collecting and analysing only
a part of that population. The principles and methods of collecting and analysing data from
finite population is a branch of statistics known as Sample Survey Method. The theory
involved is called Sampling Theory. Sample survey is widely used in many areas such as
agriculture, education, industry, social affairs, medicine.
1.2
Some technical terms
1. An element is an object on which a measurement is taken.
2. A population is a collection of elments about which we require information.
3. Population charateristic: this is the aspect of the population we wish to measure, e.g.
the average income of last year’s graduates from NUS, or the total wheat yield of all
farmers in a certain country.
4. Sampling units are nonoverlapping collections of elements from the population. Sampling units may be the individual members of the population, they may be a coarser
subdivision of the population, e.g. a household which may contain more than one individual member.
5. A frame is a list of sampling units, e.g., telephone directory.
6. A sample is a collection of sampling units drawn from a frame or frames.
1
1.3
Why sample?
If a sample is equal to the population, then we have a census, whcih contains all the information
one wants. However, census is rarely conducted for several reasons:
• cost, (money is limited)
• time, (time is limited)
• destructive (testing a product can be destructive, e.g. light bulbs),
• accessibility (non-response can be a serious issue).
In those cases, sampling is the only alternative.
1.4
How to select the sample: the design of the sample
survey
The procedure for selecting the sample is called the sample survey design. The general aim of
sample survey is to draw samples which are “representative” of the whole population. Broadly
speaking, we can classify sampling schemes into two categories: probability sampling and
some other sampling schemes.
1. Probability sampling is a sampling scheme whereby particular samples are numerated and
each has a non-zero probability of being selected. With probability built in the design, we can
make statements such as “our estimate is unbiased and we are 95% confident that it is within 2
percentage point of the true proportion”. In this course, we shall only concentrate on Probability
sampling.
2. Some other sampling schemes
a) ‘volunteer sampling’: a TV telephone polls, medical volunteers for research.
b) ‘subjective sampling’: We choose samples that we consider to be typical or “representative” of the population.
c) ‘quota sampling’: One keeps sampling until certain quota is filled.
All these sampling procedures provide some information about the population, but it is
hard to deduce the nature of the population from the studies as the samples are very subjective
and often very biased. Furthermore, it is hard to measure the precision of these estimates.
1.5
How to design a questionnaire and plan a survey
This can be the most important and perhaps most difficult part of the survey sampling problem.
We shall come back to this point in more details later.
2
Chapter 2
Simple random sampling
Definition: If a sample of size n is drawn from a population of size N in such a way that every
possible sample of size n has the same probability of being selected, the sampling procedure
is called simple random sampling. The sample thus obtained is called a simple random
sample. Simple random sampling is often written as s.r.s. for short and is the simplest
sampling procedure.
2.1
How to draw a simple random sample
Suppose that the population of size N has values
{u1 , u2 , · · · , uN }.
If
draw n (distinct) items without replacement from the population,
there are altogether
à we !
Ã
!
N
N
different ways of doing it. So if we assign probability 1/
to each of the different
n
n
samples, then each sample thus obtained is a simple random sample. We denote this sample
by
{y1 , y2 , · · · , yn }.
Remark: In our previous statistics course, we always use upper-case letters like X, Y etc.
to denote random variables and lower-case letters like x, y etc. to represent fixed values.
However, in sample survey course, by convention, we use lower-case letters like y1 , y2 etc. to
denote random variables.
Theorem 2.1.1 For simple random sampling, we have
P (y1 = ui1 , y2 = ui2 , · · · , yn = uin ) =
1
1
(N − n)!
1
···
=
.
N (N − 1)
(N − n + 1)
N!
where i1 , i2 , · · · , in are mutually different.
3
Proof.
By the definition of s.r.s, the probability
! of obtaining the sample {ui1 , ui2 , · · · , uin }
Ã
N
(where the order is not important) is 1/
. There are n! number of ways of ordering
n
{ui1 , ui2 , · · · , uin }. Therefore,
P (y1 = ui1 , y2 = ui2 , · · · , yn = uin ) = Ã
1
(N − n)!n!
(N − n)!
!
=
=
.
N !n!
N!
N
n!
n
³ ´
Remark: Recall that the total number of all possible samples is Nn , which could be very large
if N and n are large. Therefore, getting a simple random sample by first listing all possible
samples and then drawing one at random would not be practical. An easier way to get a
simple random sample is simply to draw n values at random without replacement from the N
population values. That is, we first draw one value at random from the N population values,
and then draw another value at random from the remaining N − 1 population values and so
on, until we get a sample of n (different) values.
Theorem 2.1.2 A sample obtained by drawing n values successively without replacement from
the N population values is a simple random sample.
Proof. Suppose that our sample obtained by drawing n values without replacement from the
N population values is
{a1 , a2 , · · · , an },
where the order is not important. Let {ai1 , ai2 , · · · , ain } be any permutation of {a1 , a2 , · · · , an }.
Since the sample is drawn without replacement, we have
P (y1 = ai1 , · · · , yn = ain ) =
1
1
(N − n)!
1
···
=
.
N (N − 1)
(N − n + 1)
N!
Hence, the probability of obtaining the sample {a1 , · · · , an } (where the order is not important)
is
X
all
(i1 ,···,in )
P (y1 = ai1 , · · · , yn = ain ) =
X
(N − n)!
1
(N − n)!
!.
= n! ×
=Ã
N!
N!
N
all (i1 ,···,in )
n
The theorem is thus proved by the definition of the simple random sampling.
4
Two special cases will be used later when n = 1, and n = 2.
Theorem 2.1.3 For any i, j = 1, ..., n and s, t = 1, ..., N ,
(i)
(ii)
P (yi = us ) = 1/N.
P (yi = us , yj = ut ) =
1
,
N (N − 1)
i 6= j,
s 6= t.
Proof.
X
P (yk = uj ) =
P (y1 = ui1 , · · · , yk = uik , · · · , yn = uin )
all (i1 , · · · , in ), but ik = j
Ã
!
(N − n)!
N −1
(N − n)! (N − 1)!
1
=
×
(n − 1)! =
×
= .
N!
n−1
N!
(N − n)!
N
X
P (yk = us , yj = ut ) =
P (y1 = ui1 , · · · , yn = uin )
all (i1 , · · · , in ), but ik = s,ij = t
!
Ã
(N − n)!
N −2
(N − n)! (N − 2)!
1
=
×
(n − 2)! =
×
=
.
N!
n−2
N!
(N − n)!
N (N − 1)
Example 1. A population contains {a, b, c, d}. We wish to draw a s.r.s of size 2. List all possible
samples and find out the prob. of drawing {b, d}.
Solution. Possible samples of size 2 are
{a, b},
{a, c},
{a, d},
{b, c},
The probability of drawing {b, d} is 1/6.
5
{b, d},
{c, d},
2.2
2.2.1
Estimation of population mean and total
Estimation of population mean
Suppose that the population of size N has values {u1 , u2 , · · · , uN }, we can define
1) the population mean
N
u 1 + u2 + · · · + uN
1 X
ui ,
µ=
=
N
N i=1
2) the population variance
σ2 =
N
1 X
(ui − µ)2 .
N i=1
We wish to estimate the quantities µ and σ 2 and to study the accuracy of their estimators.
Suppose that a simple random sample of size n is drawn, resulting in {y1 , y2 , · · · , yn }. Then an
obvious estimator for µ is the sample mean:
µ̂ = ȳ =
n
X
yi /n.
i=1
Theorem 2.2.1
V ar(yi ) = σ 2 .
σ2
,
for
(ii)
Cov(yi , yj ) = −
N −1
Proof. (i). By an ealier theorem,
(i)
E(yi ) = µ,
E(yi ) =
N
X
k=1
V ar(yi ) =
N
X
N
X
uk P (yi = uk ) =
uk
k=1
(uk − µ)2 P (yi = uk ) =
k=1
N
X
i 6= j.
1
= µ.
N
(uk − µ)2
k=1
1
= σ2.
N
(ii). By defintion, Cov(yi , yj ) = E(yi yj ) − E(yi )E(yj ) = E(yi yj ) − µ2 . Now,
E(yi yj ) =
X
all s 6= t
us ut P (yi = us , yj = ut ) =


X
us ut
all s 6= t
1
N (N − 1)
X
1
1

 X
us ut −
u s ut  =
=

N (N − 1)
N (N − 1)
s=t
all s, t
"
=
1
(N µ)2 −
N (N − 1)
ÃN
X
"Ã N
X
!#
(us − µ)2 + N µ2
s=1
h
i
σ2
1
(N µ)2 − N σ 2 − N µ2 = −
+ µ2 .
=
N (N − 1)
N −1
2
Thus, Cov(yi , yj ) = E(yi yj ) − µ2 = − Nσ−1 .
6
s=1
us
!Ã N
X
t=1
!
ut −
N
X
s=1
#
u2s
Theorem 2.2.2
E(ȳ) = µ,
V ar(ȳ) =
σ2
n
µ
¶
N −n
.
N −1
Proof. Note ȳ = n1 (y1 + ... + yn ). So
E(ȳ) =
1
1
(Ey1 + ... + Eyn ) = (nµ) = µ.
n
n
Now
n
n X
n
n
X
X
1
1 X
V ar(ȳ) =
Cov( yi ,
Cov(yi , yj )
yj ) = 2
n2
n i=1 j=1
i=1
j=1


X
1 X
=
Cov(y
,
y
)
+
Cov(yi , yj )
i
j
n2 i6=j
i=j


n
X
1 X
σ2
=
(−
)
+
V ar(yi )
n2 i6=j N − 1
i=1
1
=
n2
σ2
=
n
σ2
=
n
Ã
σ2
n(n − 1)(−
) + nσ 2
N −1
µ
¶
1
(n − 1)(−
)+1
N −1
µ
¶
N −n
N −1
!
Remark: From Theorem 2.2.2, we see that ȳ is an unbiased estimator for µ. Also as n gets large
(but n ≤ N ), V ar(ȳ) tends to 0. This implies that ȳ will be a more accurate estimator for µ as n gets
larger (but less than N ). In particular, when n = N , we have a census and V ar(ȳ) = 0.
Remark: In our previous statistics course, we usually sample {y1 , y2 , · · · , yn } from the population
with replacement. Therefore, {y1 , y2 , · · · , yn } are independent and identically distributed (i.i.d.).
And recall we have results like
Eiid (ȳ) = µ,
V ariid (ȳ) =
σ2
.
n
Notice that V ariid (ȳ) is different from V ar(ȳ) in Theorem 2.2.2. In fact, for n > 1,
V ar(ȳ) =
σ2
n
µ
N −n
N −1
¶
<
σ2
= V ariid (ȳ).
n
Thus, for the same sample size n, sampling without replacement produces a less variable estimator of
µ. Why?
7
Summary
1. How to draw a simple random sample? (purpose, method) Simple random sampling is
the basic survey methodology.
2. After getting a s.r.s, how to describe the population, or how to analyze the data?
Estimation of the population mean. (Sample mean.)
Estimation of σ 2 and V ar(ȳ)
The population variance σ 2 is usually unknown. Now define
Ã
!
n
n
X
1 X
1
(yi − ȳ)2 =
s =
y 2 − n(ȳ)2 .
n − 1 i=1
n − 1 i=1 i
2
Example.
When a few data points are repeated in a data set, the results are often arrayed in a
frequency table. For example, a quiz given to 25 students was graded on a 4-point scale 0, 1,
2, 3 with 3 being a perfect score. Here are the results:
Score(X) Frequency(F ) Proportion(P )
3
16
0.64
2
4
0.16
1
2
0.08
0
3
0.12
(a). Calculate the average score by using frequencies.
(b). Calculate the average score by using proportions.
(c). Calculate the standard deviation.
Solution
(a). (3 × 16 + 2 × 4 + 1 × 2)/25 = 58/25 = 2.32
P
(b). µ = xp(x) = 3 × 0.64 + 2 × 0.16 + 1 × 0.08 + 0 × 0.12 = 2.32
P 2
(c).
x p(x) − µ2 = 32 × 0.64 + 22 × 0.16 + 12 × 0.08 + 02 × 0.12 − 2.322 = 1.0976
q V ar(x) =
σ = V ar(x) = 1.05
If the above 25 students constitute a random sample, then s2 =
Let us look at some properties of s2 . Is it unbiased?
Theorem 2.2.3
E(s2 ) =
N
σ2.
N −1
8
n
1.0976
n−1
= 1.1433.
Proof.
Ã
Es
2
n
X
1
=
Ey 2 − nE(ȳ)2
n − 1 i=1 i
!
!
Ã
n h
i
i
h
X
1
=
V ar(yi ) + (Eyi )2 − n V ar(ȳ) + (E ȳ)2
n − 1 i=1
Ã
"
#!
h
i
1
σ2 N − n
=
n σ 2 + µ2 − n
+ µ2
n−1
n N −1
Ã
!
·
µ
¶¸
1 N −n
nσ 2 nN − n − (N − n)
nσ 2
1−
=
=
n−1
n N −1
n−1
n(N − 1)
2
Nσ
=
N −1
The next theorem is an easy consequence of the last theorem.
Theorem 2.2.4 σ̂ 2 :=
N −1 2
s
N
is an unbiased estimator of σ 2 , e.g.
µ
E
¶
N −1 2
s = σ2.
N
We shall define
n
to be the sample proportion,
N
n
1−f =1−
to be the finite population correction (ab. fpc)
N
f=
Then we have the following theorem.
Theorem 2.2.5 An unbiased estimator for V ar(ȳ) is
Vd
ar(ȳ) =
Proof.
E Vd
ar(ȳ) =
s2
(1 − f ) .
n
Es2
N σ2
n
(1 − f ) =
(1 − )
n
n(N − 1)
N
Confidence intervals for µ
It can be shown that the sample average ȳ under the simple random sampling is approximately
normally distributed provided n is large (≥ 30, say) and f = n/N is not too close to 0 or 1.
9
Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then
ȳ − µ
q
V ar(ȳ)
∼ N (0, 1)
approximately.
If V ar(ȳ) is replaced by its estimator Vd
ar(ȳ), we still have
ȳ − µ
q
Vd
ar(ȳ)
∼approx. N (0, 1),
as
n/N → λ > 0.
Thus,
¯

¯
¯
¯
µ
¶
q
q
¯ ȳ − µ ¯
d
d
¯
¯


1 − α ≈ P ¯q
¯ ≤ zα/2 = P ȳ − zα/2 V ar(ȳ) ≤ µ ≤ ȳ + zα/2 V ar(ȳ)
¯ Vd
ar(ȳ) ¯
Therefore, an approximate (1 − α) confidence interval for µ is
q
s q
ȳ ∓ zα/2 Vd
ar(ȳ) = ȳ ∓ zα/2 √
1 − f.
n
q
ar(ȳ) , is called bound on the error of estimation.
B := zα/2 Vd
Example. Suppose that a s.r.s. of size n = 200 is taken from a population of size N = 1000.
resulting in ȳ = 94 and s2 = 400. Find a 95% C.I. for µ.
Solution
20 q
94 ∓ 1.96 √
1 − 1/5 = 94 ∓ 2.479
200
Example. A simple random sample of n = 100 water meters within a community is
monitored to estimate the average daily water consumption per household over a specified dry
spell. The sample mean and variance are found to be ȳ = 12.5 and s2 = 1252. If we assume
that there are N = 10, 000 households within the community, estimate µ, the true average daily
consumption, and find a 95% confidence interval for µ.
Solution
µ̂ = ȳ = 12.5.
V̂ ar(ȳ) =
s2
1252
σ̂ 2 N − n
= (1 − n/N ) =
(1 − 100/10000) = 12.3948.
n N −1
n
100
q
V̂ ar(ȳ) = 3.5206
A 95% C.I. for µ is 12.5 ± 1.96 × 3.5206 = (5.6, 19.4).
10
2.3
Selecting the sample size for estimating population
means
population mean
2
³
´
−n
We have seen that V ar(ȳ) = σn N
. So the bigger the sample size n is (but ≤ N ), the more
N −1
accurate our estimate ȳ is. It is of interest to find out the minimum n such that our estimate
is within an error bound with certain probability 1 − α, say,
P (|ȳ − µ| < B) ≈ 1 − α,
i.e.,


|ȳ − µ|
B
 ≈ 1 − α.
P q
<q
V ar(ȳ)
V ar(ȳ)
By the central limit theorem,
q
B
V ar(ȳ)
⇐⇒
=r
B
σ2
n
³
N −n
N −1
´ ≈ zα/2
N
(N − 1)D
−1=
⇐⇒
n
σ2
Thus,
n≈
σ2
n
⇐⇒
µ
¶
N −n
B2
= 2 = D,
N −1
zα/2
N
(N − 1)D
(N − 1)D + σ 2
=1+
=
n
σ2
σ2
N σ2
,
(N − 1)D + σ 2
where
2
D=
B2
2
zα/2
Remark 1: if α = 5%, then zα/2 = 1.96 ≈ 2, so D ≈ B4 . This coincides with the formula in the
textbook (page 93).
Remark 2: the above formula requires the knowledge of the population variance σ 2 , which is
typically unknown in practice. However, we can approximate σ 2 by the following methods:
1) from pilot studies
2) from previous surveys
3) other studies.
11
e.g. Suppose that a total of 1500 students are to graduate next year. Determine the sample
size n needed to ensure that the sample average in starting salary is within $40 of the population average with probability at least 0.9. From previous studies, we know that the standard
deviation of the starting salary is approximately $400.
Solution. n =
1500×4002
1499×402 /1.6452 +4002
= 229.37 ≈ 230.
e.g. Example 4.5 (p.94, 5th edition). The average amount of money µ for a hospital’s accounts
receivable must be estimated. Although no prior data are available to estimate the population
variance σ 2 , that most accounts lie within a $100 range is known. There are 1000 open accounts.
Find the sample size needed to estimate µ with a bound on the error of estimation $3 with
probability 0.95.
Remark. The solution depends on how one inteprets “most accounts”, whether it means 70%,
90%, 95% or 99% of all accounts.
Solution. We need an estimate of σ 2 . For the normal distribution, N (0, σ 2 ), we have
P (|N (0, σ 2 )| ≤ 1.96σ) = P (|N (0, 1)| ≤ 1.96) = 95%, P (|N (0, σ 2 )| ≤ 3σ) = P (|N (0, 1)| ≤ 3) =
99.87% So 95% accounts lie within a 4σ range and 99.87% accounts lie within a 6σ range.
B = 3, N = 1000.
If most means 95%, we take 2 × (2σ) = 100, so σ = 25. Then
n = 210.76 ≈ 211.
If most means 99.87%, we take 2 × (3σ) = 100, so σ = 50/3. Then
n ≈ 107.
12
2.3.1
A quick summary on estimation of population mean
The population mean is defined to be
µ=
1
(u1 + u2 + · · · + uN ).
N
Suppose a simple random sample is {y1 , ..., yn }.
1) An estimator of the population mean µ and variance σ 2 are
n
1X
µ̂ = ȳ =
yi ,
n i=1
n
1 X
s =
(yi − ȳ)2 .
n − 1 i=1
2
2) The mean and variance of ȳ are
E ȳ = µ,
V ar(ȳ) =
σ2
n
µ
¶
N −n
.
N −1
3) An estimator of the variance of ȳ is
Vd
ar(ȳ) =
s2
(1 − f ) ,
n
where f = n/N .
4) An approximate (1 − α) confidence interval for µ is
q
s q
ȳ ∓ zα/2 Vd
ar(ȳ) = ȳ ∓ zα/2 √
1 − f.
n
5) Minimum sample size n needed to have an error bound B with probability 1 − α
n≈
N σ2
,
(N − 1)D + σ 2
where
13
D=
B2
2
zα/2
2.3.2
Estimation of population total
The population total is defined to be
τ = (u1 + u2 + · · · + uN ) = N µ
Suppose a simple random sample is {y1 , ..., yn }.
1) An estimator of the population total τ is
τ̂ = N ȳ
3) The mean and variance of τ̂ are
E τ̂ = τ,
V ar(τ̂ ) = N
2σ
2
µ
n
¶
N −n
.
N −1
2) An estimator of the variance of τ̂ is
Vd
ar(τ̂ ) = Vd
ar(N ȳ) = N 2
s2
(1 − f )
n
Central limit theorem: If n → ∞ such that n/N → λ ∈ (0, 1), then
τ̂ − τ
q
V ar(τ̂ )
∼ N (0, 1)
approximately.
If V ar(τ̂ ) is replaced by its estimator Vd
ar(τ̂ ), we still have
τ̂ − τ
q
Vd
ar(τ̂ )
∼approx. N (0, 1),
as
n/N → λ > 0.
Thus,
¯
¯

¯
¯
¶
µ
q
q
¯ τ̂ − τ ¯
d
d
¯
¯


1 − α ≈ P ¯q
¯ ≤ zα/2 = P τ̂ − zα/2 V ar(τ̂ ) ≤ τ ≤ τ̂ + zα/2 V ar(τ̂ )
¯ Vd
ar(τ̂ ) ¯
Therefore, an approximate (1 − α) confidence interval for τ is
q
s q
τ̂ ∓ zα/2 Vd
ar(τ̂ ) = τ̂ ∓ zα/2 N √
1 − f.
n
q
q
B := zα/2 Vd
ar(τ̂ ) = N zα/2 Vd
ar(ȳ) , is called bound on the error of estimation.
14
4) An approximate (1 − α) confidence interval for τ is
q
τ̂ ∓ zα/2
Ã
q
!
s q
1 − f = N ȳ ∓ zα/2 √
1−f .
n
n
s
Vd
ar(τ̂ ) = τ̂ ∓ zα/2 N √
5) Minimum sample size n needed to have an error bound B with probability 1 − α
n≈
N σ2
,
(N − 1)D + σ 2
where
D=
B2
2
N 2 zα/2
Example 4.6. (Page 95 of the textbook). An investigator is interested in estimating the
total weight gain in 0 to 4 weeks for N = 1000 chicks fed on a new ration. Obviously, to weigh
each bird would be time-consuming and tedious. Therefore, determine the number of chicks
to be sampled in this study in order to estimate τ within a bound on the error of estimation
equal to 1000 grams with probability 95%. Many similar studies on chick nutrition have been
run in the past. Using data from these studies, the investigator found that σ 2 , the population
variance, was approximately 36.00 (grams)2 . Determine the required sample size.
Solution
D = B 2 /(1.96N )2 = 10002 /(1.962 × 10002 ) = 0.26.
n = N σ 2 /((N − 1)D + σ 2 ) = 1000 × 36/(999 × 0.26 + 36) = 121.72 ∼ 122
15
2.4
Estimation of population proportion
If we are interested in the proportion p of the population with a specified characteristic. Let
yi = {
1 if the ith element has the characteristic
0 if not
It is easy to see that E(yi ) = E(yi2 ) = p (Why?). Therefore, we have
µ = E(yi ) = p,
σ 2 = var(yi ) = p − p2 = pq,
where q = 1 − p
So the total number of elements in the sample of size n possessing the specified characteristic
P
is ni=1 yi . Therefore,
1. An estimator of the population proportion p is
Pn
i=1
ȳ =
yi
n
= p̂,
say.
And an estimator of the population variance σ 2 = pq is
Ã
2
s
n
n
X
1 X
1
=
y 2 − n(ȳ)2
(yi − ȳ)2 =
n − 1 i=1
n − 1 i=1 i
Ã
!
!
n
´
X
1
1 ³
=
np̂ − np̂2
yi − np̂2 =
n − 1 i=1
n−1
n
=
p̂q̂ where q̂ = 1 − p̂
n−1
From Theorems 2.2.2 and 2.2.3, we have
E(p̂) = p,
N
N
σ2 =
pq.
N −1
N −1
E(s2 ) =
(4.1)
2. Again, from Theorem 2.2.2, the variance of p̂ is
σ2
V ar(p̂) =
n
µ
¶
N −n
pq
=
N −1
n
µ
¶
N −n
.
N −1
3. From equation (4.1) and Theorem 2.2.5, an estimator of the variance of p̂ is
Vd
ar(p̂) =
s2
p̂q̂
(1 − f ) =
(1 − f ) .
n
n−1
4. An approximate (1 − α) confidence interval for p is
q
p̂ ∓ zα/2
√
p̂q̂ q
d
V ar(p̂) = p̂ ∓ zα/2 √
1 − f.
n−1
16
5. The minimum sample size n required to estimate p such that our estimate p̂ is within an
error bound B with probability 1 − α is,
n≈
N pq
,
(N − 1)D + pq
where
D=
B2
2
zα/2
Note that the right hand side is an increasing function of σ 2 = pq.
a) p is often unknown, so we can replace it by some estimate (from previous
study, pilot study, etc.).
b) If we don’t have an estimate p̂, we can replace it by p = 1/2, thus pq = 1/4.
e.g. Suppose that a small town has population of N = 800 people. Let p = the proportion of
people with blood type A.
(1). What sample size n must be drawn in order to estimate p to be within 0.04 of
p with probability 0.95?
(2). Suppose that we know no more than 10% of the population have blood type
A. Find n again in (1). Comment on the difference between (1) and (2).
(3). A simple random sample of size n = 200 is taken and it is found that 7% of
the sample has blood type A. Find a 90% confidence interval for p.
Solution. N = 800, α = 0.05, B = 0.04
(1). Take p = 1/2 in the formula, we get n = 344.
(2). p ≤ 0.10 so σ 2 = pq ≤ 0.09. Simple calculation yields n = 171.
(3). (0.040, 0.096).
Example A simple random sample of n = 40 college students was interviewed to determine
the proportion of students in favor of converting from the semester to the quarter system. 25
students answered affirmatively. Estimate p, the proportion of students on campus in favor of
the change. (Assume N = 2000.) Find a 95% confidence interval for p.
17
Solution
p̂ = ȳ = 25/40 = 0.625.
V̂ ar(ȳ) =
p̂q̂
0.625 × 0.375
(1 − n/N ) =
× (1 − 40/2000) = 5.889 × 10−3 .
n−1
39
q
V̂ ar(ȳ) = 0.07674.
A 95% C.I. for p is 0.625 ± 1.96 × 0.0767 = (0.4746, 0.7754).
2.5
Comparing estimates
Suppose x1 , · · · , xm is a random sample from a population with mean µx and y1 , · · · , yn is a
random sample from a population with mean µy . We are interested in the difference of means
µy − µx , which can be estimated unbiased by ȳ − x̄, as
E(ȳ − x̄) = µy − µx .
Further,
V ar(ȳ − x̄) = V ar(ȳ) + V ar(x̄) − 2Cov(ȳ, x̄).
Remark: If the two samples x1 , · · · , xm and y1 , · · · , yn are independent, then Cov(ȳ, x̄) = 0.
However, a more interesting case is when the two samples are dependent, which will be
illustrated in the following example.
An dependent example
Suppose an opinion poll asks n people the question “Do you favor the abortion?”
The opinions given are
YES,
NO,
NO OPINION.
Let the proportions of people who answer ‘YES’, ‘NO’, ‘No opinion’ be p1 , p2 and p3 ,
respectively. In particular, we are interested in comparing p1 and p2 by looking at p1 − p2 .
Clearly, p1 and p2 are dependent proportions, since if one is high, the other is likely to be low.
Let p̂1 , p̂2 and p̂3 be the three respective sample proportions amongst the sample of size
n. Then X = np̂1 , Y = np̂2 and Z = np̂3 follows a multinomial distribution with parameter
(n, p1 , p2 , p3 ). That is
Ã
P (X = x, Y = y, Z = z) =
n
x, y, z
!
px1 py2 pz3 =
Please note that
X
n!
px1 py2 pz3 = 1.
x≥0,y≥0,x+y+z=n x! y! z!
18
n!
px py pz
x! y! z! 1 2 3
Question: What is the distribution of X? (Hint: Classify the people into “Yes” and “Not
Yes”)
Theorem 2.5.1
E(X) = np1 ,
E(Y ) = np2 ,
V ar(X) = np1 q1 ,
E(Z) = np3 ,
V ar(Y ) = np2 q2 ,
Cov(X, Y ) = −np1 p2 .
Proof. X = number of people saying “YES” ∼ Bin(n, p1 ). So EX = np1 , V ar(X) = np1 q1 .
Now Cov(X, Y ) = E(XY ) − (EX)(EY ) = E(XY ) − n2 p1 p2 . But
X
E(XY ) =
xyP (X = x, Y = y)
x,y≥0,x+y≤n
X
=
xyP (X = x, Y = y, Z = n − x − y)
x,y≥1,x+y≤n
X
=
xy
x,y≥1,x+y≤n
=
n!
px py pn−x−y
x! y! (n − x − y)! 1 2 3
X
n!
px1 py2 p3n−x−y
x,y≥1,x+y≤n (x − 1)! (y − 1)! (n − x − y)!
= n(n − 1)p1 p2
X
x−1,y−1≥0,(x−1)+(y−1)≤(n−2)
(n − 2)!
(n−2)−(x−1)−(y−1)
px−1 py−1 p
(x − 1)! (y − 1)! ((n − 2) − (x − 1) − (y − 1))! 1 2 3
X
= n(n − 1)p1 p2
x1 ,y1 ≥0,x1 +y1 ≤(n−2)
(n − 2)!
(n−2)−x1 −y1
px1 py1 p
(x1 )! (y1 )! ((n − 2) − x1 − y1 )! 1 2 3
= n(n − 1)p1 p2 = n2 p1 p2 − np1 p2 .
Therefore, Cov(X, Y ) = E(XY ) − n2 p1 p2 = −np1 p2 .
Theorem 2.5.2
E(p̂1 ) = p1 ,
E(p̂2 ) = p2 ,
V ar(p̂1 ) = p1 q1 /n,
V ar(p̂2 ) = p2 q2 /n,
Cov(p̂1 , p̂2 ) = −p1 p2 /n.
19
Proof. Note that p̂1 = X/n and p̂2 = Y /n. Apply the last theorem.
From the last theorem, we have
V ar(p̂1 − p̂2 ) = V ar(p̂1 ) + V ar(p̂2 ) − 2Cov(p̂1 , p̂2 ) =
p1 q1 p2 q2 2p1 p2
+
+
.
n
n
n
One estimator of V ar(p̂1 − p̂2 ) is
Vd
ar(p̂1 − p̂2 ) =
p̂1 q̂1 p̂2 q̂2 2p̂1 p̂2
+
+
.
n
n
n
1 q̂1
(1 − f ) .
Is is unbiased? No! An unbiased estimator of the variance of p̂1 is Vd
ar(p̂1 ) = p̂n−1
Also E p̂1 p̂2 = EXY /n2 = p1 p2 (1−1/n) implies an unbiased estimator of p1 p2 is p̂1 p̂2 (1−1/n)−1 .
So
Vd
ar(p̂1 ) + Vd
ar(p̂2 ) + 2n−1 p̂1 p̂2 (1 − 1/n)−1
is an unbiased estimator of V ar(p̂1 − p̂2 ). But it is easy to use
Vd
ar(p̂1 − p̂2 ) =
p̂1 q̂1 p̂2 q̂2 2p̂1 p̂2
+
+
.
n
n
n
Therefore, an approximate (1 − α) confidence interval for p1 − p2 is
r
(p̂1 − p̂2 ) ∓ zα/2 Vd
ar(p̂1 − p̂2 ) = (p̂1 − p̂2 ) ∓ zα/2
v
u
u p̂1 q̂1
t
n
+
p̂2 q̂2 2p̂1 p̂2
+
.
n
n
e.g. (From the textbook.) Should smoking be banned from the workplace? A Time/Yankelovich
poll of 800 adult Americans carried out on April 6-7, 1994 gave the following results.
Banned
Special areas
No restrictions
Nonsmokers
44%
52%
3%
Smokers
8%
80%
11%
Based on a sample of 600 nonsmokers and 200 smokers, estimate and construct a 95% C.I.
for
(1) the true difference between the proportions choosing “Banned” between nonsmokers and smokers;
(2) the true difference between the proportions among nonsmokers choosing between
“Banned” and “Special Areas”.
20
Solution
A. The proportions choosing “banned” are independent of each other; a high value does not
force a low value of the other. Thus, an appropriate estimate of this difference is
s
0.44 − 0.08 ± 2
0.44 × 0.56 0.08 × 0.92
+
= 0.36 ± 0.06
600
200
B. The proportion of nonsmokers choosing “special areas” is dependent on the proportions
choosing “banned”; if the latter is large, the former must be small. These are multinomial
proportions. Thus, an appropriate estimate of this difference is
s
0.52 − 0.44 ± 2
0.44 × 0.52
0.44 × 0.56 0.52 × 0.48
+
+2×
= 0.08 ± 0.08
600
600
600
Example The major league baseball season in US came to an abrupt end in the middle of
1994. In a poll of 600 adult Americans, 29% blamed the players for the strike, 34% blamed
the owners, and the rest held various other opinions. Does evidence suggest that the true
proportions who blame players and owner, respectively, are really different?
p1 : proportions of Americans who blamed the players.
p2 : proportions of Americans who blamed the owners.
p̂1 q̂1 p̂2 q̂2 2p̂1 p̂2
+
+
n
n
n
0̂.29 × 0̂.71 0.34 × 0.66 2 × 0.29 × 0.34
+
+
=
600
600
600
= 1.0458 × 10−3
V̂ ar(p̂1 − p̂2 ) =
So an approximate 95% C.I. for p1 − p2 is
q
0.29 − 0.34 ± z0.025 V̂ ar(p̂1 − p̂2 )
= −0.05 ± 1.96 × 0.03234
= (−0.11339, 0.01339)
21