Download Ch 2: probability sampling, SRS

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the work of artificial intelligence, which forms the content of this project

Document related concepts

German tank problem wikipedia, lookup

History of statistics wikipedia, lookup

Student's t-test wikipedia, lookup

Taylor's law wikipedia, lookup

Bootstrapping (statistics) wikipedia, lookup

Resampling (statistics) wikipedia, lookup

Gibbs sampling wikipedia, lookup

Sampling (statistics) wikipedia, lookup

Transcript
Ch 2: probability sampling, SRS


Overview of probability sampling
Establish basic notation and concepts



Population distribution of Y : object of inference
Sampling distribution of an estimator under a
design: assessing the quality of the estimate used
to make inference
Apply these to SRS




Selecting a SRS sample
Estimating population parameters (means, totals,
proportions)
Estimating standard errors and confidence intervals
Determining the sample size
1
Assume ideal setting

Sampled population = target population



Measurement process is perfect



Sampling frame is complete and does not
contain any OUs beyond the target pop
No unit nonresponse
All measurements are accurate
No missing data (no item nonresponse)
That is, nonsampling error is absent
2
Survey error model
Total
Survey
Error
Assessed via
bias and
variance
=
Sampling
Error
Due to the
sampling
process (i.e.,
we observe
only part of
population)
+
Nonsampling
Error
Measurement error
Nonresponse error
Frame error
3
Probability sample


DEFN: A sample in which each unit in the
population has a known, nonzero probability
of being included in the sample
Known probability  we can quantify the
probability of a SU of being included in the
sample


Assign during design, use in estimation
Nonzero probability  every SU has a
positive chance of being included in the
sample

Proper survey estimates represent entire target
population (under our ideal setting)
4
Probability sampling relies on
random selection methods

Random sampling is NOT a haphazard
method of selection



Involves very specific rules that include an
element of chance as to which unit is selected
Only the outcome of the probability sampling
process (i.e., the resulting sample) is random
More complicated than non-random samples,
but provides important advantages


Avoid bias that can be induced by selector
Required to calculate valid statistical estimates
(e.g., mean) and measures of the quality of the
estimates (e.g., standard error of mean)
5
Representative sample


Goal is to have a “representative sample”
Probability sampling is used to achieve this by giving
each OU in target population an explicit chance to be
included in the sample



Sample reflects variability in the population
Applies to the sample, but does not apply to the OU/SU
(don’t expect each observation to be a “typical” pop unit
Can create legitimate sample designs that
deliberately skew the sample to include adequate
numbers of important parts of the variation


Common example: oversampling minorities, women
MUST use estimation procedures that take into account the
sample design to make inferences about the target
population (e.g., sample weights)
6
Basic sampling designs

Simple selection methods

Simple random sampling (Ch 2 & 3)


Systematic sampling (2.6, 5.6)


Random start, take every k-th SU
Probability proportional to size (6.2.3)


Select the sample using, e.g., a random number table
“Larger” SU’s have a higher chance of being included in sample
Selection methods with explicit structure

Stratified sampling (Ch 4)



Divide population into groups (strata)
Take sample in every stratum
Cluster sampling (Ch 5 & 6)


OUs aggregated into larger units called clusters
SU is a cluster
7
Examples

Select a sample of n faculty from the 1500
UNL faculty on campus


Goal: estimate total (or average) number of hours
faculty spend per week teaching courses
Simple random sampling (SRS)



Number faculty from 1 to 1500
Select a set of n random numbers (integers)
between 1 and 1500
Faculty with ids that match the random numbers
are included in the sample
8
Examples - 2

Systematic sampling (SYS)



Choose a random number between 1 and 1500/n
Select faculty member with that id, and then take
every k-th faculty member in the list, with
sampling interval k is 1500/n
SRS / SYS


Each faculty member has an equal chance of
being included in sample
Each sample of n faculty is equally likely
9
Examples - 3

Probability proportional to size (PPS)



With pps design, we assign a selection probability to each
faculty member that is proportional to the number of
courses taught by a faculty member that semester
“Size” measure = # of courses taught by faculty member
Faculty who teach more courses are more likely to be
included in the sample, but those that teach less still have a
positive chance of being included


Motivation: faculty that spend more hours on courses are more
critical to getting good estimate of total hours spent
Data from faculty with higher inclusion probabilities will be
“down weighted” relative to those with lower probabilities
during the estimation process

Typically accomplished using weights for each observation in
the dataset
10
Examples - 4

Stratified random sampling (STS)

Organize list of faculty by college


Allocate n (divide sample size) among colleges so
that we select nh faculty in the h-th college


Stratum = college
Sum of nh over strata equals n
Use SRS, e.g., to select sample in each of the
college strata


Could use SYS or PPS rather than SRS
Could have different selection methods in each stratum
11
Examples - 5

Cluster sampling (CS)

Aggregate faculty into departments



Select a sample of departments, e.g., using SRS
Very common to use PPS for selecting clusters


OU = faculty member, SU = dept
“Size” measure = number of OUs in the the cluster SU
Many variants for cluster sampling



After selecting clusters, may want to select a sample of
OUs in the cluster rather than taking data on every OU
E.g., select 15 depts in the first stage of sampling, then
select 10 faculty in each dept in a second stage of
sampling
This is called 2-stage sampling
12
Examples - 6

Complex sample designs (Ch 7)


Combine basic selection methods (SRS, SYS, PPS) with
different methods of organizing the population for sampling
(strata, clusters)
Typically have more than one stage of sampling
(multi-stage design)

Often can not create a frame of all OUs in the population


Stratification and systematic sampling are often used to
encourage spread across the population


Need to select larger units first and then construct a frame
This improves chances of obtaining a representative sample
Costs are often reduced by selecting clusters of OUs,
although cluster sampling may lead to less precision in
estimates
13
Notation for target population

The total number of OUs in the population (also called the
universe) is denoted by N





Note UPPER CASE
Ideally for SRS, sampling frame is list of N OUs in the pop
EX: there are N = 4 households in our class
Index set (labels) for all OUs in the population (or
universe) is called U

U = {1, 2, …, N}

A different index set could be our names, or our SSNs
Each person has a value for the characteristic of interest
or random variable Y , the number of people in the
household


The value of Y for household i is denoted by yi
Values in the population are y1 , y2 , …, yN
14
Notation for sample

Sample size is denoted by n



Note lower case
n is always less than or equal to N (n = N is a census)
Index set (labels) for OUs in the sample is denoted
by S



To select a sample, we are selecting n indices (labels) from
the universe U , consisting of N indices for the population
U is our sampling frame in this simple setting
Labels in S may not be sequential because we are selecting
a subset of U
15
Class example

Suppose n = 2 households are selected from a
population of N = 4 households in the class


Randomly select sample using SRS and get 2 and 3


U = {1, 2, 3, 4}
S=
The data collected on OUs in the sample are values
for Y = number of people in the household

Data:
16
Summary of probability
sampling framework

Assumptions (for now)


Target population = sampling universe =
sampling frame



Observation unit = sampling unit
N = finite number of OUs in the population
U = {1, 2, …, N} is the index set for the OUs in
the population
Sample


n = sample size (n is less than or equal to N )
S = index set for n elements selected from
population of N units (S is a subset of U)
17
Conceptual basis for
probability sampling

Conceptual framework for selecting samples


Enumerate all possible samples of size n from
the population of size N
Each sample has a known probability of being
selected



P(S) = probability of selecting sample S
Use this probability scheme to randomly choose the
sample
Using the probability scheme for the samples, can
determine the inclusion probability for each SU

i = probability that a sample is selected that
includes unit i
18
Simple example


Population of 4 students in study group, take
a random sample of 2 students
Setting





U = {1, 2, 3, 4}
N = 4
n = 2
All possible samples of size n = 2 from N = 4
elements
Note: n < N and S  U
19
Simple example - 2

All possible samples
S1 = {1, 2}
S2 = {1, 3}

S3 = {1, 4}
S4 = {2, 3}
S5 = {2, 4}
S6 = {3, 4}
Design is determined by assigning a
selection probability to each possible
sample
P(S1) = 1/3
P(S2) = 1/6
P(S3) = 1/2 P(S5) = 0
P(S4) = 0
P(S6) = 0
20
Simple example - 3


Inclusion probability definition?
What is the probability that student 1 is
included in the sample?


Inclusion probability for student 2, 3, 4?




1 =
2 =
3 =
4 =
Is this a probability sample?
21
Population distribution

Response variables represent values
associated with a characteristic of interest for
i-th OU



Y is the random variable for the characteristic of
interest (CAP Y)
yi = value of characteristic for OU i (small y)
The population distribution is the distribution
of Y for the target population


Y is a discrete random variable with a finite
number of possible values (<= N values)
Use discrete probability distribution to represent
the distribution of Y
22
Population distribution - 2

A discrete probability distribution is denoted
by a series of pairs corresponding to




Value of the random variable Y, denoted by y
Relative frequency of the value y for the random
variable Y in the population, denoted by P(Y = y)
Pair is { y , P(Y = y) }
Constructing a probability distribution


List all unique values y of random variable Y
Record the relative frequency of y in the
population, P(Y = y)
23
Class example - 2

Back to # of people in household for each
class member
What are the unique values in the pop?

What is the frequency of each value?

What is the relative frequency of each value?


Construct a histogram depicting the variation
in values
24
Summarizing the population
distribution





Use population parameters to summarize
population distribution
Mean or expected value of y
(parameter: y )
U
Proportion of population having a particular
characteristic = mean of a binary (0, 1)
variable (parameter: p )
For finite populations, population total of y is
often of interest (parameter: t )
Variance of y (parameter: S 2)
25
Mean of Y for population

Expected value, or population mean, of Y
N
yU 




y
i 1
N
i
t

N
Mean is in y-units per OU-unit
Measure of central tendency (middle of distn)
Related to population total (t) and proportion (p)
Examples


Average number of miles driven per week adults in
US
Average number of phone lines per household 26
Class example - 3

What is the mean household size for
people in this classroom?
27
Total of Y in population

Population total of Y
N
t   y i  Ny U
i 1


Total number of y-units in the population
Examples

Number of households in market area with DSL



yi =1 if household i has DSL, yi = 0 if not
N = number of households in market area
Number of deer in Iowa


yi =number of deer observed in area i
N = number of observation areas in Iowa
28
Class example - 4

What is the total number of people
living in households of people in the
classroom?
29
Proportion

Proportion (p) of population having a
particular characteristic


Mean of binary variable
1 , if OU i has characteri stic
yi  
0 , if OU i doesn' t have characteri stic
N

p
yi

i
1
N
t

N
30
Class example - 5

What proportion of people in the
classroom have a cell phone?
31
Population variance of Y

Population variance of Y
N
V [Y ]  S 2 

2
(
y

y
)
 i U
i 1
N 1
Measure of spread or variability in population’s
response values
2
 Analogous to  in other stat classes

Not the standard error of an estimate

Note this is CAP S
2
32
Coefficient of variance for Y

Variation relative to mean (unitless)
S
CV 
yU
33
Class example - 6


What is the population variance for
number of people in households of
people in the classroom?
What is the CV?
34
Summary of population
distribution of Y




Basic pop unit: OU (i)
Number of units or size of pop: N
Random variable: Y
Parameters: characterize the target population






Mean y U
Total t
Proportion (mean) p
Variance S2
Coefficient of variation CV = S / y U
STATIC: it is the object of inference and never
changes with design or estimator
35
What’s next


Population distribution of Y is object of inference
Use SRS to select a sample and estimate the
parameters of the population distribution


How to select a sample
Estimators for population parameters of Y under SRS




Sample mean estimates population mean
N x sample mean estimates population total
Sample variance estimates population variance
Assessing the quality of an estimator of a population
parameter under SRS


Sampling distribution
Bias, standard error, confidence intervals for the estimator
36
Simple random sample (SRS)

DEFN: A SRS is a sample in which every
possible subset of n SUs has an equal
chance of being selected as the sample



 every sampling unit has equal chance of being
included in the sample
Example of an “equal probability” sample
Does not imply that a sample in which each SU
has the same inclusion probability is a SRS

Other non-SRS designs can generate equal probability
samples
37
Simple random sampling (SRS)

Two types

SRSWR (SRS with replacement)


SRSWOR (SRS without replacement)


Return SU after each step in the selection process
Do not return SU after it has been selected
Selection probability

Probability that a unit is selected in a single draw



Constant throughout SRSWR process
Changes with each draw in the SRSWOR process
NOT an inclusion probability, which considers the
probability of drawing a sample that includes unit i
38
SRSWR
(SRS with replacement)

Selection procedure

Select one OU with probability 1/N from N OUs




This is the selection probability for each draw
Returning selected OU to universe
Repeat n times
Procedure is like drawing n independent
samples of size 1



Can draw a sampling unit twice – duplicate units
Unappealing for finite populations – no additional
info in having a duplicate unit
Useful in theoretical development for large
populations
39
Focus: SRSWOR
(SRS without replacement)

Selection procedure






Select one OU from universe of size N with
probability 1/N
DON’T return selected unit to universe
Select 2nd OU from remaining units in universe
with probability 1/(N - 1)
DON’T return selected unit to universe
Repeat until n sampling units have been selected
Selection probabilities change with each draw

1/N, then 1/(N -1), then 1/(N -2), …, 1/(N – n +1)
40
SRSWOR
(SRS without replacement)

Probability of selecting a sampling unit in a single
draw depends on number of SUs already selected
(conditional probability)


On the c-th step of the process, c-1 s.u.s have already
been selected for a sample of size n
Probability of selecting any of the remaining N – c + 1 s.u.s
in the next draw is
1
N c 1

Inclusion probability for SU i (unconditional
probability)

i 
n
N
(see p. 44 in text)
41
SRSWOR
(SRS without replacement)

Number of possible SRSWOR samples of size
n from universe of size N
N 
N!
  
, where x !  x  (x  1)  (x  2)  ...  2  1
 n  n ! (N  n )!

Probability of selecting a sample S
P (S ) 
1
N 
 
n 
(Probability is the same for all samples)
42
Selecting a SRS using
SRSWOR

Create a sampling frame



Determine a selection procedure that performs
SRSWOR




List of sampling units in the universe or population
Assigns an index to each sampling unit
Procedure must generate to n unique sampling units such
that each SU has an equal chance of being included in the
sample
Random number generator or table is common basis
Need rules to identify when the selected unit is included in
the sample or tossed
Select random numbers and determine sampled units
43
Using random numbers to
select a SRSWOR sample

Determine a rule to assign random numbers
to the sampling universe index set U


Rule must give each unit an equal chance of being
included in the sample
Select the set of random numbers, e.g., using
computer or printed random number table


Apply the rule to each random number to
determine the sampled OU
Check to see if this OU has already been selected


If already selected, ignore it
Keep going until you have n SUs in the sample
44
Census of Agriculture example
Select 300 counties from 3078 counties in the US




Sampling frame = ?
Generate random numbers between 0 and 1 on the
computer


N=
n=
Need n or more random numbers depending on rule
Multiply each random number by N = 3078 and
round up to the nearest integer




Random number = .61663
Multiply random # by N = 3078 x .61663 = 1897.98714
Round up to 1898
Take 1898th county in the frame
45
Estimating population mean
under SRS


Target population mean
yU 
1
N
yi

N i
1
Estimator of y U for SRS sample of size
n is the sample mean
y 
1
n
yi

n i
1

Note


“Estimator” refers to the formula
“Estimate” refers to the value obtained from using
the formula with data
46
Class example - 7

Estimate the average household size for
our classroom
47
Estimating population total

Target population total
N
t  Ny U   y i
i 1

Estimator of t for SRS sample of size n
N
ˆ
t  Ny 
n
n
yi

i
1
48
Class example - 8

Estimate the total number of people
living in the households of people in
this classroom
49
Estimating population
proportion

Target population proportion

Y takes on values 0 or 1, where 1 means
the unit has the characteristic of interest


p  yU 
1
N
yi

N i
1
Estimator of p for SRS sample of size n
pˆ  y 
1
n
yi

n i
1
50
Class example - 9

Estimate the proportion of people with
cell phones in this class room
51
Estimating population variance

Target population variance
N
V [Y ]  S 2 

(y i

i
1
 y U )2
N 1
Estimator of S2 for SRS sample of size n is
the sample variance
n
s2 
2
(
y

y
)
 i
i 1
n 1
(note lower case s)
52
Class example - 10

Estimate the variance of number of
people in households of people in this
class room
53
Estimating population
standard deviation and CV

Standard deviation of Y, S ?

Estimator of standard deviation of Y?

CV of population distribution?

Estimator of CV?
54
What would happen if we took
another sample?



S=
Data =
Estimates





Mean
Total
Proportion
Standard deviation
CV
55
Sampling distribution

Need to assess the quality of our estimates
 Is
y a good estimator of y U ?
 Is
p̂ a good estimator of p ?


Is s2
a good estimator of
S2 ?
Use the sampling distribution to assess the
quality of the estimator


Distribution of estimator over all possible samples
EX: distribution of y over all possible SRS
samples of size n from a population of size N
56
Sampling distribution

Simulation
57
Measures of quality

Denote



Mean of the sampling distribution is the
expected value of the estimator E {ˆ}


An estimator is unbiased if E {ˆ}  
Variance of the sampling distribution V {ˆ}


Population parameter as  [think pop mean y U ]
Estimator of  as ˆ
[think sample mean y ]
Precision: want variance of estimator to be small
Coefficient of variance

Relative precision: want CV to be small
V {ˆ}
E {ˆ}
58
Sampling distribution of
estimator


Basic pop unit: sample selected using a specific
design, S
Number of units or size of pop: number of possible
samples



Random variable: estimator of parameter, ˆ
Parameters: characterize the quality of the estimator



Need probability of selecting sample !
Mean (assesses bias of the estimator), E {ˆ}
Variance, SE, CV (assesses precision of estimator)
DEPENDS on population parameter, estimator of
population parameter, sample design
59
Population
distribution
Sampling
distribution

Basic unit: OU (i)


Total number of units: N



Random variable: character
of interest, Y
Parameters: characterize
the target population






Mean y U , proportion p
(central tendency)
Total t
Variance S2, std dev S, CV
(spread of distn)
STATIC once you identify Y,
pop distribtn is the object of
inference and never changes
with design or estimator
Basic unit: sample selected
using a specific design, S
Total number of units:
number of possible samples
Random variable: estimator
of parameter, ˆ
Parameters: characterize
the quality of the estimator



Mean E {ˆ} (used to assess
bias of the estimator)
Variance V {ˆ}, SE, CV
(precision of estimator)
DEPENDS on population
parameter, estimator of
population parameter,
sample design
60
Conceptual framework for a
sampling distribution - 1

List out all possible samples of size n from
the population of size N


A sample is the BASIC UNIT for the population of
all possible samples
We determine the probability of selecting the
sample



Unequal probability sample (now)
Simple random sample
NOTE: sampling distribution depends on the
design selected
61
Simple example from earlier
lecture (not SRS!)

All possible samples
S1 = {1, 2}
S2 = {1, 3}

S3 = {1, 4}
S4 = {2, 3}
S5 = {2, 4}
S6 = {3, 4}
Design is determined by assigning a selection
probability to each possible sample
P(S1) = 1/3
P(S2) = 1/6
P(S3) = 1/2
P(S4) = 0
P(S5) = 0
P(S6) = 0
62
Conceptual framework for a
sampling distribution - 2


List
Using the n data values associated with each
sample, calculate the value of the estimator
for each sample



The estimator is the random variable of our
distribution
Example: sample mean y is calculated for each
of the possible samples
NOTE: the sampling distribution depends on the
estimator selected
63
Simple example from earlier
lecture - 2

Population values for Y



i
yi
1
2
3
4
3
5
1
3
All possible samples of size n = 2
S1 = {1, 2}, S2 = {1, 3}, S3 = {1, 4},
S4 = {2, 3}, S5 = {2, 4}, S6 = {3, 4}

Values of y corresponding to each sample
y1  (3  5) / 2  4.0
y 4  (5  1) / 2  3.0
y2  (3  1) / 2  2.0
y 5  (5  3) / 2  4.0
y3  (3  3) / 2  3.0
y 6  (1  3) / 2  2.0
64
Conceptual framework for a
sampling distribution - 3

List

Using

Sampling distribution is described by pairs of values
for estimator from the sample and relative frequency
of obtaining that value
 We are using the steps we used before for
creating a discrete distribution
65
Representing the
sampling distribution

Probability distribution: pairs of
{c , P (y  c ) }


y
is a random variable, c is a value of
P (y  c ) 
P (S )

S y c
y
, where
: 
S : y  c means " all samples S such that y  c "
66
Simple example from previous
lecture - 3

Number of possible samples
N   4 
4  3  2 1
24
     

6
(
2

1
)(
2

1
)
4
n
2
   

Probability of selecting sample
P (S 1 )  1 / 3 y 1  4.0,
P (S 2 )  1 / 2 y 2  2.0,
P (S 3 )  1 / 6 y 3  3.0,

P (S 4 )  0 y 4  3.0
P (S 5 )  0 y 5  4.0
P (S 6 )  0 y 6  2.0
Probability distribution: unique values of y and
relative frequency
c
P (y  c )
2.0
3.0
4.0
67
Conceptual framework for a
sampling distribution - 4

List
Using

Sampling distribution

Parameters summarize sampling distribution




Mean of sampling distribution
Variance, std dev (SE) of sampling distribution
CV of sampling distribution
68
Ex: mean and variance of sampling
distribution for y - 4

Mean of sampling distribution

Same concept of expected value used with population
distribution
E {y }   c  P (y  c )
c
 (2.0)

1
1
1 2  9  8 19
 (3.0)  (4.0) 

 3.1 6  3.17
6
2
3
6
6
Variance of sampling distribution


Use more general formula for variance
Later, we’ll use reductions that are easier to calculate
V {y }  E {(y  E [y ]) 2 }   (c  E {y }) 2  P (y  c )
c
 (2.0  3.1 6 ) 2
1
1
1
 (3.0  3.1 6 ) 2  (4.0  3.1 6 ) 2  0.47222
6
2
3
69
What if we took a SRS of size
n from N units?

List out all possible samples



N 
P (S )  1 /   constant for all samples
n 
Calculate estimator for each sample


n 
N!
(N  n )! n !
Determine the probability of a sample


# possible samples:  N  
 
Examples:
y or tˆ or pˆ
Create a discrete probability distribution
Calculate summary parameters

For y , E{y } and V{y }
For tˆ , E{tˆ} and V{tˆ}
70
Back to example with SRS

Number of possible samples
N   4 
4  3  2 1
24
     

6
(
2

1
)(
2

1
)
4
n
2
   

Probability of selecting sample
P (S 1 )  1 / 6 y 1  4.0,
P (S 2 )  1 / 6 y 2  2.0,
P (S 3 )  1 / 6 y 3  3.0,

P (S 4 )  1 / 6 y 4  3.0
P (S 5 )  1 / 6 y 5  4.0
P (S 6 )  1 / 6 y 6  2.0
Probability distribution: unique values of y and
relative frequency
c
P (y  c )
2.0
3.0
4.0
71
Example: mean of sampling
distribution for y under SRS

Mean of sampling distribution
E {y } 
c  P (y

c
c)
1
1
1
 (2.0)  (3.0)  (4.0)
3
3
3
9
  3.0
3

Mean of population distribution
yU 

1
N
N
yi 
i 1
12
 3.0
4
1
(3  5  1  3)
4
72
Bias of an estimator

Estimation bias of ˆ
Bias[ˆ]  E {ˆ} - 


Note that this is the mean of the estimator (from
sampling distribution) minus the population
parameter (from population distribution)
If Bias[ˆ]  0 then ˆ is said to be an
unbiased estimator of 
73
Variance of sample mean
under SRS


Don’t have to use the general formula
Variance of sample mean (derived stat using theory)
S2 
n
V [y ] 
1   , where
n 
N
S


2
1 N
y i  y U


N  1 i 1

2
is the population variance
2
n
Similar to infinite population formula
Has an extra factor called the finite population
correction factor (FPC)
74
Example

Variance of sampling distribution for y


1 N
2
 1 
2
2
2
2
S 
(
y

y
)

(
1

3
)

2

(
3

3
)

(
5

3
)



 i U
N  1 i 1
3
 4 1
n
22


V {y }  1  S 2  1    0.3333
N
4 3


2

Other measures of dispersion for
sampling distribution
SE{ y}  V { yS }  0.3333  0.5774
V { yS } 0.5774
CV { y} 

 0.1925
E{ yS }
3
75
S2 
n
V [y ] 
1  
n 
N

Finite population correction factor (FPC)
n
FPC  1 
N


Sampling fraction is the proportion of the
population sampled, or n/N
Larger sample 



Larger fraction of population
Smaller FPC
Smaller variance of sample mean
76
Impact of FPC on estimated
variance of parameter estimate

Often FPC is very close to 1

Sample of 3000 households from total of 1,200,000
households
n
3000
Sampling fraction 
 0.00025
1,200,000
n
3000
FPC  1   1 
 1  .00025  .99975
N
1,200,000



N

In cases where sampling fraction is very small and
FPC is very close to 1, FPC has no practical effect on
the SE or estimated variance of the param estimate
Sampling fraction n/N is not a good measure of
whether your estimate will be precise
The sample size n is the most important part of the
variance or SE formulas given variance s 2
77
Estimating population variance
under SRS

Do not know variance of population distribution, S
Unbiased estimator for S 2
1 N
2
2


s 
y

y
 i
n  1 i 1

Estimator for V [y ]

2
2
s
n

ˆ
V [y ] 
1  
n 
N
^

Note that SE ( y )  Vˆ[ y ] is the standard error of the
sample mean
78
Ag example



Interested in average number of acres per
county devoted to farms
Sample 300 counties from list of 3078
Collect data and get following summary
statistics
y  297,897 farm acres per county in 1992
s 2  344,551.9

What are estimated mean and standard
error?
79
Rounding rules


Always keep all of the digits while you are
doing calculations
Round only when you get ready to report the
result at the end of the calculation …

Round the estimated SE to 2 significant digits



Round estimate to precision of the SE



107,789 is rounded to 110,000
0.0325329 is rounded to 0.033
If SE is 110,000, round estimate to nearest 10,000
(xx0,000)
If SE is 0.033, round estimate to nearest 1/1000 (x.xxx)
Estimated variances are usually reported to 5
significant digits
80
Sampling distribution for y
using SRS of size n from N

y is an unbiased estimator of y U
 Mean of sampling distribution is always equal to
population mean under SRS
E {y }  y U


Variance of y is
S2 
n
V [y ] 
1  
n 
N
Estimate the variance of y using sample
variance s2
s2 
n
Vˆ[y ] 
1  
n 
N
81
Sampling distribution of
under SRS

tˆ
Mean of tˆ for population total t under SRS
E {tˆ}  E {Ny }  N  E {y }  N  y U  t

Expectation of a linear function of a random
variable
If a, b are constants & Y , ˆ are random variables, then
E {aY  b }  aE {Y }  b
E {aˆ  b }  aE {ˆ}  b

Is tˆ an unbiased estimator of t ?
82
Sampling distribution of
under SRS - 2

tˆ
Variance of estimator of total under SRS
2
n
S


V [tˆ]  V [Ny ]  N 2V [y ]  N 2 1  
N n


Variance of a linear function of a random
variable
If a, b are constants & Y , ˆ are random variables, then
V {aY  b }  a 2V {Y }
V {aˆ  b }  a 2V {ˆ}
83
Sampling distribution of
under SRS - 3

tˆ
Estimator for variance of tˆ under SRS
2
n
s


Vˆ[tˆ]  N 2 1  
N n

84
Ag example - 2



Estimated total acres devoted to farms
in the US in 1992?
Estimated Variance of estimated total?
Other measures of dispersion for
sampling distribution?

Estimated SE
85
Sampling distribution of
under SRS

p̂
Mean of estimator p̂ for population proportion
p under SRS
E {pˆ} 

Is p̂ unbiased for p ?
86
Sampling distribution of
under SRS - 2

Variance of sample proportion
theory)



p̂
(derived stat using
 N  n  p (1  p )
ˆ
V [p ]  

n
 N 1 
p (1  p )
n
Very similar to infinite population formula
Extra factor arises from finite pop and is NOT the
same as the FPC
Estimator does have the FPC in the formula
n  pˆ(1  pˆ)

ˆ
ˆ
V [p ]  1  
N  n 1

87
Ag example - 3




Suppose we are interested in the
proportion of counties with fewer than
200,000 acres devoted to farms in 1992
Data from our sample of 300 indicate
that 153 counties have less than
200,000 acres devoted to farms
Estimated population proportion?
Estimated SE of estimated proportion?
88
Quality of estimates (Fig 2.2, p. 29)

Estimator under a given design is unbiased


Estimator under a given design is precise



On average over a large number of samples, the mean of
the estimates “hit” the target population parameter
(centered on the bull’s eye)
Over a large number of samples, estimates will tend to be
close to one another, indicating that the variance of the
sampling distribution for the estimator is small
Clump pattern, but may not be centered on bull’s eye
(precise but biased)
Estimator under a given design is accurate


Estimator comes close to hitting target and is precise
Assess this with the mean squared error (MSE)
89
Mean Squared Error an
Estimator ˆ

Mean squared error (MSE) of


ˆ


2
2


MSE[ˆ]  E ˆ  
 V [ˆ]  Bias[ˆ]



Combines measures of bias and precision to provide
an index of the accuracy of an estimator under a
given design


Sometimes we are willing to accept a little bias to get a
more precise estimator, MSE is improved
If Bias[ˆ]  0 then MSE[ˆ]  V [ˆ]
90
MSE of SRS estimators


All of these estimators are unbiased
under SRS (Bias = 0)
So under SRS
MSE[y ]  V {y }
MSE[ pˆ]  V { pˆ}
MSE[tˆ]  V {tˆ}
91
Confidence intervals


Estimate variance, SE, CV, MSE of
estimator under a design to provide
indication of quality of estimate
Another approach

Estimate a confidence interval to express
precision of estimate
92
Book example 2.7, p. 35-6





True parameter value: t = 40
CI of interest: [tˆ  4seˆ(tˆ) , tˆ  4seˆ(tˆ)]
List 70 possible samples of size n = 4
Each sample has a probability of selection P(S)
For each sample, record value of a variable u that
indicates whether CI from sample S includes t = 40
u (S )  1 , if 40  [tˆ  4seˆ(tˆ) , tˆ  4seˆ(tˆ)]
 0 , if 40  [tˆ  4seˆ(tˆ) , tˆ  4seˆ(tˆ)]

Confidence coefficient:
1  
70
P (S k )u k

k
1
 0.77
93
Ex – 2: Assume SRSWOR


If 60 of the 70 SRSWOR samples
resulted in CIs that included the true
total, what is the confidence coefficient?
What is alpha?
94
What is a 95% confidence
interval (CI) under SRS?

Heuristic definition





Take repeated samples of size n from population
of size N
Collect data on Y
Calculate an estimate of a population parameter
using data from n observations
Calculate 95% CI for parameter estimate using
data from n observations
Expect 95% of the CIs to contain the true
value of the parameter
95
Interpreting CIs in general

More generally (for any design), a (1-)100%
CI has the interpretation


There is a (1-)100% chance of selecting a
sample for which the CI will include the true
population parameter
Note



The upper and lower limits of the CI are random
variables, calculated from the sample data
The true parameter value is either included or not
included in a single CI
Confidence coefficient of a CI has a relative
frequency interpretation across samples
96
Confidence interval definition

Standard estimator for a (1-)100%
confidence interval (CI):
ˆ  z  / 2 seˆ(ˆ) or equivalently
[ˆ  z  / 2 seˆ(ˆ) , ˆ  z  / 2 seˆ(ˆ)]
97
Standard normal distribution

Z ~ N(0, 1)



Z is the random variable
Mean E{Z} = 0 and variance V{Z} = 1
Two-sided (1-)100% confidence
interval


Use critical value z / 2
P Z  z  / 2   
98
Infinite vs. finite populations

In other stat classes …



Assume SRS with replacement from infinite pop
Justify CI by applying the Central Limit Theorem
(CLT)
In sample surveys, we have a finite number
of possible samples


Can calculate exact confidence coefficient 1- for
a stated interval (see previous example)
In practice, it is not possible to list all possible
samples, so we have a special CLT that relies on a
“superpopulation” framework
99
Superpopulation framework

Asymptotic framework for SRSWOR in finite
populations




Population is part of a larger superpopulation
There is a a series of increasingly larger
superpopulations
Use superpopulation concept to derive a
Central Limit Theorem for SRSWOR
Bottom line

We will use the standard CI estimator with a
different theoretical justification
100
When is CLT justified?

Confidence coefficient is approximate


Quality of approximation depends on n and the
distribution of the underlying random variable, Y
“n is large enough for CLT” is less clear for finite
populations


n = 30 rule in other stat classes does NOT apply
Rules of thumb



If distribution of Y is close to normal, n = 50
Need larger n if distribution of Y deviates from
normal, e.g., skewed
Y categorical: if p is proportion with
characteristic of interest, np  5 and n(1-p)  5
101
Determining sample size –
a general approach





Specify tolerable error (level of precision,
level of confidence)
Identify appropriate equation relating
tolerable error (e, ) to sample size (n)
Estimate unknown parameters in equation
Solve for n
Evaluate (and return to first step)


Can you afford sample size?
What expectations can be altered?
102
Specify tolerable error

Two parameters



e : margin of error or half-width of CI
 : [1-]100% is confidence level
Absolute expression (half-width of CI):
estimate within e of true pop parameter


P ˆ    e  1  

Relative expression: ˆ within 100e% of 
 ˆ  



P
 e  1 
 

103
Equation linking e, , and n


Most common equation is half-width of CI
e  z  / 2 SE [ˆ]
Example: sample mean under SRSWOR
e  z /2
Note



S2
n
n

1


N 


z  / 2S 2
n0
n 

2
n0
z  / 2S
2
1
e 
N
N
for
z  / 2S 2
n0 
e2
For p , use S2  p(1-p)
For  = 0.05, use z  / 2  2
n0 is sample size under SRSWR (ignoring FPC)
104
Estimate unknowns:
population variance of y, S2

Use estimator for variance, s2


Pilot study
Previous study


Use CV from previous study


Careful about comparability
Careful about comparability
Guess variance under normality


estimate of S = range for 95% of values / 4
estimate of S = range for 99% of values / 6
105
Estimating unknowns:
population proportion, p


Use estimates from pilot or previous
study
If know nothing of true proportion



Use p = 0.5
Max possible variance for estimated
proportion under SRS, so this is
conservative
Commonly used
106
Practicalities for determining n

Sampling fraction rarely important



Most populations are large enough that sampling
fraction n/N is small for practical values of n
Subpopulations should influence sample size
95% CI for a proportion ( = 0.05, p = 0.5)

Implies e  1 / n

n = 400 for e  0.05
n = 100 for e  0.10
n = 50 for e  0.15
(whole sample)
(subpopulation)
(subpopulation)

n = 500 for e  0.04
107
(little gain over 400)


SRS: pros and cons

Cons




SRS is rarely the “best” design
May not have list of all OUs  need different
design
May have additional info on pop to create a more
efficient design (improve precision)
Pros / uses


Standard stat procedures can be used with little or
no bias
Mainly interested in regression rather than
estimating pop params (ignore sample design –
but could still get a better sample)
108