Download Topic 2 Chapter 2: Simple Probability Samples

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Misuse of statistics wikipedia , lookup

Student's t-test wikipedia , lookup

Transcript
Statistics 522: Sampling and Survey Techniques
Topic 2
Topic Overview
This topic will cover
• Types of probability samples
• Framework for probability sampling
– Simple random sampling
– Confidence intervals
– Sample size determination
• Systematic sampling
• Randomization theory
Chapter 2: Simple Probability Samples
Properties of directly sampled populations
• Minimum
– All units are identified and indexed.
– All units can be found.
• Simplifying properties
– Frame is organized – for instance, by location.
• Desirable properties
– Additional information is available for each unit.
– Domain (subpopulation) membership is identified for each unit.
• Properties of frame vs. population.
– Every element of population occurs in frame.
– Every element of population occurs in frame exactly once.
1
Probability sampling
• Each possible sample has a known probability of being actual sample. (Usually sampled
one observation at a time)
• A chance mechanism is used to select the sample.
– random number tables
– computer-generated pseudo-random numbers
– mechanical methods such as shuffled cards
• Complete enumeration not possible
1000
–
= 5.6 × 1071
40
5000
–
= 1.4 × 10363
20
• Want sample design independent of possible trends in data
Assumptions (for now)
• Target population and sampled population are the same.
• The sampling frame is complete.
• None of the data is missing.
• There is no measurement bias.
• (Sample size is fixed.)
• We have no non-sampling errors. (All errors are sampling errors.)
Types of Probability Samples
• Simple random sample (SRS)
– Chapter 2
• Stratified random sample
– Chapter 4
• Cluster sample
– Chapter 5
2
Sample Random Sample (SRS)
• Every possible subset of the population of size n is equally likely to be the sample.
• This implies that each individual is equally likely to be in the sample.
• A probability sample with each individual being equally likely to be in the sample is
not necessarily an SRS.
– Why?
Stratified Random Sample
• First, partition the population into subgroups, called strata.
• Then, select an SRS from each stratum.
• Stratification is effective when the strata are relatively homogeneous with respect to
the characteristic of interest (smoothness assumption).
Cluster Sample
• Some or most sampling units contain more than one observation unit.
• These observation units are called clusters.
• Take an SRS of clusters.
• Sample all observation units in the sampled cluster.
Example
• You want to estimate heights of children in an elementary school.
• Select an SRS of the students.
• Stratify by grade level.
• Use classrooms as clusters.
Framework for Probability Sampling
• U = {1, 2, . . . , N } are the units in the population.
• Let n be the sample size.
N
• There are
– “N choose n” – different samples.
n
N
N!
•
= (N −n)!n!
n
3
Example
• N = 4, n = 2
• The population is U = {1, 2, 3, 4}.
4
4!
4×3×2×1
• There are
= 2!2!
= (2×1)×(2×1)
= 6 different samples.
2
• S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4}
Some probability sample designs
1. Each sample is equally likely, P (Si ) = 1/6 for all i.
2. P (S1 ) = P (S6 ) = 1/2; P (S2 ) = P (S3 ) = P (S4 ) = P (S5 ) = 0
3. P (S1 ) = P (S2 ) = P (S3 ) = 1/3; P (S4 ) = P (S5 ) = P (S6 ) = 0
Probabilities for individual units
• πi = P (unit i is in sample)
• For many designs, we want the πi to be equal.
• This implies πi = n/N .
Sample design 1
• Each sample is equally likely, P (Si ) = 1/6 for all i.
• S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4}
• π1 = P (S1 ) + P (S2 ) + P (S3 ) = 3/6 =
• Similarly, πi =
1
2
=
n
N
1
2
for all i
Sample design 2
• P (S1 ) = P (S6 ) = 1/2; P (S2 ) = P (S3 ) = P (S4 ) = P (S5 ) = 0
• S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4}
• π1 = π2 = P (S1 ) =
1
2
• π3 = π4 = P (S6 ) =
1
2
4
Sample design 3
• P (S1 ) = P (S2 ) = P (S3 ) = 1/3; P (S4 ) = P (S5 ) = P (S6 ) = 0
• S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4}
• π1 = P (S1 ) + P (S2 ) + P (S3 ) = 1
• π2 = P (S1 ) =
1
3
• π3 = P (S2 ) =
1
3
• π4 = P (S3 ) =
1
3
Sample distribution
• A fundamental idea in statistics
• Using the data in the sample, we calculate a statistic.
• The distribution of this statistic is the sampling distribution.
– Random variable is the set of possible options with associated probabilities.
– Statistic is one realization of a random variable.
• The sampling distribution depends upon
– the population distribution
– the sample design.
Population total t
• Suppose we are interested in the population total.
• Let yi denote the value of the characteristic of interest for observation unit i.
P
• t = population yi
• t̂ = N ȳsample
Parameters and statistics
• t is a constant.
– unknown
– a population parameter
• t̂ is a random variable.
5
– known after the sample is taken
– a statistic
– with a sampling distribution
N = 4 example
• y1 = 5; y2 = 10; y3 = 15; y4 = 10
• t = 5 + 10 + 15 + 10 = 40
• For S1 = {1, 2}, t̂ = 4(7.5) = 30
• For S2 = {1, 3}, t̂ = 4(10) = 40
• For S3 = {1, 4}, t̂ = 4(7.5) = 30
• For S4 = {2, 3}, t̂ = 4(12.5) = 50
• For S5 = {2, 4}, t̂ = 4(10) = 40
• For S6 = {3, 4}, t̂ = 4(12.5) = 50
Sampling distribution
• For sample design 1, the possible samples are equally likely (1/6).
• The sampling distribution of t̂ is
– P (t̂ = 30) =
– P (t̂ = 40) =
– P (t̂ = 50) =
1
3
1
3
1
3
Mean and standard deviation
• We can compute the mean and the variance (or standard deviation) of this sampling
distribution
– using the probabilities for the sampling distribution
– using the probabilities for the possible samples.
Mean
• Using the sampling distribution
– E(t̂) =
1
3
× 30 + 13 × 40 + 13 × 50 = 40.
• Using the sample probabilities
– E(t̂) =
1
6
× 30 + 16 × 40 + 16 × 30 + 16 × 50 + 16 × 40 + 16 × 50 = 40.
6
Bias
• t = 5 + 10 + 15 + 10 = 40
• E(t̂) = 40
• Bias(t̂) = E(t̂) − t = 40 − 40 = 0
• t̂ is unbiased.
Sample design 3
• P (S1 ) = P (S2 ) = P (S3 ) =
1
3
• Y1 = 5; Y2 = 10; Y3 = 15; Y4 = 10
• For S1 = {1, 2}, t̂ = 4 × 7.5 = 30.
• For S2 = {1, 3}, t̂ = 4 × 10 = 40.
• For S3 = {1, 4}, t̂ = 4 × 7.5 = 30.
• E(t̂) =
2
3
× 30 + 13 × 40 = 33.33
• Bias is 33.33 − 40 = −7.67.
Variance and standard deviation
• Var(t̂) = E(t̂ − E(t̂))2
• For sample design 1,
1
1
1
Var(t̂) = (30 − 40)2 + (40 − 40)2 + (50 − 40)2 = 200/3 = 66.67
3
3
3
• The standard deviation is
√
66.67 = 8.2
Design 3
• For sample design 3,
2
1
Var(t̂) = (30 − 33.33)2 + (40 − 33.33)2 = 22.22
3
3
• The standard deviation is
√
22.22.
• Design 3 has smaller SD (4.7) than Design 1 (8.2).
• But it is biased.
7
Mean squared error
• Among unbiased designs, the one with the smallest variance (SD) is best.
• To compare designs (including biased and unbiased designs) in general, we look at the
mean squared error (MSE).
• M SE = E(t̂ − t)2 (t̂ is the random value.)
M SE, variance, and bias
• M SE = V ar + (Bias)2
– See text, page 28.
• For design 1, M SE = V ar + 0 = 66.67
• For design 3, M SE = 22.22 + (−7.67)2 = 22.22 + 58.66 = 80.99
Comparison of designs
• Design 3 has a smaller variance than Design 1.
• Design 3 has a larger M SE than Design 1.
• Therefore, Design 1 is better.
• Designs with small bias can have smaller M SE and be better than unbiased designs.
Population parameters: total and mean
• Population total
P
– t = population Yi
• Population mean
– Ȳpopulation =
1
N
P
population
Yi =
t
N
Population parameters variability
• Population variance
S2 =
1 X
(Yi − Ȳpop )2
N − 1 pop
• Population standard deviation
S=
8
√
S2
• Coefficient of variation (CV)
– CV =
S
Ȳpop
– Also called the relative standard deviation
Binary variables
• Proportions can be handled within this framework.
• Define
– y = 0 if the characteristic is absent
– y = 1 if the characteristic is present
• T is the total number of individuals with the characteristic
• ȳ is the proportion
Replacement
• Think about selecting the sample one observation unit at a time.
• For an SRS, we do not replace an observation unit once it has been selected.
• For an SRSWR (simple random sample with replacement), we replace item, and it can
be selected again.
– Sometimes just use unique values in sample.
– sometimes easier statistical properties (e.g., Var(ȳsam )W R =
N −1 2
S
nN
=
N −n
N −1
Var(ȳsam )W OR )
– good for contrast
Example 2.4
• Census of Agriculture is conducted by the U.S. government every five years.
• Population is all farms in the 50 states for which $1000 or more of agricultural products
were produced and sold.
• The data file agpop.dat on the text disk contains data summarized by each of the
counties and county-equivalents in the U.S.
• We will view this data set as a population with N = 3078 counties.
9
First three records
1. COUNTY,STATE,ACRES92,ACRES87,ACRES82,FARMS92,FARMS87,FARMS82,LARGEF92,LARGEF87,LARGEF82,SMAL
2. ALEUTIAN ISLANDS AREA,AK,683533,726596,764514,26,27,28,14,16,20,6,4,1,W
3. ANCHORAGE AREA,AK,47146,59297,256709,217,245,223,9,10,11,41,52,38,W
agpop.dat
• Comma-delimited file
• 15 variables
• Identifiers (3)
– County
– State
– Region
• For 1982, 1987, and 1992
– total acres
– number of farms
– number of large farms (more than 1000 acres)
– number of small farms (less than 9 acres)
• SAS (SLL031.sas)
– Find file; import data
– data source: delimited file
– Browse to find location
– Select options and specify that the delimiter is the character “,”.
– Put data into ‘member’ a1. (This is the name of the SAS data set.)
proc print
options nocenter;
proc print data=a1;
run;
proc print data=a1 noobs;
var county state;
run;
proc print data=a1 noobs;
var acres92 farms92;
run;
10
Output
COUNTY
STATE
ALEUTIAN ISLANDS AREA
ANCHORAGE AREA
FAIRBANKS AREA
JUNEAU AREA
KENAI PENINSULA AREA
AUTAUGA COUNTY
AK
AK
AK
AK
AK
AL
ACRES92
FARMS92
683533
47146
141338
210
50810
107259
26
217
168
8
93
322
Select an SRS (Shuffle and take first 300)
data a1; set a1;
u=uniform(0);
proc sort data=a1; by u;
data a2;
set a1;
if _n_ le 300;
proc print data=a2;
var acres92 farms92;
run;
%
% Randomly order data
%
% _n_ is the index
Output
Obs
1
2
3
4
...
298
299
300
ACRES92
81427
23735
52904
787857
...
412673
268043
335820
FARMS92
384
393
256
284
...
1360
529
133
Examine the sample
proc univariate data=a2;
var acres92;
histogram acres92;
run;
11
N
Mean
Std Deviation
Variance
300
316552.08
411912.356
1.69672E11
Standard error of the mean
• This is the standard deviation of the sampling distribution of the sample mean.
2
• Var(ȳsample ) = Sn 1 − Nn ,
– where S 2 is population variance (defined in equation 2.5 on page 29 with divisor
N − 1)
• The standard error is the square root of the variance.
Finite population correction (f pc)
• The usual formula for the standard error of the mean does not have the term 1 −
n
N
.
• This is the finite population correction.
• Note that if n is small, relative to N , the correction is negligible.
Estimation of the standard error of the mean
• Replace the population variance S 2 with the sample variance s2 in the formula Var(ȳsample ) =
S2
n
1
−
n
N
• And take the square root.
12
Example
• For our sample of n = 300, output gave
Variance
1.69672E11
• N = 3078
• f pc = 1 −
n
N
=1−
300
3078
= 0.9025
• Estimated variance is
1.69672 × 1011
(0.9025) = 4.95 × 108
300
• Square root is 22,254.
Confidence Intervals
• 95% is the standard.
• The margin of error (MOE) is 1.96 times the standard error of the mean.
• 1.96(22254) = 43617
• The confidence interval is the sample mean plus or minus the MOE.
291766 ± 43617
or 290, 000 ± 44, 000
Asymptotics
• Reliability (consistency and efficiency) depends on assumption of sample size going to
infinity...
• But our population is finite.
• Theory of the superpopulation: n, N , and N − n all go to infinity in predictable way.
• What is sufficiently large for normality assumption?
Check on the results
• In this artificial example, we have data for the whole population so we can compare
our sample estimates with the population parameters.
• The population mean is 306676.971.
• The population variance is 1.80359 × 1011 .
13
Proportions
• Methods for estimation of proportions are similar.
• The estimated standard error for a sample proportion is
r
p̂(1 − p̂)
f pc ×
.
n−1
• See pages 34-35 and page 38.
Totals
• Do the analysis for the mean.
• Then multiply N times
– the sample mean
– the standard error of the mean
– margin of error
– the confidence limits
CI for total
• In the Census of Agriculture example, the confidence interval for average number of
acres was
290, 000 ± 44, 000
• For total acres, multiply by N = 3078
892 ± 135 million acres
• The actual number is 942 million acres.
Sample Size Determination
• Determine the margin of error that you need.
• Solve the equation for the margin of error for the sample size n.
• Substitute values for unknown quantities.
14
Quantities needed for calculation
• The confidence level (use 95%)
• The variance
– Use data from a pilot or similar study.
– Guess (use the idea that 95% of observations are within 2s of the mean for normal
populations)
• The population size N .
Margin of error formula
• M OE = z ∗ S
q
f pc
n
– z ∗ from normal distribution, use 1.96
– S 2 is population variance, need a value
– f pc = 1 − Nn
• ⇒n=
n0
n
1+ N0
, where
n0 = (1.96S/M OE)2
• n0 is the corresponding value for an SRSWR.
Binomial proportions
• Variance for a binomial is maximum at p = 0.5.
• Variance is p(1 − p) ≤ 14 , SD ≤ 12 .
• This gives n0 = (1.96/(2 × M OE))2 .
• Then use n =
n0
n
1+ N0
.
Relative precision
• For some problems, it is common to express the desired MOE relative to the mean.
n0 = (1.96S/M OE)2
2
S/Ȳ
=
1.96
M OE/Ȳ
• S/Ȳ is the CV; M OE/Ȳ is the relative margin of error.
• n=
n0
n
1+ N0
15
Details and examples
• See text pages 39-42.
• Note that increasing the sample size has a lessened effect as the sample size gets larger.
• Effect not so pronounced for increasing population size.
• See graph on page 42.
A calculation in SAS (SLL042.sas)
Data a1;
popN=1000;
z=1.96;
s=100;
do n=5 to 1000;
fpc=(1-n/popN);
moe=z*sqrt(fpc)*s/sqrt(n);
output;
end;
proc print data=a1;
symbol1 v=none i=join;
title1 ’Plot of margin of error versus sample size’;
title2 ’N=1000, s=100, 95% confidence’;
proc gplot data=a1;
plot moe*n/frame;
run;
16
60
80
Plot of MOE vs Sample Size (95% confidence)
0
20
moe
40
N = 1000
N = 10000
N = 100000
0
200
400
600
n
Print some cases
proc print data=a1 noobs;
where n=50*int(n/50);
var n moe;
run;
n
50
100
150
200
250
300
350
400
450
500
550
600
650
700
750
800
850
900
950
1000
17
moe
27.0167
18.5942
14.7543
12.3961
10.7354
9.4677
8.4465
7.5910
6.8522
6.1981
5.6064
5.0607
4.5481
4.0576
3.5785
3.0990
2.6037
2.0660
1.4219
0.0000
800
1000
Systematic sampling
• Basic idea
– random start then pick every kth observation unit
• Specifics
– Let k be the next integer after N/n.
– R is a random integer between 1 and k.
– Select units R, R + k, R + 2k, . . . , R + (n − 1)k.
Example
• N = 5000, n = 56
• N/n = 89.3, k = 90
• R is a random integer between 1 and 90.
• Suppose R = 11.
• The sample is the units numbered 11, 11 + 90, 11 + 180, . . . , 11 + 55(90) = 4961.
Properties
• There are k possible samples that are equally likely (but mutually exclusive).
• The observation number of the last unit sampled is R + (n − 1)k.
• The maximum value of R is k, the largest observation number that can be sampled is
nk; if nk > N , replace k by k − 1.
• Often results will be similar to an SRS.
• If there is a cyclic pattern in the list, results can be very bad. (Safeguard: do two or
more systematic samples with different values of k.)
• If there is a natural ordering in the list related to the outcome, the results can be better
than an SRS.
• Systematic sampling is a special case of cluster sampling.
18
Randomization Theory
• This is a theoretical section (2.7).
• We can prove that ȳ is unbiased by showing that the average of all possible values of
ȳ is the population mean.
2
• Similarly we can show that (f pc) sn is an unbiased estimator of the variance of the
sample mean.
Proofs
• Use indicator trick.
Zi = I(unit i is in sample)
• Zi is the random component.
P
P
• Use: ȳsample = sample yni = population Zi yni
• πi = E(Zi ) = P (Zi = 1) =
number of samples with unit i
=
number of possible samples
Other properties
Var(Zi ) = Nn 1 −
P
• E(ȳsamp ) = pop Nn yni = ȳpop
• E(Zi2 ) =
n
;
N
n
N
n
N
.
• For i 6= j
Cov(Zi , Zj ) = E(Zi Zj ) − E(Zi )E(Zj )
n 2
= P (Zi = 1 and Zj = 1) −
N
n − 1 n n 2
=
−
N −1
N
N
n 1− N
n
= −
N −1 N
19
Standard error of sample mean
P
Var(ȳsamp ) = Var( n1 pop Zi yi )
PN
PN
= n12 Cov
Z
y
,
Z
y
i i
j j
1
1
=
=
=
=
=
=
[Var(X) = Cov(X, X)]
hP
i [properties of covariance;
PN PN
P
N 2
1
expansion of ( i xi )2 ;
1 yi Var(Zi ) + 2
i=1
j=i+1 yi yj Cov(Zi , Zj )
n2
yi ’s are non-random]
hP
i
PN PN
[plug in formulas;
1 n
1
2
1 − Nn
pop yi − 2
i=1
j=i+1 yi yj N −1
n2 N
notice sign change]
h
i
1 1
n
P
P
P
1− N
)
nN(
(N − 1) pop yi2 − 2
j>i yi yj
N −1
P 2
1
P 2
P 2 P 2
[(
yi ) =
1
n
PP
P
1 − N N (N −1) [(N − 1) yi − ( yi ) + yi ]
2
n
+
2
y
i
j>i yi yj ]
1
P 2
P 2
1
n
1
−
[N
y
−
(
y
)
]
i
i
n
N N (N −1)
[population
variance
= S2
n S2
P
P
1− N n
y2 − 1 (
yi ) 2
= pop i NN−1 pop ]
Assumptions
• The approach is nonparametric; there are no assumptions on the distribution of the yi .
• We simply assume that the yi are a collection of unknown numbers.
Models
• Probability models are the foundation for statistical inference.
• They give a framework for evaluating estimators.
– Bias
– M SE
• Confidence intervals are probability statements.
A model for simple random sampling
• The randomization theory (also called design-based theory) provides one framework for
sampling methods.
• An alternative is model-based theory.
What is random?
• For the randomization theory, {yi , i = 1, . . . , N } is a collection of numbers.
20
• The random variables are the set {Zi : i = 1, . . . , N }, where Zi = 1 or 0 depending
upon whether or not unit i is included in the sample.
• For the model-based theory, {Yi : i = 1, . . . , N } is a collection of random variables.
• We can think in terms of an infinite superpopulation.
– Based on knowledge of natural tendency of phenomena to have particular distribution. (Example: lifetimes are exponential.)
– Yi : i = 1, . . . , N are independent random variables with expected value µ and
variance σ 2 .
– Text uses M as a subscript to denote the model-based expectation, EM and VM .
t and T
• t is the population total, the sum of yi for all items in the population (1 to N ).
• T is the population total, the sum of Yi for all items in the population (1 to N ), a
random sample from an infinite population.
• We want to estimate the number t.
Estimation of t
• t is the sum of the n observation in our sample plus the sum of the N − n observations
not in our sample.
• We do not need to estimate the values in our sample.
• We use the data in our sample to estimate (or predict) the values not in our sample.
• The expected value of each of the N − n observation not in our sample is µ.
• The predicted value is our best guess at µ, the mean of the observations in our sample,
Ȳ .
• Our estimate of the total is therefore the sum of the observations in our sample plus
N − n times the sample mean.
• This is the same as N times the sample mean, the usual estimator.
• We call it T̂ for the model-based approach.
• We call it t̂ for the randomization approach.
21
Properties
• T̂ is model-unbiased: expected value of T̂ − T is zero.
2 σ2
• M SE is the variance: f pc N n .
• Use the sample variance s2 to estimate σ 2 .
Confidence Intervals
• Calculation is the same.
• Interpretation is a bit different.
– For design-based, repeated sampling from the same population.
– For model-based, probabilities from the central limit theorem approximation.
Finite population correction
• Model for SRS provides some intuition.
• f pc = 1 −
n
N
=
N −n
N
• This is the proportion of the population that we need to estimate or predict.
SRS Mean
• The estimator of the population mean is the sample mean ȳ.
• The estimator of the standard error (SE) of this estimator is
q
2
f pc sn .
• Margin of error (MOE) is 2 times the SE.
• 95% CI is the estimate plus or minus the margin of error.
SRS Proportion
• The estimator of the population is the sample proportion p̂.
• The estimator of the standard error (SE) of this estimator is
p̂(1 − p̂).
• Margin of error (MOE) is 2 times the SE.
• 95% CI is the estimate plus or minus the margin of error.
22
q
s2
f pc n−1
, where s2 =
SRS Total
• The estimator of the population total is N times the sample mean N ȳ.
• The standard error (SE) of this estimator is N times the SE of ȳ.
• Margin of error (MOE) is 2 times the SE.
• 95% CI is the estimate plus or minus the margin of error.
Advantages of SRS
• Simple (S)
• Estimation and inference very similar to that used in elementary statistics.
• f pc is a new idea.
Other designs may be better.
• Sample survey versus designed experiment (STAT 522 vs STAT 514)
• Frame not available (e.g., mosquitoes, liquid)
• Additional information available that could be used to reduce variance.
Simplicity is an advantage
• Some view statistics as a collection of methods for compiling numbers in contrast to
methods for model-based inference.
• More complicated calculations – such as those needed for more complex designs – are
potentially confusing (litigation).
Extra information may not be available
• More complex designs usually require that we have some information about the population.
• Without such information, an SRS is usually best.
Consider the use
• For some problems, we may want to do more sophisticated analyses such as multiple
regression, factor analysis, etc.
• Adjustments can be made to take into account the design, but this can get very messy
if we have a complex design.
• An SRS may be best under these circumstances.
23
Refinements and Alternatives
• Unequal probability sampling – probabilities proportional to size
• Rejection sampling
– Do I want this sample? (Are all elements unique? Part of target population?)
– If not, get a new sample.
• Bernoulli sampling
– Sample each unit independently with probability π of being selected.
– Sample size is Binomial random variable.
– If sample sizes are unequal for each unit in population, known as Poisson sampling.
24