Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Statistics 522: Sampling and Survey Techniques Topic 2 Topic Overview This topic will cover • Types of probability samples • Framework for probability sampling – Simple random sampling – Confidence intervals – Sample size determination • Systematic sampling • Randomization theory Chapter 2: Simple Probability Samples Properties of directly sampled populations • Minimum – All units are identified and indexed. – All units can be found. • Simplifying properties – Frame is organized – for instance, by location. • Desirable properties – Additional information is available for each unit. – Domain (subpopulation) membership is identified for each unit. • Properties of frame vs. population. – Every element of population occurs in frame. – Every element of population occurs in frame exactly once. 1 Probability sampling • Each possible sample has a known probability of being actual sample. (Usually sampled one observation at a time) • A chance mechanism is used to select the sample. – random number tables – computer-generated pseudo-random numbers – mechanical methods such as shuffled cards • Complete enumeration not possible 1000 – = 5.6 × 1071 40 5000 – = 1.4 × 10363 20 • Want sample design independent of possible trends in data Assumptions (for now) • Target population and sampled population are the same. • The sampling frame is complete. • None of the data is missing. • There is no measurement bias. • (Sample size is fixed.) • We have no non-sampling errors. (All errors are sampling errors.) Types of Probability Samples • Simple random sample (SRS) – Chapter 2 • Stratified random sample – Chapter 4 • Cluster sample – Chapter 5 2 Sample Random Sample (SRS) • Every possible subset of the population of size n is equally likely to be the sample. • This implies that each individual is equally likely to be in the sample. • A probability sample with each individual being equally likely to be in the sample is not necessarily an SRS. – Why? Stratified Random Sample • First, partition the population into subgroups, called strata. • Then, select an SRS from each stratum. • Stratification is effective when the strata are relatively homogeneous with respect to the characteristic of interest (smoothness assumption). Cluster Sample • Some or most sampling units contain more than one observation unit. • These observation units are called clusters. • Take an SRS of clusters. • Sample all observation units in the sampled cluster. Example • You want to estimate heights of children in an elementary school. • Select an SRS of the students. • Stratify by grade level. • Use classrooms as clusters. Framework for Probability Sampling • U = {1, 2, . . . , N } are the units in the population. • Let n be the sample size. N • There are – “N choose n” – different samples. n N N! • = (N −n)!n! n 3 Example • N = 4, n = 2 • The population is U = {1, 2, 3, 4}. 4 4! 4×3×2×1 • There are = 2!2! = (2×1)×(2×1) = 6 different samples. 2 • S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4} Some probability sample designs 1. Each sample is equally likely, P (Si ) = 1/6 for all i. 2. P (S1 ) = P (S6 ) = 1/2; P (S2 ) = P (S3 ) = P (S4 ) = P (S5 ) = 0 3. P (S1 ) = P (S2 ) = P (S3 ) = 1/3; P (S4 ) = P (S5 ) = P (S6 ) = 0 Probabilities for individual units • πi = P (unit i is in sample) • For many designs, we want the πi to be equal. • This implies πi = n/N . Sample design 1 • Each sample is equally likely, P (Si ) = 1/6 for all i. • S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4} • π1 = P (S1 ) + P (S2 ) + P (S3 ) = 3/6 = • Similarly, πi = 1 2 = n N 1 2 for all i Sample design 2 • P (S1 ) = P (S6 ) = 1/2; P (S2 ) = P (S3 ) = P (S4 ) = P (S5 ) = 0 • S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4} • π1 = π2 = P (S1 ) = 1 2 • π3 = π4 = P (S6 ) = 1 2 4 Sample design 3 • P (S1 ) = P (S2 ) = P (S3 ) = 1/3; P (S4 ) = P (S5 ) = P (S6 ) = 0 • S1 = {1, 2}; S2 = {1, 3}; S3 = {1, 4}; S4 = {2, 3}; S5 = {2, 4}; S6 = {3, 4} • π1 = P (S1 ) + P (S2 ) + P (S3 ) = 1 • π2 = P (S1 ) = 1 3 • π3 = P (S2 ) = 1 3 • π4 = P (S3 ) = 1 3 Sample distribution • A fundamental idea in statistics • Using the data in the sample, we calculate a statistic. • The distribution of this statistic is the sampling distribution. – Random variable is the set of possible options with associated probabilities. – Statistic is one realization of a random variable. • The sampling distribution depends upon – the population distribution – the sample design. Population total t • Suppose we are interested in the population total. • Let yi denote the value of the characteristic of interest for observation unit i. P • t = population yi • t̂ = N ȳsample Parameters and statistics • t is a constant. – unknown – a population parameter • t̂ is a random variable. 5 – known after the sample is taken – a statistic – with a sampling distribution N = 4 example • y1 = 5; y2 = 10; y3 = 15; y4 = 10 • t = 5 + 10 + 15 + 10 = 40 • For S1 = {1, 2}, t̂ = 4(7.5) = 30 • For S2 = {1, 3}, t̂ = 4(10) = 40 • For S3 = {1, 4}, t̂ = 4(7.5) = 30 • For S4 = {2, 3}, t̂ = 4(12.5) = 50 • For S5 = {2, 4}, t̂ = 4(10) = 40 • For S6 = {3, 4}, t̂ = 4(12.5) = 50 Sampling distribution • For sample design 1, the possible samples are equally likely (1/6). • The sampling distribution of t̂ is – P (t̂ = 30) = – P (t̂ = 40) = – P (t̂ = 50) = 1 3 1 3 1 3 Mean and standard deviation • We can compute the mean and the variance (or standard deviation) of this sampling distribution – using the probabilities for the sampling distribution – using the probabilities for the possible samples. Mean • Using the sampling distribution – E(t̂) = 1 3 × 30 + 13 × 40 + 13 × 50 = 40. • Using the sample probabilities – E(t̂) = 1 6 × 30 + 16 × 40 + 16 × 30 + 16 × 50 + 16 × 40 + 16 × 50 = 40. 6 Bias • t = 5 + 10 + 15 + 10 = 40 • E(t̂) = 40 • Bias(t̂) = E(t̂) − t = 40 − 40 = 0 • t̂ is unbiased. Sample design 3 • P (S1 ) = P (S2 ) = P (S3 ) = 1 3 • Y1 = 5; Y2 = 10; Y3 = 15; Y4 = 10 • For S1 = {1, 2}, t̂ = 4 × 7.5 = 30. • For S2 = {1, 3}, t̂ = 4 × 10 = 40. • For S3 = {1, 4}, t̂ = 4 × 7.5 = 30. • E(t̂) = 2 3 × 30 + 13 × 40 = 33.33 • Bias is 33.33 − 40 = −7.67. Variance and standard deviation • Var(t̂) = E(t̂ − E(t̂))2 • For sample design 1, 1 1 1 Var(t̂) = (30 − 40)2 + (40 − 40)2 + (50 − 40)2 = 200/3 = 66.67 3 3 3 • The standard deviation is √ 66.67 = 8.2 Design 3 • For sample design 3, 2 1 Var(t̂) = (30 − 33.33)2 + (40 − 33.33)2 = 22.22 3 3 • The standard deviation is √ 22.22. • Design 3 has smaller SD (4.7) than Design 1 (8.2). • But it is biased. 7 Mean squared error • Among unbiased designs, the one with the smallest variance (SD) is best. • To compare designs (including biased and unbiased designs) in general, we look at the mean squared error (MSE). • M SE = E(t̂ − t)2 (t̂ is the random value.) M SE, variance, and bias • M SE = V ar + (Bias)2 – See text, page 28. • For design 1, M SE = V ar + 0 = 66.67 • For design 3, M SE = 22.22 + (−7.67)2 = 22.22 + 58.66 = 80.99 Comparison of designs • Design 3 has a smaller variance than Design 1. • Design 3 has a larger M SE than Design 1. • Therefore, Design 1 is better. • Designs with small bias can have smaller M SE and be better than unbiased designs. Population parameters: total and mean • Population total P – t = population Yi • Population mean – Ȳpopulation = 1 N P population Yi = t N Population parameters variability • Population variance S2 = 1 X (Yi − Ȳpop )2 N − 1 pop • Population standard deviation S= 8 √ S2 • Coefficient of variation (CV) – CV = S Ȳpop – Also called the relative standard deviation Binary variables • Proportions can be handled within this framework. • Define – y = 0 if the characteristic is absent – y = 1 if the characteristic is present • T is the total number of individuals with the characteristic • ȳ is the proportion Replacement • Think about selecting the sample one observation unit at a time. • For an SRS, we do not replace an observation unit once it has been selected. • For an SRSWR (simple random sample with replacement), we replace item, and it can be selected again. – Sometimes just use unique values in sample. – sometimes easier statistical properties (e.g., Var(ȳsam )W R = N −1 2 S nN = N −n N −1 Var(ȳsam )W OR ) – good for contrast Example 2.4 • Census of Agriculture is conducted by the U.S. government every five years. • Population is all farms in the 50 states for which $1000 or more of agricultural products were produced and sold. • The data file agpop.dat on the text disk contains data summarized by each of the counties and county-equivalents in the U.S. • We will view this data set as a population with N = 3078 counties. 9 First three records 1. COUNTY,STATE,ACRES92,ACRES87,ACRES82,FARMS92,FARMS87,FARMS82,LARGEF92,LARGEF87,LARGEF82,SMAL 2. ALEUTIAN ISLANDS AREA,AK,683533,726596,764514,26,27,28,14,16,20,6,4,1,W 3. ANCHORAGE AREA,AK,47146,59297,256709,217,245,223,9,10,11,41,52,38,W agpop.dat • Comma-delimited file • 15 variables • Identifiers (3) – County – State – Region • For 1982, 1987, and 1992 – total acres – number of farms – number of large farms (more than 1000 acres) – number of small farms (less than 9 acres) • SAS (SLL031.sas) – Find file; import data – data source: delimited file – Browse to find location – Select options and specify that the delimiter is the character “,”. – Put data into ‘member’ a1. (This is the name of the SAS data set.) proc print options nocenter; proc print data=a1; run; proc print data=a1 noobs; var county state; run; proc print data=a1 noobs; var acres92 farms92; run; 10 Output COUNTY STATE ALEUTIAN ISLANDS AREA ANCHORAGE AREA FAIRBANKS AREA JUNEAU AREA KENAI PENINSULA AREA AUTAUGA COUNTY AK AK AK AK AK AL ACRES92 FARMS92 683533 47146 141338 210 50810 107259 26 217 168 8 93 322 Select an SRS (Shuffle and take first 300) data a1; set a1; u=uniform(0); proc sort data=a1; by u; data a2; set a1; if _n_ le 300; proc print data=a2; var acres92 farms92; run; % % Randomly order data % % _n_ is the index Output Obs 1 2 3 4 ... 298 299 300 ACRES92 81427 23735 52904 787857 ... 412673 268043 335820 FARMS92 384 393 256 284 ... 1360 529 133 Examine the sample proc univariate data=a2; var acres92; histogram acres92; run; 11 N Mean Std Deviation Variance 300 316552.08 411912.356 1.69672E11 Standard error of the mean • This is the standard deviation of the sampling distribution of the sample mean. 2 • Var(ȳsample ) = Sn 1 − Nn , – where S 2 is population variance (defined in equation 2.5 on page 29 with divisor N − 1) • The standard error is the square root of the variance. Finite population correction (f pc) • The usual formula for the standard error of the mean does not have the term 1 − n N . • This is the finite population correction. • Note that if n is small, relative to N , the correction is negligible. Estimation of the standard error of the mean • Replace the population variance S 2 with the sample variance s2 in the formula Var(ȳsample ) = S2 n 1 − n N • And take the square root. 12 Example • For our sample of n = 300, output gave Variance 1.69672E11 • N = 3078 • f pc = 1 − n N =1− 300 3078 = 0.9025 • Estimated variance is 1.69672 × 1011 (0.9025) = 4.95 × 108 300 • Square root is 22,254. Confidence Intervals • 95% is the standard. • The margin of error (MOE) is 1.96 times the standard error of the mean. • 1.96(22254) = 43617 • The confidence interval is the sample mean plus or minus the MOE. 291766 ± 43617 or 290, 000 ± 44, 000 Asymptotics • Reliability (consistency and efficiency) depends on assumption of sample size going to infinity... • But our population is finite. • Theory of the superpopulation: n, N , and N − n all go to infinity in predictable way. • What is sufficiently large for normality assumption? Check on the results • In this artificial example, we have data for the whole population so we can compare our sample estimates with the population parameters. • The population mean is 306676.971. • The population variance is 1.80359 × 1011 . 13 Proportions • Methods for estimation of proportions are similar. • The estimated standard error for a sample proportion is r p̂(1 − p̂) f pc × . n−1 • See pages 34-35 and page 38. Totals • Do the analysis for the mean. • Then multiply N times – the sample mean – the standard error of the mean – margin of error – the confidence limits CI for total • In the Census of Agriculture example, the confidence interval for average number of acres was 290, 000 ± 44, 000 • For total acres, multiply by N = 3078 892 ± 135 million acres • The actual number is 942 million acres. Sample Size Determination • Determine the margin of error that you need. • Solve the equation for the margin of error for the sample size n. • Substitute values for unknown quantities. 14 Quantities needed for calculation • The confidence level (use 95%) • The variance – Use data from a pilot or similar study. – Guess (use the idea that 95% of observations are within 2s of the mean for normal populations) • The population size N . Margin of error formula • M OE = z ∗ S q f pc n – z ∗ from normal distribution, use 1.96 – S 2 is population variance, need a value – f pc = 1 − Nn • ⇒n= n0 n 1+ N0 , where n0 = (1.96S/M OE)2 • n0 is the corresponding value for an SRSWR. Binomial proportions • Variance for a binomial is maximum at p = 0.5. • Variance is p(1 − p) ≤ 14 , SD ≤ 12 . • This gives n0 = (1.96/(2 × M OE))2 . • Then use n = n0 n 1+ N0 . Relative precision • For some problems, it is common to express the desired MOE relative to the mean. n0 = (1.96S/M OE)2 2 S/Ȳ = 1.96 M OE/Ȳ • S/Ȳ is the CV; M OE/Ȳ is the relative margin of error. • n= n0 n 1+ N0 15 Details and examples • See text pages 39-42. • Note that increasing the sample size has a lessened effect as the sample size gets larger. • Effect not so pronounced for increasing population size. • See graph on page 42. A calculation in SAS (SLL042.sas) Data a1; popN=1000; z=1.96; s=100; do n=5 to 1000; fpc=(1-n/popN); moe=z*sqrt(fpc)*s/sqrt(n); output; end; proc print data=a1; symbol1 v=none i=join; title1 ’Plot of margin of error versus sample size’; title2 ’N=1000, s=100, 95% confidence’; proc gplot data=a1; plot moe*n/frame; run; 16 60 80 Plot of MOE vs Sample Size (95% confidence) 0 20 moe 40 N = 1000 N = 10000 N = 100000 0 200 400 600 n Print some cases proc print data=a1 noobs; where n=50*int(n/50); var n moe; run; n 50 100 150 200 250 300 350 400 450 500 550 600 650 700 750 800 850 900 950 1000 17 moe 27.0167 18.5942 14.7543 12.3961 10.7354 9.4677 8.4465 7.5910 6.8522 6.1981 5.6064 5.0607 4.5481 4.0576 3.5785 3.0990 2.6037 2.0660 1.4219 0.0000 800 1000 Systematic sampling • Basic idea – random start then pick every kth observation unit • Specifics – Let k be the next integer after N/n. – R is a random integer between 1 and k. – Select units R, R + k, R + 2k, . . . , R + (n − 1)k. Example • N = 5000, n = 56 • N/n = 89.3, k = 90 • R is a random integer between 1 and 90. • Suppose R = 11. • The sample is the units numbered 11, 11 + 90, 11 + 180, . . . , 11 + 55(90) = 4961. Properties • There are k possible samples that are equally likely (but mutually exclusive). • The observation number of the last unit sampled is R + (n − 1)k. • The maximum value of R is k, the largest observation number that can be sampled is nk; if nk > N , replace k by k − 1. • Often results will be similar to an SRS. • If there is a cyclic pattern in the list, results can be very bad. (Safeguard: do two or more systematic samples with different values of k.) • If there is a natural ordering in the list related to the outcome, the results can be better than an SRS. • Systematic sampling is a special case of cluster sampling. 18 Randomization Theory • This is a theoretical section (2.7). • We can prove that ȳ is unbiased by showing that the average of all possible values of ȳ is the population mean. 2 • Similarly we can show that (f pc) sn is an unbiased estimator of the variance of the sample mean. Proofs • Use indicator trick. Zi = I(unit i is in sample) • Zi is the random component. P P • Use: ȳsample = sample yni = population Zi yni • πi = E(Zi ) = P (Zi = 1) = number of samples with unit i = number of possible samples Other properties Var(Zi ) = Nn 1 − P • E(ȳsamp ) = pop Nn yni = ȳpop • E(Zi2 ) = n ; N n N n N . • For i 6= j Cov(Zi , Zj ) = E(Zi Zj ) − E(Zi )E(Zj ) n 2 = P (Zi = 1 and Zj = 1) − N n − 1 n n 2 = − N −1 N N n 1− N n = − N −1 N 19 Standard error of sample mean P Var(ȳsamp ) = Var( n1 pop Zi yi ) PN PN = n12 Cov Z y , Z y i i j j 1 1 = = = = = = [Var(X) = Cov(X, X)] hP i [properties of covariance; PN PN P N 2 1 expansion of ( i xi )2 ; 1 yi Var(Zi ) + 2 i=1 j=i+1 yi yj Cov(Zi , Zj ) n2 yi ’s are non-random] hP i PN PN [plug in formulas; 1 n 1 2 1 − Nn pop yi − 2 i=1 j=i+1 yi yj N −1 n2 N notice sign change] h i 1 1 n P P P 1− N ) nN( (N − 1) pop yi2 − 2 j>i yi yj N −1 P 2 1 P 2 P 2 P 2 [( yi ) = 1 n PP P 1 − N N (N −1) [(N − 1) yi − ( yi ) + yi ] 2 n + 2 y i j>i yi yj ] 1 P 2 P 2 1 n 1 − [N y − ( y ) ] i i n N N (N −1) [population variance = S2 n S2 P P 1− N n y2 − 1 ( yi ) 2 = pop i NN−1 pop ] Assumptions • The approach is nonparametric; there are no assumptions on the distribution of the yi . • We simply assume that the yi are a collection of unknown numbers. Models • Probability models are the foundation for statistical inference. • They give a framework for evaluating estimators. – Bias – M SE • Confidence intervals are probability statements. A model for simple random sampling • The randomization theory (also called design-based theory) provides one framework for sampling methods. • An alternative is model-based theory. What is random? • For the randomization theory, {yi , i = 1, . . . , N } is a collection of numbers. 20 • The random variables are the set {Zi : i = 1, . . . , N }, where Zi = 1 or 0 depending upon whether or not unit i is included in the sample. • For the model-based theory, {Yi : i = 1, . . . , N } is a collection of random variables. • We can think in terms of an infinite superpopulation. – Based on knowledge of natural tendency of phenomena to have particular distribution. (Example: lifetimes are exponential.) – Yi : i = 1, . . . , N are independent random variables with expected value µ and variance σ 2 . – Text uses M as a subscript to denote the model-based expectation, EM and VM . t and T • t is the population total, the sum of yi for all items in the population (1 to N ). • T is the population total, the sum of Yi for all items in the population (1 to N ), a random sample from an infinite population. • We want to estimate the number t. Estimation of t • t is the sum of the n observation in our sample plus the sum of the N − n observations not in our sample. • We do not need to estimate the values in our sample. • We use the data in our sample to estimate (or predict) the values not in our sample. • The expected value of each of the N − n observation not in our sample is µ. • The predicted value is our best guess at µ, the mean of the observations in our sample, Ȳ . • Our estimate of the total is therefore the sum of the observations in our sample plus N − n times the sample mean. • This is the same as N times the sample mean, the usual estimator. • We call it T̂ for the model-based approach. • We call it t̂ for the randomization approach. 21 Properties • T̂ is model-unbiased: expected value of T̂ − T is zero. 2 σ2 • M SE is the variance: f pc N n . • Use the sample variance s2 to estimate σ 2 . Confidence Intervals • Calculation is the same. • Interpretation is a bit different. – For design-based, repeated sampling from the same population. – For model-based, probabilities from the central limit theorem approximation. Finite population correction • Model for SRS provides some intuition. • f pc = 1 − n N = N −n N • This is the proportion of the population that we need to estimate or predict. SRS Mean • The estimator of the population mean is the sample mean ȳ. • The estimator of the standard error (SE) of this estimator is q 2 f pc sn . • Margin of error (MOE) is 2 times the SE. • 95% CI is the estimate plus or minus the margin of error. SRS Proportion • The estimator of the population is the sample proportion p̂. • The estimator of the standard error (SE) of this estimator is p̂(1 − p̂). • Margin of error (MOE) is 2 times the SE. • 95% CI is the estimate plus or minus the margin of error. 22 q s2 f pc n−1 , where s2 = SRS Total • The estimator of the population total is N times the sample mean N ȳ. • The standard error (SE) of this estimator is N times the SE of ȳ. • Margin of error (MOE) is 2 times the SE. • 95% CI is the estimate plus or minus the margin of error. Advantages of SRS • Simple (S) • Estimation and inference very similar to that used in elementary statistics. • f pc is a new idea. Other designs may be better. • Sample survey versus designed experiment (STAT 522 vs STAT 514) • Frame not available (e.g., mosquitoes, liquid) • Additional information available that could be used to reduce variance. Simplicity is an advantage • Some view statistics as a collection of methods for compiling numbers in contrast to methods for model-based inference. • More complicated calculations – such as those needed for more complex designs – are potentially confusing (litigation). Extra information may not be available • More complex designs usually require that we have some information about the population. • Without such information, an SRS is usually best. Consider the use • For some problems, we may want to do more sophisticated analyses such as multiple regression, factor analysis, etc. • Adjustments can be made to take into account the design, but this can get very messy if we have a complex design. • An SRS may be best under these circumstances. 23 Refinements and Alternatives • Unequal probability sampling – probabilities proportional to size • Rejection sampling – Do I want this sample? (Are all elements unique? Part of target population?) – If not, get a new sample. • Bernoulli sampling – Sample each unit independently with probability π of being selected. – Sample size is Binomial random variable. – If sample sizes are unequal for each unit in population, known as Poisson sampling. 24