Survey
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
1 Basic Concepts of Finite Population Sampling A sample survey is a statistical procedure which 1. selects a sample of n units from a real, finite population of N units, each unit being identified (for the purposes of selection) by a distinct label (eg name and address for humans) (THE DESIGN STAGE) 2. measures the fixed values of a number of variables (eg income, expenditure) on each of the sampled units (THE MEASUREMENT STAGE) 3. attempts to estimate the value of a population parameter θ (eg the population mean, domain mean for men) and give a measure of accuracy: a confidence interval, nominally 95%1 (THE ESTIMATION STAGE) HOW? 1. (THE DESIGN STAGE): How do we select the sample? ie what is the sample design or sampling mechanism? There are many possible sample designs, but the most important, even if little used in actual survey practice, is Definition 1 (SRS): Simple Random Sampling All possible samples of size n (ie with n distinct units) are equally likely to be drawn, so that each has probability 1/ Nn . The sampling fraction f = n/N is the probability that any particular unit is included in the sample (see later). It goes from 0 (sampling from an infinite population: where sampling with replacement is equivalent to sampling without replacement studied here) to 1 (a census where the whole population is sampled and there is clearly no sampling error: inferences on population parameters should be exact in the absence of non-sampling errors: see below). 2. (THE MEASUREMENT STAGE): (Note: The process of measurement in surveys, especially those on populations of human beings is a largely non-statistical problem and varies greatly in different survey contexts. In this course we will only be touching on a small aspect of this huge topic, namely that of questionnaire design, in the latter part.) Of major priority is to minimize non-response, (this is when sampled units fail to respond either totally, unit non-response, or 1 this means that although we say that the interval is 95%, we recognize that the coverage probability may well be somewhat less than 0.95(see later) 1 eg just to some of the questions asked, item non-response), measurement error, for example asking people how many cigarettes they smoked last week, and other NON-SAMPLING ERRORS. For further discussion see later in this course. 3. In order to do statistical inference in surveys, we use the sampling distribution of our estimator, which is a suitable function of the sample data: Definition 2 An estimator of a population parameter gives an estimate whatever the sample. Of course some estimators are sensible and give precise inferences, and others may be neither sensible (eg giving values outside the known range of a population parameter) or be very imprecise (eg an estimator which ignores most of the sample data) Definition 3 The sampling distribution of an estimator is the collection of all possible samples together with the corresponding estimates and their probabilities under the sampling design for the fixed population Let Var[e] be the variance of an estimator e under its sampling distribution for the fixed (unknown) population, and let v[e] be a sample estimator of this variance, called a variance estimator for e. Then under a normal approximation to the sampling distribution of e, a nominal 95% confidence interval for θ is given by e ± 1.96 p v[e] (1.1) Definition 4 The actual probability that this interval contains the value of θ is called the coverage probability for the estimator, design and population (usually < 0.95: undercoverage, but sometimes above 0.95 overcoverage). Let E[.] denote expectation (or mean) under the sampling distribution, then Definition 5 e is unbiased for θ if for all populations E[e] = θ , otherwise bias[e] = E[e] − θ is the bias of e. Unbiasedness is a desirable property, but we should not get too worried about a very small bias which diminishes as the sample size increases. It is rather we should be worried about any estimator which has a large unknown bias. This applies not only to the bias (if any) of estimators but also the bias of variance estimators. If E[v[e]] < Var[e], we might expect undercoverage of our basic C.I. (1.1) as it will tend to be too narrow, but we should not worry about this for a large sample if we know the bias decreases to zero as n → ∞ (this is true for all reasonable variance estimators). 2 FARMS: Sampling distribution of the sample mean and variance estimator under simple random sampling. Sample prob. ȳ s2 v[ȳ] Coverage? A,B 0.1 111.5 612.5 183.75 NO A,C 0.1 122.0 1568.0 470.40 NO A,D 0.1 147.5 5724.5 1717.35 YES A,E 0.1 182.0 15488.0 4646.40 YES B,C 0.1 139.5 220.5 66.15 NO B,D 0.1 165.0 2592.0 777.60 YES B,E 0.1 199.5 9940.5 2982.15 YES C,D 0.1 175.5 1300.5 390.15 YES C,E 0.1 210.0 7200.0 2160.00 YES D,E 0.1 235.5 2380.5 714.15 NO mean 168.8 4702.7 1410.81 0.6 Example 1 Farms A simple population of five farms is to be studied by drawing a simple random sample of size two in order to estimate Ȳ = N1 ∑N i=1 Yi , where Yi is the area (in acres, devoted to growing wheat) of the ith farm. The farms are respectively identified simply by the first five letters of the Roman alphabet and have areas A: 94, B: 129, C: 150, D: 201, E: 270. Thus Ȳ = 168.8 and the finite population variance 2 S 2 = ∑N i=1 (Yi − Ȳ ) /(N − 1), where N = 5 is the finite population size, takes the value 4702.70. We will see in the next chapter that the variance of the sample mean as an estimator of Ȳ , that is the variance of the sampling distribution of this estimator, is given by 1− f 2 S , Var[ȳ] = n for a sample size n under simple random sampling, where f = n/N (here equal to 0.4) is the sampling fraction. This result will be verified empirically in the Table as will the result that this sampling variance can be estimated by the corresponding sample formula v[ȳ] = 1− f 2 s , n where s2 = ∑ni=1 (yi − ȳ)2 /(n − 1) is the sample variance, calculated only from the n sample values here denoted y1 , y2 , . . . , yn . But the prime objective of this table will be to examine a quantity for which there is no theoretical result: p the coverage probability, of the nominal 95% confidence interval formed by ȳ ± 1.96 v[ȳ]. This is just the probability that an interval calculated from a sample will contain or cover the true value that is the actual population quantity to be estimated. For simple random sampling estimating the population mean it is simply the proportion of samples giving intervals containing Ȳ . Note that the means have been calculated by simple averaging as all 10 possible samples have the same probability (0.1) of being selected. The mean or expectation of 3 the sampling distributions of the sample mean and sample variance are therefore equal to the value of the population mean and population variance respectively. These agree with the general theorem in the next chapter. The coverage probability is considerably less then the nominal level of 0.95, a phenomenon known as undercoverage. However, if we had used a percentage point of the t-distribution with say one degree of freedom instead of the Normal then the coverage probability would have been 1, the phenomenon of overcoverage. The variance of the (discrete) sampling distribution of ȳ also agrees with the theorem in the next chapter as Var[ȳ] = E[(ȳ)2 ] − (E[ȳ])2 = 1 (111.52 + 122.02 + · · · + 235.52 ) − 168.82 = 1410.81. 10 However, it can be seen that the distribution of ȳ is a very poor approximation to the Normal! It is worthwhile noting the short-cut formula (y1 − y2 )2 /2 for the sample variance, when n = 2. This further example illustrates the generality of the definitions we have made to all sampling problems, not just SRS: Example 2 Unequal probability sampling For this population e is biased (with a bias of +0.02), and v[e] is also biased (with a bias of +0.0014). Note that the true variance Var[e] is calculated directly from the sampling distribution using the standard formulae for discrete distributions giving E[e2 ] − (E[e])2 = 0.5 × (1.2)2 + 0.3 × (0.8)2 + 0.2 × (0.9)2 − (1.02)2 . The coverage is NOT 2/3, but obtained by adding the probabilities of covering samples, or equivalently taking the mean of the coverage indicator (counting YES as 1 and NO as 0). Sampling distribution of an estimator e of a parameter θ = 1 Sample prob. e v[e] Coverage? 1 0.5 1.2 0.01 NO 2 0.3 0.8 0.04 YES 3 0.2 0.9 0.09 YES mean 1.02 0.035 0.5 variance 0.0336 4 Compare unbiased estimators e by their variances Var[e]: e1 better than e2 if Var[e1 ] < Var[e2 ], but this inequality often depends on the population, that is for some populations e1 is better than e2 whereas for others it is e2 which is the better estimator. One of the aims of sampling theory is to identify the types of population which make one estimator better than another. If an estimator is always worse than another i.e. for every population, no matter what the values in it, then it is clearly not worth considering (we say it is inadmissible). More generally compare biased and unbiased estimators by Definition 6 The mean square error of an estimator, MSE[e] is given by MSE[e] = E[e − θ ]2 = Var[e] + (bias[e])2 . However, and beyond the scope of an intermediate course, is Godambe’s Theorem: There is no best (linear) (unbiased) estimator! Summary To summarise, the sampling distribution is used in at least two important ways: 1. to assess the accuracy of a given estimate by a nominal 95% confidence interval based on a variance estimator (for the unknown variance of the estimator e of θ , the parameter of interest. 2. to compare different estimators, perhaps using different sample designs and/or different sample sizes by looking at variances or more generally mean square errors. 5