* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project
Download File
Survey
Document related concepts
Transcript
Lecture-5 Sampling Methods 3. Systematic Sampling etc. Engr. Dr. Attaullah Shah RECAP: Simple Random Sampling Used when there is inadequate information for developing a conceptual model for a site or for stratifying a site Any sample in which the probabilities of selection are known Sampling units are chosen by using some method using chance to determine selection Estimation of population mean Assume that a simple random sample of size n is selected without replacement from a population of N units, and that the variable of interest has values y1, y2, …, yn, for the sampled units. Then the sample mean is: Sample variance Sample coefficient of variation: These values that are calculated from samples are often referred to as sample statistics. The corresponding population values are the population mean μ, the population variance σ2, the population standard deviation σ, and the population coefficient of variation σ/μ. The sample mean is an estimator of the population mean μ. The difference y − μ is then the sampling error in the mean. This error will vary from sample to sample if the sampling process is repeated, and it can be shown theoretically that if this is done a large number of times, then the error will average out to zero. For this reason, the sample mean is said to be an unbiased estimator of the population mean. It can also be shown theoretically that the distribution of that is obtained by repeating the process of simple random sampling without replacement has the variance. The factor [1 − (n/N)] is called the finite-population correction because it makes an allowance for the size of the sample relative to the size of the population. The square root of Var(y) is commonly called the standard error of the sample mean. It will be denoted here by Since population variance is not usually known, therefore the estimate of the variance of sample mean is given as: The square root of this quantity is the estimated standard error of the mean: The accuracy of a sample mean for estimating the population mean is often represented by a 100(1 − α)% confidence interval for the population mean of the form Commonly used confidence intervals are: For smaller samples n less than 25, we use t-statistics and the CI is given as: It is assumed that the variable being measured is approximately normally distributed in the population being sampled. It may not be satisfactory for samples from very non symmetric distributions. Estimation of Population Totals Total area damaged by an oil spill is likely to be of more concern than the average area damaged on individual sample units. If the population size N is known and an estimate of the population mean is available, then population total = mean x N It is obvious, for example, that if a population consists of 500 plots of land, with an estimated mean amount of oil-spill damage of 15 m2, then it is estimated that the total amount of damage for the whole population is 500 × 15 = 7500 m2. The sampling variance of this estimator is: Standard error (i.e., standard deviation) is Estimates of the variance and standard error are: Approximate 100(1 − α)% confidence interval for the true population total: Estimation of Proportions Two terms proportions measured on sample units and proportions of sample units. Proportions measured on sample units, such as the proportions of the units covered by a certain type of vegetation, can be treated like any other variables measured on the units. Proportions of sample units are different because the interest is in which units are of a particular type. An example of this situation is where the sample units are blocks of land, and it is required to estimate the proportion of all the blocks that show evidence of damage from pollution. Suppose that a simple random sample of size n is selected without replacement from a population of size N and contains r units with some characteristic of interest. Then the sample proportion is pˆ = r/n, and it can be shown that this has a sampling variance of Estimated values for the variance and standard error can be obtained by replacing the population proportion in previous equation with the sample proportion pˆ. Thus the estimated variance is and the estimated standard error is SÊ(pˆ ) = √Vâr(pˆ ). Using the estimated standard error, an approximate 100(1 − α)% confidence interval for the true proportion is Assume that K strata have been chosen, ith the ith of these having size Ni and the total population size being ΣNi = N. Then if a random sample with size ni is taken from the ith stratum, the sample mean yi will be an unbiased estimate of the true stratum mean μi, with estimated variance as: Where si is the sample standard deviation within the stratum. In terms of the true strata means, the overall population mean is the weighted average. And the corresponding sample estimate is with estimated variance The estimated standard error of is , the square root of the estimated variance, and an approximate 100(1 − α)% confidence interval for the population mean is given by: If the population total is of interest, then this can be estimated by The estimated standard error of population total: Again, an approximate 100(1 − α)% confidence interval takes the form When a stratified sample of points in a spatial region is carried out, it will often be the case that there are an unlimited number of sample points that can be taken from any of the strata, so that Ni and N are infinite. Equation can then be modified to and the equation becomes Where wi, the proportion of the total study area within the ith stratum, replaces Ni/N. Parameter Variance of sample mean Estimate of variance of sample mean Confidence interval for population mean Commonly used CI For sample less than 30 Est error of Var of pop total CI for total pop mean CI for true proportion of Random Sampling Stratified Sampling Statistical Sampling Systemic random sampling refers to a sampling technique that involves selecting the kth item in the population after randomly selecting a starting point between 1 and k. The value of k is determined as the ratio of the population size over the desired sample size. Sampling Design II: Systematic Sampling Design: A Grid Scheme is most common FOR 220 Aerial Photo Interpretation and Forest Measurements Sampling Design II: Systematic Sampling Arguments: For: Regular spacing of sample units may yield efficient estimates of populations under certain conditions. *** Against: Accuracy of population estimates can be low if there is periodic or cyclic variation inherent in the population. FOR 220 Aerial Photo Interpretation and Forest Measurements Sampling Design II: Systematic Sampling Arguments: For: There is no practical alternative to assuming that populations are distributed in a random order across the landscape. Against: Simple random sampling statistical techniques can’t logically be applied to a systematic design unless populations are assumed to be randomly distributed across the landscape. FOR 220 Aerial Photo Interpretation and Forest Measurements Systematic sampling is often used as an alternative to simple random sampling or stratified random sampling for two reasons. First, the process of selecting sample units is simpler for systematic sampling. Second, under certain circumstances, estimates can be expected to be more precise for systematic sampling because the population is covered more evenly. The It is common to analyze a systematic sample as if it were a simple random sample. In particular, population means, totals, and proportions are estimated using the equations already discussed for the estimation of standard errors and the determination of confidence limits. The assumption is then made that, because of the way that the systematic sample covers the population, this will, if anything, result in standard errors that tend to be somewhat too large and confidence limits that tend to be somewhat too wide. That is to say, the assessment of the level of sampling errors is assumed to be conservative. The only time that this procedure is liable to give a misleading impression about the true level of sampling errors is when the population being sampled has some cyclic variation in observations, so that the regularly spaced observations that are selected tend to all be either higher or lower than the population mean. Therefore, if there is a suspicion that regularly spaced sample points may follow some pattern in the population values, then systematic sampling should be avoided. Simple random sampling and stratified random sampling are not affected by any patterns in the population, and it is therefore safer to use these when patterns may be present Assuming that yi is the ith observation along the line, it is assumed that yi−1 and yi are both measures of the variable of interest in approximately the same location. The difference squared (yi − y 2 is then an estimate of twice the variance of what can be ) i−1 thought of as the local sampling errors. With a systematic sample of size n, there are n − 1 such squared differences, leading to a combined estimate of the variance of local sampling errors of On this basis, the estimate of the standard error of the mean of the systematic sample is No finite sampling correction is applied when estimating the standard error on the presumption that the number of potential sampling points in the study area is very large. Once the standard error is estimated the CI is estimated. To estimate the population mean, we may use Simple Random Sampling approach Stratified sampling approach Systematic approach as previously described. Simple Random Sampling approach: The mean and standard deviation of the 66 observations are y = 3937.7 and s = 6646.5, in units of pg⋅g−1 (picograms per gram, i.e., parts per 1012).Therefore, if the sample is treated as being equivalent to a simple random sample, then the estimated standard error is = 6646.5/√66 = 818.1, and the approximate 95% confidence limits for the mean over the sampled region are 3937.7 ± 1.96 × 818.1, or 2334.1 to 5541.2. The second method for assessing the accuracy of the mean of a systematic sample, as described previously, entails dividing the samples into strata. This division was done arbitrarily using 11 strata of six observations each, as shown in Figure 2.7, and the calculations for the resulting stratified sample are shown in Table 2.5. The estimated mean level for total PCBs in the area is still 3937.7 pg⋅g−1. However, the standard error calculated from the stratification is 674.2, which is lower than the value of 818.1 found by treating the data as coming from a simple random sample. The approximate 95% confidence limits from stratification are 3937.7 ± 1.96 × 674.2, or 2616.2 to 5259.1. Finally, the standard error can be estimated using equations With the sample points in the order shown in Figure 2.7, but with the closest points connected between the sets of six observations that formed the strata before. This produces an estimated standard deviation of sL = 5704.8 for small-scale sampling errors, and an estimated standard error for the mean in the study area of SÊ(y) = 5704.8/√66 = 702.2. By this method, the approximate 95% confidence limits for the area mean are 3937.7 ± 1.96 × 702.2, or 2561.3 to 5314.0 pg⋅g−1. This is quite close to what was obtained using the stratification method. Other Sampling Methods: Cluster sampling refers to a method by which the population is divided into groups, or clusters, that are each intended to be mini-populations. A random sample of m clusters is selected. With cluster sampling, groups of sample units that are close in some sense are randomly sampled together, and then all measured. The idea is that this will reduce the cost of sampling each unit, so that more units can be measured than would be possible if they were all sampled individually. This advantage is offset to some extent by the tendency of sample units that are close together to have similar measurements. Therefore, in general, a cluster sample of n units will give estimates that are less precise than a simple random sample of n units. Nevertheless, cluster sampling may give better value for money than the sampling of individual units. Multi-stage sampling: With multistage sampling, the sample units are regarded as falling within a hierarchical structure. Random sampling is then conducted at the various levels within this structure. For example, suppose that there is interest in estimating the mean of some waterquality variable in the lakes in a very large area such as a whole country. The country might then be divided into primary sampling units consisting of states or provinces; each primary unit might then consist of a number of counties, and each county might contain a certain number of lakes. A three-stage sample of lakes could then be obtained by first randomly selecting several primary sampling units, next randomly selecting one or more counties (second-stage units) within each sampled primary unit, and finally randomly selecting one or more lakes (third-stage units) from each sampled county. This type of sampling plan may be useful when a hierarchical structure already exists, or when it is simply convenient to sample at two or more levels. Sample Size Principles of Sample Size selection: First, it is worth noting that, as a general rule, the sample size for a study should be large enough so that important parameters are estimated with sufficient precision to be useful, but it should not be unnecessarily large. This is because, on the one hand, small samples with unacceptable levels of error are hardly worth doing at all while, on the other hand, very large samples giving more precision than is needed are a waste of time and resources. In fact, the reality is that the main danger in this respect is that samples will be too small. Another general approach to sample-size determination that can usually be used fairly easily is trial and error. For example, a spreadsheet can be set up to carry out the analysis that is intended for a study, and the results of using different sample sizes can be explored using simulated data drawn from the type of distribution or distributions that are thought likely to occur in practice. Determining Sample Size (for means) Z n 2 E 2 2 Z = desired level of confidence = population standard deviation E = acceptable amount of sampling error Example: Suppose we are interested to study the pollution level in Islamabad and we can tolerate the variation upto 10 ppm. If the population standard deviation is estimated at 5 ppm, then how large sample will be required for confidence level of 95%. Here Z = desired level of confidence= 1.96 = population standard deviation= 10 E = acceptable amount of sampling error= 3 Z 2 2 n E 2 = (1.96)2(10) 2 / (3)2 = 42.69 say 43 Approaches to estimate population standard deviation Use results from a prior survey Conduct a pilot survey Use secondary data Use judgment Proportions Often we are interested in estimating proportions or percentages rather than means – For example, the proportion of people who have tested their allergy – We refer to the population proportion as (e.g., 75% or .75) Conversely, the proportion that have not tested the allergy are referred to as 1 (e.g., 25% or .25) Proportions The sampling distribution of the proportion is a relative frequency distribution of the sample proportions of a large number of random samples of a given size drawn from a particular population. Determining Sample Size (for proportions) z [ (1 )] n 2 E 2 Z = desired level of confidence = proportion of successes in the population. E = acceptable amount of sampling error. Example 2-Solve yourself It is desired to workout the average age of students in NUST within an accuracy of 0.5 years. Suppose that an accuracy level of 99% has been selected. Find out the sample size if the population standard deviation is estimated as 1 year.