Download File

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Misuse of statistics wikipedia , lookup

Gibbs sampling wikipedia , lookup

Student's t-test wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Transcript
Lecture-5
Sampling Methods
3. Systematic Sampling etc.
Engr. Dr.
Attaullah Shah
RECAP:
Simple Random Sampling
Used when there is inadequate
information for developing a conceptual
model for a site or for stratifying a site
 Any sample in which the probabilities of
selection are known
 Sampling units are chosen by using
some method using chance to determine
selection

Estimation of population mean

Assume that a simple random sample of size n is selected without
replacement from a population of N units, and that the variable of
interest has values y1, y2, …, yn, for the sampled units. Then the
sample mean is:

Sample variance

Sample coefficient of variation:

These values that are calculated from samples are often referred to
as sample statistics. The corresponding population values are the
population mean μ, the population variance σ2, the population
standard deviation σ, and the population coefficient of variation
σ/μ.

The sample mean is an estimator of the population mean μ. The
difference y − μ is then the sampling error in the mean. This error will
vary from sample to sample if the sampling process is repeated, and it
can be shown theoretically that if this is done a large number of times,
then the error will average out to zero. For this reason, the sample
mean is said to be an unbiased estimator of the population mean.

It can also be shown theoretically that the distribution of that is
obtained by repeating the process of simple random sampling without
replacement has the variance.
The factor [1 − (n/N)] is called the finite-population correction
because it makes an allowance for the size of the sample relative to
the size of the population.
The square root of Var(y) is commonly called the standard error of the
sample mean. It will be denoted here by
 Since population variance
is not usually known, therefore the
estimate of the variance of sample mean is given as:


The square root of this quantity is the estimated standard error of
the mean:

The accuracy of a sample mean for estimating the population
mean is often represented by a 100(1 − α)% confidence interval
for the population mean of the form

Commonly used confidence intervals are:

For smaller samples n less than 25, we use t-statistics and the CI
is given as:

It is assumed that the variable being measured is approximately
normally distributed in the population being sampled. It may not
be satisfactory for samples from very non symmetric distributions.
Estimation of Population Totals







Total area damaged by an oil spill is likely to be of more concern
than the average area damaged on individual sample units.
If the population size N is known and an estimate of the population
mean is available, then population total = mean x N
It is obvious, for example, that if a population consists of 500 plots
of land, with an estimated mean amount of oil-spill damage of 15
m2, then it is estimated that the total amount of damage for the
whole population is 500 × 15 = 7500 m2.
The sampling variance of this estimator is:
Standard error (i.e., standard deviation) is
Estimates of the variance and standard error are:
Approximate 100(1 − α)% confidence interval for the true
population total:
Estimation of Proportions




Two terms proportions measured on sample units and proportions of
sample units.
Proportions measured on sample units, such as the proportions of the
units covered by a certain type of vegetation, can be treated like any
other variables measured on the units.
Proportions of sample units are different because the interest is in
which units are of a particular type. An example of this situation is
where the sample units are blocks of land, and it is required to
estimate the proportion of all the blocks that show evidence of
damage from pollution.
Suppose that a simple random sample of size n is selected without
replacement from a population of size N and contains r units with
some characteristic of interest. Then the sample proportion is pˆ =
r/n, and it can be shown that this has a sampling variance of

Estimated values for the variance and standard error can be
obtained by replacing the population proportion in previous
equation with the sample proportion pˆ. Thus the estimated
variance is

and the estimated standard error is SÊ(pˆ ) = √Vâr(pˆ ).
Using the estimated standard error, an approximate 100(1 − α)%
confidence interval for the true proportion is





Assume that K strata have been chosen, ith the ith
of these having size Ni and the total population
size being ΣNi = N.
Then if a random sample with size ni is taken from
the ith stratum, the sample mean yi will be an
unbiased estimate of the true stratum mean μi,
with estimated variance as:
Where si is the sample standard deviation within
the stratum.
In terms of the true strata means, the overall
population mean is the weighted average.



And the corresponding sample estimate is
with estimated variance
The estimated standard error of is ,
the square root of the
estimated variance, and an approximate 100(1 − α)% confidence
interval for the population mean is given by:
If the population total is of interest, then this can be estimated by

The estimated standard error of population total:

Again, an approximate 100(1 − α)% confidence interval takes the
form

When a stratified sample of points in a spatial region is
carried out, it will often be the case that there are an
unlimited number of sample points that can be taken
from any of the strata, so that Ni and N are infinite.
Equation
can then be modified to
and the
equation
becomes

Where wi, the proportion of the total study area within
the ith stratum, replaces Ni/N.
Parameter
Variance of sample mean
Estimate of variance of sample
mean
Confidence interval for
population mean
Commonly used CI
For sample less than 30
Est error of Var of pop total
CI for total pop mean
CI for true proportion of
Random Sampling
Stratified Sampling
Statistical Sampling
Systemic random sampling refers to a
sampling technique that involves selecting
the kth item in the population after
randomly selecting a starting point between
1 and k. The value of k is determined as the
ratio of the population size over the desired
sample size.
Sampling Design II: Systematic Sampling
Design:
A Grid Scheme is most common
FOR 220 Aerial Photo Interpretation and Forest Measurements
Sampling Design II: Systematic Sampling
Arguments:
For:
Regular spacing of sample units may yield efficient
estimates of populations under certain conditions.
*** Against:
Accuracy of population estimates can be low if there is
periodic or cyclic variation inherent in the population.
FOR 220 Aerial Photo Interpretation and Forest Measurements
Sampling Design II: Systematic Sampling
Arguments:
For:
There is no practical alternative to assuming that
populations are distributed in a random order across
the landscape.
Against:
Simple random sampling statistical techniques can’t
logically be applied to a systematic design unless
populations are assumed to be randomly
distributed across the landscape.
FOR 220 Aerial Photo Interpretation and Forest Measurements





Systematic sampling is often used as an alternative to simple random
sampling or stratified random sampling for two reasons.
First, the process of selecting sample units is simpler for systematic
sampling.
Second, under certain circumstances, estimates can be expected to be
more precise for systematic sampling because the population is
covered more evenly.
The It is common to analyze a systematic sample as if it were a simple
random sample. In particular, population means, totals, and
proportions are estimated using the equations already discussed for the
estimation of standard errors and the determination of confidence
limits.
The assumption is then made that, because of the way that the
systematic sample covers the population, this will, if anything, result
in standard errors that tend to be somewhat too large and confidence
limits that tend to be somewhat too wide. That is to say, the
assessment of the level of sampling errors is assumed to be
conservative.
The only time that this procedure is liable
to give a misleading impression about
the true level of sampling errors is when
the population being sampled has some
cyclic variation in observations, so that
the regularly spaced observations that
are selected tend to all be either higher
or lower than the population mean.
Therefore, if there is a suspicion that
regularly spaced sample points may
follow some pattern in the population
values, then systematic sampling should
be avoided. Simple random sampling
and stratified random sampling are not
affected by any patterns in the
population, and it is therefore safer to
use these when patterns may be present

Assuming that yi is the ith observation along the line, it is assumed
that yi−1 and yi are both measures of the variable of interest in
approximately the same location. The difference squared (yi − y
2 is then an estimate of twice the variance of what can be
)
i−1
thought of as the local sampling errors. With a systematic sample
of size n, there are n − 1 such squared differences, leading to a
combined estimate of the variance of local sampling errors of

On this basis, the estimate of the standard error of the mean of the
systematic sample is
No finite sampling correction is applied when estimating the
standard error on the presumption that the number of potential
sampling points in the study area is very large.
Once the standard error is estimated the CI is estimated.



To estimate the population mean, we may use
 Simple Random Sampling approach
 Stratified sampling approach
 Systematic approach as previously described.

Simple Random Sampling approach:

The mean and standard deviation of the 66 observations are y
= 3937.7 and s = 6646.5, in units of pg⋅g−1 (picograms per
gram, i.e., parts per 1012).Therefore, if the sample is treated as
being equivalent to a simple random sample, then the
estimated standard error is
= 6646.5/√66 = 818.1, and
the approximate 95% confidence limits for the mean over the
sampled region are 3937.7 ± 1.96 × 818.1, or 2334.1 to
5541.2.
The second method for assessing the accuracy of the mean of a systematic
sample, as described previously, entails dividing the samples into strata.
This division was done arbitrarily using 11 strata of six observations
each, as shown in Figure 2.7, and the calculations for the resulting
stratified sample are shown in Table 2.5. The estimated mean level for
total PCBs in the area is still 3937.7 pg⋅g−1. However, the standard error
calculated from the stratification is 674.2, which is lower than the value of
818.1 found by treating the data as coming from a simple random
sample. The approximate 95% confidence limits from stratification are
3937.7 ± 1.96 × 674.2, or 2616.2 to 5259.1.

Finally, the standard error can be estimated using equations

With the sample points in the order shown in Figure 2.7, but with
the closest points connected between the sets of six observations
that formed the strata before.
This produces an estimated standard deviation of sL = 5704.8 for
small-scale sampling errors, and an estimated standard error for
the mean in the study area of SÊ(y) = 5704.8/√66 = 702.2.
By this method, the approximate 95% confidence limits for the area
mean are 3937.7 ± 1.96 × 702.2, or 2561.3 to 5314.0 pg⋅g−1.
This is quite close to what was obtained using the stratification
method.
Other Sampling Methods:
Cluster sampling refers to a method by which the population is
divided into groups, or clusters, that are each intended to be
mini-populations. A random sample of m clusters is selected.


With cluster sampling, groups of sample units that are close
in some sense are randomly sampled together, and then all
measured. The idea is that this will reduce the cost of
sampling each unit, so that more units can be measured
than would be possible if they were all sampled individually.
This advantage is offset to some extent by the tendency of
sample units that are close together to have similar
measurements. Therefore, in general, a cluster sample of n
units will give estimates that are less precise than a simple
random sample of n units. Nevertheless, cluster sampling
may give better value for money than the sampling of
individual units.
Multi-stage sampling:




With multistage sampling, the sample units are regarded as falling
within a hierarchical structure. Random sampling is then
conducted at the various levels within this structure. For example,
suppose that there is interest in estimating the mean of some waterquality variable in the lakes in a very large area such as a whole
country.
The country might then be divided into primary sampling units
consisting of states or provinces; each primary unit might then
consist of a number of counties, and each county might contain a
certain number of lakes.
A three-stage sample of lakes could then be obtained by first
randomly selecting several primary sampling units, next randomly
selecting one or more counties (second-stage units) within each
sampled primary unit, and finally randomly selecting one or more
lakes (third-stage units) from each sampled county.
This type of sampling plan may be useful when a hierarchical
structure already exists, or when it is simply convenient to sample
at two or more levels.
Sample Size

Principles of Sample Size selection:



First, it is worth noting that, as a general rule, the sample size for
a study should be large enough so that important parameters are
estimated with sufficient precision to be useful, but it should not
be unnecessarily large. This is because, on the one hand, small
samples with unacceptable levels of error are hardly worth doing
at all while, on the other hand, very large samples giving more
precision than is needed are a waste of time and resources. In
fact, the reality is that the main danger in this respect is that
samples will be too small.
Another general approach to sample-size determination that can
usually be used fairly easily is trial and error. For example, a
spreadsheet can be set up to carry out the analysis that is
intended for a study, and the results of
using different sample sizes can be explored using simulated data
drawn from the type of distribution or distributions that are
thought likely to occur in practice.
Determining Sample Size (for means)
Z 
n
2
E
2
2
Z = desired level of confidence
 = population standard deviation
E = acceptable amount of sampling error
Example:


Suppose we are interested to study the pollution level in Islamabad
and we can tolerate the variation upto 10 ppm. If the population
standard deviation is estimated at 5 ppm, then how large sample will
be required for confidence level of 95%.
Here
Z = desired level of confidence= 1.96
= population standard deviation= 10
E = acceptable amount of sampling error= 3
Z 2 2
n
E 2 = (1.96)2(10) 2 / (3)2 = 42.69 say 43
Approaches to estimate population standard
deviation
Use results from a prior survey
Conduct a pilot survey
Use secondary data
Use judgment
Proportions
Often we are interested in estimating proportions
or percentages rather than means
– For example, the proportion of people who have
tested their allergy
– We refer to the population proportion as
(e.g., 75% or .75)

Conversely, the proportion that have not
tested the allergy are referred to as
1   (e.g., 25% or .25)
Proportions
The sampling distribution of the proportion
is a relative frequency distribution of the
sample proportions of a large number of
random samples of a given size drawn from
a particular population.
Determining Sample Size (for proportions)
z [ (1   )]
n
2
E
2
Z = desired level of confidence
 = proportion of successes in the population.
E = acceptable amount of sampling error.
Example 2-Solve yourself

It is desired to workout the average age of
students in NUST within an accuracy of 0.5
years. Suppose that an accuracy level of 99% has
been selected. Find out the sample size if the
population standard deviation is estimated as 1
year.