Download n - IASRI

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

History of statistics wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

Taylor's law wikipedia , lookup

Resampling (statistics) wikipedia , lookup

Student's t-test wikipedia , lookup

Gibbs sampling wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Transcript
2: An Overview of Various Sampling Schemes…..
AN OVERVIEW OF VARIOUS SAMPLING SCHEMES AND
DETERMINATION OF SAMPLE SIZES
K. K. Tyagi
Indian Agricultural Statistics Research Institute, New Delhi-110012
1. INTRODUCTION
The prime objective of a sample survey is to obtain inferences about the characteristic of a
population. Population is defined as a group of units defined according to the objectives of
the survey. The population may consist of all the households in a village /
locality, all the fields under a particular crop etc. We may also consider a population of
persons, families, fields, animals in a region, or a population of trees, birds in a forest
depending upon the nature of data required. The information that we seek about the
population is normally, the total number of units, aggregate values of various characteristics,
averages of these characteristics per unit, proportions of units possessing specified attributes
etc. The data can be collected in two different ways. The first one is complete enumeration
which means collection of data on the survey characteristics from each unit of the population.
This type of method is used in censuses of population, agriculture, retail stores, industrial
establishments etc. The other approach is based on the use of sampling methods and
consists of collection of data on survey characteristics from selected units of the
population. The first approach can be considered as its special case. A sampling method is a
scientific and objective procedure of selecting units from the population and provides a
sample that is expected to be representative of the population. A sampling method makes it
possible to estimate the population parameters while reducing at the same time the size of
survey operations. Some of the advantages of sample surveys as compared to complete
enumeration are reduction in cost, greater speed, wider scope and higher accuracy. A
function of the unit values of the sample is called an estimator. Various measures, like bias,
mean square errors, variance etc. are used to assess the performance of the estimator.
The main problem in sample surveys is the choice of a proper sampling strategy, which
essentially comprise of a sampling method and the estimation procedure. In the choice of a
sampling method there are some methods of selection while some others are control measures
which help in grouping the population before the selection process. In the methods of
selection, schemes such as simple random sampling, systematic sampling and varying
probability sampling are generally used. Among the control measures are procedures such as
stratified sampling, cluster sampling and multi-stage sampling etc. A combination of control
measures along with the method of selection is called the sampling scheme.
We shall describe in brief the different sampling methods in the following sections:
2. SIMPLE RANDOM SAMPLING
Simple random sampling (SRS) is a method of selecting 'n' units out of 'N' units such that
each one of the possible non-distinct samples has an equal chance of its being chosen. In
15
2: An Overview of Various Sampling Schemes…..
practice, a simple way of obtaining a simple random sample is to draw the units one by one
with a known probability of selection assigned to each unit of the population at the first and
each subsequent draw. The successive draws may be made with or without replacing the units
selected in the preceding draws. The former is called the procedure of sampling with
replacement while the latter sampling without replacement.
The units in the population are numbered from 1 to N. A series of random numbers between
1 and N are then drawn either by means of a Table of random numbers or by means of a
computer program that produces such a Table.
It can be seen that in simple random sampling without replacement (SRSWOR), the
probability of selecting the units in the sample is equal for all the units. Let Y be the
characteristic of interest. The N units that comprise the population are denoted by
 y  y2    y N 1  y N   1 N y , denote
y1 , y2 ,..., y N . Let the population mean, YN  1
 i
N
N i 1
the parameter of interest.
Let us draw a simple random sample, without replacement, of size n. We denote by
 y  y 2    yi    y n1  y n  1 n
yn  1
  yi , the sample mean, an unbiased estimator of
n
n i 1
the population mean YN . In other words, the average value of y n over all possible samples
( N C n in this case) is equal to YN .
Also, the sampling variance of y n is given by
1 1
V ( yn )  (  ) S 2
n N
where S 2 
(1.1)
1 N
( yi  YN ) 2 is the population mean square.

N 1 i 1
An unbiased estimator of this sampling variance is given by
1 1
Vˆ ( yn )  (  ) s 2
n N
where s 2 
(1.2)
1 n
( yi  yn ) 2 is the sample mean square.

n1 i 1
16
2: An Overview of Various Sampling Schemes…..
We now take an example to illustrate the concepts of unbiasedness of sample mean ( y n ) and
sample
mean
square
(s2).
Suppose
we
have
a
population
of
N = 5 units, say U1, U2, U3, U4, U5. The characteristic under study, Y is the yield in kg per
plot. We want to estimate YN , the population mean of Y on the basis of a sample of size n = 2
drawn by SRSWOR.
The data and the analysis are as given in the Table below:
Units ( U i )
U1
U2
U3
U4
U5
Total
Values of y i (kg / plot)
32
28
25
30
35
150
Population Mean (YN ) 
y
 y1  y 2  y3  y 4  y5 
5
1024
2
i

784
32  28  25  30  35 150

 30 kg
5
5
625
900
1225
4558
 N 2

  Yi  NYN2 
 = 4558  5  30  30  4558  4500  58  14.5
Population Mean Square ( S 2 )   i 1
4
4
4
N 1
Units in the
sample
Sample
Observations
U1, U2
U1, U3
U1, U4
U1, U5
U2, U3
U2, U4
U2, U5
U3, U4
U3, U5
U4, U5
32, 28
32, 25
32, 30
32, 35
28, 25
28, 30
28, 35
25, 30
25, 35
30, 35
Sample Mean
Sample Mean Square
(s2)
30.0
28.5
31.0
33.5
26.5
29.0
31.5
27.5
30.0
32.5
8.0
24.5
2.0
4.5
4.5
2.0
24.5
12.5
50.0
12.5
 yn 
300/10 = 30
Average
145.0/10 = 14.5
A similar approach applies when sampling is with replacement (SRSWR). In this case, there
are Nn possible samples. The estimator of population mean, sampling variance of the
estimator, and estimator of the sampling variance are given as
yn 
1 n
 yi
n i 1
V ( yn ) 
(1.3)
2
(1.4)
n
17
2: An Overview of Various Sampling Schemes…..
and
2
s
Vˆ ( yn ) 
n
where  2 
(1.5)
1 N
( yi  YN ) 2 is the population variance and s 2 is the sample mean square.

N i 1
Consider all possible samples of size N which can be drawn from a given population. In
SRSWOR scheme, there will be in all N C n possible samples. For each sample, one can
compute a statistic, such as the mean, standard deviation etc., which will vary from sample to
sample. In this manner, one can obtain a distribution of the statistic which is called its
sampling distribution.
N
For estimating the population total Y   yi , we have an estimator
i 1
n
Yˆ  N  yi / n  N . y n
(1.6)
i 1
i.e., the population size N multiplied by the sample mean y n .
n
n
i 1
i 1
This estimator can be expressed as Yˆ   wi yi  N / n  yi , where wi  N / n .
The constant N / n is the sampling weight and is the inverse of the sampling fraction n / N .
The estimator of sampling variance of Yˆ is given by

Vˆ Yˆ  Vˆ N . yn   N 2 .Vˆ  yn 
(1.7)
and the estimator of standard error of Yˆ is given by

SEˆ Yˆ  SEˆ N . yn   N . SEˆ  yn 
(1.8)
18
2: An Overview of Various Sampling Schemes…..
From the above, it is evident that under Simple Random Sampling With Replacement
(SRSWR),
 yn  is unbiased for the population mean Y  ,
i)
the sample mean
ii)
sample mean square (s2) is unbiased for the population variance  2 , and
iii)
V  yn  =
N
 
2
n
.
Like-wise, under Simple Random Sampling Without Replacement (SRSWOR),
 yn  is unbiased for the population mean Y  ,
i)
the sample mean
ii)
sample mean square (s2) is unbiased for the population mean square ( S 2 ) , and
iii)
V  yn  =  

1
n
N
1 2
S
N
3. ESTIMATION OF POPULATION PROPORTION
Sometimes, the units in the population are classified into two groups (i) having a particular
characteristic and (ii) not having that characteristic. For example, a crop field may be
irrigated or not irrigated. If it is irrigated, we say that it possesses the characteristic,
‘irrigation’. If it is not irrigated, we say that it does not possess the particular characteristic of
irrigation. If we are interested in estimating the proportion of irrigated fields, the population
of N fields can be defined with variate yi as having value 1 if the field is irrigated, a value 0,
otherwise. If the total number of irrigated fields be N1 out of N, then
N
y
i 1
i
 N1
(2.1)
Thus,
Y
N
1 N
yi  1  P  proportion of irrigated fields in the population

N i 1
N
19
(2.2)
2: An Overview of Various Sampling Schemes…..
and
N
y
i 1
2
i
 N1  N .P
(2.3)
Thus the problem of estimating a population proportion becomes that of estimating a
population mean by defining the variate as above. If n1 units out of a random sample of size n
n
possess that characteristic, the proportion of irrigated fields in the sample is given by p  1 .
n
Thus
n
n
i 1
i 1
 yi  n1   yi2  n. p
(2.4)
Hence an unbiased estimator of P is given by
n
1 n
Pˆ  ( 1 )  p   yi  sample mean
n
n i 1
(2.5)
In sampling without replacement, the variance of p is given by
V ( p) 
( N  n) P.Q
.
( N  1) n
(2.6)
where Q = 1 – P.
In sampling with replacement, the variance of p is given by
V ( p) 
P.Q
n
(2.7)
In sampling without replacement, an unbiased estimator of V(p) is given by
20
2: An Overview of Various Sampling Schemes…..
( N  n) p.q
Vˆ ( p) 
.
N
(n  1)
(2.8)
where q =1 – p.
In sampling with replacement, an unbiased estimator of V(p) is given by
p.q
Vˆ ( p) 
(n  1)
(2.9)
4. EMPIRICAL EXERCISE
The data given below pertains to the average yield of wheat crop (in quintals) pertaining to
108 villages in a Block of a District:
Village Sl. Nos.
1-10
11-20
21-30
31-40
41-50
51-60
61-70
71-80
81-90
91-100
101-108
20
25
16
29
16
24
25
15
28
24
16
21
25
28
19
29
32
31
19
25
34
28
32
24
30
27
32
23
38
32
31
18
30
Yield (in quintals)
41
55
22
32
75
28
29
29
19
42
39
11
40
63
30
8
35
27
31
43
21
19
50
10
6
4
22
28
10
70
29
29
19
64
29
37
26
21
35
36
27
24
20
37
42
38
34
21
35
25
30
36
39
32
34
28
19
31
45
28
29
37
28
71
42
35
19
35
61
18
29
47
43
44
47
i.
Select a random sample of size 10 by simple random sampling without
replacement (SRSWOR) and estimate the average yield along with its
standard error on the basis of selected sample units.
ii.
Set up 95% confidence interval for the population mean.
SOLUTION:
As the population size N = 108 is a three digit number, so for selecting a simple random
sample of size n = 10, we shall select three-digit random numbers from the Random Number
Table (from 000 to 972, which is the highest multiple of 108 up to 999) as follows:
21
2: An Overview of Various Sampling Schemes…..
Random Number
120
572
649
211
327
673
153
317
586
943
Sampling Unit Sl. No.
(Remainder of
Random Number/108)
12
32
01
103
03
25
45
101
46
79
Yield
(q)
25
19
20
30
32
29
63
16
30
28
n
Estimate of Population Average yield = YˆN  y n 
y
i 1
n
i

292
 29.2 q
10
Estimate of Population Total Yˆ  N  yn 108 29.2  3153.6 q
The estimate of standard error of Yˆ is given by
 
SEˆ Yˆ SEˆ N . y n  N . SEˆ  y n 
1 1 
where SEˆ  y n     . s
n N 
where s 2 
1
 yi  y n 2  1  1533.6  170.4 q 2

n 1
10  1
So, s  170.4  13.0537 q
1 
1
Hence, SEˆ  yn    
 13.0537  0.3012 13.0537  3.9322
 10 108 
ii) The 95% confidence interval for population mean is given by
y n  t 0.05 /(1019) df  SEˆ  y n   29.2  2.262  3.9322  29.2  8.89
22
2: An Overview of Various Sampling Schemes…..
So, the 95% confidence interval for population mean is (29.2 - 8.89 to
29.2 + 8.89) i.e. (20.31, 38.09). It can be clearly seen that the population mean
3320
YN 
 30.74 q is contained in this confidence interval. It may be mentioned here that
108
out of total number of possible samples i.e. 108C10 , the population mean will be contained in
such like confidence intervals corresponding to 95% of the total number of samples.
5. STRATIFIED RANDOM SAMPLING
In random sampling techniques, the selection is based on random mechanism. Most of the
times, random numbers are used for selection purposes. The simplest method of selection is
simple random sampling (SRS) in which every sample gets an equal chance of selection. In
SRS, units are selected with equal probability at every draw. We have seen that the precision of
a sample estimate of the population mean depends not only upon the size of the sample and the
sampling fraction but also on the population variability. Selection of a simple random sample
from the entire population may be desirable when we do not have any knowledge about the
nature of population, such as, population variability etc. However, if it is known that the
population has got differential behaviour regarding variability, in different pockets, this
information can be made use of in providing a control in the selection. The approach through
which such a controlled selection can be exercised is called stratified sampling.
In case of SRSWOR, the sampling variance of the sample mean is
1 1
V ( yn )  (  ) S 2
n N
Clearly, the variance decreases as the sample size ‘n’ increases while the population variability
S2 decreases. Now one of the objectives of a good sampling technique is to reduce the sampling
variance. So we have to either increase ‘n’ or decrease S2. Apart from the sample size,
therefore, the only way of increasing the precision of an estimate is to devise sampling
procedure which will effectively reduce S2 i.e. the heterogeneity in the population. In fact, S2 is
a population parameter and is inherent with the population, therefore, it cannot be decreased.
Instead, the population may be divided into number of groups (called strata), thereby,
controlling variability within each group. This procedure is known as stratification.
In stratified sampling, the population consisting of N units is first divided into K
sub-populations of sizes N1, N2,…,NK units respectively. These sub-populations are
K
non-overlapping and together they comprise the whole of the population i.e.

Ni = N. These
i 1
sub-populations are called strata. To obtain full benefit from stratification, the values of Ni’s
must be known. When the strata have been determined, a sample is drawn from each stratum,
the drawings being made independently in different strata. If a simple random sample is taken
in each stratum then the procedure is termed as stratified random sampling. As the sampling
variance of the estimate of population mean or total depends on within strata variation, the
23
2: An Overview of Various Sampling Schemes…..
stratification of population is done in such a way that strata are homogeneous within
themselves with respect to the variable under study. However, in many practical situations, it is
usually difficult to stratify with respect to the variable under consideration especially because
of physical and cost consideration. Generally the stratification is done according to
administrative groupings, geographical regions and on the basis of auxiliary characters
correlated with the character under study.
Let N1, N2,…,NK denote the sizes of K strata, such that
K
N
i 1
i
 N (total number of units in
the population).
Let y ij denotes the jth observation in the ith stratum.
Denote YN 
1
N
K
Ni
 y
i 1 j 1
ij
as population mean.
Again,
S2 
1 K Ni
( yij  YN ) 2

N 1 i 1 j 1
Si 
1 Ni
( yij  YN i )2

Ni 1 j 1
2
where YN i 
1 Ni
 yij
Ni j 1
= population mean square,
= population mean square in the ith stratum,
= population mean for the ith stratum.
We select a simple random sample of size n1 from the first stratum, of size n2 from the second
stratum,..., of size ni from the ith stratum and so on such that
n1 + n2 + ....+ nK = n (sample size).
The population mean YN can be written as
YN 
K
1 K
N
Y

Pi YN i
 i Ni 
N i 1
i 1
(4.1)
24
2: An Overview of Various Sampling Schemes…..
where Pi 
Ni
.
N
1 ni
Since within each stratum, the samples have been drawn by SRSWOR, yni (  yij ), sample
ni j 1
mean for the ith stratum is an unbiased estimator of YNi and obviously
yst 
K
1 K
N
y

Pi yni
 i ni 
N i 1
i 1
(4.2)
The weighted mean of the strata sample means with strata sizes as the weights, will be an
appropriate estimator of the population mean.
Clearly yst is an unbiased estimator of YN , since
K
K
K
i 1
i 1
i 1
E ( yst )  E ( Pi yni )   Pi E ( yni )   Pi YN i  YN
(4.3)
Since the sample in the ith stratum has been drawn by SRSWOR, so
1
1 2
V yni  
 Si
Ni 
 ni
 
The sampling variance of y st is given by
K
K
K
i 1
i 1 i 1
V ( yst )   Pi 2V ( yni )   Pi Pj Cov( yni , yn j )
(4.4)
Since the samples have been drawn independently in each stratum, so
Cov( yni , yn j )  0 and so
K
K
1 1
V ( yst )   Pi 2V ( yni )   Pi 2 (  ) Si2
ni Ni
i 1
i 1
(4.5)
25
2: An Overview of Various Sampling Schemes…..
Since the sample mean square for the ith stratum, si2 
1 ni
 ( yij  yni )2 unbiasedly estimates
ni 1 j 1
the population mean square for the ith stratum, S i2 , it follows that an unbiased estimator of
V ( yst ) is given by

1 1
Vˆ ( y st )  V ( y st )   Pi 2 (  ) si2
ni N i
(4.6)
From the above, we see that sampling variance of stratified sample mean depends on Si's,
variability within the strata which suggests that the smaller the Si's, i.e. the more homogeneous
the strata, greater will be the precision of the stratified sample.
6. EMPIRICAL ILLUSTRATION
The data given below pertains to the total geographical area in 20 villages of a Block. Treating
this as population of 20 units, stratify this population in three strata taking stratum sizes to be
villages with geographical area, 50 ha or less, villages with geographical area in between 50
and 100 ha and villages having geographical area more than 100 ha. A sample of 6 villages is to
be selected by taking 2 villages in each stratum. Compare the efficiency of stratified sampling
with corresponding unstratified simple random sampling.
Village Sl.No.: 01 02 03
04 05 06 07 08 09 10
Geographical : 020 080 050 100 150 070 020 250 220 010
Area (in ha.)
Village Sl.No.: 11 12 13 14 15
16 17 18 19 20
Geographical : 050 140 080 020 050 030 070 090 100 220
Area (in ha)
SOLUTION:
Clearly N = 20, n = 6 .
26
2: An Overview of Various Sampling Schemes…..
Population Mean YN 
1 20
1820
yi 
 91ha

20 i 1
20
Population Mean Square S 2 
1 N
1 N 2
96180
2
(
y

Y
)

( yi  N YN2 ) 
 5062 ha 2

i
N
N 1 i 1
N 1 i 1
19
1 1
Sampling Variance of Simple Random Sample Mean V ( yn )  (  ) S 2  590 ha 2
n N
Now stratify the population according to given strata sizes into following three strata:
STRATA Sl. No.
I (less than 50 ha)
UNITS
020 050 020 010 050 020 050 030
II (between 50 and 100 ha) 080 100 070 080 070 090 100
III (more than 100 ha)
Clearly,
N1 = 8,
N2 = 7,
150 250 220 140 220
YN1 = 031.3 ha,
YN 2 = 084.3 ha,
N3 = 5, YN 3 = 176.0 ha,
2
S12 = 0270 ha
S 22 = 0161 ha2
S 32
= 2330 ha2
From each stratum, a sample of 2 villages has been taken so n1 = n 2 = n 3 = 2.
Now,
K
1 1
V ( yst )   Pi 2 (  ) Si2  67 ha 2
ni Ni
i 1
Obviously the stratification has reduced the sampling variance of the sample mean from
590 ha2 (in case of SRSWOR) to 67 ha2 (in case of stratified sampling) i.e. a reduction of
about 89 per cent.
In stratified sampling, having decided the strata and the sample size, the next question which a
27
2: An Overview of Various Sampling Schemes…..
survey sampling expert has to face is regarding the method of selection within each stratum and
the allocation of sample to different strata. The expression for the variance of stratified sample
mean shows that the precision of a stratified sample for given strata depends upon the ni's
which can be fixed at will. The guiding principle in the determination of the ni's is to choose
them in such a manner so as to provide an estimate of the population mean with the desired
degree of precision for a minimum cost or to provide an estimate with maximum precision for a
given cost, thus making the most effective use of the available resources. The allocation of the
sample to different strata made according to this principle is called the principle of optimum
allocation.
The cost of a survey is a function of strata sample sizes just as the variance. The manner in
which cost will vary with total sample size and with its allocation among the different strata
will depend upon the type of survey. In yield estimation surveys, the major item in the survey
cost consists of labour charges for harvesting of produce and as such survey cost is found to be
approximately proportional to the number of crop cutting experiments (CCEs). Cost per CCE
may, however, vary in different strata depending upon labour availability. Under such
K
situations, the total cost may be represented by C   ci ni where ci is the cost per CCE in the
i 1
th
i stratum. When ci is same from stratum to stratum, say c, the total cost of a survey is given by
C  c n . The cost function will change in form, if travel cost, field staff salary, statistical
analysis etc. are to be paid for.
K
To find optimum values of ni (cost function being C   ci ni ), consider the function
i 1
 V ( yst )   C
(4.7)
where  is some constant.
Now,
K
V ( yst )   C   Pi 2 (
i 1
K
=

i 1
K
K
1 1 2
 ) Si    ci ni
ni Ni
i 1
1 2 2
Pi Si + . c i . ni + terms independent of ni
ni
=  ( Pi Si /
ni   ci ni ) 2 + terms independent of ni
i 1
28
2: An Overview of Various Sampling Schemes…..
Clearly, V ( y st ) is minimum for fixed cost C, or cost of a survey is minimum for a fixed value
of V ( y st ), when each of the square terms on right-hand side of the above equation is zero i.e.
Pi Si
=
ni
or
ni =
 ci ni
Pi Si
. ci
(i= 1,2,...,K)
(i= 1,2,...,K)
(4.8)
From the above, one can easily infer that:
 the larger the stratum size, the larger should be the size of the sample to be selected from
that stratum,

the larger the stratum variability, the larger should be the size of the sample from that
stratum, and

the cheaper the cost per sampling unit in a stratum, the larger should be the sample from
that stratum.
The exact value of ni for maximizing precision for a fixed cost C0 can be obtained by
evaluating 1 /  , the constant of proportionality as
ni 
Pi Si
ci
C0
(4.9)
K
P S
i 1
i
ci
i
and the total sample size as
K
K
n   ni 
i 1
C0  ( Pi Si / ci )
i 1
K
P S
i 1
i
(4.10)
i
ci
The allocation of the sample size 'n' to different strata according to the above equation is known
as optimum allocation.
When ci is the same from stratum to stratum i.e. ci = c (say), the cost function takes the form C
= c.n, or in other words, the cost of survey is proportional to the size of the sample, the
optimum values of ni 's are given by
29
2: An Overview of Various Sampling Schemes…..
ni  n
Pi S i
(i= 1,2,...,K)
K
P S
i 1
i
(4.11)
i
The allocation of the sample according to the above equation is known as Neyman Allocation.
On substituting for ni in V ( y st ) expression (4.5), we obtain
1 K
1
VN ( yst )  ( Pi Si ) 2 
n i 1
N
K
P S
i 1
i
2
i
(4.12)
where the subscript ‘N’ stands for stratification with Neyman Allocation.
Another logical approach of allocation appears to be to allocate larger sample sizes for larger
strata i.e. ni  Ni or ni = n.
Ni
N
(i=1,2,...,K). The allocation of sample size ‘n’ according to
this equation is known as proportional allocation and V( y st ) in this case becomes
1 1 K
VP ( y st )  (  )  Pi Si2
n N i 1
(4.13)
where the subscript ‘P’ indicates the stratification with proportional allocation.
7. CLUSTER SAMPLING
A cluster may be defined as a group of units. When the sampling units are clusters, the
method of sampling is known as cluster sampling. Cluster sampling is used when the frame
of units is not available or it is expensive to construct such a frame. Thus, a list of all the
farms in the districts may not be available but information on the list of villages is easily
available. For carrying out any district level survey aimed at estimating the yield of a crop, it
is practically feasible to select villages first and then enumerating the elements (in this case
farms) in the selected village. The method is operationally convenient, less time consuming
and more importantly such a method is cost-wise efficient.
The main disadvantage of cluster sampling is that it is less efficient than a method of
sampling in which the units are selected individually. The efficiency of cluster sampling
procedure increases as the heterogeneity between units belonging to same cluster increases.
Cluster sampling becomes more efficient than element sampling if the units pertaining to
same cluster are negatively correlated.
Let there be N clusters in the population. Further, let M be the size of each cluster. We denote
by Y the character under study. We define
30
2: An Overview of Various Sampling Schemes…..
yij = value of the characteristic under study for jth element, (j=1,2,…,M) in the ith cluster,
(i=1,2,…,N)
M
yi. 
1
M
YN . 
1 N
 yi. the mean of the cluster means in the population ,
N i 1
Y.. 
y
1
NM
j 1
N
ij
the mean per element of the i th cluster ,
M
 yij 
i 1 j 1
1
N
N
y
i 1
i.
the mean per element in the population .
It is clear that YN .  Y.. . This is so since the clusters are of equal size. If, however, the clusters
vary in size, the two means will generally not be equal.
We denote by ycl , the estimator of Y.. as
y cl 
1 n
1
yi. 

n i 1
nM
n
M
 y
i 1 j 1
ij
the mean of cluster means in a sample of n clusters (5.1) Clearl
y ycl is an unbiased estimate of Y.. and its variance is given by
1 1
V ( ycl )  (  ) Sb2
n N
(5.2)
where
Sb2 
1
 ( yi.  YN . ) 2 is the mean square between cluster means in the population.
N 1
The variance of the mean of n cluster means in terms of intra-class correlation  c (intracluster correlation between elements belonging to the same cluster), if N is sufficiently large,
is given by
S2
1  (M 1)  c 
V ( ycl ) 
nM
(5.3)
31
2: An Overview of Various Sampling Schemes…..
where
S2 
N M
1
 ( yij  Y.. ) 2 is the mean square between elements in the population.
NM 1 i 1 j 1
If an equivalent sample of nM elements were selected from the population of NM elements
by SRSWOR, the variance of the mean per element ynM would be
V ( ynM )  (
1
1

) S2
nM NM
(5.4)
Accordingly, the relative efficiency of a cluster as the unit of sampling compared with that of
an element is given by
R.E. 
V ( ynM )
S2

V ( ycl ) M Sb2
(5.5)
Thus, the relative efficiency of cluster sampling increases with decrease in both S b2 and M.
Since relative efficiency of cluster sampling depends upon M, it is natural to determine the
optimum size of the cluster. As in the case of stratified sampling, the optimum size of the
cluster can be determined by minimizing variance for a fixed cost or vice-versa. A simple
cost function can be
C = c1nM +c2d
(5.6)
where
C = total cost of survey operation,
c1 = cost of enumerating an element including the time spent and cost of transportation within
clusters,
c2 = per unit cost of travelling a unit distance between clusters, and
d = the distance between the clusters.
32
2: An Overview of Various Sampling Schemes…..
By minimizing the variance subject to the above cost function, we obtain
M
c22
c1 C
(5.7)
Thus, M will be smaller if c1 and C are large and c2 is small.
8. MULTI-STAGE SAMPLING
Generally, elements belonging to the same cluster are more homogeneous as compared to
those elements which belong to different clusters. Therefore, a comparatively representative
sample can be obtained by enumerating each cluster partially and distributing the entire
sample over more clusters. This will increase the cost of the survey but the proportionate
increase in cost vis-à-vis cluster sampling will be less as compared to increase in the
precision. This process of first selecting clusters and then further sampling units within a
cluster is called as two-stage sampling. The clusters in a two-stage sample are called as
primary stage units (psus) and elements within a cluster are called as secondary stage units
(ssus).
A two-stage sample has the advantage that after psus are selected, the frame of the ssus is
required for the sampled psus only. The procedure allows the flexibility of using different
sampling design at the different stages of selection of sampling units. A two-stage sampling
procedure can be easily generalized to multi-stage sampling designs. Such a sampling design
is commonly used in large scale surveys. It is operationally convenient, provides reasonable
degree of precision and is cost-wise efficient.
9.
SYSTEMATIC SAMPLING
In the method of systematic sampling, only the first unit is selected at random while rest of
the units are selected according to a pre-determined pattern. Let N and n be the population
size and sample size respectively. Further, N = n.k, where k is an integer. A random number
‘i’ is selected such that 1< i < k. Then the sample contains n units with serial number i, i+k,
i+2k,…, i+(n -1)k.
Systematic sampling can be used in situations such as selection of kth strip in forest sampling,
selection of corn fields every kth kilometre apart for observation on incidence of borers, or the
selection of every kth time interval for observing the number of fishing crafts landing at a
centre.
The method of systematic sampling is used on account of its low cost and simplicity in the
selection of the sample. It makes control of field work easier. Since every kth unit will be in
the sample, the method is expected to provide an evenly balanced sample.
33
2: An Overview of Various Sampling Schemes…..
When N is not an exact multiple of n i.e. N = nk + r, a method of sampling called circular
systematic sampling (given by Lahiri) is used to select the sample. This method consists of
selecting a random number from 1 to N and thereafter selecting cyclically every k th unit until
n units have been chosen in the sample. Cyclic selection means assigning number (N+1) to 1st
unit, (N+2) to 2nd unit and so on, in order to continue the selection procedure when the Nth
unit has been reached.
A drawback of systematic sampling is that it is not possible to get an unbiased estimate of the
variance of the estimator. However, it is possible to get an unbiased estimator of the variance
based on number of systematic samples.
10. VARYING PROBABILITY SAMPLING
In SRSWOR, the selection probabilities are equal for all the units in the population.
However, if the sampling units vary in size considerably, SRSWOR may not be
appropriate as it does not take into account the possible importance of the larger units in
the population. To give possible importance to larger units, there are various sampling
methods in which this can be achieved. A simple method is of assigning unequal
probabilities of selection to the different units in the population. Thus, when units vary in
size and the variable under study is correlated with size, probabilities of selection may be
assigned in proportion to the size of the unit e.g. villages having larger geographical areas
are likely to have larger populations and larger areas under food crops. For estimating the
crop production, it may be desirable to adopt a selection scheme in which villages are
selected with probabilities proportional to their population sizes or to their geographical
areas. A sampling procedure in which the units are selected with probabilities proportional to
some measure of their size is known as sampling with probability proportional to size (pps).
The units may be selected with or without replacement. In sampling with replacement, the
probability of drawing a specified unit at a given draw is the same.
Procedure of selecting a sample with varying probabilities
i) Cumulative Total Method
To draw a sample of size n from a population of size N with ppswr, the procedure is as
follows:
Let x i be an integer proportional to size of the ith unit (i=1,2....,N), we make successive
cumulative totals x1 , x1  x2 , x1  x2  x3 ,..., x1  x2  x3  ...  x N ,
N
and draw a random number R in between 1 and
x
i 1
34
i
from a random number Table.
2: An Overview of Various Sampling Schemes…..
Select the ith unit in the population for which
x1  x2  x3 ,..., xi 1  R  x1  x2  x3  ...  xi 1  xi
It is clear that this procedure of selection gives to the ith unit in the population, a
probability of selection proportional to x i . This procedure is to be repeated n times, if a
sample of size n is required.
ii) Lahiri’s Method
The main drawback of cumulative total method is that it involves writing down the
successive cumulative totals which is time consuming and tedious especially if the number of
units in the population is large. Lahiri (1951) had suggested an alternative procedure which
avoids the necessity of writing down cumulative totals. It consists in selecting a pair of
random numbers, say (i , j) such that 1  i  N and 1  j  M where M is the maximum of
the sizes of the N units in the population. If j  xi , the ith unit is selected; otherwise it is
rejected and another pair of random numbers is chosen. For selecting a sample of n units with
ppswr, the procedure is to be repeated till n units are selected. It can be seen that the
procedure leads to the required probabilities of selection.
11. DETERMINATION OF SAMPLE SIZE
In the planning of a sample survey, determination of sample size is an important decision
which a survey statistician has to take while deciding the sampling plan. One has to be
careful while deciding the sample size, because too large a sample implies waste of resources,
and too small a sample diminishes the utility of the results. An efficient sampling plan should
enable an optimum utilization of budgetary resources to provide the best estimators of the
population parameters. As is well known, efficiency of an estimator is normally measured by
inverse of mean square error (or variance in case of unbiased estimators). A desirable
proposition would be to minimize the cost as well as variance simultaneously. But,
unfortunately, it is not possible. With an increase in the sample size, normally the cost of the
survey increases while the variance decreases, thereby increasing the efficiency. Thus for
determination of sample size, a balance is required to be struck which is reasonable with
respect to cost as well as efficiency. Sampling theory provides a framework within which the
problem of determining sample size may be tackled reasonably.
We first consider the estimation of sample size in case of simple random sampling. The
problem has been analyzed in a very elegant way by considering hypothetical example by
Cochran (1977). We quote the example, “An Anthropologist is preparing to study the
inhabitants of some island. Among other things, he wishes to estimate the percentage of
inhabitants belonging to blood group O. Co-operation has been secured so that it is feasible to
take a simple random sample. How large should the sample be?” This is just a typical
example. In fact, in almost all the sampling investigations, one has to face such problems. An
answer to the question is not straight forward. First of all, one must be very clear about the
35
2: An Overview of Various Sampling Schemes…..
objective of the study. Or atleast, the user must know to what use their results are going to be
put, so that he should be able to answer as to what is the margin of error he is going to
tolerate in the results. In the above example, the Anthropologist should be able to answer as
to how accurately does he wish to know the percentage of people with blood group O? In this
case he is reported to be content with a 5% margin in the sense that if the sample shows 43%
to have blood group O, the percentage for the whole island is sure to be between 38 and 48.
Since a random sampling procedure has been used, every sample has got some chance of
selection and the possibility of getting the estimates lying outside the above specified range
cannot be ruled out. Aware of this fact, the Anthropologist is prepared to take a 1 in 20
chance of getting an unlucky sample with the estimate lying outside the above margin.
With the above information, ignoring fpc and assuming that the sample proportion p is
assumed to be normally distributed, a rough estimate of n may be obtained. In technical
terms, p is to lie in the range (P  5), except for a 1 in 20 chance. Since p is assumed to be
normally distributed about the population proportion P, it will lie in the range (P  2 .  p )
apart from a 1 in 20 chance (in 95% cases).
Further, since the standard error of p is approximately given by
p 
2 . p
or 2
or
 P.Q / n 
=
where Q = (1 - P), we get
5
 P.Q / n  =
5
n = 4 P . Q / 25
At this point a difficulty appears that is common to all problems in the estimation of sample
size. A formula for n has been obtained, but n depends on some property of the population
that is to be sampled. Here, it is the quantity P that we would like to measure. We therefore
ask the anthropologist if he can give us some idea of the likely value of P. He replies that
from previous data on other ethnic groups, and from his speculations about the racial history
of this island, he will be surprised if P lies outside the range 30 to 60%. This information is
sufficient to provide a usable answer. For any value of P between 30 and 60, the product P.Q
lies between 2100 and a maximum of 2500 at P = 50. The corresponding n lies between 336
and 400. To be on the safe side, 400 is taken as the initial estimate of n.
36
2: An Overview of Various Sampling Schemes…..
PRINCIPAL STEPS INVOLVED IN THE CHOICE OF A SAMPLE SIZE

A statement about the margin of error to be tolerated in the results.

Choice of desired confidence level.

Some equation that connects n with the desired precision of the sample should be
found.

This equation will contain, as parameters, certain unknown properties of the
population. This must be estimated in order to give specific results.

Usually in a sample survey, more than one characteristic is measured. Sometimes, the
number of characteristics is large. If a desired degree of precision is prescribed for
each characteristic, the calculation leads to a conflicting values of n, one for each
characteristic. Some method must be found for reconciling these values.

Finally, the chosen value of n must be appraised to see whether it is consistent with
the resources available to take the sample. This demands an estimation of the cost, labour,
time and material required to obtain the proposed size of sample. It sometimes becomes
apparent that n will have to be drastically reduced. One has to choose whether to proceed
with a much smaller sample size, thus reducing precision, or to abandon efforts until more
resources can be found. Regarding the choice of a level for tolerable margin of error and the
confidence level, the user normally has only a vague idea and it is only through the
discussions and clarifications that a quantitative specific measures are obtained. It may be
remarked that these measures are mainly subjective and depend largely on the judgment of
the user regarding the importance, applicability and vulnerability of the results.
Regarding the sample sizes in case of simple random sampling, the cases for qualitative and
quantitative data are presented below:
12. QUALITATIVE DATA: ESTIMATION OF PROPORTIONS
The units are classified into two classes, C and C  . Some margin of error d in the estimated
proportion p of units in class C has been agreed on, and there is a small risk  that we are
willing to incur that the actual error is larger than d; i.e., we want
Pr ( p - P   d ) = 
Simple random sampling is assumed, and p is taken as normally distributed. Now we know
that,
37
2: An Overview of Various Sampling Schemes…..
 N n 
P.Q
n

 p  
 N 1 
Hence the formula that connects n with the desired degree of precision is
 N n 

d  t 
 N 1 
P.Q
n
 at the tails.
where t is the abscissa of the normal curve that cuts off an area of
Solving for n, we find
t2 P Q
d2
n

1 t2 P Q
1  
 1
2
N  d

(12.1)
For practical use, an advance estimate p of P is substituted in this formula. If N is
large, a first approximation is
n0 
t2 p q p q

V
d2
(12.2)
where
V 
pq
= desired variance of the sample proportion.
n0
In practice, we first calculate n0. If n0/N is negligible, n0 is a satisfactory approximation to the
n of (1). If not, it is apparent on comparison of (1) and (2) that n is obtained as
n
n0
n0

1  (n0  1) / N 1  (n0 / N )
(12.3)
38
2: An Overview of Various Sampling Schemes…..
EXAMPLE
In the hypothetical blood groups example, we had
d = 0.05,
p = 0.5,  = 0.05, t = 2 .
Thus,
n0 
(4)(0.5)(0.5)
 400
(0.0025)
Let us assume that there are only 3200 people on the island. The fpc is needed, and we find
n
n0

1  (n 0  1) / N
400
 356
399
1
3200
The formula for n0 holds also if d, p and q are all expressed as percentages instead of
proportions. Since the product pq Increases as p moves towards 1/2, or 50 %, a conservative
estimate of n is obtained by choosing for p the value nearest to 1/2 in the range in which p is
thought likely to lie. If p seems likely to lie between 5 and 9 %, for instance, we assume 9 %
for the estimation of n.
13. QUANTITATIVE DATA: ESTIMATION OF POPULATION MEAN
Consider a population of size N from which a simple random sample is to be selected for
estimating the population mean Y . Suppose, we wish to control the relative error r in the
estimated population total or mean. As the sample mean y estimates the population mean Y ,
the margin of error and confidence level are specified as
Pr 
yY
Ny  NY
r Pr 
r
y
NY
 Pr  y Y r Y 
where  is a small probability. We assume that y is normally distributed. Its standard error
is
y
=
Nn
N
S
n
39
2: An Overview of Various Sampling Schemes…..
Hence
rY  t .
y
= t.
Nn S
N
n
(13.1)
Solving for n gives
n(
t S 2 
1 t  S 2
) / 1  (
)
r  Y  N r  Y 
Note that the population characteristic on which n depends is its coefficient of variation S
/ Y . This is often a more stable quantity and easier to guess in advance than S itself.
As a first approximation, we take
n0  (
t S 2 1 S 2
)  ( )
C Y
rY
(13.2)
substituting an advance estimate of (S / Y ). The quantity C is the desired (cv)2 of the
sample estimate.
If n0 / N is appreciable, we compute n as in (12.3)
n
n0
n
1 0
(13.3)
N
If instead of the relative error r, we wish to control the absolute error d in y , we take n0
= t2S2/ d2 = S2 / V, where V is the desired variance of Y .
Sometimes the specification error to be tolerated is only given in terms of desired per cent
S.E. of the estimator e.g. the estimate is desired with a maximum of say 5 % S.E. In such
cases, n is obtained from the corresponding formulae. In simple random sampling, if the
desired % S.E. is d, then n is given by
1
n
2
(
d
C2

1
)
N
40
2: An Overview of Various Sampling Schemes…..
where C is the % cv of the population.
METHODOLOGICAL
SAMPLE SIZE
ISSUES
RELATING
TO
DETERMINATION
OF
The determination of sample size is generally based on (a) the available financial and
manpower resources and (b) the required level of reliability in the estimates expected from
the sample. Generally, it would be preferable to start with the second consideration, and if the
budget is a constraint to assess the precision that can be achieved under that constraint in
order to decide whether the achievable precision would be acceptable, and if not, whether the
budget should be increased.
In a sample survey, the sampling error associated with a given sample size varies from item
to time. For major items of frequent occurrence, such as area under a crop, generally the
sampling error is less that than the minor items of infrequent occurrence such as use of
pesticides/ insecticides. Similarly, for items which have greater variability, the sampling error
would be larger than that for fewer variables. To decide on the sample size for the survey, it
would be necessary to calculate the sample size required for estimating with the requisite
precision a few major items of interest and take the largest of the indicated size requirements
as the sample size for the survey.
Estimates from a survey are often required not only at the national level but also for the main
region of a country and for certain domains of study such as rural and urban sectors and
specified groups such as “irrigated crop” and “un-irrigated crop”. Generally, the sample size
required for a given precision at the national level would not give estimates for the regions
and groups with the same precision because of effective sample size for the region or group.
Compromise solution involving this has to be accepted.
OVERALL SAMPLE SIZE
If a pilot survey is undertaken for testing questions and survey procedure before the main
survey is launched, it may be possible to estimate roughly the parameters (population mean
and standard deviation) required for the determination of sample size for the various items of
interest. However, that may not be always possible, unless the pilot survey is taken well in
advance, because estimates of financial resource required are to be made available to the
government quite in advance to make the requisite budget. Thus the determination of the
sample size in most cases may have to be done in advance without any pilot survey.
In such cases the survey Statistician has to make use of the information which is readily
available. He may often have to depend on the results of similar survey conducted in the past,
preferably in the same country or elsewhere in the neighboring countries. If no such results
are available, the Statistician has to make reasonable guesses of the different parameters
which enter the formula for determination of the sample size.
41
2: An Overview of Various Sampling Schemes…..
If a stratified multi-stage random sampling design is used, which is usually the case, the
problems are further compounded because information is required not merely on the
population mean and standard deviation (SD), but also its components of variance between
primary stage sampling units (PSUs) and within PSUs. What one can do in such
circumstances is to proceed in stages by working out the sample size required for a simple
random sample (SRS) and to make adjustments to the sample size to take into account the
effects of multi-stage sampling and possible stratification as illustrated in the paragraphs that
follow.
Generally the level of precision desired of an estimate is expressed as a percentage of itself
or, strictly speaking of the population parameter. But the subtle difference is usually ignored.
Let Y be the characteristic under study, Y be the population mean and y be the sample
mean. Clearly, y is an estimate of Y .
The required sampling precision is prescribed as a percentage of y . For example, sampling
precision of y should be E per cent of y i.e., the population average should lie between y Exy
Exy
.
and y 
100
l00
Taking 95% confidence interval, which is the usual case, this implies that
E x y /100 = 2 x SE ( y )
SE( y)
E
x 100 
y
2
Percentage relative SE =
E
2
If the sampling precision is set at 5% relative SE, we know that in simple random sampling
Relative SE ( y ) =
SE( y)
x 100
Y

=
n x 100 = 
Y
Y
42
x 100 x
1
n
2: An Overview of Various Sampling Schemes…..
=
Coefficient of var iation (CV) x 100
n
E CV

x 100
2
n
n = 4 x 10000 x
If E = 5,
n = 40000
CV 2
E2
= 40000
CV 2
E2
CV  2  1600 (CV) 2
25
n = 40000
If E = 10,
(CV) 2
 400 (CV) 2
100
Thus to determine the sample sizes we require the value of CV which can be estimated on the
basis of a previous survey on the subject or a closely related subject or a survey of a
neighboring or similarly placed country.
Coefficient of variation, generally being a stable quantity can also be approximated on the
basis of some related information. e.g. average household income can be taken to be equal
1
to per capita income x average household size and
of the known range of household
6
income can be taken as an approximation of the SD on the assumption of a normal
distribution. However, a better assumption is that of log-normal distribution, i.e. instead of
assuming y to be distributed normally, assume that loge y is distributed normally.
(e = 2 3 = 2.7)
Let us assume that loge y is distributed normally with mean a and standard deviation b, then
2
Mean of y = e
Variance of
(CV)2 of
a  b2
y = e2a  b
y =
2
2
eb  1
43
2
eb 1
2: An Overview of Various Sampling Schemes…..
Median of y = e a
2
Mode of y = e a  b
2
Mean of y
 eb / 2
Median of y
Mean of y 3b2 / 2
e
Mode of y
Thus if mean is approximated (as indicated for household income by per capita income x
average household size) and the median or mode is roughly guessed, it is possible to calculate
the value of ‘b’ and therefore, approximate C.V. It may also be noted that an error in the
estimate of mode effects the estimate of ‘b’ less than the same relative error in the estimate of
the median. Also it may be perhaps less difficult to make a good guess of the mode than
median.
Further, log normal distribution has the property that the proportion of population with values
less than or equal to the mean is given by P(b/2), where P(t) is the area to the left of ‘t’ of a
standard normal probability density function. Thus if we guess the proportion of households
whose income is less than or equal to the average, it is possible to obtain the value of ‘b’ by
referring to the corresponding proportion in the standard normal distribution tables and thus
arrive at an estimate of CV.
We shall now discuss a simple method that does not depend upon the estimation of either the
population mean or the standard deviation, but assumes log-normal distribution. Take the
case of estimation of household income. Based on empirical data collected from a large
number of countries, it is reported in a technical study on Household Income and Expenditure
Surveys (Published in 1989 by Statistics Office of the United Nations under National
Household Survey Capability Programme) that nearly two-thirds of the population in the case
of distribution of income or similar economic variables lie below the average value. Thus
with the property mentioned in above para gives
CV = 1.0492  1
It is not claimed that the observation mentioned above is universal. If the proportion between
the average is different, CV will be different as given below:
44
2: An Overview of Various Sampling Schemes…..
Percentage of population
below average
55
60
65
70
75
80
C.V.
0.2554
0.5409
0.9005
1.4157
2.2739
4.0000
It would be seen that as the percentage of population below the average changes, the value of
CV changes very fast. Thus, rather than working with the assumed proportion of two-thirds,
one may prefer to err on the safe side and take the proportion as 70% and use CV = 1.4157 
1.42 or (CV)2 = 2. Table 5 (pp180-187) of the UN publication mentioned above gives the
value of CV and the proportion of households below the average for some 54 countries. It can
be seen from that Table that in a few cases CV exceeded 1.42. Thus the assumption of
(CV) 2 = 2 is not unrealistic. If (CV)2 = 2, we get n = 3200 and if (CV)2 = 1, as it happens in
most cases, we get n = 1600.
In the above calculations, we assumed that finite population correction (fpc) factor can be
ignored, i.e. the population size is very large as compared to sample size. As a working rule,
when n / N < 1%¸ fpc can be ignored. However, if fpc can not be ignored, then the sample
n
size n  for a simple random sample will be n  =
, where n is the sample size for the
n
1
N
case when fpc can be ignored.
In large scale sample surveys, generally one uses a two-stage random sampling design. A
two-stage design is generally less efficient than SRS of the same ultimate size and to achieve
the same level of precision as in a SRS, a larger number of ultimate stage sampling units has
to be surveyed. This is called the design effect and the extent of the upward adjustment to the
sample size depends on the degree of similarity of second stage units within a PSU, which is
measured by intra-class correlation coefficient. As a good working rule, one can take the
value of 2 for the design effect as indicated in the Handbook of Household Surveys (Revised
Edition), Studies in Methods, Series F.No.31, 1984 of the United Nations. Using the value 2
for the design effect, we get n = 6400, if (CV)2 = 2 and n =3200, if (CV)2 = 1. Here again the
sample size required would be less if appropriate stratification is used at various stages.
SAMPLE SIZE FOR DOMAINS
Estimates are generally required not only at the national level but also for certain domains
such as geographical regions, rural and urban areas. One has then to work out the sample size
for each domain and add them to arrive at the national sample size. We shall assume that
domain-wise estimates are required with the same precision of 5%. Further, for sake of
simplicity, we will deal with the case of two domains of study. For sake of illustration, let us
45
2: An Overview of Various Sampling Schemes…..
consider a Household Income and Expenditure Survey (HIES) and rural and urban areas may
be taken as the two domains of study. Further, let us assume that 80% of households (hh) are
rural and 20% of the hh are urban. Suppose further that the average household income for the
urban area is twice the national average. With the 80 : 20 ratio between rural and urban hh, it
means that the average household income in the rural areas, is 75% of the national average.
Let (CV)r and (CV)u be the CV for rural and urban areas respectively. If (CV)2 = 2, it can be
shown that
0.45 (CV) r2 + 0.80 (CV)u2 = 1.75
(Please see Annexure)
If we assume that (CV)r = (CV)u , we get
(CV) 2r = (CV) 2u
= 1.4
Thus to obtain the same relative precision of 5% (i.e. relative SE of 2.5%), the sample size
for each of two sectors will be
1.4 x 1600 x 2 (design effect) = 4480 hh
Hence, National sample size = 8960 hh.
Thus the national sample size is increased by 40% from 6400 to 8960. If we do not have that
many resources, we can divide the national sample size of 6400 equally to rural and urban
areas. The effect of allocation of 3200 instead of 4480 hh would be that, instead of 5% we
shall have 5.9% precision (2.95% relative SE).
In the above we have assumed that (CV)r = (CV)u . It is likely that (CV)u > (CV)r.
The table below gives the sample size required with different (CV)u and (CV)r but satisfying
the equation (1).
Sample size ( number of households )
(CV) 2
(CV) 2
r
1.0
1.1
1.2
1.3
1.4
1.5
u
1.625000
1.56875
1.515250
1.45625
1.40000
1.34375
Rural
3200
3520
3840
4150
4480
4800
Urban
5200
5020
4840
4660
4480
4300
Total
8400
8540
8680
8820
8960
9100
For any other value of (CV) 2r , all other values can be determined by linear interpolation.
46
2: An Overview of Various Sampling Schemes…..
Annexure
Let us use the following notations
Population size
Mean
S.D.
Urban
Rural
Total
N1
1
1
N2
2
2
N


N1
N
We have assumed that
N2
N
20
100
We note that
1
x 100 = 20
and
x 100 = 80

80
100
2


Further we have assumed
 1  2  , So
20
100
0.8  2
x 2 
80
100
2
 
2
 0.6 
Let us now get the relationship amongst  ,  1 and  2 .
Let
x1i be the observation on the i-th unit in the urban area,
x2j be the observation on the jth unit in the rural area.
We know that
N1 12 
N1

i 1
x12i  N1 12
47
 0.75 
2: An Overview of Various Sampling Schemes…..
N2
 22

N2

j 1
x22 j
 N2
 22
N2
 N1

N.2    x12i   x 22 j   N 2
j1
 i1

2
2
 N112  N112  N2 2
2  N2 2  N
2 
So
2
2
N1 2 N2 2 N1 2 N2 2
1 
2 
1 
2 2
N
N
N
N
2
2
N1 1
N2  2
N1 1
N2  2
2
2




1
N 2
N 2
N 2
N 2
=
We note that
2
N1 12 12 N2  2
N1 12 N2  2
2 2
2
x

x


1
2
2
2
2
2
N 1 
N 2 
N 
N 2
2
N1
1 N2 4 1
2
3
9

,
 , 2  4, 2
( )2 
, and
2
N
5 N 5 
4
16

on substitution, (1) reduces to
(CV) 2 
4
5
(CV)u2 
4
5
x
9
4 4 9
(CV)r2   x
1
16
5 5 16
4 9

= 0.8 (CV)u2 + 0.45 (CV)r2 +  
 1
 5 20 
As (CV)2 = 2, so
2 = 0.8 (CV)u2 + 0.45 (CV)r2 +
1.75
1
4
= 0.8 (CV)u2 + 0.45 (CV)r2
48
(1)
2: An Overview of Various Sampling Schemes…..
REFERENCES
Cochran, W.G. (1977): Sampling Techniques. Third Edition. John Wiley and Sons.
Des Raj (1968): Sampling Theory. TATA McGRAW-HILL Publishing Co. Ltd.
Des Raj and Chandok, P. (1998): Sample Survey Theory. Narosa Publishing House.
Murthy, M.N. (1977): Sampling Theory and Methods. Statistical Publishing Society,
Calcutta.
Singh, D. and Chaudhary, F.S. (1986): Theory and Analysis of Sample Survey Designs.
Wiley Eastern Limited.
Singh, D., Singh, P. and Kumar, P. (1978): Handbook of Sampling Methods. I.A.S.R.I., New
Delhi.
Singh, R. and Mangat, N.S. (1996): Elements of Survey Sampling, Kluwer Academic
Publishers.
Sukhatme, P.V. and Sukhatme, B.V. (1970): Sampling Theory of Surveys with Application.
Second Edition. Iowa State University Press, USA
Sukhatme, P. V., Sukhatme, B.V., Sukhatme, S. and Asok, C. (1984): Sampling Theory of
Surveys with Applications. Third Revised Edition, Iowa State University Press, USA.
49