Download Sampling - IPEM Group of Institutions

Document related concepts

History of statistics wikipedia , lookup

Statistics wikipedia , lookup

Transcript
SAMPLING
DESIGN AND
PROCEDURES
Sampling Terminology
Sample
A subset, or some part, of a larger population.
Population (universe)
Any complete group of entities that share some
common set of characteristics.
Population Element
An individual member of a population.
Census
An investigation of all the individual elements
that make up a population.
Sample Survey
A survey which is carried out using a sampling
method, i.e., in which a portion only, and not the
whole population, is surveyed.
One of the units into which an aggregate is divided
for the purposes of sampling, each unit being
regarded as individual and indivisible when the
selection is made. The definition of unit may be
made on some natural basis, for example,
households, persons etc.
PARAMETER & STATISTIC
PARAMETER(S):
population
A characteristic of a
STATISTIC(S):A characteristic of a sample
(estimation of a parameter from a statistic
is the prime objective of sampling analysis).
A list, map or other specification of the units which
constitute the available information relating to the
population designated for a particular sampling
scheme. There is corresponding to each state of
sampling in a multi-stage sampling scheme. The frame
may or may not contain information about the size or
other supplementary information of the units, but it
should have enough details so that a unit, if included in
the sample, may be located and taken up for inquiry.
that part of the difference between a population
value and an estimate thereof, derived from a
random sample, which is due to the fact that only a
sample of values is observed; as distinct from errors
due to imperfect selection, bias in response or
estimation, errors of observation and recording, etc.
the totality of sampling errors in all possible
samples of the same size generates the sampling
distribution of the statistic which is being used to
estimate the parent value.
Why Sample?
Budget and time constraints.
Limited access to total population.
Accurate and Reliable Results
Destruction of Test Units
Sampling reduces the costs of research in
finite populations.
Sample Vs. Census
Type of Study
Conditions Favoring the Use of
Sample
Census
1. Budget
Small
Large
2. Time available
Short
Long
3. Population size
Large
Small
4. Variance in the characteristic
Small
Large
5. Cost of sampling errors
Low
High
6. Cost of nonsampling errors
High
Low
7. Nature of measurement
Destructive
Nondestructive
8. Attention to individual cases
Yes
No
Sampling Techniques
Nonprobability
Sampling Techniques
Convenience
Sampling
Simple
Random
Sampling
Judgmental
Sampling
Systematic
Sampling
Probability
Sampling Techniques
Quota
Sampling
Stratified
Sampling
Snowball
Sampling
Cluster
Sampling
Other Sampling
Techniques
Convenience sampling
attempts to obtain a
sample of convenient elements. Often, respondents
are selected because they happen to be in the right
place at the right time.
• use of students, and members of social
organizations
• mall intercept interviews without qualifying the
respondents
• department stores using charge account lists
• “people on the street” interviews
A
B
C
D
E
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
Group D happens to
assemble at a convenient
time and place. So all
the elements in this
Group are selected.
The resulting sample
consists of elements 16,
17, 18, 19 and 20.
Note, no elements are
selected from group A,
B, C and E.
Judgmental sampling is a form of convenience
sampling in which the population elements are
selected based on the judgment of the
researcher.




test markets
purchase engineers selected in industrial
marketing research
bellwether precincts selected in voting
behavior research
expert witnesses used in court
A
B
C
D
E
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
The researcher considers
groups B, C and E to be
typical and convenient.
Within each of these
groups one or two
elements are selected
based on typicality and
convenience. The
resulting sample
consists of elements 8,
10, 11, 13, and 24. Note,
no elements are selected
from groups A and D.
Quota sampling may be viewed as two-stage restricted
judgmental sampling.


The first stage consists of developing control categories, or
quotas, of population elements.
In the second stage, sample elements are selected based on
convenience or judgment.
Control
Characteristic
Gender
Male
Female
Population
composition
Sample
composition
Percentage
Percentage
Number
48
52
____
100
48
52
____
100
480
520
____
1000
A
B
C
D
E
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
A quota of one
element from each
group, A to E, is
imposed. Within each
group, one element is
selected based on
judgment or
convenience. The
resulting sample
consists of elements
3, 6, 13, 20 and 22.
Note, one element is
selected from each
column or group.
o
o
o
Each element in the population has a known and
equal probability of selection.
Each possible sample of a given size (n) has a
known and equal probability of being the
sample actually selected.
This implies that every element is selected
independently of every other element.
A
B
C
D
E
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
Select five random
numbers from 1 to 25.
The resulting sample
consists of population
elements 3, 7, 9, 16,
and 24. Note, there is
no element from Group
C.



The sample is chosen by selecting a random
starting point and then picking every ith element in
succession from the sampling frame.
The sampling interval, i, is determined by dividing
the population size N by the sample size n and
rounding to the nearest integer.
When the ordering of the elements is related to
the characteristic of interest, systematic sampling
increases the representativeness of the sample.
If the ordering of the elements produces a cyclical
pattern, systematic sampling may decrease the
representativeness of the sample.
For example, there are 100,000 elements in the
population and a sample of 1,000 is desired. In this
case the sampling interval, i, is 100. A random
number between 1 and 100 is selected. If, for
example, this number is 23, the sample consists of
elements 23, 123, 223, 323, 423, 523, and so on.
A
B
C
D
E
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
Select a random number
between 1 to 5, say 2.
The resulting sample
consists of population 2,
(2+5=) 7, (2+5x2=) 12,
(2+5x3=)17, and (2+5x4=) 22.
Note, all the elements are
selected from a single row.




A two-step process in which the population is
partitioned into subpopulations, or strata.
The strata should be mutually exclusive and
collectively exhaustive in that every population
element should be assigned to one and only one
stratum and no population elements should be
omitted.
Next, elements are selected from each stratum by a
random procedure, usually SRS.
A major objective of stratified sampling is to
increase precision without increasing cost.



The elements within a stratum should be as
homogeneous as possible, but the elements in
different strata should be as heterogeneous as
possible.
The stratification variables should also be
closely related to the characteristic of interest.
Finally, the variables should decrease the cost of
the stratification process by being easy to
measure and apply.
A
B
C
D
E
1
6
11
16
21
2
7
12
17
22
3
8
13
18
23
4
9
14
19
24
5
10
15
20
25
Randomly select a number
from 1 to 5
for each stratum, A to E.
The resulting
sample consists of
population elements
4, 7, 13, 19 and 21.
Note, one element
is selected from each
column.
HYPOTHESIS …???
is formally stated expectation about how a behavior
operates.
… is a proposition that a researcher wants to verify.
A hypothesis is an assumption about the population
parameter.
• Formulate a Null Hypothesis (H0).
• Formulate an Alternative Hypothesis (H1)
• Select a suitable Test Statistic
• Specify a Level of Significance (a)
• Define a suitable Decision Criterion based on a and
Test Statistic
• Make necessary Assumptions if required
• Experiment and Calculation of Test Statistic
• Conclusion or Decision
Central Limit Theorem
As the sample size gets large enough…the sampling
distribution becomes almost normal regardless of
shape of population
The Null Hypothesis, H0
• It is a statement about the hypothesized value of
population parameter.
• States the Assumption (numerical) to be
tested for possible rejection under the assumption
that the null hypothesis is TRUE.
The average sale of showroom is at least 3.0 lakh (H0: μ≥
3.0)
•Always contains the ‘ = ‘ sign
The Alternative Hypothesis, H1
e.g.
The average sale of a showroom is less than
3.0 (H1: μ < 3.0)
 Is the opposite of the null hypothesis
 Never contains the ‘=‘ sign
 The Alternative Hypothesis may or may
not be
accepted
 Is generally the hypothesis that is believed to be true
by the researcher
Level of Significance, a
• Defines Unlikely Values of Sample Statistic if Null
Hypothesis Is True.
• If we assume that hypothesis is correct , then the
significance level will indicate the percentage of sample
statistics is outside certain limits.
0

Typical values are 0.01, 0.05, 0.10
Level of Significance, a
and the Rejection Region
a
H0: m  3
H1: m < 3
Rejection Regions
Critical
Value(s)
0
a
H0: m  3
H1: m > 3
0
H0: m = 3
H1: m  3
0
a/2
One-Tailed Hypothesis Test
The term one-tailed signifies that all values that
would cause to reject H0, are in just one tail of the
sampling distribution
Two-Tailed Hypothesis Test
Two-tailed test is one in which values of the test
statistic leading to rejectioin of the null hypothesis
fall in both tails of the sampling distribution curve
Summary of Errors Involved
in
Real State
Inference Based on Sample Data
Testing
ofHypothesis
Affairs
H0 is Accepted
Correct decision
H0 is True Confidence level =
1- a
H0 is
False
H0 is Rejected
Type I error
Significance
level=a*
Correct decision
Type II error
P (Type II error) = 
Power = 1-
*Term a represents the maximum probability of
committing a Type I error
a &  Have an
Inverse
Relationship
Reduce probability of one error and
the other one goes up.

a
How to choose between Type I and Type II
errors
Reworking cost is low----Type I error
Reworking cost is high---Type II
TOSH of means when the population
Standard deviation is known
 H0: m = < > m0
vs. HA: m ≠ > < m0
 Zcalc = (X - m0)/(/  n)
0
Example
Bajaj Company claims that the length of life of its
electric bulb is 1000 hours with standard
deviation of 30 hours. A random sample of 25
checked an average life of 960 hours. At 5 %
level of significance can we conclude that the
sample has come from a population with mean
life of 1000 hours?
Table value = 1.96
t –test, Standard deviation is unknown and small sample
 H0: m = < > m0
vs. HA: m  > < m0
 Testing a Hypothesis About a Mean;
 We Do Not Know  Which Must be Estimated by S..
 Calculate
tcalc = (X - m0)/(s/  n )
Example
The weight of a canned food product is specified as 500 grm.
For a sample of 8 cans the weight were observed as 480, 475,
510, 500, 505, 495, 504 and 515 grm. Test at 5% level of
significance, whether on an average the weight is as per
specification.
Table value = 2.365
Two independent samples were collected. For the first sample of 42
items, the mean was 32.3 and the variance 9. The second sample of 57
items had a mean of 34 and a variance of 16. Using 0.05level of
significance, test whether there is sufficient evidence to show the second
population has a larger mean.
 H0: m1 = < > m2 vs. HA: m1 ≠ > < m2
 n1 = ______, n2=______
a = _______
 Testing a Hypothesis About two Mean;
 Process Performance Measure is Approximately Normally
Distributed;
 We “Know” 1 & 2
 Therefore this is a “Z-test” - Use the Normal Distribution.
Calculate test statistic
Zcalc
=
(x1 - x 2) - (m1 - m2 )
------------------------------ 12/n1 + 22/n2
 DR: (≠ in HA) Reject H0 in favor of HA if Zcalc < -Za/2 or if Zcalc > +Za/2.
Otherwise, FTR H0.
 DR: (> in HA) Reject H0 in favor of HA iff Zcalc > +Za . Otherwise, FTR H0.
 DR: (< in HA) Reject H0 in favor of HA iff Zcalc < -Za. Otherwise, FTR H0.
Z-test to test two population mean(m1 & m2)When population standard
deviation is unknown & n is large
 H0: m1 = < > m2 vs. HA: m1 ≠ > < m2
 n1 = ______, n2=______
a = _______
 Testing a Hypothesis About two Mean;
 Process Performance Measure is Approximately Normally
Distributed;
 We “Know” S1 & S2
 Therefore this is a “Z-test” - Use the Normal Distribution.
Calculate test statistic
Zcalc
=
(x1 - x 2) - (m1 - m2 )
------------------------------ S1 2/n1 + S1 2/n2
t-test ,To test two population mean
n H0: m1 = < > m2 vs. HA: m1  > < m2
n n = _______
a = _______
• Testing a Hypothesis About a Mean;
• Process Performance Measure is Approximately Normally Distributed or
We Have a “small” Samples;
• We Do Not Know  Which Must be Estimated by S.
• Therefore this is a “t-test” - Use Student’s T Distribution.
(x1 - x2) - (m1 - m2 )
t = ------------------------s* (  1/n1 + 1/n2 )
Calculate
with d.f. = n1 + n2 - 2. In this expression, s* is the pooled standard
deviation, given by
s2
=
[ (n1 – 1)s12 + (n2 – 1)s22 ] / (n1+n2-2)
Paired Samples
The difference in these cases is examined by a paired
samples t test. To compute t for paired samples, the
paired difference variable, denoted by D, is formed and its
mean and variance calculated. Then the t statistic is
computed. The degrees of freedom are n - 1, where n is
the number of pairs. The relevant
formulas are:
H0: m D = 0
H1: m D  0
tn-1 =
D - mD
sD
n
The difference in these cases is examined by a paired
samples t test. To compute t for paired samples, the
paired difference variable, denoted by D, is formed and its
mean and variance calculated. Then the t statistic is
computed. The degrees of freedom are n - 1, where n is
the number of pairs. The relevant
formulas are:
H0: m D = 0
H1: m D  0
tn-1 =
D - mD
sD
n
Cross-Tabulations:
Chi-square Test
Technique used for determining whether
there is a statistically significant relationship
between two categorical (nominal or ordinal)
variables
Telecommunications Company
Marketing manager of a telecommunications company is
reviewing the results of a study of potential users of a new
cell phone
Random sample of 200 respondents
A cross-tabulation of data on whether target consumers would
buy the phone (Yes or No) and whether the cell phone had
Bluetooth wireless technology (Yes or No)
Question
Can the marketing manager infer that an association exists
between Bluetooth technology and buying the cell phone?
Two-Way Tabulation of Bluetooth
Technology and Whether Customers
Would Buy Cell Phone
Cross Tabulations -Hypotheses
H0: There is no association between wireless
technology and buying the cell phone (the two
variables are independent of each other).
Ha: There is some association between the Bluetooth
feature and buying the cell phone (the two
variables are not independent of each other).
Conducting the Test
Test involves comparing the actual, or
observed, cell frequencies in the crosstabulation with a corresponding set of expected
cell frequencies (Eij)
Expected Values
Eij
=
ninj
----n
Where ni and nj are the marginal frequencies, that is, the
total number of sample units in category i of the row
variable and category j of the column variable,
respectively
Computing Expected Values
The expected frequency for the first-row, firstcolumn cell is given by
100  100
E11 =
------------ = 50
200
Observed and Expected Cell Frequencies
Chi-square Test Statistic
c
(Oij - Eij)2

-----------------
i=1 j=1
Eij
r
2 = 
= 72.00
Where r and c are the number of rows and columns,
respectively, in the contingency table. The number of
degrees of freedom associated with this chi-square
statistic are given by the product (r - 1)(c - 1).
Chi-square Test Statistic in a Contingency Test
For d.f. = 1, Assuming a =.05, from Appendix 2, the critical
chi-square value (2c) = 3.84.
Decision rule is: “Reject H0 if 2  3.84.”
Computed 2 = 72.00
Since the computed Chi-square value is greater than the
critical value of 3.84, reject H0.
The apparent relationship between “Bluetooth
technology"and "would buy the cellular phone" revealed
by the sample data is unlikely to have occurred because of
chance
EXAMPLE
In a management institute, the A+, A and B grades allocated to students
in there final examination, were as follows. Using 5% level of
significance, determine whether the grading scale is independent of the
specialization.
Table value = 9.488
Specialization
Grade
Finance Marketing
Operations
A+
A
20
25
10
15
20
08
05
15 B
07
Univariate Hypothesis:
Papa John’s restaurants are more likely
to be located in a stand-alone location
or in a shopping center.
Bivariate Hypothesis:
Stand-alone locations are
more likely to be profitable
than are shopping center
locations.