Download sampling design and data prep

Document related concepts
no text concepts found
Transcript
Sampling Design
Sampling

The process of obtaining information from a subset (sample)
of a larger group (population)

The results for the sample are then used to make estimates
of the larger group

Faster and cheaper than asking the entire population

Two keys
1. Selecting the right people
 Have to be selected scientifically so that they are
representative of the population
2. Selecting the right number of the right people
 To minimize sampling errors I.e. choosing the wrong
people by chance
SAMPLING
• Sample -- contacting a portion of the
population (e.g., 10% or 25%)
– best with a very large population (n)
– easiest with a homogeneous population
• Census -- the entire population
– most useful if the population ("n") is small
– or the cost of making an error is high
Population Vs. Sample
Population of Interest
Sample
Population
Sample
Parameter
Statistic
We measure the sample using statistics in order to draw
inferences about the parameters of the population.
Characteristics of Good Samples
• Representative
• Accessible
• Low cost
…this (bad)…
Sample
Population
…or this (VERY bad)…
Sample
Population
Terminology
Population
The entire group of people of interest from whom the
researcher needs to obtain information.
Element (sampling unit)
one unit from a population
Sampling
The selection of a subset of the population
Sampling Frame
Listing of population from which a sample is chosen
Census
A polling of the entire population
Survey
A polling of the sample
Terminology
Parameter
 The variable of interest
Statistic
 The information obtained from the sample about the
parameter
Goal
To be able to make inferences about the population
parameter from knowledge of the relevant statistic - to
draw general conclusions about the entire body of units
Critical Assumption
The sample chosen is representative of the population
Steps in Sampling Process
1.Define the population
2.Identify the sampling frame
3.Select a sampling design or procedure
4.Determine the sample size
5.Draw the sample
Sampling Design Process
Define Population
Determine Sampling Frame
Determine Sampling Method
Non-Probability Sampling
•Convenience
•Judgmental
•Quota
Probability Sampling
•Simple Random Sampling
•Stratified Sampling
•Cluster Sampling
Determine Appropriate
Sample Size
Execute Sampling
Design
1. Define the Target Population
Question: “Who, ideally, do you want to survey?”
Answer: those who have the information sought.
• What are their characteristics.
• Who should be excluded?
– age, gender, product use, those in industry
– Geographic area
It involves
– defining population units
– setting population boundaries
– Screening (e.g. security questions, product use )
1. Define the Target Population
The Element ......
individuals
families
seminar groups
sampling Unit….
individuals over 20
families with 2 kids
seminar groups at “new” university
Extent ............
individuals who have bought “one”
families who eat fast food
seminar groups doing MR
Timing ..........
bought over the last seven days
1. Define the Target Population
The target population for a toy store can
be defined as all households with children
living in Calgary.
What’s wrong with this definition?
2. Determine the Sampling Frame
 Obtaining a “list” of population (how will you reach sample)





Students who eat at McDonalds?
young people at random in the street?
phone book
students union listing
University mailing list
 Problems with lists
 omissions
 ineligibles
 duplications
 Procedures
 E.g. individuals who have spent two or more hours on the
internet in the last week
2. Determine the Sampling Frame
Select “sample units”

Individuals

Household

Streets

Companies
3. Selecting a Sampling Design
 Probability sampling - equal chance of being
included in the sample (random)
–
–
–
–
simple random sampling
systematic sampling
stratified sampling
cluster sampling
 Non-probability sampling - - unequal chance of
being included in the sample (non-random)
– convenience sampling
– judgement sampling
– snowball sampling
– quota sampling
3. Selecting a Sampling Design
Probability Sampling
 An objective procedure in which the probability
of selection is nonzero and is known in advance
for each population unit.
 also called random sampling.
 Ensures information is obtained from a
representative sample of the population
 Sampling error can be computed
 Survey results can be projected to the population
 More expensive than non-probability samples
3. Selecting a Sampling Design
Simple Random Sampling (SRS)
 Population members are selected directly from the
sampling frame
 Equal probability of selection for every member
(sample size/population size)
 400/10,000 = .04
 Use random number table or random number
generator
3. Selecting a Sampling Design
Simple Random Sampling
 N = the number of cases in the sampling frame
 n = the number of cases in the sample
 f = n/N = the sampling fraction
 NCn = the number of combinations (subsets) of n
from N
If you have a sampling frame of the 10,000 full-time
students at the U of L and you want to survey .01
percent of them, how would you do it?
3. Selecting a Sampling Design
Objective: To select n units out of N such
that each NCn has an equal chance of
being selected
Procedure: Use a table of random
numbers, a computer random number
generator, or a mechanical device to
select the sample
3. Selecting a Sampling Design
Systematic Sampling
• Order all units in the sampling frame based
on some variable and number them from 1 to
N
• Choose a random starting place from 1 to N
and then sample every k units after that
systematic random sample
number the units in the
population from 1 to N
decide on the n (sample size)
that you want or need
k = N/n = the interval size
randomly select an integer
between
1 to k
then take
every kth unit
3. Selecting a Sampling Design
Stratified Sampling (I)
 The chosen sample contains a number of distinct
categories which are organized into segments, or
strata
– equalizing "important" variables
• year in school, geographic area, product use, etc.
 Steps:
– Population is divided into mutually exclusive and
exhaustive strata based on an appropriate population
characteristic. (e.g. race, age, gender etc.)
– Simple random samples are then drawn from each
stratum.
Stratified Random Sampling
Stratified Random Sampling
 The sample size is usually proportional to the relative
size of the strata.
Ensures that particular groups (e.g. males and
females) within a population are adequately represented
in the sample
Has a smaller sampling error than simple random
sample since a source of variation is eliminated
3. Selecting a Sampling Design
Stratified Sampling (II)
 Direct Proportional Stratified Sampling
– The sample size in each stratum is proportional to the
stratum size in the population
 Disproportional Stratified Sampling
– The sample size in each stratum is NOT proportional
to the stratum size in the population
– Used if
1) some strata are too small
2) some strata are more important than others
3) some strata are more diversified than others
3. Selecting a Sampling Design
Cluster Sampling
 The Population is divided into mutually
exclusive and exhaustive subgroups, or clusters,
usually based on geography or time period
 Each cluster should be representative of the
population i.e. be heterogeneous.
 Means between clusters should be the same
(homogeneous)
 Then a sample of the clusters is selected.
 then some randomly chosen units in the selected
clusters are studied.
cluster or area random sampling
 divide population into
clusters (usually along
geographic boundaries)
 randomly sample clusters
 measure units within
sampled clusters
3. Selecting a Sampling Design
When to use stratified sampling
 If primary research objective is to compare groups
 Using stratified sampling may reduce sampling
errors
When to use cluster sampling
 If there are substantial fixed costs associated with
each data collection location
 When there is a list of clusters but not of individual
population members
3. Selecting a Sampling Design
Non-Probability Sampling
 Subjective procedure in which the probability of
selection for some population units are zero or
unknown before drawing the sample.
 information is obtained from a non-representative
sample of the population
 Sampling error can not be computed
 Survey results cannot be projected to the
population
3. Selecting a Sampling Design
Non-Probability Sampling
Advantages
 Cheaper and faster than probability
 Reasonably representative if collected in a thorough
manner
Types of Non-Probability
Sampling (I)
 Convenience Sampling
– A researcher's convenience forms the basis for
selecting a sample.
• people in my classes
• Mall intercepts
• People with some specific characteristic (e.g. bald)
 Judgement Sampling
– A researcher exerts some effort in selecting a
sample that seems to be most appropriate for
the study.
Types of Non-Probability Sampling
 Snowball Sampling
– Selection of additional respondents is based on referrals
from the initial respondents.
• friends of friends
– Used to sample from low incidence or rare populations.
 Quota Sampling
 The population is divided into cells on the basis of relevant
control characteristics.
– A quota of sample units is established for each cell.
• 50 women, 50 men
– A convenience sample is drawn for each cell until the quota
is met.
(similar to stratified sampling)
Quota Sampling
Let us assume you wanted to interview tourists coming to a
community to study their activities and spending. Based on
national research you know that 60% come for
vacation/pleasure, 20% are VFR (visiting friends and relatives),
15% come for business and 5% for conventions and meetings.
You also know that 80% come from within the province. 10%
from other parts of Canada, and 10% are international. A total
of 500 tourists are to be intercepted at major tourist spots
(attractions, events, hotels, convention centre, etc.), as you
would in a convenience sample. The number of interviews could
therefore be determined based on the proportion a given
characteristic represents in the population. For instance, once
300 pleasure travellers have been interviewed, this category
would no longer be pursued, and only those who state that one
of the other purposes was their reason for coming would be
interviewed until these quotas were filled.
Alberta
Canada
International
Totals
Pleasure
.48
.06
.06
.60
Visiting
.16
.02
.02
.20
Business
.12
.015
.015
.15
Convention
.04
.005
.005
.05
Totals
.80
.10
.10
100
Probability Vs. NonProbability Sampling
Disadvantages
 The probability of selecting one element over another
is not known and therefore the estimates cannot be
projected to the population with any specified level of
confidence.
 Quantitative generalizations about population can
only be done under probability sampling.
 In practice, however, marketing researchers also
apply statistics to study non-probability samples.
Generalization
• You can only generalize to the population
from which you sampled
– U of L students not university students
• geographic, different majors, different jobs, etc.
– University students not Canadian population
• younger, poorer, etc.
– Canadians not people everywhere
• less traditional, more affluent, etc.
Drawing inferences from samples
• Population estimates
– % who smoke, buy your product, etc
• 25% of sample
• what % of population?
– very dangerous with a non-representative
sample or with low response rates
Errors in Survey
Random Sampling Error
– random error- the sample selected is not
representative of the population due to chance
– the level of it is controlled by sample size
– a larger sample size leads to a smaller sampling
error.
Population mean (μ) gross income = $42,300
Sample 1 (400/250,000) mean (Χ) = $41,100
Sample 2 (400/250,000) mean (Χ) = $43,400
Sample 3 (400/250,000) mean (Χ) = $36,400
Non-Sampling Errors (I)
Non-sampling Error
–systematic error
–the level of it is NOT controlled by sample size.
 The basic types of non-sampling error
– Non-response error
– Response or data error
 A non-response error occurs when units selected as
part of the sampling procedure do not respond in
whole or in part
– If non-respondents are not different from those that did
respond, there is no non-response error
Non-Sampling Errors (II)
 A response or data error is any systematic
bias that occurs during data collection,
analysis or interpretation
– Respondent error (e.g., lying, forgetting, etc.)
– Interviewer bias
– Recording errors
– Poorly designed questionnaires
Data Preparation
Steps in Data Preparation
Editing
Coding
Entering Data
Data Tabulation
Reviewing Tabulations
Statistically adjusting the data (e.g.
weighting)
Editing
 Carefully checking survey data for
 Completeness (no omissions)
 Non-ambiguous (e.g. two boxes checked instead of
one)
 Right informant (e.g. under age, when all supposed
to be over 18)
 Consistency
 e.g. charging something on a credit card when
the person does not own a credit card
 Accuracy (e.g. no numbers out of range)
 Most important purpose is to eliminate or at
least reduce the number of errors in the raw
data.
Solutions
1. Ideally re-interview respondent
2. Eliminate all unacceptable surveys (case wise
deletion) (if sample is large and few unacceptable)
3. In calculations only the cases with complete
responses are considered (pair wise deletion)
(means that some statistics will be based on
different sample sizes)
4. Code illegible or missing answers into a a “no valid
response” category
5. substitute a neutral value - typically the mean
response to the variable, therefore the mean
remains unchanged
Coding
• The process of systematically and consistently
assigning each response a numerical score.
• The key to a good coding system is for the coding
categories to be mutually exclusive and the entire
system to be collectively exhaustive.
• To be mutually exclusive, every response must fit
into only one category.
• To be collectively exhaustive, all possible
responses must fit into one of the categories.
• Exhaustive means that you have covered the entire
range of the variable with your measurement.
Coding
• Coding Missing Numbers: When respondents fail
to complete portions of the survey.
– Whatever the reason for incomplete surveys, you
must indicate that there was no response provided
by the respondent.
– For single digit responses code as “9”, 2 digit code
as “99”
Coding Open-Ended Questions: When open-ended
questions are used, you must create categories.
– All responses must fit into a category
– similar responses should fall into the same
category.
e.g. Who services your car? ______________
Possible categories: self, garage, husband, wife,
friend, relative etc.
• To make it collectively exhaustive add an “other” or
“none of the above” category
–Only a few i.e. < 10% should fit into this category
Precoded Questionnaires: Sometimes you can place
codes on the actual questionnaire, which simplifies
data entry.
This…
Are you:
Male
Female
How satisfied are you with our product?
___Very Satisfied
___Somewhat Satisfied
___Somewhat Dissatisfied
___Very Dissatisfied
___No opinion
Becomes this…
Are you: (1) Male
(2) Female
How satisfied are you with our product?
_1__Very Satisfied
_2__Somewhat Satisfied
_3__Somewhat Dissatisfied
_4__Very Dissatisfied
_5__No opinion
1. Are you solely responsible for taking care of your
automotive service needs ___ Yes ___ No
2. If No who performs the simple maintenance ___________
3. If scheduled maintenance is done on your automobile,
how do you keep track of what has been done
•Not tracked
•auto dealer records
•mental recollection
•other
4. How often is your automobile serviced?
•Once per month
•Once every three months
•Once every six months
•Once per year
•Other _______________
Code Book
Col.
No
Question
No.
Question Des.
Range of permissible values
1
ID #
N/A
001-200 (this also means the surveys
themselves should be numbered)
2
1
Responsible for
Maintenance
0= No. 1=yes, 9= blank
3
2
perform simple
maintenance
0=husband, 1=boyfriend, 2=father, 3=mother,
4=relative, 5=friend, 6=other, 9=blank
4
3
How maintenance
tracked
0=not tracked, 1=auto dealer records,
2=personal records, 3=mental recollection,
4=other, 9=blank
5
4
How often
maintenance
performed
1=Once per month, 2=3 month, 3=6 months ,
4=year , 5=other, 9= blank
In questions that permit multiple responses, each possible
response option should be assigned a separate column
6. Which magazines do you read, choose all that apply.
•
Time
•
National Geographic
•
Readers Digest
•
Chatelaine
•
MacLean's
Col. No
Question No.
Question Des.
Range of permissible values
15
6
Time
0 =read, 1= not read
16
6
Readers Dig.
0 =read, 1= not read
17
6
MacLean's
0 =read, 1= not read
18
6
National Geo.
0 =read, 1= not read
19
6
Chatelaine
0 =read, 1= not read
For rank order questions, separate columns are also needed
7. Please rank the following brands of toothpaste in order
of preference (1-5) with 1 being the most important
•
Crest
•
Aquafresh
•
Aim
•
Colgate
•
Arm & Hammer
Col.# Q.
No.
Question Des.
Range of permissible values
20
7
Crest rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
21
7
Colgate rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
22
7
Acquafresh rank 0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
23
7
A & H rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
25
7
Pepsodent rank
0 =blank, 1 = most important, 2 =2nd most
important, 3 =third, 4=fourth, 5= fifth
Preparing the Data for
Analysis
Variable Re-specification
• Existing data modified to create new variables
• Large number of variables collapsed into fewer
variables
• E.g. If 10 reasons for purchasing a car are given they
might be collapsed into four categories e.g.
performance, price, appearance, and service
• Creates variables that are consistent with research
questions
Entering Data
• Problems can occur during data entry, such as
transposing numbers and inputting an infeasible
code(e.g out of range)
– E.g. Score on range of 1-5 then 0, 6, 7, and 8 are
unacceptable or out of range (might be due to
transcription error)
• Always check the data-entry work.