Download Statistics for Marketing and Consumer Research

Survey
yes no Was this document useful for you?
   Thank you for your participation!

* Your assessment is very important for improving the workof artificial intelligence, which forms the content of this project

Document related concepts

Foundations of statistics wikipedia , lookup

Taylor's law wikipedia , lookup

Bootstrapping (statistics) wikipedia , lookup

History of statistics wikipedia , lookup

Statistical inference wikipedia , lookup

Gibbs sampling wikipedia , lookup

Misuse of statistics wikipedia , lookup

Sampling (statistics) wikipedia , lookup

Transcript
Sampling
Chapter 5
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
1
Sample
• A sub-set of the units in a population, expected to
represent the whole population
• By measuring data on a sample
• Information on the entire population is gathered at a
lower cost compared to censuses
• Some margin of error is necessarily accepted
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
2
Probability vs. non-probability sampling
Probability sampling: each unit in the sampling
frame is associated to a given probability of being
included in the sample, which means that the
probability of each potential sample is known
Non-probability sampling: extraction of sample units
is not based on probability rules
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
3
Inference and sampling error
• Prior knowledge on the probability of all potential samples
allows statistical inference
• Statistical inference is the generalization of sample
statistics (parameters) to the target population, subject to
a margin of uncertainty, or sampling error.
• The probability laws ruling probability sampling allow one
to ascertain how much the sample estimates reflect the
true characteristic of the target population
• The sampling error can be estimated and used to assess the
precision of sample estimates
• The sampling error is only a portion of the survey error
(which also includes non-sampling error), but has the
advantage that it can be estimated and controlled using the
information on the sampling method
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
4
Characteristics and limits of nonprobability sampling
• The use of non-probability samples is common in marketing
research, especially quota sampling (discussed later)
• It can be argued that the sampling error is often much
smaller than error from non-sampling sources
• However, the problems with non-probability samples are
that:
• Selection of the sampling units is subjective
• it is impossible to assess scientifically the ability to avoid the
potential biases of a non-probability sample
• Sampling error, precision and accuracy cannot be estimated
• Statistical methods to analyze sample data are based on probability
assumptions (e.g. normality of the data distribution), which can be
only be determined by probability extraction rules
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
5
Sampling concepts
• The sampling error depends on:
• The sampling fraction (ratio between sample size and
population size)
• Larger sampling fractions increase precision
• However, the gain in precision decreases as the sampling
fraction increases
• The data variability in the population
• The less variable the population data, the more precise the
sample estimates
• However, population variability is rarely known and generally
estimated
• The precision of the sample estimator
• Precision of an estimator (measured through the standard error
of the estimator) is the variability of an estimate across
multiple measurements (across different samples)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
6
Standard deviation vs.
standard error
• The standard deviation measures the variability of
a given variable (e.g. X) within the population or
sample
• The standard error is a precision measure which
refers to the variability of the sample estimator
(e.g. the sample mean) across multiple estimates
• The standard error depends on the standard
deviation but they are not the same concept
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
7
Accuracy and precision
Accuracy: the degree to which the sample estimate is
close to the true population value – maximum
accuracy is obtained when the estimate equals the
true population value
Precision: the variability of the sample estimate in
repeated measurements (across different samples)
– maximum precision is obtained when the
estimate is the same across all samples
The standard error of an estimator is a measure of
precision
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
8
Samples and population:
some terminology
Target population: the set of units which are the
object of the research in from which the sample is
extracted. The characters to be measured in the
population are usually called parameters
Sampling frame: a list of the population units
Sample size: the number of units (n) in the sample
Sample statistic: an estimate of the population
parameter based on the sample observations
Sampling distribution: the probability distribution of
the sample statistics around the true population
value
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
9
The indirect problem and inference
• If one extracted all potential samples from a population (the
sampling space) then the sampling distribution would be
exactly known
• However, this would be a quite stupid exercise – given that
the true population parameter would be already known
• Thus, statisticans are interested in the indirect problem
• Only one sample is extracted
• Only the sample statistics are known
• The sampling distribution is not known exactly, but it can be
ascertained from the probabilistic sampling method
• Given the sampling distribution and the sample statistics, one
obtains estimates of the true population parameters through
statistical inference
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
10
Example – sample mean
Extract two elements from a population of four
POPULATION: A=4; B=1; C=3; D=4 – Pop. Average=3
SAMPLING SPACE:
AB – Sample mean= 2.5
AC – 3.5
AD – 4
BC – 2
BD – 2.5
CD – 3.5
Sampling distribution
Number of samples
2.5
2
1.5
1
0.5
0
2
2.5
3.5
4
Sample mean
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
11
Example – the sampling distribution
• The average of all sample means in the sampling
space is three (equal to the true population mean)
• None of the extracted samples exactly reflects the
population (none has a mean of 3)
• The mean absolute error which we commit by
observing only two out of four population units is
0.667 – this is a direct measure of sampling error
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
12
Example – the indirect problem
• Suppose we have observed only one sample which
is extracted randomly
• The probability extraction method and the sample
observations allow us to
• Obtain an estimate of the population mean
• Obtain a precision estimate (the sampling error)
• By combining the sample estimate with the
sampling error, one can draw inference on the true
population value, for example by defining a
bracket which is likely to include the true value
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
13
Example results
• Suppose we extract the sample AB with a simple
random sampling
• The sample mean is 2.5
• The mean error within the sample is 0.5
• With a very rough (and inexact) assumption (that
the mean error within the sample reflects the
sampling error), we might claim that the true
population value lies between [2.5-0.5] and
[2.5+0.5], that is between two and three
• This is a rough example, but with large samples
and probability theory, knowledge based on a
single sample can lead to accurate conclusions on
the whole population, accounting for sampling
error
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
14
In practise…
• One sample is extracted
• Sample means and sample standard deviation are
obtained
• An estimate of precision is obtained through an
estimate of the standard error of the mean, which
is a function of the sample standard deviation and
the sample size
• Using the sample mean and the measure of
precision one draws conclusion on the population
mean (see lecture 6)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
15
Population parameters
(in a population of N elements)
1
• Mean

N
• Variance
2
N
x
i 1
i
1 N
2
   ( xi   )
N i 1
• Standard deviation
N
1
2
2
  
( xi   )

N i 1
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
16
Sample statistics
• Sample mean
1 n
x   xi
n i 1
• Sample variance
n
1
2
s2 
(
x

x
)

i
n  1 i 1unbiasedness
• Sample standard deviation
n
1
2
2
s s 
( xi  x )

n  1 i 1
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
17
Simple random sampling
• Each element of the population has a known and
equal probability of selection
• Every element is selected independently from
other elements
• The probability of selecting a given sample of n
elements is computable (known)
• The Central Limit Theorem guarantees that for
simple random samples with sample size (n)
sufficiently large (>40), the sampling
distribution of the sample mean follows the
normal distribution
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
18
The normal distribution (again)
• Recall the curve of measurement error
• With simple random sampling the sample means
follow the same probability distribution
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
19
Basic SRS sample statistics (unknown
pop. variance)
n
Mean case
x
x
Proportion case (p)
i
i 1
n
n
s
2
(
x

x
)
 i
i 1
n 1
sx 
s2
n
Sample
standard
deviation of X
s
Standard error of the
mean/proportion
n
p(1  p)
n 1
sp 
p(1  p)
n 1
PRECISION of sample
estimates
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
20
Precision of estimators
and sample size
• The standard error increases with higher
population variances and decreases with
larger sample sizes
• However, the relative gain in precision
decreases as sample size increases
• Very large sample sizes are not convenient,
because the gain in precision is very small
and the increase in costs is very large
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
21
Accuracy and confidence level (1)
• Suppose the sampling distribution is normal (as for simple
random sampling)
• The confidence level a (further discussed in lecture 6) is
the probability that the relative difference between the
estimated sample mean and the true population mean is
larger than a given relative accuracy level r:
 xX

Pr 
 r   1a
 X

X is the population mean and 1-a is the level of confidence
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
22
Accuracy and confidence level (2)
• The confidence level is chosen by the researcher
• For example suppose we want a 95% confidence
level – what is the value of the relative accuracy r?
• In other words, if we extracted 100 different
samples, in only 5 out of 100 would we commit a
relative error larger than r
• Relative accuracy – expressed in %age terms with
respect to the population mean
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
23
Example – relative accuracy in SRS
• Suppose we set a=0.05 (confidence level of 95%)
• The equation to compute relative accuracy with
simple random sampling is the following:
• Accuracy depends on:
ta / 2 S x
n
r
1
• Sample size
N
nX
• Population size
• Standard error of the mean
• A constant value (ta/2) which depends on the confidence
level and the sample size
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
24
Relative accuracy, sample size and
population size
Population size
Sample size
100
1,000
2,000
5,000
100,000
1,000,000
100,000,000
30
11.75%
16.28%
16.53%
16.68%
16.77%
16.78%
16.78%
50
6.39%
12.14%
12.46%
12.65%
12.78%
12.78%
12.78%
100
0.00%
8.04%
8.48%
8.75%
8.92%
8.93%
8.93%
200
5.02%
5.65%
6.02%
6.26%
6.27%
6.27%
500
1.98%
2.97%
3.56%
3.93%
3.95%
3.95%
1000
0.00%
1.40%
2.23%
2.76%
2.79%
2.79%
0.00%
1.18%
1.93%
1.97%
1.97%
2000
• For larger population sizes it is not necessary to increase sample size
• A sample size of 500 guarantees an error below 5% for any population
size
• Above a size of 500, it is better to consider spending money on reducing
non-sampling errors
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
25
Determining sample size
Factors influencing sample size (n)
•
•
•
•
•
Size of the population (N)
Variability of the population ()
Desired level of accuracy (sx or r)
Level of confidence (a)
Budget constraint
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
26
Simple random sampling – determining
sample size
Determining sampling size for a given relative error
 ta / 2 
n0  

 r 
2
 needs to be estimated (or conservative assumptions can be
made)
r is the relative level of precision
t as before, is a constant which depend on a and on the
sample size
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
27
The sampling design process
1. Define the target population its elements and the sampling
units
2. Determine the sampling frame (list)
3. Select a sampling technique
4. Determine the sample size
5. Execute the sampling process
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
28
Selection bias
• Improper selection of sample units (ignoring a relevant
control variable) so that the values observed in the sample
are biased and the sample is not representative.
• Some units have a higher probability of being selecte,
without acknowledging this in the sampling process.
• If the units with higher inclusion probabilities have specific
characteristics that differ from the rest of the population –
as it is often the case – sample measurement will suffer
from a significant bias.
Example:
A survey is conducted for measuring goat milk consumption,
but the interviewers just select people in urban areas that
on average drink less goat milk.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
29
The sampling techniques
• Probability sampling
•
•
•
•
•
Simple random sampling
Systematic sampling
Stratified sampling
Cluster sampling
Complex sampling techniques
• Non-probability sampling
–
–
–
–
Convenience sampling
Judgmental sampling
Quota sampling
Snowball sampling
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
30
Simple random sampling
• Each element of the population has a known and
equal probability of selection
• Every element is selected independently from other
elements
• The probability of selecting a given sample of n
elements is computable (known)
•Statistical inference is possible
•It is easily understood
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
•Representative samples are large
and expensive
•Standard errors are larger than in
other probabilistic sampling
techniques
•Sometimes it is difficult to execute
a really random sampling
31
Systematic sampling
• A list of N elements in the population is compiled and
ordered according to a specified variable
• Unrelated to the target variable (similar to SRS)
• Related to the target variable (increased representativeness)
• A sampling size n is chosen
• A systematic step of k=N/n is set
• A random number s between 1 and N is extracted and
represents the first element to be included
• Then the other elements selected are s+k, s+2k, s+3k…
•Cheaper and easier than SRS
•More representative if order is related
to the interest variable (monotone)
•Sampling frame not always necessary
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
–Less representative (biased) if the
order is cyclical
32
Stratified sampling
• Population is partitioned in strata through control
variables (stratification variables), closely related with
the target variable, so that there is homogeneity within
each stratum and heterogeneity between strata
• A simple random sampling frame is applied in each
strata of the population
– Proportionate sampling – size of the sample from each stratum
is proportional to the relative size of the stratum in the total
population
– Disproportionate sampling: size is also proportional to the
standard deviation of the target variable in each stratum
–Gains in precision
–Include all relevant subpopolation even
if small
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
–Stratification variables may not be
easily identifiable
–Stratification can be expensive
33
Post-stratification
• Typical obstacle to stratified sampling:
unavailability of a sampling frame for each of the
strata
• It may be useful to proceed through simple random
sampling and exploit the stratified estimator once
the sample has been extracted, which increases
efficiency.
• All that is required is the knowledge of the stratum
sizes in the population and that such post-stratum
sizes are sufficiently large. The advantage of poststratifications is two-fold:
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
34
Applying post-stratification(PS)
• It allows to correct the potential bias due to insufficient coverage of
the survey (incomplete sampling frame)
• PS allows one to correct the missing responses bias, provided that the
variable is related both to the target variable and to the cause of nonresponse
• It is carried out by extracting a Simple Random Sample (SRS) of size n
and then classifying units into strata. Instead of the usual SRS mean, a
PS estimator is computed by weighting the means of the sub-groups by
the size of each sub-group.
• The procedure is identical to the one of stratified sampling and the only
difference is that the allocation into strata is made ex post.
• The standard error for the PS mean estimator is larger than the
stratified sampling one, because additional variability is given by the
fact that the sample stratum sizes are themselves the outcome of a
random process.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
35
Cluster sampling
•
•
The population is partitioned into clusters
) Elements within the cluster should be as
heterogeneous as possible with respect to the variable
of interests (e.g. area sampling)
A random sample of clusters is extracted through SRS
(with probability proportional to the cluster size)
1.
•
•
2a. All the elements of the cluster are selected (one-stage)
2b. A probabilistic sample is extracted from the cluster (twostage cluster sampling)
•Reduced costs
•Higher feasibility
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
•Less precision
•Inference can be difficult
36
Complex sampling designs
• Combination of different sampling methods to increase efficiency or
reduce costs
• Two-stage sampling: two different sampling units, where the secondstage sampling units are a sub-set of the first-stage ones.
• Typically in household surveys a sample of cities or municipalities is
extracted in the first- stage while in the second stage the actual sample
of households is extracted out of the first-stage units.
• Any probability design can be applied within each stage.
• For example, municipalities can be stratified according to their
populations in the first stage to ensure that the sample will include
small and rural towns as well as large cities, while in the second stage
one could apply area sampling, a particular type of cluster sampling
where:
1) each sampled municipality is subdivided into blocks on a map through
geographical coordinates;
2) blocks are extracted through simple random sampling;
3) all households in a block are interviewed.
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
37
Probability sampling with SPSS (a)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
38
Probability sampling with SPSS (b)
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
39
Sampling with SAS
• SAS\STAT component procedures for the extraction
of samples and statistical inference
• Proc SURVEYSELECT allows one to extract probabilitybased samples
• Proc SURVEYMEANS computes sample statistics taking
into account the sample design
• Proc SURVEYREG estimates sample-based regression
relationships
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
40
Non-probability sampling
• Non-probability sampling does not allow one to accompany
sample estimates with evaluations of their precision and
accuracy
• Still, non-probability sampling is a common practice in
marketing research, especially quota sampling.
• It is not necessarily biasing or uninformative
• In some circumstances – for example when there is no
sampling frame – it may be the only viable solution
• Key limit – in general, techniques for statistical inference
cannot be used to generalize sample results to the
population
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
41
Convenience sampling
• Only convenient elements enter the sample
•Cheapest method
•Quickest method
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
•Selection bias
•Non representativeness
•Inference is not possible
42
Judgmental sampling
• Selection based on the judgment of the
researcher
•Low cost
•Quick
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
•Non -representativeness
•Inference is not possible
•Subjective (potential selection bias)
43
Quota sampling
1. Define control categories (quotas) for the
population elements, such as sex, age…
2. Apply a restricted judgmental sampling so that
quotas in the sample are the same of those in the
population
•Cheapest method
•Quickest method
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
•There is no guarantee that the
sample is representative (relevance
of control characteristic chosen)
•Many sources of selection bias
•No assessment of sampling error
44
Snowball sampling
• A first small sample is selected randomly
• Respondents are asked to identify others
who belong to the population of interests
• The referrals will have demographic and
psychographic characteristics similar to the
referrers
•Lower costs
•Low variability
•Useful for rare populations
Statistics for Marketing & Consumer Research
Copyright © 2008 - Mario Mazzocchi
•Inference is not possible
45